Talk:String (computer science)

This is the talk page for discussing improvements to the String (computer science) article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Computer science High‑importance

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

High

This article has been rated as High-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

String Buffer was nominated for deletion. The discussion was closed on 04 June 2013 with a consensus to merge. Its contents were merged into String (computer science). The original page is now a redirect to this page. For the contribution history and old versions of the redirected article, please see its history; for its talk page, see here.

Computing theory

The first paragraph seems too "busy" to me. What about replacing it with something like this?

A string (or string of characters) is a data type used in most programming languages to represent text, and is the focus of this article.

The computing term string is also used in a broader sense to group a sequence of entities; for example, tokens in a language grammar, or a sequence of states in automata. See the theory of computation.

This is a lot better. Also, I think the usage in computing theory could be expanded in its own paragraph: one starts with a finite alphabet, then considers all finite sequences consisting of letters from that alphabet (including the empty string) and defines concatentation of strings. The set of string with concatentation is then a monoid.

I think I wrote most the current paragraph and I agree your rewrite is better. Just do it! --drj

Ok, I'll move my text to the main article. I won't try to expand the second paragraph; I'm inclined to leave that to the computation article, or to whoever can concisely expand it without detracting from the rest of the page. --loh

I won't try to expand the second paragraph; I'm inclined to leave that to the computation article, or to whoever can concisely expand it without detracting from the rest of the page. --Hornlo

Other meaninngs

I think something definitely needs to be added about the other meanings of string. Ukulele is already linked to this page, which is somewhat confusing. Although, I'm not sure how much content can actually be provided for the other meanings. Maybe this article should be moved to String (computer science) or something, and this page be turned into a disambiguation page. B4hand

I propose renaming this article to Character string (computer science). Comments? - Bevo 17:03, 13 Jun 2004 (UTC)

I oppose. A string doesn't need to be literal string in general. -- Taku 18:44, Sep 25, 2004 (UTC)

Lexicographical order

The lexicographical order on Σ* is not a well-ordering (for example, what is the least element in a*b?), but only a total order. fudo 13:53, 2 August 2005 (UTC)[reply]

The least element of a*b is b. The lexicographical order on Σ* is indeed a well-ordering. --Pexatus 00:22, 19 March 2006 (UTC)[reply]

No. There is no least element of a*b if you use the alphabetical ordering

a<b

. Assume there is some least element m; this means that

m=a^{k}b

for some non-negative integer k. Note that

n=a^{k+1}b

is also in a*b. n agrees m in k positions, but the k + 1th position of n is less than the k + 1th position of m, so

n<m

, which contradicts the assumption that there exists a least element of a*b. Therefore, there is no least element of a*b. --Bkkbrad 16:28, 13 September 2006 (UTC)[reply]

strings, not characters

There are lots of references to this page throughout wikipedia for a "string" that is not a set of characters; a string of bits or bytes, for example. I think it is important that the article be cleaned-up to make it clear that "character strings" are the most common uses of the type, but the term might apply to vectors of data not representable by a string in a particular language. I've made a couple of edits in this direction, but I think some more effort needs to go into the issue. -- Mikeblas 22:39, 29 January 2006 (UTC)[reply]

Just dont forget that a vector has a fixed length, per definition (as an element in an N-dimensional space), while strings often has (chronologically) variable length. 83.255.35.89 (talk) 11:23, 4 March 2011 (UTC)[reply]

But (unfortunately?) the word "vector" in computer science is most often used to refer to a variable-sized storage (see std::vector), while the word "array" is used to refer to the fixed-sized objects you are talking about. (In linear algebra libraries the word "vector" is used for fixed-sized objects, but these also demonstrate unusual (for programming) properties such as addition doing a component-wise action rather than an append operation).Spitzak (talk) 17:09, 4 March 2011 (UTC)[reply]

Yes, I know, it's a disaster. The designers of the C++ STL must have been rather ignorant, or why would you create that kind of mess otherwise, with concepts turned upside down? They could have called it dynamic array, string of type, or whatever; the well defined term vector was really the worst possible choice and the worst kind of hijacking of words. If they actually knew what they did, I guess they must have been inspired by the arrogant "C-syntax" (B→C→C++→Java etc), which, when spread to the world of webb languages some 15 years ago, caused the equality symbol to suddenly lose its meaning in large circles, a symbol that has been established in both mathematics and everyday use since hundreds of years. Too many young (or uneducated) people are now using == and != instead of = and ≠ in any everyday context (mathematics next...?) and they would interpret the equation a=b as a definition/mutation/initializing of a...

I can't see why Wikipedia shouldn't do what it can to clarify backgrounds like this. I belive it is crucical to illustrate "misunderstandings" and unessescary discrepancy in terminology among branches of science, so more people can see that there are other conventions than the most vulgar ones that one may want to adhere to. It does not conflict with the goal of describing actual usage and terminology or with "following the sources", as there are many kinds of sources, and plenty of room for elaborations on Wikipedia.

(As a side note, while "addition" may mean several things (as you wrote), only algebraic superposition should use the + sign really; concatenation may use &, &&, ::, |, concat, or whatnot, at least in my world ;) Regards 83.255.32.149 (talk) 04:37, 5 March 2011 (UTC)[reply]

I agree, but you probably ought not to take it too far. "Character string" is usually implied, and the article shouldn't give the impression that "string" on its own is incorrect. For the most part, I think your edits were fine, although perhaps you could revert the edits to the string oriented languages section. More effort doesn't need to go into the issue, as that would just confuse the article. --StuartBrady 22:59, 29 January 2006 (UTC)[reply]

I was confused by this too. "String" always refers to a string of characters. Vectors of other things are lists, arrays, vectors, ... I'm curious as to what language it is where the word "string" is used in reference to lists of objects. Richard W.M. Jones 09:04, 1 May 2006 (UTC)[reply]

Yes. And keep in mind that WP article titles generally reflect the most common usage for a given term (unless it needs its own disambiguation page). In computer science, "string" most common refers to a string of characters. Other not-so-common meanings (e.g., bitstring) can be linked to in the "See also" section. — Loadmaster 18:35, 1 May 2007 (UTC)[reply]

My impression is that "string" virtually always means "string of characters". The only counter-example I know is that C++ std::string is based on a template and can be made to use any object. However I am almost certain this was done only to support bytes and "wchar" (16 bits, often mislabled "Unicode"). If it was not for "wchar" then they probably would not have made it a template. Since wchar is intended to store characters (or UTF-16) then the string is still a "string of characters". I would be interested if anybody has any real examples of usage of a std::basic_string template with any object for any purpose other than storing something that would be considered "characters".Spitzak (talk) 17:19, 4 March 2011 (UTC)[reply]

Null and NUL

We should decide on just one spelling and capitalization of null. I vote for two L's.24.186.138.188 02:03, 2 May 2006 (UTC)[reply]

"Null" generally refers to the "null character" (or a "null pointer"). "NUL" (all caps) is the mnemonic name (ASCII, EBCDIC, Unicode, etc.) of the "null character" code. — Loadmaster 18:35, 1 May 2007 (UTC)[reply]

Origin of the Term ?

Anyone know the history behind using the term "string" to mean a sequence of characters? I mean its not like a series of characters looks much like a ball of string... I assume it's origins are as a mathematical term, but it would be interesting to know how it came to such common use in computing.

Presumably it comes from the rather obvious expression "a string of characters" (as in "these go some characters stringing by"), equivalent to "a string of pearls" or "a sequence of characters" or other similar phrases. — Loadmaster (talk) 16:27, 9 February 2008 (UTC)[reply]

I heard that it originated because in the old days of physical type-setting, the type was held together in groups by literal string (rope). I don't have any references for this, though, so I can't back it up. Showeropera (talk) 20:48, 14 December 2017 (UTC)[reply]

Trying to stop misuse of character encodings

The edits I keep trying are to stop a whole army of uninformed but well-meaning programmers who think strlen() should parse and count the Unicode code points in a string. This is a totally useless definition and causing no end of grief when systems do this. For some reason otherwise intelligent programmers turn into these complete morons when presented with UTF-8 and this is actually seriously damaging any ability to do internationalization. If anybody can think of a wording that says these answers are different and that the fixed-size one is "better" it would help a lot. The reverted-to wording implies that the number of characters is the more important attribute, which is wrong.Spitzak (talk) 04:10, 29 May 2009 (UTC)[reply]

StringBuilder .Net type

As well as the standard String type, .Net has a StringBuilder type. I think this is implemented as a linked list, but I'm not too sure. This would be worth adding to the Implementation section. N4m3 (talk) 09:35, 28 May 2011 (UTC)[reply]

Only finite strings?

I noticed this statement:

Although formal strings can have an arbitrary (but finite) length, the length of strings in real languages is often constrained to an artificial maximum. [emphasis mine]

I would like a citation on that, what about languages with lazy evaluation like Clojure and Haskell?

 cycle "Is this finite? "
 ⇒ "Is this finite? Is this finite? Is this finite? Is this finite? ..."

 let shouting = 'a' : shouting in putStr shouting
 ⇒ "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa..."

—BiT (talk) 03:00, 8 December 2011 (UTC)[reply]

Reverse string

In the "Formal theory" section, it might be useful to mention that:

A string s = ab, composed of zero or more characters (here, 'a' and 'b') of the alphabet, is said to be the reverse of string t if t = ba. For example, if Σ = {0,1} the string 0011001 is the reverse of 1001100. The empty string and all strings of length 1 are reverses of themselves. A string that is the reverse of itself is also called a palindrome.

The problem is, I don't know for certain whether the terms "reverse string" or "string reversal" are correct or not. FWIW, practically all programming languages/libraries that provide this operation call it "reverse". Does anyone know what the proper term should be? (Inverse? Inversion? Opposite? Transpose?) — Loadmaster (talk) 17:05, 9 November 2012 (UTC)[reply]

Reverse would be the correct terminology, but as currently formulated your statement would either not be general enough (only working for two character strings) or, worse, likely be misinterpreted as stating that WORLDHELLO and LOWORLDHEL would be reverses of HELLOWORLD (instead of DLROWOLLEH.) —Ruud 23:09, 9 November 2012 (UTC)[reply]

How about this:

A string s = abc, composed of zero or more characters of the alphabet (here, 'a', 'b', and 'c'), is said to be the reverse of string t if t = cba. For example, if Σ = {0,1} the string 0011001 is the reverse of 1001100. A string that is the reverse of itself is also called a palindrome, which includes the empty string and all strings of length 1.

I don't see the confusion; it states that we're talking about a string of zero or more characters of the alphabet, and the example of 0011001 should make it additionally clear that we're talking about the ordering of the characters of the string, not substrings of the string. If you think this is still confusing, we could instead use HELLOWORLD and DLROWOLLEH, but this requires a larger symbol alphabet ({D,E,H,L,O,R,W}), which complicates the description somewhat. — Loadmaster (talk) 18:02, 13 November 2012 (UTC)[reply]

The problem is that you're giving a very specific example but state it in such a form that it—at least at first reading—appears to be a very general definition. You're either using too much formal machinery for what is an informal statement or, conversely, make a statement that not precise enough to be a formal definition. This is how one of my formal language textbooks defines a reverse:

The reverse of a string is obtained by writing the symbols in reverse order; if w is a string as shown above, then its reverse w^R is
w^R = a_n...a₂a₁.

Where the they explained "above" that a, b, c, ... denote elements from the alphabet Σ and u, v, w, ... strings over that alphabet. —Ruud 18:45, 13 November 2012 (UTC)[reply]

I went ahead and added a "Reversal" subsection to the article, with (hopefully) simplified language. — Loadmaster (talk) 22:25, 15 November 2012 (UTC)[reply]

External links modified

Hello fellow Wikipedians,

I have just modified one external link on String (computer science). Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

Corrected formatting/usage for http://www.wearmouth.demon.co.uk/zx80.htm

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}).

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 5 June 2024).

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—^{cyberbot II}_{Talk to my owner:Online} 08:39, 4 April 2016 (UTC)[reply]

Discussion to move "String" to "String (disambiguation)"

In order to make way for moving Draft:String to article space to take the place as the primary topic, I've posted a proposal at Talk:String#Requested move 16 January 2017 to move the disambiguation page currently at "String" to "String (disambiguation)". Your input would be helpful to establish a common consensus on whether or not this move, or something else, should be done. I look forward to your thoughts on the matter. The Transhumanist 22:50, 16 January 2017 (UTC)[reply]

String length

In section String datatypes/Representations/Null-terminated the IBM 1401 word-mark terminated string is discussed.

Somewhat similar, "data processing" machines like the IBM 1401 used a special word mark bit to delimit strings at the left, where the operation would start at the right. This bit had to be clear in all other parts of the string. This meant that, while the IBM 1401 had a seven-bit word, almost no-one ever thought to use this as a feature, and override the assignment of the seventh bit to (for example) handle ASCII codes.

That seventh bit idea could not have been implemented. The wordmark bit is hardware implemented. The MCW (Move Characters Wordmark) instruction for instance moved variable length fields terminating on the word mark. Numeric or alpha were treated no different. The Honeywell H200 H1200 H3200 and H4200 all had MCW instructions. Arithmetic operations also used wordmark field demarcation. The Honeywell computers had 8 bit memory having 6 data, a word mark and item mark bits. Steamerandy (talk) 17:26, 24 April 2017 (UTC)[reply]

It does sound like there are 1 or 2 extra bits per character. Are you saying there was no way for a program to read or write these extra bits? Or that the implementation was somehow different from having extra bits per character (perhaps it was a table of locations with the bit "set" and thus you were restricted to how many times it was turned on). I think it is obvious that instructions designed to use these bits to end strings won't work but that is not an explanation as to why this extra storage was not taken advantage of. It is also surprising that they would in effect reserve 1/4 of their memory for such a limited use, when you consider how incredibly expensive the memory was at that time.Spitzak (talk) 17:36, 24 April 2017 (UTC)[reply]

Another length prefixed representation

Siemens PLCs use a form of length prefixed string representation with 2 length information bytes (see Siemens Docs "Working with Strings in S7-SCL"). Maximum reserved memory is 256 bytes with maximum 254 bytes of actual text, where one byte denotes the allocated/reserved range for the string (the maximum count of characters allowed to be represented) and the other byte denotes the actual, currently valid length of the string. Maybe this could be added as length prefixed representation variant? --Ckonnerth (talk) 17:18, 15 December 2017 (UTC)[reply]

DNA??

Wondering why there's a bio-related image on this article's page, I don't see how it depicts what strings actually are in computer science. — Preceding unsigned comment added by 67.165.80.152 (talk) 05:19, 30 March 2020 (UTC)[reply]

I've changed the image to diagram a string. Though I haven't figured out how to make it the page image yet. TripleShortOfACycle (talk - contribs) - (she/her/hers) 14:19, 31 January 2021 (UTC)[reply]

Now the page thumbnail works! It displays a diagram of a string when links to this page are hovered over. TripleShortOfACycle (talk - contribs) - (she/her/hers) 14:30, 31 January 2021 (UTC)[reply]

Distinct, unambiguous symbols

As far as I know, it is also required that each string can be uniquely decomposed into its symbols. For example, if the alphabet itself consists of strings (as in Free_monoid#Free_generators_and_rank, or in the lead of Alphabet (formal languages), with Σ = {"0", "00"}), its symbols are distinct and unambiguous (as are the members of each mathematical set), but nevertheless, a string may be composed in different ways. I guess "unambiguous" is supposed to express the requirement of unique decomposition, but I'm not sure it is precise enough. The decomposition must be unambiguous, rather than just the symbols. - Jochen Burghardt (talk) 18:03, 13 May 2024 (UTC)[reply]

Other related topics