Strings, immutability and persistence

Todays post is based on a question from StackOverflow; I liked it so much I figured hey, let's just blog it today.

When you look at a string in C#, it looks to you like a collection of characters, end of story. But of course, behind the scenes there is a data structure in memory somewhere to implement that collection of characters. In the .NET CLR, strings are laid out in memory pretty much the same way that BSTRs were implemented in OLE Automation: as a word-aligned memory buffer consisting of a four-byte integer giving the length of the string, followed by the characters of the string in two-byte chunks of UTF-16 data, followed by two zero bytes. (Recall that BSTR originally stood for "BASIC string", because the OLE Automation team was actually part of the Visual Basic team; this is the format that Visual Basic used.)

Using this as the internal implementation of strings has a number of benefits. For example: it only requires one heap allocation per string. The length can be determined without counting characters. The string can contain embedded zero bytes, unlike formats that use zero bytes as end-of-string markers. If you disregard surrogate pairs then the nth individual character can be fetched in O(1) time, unlike in variable-width encodings like UTF-8. If the string is pinned in place and contains no zero characters then the address of the string data can be passed unmodified to unmanaged code that takes a WCHAR*. And so on.

Strings are immutable in .NET, which also has many benefits. As I've discussed many times, immutable data types are easier to reason about, are threadsafe, and are more secure. (*)

One of the benefits of the immutable data types I've talked about here previously is that they are not just immutable, they are also "persistent". By "persistent", I mean an immutable data type such that common operations on that type (like adding a new item to a queue, or removing an item from a tree) can re-use most or all of the memory of an existing data structure. Since it is all immutable, you can re-use its parts without worrying about them changing on you.

Strings in this format are immutable, but they are not persistent, thanks to that pesky single-buffer-with-both-a-prefix-and-suffix layout. Concatenation two strings does not re-use the content of either of the original strings; it creates a new string and copies the two strings into the new string. Taking the substring of a string does not re-use the content of the original string. Again, it just makes a new string of the right size and makes a full copy of the data.

This means that operations on strings such as taking a substring are O(n) in the size of the substring. Concatenations are O(n+m) in the sizes of the two source strings. These operations could be O(1) or O(lg n) if strings were persistent. For example, we could treat strings as catenable deques of chars; there are ways to do very efficient concatenations of catenable deques. (**) Or, we could say that there are two kinds of strings; "regular" strings and "ropes", which are binary trees of strings. Concatenating two strings just allocates a new rope. It becomes a bit more expensive to find the nth char of a rope, and you can't pass them unmodified to managed code, but you always avoid a potentially lengthy copy. Similarly with substring operations: taking the substring of a string could simply allocate a wrapper around the original string that keeps an offset and a length. No copying required.

Why don't we do "persistent" optimizations like this, since we have an immutable data structure already?

The motivation for doing so is based on the incorrect idea that O(1) is always better than O(lg n), is always better than O(n). The asymptotic order of an operation is only relevant if under realistic conditions the size of the problem actually becomes large. If n stays small then every problem is O(1)!

That is the situation we are actually in. Very few developers routinely deal with strings of more than a few hundred or a few thousand characters. If you are dealing with larger strings and doing a lot of concatenation, you can use a StringBuilder; if you're only doing a small number of concatenations, it is very fast to do the copy. If you're taking substrings, odds are very good that the substrings are going to be a few dozen characters out of a thousand character string, not a few hundred thousand characters out of a million-character string. Most line-of-business programmers are not in the business of chopping up strings containing million-character DNA strands; they're in the business of parsing the name of a book out of an XML document or an address out of a comma-separated-value text file. The memory operations for making a relatively small string buffer and copying a few dozen bytes out of it are insanely fast, so fast that there is really little point in optimizing it further.

Unless you have really good reason to believe that the "optimization" is a win, it probably isn't. I spent a summer about twelve years ago rewriting the VBScript internal string manipulations to use "ropes" -- to build up a tree on every string concatenation, and only turn it back into a "real" BSTR when necessary. Our research on real-world code showed that the "optimization" was no optimization at all; most of the time strings were being turned from a rope back into a BSTR after the second or third concatenation, and the added expense of allocating all the tree structures was actually slower than just making a copy of every BSTR every time. (***)

Now, some people are actually manipulating relatively large strings and inserting and removing substrings on a regular basis. My team does that all the time; we write code editors that have to be able to deal with enormous files being rapidly edited. Because operations on strings are O(n) and n is actually big, we do not use big strings; rather, we have immutable persistent data structures that represent edits to the document, and those edits are then fed into an engine that maintains an immutable, persistent model of the lexical and syntactic analysis of the program. We want to be able to re-use as much of the previously-seen text and analysis as possible, to ensure high performance in both memory size and speed.

That was some tricky code to write, and what works for us almost certainly does not work for the people doing DNA strings, or any other big-string-with-lots-of-substrings problem that you care to name. Rather than optimize CLR strings for narrow, rare cases, it's better to just keep it simple, optimize for the common case, and rely on the hardware taking care of making blindingly fast copies of short strings.

-------------

(*) Consider an attack where partially-trusted hostile code passes a string containing a "safe" path to File.Open. The security checker verifies that the path is safe for partially-trusted code to open. And then the partially-trusted code mutates the string on another thread before the file is actually opened! If they get the timing right then the security check passes and the wrong file is opened. If strings are immutable, this kind of thing does not happen.

(**) My article on deques does not show how to make them efficiently catenable; see the referenced papers for details.

(***) If you happen to be one of the people with access to the VBScript source code, I think all that gear is still in there, but turned off. Search for FANCY_STRING in the sources.

Comments

  • Anonymous
    July 19, 2011
    "If you disregard surrogate pairs then the nth individual character can be fetched in O(1) time, unlike in variable-width encodings like UTF-8." So? If you disregard multibyte sequences in UTF-8, then the nth individual character can be fetched in O(1) time also. Sure, but it is far, far more likely for there to be a multi-byte character in a typical UTF-8 string than there is to be a surrogate pair in a typical UTF-16 string. Again, we are optimizing for the common case here. Strings containing modern Japanese, Greek, and so on, are very common; strings containing cuneiform or hieroglyphics are very uncommon. -- Eric OTOH, if you disregard surrogate pairs in UTF-16, or multibyte sequences in UTF-8, your code is broken.

  • Anonymous
    July 19, 2011
    @Karellen: You could perhaps put a flag somewhere (after the double-zero terminator, maybe, or at a negative offset before the length?) that indicates whether or not the string contains any surrogate pairs. Or possibly even an integer storing the first index of such a pair. Then you could get O(1) time access for all characters prior to the first such sequence... I have no idea whether the CLR does anything like this.

  • Anonymous
    July 19, 2011
    The comment has been removed

  • Anonymous
    July 19, 2011
    "Now, some people are actually manipulating relatively large strings and inserting and removing substrings on a regular basis.." I am the author and maintainer of a document editor for a file type that's used to pass electronic banking transactions between banks and network operators (ACH, for those who are paying attention).  It's not uncommon for these files to be as large as 200Mb in size - well over 1 million lines of text and far larger than the typical documents you'll encounter in a code editor.  Despite the allure of fancy data structures like ropes, my editor is implemented() as a view on top of the simplest of all data structures:  a giant array of bytes (the contents are strictly ASCII - or occasionally EBCDIC.  Yes, it's still around).   Inserts, deletes, moves are all implemented in the most straightforward way: by copying large chunks of memory around.  Even with that, today's CPUs are so fast that I could never justify the work to build, test and maintain a fancy rope-like backing store: my simple array is good enough for any user of the program.  Even full undo/redo of whole-file operations that modify a million or more lines of text are fast. Much as I like fancy data structures like ropes or dequeues, I suspect I'll never again use one - it's just not needed unless you're working on something that's a very serious outlier. () Implemented via an abstract base class.  I fully expected to have to retrofit a rope-like data structure once I got to very large files, so I built in a dependency barrier to make the retorfit easy.  Fortunately, Moore's law has outpaced e-commerce growth by a good margin since I first started on the app back in 2004.

  • Anonymous
    July 19, 2011
    "Again, surrogate pairs aren't a problem because it's trivial to identify a pair and whether you're looking at the leading or trailing value (indeed, you're never going to mistake part of a surrogate pair for some other valid encoding point), thus ensuring that the code doesn't inappropriately break the pair apart." But in what way is this different from UTF-8? UTF-8 suffix bytes always have the highest bit set and the second highest bit unset, so you can't mistake them for something else, either. The only encodings I know of in which the problem that you describe can occur is the old MBCS encodings.

  • Anonymous
    July 19, 2011
    The comment has been removed

  • Anonymous
    July 19, 2011
    The comment has been removed

  • Anonymous
    July 19, 2011
    The comment has been removed

  • Anonymous
    July 20, 2011
    I am so glad that the decision was made to keep things simple for the common case. Trying to solve for 100% of the cases instead of the 90%, always results in problems and people developing an expertiese in something which shouldn't require expertiese.

  • Anonymous
    July 20, 2011
    Aren't strings in .NET technically mutable with unsafe code? I don't think it would ever be a good idea to actually do it, but it is possible, no? Every single bit in the user space of the process, including all of the code that implements the runtime, is mutable in unsafe code. That includes all the strings. -- Eric

  • Anonymous
    July 20, 2011
    The comment has been removed

  • Anonymous
    July 20, 2011
    "But in what way is this different from UTF-8? UTF-8 suffix bytes always have the highest bit set and the second highest bit unset, so you can't mistake them for something else, either." Okay, my oversight. You're right. My previous example still holds, and there are others. The fact remains that surrogate pairs aren't necessarily going to be as much of a problem as multi-byte UTF-8 encodings.

  • Anonymous
    July 20, 2011
    The comment has been removed

  • Anonymous
    July 20, 2011
    "My previous example still holds, and there are others. " - how does your example not apply equally to UTF-8? Anything that you can legitimately ignore surrogate pairs for, you can legitimately ignore UTF-8 for. "With a UTF-8 string, to implement the substring logic, you have to scan the string all over again just to translate the character index to a position in the actual data structure's storage." is a straw man - if you can gain efficiency without losing correctness with a "character index" for UTF-16 that ignores surrogate pairs, you can do the same in UTF-8 with a byte index.

  • Anonymous
    July 20, 2011
    "if you can gain efficiency without losing correctness with a "character index" for UTF-16 that ignores surrogate pairs, you can do the same in UTF-8 with a byte index." Not if you want the API to remain character-based. There's a difference between designing for absolute efficiency, and designing for usability. You may feel the latter is pointless, but thankfully not all people do (including the designers of .NET).

  • Anonymous
    July 20, 2011
    If recall, Borland Delphi's Object Pascal's strings also worked the same way.

  • Anonymous
    July 21, 2011
    I never had to deal with characters beyond 2 bytes but it seems to me from the documentation that substring and indexing actually work with characters as in the 2 byte char type and not in real Unicode code points. Therefore if you really wanted to handle surrogate pairs as a single character you need to do it yourself. Is that true?

  • Anonymous
    July 21, 2011
    pete.d - "Not if you want the API to remain character-based." Now your arguments are just getting silly. The traditional CHAR is only one byte.  The 2-byte WCHAR was introduced later.  Personally, I like UTF-8 better as it is backward compatible with ASCII and doesn't change the definition of character.  Also, since I have only worked for American companies, all the business text I have worked with has always been ASCII and I can assume no multi-byte characters.  UTF-8 is also the starndard for the web. Eric has a good argument for UTF-16, in that it is currently rare to have surrogate pairs, but 20 years ago the same thing could be said for UTF-8.  What about in 20 more years?  An argument can be made that code should just be written to handle multi-character code-points now, and in that case I think UTF-8 wins bits down.

  • Anonymous
    July 21, 2011
    pete.d - "Not if you want the API to remain character-based." Now your arguments are just getting silly. The traditional CHAR is only one byte.  The 2-byte WCHAR was introduced later.  Personally, I like UTF-8 better as it is backward compatible with ASCII and doesn't change the definition of character.  Also, since I have only worked for American companies, all the business text I have worked with has always been ASCII and I can assume no multi-byte characters.  UTF-8 is also the starndard for the web. Eric has a good argument for UTF-16, in that it is currently rare to have surrogate pairs, but 20 years ago the same thing could be said for UTF-8.  What about in 20 more years?  An argument can be made that code should just be written to handle multi-character code-points now, and in that case I think UTF-8 wins bits down.

  • Anonymous
    July 21, 2011
    The comment has been removed

  • Anonymous
    July 21, 2011
    The comment has been removed

  • Anonymous
    July 21, 2011
    "Concatenation two strings" - I'm assuming you meant "Concatenation of two strings" or "Concatenating two strings". FYI.

  • Anonymous
    July 21, 2011
    "Now your arguments are just getting silly. The traditional CHAR is only one byte." Calling my posts "silly" doesn't help your position any.  I certainly disagree with your unfounded mischaracterization. In any case, what was "traditional" isn't relevant.  .NET strings are an API unto themselves, and are Unicode character based. We aren't talking about some random API elsewhere.  We're talking about .NET.

  • Anonymous
    July 21, 2011
    >> In any case, what was "traditional" isn't relevant.  .NET strings are an API unto themselves, and are Unicode character based But they aren't - they are 16-bit-char based, and 16-bit char is not a Unicode character. He is more or less correct in that the only way in which UTF-16 is better than UTF-8 is that it makes it easier to justify ignoring proper handling of all code points, because most of those that you'll see in practice will not require surrogate pairs to encode, and you can just assume 1 char = 1 Unicode character and pretend that this works for everyone. It's more about hiding the problem by solving the most prominent manifestations and swiping the rest under the carpet, rather than dealing with it in its entirety. That's in the perfect world, though. In practice, when .NET appeared, Win32 has been using 16-bit chars in all its Unicode-enabled APIs for, what, 7 years already? As it is, you can pass .NET strings to Win32 functions with no conversion at all - P/Invoke marshaler just pins the string and passes the direct pointer to the buffer. That's fast. Returning strings can be handled similarly with StringBuilder, as well. And remember that e.g. WinForms did a lot of P/Invoke! Even setting performance aside, there's also the issue of learning curve and familiarity for users of existing tools. Win32 already had it that way, and it was not alone - Java is another language which does it this way, and Qt is a framework with a similar choice. If I remember correctly, Cocoa NSString is UTF-16 encoded, as well. The only UI framework that I can think of that uses UTF-8 for strings is Gtk+ - pretty much everyone else is on UTF-16. It's just one of those things that became a de facto standard back in the day when all we had was BMP and UCS2 was all you needed.  For better or worse, it is what we have, and changing it now is a fairly radical, breaking change - while benefits of such a change would be very minor in comparison.

  • Anonymous
    July 21, 2011
    String itself aside, there are some areas related to strings that might merrit some optimization. Even if this is still only relevant in extreme cases (as long as it doesn't come with a penalty). See string.Split: ajdotnet.wordpress.com/.../the-cost-of-string-split Another example is StringBuilder with long strings (memory going to the LOH) and naive usage (i.e. not properly initializing it with the required size). Some way of at least telling the developer about the anti-pattern (e.g. some trace, or a counter with the number of reallocations) could help.

  • Anonymous
    July 22, 2011
    AlexanderJ: The implementation of strings is a trade-off. If you use the CLR's implementation of strings, then performing a Split requires creating smaller strings that just about add up in size to your original large string. The bonus of that, however, is that the larger string can then be garbage collected and only the smaller strings that you actually use need to stick around. If you look at an implementation of strings like Java's where strings can reference other strings, then doing a Split is relatively cheap -- you just create an array of references to the original string with offsets and lengths of the substrings. The unfortunate problem with that, however, is that the original large string cannot be garbage collected until every single one of those smaller strings is no longer referenced. Every call to Split is a potential memory leak! And what's the anti-pattern with StringBuilder? A StringBuilder is actually a linked list, where they try very hard to keep each item in the list out of the large object heap.

  • Anonymous
    July 24, 2011
    The comment has been removed

  • Anonymous
    July 25, 2011
    ""if you can gain efficiency without losing correctness with a "character index" for UTF-16 that ignores surrogate pairs, you can do the same in UTF-8 with a byte index." Not if you want the API to remain character-based." If you're ignoring surrogate pairs, your UTF-16 api isn't character-based. You're ALREADY arguing in favor of a non-character-based API for UTF-16. So why are you insisting that the UTF-8 one has to be character-based?

  • Anonymous
    July 28, 2011
    @Gabe: The fact that string.Split produces new strings for the parts was not my point. My point is that string.Split internally wastes twice as much memory as the original string needs. Regarding StringBuilder you're right, they've changed the implementation in .NET4 (until 3.5 StringBuilder maintained one buffer that was reallocated on demand). My fault.

  • Anonymous
    August 03, 2011
    Here's an interesting fact: SyncBlk for CLR objects already has a hashCode field. It's used by stock (reference-identity) Object.GetHashCode. This begs to be reused for String.GetHashCode. In fact, if CLR ever gets the notion of immutable objects, it would be nice to universally apply this optimization to GetHashCode for all of them.

  • Anonymous
    August 05, 2011
    The comment has been removed

  • Anonymous
    January 09, 2012
    Excellent Article ! but one more Question , why  String Indexer is Read only ?what happen if it was possible we change a String Object's Character through Indexer ? like this string mystr="hello"; mystr[0]='w';