Some code I wrote a while ago came back to haunt me. You may know the feeling.
The use case was simple: A tool chain, processing some files and spitting out others (host service descriptions as input and configuration and proxy code as output – is doesn’t really matter though). At one stage I needed to process a file line by line. And, as usually, my code was nice, clean, tested, maintainable, performing quite well, robust, and unbreakable. In one word: Just perfect. Well, until someone broke it. All he did was feeding a “slightly” larger code fragment — about 5MB file size for a corresponding text file — and he encountered OutOfMemoryExceptions. Sik!
So, what on earth could have happened here? Well, one of the least things I would have expected: string.Split and StringBuilder – despite the fact that I of course had employed them correctly, concise with the general recommendations!
The Cost of String.Split
It turned out that string.Split trades memory for performance…. Take a string of a certain length, say 100, and split into, say, 10 parts. The result you’ll get will consist of the array (4 bytes per entry, 10 entries in our example), and the content fragments. This is roughly the same memory consumption for the result as for the input string (see also on MSDN).
Note: I‘m leaving out the overhead for internal management of strings and memory allocation in general.
That’s the memory consumption that is obvious and cannot be avoided (OK, I’ll get back to that one!). But what does String.Split add to that cost, if only temporarily?
String.Split internally creates an array of integers, with the length of the string as size. In other words for an input string of 100 characters it creates an int[100], using up 400 bytes! This array is used to remember the index of the found separator, by writing the found index at the next free position. That implies that only the first slots are actually used, the same number as the number of parts the string is split into. In our example 90 slots are simply wasted. This is a usage ratio of 10%! It would only need this array completely in the extreme case of a string only consisting of separators.
In case of StringSplitOptions.RemoveEmptyEntries and at least one empty entry, another temporary array for the result is created, add another array the size of the number of parts. Seems things can only get worse.
Once more: For a any given string as input String.Split returns something more than that as result – and on top of that it needs twice as much memory temporarily!
OK, so 100 characters, 200 bytes, produce 240 bytes result (40 bytes for the array) and 400 bytes temporarily on top. Who cares? But wait, remember the 5MB text file? The resulting string needed 10MB and the result ~11MB. Additionally ~20MB temporarily, of which only 4% – less than 1MB – where actually used. And all this is subject to the Large Object Heap. Well? Who cares now?
Mending…
Granted, that’s an extreme case. But how often did I use String.Split in other cases? How often within loops? How big where the input strings? How much unnecessary overhead and memory pressure did that introduce?
What if I could get rid of the temporary overhead? Great. What if I could even get rid of the result? Wait… what?
Matter of fact, quite often the result from String.Split is processed once, entry per entry, in a loop. And that’s what enumerators are for. So, if I had a respective implementation, instead of writing
string[] lines = text.Split(Environment.NewLine[0]);
foreach(string line in lines)
…
I could call a method that returns an enumerator:
var lines= text.SplitString(Environment.NewLine[0]);
foreach(string line in lines)
…
And with the lazy evaluation of enumerators I would get one part at a time. And once I’m done processing it, I drop it before going on to the next part. Granted, I still need to touch every part, but no longer all at once. When memory get’s tight, chances are all those parts are in gen 0 when the GC hits.
Meaning I would do no longer consume more than three times the memory of the input string at once. A string of 100 characters/200 bytes as input would not acquire additional 600 bytes. A 10MB string would not need additional 31MB (and only using 12MB of that anyway, subject to the LOH).
Well, I had a minute to spare… . I wrote a helper class for this post, tested to behave exactly like String.Split, so it should be possible to just replace every call to Split with the SplitString extension method, like above. You can find the code here.
That’s all for now folks,
AJ.NET
Hi AJ,
great post. I had run into the issue when trying to parse logfiles multi-threaded. I ultimately didn’t need your code, but the pointer was correct. The split operation used up all the available memory and with multiple threads running, the pressure on the GC was pretty high and the system came to a screeching halt. After retooling to use the IndexOf Method, the problem was gone.
Comment by ... the rest is just coding — June 27, 2012 @ 12:31 am