Wednesday, August 25, 2010

Tokenizing Strings

In this post I will cover some String tokenizing issues. I am sure that many of you dealt with toeknizing Strings for many times, and still, I would like to point some issues I recently discovered.

First of all, I was a bit of surprised to discover that the StringTokenizer class is not recommended to be used, and turns out that this recommendation exists back from the days of JDK 1.5 - The documentation for StringTokenizer says that the class is kept for compatibility reasons. It is recommended to use the "split" method of "String" instead.

However, if we take a look at the source code of String.split (as can be found for example at this link), you will notice that the method split compiles the regular expression passed as argument every time the method is called.

It might be better therefore to use the "split" method of the Pattern class.
and in case of "fixed" regular expressions which are used over and over again, keep a static variable for tokenizing. The following example explains this paragraph:


Although my experiments showed that for tokenizing 1000 Strings with a simple regular expression of "," comparing String.split and Pattern.split I saved only about 9-10 miliseconds (tests were run on Intel i5 , 4Gb RAM, Windows 7 64bit) , I am sure that for more complex expressions I could have saved more time.

At this point, some may raise a question about thread safety - as the static variable potentially holds a shared state - the code of the method might be used by many threads.
If we take a look at the source of
Pattern we can see that all the variables that are used inside the split method code are local variables or parameters passed to the method, and not fields, therefore it can be seen that the split method can be used from many threads concurrently. In addition I have read in other sources over internet and verified this issue.

To conclude this post, I would like to suggest a way to use the patterns in your code.
In the above example I presented a case where a comma pattern is used. The comma pattern might be popular and might be required to be used by many classes.
I suggest that a PatternConstants class will provide pattern contstants that can be used across the code.
An example for such a class might be: