[SOLVED] Java String.intern() use HashTable instead of ConcurrentHashMap

Issue

I am research String.intern() and this method have a performance penalty. I’ve compared String.intern() with ConcurrentHashMap.putIfAbsent(s,s) with Microbenchmark. Used Java1.8.0_212, Ubuntu 18.04.2 LTS

@Param({"1", "100", "10000", "1000000"})
private int size;

private StringIntern stringIntern;
private ConcurrentHashMapIntern concurrentHashMapIntern;

@Setup
public void setup(){
    stringIntern = new StringIntern();
    concurrentHashMapIntern = new ConcurrentHashMapIntern();
}
public static class StringIntern{
    public String intern(String s){
        return s.intern();
    }
}
public static class ConcurrentHashMapIntern{
    private final Map<String, String> map;

    public ConcurrentHashMapIntern(){
        map= new ConcurrentHashMap<>();
    }
    public String intern(String s){
        String existString = map.putIfAbsent(s, s);
        return (existString == null) ? s : existString;
    }
}

@Benchmark
public void intern(Blackhole blackhole){
    for(int count =0; count<size; count ++){
        blackhole.consume(stringIntern.intern("Example "+count));
    }
}
@Benchmark
public void concurrentHashMapIntern(Blackhole blackhole){
    for(int count =0; count<size; count++){
        blackhole.consume(concurrentHashMapIntern.intern("Example " +count));
    }
}

Result as expected. ConcurrentHashMap faster than String.intern() when search string.

Benchmark                             (size)  Mode  Cnt        Score        Error  Units
MyBenchmark.concurrentHashMapIntern        1  avgt    5        0.056 ±      0.007  us/op
MyBenchmark.concurrentHashMapIntern      100  avgt    5        6.094 ±      2.359  us/op
MyBenchmark.concurrentHashMapIntern    10000  avgt    5      787.802 ±    264.179  us/op
MyBenchmark.concurrentHashMapIntern  1000000  avgt    5   136504.010 ±  17872.866  us/op
MyBenchmark.intern                         1  avgt    5        0.129 ±      0.007  us/op
MyBenchmark.intern                       100  avgt    5       13.700 ±      2.404  us/op
MyBenchmark.intern                     10000  avgt    5     1618.514 ±    460.563  us/op
MyBenchmark.intern                   1000000  avgt    5  1027915.854 ± 638910.023  us/op

String.intern() slower than ConcurrentHashMap because String.intern() is native HashTable implementation. And then, read javadoc about HashTable, this documantation says:

If a thread-safe implementation is not needed, it is recommended to use HashMap in place of Hashtable. If a thread-safe highly-concurrent implementation is desired, then it is recommended to use ConcurrentHashMap in place of Hashtable.

This is very confusing situation. It recommend ConcurrentHashMap, but it using HashTable although performance penalty. Does anyone have any idea about why used native HashTable implemantation instance of ConcurrentHashMap ?

Solution

There are a number of things going on here:

  1. Your benchmarks have very large error bars. The repeat counts are probably too small. This makes the results questionable.

  2. It doesn’t look like your benchmarks are resetting the "interned string" caches after each run1. So that means that the caches are growing, and each repetition will be starting with different conditions. This may explain the error bars …

  3. Your ConcurrentHashMap is not functionally equivalent to String::intern. The latter uses a native equivalent to Reference objects to ensure that interned strings can be garbage collected. Your ConcurrentHashMap implementation doesn’t. Why does this matter?

    • Your ConcurrentHashMap is a massive memory leak.
    • A reference mechanism is expensive … at GC time. (Though possibly less expensive2 than a memory leak.)

String.intern() slower than ConcurrentHashMap because String.intern() is native HashTable implementation.

No. The real reason is that the native implementation is doing things differently:

  • The internal representations are different. The native (intern) string pool uses a custom hash table implemented in native code.
  • It has to handle references which impacts on GC performance.
  • There are also behind-the-scenes interactions with string deduping and other things.

Note that these things vary considerably across different Java versions.

This is very confusing situation. It recommend ConcurrentHashMap, but it using HashTable although performance penalty.

Now you are talking about a different scenario, that is not relevant to what you are doing.

  • Note that String::intern doesn’t use either HashTable or HashMap; see above.

  • The quote that you found is about how to get good concurrent performance from a hash table. Your benchmark is (AFAIK) single threaded. For a serial use-use case, HashMap will give better performance than the others.

Does anyone have any idea about why used native HashTable implementation instance of ConcurrentHashMap ?

It doesn’t use a hash table; see above. There are a number of reason that it doesn’t HashTable or HashMap or ConcurrentHashMap:

  • It is that it is paying more attention to memory utilization. All of the Java hash table implementations are memory hungry and that makes them unsuitable for general purpose string interning.
  • The memory and CPU overheads of using Reference classes are significant.
  • Computing a hash of a newly created string of length N is O(N) which will be significant when interning strings that may be hundreds / thousands of characters long.

Finally, be carefully that you are not focusing on the wrong problem here. If you are trying to optimize interning because it is a bottleneck in your application, the other strategy is to not intern at all. In practice, it rarely saved memory (especially compared with G1GC’s string de-duping) and rarely improves string handling performance.


In summary:

  • You are comparing apples and oranges. Your map-base implementation is not equivalent to native interning.
  • String::intern is not optimized solely (even primarily) for speed.
  • By focusing on speed, you are ignoring memory utilization … and the secondary effect of memory utilization on speed.
  • Consider the potential optimization of not interning at all.

1 – And in the native intern case, I don’t think that is possible.
2 – A Java memory leak in the regular heap impacts on long-term GC performance because the retained objects need to be repeatedly marked and copied by the GC. There may be secondary effects too.

Answered By – Stephen C

Answer Checked By – Mildred Charles (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *