What is string interning?
“String interning” (sometimes called “string pooling”) is an optimization that consists of storing only one copy of a string, no matter how many times the program references it. While it’s undoubtedly of great benefit to our programs, we’ll see that it also comes with a few pitfalls.
Why this optimization?
Have you ever written the exact same string several times in the same program? I know it happens to me all the time. For example, I could write:
color = "black";
and, later in the code, I would write:
strcmp(color, "black");
As you can see, I wrote the string literal “black” twice, does this mean that the program contains two copies of “black”? More importantly, does it mean that the program loads two copies in the RAM?
Fortunately, the answer is no to the two questions. Thanks to the string interning optimization, only one copy of each literal is stored in the program (i.e., in the Flash memory), and so only one is loaded in the RAM.
Demonstration
Demonstrating that string interning stores only one copy of two identical string literals is straightforward: we just need to compare the pointers. If the pointers refer to the same address, it means that there is only one copy in the RAM.
Here is a simple program that you can run on your Arduino:
const char* s1 = "black";
const char* s2 = "black";
Serial.println((intptr_t)s1);
Serial.println((intptr_t)s2);
As you can see, I cast the pointer to intptr_t
to display its value, i.e. the
address it points to. intptr_t
is an integer whose size matches the size of a
pointer. On most machines, intptr_t
is identical to int
.
On my Arduino UNO, the program above displays:
s1 = 309
s2 = 309
As you can see, the two values are identical, proving that the optimization worked.
An optimization performed by the compiler
In C++, the compiler performs the string interning, so the deduplication can only happen at compile time. It means that if your program assembles to identical strings in RAM, there would be two copies. In other words, the optimization works only for the strings known at compile time: the string literals.
In other languages, like C# or Java, the string interning also happens at runtime, because the .NET Runtime or the Java Virtual Machine implements this optimization. In C++, we don’t have a runtime environment that could perform this optimization, so we only have it at compile time.
If for some odd reason, you want to disable this optimization, GCC (the compiler
behind Arduino) supports the flag -fno-merge-constants
. By the way, the same
optimization deduplicates floating point literals.
Defeating the optimization
Unfortunately, this optimization is very fragile: there are many ways to break
it. We saw that it works fine with const char*
:
// Number of copies: 1 in Flash, 1 in RAM
const char* s1 = "black";
const char* s2 = "black";
If instead, we change the type to char[]
, the program makes a copy of the
literal when it creates the variables:
// Number of copies: 1 in Flash, 3 in RAM
const char s1[] = "black";
const char s2[] = "black";
Similarly, if we change the type to String
, the constructor makes a copy of
the string:
// Number of copies: 1 in Flash, 3 in RAM
String s1 = "black";
String s2 = "black";
Flash String
Do you think the compiler deduplicates Flash strings (also known as Progmem strings) in the same way? Unfortunately, no.
Again, it’s very easy to verify, we just need to reuse the same snippet, changing only the types of the variables:
// Number of copies: 2 in Flash, 0 in RAM
const char s1[] PROGMEM = "black";
const char s2[] PROGMEM = "black";
Serial.println((intptr_t)s1);
Serial.println((intptr_t)s2);
When I run this program on my Arduino UNO, it displays:
s1 = 122
s2 = 116
As you can see, the addresses are different, proving that the optimization didn’t work.
Unlike our previous examples, these addresses refer to the Flash memory, not to
the RAM. Indeed, the attribute PROGMEM
instructs the compiler not to load the
string in the RAM but, instead, to point to the location in the “program
memory.”
You could be tempted to keep the PROGMEM
attribute and change the type to
char*
instead of char[]
, but if you do that, the compiler simply ignores the
attribute:
// WRONG: doesn't do what you think!
// Number of copies: 1 in Flash, 2 in RAM
const char s1* const PROGMEM = "black";
const char s2* const PROGMEM = "black";
Now, what about the macro F()
that we see in all Arduino examples, it’s good,
right? I’m afraid it’s not good at all… I would even say that it’s worse
because it gives you a false sense of rightfulness. Indeed, the syntax is so
close to the classic string literal that we believe it behaves the same way.
Again, it’s very easy to demonstrate:
// Number of copies: 2 in Flash, 0 in RAM
const __FlashStringHelper* s1 = F("black");
const __FlashStringHelper* s2 = F("black");
Sadly, many people believe they are following best practices by calling F()
everywhere, but in reality, they are bloating their programs with useless copies
of the same string.
This limitation is one of the many reasons I avoid Flash strings in my programs; I only use them for long, unique, and rarely used strings, like log messages.
Comparing strings
Now that we’ve seen the benefits of string interning and all the ways we can
defeat the optimization, let’s talk about one major pitfall: the comparison of
strings with the operator ==
.
If you use the operator ==
on pointer types, it compares the addresses, not
the content. As we saw, two identical strings can have the same address when
string interning works, or different addresses when string interning doesn’t
apply.
Let see an example:
const char* color = "black";
if (color == "black") { // true because of string interning
// ...
}
All is well, but it stops working as soon as one of the strings is not included in the optimization, like in the example:
const char color[] = "black";
if (color == "black") { // false because color holds a copy
// ...
}
That is why it is crucial never to use the operator ==
to compare strings in
char*
. Instead, we must use strcmp()
, a function from the C standard library
that compares characters one by one and returns 0
if the two strings match
(it also returns -1
or +1
depending on the lexicographical order, but it’s
not our concern here).
const char color[] = "black";
if (strcmp(color, "black") == 0) { // true because it compare each character
// ...
}
Admittedly, this program is less readable, less expressive, and more error-prone, but that’s our legacy from the C language.
An alternative is to use the String class. Indeed, this class overloads the
operator ==
, so you can safely compare Strings:
String color = "black";
if (color == "black") { // true because it calls String::operator==()
// ..
}
However, that’s not a reason to use this class everywhere, because it comes with its own baggage, namely heap allocation and string duplication, but that’s a topic for another article.
Conclusion
String interning is a powerful optimization, but it can be easily defeated. In
particular, beware of the macro F()
that pretends to work like a string
literal but bloats your program with multiple copies of the same string.
I thought you might want to run the same tests on your microcontroller, so I uploaded to GitHub a sample program that shows the addresses of strings of various types. You can find it at github.com/bblanchon/cpp4arduino.
That’s all I have to say about string interning. I hope you learned something today; if so, share with your friends. If you want more articles like this, you can subscribe to the mailing list. See you soon!