“String interning” (sometimes called “string pooling”) is an optimization that consists of storing only one copy of a string, no matter how many times the program references it. While it’s undoubtedly of great benefit to our programs, we’ll see that it also comes a few pitfalls.

Why this optimization?

Did it ever happen to you to write the same string several times in the same program? I know it occurs to me all the time. For example, I could write:

color = "black";

and, later in the code, I would write:

strcmp(color, "black");

As you can see, I wrote the string literal “black” twice, does this mean that the program contains two copies of “black”? More importantly, does it mean that the program loads two copies in the RAM?

Fortunately, the answer is no to the two questions. Thanks to the string interning optimization, only one copy of each literal is stored in the program (i.e., in the Flash memory), and so only one is loaded in the RAM.

Demonstration

Demonstrating that string interning stores only one copy of two identical string literals is straightforward: we just need to compare the pointers. If the pointers refer to the same address, it means that there is only one copy in the RAM.

Here is a simple program that you can run on your Arduino:

const char* s1 = "black";
const char* s2 = "black";
Serial.println((intptr_t)s1);
Serial.println((intptr_t)s2);

As you can see, I cast the pointer to intptr_t to display its value, i.e. the address it points to. intptr_t is an integer whose size matches the size of a pointer. On most machines, intptr_t is identical to int.

On my Arduino UNO, the program above displays:

s1 = 309
s2 = 309

As you can see, the two values are identical, proving that the optimization worked.

The two pointers refering to the same string in the RAM

An optimization performed by the compiler

In C++, the compiler performs the string interning, so the deduplication can only happen at compile time. It means that if your program assembles to identical strings in RAM, there would be two copies. In other words, the optimization works only for the strings known at compile time: the string literals.

In other languages, like C# or Java, the string interning also happens at runtime, because the .NET Runtime or the Java Virtual Machine implements this optimization. In C++, we don’t have a runtime environment that could perform this optimization, so we only have it at compile time.

If for some odd reason, you want to disable this optimization, GCC (the compiler behind Arduino) supports the flag -fno-merge-constants. By the way, the same optimization deduplicates floating point literals.

Defeating the optimization

Unfortunately, this optimization is very fragile: there are many ways to break it. We saw that it works fine with const char*:

// Number of copies: 1 in Flash, 1 in RAM
const char* s1 = "black";
const char* s2 = "black";

If instead, we change the type to char[], the program makes a copy of the literal when it creates the variables:

// Number of copies: 1 in Flash, 3 in RAM
const char s1[] = "black";
const char s2[] = "black";

Similarly, if we change the type to String, the constructor makes a copy of the string:

// Number of copies: 1 in Flash, 3 in RAM
String s1 = "black";
String s2 = "black";

Flash String

Do you think the compiler deduplicates Flash strings (also known as Progmem strings) in the same way? Unfortunately, no.

Again, it’s very easy to verify, we just need to reuse the same snippet, changing only the types of the variables:

// Number of copies: 2 in Flash, 0 in RAM
const char s1[] PROGMEM = "black";
const char s2[] PROGMEM = "black";
Serial.println((intptr_t)s1);
Serial.println((intptr_t)s2);

When I run this program on my Arduino UNO, it displays:

s1 = 122
s2 = 116

As you can see, the addresses are different, proving that the optimization didn’t work.

Unlike our previous examples, these addresses refer to the Flash memory, not to the RAM. Indeed, the attribute PROGMEM instructs the compiler not to load the string in the RAM but, instead, to point to the location in the “program memory.”

You could be tempted to keep the PROGMEM attribute and change the type to char* instead of char[], but if you do that, the compiler simply ignores the attribute:

// WRONG: doesn't do what you think!
// Number of copies: 1 in Flash, 2 in RAM
const char s1* const PROGMEM = "black";
const char s2* const PROGMEM = "black";

Now, what about the macro F() that we see in all Arduino examples, it’s good, right? I’m afraid it’s not good at all… I would even say that it’s worse because it gives you a false sense of rightfulness. Indeed, the syntax is so close to the classic string literal that we believe it behaves the same way.

Again, it’s very easy to demonstrate:

// Number of copies: 2 in Flash, 0 in RAM
const __FlashStringHelper* s1 = F("black");
const __FlashStringHelper* s2 = F("black");

Sadly, many people believe they are following best practices by calling F() everywhere, but in reality, they are bloating their programs with useless copies of the same string.

This limitation is one of the many reasons I avoid Flash strings in my programs; I only use them for long, unique, and rarely used strings, like log messages.

Comparing strings

Now that we’ve seen the benefits of string interning and all the ways we can defeat the optimization, let’s talk about one major pitfall: the comparison of strings with the operator ==.

If you use the operator == on pointer types, it compares the addresses, not the content. As we saw, two identical strings can have the same address when string interning works, or different addresses when string interning doesn’t apply.

Let see an example:

const char* color = "black";
if (color == "black") { // true because of string interning
  // ...
}

All is well, but it stops working as soon as one of the strings is not included in the optimization, like in the example:

const char color[] = "black";
if (color  == "black") { // false because color holds a copy
  // ...
}

That is why it is crucial never to use the operator == to compare strings in char*. Instead, we must use strcmp(), a function from the C standard library that compares characters one by one and returns 0 if the two strings match (it also returns -1 or +1 depending on the lexicographical order, but it’s not our concern here).

const char color[] = "black";
if (strcmp(color, "black") == 0) { // true because it compare each character
  // ...
}

Admittedly, this program is less readable, less expressive, and more error-prone, but that’s our legacy from the C language.

An alternative is to use the String class. Indeed, this class overloads the operator ==, so you can safely compare Strings:

String color = "black";
if (color  == "black") { // true because it calls String::operator==()
  // ..
}

However, that’s not a reason to use this class everywhere, because it comes with its own baggage, namely heap allocation and string duplication, but that’s a topic for another article.

Conclusion

String interning is a powerful optimization, but it can be easily defeated. In particular, beware of the macro F() that pretends to work like a string literal but bloats your program with multiple copies of the same string.

I thought you might want to run the same tests on your microcontroller, so I uploaded to GitHub a sample program that shows the addresses of strings of various types. You can find it at github.com/bblanchon/cpp4arduino.

That’s all I have to say about string interning. I hope you learned something today; if so, share with your friends. If you want more articles like this, you can subscribe to the mailing list. See you soon!