Featured image for article: Decent Character Encoding in C++26

Decent Character Encoding in C++26

• 3 min read

As a C++ developer, you learn to live with certain things over the years: memory leaks that haunt your dreams, template error messages longer than a George R.R. Martin novel. And how the standard handles character encoding.

It’s not a rant to say that the conversion APIs were so colossally terrible that they weren’t just deprecated in C++17. Anyone who’s ever tried to wrangle std::codecvt and std::wstring_convert knows the truth. Buggy, strange and simply not reliable.

The text_encoding class provides a mechanism for identifying character encodings and is part of C++26. Yes, you read that right: C++26, not C++23. While other features are already making their rounds, std::text_encoding arrives late to our party, but it brings everything your heart desires.

The new class is based on P1885R12 “Naming Text Encodings to Demystify Them” and finally promises what we’ve been missing for years: sensible character encoding management. Instead of juggling UTF-8, UTF-16, ISO-8859-1, and other encodings, we get a very clean interface.

#include <text_encoding>
#include <print>

int main() {
    // Literal encoding known at compile-time
    constexpr std::text_encoding literal_encoding = 
        std::text_encoding::literal();
    
    // Environment encoding only at runtime
    std::text_encoding env_encoding = 
        std::text_encoding::environment();
    
    // Locale encoding from default locale
    std::text_encoding locale_encoding = 
        std::locale("").encoding();
    
    std::println("Literal encoding: {}", literal_encoding.name());
    std::println("Environment encoding: {}", env_encoding.name());
}

Each text_encoding object encapsulates a character encoding scheme, uniquely identified by an enumerator in text_encoding::id and a corresponding name. No more wild guessing, just clear identification.

The true brilliance of std::text_encoding lies in its connection to the IANA Character Sets Registry. Finally, there is a standardized source for encoding names and aliases. The class supports both registered and non-registered character encodings, covering virtually every conceivable scenario.

With 266 different encoding IDs ranging from ASCII (3) through UTF-8 (106) to exotic variants like JISEncoding (16), order is finally brought up.

// Check encoding
if (encoding.mib() == std::text_encoding::id::UTF8) {
    // UTF-8 specific handling
}

// Iterate through aliases
for (const char* alias : encoding.aliases()) {
    std::println("Alias: {}", alias);
}

// Environment check
if (encoding._M_is_environment()) {
    std::println("This encoding matches the environment");
}

The API is designed to work both at compile-time (for literal encodings) and at runtime (for environment and locale encodings). This is particularly useful when working cross-platform.

GCC has experimental support for C++26 with the -std=c++26 or -std=gnu++26 parameter. The implementation of std::text_encoding is planned for C++26 and is currently being integrated into libstdc++.

For those wanting to dive in immediately: patience, grasshopper. C++26 isn’t finished, and the implementations are experimental. But like a fine wine, std::text_encoding will also improve with time.

So how to get this thing to compile. Because what’s the point of having shiny new APIs if you can’t even convince your compiler to acknowledge their existence?

g++ -std=c++26 -Wall -Wextra example.cpp -o text_encoding_demo

g++ -std=gnu++26 -Wall -Wextra -fconcepts example.cpp -o text_encoding_demo

g++ -std=gnu++26 -Wall -Wextra -fconcepts -fexperimental-library example.cpp -o text_encoding_demo

Until C++26 actually ships and your compiler catches up, you might want to:

  1. Stick with encoding libraries like ICU
  2. Use std::locale
  3. Write your own encoding detection

If you’re feeling masochistic, you can grab the latest GCC trunk build and compile it yourself. It’s only a few hours of your life. What could go wrong?

For all C++ developers who’ve been pulling their hair out over character encodings: it’s going to get better. Not immediately, but it’s going to get a lot better.