Validating UTF-8

Validation of UTF-8 is extremely important as your program may contain many subtle bugs and security issues if it accepts invalid UTF-8. This is particularly true you pass the text to a library which depends on text inputs being valid.

I’ve seen a number of posts online asking how to validate UTF-8, so I thought I’d provide a tutorial and sample code.

UTF-8 Overview

UTF-8 stores Unicode “Code Points” in variable length sequences of bytes. The position if the most significant zero in the first byte indicates the number of bytes required to store the code point. The following table illustrates the basic scheme.

U+0000 to U+007F (0-127)  Require a 1 byte sequence
U+0080 to U+07FF          Require a 2 byte sequence
U+0800 to U+FFFF          Require a 3 byte sequence
U+10000 to U+1FFFFF       Require a 4 byte sequence

1 byte: 0aaaaaaa
2 byte: 110aaabb    10bbbbbb
3 byte: 1110aaaa    10aaaabb    10bbbbbb
4 byte: 11110aaa    10aabbbb    10bbbbcc    10cccccc

Here, the letters ‘a’, ‘b’ and ‘c’ represent bytes of the code point data stored withing the UTF-8 sequence. The 1s and 0s are fixed binary patterns that should appear in sequences of the given length.

The continuation bytes always have their most significant zero in the second most significant position. This provides the helpful property that you search forward and backward through UTF-8 text and find the start of each multi-byte character. This is not true in some other multi-byte encodings.

The scheme displayed here is the one defined by RFC-3629. The original UTF-8 encoding scheme described by RFC-2279 allowed up to six byte long sequences. Some libraries, such as libiconv, still accept the old form for compatibility. This is not recommended in new code.

As an example, lets try to encode a Unicode code point into a UTF-8 sequence. Take the Unicode smiley character U+263A ☺. If we examine the ranges in the table above, it will require a 3 byte sequence.

The binary value of 0x263A which is 0010 0110 0011 1010. We need to stuff this value into the three byte encoding sequence, which gives us: 11100000 10100110 10111010, or 0xE0 0xA6 0xBA.

Notice that we have stored no useful data in the first byte. We needed at least 12 bits to store our code point, but the two byte sequence only has 11 bits of code point storage available.

Validation Rules

Now that we’re all UTF-8 experts, lets take a look at what’s required to validate UTF-8. From examining RFC-3629, we get the following rules:

  • Sequences are limited to four bytes long.
  • You can only start a sequence with a byte that has its high bits set to one of the bit patterns: 0, 110, 1110 or 11110.
  • You must have the correct number of continuation bytes for a given lead byte.
  • Continuation bytes must have their high bits set to 10.
  • The bytes 0xC0, 0xC1, and 0xF5 to 0xFF should never appear (due to the length limitation).
  • There is only one valid way to store a code point. That is, you should not store a code point in an “overlong” sequence.
  • The range of valid code points is U+0000 to U+10FFFF (valid range of Unicode).
  • Code points in the range U+D800 to U+DFFF are illegal as they are reserved for UTF-16 surrogate pairs

One key thing to realize is that we don’t have to bother checking all of these rules for each UTF-8 sequence. A two byte sequence cannot hold the invalid values in the range U+D800 to U+DFFF, nor exceed the limit of U+10FFFF. An efficient UTF-8 validation function should do as little checking as possible.

Example code

Here is the C++ code I have been using, altered somewhat for clarity. It’s quite possible this contains an error due to the amount of bit masking going on, so take it with a grain of salt and test it before usage.

bool isValidCodePoint(uint32_t cp)
{
    if (cp >= 0xD800 && cp <= 0xDFFF) // Surrogates
    {
        return false;
    }
    return (cp <= 0x10FFFF); // Maximum value
}

bool validateUtf8(const char *buffer, size_t bufferLen)
{
    const unsigned char *start = (const unsigned char*)buffer;
    const unsigned char *end = start + bufferLen;
    const unsigned char *iter = start;
    int32_t len;
    uint32_t val;

    // Loop through UTF-8, validating each code point
    while (iter < end)
    {
        if ((*iter) < 128) // Width of 1
        {
            iter++;
            continue;
        }
        else if (((*iter) & 0xE0) == 0xC0) // Width of 2
        {
            if (iter + 2 >= end)
            {
                break;
            }

            // Check for overlong form (8th or above data bit must be set)
            if ((iter[0] & 0x1E) == 0)
            {
                break;
            }

            // Check continuation byte
            if (((iter[1] & 0xC0) != 0x80))
            {
                break;
            }

            // Don't have to check code point validity. Can't have a large
            // enough value to be one of the invalid ones.
            iter += 2;
            continue;
        }
        else
        {
            // Width of 3 or 4 is long enough we have to check the validity
            // of the code point

            if (((*iter) & 0xF0) == 0xE0) // Width of 3
            {
                if (iter + 3 >= end)
                {
                    break;
                }

                // Check continuation bytes
                if ((iter[1] & 0xC0) != 0x80 ||
                    (iter[2] & 0xC0) != 0x80)
                {
                    break;
                }

                // Convert to code point
                val = ((uint32_t) (iter[0] & 0x0F) << 12) |
                      ((uint32_t) (iter[1] & 0x3F) << 6) |
                      ((uint32_t) (iter[2] & 0x3F));

                // Check for overlong form (11th or above data bit must be set)
                if (val < (1 << 11))
                {
                    break;
                }

                len = 3;
            }
            else if (((*iter) & 0xF8) == 0xF0) // Width of 4
            {
                if (iter + 4 >= end)
                {
                    break;
                }

                // Check continuation bytes
                if ((iter[1] & 0xC0) != 0x80 ||
                    (iter[2] & 0xC0) != 0x80 ||
                    (iter[3] & 0xC0) != 0x80)
                {
                    break;
                }

                // Convert to code point
                val = ((uint32_t) (iter[0] & 0x07) << 18) |
                      ((uint32_t) (iter[1] & 0x3F) << 12) |
                      ((uint32_t) (iter[2] & 0x3F) << 6) |
                      ((uint32_t) (iter[3] & 0x3F));

                // Check for overlong form (16th or above data bit must be set)
                if (val < (1 << 11))
                {
                    break;
                }

                len = 4;
            }
            else // Invalid length, or not start byte
            {
                break;
            }

            // Check code point legality.
            if (!isValidCodePoint(val))
            {
                break;
            }

            // Increment
            iter += len;
        }
    }

    return (iter - start) == bufferLen;
}

Improving on the Example

The example code above does not have any optimizations beyond avoiding unnecessary checking. It’s possibly to considerably improve the performance. Here are two things you can look at:

Length lookup table

You can build a lookup table to get the length of the lead byte without doing so much masking. The lead byte can only have 256 possible values, so you can simply make a table of the length for each possible value.

Vectorization

Most UTF-8 text is “plain ASCII”. That is, all the bytes are single byte sequences with a zero in the high bit, except perhaps for a byte-order-mark at the start of the text. You can optimize for this case and validate multiple bytes at a time. Vectorization gets quite complex and architecture dependent, so I’ll leave that to a later post.

Advertisements

One Response to “Validating UTF-8”

  1. Chris Says:

    Do you by chance have an example of using XSAVE and XRSTOR in Intel x86_64 assembly or C++?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s


%d bloggers like this: