Archive for the ‘Uncategorized’ Category

Detecting Advanced Vector Extensions (AVX) support in Visual Studio

November 7, 2011

Every so often Intel or AMD come out with new instructions for their x86 and x64 instruction sets. Many of you may have heard of the Streaming SMID Extensions (SSE) instructions. The new Advanced Vector Extensions (AVX) instructions are similar to the older SSE instructions, but use new 32 byte (yes, byte) registers.

When building with a GCC like compiler, one usually isolates AVX code like so:

#ifdef __AVX__
// AVX intrinsics or inline assembly goes here
#else
// Non AVX version with same functionality (or just error)
#endif

When you compile your code, you tell GCC the type of chip you want to target, or have it build with instructions suitable for the local machine. Inline assembly or intrinsics using AVX instructions will fail to compile if the build is not set up to allow them.

Visual Studio has a different expectation. Generally, one builds code that will run on all modern x86 or x64 chips and the program itself asks the CPU what instructions are supported. The program will branch at runtime depending on what CPU features are available. This post will give an overview of how to do this checking.

Prerequisites

AVX instructions are only supported on (at time of writing) the latest and greatest OS, compiler and CPU versions. You will need the following:

  • Visual Studio 2010 SP1 or later
  • Windows 7 SP1 or later
  • A CPU that supports AVX instructions (check documentation).

The new AVX instructions use new YMM registers which the OS must save on a context switch. If you try to use them on a Windows version prior to Windows 7 SP1 they will be considered illegal instructions, even if the CPU has AVX support.

Performing the check

You query CPU capabilities on x86 and x64 chips using the CPUID instruction. Visual studio has an intrinsic called __cpuid that wraps this instruction. You can check if your chip supports AVX instructions using the __cpuid intrinsic, but that’s not enough. You also need to check that the OS is saving the new registers on context switch. This requires using the XGETBV instruction. A new intrinsic called _xgetbv is available for this in Visual Studio 2010 SP1.

Note that MSDN currently has no documentation of _xgetbv and other intrinsics added in Visual Studio 2010 SP1.

#include <stdio.h>
#include <intrin.h>

int main()
{
    bool avxSupported = false;

    // If Visual Studio 2010 SP1 or later
#if (_MSC_FULL_VER >= 160040219)
    // Checking for AVX requires 3 things:
    // 1) CPUID indicates that the OS uses XSAVE and XRSTORE
    //     instructions (allowing saving YMM registers on context
    //     switch)
    // 2) CPUID indicates support for AVX
    // 3) XGETBV indicates the AVX registers will be saved and
    //     restored on context switch
    //
    // Note that XGETBV is only available on 686 or later CPUs, so
    // the instruction needs to be conditionally run.
    int cpuInfo[4];
    __cpuid(cpuInfo, 1);

    bool osUsesXSAVE_XRSTORE = cpuInfo[2] & (1 << 27) || false;
    bool cpuAVXSuport = cpuInfo[2] & (1 << 28) || false;

    if (osUsesXSAVE_XRSTORE && cpuAVXSuport)
    {
        // Check if the OS will save the YMM registers
        unsigned long long xcrFeatureMask = _xgetbv(_XCR_XFEATURE_ENABLED_MASK);
        avxSupported = (xcrFeatureMask & 0x6) || false;
    }
#endif

    if (avxSupported)
    {
        printf("AVX is supported\n");
    }
    else
    {
        printf("AVX is NOT supported\n");
    }

    return 0;
}

Links

Advertisements

Validating UTF-8

November 3, 2011

Validation of UTF-8 is extremely important as your program may contain many subtle bugs and security issues if it accepts invalid UTF-8. This is particularly true you pass the text to a library which depends on text inputs being valid.

I’ve seen a number of posts online asking how to validate UTF-8, so I thought I’d provide a tutorial and sample code.

UTF-8 Overview

UTF-8 stores Unicode “Code Points” in variable length sequences of bytes. The position if the most significant zero in the first byte indicates the number of bytes required to store the code point. The following table illustrates the basic scheme.

U+0000 to U+007F (0-127)  Require a 1 byte sequence
U+0080 to U+07FF          Require a 2 byte sequence
U+0800 to U+FFFF          Require a 3 byte sequence
U+10000 to U+1FFFFF       Require a 4 byte sequence

1 byte: 0aaaaaaa
2 byte: 110aaabb    10bbbbbb
3 byte: 1110aaaa    10aaaabb    10bbbbbb
4 byte: 11110aaa    10aabbbb    10bbbbcc    10cccccc

Here, the letters ‘a’, ‘b’ and ‘c’ represent bytes of the code point data stored withing the UTF-8 sequence. The 1s and 0s are fixed binary patterns that should appear in sequences of the given length.

The continuation bytes always have their most significant zero in the second most significant position. This provides the helpful property that you search forward and backward through UTF-8 text and find the start of each multi-byte character. This is not true in some other multi-byte encodings.

The scheme displayed here is the one defined by RFC-3629. The original UTF-8 encoding scheme described by RFC-2279 allowed up to six byte long sequences. Some libraries, such as libiconv, still accept the old form for compatibility. This is not recommended in new code.

As an example, lets try to encode a Unicode code point into a UTF-8 sequence. Take the Unicode smiley character U+263A ☺. If we examine the ranges in the table above, it will require a 3 byte sequence.

The binary value of 0x263A which is 0010 0110 0011 1010. We need to stuff this value into the three byte encoding sequence, which gives us: 11100000 10100110 10111010, or 0xE0 0xA6 0xBA.

Notice that we have stored no useful data in the first byte. We needed at least 12 bits to store our code point, but the two byte sequence only has 11 bits of code point storage available.

Validation Rules

Now that we’re all UTF-8 experts, lets take a look at what’s required to validate UTF-8. From examining RFC-3629, we get the following rules:

  • Sequences are limited to four bytes long.
  • You can only start a sequence with a byte that has its high bits set to one of the bit patterns: 0, 110, 1110 or 11110.
  • You must have the correct number of continuation bytes for a given lead byte.
  • Continuation bytes must have their high bits set to 10.
  • The bytes 0xC0, 0xC1, and 0xF5 to 0xFF should never appear (due to the length limitation).
  • There is only one valid way to store a code point. That is, you should not store a code point in an “overlong” sequence.
  • The range of valid code points is U+0000 to U+10FFFF (valid range of Unicode).
  • Code points in the range U+D800 to U+DFFF are illegal as they are reserved for UTF-16 surrogate pairs

One key thing to realize is that we don’t have to bother checking all of these rules for each UTF-8 sequence. A two byte sequence cannot hold the invalid values in the range U+D800 to U+DFFF, nor exceed the limit of U+10FFFF. An efficient UTF-8 validation function should do as little checking as possible.

Example code

Here is the C++ code I have been using, altered somewhat for clarity. It’s quite possible this contains an error due to the amount of bit masking going on, so take it with a grain of salt and test it before usage.

bool isValidCodePoint(uint32_t cp)
{
    if (cp >= 0xD800 && cp <= 0xDFFF) // Surrogates
    {
        return false;
    }
    return (cp <= 0x10FFFF); // Maximum value
}

bool validateUtf8(const char *buffer, size_t bufferLen)
{
    const unsigned char *start = (const unsigned char*)buffer;
    const unsigned char *end = start + bufferLen;
    const unsigned char *iter = start;
    int32_t len;
    uint32_t val;

    // Loop through UTF-8, validating each code point
    while (iter < end)
    {
        if ((*iter) < 128) // Width of 1
        {
            iter++;
            continue;
        }
        else if (((*iter) & 0xE0) == 0xC0) // Width of 2
        {
            if (iter + 2 >= end)
            {
                break;
            }

            // Check for overlong form (8th or above data bit must be set)
            if ((iter[0] & 0x1E) == 0)
            {
                break;
            }

            // Check continuation byte
            if (((iter[1] & 0xC0) != 0x80))
            {
                break;
            }

            // Don't have to check code point validity. Can't have a large
            // enough value to be one of the invalid ones.
            iter += 2;
            continue;
        }
        else
        {
            // Width of 3 or 4 is long enough we have to check the validity
            // of the code point

            if (((*iter) & 0xF0) == 0xE0) // Width of 3
            {
                if (iter + 3 >= end)
                {
                    break;
                }

                // Check continuation bytes
                if ((iter[1] & 0xC0) != 0x80 ||
                    (iter[2] & 0xC0) != 0x80)
                {
                    break;
                }

                // Convert to code point
                val = ((uint32_t) (iter[0] & 0x0F) << 12) |
                      ((uint32_t) (iter[1] & 0x3F) << 6) |
                      ((uint32_t) (iter[2] & 0x3F));

                // Check for overlong form (11th or above data bit must be set)
                if (val < (1 << 11))
                {
                    break;
                }

                len = 3;
            }
            else if (((*iter) & 0xF8) == 0xF0) // Width of 4
            {
                if (iter + 4 >= end)
                {
                    break;
                }

                // Check continuation bytes
                if ((iter[1] & 0xC0) != 0x80 ||
                    (iter[2] & 0xC0) != 0x80 ||
                    (iter[3] & 0xC0) != 0x80)
                {
                    break;
                }

                // Convert to code point
                val = ((uint32_t) (iter[0] & 0x07) << 18) |
                      ((uint32_t) (iter[1] & 0x3F) << 12) |
                      ((uint32_t) (iter[2] & 0x3F) << 6) |
                      ((uint32_t) (iter[3] & 0x3F));

                // Check for overlong form (16th or above data bit must be set)
                if (val < (1 << 11))
                {
                    break;
                }

                len = 4;
            }
            else // Invalid length, or not start byte
            {
                break;
            }

            // Check code point legality.
            if (!isValidCodePoint(val))
            {
                break;
            }

            // Increment
            iter += len;
        }
    }

    return (iter - start) == bufferLen;
}

Improving on the Example

The example code above does not have any optimizations beyond avoiding unnecessary checking. It’s possibly to considerably improve the performance. Here are two things you can look at:

Length lookup table

You can build a lookup table to get the length of the lead byte without doing so much masking. The lead byte can only have 256 possible values, so you can simply make a table of the length for each possible value.

Vectorization

Most UTF-8 text is “plain ASCII”. That is, all the bytes are single byte sequences with a zero in the high bit, except perhaps for a byte-order-mark at the start of the text. You can optimize for this case and validate multiple bytes at a time. Vectorization gets quite complex and architecture dependent, so I’ll leave that to a later post.