# Vector instructions. Part II: Vectorization

Vector instructions. Part II: Vectorization

This article describes the main vectorization techniques, using examples of several simple video signal encoding/decoding algorithms.

All the examples will use pixel blocks of a certain image as the input data. For simplicity, consider a monochrome image with pixel values in the range of $\displaystyle 0...2^{bitdepth} -1$, where $\displaystyle bitdepth$ is the pixel bit depth. The image is represented by a one-dimensional array. In the examples, the bit depth is either 8 bits or lies in the range of 9 through 16 bits. Thus, the image is stored as an array of bytes or unsigned 16-bit integer variables.

The code is intentionally simplified: the image blocks in all examples are square and have a size of 4, 8, or 16 pixels. Each function works with blocks of a specific size and pixel bit depth.

The identifiers used in all the examples are as follows: src and dst are pointers to the top left corner of the source and destination blocks, respectively; stride is the actual length of one image row in pixels, which can be equal to or greater than the visible image width (e.g., for alignment purposes, so that the pixel address at the beginning of each row is a multiple of 16); and bitdepth is the pixel bit depth in case it is different from 8 bits.

### Block copy

One of the simplest image processing problems is copying a part of an image. The example below is a function that copies a block of an 8-bit image, implemented as a C++ template.

Example 1. Copying from src to dst

template <int SIZE>

void copy_mb(const uint8_t * src,

uint8_t * dst,

size_t src_stride,

size_t dst_stride)

{

for(int i = 0; i < SIZE; i++)

{

for(int j = 0; j < SIZE; j++)

{

dst[ j ] = src[ j ];

}

src += src_stride;

dst += dst_stride;

}

}

The copying is done in an inner loop that can be replaced, for instance, with a call to the standard memset function. But instead of calling this function, we will use vector instructions that read data from the RAM into a register and then write the result back into the RAM. There are three instruction pairs in the SSE2 instruction set for copying data blocks:

_mm_cvtsi32_si128 and _mm_cvtsi128_si32 for 4-byte blocks
_mm_loadl_epi64 and _mm_storel_epi64 for 8-byte blocks,
_mm_loadu_si128 and _mm_storeu_si128 for 16-byte blocks.

In the latter case, _mm_load_si128 and _mm_store_si128 can also be used, provided that the read and write addresses are aligned to a 16-byte boundary — that is, are multiples of 16 bytes.

Thus, we get three implementations of the copy function for the different block sizes.

Example 2. Copying from src to dst (vectorized)

#include <emmintrin.h>

void copy_mb_4(const uint8_t * src,

uint8_t * dst,

size_t src_stride,

size_t dst_stride)

{

__m128i x0;

for(int i = 0; i < 4; i++)

{

x0 = _mm_cvtsi32_si128(*(int32_t*) src);

*(int32_t*) dst = _mm_cvtsi128_si32(x0);

src += src_stride;

dst += dst_stride;

}

}//  copy_mb_4

void copy_mb_8(const uint8_t * src,

uint8_t * dst,

size_t src_stride,

size_t dst_stride)

{

__m128i x0;

for(int i = 0; i < 8; i++)

{

_mm_storel_epi64((_m128i*)dst, x0);

src += src_stride;

dst += dst_stride;

}

}// copy_mb_8

void copy_mb_16(const uint8_t * src,

uint8_t * dst,

size_t src_stride,

size_t dst_stride)

{

__m128i x0;

for(int i = 0; i < 16; i++)

{

_mm_storeu_si128((_m128i*)dst, x0);

src += src_stride;

dst += dst_stride;

}

}// copy_mb_16

It is also possible to get rid of the outer loop, as shown below in Example 3. Such "loop unrolling" is a well-known technique that is useful when the number of iterations is small and known in advance, and the loop body is relatively small.

Example 3. Copying from src to dst (vectorized, no cycles)

#include <emmintrin.h>

void copy_mb_4(const uint8_t * src,

uint8_t * dst,

size_t src_stride,

size_t dst_stride)

{

__m128i x0, x1, x2, x3;

x0 = _mm_cvtsi32_si128(*(int32_t*)(src + 0 * src_stride));

x1 = _mm_cvtsi32_si128(*(int32_t*)(src + 1 * src_stride));

x2 = _mm_cvtsi32_si128(*(int32_t*)(src + 2 * src_stride));

x3 = _mm_cvtsi32_si128(*(int32_t*)(src + 3 * src_stride));

*(int32_t*)(dst + 0 * dst_stride) = _mm_cvtsi128_si32(x0);

*(int32_t*)(dst + 1 * dst_stride) = _mm_cvtsi128_si32(x1);

*(int32_t*)(dst + 2 * dst_stride) = _mm_cvtsi128_si32(x2);

*(int32_t*)(dst + 3 * dst_stride) = _mm_cvtsi128_si32(x3);

}

Functions that copy 9- to 16-bit blocks are implemented similarly to the 8-bit case. All that is needed is an instruction that copies a larger number of bytes (e.g., _mm_loadl_epi64 instead of _mm_cvtsi32_si128) or the same instruction repeated several times (works best for 16x16 pixel blocks).

### Compensation

Consider the addition of pixels from two blocks. This is an integral part of video decoding, where one pixel block (dst in this example) is computed by interpolating pixels from the current or previous frames and another block (src) by applying the reverse discrete cosine transform to the decoded coefficients. The examples omit the computation of the src and dst blocks and only show the final compensation stage where the pixel values from these blocks are added together, and the result is written into the dst block.

In the simplest case (same pixel block variable type and no integer overflow), compensation is implemented as follows.

Example 4. Pixel compensation (16 bit)

template <int SIZE>

void compensate_mb(const uint16_t * src,

uint16_t * dst,

size_t src_stride,

size_t dst_stride)

{

for(int i = 0; i < SIZE; i++)

{

for(int j = 0; j < SIZE; j++)

{

dst[ j ] = dst[ j ] + src[ j ];

}

src += src_stride;

dst += dst_stride;

}

}

We have already discussed block copy instructions above. Now, we will complement them with an addition instruction (_mm_add_epi16 in this case). The resulting vectorized code will then look as shown below. (An 8x8 block is used for this example).

Example 5. Pixel compensation (16 bit), 8x8, vectorized

void compensate_8(const uint16_t * src

uint16_t * dst,

size_t src_stride,

size_t dst_stride)

{

__m128i x0, x1;

for(int i = 0; i < 8; i++)

{

x0 = _mm_loadu_si128((__m128i*)src); // 8 pixels

_mm_storeu_si128((_m128i*)dst, x0);

src += src_stride;

dst += dst_stride;

}

}

In practical video decoding, the block size at the output of the reverse discrete cosine transform (src) will exceed the size of the interpolated block (dst). In addition, the src block elements can take negative values. A more realistic implementation of compensation will thus look like shown below.

Example 6. «Realistic» pixel compensation (16 and 8 bit)

template <int SIZE>

void compensate_mb(const int16_t * src,

uint8_t * dst,

size_t src_stride,

size_t dst_stride)

{

for(int i = 0; i < SIZE; i++)

{

for(int j = 0; j < SIZE; j++)

{

int tmp = dst[ j ] + src[ j ];

if(tmp > MAX(uint8_t))

dst[ j ] = MAX(uint8_t);

else if (tmp < 0)

dst[ j ] = 0;

else

dst[ j ] = tmp;

}

src += src_stride;

dst += dst_stride;

}

}

Here, the value of the tmp variable should not exceed the allowed range of dst pixel values (in this case, 0...255), as it is limited by the upper or lower bound.

The vectorized implementation gets more complex compared to Example 5 because dst values must be converted from 8-bit to signed or unsigned 16-bit ones. It is also necessary to implement the reverse conversion into an unsigned 8-bit value with upper and lower limits. The easiest way to double the size of an unsigned integer is to use the _mm_unpacklo_epiX and _mm_unpackhi_epiX instructions, where X = 8, 16, 32, or 64. For example, a vector consisting of 8-bit unsigned integers can be converted to two vectors of 16-bit integers as follows:

zero = _mm_setzero_si128();

x1 = x0;

x0 = _mm_unpacklo_epi8(x0, zero);

x1 = _mm_unpackhi_epi8(x1, zero);

Larger-sized data items are converted in a similar fashion.

To convert 16-bit values into unsigned 8-bit ones, the _mm_packus_epi16 instruction is used that packs the contents of two vector registers into one. For any 16-bit item with a value outside the 0...255 range, it also truncates the value to that range. Thus, a vectorized implementation for a 8х8 block will look like Example 7.

Example 7. «Realistic» pixel compensation (8 bit), 8x8, vectorized

void compensate_8(const int16_t * src,

uint8_t * dst,

size_t src_stride,

size_t dst_stride)

{

__m128i x0, x1, zero;

zero = _mm_setzero_si128();

for(int i = 0; i < 8; i++)

{

x0 = _mm_loadu_si128((__m128i*)src); // 8 pixels

x1 = _mm_loadl_epi64((__m128i*)dst); // 8 bit !

x1 = _mm_unpacklo_epi8(x1, zero); // from 8 to 16 bit

x0 = _mm_packus_epi16(x0, x0); // back to 8 bit

_mm_storel_epi64((_m128i*)dst, x0);

src += src_stride;

dst += dst_stride;

}

}

Consider now a case where dst has a pixel bit depth in the range of 9 to 16 bits, and the src block computed using the reverse discrete cosine transform consists of 32-bit values. Converting a 32-bit value to a value in the $\displaystyle 0...2^{bitdepth} -1$ range is not so simple because there is no instruction similar to _mm_packus_epi16 for an arbitrary data size. The conversion is done in two steps. First, two vectors of 32-bit unsigned integers are packed into one vector of 16-bit signed integers (in the range of -32,768..32,767) using the _mm_packs_epi32 instruction. Next, out-of-range values are truncated using the _mm_min_epi16 and _mm_max_epi16 instructions by comparing the corresponding elements of the two registers and taking the minimum and maximum, respectively. (Note that if the _mm_packus_epi32 instruction from the SSE4.1 set is used that converts 32-bit values to unsigned 16-bit ones, one comparison will suffice.) The full code for a 4x4 block looks like Example 8 (assuming no overflow on 32-bit addition).

Example 8. «Realistic» pixel compensation (9..16 bit), 4x4, vectorized

void compensate_4(const int32_t * src,

uint16_t *dst,

size_t src_stride,

size_t dst_stride,

int bitdepth)

{

__m128i x0, x1, zero, max_val;

zero = _mm_setzero_si128();

max_val = _mm_set1_epi16((1 << bitdepth) — 1);

for(int i = 0; i < 4; i++)

{

x0 = _mm_loadu_si128((__m128i*)src); // 4 x 32

x1 = _mm_loadl_epi64((__m128i*)dst); // 4 x 16

x1 = _mm_unpacklo_epi16(x1, zero); // from 16 bit to 32 bit

x0 = _mm_packs_epi32( x0, x0 ); // from 32 bit to 16 bit

/* if x0[k] < max_val, then x0[k]. else max_val */

x0 = _mm_min_epi16(x0, max_val);

x0 = _mm_max_epi16(x0, zero);

_mm_storel_epi64((_m128i*)dst, x0);

src += src_stride;

dst += dst_stride;

}

}

The function uses _mm_set1_epi16 to set the 16-bit elements of a vector register to the same value. In reality, _mm_set1_epi16 is a pseudo-instruction (just as others similar to it) with a compiler-dependent implementation. When high performance is required, such pseudo-instructions should be avoided.

### Computation of metrics

The degree of difference between two images is determined using metrics, which are expressions containing pixel values from both images. Metrics can be used to identify the optimum encoding method.

Consider the computation of two metrics, the sum of absolute differences (SAD) and the mean squared error (MSE). For two identically sized images, these metrics are defined by the following formulas:

$\displaystyle SAD( a,b) =\sum _{i=0}^{h}\sum _{j=0}^{w} |a_{ij} -b_{ij} |$, and

\begin{array}{l}MSE( a,b) =\frac{1}{hw}\sum _{i=0}^{h}\sum _{j=0}^{w}( a_{ij} -b_{ij})^{2} \\\end{array}

where $\displaystyle w$ and $\displaystyle h$ are the height and width of the image, respectively.

The code in "pure" C/С++ to compute these expressions is quite obvious and therefore not shown here. Instead, let us see how this code can be implemented using vector instructions.

There is a special SSE2 instruction, _mm_sad_epu8, that computes the sum of absolute differences between two 8-bit data blocks. It computes the SAD separately for the least and most significant halves of the registers and stores the sums in the same way. As the absolute difference between two 8-bit values cannot exceed 255, each sum cannot exceed 2,040.

For a 16x16-pixel block, the SAD is computed as follows:

Example 9. SAD for 16x16 pixels block, 8 bit

#include <emmintrin.h>

#include <stdint.h>

const uint8_t* src1,

size_t src0_stride,

size_t src1_stride)

{

__m128i x0, x1, sum;

sum = _mm_setzero_si128();

for(int i = 0; i < 16; i++)

{

sum = _mm_add_epi32(sum, x0); // sum for lower and upper halves

src0 += src0_stride;

src1 += src1_stride;

}

x0 = _mm_shuffle_epi32(sum, _MM_SHUFFLE(1,0,3,2));

int32_t s = _mm_cvtsi128_si32(sum); // result

return s;

}

Since _mm_sad_epu8 computes two separate sums for the left and right halves of the block, the sum register also accumulates two sums. The end result thus equals the sum of all 32-bit elements of the sum register (more precisely, the first and the third). Therefore, we swap the sum register's least and most significant halves using _mm_shuffle_epi32 and store the result in the x0 register. By adding together the x0 and sum registers, we obtain the end result in the least significant 32 bits of the latter.

For smaller blocks, the computation is similar, except that the most significant half of the sum register is zero, and the extra addition is not required. This is because _mm_cvtsi32_si128 and _mm_loadl_epi64 fill the most significant 96 and, respectively, 64 bits of the registers with zeros.

At present, there is no instruction similar to _mm_sad_epu8 for data sizes larger than 8 bits. For 9- to 15-bit input data, computing the sum of absolute differences is simple:

x0 = _mm_sub_epi16(x0, x1); // x0 - x1

x0 = _mm_abs_epi16(x0);     // SSE3

This method requires that the CPU support the SSE3 instruction set and is not suitable for input data that is exactly 16 bits in size (which is rarely encountered in practice). The absolute difference can then be computed as follows:

x2 = x0;

x0 = _mm_subs_epu16(x0, x1); // x0 — x1 or 0

x1 = _mm_subs_epu16(x1, x2); // x1 — x2 or 0

x0 = _mm_xor_si128(x0, x1);  // | x0 - x1 | or 0

The _mm_subs_epu16 instruction (unsigned saturated subtraction) results in zero if the subtrahend is greater than the minuend. In that case, _mm_subs_epu16(x0,x1) and _mm_subs_epu16(x1,x2) yield either the absolute differences or zeros for all element pairs in x0 and x1. All that is left is to combine these elements using OR or XOR. The absolute differences can be computed similarly for 8-bit data by replacing the 16-bit instructions with 8-bit ones (namely, _mm_subs_epu8).

Here is an example of computing SAD for an 8x8-pixel image block with 16 bits per pixel.

Example 10. SAD for 8x8 pixels block, 16 bit

__m128i x0, x1, x2, sum, zero;

zero = _mm_setzero_si128();

sum = zero;

for(int i = 0; i < 8; i++)

{

/* | x0 — x1 | */

x2 = x0;

x0 = _mm_subs_epu16(x0, x1);

x1 = _mm_subs_epu16(x1, x2);

x0 = _mm_xor_si128(x0, x1);

x1 = x0;

x0 = _mm_unpacklo_epi16(x0, zero); // 16 bit to 32 bit

x1 = _mm_unpackhi_epi16(x1, zero);

src0 += src0_stride;

src1 += src1_stride;

}

/* sum is a0,a1,a2,a3 */

x0 = _mm_shuffle_epi32(sum, _MM_SHUFFLE(2,3,0,1)); // x0 is a1,a0,a3,a2

x0 = _mm_shuffle_epi32(sum, _MM_SHUFFLE(1,0,3,2));

int32_t s = _mm_cvtsi128_si32(sum); // result

Here, the absolute differences between pixel values are computed as shown earlier. However, they cannot be summed in this (16-bit) form because of the potential for overflow. Therefore, the absolute values are converted to 32 bits and then summed. When the loop completes, the only thing left is to compute the sum of all 32-bit elements in the sum register. As an alternative to the approach used in the above example, this can be done using the _mm_hadd_epi32 instruction, which sums the adjacent 32-bit register elements. This requires SSSE3 support from the CPU.

zero = _mm_setzero_si128();

sum = _mm_hadd_epi32(sum, zero); // sum is a0+a1,a2+a3,0,0

sum = _mm_hadd_epi32(sum, zero); // sum is a0+a1+a2+a3,0,0,0

#### MSE computation

Strictly speaking, the quantity that is computed in all examples below is not MSE but the following:$\displaystyle SED( a,b) =\sum _{i=0}^{h}\sum _{j=0}^{w}( a_{ij} -b_{ij})^{2} \ h\ w\ MSE( a,b)$. Since squaring is involved, the potential overflow must be taken into account already for 8 bits and even more so for the larger data sizes. This means that additional data size conversions will be needed.

An example of computing the SED for a block of 8-bit data is shown below.

Example 11. SED for 8x8 pixels block, 8 bit

__m128i x0, x1, x2, sum, zero;

zero = _mm_setzero_si128();

sum = zero;

for(int i = 0; i < 8; i++)

{

x0 = _mm_unpacklo_epi8(x0, zero); // 8 to 16 bit

x1 = _mm_unpacklo_epi8(x1, zero);

x0 = _mm_sub_epi16(x0, x1); // x0 — x1

x0 = _mm_madd_epi16(x0, x0); // (x0 - x1)^2

src0 += src0_stride;

src1 += src1_stride;

}

// sum of sum elements

In this example, 8-bit data are first converted to 16 bits, and then the pixel value difference is computed. The most convenient way to obtain a squared difference is the _mm_madd_epi16 instruction that converts 16-bit data directly to 32 bits and performs some of the required additions. When the loop completes, it only remains to add together the values of all elements in the sum register, like in Example 10.

SED can be computed in a similar fashion for data up to 12 bits in size inclusive and for a 16x16-pixel block. For larger-sized data, conversion from 32 bits to 64 bits will be needed when summing in the loop. In addition, _mm_madd_epi16 cannot be used for 16-bit data due to the potential overflow. The _mm_mullo_epi16 and _mm_mulhi_epu16 instructions should be used instead. These are used in Example 12 below. Moreover, the absolute difference between two pixel values is computed instead of the simple difference to avoid an extraneous data size conversion.

Example 12. SED for 8x8 pixels block, 16 bit

__m128i x0, x1, x2, sum, zero;

zero = _mm_setzero_si128();

sum = zero;

for(int i = 0; i < 8; i++)

{

/* | x0 — x1 | */

x2 = x0;

x0 = _mm_subs_epu16(x0, x1);

x1 = _mm_subs_epu16(x1, x2);

x0 = _mm_xor_si128(x0, x1);

/* x0^2 */

x1 = x0;

x0 = _mm_mullo_epi16( x0, x0 );

x1 = _mm_mulhi_epu16( x1, x1 );

x2 = x0;

x0 = _mm_unpacklo_epi16( x0, x1 ); // x0[ i ]^2, i = 0..3

x2 = _mm_unpackhi_epi16( x2, x1 ); // x0[ i ]^2, i = 4..7

x0 = _mm_add_epi32( x0, x2 );

x2 = x0;

x0 = _mm_unpacklo_epi32(x0, zero);  // from 32 to 64 bit

x2 = _mm_unpackhi_epi32(x0, zero);

src0 += src0_stride;

src1 += src1_stride;

}

// sum of sum elements

x0 = _mm_shuffle_epi32(sum, _MM_SHUFFLE(1,0,3,2));

uint64_t result;

_mm_storel_epi64((__m128i*)&result, sum);

### Arbitrary image areas

In the previous examples, the image blocks had a fixed size of 4, 8, or 16 pixels, but vectorization is also straightforward for arbitrary block sizes. Let us revisit Examples 1 and 6. In both of them, all data processing and copying are done in the inner loop, while the outer loop simply shifts the pointers to the next pixel row. Here, the inner loop comprises only one iteration, and therefore the loop operator is omitted. For a block of arbitrary width, however, several iterations are required to cover the width and the same number of pointer shifts. In the general case, the inner loop will look the same as in Example 13, where SAD is computed for 8-bit images.

Example 13. SAD for WxH pixels block, 8 bit

#include <emmintrin.h>

#include <stdint.h>

#include <stdlib.h>

const uint8_t* src1,

size_t width,

size_t height,

size_t src0_stride,

size_t src1_stride)

{

size_t width16 = width — (width % 16); // width16 == 16*x

__m128i x0, x1, sum;

sum = _mm_setzero_si128();

uint64_t sum_tail = 0;

for(int i = 0; i < height; i++)

{

for(int j = 0; j < width16; j += 16)

{

}

for(int j = width16; j < width; j ++)

{

sum_tail += abs(src0[j] — src1[j]);

}

src0 += src0_stride;

src1 += src1_stride;

}

x0 = _mm_shuffle_epi32(sum, _MM_SHUFFLE(1,0,3,2));

uint64_t sum_total;

_mm_storel_epi64((__m128i*)&sum_total, sum);

sum_total += sum_tail;

return sum_total;

}

The length of the row will not generally be a multiple of 4, 8, or 16 bytes. Therefore, the resulting "tail" is processed separately. There are no instructions that load an arbitrary number of bytes from the RAM. The simplest thing workaround is to leave this bit of code unvectorized as in Examples 9 and 13. In the example below, the "tail" is never longer than 15 bytes, and the performance penalty for sufficiently large image blocks is minor. However, if the algorithm being implemented requires a lot of CPU time, it is desirable to process as little data as possible with unvectorized code. In that case, a technique shown in Example 14 can be used.

Example 14. SAD for WxH pixels block, 8 bit

#include <emmintrin.h>

#include <stdint.h>

#include <stdlib.h>

const uint8_t* src1,

size_t width,

size_t height,

size_t src0_stride,

size_t src1_stride)

{

size_t width_r =  width % 16;

size_t width16 = width — width_r;        // width16 == 16*x

size_t width8 = width_r — (width_r % 8); // 8 or 0

width_r -= width8;

size_t width4 = width_r — (width_r % 4); // 4 or 0

width_r -= width4;                 // 0, 1, 2, or 3

__m128i x0, x1, sum;

sum = _mm_setzero_si128();

uint64_t sum_tail = 0;

for(int i = 0; i < height; i++)

{

for(int j = 0; j < width16; j += 16)

{

}

if( width8)

{

}

if( width4)

{

x0 = _mm_cvtsi32_si128(*(int32_t*)(src0 + width16 + width8));

x1 = _mm_cvtsi32_si128(*(int32_t*)(src1 + width16 + width8));

}

for(int j = width - width_r; j < width; j ++)

{

sum_tail += abs(src0[j] — src1[j]);

}

src0 += src0_stride;

src1 += src1_stride;

}

/**/

}

In this example, the number of pixels processed without vectorization never exceeds three.

### Sources

1. https://software.intel.com/sites/landingpage/IntrinsicsGuide
2. https://developer.arm.com/architectures/instruction-sets/intrinsics
3. https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/coding-for-neon---part-5-rearranging-vectors

March 31, 2022

Chapter 1. Video encoding in simple terms

Chapter 2. Inter-frame prediction (Inter) in HEVC

Chapter 3. Spatial (Intra) prediction in HEVC

Chapter 4. Motion compensation in HEVC

Chapter 5. Post-processing in HEVC

Chapter 11. DCT: brief overview

Chapter 12. Vector Instructions. Part I

### Author

Dmitry Farafonov

An Elecard engineer. He has been working with optimization of audio and video codecs, as well as programs for processing audio and video signals since 2015.