Auto-Vectorization and How to Make it Happen

What is Auto-Vectorization?

When you do the same operation on a set of numbers, (eg. adding the elements in two parallel arrays)


Instead of the following logic:
  1. Loop through the elements, start with element 0
  2. Store the first element of array 1 into a register on the CPU
  3. Store the first element of array 2 into a register on the CPU
  4. Add Array1element + Array2element
  5. Grab the next element
Vectorization Logic:
  1. Loop through the elements, start with element 0
  2. Store element 0 - 8 of array 1 into register on the CPU
  3. Store element 0 - 8 of array 2 into register on the CPU
  4. Add Array1elements + Array2elements
  5. Grab the next 8 elements
You can see why vectorization can make your code around 4-10 times faster. (For a proof of this, see my blog post timing various algorithms compiled with -O3 vs the -O0 flag. Algorithm Timing) How to make it happen Auto-Vectorization is done by the compiler under 3 conditions:
  1. compiler flags:
    • -O3 is specified. This turns on a set of flags that compile with the “risk” of getting skewed results

      OR
    • The individual flags for auto-vectorization are used like -ftree-vectorize and -fvect-cost-model
  2. Assurance that none of the arrays overlap

  3. Assurance that all of the arrays have their hardware words aligned. This means that elements of the arrays each take up a fixed amount of space, and you can expect where to find the next element. Although this is a little wasteful of space, it’s worth it so that it’s easier to jump to the next element.

Assembler Code Walkthrough on AARCH64 - Auto-Vectorized

Here’s the C code we’re going to vectorize:



Let’s go through some Auto-Vectorized code and understand what’s going on in the assembly:



Go out there and Auto-vectorize!

Comments

Popular Posts