Auto-Vectorizer in Visual Studio 2012 – Rules for Loop Header

If you haven’t read previous posts in this series about auto-vectorization, you may want to begin at the beginning.

In this post, we look at those C++ loop patterns that the auto-vectorizer recognizes. We will start by explaining its rules for loop headers (defined below).

Throughout this post, we will talk about what the auto-vectorizer does, in its first release, in Visual Studio 2012.  We will extend the feature in future releases, so this advice will vary as time goes by, as the analyses tackle a wider and wider range of patterns.

Let’s focus on C++ for loops.  (The auto-vectorizer can work for certain kinds of while loops, but we’ll get to those later).  Here is the general pattern of a C++ for loop:

  1. for <loop-header> <loop-body>

Simply expanding <loop-header> we get:

  1. for (int n = L; n < U; n += S) <loop-body>

L is the lower bound of the loop; U the upper bound; S is the step size. (We wrote the test as n < U, but we might equally well have chosen n <= U. Rather than keep repeating that both patterns apply equally well, I will use the first form only).  First rule:

S must be +1

Why?  Consider a simple loop that steps up, 1 at-a-time:

  1. for (int i = 0; i < 1000; ++i) a[i] = b[i] * b[i];

The first four iterations of this loop therefore calculate:

  1. a[0] = b[0] * b[0]; a[1] = b[1] * b[1]; a[2] = b[2] * b[2]; a[3] = b[3] * b[3];

To vectorize this code, we would load b[0:3] into one of the vector registers, XMM0, say.  SSE provides an instruction (MOVUPS, or MOVAPS) that loads these 4 contiguous locations from memory into a vector register in one operation.  Once there, we calculate their squares quickly, again with a single vector instruction (MULPS).  And then we store the 4 results into a[0:3] with another (MOVUPS) single vector instruction. The whole operation is efficient, and a worthwhile optimization.

[NitPick asks: what’s with the words, in upper-case letters, like MOVUPS?  Answer: these are the names of particular SSE (vector) instructions.  I will include them occasionally, for anyone interested in details, but you don’t need to understand them in order to follow this blog.  For a full list of the several hundred SSE instructions available, pick either the online manuals from Intel or online manuals from AMD.  Both are free]

Now consider that same loop, but stepping up by some number, other than 1.  For example, by 3:

  1. for (int i = 0; i < 1000; i += 3) a[i] = b[i] * b[i];

The first four iterations would therefore calculate:

  1. a[0] = b[0] * b[0]; a[3] = b[3] * b[3]; a[6] = b[6] * b[6]; a[9] = b[9] * b[9];

To vectorize this code, we would need to gather the four values b[0], b[3], b[6] and b[9] into one of the vector registers.  In SSE, this is not straightforward, unfortunately.  You need to load b[0] into one XMM register, b[3] into another, then mush them together so they occupy different 32-bit slots in the same XMM register.  Then repeat, with a similar mush operation to get b[6] and b[9] into the vacant slots in that same XMM register.  (For those who might care, check the MOVSS and UNPCKLPS instructions)

Once we have b[0], b[3], b[6] and b[9] nicely tucked into the four 32-bit slots of an XMM register, we can calculate their squares quickly, in a single vector operation. But then we have to scatter the results to a[0], a[3], a[6] and a[9] .  This requires more ‘instruction gymnastics’.

All those instructions required to gather the non-contiguous array elements of b into XMM registers, and later, to scatter results back to the non-contiguous array elements of a, swamps the benefit of the fast multiplication that calculated the squares.  On the whole, for this example, there’s a simpler technique (called “loop unrolling”) using scalar multiplications that runs faster.

Long story short: use a step size of +1 in your C++ for loops.  Put another way, if you have step-down loops such as:

  1. for (int i = N; i > 0; --i) a[i] = b[i] * c[i];

then consider tweaking them to become step-up loops, like this:

  1. for (int i = 1; i <= N; ++i) a[i] = b[i] * c[i];

The next auto-vectorizer rule for the loop header is:

L can be any expression

because in C++ L is evaluated just once, at the start of the loop.  If the expression for L included some variable x, say, then you could even change the value of x later, within <loop-body>.  It doesn’t matter, because that change cannot alter the already-calculated lower bound.  Here’s an example:

  1. int x = 12;
  2. for (int i = x; i < 12345; ++i) { a[i] = x; x += 3; }

This loop will have L = x = 12. Even though <loop-body> changes the value of x, it cannot alter history, as-it-were, to change the fact that L = 12.  L is not re-evaluated as the loop executes.  In summary, L can be any valid expression – it can even include calls to functions.

What about U, the upper bound?  It’s a different matter altogether, and has the following rule:

U must be a constant for the duration of this loop

Recall that C++ does re-evaluate U each time around the loop.  For the auto-vectorizer to be sure that its transformations are safe, it must ensure that U does not change value as the loop executes.  In compiler jargon, we would say that its value is “loop invariant”.  To see why changing the value of U as the loop executes, introduces untold complications into the life of the auto-vectorizer, consider the following, contrived example:

  1. int u = 42;
  2. for (int i = 0; i < u; ++i) { a[i] = i; u -= 9; }

Before any attempt at vectorization, the loop will run like this:

Iteration

i

u on entry to <loop-body>

u on exit from <loop-body>
First 0 42 33
Second 1 33 24
Third 2 24 15
Fourth 3 15 6
Fifth 4 6 -3
Sixth 5 -3 Does not execute

So <loop-body> executes 5 times.  Its net effect is to set a[i] to i for i = 0 thru 4, inclusive.  But working out, at compile-time, how many times the loop will actually execute at runtime (called the loop’s trip count) , is quite tricky.  If you were the auto-vectorizer, what would you do?  Long story short, the auto-vectorizer declines the challenge, and the compiler falls back to generating scalar code.

Perhaps a more common example of this situation is where the upper bound involves a call to a function – fun in the following snippet:

  1. for (int i = 0; i < fun(); ++i) { a[i] = b[i] + c[i]; }

Unless the compiler can prove (to itself) that fun cannot possibly interfere with the loop body, it cannot auto-vectorize this code.

 

That’s probably enough discussion for this post.  We have explained the rules, or constraints, upon for loop headers, so that the auto-vectorizer stands any chance of vectorizing the loop body.  I’ve tried to make those rules clear, and obvious – probably what each of us would have chosen if we had been invited to play the part of a human auto-vectorizer – of transforming a C++ for loop into equivalent vector instructions.

Next time, we will look at the rules for loop bodies, by working through the examples in the puzzle from the last post.

[Our astute reader, NitPick, asked only one question in this post.  Perhaps his concentration lapsed, because it seems like there are a couple more lurking in the text.  Let’s see whether your comments hit the questions that NitPick missed]

Comments

  • Anonymous
    May 01, 2012
    int u = 42;  not x

  • Anonymous
    May 01, 2012
    Fixed - thankyou.

  • Anonymous
    May 01, 2012
    The loop variable seems to be constrained to the type of int only. You kind of assumed it. It's not a general for loop thought. Even unsigned (as in size_t) doesn't seem to be vectorized (it's a repeat comment, don't see the other posted)

  • Anonymous
    May 02, 2012
    @NitPick - good question (as always  :-)  Yes, in the Beta drop, back in February, size_t did not vectorize.  Meantime, go with "int" which works for both x86 and x64. We have fixed these oddities - they'll show up in the next drop. I will cover the details in next week's post, adding a Rule 4 for Loop Header.  Well reminded - thankyou!

  • Anonymous
    May 02, 2012
    Will auto-vectorizer kick in when using C++ container objects? For example (pun not intended): std::vector<float> a;

  • Anonymous
    May 02, 2012
    Your step-down loop sample will keep stepping down until i overflows :)

  • Anonymous
    May 02, 2012
    According to one of the Q&A's under the C9 video on vectorization <channel9.msdn.com/.../GoingNative-7-VC11-Auto-Vectorizer-C-NOW-LangNEXT, VC11 does vectorize strided accesses to struct fields. But now that it does not even want to vectorize array strided accesses, but bother with struct fields after all? Is it possible to trick the compiler into vectorizing array compile-time-constant strided accesses by casting array pointers into struct pointers?

  • Anonymous
    May 02, 2012
    @Dimitry - fixed - thankyou.

  • Anonymous
    May 02, 2012
    Hi Jim, thank you for the articles, I have few questions:

  1. will for(int i = 0; i < 1000; i++){/**/} auto-vectorize? Lot of people write i++ instead of ++i.
  2. will for(int i = 0; i != 1000; ++i){/**/} auto-vectoirze? I write almost everywhere != instead of < in for-condition.
  3. as someone asked, will be auto-vectorizer smart enough to see through standard library function such as for_each, transform, used with predicates or lambdas and containers such array, vector, (deque?) and its iterators? Marek
  • Anonymous
    May 02, 2012
    Is it necessary to declare U as const? For example, const int u = 10. Or does the auto-vectorizer check if u is being modified inside the loop or not?

  • Anonymous
    May 02, 2012
    Marek,

  1. i++ and ++i should result in the same assembly code for primitive types.
  2. I believe the auto-vectorizer works on much lower level constructs than the STL, so such cases are probably not analyzed. However, Stephan T. Lavavej has said that he plans to make the STL algorithms auto-vectorizer-friendly.
  • Anonymous
    May 03, 2012
    @Marek:
  1.  Yes, the two versions of the loop, with i++ (post-increment) or with ++i (pre-increment) both vectorize.  As Chris points out, the compiler generates the same machine code in either case.  In debug code, it would be "add eax, 1".  In vectorized code, it becomes "add ecx, 10h" since we are striding thru the loop 4 at a time (for an array data type of float, or int).
  2. Similarly for the comparison "i < 1000" or "i != 1000" - both vectorize ok.  Generally, folks use "!=" because it works when iterating thru any STL container.  Sometimes, "<" doesn't mean anything, and will not even compile - for example, when iterating thru a list<T>.
  3.  We have not done the work, yet, to ensure STL algorithms vectorize.  And yes, as Chris says, we'll be working with Stephan to do fix this in the future.
  • Anonymous
    May 03, 2012
    "Unless the compiler can prove (to itself) that fun cannot possibly interfere with the loop body, it cannot auto-vectorize this code." This seems like a good place to use a constexpr function, no?

  • Anonymous
    May 03, 2012
    @Chris: The compiler (specifically, the backend) checks automatically whether any variable, such as "u", is being modified inside the loop.  It does this in any case, independent of auto-vectorization, to work out whether it's safe, for example, to enregister that value for the duration of the loop.  I suppose this broaches the bigger question of: "why bother with 'const' if the backend is so paranoid that it checks anyway?" - but that would take us too far off-track. Remember that all compiler optimizations, and auto-vectorization in particular, are always guaranteed "safe".  A good compiler never makes code run faster, unless it can prove, 100% to itself, that the transformation yields the same result as the un-optimized version.  Contrast this with parallelization libraries, such as OpenMP, that do not perform such safety checks.  We'll see more of this a few posts ahead, when we dig into the the topic of "dependence analysis".

  • Anonymous
    May 03, 2012
    @C++ In some cases, yes, the auto-vectorizer will kick-in for STL containers.  For example: vector<int> a(100), b(100), c(100); for (int n = 0; n < a.size(); ++n) a[n] = b[n] + c[n]; will not vectorize.  But tweaking the loop, to pull-out the call to size(), as follows, succeeds: int len = a.size(); for (int n = 0; n < len; ++n) a[n] = b[n] + c[n]; As we go thru the blog posts, we'll gather a "cookbook" of such cases, but aiming to smarten up the auto-vectorizer in future, so it recognizes them without help.

  • Anonymous
    May 03, 2012
    @yitsu: Yes, we will auto-vectorize certain special cases where a loop accesses a field in an array of structs.  These cases are where the array index is especially simple, a pattern that crops up in lots of "real world code" and several important benchmarks.  The resulting code still performs scatter/gather. The general case of vectorizing strided access thru an array requires deeper analysis than that for the above, special cases.  If the loop induction variable is 'n', then even analysing index forms of [n] or [2 * n] or [2 * n + K] ("affine" transforms) is non-trivial.  We'll get to this in a few posts ahead (dependence analysis). Whether you could persuade the auto-vectorizer into vectorizing strided access thru an array, by faking that data structure as an array of structs? - well, maybe.  But I would not encourage anyone to do this (except as an experiment in pushing the bounds of good taste :-)

  • Anonymous
    May 03, 2012
    The MSDN documentation is very useless about the __assume keyword and I was wondering ways you can use it for the auto-vectorizer (please document it more!). e.g. can I do something like: void vadd(T *a, const T *b, size_t n) __assume(n > 0 && n % 4 == 0); To alert the compiler that n is a positive multiple of 4? Or: __assume((uintptr_t)a & 0xF == 0 && (uintptr_t)b & 0xF == 0); To alert the compiler that a and b are 16 byte aligned. (I hope there are more readable ways to express this than the above)

  • Anonymous
    May 03, 2012
    @S. Colcord: In theory, a constexpr function would provide sufficient guarantees.  However, it's not yet supported by the Visual C++ compiler.   Implementing support is non trivial: constexpr functions can be recursive.  Verifying their const-ness can therefore require lengthy analysis at compile-time.

  • Anonymous
    May 03, 2012
    Jim,    Does VS support any pragma(s) to help / force vectorizer. I spoke once to Herb Sutter and he seemed very much against pragmas....   ICC has a whole bunch and I find them invaluable for optimization.

  • Anonymous
    May 03, 2012
    @Jim Hogg: It seems to me that to compile a constexpr function, you have to run it.  After all, the point is that the compiler can inline the result.  The main trick would be making sure that you don't waste time doing it more than necessary, maybe using something like a PCH.  Non-trivial, to be sure, but not impossible; gcc released support for constexpr a year ago.

  • Anonymous
    May 03, 2012
    @asdf: I'd say the documentation in MSDN for __assume is clear, in the sense it tells you enough that you can then go sprinkle them all over your code.  What's missing is any information on exactly which __assumes the optimizer acts upon! First thing to make clear is that whenever you write an __assume, you are providing a personal guarantee, signed in blood, that the condition holds true, now and forever.  And giving the optimizer the freedom to count upon that fact in order to generate faster code.  The optimizer is not forced to use that information, but it is allowed to.   However, you should be very, very careful in using __assume; because if you are wrong, or if some developer, not knowing about this personal "contract" forged between you and the compiler, changes your code in 2 years and breaks that assumption, then anything could happen! - your program might crash, or worse, might run to completion, but give wrong answers.  [You face a similar dilemma in using declspec(restrict) and declspec(noalias)].  Using __assume is like riding a motor-cycle without a crash helmet. All that said, the auto-vectorizer (in this release), does not pay attention to any __assumes in how it vectorizes the code.

  • Anonymous
    May 03, 2012
    @NitPick: I wrote a reply earlier, hit return, and it all disappeared.  Second time lucky: We have just one pragma: #pragma loop(no_vector), to disable auto-vectorization of that loop.  It's handy when you're testing, to measure the speedup without, and with, vectorization. It's also an 'escape hatch' if auto-vectorization ever makes a loop slower - although that should be very rare. Or maybe you have in-mind a pragma, usually called "ivdep", short for Ignore Vector DEPendencies?  We don't support this.  Like _assume, discussed above, it's one of those "trust me, I know what I'm doing". [We will support an "ivdep" pragma for auto-parallelization - spreading computation across multiple cores.  But let's delay discussion of this until we're done with auto-vectorization]

  • Anonymous
    May 04, 2012
    @Jim   Thanks for all the replies.  Yes, those would be great.

  1. no_vector is great when I know the loop count is small
  2. ivdep sometimes with unroll is very useful when compiler can't prove loop invariance. Consider    for(int i = 0; i < k; i ++) a[i] = a[i+k] * delta    I might know that k is always > 4 and unroll(4) would keep me safe.   Of course compiler can create 2 versions of the loop, do the run time test, but all that code gen would be a waste
  3. assume_align or pragma vector aligned could be a good win (even if when you do loop peeling to align the vector)
  4. nontemporal could be a big win on really large sets (2x on writes if one knows what he is doing) How big or small those wins are obviously depends on the code people are writing. [Actually this is my second attempt at posting this, there could be an issue with the system]
  • Anonymous
    May 04, 2012
    @Jim Then I don't understand your reasoning in the post, where you said strided array accesses are not vectorized because there is little performance benefit. Without INSERTPS and EXTRACTPS, doesn't vectorizing struct accesses also require UNPCKLPS + MOVLHPS? What makes them faster than array accesses? That said, I do understand the point that getting dependence analysis right is tricky (Fourier-Motzkin, omega test, etc. come to mind), especially for a ver-1.0 vectorizer.

  • Anonymous
    May 04, 2012
    The comment has been removed

  • Anonymous
    May 04, 2012
    @NitPick:

  1.  Agreed.  We skip vectorizing if the loop is "too" small, choosing a trip count of 12 for the definition of "too".  Don't count on this, obviously - it's liable to change and/or include a heuristic in future.
  2. Yep, ivdep would have its uses.  But overall, we held back - in the wrong hands, it's dangerous, obviously.
  3. The trouble with __assume <anything> is the chaos that ensues if some user breaks that contract.  declspec(align) might be useful in the future - with a runtime check that it's true.  Not sure how much this will matter going forwards however - the difference between MOVUPS and MOVAPS is growing smaller with newer micro-architectures (so I've read - not actually measured it myself)
  4. Yes.  Again we'd have to decide whether to ask the user to provide a hint (pragma or declspec), or analyse to detect streaming access.  Another "to investigate" for the future
  • Anonymous
    May 04, 2012
    @yitsu: Yes, you're right in principle.  In practice, for the samples of real-world code we examined, the loop body did enough computation on the enregistered data to make the overhead of scatter/gather worthwhile.  But that decision really needs to be backed by a heuristic in future. Dependence analysis - right!  Aiming to scratch that surface in a couple of posts ahead.

  • Anonymous
    May 04, 2012
    Can you also write a blog post documenting some of the __assume expressions the compiler currently recognizes to optimize code better? I have a sneaking suspicion that __assume(0) for unreachable code is the only expression the compiler recognizes because the programmer originally wrote it to handle unreachable code, then generalized it to arbitrary expression assumptions because handling unreachable code is too trivial for a compiler extension, and then went no further due to time constraints.

  • Anonymous
    May 05, 2012
    @asdf: This might be an interesting topic.  On the other hand, using __assume makes for brittle code.  It's not a technique I'd recommend for improving performance.  More fruitful areas might be to think about cache-awareness.  For example, in a matrix multiply, simply flip the conventional i-j-k loop into i-k-j and watch it go 3 times faster.

  • Anonymous
    May 05, 2012
    What of double type? What about CPUs that don't do SSE2? ■AMD CPUs prior to Athlon 64, including all Socket A-based CPUs ■Intel CPUs prior to Pentium 4 ■VIA C3 ■Transmeta Crusoe or do these CPUs not run the target OS of VS11?  I though VS11 targets xp, or can be made to, and all those CPIs run XP.  Is the compiler code smart enough to not A-V if the CPU wont?

  • Anonymous
    May 06, 2012
    @DOUBLE SHOT: Yes, the auto-vectorizer also works on doubles.  So far, I've used only int and float in examples - where the maximum speedup is 4X.  With doubles, the maximum speedup is just 2X, since each SSE register can only hold two doubles, side-by-side. In VS11 we assume that the target machine will support SSE2.  This seems reasonable: SSE2 was introduced back in 2001, so it's been around a while now.  But if you need to target pre-SSE2 chips, just set the /arch:IA32 flag, and the compiler will continue to generate x87 floating-point instructions instead of SSE. Also, the auto-vectorizer will check, at runtime, whether the chip supports SSE4.1.  If so, it may use some of the instructions from that set.  In other words, we assume SSE2 is present, but might make use of newer instructions if the chip supports them.

  • Anonymous
    October 16, 2012
    Jim, Is there any way to completely suppress the auto-vectorizer for all loops?

  • Anonymous
    October 17, 2012
    @Susmit I'm afraid not.  The compiler tries to perform dozens of different optimizations - auto-vectorization is just one more in the list.  We don't support switches that let you try them individually - in the typical case, you want your code to run as fast as possible. Out of interest - what would you use such a switch for?  Depending on your aims, you could use /O2 ("optimize for speed", which includes auto-vectorization) versus /O1 ("optimize for size", which doesn't)