The rise of Machine Learning is bringing the fashion for numerical tricks and hardware acceleration back. And now that I’ve completed my bacterial DNA counting adventures at ETHZ, it’s time to reboot this blog.

As someone who believes in mechanical sympathy I’d like to share some of the experiments in vectorizing simple-yet-fundamental operations. And if you have some insights how to beat the benchmarks below, I’d gladly accept them and extend the list.

Update 2019-02-18. Correction: vmulsd is a scalar and not vectorized instruction. Thanks MÃ¥rten!

After working for some time with not the most recent .NET Framework I couldn’t wait to check out .NET Core and all performance goodies that come with it. For benchmarking I highly recommend BenchmarkDotNet, it’s an amazing tool that will take the best care of your benchmarking needs.

So let’s dig into Multiply-Add operations and have a look at different ways to vectorize it - in .NET Framework, .NET Core and MKL. The code examples are in F# but it’s of course also applicable to other .NET languages. Try out your performance intuition ;)

Is vectorization more efficient on .NET Framework 4.7.2 or .NET Core 2.1.6?

The solution with examples is here. It doesn’t include a custom-built Experimental.Mkl.dll, but the list of functions Experimental.Mkl.Functions.txt and the template for the commands required to build it (BuildMkl.bat), are there as well as full benchmark outputs. Describing MKL features is not the goal of this post as it’s used just as another reference result to compare against .NET implementation, so the setup is not discussed in detail. If you’re interested in finding out more about Intel MKL Builder or the library itself, please check out Intel documentation.

Note, that the tests were performed only on a single Windows 10 machine.

1
2
3
4
5
6
7
BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17134.523 (1803/April2018Update/Redstone4)       
Intel Xeon E-2176M CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores                    
Frequency=2648441 Hz, Resolution=377.5806 ns, Timer=TSC                                   
.NET Core SDK=2.1.500                                                                     
  [Host] : .NET Core 2.1.6 (CoreCLR 4.6.27019.06, CoreFX 4.6.27019.05), 64bit RyuJIT DEBUG
  Clr    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3260.0           
  Core   : .NET Core 2.1.6 (CoreCLR 4.6.27019.06, CoreFX 4.6.27019.05), 64bit RyuJIT      

Multiply-Add

What are the first operations on vectors that you can think of? I’d bet add and scale (multiplication with a scalar) are probably in this list. To make things more interesting let’s talk about their combo:

$$\mathbf{z} = a\mathbf{x} + \mathbf{y}$$

This kind of operation is not only common in numerical code, but also has a performance boost potential because modern processors support so-called fused multiply-add operations or FMA as an extension to SIMD instructions (a nice introduction on the topic is available here). Those interested in GPU computing can also benefit from FMAs in CUDA C code: the compiler is able to generate them most of the time, and of course writing fma directly is also an option.

We consider two scenarios:

  • Out of place, $ \mathbf{z} = a\mathbf{x} + \mathbf{y} $ (the result is stored into the 3rd array)
  • In place, $ \mathbf{y} = a\mathbf{x} + \mathbf{y} $ (one of the arrays is overwritten)

In place or Out of place?

For simplicity all arrays are of type double[] (float[]) and assumed to have valid dimensions. The relative results for single[] (float32[]) are very similar but with x2 better performance.

Baseline

The baseline for our tests is a good old for loop. This is not something you’d see often in F# code, but for some application immutable-only style would be simply too expensive.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
let forIn (a : double) (xs : double[]) (ys : double[]) =
    for i = 0 to ys.Length - 1 do
        ys.[i] <- a * xs.[i] + ys.[i]
    ys
    

let forOut (a : double) (xs : double[]) (ys : double[]) (zs : double[]) =
    for i = 0 to zs.Length - 1 do
        zs.[i] <- a * xs.[i] + ys.[i]
    zs

Vectors

For or Vectors?

.NET provides a set of SIMD-enabled types including Vector<'T> that we are going to test here. For example .NET hardware intrinsics have already been successfully applied for improving ML.NET implementation.

There’re multiple ways to implement multiply-add using Vectors, and we need to take the following aspects into consideration.

Arrays and Alignment

To align or not to align?

There’re two options for defining input arrays: we can either use the data as is or align arrays considering the number of elements in a vector, Vector<double>.Count (which is equal to 4 on my machine). Intuitively aligned arrays should be more SIMD-friendly because in this case handling boundary cases is not required.

Multiplication Factor

(a : double) or Vector<double>(a)?

The factor a can be represented as either double or Vector<double>, which is interesting to check as well.

Span

double[] with Vector(xs, i * 4) or Vector<double>[] with xs.[i]?

Span<'T> can help to cast regular arrays to arrays of vectors, which is particularly helpful for overwriting the contents of result array. However, when it comes to function inputs we are free to choose between using a vector with offset or arrays of vectors. To be fair, another way to save the results is to use Vector.CopyTo, but it seems to be far less efficient so let’s skip this case. For a comprehensive introduction into Span have a look at this post by Adam Sitnik.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
    /// Out of place, arrays are not aligned, vector multiplier and inputs are not spans.
    let vectorNotAlignedOut (a : double) (xs : double[]) (ys : double[]) (zs : double[]) =
        let nAligned = lengthAlignedLeq xs.Length
        let zsVector = MemoryMarshal.Cast<double, Vector<double>>(Span<double>(xs, 0, nAligned))
        let aVector  = Vector(a)
        let v        = Vector<double>.Count
        
        for i = 0 to zsVector.Length - 1 do
            zsVector.[i] <- aVector * Vector(xs, i * v) + Vector(ys, i * v)
        
        for i = nAligned to zs.Length - 1 do
            zs.[i] <- a * xs.[i] + ys.[i]
        zs

    
    /// Out of place, arrays are aligned, scalar multiplier and inputs are not spans.
    let vectorScalarOut (a : double) (xs : double[]) (ys : double[]) (zs : double[]) =
        let zsVector = MemoryMarshal.Cast<double, Vector<double>>(Span<double>(xs))
        let v        = Vector<double>.Count
        
        for i = 0 to zsVector.Length - 1 do
            zsVector.[i] <- a * Vector(xs, i * v) + Vector(ys, i * v)
        zs


    /// In place, arrays are aligned, vector multiplier and inputs are not spans.
    let vectorIn (a : double) (xs : double[]) (ys : double[]) =
        let ysVector = MemoryMarshal.Cast<double, Vector<double>>(Span<double>(ys))
        let aVector  = Vector(a)
        let v        = Vector<double>.Count
            
        for i = 0 to ysVector.Length - 1 do
            ysVector.[i] <- aVector * Vector(xs, i * v) + ysVector.[i]
        ys

    
    /// Out of place, arrays are aligned, vector multiplier and inputs are not spans.
    let vectorOut (a : double) (xs : double[]) (ys : double[]) (zs : double[]) =
        let zsVector = MemoryMarshal.Cast<double, Vector<double>>(Span<double>(xs))
        let aVector  = Vector(a)
        let v        = Vector<double>.Count
            
        for i = 0 to zsVector.Length - 1 do
            zsVector.[i] <- aVector * Vector(xs, i * v) + Vector(ys, i * v)
        zs


    /// Out of place, arrays are aligned, vector multiplier and inputs are spans.
    let vectorSpansOut (a : double) (xs : double[]) (ys : double[]) (zs : double[]) =
        let xsVector = MemoryMarshal.Cast<double, Vector<double>>(Span<double>(xs))
        let ysVector = MemoryMarshal.Cast<double, Vector<double>>(Span<double>(ys))
        let zsVector = MemoryMarshal.Cast<double, Vector<double>>(Span<double>(zs))
        let aVector  = Vector(a)
            
        for i = 0 to zsVector.Length - 1 do
            zsVector.[i] <- aVector * xsVector.[i] + ysVector.[i]
        zs

Here’re some of the results for the vectors benchmark:

  • .NET Core code is up to ~1.8 times faster than .NET Framework (Clr job), this is expected because only .NET Core supports vectorization natively.
  • For the same reason the relative results are not very consistent between frameworks, so further we refer only to .NET Core.
  • Aligned version is slightly faster than not aligned, because the boundary counditions do not need to be handled separately, however for large arrays it doesn’t seem to be the case anymore.
MethodJobN MeanErrorStdDevStdErr Min Q1 Median Q3 Max Op/sRatioRatioSDGen 0/1k OpGen 1/1k OpGen 2/1k OpAllocated Memory/Op
NotAlignedOutClr10053.441 ns0.1227 ns0.1148 ns0.0296 ns53.264 ns53.348 ns53.438 ns53.518 ns53.646 ns18,712,195.01.840.01----
AlignedOutClr10056.728 ns0.1804 ns0.1599 ns0.0427 ns56.557 ns56.602 ns56.699 ns56.792 ns57.161 ns17,628,008.51.950.01----
NotAlignedOutCore10029.076 ns0.0799 ns0.0708 ns0.0189 ns28.994 ns29.024 ns29.060 ns29.097 ns29.263 ns34,392,627.81.000.00----
AlignedOutCore10029.002 ns0.4895 ns0.4339 ns0.1160 ns28.459 ns28.614 ns28.902 ns29.278 ns29.991 ns34,480,284.31.000.02----
NotAlignedOutClr10357.559 ns0.3002 ns0.2661 ns0.0711 ns57.325 ns57.353 ns57.440 ns57.645 ns58.127 ns17,373,491.31.800.01----
AlignedOutClr10358.622 ns0.2398 ns0.2243 ns0.0579 ns58.206 ns58.375 ns58.711 ns58.783 ns58.931 ns17,058,484.31.840.01----
NotAlignedOutCore10331.917 ns0.0629 ns0.0491 ns0.0142 ns31.848 ns31.880 ns31.900 ns31.963 ns31.995 ns31,331,161.41.000.00----
AlignedOutCore10329.720 ns0.0881 ns0.0824 ns0.0213 ns29.612 ns29.634 ns29.727 ns29.769 ns29.884 ns33,647,276.10.930.00----
  • As one could expect in place is faster than out of place. For example, introducing another array also requires additional boundary checks - watch out for cmp and jae instructions in disassembly:
1
2
cmp     r14d,esi
jae     00007ff9`02324920
MethodJobN MeanErrorStdDevStdErr Min Q1 Median Q3 Max Op/sRatioRatioSDGen 0/1k OpGen 1/1k OpGen 2/1k OpAllocated Memory/Op
AlignedOutCore10000336,877.708 ns356.7336 ns333.6889 ns86.1581 ns36,401.196 ns36,489.669 ns36,927.122 ns37,110.404 ns37,340.884 ns27,116.71.040.01----
AlignedInCore10000334,712.877 ns231.3058 ns216.3635 ns55.8648 ns34,474.578 ns34,511.612 ns34,684.755 ns34,814.202 ns35,143.618 ns28,807.80.980.01----
  • Multiplying with a a : double value is really not a good idea and leads to ~2x performance decrease comparing to Vector<double>(a) version. The reason is that a single value is not automatically vectorized and call to CoreLib function is performed instead of vmulpd instruction:
1
call    System.Numerics.Vector`1[[System.Double, System.Private.CoreLib]].op_Multiply(Double, System.Numerics.Vector`1<Double>)
MethodJobN MeanErrorStdDevStdErr Min Q1 Median Q3 Max Op/sRatioRatioSDGen 0/1k OpGen 1/1k OpGen 2/1k OpAllocated Memory/Op
NotAlignedOutCore10000335,495.294 ns324.4563 ns270.9357 ns75.1440 ns34,658.200 ns35,480.689 ns35,562.836 ns35,595.042 ns35,808.595 ns28,172.71.000.00----
ScalarOutCore10000377,333.464 ns749.1684 ns664.1182 ns177.4931 ns76,488.622 ns76,955.850 ns77,255.329 ns77,621.272 ns79,002.630 ns12,931.02.180.02----
  • Finally, Span<Vector<double>> version seems to be faster in all cases except for large arrays. This is an interesting behaviour, and would need a closer investigation to explain why that happens - I’ll leave it for another time.
MethodJobN MeanErrorStdDevStdErr Min Q1 Median Q3 Max Op/sRatioRatioSDGen 0/1k OpGen 1/1k OpGen 2/1k OpAllocated Memory/Op
AlignedOutCore1003304.120 ns1.2893 ns1.2060 ns0.3114 ns302.499 ns303.135 ns303.741 ns305.064 ns306.713 ns3,288,176.40.960.00----
SpansOutCore1003247.484 ns2.4008 ns2.2457 ns0.5798 ns244.554 ns245.691 ns246.610 ns248.767 ns253.063 ns4,040,668.90.780.01----
AlignedOutCore10000336,877.708 ns356.7336 ns333.6889 ns86.1581 ns36,401.196 ns36,489.669 ns36,927.122 ns37,110.404 ns37,340.884 ns27,116.71.040.01----
SpansOutCore10000338,141.007 ns418.7630 ns349.6861 ns96.9855 ns37,677.874 ns37,876.792 ns37,998.140 ns38,378.497 ns38,960.044 ns26,218.51.070.01----

MKL

Applying matrix functions: [n x 1] or [1 x n]?

According to the documentation MKL has multiple functions we can use to implement our multiply-add operation in BLAS Level 1 functions and BLAS-like extensions. Without going into too much detail, there exist lower-level functions and extensions. The latter support transpose operations and different matrix formats as well as offsets between rows or columns. Here we have some freedom to pick the dimensions of the matrix, because the inputs are anyway represented as 1D arrays, for example [n x 1] vs [1 x n] for an array with n elements.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
/// In place, CBLAS daxpy 
let mklAxpyIn (a : double) (xs : double[]) (ys : double[]) =
    MKL.cblas_daxpy (xs.Length, a, xs, 1, ys, 1)
    ys


/// In place, CBLAS daxpby
let mklAxpbyIn (a : double) (xs : double[]) (ys : double[]) =
    MKL.cblas_daxpby(xs.Length, a, xs, 1, 1.0, ys, 1)
    ys
    

/// Out of place, CBLAS dcopy + daxpy 
let mklAxpyOut (a : double) (xs : double[]) (ys : double[]) (zs : double[]) =
    MKL.cblas_dcopy (xs.Length, ys, 1, zs, 1)
    MKL.cblas_daxpy (xs.Length, a, xs, 1, zs, 1)
    ys


/// Out of place, mkl_domatcopy with "matrix" dimensions [1 x n]
let mklMataddRowOut (a : double) (xs : double[]) (ys : double[]) (zs : double[]) =
    MKL.MKL_Domatadd ('R', 'N', 'N', 1, xs.Length, a, xs, xs.Length, 1.0, ys, ys.Length, zs, zs.Length)
    zs


/// Out of place, mkl_domatcopy with "matrix" dimensions [n x 1]
let mklMataddColOut (a : double) (xs : double[]) (ys : double[]) (zs : double[]) =
    MKL.MKL_Domatadd ('R', 'N', 'N', xs.Length, 1, a, xs, 1, 1.0, ys, 1, zs, 1)
    zs

A quick look at the results shows that one has to be very careful with functions selection:

  • While flexible extensions are very useful in some scenarios our Multiply-Add usecase is not that complex. mkl_domatadd is 4-8 times slower than simple cblas_daxpy. To be fair char type for arguments in extern definition of mkl_domatadd is not a smart choice because of allocations, it’s here for clarity reasons. The better solution is to use integers, but the function will still be slower than cblas_daxpy (and this is noticeable mostly on smaller arrays anyway).

  • Interestingly, single Row format 1 x n is 3-7 times more efficient than single Column n x 1 (for large array it’s 53x baseline!).

  • Out of place is around 2 times slower because it has to be performed via copy and in place function cblas_daxpy.

MethodJobN Mean Error StdDevStdErr Min Q1 Median Q3 Max Op/sRatioRatioSDGen 0/1k OpGen 1/1k OpGen 2/1k OpAllocated Memory/Op
AxpyInCore10328.40 ns0.1583 ns0.1404 ns0.0375 ns28.36 ns28.25 ns28.32 ns28.42 ns28.74 ns35,205,224.01.000.00----
AxpbyInCore10333.16 ns0.0733 ns0.0612 ns0.0170 ns33.14 ns33.08 ns33.12 ns33.18 ns33.32 ns30,157,209.41.170.01----
AxpyOutCore10349.94 ns0.2313 ns0.2050 ns0.0548 ns49.86 ns49.71 ns49.81 ns49.98 ns50.41 ns20,025,979.51.760.01----
MataddRowOutCore103119.93 ns0.3128 ns0.2773 ns0.0741 ns119.95 ns119.48 ns119.72 ns120.03 ns120.47 ns8,338,274.54.220.020.0150--96 B
MataddColOutCore103385.99 ns0.9004 ns0.7519 ns0.2085 ns386.11 ns384.55 ns385.41 ns386.45 ns387.34 ns2,590,771.213.590.080.0148--96 B
AxpyInCore1000035,357.79 ns148.5879 ns423.9297 ns43.7250 ns5,240.98 ns4,563.81 ns5,103.49 ns5,581.75 ns6,401.51 ns186,644.01.000.00----
AxpbyInCore1000035,840.88 ns122.6757 ns253.3467 ns35.1329 ns5,739.96 ns5,540.38 ns5,632.27 ns5,933.44 ns6,444.85 ns171,207.21.110.10----
AxpyOutCore10000314,290.63 ns285.8218 ns267.3579 ns69.0315 ns14,251.62 ns13,936.29 ns14,060.52 ns14,463.59 ns14,936.70 ns69,975.92.770.34----
MataddRowOutCore10000340,185.48 ns81.8306 ns72.5407 ns19.3873 ns40,191.22 ns40,071.33 ns40,142.59 ns40,216.19 ns40,332.23 ns24,884.67.830.93---96 B
MataddColOutCore100003271,920.30 ns2,036.6226 ns1,805.4129 ns482.5169 ns271,196.07 ns270,153.21 ns270,761.24 ns272,728.61 ns276,126.84 ns3,677.552.966.37---96 B

Results

So is there a difference between for loops and vectorized code?

  • .NET double Vectors are approximately 2x faster than for. So what kind of instructions does RyuJIT use in the loops?
1
2
3
4
vmovaps xmm1,xmm0
vmulsd  xmm1,xmm1,mmword ptr [rdx+r9*8+10h]
vaddsd  xmm1,xmm1,mmword ptr [r8+r9*8+10h]
vmovsd  qword ptr [r8+r9*8+10h],xmm1

Compare to similar code that Legacy JIT would generate for this function (e.g. SSE instruction addsd vs AVX vaddsd):

1
2
3
4
movapd      xmm0,xmm1  
mulsd       xmm0,mmword ptr [r10+r9+10h]  
addsd       xmm0,mmword ptr [r8+r9+10h]  
movsd       mmword ptr [r8+r9+10h],xmm0  

And more efficient version with vectors (note pd in the names):

1
2
3
4
5
6
vbroadcastsd ymm0,xmm0
...
vmulpd  ymm1,ymm0,ymm1
vmovupd ymm2,ymmword ptr [r9]
vaddpd  ymm1,ymm1,ymm2
vmovupd ymmword ptr [rcx],ymm1
  • MKL has some overhead or similar results on smaller arrays but better performance (up to 12x in this test) on large ones. One of the reasons is that MKL optimizes multiply-add operations even further by using an appropriate FMA instruction vfmadd213pd. More information about Advanced Vector Extensions (AVX) is available here.
1
vfmadd213pd ymm0,ymm7,ymmword ptr [rsi+FFFFFFFFFFFFFF00h]

Summing up, the performance improvements in .NET Core and modern RyuJIT are quite impressive and for smaller arrays are even competitive as compared to native MKL libraries. Using vectorized instructions helps to make the code significantly faster. At the same time JIT doesn’t perform all possible optimizations (yet?) as we’ve seen with a specialized case of fma operation, and one has to be very careful when multiplying vectors by a factor, because it makes quite a difference.

The tricks like casting numerical arrays to arrays of vectors and alignments not only simplify the code and make it more robust and clear, but also leverage mechanical sympathy. So let’s start writing a bit more machine-friendly code and enjoy the results!

MethodJobN MeanErrorStdDevStdErr Min Q1 Median Q3 Max Op/sRatioRatioSDGen 0/1k OpGen 1/1k OpGen 2/1k OpAllocated Memory/Op
ForCore43.727 ns0.0384 ns0.0359 ns0.0093 ns3.682 ns3.707 ns3.718 ns3.751 ns3.800 ns268,323,787.91.000.00----
VectorCore42.363 ns0.0137 ns0.0121 ns0.0032 ns2.343 ns2.356 ns2.364 ns2.366 ns2.387 ns423,166,784.10.630.01----
VectorSpansCore42.437 ns0.0111 ns0.0098 ns0.0026 ns2.420 ns2.431 ns2.435 ns2.444 ns2.458 ns410,359,292.00.650.01----
AxpyCore423.044 ns0.0560 ns0.0496 ns0.0133 ns22.958 ns23.014 ns23.036 ns23.071 ns23.140 ns43,394,548.56.190.06----
ForCore10369.797 ns0.1113 ns0.1041 ns0.0269 ns69.611 ns69.729 ns69.780 ns69.865 ns70.045 ns14,327,201.81.000.00----
VectorCore10330.588 ns0.1315 ns0.1230 ns0.0318 ns30.453 ns30.512 ns30.534 ns30.679 ns30.866 ns32,692,121.70.440.00----
VectorSpansCore10326.251 ns0.0343 ns0.0286 ns0.0079 ns26.193 ns26.245 ns26.249 ns26.268 ns26.310 ns38,094,025.80.380.00----
AxpyCore10326.805 ns0.0638 ns0.0533 ns0.0148 ns26.701 ns26.783 ns26.803 ns26.831 ns26.910 ns37,305,908.80.380.00----
ForCore1003614.265 ns1.9209 ns1.6041 ns0.4449 ns612.373 ns612.876 ns614.061 ns615.123 ns618.375 ns1,627,963.21.000.00----
VectorCore1003255.404 ns0.4940 ns0.4621 ns0.1193 ns254.778 ns255.094 ns255.329 ns255.930 ns256.335 ns3,915,359.80.420.00----
VectorSpansCore1003210.470 ns0.3816 ns0.3382 ns0.0904 ns209.864 ns210.221 ns210.446 ns210.580 ns211.158 ns4,751,267.20.340.00----
AxpyCore1003110.691 ns0.2618 ns0.2186 ns0.0606 ns110.399 ns110.473 ns110.689 ns110.879 ns110.996 ns9,034,150.30.180.00----
ForCore10000360,780.472 ns148.1639 ns131.3434 ns35.1030 ns60,565.275 ns60,668.842 ns60,819.284 ns60,897.133 ns60,966.039 ns16,452.71.000.00----
VectorCore10000331,056.083 ns164.5393 ns145.8598 ns38.9827 ns30,897.109 ns30,965.186 ns30,998.533 ns31,125.262 ns31,459.747 ns32,199.80.510.00----
VectorSpansCore10000332,314.710 ns151.4383 ns141.6555 ns36.5753 ns32,134.773 ns32,228.085 ns32,260.626 ns32,414.202 ns32,587.829 ns30,945.70.530.00----
AxpyCore1000035,178.782 ns103.2439 ns241.3293 ns29.9332 ns4,488.865 ns5,050.068 ns5,193.735 ns5,336.972 ns5,628.889 ns193,095.60.080.01----
  1. Sources and benchmark results
  2. BenchmarkDotNet and its DisassemblyDiagnoser
  3. Span
  4. SIMD-enabled types
  5. SIMD in C#
  6. Using .NET hardware intrinsics API to accelerate Machine Learning scenarios
  7. BLAS routines
  8. Intel AVX2
  9. FMAs in CUDA C