# [SOLVED] Is it possible to vectorize non-trivial loop in C with SIMD? (multiple length 5 double-precision dot products reusing one input)

## Issue

I have a performance critical C code where > 90% of the time is spent doing one basic operation: The C code I am using is:

``````static void function(double *X1, double *Y1, double *X2, double *Y2, double *output) {
double Z1, Z2;
int i, j, k;
for (i = 0, j = 0; i < 25; j++) { // sweep Y
Z1 = 0;
Z2 = 0;
for (k = 0; k < 5; k++, i++) { // sweep X
Z1 += X1[k] * Y1[i];
Z2 += X2[k] * Y2[i];
}
output[j] = Z1*Z2;
}
}
``````

The lengths are fixed (X is 5; Y is 25; the output is 5). I have tried everything I know to make this faster. When I compile this code using clang with `-O3 -march=native -Rpass-analysis=loop-vectorize -Rpass=loop-vectorize -Rpass-missed=loop-vectorize`, I get this message:

remark: the cost-model indicates that vectorization is not beneficial [-Rpass-missed=loop-vectorize]

But I assume the way to make this faster is with SIMD somehow. Any suggestions would be appreciated.

## Solution

From the extended discussions in the comments it looks like mostly you are interested in reducing the latency between reading `X1`, `X2` and writing `output`. What you are calculating is the element-wise product of two matrix-vector products. The two MV products can happen quasi-parallel (with OOO-excecution), but both MV products require the sum of five products, which you can do either in a sequence (as you are doing at the moment) or in a tree-like reduction:

``````Z = ((X*Y + X*Y) + X*Y)  +  ([X*Y + [X*Y);
``````

This results in a critical path of `mulsd``fmaddsd``fmaddsd``addsd`, which is followed by a multiplication of `Z1*Z2`. That means, assuming 4 cycles latency per FLOP, you’ll have a latency of 20 cycles plus latency of reading and writing memory (unless you are able to keep everything in registers — this requires you to show the surounding code). If you accumulate the values linearily, you’ll have a critical path of `mulsd``fmaddsd``fmaddsd``fmaddsd``fmaddsd``mulsd` (i.e. 24 cycles + read/write)

Now if you are able to change the memory order of `Y`, it would be beneficial to transpose these matrices, because then you can easily calculate `output[0 ~ 3]` in parallel (assuming you have AVX), by first broadcast-loading each entry of `X` and doing packed accumulation.

``````void function_fma( const double* X1, const double* Y1, const double* X2, const double* Y2, double* output )
{
// Load X1 and X2 vectors into 10 registers.
const __m256d x1_0 = _mm256_broadcast_sd( X1 );
const __m256d x1_1 = _mm256_broadcast_sd( X1 + 1 );
const __m256d x1_2 = _mm256_broadcast_sd( X1 + 2 );
const __m256d x1_3 = _mm256_broadcast_sd( X1 + 3 );
const __m256d x1_4 = _mm256_broadcast_sd( X1 + 4 );

const __m256d x2_0 = _mm256_broadcast_sd( X2 );
const __m256d x2_1 = _mm256_broadcast_sd( X2 + 1 );
const __m256d x2_2 = _mm256_broadcast_sd( X2 + 2 );
const __m256d x2_3 = _mm256_broadcast_sd( X2 + 3 );
const __m256d x2_4 = _mm256_broadcast_sd( X2 + 4 );

// first four values:
{
// Multiply column 0
__m256d z1 = _mm256_mul_pd( x1_0, _mm256_loadu_pd( Y1 ) );
__m256d z2 = _mm256_mul_pd( x2_0, _mm256_loadu_pd( Y2 ) );

// Multiply + accumulate column 1 and column 2

// Multiply column 3
__m256d z1_ = _mm256_mul_pd( x1_3, _mm256_loadu_pd( Y1 + 15 ) );
__m256d z2_ = _mm256_mul_pd( x2_3, _mm256_loadu_pd( Y2 + 15 ) );

// Multiply + accumulate column 4

z1 = _mm256_add_pd( z1, z1_ );
z2 = _mm256_add_pd( z2, z2_ );

// Multiply and store result
_mm256_store_pd(output, _mm256_mul_pd(z1, z2));
}
// last value:
{
// Multiply column 0
__m128d z1 = _mm_mul_sd( _mm256_castpd256_pd128(x1_0), _mm_load_sd( Y1 + 4) );
__m128d z2 = _mm_mul_sd( _mm256_castpd256_pd128(x2_0), _mm_load_sd( Y2 + 4) );

// Multiply + accumulate column 1 and column 2

// Multiply column 3
__m128d z1_ = _mm_mul_sd( _mm256_castpd256_pd128(x1_3), _mm_load_sd( Y1 + 19 ) );
__m128d z2_ = _mm_mul_sd( _mm256_castpd256_pd128(x2_3), _mm_load_sd( Y2 + 19 ) );

// Multiply + accumulate column 4

z1 = _mm_add_sd( z1, z1_ );
z2 = _mm_add_sd( z2, z2_ );

// Multiply and store result
_mm_store_sd(output+4, _mm_mul_sd(z1, z2));
}
}
``````

If you don’t have FMA, you can replace these by multiplication and addition (this will not change the latency a lot, since only the additions are in the critical path — throughput could be about 50% worse, of course). Also, if you don’t have AVX, the first four values can be calculated by doing two times two values.