## Issue

I have a performance critical **C** code where > 90% of the time is spent doing one basic operation:

The **C** code I am using is:

```
static void function(double *X1, double *Y1, double *X2, double *Y2, double *output) {
double Z1, Z2;
int i, j, k;
for (i = 0, j = 0; i < 25; j++) { // sweep Y
Z1 = 0;
Z2 = 0;
for (k = 0; k < 5; k++, i++) { // sweep X
Z1 += X1[k] * Y1[i];
Z2 += X2[k] * Y2[i];
}
output[j] = Z1*Z2;
}
}
```

The lengths are fixed (X is 5; Y is 25; the output is 5). I have tried everything I know to make this faster. When I compile this code using clang with `-O3 -march=native -Rpass-analysis=loop-vectorize -Rpass=loop-vectorize -Rpass-missed=loop-vectorize`

, I get this message:

remark: the cost-model indicates that vectorization is not beneficial [-Rpass-missed=loop-vectorize]

But I assume the way to make this faster is with SIMD somehow. Any suggestions would be appreciated.

## Solution

From the extended discussions in the comments it looks like mostly you are interested in reducing the latency between reading `X1`

, `X2`

and writing `output`

. What you are calculating is the element-wise product of two matrix-vector products. The two MV products can happen quasi-parallel (with OOO-excecution), but both MV products require the sum of five products, which you can do either in a sequence (as you are doing at the moment) or in a tree-like reduction:

```
Z = ((X[0]*Y[0] + X[1]*Y[1]) + X[2]*Y[2]) + ([X[3]*Y[3] + [X[4]*Y[4]);
```

This results in a critical path of `mulsd`

– `fmaddsd`

– `fmaddsd`

– `addsd`

, which is followed by a multiplication of `Z1*Z2`

. That means, assuming 4 cycles latency per FLOP, you’ll have a latency of 20 cycles plus latency of reading and writing memory (unless you are able to keep everything in registers — this requires you to show the surounding code). If you accumulate the values linearily, you’ll have a critical path of `mulsd`

– `fmaddsd`

– `fmaddsd`

– `fmaddsd`

– `fmaddsd`

– `mulsd`

(i.e. 24 cycles + read/write)

Now if you are able to change the memory order of `Y`

, it would be beneficial to transpose these matrices, because then you can easily calculate `output[0 ~ 3]`

in parallel (assuming you have AVX), by first broadcast-loading each entry of `X`

and doing packed accumulation.

```
void function_fma( const double* X1, const double* Y1, const double* X2, const double* Y2, double* output )
{
// Load X1 and X2 vectors into 10 registers.
const __m256d x1_0 = _mm256_broadcast_sd( X1 );
const __m256d x1_1 = _mm256_broadcast_sd( X1 + 1 );
const __m256d x1_2 = _mm256_broadcast_sd( X1 + 2 );
const __m256d x1_3 = _mm256_broadcast_sd( X1 + 3 );
const __m256d x1_4 = _mm256_broadcast_sd( X1 + 4 );
const __m256d x2_0 = _mm256_broadcast_sd( X2 );
const __m256d x2_1 = _mm256_broadcast_sd( X2 + 1 );
const __m256d x2_2 = _mm256_broadcast_sd( X2 + 2 );
const __m256d x2_3 = _mm256_broadcast_sd( X2 + 3 );
const __m256d x2_4 = _mm256_broadcast_sd( X2 + 4 );
// first four values:
{
// Multiply column 0
__m256d z1 = _mm256_mul_pd( x1_0, _mm256_loadu_pd( Y1 ) );
__m256d z2 = _mm256_mul_pd( x2_0, _mm256_loadu_pd( Y2 ) );
// Multiply + accumulate column 1 and column 2
z1 = _mm256_fmadd_pd( x1_1, _mm256_loadu_pd( Y1 + 5 ), z1 );
z2 = _mm256_fmadd_pd( x2_1, _mm256_loadu_pd( Y2 + 5 ), z2 );
z1 = _mm256_fmadd_pd( x1_2, _mm256_loadu_pd( Y1 + 10 ), z1 );
z2 = _mm256_fmadd_pd( x2_2, _mm256_loadu_pd( Y2 + 10 ), z2 );
// Multiply column 3
__m256d z1_ = _mm256_mul_pd( x1_3, _mm256_loadu_pd( Y1 + 15 ) );
__m256d z2_ = _mm256_mul_pd( x2_3, _mm256_loadu_pd( Y2 + 15 ) );
// Multiply + accumulate column 4
z1_ = _mm256_fmadd_pd( x1_4, _mm256_loadu_pd( Y1 + 20 ), z1_ );
z2_ = _mm256_fmadd_pd( x2_4, _mm256_loadu_pd( Y2 + 20 ), z2_ );
// Add both partial sum
z1 = _mm256_add_pd( z1, z1_ );
z2 = _mm256_add_pd( z2, z2_ );
// Multiply and store result
_mm256_store_pd(output, _mm256_mul_pd(z1, z2));
}
// last value:
{
// Multiply column 0
__m128d z1 = _mm_mul_sd( _mm256_castpd256_pd128(x1_0), _mm_load_sd( Y1 + 4) );
__m128d z2 = _mm_mul_sd( _mm256_castpd256_pd128(x2_0), _mm_load_sd( Y2 + 4) );
// Multiply + accumulate column 1 and column 2
z1 = _mm_fmadd_sd( _mm256_castpd256_pd128(x1_1), _mm_load_sd( Y1 + 9 ), z1 );
z2 = _mm_fmadd_sd( _mm256_castpd256_pd128(x2_1), _mm_load_sd( Y2 + 9 ), z2 );
z1 = _mm_fmadd_sd( _mm256_castpd256_pd128(x1_2), _mm_load_sd( Y1 + 14 ), z1 );
z2 = _mm_fmadd_sd( _mm256_castpd256_pd128(x2_2), _mm_load_sd( Y2 + 14 ), z2 );
// Multiply column 3
__m128d z1_ = _mm_mul_sd( _mm256_castpd256_pd128(x1_3), _mm_load_sd( Y1 + 19 ) );
__m128d z2_ = _mm_mul_sd( _mm256_castpd256_pd128(x2_3), _mm_load_sd( Y2 + 19 ) );
// Multiply + accumulate column 4
z1_ = _mm_fmadd_sd( _mm256_castpd256_pd128(x1_4), _mm_load_sd( Y1 + 24 ), z1_ );
z2_ = _mm_fmadd_sd( _mm256_castpd256_pd128(x2_4), _mm_load_sd( Y2 + 24 ), z2_ );
// Add both partial sum
z1 = _mm_add_sd( z1, z1_ );
z2 = _mm_add_sd( z2, z2_ );
// Multiply and store result
_mm_store_sd(output+4, _mm_mul_sd(z1, z2));
}
}
```

If you don’t have FMA, you can replace these by multiplication and addition (this will not change the latency a lot, since only the additions are in the critical path — throughput could be about 50% worse, of course). Also, if you don’t have AVX, the first four values can be calculated by doing two times two values.

Answered By – chtz

Answer Checked By – David Goodson (BugsFixing Volunteer)