[SOLVED] Is it possible to vectorize non-trivial loop in C with SIMD? (multiple length 5 double-precision dot products reusing one input)

Issue

I have a performance critical C code where > 90% of the time is spent doing one basic operation:

Operation

The C code I am using is:

static void function(double *X1, double *Y1, double *X2, double *Y2, double *output) {
    double Z1, Z2;
    int i, j, k;
    for (i = 0, j = 0; i < 25; j++) { // sweep Y
        Z1 = 0;
        Z2 = 0;
        for (k = 0; k < 5; k++, i++) { // sweep X
            Z1 += X1[k] * Y1[i];
            Z2 += X2[k] * Y2[i];
        }
        output[j] = Z1*Z2;
    }
}

The lengths are fixed (X is 5; Y is 25; the output is 5). I have tried everything I know to make this faster. When I compile this code using clang with -O3 -march=native -Rpass-analysis=loop-vectorize -Rpass=loop-vectorize -Rpass-missed=loop-vectorize, I get this message:

remark: the cost-model indicates that vectorization is not beneficial [-Rpass-missed=loop-vectorize]

But I assume the way to make this faster is with SIMD somehow. Any suggestions would be appreciated.

Solution

From the extended discussions in the comments it looks like mostly you are interested in reducing the latency between reading X1, X2 and writing output. What you are calculating is the element-wise product of two matrix-vector products. The two MV products can happen quasi-parallel (with OOO-excecution), but both MV products require the sum of five products, which you can do either in a sequence (as you are doing at the moment) or in a tree-like reduction:

Z = ((X[0]*Y[0] + X[1]*Y[1]) + X[2]*Y[2])  +  ([X[3]*Y[3] + [X[4]*Y[4]);

This results in a critical path of mulsdfmaddsdfmaddsdaddsd, which is followed by a multiplication of Z1*Z2. That means, assuming 4 cycles latency per FLOP, you’ll have a latency of 20 cycles plus latency of reading and writing memory (unless you are able to keep everything in registers — this requires you to show the surounding code). If you accumulate the values linearily, you’ll have a critical path of mulsdfmaddsdfmaddsdfmaddsdfmaddsdmulsd (i.e. 24 cycles + read/write)

Now if you are able to change the memory order of Y, it would be beneficial to transpose these matrices, because then you can easily calculate output[0 ~ 3] in parallel (assuming you have AVX), by first broadcast-loading each entry of X and doing packed accumulation.

void function_fma( const double* X1, const double* Y1, const double* X2, const double* Y2, double* output )
{
    // Load X1 and X2 vectors into 10 registers.
    const __m256d x1_0 = _mm256_broadcast_sd( X1 );
    const __m256d x1_1 = _mm256_broadcast_sd( X1 + 1 );
    const __m256d x1_2 = _mm256_broadcast_sd( X1 + 2 );
    const __m256d x1_3 = _mm256_broadcast_sd( X1 + 3 );
    const __m256d x1_4 = _mm256_broadcast_sd( X1 + 4 );

    const __m256d x2_0 = _mm256_broadcast_sd( X2 );
    const __m256d x2_1 = _mm256_broadcast_sd( X2 + 1 );
    const __m256d x2_2 = _mm256_broadcast_sd( X2 + 2 );
    const __m256d x2_3 = _mm256_broadcast_sd( X2 + 3 );
    const __m256d x2_4 = _mm256_broadcast_sd( X2 + 4 );

    // first four values:
    {
        // Multiply column 0
        __m256d z1 = _mm256_mul_pd( x1_0, _mm256_loadu_pd( Y1 ) );
        __m256d z2 = _mm256_mul_pd( x2_0, _mm256_loadu_pd( Y2 ) );

        // Multiply + accumulate column 1 and column 2
        z1 = _mm256_fmadd_pd( x1_1, _mm256_loadu_pd( Y1 + 5 ), z1 );
        z2 = _mm256_fmadd_pd( x2_1, _mm256_loadu_pd( Y2 + 5 ), z2 );
        z1 = _mm256_fmadd_pd( x1_2, _mm256_loadu_pd( Y1 + 10 ), z1 );
        z2 = _mm256_fmadd_pd( x2_2, _mm256_loadu_pd( Y2 + 10 ), z2 );

        // Multiply column 3
        __m256d z1_ = _mm256_mul_pd( x1_3, _mm256_loadu_pd( Y1 + 15 ) );
        __m256d z2_ = _mm256_mul_pd( x2_3, _mm256_loadu_pd( Y2 + 15 ) );

        // Multiply + accumulate column 4
        z1_ = _mm256_fmadd_pd( x1_4, _mm256_loadu_pd( Y1 + 20 ), z1_ );
        z2_ = _mm256_fmadd_pd( x2_4, _mm256_loadu_pd( Y2 + 20 ), z2_ );

        // Add both partial sum
        z1 = _mm256_add_pd( z1, z1_ );
        z2 = _mm256_add_pd( z2, z2_ );

        // Multiply and store result
        _mm256_store_pd(output, _mm256_mul_pd(z1, z2));
    }
    // last value:
    {
        // Multiply column 0
        __m128d z1 = _mm_mul_sd( _mm256_castpd256_pd128(x1_0), _mm_load_sd( Y1 + 4) );
        __m128d z2 = _mm_mul_sd( _mm256_castpd256_pd128(x2_0), _mm_load_sd( Y2 + 4) );

        // Multiply + accumulate column 1 and column 2
        z1 = _mm_fmadd_sd( _mm256_castpd256_pd128(x1_1), _mm_load_sd( Y1 + 9 ), z1 );
        z2 = _mm_fmadd_sd( _mm256_castpd256_pd128(x2_1), _mm_load_sd( Y2 + 9 ), z2 );
        z1 = _mm_fmadd_sd( _mm256_castpd256_pd128(x1_2), _mm_load_sd( Y1 + 14 ), z1 );
        z2 = _mm_fmadd_sd( _mm256_castpd256_pd128(x2_2), _mm_load_sd( Y2 + 14 ), z2 );

        // Multiply column 3
        __m128d z1_ = _mm_mul_sd( _mm256_castpd256_pd128(x1_3), _mm_load_sd( Y1 + 19 ) );
        __m128d z2_ = _mm_mul_sd( _mm256_castpd256_pd128(x2_3), _mm_load_sd( Y2 + 19 ) );

        // Multiply + accumulate column 4
        z1_ = _mm_fmadd_sd( _mm256_castpd256_pd128(x1_4), _mm_load_sd( Y1 + 24 ), z1_ );
        z2_ = _mm_fmadd_sd( _mm256_castpd256_pd128(x2_4), _mm_load_sd( Y2 + 24 ), z2_ );

        // Add both partial sum
        z1 = _mm_add_sd( z1, z1_ );
        z2 = _mm_add_sd( z2, z2_ );

        // Multiply and store result
        _mm_store_sd(output+4, _mm_mul_sd(z1, z2));
    }
}

If you don’t have FMA, you can replace these by multiplication and addition (this will not change the latency a lot, since only the additions are in the critical path — throughput could be about 50% worse, of course). Also, if you don’t have AVX, the first four values can be calculated by doing two times two values.

Answered By – chtz

Answer Checked By – David Goodson (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *