[SOLVED] CPU cycles for 16 bit and 32 bit multiplications on Intel x64 processors

Issue

I am implementing a mathematical library in C which heavily uses multiplications. Initially, all my multiplications were done using uint16_t. Recently I change many of them to uint32_t and I saw my code runtime became almost double. I was confused as I thought in Intel x64 processors, 32 and 16-bit multiplications take same clock cycles.

I wrote a diagnosis code, please find it below

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include "cpucycles.c"

#define REPEAT 10000
#define OUT_REPEAT 100000

void main(){

    uint16_t a_16[REPEAT], b_16[REPEAT], c_16[REPEAT];
    uint32_t a_32[REPEAT], b_32[REPEAT], c_32[REPEAT];
    int32_t i,j;
    uint64_t clock1, clock2, CLOCK16, CLOCK32;
    uint64_t acc=0;
    time_t t;

    srand((unsigned) time(&t));

    clock1=clock2=CLOCK16=CLOCK32=0;

    for(j=0;j<OUT_REPEAT;j++){

        
        for(i=0;i<REPEAT;i++){

            a_16[i]=rand()& ( (1<<13) -1); //need 13-bit integers only
            b_16[i]=rand()& ( (1<<13) -1);

            a_32[i]=rand()&( (1<<19) -1);
            b_32[i]=rand()&( (1<<19) -1); //need 19-bit integers only
        }
        
        clock1=cpucycles();
        for(i=0;i<REPEAT;i++){
            c_16[i]=a_16[i]*b_16[i];
        }
        clock2=cpucycles();
        CLOCK16=CLOCK16+(clock2-clock1);    

        clock1=cpucycles();
        for(i=0;i<REPEAT;i++){
            c_32[i]=a_32[i]*b_32[i];
        }
        clock2=cpucycles();
        CLOCK32=CLOCK32+(clock2-clock1);    

        for(i=0;i<REPEAT;i++){
            acc=(acc+(c_32[i]-(uint32_t)c_16[i])); //this is just to prevent compiler optimization
        }
        printf("Iteration: %d,  acc:%llu\n", j, acc);
        acc=0;
    }

    printf("\n--------------------------------------------\n");
    printf("Time for 16 bit multiplication : %llu\n", CLOCK16/OUT_REPEAT);
    printf("Time for 32 bit multiplication : %llu\n", CLOCK32/OUT_REPEAT);
    printf("\n--------------------------------------------\n");
}

The cpucycles code is from ECRYPT and given below,

#include "cpucycles.h"

long long cpucycles(void)
{
  unsigned long long result;
  asm volatile(".byte 15;.byte 49;shlq $32,%%rdx;orq %%rdx,%%rax"
    : "=a" (result) ::  "%rdx");
  return result;
}

Result from one sample run, using single core and hyperthreading/TurboBoost disabled

--------------------------------------------
Time for 16 bit multiplication : 2795
Time for 32 bit multiplication : 4190

--------------------------------------------

and finally my cpuinfo (excerpt) as given by lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Model name:            Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz

Now my question is,

  1. Is it correct that in x64 platforms 16-bit multiplications take almost half of the total time than 32-bit multiplications? Or I am doing something wrong.

  2. If yes, can you please show me some reference to justify this behavior?

I appreciate your help.

Solution

Is it correct that in x64 platforms 16 bit multiplications take almost half of the total time than 32 bit multiplications? Or I am doing something wrong.

No, that isn’t correct by itself. Respectively it’s not what you have actually tested.

Your short loops are trivially vectorizable, so that’s just what the compiler did. Depending on the target CPU generation, that means there is a 128, 256 or 512bit vector type available which can be partitioned into different word sizes (8bit, 16bit, 32bit, 64bit, 128bit), and can then perform vectorized multiplications on multiple elements at once. Not just the multiplication, but also loading and storing the numbers from and to memory is fully vectorized and doesn’t operate just on single elements.

To put it simple, you can fit twice as many 16bit integers in the same vector, compared to 32bit. And your code actually isn’t limited by the multiplications either – it’s limited purely by load / stores, so you have only successfully measured that 16bit integers are half as big as 32bit integers, so when vectorized and bound by load/stores you can load twice as many elements in the same time.

If you whish to benchmark a specific instruction (in this case single element multiplication), would you need to explicitly use that specific instruction by inline assembly. You would also need to be aware of all the side effects and preconditions which effect performance, pipelined super-scalar architectures are not trivial to benchmark in general.

Otherwise the compiler is expected to optimize (vectorize, fold, inline etc.) your code as much as possible.

Answered By – Ext3h

Answer Checked By – Gilberto Lyons (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *