[SOLVED] C++ Linux fastest way to measure time (faster than std::chrono) ? Benchmark included

Issue

#include <iostream>
#include <chrono>
using namespace std;

class MyTimer {
 private:
  std::chrono::time_point<std::chrono::steady_clock> starter;
  std::chrono::time_point<std::chrono::steady_clock> ender;

 public:
  void startCounter() {
    starter = std::chrono::steady_clock::now();
  }

  double getCounter() {
    ender = std::chrono::steady_clock::now();
    return double(std::chrono::duration_cast<std::chrono::nanoseconds>(ender - starter).count()) /
           1000000;  // millisecond output
  }
  
  // timer need to have nanosecond precision
  int64_t getCounterNs() {
    return std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - starter).count();
  }
};

MyTimer timer1, timer2, timerMain;
volatile int64_t dummy = 0, res1 = 0, res2 = 0;

// time run without any time measure
void func0() {
    dummy++;
}

// we're trying to measure the cost of startCounter() and getCounterNs(), not "dummy++"
void func1() {
    timer1.startCounter();  
    dummy++;
    res1 += timer1.getCounterNs();
}

void func2() {
    // start your counter here
    dummy++;
    // res2 += end your counter here
}

int main()
{
    int i, ntest = 1000 * 1000 * 100;
    int64_t runtime0, runtime1, runtime2;

    timerMain.startCounter();
    for (i=1; i<=ntest; i++) func0();
    runtime0 = timerMain.getCounter();
    cout << "Time0 = " << runtime0 << "ms\n";

    timerMain.startCounter();
    for (i=1; i<=ntest; i++) func1();
    runtime1 = timerMain.getCounter();
    cout << "Time1 = " << runtime1 << "ms\n";

    timerMain.startCounter();
    for (i=1; i<=ntest; i++) func2();
    runtime2 = timerMain.getCounter();
    cout << "Time2 = " << runtime2 << "ms\n";

    return 0;
}

I’m trying to profile a program where certain critical parts have execution time measured in < 50 nanoseconds. I found that my timer class using std::chrono is too expensive (code with timing takes 40% more time than code without). How can I make a faster timer class?

I think some OS-specific system calls would be the fastest solution. The platform is Linux Ubuntu.

Edit: all code is compiled with -O3. It’s ensured that each timer is only initialized once, so the measured cost is due to the startMeasure/stopMeasure functions only. I’m not doing any text printing.

Edit 2: the accepted answer doesn’t include the method to actually convert number-of-cycles to nanoseconds. If someone can do that, it’d be very helpful.

Solution

What you want is called "micro-benchmarking". It can get very complex. I assume you are using Ubuntu Linux on x86_64. This is not valid form ARM, ARM64 or any other platforms.

std::chrono is implemented at libstdc++ (gcc) and libc++ (clang) on Linux as simply a thin wrapper around the GLIBC, the C library, which does all the heavy lifting. If you look at std::chrono::steady_clock::now() you will see calls to clock_gettime().

clock_gettime() is a VDSO, ie it is kernel code that runs in userspace. It should be very fast but it might be that from time to time it has to do some housekeeping and take a long time every n-th call. So I would not recommend for microbenchmarking.

Almost every platform has a cycle counter and x86 has the assembly instruction rdtsc. This instruction can be inserted in your code by crafting asm calls or by using the compiler-specific builtins __builtin_ia32_rdtsc() or __rdtsc().

These calls will return a 64-bit integer representing the number of clocks since the machine power up. rdtsc is not immediate but fast, it will take roughly 15-40 cycles to complete.

It is not guaranteed in all platforms that this counter will be the same for each core so beware when the process gets moved from core to core. In modern systems this should not be a problem though.

Another problem with rdtsc is that compilers will often reorder instructions if they find they don’t have side effects and unfortunately rdtsc is one of them. So you have to use fake barriers around these counter reads if you see that the compiler is playing tricks on you – look at the generated assembly.

Also a big problem is cpu out of order execution itself. Not only the compiler can change the order of execution but the cpu can as well. Since the x86 486 the Intel CPUs are pipelined so several instructions can be executed at the same time – roughly speaking. So you might end up measuring spurious execution.

I recommend you to get familiar with the quantum-like problems of micro-benchmarking. It is not straightforward.

Notice that rdtsc() will return the number of cycles. You have to convert to nanoseconds using the timestamp counter frequency.

Here is one example:

#include <iostream>
#include <cstdio>

void dosomething() {
    // yada yada
}

int main() {
    double sum = 0;
    const uint32_t numloops = 100000000;
    for ( uint32_t j=0; j<numloops; ++j ) {
        uint64_t t0 = __builtin_ia32_rdtsc();
        dosomething();
        uint64_t t1 = __builtin_ia32_rdtsc();
        uint64_t elapsed = t1-t0;
        sum += elapsed;
    }
    std::cout << "Average:" << sum/numloops << std::endl;
}

This paper is a bit outdated (2010) but it is sufficiently up to date to give you a good introduction to micro-benchmarking:

How to Benchmark Code Execution Times on IntelĀ® IA-32 and IA-64 Instruction Set Architectures

Answered By – user8143588

Answer Checked By – Marie Seifert (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *