[SOLVED] Performance bottlenecks in fast evaluation of trig functions using Eigen and MEX

Issue

In a project using Matlab’s C++ MEX API, I have to compute the value exp(j * 2pi * x) for over 100,000 values of x where x is always a positive double. I’ve written some helper functions that breakdown the computation into sin/cos using euler’s formula. I then apply the method of range reduction to reduce my values to their corresponding points in the domain [0,T/4] where T is the period of the exponential I’m computing. I keep track of which quadrant in [0, T] the original value would have fallen into for later. I can then compute the trig function using a taylor series polynomial in horner form and apply the appropriate shift depending on which quadrant the original value was in. For further information on some of the concepts in this technique, check out this answer. Here is the code for this function:

Eigen::VectorXcd calcRot2(const Eigen::Ref<const Eigen::VectorXd>& idxt) {

    Eigen::VectorXd vidxt = idxt.array() - idxt.array().floor();
    Eigen::VectorXd quadrant = (vidxt.array()*2+0.5).floor();
    vidxt.array() -= (quadrant.array()*0.5);
    vidxt.array() *= 2*3.14159265358979;
    const Eigen::VectorXd sq = vidxt.array()*vidxt.array();

    Eigen::VectorXcd M(vidxt.size());
    M.real() = fastCos2(sq);
    M.imag() = fastSin2(vidxt,sq);
    M = (quadrant.array() == 1).select(-M,M);
    return M;
}

I profiled the code segment in which this function is called using std::chrono and averaged over 500 calls to the function (where each call to the mex function processes all 100,000+ values by calling calcRot2 in a loop. Each iteration passes about 200 values to calcRot2). I find the following average runtimes:

runtime with calcRot2:                   75.4694 ms
runtime with fastSin/Cos commented out:  50.2409 ms
runtime with calcRot2 commented out:     30.2547 ms

Looking at the difference between the two extreme cases, it seems like calcRot has a large contribution to the runtime. However, only a portion of that comes from the sin/cos calculation. I would assume Eigen’s implicit vectorization and the compiler would make the runtime of the other operations in the function effectively negligible. (floor shouldn’t be a problem!) Where exactly is the performance bottleneck here?

This is the compilation command I’m performing (It uses MinGW64 which I think is the same as gcc):

mex(ipath,'CFLAGS="$CFLAGS -O3 -fno-math-errno -ffast-math -fopenmp -mavx2"','LDFLAGS="$LDFLAGS -fopenmp"','DAS.cpp','DAShelper.cpp')

Reference Code

For reference, here is the code segment in the main mex function where the timer is called, and the helper function that calls calcRot2():

MEX function call:

chk1 = std::chrono::steady_clock::now();
// Calculate beamformed signal at each point
Eigen::MatrixXcd bfVec(p.nPoints,1);
#pragma omp parallel for
for (int i = 0; i < p.nPoints; i++) {
    calcPoint(idxt.col(i),SIG,p,bfVec(i));
}
chk2 = std::chrono::steady_clock::now();
auto diff3 = chk2 - chk1;

calcPoint:

void calcPoint(const Eigen::Ref<const Eigen::VectorXd>& idxt,
               const Eigen::Ref<const Eigen::MatrixXcd>& SIG,
               Parameters& p, std::complex<double>& bfVal) {
    
Eigen::VectorXcd pRot = calcRot2(idxt*p.fc/p.fs);

int j = 0;
for (auto x : idxt) {
    if(x >= 0) {
        int vIDX = static_cast<int>(x);
        bfVal += (SIG(vIDX,j)*(vIDX + 1 - x) + SIG(vIDX+1,j)*(x - vIDX))*pRot(j);
    }
    j++;
}
}

Clarification

To clarify, the line

(vidxt.array()*2+0.5).floor()

is meant to yield:

0 if vidxt is between [0,0.25] 
1 if vidxt is between [0.25,0.75]
2 if vidxt is between [0.75,1]

The idea here is that when vidxt is in the second interval (quadrants 2 and 3 on the unit circle for functions with period 2pi), then the value needs to map to its negative value. Otherwise, the range reduction maps the values to the correct values.

Solution

The benefits of Eigen’s vectorization are outweighed because you evaluate your expressions into temporary vectors. Allocating, deallocating, filling and reading these vectors has cost that seems significant. This is especially so because the expressions themselves are relatively simple (just a few scalar operations).

Expression objects

What usually helps here is aggregating into fewer expressions. For example line 3 and 4 can be collapsed into one:
vidxt.array() = 2*3.14159265358979 * (vidxt.array() - quadrant.array()*0.5);
(BTW: Note that that math.h contains a constant M_PI with pi in double precision).

Beyond that, Eigen expressions can be combined and reused. Something like this:

    auto vidxt0 = idxt.array() - idxt.array().floor();
    auto quadrant = (vidxt0*2+0.5).floor();
    auto vidxt = 2*3.14159265358979 * (vidxt0 - quadrant.array()*0.5);
    auto sq = vidxt.array().square();

    Eigen::VectorXcd M(vidxt.size());
    M.real() = fastCos2(sq);
    M.imag() = fastSin2(vidxt,sq);
    M = (quadrant.array() == 1).select(-M,M);

Note that none of the auto values are vectors. They are expression objects that behave like arrays and can be evaluated into vectors or arrays.

You can pass these on to your fastCos2 and fastSin2 function by declaring them as templates. The typical Eigen pattern would be something like

template<Derived>
void fastCos2(const Eigen::ArrayBase<Derived>& sq);

The idea here is that ultimately, everything compiles into one huge loop that gets executed when you evaluate the expression into a vector or array. If you reference the same sub-expression multiple times, the compiler may be able to eliminate the redundant computations.

Unfortunately, I could not get any better performance out of this particular code, so it is no real help here but it is still something worth exploring in these kind of cases.

fastSin/Cos return value

Speaking of temporary vectors: You didn’t include the code for your fastSin/Cos functions but it looks a lot like you return a temporary vector which is then copied into the real and imaginary parts or the actual return value. This is another temporary that you may want to avoid. Something like this:

template<class Derived1, class Derived2>
void fastCos2(const Eigen::MatrixBase<Derived1>& M, const Eigen::MatrixBase<Derived2>& sq)
{
    Eigen::MatrixBase<Derived1>& M_mut = const_cast<Eigen::MatrixBase<Derived1>&>(M);
    M_mut = sq...;
}

fastCos2(M.real(), sq);

Please refer to Eigen’s documentation on the topic of function arguments.

The downside of this approach in this particular case is that now the output is not consecutive (real and imaginary parts are interleaved). This may affect vectorization negatively. You may be able to work around this by combining the sin and cos functions into one expression for both. Benchmarking is required.

Using a plain loop

As others have pointed out, using a loop may be easier in this particular case. You noted that this was slower. I have a theory why: You did not specify -DNDEBUG in your compile options. If you don’t, all array indices in Eigen vectors are range-checked with an assertion. These cost time and prevent vectorization. If you include this compile flag, I find my code significantly faster than using Eigen expressions.

Alternatively, you can use raw C pointers to the input and output vector. Something like this:

std::ptrdiff_t n = idxt.size();
Eigen::VectorXcd M(n);
const double* iidxt = idxt.data();
std::complex<double>* iM = M.data();
for(std::ptrdiff_t j = 0; j < n; ++j) {
    double ival = iidxt[j];
    double vidxt = ival - std::floor(ival);
    double quadrant = std::floor(vidxt * 2. + 0.5);
    vidxt = (vidxt - quadrant * 0.5) * (2. * 3.14159265358979);
    double sq = vidxt * vidxt;
    // stand-in for sincos
    std::complex<double> jval(sq, vidxt + sq);
    iM[j] = quadrant == 1. ? -jval : jval;
}

Fixed sized arrays

To avoid the cost of memory allocation and make it easier for the compiler to avoid memory operations in the first place, it can help to run the computation on blocks of fixed size. Something like this:


std::ptrdiff_t n = idxt.size();
Eigen::VectorXcd M(n);
std::ptrdiff_t i;
for(i = 0; i + 4 <= n; i += 4) {
    Eigen::Array4d idxt_i = idxt.segment<4>(i);
    ...
    M.segment<4>(i) = ...;
}
if(i + 2 <= n) {
    Eigen::Array2D idxt_i = idxt.segment<2>(i);
    ...
    M.segment<2>(i) = ...;
    i += 2;
}
if(i < n) {
    // last index scalar
}

This kind of stuff needs careful tuning to ensure that vectorized code is generated and there are no unnecessary temporary values on the stack. If you can read assembler, Godbolt is very helpful.

Other remarks

  • Eigen includes vectorized versions of sin and cos. Have you compared your code to these instead of e.g. Eigen’s complex exp function?

  • Depending on your math library, there is also an explicit sincos function to compute sine and cosine in one function. It is not vectorized but still saves time on range reduction. You can (usually) access it through std::polar. Try this:

Eigen::VectorXd scale = ...;
Eigen::VectorXd phase = ...;
// M = scale * exp(-2 pi j phase)
Eigen::VectorXd M = scale.binaryExpr(-2. * M_PI * phase,
      [](double s, double p) noexcept -> std::complex<double> {
          return std::polar(s, p);
});
  • If your goal is an approximation instead of a precise result, shouldn’t your first step be to cast to single precision? Maybe after the range reduction to avoid losing too many decimal places. At the very least it will double the work done per clock cycle. Also, regular sine and cosine implementations take less time in float.

Edit

  • I had to correct myself on the cast to int64 instead of int. There is no vectorized conversion to int64_t until AVX512

  • The line (vidxt.array()*2+0.5).floor() bugs me slightly. This is meant to round down to negative infinity for [0, 0.5) and up to positive infinity for [0.5, 1), correct? vidxt is never negative. Therefore this line should be equivalent to (vidxt.array()*2).round(). With AVX2 and -ffast-math that saves one instruction. With SSE2 none of these actually vectorize, as can be seen on Godbolt

Answered By – Homer512

Answer Checked By – Willingham (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *