apulSoft Blog

May 7th, 2025 - cpp math optimization

Branchless sincos() float Approximation


Motivation

When working with FFTs and phase, conversions between complex phase responses and magnitude/phase responses are sometimes unavoidable. This leads to calculating lots of sine and cosine values of phase angles. I keep looking for the fastest solutions to be used with large buffers and SIMD. Until recently, sincos_u35() from the fantastic Sleef library was used in apulSoft plugins.

Working on a new plugin that requires even more of these calls than previous ones, I found the Sleef call to be the main performance bottleneck in my code. And I do not need 3.5 ulp of precision and require no handling of special cases like inf or nan.

Optimized Approximation

Looking at the Sleef code, I saw that there is code to take advantage of the symmetries of sine and cosine, which leads to branches/selects in the code. Doing this in SIMD routines means doing double the work and selectively throwing half of the results away = suboptimal performance.

Looking for a truly branchless solution, I stumbled upon this very nice blog post by Colin at mooooo: https://mooooo.ooo/chebyshev-sine-approximation/

It contains Rust code to calculate $sin(x)$ to 5 ulp precision by approximating a full period of $sin(x)$, which means no branches. Colin found a nice way to solve the floating point relative error issue that happens when approximating functions that intersect the x-axis. Instead of approximating $sin(x)$ using Chebyshev polynomials, he approximates $sin(x)/((x - pi)(x + pi))$. This avoids going through zero at $+-pi$, and still keeps the function to approximate symmetrical. This is very beneficial, as then half the polynomial coefficients become zero.

I found the same idea can be applied to $cos(x)$ by approximating $cos(x)/((x - 0.5pi)(x + 0.5pi))$.

I recently wrote a useful class to create Chebyshev-Polynomial-Approximations (will share in the next blog post). As I do not need as much precision as Colin, I made my own 4th-order polynomials for sin and cos. Here are plots of their relative errors:

Green: sin(x), cos(x)
Red: relative error * $10^6$



c++ Implementation

Below is the fully branchless c++ sincos() for float values between $-pi$ and $pi$ that I use with SIMD types (with fma). The $(x-a)(x+a)$ terms could of course be mathematically simplified to $x^2-a^2$, but that would work against the increased precision near the $y=0$ spots thanks to Colin's factorization. I found this to be faster than various implementations that try to share more work between $sin(x)$ and $cos(x)$. Using Estrin's scheme leads to operations that can be pipelined nicely.

// could use std::fma if faster - depends on cpu/compiler configuration
inline float _fma(float a, float b, float c) { return a * b + c; } 

// Approximate sin(x) and cos(x) between -pi and pi
// relative err |f(x)/sin(x) - 1|
// sin x: 1.32e-6 near 0
// cos x: 2.07e-6 at +-2.99

void approx_sincos (float x, float &si, float &co){
    constexpr float s0 = -0.10132104963779f;      // x
    constexpr float s1 = 0.00662060857089096f;    // x^3
    constexpr float s2 = -0.000173351320734045f;  // x^5
    constexpr float s3 = 2.48668816803878e-06f;   // x^7
    constexpr float s4 = -1.97103310997063e-08f;  // x^9

    constexpr float c0 = -0.405284410277645f;     // 1
    constexpr float c1 = 0.0383849982168558f;     // x^2
    constexpr float c2 = -0.00132798793179218f;   // x^4
    constexpr float c3 = 2.37446117208029e-05f;   // x^6
    constexpr float c4 = -2.23984068352572e-07f;  // x^8

    constexpr float pi = 3.1415927410125732f;   
    constexpr float halfPi = 1.570796326794897f;    

    auto x2 = x * x;

    // evaluate two 4th-order polynomials of (x^2) using estrin's scheme.
    auto x4 = x2 * x2;
    auto x8 = x4 * x4;
    auto poly1 = _fma(x8, s4, _fma(x4, _fma(s3, x2, s2), _fma(s1, x2, s0)));
    auto poly2 = _fma(x8, c4, _fma(x4, _fma(c3, x2, c2), _fma(c1, x2, c0)));

    si = (x - pi) * (x + pi) * x * poly1;
    co = (x - halfPi) * (x + halfPi) * poly2;
}
 
Comments