apulSoft Blog

Mar 17th, 2021 - cpp math optimization

c++ Complex Square Root


I wrote a c++ filter library to design the various filters used in the apulSoft plugins. It uses a lot of std::complex math functions when designing filters using complex poles and zeros. During profiling on OSX I found slowness and precision issues in the implementation of std::sqrt() for std::complex<double>. I looked up the implementation inside libc++ used by LLVM/Clang on OSX here.

This is a copy of the code found:

template<class _Tp>
complex<_Tp>
sqrt(const complex<_Tp>& __x)
{
    if (__libcpp_isinf_or_builtin(__x.imag()))
        return complex<_Tp>(_Tp(INFINITY), __x.imag());
    if (__libcpp_isinf_or_builtin(__x.real()))
    {
        if (__x.real() > _Tp(0))
            return complex<_Tp>(__x.real(), __libcpp_isnan_or_builtin(__x.imag()) ? __x.imag() : copysign(_Tp(0), __x.imag()));
        return complex<_Tp>(__libcpp_isnan_or_builtin(__x.imag()) ? __x.imag() : _Tp(0), copysign(__x.real(), __x.imag()));
    }
    return polar(sqrt(abs(__x)), arg(__x) / _Tp(2));
}

It's mostly a call to std::polar() which comes from the polar way of determining a complex square root. It calls atan2() (via arg()), sin() and cos(). If -ffast-math is enabled, it uses sincos(). LLVM/Clang on OSX uses library functions to evaluate trigonometry which slows things down and means the complex square root cannot be fully inlined. The whole thing is a mess because some CPU models have hardware trigonometry instructions that are slower than scalar polynomial approximations. Maybe that's why Clang on OSX uses the library calls - to use the fastest code depending on the CPU.

To make matters worse, precision can be problematic:

#include <iostream>
#include <complex>

using namespace std;

int main() {
    cout << sqrt(complex<double>(-1)) << endl;
}

If I run this in XCode 12, I get (6.12323e-17,1). The small error is annoying for filter math as it makes the number appear not pure imaginary.

So I had to come up with a replacement which avoids using std::polar(). For the complex value $c=a+ib$, the components of the root $sqrt(c) = x + iy$ can be calculated like this:

$c = sqrt(c)^2 = (x+iy)^2 = x^2 - y^2 + i*2xy = a + ib $

$=>{(a,=,x^2 - y^2),(b,=,2xy):}$

${(4x^4 - 4ax^2 - b^2,=,0),(4y^4 + 4ay^2 - b^2,=,0):}$

$=> {(x,=,+-sqrt(+-0.5(sqrt(a^2+b^2) + a))), (y,=,+-sqrt(+-0.5(sqrt(a^2+b^2) - a))):}$

We are interested in the real-positive root of $c$, so the real component $x$ has to be positive. From $b=2xy$ follows that $y$ needs to have the same sign as $b$. With $r = |c|$ the final formula is:

${(x,=,sqrt(0.5(r + a))), (y,=,sqrt(0.5(r - a))*sgn(b)):}$

$r$ is of course always positive and larger than $a$, therefore the square roots are always real. The entire process requires calculating 3 square roots and a few multiplications and additions, surely that's going to be faster than the atan2() and sincos() of the libc++ implementation.
I saw another library implementation using just two square roots and a division, but prefer the presented method as it doesn't have to deal with special cases (division by zero). Less conditional code means more options for the compiler to rearrange and optimize inlined code.

The resulting process also looks quite symmetrical for the real and imaginary parts of c. I like this as I hope to write a complex library for 64-bit float precision complex numbers that does everything using 128-bit SIMD registers in the future. This method should work just fine for that, and maybe a smart compiler can vectorize it already.

#include <complex>

template<typename T>
inline std::complex<T> sqrt_notrig(std::complex<T> c) noexcept {
    T a = c.real();
    T b = c.imag();
    T r = std::abs(c);
    return {std::sqrt(T(0.5)*(r + a)), std::copysign(std::sqrt(T(0.5)*(r - a)), b)};
}

View this routine on the godbolt.org compiler explorer here. Unfortunately, sqrtpd (2x64-bit double square root SSE2 instruction) doesn't seem to be produced by any compiler, let me know if you get one to emit it.

Bonus content: a branchless routine to calculate the primary complex root of any real value:

template<typename T>
inline std::complex<T> sqrt_to_complex(T x) noexcept {
    T r = T(0.5)*std::sqrt(std::fabs(x));
    T a = std::copysign(r, x);
    return {r + a, r - a};
}
 
Comments