Reasonably accurate/fastish tanh approximation

  • Author or source: Fuzzpilz
  • Created: 2004-08-17 22:56:29
Fairly obvious, but maybe not obvious enough, since I've seen calls to tanh() in code
snippets here.

It's this, basically:

tanh(x) = sinh(x)/cosh(x)
        = (exp(x) - exp(-x))/(exp(x) + exp(-x))
        = (exp(2x) - 1)/(exp(2x) + 1)

Combine this with a somewhat less accurate approximation for exp than usual (I use a
third-order Taylor approximation below), and you're set. Useful for waveshaping.

Notes on the exp approximation:

It only works properly for input values above x, but since tanh is odd, that isn't a

exp(x) = 1 + x + x^2/(2!) + x^3/(3!) + ...

Breaking the Taylor series off after the third term, I get

1 + x + x^2/2 + x^3/6.

I can save some multiplications by using

6 + x * (6 + x * (3 + x))

instead; a division by 6 becomes necessary, but is lumped into the additions in the tanh

(a/6 - 1)/(a/6 + 1) = (a - 6)/(a + 6).


I haven't looked at this in very great detail, but it's always <= the real tanh (>= for
x<0), and the greatest deviation occurs at about +/- 1.46, where the result is ca. .95
times the correct value.

This is still faster than tanh if you use a better approximation for the exponential, even
if you simply call exp.
There are probably additional ways of improving parts of this, and naturally if you're
going to use it you'll want to figure out whether your particular application offers
additional ways of simplifying it, but it's a good start.
/* single precision absolute value, a lot faster than fabsf() (if you use MSVC++ 6 Standard - others' implementations might be less slow) */
float sabs(float a)
int b=(*((int *)(&a)))&0x7FFFFFFF;
return *((float *)(&b));

/* approximates tanh(x/2) rather than tanh(x) - depending on how you're using this, fixing that could well be wasting a multiplication (though that isn't much, and it could be done with an integer addition in sabs instead)  */
float tanh2(float x)
float a=sabs(x);
return ((x<0)?-1:1)*(a-6)/(a+6); /* instead of using <, you can also check directly whether the sign bit is set ((*((int *)(&x)))&0x80000000), but it's not really worth it */


Not sure why this didn't occur to me earlier, but you can easily save another two adds as follows:

return ((x<0)?-1:1)*a/(a+12);
You shouldn't need the sabs() on VC6 - you just need to add:

#pragma intrinsic( fabs )

before calling fabsf(), and it should go optimally fast.
  • Date: 2004-10-07 10:56:45
  • By: Laurent de Soras
You can optimise it a bit more by using the fact that tanh (x) = 1 - 2 / (exp (2*x) + 1)
AFAIK intrinsics aren't supported by VC6 Standard, but limited to Professional and Enterprise. Might be wrong, though, in which case I am a silly person. (no time to check now)
Delphi Code:

// approximates tanh(x/2) rather than tanh(x) - depending on how you're using
// this, fixing that could well be wasting a multiplication
function tanh2(x:single):Single;
var a : single;
 if (x<0)
  then result:=-a/(a+24)
  else result:= a/(a+24);
Laurent de Soras wrote:
"You can optimise it a bit more by using the fact that tanh (x) = 1 - 2 / (exp (2*x) + 1)"

It's not faster, because you'll need 3 more cycles. The routine would then look like this:

function tanh2(x:single):Single;
var a : single;
 if (x<0)
  then result:= (-1+24/a)
  else result:=  (1-24/a);
I must have missed this one..
but why is the comparison needed?

a simpler version would be:


So in asm:

function Tanh2(x:Single):Single;
const c3 :Single=3;
      c6 :Single=6;
        FLD     x
        FLD     ST(0)
        FLD     c3
        FADD    ST(0),ST(1)
        FMUL    ST(0),ST(1)
        FADD    c6
        FMULP   ST(2),ST(0)
        FADD    c12
        FDIVP   ST(1),ST(0)

..but almost all the CPU is wasted by the division anyway
wait.. has anyone tested this function?

Here's a test plot:


Green=the approximation suggested in this thread

Blue=another approximation that does:

function TanH3(x:Single):Single;
Result:=x - x*x*x/3 + 2*x*x*x*x*x/15;

If I didn't do anything wrong, the green one is VERY far from TanH. Blue is both closer & computationally more efficient.

But ok, this plot is for a normalized 0..1. When you go above, the blue like goes crazy.
But now considering that -1..1 is what matters the most for what we do, the input could still be clipped.
forget all this :)

it's all embarrassing bullshit and I obviously need some sleep :)
Ok ignore my above crap that I can't delete, I swear that this one does work :)

First I hadn't seen that this function was assuming x*2, so my graph was scaled by 2..

Second, the other algo (blue line) is still not to be neglected (because no FDIV) when the input is in the -1..1 range, and it does work.

Third, I'm suggesting here a version without the comparison/branching, but still, the CPU difference is neglectable because of the FDIV.

Here it is (this one does NOT assume a premultiplied x)..

plain code:

function Tanh2(x:Single):Single;
var   a,b:Single;


function Tanh22(x:Single):Single;
const c3  :Single=3;
      c6  :Single=6;
      c12 :Single=12;
        FLD     x
        FMUL    Mul2
        FLD     ST(0)
        FABS                 // a
        FLD     c3
        FADD    ST(0),ST(1)
        FMUL    ST(0),ST(1)
        FADD    c6           // b
        FMUL    ST(2),ST(0)  // x*b
        FMULP   ST(1),ST(0)  // a*b
        FADD    c12
        FDIVP   ST(1),ST(0)
Any suggestions about improving the 3DNow Divide-Operation??? I simply hate my code...

procedure Transistor2_3DNow(pinp,pout : PSingle; Samples:Integer;Scalar:Single);
const ng   : Array [0..1] of Integer = ($7FFFFFFF,$7FFFFFFF);
      pg   : Array [0..1] of Integer = ($80000000,$80000000);
      c2   : Array [0..1] of Single = (2,2);
      c3   : Array [0..1] of Single = (3,3);
      c6   : Array [0..1] of Single = (6,6);
      c12  : Array [0..1] of Single = (12,12);
      c24  : Array [0..1] of Single = (24,24);
 shr ecx,1
 movd       mm1, Scalar.Single
 punpckldq  mm1, mm1
 movq       mm0, c2
 pfmul      mm0, mm1

 movq       mm3, c3
 movq       mm4, c6
 movq       mm5, c12
 movq       mm6, c24

 movq       mm1, [eax] //mm1=input
 pfmul      mm1, mm0   //mm1=a
 movq       mm2, mm1   //mm1=a,   mm2=a

 pand       mm2, ng    //mm1=a,   mm2=|a|

 pfadd      mm3, c3    //mm1=a,   mm2=|a|, mm3=|a|+3
 pfmul      mm3, mm2   //mm1=a,   mm2=|a|, mm3=|a|*(|a|+3)
 pfadd      mm3, c6    //mm1=a,   mm2=|a|, mm3=6+|a|*(3+|a|)
 pfmul      mm3, mm2   //mm1=a,   mm2=|a|, mm3=|a|*(6+|a|*(3+|a|))
 pfadd      mm3, c12   //mm1=a,   mm2=|a|, mm3=b=12+|a|*(6+|a|*(3+|a|))
 pfmul      mm1, mm3   //mm1=a*b, mm2=|a|, mm3=a*(12+|a|*(6+|a|*(3+|a|)))
 pfmul      mm2, mm3   //mm1=a*b, mm2=|a|*b
 pfadd      mm2, c24   //mm1=a*b, mm2=|a|*b+24

 movq       mm3, mm2
 pfrcp      mm4, mm3
 punpckldq  mm3, mm3
 pfrcpit1   mm3, mm4
 pfrcpit2   mm3, mm4
 movq       mm4, mm2
 punpckhdq  mm4, mm4
 pfrcp      mm5, mm4
 pfrcpit1   mm4, mm5
 pfrcpit2   mm4, mm5
 punpckldq  mm4, mm5
 pfmul      mm1, mm4
 movq       [edx],mm1
 add        eax,8
 add        edx,8
 loop    @Start
mmh why the loop? You can't process more than 2 Tanh in parallel in this filter, can you?
What CPU gain did you get btw?

Anyway, sucks that 3Dnow doesn't provide division.. SSE does, though.. DIVPS (or DIVPD to get a double accuracy in this case) would work here. Only problem is that on an AMD I usually get better performances out of 3DNow than SSE/SSE2.
The loop works perfectly well, but it's of course for Tanh() processing of a whole block instead of inside the moog filter.

The thing, that 3DNow doesn't provide division really sucks. Anyway, this way i will save a small amount of performance, but it's not huge. But i believe one can optimize the 12 lines of division further more. Also data prefetching might help a little. Or restructuring, because on AMD the order does matter!

I'll SSE/SSE2 the code tonight. I think SSE gives a good performance boost, but SSE2 precisition would be needed, if the thing is inside the moog filter (IIR Filter coefficients should allways stay 64bit to avoid digital artifacts).


Here's the Analog Devices "Sharc" DSP translation of the tanh function (inline processing of register f0):

f11 = 2.;
f0  = f0 * f11;
f11 = abs f0;
f3  = 3.;
f12 = f11+f3;
f12 = f11*f2;
f3  = 6.;
f12 = f12+f3;
f0  = f0*f12;
f12 = f11*f12;
f7  = 12.;
f12 = f12+f3;
f11 = 2.;
f0 = recips f12, f7=f0;
f7=f0*f7, f0=f11-f12;
f7=f0*f7, f0=f11-f12;
f7=f0*f7, f0=f11-f12;

it can be optimized further more, but hey...
  • Date: 2006-02-25 09:57:21
  • By: Gene
tanh(x/2)~ x/(abs(x)+3/(2+x*x))

  • Date: 2006-02-25 11:53:34
  • By: Gene
tanh(x/2) ~ x/(abs(x)+2/(2.12-1.44*abs(x)+x*x))

Maximum normalized difference 0.0063 from real tanh (x/2) - good enough now.