The 64bit introduced SSE2 maths, replacing the silicon-based implementations of the FPU by software.

#### Taylor Series and Angle Reduction

In Delphi XE2 64bit, SSE2 is used to compute the trigonometric functions (cos, sin, etc.), and they are computed through what looks like Taylor series (with double-precision literals being coded in hexadecimal, likely to minimize compiler precision issues).

However Taylor series only work for small values, so when you have a large angle value, it has to be reduced in the 0 .. 2PI range, which typically involves a form of floating-point Euclidian division or exponent reduction. For typical SSE2 implementations, this means that computing a trigonometric function for a high angle value is slower, as this reduction has to be performed, typically, you’re looking at something like a 25% slowdown tops.

#### Bottleneck

That said, here comes iga2iga2 (in the comments) and Ville Krumlinde, which both noticed to a performance issue in Delphi XE2 64bit, especially when facing other compilers. In XE2 64 bit, the reduction is performed through a loop and a fixed-step reduction, which means that the greater the angle value, the slower it gets.

Here are some timings on a sin/cos benchmark (hundreds of thousandths of calls):

Angle value | XE2-32 | XE2-64 |
---|---|---|

1.0 | 112 ms | 86 ms |

100 | 113 ms | 125 ms |

1e7 | 114 ms | 3700 ms |

1e14 | 128 ms | 7600 ms |

#### Choices, Choices, Choices

But… * timing isn’t everything*, when computing trigonometry for very large angles, you quickly run into numerical precision issues, and then, you basically have three options:

**just give up**, that’s actually what the FPU does in 32bits, f.i. look at the value of Sin(1e22) in Delphi 32bit, it’s… 1e22. Which is obviously not a valid sine value! And you’ve been living with that potential issue for all your 32bit life…**spit out something**, anything, under the assumption that if the user went for such an angle, it was garbage, so garbage in, garbage out, no one will notice it, you didn’t see me do it… you can’t prove anything anyway!**try to be accurate**, damn the timings, damn garbage in, damn the torpedoes, full precision ahead! That’s what XE2-64 is doing. I haven’t checked in details, but XE2 approach seem to be based on this approach: “argument reduction, for huge arguments: good to the last bit“, and it gets Sin(1e22) right.

Just try for Sin(1e22) in your favorite environment, the correct value is -0.8522, Delphi XE2 64bit Gets It Right, where other environments may just flash a bunch of random decimals to fool your eyes.

**Update:** as pointed by Daniel Bartlett in the comments, the AMD LibM library provides a much faster and similarly accurate implementation of sin/cos and other functions.

#### So, what gives?

If you’re after raw accuracy, you’ll have to pay for the extra execution cycles to avoid the garbage out. However, chances are, your code doesn’t have anywhere near the numerical accuracy to avoid garbage in, so no matter the precision in the reduction, you’ll still just get garbage out. And if your code was running in 32bit, chances are you had some huge garbage out already, due to the FPU giving up.

If you’re not after accuracy, f.i. if you’re just using sine/cosine for time-based animations, the extra computing precision *may* bite you, for no benefit, so you’re better off performing the reduction yourself, before calling sin/cos, using whatever low-precision implementation you wish.

In the long run, it might be preferable for Delphi to just adopt the GIGO approach, and keep the high precision implementations for a high precision maths library: in most situations, they won’t avoid GO because of GI, so it might be best to blend with the rest (in benchmarks).

64-bit gnu c++ also produces -0.8522 for sin(1e22).

The Windows (Vista) Calculator (64-bit calc.exe) gets (radians) Sin(1e22) in “no time”. 7 seconds for Delphi 64-bit sounds silly… 🙂

Interesting. In practice I don’t think it matters what you do for large angles because sensible code won’t ever make such calls.

@Hallvard Vassbotn

Windows XP too… Calc gets -0,85220084976718880177270589375303

And it is right according to WolframAlpha. I am losing something? o.O”

@Hallvard Vassbotn

The benchmark was for a whole lot of sin/cos though, not just one 😉

@David Heffernan

Yes, that’s also my POV. The amount of craftsmanship involved in the reduction code is quite impressive, but the effort would probably have been better invested in providing an arbitrary precision BCD or similar type (which we, mere mortals, could use more safely for high precision situations).

ohh that explains a lot… I suggest that you stat that in the article clearly. For me seems it is a only one run time. But I’m not an English native speaker…

@EMB

Updated, fwiw a single sin/cos is a matter of nanoseconds, the timings are in milliseconds.

Does SSE2 not have native trig? If so, what else is it missing?

Please put that into QC.

@David Heffernan

No SSE2 doesn’t have any advanced maths, only basic ops and sqrt, no trig, no log, no nothing complex.

@Uwe Schuster

Looks like a design choice, I may not agree 100% with it, but it has its merits, and they couldn’t stumble upon the high-precision reduction code by mistake.

It should be possible to get much faster + still be precise though.

Using AMDs LibM http://developer.amd.com/libraries/LibM/Pages/default.aspx in Delphi seems to provide slightly more accurate results and is still a lot faster when it comes to large numbers.

Timings for 1000000 iterations:

value | amd_sincos (x64) | SineCosine (x64) | SineCosine (x86)

0.0001 | 20 ms | 65 ms | 158 ms

0.1 | 31 ms | 64 ms | 162 ms

1.0 | 85 ms | 128 ms | 151 ms

100 | 84 ms | 174 ms | 138 ms

1E7 | 164 ms | 5321 ms | 140 ms

1E14 | 163 ms | 10395 ms | 166 ms

1E22 | 168 ms ( 14 decimal places) | 16531 ms (15 d.p.)

They both start returning the same incorrect numbers above 1E23 or so (different to calc.exe + Wolfram Alpha), and I’m not sure why.

As for the FPU, it only guarantees results up to 2^63 (approx 9.28E18)

It must be possible to write a faster version of pRemDouble, but I’m not sure how.

@Dan Bartlett

That would be worthy of a QC!

Yes, you are right. I should know that you would not use just one time run to measure something like this. But it’s your habit to write explicitly these numbers, so I was lost. That’s why I was really worried and kinda know I was missing something. Thanks again for this post.