Implementing cosine in C from scratch
Implementing cosine in C from scratch
7/19/2020
Replace 7/20: See the dialogue of this submit on Reddit.
TL;DR: I explored easy methods to implement cosine utilizing a number of totally different approaches. One of many implementations is sort of 3x quicker than math.h in case you are pleased with 4 decimal locations of precision.
Have you ever ever questioned how the maths library in your favourite programming language implements trigonometric capabilities, like cosine? It’s such a standard operate that you will discover in any and each math library so it have to be pretty straight ahead, proper? Oh no. It most positively shouldn’t be.
This all began when my buddy and colleague, Dr. Stephen Marz, was engaged on the kernel of an working system and I recommended that he draw the cosine operate on the display screen. I typically use cosine as a “howdy, world” for graphical functions. The issue: his kernel can’t use the C commonplace library (goodbye math.h!) and it targets the RISC-V structure (nothing just like the Intel fcos instruction!).
A protracted journey ensued.
It’s essential to notice that I’m neither a mathematician or a methods programming guru. In actual fact, out of my 10 years of faculty learning arithmetic and laptop science, I in some way by no means took a trigonometry course. So I’ll stroll you thru my hurdles of investigating and implementing cosine in C. The targets are:
- Easy sufficient that even I can perceive the code.
- Correct sufficient that no mortal will discover the error.
- Quick sufficient {that a} online game may use it.
We’ll discover a number of methods to compute cosine and some optimizations. Since I am not a C ninja, I will be avoiding any elaborate trickery or micro-optimizations (but when of some, let me know!). In actual fact, I typically selected extra readable code for barely much less performant code. We’ll benchmark as we go alongside to grasp the implications of our actions. The entire code and benchmarks may be discovered within the GitHub repository.
Desk of contents:
- Why do I care about cosine?
- Strategies to compute cosine
- Benchmark setup
- First try: Naive Taylor collection with literals
- Second try: Higher Taylor collection with literals
- Third try: Taylor collection with operating product
- Fourth try: Lookup desk
- Fifth try: Lookup desk with LERP
- Remaining comparability
- Conclusion
Why do I care about cosine?
In all my time of coding, there has solely been one state of affairs through which I used cosine: video games, video games, video games.
operate transfer(obj) { obj.x += obj.pace * Math.sin(obj.rotation); obj.y += obj.pace * Math.cos(obj.rotation); }
That is the fundamental code that I’ve utilized in nearly all my games to maneuver an object in a given course. Whether or not or not it’s the participant from a top-down perspective or projectiles flying throughout the display screen.
It seems to be like this:
Perhaps do not stare at that for too lengthy…
Strategies to calculate cosine
After I initiated this transcendental rabbit gap, I discovered the Taylor collection technique of approximating cosine.
This appears to be the defacto means of calculating cosine in math libraries, at the very least at a excessive degree. Given extra phrases, the approximation will likely be extra correct.
One thought that got here to my thoughts is to make use of a lookup desk. It’s a precalculated array of values that can be utilized to search for the closest cosine worth given some enter. This wasn’t unusual in any respect again when computing energy was extra restricted. I could not discover any notable initiatives on GitHub that use a desk for trig capabilities, however I am positive they exist.
CORDIC is one other time period that stored popping up in my searches. It’s an iterative technique for computing capabilities like sine and cosine utilizing solely addition, subtraction, bit shifting, and a small lookup desk. It was typically carried out in {hardware}, going again to the late Nineteen Fifties, or in software program that usually runs on low-end CPUs or microcontrollers, like these present in calculators. This technique was fairly well-liked up to now, and was utilized by the Intel 8087 Math Coprocessor, TI-85 calculator, and HP-48G calculator. Nevertheless, I am unable to discover any references of whether or not it’s used a lot in the present day. For extra particulars, see the Wikipedia article or the unique paper describing the strategy or take a look at an implementation written in C. I will not be evaluating my strategies to it, however I’m a bit curious the way it holds up. Determine from the unique paper:
Then there’s the Intel CPU fcos instruction. The instruction computes cosine given a floating-point worth in radians within the vary -2^63 to 2^63 saved in a FPU register and replaces it with the consequence. I am unsure if trendy Intel CPUs nonetheless use the CORDIC technique, and in that case, whether or not it’s carried out in {hardware} or software program. After dissembling a number of packages that use cosine, I couldn’t discover one that really used the fcos instruction. Though it’s quick, others have documented the inaccuracy of the instruction, most notably when the argument approaches multiples of pi/2. For extra particulars, see this unofficial documentation or the official Intel reference.
Now, any sane particular person will simply use C’s math.h. And for on a regular basis functions, you in all probability ought to. However bear in mind, we could not use something from the usual library. Additionally, that would not be any enjoyable. I did evaluate the accuracy of math.h’s cosine operate on my laptop to Wolfram Alpha. I discovered that math.h is correct as much as 15 digits of precision, which is far more than I’ll ever want.
To see how the usual library computes cosine, I went and regarded on the supply of a number of C commonplace library implementations. Glibc, Newlibc, Musl, and many others. Despite the fact that they regarded like they have been utilizing a Taylor collection, these have been a bit an excessive amount of for me to get my head round. They have been all very totally different from each other, typically relied on a number of dense capabilities, have been plagued by magic numbers, had tables of precomputed values, and have been utilizing in depth bit trickery. Somebody spent loads of time on making these quick.
Here’s a screenshot of once I was making an attempt to stroll by the related code in Musl. From cos() to __cos() to __rem_pio2().
Yikes.
Benchmark setup
As I progress by implementing the totally different strategies of calculating cosine, I will likely be evaluating them from two views: runtime and accuracy. For runtime, every operate is executed 100 million occasions utilizing a spread of enter values and it’s timed utilizing time.h’s clock operate. For accuracy, it takes the distinction of my operate’s outcomes and math.h’s outcomes over a spread of inputs, and returns the worst case. For instance, an accuracy worth of 0.0002 signifies that within the worst case, my implementation was 0.0002 off from math.h, for one enter over a wide variety of inputs.
Right here is the related code:
// Measures the time of many executions in seconds. Smaller quantity is best. double runtime(double (*func)(double)) { clock_t begin = clock(); for (int i = 0; i < 100000000; i++) { unstable double c = func(i / 10000.0); (void)c; } return (clock() - begin) / (double)CLOCKS_PER_SEC; } // Finds the worst case for accuracy in comparison with math.h. Smaller quantity is best. double accuracy(double (*func)(double)) { double w = -1; double begin = 0; double cease = CONST_2PI; double step = 0.0000001; for (double i = begin; i < cease; i += step) { double c = absd(func(i) - cos(i)); if (c > w) { w = c; } } return w; }
The benchmark was compiled with clang 11.0.3 and ran on a 13-inch 2018 MacBook Professional with a 2.7GHz i7 CPU and 16GB of RAM.
You could find all of the benchmark code within the GitHub repo. Due to Dr. Marz for rewriting it to have a simple to make use of interface.
First try: Naive Taylor collection with literals
First, I attempted a literal translation of the Taylor collection with 4 phrases:
double cos_taylor_literal_4terms_naive(double x) { return 1 - ((x*x)/(2)) + ((x*x*x*x)/(24)) - ((x*x*x*x*x*x)/(720)) + ((x*x*x*x*x*x*x*x)/(40320)); }
This appeared surprisingly correct once I began testing it with enter values like 0.1 and 0.235. My enthusiasm pale once I graphed it subsequent to math.h:
The magenta line is math.h whereas inexperienced is my operate. It seems to be pretty correct between -pi and +pi, however then explodes.
Maybe extra phrases will assist.
double cos_taylor_literal_6terms_naive(double x) { return 1 - ((x*x)/(2)) + ((x*x*x*x)/(24)) - ((x*x*x*x*x*x)/(720)) + ((x*x*x*x*x*x*x*x)/(40320)) - ((x*x*x*x*x*x*x*x*x*x)/(3628800)) + ((x*x*x*x*x*x*x*x*x*x*x*x)/(479001600)); }
Now if we graph that…
Significantly better. But it surely nonetheless explodes because it approaches to 2pi.
Second try: Higher Taylor collection with literals
There have been 3 potential enhancements that jumped out at me at this level: scale back the vary of the enter, scale back the variety of redundant calculations, and maintain including extra phrases.
The primary optimization I attempted was vary discount. The larger the enter worth, the much less correct this technique is. Since cosine repeats each 2pi, we need to simply do x = x % (2*pi);. Nevertheless, in C the modulus operator does not work on floating level numbers, so we made our personal.
#outline modd(x, y) ((x) - (int)((x) / (y)) * (y)) double cos_taylor_literal_6terms_2pi(double x) { x = modd(x, CONST_2PI); return 1 - ((x * x) / (2)) + ((x * x * x * x) / (24)) - ((x * x * x * x * x * x) / (720)) + ((x * x * x * x * x * x * x * x) / (40320)) - ((x * x * x * x * x * x * x * x * x * x) / (3628800)) + ((x * x * x * x * x * x * x * x * x * x * x * x) / (479001600)); }
That is higher for values over 2pi, however it’s nonetheless very inaccurate main as much as 2pi. We will do higher for the reason that cosine values are equal each a number of of pi, besides that the signal flips. To do this we are able to mod by 2pi, subtract pi if the worth is larger than pi, calculate the Taylor collection, after which apply the right signal. So we’re actually solely calculating cosine from 0 to pi.
double cos_taylor_literal_6terms_pi(double x) { x = modd(x, CONST_2PI); char signal = 1; if (x > CONST_PI) { x -= CONST_PI; signal = -1; } return signal * (1 - ((x * x) / (2)) + ((x * x * x * x) / (24)) - ((x * x * x * x * x * x) / (720)) + ((x * x * x * x * x * x * x * x) / (40320)) - ((x * x * x * x * x * x * x * x * x * x) / (3628800)) + ((x * x * x * x * x * x * x * x * x * x * x * x) / (479001600))); }
That helps the accuracy significantly because the enter approaches 2pi.
The following optimization concerned eradicating a number of the redundant calculations. You may discover the x*x in all places within the code. All I did was scale back a number of the multiplications with double xx = x * x;.
double cos_taylor_literal_6terms(double x) { x = modd(x, CONST_2PI); char signal = 1; if (x > CONST_PI) { x -= CONST_PI; signal = -1; } double xx = x * x; return signal * (1 - ((xx) / (2)) + ((xx * xx) / (24)) - ((xx * xx * xx) / (720)) + ((xx * xx * xx * xx) / (40320)) - ((xx * xx * xx * xx * xx) / (3628800)) + ((xx * xx * xx * xx * xx * xx) / (479001600))); }
This was an enormous efficiency win! I additionally tried double xxxx = xx * xx; however did not see a lot of a distinction, so I moved on.
I nonetheless wasn’t positive what number of phrases to make use of. So I attempted it as much as 10 phrases to see the way it improves accuracy:
double cos_taylor_literal_10terms(double x) { x = modd(x, CONST_2PI); char signal = 1; if (x > CONST_PI) { x -= CONST_PI; signal = -1; } double xx = x * x; return signal * (1 - ((xx) / (2)) + ((xx * xx) / (24)) - ((xx * xx * xx) / (720)) + ((xx * xx * xx * xx) / (40320)) - ((xx * xx * xx * xx * xx) / (3628800)) + ((xx * xx * xx * xx * xx * xx) / (479001600)) - ((xx * xx * xx * xx * xx * xx * xx) / (87178291200)) + ((xx * xx * xx * xx * xx * xx * xx * xx) / (20922789888000)) - ((xx * xx * xx * xx * xx * xx * xx * xx * xx) / (6402373705728000)) + ((xx * xx * xx * xx * xx * xx * xx * xx * xx * xx) / (2432902008176640000))); }
At this level, when wanting on the graph the ten phrases line overlaps with math.h. Progress! When evaluating the worst case accuracy with math.h, the 4 phrases was a joke. The 6 phrases was off by 0.0001, which is extra accuracy than I want, whereas the ten phrases was off by a mere 0.00000000007. Woohoo!
Nevertheless, extra phrases comes at a steep runtime price. From the benchmark, the naive 4 phrases solely takes 0.4 seconds, 6 phrases takes 0.94 seconds, and 10 phrases takes 1.46 seconds. In the meantime, math.h solely takes about 1.04 seconds with much more accuracy.
Including extra phrases with this method regarded a bit ridiculous, so I moved on.
Third try: Taylor collection with operating product
After displaying my progress to Dr. Marz, he did some algebra magic and despatched me over his modified model. His technique removes loads of redundant calculations by storing them and has the additional advantage of permitting you to specify what number of phrases in you need. For specific functions, this might be helpful so to have various levels of precision/pace as a parameter.
double cos_taylor_running_yterms(double x, int y) { int div = (int)(x / CONST_PI); x = x - (div * CONST_PI); char signal = 1; if (div % 2 != 0) signal = -1; double consequence = 1.0; double inter = 1.0; double num = x * x; for (int i = 1; i <= y; i++) { double comp = 2.0 * i; double den = comp * (comp - 1.0); inter *= num / den; if (i % 2 == 0) consequence += inter; else consequence -= inter; } return signal * consequence; }
For the sake of benchmarks, I didn’t use this model with the second parameter. As a substitute, I duplicated the operate and hardcoded the loop to work for fixed values (like 6, 10, and 16).
There are positively diminishing returns with including extra phrases. At 16 phrases the worst case accuracy is off by 0.0000000000000009 however takes 2.57 seconds for the runtime benchmark. It is not gradual by any means in any respect, however in comparison with math.h… it’s.
Fourth try: Lookup desk
The opposite possibility I needed to strive is a lookup desk. The thought is to precompute a bunch of values and hardcode them in an array. Lookup tables have existed looong earlier than computer systems, so this is not a novel approach. On this case, I am hoping that giving up a little bit of reminiscence will present an enormous runtime win whereas nonetheless being correct sufficient.
To generate the lookup desk, Dr. Marz wrote a Python script that generates a C header file containing an array the place every aspect is a cosine worth that is calculated utilizing math.h. Very good!
from math import cos, pi def primary(f, PRECISION, NAME): f.write("double %s[] = {n" % NAME) j = 0 p = 0.0 whereas True: f.write("{:.20f}, ".format(cos(p))) j += 1 p += PRECISION if p > 2*pi: break f.write("1.0 };n") f.write("const int %s_size = %d;n" % (NAME, j+1)) if __name__ == '__main__': primary(open("costable_1.h", "w"), 1.0, "costable_1") primary(open("costable_0_1.h", "w"), 0.1, "costable_0_1") primary(open("costable_0_01.h", "w"), 0.01, "costable_0_01") primary(open("costable_0_001.h", "w"), 0.001, "costable_0_001") primary(open("costable_0_0001.h", "w"), 0.0001, "costable_0_0001")
We needed to check our tables with totally different precision. We generated tables with 8 values, 64 values, 630 values, 6285 values, and 62833 values. The price comes within the type of an elevated executable. The 1.0 and 0.1 tables aren’t noticeable, however the different tables elevated the executable measurement by about 5KB, 50KB, and 500KB, respectively.
double absd(double a) { *((unsigned lengthy *)&a;) &= ~(1UL << 63); return a; } double cos_table_0_01(double x) { x = absd(x); x = modd(x, CONST_2PI); return costable_0_01[(int)(x * 100 + 0.5)]; }
The tables appear to hit a pleasant steadiness of accuracy. The worst case accuracy of the smallest desk is 0.49, so not usable. However with every improve within the desk measurement, you get 1 extra digit of precision: 0.049, 0.0049, 0.00049, and 0.000049. The runtime take a look at for every desk was about 0.38. That’s quick! (Truly, we bought the runtime all the way down to about 0.33, however the code was ugly.)
Fifth try: Lookup desk with LERP
Positive, a lookup desk is nice, however we are able to do higher for values that are not within the desk. Introducing, linear interpolation. That is only a cool sounding time period for taking the weighted common between two values. Now when the enter worth is not within the desk, we’ll compute an approximation based mostly on which desk entry is nearer. The code:
#outline lerp(w, v1, v2) ((1.0 - (w)) * (v1) + (w) * (v2)) double cos_table_0_01_LERP(double x) { x = absd(x); x = modd(x, CONST_2PI); double i = x * 100.0; int index = (int)i; return lerp(i - index, /* weight */ costable_0_01[index], /* decrease worth */ costable_0_01[index + 1] /* higher worth */ ); }
Remaining comparability
Here’s a comparability of our capabilities for worst case accuracy in comparison with math.h (decrease is best!):
Here’s a comparability of our capabilities for runtime to compute 100,000,000 values (decrease is best!):
Conclusion
So what do I like to recommend utilizing? Math.h if potential. None of those capabilities are significantly gradual, and most of them are correct sufficient. However the subsequent time I make a sport that relies upon closely on trig capabilities, I will be utilizing the 0_001 desk.
Till the subsequent rabbit gap.