hpdz.net

High-Precision Deep Zoom

Technical Info - Bignum Arithmetic

Introduction

This page discusses the basics of BigNum arithmetic, which refers to high-precision arithmetic using more digits than the native architecure of the processor supports directly.

Some of the content refers to older obsolete techniques that were in use prior to the migration to 64-bit code, but is kept here for historical purposes. Also see the page on floating-point arithmetic.

This page has the following topics:

The first section is a general discussion of the basics of bignum arithmetic, including Karatsuba multiplication. At this point, with only about 500 bits at most in use, I don't have any need for more advanced methods like Toom-Cook or Schonhage-Strassen, or for Fourier-transform methods either, and they are not discussed.

The second section describes some details of how I have implemented bignum math in my software and provides some speed benchmarks.

The final two sections give performance data on the Pentium-IV and Core2 architectures.

I hope this page will help any aspiring young software engineers that might be working on their own bignum arithmetic functions. This area of computer science doesn't have many practical applications, but it is a lot of fun and a great exercise to hone your programming skills (not to mention the fact it's a pretty common college computer science class project). Plus, once you've got it working, it is really cool to have a program that can multiply thousand-digit numbers or calculate the millionth digit of pi.

General Discussion

Bignum arithmetic refers to software algorithms that perform the fundamental arithmetic operations of addition, subtraction, multiplication, and division on numbers with more digits than the hardware of the machine they are running on can support natively. These algorithms are an interesting and important branch of computer science in their own right, independent of their application to fractal art.

Background

The Pentium processors (and the closely related Core, Core2, and Athlon processors) all have built-in math units that can very rapidly perform addition, subtraction, multiplication, division, and more complex operations like square roots, logarithms, and trigonometric functions. But the built-in hardware units are limited to 52 bits of precision, which is the equivalent of just about 16 digits of decimal numbers (it's actually 15.65 digits):

Digits = Log10 (252) = 52 Log10(2) = 15.653559774527022151114422525674...

That is nowhere near enough to allow zooming in to 1050 or 10100. The only way to achieve the vastly greater precision required for animations that zoom to depths like that is by using software arithmetic operations. That means literally writing functions to perform addition, subtraction, multiplication, division, etc. for very large numbers.

Streaming SIMD Instructions

For many years, I used SSE2 instructions for my multiplication kernel operations, prior to the move to 64-bit code. The SSE2 instruction set features some extremely fast instructions that operate on multiple pieces of data at one time, but they have some limitations. In particular, they have no capability of propagating carries, so some rather complicated carry-saving techniques have to be used. This is how many hardware multipliers work internally. In the end, the added speed is worth the extra complications, but the details are outside the scope of general bignum operations and we won't go into it here, especially since this code is now obsolete after migrating to 64-bit arithemtic functions.

Basic Representation

High-precision numbers in my software are stored as sequences of unsigned 32-bit "digits" (referred to as "limbs" in the GNU math library). This remains the case even after the migration to 64-bit code. From the perspective of the C++ code, each digit is a 32-bit value, but the assembly language code works with them in groups of two.

Numbers are stored in a fixed-point format with one digit of integer. The smallest precision that is implemented in my software is four 32-bit digits, which means one digit of integer and three 32-bit digits to the right of the decimal point. Precision can be increased in chunks of two 32-bit digits. Previously, I had to work in even-numbered chunks like this because of the way the SSE2 instruction set works. Now, the 64-bit instructions work in groups of two 32-bit digits. In a more general case, not using 64-bit instructions, precision can be added one digit at a time.

All digits are unsigned. A separate sign bit is used.

Addition and Subtraction

Traditional addition works by adding digits consecutively starting from the rightmost (least-significant) digit and proceeding to the left, propagating the carries that are generated from each individual digit sum as you go.

For example, suppose you want to add 3867 + 1946. You start with 7+6 and proceed like this:

Carry:111 
 3867
 1946
Sum:5813

Carries are propagated upward each time two digits are added. At most, only one bit of carry can be generated each time two digits are added. The x86 ADD and ADC instructions work well for this, and addition can be done with a simple loop.

Multiplication

Multiplication is a bit more complicated than addition. To multiply two N-digit numbers, we have to form N x N = N2 pairs of individual digit products, then perform a whole bunch of additions.

For example, consider 3867 x 1946. When we calculate this on paper, we break the problem down into (3867x6) + (3867x4)x10 + (3867x9)x100 + (3867x1)x1000.

     3867
     1946
3867x6:    23202
3867x4:  15468  
3867x9: 34803   
3867x1: 3867    
Sum: 7 525182

For a fixed-point multiplication system like I use, where the first digit represents the integer part and the next three digits represent the fraction, this example shows how to multiply 3.867 x 1.946, giving a result of 7.525182.

Note that the product has twice as many digits as the original numbers did. For some purposes, it makes sense to save some of them or all of them, but I don't do that. I truncate each multiplication product so there are always the same number of digits in the product as there were in the original numbers. So the final answer here would be 7525, corresponding to 7.525.

That truncation means that some of the digit products don't need to be calculated. For example, we don't use the 2 that comes from the product of 6x7=42 in the rightmost place. We have to be careful, because there is a carry (the 4) from that digit product into the next place, and that can affect the final answer. Careful analysis of this shows that you can drop some digit products and not affect the final answer by more than 1 bit. I am not going to reproduce that analysis here, but just show the final result. To see this better, we need to rearrange the multiplication operation above into a square matrix:

 3867
11x31x81x61x7
99x39x89x69x7
44x34x84x64x7
66x36x86x66x7

The multiplications in dim coloring can safely be skipped without affecting the final truncated product by more than one least-significant bit.

Here is a larger table showing the 8x8 case, where an 8-digit number A=a7a6...a1a0 is multiplied by an 8-digit number B=b7a6...b1b0:

  a7 a6 a5 a4 a3 a2 a1 a0
b7b7a7b7a6b7a5b7a4b7a3b7a2b7a1b7a0
b6b6a7b6a6b6a5b6a4b6a3b6a2b6a1b6a0
b5b5a7b5a6b5a5b5a4b5a3b5a2b5a1b5a0
b4b4a7b4a6b4a5b4a4b3a3b4a2b4a1b4a0
b3b3a7b3a6b3a5b3a4b3a3b3a2b3a1b3a0
b2b2a7b2a6b2a5b2a4b2a3b2a2b2a1b2a0
b1b1a7b1a6b1a5b1a4b1a3b1a2b1a1b1a0
b0b0a7b0a6b0a5b0a4b0a3b0a2b0a1b0a0

In general, as the number of digits increases, we save almost half the multiplications.

To calculate the exact number of digits saved, let N be the number of digits in each number we are multiplying. The number of digit products on the diagonal and in the upper left half is N(N+1)/2. The number of digit products in the subdiagonal is N-2. So the total is N(N+1)/2 + (N-2) = (N2+N-4)/2.

Squaring

If A=B in the multiplication grid above, then all the digit products in the lower left half of the grid are duplicates of the products in the upper right, since, for example, a4a6 = a6a4. The following table shows the additional digit products that can be eliminated, dimmed in a different color. For large values of N, the final result is that we are doing about 1/4 as many digit multiplications as before. Unlike the previous saving, taking advantage of this symmetry when squaring numbers introduces no loss of precision.

  a7 a6 a5 a4 a3 a2 a1 a0
a7a7a7a7a6a7a5a7a4a7a3a7a2a7a1a7a0
a6a6a7a6a6a6a5a6a4a6a3a6a2a6a1a6a0
a5a5a7a5a6a5a5a5a4a5a3a5a2a5a1a5a0
a4a4a7a4a6a4a5a4a4a3a3a4a2a4a1a4a0
a3a3a7a3a6a3a5a3a4a3a3a3a2a3a1a3a0
a2a2a7a2a6a2a5a2a4a2a3a2a2a2a1a2a0
a1a1a7a1a6a1a5a1a4a1a3a1a2a1a1a1a0
a0a0a7a0a6a0a5a0a4a0a3a0a2a0a1a0a0

Karatsuba

Karatsuba multiplication utilizes a clever identity to multiply two 2-digit numbers using only three multiplications. To see how it works, we'll have to use some more formal notation.

Let e represent the base of the number system we're using. For normal decimal numbers, e=10. For a system in a computer that uses one byte for each digit, e=256. For my system, which uses a 32-bit unsigned integer for each digit, e=232=4294967296.

Now let A = a1*e + a0, and B=b1*e + b0. (I'm using * for multiplication here).

To form the product of A*B, the traditional way, we calculate

[Eq1]   A*B = (a1*b1)e2 + (a1*b0+a0*b1)e1 + a0*b0

This takes four multiplications.

The Karatsuba trick is based on the following fact (just write it out on paper and you will see it's true):

[Eq2]   (a1+a0)*(b1+b0)-a1*b1-a0*b0 = a1*b0+a0*b1

The right-hand side of Eq2 is exactly the same as the middle term in Eq1 (the part in front of e1) so we can rewrite Eq1 substituting the left-hand side of Eq2:

[Eq3]   A*B = (a1*b1)e2 + [(a1+a0)*(b1+b0)-a1*b1-a0*b0]e1 + a0*b0

A simpler way to write this is:

[Eq4]   A*B = Xe2 + [U-X-Y]e1 + Y,

where X=a1*b1, Y=a0*b0, and U=(a1+a0)*(b1+b0)

The three multiplications required are just those needed to calculate X, Y, and U. (There is a subtle implementation issue involved in calculating U...See below.)

For large numbers of digits, this method can be applied recursively. The savings can be substantial -- the number of multiplications is reduced from N2 to N1.6 for very large values of N.

For small numbers of digits, the overhead of the extra additions may be greater than the saving from the reduction in multiplications. In fact, in my arithmetic functions, for N=16, the Karatsuba method was faster than my earlier conventional multiplication without truncation of less-significant digit products. But after introducing the optimizations described above, the Karatsuba method no longer faster, even for N=16. (N=16 means 15 32-bit digits, for a total of 480 bits, or 144 decimal digits after the decimal point).

Practical Problems with Karatsuba

One huge problem is kind of subtle, but maybe some of more experienced software engineers spotted it in the above equations. The calculation of U requires first performing two additions, which means there is the possibility of carries. If each digit is 32 bits, then there is nowhere for the carries to go. So you either have to reduce the digit size to 31 bits or keep track of them separately. Neither approach is cheap or easy. My Karatsuba implementation kept a carry flag for each sum, then added terms in after doing the multiplication as needed.

Another problem for my operations is that a subtraction is required. This is not a big deal usually, but with the SSE2 instructions and their lack of carry or borrow, it creates some major logistical hassles in keeping track of things.

How Much Precision is Needed?

The need for high-precision arithmetic is driven by the need to take differences between numbers that are close together. It is not so much trying to handle very large or very small numbers, but rather relatively small numbers that are extremely close together.

Once a bignum arithmetic function is written, it is easy to have it run for any number of digits -- but how many is enough? The time taken to perform multiplications increases as the square of the number of digits, so we don't want to use more digits of precision than we really need given the magnification of the image we're trying to render. So we should think about how much precision is needed and use as few digits as possible to adequately render the image.

The most obvious effect of limited precision shows up when calculating the coordinates in the complex number plane that correspond to the pixel numbers in the image. This is done as follows:

X = Nx * s + X0
Y = Ny * s + Y0

where
Nx = x-coordinate pixel number
Ny = y-coordinate pixel number
s  = spacing of pixels (square!) in the complex plane
X0 = left coordinate of image
Y0 = top coordinate of image

We start running out of precision when s, the spacing of pixels in the complex plane, gets really small compared to X0 or Y0. But how small can it get before we need to add more digits?

Let d be the smallest number that can be represented by one least-significant bit in the number system in use. For the IEEE-754 double-precision numbers, that is  d = 2-52 = 2.22e-16. This is the smallest number that can be added to 1.000... that will yield a result different from 1.000... The numbers that can be exactly represented with standard double-precision quantities are integer multiples of this small value.

If we want to be sure that our X and Y values are accurately located on the complex plane, we need to consider how large the deviation of any number can be from the closest exactly-representable number. Since the spacing between the exactly-representable numbers is d, we can never be more than d/2 away from any arbitrary real number.

The spacing of the pixels in the complex plane is s, and we want to be able to represent that spacing sufficiently accurately that we do not see any noticeable jitter in the positions of the pixels in the image. Let E be the maximum proportional error we are willing to tolerate in pixel location (e.g. maybe 1% or 0.1% or something like that). Since the largest absolute error possible is d/2, we have:

(d/2) / s < E

Now, we can express s in terms of more meaningful quantities like this:

s = W / N

where

W = size of image in the complex plane (determined by whichever direction is larger, usually the X-direction)

N = number of pixels in the larger direction (again, usually the X-direction)

and this leads to

W > (d N) / (2 E)

This gives the minimum size W of an image for a precision d and pixel resolution N, given our choice of relative error E.

For a 1% error (E=0.01) in a 640-pixel wide image in IEEE-754 double-precision format, we get:

W = (2.22e-16 * 640) / (2 * 0.01) = 7.1e-12

This is the smallest size image that we can draw and maintain a maximum 1% error in pixel location. If the image gets smaller than that, we have to switch to high-precision arithmetic to maintain better than 1% maximum error.

For still images, the human eye is a little more tolerant of pixel positioning error than in animations because the eye (the brain, really) is very good at noticing even tiny amounts of temporal jitter in the locations of the pixels from one frame to the next. For animations, generally any error larger than 0.1% has a reasonable chance of being noticed. I usually switch to the next higher level of precision when the increment s becomes less than 1000 times the d-value for the current precision, which means the total image width is 320,000 times the working precision's d-value.

Implementation Details

Addition and Subtraction

Addition and subtraction scale linearly with the number of digits, so they are always much faster than multiplication. I currently implement then with regular general-purpose assembly language instructions. The SSE2 instructions do not support a carry, so they are not well-suited to multi-digit addition operations.

Division

For fractals that do not require high-precision division, which are all the convergent type fractals like the Mandelbrot set, a good machine-code division function is not needed. Currently I use a simple binary search algorithm to take the reciprocal of a bignum that must be greater than 1, from which a division function can be built easily.

Convergent fractals do require division, and that means that floating-point arithmetic is necessary to properly calculate them. Floating point was added in January 2011, and with that came an assembly-language division function, since that was the whole point of adding floating point in the first place.

Performance Data

Performance data are shown in the table below for various levels of precision.

Update Dec 2008: Improvements in the low-level assembly language code for the SSE-based multiplication have given speed improvements up to almost 50%. The table below shows the new speeds and the old speeds on the same 3.2 GHz Pentium IV.

Explanation of table data:

High-Precision Arithmetic Performance on Pentium-IV System
Total Bits Fractional Bits Fractional Part Decimal Digits Multiplications, Millions/second Mandelbrot Operations, Thousands/second
NewOldNewOld
128 96 28.89 11.64 10.64 1768 1738
192 160 48.16 8.10 7.14 1531 1488
256 224 67.43 5.93 4.57 1205 1067
320 288 86.70 3.83 2.91 979 771
384 352 105.96 3.08 2.29 840 609
448 416 125.23 2.43 1.78 722 488
512 480 144.49 2.03 1.45 573 388

Updated Benchmark: Now this software can draw a 640x480 image of the interior of the Mandelbrot set (centered at 0+0i in the complex number plane) with 1000 iterations at 106 digits of precision (352 bits of fraction) in 6.1 minutes on a 3.2 GHz Pentium IV [previously 8.4 minutes].

Comparison with Core2 System

The Core2 processor is touted as having an intrinsically faster architecture than the Pentium-IV. How does a single core of the quad-core Core2 running at 2.6 GHz compare to the Pentium-IV running at 3.2 GHz? Here are the results:

High-Precision Arithmetic Performance on Core2
Total Bits Fractional Bits Fractional Part Decimal Digits Multiplications, Millions/second Mandelbrot Operations, Thousands/second
128 96 28.89 21.32 1734
192 160 48.16 11.86 1491
256 224 67.43 8.31 1311
320 288 86.70 5.71 1086
384 352 105.96 4.44 953
448 416 125.23 3.49 829
512 480 144.49 2.87 643

Overall, the basic multiplication is significantly faster on the Core2. The Mandelbrot operation numbers are a tiny bit slower on the Core2 for lower precisions, which is probably sampling error. There is a small but noticeable increase in speed at all precisions starting at 256 total bits.