hpdz.net

High-Precision Deep Zoom

Technical Info - Bignum Arithmetic

Introduction

This page discusses the basics of BigNum arithmetic, which refers to high-precision arithmetic using more digits than the native architecure of the processor supports directly.

Some of the content refers to older obsolete techniques that were in use prior to the migration to 64-bit code, but is kept here for historical purposes. Also see the page on floating-point arithmetic.

The first section is a general discussion of the basics of bignum arithmetic, including Karatsuba multiplication. At this point, with only about 500 bits at most in use, I don't have any need for more advanced methods like Toom-Cook or Schonhage-Strassen, or for Fourier-transform methods either, and they are not discussed.

The second section describes some details of how I have implemented bignum math in my software and provides some speed benchmarks.

The final two sections give performance data on the Pentium-IV and Core2 architectures.

I hope this page will help any aspiring young software engineers that might be working on their own bignum arithmetic functions. This area of computer science doesn't have many practical applications, but it is a lot of fun and a great exercise to hone your programming skills (not to mention the fact it's a pretty common college computer science class project). Plus, once you've got it working, it is really cool to have a program that can multiply thousand-digit numbers or calculate the millionth digit of pi.

General Discussion

Bignum arithmetic refers to software algorithms that perform the fundamental arithmetic operations of addition, subtraction, multiplication, and division on numbers with more digits than the hardware of the machine they are running on can support natively. These algorithms are an interesting and important branch of computer science in their own right, independent of their application to fractal art.

Background

The Pentium processors (and the closely related Core, Core2, and Athlon processors) all have built-in math units that can very rapidly perform addition, subtraction, multiplication, division, and more complex operations like square roots, logarithms, and trigonometric functions. But the built-in hardware units are limited to 52 bits of precision, which is the equivalent of just about 16 digits of decimal numbers (it's actually 15.65 digits):

Digits = Log₁₀ (2⁵²) = 52 Log₁₀(2) = 15.653559774527022151114422525674...

That is nowhere near enough to allow zooming in to 10⁵⁰ or 10¹⁰⁰. The only way to achieve the vastly greater precision required for animations that zoom to depths like that is by using software arithmetic operations. That means literally writing functions to perform addition, subtraction, multiplication, division, etc. for very large numbers.

Streaming SIMD Instructions

For many years, I used SSE2 instructions for my multiplication kernel operations, prior to the move to 64-bit code. The SSE2 instruction set features some extremely fast instructions that operate on multiple pieces of data at one time, but they have some limitations. In particular, they have no capability of propagating carries, so some rather complicated carry-saving techniques have to be used. This is how many hardware multipliers work internally. In the end, the added speed is worth the extra complications, but the details are outside the scope of general bignum operations and we won't go into it here, especially since this code is now obsolete after migrating to 64-bit arithemtic functions.

Basic Representation

High-precision numbers in my software are stored as sequences of unsigned 32-bit "digits" (referred to as "limbs" in the GNU math library). This remains the case even after the migration to 64-bit code. From the perspective of the C++ code, each digit is a 32-bit value, but the assembly language code works with them in groups of two.

Numbers are stored in a fixed-point format with one digit of integer. The smallest precision that is implemented in my software is four 32-bit digits, which means one digit of integer and three 32-bit digits to the right of the decimal point. Precision can be increased in chunks of two 32-bit digits. Previously, I had to work in even-numbered chunks like this because of the way the SSE2 instruction set works. Now, the 64-bit instructions work in groups of two 32-bit digits. In a more general case, not using 64-bit instructions, precision can be added one digit at a time.

All digits are unsigned. A separate sign bit is used.

Addition and Subtraction

Traditional addition works by adding digits consecutively starting from the rightmost (least-significant) digit and proceeding to the left, propagating the carries that are generated from each individual digit sum as you go.

For example, suppose you want to add 3867 + 1946. You start with 7+6 and proceed like this:

Carry:	1	1	1
	3	8	6	7
	1	9	4	6
Sum:	5	8	1	3

Carries are propagated upward each time two digits are added. At most, only one bit of carry can be generated each time two digits are added. The x86 ADD and ADC instructions work well for this, and addition can be done with a simple loop.

Multiplication

Multiplication is a bit more complicated than addition. To multiply two N-digit numbers, we have to form N x N = N² pairs of individual digit products, then perform a whole bunch of additions.

For example, consider 3867 x 1946. When we calculate this on paper, we break the problem down into (3867x6) + (3867x4)x10 + (3867x9)x100 + (3867x1)x1000.

				3	8	6	7
				1	9	4	6
3867x6:			2	3	2	0	2
3867x4:		1	5	4	6	8
3867x9:	3	4	8	0	3
3867x1:	3	8	6	7
Sum:	7	5	2	5	1	8	2

For a fixed-point multiplication system like I use, where the first digit represents the integer part and the next three digits represent the fraction, this example shows how to multiply 3.867 x 1.946, giving a result of 7.525182.

Note that the product has twice as many digits as the original numbers did. For some purposes, it makes sense to save some of them or all of them, but I don't do that. I truncate each multiplication product so there are always the same number of digits in the product as there were in the original numbers. So the final answer here would be 7525, corresponding to 7.525.

That truncation means that some of the digit products don't need to be calculated. For example, we don't use the 2 that comes from the product of 6x7=42 in the rightmost place. We have to be careful, because there is a carry (the 4) from that digit product into the next place, and that can affect the final answer. Careful analysis of this shows that you can drop some digit products and not affect the final answer by more than 1 bit. I am not going to reproduce that analysis here, but just show the final result. To see this better, we need to rearrange the multiplication operation above into a square matrix:

	3	8	6	7
1	1x3	1x8	1x6	1x7
9	9x3	9x8	9x6	9x7
4	4x3	4x8	4x6	4x7
6	6x3	6x8	6x6	6x7

The multiplications in dim coloring can safely be skipped without affecting the final truncated product by more than one least-significant bit.

Here is a larger table showing the 8x8 case, where an 8-digit number A=a7a6...a1a0 is multiplied by an 8-digit number B=b7a6...b1b0:

	a7	a6	a5	a4	a3	a2	a1	a0
b7	b7a7	b7a6	b7a5	b7a4	b7a3	b7a2	b7a1	b7a0
b6	b6a7	b6a6	b6a5	b6a4	b6a3	b6a2	b6a1	b6a0
b5	b5a7	b5a6	b5a5	b5a4	b5a3	b5a2	b5a1	b5a0
b4	b4a7	b4a6	b4a5	b4a4	b3a3	b4a2	b4a1	b4a0
b3	b3a7	b3a6	b3a5	b3a4	b3a3	b3a2	b3a1	b3a0
b2	b2a7	b2a6	b2a5	b2a4	b2a3	b2a2	b2a1	b2a0
b1	b1a7	b1a6	b1a5	b1a4	b1a3	b1a2	b1a1	b1a0
b0	b0a7	b0a6	b0a5	b0a4	b0a3	b0a2	b0a1	b0a0

In general, as the number of digits increases, we save almost half the multiplications.

To calculate the exact number of digits saved, let N be the number of digits in each number we are multiplying. The number of digit products on the diagonal and in the upper left half is N(N+1)/2. The number of digit products in the subdiagonal is N-2. So the total is N(N+1)/2 + (N-2) = (N²+N-4)/2.

Squaring

If A=B in the multiplication grid above, then all the digit products in the lower left half of the grid are duplicates of the products in the upper right, since, for example, a4a6 = a6a4. The following table shows the additional digit products that can be eliminated, dimmed in a different color. For large values of N, the final result is that we are doing about 1/4 as many digit multiplications as before. Unlike the previous saving, taking advantage of this symmetry when squaring numbers introduces no loss of precision.

	a7	a6	a5	a4	a3	a2	a1	a0
a7	a7a7	a7a6	a7a5	a7a4	a7a3	a7a2	a7a1	a7a0
a6	a6a7	a6a6	a6a5	a6a4	a6a3	a6a2	a6a1	a6a0
a5	a5a7	a5a6	a5a5	a5a4	a5a3	a5a2	a5a1	a5a0
a4	a4a7	a4a6	a4a5	a4a4	a3a3	a4a2	a4a1	a4a0
a3	a3a7	a3a6	a3a5	a3a4	a3a3	a3a2	a3a1	a3a0
a2	a2a7	a2a6	a2a5	a2a4	a2a3	a2a2	a2a1	a2a0
a1	a1a7	a1a6	a1a5	a1a4	a1a3	a1a2	a1a1	a1a0
a0	a0a7	a0a6	a0a5	a0a4	a0a3	a0a2	a0a1	a0a0

Karatsuba

Karatsuba multiplication utilizes a clever identity to multiply two 2-digit numbers using only three multiplications. To see how it works, we'll have to use some more formal notation.

Let e represent the base of the number system we're using. For normal decimal numbers, e=10. For a system in a computer that uses one byte for each digit, e=256. For my system, which uses a 32-bit unsigned integer for each digit, e=2³²=4294967296.

Now let A = a1*e + a0, and B=b1*e + b0. (I'm using * for multiplication here).

To form the product of A*B, the traditional way, we calculate

[Eq1] A*B = (a1*b1)e² + (a1*b0+a0*b1)e¹ + a0*b0

This takes four multiplications.

The Karatsuba trick is based on the following fact (just write it out on paper and you will see it's true):

[Eq2] (a1+a0)*(b1+b0)-a1*b1-a0*b0 = a1*b0+a0*b1

The right-hand side of Eq2 is exactly the same as the middle term in Eq1 (the part in front of e¹) so we can rewrite Eq1 substituting the left-hand side of Eq2:

[Eq3] A*B = (a1*b1)e² + [(a1+a0)*(b1+b0)-a1*b1-a0*b0]e¹ + a0*b0

A simpler way to write this is:

[Eq4] A*B = Xe² + [U-X-Y]e¹ + Y,

where X=a1*b1, Y=a0*b0, and U=(a1+a0)*(b1+b0)

The three multiplications required are just those needed to calculate X, Y, and U. (There is a subtle implementation issue involved in calculating U...See below.)

For large numbers of digits, this method can be applied recursively. The savings can be substantial -- the number of multiplications is reduced from N² to N^1.6 for very large values of N.

For small numbers of digits, the overhead of the extra additions may be greater than the saving from the reduction in multiplications. In fact, in my arithmetic functions, for N=16, the Karatsuba method was faster than my earlier conventional multiplication without truncation of less-significant digit products. But after introducing the optimizations described above, the Karatsuba method no longer faster, even for N=16. (N=16 means 15 32-bit digits, for a total of 480 bits, or 144 decimal digits after the decimal point).

Practical Problems with Karatsuba

One huge problem is kind of subtle, but maybe some of more experienced software engineers spotted it in the above equations. The calculation of U requires first performing two additions, which means there is the possibility of carries. If each digit is 32 bits, then there is nowhere for the carries to go. So you either have to reduce the digit size to 31 bits or keep track of them separately. Neither approach is cheap or easy. My Karatsuba implementation kept a carry flag for each sum, then added terms in after doing the multiplication as needed.

Another problem for my operations is that a subtraction is required. This is not a big deal usually, but with the SSE2 instructions and their lack of carry or borrow, it creates some major logistical hassles in keeping track of things.

How Much Precision is Needed?

The need for high-precision arithmetic is driven by the need to take differences between numbers that are close together. It is not so much trying to handle very large or very small numbers, but rather relatively small numbers that are extremely close together.

Once a bignum arithmetic function is written, it is easy to have it run for any number of digits -- but how many is enough? The time taken to perform multiplications increases as the square of the number of digits, so we don't want to use more digits of precision than we really need given the magnification of the image we're trying to render. So we should think about how much precision is needed and use as few digits as possible to adequately render the image.

The most obvious effect of limited precision shows up when calculating the coordinates in the complex number plane that correspond to the pixel numbers in the image. This is done as follows:

X = Nx * s + X0
Y = Ny * s + Y0

where
Nx = x-coordinate pixel number
Ny = y-coordinate pixel number
s = spacing of pixels (square!) in the complex plane
X0 = left coordinate of image
Y0 = top coordinate of image

We start running out of precision when s, the spacing of pixels in the complex plane, gets really small compared to X0 or Y0. But how small can it get before we need to add more digits?

Let d be the smallest number that can be represented by one least-significant bit in the number system in use. For the IEEE-754 double-precision numbers, that is d = 2^-52 = 2.22e-16. This is the smallest number that can be added to 1.000... that will yield a result different from 1.000... The numbers that can be exactly represented with standard double-precision quantities are integer multiples of this small value.

If we want to be sure that our X and Y values are accurately located on the complex plane, we need to consider how large the deviation of any number can be from the closest exactly-representable number. Since the spacing between the exactly-representable numbers is d, we can never be more than d/2 away from any arbitrary real number.

The spacing of the pixels in the complex plane is s, and we want to be able to represent that spacing sufficiently accurately that we do not see any noticeable jitter in the positions of the pixels in the image. Let E be the maximum proportional error we are willing to tolerate in pixel location (e.g. maybe 1% or 0.1% or something like that). Since the largest absolute error possible is d/2, we have:

(d/2) / s < E

Now, we can express s in terms of more meaningful quantities like this:

s = W / N

where

W = size of image in the complex plane (determined by whichever direction is larger, usually the X-direction)

N = number of pixels in the larger direction (again, usually the X-direction)

and this leads to

W > (d N) / (2 E)

This gives the minimum size W of an image for a precision d and pixel resolution N, given our choice of relative error E.

For a 1% error (E=0.01) in a 640-pixel wide image in IEEE-754 double-precision format, we get:

W = (2.22e-16 * 640) / (2 * 0.01) = 7.1e-12

This is the smallest size image that we can draw and maintain a maximum 1% error in pixel location. If the image gets smaller than that, we have to switch to high-precision arithmetic to maintain better than 1% maximum error.

For still images, the human eye is a little more tolerant of pixel positioning error than in animations because the eye (the brain, really) is very good at noticing even tiny amounts of temporal jitter in the locations of the pixels from one frame to the next. For animations, generally any error larger than 0.1% has a reasonable chance of being noticed. I usually switch to the next higher level of precision when the increment s becomes less than 1000 times the d-value for the current precision, which means the total image width is 320,000 times the working precision's d-value.

Implementation Details

Addition and Subtraction

Addition and subtraction scale linearly with the number of digits, so they are always much faster than multiplication. I currently implement then with regular general-purpose assembly language instructions. The SSE2 instructions do not support a carry, so they are not well-suited to multi-digit addition operations.

Division

For fractals that do not require high-precision division, which are all the convergent type fractals like the Mandelbrot set, a good machine-code division function is not needed. Currently I use a simple binary search algorithm to take the reciprocal of a bignum that must be greater than 1, from which a division function can be built easily.

Convergent fractals do require division, and that means that floating-point arithmetic is necessary to properly calculate them. Floating point was added in January 2011, and with that came an assembly-language division function, since that was the whole point of adding floating point in the first place.

Performance Data

Performance data are shown in the table below for various levels of precision.

Update Dec 2008: Improvements in the low-level assembly language code for the SSE-based multiplication have given speed improvements up to almost 50%. The table below shows the new speeds and the old speeds on the same 3.2 GHz Pentium IV.

Explanation of table data:

Total Bits is the total number of bits, which is the integer part plus the fractional part. The integer part always has 32 bits.
Fractional Bits is the number of bits allocated to the part of the number to the right of the decimal point.
Fractional Part Decimal Digits is the number of base-10 digits after the decimal point that can be represented by the Fractional Bits
Multiplications is the rate at which individual multiplications are performed with no other operations included
Mandelbrot Operations is the rate at which the Z=Z²+C operation in complex numbers can be performed.

High-Precision Arithmetic Performance on Pentium-IV System
Total Bits	Fractional Bits	Fractional Part Decimal Digits	Multiplications, Millions/second		Mandelbrot Operations, Thousands/second
Total Bits	Fractional Bits	Fractional Part Decimal Digits	New	Old	New	Old
128	96	28.89	11.64	10.64	1768	1738
192	160	48.16	8.10	7.14	1531	1488
256	224	67.43	5.93	4.57	1205	1067
320	288	86.70	3.83	2.91	979	771
384	352	105.96	3.08	2.29	840	609
448	416	125.23	2.43	1.78	722	488
512	480	144.49	2.03	1.45	573	388

Updated Benchmark: Now this software can draw a 640x480 image of the interior of the Mandelbrot set (centered at 0+0i in the complex number plane) with 1000 iterations at 106 digits of precision (352 bits of fraction) in 6.1 minutes on a 3.2 GHz Pentium IV [previously 8.4 minutes].

Comparison with Core2 System

The Core2 processor is touted as having an intrinsically faster architecture than the Pentium-IV. How does a single core of the quad-core Core2 running at 2.6 GHz compare to the Pentium-IV running at 3.2 GHz? Here are the results:

High-Precision Arithmetic Performance on Core2
Total Bits	Fractional Bits	Fractional Part Decimal Digits	Multiplications, Millions/second	Mandelbrot Operations, Thousands/second
128	96	28.89	21.32	1734
192	160	48.16	11.86	1491
256	224	67.43	8.31	1311
320	288	86.70	5.71	1086
384	352	105.96	4.44	953
448	416	125.23	3.49	829
512	480	144.49	2.87	643

Overall, the basic multiplication is significantly faster on the Core2. The Mandelbrot operation numbers are a tiny bit slower on the Core2 for lower precisions, which is probably sampling error. There is a small but noticeable increase in speed at all precisions starting at 256 total bits.