# hpdz.net

## Technical Info - Bignum Arithmetic

### Introduction

This page discusses the basics of BigNum arithmetic, which refers to high-precision arithmetic using more digits than the native architecure of the processor supports directly.

Some of the content refers to older obsolete techniques that were in use prior to the migration to 64-bit code, but is kept here for historical purposes. Also see the page on floating-point arithmetic.

The first section is a general discussion of the basics of bignum arithmetic, including Karatsuba multiplication. At this point, with only about 500 bits at most in use, I don't have any need for more advanced methods like Toom-Cook or Schonhage-Strassen, or for Fourier-transform methods either, and they are not discussed.

The second section describes some details of how I have implemented bignum math in my software and provides some speed benchmarks.

The final two sections give performance data on the Pentium-IV and Core2 architectures.

I hope this page will help any aspiring young software engineers that might be working on their own bignum arithmetic functions. This area of computer science doesn't have many practical applications, but it is a lot of fun and a great exercise to hone your programming skills (not to mention the fact it's a pretty common college computer science class project). Plus, once you've got it working, it is really cool to have a program that can multiply thousand-digit numbers or calculate the millionth digit of pi.

### General Discussion

Bignum arithmetic refers to software algorithms that perform the fundamental arithmetic operations of addition, subtraction, multiplication, and division on numbers with more digits than the hardware of the machine they are running on can support natively. These algorithms are an interesting and important branch of computer science in their own right, independent of their application to fractal art.

#### Background

The Pentium processors (and the closely related Core, Core2, and Athlon processors) all have built-in math units that can very rapidly perform addition, subtraction, multiplication, division, and more complex operations like square roots, logarithms, and trigonometric functions. But the built-in hardware units are limited to 52 bits of precision, which is the equivalent of just about 16 digits of decimal numbers (it's actually 15.65 digits):

Digits = Log10 (252) = 52 Log10(2) = 15.653559774527022151114422525674...

That is nowhere near enough to allow zooming in to 1050 or 10100. The only way to achieve the vastly greater precision required for animations that zoom to depths like that is by using software arithmetic operations. That means literally writing functions to perform addition, subtraction, multiplication, division, etc. for very large numbers.

#### Streaming SIMD Instructions

For many years, I used SSE2 instructions for my multiplication kernel operations, prior to the move to 64-bit code. The SSE2 instruction set features some extremely fast instructions that operate on multiple pieces of data at one time, but they have some limitations. In particular, they have no capability of propagating carries, so some rather complicated carry-saving techniques have to be used. This is how many hardware multipliers work internally. In the end, the added speed is worth the extra complications, but the details are outside the scope of general bignum operations and we won't go into it here, especially since this code is now obsolete after migrating to 64-bit arithemtic functions.

#### Basic Representation

High-precision numbers in my software are stored as sequences of unsigned 32-bit "digits" (referred to as "limbs" in the GNU math library). This remains the case even after the migration to 64-bit code. From the perspective of the C++ code, each digit is a 32-bit value, but the assembly language code works with them in groups of two.

Numbers are stored in a fixed-point format with one digit of integer. The smallest precision that is implemented in my software is four 32-bit digits, which means one digit of integer and three 32-bit digits to the right of the decimal point. Precision can be increased in chunks of two 32-bit digits. Previously, I had to work in even-numbered chunks like this because of the way the SSE2 instruction set works. Now, the 64-bit instructions work in groups of two 32-bit digits. In a more general case, not using 64-bit instructions, precision can be added one digit at a time.

All digits are unsigned. A separate sign bit is used.

Traditional addition works by adding digits consecutively starting from the rightmost (least-significant) digit and proceeding to the left, propagating the carries that are generated from each individual digit sum as you go.

For example, suppose you want to add 3867 + 1946. You start with 7+6 and proceed like this:

 Carry: 1 1 1 3 8 6 7 1 9 4 6 Sum: 5 8 1 3

Carries are propagated upward each time two digits are added. At most, only one bit of carry can be generated each time two digits are added. The x86 ADD and ADC instructions work well for this, and addition can be done with a simple loop.

#### Multiplication

Multiplication is a bit more complicated than addition. To multiply two N-digit numbers, we have to form N x N = N2 pairs of individual digit products, then perform a whole bunch of additions.

For example, consider 3867 x 1946. When we calculate this on paper, we break the problem down into (3867x6) + (3867x4)x10 + (3867x9)x100 + (3867x1)x1000.

 3 8 6 7 1 9 4 6 3867x6: 2 3 2 0 2 3867x4: 1 5 4 6 8 3867x9: 3 4 8 0 3 3867x1: 3 8 6 7 Sum: 7 5 2 5 1 8 2

For a fixed-point multiplication system like I use, where the first digit represents the integer part and the next three digits represent the fraction, this example shows how to multiply 3.867 x 1.946, giving a result of 7.525182.

Note that the product has twice as many digits as the original numbers did. For some purposes, it makes sense to save some of them or all of them, but I don't do that. I truncate each multiplication product so there are always the same number of digits in the product as there were in the original numbers. So the final answer here would be 7525, corresponding to 7.525.

That truncation means that some of the digit products don't need to be calculated. For example, we don't use the 2 that comes from the product of 6x7=42 in the rightmost place. We have to be careful, because there is a carry (the 4) from that digit product into the next place, and that can affect the final answer. Careful analysis of this shows that you can drop some digit products and not affect the final answer by more than 1 bit. I am not going to reproduce that analysis here, but just show the final result. To see this better, we need to rearrange the multiplication operation above into a square matrix:

 3 8 6 7 1 1x3 1x8 1x6 1x7 9 9x3 9x8 9x6 9x7 4 4x3 4x8 4x6 4x7 6 6x3 6x8 6x6 6x7

The multiplications in dim coloring can safely be skipped without affecting the final truncated product by more than one least-significant bit.

Here is a larger table showing the 8x8 case, where an 8-digit number A=a7a6...a1a0 is multiplied by an 8-digit number B=b7a6...b1b0:

 a7 a6 a5 a4 a3 a2 a1 a0 b7 b7a7 b7a6 b7a5 b7a4 b7a3 b7a2 b7a1 b7a0 b6 b6a7 b6a6 b6a5 b6a4 b6a3 b6a2 b6a1 b6a0 b5 b5a7 b5a6 b5a5 b5a4 b5a3 b5a2 b5a1 b5a0 b4 b4a7 b4a6 b4a5 b4a4 b3a3 b4a2 b4a1 b4a0 b3 b3a7 b3a6 b3a5 b3a4 b3a3 b3a2 b3a1 b3a0 b2 b2a7 b2a6 b2a5 b2a4 b2a3 b2a2 b2a1 b2a0 b1 b1a7 b1a6 b1a5 b1a4 b1a3 b1a2 b1a1 b1a0 b0 b0a7 b0a6 b0a5 b0a4 b0a3 b0a2 b0a1 b0a0

In general, as the number of digits increases, we save almost half the multiplications.

To calculate the exact number of digits saved, let N be the number of digits in each number we are multiplying. The number of digit products on the diagonal and in the upper left half is N(N+1)/2. The number of digit products in the subdiagonal is N-2. So the total is N(N+1)/2 + (N-2) = (N2+N-4)/2.

#### Squaring

If A=B in the multiplication grid above, then all the digit products in the lower left half of the grid are duplicates of the products in the upper right, since, for example, a4a6 = a6a4. The following table shows the additional digit products that can be eliminated, dimmed in a different color. For large values of N, the final result is that we are doing about 1/4 as many digit multiplications as before. Unlike the previous saving, taking advantage of this symmetry when squaring numbers introduces no loss of precision.

 a7 a6 a5 a4 a3 a2 a1 a0 a7 a7a7 a7a6 a7a5 a7a4 a7a3 a7a2 a7a1 a7a0 a6 a6a7 a6a6 a6a5 a6a4 a6a3 a6a2 a6a1 a6a0 a5 a5a7 a5a6 a5a5 a5a4 a5a3 a5a2 a5a1 a5a0 a4 a4a7 a4a6 a4a5 a4a4 a3a3 a4a2 a4a1 a4a0 a3 a3a7 a3a6 a3a5 a3a4 a3a3 a3a2 a3a1 a3a0 a2 a2a7 a2a6 a2a5 a2a4 a2a3 a2a2 a2a1 a2a0 a1 a1a7 a1a6 a1a5 a1a4 a1a3 a1a2 a1a1 a1a0 a0 a0a7 a0a6 a0a5 a0a4 a0a3 a0a2 a0a1 a0a0

#### Karatsuba

Karatsuba multiplication utilizes a clever identity to multiply two 2-digit numbers using only three multiplications. To see how it works, we'll have to use some more formal notation.

Let e represent the base of the number system we're using. For normal decimal numbers, e=10. For a system in a computer that uses one byte for each digit, e=256. For my system, which uses a 32-bit unsigned integer for each digit, e=232=4294967296.

Now let A = a1*e + a0, and B=b1*e + b0. (I'm using * for multiplication here).

To form the product of A*B, the traditional way, we calculate

[Eq1]   A*B = (a1*b1)e2 + (a1*b0+a0*b1)e1 + a0*b0

This takes four multiplications.

The Karatsuba trick is based on the following fact (just write it out on paper and you will see it's true):

[Eq2]   (a1+a0)*(b1+b0)-a1*b1-a0*b0 = a1*b0+a0*b1

The right-hand side of Eq2 is exactly the same as the middle term in Eq1 (the part in front of e1) so we can rewrite Eq1 substituting the left-hand side of Eq2:

[Eq3]   A*B = (a1*b1)e2 + [(a1+a0)*(b1+b0)-a1*b1-a0*b0]e1 + a0*b0

A simpler way to write this is:

[Eq4]   A*B = Xe2 + [U-X-Y]e1 + Y,

where X=a1*b1, Y=a0*b0, and U=(a1+a0)*(b1+b0)

The three multiplications required are just those needed to calculate X, Y, and U. (There is a subtle implementation issue involved in calculating U...See below.)

For large numbers of digits, this method can be applied recursively. The savings can be substantial -- the number of multiplications is reduced from N2 to N1.6 for very large values of N.

For small numbers of digits, the overhead of the extra additions may be greater than the saving from the reduction in multiplications. In fact, in my arithmetic functions, for N=16, the Karatsuba method was faster than my earlier conventional multiplication without truncation of less-significant digit products. But after introducing the optimizations described above, the Karatsuba method no longer faster, even for N=16. (N=16 means 15 32-bit digits, for a total of 480 bits, or 144 decimal digits after the decimal point).

#### Practical Problems with Karatsuba

One huge problem is kind of subtle, but maybe some of more experienced software engineers spotted it in the above equations. The calculation of U requires first performing two additions, which means there is the possibility of carries. If each digit is 32 bits, then there is nowhere for the carries to go. So you either have to reduce the digit size to 31 bits or keep track of them separately. Neither approach is cheap or easy. My Karatsuba implementation kept a carry flag for each sum, then added terms in after doing the multiplication as needed.

Another problem for my operations is that a subtraction is required. This is not a big deal usually, but with the SSE2 instructions and their lack of carry or borrow, it creates some major logistical hassles in keeping track of things.

### How Much Precision is Needed?

The need for high-precision arithmetic is driven by the need to take differences between numbers that are close together. It is not so much trying to handle very large or very small numbers, but rather relatively small numbers that are extremely close together.

Once a bignum arithmetic function is written, it is easy to have it run for any number of digits -- but how many is enough? The time taken to perform multiplications increases as the square of the number of digits, so we don't want to use more digits of precision than we really need given the magnification of the image we're trying to render. So we should think about how much precision is needed and use as few digits as possible to adequately render the image.

The most obvious effect of limited precision shows up when calculating the coordinates in the complex number plane that correspond to the pixel numbers in the image. This is done as follows:

X = Nx * s + X0
Y = Ny * s + Y0

where
Nx = x-coordinate pixel number
Ny = y-coordinate pixel number
s  = spacing of pixels (square!) in the complex plane
X0 = left coordinate of image
Y0 = top coordinate of image

We start running out of precision when s, the spacing of pixels in the complex plane, gets really small compared to X0 or Y0. But how small can it get before we need to add more digits?

Let d be the smallest number that can be represented by one least-significant bit in the number system in use. For the IEEE-754 double-precision numbers, that is  d = 2-52 = 2.22e-16. This is the smallest number that can be added to 1.000... that will yield a result different from 1.000... The numbers that can be exactly represented with standard double-precision quantities are integer multiples of this small value.

If we want to be sure that our X and Y values are accurately located on the complex plane, we need to consider how large the deviation of any number can be from the closest exactly-representable number. Since the spacing between the exactly-representable numbers is d, we can never be more than d/2 away from any arbitrary real number.

The spacing of the pixels in the complex plane is s, and we want to be able to represent that spacing sufficiently accurately that we do not see any noticeable jitter in the positions of the pixels in the image. Let E be the maximum proportional error we are willing to tolerate in pixel location (e.g. maybe 1% or 0.1% or something like that). Since the largest absolute error possible is d/2, we have:

(d/2) / s < E

Now, we can express s in terms of more meaningful quantities like this:

s = W / N

where

W = size of image in the complex plane (determined by whichever direction is larger, usually the X-direction)

N = number of pixels in the larger direction (again, usually the X-direction)

W > (d N) / (2 E)

This gives the minimum size W of an image for a precision d and pixel resolution N, given our choice of relative error E.

For a 1% error (E=0.01) in a 640-pixel wide image in IEEE-754 double-precision format, we get:

W = (2.22e-16 * 640) / (2 * 0.01) = 7.1e-12

This is the smallest size image that we can draw and maintain a maximum 1% error in pixel location. If the image gets smaller than that, we have to switch to high-precision arithmetic to maintain better than 1% maximum error.

For still images, the human eye is a little more tolerant of pixel positioning error than in animations because the eye (the brain, really) is very good at noticing even tiny amounts of temporal jitter in the locations of the pixels from one frame to the next. For animations, generally any error larger than 0.1% has a reasonable chance of being noticed. I usually switch to the next higher level of precision when the increment s becomes less than 1000 times the d-value for the current precision, which means the total image width is 320,000 times the working precision's d-value.

### Implementation Details

Addition and subtraction scale linearly with the number of digits, so they are always much faster than multiplication. I currently implement then with regular general-purpose assembly language instructions. The SSE2 instructions do not support a carry, so they are not well-suited to multi-digit addition operations.

#### Division

For fractals that do not require high-precision division, which are all the convergent type fractals like the Mandelbrot set, a good machine-code division function is not needed. Currently I use a simple binary search algorithm to take the reciprocal of a bignum that must be greater than 1, from which a division function can be built easily.

Convergent fractals do require division, and that means that floating-point arithmetic is necessary to properly calculate them. Floating point was added in January 2011, and with that came an assembly-language division function, since that was the whole point of adding floating point in the first place.

### Performance Data

Performance data are shown in the table below for various levels of precision.

Update Dec 2008: Improvements in the low-level assembly language code for the SSE-based multiplication have given speed improvements up to almost 50%. The table below shows the new speeds and the old speeds on the same 3.2 GHz Pentium IV.

Explanation of table data:

• Total Bits is the total number of bits, which is the integer part plus the fractional part. The integer part always has 32 bits.
• Fractional Bits is the number of bits allocated to the part of the number to the right of the decimal point.
• Fractional Part Decimal Digits is the number of base-10 digits after the decimal point that can be represented by the Fractional Bits
• Multiplications is the rate at which individual multiplications are performed with no other operations included
• Mandelbrot Operations is the rate at which the Z=Z2+C operation in complex numbers can be performed.
High-Precision Arithmetic Performance on Pentium-IV System
Total Bits Fractional Bits Fractional Part Decimal Digits Multiplications, Millions/second Mandelbrot Operations, Thousands/second
NewOldNewOld
128 96 28.89 11.64 10.64 1768 1738
192 160 48.16 8.10 7.14 1531 1488
256 224 67.43 5.93 4.57 1205 1067
320 288 86.70 3.83 2.91 979 771
384 352 105.96 3.08 2.29 840 609
448 416 125.23 2.43 1.78 722 488
512 480 144.49 2.03 1.45 573 388

Updated Benchmark: Now this software can draw a 640x480 image of the interior of the Mandelbrot set (centered at 0+0i in the complex number plane) with 1000 iterations at 106 digits of precision (352 bits of fraction) in 6.1 minutes on a 3.2 GHz Pentium IV [previously 8.4 minutes].

### Comparison with Core2 System

The Core2 processor is touted as having an intrinsically faster architecture than the Pentium-IV. How does a single core of the quad-core Core2 running at 2.6 GHz compare to the Pentium-IV running at 3.2 GHz? Here are the results:

High-Precision Arithmetic Performance on Core2
Total Bits Fractional Bits Fractional Part Decimal Digits Multiplications, Millions/second Mandelbrot Operations, Thousands/second
128 96 28.89 21.32 1734
192 160 48.16 11.86 1491
256 224 67.43 8.31 1311
320 288 86.70 5.71 1086
384 352 105.96 4.44 953
448 416 125.23 3.49 829
512 480 144.49 2.87 643

Overall, the basic multiplication is significantly faster on the Core2. The Mandelbrot operation numbers are a tiny bit slower on the Core2 for lower precisions, which is probably sampling error. There is a small but noticeable increase in speed at all precisions starting at 256 total bits.