hpdz.net

High-Precision Deep Zoom

Technical Info - 64-Bit Migration

Introduction

If this website had a blog, that might be a good place to put the content of this page. This is a summary of the adventures encountered during the migration of the fractal software to a 64-bit platform, sharing the fun details of how this process worked and the obstacles encountered. Mostly it is just venting various frustrations, but I have tried to include some practical advice for others who may be facing this kind of code migration.

I have also included some images of what happens when things go wrong with the core arithmetic operations. These are way down at the bottom of the page.

The complete story of all the obstacles along the path are described below. First, the most important stuff: the software is MUCH faster now!

See also the pages on high-precision floating-point and bignum arithmetic.

Performance Improvements

The motivation for moving the code to a 64-bit platform was speeding up the core high-precision arithmetic code, particularly multiplication. It was pretty fast before on the 32-bit platform since it used the SSE2 instructions to perform multiplications, but the possibility that a 64-bit machine could get even better calculation speed was very enticing. The bottom line is that all this effort paid off, and the 64-bit code is significantly faster, up to twice as fast as the 32-bit code.

The following tables demonstrate exactly what kind of speed increases were achieved with this code migration. This is data from tests of the multiplication function and Mandelbrot function in isolation, not part of an image rendering, run in a single thread on a 2.4 GHz Core2.

Speed of Calculations on Core2 System (single thread)
Precision	Total Bits	Fractional Bits	Fractional Part Decimal Digits	Multiplications, Millions/second		Mandelbrot Operations, Thousands/second
Precision	Total Bits	Fractional Bits	Fractional Part Decimal Digits	64-bit	32-bit	64-bit	32-bit
4	128	96	28.89	22.8	21.32	3740	1734
6	192	160	48.16	15.2	11.86	2850	1491
8	256	224	67.43	12.7	8.31	2440	1311
10	320	288	86.70	9.0	5.71	1900	1086
12	384	352	105.96	7.8	4.44	1680	953
14	448	416	125.23	5.7	3.49	1330	829
16	512	480	144.49	5.1	2.87	1210	643

You may note that 5.1 million multiplications per second on a 2.4 GHz CPU implies a throughput of one 512x512-bit multiplication per 571 clock cycles.

Explanation of table data:

Precision is how I refer to these various number sizes. This is the number of 32-bit "digits" that a number is built out of. One of these holds the integer part of the number and the rest hold the fractional part.
Total Bits is the total number of bits, which is the integer part plus the fractional part. The integer part always has 32 bits.
Fractional Bits is the number of bits allocated to the part of the number to the right of the decimal point.
Fractional Part Decimal Digits is the number of base-10 digits after the decimal point that can be represented by the Fractional Bits
Multiplications is the rate at which individual multiplications are performed with no other operations included
Mandelbrot Operations is the rate at which the Z=Z²+C operation in complex numbers can be performed. This involves three multiplications and five additions, as well as an integer comparison.

A real-world "bottom line" kind of test is to set up an image and see how fast it actually draws. That also facilitates comparison with other software that may not be able to provide timings of low-level isolated arithmetic operations. The following table shows the total throughput of the quad-core system with the new 64-bit software and compares this to 32-bit code running in the 64-bit environment under WoW64 (the 32-bit emulation system in 64-bit Windows), and to the 32-bit code running in a native 32-bit environment (Windows XP-Pro)

The test image is a 500x500 pixel image centered at (-0.75,0) with a size of 0.4 in each direction in the complex number plane (i.e. left=-0.95, right=-0.55, top=0.2, bottom=-0.2) with a maximum iteration count of 5000.

Test Image Drawing Times, seconds (Core2 quad)
Precision	64-bit	32-bit WoW64	Native 32-bit (XP)
4	84.2	95.6	95.0
6	113	143	142
8	129	194	193
10	161	250	250
12	182	297	297
14	225	363	360
16	243	424	424

Note that times in this table are on a 2.4 GHz quad-core Core2 system with FOUR cores working together. To compare to a single-core system, multiply all drawing times by 4.

You can see the 64-bit code is up to 75% faster than the 32-bit code. Also, the WoW64 emulation in Windows 7 works very well, since the 32-bit run times are nearly identical.

With such an impressive gain in performance, the story behind how it all came about must make fascinating reading, right? You be the judge. Read on.

The 64-bit Migration Saga

Writing this is kind of painful. It is hard to relive all the frustration involved in getting this working, but if I don't capture this now, I may never have the motivation to do so again. Just to add to the fun, Expression crashed and lost about three hours of work I put into typing all this, and it doesn't seem to save a working draft like Word does. Now Ctrl-S is my favorite keystroke. Let me do it again. There. Better.

Background and Motivation: Every 10% counts!
Windows 7: Is your supercomputer fast enough to run our new OS?
Compile Please: You'll need a 64-bit number to count all the errors
Run Please: The joys of browsing EXE files
64-bit multiplication and addition
Wacky bug images

Background and Motivation - Every 10% counts

The fractal software that generates all the images uses specialized code to perform the core arithmetic operations like multiplication and addition. This is needed because the built-in hardware in the CPU only provides the equivalent of about 16 digits of precision, while the deep zoom fractals require literally hundreds of digits of precision. Internally, the fractal software represents those extra digits of precision as a string of 32-bit integers. One 32-bit value represents the integer component of a number, and the remaining 32-bit integers represent the fractional part of a number. This is a form of fixed-point arithmetic, and it is described in more detail on the page for Big Num arithmetic.

SSE2-based 32-bit code

The high-precision arithmetic code has been improved and fine-tuned over several years, and is pretty fast. It uses certain specialized instructions that are part of what Intel calls Streaming SIMD Extensions 2 (SSE2) to do multiplications much faster than is possible using the simple arithmetic instructions in the CPU. These instructions are part of the Single Instruction Multiple Data (SIMD) subsystem in all modern processors. These instructions are designed to process multimedia video and audio data extremely fast, and they execute in a special, highly efficient subsystem in the processor. The star of the show for the purposes of the fractal software is an instruction that performs two 32x32 bit multiplications at once, giving two 64-bit products.

The SSE2 instructions are tedious to use here because they were not designed for this kind of application. They have some significant limitations (like, for example, no carry bit) that require a lot of tricks to work around. Those tricks resulted in some very complicated code to shuffle data around to make it aligned with the way those instructions want to work, and the 32-bit SSE2-based multiplication code spends more time doing that than it spends actually doing multiplication. Still, the SSE2 instructions execute extremely fast, and even with all the complications, this code was significantly faster than any multiplication function written with the basic instruction set. This SSE2-based multiplication function has been at the core of the high-precision arithmetic in this software for years, with occasional incremental improvements and optimizations.

Typically, an improvement to the arithmetic code has involved spending dozens of hours on something that results in a 10% improvement, or less. Those 10% improvements do add up, but the code is very mature now, and it's been squeezed as hard as it can be within the limits of a 32-bit environment.

The potential of 64-bit code

I had been wondering for a few years whether using the regular, non-SSE2, 64-bit multiply and add instructions would be worthwhile. They operate on the same basic amount of data (64 bits) but they are much easier to use than the SSE2 instructions, so there would be less bookkeeping overhead. However, they do not execute in the SIMD subsystem of the processor, so it was not clear whether porting to these instructions would be worthwhile. The advantages are tantalizing:

A single 64x64 bit multiplication instruction that gives a 128-bit product
Standard arithmetic operations, with a carry flag, that operate on 64-bit quantities
Far fewer instructions to achieve the same computational end result
Eight new registers to hold intermediate results during big multiplications

So I did some initial sketching of how a 64-bit multiplier would look and decided it looked promising. Multiplying a 2x2 block of 32-bit integers requires about 25 SSE2 instructions to load them, multiply them, and add the result into the final product accumulator. This same operation takes seven 64-bit instructions. Multiplying a 4x4 block of 32-bit values (i.e. finding the product of a 128-bit chunk of two high-precision numbers) took over a hundred SSE2 instructions. With 64-bit code, it takes about forty regular instructions.

Furthermore, addition of two big numbers takes half as many operations on a 64-bit system as on a 32-bit system. Addition is not as time-consuming as multiplication, but there is a lot of addition going on, and making the addition operation faster can have a huge impact on the time it takes to render an image. Code that performs addition is limited to using the regular, non-SIMD instructions because the SSE2 instructions don't have any carry bit, which is essential for performing sums of strings of digits. The SSE instructions are good at performing operations on multiple independent data streams simultaneously, but don't work so well on streams of data that have dependencies, like the digits in an addition. Using the 64-bit addition would clearly double the speed of addition.

Finally, when a processor is running in 64-bit mode, it has eight new registers available, so intermediate results of multiplying larger blocks of data can be held in register space rather than having to be written out to memory. This greatly reduces the potential for cache-miss problems. At the very end of the 4x4 block multiplication, the result can be summed into the final accumulator with sequential memory accesses, which helps ensure all the data is kept in the processor cache efficiently.

So after looking at all this, the potential for huge speed gains seemed very likely, and I decided to take the plunge and migrate to a 64-bit environment.

Windows 7 - Is your supercomputer fast enough to run our new OS?

The first step, of course, was getting a 64-bit operating system installed. Unfortunately, code that uses 64-bit instructions won't run on a computer with a 32-bit operating system. The instructions are just not available if the OS puts the CPU into 32-bit mode. So the upgrade to a 64-bit OS, a move I have been rightfully resisting for several years, was now unavoidable. A quick trip to Sam's Club, and I had my Windows 7 upgrade. I created a new partition on the hard drive in my quad-core Core2 system, made an image backup of the previous WinXP partition, and inserted my shiny new Windows 7 CD.

Installation

Overall the Win7 installation went well and was relatively painless. At one point the installer just sits there idle for a long time and doesn't give any feedback about what's going on, and it looks like it's stalled. I rebooted several times during this process before deciding to just let it sit for a while, and eventually, it did complete the installation.

First major mistake

Unfortunately, I installed Win7 over the original partition that had XP, and the backup partition wouldn't run properly if I booted to it! I was able to recover by doing a simple repair of XP using the installation disk, and this happened with no loss of data and minimal pain. Well, maybe not minimal. The WinXP repair overwrote the Win7 boot manager, which made it impossible to boot to Win7. Keep reading.

Boot managers

The Windows XP boot manager cannot boot Windows 7. Apparently Win7 uses a completely different way of booting than pre-Vista versions of Windows, and the boot manager on earlier versions of Windows cannot handle this. So you have to make sure that the active partition on the boot hard drive is the Win7 partition, and you must use its boot manager to handle dual-boot operation. There's a nice tool called EasyBCD that helped get this working right after my XP repair overwrote the Win7 boot loader.

Drivers

The drivers that came with my system labeled "Vista" seem to work fine with Windows 7. As far as I can tell, that is the way it is supposed to be.

Opinions on Windows 7

First, although I have now been using Windows 7 for several weeks, I may not be qualified to have an opinion on it. All I've done so far with it is start up the remote debugging server and debug my fractal software. Still, along the way, I have noticed a few things.

Win7 is a pretty slick-looking environment to work in. It has a bunch of new features in the way the start menu works and how the taskbar icons behave. It definitely helps get work done. It also looks neat with this slick semi-translucent blurring effect in all the window frames so you can kind of see through it to what's behind, sort of like looking through a piece of frosted glass. The warning and error sounds are also nicer, less annoying than in Win XP and Win2000. Your desktop background can be a slide show, with changing images. Kind of cool.

And I really like the new Calculator application. It has some nice new features including a date difference calculator and a unit conversion mode. The Scientific mode has hyperbolic trig functions right on the keypad, and the Programmer mode shows the binary representation of the number in the display automatically. But, unfortunately, it still doesn't allow you to work with fractional numbers in hex, just whole numbers.

I would say that unless you really need to access greater than 4GB of memory, or unless you have a special piece of software that for some reason absolutely needs to run on a 64-bit system, it's not clear that it's worth it to upgrade to Win7 from XP. Windows 7 says it requires 2 GB of memory at a minimum, and it will test your hardware to determine if it's fast enough to support the slick video features of the "Aero" desktop scheme, like the transparent window borders and animated window opening/closing. It is kind of funny that this piece of hardware, which is faster than what used to be considered supercomputer-level technology way back when, is just barely fast enough to support all the pretty features of Aero, which is just the polish on the user interface.

Running 32-bit code

Once Win7 was installed, the first thing I wanted to do was test whether the 32-bit software would run on it, which it did without a problem. This was no surprise, since the 64-bit versions of Windows are supposed to be able to run 32-bit code without any problems. This is possible because of a tricky thunking layer called WoW64 (Windows on Windows64) that switches the processor between 32-bit and 64-bit modes and translates stack frames for API function calls and other things like that.

Furthermore, as you can tell from the data at the top of the page, there is no perceptible difference in performance when running a 32-bit application in a 64-bit environment versus running it in a native 32-bit environment. At least not when that application spends 99% of its time in a tight loop doing arithmetic operations. Your mileage may vary if you have software that spends the majority of its time calling Windows API functions or doing multimedia.

Compile Please - You'll need a 64-bit number to count all the errors

This software was originally written about 9 years ago, and at the time, I never imagined trying to compile it for a 64-bit platform. Overall, modifying the code to compile to an x64 platform ended up being a relatively painless process, but it was not without some major headaches, and the first pass generated quite the error list from the C++ compiler.

No more inline assembly language

The biggest difference for this project is that the x64 compiler, for some reason unfathomable to me, no longer supports inline assembly language. That's right, __asm is GONE! Most of the core arithmetic code for addition and multiplication was written with inline assembly language, so this was a big problem. Now, for x64 platforms, Visual Studio requires MASM. MASM? Yes, it's still around, and Microsoft has extended it to support x64 instructions and those extra eight 64-bit registers.

Using MASM with Visual Studio is relatively straightforward. You make a Custom Build Step and type in the command line arguments you need for ml64.exe (VS only automatically does this for ASM files on 32-bit builds). Some maddening bugs I found:

You have to put the /Fo option that specifies the output path for the object file on the command line before the name of the input file, or else the /Fo option is ignored. This behavior is not documented, and it took a lot searching around on support forums to track down one little post mentioning this, and Microsoft even officially confirmed that this is a real bug.
The /Ta option that VS appends to its default command line for ml.exe will cause ml64.exe to crash, big time. I didn't experiment with whether /Ta works with the 32-bit assembler, but if you pass it to ml64.exe, it will bring down the entire VS application.

One very nice thing about 64-bit assembly language is that there are no more segments or models. All I have at the beginning of my file is a single quiet little ".code" directive. Some applications may also need a ".data" somewhere, I suppose.

Be careful with PROC declarations in 64-bit code. If you add the "C" language specifier, the assembler (even the 64-bit assembler) will try to be helpful and emit the usual

push ebp

mov ebp, esp

without telling you. This was nice for 32-it code but the calling convention in 64-bit code is totally different (see below), and this is unnecessary. So I just left off the "C". Of course, within the C++ code that calls the assembly language subroutines, you have to wrap them all up with extern "C" {} unless you like keeping track of the C++ name-mangling.

Pointers and integers

After removing all the inline assembly code, the next huge batch of errors and warnings I faced had to do with sizes of pointers and integers. Fortunately, Microsoft made some decisions for 64-bit code that kept this to a minimum. Pointers in 64-bit code are 64 bits, while integers (except long long) are still 32 bits. This was good news, because otherwise binary files would be messed up. For example, sizeof(int) would change from 4 to 8, etc. I would have to go into the source code and change every sizeof(int) to sizeof(DWORD) or something like that. Strictly speaking, this is how the code should have been in the first place (i.e. never assume anything about the size of int) but, well, like I said, I never imagined porting this code to a 64-bit platform.

Another nice decision was making WPARAM and LPARAM, the types of the data that goes with the SendMessage API function, both become 64 bits, so you can still pass pointers via SendMessage. But I had quite a few places where I would pass an integer value via SendMessage, and on the receiving end, assigning it to an int or unsigned causes a warning because it's an assignment from a 64-bit value to a 32-bit variable. This is perfectly acceptable as long as you are aware that the upper 32 bits will be lost. For all cases where I was doing this, it was not a problem because the values involved were small enough to easily fit into a 32-bit number. I don't need a 64-bit value to represent, say, the line number of an image; 32 bits is more than enough for that.

Overall, although I had to fix a large number of this kind of warning, nothing required any significant code rewrite, just a bunch of (int) casts to make the compiler warnings go away.

Oh, and clock_t, size_t, and ptrdiff_t are all now 64 bits, so code that treated them like integers can also generate lots of warnings. The main problem I had with this was the return value from strlen, which, yes, is size_t, and is now 64 bits, so you can't assign its return value to an int without getting a warning. I find that quite silly. I doubt anyone's ever had a string long enough to require even a 32-bit length (insert the name of your favorite windbag here).

64-Bit calling conventions

Once I got MASM running, not crashing, and putting the OBJ file in the right place, it was time to deal with the calling convention in 64-bit code. One of the major reasons I loved inline assembly language is that it took care of referencing passed parameters and local variables on the stack. That help is all gone when you work in pure assembly language, and you have to keep track of the stack frame manually. Fortunately, the calling convention for 64-bit code is actually quite a bit simpler than 32-bit code, and it's a snap to handle even in hand-written assembly language.

In 64-bit code, the first four integer parameters are passed in registers. Starting with the first parameter, they go in RCX, RDX, R8, and R9 (floating point values are passed in the MMX registers). These parameters are not pushed onto the stack, but the stack pointer is decremented as if they were, creating a 32-byte space on the stack that can be used to save any of the parameters or anything else that the called function wants to do with it. If there are more than four parameters, then they are pushed on the stack after this space. The caller is responsible for setting up the stack frame and cleaning it up.

By convention, a function in a 64-bit program will create enough stack space in its prolog code to handle all the parameter-passing needs of any function it calls, and the stack pointer remains constant within the function's own code. This is done mainly to help with exception handling and stack unwinding. None of the code in my arithmetic routines is complicated enough to warrant considering this kind of thing, but it is a nice convention. Also by convention, the stack pointer is supposed to always be 16-byte aligned. This helps enormously for my application because it means all arrays of big number data, even ones created in local variables, will always be 16-byte aligned, so the multiplication function will never write a quantity that crosses a cache line.

Different __unaligned qualifiers?

In a couple of places, I use the Windows Shell function SHBrowseForFolder. I'm not a big fan of the clunky dialog box it gives you, but that's a different matter. This function takes a BROWSEINFO structure and returns a pointer to a structure called an ITEMIDLIST, and if you do the usual thing to assign this to a pointer:

ITEMIDLIST *idl = SHBrowseForFolder (&bi);

you'll get

warning C4090: 'initializing' : different '__unaligned' qualifiers

The exact problem here is not completely clear to me but the obvious guess worked:

ITEMIDLIST __unaligned *idl = SHBrowseForFolder (&bi);

Run Please - The joys of browsing EXE files

Eventually, all the little details of ints and pointers and __unaligned got sorted out and I got an error-free, warning-free build and my first 64-bit EXE file was suddenly sitting there. I copied it to the Win7 system (I have Visual Studio installed on a different system that doesn't support 64-bit execution) and excitedly double-clicked it and got a rather unhelpful dialog box with the obscure message that my program could not be run due to error 0xC000007B.

This was not entirely unexpected. I figured something had to go wrong the first time I tried this, so I was prepared mentally. I just took a deep breath and started trying to figure out what this meant. Well, after a lot of searching the best I could come up with, from the Windows error reference list, was that this is STATUS_INVALID_IMAGE_FORMAT and means:

{Bad Image} The program is either not designed to run on Windows or it contains an error. Try installing the program again using the original installation media or contact your system administrator or the software vendor for support.

My first guess was that I hadn't targeted the x64 platform correctly in the Visual Studio project. I double-checked and made sure it was set to build as a 64-bit program, and tried recompiling a few times. Same result.

"Bad Image"? Was this referring to the EXE file itself, or to some other possibly incompatible thing, like an embedded BMP image for the toolbar? This message was not helpful. Neither were the Windows system logs. I was hoping something somewhere would go into more detail about exactly what the system didn't like about my program, but nothing offered any help.

So I built a clean, new project and tried running just the shell application that Visual Studio creates. I thought that maybe something in the project file for this program -- which has been ported through several versions of updates to Visual Studio over a ten-year period -- might have some obscure setting that was preventing x64 code from working right. Trying to start from scratch with a fresh, new project file that this current version of VS created itself seemed like it might help. Sure enough, both a simple Windows program and an MFC multi-doc program worked fine as x64. Once I had a project working, I copied my source code files into it and rebuilt. Unfortunately, that gave the same result, the 0xC000007B error.

Now I was starting to get nervous. Something in my own code was causing the EXE file to crash, not a bad project setting. I tried to track down where this could be, and ended up overloading _tWinMain itself with some code to pop up a message box, and even this was not executing. The program was crashing before it started running. That gave me a sick feeling, the sick feeling that comes from starting to think about needing to debug the startup code. This is an obscure bit of code that runs prior to entering the main body of an application, code that sets up global static objects and that sort of thing. In C++ any global objects that have constructors get instantiated and initialized before the program starts, and if there's an error in one of those, you'll never get the program running. Furthermore, those objects are initialized in an unspecified order, which is up to the compiler, so it seemed reasonable that some bug that the 32-bit compiler covered up by chance was exposed in the 64-bit compiler.

Debugging the startup code is very difficult. The VS debugger doesn't help much. You have to just step through the raw machine code with no symbols or anything, and doing this remotely is a big problem. There are tools that can help with this. It seemed that the best thing for this was to get the nearly 700 MB Windows Driver Kit, which contains WinDbg, and I started to download it, not looking forward to using it at all.

Meanwhile, I started down a second line of attack, trying to see what was in the EXE file that might be objectionable. Any time you're browsing through an EXE file, you're not having a good day, let me tell you. Fortunately, there are some decent tools to help with this, although not many work with 64-bit EXE files. I started randomly browsing around through all the different subsections of my "bad" file, hoping to stumble across something would give me a clue. Most of it seemed innocent enough, although I had no idea what most of it meant. Then this caught my eye in the "Manifest" section under ".rsrc":

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0">
<dependency>
<dependentAssembly>
<assemblyIdentity type="win32" name="Microsoft.Windows.Common-Controls" version="6.0.0.0" processorArchitecture="x86" publicKeyToken="6595b64144ccf1df" language="*">
</assemblyIdentity>
</dependentAssembly>
</dependency>
</assembly>

[This, by the way, is where Expression previously crashed, right after I pasted that bit of XML code into this document, apparently in a non-text manner that blew its mind. Seems to be working OK now.]

Do you see "win32" and "x86" there? That jumped out at me, and I immediately realized where this came from. This is asking the system to load the Version 6 Windows Common Controls, which are the radio buttons and check boxes and stuff that look nicer in Windows XP than they did in Windows 2000. That came directly from this line of code in one of my source files:

#pragma comment(linker,"/manifestdependency:\"type='win32' name='Microsoft.Windows.Common-Controls' version='6.0.0.0' processorArchitecture='x86' publicKeyToken='6595b64144ccf1df' language='*'\"")

Deleting this gave me, finally, a working x64 EXE file. There it is.

64-Bit Multiplication and Addition

After all that drama, the actual arithmetic code is pretty anticlimactic. It is essentially the same as before, but uses the 64-bit MUL instruction to produce a 128 bit product, rather than using the SSE2 PMULDQA instruction to multiply two 32-bit numbers and produce two 64-bit results. The same optimizations apply regarding eliminating unused digits below the threshold of precision and taking advantage of symmetry when multiplying a number by itself.

The smallest operation that forms a building block for larger multiplications is what I call a 2x2 multiplication, which calculates the product of two clusters of two 32-bit "digits". This is essentially a 64x64 bit multiplication to yield a 128-bit product, and can be done with a single instruction in the 64-bit instruction set. That result is summed into the product accumulator with the same carry-save technique that was used in the SSE2 code, only this time the carries are from 64-bit accumulators rather than 32-bit accumulators.

The other building block is a 4x4 multiplication, which calculates the product of two 4-digit clusters from each of the multiplicands. The result of this 128x128 bit multiplication is a 256 bit product that is built up with carry saves in a group of registers in the CPU, then summed into the product accumulator in memory when the calculation is complete.

Each large product is broken down into blocks of either 2x2 or 4x4 digit products. The digits of lesser significance than will be able to influence the final product by more than 1 bit are not calculated, although any digit product that needs to be calculated will include everything in the smallest 2x2 or 4x4 block that encloses it. That means that for digit products along the diagonal of the product matrix, about three extra digits of significance are calculated. This is something that could be optimized, and the next generation of the code will have specialized routines for calculating the near-diagonal digit products that do not calculate the extra products, if this turns out to speed things up. The 2x2 operation (two 32-bit digit products) is nearly atomic on a 64-bit machine; trying to calculate a single-digit (32-bit) product and sum it into the partial product accumulator is almost not worth the effort since the 32-bit MUL instruction places the result in EDX:EAX as two distinct 32-bit values. It is awkward in a 64-bit environment to sum those into a single 64-bit value, and on first assessment, the code to do that seems slower than the code to just do a 64x64 bit multiplication. So the extra 3 digits may be free. This needs a little study.

The elemental 2x2 and 4x4 operations are encapsulated into macros that are used to build up the larger multiplication functions. This means the code is fully unwound, with no loops.

As in the 32-bit SSE2 code, squaring a number (multiplying it by itself) is optimized further by taking advantage of the digit symmetry. A series of specialized macros for on-diagonal and off-diagonal 2x2 and 4x4 blocks is used to build up code for larger products. This avoids, on average for big numbers, about half the multiplications that are needed when the multiplicands are different.

Invoking all these macros correctly is a bit error-prone, since each one needs two base references into the multiplicand arrays as well as a pointer into the partial product array, and some mistakes were made along the way, leading to code that sort of multiplied two numbers, but scrambled a few digits along the way. Generating deep-zoom Mandelbrot set images is a good way to test that everything is working properly, and that is now I found most of the bugs in this highly error-prone and tedious process. Some of the resulting images looked interesting, so I though I would publish them here. See the next section.

Bug images

The following are some of the "fractal" images that resulted from coding errors in the arithmetic core. Notice how the images get more complex as the magnitude of the digits increases.

All except the first one (the one with the white circle) were generated while zooming in to the fixed point in the Mandelbrot set at (0,1) at various magnifications. I use that point to test high-precision arithmetic code because you can quickly set the magnification to any value and you'll always get a recognizable fractal image without having to hunt around manually. The white circle image was generated with buggy precision 4 code when trying to draw the image of the entire Mandelbrot set.

Some of these actually look pretty cool in their own right. They remind me of some of the IFS fractal images. Maybe I should explore the art of buggy arithmetic code images?

Click on a thumbnail to download a 500x500 image.

An error in the squaring code	Precision 10 error	Precision 12 error	Precision 16 error 1
Precision 16 error 2	Precision 16 error 3	Precision 16 error 4	Precision 16 error 5