Morgan Herrington's Blog Adventures in Porting and Tuning

Thursday Sep 25, 2008

Most engineers rarely have to write assembly code, yet my experience has been that large applications (particularly if they've been around for a decade or so) contain at least a few functions of hand-written assembly code. Hopefully these functions will be inspected during a porting project to see if they can be rewritten in C or replaced with with calls to the appropriate graphics, atomic, or numeric library.

Since working with assembly is not very common, I won't dwell on the topic very often. However, in the last six months I've seen the same mistake in three unrelated projects (porting from 32-bit to 64-bit SPARC), so this particular topic deserves to be mentioned.

Some 64-bit assembly porting is fairly mechanical: converting to the 64-bit calling conventions, using 64-bit registers, adjusting the bias of stack offsets, and accounting for 64-bit sizes and alignments. However, the following snippet of code illustrates a slight quirk of the SPARC architecture that is the root cause of a particular porting problem. In the following code, %o2 and %o3 point to memory buffers and the pointer in %o4 marks the end of the copy:

top_of_loop:
    add     %o2,1,%o2       # inc to the next destination location
    ldsb    [%o3],%o1       # load byte from source buffer
    add     %o3,1,%o3       # inc to the next source location
    cmp     %o2,%o4         # check for end of loop
    bcs     top_of_loop     # if not done, then branch to top
    stb     %o1,[%o2-1]     # store to dest buffer in delay slot of branch
The add instruction updates all 64-bits of the output register, and the memory accesses don't need to change. However, this sequence doesn't quite work for 64-bit. Unfortunately, it works so much of the time that regression tests could easily miss the failure case.

The problem is that unlike the x64 architecture which has two sizes of compare instructions ("cmpl" and "cmpq"), SPARC has a single instruction which sets two different sets of condition codes. The conditional branch in the code sequence above inspects the 32-bit condition codes, so it jumps based on a 32-bit comparison.

To correctly base the branch on the the 64-bit condition codes, it needs to be rewritten to use the extended condition codes, %xcc:

    bcs     %xcc,top_of_loop
I can't explain why this particular architecture feature is so easily missed, but I can point out that the original sequence works correctly unless the memory buffer pointed to by %o2 crosses a 2GB boundary. At least for one application, the failure only occurred at a single customer site and only once every few weeks (and was, therefore, tough to diagnose).