Alfred Huang's Weblog

Disclaimer: Whatever I suggested in my blog is what I would do, it does not necessarily mean the only way to do it.


« The humble Frame... | Main | Sun Studio Express:... »
Monday May 22, 2006

On Studio and Gcc style inline assembly

A question has been raised raised on the optimizaton effect of Sun Studio's inline assembly machanism and that of gcc style enhanced inline assembly. Let's start with a brief introduction of what they are first and then discuss the optimization effect they may have.

At the simplest level, both compilers support the asm() statement, which is the insert-as-is non-optimized inline assembly. This form of inline assembly simply inserts the enclosed assembly string(s) as is without any facility of argument and optimization.

Studio for enhancement supports the .il inline template in a form similar to an include file. A .il file may contain multiple inline templates. Where each template is of the following form:

    .inline "template_name",0
       "assembly code"
    .end

In a sense, Studio treats each inline template as a function definition and adheres argument passing and return value according to the calling convention in the ABI of the corresponding platform.

For example, let's add 8 numbers together and return its result. The inline template in the foo.il file may be as follow:

    .inline multi_add,0
    movq    %rdi, %rax
    movq    %rsi, %rax
    movq    %rdx, %rax
    movq    %rcx, %rax
    movq    %r8,  %rax
    movq    %r9,  %rax
    movq    (%rsp),%rax
    movq    8(%rsp), %rax
    .end

The foo.c file may look like:

    int foo(int,int,int,int,int,int,int,int);
    int multi_add() 
    {
       return foo(1,2,3,4,5,6,7,8);
    }

Compile the foo.c with foo.il:

    cc -S -O -xarch=amd64 foo.il foo.c

The result foo.s will contain an optimized result:

    foo:  ...
    / ASM INLINE BEGIN:   multi_add
          movq   $36, %rax
    / ASM INLINE END

Based on the calling convention of the AMD64 ABI, the first six integral arguments are passed in %rdi, %rsi, %rdx, %rcx, %r8 and %r9, any extras are to be passed on the stack, hence (%rsp) and 8(%rsp) in this case. If optimization is not desired, a -Wu,-no_a2lf option can be used:

    cc -S -O -xarch=amd64 -Wu,-no_a2lf foo.il foo.c

Thus the result foo.s will contain the following:

        push       $8           
        push       $7          
        movq       $1,%rdi     
        movq       $2,%rsi     
        movq       $3,%rdx     
        movq       $4,%rcx     
        movq       $5,%r8      
        movq       $6,%r9                  
        / INLINE: multi_add
        movq    %rdi, %rax
        addq    %rsi, %rax
        addq    %rdx, %rax
        addq    %rcx, %rax
        addq    %r8, %rax
        addq    %r9, %rax
        addq    (%rsp), %rax
        addq    0x8(%rsp), %rax
        / INLINE_END

Studio treats an inline template as a function call, thereby function argument loading will take place before the template body is inserted. If -Wu,-no_a2lf is not used, all these assembly instructions are inserted into Studio's intermediate representation stream and all specified optimization will take place, result in a single "movq $36, %rax".

In this manner, the user specified inline template will be assimilated into the Studio's optimization and code generation machanism. Further note that all registers specified in the inline template will be "virtualized" and reallocated by the code generator, thereby the final appearance of the inline template may be drastically different from its original. The main catch of the approach being users must understand the calling convention of the underlying platform.

Gcc style extended asm inline assembly takes a different approach. Basically it has the form:

    asm("template",
        "input arguments",
        "output arguments",
        "clobber list");

The approach provides flexibility for user to specify the input and output arguments and allows the user to inform the compiler of the resource used within the template, thus allowing the compiler to avoid resource conflict. Other than that, the template is treated as a black box with its content not to be looked into, hence the argument substituted template body will be inserted as is, the optimization effect may then be limited.

For example, let's add 3 constants and a static variable together with the following asm():

int mem;
int foo()
{
        int res;

        asm ("addl %1, %0\n"
             "\taddl %2, %0\n"
             "\taddl %3, %0\n"
             "\taddl %4, %0\n"
             : "=r" (res)
             :  "g" (1),
                "g" (2),
                "g" (3),
                "m" (mem)
            );

        return res;
}

And the result will simply be:

.globl foo
        .type   foo, @function
foo:
/APP
        addl $1, %edx
        addl $2, %edx
        addl $3, %edx
        addl mem, %edx
/NO_APP
        movl    %edx, %eax
        ret

Note the template body will be inserted as is after the argument substitution into the final assembly, no constant fold nor any extensive optimization will be performed. But some forms of optimization may still be performed. For example the entire template may be moved out of a loop if it turns out to be a constant invariant when the "clobber list" indicates no change in memory and register.

Both style of inline assembly machanism has their pros and cons, it depends on one's need and it is possible to convert them from one style to the other.

Comments:

The first C function in your example should be called 'foo' and it should call 'multi_add'. You got the names reversed, which makes it a little harder to figure out what's going on. Other than that, this is a great example. I'll have to remember it's here next time I have to fix a bug involving .il files. ;-)

Posted by Chris Quenelle on May 23, 2006 at 08:46 PM PDT #

Oh, yes. Typo! I'll have it fixed. Thanks! Alfred

Posted by Alfred Huang on May 24, 2006 at 07:58 PM PDT #

[Trackback] Buy generic cialis. Generic cialis cheap.

Posted by Generic cialis. on April 15, 2007 at 08:14 PM PDT #

Hi Alfred,

thank's for this real nice blog. The inline assembly templates of Sun Studio look nice, but I think they have some real problems in conjunction with optimization. First of all, the documentation of 'inline' doesn't say a word about the interaction of optimization with inline assembler templates. It was your blog where I read for the first time about the optimzation of inline assembler templates and the magic "-Wu,-no_a2lf" option to prevent it. By the way, the same option for the C++ studio compiler CC is "-Qoption ube -no_a2lf".

Consider the following small C++ program that tries to query the frame pointer of a function:

//---------------- inline.cpp ----------------
extern "C" void printf(const char*, ...);
extern "C" void* get_fp_il();

void* get_fp() {
void *fp = get_fp_il();
return fp;
}

int main(int argc, char **argv) {
printf("rbp=%p\n", get_fp());
}
//--------------------------------------------

and the corresponding .il file

//--------------- inline.il --------------------
.inline get_fp_il,0
movq %rbp, %rax
.end
//---------------------------------------------

If compiled with "-xarch=amd64 -xO1", 'get_fp_il()' will always return 0, because the inlined assembler code will be optimized away and the actual content of %rax will be printed in the 'main()' method. If compiled without optimization, it correctly returns the frame pointer. Is this the expected behaviour? If I use the "-no_a2lf" option for ube, the program also works in the optimized version - but in my opinion it should always work correctly, even without the option.

By the way, this example is adapted from the OpenJDK source code (see: http://www.nabble.com/Bug-in-os%3A%3Acurrent_frame%28%29-_get_previous_fp%28%29-or-why-parenthesis-somtimes-really-matter-to15360880.html

What is your opinion? Any comments?

Regards,
Volker

Posted by Volker H. Simonis on February 12, 2008 at 08:54 AM PST #

Hi Alfred,

Thanks for a great comparison of these two alternative ways for using inline assembly.

One comment: I tried to run the gcc style example rutine.
It seems like there is one error in the code. Shouldn't the "int res" variable have been initialized to 0?

On my computers it either returns a constant (different from the expected answer) or a random number.

Regards,
Olav

Posted by Olav Sandstaa on June 29, 2008 at 06:01 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed

Today's Page Hits: 2