x86/x64 Compiler Backend Team Weblog

20070601 Friday June 01, 2007

GCC-style asm inlining support in Sun Studio 12 compilers


Sun Studio 12 Asm Statements

 

Introduction

    In order to support developers used to Gcc's Inline Assembly Feature,
Sun Studio 12 has implemented a compatible interface to allow the C and C++
programmer to insert assembly instructions into the code stream generated
by the compiler.  There are several advantages to this feature above and
beyond those of the Inline Assembly feature supported by prior Sun Studio
releases.  These include allowing the routine containing the inline assembly
to be optimized, compatibility with Gcc, more flexibility in the compiler's
ability to choose registers efficiently.

    In this new scheme the inline assembly takes the form of an asm
statement in the source language that has the following form:

    asm("<inst> %0, %1\n" : <outputs> : <inputs> : <clobber list>);

Where <inst> is an assembly-language opcode, <outputs> is a comma-separated
list of outputs; likewise with <inputs>.  Each input or output consists of
a constraint string and an expression from the source language enclosed in
parentheses.  These expressions provide the inputs to pass to the asm statement
or the outputs to store the results of the asm statement to.  The clobber list
is a comma-separated list of strings that name machine registers (other than
inputs or outputs) that one or more of the instructions in the asm statement
are known to write to.  A typical function in C containing an asm statement
might look like this:

#include <stdio.h>
void foo() {
      int result, source = 3;

      asm("movl %1, %0\n" : "m" (result) : "r" (source));
      printf("result = %d (expected 3)\n", result);
}

    The %0 and %1 in the above example are placeholders for "result" and "source",
respectively.  The compiler will evaluate "source" and load it into a free
register denoted by %1.  Then generate the movl instruction to move that
register into the memory location corresponding to the variable "result"
denoted by %0.

    There is an alternative notation for placeholders that users may find
more readable.  Rather than using %0, %1, %2, etc.  to denote positional
arguments, the user may refer to arguments symbolically:

#include <stdio.h>
void foo() {
   int result, source = 3;

   asm("movl %[input], %[output]\n" : [output] "m" (result) : [input] "r" (source));
   printf("result = %d (expected 3)\n", result);
}


    In the above example, input and output have no special meaning,
they could be any names, but they must match a corresponding
square-bracketed name in the input or output lists of the asm
statement.

    These are very simple examples.  In actuality, an asm statement
may have more than one instruction and the constraints can get quite
complex.  With flexibility of expression comes some degree of complexity
which we will try to demystify in the sections that follow.

The Instruction String

    The instruction(s) to be executed are contained in one or more
quoted strings which precede the first colon in an asm statement.
The compiler does not parse the contents of these strings except
to scan for placeholders that it needs to replace with the arguments
of the asm statement.  So, the compiler knows nothing of the semantics
of the instructions in an asm statement other than what it is told via
the constraints on the input and output arguments and the contents
of the clobber list.  Within the instruction strings, any percent sign
that does not introduce a placeholder must be doubled.  For example,
in the following asm statement, the %eax register must be written as
"%%eax" but in the clobber list, no percent sign is needed:

    asm("movl  %0, %%eax\n" : : "r" (foo) : "eax");

Inputs and Outputs

    For an asm statement to affect a program, it most often must
    be able to receive information from expressions in the source
    language and be able to assign to variables (or other lvalues)
    in the program.  This is accomplished by passing outputs and
    inputs into the asm statement in a manner similar to the arguments
    to a function call. 

Expressions

    The source language expressions for inputs may be rvalues or lvalues. 
    Outputs must be lvalues.  Expressions may be of arbitrary complexity
    and are enclosed in parenthesis following the constraint string.


Unused inputs and outputs

    If there is no use of an input or output in an asm statement's
    instruction string, then no loads from or stores to that variable
    are generated.  This saves registers for those arguments that
    are used in the asm instruction string.  There is one exception
    to this rule: If an input or output is constrained to a specific
    hardware register (as opposed to a register class), then is must
    be loaded or stored even if it is not referred to in the instruction
    string.  This is because it value may be used implicitly by the
    instructions in the asm statement.


Constraints        

Register Constraints

        Integer

        In the descriptions that follow, only one size of register
        is listed in the tables, but in most cases the size of the
        register actually chosen depends on the type of the source
        expression being loaded into or stored from it.  See
        "Matching register types to input and output types" below
        for more details about how the compiler chooses the size
        of register to use.

        Register classes

            The following constraints specify a class of integer
            register that the compiler may choose from when it
            needs a register within an asm statement:

            Constraint    Register class
            g or r              rax, rbx, rcx, rdx, rbp, rsi, rdi, rsp, r8 - r15
            R                    eax, ebx, ecx, edx, ebp, esi, edi, esp (legacy registers)
            q                     al, bl, cl, dl
            Q                    ah, bh, ch, dh
            A                     eax or edx (used for returning 64-bit values)

        Specific registers

            The following constraints may be used to lock a source
            variable or expression to a specific hardware register:

           Constraint     Register
                                    64-bit    32-bit

            a                        rax       eax
            b                        rbx       ebx
            c                        rcx        ecx
            d                        rdx       edx
            di                       rdi        edi
            si                        rsi        rdi


        Floating point

        XMM and MMX registers

            The following constraints are used to specify that the
            source variable or expression should occupy an XMM or
            MMX register:

            Constraint    Register class
            x                        xmm0 - xmm15
            y                        mm0 - mm15

             Note:  Be sure to specifiy -xarch=sse2 when using
                       these constraints if compiling in 32-bit mode.

        x87 Floating point stack

            The following constraints are used to refer to variables
            or expressions loaded on the x87 floating point stack:

            Constraint    Register
            f                        ST(0) - ST(7)
            t                        ST(0) (top of the FP stack)
            u                        ST(1) (register just below the top of the FP stack)


 Memory Constraints

        A memory constraint has the form "<m>"  where <m> is
        one of the following letters:

        Constraint    Description
        m                    Memory operand of any general addressing mode
        o                    Offsettable addressing mode
        V                    Non-offsettable addressing mode
        <                    Autodecrement addressing mode
        >                    Autoincrement addressing mode
       
        These constraints instruct the compiler to generate a
        memory reference wherever this argument's placeholder
        occurs in the instruction string.

 Immediate Constraints

        An immediate constraint has the form "<i>"  where <i> is
        one of the following letters:
   
        Constraint    Description
            i                     Any sized constant
            e                    Constant in range -2147483648 - 2147483647
            n                    A constant less than a word wide
            I                     Constant in range 0 - 31
            J                    Constant in range 0 - 63
            K                   0xff
            L                    0xffff
            M                   Constant in range 0 - 3
            N                   Constant in range 0 - 255
            Z                    Constant in range 0 - 0xffffffff
            E                    Floating point operand (native const double)
            F                    Floating point operand (const double)
            G                   Standard 80387 floating point constant
            s                    Constant not know at compile time (symbolic)

        These constraints instruct the compiler to generate an
        immediate operand wherever this argument's placeholder
        occurs in the instruction string.

 Digit Constraints

        Digit constraints are of the form "<n>" where <n> is a number
        which corresponds to the position of an output.  This constraint
        is only allowed on an input and the digit must refer to an output.
        The semantics are to bind the constrained input to use the same
        location to load its input to as the indicated output uses.

        The example below illustrates the use of digit constraints.

        asm ("addl %1,%0 \n\t"
                     :"=r"(foo)
                     :"r"(bar),"0"(foo)
                );

        The simple example above essentially implements foo = foo + bar;
        The "0" in the input constraint indicates that variable foo
        needs to be loaded into the same register which will also
        contain the output result. It is also possible to specify a
        particular register as shown below:

        asm ("addl %1,%0 \n\t"
                     :"=a"(foo)
                     :"b"(bar),"0"(foo)
        );

               
In this case, the com
piler will generate code to load variable foo
             into register %eax (since that input is constrained to output 0 and
            output 0 is constrained to %eax by the "=a" constraint) and bar will
            be loaded into register %ebx and the result foo will be available in
            register %eax.

        Here is another example of using digit constraints to shift a value
        by a given shift count:

        int shift_count = 5;
        int shifted_value = 37;

        asm ("sarl %1, %0\n\t"
             : "=r" (shifted_value)
             : "c" ((char) shift_count), "0" (shifted_value)
            );

        In this example, the variable "shift_count" is loaded into the %cl
        register (note that the cast is required to convert the 32-bit integer
        "shift_count" to an 8-bit value as required by the sarl instruction.
        The variable "shifted_value" is loaded into a register chosen by the
        compiler with the proviso that the compiler will choose the same
        register to hold the result of the sarl instruction as requested by
        the "0" digit constraint.

 Multiple Constraints

        More than one constraint letter may be used in a
        constraint string. When this occurs, the compiler
        looks at the input or output to determine which
        constraint is the best match for the given expression.
        If the constraint string contains an immediate constraint,
        and the input is a constant of the correct type, then
        the input will be treated as an immediate.  Otherwise,
        if the constraint string contains a memory constraint
        and the input or output is an lvalue, then a memory
        reference will be generated. Failing this, if the
        constraint string contains a register constraint then
        the input will be loaded into or the output will be
        written to a register. The example below illustrates
        usage of multiple constraints:

        asm ("mulq %3"
                    : "=a"(low),"=d"(high)
                    : "a"(word),"rm"(foo)   
        );

                The mulq instruction multiplies the contents of a 64-bit
        memory or register by the contents of %rax and the result
        is available in the %rdx, %rax register pair - the high
        64-bits in %rdx and low 64 bits in %rax.

        One of the operands of the multiply, the variable foo in
        the example above, can be available in either memory or in a
        register. The "rm" constraint used in the example allows the
        compiler to choose the most appropriate location.

        The example above also shows an interesting instance of constraints
        usage. Although there is no explicit reference to %0 or %1
        in the asm template, the mulq instruction implicitly returns
        the results in %rax and %rdx, therefore "=a" and "=d" must be
        indicated as output constraints. Similarly, the first input
        operand (word) is expected to be available in the %rax register.
       

     Modifiers

    Certain modifier characters may be included in a constraint string to control
    how the compiler applies that constraint.  They are:

        Modifier    Description
        =                Operand is only written
        +                Operand is read and written
        &                Operand is clobbered early
        %                This operand and the following one are commutative   
        #                Ignore all characters up to the next comma as constraints
        *                Ignore the following character when choosing register preferences


            Note: If = or + are specified in a constraint string, they must be the first    
                      character in the string.

        The following example shows a use of the "+" modifier:

        asm ("sarl %1, %0\n\t"
             : "+r" (shifted_value)
             : "c" ((char) shift_count)
            );

        The variable "shifted_value" in the example above is both an
        input and an output. The compiler would generate code to load
        "shifted_value" into a general purpose register and ensure
        that "shifted_value" is available as an output in that same
        register. The same effect can be achieved using digit constraints
        (see example above) as well.  However, if there is no explicit reference
        to the input parameter in the asm template, it is more concise to use
        "+" modifier instead.

        The compiler normally makes the assumption that all inputs to
        an asm statement are consumed before any outputs are written
        to in the instructions which constitute the asm's instruction
        string.  If this is not the case for a particular instruction
        sequence, the user must inform the compiler which outputs are
        written early (i.e. before the last input is used).  This
        rule allows the compiler to use registers efficiently by
        choosing the same register for an input and an output under
        normal conditions, but allows the user to override this
        behavior when it would be semantically incorrect to do so.
        The use of the early clobber ("&") modifier provides the means
        to communicate this information to the compiler.  A register
        chosen for an operand marked as early clobber may not be used
        to hold any of the input operands.  The following example illustrates
        the use of early clobber:

        asm (
            "    subq    %2,%2        \n"
            ".align 16            \n"
            "1:    movq    (%4,%2,8),%0    \n"
            "    adcq    (%5,%2,8),%0    \n"
            "    movq    %0,(%3,%2,8)    \n"
            "    leaq    1(%2),%2    \n"
            "    loop    1b        \n"
            "    sbbq    %0,%0        \n"
            : "=&a"(ret),"+c"(n),"=&r"(i)
            : "r"(rp),"r"(ap),"r"(bp)
            : "cc"
            );


    Matching register types to input and output types

        The register chosen by the compiler must match the
        type of the input or output in the source code.  There
        are two ways to for the user to affect what type of
        register the compiler will choose for any given input
        or output.  The first is to insert a size letter between
        the "%" and the digit in the placeholder in the instruction
        string such as:
            asm("movi %l1, %l0\n" : "r" (result) : "r" (source));
        This will choose a 32-bit register for the each of the
        registers chosen to hold "result" and "source".  The
        supported types are:

            Type letter    Register size
                b                    8-bits
                h                    16-bits
                l                    32-bits
                q                    64-bits

        The second way to way to affect the type of the register
        chosen is by changing the type of the source expression
        passed to the asm statement.  By default the type of
        register is chosen based on the type of the input or
        output expression.  Casting this expression will also
        influence the size of register chosen to hold that
        expression in the code generated for the asm statement.

 The Clobber List

        Some instructions implicitly modify a register or the
        user may insert a specific register name in the instruction
        string such as: 

                    asm("movl  %0, %%eax\n" : : "r" (var) : "eax");


        In such cases the modified register should be placed in the
        clobber list (the comma-separated list of strings following the

        third colon) to inform the compiler that this register is written

        to by the asm statement.  This allows the compiler to keep enough
        information about the liveness of registers around an asm
        statement to continue to do normal optimizations.  Without
        this information, the compiler would have to forgo many
        optimizations in any routine that contained asm statements.
        Note that outputs need not be placed in the clobber list.
        The compiler knows that they are written to already.

        The following example shows a use of clobber lists:

            __asm__("movl %0,%%ecx         \n\t"
                    "movl %1,%0               \n\t"
                    "movl %%ecx,%1               \n\t"
                    :"=a"(bar),"=b"(foo)
                    :"0"(bar),"1"(foo)
                    :"ecx"
                );


        The values of variable foo and variable bar are swapped
        in the example above, using %ecx as an intermediate place holder.
        Any value held in the register %ecx earlier will be lost
        after executing the asm template; therefore, "ecx" must be
        mentioned in the clobber list.


Current Limitations and Known Bugs

    No alternative constraints

    Gcc allows an operand's constraint string to have more than one series
    of constraint letters in a comma-separated list from which the best
    matching constraint is chosen based on the cost of loading that operand
    for each legal alternative constraint.  Sun Studio 12 currently implements
    only the simpler multiple constraint syntax described above.

    Assembler is not operand sensitive

    At present, the Sun Studio 12 assembler requires that the type of the
    opcode for any given instruction matches the types of its operands.
    Gcc's assembler, by contrast, can infer the suffix required for an
    opcode from the types of the operands of the instruction.  This is a
    limitation when writing asm statements intended to work interchangably
    on 32-bit and 64-bit platforms.  Most often such asm statements must
    be split into 32-bit and 64-bit versions surrounded by appropriate
    #ifdefs as in the following example:

        void f () {};

        int main () {
                void (*fptr)() = 0;
       
        #ifdef __amd64
                asm ("movq %[f], %[fptr]"
        #else
                asm ("movl %[f], %[fptr]"
        #endif
                     : [fptr] "=m" (fptr)
                     : [f] "r" (f));
       
                if ( fptr != f ) return 1;
                return 0;
        }


    As another example of operand sensitivity, the following
    program will fail to assemble because of type mismatches
    between the opcode and one of its operands:

        int main() {
            int a, res;
            char b;
   
            /* The input argument "c" is of the
               wrong type.  The movl instruction
               expects a 32-bit integers as its operands. */
            asm("movl %1, %0\n\t" : "=r" (res): "c" (b));

            /* The sete instruction requires an
               8-bit result register, but res is
               a 32-bit integer. */
            asm("sete  %0\n\t" : "=r" (res));

            /* Variable "a" is an int, but the shrl
               instruction requires an 8-bit shift
               count in register %cl. */
            asm ("shrl %1, %0\n\t" : "+r" (res) : "c" (a));
        }
       
        The user will see assembly errors such as the following:
        Assembler:
            "/tmp/srscott/yabeAAAJqaGsx", line 14 : Syntax error
            Near line: "movl %cl, %edx"
            "/tmp/srscott/yabeAAAJqaGsx", line 18 : Syntax error
            Near line: "sete  %eax"
            "/tmp/srscott/yabeAAAJqaGsx", line 23 : Syntax error
            Near line: "shrl %ecx, %eax"


        The following modifications will allow it to compile without
        errors:

        int main() {
            int a, res;
            char b;
   
            /* Casted second argument to required type. */
            asm("movl %1, %0\n\t" : "=r" (res): "c" ((int) b));

            /* Use an 8-bit lvalue for the output argument. */
            asm("sete  %0\n\t" : "=r" (b));

            /* Casted second argument to required type. */
            asm ("shrl %1, %0\n\t" : "+r" (res) : "c" ((char) a));
        }

Inefficiency of memory constraints

Memory constraints lead to an extra level of indirection which requires
an extra register to hold the address.  This will not impact correctness,
but is less efficient than the user intended when the address is simple
enough to fit one of the addressing modes supported for that instruction.

Immediate constraints do not work in C++

        The following program will compile and execute correctly when compiled
        using the Sun Studio 12 C compiler, but C++ has a bug relating to the
        "i" constraint that prevents successful compilation:

                int main() {
              int res=0, inp=3;

              asm("\tmovl %1, %0\n": "=m" (res) : "i" (4));
              if (inp == 3 && res == 4) return 0;
              return 1;
          }


        This problem can be worked around by storing the immediate value
        in a variable and using that variable with a "r" constraint:

                int main() {
                        int res=0, inp=3;
                        const int imm = 4;

                        asm("\tmovl %1, %0\n": "=m" (res) : "r" (imm));
                        if (inp == 3 && res == 4) return 0;
                        return 1;
                }

Support for x87 floating point constraints when optimizing

When optimizing, support for x87 floating point constraints is incomplete.  We intend to solidify this area in a future patch to Sun Studio 12.

Conclusion

    This article has attempted to explain the syntax and semantics of Sun Studio 12's new
Asm Statement and provide examples of how to work around know differences from the
Gcc Asm Statement.  This article reflects the current state of the Sun Studio 12 with respect
to this feature as of the SS12 patch 1 release.  Some of what is described here may not
work with the Sun Studio 12 FCS release.  We intend to improve our compatibility with
Gcc in future patches of Sun Studio 12.  As  we do so, many of the limitations and known bugs
described above will be removed.  We hope that you have found this article useful.  Any
comments are welcomed.


Posted by x86be ( Jun 01 2007, 06:47:43 PM PDT ) Permalink Comments [3]

Thread Local Storage support in Sun Studio compilers for x86/64


Typically, in a multithreaded process, all the threads belonging to a particular
process share the same address space, therefore a global or static variable would
reside in the same memory location. While such sharing of global
variables has its own advantages, sometimes it is desirable to have
a facility that allows a thread to have its private copy of a global
variable and not share it with other threads. Such a mechanism is called
Thread Local Storage (TLS). Several vendors of compiler and runtime systems
have implemented support for TLS. Sun Microsystems was an early adoptor
of TLS by impmenting its support in the Sparc-solaris platform.

Sun Studio compilers introduced support for TLS in the x86/x64 platform
in Sun Studio 10. TLS support for linux is being introduced in
Sun Studio 12.

A variable is declared as thread local by using the keyword __thread e.g.

__ thread int x;

A thread local variable can be statically initialized e.g.

__thread int x = 1;

However, dynamic initialization is not allowed. The & operator of a
thread local variable is evaluated at run time and returns the address
of the variable in the current thread. For more language specific
issues please refer to the C or C++ user's guide.

Besides language specific and compiler code generator support, TLS
requires extensive support from the runtime system including the dynamic and
static linkers, libc and threads library.

There are four models of thread local storage allocation -
general dynamic, local dynamic, initial exec and local exec.
Dynamic models are needed for supporting TLS in shared libraries
and the static models are recommended for use in applications which
are statically linked. Static TLS is significantly faster than
dynamic TLS. The compiler is required to generate different
code sequences for accessing thread local data in static or
dynamic TLS so that the linker can take the appropriate actions
to resolve the address pf thread local data. The details of
run time handling of thread local storage are described in the
solaris linker guide.

The solaris specifications for TLS are a little different from the gnu
variant for IA-32 and x86-64. The  implemention of TLS support in
Sun Studio 12 compilers for linux in x86/x64 was done following the gnu
variant of the TLS specifications.  Details about the gnu are  also available at
ELF Handling for Thread-Local Storage which also happens to be an
excellent resource for TLS description.

The user is expected to provide the compiler about choice of
static or dynamic TLS by using the compile time flag -xthreadvar.
While compiling to build a module with dynamic TLS access, it is
recommended to use -xthreadvar=dynamic. Just using -xthreadvar will
also result in dynamic TLS. Static TLS is indicated by
-xthreadvar=no%dynamic.

In absence of the -xthreadvar flag, the compiler tries to guess the
anticipated use by determining if position independent code is being
used (-KPIC/-Kpic). Otherwise, the code generator chooses static TLS
model by default (-xthreadvar=no%dynamic).

The choice of local dynamic or local exec models are by the compiler
or the linker under appropriate conditions.

The following example shows usage of TLS and also its comparison with
an alternative approach using pthreads_setspecific and pthread_getspecific:

% cat tlsbench.c

/* Copyright 2003 Sun Microsystems, Inc. All Rights Reserved
 */

#include <pthread.h>
#include <thread.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

#define NUM_THREADS 5
#define NO_KEYS 8
#define LOOP_COUNT 1000000

/* compile as cc -o tlsbench -mt -xO3 tlsbench.c */

static void *thread_func_tls(void *arg);
static void *thread_func(void *arg);

/* global data */
static pthread_t tid[NUM_THREADS];  /* array of thread IDs */
static __thread int keys_tls[NO_KEYS];
static pthread_key_t keys[NO_KEYS]; /* list of keys */

static void *thread_func_tls(void *arg)
{
  int i, j, val;
  for(i = 0; i < LOOP_COUNT; i++){
    for(j = 0; j < NO_KEYS; j++)
    /* now get and set the local values */
    keys_tls[j] = j;
  }
  for(i = 0; i < LOOP_COUNT; i++){
    for(j = 0; j < NO_KEYS; j++){
    val = keys_tls[j];
    if ( val != j ){
    (void) fprintf(stderr, "Error getting val: %d\n", val);
    exit(-1);
    }
    }
  }
  return 0;
}
static void *thread_func(void *arg)
{
  int i, j, val;
  for(i = 0; i < LOOP_COUNT; i++){
    for(j = 0; j < NO_KEYS; j++)
      /* now get and set the local values */
      pthread_setspecific(keys[j], (const void *)&j);
  }
  for(i = 0; i < LOOP_COUNT; i++){
    for(j = 0; j < NO_KEYS; j++){
      val = *(int *)(pthread_getspecific(keys[j]));
      if ( val != j ){
        (void) fprintf(stderr, "Error getting val: %d\n", val);
     exit(-1);
      }
    }
  }
  return 0;
}

int main(int argc, char *argv[])
{
  int i;
  hrtime_t start, end;
  for ( i = 0; i < NO_KEYS; i++)
    pthread_key_create(&keys[i], NULL);
  /* first time get/set */
  start = gethrtime();
  for ( i = 0; i < NUM_THREADS; i++)
    pthread_create(&tid[i], NULL, thread_func, NULL);
  for ( i = 0; i < NUM_THREADS; i++)
    pthread_join(tid[i], NULL);
  end = gethrtime();
  (void) printf("Avg time using pthread_[get|set] = %.2f sec\n", (end - start)/1000000000.0 );
  /* now time tls */
  start = gethrtime();
  for ( i = 0; i < NUM_THREADS; i++)
    pthread_create(&tid[i], NULL, thread_func_tls, NULL);

  for ( i = 0; i < NUM_THREADS; i++)
    pthread_join(tid[i], NULL);
  end = gethrtime();
  (void) printf("Avg time using TLS          = %.2f sec\n", (end - start)/1000000000.0 );
  return 0;
}


This example and a few others are available in Sun Studio installations
e.g. in /opt/SUNWspro/examples/general/tls








Posted by x86be ( Jun 01 2007, 06:44:41 PM PDT ) Permalink Comments [0]

Compiler commentary support in Sun Studio compilers for x86/x64

Compiler commentary is a feature of the Sun Studio compilers and
tools that allows the compiler to indicate to the user what transformations
and optimizations were performed to generate code. Compiler commentary
support for Sun Studio compiler for x86/x64 was introduced in
Sun Studio 11. For the sparc platform, compiler commentary
support in Sun Studio compiler has been available for several releases.

Compiler commentary messages can be viewed using the graphical
user interface of the Sun Studio Analyzer or by using the
command line utility er_src. Commentary messages will be
interleaved with the source or disassembly. In order to generate compiler
commentary, a program must be compiled with -g option.

The performance analyzer has a source and a disassembly tab
to view annotated source and disassembly. If compiler commentary
messages were generated, they will be highlighted with a
blue color along with the source and/or disassembly of the
executable. The following examples illustrates the use of
er_src to view compiler commentary messages.

% cat example.c

#include <stdio.h>
void foo ()
{
  printf ("foo\n");
}
void bar ()
{
  foo();
  printf ("bar\n");
}

int main (){
  bar();
}

% cc -O4 -g example.c
% er_src a.out
Source file: ./commentary.c
Object file: ./a.out
Load Object: ./a.out

  
  
  
     1. #include <stdio.h>
     2. void foo ()
     3. {
        <Function: foo>
     4.   int i;
     5.   printf ("foo\n");
     6. }
     7. void bar ()
     8. {
        <Function: bar>
  
   Function foo inlined from source file commentary.c into the code for the following line.  0 loops inlined
     9.   foo();
    10.   printf ("bar\n");
    11. }
    12.
    13. int main (){
        <Function: main>
  
   Function bar inlined from source file commentary.c into the code for the following line.  0 loops inlined
   Function foo inlined from source file commentary.c into inline copy of function bar.  0 loops inlined
    14.   bar();
    15. }


It is also possible to see commentary messages with source and disassembly together:

% er_src -disasm main a.out

Annotated disassembly
---------------------------------------
Source file: ./commentary.c
Object file: ./a.out
Load Object: ./a.out

  
  
  
     1. #include <stdio.h>
     2. void foo ()
     3. {
        <Function: foo>
        [ 3]  80506f4:  pushl   %ebp
        [ 3]  80506f5:  movl    %esp,%ebp
     4.   printf ("foo\n");
        [ 4]  80506f7:  subl    $0x14,%esp
        [ 4]  80506fa:  pushl   $0x80507a0
        [ 4]  80506ff:  call    printf [ 0x80505bc, .-0x143 ]
     5. }
        [ 5]  8050704:  leave  
        [ 5]  8050705:  ret    
     6. void bar ()
     7. {
        <Function: bar>
        [ 7]  8050710:  pushl   %ebp
        [ 7]  8050711:  movl    %esp,%ebp
        [ 4]  8050713:  subl    $0x14,%esp
        [ 4]  8050716:  pushl   $0x80507a0
        [ 4]  805071b:  call    printf [ 0x80505bc, .-0x15f ]
  
   Function foo inlined from source file commentary.c into the code for the following line.  0 loops inlined
     8.   foo();
     9.   printf ("bar\n");
        [ 9]  8050720:  addl    $4,%esp
        [ 9]  8050723:  pushl   $0x8050798
        [ 9]  8050728:  call    printf [ 0x80505bc, .-0x16c ]
    10. }
        [10]  805072d:  leave  
        [10]  805072e:  ret    
    11.
    12. int main (){
        <Function: main>
        [12]  8050738:  pushl   %ebp
        [12]  8050739:  movl    %esp,%ebp
        [ 4]  805073b:  subl    $0x14,%esp
        [ 4]  805073e:  pushl   $0x80507a0
        [ 4]  8050743:  call    printf [ 0x80505bc, .-0x187 ]
        [ 9]  8050748:  addl    $4,%esp
        [ 9]  805074b:  pushl   $0x8050798
        [ 9]  8050750:  call    printf [ 0x80505bc, .-0x194 ]
  
   Function bar inlined from source file commentary.c into the code for the following line.  0 loops inlined
   Function foo inlined from source file commentary.c into inline copy of function bar.  0 loops inlined
    13.   bar();
    14. }
        [14]  8050755:  xorl    %eax,%eax
        [14]  8050757:  leave  
        [14]  8050758:  ret    


Besides the commentary message about inlining in the example above,
a wide variety of messages about compiler transformations are
generated. These messages can be broadly classified into the
following categories:

* Frontend generated messages
* Iropt (the intermediate level optimizer) generated messages
  - these messages are often about loop transformations such as
    unrolling, fusion, fission etc. Iropt also generates a class
    of messages about parallelization and also about inlining.
* Code generator messages
  - the sparc code generator inserts about modulo scheduling
    and related pipelining and loop unrolling issues.

Presently on the x86 platform, Sun Studio compiler generates
compiler commentary messages only from frontend and iropt.
Messages about transformations done in the x86 code generator
is a work in progress.

Some sample messages are shown below:

* Function <name> not inlined because it contains too many calls
* Call to function <name> was tail-call optimized
* <loop> not parallelized because it contains
  multiple exit points
* <loop1> fused with <loop2>, new loop <loop3>
* <loop> unrolled <number> times

Shown below is another sample example of a fortran 90 program on
which microvectorization was performed:

% cat test.f90
subroutine add1(a,b,n)
integer a(n), b(n)
a(:) = b(:) + 1
end

% f90 -fast -xvector=simd test.f90 -g -c

% er_src test.o

Source file: ./test.f90
Object file: ./test.o
Load Object: ./test.o

  
  
  
     1. subroutine add1(a,b,n)
     2. integer a(n), b(n)
   
    Array statement below generated loop L1
    L1 is micro-vectorized
    L1 multi-versioned for microvectorizing. Specialized version is L2
    L2 cloned for peeling.  Clone is L4

    L1 cloned for microvectorizing-epilog.  Clone is L8

    L4 cloned for microvectorizing-epilog.  Clone is L6
    L2 had iterations peeled off for better unrolling and/or parallelization
     3. a(:) = b(:) + 1
     4. end



Besides messages about what transformations were performed,
commentary messages are also inserted about why certain
optimization was not performed. The motivation being that
the user can get a better understanding of how the compiler
attempted to optimize a certain piece of code. If a certain
desirable optimization was not performed, the user can
take an informed decision about modifying certain part
of the code or try a different compile time option.


Please read the man page for er_src to learn about various options
for er_src. It is possible to view commentary about only
a select subset of compiler transformations. The performance
analyzer man pages and user manuals also have details about
using er_src and viewing annotated source and disassembly.



Posted by x86be ( Jun 01 2007, 06:41:26 PM PDT ) Permalink Comments [0]

20060213 Monday February 13, 2006

News



This web page contains news/general info/technical tips for the x86/x64 common compiler backend in Sun Studio compiler C, C++ and Fortran and the x86/x64 assembler.

What's New in Studio11 from x86/x64 compiler backend?


Performance Records   (Sun Studio 11)

x86/64 Sun Studio 11 has achieved many performance records

Read the records from

 » Vijay Tatkar's -1/13/2006 blog

Hardware Capability Support  (Sun Studio 11)

In solaris 10, Hardware Capability checking is done by the presence of certain markings in the executable files(e.g. a.out).  The linker makes the marking according to the markings in the object (.o) files and library files.  X86/x64 Sun Studio 11 compilers(C, C++ and Fortran) and assembler now by defualt produce such markings in the .o files, according to the instructions that has been generated in them.


To learn more about Hardware Capability marking and how to turn it off, read the following article:
 » Alfred Huang's blog on Jan 13, 2006 - "Ridding or modifying hardware capabilities info

Media Intrinsic functions (Sun Studio 11)

There are many lower level SSE/SSE2 instructions designed for efficient media code manipulation.  Prior Sun Studio 11, user can only use assembler template .il file to take advantage of such instructions.  Sudio 11 offers a set of 128 bit media integer intrinsic functions using XMM registers.

The needed header files can be found in the installation directory SUNWspro/prod/include/cc/
sunmedia_types.h
sunmedia_intrin.h
sys/sunmedia_intrin.h
sys/sunmedia_types.h
 
The optimization backend(when using -xOn, where n=1, 5)  recognizes these intrinsic functions when these options are used

-xarch=amd64 -xbuiltin

Note:
The non-optimization backend(without any -xOn, where n=1, 5) needs venus patch 120759-01, which is available since 12/16/2005, to recognize these integer intrinsic functions.

The intrinsic functions for floating point instructions with 128-bit XMM registers are not available in Sun Studio 11 yet.

SSE3 instructions support (Sun Studio 11)

New SSE3 instructions recognized by x86/x64 assembler in Studio 11.

New -xarch={amd64a, sse2a, ssea, pentium_proa} flags (Sun Studio 11)

New flags -xarch=amd64a,  sse2a,  ssea and pentium_proa were added for 3DNOW!, 3DNOW! extensions and MMX extension generation. 

Note: AMD64 instruction prefetchw will be generated only when these flag values were set.


Revamp of Profile feedback scheme (Sun Studio 11)

x86/x64 Sun Studio 11 has improved the profile feedback scheme to help boosting the performance.  One can use the following sequence of commands to utilize it, for example

//first phase of compilation and execution
>cc -xprofile=collect[:name] -O test.c
>./a.out
//second phase of compilation and execution
>cc -xprofile=use[:name] -O test.c
>./a.out

Where name is your profile data file name, default name is a.out.profile

Be sure to use same version of compiler in both phases.  It is not recommended that one uses an older version of compiler like Sun Studio 10 to do the first phase and keep the profile data file then use Sun Studio 11 to do the second phase compilation with -xprofile=use.  

TLS support (Sun Studio 10 and 11)

Thread-Local storage(TLS) provides a mechanism to allow threads to have global variables which are local to a thread.  Basic support for Thread-Local Storage for x86 and x64 platform was introduced in Sun Studio 10 compilers.  It also coincided with linker suppport for TLS in Solaris 10 for x64 platform(linker support for x86 has been in Solaris 9).

A variable can be declared to be thread_local using the _thread keyword, e.g.,

   __thread int i;

Studio 10 compilers supported the general dynamic and initial exec models of TLS.  Studio 11 compilers enhanced it further with support for local dynamic and local exec models.

The compile time option -xthreadvar can be used to enable the use of TLS with position independnt code in shared libraries.

-xthreadvar can take either of the two values:
 
  dynamic or no%dynamic.

-xthreadvar=no%dynamic generates code for faster access of a thread variable but such object files can only be used in executables.
-xthreadvar=dynamic is required for accessing thread local variables through dynamic loading.  For more details,  please refer to the C User's Guide.

--written by Deepankar Bairagi

Medium Model support(Sun Studio 11)

With x64bit address and instruction available, different memory models are emerging in the compiler.  Sun Studio11 supports medium model. 

New flag and its values introduced for specifying memory models:

-xmodel={kernel, small, medium}  
Note:(provided by Lawrence Crowl)

-xmodel is amd64-only and affects both data and code references
while
-xcode is sparc-only and only affects data references
sparc			data refernces		code references

-xcode=abs32 absolute 32 bits independent 32 bits
-xcode=abs44 absolute 44 bits independent 32 bits
-xcode=abs64 absolute 64 bits independent 32 bits
-xcode=pic13 (-Kpic) independent 13 bits independent 32 bits
-xcode=pic32 (-KPIC) independent 32 bits independent 32 bits

amd64 data refernces code references

-xmodel=small absolute 32 bits absolute 32 bits
-xmodel=medium absolute 64 bits absolute 32 bits
-xmodel=small -KPIC independent 32 bits independent 32 bits
-xmodel=medium -KPIC independent 64 bits independent 32 bits

 Following articles gives more details and info:

  » Alfred Huang's blog on 01/22/06 - "AMD64 Memory Models",
  » Alfred Huang's blog on 01/31/06 - "Kpic under Small model versus Medum Model",
  » Chris Quennel's article on 12/06/2005 - "Sun Studio Support for the AMD64 Medium Memory Model"


Latest x86/x64 Compiler Backend patches - as of 05/31/2007
(click link for the bug fix list or download patch)

Sun Studio 11: 120759-13
Sun Studio 10: 117846-17
Sun Studio 9  : 115982-04
Sun Studio 8  : 112756-13



--created and written by Mei Chung

Posted by x86be ( Feb 13 2006, 05:44:19 PM PST ) Permalink Comments [0]


Archives
Patches
Links
Referrers