
Friday June 01, 2007
GCC-style asm inlining support in Sun Studio 12 compilers
Sun Studio 12 Asm Statements Introduction In order to support developers used to Gcc's Inline Assembly Feature, Sun Studio 12 has implemented a compatible interface to allow the C and C++ programmer to insert assembly instructions into the code stream generated by the compiler. There are several advantages to this feature above and beyond those of the Inline Assembly feature supported by prior Sun Studio releases. These include allowing the routine containing the inline assembly to be optimized, compatibility with Gcc, more flexibility in the compiler's ability to choose registers efficiently.
In this new scheme the inline assembly takes the form of an asm statement in the source language that has the following form:
asm("<inst> %0, %1\n" : <outputs> : <inputs> : <clobber list>);
Where <inst> is an assembly-language opcode, <outputs> is a comma-separated list of outputs; likewise with <inputs>. Each input or output consists of a constraint string and an expression from the source language enclosed in parentheses. These expressions provide the inputs to pass to the asm statement or the outputs to store the results of the asm statement to. The clobber list is a comma-separated list of strings that name machine registers (other than inputs or outputs) that one or more of the instructions in the asm statement are known to write to. A typical function in C containing an asm statement might look like this:
#include <stdio.h> void foo() { int result, source = 3;
asm("movl %1, %0\n" : "m" (result) : "r" (source)); printf("result = %d (expected 3)\n", result); }
The %0 and %1 in the above example are placeholders for "result" and "source", respectively. The compiler will evaluate "source" and load it into a free register denoted by %1. Then generate the movl instruction to move that register into the memory location corresponding to the variable "result" denoted by %0.
There is an alternative notation for placeholders that users may find more readable. Rather than using %0, %1, %2, etc. to denote positional arguments, the user may refer to arguments symbolically:
#include <stdio.h> void foo() { int result, source = 3;
asm("movl %[input], %[output]\n" : [output] "m" (result) : [input] "r" (source)); printf("result = %d (expected 3)\n", result); }
In the above example, input and output have no special meaning, they could be any names, but they must match a corresponding square-bracketed name in the input or output lists of the asm statement.
These are very simple examples. In actuality, an asm statement may have more than one instruction and the constraints can get quite complex. With flexibility of expression comes some degree of complexity which we will try to demystify in the sections that follow.
The Instruction String The instruction(s) to be executed are contained in one or more quoted strings which precede the first colon in an asm statement. The compiler does not parse the contents of these strings except to scan for placeholders that it needs to replace with the arguments of the asm statement. So, the compiler knows nothing of the semantics of the instructions in an asm statement other than what it is told via the constraints on the input and output arguments and the contents of the clobber list. Within the instruction strings, any percent sign that does not introduce a placeholder must be doubled. For example, in the following asm statement, the %eax register must be written as "%%eax" but in the clobber list, no percent sign is needed:
asm("movl %0, %%eax\n" : : "r" (foo) : "eax");
Inputs and Outputs For an asm statement to affect a program, it most often must be able to receive information from expressions in the source language and be able to assign to variables (or other lvalues) in the program. This is accomplished by passing outputs and inputs into the asm statement in a manner similar to the arguments to a function call.
Expressions
The source language expressions for inputs may be rvalues or lvalues. Outputs must be lvalues. Expressions may be of arbitrary complexity and are enclosed in parenthesis following the constraint string.
Unused inputs and outputs
If there is no use of an input or output in an asm statement's instruction string, then no loads from or stores to that variable are generated. This saves registers for those arguments that are used in the asm instruction string. There is one exception to this rule: If an input or output is constrained to a specific hardware register (as opposed to a register class), then is must be loaded or stored even if it is not referred to in the instruction string. This is because it value may be used implicitly by the instructions in the asm statement.
Constraints Register Constraints
Integer In the descriptions that follow, only one size of register is listed in the tables, but in most cases the size of the register actually chosen depends on the type of the source expression being loaded into or stored from it. See "Matching register types to input and output types" below for more details about how the compiler chooses the size of register to use.
Register classes
The following constraints specify a class of integer register that the compiler may choose from when it needs a register within an asm statement:
Constraint Register class
g or r
rax, rbx, rcx, rdx,
rbp, rsi, rdi, rsp, r8 - r15
R
eax, ebx, ecx, edx, ebp,
esi, edi, esp (legacy registers)
q
al, bl, cl, dl
Q
ah, bh, ch, dh
A
eax or edx (used for returning 64-bit values)
Specific registers
The following constraints may be used to lock a source variable or expression to a specific hardware register:
Constraint Register
64-bit 32-bit
a
rax
eax
b
rbx ebx
c
rcx ecx
d
rdx edx
di
rdi edi
si
rsi
rdi
Floating point XMM and MMX registers
The following constraints are used to specify that the source variable or expression should occupy an XMM or MMX register:
Constraint Register class
x
xmm0 - xmm15
y
mm0 - mm15
Note: Be sure to specifiy -xarch=sse2 when using
these constraints if compiling in 32-bit mode.
x87 Floating point stack The following constraints are used to refer to variables or expressions loaded on the x87 floating point stack:
Constraint Register
f
ST(0) - ST(7)
t
ST(0) (top of the FP stack)
u
ST(1) (register just below the
top of the FP stack)
Memory Constraints A memory constraint has the form "<m>" where <m> is one of the following letters:
Constraint Description
m
Memory operand
of any general addressing mode
o
Offsettable addressing mode
V
Non-offsettable addressing mode
<
Autodecrement
addressing mode
>
Autoincrement addressing mode These constraints instruct the compiler to generate a memory reference wherever this argument's placeholder occurs in the instruction string.
Immediate Constraints An immediate constraint has the form "<i>" where <i> is one of the following letters: Constraint Description
i
Any sized constant
e
Constant in range -2147483648 - 2147483647
n
A
constant less than a word wide
I
Constant
in range 0 - 31
J
Constant in range 0
- 63
K
0xff
L
0xffff
M
Constant in range 0
- 3
N
Constant in range 0 - 255
Z
Constant in range 0 - 0xffffffff
E
Floating
point operand (native const double)
F
Floating point operand (const double)
G
Standard 80387 floating point constant
s
Constant not know at compile time (symbolic)
These constraints instruct the compiler to generate an immediate operand wherever this argument's placeholder occurs in the instruction string.
Digit Constraints Digit constraints are of the form "<n>" where <n> is a number which corresponds to the position of an output. This constraint is only allowed on an input and the digit must refer to an output. The semantics are to bind the constrained input to use the same location to load its input to as the indicated output uses.
The example below illustrates the use of digit constraints.
asm ("addl %1,%0 \n\t"
:"=r"(foo)
:"r"(bar),"0"(foo) );
The simple example above essentially implements foo = foo + bar; The "0" in the input constraint indicates that variable foo needs to be loaded into the same register which will also contain the output result. It is also possible to specify a particular register as shown below:
asm ("addl %1,%0 \n\t"
:"=a"(foo)
:"b"(bar),"0"(foo) );
In this case, the compiler will generate code to load variable foo into register %eax (since that input is constrained to output 0 and output 0 is constrained to %eax by the "=a" constraint) and bar will be loaded into register %ebx and the result foo will be available in register %eax.
Here is another example of using digit constraints to shift a value by a given shift count:
int shift_count = 5; int shifted_value = 37;
asm ("sarl %1, %0\n\t" : "=r" (shifted_value) : "c" ((char) shift_count), "0" (shifted_value) );
In this example, the variable "shift_count" is loaded into the %cl register (note that the cast is required to convert the 32-bit integer "shift_count" to an 8-bit value as required by the sarl instruction. The variable "shifted_value" is loaded into a register chosen by the compiler with the proviso that the compiler will choose the same register to hold the result of the sarl instruction as requested by the "0" digit constraint.
Multiple Constraints More than one constraint letter may be used in a constraint string. When this occurs, the compiler looks at the input or output to determine which constraint is the best match for the given expression. If the constraint string contains an immediate constraint, and the input is a constant of the correct type, then the input will be treated as an immediate. Otherwise, if the constraint string contains a memory constraint and the input or output is an lvalue, then a memory reference will be generated. Failing this, if the constraint string contains a register constraint then the input will be loaded into or the output will be written to a register. The example below illustrates usage of multiple constraints:
asm ("mulq %3"
: "=a"(low),"=d"(high)
: "a"(word),"rm"(foo) );
The mulq instruction multiplies the contents of a 64-bit memory or register by the contents of %rax and the result is available in the %rdx, %rax register pair - the high 64-bits in %rdx and low 64 bits in %rax.
One of the operands of the multiply, the variable foo in the example above, can be available in either memory or in a register. The "rm" constraint used in the example allows the compiler to choose the most appropriate location.
The example above also shows an interesting instance of constraints usage. Although there is no explicit reference to %0 or %1 in the asm template, the mulq instruction implicitly returns the results in %rax and %rdx, therefore "=a" and "=d" must be indicated as output constraints. Similarly, the first input operand (word) is expected to be available in the %rax register.
Modifiers Certain modifier characters may be included in a constraint string to control how the compiler applies that constraint. They are:
Modifier Description
=
Operand is only written
+
Operand is read and written
&
Operand is clobbered early
%
This operand and the following
one are commutative
#
Ignore all characters up to the
next comma as constraints
*
Ignore the following character when choosing
register preferences
Note: If = or + are specified in
a constraint string, they must be the first
character in the string.
The following example shows a use of the "+" modifier:
asm ("sarl %1, %0\n\t" : "+r" (shifted_value) : "c" ((char) shift_count) );
The variable "shifted_value" in the example above is both an input and an output. The compiler would generate code to load "shifted_value" into a general purpose register and ensure that "shifted_value" is available as an output in that same register. The same effect can be achieved using digit constraints (see example above) as well. However, if there is no explicit reference to the input parameter in the asm template, it is more concise to use "+" modifier instead.
The compiler normally makes the assumption that all inputs to an asm statement are consumed before any outputs are written to in the instructions which constitute the asm's instruction string. If this is not the case for a particular instruction sequence, the user must inform the compiler which outputs are written early (i.e. before the last input is used). This rule allows the compiler to use registers efficiently by choosing the same register for an input and an output under normal conditions, but allows the user to override this behavior when it would be semantically incorrect to do so. The use of the early clobber ("&") modifier provides the means to communicate this information to the compiler. A register chosen for an operand marked as early clobber may not be used to hold any of the input operands. The following example illustrates the use of early clobber:
asm (
"
subq %2,%2 \n" ".align 16 \n"
"1:
movq (%4,%2,8),%0 \n"
"
adcq (%5,%2,8),%0 \n"
"
movq %0,(%3,%2,8) \n"
"
leaq 1(%2),%2 \n"
"
loop 1b \n"
"
sbbq %0,%0 \n" : "=&a"(ret),"+c"(n),"=&r"(i) : "r"(rp),"r"(ap),"r"(bp) : "cc" );
Matching register types to input and output types The register chosen by the compiler must match the type of the input or output in the source code. There are two ways to for the user to affect what type of register the compiler will choose for any given input or output. The first is to insert a size letter between the "%" and the digit in the placeholder in the instruction string such as: asm("movi %l1, %l0\n" : "r" (result) : "r" (source)); This will choose a 32-bit register for the each of the registers chosen to hold "result" and "source". The supported types are:
Type letter Register size
b
8-bits
h
16-bits
l
32-bits
q
64-bits
The second way to way to affect the type of the register chosen is by changing the type of the source expression passed to the asm statement. By default the type of register is chosen based on the type of the input or output expression. Casting this expression will also influence the size of register chosen to hold that expression in the code generated for the asm statement.
The Clobber List Some instructions implicitly modify a register or the user may insert a specific register name in the instruction string such as: asm("movl %0, %%eax\n" : : "r" (var) : "eax"); In such cases the modified register should be placed in the clobber list (the comma-separated list of strings following the
third colon) to inform the compiler that this register is written to by the asm statement. This allows the compiler to keep enough information about the liveness of registers around an asm statement to continue to do normal optimizations. Without this information, the compiler would have to forgo many optimizations in any routine that contained asm statements. Note that outputs need not be placed in the clobber list. The compiler knows that they are written to already.
The following example shows a use of clobber lists:
__asm__("movl
%0,%%ecx \n\t"
"movl
%1,%0
\n\t"
"movl
%%ecx,%1
\n\t"
:"=a"(bar),"=b"(foo) :"0"(bar),"1"(foo) :"ecx" );
The values of variable foo and variable bar are swapped in the example above, using %ecx as an intermediate place holder. Any value held in the register %ecx earlier will be lost after executing the asm template; therefore, "ecx" must be mentioned in the clobber list.
Current Limitations and Known Bugs No alternative constraints Gcc allows an operand's constraint string to have more than one series of constraint letters in a comma-separated list from which the best matching constraint is chosen based on the cost of loading that operand for each legal alternative constraint. Sun Studio 12 currently implements only the simpler multiple constraint syntax described above.
Assembler is not operand sensitive At present, the Sun Studio 12 assembler requires that the type of the opcode for any given instruction matches the types of its operands. Gcc's assembler, by contrast, can infer the suffix required for an opcode from the types of the operands of the instruction. This is a limitation when writing asm statements intended to work interchangably on 32-bit and 64-bit platforms. Most often such asm statements must be split into 32-bit and 64-bit versions surrounded by appropriate #ifdefs as in the following example:
void f () {};
int main () { void (*fptr)() = 0; #ifdef __amd64 asm ("movq %[f], %[fptr]" #else asm ("movl %[f], %[fptr]" #endif
: [fptr] "=m" (fptr) : [f] "r" (f)); if ( fptr != f ) return 1; return 0; }
As another example of operand sensitivity, the following program will fail to assemble because of type mismatches between the opcode and one of its operands:
int main() { int a, res; char b; /* The input argument "c" is of the wrong type. The movl instruction expects a 32-bit integers as its operands. */ asm("movl %1, %0\n\t" : "=r" (res): "c" (b));
/* The sete instruction requires an 8-bit result register, but res is a 32-bit integer. */ asm("sete %0\n\t" : "=r" (res));
/* Variable "a" is an int, but the shrl instruction requires an 8-bit shift count in register %cl. */ asm ("shrl %1, %0\n\t" : "+r" (res) : "c" (a)); } The user will see assembly errors such as the following: Assembler: "/tmp/srscott/yabeAAAJqaGsx", line 14 : Syntax error Near line: "movl %cl, %edx" "/tmp/srscott/yabeAAAJqaGsx", line 18 : Syntax error Near line: "sete %eax" "/tmp/srscott/yabeAAAJqaGsx", line 23 : Syntax error Near line: "shrl %ecx, %eax"
The following modifications will allow it to compile without errors:
int main() { int a, res; char b; /* Casted second argument to required type. */ asm("movl %1, %0\n\t" : "=r" (res): "c" ((int) b));
/* Use an 8-bit lvalue for the output argument. */ asm("sete %0\n\t" : "=r" (b));
/* Casted second argument to required type. */ asm ("shrl %1, %0\n\t" : "+r" (res) : "c" ((char) a)); }
Inefficiency of memory constraints
Memory constraints lead to an extra level of indirection which requires an extra register to hold the address. This will not impact correctness, but is less efficient than the user intended when the address is simple enough to fit one of the addressing modes supported for that instruction.
Immediate constraints do not work in C++
The following program will compile and execute correctly when compiled using the Sun Studio 12 C compiler, but C++ has a bug relating to the "i" constraint that prevents successful compilation:
int main() { int res=0, inp=3;
asm("\tmovl %1, %0\n": "=m" (res) : "i" (4)); if (inp == 3 && res == 4) return 0; return 1; }
This problem can be worked around by storing the immediate value in a variable and using that variable with a "r" constraint:
int main() {
int res=0, inp=3;
const int imm = 4;
asm("\tmovl %1, %0\n": "=m" (res) : "r" (imm));
if (inp == 3 && res == 4) return 0;
return 1; }
Support for x87 floating point constraints when optimizing
When optimizing, support for x87 floating point constraints is incomplete. We intend to solidify this area in a future patch to Sun Studio 12.
Conclusion This article has attempted to explain the syntax and semantics of Sun Studio 12's new Asm Statement and provide examples of how to work around know differences from the Gcc Asm Statement. This article reflects the current state of the Sun Studio 12 with respect to this feature as of the SS12 patch 1 release. Some of what is described here may not work with the Sun Studio 12 FCS release. We intend to improve our compatibility with Gcc in future patches of Sun Studio 12. As we do so, many of the limitations and known bugs described above will be removed. We hope that you have found this article useful. Any comments are welcomed.
Posted by x86be
( Jun 01 2007, 06:47:43 PM PDT )
Permalink
Thread Local Storage support in Sun Studio compilers for x86/64
Typically, in a multithreaded process, all the threads belonging to a particular process share the same address space, therefore a global or static variable would reside in the same memory location. While such sharing of global variables has its own advantages, sometimes it is desirable to have a facility that allows a thread to have its private copy of a global variable and not share it with other threads. Such a mechanism is called Thread Local Storage (TLS). Several vendors of compiler and runtime systems have implemented support for TLS. Sun Microsystems was an early adoptor of TLS by impmenting its support in the Sparc-solaris platform.
Sun Studio compilers introduced support for TLS in the x86/x64 platform in Sun Studio 10. TLS support for linux is being introduced in Sun Studio 12.
A variable is declared as thread local by using the keyword __thread e.g.
__ thread int x;
A thread local variable can be statically initialized e.g.
__thread int x = 1;
However, dynamic initialization is not allowed. The & operator of a thread local variable is evaluated at run time and returns the address of the variable in the current thread. For more language specific issues please refer to the C or C++ user's guide.
Besides language specific and compiler code generator support, TLS requires extensive support from the runtime system including the dynamic and static linkers, libc and threads library.
There are four models of thread local storage allocation - general dynamic, local dynamic, initial exec and local exec. Dynamic models are needed for supporting TLS in shared libraries and the static models are recommended for use in applications which are statically linked. Static TLS is significantly faster than dynamic TLS. The compiler is required to generate different code sequences for accessing thread local data in static or dynamic TLS so that the linker can take the appropriate actions to resolve the address pf thread local data. The details of run time handling of thread local storage are described in the solaris linker guide.
The solaris specifications for TLS are a little different from the gnu variant for IA-32 and x86-64. The implemention of TLS support in Sun Studio 12 compilers for linux in x86/x64 was done following the gnu variant of the TLS specifications. Details about the gnu are also available at ELF Handling for Thread-Local Storage which also happens to be an excellent resource for TLS description.
The user is expected to provide the compiler about choice of static or dynamic TLS by using the compile time flag -xthreadvar. While compiling to build a module with dynamic TLS access, it is recommended to use -xthreadvar=dynamic. Just using -xthreadvar will also result in dynamic TLS. Static TLS is indicated by -xthreadvar=no%dynamic.
In absence of the -xthreadvar flag, the compiler tries to guess the anticipated use by determining if position independent code is being used (-KPIC/-Kpic). Otherwise, the code generator chooses static TLS model by default (-xthreadvar=no%dynamic).
The choice of local dynamic or local exec models are by the compiler or the linker under appropriate conditions.
The following example shows usage of TLS and also its comparison with an alternative approach using pthreads_setspecific and pthread_getspecific:
% cat tlsbench.c
/* Copyright 2003 Sun Microsystems, Inc. All Rights Reserved */
#include <pthread.h> #include <thread.h> #include <stdio.h> #include <stdlib.h> #include <sys/time.h>
#define NUM_THREADS 5 #define NO_KEYS 8 #define LOOP_COUNT 1000000
/* compile as cc -o tlsbench -mt -xO3 tlsbench.c */
static void *thread_func_tls(void *arg); static void *thread_func(void *arg);
/* global data */ static pthread_t tid[NUM_THREADS]; /* array of thread IDs */ static __thread int keys_tls[NO_KEYS]; static pthread_key_t keys[NO_KEYS]; /* list of keys */
static void *thread_func_tls(void *arg) { int i, j, val; for(i = 0; i < LOOP_COUNT; i++){ for(j = 0; j < NO_KEYS; j++) /* now get and set the local values */ keys_tls[j] = j; } for(i = 0; i < LOOP_COUNT; i++){ for(j = 0; j < NO_KEYS; j++){ val = keys_tls[j]; if ( val != j ){ (void) fprintf(stderr, "Error getting val: %d\n", val); exit(-1); } } } return 0; } static void *thread_func(void *arg) { int i, j, val; for(i = 0; i < LOOP_COUNT; i++){ for(j = 0; j < NO_KEYS; j++) /* now get and set the local values */ pthread_setspecific(keys[j], (const void *)&j); } for(i = 0; i < LOOP_COUNT; i++){ for(j = 0; j < NO_KEYS; j++){ val = *(int *)(pthread_getspecific(keys[j])); if ( val != j ){ (void) fprintf(stderr, "Error getting val: %d\n", val); exit(-1); } } } return 0; }
int main(int argc, char *argv[]) { int i; hrtime_t start, end; for ( i = 0; i < NO_KEYS; i++) pthread_key_create(&keys[i], NULL); /* first time get/set */ start = gethrtime(); for ( i = 0; i < NUM_THREADS; i++) pthread_create(&tid[i], NULL, thread_func, NULL); for ( i = 0; i < NUM_THREADS; i++) pthread_join(tid[i], NULL); end = gethrtime(); (void) printf("Avg time using pthread_[get|set] = %.2f sec\n", (end - start)/1000000000.0 ); /* now time tls */ start = gethrtime(); for ( i = 0; i < NUM_THREADS; i++) pthread_create(&tid[i], NULL, thread_func_tls, NULL);
for ( i = 0; i < NUM_THREADS; i++) pthread_join(tid[i], NULL); end = gethrtime(); (void) printf("Avg time using TLS = %.2f sec\n", (end - start)/1000000000.0 ); return 0; }
This example and a few others are available in Sun Studio installations e.g. in /opt/SUNWspro/examples/general/tls
Posted by x86be
( Jun 01 2007, 06:44:41 PM PDT )
Permalink
Compiler commentary support in Sun Studio compilers for x86/x64
Compiler commentary is a feature of the Sun Studio compilers and tools that allows the compiler to indicate to the user what transformations and optimizations were performed to generate code. Compiler commentary support for Sun Studio compiler for x86/x64 was introduced in Sun Studio 11. For the sparc platform, compiler commentary support in Sun Studio compiler has been available for several releases.
Compiler commentary messages can be viewed using the graphical user interface of the Sun Studio Analyzer or by using the command line utility er_src. Commentary messages will be interleaved with the source or disassembly. In order to generate compiler commentary, a program must be compiled with -g option.
The performance analyzer has a source and a disassembly tab to view annotated source and disassembly. If compiler commentary messages were generated, they will be highlighted with a blue color along with the source and/or disassembly of the executable. The following examples illustrates the use of er_src to view compiler commentary messages.
% cat example.c
#include <stdio.h> void foo () { printf ("foo\n"); } void bar () { foo(); printf ("bar\n"); }
int main (){ bar(); }
% cc -O4 -g example.c % er_src a.out Source file: ./commentary.c Object file: ./a.out Load Object: ./a.out
1. #include <stdio.h> 2. void foo () 3. { <Function: foo> 4. int i; 5. printf ("foo\n"); 6. } 7. void bar () 8. { <Function: bar> Function foo inlined from source file commentary.c into the code for the following line. 0 loops inlined 9. foo(); 10. printf ("bar\n"); 11. } 12. 13. int main (){ <Function: main> Function bar inlined from source file commentary.c into the code for the following line. 0 loops inlined Function foo inlined from source file commentary.c into inline copy of function bar. 0 loops inlined 14. bar(); 15. }
It is also possible to see commentary messages with source and disassembly together:
% er_src -disasm main a.out
Annotated disassembly --------------------------------------- Source file: ./commentary.c Object file: ./a.out Load Object: ./a.out
1. #include <stdio.h> 2. void foo () 3. { <Function: foo> [ 3] 80506f4: pushl %ebp [ 3] 80506f5: movl %esp,%ebp 4. printf ("foo\n"); [ 4] 80506f7: subl $0x14,%esp [ 4] 80506fa: pushl $0x80507a0 [ 4] 80506ff: call printf [ 0x80505bc, .-0x143 ] 5. } [ 5] 8050704: leave [ 5] 8050705: ret 6. void bar () 7. { <Function: bar> [ 7] 8050710: pushl %ebp [ 7] 8050711: movl %esp,%ebp [ 4] 8050713: subl $0x14,%esp [ 4] 8050716: pushl $0x80507a0 [ 4] 805071b: call printf [ 0x80505bc, .-0x15f ] Function foo inlined from source file commentary.c into the code for the following line. 0 loops inlined 8. foo(); 9. printf ("bar\n"); [ 9] 8050720: addl $4,%esp [ 9] 8050723: pushl $0x8050798 [ 9] 8050728: call printf [ 0x80505bc, .-0x16c ] 10. } [10] 805072d: leave [10] 805072e: ret 11. 12. int main (){ <Function: main> [12] 8050738: pushl %ebp [12] 8050739: movl %esp,%ebp [ 4] 805073b: subl $0x14,%esp [ 4] 805073e: pushl $0x80507a0 [ 4] 8050743: call printf [ 0x80505bc, .-0x187 ] [ 9] 8050748: addl $4,%esp [ 9] 805074b: pushl $0x8050798 [ 9] 8050750: call printf [ 0x80505bc, .-0x194 ] Function bar inlined from source file commentary.c into the code for the following line. 0 loops inlined Function foo inlined from source file commentary.c into inline copy of function bar. 0 loops inlined 13. bar(); 14. } [14] 8050755: xorl %eax,%eax [14] 8050757: leave [14] 8050758: ret
Besides the commentary message about inlining in the example above, a wide variety of messages about compiler transformations are generated. These messages can be broadly classified into the following categories:
* Frontend generated messages * Iropt (the intermediate level optimizer) generated messages - these messages are often about loop transformations such as unrolling, fusion, fission etc. Iropt also generates a class of messages about parallelization and also about inlining. * Code generator messages - the sparc code generator inserts about modulo scheduling and related pipelining and loop unrolling issues.
Presently on the x86 platform, Sun Studio compiler generates compiler commentary messages only from frontend and iropt. Messages about transformations done in the x86 code generator is a work in progress.
Some sample messages are shown below:
* Function <name> not inlined because it contains too many calls * Call to function <name> was tail-call optimized * <loop> not parallelized because it contains multiple exit points * <loop1> fused with <loop2>, new loop <loop3> * <loop> unrolled <number> times
Shown below is another sample example of a fortran 90 program on which microvectorization was performed:
% cat test.f90 subroutine add1(a,b,n) integer a(n), b(n) a(:) = b(:) + 1 end
% f90 -fast -xvector=simd test.f90 -g -c
% er_src test.o
Source file: ./test.f90 Object file: ./test.o Load Object: ./test.o
1. subroutine add1(a,b,n) 2. integer a(n), b(n) Array statement below generated loop L1 L1 is micro-vectorized L1 multi-versioned for microvectorizing. Specialized version is L2 L2 cloned for peeling. Clone is L4
L1 cloned for microvectorizing-epilog. Clone is L8 L4 cloned for microvectorizing-epilog. Clone is L6 L2 had iterations peeled off for better unrolling and/or parallelization 3. a(:) = b(:) + 1 4. end
Besides messages about what transformations were performed, commentary messages are also inserted about why certain optimization was not performed. The motivation being that the user can get a better understanding of how the compiler attempted to optimize a certain piece of code. If a certain desirable optimization was not performed, the user can take an informed decision about modifying certain part of the code or try a different compile time option.
Please read the man page for er_src to learn about various options for er_src. It is possible to view commentary about only a select subset of compiler transformations. The performance analyzer man pages and user manuals also have details about using er_src and viewing annotated source and disassembly.
Posted by x86be
( Jun 01 2007, 06:41:26 PM PDT )
Permalink

Monday February 13, 2006
News
This web page contains news/general info/technical tips for
the x86/x64 common compiler backend in Sun Studio compiler C, C++ and Fortran
and the x86/x64 assembler.
What's New in Studio11 from x86/x64
compiler backend?
Performance Records (Sun
Studio 11)
x86/64 Sun Studio 11 has achieved many performance records
Read the records from
» Vijay Tatkar's -1/13/2006 blog
Hardware Capability Support (Sun
Studio 11)
In solaris 10, Hardware Capability checking is done by the presence of
certain markings in the executable files(e.g. a.out). The linker makes
the marking according to the markings in the object (.o) files and
library files. X86/x64 Sun Studio 11 compilers(C, C++ and Fortran) and
assembler now by defualt produce such markings in the .o files,
according to the instructions that has been generated in them.
To learn more about Hardware Capability marking and how to turn it off, read the following article:
» Alfred Huang's blog on Jan 13, 2006 - "Ridding or modifying hardware capabilities info
Media Intrinsic functions (Sun Studio 11)
There are many lower level SSE/SSE2
instructions designed for efficient media code manipulation. Prior Sun
Studio 11, user can only use assembler template .il file to take
advantage of such instructions. Sudio 11 offers a set of 128 bit media
integer intrinsic functions using XMM registers.
The needed header files can be found in the installation directory SUNWspro/prod/include/cc/
sunmedia_types.h
sunmedia_intrin.h
sys/sunmedia_intrin.h
sys/sunmedia_types.h
The optimization backend(when using -xOn, where n=1, 5) recognizes
these intrinsic functions when these options are used
-xarch=amd64 -xbuiltin
Note: The non-optimization backend(without any -xOn, where n=1, 5) needs venus patch 120759-01, which is available since 12/16/2005, to recognize these integer intrinsic functions.
The intrinsic functions for floating point instructions with 128-bit XMM registers are not available in Sun Studio 11 yet.
SSE3 instructions support (Sun Studio 11)
New SSE3 instructions recognized by x86/x64 assembler in Studio 11.
New -xarch={amd64a, sse2a, ssea, pentium_proa} flags (Sun Studio 11)
New flags -xarch=amd64a, sse2a, ssea and pentium_proa were
added for 3DNOW!, 3DNOW! extensions and MMX extension generation.
Note: AMD64 instruction prefetchw will be generated only when these flag values were set.
Revamp of Profile feedback scheme (Sun Studio 11)
x86/x64 Sun Studio 11 has improved the profile
feedback scheme to help boosting the performance. One can use the
following sequence of commands to utilize it, for example
//first phase of compilation and execution
>cc -xprofile=collect[:name] -O test.c
>./a.out //second phase of compilation and execution >cc -xprofile=use[:name] -O test.c
>./a.out
Where name is your profile data file name, default name is a.out.profile
Be sure to use same version of compiler in both phases. It is not
recommended that one uses an older version of compiler like Sun Studio
10 to do the first phase and keep the profile data file then use Sun
Studio 11 to do the second phase compilation with -xprofile=use.
TLS support (Sun Studio 10 and 11)
Thread-Local
storage(TLS) provides a mechanism to allow threads to have global
variables which are local to a thread. Basic support for Thread-Local
Storage for x86 and x64 platform was introduced in Sun Studio 10
compilers. It also coincided with linker suppport for TLS in Solaris
10 for x64 platform(linker support for x86 has been in Solaris 9).
A variable can be declared to be thread_local using the _thread keyword, e.g.,
__thread int i;
Studio 10 compilers supported the general dynamic and initial exec
models of TLS. Studio 11 compilers enhanced it further with support
for local dynamic and local exec models.
The compile time option -xthreadvar can be used to enable the use of TLS with position independnt code in shared libraries.
-xthreadvar can take either of the two values:
dynamic or no%dynamic.
-xthreadvar=no%dynamic generates code for faster access of a thread
variable but such object files can only be used in executables.
-xthreadvar=dynamic is required for accessing thread local variables
through dynamic loading. For more details, please refer to the C
User's Guide.
--written by Deepankar Bairagi
Medium Model support(Sun Studio 11)
With x64bit address and instruction available, different memory models
are emerging in the compiler. Sun Studio11 supports medium model.
New flag and its values introduced for specifying memory models:
-xmodel={kernel, small, medium}
Note:(provided by Lawrence Crowl)
-xmodel is amd64-only and affects both data and code references while -xcode is sparc-only and only affects data references sparc data refernces code references
-xcode=abs32 absolute 32 bits independent 32 bits -xcode=abs44 absolute 44 bits independent 32 bits -xcode=abs64 absolute 64 bits independent 32 bits -xcode=pic13 (-Kpic) independent 13 bits independent 32 bits -xcode=pic32 (-KPIC) independent 32 bits independent 32 bits
amd64 data refernces code references
-xmodel=small absolute 32 bits absolute 32 bits -xmodel=medium absolute 64 bits absolute 32 bits -xmodel=small -KPIC independent 32 bits independent 32 bits -xmodel=medium -KPIC independent 64 bits independent 32 bits
Following articles gives more details and info:
» Alfred Huang's blog on 01/22/06 - "AMD64 Memory Models",
» Alfred Huang's blog on 01/31/06 - "Kpic under Small model versus Medum Model",
» Chris Quennel's article on 12/06/2005 - "Sun Studio Support for the AMD64 Medium Memory Model"
Latest x86/x64 Compiler Backend patches - as of 05/31/2007
(click link for the bug fix list or download patch)
Sun Studio 11: 120759-13
Sun Studio 10: 117846-17
Sun Studio 9 : 115982-04
Sun Studio 8 : 112756-13
--created and written by Mei Chung
Posted by x86be
( Feb 13 2006, 05:44:19 PM PST )
Permalink
|
|
|
|
|