Disclaimer: Whatever I suggested in my blog is what I would do, it does not necessarily mean the only way to do it.
Compatibility between Sun and GNU Assembler
The latest patch for Sun Studio 12 Assembler has some modifications that may be of interest to some developers with existing assembly source previously written for the GNU Assembler; Sun Studio 12 Assembler is now more syntatically compatible with GNU Assembler!
There are 2 new assembler options:
Major area of changes:
For example: mov $10, %ax
can now be used for
movw $10, %ax
Note if we have: mov $10, mem
Sun Assembler will default to become
movl $10, mem
But give an error if -C is used to be compatible
with GNU Assembler
For example: "fbe -m64 -a32 file.s" can assemble
lea 123(%eax,%r10d),%eax
Posted at 04:37PM Dec 10, 2007 by alblog in Sun | Comments[0]
Sun Studio Express: A Peek into the Features of the Sun Studio Tool Set
Sun Microsystems is offering free download and usage of the latest Sun Studio under development as a preview with the Sun Studio Express program. Sun Studio is the most complete development tool set available on the market supporting all development stages of highly optimized, serial and parallel applications.
At a glance, Sun Studio consists of high-performance, optimizing C, C++, and Fortran compilers for the Solaris OS on SPARC and Solaris/Linux OS on x86/x64 platforms, plus command-line tools and a NetBeans-based Integrated Development Environment (IDE) for application performance analysis and debugging of mixed source language applications.
As an overview of the newest Sun Studio Express release, with its extensive feature set, Sun Studio can be best described in terms of Performance, Productivity and Parallelism.
| Performance |
| Advanced Compiler Optimiization: Extensive machine-independent and machine-dependent optimizations Vectorization support Instruction scheduling Data prefetching support Code/data relocation for optimal cache use Profile feedback optimization Crossfile interprocedural optimization Gcc compatible asm inlining support Sun specific asm inlining with virtualization support Gcc/Icc-compatible SSEx intrinsics support (x86/x64) Binary optimization (Sparc) |
| Optimized Libraries: Availability of Sun Performance Library, Rogue Wave Tools.h++, SCL, STLport libraries to increase application performance |
| Productivity |
| Sun Studio provides a complete software tool set for source code compatibility across Solaris/Sparc, Solaris/x86 and Linux platforms. NetBean 6-based IDE with highly convenient user interface DBX debugger facilitating multi-threaded and multi-language debugging with extensive debugging features including ksh scripting and runtime memory checking Ease of use Performance Analyzer Tool for multi-language and multi-process application analysis, including memory allocation tracing and hardware counters profiling Convenient Thread Analyzer to track down the hard-to-detect race conditions and deadlocks during execution time. Parallel make utility, Dmake, to speed up build process |
| Parallelism |
| Multi-threaded model supports both UNIX and POSIX threads Automatic Parallelization / vectorization OpenMP 2.5 API support for C, C++ and Fortran Thread Local Storage support MPI libraries support (SunCluster 6.0) Performance Analyzer capable of analyzing multi-threaded, parallel program. Thread Analyzer to detect data races and deadlock during execution in parallel, multi-threaded program. Data race and deadlock tracking with static source code analyzer, lock_lint |
The newest preview release is Sun Studio Express 3.
With extensive optimization investment in the newest Sun Studio, users can expect visible performance improvement of their applications and benchmarking.
The following list shows some of the more important new features and significant optimization enhancements to be found in Express 3. The list is by no mean exhausive.
The entire Sun Studio development is now available on Linux with the same set of compilers, tools and libraries as found in Solaris-based Sun Studio. Users can now expect source level compatibility for their applications on both Solaris and Linux OS.
Programs developed with Sun Studio under Solaris or Linux enjoy the same source, same components, same performance and same features within boundaries of what the system allows.
Sun Studio C compiler now supports the following features for Gcc-compatibility:
Alignment Attributeint __attribute__ ((aligned (16))) i1 = 0;Symbol Visibitility
References to the symbol with the following attributes bind to the definition of the first dynamic module that defines the symbol.
int __attribute__ ((visibility ("default"))) i2 = 0;
int __global i3 = 0;
References to the symbol with the following attributes from within the dynamic module being linked bind to the symbol defined within the module. Outside of the module, the symbol appears as though it were global.
int __attribute__ ((visibility ("protected"))) i4 = 0;
int __symbolic i5 = 0;
References to the symbol with the following attributes within a dynamic module bind to a definition within that module. The symbol will not be visible outside of the module.
int __attribute__ ((visibility ("internal"))) i6 = 0;
int __attribute__ ((visibility ("hidden"))) i7 = 0;
int __hidden i8 = 0;
Other compatibility supports include
Value returning blocks
#define maxint(a,b) \
({int _a = (a), _b = (b); _a > _b ? _a : _b; })
Support for __typeof__
#define max(a,b) \
({ __typeof__ (a) _a = (a); \
__typeof__ (b) _b = (b); \
_a > _b ? _a : _b; })
Zero length arrays
These are useful as the last element of a structure for variable length object.
struct foo {
int len;
int flexarray [0];
};
Support for the Open Source Libraries of BOOST and LOKI is now available
BOOST http://www.boost.org
LOKI http://sourceforge.net/projects/loki-lib
Support of STLport library under Linux OS is also available.
Interval arithmetic is now supported on Solaris x86 platform with the -xarch=sse2 -xia options.
mt-safe -xprofile under x86/x64
Profile feedback optimization for multi-threaded application is now available under x86/x64.
MMX and SSEx vector instructions can be generated via the use of inline intrinsics. Function prototypes are available in mmintrin.h (MMX instructions), xmmintrin.h (SSE instructions), emmintrin.h (SSE2 instructions) and pmmintrin.h (SSE3 instructions).
Compiler optimizer can in cases vectorize matrix operations in loop. Alternatively, one may use SSEx instrinsics to explicitly perform the task resulting in the generation of low level vectorized code, largely suitable for image processing and floating point operations.
These new intrinsics are largely compatible with those supported in Gcc and Intel Icc, enabling easy migration for these popular intrinsics.
As an example, using SSE intrinsics, we may convert the following loop:
float in1[4] = { 1.0, 2.0, 3.0, 4.0 };
float in2[4] = { -1.0, -2.0, -3.0, -4.0 };
float out[4];
for (i = 0; i < 4; i++) {
out[i] = in1[i] + in2[i];
}
with a vectorized version provided in1,in2 and out are 16-byte aligned.
#include <emmintrin.h>
__m128 x, y, z; x = _mm_load_ps(in1); y = _mm_load_ps(in2); z = _mm_add_ps(x, z); _mm_store_ps(out,z);
Major improvements in the performance of the Fast Fourier Transform software for the 1- and 2-dimensional cases on SPARC and x86/x64, and for the 3-dimensional case on SPARC.
All new IDE, based on NetBean 6.
New and improved user interface.
Tightly integrated with C/C++ project system.
New persistent, global breakpoints to be shared between debugged executables.
All Sun Studio compilers are now generating Dwarf debugging information by default. This will make it easier for 3rd party debuggers and other tools to interoperate with Studio compilers. The Dwarf format also enhanced debugging optimized code.
Debugging of OpenMP code is now improved.
Race Condition/ Deadlock Data Collection
The collect command accepts a new option, -r [datarace|deadlock|all|on|off], which specifies collecting data race detection and deadlock data.
Time DurationThe collect command accepts a new option, -t <duration>, which specifies a time range for data collection.
Count data (SPARC)Function and instruction count data can be recorded using the collect -c on option. Data will be collected for the executable only, not the shared libraries. The executable is required to be compiled with the -xbinopt=prepare flag.
Clock-based Dataspace Profiling (SPARC)Since UltraSparc-T1 does not support Hardware Counter Profiling, Performance Analyzer now has Clock-based Dataspace Profiling. This can be done by prepending + before the -p <interval> option of collect. It will generate additional data if the instruction immediately preceeding the interrupt PC is a memory operation.
Descendant Processes Data collectionData may now be selectively collected on descendant processes, by using a new collect option -F =<regex>.
Process AttachmentCollect now accepts a new option, -P <pid>, which specifies attaching <pid> to the process and collecting data from it.
Thread Analyzer is to detect data races and deadlocks that occur during the execution of a single or multi-threaded process.
Use -xinstrument=datarace during compilation time to instrument the program for data race detection. No instrumentation is needed for deadlock detection.
Use collect -r all to collect data race and deadlock data at runtime.
Review the collected data with the new RACES and DEADLOCK tabs in the Performance Analyzer or by using the GUI tool tha, the Thread Analyzer.
The static data race and deadlock utility, lint_lock is now available for the x86 platform. Use -Zll flag under the C compiler to create the lock_lint database files.
Sun Studio is famous for its compatibility with past releases. The compilers are tested extensively with past Sun Studio releases from Workshop 5.0 onwards and on Solaris from Solaris 8 onwards.
Sun Studio has been tested with 400+ Open Source Applications. Solaris Enterprise System built entirely with Sun Studio + Java Infrastructure.
All popular benchmarks tested regularly, particularly SPEC CPU 2000, CPU 2006 and SPEC OMP.
| Description |
Link |
| Technical articles, Demos, Whitepapers, CodeSample, Heroes Profiles, Documentation, Customer endorsements | http://developers.sun.com/sunstudio |
| Technical documentations including installation guides, individual product books, FAQs, READMEs, etc | http://docs.sun.com |
| Service Plans | http://www.sun.com/service/serviceplans/software.jsp |
| Service Plans for Developer Support | http://www.sun.com/service/serviceplans/developer/buyingguide.pdf |
| Expert Assistance Program | http://developers.sun.com/services/expertassistance/index.jsp |
| Studio 11 Download (and older versions) or via media kits (charge for shipping and handling only) | http://developers.sun.com/sunstudio/downloads/ |
| Download of Express program binaries | http://developers.sun.com/sunstudio/downloads/express.jsp |
| Learning site for SunStudio 11 and Realtime programming for Solaris | http://developers.sun.com/sunstudio/learning/ |
| Take a short product tour | http://developers.sun.com/sunstudio/product_tour.jsp |
| Forums for posting questions on compilers, tools, IDE; regularly answered by engineers | http://developers.sun.com/sunstudio/community/forums.jsp |
Posted at 02:09PM Jan 04, 2007 by alblog in Sun | Comments[0]
On Studio and Gcc style inline assembly
A question has been raised raised on the optimizaton effect of Sun Studio's inline assembly machanism and that of gcc style enhanced inline assembly. Let's start with a brief introduction of what they are first and then discuss the optimization effect they may have.
At the simplest level, both compilers support the asm() statement, which is the insert-as-is non-optimized inline assembly. This form of inline assembly simply inserts the enclosed assembly string(s) as is without any facility of argument and optimization.
Studio for enhancement supports the .il inline template in a form similar to an include file. A .il file may contain multiple inline templates. Where each template is of the following form:
.inline "template_name",0
"assembly code"
.end
In a sense, Studio treats each inline template as a function definition and adheres argument passing and return value according to the calling convention in the ABI of the corresponding platform.
For example, let's add 8 numbers together and return its result. The inline template in the foo.il file may be as follow:
.inline multi_add,0
movq %rdi, %rax
movq %rsi, %rax
movq %rdx, %rax
movq %rcx, %rax
movq %r8, %rax
movq %r9, %rax
movq (%rsp),%rax
movq 8(%rsp), %rax
.end
The foo.c file may look like:
int multi_add(int,int,int,int,int,int,int,int);
int foo()
{
return multi_add(1,2,3,4,5,6,7,8);
}
Compile the foo.c with foo.il:
cc -S -O -xarch=amd64 foo.il foo.c
The result foo.s will contain an optimized result:
foo: ...
/ ASM INLINE BEGIN: multi_add
movq $36, %rax
/ ASM INLINE END
Based on the calling convention of the AMD64 ABI, the first six integral arguments are passed in %rdi, %rsi, %rdx, %rcx, %r8 and %r9, any extras are to be passed on the stack, hence (%rsp) and 8(%rsp) in this case. If optimization is not desired, a -Wu,-no_a2lf option can be used:
cc -S -O -xarch=amd64 -Wu,-no_a2lf foo.il foo.c
Thus the result foo.s will contain the following:
push $8
push $7
movq $1,%rdi
movq $2,%rsi
movq $3,%rdx
movq $4,%rcx
movq $5,%r8
movq $6,%r9
/ INLINE: multi_add
movq %rdi, %rax
addq %rsi, %rax
addq %rdx, %rax
addq %rcx, %rax
addq %r8, %rax
addq %r9, %rax
addq (%rsp), %rax
addq 0x8(%rsp), %rax
/ INLINE_END
Studio treats an inline template as a function call, thereby function argument loading will take place before the template body is inserted. If -Wu,-no_a2lf is not used, all these assembly instructions are inserted into Studio's intermediate representation stream and all specified optimization will take place, result in a single "movq $36, %rax".
In this manner, the user specified inline template will be assimilated into the Studio's optimization and code generation machanism. Further note that all registers specified in the inline template will be "virtualized" and reallocated by the code generator, thereby the final appearance of the inline template may be drastically different from its original. The main catch of the approach being users must understand the calling convention of the underlying platform.
Gcc style extended asm inline assembly takes a different approach. Basically it has the form:
asm("template",
"input arguments",
"output arguments",
"clobber list");
The approach provides flexibility for user to specify the input and output arguments and allows the user to inform the compiler of the resource used within the template, thus allowing the compiler to avoid resource conflict. Other than that, the template is treated as a black box with its content not to be looked into, hence the argument substituted template body will be inserted as is, the optimization effect may then be limited.
For example, let's add 3 constants and a static variable together with the following asm():
int mem;
int foo()
{
int res;
asm ("movl %1, %0\n"
"\taddl %2, %0\n"
"\taddl %3, %0\n"
"\taddl %4, %0\n"
: "=r" (res)
: "g" (1),
"g" (2),
"g" (3),
"m" (mem)
);
return res;
}
And the result will simply be:
.globl foo
.type foo, @function
foo:
/APP
movl $1, %edx
addl $2, %edx
addl $3, %edx
addl mem, %edx
/NO_APP
movl %edx, %eax
ret
Note the template body will be inserted as is after the argument substitution into the final assembly, no constant fold nor any extensive optimization will be performed. But some forms of optimization may still be performed. For example the entire template may be moved out of a loop if it turns out to be a constant invariant when the "clobber list" indicates no change in memory and register.
Both style of inline assembly machanism has their pros and cons, it depends on one's need and it is possible to convert them from one style to the other.
UPDATE 03/10/2009: I have heard in the latest version of Sun Studio, the optimization limit may be removed and full optimization may be performed by default.
Posted at 01:40PM May 22, 2006 by alblog in Sun | Comments[7]
The humble Frame Pointer
Most architecture has a dedicated register called the frame or base pointer, in the case of 32-bit x86 platform, it is the %ebp register, for 64-bit amd64 it is the %rbp register.
The main purpose of base pointer (BP) is to allow access to the stack variables. Since the stack grows downward on x86, one can access the function argument by using (BP+offset) memory address mode, while using (BP-offset) to access the stack locals. Before the base pointer can be used, the compiler needs to set it up in the function prologue. The most common form of BP set up code at the very beginning of an amd64 32-bit mode function is
pushl %ebp
movl %esp, %ebp
The first instruction saves the old BP of the previous stack frame, which later can be restored in the function epilogue before returning to the caller. The second instruction setup the new BP of the function by making it the same as the current stack pointer. So after the function prologue, we have
| ... |
---------------------
| possible argumens | previous stack frame
-----------------------
| return address | return address to caller
---------------------
| old BP | BP of caller
--------------------- <-- BP of this frame
| possible locals...|
---------------------
In a sense the above BP setup instruction sequence is sacred, many applications use the sequence to "unwind" the stack, literally speaking that means to backtrace the stack frames from the current frame to its caller and so forth. These applications look for the existence of this sequence in the function prologue to find the saved old BP, so as to get to the previous frame. The kind of applications doing this may be a debugger, code analyzer, multi-threaded function cleanup handler, even C++ exception handling code. If the above sequence is not detected, the stack backtrace can not carried on and undefined result including core dump may occur.
Of course the world is never that nice to always leave the humble BP in place so as to allow all stack backtrace to take place. The compiler sometimes believes the BP is not really needed, it can still access the function arguments and locals using the stack pointer, the BP setup code and subsequent restore may thus be optimized away. Moreover on platform like 32-bit x86, where register is always in shortage, the availability of %ebp as a general purpose register is a blessing. Hence the BP setup code sometimes may not show up and application may break. All compilers usually have a way to turn off this form of optimization.
Sun Studio 10/11 compilers do not take the users to dangerous zone with this kind of optimization by default. For safety sake, one has to explicitly ask for it by using -xregs=frameptr option. Admittently, -xregs=frameptr is not very intuitive, but it means what it says, use BP as a general purpose register, thereby BP is no longer a dedicated base pointer and the BP setup code may not be generated.
Some architecture nowadays have a new concept of unwind information section. The compiler generates information about the stack status during compile time and records them in special sections. Users of such section may obtain information on the displacement to the previous stack frame at a certain instance. This eliminates the need to have the "sacred" BP setup code in the function prologue. AMD64 ABI defines an equivalence of this kind of unwind information section, namely the .eh_frame section.
Hence Sun Studio under 64-bit mode does not really need to have the BP setup code and the users are free to invoke -xregs=frameptr at any time. Studio and Solaris C++ exception handling library is intelligent enough to handle stack unwinding based on what it finds, it obtains the previous stack frame if the .eh_frame section is found, otherwise the existence of the BP setup code is assumed. But for 32-bit mode x86, where .eh_frame is not defined in the ABI, the existence of BP setup code is a must for any stack unwinding to take place.
Posted at 12:59AM Mar 13, 2006 by alblog in Sun | Comments[0]
A look into the AMD64 Aggregate Argument Passing
Recently there has been questions on argument passing for AMD64. As part of the calling convention, argument passing and returned value are described in detail in the AMD64 ABI. The portion on passing scalar arguments is clear and straightforward, but the description for passing aggregates is pretty algorithmic and rather obscure. Maybe I can help by explaining it with examples.
Generally speaking, all objects that can be accomodated in registers will be passed in registers until the designated registers run out and memory stack is then used. Regardless of the actual object size, all arguments are passed in a multiple of 8 bytes.
To start the topic on passing aggregates, let me reiterate argument passing of the most common scalar types, namely integers and floating point types. The first six integer types (class INTEGER) are passed in %rdi, %rsi, %rcx, %rdx, %r8, %r9, then on memory stack. Likewise, the first 8 float and double types (class SSE) are passed in %xmm0 to %xmm7, then on memory stack.
Aggregate arguments larger than 16 bytes (2 EightBytes) are always passed on stack, it is the aggregate argument smaller than or equal to 16 bytes that is the most interesting. First of all, you have to figure out the fields of the aggregate belonging to the 1st and 2nd EightBytes. This can be achieved with the knowledge of the possible padding used in between the fields of a struct.
Example 1:
struct S { short i;
float f1;
short j;
float f2;
} s;
Since the alignment of f1 and f2 is 4, there is a padding of 2 bytes
between i and f1, and between j and f2. So we have
---------
| i | 2 bytes ---
--------- |
| pad | 2 bytes |--- 1st EightByte
--------- |
| f1 | 4 bytes ---
---------
| j | 2 bytes ---
--------- |
| pad | 2 bytes |--- 2nd EightByte
--------- |
| f2 | 4 bytes ---
---------
Now the rule in the AMD64 ABI calls to consider 2 adjacent fields
in an EightByte recursively in a merge step. I will not repeat the
rules here, but one of the rule is if one class is INTEGER, the
result class is INTEGER. In this case, since i is of class INTEGER
and f1 is of class SSE, the result is INTEGER. Hence the 1st
EightByte has class INTEGER and the 2nd EightByte also has class
INTEGER. If object 's' is passed as the first argument, it will
then be passed in %rdi and %rsi, in which i and f1 are contained in
%rdi, while j and f2 are contained in %rsi.
Surprise?
Example 2:
struct S { float f[4] } s;
Since 's' is exactly 128 bits in size which exactly fits an xmm
register, would 's' be passed in a single xmm register? The answer
is no, we have in this case:
---------
| f1 | 4 bytes ---
--------- |--- 1st EightByte
| f2 | 4 bytes ---
---------
| f3 | 4 bytes ---
--------- |--- 2nd EightByte
| f4 | 4 bytes ---
---------
Since f1 and f2 are both of class SSE, the result class for the 1st
EightByte is SSE, likewise for the 2nd EightByte. Hence if object
's' is passed as the first argument, it will then be passed in %xmm0
and %xmm1, where %xmm0 contains the value of f1 and f2, while %xmm1
contains the value of f3 and f4.
Example 3:
Should the entire aggregate reside in one single class of register
when being passed? Again the answer is no. Consider the following
case:
struct S { int i;
float f1;
float f2;
float f3;
} s;
---------
| i | 4 bytes ---
--------- |--- 1st EightByte
| f1 | 4 bytes ---
---------
| f2 | 4 bytes ---
--------- |--- 2nd EightByte
| f3 | 4 bytes ---
---------
Note i is of class INTEGER and f1 is of class SSE, as one of the merge
rules says if one class is INTEGER, the result is INTEGER, so the 1st
EightByte is of class INTEGER, while the 2nd EigthByte is of class
SSE. Hence if object 's' is passed as the first argument, its first
8 bytes containing i and f1 are passed in %rdi, whereas the remaining
8 bytes containing f2 and f3 are passed in %xmm0.
Hope these little examples provide some insights into the interpretation of the aggregate argument passing rule in the AMD64 ABI.
Posted at 10:06AM Feb 21, 2006 by alblog in Sun | Comments[1]
Kpic under Small model versus Medium model
Continuing on my previous discussion of medium model. Some people pointed out their previously "address does not fit" application actually would link and run using -Kpic, without using -xmodel=medium.
Yes, Position Independent Code may buy you farther addressing capability, but it has a limit based on code size and number of global statics and may not be as efficient as the medium model code.
So what is the difference between Position Independent Code under small model and the medium model code?
PIC code goes through the Global Offset Table (GOT), which the linker usually creates beneath the text section. It contains the actual 64-bit addresses of the static objects. Access to an object under the PIC mode consists of a 4-byte displacement reference from the referencing point in the text to the corresponding entry in the GOT, the 64-bit address in the GOT is then picked up and referenced indirectly. With the full 64-bit address referenced indirectly, the limit of 2G addressability as discussed in my previous blog entry is overcome. In a sense, it is pretty similar to that of medium model, where a full 64-bit address is explicitly loaded using the "movabsq" instruction and then referenced indirectly.
The difference is PIC code requires the distance between the referencing point and the actual GOT entry to be within the 2G limit, a reference from the lower address of a very large text section to its corresponding GOT entry may become out of reach, leading to another "address does not fit" error.
Moreover, under PIC mode all global statics will be accessed through GOT, leading to a degradation in performance. Whereas under medium model, only objects larger than 65535 bytes will reside in the special ".lXXXX" sections which require 64-bit access.
Hence for application with normal code size and number of global static, using -Kpic will get around the problem if some degradation in performance can be tolerated.
Posted at 01:13PM Jan 31, 2006 by alblog in Sun | Comments[0]
AMD64 Memory Models
I recall the first production quality compiler I worked on was an 80286 in the mid 80's. Memory model such as "small", "medium" and "large" with common extended keywords "__far" and "__near" were popular with that 16-bit segmented memory architecure. Thereafter, 80386 "linearized" the address space with its 32-bit pointers and the model related terms simply became obsolete and disappeared.
It is interesting to see AMD64 ABI reintroduced the memory models to the x86 world, but this time with a 64-bit architecture. So why is there memory model in the x64 architecture? I would guess it is for performance's sake. The x64 architecture basically has only one instruction that truely loads a 64-bit address to register, namely "movabsq". All other memory related instructions contain only 4 bytes displacement which are then extended to 64-bit. If a greater that 32-bit address in the memory space is to be accessed, the compiler needs to load that "far" address with a movabsq instruction and then references it indirectly, which is not as efficient as a single, direct access. Hence the specified models of "small" and "kernel" in the x64 ABI assume certain address range limitation, so as to allow efficient memory access. "Medium" and "large" allows more flexibility with the address ranges in the expense of less efficient memory access.
To save time, let's talk about "small" and "medium" models only. "Small" being the default of most x64 compilers, is often mistaken to be equivalent to the defacto model in 32-bit x86, if there were one. I'm afraid the small model of x64 is actually smaller than the defacto 32-bit x86. We have encountered many people complaining their applications ran in 32-bit x86, but ran into linker's "address does not fit" error when ported to the default x64.
According to the AMD64 ABI, the small model allows a data address space of [-2^31, 2^31-1], with the linker limiting allocation of symbols between [0, 2^31-2^24-1]. What that means is
Effective address (i.e. offset+symbol at runtime) is limited to
[-2^31, 2^31-1]
If the "symbol" in offset+symbol is limited by the linker to
[0, 2^31-2^24-1]
The compiler can safely generate "offset" in the offset+symbol equation to
[-2^31, 2^24]
Note with the EA limitation, it means the upper 33 bits should be all 0s or all 1s to be valid, the ABI stated the linker must issue error otherwise. So only 31 bits are truely eligible for memory address computation in x64 small model, versus 32 bits in 32-bit x86, roughly half the space.
So what should be done when you have a linker "address does not fit" error? If changing source code is not an option, one option in Sun Studio 11 is to try the -xmodel=medium option.
In the recent x64 ABI, medium model is actually quite efficient. Not all static data are accessed with the "far" load 64-bit address followed by indirect reference. Medium model are now defined with extra data sections. Normally, there are the ".data", ".bss" etc data sections. Under medium model, data object larger than 65535 bytes are allocated in the corresponding ".ldata", ".lbss" data sections. Data in the "normal" data sections are referenced efficiently with direct access, whereas data in the ".l" data sections are referenced with the indirect access. What this means is performance hit may be minimized while data access range can be increased.
Posted at 09:57PM Jan 22, 2006 by alblog in Sun | Comments[0]
Ridding or modifying hardware capabilities info
Yes, yes, time to start blogging at Sun. Give people more info on Sun Studio ...
Ever experienced the following case before? You created an app targetted for certain "newer" platforms, then one day came across an "older" machine and cautiouslessly tried to run your app on it. What happened? When a new instruction of the newer machine was executed on an older one, which did not recognize it, simply ended with a rude message from the O.S., "Invalid Instruction". At this stage, it may take some effort to figure out the culpit.
Solaris 10 has a new feature, namely the Hardware Capabilities checking machanism which stops it with the runtime linker before running it.
Let's say we have a pack.o which contains x86's SSE2 and AMD's 3DNow instructions. You can check it using a file command:
% file pack.o
pack.o: ELF 32-bit LSB relocatable 80386 Version 1 [SSE2 AMD_3DNow]
Apparently, any apps linking in pack.o will not be able to run under a non-AMD machine. The runtime linker under Solaris 10 will return an error:
% a.out
ld.so.1: a.out: fatal: hardware capability unsupported: 0x100 [ AMD_3DNow ]
The trick is the compiler during the assembly phase puts out a section marking all the encountered instructions as listed in /usr/include/sys/auxv_386.h. The runtime linker will check the combined bits of all instructions against the underlying platform.
But there is need for exception. Some code are written to have different fragments of code relating to different platforms and a choice is made based on the runtime checking of the CPUID bits.
In this case a possible resolution may be
% cc -S -xarch=sse2 file.c
% fbe -nH file.s <-- file.o will have no marking
For relocatable object file:
% more mapfile
hwcap_1 = SSE2 OVERRIDE;
% cc -c -xarch=sse2 file.c -o file_pre.o
% ld -r -Mmapfile -o file.o file_pre.o
% file file_pre.o
file_pre.o: ELF 32-bit LSB relocatable 80386 Version 1 [SSE2 AMD_3DNow]
% file file.o
file.o: ELF 32-bit LSB relocatable 80386 Version 1 [SSE2]
For shared library:
% more mapfile
hwcap_1 = SSE2 OVERRIDE;
% cc -Mmapfile -G -o file.so file.o
% file file.o
file.o: ELF 32-bit LSB relocatable 80386 Version 1 [SSE2 AMD_3DNow]
% file file.so
file.so: ELF 32-bit LSB dynamic lib 80386 Version 1 [SSE2], dynamically linked, not stripped
Likewise for executable file.
One last trick, if you have a .s assembly file, you can assemble it with -showimap for verbose instruction category output.
% fbe -showimap pack.s
line 47: movdl => SSE2
line 51: movdl => SSE2
line 56: prefetchw => 3DNow
line 57: movq => SSE2
...
Posted at 06:00PM Jan 13, 2006 by alblog in Sun | Comments[0]
Today's Page Hits: 26