compiler thoughts

All | Personal | Sun
« Previous page | Main
20060714 Friday July 14, 2006

inline assembler

One the useful features of gcc is inline assembler. There is an old-style inline asm and new one that accepts C expressions as operands. GCC doesn't contain an assembler, so it cannot look inside of the assembler instruction and parse it. The users have to specify clobbered registers and tell the compiler about input/output operands. The syntax analysis is done by the assembler that came with operating system, so it's tough to associate the error reported by the assembler with source code.
gcc4ss went a step further. Sun codegen has an assembler functionality and not just an assembler, but an optimizing assembler. Simple assembler templates are processed early and codegen can optimize them. The templates with control flow or volatile marking are considered complex and codegen inlines them as-is.

During gcc4ss development inline asm implementation turned out to be very efficent, fast and reliable, so most of the other gnu language extensions we implemented using inline asm in the front-end/IR generator.

Posted by alexey ( Jul 14 2006, 04:47:54 PM PDT ) Permalink

20060421 Friday April 21, 2006

gcc flag compatibility

As you know gcc4ss accepts all the flags that plain gcc has. In the first release we mapped -mcpu flags to their -xarch equivalents, but didn't do any backwards mapping from studio flags (-xarch, -xtarget, -xchip) to gcc flags (-mcpu,-mtune) and it caused some confusion. Some users noticed -mcpu=v7 passed to 'cc1' component and made a conclusion that gcc4ss generates v7 code. That's not the case. Sun IR generation in cc1 is the same for any -mcpu flag and actual architecture and tuning is controlled by backend flags passed to 'cg' component.

For the next gcc4ss update the flag mapping -x* <-> -m* is going to be more consistent. The draft for new mapping is here: http://cooltools.sunsource.net/gcc/mcpu.html

The default architecture is still v8plus (for 32-bit) and v9 (for 64-bit).
In GCC terminology v8plus architecture is -mcpu=v9.
Don't confuse it with gcc's -mv8plus flag. v8plus mode in gcc is actually default on solaris, but it gets overwritten by -mcpu=v7 default. So gcc 4.0 and 4.1 default on solaris is v7 isa. Confusing right? gcc 4.2 is switching to -mcpu=v9 for solaris 7+, so it's going to generate v8plus isa by default.
For gcc4ss -mv8plus is going to work as its name would intend. -mv8plus in gcc4ss will turn v8plus isa on, whereas in plain gcc 4.x -mv8plus makes no difference at all. GCC docs don't help here. Docs quote:
"With -mv8plus, GCC generates code for the SPARC-V8+ ABI. The difference from the V8 ABI is that the global and out registers are considered 64-bit wide. This is enabled by default on Solaris in 32-bit mode for all SPARC-V9 processors."
So it's natural to expect that and I've seen quite a few gcc users adding -mv8plus hoping to get v8plus isa and disappointed that their code didn't improve. The problem is that the gcc driver is passing -mcpu=v7 to cc1 and it overwrites -mv8plus flag and v8plus default. Even if gcc driver didn't pass it, cc1 internal default of -mcpu=v7 is stronger than -mv8plus flag, so the users don't see any v8plus code unless they specify -mcpu=v9. At least -mno-v8plus is working as expected.
gcc4ss's -mv8plus is going to really change isa to v8plus.

Posted by alexey ( Apr 21 2006, 05:22:44 PM PDT ) Permalink

20060313 Monday March 13, 2006

inlining

There was a controversial posting regarding inlining in gcc/gcc4ss and I feel it needs to be clarified.

gcc4ss is using gimple form of tree representation that was introduced first in gcc 4.0. gcc4ss runs most of the gcc optimization without any interference and takes whatever gimple form is left after those optimizations. If the optimization level was -O3 that gimple form for a given function may contained some inlined functions, so IR conversion may not happen for these inlined functions. Therefore IR passed to scg4ss may contain less number of functions that was present in the original code. Since gcc inlines mostly very small functions it frees few extra cycles for the scg4ss optimizer. In C code it's rare situation, but in C++ such approach is deemed to be quite helpful. scg4ss profiler will not be able to collect any statistic for such inlined functions, but in case of C++ it's even better not to be distracted by lots of small methods.

gcc4ss also preserves gcc function attributes related to inlining when there is an equivalent in scg4ss's IR. Those attributes are nicely matched with profile feedback and cross file optimizations in scg4ss.

The funny thing about inlining is that few testcases of gcc testsuite were modified to pass with gcc4ss, because scg4ss is quite aggressive in inlining at -O3 and sees more opportunities to inline where plain gcc 4.0 does not.

Posted by alexey ( Mar 13 2006, 01:33:52 AM PST ) Permalink Comments [1]

20060309 Thursday March 09, 2006

gcc4ss flags

The target goal of 100% compatibility of gcc4ss (GCC for SPARC Systems) with plain GCC wouldn't be achieved if we didn't support all gcc flags. So we do! gcc4ss accepts all gcc flags plus we added more to control Sun Code Generator for SPARC Systems (scg4ss).

The maximum optimization level is still -O3 (same as GCC). At -O3 gcc4ss performs initial inlining and passes IR (Internal Representation) to scg4ss to do advance optimizations and further inlining. scg4ss's heuristics are tuned for sparc processors and can be driven by profile feedback and inter-module/inter-procedure analysis. Unfortunately I'm not in a position to talk about exact numbers, but grab your favourite app and measure -O2 vs -O3 performance with gcc4ss. And send us your results of course!

On top of -O3 we added -fast flag. Those familiar with Sun Studio know about this flag already. -fast is the macro of -O3 -xtarget=native -fns -fsimple=2 and other flags. -xtarget=native determines the available architecture, chip, cache of the machine on which the compiler is running, so you don't have to worry about improper -xarch, -xchip on your build server. Of course there is a -xtarget=generic in scg4ss for 'blended' arch/chip model. -fns and -fsimple=2 allows scg4ss's optimizer to perform aggressive floating point computations which are not strictly conforming with IEEE 754, but makes the floating point code run much faster. Once you're comfortable with -O3, try -fast instead. That what we use to run spec benchmarks.

As an extra topping to your -fast shake you can add -xipo flag to do inter-procedural optimizations. scg4ss's internal representation is stored within object file and fetched back during the link time, hence optimizer can see the IR for all modules at once. Each particular module during -xipo build is compiled with -O0-like level, hence all .o are built quickly, but the linking takes quite some time, because optimizer needs to recompile all modules with original optimization level and call code generator for each .o again. -xipo works best with -xprofile.

-xprofile flag should be used in two steps. Step one to collect train data with -xprofile=collect and step two to use the profile data with -xprofile=use. Normally you don't have to use -xipo during 'collect' phase if you want to use it during 'use', but it's recommended to have optimization level and other flags the same between two phases.

There are bunch of other performance related flags.
Please read about them here:
http://cooltools.sunsource.net/gcc/flags.html

Alexey.

Posted by alexey ( Mar 09 2006, 02:08:17 PM PST ) Permalink Comments [7]

20060307 Tuesday March 07, 2006

GCC for SPARC Systems is live!

Finally we got "GCC for SPARC Systems" out:
http://cooltools.sunsource.net/gcc/

To get the most out of SPARC chips the compiler optimizations are important. Apps on x86 chips usually don't gain much if recompiled for the specific chip. The difference between -mcpu=i386 and -mcpu=i686 gcc switch is relatively small. On SPARC the correct chip switch is more important. US2 comiled apps will suffer on early US3.

The plain GCC is using SPARC V7 ABI as a default. That means no integer mul/div instructions and schedulling for pretty old chip. Not every GCC user knows about -mcpu flag and its effect. Some users use -mcpu=v9 for all their needs, because other -mcpu switches don't produce a desirable speedup. Some users add only -mv8plus flag, see no difference in run-time and get upset. A few internal details of SPARC chips are not public and it's tough for GCC to do the perfect tuning. Read after write penalties are not modelled correctly in GCC scheduler yet. The cost of prefetch instruction and prefetch implication on the different SPARC chips didn't made into GCC backend. The users of gcc 4.0.x have to cope with early US3 tuning. The future gcc 4.2 release is going to have some support for Niagara. So the Niagara users have an option to wait till stable gcc 4.2 is released, or try its development snapshot, or go with gcc4ss+scg4ss which already has Niagara specific tuning.

With GCC for SPARC Systems (gcc4ss) the code generation, tuning, insn scheduling,etc is done by Sun Code Generator for SPARC Systems (scg4ss) that knows everything about underlying architecture. Going from GCC's V7 to tuned code normally buys a nice speedup for integer code and a lot more for floating point.

The default arch selection for gcc4ss is v8plus (since it's tough now days to find pre-v8plus chip) and tuning is done for 'blended' model, which is trying to produce close to optimum tuning on most of the modern SPARC chips. So just recompiling with gcc4ss with the same old gcc flags can give a significant boost in performance.

Alexey.

Posted by alexey ( Mar 07 2006, 09:21:54 AM PST ) Permalink Comments [5]

Disclaimer:

This site is a personal blog and is to be used for informational purposes only. The views expressed on this blog are those of the author only, and should not be attributed to any past or present employers.

Calendar

RSS Feeds

Search

Links

Navigation

Referers