Tuesday December 06, 2005 | Sun Sensible Innovative Performance Ideas from Nicolai Kosche |
|
DProfile - Piloting Sun Studio 11 Performance Analyzer Fasten your seat belt, we're about to go on a multi-dimentional ride into your program - look inside the machine! I have a sample scalability problem that I will now debug with Sun Studio 11 Performance Analyzer with DProfile. I've recompiled my program with these flags: cc -xhwcprof -g -c -xO4 th.c and linked with these flags: cc -xO4 -o fn3.out fn3.o th.o -lthread -xhwcprof -g Collect the experiement with 6 software threads for analysis using the collect command: % collect -p +on -A copy -F on fn3.out -N 6 I have used an UltraSPARC-IV+ processor in this test, so this is the .er.rc file that I created to define the perspectives for Analyzer. Note the latter portion of the file contains the processor-specific objects for the Sun Fire F6900 Server running the UltraSPARC-IV+ processor. en_desc on mobj_define Vaddr VADDR mobj_define Paddr PADDR mobj_define Process PID mobj_define Thread (PID*1000)+THRID mobj_define ThreadID THRID mobj_define Seconds (TSTAMP/1000000000) mobj_define Minutes (TSTAMP/60000000000) mobj_define US4p_L1DataCacheLine (VADDR&0x1fe0)>>5 mobj_define US4p_L2CacheLine (PADDR&0x7ffc0)>>6 mobj_define US4p_L3CacheLine (PADDR&0x7fffc0)>>6 mobj_define VA_L2 VADDR>>6 mobj_define VA_L1 VADDR>>5 mobj_define PA_L2 PADDR>>6 mobj_define PA_L1 PADDR>>5 mobj_define US4p_T512_8k (VADDR&0x1fe000)>>13 mobj_define US4p_T512_64k (VADDR&0xff0000)>>16 mobj_define US4p_T512_512k (VADDR&0x7f80000)>>19 mobj_define US4p_T512_4M (VADDR&0x3fc00000)>>22 mobj_define US4p_T512_32M (VADDR&0x1fe000000)>>25 mobj_define US4p_T512_256M (VADDR&0xff0000000)>>28 mobj_define Vpage_32M VADDR>>25 mobj_define Vpage_256M VADDR>>28 mobj_define Ppage_32M PADDR>>25 mobj_define Ppage_256M PADDR>>28 mobj_define Processor CPUID&0x1ff mobj_define Core CPUID&0x3ff mobj_define Processor_Board (CPUID&0x1fc)>>2 mobj_define CoreID CPUID>>9 Fire up analyzer, and lets roll: % analyzer This screen appears: Note in the upper left corner two columns: User CPU and Max. Mem. Stall. User CPU refers to the total execution time of the application, while Max. Mem. Stall refers to the time spent in the memory subsystem. I will only focus on the memory subsystem column in my blog. (In our scaling problem, we're spending almost all of the time in the memory subsystem)
You will need to find the Data Presentation button:
Selecting the Data Presentation button brings up this dialog: I'll select the exclusive metrics and percentage reporting: Now I'll select the Tabs option in the panel, and see all of the available perspectives in the analyzer tabs: The Performance Analyzer built-in objects are on the left, the right column has the built-in virtual and physical page objects and all of the objects defined via the .er.rc file. These are the default settings in Sun Studio 11 Performance Analyzer. Since we are looking at a scalability problem, I will enable the cache hierarchy object US4p_L2CacheLine, its associated virtual address object VA_L2, and its associated physical address object PA_L2. Since we will likely want to look at virtual addresses, I will enable Vaddr object, and the Seconds, Core and Thread objects to give you a sense of what DProfile can do for your understanding of my application. Press OK and let's look inside the machine! We can look at the Source of the function and see the one C language statement is taking all of the time in the program: One statement is causing the problem. But why? Processors share data via L2 cache lines. Let's see what the L2 cache line profile looks like. Just press the tab associated with L2 cache lines: We see one cache line is taking 94% of all memory system time. We can drill down and find out why!
Click on the first L2 cache line, and press the
The dialog below appears. The filter clause for the L2 cache line is in the dialog box. The current filter is also displayed, currently empty (all data is viewed). You can request to AND the filter clause to the current filter, OR or SET (assign) the current filter with the selected filter. Press the SET button to assign the filter, and then OK. Now you can press the VA_L2 tab. This tab returns the virtual address groupings that mapped into this L2 cache line. The question we are asking here is: how many copies of my virtual address space were mapped into the L2 cache line? One virtual address range was mapped into this one hot line!
What virtual addresses were in this range? Press the
Now AND the clauses together: We look at Vaddr and see the 6 addresses that are taking all the time in the L2 cache line: One L2 Cache Line; six addressses; six threads! Coincidence?
We can view the Seconds tab and view when in time these filtered costs occurred. We confirm that each address was used by just one thread, that falsely shared the L2 cache line. We can view every object in hardware and software... and understand their relationships! FAST! False sharing of all hardware structures is detected:
Scalability at your fingertips: DProfile - Look Inside the Machine!
[ T: NiagaraCMT ] ( Dec 06 2005, 02:28:25 PM PST ) Permalink Comments [2]DProfile - The Scalability Infrastructure We have talked about all the Perspectives, Insight and Knowledge available with DProfile. The purpose of all this technology is scalability. As outlined in the Getting Ready for CMT section, scalability is key. And keeping all hardware structures equally utilized is the cornerstone to scalability: keeping all Virtual Processors busy, keeping all Memory Boards equally busy, and using your caches and banks uniformly. My previous entry showed how to you profile all of these hardware components by teaching Sun Studio 11 Performance Analyzer about Perspectives. Here, I'd like to show you some pseudo-code fragments running in multiple Software Threads within a Process cause a scalability problem:
There is only one central lock that is guards the shared variable var. I chose this obvious case because it's Data Movement Profile is dominated by one cache line taking >95% of time between all Threads. Here is another pseudocode frament running in multiple Software Threads, that exhibits the same scalability problem: for (;(array[index]++ <= limit);) ; The array is global, with a dedicated index for every Virtual Thread. This is not as obvious, but this case also has the same Data Movement Profile as the previous obvious example: one cache line holding the entire array is being passed among all Threads, and is taking >95% of time. This is false sharing; the next inhibitor to scalability after lock contention. While the first example is easy to detect with lockstat and DProfile, only DProfile can identify the second example. Any time there are a few hardware elements within structures being utilized (detected by DProfile), a scalability problem is available for resolution with DProfile. Here are some examples:
One Memory Board used more frequently than others. With DProfile you can select each of these objects, Filter, and then identify what Software View components are responsible for the underutilization. [ T: NiagaraCMT ] ( Dec 06 2005, 10:59:05 AM PST ) Permalink Comments [1]DProfile - Teaching Analyzer Perspectives I've changed Dataspace Profiling to DProfile. The application costs are broken down between execution time and memory-subsystem time in the Functions tab. You will be able to view and operate on the memory-subsystem time through all of the perspectives reviewed earlier. Many of the perspectives in the perspectives table are built-in, such as the load object, function, PC, data object, and virtual and physical pages (8k, 64k, 512k and 4M). All other perspectives are programmed into Performance Analyzer through the use of the .er.rc file using the expression grammar. All of the profile perspectives of the machine are created by these expressions. This blog will cover the machine-independent and machine-dependent human readable expressions. Future entries will include tools that create expressions as well. Time Performance Analyzer has the Timeline view. A more simple view is the Seconds and Minutes perspectives. These provide a breakdown of memory sub-system time in seconds or minutes. By selecting the column heading, you change the sort order of the object. By selecting "Max Mem Time", you will order by most costly to less costly time intervals. By selecting Name, you will order in time series. Selecting the Graphical radio button gives you an insightful graphical view of your application through time. en_desc on mobj_define Seconds (TSTAMP/1000000000) mobj_define Minutes (TSTAMP/60000000000) The first line in the .er.rc file will instructs the engine behind Performance Analyzer to analyze the entire process tree (all descendants) created by the collect command. The second line defines Seconds from the collected time stamps. The third line defines Minutes from the collected time stamps. Software Execution Threads An application may span multiple processes and contain many threads. There are two useful perspectives (objects) that are useful in analysis: the Thread and the ThreadID. Either represent the Software Execution object within our application. The Thread is a unique identifier across the application; while the ThreadID is a unique identifier only within the each Process of the application. mobj_define Thread (PID*1000)+THRID mobj_define ThreadID THRID The application will allocate memory in a Process, through the use of virtual memory. Solaris will allocate and map physical memory for this virtual memory. mobj_define Vaddr VADDR mobj_define Paddr PADDR mobj_define Process PID Since we're announcing the Sun Fire CoolThreads Servers using the UltraSPARC T1 processor, here are the hardware-specific Sun Fire CoolThreads Server formula you add to your .er.rc file to identify cache hierarchy and hardware objects in the system: mobj_define UST1_Bank (PADDR&0xc0)>>6 mobj_define UST1_L2CacheLine (PADDR&0x3ffc0)>>6 mobj_define UST1_L1DataCacheLine (PADDR&0x7f0)>>4 mobj_define UST1_Strand (CPUID) mobj_define UST1_Core (CPUID&0x1c)>>2 mobj_define VA_L2 VADDR>>6 mobj_define VA_L1 VADDR>>4 mobj_define PA_L2 PADDR>>6 mobj_define PA_L1 PADDR>>4 mobj_define Vpage_256M VADDR>>28 mobj_define Ppage_256M PADDR>>28 Niagara has L2 Cache and physical memory grouped by UST1_Bank. Based on Paddr, a Bank is selected; then accesses will be serviced by the portion of the L2 cache in that bank. If a reference misses the L2 Cache, the memory controller associated with that Bank will service the miss. UST1_L2CacheLine return the unique identifier for any L2 Cache Line in a UltraSPARC_T1 processor. UST1_L1DataCacheLine returns the unique identifier for an L1 Cache Line within one UST1_Core. This does not return the unique identifier for a L1 Cache Line within a UST1_Processor. UST1_Strand returns the identifier for each UltraSPARC_T1 Virtual Processor. UST1_Core returns the identifier for the UltraSPARC_T1 Execution Core. VA_L2 returns the identifer for Virtual Memory grouped in UltraSPARC_T1 L2CacheLine-sized chunks. VA_L1 returns the identifer for Virtual Memory grouped in UltraSPARC_T1 L1DataCacheLine-sized chunks. The previous two formula are useful to relate Hardware View cache line costs, back to Program Address Space. You'll filter on a hardware object, and relate it back to the virtual memory allocations for that hardware object. PA_L2 and PA_L1 provide similar grouping in Physical Memory Paddr. These formula are useful in relating Hardware View cache lines costs, back to Solaris physical address allocations. I'll show you how to use these formula in my next entry. [ T: NiagaraCMT ] ( Dec 06 2005, 10:56:34 AM PST ) Permalink Comments [0] |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||