Yufei Zhu's Weblog

Tuesday Dec 06, 2005

UltraSPARC T1 performance on SAP

sap.html

We just launched UltraSPARC T1 based systems today. There are a lot of debates on whether it is a good fit for running SAP. Is response time an issue due to CMT architecture?

I have spent a lot of time working on SAP Sales and Distribution performance on Sun Fire T2000 system and it runs very well with very good response time.

Unlike any other cpu architecture, on UltraSPARC T1, each core has 4 hardware threads sharing one pipeline. There are a number of ways for a hardware thread to be scheduled or unscheduled on the single pipeline. Every clock tick a different thread is scheduled to run in round robin order. If the hardware thread is stalled because of the cache misses, the hardware will skip this thread and schedule next thread to run. If the hardware thread doesn't have any work to do, it also will be skipped and next thread is scheduled on the pipeline. This will make more efficient use of execution unit and avoid waste of cpu cycles.

CPU speed is only a small factor on how well an application can run on a system. More often, the cpu is stalled due to memory access. This simply means even when you get a system clocked at 3.0Ghz, the amount of useful cpu cycles you could get out of it is much less than that. How much less, it is depending on the cpu architecture and the application itself. But on Sun Fire T2000 system, the useful cpu cycles that is doing the real work is close to cpu speed.

The Solaris CMT scheduler is very smart. If you only have a few processes, the os scheduler will try to schedule them across cores evenly, so you don't run into a situation where all 4 threads on one core is 100% busy while the other cores are idle. Running a single threaded batch job can be an issue on UltraSPARC T1 based systems. But if the application has small working set, it should perform close to other cpu clocked at 1.2Ghz.

Let's look at some of the scaling data we collected. The experiment was done without much tunning. We incrementally enabled more hardware threads for each core. This is running SAP SD ECC 5.0 on Sun Fire T2000 box with 8 cores and 32GBytes memory.

    # of threads on each core		Performance/baseline
1 1x
2 1.7x
3 2.2x
4 2.5x
With two threads, the performance is improved by 70% than just using one hardware thread. Adding the third thread, it improves another 30%. Adding the fourth thread, it improves 14% more. More hardware threads are making execution units less idle.

Here are a few tips tunning SAP on UltraSPARC T1 based system. Remember, TLB is a precious resource, getting less tlb misses is one of the keys to high performance. Solaris integrated several changes to get executable's text, heap and anon on large page by default. Here is a few parameter you should use in SAP profile for all SAP instances.

es/use_shared_memory = true

These parameters will get extended memory and all other buffers to use shared memory segment. If "es/use_shared_memory" is not set, the extended memory will be on allocated on anon page. "

Also set this parameter

ipc/sem_fast_watchdog = 0

This will disable a watchdog process that is constantly checking if the shared memory segment is healthy, this process consumes a lot of cpu.

We are also using pool to reduce the number of shared memory segments it tries to allocate.

For example,

ipc/shm_psize_10 = 1290000000
ipc/shm_psize_01 = -10
ipc/shm_psize_02 = -10
ipc/shm_psize_03 = -10
ipc/shm_psize_04 = -10
................

This will tell operating system that SAP wants to allocated one 1.29GBytes shared memory segment, The first buffer with key 01 will be allocated from this big segment etc.

Note that SAP has different types of buffers. Most of them can be allocated from this pool if you can find the key
and set ipc/shm_psize_<key> = -10. And you can specify the size if you  don't like the default.

Most importantly, always do parameter checking before you start up SAP. This will iron out any potential errors. Generally if there is warning saying that certain parameter is unknown, it is ok to ingore them

sappfpar check pf=<profile name>


Here is an example output.

sappfpar check pf=E50_D10_ngpae2

================================================================================
== Checking profile: E50_D10_ngpae2
================================================================================


Shared memory disposition overview
================================================================
Shared memory pools
Key: 10 Pool
Size configured.....: 1290000000 (1230.2 MB)
Size min. estimated.: 1270473596 (1211.6 MB)
Advised Size........: 1274000000 (1215.0 MB)

Shared memories inside of pool 10
Key: 1 Size: 2500 ( 0.0 MB) System administration
Key: 2 Size: 213570652 ( 203.7 MB) Disp. administration tables
Key: 3 Size: 178144000 ( 169.9 MB) Disp. communication areas
Key: 4 Size: 509248 ( 0.5 MB) statistic area
Key: 6 Size: 308838400 ( 294.5 MB) ABAP program buffer
Key: 7 Size: 14838 ( 0.0 MB) Update task administration
Key: 8 Size: 67108964 ( 64.0 MB) Paging buffer
Key: 9 Size: 134217828 ( 128.0 MB) Roll buffer
Key: 11 Size: 1000000 ( 1.0 MB) Factory calender buffer
Key: 12 Size: 6000000 ( 5.7 MB) TemSe Char-Code convert Buf.
Key: 13 Size: 40500000 ( 38.6 MB) Alert Area
Key: 14 Size: 4400000 ( 4.2 MB) Presentation buffer
Key: 16 Size: 22400 ( 0.0 MB) Semaphore activity monitoring
Key: 17 Size: 2672386 ( 2.5 MB) Roll administration
Key: 18 Size: 917604 ( 0.9 MB) Paging adminitration
Key: 19 Size: 60000000 ( 57.2 MB) Table-buffer
Key: 30 Size: 13312 ( 0.0 MB) Taskhandler runtime admin.
Key: 31 Size: 4806000 ( 4.6 MB) Dispatcher request queue
Key: 33 Size: 10240000 ( 9.8 MB) Table buffer, part.buffering
Key: 41 Size: 25010000 ( 23.9 MB) DB statistics buffer
Key: 42 Size: 7232992 ( 6.9 MB) DB TTAB buffer
Key: 43 Size: 32534392 ( 31.0 MB) DB FTAB buffer
Key: 44 Size: 7958392 ( 7.6 MB) DB IREC buffer
Key: 45 Size: 4886392 ( 4.7 MB) DB short nametab buffer
Key: 46 Size: 20480 ( 0.0 MB) DB sync table
Key: 47 Size: 3073024 ( 2.9 MB) DB CUA buffer
Key: 48 Size: 300000 ( 0.3 MB) Number range buffer
Key: 49 Size: 2769392 ( 2.6 MB) Spool admin (SpoolWP+DiaWP)
Key: 51 Size: 3200000 ( 3.1 MB) Extended memory admin.
Key: 52 Size: 40000 ( 0.0 MB) Message Server buffer
Key: 54 Size: 4202496 ( 4.0 MB) Export/Import buffer
Key: 55 Size: 8192 ( 0.0 MB) Spool local printer+joblist
Key: 57 Size: 1048576 ( 1.0 MB) Profilparameter in shared mem
Key: 58 Size: 4096 ( 0.0 MB) Enqueue ID for reset
Key: 62 Size: 85983232 ( 82.0 MB) Memory pipes
Key: 63 Size: 409600 ( 0.4 MB) ICMAN shared memory
Key: 64 Size: 4202496 ( 4.0 MB) Online Text Repository Buf.
Key: 65 Size: 4202496 ( 4.0 MB) Export/Import Shared Memory
Key: 1002 Size: 400000 ( 0.4 MB) Performance monitoring V01.0
Key: 58900110 Size: 4096 ( 0.0 MB) SCSA area


Nr of operating system shared memory segments: 1

Shared memory resource requirements estimated
================================================================
Total Nr of shared segments required.....: 1
System-imposed number of shared memories.: 1000
Shared memory segment size required min..: 1274000000 (1215.0 MB)
System-imposed maximum segment size......: 44828721152 (42752.0 MB)

Swap space requirements estimated
================================================
Shared memory....................: 1218.8 MB
..in pool 10 1211.6 MB, 98% used
..not in pool: 0.0 MB
Processes........................: 107.9 MB
Extended Memory .................: 2000.0 MB
------------------------------------------------
Total, minimum requirement.......: 3326.7 MB
Process local heaps, worst case..: 1907.3 MB
Total, worst case requirement....: 5234.1 MB

Errors detected..................: 0

After SAP instance is started up, always run "pmap -xs <pid>" to see the address space layout of sap processes and if they are on large page as much as possible.

Here is an example output
pmap -xs 15272

15272: dw.sapE50_D10 pf=/usr/sap/E50/SYS/profile/E50_D10_ngpae2
Address Kbytes RSS Anon Locked Pgsz Mode Mapped File
0000000080000000 4096 4096 - - 4M r-x-- disp+work
0000000080400000 24576 16384 - - - r-x-- disp+work
0000000081C00000 3840 512 - - - r-x-- disp+work
0000000081FC0000 32 8 - - - r-x-- disp+work
00000000822C6000 40 8 - - - rwx-- disp+work
00000000822D0000 192 192 128 - 64K rwx-- disp+work
0000000082300000 256 - - - - rwx-- disp+work
0000000082340000 192 192 192 - 64K rwx-- disp+work
0000000082370000 64 - - - - rwx-- disp+work
0000000082380000 64 64 64 - 64K rwx-- disp+work
0000000082390000 640 - - - - rwx-- disp+work
0000000082430000 384 384 384 - 64K rwx-- disp+work
0000000082490000 64 64 - - - rwx-- disp+work
00000000824A0000 192 192 192 - 64K rwx-- disp+work
00000000824D0000 192 - - - - rwx-- disp+work
0000000082500000 704 704 704 - 64K rwx-- disp+work
00000000825B0000 1664 128 - - - rwx-- disp+work
0000000082750000 16 - - - - rwx-- disp+work
0000000084000000 36864 36864 36864 - 4M rwx-- [ heap ]
0000010000000000 1048576 1048576 - 1048576 256M rwxsR [ ism shmid=0x30001f9 ]
0000010040000000 208896 208896 - 208896 4M rwxsR [ ism shmid=0x30001f9 ]
000001004CC00000 2240 2240 - 2240 64K rwxsR [ ism shmid=0x30001f9 ]
000001004CE30000 56 56 - 56 8K rwxsR [ ism shmid=0x30001f9 ]
0000010050000000 8 8 - 8 8K rwxsR [ ism shmid=0x3000263 ]
0000010050400000 4096 4096 - 4096 4M rwxsR [ ism shmid=0x100005e ]
0000010060000000 1835008 1835008 - 1835008 256M rwxsR [ ism shmid=null ]
00000100D0000000 212992 212992 - 212992 4M rwxsR [ ism shmid=null ]
...........
...........

---------------- ---------- ---------- ---------- ----------
total Kb 3428296 3391800 40344 3311872



cpustat is good tool to use if you suspect tlb misses are too high on UltraSPARC T1 based system.

/usr/sbin/cpustat -c pic0=ITLB_miss,pic1=Instr_cnt \
-c pic0=DTLB_miss,pic1=Instr_cnt 1 10

trapstat is also a great tool to show how much cpu time is spent in itlb and dtlb misses.

Here is an example output "trapstat -T 10"

cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim
----------+-------------------------------+-------------------------------+----
0 u 8k| 1261 0.1 44 0.0 | 17422 0.8 50 0.0 | 0.9
0 u 64k| 2560 0.2 3 0.0 | 166073 7.1 9 0.0 | 7.3
0 u 4m| 1378 0.1 1 0.0 | 43193 2.6 51 0.0 | 2.8
0 u 256m| 0 0.0 0 0.0 | 33228 1.5 53 0.0 | 1.5
- - - - - + - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - + - -
0 k 8k| 560 0.0 0 0.0 | 26409 1.1 18 0.0 | 1.1
0 k 64k| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0
0 k 4m| 0 0.0 0 0.0 | 640 0.0 2 0.0 | 0.0
0 k 256m| 0 0.0 0 0.0 | 2719 0.2 0 0.0 | 0.2
-------------------------------------------------------------------------------


Always check the SAP R/3 buffer stats from SAP snapshot to make sure there is no swapping.

Also make sure proper number of update processes have been configured so the update queue length is less than 5 on average. The VB-Que from system enqueue performance output file ENQHIS should be a good indicator on if more update processes should be configured.

We also recommend using 4 SAP instances on 8 core based Sun Fire T2000 system.

Check out  SAP SD 2 tier benchmark website  for our latest benchmark results on Sun Fire T2000.

[T:]

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed

Calendar

Feeds

Search

Links

Navigation

Referrers