Russ Blaine has described aspects of the x86 and x64 system call implementation in OpenSolaris. In this entry I'll describe the codeflow from userland to kernel and back for a SPARC system call.
Making A System Call
An application making a system call actually calls a libc wrapper function which performs any required posturing and then enters the kernel with a software trap instruction. This means that user code and compilers do not need to know the runes to enter the kernel, and allows binaries to work on later versions of the OS where perhaps the runes have been modified, system call numbers newly overloaded etc.
OpenSolaris for SPARC supports 3 software traps for entering the kernel:
| S/W Trap # | Instruction | Description | 0x0 | ta 0x0 | Used for system calls for binaries running in SunOS 4.x binary compatability mode. |
|---|---|---|
| 0x8 | ta 0x8 | 32-bit (ILP32) binary running on 64-bit (ILP64) kernel |
| 0x40 | ta 0x40 | 64-bit (ILP64) binary running on 64-bit (ILP64) kernel |
Since OpenSolaris (as Solaris since Solaris 10) no longer includes a 32-bit kernel the ILP32 syscall on ILP32 kernel is no longer implemented.
In the wrapper function the syscall arguments are rearranged if necessary (the kernel function implementing the syscall may expect them in a different order to the syscall API, for example multiple related system calls may share a single system call number and select behaviour based on an additional argument passed into the kernel). It then places the system call number in register %g1 and executes one of the above trap-always instructions (e.g., the 32-bit libc will use ta 0x8 while the 64-bit libc will use ta 0x40). There's a lot more activity and posturing in the wrapper functions than described here, but for our purposes we simply note that it all boils down to a ta instruction to enter the kernel.
Handling A System Call Trap
A ta n instruction, as executed in userland by the wrapper function, results in a trap type 0x100 + n being taken and we move from traplevel 0 (where all userland and most kernel code executes) to traplevel 1 in nucleus context. Code that executes in nucleus context has to be handcrafted in assembler since nucleus context does not comply to the ABI etc conventions and is generally much more restricted in what it can do. The task of the trap handler executing at traplevel 1 is to provide the necessary glue in order to get us back to TL0 and running privileged (kernel) C code that implements the actual system call.
The trap table entries for sun4u and sun4v for these traps are identical. I'm going to following the two regular syscall traps and ignore the SunOS 4.x trap. Note that a trap table handler has just 8 instructions dedicated to it in the trap table - it must use these to do a little work and then to branch elsewhere:
/*
* SYSCALL is used for system calls on both ILP32 and LP64 kernels
* depending on the "which" parameter (should be either syscall_trap
* or syscall_trap32).
*/
#define SYSCALL(which) \
TT_TRACE(trace_gen) ;\
set (which), %g1 ;\
ba,pt %xcc, sys_trap ;\
sub %g0, 1, %g4 ;\
.align 32
...
...
trap_table:
scb:
trap_table0:
/* hardware traps */
...
...
/* user traps */
GOTO(syscall_trap_4x); /* 100 old system call */
...
SYSCALL(syscall_trap32); /* 108 ILP32 system call on LP64 */
...
SYSCALL(syscall_trap) /* 140 LP64 system call */
...
So in both cases we branch to sys_trap, requesting TL0 handler of syscall_trap32 for an ILP32 syscall and syscall_trap for a ILP64 syscall. In both cases we request PIL to remain as it currently is (always 0 since we came from userland). sys_trap is generic glue code that is used to take us from nucleus (TL>0) context back to TL0 running a specified handler (address in %g1, usually written in C) at a chosen PIL. The specified handler is called with arguments as given by registers %g2 and %g3 at the time we branch to sys_trap: the SYSCALL macro above does not move anything into these registers (no arguments to be passed to handler). sys_trap handlers are always called with a first argument pointing to a struct regs that provides access to all the register values at the time of branching to sys_trap; for syscalls these will include the system call number in %g1 and arguments in output registers (note that %g1 as prepared in the wrapper and %g1 as used in the SYSCALL macro for the trap table entry are not the same register - on trap we move from regular globals (as userland executes in) on to alternate globals - but that sys_trap glue collects all the correct (user) registers together and makes them available in the struct regs it passes to the handler.
sys_trap is also responsible for setting up our return linkage. When the TL0 handling is complete the handler will return, restoring the stack pointer and program counter as constructed in sys_trap. Since we trapped from userland it will be user_rtt that is interposed as the glue that TL0 handling code will return into, and which will get us back out of the kernel and into userland again.
Aside: Fancy Improving Something In OpenSolaris?Adam Leventhal logged bug 4816328 "system call traps should go straight to user_trap" some time ago. As described above, the SYSCALL macro branches to sys_trap:
ENTRY_NP(sys_trap)
!
! force tl=1, update %cwp, branch to correct handler
!
wrpr %g0, 1, %tl
rdpr %tstate, %g5
btst TSTATE_PRIV, %g5
and %g5, TSTATE_CWP, %g6
bnz,pn %xcc, priv_trap
wrpr %g0, %g6, %cwp
ALTENTRY(user_trap)
...
...
Well we know that we're at TL1 and that we were unprivileged before the
trap, so (aside from the current window pointer manipulation which
Adam explains in the bug report i- it's not required coming from a syscall
trap) we could save a few instructions by
going straight to user_trap from the trap table. Adam's
benchmarking suggests that can save around 45ns per system call - more than
1% of a quick system call!
syscall_trap32(struct regs *rp);
We'll follow the ILP32 syscall route; the route for ILP64 is analogous with trivial differences in terms of not having to clear the upper 32 bits of arguments etc. You can view the source here. This runs at TL0 as a sys_trap handler so could be written in C, however for performance and hands-on-DIY assembler-level reasons it is in assembler. Our task is to lookup and call the nominated system call handler, and performing the required housekeeping along the way.
ENTRY_NP(syscall_trap32)
ldx [THREAD_REG + T_CPU], %g1 ! get cpu pointer
mov %o7, %l0 ! save return addr
First note that we do not obtain a new register window here - we will squat within the window that sys_trap crafted for itself. Normally this would mean that you'd have to live within the output registers, but by agreement handlers called via sys_trap are permitted to use registers %l0 thru %l3.
We begin by loading a pointer to the cpu this thread is executing on into %g1, and saving the return PC (as constructed by sys_trap) in %o7.
!
! If the trapping thread has the address mask bit clear, then it's
! a 64-bit process, and has no business calling 32-bit syscalls.
!
ldx [%o0 + TSTATE_OFF], %l1 ! saved %tstate.am is that
andcc %l1, TSTATE_AM, %l1 ! of the trapping proc
be,pn %xcc, _syscall_ill32 !
mov %o0, %l1 ! save reg pointer
The comment says it all. The AM bit in the PSTATE at the time we trapped (executed the ta instruction is available in the %tstate register after trap, and sys_trap preserved that before it could be modified by further traps for us in the regs structure. Assuming we're not a 64-bit app making a 32-bit syscall:
srl %i0, 0, %o0 ! copy 1st arg, clear high bits
srl %i1, 0, %o1 ! copy 2nd arg, clear high bits
ldx [%g1 + CPU_STATS_SYS_SYSCALL], %g2
inc %g2 ! cpu_stats.sys.syscall++
stx %g2, [%g1 + CPU_STATS_SYS_SYSCALL]
The libc wrapper placed up to the first 6 arguments in %o0 thru %o5 (with the rest, if any, on stack). During sys_trap a SAVE instruction was performed to obtain a new register window, so those arguments are now available in the corresponding input registers (despite us not performing a save in syscall_trap32 itself). We're going to call the real handler so we prepare the arguments in our outputs (which we're sharing with sys_trap but outputs are understood to be volatile across calls). The shift-right-logical by 0 bits is a 32-bit operation (i.e., not srlx) so it performs no shifting but it does clear the uppermost 32-bits of the arguments. We also increment the statistic counting the number of system calls made by this cpu; this statistic is in the cpu_t and the offset, like most, is generated for a by genasym.
!
! Set new state for LWP
!
ldx [THREAD_REG + T_LWP], %l2
mov LWP_SYS, %g3
srl %i2, 0, %o2 ! copy 3rd arg, clear high bits
stb %g3, [%l2 + LWP_STATE]
srl %i3, 0, %o3 ! copy 4th arg, clear high bits
ldx [%l2 + LWP_RU_SYSC], %g2 ! pesky statistics
srl %i4, 0, %o4 ! copy 5th arg, clear high bits
addx %g2, 1, %g2
stx %g2, [%l2 + LWP_RU_SYSC]
srl %i5, 0, %o5 ! copy 6th arg, clear high bits
! args for direct syscalls now set up
We continue preparing arguments as above. Interleaved with these instructions we change the lwp_state member of the associated lwp stucture (there must be one - a user thread made a syscall, this is not a kernel thread) to indicate it is running in-kernel (LWP_SYS, would have been LWP_USER prior to this update) and increment the count of the number of syscall made by this particular lwp (there is a 1:1 correspondence between user threads and lwps these days).
Next we write a TRAPTRACE entry - only on DEBUG kernels. That's a topic for another day - I'll skip the code here, too.
While we're on the subject of tracing, note that the next code snippet includes mentions of SYSCALLTRACE. This is not defined in normal production kernels. But, of course, one of the great beauties of DTrace is that it doesn't require custom kernels to perform its tracing since it can insert/enable probes on-the-fly - so SYSCALLTRACE is near worthless now!
!
! Test for pre-system-call handling
!
ldub [THREAD_REG + T_PRE_SYS], %g3 ! pre-syscall proc?
#ifdef SYSCALLTRACE
sethi %hi(syscalltrace), %g4
ld [%g4 + %lo(syscalltrace)], %g4
orcc %g3, %g4, %g0 ! pre_syscall OR syscalltrace?
#else
tst %g3 ! is pre_syscall flag set?
#endif /* SYSCALLTRACE */
bnz,pn %icc, _syscall_pre32 ! yes - pre_syscall needed
nop
! Fast path invocation of new_mstate
mov LMS_USER, %o0
call syscall_mstate
mov LMS_SYSTEM, %o1
lduw [%l1 + O0_OFF + 4], %o0 ! reload 32-bit args
lduw [%l1 + O1_OFF + 4], %o1
lduw [%l1 + O2_OFF + 4], %o2
lduw [%l1 + O3_OFF + 4], %o3
lduw [%l1 + O4_OFF + 4], %o4
lduw [%l1 + O5_OFF + 4], %o5
! lwp_arg now set up
3:
If curthread->t_pre_sys flag is set then we branch to _syscall_pre32 to call pre_syscall. If that does not abort the call it will reload the outputs with the args (they were lost on the call to _syscall_pre32) using lduw instructions from the regs area and loading from just the lower 32-bit word of the args (we can no longer use srl by 0 since no registers have the arguments anymore) and branch back to label 3 above (as if we'd done the same after a call to syscall_mstate).
If we don't have pre-syscall work to perform then call syscall_mstate(LMS_USER, LMS_SYSTEM) to record the transition from user to system state for microstate accounting purposes. Microstate accounting is always performed now - it used not to be the default and was enabled when desired.
After the unconditional call to syscall_mstate we reload the arguments from the regs struct into the output registers (as after the pre-syscall work). Evidently our earlier srl work in the args is a complete waste of time (although not expensive) since we always land up loading them from the passed regs structure. This appears to be a hangover from days when microstate accounting was not always enabled.
Aside: Another Performance Opportunity?So we see that our original argument shuffling is always undone as we have to reload after a call for microstate accounting, at least. But those reloads are made from the regs structure (cache/memory accesses) while it is clear that the input registers remain untouched and we could simply performing register-to-register manipulations (srl for the 32-bit version, mov for the 64-bit version). Reading through and documenting code like this really is worthwhile - I'll log a bug now!
!
! Call the handler. The %o's have been set up.
!
lduw [%l1 + G1_OFF + 4], %g1 ! get 32-bit code
set sysent32, %g3 ! load address of vector table
cmp %g1, NSYSCALL ! check range
sth %g1, [THREAD_REG + T_SYSNUM] ! save syscall code
bgeu,pn %ncc, _syscall_ill32
sll %g1, SYSENT_SHIFT, %g4 ! delay - get index
add %g3, %g4, %g5 ! g5 = addr of sysentry
ldx [%g5 + SY_CALLC], %g3 ! load system call handler
brnz,a,pt %g1, 4f ! check for indir()
mov %g5, %l4 ! save addr of sysentry
!
! Yuck. If %g1 is zero, that means we're doing a syscall() via the
! indirect system call. That means we have to check the
! flags of the targetted system call, not the indirect system call
! itself. See return value handling code below.
!
set sysent32, %l4 ! load address of vector table
cmp %o0, NSYSCALL ! check range
bgeu,pn %ncc, 4f ! out of range, let C handle it
sll %o0, SYSENT_SHIFT, %g4 ! delay - get index
add %g4, %l4, %l4 ! compute & save addr of sysent
4:
call %g3 ! call system call handler
nop
We load the nominated syscall number into %g1, sanity-check it for range, and lookup the entry at that index in the table of 32-bit system calls sysent32 and extract the registered handler (the real implementation). Ignoring the indirect syscall cruft we the call the handler and the real work of the syscall is executed. Erick Schrock has described the sysent/sysent32 table in his blog entry on adding system calls to Solaris.
!
! If handler returns long long then we need to split the 64 bit
! return value in %o0 into %o0 and %o1 for ILP32 clients.
!
lduh [%l4 + SY_FLAGS], %g4 ! load sy_flags
andcc %g4, SE_64RVAL | SE_32RVAL2, %g0 ! check for 64-bit return
bz,a,pt %xcc, 5f
srl %o0, 0, %o0 ! 32-bit only
srl %o0, 0, %o1 ! lower 32 bits into %o1
srlx %o0, 32, %o0 ! upper 32 bits into %o0
For ILP32 clients we need to massage 64-bit return types into 2 adjacent and paired registers.
!
! Check for post-syscall processing.
! This tests all members of the union containing t_astflag, t_post_sys,
! and t_sig_check with one test.
!
ld [THREAD_REG + T_POST_SYS_AST], %g1
tst %g1 ! need post-processing?
bnz,pn %icc, _syscall_post32 ! yes - post_syscall or AST set
mov LWP_USER, %g1
stb %g1, [%l2 + LWP_STATE] ! set lwp_state
stx %o0, [%l1 + O0_OFF] ! set rp->r_o0
stx %o1, [%l1 + O1_OFF] ! set rp->r_o1
clrh [THREAD_REG + T_SYSNUM] ! clear syscall code
ldx [%l1 + TSTATE_OFF], %g1 ! get saved tstate
ldx [%l1 + nPC_OFF], %g2 ! get saved npc (new pc)
mov CCR_IC, %g3
sllx %g3, TSTATE_CCR_SHIFT, %g3
add %g2, 4, %g4 ! calc new npc
andn %g1, %g3, %g1 ! clear carry bit for no error
stx %g2, [%l1 + PC_OFF]
stx %g4, [%l1 + nPC_OFF]
stx %g1, [%l1 + TSTATE_OFF]
If post-syscall processing is required then branch to _syscall_post32 which will call post_syscall and then "return" by jumping to the return address passed by sys_trap (which is always user_rtt for syscalls). If not then change the lwp_state back to LWP_USER and stash the return value (possibly in 2 registers as above) in the regs structure, clear the curthread->t_sysnum since we're no longer executing a syscall, and step the PC and nPC values on so that the RETRY instruction at the end of user_rtt which we're about to "return" into will not simply re-execute the ta instruction.
! fast path outbound microstate accounting call
mov LMS_SYSTEM, %o0
call syscall_mstate
mov LMS_USER, %o1
jmp %l0 + 8
nop
Transition our state from system to user again (for microstate accounting purposes) and "return" through user_rtt as arranged by sys_trap. It is the task of user_rtt to get us back out of the kernel to resume at the instruction indicated in %tstate (for which we stepped the PC and nPC) and continue execution in userland.
Technorati Tag: OpenSolarisTechnorati Tag: Solaris