Weblog

Wednesday October 10, 2007
Dump device configuration - part2
Dump device configuration - part2
To my earlier
post
on 'why the dump device configured so late', I got a very interesting
reply, questioning the claim: 'window will be significantly reduced'.
Let me be frank, as of the first post, I didn't conduct any studies as
to quantify the gain.
So the mail prompted me to do some tests to quantify the gain.
I want to understand how many function calls get made from the word
'go' [i.e from
main()]
till
dumpinit()
gets called; And also figure out how many modules get loaded in the
same window.
That seems a pretty arduous task. At the same time I can't spend days
trying to just get this data. This is where OpenSolaris rocks - if not
for DTrace, I would have given up on this. DTrace kernel module gets
loaded very early in the boot right after the kernel starts [i.e
unix/genunix/krtld modules are loaded]. DTrace supports a feature
called 'Anonymous tracing' - which can be used in this context just to
get the data I am interested. This won't get me right from
main()
but as close to
main()
as possible. This option is good enough to get an idea of what
the numbers look like.
I needed answers for atleast the following questions :
- How many function calls are made from the point
DTrace can start to the point of dumpinit() ?
- How many function calls are made from the point
root is mounted to the point of dumpinit() ?
- How many function calls are made from
start_other_cpus()
[this routine starts the rest of the CPUs in a multi cpu system; Till
this point, only the boot cpu does all the work; This opens up scope
for race conditions / corruption / hangs - So even if we could capture
crash file from this point, that would still be a nice win] to the
point of dumpinit() ?
I put together the following simple script to answer the 3rd question.
# cat mp_init.d
BEGIN {
dumpinit = 0;
}
/* start collecting data here */
fbt::start_other_cpus:return
/dumpinit == 0/
{
dumpinit = 1;
}
fbt:::entry
/dumpinit == 1/
{
@[probefunc]=count();
}
/* stop collecting data here */
fbt::dumpinit:return
/dumpinit == 1/
{
dumpinit = 0;
}
#
And invoke the script as follows :
#
dtrace -As ./mp_init.d
dtrace: saved anonymous enabling in
/kernel/drv/dtrace.conf
dtrace: added forceload directives to
/etc/system
dtrace: run update_drv(1M) or reboot
to enable changes
#
Then reboot the system and when system comes back, just run :
#
dtrace -ae > mp_init.out
mp_init.out has thousands of
lines which look something like :
tsc_scalehrtime
369164
syscall_exit
375467
Module information can be gotten using 'kmdb' - For this I booted the
system under kmdb, put a break point in
start_other_cpus()
and let it continue. When the break point is hit, I ran
'::modinfo' and saved the o/p; Then I put one more break point in
dumpinit()
and again let it continue; When the new breakpoint is hit collect the
modinfo o/p again and comparing this with the previous one gives you an
idea of how many modules got loaded in between.
With simple changes to the script, I got answers for the 1st and 2nd
questions as well. Module info can be gotten as described above.
Then I used awk(1) to post process
mp_init.out
and the resulting numbers look like this :
Q) How many function calls are made from the
point DTrace can start to the point of dumpinit() ?
A) 59 million function calls were made from
the point dtrace can start
to the point of
dumpinit:return
Q) How many function calls are made from the point
root is mounted to the point of dumpinit() ?
A) 52 million calls were made from mounting root
to dumpinit:return
- 120 modules got
loaded in the same window
Q) How many function calls are made from
start_other_cpus()
to the point of dumpinit() ?
A) 43 million calls were made from
start_other_cpus() to dumpinit:return
- 52 modules got
loaded in the same window
The data speaks for itself !! Data modified by 43million routines can
be captured in a dump, if we could advance dump device configuration to
just before
start_other_cpus().
Gain is more if it could be advanced closer to mount of root FS.
( Oct 10 2007, 05:36:01 AM PDT )
Permalink

Tuesday September 25, 2007
Why is dump device configured so late ??
Why is dump device configured so late ??
The other day I was playing around with kmdb [setting breakpoints,
single stepping, clearing bkpts etc..] and at one stage, thought, Let
me take a dump and then issued :
[0]> $<systemdump
nopanicdebug:
0
= 0x1
panic[cpu0]/thread=fffffffffbc25d20: BAD TRAP:
type=d (#gp General protection)
rp=fffffffffbc46670 addr=f000ff53f000ff00
fffffffffbc46550 unix:die+ea ()
fffffffffbc46660 unix:trap+3de ()
fffffffffbc46670 unix:cmntrap+e9 ()
skipping system dump - no dump device
configured
rebooting...
It is the last two lines of o/p that matter to this post (so lot was
left out intentionally).
I couldn't collect the crash file - the reason being 'no dump device
configured'; I felt bad that the information I wanted was gone and I
had to repeat the whole cycle to get to that state ...
I know that by default 'primary swap partition' itself acts as a dump
device. And swap device is specified as part of install itself and is
actually available in /etc/vfstab as (something like) :
/dev/dsk/c1t0d0s1
-
- swap
- no -
That brings me to the question: Where in the boot sequence do we
configure swap as dump device ?
Looking a the code to see where the above error msg comes from: it
turns out to be
dumpsys()
uprintf(
"skipping system dump - no dump device configured\n");
dumpvp
not being set is the reason why
dumpsys()
prints out this msg; To understand where it gets set, I put a
break point in
dumpinit()
and then cotinued.
The state when the brkpt is hit :
[1]> $c
dumpinit(ffffff0154b1c880, 0, 0)
swapctl+0x68a(1,
fffffd7fffdffc90, ffffff00042acee4)
uadmin+0x10f(10,
[1]> ::ptree
fffffffffbc250b0 sched
ffffff01513923a8 fsflush
ffffff015138e3a8 pageout
ffffff015138c3a8 init
ffffff0152b7a3a8 svc.configd
ffffff0152aaf3a8 svc.startd
ffffff01548f13a8 fs-usr
ffffff01549ce3a8 swapadd
ffffff0154a1b3a8 swap
[1]
That means, in response to the cmd 'swap -1 -a /dev/dsk/c1t0d0s1',
dumpvp
gets set in the kernel. So if the system were to panic till this
happened we would not be able to collect a crash file for subsequent
analysis. Running '::modinfo' shows 161 kernel modules loaded (on my
test system) - All this state also would be lost.
I am wondering why should it be pushed to this stage - why can't it be
done
earlier?
The way /etc/system is read, /etc/vfstab could also be read, look out
for an entry with FStype as 'swap' and try configuring it as the
dumpvp
at the earliest possible moment - Wondering, if this can be done
immediatly after /dev, /devices and disk dirver (sd?) modules are
loaded.
Even with this change, we can't capture the panics that happen before
dumpvp is configured, but that window will be significantly reduced !!!
Moinak was suggesting,
explore the option of making it part of bootenv.rc - good point, but
that makes it x86 specific - needs more investigation.
Comments please.
( Sep 25 2007, 01:23:43 AM PDT )
Permalink

Thursday September 06, 2007
How do you access structure members in assembly files...
How do you access structure members in assembly files...
If you look at the assembly files in OpenSolaris kernel [fortunately
there are not many of them], you will see instructions like the one
below at quite a few places :
movq T_PROCP(%r13), %r14 /*
%r14 = proc */ <-- from
swtch.s
In this particular case, %r13 holds the thread pointer and it is
accessing the proc pointer from it and storing it in %r14. How it is
accessing the member is the focus of this post.
If this was a c-source file, we could have done something similar to:
proc_t *p = t->t_procp;
and compiler will take care of the rest.
But if we need to code it in assembly, we have to explicitly pass the
'offset' of the member [here, t_procp] to the instrcution. But if we
were to hard code the 'offset', it would be a nightmare to maintain it,
whenever structures changed.
So how OpenSolaris addresses this is pretty neat.
There is a file called
offsets.in,
which has entries as shown below :
_kthread
THREAD_SIZE
t_pcb
T_LABEL
t_lock
.....
t_procp
t_link
This file gets compiled into a header file called assym.h and it can be
included in any assembly file that makes references to these members.
This
Makefile
takes care of building the header file. Looking at the assym.h, after
building it, the entry for t_procp looks like this :
#define T_PROCP 0xdc
This is exactly what we need to feed to the assembly instruction !!
If we need to access a structure member that is not already present in
the
offsets.in,
all we need to do is, just add the 'appropriate structure and its
member', compile and start accessing in the assembly code.
What if the structure I am interested in is not of global signifcance -
i.e no other module other than mine may care about it - in which case I
don't want to clutter the
offsets.in
but want my own offsets.in which does the same job.
An example for this can be found here
offsets.in
- the structure and offsets that are declared here are of no
significance to sparc so just confine it to x86.
You can use make(1S)'s AS_INC_PATH env varaible in your Makefiles to
point to the location of where the generated heder files are present -
with this varaible, you don't end up including the header file path for
your c-files [confine it to assembly files].
( Sep 06 2007, 09:46:06 PM PDT )
Permalink

Monday August 20, 2007
lx brand internals - part1
lx brand internals - part1...
If you are here, I assume that you already read the brandz design
document available as
http://opensolaris.org/os/community/brandz/design/
and are interested
in knowing the real low level details of how lx brand does its magic.
If so continue...
This post may get modifed in the future as my understanding of things
improve :-)
Now that you have got an lx zone [say, lx1] installed, what
happens when you issue, 'zoneadm -z lx1 boot'...
/usr/lib/zones/zoneadmd is the workhorse. It does all the heavy lifting
w.r.t setting up the virtual platform; of that
server()
is the routine that does the actual work.
zsched process of a zone is equivalent to sched of global zone - all
kernel threads related to the zone hang off from the zsched process.
zoneadmd creates zsched through
server()->
zone_ready()->vplat_create()->
zone_create()->
newproc()
zsched() launches zone_start_init() which will eventually launch the
'init' of the lx brand installed [say, centOS3.0 init].
brand_platform_iter_gmounts() mounts the file sytems mentioned in the
config file /usr/lib/brand/lx/platform.xml [starting with keyword
'global']
No Inherited-Package-Dirs for lx brands.
Mount the rest of the file systems from the above xml file [starting
with keyword 'mount']
boot/halt scripts allow the brand to perform any custom configuration,
etc
lx brand boot callback: lx_support is called from
zone_bootup()->do_subproc() to do any additional setup before the
zone is booted - say, setting the 'restrat' attribute for init.
Then make the zone_boot() system call – as the comment in zone.c says,
this doesn't need to do any work other than to just find the status set
by the init() of the brand.
Understanding how init of the brand gets started, throws a lot of light
around how brandz works:
zone_start_init() execs /sbin/init of the brand.
elfexec() realizes that this is a branded process, and
hence execs the helper library:
/Brand/lx1/root/native/usr/lib/lx_brand.so.1
through lx_elfexec() and then maps in linux binary and its
interpreter as follows :
CPU
ID
FUNCTION:NAME
1
48637
elf32exec:entry
zsched
vp: /Brandz/lx1/root/sbin/init
genunix`gexec+0x374
genunix`exec_common+0x471
genunix`exec_init+0x275
genunix`start_init_common+0xfb
genunix`zone_start_init+0x3d
unix`thread_start+0x8
1
48637
elf32exec:entry
zsched
vp: /Brandz/lx1/root/native/usr/lib/lx_brand.so.1
lx_brand`lx_elfexec+0xfd
elfexec`elf32exec+0xf3e
genunix`gexec+0x374
genunix`exec_common+0x471
genunix`exec_init+0x275
genunix`start_init_common+0xfb
genunix`zone_start_init+0x3d
unix`thread_start+0x8
1
48627
mapexec32_brand:entry
zsched vp:
/Brandz/lx1/root/sbin/init
lx_brand`lx_elfexec+0x169
elfexec`elf32exec+0xf3e
genunix`gexec+0x374
genunix`exec_common+0x471
genunix`exec_init+0x275
genunix`start_init_common+0xfb
genunix`zone_start_init+0x3d
unix`thread_start+0x8
1
48627
mapexec32_brand:entry
zsched vp:
/Brandz/lx1/root/lib/ld-2.3.2.so
lx_brand`lx_elfexec+0x2aa
elfexec`elf32exec+0xf3e
genunix`gexec+0x374
genunix`exec_common+0x471
genunix`exec_init+0x275
genunix`start_init_common+0xfb
genunix`zone_start_init+0x3d
unix`thread_start+0x8
AT_ENTRY auxv vector points to _start() of lx_brand.so.1 – solaris
linker jumps to this after its boot-strapping.
exec_args() copies the arugments to the user stack. And execution in
the userland starts at _rt_boot() and stack looks something like this :
----------------
|
| <-- %ebp
----------------
| argc
|
----------------
| argv[]
|
----------------
| NULL
|
----------------
|
envp
|
----------------
| NULL
|
----------------
| auxv[]
|
----------------
When the linker turns over control to the entry in the executable [i.e
_start() in our case], the stack looks exactly how it was when kernel
gave the control to the linker i.e _rt _boot() :
_ rt_ boot()-> setup()->elf_entry_ pt()->_start()
_start() calls lx_init() which registers the user-space handler()
with the lx brand module
Handler is: lx_handler_table() - this looks like this:
LM2`lx_brand.so.1`lx_handler_table:
pushl $0x0
LM2`lx_brand.so.1`lx_handler_table+5:
jmp +0x1112 <LM2`lx_brand.so.1`lx_handler>
LM2`lx_brand.so.1`lx_handler_table+0xa:
nop
LM2`lx_brand.so.1`lx_handler_table+0xb:
nop
LM2`lx_brand.so.1`lx_handler_table+0xc:
nop
LM2`lx_brand.so.1`lx_handler_table+0xd:
nop
LM2`lx_brand.so.1`lx_handler_table+0xe:
nop
LM2`lx_brand.so.1`lx_handler_table+0xf:
nop
LM2`lx_brand.so.1`lx_handler_table+0x10:
pushl $0x10
LM2`lx_brand.so.1`lx_handler_table+0x15:
jmp +0x1102 <LM2`lx_brand.so.1`lx_handler>
LM2`lx_brand.so.1`lx_handler_table+0x1a:
nop
LM2`lx_brand.so.1`lx_handler_table+0x1b:
nop
Where we jump into this table from the kernel is dictated by the system
call made. lx_handler() takes the system call number as argument and
proceeds to emulate it.
Ok...so now the Q is how do we initially get into the kernel when a
system call is made by a lx process ??
32bit linux makes use of int80 to make a system call – brandz replaces
it with its own handlers for branded processes :
- brand_interpositioning_enable() installs the new handler on a
context switch
- brand_interpositioning_disable() restores the old handler
while switching out
- The new handler is: brand_sys_int80
Take a closer look at brand_sys_int80; In this post let us focus on
64bit
solaris kernel :
Upon int80, the stack looks like this when we enter brand_sys_int80()
[refer to 5-27 vol3a of intel
programmers manual]
[1]> <rsp,5/Jn
0xffffff00046c3fd8:
feb888bd <-- %EIP of linux app that caused the 'int 80'
43
<-- Code Segment selector
212 <-- EFLAGS
8047cec <-- %esp - stack pointer as of 'int 80'
4b
<-- %SS
brand_sys_int80() switches to the kernel stack of the running thread,
and retrieves a lot of brand information
and then jumps to lx_brand_int80_callback(). As of the jump, the stack
looks like this :
[1]> <rsp,8/Jn
0xffffff00046c3ee8:
feb888bd
<-- return address into app's user land
ffffff014ba301c0 <-- Pointer to proc brand data
structure
ffffff0152573708 <-- lwp brand data
ffffff00046c3fd8 <-- Same as above 'rsp'
fffffffff7cddc60 <--
lx_brand`lx_brand_int80_callback()
lx_brand_int80_callback() takes the system call number present int
%rax, left shifts it by 4 (as each entry
in the lx_handler_table() is 16bytes long] and that is the address it
needs to jump to, to emulate the system
call. Then it swaps the 'return address that is available on stack with
this newly computed address - such that
when it issues 'return from interrupt [iretq]', it starts off at the
new address. The actual 'user land addrss from
the application' is saved into %rax.
The stack as of iretq:
[1]> <rsp,6/Jn
0xffffff00046c3fd8:
fefddb40
43
212
8047cec
4b
[1]> <rax=J
feb888bd
See %rax holds, what used to be on stack !!!! And what is on stack is
the address in the lx_handler_table()
where it jumps to...From here on it executes the emulation code and is
the content of next post...
( Aug 20 2007, 02:38:51 AM PDT )
Permalink

Friday December 08, 2006
Enuf error checking done in your code ??
Enuf error checking done in your code ??
I remember having read somewhere sometime back, that, more than 20% of
the solaris code just does error checking ...and my first reaction was,
wow thats lot and close to paranoia ...and with the passage of time I
kind of forgot about it.
Couple of weeks back I got an escalation and, as usual, it started off
as, 3rd party application core dumped inside a solaris library (and to
be more specific in libaio on solaris10).
libaio allows applications to do IO in an asynchronous way on files
that can not make use of kernel async IO (for ex: aioread()/aiowrite()
on a device file make use of kernel aio; whereas the same call on a
regular file will make use of aio features provided the library) - How
library does aio is very simple: it creates a few worker threads which
will pick up requests and perform IO in the background using pread()
and pwrite() calls.
So when an application issues a aio read or write for the first time on
a file descriptor (fd) that can't make use of kaio, library creates a
bunch of worker threads to execute the aio requests. Refer to the
routine
__uaio_init()
to see how the initialization is happening.
This post is concerned more with the following few lines of code from
__uaio_init() :
194 /*
195 * Create the minimum number of read/write workers.
196 */
197 for (i = 0; i < _min_workers; i++)
198 (void) _aio_create_worker(NULL, AIOREAD);
As the comment says, it is creating a bunch of worker threads. But what
is interesting is, we are not checking for the return value of
_aio_create_worker() routine - This routine creates a user land thread
[From solaris9 onwards, libthread moved to 1:1 model - which
means, for every user land thread created, there will be an lwp in the
kernel to which it is tied to].
For some reason if this call fails, we continue as if everything is
hunky-dory and set __uaio_ok which means initialization is complete.
Down the line when it actually tries to take off a worker from the Q,
accesses a NULL pointer and thus shoots itself in the foot.
That makes me think that there is never enough error checking !! more
is always merrier here...
( Dec 08 2006, 02:54:49 AM PST )
Permalink

Monday November 13, 2006
One ASSERT per line ...
One ASSERT per line...
I just came across the following assert() in cpu.c
666 ASSERT(t->t_weakbound_cpu == CPU ||
667 (t->t_nomigrate < 0 && t->t_preempt > 0) ||
668 CPU_ON_INTR(CPU) || t->t_flag & T_INTR_THREAD ||
669 getpil() >= DISP_LEVEL);
A system with debug kernel would panic, if any of the above conditions
were to fail. For me, I believe, on sparc system, it would take 2mins
to decipher which of the conditions failed.
And on amd64, definitely longer.
If only these asserts were one per line :
- the panic message would tell me what had gone
wrong - letting me focus on the core of the problem, rather than which
one failed.
- Definitely improves the readability of the code,
which is what solaris is known for.
If only somebody could find an RTI advocate, who will approve the
changes, I will definitely be indebted to him/them.
( Nov 13 2006, 03:37:39 AM PST )
Permalink

Friday March 24, 2006
libaio hang ??
libaio2.html
Recently I came across a core file from an application linked with
libaio. Customer said the application hung and he used gcore(1) to
collect the core file.
After initial investigation it turned out that an mmap() call from
libaio failed as there is not enough virtual memory for the process
left. libaio issues mmap() to set aside stack for user land threads
that it creates which service the 'async IO' requests.
As it is a 32bit process, I assumed, at one point the process may have
grown close to 4Gb and mmap() call around that time failed with 'no
memory'.
With these points, we put together a test case whch calls malloc() till
a failure happens, and then issue an aio call which results in mmap()
failing. So we throught the application wass hitting the 32bit limit.
One smart customer facing engineer collected the output of 'ulimit -a'
and there it showed that the 'vmemory' was limited to ~600Mb.
Interestingly size of the core filed collected was also around the same.
So the recommendation was to set it to 'unlimited'; if not, atleast
raise it to a signifcant number.
But the point to keep in mind, while setting to 'unlimited' is, it
needs to be ensured that there is enough swap
space configured on the system. Otherwise a run away process can starve
others.
As the problem happened on s9, I can't print the stack here.
( Mar 24 2006, 02:34:35 AM PST )
Permalink

Wednesday September 21, 2005
data loss root caused...
moddebug fun on DEBUG builds...
Few weeks back I was looking at a data loss issue reported by a
customer and in this post I describe how it is root caused.
This initially came in as a NFSv3 issue. Customer doesn't see the
problem in case of ufs. But sees it only if he uses nfs. Even with nfs,
he doesn't see the problem if 'noac' mount option is used (noac
suppresses caching). Ofcourse customer doesn't want to use noac,
as performance is penalized. Even when he switched NFS
servers, the problem persisted. Customer doesn't see the problem
with solaris10 on a sparc box but sees it on an opteron box with same
solaris 10. He sees it with both 32bit and 64bit kernel.
Couple of enterprising 'nfs' engineers, after looking at the data from
customer site, cooked up a 10 line reproducible test case and proved
that the problem is not in NFS land but in the 'vm' layer beneath.
Armed with a simple test case, a killer weapon called DTrace and
a killer system called V40z I took 5hrs to zero-in on the problematic
routine.
What the test case does is: it opens an existing file, lseek's to an
offset: 2633940 [== (321*8192) + (4096+212)] and does a write of
60bytes and then seeks back to the beginning of the file and starts
reading the whole file in 8k chunks. And when finished reading, closes
the file.
Thats all and last 60bytes written is lost, somehow. And the problem
happens every time.
Whenever a user reads or writes to a file, the data is
read/copied into the kernel address space before being pushed to
userland or the FS. Where to copy into the kernel is controlled
by a segment driver called: segmap. What it does is, given
a vnode and an offset it churns out a kernel VA where a page can be
mapped to, such that FileSystems can copy the data from this page or
into the page depending upon whether its a write or read request.
With 64bit kernel, 64Mb of kernel VA is set aside for segmap. And each
'smap' structure maps 8Kb of this VA. Thus at boot time 8192
'smap' structures are created to map the whole 64Mb.
Primarily 'smap' structure contains the vnode/offset
members, beside some other information.
Even in 64bit land, page size on an x86 box is still 4Kb only.
Thus given a segmap addres, kernel needs 2pages to map 8k size.
If you refer back to the offset where the write is happening, its
actually happening to a 2nd page of a segmap address.
The stack trace when we load a PTE for the segmap address as part of
write is:
hat_memload:entry adr: fffffe80c0ddb000 pp: (pfn:dfc87)
genunix`segmap_fault+0x1c9
genunix`as_fault+0x205
unix`pagefault+0x8b
unix`trap+0xdbc
unix`_cmntrap+0x11b
unix`kcopy+0x2c
genunix`uiomove+0xdf
nfs`writerp+0x87
nfs`nfs3_write+0x3ad
genunix`fop_write+0xb
So what this means is we got an address: fffffe80c0ddb000
from segmap for the vnode/offset we passed. And the page number kernel
set aside for this is: dfc87. Ones we load a translation entry for this
page, and when the write happens to the page, processor marks the
page dirty in one of the bits of the PTE.
What happens ususally is, fsflush or say pageout, finds that this page
is dirty (refer to pvn_getdirty()), syncs these flags from PTE into the
page structure and wirtes the contents to the backign store before
disposing the page.
What the test case does next is, seeks back to the beginning and read
the page in 8k chunks :
As the thread reads the pages in 8k chunks, when it comes to offset
(321*8192), in the next read it has to return (4102+212+the
60bytes that we wrote last), as segmap already has an entry for
this vnode/offset it returns the segmap address the corresponds to the
first page:
segmap_fault:entry addr: fffffe80c0dda000, len: 8192
genunix`segmap_fault+0x1c9
genunix`segmap_getmapflt+0x3ab
nfs`nfs3_read+0x1da
genunix`fop_read+0xb
genunix`read+0x188
genunix`read32+0xe
unix`sys_syscall32+0xd9
Its understandable that this address and the earlier segmap address are
4k apart.
The catch is in segmap_getmapflt() - What it does is, it does a
hat_probe() on the beginning address, as this is the first time we are
accessing this address, there is no valid PTE for this address and ones
hat_probe() returns zero saying there is no valid PTE for this address,
it thinks same is valid even for the 2nd page thats part of this
segmap address.
So it now goes ahead and installs a new PTE for both the pages.
And thus a valid entry for the 2nd page is lost :
x86pte_set:return new pte: 80000000dfc87103, old pte: 80000000dfc87163
unix`hati_pte_map+0x93
unix`hat_kmap_load+0b7
unix`hat_memload+0x87
genunix`segmap_fault+0x1c9
genunix`segmap_getmapflt+0x3ab
nfs`nfs3_read+0x1da
genunix`fop_read+0xb
genunix`read+0x188
genunix`read32+0xe
unix`sys_syscall32+0xd9
x86pte_set() installs a new pte and returns the old pte - as we can see
the page involved is: dfc87.
The returned pte has bits 0x60 (which stand for PT_MOD | PT_REF).
Ones the new pte is installed, the 60bytes written is
guanranteed to be lost and are lost.
The problem doesn't happen in sparc land as the page size is 8Kb.
As you may have guessed by now, if the test case reads in 4k chunks,
the problem would not have happened.
This problem is fixed both in s10 and nevada.
( Sep 21 2005, 09:43:45 PM PDT )
Permalink

Monday September 05, 2005
my experiments with reiserfs - part1
moddebug fun on DEBUG builds...
The other day I was playing around on my home system (booted in RHL)
and decided to check out reiserfs and this post briefly talks about
couple of points about my initial experiments with reiserfs:
As I don't have a spare disk on my system, I decided to create reiserfs
on a 64Mb USB memory stick.
As I have had prior exposure to solaris-ufs, I thought, having a journal
or not is a mount time option. But it turned out that, whether I
eventually mount it with log or nolog, mkreiserfs sets aside space for
log. So with default journal size, I was left with less than 50%
of the stick capacity for my use. I thought this is too much and then
re-read the man page, and figured there is this '-s' option which
sets journal size. So I recreated the FS with '-s 513' (as per the man
page min is 513), but this time mount fails to mount the stick
saying its not 'reiserfs' partition; which has come as a surprise
as mkreisefs didn't complain at the time of FS creation.
I would have loved to see an option to mkreiserfs to avoid journal.
This could have been small disk friendly...
So I didn't have a choice but to settle with less than 50% of disk
space for my files.
debugreiserfs(1) is a definite hackers love. Very nice command if you
are curious about where (to the granularity of 'byte') exactly is
data/metadata.
As the disk space left is very less, reiserfs didn't bother to use
internal nodes to store the tree. root block itself is a leaf node,
where the directory entries and stat data is available. Whereas I
wanted to see the 'formatted internal nodes' in action as I populate
the FS.
Looks like my only choice is to get a bigger stick!!! or are there any
work arounds ??
( Sep 05 2005, 05:21:44 AM PDT )
Permalink

Thursday August 11, 2005
moddebug fun...
moddebug fun...
There is a kernel variable 'moddebug' which can be set to different
values
mentioned in usr/src/uts/common/sys/modctl.h :
#define MODDEBUG_LOADMSG
0x80000000 /* print "[un]loading..." msg
*/
#define MODDEBUG_ERRMSG
0x40000000 /* print detailed
error msgs */
#define MODDEBUG_LOADMSG2
0x20000000 /* print 2nd level msgs */
....
This variable controls the verbosity level of messages printed by the
kernel (module
loading and unloading subsystem of the kernel, to be precise).
This value can be set either through /etc/system or on a running system
using
'mdb -kw'. Advantage with the later option is, system doesn't
need a reboot and will
become effective immdiately.
Modules get loaded on demand or 'cos of a modload(1M) request.
And unloaded in response to modunload(1M) or whenever there is
shortage
of memory or by 'mod auto unload thread' in case of DEBUG kernels.
Messages are printed onto the console as well as to the messages file.
It will be fun to watch what all modules are getting loaded and
unloaded -
# mdb -kw
Loading modules: [ unix krtld ... ]
> moddebug/W 80000000
moddebug:
0
= 0x80000000
> $q
Because I am not running a debug kernel and the possibilty of memory
shortage is
less, I will issue a modunload and see what happens:
# modunload -i 0
# dmesg
...
Aug 11 14:52:07 t1 genunix: [ID 762576 kern.notice] uninstalled elfexec
Aug 11 14:52:07 t1 genunix: [ID 659089 kern.notice] unloading elfexec,
module id 62, loadcnt 2.
Aug 11 14:52:07 t1 genunix: [ID 762576 kern.notice] uninstalled intpexec
Aug 11 14:52:07 t1 genunix: [ID 659089 kern.notice] unloading intpexec,
module id 101, loadcnt 3.
Aug 11 14:52:23 t1 genunix: [ID 943528 kern.notice] load
'exec/intpexec' id 101 loaded @ 0x120460f/0x78117f60 size 1049/156
Aug 11 14:52:23 t1 genunix: [ID 131579 kern.notice] installing
intpexec, module id 101.
Aug 11 14:52:23 t1 genunix: [ID 943528 kern.notice] load 'exec/elfexec'
id 62 loaded @ 0x1325e6f/0x14dcce7 size 18763/259
Aug 11 14:52:23 t1 genunix: [ID 131579 kern.notice] installing elfexec,
module id 62.
#
Modules 'intpexec' and 'elfexec' were unloaded; but had to be loaded
immediatly
because I ran 'dmesg' command which happens to be a shell script which
in turn
calls ELF binaries !!!
( Aug 11 2005, 02:55:45 AM PDT )
Permalink

Tuesday August 02, 2005
swap space full; processes are dumping core...
swap full; processes are dumping core...
Last week, I was looking at a few core files, along with a colleague of
mine.
Fujitsu was doing some kind of stress testing on an s10 box and they
were
seeing core dumps from random processes.
The stress test involved filling up the whole 'swap' space on the
system and then
issuing 'init 6'.
Some of the cores are because of SIGBUS and some are from SIGSEGV.
All the cores have atleast one thread with following kind of stack :
fec7bbe0 libc_psr.so.1`memset+0x88(fec7bca8, ff1bc214, fc150c00,
b400000, fc000, fb4fc000)
fec7bc40 libc.so.1`_thrp_create+0x1f0(0, fc000, 0, 0, 80, fec7bed4)
fec7be68 libc.so.1`pthread_create+0x1e8(fec7bf3c, 0, 32e7c, 15a608, 0,
ff1e8bc0)
fec7bed8 startd_thread_create+0x10(32e7c, 15a608, 0, f, 52000, 15a608)
fec7bf40 restarter_event_thread+0x1f4(3eb800, 57a460, f5ef8, f5ef8,
56f50, 56f1c)
fec7bfa0 libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)
After an hour of doing the usual things (look at the regs, look at the
instructions that caused these
signals etc), it suddenly occurred to us, that memory for thread stacks
is allocated through mmap(2)
system call, and is passed an important option: MAP_NORESERVE -
which says don't reserve
swap space for this mapping. If this stack ever needs to be written out
to swap, system looks for
swap space that time and if space is not available then, process will
terminate with either SIGBUS
or SIGSEGV as mentioned clearly in the mmap() man page...So this is
what is causing the random
MT processes to dump core in our case. Spent 1hr on a behaviour
which is clearly documented...
hence, I thought nobody should repeat this mistake and so this post...
I believe, this flag was introduced in the first place for the
following reason:
default stack size of a userland thread is 1Mb, and not many threads
could potentially have stacks
that run into multiple pages (with a page size of 8Kb); So if mmap()
sets aside swap space as well,
for every thread created, with a few thousand threads itself, swap
requirements of the system will run
into few Gbytes whereas these threads may never actually need to be
paged out at all...
( Aug 02 2005, 01:04:11 AM PDT )
Permalink

Wednesday July 27, 2005
UFS file system defragmentation...
UFS defragmentation...
A while back I worked on a utility to defragment a Unix File System
(UFS).
I haven't integrated it yet into the source base for two reasons:
1. Couldn't run it on any customer system - to give it a taste of real
world.
2. This involves modifying the block numbers underneath a file. And
looks
like cluster folks have a project, which depends on block
numbers not
changing underneath a file.
But as this project hasn't made it, this constraint can be overlooked
(I hope).
In this post, I will explain a little bit about why I did this, what it
does
and what it doesn't do.
Few customers reported the following problem:
# df -k .
Filesystem
kbytes used avail capacity Mounted
on
/dev/md/dsk/d40 26614583 24092573
2255865 92% /stats
#
# dd bs=1024 if=/dev/zero of=test count=8
write: No space left on device
5+0 records in
5+0 records out
#
What this means is, even though there is ~2.2Gb of space is avaialbe,
'dd' failed to create a file of size 8k.
This is because the file system is so fragmented, that it could not
locate 8Kb of contiguous space and hence writes from 'dd' failed.
UFS terminology:
Being a block based file system, allocation unit under ufs is a block of
size 8k. To minimize wastage of space when files of small size are
created,
a block is actually divided into framgnets of 1k size (this can be
changed
while creating a UFS FS. see mkfs(1M)).
Fragments are the ones which are actually numbered.
What this means is a block is nothing but 8contiguous fragments and the
fragment number of the leading fragment should be a multiple of 8.
So if we need to store 8 files, each of size, say, 200bytes, ufs sets
aside
8fragments instead of 8blocks and this results in huge space saving
(8*1k Vs
8*8k). What does the math say, if we have a million of such small files?
In the above case, we also got the 'fstyp -v' o/p of the file system:
# fstyp -v /dev/md/rdsk/d40 | sed -e '/^$/,$d'
ufs
magic 11954 format dynamic
time Tue Aug 27 19:20:27 2002
sblkno 16 cblkno
24 iblkno
32 dblkno 808
sbsize 2048 cgsize 8192
cgoffset 128 cgmask 0xffffffe0
ncg 522
size
27028032 blocks 26614583
bsize 8192 shift
13 mask 0xffffe000
fsize 1024 shift
10 mask 0xfffffc00
^^^^^^^^^^^^
frag 8
shift 3 fsbtodb 1
minfree 1% maxbpg
2048 optim time
maxcontig 16 rotdelay 0ms
rps 120
csaddr 808 cssize
9216 shift
9 mask 0xfffffe00
ntrak 19 nsect
248 spc
4712 ncyl 11472
cpg 22
bpg 6479
fpg 51832
ipg 6208
nindir 2048 inopb
64 nspf 2
nbfree 0
ndir 38886 nifree 1891931
nffree 4858090
^^^^^^^^^^
^^^^^^^^^^^^^^^
cgrotor 81 fmod
0 ronly 0
From this o/p, we can see that number of free blocks (nbfree) on the
system is 0. And all the free space is available in the form of
fragments
(nffree) - that means no 8 fragments are contiguous.
How did the FS get into this state? The system serves as a web server
and it
creates and deletes tons of small files and thus the fragmentation over
a period of time.
Traditionally the best way to avoid this is, have the frament size and
block
size match such that 'df -k' o/p and 'dd' action are consistent. So if
fragment size was 8k in the above example, 'df -k' would have showed
100%
full, so you can't expect 'dd' to succeed anyway. What that means is
allocate
an 8k sized block even if the request was for just a few 100bytes
!!! So
you may end up buying more disk space.
The other way to fix this problem was, ones we start getting 'dd'
failures,
take the file system offline - use the traditional tools ufsdump(1M)/
ufsrestore(1M) to dump the files to a backup media and restore it back.
As you can see, there are two disadvantages with this approach:
unavailability
of the FS for a while and a backup device.
So one approach to fix the problem is, reconsider the block allocation
techniques of ufs (which will be an expensive proposition). And the
other
approach is, defragment the file system on the fly. This is what I have
implemented.
Initially I thought of adding a ufs kernel thread to do the
defragmentation
whenever the frament/block ratio reaches a threshold; but on popular
advise,
I switched to a utility which is called 'fsdefrag' - which can be
launched
from a crontab file if required.
What this does is, it sweeps through the FS checking for all the
cylinder
groups (another ufs thing), starts from the worst affected cylinder
group to
the relatively better ones, and tries to reallocate the fragments set
aside for
an inode such that lot of contiguous fragments can be made available at
the end
of the exercise. For the inode that is touched, its fragment numbers
will
change to the newer ones and old ones get freed after data is copied.
There are some checks made in deciding whether a given inode's data
block is a
an appropriate one or not.
What this is not:
Once I say, defragmentation, first thing that comes to the mind of
database
folks is, their ~2Gb file, which has blocks spread all over the FS will
be
made contiguous such that access times improve in a big way. This
utility
doesn't look at files which occupy more than 96k (in UFS, upto 96kb,
files
are allocated direct blocks i.e block numbers are stored in the inode
itself). So if you have read this far and this paragraph let you down,
I am
extremely sorry.
If you are interested in this, we can have a discussion on 'ufs:
discuss' forum
off opensolaris.org/
( Jul 27 2005, 07:22:22 AM PDT )
Permalink

Tuesday July 12, 2005
system crash file layout
kernel memory corruption detected !! and root caused
Sometime back while working on a pet project (may be next blog), I studied the layout of
a system crash file and this blog is about that. This is applicable for crash files from
sun4u platform (but, things should not be very much different in x86/64 land, but I don't
bet on it).
In this document i am explaing the layout of the system crash file on the
swap disk and how it is present in the vmcore.* subsequent to running
a 'savecore'.
Use: 1. Getting to know the layout
2. How we can extract unix.* from vcore.* and thus avoid copying over
unix.* while downloading cores (Not a major gain, but for fun...)
3. How we can avoid kvm_* calls and do the vitual to physical translation
on our own and then read the appropriate location in vmcore.* (again
for fun...)
No sooner the system comes back after a crash, crash dump is avialable
on primary swap device and it starts at:
offset = (size of the device - 64k);
'offset' is truncated to get aligned on a 64k boundary
and grows upwards. At the offset, calculated as above, there will be a
dumphdr_t structure which describes the dump.
No need to know the name of the primary swap device, the following lines
do the job:
dumpfd=open("/dev/dump", O_RDONLY, 0444)
ioctl(dumpfd, DIOCGETDEV, dumpfile) <- this gets dump device into
"dumpfile"
dumpfd=open(dumpfile, O_RDONLY, 0444)
Seek to 64k:
llseek(dumpfd, -DUMP_OFFSET, SEEK_END) & -DUMP_OFFSET;
And read the dumphdr_t:
pread64(dumpfd, &dumphdr, sizeof(dumphdr_t), endoff);
dumphdr contains info to describe the various sections of crash dump,
their start points, their sizes and some more info like crash time, utsname...
sections:
1. dump_start of dumphdr denotes the beginning of crash and there will be a
blank page (size: 8192) to start with (Even this contains one copy of
dumphdr_t, so its not exactly blank :-))
2. Then starts the symbol table in compressed form (its no longer compressed !!)
which will subsequently become 'unix.*' upon running savecore. Its exact
size is available in dump_ksyms_size/dump_ksyms_csize of dumphdr.
3. This is followed by the page translation map. Its an array of structures
of type mem_vtop:
struct mem_vtop {
struct as *m_as;
void *m_va;
pfn_t m_pfn;
}
How does the system populate this? It walks through all the segements of
an address space (assuming we are dumping only kernel pages, then it will
just be 'kas') and for each pages of VA mapped, it will find out underlying
page frame number and updates the above structure. This is the key.
In our debuggers, whenever we try to access a VA, debugger just looks at
this map, finds out the PFN, seek to the appropriate page/offset and read
whatever is present there (So we should be able to write our own read
routines instead of using kvm_read() for fun).
How big is it? its determined by the 'dump_nvtop' member of dumphdr.
(size is: dump_nvtop*sizeof(mem_vtop_t))
4. This is followed by dumping all the Page Frame Numbers (PFN) of all the
memory pages in use, in an array form.
How big is this section? Its determined by the 'dump_npages' member of
dumphdr_t.
(size is: dump_npages*sizeof(pfn_t))
5. This is followed by the contents of all the PFNs noted above, in compressed
form. Each compressed buffer is preceded by the compressed buffer size
(which is a uint32_t).
Big picture:
| |
------------------------- ^ start
| | |
| | pagesize (8192)
| | |
ksyms start ------------------------- -
| | ^
| | |
| | dump_ksyms_csize/dump_ksyms_size
| | |
| | v
page xlation map------------------------- -
start | | |
| | |
| |
| | dump_nvtop * sizeof(mem_vtop_t)
| |
| | |
| | |
| | v
pfn start ------------------------- -
| | ^
| | |
| | |
| |
| | dump_npages * sizeof(pfn_t)
| |
| | |
| | |
Acutal dump strt------------------------- v
| |
| |
| |
| |
| |
------------------------- <--- this offset is 64k aligned
| dumphdr stored here |
| |
_________________________
| | ^
| | |
| | 64k
| | |
------------------------- v End of swap device
Layout of vmcore.*:
-------------------
1. First page will be a blank page.
2. Kernel symbol table is read from swap device into the vmcore.*, uncompress
it (nothing happens here, as it was not compressed in the first hand) and
then write to it unix.*;
Which means we don't need to pull over unix.* while copying over the crash
dumps, we can extract it from vmcore.* (some more fun....). All we need
to do is: open vmcore.*; seek by 8192; and then read dumphdr.dump_ksyms_size
bytes into unix.*
(Now the question is: as we don't have access to swap device, how do we
find out the value of dumphdr.dump_ksyms_size?
The thing is point-1 is wrong - First page is not exactly blank - but it
contains coredhr - so read dumphdr_t bytes from offset:0 and then get
corehdr.dump_ksyms_size bytes, populate unix.* with that many bytes from
vmcore.* - Now we have a unix.* !!
3. Pfn table is dumped next. As the above diagram indicates, we need to
dump (dump_npages * sizeof(pfn_t)) bytes from appropriate offset i.e
dumphdr.dump_pfn
4. Now we create the page transalation info by reading the contents from
swap device at offset: 'page xlation map start' - in swap device data
type is mem_vtop_t and in core file its: dump_map_t
struct dump_map {
offset_t dm_first;
offset_t dm_next;
offset_t dm_data;
struct as *dm_as;
uintptr_t dm_va;
}
There will be 'dump_nvtop' mem_vtop_t structures in swap device.
'dump_nvtop' can be higher than or equal or 'dump_npages' - as a page
could have been mapped at multiple VAes.
Now we read each mem_vtop_t, look at its PFN, then find out its location
in pfn_map (created in step 3) and update the 'dm_data' variable as
follows:
dm_data = (dump_data + (location * PAGESIZE))
For 'dump_data_start', please refer to the below figure.
5. Then the actual pages are dumped. These pages are stored in compressed
form. How they are stored in the dump device is: compressed size followed
by the compressed contents.
To extract them, something like this is done:
pread(dumpfd, &csize, sizeof (uint32_t), dumpoff);
^^^^^ <- this gets us the compressed size
dumpoff += sizeof (uint32_t); <- go past the data type
pread(dumpfd, inbuf, csize, dumpoff) <- read in the compressed buf;
decompress(inbuf, outbuf, csize, pagesize); <- uncompress it
^^^^^^^^ <- expected uncompressed size
write(corefd, outbuf, pagesize,corehdr.dump_data+x*pagesize) <- write
the uncompressed buf to the vmcore.* file
Big picture (vmcore.X):
offset: 0 --------------------------
| corehdr_t |^
| ||
| |8192
| ||
corehdr. | |v
dump_kyms --------------------------
| |^
| Kernel ||
| Symbols (A.K.A) ||
| unix.* |unix.* size aligned to page size
| ||
| ||
| |v
dump_pfn --------------------------
| |^
| Number of elements: ||
| (dump_npages) |pfn_table (PFNs of all dumped pages)
| || (aligned to page size)
| |v
dump_map --------------------------
| |^
| Array of dump_map_t ||
| structures ||
| (dump_nvtop) |Xlation map
| ||
| ||
| |v
dump_start --------------------------
| |^
| ||
| ||
| |Contents of Physical pages
| |are here
| ||
| ||
| |v
--------------------------
So given this picture how do we do the xlation given a VA:
1. First search for the VA in the tranlation map starting at: corehdr.dump_map
2. Get the 'dm_data' of the corresponding dump_map_t
3. Seek to that offset, and just read.
( Jul 12 2005, 11:50:47 PM PDT )
Permalink

Thursday June 30, 2005
softcall processing in solaris
kernel memory corruption detected !! and root caused
Points to note before i move to softcall processing (this is for sun4u only):
1. There exists an intr_vector array (each element of which is a 'struct intr_vector'
(refer to ivintr.h)). HW drivers register the interrupts that the HW
generates with this array. The index into this array will be the interrupt
number. And subsequently whenever that piece of HW interrups the CPU, it will
send this number.
2. A part of this array is used for softints. Softints are something which
are generated by the CPU itself (using SOFTINT register).
3. Again there exists an array called softcalls[], each element of which is
a 'struct softcall' (refer to common/os/softint.c)
Softcall is a mechanism whereby kernel can execute a function at PIL-1.
How do we do this?
1. The first step is to make an entry in the above array (i.e intr_vector[])
using add_softintr(). add_softintr() returns the index of this entry.
e.g: siron_inum=add_softintr(PIL_1, softlevel1,0)
2. From now on, we make a call to softcall() to execute a function at PIL-1.
What softcall does is: it adds an entry for this function in a softcall array -
and then makes a call to siron().
e.g: softcall(callout_execute, ct);
What siron() does is, it calls setsoftint()
e.g: setsoftint(siron_inum)
what setsoftint() does is, it checks whether the pending flag is set
for this number (in this case its siron_inum) in the intr_vector, and if its
not set, then it sets a bit in the SOFTINT register/sets the pending flags,
which triggers a level-1 interrupt when the PIL on the cpu goes below 1.
(one more thing it does is, takes off one free intr_req structure from the
freelist, fills it up with the interrupt number (i.e siron_number in this case)
and adds it to the list corresponding to this PIL - i.e to
cpu.machcpu.intr_pool[])
3. Now when a level-1 interrupt occurs,
pil_interrup() {
Takes off the first intr_req from the intr_pool[], note the
intr number and return the intr_req struct to the free pool.
intr_thread() {
SERVE_INTR(this is a macro) {
looks up the interrupt number in the intr_vector
and calls the handler
(i.e softlevel1 at index siron_inum)
softlevel1() {
softint() { }
}
}
}
}
}
what softint() does is, goes through the softcall array(i.e softcalls[])
and calls all the handlers registered (e.g callout_execute() at line:28)
Caveats:
1. siron() doesn't call setsoftint() if a global flag siron_pending is already
set. This gets unset in softlevel1(). For this reason, we can't guarantee on
which cpu a real-time callout runs.
(explained in detail: some other thread running on a different CPU may have called
softcall() and this resulted in siron_pending getting set and then it sets a
bit in SOFTINT registe of that cpu;
subsequently when clock runs and calls softcall(callout_execute), this results in
adding an entry to the softcall table, but as siron_pending is already set, this
just returns).
Subsequently when a level-1 interrupt happens on the first cpu, it will process
besides other entries, RT callouts as well.
2. One more thing to note is we are calling setsoftint() with siron_inum, and if
pending flag is alread set for this index in intr_vector[], it just retuns.
For this reason, in cyclics, it creates its own interrupt numbers for each cpu
for level1 and level10 interrupts.
Refer to cbe_configure() of sun4u/io/cbe.c
This way we don't depend on a softint that has to happen on a differet CPU.
3. cyclics also makes use of SOFTINT feature to generate soft interrupts, but
it doesn't go through the softcall array but what it does is:
add_softintr(PIL_1, (intrfunc)cbe_level1, 0);
cbe_softint() -> setsoftint(data->cbe_level1_inum);
cbe_level1() -> cyclic_softint() -> (*handler)(arg)
More on this later..
( Jun 30 2005, 10:36:44 PM PDT )
Permalink