« June 2005
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
9
10
11
12
13
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
  
       
Today
XML

Blog::Navigation

Blog::Editing

Bookmarks::Blogroll

Bookmarks::News

Blog::Referers

Today's Page Hits: 4

Site notes

This page validates as XHTML 1.0, and will look much better in a browser that supports web standards, but it is accessible to any browser or Internet device. It was created using techniques detailed at glish.com/css/.

Powered by Roller Weblogger.
« Previous month (May 2005) | Main | Next month (Jul 2005) »
20050614 Tuesday June 14, 2005
Small logic error, Big Trouble

Small logic error, Big Trouble



            In this writeup, I will briefly explain how a small logic error
can result
in unpredictable behavior elsewhere. It may be a little
confusing since
all the aspects are not described. I will try to cover
those in future entries.

Last year, I got an escalation stating that Oracle 9i instances
configured to use DISM (Dynamic Intimate Shared Memory) could not be
shutdown ( on Solaris 9). Rather worse, they could not be killed either.
Only way out of the situation was to reboot. It was not acceptable for
large systems. This does not show up if Oracle is not shutdown
or DISM was not used. The bug id in question was 4878077. To complicate
the matters, it was  reported by a few customers, and the reproducing
the problem was not easy.


The stack of the application shows the following.
It is trying to destroy the DISM segments before it can exit.

genunix:cv_wait+0x3c(0x70004818040, 0x142e668)
unix:page_lock_es+0x1d0(, 0x1, 0x142d0c0, 0x0)
unix:page_lookup_create+0x1bc(0x30005156df0?, 0x600276e55f0000?, , 0x0)
unix:page_lookup - frame recycled
genunix:segspt_free_pages+0x1f0(0x30012f71ea8, 0x0, 0x21400000)
genunix:segspt_free+0x2c(0x30012f71ea8)
genunix:seg_free+0x2c(0x30012f71ea8)
genunix:segspt_unmap+0x44(0x30012f71ea8, 0x0, 0x21400000)
genunix:as_unmap+0xd8(0x3000770f738, 0x0, 0x21400000)
genunix:sptdestroy+0xc(0x3000770f738, 0x30013dbe2a8)
shmsys:kshmdt+0x1b8(0x3000692b450?, 0x20000000)
shmsys:shmsys+0x9c()
 


Dumping the page structure gives..

> 0x70004818000::print page_t
{
p_vnode = 0x30005156df0
p_hash = 0
p_vpnext = 0x700055e81a8
p_vpprev = 0x70004818078
p_next = 0x70004818000
p_prev = 0x70004818000
p_offset = 0x600276e55f0000
p_selock = 0x1 <<<<<<<==== shared lock set
p_lckcnt = 0
p_cowcnt = 0
p_cv = {
_opaque = 0x1
}
p_io_cv = {
_opaque = 0
}
p_iolock_state = 0
p_szc = 0x3
p_fsdata = 0
p_state = 0
p_msresv = [ 0, 0, 0 ]
}




Basically, the thread is waiting for the exclusive lock on the page. In the coredump
we do not see any thread that could potentially decrement the shared lock ( p_selock)
of this page. Hence it cannot progress from here.

From first glance it appears to be an impossible situation to get into.

If we look into the p_szc, it is 3. That means it is a large page, 4M to be exact.
This structure says that this is a constituent of a large page. Logically,
all the constituent pages should be having same flags set.

Looking at the adjacent page's p_selock, they are not locked.

> 0x700055e81a8::print page_t p_selock
p_selock = 0
> 0x70004818078::print page_t p_selock
p_selock = 0
>

This leads us to believe that some strange things would have occurred sometime ago.

Only way to find out what happened here was to employ a debug buffer to trace the
p_selock status to identify the transitions along with the thread id. This was
done on Solaris 9. In S10, Dtrace(1M) would have been the tool of choice.

With p_selock tracing in place, I could narrow down an area where the shared locks
are incremented and identified the root cause.

In file common/vm/seg_spt.c the routine segspt_dismpagelock() does caching of
DISM pages by creating a ppa array if it is NULL. For large pages, it covers all
the constituent pages.


Original logic:


Loop through all the pages, npgs = 512 which amount to 4M large page.
large page boundaries are controlled by the counter lpg_cnt.
tot_pages is the total number of pages in the anon_map.
pplist[] is array of pointers to pagelist in DISM segment.

for (an_idx = 0; an_idx < tot_npages; ) {
ap = anon_get_ptr(amp->ahp, an_idx);
/*
* Cache only mlocked pages. For large pages
* if one (constituend) page is mlocked
* all pages for that large page
* are cached also. This is for quick
* lookups of ppa array;
*/
if ((ap != NULL) && (lpg_cnt != 0 ||
(sptd->spt_ppa_lckcnt[an_idx] != 0))) {

swap_xlate(ap, &vp, &off);
pp = page_lookup_nowait(vp, off, SE_SHARED);
(A)---> if (pp != NULL) {
(B)---> if (lpg_cnt == 0) {
npgs = page_get_pagesize(pp->p_szc) >> PAGESHIFT;
if (!IS_P2ALIGNED(an_idx, npgs)) {
(D)--> an_idx = P2ALIGN(an_idx, npgs);
(C)--> page_unlock(pp);
continue;
}
}
if (++lpg_cnt == npgs)
lpg_cnt = 0;

/*
* availrmem is decremented only
* for unlocked pages
*/
if (sptd->spt_ppa_lckcnt[an_idx] == 0)
claim_availrmem++;
pplist[an_idx] = pp;
}
}
(F)--> an_idx++;
}



Start of a large page is denoted by lpg_cnt = 0 (B).
an_idx is adjusted (D) so that it can track the large pages with npgs (512).

Assuming at (A) we have pp as NULL..

We still increment the an_idx (F), but not lpg_cnt. This results in a situation
where we get misalignment. For the next iteration, we get a page in pplist[] which will have
sharedlock incremented twice. This was confirmed by tracing information.

Here is a snippet from the trace buffer..


In this example, we got pp null for an_idx = 96103. It would skip that and results in an lpg_cnt mis-alignment.

73641486.484357 0.007360 539 1 0x302db346d20 98 dism_page_lock1 time: 0xa3d00e pp: 0x700166617d0 lcnt: 358 selock: 1 an_idx: 96102 *****

73641486.496997 0.012640 539 1 0x302db346d20 98 dism_page_lock1 time: 0xa3d00e pp: 0x700166618c0 lcnt: 359 selock: 1 an_idx: 96104 *****

index 96103 is missing..

2. Note that we check for beginning of large page based on lpg_cnt==0 at (B).
Since we missed one index we end up with the first page of the subsequent page as the last page of this
large page ( assuming one instance of pp was NULL).

Going forward, we see the following traces..

73641488.005500 0.014320 539 1 0x302db346d20 98 dism_page_lock1 time: 0xa3d00e pp: 0x7001613e000 lcnt: 511 selock: 1 an_idx: 96256
** This index should be the beginning of next large page because (an_idx % 512 == 0), but landed here due to a missed an_idx earlier.

73641488.017500 0.012000 539 1 0x302db346d20 98 dism_page_lock1 time: 0xa3d00e pp: 0x7001613e078 lcnt: 0 selock: 1 an_idx: 96257
** We get a misaligned an_idx with lpg_cnt == 0, we should have gotten an_idx as 96256.

73641488.060541 0.043041 539 1 0x302db346d20 98 dism_page_lock4 time: 0xa3d00e pp: 0x7001613e078 an_idx: 96256

We release the lock for this page ( an_idx = 96257) at (C) and adjust an_idx (D) to beginning of large page an_idx, 96256.

73641488.064061 0.003520 539 1 0x302db346d20 98 dism_page_lock1 time: 0xa3d00e pp: 0x7001613e000 lcnt: 0 selock: 2 an_idx: 96256

Start locking page from that index again. We end up with additional lock for the page 0x7001613e000 at an_idx=96256,
since it was locked as part of the previous large page as well. This happened due to skip in an_idx
and not incrementing lpg_cnt.

73641488.066221 0.002160 539 1 0x302db346d20 98 dism_page_lock1 time: 0xa3d00e pp: 0x7001613e078 lcnt: 1 selock: 1 an_idx: 96257
73641488.076301 0.010080 539 1 0x302db346d20 98 dism_page_lock1 time: 0xa3d00e pp: 0x7001613e0f0 lcnt: 2 selock: 1 an_idx: 96258

Once we get into to this sitatuion, it is not possible to bail out. Extra lock remains forever and hence the
hang while exiting. A temporary fix to detect and correct the condition resulted in a clean shutdown.


Once the logic was understood, the fix was simple. There were two choices.

(1) Make sure that pp never becomes NULL.
(2) If it is NULL, make sure that the whole large page is not cached in pplist.

Option (1) was simpler.

A one line fix did the trick.

pp = page_lookup(vp, off, SE_SHARED);

New code around the fix is here.

A SunAlert was created to notify the customers.



Technorati Tag:
Technorati Tag:
Technorati Tag:
















 









 











 









 











 









 











 









 

20050608 Wednesday June 08, 2005
Introduction test My name is Narayana Kadoor working in Solaris sustaining group.
I graduated from Indian Institute of Science, and worked at Wipro,
Bangalore for about 5 years, before joining Sun.

Apart from looking at the code and fixing bugs, I spend my free time
on electronics, photography, and reading. Hopefully, I will find time
to post some blog entries too..
Copyright (C) 2003, Narayana Kadoor's Weblog