Alok Aggarwal's Weblog

All | General | Music | NFS

20090309 Monday March 09, 2009

 Serving Up Lzma

I just pushed the changes that add LZMA to (Open)Solaris
and also allow lofi(7D) to use LZMA as one of the
supported compression algorithms.

On an snv_111 machine, here's what you will see -

Usage: lofiadm -a file [ device ]
       [-c aes-128-cbc|aes-192-cbc|aes-256-cbc|des3-cbc|blowfish-cbc]
       [-e] [-k keyfile] [-T[token]:[manuf]:[serial]:key]
       lofiadm -d file | device
       lofiadm -C [gzip|gzip-6|gzip-9|lzma] [-s segment_size] file
       lofiadm -U file
       lofiadm [ file | device ]

So, if you take large'ish file and compress it with gzip
and lzma, the size difference is quite noticeable.

contraption# du -h solaris.orig
 2.2G   solaris.orig
contraption# lofiadm -C lzma solaris.orig
contraption# du -h solaris.orig
 555M   solaris.orig
contraption# lofiadm -U solaris.orig
contraption# lofiadm -C gzip solaris.orig
contraption# du -h solaris.orig
 702M   solaris.orig

With LZMA support now available for both userland and
kernel consumers, it should be very easy for other Solaris
utilities (zfs?) to provide support for it.


( Mar 10 2009, 03:38:07 PM EDT / Mar 09 2009, 02:44:26 PM EDT ) Permalink Comments [4]
Trackback: http://blogs.sun.com/aalok/entry/serving_up_lzma

20080512 Monday May 12, 2008

 Compressed lofi for LiveCD - why

The OpenSolaris LiveCD contains hsfs filesystems that
are compressed with lofi compression, primary among
these are solaris.zlib which maps to /usr and solarismisc.zlib
which maps to /mnt/misc. 

The /usr filesystem contains essential components to
allow for the LiveCD to boot into a desktop. As a result
the layout of this filesystem is carefully ordered such
that accesses are sequential as opposed to being completely
random. This careful ordering of contents allows for the
LiveCD to boot into a desktop in a reasonable amount of
time (~3 minutes on most systems).

Since hsfs is the only OpenSolaris filesystem that allows
files to be ordered a certain way via the specification of
'-sort' flag to mkisofs(8), it was the obvious choice for
the /usr filesystem. And, the primary reason why compressed
lofi is used for the LiveCD as opposed to, say, ZFS or dcfs(7FS).

More details can be found in Moinak's slides here.


( May 12 2008, 03:32:25 PM EDT / May 12 2008, 03:32:25 PM EDT ) Permalink
Trackback: http://blogs.sun.com/aalok/entry/compressed_lofi_for_livecd_why

20080430 Wednesday April 30, 2008

 Lzma Numbers

I recently wrote that LZMA has been used to pack more languages
onto the LiveCD. Here are some charts that show how LZMA
stacks up against someof the other popular compression algorithms.
(apologies for the poor image quality, open in another window 
for a clearer image)


These tests were run on a LiveCD archive using 7za(1). As you'll note, the compression ratio provided by LZMA is about 35% better than gzip-9. However, LZMA is more CPU intensive and as a result the compression and decompression speed is slower than the alternatives. So, for some use cases the cpu versus compression tradeoff might make LZMA unsuitable but for the LiveCD use case, it is reasonable provided we architect our solution such that the decompression speed isn't a bottleneck (Compression speed isn't a problem for the LiveCD architecture)


( Apr 30 2008, 06:52:18 PM EDT / Apr 30 2008, 06:52:18 PM EDT ) Permalink Comments [4]
Trackback: http://blogs.sun.com/aalok/entry/lzma_compression

20080424 Thursday April 24, 2008

 Lzma on OpenSolaris

The OpenSolaris 2008.05 release that is going to come
out sometime in May is going to have two versions of
the same LiveCD, one with a limited set of languages and
locales and another one with a more fuller set of languages.

One of the big challenges with creating a LiveCD with a
full set of languages was that there was limited amount
of available free space on the CD to allow for including
all the languages. How do you cram more stuff on the CD?
Compress it harder, I say! Even better, compress it with
LZMA!

The OpenSolaris kernel did not have an in-kernel implementation
of LZMA that could be taken advantage of (why do we need an
in-kernel implementation, I'll answer that in a separate blog entry). 
So, in our quest to provide one, we started looking at the LZMA SDK. 
Some of the challenges with porting the source from this SDK to the  
OpenSolaris kernel were that our lawyers were not amenable to the licensing 
terms and the compression code was all written in C++ (which, 
for the uninitiated, is strongly desisted in the kernel).

If you've ever dealt with lawyers you'll be quick to spot
that the licensing can be particularly troublesome. It was. 
But only until we contacted with author of LZMA, Igor Pavlov.
Igor was not only willing to relicense the source code under
CDDL (which would ofcourse be agreeable to the lawyers) but
also willing to re-write the compression code in C. And, he 
did that in just a matter of couple of weeks -  truly outstanding. 
That, to me, is the power behind open source and the sharing 
opportunities it provides for the broader good.

So, thank you Igor for an excellent compression algorithm
in LZMA and thanks for all your assistance in making the
OpenSolaris 2008.05 release what it is. We look forward to
working with you in the future too.


( Apr 24 2008, 06:33:46 PM EDT / Apr 24 2008, 06:33:46 PM EDT ) Permalink Comments [10]
Trackback: http://blogs.sun.com/aalok/entry/lzma_on_opensolaris

20070830 Thursday August 30, 2007

 Multiboot - Solaris and Ubuntu

Multiboot - Solaris and Ubuntu I've recently been futzing with getting my laptop
to run both Solaris and Ubuntu. Ubuntu mostly
because I want to run VMware, which does not
support Solaris as the host operating system (yet?).
I wanted to run VMware mostly to cut down my
development time (I'll save the answer to how I do
that for another day).

I failed miserably in trying to get Ubuntu grub to
boot Solaris; which I later found out that it doesn't
work because the required changes to Solaris grub haven't
gone back to the mainstream grub code.

I also realized that the order in which the two operating
systems are installed is also important primarily because
of the deficiency in grub - Ubuntu must be installed first
and Solaris second. This results in Solaris grub being
installed in the master boot record which can then be
taught about where to find Ubuntu by adding an entry such
as this to /boot/grub/menu.lst -

title           Ubuntu, kernel 2.6.20-15-generic
root            (hd0,1)
kernel         /boot/vmlinuz-2.6.20-15-generic root=UUID=91647296-9aca-4d1f-bdfd-7894ff9f0807 ro quiet splash
initrd          /boot/initrd.img-2.6.20-15-generic
quiet
savedefault

Having said this, I also found by trial and error that
if you do install Solaris first and Ubuntu second with
the result Ubuntu grub lands in the MBR; you can salvage
the situation by manually slamming Solaris grub into the MBR.

In order to do this, boot off of the Solaris media and
get a shell. Then utter the following incantation -

# /sbin/installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/cNdNsN

where cNdNsN is the root slice. This restores sanity and
you can now add the lines for Ubuntu to the menu.lst
Please note that the Solaris release on the media should be
as close as possible to the installed Solaris release (if not
the same)



( Aug 30 2007, 11:02:00 AM EDT / Aug 30 2007, 11:01:06 AM EDT ) Permalink
Trackback: http://blogs.sun.com/aalok/entry/multiboot_solaris_and_ubuntu

20070829 Wednesday August 29, 2007

 Marvell Ethernet on Solaris

Marvell Ethernet on Solaris I've got a Sony Vaio that has a Marvell 88E8055 gigabit
ethernet card that doesn't work out of the box on
OpenSolaris.

The bundled SK98sol driver is old and dated. The new
driver must be downloaded either from Sysconnect in
the case of 64-bit or from Marvell in the case of 32-bit.
Update -- if you're doing this on a laptop, you want to
download the driver for "PCI Express Desktop Adapter"
from the Sysconnect website.

After downloading the driver, the pre-existing SK98sol
package needs to be removed prior to adding the downloaded
SKGEsol package (remember to also remove the 'sk98sol'
entries from /etc/driver_aliases). Once the SKGEsol package
has been added Solaris needs to be informed about the
new driver by doing the following -

- Get the PCI vendor ID for the ethernet card by either
  running 'prtconf -v' or '/usr/X11/bin/scanpci'. The
  pci id for my machine was 'pciex11ab,4363'

- Either use 'add_drv' or 'update_drv' to associate
  that pci id with the skge driver. Something like this -
  # rem_drv skge
  # add_drv -m '* 0660 root sys' -i '"pciex11ab,4363"' skge

The driver should now attach and ready to be plumbed.


( Sep 06 2007, 11:56:54 AM EDT / Aug 29 2007, 03:00:00 PM EDT ) Permalink
Trackback: http://blogs.sun.com/aalok/entry/marvell_ethernet_on_solaris

20070828 Tuesday August 28, 2007

 NFS Namespace Extensions

NFS Namespace Extensions So, for those of you that haven't kept up with
projects going on in the NFS space, one of them
is NFSv4 namespace extensions. The two namespace
extensions being "mirror-mounts" and "referrals".

I just noticed that a demo based on a prototype
that we did earlier this year was posted a few
weeks back here. Avid viewers will note
how the referral functionality can be leveraged
to create a very basic global namespace.

Once the code is back in OpenSolaris,it will
be available for anyone interested in extending
this code in interesting ways.

The timing of these OpenSolaris projects is
quite nice considering the renewed momentum
at the IETF NFSv4 WG with respected to Federated
File Systems
.



( Aug 28 2007, 02:04:00 PM EDT / Aug 28 2007, 02:03:02 PM EDT ) Permalink
Trackback: http://blogs.sun.com/aalok/entry/nfs_namespace_extensions

20070511 Friday May 11, 2007

 A Global Namespace with NFSv4

The NFSv4 specification has provisions in it that allows
for constructing a "Global Namespace" for files.

Let's start by defining what is meant by a Global Namespace.
A Wikipedia search doesn't quite yield what we're looking
for but it results in a link to "Global filesystem" which
oddly enough sortof captures the essence of a global namespace.

So, to rephrase what wikipedia has to say, a Global Namespace
is a flat namespace where filesystems hosted on a number of
different servers can be aggregated such that they appear as
a single filesystem to all the clients. Okay, that sounds
rather dry, just what good is that?

Consider a typical enterprise, for example, Sun. The enterprise
spans multiple countries across multiple geographies. This
brings about a need for separating the IT network that takes
into account the location affinity -- the US west coast users
associate with the .sfbay domain, US central with .central,
UK with .uk, China with .prc and so on. Each location has to
know the names of the servers that host the relevant filesystems
such that those filesystems can be mounted either a priori or
as they're accessed (via the automouter and such). Additionally,
there's obviously the administrative overhead relating to
configuring the mounts or the automounts as well as maps for
the latter.

What if I were to replace this with, say, a single server
(appropriately replicated across .sfbay, .central, .uk, etc)
that acts as the "root" of the namespace? All the clients
across the enterprise need to know just about this single
server (even that might not be needed in the presence of
something like this) from which they mount the root filesystem.
And, subsequently as they access the directories which don't
exist in, say the .sfbay domain, they get appropriately
redirected to the server in the .central domain that hosts
these directories (or filesystems to be accurate). The clients
automagically mount the absent filesystem(s) from the .central
server and allow access -- all transparently, without any
user intervention and without the need to configure any
automounter maps.

This is, in essence, a Global Namespace for files (grossly
over simplified but conveys the gist nevertheless).

The NFSv4 protocol allows for such a facility via the
use of Referrals, Replication and Migration. All the
gory details of this facility can be found in RFC 3530 Section 6
as well as the latest internet draft for NFSv4.1. From
a high level this facility allows for an NFSv4 capable server
to indicate to an equally capable client that a particular
filesystem does not exist on the server in question. The
client can subsequently query the server as to where that
filesystem actually resides to which the server replies with
a list of locations. The client can then initiate a mount
from any of those locations.

The NFSv4.1 spec allows for the primary server to return a
much richer set of location information as compared those
supported by NFSv4.0. The richer location information allows
for the client to ascertain which of the locations will be
better equipped, for example to deliver a high QoS.

So, ultimately this functionality enables us to tie together
a number of disjoint servers such that they appear as a
single server. Did I mention single? And, given the fact that
we're dealing with NFSv4 and it's a standard protocol helps
immensely -- you can construct a Global Namespace that comprises
of heterogenous servers and clients so long as they support NFSv4
in general and referrals/replication-migration in specific.

The logical next question is - does OpenSolaris support this
NFSv4 feature? No, not yet. But, follow the details here.



( May 11 2007, 09:01:24 AM EDT / May 11 2007, 09:01:02 AM EDT ) Permalink
Trackback: http://blogs.sun.com/aalok/entry/nfsv4_and_a_global_namespace

20070510 Thursday May 10, 2007

 iSCSI at ATLOSUG

At the last ATLOSUG  meeting couple of days ago, Ryan 
Matteson
talked about iSCSI in OpenSolaris. He even did a demo.
I never realized it was just *that* easy to setup the target
and the initiator except ofcourse for setting ACLs which
seemed like a royal PITA.

Ryan's presentation can be found here .




( May 10 2007, 11:30:52 AM EDT / May 10 2007, 11:30:52 AM EDT ) Permalink
Trackback: http://blogs.sun.com/aalok/entry/iscsi_at_atlosug

20070406 Friday April 06, 2007

 Sun SPOTs

A few weeks back Project Blackbox was doing roadshows and as part of
that it visited Atlanta as well. So, I went over and checked it out mostly
'cause I was curious to see how it had been engineered.


On the way out, I noticed this guy listening to music on his Powerbook
and tinkering with a little gizmo almost a third of the size of a credit card.
Upon asking what it was, he said it was a Sun SPOT - a
battery powered device that he was using to gather environmental data on the Blackbox. The device had been attached to the Blackbox and it was gathering data
such as temperature and humidity. And, this
dude was sucking data from
it by connecting to it wirelessly from
his Powerbook.

It turns out these devices on the Blackbox also serve as GPS tracking
devices.

So, what are Sun SPOTs? Pictures can be found here but these are
basically ARM powered devices that run Java and can be programmed
in a variety of ways using the SDK that comes with the kit. More information
on the ways this toy can be used can be found here and here.

Pretty
cool stuff!



( Apr 06 2007, 05:32:51 PM EDT / Apr 06 2007, 05:19:09 PM EDT ) Permalink
Trackback: http://blogs.sun.com/aalok/entry/sun_spots

20070302 Friday March 02, 2007

 Sun Rays in Classrooms

Couple days back I happened to notice an article on the Sun internal webpage the summary
of which said something like,
"Sun Provides Sun Ray Solution to Chitkara Institute At Chitkara, Sun has installed four high end SUN servers with the best in class Solaris
operating system. 300 thin clients are spread all over the campus connected on Nortel
switches through a Fibre backbone.
The Sun Ray architecture is comprised of Sun Ray thin clients and Sun servers. By
moving resources to a central location on the network and removing complexity, a much
more flexible and easy-to-maintain environment is achieved."
(Apparently, the actual article appeared in Chandigarh, India, Tribune News Service) And, the first thought that came to my mind is,"Chitkara? Is it the same as the Chitkara I
knew when I was in high school in India? If so, there's an Institute by that name?"
Once I
read the entire article, it was clear to me that it was the Chitkara I used to know.
The same guy who taught Math classes out of his house (couple miles from my parents'
house) when I was about to get into engineering school (around 1994). And, yes, it had
been turned into a full scale Institute or rather a group of Institutions. Wow!

This was an amazing news from my perspective on a couple of counts.
First, one of the well known educational institutions in the city of Chandigarh, where most
of the population has typically been Microsoft savvy (that is true for majority of India as well),
had adopted a Sun solution running Solaris! This is a huge acknowledgement of the fact that
Solaris has come a long ways in terms of being more approachable and user friendly. Mind
you, there still is a lot of a catchup still left to do but it is approachable enough for students in
this case.
Secondly, there were going to be Sun Rays in classrooms. Just like a bunch of other educational
institutions around the world have adopted. I've always thought of the Sun Ray technology as
one of the best pieces of technology created at Sun. And, where better to put them than in the
classrooms -- no need to have 300 different workstations that need to be maintained on a regular
basis and draw a significant amount of power, the students can't do crazy
stuff on 'em and they
can't even physically break the damn thing coz it's just a dumb thin client (well, for the most part).
Just ideal! Plus, you get all the benefits of a carrier grade OS in Solaris.
And, thinking about it for a moment - it doesn't even have to be Solaris. You could very well be
serving up Windows or even Linux sessions (how long will it be before a Mac OSX virtual client
is available?). So, you could very well bring up a screen on the Sun Ray that gives the users the
ability to choose from the available operating environments -- they could choose whichever they
like potentially based upon the kind of activity they wish to undertake.
Having been a long time user of the Sun Ray at Home solution, I wonder how long will it be
before service providers like Comcast, in addition to providing you internet access, also serve up
virtual desktops for an additional charge. The users benefit because they don't have to be sysadmins
anymore and the service provider benefits because it gets to deliver value added services.
So, really how long will it be before we hit a tipping point for the Sun Ray technology?


( Mar 02 2007, 11:04:17 AM EST / Mar 02 2007, 10:42:39 AM EST ) Permalink
Trackback: http://blogs.sun.com/aalok/entry/sun_rays_in_classrooms

20051026 Wednesday October 26, 2005

 NAS Industry Conference '05

NAS Industry Conference '05 The NAS Industry Conference concluded last week in Santa Clara. There
were a lot of good talks from various vendors, all of these should be
available later this week on the nasconf website.

Some of the talks I liked included the EMC Keynote and the Filebench talk by Eric.
The EMC Keynote on "Global File Virtualization" gave a very good overview of the
various components needed in order to provide file virtualization and what standards
based technologies (pNFS, NFSv4 referrals, CIFS) can be pieced together to provide
such a solution. Eric, whose a pretty entertaining speaker, talked about the use of a new
filesystem benchmarking tool, Filebench, and how that can be used to determine
performance bottlenecks in NFSv4.

Other talks from Sun included the ACLs tutorial by Sam and Lisa, Filebench tutorial by
Richard Mcdougall, Observability talk by Bill Baker and IETF NFSv4 Minor Version Update
by Spencer. And, oh yeah, yours truly gave a talk on Checksums for NFSv4 (yeah they
solve a real purpose because the TCP checksum *is* very weak :) ) as well. Incase the
pdf version on the conference website renders screwy, here's a staroffice version that
should work.

Overall, it was a good conference. And, thank you Audrey for doing such a good job
of organizing this once more.

Technorati Tag:


( Oct 28 2005, 12:36:43 PM EDT / Oct 26 2005, 05:44:54 PM EDT ) Permalink Comments [1]
Trackback: http://blogs.sun.com/aalok/entry/nas_industry_conference_05

20050614 Tuesday June 14, 2005

 Diskset Import - An Introduction to the source

Diskset Import - An Introduction to the source One of my most significant contributions (along with Steve Peng) to
Solaris 10 was to add support for import/export of disksets to SVM. So, what
is import/export of disksets? Simply put, you've got a bunch of disks
encapsulated in an SVM diskset and you want to disconnect them from one
host and connect them to a different host. And, get your SVM configuration
back. Why might you want to do this - say if you want to consolidate your storage
or incase of a disaster if you want to move your storage from one server to
another you might want to do something just like this.

SVM stores it's configuration information for the local set in a regular
metadb (the one that can be seen by metadb(1M) without arguments). The diskset
related configuration is stored in a diskset metadb (one that can be seen by
'metadb -s <diskset>' command) that resides on most (if not all) of the disks that are
a part of that diskset. Additionally, the local set metadb has knowledge about
the disksets including information on where to find the diskset metadbs.

The problem with moving storage from one server to another is that you loose
the local metadb and thus don't know where to find the diskset metadbs (and the
associated configuration). In order to implement diskset import it was needed to
figure out which of the recently connected disks in the target system have a diskset
metadb on them, read the configuration in from that metadb and populate the
kernel structures with the read in configuration information. That was the
scope of the problem in a nut shell.

We started out by writing the code to scan the disks for diskset metadbs
(entirely in userland). If you want to follow the conversation with code
references, pull up metaimport.c This is the essentially the source of
metaimport(1M). The code starts out by scanning the available set of disks,
pruning the disks that are in use and then for each drive that's left it
calls meta_get_set_info - this is the heart of the scanning code. It checks
to see if a diskset metadb exists on the passed in disk and if one exists, it
reads it in and does a sanity check on the metadata information read in. It
also does the work figuring out the new disk names, i.e. a disk named c1t1d1
in the source system might be named c2t2d22 in the target system and you need
to correct the related metadata information in the diskset metadb to reflect
the fresh state of affairs. Upon it's return, meta_get_set_info has a list
of disks that comprise a diskset.

Once we've build up the list of disksets and the disks that comprise each
of those disksets, we pass all of this information to meta_imp_set that does
the real work of populating the information in the kernel via ioctls. The
MD_DB_USEDEV ioctl creates the kernel structures (akin to what happens when
creating the initial configuration). The MD_IOCIMP_LOAD ioctl then snarfs in
the detailed configuration, the heart of this code is in md_imp_snarf_set.
The ops vector for each of the modules (stripe, mirror, etc) was expanded to
include an import op. So, for example, the stripe ops vector now looked
something like this -

md_ops_t stripe_md_ops = {
stripe_open, /* open */
stripe_close, /* close */
md_stripe_strategy, /* strategy */
NULL, /* print */
stripe_dump, /* dump */
NULL, /* read */
NULL, /* write */
md_stripe_ioctl, /* stripe_ioctl, */
stripe_snarf, /* stripe_snarf */
stripe_halt, /* stripe_halt */
NULL, /* aread */
NULL, /* awrite */
stripe_imp_set, /* import set */
stripe_named_services
};

The import op for each of the modules handled creating detailed configuration
as well as updating out-of-date information.

md_imp_snarf_set calls the import op for each of the modules that appear in
the diskset configuration. So, if there's a stripe in the diskset configuration
stripe_imp_set gets called and so on. Subsequently, the code does exactly
what it says :)

 /*
* Fixup
* (1) locator block
* (2) locator name block if necessary
* (3) master block
* (4) directory block
* calls appropriate writes to push changes out
*/
if ((err = md_imp_db(setno)) != 0)
goto cleanup;

/*
* Create set in MD_LOCAL_SET
*/
if ((err = md_imp_create_set(setno)) != 0)
goto cleanup;


It fixes up another set of out-of-date information and creates the appropriate
structures in the local set to inform the local set about the diskset
configuration and where to find it. That's it, we're done with our job in the
kernel and we return to userland.

In the userland, the only other thing that needs to be done is to inform the
rpc daemon that stores the knowledge about disksets (rpc.metad) about the
existence of this imported set. This is accomplished via the clnt_resnarf_set routine.

So there you have it - a 15,000 ft overview of the implementation of diskset
import.

Technorati Tag:
Technorati Tag:



( Jun 14 2005, 12:52:07 PM EDT / Jun 14 2005, 12:38:33 PM EDT ) Permalink Comments [0]
Trackback: http://blogs.sun.com/aalok/entry/diskset_import_an_introduction_to

 Debugging on Sparc

Debugging on Sparc
Debugging on Sparc

While debugging x86/x64 crash dumps has been fairly extensively talked
about at various places (most recently here), I haven't come across any
resources that talk about debugging sparc dumps (other than numerous
bug reports). Now that OpenSolaris is live, it'll be relatively easier for
developers outside Sun to debug problems. Thus, the motivation behind
this entry.

Most of the time when you get a crash dump from a kernel panic, it's
either during development (in which case it's easy to debug because
you *know* exactly what caused the code to fail) or it's while the code is
in production. It's harder to debug when you're given a dump obtained
from a production machine primarily because first you need to find out
what caused the code to fail and second you need to simulate the failure
in the lab.

A lot of the times, finding the root cause entails figuring out what
parameters were passed to functions and what do the local variables
look like at a certain point in time. I'll walk through an example to
demonstrate how function arguments and local variables can be excavated
from a dump.

Parameter passing on sparc -  A brief overview

Unlike x86 that passes function arguments on the stack and x64 that passes
function arguments (atleast most of them) in registers, sparc uses
register windows
to pass parameters. Arguments are passed in
%i0, %i1 .. %i5 with %i0 having the first parameter and so on. If there
are more than six input parameters to a function, parameters after the
sixth are passed on the stack. %i6 contains the frame pointer (%fp)

Local variables are allocated at an offset to the frame pointer.

Stack Format

The frame structure is defined in the system file usr/include/sys/frame.h
and it looks as follows -

struct frame {
long fr_local[8]; /* saved locals */
long fr_arg[6]; /* saved arguments [0 - 5] */
struct frame *fr_savfp; /* saved frame pointer */
long fr_savpc; /* saved program counter */
#if !defined(__sparcv9)
char *fr_stret; /* struct return addr */
#endif /* __sparcv9 */
long fr_argd[6]; /* arg dump area */
long fr_argx[1]; /* array of args past the sixth */
};

So the input parameters are in the fr_arg array.

Exacavating arguments with an NFSv4 bug

Using the bug 6268686 as an example and referencing OpenSolaris, let's look
at the stack trace that resulted in the panic -

> $C

000002a1012203d1 vpanic(1295800, 7aabd868, 7aabd880, 851, 2400, 2a1012210fc)
000002a101220481 assfail+0x74(7aabd868, 7aabd880, 851, 18c6000, 1295800, 0)
000002a101220531 nfs4_make_dotdot+0x4f4(2a101220df8, 2388873b24c20,
fffffffffffffff8, 301412eb920, 2a101221238, 1)
000002a101220941 nfs4lookupnew_otw+0x7d8(301ef0d4dc0, 2a101221530,
2a101221528, 301412eb920, df8475800, 38285c955c0)
000002a101220a71 nfs4_lookup+0x114(301ef0d4dc0, 2a101221530, 2a101221528,
301412eb920, 0, 391f52d44a8)
000002a101220b41 fop_lookup+0x28(301ef0d4dc0, 2a101221530, 2a101221528,
7aa69c2c, 0, 600045703c0)
000002a101220c01 lookuppnvp+0x344(2a1012217f0, 0, 600045703c0, 2a101221528,
2a101221530, 6000008dbc0)
000002a101220e41 lookuppnat+0x120(301ef0d4dc0, 0, 1, 0, 2a101221930, 0)
000002a101220f01 lookupnameat+0x5c(0, 0, 1, 0, 2a101221930, 0)
000002a101221011 vn_openat+0x164(1, 400, 1, 1, 0, 1)
000002a1012211d1 copen+0x260(ffffffffffd19553, 87aa3, 0, 50400, 0, 1)
000002a1012212e1 syscall_trap32+0x1e8(87aa3, 0, 50400, 0, 0, 0)

To set the context for this bug, we were trying to lookup a directory and it
so happened that we ended up calling nfs4_make_dotdot to get an rnode. The
comments in the code explain fairly well under what circumstances this function
is called -

/*
* nfs4_make_dotdot() - find or create a parent vnode of a non-root node.
*
* Our caller has a filehandle for ".." relative to a particular
* directory object. We want to find or create a parent vnode
* with that filehandle and return it.
.. snip

Like the comments say, we had a filehandle for ".." relative to the directory
object we're trying to lookup. So, to start off what was the pathname we're
trying to lookup? To determine this, we'd like to know what are the arguments
passed into the nfs4_make_dotdot function. Check the source and the function
is defined in uts/common/fs/nfs/nfs4_subr.c as -

int
nfs4_make_dotdot(nfs4_sharedfh_t *fhp, hrtime_t t, vnode_t *dvp,
cred_t *cr, vnode_t **vpp, int need_start_op)

The interesting bit is the passed in directory vnode pointer, dvp, and it's
passed in in the i2 register. If we can find out the dvp, we'll also know
the path we're playing with here.

64-bit sparc has a notion of stack bias and you need to add the stack bias to
the frame pointer in order to get the actual data of the stack frame.

Applying that to the frame pointer for nfs4_make_dotdot and dumping out
the frame, we have -

> 000002a101220531+0x7ff::print struct frame
{
fr_local = [ 0, 0, 0x381a4c49000, 0x2a101221108, 0x2a101220ee0,
0x2a101221118, 0x7aabd800, 0x7aabd800 ]
fr_arg = [ 0x2a101220df8, 0x2388873b24c20, 0xfffffffffffffff8,
0x301412eb920, 0x2a101221238, 0x1 ]
fr_savfp = 0x2a101220941
fr_savpc = 0x7aa6b474
fr_argd = [ 0x1, 0x5bc679f3060, 0, 0x2a101221250, 0x200000000,
0x5bc679f3178 ]
fr_argx = [ 0 ]
}

i2 here looks bogus, darn! Let's backup one function higher to
nfs4lookupnew_otw and see if it we can fish out dvp out of it's frame easily.
Quick look at the source in uts/common/fs/nfs/nfs4_vnops.c and -

static int
nfs4lookupnew_otw(vnode_t *dvp, char *nm, vnode_t **vpp, cred_t *cr)

The same dvp we're looking for should be in i0 provided it's not been
overwritten. Dump out the frame -

> 000002a101220941+0x7ff::print struct frame
{
fr_local = [ 0x391f52d4450, 0x381a4c49000, 0x2388873c342a0, 0x600074432a8,
0x2388873b24c20, 0, 0x1, 0x391f52d44e8 ]
fr_arg = [ 0x301ef0d4dc0, 0x2a101221530, 0x2a101221528, 0x301412eb920,
0xdf8475800, 0x38285c955c0 ]
fr_savfp = 0x2a101220a71
fr_savpc = 0x7aa69d40
fr_argd = [ 0x38f6901b110, 0x311fe247dc0, 0x2a101221704, 0x2a1012216fc,
0x2a101220c71, 0x7aa7ad20 ]
fr_argx = [ 0x2a101221264 ]
}

Quick check to see if it's been overwritten -

> nfs4lookupnew_otw::dis!grep i0
[ .. elided ]

It's not overwritten, we're in luck! Double check to see if it's a vnode.

> 0x301ef0d4dc0::whatis
301ef0d4dc0 is 301ef0d4dc0+0, bufctl 301ecba50c8 allocated from vn_cache

It sure is a vnode. Dumping out the path is now easy -

> 301ef0d4dc0::print vnode_t v_data |::print rnode4_t r_svnode.sv_name
r_svnode.sv_name = 0x391d8a24090
> 0x391d8a24090::print nfs4_fname_t fn_parent fn_name
fn_parent = 0x3b6565ba920
fn_name = 0x3292727cbe0 "uts"
> 0x3b6565ba920::print nfs4_fname_t fn_parent fn_name
fn_parent = 0x353987b1550
fn_name = 0x4f42a102fe0 "src"
> 0x353987b1550::print nfs4_fname_t fn_parent fn_name
fn_parent = 0

We're operating on ./src/uts and this isn't handled correctly in the lookup
handling routine (it's fixed now).

As I mentioned earlier, local variables are stored at an offset to the
frame pointer. Now that we have the frame pointer, we can dig out the local
variables. The variable of interest in this case was the error structure
declared on the stack for nfs4_make_dotdot here

A close look at the disassembly of the function and we can see -

nfs4_make_dotdot+0x27c:         mov       %l2, %o0
nfs4_make_dotdot+0x280: mov 0xc, %o3
nfs4_make_dotdot+0x284: call +0x15c58 nfs4_end_fop
nfs4_make_dotdot+0x288: mov %l7, %o5
nfs4_make_dotdot+0x28c: ba -0xd0 nfs4_make_dotdot+0x1bc
nfs4_make_dotdot+0x290: cmp %i5, 0
nfs4_make_dotdot+0x294: add %fp, 0x797, %o4
nfs4_make_dotdot+0x298: mov %l2, %o0
nfs4_make_dotdot+0x29c: mov 0xc, %o3
nfs4_make_dotdot+0x2a0: mov %l1, %i1
nfs4_make_dotdot+0x2a4: call +0x15c38 nfs4_end_fop
nfs4_make_dotdot+0x2a8: mov %l1, %o5
nfs4_make_dotdot+0x2ac: ld [%fp + 0x7bb], %i3 <------
nfs4_make_dotdot+0x2b0: cmp %i3, 0

that it's stored at fp + 0x7bb (fp is the fr_savfp in nfs4_make_dotdot's frame)
Dump it out -

> 0x2a101220941+0x7bb::print nfs4_error_t
{
error = 0
stat = 0t10006 (NFS4ERR_SERVERFAULT)
rpc_status = 0 (RPC_SUCCESS)
}

This reveals a secondary problem in the code which is that there are no
checks for errors like NFS4ERR_SERVERFAULT (again, now fixed).

Technorati Tag:
Technorati Tag:
Technorati Tag:



( Jun 14 2005, 12:53:42 PM EDT / Jun 14 2005, 11:50:01 AM EDT ) Permalink Comments [0]
Trackback: http://blogs.sun.com/aalok/entry/debugging_on_sparc

20050524 Tuesday May 24, 2005

 RPC Versioning for rpc.metad

One of my earliest contributions to Solaris 10 was to version the rpc.metad daemon. In this post I'll talk about the problem I was trying to solve and why I was trying to solve it as a precursor to how I actually solved it.

What's this rpc.metad?

rpc.metad is one of the SVM rpc daemons and it's entire purpose in life is to facilitate multiple hosts in sharing common storage which is going to be used for shared SVM disksets*. The daemon communicates between the hosts in question while configuring the volumes and making changes to the configuration. As an example, while adding disks to a shared diskset with two hosts A and B, it probes each of the hosts with the question, "Hey, I see a disk c1t0d0 with these characteristics. Do you see the same disk on your
side? Okay, both of us are seeing the same disk; now I'm going to add this disk to a diskset called oracle, you okay with that? Yeah, go ahead use it I'm not using it"

* Shared SVM disksets, here, refer to the concept where one and only one host can access the diskset at any given point in time. This configuration is mostly used in high availability (HA) kind of environments or in conjunction with the clustering software.

Why did it need to be versioned?

Early on in the Solaris 10 development cycle, support was added to SVM so you could create multi-terabyte volumes as well as leverage multi-terabyte LUNs. As part of this effort, changes were made to a lot of the structures used internally by SVM (not the on-disk structures). When these changes were made, it was made sure that nothing broke if you wanted to upgrade/downgrade the machines.

However, during the external code review process (I think that's what it was; anyhow, you get the idea), one of the reviewers pointed out that there was a case where backward compatibility was broken. I'll expound on this soon, but let me take this opportunity to explain a little bit about the code review process here in the Solaris organization.

Solaris code review process - A detour

Every change to the Solaris source, i.e. every bug fix (no matter how trivial it is) and every project needs to be reviewed by someone other than the engineer making the change. Typically, for a bug fix one needs to get it reviewed by one or two engineers. For a bigger change, i.e. a project or an RFE, the change needs to be reviewed not only by engineers inside the same technology group but by engineers in different (but hopefully related) technology group as well (external code review). The motivation for doing this is to get multiple sets of eyes looking at the change so as to provide healthy criticism for the change being made. This helps in making sure that - a) The right fix is being made
and the fix won't cause future bugs b) Other areas of code are not being overlooked c) It's not going to break other Solaris functionality. Code reviews are just one of the process related tasks we in the Solaris organization undertake in order to make sure the Solaris code is always high quality.

Back to the real topic ..

So, back to the original topic, one of the external code reviewers mentioned that since rpc.metad uses the changed structures, it's likely not going to be able to talk to other rpc.metad processes that have an "older" view of the changed structures. This was particularly going to be a problem when the clustering software is being used in a rolling upgrade scenario where the cluster nodes are being upgraded on a rolling basis, i.e. some of the nodes can be running an older version of solaris (say Solaris 8) whereas some of the other nodes can be running an upgraded version of solaris (say Solaris 10). Each of these nodes need to be able to communicate with each other in such a scenario. The problem this presents is - you've got a Solaris 8 rpc.metad that knows about the Solaris 8 version of the structures and a Solaris 10 rpc.metad that knows about the Solaris 10 version of the structures. The result - the two rpc.metad processes can't communicate with each other!

The solution to this problem was to version the Solaris 10 rpc.metad, i.e. make it understand the older structure definitions as well as the newer structure definitions by leveraging the versioning capabilities of the RPC framework. Simple as that.

In a future post I'll go over the details of how I implemented the versioning changes. So long!

Technorati Tag:


( Jun 10 2005, 01:27:30 PM EDT / May 24 2005, 04:04:52 PM EDT ) Permalink Comments [0]
Trackback: http://blogs.sun.com/aalok/entry/rpc_versioning_for_rpc_metad


« March 2009
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today


XML




    Blogroll


Today's Page Hits: 38

Locations of visitors to this page