Sam Falkner's WeblogSam Falkner's Weblog |
|
Friday Nov 14, 2008
nnodes
Traditionally, NFS has always been able to share an ordinary file system, and thus has always dealt with vnodes. With the advent of Parallel NFS (pNFS), which has stripes of data distributed across multiple servers, this is no longer the case. The initial implementation of pNFS uses the DMU API provided by ZFS to store its stripe data. The pNFS protocol also requires that a server community support proxy i/o; that is, a client must be able to perform all i/o against one server, and if the data requested is on another node, then the server must perform the i/o by proxy. Neither the DMU nor proxy i/o are accessed via vnodes. Another change brought by pNFS is the distributed nature of the server. What has always been confined to a single server is now distributed over multiple servers. This necessitates a control protocol, for communication among the server nodes, and implies that some server tasks may have longer latencies than before. In pathological cases, e.g. a server reboot of one particular server, the latencies may be very high. This will likely require new ways that the NFS server is implemented, and it will become advantageous for the NFS server code to use new APIs, APIs with different design goals from vnodes. Asynchronous methods become more desirable, to deal with the new latencies involved with processing client requests. Enter nnodes. nnodes can be thought of as vnodes, but customized for the needs of NFS (especially pNFS), and with three distinct ops-vector/private-data sections. The figure below shows an NFS server implementation interacting with an nnode that is backed by a vnode.
The fact that the nnode has the three distinct sections for data, metadata, and state, makes it easy to mix and match commonly needed implementations. Here are some examples:
But this is just the beginning. There will doubtless be more constructions in the future. nnodes also serve as a place to cache or store information relevant to a file-like object. For example, in the pNFS data server, we can cache stateids that are known to be valid. Thus, the data server will not need to contact the metadata server on every i/o operation. Today I have been writing a more verbose comment header for the header file "nnode.h". Look for it in our repository soon. Posted at 03:54PM Nov 14, 2008 by samf in NFS |
Sunday May 25, 2008
goodbye MANPATH
I don't usually blog about every config file change I make, but here's one I'm particularly happy with, as it's removing a kludge. This is a change made to my zsh configuration that only runs on Solaris:
-
-# Sun's annoying man command...
-
-manpath=(/usr/share/man)
-
-for dir in $path
-do
- mdir=${dir%/*}/man
- test -d $mdir && manpath=($manpath $mdir)
-done
I was happy to play a very small roll in this putback to Nevada. Thanks Mike! Posted at 11:18PM May 25, 2008 by samf in Solaris |
Wednesday Apr 02, 2008
smcup/rmcup: hate
Some terminals have the capability to save and restore themselves. Some programs take advantage of this, so that when you exit the program, your screen is restored to its previous state. In terms of terminfo, these capabilities are known as smcup and rmcup. I hate this. Hate. Let's say you want to run a command, but want to look at its man page first. The man command sends its output to your $PAGER, which is less. The less command saves/restores your screen. So, you scroll to exactly the example you want: perfect! You hit 'q' to quit... and the example is erased. Hate. Many more examples are possible, but you get the idea. Here's how I eradicate this in my world. I use this on Solaris 10 and 11, and MacOS 10.4 and 10.5. Your mileage will probably vary, but feel free to give it a try. It's in my .zshrc file, so it's using zsh's builtin "[[" and "]]" operators, as well as "$(command)". If this fails for you, you can probably just replace "[[" with "[", "]]" with "]", and "$(command)" with "`command`".
TERMINFO="/tmp/$(id -un)-terminfo-$(uname -s)"
export TERMINFO
if [[ ! -d $TERMINFO ]]; then
mkdir -p $TERMINFO
infocmp | sed -e 's/smcup.*,//' -e 's/rmcup.*,//' -e '/^[ \t]*$/d' \
> $TERMINFO/fixed
sed -e '1d' -e '3,$d' < $TERMINFO/fixed | grep -w $TERM >/dev/null 2>&1
if [[ $? -ne 0 ]]; then
mv $TERMINFO/fixed $TERMINFO/broken
sed -e "2s/^/$TERM|/" < $TERMINFO/broken > $TERMINFO/fixed
fi
tic $TERMINFO/fixed
fi
Posted at 12:00PM Apr 02, 2008 by samf in General | Comments[2]
Thursday Mar 13, 2008
Using SMF to make DTrace a better printf replacement
When DTrace was introduced, life got much easier for kernel developers. For example, it became very easy to check if and when a certain function was being called. Before DTrace, if a debugger breakpoint was too heavy, we would often use printf()s. With DTrace, no more recompile/reboot! Sure enough, one thing often discussed in the hallways was, "Yea! No more printf()s!" Yet I still find myself using printf()s. Not as often, certainly, and not in as many situations, but they haven't disappeared. I think the reason I still use printf()s is that printf()s are "always on". With a printf() in the kernel, all output goes to /var/adm/messages. For example, if I'm writing bleeding edge code, I'd like to know if a certain routine fails. The routine might work at first, then fail a week later. With a printf(), I can always go back and look to see what happened. With DTrace, it's only there if I was running my script. If only there were a way that I could use DTrace instead. If DTrace were always running on my development box, I wouldn't miss those unexpected things, and I wouldn't find myself wishing I had had a printf() at some point. If only there were some facility to start a D script, so that you wouldn't forget to launch it every day when you start working... Duh. Use SMF! An SMF service to run a D scriptYou can find an SMF manifest for running a D script as a service here. You write a script, point this service at the script, enable via smf, and the script is always running! Output goes to /var/adm/messages via syslog. Here's how to use it (you can also read these steps in the manifest itself):
ExampleLet's try an example. Suppose I want to know when nfs4_setattr() is being called. Let's make a script:
#! /usr/sbin/dtrace -s
#pragma D option quiet
fbt::nfs4_setattr:entry
{
printf("setattr called on %s\n", stringof(args[0]->v_path));
}
Note the "#pragma" line: there is a way with the SMF properties that we could have passed a "-q" option to dtrace, but I prefer the pragma, as it makes the script more self contained. So, we store this file as /var/tmp/logger.d, and use svccfg to set the script/path property, and enable, and we're good to go! burr/3/* echo hello > foo burr/4/* chmod 666 foo burr/5/* tail /var/adm/messages Mar 13 14:02:50 burr pseudo: [ID 129642 kern.info] pseudo-device: fbt0 Mar 13 14:02:50 burr genunix: [ID 936769 kern.info] fbt0 is /pseudo/fbt@0 Mar 13 14:02:50 burr pseudo: [ID 129642 kern.info] pseudo-device: systrace0 Mar 13 14:02:50 burr genunix: [ID 936769 kern.info] systrace0 is /pseudo/systrace@0 Mar 13 14:02:50 burr pseudo: [ID 129642 kern.info] pseudo-device: profile0 Mar 13 14:02:50 burr genunix: [ID 936769 kern.info] profile0 is /pseudo/profile@0 Mar 13 14:02:50 burr pseudo: [ID 129642 kern.info] pseudo-device: lockstat0 Mar 13 14:02:50 burr genunix: [ID 936769 kern.info] lockstat0 is /pseudo/lockstat@0 Mar 13 14:21:38 burr dtsyslog: [ID 702911 daemon.notice] setattr called on /home/samf/foo burr/6/* Other PossibilitiesYou will likely want to edit your D script from time to time. When you do, just edit the script, and invoke "svcadm restart dtsyslog" to pick up the changes. If you wish to have more than one script, you can easily create multiple instances of the dtsyslog service. Look at the man page for svccfg. Just edit the script/path property for each instance. You can control the output via /etc/syslogd.conf. The messages are coming as daemon.notice by default, but you can also configure that via svccfg. I hope you find this useful! Posted at 03:15PM Mar 13, 2008 by samf in DTrace | Comments[1]
Wednesday May 23, 2007
nfs4trace: a New Direction
My previous DTrace provider for NFS has been floundering, due to a needed design change, and my priorities with pNFS. Fortunately, a new design is being worked by two engineers recently assigned to the project. [Read More]Posted at 02:00PM May 23, 2007 by samf in NFS |
Thursday Jun 08, 2006
nfs4trace release: all ops covered!
I have just pushed a new release of the DTrace provider for NFSv4. This is in fact the second release I've made -- the first release didn't get a blog entry. You can be sure to see all releases by watching the announcements section of the nfs4trace project. An RSS feed can be had here: http://www.opensolaris.org/rss/os/project/nfs4trace/announcements/rss2.xml This release is based on Nevada build 41. I mention this so that you will know what bugs and bug fixes are present in this release. The release is in the form of bfu archives, so follow the usual procedure to install them. New FeaturesThis release of nfs4trace provides a set of probes for every NFSv4 operation. There is an overarching probe for the compound op, called "op-compound". Each op within a compound has its own probe, e.g. "op-lookup". For every probe, args[0] is a pointer to a nfs4_dtrace_info_t. This has the following fields:
Besides the normal compound operations, there is a probe for all callback-related operations. The probe "cb-compound" is analagous to "op-compound", but covers the callback channel. Each operation within a callback, e.g. OP_CB_GETATTR, also has a probe -- for this example, "cb-getattr".
To see a complete list of all probes, run
What's Left?Eventually, there will be probes for every nontrivial attribute, which will fire regardless of which operation is using them. But not all of these attributes are accounted for yet. Stay tuned. What else is left? Your feedback will help decide this! Please try this, and let me know if you have any input. Thanks! Posted at 09:44AM Jun 08, 2006 by samf in NFS | Comments[1]
Thursday Feb 09, 2006
SMF Manifest for Perforce
If you want to run a Perforce server from Solaris 10 or greater, you should be using SMF instead of /etc/init.d scripts or inetd. I haven't seen a SMF manifest for a Perforce server as of yet; so, I created one. It handles p4d (the main server) and p4p (the proxy server). You can grab it from here: http://blogs.sun.com/roller/resources/samf/perforcexml.txt The manifest itself has a quick "cheat sheet" to run either p4d (the "main" server) or p4p (the proxy server). But let's look at the more correct way to configure and run the services, and then we'll look at how to run multiple instances of the services. Importing the manifest and configuring a default instanceFirst, let's take the downloaded "perforcexml.txt" file and put it in its proper place; then import it into SMF. Assume you've saved perforcexml.txt in /tmp. # cd /var/svc/manifest/application # mv /tmp/perforcexml.txt ./perforce.xml # chown root:sys perforce.xml # chmod 644 perforce.xml # svccfg import perforce.xml Notice that we didn't edit perforce.xml. Now, let's configure it from within SMF. We'll make the following assumptions:
# svccfg svc:> select p4d svc:/application/perforce/p4d> editprop editprop throws us into $EDITOR, your favorite editor. From here, we make the following changes: # Property group "executables" # delprop executables # addpg executables application setprop executables/client = astring:"/usr/local/perforce/bin/p4" setprop executables/server = astring:"/usr/local/perforce/bin/p4d" # Property group "options" # delprop options # addpg options application # setprop options/journal = astring: (journal) # setprop options/port = astring: (1666) setprop options/adminuser = astring:"bob" setprop options/root = astring:"/var/perforce" The lines that we changed are the lines that are now uncommented. Save and quit your editor. svc:/application/perforce/p4d> quit # svcadm refresh p4d We're ready! Make sure that /var/perforce is owned by daemon (the login that will actually be running the p4d daemon), and kick it off. # chown daemon /var/perforce # svcadm enable p4d # svcs -x # Woohoo! No problems! If there had been a problem, "svcs -x" would have shown it. Test a perforce client against the server. If there are any problems, check "svcs -x" again. The usual caveats apply: if you kill p4d, or run "p4 admin stop", SMF will immediately restart p4d. To stop the p4d server correctly, run "svcadm disable p4d". This will correctly shut down p4d (via "p4 admin stop"). Starting a second instance on the same machineWe can use SMF to start another Perforce repository on the same machine. The two instances can be administered separately (e.g. one can be taken down while the other one is active). For the second instance, we'll assume the following:
When creating the second instance, we will want to borrow some of the properties from the default instance. The "editprop" command comes in handy here. If your favorite editor allows you to save a range of lines, then you can save the relevant section to a temporary file. # svccfg svc:> select p4d svc:/application/perforce/p4d> editprop Now we save the following lines into a temporary file: # Property group "options" # delprop options # addpg options application # setprop options/journal = astring: (journal) # setprop options/port = astring: (1666) # setprop options/adminuser = astring: (bob) # setprop options/root = astring: (/var/perforce) Now quit your editor, and we'll create the new instance. svc:/application/perforce/p4d> add alt svc:/application/perforce/p4d> select alt svc:/application/perforce/p4d:alt> editprop We will see the following: select svc:/application/perforce/p4d:alt Now read the temporary file into your editor, and edit it to look like this: select svc:/application/perforce/p4d:alt # Property group "options" # delprop options addpg options application # setprop options/journal = astring: (journal) setprop options/port = astring:"1667" setprop options/adminuser = astring:"carl" setprop options/root = astring:"/var/altperforce" Save and quit your editor. We're almost done!
svc:/application/perforce/p4d:alt> quit
# chown daemon /var/altperforce
# svcadm refresh p4d
svcadm: Pattern 'p4d' matches multiple instances:
svc:/application/perforce/p4d:alt
svc:/application/perforce/p4d:default
Oops, things really have changed! Let's do it right: # svcadm refresh p4d:alt # svcadm enable p4d:alt # svcs -x Success. The same technique can be applied for p4p (the proxy server). The configurable options for p4p are pretty much the same as p4d. ClosingI hope you find this useful. This was my first learning experience for SMF, and it was really fun. But it's just a small sample of what SMF can do.If I make any more changes to this manifest, either from your feedback or from changes to Perforce itself, I will blog about it here. Technorati tags: OpenSolaris Solaris SMF Perforce Posted at 12:00PM Feb 09, 2006 by samf in Solaris |
Friday Dec 23, 2005
A DTrace Provider for NFS
As a kernel developer working on NFSv4, I love DTrace. The fbt and syscall providers are my constant companions. But last June, at the NFSv4 Bakeathon, I had a more difficult problem -- an infinite loop involving volatile filehandles. The problem was reproducible (good), but it took a couple of minutes running a test suite to get to that point (bad). Thousands of over-the-wire operations occurred before the more interesting things began to happen. Using snoop, which "knows" the NFSv4 protocol, I knew what Solaris NFSv4 was doing, but I didn't know why it was doing it. DTrace could tell me the "why" part, but I couldn't think of a script that would get me just the right information, without drowning me in uninteresting information. Sleep deprived, I began whining about how DTrace should have a new provider; one than "knows" the NFSv4 protocol, and lets you write protocol related events into your scripts. A provider that lets you do things like grab filehandles, and stuff them into variables, to be used later in conditionals. Upon getting home, I began writing such a provider. In fact, I wrote one that could solve the problem discussed above. As it grew, there were some parts of the code that were getting messy, so now I'm in the midst of refactoring the provider. A New ProviderHere is how the NFSv4 provider is starting to look. As usual for DTrace probes, we have: provider:module:function:name Provider is either "nfs" or "nfs-server". Plain old "nfs" is the NFS client -- the thing that lets you mount remote file systems. Yes, this means that technically there are really two providers, but they will be nearly the same structure. Learning one provider will be sufficient to learn both. Module is either "nfs" or "nfssrv", again for client and server. I would recommend just leaving this part blank, since it doesn't buy you any more than the provider slot. If you do not leave it blank, make sure that provider and module match. If they don't match, you won't get any probes. For example, "nfs:nfssrv:<something>:<something>" will not match any probes. Function is an operation or an attribute. More on that below. :-) Name is either "start" or "done". For a probe on an client operation, "start" would fire when the client sent the operation over-the-wire to the server, and "done" would fire when it received its response from the server. For the server, "start" would fire when the server received a request, and "done" would fire when the server answered the request. Callback functions, where the server makes a request of the client, are just what you would expect -- client "start" fires when the client first receives a request from the server, etc. The ProbesAs mentioned above, the function slot is an operation or an attribute. Operations are of the form "op-<operation-name>", where <operation-name> is an operation defined in the NFSv4 protocol. Examples: nfs::op-read:start /* client did a read operation */ nfs-server::op-read:start /* server got a read request */ nfs::op-compound:done /* client's compound op finished */ For attribute probes, we use "attr-<attribute-name>", where <attribute-name> is a type defined in the NFSv4 protocol. It can be either an argument to an operation (examples:) nfs::attr-seqid4:start /* client sent a sequence ID with some op*/ nfs::attr-filehandle:done /* client received a filehandle */ or an attribute of a file. Examples: nfs::attr-owner:start /* client is sending an owner file attribute */ nfs::attr-filehandle:done /* client received a filehandle */ Above, notice that attr-filehandle makes sense as an argument to an operation, or an attribute of a file. The nice thing about the attr-* probes is that you can trace these attributes going over the wire, without caring about what operation is sending them, or how it's being sent. Arguments to the Probes: args[0] and args[1]For all probes, there are two arguments. args[0] is the same for all probes: it is a structure that holds things such as the tag and transaction id (xid) of the operation. It will likely hold the network address of the machine that it's talking with; more things may be added in the future.
The second argument, args[1], is different for every probe. For op-* probes, it's the structure that is being sent or received. For attr-* probes, it's just the attribute itself. Examples:
Any working examples yet?The provider is still being implemented, and it doesn't yet handle very many of the details of the NFSv4 protocol. But the framework is there, and it does handle "compound", which is the fundamental over-the-wire operation in NFSv4. As its name implies, a compound operation can do more than one thing. For example, it can create a new file and get the new file's attributes, all in one over-the-wire operation. As mentioned before, compounds have a field called a "tag", which is a descriptive string attached that very briefly describes the purpose of the request. The tag for our hypothetical create-file-and-get-its-attributes operation might simply be "create". Here is a simple D script that uses the op-compound probe to show which over-the-wire requests the client spends the most time waiting upon. The requests are collated by the tag. You could use such a script to help get an idea of what compounds are taking the most time, and maybe start thinking about where to do further performance analysis. Following the example script is example output. While running this example script, I used "tar" to copy a bunch of files within an NFS mounted directory, and then used "rm -rf" to delete them. Don't treat this as a serious benchmark or anything -- it's just to give an idea of what's easily possible with the new provider.
Results:total time spent in calls renew 276 commit 11054 link 21647 rmdir 367802 mkdir 422512 access 599114 readdir 753105 lookup valid 822106 lookup 862495 read 1026893 close 1095138 getattr 1841042 write 6674532 remove 6802618 setattr 7101811 open 7539189
Monday Nov 21, 2005
IETF, ZFS and DTrace
What have I been up to?I recently traveled to beautiful Vancouver for the 64th IETF. There, Lisa and I attended many meetings, most notably the NFSv4 working group meeting, and presented our recent Internet Draft. The Internet Draft concerns NFSv4 ACLs. It attempts to clear up ambiguities in the NFSv4 spec, RFC3530. It also proposes a way for ACLs and UNIX-style modes to live together in harmony. Hopefully, it will end up as one or more RFCs, and NFSv4 clients and servers can have a truly useful ACL model. Meanwhile, as has been mentioned in many other places, ZFS has been released. Check out blog entries from Lisa and Mark on how ACLs work in ZFS. I also gave a presentation on DTrace at a joint meeting of the Front Range OpenSolaris User Group (FROSUG) and the Front Range Unix User Group (FRUUG). Why was I, a humble NFS engineer, giving a presentation on DTrace? Well, for one thing, I use DTrace quite a bit. But I'll be blogging more about DTrace and NFS a bit later. Technorati tags: IETF NFS OpenSolaris Solaris ZFS DTrace Posted at 05:00PM Nov 21, 2005 by samf in NFS | Comments[0]
Tuesday Jun 14, 2005
ACLs Everywhere
I was tempted to call this posting ACLs are my resume, but that's a bit extreme. Still, Access Control Lists (ACLs) seem to follow me around. Here are some of the places where I have worked on ACLs.
Solaris 10Lisa and I managed to integrate support for NFSv4 ACLs into Solaris 10. The effort to add ACL support began late in the Solaris 10 release cycle. Some of the problems we hit (outlined below) weren't even thought about when we began. We couldn't do much testing against other vendors until very late in the release cycle. There were a few show stopper bugs we had to fix before Solaris 10 could ship. This was one of the most intense but rewarding projects I've worked on! NFSv4 ACL support breaks down into two big pieces: support for the over-the-wire operations involving ACLs, and translation between the various ACL models (more on them below). The over-the-wire pieces of ACL handling are scattered throughout NFSv4. nfs4_vnops.c has the usual vnode ops. See nfs4_getsecattr() and nfs4_setsecattr() for the front end (as far as the file system is concerned) to ACLs. Other pieces are in nfs4_client.c and nfs4_xdr.c. The translators are contained in nfs4_acl.c. For this article, I will focus on the translators. TranslationSolaris has had support for ACLs for a long time. The ACL model supported before Solaris 10 is called POSIX-draft. This was supposed to become a POSIX standard, but the effort was abandoned. The latest draft is what was implemented for Solaris. For on-disk file systems, the Solaris UFS filesystem implements POSIX-draft ACLs. For versions two and three of the Network File System (NFS), an undocumented side-band protocol enables users to manipulate ACLs on the server. To the best of my knowledge, the only implementation of this protocol outside of Solaris is for Linux. NFSv4 introduces a powerful new ACL model. It's powerful enough that every POSIX-draft ACL can be translated into an NFSv4 ACL. But NFSv4 ACLs can go beyond POSIX-draft semantics; thus, not all NFSv4 ACLs can be translated into POSIX-draft ACLs. The presence of two ACL models makes it desirable to seamlessly translate between the two, so we implemented ACL translation in the kernel for NFSv4. Translation gives us many benefits:
At Connectathon 2005, we gave a presentation on implementing NFSv4 ACLs in Solaris 10. However, now that Solaris is open, I would like to talk about the translators at the code level. Translating POSIX-draft ACLs into NFSv4 ACLsIn the Solaris kernel, ACLs are passed around inside of vsecattr_t structures. The main entry point for translating POSIX-draft ACLs into NFSv4 ACLs is vs_aent_to_ace4(). Here is what the call stack looks like when a POSIX-draft ACL is translated into an NFSv4 ACL. vs_aent_to_ace4() ln_aent_to_ace4() (once for the regular ACL, once for the default ACL) ln_aent_preprocess() for every ACE: mode_to_ace4_access() access_mask_set() ace4_make_deny() access_mask_set()
Translating NFSv4 ACLs into POSIX-draft ACLsThe entry point for translating NFSv4 ACLs into POSIX-draft ACLs is vs_ace4_to_aent(). Here is what the call stack looks like when making this translation. vs_ace4_to_aent() ln_ace4_to_aent() ace4_list_init() ace4_to_aent_legal() ace4vals_find() ace4_list_to_aent() ace4_list_free()
To see when some things go wrong, you can turn on nfs4_acl_debug. The debugging code isn't very complete, and it might be replaced by static Dtrace probes some day. But for now, you can do this: # mdb -kw > nfs4_acl_debug/W 1 Future ACL support in SolarisAs you can see from reading the code, and perhaps from all of the debugging prints, there are lots of things that can go wrong when translating ACLs from NFSv4 into POSIX-draft format. Besides that, wouldn't it be nice to be able to set an arbitrary NFSv4 ACL, and use the new ACL model? Things are getting much better with the arrival of ZFS. The goal of ZFS's ACL implementation is to implement NFSv4 ACLs in a way that is compatible with Solaris. The ZFS ACL model is still in flux, but it is rapidly solidifying. We will be releasing an Internet Draft in the near future, in which we will propose a way for UNIX and UNIX-like systems to support NFSv4 ACLs. If all goes well, this will be the ACL model used by ZFS. Posted at 09:20AM Jun 14, 2005 by samf in NFS | Comments[0]
Friday May 13, 2005
Friday the 13th
I must be pretty brave to be making my first post on Friday the 13th. I'm Sam Falkner, and I work in Solaris, specifically focusing on NFS. I've been at Sun since 1992, with a roughly two year hiatus in 1997 and 1998. I've worked on ufsdump/ufsrestore, Backup Copilot, "Online: Backup 2.0", CacheFS, UFS, and finally NFS. I don't know how often I'll be posting, since things are pretty hectic right now. If you haven't taken the plunge into RSS yet, I would highly recommend it! I recently started using Sage under Firefox. It's easy and it makes infrequent blog posters less annoying. At least I hope so. Posted at 09:51AM May 13, 2005 by samf in General | Comments[0] |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||