Tuesday September 01, 2009 | Saurabh Mishra's Weblog |
|
|
|
All
|
Biking
|
General
|
Photographs
|
Solaris Operating System
|
Stock Market
|
Trekking & Mountaineering
Writing a new Ethernet device driver for Solaris This blog entry goes into describing what all you should keep in mind while writing a new Ethernet device driver for Solaris. What we will not go into are LSO, HW checksum and supporting multiple RX rings as I have not written code for these features. Most Ethernet controllers will have descriptor based TX and RX. The starting point for writing a new device driver is getting attach() and detach() working. Well that's fairly easy but mostly we would want to do following things in attach() : - Get the vendor/device-id and make sure we have correct chip by looking at the revision. - Pre-allocate all DMA buffers for TX. You will have to anyway pre-allocate all RX buffers. This is the simplest model you can think off but will require bcopy (an extra copy during TX/RX). But hey you are just starting... - Allocate interrupts, Register MAC and MII. - Reset PHY if required and do it before starting MII (mii_start() function). Reset the device too... - You must enable device interrupts before returning from attach() and this should be the last operation before returning from attach(). - MII layer in Solaris will take care of PHY operations and dladm link properties too. So you need to have getprop and setprop in MAC callback (m_callback). MII can also take care of some common Statistics and ndd. You need to implement PHY read/write/reset operations which are PHY specific.
One noticeable thing I'd like to point out here is that have one DMA alloc and free function to allocate and free a DMA handle/memory. It simplifies code a lot. The same function can be used to allocate TX/RX descriptor ring, DMA buffers for TX/RX and memory for statistics or control block. You need to pass DMA attribute structure and a flag (DMA Read/Write flag). A typical example of such a function will look like this :- typedef struct xxxx_dma_data {
xxxx_dma_t * int err;
xxxx_dma_t *dma;
dma = kmem_zalloc(sizeof (xxxx_dma_t), KM_SLEEP);
err = ddi_dma_alloc_handle(xxxxp->xxxx_dip, attr,
DDI_DMA_SLEEP, NULL, &dma->hdl);
if (err != DDI_SUCCESS) {
goto fail;
}
err = ddi_dma_mem_alloc(dma->hdl,
size, &xxxx_mem_attr, DDI_DMA_CONSISTENT, DDI_DMA_SLEEP, NULL,
&dma->addr, &dma->len, &dma->acchdl);
if (err != DDI_SUCCESS) {
ddi_dma_free_handle(&dma->hdl);
goto fail;
}
err = ddi_dma_addr_bind_handle(dma->hdl, NULL, dma->addr,
dma->len, flag | DDI_DMA_CONSISTENT, DDI_DMA_SLEEP,
NULL, &dma->cookie, &dma->count);
if (err != DDI_SUCCESS) {
ddi_dma_mem_free(&dma->acchdl);
ddi_dma_free_handle(&dma->hdl);
goto fail;
}
return (dma);
fail:
kmem_free(dma, sizeof (xxxx_dma_t));
return (NULL);
}
void if (dma != NULL) {
(void) ddi_dma_unbind_handle(dma->hdl);
ddi_dma_mem_free(&dma->acchdl);
ddi_dma_free_handle(&dma->hdl);
kmem_free(dma, sizeof (xxxx_dma_t));
}
}
Some of the corner cases you must take care: - Make sure you handle RX FIFO overflow interrupt properly. The driver may not have enough RX descriptor to receive pkts further and hence you must consume posted RX descriptors. Some chips require reset during RX FIFO.
General things that you may want to consider: - Call mac_tx_update() outside lock. - Try to raise a software interrupt whenever a hardware interrupt is raised. Don't spend too much time processing pkts in the hardware interrupt context. - Make sure chip is quiesced when detach is called. - Use DDI's ddi_periodic_add(9F) instead of timeout(9F). - Test suspend/resume and quiesce (for fast reboot to work). - I think most the Multicast filters are hash-based but I have seen a CAM (Content Addressable Memory) based filter too. It can get tricky to support multicasting and in that case just enable ALL multicast. Hash-based multicast filter are easy to implement. You can have a reference count for every bit in the 64-bit variable. Once the reference count for the bit reaches zero, you make the bit zero. Otherwise it should remain set. - Make sure you handle link status change properly and re-program the MAC register if required at different link speed/duplex. - Look for memory leaks (enable kmem_flags = 0xf in /etc/system and take crash dump; then run ::findleaks in mdb) You can use NICDRV or HCTS for testing and NICDRV will stress test most of the components in your driver including MAXQ, FTP, Ping with different payloads, load/unload of the driver, Multicast, dladm(1m) features, VLAN, VNIC etc. EOI (End-of-Interrupt) vs Directed-EOI This post is to help us distinguish between EOI and Directed-EOI. When a local APIC clears EOI register, it does two things :- - Clear the appropriate bit in the ISR register of the local APIC. - Issue a broadcast EOI message to all the IOAPICs in the system.
In Solaris, we clear EOI register of the local APIC at two different places :- - For edge interrupts, we clear EOI register while raising the TPR (Task Priroity register) i.e apic_intr_enter(). - For level-triggered interrupts, we clear EOI register when exiting from interrupt handler i.e apic_intr_exit().
The notion of Directed-EOI had come from x2APIC specification. The Directed-EOI here does not refer to generating broadcast EOI message to all the IOAPICs. What we do here is clear ISR in the local APIC (by writing 0 to EOI register in the local APIC) and then clear the appropriate vector index in the IOAPIC. Some CPUs are capable of masking the broadcast EOI message and that's when Directed-EOI comes handy. Note that Directed-EOI has no meaning when interrupt is Edge. For Edge interrupt, we don't send any Directed-EOI. x2APIC and a new device driver for Broadcom Fast Ethernet chips
http://saurabhslr.blogspot.com (2009-06-16 14:26:57.0) Permalink Install-Time-Update (ITU) and Driver Binding in Solaris If you ever wonder how to create install time driver updates for Solaris 10 and Nevada, then you may want to read this blog entry as it involves few tricks here and there.
There are two ways to make your device work with Solaris. The install-time-update (aka ITU DU or ITU diskette) is only required for the case where the disk drive will become the Solaris
boot drive. For all other case, you should be able to generate a package and run pkgadd(1m) command to install the driver package on running Solaris.
ITU MethodIn order to install Solaris onto a bootable drive supported by your driver, you can use an Install Time Update (ITU).
The ITU must have your driver (both 32-bit and 64-bit binaries) and PCI-IDs of the device your driver supports.How to construct an ITU
For Solaris 10
# mkdir -p /var/tmp/your_driver.5.10 Copy your driver and your_driver.conf file in the current directory.
# mkdir -p kernel/drv/amd64 VVVV = Vendor-id
The output of the pkg_drv(1m) will resemble the output below :-
input file: drv=your_driver
## Building pkgmap from package prototype file.
Copy the following files from '/tmp/12546' as follows :-
# cd /var/tmp/your_driver.5.10 You can run 'pkgproto' command or make a prototype file manually :
bash-3.2# cat > prototype Make sure you include both the 32-bit and 64-bit binaries of your driver. Once this is completed, we will construct the package again to include 64-bit binary of the driver.
# cd /var/tmp/your_driver.5.10
This will create '/tmp/PKG' directory under /tmp and that's where
the package is. For example :-
bash-3.2# pkgmk -r . -d /tmp
Do following things to repack package in DU (Diskette) :- # cd /tmp For Solaris Neavda
Repeat the same steps as we did for Solaris 10 except for following
things :-
Once you have created ITU for Solaris 10 and Nevada, we will bundle
them in one DVD/CD (or ISO file). In the directories
'/var/tmp/your_driver.5.11' and '/var/tmp/your_driver.5.10', you will find
a directory called 'PKG'. You must copy the files under 'PKG' to
one directory in order to bundle them together.
# mkdir -p /var/tmp/YOUR_DRIVER-DU
# mkisofs -o your_driver.iso -r /var/tmp/YOUR_DRIVER-DU
This will create an ISO file 'your_driver.iso' and a DVD/CD can be
burned by running the following command line at the prompt :-
# cdrw -i /var/tmp/YOUR_DRIVER-DU/your_driver.iso
In order to install Solaris on boot drives, you use Solaris
Installer DVD and choose option '5' (Apply Driver Updates)'. Kindly
follow the instructions when prompted. The other way is to bundle the device driver in Solaris bootable media itself or for network installation. Kindly follow the instructions described at this link.At the above link, it describes how to pack/unpack Solaris miniroot in order to make changes to Solaris bootable media. Driver Binding in SolarisDriver binding in Solaris is not so easy to understand. The way Solaris binds a driver is based on the precedence. This precedence list is maintained in the 'compatible' property of the device driver. The two functions which are responsible for creating 'compatible' property and finding the correct binding for the driver are - add_compatible() and ddi_compatible_driver_major() respectively. * pciVVVV,DDDD.SSSS.ssss.RR (0) * pciexVVVV,DDDD.SSSS.ssss.RR (0) RR = Revision number You can get the 'compatible' property by running 'prtconf -vp' command. If the Solaris fails to find a binding using 'compatible' property, then it tries by 'nodename' and the 'nodename' is constructed from Subsystem-vendor-id (SSSS) and Subsystem-device-id (ssss) of the device. The PCI-ID which we have been seeing here is embedded in the PCI Config space of the device. Device Drivers and device firmware must make sure that the proper PCI-IDs are chosen to avoid conflict with existing PCI-IDs. If your device is PCI-Express based card, then you must add 'pciexVVVV,DDDD.SSSS' like PCI-IDs in /etc/driver_aliases or via add_drv(1m) or pkg_drv(1m) command.
Solaris APIC implementation with respect to MSI/MSI-x interrupts What's Local APICLocal APIC (LAPIC) is part of the CPU chip and it contains (a) mechanism for generating/accepting interrupts (b) a timer (c) manages all external interrupts for the processor and (d) accept and generate inter-processor-interrupts (IPIs). What's IOAPICThis is a separate chip that is wired to the local APIC so that it can forward interrupts to the appropriate CPU (and to local APIC). What's Local APIC TableInterrupt vectors are numbered 0x00 through 0xFF in APIC and 0x00...0x1F are reserved for exceptions. The interrupt vectors in the range 0x20...0xFF are available for programming the interrupts in APIC. Like the Local APIC's, the IOAPIC will assign a priority to the interrupt based on the vector number and and it uses 4 top bits of the vector number to distinguish priority and ignores the lower 4 bits. For example if the vector number is 0x3F then the priority would be 0x3. In Solaris, this priority mask is represented by APIC_IPL_MASK (0xF0) and the vector mask is represented by APIC_VECTOR_MASK (0x0F). Since we can't use vector range from 0x00...0x1F, Solaris represents APIC_BASE_VECT (0x20) as the base vector and APIC_MAX_VECTOR (0xFF) being the maximum number of vectors in the local APIC. APIC_AVAIL_VECTOR is calculated based on this formula :- APIC_MAX_VECTOR+1-APIC_BASE_VECT and it translates to (0xFF+1-0x20) which is 224 vectors in decimal. Note that vectors are grouped in 16 priority groups and each group has 0x10 number of vectors. These 16 vectors share the same priority. APIC Data Structures in SolarisHere is the big picture on how the various APIC data structures are related to each other. These data structures are described below :-
A typical apic_irq_t entry in the apic_ira_table[] looks like this :- > ::interrupts > apic_irq_table+(0t22*8)/J > fffffffec10d7f38::print apic_irq_t apic_ipltopri[] - This array holds Solaris IPL priority to APIC priority. For example :- > apic_ipltopri::print Note the order of priority assignment. Higher vector numbers are being assigned to higher IPL. Also note that 0x20 is given to index 1,2,3 which means that IPL 1,2,3 share the same vector range 0x20...0x2F. And apic_ipltopri[] is declared as :- uchar_t apic_ipltopri[MAXIPL + 1]; /* unix ipl to apic pri */ apic_vectortoipl[] - This array is a bit complex. The main purpose of this array is to initialize apic_ipltopri[] array. apic_init() [.] } uchar_t apic_vectortoipl[APIC_AVAIL_VECTOR / APIC_VECTOR_PER_IPL] = { Note that IPL 5 share vector range 0x40...0x5F (or 0x20...0x3F for optimization) and that's why vector index 2 and 3 have IPL 5. Similarly vector index 4,5 have IPL 6 (0x40...0x5F or 0x60...to 0x7F). * IPL Vector range. as passed to intr_enter apic_vector_to_irq[] - This array holds IRQ number given the vector number. If an element of this array contains APIC_RESV_IRQ (0xFE) then it means that the vector is free and can be allocated. apic_navail_vector() function checks this array to figure out how many vectors are available. Here an example on how IPL to vector priority is mapped in Solaris :- Lets say we got network interrupt at IPL 6 (ath - wifi interrupt) having vector number 0x60 (as shown above in the ::interrupt output). Now Solaris will block all interrupts at and below IPL 6 which is done by apic_intr_enter() function. In this function, the caller actually subtracts 0x20 (APIC_BASE_VECT) from the vector number. Anyway, this is done for optimization but lets come to the point - apic_ipls[] array is used to get to the IPL which will be programmed in the APIC register. So we first get nipl as nipl = apic_ipls[vector]; // vector is 0x40 not 0x60 as mentioned above and nipl will be 0x6 apicadr[APIC_TASK_REG] = apic_ipltopri[nipl]; So we write 0x70 to APIC task register to block interrupts. Note that Solaris uses range 0x60...0x7F for IPL 6 :- * IPL Vector range. as passed to apic_intr_enter() and it does not matter whether you write 0x70 or 0x7F as they all do the same work which is block interrupts at IPL 6 or below. Solaris x86 Interrupt HandlingNow that we have glimpsed through the data structures involved, lets look at how Solaris x86 handles Interrupt. I prefer to describe interrupt handling before describing how interrupts are allocated because I felt interrupt handling is easier to understand. Lets first go through how Solaris x86 is designed in terms of psm ops. For example, PCI express has its own psm ops which is apic_ops and PCI has its own psm_ops which is uppc_ops. In fact xVM (Zen based hypervisor) has its own psm_ops called xen_psm_ops. It is psm_install() who is responsible for installing psm in Solaris x86 world. apic_probe_common() is what gets called when psm_install() jumps into psm_probe() for each psm_ops. apic_probe_common() does many things and one of them being mapping 'apicadr[]' (you would have seen this before; I referred it for setting APIC priority i.e task register). apic_cpus[] array also gets initialized by ACPI i.e acpi_probe() because ACPI tables have all the information like local apic cpu id, version etc. Now lets see what happens when local APIC generates an interrupt. The interrupt could come from IOAPIC or MSI/MSI-x based generated interrupt (in-band message). Solaris calls cmnint() or _interrupt(). These are same and call do_interrupt() once regs is setup. do_interrupt() will first set the PIL so that CPU does not get any interrupt at or below PIL. Raising the priority of CPU is done using setlvl pointer to function. This pointer gets set to appropriate psm_ops's psm_intr_enter and in our case it will be apic_intr_enter(). Now comes the dispatching interrupt part which is done by calling switch_sp_and_call() once the stack of interrupt thread is setup. Recall that Solaris handles interrupts in thread context if PIL is at or below LOCK_LEVEL (0xa). High level interrupts (0xa...0xf) are handled in current thread's stack. switch_sp_and_call() can dispatch three type of interrupts -- (a) software interrupts (b) high level interrupts and (c) normal device interrupts. In our example, we have been looking at wifi interrupt and it will be (c) which maps to dispatch_hardint() routine. dispatch_hardint() calls av_dispatch_autovect() after enabling interrupts. Now that we are touching av_dispatch_autovect() routine, I must explain what is autovect[] array. If you remember add_avintr() which is responsible for registering a hardware interrupt handler then I think you can skip this part. autovect[] has MAX_VECT (256) elements and each element is of type 'struct av_head'. The first pointer in 'struct av_head' points to 'struct autovec' and autovec structure will have all the information about interrupt handler, arguments passed to interrupt handler, priority level etc. Note that more than one interrupt handler can share the same vector and they are linked by 'av_link' in 'struct autovec'. For example :- > ::interrupts > ::sizeof 'struct av_head' > autovect+(0x10*0t22)=J // Take the IRQ and index into autovect[] array. > fffffffffbc52ba0::print 'struct av_head' > 0xfffffffec50d2cc0::print 'struct autovec' > 0xfffffffec10d2f40::print 'struct autovec' 1 2391 av_dispatch_autovect:entry ath`ath_intr, 13 There is a very interesting blog by Anish at this link on APIC and Solaris x86 interrupt handling. How does Solaris APIC implementation allocates InterruptNow that we looked at how APIC is structured in Solaris x86 and how interrupts are handled, lets look at how interrupts are allocated. There are three types of interrupts -- DDI_INTR_TYPE_FIXED, DDI_INTR_TYPE_MSI and DDI_INTR_TYPE_MSIX in the order they are evolved. Solaris DDI routine ddi_intr_get_supported_types() can be called to retrieve types of interrupt supported by the Bus. In case of MSI, apic_alloc_msi_vectors() gets called and in case of MSI-x, apic_alloc_msix_vectors() gets called to allocate the appropriate number of interrupt vectors. Note that MSI supports 32 number of vectors per device function and MSI-x supports 2048 number of vectors per device function however in Solaris x86 we only support 2 MSI-x interrupt vectors per device (the reason for studying APIC and MSI-x by me). On SPARC, Solaris supports far more MSI-x interrupt and configured by #msix-request property in DDI. This hard limit is determined by i_ddi_get_msix_alloc_limit() function however even on SPARC it seems we limit to 8. msix_alloc_limit = MAX(DDI_MAX_MSIX_ALLOC, ddi_msix_alloc_limit); /* Default number of MSI-X resources to allocate */ These limits will change when Interrupt Resource Management (IRM) framework is integrated in Solaris. Anyway, lets get back to the topic. Depending upon the interrupt type and bus intr ops, Solaris will jump to interrupt ops. In our case, we will get into pci_common_intr_ops() from ddi_intr_alloc(9F) to allocate the interrupts with cmd DDI_INTROP_ALLOC. We will not get into FIXED type interrupts as they are hard wired via IOAPIC and fairly easy (I suppose). It's the psm_intr_ops which gets into action with cmd PSM_INTR_OP_ALLOC_VECTORS and we land up in apic_intr_ops(). apic_intr_ops
The next step is to check whether we have enough irq's in the apic_irq_table[]. This is done by the function apic_check_free_irqs(). If we succeed in finding enough IRQ entries in the table, apic_alloc_msi_vector() proceeds to allocate irq which is done by apic_allocate_irq(). The IRQ no. returned by this function is finally used by autovect[] table to index into the appropriate vector. We will go into autovect[] again soon but for now lets see how we select CPU. The selection of CPU for this IRQ is done by apic_bind_intr() for the first interrupt in 'count' number of vectors and subsequent vectors are bound to the same CPU. These steps are done in a loop for 'count' number of times. Now that we have setup IRQ in the apic_irq_table[] with priority, vector, target CPU etc, we are set to enable the interrupt. BTW, all this is mostly done in driver's attach(9E) entry point but mostly in two phases with in the attach(9E) entry point -- (i) add interrupts by allocating them (ii) enable interrupts. apic_alloc_msix_vectors() - This function does similar work as done for MSI interrupts except that we allocate the vector (apart from allocating the IRQ entry in the apic_irq_table[]) and bind the interrupt to CPU by calling apic_bind_intr() for each request in 'count'). MSI-x does have the limitation of contiguous vectors as MSI has. Vector allocation is done by routine apic_allocate_vector() which returns the free vector by walking apic_vector_to_irq[] table and looking for APIC_RESV_IRQ slot. The range is determined by the priority passed to it. For example if priority passed is 6, then range would be highest = apic_ipltopri[ipl] + APIC_VECTOR_MASK; highest is 0x7f (0x70 + 0x0f) and lowest would be 0x60 (0x50+0x10) and this matches with our observation in the beginning of the blog. A typical flow of this dance is as follows :- 1 22557 apic_alloc_msix_vectors:entry name pciex8086,10a7, inum :
0, count : 2, pri :6
Now lets talk about how driver enables interrupts once they are allocated. Interrupts can be enabled in block (more than one at once by DDI ddi_intr_block_enable(9F)) or calling explicitly ddi_intr_enable(9F) for each interrupt however we will discuss ddi_intr_enable(9F) . Once again we will end up in pci_common_intr_ops() and call pci_enable_intr() which does two things mainly :- - Translate the interrupt if needed. This is done by apic_introp_xlate(). If the interrupt is MSI or MSI-x, we call apic_setup_irq_table() if the IRQ entry in the apic_irq_table[] is not setup. In our example, we have already done this so apic_introp_xlate() just returns IRQ number from 'apic_vector_to_irq[airqp->airq_vector]'. airqp is an entry in the apic_irq_table[] which gets assigned by calling apic_find_irq(). - Add the interrupt handler by calling add_avintr(). We have actually touched this routine in this blog but it is worth mentioning - when in the life cycle of setting up interrupts we bind an interrupt handler (ISR or Interrupt Service Routine) to vector. The main task of add_avintr() is to insert 'autovec' in the appropriate index and call insert_av(). The other and the most important thing is to program the interrupt which is done by addspl(). addspl() is another pointer to function from the family of setlvl, setspl etc. In APIC case, it will be apic_addspl() which is just a wrapper over apic_addspl_common(). There are four arguments passed to it :- apic_addspl_common(int irqno, int ipl, int min_ipl, int max_ipl) We first get the pointer from apic_irq_table[] by indexing irqno and check if we need to upgrade vector or just check IPL in case this interrupt needs to be shared. Eventually we will land up in apic_setup_io_intr() which does the main task. In fact apic_rebind() binds an interrupt to a CPU and apic_rebind() is called from apic_setup_io_intr(). Since we are discussing MSI/MSI-x and once apic_rebind() does sanity checks it will call apic_pci_msi_enable_vector(). The following statement is what we write to program the interrupt :- /* MSI Address */ apic_pci_msi_enable_mode() is also called from apic_rebind() to enable the interrupt once it's programmed. That's how per-vector masking is controlled I suppose. Since we are touch how we bind an interrupt to a CPU, I should also mention how Solaris selects CPU to bind an interrupt. The routine apic_bind_intr() is responsible for doing this and the decision is based on value of tunable 'apic_intr_policy'. You can define three type of policy -- (a) INTR_ROUND_ROBIN_WITH_AFFINITY - round robin and affinity based policy which returns same CPU for the same dip (or device). This is the default policy. (b) INTR_LOWEST_PRIORITY - I don't know because it's not implemented and (c) INTR_ROUND_ROBIN - select cpu in round-robin fashion using 'apic_next_bind_cpu' global variable. Choosing between INTR_ROUND_ROBIN_WITH_AFFINITY vs INTR_ROUND_ROBIN may not be easy but I think the decision should be based on throughput vs locality awareness.
xVM experience so far
Multi-CPU Binding in Solaris We are working on a framework which would allow processes/thread to have affinity to more than one CPU. The affinities could be divided into three categories -- (a) strong affinity (b) weak affinity and (c) negative affinity.
CEC 2006... Luckily I got chance to attend CEC 2006 which was organized at San Francisco from 1st Oct to 4th Oct. However this time as a attendee. Interestingly we had a demo on Solaris 10 adoption, Sun Cluster 3.2 and ZFS which was very well received.
Latency group (lgroup) in Solaris on NUMA aware machines All of you would have heard about NUMA (Non-uniform-memory-access) machines. I'm going to describe how the memory latency groups (called lgroup in Solaris) are layed out. While working on Multi-CPU binding project, I had to learn these aspects to implement how to choose a lgroup for a thread having least latency from its earlier home lgroup.
CoolThreads(TM) Technology Blow Away IBM, Dell in Latest SPEC Java(TM) Benchmark Performance Results In the latest update on prenews, the latest java benchmarks shows that Niagara based systems beat IBM and DELL systems. Here's the quote :-
The Sun Fire T1000 and T2000 servers equipped with the single UltraSPARC T1 processor and Solaris 10 OS out-performed a range of Intel Xeon-based 2-4 way servers running Microsoft Windows Server 2003 OS, and IBM 2-4 way Power 5 and p5+ based servers running AIX, all while consuming less power, resulting in over 5X better performance per watt, and up to 4X less space -- critical factors affecting data center efficiency, capacity, and costs.
VFS/Vnode Layer in Solaris In past I have mostly written on dispatcher locks (thread locks), scheduler, signal, procfs. This is for the first time, I'm writing about filesystem. I hope it'll help you in increasing awareness on filesystem so that developing filesystem specific things on Solaris is made easy.
f_bsize // block size f_frsize // block size. UFS has fragment size to accomodate small files. f_blocks // total number of blocks in the filesystem f_bfree // free blocks f_files = (fsfilcnt64_t)-1; f_ffree = (fsfilcnt64_t)-1; f_favail = (fsfilcnt64_t)-1; f_fsid // filesystem id (void) strcpy(sp->f_basetype, vfssw[vfsp->vfs_fstype].vsw_name); // name f_flag = vf_to_stf(vfsp->vfs_flag); // flag f_namemax // MAX filename size. (d) sync operation : For read-only filesystem, we don't need to implement sync. Otherwise, it's used for flushing dirty pages in the filesystem. (e) root operation : used by filesystem lookups to determine the root (or mount point). We are required to hold the vnode. Vnode layer exports following operations. We will focus on operations which are required to support read operations on the filesystem. Write operations are very tricky as you need to implement host of other operations and locking the filesystem. (a) read : This operation is invoked whether read(2) is called. In this routine, we use segmap to read the data of the file. We force fault the pages using segmap_getmapflt(segkmap, vp, (off + mapon), and then uiomove is called to copy back to userland. We also release the smp (segmap entry) using segmap_release() once uiomove() is done. Please note that segmap uses 8192 (MAXBSIZE), so according you're required to manage the offset (off) and mapon which are calculated as : off = uoff & (offset_t)MAXBMASK; mapon = (u_offset_t)(uoff & (offset_t)MAXBOFFSET); (b) getattr : In this operation, we need to return 'vattr' struture. 'ls -l' read this struture. Following members are relvant here :- va_type // type of vnode va_mode // mode va_uid // uid va_gid // gid va_atime.tv_sec // access time va_mtime.tv_sec // modification time va_ctime.tv_sec // creation time va_size // size va_nlink // link count va_blksize // block size va_nblocks // number of blocks (c) lookup : This is the heart of any filesystem. We must provide lookup in the filesystem before we can read files or seach in a directory. This routine understands the filesystem structure. In this operation, you can also use DNLC (Directory name lookup cache) to enhance the fs lookup. The Vnode and name will be cached and we don't to go to the disk all the time to search for a file/directory. dnlc_enter() can be used to put an entry in DNLC and dnlc_lookup() can be used to search whether vnode can be found in DNLC given the name. Both the routines increment v_count using VN_HOLD(). (d) getpage_miss/getpage : This routine will read the block of a file given the offset. Here we need to setup the page using page_create_va() and prepare for reading the block data using pageio_setup(). In order to issue the IO, we do following things in order -- bdev_strategy(), biowait() and then pageio_done(). In order to support read-ahead, we can use pvn_read_kluster() routines. Filesystem specific getpage() routine will call getpage_miss() to read the block. In getpage(), we also do page_lookup() in order to save going to disk if page is already there in memory. (e) readdir : This operation is used to read the directory entries. uio_offset passed in uio struture is the key thing here. If uio_offset is same as the filesize, then we have read all the directory entries. If that's not the case, then we read directory entries starting from the last offset which is passed to us in uio_offset. At the end, we are required to return the new offset in uio_offset, so that next time when readdir() is call again, we can read more directory entries. There are host of other functions which are required when write is also supported on the filesystem. For instance putpage, write etc. In order to support mmap(), we need to use segvn segment driver instead of segmap. (2006-01-04 19:30:00.0) Permalink Comments [2] An interesting signal delivery related problem Recently, we found an interesting performance problem using Dtrace. The program was when using Virtual timer created using setitimer(2). The interval passed was 10m (one clock tick) but SIGVTALRM signal used to arrive late and sometimes 6 ticks or more. Now how will you Dtrace the code and from where will you start tracing? I'll start tracing from signal generation to delivery. In Solaris kernel to post a signal we use sigtoproc() and eat_signal() is called on the thread to make the thread on proc (TS_ONPROC) depending upon the state (TS_RUN, TS_SLEEP, TS_STOPPED). psig() is called we kernel finds a pending signal (for instance when returning from trap). The program spins in userland after setting up the timer. Since the state of thread would be TS_ONPROC, it would be required to poke the target CPU if thread happens to be running on different CPU. So I started tracing following functions: sigtoproc(), eat_signal(), poke_cpu() and psig(). Now lets take a look at the Dtrace probes output: CPU Probe ID Function 8 11263 eat_signal:entry 1027637980027920 sig : 28 8 2981 poke_cpu:entry 1027637980030560 cpu : 9 8 11263 eat_signal:entry 1027637990025440 sig : 28 8 2981 poke_cpu:entry 1027637990032160 cpu : 9 8 11263 eat_signal:entry 1027638000036320 sig : 28 8 2981 poke_cpu:entry 1027638000043600 cpu : 9 8 11263 eat_signal:entry 1027638010025520 sig : 28 8 2981 poke_cpu:entry 1027638010032240 cpu : 9 8 11263 eat_signal:entry 1027638020023840 sig : 28 8 2981 poke_cpu:entry 1027638020031280 cpu : 9 8 11263 eat_signal:entry 1027638030028720 sig : 28 8 2981 poke_cpu:entry 1027638030035920 cpu : 9 8 11263 eat_signal:entry 1027638040024480 sig : 28 [.] 9 8317 psig:entry 1027638170086480 sig : 28 If you calculate the difference (ie timestamp) between psig() and the first eat_signal(), you will notice that the difference is huge. 1027638170086480-1027637980027920 190058560 = 19 ticks (190 ms) We also noticed that CPU 8 (from where sigtoproc() is being called by clock_tick()) is poking CPU 9, however CPU 9 is not preempting the current running thread (program which is spinning). So why and how will it happen? In order to understand this, I'll first describe a bit on how preemption works in Solaris. In order to preempt a running thread, kernel sets t_astflag (using aston() macro) and also sets appropriate CPU preemption flag. There are two CPU preemption flags viz: cpu_runrun for user level preemptions and cpu_kprunrun for kernel level preemptions. RT threads can preempt TS or SYS or IA class threads since kernel level preemptions typically kicks off when current running threads priority is <= 100 (KPQPRI). For signal we don't set CPU level preemption flags. We just need to set t_sig_check and t_astflag followed by poke call. Since we are interested in user level preemption, we should know what happens when CPU 8 poked CPU 9 (using cross call). If the current running thread on CPU 9 is in userland, then we call user_rtt() which calls trap() if the checks for t_astflag succeeds. So lets check whether t_astflag would be set when we call eat_signal() or not. And that's where the problem was. If the target thread in eat_signal() is TS_ONPROC, we should set t_astflag and then poke the CPU. It will be clear from the following probe that the running thread on CPU 9 was getting preempted because the time quantum finished and clock would have set t_astflag in cpu_surrender(). 9 15055 post_syscall:entry 1027637970269440 8 11263 eat_signal:entry 1027637980027920 sig : 28 8 2981 poke_cpu:entry 1027637980030560 cpu : 9 [.] 8 11263 eat_signal:entry 1027638040024480 sig : 28 8 2981 poke_cpu:entry 1027638040026800 cpu : 9 [.] 8 11263 eat_signal:entry 1027638110024160 sig : 28 8 2981 poke_cpu:entry 1027638110026560 cpu : 9 8 2435 cpu_surrender:entry 1027638170024720 t:3001b7af3e0 8 2981 poke_cpu:entry 1027638170027280 cpu : 9 8 11263 eat_signal:entry 1027638170032720 sig : 28 9 2919 poke_cpu_intr:entry 1027638170033760 8 2981 poke_cpu:entry 1027638170034400 cpu : 9 9 3390 trap:entry 1027638170037840 type :512, pc: 10984, ast:1 8 2981 poke_cpu:entry 1027638170038640 cpu : 9 9 2919 poke_cpu_intr:entry 1027638170045680 9 1497 trap_cleanup:entry 1027638170054880 0 9 8317 psig:entry 1027638170086480 sig : 28 9 2278 trap_rtt:entry 1027638170117440 9 15055 post_syscall:entry 1027638170143360 9 8317 psig:entry 1027638170150880 sig : 2 So Dtrace did help us in finding out where the problem could be. This is just once example. Happy Dtracing...(2005-07-21 00:47:44.0) Permalink Comments [0]
Sometime back I had a problem with my desktop and as a result it started crawling whenever Java ticker used to kick in. I think I must share this with the rest of the world. I'd also share a kernel problem that we cracked and it was related to performance. So Dtrace has helped in solving many problems so far.
My desktop running Solaris 10 started crawling when I noticed that Xsun is eating up 68% of CPU. From prstat(1M)
# prstat
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
594 ****** 85M 78M run 30 0 14:03:19 68% Xsun/1
796 root 16M 13M sleep 59 0 1:06:25 5.8% stfontserverd/18
[.]
I then started Dtrac'ing Xsun and noticed that lwp_sigmask() syscall call is being made too frequently by Xsun. Here is the data :-
# ./syscall.d
^C
Ran for 26 seconds
writev 2832
pollsys 3261
read 5910
doorfs 27199
lwp_sigmask 217592
LWP ID COUNT
1 217592
libc.so.1`__systemcall6+0x20
libc.so.1`pthread_sigmask+0x1b4
libc.so.1`sigprocmask+0x20
libc.so.1`sighold+0x54
libST.so.1`fsexchange+0x78
libST.so.1`FSSessionDisposeFontInstance+0x8c
9063
libc.so.1`__systemcall6+0x20
libc.so.1`pthread_sigmask+0x1b4
libc.so.1`sigprocmask+0x20
libc.so.1`sigrelse+0x54
libST.so.1`fsexchange+0xc0
libST.so.1`FSSessionGetFontRenderingParams+0x8c
...and many more such stack traces from libST.so.1`fsexchange().
Infact the stack is like this:-
libc.so.1`__systemcall6+0x20
libc.so.1`pthread_sigmask+0x1b4
libc.so.1`sigprocmask+0x20
libc.so.1`sighold+0x54
libST.so.1`fsexchange+0x90
libST.so.1`FSSessionGetFontRenderingParams+0x8c
libST.so.1`GetRenderProps+0x344
libST.so.1`GlyphVectorRepQuery+0xf4
libST.so.1`STGlyphVectorQuery+0xd0
SUNWXst.so.1`_XSTUseCache+0x68
Notice that in this stack trace, we are calling sighold() and sigrelse() too frequently. So this process is disabling and enabling signals for some reason. Looks like we are rendering characters, but why do we block and unblock signals in this path?. Here is the Dtrace script which was used :-
#!/usr/sbin/dtrace -s
#pragma D option quiet
BEGIN
{
start = timestamp;
}
syscall:::entry
/execname == "Xsun"/
{
@s[probefunc] = count();
}
syscall::lwp_sigmask:entry
/execname == "Xsun"/
{
@c[curthread->t_tid] = count();
@st[ustack(6)] = count();
}
END
{
printf("Ran for %d seconds\n\n", (timestamp - start) / 1000000000);
trunc(@s,5);
printa(@s);
printf("\n%-10s %-10s\n", "LWP ID", "COUNT");
printa("%-10d %@d\n", @c);
printa(@st);
}
In fact Dtrace could help us in solving much more complex problems. Happy Dtrac'ing...
(2005-07-18 19:44:08.0)
Permalink
It's been a little while since I have written about a topic in Solaris Kernel. I'll soon blog on how Preemption (both user and kernel level) works in Solaris. With the help of Jonathan Chew, Andrei Doffee and Eric Saxe in Solaris Kernel Development, I'm currently working on a project which will enable you to specify multi-CPU binding and define affinity between the processes/lwps. We are still in the stage of drafting and developing a prototype. I'm getting to learn lgroup (latency group) and HLS (Hierarchical Lgroup Support) too. lgroup improves the performance on NUMA (Non-uniform memory access) machines like E15k, E25k, Serengeti 6800 and so on. It is my pleasure to work with Solaris Kernel Development engineers on this project. I'm sure I'll learn loads of things as we go along. In the meantime, we recently cracked a problem in Solaris which delayed the response of a thread when a signal was pending. These days my fellow colleague Sudheer Abdul Salam (we call him hot gun in Solaris Kernel Sustaining group) parterns with all of us when working on a bug. We truely believe in team building and dont' hesitate to take help from others. At the same time, we don't hesitate in cracking pranks too :-) Now that our group owns picld(1), I've been little busy with few bugs too in this area. picld(1) has interfaces which allows you to get the platform information in a tree form manner (abstract configuration of the system). The current users of picld(1m) interfaces are prtpicl(1m) and SunMC and I'm sure third-party applications will be using picld too. This weekend (2nd/3rd July) I'm going for a long drive and a long trek too. I'm hoping that it'll just rain and rain. Weather in Bangalore is rejuvenating us (my uncle and myself) and western ghats are attracting us again with their lush green forests and beautiful mountain ranges.(2005-06-29 20:28:18.0) Permalink Comments [0] I'm going to write about a compiler reordering problem in door_return() function which was observed in July 2002. The customer was able to reproduce the problem for us and it took me a while to figure out that it was a compiler reordering problem. I must thank our customers for being so co-operative when we get such issues. I must have given instrumented kernels for at least five times before I found out the problem. It's bug 4699850. The symptom was very clear. System used to panic in Solaris Kernel Dispatcher routines and one of the symptom was system panicing in dispdeq() while removing a kernel thread from the dispatch queue of a CPU. We know that compiler can reorder C statments if they are independent. Assume this piece of C code: #define THREAD_SET_STATE(tp, state, lp) \ t_lockp is a pointer to a dispatcher lock and we don't know whether lp is held or not. When a thread is made TS_ONPROC, the t_lockp of the corresponding thread points to cpu_thread_lock of CPU (cpu_t). In the above mentioned C code, these stores can be reordered can be re-ordered by compiler, so the lp should be held while calling setting the threads state. In door_return(), when server thread is about to handoff to client thread to return the results, it makes the client thread TS_ONPROC and calls shuttle_resume() on client thread. The responsibility of shuttle_resume() is to make client/server thread TS_ONPROC and the caller sleeps on shuttle_lock sync obj. While putting a thread onproc, dispatcher routines need not hold cpu_thread_lock and hence in door_return() if we call THREAD_ONPROC(), we effectively lost thread lock on the client thread. Now lets look at the two stores again. It t_lockp reaches global visibility before t_state, we can effectively lose thread lock on the thread. Assume another thread on different CPU is sending a signal to client door thread. Once the thread lock is lost on the client thread, the thread which is sending signal to client thread could see the old state of client thread (in this case it happens to be TS_SLEEP). Since the state is TS_SLEEP, eat_signal() will do setrun() on the client thread which enqueues client thread in the dispatch queue of the CPU. As a result, we can see some very strange things happening which also included dispdeq() panic. The following code in door_return() was faulty: int I had used TNF (trace normal form) for finding out this problem. But now we have a powerful tool to trace from userland to kernel and of course it's Dtrace. Technorati Tag: OpenSolaris Technorati Tag: Solaris Technorati Tag: DTrace (2005-06-13 12:00:00.0) Permalink Comments [0] |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||