Saurabh Mishra's Weblog
All | Biking | General | Photographs | Solaris Operating System | Stock Market | Trekking & Mountaineering

20090901 Tuesday September 01, 2009

Writing a new Ethernet device driver for Solaris

This blog entry goes into describing what all you should keep in mind while writing a new Ethernet device driver for Solaris. What we will not go into are LSO, HW checksum and supporting multiple RX rings as I have not written code for these features.

Most Ethernet controllers will have descriptor based TX and RX. The starting point for writing a new device driver is getting attach() and detach() working. Well that's fairly easy but mostly we would want to do following things in attach() :

- Get the vendor/device-id and make sure we have correct chip by looking at the revision.

- Pre-allocate all DMA buffers for TX. You will have to anyway pre-allocate all RX buffers. This is the simplest model you can think off but will require bcopy (an extra copy during TX/RX). But hey you are just starting...

- Allocate interrupts, Register MAC and MII.

- Reset PHY if required and do it before starting MII (mii_start() function). Reset the device too...

- You must enable device interrupts before returning from attach() and this should be the last operation before returning from attach().

- MII layer in Solaris will take care of PHY operations and dladm link properties too. So you need to have getprop and setprop  in MAC callback (m_callback). MII can also take care of some common Statistics and ndd. You need to implement PHY read/write/reset operations which are PHY specific.

One noticeable thing I'd like to point out here is that have one DMA alloc and free function to allocate and free a DMA handle/memory. It simplifies code a lot. The same function can be used to allocate TX/RX descriptor ring, DMA buffers for TX/RX and memory for statistics or control block. You need to pass DMA attribute structure and a flag (DMA Read/Write flag). A typical example of such a function will look like this :-

typedef struct  xxxx_dma_data {
        ddi_dma_handle_t        hdl;
        ddi_acc_handle_t        acchdl;
        ddi_dma_cookie_t        cookie;
        caddr_t                 addr;
        size_t                  len;
        uint_t                  count;
} xxxx_dma_t;


xxxx_dma_t *
xxxx_alloc_a_dma_blk(xxxx_t *xxxxp, ddi_dma_attr_t *attr, int size, int flag)
{

	int err;
	xxxx_dma_t *dma;

	dma = kmem_zalloc(sizeof (xxxx_dma_t), KM_SLEEP);

	err = ddi_dma_alloc_handle(xxxxp->xxxx_dip, attr,
	    DDI_DMA_SLEEP, NULL, &dma->hdl);

	if (err != DDI_SUCCESS) {
		goto fail;
	}

	err = ddi_dma_mem_alloc(dma->hdl,
	    size, &xxxx_mem_attr, DDI_DMA_CONSISTENT, DDI_DMA_SLEEP, NULL,
	    &dma->addr, &dma->len, &dma->acchdl);

	if (err != DDI_SUCCESS) {
		ddi_dma_free_handle(&dma->hdl);
		goto fail;
	}

	err = ddi_dma_addr_bind_handle(dma->hdl, NULL, dma->addr,
	    dma->len, flag | DDI_DMA_CONSISTENT, DDI_DMA_SLEEP,
	    NULL, &dma->cookie, &dma->count);

	if (err != DDI_SUCCESS) {
		ddi_dma_mem_free(&dma->acchdl);
		ddi_dma_free_handle(&dma->hdl);
		goto fail;
	}

	return (dma);
fail:
	kmem_free(dma, sizeof (xxxx_dma_t));
	return (NULL);

 }

void
xxxx_free_a_dma_blk(xxxx_dma_t *dma)
{

	if (dma != NULL) {
		(void) ddi_dma_unbind_handle(dma->hdl);
		ddi_dma_mem_free(&dma->acchdl);
		ddi_dma_free_handle(&dma->hdl);
		kmem_free(dma, sizeof (xxxx_dma_t));
	}

}


Some of the corner cases you must take care:

-  Test the code path when there are no more TX descriptors available for the driver to send a pkt. You must call mac_tx_update() once a descriptor is reclaimed. Some drivers start reclaiming once threshold is reached.

- Make sure you handle RX FIFO overflow interrupt properly. The driver may not have enough RX descriptor to receive pkts further and hence you must consume posted RX descriptors. Some chips require reset during RX FIFO.

General things that you may want to consider:

- Call mac_tx_update() outside lock.

- Try to raise a software interrupt whenever a hardware interrupt is raised. Don't spend too much time processing pkts in the hardware interrupt context.

- Make sure chip is quiesced when detach is called.

- Use DDI's ddi_periodic_add(9F) instead of timeout(9F).

- Test suspend/resume and quiesce (for fast reboot to work).

- I think most the Multicast filters are hash-based but I have seen a CAM (Content Addressable Memory) based filter too. It can get tricky to support multicasting and in that case just enable ALL multicast. Hash-based multicast filter are easy to implement. You can have a reference count for every bit in the 64-bit variable. Once the reference count for the bit reaches zero, you make the bit zero. Otherwise it should remain set.

- Make sure you handle link status change properly and re-program the MAC register if required at different link speed/duplex.

- Look for memory leaks (enable kmem_flags = 0xf in /etc/system and take crash dump; then run ::findleaks in mdb)


You can use NICDRV or HCTS for testing and NICDRV will stress test most of the components in your driver including MAXQ, FTP, Ping with different payloads, load/unload of the driver, Multicast, dladm(1m) features, VLAN, VNIC etc.

(2009-09-01 14:04:45.0) Permalink

20090731 Friday July 31, 2009

EOI (End-of-Interrupt) vs Directed-EOI

This post is to help us distinguish between EOI and Directed-EOI. When a local APIC clears EOI register, it does two things :-

- Clear the appropriate bit in the ISR register of the local APIC.

- Issue a broadcast EOI message to all the IOAPICs in the system.

In Solaris, we clear EOI register of the local APIC at two different places :-

- For edge interrupts, we clear EOI register while raising the TPR (Task Priroity register) i.e apic_intr_enter().

- For level-triggered interrupts, we clear EOI register when exiting from interrupt handler i.e apic_intr_exit().

The notion of Directed-EOI had come from x2APIC specification. The Directed-EOI here does not refer to generating broadcast EOI message to all the IOAPICs. What we do here is clear ISR in the local APIC (by writing 0 to EOI register in the local APIC) and then clear the appropriate vector index in the IOAPIC. Some CPUs are capable of masking the broadcast EOI message and that's when Directed-EOI comes handy. Note that Directed-EOI has no meaning when interrupt is Edge. For Edge interrupt, we don't send any Directed-EOI.


(2009-07-31 16:40:04.0) Permalink

20090616 Tuesday June 16, 2009

x2APIC and a new device driver for Broadcom Fast Ethernet chips


Its been quite a while since I wrote something technical on my blog. I have been working on quite a few things off-late. Since my integration of x2APIC - a new Local APIC model which uses MSR (Model Specific Register) on future generation Intel Processors, I took a small challenge to work on Device Drivers and that too an Ethernet Controller. Having gained no knowledge about Networking  and Device Driver in past years, I thought this is the time to jump-in. Better late than never you know. So this blog is really about two major things :-


x2APIC - A new Local APIC (Advance Programmable Interrupt Controller). It improves performance as the local APICs can write to registers parallely. With xAPIC (MMIO model), we use-to map local APIC registers in memory and hence any write to I/O space used to get serialize. x2APIC has some improvements in IPI (Inter-Processor Interrupt) too. It also extends support for Local APIC ID > 255 but I don't think any BIOS  programs Local APIC ID > 255 as of now.

Broadcom Fast Ethernet (SUNWbfe) - This is a project which turned out to be a good experience. I had no prior knowledge of writing device drivers or Ethernet controllers. Initially, I was quite confused about the Ring-Architecture, Descriptors and Buffers. I was not able to fit everything in a big picture and convince myself that it works. I managed to learn about them after spending some two weeks looking for some documents on how TX/RX rings are organized. So the first thing was to document how a TX/RX ring is organized and it's well described here.

Solaris now have support for Broadcom 100-T-Base Fast Ethernet controller. It is a bit old Ethernet controller but a popular one. Moreover it makes lot of sense on Netbooks than laptops. This chip has only one TX and RX ring. The number of descriptors are programmable and it supports Multicast through CAM (Content Addressable Memory for 64 entries). It does not have support for Jumbo frame though and hence MTU is 1500. Having integrated bfe in Solaris Nevada the other day, my next target is to add support for Atheros/Attansic Ethernet controllers. They come in three flavors :-

- Atheros/Attansic  L2  Fast Ethernet  as device-id 0x2048

- Atheros/Attansic's AR8121/AR8113     PCI-E Ethernet Controller as device-id 0x1026

- Atheros/Attansic L1 Gigabit Ethernet 10/100/1000 Base  as device-id 0x1048

The plan is to have support for all the three chips in atge (a new device driver or SUNWatge). I have started the work and I expect it to complete in two-three months timeframe.

/Saurabh

http://saurabhslr.blogspot.com

(2009-06-16 14:26:57.0) Permalink

20080419 Saturday April 19, 2008

Install-Time-Update (ITU) and Driver Binding in Solaris

If you ever wonder how to create install time driver updates for Solaris 10 and Nevada, then you may want to read this blog entry as it involves few tricks here and there.  There are two ways to make your device work with Solaris. The install-time-update (aka ITU DU or ITU diskette) is only required for the case where the disk drive will become the Solaris boot drive. For all other case, you should be able to generate a package and run pkgadd(1m) command to install the driver package on running Solaris.

ITU Method

In order to install Solaris onto a bootable drive supported by your driver, you can use an Install Time Update (ITU). The ITU must have your driver (both 32-bit and 64-bit binaries) and PCI-IDs of the device your driver supports.

How to construct an ITU

  • Make sure you have Solaris 10 and Nevada binaries of yours driver for both the 32-bit and 64-bit Operating System and the your_driver.conf (driver configuration) file. You should get the pkg_drv(1m) command by installing the SUNWpkgd package from this link

    In order to create an ITU for Solaris 10 and Nevada, you would want to create two directories and run pkg_drv(1m) there.

For Solaris 10

# mkdir -p /var/tmp/your_driver.5.10
# cd /var/tmp/your_driver.5.10

Copy your driver and your_driver.conf file in the current directory.

# mkdir -p kernel/drv/amd64
# cp <32-bit binary of your driver> .
# cp <32-bit binary of your driver> kernel/drv/your_driver
# cp <64-bit binary of your driver> kernel/drv/amd64
# cp your_driver.conf .
# pkg_drv -i '"pciVVVV,DDDD.SSSS.ssss"' -o `pwd`/PKG -c scsi -r 5.10 your_driver

VVVV = Vendor-id
DDDD = Device-id
SSSSS = Subsystem-vendor-id
ssss = Subsystem-device-id
PKG = your_driver.
'-c scsi' is for device class and in this example we have been discussing about disk drive.

The output of the pkg_drv(1m) will resemble the output below :-

input file: drv=your_driver
input file: conf=your_driver.conf
WARNING: pkg_drv: pkg/driver name exists in /etc/driver_aliases
Suggested Package Naming Conventions: 8 characters, with the first capitalized characters uniquely specifying the company (e.g. stock market ticker). The remaining characters specify the driver (e.g. SUNWcadd for a CAD driver from Sun Microsystems). The driver name must be unique across all Solaris platforms and releases.

## Building pkgmap from package prototype file.
## Processing pkginfo file.
## Attempting to volumize 8 entries in pkgmap.
part 1 -- 276 blocks, 29 entries
## Packaging one part.
/tmp/12546/PKG/pkgmap
/tmp/12546/PKG/pkginfo
/tmp/12546/PKG/reloc/boot/solaris/devicedb/master
/tmp/12546/PKG/install/copyright
/tmp/12546/PKG/install/depend
/tmp/12546/PKG/install/i.master
/tmp/12546/PKG/reloc/kernel/drv/your_driver
/tmp/12546/PKG/reloc/kernel/drv/your_driver.conf
/tmp/12546/PKG/install/postinstall
/tmp/12546/PKG/install/postremove
/tmp/12546/PKG/install/r.master
## Validating control scripts.
## Packaging complete.
output pkg: See package directory PKG in /tmp/12546
pkg_drv: 2 warnings 0 errors


bash-3.2# find /tmp/12546
/tmp/12546
/tmp/12546/PKG
/tmp/12546/PKG/pkgmap
/tmp/12546/PKG/pkginfo
/tmp/12546/PKG/reloc
/tmp/12546/PKG/reloc/boot
/tmp/12546/PKG/reloc/boot/solaris
/tmp/12546/PKG/reloc/boot/solaris/devicedb
/tmp/12546/PKG/reloc/boot/solaris/devicedb/master
/tmp/12546/PKG/reloc/kernel
/tmp/12546/PKG/reloc/kernel/drv
/tmp/12546/PKG/reloc/kernel/drv/your_driver
/tmp/12546/PKG/reloc/kernel/drv/your_driver.conf
/tmp/12546/PKG/install
/tmp/12546/PKG/install/copyright
/tmp/12546/PKG/install/depend
/tmp/12546/PKG/install/i.master
/tmp/12546/PKG/install/postinstall
/tmp/12546/PKG/install/postremove
/tmp/12546/PKG/install/r.master

Copy the following files from '/tmp/12546' as follows :-

# cd /var/tmp/your_driver.5.10
# cp /tmp/12546/PKG/pkgmap .
# cp /tmp/12546/PKG/install/postinstall .
# cp /tmp/12546/PKG/install/postremove .
# cp /tmp/12546/PKG/install/copyright .

You can run 'pkgproto' command or make a prototype file manually :

bash-3.2# cat > prototype
i copyright
i postremove
i postinstall
i pkginfo
d none kernel 0755 root sys
d none kernel/drv 0755 root sys
d none kernel/drv/amd64 0755 root sys
f none kernel/drv/amd64/your_driver 0644 root sys
f none kernel/drv/your_driver 0644 root sys
f none kernel/drv/your_driver.conf 0644 root sys

Make sure you include both the 32-bit and 64-bit binaries of your driver. Once this is completed, we will construct the package again to include 64-bit binary of the driver.

# cd /var/tmp/your_driver.5.10
# pkgmk -r . -d /tmp

This will create '/tmp/PKG' directory under /tmp and that's where the package is. For example :-

bash-3.2# pkgmk -r . -d /tmp
## Building pkgmap from package prototype file.
## Processing pkginfo file.
## Attempting to volumize 6 entries in pkgmap.
part 1 -- 444 blocks, 23 entries
## Packaging one part.
/tmp/PKG/pkgmap
/tmp/PKG/pkginfo
/tmp/PKG/install/copyright
/tmp/PKG/reloc/kernel/drv/amd64/your_driver
/tmp/PKG/reloc/kernel/drv/your_driver
/tmp/PKG/reloc/kernel/drv/your_driver.conf
/tmp/PKG/install/postinstall
/tmp/PKG/install/postremove
## Validating control scripts.
## Packaging complete.
bash-3.2#

Do following things to repack package in DU (Diskette) :-

# cd /tmp
# find PKG -print | cpio -o > /tmp/pkg_of_your_driver
# compress /tmp/pkg_of_your_driver
# cd /var/tmp/your_driver.5.10/PKG
# cp /tmp/pkg_of_your_driver.Z PKG/DU/sol_210/i86pc/Product/your_driver.Z

For Solaris Neavda

Repeat the same steps as we did for Solaris 10 except for following things :-

  • Create a new directory '/var/tmp/your_driver.5.11' since you are working on Solaris Nevada. Make sure pkg_drv(1m) command run with '-r 5.11'.

  • When copying your_driver.Z copy to DU, make sure you change the path to 'sol_211' in ' PKG/DU/sol_210/i86pc/Product/your_driver.Z'.

Once you have created ITU for Solaris 10 and Nevada, we will bundle them in one DVD/CD (or ISO file). In the directories '/var/tmp/your_driver.5.11' and '/var/tmp/your_driver.5.10', you will find a directory called 'PKG'. You must copy the files under 'PKG' to one directory in order to bundle them together.

# mkdir -p /var/tmp/YOUR_DRIVER-DU
# cd /var/tmp/YOUR_DRIVER-DU
# cp -rf /var/tmp/your_driver.5.11/PKG/* .
# cp -rf /var/tmp/your_driver.5.10/PKG/* .


Please run the following command to make an ISO file from the directory /var/tmp/YOUR_DRIVER-DU :

# mkisofs -o your_driver.iso -r /var/tmp/YOUR_DRIVER-DU

This will create an ISO file 'your_driver.iso' and a DVD/CD can be burned by running the following command line at the prompt :-

# cdrw -i /var/tmp/YOUR_DRIVER-DU/your_driver.iso

In order to install Solaris on boot drives, you use Solaris Installer DVD and choose option '5' (Apply Driver Updates)'. Kindly follow the instructions when prompted.

The other way is to bundle the device driver in Solaris bootable media itself or for network installation. Kindly follow the instructions described at this link.At the above link, it describes how to pack/unpack Solaris miniroot in order to make changes to Solaris bootable media.

Driver Binding in Solaris

Driver binding in Solaris is not so easy to understand. The way Solaris binds a driver is based on the precedence.  This precedence list is maintained in the 'compatible' property of the device driver.  The two functions which are responsible for creating 'compatible' property and finding the correct binding for the driver are - add_compatible() and ddi_compatible_driver_major() respectively.

The responsibility of add_compatible() function is to create 'compatible property' for driver binding in the order described below. For PCI Card, the precedence is created as follows :-

 *   pciVVVV,DDDD.SSSS.ssss.RR   (0)
 *   pciVVVV,DDDD.SSSS.ssss         (1)
 *   pciSSSS,ssss                                   (2)
 *   pciVVVV,DDDD.RR                    (3)
 *   pciVVVV,DDDD                          (4)
 *   pciclass,CCSSPP                            (5)
 *   pciclass,CCSS                                (6)

For PCI Express card, the precedence will look like this :

 *   pciexVVVV,DDDD.SSSS.ssss.RR   (0)
 *   pciexVVVV,DDDD.SSSS.ssss         (1)
 *   pciexVVVV,DDDD.RR                    (2)
 *   pciexVVVV,DDDD                          (3)
 *   pciexclass,CCSSPP                            (4)
 *   pciexclass,CCSS                                (5)
 *   pciVVVV,DDDD.SSSS.ssss.RR     (6)
 *   pciVVVV,DDDD.SSSS.ssss            (7)
 *   pciSSSS,ssss                                      (8)
 *   pciVVVV,DDDD.RR                       (9)
 *   pciVVVV,DDDD                             (10)
 *   pciclass,CCSSPP                               (11)
 *   pciclass,CCSS                                   (12)

RR = Revision number
CC = Class code
(0) = being the highest precedence
(12) = being the least precedence.

You can get the 'compatible' property by running 'prtconf -vp' command. If the Solaris fails to find a binding using 'compatible' property, then it tries by 'nodename' and the 'nodename' is constructed from Subsystem-vendor-id (SSSS) and Subsystem-device-id (ssss) of the device. The PCI-ID which we have been seeing here is embedded in the PCI Config space of the device.

Device Drivers and device firmware must make sure that the proper PCI-IDs are chosen to avoid conflict with existing PCI-IDs. If your device is PCI-Express based card, then you must add 'pciexVVVV,DDDD.SSSS' like PCI-IDs in /etc/driver_aliases or via add_drv(1m) or pkg_drv(1m) command.

(2008-04-19 13:24:57.0) Permalink

20080306 Thursday March 06, 2008

Solaris APIC implementation with respect to MSI/MSI-x interrupts
Here's some basic information on APIC before we dive into Solaris details and if you want more detail on APIC then you can refer to this Wiki.  Solaris details are based on Solaris Neavda Build 84.

What's Local APIC 

Local APIC (LAPIC) is part of the CPU chip and it contains (a) mechanism for generating/accepting interrupts (b) a timer (c) manages all external interrupts for the processor and (d) accept and generate inter-processor-interrupts (IPIs).

What's IOAPIC

This is a separate chip that is wired to the local APIC so that it can forward interrupts to the appropriate CPU (and to local APIC). 

What's Local APIC Table 

Interrupt vectors are numbered 0x00 through 0xFF in APIC and 0x00...0x1F are reserved for exceptions. The interrupt vectors in the range 0x20...0xFF are available for programming the interrupts in APIC. Like the Local APIC's, the IOAPIC will assign a priority to the interrupt based on the vector number and and it uses 4 top bits of the vector number to distinguish priority and ignores the lower 4 bits. For example if the vector number is 0x3F then the priority would be 0x3. In Solaris, this priority mask is represented by APIC_IPL_MASK (0xF0) and the vector mask is represented by APIC_VECTOR_MASK (0x0F).  

Since we can't use vector range from 0x00...0x1F, Solaris represents APIC_BASE_VECT (0x20) as the base vector and  APIC_MAX_VECTOR (0xFF) being the maximum number of vectors in the local APIC. APIC_AVAIL_VECTOR is calculated based on this formula :-

APIC_MAX_VECTOR+1-APIC_BASE_VECT  and it translates to (0xFF+1-0x20) which is 224 vectors in decimal.

Note that vectors are grouped in 16 priority groups and each group has 0x10 number of vectors. These 16 vectors share the same priority.

APIC Data Structures in Solaris

Here is the big picture on how the various APIC data structures are related to each other. These data structures are described below :-




apic_irq_table[] - Holds all IRQ entires. Each entry is of type apic_irq_t and total size of the table is APIC_MAX_VECTOR + 1. Note that IRQ has no meaning with respect to MSI/MSI-x.

A typical apic_irq_t entry in the apic_ira_table[] looks like this :-

> ::interrupts
IRQ  Vect IPL Bus    Trg Type   CPU Share APIC/INT# ISR(s)
22   0x61 6   PCI    Lvl Fixed  1   2     0x0/0x16  bge_intr, ata_intr

> apic_irq_table+(0t22*8)/J
apic_irq_table+0xb0:            fffffffec10d7f38

> fffffffec10d7f38::print apic_irq_t
{
    airq_mps_intr_index = 0xfffd
    airq_intin_no = 0x16                 // set since it's FIXED type interrupt.
    airq_ioapicindex = 0
    airq_dip = 0xfffffffec01fd9c0    // dev info
    airq_major = 0xca
    airq_rdt_entry = 0xa061
    airq_cpu = 0x1
    airq_temp_cpu = 0x1
    airq_vector = 0x61    // note that it matches with ::interrupts output
    airq_share = 0x2       // two interrupts are sharing the same IRQ and vector
    airq_share_id = 0
    airq_ipl = 0x6         // IPL
    airq_iflag = {
        intr_po = 0x3
        intr_el = 0x3
        bustype = 0xd
    }
    airq_origirq = 0xa
    airq_busy = 0
    airq_next = 0
}
> 0xfffffffec01fd9c0::print 'struct dev_info' ! grep name
    devi_binding_name = 0xfffffffec01fcf88 "pci-ide"
    devi_node_name = 0xfffffffec01fcf88 "pci-ide"
    devi_compat_names = 0xfffffffec0206940 "pci1002,4379.1025.10a.80"
    devi_rebinding_name = 0
>

apic_ipltopri[]This array holds Solaris IPL priority to APIC priority. For example :-

> apic_ipltopri::print
[ 0x10, 0x20, 0x20, 0x20, 0x30, 0x50, 0x70, 0x80, 0x80, 0x80, 0x90, 0xa0, 0xb0, 0xc0, 0xd0,
0xf0, 0 ]
>

Note the order of priority assignment. Higher vector numbers are being assigned to higher IPL. Also note that 0x20 is given to index 1,2,3 which means that IPL 1,2,3 share the same vector range 0x20...0x2F.

And apic_ipltopri[] is declared as :- 

uchar_t apic_ipltopri[MAXIPL + 1];      /* unix ipl to apic pri */

apic_vectortoipl[] - This array is a bit complex. The main purpose of this array is to initialize apic_ipltopri[] array.

apic_init()
{
        [.]
        apic_ipltopri[0] = APIC_VECTOR_PER_IPL; /* leave 0 for idle */
        for (i = 0; i < (APIC_AVAIL_VECTOR / APIC_VECTOR_PER_IPL); i++) {
                if ((i < ((APIC_AVAIL_VECTOR / APIC_VECTOR_PER_IPL) - 1)) &&
                    (apic_vectortoipl[i + 1] == apic_vectortoipl[i]))
                        /* get to highest vector at the same ipl */
                        continue;
                for (; j <= apic_vectortoipl[i]; j++) {
                        apic_ipltopri[j] = (i << APIC_IPL_SHIFT) +
                            APIC_BASE_VECT;
                }
        }

        [.]

}

uchar_t apic_vectortoipl[APIC_AVAIL_VECTOR / APIC_VECTOR_PER_IPL] = {
        3, 4, 5, 5, 6, 6, 9, 10, 11, 12, 13, 14, 15, 15
};

Note that IPL 5  share vector range 0x40...0x5F (or 0x20...0x3F for optimization) and that's why vector index 2 and 3 have IPL 5. Similarly vector index 4,5 have IPL 6 (0x40...0x5F or 0x60...to 0x7F).

 *      IPL             Vector range.           as passed to intr_enter
 *      0               none.
 *      1,2,3           0x20-0x2f               0x0-0xf
 *      4               0x30-0x3f               0x10-0x1f
 *      5               0x40-0x5f               0x20-0x3f
 *      6               0x60-0x7f               0x40-0x5f
 *      7,8,9           0x80-0x8f               0x60-0x6f
 *      10              0x90-0x9f               0x70-0x7f
 *      11              0xa0-0xaf               0x80-0x8f
 *      ...             ...
 *      15              0xe0-0xef               0xc0-0xcf
 *      15              0xf0-0xff               0xd0-0xdf
 */

apic_vector_to_irq[] - This array holds IRQ number given the vector number. If an element of this array contains APIC_RESV_IRQ (0xFE) then it means that the vector is free and can be allocated. apic_navail_vector() function checks this array to figure out how many vectors are available.

Here an example on how IPL to vector priority is mapped in Solaris :-

Lets say we got network interrupt at IPL 6  (ath - wifi interrupt) having vector number 0x60 (as shown above in the ::interrupt output).  Now Solaris will block all interrupts at and below IPL 6 which is done by apic_intr_enter() function. In this function, the caller actually subtracts 0x20 (APIC_BASE_VECT) from the vector number. Anyway, this is done for optimization but lets come to the point - apic_ipls[] array is used to get to the IPL which will be programmed in the APIC register. So we first get nipl as

         nipl = apic_ipls[vector];      // vector is 0x40 not 0x60 as mentioned above and nipl will be 0x6
        *vectorp = irq = apic_vector_to_irq[vector + APIC_BASE_VECT];      // This is done to get actual vector and irq.

and then this statement blocks all the interrupts at and below the vector priority (or IPL).

        apicadr[APIC_TASK_REG] = apic_ipltopri[nipl];

So we write 0x70 to APIC task register to block interrupts. Note that Solaris uses range 0x60...0x7F for IPL 6 :-

*      IPL          Vector range.           as passed to apic_intr_enter()
*      6               0x60-0x7f               0x40-0x5f

and it does not matter whether you write 0x70 or 0x7F as they all do the same work which is block interrupts at IPL 6 or below.

Solaris x86 Interrupt Handling 

Now that we have glimpsed through the data structures involved, lets look at how Solaris x86 handles Interrupt. I prefer to describe interrupt handling before describing how interrupts are allocated because I felt interrupt handling is easier to understand.

Lets first go through how Solaris x86  is designed in terms of psm ops.  For example, PCI express has its own  psm ops which is apic_ops and PCI has its own psm_ops which is uppc_ops. In fact xVM (Zen based hypervisor) has its own psm_ops called xen_psm_ops. It is psm_install() who is responsible for installing psm in Solaris x86 world.

apic_probe_common() is what gets called when psm_install() jumps into psm_probe() for each psm_ops. apic_probe_common() does many things and one of them being mapping 'apicadr[]' (you would have seen this before; I referred it for setting APIC priority i.e task register). apic_cpus[] array also gets initialized by ACPI i.e acpi_probe() because ACPI tables have all the information like local apic cpu id, version etc.

Now lets see what happens when local APIC generates an interrupt. The interrupt could come from IOAPIC or MSI/MSI-x based generated interrupt (in-band message). Solaris calls cmnint() or _interrupt(). These are same and call do_interrupt() once regs is setup. do_interrupt() will first set the PIL so that CPU does not get any interrupt at or below PIL. Raising the priority of CPU is done using setlvl pointer to function. This pointer gets set to appropriate psm_ops's psm_intr_enter and in our case it will be apic_intr_enter(). Now comes the dispatching interrupt part which is done by calling switch_sp_and_call() once the stack of interrupt thread is setup. Recall that Solaris handles interrupts in thread context if PIL is at or below LOCK_LEVEL (0xa). High level interrupts (0xa...0xf) are handled in current thread's stack.

switch_sp_and_call() can dispatch three type of interrupts -- (a) software interrupts (b) high level interrupts and (c) normal device interrupts.

In our example, we have been looking at wifi interrupt and it will be (c) which maps to dispatch_hardint() routine. dispatch_hardint() calls av_dispatch_autovect() after enabling interrupts. Now that we are touching av_dispatch_autovect() routine, I must explain what is autovect[] array. If you remember add_avintr() which is responsible for registering a hardware interrupt handler then I think you can skip this part. autovect[] has MAX_VECT (256) elements and each element is of type 'struct av_head'. The first pointer in 'struct av_head' points to 'struct autovec' and autovec structure will have all the information about interrupt handler, arguments passed to interrupt handler, priority level etc. Note that more than one interrupt handler can share the same vector and they are linked by 'av_link' in 'struct autovec'. For example :-

> ::interrupts
IRQ  Vect IPL Bus    Trg Type   CPU Share APIC/INT# ISR(s)
22   0x61 6   PCI    Lvl Fixed  1   2     0x0/0x16  bge_intr, ata_intr

> ::sizeof 'struct av_head'
sizeof (struct av_head) = 0x10

> autovect+(0x10*0t22)=J                 // Take the IRQ and index into autovect[] array.
                fffffffffbc52ba0

> fffffffffbc52ba0::print 'struct av_head'
{
    avh_link = 0xfffffffec50d2cc0
    avh_hi_pri = 0x6        // take a look at bge_intr() and its priority below
    avh_lo_pri = 0x5        // take a look at ata_inr() and its priority below
}

> 0xfffffffec50d2cc0::print 'struct autovec'
{
    av_link = 0xfffffffec10d2f40
    av_vector = bge_intr
    av_intarg1 = 0xfffffffec50d5000
    av_intarg2 = 0
    av_ticksp = 0xfffffffec506ae20
    av_prilevel = 0x6
    av_intr_id = 0xfffffffec537a078
    av_dip = 0xfffffffec01f8400
}

> 0xfffffffec10d2f40::print 'struct autovec'
{
    av_link = 0
    av_vector = ata_intr
    av_intarg1 = 0xfffffffec00bc8c0
    av_intarg2 = 0
    av_ticksp = 0xfffffffec0528898
    av_prilevel = 0x5
    av_intr_id = 0xfffffffec10cbe78
    av_dip = 0xfffffffec01fd9c0
}
>

Here's an example which we have been discussing :-

bash-3.00# dtrace -n av_dispatch_autovect:entry'/`autovect[args[0]].avh_link->av_vector/{@[args[0]]=count(); printf("%a, %x", `autovect[args[0]].avh_link->av_vector, args[0])}'

  1   2391       av_dispatch_autovect:entry ath`ath_intr, 13
  1   2391       av_dispatch_autovect:entry ath`ath_intr, 13
  1   2391       av_dispatch_autovect:entry ath`ath_intr, 13
  1   2391       av_dispatch_autovect:entry ath`ath_intr, 13
 

There is a very interesting blog by Anish at this link on APIC and Solaris x86 interrupt handling.
 

How does Solaris APIC implementation allocates Interrupt 

Now that we looked at how APIC is structured in Solaris x86 and how interrupts are handled, lets look at how interrupts are allocated. There are three types of interrupts --  DDI_INTR_TYPE_FIXED, DDI_INTR_TYPE_MSI and DDI_INTR_TYPE_MSIX in the order they are evolved. Solaris DDI routine ddi_intr_get_supported_types() can be called to retrieve types of interrupt supported by the Bus.

In case of MSI, apic_alloc_msi_vectors() gets called and in case of MSI-x, apic_alloc_msix_vectors() gets called to allocate the appropriate number of interrupt vectors. Note that MSI supports 32 number of vectors per device function and MSI-x supports 2048 number of vectors per device function however in Solaris x86 we only support 2 MSI-x interrupt vectors per device (the reason for studying APIC and MSI-x by me). On SPARC, Solaris supports far more MSI-x interrupt and configured by #msix-request property in DDI. This hard limit is determined by i_ddi_get_msix_alloc_limit() function however even on SPARC it seems we limit to 8.

msix_alloc_limit = MAX(DDI_MAX_MSIX_ALLOC, ddi_msix_alloc_limit);

/* Default number of MSI-X resources to allocate */
#define DDI_DEFAULT_MSIX_ALLOC  2

/* Maximum number of MSI-X resources to allocate */
#define DDI_MAX_MSIX_ALLOC      8

These limits will change when Interrupt Resource Management (IRM) framework is integrated in Solaris.

Anyway, lets get back to the topic. Depending upon the interrupt type and bus intr ops, Solaris will jump to interrupt ops. In our case, we will get into pci_common_intr_ops() from ddi_intr_alloc(9F) to allocate the interrupts with cmd DDI_INTROP_ALLOC. We will not get into FIXED type interrupts as they are hard wired via IOAPIC and fairly easy (I suppose).  It's the psm_intr_ops which gets into action with cmd PSM_INTR_OP_ALLOC_VECTORS and we land up in apic_intr_ops().

apic_intr_ops
{
        [.]
        case PSM_INTR_OP_ALLOC_VECTORS:
                if (hdlp->ih_type == DDI_INTR_TYPE_MSI)
                        *result = apic_alloc_msi_vectors(dip, hdlp->ih_inum,
                            hdlp->ih_scratch1, hdlp->ih_pri,
                            (int)(uintptr_t)hdlp->ih_scratch2);
                else
                        *result = apic_alloc_msix_vectors(dip, hdlp->ih_inum,
                            hdlp->ih_scratch1, hdlp->ih_pri,
                            (int)(uintptr_t)hdlp->ih_scratch2);
                break;
                [.]
}


apic_alloc_msi_vectors() - This function allocates 'count' number of vectors for the device. 'count' has to be power of 2 and the priority is passed by the caller. The first thing which this function does is - it checks whether we have enough vectors available at the priority to satisfy the request and tt is done by routine apic_navail_vector(). We start our search whether we can get contiguous vectors and the value returned by apic_find_multi_vectors() is our starting point. It seems MSI has this constraint to give contiguous vectors only. I don't why.

The next step is to check whether we have enough irq's in the apic_irq_table[]. This is done by the function apic_check_free_irqs().  If we succeed in finding enough IRQ entries in the table, apic_alloc_msi_vector() proceeds to allocate irq which is done by apic_allocate_irq(). The IRQ no. returned by this function is finally used by autovect[] table to index into the appropriate vector. We will go into autovect[] again soon but for now lets see how we select CPU. The selection of CPU for this IRQ is done by apic_bind_intr() for the first interrupt in 'count' number of vectors and subsequent vectors are bound to the same CPU. These steps are done in a loop for 'count' number of times.

Now that we have setup IRQ in the apic_irq_table[] with priority, vector, target CPU etc, we are set to enable the interrupt. BTW, all this is mostly done in driver's attach(9E) entry point but mostly in two phases with in the attach(9E) entry point -- (i) add interrupts by allocating them (ii) enable interrupts.

apic_alloc_msix_vectors() - This function does similar work as done for MSI interrupts except that we allocate the vector (apart from allocating the IRQ entry in the apic_irq_table[]) and bind the interrupt to CPU by calling apic_bind_intr() for each request in 'count'). MSI-x does have the limitation of contiguous vectors as MSI has. Vector allocation is done by routine apic_allocate_vector() which returns the free vector by walking apic_vector_to_irq[] table and looking for APIC_RESV_IRQ slot. The range is determined by the priority passed to it. For example if priority passed is 6, then range would be

        highest = apic_ipltopri[ipl] + APIC_VECTOR_MASK;
        lowest = apic_ipltopri[ipl - 1] + APIC_VECTOR_PER_IPL;

        if (highest < lowest) /* Both ipl and ipl - 1 map to same pri */
                lowest -= APIC_VECTOR_PER_IPL;

highest is 0x7f (0x70 + 0x0f) and lowest would be 0x60 (0x50+0x10) and this matches with our observation in the beginning of the blog.

A typical flow of this dance is as follows :-

  1  22557    apic_alloc_msix_vectors:entry name pciex8086,10a7, inum : 0, count : 2, pri :6
              pcplusmp`apic_intr_ops+0x114
              npe`pci_common_intr_ops+0x8f1
              npe`npe_intr_ops+0x21
              unix`i_ddi_intr_ops+0x54
              unix`i_ddi_intr_ops+0x54
              genunix`ddi_intr_alloc+0x263
              igb`igb_alloc_intrs_msix+0x134
              igb`igb_alloc_intrs+0x64
              igb`igb_attach+0xcb
              genunix`devi_attach+0x87

  1  22485         apic_navail_vector:entry name : pciex8086,10a7, pri 6
  1  22486        apic_navail_vector:return                31
  1  22547          apic_allocate_irq:entry        72
  1  22419         apic_find_free_irq:entry start :72, end : 253
  1  22417          apic_find_io_intr:entry        72
  1  22548         apic_allocate_irq:return                72
  1  22479       apic_allocate_vector:entry ipl : 6, irq: 72, pri: 1
  1  22480      apic_allocate_vector:return                96
  1  22473             apic_bind_intr:entry name : pciex8086,10a7, irq  72
  1  22474            apic_bind_intr:return                 0

Now lets talk about how driver enables interrupts once they are allocated. Interrupts can be enabled in block (more than one at once by DDI ddi_intr_block_enable(9F)) or calling explicitly ddi_intr_enable(9F) for each  interrupt however we will discuss ddi_intr_enable(9F) . Once again we will end up in pci_common_intr_ops() and call pci_enable_intr() which does two things mainly :-

-  Translate the interrupt if needed. This is done by apic_introp_xlate(). If the interrupt is MSI or MSI-x, we call apic_setup_irq_table() if the IRQ entry in the apic_irq_table[] is not setup. In our example, we have already done this so apic_introp_xlate() just returns IRQ number from 'apic_vector_to_irq[airqp->airq_vector]'. airqp is an entry in the apic_irq_table[] which gets assigned by calling apic_find_irq().

-  Add the interrupt handler by calling add_avintr(). We have actually touched this routine in this blog but it is worth mentioning - when in the life cycle of setting up interrupts we bind an interrupt handler (ISR or Interrupt Service Routine) to vector. The main task of add_avintr() is to insert  'autovec' in the appropriate index and call insert_av(). The other and the most important thing is to program the interrupt which is done by addspl(). addspl() is another pointer to function from the family of setlvl, setspl etc. In APIC case, it will be apic_addspl() which is just a wrapper over apic_addspl_common(). There are four arguments passed to it :-

apic_addspl_common(int irqno, int ipl, int min_ipl, int max_ipl)

We first get the pointer from apic_irq_table[] by indexing irqno and check if we need to upgrade vector or just check IPL in case this interrupt needs to be shared.  Eventually we will land up in apic_setup_io_intr() which does the main task. In fact apic_rebind() binds an interrupt to a CPU and apic_rebind() is called from apic_setup_io_intr(). Since we are discussing MSI/MSI-x and once apic_rebind() does sanity checks it will call  apic_pci_msi_enable_vector(). The following statement is what we write to program the interrupt :-

        /* MSI Address */
        msi_addr = (MSI_ADDR_HDR | (target_apic_id << MSI_ADDR_DEST_SHIFT));
        msi_addr |= ((MSI_ADDR_RH_FIXED << MSI_ADDR_RH_SHIFT) |
            (MSI_ADDR_DM_PHYSICAL << MSI_ADDR_DM_SHIFT));

        /* MSI Data: MSI is edge triggered according to spec */
        msi_data = ((MSI_DATA_TM_EDGE << MSI_DATA_TM_SHIFT) | vector);

apic_pci_msi_enable_mode() is also called from apic_rebind() to enable the interrupt once it's programmed. That's how per-vector masking is controlled I suppose.

Since we are touch how we bind an interrupt to a CPU, I should also mention how Solaris selects CPU to bind an interrupt. The routine apic_bind_intr() is responsible for doing this and the decision is based on value of tunable 'apic_intr_policy'. You can define three type of policy -- (a) INTR_ROUND_ROBIN_WITH_AFFINITY - round robin and affinity based policy which returns same CPU for the same dip (or device). This is the default policy. (b) INTR_LOWEST_PRIORITY - I don't know because it's not implemented and (c) INTR_ROUND_ROBIN - select cpu in round-robin fashion using 'apic_next_bind_cpu' global variable. Choosing between INTR_ROUND_ROBIN_WITH_AFFINITY vs INTR_ROUND_ROBIN may not be easy but I think the decision should be based on throughput vs locality awareness.

(2008-03-06 11:50:32.0) Permalink

20080111 Friday January 11, 2008

xVM experience so far

I recently configured xVM on Solaris - HVM (hardware-assisted virtual machine) and PV (Paravirtualized) guest (domU) domains. I could easily install Solaris 10 Update 5 as HVM domU, boot, configure network interface and assign IP. The plan is to have multiple domU as testbed having Solaris 10 and Solaris Nevada. This would cut down on machines and sanity checks can be done quickly as I don't have to install/boot OS every time. I can easily run functional tests if not performance benchmarks. The performance of Solaris 10 as HVM domain is not as good as Solaris Nevada (PV domU) and especially when there are more than one VCPUs but I guess it's being worked. I think the performance would drastically improve when we have PV (Paravirtualized) drivers for Solaris 10. I'll soon experiment installing xVM on my laptop and configure Windows XP as HVM domain.

Here's a small demo describing my experience so far with xVM :-

For installing the Solaris PV domuU, I used this sample script.

bash-3.2# cat snv.1.py<>name = 'solaris-pv'
memory = '1024'
vcpus = 4
# for installation
disk = [ 'file:/var/tmp/solarisdvd.iso,6:cdrom,r', 'phy:/dev/zvol/dsk/snv-pool/vol,0,w' ]
on_poweroff = 'restart'
on_reboot = 'restart'
on_crash = 'preserve'

In 'disk', you will see 'file and 'phys' and they specify what kind of media it is. Once you have specified the location in 'disk', you also specify the type of access like read (r) or write (w).

Once you run '#xm create script.py', you will see OS installation screen and once the installation is completed, I used a similar script but removed solarisdvd paragraph from 'disk' (mentioned in the .py file).

name = 'solaris-pv'
memory = '1024'
vcpus = 4
disk = [ 'phy:/dev/zvol/dsk/snv-pool/vol,0,w' ]
on_poweroff = 'destroy'
on_reboot = 'restart'
on_crash = 'preserve'
vif = [ 'mac=0:14:4f:2:12:35, ip=10.5.63.98, bridge=nge1' ]

With the 'vif' property you can specify what network interface you want. You can also set 'config/default-nic' property in xvm/xend service if you want to override the NIC. Finally, once you have booted guest domain, you will see the interface as rtls0. You can run 'dlmadn show-dev' to see if network interface is really configured or not and run ifconfig(1m) to plumb the interface.

You can see the resources of each as follows.

bash-3.2# xm list
Name ID Mem VCPUs State Time(s)
Domain-0 0 4973 4 r----- 4019.6
S10U5HVM 8 2056 1 r----- 40.8
solaris-pv 10 1024 1 r----- 5.0



I also found following links to be very helpful as I learnt how to configure domU.
Write-up from Chris Beal
Write-up from mbrowarski

(2008-01-11 16:43:43.0) Permalink

20061010 Tuesday October 10, 2006

Multi-CPU Binding in Solaris We are working on a framework which would allow processes/thread to have affinity to more than one CPU. The affinities could be divided into three categories -- (a) strong affinity (b) weak affinity and (c) negative affinity.

(a) strong affinity :- This type of affinity would allow processes/threads to run only on specified CPUs.

(b) weak affinity :- This type of affinity would allow processes/threads to run on its home lgroup or CPUs specified or any CPUs if it can't run on home lgroup/CPUs. The order is also followed in the same way when Solaris Dispatcher would choose a CPU.

(c) negative affinity :- This type of affinity would allow processes/threads to not run on the CPUs specified.

At present, only strong/negative affinity could change thread's home lgroup; so on a NUMA aware machine, users need to be more cautious. These affinity are stored in bitmask of CPUs (cpuset_t). During offline phase, CPU will be removed from thread's bitmask and if it happens to be the only CPU in its bitmask, we would generate an event using contract fs so that application programs can take appropriate action in an event when affinity is revoked during offline or even when a CPU goes out from processor set.

The boundaries laid by CPU partitions will still be there and Multi-CPU binding will not allow processes/threads to cross partitions (or proessesor sets).

Idle thread is also modified to accordingly look for work. Strong affinity threads can't be stolen if a thread doesn't have that CPU in its bitmask. Weak affinity threads can be stolen. Run queue balancing done by setbackdq() is done for all the affinities.

An example of it :-

bash-3.00# ./pbind -s 528-530 `pgrep aff`

bash-3.00# dtrace -s ./a.d ## D script capturing context switches.
CPU no. of times ran
529 197
528 208
530 210

bash-3.00# ./pbind -q `pgrep aff`
process id 3211: not bound
process id 3211: strong affinity to: 528-530

bash-3.00# psradm -f 529 528

bash-3.00# dtrace -s ./a.d ## D script capturing context switches.
CPU no. of times ran
530 255


If you were to offline CPU 530 also, this would cause us to revoke the affinities because this process had strong affinity and there wouldn't be any CPU where it can run. The purpose is to allow offline (for DR or other FMA events). Same hold true for processor set as well if a CPU is removed from the pset and it happens to be be last CPU in the threads CPU bitmask.

We can preserve affinity to a CPU when a CPU is offlined so that when it is brought back users don't have to bother about finding a suitable CPU provided it's not the last CPU in its bitmask. I'm not sure whether it would be good or do we really want to do this. I do have a prototype based on that.

The above demo is just for what we are trying to achive and it's in the prototyping stage. (2006-10-10 16:09:17.0) Permalink

20061006 Friday October 06, 2006

CEC 2006... Luckily I got chance to attend CEC 2006 which was organized at San Francisco from 1st Oct to 4th Oct. However this time as a attendee. Interestingly we had a demo on Solaris 10 adoption, Sun Cluster 3.2 and ZFS which was very well received.

- Day one :- I attend a very good talk on 'Successes in High Performance Computing (HPC): Details of how Sun was successful at Tokyo Tech and others. Sun in the HPC space - New Systems, New Entry Points, New Margins' by Dean Russell and John Fragalla. This was very information and it gave me information on how complex the infrastructure could be to setup. Titech (Tokyo Institute of technology) used 600+ Galaxy 4 machines to build this super-computer. It was done by Sun CRS team.

I also attended 'Advanced techniques in the modern microprocessor design' which was okay.

The talk on 'Sun's New x64 Data Server' by Trungchau Ngo was on Thumper architecture (x4500 machines) and it was a good information for me. It demonstrated how cheap disk (250 GB SATA disks) could be used, combined with ZFS for RAID, to get maximum performance at less price.

The last talk that I attended on day one was 'Introduction to Logical Domains (LDOMs)' and it was also good. Sun has sun4v architecture.

These topics were interesting to me as I think there is lot of potential in Storage and Virtualization.

- Day two: I could not much talk. I attended 'Demo LDOMs on Niagara' which was good. I also attended a talk on VMware and Xen which was very informative. Parallels have something similar were MacOS runs as main OS and Solaris as guest OS.

- Day three: I attended Bryan Cantrill's talk on Dtrace. Unfortunately I was working on a hot issue which required an immediate update.

During these three days, I also got chance to see the city and it's pretty good. You can find some photographs here.

On Sunday, Chandan and myself biked 9-10 miles from hotel to Sausalito town which is a historic place. (2006-10-06 09:51:58.0) Permalink

20060609 Friday June 09, 2006

Latency group (lgroup) in Solaris on NUMA aware machines All of you would have heard about NUMA (Non-uniform-memory-access) machines. I'm going to describe how the memory latency groups (called lgroup in Solaris) are layed out. While working on Multi-CPU binding project, I had to learn these aspects to implement how to choose a lgroup for a thread having least latency from its earlier home lgroup.

This figure below describes how the lgroup structures are layed out on SPARC based NUMA aware machines. The root lgroup (0) is the top most level of the hierarchy having all the resource sets in the system. lgroup id 1, 2 and 3 are having four CPUs each (system board) and are leaf nodes in this case. On sparc, the remote latency from lgroup 2 to 1 or 3 is same i.e they are equidistant having local and remote latency. In Solaris, we have something called lgroup partition load (lpl_t) which represents the leaf-nodes having CPUs and memory. Each cpu_t (CPU struture) will have cpu_lpl. lpl's are also used when CPU partitons are created (processor sets are the best example). There's a global table of lgroups called lgrp_table[]. Each partition will have its lpl's in cp_lgrploads[] (cpupart_t). Both the tables are indexed by lgroup id. A thread will be homed to an lpl with in the CPU partition.


On a 4-way amd64, the lgroup representation is quite interesting as we have local and in remote we have one and two hops. For example psrinfo(1M) revealed this :-
0 on-line since 06/09/2006 06:49:25
1 on-line since 06/09/2006 06:49:31
2 on-line since 06/09/2006 06:49:33
3 on-line since 06/09/2006 06:49:35

Each CPU is a leaf lgroup. The diagram below explains this very well. In the this kind of configuration, we will have non-leaf nodes as 5, 6, 7 and 8 representing resource sets which are one hop away. For example lgroup id 5 is having 1,2,3 (local and one hop away from lgroup 1). The root lgroup id (0) will have everything.

On SPARC, we have two levels of memory hierarchy whereas on 4-way amd64 has three levels of memory hierarchy. 8 way amd64 should have four levels of memory hierarchy. The scheduling of threads starts from it's home lgroup and goes up the hierarchy. For example if the home of a thread (t->t_lpl) is lgroup 1 (CPU 0 is the resource set), then we would first look at CPU 0 and if thread can't run there, then we will look at the parent of lgroup 1 (lpl_parent) which is lgroup 5 having 1,2,3 as resource sets. Same is true when idle thread steals the work from other CPUs. The locality is kept in mind.

The lgroup hierarchical representation is more interesting when there are three hops (for example on a 8-way amd64 box). I'll leave it for next time. Thanks to Jonathan Chew for taking time and explaining all this. I thought it'd be worth to blog about this since it's a bit complex design. (2006-06-09 09:20:01.0) Permalink Comments [3]

20060512 Friday May 12, 2006

CoolThreads(TM) Technology Blow Away IBM, Dell in Latest SPEC Java(TM) Benchmark Performance Results In the latest update on prenews, the latest java benchmarks shows that Niagara based systems beat IBM and DELL systems. Here's the quote :- The Sun Fire T1000 and T2000 servers equipped with the single UltraSPARC T1 processor and Solaris 10 OS out-performed a range of Intel Xeon-based 2-4 way servers running Microsoft Windows Server 2003 OS, and IBM 2-4 way Power 5 and p5+ based servers running AIX, all while consuming less power, resulting in over 5X better performance per watt, and up to 4X less space -- critical factors affecting data center efficiency, capacity, and costs.
(2006-05-12 11:21:04.0) Permalink

20060104 Wednesday January 04, 2006

VFS/Vnode Layer in Solaris In past I have mostly written on dispatcher locks (thread locks), scheduler, signal, procfs. This is for the first time, I'm writing about filesystem. I hope it'll help you in increasing awareness on filesystem so that developing filesystem specific things on Solaris is made easy.

In this blog, I'll dessribe about how to implement VFS (Virtual Filesystem) Layer and Vnode layer for any filesystem. There are two ways you can read disk data :-

(a) using buffer cache : bread() is used to read a block of the device. The block number is always with respect to the device. brelse() must be called once buffer data is read from buf_t->b_un.b_addr

(b) using segmap driver and setting up the pages.

Using VFS layer, we can export following filesystem operations :-

(a) mount : In this operation, we need to first see whether device can be mounted or not. We also need to read the super-block (depending upon whether it's primary partition or logical partition). We are required to create pseudo device also using following calls

pseudodev = makedevice(getmajor(xdev), minor); // xdev is the device passed to mount(1m) devvp = makespecvp(xdev, VBLK); // devvp is used to do reads

Once the pseudo device is created, we open the device to read super-block and check the filesystem signature. This information is copied to in-core super-block. Now comes the hard work to mount the filesystem. Here we get the vnode for the mount point and mark it VROOT (vp->v_flag). The VFS structure is also filled. For instance vfs_data will have pointer to fs structure (struct ufsvfs) which will have super-block and other general other information about the filesystem. VFS layer routines takes care of adding vfs structure to the global array 'vfssw' of struct vfssw type.

(b) unmount : This operation is very critical. Unmount should not go through while processes are inside the mount point unless -f (force flag is passed to umount(1m)). We need to maintain the reference count so that we don't allow unmount to go through while process's current working directory is inside the mount point. For this we can increment the reference count whenever vnode is allocated and decrement it whenever vnode is released via VOP_INACTIVE(). Hence xxx_unmount() operation should first check whether it's safe to unmount the filesystem or not. DNLC will be purged by VFS layer routines before we land in filesystem specific unmount operation.

(c) stat on the filesytem : df(1m) calls stat for each mount point. In this operations, we are required to return following information in statvfs64 structure :

f_bsize    // block size
f_frsize   // block size. UFS has fragment size to accomodate small files.
f_blocks   // total number of blocks in the filesystem 
f_bfree    // free blocks
f_files = (fsfilcnt64_t)-1;
f_ffree = (fsfilcnt64_t)-1;
f_favail = (fsfilcnt64_t)-1;
f_fsid     // filesystem id
(void) strcpy(sp->f_basetype, vfssw[vfsp->vfs_fstype].vsw_name);   // name
f_flag = vf_to_stf(vfsp->vfs_flag);   // flag
f_namemax            // MAX filename size.


(d) sync operation : For read-only filesystem, we don't need to implement sync. Otherwise, it's used for flushing dirty pages in the filesystem.

(e) root operation : used by filesystem lookups to determine the root (or mount point). We are required to hold the vnode.

Vnode layer exports following operations. We will focus on operations which are required to support read operations on the filesystem. Write operations are very tricky as you need to implement host of other operations and locking the filesystem.

(a) read : This operation is invoked whether read(2) is called. In this routine, we use segmap to read the data of the file. We force fault the pages using

segmap_getmapflt(segkmap, vp, (off + mapon), , 1, S_READ);

and then uiomove is called to copy back to userland. We also release the smp (segmap entry) using segmap_release() once uiomove() is done. Please note that segmap uses 8192 (MAXBSIZE), so according you're required to manage the offset (off) and mapon which are calculated as :

off = uoff & (offset_t)MAXBMASK; mapon = (u_offset_t)(uoff & (offset_t)MAXBOFFSET);

(b) getattr : In this operation, we need to return 'vattr' struture. 'ls -l' read this struture. Following members are relvant here :-

va_type   // type of vnode
va_mode   // mode
va_uid    // uid
va_gid     // gid
va_atime.tv_sec // access time
va_mtime.tv_sec    // modification time
va_ctime.tv_sec    // creation time
va_size     // size
va_nlink    // link count 
va_blksize  // block size
va_nblocks  // number of blocks


(c) lookup : This is the heart of any filesystem. We must provide lookup in the filesystem before we can read files or seach in a directory. This routine understands the filesystem structure. In this operation, you can also use DNLC (Directory name lookup cache) to enhance the fs lookup. The Vnode and name will be cached and we don't to go to the disk all the time to search for a file/directory. dnlc_enter() can be used to put an entry in DNLC and dnlc_lookup() can be used to search whether vnode can be found in DNLC given the name. Both the routines increment v_count using VN_HOLD().

(d) getpage_miss/getpage : This routine will read the block of a file given the offset. Here we need to setup the page using page_create_va() and prepare for reading the block data using pageio_setup(). In order to issue the IO, we do following things in order -- bdev_strategy(), biowait() and then pageio_done(). In order to support read-ahead, we can use pvn_read_kluster() routines. Filesystem specific getpage() routine will call getpage_miss() to read the block. In getpage(), we also do page_lookup() in order to save going to disk if page is already there in memory.

(e) readdir : This operation is used to read the directory entries. uio_offset passed in uio struture is the key thing here. If uio_offset is same as the filesize, then we have read all the directory entries. If that's not the case, then we read directory entries starting from the last offset which is passed to us in uio_offset. At the end, we are required to return the new offset in uio_offset, so that next time when readdir() is call again, we can read more directory entries.

There are host of other functions which are required when write is also supported on the filesystem. For instance putpage, write etc. In order to support mmap(), we need to use segvn segment driver instead of segmap. (2006-01-04 19:30:00.0) Permalink Comments [2]

20050721 Thursday July 21, 2005

An interesting signal delivery related problem

Recently, we found an interesting performance problem using Dtrace. The program was when using Virtual timer created using setitimer(2). The interval passed was 10m (one clock tick) but SIGVTALRM signal used to arrive late and sometimes 6 ticks or more. Now how will you Dtrace the code and from where will you start tracing? I'll start tracing from signal generation to delivery. In Solaris kernel to post a signal we use sigtoproc() and eat_signal() is  called on the thread to make the thread on proc (TS_ONPROC) depending upon the state (TS_RUN, TS_SLEEP, TS_STOPPED). psig() is called we kernel finds a pending signal (for instance when returning from trap).

The program spins in userland after setting up the timer. Since the state of thread would be TS_ONPROC, it would be required to poke the  target CPU if thread happens to be running on different CPU. So I started tracing following functions: sigtoproc(), eat_signal(), poke_cpu() and psig(). Now lets take a look at the Dtrace probes output:


CPU Probe ID              Function
  8  11263                 eat_signal:entry  1027637980027920 sig : 28
  8   2981                   poke_cpu:entry  1027637980030560 cpu : 9
  8  11263                 eat_signal:entry  1027637990025440 sig : 28
  8   2981                   poke_cpu:entry  1027637990032160 cpu : 9
  8  11263                 eat_signal:entry  1027638000036320 sig : 28
  8   2981                   poke_cpu:entry  1027638000043600 cpu : 9
  8  11263                 eat_signal:entry  1027638010025520 sig : 28
  8   2981                   poke_cpu:entry  1027638010032240 cpu : 9
  8  11263                 eat_signal:entry  1027638020023840 sig : 28
  8   2981                   poke_cpu:entry  1027638020031280 cpu : 9
  8  11263                 eat_signal:entry  1027638030028720 sig : 28
  8   2981                   poke_cpu:entry  1027638030035920 cpu : 9
  8  11263                 eat_signal:entry  1027638040024480 sig : 28
[.]
  9   8317                       psig:entry  1027638170086480 sig : 28

If you calculate the difference (ie timestamp) between psig() and the first eat_signal(), you will notice that the difference is huge.

1027638170086480-1027637980027920
190058560 = 19 ticks (190 ms)


We also noticed that CPU 8 (from where sigtoproc() is being called by clock_tick()) is poking CPU 9, however CPU 9 is not preempting the current running thread (program which is spinning). So why and how will it happen? In order to understand this, I'll first describe a bit on how preemption works in Solaris. In order to preempt a running thread, kernel sets t_astflag (using aston() macro) and also sets appropriate CPU preemption flag. There are two CPU preemption flags viz: cpu_runrun for user level preemptions and cpu_kprunrun for kernel level preemptions. RT threads can preempt TS or SYS or IA class threads since kernel level preemptions typically kicks off when current  running threads priority is <= 100 (KPQPRI). For signal we don't set CPU level preemption flags. We just need to set t_sig_check and t_astflag followed by poke call.

Since we are interested in user level preemption, we should know what happens when CPU 8 poked CPU 9 (using cross call). If the current running thread on CPU 9 is in userland, then we call user_rtt() which calls trap() if the checks for t_astflag succeeds. So lets check whether t_astflag would be set when we call eat_signal() or not. And that's where the problem was. If the target thread in eat_signal() is TS_ONPROC, we should set t_astflag and then poke the CPU. It will be clear from the following probe that the running thread on CPU 9
was getting preempted because the time quantum finished and clock would have set t_astflag in cpu_surrender().

  9  15055               post_syscall:entry  1027637970269440
  8  11263                 eat_signal:entry  1027637980027920 sig : 28
  8   2981                   poke_cpu:entry  1027637980030560 cpu : 9
[.]
  8  11263                 eat_signal:entry  1027638040024480 sig : 28
  8   2981                   poke_cpu:entry  1027638040026800 cpu : 9
[.]
  8  11263                 eat_signal:entry  1027638110024160 sig : 28
  8   2981                   poke_cpu:entry  1027638110026560 cpu : 9
  8   2435              cpu_surrender:entry  1027638170024720 t:3001b7af3e0
  8   2981                   poke_cpu:entry  1027638170027280 cpu : 9
  8  11263                 eat_signal:entry  1027638170032720 sig : 28
  9   2919              poke_cpu_intr:entry  1027638170033760
  8   2981                   poke_cpu:entry  1027638170034400 cpu : 9
  9   3390                       trap:entry  1027638170037840 type :512, pc: 10984, ast:1
  8   2981                   poke_cpu:entry  1027638170038640 cpu : 9
  9   2919              poke_cpu_intr:entry  1027638170045680
  9   1497               trap_cleanup:entry  1027638170054880 0
  9   8317                       psig:entry  1027638170086480 sig : 28
  9   2278                   trap_rtt:entry  1027638170117440
  9  15055               post_syscall:entry  1027638170143360
  9   8317                       psig:entry  1027638170150880 sig : 2

So Dtrace did help us in finding out where the problem could be. This is just once example. Happy Dtracing...
(2005-07-21 00:47:44.0) Permalink Comments [0]

20050718 Monday July 18, 2005

Dtrace rocks...


Sometime back I had a problem with my desktop and as a result it started crawling whenever Java ticker used to kick in. I think I must share this with the rest of the world. I'd also share a kernel problem that we cracked and it was related to performance. So Dtrace has helped in solving many problems so far.

My desktop running Solaris 10 started crawling when I noticed that Xsun is eating up 68% of CPU. From prstat(1M)

# prstat
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
   594 ******     85M   78M run     30    0  14:03:19  68% Xsun/1
   796 root       16M   13M sleep   59    0   1:06:25 5.8% stfontserverd/18
[.]

I then started Dtrac'ing Xsun and noticed that lwp_sigmask() syscall call is being made too frequently by Xsun. Here is the data :-

# ./syscall.d
^C
Ran for 26 seconds


  writev                                                         2832
  pollsys                                                        3261
  read                                                           5910
  doorfs                                                        27199
  lwp_sigmask                                                  217592

LWP ID     COUNT
1          217592

              libc.so.1`__systemcall6+0x20
              libc.so.1`pthread_sigmask+0x1b4
              libc.so.1`sigprocmask+0x20
              libc.so.1`sighold+0x54
              libST.so.1`fsexchange+0x78
              libST.so.1`FSSessionDisposeFontInstance+0x8c
             9063

              libc.so.1`__systemcall6+0x20
              libc.so.1`pthread_sigmask+0x1b4
              libc.so.1`sigprocmask+0x20
              libc.so.1`sigrelse+0x54
              libST.so.1`fsexchange+0xc0
              libST.so.1`FSSessionGetFontRenderingParams+0x8c

...and many more such stack traces from libST.so.1`fsexchange().

Infact the stack is like this:-

              libc.so.1`__systemcall6+0x20
              libc.so.1`pthread_sigmask+0x1b4
              libc.so.1`sigprocmask+0x20
              libc.so.1`sighold+0x54
              libST.so.1`fsexchange+0x90
              libST.so.1`FSSessionGetFontRenderingParams+0x8c
              libST.so.1`GetRenderProps+0x344
              libST.so.1`GlyphVectorRepQuery+0xf4
              libST.so.1`STGlyphVectorQuery+0xd0
              SUNWXst.so.1`_XSTUseCache+0x68

Notice that in this stack trace, we are calling sighold() and sigrelse() too frequently. So this process is disabling and enabling signals for some reason. Looks like we are rendering characters, but why do we block and unblock signals in this path?. Here is the Dtrace script which was used :-

#!/usr/sbin/dtrace -s

#pragma D option quiet

BEGIN
{
        start = timestamp;

}

syscall:::entry
/execname == "Xsun"/
{
        @s[probefunc] = count();
}

syscall::lwp_sigmask:entry
/execname == "Xsun"/
{
        @c[curthread->t_tid] = count();
        @st[ustack(6)] = count();
}

END
{
        printf("Ran for %d seconds\n\n", (timestamp - start) / 1000000000);

        trunc(@s,5);
        printa(@s);

        printf("\n%-10s %-10s\n", "LWP ID", "COUNT");
        printa("%-10d %@d\n", @c);

        printa(@st);
}

In fact Dtrace could help us in solving much more complex problems. Happy Dtrac'ing...
(2005-07-18 19:44:08.0) Permalink

20050629 Wednesday June 29, 2005

Current activities

It's been a little while since I have written about a topic in Solaris Kernel. I'll soon blog on how Preemption (both user and kernel level) works in Solaris. With the help of Jonathan Chew, Andrei Doffee and Eric Saxe in Solaris Kernel Development, I'm currently working on a project which will enable you to specify multi-CPU binding and define affinity between the processes/lwps. We are still in the stage of drafting and developing a prototype. I'm getting to learn lgroup (latency group) and HLS (Hierarchical Lgroup Support) too. lgroup improves the performance on NUMA (Non-uniform memory access) machines like E15k, E25k, Serengeti 6800 and so on. It is my pleasure to work with Solaris Kernel Development engineers on this project. I'm sure I'll learn loads of things as we go along.

In the meantime, we recently cracked a problem in Solaris which delayed the response of a thread when a signal was pending. These days my fellow colleague Sudheer Abdul Salam (we call him hot gun in Solaris Kernel Sustaining group) parterns with all of us when working on a bug. We truely believe in team building and dont' hesitate to take help from others. At the same time, we don't hesitate in cracking pranks too :-)

Now that our group owns picld(1), I've been little busy with few bugs too in this area. picld(1) has interfaces which allows you to get the platform information in a tree form manner (abstract configuration of the system). The current users of picld(1m) interfaces are prtpicl(1m) and SunMC and I'm sure third-party applications will be using picld too. 

This weekend (2nd/3rd July) I'm going for a long drive and a long trek too. I'm hoping that it'll just rain and rain. Weather in Bangalore is rejuvenating us (my uncle and myself) and western ghats are attracting us again with their lush green forests and beautiful mountain ranges.
(2005-06-29 20:28:18.0) Permalink Comments [0]

20050613 Monday June 13, 2005

Compiler reordering problem Compiler reordering problem
I'm going to write about a compiler reordering problem in door_return() function which was observed in July 2002. The customer was able to reproduce the problem for us and it took me a  while to figure out that it was a compiler reordering problem. I must thank our customers for being so co-operative when we get such issues. I must have given instrumented kernels for at least five times before I found out the problem. It's bug  4699850.

The symptom was very clear. System used to panic in Solaris Kernel Dispatcher routines and one of the symptom was system panicing in dispdeq() while removing a kernel thread from the dispatch queue of a CPU.

We know that compiler can reorder C statments if they are independent.  Assume this piece of C code:

#define THREAD_SET_STATE(tp, state, lp) \
((tp)->t_state = state, (tp)->t_lockp = lp)

t_lockp is a pointer to a dispatcher lock and we don't know whether lp is held or not. When a thread is made TS_ONPROC, the t_lockp of the corresponding thread points to cpu_thread_lock of CPU (cpu_t). In the above mentioned C code, these stores can be reordered can be re-ordered by compiler, so the lp should be held while calling setting the threads state.

In door_return(), when server thread is about to handoff to client thread to return the results, it makes the client thread TS_ONPROC and calls shuttle_resume() on client thread. The responsibility of shuttle_resume() is to make client/server thread TS_ONPROC and the caller sleeps on shuttle_lock sync obj.

While putting a thread onproc, dispatcher routines need not hold cpu_thread_lock and hence in door_return() if we call THREAD_ONPROC(), we effectively lost thread lock on the client thread.

Now lets look at the two stores again. It t_lockp reaches global visibility before t_state, we can effectively lose thread lock on the thread. Assume another thread on different CPU is sending a signal to client door thread. Once the thread lock is lost on the client thread, the thread which is sending signal to client thread could see the old state of client thread (in this case it happens to be TS_SLEEP). Since the state is TS_SLEEP, eat_signal() will do setrun() on the client thread which enqueues client thread in the dispatch queue of the CPU. As a result, we can see some very strange things happening which also included dispdeq() panic.

The following code in door_return() was faulty:

int
door_return(caddr_t data_ptr, size_t data_size,
door_desc_t *desc_ptr, uint_t desc_num, caddr_t sp)
{
[.]
tlp = caller->t_lockp;
/*
* Setting t_disp_queue prevents erroneous preemptions
* if this thread is still in execution on another
* processor
*/
caller->t_disp_queue = cp->cpu_disp;
CL_ACTIVE(caller);
/*
* We are calling thread_onproc() instead of
* THREAD_ONPROC() because compiler can reorder
* the two stores of t_state and t_lockp in
* THREAD_ONPROC().
*/
thread_onproc(caller, cp);
disp_lock_exit_high(tlp);
shuttle_resume(caller, &door_knob);
[.]
}


I had used TNF (trace normal form) for finding out this problem. But now we have a powerful tool to trace from userland to kernel and of course it's Dtrace.


Technorati Tag:
Technorati Tag:
Technorati Tag:

(2005-06-13 12:00:00.0) Permalink Comments [0]


Locations of visitors to this page
archives
links
referers