Anish's Weblog
Anish's Weblog

20050614 Tuesday June 14, 2005

Hardware interrupts overview for Solaris X86 Hardware interrupts overview for Solaris X86

Welcome to OpenSolaris, and the wonders of Solaris 10.

This paper provides a brief introduction to hardware interrupts on x86 platforms. It is relevant for Intel and AMD based platforms. Interrupt handling is done via interrupt controller hardware in the system which are mostly sideband signals. Inband interrupts, for e.g. Message Signalled Interrupts (MSIs), introduced with PCI v2.2 specification onwards, will be discussed in another blog. MSIs are becoming mainstay with advent of new interrconnects like PCI-Express.

However, there are mainly two kinds of hardware interrupt controllers which are commonly used on x86 platforms:

1. 82c59(A) PIC(Programmable Interrupt Controller)

This is supported by the Solaris uppc(7d) module and its source is located at usr/src/uts/i86pc/io/psm/uppc.c.  Each PIC can handle 8 vectored priority interrupting sources and there are two PICs cascaded together to provide 16 interrupts on x86 systems.  However, one of the pin - IRQ2 of the 1st PIC is used to cascade the 2nd PIC and so there are only 15 interrupt sources.  This can not be used for multiprocessor (MP) systems without any major modifications.

2. APIC (Advanced Programmable Interrupt controller)

This is supported by the Solaris pcplusmp(7d) module and its source is located at usr/src/uts/i86pc/io/pcplusmp/apic.c.  It consists of two components - I/O APIC and Local APIC.  The Local APIC is embedded in the CPU while the I/O APIC is used for connecting the interrupting sources.  The Local APIC also has the capability to send interprocessor interrupt from one cpu to another and so APIC is widely used on all the x86 MP systems.  Each system can have multiple I/O APICs and each I/O APIC can have 4, 16, 20 or 24 interrupt pins.  Since the Local APIC is embedded in the CPU and the I/O APIC can handle more than 16 interrupt sources, even the single-CPU systems uses APIC as well instead of some other hardware.

There are many systems, which have I/O APICs with 4 inputs(this is typically done for PCI-X slotted systems, where each slot is given a dedicated I/O APIC, enabling INTA-INTD for each of the slots to have a dedicated input).

Solaris supports multiple I/O APICs.

Here is a diagram that shows the APIC on a two CPU system:-
 

             processor #1         processor #2
+-----------+ +-----------+
| | | |
| CPU | | CPU |
| | | |
+-----------+ +-----------+
| local APIC| | local APIC|
+-----------+ +-----------+
^ ^
| |
| |
| |
v v Processor system bus
<----------------------------------------------------------------->
^
|
|
|
+-----------------------------+
| | |
| v |
| +--------+ |
| | | |
| | Bridge | |
| | | |
| +--------+ |
| ^ |
| | |
| v PCI |
| <------------------> |
| ^ |
| | |
| v |
| +----------+ |
| | | |
| | I/O APIC |<-----|---External
| | |<-----|---Interrupts
| +----------+ |
| |
+-----------------------------+
System Chip set

Solaris x86 interrupt handling overview

When a device driver adding the interrupt through ddi_add_intr(9f), it eventually gets to uppc_addspl() in uppc(7d) for PIC or apic_addspl() in pcplusmp(7d) if using APIC.  The interrupt pin will be identified and then enabled.  It is quite simple for the uppc(7d) case, the interrupt pin to enable on the PICs is basically 1-1 mapped to the "IRQ#" or the "interrupt" property of the device on Solaris.  But for APIC (pcplusmp(7d)), it is a lot more complicated as internally it either uses the MP Spec. 1.4[2] or ACPI specification[3]  to locate the right interrupt pin of the right I/O APIC for the device. The system BIOS sets up how the interrupts are routed and saves that information in either the MP Specification table or somewhere that ACPI can easily access. pcplusmp(7d) then access that information to initialize and add the interrupts.

During the ddi_add_intr(9f) call, the device interrupt handler's entry point is stored in the autovect[] and the interrupt pin will be enabled through uppc_addspl (for PCI) or apic_addspl(for APIC).  Also, before the interrupt pin is enabled, an interrupt "vector" (refered to as "vector" from now on) will have to be selected for the CPU to trigger when that particular interrupt comes in.  For Intel CPUs, there are total 256 vectors and the first 32 vectors are reserved for special functions and so the first available vector for devices is 32 (or hex 0x20). 

For uppc(7d), vectors are set up such that the 1st pin or IRQ#0 is mapped to 32, IRQ1 to 33 and so on.  As for pcplusmp(7d), it is not as simple.  Solaris handles interrupts based on interrupt priority and each device is assigned a unique priority (can be modified by the device driver).  Say, if a device "abc" is assigned  priority 5, then all other interrupts at 5 or lower can NOT be triggered when the interrupt handler of "abc" is executing.  However, an interrupt of priority 6 or higher is allowed to trigger.  Since APIC has mechanism to prioritize interrupts, pcplusmp(7d) needs to select the vectors accordingly.

In order to handle the interrupt priority properly, there are few internal interface calls provided by uppc(7d) and pcplusmp(7d). They are the uppc_intr_enter()/uppc_intr_exit() for the uppc(7d); and apic_intr_enter()/apic_intr_exit() for pcplusmp(7d).  After the interrupt is triggered but before the interrupt handler is called, uppc_intr_enter() or apic_intr_enter() will be called to setup the interrupt priority accordingly to block all other interrupts with the same or lower priority.  After the interrupt handler is completed, then uppc_intr_exit() or apic_intr_exit() is called to restore the interrupt priority.

On the x86 platform, all the local variables of the interrupt handler are on stack. Also, if the interrupt handler needs to call another function, the parameters that are passed to the function are on stack too. i.e. all the interrupt handlers should use the stack one way or the other.

Solaris code that handles interrupts

Below are few code snippets that deal with interrupts.

To begin with the 256 vector entries are defined in the autovect[] table shown below:
usr/src/uts/common/io/avintr.c

#define MAX_VECT 256
struct av_head autovect[MAX_VECT];

usr/src/uts/common/sys/avintr.h

struct autovec {

/*
* Interrupt handler and argument to pass to it.
*/
struct autovec *av_link; /* pointer to next on in chain */
uint_t (*av_vector)();
caddr_t av_intarg;
uint_t av_prilevel; /* priority level */

/*
* Interrupt handle/id (like intrspec structure pointer) used to
* identify a specific instance of interrupt handler in case we
* have to remove the interrupt handler later.
*
*/
void *av_intr_id;
dev_info_t *av_dip;
};

av_vector is the device interrupt handler.

struct av_head {
struct autovec *avh_link;
ushort_t avh_hi_pri;
ushort_t avh_lo_pri;
};

- All interrupts run at some priority which has a ceiling of LOCK_LEVEL.
Interrupts below LOCK_LEVEL run as threads.

usr/src/uts/intel/sys/machlock.h

#define CLOCK_LEVEL 10
#define LOCK_LEVEL 10

- The following sequence shows what is done for each CPU to allocate
 enough interrupt threads to handle the interrupts. Since interrupts
are prioritized, one interrupt thread per priority should be sufficient.

usr/src/uts/i86pc/os/mp_startup.c

void
start_other_cpus(int cprboot)
{
for (who = 0; who < NCPU; who++) {
mp_startup_init(who);
...
}

void
mp_startup_init(void)
{
init_intr_threads(cp);
...
}

usr/src/uts/i86pc/os/intr.c

/*
* Allocate threads and stacks for interrupt handling.
*/
#define NINTR_THREADS (LOCK_LEVEL-1) /* number of interrupt threads */

void
init_intr_threads(struct cpu *cp)
{
int i;

for (i = 0; i < NINTR_THREADS; i++)
(void) thread_create_intr(cp);
...
}

usr/src/uts/common/disp/thread.c

void
thread_create_intr(struct cpu *cp)
{
- Here is the actual code handling the interrupts on x86.
*setlvl is the wrapper for uppc_intr_enter() or apic_intr_enter().
}



NOTE: Calling the interrupt handler is done in low level assembly code
which is not discussed here.

Reference:

1.See Chapters 5 and 7 of the Intel Architecture Software Developer's Manual Volume 3: System Programmer Guide for details on how interrupts work on the x86 platform.
2. Intel Multi-Processor Specification v1.4
3. Advanced Configuration & Power Interface (ACPI) specification home


PS: Lots of thanks to Johnny Cheung, also in Solaris I/O, for originally contributing to this material.

Technorati Tag: Technorati Tag: Posted by anish ( Jun 14 2005, 10:23:20 AM PDT ) Permalink Comments [1]

Using Cfgadm with InfiniBand Using Cfgadm with InfiniBand Welcome to OpenSolaris, and the wonders of Solaris 10.

This provides a brief introduction to using InifniBand Device management with cfgadm(1m).

Introduction

An InfiniBand (IB) device is enumerated by the IB nexus driver, ib(7D), based on the interfaces provided by IB Device M anager (IBDM). The IB nexus driver creates and initializes five types of device nodes:
    • IB Port devices
    • IB HCA service (HCA_SVC) devices
    • IB Virtual Physical Point of Attachment (VPPA) devices
    • I/O Controller (IOC)
    • IB Pseudo devices
See ib(7d), ibdm(7d), ibtl(7d) and ib(4) for details on InifiniBand nexus driv er, Device Manager respectively.

Attachment Point format

InfiniBand cfgadm plugin supports above five device nodes as 'dynamic' attachmen t points:

Device Type
Attachment Type format
Port Devices
ib::PORT_GUID,0,service-name
HCA_SVC devices
ib::HCA_GUID,0,servicename
VPPA devices
ib::PORT_GUID,P_Key,service-name
IOC devices
ib::IOC-GUID
Pseudo devices
ib::driver_name,unit-address

where
servicename    is name of the communication service
P_Key               is the Partition Key


In addition, two 'static' attachment points are supported
Static attachment
Attachment type format
IB Fabric
ib
Host Channel Adapter(s)
hca:HCA-GUID

See example below that shows all  InfiniBand devices
# cfgadm -a ib hca:2C90109764440 
Ap_Id                          Type         Receptacle   Occupant     Condition
hca:2C90109764440              IB-HCA       connected    configured   ok
ib                             IB-Fabric    connected    configured   ok
ib::2C90109764440,0,svch       IB-HCA_SVC   connected    unconfigured unknown
ib::2C90109764441,0,psvc       IB-PORT      connected    unconfigured unknown
ib::2C90109764441,ffff,ipib    IB-VPPA      connected    unconfigured unknown
ib::2C90109764442,0,psvc       IB-PORT      connected    unconfigured unknown
ib::2C90109764442,ffff,ipib    IB-VPPA      connected    unconfigured unknown
ib::daplt,0                    IB-PSEUDO    connected    configured   ok
ib::rpcib,0                    IB-PSEUDO    connected    configured   ok
# 

The example below shows how to list all kernel clients of a given InfiniBand Host Channel Adapter.
# cfgadm -x list_clients  hca:2C90109764440
Ap_Id                          IB Client                 Alternate HCA
ib::daplt,0                    daplt                     no
ib::rpcib,0                    nfs/ib                    no
-                              ibmf                      no
-                              ibdm                      no
# 

Configure/Unconfigure a IB device

Using cfgadm commands the five devices listed above could be configured for operation or unconfigured. Example below is shown for IB Pseudo device but applies for any device with the appropriate attachment point supplied as the command line argument:

Configuring a device:

# cfgadm -a ib::daplt,0             
Ap_Id                          Type         Receptacle   Occupant     Condition
ib::daplt,0                    IB-PSEUDO    connected    unconfigured unknown
# cfgadm -yc configure ib::daplt,0  
# cfgadm -a ib::daplt,0           
Ap_Id                          Type         Receptacle   Occupant     Condition
ib::daplt,0                    IB-PSEUDO    connected    configured   ok
# 
Unconfiguring a device:
# cfgadm -a ib::daplt,0
Ap_Id                          Type         Receptacle   Occupant     Condition
ib::daplt,0                    IB-PSEUDO    connected    configured   ok
# cfgadm -yc unconfigure ib::daplt,0
# cfgadm -a ib::daplt,0             
Ap_Id                          Type         Receptacle   Occupant     Condition
ib::daplt,0                    IB-PSEUDO    connected    unconfigured unknown
The example below shows how to unconfigure all kernel clients of a give InfiniBand Host Channel Adapter.
# cfgadm -x unconfig_clients hca:2C90109764440 
Unconfigure Clients of HCA /devices/ib:2C90109764440
This operation will unconfigure IB clients of this HCA
Continue (yes/no)? yes <<<<<<< 
# 

Communication Service Commands

InfiniBand Port/HCA-SVC/VPPA devices use communication services. There are certain operations allowed for communication services like adding a service, removing a service or listing known services. See examples below:

Add a communication service:
# cfgadm -x list_services ib   
PORT communication services:
                psvc

VPPA communication services:
                ipib
HCA communication services:
                svch
# cfgadm -o comm=port,service=srp -x add_service ib
# cfgadm -x list_services ib                       
PORT communication services:
                srp <<<<<<<<<<<<<<
                psvc

VPPA communication services:
                ipib
HCA communication services:
                svch
#
Delete a communication service:
# cfgadm -x list_services ib                       
PORT communication services:
                srp
                psvc

VPPA communication services:
                ipib
HCA communication services:
                svch
# cfgadm -o comm=port,service=srp -x delete_service ib
# cfgadm -x list_services ib                          
PORT communication services:
                psvc

VPPA communication services:
                ipib
HCA communication services:
                svch
# 
Note that the examples are shown only for Port Devices but are applicable to all three device types.

Other useful Commands

Two more useful commands provided by InfiniBand cfgadm plugin are:
  • update_pkey_tbls
    It updates the P_Key information inside ibtl(7d) i.e. The InifniBand Transport Layer module. ibtl(7d) reads the P_Key tables for all ports of all the HCAs seen by the host.
  • update_ioc_conf
    It updates the properties of all IOC devices if ib static attachment is supplied.If an IOC attachment point is supplied then only that IOC's properties are updated.Properties updated are:
    port-list, port-entries,  service-id, and service-name

Reference:

1. Ted Kim's Blog
2. InfiniBand Trade Association

Technorati Tag: Technorati Tag: Posted by anish ( Jun 14 2005, 10:03:15 AM PDT ) Permalink Comments [0]

Message Signaled Interrupts Message Signaled Interrupts Welcome to OpenSolaris, and the wonders of Solaris 10.

This paper provides a brief introduction to inband interrupts - Message Signaled Interrupts. All flavors of PCI (PCI 2.2 onwards, PCI-X, PCI-Express) support Me ssage Signaled Interrupts (referred to as MSIs henceforth).

Introduction

MSIs unlike fixed interrupts, are in-band messages targeting an address range in the host bridge. Since the messages are in-band, the receipt of the message can be used to "push" any data associated with the interrupt. MSI's are by definiti on, unshared. Each MSI message assigned to a device is guaranteed to be a unique message in the system. PCI functions can request between 1 and 32 MSI messages, in powers of two. The system software may allocate fewer MSI messages to a func tion than the function requested. The host bridge will have some limitation in t he number of unique MSI messages that can be allocated for devices.

The introduction of PCI-Express [1] extended PCI and MSI by requiring the use of MSI for PCI functions. PCI-Express is a serial point-to-point bus with no exter nal wires. For legacy purposes PCI-Express includes INTx (INTA-INTD) emulation m essages for compatibility with existing software, however, within any one PCI-Ex press domain, the four INTx emulation messages are shared by any device using IN Tx emulation with  that hierarchy. Thus, depending on INTx emulation is gen erally a bad idea due to the nature of its implementation.

Extended MSI (MSI-X)

A PCI-SIG MSI-X ECN [2] extended MSI by adding the ability for a function to allocate more (up to 2048) messages, makin g the address and data value used for each message independent of any other MSI- X message, and allowing software the ability to choose to use the same MSI addre ss/data value in multiple MSI-X "slots", as an architected method for dealing wi th the case when the system allocates fewer MSI/X messages to the device than th e device requested.

Implementation Notes

MSI and MSI-X shall be collectively referred to as MSI/X henceforth here. MSI/X is always edge triggered since the interrupt is signaled with a posted write com mand by the device targeting a pre-allocated area of "memory" on the host bridge . However, some host bridges have the ability to "latch" the acceptance of an MS I/X message and can effectively treat it as a level signaled interrupt.

Devices are permitted to send more than one MSI/X message prior to an  outstanding interrupt being services, however, the PCI specifications state tha t there is no guarantee that additional MSI/X messages will be serviced until th e first of a set of MSI/X messages targeting the same address/data values have b een serviced. Therefore, there is only a guarantee of servicing one MSI/X messa ge per set of MSI/X messages. Other than certain devices that send periodic inte rrupts, devices should in general, only send one MSI/X message per interrupt sou rce until that interrupt has been serviced.

With MSI/X, vectors must be allocated by the implementation and assigned to the device. Default interrupt priority is assigned based on the class code of t he device. Native PCI devices should avoid using INTx or INTx emulation when MSI /X is available in the device and supported by the host bridge implementation.

New interrupt DDI interfaces

Upcoming version(s) of Solaris supports MSIs and has new DDI interfaces to regis ter/unregister interrupts. In addition these new interfaces allow:
  • Get and set device's interrupt capabilities
  • Get and set device's interrupt priority
  • Get information if an interrupt is pending
  • Set and clear interrupt mask

Reference:

1. PCI Express Base Specification v1.0a
2. PCI Express Engineering Change Notice - MSI-X Oct. 31, 2003
Technorati Tag: Technorati Tag: Posted by anish ( Jun 14 2005, 10:00:07 AM PDT ) Permalink Comments [1]


Archives
Language
Links
Referrers