Charlie Chen's Weblog

pageicon Wednesday Dec 12, 2007

How does MSI/MSI-X work - MSI-X part

While msi defines its control information directly in config space through pci capability, msi-x puts its in system addressable memory by using pci bar's.

This is the structure of msi-x, with the number in parentheses indicating number of bits :

msg control(16) nxt ptr(8) Cap ID(8)
tab offset(29)            tab BIR(3)
PBA offset(29)            PBA BIR(3)
  • Cap ID is 0x11.
  • msg control defines number of vectors in msi-x table.
  • BIR indicates which BAR is used to define memory segment for msi-x table or msi-x PBA. 0 for BAR at 0x10, 1 0x14, and so on, and 6 and 7 are reserved. Table or PBA could use the same BAR.

Table entry is defined as:

vector control(32) msg data msg(32) upper-addr(32) msg addr(32)
  • Thus one entry uses 16 bytes. Then nth entry's address can be calculated as:
table base address + 16*n
  • msi-x data, unlike msi, couldn't be modified by the function of device. 
  • vector control, whose the most least significant bit indicates if this function is masked from sending message to system. If set, the function is prohibited from sending interrupt upstream. The default valuse is 1(masked). 

PBA consists of series of 64bit qwords each bit of them is for one vector entry of msi-x table.
These fomula can be used to calculate the qword address and bit number for the pending bit for nth vector:

qword address = PBA base + (n / 64)*8
qword bit# = n % 64

msi-x table and PBA should not share memory page with other data for other use. Suggestion is that 8k bytes is a natural boundary because some architecture use 8kB as normal page size.
 

pageicon Thursday Nov 29, 2007

How does MSI/MSI-X work - MSI part?

1. Background 

MSI, Message Signaled Interrupts, uses in-band pci memory space message to raise interrupt, instead of conventional out-band pci INTx pin.

When system wants to use msi, it should setup msi pci capability control registers. Simply said, it would write address register (32b or 64b) and data register, and set enable bit of msi control register. When device chip wants to send interrupt, it will write the data in data register to the address specified in address register.

MSI-X is an extension to MSI, for supporting more vectors. MSI can support at most 32 vectorsm while MSI-X can up to 2048.

Using msi can lower interrupt latency, by giving every kind of interrupt its own vector/handler. When kernel see the message, it will directly vector to the interrupt service routine associated with the address/data. The address/data (vector) were allocated by system, while driver needs to register handler with the vector.

By allocate vector area generally for all kinds of pci devices, system will reach a general solution to reporting interrupts quickly.

2. Msi registers

capability id: 1b, val = 0x5, the code of this capability.

next pointer: 8b, the offset of the next capability, 0 if this is the last one.

message control: 16b
bit 8: if per-vector masking is supported
bit 7: if 64b address is used
bits 6-4: number of allocated vectors by system, evaluated by 2 to this 3-bit value.
bits 3-1: number of requested vectors by function, evaluated by 2 to this 3-bit value.
bit 0: if msi is enabled, disabled by default, driver can't use this bit to mask interrupt

message address: 32b

message upper address: 32b, only used when 64bit address is enabled by bit 7 of control register

message data: 16b, specified by system, lower bits combination can be modified reflecting the identity of the interrupt source.

mask bits: 32b
pending bits: 32b
Every vector is given a number, corresponding to a bit in mask and pending resgister. Mask bit set indicates that the function is prohibited from sending the associated message. The pending bit set has s pending associated message. This mechanism enable software to disable or defer message sending on a per-vector basis.
pageicon Thursday Nov 15, 2007

Virtualization race prediction

Virtualization means efficient way of hardware resource management, which would be better done by firmware and operating system. So, when OS or processor suppliers realize that virtualization can be an inevitable tool to customers, this resource management thing will no longer be a toy, but an internalized organ. That's not good for ISVs of virtualization, such as Vmware.

A  reasonable prediction is that vmware will have a hard time facing competitors: Linux's Xen, MS's virtual PC/Server, SUN's intrinsic containers, Oracle's coming thing, etc.

Virtualization, like everything else in system software land, should be free and as invisible as possible, or, users should not pay too much for such an admin tool.

pageicon Thursday Nov 08, 2007

simulate Ethernet controller - concepts

To simulate an Ethernet controller, we need to:

1) simulate just enough registers' side effect, to mimic chip's logic;
2) control events' time of occurring, to control chip's performance.

Registers access on a chip are just like RAM's operation, with one exception that registers' read/write operation has side effects such as causing packet to send, or read clear, while RAM is totally a passive part. It's just the side effects that is the key to simulate an IO module's logic, no matter what kind of and complex the module - scsi controller, simple UART, or NIC.

Important as the correct logic of registers' interpretation, so is to make the asynchronous way up from IO to cpu expectable. This is not like the real world, and the difference is very important for a simulator. In real world, there are two ways of communication between modules including cpu, memory, various IO modules. They are synchronous way and asynchronous way.

synchronous way: From cpu down to IO, all accesses should be predicted. This path is deterministic.
asynchronous way: From IO up to cpu, interrupts happen randomly. Obviously, this path is not deterministic.

In simulator, we need to take care of asynchronous way in a semi-deterministic way: at least we should guarantee an asynchronous event can happen timely, or, not to be delayed too much to change upper layer application's behavior drastically.

The method is to timing all the events happened from IO land. Simulator must have implemented some form of central clock. The task here is to relate the IO events to this central clock.

The result is change asynchronous events to synchronous ones. The essential is that all events in a simulator should be timed. How to do that smartly is difficult.