Arbitration and Translation, Part 3

This post is the third in a series.  You can see the others here, Part 1 and Part 2.

What is an Arbiter?

In the NT PnP subsystem, an arbiter is an interface that a bus driver can expose which is able to intelligently assign PnP resources of a single specific type (memory, I/O ports, DMA channels, interrupts, bus numbers) to its children.  In general, an arbiter cannot assign resources that it has not claimed from its parent.

The PnP manager itself exposes five arbiters, one for each type listed above.  These arbiters are relatively dumb.  They give out ranges of numbers, with the only criteria being these:

· Is this range free?  If so, you can have it.

· If the range is already claimed, but with the shareable flag, and your claim is marked shareable, you can have it too.

· If any part of the range is already claimed as exclusive, you can’t have it.

These arbiters aren’t bus-specific, but they don’t have to be.  They’re enough to get started.

Yesterday, I covered the translator interface and what it does.  The arbiter interface is similar.  Both are about manipulating the resources for child devices and putting them is less domain-specific terms.  The difference between a translator and an arbiter is simply that translator interfaces are sufficient when you cannot really change the resources available to a child device and arbiters are necessary when you can.  Translators are, as you might expect, much simpler.

HALMPS

To illustrate the difference, I want to talk about HALMPS.  This was a HAL that was shipped as part of Windows NT 3.5 through Windows Server 2003.  It might have even shipped in Server 2008.  I don’t remember when it got pulled from the tree. 

It ran on machines that conformed to the Intel Multiprocessor Specification, versions 1.1 through 1.4.  If you’re curious, you can find it here. That spec has been entirely obsoleted by ACPI.  MPS was simple where ACPI is very complex.  But MPS can’t describe a machine that changes configuration dynamically at run time while ACPI can.  As it turns out, this adds a whole lot of complexity.

MPS describes a system in terms of, among other things, the number of local APICs (which deliver interrupts to processors) and I/O APICs (which collect interrupts from devices.)  It says which pins on which I/O APICs each PCI device is connected to.  This is actually encoded as device-function-IntPin, and HALMPS represents a devices “IRQ” thusly.  You can see this in Device Manager of a machine running HALMPS.  The assigned IRQ is just these values all run together.  This was very confusing to many people, as they might see two devices with the same IRQ in Device Manager, but that just meant that those two devices occupied the same slot on two different buses.  They might have been sharing interrupts, or they might not.

The important part of the story here is that the BIOS picked all interrupt-related routing and it was fixed forever at boot.  There aren’t any decisions to make, except one.  The OS gets to pick which of the processors get targeted by a specific I/O APIC input.

When we were gluing PnP onto the side of the NT driver model, during the development of Windows 2000, the existing scheme for choosing a target processor set for a device’s interrupts involved the driver calling HalGetInterruptVector.  The target IDT entries, the processor set mask and the IRQL for the device all had to be chosen there.  Furthermore, if two devices shared interrupts, they had to get the same answer, even if one driver was PnP-aware and one made this obsolete call.  So I left the IRQ-to-IDT mapping code in the HAL.

If a PnP driver made a resource claim for an interrupt, then that claim would make its way toward the root of the PnP tree (see yesterday’s post: link) and it would reach an interrupt translator at the HAL device node.  The HAL would see the device’s claim, do the math on how the interrupt was routed, including which I/O APIC and which pin on that I/O APIC, and then make an internal call to HalGetInterruptVector, which would choose a target processor set, an IRQL and a vector.  The target processor set (actually the APIC cluster ID) was then encoded in the upper 24 bits of the “Vector” that the device was assigned and the IDT entry was encoded in the lower 8 bits.  This was then presented to the root interrupt arbiter within the PnP manager, where it was claimed.

Just for fun, I fired up a VM running HALMPS and dumped this out in the debugger.  You can see the relevant parts here:

 0: kd> !translator

DEVNODE 83373ee0 (HTREE\ROOT\0)
  BusNumber Translator
    Resources:    nt!IopTranslatorHandlerCm
    Requirements: nt!IopTranslatorHandlerIo
  Port Translator
    Resources:    nt!IopTranslatorHandlerCm
    Requirements: nt!IopTranslatorHandlerIo
Memory Translator
    Resources:    nt!IopTranslatorHandlerCm
    Requirements: nt!IopTranslatorHandlerIo
  Dma Translator
    Resources:    nt!IopTranslatorHandlerCm
    Requirements: nt!IopTranslatorHandlerIo
  Interrupt Translator
    Resources:    nt!IopTranslatorHandlerCm
    Requirements: nt!IopTranslatorHandlerIo

    DEVNODE 8336f3c0 (PCI_HAL\PNP0A03\0)<br>      Interrupt Translator<br>        Resources:    hal!HalpIrqTranslateResourcesPci<br>        Requirements: hal!HalpIrqTranslateRequirementsPci

      DEVNODE 833ba008 (PCI\VEN_8086&DEV_7110&SUBSYS_00000000&REV_01\2&ebb567f&0&38)
        Interrupt Translator
          Resources:    hal!HalIrqTranslateResourcesIsa
          Requirements: hal!HalIrqTranslateResourceRequirementsIsa

      DEVNODE 833baee0 (PCI\VEN_8086&DEV_7111&SUBSYS_00000000&REV_01\2&ebb567f&0&39)
        Interrupt Translator
          Resources:    hal!HalIrqTranslateResourcesIsa
          Requirements: hal!HalIrqTranslateResourceRequirementsIsa
0: kd> !arbiter 

  Interrupt Arbiter "RootIRQ" at 808a5620
    Allocated ranges:
      0000000000000000 - 0000000000000000   B   833c0988  (\Driver\PCI_HAL)
      0000000000000001 - 0000000000000001   B   833c0988  (\Driver\PCI_HAL)
       (some lines omitted for brevity)
      000000000000002f - 000000000000002f   B   833c0988  (\Driver\PCI_HAL)
      00000000000000ff - 00000000000000ff   B   833c0988  (\Driver\PCI_HAL)
      0000000000000151 - 0000000000000151       83373250  (atapi)<br>      0000000000000152 - 0000000000000152   B   8336df10 <br>      0000000000000161 - 0000000000000161    <br>        0000000000000161 - 0000000000000161  CB   833c0988  (\Driver\PCI_HAL)<br>        0000000000000161 - 0000000000000161  CB   8336d838  (Serial)<br>      0000000000000172 - 0000000000000172       8336da80  (i8042prt)<br>      0000000000000181 - 0000000000000181       8336d5f0  (Serial)<br>      0000000000000182 - 0000000000000182       833bd1c0  (i8042prt)<br>      0000000000000192 - 0000000000000192       8336d3a8  (fdc)<br>      00000000000001a2 - 00000000000001a2       83373030  (atapi)<br>      00000000000001b1 - 00000000000001b1    <br>        00000000000001b1 - 00000000000001b1  CB   833c0988  (\Driver\PCI_HAL)<br>        00000000000001b1 - 00000000000001b1  CB   833bdf10 <br>      00000000000001b2 - 00000000000001b2   B   833bd408

HALMACPI

Now let’s contrast that with HALMACPI.  This is the HAL that runs (to this day) on any machine that conforms to the ACPI spec and has more than one processor, which is nearly anything you can go out and buy.

The ACPI spec says a few things about interrupts:

· There are a discrete number of I/O APICs and their base addresses are listed in ACPI tables.

· ISAPnP- or ACPI-enumerated devices are attached to I/O APIC inputs and those attachments are described in the ACPI namespace under each device.  A device can be moved from one input to another by invoking the _SRS method under the device.

· PCI devices are either directly attached to I/O APIC inputs or they are attached to IRQ steering “link nodes” which themselves can be attached to one of a set of I/O APIC inputs.  The set of possible attachments is described under the link node (which is itself sort of a device) in the ACPI namespace.  The exact pin that they are attached to can be changed by invoking the link node’s _SRS method.

This is entirely different from HALMPS.  Now we have a choice about how devices are routed, if the motherboard designer designs the board that way and if the BIOS guy exposes the functionality.  If we want to move one or a group of PCI devices from one IRQ to another, we can.  I put an interrupt arbiter in the ACPI driver, as that was where it was possible, or at least easy, to interact with all the various parts of the ACPI namespace.

An arbiter gets requests like: “Here’s a set of four devices, each of which has a fairly complex set of possible interrupt assignments.  Please find the optimal configuration which satisfies all the requirements.  When a device needs I/O ports, memory ranges and interrupts, these requests get made by the PnP manager to each type of arbiter simultaneously.  If a fit can be found, all the device eventually get IRP_MN_START_DEVICE with a resource set that meets their needs.

Note that this problem is NP-complete.  So we don’t look at every possible solution.  There are a bunch of heuristics about which parts of the solution space to look at first and how long to spend looking.

In truth, the NT PnP team came to a fairly painful conclusion after a couple of years of tweaking these algorithms.  (It was painful mostly because it took so long to fully understand the situation.)  The first major truth is that you can no longer add a truly new bus architecture to a PC because Windows 95 (and now many other OSes) only understood PCI.  At the point that a largely-deployed OS that did PnP natively existed, every machine had to expose the interfaces that that OS understood.  Thus we have HyperTransport, PCI Express and lots of internal bus architectures that never got widely published, all of which pretend to be PCI at a PnP level so that they work with old OSes which do PnP natively.

The kicker is that all of those, particularly the chipset-internal ones, have deviations from the PCI spec.  I’ve sat in meetings with chipset designers who said that their devices didn’t have to be PCI-compliant because they were inside of a chipset.  From a hardware guy’s perspective, this makes perfect sense.  It doesn’t have PCI pins, it doesn’t have any PCI logic, so it isn’t PCI.  But, for various reasons, it does have a PCI configuration space.  When I point out to them that there’s no way for the OS to differentiate between these “non-PCI PCI devices” and real PCI devices, they shrug and say that’s not their problem, since the BIOS sets it all up right anyhow.

And that’s the second major truth.  The BIOS sets most or all of it up anyhow.

So the arbiter interface and NT PnP, in general, have a way of asking about how a device was configured by the BIOS.  When a device is first discovered, the PnP manager sends IRP_MN_QUERY_RESOURCES.    This IRP asks the question “what resources is this device using, right now?”  The PCI driver will look at a device’s Base Address Registers and its Interrupt Line register and send that claim back in response.  The PnP manager then calls into the relevant arbiters with the device’s PDO (or a proxy PDO if the driver an NT4-style non-PnP driver) and claim those ranges unconditionally for the device, with a flag saying that this is a “boot reservation.”  See the ‘B’ in some lines of the debugger dump above, and you’ll see these boot claims.

When the device stack for the device is being built, the PnP manager sends IRP_MN_QUERY_RESOURCE_REQUIREMENTS to ask “what are the set of all possible sets of resources this device could use?”  And once the FDO and filters have been loaded, it sends IRP_MN_FILTER_RESOURCE_REQUIREMENTS to ask “what modifications would you like to make to this claim that the bus driver has generated on your behalf?”

The resulting claim set is sent to the arbiters.  Now those arbiters know what resources the device booted with, if the device was present in the machine at boot time.  So they, for the most part, just choose what the BIOS chose. This is what makes slippery chipsets work just fine.  The BIOS is the expert and NT leaves that alone.

Some resource types don’t work this way.  Most notably, there’s no notion of which IRQ a device was connected to at boot time if your machine is running with the APIC enabled.  The BIOS only configures the IRQ routing for the PIC (not APIC) interrupt controller, in preparation for running Windows 98, which never supported APICs.  So the ACPI IRQ arbiter, when running on an APIC system, throws away the boot claims.

Note that the boot claim system has some interesting properties.  There may be conflicts, and sometimes that’s okay.  BIOSes tend to make claims for ACPI-enumerated dummy devices like “Motherboard Resources” when there is a device which must claim some I/O ports but which mostly doesn’t ever get a driver loaded.  The most famous example of this tends to be an SMBus controller.  Most machines don’t run a driver on it, but the BIOS needs to access it in System Management Mode. So it will claim the ports.  Sometimes, people write drivers for them and then those driver show up as a conflict with a boot claim.  This is mostly benign.

Message-Signaled Interrupts

Interrupt arbitration tends to be the most complicated part of the system.  Or, at least, it seems that way to me, since I’m still messing around with it almost fifteen years after I first began.  Most of the other arbiters haven’t changed much in years beginning with a ‘2’. 

Devices which can generate Message-Signaled Interrupts don’t need to use an I/O APIC input.  But they can, usually, also use one, particularly if the OS in question doesn’t understand MSI.  With MSI, the interrupt is sent by doing a short busmaster burst involving 32-bits of data to a special address.  The device need not understand the address nor the data.  It just gets told, when you want to trigger this interrupt, send this blob here.

The PCI Spec has taken two passes at defining how this should be configured in a device, both of which have proved insufficient for representing the problem at hand.  “MSI” was introduced in PCI 2.2 and it involved writing a single address into the device, and a single data value too.  If the device wanted to send more than one interrupt, it could vary N low-order bits of the data value, at the OS’s discretion.  This meant that the data values were constrained to a naturally aligned range of values, and that range was a power of 2 in length.  (See the PCI Spec for the scary details.)

Given the way Intel defined the special address/data format in the Software Developer’s Manual, Volume 3, Chapter 8, Section 11 (<www.intel.com/products/processor/manuals/>) the address determine the target processor set.  This means that MSI (as defined in PCI 2.2) can only work if every interrupt targets the same processor or set of processors.  You can’t choose to send one interrupt message to one processor and one to another.

Thus MSI-X was defined in PCI 3.0.  Both still exist, and they’ve been carried into PCI-X and PCI Express.  MSI-X allows each interrupt message to have separate address and data values.  It also allows as many as 2048 messages per PCI function.

Given that the processor-set-to-address mapping was fixed by Intel, virtualization and large numbers of cores is forcing another level of indirection through I/O MMUs, called “VT-d” by Intel and “IOMMU” by AMD.

The fundamental problem here is that the PCI spec never should have tried to define message-signaled interrupts at all.  They just don’t have anything to do with the PCI bus.  Every interesting thing about them is external to the PCI bus.  (Full disclosure:  I didn’t always understand this, and I sat on the committee that defined MSI-X.)  The only thing the PCI spec allows you to do is to have a defined mechanism for telling the device to target a busmaster transaction to a specific address with specific data when the device needs attention. 

There’s no standard mechanism for telling a PCI NIC to send your network data to a specific address, as that’s just part of the definition of the device behavior.  You don’t want to standardize that because it removes degrees of freedom when you want to do it differently in the future.  There shouldn’t be one for interrupts, either, on exactly the same grounds.  I’ll quit ranting now.

What you really need is a way to say, for example, “my device needs to trigger 36 interrupts, two-per core in this 16-core machine, plus four more for various housekeeping tasks.”  That’s not really expressible in the PCI capability structs which define MSI and MSI-X, but it is expressible inside of Windows.

Once the PnP manager has assigned IDT vectors, IRQLs, target processors and the lot, you need a way of programming these into the device.  This is expressible in the PCI spec, though it’s redundant in my mind.  Whether the bus driver does it or the function driver does it doesn’t matter much.

Mechanically, it works like this:

1. The PnP manager sends IRP_MN_QUERY_RESOURCE_REQUIREMENTS.  The PCI driver reads the various capability structs and some registry keys that were set during INF processing (since, as we saw above, the PCI spec can’t express everything necessary) and responds to this IRP with some interrupt claims.  Typically, there will be three possibilities expressed in the resultant IO Resource Requirements List: lots of message-signaled interrupts, one message-signaled interrupt and, lastly, one line-based interrupt.

2. The PnP manager builds the rest of the device stack and sends IRP_MN_FILTER_RESOURCE_REQUIREMENTS.  If the device is trying really hard to squeeze out performance by targeting specific interrupts at specific processors, the FDO (usually NDIS or storport, along with the miniport) will “filter” that claim to affinitize certain interrupts to certain cores, and possibly to cut down the total number of messages in the first claim to some multiple of the number of cores actually installed.

3. The PnP manager passes these sets of claims to the interrupt arbiter in the ACPI driver, which looks at them and tries to satisfy them in the order that they’re listed.  If there are enough free IDT entries (and the underlying processor and chipset support MSI at all) then the first claim gets satisfied.  If not, it goes for the single message claim.  If that can’t be satisfied, it will back off to the line-based interrupt, which is usually shared with something else and will almost certainly succeed.

4. The PnP manager translates these resources down to the bus terms.  (See yesterday’s post.)  This involves changing these vector and target processor sets into addresses and data again.  These values end up in your interrupt resources in your raw resource list.

5. The PnP manager translates these “up” into processor-relative terms.  This populates the translated resource list with Vector, Level and Affinity for each interrupt message.

6. The PnP manager sends IRP_MN_START_DEVICE with both lists.  The PCI driver sees the IRP first (since the FDO handles start on the way up, remember) and programs the MSI or MSI-X capability structures, if they exist.  The FDO sees the IRP next, and stores the information for calling IoConnectInterruptEx.  It may use the raw resources to derive address and data values if it likes.

ACPI IRQ Arbiter Dumps

The ACPI IRQ arbiter handles all this by considering a list of things simultaneously.

· Free IDT entries on all the potential cores.

· Free I/O APIC inputs for devices which have some flexibility.

· Whether MSI is available in the processor and the chipset

· Whether the device has an MSI request

Since that arbiter is looking across a couple of dimensions simultaneously, dumping it is a little more complicated.  The default debugger command “!arbiter” will show you the IRQ claims.  “!acpiirqarb” will show you the other state.  I’ll walk through these dumps below.

This first dump is of the default arbiter in the PnP manager.  It says that lots of vectors are reserved for internal use and lots of vectors are assigned to ACPI (across every core) for redistribution to other devices.

 0: kd> !arbiter 4

DEVNODE fffffa8003c49d90 (HTREE\ROOT\0)
  Interrupt Arbiter "RootIRQ" at fffff800014aebc0
    Allocated ranges:
      0000000000000000 - 0000000000000000   B   fffffa8003c48e30 
      0000000000000001 - 0000000000000001   B   fffffa8003c48e30 
       (lines omitted)
      000000000000003f - 000000000000003f   B   fffffa8003c48e30 
      0000000000000051 - 0000000000000051       fffffa8003c4dbd0  (ACPI)
       (lines omitted)
      00000000000000bd - 00000000000000bd       fffffa8003c4dbd0  (ACPI)
      00000000000000be - 00000000000000be       fffffa8003c4dbd0  (ACPI)
      00000000000000ff - 00000000000000ff   B   fffffa8003c48e30 
    Possible allocation:
      < none >

This next dump is of the arbiter state of the ACPI IRQ arbiter.  It’s just the “IRQ” part, as that’s what’s done in terms that !arbiter can interpret

     DEVNODE fffffa8003c394a0 (ACPI_HAL\PNP0C08\0)
      Interrupt Arbiter "ACPI_IRQ" at fffff880010fdfc0
        Allocated ranges:
          0000000000000002 - 0000000000000002   B   fffffa80039a05c0 
          0000000000000008 - 0000000000000008       fffffa80039a0c20 
          0000000000000009 - 0000000000000009 S     fffffa8003c4dbd0  (ACPI)
          000000000000000d - 000000000000000d   B   fffffa80039a07e0 
          0000000000000010 - 0000000000000010 S  
            0000000000000010 - 0000000000000010 S     fffffa800399b060  (pciide)
            0000000000000010 - 0000000000000010 S     fffffa80039b3a20  (usbuhci)
          0000000000000012 - 0000000000000012 S  
            0000000000000012 - 0000000000000012 S     fffffa80039ab060  (pciide)
            0000000000000012 - 0000000000000012 S     fffffa80039b1060  (usbehci)
            0000000000000012 - 0000000000000012 S     fffffa80039ae060  (usbuhci)
          0000000000000013 - 0000000000000013 S  
            0000000000000013 - 0000000000000013 S     fffffa80039ac060  (pciide)
            0000000000000013 - 0000000000000013 S     fffffa80039b2a20  (usbuhci)
            0000000000000013 - 0000000000000013 S     fffffa80039afa20  (usbuhci)
          0000000000000015 - 0000000000000015 S     fffffa80039b2060  (usbuhci)
          0000000000000017 - 0000000000000017 S  
            0000000000000017 - 0000000000000017 S     fffffa80039aea20  (usbehci)
            0000000000000017 - 0000000000000017 S     fffffa80039af060  (usbuhci)
          00000000fffffff9 - 00000000fffffff9       fffffa8003997a20  (pci)
          00000000fffffffa - 00000000fffffffa       fffffa80039b0060  (pci)
          00000000fffffffb - 00000000fffffffb       fffffa80039b1a20  (pci)
          00000000fffffffc - 00000000fffffffc       fffffa80039b0a20  (pci)
          00000000fffffffd - 00000000fffffffd       fffffa80039b7a20  (pci)
          00000000fffffffe - 00000000fffffffe       fffffa80039b7060  (pci)
        Possible allocation:
          < none >

The large numbers for IRQs are placeholders for MSI assignments, which in this machine are all PCI Express root ports.

!acpiirqarb tells us about the other internal arbiter state, including IDT assignments on every core and state of the ACPI link nodes, which exist but aren’t used in APIC mode in this machine.  It also details all the I/O APICs in the machine, including the metadata on all the inputs.

The “not on bus” claims are interesting.  They’re the inverse of the IDT entries that got claimed above in the root arbiter.  It means, essentially, that ACPI can’t give them out because it doesn’t own them.

 0: kd> !acpiirqarb


Processor 0 (0, 0):
Device Object: 0000000000000000
Current IDT Allocation:
    0000000000000000 - 0000000000000050       00000000   A:0000000000000000 IRQ:0
    0000000000000061 - 0000000000000061 S  
      0000000000000061 - 0000000000000061 S B   fffffa80039ac060  (pciide)  A:fffff8a00189b230 IRQ:13
      0000000000000061 - 0000000000000061 S B   fffffa80039b2a20  (usbuhci)  A:fffff8a0018d8e90 IRQ:13
      0000000000000061 - 0000000000000061 S B   fffffa80039afa20  (usbuhci)  A:fffff8a0001986b0 IRQ:13
    0000000000000081 - 0000000000000081   D   fffffa80039a05c0   A:fffff8a0018f74e0 IRQ:2
    00000000000000a0 - 00000000000000a1   D   fffffa80039b7a20  (pci)  A:fffff8a0017eaa10 IRQ:fffffffd
    00000000000000a2 - 00000000000000a2 S B   fffffa80039b2060  (usbuhci)  A:fffff8a0006cc860 IRQ:15
    00000000000000b1 - 00000000000000b1 S B   fffffa8003c4dbd0  (ACPI)  A:fffff8a0001120f0 IRQ:9
    00000000000000bf - ffffffffffffffff       00000000   A:0000000000000000 IRQ:10

Possible IDT Allocation:
    < none >


Processor 1 (0, 1):
Device Object: 0000000000000000
Current IDT Allocation:
    0000000000000000 - 0000000000000050       00000000   A:0000000000000000 IRQ:0
    0000000000000061 - 0000000000000061 S  
      0000000000000061 - 0000000000000061 S B   fffffa80039ac060  (pciide)  A:fffff8a00189b230 IRQ:13
      0000000000000061 - 0000000000000061 S B   fffffa80039b2a20  (usbuhci)  A:fffff8a0018d8e90 IRQ:13
      0000000000000061 - 0000000000000061 S B   fffffa80039afa20  (usbuhci)  A:fffff8a0001986b0 IRQ:13
    0000000000000081 - 0000000000000081   D   fffffa80039a05c0   A:fffff8a0018f74e0 IRQ:2
    00000000000000a0 - 00000000000000a1   D   fffffa80039b7a20  (pci)  A:fffff8a0017eaa10 IRQ:fffffffd
    00000000000000a2 - 00000000000000a2 S B   fffffa80039b2060  (usbuhci)  A:fffff8a0006cc860 IRQ:15
    00000000000000b1 - 00000000000000b1 S B   fffffa8003c4dbd0  (ACPI)  A:fffff8a0001120f0 IRQ:9
    00000000000000bf - ffffffffffffffff       00000000   A:0000000000000000 IRQ:0

Possible IDT Allocation:
    < none >


Processor 2 (0, 2):
Device Object: 0000000000000000
Current IDT Allocation:
    0000000000000000 - 0000000000000050       00000000   A:0000000000000000 IRQ:0
    0000000000000061 - 0000000000000061 S  
      0000000000000061 - 0000000000000061 S B   fffffa80039ac060  (pciide)  A:fffff8a00189b230 IRQ:13
      0000000000000061 - 0000000000000061 S B   fffffa80039b2a20  (usbuhci)  A:fffff8a0018d8e90 IRQ:13
      0000000000000061 - 0000000000000061 S B   fffffa80039afa20  (usbuhci)  A:fffff8a0001986b0 IRQ:13
    0000000000000081 - 0000000000000081   D   fffffa80039a05c0   A:fffff8a0018f74e0 IRQ:2
    00000000000000a0 - 00000000000000a1   D   fffffa80039b7a20  (pci)  A:fffff8a0017eaa10 IRQ:fffffffd
    00000000000000a2 - 00000000000000a2 S B   fffffa80039b2060  (usbuhci)  A:fffff8a0006cc860 IRQ:15
    00000000000000b1 - 00000000000000b1 S B   fffffa8003c4dbd0  (ACPI)  A:fffff8a0001120f0 IRQ:9
    00000000000000bf - ffffffffffffffff       00000000   A:0000000000000000 IRQ:0

Possible IDT Allocation:
    < none >


Processor 3 (0, 3):
Device Object: 0000000000000000
Current IDT Allocation:
    0000000000000000 - 0000000000000050       00000000   A:0000000000000000 IRQ:0
    0000000000000061 - 0000000000000061 S  
      0000000000000061 - 0000000000000061 S B   fffffa80039ac060  (pciide)  A:fffff8a00189b230 IRQ:13
      0000000000000061 - 0000000000000061 S B   fffffa80039b2a20  (usbuhci)  A:fffff8a0018d8e90 IRQ:13
      0000000000000061 - 0000000000000061 S B   fffffa80039afa20  (usbuhci)  A:fffff8a0001986b0 IRQ:13
    0000000000000081 - 0000000000000081   D   fffffa80039a05c0   A:fffff8a0018f74e0 IRQ:2
    00000000000000a0 - 00000000000000a1   D   fffffa80039b7a20  (pci)  A:fffff8a0017eaa10 IRQ:fffffffd
    00000000000000a2 - 00000000000000a2 S B   fffffa80039b2060  (usbuhci)  A:fffff8a0006cc860 IRQ:15
    00000000000000b1 - 00000000000000b1 S B   fffffa8003c4dbd0  (ACPI)  A:fffff8a0001120f0 IRQ:9
    00000000000000bf - ffffffffffffffff       00000000   A:0000000000000000 IRQ:0

Possible IDT Allocation:
    < none >


Processor 4 (0, 4):
Device Object: 0000000000000000
Current IDT Allocation:
    0000000000000000 - 0000000000000050       00000000   A:0000000000000000 IRQ:0
    0000000000000051 - 0000000000000051 S  
      0000000000000051 - 0000000000000051 S B   fffffa80039ab060  (pciide)  A:fffff8a0006eaa10 IRQ:12
      0000000000000051 - 0000000000000051 S B   fffffa80039b1060  (usbehci)  A:fffff8a0006d83e0 IRQ:12
      0000000000000051 - 0000000000000051 S B   fffffa80039ae060  (usbuhci)  A:fffff8a0006c5a80 IRQ:12
    0000000000000090 - 0000000000000091   D   fffffa8003997a20  (pci)  A:fffff8a0006edeb0 IRQ:fffffff9
    00000000000000a0 - 00000000000000a0   D   fffffa80039b0060  (pci)  A:fffff8a001909750 IRQ:fffffffa
    00000000000000a1 - 00000000000000a1   D   fffffa80039a0c20   A:fffff8a00187f840 IRQ:8
    00000000000000b0 - 00000000000000b0   D   fffffa80039b0a20  (pci)  A:fffff8a001890b50 IRQ:fffffffc
    00000000000000bf - ffffffffffffffff       00000000   A:0000000000000000 IRQ:0

Possible IDT Allocation:
    < none >


Processor 5 (0, 5):
Device Object: 0000000000000000
Current IDT Allocation:
    0000000000000000 - 0000000000000050       00000000   A:0000000000000000 IRQ:0
    0000000000000051 - 0000000000000051 S  
      0000000000000051 - 0000000000000051 S B   fffffa80039ab060  (pciide)  A:fffff8a0006eaa10 IRQ:12
      0000000000000051 - 0000000000000051 S B   fffffa80039b1060  (usbehci)  A:fffff8a0006d83e0 IRQ:12
      0000000000000051 - 0000000000000051 S B   fffffa80039ae060  (usbuhci)  A:fffff8a0006c5a80 IRQ:12
    0000000000000090 - 0000000000000091   D   fffffa8003997a20  (pci)  A:fffff8a0006edeb0 IRQ:fffffff9
    00000000000000a0 - 00000000000000a0   D   fffffa80039b0060  (pci)  A:fffff8a001909750 IRQ:fffffffa
    00000000000000a1 - 00000000000000a1   D   fffffa80039a0c20   A:fffff8a00187f840 IRQ:8
    00000000000000b0 - 00000000000000b0   D   fffffa80039b0a20  (pci)  A:fffff8a001890b50 IRQ:fffffffc
    00000000000000bf - ffffffffffffffff       00000000   A:0000000000000000 IRQ:0

Possible IDT Allocation:
    < none >


Processor 6 (0, 6):
Device Object: 0000000000000000
Current IDT Allocation:
    0000000000000000 - 0000000000000050       00000000   A:0000000000000000 IRQ:0
    0000000000000051 - 0000000000000051 S  
      0000000000000051 - 0000000000000051 S B   fffffa80039ab060  (pciide)  A:fffff8a0006eaa10 IRQ:12
      0000000000000051 - 0000000000000051 S B   fffffa80039b1060  (usbehci)  A:fffff8a0006d83e0 IRQ:12
      0000000000000051 - 0000000000000051 S B   fffffa80039ae060  (usbuhci)  A:fffff8a0006c5a80 IRQ:12
    0000000000000090 - 0000000000000091   D   fffffa8003997a20  (pci)  A:fffff8a0006edeb0 IRQ:fffffff9
    00000000000000a0 - 00000000000000a0   D   fffffa80039b0060  (pci)  A:fffff8a001909750 IRQ:fffffffa
    00000000000000a1 - 00000000000000a1   D   fffffa80039a0c20   A:fffff8a00187f840 IRQ:8
    00000000000000b0 - 00000000000000b0   D   fffffa80039b0a20  (pci)  A:fffff8a001890b50 IRQ:fffffffc
    00000000000000bf - ffffffffffffffff       00000000   A:0000000000000000 IRQ:0

Possible IDT Allocation:
    < none >


Processor 7 (0, 7):
Device Object: 0000000000000000
Current IDT Allocation:
    0000000000000000 - 0000000000000050       00000000   A:0000000000000000 IRQ:0
    0000000000000051 - 0000000000000051 S  
      0000000000000051 - 0000000000000051 S B   fffffa80039ab060  (pciide)  A:fffff8a0006eaa10 IRQ:12
      0000000000000051 - 0000000000000051 S B   fffffa80039b1060  (usbehci)  A:fffff8a0006d83e0 IRQ:12
      0000000000000051 - 0000000000000051 S B   fffffa80039ae060  (usbuhci)  A:fffff8a0006c5a80 IRQ:12
    0000000000000090 - 0000000000000091   D   fffffa8003997a20  (pci)  A:fffff8a0006edeb0 IRQ:fffffff9
    00000000000000a0 - 00000000000000a0   D   fffffa80039b0060  (pci)  A:fffff8a001909750 IRQ:fffffffa
    00000000000000a1 - 00000000000000a1   D   fffffa80039a0c20   A:fffff8a00187f840 IRQ:8
    00000000000000b0 - 00000000000000b0   D   fffffa80039b0a20  (pci)  A:fffff8a001890b50 IRQ:fffffffc
    00000000000000bf - ffffffffffffffff       00000000   A:0000000000000000 IRQ:0

Possible IDT Allocation:
    < none >


Processor 8 (0, 8):
Device Object: 0000000000000000
Current IDT Allocation:
    0000000000000000 - 0000000000000050       00000000   A:0000000000000000 IRQ:0
    0000000000000071 - 0000000000000071 S  
      0000000000000071 - 0000000000000071 S B   fffffa800399b060  (pciide)  A:fffff8a000154d20 IRQ:10
      0000000000000071 - 0000000000000071 S B   fffffa80039b3a20  (usbuhci)  A:fffff8a0000a0b20 IRQ:10
    0000000000000091 - 0000000000000091   D   fffffa80039a07e0   A:fffff8a00193a8a0 IRQ:d
    00000000000000a0 - 00000000000000a0   D   fffffa80039b1a20  (pci)  A:fffff8a00193a870 IRQ:fffffffb
    00000000000000b0 - 00000000000000b1   D   fffffa80039b7060  (pci)  A:fffff8a00197a3e0 IRQ:fffffffe
    00000000000000b2 - 00000000000000b2 S  
      00000000000000b2 - 00000000000000b2 S B   fffffa80039aea20  (usbehci)  A:fffff8a00197a3b0 IRQ:17
      00000000000000b2 - 00000000000000b2 S B   fffffa80039af060  (usbuhci)  A:fffff8a0011c5460 IRQ:17
    00000000000000bf - ffffffffffffffff       00000000   A:0000000000000000 IRQ:0

Possible IDT Allocation:
    < none >


Processor 9 (0, 9):
Device Object: 0000000000000000
Current IDT Allocation:
    0000000000000000 - 0000000000000050       00000000   A:0000000000000000 IRQ:0
    0000000000000071 - 0000000000000071 S  
      0000000000000071 - 0000000000000071 S B   fffffa800399b060  (pciide)  A:fffff8a000154d20 IRQ:10
      0000000000000071 - 0000000000000071 S B   fffffa80039b3a20  (usbuhci)  A:fffff8a0000a0b20 IRQ:10
    0000000000000091 - 0000000000000091   D   fffffa80039a07e0   A:fffff8a00193a8a0 IRQ:d
    00000000000000a0 - 00000000000000a0   D   fffffa80039b1a20  (pci)  A:fffff8a00193a870 IRQ:fffffffb
    00000000000000b0 - 00000000000000b1   D   fffffa80039b7060  (pci)  A:fffff8a00197a3e0 IRQ:fffffffe
    00000000000000b2 - 00000000000000b2 S  
      00000000000000b2 - 00000000000000b2 S B   fffffa80039aea20  (usbehci)  A:fffff8a00197a3b0 IRQ:17
      00000000000000b2 - 00000000000000b2 S B   fffffa80039af060  (usbuhci)  A:fffff8a0011c5460 IRQ:17
    00000000000000bf - ffffffffffffffff       00000000   A:0000000000000000 IRQ:0

Possible IDT Allocation:
    < none >


Processor 10 (0, 10):
Device Object: 0000000000000000
Current IDT Allocation:
    0000000000000000 - 0000000000000050       00000000   A:0000000000000000 IRQ:0
    0000000000000071 - 0000000000000071 S  
      0000000000000071 - 0000000000000071 S B   fffffa800399b060  (pciide)  A:fffff8a000154d20 IRQ:10
      0000000000000071 - 0000000000000071 S B   fffffa80039b3a20  (usbuhci)  A:fffff8a0000a0b20 IRQ:10
    0000000000000091 - 0000000000000091   D   fffffa80039a07e0   A:fffff8a00193a8a0 IRQ:d
    00000000000000a0 - 00000000000000a0   D   fffffa80039b1a20  (pci)  A:fffff8a00193a870 IRQ:fffffffb
    00000000000000b0 - 00000000000000b1   D   fffffa80039b7060  (pci)  A:fffff8a00197a3e0 IRQ:fffffffe
    00000000000000b2 - 00000000000000b2 S  
      00000000000000b2 - 00000000000000b2 S B   fffffa80039aea20  (usbehci)  A:fffff8a00197a3b0 IRQ:17
      00000000000000b2 - 00000000000000b2 S B   fffffa80039af060  (usbuhci)  A:fffff8a0011c5460 IRQ:17
    00000000000000bf - ffffffffffffffff       00000000   A:0000000000000000 IRQ:0

Possible IDT Allocation:
    < none >


Processor 11 (0, 11):
Device Object: 0000000000000000
Current IDT Allocation:
    0000000000000000 - 0000000000000050       00000000   A:0000000000000000 IRQ:0
    0000000000000071 - 0000000000000071 S  
      0000000000000071 - 0000000000000071 S B   fffffa800399b060  (pciide)  A:fffff8a000154d20 IRQ:10
      0000000000000071 - 0000000000000071 S B   fffffa80039b3a20  (usbuhci)  A:fffff8a0000a0b20 IRQ:10
    0000000000000091 - 0000000000000091   D   fffffa80039a07e0   A:fffff8a00193a8a0 IRQ:d
    00000000000000a0 - 00000000000000a0   D   fffffa80039b1a20  (pci)  A:fffff8a00193a870 IRQ:fffffffb
    00000000000000b0 - 00000000000000b1   D   fffffa80039b7060  (pci)  A:fffff8a00197a3e0 IRQ:fffffffe
    00000000000000b2 - 00000000000000b2 S  
      00000000000000b2 - 00000000000000b2 S B   fffffa80039aea20  (usbehci)  A:fffff8a00197a3b0 IRQ:17
      00000000000000b2 - 00000000000000b2 S B   fffffa80039af060  (usbuhci)  A:fffff8a0011c5460 IRQ:17
    00000000000000bf - ffffffffffffffff       00000000   A:0000000000000000 IRQ:0

Possible IDT Allocation:
    < none >

Interrupt Controller (Inputs: 0x0-0x17  Dev: 0000000000000000):
       (00)Cur:IDT-00 Ref-0 edg hi   Pos:IDT-00 Ref-0 edg hi 
       (01)Cur:IDT-00 Ref-0 edg hi   Pos:IDT-00 Ref-0 edg hi 
       (02)Cur:IDT-81 Ref-1 edg hi   Pos:IDT-00 Ref-0 edg hi 
       (03)Cur:IDT-00 Ref-0 edg hi   Pos:IDT-00 Ref-0 edg hi 
       (04)Cur:IDT-00 Ref-0 edg hi   Pos:IDT-00 Ref-0 edg hi 
       (05)Cur:IDT-00 Ref-0 edg hi   Pos:IDT-00 Ref-0 edg hi 
       (06)Cur:IDT-00 Ref-0 edg hi   Pos:IDT-00 Ref-0 edg hi 
       (07)Cur:IDT-00 Ref-0 edg hi   Pos:IDT-00 Ref-0 edg hi 
       (08)Cur:IDT-a1 Ref-1 edg hi   Pos:IDT-00 Ref-0 edg hi 
       (09)Cur:IDT-b1 Ref-1 lev hi   Pos:IDT-00 Ref-0 edg hi 
       (0a)Cur:IDT-00 Ref-0 edg hi   Pos:IDT-00 Ref-0 edg hi 
       (0b)Cur:IDT-00 Ref-0 edg hi   Pos:IDT-00 Ref-0 edg hi 
       (0c)Cur:IDT-00 Ref-0 edg hi   Pos:IDT-00 Ref-0 edg hi 
       (0d)Cur:IDT-91 Ref-1 edg hi   Pos:IDT-00 Ref-0 edg hi 
       (0e)Cur:IDT-00 Ref-0 edg hi   Pos:IDT-00 Ref-0 edg hi 
       (0f)Cur:IDT-00 Ref-0 edg hi   Pos:IDT-00 Ref-0 edg hi 
       (10)Cur:IDT-71 Ref-2 lev low  Pos:IDT-00 Ref-0 edg hi 
       (11)Cur:IDT-00 Ref-0 edg hi   Pos:IDT-00 Ref-0 edg hi 
       (12)Cur:IDT-51 Ref-3 lev low  Pos:IDT-00 Ref-0 edg hi 
       (13)Cur:IDT-61 Ref-3 lev low  Pos:IDT-00 Ref-0 edg hi 
       (14)Cur:IDT-00 Ref-0 edg hi   Pos:IDT-00 Ref-0 edg hi 
       (15)Cur:IDT-a2 Ref-1 lev low  Pos:IDT-00 Ref-0 edg hi 
       (16)Cur:IDT-00 Ref-0 edg hi   Pos:IDT-00 Ref-0 edg hi 
       (17)Cur:IDT-b2 Ref-2 lev low  Pos:IDT-00 Ref-0 edg hi 

Link Node: LNKA
       Current   IRQ: 0x0 - 0 reference(s)
       Possible  IRQ: 0x0 - 0 reference(s)
       Preferred IRQ: 0xffffffff - ResourceOverride (IO_List) 0000000000000000

Link Node: LNKB
       Current   IRQ: 0x0 - 0 reference(s)
       Possible  IRQ: 0x0 - 0 reference(s)
       Preferred IRQ: 0xffffffff - ResourceOverride (IO_List) 0000000000000000

Link Node: LNKC
       Current   IRQ: 0x0 - 0 reference(s)
       Possible  IRQ: 0x0 - 0 reference(s)
       Preferred IRQ: 0xffffffff - ResourceOverride (IO_List) 0000000000000000

Link Node: LNKD
       Current   IRQ: 0x0 - 0 reference(s)
       Possible  IRQ: 0x0 - 0 reference(s)
       Preferred IRQ: 0xffffffff - ResourceOverride (IO_List) 0000000000000000

Link Node: LNKE
       Current   IRQ: 0x0 - 0 reference(s)
       Possible  IRQ: 0x0 - 0 reference(s)
       Preferred IRQ: 0xffffffff - ResourceOverride (IO_List) 0000000000000000

Link Node: LNKF
       Current   IRQ: 0x0 - 0 reference(s)
       Possible  IRQ: 0x0 - 0 reference(s)
       Preferred IRQ: 0xffffffff - ResourceOverride (IO_List) 0000000000000000

Link Node: LNKG
       Current   IRQ: 0x0 - 0 reference(s)
       Possible  IRQ: 0x0 - 0 reference(s)
       Preferred IRQ: 0xffffffff - ResourceOverride (IO_List) 0000000000000000

Link Node: LNKH
       Current   IRQ: 0x0 - 0 reference(s)
       Possible  IRQ: 0x0 - 0 reference(s)
       Preferred IRQ: 0xffffffff - ResourceOverride (IO_List) 0000000000000000

In conclusion, arbitration is complicated and we keep adjusting it.  Windows 7 actually added a little bit of knowledge about VT-d to interrupt arbitration so that we could easily go beyond 64 cores. 

People have been asking us for years to document the interfaces so that non-Microsoft-employed driver writers could write their own arbiters.  This would be most useful for “converged NICs” where a single PCI function exposes a bus driver which in turn exposes a NIC, an RDMA device, an iSCSI initiator and/or an FCoE HBA.  These bus drivers jump through many hoops to do second-level interrupt dispatch for their children, which they wouldn’t have to do if they could write an interrupt arbiter.

It’s particularly difficult, though, to do interrupt arbitration in a distributed manner.  I/O port or memory arbitration can be done locally on the bus related to the device.  But interrupts are often run as side-band signals straight from one part of the motherboard to another.  It’s difficult to prove that you can make this code work if it’s decentralized.

We wrote a simple bus driver that claims resources and doles them out for children.  It’s called “MF.sys” and it works so long as the resources you need for one child are completely disjoint from the resources you need for another child.  This tends not to be the case with converged NICs.  Some register or some interrupt gets used for some shared purpose.

For now though, the best answer I can give is that all this information is mostly useful for debugging.

- Jake Oshins

Comments

  • Anonymous
    November 21, 2011
    Perfect