Hyper-V Virtual Switch Architecture
Hyper-V Virtual Switch (referred also as VMSWITCH) is the foundational component of network device virtualization in Hyper-V. It powers some of the largest data centers in the world, including Windows Azure. In this post, I would talk about its high level architecture.
Standards Compliant Virtual Switch
As I mentioned in my previous post, the main goal in building VMSWITCH was to build a standards compliant, high performance virtual switch. There are many ways to interpret standards compliant because there are many standards. So to be specific, our goal was to build packet forwarding based on 802.1q and mimic physical network semantics (such as link up/down) as much as possible. This was to make sure that whatever works in a physical network, works in the virtual environment as well. We defined three main objects in VMSWITCH, vSwitch, vPort and NIC. A vSwitch is an instance of a virtual switch that provides packet forwarding and various other features provided by a switch such as QoS, ACL etc. A vPort is analogous to a physical switch port and has configuration associated with it for various features. And finally, a NIC objects, that acts as the endpoint connecting to a vPort. This is similar to the physical network, where a host has a physical NIC that connects to a physical port on a physical switch.
The clear separation of these objects, allowed us to model our switch design and code independent of the type of network device we virtualize e.g. legacy NIC versus virtual NIC as well as support network semantics similar to physical network. For example, by disconnecting a NIC from a vPort, we could generate link down notification. It also gave us the flexibility to support various types of vSwitch (discussed later) without requiring any changes in our core VMSWITCH implementation.
The picture below shows an example setup, multiple vSwitch are created with different number of vPort on each one of them and vPort(s) connecting to different types of NIC(s).
This flexibility allowed creation of different types of vSwitch in the Hyper-V UI. If you use Hyper-V UI, you would notice that vSwitch creation page allows three different types of vSwitches, External, Internal and Private vSwitch. The external switch provides connectivity via a physical NIC, internal switch only provides connectivity to root partition via host virtual NIC and private switch only provides inter-VM connectivity. The key here is that, in the VMSWITCH implementation, there is no vSwitch type, each vSwitch is the same. The UI classifies the vSwitch, based on what types of NICs are connected to the vPort(s) of that vSwitch. This kept VMSWITCH design simpler but allowed us to provide a friendlier UX for our customers.
Type of NIC Objects
There are four types of NIC objects supported in VMSWITCH and these are described below.
Physical NIC: It is also referred as Protocol NIC, is the object that represents a physical NIC in VMSWITCH. This type of NIC is used by VMSWITCH to send packets in/out to the physical network. Please note that the code that handles the I/O on a NIC abstracts the NIC object and hides the NIC specific behavior. All other code in VMSWITCH is agnostic to the type of NIC and operates on NIC as a generic object.
Host Virtual NIC: This is a virtual NIC that is created in the host (or root partition) of Hyper-V. This NIC is created to allow host to communicate with virtual machines or physical network. This type of NIC is especially useful in a single NIC setup, but even used in cases where multiple NICs are teamed together and all I/O flows via VMSWITCH, such as a 2x10Gbps NICs teamed together in a server.
VM Virtual NIC: This type of NIC is created in the VM over a virtual bus (called VMBUS). This NIC is only available on operating system, that have support for VMBUS. Such operating systems are also referred as enlightened operating systems. This NIC provides high performance networking to VMs.
VM Legacy NIC: This type of NIC is created in a VM by emulating a real hardware NIC (Intel DEC 21140A adapter). This type of NIC is provided to support enlightened OS as well as supporting PXE boot. The PXE boot issue is solved in recent Hyper-V release by having an enlightened UEFI, however this NIC is still supported for unenlightened operating systems.
Control and Data Plane
The control interface to VMSWITCH is via IOCTL, which is standard mechanism in Windows to interface user mode applications with kernel mode. The management application invokes various IOCTLs to carry out operations such as creation of a vSwitch, connecting a vPort to a NIC etc.
The data plane interface is specific to each type of NIC that connects to a vPort. The overall control and data plane interactions are shown in the picture below.
This picture shows a vSwitch with 4 vPort, and each vPort connects to a different type of NIC. The control plane interaction happens via IOCTL. The picture also shows the internal representation of various NICs in VMSWITCH. The data transfer happens via interface specific to the type of NIC.
A physical NIC connected to VMSWITCH is represented as NDIS protocol and does packet I/O via NDIS interfaces defined for a protocol driver. The host vNIC is represented by an NDIS Miniport in VMSWITCH and does packet I/O via NDIS interfaces defined for a miniport driver. The legacy NIC is supported by emulation running in VMWP process and is represented in VMSWITCH as direct I/O NIC. It uses IRP_MJ_READ and IRP_MJ_WRITE to do packet I/O with VMSWITCH. Lastly, a virtual NIC in the VM is represented as VM NIC in VMSWITCH and uses VMBUS interfaces for packet I/O.
Core Packet Processing
The core VMSWITCH packet processing logic has three layers, Ingress, Forwarding and Egress. In the initial version, we called it Source, Route and Destination. In the source processing step, we applied various source vPort specific policies on the packet and if the packet was not dropped, then we did the packet forwarding processing and once we computed the destination vPort list of the packet, for each destination, we applied destination vPort specific policies. Packets that did not get dropped based on these policies were delivered to their destination vPort(s).
All NIC entry points to VMSWITCH do the interface specific processing on the packets and convert the packets from interface specific format to a common format that is fed to VMSWITCH packet processing pipeline. The VMSWITCH packet processing code is largely independent of the type of NIC, except for cases where offload handling is needed, since offloads are specific to the type of NIC. The modularity of this architecture made is easy for us to move VMSWITCH to Extensible VMSWITCH during Windows Server 2012 release with minimal changes to our core packet processing logic.
Performance
The packet processing pipeline of VMSWITCH is largely lock free, allowing for parallel processing of packets. By using features such as VMQ, VMSWITCH is able to utilize any available CPU for packet processing and carry those out without requiring serialization as much as possible. Most of the operations are lock free and where lock is needed, they are either held for short duration or interlocked operations are used. As an example of lock free operation, VMSWITCH maintains various packet counters on vSwitch, vPort and NIC objects. Updating to these counters atomically can add significant CPU cycles to the overall packet processing cost. To reduce (or almost eliminate) this overhead, per processor packet counters are used that are updated at DISPATCH_LEVEL thus providing lock free updates to these. A per processor counter design makes querying of packet counters less efficient, because now we have to add all per-processor counters. However, since querying of counters is not very frequent as compared to updating the counters on data path, the optimization helps in the critical path i.e. the core packet processing. Similar techniques are used to make sure that overall packet processing is highly parallel and efficient. This coupled with offload support is able to provide performance that is needed by some of the most demanding workloads running in VMs.
This is all I have for today, in future posts I would talk about VMSWITCH offload support, Extensibility model and more.