Private Cloud Infrastructure as a Service Fabric Management


This article is no longer being updated by the Microsoft team that originally published it.  It remains online for the community to update, if desired.  Current documents from Microsoft that help you plan for cloud solutions with Microsoft products are found at the TechNet Library Solutions or Cloud and Datacenter Solutions pages.


1 Introduction

Fabric Management is the toolset responsible for managing workloads of virtual hosts, virtual networks, and storage. Fabric Management provides the orchestration necessary to manage the life cycle of a consumer’s workload. In delivering Infrastructure as a Service capability in a Private Cloud the emphasis is on services and service management. Services must be defined by the architecture consistent with the Service Operations Layer processes and modeled in a form that can be consumed by the Management Layer toolset.

Before we can model services effectively to realize the benefits of private cloud computing the elements of Infrastructure as a Service must be abstracted into artifacts that can be modeled at design time for the purposes of capturing the design in form that can be persisted and submitted into the organizations change control process and validated among peer review and/or systematically to verify conformance to operational and business governance and compliance requirements.

Validated design models may then be transformed into service definition templates that are presented to the Management Layer toolset to implement the definition in real-time to deploy services from the template definition. Fabric Management is responsible for maintaining the various resources that are required to resolve logical abstractions in the service definition to physical resources that are allocated and deployed to meet a Service Management request for Infrastructure Services.

This article discusses service definitions and resource abstractions required that enable Fabric Management to respond to Service Management requests for infrastructure in a Private Cloud.

2 Service Definition Basics

The artifacts that make up a service are designed by the architect at design-time and captured in the form of a service definition or model. The service definition consists of the logical representation of the resources that make up the service and how each should be configured and sequenced. Configuration is represented as properties of a resource in the service definition.

A simple definition may include a single compute resource that is associated with a specific type of storage and on a specific network. This definition includes three obvious resources associated with the service:

  1. Compute Resource
  2. Storage Resource
  3. Network Resource

The private cloud fabric management toolset must resolve the logical service definition configuration into actual physical resources that can be allocated and deployed in the infrastructure from shared resources. As currently represented the Management Layer would have wide discretion over how the service would be deployed. That is a compute resource would be allocated from any shared pool and imaged with any operating system and finally associated with any storage and network. The problem with this simple model is that not enough information has been defined in the service definition to consider it a valid well-formed model.

To further refine the service definition we must provide configuration properties on each resource that are meaningful to facilitate review and validation of the model at design-time and representative of runtime configuration of the private cloud such that the Management Layer tooling can resolve the defined configuration into physical resources in the infrastructure.

An expanded definition may include the same resources with additional properties:

  1. Compute Resource:
    1. Location: North America
    2. Group: Manufacturing
    3. Type: Medium
  2. Storage Resource:
    1. Group: Manufacturing
    2. Type: (Resilient, Fast)
    3. Size: 500 GB
  3. Network Resource:
    1. Subnet: Production

This expanded definition constrains the Management Layer to provisioning a Medium compute resource from a Manufacturing resource group located in a datacenter geographically located in North America. This compute resource would also be provisioned with 500 GB of storage that also exhibits Resilient and Fast characteristics. Resilient storage implies some level of high availability, in resolving the logical property Resilient the Management Layer may be instructed from the Service Operations Layer Service Asset and Configuration Management process definition to assemble the storage resource using a mirrored array. Further expansion of the definition may conclude that a Striped-Mirror Set is appropriate to accommodate the storage characteristic of Fast. Lastly this compute resource is also associated with the Production network in the organization.

Some less obvious resources that must be allocated and managed with this service are the operating system image provisioned on the compute resource, the individual storages devices that are assembled into a highly available array, the storage device identifiers such as Logical Unit Numbers (LUN), and interconnect paths between the resources. These additional resource and/or configuration characteristics may be implicitly assigned or inherited. This inheritence may have also applied to the simple less qualified model above.

Service definition models are a tool to aid the architect in deigning their services end-to-end and validating their designs against industry and organization best practices and requirements. The output of a model is a service definition template that is used by the Management Layer tooling to instantiate the service following a Service Operations request. It is important to note that a service operations request can originate externally by a consumer and internally as the result of a monitoring trigger to expand or contract a service.

3 Abstracting Private Cloud Infrastructure Resources

Physical hardware resources are abstracted through virtualization by the hypervisor. This abstraction allows for hosting multiple guest operating systems on a single host while hiding the details of the physical hardware platform providing an impression of a shared pool of resources available to all virtual machines running on the host. Fabric Management is the tooling responsible for the allocation and de-allocation of virtualized resources in a private cloud. These resources are classified into three primary categories:

  • Compute services supply the physical resources such as CPU, Random Access Memory (RAM), NIC, Video, and Storage used by the provider to deliver VMs to consumers. It contains the physical server hardware and parent OS.
  • Storage provides physical storage devices to the provider, which exposes these services to consumers as virtual disks. This includes the provisioning steps required for device discovery, allocation and de-allocation of devices, formation of higher level storage arrays, establishing paths and access to vendor specific capabilities.
  • Network services provide addressing and packet delivery for the provider’s physical infrastructure and the consumer’s VMs. Network capability includes physical and virtual network switches, routers, firewalls, and Virtual Local Area Network (VLAN).

Each of these categories is in fact components within the Private Cloud Reference Model Infrastructure Layer. Fabric Management uses automation and orchestration to perform management operations on these resources in response to Service Operations layer requests for service or service updates.

Within each category the physical resources available must be abstracted to well-known terms that are available at design time and mapped to physical resource that can be provisioned by fabric management at runtime.

3.1 Compute Resources

Compute resource virtual machines are allocated and provisioned from a host group defined in a private cloud. A private cloud may have many host groups defined. Each host group is made up of common homogenized physical hardware allowing the same automation to be used across all physical servers in the host group.

Compute resources are commonly referenced by their geographic location and their type or capacity properties. Geographic location is the physical location of the datacenter where the host group is located, in the case where a private cloud spans datacenters in multiple geographic locations additional properties are required to determine placement. This property influences fabric management virtual machine placement during provisioning.

Type or capacity refers to the library image associated with a virtual machine and its quotas that are reserved on the host group server before provisioning. This property also influences virtual machine placement as fabric management will seek the most ideal host candidate to instantiate the virtual machine during provisioning of the compute resource.

3.2 Storage Resources

Virtualizing storage involves the abstracting of the physical location of storage from the named representation that virtual machines use to access the storage. Fabric Management exposes storage volumes to virtual machines and manages the mapping of volumes to physical resources.

Fabric Management accomplishes storage virtualization through a series of related steps:

3.2.1 Discovery

Discovery involves locating all directly attached storage and storage available to the host through network attached storage controllers. Fabric Management is responsible for discovering all manageable entities in the storage environment including devices, arrays and volumes.

3.2.2 Provisioning

Provisioning involves the allocation of physical storage devices from a free pool and assembling into an array that meets the requirements of the Service Operations request for storage. Once an array has been created one or more volumes are created and exposed to host group server(s) (initiator) and brought online and available to host group virtual machines.

Provisioning also includes creating snapshots that are delta copies and clones that are full copies of volumes.

3.2.3 Monitoring

Monitoring of long running storage operations and the heath of storage resources over time is the responsibility of Fabric Management. During provisioning the monitoring of long running operations would be performed by storage management automation to initiate the operation and update the Configuration Management status and finally either poll for completion status or subscribe to event notification indicating completion. It is a consideration for the service definition architect to decide if long running operations should pause the orchestration pending completion of the long running operation or if orchestration can continue asynchronously.

Over the lifetime of the service storage resources will eventually fail. These resources in a private cloud are generally composed using some level of high availability so no impact to service availability will be immediately experienced however the service is now in a degraded state. Storage monitoring is required to identify failed components or events that may indicate a possible failure in the future and update configuration management with the failed status and also create a Service Operations Incident.

3.3 Network Resources

Abstracting physical network requirements into logical network definitions allows the architect to describe how a service should be configured into one or more networks at design-time that Fabric Management will use to resolve into physical network resources at deployment time.

A logical network together with one or more associated network sites is a user-defined named grouping of IP subnets, VLANs, or IP subnet/VLAN pairs that is used to organize and simplify network assignments. Some possible examples include BACKEND, FRONTEND, PRODUCTION, MANAGEMENT and BACKUP. Logical networks represent an abstraction of the underlying physical network infrastructure which enables you to model the network based on business needs and connectivity properties. After a logical network is created, it can be used to specify the network on which a host or a virtual machine (stand-alone or part of a service) is deployed. Architects can assign logical networks as part of virtual machine and service creation without having to understand the network details.

You can use logical networks to describe networks with different purposes, for traffic isolation and to provision networks for different types of service-level agreements (SLAs). For example, for a tiered application, you may group IP subnets and VLANs that are used for the front-end Web tier as the FRONTEND logical network. For the IP subnets and VLANs that are used for backend servers such as the application and database servers, you may group them as BACKEND. When a self-service user models the application as a service, they can pick the logical network for virtual machines in each tier of the service to connect to.

4 Automation and Orchestration

In this section we define the terms automation and orchestration in a private cloud.

Automation is the collection of related tasks that together perform a system management operation on a managed entity. These tasks are encoded into a semantics that can be evaluated and executed by a scripting technology available on the server group hosts. The scripting technology must allow for passing of parameters, parameter validation, flow control and error handling.

IT organizations have been using some degree of automation for some time now possibly in an informal manner. However in a private cloud all system management operations must utilize repeatable forms of automation to guarantee predictable results in provisioning and de-provisioning resources in a private cloud. An example of tasks that may to automated to bring a storage unit online:

  1. Determine ideal storage units candidates that meet desired characteristics
  2. Determine if Host Group has Path to Array
  3. Allocate Devices
  4. Create Array
  5. Create Volume
  6. Create Volume Identification
  7. Enable Host Group Access to Volume

Orchestration is the stitching together of many automated tasks into service definition workflow that accomplishes a holistic set of operations across several layers of the private cloud.

The posting of status events and the configuration management updates is the responsibility of each respective automation and orchestration operation.

5 Orchestration in a Private Cloud

This section highlights common orchestration found in a private cloud. Much of this orchestration will be provided by the platform while others must be custom developed to meet the needs of unique organizational requirements. In some cases orchestration is provided by the platform but includes integration points allowing for custom extensions to be implemented and integrated with platform orchestration.

5.1 Server Orchestration

Server orchestration includes host server orchestration and virtual machine orchestration. Each automation task in the orchestration is responsible for checking the success or failure of an operation and reporting status back to the orchestration. The orchestration is responsible for updating configuration management with current state within a workflow and completion status when done.

5.1.1 Provision Virtual Machine

This orchestration expands the Service Operations request and initializes properties used during provisioning to determine Resource Pool, Library Image, Storage, and Network Characteristics. Once expanded and validated the provision virtual machine orchestration begins a long running process of identifying an ideal host within the specified geographic location and resource group to create the virtual machine using the configured library image and storage.
Network configuration for the virtual machine allocates an IP address from a pool associated with the VLAN specified in the Service Operation request.

5.1.2 Poll Provisioning Status

The ability to poll the status of a provisioning request is needed for three reasons.

First provisioning of a new virtual machine is a long running process and the ability to poll the process allows fabric management and system administrators to view the current step in the automation that is currently executing.

Second once a virtual machine is started, the configuration of the machine occurs asynchronously and similar to the first scenario the ability to poll the process allows visibility into the current state of the machine.

And third the provision virtual machine orchestration may schedule create automation to occur asynchronously and suspend pending completion. In this case the polling status is necessary to resume the provision virtual machine orchestration.

5.1.3 Remove Virtual Machine

This orchestration is responsible for taking a virtual machine offline and the return of all resources to their associated pools. Note it is not the responsibility of this orchestration to validate that the virtual machine is a candidate for deletion. The higher level service management processes validate that the virtual machine is not a dependency of another service in the private cloud before issuing a Fabric Management operation to remove a virtual machine.

5.1.4 Error Handling

Each step in fabric management orchestration has the potential to report errors. The processing of errors is generally though a common event handler. This handler is responsible for logging the event through an appropriate platform event sink and creating a Service Operations Incident.