Refine your application platform

Article
07/05/2024

Once you start improving your organization's platform engineering practices, you might find that you need to address some challenges with your application platform first. Your application platform includes all the resources used to power an application such as an app build directly on Azure Kubernetes Service (AKS).

In this article, learn more about the often missed or forgotten aspects of creating a well architected application platform: infrastructure management, governance, observability, and security.

Infrastructure management: Use Infrastructure as Code (IaC) and automation, combined with templates, to simplify and standardize infrastructure and application deployment, while also providing mechanisms for change management and configuration drift detection.
Governance: Use IaC templates to implement strategies for initial deployment compliance and ongoing maintenance, add policy-based tools, and create a policy as code (PaC) practice for centralized standards.
Observability: Implement role-specific observability and logging with standardized dashboards, ensure appropriate access and retention policies, and monitor resource limits and key metrics using tools like Azure Policy.
Security: Build in security across all layers of the application platform with the principle of least privilege, unified security management in DevOps, threat detection, and the use of tools to address vulnerabilities and manage the software supply chain.
Cost management: Manage costs by identifying workload owners and mapping resources, enforce mandatory properties for deploying resources and assign costs to teams, manage shared resource costs, and use tools like Microsoft Cost Management for monitoring spending and setting alerts.

Deciding when and where to invest

If you have more than one application platforms, it can be tricky to decide when and where to invest in improvements that solve problems like high costs or poor observability. If you're starting fresh, the Azure Architecture Center has several potential patterns for you to evaluate. But beyond that, here are a few questions to consider as you begin to plan what you want to do:

Question	Tips
Do you want to adapt your existing application platform, start fresh, or use a combination of these approaches?	Even if you're happy with what you have now or are starting fresh, you want want to think about how to adapt to change over time. Immediate changes rarely work. Your application platforms are a moving target. Your ideal system changes as time passes. You want to factor this thinking and any related migration plans into your go-forward design. Learn more about already covered infrastructure as code (IaC) and templating approaches in Apply software engineering systems to help you manage some of this variation for new applications.
If you want to change what you are doing today, what products, services, or investments are you happy with?	As the saying goes, "if it isn’t broken, don’t fix it." Don't change things without a reason to do so. However, if you have any home-grown solutions, consider whether it's time to move towards an existing product to save on long term maintenance. For example, if you're operating your own monitoring solution, do you want to remove that burden from your ops team and migrate to a new managed product?
Where do you see the most change happening over time? Are any of these in areas that are common to all (or most) of your organization's app types?	Areas that you or your internal customers aren't happy with and aren't likely to change frequently are great places to start. These have the biggest return on investment over the long term. This can also help you iron out how you would help facilitate migrating to a new solution. For example, app models tend to be fluid, but log analysis tools tend to have a longer shelf-life. You can also focus on new projects / applications to start while you confirm that the direction change has the desired returns.
Are you investing in custom solutions in areas with the highest value-add? Do you feel strongly that a unique app infrastructure platform capability is part of your competitive advantage?	If you’ve identified gaps, before doing something custom, consider which areas vendors are most likely to invest in and focus your custom thinking elsewhere. Start by thinking of yourself as an integrator rather than a custom app infrastructure or app model provider. Anything you build will have to be maintained which dwarfs up-front costs in the long term. If you feel the urgent need to custom build a solution in an area you suspect will be covered by vendors long term, plan for sunsetting or long-term support. Your internal customers will typically be as happy (if not more) with an off-the-shelf product as a custom one.

Adapting your existing application platform investments can be a good way to get going. When you make updates, consider starting with new applications to simplify piloting ideas before any kind of roll-out. Factor in this change through IaC and application templating. Invest in custom solutions for your unique needs in high impact, high value areas. Otherwise, try to use an off-the-shelf solution. As with engineering systems, focus on automating provisioning, tracking, and deployment rather than assuming one rigid path to help you manage change over time.

Infrastructure management

As mentioned in Apply software engineering systems, IaC, and automation tools can be combined with templates to standardize infrastructure and application deployment. To reduce the burden of platform specifics on the end user, you should abstract platform details by breaking down choices into relatable naming conventions, for example:

Resource type categories (high compute, high memory)
Resource size categories (t-shirt sizing, small medium and large.)

The goal should be to have templates that represent general requirements that have been tested with preset configurations, so dev teams can immediately get started with supplying minimal parameters and without needing to review options. However, there will be occasions where teams need to change more options on published templates than are available or desirable. For example, an approved design may need a specific configuration that is outside of the supported template defaults. In this instance operations or platform engineering teams can create a one-off configuration, and then decide whether the template needs to incorporate those changes as a default.

You can track changes using IaC tools with drift detection features that can automatically remediate drift (GitOps). Examples of these tools are Terraform and cloud native IaC tools (examples, Cluster API, Crossplane, Azure Service Operator v2). Outside of IaC tool drift detect there are cloud configuration tools that can query for resource configurations, such as Azure Resource Graph. These can serve as two benefits; you monitor for changes outside of the infrastructure code and to review for changed preset configurations. To avoid being too rigid, you can implement tolerances in deployments too with predefined limits. For example, you can use Azure Policy to limit the number of Kubernetes nodes that a deployment can have.

Self-managed or managed?

In public clouds you have the choice to consume SaaS, PaaS, or IaaS. To learn more about SaaS, PaaS, and IaaS, see the training module Describe cloud concepts. PaaS services offer streamlined development experiences but are more prescriptive with their app models. Ultimately, there's a trade-off between ease of use and control that you need to evaluate.

During platform design, evaluate and prioritize the services you want to offer or move to. For example, whether you build apps directly on Azure Kubernetes Service (AKS) or through Azure Container Apps (ACA) depends on your requirements for the service and on your in-house capacity and skill set. The same goes for function-style services like Azure Functions or Azure App Service. ACA, Azure Functions, and App Service reduce complexity, while AKS provides more flexibility and control. More experimental app models like the OSS Radius incubation project try to provide a balance between the two, but are generally in earlier stages of maturity than cloud services with full support and a presence in established IaC formats.

The problems you identified when you planned should help you evaluate which end of this scale is right for you. Be sure to factor your own internal existing skill set as you make a decision.

Shared vs. dedicated resources

Within your organization, there are many resources that can be shared by multiple applications to increase utilization and cost effectiveness. Each of the resources that can be shared have their own set of considerations. For example, these are considerations for sharing K8s clusters, but some will apply to other types of resources:

Organization: Sharing resources like clusters within, rather than across, organizational boundaries can improve how they align with organizational direction, requirements, priority, etc.
Application tenancy: Applications can have different tenancy isolation requirements; you need to review individual application security and regulatory compliance if it can coexist with other applications. For example, in Kubernetes, applications can use namespace isolation. But you should also consider application tenancy for different environment types. For example, it's often best to avoid mixing test applications with production applications on the same clusters to avoiding unexpected impacts due to misconfigurations or security issues. Or you might opt to first test and tune on dedicated Kubernetes clusters to track down these issues prior to deployment on a shared cluster instead. Regardless, consistency in your approach is the key to avoiding confusion and mistakes.
Resource consumption: Understand each application resource usage, spare capacity, and do a projection to estimate whether sharing is viable. You should also be aware of limits of the resources consumed (data center capacity or subscription limits). The goal is to avoid moving your application and dependencies due to resource constraints in a shared environment or to have live site incidents due to capacity exhaustion. Using resource limits, representative testing, monitoring alerting, and reporting can help identify resource consumption and protect against applications consuming too many resources that might impact other applications.
Optimize shared configurations: Shared resources such as shared clusters require extra consideration and configuration. These considerations include cross charging, resource allocation, permissions management, workload ownership, data sharing, upgrade coordination, workload placement, establishing, managing, and iterating a baseline configuration, capacity management, and workload placement. Shared resources have benefits, but if the standard configurations are too restrictive and don't evolve then they become obsolete.

Some of these issues are simplified by PaaS solutions, but many of these points apply to even something like a sharing a database. Sharing has both ups-sides and down-sides and so you should consider the trade-offs carefully.

For more information the Kubernetes cluster aspect of this article, see the Azure Kubernetes Service (AKS) multi-tenancy documentation.

Governance

Governance is a key part of enabling self-service with guardrails, but applying compliance rules in a way that doesn't impact time to business value for applications is a common challenge. Governance is a broad topic, but if this is a problem you're encountering, keep in mind both aspects of this space:

Initial deployment compliance (start right): This can be achieved with standardized IaC templates that are made available through catalogs, with permission management and policies to ensure only allowed resources and configurations can be deployed.
Maintaining compliance (stay right): Policy based tools can prevent or alert you when there are resource changes. Beyond your core infrastructure, consider tools also support compliance inside resources like K8s along with OSs used in your containers or VMs. For example, you might want to enforce a locked down OS configuration or install security software tools such as Windows Group Policy, SELinux, AppArmor, Azure Policy, or Kyverno. If developers only have access to IaC repositories, you can add approval workflows to review proposed changes and prevent direct access to resource control planes (example, Azure).

Maintaining compliance requires tooling to access, report, and act on issues. For example, Azure Policy can be used with many Azure services for auditing, reporting, and remediation. It also has different modes such as Audit, Deny, and DeployIfNotExists depending on your needs.

While policies can enforce compliance, they can also break applications unexpectedly. Therefore, consider evolving to a policy as code (PaC) practice when operating at scale. As a key part of your start right and stay right approach, PaC provides:

Centrally managed standards
Version control for your policies
Automated testing & validation
Reduced time to roll out
Continuous deployment

PaC can help to minimize the blast radius of potentially a bad policy with capabilities such as:

Policy definitions stored as code in a repository that is reviewed and approved.
Automation to provide testing and validation.
Ring-based gradual rollout of policies & remediation on existing resources help to minimize the blast radius of potentially a bad policy.
Remediation task has safety built in, with controls such as stopping the remediation task if more than 90 percent of deployments fail.

Observability

To support your applications and infrastructure, you need observability and logging across the entire stack that your platform engineering, operations, and developer teams can use to see what is happening.

However, requirements differ per role. For example, platform engineering and operations team require dashboards to review the health and capacity of the infrastructure with suitable alerts. Developers require application metrics, logs, and traces to troubleshoot and customized dashboards that show application and infrastructure health. One problem either of these roles might be encountering is cognitive overload from too much information or knowledge gaps due to a lack of useful information.

To resolve these challenges, consider the following:

Standards: Apply logging standards to make it easier to create and reuse standardized dashboards and simplify ingestion processing through something like the OpenTelemetry observability framework.
Permissions: Consider providing team or application-level dashboards using something like Grafana to provides rolled up data for anyone interested, along with a facility for trusted members of application teams to securely get access to logs when needed.
Retention: Retaining logs and metrics can be expensive, and can create unintended risks or compliance violations. Establish retention defaults and publish them as a part of your start right guidance.
Monitor resource limits: Operations teams should be able to identify and track any limitations for a given type of resource. When possible, these limitations should be factored into IaC templates or policies using tools like Azure Policy. Operations should then proactively monitor using dashboards in something like Grafana and expand shared resources where automated scaling isn't possible or enabled. For example, monitor the number of K8s cluster nodes for capacity as apps are onboarded and modified over time. Alerting is needed, and these definitions should be stored as code so they can be programmatically added to resources.
Identify key capacity and health metrics: Monitor and alert OS and shared resources (examples: CPU, memory, storage) for starvation with metrics collection using something like Prometheus or Azure Container Insights. You can monitor sockets/ports in use, network bandwidth consumption of chatty apps, and the number of stateful applications hosted on the cluster.

Security

Security is required at every layer, from code, container, cluster, and cloud/infrastructure. Every organization has their own security requirements, but at a high level, these are some things to consider for your platform:

Follow the principle of least privilege.
Unify your DevOps security management across multiple pipelines.
Ensure contextual insights to identify and remediate your most critical risk are visible.
Enable detection and response to modern threats across your cloud workloads at runtime.

To help resolve problems in this area, you need tools to evaluate tools that work across your engineering and applications systems, resources, and services across clouds and hybrid (for example, Microsoft Defender for Cloud). Beyond application security, you want to evaluate:

External attack surface management using something like Microsoft Defender External Attack Surface Management (Defender EASM).
Network security services - Applications and cloud workload protection from network-based cyberattacks with something like Azure Network Security.
Intelligent security analytics - Using a security information and event management (SIEM) solution like Microsoft Sentinel
Ways to govern, protect, visualize, and manage your data estate securely like Microsoft Purview
Ways to scan code for potential security vulnerabilities, detect secrets, review dependencies like GitHub Advanced Security and GitHub Advanced Security for Azure DevOps.
Management of your software supply chain, particularly for containers (for example, apply the Containers Secure Supply Chain Framework).

Permissions requirements can differ by environment. For example, in some organizations, individual teams aren't allowed to access production resources and new applications can't automatically deploy until reviews are complete. However, in dev and test environments, automated resource and app deployment, and access to clusters for troubleshooting might be permitted.

Managing identity access to services, applications, infrastructure at scale can be challenging. You want to identity providers to create, maintain, and manage identity information while providing authentication services to applications and services and that can integrate with role-based access control authorizations systems for at scale authentication and authorization management (RBAC). For example, you can use Azure RBAC and Microsoft Entra ID to provide authentication and authorization at scale for Azure services like Azure Kubernetes Service without needing to set up permissions directly on every individual cluster.

Applications might need access to an identity to access cloud resources like storage. You need to review requirements and assess how your identity provider can support this in the most secure way possible. For example, within AKS, cloud native apps can utilize a Workload Identity that federates with Microsoft Entra ID to allow containerized workloads to authenticate. This approach allows applications to access cloud resources without secret exchanges within application code.

Cost management

Cost is another problem that may bubble to the top for your platform engineering efforts. To properly manage your application platform, you need a way to identify workload owners. You want a way to get an inventory of resources that maps to owners for a particular set of metadata. For example, within Azure, you can use AKS Labels, Azure Resource Manager tags, along with concepts like projects in Azure Deployment Environments to group your resources at different levels. For this to work, the chosen metadata must include mandatory properties (using something like Azure Policy) when deploying workloads and resources. This helps with cost apportionment, solution resource mapping, owners, etc. Consider running regular reporting to track orphaned resources.

Beyond tracking, you might need to assign cost to individual application teams for their resource usage using this same metadata with cost management systems like Microsoft Cost Management. While this method tracks resources provisioned by the application teams, it doesn't cover the cost of shared resources such as your identity provider, logging & metric storage, and networking bandwidth consumption. For shared resources, you can equally divide the operational costs by the individual teams or provide dedicated systems (example, logging storage) where there's nonuniform consumption. Some shared resource types might be able to provide insights on resource consumption, for example Kubernetes has tools such as OpenCost or Kubecost and can help.

You should also look for cost analysis tooling where you can review current spending. For example, in Azure portal there are cost alerting and budgets alerts that can track consumption of resources in a group and send notifications when you hit preset thresholds.

Design a developer self-service foundation

Share via