Streached Azure Stack HCI - CSV Volume Status: "Offline Pending" after attempting to change Volume Owner Node

Brad 26 Reputation points
2022-01-20T10:01:11.987+00:00

We have a brand new 4x node identical physical Stretched Azure Stack HCI Cluster. 2 Nodes per site. We are on v21H2

When we attempt to change the owner node of a CSV volume from one node to the next (in the same site), we occasionally see the the CSV volume go offline & the status move from Online to Offline Pending. After about 10 - 15 mins of the volume being in the Offline Pending state, the cluster service of the node will restart & the CSV volume ownership is moved to the node that we wanted initially & the volume is accessible.

This is happening randomly on all nodes, which makes it harder to diagnose.

This Stretched Azure Stack HCI Cluster is currently in pre-production, so has no guests hosted at the moment.

Cluster validation report runs successfully, the only warning's are related to the ClusterPerformanceHistory volumes not using default settings - which is not something we have adjusted.

Any pointers on how to pinpoint/dragonise this issue would be greatly apricated.
166802-image.png

Azure Stack HCI
Azure Stack HCI
A hyperconverged infrastructure operating system delivered as an Azure service that provides security, performance, and feature updates.
301 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Trent Helms - MSFT 2,536 Reputation points Microsoft Employee
    2022-01-20T13:54:11.597+00:00

    Hi @Brad ,

    There are a lot of possibilities here that are difficult to answer without knowing the full setup of the environment. However, I will attempt to give you some information that can help you troubleshoot the cause of this issue.

    First, I would ensure the nodes are all fully up-to-date with the latest patches. This will help to ensure you aren't running into any issues that may have already been resolved.

    Next, I would confirm your networking is properly set up between sites. We frequently see stretched configurations that are not designed as per our best practice. These configurations have caused numerous issues with stretched clusters, so it is worth taking the time to verify. Most of the common misconfigurations we see include multiple paths between sites, SR traffic using unintended NICs, attempting to use a stretched L2 network for the cluster hosts (this is OK for the VM traffic though) and bandwidth bottlenecks between sites. The requirements can be found here - https://video2.skills-academy.com/en-us/azure-stack/hci/concepts/host-network-requirements#stretched-clusters

    Another thing to consider, does this issue happens without SR configured? If so, this is a simpler configuration to troubleshoot. If not, this could point to a possible issue with the stretch configuration. If you are using a Synchronous replication, you could attempt to try Asynchronous replication to see if it changes the behavior as well.

    As for data to collect, the best would be the SDDC diagnostics - https://github.com/PowerShell/PrivateCloud.DiagnosticInfo. This would give you the best overall data as you could review the cluster and SR logs for clues to what is happening. Look at the logs on the node you are attempting to move the resource from first. Hopefully this can lead you to an answer, but if you need additional help, I would suggest opening a case where we can deep dive into the logs with you further.

    Hope this helps!
    Trent