资源传播失败:ClusterResourcePlacementRolloutStarted 为 false

本文介绍如何排查 ClusterResourcePlacementRolloutStarted 在 Azure Kubernetes Fleet Manager 中使用 ClusterResourcePlacement API 对象传播资源时出现的问题。

现象

使用 ClusterResourcePlacement Azure Kubernetes Fleet Manager 中的 API 对象传播资源时,所选资源不会在所有计划群集中推出, ClusterResourcePlacementRolloutStarted 条件状态显示为 False

注意

若要详细了解推出为何不启动,可以检查 推出控制器 日志。

原因

群集资源放置推出策略被阻止,因为 RollingUpdate 配置过于严格。

疑难解答步骤

  1. ClusterResourcePlacement“状态”部分中,检查placementStatuses标识状态设置为FalseRolloutStarted群集。
  2. 找到标识的群集的相应 ClusterResourceBinding 位置。 有关详细信息,请参阅 如何查找最新的 ClusterResourceBinding 资源? 此资源应指示 Work 状态(是创建还是更新)。
  3. 验证其值 maxUnavailable 并确保 maxSurge 它们符合预期。

案例研究

在以下示例中,尝试 ClusterResourcePlacement 将命名空间传播到三个成员群集。 但是,在初始创建ClusterResourcePlacement期间,中心群集上不存在命名空间,并且机群当前包含两个名为和kind-cluster-2的成员群集kind-cluster-1

ClusterResourcePlacement 规范

spec:
  policy:
    numberOfClusters: 3
    placementType: PickN
  resourceSelectors:
  - group: ""
    kind: Namespace
    name: test-ns
    version: v1
  revisionHistoryLimit: 10
  strategy:
    type: RollingUpdate

ClusterResourcePlacement 状态

status:
  conditions:
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: could not find all the clusters needed as specified by the scheduling
      policy
    observedGeneration: 1
    reason: SchedulingPolicyUnfulfilled
    status: "False"
    type: ClusterResourcePlacementScheduled
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: All 2 cluster(s) start rolling out the latest resource
    observedGeneration: 1
    reason: RolloutStarted
    status: "True"
    type: ClusterResourcePlacementRolloutStarted
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: No override rules are configured for the selected resources
    observedGeneration: 1
    reason: NoOverrideSpecified
    status: "True"
    type: ClusterResourcePlacementOverridden
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: Works(s) are successfully created or updated in the 2 target clusters'
      namespaces
    observedGeneration: 1
    reason: WorkSynchronized
    status: "True"
    type: ClusterResourcePlacementWorkSynchronized
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: The selected resources are successfully applied to 2 clusters
    observedGeneration: 1
    reason: ApplySucceeded
    status: "True"
    type: ClusterResourcePlacementApplied
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: The selected resources in 2 cluster are available now
    observedGeneration: 1
    reason: ResourceAvailable
    status: "True"
    type: ClusterResourcePlacementAvailable
  observedResourceIndex: "0"
  placementStatuses:
  - clusterName: kind-cluster-2
    conditions:
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: 'Successfully scheduled resources for placement in kind-cluster-2 (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 1
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: Detected the new changes on the resources and started the rollout process
      observedGeneration: 1
      reason: RolloutStarted
      status: "True"
      type: RolloutStarted
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: No override rules are configured for the selected resources
      observedGeneration: 1
      reason: NoOverrideSpecified
      status: "True"
      type: Overridden
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: All of the works are synchronized to the latest
      observedGeneration: 1
      reason: AllWorkSynced
      status: "True"
      type: WorkSynchronized
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: All corresponding work objects are applied
      observedGeneration: 1
      reason: AllWorkHaveBeenApplied
      status: "True"
      type: Applied
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: All corresponding work objects are available
      observedGeneration: 1
      reason: AllWorkAreAvailable
      status: "True"
      type: Available
  - clusterName: kind-cluster-1
    conditions:
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 1
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: Detected the new changes on the resources and started the rollout process
      observedGeneration: 1
      reason: RolloutStarted
      status: "True"
      type: RolloutStarted
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: No override rules are configured for the selected resources
      observedGeneration: 1
      reason: NoOverrideSpecified
      status: "True"
      type: Overridden
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: All of the works are synchronized to the latest
      observedGeneration: 1
      reason: AllWorkSynced
      status: "True"
      type: WorkSynchronized
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: All corresponding work objects are applied
      observedGeneration: 1
      reason: AllWorkHaveBeenApplied
      status: "True"
      type: Applied
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: All corresponding work objects are available
      observedGeneration: 1
      reason: AllWorkAreAvailable
      status: "True"
      type: Available

上述输出指示中心群集上从未存在资源 test-ns 命名空间,并显示以下 ClusterResourcePlacement 条件状态:

  • 条件 ClusterResourcePlacementScheduled 状态显示为 False,因为指定的策略旨在选取三个群集,但计划程序只能容纳两个当前可用群集和已加入群集中的位置。
  • 条件 ClusterResourcePlacementRolloutStarted 状态显示为 True,因为推出过程已从选择了两个群集开始。
  • 条件 ClusterResourcePlacementOverridden 状态显示为 True,因为未为所选资源配置替代规则。
  • 条件 ClusterResourcePlacementWorkSynchronized 状态显示为 True
  • 条件 ClusterResourcePlacementApplied 状态显示为 True
  • 条件 ClusterResourcePlacementAvailable 状态显示为 True

若要确保跨相关群集无缝传播命名空间,请继续在中心群集上创建 test-ns 命名空间。

在中心群集上创建命名空间“test-ns”后,ClusterResourcePlacement 状态

status:
  conditions:
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: could not find all the clusters needed as specified by the scheduling
      policy
    observedGeneration: 1
    reason: SchedulingPolicyUnfulfilled
    status: "False"
    type: ClusterResourcePlacementScheduled
  - lastTransitionTime: "2024-05-07T23:13:51Z"
    message: The rollout is being blocked by the rollout strategy in 2 cluster(s)
    observedGeneration: 1
    reason: RolloutNotStartedYet
    status: "False"
    type: ClusterResourcePlacementRolloutStarted
  observedResourceIndex: "1"
  placementStatuses:
  - clusterName: kind-cluster-2
    conditions:
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: 'Successfully scheduled resources for placement in kind-cluster-2 (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 1
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2024-05-07T23:13:51Z"
      message: The rollout is being blocked by the rollout strategy
      observedGeneration: 1
      reason: RolloutNotStartedYet
      status: "False"
      type: RolloutStarted
  - clusterName: kind-cluster-1
    conditions:
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 1
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2024-05-07T23:13:51Z"
      message: The rollout is being blocked by the rollout strategy
      observedGeneration: 1
      reason: RolloutNotStartedYet
      status: "False"
      type: RolloutStarted
  selectedResources:
  - kind: Namespace
    name: test-ns
    version: v1

在前面的输出中, ClusterResourcePlacementScheduled 条件状态显示为 False。 状态 ClusterResourcePlacementRolloutStarted 也显示为 False 消息: The rollout is being blocked by the rollout strategy in 2 cluster(s)

ClusterResourceSnapshot通过在“如何查找最新的 ClusterResourceBinding”资源中运行命令来检查最新资源?

最新 ClusterResourceSnapshot

apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourceSnapshot
metadata:
  annotations:
    kubernetes-fleet.io/number-of-enveloped-object: "0"
    kubernetes-fleet.io/number-of-resource-snapshots: "1"
    kubernetes-fleet.io/resource-hash: 72344be6e268bc7af29d75b7f0aad588d341c228801aab50d6f9f5fc33dd9c7c
  creationTimestamp: "2024-05-07T23:13:51Z"
  generation: 1
  labels:
    kubernetes-fleet.io/is-latest-snapshot: "true"
    kubernetes-fleet.io/parent-CRP: crp-3
    kubernetes-fleet.io/resource-index: "1"
  name: crp-3-1-snapshot
  ownerReferences:
  - apiVersion: placement.kubernetes-fleet.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: ClusterResourcePlacement
    name: crp-3
    uid: b4f31b9a-971a-480d-93ac-93f093ee661f
  resourceVersion: "14434"
  uid: 85ee0e81-92c9-4362-932b-b0bf57d78e3f
spec:
  selectedResources:
  - apiVersion: v1
    kind: Namespace
    metadata:
      labels:
        kubernetes.io/metadata.name: test-ns
      name: test-ns
    spec:
      finalizers:
      - kubernetes

在规范中 ClusterResourceSnapshot ,该 selectedResources 部分现在显示命名空间 test-ns

ClusterResourceBinding检查是否kind-cluster-1在创建命名空间后更新了命名空间test-ns。 有关详细信息,请参阅 如何查找最新的 ClusterResourceBinding 资源?

kind-cluster-1 的 ClusterResourceBinding

apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourceBinding
metadata:
  creationTimestamp: "2024-05-07T23:08:53Z"
  finalizers:
  - kubernetes-fleet.io/work-cleanup
  generation: 2
  labels:
    kubernetes-fleet.io/parent-CRP: crp-3
  name: crp-3-kind-cluster-1-7114c253
  resourceVersion: "14438"
  uid: 0db4e480-8599-4b40-a1cc-f33bcb24b1a7
spec:
  applyStrategy:
    type: ClientSideApply
  clusterDecision:
    clusterName: kind-cluster-1
    clusterScore:
      affinityScore: 0
      priorityScore: 0
    reason: picked by scheduling policy
    selected: true
  resourceSnapshotName: crp-3-0-snapshot
  schedulingPolicySnapshotName: crp-3-0
  state: Bound
  targetCluster: kind-cluster-1
status:
  conditions:
  - lastTransitionTime: "2024-05-07T23:13:51Z"
    message: The resources cannot be updated to the latest because of the rollout
      strategy
    observedGeneration: 2
    reason: RolloutNotStartedYet
    status: "False"
    type: RolloutStarted
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: No override rules are configured for the selected resources
    observedGeneration: 2
    reason: NoOverrideSpecified
    status: "True"
    type: Overridden
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: All of the works are synchronized to the latest
    observedGeneration: 2
    reason: AllWorkSynced
    status: "True"
    type: WorkSynchronized
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: All corresponding work objects are applied
    observedGeneration: 2
    reason: AllWorkHaveBeenApplied
    status: "True"
    type: Applied
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: All corresponding work objects are available
    observedGeneration: 2
    reason: AllWorkAreAvailable
    status: "True"
    type: Available

保持不变 ClusterResourceBinding 。 在规范中 ClusterResourceBinding ,仍 resourceSnapshotName 引用旧 ClusterResourceSnapshot 名称。 如果用户没有显式 RollingUpdate 输入,因为应用了默认值,则会出现此问题:

  • 该值 maxUnavailable 配置为 25% × 3(所需数字),舍入为 1
  • 该值 maxSurge 配置为 25% × 3(所需数字),舍入为 1

为何未更新 ClusterResourceBinding

最初,创建后 ClusterResourcePlacement 会生成两 ClusterResourceBindings 个。 但是,由于推出不适用于初始阶段,条件 ClusterResourcePlacementRolloutStarted 设置为 True

在中心群集上创建 test-ns 命名空间时,推出控制器尝试更新两个现有 ClusterResourceBindings命名空间。 但是, maxUnavailable 由于 1 缺少成员群集,因此 RollingUpdate 配置过于严格。

注意

在更新期间,如果其中一个绑定无法应用,它还会违反 RollingUpdate 配置,这会导致 maxUnavailable 设置为 1

解决方法

在这种情况下,若要解决此问题,请考虑手动设置为 maxUnavailable 大于 1 放宽 RollingUpdate 配置的值。 或者,可以加入第三个成员群集。

联系我们寻求帮助

如果你有任何疑问或需要帮助,请创建支持请求联系 Azure 社区支持。 你还可以将产品反馈提交到 Azure 反馈社区