Skip to main content

Troubleshooting Lab: Modify an existing Azure Resource Manager template

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

An infrastructure team modifies an existing ARM template to add a second managed disk to a production VM. The original template worked correctly. After modification, the deployment is submitted via Azure CLI and fails immediately with the following output:

$ az deployment group create \
--resource-group rg-prod-eastus \
--template-file vm-template.json \
--parameters @vm-params.json

{
"error": {
"code": "InvalidTemplate",
"message": "Deployment template validation failed: 'The template resource 'osdisk-prod-01' at line '47' and column '9' is not valid: The language expression property 'parameters' doesn't exist, please see https://aka.ms/arm-template-expressions for usage details.'",
"target": "osdisk-prod-01"
}
}

The responsible engineer reports that the diskSizeGB parameter was correctly added to the template's parameters section. They also mention that they changed the apiVersion of the Microsoft.Compute/disks resource from 2022-07-02 to 2023-04-02 to use newer features. The relevant snippet of the modified template is:

"diskSizeGB": {
"type": "int",
"defaultValue": 128
},
{
"type": "Microsoft.Compute/disks",
"apiVersion": "2023-04-02",
"name": "osdisk-prod-01",
"location": "[resourceGroup().location]",
"properties": {
"diskSizeGB": "[parameter('diskSizeGB')]",
"creationData": {
"createOption": "Empty"
}
}
}

The template has 3 other resources that were previously deployed successfully and were not modified.

What is the root cause of the failure?

A) The apiVersion 2023-04-02 is not compatible with the Microsoft.Compute/disks type, causing schema rejection.

B) The template function has the wrong name: parameter instead of parameters, making the expression invalid.

C) The diskSizeGB parameter was declared as int, but the diskSizeGB field in properties requires a string.

D) The creationData.createOption field is incompatible with the diskSizeGB property when both are declared simultaneously.


Scenario 2 β€” Action Decision​

The cause of the problem has been identified: a production ARM template references a network resource via dependsOn using an outdated literal name. The network resource name was renamed in a previous template modification, but the dependsOn entry was not updated. As a result, all new environment deployments fail with a dependency resource not found error.

The environment has the following constraints:

  • The affected deployment runs from a CI/CD pipeline that executes on every commit to the main branch
  • The template file is versioned in a Git repository with protected branch policy: all changes require a pull request approved by two reviewers
  • The team only has one reviewer available at the moment
  • The pipeline can be manually paused by the technical lead without impacting already provisioned resources
  • A staging environment exists with an identical template where the fix can be tested

What is the correct action to take at this time?

A) Edit the template file directly in the main branch via the repository interface to fix the dependsOn immediately, avoiding more pipeline failures.

B) Pause the pipeline, open a pull request with the dependsOn fix, test in staging and wait for approval according to the branch policy.

C) Temporarily remove the incorrect dependsOn entry to unblock the pipeline and add the correct version in a second pull request later.

D) Revert the template to the version before the network resource rename to restore consistency between the declared name and the dependency.


Scenario 3 β€” Root Cause​

An administrator modifies an ARM template to provision three storage accounts using a copy loop. The template validates without errors using the az deployment group validate command, but when executing the actual deployment, only two accounts are created. No error is returned in the command output.

The environment uses the brazilsouth region. Previously created accounts in the same resource group are operational. The loop snippet in the template is:

"copy": {
"name": "storageCopy",
"count": "[parameters('storageCount')]"
},
"name": "[concat('stprod', copyIndex())]"

The parameter file used in the deployment contains:

{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentParameters.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"storageCount": {
"value": 2
}
}
}

The administrator states they intended to create three accounts and believes the template is configured for that. They also mention testing the template last week with storageCount equal to 5 and all accounts were created correctly.

What is the root cause of the observed behavior?

A) The copyIndex() function starts at 0, generating names like stprod0, stprod1, which causes conflicts with existing accounts and prevents creation of the third.

B) The storageCount parameter in the parameter file has value 2, and ARM created exactly the number of resources declared in that file.

C) The brazilsouth region has a limitation of two storage accounts per resource group in deployments via copy loop.

D) The az deployment group validate command does not execute the copy loop completely, masking the count error during validation.


Scenario 4 β€” Diagnostic Sequence​

A team receives the following report from an operator:

"I tried to redeploy the ARM template for the staging environment. The deployment was submitted but got stuck in the Running state for more than forty minutes. I canceled it manually. Now, when trying to redeploy, I get a different error than the original."

The new error observed after cancellation is:

{
"error": {
"code": "DeploymentOperationFailed",
"message": "Resource 'vnet-staging-01' failed with error: Another operation (PUT) is in progress on resource 'vnet-staging-01'. Please retry later."
}
}

The available investigation steps are listed below out of order:

  1. Check the current state of the vnet-staging-01 resource in the portal or via az network vnet show
  2. Identify if there's another active operation in the resource group via az deployment operation group list
  3. Wait for the stuck operation to complete or expire before redeploying
  4. Confirm that the template doesn't contain circular references between vnet-staging-01 and other resources
  5. Check the resource group deployment history to identify the canceled operation

What is the correct investigation sequence?

A) 4 β†’ 1 β†’ 5 β†’ 2 β†’ 3

B) 5 β†’ 2 β†’ 1 β†’ 3 β†’ 4

C) 1 β†’ 4 β†’ 2 β†’ 5 β†’ 3

D) 2 β†’ 5 β†’ 1 β†’ 4 β†’ 3


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

The error message is explicit: "The language expression property 'parameters' doesn't exist". This occurs because the expression in the template uses "[parameter('diskSizeGB')]" instead of "[parameters('diskSizeGB')]" with the final s. ARM doesn't recognize parameter as a valid template function and rejects the expression during validation, before even attempting to provision any resource.

The decisive clue is in the error message, which points to the language expression, not to the resource type or property combination.

The information about the apiVersion change is purposefully irrelevant: it doesn't cause the described error and serves to divert diagnosis toward alternative A. The API version affects the set of available properties, but doesn't generate the specific message observed.

Alternative C represents a reasoning error about types: ARM accepts integers in numeric schema fields without manual conversion. Alternative D describes a non-existent incompatibility: creationData and diskSizeGB normally coexist in the resource schema.

The most dangerous distractor is A, as it leads the operator to investigate apiVersion compatibility, which is a legitimate cause in other contexts but doesn't explain the specific error message in this scenario.


Answer Key β€” Scenario 2​

Answer: B

The critical constraint of the scenario is the protected branch policy requiring two reviewers. Editing the main branch directly, as proposed in alternative A, violates this policy and may be blocked by the repository itself or introduce additional risks without review.

The correct action is to pause the pipeline to stop continuous failures without impacting already provisioned resources, then follow the established process: open pull request, validate in staging and wait for approval. The scenario confirms that pausing the pipeline is an available operation without impact.

Alternative C represents a technically functional solution, but unnecessarily divides the fix into two pull requests, creating a window where the template would be without the declared dependency, which could cause race conditions during parallel deployments.

Alternative D would be regressive: reverting to the previous version would restore the old network resource name, but would undo any other modifications made since then, potentially causing new problems in the environment.

The central reasoning error of distractors A and C is prioritizing fix speed over process, ignoring the governance constraint explicitly stated in the scenario.


Answer Key β€” Scenario 3​

Answer: B

The behavior is exactly as expected. The parameter file declares "storageCount": 2, and ARM creates precisely two resources, which is the value provided at runtime. The template itself may be "configured for three" in the administrator's intention, but the source of truth for the deployment is the parameter file, and it says two.

The decisive clue is the presence of the parameter file with explicit value 2. The scenario mentions that previous tests with storageCount: 5 worked correctly, which eliminates any hypothesis of a bug in the loop or the copyIndex() function.

The information about the brazilsouth region is purposefully irrelevant and serves to induce the reader to investigate non-existent regional limitations for this type of resource.

Alternative A describes real behavior of copyIndex() starting at 0, but this doesn't cause the described problem: names like stprod0 and stprod1 are valid and don't conflict with each other. Alternative D describes a real limitation of the validate command regarding some runtime checks, but the copy loop is processed correctly during validation and is not the cause here.

The most dangerous distractor is A, as it directs investigation toward copyIndex() behavior, which is technically correct but completely irrelevant to the observed symptom.


Answer Key β€” Scenario 4​

Answer: B

The correct sequence is: 5 β†’ 2 β†’ 1 β†’ 3 β†’ 4.

Progressive diagnostic reasoning requires starting from what is known to what needs to be discovered:

Step 5 (deployment history) establishes context: the manually canceled operation may have left the resource in an inconsistent state. This is the diagnostic entry point.

Step 2 (list active operations) confirms if there's still an operation running in the resource group, which is directly indicated by the received error message.

Step 1 (check current resource state) determines if the vnet-staging-01 resource is in a provisional state, which guides the next decision.

Step 3 (wait or expire operation) is the appropriate corrective action once the diagnosis is complete.

Step 4 (check circular reference) is the only step that investigates a structural cause in the template, not an environmental state condition. It's relevant in other deployment failure contexts, but not the cause of the specific message observed here, which clearly points to a concurrent operation. Therefore, it's executed last, only to rule out residual hypotheses.

Alternative A starts with the circular reference step, which is an unlikely cause given the explicit concurrent operation error, demonstrating inverted reasoning. Alternative C starts with the current resource state without first understanding the history, losing the necessary context to interpret what the state means.


Troubleshooting Tree: Modify an existing Azure Resource Manager template​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color Legend:

ColorNode Type
Dark BlueInitial symptom (entry point)
BlueDiagnostic question
RedIdentified cause
GreenRecommended action or resolution
OrangeIntermediate validation or verification

To use this tree when facing a real problem, start with the root node and answer each question based on what was directly observed in the environment. The first step is always to classify whether the failure occurs in validation or execution, as this defines which branch to investigate. Follow the branches answering only what can be verified at that moment, without skipping steps. When reaching an identified cause node, apply the corresponding action and return to the validation node to confirm the correction before ending the diagnosis.