Troubleshooting Lab: Manage costs by using alerts, budgets, and Azure Advisor recommendations
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
A company's FinOps team configured a monthly budget of USD 5,000 for a production subscription. Three alerts were defined with the following settings:
Budget name: prod-monthly-budget
Amount: 5000 USD
Reset period: Monthly
Start date: 2024-01-01
Alert 1: Type=Actual, Threshold=80%, Recipients=finops@company.com
Alert 2: Type=Actual, Threshold=100%, Recipients=finops@company.com
Alert 3: Type=Forecasted, Threshold=110%, Recipients=finops@company.com
On the 18th of the current month, Azure Cost Management displays an accumulated cost of USD 4,320, representing 86.4% of the budget. The responsible analyst has not received any email notifications so far.
Additional information collected by the analyst:
- The subscription is active and resources are generating charges normally
- The analyst confirmed they can access the Azure portal without problems
- The finops@company.com address is active and received other corporate emails today
- The budget was created 3 days ago, with a retroactive start date of 2024-01-01
What is the root cause of the missing notifications?
A) The recipient's email address was rejected by Microsoft's email server because it belongs to an external corporate domain.
B) The budget was created after the cost had already exceeded the 80% threshold, and Actual-type alerts do not trigger retroactively for costs already incurred before budget creation.
C) The Forecasted alert type conflicts with Actual-type alerts in the same budget, blocking the sending of all notifications.
D) The monthly reset period restarted the accumulation on the budget creation day, zeroing the costs considered for threshold calculation.
Scenario 2 β Action Decisionβ
The problem cause was identified: a set of development VMs was left running over the weekend by mistake, generating USD 1,200 in unplanned charges in 48 hours. The monthly budget for the development subscription is USD 3,000, and the accumulated cost for the month is already at USD 2,850, representing 95% of the limit.
The infrastructure manager determined that the team must act immediately. The following restrictions apply at the moment:
- It's 11 PM on Sunday; the approval team is not available
- The development VMs do not have active workloads at the moment
- The subscription has a corporate policy that requires formal approval for permanent resource deletion
- The team has Contributor permission on the subscription, without billing access
What is the correct action to take at this moment?
A) Delete the VMs immediately to stop charges, given that there's no active workload and the cost is near the limit.
B) Shut down (deallocate) the VMs to stop compute charges, without deleting resources, respecting the approval policy for deletion.
C) Create an Azure Policy denial to block the creation of new resources in the subscription until the next billing cycle.
D) Wait until Monday to trigger the formal approval process before any action, avoiding violating the corporate policy.
Scenario 3 β Root Causeβ
An administrator reviews Azure Advisor and finds the following recommendation in the cost category:
Recommendation: Resize or shutdown underutilized virtual machines
Resource: vm-analytics-prod (Standard_E16s_v3)
Savings: ~USD 890/month
Confidence: High
Observation: Average CPU utilization: 4.2% over 14 days
Max CPU utilization: 11.7% over 14 days
Average memory: 18% over 14 days
The administrator decides to act on the recommendation and resizes the VM to Standard_E4s_v3 at 10 AM on a Tuesday. At 2 PM the same day, they return to Azure Advisor expecting to see the recommendation removed from the list, but it still appears with the same data.
The administrator concludes that the resize failed silently and opens a support ticket. The Azure portal, however, confirms that the VM is running with the Standard_E4s_v3 SKU.
Additional information:
- The VM is running continuously and healthy on the new SKU since the resize
- No Azure Monitor alerts were triggered for the VM
- The administrator has Owner role on the subscription
- The VM's region supports the Standard_E4s_v3 SKU without restrictions
What is the root cause of the behavior observed in Azure Advisor?
A) The resize was applied but generated a new resource with a different ID, causing Advisor to continue monitoring the original resource.
B) Azure Advisor does not update recommendations in real-time; there's a lag in the update cycle, and the recommendation still reflects data collected before the action.
C) The recommendation persists because the Standard_E4s_v3 SKU is also being flagged as underutilized based on the same historical data.
D) Azure Advisor requires the administrator to manually mark the recommendation as "Implemented" for it to be removed from the list after the action.
Scenario 4 β Diagnostic Sequenceβ
An analyst receives the following report: "I configured a budget last week, but alerts never arrive. We've already spent 110% of the budget amount this month and I haven't received anything."
The analyst needs to investigate the problem. The following investigation steps are available, presented out of order:
[P] Verify if the budget has alerts configured with valid recipients
[Q] Confirm the current accumulated cost in Azure Cost Management
[R] Verify if the alert type is Actual or Forecasted and what threshold is defined
[S] Check if the recipient's email address is on Azure's suppression list
[T] Confirm if the budget start date covers the current cost period
What is the correct investigation sequence?
A) Q -> T -> P -> R -> S
B) P -> R -> Q -> S -> T
C) T -> Q -> P -> S -> R
D) Q -> P -> R -> T -> S
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The central point of the diagnosis lies in the combination of two facts: the budget was created only 3 days ago and the cost was already at 86.4% at the time of creation. Actual type alerts are triggered when the cost crosses the threshold during the active monitoring cycle. If the cost had already exceeded 80% before the budget existed, the transition that would trigger the alert already occurred in the past, and the system does not issue retroactive notifications.
The information about the email being functional and receiving other messages is the irrelevant information included purposefully. It may lead the analyst to suspect an email delivery problem, which is a valid hypothesis in other contexts, but here it's not the cause. The address is operational; the alert simply was never generated.
Alternative C describes a conflict between alert types that doesn't exist in the platform. Alternative D confuses the reset period behavior with threshold detection behavior. The consequence of acting based on distractor A would be escalating to email support, wasting time without solving the real problem.
Answer Key β Scenario 2β
Answer: B
The scenario imposes three simultaneous restrictions that eliminate the other alternatives: it's outside business hours, corporate policy prohibits deletion without approval, and the team has Contributor permission (sufficient to shut down VMs, but deletion would require violating the policy).
Shutting down (deallocate) the VMs solves the immediate problem, which is generating compute charges, without violating any restriction. VMs in deallocated state don't generate CPU/RAM costs, only managed disk costs, which are significantly lower.
Alternative A directly violates the corporate approval policy for deletion, regardless of there being no active load. Alternative C doesn't solve the immediate cost problem, as existing VMs would continue accumulating charges. Alternative D ignores the urgency and passively accepts that the budget will be exceeded while the team sleeps. The most dangerous distractor is A: technically effective, but creates a compliance risk that may have more costly consequences than the extra spending itself.
Answer Key β Scenario 3β
Answer: B
Azure Advisor processes historical utilization data and updates its recommendations periodically, not in real-time. The resize was successfully applied at 10 AM, but at 2 PM the same day the Advisor's update cycle had not yet been executed to include the new resource state. The recommendation continues displaying the data that led to the original suggestion.
The portal confirming the correct SKU is the definitive clue that the action was successful. The fact that the administrator has Owner role and the region supports the SKU are irrelevant information for the diagnosis: they confirm there's no technical impediment, but don't explain the observed behavior.
Alternative D represents a common diagnostic error: assuming the system requires manual confirmation when, in reality, the problem is simply temporal. The consequence of acting based on this alternative would be manually marking the recommendation as implemented, which could mask a legitimate recommendation if Advisor detected underutilization in the new SKU in the future.
Answer Key β Scenario 4β
Answer: A
The correct investigation sequence follows the logic of progressive elimination, from most fundamental to most specific:
| Step | Action | Reason |
|---|---|---|
| Q | Confirm accumulated cost | Validate the reported symptom with objective data |
| T | Verify budget start date | Ensure the billing period is covered |
| P | Verify alerts and recipients | Confirm the alert structure exists |
| R | Verify type and threshold | Confirm if the alert should have triggered |
| S | Check email suppression list | Investigate delivery problem as last resort |
Starting with P or R before confirming that the cost actually exceeded the threshold (Q) and that the budget covers the correct period (T) represents a sequence error: the analyst would be investigating alert configuration without having validated the prerequisites for it to trigger. The email suppression list (S) is the hypothesis most external to the budget system and should be investigated last, only if all other elements are correct.
Troubleshooting Tree: Manage costs by using alerts, budgets, and Azure Advisor recommendationsβ
Color Legend:
- Dark blue: initial symptom, investigation entry point
- Blue: diagnostic question node, binary or state decision
- Red: identified cause, confirmed problem origin
- Green: recommended action or applicable resolution
To use this tree when facing a real problem, start from the root node describing the observed symptom and follow the branches by answering each question based on what you can verify directly in the portal or via CLI. The goal is to reach a red node as quickly as possible, confirming the cause, before executing any corrective action indicated by the corresponding green node. Never skip intermediate validation steps: diagnosing wrong and acting fast is more costly than diagnosing slowly and acting right.