Troubleshooting Lab: Azure Firewall Manager Policies
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
A network team received complaints that a critical application hosted in a spoke VNet stopped receiving external traffic after a maintenance window performed on Friday night. The environment uses Azure Firewall Manager with a hierarchical Firewall Policy: a corporate-level parent policy and a child policy associated with the managed hub.
During maintenance, the following actions were performed:
- Updated the child policy with new application rules for a new SaaS
- Added a custom FQDN tag in the parent policy
- Renewed the TLS certificate used by TLS inspection in the child policy
- Restarted the Azure Firewall via portal
After the window, the team verifies the following behavior:
Connection attempt to 10.20.5.10:443 from 203.0.113.45
Action: Deny
Rule Collection: DefaultDeny
Policy: ChildPolicy-Prod
Matched Rule: None
Threat Intel: No match
The traffic in question was always allowed by a network rule in the parent policy, never in the child. The child policy has Threat Intelligence configured in Alert and Deny mode. The firewall SKU is Standard.
What is the root cause of the problem?
A) The Azure Firewall restart cleared the state of active rules, requiring manual reapplication of policies.
B) The child policy overrode the network rule of the parent policy because network rules in child policies take priority over those in the parent policy.
C) The TLS certificate renewed in the child policy is invalid, causing blocking of all inspected HTTPS traffic.
D) The network rule from the parent policy is not inherited by the child policy because the associated firewall uses Standard SKU, which does not support policy hierarchy.
Scenario 2 β Action Decisionβ
You are the architect responsible for a hub-and-spoke topology with Azure Virtual WAN. The environment has a Secured Virtual Hub with Azure Firewall managed by Firewall Manager. An internal audit identified that the following behavior is occurring:
Source: 10.1.0.5 (Spoke A)
Destination: 10.2.0.8 (Spoke B)
Port: 1433
Action: Allow
Rule: AllowAll-Network
Policy: HubPolicy-Dev
The HubPolicy-Dev policy was mistakenly associated with the production hub during an update. It contains an AllowAll-Network network rule that allows all traffic between spokes without restriction. The correct policy is HubPolicy-Prod, which contains specific rules per segment.
The environment is in active production with dozens of established connections. No maintenance window is available for the next 6 hours.
What is the correct action to take at this moment?
A) Immediately disassociate HubPolicy-Dev from the hub and associate HubPolicy-Prod, accepting the momentary disruption as necessary to restore security posture.
B) Edit HubPolicy-Dev directly, adding the restriction rules from HubPolicy-Prod, without changing the association, until the maintenance window is available.
C) Create a denial rule with higher priority within HubPolicy-Dev to block unrestricted traffic until the swap can be made in the maintenance window.
D) Make no changes until the maintenance window and log the incident, as any modification in production without a window may cause greater impact than the existing vulnerability.
Scenario 3 β Root Causeβ
An administrator reports that after associating a new Firewall Policy to an existing Azure Firewall via Firewall Manager, application rules that allow access to *.microsoft.com stopped working for a specific group of users. Other users were not affected.
The environment has the following characteristics:
- Firewall SKU: Premium
- TLS Inspection: enabled
- The new policy replaced the previous one, which did not have TLS Inspection configured
- The affected group uses devices with a different corporate root certificate than the organization's standard
- The firewall's intermediate certificate was issued by an internal CA
- The firewall log shows:
Category: AzureFirewallApplicationRule
Action: Deny
Rule: AllowMicrosoftFQDN
Fqdn: login.microsoft.com
Protocol: Https
TlsInspectionEnabled: true
TlsInspectionResult: CertificateValidationFailed
The administrator immediately suspects that the application rule was configured incorrectly. He also mentions that the SKU was upgraded from Standard to Premium three weeks ago, but everything was working until the policy change.
What is the root cause of the problem?
A) The AllowMicrosoftFQDN application rule was configured without the HTTPS protocol, preventing TLS inspection from processing requests correctly.
B) The root certificate of the internal CA that issued the firewall's intermediate certificate is not present in the trusted certificate repository of the affected group's devices.
C) The SKU upgrade from Standard to Premium generated incompatibility with the previous policy, corrupting inherited application rules.
D) Enabling TLS Inspection in the new policy requires that the FQDN *.microsoft.com be added to the TLS inspection exclusion list, as Microsoft FQDNs do not support TLS interception.
Scenario 4 β Diagnostic Sequenceβ
An engineer receives the following alert at 2:37 PM in a production environment:
Alert: Azure Firewall - High Threat Intelligence Hit Rate
Hub: SecuredHub-EastUS
Policy: CorporatePolicy-Prod
Period: Last 15 minutes
Threat Intel Hits: 847
Action Configured: Alert Only
Shortly after, users report generalized slowness in internet access from all spokes connected to the hub. The firewall is responding, but with high latency. The team suspects that the volume of Threat Intelligence hits and the slowness are related, but is not sure how to proceed.
The available investigation steps are:
- Check in Diagnostic Settings logs if Threat Intelligence hits correspond to legitimate or known malicious destination IPs
- Analyze Azure Firewall metrics in Azure Monitor to identify throughput, active connections, and SNAT ports usage
- Confirm if the policy is in Alert Only or Alert and Deny mode for Threat Intelligence
- Check if there was recent configuration change in the policy via Azure Activity Log
- Check if the firewall SKU supports the current connection volume and if scaling is needed
What is the correct investigation sequence?
A) 3 -> 1 -> 4 -> 2 -> 5
B) 2 -> 3 -> 1 -> 4 -> 5
C) 4 -> 1 -> 3 -> 2 -> 5
D) 1 -> 4 -> 2 -> 3 -> 5
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: D
The root cause is that Azure Firewall with Standard SKU does not support policy hierarchy. Parent-child inheritance in Firewall Manager is a functionality exclusive to the Premium SKU. With Standard SKU, only one policy can be associated with the firewall, and it is applied flat, without inheritance from a parent policy.
The definitive clue in the statement is the combination of two elements: the use of policy hierarchy (parent and child policy) and the explicitly declared Standard SKU. The log confirms that traffic was processed only by ChildPolicy-Prod and fell into the DefaultDeny rule, meaning the parent policy rule was never evaluated.
The irrelevant information in the scenario is the TLS certificate renewal. It is plausible as an initial suspicion, but the log shows that the blocking occurred in a network rule, before even reaching the application layer where TLS Inspection would act.
The main reasoning error of the distractors is focusing on firewall restart (A), certificate (C), or child rule precedence over parent (B). Alternative B is the most dangerous distractor: the engineer could spend hours reviewing rule priorities when the problem is an SKU limitation that makes hierarchy inoperative.
The consequence of acting based on B would be reordering rules in both policies without effect, while the environment remains blocked.
Answer Key β Scenario 2β
Answer: C
The correct action is to add a high-priority denial rule within HubPolicy-Dev to eliminate unrestricted access between spokes, without replacing the associated policy, until the maintenance window allows for controlled swap.
The determining constraint context is: active production environment, established connections, and no maintenance window for 6 hours. Immediate policy swap (A) is technically correct in security terms, but ignores the critical constraint of operational impact. Swapping the policy associated with a hub in Firewall Manager causes a configuration update that can restart active connections and generate disruption.
Alternative B is dangerous because editing HubPolicy-Dev to replicate HubPolicy-Prod is laborious, error-prone, and makes both policies redundant, creating unnecessary operational complexity. Alternative D represents decision paralysis: an AllowAll rule in production is an active vulnerability that justifies immediate action even outside a maintenance window.
Alternative C balances immediate risk mitigation with operational stability, which is the correct criterion for decisions without available maintenance window.
Answer Key β Scenario 3β
Answer: B
The root cause is the absence of the internal CA root certificate in the trusted certificate repository of the affected group's devices. When Azure Firewall Premium performs TLS Inspection, it intercepts the connection, decrypts and reinspects the traffic, then presents to the client a new certificate issued by the internal CA configured in the firewall. If the client device does not trust this CA, certificate validation fails and the connection is blocked.
The log confirms this precisely: TlsInspectionResult: CertificateValidationFailed. The fact that only the group with different corporate root certificate is affected is the definitive clue, as other devices already trust the internal CA.
The irrelevant information is the mention of SKU upgrade three weeks ago. The Premium SKU was working normally before the policy change. The problem arose with the new policy that enabled TLS Inspection, not with the SKU upgrade.
The most dangerous distractor is D: the statement that Microsoft FQDNs do not support TLS inspection is partially true in specific contexts, making the distractor convincing. Indeed, Microsoft recommends configuring TLS Inspection exclusions for services like Microsoft 365 for compatibility and performance reasons, but this is not an absolute technical limitation of the product, and the log clearly shows that the failure is certificate validation on the client, not destination service incompatibility.
Answer Key β Scenario 4β
Answer: A
The correct sequence is 3 -> 1 -> 4 -> 2 -> 5.
Progressive diagnostic reasoning requires starting with the most immediate and determining information before moving to metrics and capacity analysis.
Step 3 comes first because the alert mentions 847 hits with Alert Only action. Confirming Threat Intelligence mode is the quickest step and directs the entire diagnosis: if it's in Alert Only, the hits are not the cause of slowness, as traffic is not being blocked.
Step 1 follows to identify if IPs are legitimate or malicious. If legitimate, there's a false positive at scale, which changes the action path. If malicious, the volume of attempts may indicate an ongoing attack.
Step 4 checks recent changes via Activity Log, which can explain both hit volume and slowness, if a route or policy change unexpectedly redirected traffic.
Step 2 analyzes throughput and connection metrics, which require context from previous steps to be interpreted correctly.
Step 5 last, because SKU scaling is a capacity decision that only makes sense after excluding configuration causes and anomalous traffic.
Alternative B is the most attractive distractor because it starts with metrics, which seems logical given slowness. However, starting with metric analysis without knowing Threat Intelligence mode leads the engineer to correlate slowness and hits without the necessary context to determine if this correlation is real.
Troubleshooting Tree: Azure Firewall Manager Policiesβ
Color legend:
| Color | Node type |
|---|---|
| Dark blue | Initial symptom (tree root) |
| Medium blue | Diagnostic question |
| Red | Identified cause |
| Green | Recommended action or resolution |
| Orange | Intermediate validation or verification |
To use this tree when facing a real problem, start with the root node by identifying whether the observed behavior is unexpected blocking or slowness without explicit blocking. Follow the diagnostic questions by answering with what you can verify directly in logs, metrics, or Activity Log, without assuming the cause. Each branch eliminates a hypothesis or confirms a path until you reach an identified cause node, from which the corresponding recommended action points to the next concrete step.