Troubleshooting Lab: Create a network security group (NSG)
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
An operations team reports that a production VM has stopped responding to SSH connections (port 22) from an internal bastion server with IP 10.1.0.5. The VM is in a subnet called snet-backend in VNet vnet-prod.
The on-call engineer checks the configuration and gathers the following information:
NSG associated with subnet snet-backend:
Inbound rules:
Priority | Name | Source | Port | Action
---------|--------------------|--------------|----- |-------
100 | Allow-SSH-Bastion | 10.1.0.0/24 | 22 | Allow
300 | Allow-HTTPS | * | 443 | Allow
65000 | AllowVnetInBound | VirtualNet. | * | Allow
65500 | DenyAllInBound | * | * | Deny
NSG associated with VM NIC:
Inbound rules:
Priority | Name | Source | Port | Action
---------|--------------------|--------------|----- |-------
150 | Deny-SSH | * | 22 | Deny
65000 | AllowVnetInBound | VirtualNet. | * | Allow
65500 | DenyAllInBound | * | * | Deny
The engineer also verifies that the VM is running, the SSH service is active, and the disk shows no alerts. The resource group was created three days ago as part of a new deployment.
What is the root cause of the connectivity failure?
A) The Allow-SSH-Bastion rule in the subnet NSG uses prefix 10.1.0.0/24, which doesn't cover IP 10.1.0.5, causing blocking at the first evaluation level.
B) The Deny-SSH rule in the NIC NSG, with priority 150, denies SSH traffic after the subnet NSG allows the connection, blocking access before reaching the VM.
C) The VM's SSH service is not accessible because the recent deployment may have introduced incorrect configuration in the operating system.
D) The AllowVnetInBound rule with priority 65000 in the NIC NSG overrides the Deny-SSH rule because service tags have precedence over port-based rules.
Scenario 2 β Action Decisionβ
The cause has been identified: an NSG associated with a production subnet contains a rule with priority 200 that denies all outbound traffic destined to range 10.2.0.0/16. This range corresponds to the database subnet in another VNet connected via peering. All applications in the subnet stopped communicating with the databases approximately 40 minutes ago.
The environment operates during peak hours. The database team confirmed that the servers are operational. The engineer has Contributor permission on the resource group containing the NSG. The security team has not yet been notified about the change that originated the problem.
What is the correct action to take at this moment?
A) Immediately remove the deny rule from the NSG, restore connectivity, and notify the security team after normalization.
B) Create a new outbound rule with priority 100, source *, destination 10.2.0.0/16, port *, action Allow, to override the deny rule without removing it, and notify the security team in parallel.
C) Wait for formal notification and approval from the security team before any changes, as the rule may have been created intentionally as a security control.
D) Temporarily disassociate the NSG from the subnet to restore traffic while the rule analysis is conducted.
Scenario 3 β Root Causeβ
A developer reports that a newly provisioned VM in subnet snet-app cannot access an external service on port 443. The NSG associated with the VM's NIC was created yesterday and has no custom outbound rules. The output from the command executed on the VM is:
$ curl -v https://api.example.com
* Trying 203.0.113.45:443...
* connect to 203.0.113.45 port 443 failed: Connection timed out
* Failed to connect to api.example.com port 443 after 130004 ms
curl: (28) Connection timed out after 130003 milliseconds
The engineer verifies that the VM has a public IP assigned. He also confirms that there is no Azure Firewall or User Defined Routes (UDR) configured in the subnet. The NSG for subnet snet-app has the following outbound rules:
Priority | Name | Destination | Port | Action
---------|----------------------- |-------------|------|-------
100 | Deny-Internet-Egress | Internet | * | Deny
65000 | AllowVnetOutBound | VNet. | * | Allow
65001 | AllowInternetOutBound | Internet | * | Allow
65500 | DenyAllOutBound | * | * | Deny
The developer suggests that the timeout indicates a DNS problem. The network team mentions that the VM's public IP was assigned yesterday.
What is the root cause of the problem?
A) The public IP assigned to the VM was recently provisioned and has not yet propagated correctly, preventing outbound routing to the internet.
B) The absence of custom outbound rules in the NIC NSG causes outbound traffic to be blocked by default, as the default behavior is to deny everything.
C) The Deny-Internet-Egress rule with priority 100 in the subnet NSG blocks all outbound traffic destined to the Internet tag, overriding the default AllowInternetOutBound rule.
D) The timeout on port 443 indicates that the problem is DNS resolution, as the destination IP address is being resolved incorrectly within the VNet.
Scenario 4 β Diagnostic Sequenceβ
A production VM is not receiving HTTP traffic (port 80) from external clients. The responsible engineer has access to the Azure portal and the VM. Below are five possible investigation steps, presented out of order:
[P] Verify if the web application is running and listening on port 80 inside the VM
[Q] Use the "IP Flow Verify" functionality in Network Watcher to test if the NSG blocks inbound traffic on port 80
[R] Check if there are NSGs associated with both the NIC and subnet of the VM and list all inbound rules from both
[S] Confirm that the VM is running and accessible via the Azure portal
[T] Review NSG Flow Logs to identify if packets were actually blocked and by which rule
What is the correct sequence for diagnostic investigation?
A) S -> R -> Q -> T -> P
B) Q -> R -> T -> S -> P
C) R -> Q -> S -> T -> P
D) S -> Q -> R -> P -> T
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
Explanation:
- The determining clue is in the NIC NSG configuration: the
Deny-SSHrule with priority 150 denies all traffic on port 22 regardless of source. For inbound traffic, Azure evaluates the subnet NSG first, then the NIC NSG. The subnet NSG allows traffic from the bastion (priority 100,10.1.0.0/24covers10.1.0.5), but the NIC NSG denies it subsequently. The traffic does not reach the VM. - The irrelevant information in this scenario is the fact that the resource group was created three days ago and the SSH service is active on the VM. This data is plausible as clues but doesn't influence the diagnosis once the NSG evaluation chain completely explains the blocking.
- Alternative A is a distractor that forces the reader to recalculate the CIDR:
10.1.0.0/24covers all addresses from10.1.0.0to10.1.0.255, so10.1.0.5is within range. This calculation error is common under pressure. Alternative D represents a serious misconception: service tags don't have special precedence over port-based rules; evaluation is strictly by priority number. - The most dangerous distractor is C: acting on it would lead the engineer to investigate the VM indefinitely without finding any problem, while traffic remains blocked by the NIC NSG.
Answer Key β Scenario 2β
Answer: B
Explanation:
- The critical constraint of the scenario is the combination of two factors: environment during peak hours (active production impact) and uncertainty about the rule's intentionality (security team not consulted). Immediately removing the rule (alternative A) restores service but undoes a change that may have been intentional without prior validation with the responsible team. Creating an override rule with priority 100 (alternative B) immediately and reversibly restores traffic while keeping the original rule intact for later analysis and notifying the security team in parallel. This approach balances operational urgency with governance responsibility.
- Alternative C ignores the criticality of production impact: waiting for formal approval while applications have been failing for 40 minutes is operationally unacceptable without at least urgent escalation.
- Alternative D is the most dangerous: disassociating the NSG from the subnet removes all security rules, potentially exposing other subnet resources that depend on the NSG's other rules for protection. This action solves the immediate symptom by creating a broader security problem.
Answer Key β Scenario 3β
Answer: C
Explanation:
- The diagnostic key is in the subnet NSG outbound rules: the
Deny-Internet-Egressrule with priority 100 denies all traffic destined to theInternettag. This rule has higher priority thanAllowInternetOutBound(priority 65001), which is the default rule that would normally allow internet access. The NIC NSG without custom rules uses only default rules, which includeAllowInternetOutBound, but evaluation for outbound traffic starts with the NIC NSG and then goes to the subnet NSG. The NIC NSG allows traffic, but the subnet NSG blocks it. - The irrelevant information is the VM's public IP and the absence of Azure Firewall and UDR. Both are plausible data that direct reasoning toward wrong paths but have no relation to the actual blocking.
- Alternative B represents a classic error: the default NSG behavior for outbound traffic is not to deny everything. The default rules
AllowVnetOutBoundandAllowInternetOutBoundare present in every NSG. Alternative D accommodates the developer's DNS suggestion without technical basis: the timeout indicates that packets are leaving and not receiving a response (or being blocked along the way), not that DNS failed. Adopting this hypothesis would lead to a completely fruitless DNS investigation.
Answer Key β Scenario 4β
Answer: A
Explanation:
- The correct diagnostic sequence follows the principle of progression from simplest and most comprehensive to most specific and costly. Step S confirms the VM is operational before any network analysis, avoiding NSG investigation when the problem might be trivial. Step R maps all existing rules in both NSGs, creating complete visibility before any testing. Step Q uses IP Flow Verify to objectively validate whether the NSG is causing the blocking, without ambiguity. Step T accesses flow logs to confirm which specific rule is acting, if Q indicates blocking. Step P checks the application inside the VM, relevant only if previous steps rule out the NSG as the cause.
- Alternative B starts with IP Flow Verify before confirming the VM is running, which can generate false results or wasted effort if the VM is down. Alternative C starts by listing rules without validating VM state, which can also be fruitless. Alternative D jumps from Q directly to the application without validating with logs, which prevents precise identification of the responsible rule if the NSG is the cause.
- The most common reasoning error that distractors exploit is the inversion between infrastructure validation (NSG) and application validation: investigating the application before ruling out network causes is one of the most frequent sources of prolonged diagnostics.
Troubleshooting Tree: Create a network security group (NSG)β
Color Legend:
| Color | Node Type |
|---|---|
| Dark Blue | Initial symptom (entry point) |
| Blue | Diagnostic question (investigation decision) |
| Orange | Intermediate verification (validation before concluding) |
| Red | Identified cause |
| Green | Recommended action or resolution |
When facing a real problem, start with the root node identifying the blocked traffic symptom and navigate the branches by answering each question based on what is directly observable: VM state, NSG association existence, IP Flow Verify result, affected traffic direction, and presence of deny rules with dominant priority. Each bifurcation eliminates a class of causes until the path converges on an identified cause and concrete action, avoiding premature interventions based on unvalidated hypotheses.