Troubleshooting Lab: Create a network security group (NSG)

Diagnostic Scenarios

Scenario 1 — Root Cause

An operations team reports that a production VM has stopped responding to SSH connections (port 22) from an internal bastion server with IP 10.1.0.5. The VM is in a subnet called snet-backend in VNet vnet-prod.

The on-call engineer checks the configuration and gathers the following information:

NSG associated with subnet snet-backend:

Inbound rules:
Priority | Name               | Source       | Port | Action
---------|--------------------|--------------|----- |-------
100      | Allow-SSH-Bastion  | 10.1.0.0/24  | 22   | Allow
300      | Allow-HTTPS        | *            | 443  | Allow
65000    | AllowVnetInBound   | VirtualNet.  | *    | Allow
65500    | DenyAllInBound     | *            | *    | Deny

NSG associated with VM NIC:

Inbound rules:
Priority | Name               | Source       | Port | Action
---------|--------------------|--------------|----- |-------
150      | Deny-SSH           | *            | 22   | Deny
65000    | AllowVnetInBound   | VirtualNet.  | *    | Allow
65500    | DenyAllInBound     | *            | *    | Deny

The engineer also verifies that the VM is running, the SSH service is active, and the disk shows no alerts. The resource group was created three days ago as part of a new deployment.

What is the root cause of the connectivity failure?

A) The Allow-SSH-Bastion rule in the subnet NSG uses prefix 10.1.0.0/24, which doesn't cover IP 10.1.0.5, causing blocking at the first evaluation level.

B) The Deny-SSH rule in the NIC NSG, with priority 150, denies SSH traffic after the subnet NSG allows the connection, blocking access before reaching the VM.

C) The VM's SSH service is not accessible because the recent deployment may have introduced incorrect configuration in the operating system.

D) The AllowVnetInBound rule with priority 65000 in the NIC NSG overrides the Deny-SSH rule because service tags have precedence over port-based rules.

Scenario 2 — Action Decision

The cause has been identified: an NSG associated with a production subnet contains a rule with priority 200 that denies all outbound traffic destined to range 10.2.0.0/16. This range corresponds to the database subnet in another VNet connected via peering. All applications in the subnet stopped communicating with the databases approximately 40 minutes ago.

The environment operates during peak hours. The database team confirmed that the servers are operational. The engineer has Contributor permission on the resource group containing the NSG. The security team has not yet been notified about the change that originated the problem.

What is the correct action to take at this moment?

A) Immediately remove the deny rule from the NSG, restore connectivity, and notify the security team after normalization.

B) Create a new outbound rule with priority 100, source *, destination 10.2.0.0/16, port *, action Allow, to override the deny rule without removing it, and notify the security team in parallel.

C) Wait for formal notification and approval from the security team before any changes, as the rule may have been created intentionally as a security control.

D) Temporarily disassociate the NSG from the subnet to restore traffic while the rule analysis is conducted.

Scenario 3 — Root Cause

A developer reports that a newly provisioned VM in subnet snet-app cannot access an external service on port 443. The NSG associated with the VM's NIC was created yesterday and has no custom outbound rules. The output from the command executed on the VM is:

$ curl -v https://api.example.com
* Trying 203.0.113.45:443...
* connect to 203.0.113.45 port 443 failed: Connection timed out
* Failed to connect to api.example.com port 443 after 130004 ms
curl: (28) Connection timed out after 130003 milliseconds

The engineer verifies that the VM has a public IP assigned. He also confirms that there is no Azure Firewall or User Defined Routes (UDR) configured in the subnet. The NSG for subnet snet-app has the following outbound rules:

Priority | Name                  | Destination | Port | Action
---------|----------------------- |-------------|------|-------
100      | Deny-Internet-Egress  | Internet    | *    | Deny
65000    | AllowVnetOutBound     | VNet.       | *    | Allow
65001    | AllowInternetOutBound | Internet    | *    | Allow
65500    | DenyAllOutBound       | *           | *    | Deny

The developer suggests that the timeout indicates a DNS problem. The network team mentions that the VM's public IP was assigned yesterday.

What is the root cause of the problem?

A) The public IP assigned to the VM was recently provisioned and has not yet propagated correctly, preventing outbound routing to the internet.

B) The absence of custom outbound rules in the NIC NSG causes outbound traffic to be blocked by default, as the default behavior is to deny everything.

C) The Deny-Internet-Egress rule with priority 100 in the subnet NSG blocks all outbound traffic destined to the Internet tag, overriding the default AllowInternetOutBound rule.

D) The timeout on port 443 indicates that the problem is DNS resolution, as the destination IP address is being resolved incorrectly within the VNet.

Scenario 4 — Diagnostic Sequence

A production VM is not receiving HTTP traffic (port 80) from external clients. The responsible engineer has access to the Azure portal and the VM. Below are five possible investigation steps, presented out of order:

[P] Verify if the web application is running and listening on port 80 inside the VM
[Q] Use the "IP Flow Verify" functionality in Network Watcher to test if the NSG blocks inbound traffic on port 80
[R] Check if there are NSGs associated with both the NIC and subnet of the VM and list all inbound rules from both
[S] Confirm that the VM is running and accessible via the Azure portal
[T] Review NSG Flow Logs to identify if packets were actually blocked and by which rule

What is the correct sequence for diagnostic investigation?

A) S -> R -> Q -> T -> P

B) Q -> R -> T -> S -> P

C) R -> Q -> S -> T -> P

D) S -> Q -> R -> P -> T

Answer Key and Explanations

Answer Key — Scenario 1

Answer: B

Explanation:

The determining clue is in the NIC NSG configuration: the Deny-SSH rule with priority 150 denies all traffic on port 22 regardless of source. For inbound traffic, Azure evaluates the subnet NSG first, then the NIC NSG. The subnet NSG allows traffic from the bastion (priority 100, 10.1.0.0/24 covers 10.1.0.5), but the NIC NSG denies it subsequently. The traffic does not reach the VM.
The irrelevant information in this scenario is the fact that the resource group was created three days ago and the SSH service is active on the VM. This data is plausible as clues but doesn't influence the diagnosis once the NSG evaluation chain completely explains the blocking.
Alternative A is a distractor that forces the reader to recalculate the CIDR: 10.1.0.0/24 covers all addresses from 10.1.0.0 to 10.1.0.255, so 10.1.0.5 is within range. This calculation error is common under pressure. Alternative D represents a serious misconception: service tags don't have special precedence over port-based rules; evaluation is strictly by priority number.
The most dangerous distractor is C: acting on it would lead the engineer to investigate the VM indefinitely without finding any problem, while traffic remains blocked by the NIC NSG.

Answer Key — Scenario 2

Answer: B

Explanation:

The critical constraint of the scenario is the combination of two factors: environment during peak hours (active production impact) and uncertainty about the rule's intentionality (security team not consulted). Immediately removing the rule (alternative A) restores service but undoes a change that may have been intentional without prior validation with the responsible team. Creating an override rule with priority 100 (alternative B) immediately and reversibly restores traffic while keeping the original rule intact for later analysis and notifying the security team in parallel. This approach balances operational urgency with governance responsibility.
Alternative C ignores the criticality of production impact: waiting for formal approval while applications have been failing for 40 minutes is operationally unacceptable without at least urgent escalation.
Alternative D is the most dangerous: disassociating the NSG from the subnet removes all security rules, potentially exposing other subnet resources that depend on the NSG's other rules for protection. This action solves the immediate symptom by creating a broader security problem.

Answer Key — Scenario 3

Answer: C

Explanation:

The diagnostic key is in the subnet NSG outbound rules: the Deny-Internet-Egress rule with priority 100 denies all traffic destined to the Internet tag. This rule has higher priority than AllowInternetOutBound (priority 65001), which is the default rule that would normally allow internet access. The NIC NSG without custom rules uses only default rules, which include AllowInternetOutBound, but evaluation for outbound traffic starts with the NIC NSG and then goes to the subnet NSG. The NIC NSG allows traffic, but the subnet NSG blocks it.
The irrelevant information is the VM's public IP and the absence of Azure Firewall and UDR. Both are plausible data that direct reasoning toward wrong paths but have no relation to the actual blocking.
Alternative B represents a classic error: the default NSG behavior for outbound traffic is not to deny everything. The default rules AllowVnetOutBound and AllowInternetOutBound are present in every NSG. Alternative D accommodates the developer's DNS suggestion without technical basis: the timeout indicates that packets are leaving and not receiving a response (or being blocked along the way), not that DNS failed. Adopting this hypothesis would lead to a completely fruitless DNS investigation.

Answer Key — Scenario 4

Answer: A

Explanation:

The correct diagnostic sequence follows the principle of progression from simplest and most comprehensive to most specific and costly. Step S confirms the VM is operational before any network analysis, avoiding NSG investigation when the problem might be trivial. Step R maps all existing rules in both NSGs, creating complete visibility before any testing. Step Q uses IP Flow Verify to objectively validate whether the NSG is causing the blocking, without ambiguity. Step T accesses flow logs to confirm which specific rule is acting, if Q indicates blocking. Step P checks the application inside the VM, relevant only if previous steps rule out the NSG as the cause.
Alternative B starts with IP Flow Verify before confirming the VM is running, which can generate false results or wasted effort if the VM is down. Alternative C starts by listing rules without validating VM state, which can also be fruitless. Alternative D jumps from Q directly to the application without validating with logs, which prevents precise identification of the responsible rule if the NSG is the cause.
The most common reasoning error that distractors exploit is the inversion between infrastructure validation (NSG) and application validation: investigating the application before ruling out network causes is one of the most frequent sources of prolonged diagnostics.

Troubleshooting Tree: Create a network security group (NSG)

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

Color Legend:

Color	Node Type
Dark Blue	Initial symptom (entry point)
Blue	Diagnostic question (investigation decision)
Orange	Intermediate verification (validation before concluding)
Red	Identified cause
Green	Recommended action or resolution

When facing a real problem, start with the root node identifying the blocked traffic symptom and navigate the branches by answering each question based on what is directly observable: VM state, NSG association existence, IP Flow Verify result, affected traffic direction, and presence of deny rules with dominant priority. Each bifurcation eliminates a class of causes until the path converges on an identified cause and concrete action, avoiding premature interventions based on unvalidated hypotheses.

Diagnostic Scenarios​

Scenario 1 — Root Cause​

Scenario 2 — Action Decision​

Scenario 3 — Root Cause​

Scenario 4 — Diagnostic Sequence​

Answer Key and Explanations​

Answer Key — Scenario 1​

Answer Key — Scenario 2​

Answer Key — Scenario 3​

Answer Key — Scenario 4​

Troubleshooting Tree: Create a network security group (NSG)​

Diagnostic Scenarios

Scenario 1 — Root Cause

Scenario 2 — Action Decision

Scenario 3 — Root Cause

Scenario 4 — Diagnostic Sequence

Answer Key and Explanations

Answer Key — Scenario 1

Answer Key — Scenario 2

Answer Key — Scenario 3

Answer Key — Scenario 4

Troubleshooting Tree: Create a network security group (NSG)