Troubleshooting Lab: Evaluate Effective Security Rules in NSGs

Diagnostic Scenarios

Scenario 1 — Root Cause

The operations team reports that a VM named vm-app01 stopped receiving HTTPS connections on port 443 from external clients. The VM was working normally until a colleague applied a new security policy the previous afternoon.

The administrator checks the NSG associated with the VM's NIC and finds the following inbound rules configuration:

Priority	Name	Port	Source	Destination	Action
100	Allow-HTTPS	443	Any	Any	Allow
200	Allow-HTTP	80	Any	Any	Allow
65000	AllowVnetInBound	Any	VirtualNetwork	Any	Allow
65500	DenyAllInBound	Any	Any	Any	Deny

The NSG appears correct. The administrator also verifies that the public IP is correctly associated with the NIC and that the service inside the VM is active and listening on port 443. The VM has two data disks recently attached, added by the same colleague.

Next, he runs IP Flow Verify and gets:

Direction : Inbound
Protocol  : TCP
LocalPort : 443
RemotePort: 52341
LocalIP   : 10.0.1.10
RemoteIP  : 187.45.32.10
Access    : Deny
RuleName  : Deny-All-External
NSG       : nsg-subnet-frontend

What is the root cause of the blocking?

A) The default DenyAllInBound rule in the NIC NSG is blocking traffic because no rule with priority lower than 65500 properly covers external traffic.

B) The NSG associated with the subnet contains a rule named Deny-All-External with sufficient priority to block traffic before the NIC NSG is evaluated.

C) The data disks added to the VM caused a change in the network interface state, invalidating the NIC NSG rules.

D) IP Flow Verify is analyzing the VM's internal IP, and the result doesn't reflect the real behavior of external traffic arriving through the public IP.

Scenario 2 — Action Decision

The cause of a connectivity issue has been identified: a denial rule with priority 150 was added to the subnet NSG blocking all outbound traffic to the range 10.20.0.0/16, which corresponds to the database subnet in a peered VNet. The application on vm-app02 cannot connect to the database since the last maintenance window.

The environment has the following constraints:

This NSG is associated with a subnet containing 14 other production VMs besides vm-app02
The security team that created the rule is unavailable until the next day
There's a scheduled maintenance window in 6 hours
The application on vm-app02 has partial degradation, not completely offline

What is the correct action to take at this moment?

A) Immediately delete the priority 150 rule from the subnet NSG to restore connectivity, since the cause has been confirmed.

B) Modify the rule priority from 150 to 160, reducing its effectiveness while waiting for the security team.

C) Wait for the scheduled maintenance window and, before making any changes, engage the security team to validate the impact on other subnet VMs.

D) Create an allow rule with priority 100 in the vm-app02 NIC NSG for destination 10.20.0.0/16, bypassing the subnet blocking rule without altering the shared NSG.

Scenario 3 — Root Cause

A developer reports being able to SSH access (port 22) to a development VM called vm-dev03 from his corporate laptop (10.50.1.15), but a second developer with a laptop on the same network segment (10.50.1.22) receives timeout when attempting the same access.

The administrator checks the NSG of the subnet where vm-dev03 is hosted:

Inbound rules (Subnet NSG):
Priority 100 | Allow-SSH-CorpNet | TCP 22 | Source: 10.50.0.0/16 | Allow
Priority 200 | Deny-All          | Any    | Source: Any           | Deny

The NIC NSG has no custom rules beyond the defaults. The administrator confirms that both laptops are on the same corporate VLAN and that no changes were made to the NSG in the last 48 hours. He also verifies that vm-dev03 was migrated to a new physical host by the infrastructure team on the morning of the same day.

When running IP Flow Verify for IP 10.50.1.22 on port 22, the result is:

Access    : Allow
RuleName  : Allow-SSH-CorpNet
NSG       : nsg-subnet-dev

What is the most likely root cause of the timeout reported by the second developer?

A) The migration to a new physical host corrupted the effective NSG rules associated with the NIC, which now shows inconsistent state.

B) IP Flow Verify confirms that the NSG allows the traffic; the timeout cause is outside the NSG scope, possibly in the VM's operating system, local firewall, or SSH application itself.

C) The Allow-SSH-CorpNet rule covers the 10.50.0.0/16 block, but IP 10.50.1.22 is outside that range, so traffic falls into the Deny-All rule.

D) The NIC NSG without custom rules automatically inherits an implicit denial that only affects secondary connections to the VM, explaining why the second developer is blocked.

Scenario 4 — Diagnostic Sequence

An administrator receives the following report: "The web application on vm-web01 stopped responding on port 80 for external users. Internally, the IT team can access normally."

The available investigation steps are:

Step P: Check if the HTTP service is active and listening on port 80 inside the VM (netstat or ss).
Step Q: Run IP Flow Verify simulating inbound traffic from an external IP on port 80 to identify which NSG and which rule is blocking.
Step R: Review the effective rules of vm-web01's NIC to get a consolidated view of NIC and subnet NSGs.
Step S: Check if there are differences between allowed sources in NSG rules, comparing what covers internal versus external traffic.
Step T: Confirm that problem resolution was effective by trying to access the application from an external IP after any adjustment.

What is the most logical diagnostic sequence?

A) P → R → Q → S → T

B) Q → P → S → R → T

C) R → S → Q → P → T

D) S → Q → R → P → T

Answer Key and Explanations

Answer Key — Scenario 1

Answer: B

The definitive clue is in the IP Flow Verify output: the RuleName field points to Deny-All-External and the NSG field points to nsg-subnet-frontend, which is the subnet NSG, not the NIC NSG. For inbound traffic, the subnet NSG is evaluated first. The fact that the NIC NSG has an Allow-HTTPS rule at priority 100 is irrelevant because the packet never gets evaluated by it.

The information about the two data disks added is intentionally irrelevant. Data disks have no effect on NSG rules or on a NIC's network behavior. Including this detail tests the reader's ability to ignore recent changes that have no technical relationship with the symptom.

Alternative A represents a classic error: focusing on the default DenyAllInBound rule of the NIC NSG without noticing that the IP Flow Verify output already indicates a different specific NSG and rule. Alternative D describes a misconception about how IP Flow Verify works; it evaluates the real traffic path considering address translation, and the result is reliable. Acting based on alternative A would lead the administrator to modify the wrong NSG without solving the problem.

Answer Key — Scenario 2

Answer: D

The critical constraint in the scenario is that the subnet NSG is shared by 14 other production VMs. Deleting or modifying the rule (alternatives A and B) without validation from the security team could create an unintended network exposure for all these VMs. The degradation is partial, not critical, which removes the urgency that would justify unilateral action.

Waiting for the scheduled maintenance window and engaging the security team (alternative C) would be the ideal approach in a scenario without degradation, but the scenario presents a service in degraded state that could worsen, justifying immediate and surgical action.

Alternative D solves the problem in isolation: creating an allow rule in the vm-app02 NIC NSG with higher priority than the subnet blocking rule. For outbound traffic, the NIC NSG is evaluated before the subnet NSG. This restores vm-app02 connectivity without modifying the shared NSG, preserving the configuration of other VMs until the security team can assess the situation.

The most dangerous distractor is alternative A. Deleting the rule without understanding the context of the other 14 VMs could result in broad network exposure in a production environment.

Answer Key — Scenario 3

Answer: B

The most important clue is the IP Flow Verify result: Access: Allow. This means that from the NSG perspective, traffic from IP 10.50.1.22 on port 22 is allowed. The NSG is not blocking. The timeout reported by the second developer, therefore, originates from another component in the network path: the operating system firewall (iptables, ufw, firewalld), a hosts.allow/hosts.deny rule, the SSH service configured to accept only specific IPs, or another mechanism outside the NSG scope.

The information about the physical host migration is intentionally irrelevant. Host migrations (such as those performed by Azure Live Migration) do not affect or corrupt NSG rules; these are maintained by Azure's network layer independently of the underlying host.

Alternative C represents an arithmetic error: IP 10.50.1.22 is within the 10.50.0.0/16 block (which covers from 10.50.0.0 to 10.50.255.255), so the Allow-SSH-CorpNet rule does cover it. Alternative D describes non-existent behavior in Azure; NIC NSGs without custom rules don't have any differentiated logic based on "secondary connections". Acting based on alternative A would lead the administrator to open an unnecessary infrastructure ticket while the real problem remains in the VM's operating system.

Answer Key — Scenario 4

Answer: A

The sequence P → R → Q → S → T respects the correct diagnostic progression:

P first checks if the service is functional inside the VM. If the service isn't listening on port 80, no NSG investigation will be relevant.

R gets the consolidated view of the NIC's effective rules, which combines NIC NSG and subnet NSG into a single ordered list. This may already reveal the rule responsible for blocking without needing additional tools.

Q uses IP Flow Verify to confirm the diagnosis accurately, identifying exactly which rule in which NSG blocks external traffic. Comes after R because R provides context that makes reading Q's result more productive.

S compares the sources covered by the rules, which explains why internal traffic works and external doesn't, completing the causal reasoning.

T validates that the applied correction was effective before closing the diagnosis.

Alternative B starts with IP Flow Verify before checking if the service is active, which could lead to modifying NSGs when the problem isn't even at the network layer. Alternative C reverses the logic by reviewing effective rules before confirming if there's actually a service problem. Starting with diagnosis before establishing the problem scope is the most common error under pressure.

Troubleshooting Tree: Evaluate Effective Security Rules in NSGs

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

Color Legend:

Dark blue: initial symptom, diagnosis entry point
Blue: diagnostic question, decision node
Red: identified cause
Green: recommended action or resolution
Orange: validation or intermediate verification step

To use this tree when facing a real problem, start with the root node describing the connectivity symptom. At each blue node, answer the question based on what you observed in the environment: service state, IP Flow Verify result, rule name and indicated NSG. Follow the corresponding path until reaching a red node that identifies the cause. From the cause, the green node indicates the action to take. Always finish with the orange validation node before closing the incident.

Diagnostic Scenarios​

Scenario 1 — Root Cause​

Scenario 2 — Action Decision​

Scenario 3 — Root Cause​

Scenario 4 — Diagnostic Sequence​

Answer Key and Explanations​

Answer Key — Scenario 1​

Answer Key — Scenario 2​

Answer Key — Scenario 3​

Answer Key — Scenario 4​

Troubleshooting Tree: Evaluate Effective Security Rules in NSGs​