Troubleshooting Lab: Use Azure Network Watcher and Connection Monitor
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
A development team reports that a VM in the East US region cannot access an internal service hosted on another VM in the same VNet, in subnet 10.2.1.0/24. The application returns timeout when trying to connect on port 8080.
The administrator runs IP Flow Verify on Network Watcher with the following parameters:
Direction: Outbound
Protocol: TCP
Source IP: 10.2.0.4
Source Port: 5000
Destination IP: 10.2.1.10
Destination Port: 8080
Result: Allow
Applied Rule: AllowVnetOutBound
Then runs the same test on the destination VM, reversing the direction:
Direction: Inbound
Protocol: TCP
Source IP: 10.2.0.4
Source Port: 5000
Destination IP: 10.2.1.10
Destination Port: 8080
Result: Deny
Applied Rule: DenyAllInbound
Additional environment information:
- The destination VM was created three days ago from a custom image
- Network Watcher is enabled in the East US region
- The storage account used by NSG Flow Logs was created on the same day
- The destination VM's operating system is Windows Server 2022
What is the root cause of the observed problem?
A. The NSG associated with the destination VM's network adapter does not have an inbound rule allowing port 8080, and the default DenyAllInbound rule is being applied
B. Network Watcher is not enabled on the destination subnet, causing IP Flow Verify to return incorrect results
C. The destination VM was created from a custom image that blocks external connections by default at the operating system level
D. NSG Flow Logs is not collecting sufficient data because the storage account was created recently, preventing proper traffic analysis
Scenario 2 β Decision for Actionβ
The cause of a connectivity failure between two services in different regions has been identified: the Connection Monitor created to monitor the endpoint api.interno.contoso.com on port 443 is returning failure in 100% of probes, but the destination service is operational and accessible from other sources.
When inspecting the Connection Monitor configuration, the administrator observes:
Test Group: prod-api-monitor
Source: vm-monitor-eastus (10.1.0.5)
Destination: api.interno.contoso.com:443
Protocol: TCP
Interval: 30 seconds
Threshold: 5% failure
Extension status on source VM:
NetworkWatcherAgentLinux -> ProvisioningState: Failed
AzureMonitorLinuxAgent -> ProvisioningState: Succeeded
The environment is in production with active SLA. The security team does not allow VM reboots during business hours without an approved maintenance window. The next available window is in 48 hours.
What is the correct action to take at this moment?
A. Immediately recreate the source VM, as the failed extension cannot be repaired without replacing the VM
B. Remove and reinstall the NetworkWatcherAgentLinux extension via portal or CLI without rebooting the VM, and verify if the Connection Monitor returns to operation
C. Wait for the 48-hour maintenance window and reboot the VM to force automatic reinstallation of all extensions
D. Change the Connection Monitor's source endpoint to another available VM in the same subnet as a definitive solution, without addressing the failed extension
Scenario 3 β Root Causeβ
An administrator configures Packet Capture in Network Watcher to capture traffic from a Linux VM for 10 minutes. After waiting for the defined time, they attempt to download the file generated from the storage account configured as destination.
The file is not present in the storage account. No error message was displayed during configuration.
Information collected during investigation:
VM: vm-captura-prod
Installed extension: AzureNetworkWatcherExtension -> ProvisioningState: Succeeded
Storage account: stcapturaprod (LRS, East US)
Configured container: packetcaptures
Packet Capture Status: Stopped (Completed)
Additional verification in portal:
Packet Capture β Details:
Destination file: /var/captures/capture01.cap
Storage account: (not configured)
Status: Stopped
The administrator also confirms that the subnet's NSG allows outbound traffic to the Internet and that the VM has an associated public IP.
What is the root cause of the observed problem?
A. The AzureNetworkWatcherExtension extension failed silently during capture, as the Succeeded status only reflects installation, not execution
B. Packet Capture was configured with destination to the VM's local file system, not to the storage account, so the file is on the VM and not in storage
C. The storage account uses LRS redundancy, which is not compatible with Network Watcher's Packet Capture
D. The subnet's NSG blocked the file writing to the storage account by not allowing outbound traffic to the Azure Storage endpoint
Scenario 4 β Diagnostic Sequenceβ
An administrator receives the following report: VMs in a specific subnet suddenly lost Internet connectivity. No infrastructure changes were recorded in the changelog for the last 24 hours, but the network team applied updates to route tables the previous week.
Available investigation steps are:
[P] Execute Next Hop in Network Watcher to verify the effective route type for an external IP from one of the affected VMs
[Q] Check if the subnet's NSG has an explicit outbound rule blocking Internet traffic
[R] Check the VM's effective route table in the portal to identify if there's a UDR overriding the default Internet route
[S] Execute IP Flow Verify with destination to an external IP to confirm if blocking occurs at the NSG layer
[T] Open a ticket with the network team requesting review of changes made the previous week
What is the correct investigation sequence?
A. T, Q, S, P, R
B. P, R, S, Q, T
C. Q, S, P, R, T
D. S, Q, T, P, R
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: A
The definitive clue is in the IP Flow Verify output executed on the destination VM: the result is Deny with the DenyAllInbound rule applied. This is the default rule of every NSG when no explicit inbound rule matches the evaluated flow. This indicates that the NSG associated with the destination VM's network adapter or subnet does not have any rule allowing inbound traffic on port 8080.
The information about the VM being created from a custom image is irrelevant for network diagnosis. Custom images may contain OS-level firewall configurations, but IP Flow Verify operates at the Azure NSG layer, not within the OS. If the blocking were at the Windows firewall, IP Flow Verify would return Allow, not Deny.
Alternative B is factually incorrect: Network Watcher operates at the region level, not subnet level. Alternative D confuses the purpose of NSG Flow Logs (which record flows already decided by the NSG) with IP Flow Verify (which simulates the NSG decision). The most dangerous distractor is alternative C, as it leads the administrator to investigate the destination VM's operating system without first correcting the real cause in the NSG, wasting time and potentially creating unnecessary exceptions in the OS firewall.
Answer Key β Scenario 2β
Answer: B
The cause was explicitly stated: the NetworkWatcherAgentLinux extension has ProvisioningState: Failed. This extension is the component that allows Connection Monitor to execute probes from the source VM. Without it working, all probes fail regardless of the actual connectivity state.
The critical constraint of the scenario is that the environment is in production with active SLA and VM reboots are not allowed during business hours. Alternative B solves the problem within these constraints: Azure extensions can be removed and reinstalled via portal or CLI (az vm extension delete followed by az vm extension set) without requiring VM reboot.
Alternative A is technically unnecessary and violates the production impact constraint. Alternative C respects the time constraint but unnecessarily delays a fix that can be done now without impact. Alternative D is the most dangerous: it treats the symptom by moving the source, but doesn't fix the failed extension, leaving the VM in a degraded state and potentially generating inconsistent monitoring data long-term.
Answer Key β Scenario 3β
Answer: B
The decisive detail is in the "Packet Capture β Details" section of the portal: the Storage account field appears as (not configured), while the Destination file field points to /var/captures/capture01.cap, a local path in the VM's file system. This indicates that during configuration, the administrator selected local destination instead of storage account, and the file was successfully written to the VM, not to storage.
The information about the VM's public IP and NSG outbound permission is irrelevant for the diagnosis, as the problem doesn't involve network connectivity to storage. The Stopped (Completed) status confirms that the capture was executed and completed normally.
Alternative A is a distractor that exploits distrust in the Succeeded status, but this status reflects the extension's installation and functioning, and the capture's own Completed status confirms it was executed. Alternative C is factually incorrect: the storage account's redundancy type (LRS, GRS, etc.) doesn't affect Packet Capture compatibility. Alternative D would be plausible if the storage account field were configured, but the details clearly show it wasn't.
Answer Key β Scenario 4β
Answer: B
The correct sequence is P, R, S, Q, T, following the principle of progressive diagnosis from routing layer to security layer, from general to specific.
The reasoning is: start with Next Hop (P) to immediately get the effective route type for an external IP. If the result is None or unexpected VirtualAppliance, the problem is in the routing layer. Then, check the effective routes (R) to identify if a UDR is overriding the default Internet route, which is consistent with changes made by the network team the previous week. Only after ruling out or confirming the routing problem, use IP Flow Verify (S) to check for NSG blocking. Then, directly inspect NSG rules (Q) to confirm or rule out this layer. Finally, engage the network team (T) only when the diagnosis is complete and there's concrete evidence to present.
Sequence A starts with escalation (T) before any technical diagnosis, which is inefficient and imprecise. Sequence C starts with manual NSG inspection before using automated diagnostic tools, reversing the logical order. Sequence D starts with IP Flow Verify without first understanding the routing layer, potentially leading to a wrong conclusion if the problem is in a UDR and not in the NSG.
Troubleshooting Tree: Use Azure Network Watcher and Connection Monitorβ
Color Legend:
| Color | Node Type |
|---|---|
| Dark blue | Initial symptom (entry point) |
| Medium blue | Diagnostic question (decision) |
| Red | Identified cause |
| Green | Recommended action or resolution |
| Orange | Validation or intermediate verification |
To use this tree when facing a real problem, always start from the root node describing the observed symptom. At each question node, answer based on what you can verify directly in the portal or via CLI, without assuming the answer. Follow the path corresponding to your observation until reaching an identified cause node (red), which indicates what should be fixed, or a recommended action node (green), which indicates the next concrete step. Orange nodes indicate that additional verification is needed before concluding the diagnosis.