Theoretical Foundation: Troubleshoot network connectivity
1. Initial Intuitionβ
Imagine you called someone and the call didn't connect. To discover the problem, you follow a progressive logic: does your phone have signal? Is the number correct? Is the other person's line active? Is call blocking configured? Each check eliminates a hypothesis and points to the next one.
Diagnosing network connectivity in Azure follows exactly this structured reasoning. When a VM cannot communicate with another, when a user cannot access a service, or when an application is timing out, the cause can be at any point in the path: NSG blocking, incorrect route, disconnected peering, firewall dropping packets, DNS resolving to wrong address, or simply the destination service is not listening on the expected port.
Network connectivity troubleshooting in Azure is the ability to use a set of tools and a systematic methodology to locate exactly where in the path communication is being interrupted.
2. Contextβ
All concepts studied in previous modules converge here: VNets, subnets, peerings, NSGs, UDRs, public IPs. A connectivity problem can originate from any of these components.
Azure Network Watcher is the central network diagnostic service. It contains a set of specific tools for each type of problem. Understanding what each tool does and when to use it is the core of this module.
3. Building the Conceptsβ
3.1 Azure Network Watcherβ
Network Watcher is a regional service that needs to be enabled in each region where you want to use its diagnostic tools. It is automatically enabled when you create a VNet in a region, but in some situations it might be disabled.
Path: Monitor > Network Watcher or search for "Network Watcher" in the portal.
Network Watcher organizes its tools in categories:
3.2 Main Tools and What Each One Answersβ
| Tool | Question it answers | Layer |
|---|---|---|
| IP Flow Verify | "Would this packet be allowed or blocked by NSG?" | L4: NSG |
| Next Hop | "Where does this traffic go? Which route is being used?" | L3: Routing |
| Effective Security Rules | "Which NSG rules are currently active on this NIC?" | L4: NSG |
| Connection Troubleshoot | "Does this TCP/ICMP connection work between these two points?" | L3-L7: End-to-end |
| VPN Troubleshoot | "Why is the VPN Gateway or VPN connection having problems?" | VPN |
| Packet Capture | "What exactly is passing through this NIC?" | L2-L7: Raw capture |
| NSG Flow Logs | "What traffic passed (or was blocked) in recent days?" | L4: History |
| Topology | "What does the network topology of this region look like visually?" | Overview |
| Connection Monitor | "Is this connection working continuously?" | Continuous monitoring |
3.3 Diagnostic Methodology: From Simple to Specificβ
Before opening tools, it's important to have a methodology. Random investigation wastes time. The most effective model is to divide the problem into layers:
4. Structural Viewβ
The Path of a Packet and Where It Can Be Blockedβ
Each arrow is a potential blocking point. The right tool diagnoses each specific point.
5. Practical Operationβ
Tool 1: IP Flow Verifyβ
What it does: verifies if a packet with certain parameters (protocol, port, direction, source and destination IP) would be allowed or denied by NSG rules applied to a specific NIC.
Path: Network Watcher > IP Flow Verify
Required parameters:
- VM (and its NIC)
- Direction: Inbound or Outbound
- Protocol: TCP or UDP
- Local IP (from the VM)
- Local port
- Remote IP (source or destination)
- Remote port
Result: indicates "Access Allowed" or "Access Denied", and which specific NSG rule made the decision.
# Via CLI
az network watcher test-ip-flow \
--direction Inbound \
--protocol TCP \
--local 10.0.1.4:80 \
--remote 40.68.100.50:12345 \
--vm /subscriptions/<sub-id>/resourceGroups/rg-producao/providers/Microsoft.Compute/virtualMachines/vm-web \
--nic /subscriptions/<sub-id>/resourceGroups/rg-producao/providers/Microsoft.Network/networkInterfaces/nic-vm-web \
--resource-group NetworkWatcherRG \
--watcher-resource-group NetworkWatcherRG
Non-obvious behavior: IP Flow Verify only checks NSGs. It doesn't consider Azure Firewall, UDRs, or VM operating system firewalls. If the answer is "Access Allowed" but the connection still fails, the problem is at another layer.
Tool 2: Next Hopβ
What it does: for a source IP (VM) and a destination IP, it informs what the next hop is and which route is being used.
az network watcher show-next-hop \
--resource-group NetworkWatcherRG \
--watcher-resource-group NetworkWatcherRG \
--vm vm-app \
--source-ip 10.0.1.4 \
--dest-ip 8.8.8.8
Typical result:
{
"nextHopIpAddress": "",
"nextHopType": "Internet",
"routeTableId": "System Route"
}
If Next Hop returns None, traffic is being dropped by routing (blackhole). If it returns an unexpected IP, a UDR is redirecting traffic to the wrong place.
Tool 3: Effective Security Rulesβ
What it does: displays all NSG rules currently effective on a NIC, including rules from the NIC's NSG and the subnet's NSG, already combined and ordered by priority. It clearly shows Azure's default rules and user-created rules.
az network watcher show-security-group-view \
--resource-group NetworkWatcherRG \
--watcher-resource-group NetworkWatcherRG \
--vm vm-web \
--network-interfaces /subscriptions/<sub-id>/resourceGroups/rg-producao/providers/Microsoft.Network/networkInterfaces/nic-vm-web
Difference from IP Flow Verify: Effective Security Rules shows all rules (complete view); IP Flow Verify simulates a specific packet and tells which rule applies. Use Effective Security Rules for auditing; use IP Flow Verify for diagnosing a specific flow.
Tool 4: Connection Troubleshootβ
What it does: tests end-to-end connectivity between a source (Azure VM) and a destination (VM, URL, or IP). Verifies if the TCP or ICMP connection works, and in case of failure, indicates the problem point.
az network watcher test-connectivity \
--resource-group NetworkWatcherRG \
--source-resource /subscriptions/<sub-id>/resourceGroups/rg-producao/providers/Microsoft.Compute/virtualMachines/vm-app \
--dest-address 10.0.2.4 \
--dest-port 1433 \
--protocol Tcp
Result includes:
connectionStatus: Reachable or UnreachableavgLatencyInMs: average latencyhops: list of hops with status at each one, showing where the block occurs
# Test connectivity to external URL
az network watcher test-connectivity \
--resource-group NetworkWatcherRG \
--source-resource /subscriptions/<sub-id>/resourceGroups/rg-producao/providers/Microsoft.Compute/virtualMachines/vm-app \
--dest-address "https://www.microsoft.com" \
--dest-port 443 \
--protocol Https
Limitation: Connection Troubleshoot requires the Network Watcher Agent to be installed on the source VM. On Windows VMs, it's the AzureNetworkWatcherExtension extension. On Linux VMs, it's NetworkWatcherAgentLinux. The portal automatically requests installation if needed.
Tool 5: VPN Troubleshootβ
What it does: diagnoses problems in VPN Gateways and site-to-site VPN connections. Analyzes detailed gateway logs and returns a report with the identified problem.
az network watcher troubleshooting start \
--resource-id /subscriptions/<sub-id>/resourceGroups/rg-networking/providers/Microsoft.Network/virtualNetworkGateways/vpn-gw-prod \
--resource-type vnetGateway \
--storage-account <storage-account-id> \
--storage-path "https://minhaconta.blob.core.windows.net/diagnostics" \
--resource-group NetworkWatcherRG
VPN diagnosis saves results to a Storage Account (mandatory) and can take several minutes. The result includes a status (Healthy, NotConnected, Unknown) and problem details.
Tool 6: Packet Captureβ
What it does: captures network traffic on a VM NIC and saves it to a .pcap file (Wireshark compatible). It's the last resort tool when others haven't identified the problem.
az network watcher packet-capture create \
--resource-group NetworkWatcherRG \
--vm vm-web \
--name capture-vm-web-01 \
--storage-account /subscriptions/<sub-id>/resourceGroups/rg-monitoring/providers/Microsoft.Storage/storageAccounts/diagnosticstore \
--file-path /var/captures/capture-vm-web-01.cap \
--time-limit 120 \
--filters '[{"protocol": "TCP", "remotePort": "80"}]'
Key points:
- Requires Network Watcher Agent on the VM
- Captures at the Azure virtual NIC level, not inside the VM OS
- Can be filtered by protocol, port, IP to reduce volume
- Time or file size limit automatically stops capture
Tool 7: NSG Flow Logsβ
What it does: logs all traffic (allowed and denied) passing through an NSG, with timestamps, IPs, ports, protocol and decision. It's a historical log, not a real-time diagnostic tool.
Versions:
- Version 1: logs basic flow (allowed/denied, IPs, ports)
- Version 2: adds bytes and packets information per flow
Logs are saved to a Storage Account in JSON format and can be integrated with Traffic Analytics (Log Analytics) for visualization and KQL queries.
# Enable NSG Flow Logs version 2
az network watcher flow-log create \
--resource-group NetworkWatcherRG \
--name flow-log-nsg-backend \
--nsg /subscriptions/<sub-id>/resourceGroups/rg-producao/providers/Microsoft.Network/networkSecurityGroups/nsg-subnet-backend \
--storage-account /subscriptions/<sub-id>/resourceGroups/rg-monitoring/providers/Microsoft.Storage/storageAccounts/flowlogstore \
--enabled true \
--format JSON \
--log-version 2 \
--retention 30 \
--traffic-analytics true \
--workspace /subscriptions/<sub-id>/resourceGroups/rg-monitoring/providers/Microsoft.OperationalInsights/workspaces/law-monitoring \
--interval 10
Tool 8: Connection Monitorβ
What it does: monitors connections continuously and periodically, with alerts when connectivity changes. Unlike Connection Troubleshoot (point-in-time), Connection Monitor runs tests at regular intervals.
Use cases:
- Continuously monitor connectivity between application layers
- Detect intermittent problems that don't appear in point-in-time tests
- Measure latency over time for SLAs
6. Implementation Methodsβ
6.1 Azure Portalβ
When to use: interactive diagnosis, exploratory investigation, cases where CLI is less convenient.
The portal offers all Network Watcher tools with visual interface and immediate feedback. The Topology tool generates an interactive network diagram that is exclusive to the portal (no functional equivalent in CLI).
Portal advantage for Topology: visually shows how VNets, subnets, VMs, NSGs and peerings are connected in the region, facilitating topology understanding before diagnosing.
6.2 Azure CLIβ
When to use: automated diagnostic scripts, environments without portal access, pipeline integration.
CLI covers all Network Watcher tools with JSON output that can be processed by scripts. It's the most efficient approach for batch diagnosis or automation.
Check if Network Watcher is enabled in a region:
az network watcher list --output table
Enable Network Watcher in a region:
az network watcher configure \
--resource-group NetworkWatcherRG \
--locations brazilsouth \
--enabled true
6.3 PowerShellβ
# IP Flow Verify
Test-AzNetworkWatcherIPFlow `
-NetworkWatcher (Get-AzNetworkWatcher -Location "brazilsouth") `
-TargetVirtualMachineId (Get-AzVM -Name "vm-web" -ResourceGroupName "rg-producao").Id `
-Direction "Inbound" `
-Protocol "TCP" `
-RemoteIPAddress "40.68.100.50" `
-LocalIPAddress "10.0.1.4" `
-LocalPort "80" `
-RemotePort "54321"
# Next Hop
Get-AzNetworkWatcherNextHop `
-NetworkWatcher (Get-AzNetworkWatcher -Location "brazilsouth") `
-TargetVirtualMachineId (Get-AzVM -Name "vm-app" -ResourceGroupName "rg-producao").Id `
-SourceIPAddress "10.0.1.4" `
-DestinationIPAddress "8.8.8.8"
7. Control and Securityβ
Required Permissions for Network Watcherβ
| Operation | Minimum permission |
|---|---|
| Use IP Flow Verify, Next Hop, Effective Security Rules | Network Contributor on VM and network resources |
| Packet Capture | Network Contributor + access to Storage Account |
| NSG Flow Logs | Network Contributor + Storage Blob Contributor on Storage |
| Connection Troubleshoot | Network Contributor + access to source VM |
| VPN Troubleshoot | Network Contributor on Gateway + Storage |
NSG Flow Logs and Sensitive Dataβ
NSG Flow Logs record source and destination IP addresses of all traffic. In environments processing personal data (LGPD/GDPR), these logs may contain sensitive information. Consider:
- Minimum necessary retention (configurable, recommended 90 days for analysis)
- Storage Account encryption
- Restricted access via RBAC to Storage containing logs
8. Decision Makingβ
Which tool to use for each symptom?β
| Symptom | Initial tool | Next step if not resolved |
|---|---|---|
| VM not responding on port X | IP Flow Verify (check NSG) | Connection Troubleshoot (check complete path) |
| Traffic going to wrong place | Next Hop (check route) | Effective Security Rules (check if NSG interferes) |
| "Works sometimes, not always" | Connection Monitor (continuous monitoring) | NSG Flow Logs (flow history) |
| VPN doesn't connect | VPN Troubleshoot | Check IKE/IPSec configuration |
| Want to see what's going through | Packet Capture | Analyze .pcap in Wireshark |
| Traffic security audit | NSG Flow Logs + Traffic Analytics | KQL queries in Log Analytics |
| Don't know where to start | Topology (overview) | IP Flow Verify + Next Hop in sequence |
When the problem is in the OS vs. Azure network?β
9. Best Practicesβ
Enable NSG Flow Logs on all production NSGs from the start: historical logs are invaluable for retroactive diagnosis. When a problem is reported "it happened yesterday at 2 PM," without Flow Logs it's impossible to see what occurred. The storage cost of logs is low compared to the diagnostic value.
Use Traffic Analytics for pattern visualization: raw NSG Flow Logs are difficult to analyze. Integrated with Log Analytics with Traffic Analytics enabled, you have visual dashboards of flows and can write KQL queries to investigate specific patterns.
Create a network diagnosis infrastructure as code: standardized diagnostic scripts that execute IP Flow Verify, Next Hop, and Connection Troubleshoot in sequence for an IP pair reduce diagnosis time and ensure nothing is forgotten.
Install Network Watcher Agent on all production VMs: without the agent, Connection Troubleshoot and Packet Capture don't work. Install the agent at VM creation time via extension, don't wait until you need it.
Document expected network topology: having a reference diagram of expected topology makes it easier to identify when Next Hop returns an unexpected value. Without reference, it's hard to know if a specific next hop is correct or wrong.
10. Common Errorsβ
Using IP Flow Verify and concluding that "NSG is ok" without checking other layers
IP Flow Verify says the packet would be allowed by the NSG. The administrator concludes the problem isn't in the Azure network. But the connection still fails because there's an Azure Firewall in the path (UDR redirects to the Firewall) with a rule blocking the traffic. IP Flow Verify doesn't analyze Azure Firewall, only NSGs. The complete sequence should include Next Hop to check if there's an NVA/Firewall in the path.
Not checking the correct direction in IP Flow Verify
The administrator checks IP Flow Verify in the Inbound direction to the destination VM and finds "Allowed". But the problem is in the Outbound traffic from the source VM (an NSG on the source subnet is blocking outbound). It's necessary to check both directions: Outbound at the source AND Inbound at the destination.
Testing with Connection Troubleshoot without Network Watcher Agent installed
The tool returns an "agent not found" error or fails silently. The administrator interprets this as a network problem when it's actually the absence of the agent. Always confirm the agent is installed before using Connection Troubleshoot.
Confusing "Access Allowed" in IP Flow Verify with "connection works"
"Access Allowed" means the NSG would permit the traffic. It doesn't mean the connection will work: there could be a UDR redirecting to a Firewall that blocks, there could be a routing blackhole, there could be the VM's Windows Firewall blocking. IP Flow Verify is just one piece of the diagnosis.
Analyzing NSG Flow Logs without considering that the log is sampled
During high traffic periods, NSG Flow Logs may not record 100% of flows. It uses sampling. For security forensic analysis where every flow matters, consider Packet Capture together with Flow Logs.
11. Operation and Maintenanceβ
KQL Queries for Traffic Analyticsβ
With NSG Flow Logs and Traffic Analytics enabled, queries in Log Analytics reveal patterns:
Traffic blocked by NSG in the last 24 hours:
AzureNetworkAnalytics_CL
| where TimeGenerated > ago(24h)
| where FlowStatus_s == "D" // D = Denied
| summarize count() by NSGName_s, NSGRuleNumber_d, DestIP_s, DestPort_d
| sort by count_ desc
| top 20 by count_
Top 10 traffic sources for a specific VM:
AzureNetworkAnalytics_CL
| where TimeGenerated > ago(1h)
| where DestIP_s == "10.0.1.4"
| summarize Bytes = sum(OutboundBytes_d) by SrcIP_s
| sort by Bytes desc
| top 10 by Bytes
Connections blocked by specific IP (incident investigation):
AzureNetworkAnalytics_CL
| where TimeGenerated between (datetime(2024-03-01) .. datetime(2024-03-02))
| where SrcIP_s == "203.0.113.50"
| project TimeGenerated, SrcIP_s, SrcPort_d, DestIP_s, DestPort_d, FlowStatus_s, NSGName_s
| sort by TimeGenerated asc
Latency Diagnosis Between Regionsβ
For latency problems between VMs in different regions (Global VNet Peering), Connection Monitor with continuous measurement is the most suitable tool:
# Create Connection Monitor between VMs in different regions
az network watcher connection-monitor create \
--name cm-prod-dr-connectivity \
--resource-group NetworkWatcherRG \
--location brazilsouth \
--source-resource /subscriptions/<sub-id>/resourceGroups/rg-producao/providers/Microsoft.Compute/virtualMachines/vm-prod \
--dest-resource /subscriptions/<sub-id>/resourceGroups/rg-dr/providers/Microsoft.Compute/virtualMachines/vm-dr \
--dest-port 443 \
--monitoring-interval 30
Important Limitsβ
| Item | Limit |
|---|---|
| Active Packet Capture per region | 10 simultaneous |
| Maximum Packet Capture duration | 5 hours |
| Maximum Packet Capture file size | 1 GB |
| NSG Flow Logs maximum retention | 365 days |
| Connection Monitor: minimum intervals | 30 seconds |
12. Integration and Automationβ
Automated Diagnosis with Azure Functionsβ
For environments where connectivity problems are frequent, an Azure Function can execute diagnostics automatically in response to alerts:
This transforms manual diagnosis from 20-30 minutes into an automatic report generated in seconds when an alert is triggered.
### Integration with Microsoft Sentinel
NSG Flow Logs sent to Log Analytics can be connected to Microsoft Sentinel for security analysis and anomaly detection:
```kql
// Detect port scanning (many different ports from the same IP)
AzureNetworkAnalytics_CL
| where TimeGenerated > ago(1h)
| where FlowStatus_s == "D"
| summarize PortsAttempted = dcount(DestPort_d) by SrcIP_s
| where PortsAttempted > 20
| sort by PortsAttempted desc
Systematic Diagnosis Scriptβ
A script that executes the complete diagnosis sequence for a source-destination pair:
#!/bin/bash
SOURCE_VM="vm-app"
SOURCE_IP="10.0.1.4"
DEST_IP="10.0.2.4"
DEST_PORT="1433"
RG="rg-producao"
NW_RG="NetworkWatcherRG"
SOURCE_VM_ID=$(az vm show --name $SOURCE_VM --resource-group $RG --query id -o tsv)
SOURCE_NIC=$(az vm show --name $SOURCE_VM --resource-group $RG \
--query "networkProfile.networkInterfaces[0].id" -o tsv)
echo "=== 1. IP Flow Verify (Outbound from source) ==="
az network watcher test-ip-flow \
--direction Outbound \
--protocol TCP \
--local $SOURCE_IP:12345 \
--remote $DEST_IP:$DEST_PORT \
--vm $SOURCE_VM_ID \
--nic $SOURCE_NIC \
--resource-group $NW_RG \
--watcher-resource-group $NW_RG
echo "=== 2. Next Hop from source to destination ==="
az network watcher show-next-hop \
--resource-group $NW_RG \
--vm $SOURCE_VM \
--source-ip $SOURCE_IP \
--dest-ip $DEST_IP \
--watcher-resource-group $NW_RG
echo "=== 3. Connection Troubleshoot ==="
az network watcher test-connectivity \
--source-resource $SOURCE_VM_ID \
--dest-address $DEST_IP \
--dest-port $DEST_PORT \
--protocol Tcp \
--resource-group $NW_RG \
--watcher-resource-group $NW_RG
13. Final Summaryβ
Essential points:
- Network Watcher is the central network diagnostics service in Azure. It needs to be enabled per region.
- Each tool diagnoses a specific layer: IP Flow Verify (NSG), Next Hop (routing), Connection Troubleshoot (end-to-end), Packet Capture (raw), NSG Flow Logs (historical).
- The correct methodology is progressive: check NSG, then routing, then firewall/NVA, then destination VM OS.
- NSG Flow Logs is a historical monitoring tool, not real-time diagnosis. Enable preventively.
Critical differences:
- IP Flow Verify checks only NSGs. Doesn't analyze Azure Firewall, UDRs, or OS firewalls. "Allowed" doesn't mean the connection works.
- Connection Troubleshoot vs. Connection Monitor: Troubleshoot is point-in-time (executes once); Monitor is continuous (executes periodically with alerts).
- Effective Security Rules vs. IP Flow Verify: Effective Security Rules shows all active rules (complete view); IP Flow Verify simulates a specific packet and says which rule decides.
- Next hop "None" means blackhole (traffic discarded by routing), not absence of route. It's different from route not found.
What needs to be remembered:
- Connection Troubleshoot and Packet Capture require the Network Watcher Agent installed on the VM. Install as extension during provisioning.
- The most efficient diagnosis order: IP Flow Verify β Next Hop β Connection Troubleshoot β Packet Capture.
- NSG Flow Logs save to Storage Account and can be integrated with Log Analytics via Traffic Analytics for KQL queries.
- VPN Troubleshoot saves results to Storage Account (mandatory) and can take several minutes to complete the analysis.
- For intermittent problems, Connection Monitor with continuous monitoring is much more effective than point-in-time tests.
- IP Flow Verify checks both directions independently: always test Outbound at the source AND Inbound at the destination.