Troubleshooting Lab: Implement Azure Extended Network
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
An operations team reports that after deploying Azure Extended Network, VMs in Azure with extended IPs can successfully initiate connections to on-premises servers without issues. However, when an on-premises server attempts to initiate a connection to one of these VMs in Azure, the connection times out with no response.
The responsible engineer verified the following items and confirmed they are correct:
- The NSG associated with the VNet subnet in Azure allows inbound traffic from the
10.10.5.0/24range (on-premises subnet) - The Azure Extended Network appliance in Azure is in
Runningstate in the portal - Windows firewall on the target VMs is disabled for testing purposes
- The operating system version of the on-premises appliance is Windows Server 2019
The engineer also collected the output of the route print command from the on-premises server attempting to initiate the connection:
IPv4 Route Table
Active Routes:
Network Destination Netmask Gateway Interface
0.0.0.0 0.0.0.0 10.10.5.1 10.10.5.20
10.10.5.0 255.255.255.0 On-link 10.10.5.20
127.0.0.0 255.0.0.0 On-link 127.0.0.1
What is the root cause of the observed problem?
A) The NSG is incorrectly configured; the inbound rule needs to reference the VirtualNetwork service tag instead of the specific IP prefix.
B) The on-premises server lacks a route for the extended subnet prefix pointing to the local Azure Extended Network appliance, causing traffic to follow the default gateway.
C) The on-premises appliance has the network extension service stopped; the Running state visible in the portal only reflects the Azure-side appliance.
D) The Windows Server 2019 version of the on-premises appliance is incompatible with the current Azure Extended Network version, requiring an upgrade to Windows Server 2022.
Scenario 2 β Action Decisionβ
The infrastructure team identified that the Azure Extended Network appliance on the Azure side experienced hardware failure in the underlying VM and was deallocated by the platform. As a consequence, all Layer 2 extended communication between on-premises and Azure is interrupted, affecting 14 production VMs that depend on the extended subnet.
The cause was confirmed: the primary appliance VM in Azure was deallocated and did not restart automatically.
The environment has the following characteristics:
- A secondary Azure Extended Network appliance already provisioned in Azure, previously configured as a failover endpoint
- The on-premises appliance is already configured with both endpoints (primary and secondary)
- The team has full contributor permissions on the subscription
- The process of recreating and reconfiguring a new primary appliance would take approximately 45 minutes
- The business team is reporting critical production impact for 12 minutes
What is the correct action to take at this moment?
A) Immediately start recreating the primary appliance from scratch, as it is the originally configured endpoint and restoring it ensures the environment returns to the documented desired state.
B) Check the state of the secondary appliance in Azure and, if operational, force the on-premises appliance to actively point to the secondary endpoint to restore communication immediately.
C) Open a Microsoft support ticket to request restoration of the primary appliance VM, since the failure was in the Azure platform infrastructure.
D) Restart all 14 production VMs in the extended subnet to force TCP session reestablishment through the secondary path once automatic failover occurs.
Scenario 3 β Root Causeβ
An organization completed Azure Extended Network configuration three days ago and the environment was working correctly. This morning, users report that VMs with extended IPs in Azure are inaccessible from the on-premises environment. The on-duty engineer collected the following information:
The on-premises appliance event log (Windows Admin Center) displays:
[2025-06-10 07:14:32] ExtendedNetwork: VXLAN tunnel to remote endpoint 10.0.1.4 - Status: UNREACHABLE
[2025-06-10 07:14:32] ExtendedNetwork: Last successful heartbeat: 2025-06-10 02:58:17
[2025-06-10 07:14:33] ExtendedNetwork: Attempting reconnect to primary endpoint...
[2025-06-10 07:14:33] ExtendedNetwork: Reconnect failed - no route to host
The engineer verified the following additional items:
- The site-to-site VPN service between on-premises and Azure is active and with normal traffic according to the VPN gateway dashboard in the Azure portal
- The Azure Extended Network appliance in Azure (
10.0.1.4) is inRunningstatus in the portal - A new security policy was applied to the on-premises perimeter firewall early morning of the same day at 03:05
- The number of active VMs in the VNet has not changed
What is the root cause of the observed failure?
A) The site-to-site VPN gateway entered a degraded state even while showing normal traffic; the VXLAN tunnel failure is a consequence of intermittent UDP packet loss in the VPN.
B) The Azure Extended Network appliance in Azure had its network interface disconnected from the subnet, making the IP 10.0.1.4 inactive despite the VM's Running status.
C) The new firewall policy applied on-premises is blocking the UDP traffic necessary for VXLAN encapsulation between the local appliance and the Azure endpoint, preventing tunnel establishment.
D) The on-premises appliance lost the DNS registration of the Azure endpoint after the firewall policy was applied, preventing resolution of IP 10.0.1.4.
Scenario 4 β Diagnostic Sequenceβ
An engineer receives an alert: VMs with extended IPs in Azure are not responding to pings from on-premises servers. The environment uses site-to-site VPN as the underlying connectivity. The engineer needs to diagnose the problem systematically.
The available investigation steps are:
P β Verify if the site-to-site VPN between on-premises and Azure is active and passing traffic, confirming that the underlying Layer 3 connectivity is functional.
Q β Check the on-premises appliance log to see if the VXLAN tunnel is established and if there are recent heartbeats with the Azure endpoint.
R β Test direct connectivity (ping or traceroute) from the on-premises appliance to the Azure appliance IP (10.0.1.x) using the management interface.
S β Confirm that the target VMs in Azure are running and their NSGs allow ICMP or the tested port from the on-premises subnet.
T β Verify if the Extended Network service is active on the on-premises appliance through Windows Admin Center.
What is the correct diagnostic sequence?
A) T, Q, P, R, S
B) P, T, Q, R, S
C) Q, P, T, S, R
D) S, P, Q, T, R
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: B
The definitive clue is in the route print output: the on-premises server's route table contains only the local route 10.10.5.0/24 and the default gateway 0.0.0.0. There is no specific route for the extended subnet prefix pointing to the local Azure Extended Network appliance.
When the server attempts to initiate a connection to a VM in Azure whose IP belongs to the extended subnet, the operating system doesn't find a more specific route and forwards the packet to the default gateway (10.10.5.1), which doesn't know how to forward traffic through the VXLAN tunnel. The packet never reaches the extension appliance.
The NSG information (option A) is irrelevant to this diagnosis: the problem occurs before traffic even reaches Azure. Option C is ruled out because the scenario confirms that traffic initiated by Azure VMs correctly reaches on-premises, proving the on-premises appliance is functional. Option D is a distractor based on OS version without any evidence in the scenario; the correlation between Windows Server 2019 and the problem doesn't exist in the solution documentation.
The most dangerous error would be acting on the NSG (option A) without first validating the route table, wasting production time without addressing the real cause.
Answer Key β Scenario 2β
Answer: B
The environment already has all prerequisites to use the secondary appliance immediately: it's provisioned, the on-premises appliance already knows both endpoints, and the team has permission to act. With critical production impact for over 10 minutes, the priority is to restore service through the fastest available path.
Option A is the most dangerous: recreating the primary appliance is the correct long-term recovery action, but taking 45 minutes for this when an operational secondary already exists is technically unjustifiable under production impact constraints. Option C delegates to Microsoft an action the team can and should execute autonomously with available permissions. Option D is incorrect because TCP sessions don't automatically migrate through simple restart of target VMs; failover depends on the on-premises appliance actively pointing to the secondary endpoint.
The decisive constraint in this scenario is time: all options except B ignore that an immediate recovery path is already available and operational.
Answer Key β Scenario 3β
Answer: C
The temporal correlation is the central clue: the last successful heartbeat was at 02:58:17 and the new firewall policy was applied at 03:05. The VXLAN tunnel failed immediately after the change to the on-premises perimeter firewall, not before.
VXLAN operates over UDP, typically on port 4789. If the new firewall policy blocked this UDP traffic leaving the on-premises appliance toward Azure, the tunnel can no longer be established or maintained, even though the underlying Layer 3 VPN remains active. This is exactly the described situation: the VPN is functional (normal traffic in the gateway dashboard), but the VXLAN traveling over it is blocked.
The information about the number of active VMs in the VNet is irrelevant and was purposefully included as a distractor.
Option A is the most dangerous distractor: the active VPN with normal traffic rules out the gateway degradation hypothesis. Option B is ruled out because the Azure appliance is in Running state and the IP would be invalid in that case. Option D is technically implausible: the appliance uses the IP directly in logs, not DNS resolution, and the no route to host error message confirms UDP traffic problem, not name resolution.
Answer Key β Scenario 4β
Answer: B β P, T, Q, R, S
The correct diagnostic reasoning starts from the most fundamental layer to the most specific, progressively eliminating hypotheses.
The first step is to confirm that the underlying Layer 3 connectivity (VPN) is active (P), because without it no other verification makes sense: VXLAN cannot exist without the channel that transports it.
With the VPN confirmed, the next step is to verify if the Extended Network service is active on the on-premises appliance (T), because a stopped service would explain the failure without any evidence of network problems.
Next, check the VXLAN tunnel state in the appliance logs (Q) to determine if the tunnel is established or attempting reconnection.
With evidence of tunnel problems, the fourth step is to test direct Layer 3 connectivity between the two appliances (R), isolating whether the problem is in the network path between them.
Finally, with the entire tunneling chain validated, check the state of target VMs and their NSGs (S), which represent the application layer and access control, not the transport mechanism.
Sequence A (T, Q, P, R, S) is close to correct but checks the local service before confirming the underlying infrastructure, which can lead to premature conclusions if the VPN is degraded. Sequences C and D start with intermediate or destination components, violating the principle of diagnosing from the most fundamental to the most specific layer.
Troubleshooting Tree: Implement Azure Extended Networkβ
Color Legend:
| Color | Node Type |
|---|---|
| Dark Blue | Initial symptom (entry point) |
| Blue | Diagnostic question (binary decision or observable) |
| Red | Identified cause |
| Green | Recommended action or resolution |
| Orange | Intermediate verification or validation |
When facing a real problem, start at the root node (observed symptom) and answer each diagnostic question based on what is immediately verifiable in the environment. Follow the branch corresponding to the obtained answer. Orange nodes indicate points where additional verification is needed before concluding the diagnosis. When reaching a red node, the cause is identified; the immediately connected green node indicates the recommended corrective action. Never skip steps: the tree was built to eliminate hypotheses from the most fundamental to the most specific layer, avoiding corrective actions based on assumptions.