Troubleshooting Lab: Implement Bidirectional Forwarding Detection

Diagnostic Scenarios

Scenario 1 — Root Cause

A company has an ExpressRoute circuit with private peering connecting the on-premises datacenter to Azure. BFD was enabled on the MSEE two weeks ago and is operational. The on-premises CPE device is a Cisco IOS-XE router. The responsible engineer reports that after a maintenance window last Friday, the BFD session never came back up, although the BGP session reestablished normally and traffic is flowing without apparent interruption.

During the investigation, the engineer collects the following outputs:

# Cisco IOS-XE — BFD verification
Router# show bfd neighbors details

IPv4 Sessions
NeighAddr                              LD/RD         RH/RS     State     Int
10.0.0.1                              4097/0        Down      Down      Gi0/0/1

Session state is DOWN and not using echo function
Local Diag: 0, Demand mode: 0, Poll bit: 0
MinTxInt: 300000, MinRxInt: 300000, Multiplier: 3
Received MinRxInt: 0, Received Multiplier: 0
Holdown (hits): 900ms (0)

The engineer also observes that the Gi0/0/1 interface is up/up, the MTU was changed from 1500 to 9000 during maintenance, and BGP is receiving 14 prefixes normally.

What is the root cause of the BFD session establishment failure?

A. The 300 ms transmission interval is above the maximum limit supported by the MSEE, which requires at least 1000 ms.

B. The Received MinRxInt: 0 field indicates that no BFD packets are arriving from the MSEE peer, which points to a reachability or BFD packet filtering problem after maintenance.

C. Changing the MTU to 9000 bytes fragmented the BFD packets, which are discarded by the MSEE as it doesn't support jumbo frames.

D. BGP reestablished the session before BFD could negotiate the parameters, causing a state conflict that prevents BFD from coming up.

Scenario 2 — Action Decision

The cause of the problem has been identified: BFD authentication was inadvertently enabled on the on-premises CPE device during maintenance, and the Azure VPN Gateway from the adjacent test environment doesn't support this feature. However, the scenario in question is the production ExpressRoute circuit, where the peer is the MSEE, which also doesn't support BFD authentication.

The environment has the following restrictions:

The ExpressRoute circuit is carrying active production traffic
There's no approved maintenance window until next weekend
The BGP session is stable and traffic is not impacted
The security team requires prior approval for any configuration changes on edge devices

What is the correct action to take at this time?

A. Immediately remove the BFD authentication configuration from the CPE, as BFD failure represents a latent risk of slow convergence in case of link failure.

B. Restart the BGP session on the CPE to force renegotiation of BFD parameters along with BGP.

C. Document the identified problem, open an approval request with the security team, and wait for the maintenance window to remove the authentication configuration.

D. Completely disable BFD on the CPE until the maintenance window is approved, ensuring there are no session attempts with invalid parameters.

Scenario 3 — Root Cause

An engineer is configuring BFD over a Site-to-Site VPN tunnel between Azure VPN Gateway and an on-premises firewall. The IKEv2 tunnel has been established successfully. After applying the BFD configuration on the firewall, the session remains in Init state for several minutes and then drops with local diagnostic No Diagnostic.

The engineer verifies the following environment information:

# On-premises firewall — applied BFD configuration
neighbor 172.16.0.1 bfd
  bfd interval 750 min_rx 750 multiplier 4
  
# Connectivity verification
ping 172.16.0.1 source 172.16.0.2 count 100
Success rate is 100 percent (100/100)

# Route to BFD peer
172.16.0.1/32 via tunnel0 [1/0]

The engineer notes that ping works perfectly and the IPsec tunnel has been active for 3 days without interruption. The firewall firmware version is recent and supported. The Azure VPN Gateway SKU is VpnGw1.

What is the root cause of the observed behavior?

A. The 750 ms interval is incompatible with Azure VPN Gateway, which requires a minimum value of 1000 ms for BFD sessions over VPN.

B. The multiplier 4 exceeds the maximum value supported by Azure VPN Gateway for BFD sessions over IKEv2 tunnels.

C. BFD is not enabled or not supported on the VpnGw1 SKU of Azure VPN Gateway, so the Azure side never responds to Init packets sent by the firewall.

D. The peer address 172.16.0.1 belongs to the internal tunnel address space, which is not routable by the Azure VPN Gateway control plane for BFD purposes.

Scenario 4 — Diagnostic Sequence

An operator receives the following alert at 03:42:

"ExpressRoute circuit — BGP session down. Failover to secondary circuit in progress."

When starting the investigation, they have access to the on-premises CPE and Azure portal metrics. BFD was enabled on the primary circuit. The following investigation steps are available, out of order:

Verify if BFD packets are being received on the CPE with show bfd neighbors details and confirm the Received MinRxInt value
Check the state of the CPE's physical interface connected to the provider with show interfaces
Check the BitsInPerSecond and BitsOutPerSecond metrics history of the circuit in the Azure portal to identify the exact time of the failure
Confirm if the secondary circuit took over correctly by checking the active BGP routes on the CPE
Check the connectivity provider logs to identify if there was a physical event at the provider layer

What is the correct investigation sequence?

A. 4 → 2 → 1 → 3 → 5

B. 2 → 1 → 3 → 5 → 4

C. 3 → 2 → 1 → 5 → 4

D. 1 → 3 → 2 → 5 → 4

Answer Key and Explanations

Answer Key — Scenario 1

Answer: B

The definitive clue is in the command output: Received MinRxInt: 0 and Received Multiplier: 0. These values indicate that the local device never received a BFD response packet from the MSEE peer. A BFD session only advances from Down to Init state when the peer responds. If the received values are zero, the packets simply aren't arriving, which points to blocking, filtering, or incorrect routing of BFD packets introduced during maintenance.

The MTU change to 9000 bytes is irrelevant information included intentionally. BFD packets are small (usually under 100 bytes) and would never be affected by jumbo frame fragmentation. The fact that BGP is functional with received prefixes confirms that the transport layer is operational, which eliminates the hypothesis of a physical problem.

The most dangerous distractor is option C, as the MTU change during maintenance creates a false temporal correlation that can divert diagnosis down a dead end.

Answer Key — Scenario 2

Answer: C

The cause is identified and the correction is technically simple (remove the authentication configuration). However, the scenario imposes two critical restrictions: absence of an approved maintenance window and requirement for prior approval from the security team for changes to edge devices.

Option A ignores both restrictions. While the technical reasoning is correct (absence of functional BFD is a convergence risk), acting without approval in production violates the established process and can generate unplanned impacts. Option D creates a different problem: disabling BFD removes the protection of fast failure detection without solving the root cause, and would also require approval by the same logic.

The central point of this scenario is that the correct decision is not always the most technically elegant one. When there are process restrictions and immediate impact is zero (traffic flowing normally through BGP), the correct action is to follow the approval process and execute during the appropriate window.

Answer Key — Scenario 3

Answer: C

The No Diagnostic diagnosis combined with persistent Init state indicates that the Azure side never responded with a BFD packet. Unlike Scenario 1, here IP connectivity is proven (100% ping success, route present, tunnel active for 3 days). This eliminates any hypothesis related to reachability.

The VpnGw1 SKU is the determining element. BFD support in Azure VPN Gateway is tied to specific SKUs, and VpnGw1 doesn't offer BFD support. Since the Azure side simply doesn't process received BFD packets, the on-premises firewall repeatedly sends Init packets without getting a response, until the holdown expires and the session drops with null diagnostic.

The most dangerous distractor is option A, as interval values outside specification are a real and common cause of BFD failure, and the absence of explicit error messages can lead the operator to investigate parameters instead of checking SKU support.

Answer Key — Scenario 4

Answer: B

The correct sequence follows progressive diagnostic logic from nearest to farthest, and from physical to logical:

2 (physical interface) is the starting point because a failure at the physical layer would explain everything else. If the interface is down, there's no need to investigate BFD or metrics.

1 (BFD state) comes next because, if the interface is up, the next step is to understand if BFD correctly detected the failure or if BGP dropped due to timeout.

3 (portal metrics) allows correlating the time of failure with traffic behavior, identifying if it was a gradual degradation or abrupt drop, which helps direct investigation toward the provider.

5 (provider logs) is consulted after exhausting local verifications, as it depends on third-party communication and usually has response latency.

4 (failover verification) closes the cycle by confirming that the secondary circuit took over correctly. This step is validation, not root cause diagnosis, so it belongs at the end.

Sequence A makes the classic mistake of checking the result (failover) before understanding the cause. Sequence D starts with BFD before validating the physical layer, skipping a more basic step that could end the diagnosis earlier.

Troubleshooting Tree: Implement Bidirectional Forwarding Detection

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

Color Legend:

Color	Node Type
Dark blue	Initial symptom (entry point)
Blue	Diagnostic question (binary or state decision)
Orange	Intermediate verification or validation
Red	Identified cause
Green	Recommended action or resolution

When facing a real problem, start at the root node and answer each question based on what is directly observable in the environment, without assuming causes. At each branch, the correct path is determined by what you can confirm, not what you suspect. When an intermediate verification node (orange) appears, it indicates that additional inspection is needed before confirming the cause. Only when reaching a red identified cause node is the diagnosis complete and the corresponding action (green) can be safely executed.

Diagnostic Scenarios​

Scenario 1 — Root Cause​

Scenario 2 — Action Decision​

Scenario 3 — Root Cause​

Scenario 4 — Diagnostic Sequence​

Answer Key and Explanations​

Answer Key — Scenario 1​

Answer Key — Scenario 2​

Answer Key — Scenario 3​

Answer Key — Scenario 4​

Troubleshooting Tree: Implement Bidirectional Forwarding Detection​