Skip to main content

Troubleshooting Lab: Create and configure an IPsec/Internet Key Exchange (IKE) policy

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

An infrastructure team reports that a Site-to-Site VPN tunnel between Azure and a corporate firewall had been successfully established for months. After a maintenance window on Friday, the tunnel stopped coming up. No changes were made to the Azure VPN Gateway. The network team reports that the firewall management portal TLS certificate was renewed during maintenance and the device firmware was updated to the latest version.

When checking the Azure VPN Gateway logs, the team observes the following repeated message:

IKE SA MM: FAILED - No proposal match found
Phase 1 negotiation failed
Local policy: AES256-SHA256-DHGroup14
Remote proposal received: AES128-SHA1-DHGroup2

The Azure gateway is configured with a custom IPsec/IKE policy. The policy on the on-premises firewall also uses custom configuration. The gateway SKU is VpnGw2 and is operational. The gateway's public IP address has not changed.

What is the root cause of the tunnel establishment failure?

A) The TLS certificate renewal of the firewall portal changed the authentication credentials used in IKE phase 1, invalidating the existing SA.

B) The firmware update on the firewall reset the device's custom IPsec/IKE policy to manufacturer defaults, which are incompatible with the policy configured in Azure.

C) The VpnGw2 SKU does not support custom IPsec/IKE policies after firmware updates on remote devices, requiring gateway recreation.

D) The custom policy in Azure expired after the maintenance window, as custom policies have limited validity tied to the active SA lifetime.


Scenario 2 β€” Action Decision​

The cause of a Site-to-Site VPN tunnel failure has been identified: the IPsec/IKE policy configured in Azure specifies PfsGroup PFS24 (ECP384), but the on-premises device is a legacy appliance that only supports PFS2 and PFS14. The connection is in production and supports integration with a payment system that operates 24 hours a day. The team has approval to make changes only in scheduled maintenance windows, which occur on Sundays between 2 AM and 4 AM. Today is Thursday and the next available window is in 10 days.

The team has the following options available:

  • Update the IPsec/IKE policy in Azure now, outside the window
  • Wait for the maintenance window and update the policy to PFS14
  • Create a second VPN connection with compatible policy and redirect traffic
  • Remove the custom policy now to restore default behavior

What is the correct action to take at this time?

A) Update the connection's IPsec/IKE policy immediately, replacing PFS24 with PFS14, since the cause has been identified and the current impact is greater than the risk of change.

B) Wait for the maintenance window in 10 days and apply the fix with PFS14, respecting the established change control process.

C) Remove the custom policy from the connection now, outside the window, to restore Azure's default policy and allow the tunnel to negotiate automatically with the legacy device.

D) Create a new parallel VPN connection with the corrected policy and redirect traffic to it before removing the problematic connection, executing everything within the next maintenance window.


Scenario 3 β€” Root Cause​

An engineer is deploying a new Site-to-Site VPN connection with a custom IPsec/IKE policy. The Local Network Gateway and Azure VPN Gateway were created correctly. The connection was provisioned and appears with Connected status in the Azure portal. However, no traffic passes through the tunnel. Pings from virtual machines in the Azure VNet to hosts on the on-premises network fail completely.

The engineer verifies and confirms:

  • The on-premises VPN device is online and without alerts
  • The route to the Azure VNet prefix exists in the firewall routing table
  • There are no health alerts on the Azure VPN Gateway
  • The connection status is Connected

The policy configured in Azure is:

$policy = New-AzIpsecPolicy `
-IkeEncryption AES256 `
-IkeIntegrity SHA384 `
-DhGroup DHGroup24 `
-IpsecEncryption AES256 `
-IpsecIntegrity SHA256 `
-PfsGroup PFS24 `
-SALifeTimeSeconds 27000 `
-SADataSizeKilobytes 102400000

The on-premises device is configured with the same phase 1 and phase 2 parameters. The network security group associated with the VM subnet allows all outbound traffic. The GatewaySubnet subnet has no associated NSG.

When capturing packets on the on-premises device, the engineer observes that IPsec packets arrive correctly at the firewall, but no packets return toward Azure.

What is the root cause of the problem?

A) The value SADataSizeKilobytes 102400000 is above the limit supported by Azure for custom policies, causing silent packet drops in phase 2.

B) The custom policy in Azure is correct, but the network prefixes configured in the Local Network Gateway do not include the addresses of the on-premises hosts that need to be reached, preventing return traffic from being properly encapsulated by the remote device.

C) The Connected status indicates only that phase 1 was completed; phase 2 failed silently due to PfsGroup PFS24 incompatibility with the remote device.

D) The network security group on the VM subnet is blocking return traffic, even though it allows outbound, because inbound rules for IPsec responses were not configured.


Scenario 4 β€” Diagnostic Sequence​

A Site-to-Site VPN tunnel with a custom IPsec/IKE policy presents intermittent failure: the tunnel goes down and comes up by itself every few hours, without intervention. The environment uses Azure VPN Gateway with VpnGw1 SKU and a connection with applied custom policy.

The available investigation steps are:

  • Step P: Compare the SALifeTimeSeconds value configured in the Azure policy with the lifetime configured on the on-premises device
  • Step Q: Check Azure VPN Gateway logs for SA renegotiation messages or rekey failures
  • Step R: Confirm if the VpnGw1 SKU supports the number of active connections on the gateway
  • Step S: Verify if the on-premises device initiates renegotiation or if Azure initiates, and if both sides accept the proposal during rekey
  • Step T: Review if the custom policy specifies a SADataSizeKilobytes too low that causes frequent rekey due to volume exhaustion

What is the correct investigation sequence for this symptom?

A) R, Q, P, S, T

B) Q, P, T, S, R

C) P, R, T, Q, S

D) T, P, R, Q, S


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

The decisive clue is in the log message: Azure is proposing AES256-SHA256-DHGroup14, exactly what was configured in the custom policy, and the remote device is sending AES128-SHA1-DHGroup2. This divergence indicates that the on-premises device started using different parameters from what was configured before the maintenance.

The firmware update is the real cause: many manufacturers reset custom configurations to factory defaults during firmware updates, especially when there are changes in the configuration schema between versions. The visible result is exactly what was observed: the proposal sent by the remote device changes to the manufacturer's defaults.

The TLS certificate renewal of the management portal has no relation to IKE authentication credentials; they are completely different planes. This detail was purposefully included as irrelevant information to test if the reader would incorrectly associate management certificate with tunnel authentication. The VpnGw2 SKU has no limitations related to remote device firmware updates. Custom policies in Azure do not expire by time; they persist while the connection exists.

The most dangerous distractor is alternative A: an incorrect diagnosis would lead the team to investigate authentication and IKE certificates, wasting time while the real cause is the phase 1 policy reset on the firewall.


Answer Key β€” Scenario 2​

Answer: D

The context establishes explicit constraints: the system is a 24-hour production payment system and changes are only allowed in scheduled windows. The cause has been identified and the solution is known. The decision criteria is not technical; it's operational.

Alternative D is correct because it creates the new connection with the corrected policy and only performs the cutover within the maintenance window, respecting all constraints. This minimizes risk to production traffic and maintains compliance with the change process.

Alternative A ignores the maintenance window restriction for a critical system. Even if technically valid, acting outside the window on a payment system violates change control and can cause unplanned unavailability during reconfiguration. Alternative B waits 10 days without offering a functional alternative, meaning the tunnel remains inoperative for the entire period. Alternative C removes the custom policy outside the window, which also violates the process and can introduce unexpected behavior in a payment system.


Answer Key β€” Scenario 3​

Answer: B

The critical clue is in the packet capture: IPsec packets arrive correctly at the on-premises device, but no packets return toward Azure. This means that phase 1 and phase 2 were successfully established (the tunnel is Connected and packets reach the destination), but the on-premises device is not encapsulating return traffic to send it to Azure.

The described behavior indicates that the on-premises firewall receives the packets, decapsulates them correctly, but when trying to respond, cannot find an active IPsec SA for the destination (the Azure VM IP addresses). This occurs when the prefixes configured in the Local Network Gateway in Azure do not properly cover the on-premises source addresses that need to return traffic. The on-premises device encapsulates return traffic based on traffic selectors negotiated in phase 2, which derive from the Local Network Gateway prefixes.

The value SADataSizeKilobytes 102400000 is valid in Azure. The Connected status confirms that both phase 1 and phase 2 were established, ruling out alternative C. The subnet NSG does not block IPsec return traffic, as the encapsulated traffic reaches the gateway, not directly to the VMs. The information about the absence of NSG on the GatewaySubnet is irrelevant for this diagnosis and was purposefully included as a distraction.


Answer Key β€” Scenario 4​

Answer: B

The correct sequence is: Q, P, T, S, R.

The progressive diagnostic reasoning for intermittent failure with periodic reconnection should follow from general to specific:

  1. Q first: gateway logs show what's happening at the time of the drop; if there's renegotiation or rekey error, this confirms the problem is in the SA lifecycle.
  2. P next: with the pattern confirmed in logs, compare the SALifeTimeSeconds on both sides. Asymmetric lifetime is the most common cause of periodic reconnections with custom policies.
  3. T in sequence: if the lifetimes in seconds match, check the volume limit (SADataSizeKilobytes), which can force rekey before time if it's too low for the traffic volume.
  4. S after: identify which side initiates the rekey and if the proposal is accepted; this reveals if there's incompatibility during renegotiation.
  5. R last: SKU capacity is the least likely hypothesis for this specific symptom and should only be checked after exhausting causes related to the policy.

Sequence A starts with SKU capacity, which is unlikely for the described symptom. Sequences C and D begin with specific parameters without first observing what the logs reveal, inverting the progressive diagnostic logic.


Troubleshooting Tree: Create and configure an IPsec/IKE policy​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color Legend:

ColorNode Type
Dark blueInitial symptom (entry point)
BlueDiagnostic question (binary decision or observable)
RedIdentified cause or corrective action
OrangeIntermediate validation or verification

To use this tree when facing a real problem, start at the root node and answer each question based on what you observe in the environment, without assuming the cause. Follow the path that corresponds to the observed state, not the expected state. When reaching a red node, you have the cause or corrective action; when reaching an orange node, execute the validation before considering the problem resolved. If validation fails, return to the immediately previous question node and reevaluate the given answer.