Skip to main content

Troubleshooting Lab: Configure TLS Termination and End-to-End TLS Encryption

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

An operations team reports that an application published via Azure Application Gateway v2 started returning HTTP 502 errors to all external clients after a maintenance window performed the previous night. Before the maintenance, the application was working normally.

The responsible engineer confirms that during maintenance, two changes were made: renewal of the TLS certificate installed on backend servers and operating system update of the backend pool VMs. The backend servers are responding normally to internal tests via HTTP on port 80.

The Application Gateway logs show the following:

[ERROR] Backend connect error: SSL handshake failed
[ERROR] Backend: 10.0.1.10:443
[ERROR] Certificate chain validation failed: unable to get local issuer certificate
[INFO] Frontend listener: HTTPS/443 - OK
[INFO] Health probe: HTTP/80 - Healthy

The backend health probe is configured for HTTP on port 80 and returns Healthy status for all pool instances.

What is the root cause of the observed 502 error?

A) The frontend listener TLS certificate was corrupted during maintenance, preventing the establishment of new TLS connections with clients
B) The renewed certificate on the backends was issued by a different CA than the previous one, and the new root certificate was not updated as a Trusted Root Certificate in the Application Gateway's backend HTTP settings
C) The VM operating system update changed the port configuration, causing the backends to stop listening on port 443
D) The health probe is configured for HTTP and therefore doesn't detect the TLS failure, which indirectly causes the 502 error


Scenario 2 β€” Action Decision​

The problem cause has been identified: the custom domain certificate associated with Azure Front Door expired 18 hours ago. The Front Door is using a BYOC (Bring Your Own Certificate) stored in Azure Key Vault. The certificate in the Key Vault has already been replaced by the PKI team with a new valid version, but Front Door continues returning TLS errors to clients.

The environment has the following constraints:

  • The application is in production with an active 99.9% SLA
  • The custom domain cannot be removed and recreated without Change Advisory Board approval, a process that takes at least 4 hours
  • The team has full access to the Azure portal and CLI
  • 20 minutes have passed since the certificate replacement in Key Vault without Front Door automatically detecting the update

What is the correct action to take at this moment?

A) Wait another 40 minutes, as Front Door can take up to 1 hour to automatically sync the new certificate version from Key Vault
B) Remove the custom domain from Front Door and recreate it pointing to the new certificate
C) Access the custom domain settings in Front Door and manually trigger the certificate update, pointing to the new version in Key Vault
D) Replace the BYOC certificate with a Front Door-managed certificate to eliminate the Key Vault dependency


Scenario 3 β€” Root Cause​

A corporate web application is published via Azure Application Gateway v2 with end-to-end TLS configured. The backends use certificates issued by the company's internal CA, and the CA root certificate is correctly registered as a Trusted Root Certificate in the backend HTTP settings.

In the last week, three of the eight backend pool servers started presenting intermittent failures. The collected logs show:

[WARN]  Backend: 10.0.2.14:443 - Connection established
[WARN] Backend: 10.0.2.14:443 - SSL handshake timeout after 10000ms
[WARN] Backend: 10.0.2.15:443 - Connection established
[WARN] Backend: 10.0.2.15:443 - SSL handshake timeout after 10000ms
[INFO] Backend: 10.0.2.10:443 - Handshake OK
[INFO] Backend: 10.0.2.11:443 - Handshake OK
[INFO] Backend: 10.0.2.12:443 - Handshake OK

The infrastructure team reports that the three affected servers received a security configuration update that restricted the accepted cipher suites to a more restrictive set, aligned with an internal hardening policy. The other five servers have not been updated yet.

The health probe is configured for HTTPS and intermittently marks the three servers as Unhealthy.

What is the root cause of the observed handshake failures?

A) The internal CA root certificate is expired, causing the Application Gateway to reject the trust chain on the three servers
B) The HTTPS health probe is overloading the three servers with simultaneous checks, causing timeout due to connection exhaustion
C) The cipher suites enabled on the three servers after hardening are incompatible with the cipher suites allowed by the SSL Policy configured on the Application Gateway
D) The individual certificate of two of the three affected servers was revoked by the internal CA during the hardening process


Scenario 4 β€” Diagnostic Sequence​

An engineer receives a ticket reporting that a new route published on Azure Application Gateway returns SSL_ERROR_HANDSHAKE_FAILURE error in users' browsers when trying to access https://app.contoso.com/api/v2. Other routes on the same gateway work normally.

The engineer has access to the Azure portal, Key Vault, and gateway logs. He identifies the following steps that need to be executed, but needs to order them correctly:

[P] Verify if the HTTPS listener associated with the /api/v2 route has a valid and non-expired TLS certificate
[Q] Analyze the Application Gateway access logs to identify the error code returned from the backend
[R] Confirm that the Backend HTTP Setting for the /api/v2 route points to HTTPS protocol and correct port
[S] Test the backend endpoint directly via curl with verbose TLS to isolate if the problem occurs before or after the gateway
[T] Verify if the SSL Policy applied to the gateway is compatible with the failing clients

Which sequence represents the correct diagnostic order, from outermost to innermost?

A) T, P, Q, S, R
B) P, T, Q, R, S
C) Q, P, T, R, S
D) P, Q, R, S, T


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

The determining clue is in the combination of two facts: the certificate was renewed (not just reissued with the same CA) and the log explicitly indicates unable to get local issuer certificate, which is the trust chain validation failure message from the Application Gateway side when trying to verify the certificate presented by the backend.

When the new certificate was issued by a different CA than the original, the Application Gateway can no longer build the trust chain from the previously registered Trusted Root Certificate. The result is handshake failure and the consequent 502 error for clients.

The information about the VM OS update is the irrelevant element of the scenario: the backends respond normally on port 80 (confirmed by the health probe) and the logs show that the gateway can open TCP connection to port 443, but the TLS handshake fails, which rules out any port or connectivity issues.

Option A is a distractor that directs reasoning to the wrong side of the flow: the log confirms that the frontend listener is OK. Option D represents a classic reasoning error: the health probe working on HTTP indicates only that the servers are alive at the application layer; it is completely blind to TLS failures on port 443. Acting based on this distractor would lead the engineer to erroneously conclude that the backends are healthy and investigate the gateway in the wrong direction.


Answer Key β€” Scenario 2​

Answer: C

When Azure Front Door uses a BYOC certificate stored in Key Vault, it maintains a reference to the certificate version. Simply replacing the secret in Key Vault does not trigger an immediate automatic sync in Front Door. The correct mechanism to force adoption of the new certificate without domain recreation is to manually update the certificate reference in the Front Door custom domain settings, pointing to the new version available in Key Vault.

Option A presents a time constraint that doesn't correspond to real behavior: while Front Door may take a few minutes to propagate configurations, waiting indefinitely without triggering the update is not the correct action in an environment with active SLA. Option B explicitly violates the scenario's most critical constraint: removing and recreating the domain requires approval that takes at least 4 hours, unacceptable time given the situation. Option D is technically valid as a permanent solution, but at this moment it doesn't solve the immediate problem: migrating to a managed certificate involves a provisioning process that can take minutes to hours, plus it's not necessary when the certificate is already available in Key Vault.


Answer Key β€” Scenario 3​

Answer: C

The root cause lies in cipher suite incompatibility between the hardened backends and the Application Gateway's SSL Policy. The elimination diagram is straightforward: the five unaffected servers use the same certificates from the same internal CA and work normally, which rules out any problem with the Trusted Root Certificate. The only variable that differentiates the three affected servers from the others is the accepted cipher suites update.

The Application Gateway, when trying to negotiate the outbound TLS handshake with these backends, finds no common cipher suite between what its SSL Policy allows and what the servers now accept. The negotiation fails in the ClientHello/ServerHello phase, resulting in timeout or rejection, exactly as described in the logs.

The irrelevant element purposely included is the intermittent behavior of the health probe: it follows the same pattern as TLS failures because the HTTPS probe also tries to negotiate TLS with the backend, but this describes the health probe symptom, not the cause of the failures.

Option D is the most dangerous distractor: an engineer who rushes to check certificate revocation would waste time on a completely wrong path, as revoked certificates produce different error messages (related to CRL or OCSP), not handshake timeouts.


Answer Key β€” Scenario 4​

Answer: B

The correct diagnostic sequence for a TLS handshake failure affecting only a specific route follows the logic of investigating from the point closest to the client to the closest to the backend, validating each layer before advancing.

P comes first because the SSL_ERROR_HANDSHAKE_FAILURE error is reported by the client's browser, and the first hypothesis to eliminate is a problem in the listener certificate itself: if the certificate is expired or missing in the listener associated with the /api/v2 route, the handshake fails before any other layer.

T comes next because, if the listener certificate is valid, the next candidate is an SSL Policy incompatibility between the gateway and the failing client browsers.

Q comes after validating the client side because, with the listener and policy confirmed as correct, the gateway logs will reveal if the problem is in backend communication or if the gateway is rejecting the request for another reason.

R enters to verify if the Backend HTTP Setting is configured correctly for the affected route, since other routes work, suggesting there may be a specific configuration for this route that differs from the others.

S closes the sequence with a direct backend test, which only makes sense after confirming that all gateway components are configured correctly and the problem is beyond the gateway.

Sequence D seems intuitive because it starts with logs, but analyzing logs before confirming the listener's basic configuration generates noise: the logs would show the error without the engineer knowing yet if the listener even has a valid certificate associated.


Troubleshooting Tree: Configure TLS Termination and End-to-End TLS Encryption​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color Legend:

ColorMeaning
Dark blueInitial symptom (root node)
BlueDiagnostic question (decision)
RedIdentified cause
GreenRecommended action or resolution
OrangeValidation or intermediate verification node

To use this tree when facing a real problem, start with the root node describing the general TLS failure symptom. The first bifurcation separates errors that manifest on the client side (handshake with the gateway listener) from errors that occur in backend communication (502). From this separation, each branch asks an objective and verifiable question in the Azure portal, gateway logs, or directly on the servers. Follow the answers without skipping steps: each orange node represents an intermediate checkpoint before declaring a cause. When reaching a red node, the cause is identified and the following green node indicates the corresponding corrective action.