Skip to main content

Troubleshooting Lab: Provision a container by using Azure Container Apps

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

The platform team deployed a container app in production using the consumption plan. The environment was created three weeks ago and hosts four other applications without issues. The new application processes webhooks received from an external partner and requires external ingress enabled.

After deployment, system logs show the revision as active and healthy. The partner's integration team reports that all requests sent to the public endpoint return connection refused. The team verifies in the portal that the URL was generated correctly and that the container is running with one active replica.

The responsible engineer notes that, for cost savings, the application was configured with minimum replicas equal to zero. He also observes that the environment uses a custom virtual network created six months ago, and that the network security group associated with the Container Apps subnet has the following inbound rules:

Priority  Name                    Port   Protocol  Action
100 Allow-HTTPS-Internet 443 TCP Allow
200 Allow-HTTP-Internal 80 TCP Allow
300 Deny-All-Inbound * * Deny

The application exposes port 3000 internally and the --target-port was configured correctly as 3000. The ingress is configured as external with http transport.

What is the root cause of the observed behavior?

A) Minimum replicas configured as zero cause the application to scale to zero during low demand periods; since no previous request warmed the instance, the partner's first connection arrives before the replica is provisioned and is refused.

B) The NSG associated with the subnet blocks inbound traffic on the port used by the Container Apps data plane to route external requests to the container, as no rule allows traffic on the infrastructure subnet ports.

C) Ingress configured as http does not support POST webhooks with payloads above 1 MB; the external partner is likely sending large payloads that are rejected at the ingress layer before reaching the container.

D) The public URL generated for the container app is valid for only 24 hours after first deployment; after this period, it's necessary to reissue the managed TLS certificate before the endpoint responds again.


Scenario 2 β€” Action Decision​

The team identified that a container app in production is stuck in a degraded state. The cause was identified: the new revision deployed 20 minutes ago contains a configuration error in an environment variable that points to an invalid connection string. The container starts, fails to connect to the database, and restarts in a loop.

The environment operates in multiple revision mode. The previous revision is healthy and still exists in the environment. The SLA contract with the client requires that the service be restored within a maximum of 15 minutes from incident opening, which was opened 8 minutes ago.

The application receives 100% of traffic on the faulty revision because the team didn't configure traffic splitting rules during deployment. The database is accessible and healthy. The team has Contributor access to the resource group.

What is the correct action to take at this moment?

A) Fix the environment variable directly in the faulty revision via portal or CLI, forcing a restart of the revision so it re-reads the configurations and reconnects to the database.

B) Redirect 100% of traffic to the previous healthy revision immediately, restoring service within the SLA, and handle the environment variable correction in a separate window without time pressure.

C) Deploy a new revision with the corrected environment variable and wait for it to become healthy before transferring traffic, ensuring the fix is validated before being exposed to the client.

D) Deactivate the faulty revision in the portal so that Container Apps automatically redistributes traffic to the previous healthy revision, without needing to configure traffic rules manually.


Scenario 3 β€” Root Cause​

A development team uses Azure Container Apps to host an internal authentication API. The application uses Dapr enabled for communication with other services in the same environment. The environment doesn't have a custom virtual network.

After an infrastructure update made by the platform team, the API starts returning 500 Internal Server Error on all calls that depend on communication with another service in the environment. Direct calls to the API that don't involve Dapr continue working normally.

Application logs show:

[2025-03-15 14:32:11] ERROR Dapr sidecar not reachable at localhost:3500
[2025-03-15 14:32:11] ERROR connection refused: 127.0.0.1:3500
[2025-03-15 14:32:14] ERROR Dapr sidecar not reachable at localhost:3500

The platform team informs that during the update, the following changes were made to the container app definition:

  1. The image was updated to version v2.1.3
  2. The minimum number of replicas was reduced from 2 to 1
  3. The daprEnabled field was changed from true to false to "save resources during load testing"
  4. A LOG_LEVEL environment variable was added with value debug

The application is in single revision mode. The environment has 6 other applications with Dapr enabled working normally.

What is the root cause of the observed problem?

A) The reduction of minimum replicas from 2 to 1 caused a race condition where the Dapr sidecar cannot register with the control plane when there's only one available replica.

B) The image update to v2.1.3 introduced an incompatibility with the Dapr runtime version used by the environment, causing communication failure between the application and the sidecar.

C) The daprEnabled field was set to false in the new revision, removing the Dapr sidecar from the container; without the sidecar, the application cannot reach port 3500 that it uses for Dapr communication.

D) Adding the LOG_LEVEL=debug variable conflicts with internal Dapr configurations that use the same environment variable, corrupting sidecar initialization in the new revision.


Scenario 4 β€” Diagnostic Sequence​

A container app that processes events from Azure Event Hubs stopped consuming messages. The environment is healthy, other applications work normally, and the Event Hub has continuously growing accumulated messages. The container is running with active replicas.

An engineer needs to diagnose the problem. The available investigation steps are:

P β€” Check container logs for connection or authentication errors with Event Hub.

Q β€” Confirm if the scaling rule based on Event Hub is configured and if the referenced connection string secret exists and is accessible in the container app.

R β€” Verify if the current number of replicas is within configured minimum and maximum limits and if the KEDA scaler is active.

S β€” Confirm if the consumer group configured in the scaling rule matches the consumer group that the application uses internally to consume messages.

T β€” Check the environment's System Logs in Log Analytics to identify if there are provisioning errors or control plane failures.

What is the most logical and efficient diagnostic sequence?

A) T, Q, P, R, S

B) R, P, T, Q, S

C) P, T, Q, S, R

D) Q, P, R, S, T


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

The NSG blocks the traffic that the Container Apps data plane needs to route external requests to the container. In environments with custom virtual networks, Container Apps requires specific inbound rules to be configured in the NSG to allow management and data traffic. External traffic arriving at the public endpoint doesn't go directly through port 443 to the container: it passes through the environment infrastructure, which uses an internal port range for communication with hosted applications. The NSG in this scenario only allows ports 443 and 80, blocking everything else with Deny-All-Inbound, which prevents this internal routing.

The definitive clue in the statement is the combination of custom virtual network with NSG with generic denial rule, while the container appears healthy and the URL was generated. If the problem were cold start (alternative A), the connection would be refused only on the first call after inactivity, not on all calls. The webhook payload (alternative C) is completely irrelevant for a connection refused error, which occurs before any data is transmitted. URL validity (alternative D) is false information: container apps URLs don't expire.

The most dangerous distractor is A, as minimum zero replicas appear highlighted in the statement as an eye-catching detail, leading the reader to focus on cold start instead of examining network configuration.

Answer Key β€” Scenario 2​

Answer: B

The critical constraint of the scenario is the 15-minute SLA with 8 already consumed: 7 minutes remain. The correct action is to immediately redirect traffic to the previous healthy revision, which already exists in the environment because the revision mode is multiple. This action restores service in seconds without any risk, as the previous revision is validated in production.

Alternative C would be technically correct in a context without time pressure, but deploys a new revision and waits for validation, which consumes time that doesn't exist within the SLA. Alternative A wouldn't work because revisions in Container Apps are immutable: you cannot edit environment variables of an existing revision; an edit always generates a new revision. Alternative D would be attractive, but deactivating a revision in Container Apps doesn't automatically redistribute traffic to other active revisions; traffic would simply be abandoned, worsening the incident.

The most dangerous distractor is C, as the logic of "validate before exposing" is healthy under normal conditions, but completely ignores the time constraint stated in the scenario.

Answer Key β€” Scenario 3​

Answer: C

The log is direct: connection refused: 127.0.0.1:3500. Port 3500 is the local port of the Dapr sidecar. When daprEnabled is false, Container Apps doesn't inject the sidecar into the container pod, and any call from the application to localhost:3500 results in connection refused because the process simply doesn't exist. Since the revision mode is single, the new configuration completely replaced the previous revision, removing Dapr from the entire application.

The definitive clue is in the list of changes provided by the platform team: item 3 is the direct cause, and the log confirms exactly the expected behavior when the sidecar is not present.

The irrelevant information in the scenario is the reduction of minimum replicas from 2 to 1 (alternative A). It's technically visible and plausible as a cause, but the connection refused behavior on localhost has no relation to the number of replicas. The image update (alternative B) and the LOG_LEVEL variable (alternative D) are intentional noise representing the diagnostic error of focusing on the most technical or most recent change instead of examining each alteration individually against the observed symptom.

Answer Key β€” Scenario 4​

Answer: A

The correct sequence is T, Q, P, R, S.

The progressive reasoning starts with the environment's control plane (T), as an infrastructure failure would immediately rule out most remaining hypotheses without needing to investigate the application. With the environment confirmed healthy, the next step is to verify the scaling rule configuration and connection secret (Q), as without a valid rule or accessible credentials, KEDA would never trigger the scaler. With configuration validated, examining container logs (P) reveals application errors that could indicate authentication or consumption logic problems. Next, checking the current state of replicas and scaler (R) shows if KEDA is detecting messages and trying to scale. Finally, validating the consumer group (S) is the most specific step and only makes sense after confirming that all scaling infrastructure is operational.

Sequence B starts with replicas (R) before verifying if the configuration controlling the scaler is correct, which is inefficient. Sequence C starts with application logs (P) before confirming that scaling infrastructure and credentials are intact, diagnosing the wrong layer first. Sequence D starts with the scaling rule (Q), which is reasonable, but skips the environment's control plane, which could reveal the cause more quickly if it were an infrastructure failure.


Troubleshooting Tree: Provision a container by using Azure Container Apps​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Legend

ColorNode type
Dark blueInitial symptom (entry point)
BlueDiagnostic question
RedIdentified cause
GreenRecommended action or resolution
OrangeIntermediate validation or verification

To use this tree when facing a real problem, start with the root node and answer each question based on what is directly observable in the environment: the portal, Log Analytics logs, and CLI command output. At each fork, choose the path that corresponds to what you see, not what you suspect. The goal is to eliminate hypotheses through evidence before executing any corrective action. Upon reaching an identified cause node, execute the corresponding action and validate the result at the associated verification node before closing the diagnosis.