Troubleshooting Lab: Manage sizing and scaling for containers, including Azure Container Instances and Azure Container Apps
Diagnostic Scenariosβ
Scenario 1 β Root Causeβ
An operations team reports that a container group in Azure Container Instances was successfully deployed, but the application inside the container terminates after a few seconds and the container restarts in a loop. The team verified that the image is correct and was tested locally without issues.
The environment has the following configurations:
{
"name": "processador-relatorios",
"properties": {
"restartPolicy": "Always",
"containers": [
{
"name": "app",
"properties": {
"image": "meuregistry.azurecr.io/processador:v2",
"resources": {
"requests": {
"cpu": 0.5,
"memoryInGB": 0.3
}
},
"environmentVariables": [
{ "name": "ENV", "value": "production" }
]
}
}
]
}
}
The operator executed the following command to check events:
az container show \
--name processador-relatorios \
--resource-group rg-producao \
--query "containers[0].instanceView.events" \
--output table
Output:
Name Count FirstTimestamp LastTimestamp Message
------- ----- -------------------- -------------------- ----------------------------------
Pulling 1 2024-11-10T10:00:01Z 2024-11-10T10:00:01Z Pulling image from registry
Pulled 1 2024-11-10T10:00:04Z 2024-11-10T10:00:04Z Successfully pulled image
Started 12 2024-11-10T10:00:05Z 2024-11-10T10:04:47Z Started container
Killed 11 2024-11-10T10:00:07Z 2024-11-10T10:04:49Z OOMKilled
The team also reports that the variable ENV=production is necessary and correctly configured. The container registry is accessible and the image was pulled without errors.
What is the root cause of the observed behavior?
A) The image is corrupted in the registry, as it was pulled but the container cannot initialize correctly.
B) The restart policy Always is forcing unnecessary restarts even when the container terminates normally.
C) The allocated memory amount is insufficient for the application, causing the kernel to terminate the process due to memory overflow.
D) The environment variable ENV=production is activating a heavier execution mode that consumes more CPU than allocated.
Scenario 2 β Action Decisionβ
The platform team identified the cause of a problem in a production environment: an active revision of an Azure Container Apps has the maxReplicas value set to 2, while the current load requires at least 8 replicas to process messages within SLA. The KEDA-based scaling rule is functional and logs confirm that the scaler is attempting to create new replicas but is prevented by the configured limit.
The environment has the following restrictions:
- The application is in production and processing active financial transactions
- Changing the active revision requires creating a new revision in multiple revisions mode
- The environment is configured in single revision mode, meaning any new deployment immediately replaces the current revision
- The team has
Contributorpermission on the resource group - There is no scheduled maintenance window for the next 4 hours
What is the correct action to take at this moment?
A) Delete the Container App and redeploy it with maxReplicas: 8, as the service does not allow scaling configuration updates without recreation.
B) Update the scaling configuration of the active revision via az containerapp update to raise maxReplicas to 8, without creating a new revision.
C) Enable multiple revisions mode, deploy a new revision with maxReplicas: 8 and gradually migrate traffic.
D) Wait for the maintenance window to apply the change, as modifying scaling configurations in production without a window represents operational risk.
Scenario 3 β Root Causeβ
An engineering team configured an application in Azure Container Apps to scale based on messages in an Azure Service Bus queue. During the week, the system worked correctly. On the following Monday, after an infrastructure update performed by the security team, scaling stopped working: the queue accumulates messages but no additional replicas are created.
The engineering team verified the following points:
- The container image was not changed
- The KEDA scaling rule configuration is identical to the previous week
- The current number of replicas is fixed at
1, which is theminReplicasvalue - Azure Monitor does not register application errors
- The security team reports that they rotated Key Vault secrets and updated network policies for resources
The relevant scaling configuration is:
{
"scale": {
"minReplicas": 1,
"maxReplicas": 10,
"rules": [
{
"name": "servicebus-rule",
"custom": {
"type": "azure-servicebus",
"metadata": {
"queueName": "fila-pedidos",
"messageCount": "5"
},
"auth": [
{
"secretRef": "sb-connection-string",
"triggerParameter": "connection"
}
]
}
}
]
}
}
What is the root cause of the problem?
A) The messageCount property was discontinued in KEDA and should be replaced with queueLength for the trigger to work correctly.
B) The secret referenced in secretRef: sb-connection-string in the Container App was not updated after rotation, causing KEDA to fail authentication with the queue.
C) The updated network policies blocked outbound traffic from the Container App to Azure Monitor, preventing scaling metrics collection.
D) The value of minReplicas: 1 prevents KEDA from creating new replicas, as the scaler interprets the existing replica as the maximum supported load.
Scenario 4 β Diagnostic Sequenceβ
An operator receives the following alert: an application deployed in Azure Container Apps is responding with latency 10 times above normal during peak hours. The operator suspects a scaling problem.
The available investigation steps are:
- Step P: Check application logs in Log Analytics to identify internal errors that could explain the slowness independently of scaling
- Step Q: Confirm if the current number of active replicas is below the configured
maxReplicasvalue - Step R: Inspect the scaling configuration to verify
minReplicas,maxReplicasvalues and active trigger rules - Step S: Check CPU and memory metrics of active replicas to confirm if they are saturated
- Step T: Manually test the application endpoint with a load tool to reproduce the behavior and confirm the symptom
What is the correct investigation sequence?
A) T, S, R, Q, P
B) R, Q, S, P, T
C) T, P, R, S, Q
D) P, T, S, R, Q
Answer Key and Explanationsβ
Answer Key β Scenario 1β
Answer: C
The OOMKilled event in the event log is the definitive clue. This event indicates that the operating system kernel terminated the container process because it attempted to use more memory than the allocated limit, which in this case is 0.3 GB (approximately 307 MB). This value is extremely low for a report processing application in a production environment.
The information about the ENV=production variable is intentionally irrelevant to the diagnosis: it is present in the statement to distract and induce alternative D. The fact that the image was successfully pulled (Pulled event) eliminates alternative A. The Always restart policy (alternative B) explains why the container restarts after being killed, but is not the cause of the termination itself.
The most dangerous distractor is alternative D: a less experienced operator might associate ENV=production with different application behavior and direct the diagnosis toward environment variables, ignoring the OOMKilled that is explicitly in the logs.
Answer Key β Scenario 2β
Answer: B
Azure Container Apps allows updating scaling configurations, including maxReplicas, directly via az containerapp update without needing to create a new revision. In single revision mode, this update is applied to the active revision immediately and silently, without downtime.
Alternative A is incorrect: the service does not require recreation to change scaling. Alternative C would be valid in a context of gradual deployment or A/B testing, but introduces unnecessary complexity and is not applicable in single revision mode without first changing the environment mode, which increases risk in production. Alternative D ignores that the problem is actively impacting SLA now, making waiting unjustifiable when a safe action is available immediately.
The critical restriction that eliminates C is precisely the already active single revision mode: changing to multiple revisions during a production incident is a higher risk action than applying az containerapp update directly.
Answer Key β Scenario 3β
Answer: B
The central clue is in the timeline: the problem started after the security team rotated secrets. The KEDA configuration references the sb-connection-string secret to authenticate with the Service Bus queue. When the secret is rotated in Key Vault or the original connection string, the value stored as a secret in Container App becomes outdated. KEDA starts using an invalid credential and cannot query the queue, therefore does not detect accumulated messages and does not trigger scaling.
Alternative A is incorrect: messageCount is a valid parameter in the KEDA scaler for Service Bus. Alternative C is a plausible distractor, but Azure Monitor is not the mechanism by which KEDA queries the queue; KEDA accesses Service Bus directly via authenticated connection. Alternative D represents a misconception about how minReplicas works: it defines the floor of replicas, not a load limit.
The information about network policies is present as irrelevant data to induce diagnosis toward alternative C. The fact that there are no errors in Azure Monitor reinforces that the problem is not in application execution, but in the scaler's ability to observe the queue.
Answer Key β Scenario 4β
Answer: A
The correct sequence is: T, S, R, Q, P.
The reasoning starts from the observed symptom: high latency during peak. The first step is to reproduce and confirm the symptom (T), avoiding investigating a problem that may no longer be occurring. Next, checking if active replicas are saturated in CPU and memory (S) confirms if the problem is processing capacity. With saturation confirmed, inspecting the scaling configuration (R) reveals if trigger parameters and limits are correct. Checking if the number of replicas is below maxReplicas (Q) determines if there is room for scaling or if the ceiling has been reached. Lastly, investigating application logs (P) covers the hypothesis that slowness has an internal cause, not related to scaling.
Sequence B makes the error of starting with configuration before confirming the symptom and replica state. Sequences C and D place application log investigation before validating infrastructure metrics, which inverts diagnostic priority for a typically operational problem like replica saturation.
Troubleshooting Tree: Manage sizing and scaling for containers, including Azure Container Instances and Azure Container Appsβ
Color Legend:
| Color | Node Type |
|---|---|
| Dark blue | Initial symptom (entry point) |
| Medium blue | Diagnostic question |
| Red | Identified cause |
| Green | Recommended action or resolution |
| Orange | Intermediate validation or verification |
To use this tree when facing a real problem, start from the root node by identifying the observed symptom: container terminating in loop or absence of scaling. Follow the closed questions by responding with what you can directly observe, whether via az container show, Log Analytics logs, or Container App configuration inspection. Each bifurcation eliminates a class of causes until reaching the red node that names the root cause, followed by the corresponding green action. Don't skip intermediate questions: the most common distractor in these scenarios is acting on the most visible symptom before validating the real cause.