Troubleshooting Lab: Provision a container by using Azure Container Instances

Diagnostic Scenarios

Scenario 1 — Root Cause

An operations team deployed a container group in Azure Container Instances using a YAML file. The group contains two containers: a main application and a log collection sidecar. The deployment completed successfully according to the Azure portal, and the container group status appears as Running.

However, the team reports that the application's public endpoint is not responding. When trying to access http://<public-ip>:8080, the connection is refused.

The YAML used in the deployment contains the following relevant section:

ipAddress:
  type: Public
  ports:
    - protocol: TCP
      port: 80
containers:
  - name: app-principal
    image: meuregistry.azurecr.io/webapp:latest
    ports:
      - port: 8080
    resources:
      requests:
        cpu: 1
        memoryInGb: 1
  - name: sidecar-logs
    image: meuregistry.azurecr.io/logcollector:v2
    resources:
      requests:
        cpu: 0.5
        memoryInGb: 0.5

The team confirms that the webapp:latest image exposes the application on port 8080 internally and that the registry credentials were provided correctly. The sidecar is functional and sending logs to Azure Monitor.

What is the root cause of the endpoint inaccessibility?

A) The Public IP type is not compatible with container groups that have more than one container; a Virtual Network is required.

B) The port exposed in ipAddress is 80, but the application container listens on port 8080; external traffic arrives on 80 and has no mapping to the actual application port.

C) The sidecar is consuming part of the CPU and memory quota, causing throttling in the main container and preventing it from accepting connections.

D) The webapp:latest image uses a floating tag, which causes periodic container recreation and interrupts active connections during pull.

Scenario 2 — Action Decision

The cause of a failure has been identified: a container group in production is deployed without Azure Virtual Network integration, and a new company security policy requires that all ACI containers in production communicate exclusively through private network, with no exposed public IP.

The current container group processes real-time orders for an e-commerce platform, with peak usage between 6 PM and 10 PM. It's 2 PM on a Friday. The network team confirmed that the subnet delegated for ACI is already provisioned and available. The development team informed that no code changes are necessary, only infrastructure reconfiguration.

Recreating the container group with VNet integration requires destroying and redeploying the resource, as this property cannot be changed on an existing group.

What is the correct action to take at this moment?

A) Immediately recreate the container group with VNet configuration, taking advantage of the 4 hours remaining before peak usage.

B) Remove the public IP from the current container group via Azure portal to immediately comply with the policy, and plan the VNet migration for the next maintenance cycle.

C) Plan the container group recreation with VNet for a low-traffic time outside peak hours, document the temporary non-compliance risk, and communicate to stakeholders.

D) Request a permanent exception to the security policy for this container group, arguing that production recreation represents operational risk.

Scenario 3 — Root Cause

A CI/CD pipeline automatically deploys a container to ACI after each successful build. The pipeline uses the following command to recreate the container:

az container delete \
  --resource-group rg-staging \
  --name processador-staging \
  --yes

az container create \
  --resource-group rg-staging \
  --name processador-staging \
  --image meuregistry.azurecr.io/processador:latest \
  --cpu 2 \
  --memory 4 \
  --restart-policy Never \
  --environment-variables ENV=staging DB_HOST=db.internal.empresa.com \
  --registry-login-server meuregistry.azurecr.io \
  --registry-username $ACR_USER \
  --registry-password $ACR_PASS

After a recent pipeline execution, the team observes that the container starts normally but terminates in less than 10 seconds without producing any output in the logs. The final recorded status is Terminated with exit code 0.

Name                  State        ExitCode    StartTime                    FinishTime
--------------------  -----------  ----------  ---------------------------  ---------------------------
processador-staging   Terminated   0           2024-11-15T14:03:22+00:00   2024-11-15T14:03:31+00:00

The team verified that the DB_HOST variable is correct and that the database is accessible. The container registry is authenticated and the image was downloaded successfully.

What is the root cause of the observed behavior?

A) The Never policy prevents the container from restarting after completion, but the real problem is that the application terminated with silent failure; the exit code 0 was returned incorrectly by the ACI layer.

B) The ENV=staging variable is being interpreted as a command by the image entrypoint, causing early termination.

C) The main process of the processador:latest image completed its execution normally; exit code 0 with restart-policy Never is the expected behavior for a finite execution task that terminated successfully.

D) The previous az container delete command did not wait for complete resource deletion, causing state conflict during recreation and premature container execution.

Scenario 4 — Diagnostic Sequence

A container group was deployed in ACI and the status displayed in the portal is Waiting. The container never advances to the Running state. The team needs to diagnose the cause.

The following investigation steps are available, out of order:

Step P: Execute az container logs to check if there's output from the main process before the failure.
Step Q: Execute az container show to inspect the current state of the container group and check recent events.
Step R: Verify if the container registry credentials provided in the creation command are correct and if the referenced image exists in the registry.
Step S: Check the container group events for messages like Failed to pull image or ImagePullBackOff.
Step T: Review the requested resources (CPU and memory) and compare with available limits in the region and SKU used.

What is the correct diagnostic sequence?

A) P, Q, R, S, T

B) Q, S, R, T, P

C) R, S, Q, T, P

D) S, Q, R, P, T

Answer Key and Explanations

Answer Key — Scenario 1

Answer: B

The determining clue is in the YAML configuration itself: the ipAddress section only exposes port 80 to the outside, but the application container declares and listens on port 8080. In ACI, the ports declared in ipAddress.ports are the only ones opened on the container group's public IP. Port 8080 of the container is not mapped to the outside, so any request that arrives at the public IP simply finds no destination.

The information about the functional sidecar and logs in Azure Monitor is irrelevant to the diagnosis and was included purposely to divert attention. The sidecar doesn't share ports with the outside and its operation doesn't influence the main container's port exposure.

The distractors exploit the misconception of looking for the cause in more complex places: resource throttling, floating tag behavior, or multi-container architectural restrictions. The most dangerous distractor is A, as it could lead to unnecessary network topology refactoring when the problem is simply a port not declared in the IP block.

Answer Key — Scenario 2

Answer: C

The context of restrictions is determining here. It's 2 PM on Friday, peak traffic occurs between 6 PM and 10 PM, and recreating the container group requires destroying the existing resource, generating unavailability. Executing this operation immediately (alternative A) introduces instability risk 4 hours before peak, without sufficient validation window.

Alternative B is technically invalid: removing the public IP from an existing container group doesn't satisfy the policy, which requires VNet integration, and furthermore may not be possible without recreating the resource depending on current configuration.

Alternative D represents operational capitulation without technical foundation, since the delegated subnet is available and the team confirmed the change is viable.

The correct action is to plan the recreation for low-traffic hours, formally document the temporary non-compliance state, and communicate risks. This simultaneously respects the security policy, business continuity, and process governance.

Answer Key — Scenario 3

Answer: C

Exit code 0 is the central and definitive clue. In the operating system and container model, exit code 0 means successful termination without error. Combined with restart-policy Never, the described behavior is exactly what's expected for a finite processing task: the container executed its workload, completed successfully, and terminated without restarting.

The team probably expected continuous service behavior, but the image was built for one-time execution. The information about DB_HOST being correct and registry authentication working is relevant to confirm there was no infrastructure failure, but doesn't change the diagnosis.

The most dangerous distractor is A, as it introduces the hypothesis that ACI would have masked an error exit code, which doesn't correspond to the platform's behavior. Acting based on this hypothesis would lead investigation in the wrong direction, possibly rebuilding the image or changing the application unnecessarily.

Answer Key — Scenario 4

Answer: B

The correct sequence is Q, S, R, T, P, and the logic is progression from general to specific.

Q first: inspecting the container group's general state with az container show provides an immediate view of events recorded by the platform, without prior hypotheses.

S next: the container group events directly reveal if there was an image pull failure, which is the most common cause of Waiting state. Checking this before any other hypothesis is efficient.

R after: if events indicate pull failure, verifying credentials and image existence in the registry confirms or rules out this hypothesis precisely.

T in sequence: if the image was downloaded correctly but the container still hasn't started, investigating resource limitations in the region is the next logical step.

P last: az container logs is only useful if the container managed to start and produce output. In Waiting state, this command will probably return nothing relevant and should be reserved for when previous steps don't identify the cause.

Alternative A makes the classic mistake of checking logs before understanding system state. Alternatives C and D go to checking credentials before observing events, which is blind diagnosis.

Troubleshooting Tree: Provision a container by using Azure Container Instances

100%

Scroll para zoom · Arraste para mover · 📱 Pinch para zoom no celular

Color Legend:

Color	Node Type
Dark blue (navy)	Initial symptom, entry point
Blue	Diagnostic question or decision point
Orange	Intermediate check or state validation
Red	Identified cause
Green	Recommended action or resolution

When facing a real problem with Azure Container Instances, start at the root node and answer the first observable question: is the container in Waiting, Running but not responding, or terminating immediately? Each answer leads to a specific branch of progressively more precise questions. Follow only the path that corresponds to what you observe, without skipping steps. Orange nodes indicate intermediate checks that confirm or rule out a hypothesis before declaring the cause. When reaching a red node, the cause is identified; the green node immediately below indicates the correct action to execute.

Diagnostic Scenarios​

Scenario 1 — Root Cause​

Scenario 2 — Action Decision​

Scenario 3 — Root Cause​

Scenario 4 — Diagnostic Sequence​

Answer Key and Explanations​

Answer Key — Scenario 1​

Answer Key — Scenario 2​

Answer Key — Scenario 3​

Answer Key — Scenario 4​

Troubleshooting Tree: Provision a container by using Azure Container Instances​