Skip to main content

Troubleshooting Lab: Create and manage an Azure Container Registry

Diagnostic Scenarios​

Scenario 1 β€” Root Cause​

A development team reports that images pushed to the registry work normally in the staging environment, but the production environment, located in another Azure region, shows high latency and occasionally fails on pull with timeout. The registry was created three weeks ago in the East US region. The contracted SKU is Premium. The team confirms that the registry is publicly accessible and that the credentials used in production are the same as in the staging environment. The network team informs that there are no firewall rules blocking communication between the production region and East US.

In the production pipeline logs, the observed error is:

Error response from daemon: Get "https://meuregistry.azurecr.io/v2/": 
net/http: request canceled while waiting for connection (Client.Timeout exceeded)

What is causing this behavior?

A) The credentials used in production have expired and need to be renewed via az acr credential renew
B) The registry does not have a replica in the production region, forcing all traffic to traverse regions with accumulated latency
C) The Premium SKU has a concurrent request limit that is being reached by the production load
D) The public endpoint of the registry is being blocked by a Microsoft Entra Conditional Access policy applied to the subscription


Scenario 2 β€” Action Decision​

The security team identified that the admin account of the registry is enabled and the credentials were compromised. The cause is confirmed. The registry is used by three applications in production that authenticate via the admin account username and password. The applications run on Azure Container Instances and Azure Kubernetes Service instances. The identity team confirms that managed identities are already configured on the instances, but have never been used for registry authentication. There is no scheduled maintenance window in the next 8 hours.

What is the correct action to take at this moment?

A) Immediately disable the admin account and reconfigure applications to use managed identity, accepting temporary production interruption
B) Rotate the admin account credentials via az acr credential renew and, in parallel, start migration to managed identity without interrupting production
C) Delete the registry and recreate with admin account disabled, restoring images from an external backup
D) Revoke the compromised token in Microsoft Entra ID and wait for the maintenance window for any registry changes


Scenario 3 β€” Root Cause​

An engineer executes the following command to check available images in the registry:

az acr repository list --name meuregistry --output table

The return is an empty list, without errors. Next, the same engineer executes a local build and pushes manually:

docker build -t meuregistry.azurecr.io/api:v2 .
docker push meuregistry.azurecr.io/api:v2

The push completes successfully. When listing again, the api:v2 image appears. The engineer confirms that previous builds via az acr task were executed without error in the last 24 hours and the task logs show Succeeded status. The registry is on Standard SKU and has no retention policy configured.

What is the root cause of the empty list before the manual push?

A) The Standard SKU does not support the az acr repository list command; it's necessary to use the REST API directly
B) The tasks executed via az acr task published the images to a different registry than expected, incorrectly configured in the task definition
C) The quarantine policy was active and was retaining images before release
D) The az acr repository list command failed silently due to a user permission issue in the repository read layer


Scenario 4 β€” Diagnostic Sequence​

A CI/CD pipeline fails when trying to push an image to the registry. The returned error is:

unauthorized: authentication required, visit https://aka.ms/acr/authorization for more information.

The pipeline runs on a hosted agent in Azure DevOps and uses a service connection configured six months ago. Previous images were sent successfully until yesterday. No changes were made to the pipeline. The registry is operational and accessible.

The following investigation steps are available, out of order:

  1. Verify if the AcrPush role is still assigned to the service principal used by the service connection
  2. Execute az acr login --name meuregistry on the agent to test interactive authentication
  3. Check the registry audit log to identify which identity attempted to authenticate
  4. Verify the expiration date of the service principal secret associated with the service connection
  5. Confirm if the registry underwent network or firewall configuration changes in the last 24 hours

What is the correct investigation sequence?

A) 5, 3, 1, 4, 2
B) 2, 1, 4, 3, 5
C) 3, 5, 1, 4, 2
D) 1, 2, 3, 4, 5


Answer Key and Explanations​

Answer Key β€” Scenario 1​

Answer: B

The central clue in the statement is the combination of Premium SKU with registry in a single region (East US) and production environment in another region. The Premium SKU offers geo-replication, but it is not enabled automatically; it's necessary to explicitly create a replica in the production region. Without the replica, all pull traffic traverses regions, resulting in high latency and timeouts.

The information about absence of firewall rules is deliberately irrelevant: it eliminates a suspect, but doesn't point to the cause. The timeout error indicates a connectivity or distance problem, not authentication.

Alternative A can be discarded because the error is not authentication-related. Alternative C is a plausible distractor, but Premium throughput limits are extremely high and rarely reached. Alternative D would be an HTTP 401 authentication error, not a connection timeout.

The most dangerous distractor is D: an anxious operator could open tickets with the identity team and waste time investigating conditional access policies while the real problem is replication topology.


Answer Key β€” Scenario 2​

Answer: B

The context of constraints is determining here: there is no maintenance window, managed identities are already configured but have never been tested as an authentication mechanism in the registry, and the applications are in production. Disabling the admin account without validating that authentication via managed identity works correctly would cause immediate and potentially prolonged interruption.

The correct action is to rotate the credentials to neutralize the immediate compromise and, in parallel, migrate to managed identity in a controlled and validated manner before disabling the admin account.

Alternative A ignores the critical constraint that managed identities have never been used to authenticate to the registry; assuming they will work without validation in production is a high risk. Alternative C is destructive and unnecessary. Alternative D is incorrect because registry admin account credentials are not tokens managed by Microsoft Entra ID; they are local service credentials, and revoking something in Entra ID does not invalidate them.


Answer Key β€” Scenario 3​

Answer: B

The clue that confirms this cause is the fact that tasks reported Succeeded but no image appeared in the queried registry. If there were a permission problem (alternative D), the command would return an error, not an empty list. If quarantine were active (alternative C), the images would exist in the repository in a retained state, and az acr repository list would list them, since the command lists repositories, not just released images.

The information about Standard SKU and absence of retention policy is irrelevant and was included to divert focus: Standard SKU fully supports az acr repository list, and image retention would not explain the absence of repositories.

The real cause is that the task definition points to a different registry. This is a silent and common configuration error when there are multiple registries in an organization (dev, staging, prod). The most dangerous distractor is D, as it would lead the engineer to investigate permissions and potentially change roles unnecessarily.


Answer Key β€” Scenario 4​

Answer: A

The correct sequence is: 5, 3, 1, 4, 2.

The correct diagnostic reasoning starts with what changed in the environment before investigating the identity. Step 5 quickly rules out network or firewall changes that could explain the failure. Step 3 uses the audit log to identify which identity is trying to authenticate, avoiding assumptions. With the identity confirmed, step 1 verifies if the role is still assigned. Step 4 verifies secret expiration, which is the most likely cause given that no changes were made to the pipeline and it was working until yesterday. Step 2 is last because it tests interactive authentication on the agent, which has limited utility for diagnosing service principal failure.

Alternative D (1, 2, 3, 4, 5) represents the error of jumping directly to the most known hypothesis without validating the context first. Alternative B starts with interactive authentication, which doesn't diagnose the service connection problem. The most likely cause is the expired service principal secret, and the correct sequence reaches this point progressively and eliminatively.


Troubleshooting Tree: Create and manage an Azure Container Registry​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

Color Legend:

ColorNode Type
Dark BlueInitial symptom (entry point)
BlueDiagnostic question
RedIdentified cause
GreenRecommended action or resolution
OrangeIntermediate validation or verification

To use this tree when facing a real problem, identify the observed symptom and locate the corresponding entry node at the root. From there, answer each diagnostic question based on what is verifiable in the environment: logs, configurations, role assignments, registry state. Follow the path to a red node of identified cause and, from it, take the recommended action in the corresponding green node. Orange nodes indicate that a validation should be performed before considering the problem closed.