Skip to main content

Theoretical Foundation: Query and Analyze Logs in Azure Monitor


1. Initial Intuition​

Imagine that your company maintains a central file with records of everything that happens: who entered the building, which systems were accessed, what errors occurred, what changes were made. This file grows terabytes per day. To extract value from it, you need a query language that allows you to ask things like: "Show me all access denials from the last 6 hours, grouped by user, ordered by who tried the most times."

Kusto Query Language (KQL) is this query language for logs stored in Azure Monitor Log Analytics. It's a language designed specifically to query large volumes of time-series data quickly and expressively.

The beauty of KQL is that it reads like a chain of transformations: you take a table of data, filter what doesn't matter, select the relevant fields, group, calculate, and present the result. It's intuitive once you understand the pattern table | operator | operator | operator.


2. Context​

2.1 KQL within Azure Monitor​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

2.2 Main tables you need to know​

Before writing queries, you need to know where the data is. The most important tables in AZ-104:

TableContent
AzureActivityActivity Log: resource creations, modifications, deletions
AzureDiagnosticsDiagnostic logs from multiple resources (legacy)
StorageBlobLogsBlob Storage operations
AzureMetricsMetrics exported to Log Analytics
SecurityEventWindows security events (Event Log)
SyslogLinux system logs
HeartbeatPeriodic heartbeat from VMs monitored by agent
SigninLogsLogin attempts to Azure AD
AuditLogsAdministrative actions in Azure AD
EventGeneric Windows Event Log events
PerfVM performance counters (CPU, memory, disk)

3. Concept Building​

3.1 KQL mental model: data pipeline​

The central concept of KQL is the pipe operator (|): each operator receives a table as input, applies a transformation, and passes the result to the next operator.

table | operator1 | operator2 | operator3

It's exactly like Unix/Linux pipes: cat file.log | grep ERROR | sort | uniq -c


3.2 Fundamental operators​

where: Filters rows based on a condition. The most used operator.

AzureActivity
| where TimeGenerated > ago(24h)
| where ActivityStatus == "Failed"

project: Selects only the columns you want (like SELECT in SQL).

AzureActivity
| where TimeGenerated > ago(24h)
| project TimeGenerated, Caller, OperationName, ActivityStatus

extend: Adds a new calculated column without removing existing ones.

Perf
| where CounterName == "% Processor Time"
| extend CPUCategory = iff(CounterValue > 80, "High", "Normal")

summarize: Aggregates data, like GROUP BY in SQL. Always used with aggregation functions.

AzureActivity
| where TimeGenerated > ago(7d)
| summarize OperationCount = count() by Caller, ActivityStatus
| order by OperationCount desc

order by / sort by: Orders results.

SecurityEvent
| where EventID == 4625 // Failed logon
| summarize FailedAttempts = count() by Account
| order by FailedAttempts desc
| take 10

take / limit: Returns only the first N results. Useful for quick exploration.

distinct: Returns unique values from a column.

AzureActivity
| distinct Caller

count: Counts the total number of records.

AzureActivity
| where TimeGenerated > ago(24h)
| count

3.3 Time functions: essential for logs​

ago(): Returns a timestamp in the past relative to the current moment.

| where TimeGenerated > ago(1h)    // last hour
| where TimeGenerated > ago(7d) // last 7 days
| where TimeGenerated > ago(30m) // last 30 minutes

bin(): Groups timestamps into regular intervals. Essential for creating time series.

AzureActivity
| where TimeGenerated > ago(24h)
| summarize EventCount = count() by bin(TimeGenerated, 1h)
| order by TimeGenerated asc

startofday(), startofweek(), startofmonth(): Returns the beginning of the period relative to a timestamp.


3.4 String functions​

contains: Checks if a string contains a substring (case-insensitive by default).

| where OperationName contains "delete"

startswith, endswith: Check prefix or suffix.

tolower(), toupper(): Normalize capitalization.

split(): Splits a string into an array by separator.

strcat(): Concatenates strings.

extract(): Extracts a pattern via regex.

| extend IPAddress = extract(@"(\d+\.\d+\.\d+\.\d+)", 1, CallerIpAddress)

3.5 Aggregation functions​

FunctionDescriptionExample
count()Counts recordssummarize Total = count()
sum(field)Sums valuessummarize TotalBytes = sum(ResponseBodySize)
avg(field)Averagesummarize AvgLatency = avg(DurationMs)
max(field)Maximum valuesummarize PeakCPU = max(CounterValue)
min(field)Minimum valuesummarize MinMemory = min(Available_MBytes)
percentile(field, N)N percentilesummarize P95 = percentile(DurationMs, 95)
dcount(field)Distinct countsummarize UniqueUsers = dcount(Caller)
makeset(field)Creates an array of distinct valuessummarize ResourceList = makeset(Resource)

3.6 join: joining tables​

The join combines two tables based on a common field. Similar to SQL JOIN but with specific types:

// Cross-reference application errors with security events in the same period
let errorTimes = AppExceptions
| where TimeGenerated > ago(1h)
| project ErrorTime = TimeGenerated, ProblemId;

SecurityEvent
| where TimeGenerated > ago(1h)
| join kind=inner errorTimes on $left.TimeGenerated == $right.ErrorTime

Most used join types:

  • inner: Only records with match on both sides
  • leftouter: All from left, with right match when exists
  • anti: Records from left that DON'T have match on right

3.7 let: variables and subqueries​

The let defines named variables or subqueries for reuse.

let threshold = 80;
let highCPUVMs = Perf
| where CounterName == "% Processor Time"
| where CounterValue > threshold
| distinct Computer;

Heartbeat
| where Computer in (highCPUVMs)
| summarize LastHeartbeat = max(TimeGenerated) by Computer

3.8 union: combining results from multiple tables​

// See all Storage and SQL errors in the same result
union StorageBlobLogs, AzureDiagnostics
| where TimeGenerated > ago(1h)
| where Level == "Error"
| project TimeGenerated, ResourceType, OperationName, Level
| order by TimeGenerated desc

4. Structural View​

100%
Scroll para zoom Β· Arraste para mover Β· πŸ“± Pinch para zoom no celular

5. Practical Operation​

5.1 Essential queries for AZ-104​

Who deleted resources in the last 24 hours?

AzureActivity
| where TimeGenerated > ago(24h)
| where OperationName contains "delete"
| where ActivityStatus == "Succeeded"
| project TimeGenerated, Caller, OperationName, ResourceGroup, Resource
| order by TimeGenerated desc

Which VMs are without heartbeat (potentially offline)?

Heartbeat
| where TimeGenerated > ago(5m)
| summarize LastHeartbeat = max(TimeGenerated) by Computer
| where LastHeartbeat < ago(5m)
| order by LastHeartbeat asc

Top 10 IPs with Windows login failures:

SecurityEvent
| where EventID == 4625
| where TimeGenerated > ago(24h)
| summarize FailedAttempts = count() by IpAddress
| top 10 by FailedAttempts

VM CPU usage in the last 4 hours (time series):

Perf
| where TimeGenerated > ago(4h)
| where CounterName == "% Processor Time"
| where InstanceName == "_Total"
| summarize AvgCPU = avg(CounterValue) by Computer, bin(TimeGenerated, 15m)
| render timechart

Storage Account errors in the last 6 hours:

StorageBlobLogs
| where TimeGenerated > ago(6h)
| where StatusCode >= 400
| summarize ErrorCount = count() by StatusCode, OperationName
| order by ErrorCount desc

Azure changes (Activity Log) by operation type:

AzureActivity
| where TimeGenerated > ago(7d)
| where ActivityStatus == "Succeeded"
| summarize OperationCount = count() by OperationName
| top 20 by OperationCount
| render barchart

Which users performed most actions in the subscription:

AzureActivity
| where TimeGenerated > ago(30d)
| summarize ActionCount = count() by Caller
| where Caller !contains "microsoft" // exclude internal Microsoft services
| top 10 by ActionCount

5.2 The render operator: inline visualization​

The render transforms query results into a chart directly in Log Analytics Explorer:

// Time series of events per hour
AzureActivity
| where TimeGenerated > ago(24h)
| summarize Count = count() by bin(TimeGenerated, 1h)
| render timechart

// Bar chart by operation type
AzureActivity
| summarize Count = count() by Category
| render barchart

// Pie chart of distribution
AzureActivity
| summarize Count = count() by ActivityStatus
| render piechart

Render types: timechart, barchart, columnchart, piechart, scatterchart, table.


5.3 Non-obvious behaviors​

Case-sensitivity: KQL is case-sensitive by default. "Error" is not equal to "error". Use =~ for case-insensitive comparison:

| where ActivityStatus =~ "succeeded"  // case-insensitive
| where ActivityStatus == "Succeeded" // case-sensitive

Timestamps: The TimeGenerated field is always in UTC. If you're in UTC-3, an event at 2 PM local time appears as 5 PM in KQL.

Dynamic data types: Some fields are of type dynamic (embedded JSON). To access properties:

| extend statusCode = tostring(Properties.statusCode)
| extend vmName = tostring(Properties.entity.name)

6. Implementation Methods​

6.1 Log Analytics Workspace (Azure Portal)​

When to use: Interactive investigation, troubleshooting, data exploration, creating visualizations, saving queries for reuse.

Access: Azure Monitor > Logs or Log Analytics Workspace > Logs

The portal offers:

  • IntelliSense to autocomplete tables and fields
  • History of executed queries
  • Save queries in "Saved Queries" to share with team
  • Pin to dashboard
  • Export results to CSV or JSON

6.2 Azure CLI​

# Execute KQL query via CLI
az monitor log-analytics query \
--workspace <workspace-id> \
--analytics-query "AzureActivity | where TimeGenerated > ago(1h) | take 10" \
--output table

# Query with time parameters
az monitor log-analytics query \
--workspace <workspace-id> \
--analytics-query "AzureActivity | where ActivityStatus == 'Failed' | count" \
--start-time 2025-01-15T00:00:00Z \
--end-time 2025-01-15T23:59:59Z

When to use: Automation scripts, periodic reports, CI/CD pipeline integration.


6.3 Azure PowerShell​

# Execute KQL query
$query = @"
AzureActivity
| where TimeGenerated > ago(24h)
| where ActivityStatus == 'Failed'
| project TimeGenerated, Caller, OperationName
| order by TimeGenerated desc
"@

Invoke-AzOperationalInsightsQuery `
-WorkspaceId <workspace-id> `
-Query $query |
Select-Object -ExpandProperty Results |
Format-Table

6.4 REST API (Azure Monitor Query API)​

For integration with applications and external systems:

curl -X POST \
"https://api.loganalytics.io/v1/workspaces/<workspace-id>/query" \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"query": "AzureActivity | where TimeGenerated > ago(1h) | count",
"timespan": "PT1H"
}'

6.5 Azure Monitor Workbooks​

Workbooks are interactive dashboards that combine text, KQL queries, metrics, and visualizations. They're ideal for repeatable operations or security reports.

Access: Azure Monitor > Workbooks > + New

You can add multiple sections, each with its own KQL query and visualization type.


7. Control and Security​

7.1 Log access control​

RoleAccess
Log Analytics ReaderQuery logs, view configurations
Log Analytics ContributorEverything in Reader + create alerts and workbooks
Monitoring ReaderQuery logs and metrics (read-only)
Owner/Contributor on subscriptionFull access

Table-level RBAC: For sensitive data, you can restrict access by table. A user can have workspace access but not be able to query SecurityEvent or SigninLogs.

az monitor log-analytics workspace table update \
--resource-group myRG \
--workspace-name myLAW \
--name SecurityEvent \
--plan Analytics

7.2 Queries that expose sensitive data​

Some tables contain PII (Personally Identifiable Information) or sensitive data:

  • SigninLogs: IP addresses, usernames, user location
  • AuditLogs: User actions including data access
  • SecurityEvent: Usernames and login activity
  • StorageBlobLogs: File names accessed, client IPs

Restrict access to these tables to only teams that need them.


8. Decision Making​

8.1 Which operator to use for each need​

NeedOperatorExample
Filter by conditionwherewhere Level == "Error"
Select columnsprojectproject Time, User, Action
Add calculated columnextendextend Severity = iff(Code>400,"High","Low")
Count and groupsummarize + count()summarize Count = count() by User
Sort resultsorder byorder by Count desc
Limit resultstake / toptop 10 by Count
Unique valuesdistinctdistinct User
Join tablesjoinjoin kind=inner T2 on Field
Union resultsunionunion T1, T2
Create variableletlet threshold = 80;
Visualize chartrenderrender timechart

8.2 Incident investigation strategy​

PhaseRecommended queryObjective
Detectionwhere Level == "Error" with count()Quantify the problem
Time contextbin(TimeGenerated, 5m) with render timechartSee when it started
Identify sourcesummarize count() by Computer/User/ResourceDiscover origin
Detailsproject with specific fields + take 100Examine individual events
Correlationjoin with another tableCross with other events

9. Best Practices​

  • Always filter by TimeGenerated first. Queries without time filter scan entire history and are slow and expensive. Put where TimeGenerated > ago(Xh) as the first line after the table.
  • Use project to reduce columns before summarize. Fewer columns processed = faster queries.
  • Use top N by field instead of order by | take N. top is optimized for this pattern.
  • Save useful queries in "Saved Queries" in the portal to share with the team and reuse during incidents.
  • Use let for complex subqueries and to define thresholds that can be easily adjusted at the beginning of the query.
  • Test queries with take 100 first to validate logic before executing aggregations on large volumes.
  • Use render timechart for any temporal analysis. Patterns become immediately visible in charts that would be invisible in a table of numbers.
  • Document investigation queries in a repository (GitHub, Azure DevOps) to build an operations runbook.

10. Common Errors​

ErrorWhy it happensHow to avoid
Slow query or timeoutNo TimeGenerated filterAlways start with where TimeGenerated > ago(Xh)
Incorrect results due to caseComparison with == on field with variable capitalizationUse =~ for case-insensitive
dynamic field not accessibleTrying to compare JSON field directlyUse tostring(Properties.field) before comparing
summarize removing unexpected columnssummarize only keeps columns listed in byUse project to select before summarize
join returning Cartesian productMany matches on both sidesPre-filter tables before join or use kind=leftouter
ago() returning wrong periodConfusing ago(7d) with "date X"Use datetime(2025-01-15) for absolute dates
Results with UTC vs local timeTimeGenerated in UTCUse datetime_local_to_utc() or adjust in presentation
Correct query but no resultsData still in transit (ingestion latency)Wait 5-10 minutes and retry

11. Operation and Maintenance​

11.1 Log Analytics health queries​

Ingestion volume per table:

Usage
| where TimeGenerated > ago(7d)
| summarize TotalGB = round(sum(Quantity) / 1024, 2) by DataType
| order by TotalGB desc

Ingestion latency:

AzureActivity
| where TimeGenerated > ago(1h)
| extend IngestDelay = ingestion_time() - TimeGenerated
| summarize AvgDelayMin = avg(IngestDelay / 1m)

Slowest executed queries (requires LAW diagnostics enabled):

LAQueryLogs
| where TimeGenerated > ago(1d)
| order by ResponseDurationMs desc
| take 20
| project TimeGenerated, RequestClientApp, ResponseDurationMs, QueryText

11.2 Important limits​

AspectLimit
Maximum query execution time10 minutes
Maximum result size64 MB
Maximum number of rows in result500,000
Maximum time window per queryLimited by workspace retention
Concurrent queries per workspace200
join columns on right sideMaximum 1 million records

12. Integration and Automation​

12.1 Alert Rules based on KQL​

# Create alert based on KQL query
az monitor scheduled-query alert create \
--resource-group myRG \
--name "Failed-Login-Alert" \
--scopes <workspace-resource-id> \
--condition-query "SecurityEvent | where EventID == 4625 | where TimeGenerated > ago(5m) | summarize count() | where count_ > 10" \
--condition-threshold 0 \
--condition-operator "GreaterThan" \
--evaluation-frequency 5m \
--window-size 5m \
--severity 2 \
--action-groups <action-group-id>

12.2 Automated workbook for daily report​

Create Workbooks with queries parameterized by time interval. Configure for automatic publishing or email sending via Logic Apps that execute KQL queries via API.


12.3 Exporting query results via Pipeline​

# Export query result to CSV via CLI
az monitor log-analytics query \
--workspace <workspace-id> \
--analytics-query "AzureActivity | where TimeGenerated > ago(7d) | where ActivityStatus == 'Failed' | project TimeGenerated, Caller, OperationName" \
--output json | \
jq -r '.[] | [.TimeGenerated, .Caller, .OperationName] | @csv' > failed-operations.csv

13. Final Summary​

Essential concepts:

  • KQL uses the pipeline pattern: table | operator | operator. Each operator transforms the output of the previous one.
  • The most important operators are: where (filter), project (select columns), extend (add columns), summarize (aggregate), order by (sort), take/top (limit).
  • Always filter by TimeGenerated as the first operator for performance and cost.
  • render transforms results into inline charts (timechart, barchart, piechart).

Critical differences:

  • == vs =~: == is case-sensitive; =~ is case-insensitive.
  • project vs extend: project selects only the listed columns (removes others). extend adds new columns without removing existing ones.
  • take vs top: take N returns the first N without specific ordering. top N by field returns the N largest/smallest by a field (more efficient than order by | take).
  • count (tabular operator) vs count() (aggregation function): | count counts everything and returns one row. summarize count() by field groups and counts by category.
  • ago() vs datetime(): ago(7d) is relative to current moment. datetime(2025-01-15) is absolute.

What needs to be remembered:

  • The TimeGenerated field is always in UTC.
  • dynamic type fields (JSON) require tostring(field.subfield) before comparing or using.
  • The summarize operator removes all columns that don't appear in the by. Use project beforehand to select what you need.
  • Queries without time filter are slow, expensive, and often timeout.
  • let is powerful for creating named subqueries and defining constants that make query adjustments easier.
  • Log ingestion latency is 2 to 5 minutes. Queries for very recent events may not return results yet.
  • Save useful queries in Saved Queries of the workspace for reuse during incidents, when time is critical.