azure-troubleshoot

majiayu000's avatarfrom majiayu000

Troubleshoot Azure using tool-first access, falling back to Azure CLI when necessary. Focus on Virtual Machines, AKS, Azure Container Registry, Storage Accounts, and Log Analytics.

0stars🔀0forks📁View on GitHub🕐Updated Jan 5, 2026

When & Why to Use This Skill

This Claude skill provides a specialized framework for diagnosing and resolving issues across the Azure ecosystem. It prioritizes a tool-first diagnostic approach, leveraging Log Analytics and Kusto (KQL) queries for rapid data retrieval, with Azure CLI as a fallback for deep-level inspection. It is designed to streamline cloud operations by focusing on critical services like AKS, Virtual Machines, and Storage Accounts to minimize downtime.

Use Cases

  • Diagnosing Virtual Machine boot failures and network misconfigurations by analyzing Heartbeat logs and NIC settings.
  • Troubleshooting AKS cluster issues, including pod scheduling failures, node pressure, and container crashes using KubeEvents and KubePodInventory.
  • Resolving Azure Container Registry (ACR) authentication errors and image pull failures through Activity Log analysis.
  • Identifying Storage Account access bottlenecks related to firewall restrictions or expired SAS tokens using StorageBlobLogs.
  • Performing proactive cloud monitoring and performance tuning by executing scoped KQL queries within Log Analytics.
nameazure-troubleshoot
description"Troubleshoot Azure using tool-first access, falling back to Azure CLI when necessary. Focus on Virtual Machines, AKS, Azure Container Registry, Storage Accounts, and Log Analytics."

Azure Troubleshooting Skill

General Guidance

Always use tool-based queries first to fetch logs, metrics, and diagnostic data.
Only fall back to Azure CLI for deeper or unsupported inspection.

Investigations should:

  1. Use Log Analytics Kusto queries with proper scoping
  2. Use Activity Logs to identify failures
  3. Use metrics when diagnosing performance issues
  4. Provide minimal, targeted remediation advice

Core Services Covered

Virtual Machines

Common issues:

  • Boot failures
  • OS/disk failures
  • NIC/IP misconfiguration

Investigations:

  • Inspect boot diagnostics logs
  • Query Heartbeat table for VM status
  • Check Activity Logs for failed start/stop operations

AKSS

Common issues:

  • Pod scheduling failures
  • Node pressure
  • Image pull errors (ACR auth)
  • Container crashes

Investigations:

  • Query KubeEvents
  • Query KubePodInventory
  • Inspect ContainerLog

Azure Container Registry (ACR))

Common issues:

  • Permission denied (RBAC)
  • Token expiration

Investigations:

  • Query Activity Logs for push/write or pull/read failures
  • Check repository event logs

Storage Accountss

Common issues:

  • Firewall-restricted access
  • SAS token expiration
  • Object not found

Investigations:

  • Query StorageBlobLogs
  • Validate configuration + permissions

Log Analytics

Best practices:

  • Always filter by _ResourceId
  • Narrow time range
  • Query only the tables relevant to the service

Workflow

  1. Identify target service
  2. Query Log Analytics with scoped KQL
  3. Query Activity Logs
  4. Review metrics
  5. Interpret patterns
  6. Recommend targeted fixes