azure-troubleshoot
Troubleshoot Azure using tool-first access, falling back to Azure CLI when necessary. Focus on Virtual Machines, AKS, Azure Container Registry, Storage Accounts, and Log Analytics.
When & Why to Use This Skill
This Claude skill provides a specialized framework for diagnosing and resolving issues across the Azure ecosystem. It prioritizes a tool-first diagnostic approach, leveraging Log Analytics and Kusto (KQL) queries for rapid data retrieval, with Azure CLI as a fallback for deep-level inspection. It is designed to streamline cloud operations by focusing on critical services like AKS, Virtual Machines, and Storage Accounts to minimize downtime.
Use Cases
- Diagnosing Virtual Machine boot failures and network misconfigurations by analyzing Heartbeat logs and NIC settings.
- Troubleshooting AKS cluster issues, including pod scheduling failures, node pressure, and container crashes using KubeEvents and KubePodInventory.
- Resolving Azure Container Registry (ACR) authentication errors and image pull failures through Activity Log analysis.
- Identifying Storage Account access bottlenecks related to firewall restrictions or expired SAS tokens using StorageBlobLogs.
- Performing proactive cloud monitoring and performance tuning by executing scoped KQL queries within Log Analytics.
| name | azure-troubleshoot |
|---|---|
| description | "Troubleshoot Azure using tool-first access, falling back to Azure CLI when necessary. Focus on Virtual Machines, AKS, Azure Container Registry, Storage Accounts, and Log Analytics." |
Azure Troubleshooting Skill
General Guidance
Always use tool-based queries first to fetch logs, metrics, and diagnostic data.
Only fall back to Azure CLI for deeper or unsupported inspection.
Investigations should:
- Use Log Analytics Kusto queries with proper scoping
- Use Activity Logs to identify failures
- Use metrics when diagnosing performance issues
- Provide minimal, targeted remediation advice
Core Services Covered
Virtual Machines
Common issues:
- Boot failures
- OS/disk failures
- NIC/IP misconfiguration
Investigations:
- Inspect boot diagnostics logs
- Query
Heartbeattable for VM status - Check Activity Logs for failed start/stop operations
AKSS
Common issues:
- Pod scheduling failures
- Node pressure
- Image pull errors (ACR auth)
- Container crashes
Investigations:
- Query
KubeEvents - Query
KubePodInventory - Inspect
ContainerLog
Azure Container Registry (ACR))
Common issues:
- Permission denied (RBAC)
- Token expiration
Investigations:
- Query Activity Logs for
push/writeorpull/readfailures - Check repository event logs
Storage Accountss
Common issues:
- Firewall-restricted access
- SAS token expiration
- Object not found
Investigations:
- Query
StorageBlobLogs - Validate configuration + permissions
Log Analytics
Best practices:
- Always filter by
_ResourceId - Narrow time range
- Query only the tables relevant to the service
Workflow
- Identify target service
- Query Log Analytics with scoped KQL
- Query Activity Logs
- Review metrics
- Interpret patterns
- Recommend targeted fixes