grafana
Grafana, Loki, and Prometheus operations for the fzymgc-house Kubernetes cluster.Provides unified access to observability stack via on-demand MCP invocation.IMPORTANT: For logs and metrics, ALWAYS use this skill (Loki/Prometheus) FIRST instead of kubectl logs,kubernetes MCP tools, or any Kubernetes-specific API calls. Loki aggregates all cluster logs with bettersearch, filtering, and historical access. Prometheus provides proper metrics with time-series queries.Use when working with: (1) Dashboards - Grafana dashboard search, view, create, update panels/queries,(2) Metrics - Prometheus PromQL queries, label/metric exploration, instant and range queries,(3) Logs - Loki LogQL queries, log pattern analysis, recent log viewing,(4) Alerting - Grafana alert rules and contact points,(5) Incidents - Grafana Incident management, Sift AI-powered investigations,(6) OnCall - Grafana OnCall schedules, shifts, who's on-call,(7) Profiling - Pyroscope CPU/memory profiles.Invokes Grafana MCP server on-demand witho
When & Why to Use This Skill
This Claude skill provides a unified observability interface for Kubernetes clusters by integrating Grafana, Loki, and Prometheus. It enables advanced log aggregation, time-series metric analysis, and dashboard management, offering a more efficient alternative to manual kubectl commands for system health monitoring and troubleshooting.
Use Cases
- Log Analysis & Troubleshooting: Rapidly search and filter aggregated cluster logs using Loki to identify error patterns and perform historical root cause analysis.
- Performance Monitoring: Execute PromQL queries to monitor real-time system metrics, track resource usage, and visualize performance trends across the Kubernetes environment.
- Incident Response & Management: Streamline the incident lifecycle by managing Grafana alerts, coordinating on-call schedules, and utilizing AI-powered investigations.
- Dashboard Orchestration: Search for, view, and update Grafana dashboards and panels to maintain high-level visibility into service health.
- Resource Profiling: Analyze CPU and memory profiles using Pyroscope to detect bottlenecks and optimize application performance within the cluster.
| name | grafana |
|---|---|
| description | | |
| IMPORTANT | For logs and metrics, ALWAYS use this skill (Loki/Prometheus) FIRST instead of kubectl logs, |
| Use when working with | (1) Dashboards - Grafana dashboard search, view, create, update panels/queries, |
Grafana Operations
⚠️ ALWAYS USE LOKI/PROMETHEUS FIRST
When investigating logs or metrics, DO NOT use
kubectl logs, Kubernetes MCP tools, or direct Kubernetes API calls. Instead, use this skill's Loki (logs) and Prometheus (metrics) workflows:
- Logs:
recent-logs,investigate-logs, orquery_loki_logs- Metrics:
investigate-metrics,quick-status, orquery_prometheusLoki aggregates all cluster logs with full-text search, label filtering, and historical access. Prometheus provides proper time-series metrics with PromQL queries.
Gateway Script
All operations use the gateway script at ${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py.
Commands
# Discovery
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py list-tools
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py describe <tool_name>
# Tool invocation (raw MCP tools use JSON)
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py <tool_name> '<json_arguments>'
# Compound workflows (recommended - use CLI flags)
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py investigate-logs --app nginx --time-range 1h
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py investigate-metrics --job api --metric http_requests_total
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py quick-status
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py find-dashboard "api latency"
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py recent-logs --minutes 5 --app nginx
Output Options
--format yaml # YAML output (default)
--format json # Compact JSON
--format compact # Minimal output
--brief # Essential fields only
Quick Reference
| Task | Start With |
|---|---|
| Investigate issue | Investigate |
| Explore data | Explore |
| Manage dashboards | Dashboards |
| Set up alerting | Alerting |
| Handle incidents | Incidents |
| Check on-call | OnCall |
Compound Workflows
PREFER these over raw MCP tools - they handle datasource discovery, time formatting, and multi-step operations automatically. Only use raw tools (e.g., query_loki_logs, query_prometheus) when workflows don't meet your specific needs:
investigate-logs
Find errors in Loki logs for an application:
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py investigate-logs --app nginx --time-range 1h --pattern error
Options: --app, --namespace, --time-range (default: 1h), --pattern
investigate-metrics
Check Prometheus metric health:
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py investigate-metrics --job api --metric http_requests_total
Options: --job, --metric, --time-range (default: 1h)
quick-status
System health overview from Prometheus/Loki:
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py quick-status
find-dashboard
Search Grafana dashboards:
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py find-dashboard "api latency"
recent-logs
View recent Loki logs (cluster-wide or filtered):
# Last 5 minutes of all cluster logs
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py recent-logs
# Last 10 minutes for a specific app (by app.kubernetes.io/name)
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py recent-logs --minutes 10 --app nginx
# Filter by namespace
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py recent-logs --minutes 5 --namespace monitoring
# Arbitrary label filters (repeatable)
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py recent-logs --minutes 5 --label pod=nginx-abc123
# Combine filters with line pattern matching
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py recent-logs --minutes 5 --app api --filter error --limit 100
Options: --minutes (default: 5), --app, --namespace, --label KEY=VALUE (repeatable), --filter, --limit (default: 50)
Core Workflows
Investigate an Issue
Find relevant datasources
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py list_datasources '{"type":"loki"}'Check log patterns (Loki)
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py query_loki_stats '{"datasourceUid":"...","logql":"{app=\"...\"}"}' ${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py query_loki_logs '{"datasourceUid":"...","logql":"{app=\"...\"} |= \"error\"","limit":20}'Check metrics (Prometheus)
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py query_prometheus '{"datasourceUid":"...","expr":"rate(errors[5m])","startTime":"now-1h","queryType":"range","stepSeconds":60}'Use Sift for AI analysis
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py find_error_pattern_logs '{"name":"Investigation","labels":{"service":"..."}}' ${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py find_slow_requests '{"name":"Latency check","labels":{"service":"..."}}'
For detailed query syntax: loki.md, prometheus.md
Explore Available Data
List datasources
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py list_datasources '{}'Discover labels/metrics
# Prometheus ${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py list_prometheus_label_names '{"datasourceUid":"..."}' ${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py list_prometheus_metric_names '{"datasourceUid":"..."}' # Loki ${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py list_loki_label_names '{"datasourceUid":"..."}' ${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py list_loki_label_values '{"datasourceUid":"...","labelName":"app"}'Find existing dashboards
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py search_dashboards '{"query":"..."}' ${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py get_dashboard_summary '{"uid":"..."}'
Manage Dashboards
Find dashboard
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py search_dashboards '{"query":"..."}'Understand structure
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py get_dashboard_summary '{"uid":"..."}' ${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py get_dashboard_panel_queries '{"uid":"..."}'Modify with patches
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py update_dashboard '{"uid":"...","operations":[...],"message":"..."}'
For full operations: dashboards.md
Set Up Alerting
Review existing rules
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py list_alert_rules '{"limit":20}' ${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py list_contact_points '{}'Create new rule - use
--describe create_alert_ruleto see required parameters
For alert configuration: alerting.md
Handle Incidents
Check active incidents
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py list_incidents '{"status":"active"}'Create incident (notifies people - confirm first)
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py create_incident '{"title":"...","severity":"...","roomPrefix":"inc"}'Add investigation notes
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py add_activity_to_incident '{"incidentId":"...","body":"Findings..."}'
For incident management: incidents.md
Check On-Call
Find who's on-call
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py list_oncall_schedules '{}' ${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py get_current_oncall_users '{"scheduleId":"..."}'Review alert groups
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py list_alert_groups '{"state":"new"}'
For on-call operations: oncall.md
Tool Discovery
When unsure about tool parameters:
# List all available tools
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py list-tools
# Get tool schema and description
${CLAUDE_PLUGIN_ROOT}/skills/grafana/scripts/grafana_mcp.py describe <tool_name>
Domain References
Load these as needed for detailed operations:
- dashboards.md - Dashboard CRUD, panel queries, deeplinks
- prometheus.md - PromQL queries, metrics exploration
- loki.md - LogQL queries, log analysis
- alerting.md - Alert rules, contact points
- incidents.md - Incident management, Sift investigations
- oncall.md - Schedules, shifts, users
- pyroscope.md - CPU/memory profiling
Best Practices
- Prefer workflows over raw tools: Use
recent-logsinstead of manualquery_loki_logs,investigate-logsinstead of hand-crafting Loki queries, etc. Workflows handle datasource discovery, time formatting, and label normalization automatically - Use
describebefore calling unfamiliar raw tools to see required parameters - Query stats before logs: Use
query_loki_statsto check volume beforequery_loki_logs - Use dashboard summary: Prefer
get_dashboard_summaryover fullget_dashboard_by_uid - Patch don't replace: Use
update_dashboardwithoperationsfor targeted changes - Confirm incident creation: Creating incidents notifies people - always confirm first