application-metrics
Guide for instrumenting applications with metrics. Use when adding observability, monitoring, metrics, counters, gauges, or instrumentation to code. Covers API endpoints, databases, queues, caching, and locks.
When & Why to Use This Skill
This Claude skill provides a comprehensive framework for instrumenting applications with robust observability metrics. It guides developers through implementing operational counters, resource utilization gauges, and performance latency tracking across critical infrastructure components like APIs, databases, and message queues. By following standardized naming conventions and avoiding common anti-patterns like high cardinality, it ensures that system health and business logic are transparent and actionable for SRE and DevOps teams.
Use Cases
- API Performance Optimization: Implementing request tracking, error rate monitoring, and latency percentiles (p95, p99) to identify and resolve REST/GraphQL bottlenecks.
- Database Health Monitoring: Tracking connection pool utilization, slow query counts, and transaction rollback rates to prevent database-related application failures.
- Distributed System Reliability: Monitoring message queue depths, consumer lag, and dead-letter queue sizes to ensure asynchronous data processing remains healthy.
- Standardizing Observability: Establishing a unified metric naming convention across microservices to improve dashboard consistency and alerting accuracy.
- Resource & Cache Management: Measuring cache hit/miss ratios and memory eviction rates to optimize infrastructure costs and application speed.
| name | application-metrics |
|---|---|
| description | Guide for instrumenting applications with metrics. Use when adding |
| allowed-tools | Read, Grep, Edit, Write |
Application Metrics Instrumentation
Practical patterns for adding observability to applications.
Five Metric Types
| Type | Purpose | Example |
|---|---|---|
| Operational Counters | Track discrete events (success/failure) | api.requests.success_total |
| Resource Utilization | Current capacity usage (gauges) | db.connections.active |
| Performance/Latency | Speed with explicit units | api.request.duration_ms |
| Data Volume | Information flow rates | queue.messages.bytes_total |
| Business Logic | Domain-specific value | orders.completed_total |
Naming Convention
<system>.<component>.<operation>.<metric_type>
Examples:
myapp.api.users.requests_totalmyapp.db.queries.duration_msmyapp.cache.items.hit_total
Component Checklists
API Endpoints
- Request count by endpoint and method
- Response time (p50, p95, p99)
- Error rate by status code
- Authentication failures
- Request/response payload sizes
Database
- Connection pool (active, idle, waiting)
- Query duration by operation type
- Slow query count (threshold-based)
- Error count by type (timeout, constraint, connection)
- Transaction commit/rollback rates
Message Queues
- Messages produced/consumed per topic
- Queue depth (current backlog)
- Processing latency (end-to-end)
- Consumer lag
- Dead letter queue size
Caching
- Hit/miss ratio
- Eviction count and reason
- Cache size (entries and bytes)
- TTL expiration rate
- Connection pool status
Locks/Synchronization
- Acquisition time
- Contention count (failed acquisitions)
- Hold duration
- Timeout count
- Deadlock occurrences
Anti-patterns to Avoid
- Unbounded label cardinality - Never use user IDs, session tokens, or request IDs as labels
- Missing failure paths - Always instrument errors alongside successes
- No heartbeat metric - Add a constant gauge (e.g.,
app.up = 1) to verify instrumentation works - Inconsistent naming - Stick to one convention across the codebase
Full Reference
For detailed examples, patterns, and rationale, fetch the complete guide: https://pierrezemb.fr/posts/practical-guide-to-application-metrics/