$
copilot --idea "AKS Cluster Health Dashboard"
→
Build an interactive cluster health report using kubectl, KQL, and Copilot diagnostics
The Problem
Your AKS cluster is running 50 microservices. Something is slow. Alerts are firing. The Kubernetes dashboard shows green everywhere because nobody configured proper health checks. Sound familiar?
What You'll Build
An end-to-end cluster health investigation workflow that produces:
- A categorised list of every unhealthy or underperforming workload
- Root cause analysis for failing pods
- Resource right-sizing recommendations
- KQL queries for ongoing monitoring
Step-by-Step Walkthrough
Phase 1: Cluster Overview
Make sure your kubeconfig is set, then ask:
$
"Connect to my AKS cluster and give me a full health report.
Check nodes, pods, deployments, and services.
Flag anything that's not Running or Ready."
Copilot will run kubectl commands and interpret the output. It'll catch things you'd miss scrolling through raw output.
Phase 2: Deep Dive on Failures
$
"Find all pods in CrashLoopBackOff or Error state.
For each one, show me the last 50 log lines
and explain the likely root cause."
$
"Check for pods stuck in Pending state.
Is it a resource constraint, node affinity issue,
or missing PersistentVolume?"
Phase 3: Resource Right-Sizing
$
"Compare resource requests vs actual usage for every deployment.
Flag any deployment requesting more than 3x what it uses.
Generate updated resource specs."
$
"Find deployments with no resource limits set.
These are ticking time bombs — generate limits
based on the p95 usage from the last 7 days."
Phase 4: Generate KQL Monitoring Queries
$
"Write KQL queries for Azure Monitor that detect:
1. Pod restarts > 3 in the last hour
2. Node CPU > 80% sustained for 10 minutes
3. Container OOMKilled events
4. Failed liveness/readiness probes
Create an Azure Monitor workbook template with these."
Phase 5: Produce the Report
$
"Summarise everything we found into a health report.
Include severity ratings, remediation steps, and
estimated cost savings from right-sizing."
Pro Tips
• Pipe large kubectl outputs through `head -100` to stay within context limits
• Use the azure-diagnostics skill for AppLens-powered root cause analysis
• The azure-kubernetes skill has Day-2 operations knowledge built in
• Run this monthly as a proactive health check, not just during incidents
What You'll Learn
• Kubernetes troubleshooting patterns and kubectl mastery
• KQL query language for Azure Monitor
• Resource management and cost optimisation in AKS
• Building operational runbooks with AI assistance