AKS Cluster Health Dashboard — Copilot CLI Ideas

$ copilot --idea "AKS Cluster Health Dashboard"

advanced ⏱ 1-2 hours Azure Projects

→ Build an interactive cluster health report using kubectl, KQL, and Copilot diagnostics

The Problem

Your AKS cluster is running 50 microservices. Something is slow. Alerts are firing. The Kubernetes dashboard shows green everywhere because nobody configured proper health checks. Sound familiar?

What You'll Build

An end-to-end cluster health investigation workflow that produces:

- A categorised list of every unhealthy or underperforming workload

- Root cause analysis for failing pods

- Resource right-sizing recommendations

- KQL queries for ongoing monitoring

Step-by-Step Walkthrough

Phase 1: Cluster Overview

Make sure your kubeconfig is set, then ask:

$ "Connect to my AKS cluster and give me a full health report.

Check nodes, pods, deployments, and services.

Flag anything that's not Running or Ready."

Copilot will run kubectl commands and interpret the output. It'll catch things you'd miss scrolling through raw output.

Phase 2: Deep Dive on Failures

$ "Find all pods in CrashLoopBackOff or Error state.

For each one, show me the last 50 log lines

and explain the likely root cause."

$ "Check for pods stuck in Pending state.

Is it a resource constraint, node affinity issue,

or missing PersistentVolume?"

Phase 3: Resource Right-Sizing

$ "Compare resource requests vs actual usage for every deployment.

Flag any deployment requesting more than 3x what it uses.

Generate updated resource specs."

$ "Find deployments with no resource limits set.

These are ticking time bombs — generate limits

based on the p95 usage from the last 7 days."

Phase 4: Generate KQL Monitoring Queries

$ "Write KQL queries for Azure Monitor that detect:

1. Pod restarts > 3 in the last hour

2. Node CPU > 80% sustained for 10 minutes

3. Container OOMKilled events

4. Failed liveness/readiness probes

Create an Azure Monitor workbook template with these."

Phase 5: Produce the Report

$ "Summarise everything we found into a health report.

Include severity ratings, remediation steps, and

estimated cost savings from right-sizing."

Pro Tips

• Pipe large kubectl outputs through `head -100` to stay within context limits

• Use the azure-diagnostics skill for AppLens-powered root cause analysis

• The azure-kubernetes skill has Day-2 operations knowledge built in

• Run this monthly as a proactive health check, not just during incidents

What You'll Learn

• Kubernetes troubleshooting patterns and kubectl mastery

• KQL query language for Azure Monitor

• Resource management and cost optimisation in AKS

• Building operational runbooks with AI assistance