Ceph Troubleshooting Cheatsheet
Panic Button Guide
This document contains a collection of essential commands for diagnosing and fixing common issues in a Ceph cluster (Ubuntu 22.04 / Squid).
All commands are executed on the Admin Node (ceph-mgr-1) with sudo access.
1. General Health Check
The first step when an issue occurs. Do not take any action before checking this.
| Command | Function | Explanation |
|---|---|---|
ceph -s | Cluster Status | Main dashboard. Check if HEALTH_OK, WARN, or ERR. |
ceph health detail | Error Detail | If status is WARN/ERR, this provides the specific reason. |
ceph -w | Live Log (CCTV) | Watch cluster logs in real-time. Press Ctrl+C to exit. |
ceph df | Capacity | Check global disk usage (Raw vs Usable). |
Verify Running Services
To see everything running in the background via the orchestrator:
# View all running services (MDS, MON, OSD, MGR, etc.)
sudo ceph orch ps
# View only services with errors or warnings
sudo ceph orch ps --status_class error
sudo ceph orch ps --status_class warning
# Check specific daemon logs
sudo cephadm logs --name mon.ceph-osd-12. Monitor & Quorum Issues
Monitors require a strict quorum (majority). If you have 3 monitors, 2 must be online.
Clock Skew
Ceph monitors are highly sensitive to time drift. If you see clock skew detected on mon.X:
# Force chrony to sync immediately
sudo chronyc -a makestep
sudo systemctl restart chrony
# Verify sync
chronyc sourcesLoss of Quorum / Stuck Monitors
If a monitor refuses to join the quorum:
# Check monitor quorum status
sudo ceph quorum_status --format json-pretty
# Force remove a dead monitor
sudo ceph orch daemon rm mon.<hostname> --force3. Disk & Host Issues
Check Inventory
# View nodes recognized by the cluster
sudo ceph orch host ls
# View disks (use --wide to see rejection reasons)
sudo ceph orch device ls --wide --refreshZapping Disks
If a disk doesn't appear or is marked "LVM/Insufficient Space," you may need to zap it. This destroys all data on that device.
# Format: ceph orch device zap <hostname> <device> --force
sudo ceph orch device zap ceph-osd-1 /dev/sdb --force4. OSD Issues & Placement Groups (PGs)
OSD Tree Visualization
sudo ceph osd tree- Status:
up(alive),down(dead). - Membership:
in(storing data),out(evacuated).
Ghost OSDs
If an OSD crashes and won't come back up:
# Mark OSD out to force rebalancing
sudo ceph osd out osd.<ID>
# Safely purge permanently dead OSD
sudo ceph orch rm osd.<ID> --replacePG States
If ceph -s shows PGs that are not active+clean:
- pgs inconsistent: Bitrot detected. Run:
sudo ceph pg repair <pg_id> - pgs stale: Primary OSD is down. Bring it back online.
- pgs peering: OSDs are syncing data. Usually resolves itself.
5. CephFS & MDS Issues
# Check MDS status
sudo ceph fs status
# View top clients (IO consumers)
sudo ceph tell mds.cephfs:0 client lsMDS Stuck
If the active MDS hangs:
# Force failover to standby
sudo ceph mds fail <mds_name>6. Emergency & Maintenance
Deleting a Pool
Restricted Operation
Pool deletion is locked by default.
sudo ceph config set mon mon_allow_pool_delete truesudo ceph osd pool delete <pool> <pool> --yes-i-really-mean-itsudo ceph config set mon mon_allow_pool_delete false(Mandatory!)
Full Cluster Lockdown
If OSDs hit 100%, the cluster enters Read Only mode.
# Temporarily raise threshold to 99% to allow deletions
sudo ceph osd set-full-ratio 0.99
# ... delete files ...
# Revert to 0.95
sudo ceph osd set-full-ratio 0.95Maintenance Mode
Prevent data rebalance storms during host reboots:
# Set flags
sudo ceph osd set noout
sudo ceph osd set norebalance
# ... Reboot Nodes ...
# Unset flags
sudo ceph osd unset noout
sudo ceph osd unset norebalance7. Orchestrator & Deployment
Stray Daemons
If HEALTH_WARN reports "stray daemon not managed by cephadm":
# Identify stray
sudo ceph health detail
# Adopt it
sudo ceph orch daemon add <type> <host>SSH/Host Check
If ceph orch host ls shows blank statuses:
# Force immediate SSH check
sudo ceph orch host check <hostname>
# Force manager failover if orchestrator hangs
sudo ceph mgr fail8. Docker Runtime (Critical)
Upstream Docker Warning
Using get.docker.com or official Docker repos instead of Ubuntu's docker.io often breaks cephadm. Revert to stable packages if you experience orchestrator hangs.
Recovering Daemons
# Redeploy mon/osd container completely
sudo ceph orch daemon redeploy mon.<host>
sudo ceph orch daemon redeploy osd.<id>BlueStore Slow Ops
If seeing BLUESTORE_SLOW_OP_ALERT:
- Check
ceph -sfor active rebalancing. - Run
ceph osd perfto identify bottlenecked IDs. - Check
dmesgfor hardware I/O errors.