Skip to content

Ceph Troubleshooting Cheatsheet

Panic Button Guide

This document contains a collection of essential commands for diagnosing and fixing common issues in a Ceph cluster (Ubuntu 22.04 / Squid).

All commands are executed on the Admin Node (ceph-mgr-1) with sudo access.

1. General Health Check

The first step when an issue occurs. Do not take any action before checking this.

CommandFunctionExplanation
ceph -sCluster StatusMain dashboard. Check if HEALTH_OK, WARN, or ERR.
ceph health detailError DetailIf status is WARN/ERR, this provides the specific reason.
ceph -wLive Log (CCTV)Watch cluster logs in real-time. Press Ctrl+C to exit.
ceph dfCapacityCheck global disk usage (Raw vs Usable).

Verify Running Services

To see everything running in the background via the orchestrator:

bash
# View all running services (MDS, MON, OSD, MGR, etc.)
sudo ceph orch ps

# View only services with errors or warnings
sudo ceph orch ps --status_class error
sudo ceph orch ps --status_class warning

# Check specific daemon logs
sudo cephadm logs --name mon.ceph-osd-1

2. Monitor & Quorum Issues

Monitors require a strict quorum (majority). If you have 3 monitors, 2 must be online.

Clock Skew

Ceph monitors are highly sensitive to time drift. If you see clock skew detected on mon.X:

bash
# Force chrony to sync immediately
sudo chronyc -a makestep
sudo systemctl restart chrony
# Verify sync
chronyc sources

Loss of Quorum / Stuck Monitors

If a monitor refuses to join the quorum:

bash
# Check monitor quorum status
sudo ceph quorum_status --format json-pretty

# Force remove a dead monitor
sudo ceph orch daemon rm mon.<hostname> --force

3. Disk & Host Issues

Check Inventory

bash
# View nodes recognized by the cluster
sudo ceph orch host ls

# View disks (use --wide to see rejection reasons)
sudo ceph orch device ls --wide --refresh

Zapping Disks

If a disk doesn't appear or is marked "LVM/Insufficient Space," you may need to zap it. This destroys all data on that device.

bash
# Format: ceph orch device zap <hostname> <device> --force
sudo ceph orch device zap ceph-osd-1 /dev/sdb --force

4. OSD Issues & Placement Groups (PGs)

OSD Tree Visualization

bash
sudo ceph osd tree
  • Status: up (alive), down (dead).
  • Membership: in (storing data), out (evacuated).

Ghost OSDs

If an OSD crashes and won't come back up:

bash
# Mark OSD out to force rebalancing
sudo ceph osd out osd.<ID>

# Safely purge permanently dead OSD
sudo ceph orch rm osd.<ID> --replace

PG States

If ceph -s shows PGs that are not active+clean:

  • pgs inconsistent: Bitrot detected. Run: sudo ceph pg repair <pg_id>
  • pgs stale: Primary OSD is down. Bring it back online.
  • pgs peering: OSDs are syncing data. Usually resolves itself.

5. CephFS & MDS Issues

bash
# Check MDS status
sudo ceph fs status

# View top clients (IO consumers)
sudo ceph tell mds.cephfs:0 client ls

MDS Stuck

If the active MDS hangs:

bash
# Force failover to standby
sudo ceph mds fail <mds_name>

6. Emergency & Maintenance

Deleting a Pool

Restricted Operation

Pool deletion is locked by default.

  1. sudo ceph config set mon mon_allow_pool_delete true
  2. sudo ceph osd pool delete <pool> <pool> --yes-i-really-mean-it
  3. sudo ceph config set mon mon_allow_pool_delete false (Mandatory!)

Full Cluster Lockdown

If OSDs hit 100%, the cluster enters Read Only mode.

bash
# Temporarily raise threshold to 99% to allow deletions
sudo ceph osd set-full-ratio 0.99
# ... delete files ...
# Revert to 0.95
sudo ceph osd set-full-ratio 0.95

Maintenance Mode

Prevent data rebalance storms during host reboots:

bash
# Set flags
sudo ceph osd set noout
sudo ceph osd set norebalance

# ... Reboot Nodes ...

# Unset flags
sudo ceph osd unset noout
sudo ceph osd unset norebalance

7. Orchestrator & Deployment

Stray Daemons

If HEALTH_WARN reports "stray daemon not managed by cephadm":

bash
# Identify stray
sudo ceph health detail

# Adopt it
sudo ceph orch daemon add <type> <host>

SSH/Host Check

If ceph orch host ls shows blank statuses:

bash
# Force immediate SSH check
sudo ceph orch host check <hostname>

# Force manager failover if orchestrator hangs
sudo ceph mgr fail

8. Docker Runtime (Critical)

Upstream Docker Warning

Using get.docker.com or official Docker repos instead of Ubuntu's docker.io often breaks cephadm. Revert to stable packages if you experience orchestrator hangs.

Recovering Daemons

bash
# Redeploy mon/osd container completely
sudo ceph orch daemon redeploy mon.<host>
sudo ceph orch daemon redeploy osd.<id>

BlueStore Slow Ops

If seeing BLUESTORE_SLOW_OP_ALERT:

  1. Check ceph -s for active rebalancing.
  2. Run ceph osd perf to identify bottlenecked IDs.
  3. Check dmesg for hardware I/O errors.