Ceph Troubleshooting Cheatsheet

Panic Button Guide

This document contains a collection of essential commands for diagnosing and fixing common issues in a Ceph cluster (Ubuntu 22.04 / Squid).

All commands are executed on the Admin Node (ceph-mgr-1) with sudo access.

1. General Health Check

The first step when an issue occurs. Do not take any action before checking this.

Command	Function	Explanation
`ceph -s`	Cluster Status	Main dashboard. Check if `HEALTH_OK`, `WARN`, or `ERR`.
`ceph health detail`	Error Detail	If status is `WARN/ERR`, this provides the specific reason.
`ceph -w`	Live Log (CCTV)	Watch cluster logs in real-time. Press `Ctrl+C` to exit.
`ceph df`	Capacity	Check global disk usage (Raw vs Usable).

Verify Running Services

To see everything running in the background via the orchestrator:

bash

# View all running services (MDS, MON, OSD, MGR, etc.)
sudo ceph orch ps

# View only services with errors or warnings
sudo ceph orch ps --status_class error
sudo ceph orch ps --status_class warning

# Check specific daemon logs
sudo cephadm logs --name mon.ceph-osd-1

2. Monitor & Quorum Issues

Monitors require a strict quorum (majority). If you have 3 monitors, 2 must be online.

Clock Skew

Ceph monitors are highly sensitive to time drift. If you see clock skew detected on mon.X:

bash

# Force chrony to sync immediately
sudo chronyc -a makestep
sudo systemctl restart chrony
# Verify sync
chronyc sources

Loss of Quorum / Stuck Monitors

If a monitor refuses to join the quorum:

bash

# Check monitor quorum status
sudo ceph quorum_status --format json-pretty

# Force remove a dead monitor
sudo ceph orch daemon rm mon.<hostname> --force

3. Disk & Host Issues

Check Inventory

bash

# View nodes recognized by the cluster
sudo ceph orch host ls

# View disks (use --wide to see rejection reasons)
sudo ceph orch device ls --wide --refresh

Zapping Disks

If a disk doesn't appear or is marked "LVM/Insufficient Space," you may need to zap it. This destroys all data on that device.

bash

# Format: ceph orch device zap <hostname> <device> --force
sudo ceph orch device zap ceph-osd-1 /dev/sdb --force

4. OSD Issues & Placement Groups (PGs)

OSD Tree Visualization

bash

sudo ceph osd tree

Status: up (alive), down (dead).
Membership: in (storing data), out (evacuated).

Ghost OSDs

If an OSD crashes and won't come back up:

bash

# Mark OSD out to force rebalancing
sudo ceph osd out osd.<ID>

# Safely purge permanently dead OSD
sudo ceph orch rm osd.<ID> --replace

PG States

If ceph -s shows PGs that are not active+clean:

pgs inconsistent: Bitrot detected. Run: sudo ceph pg repair <pg_id>
pgs stale: Primary OSD is down. Bring it back online.
pgs peering: OSDs are syncing data. Usually resolves itself.

5. CephFS & MDS Issues

bash

# Check MDS status
sudo ceph fs status

# View top clients (IO consumers)
sudo ceph tell mds.cephfs:0 client ls

MDS Stuck

If the active MDS hangs:

bash

# Force failover to standby
sudo ceph mds fail <mds_name>

6. Emergency & Maintenance

Deleting a Pool

Restricted Operation

Pool deletion is locked by default.

sudo ceph config set mon mon_allow_pool_delete true
sudo ceph osd pool delete <pool> <pool> --yes-i-really-mean-it
sudo ceph config set mon mon_allow_pool_delete false (Mandatory!)

Full Cluster Lockdown

If OSDs hit 100%, the cluster enters Read Only mode.

bash

# Temporarily raise threshold to 99% to allow deletions
sudo ceph osd set-full-ratio 0.99
# ... delete files ...
# Revert to 0.95
sudo ceph osd set-full-ratio 0.95

Maintenance Mode

Prevent data rebalance storms during host reboots:

bash

# Set flags
sudo ceph osd set noout
sudo ceph osd set norebalance
sudo ceph osd set norecover

# ... Reboot Nodes ...

# Unset flags
sudo ceph osd unset noout
sudo ceph osd unset norebalance
sudo ceph osd unset norecover

7. Orchestrator & Deployment

Stray Daemons

If HEALTH_WARN reports "stray daemon not managed by cephadm":

bash

# Identify stray
sudo ceph health detail

# Adopt it
sudo ceph orch daemon add <type> <host>

SSH/Host Check

If ceph orch host ls shows blank statuses:

bash

# Force immediate SSH check
sudo ceph orch host check <hostname>

# Force manager failover if orchestrator hangs
sudo ceph mgr fail

8. Docker Runtime (Critical)

Upstream Docker Warning

Using get.docker.com or official Docker repos instead of Ubuntu's docker.io often breaks cephadm. Revert to stable packages if you experience orchestrator hangs.

Recovering Daemons

bash

# Redeploy mon/osd container completely
sudo ceph orch daemon redeploy mon.<host>
sudo ceph orch daemon redeploy osd.<id>

BlueStore Slow Ops

If seeing BLUESTORE_SLOW_OP_ALERT:

Check ceph -s for active rebalancing.
Run ceph osd perf to identify bottlenecked IDs.
Check dmesg for hardware I/O errors.

9. Documentation (VitePress) Issues

HTTP 502 / Connection Refused

Symptom: Caddy logs show dial tcp 172.20.0.5:5173: connect: connection refused. Cause: The VitePress dev server is only listening on localhost inside the container. Fix: Add --host to the dev command in docker-compose.yml:

yaml

services:
  node:
    command: bun run dev -- --host

Blocked Request / Host Not Allowed

Symptom: Browser shows Blocked request. This host ("ceph.mint.test") is not allowed. Cause: Vite's security layer blocks unknown host headers. Fix: Update docs/.vitepress/config.ts to allow the specific domain:

typescript

export default defineConfig({
  vite: {
    server: {
      allowedHosts: ['ceph.mint.test']
    }
  }
})

Ceph Troubleshooting Cheatsheet ​

1. General Health Check ​

Verify Running Services ​

2. Monitor & Quorum Issues ​

Loss of Quorum / Stuck Monitors ​

3. Disk & Host Issues ​

Check Inventory ​

4. OSD Issues & Placement Groups (PGs) ​

OSD Tree Visualization ​

Ghost OSDs ​

PG States ​

5. CephFS & MDS Issues ​

MDS Stuck ​

6. Emergency & Maintenance ​

Deleting a Pool ​

Full Cluster Lockdown ​

Maintenance Mode ​

7. Orchestrator & Deployment ​

Stray Daemons ​

SSH/Host Check ​

8. Docker Runtime (Critical) ​

Recovering Daemons ​

BlueStore Slow Ops ​

9. Documentation (VitePress) Issues ​

HTTP 502 / Connection Refused ​

Blocked Request / Host Not Allowed ​

Ceph Troubleshooting Cheatsheet

1. General Health Check

Verify Running Services

2. Monitor & Quorum Issues

Loss of Quorum / Stuck Monitors

3. Disk & Host Issues

Check Inventory

4. OSD Issues & Placement Groups (PGs)

OSD Tree Visualization

Ghost OSDs

PG States

5. CephFS & MDS Issues

MDS Stuck

6. Emergency & Maintenance

Deleting a Pool

Full Cluster Lockdown

Maintenance Mode

7. Orchestrator & Deployment

Stray Daemons

SSH/Host Check

8. Docker Runtime (Critical)

Recovering Daemons

BlueStore Slow Ops

9. Documentation (VitePress) Issues

HTTP 502 / Connection Refused

Blocked Request / Host Not Allowed