Skip to content

Storage Strategy: Erasure Coding

This document explains the Storage Pool configuration using Erasure Coding (EC) with a 2+2 scheme. This strategy provides an excellent balance of redundancy and storage efficiency for our 8.0 TiB cluster (4x 2TB drives).

1. Architectural Concept

Instead of using 3x Replication (which consumes 200% extra storage), we use Erasure Coding.

  • Data Chunks (k): 2
  • Parity Chunks (m): 2
  • Efficiency: 50% (Use ~4.0 TiB out of 8.0 TiB Raw).
  • Safety: Resistant to the simultaneous failure of any two OSD Nodes.

2. Metadata Server (MDS)

High Availability

Before creating a file system, we must enable the MDS service. We place it on ceph-osd-1 and ceph-osd-2 for redundancy.

Run on the Admin Node (ceph-mgr-1):

bash
# Deploy MDS service named 'cephfs'
sudo ceph orch apply mds cephfs --placement="ceph-osd-1,ceph-osd-2"

# Verify service is running
sudo ceph orch ps --daemon_type mds

3. Configure EC Profile

We create a profile with $k=2, m=2$ and a failure domain of host.

bash
sudo ceph osd erasure-code-profile set ec-2plus2 \
  k=2 \
  m=2 \
  crush-failure-domain=host

4. Create Storage Pools

CephFS requires two pools: Metadata (structure) and Data (content).

4.1 Metadata Pool (Replicated)

Metadata Restriction

Metadata must not use Erasure Coding. It requires fast, atomic access provided by standard replication.

bash
# Create metadata pool (Size 3)
sudo ceph osd pool create cephfs_metadata 32 32 replicated
sudo ceph osd pool set cephfs_metadata size 3
sudo ceph osd pool set cephfs_metadata min_size 2

4.2 Data Pool (Erasure Coded)

This is the primary storage warehouse.

bash
# Create data pool using the ec-2plus2 profile
sudo ceph osd pool create cephfs_data 32 32 erasure ec-2plus2

# Allow overwrites (Mandatory for CephFS)
sudo ceph osd pool set cephfs_data allow_ec_overwrites true

# Enable bulk mode for the primary data pool
sudo ceph osd pool set cephfs_data bulk true

5. Activate File System

Combine the pools into a single File System.

bash
# ceph fs new <fs_name> <metadata_pool> <data_pool>
sudo ceph fs new cephfs cephfs_metadata cephfs_data --force

# Verify status
sudo ceph fs status

6. Tuning & Capacity

Available Capacity (Why 3.5 TiB?)

Capacity Calculation Details

If ceph fs status shows 3539G instead of 4.0 TiB, this is normal:

  1. Nearfull Ratios: Ceph reserves space. nearfull hits at 85%, and the cluster locks at 95%.
  2. Metadata Overhead: Space is reserved for journals and indexing.
  3. GiB vs GB: Hard drive manufacturers use Base 10 (TB). Ceph uses Base 2 (TiB). 8.0 TB RAW is actually ~7.27 TiB.

Metadata Pool Size

Thin Provisioning

Both pools sit on the same physical disks. The AVAIL number represents how much data of that specific type could fit if nothing else existed. Because Metadata is replicated 3x, its "Available" space will always show as lower than the EC-protected Data pool.