Braindump: path names, partition labels, FHS, auto-discovery

* Braindump: path names, partition labels, FHS, auto-discovery
@ 2012-03-06 21:19 Tommi Virtanen
  2012-03-06 22:29 ` Greg Farnum
  2012-03-07  9:55 ` David McBride
  0 siblings, 2 replies; 11+ messages in thread
From: Tommi Virtanen @ 2012-03-06 21:19 UTC (permalink / raw)
  To: ceph-devel

As you may have noticed, the docs [1] and Chef cookbooks [2] currently
use /srv/osd.$id and such paths. That's, shall we say, Not Ideal(tm).

[1] http://ceph.newdream.net/docs/latest/ops/install/mkcephfs/#creating-a-ceph-conf-file
[2] https://github.com/ceph/ceph-cookbooks/blob/master/ceph/recipes/bootstrap_osd.rb#L70

I initially used /srv purely because I needed to get them going quick,
and that directory was guaranteed to exist. Let's figure out the long
term goal.

The kinds of things we have:

- configuration, edited by humans (ONLY)
- machine-editable state similar to configuration
- OSD data is typically a dedicated filesystem, accommodate that
- OSD journal can be just about any file, including block devices

OSD journal flexibility is limiting for automation.. support three
major use cases:

- OSD journal may be fixed-basename file inside osd data directory
- OSD journal may be a file on a shared SSD
- OSD journal may be a block device (e.g. full SSD, partition on SSD,
2nd LUN on the same RAID with different tuning)

Requirements:

- FHS compliant: http://www.pathname.com/fhs/
- works well with Debian and RPM packaging
- OSD creation/teardown is completely automated
- ceph.conf is static for the whole cluster; not edited by per-machine
automation
- we're assuming GPT partitions, at least for not

Desirable things:

- ability to isolate daemons from each other more, e.g.
AppArmor/SELinux/different uids; e.g. do not assume all daemons can
mkdir in the same directory (ceph-mon vs ceph-osd)
- ability to move OSD data disk from server A to server B (e.g.
chassis swap due to faulty mother board)

The Plan (ta-daaa!):

(These will be just the defaults -- if you're hand-rolling your setup,
and disagree, just override them.)

(Apologies if this gets sketchy, I haven't had time to distill these
thoughts into something prettier.)

- FHS says human-editable configuration goes in /etc
- FHS says machine-editable state goes in /var/lib/ceph
- use /var/lib/ceph/mon/$id/ for mon.$id
- use /var/lib/ceph/osd-journal/$id for osd.$id journal; symlink to
actual location
- use /var/lib/ceph/osd-data/$id for osd.$id data; may be a symlink to
actual location?
- embed the same random UUID in osd data & osd journal at ceph-osd
mkfs time, for safety

On a disk hot plug event (and at bootup):
- found = {}
- scan the partitions for partition label with the prefix
"ceph-osd-data-". Take the remaining portion as $id and mount the fs
in /var/lib/ceph/osd-data/$id. Add $id to found (TODO handle
pre-existing). if osd-data/$id/journal exists, symlink osd-journal/$id
to it (TODO handle pre-existing).
- scan for partition label with the prefix "ceph-osd-journal-" and
special GUID type. Take the remaining portion as $id and symlink the
block device to /var/lib/ceph/osd-journal/$id. Add $id to found. (TODO
handle pre-existing)
- for each $id in found, if we have both osd-journal and osd-data,
start a ceph-osd for it

Moving journal

As an admin, I want to move an OSD data disk from one physical host
(chassis) to another (e.g. for maintenance of non-hotswap power
supply).
I might have a single SSD, divided into multiple partitions, each
acting as the journal for a single OSD data disk. I want to spread the
load evenly across the rest of the cluster, so I move the OSD data
disks to multiple destination machines, as long as they have 1 slot
free. Naturally, I cannot easily saw the SSD apart and move it
physically.

I would like to be able to:

1. shut down the osd daemon
2. explicitly flush out & invalidate the journal on SSD (after this,
the journal would not be marked with the osd id and fsid anymore)
3. move the HDD
4. on the new host, assign a blank SSD partition and initialize it
with the right fsid etc metadata

It may actually be nicer to think of this as:

1. shut down the osd daemon
2. move the journal inside the osd data dir, invalidate the old one
(flushing it is an optimization)
3. physically move the HDD
4. move the journal from inside the osd data dir to assigned block device

^ permalink raw reply	[flat|nested] 11+ messages in thread