################### Ceph Storage System ################### The ACM uses Ceph to manage its collection of storage. Ceph itself has extensive documentation_. .. _documentation: http://docs.ceph.com/docs/master/ Configuration ============= The global parameters of our ceph cluster are available for perusal in file:///afs/acm.jhu.edu/readonly/group/admins.pub/ceph.conf . The various Ceph worker nodes should all have ``/etc/ceph/ceph.conf`` symlinked to that location. (As usual, be sure to use the RO mountpoint for the benefits of replication; see :doc:`../afsdoc/admins-pub` for details.) .. note:: This file should be kept up to date. Do note, however, that the setting ``mon_host`` gives the clients a list of possible mon locations; only one of them must be correct for in order for an operation to succeed, as the mons maintain their own address lists in the monmap itself. Authentication ============== Ceph rolls its own authentication and authorization layers, which is a bit of a bummer, but it is what it is, I suppose. Thankfully, we don't fiddle with this often. You can inspect the database with ``ceph auth list``. .. warning:: This list command displays secrets to the console and there does not seem to be an alternative or option to not! .. _ceph_new-user: Creating a New Ceph User ------------------------ Run ``ceph auth add ${NAME}`` with optional repetitions of ``${CAPTY} ${CAPVAL}`` following, e.g. ``mon "allow r" osd "allow class-read object_prefix rbd_children"`` or somesuch. ``ceph auth caps`` can be used to change a user's caps later. The equivalent of extracting a keytab is done with ``ceph auth get-key``. Maintenance Tasks ================= Getting Cluster Status ---------------------- On magellan or the other ceph worker (mon or osd) nodes, commands like ``ceph status`` or ``ceph status -w`` will show you what's up on the storage cluster (and keep you up to date). ``ceph pg dump_stuck`` is useful while Ceph is rebalancing itself. Identifying RBD Objects ----------------------- Map RBD names to Ceph internal object identifier prefix (used by OSDs as part of the filename). Sometimes this works:: VNAME=volume-5612246f-a1ec-4faa-9a41-ac33876805bc POOL=volumes_2 rados -p ${POOL} get rbd_id.${VNAME} - | strings And sometimes you need to use the "dot rbd" object instead:: VNAME=home-1 POOL=rbdafs-home rados -p ${POOL} get ${VNAME}.rbd - | strings | grep rb .. note:: I have no idea why one and not the other; I suspect the latter is the older way of doing things? Rebalancing Scrubs ------------------ You can see the days of the week that the last round of deep-scrubs occurred on with a command like the following (taken from http://cephnotes.ksperis.com/blog/2013/08/27/deep-scrub-distribution ) :: for date in `ceph pg dump | grep active | awk '{print $22}'`; do date +%A -d $date; done | sort | uniq -c Re-balancing the ceph deep-scrub schedule may be done with something like the following. Note that this takes a week to run. "A week" comes from the fact that "osd deep scrub interval" is set, by default, to 1 week. :: ceph pg dump | awk '/active/{ print $1 }' \ | (while read i; do echo $i; ceph pg deep-scrub $i; sleep $((604800 / `ceph pg stat | sed -e 's/^.*: \([0-9]*\) pgs:.*$/\1/'`)); done) What's On An OSD ---------------- Run:: ceph pg ls-by-osd osd.$X Slowly Easing In or Out OSDs ---------------------------- Try some shell scripting:: # This script assumes oldschool weight 8 osds! No longer valid. Await # useful scripts in admins.pub/scripts/ceph await_ceph() { until ceph health | grep HEALTH_OK; do echo -n 'Waiting... '; date; sleep 15; done } for w in `seq 1 8`; do await_ceph for osd in 2 3 4 5; do ceph osd crush reweight osd.${osd} $w; done sleep 30 done Quickly Removing an OSD ----------------------- Avoid double-computation of the new CRUSH maps by running ``ceph osd rados rm ${OSD}`` rather than marking it ``out`` and then taking it ``down``. This will only trigger one round of backfilling and will result in less data motion. Otherwise, follow the procedure in http://ceph.com/docs/v0.78/rados/operations/add-or-rm-osds/#removing-osds-manual .. note:: I assume that even after a ``ceph osd crush rm``, that ceph will still allow the OSD in question to participate in its pgs replications. .. note:: The procedure for adding an OSD to the cluster as officially documented uses ``ceph osd crush add``, which will only compute the CRUSH map once. Creating a New Mon ------------------ Original reference at http://ceph.com/docs/master/rados/operations/add-or-rm-mons/ * Work in a temporary directory, for starters:: cd /tmp * Grab the monitors' keyring, so the mon can authenticate to the others:: ceph auth get mon. -o keyring * Grab the current mon map:: ceph mon getmap -o monmap * Create the data store for the mon. This should create a directory and files at ``/var/lib/ceph/mon/ceph-$MON_ID``:: sudo ceph-mon -i $MON_ID --mkfs --monmap monmap --keyring keyring * Add the mon to the mon map:: ceph mon add $MON_ID $IP_ADDR * Start the mon:: sudo service ceph -a start mon.$MON_ID * Delete the temporary copy of the monitors' keyring and the mon map:: rm keyring monmap * Update ``ceph.conf``. Removing a Mon -------------- Original reference at http://ceph.com/docs/master/rados/operations/add-or-rm-mons/ * First, shut down the mon:: sudo service ceph -a stop mon.$MON_ID * Remove it from the known set of mons:: ceph mon remove $MON_ID * Delete its data store. :: rm -r /var/lib/ceph/mon/ceph-$MON_ID * Update ``ceph.conf``. ZFS and Ceph ============ Look at https://github.com/zfsonlinux/zfs/issues/4913#issuecomment-268182335 Basically, follow those instructions :) But it can break things. Maybe don't follow them. Miscellany ========== Cern also uses Ceph; pay attention to everything they have to say on the matter. Notably, this includes http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern (via http://ceph.com/cephdays/frankfurt/ ).