###################
Ceph Storage System
###################

The ACM uses Ceph to manage its collection of storage. Ceph itself has extensive
documentation_.

.. _documentation: http://docs.ceph.com/docs/master/

Configuration
=============

The global parameters of our ceph cluster are available for perusal in
file:///afs/acm.jhu.edu/readonly/group/admins.pub/ceph.conf .  The various Ceph
worker nodes should all have ``/etc/ceph/ceph.conf`` symlinked to that
location.  (As usual, be sure to use the RO mountpoint for the benefits of
replication; see :doc:`../afsdoc/admins-pub` for details.)

.. note:: This file should be kept up to date.  Do note, however, that the
   setting ``mon_host`` gives the clients a list of possible mon locations;
   only one of them must be correct for in order for an operation to
   succeed, as the mons maintain their own address lists in the monmap
   itself.

Authentication
==============

Ceph rolls its own authentication and authorization layers, which is a bit
of a bummer, but it is what it is, I suppose.  Thankfully, we don't fiddle
with this often.  You can inspect the database with ``ceph auth list``.

.. warning:: This list command displays secrets to the console and there
   does not seem to be an alternative or option to not!

.. _ceph_new-user:

Creating a New Ceph User
------------------------

Run ``ceph auth add ${NAME}`` with optional repetitions of ``${CAPTY}
${CAPVAL}`` following, e.g. ``mon "allow r" osd "allow class-read object_prefix
rbd_children"`` or somesuch.  ``ceph auth caps`` can be used to change a user's
caps later.

The equivalent of extracting a keytab is done with ``ceph auth get-key``.

Maintenance Tasks
=================

Getting Cluster Status
----------------------

On magellan or the other ceph worker (mon or osd) nodes, commands like ``ceph
status`` or ``ceph status -w`` will show you what's up on the storage cluster
(and keep you up to date).

``ceph pg dump_stuck`` is useful while Ceph is rebalancing itself.

Identifying RBD Objects
-----------------------

Map RBD names to Ceph internal object identifier prefix (used by OSDs as
part of the filename).  Sometimes this works::

  VNAME=volume-5612246f-a1ec-4faa-9a41-ac33876805bc
  POOL=volumes_2
  rados -p ${POOL} get rbd_id.${VNAME} - | strings

And sometimes you need to use the "dot rbd" object instead::

  VNAME=home-1
  POOL=rbdafs-home
  rados -p ${POOL} get ${VNAME}.rbd - | strings | grep rb

.. note:: I have no idea why one and not the other; I suspect the latter is
   the older way of doing things?

Rebalancing Scrubs
------------------

You can see the days of the week that the last round of deep-scrubs
occurred on with a command like the following (taken from
http://cephnotes.ksperis.com/blog/2013/08/27/deep-scrub-distribution ) ::

  for date in `ceph pg dump | grep active | awk '{print $22}'`; do
    date +%A -d $date;
  done | sort | uniq -c

Re-balancing the ceph deep-scrub schedule may be done with something like the
following.  Note that this takes a week to run.  "A week" comes from the fact
that "osd deep scrub interval" is set, by default, to 1 week. ::

  ceph pg dump | awk '/active/{ print $1 }' \
    | (while read i; do echo $i;
         ceph pg deep-scrub $i;
         sleep $((604800 / `ceph pg stat | sed -e 's/^.*: \([0-9]*\) pgs:.*$/\1/'`));
       done)

What's On An OSD
----------------

Run::

  ceph pg ls-by-osd osd.$X

Slowly Easing In or Out OSDs
----------------------------

Try some shell scripting::

  # This script assumes oldschool weight 8 osds! No longer valid. Await
  # useful scripts in admins.pub/scripts/ceph
  await_ceph() { until ceph health | grep HEALTH_OK; do
    echo -n 'Waiting... '; date; sleep 15; done }

  for w in `seq 1 8`; do
   await_ceph
   for osd in 2 3 4 5; do ceph osd crush reweight osd.${osd} $w; done
   sleep 30
  done

Quickly Removing an OSD
-----------------------

Avoid double-computation of the new CRUSH maps by running ``ceph osd rados
rm ${OSD}`` rather than marking it ``out`` and then taking it ``down``.
This will only trigger one round of backfilling and will result in less data
motion.  Otherwise, follow the procedure in
http://ceph.com/docs/v0.78/rados/operations/add-or-rm-osds/#removing-osds-manual

.. note:: I assume that even after a ``ceph osd crush rm``, that ceph will
   still allow the OSD in question to participate in its pgs replications.

.. note:: The procedure for adding an OSD to the cluster as officially
   documented uses ``ceph osd crush add``, which will only compute the CRUSH
   map once.

Creating a New Mon
------------------

Original reference at http://ceph.com/docs/master/rados/operations/add-or-rm-mons/

* Work in a temporary directory, for starters::

    cd /tmp

* Grab the monitors' keyring, so the mon can authenticate to the others::

    ceph auth get mon. -o keyring

* Grab the current mon map::

    ceph mon getmap -o monmap

* Create the data store for the mon.  This should create a directory and
  files at ``/var/lib/ceph/mon/ceph-$MON_ID``::

    sudo ceph-mon -i $MON_ID --mkfs --monmap monmap --keyring keyring

* Add the mon to the mon map::

    ceph mon add $MON_ID $IP_ADDR

* Start the mon::

    sudo service ceph -a start mon.$MON_ID

* Delete the temporary copy of the monitors' keyring and the mon map::

    rm keyring monmap

* Update ``ceph.conf``.

Removing a Mon
--------------

Original reference at http://ceph.com/docs/master/rados/operations/add-or-rm-mons/

* First, shut down the mon::

    sudo service ceph -a stop mon.$MON_ID

* Remove it from the known set of mons::

    ceph mon remove $MON_ID

* Delete its data store. ::

    rm -r /var/lib/ceph/mon/ceph-$MON_ID

* Update ``ceph.conf``.


ZFS and Ceph
============

Look at https://github.com/zfsonlinux/zfs/issues/4913#issuecomment-268182335

Basically, follow those instructions :) But it can break things. Maybe don't follow them.

Miscellany
==========

Cern also uses Ceph; pay attention to everything they have to say on the
matter.  Notably, this includes
http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern (via
http://ceph.com/cephdays/frankfurt/ ).