All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Piotr Dałek" <piotr.dalek@corp.ovh.com>
To: ceph-devel <ceph-devel@vger.kernel.org>,
	ceph-users <ceph-users@lists.ceph.com>
Subject: Snap trim queue length issues
Date: Thu, 14 Dec 2017 15:36:35 +0100	[thread overview]
Message-ID: <82009aab-6b20-ef21-9bbd-76fddf84e0a3@corp.ovh.com> (raw)

Hi,

We recently ran into low disk space issues on our clusters, and it wasn't 
because of actual data. On those affected clusters we're hosting VMs and 
volumes, so naturally there are snapshots involved. For some time, we 
observed increased disk space usage that we couldn't explain, as there was 
discrepancy between  what Ceph reported and actual space used on disks. We 
finally found out that snap trim queues were both long and not getting any 
shorter, and decreasing snap trim sleep and increasing max concurrent snap 
trims helped reversing the trend - we're safe now.
The problem is, we haven't been aware of this issue for some time, and 
there's no easy (and fast[1]) way to check this. I made a pull request[2] 
that makes snap trim queue lengths available to monitoring tools
and also generates health warning when things go out of control, so an admin 
can act before hell breaks loose.

My question is, how many Jewel users would be interested in a such feature? 
There's a lot of changes between Luminous and Jewel, and it's not going to 
be a straight backport, but it's not a big patch either, so I won't mind 
doing it myself. But having some support from users would be helpful in 
pushing this into next Jewel release.

Thanks!


[1] one of our guys hacked a bash oneliner that printed out snap trim queue 
lengths for all pgs, but full run takes over an hour to complete on a 
cluster with over 20k pgs...
[2] https://github.com/ceph/ceph/pull/19520

-- 
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/

             reply	other threads:[~2017-12-14 14:36 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-14 14:36 Piotr Dałek [this message]
     [not found] ` <82009aab-6b20-ef21-9bbd-76fddf84e0a3-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
2017-12-14 16:31   ` Snap trim queue length issues David Turner
2017-12-15  9:00     ` [ceph-users] " Piotr Dałek
     [not found]       ` <81eabcfe-59b1-70c9-4f4f-2abbc86b9456-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
2017-12-15 14:58         ` Sage Weil
     [not found]           ` <alpine.DEB.2.11.1712151454030.2838-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
2017-12-18  8:52             ` Piotr Dałek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=82009aab-6b20-ef21-9bbd-76fddf84e0a3@corp.ovh.com \
    --to=piotr.dalek@corp.ovh.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=ceph-users@lists.ceph.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.