All of lore.kernel.org
 help / color / mirror / Atom feed
* Snap trim queue length issues
@ 2017-12-14 14:36 Piotr Dałek
       [not found] ` <82009aab-6b20-ef21-9bbd-76fddf84e0a3-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Piotr Dałek @ 2017-12-14 14:36 UTC (permalink / raw)
  To: ceph-devel, ceph-users

Hi,

We recently ran into low disk space issues on our clusters, and it wasn't 
because of actual data. On those affected clusters we're hosting VMs and 
volumes, so naturally there are snapshots involved. For some time, we 
observed increased disk space usage that we couldn't explain, as there was 
discrepancy between  what Ceph reported and actual space used on disks. We 
finally found out that snap trim queues were both long and not getting any 
shorter, and decreasing snap trim sleep and increasing max concurrent snap 
trims helped reversing the trend - we're safe now.
The problem is, we haven't been aware of this issue for some time, and 
there's no easy (and fast[1]) way to check this. I made a pull request[2] 
that makes snap trim queue lengths available to monitoring tools
and also generates health warning when things go out of control, so an admin 
can act before hell breaks loose.

My question is, how many Jewel users would be interested in a such feature? 
There's a lot of changes between Luminous and Jewel, and it's not going to 
be a straight backport, but it's not a big patch either, so I won't mind 
doing it myself. But having some support from users would be helpful in 
pushing this into next Jewel release.

Thanks!


[1] one of our guys hacked a bash oneliner that printed out snap trim queue 
lengths for all pgs, but full run takes over an hour to complete on a 
cluster with over 20k pgs...
[2] https://github.com/ceph/ceph/pull/19520

-- 
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Snap trim queue length issues
       [not found] ` <82009aab-6b20-ef21-9bbd-76fddf84e0a3-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
@ 2017-12-14 16:31   ` David Turner
  2017-12-15  9:00     ` [ceph-users] " Piotr Dałek
  0 siblings, 1 reply; 5+ messages in thread
From: David Turner @ 2017-12-14 16:31 UTC (permalink / raw)
  To: Piotr Dałek; +Cc: ceph-devel, ceph-users


[-- Attachment #1.1: Type: text/plain, Size: 3889 bytes --]

I've tracked this in a much more manual way.  I would grab a random subset
of PGs in the pool and query the PGs counting how much were in there
queues.  After that, you average it out by how many PGs you queried and how
many objects there were and multiply it back out by how many PGs are in the
pool.  That gave us a relatively accurate size of the snaptrimq.  Well
enough to be monitored at least.  We could run this in a matter of minutes
with a subset of 200 PGs and it was generally accurate in a pool with 32k
pgs.

I also created a daemon that ran against the cluster watching for cluster
load and modifying the snap_trim_sleep accordingly.  The combination of
those 2 things and we were able to keep up with deleting hundreds of GB of
snapshots/day while not killing VM performance.  We hit a bug where we had
to disable snap trimming completely for about a week and on a dozen osds
for about a month.  We ended up with a snaptrimq over 100M objects, but
with these tools we were able to catch up within a couple weeks taking care
of the daily snapshots being added to the queue.

This was all on a Hammer cluster.  The changes to the snap trimming queues
going into the main osd thread made it so that our use case was not viable
on Jewel until changes to Jewel that happened after I left.  It's exciting
that this will actually be a reportable value from the cluster.

Sorry that this story doesn't really answer your question, except to say
that people aware of this problem likely have a work around for it.
However I'm certain that a lot more clusters are impacted by this than are
aware of it and being able to quickly see that would be beneficial to
troubleshooting problems.  Backporting would be nice.  I run a few Jewel
clusters that have some VM's and it would be nice to see how well the
cluster handle snap trimming.  But they are much less critical on how much
snapshots they do.

On Thu, Dec 14, 2017 at 9:36 AM Piotr Dałek <piotr.dalek-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
wrote:

> Hi,
>
> We recently ran into low disk space issues on our clusters, and it wasn't
> because of actual data. On those affected clusters we're hosting VMs and
> volumes, so naturally there are snapshots involved. For some time, we
> observed increased disk space usage that we couldn't explain, as there was
> discrepancy between  what Ceph reported and actual space used on disks. We
> finally found out that snap trim queues were both long and not getting any
> shorter, and decreasing snap trim sleep and increasing max concurrent snap
> trims helped reversing the trend - we're safe now.
> The problem is, we haven't been aware of this issue for some time, and
> there's no easy (and fast[1]) way to check this. I made a pull request[2]
> that makes snap trim queue lengths available to monitoring tools
> and also generates health warning when things go out of control, so an
> admin
> can act before hell breaks loose.
>
> My question is, how many Jewel users would be interested in a such feature?
> There's a lot of changes between Luminous and Jewel, and it's not going to
> be a straight backport, but it's not a big patch either, so I won't mind
> doing it myself. But having some support from users would be helpful in
> pushing this into next Jewel release.
>
> Thanks!
>
>
> [1] one of our guys hacked a bash oneliner that printed out snap trim queue
> lengths for all pgs, but full run takes over an hour to complete on a
> cluster with over 20k pgs...
> [2] https://github.com/ceph/ceph/pull/19520
>
> --
> Piotr Dałek
> piotr.dalek-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org
> https://www.ovh.com/us/
> _______________________________________________
> ceph-users mailing list
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

[-- Attachment #1.2: Type: text/html, Size: 4749 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [ceph-users] Snap trim queue length issues
  2017-12-14 16:31   ` David Turner
@ 2017-12-15  9:00     ` Piotr Dałek
       [not found]       ` <81eabcfe-59b1-70c9-4f4f-2abbc86b9456-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Piotr Dałek @ 2017-12-15  9:00 UTC (permalink / raw)
  To: David Turner; +Cc: ceph-devel, ceph-users

On 17-12-14 05:31 PM, David Turner wrote:
> I've tracked this in a much more manual way.  I would grab a random subset 
> [..]
> 
> This was all on a Hammer cluster.  The changes to the snap trimming queues 
> going into the main osd thread made it so that our use case was not viable 
> on Jewel until changes to Jewel that happened after I left.  It's exciting 
> that this will actually be a reportable value from the cluster.
> 
> Sorry that this story doesn't really answer your question, except to say 
> that people aware of this problem likely have a work around for it.  However 
> I'm certain that a lot more clusters are impacted by this than are aware of 
> it and being able to quickly see that would be beneficial to troubleshooting 
> problems.  Backporting would be nice.  I run a few Jewel clusters that have 
> some VM's and it would be nice to see how well the cluster handle snap 
> trimming.  But they are much less critical on how much snapshots they do.

Thanks for your response, it pretty much confirms what I though:
- users aware of issue have their own hacks that don't need to be efficient 
or convenient.
- users unaware of issue are, well, unaware and at risk of serious service 
disruption once disk space is all used up.

Hopefully it'll be convincing enough for devs. ;)

-- 
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Snap trim queue length issues
       [not found]       ` <81eabcfe-59b1-70c9-4f4f-2abbc86b9456-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
@ 2017-12-15 14:58         ` Sage Weil
       [not found]           ` <alpine.DEB.2.11.1712151454030.2838-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Sage Weil @ 2017-12-15 14:58 UTC (permalink / raw)
  To: Piotr Dałek; +Cc: ceph-devel, ceph-users

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1679 bytes --]

On Fri, 15 Dec 2017, Piotr Dałek wrote:
> On 17-12-14 05:31 PM, David Turner wrote:
> > I've tracked this in a much more manual way.  I would grab a random subset
> > [..]
> > 
> > This was all on a Hammer cluster.  The changes to the snap trimming queues
> > going into the main osd thread made it so that our use case was not viable
> > on Jewel until changes to Jewel that happened after I left.  It's exciting
> > that this will actually be a reportable value from the cluster.
> > 
> > Sorry that this story doesn't really answer your question, except to say
> > that people aware of this problem likely have a work around for it.  However
> > I'm certain that a lot more clusters are impacted by this than are aware of
> > it and being able to quickly see that would be beneficial to troubleshooting
> > problems.  Backporting would be nice.  I run a few Jewel clusters that have
> > some VM's and it would be nice to see how well the cluster handle snap
> > trimming.  But they are much less critical on how much snapshots they do.
> 
> Thanks for your response, it pretty much confirms what I though:
> - users aware of issue have their own hacks that don't need to be efficient or
> convenient.
> - users unaware of issue are, well, unaware and at risk of serious service
> disruption once disk space is all used up.
> 
> Hopefully it'll be convincing enough for devs. ;)

Your PR looks great!  I commented with a nit on the format of the warning 
itself.

I expect this is trivial to backport to luminous; it will need to be 
partially reimplemented for jewel (with some care around the pg_stat_t and 
a different check for the jewel-style health checks).

sage

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Snap trim queue length issues
       [not found]           ` <alpine.DEB.2.11.1712151454030.2838-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
@ 2017-12-18  8:52             ` Piotr Dałek
  0 siblings, 0 replies; 5+ messages in thread
From: Piotr Dałek @ 2017-12-18  8:52 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, ceph-users

On 17-12-15 03:58 PM, Sage Weil wrote:
> On Fri, 15 Dec 2017, Piotr Dałek wrote:
>> On 17-12-14 05:31 PM, David Turner wrote:
>>> I've tracked this in a much more manual way.  I would grab a random subset
>>> [..]
>>>
>>> This was all on a Hammer cluster.  The changes to the snap trimming queues
>>> going into the main osd thread made it so that our use case was not viable
>>> on Jewel until changes to Jewel that happened after I left.  It's exciting
>>> that this will actually be a reportable value from the cluster.
>>>
>>> Sorry that this story doesn't really answer your question, except to say
>>> that people aware of this problem likely have a work around for it.  However
>>> I'm certain that a lot more clusters are impacted by this than are aware of
>>> it and being able to quickly see that would be beneficial to troubleshooting
>>> problems.  Backporting would be nice.  I run a few Jewel clusters that have
>>> some VM's and it would be nice to see how well the cluster handle snap
>>> trimming.  But they are much less critical on how much snapshots they do.
>>
>> Thanks for your response, it pretty much confirms what I though:
>> - users aware of issue have their own hacks that don't need to be efficient or
>> convenient.
>> - users unaware of issue are, well, unaware and at risk of serious service
>> disruption once disk space is all used up.
>>
>> Hopefully it'll be convincing enough for devs. ;)
> 
> Your PR looks great!  I commented with a nit on the format of the warning
> itself.

I just adressed the comments.

> I expect this is trivial to backport to luminous; it will need to be
> partially reimplemented for jewel (with some care around the pg_stat_t and
> a different check for the jewel-style health checks).

Yeah, that's why I expected some resistance here and asked for comments. I 
really don't mind reimplementing this, it's not a big deal.

-- 
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-12-18  8:52 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-14 14:36 Snap trim queue length issues Piotr Dałek
     [not found] ` <82009aab-6b20-ef21-9bbd-76fddf84e0a3-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
2017-12-14 16:31   ` David Turner
2017-12-15  9:00     ` [ceph-users] " Piotr Dałek
     [not found]       ` <81eabcfe-59b1-70c9-4f4f-2abbc86b9456-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org>
2017-12-15 14:58         ` Sage Weil
     [not found]           ` <alpine.DEB.2.11.1712151454030.2838-qHenpvqtifaMSRpgCs4c+g@public.gmane.org>
2017-12-18  8:52             ` Piotr Dałek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.