From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Turner Subject: Re: Snap trim queue length issues Date: Thu, 14 Dec 2017 16:31:20 +0000 Message-ID: References: <82009aab-6b20-ef21-9bbd-76fddf84e0a3@corp.ovh.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0273725534589214178==" Return-path: In-Reply-To: <82009aab-6b20-ef21-9bbd-76fddf84e0a3-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org Sender: "ceph-users" To: =?UTF-8?B?UGlvdHIgRGHFgmVr?= Cc: ceph-devel , ceph-users List-Id: ceph-devel.vger.kernel.org --===============0273725534589214178== Content-Type: multipart/alternative; boundary="089e0822ba30a22b1405604f6ac0" --089e0822ba30a22b1405604f6ac0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I've tracked this in a much more manual way. I would grab a random subset of PGs in the pool and query the PGs counting how much were in there queues. After that, you average it out by how many PGs you queried and how many objects there were and multiply it back out by how many PGs are in the pool. That gave us a relatively accurate size of the snaptrimq. Well enough to be monitored at least. We could run this in a matter of minutes with a subset of 200 PGs and it was generally accurate in a pool with 32k pgs. I also created a daemon that ran against the cluster watching for cluster load and modifying the snap_trim_sleep accordingly. The combination of those 2 things and we were able to keep up with deleting hundreds of GB of snapshots/day while not killing VM performance. We hit a bug where we had to disable snap trimming completely for about a week and on a dozen osds for about a month. We ended up with a snaptrimq over 100M objects, but with these tools we were able to catch up within a couple weeks taking care of the daily snapshots being added to the queue. This was all on a Hammer cluster. The changes to the snap trimming queues going into the main osd thread made it so that our use case was not viable on Jewel until changes to Jewel that happened after I left. It's exciting that this will actually be a reportable value from the cluster. Sorry that this story doesn't really answer your question, except to say that people aware of this problem likely have a work around for it. However I'm certain that a lot more clusters are impacted by this than are aware of it and being able to quickly see that would be beneficial to troubleshooting problems. Backporting would be nice. I run a few Jewel clusters that have some VM's and it would be nice to see how well the cluster handle snap trimming. But they are much less critical on how much snapshots they do. On Thu, Dec 14, 2017 at 9:36 AM Piotr Da=C5=82ek wrote: > Hi, > > We recently ran into low disk space issues on our clusters, and it wasn't > because of actual data. On those affected clusters we're hosting VMs and > volumes, so naturally there are snapshots involved. For some time, we > observed increased disk space usage that we couldn't explain, as there wa= s > discrepancy between what Ceph reported and actual space used on disks. W= e > finally found out that snap trim queues were both long and not getting an= y > shorter, and decreasing snap trim sleep and increasing max concurrent sna= p > trims helped reversing the trend - we're safe now. > The problem is, we haven't been aware of this issue for some time, and > there's no easy (and fast[1]) way to check this. I made a pull request[2] > that makes snap trim queue lengths available to monitoring tools > and also generates health warning when things go out of control, so an > admin > can act before hell breaks loose. > > My question is, how many Jewel users would be interested in a such featur= e? > There's a lot of changes between Luminous and Jewel, and it's not going t= o > be a straight backport, but it's not a big patch either, so I won't mind > doing it myself. But having some support from users would be helpful in > pushing this into next Jewel release. > > Thanks! > > > [1] one of our guys hacked a bash oneliner that printed out snap trim que= ue > lengths for all pgs, but full run takes over an hour to complete on a > cluster with over 20k pgs... > [2] https://github.com/ceph/ceph/pull/19520 > > -- > Piotr Da=C5=82ek > piotr.dalek-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org > https://www.ovh.com/us/ > _______________________________________________ > ceph-users mailing list > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > --089e0822ba30a22b1405604f6ac0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I've tracked this in a much more manual way.=C2=A0 I w= ould grab a random subset of PGs in the pool and query the PGs counting how= much were in there queues.=C2=A0 After that, you average it out by how man= y PGs you queried and how many objects there were and multiply it back out = by how many PGs are in the pool.=C2=A0 That gave us a relatively accurate s= ize of the snaptrimq.=C2=A0 Well enough to be monitored at least.=C2=A0 We = could run this in a matter of minutes with a subset of 200 PGs and it was g= enerally accurate in a pool with 32k pgs.

I also created= a daemon that ran against the cluster watching for cluster load and modify= ing the snap_trim_sleep accordingly.=C2=A0 The combination of those 2 thing= s and we were able to keep up with deleting hundreds of GB of snapshots/day= while not killing VM performance.=C2=A0 We hit a bug where we had to disab= le snap trimming completely for about a week and on a dozen osds for about = a month.=C2=A0 We ended up with a snaptrimq over 100M objects, but with the= se tools we were able to catch up within a couple weeks taking care of the = daily snapshots being added to the queue.

This was= all on a Hammer cluster.=C2=A0 The changes to the snap trimming queues goi= ng into the main osd thread made it so that our use case was not viable on = Jewel until changes to Jewel that happened after I left.=C2=A0 It's exc= iting that this will actually be a reportable value from the cluster.
=

Sorry that this story doesn't really answer your qu= estion, except to say that people aware of this problem likely have a work = around for it.=C2=A0 However I'm certain that a lot more clusters are i= mpacted by this than are aware of it and being able to quickly see that wou= ld be beneficial to troubleshooting problems.=C2=A0 Backporting would be ni= ce.=C2=A0 I run a few Jewel clusters that have some VM's and it would b= e nice to see how well the cluster handle snap trimming.=C2=A0 But they are= much less critical on how much snapshots they do.

On Thu, Dec 14, 2017 at 9:36 AM Piotr Da=C5=82ek <piotr.dalek-Rm6v+N6rxxBWk0Htik3J/w@public.gmane.org> w= rote:
Hi,
We recently ran into low disk space issues on our clusters, and it wasn'= ;t
because of actual data. On those affected clusters we're hosting VMs an= d
volumes, so naturally there are snapshots involved. For some time, we
observed increased disk space usage that we couldn't explain, as there = was
discrepancy between=C2=A0 what Ceph reported and actual space used on disks= . We
finally found out that snap trim queues were both long and not getting any<= br> shorter, and decreasing snap trim sleep and increasing max concurrent snap<= br> trims helped reversing the trend - we're safe now.
The problem is, we haven't been aware of this issue for some time, and<= br> there's no easy (and fast[1]) way to check this. I made a pull request[= 2]
that makes snap trim queue lengths available to monitoring tools
and also generates health warning when things go out of control, so an admi= n
can act before hell breaks loose.

My question is, how many Jewel users would be interested in a such feature?=
There's a lot of changes between Luminous and Jewel, and it's not g= oing to
be a straight backport, but it's not a big patch either, so I won't= mind
doing it myself. But having some support from users would be helpful in
pushing this into next Jewel release.

Thanks!


[1] one of our guys hacked a bash oneliner that printed out snap trim queue=
lengths for all pgs, but full run takes over an hour to complete on a
cluster with over 20k pgs...
[2] https://github.com/ceph/ceph/pull/19520

--
Piotr Da=C5=82ek
piotr.dalek@c= orp.ovh.com
ht= tps://www.ovh.com/us/
_______________________________________________
ceph-users mailing list
ceph-users@l= ists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-= ceph.com
--089e0822ba30a22b1405604f6ac0-- --===============0273725534589214178== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ ceph-users mailing list ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com --===============0273725534589214178==--