All of lore.kernel.org
 help / color / mirror / Atom feed
* snap_trimming + backfilling is inefficient with many purged_snaps
       [not found] <CAPUexz-ff+fTmU0J0TrJ80p+6334BBPvM7EWW5=eGa8uEomRew@mail.gmail.com>
@ 2014-09-18 12:50 ` Dan Van Der Ster
  2014-09-18 17:03   ` Florian Haas
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Van Der Ster @ 2014-09-18 12:50 UTC (permalink / raw)
  To: Ceph Development; +Cc: Florian Haas

(moving this discussion to -devel)

> Begin forwarded message:
> 
> From: Florian Haas <florian@hastexo.com>
> Date: 17 Sep 2014 18:02:09 CEST
> Subject: Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
> To: Dan Van Der Ster <daniel.vanderster@cern.ch>
> Cc: Craig Lewis <clewis@centraldesktop.com>, "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
> 
> On Wed, Sep 17, 2014 at 5:42 PM, Dan Van Der Ster
> <daniel.vanderster@cern.ch> wrote:
>> From: Florian Haas <florian@hastexo.com>
>> Sent: Sep 17, 2014 5:33 PM
>> To: Dan Van Der Ster
>> Cc: Craig Lewis <clewis@centraldesktop.com>;ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] RGW hung, 2 OSDs using 100% CPU
>> 
>> On Wed, Sep 17, 2014 at 5:24 PM, Dan Van Der Ster
>> <daniel.vanderster@cern.ch> wrote:
>>> Hi Florian,
>>> 
>>>> On 17 Sep 2014, at 17:09, Florian Haas <florian@hastexo.com> wrote:
>>>> 
>>>> Hi Craig,
>>>> 
>>>> just dug this up in the list archives.
>>>> 
>>>> On Fri, Mar 28, 2014 at 2:04 AM, Craig Lewis <clewis@centraldesktop.com>
>>>> wrote:
>>>>> In the interest of removing variables, I removed all snapshots on all
>>>>> pools,
>>>>> then restarted all ceph daemons at the same time.  This brought up osd.8
>>>>> as
>>>>> well.
>>>> 
>>>> So just to summarize this: your 100% CPU problem at the time went away
>>>> after you removed all snapshots, and the actual cause of the issue was
>>>> never found?
>>>> 
>>>> I am seeing a similar issue now, and have filed
>>>> http://tracker.ceph.com/issues/9503 to make sure it doesn't get lost
>>>> again. Can you take a look at that issue and let me know if anything
>>>> in the description sounds familiar?
>>> 
>>> 
>>> Could your ticket be related to the snap trimming issue I’ve finally
>>> narrowed down in the past couple days?
>>> 
>>>  http://tracker.ceph.com/issues/9487
>>> 
>>> Bump up debug_osd to 20 then check the log during one of your incidents.
>>> If it is busy logging the snap_trimmer messages, then it’s the same issue.
>>> (The issue is that rbd pools have many purged_snaps, but sometimes after
>>> backfilling a PG the purged_snaps list is lost and thus the snap trimmer
>>> becomes very busy whilst re-trimming thousands of snaps. During that time (a
>>> few minutes on my cluster) the OSD is blocked.)
>> 
>> That sounds promising, thank you! debug_osd=10 should actually be
>> sufficient as those snap_trim messages get logged at that level. :)
>> 
>> Do I understand your issue report correctly in that you have found
>> setting osd_snap_trim_sleep to be ineffective, because it's being
>> applied when iterating from PG to PG, rather than from snap to snap?
>> If so, then I'm guessing that that can hardly be intentional…


I’m beginning to agree with you on that guess. AFAICT, the normal behavior of the snap trimmer is to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the only time the current sleep implementation could be useful is if we rm’d a snap across many PGs at once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at most need to trim O(100) PGs.

We could move the snap trim sleep into the SnapTrimmer state machine, for example in ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of course the trimming PG would remain locked. And it would be locked for even longer now due to the sleep.

To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve done in this pull req: https://github.com/ceph/ceph/pull/2516
Breaking out of the trimmer like that should allow IOs to the trimming PG to get through.

The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve managed to reproduce that on my test cluster. All you have to do is create many pool snaps (e.g. of a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, which is one signature of this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be O(10000).

Looking forward to any ideas someone might have.

Cheers, Dan




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
  2014-09-18 12:50 ` snap_trimming + backfilling is inefficient with many purged_snaps Dan Van Der Ster
@ 2014-09-18 17:03   ` Florian Haas
       [not found]     ` <541b2add.ea5cb40a.250c.3ed0SMTPIN_ADDED_BROKEN@mx.google.com>
  0 siblings, 1 reply; 26+ messages in thread
From: Florian Haas @ 2014-09-18 17:03 UTC (permalink / raw)
  To: Dan Van Der Ster; +Cc: Ceph Development

Hi Dan,

saw the pull request, and can confirm your observations, at least
partially. Comments inline.

On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
<daniel.vanderster@cern.ch> wrote:
>>> Do I understand your issue report correctly in that you have found
>>> setting osd_snap_trim_sleep to be ineffective, because it's being
>>> applied when iterating from PG to PG, rather than from snap to snap?
>>> If so, then I'm guessing that that can hardly be intentional…
>
>
> I’m beginning to agree with you on that guess. AFAICT, the normal behavior of the snap trimmer is to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the only time the current sleep implementation could be useful is if we rm’d a snap across many PGs at once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at most need to trim O(100) PGs.

Hmm. I'm actually seeing this in a system where the problematic snaps
could *only* have been RBD snaps.

> We could move the snap trim sleep into the SnapTrimmer state machine, for example in ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of course the trimming PG would remain locked. And it would be locked for even longer now due to the sleep.
>
> To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve done in this pull req: https://github.com/ceph/ceph/pull/2516
> Breaking out of the trimmer like that should allow IOs to the trimming PG to get through.
>
> The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve managed to reproduce that on my test cluster. All you have to do is create many pool snaps (e.g. of a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, which is one signature of this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be O(10000).

Hmmm, I'm not sure if I confirm that. I see "adding snap X to
purged_snaps", but only after the snap has been purged. See
https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
fact that the OSD tries to trim a snap only to get an ENOENT is
probably indicative of something being fishy with the snaptrimq and/or
the purged_snaps list as well.

> Looking forward to any ideas someone might have.

So am I. :)

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
       [not found]     ` <541b2add.ea5cb40a.250c.3ed0SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2014-09-18 19:03       ` Florian Haas
  2014-09-18 19:12       ` Dan van der Ster
  2014-09-18 19:31       ` Dan van der Ster
  2 siblings, 0 replies; 26+ messages in thread
From: Florian Haas @ 2014-09-18 19:03 UTC (permalink / raw)
  To: Mango Thirtyfour; +Cc: ceph-devel

On Thu, Sep 18, 2014 at 8:56 PM, Mango Thirtyfour
<daniel.vanderster@cern.ch> wrote:
> Hi Florian,
>
> On Sep 18, 2014 7:03 PM, Florian Haas <florian@hastexo.com> wrote:
>>
>> Hi Dan,
>>
>> saw the pull request, and can confirm your observations, at least
>> partially. Comments inline.
>>
>> On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
>> <daniel.vanderster@cern.ch> wrote:
>> >>> Do I understand your issue report correctly in that you have found
>> >>> setting osd_snap_trim_sleep to be ineffective, because it's being
>> >>> applied when iterating from PG to PG, rather than from snap to snap?
>> >>> If so, then I'm guessing that that can hardly be intentional…
>> >
>> >
>> > I’m beginning to agree with you on that guess. AFAICT, the normal behavior of the snap trimmer is to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the only time the current sleep implementation could be useful is if we rm’d a snap across many PGs at once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at most need to trim O(100) PGs.
>>
>> Hmm. I'm actually seeing this in a system where the problematic snaps
>> could *only* have been RBD snaps.
>>
>
> True, as am I. The current sleep is useful in this case, but since we'd normally only expect up to ~100 of these PGs per OSD, the trimming of 1 snap across all of those PGs would finish rather quickly anyway. Latency would surely be increased momentarily, but I wouldn't expect 90s slow requests like I have with the 30000 snap_trimq single PG.
>
> Possibly the sleep is useful in both places.
>
>> > We could move the snap trim sleep into the SnapTrimmer state machine, for example in ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of course the trimming PG would remain locked. And it would be locked for even longer now due to the sleep.
>> >
>> > To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve done in this pull req: https://github.com/ceph/ceph/pull/2516
>> > Breaking out of the trimmer like that should allow IOs to the trimming PG to get through.
>> >
>> > The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve managed to reproduce that on my test cluster. All you have to do is create many pool snaps (e.g. of a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, which is one signature of this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be O(10000).
>>
>> Hmmm, I'm not sure if I confirm that. I see "adding snap X to
>> purged_snaps", but only after the snap has been purged. See
>> https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
>> fact that the OSD tries to trim a snap only to get an ENOENT is
>> probably indicative of something being fishy with the snaptrimq and/or
>> the purged_snaps list as well.
>>
>
> With such a long snap_trimq there in your log, I suspect you're seeing the exact same behavior as I am. In my case the first snap trimmed is snap 1, of course because that is the first rm'd snap, and the contents of your pool are surely different. I also see the ENOENT messages... again confirming those snaps were already trimmed. Anyway, what I've observed is that a large snap_trimq like that will block the OSD until they are all re-trimmed.

That's... a mess.

So what is your workaround for recovery? My hunch would be to

- stop all access to the cluster;
- set nodown and noout so that other OSDs don't mark spinning OSDs
down (which would cause all sorts of primary and PG reassignments,
useless backfill/recovery when mon osd down out interval expires,
etc.);
- set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
so that at least *between* PGs, the OSD has a chance to respond to
heartbeats and do whatever else it needs to do;
- let the snap trim play itself out over several hours (days?).

That sounds utterly awful, but if anyone has a better idea (other than
"wait until the patch is merged"), I'd be all ears.

Cheers
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
       [not found]     ` <541b2add.ea5cb40a.250c.3ed0SMTPIN_ADDED_BROKEN@mx.google.com>
  2014-09-18 19:03       ` Florian Haas
@ 2014-09-18 19:12       ` Dan van der Ster
  2014-09-18 21:19         ` Florian Haas
  2014-09-18 19:31       ` Dan van der Ster
  2 siblings, 1 reply; 26+ messages in thread
From: Dan van der Ster @ 2014-09-18 19:12 UTC (permalink / raw)
  To: Florian Haas; +Cc: ceph-devel

Hi,

September 18 2014 9:03 PM, "Florian Haas" <florian@hastexo.com> wrote: 
> On Thu, Sep 18, 2014 at 8:56 PM, Dan van der Ster <daniel.vanderster@cern.ch> wrote:
> 
>> Hi Florian,
>> 
>> On Sep 18, 2014 7:03 PM, Florian Haas <florian@hastexo.com> wrote: 
>>> Hi Dan,
>>> 
>>> saw the pull request, and can confirm your observations, at least
>>> partially. Comments inline.
>>> 
>>> On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
>>> <daniel.vanderster@cern.ch> wrote:
>>>>>> Do I understand your issue report correctly in that you have found
>>>>>> setting osd_snap_trim_sleep to be ineffective, because it's being
>>>>>> applied when iterating from PG to PG, rather than from snap to snap?
>>>>>> If so, then I'm guessing that that can hardly be intentional…
>>>> 
>>>> 
>>>> I’m beginning to agree with you on that guess. AFAICT, the normal behavior of the snap trimmer
>> is
>>> to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the
>>> only time the current sleep implementation could be useful is if we rm’d a snap across many PGs
>> at
>>> once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at
>>> most need to trim O(100) PGs.
>>> 
>>> Hmm. I'm actually seeing this in a system where the problematic snaps
>>> could *only* have been RBD snaps.
>> 
>> True, as am I. The current sleep is useful in this case, but since we'd normally only expect up
> to
>> ~100 of these PGs per OSD, the trimming of 1 snap across all of those PGs would finish rather
>> quickly anyway. Latency would surely be increased momentarily, but I wouldn't expect 90s slow
>> requests like I have with the 30000 snap_trimq single PG.
>> 
>> Possibly the sleep is useful in both places.
>> 
>>>> We could move the snap trim sleep into the SnapTrimmer state machine, for example in
>>> ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of
>>> course the trimming PG would remain locked. And it would be locked for even longer now due to
> the
>>> sleep.
>>>> 
>>>> To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve
> done
>>> in this pull req: https://github.com/ceph/ceph/pull/2516
>>>> Breaking out of the trimmer like that should allow IOs to the trimming PG to get through.
>>>> 
>>>> The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve
>>> managed to reproduce that on my test cluster. All you have to do is create many pool snaps (e.g.
>> of
>>> a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs
>>> around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, which is one signature
>> of
>>> this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be
>>> O(10000).
>>> 
>>> Hmmm, I'm not sure if I confirm that. I see "adding snap X to
>>> purged_snaps", but only after the snap has been purged. See
>>> https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
>>> fact that the OSD tries to trim a snap only to get an ENOENT is
>>> probably indicative of something being fishy with the snaptrimq and/or
>>> the purged_snaps list as well.
>> 
>> With such a long snap_trimq there in your log, I suspect you're seeing the exact same behavior as
> I
>> am. In my case the first snap trimmed is snap 1, of course because that is the first rm'd snap,
> and
>> the contents of your pool are surely different. I also see the ENOENT messages... again
> confirming
>> those snaps were already trimmed. Anyway, what I've observed is that a large snap_trimq like that
>> will block the OSD until they are all re-trimmed.
> 
> That's... a mess.
> 
> So what is your workaround for recovery? My hunch would be to
> 
> - stop all access to the cluster;
> - set nodown and noout so that other OSDs don't mark spinning OSDs
> down (which would cause all sorts of primary and PG reassignments,
> useless backfill/recovery when mon osd down out interval expires,
> etc.);
> - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
> so that at least *between* PGs, the OSD has a chance to respond to
> heartbeats and do whatever else it needs to do;
> - let the snap trim play itself out over several hours (days?).
> 

What I've been doing is I just continue draining my OSDs, two at a time. Each time, 1-2 other OSDs become blocked for a couple minutes (out of the ~1 hour it takes to drain) while a single PG re-trims, leading to ~100 slow requests. The OSD must still be responding to the peer pings, since other OSDs do not mark it down. Luckily this doesn't happen with every single movement of our pool 5 PGs, otherwise it would be a disaster like you said.

Cheers, Dan

> That sounds utterly awful, but if anyone has a better idea (other than
> "wait until the patch is merged"), I'd be all ears.
> 
> Cheers
> Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
       [not found]     ` <541b2add.ea5cb40a.250c.3ed0SMTPIN_ADDED_BROKEN@mx.google.com>
  2014-09-18 19:03       ` Florian Haas
  2014-09-18 19:12       ` Dan van der Ster
@ 2014-09-18 19:31       ` Dan van der Ster
  2 siblings, 0 replies; 26+ messages in thread
From: Dan van der Ster @ 2014-09-18 19:31 UTC (permalink / raw)
  To: Florian Haas; +Cc: ceph-devel

-- Dan van der Ster || Data & Storage Services || CERN IT Department --

September 18 2014 9:12 PM, "Dan van der Ster" <daniel.vanderster@cern.ch> wrote: 
> Hi,
> 
> September 18 2014 9:03 PM, "Florian Haas" <florian@hastexo.com> wrote:
> 
>> On Thu, Sep 18, 2014 at 8:56 PM, Dan van der Ster <daniel.vanderster@cern.ch> wrote:
>> 
>>> Hi Florian,
>>> 
>>> On Sep 18, 2014 7:03 PM, Florian Haas <florian@hastexo.com> wrote: 
>>>> Hi Dan,
>>>> 
>>>> saw the pull request, and can confirm your observations, at least
>>>> partially. Comments inline.
>>>> 
>>>> On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
>>>> <daniel.vanderster@cern.ch> wrote: 
>>>>>>> Do I understand your issue report correctly in that you have found
>>>>>>> setting osd_snap_trim_sleep to be ineffective, because it's being
>>>>>>> applied when iterating from PG to PG, rather than from snap to snap?
>>>>>>> If so, then I'm guessing that that can hardly be intentional…
>>>>> 
>>>>> I’m beginning to agree with you on that guess. AFAICT, the normal behavior of the snap trimmer
>>> 
>>> is 
>>>> to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the
>>>> only time the current sleep implementation could be useful is if we rm’d a snap across many PGs
>>> 
>>> at 
>>>> once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at
>>>> most need to trim O(100) PGs.
>>>> 
>>>> Hmm. I'm actually seeing this in a system where the problematic snaps
>>>> could *only* have been RBD snaps.
>>> 
>>> True, as am I. The current sleep is useful in this case, but since we'd normally only expect up
>> 
>> to 
>>> ~100 of these PGs per OSD, the trimming of 1 snap across all of those PGs would finish rather
>>> quickly anyway. Latency would surely be increased momentarily, but I wouldn't expect 90s slow
>>> requests like I have with the 30000 snap_trimq single PG.
>>> 
>>> Possibly the sleep is useful in both places.
>>> 
>>>>> We could move the snap trim sleep into the SnapTrimmer state machine, for example in
>>>> 
>>>> ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of
>>>> course the trimming PG would remain locked. And it would be locked for even longer now due to
>> 
>> the 
>>>> sleep. 
>>>>> To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve
>> 
>> done 
>>>> in this pull req: https://github.com/ceph/ceph/pull/2516 
>>>>> Breaking out of the trimmer like that should allow IOs to the trimming PG to get through.
>>>>> 
>>>>> The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve
>>>> 
>>>> managed to reproduce that on my test cluster. All you have to do is create many pool snaps
> (e.g.
>>> 
>>> of 
>>>> a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs
>>>> around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, which is one
> signature
>>> 
>>> of 
>>>> this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be
>>>> O(10000).
>>>> 
>>>> Hmmm, I'm not sure if I confirm that. I see "adding snap X to
>>>> purged_snaps", but only after the snap has been purged. See
>>>> https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
>>>> fact that the OSD tries to trim a snap only to get an ENOENT is
>>>> probably indicative of something being fishy with the snaptrimq and/or
>>>> the purged_snaps list as well.
>>> 
>>> With such a long snap_trimq there in your log, I suspect you're seeing the exact same behavior
> as
>> 
>> I 
>>> am. In my case the first snap trimmed is snap 1, of course because that is the first rm'd snap,
>> 
>> and 
>>> the contents of your pool are surely different. I also see the ENOENT messages... again
>> 
>> confirming 
>>> those snaps were already trimmed. Anyway, what I've observed is that a large snap_trimq like
> that
>>> will block the OSD until they are all re-trimmed.
>> 
>> That's... a mess.
>> 
>> So what is your workaround for recovery? My hunch would be to
>> 
>> - stop all access to the cluster;
>> - set nodown and noout so that other OSDs don't mark spinning OSDs
>> down (which would cause all sorts of primary and PG reassignments,
>> useless backfill/recovery when mon osd down out interval expires,
>> etc.);
>> - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
>> so that at least *between* PGs, the OSD has a chance to respond to
>> heartbeats and do whatever else it needs to do;
>> - let the snap trim play itself out over several hours (days?).
> 
> What I've been doing is I just continue draining my OSDs, two at a time. Each time, 1-2 other OSDs
> become blocked for a couple minutes (out of the ~1 hour it takes to drain) while a single PG
> re-trims, leading to ~100 slow requests. The OSD must still be responding to the peer pings, since
> other OSDs do not mark it down. Luckily this doesn't happen with every single movement of our pool
> 5 PGs, otherwise it would be a disaster like you said.

Two other more risky work-arounds that I didn't try yet are:

1. lower the osd_snap_trim_thread_timeout from 3600s to something like 10 or 20s, so that these long trim operations are just killed. I have no idea if this is safe.
2. pay close attention to the slow requests and manually mark the affected OSDs down when they become blocked. by marking the trimming OSD down the IOs should go elsewhere until the OSD can recover once again later. But I don't know how the backfilling OSD will behave if it is manually marked down while trimming.

Cheers, Dan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
  2014-09-18 19:12       ` Dan van der Ster
@ 2014-09-18 21:19         ` Florian Haas
       [not found]           ` <alpine.DEB.2.00.1409181446110.19460@cobra.newdream.net>
  0 siblings, 1 reply; 26+ messages in thread
From: Florian Haas @ 2014-09-18 21:19 UTC (permalink / raw)
  To: Dan van der Ster

On Thu, Sep 18, 2014 at 9:12 PM, Dan van der Ster
<daniel.vanderster@cern.ch> wrote:
> Hi,
>
> September 18 2014 9:03 PM, "Florian Haas" <florian@hastexo.com> wrote:
>> On Thu, Sep 18, 2014 at 8:56 PM, Dan van der Ster <daniel.vanderster@cern.ch> wrote:
>>
>>> Hi Florian,
>>>
>>> On Sep 18, 2014 7:03 PM, Florian Haas <florian@hastexo.com> wrote:
>>>> Hi Dan,
>>>>
>>>> saw the pull request, and can confirm your observations, at least
>>>> partially. Comments inline.
>>>>
>>>> On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
>>>> <daniel.vanderster@cern.ch> wrote:
>>>>>>> Do I understand your issue report correctly in that you have found
>>>>>>> setting osd_snap_trim_sleep to be ineffective, because it's being
>>>>>>> applied when iterating from PG to PG, rather than from snap to snap?
>>>>>>> If so, then I'm guessing that that can hardly be intentional…
>>>>>
>>>>>
>>>>> I’m beginning to agree with you on that guess. AFAICT, the normal behavior of the snap trimmer
>>> is
>>>> to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the
>>>> only time the current sleep implementation could be useful is if we rm’d a snap across many PGs
>>> at
>>>> once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at
>>>> most need to trim O(100) PGs.
>>>>
>>>> Hmm. I'm actually seeing this in a system where the problematic snaps
>>>> could *only* have been RBD snaps.
>>>
>>> True, as am I. The current sleep is useful in this case, but since we'd normally only expect up
>> to
>>> ~100 of these PGs per OSD, the trimming of 1 snap across all of those PGs would finish rather
>>> quickly anyway. Latency would surely be increased momentarily, but I wouldn't expect 90s slow
>>> requests like I have with the 30000 snap_trimq single PG.
>>>
>>> Possibly the sleep is useful in both places.
>>>
>>>>> We could move the snap trim sleep into the SnapTrimmer state machine, for example in
>>>> ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of
>>>> course the trimming PG would remain locked. And it would be locked for even longer now due to
>> the
>>>> sleep.
>>>>>
>>>>> To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve
>> done
>>>> in this pull req: https://github.com/ceph/ceph/pull/2516
>>>>> Breaking out of the trimmer like that should allow IOs to the trimming PG to get through.
>>>>>
>>>>> The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve
>>>> managed to reproduce that on my test cluster. All you have to do is create many pool snaps (e.g.
>>> of
>>>> a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs
>>>> around. With debug_osd>=10, you will see "adding snap 1 to purged_snaps”, which is one signature
>>> of
>>>> this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be
>>>> O(10000).
>>>>
>>>> Hmmm, I'm not sure if I confirm that. I see "adding snap X to
>>>> purged_snaps", but only after the snap has been purged. See
>>>> https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
>>>> fact that the OSD tries to trim a snap only to get an ENOENT is
>>>> probably indicative of something being fishy with the snaptrimq and/or
>>>> the purged_snaps list as well.
>>>
>>> With such a long snap_trimq there in your log, I suspect you're seeing the exact same behavior as
>> I
>>> am. In my case the first snap trimmed is snap 1, of course because that is the first rm'd snap,
>> and
>>> the contents of your pool are surely different. I also see the ENOENT messages... again
>> confirming
>>> those snaps were already trimmed. Anyway, what I've observed is that a large snap_trimq like that
>>> will block the OSD until they are all re-trimmed.
>>
>> That's... a mess.
>>
>> So what is your workaround for recovery? My hunch would be to
>>
>> - stop all access to the cluster;
>> - set nodown and noout so that other OSDs don't mark spinning OSDs
>> down (which would cause all sorts of primary and PG reassignments,
>> useless backfill/recovery when mon osd down out interval expires,
>> etc.);
>> - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
>> so that at least *between* PGs, the OSD has a chance to respond to
>> heartbeats and do whatever else it needs to do;
>> - let the snap trim play itself out over several hours (days?).
>>
>
> What I've been doing is I just continue draining my OSDs, two at a time. Each time, 1-2 other OSDs become blocked for a couple minutes (out of the ~1 hour it takes to drain) while a single PG re-trims, leading to ~100 slow requests. The OSD must still be responding to the peer pings, since other OSDs do not mark it down. Luckily this doesn't happen with every single movement of our pool 5 PGs, otherwise it would be a disaster like you said.

So just to clarify, what you're doing is out of the OSDs that are
spinning, you mark 2 out and wait for them to go empty?

What I'm seeing i my environment is that the OSDs *do* go down.
Marking them out seems not to help much as the problem then promptly
pops up elsewhere.

So, disaster is a pretty good description. Would anyone from the core
team like to suggest another course of action or workaround, or are
Dan and I generally on the right track to make the best out of a
pretty bad situation?

It would be helpful for others that bought into the "snapshots are
awesome, cheap and you can have as many as you want" mantra, so as to
perhaps not have their cluster blow up in their faces at some point.
Because right now, to me it seems that as you go past maybe a few
thousand snapshots and then at some point want to remove lots of them
at the same time, you'd better be scared. Happy to stand corrected,
though. :)

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
       [not found]             ` <CAPUexz-HXn=x_b=CJev46jWFsSrUpXc7UO7kzWcRe4Yrm6VL3g@mail.gmail.com>
@ 2014-09-18 22:27               ` Sage Weil
  2014-09-19  6:12                 ` Florian Haas
  0 siblings, 1 reply; 26+ messages in thread
From: Sage Weil @ 2014-09-18 22:27 UTC (permalink / raw)
  To: Florian Haas; +Cc: Dan van der Ster, ceph-devel

On Fri, 19 Sep 2014, Florian Haas wrote:
> Hi Sage,
> 
> was the off-list reply intentional?

Whoops!  Nope :)

> On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil <sweil@redhat.com> wrote:
> >> So, disaster is a pretty good description. Would anyone from the core
> >> team like to suggest another course of action or workaround, or are
> >> Dan and I generally on the right track to make the best out of a
> >> pretty bad situation?
> >
> > The short term fix would probably be to just prevent backfill for the time
> > being until the bug is fixed.
> 
> As in, osd max backfills = 0?

Yeah :)

Just managed to reproduce the problem...

sage

> > The root of the problem seems to be that it is trying to trim snaps that
> > aren't there.  I'm trying to reproduce the issue now!  Hopefully the fix
> > is simple...
> >
> >         http://tracker.ceph.com/issues/9487
> >
> > Thanks!
> > sage
> 
> Thanks. :)
> 
> Cheers,
> Florian
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
  2014-09-18 22:27               ` Sage Weil
@ 2014-09-19  6:12                 ` Florian Haas
  2014-09-19  8:41                   ` Dan Van Der Ster
  0 siblings, 1 reply; 26+ messages in thread
From: Florian Haas @ 2014-09-19  6:12 UTC (permalink / raw)
  To: Sage Weil; +Cc: Dan van der Ster, ceph-devel

On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil <sweil@redhat.com> wrote:
> On Fri, 19 Sep 2014, Florian Haas wrote:
>> Hi Sage,
>>
>> was the off-list reply intentional?
>
> Whoops!  Nope :)
>
>> On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil <sweil@redhat.com> wrote:
>> >> So, disaster is a pretty good description. Would anyone from the core
>> >> team like to suggest another course of action or workaround, or are
>> >> Dan and I generally on the right track to make the best out of a
>> >> pretty bad situation?
>> >
>> > The short term fix would probably be to just prevent backfill for the time
>> > being until the bug is fixed.
>>
>> As in, osd max backfills = 0?
>
> Yeah :)
>
> Just managed to reproduce the problem...
>
> sage

Saw the wip branch. Color me freakishly impressed on the turnaround. :) Thanks!

Cheers,
Florian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
  2014-09-19  6:12                 ` Florian Haas
@ 2014-09-19  8:41                   ` Dan Van Der Ster
  2014-09-19 12:58                     ` Dan van der Ster
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Van Der Ster @ 2014-09-19  8:41 UTC (permalink / raw)
  To: Florian Haas; +Cc: Sage Weil, ceph-devel

> On 19 Sep 2014, at 08:12, Florian Haas <florian@hastexo.com> wrote:
> 
> On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil <sweil@redhat.com> wrote:
>> On Fri, 19 Sep 2014, Florian Haas wrote:
>>> Hi Sage,
>>> 
>>> was the off-list reply intentional?
>> 
>> Whoops!  Nope :)
>> 
>>> On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil <sweil@redhat.com> wrote:
>>>>> So, disaster is a pretty good description. Would anyone from the core
>>>>> team like to suggest another course of action or workaround, or are
>>>>> Dan and I generally on the right track to make the best out of a
>>>>> pretty bad situation?
>>>> 
>>>> The short term fix would probably be to just prevent backfill for the time
>>>> being until the bug is fixed.
>>> 
>>> As in, osd max backfills = 0?
>> 
>> Yeah :)
>> 
>> Just managed to reproduce the problem...
>> 
>> sage
> 
> Saw the wip branch. Color me freakishly impressed on the turnaround. :) Thanks!

Indeed :) Thanks Sage!
wip-9487-dumpling fixes the problem on my test cluster. Trying in prod now…
Cheers, Dan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
  2014-09-19  8:41                   ` Dan Van Der Ster
@ 2014-09-19 12:58                     ` Dan van der Ster
  2014-09-19 15:19                       ` Sage Weil
  2014-09-19 15:37                       ` Dan van der Ster
  0 siblings, 2 replies; 26+ messages in thread
From: Dan van der Ster @ 2014-09-19 12:58 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, Florian Haas

On Fri, Sep 19, 2014 at 10:41 AM, Dan Van Der Ster
<daniel.vanderster@cern.ch> wrote:
>> On 19 Sep 2014, at 08:12, Florian Haas <florian@hastexo.com> wrote:
>>
>> On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil <sweil@redhat.com> wrote:
>>> On Fri, 19 Sep 2014, Florian Haas wrote:
>>>> Hi Sage,
>>>>
>>>> was the off-list reply intentional?
>>>
>>> Whoops!  Nope :)
>>>
>>>> On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil <sweil@redhat.com> wrote:
>>>>>> So, disaster is a pretty good description. Would anyone from the core
>>>>>> team like to suggest another course of action or workaround, or are
>>>>>> Dan and I generally on the right track to make the best out of a
>>>>>> pretty bad situation?
>>>>>
>>>>> The short term fix would probably be to just prevent backfill for the time
>>>>> being until the bug is fixed.
>>>>
>>>> As in, osd max backfills = 0?
>>>
>>> Yeah :)
>>>
>>> Just managed to reproduce the problem...
>>>
>>> sage
>>
>> Saw the wip branch. Color me freakishly impressed on the turnaround. :) Thanks!
>
> Indeed :) Thanks Sage!
> wip-9487-dumpling fixes the problem on my test cluster. Trying in prod now…

Final update, after 4 hours in prod and after draining 8 OSDs -- zero
slow requests :)

Thanks again!

Dan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
  2014-09-19 12:58                     ` Dan van der Ster
@ 2014-09-19 15:19                       ` Sage Weil
  2014-09-19 15:37                       ` Dan van der Ster
  1 sibling, 0 replies; 26+ messages in thread
From: Sage Weil @ 2014-09-19 15:19 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: ceph-devel, Florian Haas

On Fri, 19 Sep 2014, Dan van der Ster wrote:
> On Fri, Sep 19, 2014 at 10:41 AM, Dan Van Der Ster
> <daniel.vanderster@cern.ch> wrote:
> >> On 19 Sep 2014, at 08:12, Florian Haas <florian@hastexo.com> wrote:
> >>
> >> On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil <sweil@redhat.com> wrote:
> >>> On Fri, 19 Sep 2014, Florian Haas wrote:
> >>>> Hi Sage,
> >>>>
> >>>> was the off-list reply intentional?
> >>>
> >>> Whoops!  Nope :)
> >>>
> >>>> On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil <sweil@redhat.com> wrote:
> >>>>>> So, disaster is a pretty good description. Would anyone from the core
> >>>>>> team like to suggest another course of action or workaround, or are
> >>>>>> Dan and I generally on the right track to make the best out of a
> >>>>>> pretty bad situation?
> >>>>>
> >>>>> The short term fix would probably be to just prevent backfill for the time
> >>>>> being until the bug is fixed.
> >>>>
> >>>> As in, osd max backfills = 0?
> >>>
> >>> Yeah :)
> >>>
> >>> Just managed to reproduce the problem...
> >>>
> >>> sage
> >>
> >> Saw the wip branch. Color me freakishly impressed on the turnaround. :) Thanks!
> >
> > Indeed :) Thanks Sage!
> > wip-9487-dumpling fixes the problem on my test cluster. Trying in prod now?
> 
> Final update, after 4 hours in prod and after draining 8 OSDs -- zero
> slow requests :)

That's great news!

But, please be careful.  This code hasn't been reiewed yet or been through 
any testing!  I would hold off on further backfills until it's merged.

Thanks!
sage

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
  2014-09-19 12:58                     ` Dan van der Ster
  2014-09-19 15:19                       ` Sage Weil
@ 2014-09-19 15:37                       ` Dan van der Ster
       [not found]                         ` <CAME-gARt1NZmEFj6SCpxkfxnibXyR7+AdYKO4YNkQc_n+XJuXQ@mail.gmail.com>
  2014-10-15 14:47                         ` Dan Van Der Ster
  1 sibling, 2 replies; 26+ messages in thread
From: Dan van der Ster @ 2014-09-19 15:37 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, Florian Haas

September 19 2014 5:19 PM, "Sage Weil" <sweil@redhat.com> wrote: 
> On Fri, 19 Sep 2014, Dan van der Ster wrote:
> 
>> On Fri, Sep 19, 2014 at 10:41 AM, Dan Van Der Ster
>> <daniel.vanderster@cern.ch> wrote:
>>>> On 19 Sep 2014, at 08:12, Florian Haas <florian@hastexo.com> wrote:
>>>> 
>>>> On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil <sweil@redhat.com> wrote:
>>>>> On Fri, 19 Sep 2014, Florian Haas wrote:
>>>>>> Hi Sage,
>>>>>> 
>>>>>> was the off-list reply intentional?
>>>>> 
>>>>> Whoops! Nope :)
>>>>> 
>>>>>> On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil <sweil@redhat.com> wrote:
>>>>>>>> So, disaster is a pretty good description. Would anyone from the core
>>>>>>>> team like to suggest another course of action or workaround, or are
>>>>>>>> Dan and I generally on the right track to make the best out of a
>>>>>>>> pretty bad situation?
>>>>>>> 
>>>>>>> The short term fix would probably be to just prevent backfill for the time
>>>>>>> being until the bug is fixed.
>>>>>> 
>>>>>> As in, osd max backfills = 0?
>>>>> 
>>>>> Yeah :)
>>>>> 
>>>>> Just managed to reproduce the problem...
>>>>> 
>>>>> sage
>>>> 
>>>> Saw the wip branch. Color me freakishly impressed on the turnaround. :) Thanks!
>>> 
>>> Indeed :) Thanks Sage!
>>> wip-9487-dumpling fixes the problem on my test cluster. Trying in prod now?
>> 
>> Final update, after 4 hours in prod and after draining 8 OSDs -- zero
>> slow requests :)
> 
> That's great news!
> 
> But, please be careful. This code hasn't been reiewed yet or been through
> any testing! I would hold off on further backfills until it's merged.

Roger; I've been watching it very closely and so far it seems to work very well. Looking forward to that merge :)

Cheers, Dan


> 
> Thanks!
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
       [not found]                         ` <CAME-gARt1NZmEFj6SCpxkfxnibXyR7+AdYKO4YNkQc_n+XJuXQ@mail.gmail.com>
@ 2014-09-21 13:33                           ` Florian Haas
  2014-09-21 14:26                           ` Dan van der Ster
  2014-09-21 19:41                           ` Sage Weil
  2 siblings, 0 replies; 26+ messages in thread
From: Florian Haas @ 2014-09-21 13:33 UTC (permalink / raw)
  To: Alphe Salas; +Cc: Dan van der Ster, ceph-devel, Sage Weil

On Sat, Sep 20, 2014 at 9:08 PM, Alphe Salas <asalas@kepler.cl> wrote:
> Real field testings and proof workout are better than any unit testing ... I
> would follow Dan s notice of resolution because it based on real problem and
> not fony style test ground.

That statement is almost an insult to the authors and maintainers of
the testing framework around Ceph. Therefore, I'm taking the liberty
to register my objection.

That said, I'm not sure that wip-9487-dumpling is the final fix to the
issue. On the system where I am seeing the issue, even with the fix
deployed, osd's still not only go crazy snap trimming (which by itself
would be understandable, as the system has indeed recently had
thousands of snapshots removed), but they also still produce the
previously seen ENOENT messages indicating they're trying to trim
snaps that aren't there.

That system, however, has PGs marked as recovering, not backfilling as
in Dan's system. Not sure if wip-9487 falls short of fixing the issue
at its root. Sage, whenever you have time, would you mind commenting?

Cheers,
Florian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
       [not found]                         ` <CAME-gARt1NZmEFj6SCpxkfxnibXyR7+AdYKO4YNkQc_n+XJuXQ@mail.gmail.com>
  2014-09-21 13:33                           ` Florian Haas
@ 2014-09-21 14:26                           ` Dan van der Ster
  2014-09-21 15:27                             ` Florian Haas
  2014-09-21 19:41                           ` Sage Weil
  2 siblings, 1 reply; 26+ messages in thread
From: Dan van der Ster @ 2014-09-21 14:26 UTC (permalink / raw)
  To: Florian Haas; +Cc: ceph-devel, Sage Weil

Hi Florian,

September 21 2014 3:33 PM, "Florian Haas" <florian@hastexo.com> wrote: 
> That said, I'm not sure that wip-9487-dumpling is the final fix to the
> issue. On the system where I am seeing the issue, even with the fix
> deployed, osd's still not only go crazy snap trimming (which by itself
> would be understandable, as the system has indeed recently had
> thousands of snapshots removed), but they also still produce the
> previously seen ENOENT messages indicating they're trying to trim
> snaps that aren't there.
> 

You should be able to tell exactly how many snaps need to be trimmed. Check the current purged_snaps with

ceph pg x.y query

and also check the snap_trimq from debug_osd=10. The problem fixed in wip-9487 is the (mis)communication of purged_snaps to a new OSD. But if in your cluster purged_snaps is "correct" (which it should be after the fix from Sage), and it still has lots of snaps to trim, then I believe the only thing to do is let those snaps all get trimmed. (my other patch linked sometime earlier in this thread might help by breaking up all that trimming work into smaller pieces, but that was never tested).

Entering the realm of speculation, I wonder if your OSDs are getting interrupted, marked down, out, or crashing before they have the opportunity to persist purged_snaps? purged_snaps is updated in ReplicatedPG::WaitingOnReplicas::react, but if the primary is too busy to actually send that transaction to its peers, so then eventually it or the new primary needs to start again, and no progress is ever made. If this is what is happening on your cluster, then again, perhaps my osd_snap_trim_max patch could be a solution.

Cheers, Dan

> That system, however, has PGs marked as recovering, not backfilling as
> in Dan's system. Not sure if wip-9487 falls short of fixing the issue
> at its root. Sage, whenever you have time, would you mind commenting?
> 
> Cheers,
> Florian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
  2014-09-21 14:26                           ` Dan van der Ster
@ 2014-09-21 15:27                             ` Florian Haas
  2014-09-21 19:52                               ` Sage Weil
  0 siblings, 1 reply; 26+ messages in thread
From: Florian Haas @ 2014-09-21 15:27 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: ceph-devel, Sage Weil

On Sun, Sep 21, 2014 at 4:26 PM, Dan van der Ster
<daniel.vanderster@cern.ch> wrote:
> Hi Florian,
>
> September 21 2014 3:33 PM, "Florian Haas" <florian@hastexo.com> wrote:
>> That said, I'm not sure that wip-9487-dumpling is the final fix to the
>> issue. On the system where I am seeing the issue, even with the fix
>> deployed, osd's still not only go crazy snap trimming (which by itself
>> would be understandable, as the system has indeed recently had
>> thousands of snapshots removed), but they also still produce the
>> previously seen ENOENT messages indicating they're trying to trim
>> snaps that aren't there.
>>
>
> You should be able to tell exactly how many snaps need to be trimmed. Check the current purged_snaps with
>
> ceph pg x.y query
>
> and also check the snap_trimq from debug_osd=10. The problem fixed in wip-9487 is the (mis)communication of purged_snaps to a new OSD. But if in your cluster purged_snaps is "correct" (which it should be after the fix from Sage), and it still has lots of snaps to trim, then I believe the only thing to do is let those snaps all get trimmed. (my other patch linked sometime earlier in this thread might help by breaking up all that trimming work into smaller pieces, but that was never tested).

Yes, it does indeed look like the system does have thousands of
snapshots left to trim. That said, since the PGs are locked during
this time, this creates a situation where the cluster is becoming
unusable with no way for the user to recover.

> Entering the realm of speculation, I wonder if your OSDs are getting interrupted, marked down, out, or crashing before they have the opportunity to persist purged_snaps? purged_snaps is updated in ReplicatedPG::WaitingOnReplicas::react, but if the primary is too busy to actually send that transaction to its peers, so then eventually it or the new primary needs to start again, and no progress is ever made. If this is what is happening on your cluster, then again, perhaps my osd_snap_trim_max patch could be a solution.

Since the snap trimmer immediately jacks the affected OSDs up to 100%
CPU utilization, and they stop even responding to heartbeats, yes they
do get marked down and that makes the issue much worse. Even when
setting nodown, though, then that doesn't change the fact that the
affected OSDs just spin practically indefinitely.

So, even with the patch for 9487, which fixes *your* issue of the
cluster trying to trim tons of snaps when in fact it should be
trimming only a handful, the user is still in a world of pain when
they do indeed have tons of snaps to trim. And obviously, neither of
osd max backfills nor osd recovery max active help here, because even
a single backfill/recovery makes the OSD go nuts.

There is the silly option of setting osd_snap_trim_sleep to say 61
minutes, and restarting the ceph-osd daemons before the snap trim can
kick in, i.e. hourly, via a cron job. Of course, while this prevents
the OSD from going into a death spin, it only perpetuates the problem
until a patch for this issue is available, because snap trimming never
even runs, let alone completes.

This is particularly bad because a user can get themselves a
non-functional cluster simply by trying to delete thousands of
snapshots at once. If you consider a tiny virtualization cluster of
just 100 persistent VMs, out of which you take one snapshot an hour,
then deleting the snapshots taken in one month puts you well above
that limit. So we're not talking about outrageous numbers here. I
don't think anyone can fault any user for attempting this.

What makes the situation even worse is that there is no cluster-wide
limit to the number of snapshots, or even say snapshots per RBD
volume, or snapshots per PG, nor any limit on the number of snapshots
deleted concurrently.

So yes, I think your patch absolutely still has merit, as would any
means of reducing the number of snapshots an OSD will trim in one go.
As it is, the situation looks really really bad, specifically
considering that RBD and RADOS are meant to be super rock solid, as
opposed to say CephFS which is in an experimental state. And contrary
to CephFS snapshots, I can't recall any documentation saying that RBD
snapshots will break your system.

Cheers,
Florian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
       [not found]                         ` <CAME-gARt1NZmEFj6SCpxkfxnibXyR7+AdYKO4YNkQc_n+XJuXQ@mail.gmail.com>
  2014-09-21 13:33                           ` Florian Haas
  2014-09-21 14:26                           ` Dan van der Ster
@ 2014-09-21 19:41                           ` Sage Weil
  2 siblings, 0 replies; 26+ messages in thread
From: Sage Weil @ 2014-09-21 19:41 UTC (permalink / raw)
  To: Alphe Salas; +Cc: Dan van der Ster, ceph-devel, Florian Haas

On Sat, 20 Sep 2014, Alphe Salas wrote:
> Real field testings and proof workout are better than any unit testing ... I
> would follow Dan s notice of resolution because it based on real problem and
> not fony style test ground.

It's been reviewed and look right, but the rados torture tests are 
pretty ... torturous, and this code is delicate.  I would still wait.

> Sage apart that problem is there a solution to the ever expending replicas
> problem ?

Discard for the kernel RBD client should go upstream this cycle.

As for RADOS consuming more data when RBD blocks are overwritten, I still 
have yet to see any actual evidence of this, and have a hard time seeing 
how it could happen.  A sequence of steps to reproduce would be the next 
step.

sage

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
  2014-09-21 15:27                             ` Florian Haas
@ 2014-09-21 19:52                               ` Sage Weil
  2014-09-22 17:06                                 ` Florian Haas
  0 siblings, 1 reply; 26+ messages in thread
From: Sage Weil @ 2014-09-21 19:52 UTC (permalink / raw)
  To: Florian Haas; +Cc: Dan van der Ster, ceph-devel

On Sun, 21 Sep 2014, Florian Haas wrote:
> So yes, I think your patch absolutely still has merit, as would any
> means of reducing the number of snapshots an OSD will trim in one go.
> As it is, the situation looks really really bad, specifically
> considering that RBD and RADOS are meant to be super rock solid, as
> opposed to say CephFS which is in an experimental state. And contrary
> to CephFS snapshots, I can't recall any documentation saying that RBD
> snapshots will break your system.

Yeah, it sounds like a separate issue, and no, the limit is not 
documented because it's definitely not the intended behavior. :)

...and I see you already have a log attached to #9503.  Will take a look.

Thanks!
sage


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
  2014-09-21 19:52                               ` Sage Weil
@ 2014-09-22 17:06                                 ` Florian Haas
  2014-09-23 13:20                                   ` Florian Haas
  0 siblings, 1 reply; 26+ messages in thread
From: Florian Haas @ 2014-09-22 17:06 UTC (permalink / raw)
  To: Sage Weil; +Cc: Dan van der Ster, ceph-devel

On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil <sweil@redhat.com> wrote:
> On Sun, 21 Sep 2014, Florian Haas wrote:
>> So yes, I think your patch absolutely still has merit, as would any
>> means of reducing the number of snapshots an OSD will trim in one go.
>> As it is, the situation looks really really bad, specifically
>> considering that RBD and RADOS are meant to be super rock solid, as
>> opposed to say CephFS which is in an experimental state. And contrary
>> to CephFS snapshots, I can't recall any documentation saying that RBD
>> snapshots will break your system.
>
> Yeah, it sounds like a separate issue, and no, the limit is not
> documented because it's definitely not the intended behavior. :)
>
> ...and I see you already have a log attached to #9503.  Will take a look.

I've already updated that issue in Redmine, but for the list archives
I should also add this here: Dan's patch for #9503, together with
Sage's for #9487, makes the problem go away in an instant. I've
already pointed out that I owe Dan dinner, and Sage, well I already
owe Sage pretty much lifelong full board. :)

Everyone with a ton of snapshots in their clusters (not sure where the
threshold is, but it gets nasty somewhere between 1,000 and 10,000 I
imagine) should probably update to 0.67.11 and 0.80.6 as soon as they
come out, otherwise Terrible Things Will Happen™ if you're ever forced
to delete a large number of snaps at once.

Thanks again to Dan and Sage,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
  2014-09-22 17:06                                 ` Florian Haas
@ 2014-09-23 13:20                                   ` Florian Haas
  2014-09-23 20:00                                     ` Gregory Farnum
       [not found]                                     ` <C207F487-4FD4-45FF-AE41-6A0E706C9D38@cern.ch>
  0 siblings, 2 replies; 26+ messages in thread
From: Florian Haas @ 2014-09-23 13:20 UTC (permalink / raw)
  To: Sage Weil; +Cc: Dan van der Ster, ceph-devel

On Mon, Sep 22, 2014 at 7:06 PM, Florian Haas <florian@hastexo.com> wrote:
> On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil <sweil@redhat.com> wrote:
>> On Sun, 21 Sep 2014, Florian Haas wrote:
>>> So yes, I think your patch absolutely still has merit, as would any
>>> means of reducing the number of snapshots an OSD will trim in one go.
>>> As it is, the situation looks really really bad, specifically
>>> considering that RBD and RADOS are meant to be super rock solid, as
>>> opposed to say CephFS which is in an experimental state. And contrary
>>> to CephFS snapshots, I can't recall any documentation saying that RBD
>>> snapshots will break your system.
>>
>> Yeah, it sounds like a separate issue, and no, the limit is not
>> documented because it's definitely not the intended behavior. :)
>>
>> ...and I see you already have a log attached to #9503.  Will take a look.
>
> I've already updated that issue in Redmine, but for the list archives
> I should also add this here: Dan's patch for #9503, together with
> Sage's for #9487, makes the problem go away in an instant. I've
> already pointed out that I owe Dan dinner, and Sage, well I already
> owe Sage pretty much lifelong full board. :)

Looks like I was bit too eager: while the cluster is behaving nicely
with these patches while nothing happens to any OSDs, it does flag PGs
as incomplete when an OSD goes down. Once the mon osd down out
interval expires things seem to recover/backfill normally, but it's
still disturbing to see this in the interim.

I've updated http://tracker.ceph.com/issues/9503 with a pg query from
one of the affected PGs, within the mon osd down out interval, while
it was marked incomplete.

Dan or Sage, any ideas as to what might be causing this?

Cheers,
Florian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
  2014-09-23 13:20                                   ` Florian Haas
@ 2014-09-23 20:00                                     ` Gregory Farnum
  2014-10-16  9:04                                       ` Florian Haas
       [not found]                                     ` <C207F487-4FD4-45FF-AE41-6A0E706C9D38@cern.ch>
  1 sibling, 1 reply; 26+ messages in thread
From: Gregory Farnum @ 2014-09-23 20:00 UTC (permalink / raw)
  To: Florian Haas; +Cc: Sage Weil, Dan van der Ster, ceph-devel

On Tue, Sep 23, 2014 at 6:20 AM, Florian Haas <florian@hastexo.com> wrote:
> On Mon, Sep 22, 2014 at 7:06 PM, Florian Haas <florian@hastexo.com> wrote:
>> On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil <sweil@redhat.com> wrote:
>>> On Sun, 21 Sep 2014, Florian Haas wrote:
>>>> So yes, I think your patch absolutely still has merit, as would any
>>>> means of reducing the number of snapshots an OSD will trim in one go.
>>>> As it is, the situation looks really really bad, specifically
>>>> considering that RBD and RADOS are meant to be super rock solid, as
>>>> opposed to say CephFS which is in an experimental state. And contrary
>>>> to CephFS snapshots, I can't recall any documentation saying that RBD
>>>> snapshots will break your system.
>>>
>>> Yeah, it sounds like a separate issue, and no, the limit is not
>>> documented because it's definitely not the intended behavior. :)
>>>
>>> ...and I see you already have a log attached to #9503.  Will take a look.
>>
>> I've already updated that issue in Redmine, but for the list archives
>> I should also add this here: Dan's patch for #9503, together with
>> Sage's for #9487, makes the problem go away in an instant. I've
>> already pointed out that I owe Dan dinner, and Sage, well I already
>> owe Sage pretty much lifelong full board. :)
>
> Looks like I was bit too eager: while the cluster is behaving nicely
> with these patches while nothing happens to any OSDs, it does flag PGs
> as incomplete when an OSD goes down. Once the mon osd down out
> interval expires things seem to recover/backfill normally, but it's
> still disturbing to see this in the interim.
>
> I've updated http://tracker.ceph.com/issues/9503 with a pg query from
> one of the affected PGs, within the mon osd down out interval, while
> it was marked incomplete.
>
> Dan or Sage, any ideas as to what might be causing this?

That *looks* like it's just because the pool has both size and
min_size set to 2?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
       [not found]                                       ` <CAPUexz8MBvCbPaYYk2SWQqbgSLYBqmOArVwZkNqgH-=E0_V7cQ@mail.gmail.com>
@ 2014-09-24  0:05                                         ` Sage Weil
  2014-09-24 23:01                                           ` Florian Haas
  0 siblings, 1 reply; 26+ messages in thread
From: Sage Weil @ 2014-09-24  0:05 UTC (permalink / raw)
  To: Florian Haas; +Cc: Dan Van Der Ster, ceph-devel

Sam and I discussed this on IRC and have we think two simpler patches that 
solve the problem more directly.  See wip-9487.  Queued for testing now.  
Once that passes we can backport and test for firefly and dumpling too.

Note that this won't make the next dumpling or firefly point releases 
(which are imminent).  Should be in the next ones, though.

Upside is it looks like Sam found #9113 (snaptrimmer memory leak) at the 
same time, yay!

sage


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
  2014-09-24  0:05                                         ` Sage Weil
@ 2014-09-24 23:01                                           ` Florian Haas
  0 siblings, 0 replies; 26+ messages in thread
From: Florian Haas @ 2014-09-24 23:01 UTC (permalink / raw)
  To: Sage Weil; +Cc: Dan Van Der Ster, ceph-devel

On Wed, Sep 24, 2014 at 1:05 AM, Sage Weil <sweil@redhat.com> wrote:
> Sam and I discussed this on IRC and have we think two simpler patches that
> solve the problem more directly.  See wip-9487.

So I understand this makes Dan's patch (and the config parameter that
it introduces) unnecessary, but is it correct to assume that just like
Dan's patch yours too will not be effective unless osd snap trim sleep
> 0?

> Queued for testing now.
> Once that passes we can backport and test for firefly and dumpling too.
>
> Note that this won't make the next dumpling or firefly point releases
> (which are imminent).  Should be in the next ones, though.

OK, just in case anyone else runs into problems after removing tons of
snapshots with <=0.67.11, what's the plan to get them going again
until 0.67.12 comes out? Install the autobuild package from the wip
branch?

Cheers,
Florian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
  2014-09-19 15:37                       ` Dan van der Ster
       [not found]                         ` <CAME-gARt1NZmEFj6SCpxkfxnibXyR7+AdYKO4YNkQc_n+XJuXQ@mail.gmail.com>
@ 2014-10-15 14:47                         ` Dan Van Der Ster
  2014-10-15 17:50                           ` Samuel Just
  1 sibling, 1 reply; 26+ messages in thread
From: Dan Van Der Ster @ 2014-10-15 14:47 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, Florian Haas

Hi Sage,

> On 19 Sep 2014, at 17:37, Dan Van Der Ster <daniel.vanderster@cern.ch> wrote:
> 
> September 19 2014 5:19 PM, "Sage Weil" <sweil@redhat.com> wrote: 
>> On Fri, 19 Sep 2014, Dan van der Ster wrote:
>> 
>>> On Fri, Sep 19, 2014 at 10:41 AM, Dan Van Der Ster
>>> <daniel.vanderster@cern.ch> wrote:
>>>>> On 19 Sep 2014, at 08:12, Florian Haas <florian@hastexo.com> wrote:
>>>>> 
>>>>> On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil <sweil@redhat.com> wrote:
>>>>>> On Fri, 19 Sep 2014, Florian Haas wrote:
>>>>>>> Hi Sage,
>>>>>>> 
>>>>>>> was the off-list reply intentional?
>>>>>> 
>>>>>> Whoops! Nope :)
>>>>>> 
>>>>>>> On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil <sweil@redhat.com> wrote:
>>>>>>>>> So, disaster is a pretty good description. Would anyone from the core
>>>>>>>>> team like to suggest another course of action or workaround, or are
>>>>>>>>> Dan and I generally on the right track to make the best out of a
>>>>>>>>> pretty bad situation?
>>>>>>>> 
>>>>>>>> The short term fix would probably be to just prevent backfill for the time
>>>>>>>> being until the bug is fixed.
>>>>>>> 
>>>>>>> As in, osd max backfills = 0?
>>>>>> 
>>>>>> Yeah :)
>>>>>> 
>>>>>> Just managed to reproduce the problem...
>>>>>> 
>>>>>> sage
>>>>> 
>>>>> Saw the wip branch. Color me freakishly impressed on the turnaround. :) Thanks!
>>>> 
>>>> Indeed :) Thanks Sage!
>>>> wip-9487-dumpling fixes the problem on my test cluster. Trying in prod now?
>>> 
>>> Final update, after 4 hours in prod and after draining 8 OSDs -- zero
>>> slow requests :)
>> 
>> That's great news!
>> 
>> But, please be careful. This code hasn't been reiewed yet or been through
>> any testing! I would hold off on further backfills until it's merged.


Any news on those merges? It would be good to get this fixed on the dumpling and firefly branches. We're kind of stuck at the moment :(

Cheers, Dan



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
  2014-10-15 14:47                         ` Dan Van Der Ster
@ 2014-10-15 17:50                           ` Samuel Just
  0 siblings, 0 replies; 26+ messages in thread
From: Samuel Just @ 2014-10-15 17:50 UTC (permalink / raw)
  To: Dan Van Der Ster; +Cc: Sage Weil, ceph-devel, Florian Haas

It's in giant, the firefly backport will happen once we are happy with
the fallout from the 80.7 thing.
-Sam

On Wed, Oct 15, 2014 at 7:47 AM, Dan Van Der Ster
<daniel.vanderster@cern.ch> wrote:
> Hi Sage,
>
>> On 19 Sep 2014, at 17:37, Dan Van Der Ster <daniel.vanderster@cern.ch> wrote:
>>
>> September 19 2014 5:19 PM, "Sage Weil" <sweil@redhat.com> wrote:
>>> On Fri, 19 Sep 2014, Dan van der Ster wrote:
>>>
>>>> On Fri, Sep 19, 2014 at 10:41 AM, Dan Van Der Ster
>>>> <daniel.vanderster@cern.ch> wrote:
>>>>>> On 19 Sep 2014, at 08:12, Florian Haas <florian@hastexo.com> wrote:
>>>>>>
>>>>>> On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil <sweil@redhat.com> wrote:
>>>>>>> On Fri, 19 Sep 2014, Florian Haas wrote:
>>>>>>>> Hi Sage,
>>>>>>>>
>>>>>>>> was the off-list reply intentional?
>>>>>>>
>>>>>>> Whoops! Nope :)
>>>>>>>
>>>>>>>> On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil <sweil@redhat.com> wrote:
>>>>>>>>>> So, disaster is a pretty good description. Would anyone from the core
>>>>>>>>>> team like to suggest another course of action or workaround, or are
>>>>>>>>>> Dan and I generally on the right track to make the best out of a
>>>>>>>>>> pretty bad situation?
>>>>>>>>>
>>>>>>>>> The short term fix would probably be to just prevent backfill for the time
>>>>>>>>> being until the bug is fixed.
>>>>>>>>
>>>>>>>> As in, osd max backfills = 0?
>>>>>>>
>>>>>>> Yeah :)
>>>>>>>
>>>>>>> Just managed to reproduce the problem...
>>>>>>>
>>>>>>> sage
>>>>>>
>>>>>> Saw the wip branch. Color me freakishly impressed on the turnaround. :) Thanks!
>>>>>
>>>>> Indeed :) Thanks Sage!
>>>>> wip-9487-dumpling fixes the problem on my test cluster. Trying in prod now?
>>>>
>>>> Final update, after 4 hours in prod and after draining 8 OSDs -- zero
>>>> slow requests :)
>>>
>>> That's great news!
>>>
>>> But, please be careful. This code hasn't been reiewed yet or been through
>>> any testing! I would hold off on further backfills until it's merged.
>
>
> Any news on those merges? It would be good to get this fixed on the dumpling and firefly branches. We're kind of stuck at the moment :(
>
> Cheers, Dan
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
  2014-09-23 20:00                                     ` Gregory Farnum
@ 2014-10-16  9:04                                       ` Florian Haas
  2014-10-16 13:54                                         ` Gregory Farnum
  0 siblings, 1 reply; 26+ messages in thread
From: Florian Haas @ 2014-10-16  9:04 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, Dan van der Ster, ceph-devel

Hi Greg,

sorry, this somehow got stuck in my drafts folder.

On Tue, Sep 23, 2014 at 10:00 PM, Gregory Farnum <greg@inktank.com> wrote:
> On Tue, Sep 23, 2014 at 6:20 AM, Florian Haas <florian@hastexo.com> wrote:
>> On Mon, Sep 22, 2014 at 7:06 PM, Florian Haas <florian@hastexo.com> wrote:
>>> On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil <sweil@redhat.com> wrote:
>>>> On Sun, 21 Sep 2014, Florian Haas wrote:
>>>>> So yes, I think your patch absolutely still has merit, as would any
>>>>> means of reducing the number of snapshots an OSD will trim in one go.
>>>>> As it is, the situation looks really really bad, specifically
>>>>> considering that RBD and RADOS are meant to be super rock solid, as
>>>>> opposed to say CephFS which is in an experimental state. And contrary
>>>>> to CephFS snapshots, I can't recall any documentation saying that RBD
>>>>> snapshots will break your system.
>>>>
>>>> Yeah, it sounds like a separate issue, and no, the limit is not
>>>> documented because it's definitely not the intended behavior. :)
>>>>
>>>> ...and I see you already have a log attached to #9503.  Will take a look.
>>>
>>> I've already updated that issue in Redmine, but for the list archives
>>> I should also add this here: Dan's patch for #9503, together with
>>> Sage's for #9487, makes the problem go away in an instant. I've
>>> already pointed out that I owe Dan dinner, and Sage, well I already
>>> owe Sage pretty much lifelong full board. :)
>>
>> Looks like I was bit too eager: while the cluster is behaving nicely
>> with these patches while nothing happens to any OSDs, it does flag PGs
>> as incomplete when an OSD goes down. Once the mon osd down out
>> interval expires things seem to recover/backfill normally, but it's
>> still disturbing to see this in the interim.
>>
>> I've updated http://tracker.ceph.com/issues/9503 with a pg query from
>> one of the affected PGs, within the mon osd down out interval, while
>> it was marked incomplete.
>>
>> Dan or Sage, any ideas as to what might be causing this?
>
> That *looks* like it's just because the pool has both size and
> min_size set to 2?

Correct. But the documentation did not reflect that this is a
perfectly expected side effect of having min_size > 1.

pg-states.rst says:

*Incomplete*
  Ceph detects that a placement group is missing a necessary period of history
  from its log.  If you see this state, report a bug, and try to start any
  failed OSDs that may contain the needed information.

So if min_size > 1 and replicas < min_size, then the incomplete state
is not a bug but a perfectly expected occurrence, correct?

It's still a bit weird in that the PG seems to behave differently
depending on min_size. If min_size == 1 (default), then a PG with no
remaining replicas is stale, unless a replica failed first and the
primary was written to, after which it also failed, and the replica
then comes up and can't go primary because it now has outdated data,
in which case the PG goes "down". It never goes "incomplete".

So is the documentation wrong, or is there something fishy with the
reported state of the PGs?

Cheers,
Florian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: snap_trimming + backfilling is inefficient with many purged_snaps
  2014-10-16  9:04                                       ` Florian Haas
@ 2014-10-16 13:54                                         ` Gregory Farnum
  0 siblings, 0 replies; 26+ messages in thread
From: Gregory Farnum @ 2014-10-16 13:54 UTC (permalink / raw)
  To: Florian Haas; +Cc: Sage Weil, Dan van der Ster, ceph-devel

On Thu, Oct 16, 2014 at 2:04 AM, Florian Haas <florian@hastexo.com> wrote:
> Hi Greg,
>
> sorry, this somehow got stuck in my drafts folder.
>
> On Tue, Sep 23, 2014 at 10:00 PM, Gregory Farnum <greg@inktank.com> wrote:
>> On Tue, Sep 23, 2014 at 6:20 AM, Florian Haas <florian@hastexo.com> wrote:
>>> On Mon, Sep 22, 2014 at 7:06 PM, Florian Haas <florian@hastexo.com> wrote:
>>>> On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil <sweil@redhat.com> wrote:
>>>>> On Sun, 21 Sep 2014, Florian Haas wrote:
>>>>>> So yes, I think your patch absolutely still has merit, as would any
>>>>>> means of reducing the number of snapshots an OSD will trim in one go.
>>>>>> As it is, the situation looks really really bad, specifically
>>>>>> considering that RBD and RADOS are meant to be super rock solid, as
>>>>>> opposed to say CephFS which is in an experimental state. And contrary
>>>>>> to CephFS snapshots, I can't recall any documentation saying that RBD
>>>>>> snapshots will break your system.
>>>>>
>>>>> Yeah, it sounds like a separate issue, and no, the limit is not
>>>>> documented because it's definitely not the intended behavior. :)
>>>>>
>>>>> ...and I see you already have a log attached to #9503.  Will take a look.
>>>>
>>>> I've already updated that issue in Redmine, but for the list archives
>>>> I should also add this here: Dan's patch for #9503, together with
>>>> Sage's for #9487, makes the problem go away in an instant. I've
>>>> already pointed out that I owe Dan dinner, and Sage, well I already
>>>> owe Sage pretty much lifelong full board. :)
>>>
>>> Looks like I was bit too eager: while the cluster is behaving nicely
>>> with these patches while nothing happens to any OSDs, it does flag PGs
>>> as incomplete when an OSD goes down. Once the mon osd down out
>>> interval expires things seem to recover/backfill normally, but it's
>>> still disturbing to see this in the interim.
>>>
>>> I've updated http://tracker.ceph.com/issues/9503 with a pg query from
>>> one of the affected PGs, within the mon osd down out interval, while
>>> it was marked incomplete.
>>>
>>> Dan or Sage, any ideas as to what might be causing this?
>>
>> That *looks* like it's just because the pool has both size and
>> min_size set to 2?
>
> Correct. But the documentation did not reflect that this is a
> perfectly expected side effect of having min_size > 1.
>
> pg-states.rst says:
>
> *Incomplete*
>   Ceph detects that a placement group is missing a necessary period of history
>   from its log.  If you see this state, report a bug, and try to start any
>   failed OSDs that may contain the needed information.
>
> So if min_size > 1 and replicas < min_size, then the incomplete state
> is not a bug but a perfectly expected occurrence, correct?
>
> It's still a bit weird in that the PG seems to behave differently
> depending on min_size. If min_size == 1 (default), then a PG with no
> remaining replicas is stale, unless a replica failed first and the
> primary was written to, after which it also failed, and the replica
> then comes up and can't go primary because it now has outdated data,
> in which case the PG goes "down". It never goes "incomplete".
>
> So is the documentation wrong, or is there something fishy with the
> reported state of the PGs?

I guess the documentation is wrong, although I thought we'd fixed that
particular one. :/ Giant actually distinguishes between these
conditions by adding an "undersized" state to the PG, so it'll be
easier to diagnose.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2014-10-16 13:54 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAPUexz-ff+fTmU0J0TrJ80p+6334BBPvM7EWW5=eGa8uEomRew@mail.gmail.com>
2014-09-18 12:50 ` snap_trimming + backfilling is inefficient with many purged_snaps Dan Van Der Ster
2014-09-18 17:03   ` Florian Haas
     [not found]     ` <541b2add.ea5cb40a.250c.3ed0SMTPIN_ADDED_BROKEN@mx.google.com>
2014-09-18 19:03       ` Florian Haas
2014-09-18 19:12       ` Dan van der Ster
2014-09-18 21:19         ` Florian Haas
     [not found]           ` <alpine.DEB.2.00.1409181446110.19460@cobra.newdream.net>
     [not found]             ` <CAPUexz-HXn=x_b=CJev46jWFsSrUpXc7UO7kzWcRe4Yrm6VL3g@mail.gmail.com>
2014-09-18 22:27               ` Sage Weil
2014-09-19  6:12                 ` Florian Haas
2014-09-19  8:41                   ` Dan Van Der Ster
2014-09-19 12:58                     ` Dan van der Ster
2014-09-19 15:19                       ` Sage Weil
2014-09-19 15:37                       ` Dan van der Ster
     [not found]                         ` <CAME-gARt1NZmEFj6SCpxkfxnibXyR7+AdYKO4YNkQc_n+XJuXQ@mail.gmail.com>
2014-09-21 13:33                           ` Florian Haas
2014-09-21 14:26                           ` Dan van der Ster
2014-09-21 15:27                             ` Florian Haas
2014-09-21 19:52                               ` Sage Weil
2014-09-22 17:06                                 ` Florian Haas
2014-09-23 13:20                                   ` Florian Haas
2014-09-23 20:00                                     ` Gregory Farnum
2014-10-16  9:04                                       ` Florian Haas
2014-10-16 13:54                                         ` Gregory Farnum
     [not found]                                     ` <C207F487-4FD4-45FF-AE41-6A0E706C9D38@cern.ch>
     [not found]                                       ` <CAPUexz8MBvCbPaYYk2SWQqbgSLYBqmOArVwZkNqgH-=E0_V7cQ@mail.gmail.com>
2014-09-24  0:05                                         ` Sage Weil
2014-09-24 23:01                                           ` Florian Haas
2014-09-21 19:41                           ` Sage Weil
2014-10-15 14:47                         ` Dan Van Der Ster
2014-10-15 17:50                           ` Samuel Just
2014-09-18 19:31       ` Dan van der Ster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.