All of lore.kernel.org
 help / color / mirror / Atom feed
* removed_snaps
@ 2017-10-11 16:13 Sage Weil
  2017-10-11 18:38 ` removed_snaps Gregory Farnum
  0 siblings, 1 reply; 12+ messages in thread
From: Sage Weil @ 2017-10-11 16:13 UTC (permalink / raw)
  To: ceph-devel

I'm working on removing the removed_snaps field from pg_pool_t (and 
thus the OSDMap) as this can get very large for clusters that have aged 
and use snapshots.

	https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1320

The short version of the plan is only include recently removed snaps in 
the map.  Once all PGs in a pool have reported that the snap has been 
trimmed, we can safely retire the snapid from that set.

There are a couple of possible problems related to the fact that the 
OSD currently santizes the SnapContext with every write by removing 
snapids that appear in removed_snaps.  This is meant to deal with race 
conditions where the IO was submitted before the snap was removed, but the 
OSD has already logically removed it (or scheduled it for removal).

I see two categories of problems:

1. The IO was prepared at the librados client before the snap was removed, 
the IO is delayed (e.g., client can't connect, or PG is inactive, etc.) 
until after the snap is deleted and retired from removed_snaps, and then 
the IO is processed.  This could trigger a clone on the OSD that will 
then never get cleaned up.

Specifically, a librbd example:

 a. librbd prepares a IOContext that includes snap S
 b. librbd initiates a librados IO with on that ioc.  That request is 
    delayed.
 c. Snap S is removed (someone tells the mon to delete it)
    - either our librbd client did it, or another one did and we get a 
      notify telling us to refresh our header; doesn't matter which.
 d. S is included in OSDMap removed_snaps.
 e. all OSDs prune S
 f. S is removed from removed_snaps some time later
 g. The request from (b) finally reaches the OSD and triggers a clone S
    (which will never get cleaned up)

I think we can fix this problem by making Objecter slightly smarter: it 
can watch OSDMaps it receives and prune snapc's for in-flight requests.  
In the above scenario, sometime between d and g, when it finally 
gets a recent OSDMap, it will have pruned S from the request's snapc.  If 
it didn't get the map, and the request is still tagged with an old 
OSDMap, the OSD can kill the OSD session to force a reconnect/resend. It 
can do this for any incoming client request tagged with an OSDMap prior to 
a low-water-mark epoch in the OSDMap for which older removed_snaps 
have been proved (e.g. removed_snaps_pruned_epoch, or some better name).

In the extreme case where a client is disconnected for so long that they 
can't advance their map to the current one due to the mon having trimmed 
maps, and they have outstanding writes with snapcs, the client 
can choose to fail the requests with EIO or ESTALE or something, or crash 
itself, or otherwise behave as if it has been blacklisted/fenced (if it's 
RBD, it probably has anyway).

2. There is a bug, and the librbd image is out of sync: it thinks that 
snap S still exists but in reality it has been pruned.  If this happens, 
then the librbd client may use S and it may trigger a clone.  However, 
that snap still is referenced from the image, so it will presumably 
eventually get deleted and cleaned up.

Greg suggested the possibility of a similar CephFS bug, where for example 
a CephFS client gets confused and continues sending snapcs with removed 
snaps.  I think we can catch this category of bug (and probably #2 as 
well) if we make the OSD log an error/warning to the cluster log if it 
gets an incoming request including a deleted snapid when the request is 
marked with an epoch after when the snap was deleted.  This would 
would mean adding the deleted_epoch for each removed snapid to pg_pool_t 
to do properly; maybe worth it, maybe not?

Thoughts?
sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: removed_snaps
  2017-10-11 16:13 removed_snaps Sage Weil
@ 2017-10-11 18:38 ` Gregory Farnum
  2017-10-11 19:02   ` removed_snaps Sage Weil
  0 siblings, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2017-10-11 18:38 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, Mclean, Patrick

On Wed, Oct 11, 2017 at 9:13 AM, Sage Weil <sweil@redhat.com> wrote:
> I'm working on removing the removed_snaps field from pg_pool_t (and
> thus the OSDMap) as this can get very large for clusters that have aged
> and use snapshots.
>
>         https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1320
>
> The short version of the plan is only include recently removed snaps in
> the map.  Once all PGs in a pool have reported that the snap has been
> trimmed, we can safely retire the snapid from that set.
>
> There are a couple of possible problems related to the fact that the
> OSD currently santizes the SnapContext with every write by removing
> snapids that appear in removed_snaps.  This is meant to deal with race
> conditions where the IO was submitted before the snap was removed, but the
> OSD has already logically removed it (or scheduled it for removal).
>
> I see two categories of problems:
>
> 1. The IO was prepared at the librados client before the snap was removed,
> the IO is delayed (e.g., client can't connect, or PG is inactive, etc.)
> until after the snap is deleted and retired from removed_snaps, and then
> the IO is processed.  This could trigger a clone on the OSD that will
> then never get cleaned up.
>
> Specifically, a librbd example:
>
>  a. librbd prepares a IOContext that includes snap S
>  b. librbd initiates a librados IO with on that ioc.  That request is
>     delayed.
>  c. Snap S is removed (someone tells the mon to delete it)
>     - either our librbd client did it, or another one did and we get a
>       notify telling us to refresh our header; doesn't matter which.
>  d. S is included in OSDMap removed_snaps.
>  e. all OSDs prune S
>  f. S is removed from removed_snaps some time later
>  g. The request from (b) finally reaches the OSD and triggers a clone S
>     (which will never get cleaned up)
>
> I think we can fix this problem by making Objecter slightly smarter: it
> can watch OSDMaps it receives and prune snapc's for in-flight requests.
> In the above scenario, sometime between d and g, when it finally
> gets a recent OSDMap, it will have pruned S from the request's snapc.  If
> it didn't get the map, and the request is still tagged with an old
> OSDMap, the OSD can kill the OSD session to force a reconnect/resend. It
> can do this for any incoming client request tagged with an OSDMap prior to
> a low-water-mark epoch in the OSDMap for which older removed_snaps
> have been proved (e.g. removed_snaps_pruned_epoch, or some better name).
>
> In the extreme case where a client is disconnected for so long that they
> can't advance their map to the current one due to the mon having trimmed
> maps, and they have outstanding writes with snapcs, the client
> can choose to fail the requests with EIO or ESTALE or something, or crash
> itself, or otherwise behave as if it has been blacklisted/fenced (if it's
> RBD, it probably has anyway).

This all sounds good, except...
How on earth do we process these snapc trims in a way that doesn't
bring us back down to effectively single-threaded transfer speeds?

We might be able to do something with publishing a list of trimmed
snaps and having the dispatch threads check their op snapcs against
that shared data, but I'm not sure and we'd need to performance test
very carefully...


> 2. There is a bug, and the librbd image is out of sync: it thinks that
> snap S still exists but in reality it has been pruned.  If this happens,
> then the librbd client may use S and it may trigger a clone.  However,
> that snap still is referenced from the image, so it will presumably
> eventually get deleted and cleaned up.
>
> Greg suggested the possibility of a similar CephFS bug, where for example
> a CephFS client gets confused and continues sending snapcs with removed
> snaps.  I think we can catch this category of bug (and probably #2 as
> well) if we make the OSD log an error/warning to the cluster log if it
> gets an incoming request including a deleted snapid when the request is
> marked with an epoch after when the snap was deleted.  This would
> would mean adding the deleted_epoch for each removed snapid to pg_pool_t
> to do properly; maybe worth it, maybe not?

I don't quite understand; can you expand on this?


Let me also suggest one other idea that came up when I was discussing
snapshots with Patrick, by first laying out the problems we're aware
of:
1) the OSDMap itself might grow too large
2) comparing very large interval sets of snapIDs is computationally
expensive for the OSD. We do this
  a) when processing a new OSDMap against each PG's local list of deleted snaps
  b) when comparing the deleted snaps against incoming snapcontexts
(although this is usually not an issue because incoming snapcontexts
tend to be pretty small)

So we want to limit the size of the OSDMap, and we also want to avoid
very large comparisons. Why don't we attack these problems
individually?

1) Switch from a deleted_snaps set to a deleting_snaps set in the
OSDMap, and trim it based on per-PG feedback to the manager.
2) Maintain a *full* deleted_snaps set in the PG info on the OSDs.
3) Only load into memory the deleted_snaps we reasonably expect to
see, *along* with a boundary snapid indicating what range the set is
valid for.
4) If we get a snapid smaller than the range is valid for (in a
client's SnapContext, or I suppose in the deleting_snaps map), load
load more deleted snapids off disk to do the comparison.

I haven't sketched out all the code paths but when I skimmed things
over I think that snapcontext check (and possible off-disk loading) is
actually in any easy-to-wait location. This avoids us needing to
change the client wire protocol or introduce more ordering
dependencies.

The biggest downside I can see is that it adds a pretty obvious
DoS/resource consumption attack for malicious clients, but we're not
exactly immune to those in general. Thoughts?
-Greg

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: removed_snaps
  2017-10-11 18:38 ` removed_snaps Gregory Farnum
@ 2017-10-11 19:02   ` Sage Weil
  2017-10-11 19:37     ` removed_snaps Sage Weil
  2017-10-11 21:25     ` removed_snaps Gregory Farnum
  0 siblings, 2 replies; 12+ messages in thread
From: Sage Weil @ 2017-10-11 19:02 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel, Mclean, Patrick

On Wed, 11 Oct 2017, Gregory Farnum wrote:
> On Wed, Oct 11, 2017 at 9:13 AM, Sage Weil <sweil@redhat.com> wrote:
> > I'm working on removing the removed_snaps field from pg_pool_t (and
> > thus the OSDMap) as this can get very large for clusters that have aged
> > and use snapshots.
> >
> >         https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1320
> >
> > The short version of the plan is only include recently removed snaps in
> > the map.  Once all PGs in a pool have reported that the snap has been
> > trimmed, we can safely retire the snapid from that set.
> >
> > There are a couple of possible problems related to the fact that the
> > OSD currently santizes the SnapContext with every write by removing
> > snapids that appear in removed_snaps.  This is meant to deal with race
> > conditions where the IO was submitted before the snap was removed, but the
> > OSD has already logically removed it (or scheduled it for removal).
> >
> > I see two categories of problems:
> >
> > 1. The IO was prepared at the librados client before the snap was removed,
> > the IO is delayed (e.g., client can't connect, or PG is inactive, etc.)
> > until after the snap is deleted and retired from removed_snaps, and then
> > the IO is processed.  This could trigger a clone on the OSD that will
> > then never get cleaned up.
> >
> > Specifically, a librbd example:
> >
> >  a. librbd prepares a IOContext that includes snap S
> >  b. librbd initiates a librados IO with on that ioc.  That request is
> >     delayed.
> >  c. Snap S is removed (someone tells the mon to delete it)
> >     - either our librbd client did it, or another one did and we get a
> >       notify telling us to refresh our header; doesn't matter which.
> >  d. S is included in OSDMap removed_snaps.
> >  e. all OSDs prune S
> >  f. S is removed from removed_snaps some time later
> >  g. The request from (b) finally reaches the OSD and triggers a clone S
> >     (which will never get cleaned up)
> >
> > I think we can fix this problem by making Objecter slightly smarter: it
> > can watch OSDMaps it receives and prune snapc's for in-flight requests.
> > In the above scenario, sometime between d and g, when it finally
> > gets a recent OSDMap, it will have pruned S from the request's snapc.  If
> > it didn't get the map, and the request is still tagged with an old
> > OSDMap, the OSD can kill the OSD session to force a reconnect/resend. It
> > can do this for any incoming client request tagged with an OSDMap prior to
> > a low-water-mark epoch in the OSDMap for which older removed_snaps
> > have been proved (e.g. removed_snaps_pruned_epoch, or some better name).
> >
> > In the extreme case where a client is disconnected for so long that they
> > can't advance their map to the current one due to the mon having trimmed
> > maps, and they have outstanding writes with snapcs, the client
> > can choose to fail the requests with EIO or ESTALE or something, or crash
> > itself, or otherwise behave as if it has been blacklisted/fenced (if it's
> > RBD, it probably has anyway).
> 
> This all sounds good, except...
> How on earth do we process these snapc trims in a way that doesn't
> bring us back down to effectively single-threaded transfer speeds?
> 
> We might be able to do something with publishing a list of trimmed
> snaps and having the dispatch threads check their op snapcs against
> that shared data, but I'm not sure and we'd need to performance test
> very carefully...

The only Objecter change here is that when we get a new osdmap we would 
apply removed_snaps_this_epoch to the current in-flight requests.  We 
already walk all in-flight ops on map changes to recheck the CRUSH 
mapping, and the snapc lists are short, and we can construct the OSDMap so 
that it has the list of newly-deleted snaps for just that epoch in a new 
field, so I don't think this will have any measureable impact on 
performance.

Is there a different effect I'm missing?

> > 2. There is a bug, and the librbd image is out of sync: it thinks that
> > snap S still exists but in reality it has been pruned.  If this happens,
> > then the librbd client may use S and it may trigger a clone.  However,
> > that snap still is referenced from the image, so it will presumably
> > eventually get deleted and cleaned up.
> >
> > Greg suggested the possibility of a similar CephFS bug, where for example
> > a CephFS client gets confused and continues sending snapcs with removed
> > snaps.  I think we can catch this category of bug (and probably #2 as
> > well) if we make the OSD log an error/warning to the cluster log if it
> > gets an incoming request including a deleted snapid when the request is
> > marked with an epoch after when the snap was deleted.  This would
> > would mean adding the deleted_epoch for each removed snapid to pg_pool_t
> > to do properly; maybe worth it, maybe not?
> 
> I don't quite understand; can you expand on this?

The annoyingly confusing thing about this is that snapids are kind of like 
time points, but the set of existing snapids has nothing to do with what 
we've trimmed from removed_snaps... it's not like we can just drop low 
snapids out of the set, since the snapid is the snap creation "when" and 
not the deletion "when."

So, if we want to catch buggy clients that try to use snapcontexts with 
deleted snapids, we need to know which epoch each snapid was deleted in so 
that we can compare that to the request epoch.

I'm not so sure this will work out.  For one, there is a sort of race 
condition when librbd is setting up the snapc itself, since a 
concurrent client might delete the snap before it refreshes, so that it 
constructs a snapc with a bad snap after librados sees the osdmap deleting 
it.  :/  Maybe it's not worth worrying about this piece.


> Let me also suggest one other idea that came up when I was discussing
> snapshots with Patrick, by first laying out the problems we're aware
> of:
> 1) the OSDMap itself might grow too large
> 2) comparing very large interval sets of snapIDs is computationally
> expensive for the OSD. We do this
>   a) when processing a new OSDMap against each PG's local list of deleted snaps
>   b) when comparing the deleted snaps against incoming snapcontexts
> (although this is usually not an issue because incoming snapcontexts
> tend to be pretty small)
> 
> So we want to limit the size of the OSDMap, and we also want to avoid
> very large comparisons. Why don't we attack these problems
> individually?
> 
> 1) Switch from a deleted_snaps set to a deleting_snaps set in the
> OSDMap, and trim it based on per-PG feedback to the manager.
> 2) Maintain a *full* deleted_snaps set in the PG info on the OSDs.
> 3) Only load into memory the deleted_snaps we reasonably expect to
> see, *along* with a boundary snapid indicating what range the set is
> valid for.
> 4) If we get a snapid smaller than the range is valid for (in a
> client's SnapContext, or I suppose in the deleting_snaps map), load
> load more deleted snapids off disk to do the comparison.
> 
> I haven't sketched out all the code paths but when I skimmed things
> over I think that snapcontext check (and possible off-disk loading) is
> actually in any easy-to-wait location. This avoids us needing to
> change the client wire protocol or introduce more ordering
> dependencies.
> 
> The biggest downside I can see is that it adds a pretty obvious
> DoS/resource consumption attack for malicious clients, but we're not
> exactly immune to those in general. Thoughts?

I think this would definitly work.  I'm still hoping that we can avoid 
having the full set of deleted snaps maintained on the OSD at all, 
though--that will cut out a lot of complexity (and memory (and storage)).


FWIW what I'm currently trying is adding new fields to the map for snapids 
deleted *this* epoch.  That makes it easy for Objecter to adjust its 
requests accordingly.  On the OSD, it also makes the interval_set 
calculcations on map updates faster (we already know what hte new snapids 
are).  

And I think it will open up the possibility of removing removed_snaps 
entirely from pg_pool_t if the PGs track the same set locally.  In fact, I 
think in the limit we'll end up with each osdmap epoch just containing (1) 
newly deleted snapids this epoch, and (2) newly purged snapids this epoch.  
The PGs will maintain the set of deleting snapids (in pg_info_t or 
similar, sharing it during peering etc), and we'll keep that window 
short-ish to prevent it from getting too big.  I'm thinking a 1-5 day 
window... basically, whatever the window of laggy clients getting the boot 
is that you can tolerate.  The OSDs will only be pruning incoming snapcs 
against that deleting_snaps set (small!), OSDMaps themselves won't be big 
(each one will only include newly deleted or purged ids).  If a PG falls 
behind and jumps a map gap, it will backfill anyway and get mopped up that 
way...

But... the cost of eliminating all of that crap is losing the client's 
ability to have an in-flight request that is days old reconnect and 
proceed.  Fair trade?

sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: removed_snaps
  2017-10-11 19:02   ` removed_snaps Sage Weil
@ 2017-10-11 19:37     ` Sage Weil
  2017-10-11 21:25     ` removed_snaps Gregory Farnum
  1 sibling, 0 replies; 12+ messages in thread
From: Sage Weil @ 2017-10-11 19:37 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel, Mclean, Patrick

On Wed, 11 Oct 2017, Sage Weil wrote:
> FWIW what I'm currently trying is adding new fields to the map for snapids 
> ...

I added a "Plan B" section to the pad

	http://pad.ceph.com/p/removing_removed_snaps

with specifics (and a couple questions).

sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: removed_snaps
  2017-10-11 19:02   ` removed_snaps Sage Weil
  2017-10-11 19:37     ` removed_snaps Sage Weil
@ 2017-10-11 21:25     ` Gregory Farnum
  2017-10-12 14:02       ` removed_snaps Sage Weil
  1 sibling, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2017-10-11 21:25 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, Mclean, Patrick

On Wed, Oct 11, 2017 at 12:02 PM, Sage Weil <sweil@redhat.com> wrote:
> On Wed, 11 Oct 2017, Gregory Farnum wrote:
>> On Wed, Oct 11, 2017 at 9:13 AM, Sage Weil <sweil@redhat.com> wrote:
>> > I'm working on removing the removed_snaps field from pg_pool_t (and
>> > thus the OSDMap) as this can get very large for clusters that have aged
>> > and use snapshots.
>> >
>> >         https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1320
>> >
>> > The short version of the plan is only include recently removed snaps in
>> > the map.  Once all PGs in a pool have reported that the snap has been
>> > trimmed, we can safely retire the snapid from that set.
>> >
>> > There are a couple of possible problems related to the fact that the
>> > OSD currently santizes the SnapContext with every write by removing
>> > snapids that appear in removed_snaps.  This is meant to deal with race
>> > conditions where the IO was submitted before the snap was removed, but the
>> > OSD has already logically removed it (or scheduled it for removal).
>> >
>> > I see two categories of problems:
>> >
>> > 1. The IO was prepared at the librados client before the snap was removed,
>> > the IO is delayed (e.g., client can't connect, or PG is inactive, etc.)
>> > until after the snap is deleted and retired from removed_snaps, and then
>> > the IO is processed.  This could trigger a clone on the OSD that will
>> > then never get cleaned up.
>> >
>> > Specifically, a librbd example:
>> >
>> >  a. librbd prepares a IOContext that includes snap S
>> >  b. librbd initiates a librados IO with on that ioc.  That request is
>> >     delayed.
>> >  c. Snap S is removed (someone tells the mon to delete it)
>> >     - either our librbd client did it, or another one did and we get a
>> >       notify telling us to refresh our header; doesn't matter which.
>> >  d. S is included in OSDMap removed_snaps.
>> >  e. all OSDs prune S
>> >  f. S is removed from removed_snaps some time later
>> >  g. The request from (b) finally reaches the OSD and triggers a clone S
>> >     (which will never get cleaned up)
>> >
>> > I think we can fix this problem by making Objecter slightly smarter: it
>> > can watch OSDMaps it receives and prune snapc's for in-flight requests.
>> > In the above scenario, sometime between d and g, when it finally
>> > gets a recent OSDMap, it will have pruned S from the request's snapc.  If
>> > it didn't get the map, and the request is still tagged with an old
>> > OSDMap, the OSD can kill the OSD session to force a reconnect/resend. It
>> > can do this for any incoming client request tagged with an OSDMap prior to
>> > a low-water-mark epoch in the OSDMap for which older removed_snaps
>> > have been proved (e.g. removed_snaps_pruned_epoch, or some better name).
>> >
>> > In the extreme case where a client is disconnected for so long that they
>> > can't advance their map to the current one due to the mon having trimmed
>> > maps, and they have outstanding writes with snapcs, the client
>> > can choose to fail the requests with EIO or ESTALE or something, or crash
>> > itself, or otherwise behave as if it has been blacklisted/fenced (if it's
>> > RBD, it probably has anyway).
>>
>> This all sounds good, except...
>> How on earth do we process these snapc trims in a way that doesn't
>> bring us back down to effectively single-threaded transfer speeds?
>>
>> We might be able to do something with publishing a list of trimmed
>> snaps and having the dispatch threads check their op snapcs against
>> that shared data, but I'm not sure and we'd need to performance test
>> very carefully...
>
> The only Objecter change here is that when we get a new osdmap we would
> apply removed_snaps_this_epoch to the current in-flight requests.  We
> already walk all in-flight ops on map changes to recheck the CRUSH
> mapping, and the snapc lists are short, and we can construct the OSDMap so
> that it has the list of newly-deleted snaps for just that epoch in a new
> field, so I don't think this will have any measureable impact on
> performance.
>
> Is there a different effect I'm missing?

I was worried about the map walk. I hadn't thought about the CRUSH
mapping changes so I guess I don't know how that's implemented, but if
it works that's good!

>
>> > 2. There is a bug, and the librbd image is out of sync: it thinks that
>> > snap S still exists but in reality it has been pruned.  If this happens,
>> > then the librbd client may use S and it may trigger a clone.  However,
>> > that snap still is referenced from the image, so it will presumably
>> > eventually get deleted and cleaned up.
>> >
>> > Greg suggested the possibility of a similar CephFS bug, where for example
>> > a CephFS client gets confused and continues sending snapcs with removed
>> > snaps.  I think we can catch this category of bug (and probably #2 as
>> > well) if we make the OSD log an error/warning to the cluster log if it
>> > gets an incoming request including a deleted snapid when the request is
>> > marked with an epoch after when the snap was deleted.  This would
>> > would mean adding the deleted_epoch for each removed snapid to pg_pool_t
>> > to do properly; maybe worth it, maybe not?
>>
>> I don't quite understand; can you expand on this?
>
> The annoyingly confusing thing about this is that snapids are kind of like
> time points, but the set of existing snapids has nothing to do with what
> we've trimmed from removed_snaps... it's not like we can just drop low
> snapids out of the set, since the snapid is the snap creation "when" and
> not the deletion "when."
>
> So, if we want to catch buggy clients that try to use snapcontexts with
> deleted snapids, we need to know which epoch each snapid was deleted in so
> that we can compare that to the request epoch.
>
> I'm not so sure this will work out.  For one, there is a sort of race
> condition when librbd is setting up the snapc itself, since a
> concurrent client might delete the snap before it refreshes, so that it
> constructs a snapc with a bad snap after librados sees the osdmap deleting
> it.  :/  Maybe it's not worth worrying about this piece.

Oh, so instead of having just "interval_set<snapid> deleted_snaps",
we'd have to add "map<snapid, epoch_t> deleted_snap_epochs" (or
equivalent) and then compare those (plus a floor epoch for
oldest-known-deletion) to any incoming snapcontext?

And for the warning, we'd have some of the snapids that should have
been deleted still around for comparison. So we wouldn't have any way
of reliably identifying them all, but we'd expect to see some of them
turning up if it's an ongoing problem?

>
>
>> Let me also suggest one other idea that came up when I was discussing
>> snapshots with Patrick, by first laying out the problems we're aware
>> of:
>> 1) the OSDMap itself might grow too large
>> 2) comparing very large interval sets of snapIDs is computationally
>> expensive for the OSD. We do this
>>   a) when processing a new OSDMap against each PG's local list of deleted snaps
>>   b) when comparing the deleted snaps against incoming snapcontexts
>> (although this is usually not an issue because incoming snapcontexts
>> tend to be pretty small)
>>
>> So we want to limit the size of the OSDMap, and we also want to avoid
>> very large comparisons. Why don't we attack these problems
>> individually?
>>
>> 1) Switch from a deleted_snaps set to a deleting_snaps set in the
>> OSDMap, and trim it based on per-PG feedback to the manager.
>> 2) Maintain a *full* deleted_snaps set in the PG info on the OSDs.
>> 3) Only load into memory the deleted_snaps we reasonably expect to
>> see, *along* with a boundary snapid indicating what range the set is
>> valid for.
>> 4) If we get a snapid smaller than the range is valid for (in a
>> client's SnapContext, or I suppose in the deleting_snaps map), load
>> load more deleted snapids off disk to do the comparison.
>>
>> I haven't sketched out all the code paths but when I skimmed things
>> over I think that snapcontext check (and possible off-disk loading) is
>> actually in any easy-to-wait location. This avoids us needing to
>> change the client wire protocol or introduce more ordering
>> dependencies.
>>
>> The biggest downside I can see is that it adds a pretty obvious
>> DoS/resource consumption attack for malicious clients, but we're not
>> exactly immune to those in general. Thoughts?
>
> I think this would definitly work.  I'm still hoping that we can avoid
> having the full set of deleted snaps maintained on the OSD at all,
> though--that will cut out a lot of complexity (and memory (and storage)).
>
>
> FWIW what I'm currently trying is adding new fields to the map for snapids
> deleted *this* epoch.  That makes it easy for Objecter to adjust its
> requests accordingly.  On the OSD, it also makes the interval_set
> calculcations on map updates faster (we already know what hte new snapids
> are).
>
> And I think it will open up the possibility of removing removed_snaps
> entirely from pg_pool_t if the PGs track the same set locally.  In fact, I
> think in the limit we'll end up with each osdmap epoch just containing (1)
> newly deleted snapids this epoch, and (2) newly purged snapids this epoch.
> The PGs will maintain the set of deleting snapids (in pg_info_t or
> similar, sharing it during peering etc), and we'll keep that window
> short-ish to prevent it from getting too big.  I'm thinking a 1-5 day
> window... basically, whatever the window of laggy clients getting the boot
> is that you can tolerate.  The OSDs will only be pruning incoming snapcs
> against that deleting_snaps set (small!), OSDMaps themselves won't be big
> (each one will only include newly deleted or purged ids).  If a PG falls
> behind and jumps a map gap, it will backfill anyway and get mopped up that
> way...
>
> But... the cost of eliminating all of that crap is losing the client's
> ability to have an in-flight request that is days old reconnect and
> proceed.  Fair trade?

Well, thinking about it more I'm not sure how we could sensibly store
a good boundary for storing some of the deleted snaps out of memory
anyway. So I think we have to go this direction. But I want to think
about it a bit more because I'm still concerned about the buggy
resurrection of snapids.

Maybe we store the full set on disk, cache what we see in-memory, and
check validity against the disk on any snapid we don't have cached? :/
-Greg

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: removed_snaps
  2017-10-11 21:25     ` removed_snaps Gregory Farnum
@ 2017-10-12 14:02       ` Sage Weil
  2017-10-13 17:22         ` removed_snaps Gregory Farnum
  0 siblings, 1 reply; 12+ messages in thread
From: Sage Weil @ 2017-10-12 14:02 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel, Mclean, Patrick

On Wed, 11 Oct 2017, Gregory Farnum wrote:
> On Wed, Oct 11, 2017 at 12:02 PM, Sage Weil <sweil@redhat.com> wrote:
> > On Wed, 11 Oct 2017, Gregory Farnum wrote:
> >> On Wed, Oct 11, 2017 at 9:13 AM, Sage Weil <sweil@redhat.com> wrote:
> >> > I'm working on removing the removed_snaps field from pg_pool_t (and
> >> > thus the OSDMap) as this can get very large for clusters that have aged
> >> > and use snapshots.
> >> >
> >> >         https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1320
> >> >
> >> > The short version of the plan is only include recently removed snaps in
> >> > the map.  Once all PGs in a pool have reported that the snap has been
> >> > trimmed, we can safely retire the snapid from that set.
> >> >
> >> > There are a couple of possible problems related to the fact that the
> >> > OSD currently santizes the SnapContext with every write by removing
> >> > snapids that appear in removed_snaps.  This is meant to deal with race
> >> > conditions where the IO was submitted before the snap was removed, but the
> >> > OSD has already logically removed it (or scheduled it for removal).
> >> >
> >> > I see two categories of problems:
> >> >
> >> > 1. The IO was prepared at the librados client before the snap was removed,
> >> > the IO is delayed (e.g., client can't connect, or PG is inactive, etc.)
> >> > until after the snap is deleted and retired from removed_snaps, and then
> >> > the IO is processed.  This could trigger a clone on the OSD that will
> >> > then never get cleaned up.
> >> >
> >> > Specifically, a librbd example:
> >> >
> >> >  a. librbd prepares a IOContext that includes snap S
> >> >  b. librbd initiates a librados IO with on that ioc.  That request is
> >> >     delayed.
> >> >  c. Snap S is removed (someone tells the mon to delete it)
> >> >     - either our librbd client did it, or another one did and we get a
> >> >       notify telling us to refresh our header; doesn't matter which.
> >> >  d. S is included in OSDMap removed_snaps.
> >> >  e. all OSDs prune S
> >> >  f. S is removed from removed_snaps some time later
> >> >  g. The request from (b) finally reaches the OSD and triggers a clone S
> >> >     (which will never get cleaned up)
> >> >
> >> > I think we can fix this problem by making Objecter slightly smarter: it
> >> > can watch OSDMaps it receives and prune snapc's for in-flight requests.
> >> > In the above scenario, sometime between d and g, when it finally
> >> > gets a recent OSDMap, it will have pruned S from the request's snapc.  If
> >> > it didn't get the map, and the request is still tagged with an old
> >> > OSDMap, the OSD can kill the OSD session to force a reconnect/resend. It
> >> > can do this for any incoming client request tagged with an OSDMap prior to
> >> > a low-water-mark epoch in the OSDMap for which older removed_snaps
> >> > have been proved (e.g. removed_snaps_pruned_epoch, or some better name).
> >> >
> >> > In the extreme case where a client is disconnected for so long that they
> >> > can't advance their map to the current one due to the mon having trimmed
> >> > maps, and they have outstanding writes with snapcs, the client
> >> > can choose to fail the requests with EIO or ESTALE or something, or crash
> >> > itself, or otherwise behave as if it has been blacklisted/fenced (if it's
> >> > RBD, it probably has anyway).
> >>
> >> This all sounds good, except...
> >> How on earth do we process these snapc trims in a way that doesn't
> >> bring us back down to effectively single-threaded transfer speeds?
> >>
> >> We might be able to do something with publishing a list of trimmed
> >> snaps and having the dispatch threads check their op snapcs against
> >> that shared data, but I'm not sure and we'd need to performance test
> >> very carefully...
> >
> > The only Objecter change here is that when we get a new osdmap we would
> > apply removed_snaps_this_epoch to the current in-flight requests.  We
> > already walk all in-flight ops on map changes to recheck the CRUSH
> > mapping, and the snapc lists are short, and we can construct the OSDMap so
> > that it has the list of newly-deleted snaps for just that epoch in a new
> > field, so I don't think this will have any measureable impact on
> > performance.
> >
> > Is there a different effect I'm missing?
> 
> I was worried about the map walk. I hadn't thought about the CRUSH
> mapping changes so I guess I don't know how that's implemented, but if
> it works that's good!
> 
> >
> >> > 2. There is a bug, and the librbd image is out of sync: it thinks that
> >> > snap S still exists but in reality it has been pruned.  If this happens,
> >> > then the librbd client may use S and it may trigger a clone.  However,
> >> > that snap still is referenced from the image, so it will presumably
> >> > eventually get deleted and cleaned up.
> >> >
> >> > Greg suggested the possibility of a similar CephFS bug, where for example
> >> > a CephFS client gets confused and continues sending snapcs with removed
> >> > snaps.  I think we can catch this category of bug (and probably #2 as
> >> > well) if we make the OSD log an error/warning to the cluster log if it
> >> > gets an incoming request including a deleted snapid when the request is
> >> > marked with an epoch after when the snap was deleted.  This would
> >> > would mean adding the deleted_epoch for each removed snapid to pg_pool_t
> >> > to do properly; maybe worth it, maybe not?
> >>
> >> I don't quite understand; can you expand on this?
> >
> > The annoyingly confusing thing about this is that snapids are kind of like
> > time points, but the set of existing snapids has nothing to do with what
> > we've trimmed from removed_snaps... it's not like we can just drop low
> > snapids out of the set, since the snapid is the snap creation "when" and
> > not the deletion "when."
> >
> > So, if we want to catch buggy clients that try to use snapcontexts with
> > deleted snapids, we need to know which epoch each snapid was deleted in so
> > that we can compare that to the request epoch.
> >
> > I'm not so sure this will work out.  For one, there is a sort of race
> > condition when librbd is setting up the snapc itself, since a
> > concurrent client might delete the snap before it refreshes, so that it
> > constructs a snapc with a bad snap after librados sees the osdmap deleting
> > it.  :/  Maybe it's not worth worrying about this piece.
> 
> Oh, so instead of having just "interval_set<snapid> deleted_snaps",
> we'd have to add "map<snapid, epoch_t> deleted_snap_epochs" (or
> equivalent) and then compare those (plus a floor epoch for
> oldest-known-deletion) to any incoming snapcontext?
> 
> And for the warning, we'd have some of the snapids that should have
> been deleted still around for comparison. So we wouldn't have any way
> of reliably identifying them all, but we'd expect to see some of them
> turning up if it's an ongoing problem?

Yeah exactly.  I think this check isn't so bad to implement, but not 
strictly necessary.

> >> Let me also suggest one other idea that came up when I was discussing
> >> snapshots with Patrick, by first laying out the problems we're aware
> >> of:
> >> 1) the OSDMap itself might grow too large
> >> 2) comparing very large interval sets of snapIDs is computationally
> >> expensive for the OSD. We do this
> >>   a) when processing a new OSDMap against each PG's local list of deleted snaps
> >>   b) when comparing the deleted snaps against incoming snapcontexts
> >> (although this is usually not an issue because incoming snapcontexts
> >> tend to be pretty small)
> >>
> >> So we want to limit the size of the OSDMap, and we also want to avoid
> >> very large comparisons. Why don't we attack these problems
> >> individually?
> >>
> >> 1) Switch from a deleted_snaps set to a deleting_snaps set in the
> >> OSDMap, and trim it based on per-PG feedback to the manager.
> >> 2) Maintain a *full* deleted_snaps set in the PG info on the OSDs.
> >> 3) Only load into memory the deleted_snaps we reasonably expect to
> >> see, *along* with a boundary snapid indicating what range the set is
> >> valid for.
> >> 4) If we get a snapid smaller than the range is valid for (in a
> >> client's SnapContext, or I suppose in the deleting_snaps map), load
> >> load more deleted snapids off disk to do the comparison.
> >>
> >> I haven't sketched out all the code paths but when I skimmed things
> >> over I think that snapcontext check (and possible off-disk loading) is
> >> actually in any easy-to-wait location. This avoids us needing to
> >> change the client wire protocol or introduce more ordering
> >> dependencies.
> >>
> >> The biggest downside I can see is that it adds a pretty obvious
> >> DoS/resource consumption attack for malicious clients, but we're not
> >> exactly immune to those in general. Thoughts?
> >
> > I think this would definitly work.  I'm still hoping that we can avoid
> > having the full set of deleted snaps maintained on the OSD at all,
> > though--that will cut out a lot of complexity (and memory (and storage)).
> >
> >
> > FWIW what I'm currently trying is adding new fields to the map for snapids
> > deleted *this* epoch.  That makes it easy for Objecter to adjust its
> > requests accordingly.  On the OSD, it also makes the interval_set
> > calculcations on map updates faster (we already know what hte new snapids
> > are).
> >
> > And I think it will open up the possibility of removing removed_snaps
> > entirely from pg_pool_t if the PGs track the same set locally.  In fact, I
> > think in the limit we'll end up with each osdmap epoch just containing (1)
> > newly deleted snapids this epoch, and (2) newly purged snapids this epoch.
> > The PGs will maintain the set of deleting snapids (in pg_info_t or
> > similar, sharing it during peering etc), and we'll keep that window
> > short-ish to prevent it from getting too big.  I'm thinking a 1-5 day
> > window... basically, whatever the window of laggy clients getting the boot
> > is that you can tolerate.  The OSDs will only be pruning incoming snapcs
> > against that deleting_snaps set (small!), OSDMaps themselves won't be big
> > (each one will only include newly deleted or purged ids).  If a PG falls
> > behind and jumps a map gap, it will backfill anyway and get mopped up that
> > way...
> >
> > But... the cost of eliminating all of that crap is losing the client's
> > ability to have an in-flight request that is days old reconnect and
> > proceed.  Fair trade?
> 
> Well, thinking about it more I'm not sure how we could sensibly store
> a good boundary for storing some of the deleted snaps out of memory
> anyway. So I think we have to go this direction. But I want to think
> about it a bit more because I'm still concerned about the buggy
> resurrection of snapids.
> 
> Maybe we store the full set on disk, cache what we see in-memory, and
> check validity against the disk on any snapid we don't have cached? :/

Just had another thought last night: the mon can preserve the full history 
of deletions, by epoch.  When the objecter encounters a map gap it can 
request the removed_snaps over teh gap period from the mon at the same 
time it's getting the next map (i.e., the oldest full map stored by 
the mon).  Since this is a pretty rare/excpetional thing, I don't 
worry much about the extra work for the mon, and it avoids the ugly 
client-must-crash behavior... a laggy client will always be able to catch 
up.

I updated the plan B notes in the pad to include this bit!

sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: removed_snaps
  2017-10-12 14:02       ` removed_snaps Sage Weil
@ 2017-10-13 17:22         ` Gregory Farnum
  2017-10-13 21:32           ` removed_snaps Sage Weil
  0 siblings, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2017-10-13 17:22 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, Mclean, Patrick

On Thu, Oct 12, 2017 at 7:02 AM, Sage Weil <sweil@redhat.com> wrote:
> Just had another thought last night: the mon can preserve the full history
> of deletions, by epoch.  When the objecter encounters a map gap it can
> request the removed_snaps over teh gap period from the mon at the same
> time it's getting the next map (i.e., the oldest full map stored by
> the mon).  Since this is a pretty rare/excpetional thing, I don't
> worry much about the extra work for the mon, and it avoids the ugly
> client-must-crash behavior... a laggy client will always be able to catch
> up.e

That seems useful and will probably work in a practical sense, but I'm
still a bit concerned. There's an in-built assumption here that the
OSD map epoch of a client is a useful proxy for "has the correct set
of snapids". And...well, it's a *reasonable* proxy, especially if the
Objecter starts trimming snapids. But CephFS certainly doesn't have
any explicit relationship (or even much of an implicit one) between
the OSDMap epochs and the set of live snapshots. I don't think RBD
does either, although since it passes around snapids via header
objects and watch-notify it might come closer?

I'm tossing around in my head if there's some good way to attach a
"valid as of this epoch" tag to snapcontexts generated by external
systems. All snapshot users *do* already maintain a snapid_t for
versioning that they use; maybe we can tie into or extend it somehow?
(A trivial but presumably too-slow implementation for CephFS could do
something like, on every load of a SnapRealm in the MDS, validate the
snap ids against the monitor's full list and attach the current osd
epoch to it.)


Moving on to the stuff actually written down:
How comfortable are we with the size of the currently-deleting
snapshots maps, for computation purposes? I don't have a good way of
quantifying that cost but I'm definitely tempted to split the sets
into
newly_deleted_snaps (for *this epoch*)
deleting_snaps (which are kept around until removed_snaps_lb_epoch)
newly_purged_snaps (also for this epoch, which I think is how you have
it written?)

There are also two questions down at the bottom. For (1) I think it's
good to keep the deleted snaps set for all time (always good for
debugging!), but we need to be careful: if there is a divergence
between RADOS' metadata and those of RBD or CephFS, we need a
mechanism for re-deleting snaps even if they were already zapped.
For (2), yes the removed_snaps_lb_epoch should be per-pool, not
global. We don't have any other global snap data, why would we
introduce a linkage? :)
-Greg

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: removed_snaps
  2017-10-13 17:22         ` removed_snaps Gregory Farnum
@ 2017-10-13 21:32           ` Sage Weil
  2017-10-13 22:25             ` removed_snaps Gregory Farnum
  0 siblings, 1 reply; 12+ messages in thread
From: Sage Weil @ 2017-10-13 21:32 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel, Mclean, Patrick

On Fri, 13 Oct 2017, Gregory Farnum wrote:
> On Thu, Oct 12, 2017 at 7:02 AM, Sage Weil <sweil@redhat.com> wrote:
> > Just had another thought last night: the mon can preserve the full history
> > of deletions, by epoch.  When the objecter encounters a map gap it can
> > request the removed_snaps over teh gap period from the mon at the same
> > time it's getting the next map (i.e., the oldest full map stored by
> > the mon).  Since this is a pretty rare/excpetional thing, I don't
> > worry much about the extra work for the mon, and it avoids the ugly
> > client-must-crash behavior... a laggy client will always be able to catch
> > up.e
> 
> That seems useful and will probably work in a practical sense, but I'm
> still a bit concerned. There's an in-built assumption here that the
> OSD map epoch of a client is a useful proxy for "has the correct set
> of snapids". And...well, it's a *reasonable* proxy, especially if the
> Objecter starts trimming snapids. But CephFS certainly doesn't have
> any explicit relationship (or even much of an implicit one) between
> the OSDMap epochs and the set of live snapshots. I don't think RBD
> does either, although since it passes around snapids via header
> objects and watch-notify it might come closer?
> 
> I'm tossing around in my head if there's some good way to attach a
> "valid as of this epoch" tag to snapcontexts generated by external
> systems. All snapshot users *do* already maintain a snapid_t for
> versioning that they use; maybe we can tie into or extend it somehow?
> (A trivial but presumably too-slow implementation for CephFS could do
> something like, on every load of a SnapRealm in the MDS, validate the
> snap ids against the monitor's full list and attach the current osd
> epoch to it.)

Yeah, the "snapc valid as of this epoch" is exactly it.

Unfortunately I'm not sure attaching anything to the snapc's 
snap_seq helps, since that's the time point when the snapc was last 
modified, but we actually want a lower bound on when we know the 
snapc to be accurate (i.e., reflect any completed snap creates/deletes).

For librbd, maybe this is the epoch we load the head object.. *if* we also 
modify it to do a 2PC on snap removal.  Although probably this 
problem isn't so bad since we have exclusive locking and the 
exclusive lock holder is the one doing the snap deletions?

For CephFS... yeah.  IIRC the client only hears about SnapRealm changes 
when they happen, but we actually want to roll the "valid as of epoch" 
value forward when no snaps are created or deleted.  Maybe it can be baked 
into the client/mds heartbeat?

Anyway, in order to actually use the value once we have it, the librados 
snapc methods would need to get an additional valid_as_of_epoch argument 
and refresh it periodically.

...

In any case, in order to use the value properly we need to apply any 
removed_snaps between the valid_as_of_epoch and the current epoch at the 
time the op is submitted.  Which means the OSDMap needs to have a window 
of recent removed_snaps that's big enough to do that, and/or the librados 
user has to have a defined way of refreshing itself if its snapc is too 
stale.

The bad news is that my branch removes the removed_snaps entirely from teh 
OSDMap--we only have deltas.  I can add it back it without much trouble.

The better news is that this means we don't have the have a wide window 
between advertising removed vs purged snaps anymore--we can declare them 
purged as soon as they are purged.  Which has the added benefit of being 
able to explicitly track/monitor the 'removing' set of snaps on the mon, 
which seems like something that will be of interest to ops folks.

So.. yeah, this is an open issue.  I think what I have now will 
mostly work, but it may leak clones occasionally if clients are super 
stale.  Maybe that's not a big deal...


> Moving on to the stuff actually written down:
> How comfortable are we with the size of the currently-deleting
> snapshots maps, for computation purposes? I don't have a good way of
> quantifying that cost but I'm definitely tempted to split the sets
> into
> newly_deleted_snaps (for *this epoch*)
> deleting_snaps (which are kept around until removed_snaps_lb_epoch)
> newly_purged_snaps (also for this epoch, which I think is how you have
> it written?)

The code I have now has 2 sets: removed_snaps and purged_snaps.  
purged_snapd was already there before; removed_snaps is the superset that 
includes stuff not yet purged.  (I made it a superset thinking that 
the filter_snapc() thing on every op will be faster if it only has to 
filter against a single set instead of 2 of them, and that matters more 
than the memory.)

> There are also two questions down at the bottom. For (1) I think it's
> good to keep the deleted snaps set for all time (always good for
> debugging!), but we need to be careful: if there is a divergence
> between RADOS' metadata and those of RBD or CephFS, we need a
> mechanism for re-deleting snaps even if they were already zapped.

I'm preserving for all time, but yeah, no mechanism to force a re-removal.  
Hrm.

> For (2), yes the removed_snaps_lb_epoch should be per-pool, not
> global. We don't have any other global snap data, why would we
> introduce a linkage? :)

Yep!

Thanks-
sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: removed_snaps
  2017-10-13 21:32           ` removed_snaps Sage Weil
@ 2017-10-13 22:25             ` Gregory Farnum
  2017-10-14 15:56               ` removed_snaps Sage Weil
  0 siblings, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2017-10-13 22:25 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, Mclean, Patrick

On Fri, Oct 13, 2017 at 2:32 PM, Sage Weil <sweil@redhat.com> wrote:
> On Fri, 13 Oct 2017, Gregory Farnum wrote:
>> Moving on to the stuff actually written down:
>> How comfortable are we with the size of the currently-deleting
>> snapshots maps, for computation purposes? I don't have a good way of
>> quantifying that cost but I'm definitely tempted to split the sets
>> into
>> newly_deleted_snaps (for *this epoch*)
>> deleting_snaps (which are kept around until removed_snaps_lb_epoch)
>> newly_purged_snaps (also for this epoch, which I think is how you have
>> it written?)
>
> The code I have now has 2 sets: removed_snaps and purged_snaps.
> purged_snapd was already there before; removed_snaps is the superset that
> includes stuff not yet purged.  (I made it a superset thinking that
> the filter_snapc() thing on every op will be faster if it only has to
> filter against a single set instead of 2 of them, and that matters more
> than the memory.)

No, what I mean is:
* we know that computing the overlap of interval sets can be expensive
* so we want to only compute overlaps of small sets
* this is especially true if we're doing it on all outstanding
operations in librados, within a single thread

So we definitely don't want to repeat comparisons if we don't have to.
And I suspect there will be occasional periods of intense snapshot
deletion from admins where they remove a number large enough to cause
trouble, and we don't want that to result in noticeably slower IO
submission on the client-side!

So we should have a separate set of snaps which were just added to the
removed_snaps during this epoch, that librados and the OSD map
processing can use in their normal-course-of-business scanning.
-Greg

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: removed_snaps
  2017-10-13 22:25             ` removed_snaps Gregory Farnum
@ 2017-10-14 15:56               ` Sage Weil
  2017-10-16 15:27                 ` removed_snaps Gregory Farnum
  0 siblings, 1 reply; 12+ messages in thread
From: Sage Weil @ 2017-10-14 15:56 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel, Mclean, Patrick

On Fri, 13 Oct 2017, Gregory Farnum wrote:
> On Fri, Oct 13, 2017 at 2:32 PM, Sage Weil <sweil@redhat.com> wrote:
> > On Fri, 13 Oct 2017, Gregory Farnum wrote:
> >> Moving on to the stuff actually written down:
> >> How comfortable are we with the size of the currently-deleting
> >> snapshots maps, for computation purposes? I don't have a good way of
> >> quantifying that cost but I'm definitely tempted to split the sets
> >> into
> >> newly_deleted_snaps (for *this epoch*)
> >> deleting_snaps (which are kept around until removed_snaps_lb_epoch)
> >> newly_purged_snaps (also for this epoch, which I think is how you have
> >> it written?)
> >
> > The code I have now has 2 sets: removed_snaps and purged_snaps.
> > purged_snapd was already there before; removed_snaps is the superset that
> > includes stuff not yet purged.  (I made it a superset thinking that
> > the filter_snapc() thing on every op will be faster if it only has to
> > filter against a single set instead of 2 of them, and that matters more
> > than the memory.)
> 
> No, what I mean is:
> * we know that computing the overlap of interval sets can be expensive
> * so we want to only compute overlaps of small sets
> * this is especially true if we're doing it on all outstanding
> operations in librados, within a single thread
> 
> So we definitely don't want to repeat comparisons if we don't have to.
> And I suspect there will be occasional periods of intense snapshot
> deletion from admins where they remove a number large enough to cause
> trouble, and we don't want that to result in noticeably slower IO
> submission on the client-side!
> 
> So we should have a separate set of snaps which were just added to the
> removed_snaps during this epoch, that librados and the OSD map
> processing can use in their normal-course-of-business scanning.

I definitely get that we want to avoid any set intersections (and I think 
the current code actually has only 1 of them, during peering activate).  
But I'm not quite following which lookup you're thinking of.  FTR 
filter_snapc iterates over the vector<snapid_t> in SnapContext and does an 
interval_set lookup on each one to make sure it isn't deleted.

I think there are basically two cases:

1- The request comes in with the same epoch as our epoch, and the mimic 
feature, so we could skip the filter_snapc entirely (client will have done 
it).  Yay!  Hopefully this will be true for most requests in steady state.

2- The request comes in with an older osdmap epoch, so we have to 
filter_snapc against the recent_removed_snaps interval_set<>.  This is 
what we do today, except instead of recent_removed_snaps we have 
removed_snaps_over_all_time, so we are going from a huge set to a 
relatively small one.

We could do something like

3- Maintain a small interval_set with a smaller range of epochs (e.g., 
only the last 20 epochs instead of the last 500) and filter against that 
if the request's epoch is recent enough.

or,

4- Remember the new_removed_snaps for the last N epochs and filter against 
them individually.  This turns into m log(n) lookups instead of one 
log(n*500) lookup (where m << 500)... probably a small win?

sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: removed_snaps
  2017-10-14 15:56               ` removed_snaps Sage Weil
@ 2017-10-16 15:27                 ` Gregory Farnum
  2017-10-16 18:38                   ` removed_snaps Sage Weil
  0 siblings, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2017-10-16 15:27 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, Mclean, Patrick

On Sat, Oct 14, 2017 at 8:56 AM, Sage Weil <sweil@redhat.com> wrote:
> On Fri, 13 Oct 2017, Gregory Farnum wrote:
>> On Fri, Oct 13, 2017 at 2:32 PM, Sage Weil <sweil@redhat.com> wrote:
>> > On Fri, 13 Oct 2017, Gregory Farnum wrote:
>> >> Moving on to the stuff actually written down:
>> >> How comfortable are we with the size of the currently-deleting
>> >> snapshots maps, for computation purposes? I don't have a good way of
>> >> quantifying that cost but I'm definitely tempted to split the sets
>> >> into
>> >> newly_deleted_snaps (for *this epoch*)
>> >> deleting_snaps (which are kept around until removed_snaps_lb_epoch)
>> >> newly_purged_snaps (also for this epoch, which I think is how you have
>> >> it written?)
>> >
>> > The code I have now has 2 sets: removed_snaps and purged_snaps.
>> > purged_snapd was already there before; removed_snaps is the superset that
>> > includes stuff not yet purged.  (I made it a superset thinking that
>> > the filter_snapc() thing on every op will be faster if it only has to
>> > filter against a single set instead of 2 of them, and that matters more
>> > than the memory.)
>>
>> No, what I mean is:
>> * we know that computing the overlap of interval sets can be expensive
>> * so we want to only compute overlaps of small sets
>> * this is especially true if we're doing it on all outstanding
>> operations in librados, within a single thread
>>
>> So we definitely don't want to repeat comparisons if we don't have to.
>> And I suspect there will be occasional periods of intense snapshot
>> deletion from admins where they remove a number large enough to cause
>> trouble, and we don't want that to result in noticeably slower IO
>> submission on the client-side!
>>
>> So we should have a separate set of snaps which were just added to the
>> removed_snaps during this epoch, that librados and the OSD map
>> processing can use in their normal-course-of-business scanning.
>
> I definitely get that we want to avoid any set intersections (and I think
> the current code actually has only 1 of them, during peering activate).
> But I'm not quite following which lookup you're thinking of.  FTR
> filter_snapc iterates over the vector<snapid_t> in SnapContext and does an
> interval_set lookup on each one to make sure it isn't deleted.
>
> I think there are basically two cases:
>
> 1- The request comes in with the same epoch as our epoch, and the mimic
> feature, so we could skip the filter_snapc entirely (client will have done
> it).  Yay!  Hopefully this will be true for most requests in steady state.
>
> 2- The request comes in with an older osdmap epoch, so we have to
> filter_snapc against the recent_removed_snaps interval_set<>.  This is
> what we do today, except instead of recent_removed_snaps we have
> removed_snaps_over_all_time, so we are going from a huge set to a
> relatively small one.

You're leaving out

3) Every client runs a set intersection on its outstanding ops on
every map update (which we did not run before).

This is the one I'm thinking about when I say we should have a
this-epoch-only set of additions.
-Greg

>
> We could do something like
>
> 3- Maintain a small interval_set with a smaller range of epochs (e.g.,
> only the last 20 epochs instead of the last 500) and filter against that
> if the request's epoch is recent enough.
>
> or,
>
> 4- Remember the new_removed_snaps for the last N epochs and filter against
> them individually.  This turns into m log(n) lookups instead of one
> log(n*500) lookup (where m << 500)... probably a small win?
>
> sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: removed_snaps
  2017-10-16 15:27                 ` removed_snaps Gregory Farnum
@ 2017-10-16 18:38                   ` Sage Weil
  0 siblings, 0 replies; 12+ messages in thread
From: Sage Weil @ 2017-10-16 18:38 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel, Mclean, Patrick

On Mon, 16 Oct 2017, Gregory Farnum wrote:
> On Sat, Oct 14, 2017 at 8:56 AM, Sage Weil <sweil@redhat.com> wrote:
> > On Fri, 13 Oct 2017, Gregory Farnum wrote:
> >> On Fri, Oct 13, 2017 at 2:32 PM, Sage Weil <sweil@redhat.com> wrote:
> >> > On Fri, 13 Oct 2017, Gregory Farnum wrote:
> >> >> Moving on to the stuff actually written down:
> >> >> How comfortable are we with the size of the currently-deleting
> >> >> snapshots maps, for computation purposes? I don't have a good way of
> >> >> quantifying that cost but I'm definitely tempted to split the sets
> >> >> into
> >> >> newly_deleted_snaps (for *this epoch*)
> >> >> deleting_snaps (which are kept around until removed_snaps_lb_epoch)
> >> >> newly_purged_snaps (also for this epoch, which I think is how you have
> >> >> it written?)
> >> >
> >> > The code I have now has 2 sets: removed_snaps and purged_snaps.
> >> > purged_snapd was already there before; removed_snaps is the superset that
> >> > includes stuff not yet purged.  (I made it a superset thinking that
> >> > the filter_snapc() thing on every op will be faster if it only has to
> >> > filter against a single set instead of 2 of them, and that matters more
> >> > than the memory.)
> >>
> >> No, what I mean is:
> >> * we know that computing the overlap of interval sets can be expensive
> >> * so we want to only compute overlaps of small sets
> >> * this is especially true if we're doing it on all outstanding
> >> operations in librados, within a single thread
> >>
> >> So we definitely don't want to repeat comparisons if we don't have to.
> >> And I suspect there will be occasional periods of intense snapshot
> >> deletion from admins where they remove a number large enough to cause
> >> trouble, and we don't want that to result in noticeably slower IO
> >> submission on the client-side!
> >>
> >> So we should have a separate set of snaps which were just added to the
> >> removed_snaps during this epoch, that librados and the OSD map
> >> processing can use in their normal-course-of-business scanning.
> >
> > I definitely get that we want to avoid any set intersections (and I think
> > the current code actually has only 1 of them, during peering activate).
> > But I'm not quite following which lookup you're thinking of.  FTR
> > filter_snapc iterates over the vector<snapid_t> in SnapContext and does an
> > interval_set lookup on each one to make sure it isn't deleted.
> >
> > I think there are basically two cases:
> >
> > 1- The request comes in with the same epoch as our epoch, and the mimic
> > feature, so we could skip the filter_snapc entirely (client will have done
> > it).  Yay!  Hopefully this will be true for most requests in steady state.
> >
> > 2- The request comes in with an older osdmap epoch, so we have to
> > filter_snapc against the recent_removed_snaps interval_set<>.  This is
> > what we do today, except instead of recent_removed_snaps we have
> > removed_snaps_over_all_time, so we are going from a huge set to a
> > relatively small one.
> 
> You're leaving out
> 
> 3) Every client runs a set intersection on its outstanding ops on
> every map update (which we did not run before).
> 
> This is the one I'm thinking about when I say we should have a
> this-epoch-only set of additions.

Oh, I see what you mean.

Each OSDMap::Incremental has the new_removed_snaps, so it will be small 
list to filter against (esp compared to the CRUSH calculation).  I'm also 
going to switch it a flat_set<>, which gives log(n) lookup w/ 1 
allocation.

sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-10-16 18:38 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-11 16:13 removed_snaps Sage Weil
2017-10-11 18:38 ` removed_snaps Gregory Farnum
2017-10-11 19:02   ` removed_snaps Sage Weil
2017-10-11 19:37     ` removed_snaps Sage Weil
2017-10-11 21:25     ` removed_snaps Gregory Farnum
2017-10-12 14:02       ` removed_snaps Sage Weil
2017-10-13 17:22         ` removed_snaps Gregory Farnum
2017-10-13 21:32           ` removed_snaps Sage Weil
2017-10-13 22:25             ` removed_snaps Gregory Farnum
2017-10-14 15:56               ` removed_snaps Sage Weil
2017-10-16 15:27                 ` removed_snaps Gregory Farnum
2017-10-16 18:38                   ` removed_snaps Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.