* removed_snaps @ 2017-10-11 16:13 Sage Weil 2017-10-11 18:38 ` removed_snaps Gregory Farnum 0 siblings, 1 reply; 12+ messages in thread From: Sage Weil @ 2017-10-11 16:13 UTC (permalink / raw) To: ceph-devel I'm working on removing the removed_snaps field from pg_pool_t (and thus the OSDMap) as this can get very large for clusters that have aged and use snapshots. https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1320 The short version of the plan is only include recently removed snaps in the map. Once all PGs in a pool have reported that the snap has been trimmed, we can safely retire the snapid from that set. There are a couple of possible problems related to the fact that the OSD currently santizes the SnapContext with every write by removing snapids that appear in removed_snaps. This is meant to deal with race conditions where the IO was submitted before the snap was removed, but the OSD has already logically removed it (or scheduled it for removal). I see two categories of problems: 1. The IO was prepared at the librados client before the snap was removed, the IO is delayed (e.g., client can't connect, or PG is inactive, etc.) until after the snap is deleted and retired from removed_snaps, and then the IO is processed. This could trigger a clone on the OSD that will then never get cleaned up. Specifically, a librbd example: a. librbd prepares a IOContext that includes snap S b. librbd initiates a librados IO with on that ioc. That request is delayed. c. Snap S is removed (someone tells the mon to delete it) - either our librbd client did it, or another one did and we get a notify telling us to refresh our header; doesn't matter which. d. S is included in OSDMap removed_snaps. e. all OSDs prune S f. S is removed from removed_snaps some time later g. The request from (b) finally reaches the OSD and triggers a clone S (which will never get cleaned up) I think we can fix this problem by making Objecter slightly smarter: it can watch OSDMaps it receives and prune snapc's for in-flight requests. In the above scenario, sometime between d and g, when it finally gets a recent OSDMap, it will have pruned S from the request's snapc. If it didn't get the map, and the request is still tagged with an old OSDMap, the OSD can kill the OSD session to force a reconnect/resend. It can do this for any incoming client request tagged with an OSDMap prior to a low-water-mark epoch in the OSDMap for which older removed_snaps have been proved (e.g. removed_snaps_pruned_epoch, or some better name). In the extreme case where a client is disconnected for so long that they can't advance their map to the current one due to the mon having trimmed maps, and they have outstanding writes with snapcs, the client can choose to fail the requests with EIO or ESTALE or something, or crash itself, or otherwise behave as if it has been blacklisted/fenced (if it's RBD, it probably has anyway). 2. There is a bug, and the librbd image is out of sync: it thinks that snap S still exists but in reality it has been pruned. If this happens, then the librbd client may use S and it may trigger a clone. However, that snap still is referenced from the image, so it will presumably eventually get deleted and cleaned up. Greg suggested the possibility of a similar CephFS bug, where for example a CephFS client gets confused and continues sending snapcs with removed snaps. I think we can catch this category of bug (and probably #2 as well) if we make the OSD log an error/warning to the cluster log if it gets an incoming request including a deleted snapid when the request is marked with an epoch after when the snap was deleted. This would would mean adding the deleted_epoch for each removed snapid to pg_pool_t to do properly; maybe worth it, maybe not? Thoughts? sage ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: removed_snaps 2017-10-11 16:13 removed_snaps Sage Weil @ 2017-10-11 18:38 ` Gregory Farnum 2017-10-11 19:02 ` removed_snaps Sage Weil 0 siblings, 1 reply; 12+ messages in thread From: Gregory Farnum @ 2017-10-11 18:38 UTC (permalink / raw) To: Sage Weil; +Cc: ceph-devel, Mclean, Patrick On Wed, Oct 11, 2017 at 9:13 AM, Sage Weil <sweil@redhat.com> wrote: > I'm working on removing the removed_snaps field from pg_pool_t (and > thus the OSDMap) as this can get very large for clusters that have aged > and use snapshots. > > https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1320 > > The short version of the plan is only include recently removed snaps in > the map. Once all PGs in a pool have reported that the snap has been > trimmed, we can safely retire the snapid from that set. > > There are a couple of possible problems related to the fact that the > OSD currently santizes the SnapContext with every write by removing > snapids that appear in removed_snaps. This is meant to deal with race > conditions where the IO was submitted before the snap was removed, but the > OSD has already logically removed it (or scheduled it for removal). > > I see two categories of problems: > > 1. The IO was prepared at the librados client before the snap was removed, > the IO is delayed (e.g., client can't connect, or PG is inactive, etc.) > until after the snap is deleted and retired from removed_snaps, and then > the IO is processed. This could trigger a clone on the OSD that will > then never get cleaned up. > > Specifically, a librbd example: > > a. librbd prepares a IOContext that includes snap S > b. librbd initiates a librados IO with on that ioc. That request is > delayed. > c. Snap S is removed (someone tells the mon to delete it) > - either our librbd client did it, or another one did and we get a > notify telling us to refresh our header; doesn't matter which. > d. S is included in OSDMap removed_snaps. > e. all OSDs prune S > f. S is removed from removed_snaps some time later > g. The request from (b) finally reaches the OSD and triggers a clone S > (which will never get cleaned up) > > I think we can fix this problem by making Objecter slightly smarter: it > can watch OSDMaps it receives and prune snapc's for in-flight requests. > In the above scenario, sometime between d and g, when it finally > gets a recent OSDMap, it will have pruned S from the request's snapc. If > it didn't get the map, and the request is still tagged with an old > OSDMap, the OSD can kill the OSD session to force a reconnect/resend. It > can do this for any incoming client request tagged with an OSDMap prior to > a low-water-mark epoch in the OSDMap for which older removed_snaps > have been proved (e.g. removed_snaps_pruned_epoch, or some better name). > > In the extreme case where a client is disconnected for so long that they > can't advance their map to the current one due to the mon having trimmed > maps, and they have outstanding writes with snapcs, the client > can choose to fail the requests with EIO or ESTALE or something, or crash > itself, or otherwise behave as if it has been blacklisted/fenced (if it's > RBD, it probably has anyway). This all sounds good, except... How on earth do we process these snapc trims in a way that doesn't bring us back down to effectively single-threaded transfer speeds? We might be able to do something with publishing a list of trimmed snaps and having the dispatch threads check their op snapcs against that shared data, but I'm not sure and we'd need to performance test very carefully... > 2. There is a bug, and the librbd image is out of sync: it thinks that > snap S still exists but in reality it has been pruned. If this happens, > then the librbd client may use S and it may trigger a clone. However, > that snap still is referenced from the image, so it will presumably > eventually get deleted and cleaned up. > > Greg suggested the possibility of a similar CephFS bug, where for example > a CephFS client gets confused and continues sending snapcs with removed > snaps. I think we can catch this category of bug (and probably #2 as > well) if we make the OSD log an error/warning to the cluster log if it > gets an incoming request including a deleted snapid when the request is > marked with an epoch after when the snap was deleted. This would > would mean adding the deleted_epoch for each removed snapid to pg_pool_t > to do properly; maybe worth it, maybe not? I don't quite understand; can you expand on this? Let me also suggest one other idea that came up when I was discussing snapshots with Patrick, by first laying out the problems we're aware of: 1) the OSDMap itself might grow too large 2) comparing very large interval sets of snapIDs is computationally expensive for the OSD. We do this a) when processing a new OSDMap against each PG's local list of deleted snaps b) when comparing the deleted snaps against incoming snapcontexts (although this is usually not an issue because incoming snapcontexts tend to be pretty small) So we want to limit the size of the OSDMap, and we also want to avoid very large comparisons. Why don't we attack these problems individually? 1) Switch from a deleted_snaps set to a deleting_snaps set in the OSDMap, and trim it based on per-PG feedback to the manager. 2) Maintain a *full* deleted_snaps set in the PG info on the OSDs. 3) Only load into memory the deleted_snaps we reasonably expect to see, *along* with a boundary snapid indicating what range the set is valid for. 4) If we get a snapid smaller than the range is valid for (in a client's SnapContext, or I suppose in the deleting_snaps map), load load more deleted snapids off disk to do the comparison. I haven't sketched out all the code paths but when I skimmed things over I think that snapcontext check (and possible off-disk loading) is actually in any easy-to-wait location. This avoids us needing to change the client wire protocol or introduce more ordering dependencies. The biggest downside I can see is that it adds a pretty obvious DoS/resource consumption attack for malicious clients, but we're not exactly immune to those in general. Thoughts? -Greg ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: removed_snaps 2017-10-11 18:38 ` removed_snaps Gregory Farnum @ 2017-10-11 19:02 ` Sage Weil 2017-10-11 19:37 ` removed_snaps Sage Weil 2017-10-11 21:25 ` removed_snaps Gregory Farnum 0 siblings, 2 replies; 12+ messages in thread From: Sage Weil @ 2017-10-11 19:02 UTC (permalink / raw) To: Gregory Farnum; +Cc: ceph-devel, Mclean, Patrick On Wed, 11 Oct 2017, Gregory Farnum wrote: > On Wed, Oct 11, 2017 at 9:13 AM, Sage Weil <sweil@redhat.com> wrote: > > I'm working on removing the removed_snaps field from pg_pool_t (and > > thus the OSDMap) as this can get very large for clusters that have aged > > and use snapshots. > > > > https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1320 > > > > The short version of the plan is only include recently removed snaps in > > the map. Once all PGs in a pool have reported that the snap has been > > trimmed, we can safely retire the snapid from that set. > > > > There are a couple of possible problems related to the fact that the > > OSD currently santizes the SnapContext with every write by removing > > snapids that appear in removed_snaps. This is meant to deal with race > > conditions where the IO was submitted before the snap was removed, but the > > OSD has already logically removed it (or scheduled it for removal). > > > > I see two categories of problems: > > > > 1. The IO was prepared at the librados client before the snap was removed, > > the IO is delayed (e.g., client can't connect, or PG is inactive, etc.) > > until after the snap is deleted and retired from removed_snaps, and then > > the IO is processed. This could trigger a clone on the OSD that will > > then never get cleaned up. > > > > Specifically, a librbd example: > > > > a. librbd prepares a IOContext that includes snap S > > b. librbd initiates a librados IO with on that ioc. That request is > > delayed. > > c. Snap S is removed (someone tells the mon to delete it) > > - either our librbd client did it, or another one did and we get a > > notify telling us to refresh our header; doesn't matter which. > > d. S is included in OSDMap removed_snaps. > > e. all OSDs prune S > > f. S is removed from removed_snaps some time later > > g. The request from (b) finally reaches the OSD and triggers a clone S > > (which will never get cleaned up) > > > > I think we can fix this problem by making Objecter slightly smarter: it > > can watch OSDMaps it receives and prune snapc's for in-flight requests. > > In the above scenario, sometime between d and g, when it finally > > gets a recent OSDMap, it will have pruned S from the request's snapc. If > > it didn't get the map, and the request is still tagged with an old > > OSDMap, the OSD can kill the OSD session to force a reconnect/resend. It > > can do this for any incoming client request tagged with an OSDMap prior to > > a low-water-mark epoch in the OSDMap for which older removed_snaps > > have been proved (e.g. removed_snaps_pruned_epoch, or some better name). > > > > In the extreme case where a client is disconnected for so long that they > > can't advance their map to the current one due to the mon having trimmed > > maps, and they have outstanding writes with snapcs, the client > > can choose to fail the requests with EIO or ESTALE or something, or crash > > itself, or otherwise behave as if it has been blacklisted/fenced (if it's > > RBD, it probably has anyway). > > This all sounds good, except... > How on earth do we process these snapc trims in a way that doesn't > bring us back down to effectively single-threaded transfer speeds? > > We might be able to do something with publishing a list of trimmed > snaps and having the dispatch threads check their op snapcs against > that shared data, but I'm not sure and we'd need to performance test > very carefully... The only Objecter change here is that when we get a new osdmap we would apply removed_snaps_this_epoch to the current in-flight requests. We already walk all in-flight ops on map changes to recheck the CRUSH mapping, and the snapc lists are short, and we can construct the OSDMap so that it has the list of newly-deleted snaps for just that epoch in a new field, so I don't think this will have any measureable impact on performance. Is there a different effect I'm missing? > > 2. There is a bug, and the librbd image is out of sync: it thinks that > > snap S still exists but in reality it has been pruned. If this happens, > > then the librbd client may use S and it may trigger a clone. However, > > that snap still is referenced from the image, so it will presumably > > eventually get deleted and cleaned up. > > > > Greg suggested the possibility of a similar CephFS bug, where for example > > a CephFS client gets confused and continues sending snapcs with removed > > snaps. I think we can catch this category of bug (and probably #2 as > > well) if we make the OSD log an error/warning to the cluster log if it > > gets an incoming request including a deleted snapid when the request is > > marked with an epoch after when the snap was deleted. This would > > would mean adding the deleted_epoch for each removed snapid to pg_pool_t > > to do properly; maybe worth it, maybe not? > > I don't quite understand; can you expand on this? The annoyingly confusing thing about this is that snapids are kind of like time points, but the set of existing snapids has nothing to do with what we've trimmed from removed_snaps... it's not like we can just drop low snapids out of the set, since the snapid is the snap creation "when" and not the deletion "when." So, if we want to catch buggy clients that try to use snapcontexts with deleted snapids, we need to know which epoch each snapid was deleted in so that we can compare that to the request epoch. I'm not so sure this will work out. For one, there is a sort of race condition when librbd is setting up the snapc itself, since a concurrent client might delete the snap before it refreshes, so that it constructs a snapc with a bad snap after librados sees the osdmap deleting it. :/ Maybe it's not worth worrying about this piece. > Let me also suggest one other idea that came up when I was discussing > snapshots with Patrick, by first laying out the problems we're aware > of: > 1) the OSDMap itself might grow too large > 2) comparing very large interval sets of snapIDs is computationally > expensive for the OSD. We do this > a) when processing a new OSDMap against each PG's local list of deleted snaps > b) when comparing the deleted snaps against incoming snapcontexts > (although this is usually not an issue because incoming snapcontexts > tend to be pretty small) > > So we want to limit the size of the OSDMap, and we also want to avoid > very large comparisons. Why don't we attack these problems > individually? > > 1) Switch from a deleted_snaps set to a deleting_snaps set in the > OSDMap, and trim it based on per-PG feedback to the manager. > 2) Maintain a *full* deleted_snaps set in the PG info on the OSDs. > 3) Only load into memory the deleted_snaps we reasonably expect to > see, *along* with a boundary snapid indicating what range the set is > valid for. > 4) If we get a snapid smaller than the range is valid for (in a > client's SnapContext, or I suppose in the deleting_snaps map), load > load more deleted snapids off disk to do the comparison. > > I haven't sketched out all the code paths but when I skimmed things > over I think that snapcontext check (and possible off-disk loading) is > actually in any easy-to-wait location. This avoids us needing to > change the client wire protocol or introduce more ordering > dependencies. > > The biggest downside I can see is that it adds a pretty obvious > DoS/resource consumption attack for malicious clients, but we're not > exactly immune to those in general. Thoughts? I think this would definitly work. I'm still hoping that we can avoid having the full set of deleted snaps maintained on the OSD at all, though--that will cut out a lot of complexity (and memory (and storage)). FWIW what I'm currently trying is adding new fields to the map for snapids deleted *this* epoch. That makes it easy for Objecter to adjust its requests accordingly. On the OSD, it also makes the interval_set calculcations on map updates faster (we already know what hte new snapids are). And I think it will open up the possibility of removing removed_snaps entirely from pg_pool_t if the PGs track the same set locally. In fact, I think in the limit we'll end up with each osdmap epoch just containing (1) newly deleted snapids this epoch, and (2) newly purged snapids this epoch. The PGs will maintain the set of deleting snapids (in pg_info_t or similar, sharing it during peering etc), and we'll keep that window short-ish to prevent it from getting too big. I'm thinking a 1-5 day window... basically, whatever the window of laggy clients getting the boot is that you can tolerate. The OSDs will only be pruning incoming snapcs against that deleting_snaps set (small!), OSDMaps themselves won't be big (each one will only include newly deleted or purged ids). If a PG falls behind and jumps a map gap, it will backfill anyway and get mopped up that way... But... the cost of eliminating all of that crap is losing the client's ability to have an in-flight request that is days old reconnect and proceed. Fair trade? sage ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: removed_snaps 2017-10-11 19:02 ` removed_snaps Sage Weil @ 2017-10-11 19:37 ` Sage Weil 2017-10-11 21:25 ` removed_snaps Gregory Farnum 1 sibling, 0 replies; 12+ messages in thread From: Sage Weil @ 2017-10-11 19:37 UTC (permalink / raw) To: Gregory Farnum; +Cc: ceph-devel, Mclean, Patrick On Wed, 11 Oct 2017, Sage Weil wrote: > FWIW what I'm currently trying is adding new fields to the map for snapids > ... I added a "Plan B" section to the pad http://pad.ceph.com/p/removing_removed_snaps with specifics (and a couple questions). sage ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: removed_snaps 2017-10-11 19:02 ` removed_snaps Sage Weil 2017-10-11 19:37 ` removed_snaps Sage Weil @ 2017-10-11 21:25 ` Gregory Farnum 2017-10-12 14:02 ` removed_snaps Sage Weil 1 sibling, 1 reply; 12+ messages in thread From: Gregory Farnum @ 2017-10-11 21:25 UTC (permalink / raw) To: Sage Weil; +Cc: ceph-devel, Mclean, Patrick On Wed, Oct 11, 2017 at 12:02 PM, Sage Weil <sweil@redhat.com> wrote: > On Wed, 11 Oct 2017, Gregory Farnum wrote: >> On Wed, Oct 11, 2017 at 9:13 AM, Sage Weil <sweil@redhat.com> wrote: >> > I'm working on removing the removed_snaps field from pg_pool_t (and >> > thus the OSDMap) as this can get very large for clusters that have aged >> > and use snapshots. >> > >> > https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1320 >> > >> > The short version of the plan is only include recently removed snaps in >> > the map. Once all PGs in a pool have reported that the snap has been >> > trimmed, we can safely retire the snapid from that set. >> > >> > There are a couple of possible problems related to the fact that the >> > OSD currently santizes the SnapContext with every write by removing >> > snapids that appear in removed_snaps. This is meant to deal with race >> > conditions where the IO was submitted before the snap was removed, but the >> > OSD has already logically removed it (or scheduled it for removal). >> > >> > I see two categories of problems: >> > >> > 1. The IO was prepared at the librados client before the snap was removed, >> > the IO is delayed (e.g., client can't connect, or PG is inactive, etc.) >> > until after the snap is deleted and retired from removed_snaps, and then >> > the IO is processed. This could trigger a clone on the OSD that will >> > then never get cleaned up. >> > >> > Specifically, a librbd example: >> > >> > a. librbd prepares a IOContext that includes snap S >> > b. librbd initiates a librados IO with on that ioc. That request is >> > delayed. >> > c. Snap S is removed (someone tells the mon to delete it) >> > - either our librbd client did it, or another one did and we get a >> > notify telling us to refresh our header; doesn't matter which. >> > d. S is included in OSDMap removed_snaps. >> > e. all OSDs prune S >> > f. S is removed from removed_snaps some time later >> > g. The request from (b) finally reaches the OSD and triggers a clone S >> > (which will never get cleaned up) >> > >> > I think we can fix this problem by making Objecter slightly smarter: it >> > can watch OSDMaps it receives and prune snapc's for in-flight requests. >> > In the above scenario, sometime between d and g, when it finally >> > gets a recent OSDMap, it will have pruned S from the request's snapc. If >> > it didn't get the map, and the request is still tagged with an old >> > OSDMap, the OSD can kill the OSD session to force a reconnect/resend. It >> > can do this for any incoming client request tagged with an OSDMap prior to >> > a low-water-mark epoch in the OSDMap for which older removed_snaps >> > have been proved (e.g. removed_snaps_pruned_epoch, or some better name). >> > >> > In the extreme case where a client is disconnected for so long that they >> > can't advance their map to the current one due to the mon having trimmed >> > maps, and they have outstanding writes with snapcs, the client >> > can choose to fail the requests with EIO or ESTALE or something, or crash >> > itself, or otherwise behave as if it has been blacklisted/fenced (if it's >> > RBD, it probably has anyway). >> >> This all sounds good, except... >> How on earth do we process these snapc trims in a way that doesn't >> bring us back down to effectively single-threaded transfer speeds? >> >> We might be able to do something with publishing a list of trimmed >> snaps and having the dispatch threads check their op snapcs against >> that shared data, but I'm not sure and we'd need to performance test >> very carefully... > > The only Objecter change here is that when we get a new osdmap we would > apply removed_snaps_this_epoch to the current in-flight requests. We > already walk all in-flight ops on map changes to recheck the CRUSH > mapping, and the snapc lists are short, and we can construct the OSDMap so > that it has the list of newly-deleted snaps for just that epoch in a new > field, so I don't think this will have any measureable impact on > performance. > > Is there a different effect I'm missing? I was worried about the map walk. I hadn't thought about the CRUSH mapping changes so I guess I don't know how that's implemented, but if it works that's good! > >> > 2. There is a bug, and the librbd image is out of sync: it thinks that >> > snap S still exists but in reality it has been pruned. If this happens, >> > then the librbd client may use S and it may trigger a clone. However, >> > that snap still is referenced from the image, so it will presumably >> > eventually get deleted and cleaned up. >> > >> > Greg suggested the possibility of a similar CephFS bug, where for example >> > a CephFS client gets confused and continues sending snapcs with removed >> > snaps. I think we can catch this category of bug (and probably #2 as >> > well) if we make the OSD log an error/warning to the cluster log if it >> > gets an incoming request including a deleted snapid when the request is >> > marked with an epoch after when the snap was deleted. This would >> > would mean adding the deleted_epoch for each removed snapid to pg_pool_t >> > to do properly; maybe worth it, maybe not? >> >> I don't quite understand; can you expand on this? > > The annoyingly confusing thing about this is that snapids are kind of like > time points, but the set of existing snapids has nothing to do with what > we've trimmed from removed_snaps... it's not like we can just drop low > snapids out of the set, since the snapid is the snap creation "when" and > not the deletion "when." > > So, if we want to catch buggy clients that try to use snapcontexts with > deleted snapids, we need to know which epoch each snapid was deleted in so > that we can compare that to the request epoch. > > I'm not so sure this will work out. For one, there is a sort of race > condition when librbd is setting up the snapc itself, since a > concurrent client might delete the snap before it refreshes, so that it > constructs a snapc with a bad snap after librados sees the osdmap deleting > it. :/ Maybe it's not worth worrying about this piece. Oh, so instead of having just "interval_set<snapid> deleted_snaps", we'd have to add "map<snapid, epoch_t> deleted_snap_epochs" (or equivalent) and then compare those (plus a floor epoch for oldest-known-deletion) to any incoming snapcontext? And for the warning, we'd have some of the snapids that should have been deleted still around for comparison. So we wouldn't have any way of reliably identifying them all, but we'd expect to see some of them turning up if it's an ongoing problem? > > >> Let me also suggest one other idea that came up when I was discussing >> snapshots with Patrick, by first laying out the problems we're aware >> of: >> 1) the OSDMap itself might grow too large >> 2) comparing very large interval sets of snapIDs is computationally >> expensive for the OSD. We do this >> a) when processing a new OSDMap against each PG's local list of deleted snaps >> b) when comparing the deleted snaps against incoming snapcontexts >> (although this is usually not an issue because incoming snapcontexts >> tend to be pretty small) >> >> So we want to limit the size of the OSDMap, and we also want to avoid >> very large comparisons. Why don't we attack these problems >> individually? >> >> 1) Switch from a deleted_snaps set to a deleting_snaps set in the >> OSDMap, and trim it based on per-PG feedback to the manager. >> 2) Maintain a *full* deleted_snaps set in the PG info on the OSDs. >> 3) Only load into memory the deleted_snaps we reasonably expect to >> see, *along* with a boundary snapid indicating what range the set is >> valid for. >> 4) If we get a snapid smaller than the range is valid for (in a >> client's SnapContext, or I suppose in the deleting_snaps map), load >> load more deleted snapids off disk to do the comparison. >> >> I haven't sketched out all the code paths but when I skimmed things >> over I think that snapcontext check (and possible off-disk loading) is >> actually in any easy-to-wait location. This avoids us needing to >> change the client wire protocol or introduce more ordering >> dependencies. >> >> The biggest downside I can see is that it adds a pretty obvious >> DoS/resource consumption attack for malicious clients, but we're not >> exactly immune to those in general. Thoughts? > > I think this would definitly work. I'm still hoping that we can avoid > having the full set of deleted snaps maintained on the OSD at all, > though--that will cut out a lot of complexity (and memory (and storage)). > > > FWIW what I'm currently trying is adding new fields to the map for snapids > deleted *this* epoch. That makes it easy for Objecter to adjust its > requests accordingly. On the OSD, it also makes the interval_set > calculcations on map updates faster (we already know what hte new snapids > are). > > And I think it will open up the possibility of removing removed_snaps > entirely from pg_pool_t if the PGs track the same set locally. In fact, I > think in the limit we'll end up with each osdmap epoch just containing (1) > newly deleted snapids this epoch, and (2) newly purged snapids this epoch. > The PGs will maintain the set of deleting snapids (in pg_info_t or > similar, sharing it during peering etc), and we'll keep that window > short-ish to prevent it from getting too big. I'm thinking a 1-5 day > window... basically, whatever the window of laggy clients getting the boot > is that you can tolerate. The OSDs will only be pruning incoming snapcs > against that deleting_snaps set (small!), OSDMaps themselves won't be big > (each one will only include newly deleted or purged ids). If a PG falls > behind and jumps a map gap, it will backfill anyway and get mopped up that > way... > > But... the cost of eliminating all of that crap is losing the client's > ability to have an in-flight request that is days old reconnect and > proceed. Fair trade? Well, thinking about it more I'm not sure how we could sensibly store a good boundary for storing some of the deleted snaps out of memory anyway. So I think we have to go this direction. But I want to think about it a bit more because I'm still concerned about the buggy resurrection of snapids. Maybe we store the full set on disk, cache what we see in-memory, and check validity against the disk on any snapid we don't have cached? :/ -Greg ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: removed_snaps 2017-10-11 21:25 ` removed_snaps Gregory Farnum @ 2017-10-12 14:02 ` Sage Weil 2017-10-13 17:22 ` removed_snaps Gregory Farnum 0 siblings, 1 reply; 12+ messages in thread From: Sage Weil @ 2017-10-12 14:02 UTC (permalink / raw) To: Gregory Farnum; +Cc: ceph-devel, Mclean, Patrick On Wed, 11 Oct 2017, Gregory Farnum wrote: > On Wed, Oct 11, 2017 at 12:02 PM, Sage Weil <sweil@redhat.com> wrote: > > On Wed, 11 Oct 2017, Gregory Farnum wrote: > >> On Wed, Oct 11, 2017 at 9:13 AM, Sage Weil <sweil@redhat.com> wrote: > >> > I'm working on removing the removed_snaps field from pg_pool_t (and > >> > thus the OSDMap) as this can get very large for clusters that have aged > >> > and use snapshots. > >> > > >> > https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1320 > >> > > >> > The short version of the plan is only include recently removed snaps in > >> > the map. Once all PGs in a pool have reported that the snap has been > >> > trimmed, we can safely retire the snapid from that set. > >> > > >> > There are a couple of possible problems related to the fact that the > >> > OSD currently santizes the SnapContext with every write by removing > >> > snapids that appear in removed_snaps. This is meant to deal with race > >> > conditions where the IO was submitted before the snap was removed, but the > >> > OSD has already logically removed it (or scheduled it for removal). > >> > > >> > I see two categories of problems: > >> > > >> > 1. The IO was prepared at the librados client before the snap was removed, > >> > the IO is delayed (e.g., client can't connect, or PG is inactive, etc.) > >> > until after the snap is deleted and retired from removed_snaps, and then > >> > the IO is processed. This could trigger a clone on the OSD that will > >> > then never get cleaned up. > >> > > >> > Specifically, a librbd example: > >> > > >> > a. librbd prepares a IOContext that includes snap S > >> > b. librbd initiates a librados IO with on that ioc. That request is > >> > delayed. > >> > c. Snap S is removed (someone tells the mon to delete it) > >> > - either our librbd client did it, or another one did and we get a > >> > notify telling us to refresh our header; doesn't matter which. > >> > d. S is included in OSDMap removed_snaps. > >> > e. all OSDs prune S > >> > f. S is removed from removed_snaps some time later > >> > g. The request from (b) finally reaches the OSD and triggers a clone S > >> > (which will never get cleaned up) > >> > > >> > I think we can fix this problem by making Objecter slightly smarter: it > >> > can watch OSDMaps it receives and prune snapc's for in-flight requests. > >> > In the above scenario, sometime between d and g, when it finally > >> > gets a recent OSDMap, it will have pruned S from the request's snapc. If > >> > it didn't get the map, and the request is still tagged with an old > >> > OSDMap, the OSD can kill the OSD session to force a reconnect/resend. It > >> > can do this for any incoming client request tagged with an OSDMap prior to > >> > a low-water-mark epoch in the OSDMap for which older removed_snaps > >> > have been proved (e.g. removed_snaps_pruned_epoch, or some better name). > >> > > >> > In the extreme case where a client is disconnected for so long that they > >> > can't advance their map to the current one due to the mon having trimmed > >> > maps, and they have outstanding writes with snapcs, the client > >> > can choose to fail the requests with EIO or ESTALE or something, or crash > >> > itself, or otherwise behave as if it has been blacklisted/fenced (if it's > >> > RBD, it probably has anyway). > >> > >> This all sounds good, except... > >> How on earth do we process these snapc trims in a way that doesn't > >> bring us back down to effectively single-threaded transfer speeds? > >> > >> We might be able to do something with publishing a list of trimmed > >> snaps and having the dispatch threads check their op snapcs against > >> that shared data, but I'm not sure and we'd need to performance test > >> very carefully... > > > > The only Objecter change here is that when we get a new osdmap we would > > apply removed_snaps_this_epoch to the current in-flight requests. We > > already walk all in-flight ops on map changes to recheck the CRUSH > > mapping, and the snapc lists are short, and we can construct the OSDMap so > > that it has the list of newly-deleted snaps for just that epoch in a new > > field, so I don't think this will have any measureable impact on > > performance. > > > > Is there a different effect I'm missing? > > I was worried about the map walk. I hadn't thought about the CRUSH > mapping changes so I guess I don't know how that's implemented, but if > it works that's good! > > > > >> > 2. There is a bug, and the librbd image is out of sync: it thinks that > >> > snap S still exists but in reality it has been pruned. If this happens, > >> > then the librbd client may use S and it may trigger a clone. However, > >> > that snap still is referenced from the image, so it will presumably > >> > eventually get deleted and cleaned up. > >> > > >> > Greg suggested the possibility of a similar CephFS bug, where for example > >> > a CephFS client gets confused and continues sending snapcs with removed > >> > snaps. I think we can catch this category of bug (and probably #2 as > >> > well) if we make the OSD log an error/warning to the cluster log if it > >> > gets an incoming request including a deleted snapid when the request is > >> > marked with an epoch after when the snap was deleted. This would > >> > would mean adding the deleted_epoch for each removed snapid to pg_pool_t > >> > to do properly; maybe worth it, maybe not? > >> > >> I don't quite understand; can you expand on this? > > > > The annoyingly confusing thing about this is that snapids are kind of like > > time points, but the set of existing snapids has nothing to do with what > > we've trimmed from removed_snaps... it's not like we can just drop low > > snapids out of the set, since the snapid is the snap creation "when" and > > not the deletion "when." > > > > So, if we want to catch buggy clients that try to use snapcontexts with > > deleted snapids, we need to know which epoch each snapid was deleted in so > > that we can compare that to the request epoch. > > > > I'm not so sure this will work out. For one, there is a sort of race > > condition when librbd is setting up the snapc itself, since a > > concurrent client might delete the snap before it refreshes, so that it > > constructs a snapc with a bad snap after librados sees the osdmap deleting > > it. :/ Maybe it's not worth worrying about this piece. > > Oh, so instead of having just "interval_set<snapid> deleted_snaps", > we'd have to add "map<snapid, epoch_t> deleted_snap_epochs" (or > equivalent) and then compare those (plus a floor epoch for > oldest-known-deletion) to any incoming snapcontext? > > And for the warning, we'd have some of the snapids that should have > been deleted still around for comparison. So we wouldn't have any way > of reliably identifying them all, but we'd expect to see some of them > turning up if it's an ongoing problem? Yeah exactly. I think this check isn't so bad to implement, but not strictly necessary. > >> Let me also suggest one other idea that came up when I was discussing > >> snapshots with Patrick, by first laying out the problems we're aware > >> of: > >> 1) the OSDMap itself might grow too large > >> 2) comparing very large interval sets of snapIDs is computationally > >> expensive for the OSD. We do this > >> a) when processing a new OSDMap against each PG's local list of deleted snaps > >> b) when comparing the deleted snaps against incoming snapcontexts > >> (although this is usually not an issue because incoming snapcontexts > >> tend to be pretty small) > >> > >> So we want to limit the size of the OSDMap, and we also want to avoid > >> very large comparisons. Why don't we attack these problems > >> individually? > >> > >> 1) Switch from a deleted_snaps set to a deleting_snaps set in the > >> OSDMap, and trim it based on per-PG feedback to the manager. > >> 2) Maintain a *full* deleted_snaps set in the PG info on the OSDs. > >> 3) Only load into memory the deleted_snaps we reasonably expect to > >> see, *along* with a boundary snapid indicating what range the set is > >> valid for. > >> 4) If we get a snapid smaller than the range is valid for (in a > >> client's SnapContext, or I suppose in the deleting_snaps map), load > >> load more deleted snapids off disk to do the comparison. > >> > >> I haven't sketched out all the code paths but when I skimmed things > >> over I think that snapcontext check (and possible off-disk loading) is > >> actually in any easy-to-wait location. This avoids us needing to > >> change the client wire protocol or introduce more ordering > >> dependencies. > >> > >> The biggest downside I can see is that it adds a pretty obvious > >> DoS/resource consumption attack for malicious clients, but we're not > >> exactly immune to those in general. Thoughts? > > > > I think this would definitly work. I'm still hoping that we can avoid > > having the full set of deleted snaps maintained on the OSD at all, > > though--that will cut out a lot of complexity (and memory (and storage)). > > > > > > FWIW what I'm currently trying is adding new fields to the map for snapids > > deleted *this* epoch. That makes it easy for Objecter to adjust its > > requests accordingly. On the OSD, it also makes the interval_set > > calculcations on map updates faster (we already know what hte new snapids > > are). > > > > And I think it will open up the possibility of removing removed_snaps > > entirely from pg_pool_t if the PGs track the same set locally. In fact, I > > think in the limit we'll end up with each osdmap epoch just containing (1) > > newly deleted snapids this epoch, and (2) newly purged snapids this epoch. > > The PGs will maintain the set of deleting snapids (in pg_info_t or > > similar, sharing it during peering etc), and we'll keep that window > > short-ish to prevent it from getting too big. I'm thinking a 1-5 day > > window... basically, whatever the window of laggy clients getting the boot > > is that you can tolerate. The OSDs will only be pruning incoming snapcs > > against that deleting_snaps set (small!), OSDMaps themselves won't be big > > (each one will only include newly deleted or purged ids). If a PG falls > > behind and jumps a map gap, it will backfill anyway and get mopped up that > > way... > > > > But... the cost of eliminating all of that crap is losing the client's > > ability to have an in-flight request that is days old reconnect and > > proceed. Fair trade? > > Well, thinking about it more I'm not sure how we could sensibly store > a good boundary for storing some of the deleted snaps out of memory > anyway. So I think we have to go this direction. But I want to think > about it a bit more because I'm still concerned about the buggy > resurrection of snapids. > > Maybe we store the full set on disk, cache what we see in-memory, and > check validity against the disk on any snapid we don't have cached? :/ Just had another thought last night: the mon can preserve the full history of deletions, by epoch. When the objecter encounters a map gap it can request the removed_snaps over teh gap period from the mon at the same time it's getting the next map (i.e., the oldest full map stored by the mon). Since this is a pretty rare/excpetional thing, I don't worry much about the extra work for the mon, and it avoids the ugly client-must-crash behavior... a laggy client will always be able to catch up. I updated the plan B notes in the pad to include this bit! sage ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: removed_snaps 2017-10-12 14:02 ` removed_snaps Sage Weil @ 2017-10-13 17:22 ` Gregory Farnum 2017-10-13 21:32 ` removed_snaps Sage Weil 0 siblings, 1 reply; 12+ messages in thread From: Gregory Farnum @ 2017-10-13 17:22 UTC (permalink / raw) To: Sage Weil; +Cc: ceph-devel, Mclean, Patrick On Thu, Oct 12, 2017 at 7:02 AM, Sage Weil <sweil@redhat.com> wrote: > Just had another thought last night: the mon can preserve the full history > of deletions, by epoch. When the objecter encounters a map gap it can > request the removed_snaps over teh gap period from the mon at the same > time it's getting the next map (i.e., the oldest full map stored by > the mon). Since this is a pretty rare/excpetional thing, I don't > worry much about the extra work for the mon, and it avoids the ugly > client-must-crash behavior... a laggy client will always be able to catch > up.e That seems useful and will probably work in a practical sense, but I'm still a bit concerned. There's an in-built assumption here that the OSD map epoch of a client is a useful proxy for "has the correct set of snapids". And...well, it's a *reasonable* proxy, especially if the Objecter starts trimming snapids. But CephFS certainly doesn't have any explicit relationship (or even much of an implicit one) between the OSDMap epochs and the set of live snapshots. I don't think RBD does either, although since it passes around snapids via header objects and watch-notify it might come closer? I'm tossing around in my head if there's some good way to attach a "valid as of this epoch" tag to snapcontexts generated by external systems. All snapshot users *do* already maintain a snapid_t for versioning that they use; maybe we can tie into or extend it somehow? (A trivial but presumably too-slow implementation for CephFS could do something like, on every load of a SnapRealm in the MDS, validate the snap ids against the monitor's full list and attach the current osd epoch to it.) Moving on to the stuff actually written down: How comfortable are we with the size of the currently-deleting snapshots maps, for computation purposes? I don't have a good way of quantifying that cost but I'm definitely tempted to split the sets into newly_deleted_snaps (for *this epoch*) deleting_snaps (which are kept around until removed_snaps_lb_epoch) newly_purged_snaps (also for this epoch, which I think is how you have it written?) There are also two questions down at the bottom. For (1) I think it's good to keep the deleted snaps set for all time (always good for debugging!), but we need to be careful: if there is a divergence between RADOS' metadata and those of RBD or CephFS, we need a mechanism for re-deleting snaps even if they were already zapped. For (2), yes the removed_snaps_lb_epoch should be per-pool, not global. We don't have any other global snap data, why would we introduce a linkage? :) -Greg ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: removed_snaps 2017-10-13 17:22 ` removed_snaps Gregory Farnum @ 2017-10-13 21:32 ` Sage Weil 2017-10-13 22:25 ` removed_snaps Gregory Farnum 0 siblings, 1 reply; 12+ messages in thread From: Sage Weil @ 2017-10-13 21:32 UTC (permalink / raw) To: Gregory Farnum; +Cc: ceph-devel, Mclean, Patrick On Fri, 13 Oct 2017, Gregory Farnum wrote: > On Thu, Oct 12, 2017 at 7:02 AM, Sage Weil <sweil@redhat.com> wrote: > > Just had another thought last night: the mon can preserve the full history > > of deletions, by epoch. When the objecter encounters a map gap it can > > request the removed_snaps over teh gap period from the mon at the same > > time it's getting the next map (i.e., the oldest full map stored by > > the mon). Since this is a pretty rare/excpetional thing, I don't > > worry much about the extra work for the mon, and it avoids the ugly > > client-must-crash behavior... a laggy client will always be able to catch > > up.e > > That seems useful and will probably work in a practical sense, but I'm > still a bit concerned. There's an in-built assumption here that the > OSD map epoch of a client is a useful proxy for "has the correct set > of snapids". And...well, it's a *reasonable* proxy, especially if the > Objecter starts trimming snapids. But CephFS certainly doesn't have > any explicit relationship (or even much of an implicit one) between > the OSDMap epochs and the set of live snapshots. I don't think RBD > does either, although since it passes around snapids via header > objects and watch-notify it might come closer? > > I'm tossing around in my head if there's some good way to attach a > "valid as of this epoch" tag to snapcontexts generated by external > systems. All snapshot users *do* already maintain a snapid_t for > versioning that they use; maybe we can tie into or extend it somehow? > (A trivial but presumably too-slow implementation for CephFS could do > something like, on every load of a SnapRealm in the MDS, validate the > snap ids against the monitor's full list and attach the current osd > epoch to it.) Yeah, the "snapc valid as of this epoch" is exactly it. Unfortunately I'm not sure attaching anything to the snapc's snap_seq helps, since that's the time point when the snapc was last modified, but we actually want a lower bound on when we know the snapc to be accurate (i.e., reflect any completed snap creates/deletes). For librbd, maybe this is the epoch we load the head object.. *if* we also modify it to do a 2PC on snap removal. Although probably this problem isn't so bad since we have exclusive locking and the exclusive lock holder is the one doing the snap deletions? For CephFS... yeah. IIRC the client only hears about SnapRealm changes when they happen, but we actually want to roll the "valid as of epoch" value forward when no snaps are created or deleted. Maybe it can be baked into the client/mds heartbeat? Anyway, in order to actually use the value once we have it, the librados snapc methods would need to get an additional valid_as_of_epoch argument and refresh it periodically. ... In any case, in order to use the value properly we need to apply any removed_snaps between the valid_as_of_epoch and the current epoch at the time the op is submitted. Which means the OSDMap needs to have a window of recent removed_snaps that's big enough to do that, and/or the librados user has to have a defined way of refreshing itself if its snapc is too stale. The bad news is that my branch removes the removed_snaps entirely from teh OSDMap--we only have deltas. I can add it back it without much trouble. The better news is that this means we don't have the have a wide window between advertising removed vs purged snaps anymore--we can declare them purged as soon as they are purged. Which has the added benefit of being able to explicitly track/monitor the 'removing' set of snaps on the mon, which seems like something that will be of interest to ops folks. So.. yeah, this is an open issue. I think what I have now will mostly work, but it may leak clones occasionally if clients are super stale. Maybe that's not a big deal... > Moving on to the stuff actually written down: > How comfortable are we with the size of the currently-deleting > snapshots maps, for computation purposes? I don't have a good way of > quantifying that cost but I'm definitely tempted to split the sets > into > newly_deleted_snaps (for *this epoch*) > deleting_snaps (which are kept around until removed_snaps_lb_epoch) > newly_purged_snaps (also for this epoch, which I think is how you have > it written?) The code I have now has 2 sets: removed_snaps and purged_snaps. purged_snapd was already there before; removed_snaps is the superset that includes stuff not yet purged. (I made it a superset thinking that the filter_snapc() thing on every op will be faster if it only has to filter against a single set instead of 2 of them, and that matters more than the memory.) > There are also two questions down at the bottom. For (1) I think it's > good to keep the deleted snaps set for all time (always good for > debugging!), but we need to be careful: if there is a divergence > between RADOS' metadata and those of RBD or CephFS, we need a > mechanism for re-deleting snaps even if they were already zapped. I'm preserving for all time, but yeah, no mechanism to force a re-removal. Hrm. > For (2), yes the removed_snaps_lb_epoch should be per-pool, not > global. We don't have any other global snap data, why would we > introduce a linkage? :) Yep! Thanks- sage ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: removed_snaps 2017-10-13 21:32 ` removed_snaps Sage Weil @ 2017-10-13 22:25 ` Gregory Farnum 2017-10-14 15:56 ` removed_snaps Sage Weil 0 siblings, 1 reply; 12+ messages in thread From: Gregory Farnum @ 2017-10-13 22:25 UTC (permalink / raw) To: Sage Weil; +Cc: ceph-devel, Mclean, Patrick On Fri, Oct 13, 2017 at 2:32 PM, Sage Weil <sweil@redhat.com> wrote: > On Fri, 13 Oct 2017, Gregory Farnum wrote: >> Moving on to the stuff actually written down: >> How comfortable are we with the size of the currently-deleting >> snapshots maps, for computation purposes? I don't have a good way of >> quantifying that cost but I'm definitely tempted to split the sets >> into >> newly_deleted_snaps (for *this epoch*) >> deleting_snaps (which are kept around until removed_snaps_lb_epoch) >> newly_purged_snaps (also for this epoch, which I think is how you have >> it written?) > > The code I have now has 2 sets: removed_snaps and purged_snaps. > purged_snapd was already there before; removed_snaps is the superset that > includes stuff not yet purged. (I made it a superset thinking that > the filter_snapc() thing on every op will be faster if it only has to > filter against a single set instead of 2 of them, and that matters more > than the memory.) No, what I mean is: * we know that computing the overlap of interval sets can be expensive * so we want to only compute overlaps of small sets * this is especially true if we're doing it on all outstanding operations in librados, within a single thread So we definitely don't want to repeat comparisons if we don't have to. And I suspect there will be occasional periods of intense snapshot deletion from admins where they remove a number large enough to cause trouble, and we don't want that to result in noticeably slower IO submission on the client-side! So we should have a separate set of snaps which were just added to the removed_snaps during this epoch, that librados and the OSD map processing can use in their normal-course-of-business scanning. -Greg ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: removed_snaps 2017-10-13 22:25 ` removed_snaps Gregory Farnum @ 2017-10-14 15:56 ` Sage Weil 2017-10-16 15:27 ` removed_snaps Gregory Farnum 0 siblings, 1 reply; 12+ messages in thread From: Sage Weil @ 2017-10-14 15:56 UTC (permalink / raw) To: Gregory Farnum; +Cc: ceph-devel, Mclean, Patrick On Fri, 13 Oct 2017, Gregory Farnum wrote: > On Fri, Oct 13, 2017 at 2:32 PM, Sage Weil <sweil@redhat.com> wrote: > > On Fri, 13 Oct 2017, Gregory Farnum wrote: > >> Moving on to the stuff actually written down: > >> How comfortable are we with the size of the currently-deleting > >> snapshots maps, for computation purposes? I don't have a good way of > >> quantifying that cost but I'm definitely tempted to split the sets > >> into > >> newly_deleted_snaps (for *this epoch*) > >> deleting_snaps (which are kept around until removed_snaps_lb_epoch) > >> newly_purged_snaps (also for this epoch, which I think is how you have > >> it written?) > > > > The code I have now has 2 sets: removed_snaps and purged_snaps. > > purged_snapd was already there before; removed_snaps is the superset that > > includes stuff not yet purged. (I made it a superset thinking that > > the filter_snapc() thing on every op will be faster if it only has to > > filter against a single set instead of 2 of them, and that matters more > > than the memory.) > > No, what I mean is: > * we know that computing the overlap of interval sets can be expensive > * so we want to only compute overlaps of small sets > * this is especially true if we're doing it on all outstanding > operations in librados, within a single thread > > So we definitely don't want to repeat comparisons if we don't have to. > And I suspect there will be occasional periods of intense snapshot > deletion from admins where they remove a number large enough to cause > trouble, and we don't want that to result in noticeably slower IO > submission on the client-side! > > So we should have a separate set of snaps which were just added to the > removed_snaps during this epoch, that librados and the OSD map > processing can use in their normal-course-of-business scanning. I definitely get that we want to avoid any set intersections (and I think the current code actually has only 1 of them, during peering activate). But I'm not quite following which lookup you're thinking of. FTR filter_snapc iterates over the vector<snapid_t> in SnapContext and does an interval_set lookup on each one to make sure it isn't deleted. I think there are basically two cases: 1- The request comes in with the same epoch as our epoch, and the mimic feature, so we could skip the filter_snapc entirely (client will have done it). Yay! Hopefully this will be true for most requests in steady state. 2- The request comes in with an older osdmap epoch, so we have to filter_snapc against the recent_removed_snaps interval_set<>. This is what we do today, except instead of recent_removed_snaps we have removed_snaps_over_all_time, so we are going from a huge set to a relatively small one. We could do something like 3- Maintain a small interval_set with a smaller range of epochs (e.g., only the last 20 epochs instead of the last 500) and filter against that if the request's epoch is recent enough. or, 4- Remember the new_removed_snaps for the last N epochs and filter against them individually. This turns into m log(n) lookups instead of one log(n*500) lookup (where m << 500)... probably a small win? sage ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: removed_snaps 2017-10-14 15:56 ` removed_snaps Sage Weil @ 2017-10-16 15:27 ` Gregory Farnum 2017-10-16 18:38 ` removed_snaps Sage Weil 0 siblings, 1 reply; 12+ messages in thread From: Gregory Farnum @ 2017-10-16 15:27 UTC (permalink / raw) To: Sage Weil; +Cc: ceph-devel, Mclean, Patrick On Sat, Oct 14, 2017 at 8:56 AM, Sage Weil <sweil@redhat.com> wrote: > On Fri, 13 Oct 2017, Gregory Farnum wrote: >> On Fri, Oct 13, 2017 at 2:32 PM, Sage Weil <sweil@redhat.com> wrote: >> > On Fri, 13 Oct 2017, Gregory Farnum wrote: >> >> Moving on to the stuff actually written down: >> >> How comfortable are we with the size of the currently-deleting >> >> snapshots maps, for computation purposes? I don't have a good way of >> >> quantifying that cost but I'm definitely tempted to split the sets >> >> into >> >> newly_deleted_snaps (for *this epoch*) >> >> deleting_snaps (which are kept around until removed_snaps_lb_epoch) >> >> newly_purged_snaps (also for this epoch, which I think is how you have >> >> it written?) >> > >> > The code I have now has 2 sets: removed_snaps and purged_snaps. >> > purged_snapd was already there before; removed_snaps is the superset that >> > includes stuff not yet purged. (I made it a superset thinking that >> > the filter_snapc() thing on every op will be faster if it only has to >> > filter against a single set instead of 2 of them, and that matters more >> > than the memory.) >> >> No, what I mean is: >> * we know that computing the overlap of interval sets can be expensive >> * so we want to only compute overlaps of small sets >> * this is especially true if we're doing it on all outstanding >> operations in librados, within a single thread >> >> So we definitely don't want to repeat comparisons if we don't have to. >> And I suspect there will be occasional periods of intense snapshot >> deletion from admins where they remove a number large enough to cause >> trouble, and we don't want that to result in noticeably slower IO >> submission on the client-side! >> >> So we should have a separate set of snaps which were just added to the >> removed_snaps during this epoch, that librados and the OSD map >> processing can use in their normal-course-of-business scanning. > > I definitely get that we want to avoid any set intersections (and I think > the current code actually has only 1 of them, during peering activate). > But I'm not quite following which lookup you're thinking of. FTR > filter_snapc iterates over the vector<snapid_t> in SnapContext and does an > interval_set lookup on each one to make sure it isn't deleted. > > I think there are basically two cases: > > 1- The request comes in with the same epoch as our epoch, and the mimic > feature, so we could skip the filter_snapc entirely (client will have done > it). Yay! Hopefully this will be true for most requests in steady state. > > 2- The request comes in with an older osdmap epoch, so we have to > filter_snapc against the recent_removed_snaps interval_set<>. This is > what we do today, except instead of recent_removed_snaps we have > removed_snaps_over_all_time, so we are going from a huge set to a > relatively small one. You're leaving out 3) Every client runs a set intersection on its outstanding ops on every map update (which we did not run before). This is the one I'm thinking about when I say we should have a this-epoch-only set of additions. -Greg > > We could do something like > > 3- Maintain a small interval_set with a smaller range of epochs (e.g., > only the last 20 epochs instead of the last 500) and filter against that > if the request's epoch is recent enough. > > or, > > 4- Remember the new_removed_snaps for the last N epochs and filter against > them individually. This turns into m log(n) lookups instead of one > log(n*500) lookup (where m << 500)... probably a small win? > > sage ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: removed_snaps 2017-10-16 15:27 ` removed_snaps Gregory Farnum @ 2017-10-16 18:38 ` Sage Weil 0 siblings, 0 replies; 12+ messages in thread From: Sage Weil @ 2017-10-16 18:38 UTC (permalink / raw) To: Gregory Farnum; +Cc: ceph-devel, Mclean, Patrick On Mon, 16 Oct 2017, Gregory Farnum wrote: > On Sat, Oct 14, 2017 at 8:56 AM, Sage Weil <sweil@redhat.com> wrote: > > On Fri, 13 Oct 2017, Gregory Farnum wrote: > >> On Fri, Oct 13, 2017 at 2:32 PM, Sage Weil <sweil@redhat.com> wrote: > >> > On Fri, 13 Oct 2017, Gregory Farnum wrote: > >> >> Moving on to the stuff actually written down: > >> >> How comfortable are we with the size of the currently-deleting > >> >> snapshots maps, for computation purposes? I don't have a good way of > >> >> quantifying that cost but I'm definitely tempted to split the sets > >> >> into > >> >> newly_deleted_snaps (for *this epoch*) > >> >> deleting_snaps (which are kept around until removed_snaps_lb_epoch) > >> >> newly_purged_snaps (also for this epoch, which I think is how you have > >> >> it written?) > >> > > >> > The code I have now has 2 sets: removed_snaps and purged_snaps. > >> > purged_snapd was already there before; removed_snaps is the superset that > >> > includes stuff not yet purged. (I made it a superset thinking that > >> > the filter_snapc() thing on every op will be faster if it only has to > >> > filter against a single set instead of 2 of them, and that matters more > >> > than the memory.) > >> > >> No, what I mean is: > >> * we know that computing the overlap of interval sets can be expensive > >> * so we want to only compute overlaps of small sets > >> * this is especially true if we're doing it on all outstanding > >> operations in librados, within a single thread > >> > >> So we definitely don't want to repeat comparisons if we don't have to. > >> And I suspect there will be occasional periods of intense snapshot > >> deletion from admins where they remove a number large enough to cause > >> trouble, and we don't want that to result in noticeably slower IO > >> submission on the client-side! > >> > >> So we should have a separate set of snaps which were just added to the > >> removed_snaps during this epoch, that librados and the OSD map > >> processing can use in their normal-course-of-business scanning. > > > > I definitely get that we want to avoid any set intersections (and I think > > the current code actually has only 1 of them, during peering activate). > > But I'm not quite following which lookup you're thinking of. FTR > > filter_snapc iterates over the vector<snapid_t> in SnapContext and does an > > interval_set lookup on each one to make sure it isn't deleted. > > > > I think there are basically two cases: > > > > 1- The request comes in with the same epoch as our epoch, and the mimic > > feature, so we could skip the filter_snapc entirely (client will have done > > it). Yay! Hopefully this will be true for most requests in steady state. > > > > 2- The request comes in with an older osdmap epoch, so we have to > > filter_snapc against the recent_removed_snaps interval_set<>. This is > > what we do today, except instead of recent_removed_snaps we have > > removed_snaps_over_all_time, so we are going from a huge set to a > > relatively small one. > > You're leaving out > > 3) Every client runs a set intersection on its outstanding ops on > every map update (which we did not run before). > > This is the one I'm thinking about when I say we should have a > this-epoch-only set of additions. Oh, I see what you mean. Each OSDMap::Incremental has the new_removed_snaps, so it will be small list to filter against (esp compared to the CRUSH calculation). I'm also going to switch it a flat_set<>, which gives log(n) lookup w/ 1 allocation. sage ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2017-10-16 18:38 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-10-11 16:13 removed_snaps Sage Weil 2017-10-11 18:38 ` removed_snaps Gregory Farnum 2017-10-11 19:02 ` removed_snaps Sage Weil 2017-10-11 19:37 ` removed_snaps Sage Weil 2017-10-11 21:25 ` removed_snaps Gregory Farnum 2017-10-12 14:02 ` removed_snaps Sage Weil 2017-10-13 17:22 ` removed_snaps Gregory Farnum 2017-10-13 21:32 ` removed_snaps Sage Weil 2017-10-13 22:25 ` removed_snaps Gregory Farnum 2017-10-14 15:56 ` removed_snaps Sage Weil 2017-10-16 15:27 ` removed_snaps Gregory Farnum 2017-10-16 18:38 ` removed_snaps Sage Weil
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.