All of lore.kernel.org
 help / color / mirror / Atom feed
* A design for CephFS forward scrub with multiple MDS
@ 2016-09-20 17:16 Douglas Fuller
  2016-09-21  6:29 ` Gregory Farnum
  2016-09-21 12:56 ` John Spray
  0 siblings, 2 replies; 8+ messages in thread
From: Douglas Fuller @ 2016-09-20 17:16 UTC (permalink / raw)
  To: Ceph Development; +Cc: John Spray, Gregory Farnum

This serves to assemble some discussions we’ve had recently surrounding performing CephFS forward scrub in the case of multiple, active MDSs. I have been doing some implementation work recently in this area and it became a large enough departure from current practice that it’s probably time to revisit the design altogether. This message is intended to summarize the discussions I’ve had so far and to serve as a straw man for any changes that may be needed. It contains a couple questions as well.

Currently, CephFS forward scrub proceeds straightforwardly by enqueuing inodes onto a stack as they are found, completing each parent directory once all of its children have been scrubbed. In a multi-MDS system, this will need to be extended to handle subtrees present on other MDSs.

The proposed design is as follows:

We scrub a local subtree as we would in the single-MDS case: follow the directory hierarchy downward, pushing found items onto a stack and completing directories once all their children are complete. When a subtree boundary is encountered, send a message to the authoritative MDS for that subtree requesting that it be scrubbed. When subtree scrubbing is complete, send a message to the requesting MDS with the completion information and relevant rstats for the parent directory inode (NB: do we have to block the scrubbing of all ancestors, then?).

When popping an inode from the scrub stack, it’s important to note that its authority may have been changed by some intervening export. The scrubbing MDS will drop any file inode for which it is no longer authoritative, assuming this would be handled by the correct MDS. For directory inodes, forward a request to the authoritative MDS to scrub the directory. This may result in attempts to scrub the same inodes more than once (though we track this and can avoid most of the work), it seems necessary in order to guarantee no directories are missed due to splits or exports (NB: this is correct, right?).

Outbound scrub requests will need to be tracked and restarted in the case of MDS failure.

It may be the case that, in the case of a badly thrashing directory hierarchy, that many unnecessary sub-scrub requests may be created and duplicate work attempted. We can short-circuit the duplicate work by noting (as we do in the single-MDS case) when we have already scrubbed an inode and bailing when we attempt to do it again. I’m not sure that extra or unnecessary requests are avoidable or if they will pose a serious performance concern.

Additions, criticisms, clarifications, tomatoes, and other reactions would be appreciated.

Cheers,
—Doug

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: A design for CephFS forward scrub with multiple MDS
  2016-09-20 17:16 A design for CephFS forward scrub with multiple MDS Douglas Fuller
@ 2016-09-21  6:29 ` Gregory Farnum
  2016-09-21 10:58   ` Douglas Fuller
  2016-09-21 12:56 ` John Spray
  1 sibling, 1 reply; 8+ messages in thread
From: Gregory Farnum @ 2016-09-21  6:29 UTC (permalink / raw)
  To: Douglas Fuller; +Cc: Ceph Development, John Spray

On Tue, Sep 20, 2016 at 10:16 AM, Douglas Fuller <dfuller@redhat.com> wrote:
> This serves to assemble some discussions we’ve had recently surrounding performing CephFS forward scrub in the case of multiple, active MDSs. I have been doing some implementation work recently in this area and it became a large enough departure from current practice that it’s probably time to revisit the design altogether. This message is intended to summarize the discussions I’ve had so far and to serve as a straw man for any changes that may be needed. It contains a couple questions as well.
>
> Currently, CephFS forward scrub proceeds straightforwardly by enqueuing inodes onto a stack as they are found, completing each parent directory once all of its children have been scrubbed. In a multi-MDS system, this will need to be extended to handle subtrees present on other MDSs.
>
> The proposed design is as follows:
>
> We scrub a local subtree as we would in the single-MDS case: follow the directory hierarchy downward, pushing found items onto a stack and completing directories once all their children are complete. When a subtree boundary is encountered, send a message to the authoritative MDS for that subtree requesting that it be scrubbed. When subtree scrubbing is complete, send a message to the requesting MDS with the completion information and relevant rstats for the parent directory inode (NB: do we have to block the scrubbing of all ancestors, then?).
>
> When popping an inode from the scrub stack, it’s important to note that its authority may have been changed by some intervening export. The scrubbing MDS will drop any file inode for which it is no longer authoritative, assuming this would be handled by the correct MDS. For directory inodes, forward a request to the authoritative MDS to scrub the directory. This may result in attempts to scrub the same inodes more than once (though we track this and can avoid most of the work), it seems necessary in order to guarantee no directories are missed due to splits or exports (NB: this is correct, right?).

I think we need to spell this out a little more. Some thoughts:
* right now, the ScrubStack is just a CInode*. This needs to turn into
a two-way reference.
* When we freeze a tree for export, we need a new step that removes it
from the ScrubStack and sets up the "remote scrub" state we'd have if
it were a freshly-encountered subtree boundary
  * this may involve some delayed execution of remote scrub requests,
or of bundling up the need for a scrub in the exported state

> Outbound scrub requests will need to be tracked and restarted in the case of MDS failure.
>
> It may be the case that, in the case of a badly thrashing directory hierarchy, that many unnecessary sub-scrub requests may be created and duplicate work attempted. We can short-circuit the duplicate work by noting (as we do in the single-MDS case) when we have already scrubbed an inode and bailing when we attempt to do it again. I’m not sure that extra or unnecessary requests are avoidable or if they will pose a serious performance concern.

I think a good design won't let this be much a problem. If subtrees
move continuously we might have to "chase" the scrub (which perhaps
argues for sending the scrub state along with the metadata export),
but otherwise more fragmentation will require more messages but the
system should handle that (it will presumably be constant state at the
boundaries).
-Greg

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: A design for CephFS forward scrub with multiple MDS
  2016-09-21  6:29 ` Gregory Farnum
@ 2016-09-21 10:58   ` Douglas Fuller
  2016-09-21 14:24     ` Gregory Farnum
  0 siblings, 1 reply; 8+ messages in thread
From: Douglas Fuller @ 2016-09-21 10:58 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Ceph Development, John Spray


> On Sep 21, 2016, at 2:29 AM, Gregory Farnum <gfarnum@redhat.com> wrote:
> 
> On Tue, Sep 20, 2016 at 10:16 AM, Douglas Fuller <dfuller@redhat.com> wrote:
>> 
>> When popping an inode from the scrub stack, it’s important to note that its authority may have been changed by some intervening export. The scrubbing MDS will drop any file inode for which it is no longer authoritative, assuming this would be handled by the correct MDS. For directory inodes, forward a request to the authoritative MDS to scrub the directory. This may result in attempts to scrub the same inodes more than once (though we track this and can avoid most of the work), it seems necessary in order to guarantee no directories are missed due to splits or exports (NB: this is correct, right?).
> 
> I think we need to spell this out a little more. Some thoughts:
> * right now, the ScrubStack is just a CInode*. This needs to turn into
> a two-way reference.

I wasn’t at the datatype level of detail here. I agree it can’t be a CInode* anymore, and figured it’d have to be something we could fetch if it is exported while on the stack.

> * When we freeze a tree for export, we need a new step that removes it
> from the ScrubStack and sets up the "remote scrub" state we'd have if
> it were a freshly-encountered subtree boundary
>  * this may involve some delayed execution of remote scrub requests,
> or of bundling up the need for a scrub in the exported state

Directories don’t know where their subtree roots are, so I’m not sure how we would remove subdirectories and their contained files from the stack if one of their parents were exported. I think the stack could be “dumb” in some sense and not care what happens to the items on it. If we pop a file inode for which we are not authoritative, we drop it on the floor, assuming its parent directory will cause it to be scrubbed elsewhere. If we pop a directory inode for which we are not authoritative, we send a request to the authoritative MDS to scrub it.

Some duplicate work is created here since a subtree could be exported and then we will end up requesting multiple scrub operations (which could race one another) in the same directory hierarchy. That’s inefficient, but can be handled fairly well by the existing code. If we want to avoid that, we could either:
* When we pop a directory inode for which we are not authoritative, trace back to the nearest subtree root. We would need to maintain state for that subtree root anyway, so that could be checked to avoid duplication of work.
* Create a wrapper data structure for scrubbing a given subtree and link scrub stack elements back to that. The problem would then be maintaining that data structure in the face of subtree changes.

>> Outbound scrub requests will need to be tracked and restarted in the case of MDS failure.
>> 
>> It may be the case that, in the case of a badly thrashing directory hierarchy, that many unnecessary sub-scrub requests may be created and duplicate work attempted. We can short-circuit the duplicate work by noting (as we do in the single-MDS case) when we have already scrubbed an inode and bailing when we attempt to do it again. I’m not sure that extra or unnecessary requests are avoidable or if they will pose a serious performance concern.
> 
> I think a good design won't let this be much a problem. If subtrees
> move continuously we might have to "chase" the scrub (which perhaps
> argues for sending the scrub state along with the metadata export),
> but otherwise more fragmentation will require more messages but the
> system should handle that (it will presumably be constant state at the
> boundaries).
> -Greg


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: A design for CephFS forward scrub with multiple MDS
  2016-09-20 17:16 A design for CephFS forward scrub with multiple MDS Douglas Fuller
  2016-09-21  6:29 ` Gregory Farnum
@ 2016-09-21 12:56 ` John Spray
  2016-09-21 13:25   ` Douglas Fuller
  1 sibling, 1 reply; 8+ messages in thread
From: John Spray @ 2016-09-21 12:56 UTC (permalink / raw)
  To: Douglas Fuller; +Cc: Ceph Development, Gregory Farnum

On Tue, Sep 20, 2016 at 6:16 PM, Douglas Fuller <dfuller@redhat.com> wrote:
> This serves to assemble some discussions we’ve had recently surrounding performing CephFS forward scrub in the case of multiple, active MDSs. I have been doing some implementation work recently in this area and it became a large enough departure from current practice that it’s probably time to revisit the design altogether. This message is intended to summarize the discussions I’ve had so far and to serve as a straw man for any changes that may be needed. It contains a couple questions as well.
>
> Currently, CephFS forward scrub proceeds straightforwardly by enqueuing inodes onto a stack as they are found, completing each parent directory once all of its children have been scrubbed. In a multi-MDS system, this will need to be extended to handle subtrees present on other MDSs.
>
> The proposed design is as follows:
>
> We scrub a local subtree as we would in the single-MDS case: follow the directory hierarchy downward, pushing found items onto a stack and completing directories once all their children are complete. When a subtree boundary is encountered, send a message to the authoritative MDS for that subtree requesting that it be scrubbed. When subtree scrubbing is complete, send a message to the requesting MDS with the completion information and relevant rstats for the parent directory inode (NB: do we have to block the scrubbing of all ancestors, then?).

I think we have to block, yes -- otherwise we can't claim to have
really validated the recursive statistics at the upper levels.

> When popping an inode from the scrub stack, it’s important to note that its authority may have been changed by some intervening export. The scrubbing MDS will drop any file inode for which it is no longer authoritative, assuming this would be handled by the correct MDS. For directory inodes, forward a request to the authoritative MDS to scrub the directory. This may result in attempts to scrub the same inodes more than once (though we track this and can avoid most of the work), it seems necessary in order to guarantee no directories are missed due to splits or exports (NB: this is correct, right?).

Yes, I think this sounds right.  I was fuzzy on this part when we
talked yesterday but it makes more sense after sleeping on it: when
something in our stack gets migrated away, we don't just forget about
it, we treat it as a new bud on the tree that needs to be sent off to
another mds and scrubbed.

>
> Outbound scrub requests will need to be tracked and restarted in the case of MDS failure.

One thing we didn't discuss was the backwards case, where I (an MDS)
am told by another MDS to scrub a subtree, but he fails before I can
tell him the result of my scrub.  Simplest thing seems to be to abort
scrubs in this case, and say that (for the moment) a scrub is only
guaranteed to complete if the MDS where it was initiated stays online?

John

> It may be the case that, in the case of a badly thrashing directory hierarchy, that many unnecessary sub-scrub requests may be created and duplicate work attempted. We can short-circuit the duplicate work by noting (as we do in the single-MDS case) when we have already scrubbed an inode and bailing when we attempt to do it again. I’m not sure that extra or unnecessary requests are avoidable or if they will pose a serious performance concern.
>
> Additions, criticisms, clarifications, tomatoes, and other reactions would be appreciated.
>
> Cheers,
> —Doug

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: A design for CephFS forward scrub with multiple MDS
  2016-09-21 12:56 ` John Spray
@ 2016-09-21 13:25   ` Douglas Fuller
  2016-09-21 13:45     ` John Spray
  0 siblings, 1 reply; 8+ messages in thread
From: Douglas Fuller @ 2016-09-21 13:25 UTC (permalink / raw)
  To: John Spray; +Cc: Ceph Development, Gregory Farnum


>> Outbound scrub requests will need to be tracked and restarted in the case of MDS failure.
> 
> One thing we didn't discuss was the backwards case, where I (an MDS)
> am told by another MDS to scrub a subtree, but he fails before I can
> tell him the result of my scrub.  Simplest thing seems to be to abort
> scrubs in this case, and say that (for the moment) a scrub is only
> guaranteed to complete if the MDS where it was initiated stays online?

That makes sense as a first pass. For the future, we could resend the completion to the new subtree root owner after reconnecting and at least update the rstats. The scrub may even complete in that case.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: A design for CephFS forward scrub with multiple MDS
  2016-09-21 13:25   ` Douglas Fuller
@ 2016-09-21 13:45     ` John Spray
  0 siblings, 0 replies; 8+ messages in thread
From: John Spray @ 2016-09-21 13:45 UTC (permalink / raw)
  To: Douglas Fuller; +Cc: Ceph Development, Gregory Farnum

On Wed, Sep 21, 2016 at 2:25 PM, Douglas Fuller <dfuller@redhat.com> wrote:
>
>>> Outbound scrub requests will need to be tracked and restarted in the case of MDS failure.
>>
>> One thing we didn't discuss was the backwards case, where I (an MDS)
>> am told by another MDS to scrub a subtree, but he fails before I can
>> tell him the result of my scrub.  Simplest thing seems to be to abort
>> scrubs in this case, and say that (for the moment) a scrub is only
>> guaranteed to complete if the MDS where it was initiated stays online?
>
> That makes sense as a first pass. For the future, we could resend the completion to the new subtree root owner after reconnecting and at least update the rstats. The scrub may even complete in that case.

Yes, although for the completely general case (including all MDSs
offline simultaneously) we would need to start persisting something.
Not sure if we'd ever want to do that silently inside the MDSs though,
as spending IOPs on scrub is probably not desired behaviour fresh from
a power cycle.  #futurework

John

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: A design for CephFS forward scrub with multiple MDS
  2016-09-21 10:58   ` Douglas Fuller
@ 2016-09-21 14:24     ` Gregory Farnum
  2016-09-21 15:04       ` Douglas Fuller
  0 siblings, 1 reply; 8+ messages in thread
From: Gregory Farnum @ 2016-09-21 14:24 UTC (permalink / raw)
  To: Douglas Fuller; +Cc: Ceph Development, John Spray

On Wed, Sep 21, 2016 at 3:58 AM, Douglas Fuller <dfuller@redhat.com> wrote:
>
>> On Sep 21, 2016, at 2:29 AM, Gregory Farnum <gfarnum@redhat.com> wrote:
>>
>> On Tue, Sep 20, 2016 at 10:16 AM, Douglas Fuller <dfuller@redhat.com> wrote:
>>>
>>> When popping an inode from the scrub stack, it’s important to note that its authority may have been changed by some intervening export. The scrubbing MDS will drop any file inode for which it is no longer authoritative, assuming this would be handled by the correct MDS. For directory inodes, forward a request to the authoritative MDS to scrub the directory. This may result in attempts to scrub the same inodes more than once (though we track this and can avoid most of the work), it seems necessary in order to guarantee no directories are missed due to splits or exports (NB: this is correct, right?).
>>
>> I think we need to spell this out a little more. Some thoughts:
>> * right now, the ScrubStack is just a CInode*. This needs to turn into
>> a two-way reference.
>
> I wasn’t at the datatype level of detail here. I agree it can’t be a CInode* anymore, and figured it’d have to be something we could fetch if it is exported while on the stack.
>
>> * When we freeze a tree for export, we need a new step that removes it
>> from the ScrubStack and sets up the "remote scrub" state we'd have if
>> it were a freshly-encountered subtree boundary
>>  * this may involve some delayed execution of remote scrub requests,
>> or of bundling up the need for a scrub in the exported state
>
> Directories don’t know where their subtree roots are, so I’m not sure how we would remove subdirectories and their contained files from the stack if one of their parents were exported. I think the stack could be “dumb” in some sense and not care what happens to the items on it. If we pop a file inode for which we are not authoritative, we drop it on the floor, assuming its parent directory will cause it to be scrubbed elsewhere. If we pop a directory inode for which we are not authoritative, we send a request to the authoritative MDS to scrub it.

Well, we can't keep these inodes around the way the code currently
works: unless I'm much mistaken, nothing is keeping them updated so
they're just out-of-date copies of metadata. PIN_SCRUBQUEUE exists to
keep inodes in-memory once on the scrub queue but it was explicitly
never designed to interact with multi-mds systems and needs to be
cleaned up; the current behavior is just broken. Luckily, there are
some not-too-ridiculous solutions.
* We already freeze a subtree before exporting it. I don't remember if
that involves actually touching every in-memory CDentry/CInode
underneath, or marking a flag on the root?
* We obviously walk our way through the whole subtree when bundling it
up for export
So, in one of those passes (or in a new one tacked on to freezing), we
can detect that an inode is on the scrub queue and remove it.
-Greg

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: A design for CephFS forward scrub with multiple MDS
  2016-09-21 14:24     ` Gregory Farnum
@ 2016-09-21 15:04       ` Douglas Fuller
  0 siblings, 0 replies; 8+ messages in thread
From: Douglas Fuller @ 2016-09-21 15:04 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Ceph Development, John Spray


> Well, we can't keep these inodes around the way the code currently
> works: unless I'm much mistaken, nothing is keeping them updated so
> they're just out-of-date copies of metadata. PIN_SCRUBQUEUE exists to
> keep inodes in-memory once on the scrub queue but it was explicitly
> never designed to interact with multi-mds systems and needs to be
> cleaned up; the current behavior is just broken. Luckily, there are
> some not-too-ridiculous solutions.
> * We already freeze a subtree before exporting it. I don't remember if
> that involves actually touching every in-memory CDentry/CInode
> underneath, or marking a flag on the root?
> * We obviously walk our way through the whole subtree when bundling it
> up for export
> So, in one of those passes (or in a new one tacked on to freezing), we
> can detect that an inode is on the scrub queue and remove it.

There’s also:
* Represent the stack of inodes with a different data structure (like an inode_t). Then we could leave the current behavior intact. 

Your subtree bundling suggestion does look interesting, though. I’ll look into it.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-09-21 15:04 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-20 17:16 A design for CephFS forward scrub with multiple MDS Douglas Fuller
2016-09-21  6:29 ` Gregory Farnum
2016-09-21 10:58   ` Douglas Fuller
2016-09-21 14:24     ` Gregory Farnum
2016-09-21 15:04       ` Douglas Fuller
2016-09-21 12:56 ` John Spray
2016-09-21 13:25   ` Douglas Fuller
2016-09-21 13:45     ` John Spray

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.