All of lore.kernel.org
 help / color / mirror / Atom feed
From: Filipe Manana <fdmanana@gmail.com>
To: Scott Talbert <scott.talbert@hgst.com>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: Device replace issues and disabling it until they are solved
Date: Fri, 27 May 2016 10:43:47 +0100	[thread overview]
Message-ID: <CAL3q7H67b3JNMogY5DS4X3JTLQEugB4E48kjLx6fE69VwXGpDQ@mail.gmail.com> (raw)
In-Reply-To: <alpine.DEB.2.10.1605261506540.25778@dispatch>

On Thu, May 26, 2016 at 8:09 PM, Scott Talbert <scott.talbert@hgst.com> wrote:
> On Wed, 11 May 2016, Filipe Manana wrote:
>
>>> I've noticed some time ago that our device replace implementation is
>>> unreliable. Basically under several situations it ends up not copying
>>> extents (both data and metadata) from the old device to the new device
>>> (I briefly talked about some of the problems with Josef at LSF).
>>>
>>> Essentially it operates by iterating all the extents of the old device
>>> using the device tree's commit root, and for each device extent found,
>>> does a direct copy of it into the new device (it does not use the
>>> regular write paths). While it's doing this no extents can be
>>> allocated from the new device, this is only possible once it finishes
>>> copying all device extents and replaces the old device with the new
>>> device in the list of devices available for allocations.
>>>
>>> There are 4 main flaws with the implementation:
>>>
>>> 1) It processes each device extent only once. Clearly after it
>>> processes (copies) a device extent it's possible for future data and
>>> metadata writes to allocate extents from the corresponding block
>>> group, so any new extents are not copied into the new device.
>>>
>>> 2) Before it copies a device extent, it sets the corresponding block
>>> group to RO mode (as of commit [1], which sadly does not really fix an
>>> issue nor does it explain how the problem it claims to fix happens) to
>>> prevent new allocations from allocating extents from the block group
>>> while we are processing it, and once we finish processing it, it sets
>>> the block group back to RW mode. This is flawed too, as when it sets
>>> the block group to RO mode, some concurrent task might have just
>>> allocated an extent from the block group and ends up writing to it
>>> while or after we process the device extent.
>>>
>>> 3) Since it iterates over the device tree in an ascending fashion, it
>>> never considers the possibility for new device extents with an offset
>>> lower then the current search offset from being allocated for new
>>> block groups (just take a look at the while loop at
>>> scrub_enumerate_chunks() to see how flawed it is). So it totally
>>> ignores that device holes can be used while we are doing the replace
>>> and never looks back to check for any and copy the respective device
>>> extent into the new device. This is much more likely to happen
>>> nowadays since empty block groups are automatically removed by the
>>> cleaner kthread, while in the past this could only happen through
>>> balance operations triggered from userspace.
>>>
>>> 4) When we finish iterating "all" device extents, we call
>>> btrfs_dev_replace_finishing(), which flushes all delalloc, commits the
>>> current transaction, replaces the old device in all extent mappings
>>> with the new device, removes old device from the list of available
>>> devices for allocation and adds the new device to that list. So when
>>> flushing dellaloc more extents are allocated from existing block
>>> groups and new block groups might be allocated, with all the
>>> corresponding bios writing to the old device and not the new one, and
>>> after this flushing we surely don't copy all new or changed extents to
>>> the new device.
>>>
>>> These are the 4 major issues I see with device replace. Issue number 2
>>> is trivial to solve but it's pointless to do so without addressing the
>>> other issues.
>>>
>>> Solving this isn't easy, not without some trade offs at least. We can
>>> keep in memory structures to track which new block groups are
>>> allocated while we are doing the replace and which new extents are
>>> allocated from existing block groups. Besides the needed
>>> synchronization between the allocator and the device replace code
>>> (which would bring yet more complexity), this can imply using a lot of
>>> memory if there's a significant number of writes happening all the
>>> time and ending up in an endless copy loop. Maybe some sort of write
>>> throttling would work out. There's also a bit of a chicken and the egg
>>> problem, as while the replace operation is ongoing we update a
>>> progress item in the devices tree (btrfs_run_dev_replace()) which in
>>> turn results in allocating new metadata extents.
>>>
>>> Anyway, I'm just hoping someone tells me that I'm completely wrong and
>>> misunderstood the current implementation we have. If not, and given
>>> how serious this is (affecting both data and metadata, and defeating
>>> the main purpose of raid), shouldn't we disable this feature, like we
>>> did for snapshot aware defrag a couple years ago, until we have these
>>> issues solved?
>>
>>
>> On IRC, Stefan Behrens just pointed me out that for locations of the
>> old device behind the current left cursor, we make all writes go both
>> into the old and new devices (volumes.c:__btrfs_map_block()). This
>> solves problems 1 and 3, which are really the hard problems. The
>> remaining ones are easy to fix and I'll address them soon.
>
>
> Hi Filipe,

Hi Scott,

>
> Does your recent patch set (from May 20) address all of these issues?

Yes.

>
> Scott
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

  reply	other threads:[~2016-05-27  9:43 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-11  9:09 Device replace issues and disabling it until they are solved Filipe Manana
2016-05-11  9:29 ` Qu Wenruo
2016-05-11  9:39   ` Filipe Manana
2016-05-11 15:13 ` Filipe Manana
2016-05-26 19:09   ` Scott Talbert
2016-05-27  9:43     ` Filipe Manana [this message]
2016-06-02 10:03       ` Yauhen Kharuzhy
2016-06-02 10:05         ` Filipe Manana
2016-06-02 13:58           ` Scott Talbert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAL3q7H67b3JNMogY5DS4X3JTLQEugB4E48kjLx6fE69VwXGpDQ@mail.gmail.com \
    --to=fdmanana@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=scott.talbert@hgst.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.