From: Song Liu <songliubraving@fb.com>
To: Coly Li <colyli@suse.de>
Cc: NeilBrown <neilb@suse.com>,
linux-raid <linux-raid@vger.kernel.org>,
"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>
Subject: Re: [RFC] How to handle an ugly md raid0 sector map bug ?
Date: Fri, 23 Aug 2019 17:17:06 +0000 [thread overview]
Message-ID: <9008538C-A2BE-429C-A90E-18FBB91E7B34@fb.com> (raw)
In-Reply-To: <f3c41c4b-5b1d-bd2f-ad2d-9aa5108ad798@suse.de>
> On Aug 23, 2019, at 10:03 AM, Coly Li <colyli@suse.de> wrote:
>
> On 2019/8/24 12:37 上午, Song Liu wrote:
>> Thanks Coly and Neil.
>>
>>> On Aug 22, 2019, at 5:02 PM, NeilBrown <neilb@suse.com> wrote:
>>>
>>> On Thu, Aug 22 2019, Coly Li wrote:
>>>
>>>> Hi folks,
>>>>
>>>> First line: This bug only influences md raid0 device which applies all
>>>> the following conditions,
>>>> 1) Assembled by component disks with different sizes.
>>>> 2) Created and used under Linux kernel before (including) Linux v3.12,
>>>> then upgrade to Linux kernel after (including) Linux v3.13.
>>>> 3) New data are written to md raid0 in new kernel >= Linux v3.13.
>>>> Then the md raid0 may have inconsistent sector mapping and experience
>>>> data corruption.
>>>>
>>>> Recently I receive a bug report that customer encounter file system
>>>> corruption after upgrading their kernel from Linux 3.12 to 4.4. It turns
>>>> out to be the underlying md raid0 corruption after the kernel upgrade.
>>>>
>>>> I find it is because a sector map bug in md raid0 code include and
>>>> before Linux v3.12. Here is the buggy code piece I copied from stable
>>>> Linux v3.12.74 drivers/md/raid0.c:raid0_make_request(),
>>>>
>>>> 547 sector_offset = bio->bi_sector;
>>>> 548 zone = find_zone(mddev->private, §or_offset);
>>>> 549 tmp_dev = map_sector(mddev, zone, bio->bi_sector,
>>>> 550 §or_offset);
>>>
>>> I don't think this code is buggy. The mapping may not be the mapping
>>> you would expect, but it is the mapping that md/raid0 had always used up
>>> to this time.
>>>
>>>>
>>>> At line 548 after find_zone() returns, sector_offset is updated to be an
>>>> offset inside current zone. Then at line 549 the third parameter of
>>>> calling map_sector() should be the updated sector_offset, but
>>>> bio->bi_sector (original LBA or md raid0 device) is used. If the raid0
>>>> device has *multiple zones*, except the first zone, the mapping <dev,
>>>> sector> pair returned by map_sector() for all rested zones are
>>>> unexpected and wrong.
>>>>
>>>> The buggy code was introduced since Linux v2.6.31 in commit fbb704efb784
>>>> ("md: raid0 :Enables chunk size other than powers of 2."), unfortunate
>>>> the mistaken mapping calculation has stable and unique result too, so it
>>>> works without obvious problem until commit 20d0189b1012 ("block:
>>>> Introduce new bio_split()") merged into Linux v3.13.
>>>>
>>>> This patch fixed the mistaken mapping in the following lines of change,
>>>> 654 - sector_offset = bio->bi_iter.bi_sector;
>>>> 655 - zone = find_zone(mddev->private, §or_offset);
>>>> 656 - tmp_dev = map_sector(mddev, zone, bio->bi_iter.bi_sector,
>>>> 657 - §or_offset);
>>>>
>>>> 694 + zone = find_zone(mddev->private, §or);
>>>> 695 + tmp_dev = map_sector(mddev, zone, sector, §or);
>>>> At line 695 of this patch, the third parameter of calling map_sector()
>>>> is fixed to 'sector', this is the correct value which contains the
>>>> sector offset inside the corresponding zone.
>>>
>>> This is buggy because, as you say, the third argument to map_sector has
>>> changed.
>>> Previously it was bio->bi_iter.bi_sector. Now it is 'sector' which
>>> find_zone has just modified.
>>>
>>>>
>>>> The this patch implicitly *changes* md raid0 on-disk layout. If a md
>>>> raid0 has component disks with *different* sizes, then it will contain
>>>> multiple zones. If such multiple zones raid0 device is created before
>>>> Linux v3.13, all data chunks after first zone will be mapped to
>>>> different location in kernel after (including) Linux v3.13. The result
>>>> is, data written in the LBA after first zone will be treated as
>>>> corruption. A worse case is, if the md raid0 has data chunks filled in
>>>> first md raid0 zone in Linux v3.12 (or earlier kernels), then update to
>>>> Linux v3.13 (or later kernels) and fill more data chunks in second and
>>>> rested zone. Then in neither Linux v3.12 no Linux v3.13, there is always
>>>> partial data corrupted.
>>>>
>>>> Currently there is no way to tell whether a md raid0 device is mapped in
>>>> wrong calculation in kernel before (including) Linux v3.12 or in correct
>>>> calculation in kernels after (including) Linux v3.13. If a md raid0
>>>> device (contains multiple zones) created and used crossing these kernel
>>>> version, there is possibility and different mapping calculation
>>>> generation different/inconsistent on-disk layout in different md raid0
>>>> zones, and results data corruption.
>>>>
>>>> For our enterprise Linux products we can handle it properly for a few
>>>> product number of kernels. But for upstream and stable kernels, I don't
>>>> have idea how to fix this ugly problem in a generic way.
>>>>
>>>> Neil Brown discussed with me offline, he proposed a temporary workaround
>>>> that only permit to assemble md raid0 device with identical component
>>>> disk size, and reject to assemble md raid0 device with component disks
>>>> with different sizes. We can stop this workaround when there is a proper
>>>> way to fix the problem.
>>>>
>>>> I suggest our developer community to work together for a solution, this
>>>> is the motivation I post this email for your comments.
>>>
>>> There are four separate cases that we need to consider:
>>> - v1.x metadata
>>> - v0.90 metadata
>>> - LVM metadata (via dm-raid)
>>> - no metadata (array created with "mdadm --build").
>>>
>>> For v1.x metadata, I think we can add a new "feature_map" flag.
>>> If this flag isn't set, raid0 with non-uniform device sizes will not be
>>> assembled.
>>> If it is set, then:
>>> if 'layout' is 0, use the old mapping
>>> if 'layout' is 1, use the new mapping
>>>
>>> For v0.90 metadata we don't have feature-flags. We could
>>> The gvalid_words field is unused and always set to zero.
>>> So we could start storing some feature bits there.
>>>
>>> For LVM/dm-raid, I suspect it doesn't support varying
>>> sized devices, but we would need to check.
>>>
>>> For "no metadata" arrays ... we could possibly just stop supporting
>>> them - I doubt they are used much.
>>
>> So for an existing array, we really cannot tell whether it is broken or
>> not, right? If this is the case, we only need to worry about new arrays.
>>
>> For new arrays, I guess we can only allow v1.x raid0 to have non-uniform
>> devices sizes, and use the new feature_map bit.
>>
>> Would this work? If so, we only have 1 case to work on.
>
> It seems v1.2 support started since Linux v2.16, so it may also have
> problem for multiple zones.
>
For v1.2 metadata, we still need the feature_map bit, meaning this
non-uniform array is safe to assemble. If the array doesn't have
this bit, and is non-uniform size, we refuse to assemble it.
Thanks,
Song
next prev parent reply other threads:[~2019-08-23 17:17 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-22 10:38 [RFC] How to handle an ugly md raid0 sector map bug ? Coly Li
2019-08-23 0:02 ` NeilBrown
2019-08-23 16:37 ` Song Liu
2019-08-23 17:03 ` Coly Li
2019-08-23 17:17 ` Song Liu [this message]
2019-08-23 17:47 ` Coly Li
2019-09-09 6:57 ` [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion NeilBrown
2019-09-09 6:58 ` [PATCH 2/2] md: add feature flag MD_FEATURE_RAID0_LAYOUT NeilBrown
2019-09-09 15:33 ` Guoqing Jiang
2019-09-09 23:26 ` NeilBrown
2019-09-09 14:56 ` [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion Song Liu
2019-09-09 23:33 ` NeilBrown
2019-09-10 15:45 ` Song Liu
2019-09-10 16:01 ` Guoqing Jiang
2019-09-10 23:08 ` NeilBrown
2019-09-11 9:56 ` Song Liu
2019-09-11 22:48 ` NeilBrown
2019-09-09 15:09 ` Guoqing Jiang
2019-09-09 23:34 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9008538C-A2BE-429C-A90E-18FBB91E7B34@fb.com \
--to=songliubraving@fb.com \
--cc=colyli@suse.de \
--cc=linux-block@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).