Re: [RFC] How to handle an ugly md raid0 sector map bug ?

From: NeilBrown <neilb@suse.com>
To: Coly Li <colyli@suse.de>, Song Liu <songliubraving@fb.com>,
	linux-raid@vger.kernel.org
Cc: "linux-block\@vger.kernel.org" <linux-block@vger.kernel.org>
Subject: Re: [RFC] How to handle an ugly md raid0 sector map bug ?
Date: Fri, 23 Aug 2019 10:02:08 +1000	[thread overview]
Message-ID: <87blwghhq7.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <10ca59ff-f1ba-1464-030a-0d73ff25d2de@suse.de>

[-- Attachment #1: Type: text/plain, Size: 6280 bytes --]

On Thu, Aug 22 2019, Coly Li wrote:

> Hi folks,
>
> First line: This bug only influences md raid0 device which applies all
> the following conditions,
> 1) Assembled by component disks with different sizes.
> 2) Created and used under Linux kernel before (including) Linux v3.12,
> then upgrade to Linux kernel after (including) Linux v3.13.
> 3) New data are written to md raid0 in new kernel >= Linux v3.13.
> Then the md raid0 may have inconsistent sector mapping and experience
> data corruption.
>
> Recently I receive a bug report that customer encounter file system
> corruption after upgrading their kernel from Linux 3.12 to 4.4. It turns
> out to be the underlying md raid0 corruption after the kernel upgrade.
>
> I find it is because a sector map bug in md raid0 code include and
> before Linux v3.12. Here is the buggy code piece I copied from stable
> Linux v3.12.74 drivers/md/raid0.c:raid0_make_request(),
>
> 547         sector_offset = bio->bi_sector;
> 548         zone = find_zone(mddev->private, &sector_offset);
> 549         tmp_dev = map_sector(mddev, zone, bio->bi_sector,
> 550                              &sector_offset);

I don't think this code is buggy.  The mapping may not be the mapping
you would expect, but it is the mapping that md/raid0 had always used up
to this time.

>
> At line 548 after find_zone() returns, sector_offset is updated to be an
> offset inside current zone. Then at line 549 the third parameter of
> calling map_sector() should be the updated sector_offset, but
> bio->bi_sector (original LBA or md raid0 device) is used. If the raid0
> device has *multiple zones*, except the first zone, the mapping <dev,
> sector> pair returned by map_sector() for all rested zones are
> unexpected and wrong.
>
> The buggy code was introduced since Linux v2.6.31 in commit fbb704efb784
> ("md: raid0 :Enables chunk size other than powers of 2."), unfortunate
> the mistaken mapping calculation has stable and unique result too, so it
> works without obvious problem until commit 20d0189b1012 ("block:
> Introduce new bio_split()") merged into Linux v3.13.
>
> This patch fixed the mistaken mapping in the following lines of change,
> 654 -       sector_offset = bio->bi_iter.bi_sector;
> 655 -       zone = find_zone(mddev->private, &sector_offset);
> 656 -       tmp_dev = map_sector(mddev, zone, bio->bi_iter.bi_sector,
> 657 -                            &sector_offset);
>
> 694 +               zone = find_zone(mddev->private, &sector);
> 695 +               tmp_dev = map_sector(mddev, zone, sector, &sector);
> At line 695 of this patch, the third parameter of calling map_sector()
> is fixed to 'sector', this is the correct value which contains the
> sector offset inside the corresponding zone.

This is buggy because, as you say, the third argument to map_sector has
changed.
Previously it was bio->bi_iter.bi_sector.  Now it is 'sector' which
find_zone has just modified.

>
> The this patch implicitly *changes* md raid0 on-disk layout. If a md
> raid0 has component disks with *different* sizes, then it will contain
> multiple zones. If such multiple zones raid0 device is created before
> Linux v3.13, all data chunks after first zone will be mapped to
> different location in kernel after (including) Linux v3.13. The result
> is, data written in the LBA after first zone will be treated as
> corruption. A worse case is, if the md raid0 has data chunks filled in
> first md raid0 zone in Linux v3.12 (or earlier kernels), then update to
> Linux v3.13 (or later kernels) and fill more data chunks in second and
> rested zone. Then in neither Linux v3.12 no Linux v3.13, there is always
> partial data corrupted.
>
> Currently there is no way to tell whether a md raid0 device is mapped in
> wrong calculation in kernel before (including) Linux v3.12 or in correct
> calculation in kernels after (including) Linux v3.13. If a md raid0
> device (contains multiple zones) created and used crossing these kernel
> version, there is possibility and different mapping calculation
> generation different/inconsistent on-disk layout in different md raid0
> zones, and results data corruption.
>
> For our enterprise Linux products we can handle it properly for a few
> product number of kernels. But for upstream and stable kernels, I don't
> have idea how to fix this ugly problem in a generic way.
>
> Neil Brown discussed with me offline, he proposed a temporary workaround
> that only permit to assemble md raid0 device with identical component
> disk size, and reject to assemble md raid0 device with component disks
> with different sizes. We can stop this workaround when there is a proper
> way to fix the problem.
>
> I suggest our developer community to work together for a solution, this
> is the motivation I post this email for your comments.

There are four separate cases that we need to consider:
 - v1.x metadata
 - v0.90 metadata
 - LVM metadata (via dm-raid)
 - no metadata (array created with "mdadm --build").

For v1.x metadata, I think we can add a new "feature_map" flag.
If this flag isn't set, raid0 with non-uniform device sizes will not be
assembled.
If it is set, then:
  if 'layout' is 0, use the old mapping
  if 'layout' is 1, use the new mapping

For v0.90 metadata we don't have feature-flags.  We could
The gvalid_words field is unused and always set to zero.
So we could start storing some feature bits there.

For LVM/dm-raid, I suspect it doesn't support varying
sized devices, but we would need to check.

For "no metadata" arrays ... we could possibly just stop supporting
them - I doubt they are used much.

Once this is decided, we then need to work out how to
make it all usable via mdadm.

mdadm needs to set the relevant flags and 'layout' values
when creating an array, and it needs to be able to set them correctly
on an existing array that doesn't have them set.

We also need to decide what happens when a new mdadm is run on an old
kernel.
If we create an array with a new feature bit, the old kernel won't
be able to assemble it.  But if we don't set the feature bit, then
the array will not assemble after an upgrade.

It might be easier to think about this when we actually have concrete
code, but if anyone does have useful input, I'd be very happy to read
it.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]