[RFC] How to handle an ugly md raid0 sector map bug ?

* [RFC] How to handle an ugly md raid0 sector map bug ?
@ 2019-08-22 10:38 Coly Li
  2019-08-23  0:02 ` NeilBrown
  0 siblings, 1 reply; 19+ messages in thread
From: Coly Li @ 2019-08-22 10:38 UTC (permalink / raw)
  To: linux-raid, Song Liu; +Cc: NeilBrown, linux-block

Hi folks,

First line: This bug only influences md raid0 device which applies all
the following conditions,
1) Assembled by component disks with different sizes.
2) Created and used under Linux kernel before (including) Linux v3.12,
then upgrade to Linux kernel after (including) Linux v3.13.
3) New data are written to md raid0 in new kernel >= Linux v3.13.
Then the md raid0 may have inconsistent sector mapping and experience
data corruption.

Recently I receive a bug report that customer encounter file system
corruption after upgrading their kernel from Linux 3.12 to 4.4. It turns
out to be the underlying md raid0 corruption after the kernel upgrade.

I find it is because a sector map bug in md raid0 code include and
before Linux v3.12. Here is the buggy code piece I copied from stable
Linux v3.12.74 drivers/md/raid0.c:raid0_make_request(),

547         sector_offset = bio->bi_sector;
548         zone = find_zone(mddev->private, &sector_offset);
549         tmp_dev = map_sector(mddev, zone, bio->bi_sector,
550                              &sector_offset);

At line 548 after find_zone() returns, sector_offset is updated to be an
offset inside current zone. Then at line 549 the third parameter of
calling map_sector() should be the updated sector_offset, but
bio->bi_sector (original LBA or md raid0 device) is used. If the raid0
device has *multiple zones*, except the first zone, the mapping <dev,
sector> pair returned by map_sector() for all rested zones are
unexpected and wrong.

The buggy code was introduced since Linux v2.6.31 in commit fbb704efb784
("md: raid0 :Enables chunk size other than powers of 2."), unfortunate
the mistaken mapping calculation has stable and unique result too, so it
works without obvious problem until commit 20d0189b1012 ("block:
Introduce new bio_split()") merged into Linux v3.13.

This patch fixed the mistaken mapping in the following lines of change,
654 -       sector_offset = bio->bi_iter.bi_sector;
655 -       zone = find_zone(mddev->private, &sector_offset);
656 -       tmp_dev = map_sector(mddev, zone, bio->bi_iter.bi_sector,
657 -                            &sector_offset);

694 +               zone = find_zone(mddev->private, &sector);
695 +               tmp_dev = map_sector(mddev, zone, sector, &sector);
At line 695 of this patch, the third parameter of calling map_sector()
is fixed to 'sector', this is the correct value which contains the
sector offset inside the corresponding zone.

The this patch implicitly *changes* md raid0 on-disk layout. If a md
raid0 has component disks with *different* sizes, then it will contain
multiple zones. If such multiple zones raid0 device is created before
Linux v3.13, all data chunks after first zone will be mapped to
different location in kernel after (including) Linux v3.13. The result
is, data written in the LBA after first zone will be treated as
corruption. A worse case is, if the md raid0 has data chunks filled in
first md raid0 zone in Linux v3.12 (or earlier kernels), then update to
Linux v3.13 (or later kernels) and fill more data chunks in second and
rested zone. Then in neither Linux v3.12 no Linux v3.13, there is always
partial data corrupted.

Currently there is no way to tell whether a md raid0 device is mapped in
wrong calculation in kernel before (including) Linux v3.12 or in correct
calculation in kernels after (including) Linux v3.13. If a md raid0
device (contains multiple zones) created and used crossing these kernel
version, there is possibility and different mapping calculation
generation different/inconsistent on-disk layout in different md raid0
zones, and results data corruption.

For our enterprise Linux products we can handle it properly for a few
product number of kernels. But for upstream and stable kernels, I don't
have idea how to fix this ugly problem in a generic way.

Neil Brown discussed with me offline, he proposed a temporary workaround
that only permit to assemble md raid0 device with identical component
disk size, and reject to assemble md raid0 device with component disks
with different sizes. We can stop this workaround when there is a proper
way to fix the problem.

I suggest our developer community to work together for a solution, this
is the motivation I post this email for your comments.

Thanks in advance.

-- 

Coly Li

^ permalink raw reply	[flat|nested] 19+ messages in thread