linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] How to handle an ugly md raid0 sector map bug ?
@ 2019-08-22 10:38 Coly Li
  2019-08-23  0:02 ` NeilBrown
  0 siblings, 1 reply; 19+ messages in thread
From: Coly Li @ 2019-08-22 10:38 UTC (permalink / raw)
  To: linux-raid, Song Liu; +Cc: NeilBrown, linux-block

Hi folks,

First line: This bug only influences md raid0 device which applies all
the following conditions,
1) Assembled by component disks with different sizes.
2) Created and used under Linux kernel before (including) Linux v3.12,
then upgrade to Linux kernel after (including) Linux v3.13.
3) New data are written to md raid0 in new kernel >= Linux v3.13.
Then the md raid0 may have inconsistent sector mapping and experience
data corruption.

Recently I receive a bug report that customer encounter file system
corruption after upgrading their kernel from Linux 3.12 to 4.4. It turns
out to be the underlying md raid0 corruption after the kernel upgrade.

I find it is because a sector map bug in md raid0 code include and
before Linux v3.12. Here is the buggy code piece I copied from stable
Linux v3.12.74 drivers/md/raid0.c:raid0_make_request(),

547         sector_offset = bio->bi_sector;
548         zone = find_zone(mddev->private, &sector_offset);
549         tmp_dev = map_sector(mddev, zone, bio->bi_sector,
550                              &sector_offset);

At line 548 after find_zone() returns, sector_offset is updated to be an
offset inside current zone. Then at line 549 the third parameter of
calling map_sector() should be the updated sector_offset, but
bio->bi_sector (original LBA or md raid0 device) is used. If the raid0
device has *multiple zones*, except the first zone, the mapping <dev,
sector> pair returned by map_sector() for all rested zones are
unexpected and wrong.

The buggy code was introduced since Linux v2.6.31 in commit fbb704efb784
("md: raid0 :Enables chunk size other than powers of 2."), unfortunate
the mistaken mapping calculation has stable and unique result too, so it
works without obvious problem until commit 20d0189b1012 ("block:
Introduce new bio_split()") merged into Linux v3.13.

This patch fixed the mistaken mapping in the following lines of change,
654 -       sector_offset = bio->bi_iter.bi_sector;
655 -       zone = find_zone(mddev->private, &sector_offset);
656 -       tmp_dev = map_sector(mddev, zone, bio->bi_iter.bi_sector,
657 -                            &sector_offset);

694 +               zone = find_zone(mddev->private, &sector);
695 +               tmp_dev = map_sector(mddev, zone, sector, &sector);
At line 695 of this patch, the third parameter of calling map_sector()
is fixed to 'sector', this is the correct value which contains the
sector offset inside the corresponding zone.

The this patch implicitly *changes* md raid0 on-disk layout. If a md
raid0 has component disks with *different* sizes, then it will contain
multiple zones. If such multiple zones raid0 device is created before
Linux v3.13, all data chunks after first zone will be mapped to
different location in kernel after (including) Linux v3.13. The result
is, data written in the LBA after first zone will be treated as
corruption. A worse case is, if the md raid0 has data chunks filled in
first md raid0 zone in Linux v3.12 (or earlier kernels), then update to
Linux v3.13 (or later kernels) and fill more data chunks in second and
rested zone. Then in neither Linux v3.12 no Linux v3.13, there is always
partial data corrupted.

Currently there is no way to tell whether a md raid0 device is mapped in
wrong calculation in kernel before (including) Linux v3.12 or in correct
calculation in kernels after (including) Linux v3.13. If a md raid0
device (contains multiple zones) created and used crossing these kernel
version, there is possibility and different mapping calculation
generation different/inconsistent on-disk layout in different md raid0
zones, and results data corruption.

For our enterprise Linux products we can handle it properly for a few
product number of kernels. But for upstream and stable kernels, I don't
have idea how to fix this ugly problem in a generic way.

Neil Brown discussed with me offline, he proposed a temporary workaround
that only permit to assemble md raid0 device with identical component
disk size, and reject to assemble md raid0 device with component disks
with different sizes. We can stop this workaround when there is a proper
way to fix the problem.

I suggest our developer community to work together for a solution, this
is the motivation I post this email for your comments.

Thanks in advance.

-- 

Coly Li

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] How to handle an ugly md raid0 sector map bug ?
  2019-08-22 10:38 [RFC] How to handle an ugly md raid0 sector map bug ? Coly Li
@ 2019-08-23  0:02 ` NeilBrown
  2019-08-23 16:37   ` Song Liu
  0 siblings, 1 reply; 19+ messages in thread
From: NeilBrown @ 2019-08-23  0:02 UTC (permalink / raw)
  To: Coly Li, Song Liu, linux-raid; +Cc: linux-block

[-- Attachment #1: Type: text/plain, Size: 6280 bytes --]

On Thu, Aug 22 2019, Coly Li wrote:

> Hi folks,
>
> First line: This bug only influences md raid0 device which applies all
> the following conditions,
> 1) Assembled by component disks with different sizes.
> 2) Created and used under Linux kernel before (including) Linux v3.12,
> then upgrade to Linux kernel after (including) Linux v3.13.
> 3) New data are written to md raid0 in new kernel >= Linux v3.13.
> Then the md raid0 may have inconsistent sector mapping and experience
> data corruption.
>
> Recently I receive a bug report that customer encounter file system
> corruption after upgrading their kernel from Linux 3.12 to 4.4. It turns
> out to be the underlying md raid0 corruption after the kernel upgrade.
>
> I find it is because a sector map bug in md raid0 code include and
> before Linux v3.12. Here is the buggy code piece I copied from stable
> Linux v3.12.74 drivers/md/raid0.c:raid0_make_request(),
>
> 547         sector_offset = bio->bi_sector;
> 548         zone = find_zone(mddev->private, &sector_offset);
> 549         tmp_dev = map_sector(mddev, zone, bio->bi_sector,
> 550                              &sector_offset);

I don't think this code is buggy.  The mapping may not be the mapping
you would expect, but it is the mapping that md/raid0 had always used up
to this time.

>
> At line 548 after find_zone() returns, sector_offset is updated to be an
> offset inside current zone. Then at line 549 the third parameter of
> calling map_sector() should be the updated sector_offset, but
> bio->bi_sector (original LBA or md raid0 device) is used. If the raid0
> device has *multiple zones*, except the first zone, the mapping <dev,
> sector> pair returned by map_sector() for all rested zones are
> unexpected and wrong.
>
> The buggy code was introduced since Linux v2.6.31 in commit fbb704efb784
> ("md: raid0 :Enables chunk size other than powers of 2."), unfortunate
> the mistaken mapping calculation has stable and unique result too, so it
> works without obvious problem until commit 20d0189b1012 ("block:
> Introduce new bio_split()") merged into Linux v3.13.
>
> This patch fixed the mistaken mapping in the following lines of change,
> 654 -       sector_offset = bio->bi_iter.bi_sector;
> 655 -       zone = find_zone(mddev->private, &sector_offset);
> 656 -       tmp_dev = map_sector(mddev, zone, bio->bi_iter.bi_sector,
> 657 -                            &sector_offset);
>
> 694 +               zone = find_zone(mddev->private, &sector);
> 695 +               tmp_dev = map_sector(mddev, zone, sector, &sector);
> At line 695 of this patch, the third parameter of calling map_sector()
> is fixed to 'sector', this is the correct value which contains the
> sector offset inside the corresponding zone.

This is buggy because, as you say, the third argument to map_sector has
changed.
Previously it was bio->bi_iter.bi_sector.  Now it is 'sector' which
find_zone has just modified.

>
> The this patch implicitly *changes* md raid0 on-disk layout. If a md
> raid0 has component disks with *different* sizes, then it will contain
> multiple zones. If such multiple zones raid0 device is created before
> Linux v3.13, all data chunks after first zone will be mapped to
> different location in kernel after (including) Linux v3.13. The result
> is, data written in the LBA after first zone will be treated as
> corruption. A worse case is, if the md raid0 has data chunks filled in
> first md raid0 zone in Linux v3.12 (or earlier kernels), then update to
> Linux v3.13 (or later kernels) and fill more data chunks in second and
> rested zone. Then in neither Linux v3.12 no Linux v3.13, there is always
> partial data corrupted.
>
> Currently there is no way to tell whether a md raid0 device is mapped in
> wrong calculation in kernel before (including) Linux v3.12 or in correct
> calculation in kernels after (including) Linux v3.13. If a md raid0
> device (contains multiple zones) created and used crossing these kernel
> version, there is possibility and different mapping calculation
> generation different/inconsistent on-disk layout in different md raid0
> zones, and results data corruption.
>
> For our enterprise Linux products we can handle it properly for a few
> product number of kernels. But for upstream and stable kernels, I don't
> have idea how to fix this ugly problem in a generic way.
>
> Neil Brown discussed with me offline, he proposed a temporary workaround
> that only permit to assemble md raid0 device with identical component
> disk size, and reject to assemble md raid0 device with component disks
> with different sizes. We can stop this workaround when there is a proper
> way to fix the problem.
>
> I suggest our developer community to work together for a solution, this
> is the motivation I post this email for your comments.

There are four separate cases that we need to consider:
 - v1.x metadata
 - v0.90 metadata
 - LVM metadata (via dm-raid)
 - no metadata (array created with "mdadm --build").

For v1.x metadata, I think we can add a new "feature_map" flag.
If this flag isn't set, raid0 with non-uniform device sizes will not be
assembled.
If it is set, then:
  if 'layout' is 0, use the old mapping
  if 'layout' is 1, use the new mapping

For v0.90 metadata we don't have feature-flags.  We could
The gvalid_words field is unused and always set to zero.
So we could start storing some feature bits there.

For LVM/dm-raid, I suspect it doesn't support varying
sized devices, but we would need to check.

For "no metadata" arrays ... we could possibly just stop supporting
them - I doubt they are used much.

Once this is decided, we then need to work out how to
make it all usable via mdadm.

mdadm needs to set the relevant flags and 'layout' values
when creating an array, and it needs to be able to set them correctly
on an existing array that doesn't have them set.

We also need to decide what happens when a new mdadm is run on an old
kernel.
If we create an array with a new feature bit, the old kernel won't
be able to assemble it.  But if we don't set the feature bit, then
the array will not assemble after an upgrade.

It might be easier to think about this when we actually have concrete
code, but if anyone does have useful input, I'd be very happy to read
it.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] How to handle an ugly md raid0 sector map bug ?
  2019-08-23  0:02 ` NeilBrown
@ 2019-08-23 16:37   ` Song Liu
  2019-08-23 17:03     ` Coly Li
  0 siblings, 1 reply; 19+ messages in thread
From: Song Liu @ 2019-08-23 16:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: Coly Li, linux-raid, linux-block

Thanks Coly and Neil. 

> On Aug 22, 2019, at 5:02 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Thu, Aug 22 2019, Coly Li wrote:
> 
>> Hi folks,
>> 
>> First line: This bug only influences md raid0 device which applies all
>> the following conditions,
>> 1) Assembled by component disks with different sizes.
>> 2) Created and used under Linux kernel before (including) Linux v3.12,
>> then upgrade to Linux kernel after (including) Linux v3.13.
>> 3) New data are written to md raid0 in new kernel >= Linux v3.13.
>> Then the md raid0 may have inconsistent sector mapping and experience
>> data corruption.
>> 
>> Recently I receive a bug report that customer encounter file system
>> corruption after upgrading their kernel from Linux 3.12 to 4.4. It turns
>> out to be the underlying md raid0 corruption after the kernel upgrade.
>> 
>> I find it is because a sector map bug in md raid0 code include and
>> before Linux v3.12. Here is the buggy code piece I copied from stable
>> Linux v3.12.74 drivers/md/raid0.c:raid0_make_request(),
>> 
>> 547         sector_offset = bio->bi_sector;
>> 548         zone = find_zone(mddev->private, &sector_offset);
>> 549         tmp_dev = map_sector(mddev, zone, bio->bi_sector,
>> 550                              &sector_offset);
> 
> I don't think this code is buggy.  The mapping may not be the mapping
> you would expect, but it is the mapping that md/raid0 had always used up
> to this time.
> 
>> 
>> At line 548 after find_zone() returns, sector_offset is updated to be an
>> offset inside current zone. Then at line 549 the third parameter of
>> calling map_sector() should be the updated sector_offset, but
>> bio->bi_sector (original LBA or md raid0 device) is used. If the raid0
>> device has *multiple zones*, except the first zone, the mapping <dev,
>> sector> pair returned by map_sector() for all rested zones are
>> unexpected and wrong.
>> 
>> The buggy code was introduced since Linux v2.6.31 in commit fbb704efb784
>> ("md: raid0 :Enables chunk size other than powers of 2."), unfortunate
>> the mistaken mapping calculation has stable and unique result too, so it
>> works without obvious problem until commit 20d0189b1012 ("block:
>> Introduce new bio_split()") merged into Linux v3.13.
>> 
>> This patch fixed the mistaken mapping in the following lines of change,
>> 654 -       sector_offset = bio->bi_iter.bi_sector;
>> 655 -       zone = find_zone(mddev->private, &sector_offset);
>> 656 -       tmp_dev = map_sector(mddev, zone, bio->bi_iter.bi_sector,
>> 657 -                            &sector_offset);
>> 
>> 694 +               zone = find_zone(mddev->private, &sector);
>> 695 +               tmp_dev = map_sector(mddev, zone, sector, &sector);
>> At line 695 of this patch, the third parameter of calling map_sector()
>> is fixed to 'sector', this is the correct value which contains the
>> sector offset inside the corresponding zone.
> 
> This is buggy because, as you say, the third argument to map_sector has
> changed.
> Previously it was bio->bi_iter.bi_sector.  Now it is 'sector' which
> find_zone has just modified.
> 
>> 
>> The this patch implicitly *changes* md raid0 on-disk layout. If a md
>> raid0 has component disks with *different* sizes, then it will contain
>> multiple zones. If such multiple zones raid0 device is created before
>> Linux v3.13, all data chunks after first zone will be mapped to
>> different location in kernel after (including) Linux v3.13. The result
>> is, data written in the LBA after first zone will be treated as
>> corruption. A worse case is, if the md raid0 has data chunks filled in
>> first md raid0 zone in Linux v3.12 (or earlier kernels), then update to
>> Linux v3.13 (or later kernels) and fill more data chunks in second and
>> rested zone. Then in neither Linux v3.12 no Linux v3.13, there is always
>> partial data corrupted.
>> 
>> Currently there is no way to tell whether a md raid0 device is mapped in
>> wrong calculation in kernel before (including) Linux v3.12 or in correct
>> calculation in kernels after (including) Linux v3.13. If a md raid0
>> device (contains multiple zones) created and used crossing these kernel
>> version, there is possibility and different mapping calculation
>> generation different/inconsistent on-disk layout in different md raid0
>> zones, and results data corruption.
>> 
>> For our enterprise Linux products we can handle it properly for a few
>> product number of kernels. But for upstream and stable kernels, I don't
>> have idea how to fix this ugly problem in a generic way.
>> 
>> Neil Brown discussed with me offline, he proposed a temporary workaround
>> that only permit to assemble md raid0 device with identical component
>> disk size, and reject to assemble md raid0 device with component disks
>> with different sizes. We can stop this workaround when there is a proper
>> way to fix the problem.
>> 
>> I suggest our developer community to work together for a solution, this
>> is the motivation I post this email for your comments.
> 
> There are four separate cases that we need to consider:
> - v1.x metadata
> - v0.90 metadata
> - LVM metadata (via dm-raid)
> - no metadata (array created with "mdadm --build").
> 
> For v1.x metadata, I think we can add a new "feature_map" flag.
> If this flag isn't set, raid0 with non-uniform device sizes will not be
> assembled.
> If it is set, then:
>  if 'layout' is 0, use the old mapping
>  if 'layout' is 1, use the new mapping
> 
> For v0.90 metadata we don't have feature-flags.  We could
> The gvalid_words field is unused and always set to zero.
> So we could start storing some feature bits there.
> 
> For LVM/dm-raid, I suspect it doesn't support varying
> sized devices, but we would need to check.
> 
> For "no metadata" arrays ... we could possibly just stop supporting
> them - I doubt they are used much.

So for an existing array, we really cannot tell whether it is broken or 
not, right? If this is the case, we only need to worry about new arrays.

For new arrays, I guess we can only allow v1.x raid0 to have non-uniform
devices sizes, and use the new feature_map bit. 

Would this work? If so, we only have 1 case to work on. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] How to handle an ugly md raid0 sector map bug ?
  2019-08-23 16:37   ` Song Liu
@ 2019-08-23 17:03     ` Coly Li
  2019-08-23 17:17       ` Song Liu
  0 siblings, 1 reply; 19+ messages in thread
From: Coly Li @ 2019-08-23 17:03 UTC (permalink / raw)
  To: Song Liu; +Cc: NeilBrown, linux-raid, linux-block

On 2019/8/24 12:37 上午, Song Liu wrote:
> Thanks Coly and Neil. 
> 
>> On Aug 22, 2019, at 5:02 PM, NeilBrown <neilb@suse.com> wrote:
>>
>> On Thu, Aug 22 2019, Coly Li wrote:
>>
>>> Hi folks,
>>>
>>> First line: This bug only influences md raid0 device which applies all
>>> the following conditions,
>>> 1) Assembled by component disks with different sizes.
>>> 2) Created and used under Linux kernel before (including) Linux v3.12,
>>> then upgrade to Linux kernel after (including) Linux v3.13.
>>> 3) New data are written to md raid0 in new kernel >= Linux v3.13.
>>> Then the md raid0 may have inconsistent sector mapping and experience
>>> data corruption.
>>>
>>> Recently I receive a bug report that customer encounter file system
>>> corruption after upgrading their kernel from Linux 3.12 to 4.4. It turns
>>> out to be the underlying md raid0 corruption after the kernel upgrade.
>>>
>>> I find it is because a sector map bug in md raid0 code include and
>>> before Linux v3.12. Here is the buggy code piece I copied from stable
>>> Linux v3.12.74 drivers/md/raid0.c:raid0_make_request(),
>>>
>>> 547         sector_offset = bio->bi_sector;
>>> 548         zone = find_zone(mddev->private, &sector_offset);
>>> 549         tmp_dev = map_sector(mddev, zone, bio->bi_sector,
>>> 550                              &sector_offset);
>>
>> I don't think this code is buggy.  The mapping may not be the mapping
>> you would expect, but it is the mapping that md/raid0 had always used up
>> to this time.
>>
>>>
>>> At line 548 after find_zone() returns, sector_offset is updated to be an
>>> offset inside current zone. Then at line 549 the third parameter of
>>> calling map_sector() should be the updated sector_offset, but
>>> bio->bi_sector (original LBA or md raid0 device) is used. If the raid0
>>> device has *multiple zones*, except the first zone, the mapping <dev,
>>> sector> pair returned by map_sector() for all rested zones are
>>> unexpected and wrong.
>>>
>>> The buggy code was introduced since Linux v2.6.31 in commit fbb704efb784
>>> ("md: raid0 :Enables chunk size other than powers of 2."), unfortunate
>>> the mistaken mapping calculation has stable and unique result too, so it
>>> works without obvious problem until commit 20d0189b1012 ("block:
>>> Introduce new bio_split()") merged into Linux v3.13.
>>>
>>> This patch fixed the mistaken mapping in the following lines of change,
>>> 654 -       sector_offset = bio->bi_iter.bi_sector;
>>> 655 -       zone = find_zone(mddev->private, &sector_offset);
>>> 656 -       tmp_dev = map_sector(mddev, zone, bio->bi_iter.bi_sector,
>>> 657 -                            &sector_offset);
>>>
>>> 694 +               zone = find_zone(mddev->private, &sector);
>>> 695 +               tmp_dev = map_sector(mddev, zone, sector, &sector);
>>> At line 695 of this patch, the third parameter of calling map_sector()
>>> is fixed to 'sector', this is the correct value which contains the
>>> sector offset inside the corresponding zone.
>>
>> This is buggy because, as you say, the third argument to map_sector has
>> changed.
>> Previously it was bio->bi_iter.bi_sector.  Now it is 'sector' which
>> find_zone has just modified.
>>
>>>
>>> The this patch implicitly *changes* md raid0 on-disk layout. If a md
>>> raid0 has component disks with *different* sizes, then it will contain
>>> multiple zones. If such multiple zones raid0 device is created before
>>> Linux v3.13, all data chunks after first zone will be mapped to
>>> different location in kernel after (including) Linux v3.13. The result
>>> is, data written in the LBA after first zone will be treated as
>>> corruption. A worse case is, if the md raid0 has data chunks filled in
>>> first md raid0 zone in Linux v3.12 (or earlier kernels), then update to
>>> Linux v3.13 (or later kernels) and fill more data chunks in second and
>>> rested zone. Then in neither Linux v3.12 no Linux v3.13, there is always
>>> partial data corrupted.
>>>
>>> Currently there is no way to tell whether a md raid0 device is mapped in
>>> wrong calculation in kernel before (including) Linux v3.12 or in correct
>>> calculation in kernels after (including) Linux v3.13. If a md raid0
>>> device (contains multiple zones) created and used crossing these kernel
>>> version, there is possibility and different mapping calculation
>>> generation different/inconsistent on-disk layout in different md raid0
>>> zones, and results data corruption.
>>>
>>> For our enterprise Linux products we can handle it properly for a few
>>> product number of kernels. But for upstream and stable kernels, I don't
>>> have idea how to fix this ugly problem in a generic way.
>>>
>>> Neil Brown discussed with me offline, he proposed a temporary workaround
>>> that only permit to assemble md raid0 device with identical component
>>> disk size, and reject to assemble md raid0 device with component disks
>>> with different sizes. We can stop this workaround when there is a proper
>>> way to fix the problem.
>>>
>>> I suggest our developer community to work together for a solution, this
>>> is the motivation I post this email for your comments.
>>
>> There are four separate cases that we need to consider:
>> - v1.x metadata
>> - v0.90 metadata
>> - LVM metadata (via dm-raid)
>> - no metadata (array created with "mdadm --build").
>>
>> For v1.x metadata, I think we can add a new "feature_map" flag.
>> If this flag isn't set, raid0 with non-uniform device sizes will not be
>> assembled.
>> If it is set, then:
>>  if 'layout' is 0, use the old mapping
>>  if 'layout' is 1, use the new mapping
>>
>> For v0.90 metadata we don't have feature-flags.  We could
>> The gvalid_words field is unused and always set to zero.
>> So we could start storing some feature bits there.
>>
>> For LVM/dm-raid, I suspect it doesn't support varying
>> sized devices, but we would need to check.
>>
>> For "no metadata" arrays ... we could possibly just stop supporting
>> them - I doubt they are used much.
> 
> So for an existing array, we really cannot tell whether it is broken or 
> not, right? If this is the case, we only need to worry about new arrays.
> 
> For new arrays, I guess we can only allow v1.x raid0 to have non-uniform
> devices sizes, and use the new feature_map bit. 
> 
> Would this work? If so, we only have 1 case to work on. 

It seems v1.2 support started since Linux v2.16, so it may also have
problem for multiple zones.


-- 

Coly Li

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] How to handle an ugly md raid0 sector map bug ?
  2019-08-23 17:03     ` Coly Li
@ 2019-08-23 17:17       ` Song Liu
  2019-08-23 17:47         ` Coly Li
  0 siblings, 1 reply; 19+ messages in thread
From: Song Liu @ 2019-08-23 17:17 UTC (permalink / raw)
  To: Coly Li; +Cc: NeilBrown, linux-raid, linux-block



> On Aug 23, 2019, at 10:03 AM, Coly Li <colyli@suse.de> wrote:
> 
> On 2019/8/24 12:37 上午, Song Liu wrote:
>> Thanks Coly and Neil. 
>> 
>>> On Aug 22, 2019, at 5:02 PM, NeilBrown <neilb@suse.com> wrote:
>>> 
>>> On Thu, Aug 22 2019, Coly Li wrote:
>>> 
>>>> Hi folks,
>>>> 
>>>> First line: This bug only influences md raid0 device which applies all
>>>> the following conditions,
>>>> 1) Assembled by component disks with different sizes.
>>>> 2) Created and used under Linux kernel before (including) Linux v3.12,
>>>> then upgrade to Linux kernel after (including) Linux v3.13.
>>>> 3) New data are written to md raid0 in new kernel >= Linux v3.13.
>>>> Then the md raid0 may have inconsistent sector mapping and experience
>>>> data corruption.
>>>> 
>>>> Recently I receive a bug report that customer encounter file system
>>>> corruption after upgrading their kernel from Linux 3.12 to 4.4. It turns
>>>> out to be the underlying md raid0 corruption after the kernel upgrade.
>>>> 
>>>> I find it is because a sector map bug in md raid0 code include and
>>>> before Linux v3.12. Here is the buggy code piece I copied from stable
>>>> Linux v3.12.74 drivers/md/raid0.c:raid0_make_request(),
>>>> 
>>>> 547         sector_offset = bio->bi_sector;
>>>> 548         zone = find_zone(mddev->private, &sector_offset);
>>>> 549         tmp_dev = map_sector(mddev, zone, bio->bi_sector,
>>>> 550                              &sector_offset);
>>> 
>>> I don't think this code is buggy.  The mapping may not be the mapping
>>> you would expect, but it is the mapping that md/raid0 had always used up
>>> to this time.
>>> 
>>>> 
>>>> At line 548 after find_zone() returns, sector_offset is updated to be an
>>>> offset inside current zone. Then at line 549 the third parameter of
>>>> calling map_sector() should be the updated sector_offset, but
>>>> bio->bi_sector (original LBA or md raid0 device) is used. If the raid0
>>>> device has *multiple zones*, except the first zone, the mapping <dev,
>>>> sector> pair returned by map_sector() for all rested zones are
>>>> unexpected and wrong.
>>>> 
>>>> The buggy code was introduced since Linux v2.6.31 in commit fbb704efb784
>>>> ("md: raid0 :Enables chunk size other than powers of 2."), unfortunate
>>>> the mistaken mapping calculation has stable and unique result too, so it
>>>> works without obvious problem until commit 20d0189b1012 ("block:
>>>> Introduce new bio_split()") merged into Linux v3.13.
>>>> 
>>>> This patch fixed the mistaken mapping in the following lines of change,
>>>> 654 -       sector_offset = bio->bi_iter.bi_sector;
>>>> 655 -       zone = find_zone(mddev->private, &sector_offset);
>>>> 656 -       tmp_dev = map_sector(mddev, zone, bio->bi_iter.bi_sector,
>>>> 657 -                            &sector_offset);
>>>> 
>>>> 694 +               zone = find_zone(mddev->private, &sector);
>>>> 695 +               tmp_dev = map_sector(mddev, zone, sector, &sector);
>>>> At line 695 of this patch, the third parameter of calling map_sector()
>>>> is fixed to 'sector', this is the correct value which contains the
>>>> sector offset inside the corresponding zone.
>>> 
>>> This is buggy because, as you say, the third argument to map_sector has
>>> changed.
>>> Previously it was bio->bi_iter.bi_sector.  Now it is 'sector' which
>>> find_zone has just modified.
>>> 
>>>> 
>>>> The this patch implicitly *changes* md raid0 on-disk layout. If a md
>>>> raid0 has component disks with *different* sizes, then it will contain
>>>> multiple zones. If such multiple zones raid0 device is created before
>>>> Linux v3.13, all data chunks after first zone will be mapped to
>>>> different location in kernel after (including) Linux v3.13. The result
>>>> is, data written in the LBA after first zone will be treated as
>>>> corruption. A worse case is, if the md raid0 has data chunks filled in
>>>> first md raid0 zone in Linux v3.12 (or earlier kernels), then update to
>>>> Linux v3.13 (or later kernels) and fill more data chunks in second and
>>>> rested zone. Then in neither Linux v3.12 no Linux v3.13, there is always
>>>> partial data corrupted.
>>>> 
>>>> Currently there is no way to tell whether a md raid0 device is mapped in
>>>> wrong calculation in kernel before (including) Linux v3.12 or in correct
>>>> calculation in kernels after (including) Linux v3.13. If a md raid0
>>>> device (contains multiple zones) created and used crossing these kernel
>>>> version, there is possibility and different mapping calculation
>>>> generation different/inconsistent on-disk layout in different md raid0
>>>> zones, and results data corruption.
>>>> 
>>>> For our enterprise Linux products we can handle it properly for a few
>>>> product number of kernels. But for upstream and stable kernels, I don't
>>>> have idea how to fix this ugly problem in a generic way.
>>>> 
>>>> Neil Brown discussed with me offline, he proposed a temporary workaround
>>>> that only permit to assemble md raid0 device with identical component
>>>> disk size, and reject to assemble md raid0 device with component disks
>>>> with different sizes. We can stop this workaround when there is a proper
>>>> way to fix the problem.
>>>> 
>>>> I suggest our developer community to work together for a solution, this
>>>> is the motivation I post this email for your comments.
>>> 
>>> There are four separate cases that we need to consider:
>>> - v1.x metadata
>>> - v0.90 metadata
>>> - LVM metadata (via dm-raid)
>>> - no metadata (array created with "mdadm --build").
>>> 
>>> For v1.x metadata, I think we can add a new "feature_map" flag.
>>> If this flag isn't set, raid0 with non-uniform device sizes will not be
>>> assembled.
>>> If it is set, then:
>>> if 'layout' is 0, use the old mapping
>>> if 'layout' is 1, use the new mapping
>>> 
>>> For v0.90 metadata we don't have feature-flags.  We could
>>> The gvalid_words field is unused and always set to zero.
>>> So we could start storing some feature bits there.
>>> 
>>> For LVM/dm-raid, I suspect it doesn't support varying
>>> sized devices, but we would need to check.
>>> 
>>> For "no metadata" arrays ... we could possibly just stop supporting
>>> them - I doubt they are used much.
>> 
>> So for an existing array, we really cannot tell whether it is broken or 
>> not, right? If this is the case, we only need to worry about new arrays.
>> 
>> For new arrays, I guess we can only allow v1.x raid0 to have non-uniform
>> devices sizes, and use the new feature_map bit. 
>> 
>> Would this work? If so, we only have 1 case to work on. 
> 
> It seems v1.2 support started since Linux v2.16, so it may also have
> problem for multiple zones.
> 

For v1.2 metadata, we still need the feature_map bit, meaning this 
non-uniform array is safe to assemble. If the array doesn't have 
this bit, and is non-uniform size, we refuse to assemble it. 

Thanks,
Song





^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] How to handle an ugly md raid0 sector map bug ?
  2019-08-23 17:17       ` Song Liu
@ 2019-08-23 17:47         ` Coly Li
  2019-09-09  6:57           ` [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion NeilBrown
  0 siblings, 1 reply; 19+ messages in thread
From: Coly Li @ 2019-08-23 17:47 UTC (permalink / raw)
  To: Song Liu; +Cc: NeilBrown, linux-raid, linux-block

On 2019/8/24 1:17 上午, Song Liu wrote:
> 
> 
>> On Aug 23, 2019, at 10:03 AM, Coly Li <colyli@suse.de> wrote:
>>
>> On 2019/8/24 12:37 上午, Song Liu wrote:
>>> Thanks Coly and Neil. 
>>>
>>>> On Aug 22, 2019, at 5:02 PM, NeilBrown <neilb@suse.com> wrote:
>>>>
>>>> On Thu, Aug 22 2019, Coly Li wrote:
>>>>
>>>>> Hi folks,
>>>>>
>>>>> First line: This bug only influences md raid0 device which applies all
>>>>> the following conditions,
>>>>> 1) Assembled by component disks with different sizes.
>>>>> 2) Created and used under Linux kernel before (including) Linux v3.12,
>>>>> then upgrade to Linux kernel after (including) Linux v3.13.
>>>>> 3) New data are written to md raid0 in new kernel >= Linux v3.13.
>>>>> Then the md raid0 may have inconsistent sector mapping and experience
>>>>> data corruption.
>>>>>
>>>>> Recently I receive a bug report that customer encounter file system
>>>>> corruption after upgrading their kernel from Linux 3.12 to 4.4. It turns
>>>>> out to be the underlying md raid0 corruption after the kernel upgrade.
>>>>>
>>>>> I find it is because a sector map bug in md raid0 code include and
>>>>> before Linux v3.12. Here is the buggy code piece I copied from stable
>>>>> Linux v3.12.74 drivers/md/raid0.c:raid0_make_request(),
>>>>>
>>>>> 547         sector_offset = bio->bi_sector;
>>>>> 548         zone = find_zone(mddev->private, &sector_offset);
>>>>> 549         tmp_dev = map_sector(mddev, zone, bio->bi_sector,
>>>>> 550                              &sector_offset);
>>>>
>>>> I don't think this code is buggy.  The mapping may not be the mapping
>>>> you would expect, but it is the mapping that md/raid0 had always used up
>>>> to this time.
>>>>
>>>>>
>>>>> At line 548 after find_zone() returns, sector_offset is updated to be an
>>>>> offset inside current zone. Then at line 549 the third parameter of
>>>>> calling map_sector() should be the updated sector_offset, but
>>>>> bio->bi_sector (original LBA or md raid0 device) is used. If the raid0
>>>>> device has *multiple zones*, except the first zone, the mapping <dev,
>>>>> sector> pair returned by map_sector() for all rested zones are
>>>>> unexpected and wrong.
>>>>>
>>>>> The buggy code was introduced since Linux v2.6.31 in commit fbb704efb784
>>>>> ("md: raid0 :Enables chunk size other than powers of 2."), unfortunate
>>>>> the mistaken mapping calculation has stable and unique result too, so it
>>>>> works without obvious problem until commit 20d0189b1012 ("block:
>>>>> Introduce new bio_split()") merged into Linux v3.13.
>>>>>
>>>>> This patch fixed the mistaken mapping in the following lines of change,
>>>>> 654 -       sector_offset = bio->bi_iter.bi_sector;
>>>>> 655 -       zone = find_zone(mddev->private, &sector_offset);
>>>>> 656 -       tmp_dev = map_sector(mddev, zone, bio->bi_iter.bi_sector,
>>>>> 657 -                            &sector_offset);
>>>>>
>>>>> 694 +               zone = find_zone(mddev->private, &sector);
>>>>> 695 +               tmp_dev = map_sector(mddev, zone, sector, &sector);
>>>>> At line 695 of this patch, the third parameter of calling map_sector()
>>>>> is fixed to 'sector', this is the correct value which contains the
>>>>> sector offset inside the corresponding zone.
>>>>
>>>> This is buggy because, as you say, the third argument to map_sector has
>>>> changed.
>>>> Previously it was bio->bi_iter.bi_sector.  Now it is 'sector' which
>>>> find_zone has just modified.
>>>>
>>>>>
>>>>> The this patch implicitly *changes* md raid0 on-disk layout. If a md
>>>>> raid0 has component disks with *different* sizes, then it will contain
>>>>> multiple zones. If such multiple zones raid0 device is created before
>>>>> Linux v3.13, all data chunks after first zone will be mapped to
>>>>> different location in kernel after (including) Linux v3.13. The result
>>>>> is, data written in the LBA after first zone will be treated as
>>>>> corruption. A worse case is, if the md raid0 has data chunks filled in
>>>>> first md raid0 zone in Linux v3.12 (or earlier kernels), then update to
>>>>> Linux v3.13 (or later kernels) and fill more data chunks in second and
>>>>> rested zone. Then in neither Linux v3.12 no Linux v3.13, there is always
>>>>> partial data corrupted.
>>>>>
>>>>> Currently there is no way to tell whether a md raid0 device is mapped in
>>>>> wrong calculation in kernel before (including) Linux v3.12 or in correct
>>>>> calculation in kernels after (including) Linux v3.13. If a md raid0
>>>>> device (contains multiple zones) created and used crossing these kernel
>>>>> version, there is possibility and different mapping calculation
>>>>> generation different/inconsistent on-disk layout in different md raid0
>>>>> zones, and results data corruption.
>>>>>
>>>>> For our enterprise Linux products we can handle it properly for a few
>>>>> product number of kernels. But for upstream and stable kernels, I don't
>>>>> have idea how to fix this ugly problem in a generic way.
>>>>>
>>>>> Neil Brown discussed with me offline, he proposed a temporary workaround
>>>>> that only permit to assemble md raid0 device with identical component
>>>>> disk size, and reject to assemble md raid0 device with component disks
>>>>> with different sizes. We can stop this workaround when there is a proper
>>>>> way to fix the problem.
>>>>>
>>>>> I suggest our developer community to work together for a solution, this
>>>>> is the motivation I post this email for your comments.
>>>>
>>>> There are four separate cases that we need to consider:
>>>> - v1.x metadata
>>>> - v0.90 metadata
>>>> - LVM metadata (via dm-raid)
>>>> - no metadata (array created with "mdadm --build").
>>>>
>>>> For v1.x metadata, I think we can add a new "feature_map" flag.
>>>> If this flag isn't set, raid0 with non-uniform device sizes will not be
>>>> assembled.
>>>> If it is set, then:
>>>> if 'layout' is 0, use the old mapping
>>>> if 'layout' is 1, use the new mapping
>>>>
>>>> For v0.90 metadata we don't have feature-flags.  We could
>>>> The gvalid_words field is unused and always set to zero.
>>>> So we could start storing some feature bits there.
>>>>
>>>> For LVM/dm-raid, I suspect it doesn't support varying
>>>> sized devices, but we would need to check.
>>>>
>>>> For "no metadata" arrays ... we could possibly just stop supporting
>>>> them - I doubt they are used much.
>>>
>>> So for an existing array, we really cannot tell whether it is broken or 
>>> not, right? If this is the case, we only need to worry about new arrays.
>>>
>>> For new arrays, I guess we can only allow v1.x raid0 to have non-uniform
>>> devices sizes, and use the new feature_map bit. 
>>>
>>> Would this work? If so, we only have 1 case to work on. 
>>
>> It seems v1.2 support started since Linux v2.16, so it may also have
>> problem for multiple zones.
>>
> 
> For v1.2 metadata, we still need the feature_map bit, meaning this 
> non-uniform array is safe to assemble. If the array doesn't have 
> this bit, and is non-uniform size, we refuse to assemble it. 

Yes, it make sense. I understand you now :-)

-- 

Coly Li

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion.
  2019-08-23 17:47         ` Coly Li
@ 2019-09-09  6:57           ` NeilBrown
  2019-09-09  6:58             ` [PATCH 2/2] md: add feature flag MD_FEATURE_RAID0_LAYOUT NeilBrown
                               ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: NeilBrown @ 2019-09-09  6:57 UTC (permalink / raw)
  To: Coly Li, Song Liu; +Cc: NeilBrown, linux-block, linux-raid

[-- Attachment #1: Type: text/plain, Size: 4277 bytes --]


If the drives in a RAID0 are not all the same size, the array is
divided into zones.
The first zone covers all drives, to the size of the smallest.
The second zone covers all drives larger than the smallest, up to
the size of the second smallest - etc.

A change in Linux 3.14 unintentionally changed the layout for the
second and subsequent zones.  All the correct data is still stored, but
each chunk may be assigned to a different device than in pre-3.14 kernels.
This can lead to data corruption.

It is not possible to determine what layout to use - it depends which
kernel the data was written by.
So we add a module parameter to allow the old (0) or new (1) layout to be
specified, and refused to assemble an affected array if that parameter is
not set.

Fixes: 20d0189b1012 ("block: Introduce new bio_split()")
cc: stable@vger.kernel.org (3.14+)
Signed-off-by: NeilBrown <neilb@suse.de>
---

This and the next patch are my proposal for how to address
this problem.  I haven't actually tested .....

NeilBrown

 drivers/md/raid0.c | 28 +++++++++++++++++++++++++++-
 drivers/md/raid0.h | 14 ++++++++++++++
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index bf5cf184a260..a8888c12308a 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -19,6 +19,9 @@
 #include "raid0.h"
 #include "raid5.h"
 
+static int default_layout = -1;
+module_param(default_layout, int, 0644);
+
 #define UNSUPPORTED_MDDEV_FLAGS		\
 	((1L << MD_HAS_JOURNAL) |	\
 	 (1L << MD_JOURNAL_CLEAN) |	\
@@ -139,6 +142,19 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
 	}
 	pr_debug("md/raid0:%s: FINAL %d zones\n",
 		 mdname(mddev), conf->nr_strip_zones);
+
+	if (conf->nr_strip_zones == 1) {
+		conf->layout = RAID0_ORIG_LAYOUT;
+	} else if (default_layout == RAID0_ORIG_LAYOUT ||
+		   default_layout == RAID0_ALT_MULTIZONE_LAYOUT) {
+		conf->layout = default_layout;
+	} else {
+		pr_err("md/raid0:%s: cannot assemble multi-zone RAID0 with default_layout setting\n",
+		       mdname(mddev));
+		pr_err("md/raid0: please set raid.default_layout to 0 or 1\n");
+		err = -ENOTSUPP;
+		goto abort;
+	}
 	/*
 	 * now since we have the hard sector sizes, we can make sure
 	 * chunk size is a multiple of that sector size
@@ -547,10 +563,12 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
 
 static bool raid0_make_request(struct mddev *mddev, struct bio *bio)
 {
+	struct r0conf *conf = mddev->private;
 	struct strip_zone *zone;
 	struct md_rdev *tmp_dev;
 	sector_t bio_sector;
 	sector_t sector;
+	sector_t orig_sector;
 	unsigned chunk_sects;
 	unsigned sectors;
 
@@ -584,8 +602,16 @@ static bool raid0_make_request(struct mddev *mddev, struct bio *bio)
 		bio = split;
 	}
 
+	orig_sector = sector;
 	zone = find_zone(mddev->private, &sector);
-	tmp_dev = map_sector(mddev, zone, sector, &sector);
+	switch (conf->layout) {
+	case RAID0_ORIG_LAYOUT:
+		tmp_dev = map_sector(mddev, zone, orig_sector, &sector);
+		break;
+	case RAID0_ALT_MULTIZONE_LAYOUT:
+		tmp_dev = map_sector(mddev, zone, sector, &sector);
+		break;
+	}
 	bio_set_dev(bio, tmp_dev->bdev);
 	bio->bi_iter.bi_sector = sector + zone->dev_start +
 		tmp_dev->data_offset;
diff --git a/drivers/md/raid0.h b/drivers/md/raid0.h
index 540e65d92642..3816e5477db1 100644
--- a/drivers/md/raid0.h
+++ b/drivers/md/raid0.h
@@ -8,11 +8,25 @@ struct strip_zone {
 	int	 nb_dev;	/* # of devices attached to the zone */
 };
 
+/* Linux 3.14 (20d0189b101) made an unintended change to
+ * the RAID0 layout for multi-zone arrays (where devices aren't all
+ * the same size.
+ * RAID0_ORIG_LAYOUT restores the original layout
+ * RAID0_ALT_MULTIZONE_LAYOUT uses the altered layout
+ * The layouts are identical when there is only one zone (all
+ * devices the same size).
+ */
+
+enum r0layout {
+	RAID0_ORIG_LAYOUT = 1,
+	RAID0_ALT_MULTIZONE_LAYOUT = 2,
+};
 struct r0conf {
 	struct strip_zone	*strip_zone;
 	struct md_rdev		**devlist; /* lists of rdevs, pointed to
 					    * by strip_zone->dev */
 	int			nr_strip_zones;
+	enum r0layout		layout;
 };
 
 #endif
-- 
2.14.0.rc0.dirty


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/2] md: add feature flag MD_FEATURE_RAID0_LAYOUT
  2019-09-09  6:57           ` [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion NeilBrown
@ 2019-09-09  6:58             ` NeilBrown
  2019-09-09 15:33               ` Guoqing Jiang
  2019-09-09 14:56             ` [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion Song Liu
  2019-09-09 15:09             ` Guoqing Jiang
  2 siblings, 1 reply; 19+ messages in thread
From: NeilBrown @ 2019-09-09  6:58 UTC (permalink / raw)
  To: Coly Li, Song Liu; +Cc: NeilBrown, linux-block, linux-raid

[-- Attachment #1: Type: text/plain, Size: 3755 bytes --]


Due to a bug introduced in Linux 3.14 we cannot determine the
correctly layout for a multi-zone RAID0 array - there are two
possibiities.

It is possible to tell the kernel which to chose using a module
parameter, but this can be clumsy to use.  It would be best if
the choice were recorded in the metadata.
So add a feature flag for this purpose.
If it is set, then the 'layout' field of the superblock is used
to determine which layout to use.

If this flag is not set, then mddev->layout gets set to -1,
which causes the module parameter to be required.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/md.c                | 13 +++++++++++++
 drivers/md/raid0.c             |  2 ++
 include/uapi/linux/raid/md_p.h |  2 ++
 3 files changed, 17 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1f70ec595282..a41fce7f8b4c 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1232,6 +1232,8 @@ static int super_90_validate(struct mddev *mddev, struct md_rdev *rdev)
 			mddev->new_layout = mddev->layout;
 			mddev->new_chunk_sectors = mddev->chunk_sectors;
 		}
+		if (mddev->level == 0)
+			mddev->layout = -1;
 
 		if (sb->state & (1<<MD_SB_CLEAN))
 			mddev->recovery_cp = MaxSector;
@@ -1647,6 +1649,10 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_
 		rdev->ppl.sector = rdev->sb_start + rdev->ppl.offset;
 	}
 
+	if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_RAID0_LAYOUT) &&
+	    sb->level != 0)
+		return -EINVAL;
+
 	if (!refdev) {
 		ret = 1;
 	} else {
@@ -1757,6 +1763,10 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
 			mddev->new_chunk_sectors = mddev->chunk_sectors;
 		}
 
+		if (mddev->level == 0 &&
+		    !(le32_to_cpu(sb->feature_map) & MD_FEATURE_RAID0_LAYOUT))
+			mddev->layout = -1;
+
 		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL)
 			set_bit(MD_HAS_JOURNAL, &mddev->flags);
 
@@ -6852,6 +6862,9 @@ static int set_array_info(struct mddev *mddev, mdu_array_info_t *info)
 	mddev->external	     = 0;
 
 	mddev->layout        = info->layout;
+	if (mddev->level == 0)
+		/* Cannot trust RAID0 layout info here */
+		mddev->layout = -1;
 	mddev->chunk_sectors = info->chunk_size >> 9;
 
 	if (mddev->persistent) {
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index a8888c12308a..6f095b0b6f5c 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -145,6 +145,8 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
 
 	if (conf->nr_strip_zones == 1) {
 		conf->layout = RAID0_ORIG_LAYOUT;
+	} else if (mddev->layout == RAID0_ORIG_LAYOUT ||
+		   mddev->layout == RAID0_ALT_MULTIZONE_LAYOUT) {
 	} else if (default_layout == RAID0_ORIG_LAYOUT ||
 		   default_layout == RAID0_ALT_MULTIZONE_LAYOUT) {
 		conf->layout = default_layout;
diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
index b0d15c73f6d7..1f2d8c81f0e0 100644
--- a/include/uapi/linux/raid/md_p.h
+++ b/include/uapi/linux/raid/md_p.h
@@ -329,6 +329,7 @@ struct mdp_superblock_1 {
 #define	MD_FEATURE_JOURNAL		512 /* support write cache */
 #define	MD_FEATURE_PPL			1024 /* support PPL */
 #define	MD_FEATURE_MULTIPLE_PPLS	2048 /* support for multiple PPLs */
+#define	MD_FEATURE_RAID0_LAYOUT		4096 /* layout is meaningful for RAID0 */
 #define	MD_FEATURE_ALL			(MD_FEATURE_BITMAP_OFFSET	\
 					|MD_FEATURE_RECOVERY_OFFSET	\
 					|MD_FEATURE_RESHAPE_ACTIVE	\
@@ -341,6 +342,7 @@ struct mdp_superblock_1 {
 					|MD_FEATURE_JOURNAL		\
 					|MD_FEATURE_PPL			\
 					|MD_FEATURE_MULTIPLE_PPLS	\
+					|MD_FEATURE_RAID0_LAYOUT	\
 					)
 
 struct r5l_payload_header {
-- 
2.14.0.rc0.dirty


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion.
  2019-09-09  6:57           ` [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion NeilBrown
  2019-09-09  6:58             ` [PATCH 2/2] md: add feature flag MD_FEATURE_RAID0_LAYOUT NeilBrown
@ 2019-09-09 14:56             ` Song Liu
  2019-09-09 23:33               ` NeilBrown
  2019-09-09 15:09             ` Guoqing Jiang
  2 siblings, 1 reply; 19+ messages in thread
From: Song Liu @ 2019-09-09 14:56 UTC (permalink / raw)
  To: NeilBrown; +Cc: Coly Li, NeilBrown, linux-block, linux-raid

Hi Neil,

> On Sep 9, 2019, at 7:57 AM, NeilBrown <neilb@suse.de> wrote:
> 
> 
> If the drives in a RAID0 are not all the same size, the array is
> divided into zones.
> The first zone covers all drives, to the size of the smallest.
> The second zone covers all drives larger than the smallest, up to
> the size of the second smallest - etc.
> 
> A change in Linux 3.14 unintentionally changed the layout for the
> second and subsequent zones.  All the correct data is still stored, but
> each chunk may be assigned to a different device than in pre-3.14 kernels.
> This can lead to data corruption.
> 
> It is not possible to determine what layout to use - it depends which
> kernel the data was written by.
> So we add a module parameter to allow the old (0) or new (1) layout to be
> specified, and refused to assemble an affected array if that parameter is
> not set.
> 
> Fixes: 20d0189b1012 ("block: Introduce new bio_split()")
> cc: stable@vger.kernel.org (3.14+)
> Signed-off-by: NeilBrown <neilb@suse.de>

Thanks for the patches. They look great. However, I am having problem
apply them (not sure whether it is a problem on my side). Could you 
please push it somewhere so I can use cherry-pick instead?

Thanks,
Song

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion.
  2019-09-09  6:57           ` [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion NeilBrown
  2019-09-09  6:58             ` [PATCH 2/2] md: add feature flag MD_FEATURE_RAID0_LAYOUT NeilBrown
  2019-09-09 14:56             ` [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion Song Liu
@ 2019-09-09 15:09             ` Guoqing Jiang
  2019-09-09 23:34               ` NeilBrown
  2 siblings, 1 reply; 19+ messages in thread
From: Guoqing Jiang @ 2019-09-09 15:09 UTC (permalink / raw)
  To: NeilBrown, Coly Li, Song Liu; +Cc: NeilBrown, linux-block, linux-raid



On 9/9/19 8:57 AM, NeilBrown wrote:
> 
> If the drives in a RAID0 are not all the same size, the array is
> divided into zones.
> The first zone covers all drives, to the size of the smallest.
> The second zone covers all drives larger than the smallest, up to
> the size of the second smallest - etc.
> 
> A change in Linux 3.14 unintentionally changed the layout for the
> second and subsequent zones.  All the correct data is still stored, but
> each chunk may be assigned to a different device than in pre-3.14 kernels.
> This can lead to data corruption.
> 
> It is not possible to determine what layout to use - it depends which
> kernel the data was written by.
> So we add a module parameter to allow the old (0) or new (1) layout to be
> specified, and refused to assemble an affected array if that parameter is
> not set.
> 
> Fixes: 20d0189b1012 ("block: Introduce new bio_split()")
> cc: stable@vger.kernel.org (3.14+)
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> 
> This and the next patch are my proposal for how to address
> this problem.  I haven't actually tested .....
> 
> NeilBrown
> 
>   drivers/md/raid0.c | 28 +++++++++++++++++++++++++++-
>   drivers/md/raid0.h | 14 ++++++++++++++
>   2 files changed, 41 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> index bf5cf184a260..a8888c12308a 100644
> --- a/drivers/md/raid0.c
> +++ b/drivers/md/raid0.c
> @@ -19,6 +19,9 @@
>   #include "raid0.h"
>   #include "raid5.h"
>   
> +static int default_layout = -1;
> +module_param(default_layout, int, 0644);
> +
>   #define UNSUPPORTED_MDDEV_FLAGS		\
>   	((1L << MD_HAS_JOURNAL) |	\
>   	 (1L << MD_JOURNAL_CLEAN) |	\
> @@ -139,6 +142,19 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
>   	}
>   	pr_debug("md/raid0:%s: FINAL %d zones\n",
>   		 mdname(mddev), conf->nr_strip_zones);
> +
> +	if (conf->nr_strip_zones == 1) {
> +		conf->layout = RAID0_ORIG_LAYOUT;
> +	} else if (default_layout == RAID0_ORIG_LAYOUT ||
> +		   default_layout == RAID0_ALT_MULTIZONE_LAYOUT) {
> +		conf->layout = default_layout;
> +	} else {
> +		pr_err("md/raid0:%s: cannot assemble multi-zone RAID0 with default_layout setting\n",
> +		       mdname(mddev));
> +		pr_err("md/raid0: please set raid.default_layout to 0 or 1\n");

Maybe "1 or 2" to align with the definition of below r0layout?

[snip]

> +enum r0layout {
> +	RAID0_ORIG_LAYOUT = 1,
> +	RAID0_ALT_MULTIZONE_LAYOUT = 2,
> +};
>   struct r0conf {
>   	struct strip_zone	*strip_zone;
>   	struct md_rdev		**devlist; /* lists of rdevs, pointed to
>   					    * by strip_zone->dev */
>   	int			nr_strip_zones;
> +	enum r0layout		layout;
>   };
>   
>   #endif
> 

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] md: add feature flag MD_FEATURE_RAID0_LAYOUT
  2019-09-09  6:58             ` [PATCH 2/2] md: add feature flag MD_FEATURE_RAID0_LAYOUT NeilBrown
@ 2019-09-09 15:33               ` Guoqing Jiang
  2019-09-09 23:26                 ` NeilBrown
  0 siblings, 1 reply; 19+ messages in thread
From: Guoqing Jiang @ 2019-09-09 15:33 UTC (permalink / raw)
  To: NeilBrown, Coly Li, Song Liu; +Cc: NeilBrown, linux-block, linux-raid

Hi Neil,

On 9/9/19 8:58 AM, NeilBrown wrote:
> 
> Due to a bug introduced in Linux 3.14 we cannot determine the
> correctly layout for a multi-zone RAID0 array - there are two
> possibiities.

possibilities.

> 
> It is possible to tell the kernel which to chose using a module
> parameter, but this can be clumsy to use.  It would be best if
> the choice were recorded in the metadata.
> So add a feature flag for this purpose.
> If it is set, then the 'layout' field of the superblock is used
> to determine which layout to use.
> 
> If this flag is not set, then mddev->layout gets set to -1,
> which causes the module parameter to be required.

Could you point where the flag is set? Thanks.

> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>   drivers/md/md.c                | 13 +++++++++++++
>   drivers/md/raid0.c             |  2 ++
>   include/uapi/linux/raid/md_p.h |  2 ++
>   3 files changed, 17 insertions(+)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 1f70ec595282..a41fce7f8b4c 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1232,6 +1232,8 @@ static int super_90_validate(struct mddev *mddev, struct md_rdev *rdev)
>   			mddev->new_layout = mddev->layout;
>   			mddev->new_chunk_sectors = mddev->chunk_sectors;
>   		}
> +		if (mddev->level == 0)
> +			mddev->layout = -1;
>   
>   		if (sb->state & (1<<MD_SB_CLEAN))
>   			mddev->recovery_cp = MaxSector;
> @@ -1647,6 +1649,10 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_
>   		rdev->ppl.sector = rdev->sb_start + rdev->ppl.offset;
>   	}
>   
> +	if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_RAID0_LAYOUT) &&
> +	    sb->level != 0)
> +		return -EINVAL;
> +
>   	if (!refdev) {
>   		ret = 1;
>   	} else {
> @@ -1757,6 +1763,10 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
>   			mddev->new_chunk_sectors = mddev->chunk_sectors;
>   		}
>   
> +		if (mddev->level == 0 &&
> +		    !(le32_to_cpu(sb->feature_map) & MD_FEATURE_RAID0_LAYOUT))
> +			mddev->layout = -1;
> +
>   		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL)
>   			set_bit(MD_HAS_JOURNAL, &mddev->flags);
>   
> @@ -6852,6 +6862,9 @@ static int set_array_info(struct mddev *mddev, mdu_array_info_t *info)
>   	mddev->external	     = 0;
>   
>   	mddev->layout        = info->layout;
> +	if (mddev->level == 0)
> +		/* Cannot trust RAID0 layout info here */
> +		mddev->layout = -1;
>   	mddev->chunk_sectors = info->chunk_size >> 9;
>   
>   	if (mddev->persistent) {
> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> index a8888c12308a..6f095b0b6f5c 100644
> --- a/drivers/md/raid0.c
> +++ b/drivers/md/raid0.c
> @@ -145,6 +145,8 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
>   
>   	if (conf->nr_strip_zones == 1) {
>   		conf->layout = RAID0_ORIG_LAYOUT;
> +	} else if (mddev->layout == RAID0_ORIG_LAYOUT ||
> +		   mddev->layout == RAID0_ALT_MULTIZONE_LAYOUT) {

Maybe "conf->layout = mddev->layout" here? Otherwise seems conf->layout is not set accordingly, just 
my 2 cents.

>   	} else if (default_layout == RAID0_ORIG_LAYOUT ||
>   		   default_layout == RAID0_ALT_MULTIZONE_LAYOUT) {
>   		conf->layout = default_layout;
> diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
> index b0d15c73f6d7..1f2d8c81f0e0 100644
> --- a/include/uapi/linux/raid/md_p.h
> +++ b/include/uapi/linux/raid/md_p.h
> @@ -329,6 +329,7 @@ struct mdp_superblock_1 {
>   #define	MD_FEATURE_JOURNAL		512 /* support write cache */
>   #define	MD_FEATURE_PPL			1024 /* support PPL */
>   #define	MD_FEATURE_MULTIPLE_PPLS	2048 /* support for multiple PPLs */
> +#define	MD_FEATURE_RAID0_LAYOUT		4096 /* layout is meaningful for RAID0 */
>   #define	MD_FEATURE_ALL			(MD_FEATURE_BITMAP_OFFSET	\
>   					|MD_FEATURE_RECOVERY_OFFSET	\
>   					|MD_FEATURE_RESHAPE_ACTIVE	\
> @@ -341,6 +342,7 @@ struct mdp_superblock_1 {
>   					|MD_FEATURE_JOURNAL		\
>   					|MD_FEATURE_PPL			\
>   					|MD_FEATURE_MULTIPLE_PPLS	\
> +					|MD_FEATURE_RAID0_LAYOUT	\
>   					)
>   
>   struct r5l_payload_header {
> 

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] md: add feature flag MD_FEATURE_RAID0_LAYOUT
  2019-09-09 15:33               ` Guoqing Jiang
@ 2019-09-09 23:26                 ` NeilBrown
  0 siblings, 0 replies; 19+ messages in thread
From: NeilBrown @ 2019-09-09 23:26 UTC (permalink / raw)
  To: Guoqing Jiang, Coly Li, Song Liu; +Cc: NeilBrown, linux-block, linux-raid

[-- Attachment #1: Type: text/plain, Size: 5068 bytes --]

On Mon, Sep 09 2019, Guoqing Jiang wrote:

> Hi Neil,
>
> On 9/9/19 8:58 AM, NeilBrown wrote:
>> 
>> Due to a bug introduced in Linux 3.14 we cannot determine the
>> correctly layout for a multi-zone RAID0 array - there are two
>> possibiities.
>
> possibilities.

Thanks.

>
>> 
>> It is possible to tell the kernel which to chose using a module
>> parameter, but this can be clumsy to use.  It would be best if
>> the choice were recorded in the metadata.
>> So add a feature flag for this purpose.
>> If it is set, then the 'layout' field of the superblock is used
>> to determine which layout to use.
>> 
>> If this flag is not set, then mddev->layout gets set to -1,
>> which causes the module parameter to be required.
>
> Could you point where the flag is set? Thanks.

It isn't set by the kernel - the kernel doesn't know when to set it.

We would need to change mdadm to set the flag, either when creating an
array, or when asked to be --update.

Actually.... that would be a problem if someone used the new mdadm on an
old kernel.  The old kernel would refuse to assemble the array with the
flag set.
Maybe that is what we want anyway.  We *want* people to never use
multi-zone RAID0 on old kernels, because the result could be data
corruption.

So - mdadm needs to add the flag, and maybe warn in the kernel is too
old.


>
>> 
>> Signed-off-by: NeilBrown <neilb@suse.de>
>> ---
>>   drivers/md/md.c                | 13 +++++++++++++
>>   drivers/md/raid0.c             |  2 ++
>>   include/uapi/linux/raid/md_p.h |  2 ++
>>   3 files changed, 17 insertions(+)
>> 
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 1f70ec595282..a41fce7f8b4c 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -1232,6 +1232,8 @@ static int super_90_validate(struct mddev *mddev, struct md_rdev *rdev)
>>   			mddev->new_layout = mddev->layout;
>>   			mddev->new_chunk_sectors = mddev->chunk_sectors;
>>   		}
>> +		if (mddev->level == 0)
>> +			mddev->layout = -1;
>>   
>>   		if (sb->state & (1<<MD_SB_CLEAN))
>>   			mddev->recovery_cp = MaxSector;
>> @@ -1647,6 +1649,10 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_
>>   		rdev->ppl.sector = rdev->sb_start + rdev->ppl.offset;
>>   	}
>>   
>> +	if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_RAID0_LAYOUT) &&
>> +	    sb->level != 0)
>> +		return -EINVAL;
>> +
>>   	if (!refdev) {
>>   		ret = 1;
>>   	} else {
>> @@ -1757,6 +1763,10 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
>>   			mddev->new_chunk_sectors = mddev->chunk_sectors;
>>   		}
>>   
>> +		if (mddev->level == 0 &&
>> +		    !(le32_to_cpu(sb->feature_map) & MD_FEATURE_RAID0_LAYOUT))
>> +			mddev->layout = -1;
>> +
>>   		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL)
>>   			set_bit(MD_HAS_JOURNAL, &mddev->flags);
>>   
>> @@ -6852,6 +6862,9 @@ static int set_array_info(struct mddev *mddev, mdu_array_info_t *info)
>>   	mddev->external	     = 0;
>>   
>>   	mddev->layout        = info->layout;
>> +	if (mddev->level == 0)
>> +		/* Cannot trust RAID0 layout info here */
>> +		mddev->layout = -1;
>>   	mddev->chunk_sectors = info->chunk_size >> 9;
>>   
>>   	if (mddev->persistent) {
>> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
>> index a8888c12308a..6f095b0b6f5c 100644
>> --- a/drivers/md/raid0.c
>> +++ b/drivers/md/raid0.c
>> @@ -145,6 +145,8 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
>>   
>>   	if (conf->nr_strip_zones == 1) {
>>   		conf->layout = RAID0_ORIG_LAYOUT;
>> +	} else if (mddev->layout == RAID0_ORIG_LAYOUT ||
>> +		   mddev->layout == RAID0_ALT_MULTIZONE_LAYOUT) {
>
> Maybe "conf->layout = mddev->layout" here? Otherwise seems conf->layout is not set accordingly, just 
> my 2 cents.
>

Yes, of course.  thanks.

Thanks for your review,
NeilBrown


>>   	} else if (default_layout == RAID0_ORIG_LAYOUT ||
>>   		   default_layout == RAID0_ALT_MULTIZONE_LAYOUT) {
>>   		conf->layout = default_layout;
>> diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
>> index b0d15c73f6d7..1f2d8c81f0e0 100644
>> --- a/include/uapi/linux/raid/md_p.h
>> +++ b/include/uapi/linux/raid/md_p.h
>> @@ -329,6 +329,7 @@ struct mdp_superblock_1 {
>>   #define	MD_FEATURE_JOURNAL		512 /* support write cache */
>>   #define	MD_FEATURE_PPL			1024 /* support PPL */
>>   #define	MD_FEATURE_MULTIPLE_PPLS	2048 /* support for multiple PPLs */
>> +#define	MD_FEATURE_RAID0_LAYOUT		4096 /* layout is meaningful for RAID0 */
>>   #define	MD_FEATURE_ALL			(MD_FEATURE_BITMAP_OFFSET	\
>>   					|MD_FEATURE_RECOVERY_OFFSET	\
>>   					|MD_FEATURE_RESHAPE_ACTIVE	\
>> @@ -341,6 +342,7 @@ struct mdp_superblock_1 {
>>   					|MD_FEATURE_JOURNAL		\
>>   					|MD_FEATURE_PPL			\
>>   					|MD_FEATURE_MULTIPLE_PPLS	\
>> +					|MD_FEATURE_RAID0_LAYOUT	\
>>   					)
>>   
>>   struct r5l_payload_header {
>> 
>
> Thanks,
> Guoqing

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion.
  2019-09-09 14:56             ` [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion Song Liu
@ 2019-09-09 23:33               ` NeilBrown
  2019-09-10 15:45                 ` Song Liu
  0 siblings, 1 reply; 19+ messages in thread
From: NeilBrown @ 2019-09-09 23:33 UTC (permalink / raw)
  To: Song Liu; +Cc: Coly Li, NeilBrown, linux-block, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1472 bytes --]

On Mon, Sep 09 2019, Song Liu wrote:

> Hi Neil,
>
>> On Sep 9, 2019, at 7:57 AM, NeilBrown <neilb@suse.de> wrote:
>> 
>> 
>> If the drives in a RAID0 are not all the same size, the array is
>> divided into zones.
>> The first zone covers all drives, to the size of the smallest.
>> The second zone covers all drives larger than the smallest, up to
>> the size of the second smallest - etc.
>> 
>> A change in Linux 3.14 unintentionally changed the layout for the
>> second and subsequent zones.  All the correct data is still stored, but
>> each chunk may be assigned to a different device than in pre-3.14 kernels.
>> This can lead to data corruption.
>> 
>> It is not possible to determine what layout to use - it depends which
>> kernel the data was written by.
>> So we add a module parameter to allow the old (0) or new (1) layout to be
>> specified, and refused to assemble an affected array if that parameter is
>> not set.
>> 
>> Fixes: 20d0189b1012 ("block: Introduce new bio_split()")
>> cc: stable@vger.kernel.org (3.14+)
>> Signed-off-by: NeilBrown <neilb@suse.de>
>
> Thanks for the patches. They look great. However, I am having problem
> apply them (not sure whether it is a problem on my side). Could you 
> please push it somewhere so I can use cherry-pick instead?

I rebased them on block/for-next, fixed the problems that Guoqing found,
and pushed them to 
  https://github.com/neilbrown/linux md/raid0

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion.
  2019-09-09 15:09             ` Guoqing Jiang
@ 2019-09-09 23:34               ` NeilBrown
  0 siblings, 0 replies; 19+ messages in thread
From: NeilBrown @ 2019-09-09 23:34 UTC (permalink / raw)
  To: Guoqing Jiang, Coly Li, Song Liu; +Cc: NeilBrown, linux-block, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2972 bytes --]

On Mon, Sep 09 2019, Guoqing Jiang wrote:

> On 9/9/19 8:57 AM, NeilBrown wrote:
>> 
>> If the drives in a RAID0 are not all the same size, the array is
>> divided into zones.
>> The first zone covers all drives, to the size of the smallest.
>> The second zone covers all drives larger than the smallest, up to
>> the size of the second smallest - etc.
>> 
>> A change in Linux 3.14 unintentionally changed the layout for the
>> second and subsequent zones.  All the correct data is still stored, but
>> each chunk may be assigned to a different device than in pre-3.14 kernels.
>> This can lead to data corruption.
>> 
>> It is not possible to determine what layout to use - it depends which
>> kernel the data was written by.
>> So we add a module parameter to allow the old (0) or new (1) layout to be
>> specified, and refused to assemble an affected array if that parameter is
>> not set.
>> 
>> Fixes: 20d0189b1012 ("block: Introduce new bio_split()")
>> cc: stable@vger.kernel.org (3.14+)
>> Signed-off-by: NeilBrown <neilb@suse.de>
>> ---
>> 
>> This and the next patch are my proposal for how to address
>> this problem.  I haven't actually tested .....
>> 
>> NeilBrown
>> 
>>   drivers/md/raid0.c | 28 +++++++++++++++++++++++++++-
>>   drivers/md/raid0.h | 14 ++++++++++++++
>>   2 files changed, 41 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
>> index bf5cf184a260..a8888c12308a 100644
>> --- a/drivers/md/raid0.c
>> +++ b/drivers/md/raid0.c
>> @@ -19,6 +19,9 @@
>>   #include "raid0.h"
>>   #include "raid5.h"
>>   
>> +static int default_layout = -1;
>> +module_param(default_layout, int, 0644);
>> +
>>   #define UNSUPPORTED_MDDEV_FLAGS		\
>>   	((1L << MD_HAS_JOURNAL) |	\
>>   	 (1L << MD_JOURNAL_CLEAN) |	\
>> @@ -139,6 +142,19 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
>>   	}
>>   	pr_debug("md/raid0:%s: FINAL %d zones\n",
>>   		 mdname(mddev), conf->nr_strip_zones);
>> +
>> +	if (conf->nr_strip_zones == 1) {
>> +		conf->layout = RAID0_ORIG_LAYOUT;
>> +	} else if (default_layout == RAID0_ORIG_LAYOUT ||
>> +		   default_layout == RAID0_ALT_MULTIZONE_LAYOUT) {
>> +		conf->layout = default_layout;
>> +	} else {
>> +		pr_err("md/raid0:%s: cannot assemble multi-zone RAID0 with default_layout setting\n",
>> +		       mdname(mddev));
>> +		pr_err("md/raid0: please set raid.default_layout to 0 or 1\n");
>
> Maybe "1 or 2" to align with the definition of below r0layout?

Thanks.  Fixed.

NeilBrown

>
> [snip]
>
>> +enum r0layout {
>> +	RAID0_ORIG_LAYOUT = 1,
>> +	RAID0_ALT_MULTIZONE_LAYOUT = 2,
>> +};
>>   struct r0conf {
>>   	struct strip_zone	*strip_zone;
>>   	struct md_rdev		**devlist; /* lists of rdevs, pointed to
>>   					    * by strip_zone->dev */
>>   	int			nr_strip_zones;
>> +	enum r0layout		layout;
>>   };
>>   
>>   #endif
>> 
>
> Thanks,
> Guoqing

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion.
  2019-09-09 23:33               ` NeilBrown
@ 2019-09-10 15:45                 ` Song Liu
  2019-09-10 16:01                   ` Guoqing Jiang
  0 siblings, 1 reply; 19+ messages in thread
From: Song Liu @ 2019-09-10 15:45 UTC (permalink / raw)
  To: NeilBrown, Guoqing Jiang; +Cc: Coly Li, NeilBrown, linux-block, linux-raid



> On Sep 10, 2019, at 12:33 AM, NeilBrown <neilb@suse.de> wrote:
> 
> On Mon, Sep 09 2019, Song Liu wrote:
> 
>> Hi Neil,
>> 
>>> On Sep 9, 2019, at 7:57 AM, NeilBrown <neilb@suse.de> wrote:
>>> 
>>> 
>>> If the drives in a RAID0 are not all the same size, the array is
>>> divided into zones.
>>> The first zone covers all drives, to the size of the smallest.
>>> The second zone covers all drives larger than the smallest, up to
>>> the size of the second smallest - etc.
>>> 
>>> A change in Linux 3.14 unintentionally changed the layout for the
>>> second and subsequent zones.  All the correct data is still stored, but
>>> each chunk may be assigned to a different device than in pre-3.14 kernels.
>>> This can lead to data corruption.
>>> 
>>> It is not possible to determine what layout to use - it depends which
>>> kernel the data was written by.
>>> So we add a module parameter to allow the old (0) or new (1) layout to be
>>> specified, and refused to assemble an affected array if that parameter is
>>> not set.
>>> 
>>> Fixes: 20d0189b1012 ("block: Introduce new bio_split()")
>>> cc: stable@vger.kernel.org (3.14+)
>>> Signed-off-by: NeilBrown <neilb@suse.de>
>> 
>> Thanks for the patches. They look great. However, I am having problem
>> apply them (not sure whether it is a problem on my side). Could you 
>> please push it somewhere so I can use cherry-pick instead?
> 
> I rebased them on block/for-next, fixed the problems that Guoqing found,
> and pushed them to 
>  https://github.com/neilbrown/linux md/raid0
> 
> NeilBrown

Thanks Neil!

Guoqing, if this looks good, please reply with your Reviewed-by
or Acked-by. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion.
  2019-09-10 15:45                 ` Song Liu
@ 2019-09-10 16:01                   ` Guoqing Jiang
  2019-09-10 23:08                     ` NeilBrown
  0 siblings, 1 reply; 19+ messages in thread
From: Guoqing Jiang @ 2019-09-10 16:01 UTC (permalink / raw)
  To: Song Liu, NeilBrown, Guoqing Jiang
  Cc: Coly Li, NeilBrown, linux-block, linux-raid



On 9/10/19 5:45 PM, Song Liu wrote:
> 
> 
>> On Sep 10, 2019, at 12:33 AM, NeilBrown <neilb@suse.de> wrote:
>>
>> On Mon, Sep 09 2019, Song Liu wrote:
>>
>>> Hi Neil,
>>>
>>>> On Sep 9, 2019, at 7:57 AM, NeilBrown <neilb@suse.de> wrote:
>>>>
>>>>
>>>> If the drives in a RAID0 are not all the same size, the array is
>>>> divided into zones.
>>>> The first zone covers all drives, to the size of the smallest.
>>>> The second zone covers all drives larger than the smallest, up to
>>>> the size of the second smallest - etc.
>>>>
>>>> A change in Linux 3.14 unintentionally changed the layout for the
>>>> second and subsequent zones.  All the correct data is still stored, but
>>>> each chunk may be assigned to a different device than in pre-3.14 kernels.
>>>> This can lead to data corruption.
>>>>
>>>> It is not possible to determine what layout to use - it depends which
>>>> kernel the data was written by.
>>>> So we add a module parameter to allow the old (0) or new (1) layout to be
>>>> specified, and refused to assemble an affected array if that parameter is
>>>> not set.
>>>>
>>>> Fixes: 20d0189b1012 ("block: Introduce new bio_split()")
>>>> cc: stable@vger.kernel.org (3.14+)
>>>> Signed-off-by: NeilBrown <neilb@suse.de>
>>>
>>> Thanks for the patches. They look great. However, I am having problem
>>> apply them (not sure whether it is a problem on my side). Could you
>>> please push it somewhere so I can use cherry-pick instead?
>>
>> I rebased them on block/for-next, fixed the problems that Guoqing found,
>> and pushed them to
>>   https://github.com/neilbrown/linux md/raid0
>>
>> NeilBrown
> 
> Thanks Neil!

Thanks for the explanation about set the flag.

> 
> Guoqing, if this looks good, please reply with your Reviewed-by
> or Acked-by.

No more comments from my side, but I am not sure if it is better/possible to use one
sysfs node to control the behavior instead of module parameter, then we can support
different raid0 layout dynamically.

Anyway, Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion.
  2019-09-10 16:01                   ` Guoqing Jiang
@ 2019-09-10 23:08                     ` NeilBrown
  2019-09-11  9:56                       ` Song Liu
  0 siblings, 1 reply; 19+ messages in thread
From: NeilBrown @ 2019-09-10 23:08 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, Guoqing Jiang
  Cc: Coly Li, NeilBrown, linux-block, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2442 bytes --]

On Tue, Sep 10 2019, Guoqing Jiang wrote:

> On 9/10/19 5:45 PM, Song Liu wrote:
>> 
>> 
>>> On Sep 10, 2019, at 12:33 AM, NeilBrown <neilb@suse.de> wrote:
>>>
>>> On Mon, Sep 09 2019, Song Liu wrote:
>>>
>>>> Hi Neil,
>>>>
>>>>> On Sep 9, 2019, at 7:57 AM, NeilBrown <neilb@suse.de> wrote:
>>>>>
>>>>>
>>>>> If the drives in a RAID0 are not all the same size, the array is
>>>>> divided into zones.
>>>>> The first zone covers all drives, to the size of the smallest.
>>>>> The second zone covers all drives larger than the smallest, up to
>>>>> the size of the second smallest - etc.
>>>>>
>>>>> A change in Linux 3.14 unintentionally changed the layout for the
>>>>> second and subsequent zones.  All the correct data is still stored, but
>>>>> each chunk may be assigned to a different device than in pre-3.14 kernels.
>>>>> This can lead to data corruption.
>>>>>
>>>>> It is not possible to determine what layout to use - it depends which
>>>>> kernel the data was written by.
>>>>> So we add a module parameter to allow the old (0) or new (1) layout to be
>>>>> specified, and refused to assemble an affected array if that parameter is
>>>>> not set.
>>>>>
>>>>> Fixes: 20d0189b1012 ("block: Introduce new bio_split()")
>>>>> cc: stable@vger.kernel.org (3.14+)
>>>>> Signed-off-by: NeilBrown <neilb@suse.de>
>>>>
>>>> Thanks for the patches. They look great. However, I am having problem
>>>> apply them (not sure whether it is a problem on my side). Could you
>>>> please push it somewhere so I can use cherry-pick instead?
>>>
>>> I rebased them on block/for-next, fixed the problems that Guoqing found,
>>> and pushed them to
>>>   https://github.com/neilbrown/linux md/raid0
>>>
>>> NeilBrown
>> 
>> Thanks Neil!
>
> Thanks for the explanation about set the flag.
>
>> 
>> Guoqing, if this looks good, please reply with your Reviewed-by
>> or Acked-by.
>
> No more comments from my side, but I am not sure if it is better/possible to use one
> sysfs node to control the behavior instead of module parameter, then we can support
> different raid0 layout dynamically.

A strength of module parameters is that you can set them in
  /etc/modprobe.d/00-local.conf
and then they are automatically set on boot.
For sysfs, you need some tool to set them.

>
> Anyway, Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
>

Thanks,
NeilBrown


> Thanks,
> Guoqing

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion.
  2019-09-10 23:08                     ` NeilBrown
@ 2019-09-11  9:56                       ` Song Liu
  2019-09-11 22:48                         ` NeilBrown
  0 siblings, 1 reply; 19+ messages in thread
From: Song Liu @ 2019-09-11  9:56 UTC (permalink / raw)
  To: NeilBrown
  Cc: Guoqing Jiang, Song Liu, Guoqing Jiang, Coly Li, NeilBrown,
	linux-block, linux-raid

On Wed, Sep 11, 2019 at 12:10 AM NeilBrown <neilb@suse.de> wrote:
>
> On Tue, Sep 10 2019, Guoqing Jiang wrote:
>
> > On 9/10/19 5:45 PM, Song Liu wrote:
> >>
> >>
> >>> On Sep 10, 2019, at 12:33 AM, NeilBrown <neilb@suse.de> wrote:
> >>>
> >>> On Mon, Sep 09 2019, Song Liu wrote:
> >>>
> >>>> Hi Neil,
> >>>>
> >>>>> On Sep 9, 2019, at 7:57 AM, NeilBrown <neilb@suse.de> wrote:
> >>>>>
> >>>>>
> >>>>> If the drives in a RAID0 are not all the same size, the array is
> >>>>> divided into zones.
> >>>>> The first zone covers all drives, to the size of the smallest.
> >>>>> The second zone covers all drives larger than the smallest, up to
> >>>>> the size of the second smallest - etc.
> >>>>>
> >>>>> A change in Linux 3.14 unintentionally changed the layout for the
> >>>>> second and subsequent zones.  All the correct data is still stored, but
> >>>>> each chunk may be assigned to a different device than in pre-3.14 kernels.
> >>>>> This can lead to data corruption.
> >>>>>
> >>>>> It is not possible to determine what layout to use - it depends which
> >>>>> kernel the data was written by.
> >>>>> So we add a module parameter to allow the old (0) or new (1) layout to be
> >>>>> specified, and refused to assemble an affected array if that parameter is
> >>>>> not set.
> >>>>>
> >>>>> Fixes: 20d0189b1012 ("block: Introduce new bio_split()")
> >>>>> cc: stable@vger.kernel.org (3.14+)
> >>>>> Signed-off-by: NeilBrown <neilb@suse.de>
> >>>>
> >>>> Thanks for the patches. They look great. However, I am having problem
> >>>> apply them (not sure whether it is a problem on my side). Could you
> >>>> please push it somewhere so I can use cherry-pick instead?
> >>>
> >>> I rebased them on block/for-next, fixed the problems that Guoqing found,
> >>> and pushed them to
> >>>   https://github.com/neilbrown/linux md/raid0
> >>>
> >>> NeilBrown
> >>
> >> Thanks Neil!
> >
> > Thanks for the explanation about set the flag.
> >
> >>
> >> Guoqing, if this looks good, please reply with your Reviewed-by
> >> or Acked-by.
> >
> > No more comments from my side, but I am not sure if it is better/possible to use one
> > sysfs node to control the behavior instead of module parameter, then we can support
> > different raid0 layout dynamically.
>
> A strength of module parameters is that you can set them in
>   /etc/modprobe.d/00-local.conf
> and then they are automatically set on boot.
> For sysfs, you need some tool to set them.
>
> >
> > Anyway, Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
> >

I am adding the following change to the 1/2. Please let me know if it doesn't
make sense.

Thanks,
Song

diff --git i/drivers/md/raid0.c w/drivers/md/raid0.c
index a9fcff50bbfc..54d0064787a8 100644
--- i/drivers/md/raid0.c
+++ w/drivers/md/raid0.c
@@ -615,6 +615,10 @@ static bool raid0_make_request(struct mddev
*mddev, struct bio *bio)
        case RAID0_ALT_MULTIZONE_LAYOUT:
                tmp_dev = map_sector(mddev, zone, sector, &sector);
                break;
+       default:
+               WARN("md/raid0:%s: Invalid layout\n", mdname(mddev));
+               bio_io_error(bio);
+               return true;
        }

        if (unlikely(is_mddev_broken(tmp_dev, "raid0"))) {

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion.
  2019-09-11  9:56                       ` Song Liu
@ 2019-09-11 22:48                         ` NeilBrown
  0 siblings, 0 replies; 19+ messages in thread
From: NeilBrown @ 2019-09-11 22:48 UTC (permalink / raw)
  To: Song Liu
  Cc: Guoqing Jiang, Song Liu, Guoqing Jiang, Coly Li, NeilBrown,
	linux-block, linux-raid

[-- Attachment #1: Type: text/plain, Size: 3549 bytes --]

On Wed, Sep 11 2019, Song Liu wrote:

> On Wed, Sep 11, 2019 at 12:10 AM NeilBrown <neilb@suse.de> wrote:
>>
>> On Tue, Sep 10 2019, Guoqing Jiang wrote:
>>
>> > On 9/10/19 5:45 PM, Song Liu wrote:
>> >>
>> >>
>> >>> On Sep 10, 2019, at 12:33 AM, NeilBrown <neilb@suse.de> wrote:
>> >>>
>> >>> On Mon, Sep 09 2019, Song Liu wrote:
>> >>>
>> >>>> Hi Neil,
>> >>>>
>> >>>>> On Sep 9, 2019, at 7:57 AM, NeilBrown <neilb@suse.de> wrote:
>> >>>>>
>> >>>>>
>> >>>>> If the drives in a RAID0 are not all the same size, the array is
>> >>>>> divided into zones.
>> >>>>> The first zone covers all drives, to the size of the smallest.
>> >>>>> The second zone covers all drives larger than the smallest, up to
>> >>>>> the size of the second smallest - etc.
>> >>>>>
>> >>>>> A change in Linux 3.14 unintentionally changed the layout for the
>> >>>>> second and subsequent zones.  All the correct data is still stored, but
>> >>>>> each chunk may be assigned to a different device than in pre-3.14 kernels.
>> >>>>> This can lead to data corruption.
>> >>>>>
>> >>>>> It is not possible to determine what layout to use - it depends which
>> >>>>> kernel the data was written by.
>> >>>>> So we add a module parameter to allow the old (0) or new (1) layout to be
>> >>>>> specified, and refused to assemble an affected array if that parameter is
>> >>>>> not set.
>> >>>>>
>> >>>>> Fixes: 20d0189b1012 ("block: Introduce new bio_split()")
>> >>>>> cc: stable@vger.kernel.org (3.14+)
>> >>>>> Signed-off-by: NeilBrown <neilb@suse.de>
>> >>>>
>> >>>> Thanks for the patches. They look great. However, I am having problem
>> >>>> apply them (not sure whether it is a problem on my side). Could you
>> >>>> please push it somewhere so I can use cherry-pick instead?
>> >>>
>> >>> I rebased them on block/for-next, fixed the problems that Guoqing found,
>> >>> and pushed them to
>> >>>   https://github.com/neilbrown/linux md/raid0
>> >>>
>> >>> NeilBrown
>> >>
>> >> Thanks Neil!
>> >
>> > Thanks for the explanation about set the flag.
>> >
>> >>
>> >> Guoqing, if this looks good, please reply with your Reviewed-by
>> >> or Acked-by.
>> >
>> > No more comments from my side, but I am not sure if it is better/possible to use one
>> > sysfs node to control the behavior instead of module parameter, then we can support
>> > different raid0 layout dynamically.
>>
>> A strength of module parameters is that you can set them in
>>   /etc/modprobe.d/00-local.conf
>> and then they are automatically set on boot.
>> For sysfs, you need some tool to set them.
>>
>> >
>> > Anyway, Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
>> >
>
> I am adding the following change to the 1/2. Please let me know if it doesn't
> make sense.

I don't object, through with the current code it is impossible for that
warning to fire.
Code might change in the future though, and it's better to be safe than
sorry.

Thanks,
NeilBrown

>
> Thanks,
> Song
>
> diff --git i/drivers/md/raid0.c w/drivers/md/raid0.c
> index a9fcff50bbfc..54d0064787a8 100644
> --- i/drivers/md/raid0.c
> +++ w/drivers/md/raid0.c
> @@ -615,6 +615,10 @@ static bool raid0_make_request(struct mddev
> *mddev, struct bio *bio)
>         case RAID0_ALT_MULTIZONE_LAYOUT:
>                 tmp_dev = map_sector(mddev, zone, sector, &sector);
>                 break;
> +       default:
> +               WARN("md/raid0:%s: Invalid layout\n", mdname(mddev));
> +               bio_io_error(bio);
> +               return true;
>         }
>
>         if (unlikely(is_mddev_broken(tmp_dev, "raid0"))) {

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-09-11 22:49 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-22 10:38 [RFC] How to handle an ugly md raid0 sector map bug ? Coly Li
2019-08-23  0:02 ` NeilBrown
2019-08-23 16:37   ` Song Liu
2019-08-23 17:03     ` Coly Li
2019-08-23 17:17       ` Song Liu
2019-08-23 17:47         ` Coly Li
2019-09-09  6:57           ` [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion NeilBrown
2019-09-09  6:58             ` [PATCH 2/2] md: add feature flag MD_FEATURE_RAID0_LAYOUT NeilBrown
2019-09-09 15:33               ` Guoqing Jiang
2019-09-09 23:26                 ` NeilBrown
2019-09-09 14:56             ` [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion Song Liu
2019-09-09 23:33               ` NeilBrown
2019-09-10 15:45                 ` Song Liu
2019-09-10 16:01                   ` Guoqing Jiang
2019-09-10 23:08                     ` NeilBrown
2019-09-11  9:56                       ` Song Liu
2019-09-11 22:48                         ` NeilBrown
2019-09-09 15:09             ` Guoqing Jiang
2019-09-09 23:34               ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).