* [RFC] How to handle an ugly md raid0 sector map bug ? @ 2019-08-22 10:38 Coly Li 2019-08-23 0:02 ` NeilBrown 0 siblings, 1 reply; 19+ messages in thread From: Coly Li @ 2019-08-22 10:38 UTC (permalink / raw) To: linux-raid, Song Liu; +Cc: NeilBrown, linux-block Hi folks, First line: This bug only influences md raid0 device which applies all the following conditions, 1) Assembled by component disks with different sizes. 2) Created and used under Linux kernel before (including) Linux v3.12, then upgrade to Linux kernel after (including) Linux v3.13. 3) New data are written to md raid0 in new kernel >= Linux v3.13. Then the md raid0 may have inconsistent sector mapping and experience data corruption. Recently I receive a bug report that customer encounter file system corruption after upgrading their kernel from Linux 3.12 to 4.4. It turns out to be the underlying md raid0 corruption after the kernel upgrade. I find it is because a sector map bug in md raid0 code include and before Linux v3.12. Here is the buggy code piece I copied from stable Linux v3.12.74 drivers/md/raid0.c:raid0_make_request(), 547 sector_offset = bio->bi_sector; 548 zone = find_zone(mddev->private, §or_offset); 549 tmp_dev = map_sector(mddev, zone, bio->bi_sector, 550 §or_offset); At line 548 after find_zone() returns, sector_offset is updated to be an offset inside current zone. Then at line 549 the third parameter of calling map_sector() should be the updated sector_offset, but bio->bi_sector (original LBA or md raid0 device) is used. If the raid0 device has *multiple zones*, except the first zone, the mapping <dev, sector> pair returned by map_sector() for all rested zones are unexpected and wrong. The buggy code was introduced since Linux v2.6.31 in commit fbb704efb784 ("md: raid0 :Enables chunk size other than powers of 2."), unfortunate the mistaken mapping calculation has stable and unique result too, so it works without obvious problem until commit 20d0189b1012 ("block: Introduce new bio_split()") merged into Linux v3.13. This patch fixed the mistaken mapping in the following lines of change, 654 - sector_offset = bio->bi_iter.bi_sector; 655 - zone = find_zone(mddev->private, §or_offset); 656 - tmp_dev = map_sector(mddev, zone, bio->bi_iter.bi_sector, 657 - §or_offset); 694 + zone = find_zone(mddev->private, §or); 695 + tmp_dev = map_sector(mddev, zone, sector, §or); At line 695 of this patch, the third parameter of calling map_sector() is fixed to 'sector', this is the correct value which contains the sector offset inside the corresponding zone. The this patch implicitly *changes* md raid0 on-disk layout. If a md raid0 has component disks with *different* sizes, then it will contain multiple zones. If such multiple zones raid0 device is created before Linux v3.13, all data chunks after first zone will be mapped to different location in kernel after (including) Linux v3.13. The result is, data written in the LBA after first zone will be treated as corruption. A worse case is, if the md raid0 has data chunks filled in first md raid0 zone in Linux v3.12 (or earlier kernels), then update to Linux v3.13 (or later kernels) and fill more data chunks in second and rested zone. Then in neither Linux v3.12 no Linux v3.13, there is always partial data corrupted. Currently there is no way to tell whether a md raid0 device is mapped in wrong calculation in kernel before (including) Linux v3.12 or in correct calculation in kernels after (including) Linux v3.13. If a md raid0 device (contains multiple zones) created and used crossing these kernel version, there is possibility and different mapping calculation generation different/inconsistent on-disk layout in different md raid0 zones, and results data corruption. For our enterprise Linux products we can handle it properly for a few product number of kernels. But for upstream and stable kernels, I don't have idea how to fix this ugly problem in a generic way. Neil Brown discussed with me offline, he proposed a temporary workaround that only permit to assemble md raid0 device with identical component disk size, and reject to assemble md raid0 device with component disks with different sizes. We can stop this workaround when there is a proper way to fix the problem. I suggest our developer community to work together for a solution, this is the motivation I post this email for your comments. Thanks in advance. -- Coly Li ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] How to handle an ugly md raid0 sector map bug ? 2019-08-22 10:38 [RFC] How to handle an ugly md raid0 sector map bug ? Coly Li @ 2019-08-23 0:02 ` NeilBrown 2019-08-23 16:37 ` Song Liu 0 siblings, 1 reply; 19+ messages in thread From: NeilBrown @ 2019-08-23 0:02 UTC (permalink / raw) To: Coly Li, Song Liu, linux-raid; +Cc: linux-block [-- Attachment #1: Type: text/plain, Size: 6280 bytes --] On Thu, Aug 22 2019, Coly Li wrote: > Hi folks, > > First line: This bug only influences md raid0 device which applies all > the following conditions, > 1) Assembled by component disks with different sizes. > 2) Created and used under Linux kernel before (including) Linux v3.12, > then upgrade to Linux kernel after (including) Linux v3.13. > 3) New data are written to md raid0 in new kernel >= Linux v3.13. > Then the md raid0 may have inconsistent sector mapping and experience > data corruption. > > Recently I receive a bug report that customer encounter file system > corruption after upgrading their kernel from Linux 3.12 to 4.4. It turns > out to be the underlying md raid0 corruption after the kernel upgrade. > > I find it is because a sector map bug in md raid0 code include and > before Linux v3.12. Here is the buggy code piece I copied from stable > Linux v3.12.74 drivers/md/raid0.c:raid0_make_request(), > > 547 sector_offset = bio->bi_sector; > 548 zone = find_zone(mddev->private, §or_offset); > 549 tmp_dev = map_sector(mddev, zone, bio->bi_sector, > 550 §or_offset); I don't think this code is buggy. The mapping may not be the mapping you would expect, but it is the mapping that md/raid0 had always used up to this time. > > At line 548 after find_zone() returns, sector_offset is updated to be an > offset inside current zone. Then at line 549 the third parameter of > calling map_sector() should be the updated sector_offset, but > bio->bi_sector (original LBA or md raid0 device) is used. If the raid0 > device has *multiple zones*, except the first zone, the mapping <dev, > sector> pair returned by map_sector() for all rested zones are > unexpected and wrong. > > The buggy code was introduced since Linux v2.6.31 in commit fbb704efb784 > ("md: raid0 :Enables chunk size other than powers of 2."), unfortunate > the mistaken mapping calculation has stable and unique result too, so it > works without obvious problem until commit 20d0189b1012 ("block: > Introduce new bio_split()") merged into Linux v3.13. > > This patch fixed the mistaken mapping in the following lines of change, > 654 - sector_offset = bio->bi_iter.bi_sector; > 655 - zone = find_zone(mddev->private, §or_offset); > 656 - tmp_dev = map_sector(mddev, zone, bio->bi_iter.bi_sector, > 657 - §or_offset); > > 694 + zone = find_zone(mddev->private, §or); > 695 + tmp_dev = map_sector(mddev, zone, sector, §or); > At line 695 of this patch, the third parameter of calling map_sector() > is fixed to 'sector', this is the correct value which contains the > sector offset inside the corresponding zone. This is buggy because, as you say, the third argument to map_sector has changed. Previously it was bio->bi_iter.bi_sector. Now it is 'sector' which find_zone has just modified. > > The this patch implicitly *changes* md raid0 on-disk layout. If a md > raid0 has component disks with *different* sizes, then it will contain > multiple zones. If such multiple zones raid0 device is created before > Linux v3.13, all data chunks after first zone will be mapped to > different location in kernel after (including) Linux v3.13. The result > is, data written in the LBA after first zone will be treated as > corruption. A worse case is, if the md raid0 has data chunks filled in > first md raid0 zone in Linux v3.12 (or earlier kernels), then update to > Linux v3.13 (or later kernels) and fill more data chunks in second and > rested zone. Then in neither Linux v3.12 no Linux v3.13, there is always > partial data corrupted. > > Currently there is no way to tell whether a md raid0 device is mapped in > wrong calculation in kernel before (including) Linux v3.12 or in correct > calculation in kernels after (including) Linux v3.13. If a md raid0 > device (contains multiple zones) created and used crossing these kernel > version, there is possibility and different mapping calculation > generation different/inconsistent on-disk layout in different md raid0 > zones, and results data corruption. > > For our enterprise Linux products we can handle it properly for a few > product number of kernels. But for upstream and stable kernels, I don't > have idea how to fix this ugly problem in a generic way. > > Neil Brown discussed with me offline, he proposed a temporary workaround > that only permit to assemble md raid0 device with identical component > disk size, and reject to assemble md raid0 device with component disks > with different sizes. We can stop this workaround when there is a proper > way to fix the problem. > > I suggest our developer community to work together for a solution, this > is the motivation I post this email for your comments. There are four separate cases that we need to consider: - v1.x metadata - v0.90 metadata - LVM metadata (via dm-raid) - no metadata (array created with "mdadm --build"). For v1.x metadata, I think we can add a new "feature_map" flag. If this flag isn't set, raid0 with non-uniform device sizes will not be assembled. If it is set, then: if 'layout' is 0, use the old mapping if 'layout' is 1, use the new mapping For v0.90 metadata we don't have feature-flags. We could The gvalid_words field is unused and always set to zero. So we could start storing some feature bits there. For LVM/dm-raid, I suspect it doesn't support varying sized devices, but we would need to check. For "no metadata" arrays ... we could possibly just stop supporting them - I doubt they are used much. Once this is decided, we then need to work out how to make it all usable via mdadm. mdadm needs to set the relevant flags and 'layout' values when creating an array, and it needs to be able to set them correctly on an existing array that doesn't have them set. We also need to decide what happens when a new mdadm is run on an old kernel. If we create an array with a new feature bit, the old kernel won't be able to assemble it. But if we don't set the feature bit, then the array will not assemble after an upgrade. It might be easier to think about this when we actually have concrete code, but if anyone does have useful input, I'd be very happy to read it. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] How to handle an ugly md raid0 sector map bug ? 2019-08-23 0:02 ` NeilBrown @ 2019-08-23 16:37 ` Song Liu 2019-08-23 17:03 ` Coly Li 0 siblings, 1 reply; 19+ messages in thread From: Song Liu @ 2019-08-23 16:37 UTC (permalink / raw) To: NeilBrown; +Cc: Coly Li, linux-raid, linux-block Thanks Coly and Neil. > On Aug 22, 2019, at 5:02 PM, NeilBrown <neilb@suse.com> wrote: > > On Thu, Aug 22 2019, Coly Li wrote: > >> Hi folks, >> >> First line: This bug only influences md raid0 device which applies all >> the following conditions, >> 1) Assembled by component disks with different sizes. >> 2) Created and used under Linux kernel before (including) Linux v3.12, >> then upgrade to Linux kernel after (including) Linux v3.13. >> 3) New data are written to md raid0 in new kernel >= Linux v3.13. >> Then the md raid0 may have inconsistent sector mapping and experience >> data corruption. >> >> Recently I receive a bug report that customer encounter file system >> corruption after upgrading their kernel from Linux 3.12 to 4.4. It turns >> out to be the underlying md raid0 corruption after the kernel upgrade. >> >> I find it is because a sector map bug in md raid0 code include and >> before Linux v3.12. Here is the buggy code piece I copied from stable >> Linux v3.12.74 drivers/md/raid0.c:raid0_make_request(), >> >> 547 sector_offset = bio->bi_sector; >> 548 zone = find_zone(mddev->private, §or_offset); >> 549 tmp_dev = map_sector(mddev, zone, bio->bi_sector, >> 550 §or_offset); > > I don't think this code is buggy. The mapping may not be the mapping > you would expect, but it is the mapping that md/raid0 had always used up > to this time. > >> >> At line 548 after find_zone() returns, sector_offset is updated to be an >> offset inside current zone. Then at line 549 the third parameter of >> calling map_sector() should be the updated sector_offset, but >> bio->bi_sector (original LBA or md raid0 device) is used. If the raid0 >> device has *multiple zones*, except the first zone, the mapping <dev, >> sector> pair returned by map_sector() for all rested zones are >> unexpected and wrong. >> >> The buggy code was introduced since Linux v2.6.31 in commit fbb704efb784 >> ("md: raid0 :Enables chunk size other than powers of 2."), unfortunate >> the mistaken mapping calculation has stable and unique result too, so it >> works without obvious problem until commit 20d0189b1012 ("block: >> Introduce new bio_split()") merged into Linux v3.13. >> >> This patch fixed the mistaken mapping in the following lines of change, >> 654 - sector_offset = bio->bi_iter.bi_sector; >> 655 - zone = find_zone(mddev->private, §or_offset); >> 656 - tmp_dev = map_sector(mddev, zone, bio->bi_iter.bi_sector, >> 657 - §or_offset); >> >> 694 + zone = find_zone(mddev->private, §or); >> 695 + tmp_dev = map_sector(mddev, zone, sector, §or); >> At line 695 of this patch, the third parameter of calling map_sector() >> is fixed to 'sector', this is the correct value which contains the >> sector offset inside the corresponding zone. > > This is buggy because, as you say, the third argument to map_sector has > changed. > Previously it was bio->bi_iter.bi_sector. Now it is 'sector' which > find_zone has just modified. > >> >> The this patch implicitly *changes* md raid0 on-disk layout. If a md >> raid0 has component disks with *different* sizes, then it will contain >> multiple zones. If such multiple zones raid0 device is created before >> Linux v3.13, all data chunks after first zone will be mapped to >> different location in kernel after (including) Linux v3.13. The result >> is, data written in the LBA after first zone will be treated as >> corruption. A worse case is, if the md raid0 has data chunks filled in >> first md raid0 zone in Linux v3.12 (or earlier kernels), then update to >> Linux v3.13 (or later kernels) and fill more data chunks in second and >> rested zone. Then in neither Linux v3.12 no Linux v3.13, there is always >> partial data corrupted. >> >> Currently there is no way to tell whether a md raid0 device is mapped in >> wrong calculation in kernel before (including) Linux v3.12 or in correct >> calculation in kernels after (including) Linux v3.13. If a md raid0 >> device (contains multiple zones) created and used crossing these kernel >> version, there is possibility and different mapping calculation >> generation different/inconsistent on-disk layout in different md raid0 >> zones, and results data corruption. >> >> For our enterprise Linux products we can handle it properly for a few >> product number of kernels. But for upstream and stable kernels, I don't >> have idea how to fix this ugly problem in a generic way. >> >> Neil Brown discussed with me offline, he proposed a temporary workaround >> that only permit to assemble md raid0 device with identical component >> disk size, and reject to assemble md raid0 device with component disks >> with different sizes. We can stop this workaround when there is a proper >> way to fix the problem. >> >> I suggest our developer community to work together for a solution, this >> is the motivation I post this email for your comments. > > There are four separate cases that we need to consider: > - v1.x metadata > - v0.90 metadata > - LVM metadata (via dm-raid) > - no metadata (array created with "mdadm --build"). > > For v1.x metadata, I think we can add a new "feature_map" flag. > If this flag isn't set, raid0 with non-uniform device sizes will not be > assembled. > If it is set, then: > if 'layout' is 0, use the old mapping > if 'layout' is 1, use the new mapping > > For v0.90 metadata we don't have feature-flags. We could > The gvalid_words field is unused and always set to zero. > So we could start storing some feature bits there. > > For LVM/dm-raid, I suspect it doesn't support varying > sized devices, but we would need to check. > > For "no metadata" arrays ... we could possibly just stop supporting > them - I doubt they are used much. So for an existing array, we really cannot tell whether it is broken or not, right? If this is the case, we only need to worry about new arrays. For new arrays, I guess we can only allow v1.x raid0 to have non-uniform devices sizes, and use the new feature_map bit. Would this work? If so, we only have 1 case to work on. Thanks, Song ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] How to handle an ugly md raid0 sector map bug ? 2019-08-23 16:37 ` Song Liu @ 2019-08-23 17:03 ` Coly Li 2019-08-23 17:17 ` Song Liu 0 siblings, 1 reply; 19+ messages in thread From: Coly Li @ 2019-08-23 17:03 UTC (permalink / raw) To: Song Liu; +Cc: NeilBrown, linux-raid, linux-block On 2019/8/24 12:37 上午, Song Liu wrote: > Thanks Coly and Neil. > >> On Aug 22, 2019, at 5:02 PM, NeilBrown <neilb@suse.com> wrote: >> >> On Thu, Aug 22 2019, Coly Li wrote: >> >>> Hi folks, >>> >>> First line: This bug only influences md raid0 device which applies all >>> the following conditions, >>> 1) Assembled by component disks with different sizes. >>> 2) Created and used under Linux kernel before (including) Linux v3.12, >>> then upgrade to Linux kernel after (including) Linux v3.13. >>> 3) New data are written to md raid0 in new kernel >= Linux v3.13. >>> Then the md raid0 may have inconsistent sector mapping and experience >>> data corruption. >>> >>> Recently I receive a bug report that customer encounter file system >>> corruption after upgrading their kernel from Linux 3.12 to 4.4. It turns >>> out to be the underlying md raid0 corruption after the kernel upgrade. >>> >>> I find it is because a sector map bug in md raid0 code include and >>> before Linux v3.12. Here is the buggy code piece I copied from stable >>> Linux v3.12.74 drivers/md/raid0.c:raid0_make_request(), >>> >>> 547 sector_offset = bio->bi_sector; >>> 548 zone = find_zone(mddev->private, §or_offset); >>> 549 tmp_dev = map_sector(mddev, zone, bio->bi_sector, >>> 550 §or_offset); >> >> I don't think this code is buggy. The mapping may not be the mapping >> you would expect, but it is the mapping that md/raid0 had always used up >> to this time. >> >>> >>> At line 548 after find_zone() returns, sector_offset is updated to be an >>> offset inside current zone. Then at line 549 the third parameter of >>> calling map_sector() should be the updated sector_offset, but >>> bio->bi_sector (original LBA or md raid0 device) is used. If the raid0 >>> device has *multiple zones*, except the first zone, the mapping <dev, >>> sector> pair returned by map_sector() for all rested zones are >>> unexpected and wrong. >>> >>> The buggy code was introduced since Linux v2.6.31 in commit fbb704efb784 >>> ("md: raid0 :Enables chunk size other than powers of 2."), unfortunate >>> the mistaken mapping calculation has stable and unique result too, so it >>> works without obvious problem until commit 20d0189b1012 ("block: >>> Introduce new bio_split()") merged into Linux v3.13. >>> >>> This patch fixed the mistaken mapping in the following lines of change, >>> 654 - sector_offset = bio->bi_iter.bi_sector; >>> 655 - zone = find_zone(mddev->private, §or_offset); >>> 656 - tmp_dev = map_sector(mddev, zone, bio->bi_iter.bi_sector, >>> 657 - §or_offset); >>> >>> 694 + zone = find_zone(mddev->private, §or); >>> 695 + tmp_dev = map_sector(mddev, zone, sector, §or); >>> At line 695 of this patch, the third parameter of calling map_sector() >>> is fixed to 'sector', this is the correct value which contains the >>> sector offset inside the corresponding zone. >> >> This is buggy because, as you say, the third argument to map_sector has >> changed. >> Previously it was bio->bi_iter.bi_sector. Now it is 'sector' which >> find_zone has just modified. >> >>> >>> The this patch implicitly *changes* md raid0 on-disk layout. If a md >>> raid0 has component disks with *different* sizes, then it will contain >>> multiple zones. If such multiple zones raid0 device is created before >>> Linux v3.13, all data chunks after first zone will be mapped to >>> different location in kernel after (including) Linux v3.13. The result >>> is, data written in the LBA after first zone will be treated as >>> corruption. A worse case is, if the md raid0 has data chunks filled in >>> first md raid0 zone in Linux v3.12 (or earlier kernels), then update to >>> Linux v3.13 (or later kernels) and fill more data chunks in second and >>> rested zone. Then in neither Linux v3.12 no Linux v3.13, there is always >>> partial data corrupted. >>> >>> Currently there is no way to tell whether a md raid0 device is mapped in >>> wrong calculation in kernel before (including) Linux v3.12 or in correct >>> calculation in kernels after (including) Linux v3.13. If a md raid0 >>> device (contains multiple zones) created and used crossing these kernel >>> version, there is possibility and different mapping calculation >>> generation different/inconsistent on-disk layout in different md raid0 >>> zones, and results data corruption. >>> >>> For our enterprise Linux products we can handle it properly for a few >>> product number of kernels. But for upstream and stable kernels, I don't >>> have idea how to fix this ugly problem in a generic way. >>> >>> Neil Brown discussed with me offline, he proposed a temporary workaround >>> that only permit to assemble md raid0 device with identical component >>> disk size, and reject to assemble md raid0 device with component disks >>> with different sizes. We can stop this workaround when there is a proper >>> way to fix the problem. >>> >>> I suggest our developer community to work together for a solution, this >>> is the motivation I post this email for your comments. >> >> There are four separate cases that we need to consider: >> - v1.x metadata >> - v0.90 metadata >> - LVM metadata (via dm-raid) >> - no metadata (array created with "mdadm --build"). >> >> For v1.x metadata, I think we can add a new "feature_map" flag. >> If this flag isn't set, raid0 with non-uniform device sizes will not be >> assembled. >> If it is set, then: >> if 'layout' is 0, use the old mapping >> if 'layout' is 1, use the new mapping >> >> For v0.90 metadata we don't have feature-flags. We could >> The gvalid_words field is unused and always set to zero. >> So we could start storing some feature bits there. >> >> For LVM/dm-raid, I suspect it doesn't support varying >> sized devices, but we would need to check. >> >> For "no metadata" arrays ... we could possibly just stop supporting >> them - I doubt they are used much. > > So for an existing array, we really cannot tell whether it is broken or > not, right? If this is the case, we only need to worry about new arrays. > > For new arrays, I guess we can only allow v1.x raid0 to have non-uniform > devices sizes, and use the new feature_map bit. > > Would this work? If so, we only have 1 case to work on. It seems v1.2 support started since Linux v2.16, so it may also have problem for multiple zones. -- Coly Li ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] How to handle an ugly md raid0 sector map bug ? 2019-08-23 17:03 ` Coly Li @ 2019-08-23 17:17 ` Song Liu 2019-08-23 17:47 ` Coly Li 0 siblings, 1 reply; 19+ messages in thread From: Song Liu @ 2019-08-23 17:17 UTC (permalink / raw) To: Coly Li; +Cc: NeilBrown, linux-raid, linux-block > On Aug 23, 2019, at 10:03 AM, Coly Li <colyli@suse.de> wrote: > > On 2019/8/24 12:37 上午, Song Liu wrote: >> Thanks Coly and Neil. >> >>> On Aug 22, 2019, at 5:02 PM, NeilBrown <neilb@suse.com> wrote: >>> >>> On Thu, Aug 22 2019, Coly Li wrote: >>> >>>> Hi folks, >>>> >>>> First line: This bug only influences md raid0 device which applies all >>>> the following conditions, >>>> 1) Assembled by component disks with different sizes. >>>> 2) Created and used under Linux kernel before (including) Linux v3.12, >>>> then upgrade to Linux kernel after (including) Linux v3.13. >>>> 3) New data are written to md raid0 in new kernel >= Linux v3.13. >>>> Then the md raid0 may have inconsistent sector mapping and experience >>>> data corruption. >>>> >>>> Recently I receive a bug report that customer encounter file system >>>> corruption after upgrading their kernel from Linux 3.12 to 4.4. It turns >>>> out to be the underlying md raid0 corruption after the kernel upgrade. >>>> >>>> I find it is because a sector map bug in md raid0 code include and >>>> before Linux v3.12. Here is the buggy code piece I copied from stable >>>> Linux v3.12.74 drivers/md/raid0.c:raid0_make_request(), >>>> >>>> 547 sector_offset = bio->bi_sector; >>>> 548 zone = find_zone(mddev->private, §or_offset); >>>> 549 tmp_dev = map_sector(mddev, zone, bio->bi_sector, >>>> 550 §or_offset); >>> >>> I don't think this code is buggy. The mapping may not be the mapping >>> you would expect, but it is the mapping that md/raid0 had always used up >>> to this time. >>> >>>> >>>> At line 548 after find_zone() returns, sector_offset is updated to be an >>>> offset inside current zone. Then at line 549 the third parameter of >>>> calling map_sector() should be the updated sector_offset, but >>>> bio->bi_sector (original LBA or md raid0 device) is used. If the raid0 >>>> device has *multiple zones*, except the first zone, the mapping <dev, >>>> sector> pair returned by map_sector() for all rested zones are >>>> unexpected and wrong. >>>> >>>> The buggy code was introduced since Linux v2.6.31 in commit fbb704efb784 >>>> ("md: raid0 :Enables chunk size other than powers of 2."), unfortunate >>>> the mistaken mapping calculation has stable and unique result too, so it >>>> works without obvious problem until commit 20d0189b1012 ("block: >>>> Introduce new bio_split()") merged into Linux v3.13. >>>> >>>> This patch fixed the mistaken mapping in the following lines of change, >>>> 654 - sector_offset = bio->bi_iter.bi_sector; >>>> 655 - zone = find_zone(mddev->private, §or_offset); >>>> 656 - tmp_dev = map_sector(mddev, zone, bio->bi_iter.bi_sector, >>>> 657 - §or_offset); >>>> >>>> 694 + zone = find_zone(mddev->private, §or); >>>> 695 + tmp_dev = map_sector(mddev, zone, sector, §or); >>>> At line 695 of this patch, the third parameter of calling map_sector() >>>> is fixed to 'sector', this is the correct value which contains the >>>> sector offset inside the corresponding zone. >>> >>> This is buggy because, as you say, the third argument to map_sector has >>> changed. >>> Previously it was bio->bi_iter.bi_sector. Now it is 'sector' which >>> find_zone has just modified. >>> >>>> >>>> The this patch implicitly *changes* md raid0 on-disk layout. If a md >>>> raid0 has component disks with *different* sizes, then it will contain >>>> multiple zones. If such multiple zones raid0 device is created before >>>> Linux v3.13, all data chunks after first zone will be mapped to >>>> different location in kernel after (including) Linux v3.13. The result >>>> is, data written in the LBA after first zone will be treated as >>>> corruption. A worse case is, if the md raid0 has data chunks filled in >>>> first md raid0 zone in Linux v3.12 (or earlier kernels), then update to >>>> Linux v3.13 (or later kernels) and fill more data chunks in second and >>>> rested zone. Then in neither Linux v3.12 no Linux v3.13, there is always >>>> partial data corrupted. >>>> >>>> Currently there is no way to tell whether a md raid0 device is mapped in >>>> wrong calculation in kernel before (including) Linux v3.12 or in correct >>>> calculation in kernels after (including) Linux v3.13. If a md raid0 >>>> device (contains multiple zones) created and used crossing these kernel >>>> version, there is possibility and different mapping calculation >>>> generation different/inconsistent on-disk layout in different md raid0 >>>> zones, and results data corruption. >>>> >>>> For our enterprise Linux products we can handle it properly for a few >>>> product number of kernels. But for upstream and stable kernels, I don't >>>> have idea how to fix this ugly problem in a generic way. >>>> >>>> Neil Brown discussed with me offline, he proposed a temporary workaround >>>> that only permit to assemble md raid0 device with identical component >>>> disk size, and reject to assemble md raid0 device with component disks >>>> with different sizes. We can stop this workaround when there is a proper >>>> way to fix the problem. >>>> >>>> I suggest our developer community to work together for a solution, this >>>> is the motivation I post this email for your comments. >>> >>> There are four separate cases that we need to consider: >>> - v1.x metadata >>> - v0.90 metadata >>> - LVM metadata (via dm-raid) >>> - no metadata (array created with "mdadm --build"). >>> >>> For v1.x metadata, I think we can add a new "feature_map" flag. >>> If this flag isn't set, raid0 with non-uniform device sizes will not be >>> assembled. >>> If it is set, then: >>> if 'layout' is 0, use the old mapping >>> if 'layout' is 1, use the new mapping >>> >>> For v0.90 metadata we don't have feature-flags. We could >>> The gvalid_words field is unused and always set to zero. >>> So we could start storing some feature bits there. >>> >>> For LVM/dm-raid, I suspect it doesn't support varying >>> sized devices, but we would need to check. >>> >>> For "no metadata" arrays ... we could possibly just stop supporting >>> them - I doubt they are used much. >> >> So for an existing array, we really cannot tell whether it is broken or >> not, right? If this is the case, we only need to worry about new arrays. >> >> For new arrays, I guess we can only allow v1.x raid0 to have non-uniform >> devices sizes, and use the new feature_map bit. >> >> Would this work? If so, we only have 1 case to work on. > > It seems v1.2 support started since Linux v2.16, so it may also have > problem for multiple zones. > For v1.2 metadata, we still need the feature_map bit, meaning this non-uniform array is safe to assemble. If the array doesn't have this bit, and is non-uniform size, we refuse to assemble it. Thanks, Song ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] How to handle an ugly md raid0 sector map bug ? 2019-08-23 17:17 ` Song Liu @ 2019-08-23 17:47 ` Coly Li 2019-09-09 6:57 ` [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion NeilBrown 0 siblings, 1 reply; 19+ messages in thread From: Coly Li @ 2019-08-23 17:47 UTC (permalink / raw) To: Song Liu; +Cc: NeilBrown, linux-raid, linux-block On 2019/8/24 1:17 上午, Song Liu wrote: > > >> On Aug 23, 2019, at 10:03 AM, Coly Li <colyli@suse.de> wrote: >> >> On 2019/8/24 12:37 上午, Song Liu wrote: >>> Thanks Coly and Neil. >>> >>>> On Aug 22, 2019, at 5:02 PM, NeilBrown <neilb@suse.com> wrote: >>>> >>>> On Thu, Aug 22 2019, Coly Li wrote: >>>> >>>>> Hi folks, >>>>> >>>>> First line: This bug only influences md raid0 device which applies all >>>>> the following conditions, >>>>> 1) Assembled by component disks with different sizes. >>>>> 2) Created and used under Linux kernel before (including) Linux v3.12, >>>>> then upgrade to Linux kernel after (including) Linux v3.13. >>>>> 3) New data are written to md raid0 in new kernel >= Linux v3.13. >>>>> Then the md raid0 may have inconsistent sector mapping and experience >>>>> data corruption. >>>>> >>>>> Recently I receive a bug report that customer encounter file system >>>>> corruption after upgrading their kernel from Linux 3.12 to 4.4. It turns >>>>> out to be the underlying md raid0 corruption after the kernel upgrade. >>>>> >>>>> I find it is because a sector map bug in md raid0 code include and >>>>> before Linux v3.12. Here is the buggy code piece I copied from stable >>>>> Linux v3.12.74 drivers/md/raid0.c:raid0_make_request(), >>>>> >>>>> 547 sector_offset = bio->bi_sector; >>>>> 548 zone = find_zone(mddev->private, §or_offset); >>>>> 549 tmp_dev = map_sector(mddev, zone, bio->bi_sector, >>>>> 550 §or_offset); >>>> >>>> I don't think this code is buggy. The mapping may not be the mapping >>>> you would expect, but it is the mapping that md/raid0 had always used up >>>> to this time. >>>> >>>>> >>>>> At line 548 after find_zone() returns, sector_offset is updated to be an >>>>> offset inside current zone. Then at line 549 the third parameter of >>>>> calling map_sector() should be the updated sector_offset, but >>>>> bio->bi_sector (original LBA or md raid0 device) is used. If the raid0 >>>>> device has *multiple zones*, except the first zone, the mapping <dev, >>>>> sector> pair returned by map_sector() for all rested zones are >>>>> unexpected and wrong. >>>>> >>>>> The buggy code was introduced since Linux v2.6.31 in commit fbb704efb784 >>>>> ("md: raid0 :Enables chunk size other than powers of 2."), unfortunate >>>>> the mistaken mapping calculation has stable and unique result too, so it >>>>> works without obvious problem until commit 20d0189b1012 ("block: >>>>> Introduce new bio_split()") merged into Linux v3.13. >>>>> >>>>> This patch fixed the mistaken mapping in the following lines of change, >>>>> 654 - sector_offset = bio->bi_iter.bi_sector; >>>>> 655 - zone = find_zone(mddev->private, §or_offset); >>>>> 656 - tmp_dev = map_sector(mddev, zone, bio->bi_iter.bi_sector, >>>>> 657 - §or_offset); >>>>> >>>>> 694 + zone = find_zone(mddev->private, §or); >>>>> 695 + tmp_dev = map_sector(mddev, zone, sector, §or); >>>>> At line 695 of this patch, the third parameter of calling map_sector() >>>>> is fixed to 'sector', this is the correct value which contains the >>>>> sector offset inside the corresponding zone. >>>> >>>> This is buggy because, as you say, the third argument to map_sector has >>>> changed. >>>> Previously it was bio->bi_iter.bi_sector. Now it is 'sector' which >>>> find_zone has just modified. >>>> >>>>> >>>>> The this patch implicitly *changes* md raid0 on-disk layout. If a md >>>>> raid0 has component disks with *different* sizes, then it will contain >>>>> multiple zones. If such multiple zones raid0 device is created before >>>>> Linux v3.13, all data chunks after first zone will be mapped to >>>>> different location in kernel after (including) Linux v3.13. The result >>>>> is, data written in the LBA after first zone will be treated as >>>>> corruption. A worse case is, if the md raid0 has data chunks filled in >>>>> first md raid0 zone in Linux v3.12 (or earlier kernels), then update to >>>>> Linux v3.13 (or later kernels) and fill more data chunks in second and >>>>> rested zone. Then in neither Linux v3.12 no Linux v3.13, there is always >>>>> partial data corrupted. >>>>> >>>>> Currently there is no way to tell whether a md raid0 device is mapped in >>>>> wrong calculation in kernel before (including) Linux v3.12 or in correct >>>>> calculation in kernels after (including) Linux v3.13. If a md raid0 >>>>> device (contains multiple zones) created and used crossing these kernel >>>>> version, there is possibility and different mapping calculation >>>>> generation different/inconsistent on-disk layout in different md raid0 >>>>> zones, and results data corruption. >>>>> >>>>> For our enterprise Linux products we can handle it properly for a few >>>>> product number of kernels. But for upstream and stable kernels, I don't >>>>> have idea how to fix this ugly problem in a generic way. >>>>> >>>>> Neil Brown discussed with me offline, he proposed a temporary workaround >>>>> that only permit to assemble md raid0 device with identical component >>>>> disk size, and reject to assemble md raid0 device with component disks >>>>> with different sizes. We can stop this workaround when there is a proper >>>>> way to fix the problem. >>>>> >>>>> I suggest our developer community to work together for a solution, this >>>>> is the motivation I post this email for your comments. >>>> >>>> There are four separate cases that we need to consider: >>>> - v1.x metadata >>>> - v0.90 metadata >>>> - LVM metadata (via dm-raid) >>>> - no metadata (array created with "mdadm --build"). >>>> >>>> For v1.x metadata, I think we can add a new "feature_map" flag. >>>> If this flag isn't set, raid0 with non-uniform device sizes will not be >>>> assembled. >>>> If it is set, then: >>>> if 'layout' is 0, use the old mapping >>>> if 'layout' is 1, use the new mapping >>>> >>>> For v0.90 metadata we don't have feature-flags. We could >>>> The gvalid_words field is unused and always set to zero. >>>> So we could start storing some feature bits there. >>>> >>>> For LVM/dm-raid, I suspect it doesn't support varying >>>> sized devices, but we would need to check. >>>> >>>> For "no metadata" arrays ... we could possibly just stop supporting >>>> them - I doubt they are used much. >>> >>> So for an existing array, we really cannot tell whether it is broken or >>> not, right? If this is the case, we only need to worry about new arrays. >>> >>> For new arrays, I guess we can only allow v1.x raid0 to have non-uniform >>> devices sizes, and use the new feature_map bit. >>> >>> Would this work? If so, we only have 1 case to work on. >> >> It seems v1.2 support started since Linux v2.16, so it may also have >> problem for multiple zones. >> > > For v1.2 metadata, we still need the feature_map bit, meaning this > non-uniform array is safe to assemble. If the array doesn't have > this bit, and is non-uniform size, we refuse to assemble it. Yes, it make sense. I understand you now :-) -- Coly Li ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion. 2019-08-23 17:47 ` Coly Li @ 2019-09-09 6:57 ` NeilBrown 2019-09-09 6:58 ` [PATCH 2/2] md: add feature flag MD_FEATURE_RAID0_LAYOUT NeilBrown ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: NeilBrown @ 2019-09-09 6:57 UTC (permalink / raw) To: Coly Li, Song Liu; +Cc: NeilBrown, linux-block, linux-raid [-- Attachment #1: Type: text/plain, Size: 4277 bytes --] If the drives in a RAID0 are not all the same size, the array is divided into zones. The first zone covers all drives, to the size of the smallest. The second zone covers all drives larger than the smallest, up to the size of the second smallest - etc. A change in Linux 3.14 unintentionally changed the layout for the second and subsequent zones. All the correct data is still stored, but each chunk may be assigned to a different device than in pre-3.14 kernels. This can lead to data corruption. It is not possible to determine what layout to use - it depends which kernel the data was written by. So we add a module parameter to allow the old (0) or new (1) layout to be specified, and refused to assemble an affected array if that parameter is not set. Fixes: 20d0189b1012 ("block: Introduce new bio_split()") cc: stable@vger.kernel.org (3.14+) Signed-off-by: NeilBrown <neilb@suse.de> --- This and the next patch are my proposal for how to address this problem. I haven't actually tested ..... NeilBrown drivers/md/raid0.c | 28 +++++++++++++++++++++++++++- drivers/md/raid0.h | 14 ++++++++++++++ 2 files changed, 41 insertions(+), 1 deletion(-) diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index bf5cf184a260..a8888c12308a 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -19,6 +19,9 @@ #include "raid0.h" #include "raid5.h" +static int default_layout = -1; +module_param(default_layout, int, 0644); + #define UNSUPPORTED_MDDEV_FLAGS \ ((1L << MD_HAS_JOURNAL) | \ (1L << MD_JOURNAL_CLEAN) | \ @@ -139,6 +142,19 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf) } pr_debug("md/raid0:%s: FINAL %d zones\n", mdname(mddev), conf->nr_strip_zones); + + if (conf->nr_strip_zones == 1) { + conf->layout = RAID0_ORIG_LAYOUT; + } else if (default_layout == RAID0_ORIG_LAYOUT || + default_layout == RAID0_ALT_MULTIZONE_LAYOUT) { + conf->layout = default_layout; + } else { + pr_err("md/raid0:%s: cannot assemble multi-zone RAID0 with default_layout setting\n", + mdname(mddev)); + pr_err("md/raid0: please set raid.default_layout to 0 or 1\n"); + err = -ENOTSUPP; + goto abort; + } /* * now since we have the hard sector sizes, we can make sure * chunk size is a multiple of that sector size @@ -547,10 +563,12 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio) static bool raid0_make_request(struct mddev *mddev, struct bio *bio) { + struct r0conf *conf = mddev->private; struct strip_zone *zone; struct md_rdev *tmp_dev; sector_t bio_sector; sector_t sector; + sector_t orig_sector; unsigned chunk_sects; unsigned sectors; @@ -584,8 +602,16 @@ static bool raid0_make_request(struct mddev *mddev, struct bio *bio) bio = split; } + orig_sector = sector; zone = find_zone(mddev->private, §or); - tmp_dev = map_sector(mddev, zone, sector, §or); + switch (conf->layout) { + case RAID0_ORIG_LAYOUT: + tmp_dev = map_sector(mddev, zone, orig_sector, §or); + break; + case RAID0_ALT_MULTIZONE_LAYOUT: + tmp_dev = map_sector(mddev, zone, sector, §or); + break; + } bio_set_dev(bio, tmp_dev->bdev); bio->bi_iter.bi_sector = sector + zone->dev_start + tmp_dev->data_offset; diff --git a/drivers/md/raid0.h b/drivers/md/raid0.h index 540e65d92642..3816e5477db1 100644 --- a/drivers/md/raid0.h +++ b/drivers/md/raid0.h @@ -8,11 +8,25 @@ struct strip_zone { int nb_dev; /* # of devices attached to the zone */ }; +/* Linux 3.14 (20d0189b101) made an unintended change to + * the RAID0 layout for multi-zone arrays (where devices aren't all + * the same size. + * RAID0_ORIG_LAYOUT restores the original layout + * RAID0_ALT_MULTIZONE_LAYOUT uses the altered layout + * The layouts are identical when there is only one zone (all + * devices the same size). + */ + +enum r0layout { + RAID0_ORIG_LAYOUT = 1, + RAID0_ALT_MULTIZONE_LAYOUT = 2, +}; struct r0conf { struct strip_zone *strip_zone; struct md_rdev **devlist; /* lists of rdevs, pointed to * by strip_zone->dev */ int nr_strip_zones; + enum r0layout layout; }; #endif -- 2.14.0.rc0.dirty [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 2/2] md: add feature flag MD_FEATURE_RAID0_LAYOUT 2019-09-09 6:57 ` [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion NeilBrown @ 2019-09-09 6:58 ` NeilBrown 2019-09-09 15:33 ` Guoqing Jiang 2019-09-09 14:56 ` [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion Song Liu 2019-09-09 15:09 ` Guoqing Jiang 2 siblings, 1 reply; 19+ messages in thread From: NeilBrown @ 2019-09-09 6:58 UTC (permalink / raw) To: Coly Li, Song Liu; +Cc: NeilBrown, linux-block, linux-raid [-- Attachment #1: Type: text/plain, Size: 3755 bytes --] Due to a bug introduced in Linux 3.14 we cannot determine the correctly layout for a multi-zone RAID0 array - there are two possibiities. It is possible to tell the kernel which to chose using a module parameter, but this can be clumsy to use. It would be best if the choice were recorded in the metadata. So add a feature flag for this purpose. If it is set, then the 'layout' field of the superblock is used to determine which layout to use. If this flag is not set, then mddev->layout gets set to -1, which causes the module parameter to be required. Signed-off-by: NeilBrown <neilb@suse.de> --- drivers/md/md.c | 13 +++++++++++++ drivers/md/raid0.c | 2 ++ include/uapi/linux/raid/md_p.h | 2 ++ 3 files changed, 17 insertions(+) diff --git a/drivers/md/md.c b/drivers/md/md.c index 1f70ec595282..a41fce7f8b4c 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1232,6 +1232,8 @@ static int super_90_validate(struct mddev *mddev, struct md_rdev *rdev) mddev->new_layout = mddev->layout; mddev->new_chunk_sectors = mddev->chunk_sectors; } + if (mddev->level == 0) + mddev->layout = -1; if (sb->state & (1<<MD_SB_CLEAN)) mddev->recovery_cp = MaxSector; @@ -1647,6 +1649,10 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_ rdev->ppl.sector = rdev->sb_start + rdev->ppl.offset; } + if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_RAID0_LAYOUT) && + sb->level != 0) + return -EINVAL; + if (!refdev) { ret = 1; } else { @@ -1757,6 +1763,10 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev) mddev->new_chunk_sectors = mddev->chunk_sectors; } + if (mddev->level == 0 && + !(le32_to_cpu(sb->feature_map) & MD_FEATURE_RAID0_LAYOUT)) + mddev->layout = -1; + if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL) set_bit(MD_HAS_JOURNAL, &mddev->flags); @@ -6852,6 +6862,9 @@ static int set_array_info(struct mddev *mddev, mdu_array_info_t *info) mddev->external = 0; mddev->layout = info->layout; + if (mddev->level == 0) + /* Cannot trust RAID0 layout info here */ + mddev->layout = -1; mddev->chunk_sectors = info->chunk_size >> 9; if (mddev->persistent) { diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index a8888c12308a..6f095b0b6f5c 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -145,6 +145,8 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf) if (conf->nr_strip_zones == 1) { conf->layout = RAID0_ORIG_LAYOUT; + } else if (mddev->layout == RAID0_ORIG_LAYOUT || + mddev->layout == RAID0_ALT_MULTIZONE_LAYOUT) { } else if (default_layout == RAID0_ORIG_LAYOUT || default_layout == RAID0_ALT_MULTIZONE_LAYOUT) { conf->layout = default_layout; diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h index b0d15c73f6d7..1f2d8c81f0e0 100644 --- a/include/uapi/linux/raid/md_p.h +++ b/include/uapi/linux/raid/md_p.h @@ -329,6 +329,7 @@ struct mdp_superblock_1 { #define MD_FEATURE_JOURNAL 512 /* support write cache */ #define MD_FEATURE_PPL 1024 /* support PPL */ #define MD_FEATURE_MULTIPLE_PPLS 2048 /* support for multiple PPLs */ +#define MD_FEATURE_RAID0_LAYOUT 4096 /* layout is meaningful for RAID0 */ #define MD_FEATURE_ALL (MD_FEATURE_BITMAP_OFFSET \ |MD_FEATURE_RECOVERY_OFFSET \ |MD_FEATURE_RESHAPE_ACTIVE \ @@ -341,6 +342,7 @@ struct mdp_superblock_1 { |MD_FEATURE_JOURNAL \ |MD_FEATURE_PPL \ |MD_FEATURE_MULTIPLE_PPLS \ + |MD_FEATURE_RAID0_LAYOUT \ ) struct r5l_payload_header { -- 2.14.0.rc0.dirty [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] md: add feature flag MD_FEATURE_RAID0_LAYOUT 2019-09-09 6:58 ` [PATCH 2/2] md: add feature flag MD_FEATURE_RAID0_LAYOUT NeilBrown @ 2019-09-09 15:33 ` Guoqing Jiang 2019-09-09 23:26 ` NeilBrown 0 siblings, 1 reply; 19+ messages in thread From: Guoqing Jiang @ 2019-09-09 15:33 UTC (permalink / raw) To: NeilBrown, Coly Li, Song Liu; +Cc: NeilBrown, linux-block, linux-raid Hi Neil, On 9/9/19 8:58 AM, NeilBrown wrote: > > Due to a bug introduced in Linux 3.14 we cannot determine the > correctly layout for a multi-zone RAID0 array - there are two > possibiities. possibilities. > > It is possible to tell the kernel which to chose using a module > parameter, but this can be clumsy to use. It would be best if > the choice were recorded in the metadata. > So add a feature flag for this purpose. > If it is set, then the 'layout' field of the superblock is used > to determine which layout to use. > > If this flag is not set, then mddev->layout gets set to -1, > which causes the module parameter to be required. Could you point where the flag is set? Thanks. > > Signed-off-by: NeilBrown <neilb@suse.de> > --- > drivers/md/md.c | 13 +++++++++++++ > drivers/md/raid0.c | 2 ++ > include/uapi/linux/raid/md_p.h | 2 ++ > 3 files changed, 17 insertions(+) > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 1f70ec595282..a41fce7f8b4c 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -1232,6 +1232,8 @@ static int super_90_validate(struct mddev *mddev, struct md_rdev *rdev) > mddev->new_layout = mddev->layout; > mddev->new_chunk_sectors = mddev->chunk_sectors; > } > + if (mddev->level == 0) > + mddev->layout = -1; > > if (sb->state & (1<<MD_SB_CLEAN)) > mddev->recovery_cp = MaxSector; > @@ -1647,6 +1649,10 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_ > rdev->ppl.sector = rdev->sb_start + rdev->ppl.offset; > } > > + if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_RAID0_LAYOUT) && > + sb->level != 0) > + return -EINVAL; > + > if (!refdev) { > ret = 1; > } else { > @@ -1757,6 +1763,10 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev) > mddev->new_chunk_sectors = mddev->chunk_sectors; > } > > + if (mddev->level == 0 && > + !(le32_to_cpu(sb->feature_map) & MD_FEATURE_RAID0_LAYOUT)) > + mddev->layout = -1; > + > if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL) > set_bit(MD_HAS_JOURNAL, &mddev->flags); > > @@ -6852,6 +6862,9 @@ static int set_array_info(struct mddev *mddev, mdu_array_info_t *info) > mddev->external = 0; > > mddev->layout = info->layout; > + if (mddev->level == 0) > + /* Cannot trust RAID0 layout info here */ > + mddev->layout = -1; > mddev->chunk_sectors = info->chunk_size >> 9; > > if (mddev->persistent) { > diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c > index a8888c12308a..6f095b0b6f5c 100644 > --- a/drivers/md/raid0.c > +++ b/drivers/md/raid0.c > @@ -145,6 +145,8 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf) > > if (conf->nr_strip_zones == 1) { > conf->layout = RAID0_ORIG_LAYOUT; > + } else if (mddev->layout == RAID0_ORIG_LAYOUT || > + mddev->layout == RAID0_ALT_MULTIZONE_LAYOUT) { Maybe "conf->layout = mddev->layout" here? Otherwise seems conf->layout is not set accordingly, just my 2 cents. > } else if (default_layout == RAID0_ORIG_LAYOUT || > default_layout == RAID0_ALT_MULTIZONE_LAYOUT) { > conf->layout = default_layout; > diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h > index b0d15c73f6d7..1f2d8c81f0e0 100644 > --- a/include/uapi/linux/raid/md_p.h > +++ b/include/uapi/linux/raid/md_p.h > @@ -329,6 +329,7 @@ struct mdp_superblock_1 { > #define MD_FEATURE_JOURNAL 512 /* support write cache */ > #define MD_FEATURE_PPL 1024 /* support PPL */ > #define MD_FEATURE_MULTIPLE_PPLS 2048 /* support for multiple PPLs */ > +#define MD_FEATURE_RAID0_LAYOUT 4096 /* layout is meaningful for RAID0 */ > #define MD_FEATURE_ALL (MD_FEATURE_BITMAP_OFFSET \ > |MD_FEATURE_RECOVERY_OFFSET \ > |MD_FEATURE_RESHAPE_ACTIVE \ > @@ -341,6 +342,7 @@ struct mdp_superblock_1 { > |MD_FEATURE_JOURNAL \ > |MD_FEATURE_PPL \ > |MD_FEATURE_MULTIPLE_PPLS \ > + |MD_FEATURE_RAID0_LAYOUT \ > ) > > struct r5l_payload_header { > Thanks, Guoqing ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] md: add feature flag MD_FEATURE_RAID0_LAYOUT 2019-09-09 15:33 ` Guoqing Jiang @ 2019-09-09 23:26 ` NeilBrown 0 siblings, 0 replies; 19+ messages in thread From: NeilBrown @ 2019-09-09 23:26 UTC (permalink / raw) To: Guoqing Jiang, Coly Li, Song Liu; +Cc: NeilBrown, linux-block, linux-raid [-- Attachment #1: Type: text/plain, Size: 5068 bytes --] On Mon, Sep 09 2019, Guoqing Jiang wrote: > Hi Neil, > > On 9/9/19 8:58 AM, NeilBrown wrote: >> >> Due to a bug introduced in Linux 3.14 we cannot determine the >> correctly layout for a multi-zone RAID0 array - there are two >> possibiities. > > possibilities. Thanks. > >> >> It is possible to tell the kernel which to chose using a module >> parameter, but this can be clumsy to use. It would be best if >> the choice were recorded in the metadata. >> So add a feature flag for this purpose. >> If it is set, then the 'layout' field of the superblock is used >> to determine which layout to use. >> >> If this flag is not set, then mddev->layout gets set to -1, >> which causes the module parameter to be required. > > Could you point where the flag is set? Thanks. It isn't set by the kernel - the kernel doesn't know when to set it. We would need to change mdadm to set the flag, either when creating an array, or when asked to be --update. Actually.... that would be a problem if someone used the new mdadm on an old kernel. The old kernel would refuse to assemble the array with the flag set. Maybe that is what we want anyway. We *want* people to never use multi-zone RAID0 on old kernels, because the result could be data corruption. So - mdadm needs to add the flag, and maybe warn in the kernel is too old. > >> >> Signed-off-by: NeilBrown <neilb@suse.de> >> --- >> drivers/md/md.c | 13 +++++++++++++ >> drivers/md/raid0.c | 2 ++ >> include/uapi/linux/raid/md_p.h | 2 ++ >> 3 files changed, 17 insertions(+) >> >> diff --git a/drivers/md/md.c b/drivers/md/md.c >> index 1f70ec595282..a41fce7f8b4c 100644 >> --- a/drivers/md/md.c >> +++ b/drivers/md/md.c >> @@ -1232,6 +1232,8 @@ static int super_90_validate(struct mddev *mddev, struct md_rdev *rdev) >> mddev->new_layout = mddev->layout; >> mddev->new_chunk_sectors = mddev->chunk_sectors; >> } >> + if (mddev->level == 0) >> + mddev->layout = -1; >> >> if (sb->state & (1<<MD_SB_CLEAN)) >> mddev->recovery_cp = MaxSector; >> @@ -1647,6 +1649,10 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_ >> rdev->ppl.sector = rdev->sb_start + rdev->ppl.offset; >> } >> >> + if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_RAID0_LAYOUT) && >> + sb->level != 0) >> + return -EINVAL; >> + >> if (!refdev) { >> ret = 1; >> } else { >> @@ -1757,6 +1763,10 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev) >> mddev->new_chunk_sectors = mddev->chunk_sectors; >> } >> >> + if (mddev->level == 0 && >> + !(le32_to_cpu(sb->feature_map) & MD_FEATURE_RAID0_LAYOUT)) >> + mddev->layout = -1; >> + >> if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL) >> set_bit(MD_HAS_JOURNAL, &mddev->flags); >> >> @@ -6852,6 +6862,9 @@ static int set_array_info(struct mddev *mddev, mdu_array_info_t *info) >> mddev->external = 0; >> >> mddev->layout = info->layout; >> + if (mddev->level == 0) >> + /* Cannot trust RAID0 layout info here */ >> + mddev->layout = -1; >> mddev->chunk_sectors = info->chunk_size >> 9; >> >> if (mddev->persistent) { >> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c >> index a8888c12308a..6f095b0b6f5c 100644 >> --- a/drivers/md/raid0.c >> +++ b/drivers/md/raid0.c >> @@ -145,6 +145,8 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf) >> >> if (conf->nr_strip_zones == 1) { >> conf->layout = RAID0_ORIG_LAYOUT; >> + } else if (mddev->layout == RAID0_ORIG_LAYOUT || >> + mddev->layout == RAID0_ALT_MULTIZONE_LAYOUT) { > > Maybe "conf->layout = mddev->layout" here? Otherwise seems conf->layout is not set accordingly, just > my 2 cents. > Yes, of course. thanks. Thanks for your review, NeilBrown >> } else if (default_layout == RAID0_ORIG_LAYOUT || >> default_layout == RAID0_ALT_MULTIZONE_LAYOUT) { >> conf->layout = default_layout; >> diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h >> index b0d15c73f6d7..1f2d8c81f0e0 100644 >> --- a/include/uapi/linux/raid/md_p.h >> +++ b/include/uapi/linux/raid/md_p.h >> @@ -329,6 +329,7 @@ struct mdp_superblock_1 { >> #define MD_FEATURE_JOURNAL 512 /* support write cache */ >> #define MD_FEATURE_PPL 1024 /* support PPL */ >> #define MD_FEATURE_MULTIPLE_PPLS 2048 /* support for multiple PPLs */ >> +#define MD_FEATURE_RAID0_LAYOUT 4096 /* layout is meaningful for RAID0 */ >> #define MD_FEATURE_ALL (MD_FEATURE_BITMAP_OFFSET \ >> |MD_FEATURE_RECOVERY_OFFSET \ >> |MD_FEATURE_RESHAPE_ACTIVE \ >> @@ -341,6 +342,7 @@ struct mdp_superblock_1 { >> |MD_FEATURE_JOURNAL \ >> |MD_FEATURE_PPL \ >> |MD_FEATURE_MULTIPLE_PPLS \ >> + |MD_FEATURE_RAID0_LAYOUT \ >> ) >> >> struct r5l_payload_header { >> > > Thanks, > Guoqing [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion. 2019-09-09 6:57 ` [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion NeilBrown 2019-09-09 6:58 ` [PATCH 2/2] md: add feature flag MD_FEATURE_RAID0_LAYOUT NeilBrown @ 2019-09-09 14:56 ` Song Liu 2019-09-09 23:33 ` NeilBrown 2019-09-09 15:09 ` Guoqing Jiang 2 siblings, 1 reply; 19+ messages in thread From: Song Liu @ 2019-09-09 14:56 UTC (permalink / raw) To: NeilBrown; +Cc: Coly Li, NeilBrown, linux-block, linux-raid Hi Neil, > On Sep 9, 2019, at 7:57 AM, NeilBrown <neilb@suse.de> wrote: > > > If the drives in a RAID0 are not all the same size, the array is > divided into zones. > The first zone covers all drives, to the size of the smallest. > The second zone covers all drives larger than the smallest, up to > the size of the second smallest - etc. > > A change in Linux 3.14 unintentionally changed the layout for the > second and subsequent zones. All the correct data is still stored, but > each chunk may be assigned to a different device than in pre-3.14 kernels. > This can lead to data corruption. > > It is not possible to determine what layout to use - it depends which > kernel the data was written by. > So we add a module parameter to allow the old (0) or new (1) layout to be > specified, and refused to assemble an affected array if that parameter is > not set. > > Fixes: 20d0189b1012 ("block: Introduce new bio_split()") > cc: stable@vger.kernel.org (3.14+) > Signed-off-by: NeilBrown <neilb@suse.de> Thanks for the patches. They look great. However, I am having problem apply them (not sure whether it is a problem on my side). Could you please push it somewhere so I can use cherry-pick instead? Thanks, Song ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion. 2019-09-09 14:56 ` [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion Song Liu @ 2019-09-09 23:33 ` NeilBrown 2019-09-10 15:45 ` Song Liu 0 siblings, 1 reply; 19+ messages in thread From: NeilBrown @ 2019-09-09 23:33 UTC (permalink / raw) To: Song Liu; +Cc: Coly Li, NeilBrown, linux-block, linux-raid [-- Attachment #1: Type: text/plain, Size: 1472 bytes --] On Mon, Sep 09 2019, Song Liu wrote: > Hi Neil, > >> On Sep 9, 2019, at 7:57 AM, NeilBrown <neilb@suse.de> wrote: >> >> >> If the drives in a RAID0 are not all the same size, the array is >> divided into zones. >> The first zone covers all drives, to the size of the smallest. >> The second zone covers all drives larger than the smallest, up to >> the size of the second smallest - etc. >> >> A change in Linux 3.14 unintentionally changed the layout for the >> second and subsequent zones. All the correct data is still stored, but >> each chunk may be assigned to a different device than in pre-3.14 kernels. >> This can lead to data corruption. >> >> It is not possible to determine what layout to use - it depends which >> kernel the data was written by. >> So we add a module parameter to allow the old (0) or new (1) layout to be >> specified, and refused to assemble an affected array if that parameter is >> not set. >> >> Fixes: 20d0189b1012 ("block: Introduce new bio_split()") >> cc: stable@vger.kernel.org (3.14+) >> Signed-off-by: NeilBrown <neilb@suse.de> > > Thanks for the patches. They look great. However, I am having problem > apply them (not sure whether it is a problem on my side). Could you > please push it somewhere so I can use cherry-pick instead? I rebased them on block/for-next, fixed the problems that Guoqing found, and pushed them to https://github.com/neilbrown/linux md/raid0 NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion. 2019-09-09 23:33 ` NeilBrown @ 2019-09-10 15:45 ` Song Liu 2019-09-10 16:01 ` Guoqing Jiang 0 siblings, 1 reply; 19+ messages in thread From: Song Liu @ 2019-09-10 15:45 UTC (permalink / raw) To: NeilBrown, Guoqing Jiang; +Cc: Coly Li, NeilBrown, linux-block, linux-raid > On Sep 10, 2019, at 12:33 AM, NeilBrown <neilb@suse.de> wrote: > > On Mon, Sep 09 2019, Song Liu wrote: > >> Hi Neil, >> >>> On Sep 9, 2019, at 7:57 AM, NeilBrown <neilb@suse.de> wrote: >>> >>> >>> If the drives in a RAID0 are not all the same size, the array is >>> divided into zones. >>> The first zone covers all drives, to the size of the smallest. >>> The second zone covers all drives larger than the smallest, up to >>> the size of the second smallest - etc. >>> >>> A change in Linux 3.14 unintentionally changed the layout for the >>> second and subsequent zones. All the correct data is still stored, but >>> each chunk may be assigned to a different device than in pre-3.14 kernels. >>> This can lead to data corruption. >>> >>> It is not possible to determine what layout to use - it depends which >>> kernel the data was written by. >>> So we add a module parameter to allow the old (0) or new (1) layout to be >>> specified, and refused to assemble an affected array if that parameter is >>> not set. >>> >>> Fixes: 20d0189b1012 ("block: Introduce new bio_split()") >>> cc: stable@vger.kernel.org (3.14+) >>> Signed-off-by: NeilBrown <neilb@suse.de> >> >> Thanks for the patches. They look great. However, I am having problem >> apply them (not sure whether it is a problem on my side). Could you >> please push it somewhere so I can use cherry-pick instead? > > I rebased them on block/for-next, fixed the problems that Guoqing found, > and pushed them to > https://github.com/neilbrown/linux md/raid0 > > NeilBrown Thanks Neil! Guoqing, if this looks good, please reply with your Reviewed-by or Acked-by. Thanks, Song ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion. 2019-09-10 15:45 ` Song Liu @ 2019-09-10 16:01 ` Guoqing Jiang 2019-09-10 23:08 ` NeilBrown 0 siblings, 1 reply; 19+ messages in thread From: Guoqing Jiang @ 2019-09-10 16:01 UTC (permalink / raw) To: Song Liu, NeilBrown, Guoqing Jiang Cc: Coly Li, NeilBrown, linux-block, linux-raid On 9/10/19 5:45 PM, Song Liu wrote: > > >> On Sep 10, 2019, at 12:33 AM, NeilBrown <neilb@suse.de> wrote: >> >> On Mon, Sep 09 2019, Song Liu wrote: >> >>> Hi Neil, >>> >>>> On Sep 9, 2019, at 7:57 AM, NeilBrown <neilb@suse.de> wrote: >>>> >>>> >>>> If the drives in a RAID0 are not all the same size, the array is >>>> divided into zones. >>>> The first zone covers all drives, to the size of the smallest. >>>> The second zone covers all drives larger than the smallest, up to >>>> the size of the second smallest - etc. >>>> >>>> A change in Linux 3.14 unintentionally changed the layout for the >>>> second and subsequent zones. All the correct data is still stored, but >>>> each chunk may be assigned to a different device than in pre-3.14 kernels. >>>> This can lead to data corruption. >>>> >>>> It is not possible to determine what layout to use - it depends which >>>> kernel the data was written by. >>>> So we add a module parameter to allow the old (0) or new (1) layout to be >>>> specified, and refused to assemble an affected array if that parameter is >>>> not set. >>>> >>>> Fixes: 20d0189b1012 ("block: Introduce new bio_split()") >>>> cc: stable@vger.kernel.org (3.14+) >>>> Signed-off-by: NeilBrown <neilb@suse.de> >>> >>> Thanks for the patches. They look great. However, I am having problem >>> apply them (not sure whether it is a problem on my side). Could you >>> please push it somewhere so I can use cherry-pick instead? >> >> I rebased them on block/for-next, fixed the problems that Guoqing found, >> and pushed them to >> https://github.com/neilbrown/linux md/raid0 >> >> NeilBrown > > Thanks Neil! Thanks for the explanation about set the flag. > > Guoqing, if this looks good, please reply with your Reviewed-by > or Acked-by. No more comments from my side, but I am not sure if it is better/possible to use one sysfs node to control the behavior instead of module parameter, then we can support different raid0 layout dynamically. Anyway, Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Thanks, Guoqing ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion. 2019-09-10 16:01 ` Guoqing Jiang @ 2019-09-10 23:08 ` NeilBrown 2019-09-11 9:56 ` Song Liu 0 siblings, 1 reply; 19+ messages in thread From: NeilBrown @ 2019-09-10 23:08 UTC (permalink / raw) To: Guoqing Jiang, Song Liu, Guoqing Jiang Cc: Coly Li, NeilBrown, linux-block, linux-raid [-- Attachment #1: Type: text/plain, Size: 2442 bytes --] On Tue, Sep 10 2019, Guoqing Jiang wrote: > On 9/10/19 5:45 PM, Song Liu wrote: >> >> >>> On Sep 10, 2019, at 12:33 AM, NeilBrown <neilb@suse.de> wrote: >>> >>> On Mon, Sep 09 2019, Song Liu wrote: >>> >>>> Hi Neil, >>>> >>>>> On Sep 9, 2019, at 7:57 AM, NeilBrown <neilb@suse.de> wrote: >>>>> >>>>> >>>>> If the drives in a RAID0 are not all the same size, the array is >>>>> divided into zones. >>>>> The first zone covers all drives, to the size of the smallest. >>>>> The second zone covers all drives larger than the smallest, up to >>>>> the size of the second smallest - etc. >>>>> >>>>> A change in Linux 3.14 unintentionally changed the layout for the >>>>> second and subsequent zones. All the correct data is still stored, but >>>>> each chunk may be assigned to a different device than in pre-3.14 kernels. >>>>> This can lead to data corruption. >>>>> >>>>> It is not possible to determine what layout to use - it depends which >>>>> kernel the data was written by. >>>>> So we add a module parameter to allow the old (0) or new (1) layout to be >>>>> specified, and refused to assemble an affected array if that parameter is >>>>> not set. >>>>> >>>>> Fixes: 20d0189b1012 ("block: Introduce new bio_split()") >>>>> cc: stable@vger.kernel.org (3.14+) >>>>> Signed-off-by: NeilBrown <neilb@suse.de> >>>> >>>> Thanks for the patches. They look great. However, I am having problem >>>> apply them (not sure whether it is a problem on my side). Could you >>>> please push it somewhere so I can use cherry-pick instead? >>> >>> I rebased them on block/for-next, fixed the problems that Guoqing found, >>> and pushed them to >>> https://github.com/neilbrown/linux md/raid0 >>> >>> NeilBrown >> >> Thanks Neil! > > Thanks for the explanation about set the flag. > >> >> Guoqing, if this looks good, please reply with your Reviewed-by >> or Acked-by. > > No more comments from my side, but I am not sure if it is better/possible to use one > sysfs node to control the behavior instead of module parameter, then we can support > different raid0 layout dynamically. A strength of module parameters is that you can set them in /etc/modprobe.d/00-local.conf and then they are automatically set on boot. For sysfs, you need some tool to set them. > > Anyway, Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> > Thanks, NeilBrown > Thanks, > Guoqing [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion. 2019-09-10 23:08 ` NeilBrown @ 2019-09-11 9:56 ` Song Liu 2019-09-11 22:48 ` NeilBrown 0 siblings, 1 reply; 19+ messages in thread From: Song Liu @ 2019-09-11 9:56 UTC (permalink / raw) To: NeilBrown Cc: Guoqing Jiang, Song Liu, Guoqing Jiang, Coly Li, NeilBrown, linux-block, linux-raid On Wed, Sep 11, 2019 at 12:10 AM NeilBrown <neilb@suse.de> wrote: > > On Tue, Sep 10 2019, Guoqing Jiang wrote: > > > On 9/10/19 5:45 PM, Song Liu wrote: > >> > >> > >>> On Sep 10, 2019, at 12:33 AM, NeilBrown <neilb@suse.de> wrote: > >>> > >>> On Mon, Sep 09 2019, Song Liu wrote: > >>> > >>>> Hi Neil, > >>>> > >>>>> On Sep 9, 2019, at 7:57 AM, NeilBrown <neilb@suse.de> wrote: > >>>>> > >>>>> > >>>>> If the drives in a RAID0 are not all the same size, the array is > >>>>> divided into zones. > >>>>> The first zone covers all drives, to the size of the smallest. > >>>>> The second zone covers all drives larger than the smallest, up to > >>>>> the size of the second smallest - etc. > >>>>> > >>>>> A change in Linux 3.14 unintentionally changed the layout for the > >>>>> second and subsequent zones. All the correct data is still stored, but > >>>>> each chunk may be assigned to a different device than in pre-3.14 kernels. > >>>>> This can lead to data corruption. > >>>>> > >>>>> It is not possible to determine what layout to use - it depends which > >>>>> kernel the data was written by. > >>>>> So we add a module parameter to allow the old (0) or new (1) layout to be > >>>>> specified, and refused to assemble an affected array if that parameter is > >>>>> not set. > >>>>> > >>>>> Fixes: 20d0189b1012 ("block: Introduce new bio_split()") > >>>>> cc: stable@vger.kernel.org (3.14+) > >>>>> Signed-off-by: NeilBrown <neilb@suse.de> > >>>> > >>>> Thanks for the patches. They look great. However, I am having problem > >>>> apply them (not sure whether it is a problem on my side). Could you > >>>> please push it somewhere so I can use cherry-pick instead? > >>> > >>> I rebased them on block/for-next, fixed the problems that Guoqing found, > >>> and pushed them to > >>> https://github.com/neilbrown/linux md/raid0 > >>> > >>> NeilBrown > >> > >> Thanks Neil! > > > > Thanks for the explanation about set the flag. > > > >> > >> Guoqing, if this looks good, please reply with your Reviewed-by > >> or Acked-by. > > > > No more comments from my side, but I am not sure if it is better/possible to use one > > sysfs node to control the behavior instead of module parameter, then we can support > > different raid0 layout dynamically. > > A strength of module parameters is that you can set them in > /etc/modprobe.d/00-local.conf > and then they are automatically set on boot. > For sysfs, you need some tool to set them. > > > > > Anyway, Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> > > I am adding the following change to the 1/2. Please let me know if it doesn't make sense. Thanks, Song diff --git i/drivers/md/raid0.c w/drivers/md/raid0.c index a9fcff50bbfc..54d0064787a8 100644 --- i/drivers/md/raid0.c +++ w/drivers/md/raid0.c @@ -615,6 +615,10 @@ static bool raid0_make_request(struct mddev *mddev, struct bio *bio) case RAID0_ALT_MULTIZONE_LAYOUT: tmp_dev = map_sector(mddev, zone, sector, §or); break; + default: + WARN("md/raid0:%s: Invalid layout\n", mdname(mddev)); + bio_io_error(bio); + return true; } if (unlikely(is_mddev_broken(tmp_dev, "raid0"))) { ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion. 2019-09-11 9:56 ` Song Liu @ 2019-09-11 22:48 ` NeilBrown 0 siblings, 0 replies; 19+ messages in thread From: NeilBrown @ 2019-09-11 22:48 UTC (permalink / raw) To: Song Liu Cc: Guoqing Jiang, Song Liu, Guoqing Jiang, Coly Li, NeilBrown, linux-block, linux-raid [-- Attachment #1: Type: text/plain, Size: 3549 bytes --] On Wed, Sep 11 2019, Song Liu wrote: > On Wed, Sep 11, 2019 at 12:10 AM NeilBrown <neilb@suse.de> wrote: >> >> On Tue, Sep 10 2019, Guoqing Jiang wrote: >> >> > On 9/10/19 5:45 PM, Song Liu wrote: >> >> >> >> >> >>> On Sep 10, 2019, at 12:33 AM, NeilBrown <neilb@suse.de> wrote: >> >>> >> >>> On Mon, Sep 09 2019, Song Liu wrote: >> >>> >> >>>> Hi Neil, >> >>>> >> >>>>> On Sep 9, 2019, at 7:57 AM, NeilBrown <neilb@suse.de> wrote: >> >>>>> >> >>>>> >> >>>>> If the drives in a RAID0 are not all the same size, the array is >> >>>>> divided into zones. >> >>>>> The first zone covers all drives, to the size of the smallest. >> >>>>> The second zone covers all drives larger than the smallest, up to >> >>>>> the size of the second smallest - etc. >> >>>>> >> >>>>> A change in Linux 3.14 unintentionally changed the layout for the >> >>>>> second and subsequent zones. All the correct data is still stored, but >> >>>>> each chunk may be assigned to a different device than in pre-3.14 kernels. >> >>>>> This can lead to data corruption. >> >>>>> >> >>>>> It is not possible to determine what layout to use - it depends which >> >>>>> kernel the data was written by. >> >>>>> So we add a module parameter to allow the old (0) or new (1) layout to be >> >>>>> specified, and refused to assemble an affected array if that parameter is >> >>>>> not set. >> >>>>> >> >>>>> Fixes: 20d0189b1012 ("block: Introduce new bio_split()") >> >>>>> cc: stable@vger.kernel.org (3.14+) >> >>>>> Signed-off-by: NeilBrown <neilb@suse.de> >> >>>> >> >>>> Thanks for the patches. They look great. However, I am having problem >> >>>> apply them (not sure whether it is a problem on my side). Could you >> >>>> please push it somewhere so I can use cherry-pick instead? >> >>> >> >>> I rebased them on block/for-next, fixed the problems that Guoqing found, >> >>> and pushed them to >> >>> https://github.com/neilbrown/linux md/raid0 >> >>> >> >>> NeilBrown >> >> >> >> Thanks Neil! >> > >> > Thanks for the explanation about set the flag. >> > >> >> >> >> Guoqing, if this looks good, please reply with your Reviewed-by >> >> or Acked-by. >> > >> > No more comments from my side, but I am not sure if it is better/possible to use one >> > sysfs node to control the behavior instead of module parameter, then we can support >> > different raid0 layout dynamically. >> >> A strength of module parameters is that you can set them in >> /etc/modprobe.d/00-local.conf >> and then they are automatically set on boot. >> For sysfs, you need some tool to set them. >> >> > >> > Anyway, Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> >> > > > I am adding the following change to the 1/2. Please let me know if it doesn't > make sense. I don't object, through with the current code it is impossible for that warning to fire. Code might change in the future though, and it's better to be safe than sorry. Thanks, NeilBrown > > Thanks, > Song > > diff --git i/drivers/md/raid0.c w/drivers/md/raid0.c > index a9fcff50bbfc..54d0064787a8 100644 > --- i/drivers/md/raid0.c > +++ w/drivers/md/raid0.c > @@ -615,6 +615,10 @@ static bool raid0_make_request(struct mddev > *mddev, struct bio *bio) > case RAID0_ALT_MULTIZONE_LAYOUT: > tmp_dev = map_sector(mddev, zone, sector, §or); > break; > + default: > + WARN("md/raid0:%s: Invalid layout\n", mdname(mddev)); > + bio_io_error(bio); > + return true; > } > > if (unlikely(is_mddev_broken(tmp_dev, "raid0"))) { [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion. 2019-09-09 6:57 ` [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion NeilBrown 2019-09-09 6:58 ` [PATCH 2/2] md: add feature flag MD_FEATURE_RAID0_LAYOUT NeilBrown 2019-09-09 14:56 ` [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion Song Liu @ 2019-09-09 15:09 ` Guoqing Jiang 2019-09-09 23:34 ` NeilBrown 2 siblings, 1 reply; 19+ messages in thread From: Guoqing Jiang @ 2019-09-09 15:09 UTC (permalink / raw) To: NeilBrown, Coly Li, Song Liu; +Cc: NeilBrown, linux-block, linux-raid On 9/9/19 8:57 AM, NeilBrown wrote: > > If the drives in a RAID0 are not all the same size, the array is > divided into zones. > The first zone covers all drives, to the size of the smallest. > The second zone covers all drives larger than the smallest, up to > the size of the second smallest - etc. > > A change in Linux 3.14 unintentionally changed the layout for the > second and subsequent zones. All the correct data is still stored, but > each chunk may be assigned to a different device than in pre-3.14 kernels. > This can lead to data corruption. > > It is not possible to determine what layout to use - it depends which > kernel the data was written by. > So we add a module parameter to allow the old (0) or new (1) layout to be > specified, and refused to assemble an affected array if that parameter is > not set. > > Fixes: 20d0189b1012 ("block: Introduce new bio_split()") > cc: stable@vger.kernel.org (3.14+) > Signed-off-by: NeilBrown <neilb@suse.de> > --- > > This and the next patch are my proposal for how to address > this problem. I haven't actually tested ..... > > NeilBrown > > drivers/md/raid0.c | 28 +++++++++++++++++++++++++++- > drivers/md/raid0.h | 14 ++++++++++++++ > 2 files changed, 41 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c > index bf5cf184a260..a8888c12308a 100644 > --- a/drivers/md/raid0.c > +++ b/drivers/md/raid0.c > @@ -19,6 +19,9 @@ > #include "raid0.h" > #include "raid5.h" > > +static int default_layout = -1; > +module_param(default_layout, int, 0644); > + > #define UNSUPPORTED_MDDEV_FLAGS \ > ((1L << MD_HAS_JOURNAL) | \ > (1L << MD_JOURNAL_CLEAN) | \ > @@ -139,6 +142,19 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf) > } > pr_debug("md/raid0:%s: FINAL %d zones\n", > mdname(mddev), conf->nr_strip_zones); > + > + if (conf->nr_strip_zones == 1) { > + conf->layout = RAID0_ORIG_LAYOUT; > + } else if (default_layout == RAID0_ORIG_LAYOUT || > + default_layout == RAID0_ALT_MULTIZONE_LAYOUT) { > + conf->layout = default_layout; > + } else { > + pr_err("md/raid0:%s: cannot assemble multi-zone RAID0 with default_layout setting\n", > + mdname(mddev)); > + pr_err("md/raid0: please set raid.default_layout to 0 or 1\n"); Maybe "1 or 2" to align with the definition of below r0layout? [snip] > +enum r0layout { > + RAID0_ORIG_LAYOUT = 1, > + RAID0_ALT_MULTIZONE_LAYOUT = 2, > +}; > struct r0conf { > struct strip_zone *strip_zone; > struct md_rdev **devlist; /* lists of rdevs, pointed to > * by strip_zone->dev */ > int nr_strip_zones; > + enum r0layout layout; > }; > > #endif > Thanks, Guoqing ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion. 2019-09-09 15:09 ` Guoqing Jiang @ 2019-09-09 23:34 ` NeilBrown 0 siblings, 0 replies; 19+ messages in thread From: NeilBrown @ 2019-09-09 23:34 UTC (permalink / raw) To: Guoqing Jiang, Coly Li, Song Liu; +Cc: NeilBrown, linux-block, linux-raid [-- Attachment #1: Type: text/plain, Size: 2972 bytes --] On Mon, Sep 09 2019, Guoqing Jiang wrote: > On 9/9/19 8:57 AM, NeilBrown wrote: >> >> If the drives in a RAID0 are not all the same size, the array is >> divided into zones. >> The first zone covers all drives, to the size of the smallest. >> The second zone covers all drives larger than the smallest, up to >> the size of the second smallest - etc. >> >> A change in Linux 3.14 unintentionally changed the layout for the >> second and subsequent zones. All the correct data is still stored, but >> each chunk may be assigned to a different device than in pre-3.14 kernels. >> This can lead to data corruption. >> >> It is not possible to determine what layout to use - it depends which >> kernel the data was written by. >> So we add a module parameter to allow the old (0) or new (1) layout to be >> specified, and refused to assemble an affected array if that parameter is >> not set. >> >> Fixes: 20d0189b1012 ("block: Introduce new bio_split()") >> cc: stable@vger.kernel.org (3.14+) >> Signed-off-by: NeilBrown <neilb@suse.de> >> --- >> >> This and the next patch are my proposal for how to address >> this problem. I haven't actually tested ..... >> >> NeilBrown >> >> drivers/md/raid0.c | 28 +++++++++++++++++++++++++++- >> drivers/md/raid0.h | 14 ++++++++++++++ >> 2 files changed, 41 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c >> index bf5cf184a260..a8888c12308a 100644 >> --- a/drivers/md/raid0.c >> +++ b/drivers/md/raid0.c >> @@ -19,6 +19,9 @@ >> #include "raid0.h" >> #include "raid5.h" >> >> +static int default_layout = -1; >> +module_param(default_layout, int, 0644); >> + >> #define UNSUPPORTED_MDDEV_FLAGS \ >> ((1L << MD_HAS_JOURNAL) | \ >> (1L << MD_JOURNAL_CLEAN) | \ >> @@ -139,6 +142,19 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf) >> } >> pr_debug("md/raid0:%s: FINAL %d zones\n", >> mdname(mddev), conf->nr_strip_zones); >> + >> + if (conf->nr_strip_zones == 1) { >> + conf->layout = RAID0_ORIG_LAYOUT; >> + } else if (default_layout == RAID0_ORIG_LAYOUT || >> + default_layout == RAID0_ALT_MULTIZONE_LAYOUT) { >> + conf->layout = default_layout; >> + } else { >> + pr_err("md/raid0:%s: cannot assemble multi-zone RAID0 with default_layout setting\n", >> + mdname(mddev)); >> + pr_err("md/raid0: please set raid.default_layout to 0 or 1\n"); > > Maybe "1 or 2" to align with the definition of below r0layout? Thanks. Fixed. NeilBrown > > [snip] > >> +enum r0layout { >> + RAID0_ORIG_LAYOUT = 1, >> + RAID0_ALT_MULTIZONE_LAYOUT = 2, >> +}; >> struct r0conf { >> struct strip_zone *strip_zone; >> struct md_rdev **devlist; /* lists of rdevs, pointed to >> * by strip_zone->dev */ >> int nr_strip_zones; >> + enum r0layout layout; >> }; >> >> #endif >> > > Thanks, > Guoqing [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2019-09-11 22:49 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-08-22 10:38 [RFC] How to handle an ugly md raid0 sector map bug ? Coly Li 2019-08-23 0:02 ` NeilBrown 2019-08-23 16:37 ` Song Liu 2019-08-23 17:03 ` Coly Li 2019-08-23 17:17 ` Song Liu 2019-08-23 17:47 ` Coly Li 2019-09-09 6:57 ` [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion NeilBrown 2019-09-09 6:58 ` [PATCH 2/2] md: add feature flag MD_FEATURE_RAID0_LAYOUT NeilBrown 2019-09-09 15:33 ` Guoqing Jiang 2019-09-09 23:26 ` NeilBrown 2019-09-09 14:56 ` [PATCH] md/raid0: avoid RAID0 data corruption due to layout confusion Song Liu 2019-09-09 23:33 ` NeilBrown 2019-09-10 15:45 ` Song Liu 2019-09-10 16:01 ` Guoqing Jiang 2019-09-10 23:08 ` NeilBrown 2019-09-11 9:56 ` Song Liu 2019-09-11 22:48 ` NeilBrown 2019-09-09 15:09 ` Guoqing Jiang 2019-09-09 23:34 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).