All of lore.kernel.org
 help / color / mirror / Atom feed
* Possible bug in DM-RAID.
@ 2015-10-20 15:12 Austin S Hemmelgarn
  2015-10-21  1:39 ` Neil Brown
  2015-10-21 13:19 ` Austin S Hemmelgarn
  0 siblings, 2 replies; 6+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-20 15:12 UTC (permalink / raw)
  To: Linux-Kernel mailing list, linux-raid

[-- Attachment #1: Type: text/plain, Size: 3267 bytes --]

I think I've stumbled upon a bug in DM-RAID.  The primary symptom is that when
creating a new DM-RAID based device (using either LVM or dmsetup) in a RAID1
configuration, it very quickly claims one by one that all of the disks failed
except the first, and goes degraded.  When this happens on a given system, the
disks always 'fail' in the reverse of the order of the mirror numbers.  All of
the other RAID profiles work just fine.  Curiously, it also only seems to
happen for 'big' devices (I haven't been able to determine exactly what the
minimum size is, but I see it 100% of the time with 32G devices, never with 16G
ones, and only intermittently with 24G).

Here's what I got from dmesg when creating a 32G LVM volume that exhibited
this issue:
[66318.401295] device-mapper: raid: Superblocks created for new array
[66318.450452] md/raid1:mdX: active with 2 out of 2 mirrors
[66318.450467] Choosing daemon_sleep default (5 sec)
[66318.450482] created bitmap (32 pages) for device mdX
[66318.450495] attempt to access beyond end of device
[66318.450501] dm-91: rw=13329, want=0, limit=8192
[66318.450506] md: super_written gets error=-5, uptodate=0
[66318.450513] md/raid1:mdX: Disk failure on dm-92, disabling device.
               md/raid1:mdX: Operation continuing on 1 devices.
[66318.459815] attempt to access beyond end of device
[66318.459819] dm-89: rw=13329, want=0, limit=8192
[66318.459822] md: super_written gets error=-5, uptodate=0
[66318.492852] attempt to access beyond end of device
[66318.492862] dm-89: rw=13329, want=0, limit=8192
[66318.492868] md: super_written gets error=-5, uptodate=0
[66318.627183] mdX: bitmap file is out of date, doing full recovery
[66318.714107] mdX: bitmap initialized from disk: read 3 pages, set 65536 of 65536 bits
[66318.782045] RAID1 conf printout:
[66318.782054]  --- wd:1 rd:2
[66318.782061]  disk 0, wo:0, o:1, dev:dm-90
[66318.782068]  disk 1, wo:1, o:0, dev:dm-92
[66318.836598] RAID1 conf printout:
[66318.836607]  --- wd:1 rd:2
[66318.836614]  disk 0, wo:0, o:1, dev:dm-90

And here's output for a 24G LVM volume that didn't display the issue.
[66343.407954] device-mapper: raid: Superblocks created for new array
[66343.479065] md/raid1:mdX: active with 2 out of 2 mirrors
[66343.479078] Choosing daemon_sleep default (5 sec)
[66343.479101] created bitmap (24 pages) for device mdX
[66343.629329] mdX: bitmap file is out of date, doing full recovery
[66343.677374] mdX: bitmap initialized from disk: read 2 pages, set 49152 of 49152 bits

I'm using a lightly patched version of 4.2.3
(the source can be found at https://github.com/ferroin/linux)
but none of the patches I'm using come anywhere near anything in the block layer,
let alone the DM/MD code.

I've attempted to bisect this, although it got kind of complicated.  So far I've
determined that the first commit that I see this issue on is d3b178a: md: Skip cluster setup for dm-raid
Prior to that commit, I can't initialize any dm-raid devices due to the bug it fixes.
I have not tested anything prior to d51e4fe (the merge commit that pulled in the md-cluster code),
but I do distinctly remember that I did not see this issue in 3.19.

I'll be happy to provide more info if needed.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Possible bug in DM-RAID.
  2015-10-20 15:12 Possible bug in DM-RAID Austin S Hemmelgarn
@ 2015-10-21  1:39 ` Neil Brown
  2015-10-21 14:11   ` Heinz Mauelshagen
  2015-10-21 13:19 ` Austin S Hemmelgarn
  1 sibling, 1 reply; 6+ messages in thread
From: Neil Brown @ 2015-10-21  1:39 UTC (permalink / raw)
  To: Austin S Hemmelgarn, Linux-Kernel mailing list, linux-raid,
	device-mapper development

[-- Attachment #1: Type: text/plain, Size: 3457 bytes --]


Added dm-devel, which is probably the more appropriate list for dm
things.

NeilBrown

Austin S Hemmelgarn <ahferroin7@gmail.com> writes:

> I think I've stumbled upon a bug in DM-RAID.  The primary symptom is that when
> creating a new DM-RAID based device (using either LVM or dmsetup) in a RAID1
> configuration, it very quickly claims one by one that all of the disks failed
> except the first, and goes degraded.  When this happens on a given system, the
> disks always 'fail' in the reverse of the order of the mirror numbers.  All of
> the other RAID profiles work just fine.  Curiously, it also only seems to
> happen for 'big' devices (I haven't been able to determine exactly what the
> minimum size is, but I see it 100% of the time with 32G devices, never with 16G
> ones, and only intermittently with 24G).
>
> Here's what I got from dmesg when creating a 32G LVM volume that exhibited
> this issue:
> [66318.401295] device-mapper: raid: Superblocks created for new array
> [66318.450452] md/raid1:mdX: active with 2 out of 2 mirrors
> [66318.450467] Choosing daemon_sleep default (5 sec)
> [66318.450482] created bitmap (32 pages) for device mdX
> [66318.450495] attempt to access beyond end of device
> [66318.450501] dm-91: rw=13329, want=0, limit=8192
> [66318.450506] md: super_written gets error=-5, uptodate=0
> [66318.450513] md/raid1:mdX: Disk failure on dm-92, disabling device.
>                md/raid1:mdX: Operation continuing on 1 devices.
> [66318.459815] attempt to access beyond end of device
> [66318.459819] dm-89: rw=13329, want=0, limit=8192
> [66318.459822] md: super_written gets error=-5, uptodate=0
> [66318.492852] attempt to access beyond end of device
> [66318.492862] dm-89: rw=13329, want=0, limit=8192
> [66318.492868] md: super_written gets error=-5, uptodate=0
> [66318.627183] mdX: bitmap file is out of date, doing full recovery
> [66318.714107] mdX: bitmap initialized from disk: read 3 pages, set 65536 of 65536 bits
> [66318.782045] RAID1 conf printout:
> [66318.782054]  --- wd:1 rd:2
> [66318.782061]  disk 0, wo:0, o:1, dev:dm-90
> [66318.782068]  disk 1, wo:1, o:0, dev:dm-92
> [66318.836598] RAID1 conf printout:
> [66318.836607]  --- wd:1 rd:2
> [66318.836614]  disk 0, wo:0, o:1, dev:dm-90
>
> And here's output for a 24G LVM volume that didn't display the issue.
> [66343.407954] device-mapper: raid: Superblocks created for new array
> [66343.479065] md/raid1:mdX: active with 2 out of 2 mirrors
> [66343.479078] Choosing daemon_sleep default (5 sec)
> [66343.479101] created bitmap (24 pages) for device mdX
> [66343.629329] mdX: bitmap file is out of date, doing full recovery
> [66343.677374] mdX: bitmap initialized from disk: read 2 pages, set 49152 of 49152 bits
>
> I'm using a lightly patched version of 4.2.3
> (the source can be found at https://github.com/ferroin/linux)
> but none of the patches I'm using come anywhere near anything in the block layer,
> let alone the DM/MD code.
>
> I've attempted to bisect this, although it got kind of complicated.  So far I've
> determined that the first commit that I see this issue on is d3b178a: md: Skip cluster setup for dm-raid
> Prior to that commit, I can't initialize any dm-raid devices due to the bug it fixes.
> I have not tested anything prior to d51e4fe (the merge commit that pulled in the md-cluster code),
> but I do distinctly remember that I did not see this issue in 3.19.
>
> I'll be happy to provide more info if needed.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Possible bug in DM-RAID.
  2015-10-20 15:12 Possible bug in DM-RAID Austin S Hemmelgarn
  2015-10-21  1:39 ` Neil Brown
@ 2015-10-21 13:19 ` Austin S Hemmelgarn
  2015-10-21 13:24   ` Austin S Hemmelgarn
  1 sibling, 1 reply; 6+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-21 13:19 UTC (permalink / raw)
  To: Linux-Kernel mailing list, linux-raid, dm-devel

[-- Attachment #1: Type: text/plain, Size: 4099 bytes --]

On 2015-10-20 11:12, Austin S Hemmelgarn wrote:
> I think I've stumbled upon a bug in DM-RAID.  The primary symptom is that when
> creating a new DM-RAID based device (using either LVM or dmsetup) in a RAID1
> configuration, it very quickly claims one by one that all of the disks failed
> except the first, and goes degraded.  When this happens on a given system, the
> disks always 'fail' in the reverse of the order of the mirror numbers.  All of
> the other RAID profiles work just fine.  Curiously, it also only seems to
> happen for 'big' devices (I haven't been able to determine exactly what the
> minimum size is, but I see it 100% of the time with 32G devices, never with 16G
> ones, and only intermittently with 24G).
OK, I've done some more experimentation, and have figured out that 
adjusting the sync region size from the default (and thus adjusting the 
bitmap size) can temporarily work around this.  If I adjust things so 
that the bitmap is less than 32 pages, then everything works fine, until 
I try to reboot, at which point the device either (in order of 
decreasing probability):
1. Fails just like I've outlined above.
2. Refuses to activate at all (if using LVM, you get some complaint 
about 'expected raid1 segment type, but got NULL' or 'reload ioctl on 
failed')
3. It works for a while, and then one of the first two things happens 
the next time I reboot.
>
> Here's what I got from dmesg when creating a 32G LVM volume that exhibited
> this issue:
> [66318.401295] device-mapper: raid: Superblocks created for new array
> [66318.450452] md/raid1:mdX: active with 2 out of 2 mirrors
> [66318.450467] Choosing daemon_sleep default (5 sec)
> [66318.450482] created bitmap (32 pages) for device mdX
> [66318.450495] attempt to access beyond end of device
> [66318.450501] dm-91: rw=13329, want=0, limit=8192
> [66318.450506] md: super_written gets error=-5, uptodate=0
> [66318.450513] md/raid1:mdX: Disk failure on dm-92, disabling device.
>                 md/raid1:mdX: Operation continuing on 1 devices.
> [66318.459815] attempt to access beyond end of device
> [66318.459819] dm-89: rw=13329, want=0, limit=8192
> [66318.459822] md: super_written gets error=-5, uptodate=0
> [66318.492852] attempt to access beyond end of device
> [66318.492862] dm-89: rw=13329, want=0, limit=8192
> [66318.492868] md: super_written gets error=-5, uptodate=0
> [66318.627183] mdX: bitmap file is out of date, doing full recovery
> [66318.714107] mdX: bitmap initialized from disk: read 3 pages, set 65536 of 65536 bits
> [66318.782045] RAID1 conf printout:
> [66318.782054]  --- wd:1 rd:2
> [66318.782061]  disk 0, wo:0, o:1, dev:dm-90
> [66318.782068]  disk 1, wo:1, o:0, dev:dm-92
> [66318.836598] RAID1 conf printout:
> [66318.836607]  --- wd:1 rd:2
> [66318.836614]  disk 0, wo:0, o:1, dev:dm-90
>
> And here's output for a 24G LVM volume that didn't display the issue.
> [66343.407954] device-mapper: raid: Superblocks created for new array
> [66343.479065] md/raid1:mdX: active with 2 out of 2 mirrors
> [66343.479078] Choosing daemon_sleep default (5 sec)
> [66343.479101] created bitmap (24 pages) for device mdX
> [66343.629329] mdX: bitmap file is out of date, doing full recovery
> [66343.677374] mdX: bitmap initialized from disk: read 2 pages, set 49152 of 49152 bits
>
> I'm using a lightly patched version of 4.2.3
> (the source can be found at https://github.com/ferroin/linux)
> but none of the patches I'm using come anywhere near anything in the block layer,
> let alone the DM/MD code.
>
> I've attempted to bisect this, although it got kind of complicated.  So far I've
> determined that the first commit that I see this issue on is d3b178a: md: Skip cluster setup for dm-raid
> Prior to that commit, I can't initialize any dm-raid devices due to the bug it fixes.
> I have not tested anything prior to d51e4fe (the merge commit that pulled in the md-cluster code),
> but I do distinctly remember that I did not see this issue in 3.19.
>
> I'll be happy to provide more info if needed.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Possible bug in DM-RAID.
  2015-10-21 13:19 ` Austin S Hemmelgarn
@ 2015-10-21 13:24   ` Austin S Hemmelgarn
  0 siblings, 0 replies; 6+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-21 13:24 UTC (permalink / raw)
  To: Linux-Kernel mailing list, linux-raid, dm-devel, Neil Brown

[-- Attachment #1: Type: text/plain, Size: 4419 bytes --]

On 2015-10-21 09:19, Austin S Hemmelgarn wrote:
Hmm, dm-devel@redhat.org seems to have bounced for me.  Any ideas why 
RedHat would be blocking inbound mail from Google's mail servers?
> On 2015-10-20 11:12, Austin S Hemmelgarn wrote:
>> I think I've stumbled upon a bug in DM-RAID.  The primary symptom is
>> that when
>> creating a new DM-RAID based device (using either LVM or dmsetup) in a
>> RAID1
>> configuration, it very quickly claims one by one that all of the disks
>> failed
>> except the first, and goes degraded.  When this happens on a given
>> system, the
>> disks always 'fail' in the reverse of the order of the mirror
>> numbers.  All of
>> the other RAID profiles work just fine.  Curiously, it also only seems to
>> happen for 'big' devices (I haven't been able to determine exactly
>> what the
>> minimum size is, but I see it 100% of the time with 32G devices, never
>> with 16G
>> ones, and only intermittently with 24G).
> OK, I've done some more experimentation, and have figured out that
> adjusting the sync region size from the default (and thus adjusting the
> bitmap size) can temporarily work around this.  If I adjust things so
> that the bitmap is less than 32 pages, then everything works fine, until
> I try to reboot, at which point the device either (in order of
> decreasing probability):
> 1. Fails just like I've outlined above.
> 2. Refuses to activate at all (if using LVM, you get some complaint
> about 'expected raid1 segment type, but got NULL' or 'reload ioctl on
> failed')
> 3. It works for a while, and then one of the first two things happens
> the next time I reboot.
>>
>> Here's what I got from dmesg when creating a 32G LVM volume that
>> exhibited
>> this issue:
>> [66318.401295] device-mapper: raid: Superblocks created for new array
>> [66318.450452] md/raid1:mdX: active with 2 out of 2 mirrors
>> [66318.450467] Choosing daemon_sleep default (5 sec)
>> [66318.450482] created bitmap (32 pages) for device mdX
>> [66318.450495] attempt to access beyond end of device
>> [66318.450501] dm-91: rw=13329, want=0, limit=8192
>> [66318.450506] md: super_written gets error=-5, uptodate=0
>> [66318.450513] md/raid1:mdX: Disk failure on dm-92, disabling device.
>>                 md/raid1:mdX: Operation continuing on 1 devices.
>> [66318.459815] attempt to access beyond end of device
>> [66318.459819] dm-89: rw=13329, want=0, limit=8192
>> [66318.459822] md: super_written gets error=-5, uptodate=0
>> [66318.492852] attempt to access beyond end of device
>> [66318.492862] dm-89: rw=13329, want=0, limit=8192
>> [66318.492868] md: super_written gets error=-5, uptodate=0
>> [66318.627183] mdX: bitmap file is out of date, doing full recovery
>> [66318.714107] mdX: bitmap initialized from disk: read 3 pages, set
>> 65536 of 65536 bits
>> [66318.782045] RAID1 conf printout:
>> [66318.782054]  --- wd:1 rd:2
>> [66318.782061]  disk 0, wo:0, o:1, dev:dm-90
>> [66318.782068]  disk 1, wo:1, o:0, dev:dm-92
>> [66318.836598] RAID1 conf printout:
>> [66318.836607]  --- wd:1 rd:2
>> [66318.836614]  disk 0, wo:0, o:1, dev:dm-90
>>
>> And here's output for a 24G LVM volume that didn't display the issue.
>> [66343.407954] device-mapper: raid: Superblocks created for new array
>> [66343.479065] md/raid1:mdX: active with 2 out of 2 mirrors
>> [66343.479078] Choosing daemon_sleep default (5 sec)
>> [66343.479101] created bitmap (24 pages) for device mdX
>> [66343.629329] mdX: bitmap file is out of date, doing full recovery
>> [66343.677374] mdX: bitmap initialized from disk: read 2 pages, set
>> 49152 of 49152 bits
>>
>> I'm using a lightly patched version of 4.2.3
>> (the source can be found at https://github.com/ferroin/linux)
>> but none of the patches I'm using come anywhere near anything in the
>> block layer,
>> let alone the DM/MD code.
>>
>> I've attempted to bisect this, although it got kind of complicated.
>> So far I've
>> determined that the first commit that I see this issue on is d3b178a:
>> md: Skip cluster setup for dm-raid
>> Prior to that commit, I can't initialize any dm-raid devices due to
>> the bug it fixes.
>> I have not tested anything prior to d51e4fe (the merge commit that
>> pulled in the md-cluster code),
>> but I do distinctly remember that I did not see this issue in 3.19.
>>
>> I'll be happy to provide more info if needed.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Possible bug in DM-RAID.
  2015-10-21  1:39 ` Neil Brown
@ 2015-10-21 14:11   ` Heinz Mauelshagen
  2015-10-21 15:08     ` [dm-devel] " Austin S Hemmelgarn
  0 siblings, 1 reply; 6+ messages in thread
From: Heinz Mauelshagen @ 2015-10-21 14:11 UTC (permalink / raw)
  To: device-mapper development, Austin S Hemmelgarn,
	Linux-Kernel mailing list, linux-raid


[-- Attachment #1.1: Type: text/plain, Size: 5044 bytes --]


Neil,

this looks like an incarnation of the md bitmap flaw (the one with the bogus
slot number) leading to the false bitmap header page index.


Austin,
this is the respective upstream commit you need to fix your problem:

commit da6fb7a9e5bd6f04f7e15070f630bdf1ea502841
Author: NeilBrown <neilb@suse.com>
Date:   Thu Oct 1 16:03:38 2015 +1000

     md/bitmap: don't pass -1 to bitmap_storage_alloc.

     Passing -1 to bitmap_storage_alloc() causes page->index to be set to
     -1, which is quite problematic.

     So only pass ->cluster_slot if mddev_is_clustered().

     Fixes: b97e92574c0b ("Use separate bitmaps for each nodes in the 
cluster")
     Cc: stable@vger.kernel.org (v4.1+)
     Signed-off-by: NeilBrown <neilb@suse.com>

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index e51de52..48b5890 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1997,7 +1997,8 @@ int bitmap_resize(struct bitmap *bitmap, sector_t 
blocks,
         if (bitmap->mddev->bitmap_info.offset || 
bitmap->mddev->bitmap_info.file)
                 ret = bitmap_storage_alloc(&store, chunks,
!bitmap->mddev->bitmap_info.external,
-                                          bitmap->cluster_slot);
+ mddev_is_clustered(bitmap->mddev)
+                                          ? bitmap->cluster_slot : 0);
         if (ret)
                 goto err;


On 10/21/2015 03:39 AM, Neil Brown wrote:
> Added dm-devel, which is probably the more appropriate list for dm
> things.
>
> NeilBrown
>
> Austin S Hemmelgarn <ahferroin7@gmail.com> writes:
>
>> I think I've stumbled upon a bug in DM-RAID.  The primary symptom is that when
>> creating a new DM-RAID based device (using either LVM or dmsetup) in a RAID1
>> configuration, it very quickly claims one by one that all of the disks failed
>> except the first, and goes degraded.  When this happens on a given system, the
>> disks always 'fail' in the reverse of the order of the mirror numbers.  All of
>> the other RAID profiles work just fine.  Curiously, it also only seems to
>> happen for 'big' devices (I haven't been able to determine exactly what the
>> minimum size is, but I see it 100% of the time with 32G devices, never with 16G
>> ones, and only intermittently with 24G).
>>
>> Here's what I got from dmesg when creating a 32G LVM volume that exhibited
>> this issue:
>> [66318.401295] device-mapper: raid: Superblocks created for new array
>> [66318.450452] md/raid1:mdX: active with 2 out of 2 mirrors
>> [66318.450467] Choosing daemon_sleep default (5 sec)
>> [66318.450482] created bitmap (32 pages) for device mdX
>> [66318.450495] attempt to access beyond end of device
>> [66318.450501] dm-91: rw=13329, want=0, limit=8192
>> [66318.450506] md: super_written gets error=-5, uptodate=0
>> [66318.450513] md/raid1:mdX: Disk failure on dm-92, disabling device.
>>                 md/raid1:mdX: Operation continuing on 1 devices.
>> [66318.459815] attempt to access beyond end of device
>> [66318.459819] dm-89: rw=13329, want=0, limit=8192
>> [66318.459822] md: super_written gets error=-5, uptodate=0
>> [66318.492852] attempt to access beyond end of device
>> [66318.492862] dm-89: rw=13329, want=0, limit=8192
>> [66318.492868] md: super_written gets error=-5, uptodate=0
>> [66318.627183] mdX: bitmap file is out of date, doing full recovery
>> [66318.714107] mdX: bitmap initialized from disk: read 3 pages, set 65536 of 65536 bits
>> [66318.782045] RAID1 conf printout:
>> [66318.782054]  --- wd:1 rd:2
>> [66318.782061]  disk 0, wo:0, o:1, dev:dm-90
>> [66318.782068]  disk 1, wo:1, o:0, dev:dm-92
>> [66318.836598] RAID1 conf printout:
>> [66318.836607]  --- wd:1 rd:2
>> [66318.836614]  disk 0, wo:0, o:1, dev:dm-90
>>
>> And here's output for a 24G LVM volume that didn't display the issue.
>> [66343.407954] device-mapper: raid: Superblocks created for new array
>> [66343.479065] md/raid1:mdX: active with 2 out of 2 mirrors
>> [66343.479078] Choosing daemon_sleep default (5 sec)
>> [66343.479101] created bitmap (24 pages) for device mdX
>> [66343.629329] mdX: bitmap file is out of date, doing full recovery
>> [66343.677374] mdX: bitmap initialized from disk: read 2 pages, set 49152 of 49152 bits
>>
>> I'm using a lightly patched version of 4.2.3
>> (the source can be found at https://github.com/ferroin/linux)
>> but none of the patches I'm using come anywhere near anything in the block layer,
>> let alone the DM/MD code.
>>
>> I've attempted to bisect this, although it got kind of complicated.  So far I've
>> determined that the first commit that I see this issue on is d3b178a: md: Skip cluster setup for dm-raid
>> Prior to that commit, I can't initialize any dm-raid devices due to the bug it fixes.
>> I have not tested anything prior to d51e4fe (the merge commit that pulled in the md-cluster code),
>> but I do distinctly remember that I did not see this issue in 3.19.
>>
>> I'll be happy to provide more info if needed.
>>
>>
>> --
>> dm-devel mailing list
>> dm-devel@redhat.com
>> https://www.redhat.com/mailman/listinfo/dm-devel


[-- Attachment #1.2: Type: text/html, Size: 6556 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [dm-devel] Possible bug in DM-RAID.
  2015-10-21 14:11   ` Heinz Mauelshagen
@ 2015-10-21 15:08     ` Austin S Hemmelgarn
  0 siblings, 0 replies; 6+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-21 15:08 UTC (permalink / raw)
  To: Heinz Mauelshagen, device-mapper development,
	Linux-Kernel mailing list, linux-raid

[-- Attachment #1: Type: text/plain, Size: 5616 bytes --]

Thanks for the quick response.  I've cloned Linux's master branch (which 
has the commit), built it, tested it, and everything works, so it looks 
like this was indeed the bug I was seeing (that, or something else 
between 4.2.3 and what I tested fixed things).

On 2015-10-21 10:11, Heinz Mauelshagen wrote:
>
> Neil,
>
> this looks like an incarnation of the md bitmap flaw (the one with the bogus
> slot number) leading to the false bitmap header page index.
>
>
> Austin,
> this is the respective upstream commit you need to fix your problem:
>
> commit da6fb7a9e5bd6f04f7e15070f630bdf1ea502841
> Author: NeilBrown <neilb@suse.com>
> Date:   Thu Oct 1 16:03:38 2015 +1000
>
>      md/bitmap: don't pass -1 to bitmap_storage_alloc.
>
>      Passing -1 to bitmap_storage_alloc() causes page->index to be set to
>      -1, which is quite problematic.
>
>      So only pass ->cluster_slot if mddev_is_clustered().
>
>      Fixes: b97e92574c0b ("Use separate bitmaps for each nodes in the
> cluster")
>      Cc: stable@vger.kernel.org (v4.1+)
>      Signed-off-by: NeilBrown <neilb@suse.com>
>
> diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
> index e51de52..48b5890 100644
> --- a/drivers/md/bitmap.c
> +++ b/drivers/md/bitmap.c
> @@ -1997,7 +1997,8 @@ int bitmap_resize(struct bitmap *bitmap, sector_t
> blocks,
>          if (bitmap->mddev->bitmap_info.offset ||
> bitmap->mddev->bitmap_info.file)
>                  ret = bitmap_storage_alloc(&store, chunks,
> !bitmap->mddev->bitmap_info.external,
> -                                          bitmap->cluster_slot);
> + mddev_is_clustered(bitmap->mddev)
> +                                          ? bitmap->cluster_slot : 0);
>          if (ret)
>                  goto err;
>
>
> On 10/21/2015 03:39 AM, Neil Brown wrote:
>> Added dm-devel, which is probably the more appropriate list for dm
>> things.
>>
>> NeilBrown
>>
>> Austin S Hemmelgarn<ahferroin7@gmail.com>  writes:
>>
>>> I think I've stumbled upon a bug in DM-RAID.  The primary symptom is that when
>>> creating a new DM-RAID based device (using either LVM or dmsetup) in a RAID1
>>> configuration, it very quickly claims one by one that all of the disks failed
>>> except the first, and goes degraded.  When this happens on a given system, the
>>> disks always 'fail' in the reverse of the order of the mirror numbers.  All of
>>> the other RAID profiles work just fine.  Curiously, it also only seems to
>>> happen for 'big' devices (I haven't been able to determine exactly what the
>>> minimum size is, but I see it 100% of the time with 32G devices, never with 16G
>>> ones, and only intermittently with 24G).
>>>
>>> Here's what I got from dmesg when creating a 32G LVM volume that exhibited
>>> this issue:
>>> [66318.401295] device-mapper: raid: Superblocks created for new array
>>> [66318.450452] md/raid1:mdX: active with 2 out of 2 mirrors
>>> [66318.450467] Choosing daemon_sleep default (5 sec)
>>> [66318.450482] created bitmap (32 pages) for device mdX
>>> [66318.450495] attempt to access beyond end of device
>>> [66318.450501] dm-91: rw=13329, want=0, limit=8192
>>> [66318.450506] md: super_written gets error=-5, uptodate=0
>>> [66318.450513] md/raid1:mdX: Disk failure on dm-92, disabling device.
>>>                 md/raid1:mdX: Operation continuing on 1 devices.
>>> [66318.459815] attempt to access beyond end of device
>>> [66318.459819] dm-89: rw=13329, want=0, limit=8192
>>> [66318.459822] md: super_written gets error=-5, uptodate=0
>>> [66318.492852] attempt to access beyond end of device
>>> [66318.492862] dm-89: rw=13329, want=0, limit=8192
>>> [66318.492868] md: super_written gets error=-5, uptodate=0
>>> [66318.627183] mdX: bitmap file is out of date, doing full recovery
>>> [66318.714107] mdX: bitmap initialized from disk: read 3 pages, set 65536 of 65536 bits
>>> [66318.782045] RAID1 conf printout:
>>> [66318.782054]  --- wd:1 rd:2
>>> [66318.782061]  disk 0, wo:0, o:1, dev:dm-90
>>> [66318.782068]  disk 1, wo:1, o:0, dev:dm-92
>>> [66318.836598] RAID1 conf printout:
>>> [66318.836607]  --- wd:1 rd:2
>>> [66318.836614]  disk 0, wo:0, o:1, dev:dm-90
>>>
>>> And here's output for a 24G LVM volume that didn't display the issue.
>>> [66343.407954] device-mapper: raid: Superblocks created for new array
>>> [66343.479065] md/raid1:mdX: active with 2 out of 2 mirrors
>>> [66343.479078] Choosing daemon_sleep default (5 sec)
>>> [66343.479101] created bitmap (24 pages) for device mdX
>>> [66343.629329] mdX: bitmap file is out of date, doing full recovery
>>> [66343.677374] mdX: bitmap initialized from disk: read 2 pages, set 49152 of 49152 bits
>>>
>>> I'm using a lightly patched version of 4.2.3
>>> (the source can be found athttps://github.com/ferroin/linux)
>>> but none of the patches I'm using come anywhere near anything in the block layer,
>>> let alone the DM/MD code.
>>>
>>> I've attempted to bisect this, although it got kind of complicated.  So far I've
>>> determined that the first commit that I see this issue on is d3b178a: md: Skip cluster setup for dm-raid
>>> Prior to that commit, I can't initialize any dm-raid devices due to the bug it fixes.
>>> I have not tested anything prior to d51e4fe (the merge commit that pulled in the md-cluster code),
>>> but I do distinctly remember that I did not see this issue in 3.19.
>>>
>>> I'll be happy to provide more info if needed.
>>>
>>>
>>> --
>>> dm-devel mailing list
>>> dm-devel@redhat.com
>>> https://www.redhat.com/mailman/listinfo/dm-devel


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-10-21 15:08 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-20 15:12 Possible bug in DM-RAID Austin S Hemmelgarn
2015-10-21  1:39 ` Neil Brown
2015-10-21 14:11   ` Heinz Mauelshagen
2015-10-21 15:08     ` [dm-devel] " Austin S Hemmelgarn
2015-10-21 13:19 ` Austin S Hemmelgarn
2015-10-21 13:24   ` Austin S Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.