All of lore.kernel.org
 help / color / mirror / Atom feed
* compute_blocknr: map not correct error during RAID6 reshape 6 -> 7 disks, mdadm 3.1.2 / kernel 2.6.34-rc3
@ 2010-04-16  1:56 Brett King
  2010-04-20  3:43 ` Brett King
  0 siblings, 1 reply; 4+ messages in thread
From: Brett King @ 2010-04-16  1:56 UTC (permalink / raw)
  To: linux-raid

Hi All,
I'm currently encountering an error when growing a 6-disk RAID6 array
to 7 disks (2TB disks used). The reshape stalls with many
"compute_blocknr: map not correct" errors in the system log.

array:~ # mdadm -V
mdadm - v3.1.2 - 10th March 2010
array:~ # uname -a
Linux array 2.6.34-rc3-11-default #1 SMP 2010-04-09 18:24:53 +0200
x86_64 x86_64 x86_64 GNU/Linux
array:~ # cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md2000 : active raid6 sdl[0] sda[6] sdq[5] sdp[4] sdo[3] sdn[2] sdm[1]
      7814057808 blocks super 1.1 level 6, 4k chunk, algorithm 18
[7/7] [UUUUUUU]
      [=================>...]  reshape = 87.9% (1717986916/1953514452)
finish=15863.5min speed=247K/sec

unused devices: <none>
array:~ #

COMMAND:

array:~ # mdadm -A /dev/md2000 /dev/sda /dev/sd[l-q]
mdadm: /dev/md2000 has been started with 7 drives.
array:~ # cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md2000 : active raid6 sdl[0] sda[6] sdq[5] sdp[4] sdo[3] sdn[2] sdm[1]
      7814057808 blocks super 1.1 level 6, 4k chunk, algorithm 18
[7/7] [UUUUUUU]
      [=================>...]  reshape = 87.9% (1717808872/1953514452)
finish=151.3min speed=25946K/sec

unused devices: <none>
array:~ # cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md2000 : active raid6 sdl[0] sda[6] sdq[5] sdp[4] sdo[3] sdn[2] sdm[1]
      7814057808 blocks super 1.1 level 6, 4k chunk, algorithm 18
[7/7] [UUUUUUU]
      [=================>...]  reshape = 87.9% (1717986916/1953514452)
finish=111.4min speed=35228K/sec

unused devices: <none>
array:~ #

As you can see the reshape jumps back a few blocks then gets stuck,
throwing many compute_blocknr: map not correct errors in syslog.

SYSLOG:

Apr 15 23:10:11 array kernel: [  765.216458] md: md2000 stopped.
Apr 15 23:10:11 array kernel: [  765.261491] md: bind<sdm>
Apr 15 23:10:11 array kernel: [  765.261679] md: bind<sdn>
Apr 15 23:10:11 array kernel: [  765.261864] md: bind<sdo>
Apr 15 23:10:11 array kernel: [  765.262002] md: bind<sdp>
Apr 15 23:10:11 array kernel: [  765.262136] md: bind<sdq>
Apr 15 23:10:11 array kernel: [  765.273414] md: bind<sdl>
Apr 15 23:10:11 array kernel: [  765.280031] async_tx: api initialized
(async)
Apr 15 23:10:11 array kernel: [  765.283014] xor: automatically using
best checksumming function: generic_sse
Apr 15 23:10:11 array kernel: [  765.300671]    generic_sse:  6006.000
MB/sec
Apr 15 23:10:11 array kernel: [  765.300676] xor: using function:
generic_sse (6006.000 MB/sec)
Apr 15 23:10:11 array kernel: [  765.376648] raid6: int64x1   1466
MB/s
Apr 15 23:10:11 array kernel: [  765.444542] raid6: int64x2   1815
MB/s
Apr 15 23:10:11 array kernel: [  765.512417] raid6: int64x4   1262
MB/s
Apr 15 23:10:12 array kernel: [  765.580300] raid6: int64x8   1393
MB/s
Apr 15 23:10:12 array kernel: [  765.648185] raid6: sse2x1    3960
MB/s
Apr 15 23:10:12 array kernel: [  765.716074] raid6: sse2x2    4649
MB/s
Apr 15 23:10:12 array kernel: [  765.783954] raid6: sse2x4    5007
MB/s
Apr 15 23:10:12 array kernel: [  765.783959] raid6: using algorithm
sse2x4 (5007 MB/s)
Apr 15 23:10:12 array kernel: [  765.800602] md: raid6 personality
registered for level 6
Apr 15 23:10:12 array kernel: [  765.800611] md: raid5 personality
registered for level 5
Apr 15 23:10:12 array kernel: [  765.800617] md: raid4 personality
registered for level 4
Apr 15 23:10:12 array kernel: [  765.805135] raid5: reshape will
continue
Apr 15 23:10:12 array kernel: [  765.805153] raid5: device sdl
operational as raid disk 0
Apr 15 23:10:12 array kernel: [  765.805158] raid5: device sdq
operational as raid disk 5
Apr 15 23:10:12 array kernel: [  765.805161] raid5: device sdp
operational as raid disk 4
Apr 15 23:10:12 array kernel: [  765.805165] raid5: device sdo
operational as raid disk 3
Apr 15 23:10:12 array kernel: [  765.805169] raid5: device sdn
operational as raid disk 2
Apr 15 23:10:12 array kernel: [  765.805172] raid5: device sdm
operational as raid disk 1
Apr 15 23:10:12 array kernel: [  765.806332] raid5: allocated 7438kB
for md2000
Apr 15 23:10:12 array kernel: [  765.806457] 0: w=1 pa=18 pr=6 m=2
a=18 r=7 op1=0 op2=0
Apr 15 23:10:12 array kernel: [  765.806463] 5: w=2 pa=18 pr=6 m=2
a=18 r=7 op1=1 op2=0
Apr 15 23:10:12 array kernel: [  765.806468] 4: w=3 pa=18 pr=6 m=2
a=18 r=7 op1=0 op2=0
Apr 15 23:10:12 array kernel: [  765.806472] 3: w=4 pa=18 pr=6 m=2
a=18 r=7 op1=0 op2=0
Apr 15 23:10:12 array kernel: [  765.806477] 2: w=5 pa=18 pr=6 m=2
a=18 r=7 op1=0 op2=0
Apr 15 23:10:12 array kernel: [  765.806481] 1: w=6 pa=18 pr=6 m=2
a=18 r=7 op1=0 op2=0
Apr 15 23:10:12 array kernel: [  765.806485] raid5: raid level 6 set
md2000 active with 6 out of 7 devices, algorithm 18
Apr 15 23:10:12 array kernel: [  765.806490] RAID5 conf printout:
Apr 15 23:10:12 array kernel: [  765.806493]  --- rd:7 wd:6
Apr 15 23:10:12 array kernel: [  765.806496]  disk 0, o:1, dev:sdl
Apr 15 23:10:12 array kernel: [  765.806499]  disk 1, o:1, dev:sdm
Apr 15 23:10:12 array kernel: [  765.806502]  disk 2, o:1, dev:sdn
Apr 15 23:10:12 array kernel: [  765.806505]  disk 3, o:1, dev:sdo
Apr 15 23:10:12 array kernel: [  765.806508]  disk 4, o:1, dev:sdp
Apr 15 23:10:12 array kernel: [  765.806511]  disk 5, o:1, dev:sdq
Apr 15 23:10:12 array kernel: [  765.806513] ...ok start reshape
thread
Apr 15 23:10:12 array kernel: [  765.806595] md2000: detected capacity
change from 0 to 8001595195392
Apr 15 23:10:12 array kernel: [  765.806603] md: reshape of RAID array
md2000
Apr 15 23:10:12 array kernel: [  765.806610] md: minimum _guaranteed_
speed: 1000 KB/sec/disk.
Apr 15 23:10:12 array kernel: [  765.806615] md: using maximum
available idle IO bandwidth (but not more than 200000 KB/sec) for
reshape.
Apr 15 23:10:12 array kernel: [  765.806632] md: using 128k window,
over a total of 1953514452 blocks.
Apr 15 23:10:13 array kernel: [  766.600756]  md2000: unknown
partition table
Apr 15 23:10:20 array kernel: [  774.298298] compute_blocknr: map not
correct
Apr 15 23:10:20 array kernel: [  774.298306] compute_blocknr: map not
correct
Apr 15 23:10:20 array kernel: [  774.298311] compute_blocknr: map not
correct
Apr 15 23:10:20 array kernel: [  774.298315] compute_blocknr: map not
correct
Apr 15 23:10:20 array kernel: [  774.298322] compute_blocknr: map not
correct
Apr 15 23:10:20 array kernel: [  774.298326] compute_blocknr: map not
correct
Apr 15 23:10:20 array kernel: [  774.298329] compute_blocknr: map not
correct
Apr 15 23:10:20 array kernel: [  774.298332] compute_blocknr: map not
correct
Apr 15 23:10:20 array kernel: [  774.298336] compute_blocknr: map not
correct

Any commands relating to the array hang after this, and the system
needs a hard reset to recover.

I found some people's previous encounters with this error message,
back around 2004 with the kernel at that stage requiring LBD (large
block device) support to be explicitly enabled. This is now the
default in x86_64 systems for a long time so I'm thinking this reshape
is hitting another limit above the 2^30 mark. The strange thing is
I've grown larger RAID6 arrays (e.g. 13TB) made of smaller 1TB disks
before without an issue, on earlier kernels (e.g. 2.6.27) and mdadm
versions (e.g. 3.0.2). The root cause now seems to be related to the
larger 2TB disks being used; growing from 4 disks to 5 and then 5
disks to 6 plus adding a Q disk in there was fine.

Also I've tried adjusting the value of stripe_cache_size as mentioned
in another person's similar issue on this list however the reshape
doesn't budge. Am I correct in expecting the reshape to automatically
continue as soon as this value is modified ?

I'm open to try any commands, patches, debugs etc that may get the
reshape moving again. This is one of several arrays in a ~20TB LVM
volume group; all the data is inaccessible until I can get this
resolved !

Thanks in advance everyone.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: compute_blocknr: map not correct error during RAID6 reshape 6 -> 7 disks, mdadm 3.1.2 / kernel 2.6.34-rc3
  2010-04-16  1:56 compute_blocknr: map not correct error during RAID6 reshape 6 -> 7 disks, mdadm 3.1.2 / kernel 2.6.34-rc3 Brett King
@ 2010-04-20  3:43 ` Brett King
  2010-04-20  4:14   ` Neil Brown
  0 siblings, 1 reply; 4+ messages in thread
From: Brett King @ 2010-04-20  3:43 UTC (permalink / raw)
  To: linux-raid

Hello,
It's been quiet on this issue to date and I assume everyone's working
on important fixes and features etc - my apologies for the annoyance
however at this point I'm looking for some collective guidance on my
next move as I can't leave things in this state.

To summarize, I am attempting to grow a RAID6 array from 6 to 7x 2TB
disks however the reshape is stalling at 87.9% and throwing
'compute_blocknr: map not correct' errors. The system still responds
but needs a hard reset to do anything MD related.

I've tried varying kernel and mdadm versions all with the same result:
2.6.31, 2.6.33, 2.6.34-rc3 & rc4, each with mdadm 2.6.8, 3.0.2 &
3.1.2.

The only kernel source file containing this string is
drivers/md/raid5.c so I'm in the right place.

So I am asking you fellow linux-raid'ers, what can be done in this
situation ? I've filed a bug on bugzilla and now would only like to
know if this issue has been experienced by others, if perhaps a fix is
in the pipeline and any tips, tricks etc that may get the reshape
completed then the data accessible again.

I can see the only other non-destructuve option right now is to pull
out the disks, restore from backup onto a new array & wait for a
future kernel / mdadm version which will allow completion of the
reshape.

Cheers,
Brett.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: compute_blocknr: map not correct error during RAID6 reshape 6 ->  7 disks, mdadm 3.1.2 / kernel 2.6.34-rc3
  2010-04-20  3:43 ` Brett King
@ 2010-04-20  4:14   ` Neil Brown
  2010-04-22  3:43     ` Brett King
  0 siblings, 1 reply; 4+ messages in thread
From: Neil Brown @ 2010-04-20  4:14 UTC (permalink / raw)
  To: Brett King; +Cc: linux-raid

On Tue, 20 Apr 2010 13:43:08 +1000
Brett King <king.br@gmail.com> wrote:

> Hello,
> It's been quiet on this issue to date and I assume everyone's working
> on important fixes and features etc - my apologies for the annoyance
> however at this point I'm looking for some collective guidance on my
> next move as I can't leave things in this state.
> 
> To summarize, I am attempting to grow a RAID6 array from 6 to 7x 2TB
> disks however the reshape is stalling at 87.9% and throwing
> 'compute_blocknr: map not correct' errors. The system still responds
> but needs a hard reset to do anything MD related.

Thanks for the reminder.

Please try the following patch.

NeilBrown

From 45b14940d3fbf1891d5c2f99f334cc4f7d9e36d3 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Tue, 20 Apr 2010 14:13:34 +1000
Subject: [PATCH] md/raid5: allow for more than 2^31 chunks.

With many large drives and small chunk sizes it is possible
to create a RAID5 with more than 2^31 chunks.  Make sure this
works.

Reported-by: Brett King <king.br@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e3e9a36..20e4840 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1650,8 +1650,8 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
 				     int previous, int *dd_idx,
 				     struct stripe_head *sh)
 {
-	long stripe;
-	unsigned long chunk_number;
+	sector_t stripe;
+	sector_t chunk_number;
 	unsigned int chunk_offset;
 	int pd_idx, qd_idx;
 	int ddf_layout = 0;
@@ -1671,17 +1671,12 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
 	 */
 	chunk_offset = sector_div(r_sector, sectors_per_chunk);
 	chunk_number = r_sector;
-	BUG_ON(r_sector != chunk_number);
 
 	/*
 	 * Compute the stripe number
 	 */
-	stripe = chunk_number / data_disks;
-
-	/*
-	 * Compute the data disk and parity disk indexes inside the stripe
-	 */
-	*dd_idx = chunk_number % data_disks;
+	stripe = chunk_number;
+	*dd_idx = sector_div(stripe, data_disks);
 
 	/*
 	 * Select the parity disk based on the user selected algorithm.
@@ -1870,14 +1865,14 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i, int previous)
 				 : conf->algorithm;
 	sector_t stripe;
 	int chunk_offset;
-	int chunk_number, dummy1, dd_idx = i;
+	sector_t chunk_number;
+	int dummy1, dd_idx = i;
 	sector_t r_sector;
 	struct stripe_head sh2;
 
 
 	chunk_offset = sector_div(new_sector, sectors_per_chunk);
 	stripe = new_sector;
-	BUG_ON(new_sector != stripe);
 
 	if (i == sh->pd_idx)
 		return 0;
@@ -1970,7 +1965,7 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i, int previous)
 	}
 
 	chunk_number = stripe * data_disks + i;
-	r_sector = (sector_t)chunk_number * sectors_per_chunk + chunk_offset;
+	r_sector = chunk_number * sectors_per_chunk + chunk_offset;
 
 	check = raid5_compute_sector(conf, r_sector,
 				     previous, &dummy1, &sh2);

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: compute_blocknr: map not correct error during RAID6 reshape 6 -> 7 disks, mdadm 3.1.2 / kernel 2.6.34-rc3
  2010-04-20  4:14   ` Neil Brown
@ 2010-04-22  3:43     ` Brett King
  0 siblings, 0 replies; 4+ messages in thread
From: Brett King @ 2010-04-22  3:43 UTC (permalink / raw)
  To: linux-raid

Hi All,
Just a quick update to advise the below patch worked - the array
finished its reshape and the filesystem mounted without error.

-    int chunk_number, dummy1, dd_idx = i;
+    int dummy1, dd_idx = i;
+    sector_t chunk_number;

I'll rebuild it with a larger chunk size to avoid similar issues in
the future, thanks everyone !

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-04-22  3:43 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-16  1:56 compute_blocknr: map not correct error during RAID6 reshape 6 -> 7 disks, mdadm 3.1.2 / kernel 2.6.34-rc3 Brett King
2010-04-20  3:43 ` Brett King
2010-04-20  4:14   ` Neil Brown
2010-04-22  3:43     ` Brett King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.