All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID6 rebuild oddity
@ 2017-03-24  7:44 Brad Campbell
  2017-03-29  4:08 ` NeilBrown
  0 siblings, 1 reply; 9+ messages in thread
From: Brad Campbell @ 2017-03-24  7:44 UTC (permalink / raw)
  To: Linux-RAID

I'm in the process of setting up a new little array. 8 x 6TB drives in a 
RAID6. While I have the luxury of a long burn in period I've been 
beating it up and have seen some odd performance anomalies.

I have one in front of me now, so I thought I'd lay out the data and see 
if anyone has any ideas as to what might be going on.

Here's the current state. I did this by removing and adding /dev/sdb 
without a write intent bitmap to deliberately cause a rebuild.

/dev/md0:
         Version : 1.2
   Creation Time : Wed Mar 22 14:01:41 2017
      Raid Level : raid6
      Array Size : 35162348160 (33533.43 GiB 36006.24 GB)
   Used Dev Size : 5860391360 (5588.90 GiB 6001.04 GB)
    Raid Devices : 8
   Total Devices : 8
     Persistence : Superblock is persistent

     Update Time : Fri Mar 24 15:34:28 2017
           State : clean, degraded, recovering
  Active Devices : 7
Working Devices : 8
  Failed Devices : 0
   Spare Devices : 1

          Layout : left-symmetric
      Chunk Size : 64K

  Rebuild Status : 0% complete

            Name : test:0  (local to host test)
            UUID : 93a09ba7:f159e9f5:7c478f16:6ca8858e
          Events : 394

     Number   Major   Minor   RaidDevice State
        8       8       16        0      spare rebuilding   /dev/sdb
        1       8       32        1      active sync   /dev/sdc
        2       8       48        2      active sync   /dev/sdd
        3       8       64        3      active sync   /dev/sde
        4       8       80        4      active sync   /dev/sdf
        5       8       96        5      active sync   /dev/sdg
        6       8      128        6      active sync   /dev/sdi
        7       8      144        7      active sync   /dev/sdj


Here's the iostat output (hope it doesn't wrap).
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.05    0.00   10.42    7.85    0.00   81.68

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     1.60    0.00    1.40     0.00     8.80 
12.57     0.02   12.86    0.00   12.86  12.86   1.80
sdb               0.00 18835.60    0.00  657.80     0.00 90082.40 
273.89     3.72    4.71    0.00    4.71   0.85  55.80
sdc           20685.80     0.00  244.20    0.00 87659.20     0.00 
717.93     8.65   34.10   34.10    0.00   2.15  52.40
sdd           20664.60     0.00  244.60    0.00 87652.00     0.00 
716.70     8.72   34.28   34.28    0.00   2.19  53.60
sde           20647.80     0.00  240.40    0.00 87556.80     0.00 
728.43     9.13   36.54   36.54    0.00   2.30  55.40
sdf           20622.40     0.00  242.40    0.00 87556.80     0.00 
722.42     8.73   34.60   34.60    0.00   2.20  53.40
sdg           20596.00     0.00  239.20    0.00 87556.80     0.00 
732.08     9.32   37.54   37.54    0.00   2.37  56.60
sdh               0.00     1.60    0.00    1.40     0.00     8.80 
12.57     0.01    7.14    0.00    7.14   7.14   1.00
sdi           20575.80     0.00  238.20    0.00 86999.20     0.00 
730.47     8.53   34.06   34.06    0.00   2.20  52.40
sdj           22860.80     0.00  475.80    0.00 101773.60     0.00 
427.80   245.09  513.25  513.25    0.00   2.10 100.00
md1               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    2.00     0.00     8.00 
8.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00    0.00    0.00   0.00   0.00

The long and short is /dev/sdj is the last drive in the array, and is 
getting hit with a completely different read pattern to the other 
drives, causing it to bottleneck the rebuild process.

I *thought* the rebuild process was "read one stripe, calculate the 
missing bit and write it out to the drive being rebuilt".

I've seen this behaviour now a number of times, but this is the first 
time I've been able to reliably reproduce it. Of course it takes about 
20 hours to complete the rebuild, so it's a slow diagnostic process.

I've set the stripe cache size to 8192. Didn't make a dent.

The bottleneck drive seems to change depending on the load. I've seen it 
happen simply dd'ing the array to /dev/null where the transfer rate 
slows to < 150MB/s. Stop and restart the transfer and it's back up to 
500MB/s.

I've reproduced this on kernel 4.6.4 & 4.10.5, so I'm not sure what is 
going on at the moment. There is obviously a sub-optimal read pattern 
getting fed to sdj. I had a look at it with blocktrace, but went cross 
eyed trying to figure out what was going on.

The drives are all on individual lanes on a SAS controller, are set with 
the deadline scheduler and I can get about 160MB/s sustained from all 
drives simultaneously using dd.

It's not important, but I thought since I was seeing it and I have a 
month or so of extra time with this array before it needs to do useful 
work, I'd ask.

Regards,
Brad

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID6 rebuild oddity
  2017-03-24  7:44 RAID6 rebuild oddity Brad Campbell
@ 2017-03-29  4:08 ` NeilBrown
  2017-03-29  8:12   ` Brad Campbell
  0 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2017-03-29  4:08 UTC (permalink / raw)
  To: Brad Campbell, Linux-RAID

[-- Attachment #1: Type: text/plain, Size: 6248 bytes --]

On Fri, Mar 24 2017, Brad Campbell wrote:

> I'm in the process of setting up a new little array. 8 x 6TB drives in a 
> RAID6. While I have the luxury of a long burn in period I've been 
> beating it up and have seen some odd performance anomalies.
>
> I have one in front of me now, so I thought I'd lay out the data and see 
> if anyone has any ideas as to what might be going on.
>
> Here's the current state. I did this by removing and adding /dev/sdb 
> without a write intent bitmap to deliberately cause a rebuild.
>
> /dev/md0:
>          Version : 1.2
>    Creation Time : Wed Mar 22 14:01:41 2017
>       Raid Level : raid6
>       Array Size : 35162348160 (33533.43 GiB 36006.24 GB)
>    Used Dev Size : 5860391360 (5588.90 GiB 6001.04 GB)
>     Raid Devices : 8
>    Total Devices : 8
>      Persistence : Superblock is persistent
>
>      Update Time : Fri Mar 24 15:34:28 2017
>            State : clean, degraded, recovering
>   Active Devices : 7
> Working Devices : 8
>   Failed Devices : 0
>    Spare Devices : 1
>
>           Layout : left-symmetric
>       Chunk Size : 64K
>
>   Rebuild Status : 0% complete
>
>             Name : test:0  (local to host test)
>             UUID : 93a09ba7:f159e9f5:7c478f16:6ca8858e
>           Events : 394
>
>      Number   Major   Minor   RaidDevice State
>         8       8       16        0      spare rebuilding   /dev/sdb
>         1       8       32        1      active sync   /dev/sdc
>         2       8       48        2      active sync   /dev/sdd
>         3       8       64        3      active sync   /dev/sde
>         4       8       80        4      active sync   /dev/sdf
>         5       8       96        5      active sync   /dev/sdg
>         6       8      128        6      active sync   /dev/sdi
>         7       8      144        7      active sync   /dev/sdj
>
>
> Here's the iostat output (hope it doesn't wrap).
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.05    0.00   10.42    7.85    0.00   81.68
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     1.60    0.00    1.40     0.00     8.80 
> 12.57     0.02   12.86    0.00   12.86  12.86   1.80
> sdb               0.00 18835.60    0.00  657.80     0.00 90082.40 
> 273.89     3.72    4.71    0.00    4.71   0.85  55.80
> sdc           20685.80     0.00  244.20    0.00 87659.20     0.00 
> 717.93     8.65   34.10   34.10    0.00   2.15  52.40
> sdd           20664.60     0.00  244.60    0.00 87652.00     0.00 
> 716.70     8.72   34.28   34.28    0.00   2.19  53.60
> sde           20647.80     0.00  240.40    0.00 87556.80     0.00 
> 728.43     9.13   36.54   36.54    0.00   2.30  55.40
> sdf           20622.40     0.00  242.40    0.00 87556.80     0.00 
> 722.42     8.73   34.60   34.60    0.00   2.20  53.40
> sdg           20596.00     0.00  239.20    0.00 87556.80     0.00 
> 732.08     9.32   37.54   37.54    0.00   2.37  56.60
> sdh               0.00     1.60    0.00    1.40     0.00     8.80 
> 12.57     0.01    7.14    0.00    7.14   7.14   1.00
> sdi           20575.80     0.00  238.20    0.00 86999.20     0.00 
> 730.47     8.53   34.06   34.06    0.00   2.20  52.40
> sdj           22860.80     0.00  475.80    0.00 101773.60     0.00 
> 427.80   245.09  513.25  513.25    0.00   2.10 100.00
> md1               0.00     0.00    0.00    0.00     0.00     0.00 
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> md2               0.00     0.00    0.00    2.00     0.00     8.00 
> 8.00     0.00    0.00    0.00    0.00   0.00   0.00
> md0               0.00     0.00    0.00    0.00     0.00     0.00 
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>
> The long and short is /dev/sdj is the last drive in the array, and is 
> getting hit with a completely different read pattern to the other 
> drives, causing it to bottleneck the rebuild process.

sdj is getting twice the utilization of the others but only about 10%
more rKB/sec.  That suggests lots of seeking.

Does "fuser /dev/sdj" report anything funny?

Is there filesystem IO happening? If so, what filesystem?
Have you told the filesystem about the RAID layout?
Maybe the filesystem keeps reading some index blocks that are always on
the same drive.

>
> I *thought* the rebuild process was "read one stripe, calculate the 
> missing bit and write it out to the drive being rebuilt".

Yep, that is what it does.


>
> I've seen this behaviour now a number of times, but this is the first 
> time I've been able to reliably reproduce it. Of course it takes about 
> 20 hours to complete the rebuild, so it's a slow diagnostic process.
>
> I've set the stripe cache size to 8192. Didn't make a dent.
>
> The bottleneck drive seems to change depending on the load. I've seen it 
> happen simply dd'ing the array to /dev/null where the transfer rate 
> slows to < 150MB/s. Stop and restart the transfer and it's back up to 
> 500MB/s.

So the problem moves from drive to drive?  Strongly suggests filesystem
metadata access to me.

>
> I've reproduced this on kernel 4.6.4 & 4.10.5, so I'm not sure what is 
> going on at the moment. There is obviously a sub-optimal read pattern 
> getting fed to sdj. I had a look at it with blocktrace, but went cross 
> eyed trying to figure out what was going on.

If you can capture several seconds of trace on all drives plus the
array, compress it and host it somewhere, I can pick it up and have
look.

NeilBrown

>
> The drives are all on individual lanes on a SAS controller, are set with 
> the deadline scheduler and I can get about 160MB/s sustained from all 
> drives simultaneously using dd.
>
> It's not important, but I thought since I was seeing it and I have a 
> month or so of extra time with this array before it needs to do useful 
> work, I'd ask.
>
> Regards,
> Brad
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID6 rebuild oddity
  2017-03-29  4:08 ` NeilBrown
@ 2017-03-29  8:12   ` Brad Campbell
  2017-03-30  0:49     ` NeilBrown
  0 siblings, 1 reply; 9+ messages in thread
From: Brad Campbell @ 2017-03-29  8:12 UTC (permalink / raw)
  To: NeilBrown, Linux-RAID

On 29/03/17 12:08, NeilBrown wrote:

>
> sdj is getting twice the utilization of the others but only about 10%
> more rKB/sec.  That suggests lots of seeking.

Yes, something is not entirely sequential.

> Does "fuser /dev/sdj" report anything funny?

No. No output. As far as I can tell nothing should be touching the disks 
other than the md kernel thread.

> Is there filesystem IO happening? If so, what filesystem?
> Have you told the filesystem about the RAID layout?
> Maybe the filesystem keeps reading some index blocks that are always on
> the same drive.

No. I probably wasn't as clear as I should have been in the initial 
post. There was nothing mounted at the time.

Right now the array contains one large LUKS container (dm-crypt). This 
was mounted and a continuous dd done to the dm device to zero it out :

4111195+1 records in
4111195+1 records out
34487205507072 bytes (34 TB) copied, 57781.1 s, 597 MB/s

So there is no filesystem on the drive.

I failed and removed sdi :

root@test:~# mdadm --fail /dev/md0 /dev/sdi
mdadm: set /dev/sdi faulty in /dev/md0
root@test:~# mdadm --remove /dev/md0 /dev/sdi
mdadm: hot removed /dev/sdi from /dev/md0
root@test:~# mdadm --zero-superblock /dev/sdi
root@test:~# mdadm --add /dev/md0 /dev/sdi
mdadm: added /dev/sdi

Rebooted the machine to remove all tweaks of things like stripe cache 
size, readahead, NCQ and anything else.

I opened the LUKS container, dd'd a meg to the start to write to the 
array and kick off the resync, then closed the LUKS container. At this 
point dm should no longer be touching the drive and I've verified the 
device has gone.

I then ran sync a couple of times and waited a couple of minutes until I 
was positive _nothing_ was touching md0, then ran :

blktrace -w 5 /dev/sd[bcdefgij] /dev/md0

> So the problem moves from drive to drive?  Strongly suggests filesystem
> metadata access to me.

Again, sorry for me not being clear. The situation changes on a resync 
specific basis. For example the reproduction I've done now I popped out 
sdi rather than sdb, and now the bottleneck is sdg. It is the same if 
the exact circumstances remain the same.

>
> If you can capture several seconds of trace on all drives plus the
> array, compress it and host it somewhere, I can pick it up and have
> look.

I've captured 5 seconds. I was overly optimistic initially and tried 20 
seconds, but that resulted in 100M bzipped.

This is 19M. The kernel is a clean 4.10.6 compiled up half an hour ago 
as I forgot to include block tracing in there.

http://www.fnarfbargle.com/private/170329-Resync/resync-blktrace.tar.bz2

root@test:~/bench# mdadm --detail /dev/md0
/dev/md0:
         Version : 1.2
   Creation Time : Wed Mar 22 14:01:41 2017
      Raid Level : raid6
      Array Size : 35162348160 (33533.43 GiB 36006.24 GB)
   Used Dev Size : 5860391360 (5588.90 GiB 6001.04 GB)
    Raid Devices : 8
   Total Devices : 8
     Persistence : Superblock is persistent

     Update Time : Wed Mar 29 16:08:20 2017
           State : clean, degraded, recovering
  Active Devices : 7
Working Devices : 8
  Failed Devices : 0
   Spare Devices : 1

          Layout : left-symmetric
      Chunk Size : 64K

  Rebuild Status : 2% complete

            Name : test:0  (local to host test)
            UUID : 93a09ba7:f159e9f5:7c478f16:6ca8858e
          Events : 715

     Number   Major   Minor   RaidDevice State
        8       8       16        0      active sync   /dev/sdb
        1       8       32        1      active sync   /dev/sdc
        2       8       48        2      active sync   /dev/sdd
        3       8       64        3      active sync   /dev/sde
        4       8       80        4      active sync   /dev/sdf
        5       8       96        5      active sync   /dev/sdg
        9       8      128        6      spare rebuilding   /dev/sdi
        7       8      144        7      active sync   /dev/sdj


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.08    0.00    4.77    4.75    0.00   90.41

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     1.40    0.00    3.00     0.00    10.60 
7.07     0.01    4.67    0.00    4.67   4.67   1.40
sdb           14432.20     0.00  120.20    0.00 57295.20     0.00 
953.33     2.04   16.71   16.71    0.00   3.23  38.80
sdc           14432.20     0.00  120.20    0.00 57295.20     0.00 
953.33     1.99   16.26   16.26    0.00   3.21  38.60
sdd           14432.20     0.00  120.80    0.00 57602.40     0.00 
953.68     2.14   17.55   17.55    0.00   3.18  38.40
sde           14432.20     0.00  120.80    0.00 57602.40     0.00 
953.68     2.02   16.57   16.57    0.00   3.20  38.60
sdf           14432.20     0.00  120.80    0.00 57602.40     0.00 
953.68     2.08   17.04   17.04    0.00   3.25  39.20
sdg           16135.40     0.00  224.60    0.00 65811.20     0.00 
586.03   126.46  575.26  575.26    0.00   4.45 100.00
sdh               0.00     1.40    0.00    3.00     0.00    10.60 
7.07     0.02    6.67    0.00    6.67   6.67   2.00
sdi               0.00 14982.60    0.00  123.20     0.00 59801.60 
970.81     4.52   35.57    0.00   35.57   3.25  40.00
sdj           14432.40     0.00  120.80    0.00 57603.20     0.00 
953.70     1.84   15.07   15.07    0.00   3.06  37.00
md1               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    2.20     0.00     8.80 
8.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00    0.00    0.00   0.00   0.00


Thanks for having a look. It seems odd to me, but I can't figure it out.

Regards,
Brad

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID6 rebuild oddity
  2017-03-29  8:12   ` Brad Campbell
@ 2017-03-30  0:49     ` NeilBrown
  2017-03-30  1:22       ` Brad Campbell
  2017-03-31 18:45       ` Dan Williams
  0 siblings, 2 replies; 9+ messages in thread
From: NeilBrown @ 2017-03-30  0:49 UTC (permalink / raw)
  To: Brad Campbell, Linux-RAID; +Cc: Dan Williams

[-- Attachment #1: Type: text/plain, Size: 8266 bytes --]

On Wed, Mar 29 2017, Brad Campbell wrote:

> On 29/03/17 12:08, NeilBrown wrote:
>
>>
>> sdj is getting twice the utilization of the others but only about 10%
>> more rKB/sec.  That suggests lots of seeking.
>
> Yes, something is not entirely sequential.
>
>> Does "fuser /dev/sdj" report anything funny?
>
> No. No output. As far as I can tell nothing should be touching the disks 
> other than the md kernel thread.
>
>> Is there filesystem IO happening? If so, what filesystem?
>> Have you told the filesystem about the RAID layout?
>> Maybe the filesystem keeps reading some index blocks that are always on
>> the same drive.
>
> No. I probably wasn't as clear as I should have been in the initial 
> post. There was nothing mounted at the time.
>
> Right now the array contains one large LUKS container (dm-crypt). This 
> was mounted and a continuous dd done to the dm device to zero it out :
>
> 4111195+1 records in
> 4111195+1 records out
> 34487205507072 bytes (34 TB) copied, 57781.1 s, 597 MB/s
>
> So there is no filesystem on the drive.
>
> I failed and removed sdi :
>
> root@test:~# mdadm --fail /dev/md0 /dev/sdi
> mdadm: set /dev/sdi faulty in /dev/md0
> root@test:~# mdadm --remove /dev/md0 /dev/sdi
> mdadm: hot removed /dev/sdi from /dev/md0
> root@test:~# mdadm --zero-superblock /dev/sdi
> root@test:~# mdadm --add /dev/md0 /dev/sdi
> mdadm: added /dev/sdi
>
> Rebooted the machine to remove all tweaks of things like stripe cache 
> size, readahead, NCQ and anything else.
>
> I opened the LUKS container, dd'd a meg to the start to write to the 
> array and kick off the resync, then closed the LUKS container. At this 
> point dm should no longer be touching the drive and I've verified the 
> device has gone.
>
> I then ran sync a couple of times and waited a couple of minutes until I 
> was positive _nothing_ was touching md0, then ran :
>
> blktrace -w 5 /dev/sd[bcdefgij] /dev/md0
>
>> So the problem moves from drive to drive?  Strongly suggests filesystem
>> metadata access to me.
>
> Again, sorry for me not being clear. The situation changes on a resync 
> specific basis. For example the reproduction I've done now I popped out 
> sdi rather than sdb, and now the bottleneck is sdg. It is the same if 
> the exact circumstances remain the same.

Now *that* is interesting!

In the first example you gave, device 0 was rebuilding and device 7 was
slow.
Now device 6 is rebuilding and device5 is slow.  Surely you see the
pattern?

Also, I previously observed that the slow device was getting 10% more
read traffic, but I was being imprecise.
In the first example  it was 16.2%, in the second it is 14.5.
These are close to 1/7.

There are 8 devices in the array, so there are 8 different layouts for
stripes (the parity block can be on 8 different devices).  8 is close to
7.

How recovery *should* work on RAID6 is that for each strip we read all
blocks except for the Q block (which isn't needed to rebuild a single
device) and the block which has failed (of course).  For the stripe
where the Q block has failed, we don't read the P block.  Just read all
the data and calculate Q from that.

So for every 8-device strip, we read from 6 devices.  5 Data plus parity
when data has failed, 6 data when P or Q has failed.
So across each of the 8 different strip layouts, there are 48 reads
Spread across 7 working devices....  using lower-case for blocks that
aren't read and assuming first device is being recovered:

           d  D  D  D  D  D  P  q
           d  D  D  D  D  P  q  D
           d  D  D  D  P  q  D  D
           d  D  D  P  q  D  D  D
           d  D  P  q  D  D  D  D
           d  P  q  D  D  D  D  D
           p  q  D  D  D  D  D  D
           q  D  D  D  D  D  D  p
totals:    0  7  7  7  7  7  7  6

The device just "before" the missing device gets fewer reads,
because we don't need to read P from that device.
But we are seeing that it gets more reads.  The numbers suggest that
it gets 8 reads (1/7 more than the others).  But the numbers also
suggest that it gets lots of seeks.  Maybe after all the other devices
have break read for one stripe, md/raid6 decides it does need those
blocks and goes back to read them?  That would explain the seeking
and the increased number of reads.

Time to look at the blktrace...
Looking at sdj, each request is Queued 8 times (I wonder why) but the
requests are completely sequential.

  blkparse sdj* | awk '$6 == "Q" && $8 != prev {print $8, $8-prev ; prev=$8}'  | awk '$2 != 8'

This means the 'Q' block is being read.  Not what I expected, but not
too surprising.

Looking at sdg, which is the problem device:

  blkparse sdg* | awk '$6 == "Q" && $8 != prev {print $8-prev ; prev=$8}'  \
    | sort | uniq -c | sort -n | tail -60

There are lots of places we skip forward 904 sectors.  That is an 8
sector read and a 896 sector seek. 896 sectors = 448K or 7 chunks
(64K chunk size).
These 7-chunk seeks are separated by 16 8-sector reads.  So it seems to
be reading all the P (or maybe Q) blocks from a sequence of stripes.

There are also lots of large back/forward seeks. They range from -31328
to -71280 backward and 27584 to 66408  (with a couple of smaller ones).

If we ignore all reads to sdg that are the result of seeking backwards -
i.e. that are earlier that the largest offset seen so far:

  blkparse sdg* | awk '$6 == "Q" && $8 != prev {if ($8>max) {print $8; max=$8} ; prev=$8}' | awk '{print $1-prev; prev=$1}' | grep  -v 8

we seek that (after an initial section) every block is being read.
So it seems that sdg is seeking backwards to read some blocks that it
has already read.

So what really happens in that all the working devices read all blocks
(although they don't really need Q), and sdg needs to seek backwards to
re-read 1/8 of all blocks, probably the P block.

So we should be seeing the rkB/s for the slow disk 1/8 more than the
others.  i.e. 12.5%, not 14% or 16%.  The difference bothers me, but not
very much.

Now to look at the code.

need_this_block() always returns 1 when s->syncing, so that is why
we already read the Q block.  The distinction between recovery and
resync isn't very strong in raid5/6.

So on the stripes where Q has failed, we read all the Data and P.
Then handle_parity_checks6() notice that there are enough devices that
something is worth checking, so it checks the P.  This is destructive!
of P.  The D blocks are xored into P and P is checked for all-zero.

I always get lost following this next bit....
I think that after deciding zero_sum_result is zero, check_state is set
to check_state_compute_result.
But before that is processed, we go around the loop again and notice
that P is missing (because handle_parity_checks6() cleared R5_UPTODATE)
so need_this_block() decided that we need to read it back in again.
We don't calculate it because we only calculate blocks on failed
devices (I'm sure we didn't used to do that).
So we read P back in ... just as we see from the trace.

Patch below makes the symptom go away in my testing, which confirms
that I'm on the right track.

Dan Williams changed the code to only recompute blocks from failed
devices, way back in 2008.
Commit: c337869d9501 ("md: do not compute parity unless it is on a failed drive")

Back then we didn't have raid6 in the same code, so the fix was probably
fine.

I wonder what the best fix is.... Dan... Any thoughts?

NeilBrown

This works for simple tests but might not be correct.

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c523fd69a7bc..2eb45d57226c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3618,8 +3618,9 @@ static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
 		BUG_ON(test_bit(R5_Wantread, &dev->flags));
 		BUG_ON(sh->batch_head);
 		if ((s->uptodate == disks - 1) &&
+		    ((sh->qd_idx >= 0 && sh->pd_idx == disk_idx) ||
 		    (s->failed && (disk_idx == s->failed_num[0] ||
-				   disk_idx == s->failed_num[1]))) {
+				   disk_idx == s->failed_num[1])))) {
 			/* have disk failed, and we're requested to fetch it;
 			 * do compute it
 			 */

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: RAID6 rebuild oddity
  2017-03-30  0:49     ` NeilBrown
@ 2017-03-30  1:22       ` Brad Campbell
  2017-03-30  1:53         ` NeilBrown
  2017-03-31 18:45       ` Dan Williams
  1 sibling, 1 reply; 9+ messages in thread
From: Brad Campbell @ 2017-03-30  1:22 UTC (permalink / raw)
  To: NeilBrown, Linux-RAID; +Cc: Dan Williams

> NeilBrown
>
> This works for simple tests but might not be correct.
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index c523fd69a7bc..2eb45d57226c 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3618,8 +3618,9 @@ static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
>   		BUG_ON(test_bit(R5_Wantread, &dev->flags));
>   		BUG_ON(sh->batch_head);
>   		if ((s->uptodate == disks - 1) &&
> +		    ((sh->qd_idx >= 0 && sh->pd_idx == disk_idx) ||
>   		    (s->failed && (disk_idx == s->failed_num[0] ||
> -				   disk_idx == s->failed_num[1]))) {
> +				   disk_idx == s->failed_num[1])))) {
>   			/* have disk failed, and we're requested to fetch it;
>   			 * do compute it
>   			 */

G'day Neil,

Thanks for the in-depth analysis. I managed to follow most of it and 
when I get some time I'll get my head into the code and see if I can 
*really* follow along.

As I'm not particularly fussed about the integrity of the system in its 
current state, I patched the kernel, re-booted and kicked off the resync 
again.


Before patch :

brad@test:~$ cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid6 sdb[8] sdj[7] sdi[9] sdg[5] sdf[4] sde[3] sdd[2] sdc[1]
       35162348160 blocks super 1.2 level 6, 64k chunk, algorithm 2 
[8/7] [UUUUUU_U]
       [==============>......]  recovery = 72.0% (4222847296/5860391360) 
finish=541.6min speed=50382K/sec


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            1.32    0.00    7.02    1.37    0.00   90.29

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    19.00   10.20   22.40   114.40   139.00 
15.55     0.52   15.95   26.27   11.25   5.03  16.40
sdb           16514.60     0.00  141.00    0.00 67860.00     0.00 
962.55     2.64   18.91   18.91    0.00   3.21  45.20
sdc           16514.60     0.00  140.60    0.00 67655.20     0.00 
962.38     2.06   14.77   14.77    0.00   2.79  39.20
sdd           16514.60     0.00  140.60    0.00 67655.20     0.00 
962.38     2.64   18.92   18.92    0.00   3.04  42.80
sde           16514.60     0.00  140.60    0.00 67655.20     0.00 
962.38     2.49   17.85   17.85    0.00   3.14  44.20
sdf           16514.60     0.00  140.60    0.00 67655.20     0.00 
962.38     2.24   16.05   16.05    0.00   2.86  40.20
sdg           18459.80     0.00  269.40    0.00 74772.00     0.00 
555.10    84.84  327.45  327.45    0.00   3.71 100.00
sdh               0.00    19.00    1.80   22.40    24.80   139.00 
13.54     0.25   10.17   10.00   10.18   5.87  14.20
sdi               0.00 16390.20    0.00  136.60     0.00 66138.40 
968.35     1.19    8.74    0.00    8.74   3.38  46.20
sdj           16514.40     0.00  140.80    0.00 67654.40     0.00 
961.00     2.25   16.11   16.11    0.00   2.86  40.20
md1               0.00     0.00    0.00    4.80     0.00 4.60     
1.92     0.00    0.00    0.00    0.00   0.00   0.00
md2               0.00     0.00   12.00   32.60   139.20   131.20 
12.13     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00

After Patch :

root@test:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid6 sdb[8] sdj[7] sdi[9] sdg[5] sdf[4] sde[3] sdd[2] sdc[1]
       35162348160 blocks super 1.2 level 6, 64k chunk, algorithm 2 
[8/7] [UUUUUU_U]
       [==============>......]  recovery = 73.3% (4297641980/5860391360) 
finish=284.7min speed=91465K/sec


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.08    0.00    6.96    0.00    0.00   92.97

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.20     0.00 0.20     
2.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb           28232.80     0.00  237.40    0.00 114502.40     0.00 
964.64     8.53   36.05   36.05    0.00   4.15  98.60
sdc           28232.80     0.00  236.00    0.00 113875.20     0.00 
965.04     1.21    5.13    5.13    0.00   1.77  41.80
sdd           28232.80     0.00  236.00    0.00 113875.20     0.00 
965.04     1.28    5.42    5.42    0.00   1.85  43.60
sde           28232.80     0.00  236.80    0.00 114195.20     0.00 
964.49     2.74   11.65   11.65    0.00   2.77  65.60
sdf           28232.80     0.00  236.00    0.00 113875.20     0.00 
965.04     1.56    6.63    6.63    0.00   2.14  50.60
sdg           28232.80     0.00  234.80    0.00 113273.60     0.00 
964.85     3.06   12.86   12.86    0.00   2.82  66.20
sdh               0.00     0.00    0.00    0.20     0.00 0.20     
2.00     0.00    0.00    0.00    0.00   0.00   0.00
sdi               0.00 28175.00    0.00  245.80     0.00 113683.20 
925.01     0.96    3.91    0.00    3.91   2.89  71.00
sdj           28232.80     0.00  236.00    0.00 113875.20     0.00 
965.04     1.53    6.50    6.50    0.00   2.07  48.80
md1               0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00

So that has significantly increased the rebuild speed, as you would expect.


Regards,
Brad
-- 
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID6 rebuild oddity
  2017-03-30  1:22       ` Brad Campbell
@ 2017-03-30  1:53         ` NeilBrown
  2017-03-30  3:09           ` Brad Campbell
  0 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2017-03-30  1:53 UTC (permalink / raw)
  To: Brad Campbell, Linux-RAID; +Cc: Dan Williams

[-- Attachment #1: Type: text/plain, Size: 580 bytes --]

On Thu, Mar 30 2017, Brad Campbell wrote:

>        [==============>......]  recovery = 72.0% (4222847296/5860391360) 
> finish=541.6min speed=50382K/sec
...
>        [==============>......]  recovery = 73.3% (4297641980/5860391360) 
> finish=284.7min speed=91465K/sec
>

>
> So that has significantly increased the rebuild speed, as you would expect.
>

Not quite twice as fast, but close!
Everyone has been suffering with that poor speed for years and nobody
noticed until you did.  Thanks for being thorough!  And thanks for
confirming the patch.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID6 rebuild oddity
  2017-03-30  1:53         ` NeilBrown
@ 2017-03-30  3:09           ` Brad Campbell
  0 siblings, 0 replies; 9+ messages in thread
From: Brad Campbell @ 2017-03-30  3:09 UTC (permalink / raw)
  To: NeilBrown, Linux-RAID; +Cc: Dan Williams

On 30/03/17 09:53, NeilBrown wrote:
> On Thu, Mar 30 2017, Brad Campbell wrote:
>
>>        [==============>......]  recovery = 72.0% (4222847296/5860391360)
>> finish=541.6min speed=50382K/sec
> ...
>>        [==============>......]  recovery = 73.3% (4297641980/5860391360)
>> finish=284.7min speed=91465K/sec
>>
>
>>
>> So that has significantly increased the rebuild speed, as you would expect.
>>
>
> Not quite twice as fast, but close!

It's actually more than twice as fast.
       [================>....]  recovery = 84.1% (4933147664/5860391360) 
finish=151.4min speed=102062K/sec

That sample was taken at an inopportune moment when things were just 
getting up to speed. We're progressing towards slower parts of the disk, 
and it's still twice as fast as it was.

> Everyone has been suffering with that poor speed for years and nobody
> noticed until you did.  Thanks for being thorough!  And thanks for
> confirming the patch.

I'm baffled it has been around this long and it hasn't come up previously.

Again, I really appreciate you taking a look at this.

Regards,
Brad

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID6 rebuild oddity
  2017-03-30  0:49     ` NeilBrown
  2017-03-30  1:22       ` Brad Campbell
@ 2017-03-31 18:45       ` Dan Williams
  2017-04-03  2:09         ` NeilBrown
  1 sibling, 1 reply; 9+ messages in thread
From: Dan Williams @ 2017-03-31 18:45 UTC (permalink / raw)
  To: NeilBrown; +Cc: Brad Campbell, Linux-RAID

On Wed, Mar 29, 2017 at 5:49 PM, NeilBrown <neilb@suse.com> wrote:
> On Wed, Mar 29 2017, Brad Campbell wrote:
>
>> On 29/03/17 12:08, NeilBrown wrote:
>>
>>>
>>> sdj is getting twice the utilization of the others but only about 10%
>>> more rKB/sec.  That suggests lots of seeking.
>>
>> Yes, something is not entirely sequential.
>>
>>> Does "fuser /dev/sdj" report anything funny?
>>
>> No. No output. As far as I can tell nothing should be touching the disks
>> other than the md kernel thread.
>>
>>> Is there filesystem IO happening? If so, what filesystem?
>>> Have you told the filesystem about the RAID layout?
>>> Maybe the filesystem keeps reading some index blocks that are always on
>>> the same drive.
>>
>> No. I probably wasn't as clear as I should have been in the initial
>> post. There was nothing mounted at the time.
>>
>> Right now the array contains one large LUKS container (dm-crypt). This
>> was mounted and a continuous dd done to the dm device to zero it out :
>>
>> 4111195+1 records in
>> 4111195+1 records out
>> 34487205507072 bytes (34 TB) copied, 57781.1 s, 597 MB/s
>>
>> So there is no filesystem on the drive.
>>
>> I failed and removed sdi :
>>
>> root@test:~# mdadm --fail /dev/md0 /dev/sdi
>> mdadm: set /dev/sdi faulty in /dev/md0
>> root@test:~# mdadm --remove /dev/md0 /dev/sdi
>> mdadm: hot removed /dev/sdi from /dev/md0
>> root@test:~# mdadm --zero-superblock /dev/sdi
>> root@test:~# mdadm --add /dev/md0 /dev/sdi
>> mdadm: added /dev/sdi
>>
>> Rebooted the machine to remove all tweaks of things like stripe cache
>> size, readahead, NCQ and anything else.
>>
>> I opened the LUKS container, dd'd a meg to the start to write to the
>> array and kick off the resync, then closed the LUKS container. At this
>> point dm should no longer be touching the drive and I've verified the
>> device has gone.
>>
>> I then ran sync a couple of times and waited a couple of minutes until I
>> was positive _nothing_ was touching md0, then ran :
>>
>> blktrace -w 5 /dev/sd[bcdefgij] /dev/md0
>>
>>> So the problem moves from drive to drive?  Strongly suggests filesystem
>>> metadata access to me.
>>
>> Again, sorry for me not being clear. The situation changes on a resync
>> specific basis. For example the reproduction I've done now I popped out
>> sdi rather than sdb, and now the bottleneck is sdg. It is the same if
>> the exact circumstances remain the same.
>
> Now *that* is interesting!
>
> In the first example you gave, device 0 was rebuilding and device 7 was
> slow.
> Now device 6 is rebuilding and device5 is slow.  Surely you see the
> pattern?
>
> Also, I previously observed that the slow device was getting 10% more
> read traffic, but I was being imprecise.
> In the first example  it was 16.2%, in the second it is 14.5.
> These are close to 1/7.
>
> There are 8 devices in the array, so there are 8 different layouts for
> stripes (the parity block can be on 8 different devices).  8 is close to
> 7.
>
> How recovery *should* work on RAID6 is that for each strip we read all
> blocks except for the Q block (which isn't needed to rebuild a single
> device) and the block which has failed (of course).  For the stripe
> where the Q block has failed, we don't read the P block.  Just read all
> the data and calculate Q from that.
>
> So for every 8-device strip, we read from 6 devices.  5 Data plus parity
> when data has failed, 6 data when P or Q has failed.
> So across each of the 8 different strip layouts, there are 48 reads
> Spread across 7 working devices....  using lower-case for blocks that
> aren't read and assuming first device is being recovered:
>
>            d  D  D  D  D  D  P  q
>            d  D  D  D  D  P  q  D
>            d  D  D  D  P  q  D  D
>            d  D  D  P  q  D  D  D
>            d  D  P  q  D  D  D  D
>            d  P  q  D  D  D  D  D
>            p  q  D  D  D  D  D  D
>            q  D  D  D  D  D  D  p
> totals:    0  7  7  7  7  7  7  6
>
> The device just "before" the missing device gets fewer reads,
> because we don't need to read P from that device.
> But we are seeing that it gets more reads.  The numbers suggest that
> it gets 8 reads (1/7 more than the others).  But the numbers also
> suggest that it gets lots of seeks.  Maybe after all the other devices
> have break read for one stripe, md/raid6 decides it does need those
> blocks and goes back to read them?  That would explain the seeking
> and the increased number of reads.
>
> Time to look at the blktrace...
> Looking at sdj, each request is Queued 8 times (I wonder why) but the
> requests are completely sequential.
>
>   blkparse sdj* | awk '$6 == "Q" && $8 != prev {print $8, $8-prev ; prev=$8}'  | awk '$2 != 8'
>
> This means the 'Q' block is being read.  Not what I expected, but not
> too surprising.
>
> Looking at sdg, which is the problem device:
>
>   blkparse sdg* | awk '$6 == "Q" && $8 != prev {print $8-prev ; prev=$8}'  \
>     | sort | uniq -c | sort -n | tail -60
>
> There are lots of places we skip forward 904 sectors.  That is an 8
> sector read and a 896 sector seek. 896 sectors = 448K or 7 chunks
> (64K chunk size).
> These 7-chunk seeks are separated by 16 8-sector reads.  So it seems to
> be reading all the P (or maybe Q) blocks from a sequence of stripes.
>
> There are also lots of large back/forward seeks. They range from -31328
> to -71280 backward and 27584 to 66408  (with a couple of smaller ones).
>
> If we ignore all reads to sdg that are the result of seeking backwards -
> i.e. that are earlier that the largest offset seen so far:
>
>   blkparse sdg* | awk '$6 == "Q" && $8 != prev {if ($8>max) {print $8; max=$8} ; prev=$8}' | awk '{print $1-prev; prev=$1}' | grep  -v 8
>
> we seek that (after an initial section) every block is being read.
> So it seems that sdg is seeking backwards to read some blocks that it
> has already read.
>
> So what really happens in that all the working devices read all blocks
> (although they don't really need Q), and sdg needs to seek backwards to
> re-read 1/8 of all blocks, probably the P block.
>
> So we should be seeing the rkB/s for the slow disk 1/8 more than the
> others.  i.e. 12.5%, not 14% or 16%.  The difference bothers me, but not
> very much.
>
> Now to look at the code.
>
> need_this_block() always returns 1 when s->syncing, so that is why
> we already read the Q block.  The distinction between recovery and
> resync isn't very strong in raid5/6.
>
> So on the stripes where Q has failed, we read all the Data and P.
> Then handle_parity_checks6() notice that there are enough devices that
> something is worth checking, so it checks the P.  This is destructive!
> of P.  The D blocks are xored into P and P is checked for all-zero.
>
> I always get lost following this next bit....
> I think that after deciding zero_sum_result is zero, check_state is set
> to check_state_compute_result.
> But before that is processed, we go around the loop again and notice
> that P is missing (because handle_parity_checks6() cleared R5_UPTODATE)
> so need_this_block() decided that we need to read it back in again.
> We don't calculate it because we only calculate blocks on failed
> devices (I'm sure we didn't used to do that).
> So we read P back in ... just as we see from the trace.
>
> Patch below makes the symptom go away in my testing, which confirms
> that I'm on the right track.
>
> Dan Williams changed the code to only recompute blocks from failed
> devices, way back in 2008.
> Commit: c337869d9501 ("md: do not compute parity unless it is on a failed drive")
>
> Back then we didn't have raid6 in the same code, so the fix was probably
> fine.
>
> I wonder what the best fix is.... Dan... Any thoughts?

First thought is "Wow!" that's a great bit of detective work.

Second thought is that since we validated P as being in-sync maybe we
can flag at is safe for compute and not re-read it. I wish I had
written a unit test for the condition where we were missing some
writebacks after computing that lead to commit c337869d9501  ("md: do
not compute parity unless it is on a failed drive").

However, I think we already have enough information to trust P.

                        /* The only possible failed device holds Q, so it
                         * makes sense to check P (If anything else were failed,
                         * we would have used P to recreate it).
                         */

So I think your patch is actually already correct, would just need
comment like the following so the exception is clear:

                /*
                 * In the raid6 case if the only non-uptodate disk is P
                 * then we already trusted P to compute the other failed
                 * drives. It is safe to compute rather than re-read P.
                 */

Thoughts?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RAID6 rebuild oddity
  2017-03-31 18:45       ` Dan Williams
@ 2017-04-03  2:09         ` NeilBrown
  0 siblings, 0 replies; 9+ messages in thread
From: NeilBrown @ 2017-04-03  2:09 UTC (permalink / raw)
  To: Dan Williams; +Cc: Brad Campbell, Linux-RAID

[-- Attachment #1: Type: text/plain, Size: 9445 bytes --]

On Fri, Mar 31 2017, Dan Williams wrote:

> On Wed, Mar 29, 2017 at 5:49 PM, NeilBrown <neilb@suse.com> wrote:
>> On Wed, Mar 29 2017, Brad Campbell wrote:
>>
>>> On 29/03/17 12:08, NeilBrown wrote:
>>>
>>>>
>>>> sdj is getting twice the utilization of the others but only about 10%
>>>> more rKB/sec.  That suggests lots of seeking.
>>>
>>> Yes, something is not entirely sequential.
>>>
>>>> Does "fuser /dev/sdj" report anything funny?
>>>
>>> No. No output. As far as I can tell nothing should be touching the disks
>>> other than the md kernel thread.
>>>
>>>> Is there filesystem IO happening? If so, what filesystem?
>>>> Have you told the filesystem about the RAID layout?
>>>> Maybe the filesystem keeps reading some index blocks that are always on
>>>> the same drive.
>>>
>>> No. I probably wasn't as clear as I should have been in the initial
>>> post. There was nothing mounted at the time.
>>>
>>> Right now the array contains one large LUKS container (dm-crypt). This
>>> was mounted and a continuous dd done to the dm device to zero it out :
>>>
>>> 4111195+1 records in
>>> 4111195+1 records out
>>> 34487205507072 bytes (34 TB) copied, 57781.1 s, 597 MB/s
>>>
>>> So there is no filesystem on the drive.
>>>
>>> I failed and removed sdi :
>>>
>>> root@test:~# mdadm --fail /dev/md0 /dev/sdi
>>> mdadm: set /dev/sdi faulty in /dev/md0
>>> root@test:~# mdadm --remove /dev/md0 /dev/sdi
>>> mdadm: hot removed /dev/sdi from /dev/md0
>>> root@test:~# mdadm --zero-superblock /dev/sdi
>>> root@test:~# mdadm --add /dev/md0 /dev/sdi
>>> mdadm: added /dev/sdi
>>>
>>> Rebooted the machine to remove all tweaks of things like stripe cache
>>> size, readahead, NCQ and anything else.
>>>
>>> I opened the LUKS container, dd'd a meg to the start to write to the
>>> array and kick off the resync, then closed the LUKS container. At this
>>> point dm should no longer be touching the drive and I've verified the
>>> device has gone.
>>>
>>> I then ran sync a couple of times and waited a couple of minutes until I
>>> was positive _nothing_ was touching md0, then ran :
>>>
>>> blktrace -w 5 /dev/sd[bcdefgij] /dev/md0
>>>
>>>> So the problem moves from drive to drive?  Strongly suggests filesystem
>>>> metadata access to me.
>>>
>>> Again, sorry for me not being clear. The situation changes on a resync
>>> specific basis. For example the reproduction I've done now I popped out
>>> sdi rather than sdb, and now the bottleneck is sdg. It is the same if
>>> the exact circumstances remain the same.
>>
>> Now *that* is interesting!
>>
>> In the first example you gave, device 0 was rebuilding and device 7 was
>> slow.
>> Now device 6 is rebuilding and device5 is slow.  Surely you see the
>> pattern?
>>
>> Also, I previously observed that the slow device was getting 10% more
>> read traffic, but I was being imprecise.
>> In the first example  it was 16.2%, in the second it is 14.5.
>> These are close to 1/7.
>>
>> There are 8 devices in the array, so there are 8 different layouts for
>> stripes (the parity block can be on 8 different devices).  8 is close to
>> 7.
>>
>> How recovery *should* work on RAID6 is that for each strip we read all
>> blocks except for the Q block (which isn't needed to rebuild a single
>> device) and the block which has failed (of course).  For the stripe
>> where the Q block has failed, we don't read the P block.  Just read all
>> the data and calculate Q from that.
>>
>> So for every 8-device strip, we read from 6 devices.  5 Data plus parity
>> when data has failed, 6 data when P or Q has failed.
>> So across each of the 8 different strip layouts, there are 48 reads
>> Spread across 7 working devices....  using lower-case for blocks that
>> aren't read and assuming first device is being recovered:
>>
>>            d  D  D  D  D  D  P  q
>>            d  D  D  D  D  P  q  D
>>            d  D  D  D  P  q  D  D
>>            d  D  D  P  q  D  D  D
>>            d  D  P  q  D  D  D  D
>>            d  P  q  D  D  D  D  D
>>            p  q  D  D  D  D  D  D
>>            q  D  D  D  D  D  D  p
>> totals:    0  7  7  7  7  7  7  6
>>
>> The device just "before" the missing device gets fewer reads,
>> because we don't need to read P from that device.
>> But we are seeing that it gets more reads.  The numbers suggest that
>> it gets 8 reads (1/7 more than the others).  But the numbers also
>> suggest that it gets lots of seeks.  Maybe after all the other devices
>> have break read for one stripe, md/raid6 decides it does need those
>> blocks and goes back to read them?  That would explain the seeking
>> and the increased number of reads.
>>
>> Time to look at the blktrace...
>> Looking at sdj, each request is Queued 8 times (I wonder why) but the
>> requests are completely sequential.
>>
>>   blkparse sdj* | awk '$6 == "Q" && $8 != prev {print $8, $8-prev ; prev=$8}'  | awk '$2 != 8'
>>
>> This means the 'Q' block is being read.  Not what I expected, but not
>> too surprising.
>>
>> Looking at sdg, which is the problem device:
>>
>>   blkparse sdg* | awk '$6 == "Q" && $8 != prev {print $8-prev ; prev=$8}'  \
>>     | sort | uniq -c | sort -n | tail -60
>>
>> There are lots of places we skip forward 904 sectors.  That is an 8
>> sector read and a 896 sector seek. 896 sectors = 448K or 7 chunks
>> (64K chunk size).
>> These 7-chunk seeks are separated by 16 8-sector reads.  So it seems to
>> be reading all the P (or maybe Q) blocks from a sequence of stripes.
>>
>> There are also lots of large back/forward seeks. They range from -31328
>> to -71280 backward and 27584 to 66408  (with a couple of smaller ones).
>>
>> If we ignore all reads to sdg that are the result of seeking backwards -
>> i.e. that are earlier that the largest offset seen so far:
>>
>>   blkparse sdg* | awk '$6 == "Q" && $8 != prev {if ($8>max) {print $8; max=$8} ; prev=$8}' | awk '{print $1-prev; prev=$1}' | grep  -v 8
>>
>> we seek that (after an initial section) every block is being read.
>> So it seems that sdg is seeking backwards to read some blocks that it
>> has already read.
>>
>> So what really happens in that all the working devices read all blocks
>> (although they don't really need Q), and sdg needs to seek backwards to
>> re-read 1/8 of all blocks, probably the P block.
>>
>> So we should be seeing the rkB/s for the slow disk 1/8 more than the
>> others.  i.e. 12.5%, not 14% or 16%.  The difference bothers me, but not
>> very much.
>>
>> Now to look at the code.
>>
>> need_this_block() always returns 1 when s->syncing, so that is why
>> we already read the Q block.  The distinction between recovery and
>> resync isn't very strong in raid5/6.
>>
>> So on the stripes where Q has failed, we read all the Data and P.
>> Then handle_parity_checks6() notice that there are enough devices that
>> something is worth checking, so it checks the P.  This is destructive!
>> of P.  The D blocks are xored into P and P is checked for all-zero.
>>
>> I always get lost following this next bit....
>> I think that after deciding zero_sum_result is zero, check_state is set
>> to check_state_compute_result.
>> But before that is processed, we go around the loop again and notice
>> that P is missing (because handle_parity_checks6() cleared R5_UPTODATE)
>> so need_this_block() decided that we need to read it back in again.
>> We don't calculate it because we only calculate blocks on failed
>> devices (I'm sure we didn't used to do that).
>> So we read P back in ... just as we see from the trace.
>>
>> Patch below makes the symptom go away in my testing, which confirms
>> that I'm on the right track.
>>
>> Dan Williams changed the code to only recompute blocks from failed
>> devices, way back in 2008.
>> Commit: c337869d9501 ("md: do not compute parity unless it is on a failed drive")
>>
>> Back then we didn't have raid6 in the same code, so the fix was probably
>> fine.
>>
>> I wonder what the best fix is.... Dan... Any thoughts?
>
> First thought is "Wow!" that's a great bit of detective work.

Not many problems allow for that sort of detective work... It is fun
when they come along :-)

>
> Second thought is that since we validated P as being in-sync maybe we
> can flag at is safe for compute and not re-read it. I wish I had
> written a unit test for the condition where we were missing some
> writebacks after computing that lead to commit c337869d9501  ("md: do
> not compute parity unless it is on a failed drive").
>
> However, I think we already have enough information to trust P.
>
>                         /* The only possible failed device holds Q, so it
>                          * makes sense to check P (If anything else were failed,
>                          * we would have used P to recreate it).
>                          */
>
> So I think your patch is actually already correct, would just need
> comment like the following so the exception is clear:
>
>                 /*
>                  * In the raid6 case if the only non-uptodate disk is P
>                  * then we already trusted P to compute the other failed
>                  * drives. It is safe to compute rather than re-read P.
>                  */
>
> Thoughts?

I'm convinced.  If we have all of D and Q, we *must* have read (or
generated) P recently, so we cannot hide an error here.
I extended the comment to explain why we don't always compute missing
blocks.
I've think your response qualifies for a Reviewed-by - hope that's OK.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-04-03  2:09 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-24  7:44 RAID6 rebuild oddity Brad Campbell
2017-03-29  4:08 ` NeilBrown
2017-03-29  8:12   ` Brad Campbell
2017-03-30  0:49     ` NeilBrown
2017-03-30  1:22       ` Brad Campbell
2017-03-30  1:53         ` NeilBrown
2017-03-30  3:09           ` Brad Campbell
2017-03-31 18:45       ` Dan Williams
2017-04-03  2:09         ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.