On Fri, Mar 31 2017, Dan Williams wrote:

> On Wed, Mar 29, 2017 at 5:49 PM, NeilBrown <neilb@suse.com> wrote:
>> On Wed, Mar 29 2017, Brad Campbell wrote:
>>
>>> On 29/03/17 12:08, NeilBrown wrote:
>>>
>>>>
>>>> sdj is getting twice the utilization of the others but only about 10%
>>>> more rKB/sec.  That suggests lots of seeking.
>>>
>>> Yes, something is not entirely sequential.
>>>
>>>> Does "fuser /dev/sdj" report anything funny?
>>>
>>> No. No output. As far as I can tell nothing should be touching the disks
>>> other than the md kernel thread.
>>>
>>>> Is there filesystem IO happening? If so, what filesystem?
>>>> Have you told the filesystem about the RAID layout?
>>>> Maybe the filesystem keeps reading some index blocks that are always on
>>>> the same drive.
>>>
>>> No. I probably wasn't as clear as I should have been in the initial
>>> post. There was nothing mounted at the time.
>>>
>>> Right now the array contains one large LUKS container (dm-crypt). This
>>> was mounted and a continuous dd done to the dm device to zero it out :
>>>
>>> 4111195+1 records in
>>> 4111195+1 records out
>>> 34487205507072 bytes (34 TB) copied, 57781.1 s, 597 MB/s
>>>
>>> So there is no filesystem on the drive.
>>>
>>> I failed and removed sdi :
>>>
>>> root@test:~# mdadm --fail /dev/md0 /dev/sdi
>>> mdadm: set /dev/sdi faulty in /dev/md0
>>> root@test:~# mdadm --remove /dev/md0 /dev/sdi
>>> mdadm: hot removed /dev/sdi from /dev/md0
>>> root@test:~# mdadm --zero-superblock /dev/sdi
>>> root@test:~# mdadm --add /dev/md0 /dev/sdi
>>> mdadm: added /dev/sdi
>>>
>>> Rebooted the machine to remove all tweaks of things like stripe cache
>>> size, readahead, NCQ and anything else.
>>>
>>> I opened the LUKS container, dd'd a meg to the start to write to the
>>> array and kick off the resync, then closed the LUKS container. At this
>>> point dm should no longer be touching the drive and I've verified the
>>> device has gone.
>>>
>>> I then ran sync a couple of times and waited a couple of minutes until I
>>> was positive _nothing_ was touching md0, then ran :
>>>
>>> blktrace -w 5 /dev/sd[bcdefgij] /dev/md0
>>>
>>>> So the problem moves from drive to drive?  Strongly suggests filesystem
>>>> metadata access to me.
>>>
>>> Again, sorry for me not being clear. The situation changes on a resync
>>> specific basis. For example the reproduction I've done now I popped out
>>> sdi rather than sdb, and now the bottleneck is sdg. It is the same if
>>> the exact circumstances remain the same.
>>
>> Now *that* is interesting!
>>
>> In the first example you gave, device 0 was rebuilding and device 7 was
>> slow.
>> Now device 6 is rebuilding and device5 is slow.  Surely you see the
>> pattern?
>>
>> Also, I previously observed that the slow device was getting 10% more
>> read traffic, but I was being imprecise.
>> In the first example  it was 16.2%, in the second it is 14.5.
>> These are close to 1/7.
>>
>> There are 8 devices in the array, so there are 8 different layouts for
>> stripes (the parity block can be on 8 different devices).  8 is close to
>> 7.
>>
>> How recovery *should* work on RAID6 is that for each strip we read all
>> blocks except for the Q block (which isn't needed to rebuild a single
>> device) and the block which has failed (of course).  For the stripe
>> where the Q block has failed, we don't read the P block.  Just read all
>> the data and calculate Q from that.
>>
>> So for every 8-device strip, we read from 6 devices.  5 Data plus parity
>> when data has failed, 6 data when P or Q has failed.
>> So across each of the 8 different strip layouts, there are 48 reads
>> Spread across 7 working devices....  using lower-case for blocks that
>> aren't read and assuming first device is being recovered:
>>
>>            d  D  D  D  D  D  P  q
>>            d  D  D  D  D  P  q  D
>>            d  D  D  D  P  q  D  D
>>            d  D  D  P  q  D  D  D
>>            d  D  P  q  D  D  D  D
>>            d  P  q  D  D  D  D  D
>>            p  q  D  D  D  D  D  D
>>            q  D  D  D  D  D  D  p
>> totals:    0  7  7  7  7  7  7  6
>>
>> The device just "before" the missing device gets fewer reads,
>> because we don't need to read P from that device.
>> But we are seeing that it gets more reads.  The numbers suggest that
>> it gets 8 reads (1/7 more than the others).  But the numbers also
>> suggest that it gets lots of seeks.  Maybe after all the other devices
>> have break read for one stripe, md/raid6 decides it does need those
>> blocks and goes back to read them?  That would explain the seeking
>> and the increased number of reads.
>>
>> Time to look at the blktrace...
>> Looking at sdj, each request is Queued 8 times (I wonder why) but the
>> requests are completely sequential.
>>
>>   blkparse sdj* | awk '$6 == "Q" && $8 != prev {print $8, $8-prev ; prev=$8}'  | awk '$2 != 8'
>>
>> This means the 'Q' block is being read.  Not what I expected, but not
>> too surprising.
>>
>> Looking at sdg, which is the problem device:
>>
>>   blkparse sdg* | awk '$6 == "Q" && $8 != prev {print $8-prev ; prev=$8}'  \
>>     | sort | uniq -c | sort -n | tail -60
>>
>> There are lots of places we skip forward 904 sectors.  That is an 8
>> sector read and a 896 sector seek. 896 sectors = 448K or 7 chunks
>> (64K chunk size).
>> These 7-chunk seeks are separated by 16 8-sector reads.  So it seems to
>> be reading all the P (or maybe Q) blocks from a sequence of stripes.
>>
>> There are also lots of large back/forward seeks. They range from -31328
>> to -71280 backward and 27584 to 66408  (with a couple of smaller ones).
>>
>> If we ignore all reads to sdg that are the result of seeking backwards -
>> i.e. that are earlier that the largest offset seen so far:
>>
>>   blkparse sdg* | awk '$6 == "Q" && $8 != prev {if ($8>max) {print $8; max=$8} ; prev=$8}' | awk '{print $1-prev; prev=$1}' | grep  -v 8
>>
>> we seek that (after an initial section) every block is being read.
>> So it seems that sdg is seeking backwards to read some blocks that it
>> has already read.
>>
>> So what really happens in that all the working devices read all blocks
>> (although they don't really need Q), and sdg needs to seek backwards to
>> re-read 1/8 of all blocks, probably the P block.
>>
>> So we should be seeing the rkB/s for the slow disk 1/8 more than the
>> others.  i.e. 12.5%, not 14% or 16%.  The difference bothers me, but not
>> very much.
>>
>> Now to look at the code.
>>
>> need_this_block() always returns 1 when s->syncing, so that is why
>> we already read the Q block.  The distinction between recovery and
>> resync isn't very strong in raid5/6.
>>
>> So on the stripes where Q has failed, we read all the Data and P.
>> Then handle_parity_checks6() notice that there are enough devices that
>> something is worth checking, so it checks the P.  This is destructive!
>> of P.  The D blocks are xored into P and P is checked for all-zero.
>>
>> I always get lost following this next bit....
>> I think that after deciding zero_sum_result is zero, check_state is set
>> to check_state_compute_result.
>> But before that is processed, we go around the loop again and notice
>> that P is missing (because handle_parity_checks6() cleared R5_UPTODATE)
>> so need_this_block() decided that we need to read it back in again.
>> We don't calculate it because we only calculate blocks on failed
>> devices (I'm sure we didn't used to do that).
>> So we read P back in ... just as we see from the trace.
>>
>> Patch below makes the symptom go away in my testing, which confirms
>> that I'm on the right track.
>>
>> Dan Williams changed the code to only recompute blocks from failed
>> devices, way back in 2008.
>> Commit: c337869d9501 ("md: do not compute parity unless it is on a failed drive")
>>
>> Back then we didn't have raid6 in the same code, so the fix was probably
>> fine.
>>
>> I wonder what the best fix is.... Dan... Any thoughts?
>
> First thought is "Wow!" that's a great bit of detective work.

Not many problems allow for that sort of detective work... It is fun
when they come along :-)

>
> Second thought is that since we validated P as being in-sync maybe we
> can flag at is safe for compute and not re-read it. I wish I had
> written a unit test for the condition where we were missing some
> writebacks after computing that lead to commit c337869d9501  ("md: do
> not compute parity unless it is on a failed drive").
>
> However, I think we already have enough information to trust P.
>
>                         /* The only possible failed device holds Q, so it
>                          * makes sense to check P (If anything else were failed,
>                          * we would have used P to recreate it).
>                          */
>
> So I think your patch is actually already correct, would just need
> comment like the following so the exception is clear:
>
>                 /*
>                  * In the raid6 case if the only non-uptodate disk is P
>                  * then we already trusted P to compute the other failed
>                  * drives. It is safe to compute rather than re-read P.
>                  */
>
> Thoughts?

I'm convinced.  If we have all of D and Q, we *must* have read (or
generated) P recently, so we cannot hide an error here.
I extended the comment to explain why we don't always compute missing
blocks.
I've think your response qualifies for a Reviewed-by - hope that's OK.

Thanks,
NeilBrown