All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: NeilBrown <neilb@suse.com>
Cc: Brad Campbell <lists2009@fnarfbargle.com>,
	Linux-RAID <linux-raid@vger.kernel.org>
Subject: Re: RAID6 rebuild oddity
Date: Fri, 31 Mar 2017 11:45:41 -0700	[thread overview]
Message-ID: <CAPcyv4gwG6yRMKF4wdhT_cPfCn=eucLX=ON7DWTA2gw0ptgmbQ@mail.gmail.com> (raw)
In-Reply-To: <87wpb7612w.fsf@notabene.neil.brown.name>

On Wed, Mar 29, 2017 at 5:49 PM, NeilBrown <neilb@suse.com> wrote:
> On Wed, Mar 29 2017, Brad Campbell wrote:
>
>> On 29/03/17 12:08, NeilBrown wrote:
>>
>>>
>>> sdj is getting twice the utilization of the others but only about 10%
>>> more rKB/sec.  That suggests lots of seeking.
>>
>> Yes, something is not entirely sequential.
>>
>>> Does "fuser /dev/sdj" report anything funny?
>>
>> No. No output. As far as I can tell nothing should be touching the disks
>> other than the md kernel thread.
>>
>>> Is there filesystem IO happening? If so, what filesystem?
>>> Have you told the filesystem about the RAID layout?
>>> Maybe the filesystem keeps reading some index blocks that are always on
>>> the same drive.
>>
>> No. I probably wasn't as clear as I should have been in the initial
>> post. There was nothing mounted at the time.
>>
>> Right now the array contains one large LUKS container (dm-crypt). This
>> was mounted and a continuous dd done to the dm device to zero it out :
>>
>> 4111195+1 records in
>> 4111195+1 records out
>> 34487205507072 bytes (34 TB) copied, 57781.1 s, 597 MB/s
>>
>> So there is no filesystem on the drive.
>>
>> I failed and removed sdi :
>>
>> root@test:~# mdadm --fail /dev/md0 /dev/sdi
>> mdadm: set /dev/sdi faulty in /dev/md0
>> root@test:~# mdadm --remove /dev/md0 /dev/sdi
>> mdadm: hot removed /dev/sdi from /dev/md0
>> root@test:~# mdadm --zero-superblock /dev/sdi
>> root@test:~# mdadm --add /dev/md0 /dev/sdi
>> mdadm: added /dev/sdi
>>
>> Rebooted the machine to remove all tweaks of things like stripe cache
>> size, readahead, NCQ and anything else.
>>
>> I opened the LUKS container, dd'd a meg to the start to write to the
>> array and kick off the resync, then closed the LUKS container. At this
>> point dm should no longer be touching the drive and I've verified the
>> device has gone.
>>
>> I then ran sync a couple of times and waited a couple of minutes until I
>> was positive _nothing_ was touching md0, then ran :
>>
>> blktrace -w 5 /dev/sd[bcdefgij] /dev/md0
>>
>>> So the problem moves from drive to drive?  Strongly suggests filesystem
>>> metadata access to me.
>>
>> Again, sorry for me not being clear. The situation changes on a resync
>> specific basis. For example the reproduction I've done now I popped out
>> sdi rather than sdb, and now the bottleneck is sdg. It is the same if
>> the exact circumstances remain the same.
>
> Now *that* is interesting!
>
> In the first example you gave, device 0 was rebuilding and device 7 was
> slow.
> Now device 6 is rebuilding and device5 is slow.  Surely you see the
> pattern?
>
> Also, I previously observed that the slow device was getting 10% more
> read traffic, but I was being imprecise.
> In the first example  it was 16.2%, in the second it is 14.5.
> These are close to 1/7.
>
> There are 8 devices in the array, so there are 8 different layouts for
> stripes (the parity block can be on 8 different devices).  8 is close to
> 7.
>
> How recovery *should* work on RAID6 is that for each strip we read all
> blocks except for the Q block (which isn't needed to rebuild a single
> device) and the block which has failed (of course).  For the stripe
> where the Q block has failed, we don't read the P block.  Just read all
> the data and calculate Q from that.
>
> So for every 8-device strip, we read from 6 devices.  5 Data plus parity
> when data has failed, 6 data when P or Q has failed.
> So across each of the 8 different strip layouts, there are 48 reads
> Spread across 7 working devices....  using lower-case for blocks that
> aren't read and assuming first device is being recovered:
>
>            d  D  D  D  D  D  P  q
>            d  D  D  D  D  P  q  D
>            d  D  D  D  P  q  D  D
>            d  D  D  P  q  D  D  D
>            d  D  P  q  D  D  D  D
>            d  P  q  D  D  D  D  D
>            p  q  D  D  D  D  D  D
>            q  D  D  D  D  D  D  p
> totals:    0  7  7  7  7  7  7  6
>
> The device just "before" the missing device gets fewer reads,
> because we don't need to read P from that device.
> But we are seeing that it gets more reads.  The numbers suggest that
> it gets 8 reads (1/7 more than the others).  But the numbers also
> suggest that it gets lots of seeks.  Maybe after all the other devices
> have break read for one stripe, md/raid6 decides it does need those
> blocks and goes back to read them?  That would explain the seeking
> and the increased number of reads.
>
> Time to look at the blktrace...
> Looking at sdj, each request is Queued 8 times (I wonder why) but the
> requests are completely sequential.
>
>   blkparse sdj* | awk '$6 == "Q" && $8 != prev {print $8, $8-prev ; prev=$8}'  | awk '$2 != 8'
>
> This means the 'Q' block is being read.  Not what I expected, but not
> too surprising.
>
> Looking at sdg, which is the problem device:
>
>   blkparse sdg* | awk '$6 == "Q" && $8 != prev {print $8-prev ; prev=$8}'  \
>     | sort | uniq -c | sort -n | tail -60
>
> There are lots of places we skip forward 904 sectors.  That is an 8
> sector read and a 896 sector seek. 896 sectors = 448K or 7 chunks
> (64K chunk size).
> These 7-chunk seeks are separated by 16 8-sector reads.  So it seems to
> be reading all the P (or maybe Q) blocks from a sequence of stripes.
>
> There are also lots of large back/forward seeks. They range from -31328
> to -71280 backward and 27584 to 66408  (with a couple of smaller ones).
>
> If we ignore all reads to sdg that are the result of seeking backwards -
> i.e. that are earlier that the largest offset seen so far:
>
>   blkparse sdg* | awk '$6 == "Q" && $8 != prev {if ($8>max) {print $8; max=$8} ; prev=$8}' | awk '{print $1-prev; prev=$1}' | grep  -v 8
>
> we seek that (after an initial section) every block is being read.
> So it seems that sdg is seeking backwards to read some blocks that it
> has already read.
>
> So what really happens in that all the working devices read all blocks
> (although they don't really need Q), and sdg needs to seek backwards to
> re-read 1/8 of all blocks, probably the P block.
>
> So we should be seeing the rkB/s for the slow disk 1/8 more than the
> others.  i.e. 12.5%, not 14% or 16%.  The difference bothers me, but not
> very much.
>
> Now to look at the code.
>
> need_this_block() always returns 1 when s->syncing, so that is why
> we already read the Q block.  The distinction between recovery and
> resync isn't very strong in raid5/6.
>
> So on the stripes where Q has failed, we read all the Data and P.
> Then handle_parity_checks6() notice that there are enough devices that
> something is worth checking, so it checks the P.  This is destructive!
> of P.  The D blocks are xored into P and P is checked for all-zero.
>
> I always get lost following this next bit....
> I think that after deciding zero_sum_result is zero, check_state is set
> to check_state_compute_result.
> But before that is processed, we go around the loop again and notice
> that P is missing (because handle_parity_checks6() cleared R5_UPTODATE)
> so need_this_block() decided that we need to read it back in again.
> We don't calculate it because we only calculate blocks on failed
> devices (I'm sure we didn't used to do that).
> So we read P back in ... just as we see from the trace.
>
> Patch below makes the symptom go away in my testing, which confirms
> that I'm on the right track.
>
> Dan Williams changed the code to only recompute blocks from failed
> devices, way back in 2008.
> Commit: c337869d9501 ("md: do not compute parity unless it is on a failed drive")
>
> Back then we didn't have raid6 in the same code, so the fix was probably
> fine.
>
> I wonder what the best fix is.... Dan... Any thoughts?

First thought is "Wow!" that's a great bit of detective work.

Second thought is that since we validated P as being in-sync maybe we
can flag at is safe for compute and not re-read it. I wish I had
written a unit test for the condition where we were missing some
writebacks after computing that lead to commit c337869d9501  ("md: do
not compute parity unless it is on a failed drive").

However, I think we already have enough information to trust P.

                        /* The only possible failed device holds Q, so it
                         * makes sense to check P (If anything else were failed,
                         * we would have used P to recreate it).
                         */

So I think your patch is actually already correct, would just need
comment like the following so the exception is clear:

                /*
                 * In the raid6 case if the only non-uptodate disk is P
                 * then we already trusted P to compute the other failed
                 * drives. It is safe to compute rather than re-read P.
                 */

Thoughts?

  parent reply	other threads:[~2017-03-31 18:45 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-24  7:44 RAID6 rebuild oddity Brad Campbell
2017-03-29  4:08 ` NeilBrown
2017-03-29  8:12   ` Brad Campbell
2017-03-30  0:49     ` NeilBrown
2017-03-30  1:22       ` Brad Campbell
2017-03-30  1:53         ` NeilBrown
2017-03-30  3:09           ` Brad Campbell
2017-03-31 18:45       ` Dan Williams [this message]
2017-04-03  2:09         ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPcyv4gwG6yRMKF4wdhT_cPfCn=eucLX=ON7DWTA2gw0ptgmbQ@mail.gmail.com' \
    --to=dan.j.williams@intel.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=lists2009@fnarfbargle.com \
    --cc=neilb@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.