All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo@cn.fujitsu.com>
To: Chris Mason <clm@fb.com>, <linux-btrfs@vger.kernel.org>
Cc: <kreijack@inwind.it>, <ce3g8jdj@umail.furryterror.org>
Subject: Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q
Date: Wed, 23 Nov 2016 08:26:04 +0800	[thread overview]
Message-ID: <f182add7-4a1c-b882-4732-338f32debf87@cn.fujitsu.com> (raw)
In-Reply-To: <fd6f5be1-9f18-2311-403b-06bfd7abd28d@fb.com>



At 11/23/2016 02:58 AM, Chris Mason wrote:
> On 11/21/2016 03:50 AM, Qu Wenruo wrote:
>> In the following situation, scrub will calculate wrong parity to
>> overwrite correct one:
>>
>> RAID5 full stripe:
>>
>> Before
>> |     Dev 1      |     Dev  2     |     Dev 3     |
>> | Data stripe 1  | Data stripe 2  | Parity Stripe |
>> --------------------------------------------------- 0
>> | 0x0000 (Bad)   |     0xcdcd     |     0x0000    |
>> --------------------------------------------------- 4K
>> |     0xcdcd     |     0xcdcd     |     0x0000    |
>> ...
>> |     0xcdcd     |     0xcdcd     |     0x0000    |
>> --------------------------------------------------- 64K
>>
>> After scrubbing dev3 only:
>>
>> |     Dev 1      |     Dev  2     |     Dev 3     |
>> | Data stripe 1  | Data stripe 2  | Parity Stripe |
>> --------------------------------------------------- 0
>> | 0xcdcd (Good)  |     0xcdcd     | 0xcdcd (Bad)  |
>> --------------------------------------------------- 4K
>> |     0xcdcd     |     0xcdcd     |     0x0000    |
>> ...
>> |     0xcdcd     |     0xcdcd     |     0x0000    |
>> --------------------------------------------------- 64K
>>
>> The calltrace of such corruption is as following:
>>
>> scrub_bio_end_io_worker() get called for each extent read out
>> |- scriub_block_complete()
>>    |- Data extent csum mismatch
>>    |- scrub_handle_errored_block
>>       |- scrub_recheck_block()
>>          |- scrub_submit_raid56_bio_wait()
>>             |- raid56_parity_recover()
>>
>> Now we have a rbio with correct data stripe 1 recovered.
>> Let's call it "good_rbio".
>>
>> scrub_parity_check_and_repair()
>> |- raid56_parity_submit_scrub_rbio()
>>    |- lock_stripe_add()
>>    |  |- steal_rbio()
>>    |     |- Recovered data are steal from "good_rbio", stored into
>>    |        rbio->stripe_pages[]
>>    |        Now rbio->bio_pages[] are bad data read from disk.
>>    |- async_scrub_parity()
>>       |- scrub_parity_work() (delayed_call to scrub_parity_work)
>>
>> scrub_parity_work()
>> |- raid56_parity_scrub_stripe()
>>    |- validate_rbio_for_parity_scrub()
>>       |- finish_parity_scrub()
>>          |- Recalculate parity using *BAD* pages in rbio->bio_pages[]
>>             So good parity is overwritten with *BAD* one
>>
>> The fix is to introduce 2 new members, bad_ondisk_a/b, to struct
>> btrfs_raid_bio, to info scrub code to use correct data pages to
>> re-calculate parity.
>>
>> Reported-by: Goffredo Baroncelli <kreijack@inwind.it>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> ---
>> Thanks to the above hell of delayed function all and damn stupid code
>> logical, such bug is quite hard to trace.
>>
>> The damn kernel scrub is already multi-thread, why do such meaningless
>> delayed function call again and again?
>>
>> What's wrong with single thread scrub?
>> We can do thing like in each stripe for raid56 which is easy and
>> straightforward, only delayed thing is to wake up waiter:
>>
>>     lock_full_stripe()
>>     if (!is_parity_stripe()) {
>>         prepare_data_stripe_bios()
>>         submit_and_wait_bios()
>>         if (check_csum() == 0)
>>             goto out;
>>     }
>>     prepare_full_stripe_bios()
>>     submit_and_wait_bios()
>>
>>     recover_raid56_stipres();
>>     prepare_full_stripe_write_bios()
>>     submit_and_wait_bios()
>>
>> out:
>>       unlock_full_stripe()
>>
>> We really need to re-work the whole damn scrub code.
>>
>> Also, we need to enhance btrfs-progs to detect scrub problem(my
>> submitted offline scrub is good enough for such usage), and tools to
>> corrupt extents reliably to put it into xfstests test cases.
>>
>> RAID56 scrub code is neither tested nor well-designed.
>
> Great description, thanks for tracking this down.
>
>> @@ -2352,7 +2370,16 @@ static noinline void finish_parity_scrub(struct
>> btrfs_raid_bio *rbio,
>>          void *parity;
>>          /* first collect one page from each data stripe */
>>          for (stripe = 0; stripe < nr_data; stripe++) {
>> -            p = page_in_rbio(rbio, stripe, pagenr, 0);
>> +
>> +            /*
>> +             * Use stolen recovered page other than bad
>> +             * on disk pages
>> +             */
>> +            if (stripe == rbio->bad_ondisk_a ||
>> +                stripe == rbio->bad_ondisk_b)
>> +                p = rbio_stripe_page(rbio, stripe, pagenr);
>> +            else
>> +                p = page_in_rbio(rbio, stripe, pagenr, 0);
>>              pointers[stripe] = kmap(p);
>>          }
>>
>>
>
> We're changing which pages we kmap() but not which ones we kunmap(). Can
> you please update the kunmap loop to use this pointers array?  Also it
> looks like this kmap is never unmapped.

Oh I forget that.
I'll update it soon.

This reminds me, is there any kernel debug option to trace such unmapped 
pages?

Thanks,
Qu

>
> pointers[stripe++] = kmap(q_page);
>
> -chris
>
>



  reply	other threads:[~2016-11-23  0:26 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-21  8:50 [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q Qu Wenruo
2016-11-21 18:48 ` Goffredo Baroncelli
2016-11-22  0:28   ` Qu Wenruo
2016-11-22 18:02     ` Goffredo Baroncelli
2016-11-25  4:31       ` Zygo Blaxell
2016-11-25  4:40         ` Gareth Pye
2016-11-25  5:07           ` Zygo Blaxell
2016-11-26 13:12         ` Goffredo Baroncelli
2016-11-26 18:54           ` Zygo Blaxell
2016-11-26 23:16             ` Goffredo Baroncelli
2016-11-27 16:53               ` Zygo Blaxell
2016-11-28  0:40               ` Qu Wenruo
2016-11-28 18:45                 ` Goffredo Baroncelli
2016-11-28 19:01                   ` Christoph Anton Mitterer
2016-11-28 19:39                     ` Austin S. Hemmelgarn
2016-11-28  3:37           ` Christoph Anton Mitterer
2016-11-28  3:53             ` Andrei Borzenkov
2016-11-28  4:01               ` Christoph Anton Mitterer
2016-11-28 18:32             ` Goffredo Baroncelli
2016-11-28 19:00               ` Christoph Anton Mitterer
2016-11-28 21:48               ` Zygo Blaxell
2016-11-29  1:52                 ` Christoph Anton Mitterer
2016-11-29  3:19                   ` Zygo Blaxell
2016-11-29  7:35                   ` Adam Borowski
2016-11-29 14:24                     ` Christoph Anton Mitterer
2016-11-22 18:58 ` Chris Mason
2016-11-23  0:26   ` Qu Wenruo [this message]
2016-11-26 17:18     ` Chris Mason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f182add7-4a1c-b882-4732-338f32debf87@cn.fujitsu.com \
    --to=quwenruo@cn.fujitsu.com \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=clm@fb.com \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.