Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

From: junxiao.bi@oracle.com
To: Song Liu <song@kernel.org>,
	Linux regressions mailing list <regressions@lists.linux.dev>
Cc: gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
	linux-raid@vger.kernel.org, stable@vger.kernel.org,
	Dan Moulding <dan@danm.net>
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
Date: Fri, 1 Mar 2024 12:26:54 -0800	[thread overview]
Message-ID: <739634c3-3e21-44dd-abb1-356cf54e54fd@oracle.com> (raw)
In-Reply-To: <CAPhsuW5QYTjBvjAjf8SdcKmPGO20e5-p57n6af5FaXudSiOCmg@mail.gmail.com>

Hi Dan & Song,

I have not root cause this yet, but would like share some findings from 
the vmcore Dan shared. From what i can see, this doesn't look like a md 
issue, but something wrong with block layer or below.

1. There were multiple process hung by IO over 15mins.

crash> ps -m | grep UN
[0 00:15:50.424] [UN]  PID: 957      TASK: ffff88810baa0ec0  CPU: 1    
COMMAND: "jbd2/dm-3-8"
[0 00:15:56.151] [UN]  PID: 1835     TASK: ffff888108a28ec0  CPU: 2    
COMMAND: "dd"
[0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00  CPU: 3    
COMMAND: "md0_reclaim"
[0 00:15:56.185] [UN]  PID: 1914     TASK: ffff8881015e6740  CPU: 1    
COMMAND: "kworker/1:2"
[0 00:15:56.255] [UN]  PID: 403      TASK: ffff888101351d80  CPU: 7    
COMMAND: "kworker/u21:1"

2. Let pick md0_reclaim to take a look, it is waiting done super_block 
update. We can see there were two pending superblock write and other 
pending io for the underling physical disk, which caused these process hung.

crash> bt 876
PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: "md0_reclaim"
  #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
  #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
  #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
  #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
  #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
  #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
  #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
  #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
  #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1

crash> mddev.pending_writes,disks 0xffff888108335800
   pending_writes = {
     counter = 2  <<<<<<<<<< 2 active super block write
   },
   disks = {
     next = 0xffff88810ce85a00,
     prev = 0xffff88810ce84c00
   },
crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending 
0xffff88810ce85a00
ffff88810ce85a00
   kobj.name = 0xffff8881067c1a00 "dev-dm-1",
   nr_pending = {
     counter = 0
   },
ffff8881083ace00
   kobj.name = 0xffff888100a93280 "dev-sde",
   nr_pending = {
     counter = 10 <<<<
   },
ffff8881010ad200
   kobj.name = 0xffff8881012721c8 "dev-sdc",
   nr_pending = {
     counter = 8 <<<<<
   },
ffff88810ce84c00
   kobj.name = 0xffff888100325f08 "dev-sdd",
   nr_pending = {
     counter = 2 <<<<<
   },

3. From block layer, i can find the inflight IO for md superblock write 
which has been pending 955s which matches with the hung time of 
"md0_reclaim"

crash> 
request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io 
ffff888103b4c300
   q = 0xffff888103a00d80,
   mq_hctx = 0xffff888103c5d200,
   cmd_flags = 38913,
   rq_flags = 139408,
   start_time_ns = 1504179024146,
   bio = 0x0,
   biotail = 0xffff888120758e40,
   state = MQ_RQ_COMPLETE,
   __data_len = 0,
   flush = {
     seq = 3, <<<< REQ_FSEQ_PREFLUSH |  REQ_FSEQ_DATA
     saved_end_io = 0x0
   },
   end_io = 0xffffffff815186e0 <mq_flush_data_end_io>,

crash> p tk_core.timekeeper.tkr_mono.base
$1 = 2459916243002
crash> eval 2459916243002-1504179024146
hexadecimal: de86609f28
     decimal: 955737218856  <<<<<<< IO pending time is 955s
       octal: 15720630117450
      binary: 
0000000000000000000000001101111010000110011000001001111100101000

crash> bio.bi_iter,bi_end_io 0xffff888120758e40
   bi_iter = {
     bi_sector = 8, <<<< super block offset
     bi_size = 0,
     bi_idx = 0,
     bi_bvec_done = 0
   },
   bi_end_io = 0xffffffff817dca50 <super_written>,
crash> dev -d | grep ffff888103a00d80
     8 ffff8881003ab000   sdd        ffff888103a00d80       0 0     0

4. Check above request, even its state is "MQ_RQ_COMPLETE", but it is 
still pending. That's because each md superblock write was marked with 
REQ_PREFLUSH | REQ_FUA, so it will be handled in 3 steps: pre_flush, 
data, and post_flush. Once each step complete, it will be marked in 
"request.flush.seq", here the value is 3, which is REQ_FSEQ_PREFLUSH |  
REQ_FSEQ_DATA, so the last step "post_flush" has not be done.  Another 
wired thing is that blk_flush_queue.flush_data_in_flight is still 1 even 
"data" step already done.

crash> blk_mq_hw_ctx.fq 0xffff888103c5d200
   fq = 0xffff88810332e240,
crash> blk_flush_queue 0xffff88810332e240
struct blk_flush_queue {
   mq_flush_lock = {
     {
       rlock = {
         raw_lock = {
           {
             val = {
               counter = 0
             },
             {
               locked = 0 '\000',
               pending = 0 '\000'
             },
             {
               locked_pending = 0,
               tail = 0
             }
           }
         }
       }
     }
   },
   flush_pending_idx = 1,
   flush_running_idx = 1,
   rq_status = 0 '\000',
   flush_pending_since = 4296171408,
   flush_queue = {{
       next = 0xffff88810332e250,
       prev = 0xffff88810332e250
     }, {
       next = 0xffff888103b4c348, <<<< the request is in this list
       prev = 0xffff888103b4c348
     }},
   flush_data_in_flight = 1,  >>>>>> still 1
   flush_rq = 0xffff888103c2e000
}

crash> list 0xffff888103b4c348
ffff888103b4c348
ffff88810332e260

crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of hw queue
   tag = -1,
   state = MQ_RQ_IDLE,
   ref = {
     counter = 0
   },

5. Looks like the block layer or underlying(scsi/virtio-scsi) may have 
some issue which leading to the io request from md layer stayed in a 
partial complete statue. I can't see how this can be related with the 
commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in 
raid5d"")

Dan,

Are you able to reproduce using some regular scsi disk, would like to 
rule out whether this is related with virtio-scsi?

And I see the kernel version is 6.8.0-rc5 from vmcore, is this the 
official mainline v6.8-rc5 without any other patches?

Thanks,

Junxiao.

On 2/23/24 6:13 PM, Song Liu wrote:
> Hi,
>
> On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
> Leemhuis) <regressions@leemhuis.info> wrote:
>> On 21.02.24 00:06, Dan Moulding wrote:
>>> Just a friendly reminder that this regression still exists on the
>>> mainline. It has been reverted in 6.7 stable. But I upgraded a
>>> development system to 6.8-rc5 today and immediately hit this issue
>>> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>> Song Liu, what's the status here? I aware that you fixed with quite a
>> few regressions recently, but it seems like resolving this one is
>> stalled. Or were you able to reproduce the issue or make some progress
>> and I just missed it?
> Sorry for the delay with this issue. I have been occupied with some
> other stuff this week.
>
> I haven't got luck to reproduce this issue. I will spend more time looking
> into it next week.
>
>> And if not, what's the way forward here wrt to the release of 6.8?
>> Revert the culprit and try again later? Or is that not an option for one
>> reason or another?
> If we don't make progress with it in the next week, we will do the revert,
> same as we did with stable kernels.
>
>> Or do we assume that this is not a real issue? That it's caused by some
>> oddity (bit-flip in the metadata or something like that?) only to be
>> found in Dan's setup?
> I don't think this is because of oddities. Hopefully we can get more
> information about this soon.
>
> Thanks,
> Song
>
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> If I did something stupid, please tell me, as explained on that page.
>>
>> #regzbot poke
>>