All of lore.kernel.org
 help / color / mirror / Atom feed
From: junxiao.bi@oracle.com
To: Song Liu <song@kernel.org>,
	Linux regressions mailing list <regressions@lists.linux.dev>
Cc: gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
	linux-raid@vger.kernel.org, stable@vger.kernel.org,
	Dan Moulding <dan@danm.net>
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
Date: Fri, 1 Mar 2024 12:26:54 -0800	[thread overview]
Message-ID: <739634c3-3e21-44dd-abb1-356cf54e54fd@oracle.com> (raw)
In-Reply-To: <CAPhsuW5QYTjBvjAjf8SdcKmPGO20e5-p57n6af5FaXudSiOCmg@mail.gmail.com>

Hi Dan & Song,

I have not root cause this yet, but would like share some findings from 
the vmcore Dan shared. From what i can see, this doesn't look like a md 
issue, but something wrong with block layer or below.

1. There were multiple process hung by IO over 15mins.

crash> ps -m | grep UN
[0 00:15:50.424] [UN]  PID: 957      TASK: ffff88810baa0ec0  CPU: 1    
COMMAND: "jbd2/dm-3-8"
[0 00:15:56.151] [UN]  PID: 1835     TASK: ffff888108a28ec0  CPU: 2    
COMMAND: "dd"
[0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00  CPU: 3    
COMMAND: "md0_reclaim"
[0 00:15:56.185] [UN]  PID: 1914     TASK: ffff8881015e6740  CPU: 1    
COMMAND: "kworker/1:2"
[0 00:15:56.255] [UN]  PID: 403      TASK: ffff888101351d80  CPU: 7    
COMMAND: "kworker/u21:1"

2. Let pick md0_reclaim to take a look, it is waiting done super_block 
update. We can see there were two pending superblock write and other 
pending io for the underling physical disk, which caused these process hung.

crash> bt 876
PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: "md0_reclaim"
  #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
  #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
  #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
  #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
  #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
  #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
  #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
  #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
  #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1

crash> mddev.pending_writes,disks 0xffff888108335800
   pending_writes = {
     counter = 2  <<<<<<<<<< 2 active super block write
   },
   disks = {
     next = 0xffff88810ce85a00,
     prev = 0xffff88810ce84c00
   },
crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending 
0xffff88810ce85a00
ffff88810ce85a00
   kobj.name = 0xffff8881067c1a00 "dev-dm-1",
   nr_pending = {
     counter = 0
   },
ffff8881083ace00
   kobj.name = 0xffff888100a93280 "dev-sde",
   nr_pending = {
     counter = 10 <<<<
   },
ffff8881010ad200
   kobj.name = 0xffff8881012721c8 "dev-sdc",
   nr_pending = {
     counter = 8 <<<<<
   },
ffff88810ce84c00
   kobj.name = 0xffff888100325f08 "dev-sdd",
   nr_pending = {
     counter = 2 <<<<<
   },

3. From block layer, i can find the inflight IO for md superblock write 
which has been pending 955s which matches with the hung time of 
"md0_reclaim"

crash> 
request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io 
ffff888103b4c300
   q = 0xffff888103a00d80,
   mq_hctx = 0xffff888103c5d200,
   cmd_flags = 38913,
   rq_flags = 139408,
   start_time_ns = 1504179024146,
   bio = 0x0,
   biotail = 0xffff888120758e40,
   state = MQ_RQ_COMPLETE,
   __data_len = 0,
   flush = {
     seq = 3, <<<< REQ_FSEQ_PREFLUSH |  REQ_FSEQ_DATA
     saved_end_io = 0x0
   },
   end_io = 0xffffffff815186e0 <mq_flush_data_end_io>,

crash> p tk_core.timekeeper.tkr_mono.base
$1 = 2459916243002
crash> eval 2459916243002-1504179024146
hexadecimal: de86609f28
     decimal: 955737218856  <<<<<<< IO pending time is 955s
       octal: 15720630117450
      binary: 
0000000000000000000000001101111010000110011000001001111100101000

crash> bio.bi_iter,bi_end_io 0xffff888120758e40
   bi_iter = {
     bi_sector = 8, <<<< super block offset
     bi_size = 0,
     bi_idx = 0,
     bi_bvec_done = 0
   },
   bi_end_io = 0xffffffff817dca50 <super_written>,
crash> dev -d | grep ffff888103a00d80
     8 ffff8881003ab000   sdd        ffff888103a00d80       0 0     0

4. Check above request, even its state is "MQ_RQ_COMPLETE", but it is 
still pending. That's because each md superblock write was marked with 
REQ_PREFLUSH | REQ_FUA, so it will be handled in 3 steps: pre_flush, 
data, and post_flush. Once each step complete, it will be marked in 
"request.flush.seq", here the value is 3, which is REQ_FSEQ_PREFLUSH |  
REQ_FSEQ_DATA, so the last step "post_flush" has not be done.  Another 
wired thing is that blk_flush_queue.flush_data_in_flight is still 1 even 
"data" step already done.

crash> blk_mq_hw_ctx.fq 0xffff888103c5d200
   fq = 0xffff88810332e240,
crash> blk_flush_queue 0xffff88810332e240
struct blk_flush_queue {
   mq_flush_lock = {
     {
       rlock = {
         raw_lock = {
           {
             val = {
               counter = 0
             },
             {
               locked = 0 '\000',
               pending = 0 '\000'
             },
             {
               locked_pending = 0,
               tail = 0
             }
           }
         }
       }
     }
   },
   flush_pending_idx = 1,
   flush_running_idx = 1,
   rq_status = 0 '\000',
   flush_pending_since = 4296171408,
   flush_queue = {{
       next = 0xffff88810332e250,
       prev = 0xffff88810332e250
     }, {
       next = 0xffff888103b4c348, <<<< the request is in this list
       prev = 0xffff888103b4c348
     }},
   flush_data_in_flight = 1,  >>>>>> still 1
   flush_rq = 0xffff888103c2e000
}

crash> list 0xffff888103b4c348
ffff888103b4c348
ffff88810332e260

crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of hw queue
   tag = -1,
   state = MQ_RQ_IDLE,
   ref = {
     counter = 0
   },

5. Looks like the block layer or underlying(scsi/virtio-scsi) may have 
some issue which leading to the io request from md layer stayed in a 
partial complete statue. I can't see how this can be related with the 
commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in 
raid5d"")


Dan,

Are you able to reproduce using some regular scsi disk, would like to 
rule out whether this is related with virtio-scsi?

And I see the kernel version is 6.8.0-rc5 from vmcore, is this the 
official mainline v6.8-rc5 without any other patches?


Thanks,

Junxiao.

On 2/23/24 6:13 PM, Song Liu wrote:
> Hi,
>
> On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
> Leemhuis) <regressions@leemhuis.info> wrote:
>> On 21.02.24 00:06, Dan Moulding wrote:
>>> Just a friendly reminder that this regression still exists on the
>>> mainline. It has been reverted in 6.7 stable. But I upgraded a
>>> development system to 6.8-rc5 today and immediately hit this issue
>>> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>> Song Liu, what's the status here? I aware that you fixed with quite a
>> few regressions recently, but it seems like resolving this one is
>> stalled. Or were you able to reproduce the issue or make some progress
>> and I just missed it?
> Sorry for the delay with this issue. I have been occupied with some
> other stuff this week.
>
> I haven't got luck to reproduce this issue. I will spend more time looking
> into it next week.
>
>> And if not, what's the way forward here wrt to the release of 6.8?
>> Revert the culprit and try again later? Or is that not an option for one
>> reason or another?
> If we don't make progress with it in the next week, we will do the revert,
> same as we did with stable kernels.
>
>> Or do we assume that this is not a real issue? That it's caused by some
>> oddity (bit-flip in the metadata or something like that?) only to be
>> found in Dan's setup?
> I don't think this is because of oddities. Hopefully we can get more
> information about this soon.
>
> Thanks,
> Song
>
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> If I did something stupid, please tell me, as explained on that page.
>>
>> #regzbot poke
>>

  parent reply	other threads:[~2024-03-01 20:27 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-23  0:56 [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected Dan Moulding
2024-01-23  1:08 ` Song Liu
2024-01-23  1:35 ` Dan Moulding
2024-01-23  6:35   ` Song Liu
2024-01-23 21:53     ` Dan Moulding
2024-01-23 22:21       ` Song Liu
2024-01-23 23:58         ` Dan Moulding
2024-01-25  0:01           ` Song Liu
2024-01-25 16:44             ` junxiao.bi
2024-01-25 19:40               ` Song Liu
2024-01-25 20:31               ` Dan Moulding
2024-01-26  3:30                 ` Carlos Carvalho
2024-01-26 15:46                   ` Dan Moulding
2024-01-30 16:26                     ` Blazej Kucman
2024-01-30 20:21                       ` Song Liu
2024-01-31  1:26                       ` Song Liu
2024-01-31  2:13                         ` Yu Kuai
2024-01-31  2:41                       ` Yu Kuai
2024-01-31  4:55                         ` Song Liu
2024-01-31 13:36                           ` Blazej Kucman
2024-02-01  1:39                             ` Yu Kuai
2024-01-26 16:21                   ` Roman Mamedov
2024-01-31 17:37                 ` junxiao.bi
2024-02-06  8:07                 ` Song Liu
2024-02-06 20:56                   ` Dan Moulding
2024-02-06 21:34                     ` Song Liu
2024-02-20 23:06 ` Dan Moulding
2024-02-20 23:15   ` junxiao.bi
2024-02-21 14:50     ` Mateusz Kusiak
2024-02-21 19:15       ` junxiao.bi
2024-02-23 17:44     ` Dan Moulding
2024-02-23 19:18       ` junxiao.bi
2024-02-23 20:22         ` Dan Moulding
2024-02-23  8:07   ` Linux regression tracking (Thorsten Leemhuis)
2024-02-24  2:13     ` Song Liu
2024-02-25 17:46       ` Thomas B. Clark
2024-02-26  1:17         ` Thomas B. Clark
2024-02-26 17:35           ` Song Liu
2024-03-01 20:26       ` junxiao.bi [this message]
2024-03-01 23:12         ` Dan Moulding
2024-03-02  0:05           ` Song Liu
2024-03-06  8:38             ` Linux regression tracking (Thorsten Leemhuis)
2024-03-06 17:13               ` Song Liu
2024-03-02 16:55         ` Dan Moulding
2024-03-07  3:34         ` Yu Kuai
2024-03-08 23:49         ` junxiao.bi
2024-03-10  5:13           ` Dan Moulding
2024-03-11  1:50           ` Yu Kuai
2024-03-12 22:56             ` junxiao.bi
2024-03-13  1:20               ` Yu Kuai
2024-03-14 18:20                 ` junxiao.bi
2024-03-14 22:36                   ` Song Liu
2024-03-15  1:30                   ` Yu Kuai
2024-03-14 16:12             ` Dan Moulding
2024-03-15  1:17               ` Yu Kuai
2024-03-19 14:16                 ` Dan Moulding

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=739634c3-3e21-44dd-abb1-356cf54e54fd@oracle.com \
    --to=junxiao.bi@oracle.com \
    --cc=dan@danm.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=regressions@lists.linux.dev \
    --cc=song@kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.