All of lore.kernel.org
 help / color / mirror / Atom feed
From: john.p.donnelly@oracle.com
To: Waiman Long <longman@redhat.com>,
	chenguanyou <chenguanyou9338@gmail.com>,
	gregkh@linuxfoundation.org
Cc: dave@stgolabs.net, hdanton@sina.com,
	linux-kernel@vger.kernel.org, mazhenhua@xiaomi.com,
	mingo@redhat.com, peterz@infradead.org, quic_aiquny@quicinc.com,
	will@kernel.org, sashal@kernel.org
Subject: Re: [PATCH v5] locking/rwsem: Make handoff bit handling more consistent
Date: Tue, 26 Apr 2022 16:22:05 -0500	[thread overview]
Message-ID: <c0a068e7-18e3-87df-676c-e8270cd732b6@oracle.com> (raw)
In-Reply-To: <020aef66-6911-77e7-fd1a-25506dfcd3df@redhat.com>

On 4/26/22 3:21 PM, Waiman Long wrote:
> On 4/20/22 09:55, john.p.donnelly@oracle.com wrote:
>> On 4/12/22 11:28 AM, john.p.donnelly@oracle.com wrote:
>>> On 4/11/22 4:07 PM, Waiman Long wrote:
>>>>
>>>> On 4/11/22 17:03, john.p.donnelly@oracle.com wrote:
>>>>>
>>>>>>>
>>>>>>> I have reached out to Waiman and he suggested this for our next 
>>>>>>> test pass:
>>>>>>>
>>>>>>>
>>>>>>> 1ee326196c6658 locking/rwsem: Always try to wake waiters in 
>>>>>>> out_nolock path
>>>>>>
>>>>>> Does this commit help to avoid the lockup problem?
>>>>>>
>>>>>> Commit 1ee326196c6658 fixes a potential missed wakeup problem when 
>>>>>> a reader first in the wait queue is interrupted out without 
>>>>>> acquiring the lock. It is actually not a fix for commit 
>>>>>> d257cc8cb8d5. However, this commit changes the out_nolock path 
>>>>>> behavior of writers by leaving the handoff bit set when the wait 
>>>>>> queue isn't empty. That likely makes the missed wakeup problem 
>>>>>> easier to reproduce.
>>>>>>
>>>>>> Cheers,
>>>>>> Longman
>>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> We are testing now
>>>>>
>>>>> ETA for fio soak test completion is  ~15hr from now.
>>>>>
>>>>> I wanted to share the stack traces for future reference + occurrences.
>>>>>
>>>> I am looking forward to your testing results tomorrow.
>>>>
>>>> Cheers,
>>>> Longman
>>>>
>>> Hi
>>>
>>>   Our 24hr fio soak test with :
>>>
>>>   1ee326196c6658 locking/rwsem: Always try to wake waiters in 
>>> out_nolock path
>>>
>>>
>>>   applied to 5.15.30  passed.
>>>
>>>   I suggest you append  1ee326196c6658 with :
>>>
>>>
>>>   cc: stable
>>>
>>>    Fixes: d257cc8cb8d5 ("locking/rwsem: Make handoff bit handling 
>>> more consistent")
>>>
>>>
>>> I'll leave the implementation details up to the core maintainers how 
>>> to do that ;-)
>>>
>>> ...
>>>
>>> Thank you
>>>
>>> John.
>>
>> Hi ,
>>
>>
>>  We have observed another panic with :
>>
>>  1ee326196c6658 locking/rwsem: Always try to wake waiters in out_nolock
>>  path
>>
>>  Applied to 5.15.30 :
>>
>>
> Sorry for the late reply as I was busy with other important tasks.
> 
> When you said panic, you mean a system hang, not an actual panic. Right?

Hi ,

Our setups turn on all the panic on-hung-task , on-opps,  all those 
various features:

./sys/kernel/hardlockup_panic
./sys/kernel/hung_task_panic
./sys/kernel/max_rcu_stall_to_panic
./sys/kernel/panic
./sys/kernel/panic_on_io_nmi
./sys/kernel/panic_on_oops
./sys/kernel/panic_on_rcu_stall
./sys/kernel/panic_on_unrecovered_nmi
./sys/kernel/panic_on_warn
./sys/kernel/panic_print
./sys/kernel/softlockup_panic
./sys/kernel/unknown_nmi_panic


The machine is unusable when this occurs.


> 
> 
>> PID: 3789   TASK: ffff900fc409b300  CPU: 29  COMMAND: "dio/dm-0"
>>  #0 [fffffe00006bce50] crash_nmi_callback at ffffffff97c772c3
>>  #1 [fffffe00006bce58] nmi_handle at ffffffff97c40778
>>  #2 [fffffe00006bcea0] default_do_nmi at ffffffff988161e2
>>  #3 [fffffe00006bcec8] exc_nmi at ffffffff9881648d
>>  #4 [fffffe00006bcef0] end_repeat_nmi at ffffffff98a0153b
>>     [exception RIP: _raw_spin_lock_irq+35]
>>     RIP: ffffffff98827333  RSP: ffffa9320917fc78  RFLAGS: 00000046
>>     RAX: 0000000000000000  RBX: ffff900fc409b300  RCX: 0000000000000000
>>     RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000000
>>     RBP: ffffa9320917fd20   R8: 0000000000000000   R9: 0000000000000000
>>     R10: 0000000000000000  R11: 0000000000000000  R12: ffff90006259546c
>>     R13: ffffa9320917fcb0  R14: ffff900062595458  R15: 0000000000000000
>>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>> --- <NMI exception stack> ---
>>  #5 [ffffa9320917fc78] _raw_spin_lock_irq at ffffffff98827333
>>  #6 [ffffa9320917fc78] rwsem_down_write_slowpath at ffffffff97d25d49
>>  #7 [ffffa9320917fd28] ext4_map_blocks at ffffffffc104b6dc [ext4]
>>  #8 [ffffa9320917fd98] ext4_convert_unwritten_extents at 
>> ffffffffc10369e0 [ext4]
>>  #9 [ffffa9320917fdf0] ext4_dio_write_end_io at ffffffffc103b2aa [ext4]
>> #10 [ffffa9320917fe18] iomap_dio_complete at ffffffff98013f45
>> #11 [ffffa9320917fe48] iomap_dio_complete_work at ffffffff98014047
>> #12 [ffffa9320917fe60] process_one_work at ffffffff97cd9191
>> #13 [ffffa9320917fea8] rescuer_thread at ffffffff97cd991b
>> #14 [ffffa9320917ff10] kthread at ffffffff97ce11f7
>> #15 [ffffa9320917ff50] ret_from_fork at ffffffff97c04cf2
>> crash>
>>
>>
>> The failure is observed running "fio test suite"  as a 24 hour soak 
>> test  on an LVM composed of four NVME devices, Intel 72 core server. 
>> The test cycles through a variety of file-system types.
>>
>>
>> This kernel has these commits
>>
>> 1ee326196c6658 locking/rwsem: Always try to wake waiters in 
>> out_nolock  path
>>
>> d257cc8cb8d5 ("locking/rwsem: Make handoff bit handling more consistent")
>>
>> In earlier testing I had reverted d257cc8cb8d5 and did not observe 
>> said panics.  I still feel d257cc8cb8d5 is  still the root cause.
> 
> So it is possible that 1ee326196c6658 does not completely eliminate the 
> missed wakeup situation.
> 
> Regards,
> Longman
> 


  reply	other threads:[~2022-04-26 21:22 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-16  1:29 [PATCH v5] locking/rwsem: Make handoff bit handling more consistent Waiman Long
2021-11-16  2:52 ` Aiqun(Maria) Yu
2021-11-16  9:14   ` Peter Zijlstra
2021-11-16  9:24     ` Peter Zijlstra
2021-11-16 14:52       ` Waiman Long
2021-11-17 13:36 ` Peter Zijlstra
2021-11-23  8:53 ` [tip: locking/urgent] " tip-bot2 for Waiman Long
2022-02-14 15:47 ` Re:[PATCH v5] " chenguanyou
2022-02-14 16:01   ` [PATCH " Greg KH
2022-04-11 18:26   ` john.p.donnelly
2022-04-11 18:40     ` Waiman Long
2022-04-11 21:03       ` john.p.donnelly
2022-04-11 21:07         ` Waiman Long
2022-04-12 16:28           ` john.p.donnelly
2022-04-12 17:04             ` Waiman Long
2022-04-14 10:48               ` Greg KH
2022-04-14 15:18                 ` Waiman Long
2022-04-14 15:42                   ` Greg KH
2022-04-14 15:44                     ` Waiman Long
2022-04-20 13:55             ` john.p.donnelly
2022-04-26 20:21               ` Waiman Long
2022-04-26 21:22                 ` john.p.donnelly [this message]
2022-02-14 16:22 ` chenguanyou
2022-02-15  7:41   ` [PATCH " Greg KH
2022-02-16 16:30     ` Waiman Long
2022-02-17 15:41       ` chenguanyou
2022-03-14  8:07         ` [PATCH " Greg KH
2022-03-22  2:49           ` chenguanyou
2022-03-24 12:51             ` [PATCH " Greg KH
2022-07-19  0:27 ` Doug Anderson
2022-07-19 10:41   ` Hillf Danton
2022-07-19 15:30     ` Doug Anderson
2022-07-22 11:55       ` Hillf Danton
2022-07-22 14:02         ` Doug Anderson
2022-07-23  0:17           ` Hillf Danton
2022-07-23  1:27             ` Hillf Danton
2022-08-05 17:14             ` Doug Anderson
2022-08-05 19:02               ` Waiman Long
2022-08-05 19:16                 ` Doug Anderson
2022-08-30 16:18                   ` Doug Anderson
2022-08-31 11:08                     ` Hillf Danton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c0a068e7-18e3-87df-676c-e8270cd732b6@oracle.com \
    --to=john.p.donnelly@oracle.com \
    --cc=chenguanyou9338@gmail.com \
    --cc=dave@stgolabs.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=hdanton@sina.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mazhenhua@xiaomi.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=quic_aiquny@quicinc.com \
    --cc=sashal@kernel.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.