All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1] rwsem_down_write_slowpath check if oob() before skipping schedule()
@ 2023-05-30 19:04 Dave Rolenc
  2023-05-31  6:22 ` Philippe Gerum
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Rolenc @ 2023-05-30 19:04 UTC (permalink / raw)
  To: xenomai; +Cc: Dave Rolenc, Russell Johnson

This fixes a timing-dependent bug that would cause a tight loop without
schedule() when running in-band. The rwsem would get released enough to
allow this tight loop without schedule() to execute when the handoff
bit got set, but it would never allow the previous on-CPU owner to
finish releasing the rwsem. Checking to see if we are running oob before
skipping the schedule() seems to fix the problem. This problem was nasty
when the global kernfs_rwsem was the target semaphore since many tasks
ended up in D state waiting on the kernfs_rwsem. 

We would typically see this problem when repeatedly starting/stopping an
EVL-enabled application, which would create and tear-down a lot of EVL
elements. Eventually, the timing worked out to where we'd get a stuck
CPU message and a bunch of tasks in D-state waiting on the kernfs_rwsem.
The cuplrit task was stuck in a forever loop in
rwsem_down_write_slowpath() with the handoff bit set and skipping the
schedule() call. 

This patch is based on 5.15.98.

This is also my first patch submission to this group, so if you need me
to follow some specific procedure or checklist for patch submission,
please let me know and I'll comply.

Dave Rolenc (1):
  rwsem_down_write_slowpath check if oob() before skipping schedule()

 kernel/locking/rwsem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.39.1



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/1] rwsem_down_write_slowpath check if oob() before skipping schedule()
  2023-05-30 19:04 [PATCH 0/1] rwsem_down_write_slowpath check if oob() before skipping schedule() Dave Rolenc
@ 2023-05-31  6:22 ` Philippe Gerum
  2023-05-31 15:27   ` Dave Rolenc
  0 siblings, 1 reply; 5+ messages in thread
From: Philippe Gerum @ 2023-05-31  6:22 UTC (permalink / raw)
  To: Dave Rolenc; +Cc: xenomai, Russell Johnson


Dave Rolenc <Dave.Rolenc@kratosdefense.com> writes:

> This fixes a timing-dependent bug that would cause a tight loop without
> schedule() when running in-band. The rwsem would get released enough to
> allow this tight loop without schedule() to execute when the handoff
> bit got set, but it would never allow the previous on-CPU owner to
> finish releasing the rwsem. Checking to see if we are running oob before
> skipping the schedule() seems to fix the problem. This problem was nasty
> when the global kernfs_rwsem was the target semaphore since many tasks
> ended up in D state waiting on the kernfs_rwsem. 
>
> We would typically see this problem when repeatedly starting/stopping an
> EVL-enabled application, which would create and tear-down a lot of EVL
> elements. Eventually, the timing worked out to where we'd get a stuck
> CPU message and a bunch of tasks in D-state waiting on the kernfs_rwsem.
> The cuplrit task was stuck in a forever loop in
> rwsem_down_write_slowpath() with the handoff bit set and skipping the
> schedule() call. 
>
> This patch is based on 5.15.98.
>
> This is also my first patch submission to this group, so if you need me
> to follow some specific procedure or checklist for patch submission,
> please let me know and I'll comply.
>
> Dave Rolenc (1):
>   rwsem_down_write_slowpath check if oob() before skipping schedule()
>
>  kernel/locking/rwsem.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Mm, nasty. Thanks for digging into this. What's even worse and the root
cause of that issue is that no task on the oob stage should ever run
rwsem_down_write_slowpath() in the first place. Generally speaking, oob
execution is an NMI-like context, therefore taking regular locks and/or
(re)scheduling is not allowed, which means rwsem_down_write_slowpath()
should not be entered by a task running oob. In other words, Houston
we've had a problem here.

In order to figure out what happened, could you please check whether the
following patch causes a kernel splat when the bug triggers?  We are
trying to find out who calls rwsem_down_write_slowpath() from the wrong
context. The second hunk in this patch is making sure nothing weird
happens due to transitioning from inband to oob context - when Dovetail
is enabled, the tail code of the logic performing such transition lives
in the regular/inband schedule() code, but that should never leak into
the common in-band code.

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index f0287a16b4ec8..12f4956842fdb 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1070,6 +1070,8 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	struct rwsem_waiter waiter;
 	DEFINE_WAKE_Q(wake_q);
 
+	WARN_ON_ONCE(running_oob());
+
 	/* do optimistic spinning and steal lock if possible */
 	if (rwsem_can_spin_on_owner(sem) && rwsem_optimistic_spin(sem)) {
 		/* rwsem_optimistic_spin() implies ACQUIRE on success */
@@ -1155,6 +1157,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 		}
 
 		schedule();
+		WARN_ON_ONCE(running_oob());
 		lockevent_inc(rwsem_sleep_writer);
 		set_current_state(state);
 trylock_again:

-- 
Philippe.

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/1] rwsem_down_write_slowpath check if oob() before skipping schedule()
  2023-05-31  6:22 ` Philippe Gerum
@ 2023-05-31 15:27   ` Dave Rolenc
  2023-05-31 16:55     ` Philippe Gerum
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Rolenc @ 2023-05-31 15:27 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai, Russell Johnson

> What's even worse and the root cause of that issue is that no task on
> the oob stage should ever run rwsem_down_write_slowpath() in the first
> place.

That is probably true, and was likely my misunderstanding of the context
in which it could run. I was simply trying to keep as much of the normal
code path found in vanilla as possible when constructing the patch.

My first attempt at a fix simply commented out the OWNER_NULL check and
goto that skips the schedule() call like this:

                if (waiter.handoff_set) {
                        enum owner_state owner_state;

                        preempt_disable();
                        owner_state = rwsem_spin_on_owner(sem);
                        preempt_enable();

                     /*   if (owner_state == OWNER_NULL)
                                goto trylock_again; */
                }

                schedule();
                lockevent_inc(rwsem_sleep_writer);
                set_current_state(state);
trylock_again:
                raw_spin_lock_irq(&sem->wait_lock);
        }


That test with the commented out lines worked and the problem went away
, but it seemed like a pretty big hammer. I figured trying the oob
check to preserve as much of the normal vanilla kernel code path as
possible would probably be better. So I ended up with this:

                if (waiter.handoff_set) {
                        enum owner_state owner_state;

                        preempt_disable();
                        owner_state = rwsem_spin_on_owner(sem);
                        preempt_enable();

                        if (owner_state == OWNER_NULL && running_oob())
                                goto trylock_again;
                }

                schedule();
                lockevent_inc(rwsem_sleep_writer);
                set_current_state(state);
trylock_again:
                raw_spin_lock_irq(&sem->wait_lock);
        }


Assuming you're correct that this code never runs out of band,  that
just means that the schedule call will never get skipped by the goto as
the running_oob() always returns false. The important part I observed is
that if that goto trylock_again gets hit the task never gets out of that
loop. If we make it so the schedule() chunk never gets skipped, things
work fine.

Empirical evidence shows that never skipping that schedule() call makes
the problem go away. My test scenario, which is way too involved to
package up for you and involves custom hardware, will run a couple hours
max before getting a stuck CPU. With the patch it ran over 4 days
without issue. Assuming running_oob always returns false, the code
should be roughly equivalent to commenting out the lines as I did in my
first attempt. My first attempt at commenting out the lines also worked
fine for over 24 hours.  I wish I had more of a definitive answer to the
other task involved, but the stack traces didn't really help there. Not
having a fully working kernel debugger kind of limited what I could see,
so I had to sample stack traces and figure out which code path it was
taking by piecing together the information I had.

I can definitely run your suggested test with the WARN_ONCE if you
really want, but I don't think the cause is some oob context running
this code. It was just my misunderstanding that this code could run in
oob coupled with my desire to not delete those lines if at all possible.

Do you have any thoughts on how to proceed?

Dave

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/1] rwsem_down_write_slowpath check if oob() before skipping schedule()
  2023-05-31 15:27   ` Dave Rolenc
@ 2023-05-31 16:55     ` Philippe Gerum
  2023-05-31 21:48       ` [External] - " Dave Rolenc
  0 siblings, 1 reply; 5+ messages in thread
From: Philippe Gerum @ 2023-05-31 16:55 UTC (permalink / raw)
  To: Dave Rolenc; +Cc: xenomai, Russell Johnson


Dave Rolenc <Dave.Rolenc@kratosdefense.com> writes:

>> What's even worse and the root cause of that issue is that no task on
>> the oob stage should ever run rwsem_down_write_slowpath() in the first
>> place.
>

[snip]

>
> Assuming you're correct that this code never runs out of band,  that
> just means that the schedule call will never get skipped by the goto as
> the running_oob() always returns false. The important part I observed is
> that if that goto trylock_again gets hit the task never gets out of that
> loop. If we make it so the schedule() chunk never gets skipped, things
> work fine.
>
> Empirical evidence shows that never skipping that schedule() call makes
> the problem go away. My test scenario, which is way too involved to
> package up for you and involves custom hardware, will run a couple hours
> max before getting a stuck CPU. With the patch it ran over 4 days
> without issue. Assuming running_oob always returns false, the code
> should be roughly equivalent to commenting out the lines as I did in my
> first attempt. My first attempt at commenting out the lines also worked
> fine for over 24 hours.  I wish I had more of a definitive answer to the
> other task involved, but the stack traces didn't really help there. Not
> having a fully working kernel debugger kind of limited what I could see,
> so I had to sample stack traces and figure out which code path it was
> taking by piecing together the information I had.
>
> I can definitely run your suggested test with the WARN_ONCE if you
> really want, but I don't think the cause is some oob context running
> this code. It was just my misunderstanding that this code could run in
> oob coupled with my desire to not delete those lines if at all possible.
>

Ok, got it. Adding running_oob() which should always evaluate to false
only prevents the code from spinning, papering over the issue. Replacing
running_oob() by dovetailing() would achieve the same purpose. If so, my
patch does not bring anything valuable. Besides, if the Dovetail debug
is compiled in, a bad (oob) context running strictly in-band code would
most certainly have caused some existing assertions to trigger anyway.

> Do you have any thoughts on how to proceed?
>

First thing would be to enable CONFIG_DEBUG_RWSEMS and re-run an
overnight test without any patch in, hoping for the native debug
infrastructure to give us some hint.

-- 
Philippe.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [External] - Re: [PATCH 0/1] rwsem_down_write_slowpath check if oob() before skipping schedule()
  2023-05-31 16:55     ` Philippe Gerum
@ 2023-05-31 21:48       ` Dave Rolenc
  0 siblings, 0 replies; 5+ messages in thread
From: Dave Rolenc @ 2023-05-31 21:48 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai, Russell Johnson

> First thing would be to enable CONFIG_DEBUG_RWSEMS and re-run an overnight test without any patch in, hoping for the native debug infrastructure to give us some hint.

That didn't turn up anything useful, but I did find this in the 5.15.99 changelog,
which sounds similar to what I am seeing 
(https://mirrors.edge.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.15.99):

>> commit db1c5ec57611de759b56c82d273c30ebfc7c25c0
>> Author: Waiman Long <longman@redhat.com>
>> Date:   Wed Jan 25 19:36:25 2023 -0500

>>    locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpath
    
>>    commit b613c7f31476c44316bfac1af7cac714b7d6bef9 upstream.
    
>>    A non-first waiter can potentially spin in the for loop of
>>    rwsem_down_write_slowpath() without sleeping but fail to acquire the
>>   lock even if the rwsem is free if the following sequence happens:
    
>>      Non-first RT waiter    First waiter      Lock holder
>>      -------------------    ------------      -----------
>>      Acquire wait_lock
>>      rwsem_try_write_lock():
>>        Set handoff bit if RT or
>>          wait too long
>>        Set waiter->handoff_set
>>      Release wait_lock
>>                             Acquire wait_lock
>>                             Inherit waiter->handoff_set
>>                             Release wait_lock
>>                                               Clear owner
>>                                               Release lock
>>      if (waiter.handoff_set) {
>>        rwsem_spin_on_owner(();
>>        if (OWNER_NULL)
>>          goto trylock_again;
>>      }
>>      trylock_again:
>>      Acquire wait_lock
>>      rwsem_try_write_lock():
>>         if (first->handoff_set && (waiter != first))
>>            return false;
>>      Release wait_lock
    
>>    A non-first waiter cannot really acquire the rwsem even if it mistakenly
>>    believes that it can spin on OWNER_NULL value. If that waiter happens
>>    to be an RT task running on the same CPU as the first waiter, it can
>>    block the first waiter from acquiring the rwsem leading to live lock.
>>    Fix this problem by making sure that a non-first waiter cannot spin in
>>    the slowpath loop without sleeping.

It looks like there are some other fixes to rwsem.c in there as well.

The answer may be to try something more recent than 5.15.98. Looks like
5.15.112 is out there, so I'll give that a try.

Thanks!

Dave






^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-05-31 21:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-30 19:04 [PATCH 0/1] rwsem_down_write_slowpath check if oob() before skipping schedule() Dave Rolenc
2023-05-31  6:22 ` Philippe Gerum
2023-05-31 15:27   ` Dave Rolenc
2023-05-31 16:55     ` Philippe Gerum
2023-05-31 21:48       ` [External] - " Dave Rolenc

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.