All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hillf Danton <hdanton@sina.com>
To: Doug Anderson <dianders@chromium.org>
Cc: Waiman Long <longman@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Will Deacon <will@kernel.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v5] locking/rwsem: Make handoff bit handling more consistent
Date: Tue, 19 Jul 2022 18:41:04 +0800	[thread overview]
Message-ID: <20220719104104.1634-1-hdanton@sina.com> (raw)
In-Reply-To: <CAD=FV=URCo5xv3k3jWbxV1uRkUU5k6bcnuB1puZhxayEyVc6-A@mail.gmail.com>

On Mon, 18 Jul 2022 17:27:28 -0700 Doug Anderson <dianders@chromium.org> wrote:
> 
> I've been tracking down an occasional hang at reboot on my system and
> I've ended up at this as the first bad commit. I will not pretend to
> understand the intricacies of the rwsem implementation, but I can
> describe what I saw. I have also produced a fairly small test case
> that reproduces the problem rather quickly.
> 
> First, what I saw:
> 
> My system failed to fully boot up and eventually the "hung task"
> detection kicked in. Many tasks in my system were hung all waiting on
> the "kernfs_rwsem". No tasks actually had the semaphore--it only had
> tasks waiting.
> 
> Of the tasks waiting, 3 of them were doing a down_write(). The rest
> were all waiting on down_read().
> 
> 2 of the tasks waiting on the down_write() were locked to CPU0. One of
> these tasks was a bound kworker. Another of these tasks was a threaded
> IRQ handler. The threaded IRQ handler was set to "real time" priority
> because in setup_irq_thread() you can see the call to
> sched_set_fifo().
> 
> At the time the hung task detector kicked in, the real time task was
> actually active on a CPU. Specifically it was running in the for (;;)
> loop in rwsem_down_write_slowpath(). rwsem_try_write_lock() had
> clearly just returned false which meant we didn't get the lock.
> Everything else was sitting in schedule().
> 
> I managed to get the real time task into kgdb and I could analyze its
> state as well as the state of "sem". The real time task was _not_ the
> first waiter. The kworker was the first waiter. The
> "waiter.handoff_set" was set to "true" for the real time task. The
> rwsem owner was OWNER_NULL.
> 
> Looking through the code and watching what was happening.
> 
> 1. The function rwsem_try_write_lock() was instantly returning false
> since `handoff` is set and we're not first.
> 2. After we get back into rwsem_down_write_slowpath() we'll see the
> handoff set and we'll try to spin on the owner. There is no owner, so
> this is a noop.
> 3. Since there's no owner, we'll go right back to the start of the loop.
> 
> So basically the real time thread (the threaded IRQ handler) was
> locked to CPU0 and spinning as fast as possible. The "first waiter"
> for the semaphore was blocked from running because it could only run
> on CPU0 but was _not_ real time priority.
> 
> -
> 
> So all the analysis above was done on the Chrome OS 5.15 kernel
> branch, which has ${SUBJECT} patch from the stable tree. The code
> looks reasonably the same on mainline.
> 
> ...and also, I coded up a test case that can reproduce this on
> mainline. It's ugly/hacky but it gets the job done. This reproduces
> the problem at the top of mainline as of commit 80e19f34c288 ("Merge
> tag 'hte/for-5.19' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux").
> 
> For me, I was only able to reproduce this without "lockdep" enabled.
> My lockdep configs were:
> 
> CONFIG_DEBUG_RT_MUTEXES=y
> CONFIG_DEBUG_SPINLOCK=y
> CONFIG_DEBUG_MUTEXES=y
> CONFIG_PROVE_RCU=y
> CONFIG_PROVE_LOCKING=y
> CONFIG_DEBUG_ATOMIC_SLEEP=y
> 
> I don't know for sure if lockdep is actually required to reproduce.
> 
> -
> 
> OK, so here's my hacky test case. In my case, I put a call to this
> test function in a convenient debugfs "show" function to make it easy
> to trigger. You can put it wherever.
> 
> struct test_data {
>         struct rw_semaphore *rwsem;
>         int i;
>         bool should_sleep;
> };
> 
> static int test_thread_fn(void *data)
> {
>         struct test_data *test_data = data;
>         struct rw_semaphore *rwsem = test_data->rwsem;
>         ktime_t start;
> 
>         trace_printk("Starting\n");
>         start = ktime_get();
>         while (ktime_to_ms(ktime_sub(ktime_get(), start)) < 60000) {
>                 trace_printk("About to grab\n");
>                 down_write(rwsem);
>                 trace_printk("Grabbed write %d\n", test_data->i);
>                 schedule();
>                 up_write(rwsem);
>                 trace_printk("Released write %d\n", test_data->i);
>                 if (test_data->should_sleep)
>                         msleep(1);
>         }
>         trace_printk("Done\n");
> 
>         return 0;
> }
> 
> static void test(void)
> {
>         static struct task_struct *t[10];
>         static struct test_data test_data[10];
>         static DECLARE_RWSEM(rwsem);
>         int i;
> 
>         trace_printk("About to create threads\n");
> 
>         for (i = 0; i < ARRAY_SIZE(t); i++) {
>                 test_data[i].rwsem = &rwsem;
>                 test_data[i].i = i;
> 
>                 if (i == ARRAY_SIZE(t) - 1) {
>                         /*
>                          * Last thread will be bound to CPU0 and realtime.
>                          * Have it sleep to give other threads a chance to
>                          * run and contend.
>                          */
>                         test_data[i].should_sleep = true;
>                         t[i] = kthread_create_on_cpu(test_thread_fn,
>                                                      &test_data[i], 0,
>                                                      "test0 FIFO-%u");
>                         sched_set_fifo(t[i]);
>                 } else if (i == ARRAY_SIZE(t) - 2) {
>                         /* 2nd to last thread will be bound to CPU0 */
>                         t[i] = kthread_create_on_cpu(test_thread_fn,
>                                                      &test_data[i], 0,
>                                                      "test0-%u");
>                 } else {
>                         /* All other threads are just normal */
>                         t[i] = kthread_create(test_thread_fn,
>                                               &test_data[i], "test");
>                 }
>                 wake_up_process(t[i]);
>                 msleep(10);
>         }
> }
> 
> -
> 
> With the reproducer above, I was able to:
> 
> 1. Validate that on chromeos-5.15 I could revert ${SUBJECT} patch and
> the problem went away.
> 
> 2. I could go to mainline at exactly the commit hash of ${SUBJECT}
> patch, see the problem, then revert ${SUBJECT} patch and see the
> problem go away.
> 
> Thus I'm fairly confident that the problem is related to ${SUBJECT} patch.
> 
> -
> 
> I'm hoping that someone on this thread can propose a fix. I'm happy to
> test, but I was hoping not to have to become an expert on the rwsem
> implementation to try to figure out the proper fix.
> 

See if it makes sense to only allow the first waiter to spin on owner.

Hillf

--- mainline/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -337,7 +337,7 @@ struct rwsem_waiter {
 	unsigned long timeout;
 
 	/* Writer only, not initialized in reader */
-	bool handoff_set;
+	bool handoff_set, first;
 };
 #define rwsem_first_waiter(sem) \
 	list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
@@ -604,6 +604,7 @@ static inline bool rwsem_try_write_lock(
 
 	lockdep_assert_held(&sem->wait_lock);
 
+	waiter->first = first;
 	count = atomic_long_read(&sem->count);
 	do {
 		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
@@ -1114,6 +1115,7 @@ rwsem_down_write_slowpath(struct rw_sema
 	waiter.type = RWSEM_WAITING_FOR_WRITE;
 	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
 	waiter.handoff_set = false;
+	waiter.first = false;
 
 	raw_spin_lock_irq(&sem->wait_lock);
 	rwsem_add_waiter(sem, &waiter);
@@ -1158,7 +1160,7 @@ rwsem_down_write_slowpath(struct rw_sema
 		 * In this case, we attempt to acquire the lock again
 		 * without sleeping.
 		 */
-		if (waiter.handoff_set) {
+		if (waiter.handoff_set && waiter.first) {
 			enum owner_state owner_state;
 
 			preempt_disable();


  reply	other threads:[~2022-07-19 10:41 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-16  1:29 [PATCH v5] locking/rwsem: Make handoff bit handling more consistent Waiman Long
2021-11-16  2:52 ` Aiqun(Maria) Yu
2021-11-16  9:14   ` Peter Zijlstra
2021-11-16  9:24     ` Peter Zijlstra
2021-11-16 14:52       ` Waiman Long
2021-11-17 13:36 ` Peter Zijlstra
2021-11-23  8:53 ` [tip: locking/urgent] " tip-bot2 for Waiman Long
2022-02-14 15:47 ` Re:[PATCH v5] " chenguanyou
2022-02-14 16:01   ` [PATCH " Greg KH
2022-04-11 18:26   ` john.p.donnelly
2022-04-11 18:40     ` Waiman Long
2022-04-11 21:03       ` john.p.donnelly
2022-04-11 21:07         ` Waiman Long
2022-04-12 16:28           ` john.p.donnelly
2022-04-12 17:04             ` Waiman Long
2022-04-14 10:48               ` Greg KH
2022-04-14 15:18                 ` Waiman Long
2022-04-14 15:42                   ` Greg KH
2022-04-14 15:44                     ` Waiman Long
2022-04-20 13:55             ` john.p.donnelly
2022-04-26 20:21               ` Waiman Long
2022-04-26 21:22                 ` john.p.donnelly
2022-02-14 16:22 ` chenguanyou
2022-02-15  7:41   ` [PATCH " Greg KH
2022-02-16 16:30     ` Waiman Long
2022-02-17 15:41       ` chenguanyou
2022-03-14  8:07         ` [PATCH " Greg KH
2022-03-22  2:49           ` chenguanyou
2022-03-24 12:51             ` [PATCH " Greg KH
2022-07-19  0:27 ` Doug Anderson
2022-07-19 10:41   ` Hillf Danton [this message]
2022-07-19 15:30     ` Doug Anderson
2022-07-22 11:55       ` Hillf Danton
2022-07-22 14:02         ` Doug Anderson
2022-07-23  0:17           ` Hillf Danton
2022-07-23  1:27             ` Hillf Danton
2022-08-05 17:14             ` Doug Anderson
2022-08-05 19:02               ` Waiman Long
2022-08-05 19:16                 ` Doug Anderson
2022-08-30 16:18                   ` Doug Anderson
2022-08-31 11:08                     ` Hillf Danton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220719104104.1634-1-hdanton@sina.com \
    --to=hdanton@sina.com \
    --cc=dave@stgolabs.net \
    --cc=dianders@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=peterz@infradead.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.