From: Hillf Danton <hdanton@sina.com>
To: Kuyo Chang <kuyo.chang@mediatek.com>
Cc: peterz@infradead.org, mgorman@suse.de,
Waiman Long <longman@redhat.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
jing-ting.wu@mediatek.com
Subject: Re: BUG: list_add corruption while doing migrate_swap -> balance_push
Date: Wed, 7 Sep 2022 20:00:18 +0800 [thread overview]
Message-ID: <20220907120018.2594-1-hdanton@sina.com> (raw)
In-Reply-To: <6dab6e564e43c952f63f83ef868da6ed829fc1a8.camel@mediatek.com>
On 6 Sep 2022 20:54:58 +0800 Kuyo Chang <kuyo.chang@mediatek.com> wrote
> Hi,
>
> [Syndrome]
> A list_add corruption error at kernel-5.15, the log shows.
> list_add corruption. prev->next should be next (ffffff81a6f08ba0), but
> was 0000000000000000. (prev=ffffff81a6f05930).
>
> The call trace as below:
> ipanic_die
> notify_die
> die
> bug_handler
> brk_handler
> do_debug_exception
> el1_dbg
> el1h_64_sync_handler
> el1h_64_sync
> __list_add_valid
> cpu_stop_queue_work
> stop_one_cpu_nowait
> balance_push
> __schedule
> schedule
> do_sched_yield
> __arm64_sys_sched_yield
> invoke_syscall
> el0_svc_common
> do_el0_svc
> el0_svc
> el0t_64_sync_handler
> el0t_64_sync
>
> [Analysis]
> By memory dump and analyzing the stopper->works list, the error code
> flow as following:
>
> migrate_swap
> ->stop_two_cpus
> ->cpu_stop_queue_two_works
> ->__cpu_stop_queue_work (add work->list to stopper-
> >works respectively)
> ->list_add_tail(&work->list, &stopper->works);
> ->wake_up_q(&wakeq);
> ->wait_for_completion(&done.completion);
> ->wait_for_common
> ->schedule_timeout
> ->schedule
>
> At this point, the cpu hotplug trigged,
> It registers balance_callback by below flow:
> cpu_down(cpuid)
> ->_cpu_down
> ->cpuhp_set_state()
> ->set_cpu_dying(cpuid, true)
> ->sched_cpu_deactivate
> ->balance_push_set(cpuid, true)
> ->rq->balance_callback = &balance_push_callback;
>
>
> Finally,
> ->__schedule
> ->__balance_callbacks
> ->do_balance_callbacks(rq, __splice_balance_callbacks(rq, false));
> ->balance_push
> ->stop_one_cpu_nowait
> *work_buf = (struct cpu_stop_work){ .fn = fn, .arg = arg,
> .caller = _RET_IP_, };
> At this point the list_head *next, *prev is initial to NULL!!
> ->cpu_stop_queue_work
> ->__list_add_valid
>
> Do you have any suggestion for this issue?
See if making balance_push() non re-entrable removes the chance for
double list add in your case.
Hillf
--- linux-5.15/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8815,6 +8815,7 @@ static int __balance_push_cpu_stop(void
cpu = select_fallback_rq(rq->cpu, p);
rq = __migrate_task(rq, &rf, p, cpu);
}
+ this_cpu_ptr(&push_work)->queued = 0;
rq_unlock(rq, &rf);
raw_spin_unlock_irq(&p->pi_lock);
@@ -8838,6 +8839,8 @@ static void balance_push(struct rq *rq)
lockdep_assert_rq_held(rq);
+ if (WARN_ON_ONCE(this_cpu_ptr(&push_work)->queued != 0))
+ return;
/*
* Ensure the thing is persistent until balance_push_set(.on = false);
*/
@@ -8877,6 +8880,7 @@ static void balance_push(struct rq *rq)
return;
}
+ this_cpu_ptr(&push_work)->queued = 1;
get_task_struct(push_task);
/*
* Temporarily drop rq->lock such that we can wake-up the stop task.
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -27,6 +27,7 @@ struct cpu_stop_work {
unsigned long caller;
void *arg;
struct cpu_stop_done *done;
+ unsigned queued;
};
int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg);
parent reply other threads:[~2022-09-07 12:00 UTC|newest]
Thread overview: expand[flat|nested] mbox.gz Atom feed
[parent not found: <6dab6e564e43c952f63f83ef868da6ed829fc1a8.camel@mediatek.com>]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220907120018.2594-1-hdanton@sina.com \
--to=hdanton@sina.com \
--cc=jing-ting.wu@mediatek.com \
--cc=kuyo.chang@mediatek.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=longman@redhat.com \
--cc=mgorman@suse.de \
--cc=peterz@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).