* Re: BUG: list_add corruption while doing migrate_swap -> balance_push
[not found] <6dab6e564e43c952f63f83ef868da6ed829fc1a8.camel@mediatek.com>
@ 2022-09-07 12:00 ` Hillf Danton
0 siblings, 0 replies; only message in thread
From: Hillf Danton @ 2022-09-07 12:00 UTC (permalink / raw)
To: Kuyo Chang
Cc: peterz, mgorman, Waiman Long, linux-kernel, linux-mm, jing-ting.wu
On 6 Sep 2022 20:54:58 +0800 Kuyo Chang <kuyo.chang@mediatek.com> wrote
> Hi,
>
> [Syndrome]
> A list_add corruption error at kernel-5.15, the log shows.
> list_add corruption. prev->next should be next (ffffff81a6f08ba0), but
> was 0000000000000000. (prev=ffffff81a6f05930).
>
> The call trace as below:
> ipanic_die
> notify_die
> die
> bug_handler
> brk_handler
> do_debug_exception
> el1_dbg
> el1h_64_sync_handler
> el1h_64_sync
> __list_add_valid
> cpu_stop_queue_work
> stop_one_cpu_nowait
> balance_push
> __schedule
> schedule
> do_sched_yield
> __arm64_sys_sched_yield
> invoke_syscall
> el0_svc_common
> do_el0_svc
> el0_svc
> el0t_64_sync_handler
> el0t_64_sync
>
> [Analysis]
> By memory dump and analyzing the stopper->works list, the error code
> flow as following:
>
> migrate_swap
> ->stop_two_cpus
> ->cpu_stop_queue_two_works
> ->__cpu_stop_queue_work (add work->list to stopper-
> >works respectively)
> ->list_add_tail(&work->list, &stopper->works);
> ->wake_up_q(&wakeq);
> ->wait_for_completion(&done.completion);
> ->wait_for_common
> ->schedule_timeout
> ->schedule
>
> At this point, the cpu hotplug trigged,
> It registers balance_callback by below flow:
> cpu_down(cpuid)
> ->_cpu_down
> ->cpuhp_set_state()
> ->set_cpu_dying(cpuid, true)
> ->sched_cpu_deactivate
> ->balance_push_set(cpuid, true)
> ->rq->balance_callback = &balance_push_callback;
>
>
> Finally,
> ->__schedule
> ->__balance_callbacks
> ->do_balance_callbacks(rq, __splice_balance_callbacks(rq, false));
> ->balance_push
> ->stop_one_cpu_nowait
> *work_buf = (struct cpu_stop_work){ .fn = fn, .arg = arg,
> .caller = _RET_IP_, };
> At this point the list_head *next, *prev is initial to NULL!!
> ->cpu_stop_queue_work
> ->__list_add_valid
>
> Do you have any suggestion for this issue?
See if making balance_push() non re-entrable removes the chance for
double list add in your case.
Hillf
--- linux-5.15/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8815,6 +8815,7 @@ static int __balance_push_cpu_stop(void
cpu = select_fallback_rq(rq->cpu, p);
rq = __migrate_task(rq, &rf, p, cpu);
}
+ this_cpu_ptr(&push_work)->queued = 0;
rq_unlock(rq, &rf);
raw_spin_unlock_irq(&p->pi_lock);
@@ -8838,6 +8839,8 @@ static void balance_push(struct rq *rq)
lockdep_assert_rq_held(rq);
+ if (WARN_ON_ONCE(this_cpu_ptr(&push_work)->queued != 0))
+ return;
/*
* Ensure the thing is persistent until balance_push_set(.on = false);
*/
@@ -8877,6 +8880,7 @@ static void balance_push(struct rq *rq)
return;
}
+ this_cpu_ptr(&push_work)->queued = 1;
get_task_struct(push_task);
/*
* Temporarily drop rq->lock such that we can wake-up the stop task.
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -27,6 +27,7 @@ struct cpu_stop_work {
unsigned long caller;
void *arg;
struct cpu_stop_done *done;
+ unsigned queued;
};
int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg);
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2022-09-07 12:00 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <6dab6e564e43c952f63f83ef868da6ed829fc1a8.camel@mediatek.com>
2022-09-07 12:00 ` BUG: list_add corruption while doing migrate_swap -> balance_push Hillf Danton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).