linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* BUG: list_add corruption while doing migrate_swap -> balance_push
@ 2022-09-06 12:54 Kuyo Chang
  2022-09-07  8:31 ` Mel Gorman
  0 siblings, 1 reply; 2+ messages in thread
From: Kuyo Chang @ 2022-09-06 12:54 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot
  Cc: linux-kernel, wsd_upstream, linux-mediatek, jing-ting.wu,
	yt.chang, jonathan.jmchen

Hi,

[Syndrome]
A list_add corruption error at kernel-5.15, the log shows.
list_add corruption. prev->next should be next (ffffff81a6f08ba0), but
was 0000000000000000. (prev=ffffff81a6f05930).

The call trace as below:
ipanic_die
notify_die
die
bug_handler
brk_handler
do_debug_exception
el1_dbg
el1h_64_sync_handler
el1h_64_sync
__list_add_valid
cpu_stop_queue_work
stop_one_cpu_nowait
balance_push
__schedule
schedule
do_sched_yield
__arm64_sys_sched_yield
invoke_syscall
el0_svc_common
do_el0_svc
el0_svc
el0t_64_sync_handler
el0t_64_sync

[Analysis]
By memory dump and analyzing the stopper->works list, the error code
flow as following:

migrate_swap 
->stop_two_cpus
	->cpu_stop_queue_two_works
		->__cpu_stop_queue_work (add work->list to stopper-
>works respectively)	
			->list_add_tail(&work->list, &stopper->works);
	->wake_up_q(&wakeq);	
->wait_for_completion(&done.completion);
->wait_for_common
->schedule_timeout
->schedule

At this point, the cpu hotplug trigged,
It registers balance_callback by below flow:
cpu_down(cpuid)
->_cpu_down
->cpuhp_set_state()
->set_cpu_dying(cpuid, true)
->sched_cpu_deactivate
->balance_push_set(cpuid, true)
	->rq->balance_callback = &balance_push_callback;


Finally, 
->__schedule
->__balance_callbacks
->do_balance_callbacks(rq, __splice_balance_callbacks(rq, false));
->balance_push
->stop_one_cpu_nowait
	*work_buf = (struct cpu_stop_work){ .fn = fn, .arg = arg,
.caller = _RET_IP_, };
At this point the list_head *next, *prev is initial to NULL!!
->cpu_stop_queue_work
->__list_add_valid

So it will hit this error 
if (CHECK_DATA_CORRUPTION(next->prev != prev,
	"list_add corruption. next->prev should be prev (%px), but was
%px. (next=%px).\n",
	prev, next->prev, next)

Do you have any suggestion for this issue?
Thank you.


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: BUG: list_add corruption while doing migrate_swap -> balance_push
  2022-09-06 12:54 BUG: list_add corruption while doing migrate_swap -> balance_push Kuyo Chang
@ 2022-09-07  8:31 ` Mel Gorman
  0 siblings, 0 replies; 2+ messages in thread
From: Mel Gorman @ 2022-09-07  8:31 UTC (permalink / raw)
  To: Kuyo Chang
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, bristot, linux-kernel, wsd_upstream,
	linux-mediatek, jing-ting.wu, yt.chang, jonathan.jmchen

On Tue, Sep 06, 2022 at 08:54:58PM +0800, Kuyo Chang wrote:
> Hi,
> 
> [Syndrome]
> A list_add corruption error at kernel-5.15, the log shows.
> list_add corruption. prev->next should be next (ffffff81a6f08ba0), but
> was 0000000000000000. (prev=ffffff81a6f05930).
> 

Is this a vanilla 5.15 kernel or modified? Does it happen on 5.19 or
6.0-rc4? Given that this appears to be a bug triggered by NUMA balancing
racing against memory hotplug, is there a reproducer for this?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2022-09-07  8:32 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-06 12:54 BUG: list_add corruption while doing migrate_swap -> balance_push Kuyo Chang
2022-09-07  8:31 ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).