linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Hillf Danton <hdanton@sina.com>
To: Kuyo Chang <kuyo.chang@mediatek.com>
Cc: peterz@infradead.org, mgorman@suse.de,
	Waiman Long <longman@redhat.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	jing-ting.wu@mediatek.com
Subject: Re: BUG: list_add corruption while doing migrate_swap -> balance_push
Date: Wed,  7 Sep 2022 20:00:18 +0800	[thread overview]
Message-ID: <20220907120018.2594-1-hdanton@sina.com> (raw)
In-Reply-To: <6dab6e564e43c952f63f83ef868da6ed829fc1a8.camel@mediatek.com>

On 6 Sep 2022 20:54:58 +0800 Kuyo Chang <kuyo.chang@mediatek.com> wrote
> Hi,
> 
> [Syndrome]
> A list_add corruption error at kernel-5.15, the log shows.
> list_add corruption. prev->next should be next (ffffff81a6f08ba0), but
> was 0000000000000000. (prev=ffffff81a6f05930).
> 
> The call trace as below:
> ipanic_die
> notify_die
> die
> bug_handler
> brk_handler
> do_debug_exception
> el1_dbg
> el1h_64_sync_handler
> el1h_64_sync
> __list_add_valid
> cpu_stop_queue_work
> stop_one_cpu_nowait
> balance_push
> __schedule
> schedule
> do_sched_yield
> __arm64_sys_sched_yield
> invoke_syscall
> el0_svc_common
> do_el0_svc
> el0_svc
> el0t_64_sync_handler
> el0t_64_sync
> 
> [Analysis]
> By memory dump and analyzing the stopper->works list, the error code
> flow as following:
> 
> migrate_swap 
> ->stop_two_cpus
> 	->cpu_stop_queue_two_works
> 		->__cpu_stop_queue_work (add work->list to stopper-
> >works respectively)	
> 			->list_add_tail(&work->list, &stopper->works);
> 	->wake_up_q(&wakeq);	
> ->wait_for_completion(&done.completion);
> ->wait_for_common
> ->schedule_timeout
> ->schedule
> 
> At this point, the cpu hotplug trigged,
> It registers balance_callback by below flow:
> cpu_down(cpuid)
> ->_cpu_down
> ->cpuhp_set_state()
> ->set_cpu_dying(cpuid, true)
> ->sched_cpu_deactivate
> ->balance_push_set(cpuid, true)
> 	->rq->balance_callback = &balance_push_callback;
> 
> 
> Finally, 
> ->__schedule
> ->__balance_callbacks
> ->do_balance_callbacks(rq, __splice_balance_callbacks(rq, false));
> ->balance_push
> ->stop_one_cpu_nowait
> 	*work_buf = (struct cpu_stop_work){ .fn = fn, .arg = arg,
> .caller = _RET_IP_, };
> At this point the list_head *next, *prev is initial to NULL!!
> ->cpu_stop_queue_work
> ->__list_add_valid
> 
> Do you have any suggestion for this issue?

See if making balance_push() non re-entrable removes the chance for
double list add in your case.

Hillf

--- linux-5.15/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8815,6 +8815,7 @@ static int __balance_push_cpu_stop(void
 		cpu = select_fallback_rq(rq->cpu, p);
 		rq = __migrate_task(rq, &rf, p, cpu);
 	}
+	this_cpu_ptr(&push_work)->queued = 0;
 
 	rq_unlock(rq, &rf);
 	raw_spin_unlock_irq(&p->pi_lock);
@@ -8838,6 +8839,8 @@ static void balance_push(struct rq *rq)
 
 	lockdep_assert_rq_held(rq);
 
+	if (WARN_ON_ONCE(this_cpu_ptr(&push_work)->queued != 0))
+		return;
 	/*
 	 * Ensure the thing is persistent until balance_push_set(.on = false);
 	 */
@@ -8877,6 +8880,7 @@ static void balance_push(struct rq *rq)
 		return;
 	}
 
+	this_cpu_ptr(&push_work)->queued = 1;
 	get_task_struct(push_task);
 	/*
 	 * Temporarily drop rq->lock such that we can wake-up the stop task.
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -27,6 +27,7 @@ struct cpu_stop_work {
 	unsigned long		caller;
 	void			*arg;
 	struct cpu_stop_done	*done;
+	unsigned		queued;
 };
 
 int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg);


           reply	other threads:[~2022-09-07 12:00 UTC|newest]

Thread overview: expand[flat|nested]  mbox.gz  Atom feed
 [parent not found: <6dab6e564e43c952f63f83ef868da6ed829fc1a8.camel@mediatek.com>]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220907120018.2594-1-hdanton@sina.com \
    --to=hdanton@sina.com \
    --cc=jing-ting.wu@mediatek.com \
    --cc=kuyo.chang@mediatek.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).