From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1753521AbcLGT5e (ORCPT <rfc822;w@1wt.eu>);
        Wed, 7 Dec 2016 14:57:34 -0500
Received: from foss.arm.com ([217.140.101.70]:47164 "EHLO foss.arm.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1752724AbcLGT5d (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 7 Dec 2016 14:57:33 -0500
Date: Wed, 7 Dec 2016 19:56:44 +0000
From: Mark Rutland <mark.rutland@arm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org, Ingo Molnar <mingo@redhat.com>,
        Arnaldo Carvalho de Melo <acme@kernel.org>,
        Thomas Gleixner <tglx@linutronix.de>,
        Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
        jeremy.linton@arm.com
Subject: Re: Perf hotplug lockup in v4.9-rc8
Message-ID: <20161207195643.GA9027@leverpostej>
References: <20161207135217.GA25605@leverpostej>
 <20161207175347.GB13840@leverpostej>
 <20161207183455.GQ3124@twins.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20161207183455.GQ3124@twins.programming.kicks-ass.net>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Dec 07, 2016 at 07:34:55PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 07, 2016 at 05:53:47PM +0000, Mark Rutland wrote:
> > On Wed, Dec 07, 2016 at 01:52:17PM +0000, Mark Rutland wrote:
> > > Hi all
> > > 
> > > Jeremy noticed a kernel lockup on arm64 when the perf tool was used in
> > > parallel with hotplug, which I've reproduced on arm64 and x86(-64) with
> > > v4.9-rc8. In both cases I'm using defconfig; I've tried enabling lockdep
> > > but it was silent for arm64 and x86.
> > 
> > It looks like we're trying to install a task-bound event into a context
> > where task_cpu(ctx->task) is dead, and thus the cpu_function_call() in
> > perf_install_in_context() fails. We retry repeatedly.
> > 
> > On !PREEMPT (as with x86 defconfig), we manage to prevent the hotplug
> > machinery from making progress, and this turns into a livelock.
> > 
> > On PREEMPT (as with arm64 defconfig), I'm somewhat lost.
> 
> So the problem is that even with PREEMPT we can hit a blocked task
> that has a 'dead' cpu.
> 
> We'll spin until either the task wakes up or the CPU does, either can
> take a very long time.
> 
> How exactly your test-case triggers this, all it executes is 'true' and
> that really shouldn't block much, is a mystery still.

The perf tool forks a helper process, which blocks on a pipe, and once
signalled, execs the target (i.e. true). The main perf process opens
(enable-on-exec) events on that, then writes to the pipe to wake up the
helper.

... so now I see why that makes us see a dead task_cpu(); thanks for the
explanation above!

[...]

> @@ -2352,6 +2357,28 @@ perf_install_in_context(struct perf_event_context *ctx,
>  		return;
>  	}
>  	raw_spin_unlock_irq(&ctx->lock);
> +
> +	raw_spin_lock_irq(&task->pi_lock);
> +	if (!(task->state == TASK_RUNNING || task->state == TASK_WAKING)) {

For a moment I thought there was a remaining race here with the lazy
ctx-switch if the new task was RUNNING on an online CPU, but I guess
we'll retry the cpu_function_call() in that case.

I'll attack this tomorrow when I can think again...

Thanks,
Mark.