From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753521AbcLGT5e (ORCPT ); Wed, 7 Dec 2016 14:57:34 -0500 Received: from foss.arm.com ([217.140.101.70]:47164 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752724AbcLGT5d (ORCPT ); Wed, 7 Dec 2016 14:57:33 -0500 Date: Wed, 7 Dec 2016 19:56:44 +0000 From: Mark Rutland To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Ingo Molnar , Arnaldo Carvalho de Melo , Thomas Gleixner , Sebastian Andrzej Siewior , jeremy.linton@arm.com Subject: Re: Perf hotplug lockup in v4.9-rc8 Message-ID: <20161207195643.GA9027@leverpostej> References: <20161207135217.GA25605@leverpostej> <20161207175347.GB13840@leverpostej> <20161207183455.GQ3124@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161207183455.GQ3124@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 07, 2016 at 07:34:55PM +0100, Peter Zijlstra wrote: > On Wed, Dec 07, 2016 at 05:53:47PM +0000, Mark Rutland wrote: > > On Wed, Dec 07, 2016 at 01:52:17PM +0000, Mark Rutland wrote: > > > Hi all > > > > > > Jeremy noticed a kernel lockup on arm64 when the perf tool was used in > > > parallel with hotplug, which I've reproduced on arm64 and x86(-64) with > > > v4.9-rc8. In both cases I'm using defconfig; I've tried enabling lockdep > > > but it was silent for arm64 and x86. > > > > It looks like we're trying to install a task-bound event into a context > > where task_cpu(ctx->task) is dead, and thus the cpu_function_call() in > > perf_install_in_context() fails. We retry repeatedly. > > > > On !PREEMPT (as with x86 defconfig), we manage to prevent the hotplug > > machinery from making progress, and this turns into a livelock. > > > > On PREEMPT (as with arm64 defconfig), I'm somewhat lost. > > So the problem is that even with PREEMPT we can hit a blocked task > that has a 'dead' cpu. > > We'll spin until either the task wakes up or the CPU does, either can > take a very long time. > > How exactly your test-case triggers this, all it executes is 'true' and > that really shouldn't block much, is a mystery still. The perf tool forks a helper process, which blocks on a pipe, and once signalled, execs the target (i.e. true). The main perf process opens (enable-on-exec) events on that, then writes to the pipe to wake up the helper. ... so now I see why that makes us see a dead task_cpu(); thanks for the explanation above! [...] > @@ -2352,6 +2357,28 @@ perf_install_in_context(struct perf_event_context *ctx, > return; > } > raw_spin_unlock_irq(&ctx->lock); > + > + raw_spin_lock_irq(&task->pi_lock); > + if (!(task->state == TASK_RUNNING || task->state == TASK_WAKING)) { For a moment I thought there was a remaining race here with the lazy ctx-switch if the new task was RUNNING on an online CPU, but I guess we'll retry the cpu_function_call() in that case. I'll attack this tomorrow when I can think again... Thanks, Mark.