From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757766Ab2GMOZR (ORCPT ); Fri, 13 Jul 2012 10:25:17 -0400 Received: from www.linutronix.de ([62.245.132.108]:45323 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752019Ab2GMOZO (ORCPT ); Fri, 13 Jul 2012 10:25:14 -0400 Date: Fri, 13 Jul 2012 16:25:05 +0200 (CEST) From: Thomas Gleixner To: Jan Kara cc: Jeff Moyer , LKML , linux-fsdevel@vger.kernel.org, Tejun Heo , Jens Axboe , mgalbraith@suse.com Subject: Re: Deadlocks due to per-process plugging In-Reply-To: <20120713123318.GB20361@quack.suse.cz> Message-ID: References: <20120711133735.GA8122@quack.suse.cz> <20120711201601.GB9779@quack.suse.cz> <20120713123318.GB20361@quack.suse.cz> User-Agent: Alpine 2.02 (LFD 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Linutronix-Spam-Score: -1.0 X-Linutronix-Spam-Level: - X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required, ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 13 Jul 2012, Jan Kara wrote: > On Thu 12-07-12 16:15:29, Thomas Gleixner wrote: > > > Ah, I didn't know this. Thanks for the hint. So in the kdump I have I can > > > see requests queued in tsk->plug despite the process is sleeping in > > > TASK_UNINTERRUPTIBLE state. So the only way how unplug could have been > > > omitted is if tsk_is_pi_blocked() was true. Rummaging through the dump... > > > indeed task has pi_blocked_on = 0xffff8802717d79c8. The dump is from an -rt > > > kernel (I just didn't originally thought that makes any difference) so > > > actually any mutex is rtmutex and thus tsk_is_pi_blocked() is true whenever > > > we are sleeping on a mutex. So this seems like a bug in rtmutex code. > > > > Well, the reason why this check is there is that the task which is > > blocked on a lock can hold another lock which might cause a deadlock > > in the flush path. > OK. Let me understand the details. Block layer needs just queue_lock for > unplug to succeed. That is a spinlock but in RT kernel, even a process > holding a spinlock can be preempted if I remember correctly. So that > condition is there effectively to not unplug when a task is being scheduled > away while holding queue_lock? Did I get it right? blk_flush_plug_list() is not only queue_lock. There can be other locks taken in the callbacks, elevator ... > > > Thomas, you seemed to have added that condition... Any idea how to avoid > > > the deadlock? > > > > Good question. We could do the flush when the blocked task does not > > hold a lock itself. Might be worth a try. > Yeah, that should work for avoiding the deadlock as well. Though we don't have a lock held count except when lockdep is enabled, which you probably don't want to do when running a production system. But we only care about stuff being scheduled out while blocked on a "sleeping spinlock" - i.e. spinlock, rwlock. So the patch below should allow the unplug to take place when blocked on mutexes etc. Thanks, tglx ---- Index: linux-stable-rt/include/linux/sched.h =================================================================== --- linux-stable-rt.orig/include/linux/sched.h +++ linux-stable-rt/include/linux/sched.h @@ -2145,9 +2145,10 @@ extern unsigned int sysctl_sched_cfs_ban extern int rt_mutex_getprio(struct task_struct *p); extern void rt_mutex_setprio(struct task_struct *p, int prio); extern void rt_mutex_adjust_pi(struct task_struct *p); +extern bool pi_blocked_on_rt_lock(struct task_struct *tsk); static inline bool tsk_is_pi_blocked(struct task_struct *tsk) { - return tsk->pi_blocked_on != NULL; + return tsk->pi_blocked_on != NULL && pi_blocked_on_rt_lock(tsk); } #else static inline int rt_mutex_getprio(struct task_struct *p) Index: linux-stable-rt/kernel/rtmutex.c =================================================================== --- linux-stable-rt.orig/kernel/rtmutex.c +++ linux-stable-rt/kernel/rtmutex.c @@ -699,6 +699,11 @@ static int adaptive_wait(struct rt_mutex # define pi_lock(lock) raw_spin_lock_irq(lock) # define pi_unlock(lock) raw_spin_unlock_irq(lock) +bool pi_blocked_on_rt_lock(struct task_struct *tsk) +{ + return tsk->pi_blocked_on && tsk->pi_blocked_on->savestate; +} + /* * Slow path lock function spin_lock style: this variant is very * careful not to miss any non-lock wakeups.