From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752999AbbBBPMH (ORCPT ); Mon, 2 Feb 2015 10:12:07 -0500 Received: from casper.infradead.org ([85.118.1.10]:60207 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751382AbbBBPME (ORCPT ); Mon, 2 Feb 2015 10:12:04 -0500 Date: Mon, 2 Feb 2015 16:11:59 +0100 From: Peter Zijlstra To: Oleg Nesterov Cc: Darren Hart , Thomas Gleixner , Jerome Marchand , Larry Woodman , Mateusz Guzik , linux-kernel@vger.kernel.org Subject: Re: [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads Message-ID: <20150202151159.GE26304@twins.programming.kicks-ass.net> References: <20150202140515.GA26398@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150202140515.GA26398@redhat.com> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 02, 2015 at 03:05:15PM +0100, Oleg Nesterov wrote: > First of all, why exactly do we need this mm/PF_KTHREAD check added by > f0d71b3dcb8332f7971 ? Of course, it is simply wrong to declare a random > kernel thread to be the owner as the changelog says. But why kthread is > worse than a random user-space task, say, /sbin/init? As the changelog says, we _should_ equally disallow other userspace tasks that do not share the futex value with us, its just that at the time we could not come up with a sensible (and cheap) way of testing for this. > IIUC, the fact that we can abuse ->pi_state_list is not that bad, no matter > if this (k)thread will exit or not. AFAICS, the only problem is that we can > boost the prio of this thread. Or I missed another problem? No that's it. > I am asking because we need to backport some fixes, and I am trying to > convince myself that I actually understand what I am trying to do ;) > And another question. Lets forget about this ->mm check. I simply can not > understand this > > ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN > > logic in attach_to_pi_owner(). First of all, why do we need to retry if > PF_EXITING is set but PF_EXITPIDONE is not? Why we can not simply ignore > PF_EXITING and rely on exit_pi_state_list() if PF_EXITPIDONE is not set? > > I must have missed something but this looks buggy, I do not see any > preemption point in this "retry" loop. Suppose that max_cpus=1 and rt_task() > preempts the non-rt PF_EXITING owner. Looks like futex_lock_pi() can spin > forever in this case? (OK, ignoring RT throttling). This is not something I've ever looked at before; 778e9a9c3e71 ("pi-futex: fix exit races and locking problems") seems to suggest its possible to get onto tsk->pi_state_list after exit_pi_state_list(). So while the below shows preemption points; those don't actually help against RT tasks, a FIFO-99 task will always be more eligible to run than most others. So yes, I do like your proposal of putting PF_EXITPIDONE under the ->pi_lock section that handles exit_pi_state_list(). I further think we can remove the smp_mb(); raw_spin_unlock_wait() from do_exit() -- this would offset the new unconditional ->pi_lock acquisition in exit_pi_state_list(). The comment there suggests robust futexes are involved but I cannot find any except the PI state muck testing ->flags. As for the recursive fault; I think the safer option is to set EXITPIDONE and not register more PI states, as opposed to allowing more and more states to be added. Yes we'll leak whatever currently is there, but no point in allowing it to get worse. do_exit() { exit_signals(tsk); /* sets PF_EXITING */ smp_mb(); raw_spin_unlock_wait(&tsk->pi_lock); exit_mm() { mm_release() { exit_pi_state_list(); } } tsk->flags |= PF_EXITPIDONE; } vs futex_lock_pi() { retry: ... ret = futex_lock_pi_atomic() { attach_to_pi_owner() { raw_spin_lock(&tsk->pi_lock); if (PF_EXITING) { ret = PF_EXITPIDONE ? -ESRCH : -AGAIN; raw_spin_unlock(&tsk->pi_lock); return ret; } } } if (ret) { switch(ret) { ... case -EAGAIN: ... cond_resched(); goto retry; } } } vs futex_requeue() { retry: ... ret = futex_proxy_trylock_atomic() { ret = futex_lock_pi_atomic() { attach_to_pi_owner() { raw_spin_lock(&tsk->pi_lock); if (PF_EXITING) { ret = PF_EXITPIDONE ? -ESRCH : -AGAIN; raw_spin_unlock(&tsk->pi_lock); return ret; } } } } if (ret > 0) { ret = lookup_pi_state() { attach_to_pi_owner() { raw_spin_lock(&tsk->pi_lock); if (PF_EXITING) { ret = PF_EXITPIDONE ? -ESRCH : -AGAIN; raw_spin_unlock(&tsk->pi_lock); return ret; } } } } ... switch(ret) { ... case -EAGAIN: ... cond_resched(); goto retry; } } vs