From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752999AbbBBPMH (ORCPT <rfc822;w@1wt.eu>);
	Mon, 2 Feb 2015 10:12:07 -0500
Received: from casper.infradead.org ([85.118.1.10]:60207 "EHLO
	casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751382AbbBBPME (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Mon, 2 Feb 2015 10:12:04 -0500
Date: Mon, 2 Feb 2015 16:11:59 +0100
From: Peter Zijlstra <peterz@infradead.org>
To: Oleg Nesterov <oleg@redhat.com>
Cc: Darren Hart <darren@dvhart.com>, Thomas Gleixner <tglx@linutronix.de>,
        Jerome Marchand <jmarchan@redhat.com>,
        Larry Woodman <lwoodman@redhat.com>, Mateusz Guzik <mguzik@redhat.com>,
        linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter
 out kthreads
Message-ID: <20150202151159.GE26304@twins.programming.kicks-ass.net>
References: <20150202140515.GA26398@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150202140515.GA26398@redhat.com>
User-Agent: Mutt/1.5.21 (2012-12-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, Feb 02, 2015 at 03:05:15PM +0100, Oleg Nesterov wrote:

> First of all, why exactly do we need this mm/PF_KTHREAD check added by
> f0d71b3dcb8332f7971 ? Of course, it is simply wrong to declare a random
> kernel thread to be the owner as the changelog says. But why kthread is
> worse than a random user-space task, say, /sbin/init?

As the changelog says, we _should_ equally disallow other userspace
tasks that do not share the futex value with us, its just that at the
time we could not come up with a sensible (and cheap) way of testing for
this.

> IIUC, the fact that we can abuse ->pi_state_list is not that bad, no matter
> if this (k)thread will exit or not. AFAICS, the only problem is that we can
> boost the prio of this thread. Or I missed another problem?

No that's it.

> I am asking because we need to backport some fixes, and I am trying to
> convince myself that I actually understand what I am trying to do ;)


> And another question. Lets forget about this ->mm check. I simply can not
> understand this
> 
> 	ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN
> 
> logic in attach_to_pi_owner(). First of all, why do we need to retry if
> PF_EXITING is set but PF_EXITPIDONE is not? Why we can not simply ignore
> PF_EXITING and rely on exit_pi_state_list() if PF_EXITPIDONE is not set?
> 
> I must have missed something but this looks buggy, I do not see any
> preemption point in this "retry" loop. Suppose that max_cpus=1 and rt_task()
> preempts the non-rt PF_EXITING owner. Looks like futex_lock_pi() can spin
> forever in this case? (OK, ignoring RT throttling).

This is not something I've ever looked at before; 778e9a9c3e71
("pi-futex: fix exit races and locking problems") seems to suggest its
possible to get onto tsk->pi_state_list after exit_pi_state_list().

So while the below shows preemption points; those don't actually help
against RT tasks, a FIFO-99 task will always be more eligible to run
than most others.

So yes, I do like your proposal of putting PF_EXITPIDONE under the
->pi_lock section that handles exit_pi_state_list().

I further think we can remove the smp_mb(); raw_spin_unlock_wait() from
do_exit() -- this would offset the new unconditional ->pi_lock
acquisition in exit_pi_state_list(). The comment there suggests robust
futexes are involved but I cannot find any except the PI state muck
testing ->flags.

As for the recursive fault; I think the safer option is to set
EXITPIDONE and not register more PI states, as opposed to allowing more
and more states to be added. Yes we'll leak whatever currently is there,
but no point in allowing it to get worse.


do_exit()
{
	exit_signals(tsk); /* sets PF_EXITING */

	smp_mb();
	raw_spin_unlock_wait(&tsk->pi_lock);

	exit_mm() {
		mm_release() {
			exit_pi_state_list();
		}
	}

	tsk->flags |= PF_EXITPIDONE;
}

vs

futex_lock_pi()
{
retry:
	...

	ret = futex_lock_pi_atomic() {
		attach_to_pi_owner() {
			raw_spin_lock(&tsk->pi_lock);
			if (PF_EXITING) {
				ret = PF_EXITPIDONE ? -ESRCH : -AGAIN;
				raw_spin_unlock(&tsk->pi_lock);
				return ret;
			}
		}
	}
	if (ret) {
		switch(ret) {
		...

		case -EAGAIN:
			...
			cond_resched();
			goto retry;
		}
	}
}

vs

futex_requeue()
{
retry:
	...

	ret = futex_proxy_trylock_atomic() {
		ret = futex_lock_pi_atomic() {
			attach_to_pi_owner() {
				raw_spin_lock(&tsk->pi_lock);
				if (PF_EXITING) {
					ret = PF_EXITPIDONE ? -ESRCH : -AGAIN;
					raw_spin_unlock(&tsk->pi_lock);
					return ret;
				}
			}
		}
	}

	if (ret > 0) {
		ret = lookup_pi_state() {
			attach_to_pi_owner() {
				raw_spin_lock(&tsk->pi_lock);
				if (PF_EXITING) {
					ret = PF_EXITPIDONE ? -ESRCH : -AGAIN;
					raw_spin_unlock(&tsk->pi_lock);
					return ret;
				}
			}
		}
	}

	...
	switch(ret) {
		...
	case -EAGAIN:
		...
		cond_resched();
		goto retry;
	}
}

vs