From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S933383AbbBBOGz (ORCPT <rfc822;w@1wt.eu>);
	Mon, 2 Feb 2015 09:06:55 -0500
Received: from mx1.redhat.com ([209.132.183.28]:34346 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S932206AbbBBOGx (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Mon, 2 Feb 2015 09:06:53 -0500
Date: Mon, 2 Feb 2015 15:05:15 +0100
From: Oleg Nesterov <oleg@redhat.com>
To: Darren Hart <darren@dvhart.com>, Peter Zijlstra <peterz@infradead.org>,
        Thomas Gleixner <tglx@linutronix.de>
Cc: Jerome Marchand <jmarchan@redhat.com>, Larry Woodman <lwoodman@redhat.com>,
        Mateusz Guzik <mguzik@redhat.com>, linux-kernel@vger.kernel.org
Subject: [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter
	out kthreads
Message-ID: <20150202140515.GA26398@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.18 (2008-05-17)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Thomas, et al, could you please review?

The change looks trivial, but I simply can not understand this logic,
please help.


First of all, why exactly do we need this mm/PF_KTHREAD check added by
f0d71b3dcb8332f7971 ? Of course, it is simply wrong to declare a random
kernel thread to be the owner as the changelog says. But why kthread is
worse than a random user-space task, say, /sbin/init?

IIUC, the fact that we can abuse ->pi_state_list is not that bad, no matter
if this (k)thread will exit or not. AFAICS, the only problem is that we can
boost the prio of this thread. Or I missed another problem?

I am asking because we need to backport some fixes, and I am trying to
convince myself that I actually understand what I am trying to do ;)


And another question. Lets forget about this ->mm check. I simply can not
understand this

	ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN

logic in attach_to_pi_owner(). First of all, why do we need to retry if
PF_EXITING is set but PF_EXITPIDONE is not? Why we can not simply ignore
PF_EXITING and rely on exit_pi_state_list() if PF_EXITPIDONE is not set?

I must have missed something but this looks buggy, I do not see any
preemption point in this "retry" loop. Suppose that max_cpus=1 and rt_task()
preempts the non-rt PF_EXITING owner. Looks like futex_lock_pi() can spin
forever in this case? (OK, ignoring RT throttling).

IOW. Could you explain why the patch below is wrong correctness-wise?
Lets ignore the fact it is obviously ugly and suboptimal...

Thanks,

Oleg.

--- x/kernel/exit.c
+++ x/kernel/exit.c
@@ -684,13 +684,7 @@ void do_exit(long code)
 	if (unlikely(tsk->flags & PF_EXITING)) {
 		pr_alert("Fixing recursive fault but reboot is needed!\n");
 		/*
-		 * We can do this unlocked here. The futex code uses
-		 * this flag just to verify whether the pi state
-		 * cleanup has been done or not. In the worst case it
-		 * loops once more. We pretend that the cleanup was
-		 * done as there is no way to return. Either the
-		 * OWNER_DIED bit is set by now or we push the blocked
-		 * task into the wait for ever nirwana as well.
+		 * XXX: ADD COMMENT
 		 */
 		tsk->flags |= PF_EXITPIDONE;
 		set_current_state(TASK_UNINTERRUPTIBLE);
@@ -699,8 +693,7 @@ void do_exit(long code)
 
 	exit_signals(tsk);  /* sets PF_EXITING */
 	/*
-	 * tsk->flags are checked in the futex code to protect against
-	 * an exiting task cleaning up the robust pi futexes.
+	 * XXX: anobody else needs this ?
 	 */
 	smp_mb();
 	raw_spin_unlock_wait(&tsk->pi_lock);
@@ -779,12 +772,6 @@ void do_exit(long code)
 	 * Make sure we are holding no locks:
 	 */
 	debug_check_no_locks_held();
-	/*
-	 * We can do this unlocked here. The futex code uses this flag
-	 * just to verify whether the pi state cleanup has been done
-	 * or not. In the worst case it loops once more.
-	 */
-	tsk->flags |= PF_EXITPIDONE;
 
 	if (tsk->io_context)
 		exit_io_context(tsk);
--- x/kernel/fork.c
+++ x/kernel/fork.c
@@ -803,8 +803,7 @@ void mm_release(struct task_struct *tsk,
 		tsk->compat_robust_list = NULL;
 	}
 #endif
-	if (unlikely(!list_empty(&tsk->pi_state_list)))
-		exit_pi_state_list(tsk);
+	exit_pi_state_list(tsk);
 #endif
 
 	uprobe_free_utask(tsk);
--- x/kernel/futex.c
+++ x/kernel/futex.c
@@ -722,6 +722,7 @@ void exit_pi_state_list(struct task_stru
 	 * versus waiters unqueueing themselves:
 	 */
 	raw_spin_lock_irq(&curr->pi_lock);
+	curr->flags |= PF_EXITPIDONE;
 	while (!list_empty(head)) {
 
 		next = head->next;
@@ -912,17 +913,15 @@ static int attach_to_pi_owner(u32 uval, 
 	 * p->pi_lock:
 	 */
 	raw_spin_lock_irq(&p->pi_lock);
-	if (unlikely(p->flags & PF_EXITING)) {
+	if (unlikely(p->flags & PF_EXITPIDONE)) {
 		/*
 		 * The task is on the way out. When PF_EXITPIDONE is
 		 * set, we know that the task has finished the
 		 * cleanup:
 		 */
-		int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
-
 		raw_spin_unlock_irq(&p->pi_lock);
 		put_task_struct(p);
-		return ret;
+		return -ESRCH;
 	}
 
 	/*