From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933383AbbBBOGz (ORCPT ); Mon, 2 Feb 2015 09:06:55 -0500 Received: from mx1.redhat.com ([209.132.183.28]:34346 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932206AbbBBOGx (ORCPT ); Mon, 2 Feb 2015 09:06:53 -0500 Date: Mon, 2 Feb 2015 15:05:15 +0100 From: Oleg Nesterov To: Darren Hart , Peter Zijlstra , Thomas Gleixner Cc: Jerome Marchand , Larry Woodman , Mateusz Guzik , linux-kernel@vger.kernel.org Subject: [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads Message-ID: <20150202140515.GA26398@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Thomas, et al, could you please review? The change looks trivial, but I simply can not understand this logic, please help. First of all, why exactly do we need this mm/PF_KTHREAD check added by f0d71b3dcb8332f7971 ? Of course, it is simply wrong to declare a random kernel thread to be the owner as the changelog says. But why kthread is worse than a random user-space task, say, /sbin/init? IIUC, the fact that we can abuse ->pi_state_list is not that bad, no matter if this (k)thread will exit or not. AFAICS, the only problem is that we can boost the prio of this thread. Or I missed another problem? I am asking because we need to backport some fixes, and I am trying to convince myself that I actually understand what I am trying to do ;) And another question. Lets forget about this ->mm check. I simply can not understand this ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN logic in attach_to_pi_owner(). First of all, why do we need to retry if PF_EXITING is set but PF_EXITPIDONE is not? Why we can not simply ignore PF_EXITING and rely on exit_pi_state_list() if PF_EXITPIDONE is not set? I must have missed something but this looks buggy, I do not see any preemption point in this "retry" loop. Suppose that max_cpus=1 and rt_task() preempts the non-rt PF_EXITING owner. Looks like futex_lock_pi() can spin forever in this case? (OK, ignoring RT throttling). IOW. Could you explain why the patch below is wrong correctness-wise? Lets ignore the fact it is obviously ugly and suboptimal... Thanks, Oleg. --- x/kernel/exit.c +++ x/kernel/exit.c @@ -684,13 +684,7 @@ void do_exit(long code) if (unlikely(tsk->flags & PF_EXITING)) { pr_alert("Fixing recursive fault but reboot is needed!\n"); /* - * We can do this unlocked here. The futex code uses - * this flag just to verify whether the pi state - * cleanup has been done or not. In the worst case it - * loops once more. We pretend that the cleanup was - * done as there is no way to return. Either the - * OWNER_DIED bit is set by now or we push the blocked - * task into the wait for ever nirwana as well. + * XXX: ADD COMMENT */ tsk->flags |= PF_EXITPIDONE; set_current_state(TASK_UNINTERRUPTIBLE); @@ -699,8 +693,7 @@ void do_exit(long code) exit_signals(tsk); /* sets PF_EXITING */ /* - * tsk->flags are checked in the futex code to protect against - * an exiting task cleaning up the robust pi futexes. + * XXX: anobody else needs this ? */ smp_mb(); raw_spin_unlock_wait(&tsk->pi_lock); @@ -779,12 +772,6 @@ void do_exit(long code) * Make sure we are holding no locks: */ debug_check_no_locks_held(); - /* - * We can do this unlocked here. The futex code uses this flag - * just to verify whether the pi state cleanup has been done - * or not. In the worst case it loops once more. - */ - tsk->flags |= PF_EXITPIDONE; if (tsk->io_context) exit_io_context(tsk); --- x/kernel/fork.c +++ x/kernel/fork.c @@ -803,8 +803,7 @@ void mm_release(struct task_struct *tsk, tsk->compat_robust_list = NULL; } #endif - if (unlikely(!list_empty(&tsk->pi_state_list))) - exit_pi_state_list(tsk); + exit_pi_state_list(tsk); #endif uprobe_free_utask(tsk); --- x/kernel/futex.c +++ x/kernel/futex.c @@ -722,6 +722,7 @@ void exit_pi_state_list(struct task_stru * versus waiters unqueueing themselves: */ raw_spin_lock_irq(&curr->pi_lock); + curr->flags |= PF_EXITPIDONE; while (!list_empty(head)) { next = head->next; @@ -912,17 +913,15 @@ static int attach_to_pi_owner(u32 uval, * p->pi_lock: */ raw_spin_lock_irq(&p->pi_lock); - if (unlikely(p->flags & PF_EXITING)) { + if (unlikely(p->flags & PF_EXITPIDONE)) { /* * The task is on the way out. When PF_EXITPIDONE is * set, we know that the task has finished the * cleanup: */ - int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN; - raw_spin_unlock_irq(&p->pi_lock); put_task_struct(p); - return ret; + return -ESRCH; } /*