linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GIT PULL] futex fixlet
@ 2011-12-29 21:07 Ingo Molnar
  2011-12-30  1:26 ` Linus Torvalds
  0 siblings, 1 reply; 5+ messages in thread
From: Ingo Molnar @ 2011-12-29 21:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Darren Hart

Linus,

Please pull the latest core-urgent-for-linus git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core-urgent-for-linus

Please have a good look at it - it was probably not tested
by everyone with the usual vigor, due to holiday excesse^W 
festivities.

 Thanks,

	Ingo

------------------>
Peter Zijlstra (1):
      futex: Fix uninterruptible loop due to gate_area


 include/linux/mm.h |    1 +
 kernel/futex.c     |   40 +++++++++++++++++++++++++++++++++++-----
 mm/mmap.c          |    5 +++++
 3 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4baadd1..3025cbc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1395,6 +1395,7 @@ extern int may_expand_vm(struct mm_struct *mm, unsigned long npages);
 extern int install_special_mapping(struct mm_struct *mm,
 				   unsigned long addr, unsigned long len,
 				   unsigned long flags, struct page **pages);
+extern bool is_special_mapping(struct vm_area_struct *vma);
 
 extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 
diff --git a/kernel/futex.c b/kernel/futex.c
index ea87f4d..4d66cd3 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -59,6 +59,7 @@
 #include <linux/magic.h>
 #include <linux/pid.h>
 #include <linux/nsproxy.h>
+#include <linux/mm.h>
 
 #include <asm/futex.h>
 
@@ -236,7 +237,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
 	unsigned long address = (unsigned long)uaddr;
 	struct mm_struct *mm = current->mm;
 	struct page *page, *page_head;
-	int err, ro = 0;
+	int err, ro = 0, no_mapping_tries = 0;
 
 	/*
 	 * The futex address must be "naturally" aligned.
@@ -317,13 +318,42 @@ again:
 	if (!page_head->mapping) {
 		unlock_page(page_head);
 		put_page(page_head);
+
 		/*
-		* ZERO_PAGE pages don't have a mapping. Avoid a busy loop
-		* trying to find one. RW mapping would have COW'd (and thus
-		* have a mapping) so this page is RO and won't ever change.
-		*/
+		 * ZERO_PAGE pages don't have a mapping. Avoid a busy loop
+		 * trying to find one. RW mapping would have COW'd (and thus
+		 * have a mapping) so this page is RO and won't ever change.
+		 */
 		if ((page_head == ZERO_PAGE(address)))
 			return -EFAULT;
+
+		/*
+		 * Similar problem for the gate area.
+		 */
+		if (in_gate_area(mm, address))
+			return -EFAULT;
+
+		/*
+		 * There is a special class of pages that will have no mapping
+		 * and yet is perfectly valid and not going anywhere. These
+		 * are the pages from install_special_mapping(). Since looking
+		 * up the vma is expensive, don't do so on the first go round.
+		 */
+		if (no_mapping_tries) {
+			struct vm_area_struct *vma;
+
+			err = 0;
+			down_read(&mm->mmap_sem);
+			vma = find_vma(mm, address);
+			if (vma && is_special_mapping(vma))
+				err = -EFAULT;
+			up_read(&mm->mmap_sem);
+
+			if (err)
+				return err;
+		}
+
+		++no_mapping_tries;
 		goto again;
 	}
 
diff --git a/mm/mmap.c b/mm/mmap.c
index eae90af..50fde2e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2479,6 +2479,11 @@ out:
 	return ret;
 }
 
+bool is_special_mapping(struct vm_area_struct *vma)
+{
+	return vma->vm_ops == &special_mapping_vmops;
+}
+
 static DEFINE_MUTEX(mm_all_locks_mutex);
 
 static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma)

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [GIT PULL] futex fixlet
  2011-12-29 21:07 [GIT PULL] futex fixlet Ingo Molnar
@ 2011-12-30  1:26 ` Linus Torvalds
  2011-12-30 17:07   ` Peter Zijlstra
  0 siblings, 1 reply; 5+ messages in thread
From: Linus Torvalds @ 2011-12-30  1:26 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Darren Hart

On Thu, Dec 29, 2011 at 1:07 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> Please pull the latest core-urgent-for-linus git tree from:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core-urgent-for-linus
>
> Please have a good look at it - it was probably not tested
> by everyone with the usual vigor, due to holiday excesse^W
> festivities.

Ugh.

It's not about the testing, that thing is disgusting, and I really dislike it.

Why do we even bother to check "page->mapping" at all? What's the
*use* of that loop? My gut feeling is that *that* is the fundamental
problem, and we should just get rid of it, rather than add all these
totally random work-arounds for the problem.

Peter Z? That "if (!page->mapping) goto again" actually goes back to
38d47c1b7075, in 2008. Why does it exist in the first place? There's
no comment nor explanation in the changelog.

Why don't we just unconditionally return -EFAULT? What is the retry
actually supposed to *do*? If somebody races with a mmap/munmap, why
the hell would we care? What is it doing?

So there's no way I'll pull that patch. Even if the patch is
technically correct, what it *really* needs is the comment about why
retrying would be needed and a good idea in the first place. The
comments about all the cases where retrying cannot work are completely
pointless without that much more important commentary.

I really don't understand what race we could be fixing by retrying,
and why we should care about that race.

                Linus

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [GIT PULL] futex fixlet
  2011-12-30  1:26 ` Linus Torvalds
@ 2011-12-30 17:07   ` Peter Zijlstra
  2011-12-30 20:06     ` Linus Torvalds
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2011-12-30 17:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, linux-kernel, Thomas Gleixner, Darren Hart, Hugh Dickins

On Thu, 2011-12-29 at 17:26 -0800, Linus Torvalds wrote:
> Why do we even bother to check "page->mapping" at all? What's the
> *use* of that loop? My gut feeling is that *that* is the fundamental
> problem, and we should just get rid of it, rather than add all these
> totally random work-arounds for the problem.
> 
> Peter Z? That "if (!page->mapping) goto again" actually goes back to
> 38d47c1b7075, in 2008. Why does it exist in the first place? There's
> no comment nor explanation in the changelog.
> 
> Why don't we just unconditionally return -EFAULT? What is the retry
> actually supposed to *do*? If somebody races with a mmap/munmap, why
> the hell would we care? What is it doing? 

Vague memory seems to suggest it was to do with an unmap race, now the
only such race we care about is swapping, if someone has a futex and
does munmap+mmap under us we really don't care and you get to keep
whatever result that yields.

That said, the ->mapping test is wrong because ->mapping is not actually
cleared when the page is unmapped.

Also, I suspect the is_page_cache_freeable() test in pageout() avoids
the worst of it. It keeps the page around if we have a reference to it,
so a minor fault will then quickly re-instate the same page without loss
of data.

So I _think_ you're completely right and we can simply kill the whole
thing, but I've been trying very hard to forget everything kernel
related for a week, and I really shouldn't kick-start my brain until
somewhere next week.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [GIT PULL] futex fixlet
  2011-12-30 17:07   ` Peter Zijlstra
@ 2011-12-30 20:06     ` Linus Torvalds
  2011-12-30 21:01       ` Hugh Dickins
  0 siblings, 1 reply; 5+ messages in thread
From: Linus Torvalds @ 2011-12-30 20:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Thomas Gleixner, Darren Hart, Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 935 bytes --]

On Fri, Dec 30, 2011 at 9:07 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> So I _think_ you're completely right and we can simply kill the whole
> thing, but I've been trying very hard to forget everything kernel
> related for a week, and I really shouldn't kick-start my brain until
> somewhere next week.

Ok.

Ingo, I'd suggest you put a patch like that into -next, and mark it
for stable after it has gotten lots more testing.

Something like the appended *totally* untested code. It compiles, but
maybe there really is some "!page->mapping" case that could possibly
matter.

I can't see it, though - but we should definitely get some testing for
this before I'd put it in a release. Since we do that
"get_user_pages_fast()" and ask for a writable page, the result should
be as stable as it will ever get. There is still the race with
hugepage splitting, but at least that one has a comment.

                       Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1093 bytes --]

 kernel/futex.c |   20 ++++++++++++--------
 1 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index ea87f4d2f455..8d2ee2f66418 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -310,21 +310,25 @@ again:
 	if (page != page_head) {
 		get_page(page_head);
 		put_page(page);
+		/* Avoid "unused label" for non-THP config */
+		if (0)
+			goto again;
 	}
 #endif
 
 	lock_page(page_head);
+
+	/*
+	 * If the page doesn't have a mapping associated with it,
+	 * this is either an anonymous page that was accessed RO
+	 * (so no COW etc) and we would just return EFAULT anyway,
+	 * or some special case page (ZERO_PAGE, gate area, special
+	 * mapping..).
+	 */
 	if (!page_head->mapping) {
 		unlock_page(page_head);
 		put_page(page_head);
-		/*
-		* ZERO_PAGE pages don't have a mapping. Avoid a busy loop
-		* trying to find one. RW mapping would have COW'd (and thus
-		* have a mapping) so this page is RO and won't ever change.
-		*/
-		if ((page_head == ZERO_PAGE(address)))
-			return -EFAULT;
-		goto again;
+		return -EFAULT;
 	}
 
 	/*

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [GIT PULL] futex fixlet
  2011-12-30 20:06     ` Linus Torvalds
@ 2011-12-30 21:01       ` Hugh Dickins
  0 siblings, 0 replies; 5+ messages in thread
From: Hugh Dickins @ 2011-12-30 21:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Thomas Gleixner, Darren Hart

On Fri, 30 Dec 2011, Linus Torvalds wrote:
> On Fri, Dec 30, 2011 at 9:07 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > So I _think_ you're completely right and we can simply kill the whole
> > thing, but I've been trying very hard to forget everything kernel
> > related for a week, and I really shouldn't kick-start my brain until
> > somewhere next week.
> 
> Ok.
> 
> Ingo, I'd suggest you put a patch like that into -next, and mark it
> for stable after it has gotten lots more testing.

Yes, we shouldn't rush in what we don't understand.  But mostly I agree
with you, and Peter, that we've gathered too much cargo-cult in here.

> 
> Something like the appended *totally* untested code. It compiles, but
> maybe there really is some "!page->mapping" case that could possibly
> matter.

I think there is such a case.  I located (Peter's reply to) 2008-04-08
lkml mail, in which Nick says

> What I'm worried about with this is invalidate or truncate races.
> With direct IO, it obviously doesn't matter because the only
> requirement is that the page existed at the address at some point
> during the syscall... 
> 
> So I'd really like you to not carry the page around in the key, but
> just continue using the same key we have now. Also, lock the page
> and ensure it hasn't been truncated before taking the inode from the
> key and incrementing its count (page lock's extra atomics should be
> more or less cancelled out by fewer mmap_sem atomic ops).

and rewrites what Peter posted earlier to include this fragment

> +       lock_page(page);
> +       if (!page->mapping) { /* PageAnon pages shouldn't get caught here */
> +               unlock_page(page);
> +               put_page(page);
> +               goto again;
> +       }

Right, it's standard procedure to check page->mapping after locking the
page, to make sure that it hasn't just been truncated (or invalidated).

(And since the PageAnon bit is actually kept inside page->mapping, it's
impossible to find a PageAnon with !page->mapping: which suits fine here.)

Now, I don't think truncation (or hole punching) is an issue: it would
be okay to return -EFAULT if we caught a moment when it was truncated.

But invalidation?  Someone doing an echo to /proc/sys/vm/drop_caches
at the wrong moment causing this futex path to fail?  The invalidate
used in drop_caches does back off if the page is mapped, but who's to
say that page reclaim didn't just unmap it independently.  And it can't
do much on shmem mappings, which would be the common shared futex case:
but we kept that general, so it could be a real filesystem mapping.

I believe that's why Nick put in the "goto again" that's caused
so much trouble.  But I'm not sure what's the best way forward.

My first thought was to fall back to the original pre-38d47c1b70
code in the !page->mapping case: we only want this page and its
page->mapping as a quick way to discover page->mapping->host.
So fall back to mmap_sem, find_extend_vma,
vma->vm_file->f_path.dentry->d_inode?

But the old code had its own problem, with VM_NONLINEAR mappings,
where you really need page->index to find the offset.  So the old
code had its own fallback to get_user_pages and looking up the page.
We don't really want to have a fallback to a fallback like that.

Oh, I've just had a slightly evil thought: although the !page->mapping
needs fallback to discover the inode, its page->index remains valid
until the page is actually freed (and we rely upon that thoughout
mm/truncate.c).

If we're not rushing for 3.2 final (this is no new regression),
I'd prefer to do such a patch later in the day if that's okay -
or by all means beat me to it.

Or am I just getting over-excited about invalidation?

Hugh

> 
> I can't see it, though - but we should definitely get some testing for
> this before I'd put it in a release. Since we do that
> "get_user_pages_fast()" and ask for a writable page, the result should
> be as stable as it will ever get. There is still the race with
> hugepage splitting, but at least that one has a comment.
> 
>  kernel/futex.c |   20 ++++++++++++--------
>  1 files changed, 12 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/futex.c b/kernel/futex.c
> index ea87f4d2f455..8d2ee2f66418 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -310,21 +310,25 @@ again:
>  	if (page != page_head) {
>  		get_page(page_head);
>  		put_page(page);
> +		/* Avoid "unused label" for non-THP config */
> +		if (0)
> +			goto again;
>  	}
>  #endif
>  
>  	lock_page(page_head);
> +
> +	/*
> +	 * If the page doesn't have a mapping associated with it,
> +	 * this is either an anonymous page that was accessed RO
> +	 * (so no COW etc) and we would just return EFAULT anyway,
> +	 * or some special case page (ZERO_PAGE, gate area, special
> +	 * mapping..).
> +	 */
>  	if (!page_head->mapping) {
>  		unlock_page(page_head);
>  		put_page(page_head);
> -		/*
> -		* ZERO_PAGE pages don't have a mapping. Avoid a busy loop
> -		* trying to find one. RW mapping would have COW'd (and thus
> -		* have a mapping) so this page is RO and won't ever change.
> -		*/
> -		if ((page_head == ZERO_PAGE(address)))
> -			return -EFAULT;
> -		goto again;
> +		return -EFAULT;
>  	}
>  
>  	/*

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-12-30 21:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-29 21:07 [GIT PULL] futex fixlet Ingo Molnar
2011-12-30  1:26 ` Linus Torvalds
2011-12-30 17:07   ` Peter Zijlstra
2011-12-30 20:06     ` Linus Torvalds
2011-12-30 21:01       ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).