linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* smp race fix between invalidate_inode_pages* and do_no_page
@ 2005-12-13 19:37 Andrea Arcangeli
  2005-12-13 21:02 ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2005-12-13 19:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Hugh Dickins

Thanks to the debugging checks in the sles9 kernel (wrongly removed by
somebody while integrating the objrmap/anon-vma code into mainline) I
identified a race condition between the pagecache invalidate and
do_no_page.

A similar race condition existed with truncate too, and it was fixed by
relaying on the i_size shrinking first with proper memory barriers.

But with the cache invalidates (including the old cache invalidates,
before I added the more aggressive invalidate_inode_pages2 for O_DIRECT)
the i_size doesn't shrink. So we can end up with stale contents mapped
in userland with page->mapping = NULL. The page_mapped check before
removing the page from the pagecache is worthless without my fix because
do_no_page may be running from under us.

I resurrected at least one warn-on in mainline that should allow
catching those conditions. I had two bugons triggering in sles9, one is
this one that I resurrected, the other one was in objrmap. It was
triggering the first that noticed the problem depending on the timings.

The basic issue is that locking down the page is totally useless to
block do_no_page, do_no_page will find the page in the cache regardless
of the page lock and it will map it in the address space of the task
while remove_from_page_cache is running.

To fix the smp race without being able to use the i_size shrinking on
truncate side and being read on the ->nopage side, I added a new data
structure that tracks if invalidates are being run under us. It's the
same as the seqlock but it allows scheduling inside the write critical
section. So I called it seqschedlock.

The patch is untested (I only checked it boots) and it will be tested
soon in the real world scenario that triggered the race condition, but
in the meantime I'd like to hear comments about this approach or if
you've ideas on simpler way to fix it.

Clearly taking the page lock in do_no_page would fix it too, but it
would destroy the scalability of page faults. The seqschedlock instead
is 100% smp scalable in the fast path (exactly like RCU and seqlocks),
and it's lightweight in the write path too and it doesn't introducing
latencies in freeing memory.

My main concern is the yield() instead of a waitqueue but perhaps that
can be improved somehow (I didn't think much about it yet). Not sure if
a waitqueue would be better off inside or outside the seqschedlock, but
I believe these are secondary matters, in practice it should never
trigger.

One could argue that it's not a bug if an invalidate leaves stale
orhpaned pagecache mapped in userspace but I believe it's a bug or at
the very least it breaks Andrew's efforts to make O_DIRECT fully
coherency at the mmap level. It's true the very single racy page fault
that happens at the same time of O_DIRECT deserves to see the old data,
but all the following ones (once the read/write O_DIRECT call returned)
definitely should not see the old data anymore. So either we give up on
trying to make invalidate_inode_pages2 mmap coherent (and we resurrect
my code that only cleared the uptodate bitflag and we allow
not-up-to-date pages to be mapped in userland) or we should fix this
race too IMHO (this way or in some other way).

Thanks!

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff -r 48eb6d55cbd3 fs/inode.c
--- a/fs/inode.c	Tue Dec 13 08:48:29 2005 +0800
+++ b/fs/inode.c	Tue Dec 13 20:30:19 2005 +0100
@@ -197,6 +197,7 @@
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	rwlock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
+	seqschedlock_init(&inode->i_data.invalidate_lock);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
diff -r 48eb6d55cbd3 include/linux/fs.h
--- a/include/linux/fs.h	Tue Dec 13 08:48:29 2005 +0800
+++ b/include/linux/fs.h	Tue Dec 13 20:30:19 2005 +0100
@@ -10,6 +10,7 @@
 #include <linux/limits.h>
 #include <linux/ioctl.h>
 #include <linux/rcuref.h>
+#include <linux/seqschedlock.h>
 
 /*
  * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
@@ -349,6 +350,7 @@
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
 	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
+	seqschedlock_t		invalidate_lock;/* Cover race condition between ->nopage and invalidate_inode_pages2 */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
 	struct address_space_operations *a_ops;	/* methods */
diff -r 48eb6d55cbd3 mm/filemap.c
--- a/mm/filemap.c	Tue Dec 13 08:48:29 2005 +0800
+++ b/mm/filemap.c	Tue Dec 13 20:30:19 2005 +0100
@@ -116,6 +116,9 @@
 	page->mapping = NULL;
 	mapping->nrpages--;
 	pagecache_acct(-1);
+
+	/* mapped pagecache can't be dropped! */
+	WARN_ON(page_mapped(page) && !PageSwapCache(page));
 }
 
 void remove_from_page_cache(struct page *page)
diff -r 48eb6d55cbd3 mm/memory.c
--- a/mm/memory.c	Tue Dec 13 08:48:29 2005 +0800
+++ b/mm/memory.c	Tue Dec 13 20:30:19 2005 +0100
@@ -2006,6 +2006,7 @@
 	struct address_space *mapping = NULL;
 	pte_t entry;
 	unsigned int sequence = 0;
+	unsigned int invalidate_lock;
 	int ret = VM_FAULT_MINOR;
 	int anon = 0;
 
@@ -2014,10 +2015,24 @@
 
 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
+	retry:
 		sequence = mapping->truncate_count;
-		smp_rmb(); /* serializes i_size against truncate_count */
-	}
-retry:
+		/* Implicit smb_rmb() serializes i_size against truncate_count */
+
+		switch (read_seqschedlock(&mapping->invalidate_lock,
+					  &invalidate_lock)) {
+		default:
+			BUG();
+		case SEQSCHEDLOCK_WRITER_RUNNING:
+			yield();
+			/* fall through */
+		case SEQSCHEDLOCK_WRITER_RACE:
+			cond_resched();
+			goto retry;
+		case SEQSCHEDLOCK_SUCCESS:
+			;
+		}
+	}
 	new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);
 	/*
 	 * No smp_rmb is needed here as long as there's a full
@@ -2056,13 +2071,26 @@
 	 * invalidated this page.  If unmap_mapping_range got called,
 	 * retry getting the page.
 	 */
-	if (mapping && unlikely(sequence != mapping->truncate_count)) {
-		pte_unmap_unlock(page_table, ptl);
-		page_cache_release(new_page);
-		cond_resched();
-		sequence = mapping->truncate_count;
-		smp_rmb();
-		goto retry;
+	if (likely(mapping)) {
+		if (unlikely(sequence != mapping->truncate_count)) {
+		race_failure:
+			pte_unmap_unlock(page_table, ptl);
+			page_cache_release(new_page);
+			cond_resched();
+			goto retry;
+		}
+		switch (read_seqschedunlock(&mapping->invalidate_lock,
+					    invalidate_lock)) {
+		default:
+			BUG();
+		case SEQSCHEDLOCK_WRITER_RUNNING:
+			yield();
+			/* fall through */
+		case SEQSCHEDLOCK_WRITER_RACE:
+			goto race_failure;
+		case SEQSCHEDLOCK_SUCCESS:
+			;
+		}
 	}
 
 	/*
diff -r 48eb6d55cbd3 mm/truncate.c
--- a/mm/truncate.c	Tue Dec 13 08:48:29 2005 +0800
+++ b/mm/truncate.c	Tue Dec 13 20:30:19 2005 +0100
@@ -210,9 +210,13 @@
 			next++;
 			if (PageDirty(page) || PageWriteback(page))
 				goto unlock;
+
+			write_seqschedlock(&mapping->invalidate_lock);
 			if (page_mapped(page))
 				goto unlock;
 			ret += invalidate_complete_page(mapping, page);
+			write_seqschedunlock(&mapping->invalidate_lock);
+
 unlock:
 			unlock_page(page);
 			if (next > end)
@@ -276,6 +280,8 @@
 				break;
 			}
 			wait_on_page_writeback(page);
+
+			write_seqschedlock(&mapping->invalidate_lock);
 			while (page_mapped(page)) {
 				if (!did_range_unmap) {
 					/*
@@ -302,6 +308,8 @@
 					set_page_dirty(page);
 				ret = -EIO;
 			}
+			write_seqschedunlock(&mapping->invalidate_lock);
+
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);
diff -r 48eb6d55cbd3 include/linux/seqschedlock.h
--- /dev/null	Thu Jan  1 00:00:00 1970 +0000
+++ b/include/linux/seqschedlock.h	Tue Dec 13 20:30:19 2005 +0100
@@ -0,0 +1,129 @@
+/*
+ * Like the seqlock but allows scheduling inside the write critical section.
+ * Copyright (c) 2005 Novell, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, contact Novell, Inc.
+ *
+ * Started by Andrea Arcangeli to fix the SMP race between
+ * invalidate_inode_pages2() and do_no_page().
+ */
+
+#include <linux/config.h>
+
+#ifndef __LINUX_SEQSCHEDLOCK_H
+#define __LINUX_SEQSCHEDLOCK_H
+
+#define SEQSCHEDLOCK_WRITER_RUNNING	2	/* big collision */
+#define SEQSCHEDLOCK_WRITER_RACE	1	/* minor collision */
+#define SEQSCHEDLOCK_SUCCESS		0
+
+#ifdef CONFIG_SMP
+
+typedef struct seqschedlock_s {
+	atomic_t writers;
+	atomic_t sequence;
+} seqschedlock_t;
+
+#define SEQSCHEDLOCK_UNLOCKED	{ ATOMIC_INIT(0), ATOMIC_INIT(0), }
+
+static inline void write_seqschedlock(seqschedlock_t * sl)
+{
+	atomic_inc(&sl->writers);
+	smp_wmb();
+}
+
+static inline void write_seqschedunlock(seqschedlock_t * sl)
+{
+	smp_wmb();
+	atomic_inc(&sl->sequence);
+	smp_wmb();
+	atomic_dec(&sl->writers);
+}
+
+/*
+ * Perhaps this should be changed to wait in some waitqueue
+ * automatically instead of asking the caller to handle
+ * the race condition.
+ */
+static inline int read_seqschedlock(const seqschedlock_t * sl,
+				    unsigned int * sequence)
+{
+	unsigned int writers, sequence1, sequence2;
+
+	sequence1 = atomic_read(&sl->sequence);
+	smp_rmb();
+	writers = atomic_read(&sl->writers);
+	smp_rmb();
+	sequence2 = atomic_read(&sl->sequence);
+	smp_rmb();
+
+	if (unlikely(writers))
+		return SEQSCHEDLOCK_WRITER_RUNNING;
+	else if (unlikely(sequence1 != sequence2))
+		return SEQSCHEDLOCK_WRITER_RACE;
+	else {
+		*sequence = sequence1;
+		return SEQSCHEDLOCK_SUCCESS;
+	}
+}
+
+static inline int read_seqschedunlock(const seqschedlock_t * sl,
+				      unsigned int prev_sequence)
+{
+	unsigned int writers, sequence;
+
+	smp_rmb();
+	writers = atomic_read(&sl->writers);
+	smp_rmb();
+	sequence = atomic_read(&sl->sequence);
+
+	if (unlikely(writers))
+		return SEQSCHEDLOCK_WRITER_RUNNING;
+	else if (unlikely(sequence != prev_sequence))
+		return SEQSCHEDLOCK_WRITER_RACE;
+	else
+		return SEQSCHEDLOCK_SUCCESS;
+}
+
+#else /* CONFIG_SMP */
+
+typedef struct seqschedlock_s {
+} seqschedlock_t;
+
+#define SEQSCHEDLOCK_UNLOCKED	{ }
+
+static inline void write_seqschedlock(seqschedlock_t * sl)
+{
+}
+
+static inline void write_seqschedunlock(seqschedlock_t * sl)
+{
+}
+
+static inline int read_seqschedlock(const seqschedlock_t * sl,
+				    unsigned int * sequence)
+{
+	return SEQSCHEDLOCK_SUCCESS;
+}
+
+static inline int read_seqschedunlock(const seqschedlock_t * sl,
+				      unsigned int prev_sequence)
+{
+	return SEQSCHEDLOCK_SUCCESS;
+}
+
+#endif  /* CONFIG_SMP */
+
+#define seqschedlock_init(x)	do { *(x) = (seqschedlock_t) SEQSCHEDLOCK_UNLOCKED; } while (0)
+
+#endif /* __LINUX_SEQSCHEDLOCK_H */

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2005-12-13 19:37 smp race fix between invalidate_inode_pages* and do_no_page Andrea Arcangeli
@ 2005-12-13 21:02 ` Andrew Morton
  2005-12-13 21:14   ` Andrea Arcangeli
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2005-12-13 21:02 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, hugh

Andrea Arcangeli <andrea@suse.de> wrote:
>
> ...
>
> Clearly taking the page lock in do_no_page would fix it too, but it
> would destroy the scalability of page faults.

It's always bugged me that filemap_nopage() doesn't lock the page.  There
might be additional uglies which could be tidied up if we were to do so.

The scalability loss would happen if there are multiple processes/threads
faulting in the same page I guess.   I wonder how important that would be.

I suppose that even if we did lock the page in filemap_nopage(), the
coverage isn't sufficient here - it needs to extend into do_no_page()?


> The seqschedlock instead
> is 100% smp scalable in the fast path (exactly like RCU and seqlocks),
> and it's lightweight in the write path too and it doesn't introducing
> latencies in freeing memory.

Why this rather than down_read/down_write?  We might even be able to hoist
ext3_inode's i_truncate_sem into the address_space, for zero space cost on
most Linux inodes (dunno).

Is there some way in which we can use mapping->truncate_count to tell
do_no_page() that it raced with invalidate()?  By checking it after the pte
has been established?

> My main concern is the yield() instead of a waitqueue but perhaps that
> can be improved somehow (I didn't think much about it yet). Not sure if
> a waitqueue would be better off inside or outside the seqschedlock, but
> I believe these are secondary matters, in practice it should never
> trigger.

yield() is pretty sucky.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2005-12-13 21:02 ` Andrew Morton
@ 2005-12-13 21:14   ` Andrea Arcangeli
  2005-12-16 13:51     ` Andrea Arcangeli
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2005-12-13 21:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, hugh

On Tue, Dec 13, 2005 at 01:02:27PM -0800, Andrew Morton wrote:
> It's always bugged me that filemap_nopage() doesn't lock the page.  There
> might be additional uglies which could be tidied up if we were to do so.
> 
> The scalability loss would happen if there are multiple processes/threads
> faulting in the same page I guess.   I wonder how important that would be.

I don't know for sure, I know for sure with threads faulting on the same
page at the same time is common, but I don't know with processes, I
expected it is possible. 

> I suppose that even if we did lock the page in filemap_nopage(), the
> coverage isn't sufficient here - it needs to extend into do_no_page()?

I think so, for invalidate_mapping_pages (the simpler case since it's
less aggressive) we'd need to drop the page lock after increasing the
mapcount, so page_mapped() will notice it has to skip the page.

> Why this rather than down_read/down_write?  We might even be able to hoist
> ext3_inode's i_truncate_sem into the address_space, for zero space cost on
> most Linux inodes (dunno).

The only reason for not using semaphores, is to keep the fast path 100%
scalable, without risking cacheline bouncing and without requiring
exclusive access to a cacheline in a cpu (i.e. writes).

> Is there some way in which we can use mapping->truncate_count to tell
> do_no_page() that it raced with invalidate()?  By checking it after the pte
> has been established?

I tried that but failed, the reason is that the truncate_count alone is
worthless, we combine the truncate_count with the i_size information and
we write and read them in reverse order to do the locking for truncate.

But invalidates can't change the i_size, so truncate_count alone can't
be used.

I could save 4 bytes by using truncate_count instead of sl->sequence,
but the ugliness of the code that it would have been generated made me
waste 4 bytes.

> yield() is pretty sucky.

I wonder if we can use a waitqueue to wait a wakeup from the writer
while still leaving the fast path readonly. In theory yield() should
never trigger, when it triggers it means we had a race condition, but I
also don't like yield. anyway it was pointless to make it more
complicated before hearing some comment, I'd rather not invest too much
time in the seqschedlocks before knowing if they will be used ;).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2005-12-13 21:14   ` Andrea Arcangeli
@ 2005-12-16 13:51     ` Andrea Arcangeli
  2006-01-10  6:24       ` Andrea Arcangeli
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2005-12-16 13:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, hugh

There was a minor buglet in the previous patch an update is here:

	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.15-rc5/seqschedlock-2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2005-12-16 13:51     ` Andrea Arcangeli
@ 2006-01-10  6:24       ` Andrea Arcangeli
  2006-01-10  6:48         ` Andrea Arcangeli
  2006-01-11  4:08         ` Nick Piggin
  0 siblings, 2 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2006-01-10  6:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, hugh

On Fri, Dec 16, 2005 at 02:51:47PM +0100, Andrea Arcangeli wrote:
> There was a minor buglet in the previous patch an update is here:
> 
> 	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.15-rc5/seqschedlock-2

JFYI: I got a few hours ago positive confirmation of the fix from the
customer that was capable of reproducing this. I guess this is good
enough for production use (it's at the very least certainly better than
the previous code and it's guaranteed not to hurt the scalability of the
fast path in smp, so it's the least intrusive fix I could imagine).

So we can start to think if we should using this new primitive I
created, and if to replace the yield() with a proper waitqueue (and
how). Or if to take the risk of hitting a bit of scalability in the
nopage page faults of processes, by rewriting the fix with a
find_lock_page in the do_no_page handler, that would avoid the need of
my new locking primitive.

Comments welcome thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2006-01-10  6:24       ` Andrea Arcangeli
@ 2006-01-10  6:48         ` Andrea Arcangeli
  2006-01-11  4:08         ` Nick Piggin
  1 sibling, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2006-01-10  6:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, hugh

On Tue, Jan 10, 2006 at 07:24:25AM +0100, Andrea Arcangeli wrote:
> On Fri, Dec 16, 2005 at 02:51:47PM +0100, Andrea Arcangeli wrote:
> > There was a minor buglet in the previous patch an update is here:
> > 
> > 	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.15-rc5/seqschedlock-2
> 
> JFYI: I got a few hours ago positive confirmation of the fix from the
> customer that was capable of reproducing this. I guess this is good
> enough for production use (it's at the very least certainly better than
> the previous code and it's guaranteed not to hurt the scalability of the
> fast path in smp, so it's the least intrusive fix I could imagine).
> 
> So we can start to think if we should using this new primitive I
> created, and if to replace the yield() with a proper waitqueue (and
> how). Or if to take the risk of hitting a bit of scalability in the
> nopage page faults of processes, by rewriting the fix with a
> find_lock_page in the do_no_page handler, that would avoid the need of
> my new locking primitive.

Another possible way to fix this is to put the page_count check back in
the invalidate_*_* under the tree_lock (exactly like the VM does in
shrink_caches and exactly like 2.4 does too!), and to stop removing
pages from pagecache if their page_count is > 1 (we would go back to
clear PG_uptodate). But then we'd have a problem once again with the
PG_utpodate bitflag being cleared by invalidate_*_* while do_no_page is
running, and in turn a mapped page could have PG_uptodate clear 8),
that's the invariant that lead us to start zapping the ptes and dropping
pagecache, so then we could stop zapping the ptes too like 2.4 and
allowing PG_uptodate to be clear (there's nothing fundamentally wrong
with that, as long as the buffers BH_uptodate are clear too).

Side note: in 2.6 invalidate_mapping_pages has been smp racy at least since
2.6.5, it basically broke when the page_count check was replaced with a
page_mapping check long ago. But it's probably so infrequent and the
race so tiny that it never happened there, but the first bugchecks in
the sles VM (those bugchecks unfortunately removed when
mainline-merging) started to trigger only very recently when we started
zapping ptes and dropping pages from invalidate_inode_pages2 like
mainline. Bug was invisible in invalidate_mapping_pages (apparently only
jffs2 uses invalidate_mapping_pages in a way that could oops, even nfs
uses invalidate_inode_pages2).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2006-01-10  6:24       ` Andrea Arcangeli
  2006-01-10  6:48         ` Andrea Arcangeli
@ 2006-01-11  4:08         ` Nick Piggin
  2006-01-11  8:23           ` Andrea Arcangeli
  1 sibling, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2006-01-11  4:08 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, linux-kernel, hugh

Andrea Arcangeli wrote:
> On Fri, Dec 16, 2005 at 02:51:47PM +0100, Andrea Arcangeli wrote:
> 
>>There was a minor buglet in the previous patch an update is here:
>>
>>	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.15-rc5/seqschedlock-2
> 
> 
> JFYI: I got a few hours ago positive confirmation of the fix from the
> customer that was capable of reproducing this. I guess this is good
> enough for production use (it's at the very least certainly better than
> the previous code and it's guaranteed not to hurt the scalability of the
> fast path in smp, so it's the least intrusive fix I could imagine).
> 
> So we can start to think if we should using this new primitive I
> created, and if to replace the yield() with a proper waitqueue (and
> how). Or if to take the risk of hitting a bit of scalability in the
> nopage page faults of processes, by rewriting the fix with a
> find_lock_page in the do_no_page handler, that would avoid the need of
> my new locking primitive.
> 
> Comments welcome thanks!

I'd be inclined to think a lock_page is not a big SMP scalability
problem because the struct page's cacheline(s) will be written to
several times in the process of refcounting anyway. Such a workload
would also be running into tree_lock as well.

Not that I have done any measurements.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2006-01-11  4:08         ` Nick Piggin
@ 2006-01-11  8:23           ` Andrea Arcangeli
  2006-01-11  8:51             ` Andrew Morton
  2006-01-11  9:34             ` Nick Piggin
  0 siblings, 2 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2006-01-11  8:23 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, hugh

On Wed, Jan 11, 2006 at 03:08:31PM +1100, Nick Piggin wrote:
> I'd be inclined to think a lock_page is not a big SMP scalability
> problem because the struct page's cacheline(s) will be written to
> several times in the process of refcounting anyway. Such a workload
> would also be running into tree_lock as well.

I seem to recall you wanted to make the tree_lock a readonly lock for
readers for the exact same scalability reason? do_no_page is quite a
fast path for the tree lock too. But I totally agree the unavoidable is
the atomic_inc though, good point, so it worth more to remove the
tree_lock than to remove the page lock, the tree_lock can be avoided the
atomic_inc on page->_count not.

The other bonus that makes this attractive is that then we can drop the
*whole* vm_truncate_count mess... vm_truncate_count and
inode->trunate_count exists for the only single reason that do_no_page
must not map into the pte a page that is under truncation. We can
provide the same guarantee with the page lock doing like
invalidate_inode_pages2_range (that is to check page_mapping under the
page_lock and executing unmap_mapping_range with the page lock held if
needed). That will free 4 bytes per vma (without even counting the
truncate_count on every inode out there! that could be an even larger
gain), on my system I have 9191 vmas in use, that's 36K saved of ram in
my system, and that's 36K saved on x86, on x86-64 it's 72K saved of
physical ram since it's an unsigned long after a pointer, and vma must
not be hw aligned (and infact it isn't so the saving is real). On the
indoes side it saves 4 bytes
* 1384 on my current system, on a busy nfs server it can save a lot
more. The inode also most not be hw aligned and correctly it isn't. On a
server with lot more of vmas and lot more of inodes it'll save more ram.

So if I make this change this could give me a grant for lifetime
guarantee of seccomp in the kernel that takes less than 1kbyte on a x86,
right? (on a normal desktop I'll save at minimum 30 times more than what
I cost to the kernel users ;) Just kidding of course...

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2006-01-11  8:23           ` Andrea Arcangeli
@ 2006-01-11  8:51             ` Andrew Morton
  2006-01-11  9:02               ` Andrea Arcangeli
  2006-01-11  9:34             ` Nick Piggin
  1 sibling, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2006-01-11  8:51 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: nickpiggin, linux-kernel, hugh

Andrea Arcangeli <andrea@suse.de> wrote:
>
>  On Wed, Jan 11, 2006 at 03:08:31PM +1100, Nick Piggin wrote:
>  > I'd be inclined to think a lock_page is not a big SMP scalability
>  > problem because the struct page's cacheline(s) will be written to
>  > several times in the process of refcounting anyway. Such a workload
>  > would also be running into tree_lock as well.
> 
>  I seem to recall you wanted to make the tree_lock a readonly lock for
>  readers for the exact same scalability reason? do_no_page is quite a
>  fast path for the tree lock too. But I totally agree the unavoidable is
>  the atomic_inc though, good point, so it worth more to remove the
>  tree_lock than to remove the page lock, the tree_lock can be avoided the
>  atomic_inc on page->_count not.
> 
>  The other bonus that makes this attractive is that then we can drop the
>  *whole* vm_truncate_count mess... vm_truncate_count and
>  inode->trunate_count exists for the only single reason that do_no_page
>  must not map into the pte a page that is under truncation.

I think you'll find this hard - filemap_nopage() is the first to find the
page but we need lock coverage up in do_no_page().  So the ->nopage
protocol will need to be changed to "must return with the page locked".  Or
we add a new ->nopage_locked and call that if the vm_ops implements it.

But I agree it's a good change if we can pull it off.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2006-01-11  8:51             ` Andrew Morton
@ 2006-01-11  9:02               ` Andrea Arcangeli
  2006-01-11  9:06                 ` Andrew Morton
  2006-01-11  9:39                 ` Nick Piggin
  0 siblings, 2 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2006-01-11  9:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nickpiggin, linux-kernel, hugh

On Wed, Jan 11, 2006 at 12:51:34AM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> >  On Wed, Jan 11, 2006 at 03:08:31PM +1100, Nick Piggin wrote:
> >  > I'd be inclined to think a lock_page is not a big SMP scalability
> >  > problem because the struct page's cacheline(s) will be written to
> >  > several times in the process of refcounting anyway. Such a workload
> >  > would also be running into tree_lock as well.
> > 
> >  I seem to recall you wanted to make the tree_lock a readonly lock for
> >  readers for the exact same scalability reason? do_no_page is quite a
> >  fast path for the tree lock too. But I totally agree the unavoidable is
> >  the atomic_inc though, good point, so it worth more to remove the
> >  tree_lock than to remove the page lock, the tree_lock can be avoided the
> >  atomic_inc on page->_count not.
> > 
> >  The other bonus that makes this attractive is that then we can drop the
> >  *whole* vm_truncate_count mess... vm_truncate_count and
> >  inode->trunate_count exists for the only single reason that do_no_page
> >  must not map into the pte a page that is under truncation.
> 
> I think you'll find this hard - filemap_nopage() is the first to find the
> page but we need lock coverage up in do_no_page().  So the ->nopage
> protocol will need to be changed to "must return with the page locked".  Or
> we add a new ->nopage_locked and call that if the vm_ops implements it.

Can't we avoid to change the protocol and use lock_page in do_no_page
instead? All we need to check before mailing out trying again is that
page->mapping is still there and then we have to set "page_mapping() ==
True" before unlocking (then the other side will have to block the pte
pte_lock running unmap_mapping_pages a second time with the page lock
held).

The main scary thing as far as I can tell, is the blocking lock_page. We
can't just do TryLockPage...

> But I agree it's a good change if we can pull it off.

Ok good thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2006-01-11  9:02               ` Andrea Arcangeli
@ 2006-01-11  9:06                 ` Andrew Morton
  2006-01-11  9:13                   ` Andrea Arcangeli
  2006-01-11  9:39                 ` Nick Piggin
  1 sibling, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2006-01-11  9:06 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: nickpiggin, linux-kernel, hugh

Andrea Arcangeli <andrea@suse.de> wrote:
>
>  On Wed, Jan 11, 2006 at 12:51:34AM -0800, Andrew Morton wrote:
>  > Andrea Arcangeli <andrea@suse.de> wrote:
>  > >
>  > >  On Wed, Jan 11, 2006 at 03:08:31PM +1100, Nick Piggin wrote:
>  > >  > I'd be inclined to think a lock_page is not a big SMP scalability
>  > >  > problem because the struct page's cacheline(s) will be written to
>  > >  > several times in the process of refcounting anyway. Such a workload
>  > >  > would also be running into tree_lock as well.
>  > > 
>  > >  I seem to recall you wanted to make the tree_lock a readonly lock for
>  > >  readers for the exact same scalability reason? do_no_page is quite a
>  > >  fast path for the tree lock too. But I totally agree the unavoidable is
>  > >  the atomic_inc though, good point, so it worth more to remove the
>  > >  tree_lock than to remove the page lock, the tree_lock can be avoided the
>  > >  atomic_inc on page->_count not.
>  > > 
>  > >  The other bonus that makes this attractive is that then we can drop the
>  > >  *whole* vm_truncate_count mess... vm_truncate_count and
>  > >  inode->trunate_count exists for the only single reason that do_no_page
>  > >  must not map into the pte a page that is under truncation.
>  > 
>  > I think you'll find this hard - filemap_nopage() is the first to find the
>  > page but we need lock coverage up in do_no_page().  So the ->nopage
>  > protocol will need to be changed to "must return with the page locked".  Or
>  > we add a new ->nopage_locked and call that if the vm_ops implements it.
> 
>  Can't we avoid to change the protocol and use lock_page in do_no_page
>  instead?

Confused.  do_no_page() doesn't have a page to lock until it has called
->nopage.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2006-01-11  9:06                 ` Andrew Morton
@ 2006-01-11  9:13                   ` Andrea Arcangeli
  2006-01-11 20:49                     ` Hugh Dickins
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2006-01-11  9:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nickpiggin, linux-kernel, hugh

On Wed, Jan 11, 2006 at 01:06:38AM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> >  On Wed, Jan 11, 2006 at 12:51:34AM -0800, Andrew Morton wrote:
> >  > Andrea Arcangeli <andrea@suse.de> wrote:
> >  > >
> >  > >  On Wed, Jan 11, 2006 at 03:08:31PM +1100, Nick Piggin wrote:
> >  > >  > I'd be inclined to think a lock_page is not a big SMP scalability
> >  > >  > problem because the struct page's cacheline(s) will be written to
> >  > >  > several times in the process of refcounting anyway. Such a workload
> >  > >  > would also be running into tree_lock as well.
> >  > > 
> >  > >  I seem to recall you wanted to make the tree_lock a readonly lock for
> >  > >  readers for the exact same scalability reason? do_no_page is quite a
> >  > >  fast path for the tree lock too. But I totally agree the unavoidable is
> >  > >  the atomic_inc though, good point, so it worth more to remove the
> >  > >  tree_lock than to remove the page lock, the tree_lock can be avoided the
> >  > >  atomic_inc on page->_count not.
> >  > > 
> >  > >  The other bonus that makes this attractive is that then we can drop the
> >  > >  *whole* vm_truncate_count mess... vm_truncate_count and
> >  > >  inode->trunate_count exists for the only single reason that do_no_page
> >  > >  must not map into the pte a page that is under truncation.
> >  > 
> >  > I think you'll find this hard - filemap_nopage() is the first to find the
> >  > page but we need lock coverage up in do_no_page().  So the ->nopage
> >  > protocol will need to be changed to "must return with the page locked".  Or
> >  > we add a new ->nopage_locked and call that if the vm_ops implements it.
> > 
> >  Can't we avoid to change the protocol and use lock_page in do_no_page
> >  instead?
> 
> Confused.  do_no_page() doesn't have a page to lock until it has called
> ->nopage.

yes, I mean doing lock_page after ->nopage returned it here:


	lock_page(page);
	if (mapping && !page->mapping)
		goto bail_out;
	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
[..]
			page_add_file_rmap()
			unlock_page()

That should be enough no? Imagine the truncate side implemented exactly
like invalidate_inode_pages2:

	lock_page(page)
	if (page_mapped(page))	
		unmap_mapping_pages()
	truncate_full_page(page)
	unlock_page(page)

Either the pte is dropped by unmap_mapping_pages and we're safe, or
->nopage returns an already truncated page and page->mapping is null and
we bail out.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2006-01-11  8:23           ` Andrea Arcangeli
  2006-01-11  8:51             ` Andrew Morton
@ 2006-01-11  9:34             ` Nick Piggin
  1 sibling, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2006-01-11  9:34 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, linux-kernel, hugh

Andrea Arcangeli wrote:
> On Wed, Jan 11, 2006 at 03:08:31PM +1100, Nick Piggin wrote:
> 
>>I'd be inclined to think a lock_page is not a big SMP scalability
>>problem because the struct page's cacheline(s) will be written to
>>several times in the process of refcounting anyway. Such a workload
>>would also be running into tree_lock as well.
> 
> 
> I seem to recall you wanted to make the tree_lock a readonly lock for
> readers for the exact same scalability reason? do_no_page is quite a

I think Bill Irwin or Peter Chubb made the tree_lock a reader-writer
lock back in the day.

I have some patches (ref:lockless pagecache) that completely removes
the tree_lock from read-side operations like find_get_page and
find_lock_page, and turns the write side back into a regular spinlock.
You must be thinking of that?

> fast path for the tree lock too. But I totally agree the unavoidable is
> the atomic_inc though, good point, so it worth more to remove the
> tree_lock than to remove the page lock, the tree_lock can be avoided the
> atomic_inc on page->_count not.
> 

Yep, my thinking as well.

> The other bonus that makes this attractive is that then we can drop the
> *whole* vm_truncate_count mess... vm_truncate_count and
> inode->trunate_count exists for the only single reason that do_no_page
> must not map into the pte a page that is under truncation. We can
> provide the same guarantee with the page lock doing like
> invalidate_inode_pages2_range (that is to check page_mapping under the
> page_lock and executing unmap_mapping_range with the page lock held if
> needed). That will free 4 bytes per vma (without even counting the
> truncate_count on every inode out there! that could be an even larger
> gain), on my system I have 9191 vmas in use, that's 36K saved of ram in
> my system, and that's 36K saved on x86, on x86-64 it's 72K saved of
> physical ram since it's an unsigned long after a pointer, and vma must
> not be hw aligned (and infact it isn't so the saving is real). On the
> indoes side it saves 4 bytes
> * 1384 on my current system, on a busy nfs server it can save a lot
> more. The inode also most not be hw aligned and correctly it isn't. On a
> server with lot more of vmas and lot more of inodes it'll save more ram.
> 
> So if I make this change this could give me a grant for lifetime
> guarantee of seccomp in the kernel that takes less than 1kbyte on a x86,
> right? (on a normal desktop I'll save at minimum 30 times more than what
> I cost to the kernel users ;) Just kidding of course...
> 

Sounds like a good idea (and your proposed implementation -
lock_page and recheck mapping in do_no_page sounds sane).

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2006-01-11  9:02               ` Andrea Arcangeli
  2006-01-11  9:06                 ` Andrew Morton
@ 2006-01-11  9:39                 ` Nick Piggin
  1 sibling, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2006-01-11  9:39 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, linux-kernel, hugh

Andrea Arcangeli wrote:

> The main scary thing as far as I can tell, is the blocking lock_page. We
> can't just do TryLockPage...
> 

Hopefully that should be OK... it should not tend to get tripped up on read
because filemap_nopage needs to take the lock and wait in that case anyway.
Hopefully other lock_page users will be in the noise?

Another option might be a spinbit in page->flags but that seems like overkill.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2006-01-11  9:13                   ` Andrea Arcangeli
@ 2006-01-11 20:49                     ` Hugh Dickins
  2006-01-11 21:05                       ` Andrew Morton
  2006-01-13  7:35                       ` Nick Piggin
  0 siblings, 2 replies; 23+ messages in thread
From: Hugh Dickins @ 2006-01-11 20:49 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, nickpiggin, linux-kernel

On Wed, 11 Jan 2006, Andrea Arcangeli wrote:
> On Wed, Jan 11, 2006 at 01:06:38AM -0800, Andrew Morton wrote:
> > 
> > Confused.  do_no_page() doesn't have a page to lock until it has called
> > ->nopage.
> 
> yes, I mean doing lock_page after ->nopage returned it here:
> 
> 	lock_page(page);
> 	if (mapping && !page->mapping)
> 		goto bail_out;
> 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> [..]
> 			page_add_file_rmap()
> 			unlock_page()

I've rather chosen this mail at random from the thread, and can't
come up with better than a few random remarks on it all, sorry.

Though using lock_page may solve more than one problem, without seeing
a full implementation I'm less sure than the rest of you that it will
really do the job.

And less confident than you and Nick that adding a sleeping lock here
in nopage won't present a scalability issue.  Though it gives me a
special thrill to see Nick arguing in favour of extra locking ;)

Keep in mind that in the truncate case, it's the original shared file
page that needs to be locked and tested, when deciding whether it's
okay to insert a COWed anonymous copy (even COWed areas must SIGBUS).

Don't forget to update mm/fremap.c's install_page, which we forgot
originally when doing the truncate_count business.

But when fixing that (not me who noticed the problem), I found we
didn't really need to rely on truncate_count there, just mapping and
i_size: because in install_page we could be sure it was (or had been)
a page cache page with mapping set.  The need for truncate_count
arises from the uncertainty when do_no_page returns from ->nopage,
whether it was a pagecache kind of nopage (which would set mappping),
or a weird driver kind of nopage (which would not).

If coming here again, I think I'd have a VM_ flag for that: almost
corresponds to VM_RESERVED (as I think you were keying off in your SuSE
tree) but better to make it specific to pagecache, or truncatability -
to say mapping ought to be non-NULL, and if it's found NULL after
grabbing pte_lock, then it must have been truncated/invalidated off.

Didn't examine in detail, but I didn't care for your seqschedlock
implementation.  Partly the overdesign of a new locking scheme for
just one usage (seqschedlock.h missing from your recent send, by
the way, but I assume same as last month's) - fine when more uses
come along later, but seemed premature, and better to concentrate
on the immediate need for now.

And partly that you seemed to be leaving the truncate case handled
by one technique (truncate_count) and the invalidate case handled by
another (seqschedlock): much more satisfying to handle them both in
the same-ish way.

But yes, they do differ significantly: treatment of COW pages,
applicability of i_size, wait rather than error out when invalidate.
Perhaps I underestimate the effect of those differences.

I don't think you'll be able to eliminate truncate_count so easily,
since I put it to further use, with a matching vm_truncate_count,
to handle the latency while unmapping vmas from the prio_tree.

I've never got to grips with invalidate_inode_pages, or its more
exciting sequel "Invalidate Inode Pages II".  Last time I looked
(2.6.10-ish) both looked racy, neither very serious.  Alarmed now
to see that while I was attacking latency below unmap_mapping_range,
Andrew was adding unmap_mapping_range calls into iipages2.

I'd been working on the assumption that i_sem is held across it,
but that seems not to be the case, in NFS uses at least (I've
given up further researches to write this mail).  But miraculously
(or by Andrew's careful checking) it looks to me like it is okay:
if there are concurrent unmappings, i_mmap_lock protects truncate_
count and vm_truncate_count modifications, and unmap_mapping_range_tree
won't do worse than fail to skip vmas it has already dealt with.

But I do not know what guarantees invalidate_inode_pages2 is supposed
to give.  As soon as you emerge from iipages2, its work could be undone:
I suppose it provides a serialization point, ensures that you go down to
filesystem once, rather than being satisfied by the pagecache.

Hugh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2006-01-11 20:49                     ` Hugh Dickins
@ 2006-01-11 21:05                       ` Andrew Morton
  2006-01-13  7:35                       ` Nick Piggin
  1 sibling, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2006-01-11 21:05 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: andrea, nickpiggin, linux-kernel

Hugh Dickins <hugh@veritas.com> wrote:
>
>  But I do not know what guarantees invalidate_inode_pages2 is supposed
>  to give.  As soon as you emerge from iipages2, its work could be undone:

yup.  It cannot become a hard guarantee unless we add some new really big
locks.

So it can be fooled by silly or poorly-designed apps.  What we're aiming
for here is predictable behaviour for sanely-implemented applications and
refusal to oops or to expose insecure data for poorly-designed ones or
exploits.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2006-01-11 20:49                     ` Hugh Dickins
  2006-01-11 21:05                       ` Andrew Morton
@ 2006-01-13  7:35                       ` Nick Piggin
  2006-01-13  7:47                         ` Andrew Morton
  1 sibling, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2006-01-13  7:35 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrea Arcangeli, Andrew Morton, linux-kernel

Hugh Dickins wrote:
> On Wed, 11 Jan 2006, Andrea Arcangeli wrote:
> 
>>On Wed, Jan 11, 2006 at 01:06:38AM -0800, Andrew Morton wrote:
>>
>>>Confused.  do_no_page() doesn't have a page to lock until it has called
>>>->nopage.
>>
>>yes, I mean doing lock_page after ->nopage returned it here:
>>
>>	lock_page(page);
>>	if (mapping && !page->mapping)
>>		goto bail_out;
>>	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
>>[..]
>>			page_add_file_rmap()
>>			unlock_page()
> 
> 
> I've rather chosen this mail at random from the thread, and can't
> come up with better than a few random remarks on it all, sorry.
> 
> Though using lock_page may solve more than one problem, without seeing
> a full implementation I'm less sure than the rest of you that it will
> really do the job.
> 
> And less confident than you and Nick that adding a sleeping lock here
> in nopage won't present a scalability issue.  Though it gives me a
> special thrill to see Nick arguing in favour of extra locking ;)
> 

I argue in favour of extra locking simply because the concept doesn't
fundamentally harm scalability ie. because it is very short held and
there are other unavoidable cacheline conflicts already there.

Having it a sleeping lock rather than spinning is another thing though.
However unless you imagine it running into other long held page lockers
(I guess reclaim might be one, but quite rare -- any other significant
lock_page users that we might hit?), then I don't think it would impact
scalability much on workloads that don't already hurt.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2006-01-13  7:35                       ` Nick Piggin
@ 2006-01-13  7:47                         ` Andrew Morton
  2006-01-13 10:37                           ` Nick Piggin
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2006-01-13  7:47 UTC (permalink / raw)
  To: Nick Piggin; +Cc: hugh, andrea, linux-kernel

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> (I guess reclaim might be one, but quite rare -- any other significant
>  lock_page users that we might hit?)

The only time 2.6 holds lock_page() for a significant duration is when
bringing the page uptodate with readpage or memset.

The scalability risk here is 100 CPUs all faulting in the same file in the
same pattern.  Like the workload which caused the page_table_lock splitup
(that was with anon pages).  All the CPUs could pretty easily get into sync
and start arguing over every single page's lock.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2006-01-13  7:47                         ` Andrew Morton
@ 2006-01-13 10:37                           ` Nick Piggin
  2006-03-31 12:36                             ` Andrea Arcangeli
  0 siblings, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2006-01-13 10:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: hugh, andrea, linux-kernel

Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>(I guess reclaim might be one, but quite rare -- any other significant
>> lock_page users that we might hit?)
> 
> 
> The only time 2.6 holds lock_page() for a significant duration is when
> bringing the page uptodate with readpage or memset.
> 

Yes that's what I thought. And we don't really need to worry about this
case because filemap_nopage has to deal with it anyway (ie. we shouldn't
see a locked !uptodate page in do_no_page).

> The scalability risk here is 100 CPUs all faulting in the same file in the
> same pattern.  Like the workload which caused the page_table_lock splitup
> (that was with anon pages).  All the CPUs could pretty easily get into sync
> and start arguing over every single page's lock.
> 

Yes, but in that case they're still going to hit the tree_lock anyway, and
if they do have a chance of synching up, the cacheline bouncing from count
and mapcount accounting is almost as likely to cause it as the lock_page
itself.

I did a nopage microbenchmark like you describe a while back. IIRC single
threaded is 2.5 times *more* throughput than 64 CPUs, even when those 64 are
faulting their own NUMA memory (and obviously different pages). Thanks to
tree_lock.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2006-01-13 10:37                           ` Nick Piggin
@ 2006-03-31 12:36                             ` Andrea Arcangeli
  2006-04-02  5:17                               ` Nick Piggin
  2006-04-02  5:21                               ` Andrew Morton
  0 siblings, 2 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2006-03-31 12:36 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, hugh, linux-kernel

On Fri, Jan 13, 2006 at 09:37:39PM +1100, Nick Piggin wrote:
> Andrew Morton wrote:
> >Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> >
> >>(I guess reclaim might be one, but quite rare -- any other significant
> >>lock_page users that we might hit?)
> >
> >
> >The only time 2.6 holds lock_page() for a significant duration is when
> >bringing the page uptodate with readpage or memset.
> >
> 
> Yes that's what I thought. And we don't really need to worry about this
> case because filemap_nopage has to deal with it anyway (ie. we shouldn't
> see a locked !uptodate page in do_no_page).
> 
> >The scalability risk here is 100 CPUs all faulting in the same file in the
> >same pattern.  Like the workload which caused the page_table_lock splitup
> >(that was with anon pages).  All the CPUs could pretty easily get into sync
> >and start arguing over every single page's lock.
> >
> 
> Yes, but in that case they're still going to hit the tree_lock anyway, and
> if they do have a chance of synching up, the cacheline bouncing from count
> and mapcount accounting is almost as likely to cause it as the lock_page
> itself.
> 
> I did a nopage microbenchmark like you describe a while back. IIRC single
> threaded is 2.5 times *more* throughput than 64 CPUs, even when those 64 are
> faulting their own NUMA memory (and obviously different pages). Thanks to
> tree_lock.

This is the discussed patch that fixes the smp race between do_no_page
and invalidate_inode_pages2.

As a reminder the race has seen the light of the day when
invalidate_inode_pages2 started dropping pages from the pagecache
without verifying that their page_count was 1 (i.e. only in pagecache)
while keeping the tree_lock held (that's what shrink_list does). In the
past (including 2.4) to implement invalidate_inode_pages2 we were
clearing only the PG_uptdate/PG_dirty bitflags while leaving the page in
pagecache exactly to avoid breaking such invariant. But that was
allowing not uptodate pagecache to be mapped in userland. Now instead we
enforced a new invariant that mapped pagecache has to be uptodate, so
we've to add synchronization between invalidate_inode_pages2 and
do_no_page using something else and not only the tree_lock + page_count.

Without this patch some pages could be mapped in userland without being
part of pagecache and without being freeable as well despite being part
of regular filebacked vmas. This was triggering crashes with sles9 after
the last invalidate_inode_pages2 from mainline, because of the more
restrictive checks with bug-ons in page_add_file_rmap equivalalents that
requried page->mapping not to be null for pagecache. It looks inferior
that you increase page->mapcount for things like the zeropage in
mainline, instead of making sure to add only filebacked pages in the
rmap layer and so verifying explicitly that page->mapping is not null in
page_add_file_rmap.

The new PG_truncate bitflag can be used as well to eliminate the
truncate_count from all vmas etc... that would save substantial memory
and remove some complexity, truncate_count grown a lot since the time we
introduced it.

The PG_truncate is needed as well because we can't know in do_no_page if
page->mapping is legitimate null or not (think bttv and other device
drivers returning page->mapping null because they're private but not
reserved pages etc..)

From: Andrea Arcangeli <andrea@suse.de>
Subject: avoid race between invalidate_inode_pages2 and do_no_page

Use page lock and new bitflag to serialize.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

 include/linux/page-flags.h |    6 ++++++
 mm/memory.c                |   18 ++++++++++++++++++
 mm/page_alloc.c            |    4 ++++
 mm/truncate.c              |    8 ++++++++
 4 files changed, 36 insertions(+)

--- linux-2.6-1/include/linux/page-flags.h	2006-03-29 19:10:03.000000000 +0200
+++ linux-2.6-2/include/linux/page-flags.h	2006-03-30 19:16:29.000000000 +0200
@@ -76,6 +76,8 @@
 #define PG_nosave_free		18	/* Free, should not be written */
 #define PG_uncached		19	/* Page has been mapped as uncached */
 
+#define PG_truncate		20	/* Pagecache has been truncated/invalidated */
+
 /*
  * Global page accounting.  One instance per CPU.  Only unsigned longs are
  * allowed.
@@ -330,6 +332,10 @@ extern void __mod_page_state_offset(unsi
 #define __SetPageCompound(page)	__set_bit(PG_compound, &(page)->flags)
 #define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags)
 
+#define PageTruncate(page)	test_bit(PG_truncate, &(page)->flags)
+#define SetPageTruncate(page)	set_bit(PG_truncate, &(page)->flags)
+#define ClearPageTruncate(page)	clear_bit(PG_truncate, &(page)->flags)
+
 #ifdef CONFIG_SWAP
 #define PageSwapCache(page)	test_bit(PG_swapcache, &(page)->flags)
 #define SetPageSwapCache(page)	set_bit(PG_swapcache, &(page)->flags)
--- linux-2.6-1/mm/memory.c	2006-03-29 19:10:04.000000000 +0200
+++ linux-2.6-2/mm/memory.c	2006-03-30 20:14:58.000000000 +0200
@@ -2088,6 +2088,14 @@ retry:
 		anon = 1;
 	}
 
+	lock_page(new_page);
+	if (unlikely(PageTruncate(new_page))) {
+		unlock_page(new_page);
+		page_cache_release(new_page);
+		BUG_ON(!mapping);
+		cond_resched();
+		goto retry;
+	}
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 	/*
 	 * For a file-backed vma, someone could have truncated or otherwise
@@ -2096,6 +2104,7 @@ retry:
 	 */
 	if (mapping && unlikely(sequence != mapping->truncate_count)) {
 		pte_unmap_unlock(page_table, ptl);
+		unlock_page(new_page);
 		page_cache_release(new_page);
 		cond_resched();
 		sequence = mapping->truncate_count;
@@ -2128,7 +2137,9 @@ retry:
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(new_page);
 		}
+		unlock_page(new_page);
 	} else {
+		unlock_page(new_page);
 		/* One of our sibling threads was faster, back out. */
 		page_cache_release(new_page);
 		goto unlock;
--- linux-2.6-1/mm/page_alloc.c	2006-03-29 19:10:04.000000000 +0200
+++ linux-2.6-2/mm/page_alloc.c	2006-03-30 20:03:42.000000000 +0200
@@ -148,6 +148,7 @@ static void bad_page(struct page *page)
 			1 << PG_locked	|
 			1 << PG_active	|
 			1 << PG_dirty	|
+			1 << PG_truncate	|
 			1 << PG_reclaim |
 			1 << PG_slab    |
 			1 << PG_swapcache |
@@ -380,6 +381,8 @@ static inline int free_pages_check(struc
 		bad_page(page);
 	if (PageDirty(page))
 		__ClearPageDirty(page);
+	if (PageTruncate(page))
+		ClearPageTruncate(page);
 	/*
 	 * For now, we report if PG_reserved was found set, but do not
 	 * clear it, and do not free the page.  But we shall soon need
@@ -520,6 +523,7 @@ static int prep_new_page(struct page *pa
 			1 << PG_locked	|
 			1 << PG_active	|
 			1 << PG_dirty	|
+			1 << PG_truncate	|
 			1 << PG_reclaim	|
 			1 << PG_slab    |
 			1 << PG_swapcache |
--- linux-2.6-1/mm/truncate.c	2006-02-24 04:36:27.000000000 +0100
+++ linux-2.6-2/mm/truncate.c	2006-03-30 19:15:33.000000000 +0200
@@ -78,6 +78,14 @@ invalidate_complete_page(struct address_
 	write_unlock_irq(&mapping->tree_lock);
 	ClearPageUptodate(page);
 	page_cache_release(page);	/* pagecache ref */
+	/*
+	 * After page truncate is set we must guarantee that
+	 * a new radix tree lookup won't find the same PG_truncate
+	 * page again. The __remove_from_page_cache has taken
+	 * care to provide this guarantee. Of course PG_truncate
+	 * can be changed only under the page_lock().
+	 */
+	SetPageTruncate(page);
 	return 1;
 }
 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2006-03-31 12:36                             ` Andrea Arcangeli
@ 2006-04-02  5:17                               ` Nick Piggin
  2006-04-02  5:21                               ` Andrew Morton
  1 sibling, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2006-04-02  5:17 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, hugh, linux-kernel

Hi Andrea, seems good to me.

Andrea Arcangeli wrote:

> The PG_truncate is needed as well because we can't know in do_no_page if
> page->mapping is legitimate null or not (think bttv and other device
> drivers returning page->mapping null because they're private but not
> reserved pages etc..)
> 
> From: Andrea Arcangeli <andrea@suse.de>
> Subject: avoid race between invalidate_inode_pages2 and do_no_page
> 
> Use page lock and new bitflag to serialize.
> 

As clean upstream solution, could we make truncatable (ie. regular
file backed) mappings set a vma flag which changes its nopage protocol
to return a locked page?

filemap_nopage itself, rather than do_no_page would then take care of
handling the truncate races. That seems to be a better layer to handle
it in.

Also, after this there should be no reason for truncate_count, right?
(At least in its truncate/nopage capacity.) If so, we should remove it
as part of the same patch / patchset.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2006-03-31 12:36                             ` Andrea Arcangeli
  2006-04-02  5:17                               ` Nick Piggin
@ 2006-04-02  5:21                               ` Andrew Morton
  2006-04-07 19:18                                 ` Hugh Dickins
  1 sibling, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2006-04-02  5:21 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: nickpiggin, hugh, linux-kernel

Andrea Arcangeli <andrea@suse.de> wrote:
>
> On Fri, Jan 13, 2006 at 09:37:39PM +1100, Nick Piggin wrote:
> > Andrew Morton wrote:
> > >Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > >
> > >>(I guess reclaim might be one, but quite rare -- any other significant
> > >>lock_page users that we might hit?)
> > >
> > >
> > >The only time 2.6 holds lock_page() for a significant duration is when
> > >bringing the page uptodate with readpage or memset.
> > >
> > 
> > Yes that's what I thought. And we don't really need to worry about this
> > case because filemap_nopage has to deal with it anyway (ie. we shouldn't
> > see a locked !uptodate page in do_no_page).
> > 
> > >The scalability risk here is 100 CPUs all faulting in the same file in the
> > >same pattern.  Like the workload which caused the page_table_lock splitup
> > >(that was with anon pages).  All the CPUs could pretty easily get into sync
> > >and start arguing over every single page's lock.
> > >
> > 
> > Yes, but in that case they're still going to hit the tree_lock anyway, and
> > if they do have a chance of synching up, the cacheline bouncing from count
> > and mapcount accounting is almost as likely to cause it as the lock_page
> > itself.
> > 
> > I did a nopage microbenchmark like you describe a while back. IIRC single
> > threaded is 2.5 times *more* throughput than 64 CPUs, even when those 64 are
> > faulting their own NUMA memory (and obviously different pages). Thanks to
> > tree_lock.
> 
> This is the discussed patch that fixes the smp race between do_no_page
> and invalidate_inode_pages2.

Again, please try to provide as complete a possible description of a bug
when proposing a fix for it.

> As a reminder the race has seen the light of the day when
> invalidate_inode_pages2 started dropping pages from the pagecache
> without verifying that their page_count was 1 (i.e. only in pagecache)
> while keeping the tree_lock held (that's what shrink_list does). In the
> past (including 2.4) to implement invalidate_inode_pages2 we were
> clearing only the PG_uptdate/PG_dirty bitflags while leaving the page in
> pagecache exactly to avoid breaking such invariant. But that was
> allowing not uptodate pagecache to be mapped in userland. Now instead we
> enforced a new invariant that mapped pagecache has to be uptodate, so
> we've to add synchronization between invalidate_inode_pages2 and
> do_no_page using something else and not only the tree_lock + page_count.
> 
> Without this patch some pages could be mapped in userland without being
> part of pagecache and without being freeable as well despite being part
> of regular filebacked vmas. This was triggering crashes with sles9 after
> the last invalidate_inode_pages2 from mainline, because of the more
> restrictive checks with bug-ons in page_add_file_rmap equivalalents that
> requried page->mapping not to be null for pagecache. It looks inferior
> that you increase page->mapcount for things like the zeropage in
> mainline, instead of making sure to add only filebacked pages in the
> rmap layer and so verifying explicitly that page->mapping is not null in
> page_add_file_rmap.
> 
> The new PG_truncate bitflag can be used as well to eliminate the
> truncate_count from all vmas etc... that would save substantial memory
> and remove some complexity, truncate_count grown a lot since the time we
> introduced it.
> 
> The PG_truncate is needed as well because we can't know in do_no_page if
> page->mapping is legitimate null or not (think bttv and other device
> drivers returning page->mapping null because they're private but not
> reserved pages etc..)
> 
> From: Andrea Arcangeli <andrea@suse.de>
> Subject: avoid race between invalidate_inode_pages2 and do_no_page
> 
> Use page lock and new bitflag to serialize.
> 
> Signed-off-by: Andrea Arcangeli <andrea@suse.de>

afaict, the race is this:

invalidate_inode_pages2_range():		do_no_page():

	unmap_mapping_range(page)
						  filemap_nopage()
						    page_cache_lookup()
	invalidate_complete_page(page)
	  __remove_from_page_cache()
						install page in pagetables

yes?

The patch will pretty clearly fix that.  But it really would be better to
place all the cost over on the invalidate side, rather than adding overhead
to the pagefault path.

Could we perhaps check the page_count(page) and/or page_mapped(page) in
invalidate_complete_page(), when we have tree_lock?

Do you have handy any test code which others can use to recreate this bug?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: smp race fix between invalidate_inode_pages* and do_no_page
  2006-04-02  5:21                               ` Andrew Morton
@ 2006-04-07 19:18                                 ` Hugh Dickins
  0 siblings, 0 replies; 23+ messages in thread
From: Hugh Dickins @ 2006-04-07 19:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, nickpiggin, linux-kernel

On Sat, 1 Apr 2006, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> > 
> > The new PG_truncate bitflag can be used as well to eliminate the
> > truncate_count from all vmas etc... that would save substantial memory
> > and remove some complexity, truncate_count grown a lot since the time we
> > introduced it.

truncate_count is playing a useful role with the vma tree, allowing
us to find our place again after allowing preemption in.  Though I'm
sometimes wondering whether just to make the vma _bigger_ and add a
list instead of that truncate_count: construct a list of vmas from
the prio_tree for unmap_mapping_range, so it can keep place in that,
instead of having to restart the tree whenever preempted.  But I
think NFS doesn't hold i_sem across unmapping, so it wouldn't work.

> > The PG_truncate is needed as well because we can't know in do_no_page if
> > page->mapping is legitimate null or not (think bttv and other device
> > drivers returning page->mapping null because they're private but not
> > reserved pages etc..)

That part is easily dealt with, as Nick and I have suggested,
just by marking vmas liable to having their pages truncated.
But that's certainly no more than one part of a larger solution.

> The patch will pretty clearly fix that.  But it really would be better to
> place all the cost over on the invalidate side, rather than adding overhead
> to the pagefault path.

I see ~2% slowdown in relevant faulting tests e.g. the lmbench ones.
I'd certainly prefer not to be locking pages here: or not without
word from the high-scalability people that Andrea's patch really in
practice doesn't hurt them (but the ~2% I see is on UP as much as MP).

> Could we perhaps check the page_count(page) and/or page_mapped(page) in
> invalidate_complete_page(), when we have tree_lock?

I'm thinking on it, but not finding it easy.  And I've just realized
that invalidate_inode_pages2 isn't the only problematic case here
(though agreed the more urgent one): MADV_REMOVE likewise needs to
invalidate pages without adjusting i_size, so also cannot rely on
the truncate_count/i_size method.

> Do you have handy any test code which others can use to recreate this bug?

I'd be glad of that too.

Hugh

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2006-04-07 19:18 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-12-13 19:37 smp race fix between invalidate_inode_pages* and do_no_page Andrea Arcangeli
2005-12-13 21:02 ` Andrew Morton
2005-12-13 21:14   ` Andrea Arcangeli
2005-12-16 13:51     ` Andrea Arcangeli
2006-01-10  6:24       ` Andrea Arcangeli
2006-01-10  6:48         ` Andrea Arcangeli
2006-01-11  4:08         ` Nick Piggin
2006-01-11  8:23           ` Andrea Arcangeli
2006-01-11  8:51             ` Andrew Morton
2006-01-11  9:02               ` Andrea Arcangeli
2006-01-11  9:06                 ` Andrew Morton
2006-01-11  9:13                   ` Andrea Arcangeli
2006-01-11 20:49                     ` Hugh Dickins
2006-01-11 21:05                       ` Andrew Morton
2006-01-13  7:35                       ` Nick Piggin
2006-01-13  7:47                         ` Andrew Morton
2006-01-13 10:37                           ` Nick Piggin
2006-03-31 12:36                             ` Andrea Arcangeli
2006-04-02  5:17                               ` Nick Piggin
2006-04-02  5:21                               ` Andrew Morton
2006-04-07 19:18                                 ` Hugh Dickins
2006-01-11  9:39                 ` Nick Piggin
2006-01-11  9:34             ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).