[PATCH 0/12] tmpfs: convert from old swap vector to radix tree

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/12] tmpfs: convert from old swap vector to radix tree
@ 2011-06-14 10:40 ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Christoph Hellwig, Robin Holt, Nick Piggin,
	Rik van Riel, Andrea Arcangeli, Miklos Szeredi,
	KAMEZAWA Hiroyuki, Shaohua Li, Tim Chen, Zhang, Yanmin,
	linux-kernel, linux-mm

Here's my third patchset for mmotm, completing the series.
Based on 3.0-rc3 plus the 14 in June 5th "mm: tmpfs and trunc changes"
plus the 7 in June 9th "tmpfs: simplify by splice instead of readpage",
which were in preparation for it.

I'm not sure who would really be interested in it: I'm Cc'ing this
header mail as notification to a number of people who might care;
but reluctant to spam you all with the 14+7+12 patches themselves,
I hope you can pick them up from the list if you want (or ask me).

What's it about?  Extending tmpfs to MAX_LFS_FILESIZE by abandoning
its peculiar swap vector, instead keeping a file's swap entries in
the same radix tree as its struct page pointers: thus saving memory,
and simplifying its code and locking.

 1/12 radix_tree: exceptional entries and indices
 2/12 mm: let swap use exceptional entries
 3/12 tmpfs: demolish old swap vector support
 4/12 tmpfs: miscellaneous trivial cleanups
 5/12 tmpfs: copy truncate_inode_pages_range
 6/12 tmpfs: convert shmem_truncate_range to radix-swap
 7/12 tmpfs: convert shmem_unuse_inode to radix-swap
 8/12 tmpfs: convert shmem_getpage_gfp to radix-swap
 9/12 tmpfs: convert mem_cgroup shmem to radix-swap
10/12 tmpfs: convert shmem_writepage and enable swap
11/12 tmpfs: use kmemdup for short symlinks
12/12 mm: a few small updates for radix-swap

 fs/stack.c                 |    5 
 include/linux/memcontrol.h |    8 
 include/linux/radix-tree.h |   36 
 include/linux/shmem_fs.h   |   17 
 include/linux/swapops.h    |   23 
 init/main.c                |    2 
 lib/radix-tree.c           |   29 
 mm/filemap.c               |   74 -
 mm/memcontrol.c            |   66 -
 mm/mincore.c               |   10 
 mm/shmem.c                 | 1515 +++++++++++------------------------
 mm/swapfile.c              |   20 
 mm/truncate.c              |    8 
 13 files changed, 669 insertions(+), 1144 deletions(-)

Hugh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 0/12] tmpfs: convert from old swap vector to radix tree
@ 2011-06-14 10:40 ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Christoph Hellwig, Robin Holt, Nick Piggin,
	Rik van Riel, Andrea Arcangeli, Miklos Szeredi,
	KAMEZAWA Hiroyuki, Shaohua Li, Tim Chen, Zhang, Yanmin,
	linux-kernel, linux-mm

Here's my third patchset for mmotm, completing the series.
Based on 3.0-rc3 plus the 14 in June 5th "mm: tmpfs and trunc changes"
plus the 7 in June 9th "tmpfs: simplify by splice instead of readpage",
which were in preparation for it.

I'm not sure who would really be interested in it: I'm Cc'ing this
header mail as notification to a number of people who might care;
but reluctant to spam you all with the 14+7+12 patches themselves,
I hope you can pick them up from the list if you want (or ask me).

What's it about?  Extending tmpfs to MAX_LFS_FILESIZE by abandoning
its peculiar swap vector, instead keeping a file's swap entries in
the same radix tree as its struct page pointers: thus saving memory,
and simplifying its code and locking.

 1/12 radix_tree: exceptional entries and indices
 2/12 mm: let swap use exceptional entries
 3/12 tmpfs: demolish old swap vector support
 4/12 tmpfs: miscellaneous trivial cleanups
 5/12 tmpfs: copy truncate_inode_pages_range
 6/12 tmpfs: convert shmem_truncate_range to radix-swap
 7/12 tmpfs: convert shmem_unuse_inode to radix-swap
 8/12 tmpfs: convert shmem_getpage_gfp to radix-swap
 9/12 tmpfs: convert mem_cgroup shmem to radix-swap
10/12 tmpfs: convert shmem_writepage and enable swap
11/12 tmpfs: use kmemdup for short symlinks
12/12 mm: a few small updates for radix-swap

 fs/stack.c                 |    5 
 include/linux/memcontrol.h |    8 
 include/linux/radix-tree.h |   36 
 include/linux/shmem_fs.h   |   17 
 include/linux/swapops.h    |   23 
 init/main.c                |    2 
 lib/radix-tree.c           |   29 
 mm/filemap.c               |   74 -
 mm/memcontrol.c            |   66 -
 mm/mincore.c               |   10 
 mm/shmem.c                 | 1515 +++++++++++------------------------
 mm/swapfile.c              |   20 
 mm/truncate.c              |    8 
 13 files changed, 669 insertions(+), 1144 deletions(-)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 1/12] radix_tree: exceptional entries and indices
  2011-06-14 10:40 ` Hugh Dickins
@ 2011-06-14 10:42   ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

The radix_tree is used by several subsystems for different purposes.
A major use is to store the struct page pointers of a file's pagecache
for memory management.  But what if mm wanted to store something other
than page pointers there too?

The low bit of a radix_tree entry is already used to denote an indirect
pointer, for internal use, and the unlikely radix_tree_deref_retry() case.
Define the next bit as denoting an exceptional entry, and supply inline
functions radix_tree_exception() to return non-0 in either unlikely case,
and radix_tree_exceptional_entry() to return non-0 in the second case.

If a subsystem already uses radix_tree with that bit set, no problem:
it does not affect internal workings at all, but is defined for the
convenience of those storing well-aligned pointers in the radix_tree.

The radix_tree_gang_lookups have an implicit assumption that the caller
can deduce the offset of each entry returned e.g. by the page->index of
a struct page.  But that may not be feasible for some kinds of item to
be stored there.

radix_tree_gang_lookup_slot() allow for an optional indices argument,
output array in which to return those offsets.  The same could be added
to other radix_tree_gang_lookups, but for now keep it to the only one
for which we need it.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/radix-tree.h |   36 ++++++++++++++++++++++++++++++++---
 lib/radix-tree.c           |   29 ++++++++++++++++++----------
 mm/filemap.c               |    4 +--
 3 files changed, 54 insertions(+), 15 deletions(-)

--- linux.orig/include/linux/radix-tree.h	2011-06-13 13:26:07.566101333 -0700
+++ linux/include/linux/radix-tree.h	2011-06-13 13:26:44.426284119 -0700
@@ -39,7 +39,15 @@
  * when it is shrunk, before we rcu free the node. See shrink code for
  * details.
  */
-#define RADIX_TREE_INDIRECT_PTR	1
+#define RADIX_TREE_INDIRECT_PTR		1
+/*
+ * A common use of the radix tree is to store pointers to struct pages;
+ * but shmem/tmpfs needs also to store swap entries in the same tree:
+ * those are marked as exceptional entries to distinguish them.
+ * EXCEPTIONAL_ENTRY tests the bit, EXCEPTIONAL_SHIFT shifts content past it.
+ */
+#define RADIX_TREE_EXCEPTIONAL_ENTRY	2
+#define RADIX_TREE_EXCEPTIONAL_SHIFT	2
 
 #define radix_tree_indirect_to_ptr(ptr) \
 	radix_tree_indirect_to_ptr((void __force *)(ptr))
@@ -174,6 +182,28 @@ static inline int radix_tree_deref_retry
 }
 
 /**
+ * radix_tree_exceptional_entry	- radix_tree_deref_slot gave exceptional entry?
+ * @arg:	value returned by radix_tree_deref_slot
+ * Returns:	0 if well-aligned pointer, non-0 if exceptional entry.
+ */
+static inline int radix_tree_exceptional_entry(void *arg)
+{
+	/* Not unlikely because radix_tree_exception often tested first */
+	return (unsigned long)arg & RADIX_TREE_EXCEPTIONAL_ENTRY;
+}
+
+/**
+ * radix_tree_exception	- radix_tree_deref_slot returned either exception?
+ * @arg:	value returned by radix_tree_deref_slot
+ * Returns:	0 if well-aligned pointer, non-0 if either kind of exception.
+ */
+static inline int radix_tree_exception(void *arg)
+{
+	return unlikely((unsigned long)arg &
+		(RADIX_TREE_INDIRECT_PTR | RADIX_TREE_EXCEPTIONAL_ENTRY));
+}
+
+/**
  * radix_tree_replace_slot	- replace item in a slot
  * @pslot:	pointer to slot, returned by radix_tree_lookup_slot
  * @item:	new item to store in the slot.
@@ -194,8 +224,8 @@ void *radix_tree_delete(struct radix_tre
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 			unsigned long first_index, unsigned int max_items);
-unsigned int
-radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
+			void ***results, unsigned long *indices,
 			unsigned long first_index, unsigned int max_items);
 unsigned long radix_tree_next_hole(struct radix_tree_root *root,
 				unsigned long index, unsigned long max_scan);
--- linux.orig/lib/radix-tree.c	2011-06-13 13:26:07.566101333 -0700
+++ linux/lib/radix-tree.c	2011-06-13 13:26:44.426284119 -0700
@@ -823,8 +823,8 @@ unsigned long radix_tree_prev_hole(struc
 EXPORT_SYMBOL(radix_tree_prev_hole);
 
 static unsigned int
-__lookup(struct radix_tree_node *slot, void ***results, unsigned long index,
-	unsigned int max_items, unsigned long *next_index)
+__lookup(struct radix_tree_node *slot, void ***results, unsigned long *indices,
+	unsigned long index, unsigned int max_items, unsigned long *next_index)
 {
 	unsigned int nr_found = 0;
 	unsigned int shift, height;
@@ -857,12 +857,16 @@ __lookup(struct radix_tree_node *slot, v
 
 	/* Bottom level: grab some items */
 	for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
-		index++;
 		if (slot->slots[i]) {
-			results[nr_found++] = &(slot->slots[i]);
-			if (nr_found == max_items)
+			results[nr_found] = &(slot->slots[i]);
+			if (indices)
+				indices[nr_found] = index;
+			if (++nr_found == max_items) {
+				index++;
 				goto out;
+			}
 		}
+		index++;
 	}
 out:
 	*next_index = index;
@@ -918,8 +922,8 @@ radix_tree_gang_lookup(struct radix_tree
 
 		if (cur_index > max_index)
 			break;
-		slots_found = __lookup(node, (void ***)results + ret, cur_index,
-					max_items - ret, &next_index);
+		slots_found = __lookup(node, (void ***)results + ret, NULL,
+				cur_index, max_items - ret, &next_index);
 		nr_found = 0;
 		for (i = 0; i < slots_found; i++) {
 			struct radix_tree_node *slot;
@@ -944,6 +948,7 @@ EXPORT_SYMBOL(radix_tree_gang_lookup);
  *	radix_tree_gang_lookup_slot - perform multiple slot lookup on radix tree
  *	@root:		radix tree root
  *	@results:	where the results of the lookup are placed
+ *	@indices:	where their indices should be placed (but usually NULL)
  *	@first_index:	start the lookup from this key
  *	@max_items:	place up to this many items at *results
  *
@@ -958,7 +963,8 @@ EXPORT_SYMBOL(radix_tree_gang_lookup);
  *	protection, radix_tree_deref_slot may fail requiring a retry.
  */
 unsigned int
-radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+radix_tree_gang_lookup_slot(struct radix_tree_root *root,
+			void ***results, unsigned long *indices,
 			unsigned long first_index, unsigned int max_items)
 {
 	unsigned long max_index;
@@ -974,6 +980,8 @@ radix_tree_gang_lookup_slot(struct radix
 		if (first_index > 0)
 			return 0;
 		results[0] = (void **)&root->rnode;
+		if (indices)
+			indices[0] = 0;
 		return 1;
 	}
 	node = indirect_to_ptr(node);
@@ -987,8 +995,9 @@ radix_tree_gang_lookup_slot(struct radix
 
 		if (cur_index > max_index)
 			break;
-		slots_found = __lookup(node, results + ret, cur_index,
-					max_items - ret, &next_index);
+		slots_found = __lookup(node, results + ret,
+				indices ? indices + ret : NULL,
+				cur_index, max_items - ret, &next_index);
 		ret += slots_found;
 		if (next_index == 0)
 			break;
--- linux.orig/mm/filemap.c	2011-06-13 13:26:07.566101333 -0700
+++ linux/mm/filemap.c	2011-06-13 13:26:44.430284135 -0700
@@ -843,7 +843,7 @@ unsigned find_get_pages(struct address_s
 	rcu_read_lock();
 restart:
 	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
-				(void ***)pages, start, nr_pages);
+				(void ***)pages, NULL, start, nr_pages);
 	ret = 0;
 	for (i = 0; i < nr_found; i++) {
 		struct page *page;
@@ -906,7 +906,7 @@ unsigned find_get_pages_contig(struct ad
 	rcu_read_lock();
 restart:
 	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
-				(void ***)pages, index, nr_pages);
+				(void ***)pages, NULL, index, nr_pages);
 	ret = 0;
 	for (i = 0; i < nr_found; i++) {
 		struct page *page;

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 1/12] radix_tree: exceptional entries and indices
@ 2011-06-14 10:42   ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

The radix_tree is used by several subsystems for different purposes.
A major use is to store the struct page pointers of a file's pagecache
for memory management.  But what if mm wanted to store something other
than page pointers there too?

The low bit of a radix_tree entry is already used to denote an indirect
pointer, for internal use, and the unlikely radix_tree_deref_retry() case.
Define the next bit as denoting an exceptional entry, and supply inline
functions radix_tree_exception() to return non-0 in either unlikely case,
and radix_tree_exceptional_entry() to return non-0 in the second case.

If a subsystem already uses radix_tree with that bit set, no problem:
it does not affect internal workings at all, but is defined for the
convenience of those storing well-aligned pointers in the radix_tree.

The radix_tree_gang_lookups have an implicit assumption that the caller
can deduce the offset of each entry returned e.g. by the page->index of
a struct page.  But that may not be feasible for some kinds of item to
be stored there.

radix_tree_gang_lookup_slot() allow for an optional indices argument,
output array in which to return those offsets.  The same could be added
to other radix_tree_gang_lookups, but for now keep it to the only one
for which we need it.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/radix-tree.h |   36 ++++++++++++++++++++++++++++++++---
 lib/radix-tree.c           |   29 ++++++++++++++++++----------
 mm/filemap.c               |    4 +--
 3 files changed, 54 insertions(+), 15 deletions(-)

--- linux.orig/include/linux/radix-tree.h	2011-06-13 13:26:07.566101333 -0700
+++ linux/include/linux/radix-tree.h	2011-06-13 13:26:44.426284119 -0700
@@ -39,7 +39,15 @@
  * when it is shrunk, before we rcu free the node. See shrink code for
  * details.
  */
-#define RADIX_TREE_INDIRECT_PTR	1
+#define RADIX_TREE_INDIRECT_PTR		1
+/*
+ * A common use of the radix tree is to store pointers to struct pages;
+ * but shmem/tmpfs needs also to store swap entries in the same tree:
+ * those are marked as exceptional entries to distinguish them.
+ * EXCEPTIONAL_ENTRY tests the bit, EXCEPTIONAL_SHIFT shifts content past it.
+ */
+#define RADIX_TREE_EXCEPTIONAL_ENTRY	2
+#define RADIX_TREE_EXCEPTIONAL_SHIFT	2
 
 #define radix_tree_indirect_to_ptr(ptr) \
 	radix_tree_indirect_to_ptr((void __force *)(ptr))
@@ -174,6 +182,28 @@ static inline int radix_tree_deref_retry
 }
 
 /**
+ * radix_tree_exceptional_entry	- radix_tree_deref_slot gave exceptional entry?
+ * @arg:	value returned by radix_tree_deref_slot
+ * Returns:	0 if well-aligned pointer, non-0 if exceptional entry.
+ */
+static inline int radix_tree_exceptional_entry(void *arg)
+{
+	/* Not unlikely because radix_tree_exception often tested first */
+	return (unsigned long)arg & RADIX_TREE_EXCEPTIONAL_ENTRY;
+}
+
+/**
+ * radix_tree_exception	- radix_tree_deref_slot returned either exception?
+ * @arg:	value returned by radix_tree_deref_slot
+ * Returns:	0 if well-aligned pointer, non-0 if either kind of exception.
+ */
+static inline int radix_tree_exception(void *arg)
+{
+	return unlikely((unsigned long)arg &
+		(RADIX_TREE_INDIRECT_PTR | RADIX_TREE_EXCEPTIONAL_ENTRY));
+}
+
+/**
  * radix_tree_replace_slot	- replace item in a slot
  * @pslot:	pointer to slot, returned by radix_tree_lookup_slot
  * @item:	new item to store in the slot.
@@ -194,8 +224,8 @@ void *radix_tree_delete(struct radix_tre
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 			unsigned long first_index, unsigned int max_items);
-unsigned int
-radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
+			void ***results, unsigned long *indices,
 			unsigned long first_index, unsigned int max_items);
 unsigned long radix_tree_next_hole(struct radix_tree_root *root,
 				unsigned long index, unsigned long max_scan);
--- linux.orig/lib/radix-tree.c	2011-06-13 13:26:07.566101333 -0700
+++ linux/lib/radix-tree.c	2011-06-13 13:26:44.426284119 -0700
@@ -823,8 +823,8 @@ unsigned long radix_tree_prev_hole(struc
 EXPORT_SYMBOL(radix_tree_prev_hole);
 
 static unsigned int
-__lookup(struct radix_tree_node *slot, void ***results, unsigned long index,
-	unsigned int max_items, unsigned long *next_index)
+__lookup(struct radix_tree_node *slot, void ***results, unsigned long *indices,
+	unsigned long index, unsigned int max_items, unsigned long *next_index)
 {
 	unsigned int nr_found = 0;
 	unsigned int shift, height;
@@ -857,12 +857,16 @@ __lookup(struct radix_tree_node *slot, v
 
 	/* Bottom level: grab some items */
 	for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
-		index++;
 		if (slot->slots[i]) {
-			results[nr_found++] = &(slot->slots[i]);
-			if (nr_found == max_items)
+			results[nr_found] = &(slot->slots[i]);
+			if (indices)
+				indices[nr_found] = index;
+			if (++nr_found == max_items) {
+				index++;
 				goto out;
+			}
 		}
+		index++;
 	}
 out:
 	*next_index = index;
@@ -918,8 +922,8 @@ radix_tree_gang_lookup(struct radix_tree
 
 		if (cur_index > max_index)
 			break;
-		slots_found = __lookup(node, (void ***)results + ret, cur_index,
-					max_items - ret, &next_index);
+		slots_found = __lookup(node, (void ***)results + ret, NULL,
+				cur_index, max_items - ret, &next_index);
 		nr_found = 0;
 		for (i = 0; i < slots_found; i++) {
 			struct radix_tree_node *slot;
@@ -944,6 +948,7 @@ EXPORT_SYMBOL(radix_tree_gang_lookup);
  *	radix_tree_gang_lookup_slot - perform multiple slot lookup on radix tree
  *	@root:		radix tree root
  *	@results:	where the results of the lookup are placed
+ *	@indices:	where their indices should be placed (but usually NULL)
  *	@first_index:	start the lookup from this key
  *	@max_items:	place up to this many items at *results
  *
@@ -958,7 +963,8 @@ EXPORT_SYMBOL(radix_tree_gang_lookup);
  *	protection, radix_tree_deref_slot may fail requiring a retry.
  */
 unsigned int
-radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+radix_tree_gang_lookup_slot(struct radix_tree_root *root,
+			void ***results, unsigned long *indices,
 			unsigned long first_index, unsigned int max_items)
 {
 	unsigned long max_index;
@@ -974,6 +980,8 @@ radix_tree_gang_lookup_slot(struct radix
 		if (first_index > 0)
 			return 0;
 		results[0] = (void **)&root->rnode;
+		if (indices)
+			indices[0] = 0;
 		return 1;
 	}
 	node = indirect_to_ptr(node);
@@ -987,8 +995,9 @@ radix_tree_gang_lookup_slot(struct radix
 
 		if (cur_index > max_index)
 			break;
-		slots_found = __lookup(node, results + ret, cur_index,
-					max_items - ret, &next_index);
+		slots_found = __lookup(node, results + ret,
+				indices ? indices + ret : NULL,
+				cur_index, max_items - ret, &next_index);
 		ret += slots_found;
 		if (next_index == 0)
 			break;
--- linux.orig/mm/filemap.c	2011-06-13 13:26:07.566101333 -0700
+++ linux/mm/filemap.c	2011-06-13 13:26:44.430284135 -0700
@@ -843,7 +843,7 @@ unsigned find_get_pages(struct address_s
 	rcu_read_lock();
 restart:
 	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
-				(void ***)pages, start, nr_pages);
+				(void ***)pages, NULL, start, nr_pages);
 	ret = 0;
 	for (i = 0; i < nr_found; i++) {
 		struct page *page;
@@ -906,7 +906,7 @@ unsigned find_get_pages_contig(struct ad
 	rcu_read_lock();
 restart:
 	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
-				(void ***)pages, index, nr_pages);
+				(void ***)pages, NULL, index, nr_pages);
 	ret = 0;
 	for (i = 0; i < nr_found; i++) {
 		struct page *page;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 2/12] mm: let swap use exceptional entries
  2011-06-14 10:40 ` Hugh Dickins
@ 2011-06-14 10:43   ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

If swap entries are to be stored along with struct page pointers in
a radix tree, they need to be distinguished as exceptional entries.

Most of the handling of swap entries in radix tree will be contained
in shmem.c, but a few functions in filemap.c's common code need to
check for their appearance: find_get_page(), find_lock_page(),
find_get_pages() and find_get_pages_contig().

So as not to slow their fast paths, tuck those checks inside the
existing checks for unlikely radix_tree_deref_slot(); except for
find_lock_page(), where it is an added test.  And make it a BUG
in find_get_pages_tag(), which is not applied to tmpfs files.

A part of the reason for eliminating shmem_readpage() earlier,
was to minimize the places where common code would need to allow
for swap entries.

The swp_entry_t known to swapfile.c must be massaged into a
slightly different form when stored in the radix tree, just
as it gets massaged into a pte_t when stored in page tables.

In an i386 kernel this limits its information (type and page offset)
to 30 bits: given 32 "types" of swapfile and 4kB pagesize, that's
a maximum swapfile size of 128GB.  Which is less than the 512GB we
previously allowed with X86_PAE (where the swap entry can occupy the
entire upper 32 bits of a pte_t), but not a new limitation on 32-bit
without PAE; and there's not a new limitation on 64-bit (where swap
filesize is already limited to 16TB by a 32-bit page offset).  Thirty
areas of 128GB is probably still enough swap for a 64GB 32-bit machine.

Provide swp_to_radix_entry() and radix_to_swp_entry() conversions,
and enforce filesize limit in read_swap_header(), just as for ptes.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/swapops.h |   23 +++++++++++++++++
 mm/filemap.c            |   49 ++++++++++++++++++++++++--------------
 mm/swapfile.c           |   20 +++++++++------
 3 files changed, 66 insertions(+), 26 deletions(-)

--- linux.orig/include/linux/swapops.h	2011-06-13 13:26:07.506101039 -0700
+++ linux/include/linux/swapops.h	2011-06-13 13:27:34.522532530 -0700
@@ -1,3 +1,8 @@
+#ifndef _LINUX_SWAPOPS_H
+#define _LINUX_SWAPOPS_H
+
+#include <linux/radix-tree.h>
+
 /*
  * swapcache pages are stored in the swapper_space radix tree.  We want to
  * get good packing density in that tree, so the index should be dense in
@@ -76,6 +81,22 @@ static inline pte_t swp_entry_to_pte(swp
 	return __swp_entry_to_pte(arch_entry);
 }
 
+static inline swp_entry_t radix_to_swp_entry(void *arg)
+{
+	swp_entry_t entry;
+
+	entry.val = (unsigned long)arg >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+	return entry;
+}
+
+static inline void *swp_to_radix_entry(swp_entry_t entry)
+{
+	unsigned long value;
+
+	value = entry.val << RADIX_TREE_EXCEPTIONAL_SHIFT;
+	return (void *)(value | RADIX_TREE_EXCEPTIONAL_ENTRY);
+}
+
 #ifdef CONFIG_MIGRATION
 static inline swp_entry_t make_migration_entry(struct page *page, int write)
 {
@@ -169,3 +190,5 @@ static inline int non_swap_entry(swp_ent
 	return 0;
 }
 #endif
+
+#endif /* _LINUX_SWAPOPS_H */
--- linux.orig/mm/filemap.c	2011-06-13 13:26:44.430284135 -0700
+++ linux/mm/filemap.c	2011-06-13 13:27:34.526532556 -0700
@@ -717,9 +717,12 @@ repeat:
 		page = radix_tree_deref_slot(pagep);
 		if (unlikely(!page))
 			goto out;
-		if (radix_tree_deref_retry(page))
+		if (radix_tree_exception(page)) {
+			if (radix_tree_exceptional_entry(page))
+				goto out;
+			/* radix_tree_deref_retry(page) */
 			goto repeat;
-
+		}
 		if (!page_cache_get_speculative(page))
 			goto repeat;
 
@@ -756,7 +759,7 @@ struct page *find_lock_page(struct addre
 
 repeat:
 	page = find_get_page(mapping, offset);
-	if (page) {
+	if (page && !radix_tree_exception(page)) {
 		lock_page(page);
 		/* Has the page been truncated? */
 		if (unlikely(page->mapping != mapping)) {
@@ -852,11 +855,14 @@ repeat:
 		if (unlikely(!page))
 			continue;
 
-		/*
-		 * This can only trigger when the entry at index 0 moves out
-		 * of or back to the root: none yet gotten, safe to restart.
-		 */
-		if (radix_tree_deref_retry(page)) {
+		if (radix_tree_exception(page)) {
+			if (radix_tree_exceptional_entry(page))
+				continue;
+			/*
+			 * radix_tree_deref_retry(page):
+			 * can only trigger when entry at index 0 moves out of
+			 * or back to root: none yet gotten, safe to restart.
+			 */
 			WARN_ON(start | i);
 			goto restart;
 		}
@@ -915,12 +921,16 @@ repeat:
 		if (unlikely(!page))
 			continue;
 
-		/*
-		 * This can only trigger when the entry at index 0 moves out
-		 * of or back to the root: none yet gotten, safe to restart.
-		 */
-		if (radix_tree_deref_retry(page))
+		if (radix_tree_exception(page)) {
+			if (radix_tree_exceptional_entry(page))
+				break;
+			/*
+			 * radix_tree_deref_retry(page):
+			 * can only trigger when entry at index 0 moves out of
+			 * or back to root: none yet gotten, safe to restart.
+			 */
 			goto restart;
+		}
 
 		if (!page_cache_get_speculative(page))
 			goto repeat;
@@ -980,12 +990,15 @@ repeat:
 		if (unlikely(!page))
 			continue;
 
-		/*
-		 * This can only trigger when the entry at index 0 moves out
-		 * of or back to the root: none yet gotten, safe to restart.
-		 */
-		if (radix_tree_deref_retry(page))
+		if (radix_tree_exception(page)) {
+			BUG_ON(radix_tree_exceptional_entry(page));
+			/*
+			 * radix_tree_deref_retry(page):
+			 * can only trigger when entry at index 0 moves out of
+			 * or back to root: none yet gotten, safe to restart.
+			 */
 			goto restart;
+		}
 
 		if (!page_cache_get_speculative(page))
 			goto repeat;
--- linux.orig/mm/swapfile.c	2011-06-13 13:26:07.506101039 -0700
+++ linux/mm/swapfile.c	2011-06-13 13:27:34.526532556 -0700
@@ -1937,20 +1937,24 @@ static unsigned long read_swap_header(st
 
 	/*
 	 * Find out how many pages are allowed for a single swap
-	 * device. There are two limiting factors: 1) the number of
-	 * bits for the swap offset in the swp_entry_t type and
-	 * 2) the number of bits in the a swap pte as defined by
-	 * the different architectures. In order to find the
-	 * largest possible bit mask a swap entry with swap type 0
+	 * device. There are three limiting factors: 1) the number
+	 * of bits for the swap offset in the swp_entry_t type, and
+	 * 2) the number of bits in the swap pte as defined by the
+	 * the different architectures, and 3) the number of free bits
+	 * in an exceptional radix_tree entry. In order to find the
+	 * largest possible bit mask, a swap entry with swap type 0
 	 * and swap offset ~0UL is created, encoded to a swap pte,
-	 * decoded to a swp_entry_t again and finally the swap
+	 * decoded to a swp_entry_t again, and finally the swap
 	 * offset is extracted. This will mask all the bits from
 	 * the initial ~0UL mask that can't be encoded in either
 	 * the swp_entry_t or the architecture definition of a
-	 * swap pte.
+	 * swap pte.  Then the same is done for a radix_tree entry.
 	 */
 	maxpages = swp_offset(pte_to_swp_entry(
-			swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1;
+			swp_entry_to_pte(swp_entry(0, ~0UL))));
+	maxpages = swp_offset(radix_to_swp_entry(
+			swp_to_radix_entry(swp_entry(0, maxpages)))) + 1;
+
 	if (maxpages > swap_header->info.last_page) {
 		maxpages = swap_header->info.last_page + 1;
 		/* p->max is an unsigned int: don't overflow it */

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 2/12] mm: let swap use exceptional entries
@ 2011-06-14 10:43   ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

If swap entries are to be stored along with struct page pointers in
a radix tree, they need to be distinguished as exceptional entries.

Most of the handling of swap entries in radix tree will be contained
in shmem.c, but a few functions in filemap.c's common code need to
check for their appearance: find_get_page(), find_lock_page(),
find_get_pages() and find_get_pages_contig().

So as not to slow their fast paths, tuck those checks inside the
existing checks for unlikely radix_tree_deref_slot(); except for
find_lock_page(), where it is an added test.  And make it a BUG
in find_get_pages_tag(), which is not applied to tmpfs files.

A part of the reason for eliminating shmem_readpage() earlier,
was to minimize the places where common code would need to allow
for swap entries.

The swp_entry_t known to swapfile.c must be massaged into a
slightly different form when stored in the radix tree, just
as it gets massaged into a pte_t when stored in page tables.

In an i386 kernel this limits its information (type and page offset)
to 30 bits: given 32 "types" of swapfile and 4kB pagesize, that's
a maximum swapfile size of 128GB.  Which is less than the 512GB we
previously allowed with X86_PAE (where the swap entry can occupy the
entire upper 32 bits of a pte_t), but not a new limitation on 32-bit
without PAE; and there's not a new limitation on 64-bit (where swap
filesize is already limited to 16TB by a 32-bit page offset).  Thirty
areas of 128GB is probably still enough swap for a 64GB 32-bit machine.

Provide swp_to_radix_entry() and radix_to_swp_entry() conversions,
and enforce filesize limit in read_swap_header(), just as for ptes.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/swapops.h |   23 +++++++++++++++++
 mm/filemap.c            |   49 ++++++++++++++++++++++++--------------
 mm/swapfile.c           |   20 +++++++++------
 3 files changed, 66 insertions(+), 26 deletions(-)

--- linux.orig/include/linux/swapops.h	2011-06-13 13:26:07.506101039 -0700
+++ linux/include/linux/swapops.h	2011-06-13 13:27:34.522532530 -0700
@@ -1,3 +1,8 @@
+#ifndef _LINUX_SWAPOPS_H
+#define _LINUX_SWAPOPS_H
+
+#include <linux/radix-tree.h>
+
 /*
  * swapcache pages are stored in the swapper_space radix tree.  We want to
  * get good packing density in that tree, so the index should be dense in
@@ -76,6 +81,22 @@ static inline pte_t swp_entry_to_pte(swp
 	return __swp_entry_to_pte(arch_entry);
 }
 
+static inline swp_entry_t radix_to_swp_entry(void *arg)
+{
+	swp_entry_t entry;
+
+	entry.val = (unsigned long)arg >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+	return entry;
+}
+
+static inline void *swp_to_radix_entry(swp_entry_t entry)
+{
+	unsigned long value;
+
+	value = entry.val << RADIX_TREE_EXCEPTIONAL_SHIFT;
+	return (void *)(value | RADIX_TREE_EXCEPTIONAL_ENTRY);
+}
+
 #ifdef CONFIG_MIGRATION
 static inline swp_entry_t make_migration_entry(struct page *page, int write)
 {
@@ -169,3 +190,5 @@ static inline int non_swap_entry(swp_ent
 	return 0;
 }
 #endif
+
+#endif /* _LINUX_SWAPOPS_H */
--- linux.orig/mm/filemap.c	2011-06-13 13:26:44.430284135 -0700
+++ linux/mm/filemap.c	2011-06-13 13:27:34.526532556 -0700
@@ -717,9 +717,12 @@ repeat:
 		page = radix_tree_deref_slot(pagep);
 		if (unlikely(!page))
 			goto out;
-		if (radix_tree_deref_retry(page))
+		if (radix_tree_exception(page)) {
+			if (radix_tree_exceptional_entry(page))
+				goto out;
+			/* radix_tree_deref_retry(page) */
 			goto repeat;
-
+		}
 		if (!page_cache_get_speculative(page))
 			goto repeat;
 
@@ -756,7 +759,7 @@ struct page *find_lock_page(struct addre
 
 repeat:
 	page = find_get_page(mapping, offset);
-	if (page) {
+	if (page && !radix_tree_exception(page)) {
 		lock_page(page);
 		/* Has the page been truncated? */
 		if (unlikely(page->mapping != mapping)) {
@@ -852,11 +855,14 @@ repeat:
 		if (unlikely(!page))
 			continue;
 
-		/*
-		 * This can only trigger when the entry at index 0 moves out
-		 * of or back to the root: none yet gotten, safe to restart.
-		 */
-		if (radix_tree_deref_retry(page)) {
+		if (radix_tree_exception(page)) {
+			if (radix_tree_exceptional_entry(page))
+				continue;
+			/*
+			 * radix_tree_deref_retry(page):
+			 * can only trigger when entry at index 0 moves out of
+			 * or back to root: none yet gotten, safe to restart.
+			 */
 			WARN_ON(start | i);
 			goto restart;
 		}
@@ -915,12 +921,16 @@ repeat:
 		if (unlikely(!page))
 			continue;
 
-		/*
-		 * This can only trigger when the entry at index 0 moves out
-		 * of or back to the root: none yet gotten, safe to restart.
-		 */
-		if (radix_tree_deref_retry(page))
+		if (radix_tree_exception(page)) {
+			if (radix_tree_exceptional_entry(page))
+				break;
+			/*
+			 * radix_tree_deref_retry(page):
+			 * can only trigger when entry at index 0 moves out of
+			 * or back to root: none yet gotten, safe to restart.
+			 */
 			goto restart;
+		}
 
 		if (!page_cache_get_speculative(page))
 			goto repeat;
@@ -980,12 +990,15 @@ repeat:
 		if (unlikely(!page))
 			continue;
 
-		/*
-		 * This can only trigger when the entry at index 0 moves out
-		 * of or back to the root: none yet gotten, safe to restart.
-		 */
-		if (radix_tree_deref_retry(page))
+		if (radix_tree_exception(page)) {
+			BUG_ON(radix_tree_exceptional_entry(page));
+			/*
+			 * radix_tree_deref_retry(page):
+			 * can only trigger when entry at index 0 moves out of
+			 * or back to root: none yet gotten, safe to restart.
+			 */
 			goto restart;
+		}
 
 		if (!page_cache_get_speculative(page))
 			goto repeat;
--- linux.orig/mm/swapfile.c	2011-06-13 13:26:07.506101039 -0700
+++ linux/mm/swapfile.c	2011-06-13 13:27:34.526532556 -0700
@@ -1937,20 +1937,24 @@ static unsigned long read_swap_header(st
 
 	/*
 	 * Find out how many pages are allowed for a single swap
-	 * device. There are two limiting factors: 1) the number of
-	 * bits for the swap offset in the swp_entry_t type and
-	 * 2) the number of bits in the a swap pte as defined by
-	 * the different architectures. In order to find the
-	 * largest possible bit mask a swap entry with swap type 0
+	 * device. There are three limiting factors: 1) the number
+	 * of bits for the swap offset in the swp_entry_t type, and
+	 * 2) the number of bits in the swap pte as defined by the
+	 * the different architectures, and 3) the number of free bits
+	 * in an exceptional radix_tree entry. In order to find the
+	 * largest possible bit mask, a swap entry with swap type 0
 	 * and swap offset ~0UL is created, encoded to a swap pte,
-	 * decoded to a swp_entry_t again and finally the swap
+	 * decoded to a swp_entry_t again, and finally the swap
 	 * offset is extracted. This will mask all the bits from
 	 * the initial ~0UL mask that can't be encoded in either
 	 * the swp_entry_t or the architecture definition of a
-	 * swap pte.
+	 * swap pte.  Then the same is done for a radix_tree entry.
 	 */
 	maxpages = swp_offset(pte_to_swp_entry(
-			swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1;
+			swp_entry_to_pte(swp_entry(0, ~0UL))));
+	maxpages = swp_offset(radix_to_swp_entry(
+			swp_to_radix_entry(swp_entry(0, maxpages)))) + 1;
+
 	if (maxpages > swap_header->info.last_page) {
 		maxpages = swap_header->info.last_page + 1;
 		/* p->max is an unsigned int: don't overflow it */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 3/12] tmpfs: demolish old swap vector support
  2011-06-14 10:40 ` Hugh Dickins
@ 2011-06-14 10:45   ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

The maximum size of a shmem/tmpfs file has been limited by the maximum
size of its triple-indirect swap vector.  With 4kB page size, maximum
filesize was just over 2TB on a 32-bit kernel, but sadly one eighth of
that on a 64-bit kernel.  (With 8kB page size, maximum filesize was
just over 4TB on a 64-bit kernel, but 16TB on a 32-bit kernel,
MAX_LFS_FILESIZE being then more restrictive than swap vector layout.)

It's a shame that tmpfs should be more restrictive than ramfs, and this
limitation has now been noticed.  Add another level to the swap vector?
No, it became obscure and hard to maintain, once I complicated it to
make use of highmem pages nine years ago: better choose another way.

Surely, if 2.4 had had the radix tree pagecache introduced in 2.5,
then tmpfs would never have invented its own peculiar radix tree:
we would have fitted swap entries into the common radix tree instead,
in much the same way as we fit swap entries into page tables.

And why should each file have a separate radix tree for its pages
and for its swap entries?  The swap entries are required precisely
where and when the pages are not.  We want to put them together in
a single radix tree: which can then avoid much of the locking which
was needed to prevent them from being exchanged underneath us.

This also avoids the waste of memory devoted to swap vectors, first
in the shmem_inode itself, then at least two more pages once a file
grew beyond 16 data pages (pages accounted by df and du, but not by
memcg).  Allocated upfront, to avoid allocation when under swapping
pressure, but pure waste when CONFIG_SWAP is not set - I have never
spattered around the ifdefs to prevent that, preferring this move
to sharing the common radix tree instead.

There are three downsides to sharing the radix tree.  One, that it
binds tmpfs more tightly to the rest of mm, either requiring knowledge
of swap entries in radix tree there, or duplication of its code here
in shmem.c.  I believe that the simplications and memory savings
(and probable higher performance, not yet measured) justify that.

Two, that on HIGHMEM systems with SWAP enabled, it's the lowmem radix
nodes that cannot be freed under memory pressure - whereas before it
was the less precious highmem swap vector pages that could not be freed.
I'm hoping that 64-bit has now been accessible for long enough, that
the highmem argument has grown much less persuasive.

Three, that swapoff is slower than it used to be on tmpfs files, since
it's using a simple generic mechanism not tailored to it: I find this
noticeable, and shall want to improve, but maybe nobody else will notice.

So... now remove most of the old swap vector code from shmem.c.  But,
for the moment, keep the simple i_direct vector of 16 pages, with simple
accessors shmem_put_swap() and shmem_get_swap(), as a toy implementation
to help mark where swap needs to be handled in subsequent patches.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/shmem_fs.h |    2 
 mm/shmem.c               |  782 +++----------------------------------
 2 files changed, 84 insertions(+), 700 deletions(-)

--- linux.orig/include/linux/shmem_fs.h	2011-06-13 13:26:07.446100738 -0700
+++ linux/include/linux/shmem_fs.h	2011-06-13 13:27:59.634657055 -0700
@@ -17,9 +17,7 @@ struct shmem_inode_info {
 	unsigned long		flags;
 	unsigned long		alloced;	/* data pages alloced to file */
 	unsigned long		swapped;	/* subtotal assigned to swap */
-	unsigned long		next_index;	/* highest alloced index + 1 */
 	struct shared_policy	policy;		/* NUMA memory alloc policy */
-	struct page		*i_indirect;	/* top indirect blocks page */
 	union {
 		swp_entry_t	i_direct[SHMEM_NR_DIRECT]; /* first blocks */
 		char		inline_symlink[SHMEM_SYMLINK_INLINE_LEN];
--- linux.orig/mm/shmem.c	2011-06-13 13:26:07.446100738 -0700
+++ linux/mm/shmem.c	2011-06-13 13:27:59.634657055 -0700
@@ -66,37 +66,9 @@ static struct vfsmount *shm_mnt;
 #include <asm/div64.h>
 #include <asm/pgtable.h>
 
-/*
- * The maximum size of a shmem/tmpfs file is limited by the maximum size of
- * its triple-indirect swap vector - see illustration at shmem_swp_entry().
- *
- * With 4kB page size, maximum file size is just over 2TB on a 32-bit kernel,
- * but one eighth of that on a 64-bit kernel.  With 8kB page size, maximum
- * file size is just over 4TB on a 64-bit kernel, but 16TB on a 32-bit kernel,
- * MAX_LFS_FILESIZE being then more restrictive than swap vector layout.
- *
- * We use / and * instead of shifts in the definitions below, so that the swap
- * vector can be tested with small even values (e.g. 20) for ENTRIES_PER_PAGE.
- */
-#define ENTRIES_PER_PAGE (PAGE_CACHE_SIZE/sizeof(unsigned long))
-#define ENTRIES_PER_PAGEPAGE ((unsigned long long)ENTRIES_PER_PAGE*ENTRIES_PER_PAGE)
-
-#define SHMSWP_MAX_INDEX (SHMEM_NR_DIRECT + (ENTRIES_PER_PAGEPAGE/2) * (ENTRIES_PER_PAGE+1))
-#define SHMSWP_MAX_BYTES (SHMSWP_MAX_INDEX << PAGE_CACHE_SHIFT)
-
-#define SHMEM_MAX_BYTES  min_t(unsigned long long, SHMSWP_MAX_BYTES, MAX_LFS_FILESIZE)
-#define SHMEM_MAX_INDEX  ((unsigned long)((SHMEM_MAX_BYTES+1) >> PAGE_CACHE_SHIFT))
-
 #define BLOCKS_PER_PAGE  (PAGE_CACHE_SIZE/512)
 #define VM_ACCT(size)    (PAGE_CACHE_ALIGN(size) >> PAGE_SHIFT)
 
-/* info->flags needs VM_flags to handle pagein/truncate races efficiently */
-#define SHMEM_PAGEIN	 VM_READ
-#define SHMEM_TRUNCATE	 VM_WRITE
-
-/* Definition to limit shmem_truncate's steps between cond_rescheds */
-#define LATENCY_LIMIT	 64
-
 /* Pretend that each entry is of this size in directory's i_size */
 #define BOGO_DIRENT_SIZE 20
 
@@ -107,7 +79,7 @@ struct shmem_xattr {
 	char value[0];
 };
 
-/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
+/* Flag allocation requirements to shmem_getpage */
 enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
 	SGP_CACHE,	/* don't exceed i_size, may allocate page */
@@ -137,56 +109,6 @@ static inline int shmem_getpage(struct i
 			mapping_gfp_mask(inode->i_mapping), fault_type);
 }
 
-static inline struct page *shmem_dir_alloc(gfp_t gfp_mask)
-{
-	/*
-	 * The above definition of ENTRIES_PER_PAGE, and the use of
-	 * BLOCKS_PER_PAGE on indirect pages, assume PAGE_CACHE_SIZE:
-	 * might be reconsidered if it ever diverges from PAGE_SIZE.
-	 *
-	 * Mobility flags are masked out as swap vectors cannot move
-	 */
-	return alloc_pages((gfp_mask & ~GFP_MOVABLE_MASK) | __GFP_ZERO,
-				PAGE_CACHE_SHIFT-PAGE_SHIFT);
-}
-
-static inline void shmem_dir_free(struct page *page)
-{
-	__free_pages(page, PAGE_CACHE_SHIFT-PAGE_SHIFT);
-}
-
-static struct page **shmem_dir_map(struct page *page)
-{
-	return (struct page **)kmap_atomic(page, KM_USER0);
-}
-
-static inline void shmem_dir_unmap(struct page **dir)
-{
-	kunmap_atomic(dir, KM_USER0);
-}
-
-static swp_entry_t *shmem_swp_map(struct page *page)
-{
-	return (swp_entry_t *)kmap_atomic(page, KM_USER1);
-}
-
-static inline void shmem_swp_balance_unmap(void)
-{
-	/*
-	 * When passing a pointer to an i_direct entry, to code which
-	 * also handles indirect entries and so will shmem_swp_unmap,
-	 * we must arrange for the preempt count to remain in balance.
-	 * What kmap_atomic of a lowmem page does depends on config
-	 * and architecture, so pretend to kmap_atomic some lowmem page.
-	 */
-	(void) kmap_atomic(ZERO_PAGE(0), KM_USER1);
-}
-
-static inline void shmem_swp_unmap(swp_entry_t *entry)
-{
-	kunmap_atomic(entry, KM_USER1);
-}
-
 static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb)
 {
 	return sb->s_fs_info;
@@ -303,468 +225,56 @@ static void shmem_recalc_inode(struct in
 	}
 }
 
-/**
- * shmem_swp_entry - find the swap vector position in the info structure
- * @info:  info structure for the inode
- * @index: index of the page to find
- * @page:  optional page to add to the structure. Has to be preset to
- *         all zeros
- *
- * If there is no space allocated yet it will return NULL when
- * page is NULL, else it will use the page for the needed block,
- * setting it to NULL on return to indicate that it has been used.
- *
- * The swap vector is organized the following way:
- *
- * There are SHMEM_NR_DIRECT entries directly stored in the
- * shmem_inode_info structure. So small files do not need an addional
- * allocation.
- *
- * For pages with index > SHMEM_NR_DIRECT there is the pointer
- * i_indirect which points to a page which holds in the first half
- * doubly indirect blocks, in the second half triple indirect blocks:
- *
- * For an artificial ENTRIES_PER_PAGE = 4 this would lead to the
- * following layout (for SHMEM_NR_DIRECT == 16):
- *
- * i_indirect -> dir --> 16-19
- * 	      |	     +-> 20-23
- * 	      |
- * 	      +-->dir2 --> 24-27
- * 	      |	       +-> 28-31
- * 	      |	       +-> 32-35
- * 	      |	       +-> 36-39
- * 	      |
- * 	      +-->dir3 --> 40-43
- * 	       	       +-> 44-47
- * 	      	       +-> 48-51
- * 	      	       +-> 52-55
- */
-static swp_entry_t *shmem_swp_entry(struct shmem_inode_info *info, unsigned long index, struct page **page)
-{
-	unsigned long offset;
-	struct page **dir;
-	struct page *subdir;
-
-	if (index < SHMEM_NR_DIRECT) {
-		shmem_swp_balance_unmap();
-		return info->i_direct+index;
-	}
-	if (!info->i_indirect) {
-		if (page) {
-			info->i_indirect = *page;
-			*page = NULL;
-		}
-		return NULL;			/* need another page */
-	}
-
-	index -= SHMEM_NR_DIRECT;
-	offset = index % ENTRIES_PER_PAGE;
-	index /= ENTRIES_PER_PAGE;
-	dir = shmem_dir_map(info->i_indirect);
-
-	if (index >= ENTRIES_PER_PAGE/2) {
-		index -= ENTRIES_PER_PAGE/2;
-		dir += ENTRIES_PER_PAGE/2 + index/ENTRIES_PER_PAGE;
-		index %= ENTRIES_PER_PAGE;
-		subdir = *dir;
-		if (!subdir) {
-			if (page) {
-				*dir = *page;
-				*page = NULL;
-			}
-			shmem_dir_unmap(dir);
-			return NULL;		/* need another page */
-		}
-		shmem_dir_unmap(dir);
-		dir = shmem_dir_map(subdir);
-	}
-
-	dir += index;
-	subdir = *dir;
-	if (!subdir) {
-		if (!page || !(subdir = *page)) {
-			shmem_dir_unmap(dir);
-			return NULL;		/* need a page */
-		}
-		*dir = subdir;
-		*page = NULL;
-	}
-	shmem_dir_unmap(dir);
-	return shmem_swp_map(subdir) + offset;
-}
-
-static void shmem_swp_set(struct shmem_inode_info *info, swp_entry_t *entry, unsigned long value)
+static void shmem_put_swap(struct shmem_inode_info *info, pgoff_t index,
+			   swp_entry_t swap)
 {
-	long incdec = value? 1: -1;
-
-	entry->val = value;
-	info->swapped += incdec;
-	if ((unsigned long)(entry - info->i_direct) >= SHMEM_NR_DIRECT) {
-		struct page *page = kmap_atomic_to_page(entry);
-		set_page_private(page, page_private(page) + incdec);
-	}
+	if (index < SHMEM_NR_DIRECT)
+		info->i_direct[index] = swap;
 }
 
-/**
- * shmem_swp_alloc - get the position of the swap entry for the page.
- * @info:	info structure for the inode
- * @index:	index of the page to find
- * @sgp:	check and recheck i_size? skip allocation?
- * @gfp:	gfp mask to use for any page allocation
- *
- * If the entry does not exist, allocate it.
- */
-static swp_entry_t *shmem_swp_alloc(struct shmem_inode_info *info,
-			unsigned long index, enum sgp_type sgp, gfp_t gfp)
+static swp_entry_t shmem_get_swap(struct shmem_inode_info *info, pgoff_t index)
 {
-	struct inode *inode = &info->vfs_inode;
-	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
-	struct page *page = NULL;
-	swp_entry_t *entry;
-
-	if (sgp != SGP_WRITE &&
-	    ((loff_t) index << PAGE_CACHE_SHIFT) >= i_size_read(inode))
-		return ERR_PTR(-EINVAL);
-
-	while (!(entry = shmem_swp_entry(info, index, &page))) {
-		if (sgp == SGP_READ)
-			return shmem_swp_map(ZERO_PAGE(0));
-		/*
-		 * Test used_blocks against 1 less max_blocks, since we have 1 data
-		 * page (and perhaps indirect index pages) yet to allocate:
-		 * a waste to allocate index if we cannot allocate data.
-		 */
-		if (sbinfo->max_blocks) {
-			if (percpu_counter_compare(&sbinfo->used_blocks,
-						sbinfo->max_blocks - 1) >= 0)
-				return ERR_PTR(-ENOSPC);
-			percpu_counter_inc(&sbinfo->used_blocks);
-			inode->i_blocks += BLOCKS_PER_PAGE;
-		}
-
-		spin_unlock(&info->lock);
-		page = shmem_dir_alloc(gfp);
-		spin_lock(&info->lock);
-
-		if (!page) {
-			shmem_free_blocks(inode, 1);
-			return ERR_PTR(-ENOMEM);
-		}
-		if (sgp != SGP_WRITE &&
-		    ((loff_t) index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
-			entry = ERR_PTR(-EINVAL);
-			break;
-		}
-		if (info->next_index <= index)
-			info->next_index = index + 1;
-	}
-	if (page) {
-		/* another task gave its page, or truncated the file */
-		shmem_free_blocks(inode, 1);
-		shmem_dir_free(page);
-	}
-	if (info->next_index <= index && !IS_ERR(entry))
-		info->next_index = index + 1;
-	return entry;
+	return (index < SHMEM_NR_DIRECT) ?
+		info->i_direct[index] : (swp_entry_t){0};
 }
 
-/**
- * shmem_free_swp - free some swap entries in a directory
- * @dir:        pointer to the directory
- * @edir:       pointer after last entry of the directory
- * @punch_lock: pointer to spinlock when needed for the holepunch case
- */
-static int shmem_free_swp(swp_entry_t *dir, swp_entry_t *edir,
-						spinlock_t *punch_lock)
-{
-	spinlock_t *punch_unlock = NULL;
-	swp_entry_t *ptr;
-	int freed = 0;
-
-	for (ptr = dir; ptr < edir; ptr++) {
-		if (ptr->val) {
-			if (unlikely(punch_lock)) {
-				punch_unlock = punch_lock;
-				punch_lock = NULL;
-				spin_lock(punch_unlock);
-				if (!ptr->val)
-					continue;
-			}
-			free_swap_and_cache(*ptr);
-			*ptr = (swp_entry_t){0};
-			freed++;
-		}
-	}
-	if (punch_unlock)
-		spin_unlock(punch_unlock);
-	return freed;
-}
-
-static int shmem_map_and_free_swp(struct page *subdir, int offset,
-		int limit, struct page ***dir, spinlock_t *punch_lock)
-{
-	swp_entry_t *ptr;
-	int freed = 0;
-
-	ptr = shmem_swp_map(subdir);
-	for (; offset < limit; offset += LATENCY_LIMIT) {
-		int size = limit - offset;
-		if (size > LATENCY_LIMIT)
-			size = LATENCY_LIMIT;
-		freed += shmem_free_swp(ptr+offset, ptr+offset+size,
-							punch_lock);
-		if (need_resched()) {
-			shmem_swp_unmap(ptr);
-			if (*dir) {
-				shmem_dir_unmap(*dir);
-				*dir = NULL;
-			}
-			cond_resched();
-			ptr = shmem_swp_map(subdir);
-		}
-	}
-	shmem_swp_unmap(ptr);
-	return freed;
-}
-
-static void shmem_free_pages(struct list_head *next)
-{
-	struct page *page;
-	int freed = 0;
-
-	do {
-		page = container_of(next, struct page, lru);
-		next = next->next;
-		shmem_dir_free(page);
-		freed++;
-		if (freed >= LATENCY_LIMIT) {
-			cond_resched();
-			freed = 0;
-		}
-	} while (next);
-}
-
-void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end)
+void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
 {
+	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
-	unsigned long idx;
-	unsigned long size;
-	unsigned long limit;
-	unsigned long stage;
-	unsigned long diroff;
-	struct page **dir;
-	struct page *topdir;
-	struct page *middir;
-	struct page *subdir;
-	swp_entry_t *ptr;
-	LIST_HEAD(pages_to_free);
-	long nr_pages_to_free = 0;
-	long nr_swaps_freed = 0;
-	int offset;
-	int freed;
-	int punch_hole;
-	spinlock_t *needs_lock;
-	spinlock_t *punch_lock;
-	unsigned long upper_limit;
+	pgoff_t start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	pgoff_t end = (lend >> PAGE_CACHE_SHIFT);
+	pgoff_t index;
+	swp_entry_t swap;
 
-	truncate_inode_pages_range(inode->i_mapping, start, end);
+	truncate_inode_pages_range(mapping, lstart, lend);
 
-	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
-	idx = (start + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (idx >= info->next_index)
-		return;
+	if (end > SHMEM_NR_DIRECT)
+		end = SHMEM_NR_DIRECT;
 
 	spin_lock(&info->lock);
-	info->flags |= SHMEM_TRUNCATE;
-	if (likely(end == (loff_t) -1)) {
-		limit = info->next_index;
-		upper_limit = SHMEM_MAX_INDEX;
-		info->next_index = idx;
-		needs_lock = NULL;
-		punch_hole = 0;
-	} else {
-		if (end + 1 >= inode->i_size) {	/* we may free a little more */
-			limit = (inode->i_size + PAGE_CACHE_SIZE - 1) >>
-							PAGE_CACHE_SHIFT;
-			upper_limit = SHMEM_MAX_INDEX;
-		} else {
-			limit = (end + 1) >> PAGE_CACHE_SHIFT;
-			upper_limit = limit;
-		}
-		needs_lock = &info->lock;
-		punch_hole = 1;
-	}
-
-	topdir = info->i_indirect;
-	if (topdir && idx <= SHMEM_NR_DIRECT && !punch_hole) {
-		info->i_indirect = NULL;
-		nr_pages_to_free++;
-		list_add(&topdir->lru, &pages_to_free);
-	}
-	spin_unlock(&info->lock);
-
-	if (info->swapped && idx < SHMEM_NR_DIRECT) {
-		ptr = info->i_direct;
-		size = limit;
-		if (size > SHMEM_NR_DIRECT)
-			size = SHMEM_NR_DIRECT;
-		nr_swaps_freed = shmem_free_swp(ptr+idx, ptr+size, needs_lock);
-	}
-
-	/*
-	 * If there are no indirect blocks or we are punching a hole
-	 * below indirect blocks, nothing to be done.
-	 */
-	if (!topdir || limit <= SHMEM_NR_DIRECT)
-		goto done2;
-
-	/*
-	 * The truncation case has already dropped info->lock, and we're safe
-	 * because i_size and next_index have already been lowered, preventing
-	 * access beyond.  But in the punch_hole case, we still need to take
-	 * the lock when updating the swap directory, because there might be
-	 * racing accesses by shmem_getpage(SGP_CACHE), shmem_unuse_inode or
-	 * shmem_writepage.  However, whenever we find we can remove a whole
-	 * directory page (not at the misaligned start or end of the range),
-	 * we first NULLify its pointer in the level above, and then have no
-	 * need to take the lock when updating its contents: needs_lock and
-	 * punch_lock (either pointing to info->lock or NULL) manage this.
-	 */
-
-	upper_limit -= SHMEM_NR_DIRECT;
-	limit -= SHMEM_NR_DIRECT;
-	idx = (idx > SHMEM_NR_DIRECT)? (idx - SHMEM_NR_DIRECT): 0;
-	offset = idx % ENTRIES_PER_PAGE;
-	idx -= offset;
-
-	dir = shmem_dir_map(topdir);
-	stage = ENTRIES_PER_PAGEPAGE/2;
-	if (idx < ENTRIES_PER_PAGEPAGE/2) {
-		middir = topdir;
-		diroff = idx/ENTRIES_PER_PAGE;
-	} else {
-		dir += ENTRIES_PER_PAGE/2;
-		dir += (idx - ENTRIES_PER_PAGEPAGE/2)/ENTRIES_PER_PAGEPAGE;
-		while (stage <= idx)
-			stage += ENTRIES_PER_PAGEPAGE;
-		middir = *dir;
-		if (*dir) {
-			diroff = ((idx - ENTRIES_PER_PAGEPAGE/2) %
-				ENTRIES_PER_PAGEPAGE) / ENTRIES_PER_PAGE;
-			if (!diroff && !offset && upper_limit >= stage) {
-				if (needs_lock) {
-					spin_lock(needs_lock);
-					*dir = NULL;
-					spin_unlock(needs_lock);
-					needs_lock = NULL;
-				} else
-					*dir = NULL;
-				nr_pages_to_free++;
-				list_add(&middir->lru, &pages_to_free);
-			}
-			shmem_dir_unmap(dir);
-			dir = shmem_dir_map(middir);
-		} else {
-			diroff = 0;
-			offset = 0;
-			idx = stage;
+	for (index = start; index < end; index++) {
+		swap = shmem_get_swap(info, index);
+		if (swap.val) {
+			free_swap_and_cache(swap);
+			shmem_put_swap(info, index, (swp_entry_t){0});
+			info->swapped--;
 		}
 	}
 
-	for (; idx < limit; idx += ENTRIES_PER_PAGE, diroff++) {
-		if (unlikely(idx == stage)) {
-			shmem_dir_unmap(dir);
-			dir = shmem_dir_map(topdir) +
-			    ENTRIES_PER_PAGE/2 + idx/ENTRIES_PER_PAGEPAGE;
-			while (!*dir) {
-				dir++;
-				idx += ENTRIES_PER_PAGEPAGE;
-				if (idx >= limit)
-					goto done1;
-			}
-			stage = idx + ENTRIES_PER_PAGEPAGE;
-			middir = *dir;
-			if (punch_hole)
-				needs_lock = &info->lock;
-			if (upper_limit >= stage) {
-				if (needs_lock) {
-					spin_lock(needs_lock);
-					*dir = NULL;
-					spin_unlock(needs_lock);
-					needs_lock = NULL;
-				} else
-					*dir = NULL;
-				nr_pages_to_free++;
-				list_add(&middir->lru, &pages_to_free);
-			}
-			shmem_dir_unmap(dir);
-			cond_resched();
-			dir = shmem_dir_map(middir);
-			diroff = 0;
-		}
-		punch_lock = needs_lock;
-		subdir = dir[diroff];
-		if (subdir && !offset && upper_limit-idx >= ENTRIES_PER_PAGE) {
-			if (needs_lock) {
-				spin_lock(needs_lock);
-				dir[diroff] = NULL;
-				spin_unlock(needs_lock);
-				punch_lock = NULL;
-			} else
-				dir[diroff] = NULL;
-			nr_pages_to_free++;
-			list_add(&subdir->lru, &pages_to_free);
-		}
-		if (subdir && page_private(subdir) /* has swap entries */) {
-			size = limit - idx;
-			if (size > ENTRIES_PER_PAGE)
-				size = ENTRIES_PER_PAGE;
-			freed = shmem_map_and_free_swp(subdir,
-					offset, size, &dir, punch_lock);
-			if (!dir)
-				dir = shmem_dir_map(middir);
-			nr_swaps_freed += freed;
-			if (offset || punch_lock) {
-				spin_lock(&info->lock);
-				set_page_private(subdir,
-					page_private(subdir) - freed);
-				spin_unlock(&info->lock);
-			} else
-				BUG_ON(page_private(subdir) != freed);
-		}
-		offset = 0;
-	}
-done1:
-	shmem_dir_unmap(dir);
-done2:
-	if (inode->i_mapping->nrpages && (info->flags & SHMEM_PAGEIN)) {
+	if (mapping->nrpages) {
+		spin_unlock(&info->lock);
 		/*
-		 * Call truncate_inode_pages again: racing shmem_unuse_inode
-		 * may have swizzled a page in from swap since
-		 * truncate_pagecache or generic_delete_inode did it, before we
-		 * lowered next_index.  Also, though shmem_getpage checks
-		 * i_size before adding to cache, no recheck after: so fix the
-		 * narrow window there too.
+		 * A page may have meanwhile sneaked in from swap.
 		 */
-		truncate_inode_pages_range(inode->i_mapping, start, end);
+		truncate_inode_pages_range(mapping, lstart, lend);
+		spin_lock(&info->lock);
 	}
 
-	spin_lock(&info->lock);
-	info->flags &= ~SHMEM_TRUNCATE;
-	info->swapped -= nr_swaps_freed;
-	if (nr_pages_to_free)
-		shmem_free_blocks(inode, nr_pages_to_free);
 	shmem_recalc_inode(inode);
 	spin_unlock(&info->lock);
 
-	/*
-	 * Empty swap vector directory pages to be freed?
-	 */
-	if (!list_empty(&pages_to_free)) {
-		pages_to_free.prev->next = NULL;
-		shmem_free_pages(pages_to_free.next);
-	}
+	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
 }
 EXPORT_SYMBOL_GPL(shmem_truncate_range);
 
@@ -797,19 +307,6 @@ static int shmem_setattr(struct dentry *
 				if (page)
 					unlock_page(page);
 			}
-			/*
-			 * Reset SHMEM_PAGEIN flag so that shmem_truncate can
-			 * detect if any pages might have been added to cache
-			 * after truncate_inode_pages.  But we needn't bother
-			 * if it's being fully truncated to zero-length: the
-			 * nrpages check is efficient enough in that case.
-			 */
-			if (newsize) {
-				struct shmem_inode_info *info = SHMEM_I(inode);
-				spin_lock(&info->lock);
-				info->flags &= ~SHMEM_PAGEIN;
-				spin_unlock(&info->lock);
-			}
 		}
 		if (newsize != oldsize) {
 			i_size_write(inode, newsize);
@@ -859,106 +356,28 @@ static void shmem_evict_inode(struct ino
 	end_writeback(inode);
 }
 
-static inline int shmem_find_swp(swp_entry_t entry, swp_entry_t *dir, swp_entry_t *edir)
-{
-	swp_entry_t *ptr;
-
-	for (ptr = dir; ptr < edir; ptr++) {
-		if (ptr->val == entry.val)
-			return ptr - dir;
-	}
-	return -1;
-}
-
 static int shmem_unuse_inode(struct shmem_inode_info *info, swp_entry_t entry, struct page *page)
 {
-	struct address_space *mapping;
+	struct address_space *mapping = info->vfs_inode.i_mapping;
 	unsigned long idx;
-	unsigned long size;
-	unsigned long limit;
-	unsigned long stage;
-	struct page **dir;
-	struct page *subdir;
-	swp_entry_t *ptr;
-	int offset;
 	int error;
 
-	idx = 0;
-	ptr = info->i_direct;
-	spin_lock(&info->lock);
-	if (!info->swapped) {
-		list_del_init(&info->swaplist);
-		goto lost2;
-	}
-	limit = info->next_index;
-	size = limit;
-	if (size > SHMEM_NR_DIRECT)
-		size = SHMEM_NR_DIRECT;
-	offset = shmem_find_swp(entry, ptr, ptr+size);
-	if (offset >= 0) {
-		shmem_swp_balance_unmap();
-		goto found;
-	}
-	if (!info->i_indirect)
-		goto lost2;
-
-	dir = shmem_dir_map(info->i_indirect);
-	stage = SHMEM_NR_DIRECT + ENTRIES_PER_PAGEPAGE/2;
-
-	for (idx = SHMEM_NR_DIRECT; idx < limit; idx += ENTRIES_PER_PAGE, dir++) {
-		if (unlikely(idx == stage)) {
-			shmem_dir_unmap(dir-1);
-			if (cond_resched_lock(&info->lock)) {
-				/* check it has not been truncated */
-				if (limit > info->next_index) {
-					limit = info->next_index;
-					if (idx >= limit)
-						goto lost2;
-				}
-			}
-			dir = shmem_dir_map(info->i_indirect) +
-			    ENTRIES_PER_PAGE/2 + idx/ENTRIES_PER_PAGEPAGE;
-			while (!*dir) {
-				dir++;
-				idx += ENTRIES_PER_PAGEPAGE;
-				if (idx >= limit)
-					goto lost1;
-			}
-			stage = idx + ENTRIES_PER_PAGEPAGE;
-			subdir = *dir;
-			shmem_dir_unmap(dir);
-			dir = shmem_dir_map(subdir);
-		}
-		subdir = *dir;
-		if (subdir && page_private(subdir)) {
-			ptr = shmem_swp_map(subdir);
-			size = limit - idx;
-			if (size > ENTRIES_PER_PAGE)
-				size = ENTRIES_PER_PAGE;
-			offset = shmem_find_swp(entry, ptr, ptr+size);
-			shmem_swp_unmap(ptr);
-			if (offset >= 0) {
-				shmem_dir_unmap(dir);
-				ptr = shmem_swp_map(subdir);
-				goto found;
-			}
-		}
-	}
-lost1:
-	shmem_dir_unmap(dir-1);
-lost2:
-	spin_unlock(&info->lock);
+	for (idx = 0; idx < SHMEM_NR_DIRECT; idx++)
+		if (shmem_get_swap(info, idx).val == entry.val)
+			goto found;
 	return 0;
 found:
-	idx += offset;
-	ptr += offset;
+	spin_lock(&info->lock);
+	if (shmem_get_swap(info, idx).val != entry.val) {
+		spin_unlock(&info->lock);
+		return 0;
+	}
 
 	/*
 	 * Move _head_ to start search for next from here.
 	 * But be careful: shmem_evict_inode checks list_empty without taking
 	 * mutex, and there's an instant in list_move_tail when info->swaplist
-	 * would appear empty, if it were the only one on shmem_swaplist.  We
-	 * could avoid doing it if inode NULL; or use this minor optimization.
+	 * would appear empty, if it were the only one on shmem_swaplist.
 	 */
 	if (shmem_swaplist.next != &info->swaplist)
 		list_move_tail(&shmem_swaplist, &info->swaplist);
@@ -968,19 +387,17 @@ found:
 	 * but also to hold up shmem_evict_inode(): so inode cannot be freed
 	 * beneath us (pagelock doesn't help until the page is in pagecache).
 	 */
-	mapping = info->vfs_inode.i_mapping;
 	error = add_to_page_cache_locked(page, mapping, idx, GFP_NOWAIT);
 	/* which does mem_cgroup_uncharge_cache_page on error */
 
 	if (error != -ENOMEM) {
 		delete_from_swap_cache(page);
 		set_page_dirty(page);
-		info->flags |= SHMEM_PAGEIN;
-		shmem_swp_set(info, ptr, 0);
+		shmem_put_swap(info, idx, (swp_entry_t){0});
+		info->swapped--;
 		swap_free(entry);
 		error = 1;	/* not an error, but entry was found */
 	}
-	shmem_swp_unmap(ptr);
 	spin_unlock(&info->lock);
 	return error;
 }
@@ -1017,7 +434,14 @@ int shmem_unuse(swp_entry_t entry, struc
 	mutex_lock(&shmem_swaplist_mutex);
 	list_for_each_safe(p, next, &shmem_swaplist) {
 		info = list_entry(p, struct shmem_inode_info, swaplist);
-		found = shmem_unuse_inode(info, entry, page);
+		if (!info->swapped) {
+			spin_lock(&info->lock);
+			if (!info->swapped)
+				list_del_init(&info->swaplist);
+			spin_unlock(&info->lock);
+		}
+		if (info->swapped)
+			found = shmem_unuse_inode(info, entry, page);
 		cond_resched();
 		if (found)
 			break;
@@ -1041,7 +465,7 @@ out:
 static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct shmem_inode_info *info;
-	swp_entry_t *entry, swap;
+	swp_entry_t swap, oswap;
 	struct address_space *mapping;
 	unsigned long index;
 	struct inode *inode;
@@ -1067,6 +491,15 @@ static int shmem_writepage(struct page *
 		WARN_ON_ONCE(1);	/* Still happens? Tell us about it! */
 		goto redirty;
 	}
+
+	/*
+	 * Just for this patch, we have a toy implementation,
+	 * which can swap out only the first SHMEM_NR_DIRECT pages:
+	 * for simple demonstration of where we need to think about swap.
+	 */
+	if (index >= SHMEM_NR_DIRECT)
+		goto redirty;
+
 	swap = get_swap_page();
 	if (!swap.val)
 		goto redirty;
@@ -1087,22 +520,19 @@ static int shmem_writepage(struct page *
 	spin_lock(&info->lock);
 	mutex_unlock(&shmem_swaplist_mutex);
 
-	if (index >= info->next_index) {
-		BUG_ON(!(info->flags & SHMEM_TRUNCATE));
-		goto unlock;
-	}
-	entry = shmem_swp_entry(info, index, NULL);
-	if (entry->val) {
+	oswap = shmem_get_swap(info, index);
+	if (oswap.val) {
 		WARN_ON_ONCE(1);	/* Still happens? Tell us about it! */
-		free_swap_and_cache(*entry);
-		shmem_swp_set(info, entry, 0);
+		free_swap_and_cache(oswap);
+		shmem_put_swap(info, index, (swp_entry_t){0});
+		info->swapped--;
 	}
 	shmem_recalc_inode(inode);
 
 	if (add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) {
 		delete_from_page_cache(page);
-		shmem_swp_set(info, entry, swap.val);
-		shmem_swp_unmap(entry);
+		shmem_put_swap(info, index, swap);
+		info->swapped++;
 		swap_shmem_alloc(swap);
 		spin_unlock(&info->lock);
 		BUG_ON(page_mapped(page));
@@ -1110,13 +540,7 @@ static int shmem_writepage(struct page *
 		return 0;
 	}
 
-	shmem_swp_unmap(entry);
-unlock:
 	spin_unlock(&info->lock);
-	/*
-	 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
-	 * clear SWAP_HAS_CACHE flag.
-	 */
 	swapcache_free(swap, NULL);
 redirty:
 	set_page_dirty(page);
@@ -1230,12 +654,10 @@ static int shmem_getpage_gfp(struct inod
 	struct shmem_sb_info *sbinfo;
 	struct page *page;
 	struct page *prealloc_page = NULL;
-	swp_entry_t *entry;
 	swp_entry_t swap;
 	int error;
-	int ret;
 
-	if (idx >= SHMEM_MAX_INDEX)
+	if (idx > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT))
 		return -EFBIG;
 repeat:
 	page = find_lock_page(mapping, idx);
@@ -1272,37 +694,22 @@ repeat:
 
 	spin_lock(&info->lock);
 	shmem_recalc_inode(inode);
-	entry = shmem_swp_alloc(info, idx, sgp, gfp);
-	if (IS_ERR(entry)) {
-		spin_unlock(&info->lock);
-		error = PTR_ERR(entry);
-		goto out;
-	}
-	swap = *entry;
-
+	swap = shmem_get_swap(info, idx);
 	if (swap.val) {
 		/* Look it up and read it in.. */
 		page = lookup_swap_cache(swap);
 		if (!page) {
-			shmem_swp_unmap(entry);
 			spin_unlock(&info->lock);
 			/* here we actually do the io */
 			if (fault_type)
 				*fault_type |= VM_FAULT_MAJOR;
 			page = shmem_swapin(swap, gfp, info, idx);
 			if (!page) {
-				spin_lock(&info->lock);
-				entry = shmem_swp_alloc(info, idx, sgp, gfp);
-				if (IS_ERR(entry))
-					error = PTR_ERR(entry);
-				else {
-					if (entry->val == swap.val)
-						error = -ENOMEM;
-					shmem_swp_unmap(entry);
-				}
-				spin_unlock(&info->lock);
-				if (error)
+				swp_entry_t nswap = shmem_get_swap(info, idx);
+				if (nswap.val == swap.val) {
+					error = -ENOMEM;
 					goto out;
+				}
 				goto repeat;
 			}
 			wait_on_page_locked(page);
@@ -1312,14 +719,12 @@ repeat:
 
 		/* We have to do this with page locked to prevent races */
 		if (!trylock_page(page)) {
-			shmem_swp_unmap(entry);
 			spin_unlock(&info->lock);
 			wait_on_page_locked(page);
 			page_cache_release(page);
 			goto repeat;
 		}
 		if (PageWriteback(page)) {
-			shmem_swp_unmap(entry);
 			spin_unlock(&info->lock);
 			wait_on_page_writeback(page);
 			unlock_page(page);
@@ -1327,7 +732,6 @@ repeat:
 			goto repeat;
 		}
 		if (!PageUptodate(page)) {
-			shmem_swp_unmap(entry);
 			spin_unlock(&info->lock);
 			unlock_page(page);
 			page_cache_release(page);
@@ -1338,7 +742,6 @@ repeat:
 		error = add_to_page_cache_locked(page, mapping,
 						 idx, GFP_NOWAIT);
 		if (error) {
-			shmem_swp_unmap(entry);
 			spin_unlock(&info->lock);
 			if (error == -ENOMEM) {
 				/*
@@ -1358,16 +761,14 @@ repeat:
 			goto repeat;
 		}
 
-		info->flags |= SHMEM_PAGEIN;
-		shmem_swp_set(info, entry, 0);
-		shmem_swp_unmap(entry);
 		delete_from_swap_cache(page);
+		shmem_put_swap(info, idx, (swp_entry_t){0});
+		info->swapped--;
 		spin_unlock(&info->lock);
 		set_page_dirty(page);
 		swap_free(swap);
 
 	} else if (sgp == SGP_READ) {
-		shmem_swp_unmap(entry);
 		page = find_get_page(mapping, idx);
 		if (page && !trylock_page(page)) {
 			spin_unlock(&info->lock);
@@ -1378,7 +779,6 @@ repeat:
 		spin_unlock(&info->lock);
 
 	} else if (prealloc_page) {
-		shmem_swp_unmap(entry);
 		sbinfo = SHMEM_SB(inode->i_sb);
 		if (sbinfo->max_blocks) {
 			if (percpu_counter_compare(&sbinfo->used_blocks,
@@ -1393,34 +793,24 @@ repeat:
 		page = prealloc_page;
 		prealloc_page = NULL;
 
-		entry = shmem_swp_alloc(info, idx, sgp, gfp);
-		if (IS_ERR(entry))
-			error = PTR_ERR(entry);
-		else {
-			swap = *entry;
-			shmem_swp_unmap(entry);
-		}
-		ret = error || swap.val;
-		if (ret)
+		swap = shmem_get_swap(info, idx);
+		if (swap.val)
 			mem_cgroup_uncharge_cache_page(page);
 		else
-			ret = add_to_page_cache_lru(page, mapping,
+			error = add_to_page_cache_lru(page, mapping,
 						idx, GFP_NOWAIT);
 		/*
 		 * At add_to_page_cache_lru() failure,
 		 * uncharge will be done automatically.
 		 */
-		if (ret) {
+		if (swap.val || error) {
 			shmem_unacct_blocks(info->flags, 1);
 			shmem_free_blocks(inode, 1);
 			spin_unlock(&info->lock);
 			page_cache_release(page);
-			if (error)
-				goto out;
 			goto repeat;
 		}
 
-		info->flags |= SHMEM_PAGEIN;
 		info->alloced++;
 		spin_unlock(&info->lock);
 		clear_highpage(page);
@@ -2627,7 +2017,7 @@ int shmem_fill_super(struct super_block
 		goto failed;
 	sbinfo->free_inodes = sbinfo->max_inodes;
 
-	sb->s_maxbytes = SHMEM_MAX_BYTES;
+	sb->s_maxbytes = MAX_LFS_FILESIZE;
 	sb->s_blocksize = PAGE_CACHE_SIZE;
 	sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
 	sb->s_magic = TMPFS_MAGIC;
@@ -2869,7 +2259,7 @@ out4:
 void mem_cgroup_get_shmem_target(struct inode *inode, pgoff_t pgoff,
 					struct page **pagep, swp_entry_t *ent)
 {
-	swp_entry_t entry = { .val = 0 }, *ptr;
+	swp_entry_t entry = { .val = 0 };
 	struct page *page = NULL;
 	struct shmem_inode_info *info = SHMEM_I(inode);
 
@@ -2877,16 +2267,13 @@ void mem_cgroup_get_shmem_target(struct
 		goto out;
 
 	spin_lock(&info->lock);
-	ptr = shmem_swp_entry(info, pgoff, NULL);
 #ifdef CONFIG_SWAP
-	if (ptr && ptr->val) {
-		entry.val = ptr->val;
+	entry = shmem_get_swap(info, pgoff);
+	if (entry.val)
 		page = find_get_page(&swapper_space, entry.val);
-	} else
+	else
 #endif
 		page = find_get_page(inode->i_mapping, pgoff);
-	if (ptr)
-		shmem_swp_unmap(ptr);
 	spin_unlock(&info->lock);
 out:
 	*pagep = page;
@@ -2969,7 +2356,6 @@ out:
 #define shmem_get_inode(sb, dir, mode, dev, flags)	ramfs_get_inode(sb, dir, mode, dev)
 #define shmem_acct_size(flags, size)		0
 #define shmem_unacct_size(flags, size)		do {} while (0)
-#define SHMEM_MAX_BYTES				MAX_LFS_FILESIZE
 
 #endif /* CONFIG_SHMEM */
 
@@ -2993,7 +2379,7 @@ struct file *shmem_file_setup(const char
 	if (IS_ERR(shm_mnt))
 		return (void *)shm_mnt;
 
-	if (size < 0 || size > SHMEM_MAX_BYTES)
+	if (size < 0 || size > MAX_LFS_FILESIZE)
 		return ERR_PTR(-EINVAL);
 
 	if (shmem_acct_size(flags, size))

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 3/12] tmpfs: demolish old swap vector support
@ 2011-06-14 10:45   ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

The maximum size of a shmem/tmpfs file has been limited by the maximum
size of its triple-indirect swap vector.  With 4kB page size, maximum
filesize was just over 2TB on a 32-bit kernel, but sadly one eighth of
that on a 64-bit kernel.  (With 8kB page size, maximum filesize was
just over 4TB on a 64-bit kernel, but 16TB on a 32-bit kernel,
MAX_LFS_FILESIZE being then more restrictive than swap vector layout.)

It's a shame that tmpfs should be more restrictive than ramfs, and this
limitation has now been noticed.  Add another level to the swap vector?
No, it became obscure and hard to maintain, once I complicated it to
make use of highmem pages nine years ago: better choose another way.

Surely, if 2.4 had had the radix tree pagecache introduced in 2.5,
then tmpfs would never have invented its own peculiar radix tree:
we would have fitted swap entries into the common radix tree instead,
in much the same way as we fit swap entries into page tables.

And why should each file have a separate radix tree for its pages
and for its swap entries?  The swap entries are required precisely
where and when the pages are not.  We want to put them together in
a single radix tree: which can then avoid much of the locking which
was needed to prevent them from being exchanged underneath us.

This also avoids the waste of memory devoted to swap vectors, first
in the shmem_inode itself, then at least two more pages once a file
grew beyond 16 data pages (pages accounted by df and du, but not by
memcg).  Allocated upfront, to avoid allocation when under swapping
pressure, but pure waste when CONFIG_SWAP is not set - I have never
spattered around the ifdefs to prevent that, preferring this move
to sharing the common radix tree instead.

There are three downsides to sharing the radix tree.  One, that it
binds tmpfs more tightly to the rest of mm, either requiring knowledge
of swap entries in radix tree there, or duplication of its code here
in shmem.c.  I believe that the simplications and memory savings
(and probable higher performance, not yet measured) justify that.

Two, that on HIGHMEM systems with SWAP enabled, it's the lowmem radix
nodes that cannot be freed under memory pressure - whereas before it
was the less precious highmem swap vector pages that could not be freed.
I'm hoping that 64-bit has now been accessible for long enough, that
the highmem argument has grown much less persuasive.

Three, that swapoff is slower than it used to be on tmpfs files, since
it's using a simple generic mechanism not tailored to it: I find this
noticeable, and shall want to improve, but maybe nobody else will notice.

So... now remove most of the old swap vector code from shmem.c.  But,
for the moment, keep the simple i_direct vector of 16 pages, with simple
accessors shmem_put_swap() and shmem_get_swap(), as a toy implementation
to help mark where swap needs to be handled in subsequent patches.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/shmem_fs.h |    2 
 mm/shmem.c               |  782 +++----------------------------------
 2 files changed, 84 insertions(+), 700 deletions(-)

--- linux.orig/include/linux/shmem_fs.h	2011-06-13 13:26:07.446100738 -0700
+++ linux/include/linux/shmem_fs.h	2011-06-13 13:27:59.634657055 -0700
@@ -17,9 +17,7 @@ struct shmem_inode_info {
 	unsigned long		flags;
 	unsigned long		alloced;	/* data pages alloced to file */
 	unsigned long		swapped;	/* subtotal assigned to swap */
-	unsigned long		next_index;	/* highest alloced index + 1 */
 	struct shared_policy	policy;		/* NUMA memory alloc policy */
-	struct page		*i_indirect;	/* top indirect blocks page */
 	union {
 		swp_entry_t	i_direct[SHMEM_NR_DIRECT]; /* first blocks */
 		char		inline_symlink[SHMEM_SYMLINK_INLINE_LEN];
--- linux.orig/mm/shmem.c	2011-06-13 13:26:07.446100738 -0700
+++ linux/mm/shmem.c	2011-06-13 13:27:59.634657055 -0700
@@ -66,37 +66,9 @@ static struct vfsmount *shm_mnt;
 #include <asm/div64.h>
 #include <asm/pgtable.h>
 
-/*
- * The maximum size of a shmem/tmpfs file is limited by the maximum size of
- * its triple-indirect swap vector - see illustration at shmem_swp_entry().
- *
- * With 4kB page size, maximum file size is just over 2TB on a 32-bit kernel,
- * but one eighth of that on a 64-bit kernel.  With 8kB page size, maximum
- * file size is just over 4TB on a 64-bit kernel, but 16TB on a 32-bit kernel,
- * MAX_LFS_FILESIZE being then more restrictive than swap vector layout.
- *
- * We use / and * instead of shifts in the definitions below, so that the swap
- * vector can be tested with small even values (e.g. 20) for ENTRIES_PER_PAGE.
- */
-#define ENTRIES_PER_PAGE (PAGE_CACHE_SIZE/sizeof(unsigned long))
-#define ENTRIES_PER_PAGEPAGE ((unsigned long long)ENTRIES_PER_PAGE*ENTRIES_PER_PAGE)
-
-#define SHMSWP_MAX_INDEX (SHMEM_NR_DIRECT + (ENTRIES_PER_PAGEPAGE/2) * (ENTRIES_PER_PAGE+1))
-#define SHMSWP_MAX_BYTES (SHMSWP_MAX_INDEX << PAGE_CACHE_SHIFT)
-
-#define SHMEM_MAX_BYTES  min_t(unsigned long long, SHMSWP_MAX_BYTES, MAX_LFS_FILESIZE)
-#define SHMEM_MAX_INDEX  ((unsigned long)((SHMEM_MAX_BYTES+1) >> PAGE_CACHE_SHIFT))
-
 #define BLOCKS_PER_PAGE  (PAGE_CACHE_SIZE/512)
 #define VM_ACCT(size)    (PAGE_CACHE_ALIGN(size) >> PAGE_SHIFT)
 
-/* info->flags needs VM_flags to handle pagein/truncate races efficiently */
-#define SHMEM_PAGEIN	 VM_READ
-#define SHMEM_TRUNCATE	 VM_WRITE
-
-/* Definition to limit shmem_truncate's steps between cond_rescheds */
-#define LATENCY_LIMIT	 64
-
 /* Pretend that each entry is of this size in directory's i_size */
 #define BOGO_DIRENT_SIZE 20
 
@@ -107,7 +79,7 @@ struct shmem_xattr {
 	char value[0];
 };
 
-/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
+/* Flag allocation requirements to shmem_getpage */
 enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
 	SGP_CACHE,	/* don't exceed i_size, may allocate page */
@@ -137,56 +109,6 @@ static inline int shmem_getpage(struct i
 			mapping_gfp_mask(inode->i_mapping), fault_type);
 }
 
-static inline struct page *shmem_dir_alloc(gfp_t gfp_mask)
-{
-	/*
-	 * The above definition of ENTRIES_PER_PAGE, and the use of
-	 * BLOCKS_PER_PAGE on indirect pages, assume PAGE_CACHE_SIZE:
-	 * might be reconsidered if it ever diverges from PAGE_SIZE.
-	 *
-	 * Mobility flags are masked out as swap vectors cannot move
-	 */
-	return alloc_pages((gfp_mask & ~GFP_MOVABLE_MASK) | __GFP_ZERO,
-				PAGE_CACHE_SHIFT-PAGE_SHIFT);
-}
-
-static inline void shmem_dir_free(struct page *page)
-{
-	__free_pages(page, PAGE_CACHE_SHIFT-PAGE_SHIFT);
-}
-
-static struct page **shmem_dir_map(struct page *page)
-{
-	return (struct page **)kmap_atomic(page, KM_USER0);
-}
-
-static inline void shmem_dir_unmap(struct page **dir)
-{
-	kunmap_atomic(dir, KM_USER0);
-}
-
-static swp_entry_t *shmem_swp_map(struct page *page)
-{
-	return (swp_entry_t *)kmap_atomic(page, KM_USER1);
-}
-
-static inline void shmem_swp_balance_unmap(void)
-{
-	/*
-	 * When passing a pointer to an i_direct entry, to code which
-	 * also handles indirect entries and so will shmem_swp_unmap,
-	 * we must arrange for the preempt count to remain in balance.
-	 * What kmap_atomic of a lowmem page does depends on config
-	 * and architecture, so pretend to kmap_atomic some lowmem page.
-	 */
-	(void) kmap_atomic(ZERO_PAGE(0), KM_USER1);
-}
-
-static inline void shmem_swp_unmap(swp_entry_t *entry)
-{
-	kunmap_atomic(entry, KM_USER1);
-}
-
 static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb)
 {
 	return sb->s_fs_info;
@@ -303,468 +225,56 @@ static void shmem_recalc_inode(struct in
 	}
 }
 
-/**
- * shmem_swp_entry - find the swap vector position in the info structure
- * @info:  info structure for the inode
- * @index: index of the page to find
- * @page:  optional page to add to the structure. Has to be preset to
- *         all zeros
- *
- * If there is no space allocated yet it will return NULL when
- * page is NULL, else it will use the page for the needed block,
- * setting it to NULL on return to indicate that it has been used.
- *
- * The swap vector is organized the following way:
- *
- * There are SHMEM_NR_DIRECT entries directly stored in the
- * shmem_inode_info structure. So small files do not need an addional
- * allocation.
- *
- * For pages with index > SHMEM_NR_DIRECT there is the pointer
- * i_indirect which points to a page which holds in the first half
- * doubly indirect blocks, in the second half triple indirect blocks:
- *
- * For an artificial ENTRIES_PER_PAGE = 4 this would lead to the
- * following layout (for SHMEM_NR_DIRECT == 16):
- *
- * i_indirect -> dir --> 16-19
- * 	      |	     +-> 20-23
- * 	      |
- * 	      +-->dir2 --> 24-27
- * 	      |	       +-> 28-31
- * 	      |	       +-> 32-35
- * 	      |	       +-> 36-39
- * 	      |
- * 	      +-->dir3 --> 40-43
- * 	       	       +-> 44-47
- * 	      	       +-> 48-51
- * 	      	       +-> 52-55
- */
-static swp_entry_t *shmem_swp_entry(struct shmem_inode_info *info, unsigned long index, struct page **page)
-{
-	unsigned long offset;
-	struct page **dir;
-	struct page *subdir;
-
-	if (index < SHMEM_NR_DIRECT) {
-		shmem_swp_balance_unmap();
-		return info->i_direct+index;
-	}
-	if (!info->i_indirect) {
-		if (page) {
-			info->i_indirect = *page;
-			*page = NULL;
-		}
-		return NULL;			/* need another page */
-	}
-
-	index -= SHMEM_NR_DIRECT;
-	offset = index % ENTRIES_PER_PAGE;
-	index /= ENTRIES_PER_PAGE;
-	dir = shmem_dir_map(info->i_indirect);
-
-	if (index >= ENTRIES_PER_PAGE/2) {
-		index -= ENTRIES_PER_PAGE/2;
-		dir += ENTRIES_PER_PAGE/2 + index/ENTRIES_PER_PAGE;
-		index %= ENTRIES_PER_PAGE;
-		subdir = *dir;
-		if (!subdir) {
-			if (page) {
-				*dir = *page;
-				*page = NULL;
-			}
-			shmem_dir_unmap(dir);
-			return NULL;		/* need another page */
-		}
-		shmem_dir_unmap(dir);
-		dir = shmem_dir_map(subdir);
-	}
-
-	dir += index;
-	subdir = *dir;
-	if (!subdir) {
-		if (!page || !(subdir = *page)) {
-			shmem_dir_unmap(dir);
-			return NULL;		/* need a page */
-		}
-		*dir = subdir;
-		*page = NULL;
-	}
-	shmem_dir_unmap(dir);
-	return shmem_swp_map(subdir) + offset;
-}
-
-static void shmem_swp_set(struct shmem_inode_info *info, swp_entry_t *entry, unsigned long value)
+static void shmem_put_swap(struct shmem_inode_info *info, pgoff_t index,
+			   swp_entry_t swap)
 {
-	long incdec = value? 1: -1;
-
-	entry->val = value;
-	info->swapped += incdec;
-	if ((unsigned long)(entry - info->i_direct) >= SHMEM_NR_DIRECT) {
-		struct page *page = kmap_atomic_to_page(entry);
-		set_page_private(page, page_private(page) + incdec);
-	}
+	if (index < SHMEM_NR_DIRECT)
+		info->i_direct[index] = swap;
 }
 
-/**
- * shmem_swp_alloc - get the position of the swap entry for the page.
- * @info:	info structure for the inode
- * @index:	index of the page to find
- * @sgp:	check and recheck i_size? skip allocation?
- * @gfp:	gfp mask to use for any page allocation
- *
- * If the entry does not exist, allocate it.
- */
-static swp_entry_t *shmem_swp_alloc(struct shmem_inode_info *info,
-			unsigned long index, enum sgp_type sgp, gfp_t gfp)
+static swp_entry_t shmem_get_swap(struct shmem_inode_info *info, pgoff_t index)
 {
-	struct inode *inode = &info->vfs_inode;
-	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
-	struct page *page = NULL;
-	swp_entry_t *entry;
-
-	if (sgp != SGP_WRITE &&
-	    ((loff_t) index << PAGE_CACHE_SHIFT) >= i_size_read(inode))
-		return ERR_PTR(-EINVAL);
-
-	while (!(entry = shmem_swp_entry(info, index, &page))) {
-		if (sgp == SGP_READ)
-			return shmem_swp_map(ZERO_PAGE(0));
-		/*
-		 * Test used_blocks against 1 less max_blocks, since we have 1 data
-		 * page (and perhaps indirect index pages) yet to allocate:
-		 * a waste to allocate index if we cannot allocate data.
-		 */
-		if (sbinfo->max_blocks) {
-			if (percpu_counter_compare(&sbinfo->used_blocks,
-						sbinfo->max_blocks - 1) >= 0)
-				return ERR_PTR(-ENOSPC);
-			percpu_counter_inc(&sbinfo->used_blocks);
-			inode->i_blocks += BLOCKS_PER_PAGE;
-		}
-
-		spin_unlock(&info->lock);
-		page = shmem_dir_alloc(gfp);
-		spin_lock(&info->lock);
-
-		if (!page) {
-			shmem_free_blocks(inode, 1);
-			return ERR_PTR(-ENOMEM);
-		}
-		if (sgp != SGP_WRITE &&
-		    ((loff_t) index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
-			entry = ERR_PTR(-EINVAL);
-			break;
-		}
-		if (info->next_index <= index)
-			info->next_index = index + 1;
-	}
-	if (page) {
-		/* another task gave its page, or truncated the file */
-		shmem_free_blocks(inode, 1);
-		shmem_dir_free(page);
-	}
-	if (info->next_index <= index && !IS_ERR(entry))
-		info->next_index = index + 1;
-	return entry;
+	return (index < SHMEM_NR_DIRECT) ?
+		info->i_direct[index] : (swp_entry_t){0};
 }
 
-/**
- * shmem_free_swp - free some swap entries in a directory
- * @dir:        pointer to the directory
- * @edir:       pointer after last entry of the directory
- * @punch_lock: pointer to spinlock when needed for the holepunch case
- */
-static int shmem_free_swp(swp_entry_t *dir, swp_entry_t *edir,
-						spinlock_t *punch_lock)
-{
-	spinlock_t *punch_unlock = NULL;
-	swp_entry_t *ptr;
-	int freed = 0;
-
-	for (ptr = dir; ptr < edir; ptr++) {
-		if (ptr->val) {
-			if (unlikely(punch_lock)) {
-				punch_unlock = punch_lock;
-				punch_lock = NULL;
-				spin_lock(punch_unlock);
-				if (!ptr->val)
-					continue;
-			}
-			free_swap_and_cache(*ptr);
-			*ptr = (swp_entry_t){0};
-			freed++;
-		}
-	}
-	if (punch_unlock)
-		spin_unlock(punch_unlock);
-	return freed;
-}
-
-static int shmem_map_and_free_swp(struct page *subdir, int offset,
-		int limit, struct page ***dir, spinlock_t *punch_lock)
-{
-	swp_entry_t *ptr;
-	int freed = 0;
-
-	ptr = shmem_swp_map(subdir);
-	for (; offset < limit; offset += LATENCY_LIMIT) {
-		int size = limit - offset;
-		if (size > LATENCY_LIMIT)
-			size = LATENCY_LIMIT;
-		freed += shmem_free_swp(ptr+offset, ptr+offset+size,
-							punch_lock);
-		if (need_resched()) {
-			shmem_swp_unmap(ptr);
-			if (*dir) {
-				shmem_dir_unmap(*dir);
-				*dir = NULL;
-			}
-			cond_resched();
-			ptr = shmem_swp_map(subdir);
-		}
-	}
-	shmem_swp_unmap(ptr);
-	return freed;
-}
-
-static void shmem_free_pages(struct list_head *next)
-{
-	struct page *page;
-	int freed = 0;
-
-	do {
-		page = container_of(next, struct page, lru);
-		next = next->next;
-		shmem_dir_free(page);
-		freed++;
-		if (freed >= LATENCY_LIMIT) {
-			cond_resched();
-			freed = 0;
-		}
-	} while (next);
-}
-
-void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end)
+void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
 {
+	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
-	unsigned long idx;
-	unsigned long size;
-	unsigned long limit;
-	unsigned long stage;
-	unsigned long diroff;
-	struct page **dir;
-	struct page *topdir;
-	struct page *middir;
-	struct page *subdir;
-	swp_entry_t *ptr;
-	LIST_HEAD(pages_to_free);
-	long nr_pages_to_free = 0;
-	long nr_swaps_freed = 0;
-	int offset;
-	int freed;
-	int punch_hole;
-	spinlock_t *needs_lock;
-	spinlock_t *punch_lock;
-	unsigned long upper_limit;
+	pgoff_t start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	pgoff_t end = (lend >> PAGE_CACHE_SHIFT);
+	pgoff_t index;
+	swp_entry_t swap;
 
-	truncate_inode_pages_range(inode->i_mapping, start, end);
+	truncate_inode_pages_range(mapping, lstart, lend);
 
-	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
-	idx = (start + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (idx >= info->next_index)
-		return;
+	if (end > SHMEM_NR_DIRECT)
+		end = SHMEM_NR_DIRECT;
 
 	spin_lock(&info->lock);
-	info->flags |= SHMEM_TRUNCATE;
-	if (likely(end == (loff_t) -1)) {
-		limit = info->next_index;
-		upper_limit = SHMEM_MAX_INDEX;
-		info->next_index = idx;
-		needs_lock = NULL;
-		punch_hole = 0;
-	} else {
-		if (end + 1 >= inode->i_size) {	/* we may free a little more */
-			limit = (inode->i_size + PAGE_CACHE_SIZE - 1) >>
-							PAGE_CACHE_SHIFT;
-			upper_limit = SHMEM_MAX_INDEX;
-		} else {
-			limit = (end + 1) >> PAGE_CACHE_SHIFT;
-			upper_limit = limit;
-		}
-		needs_lock = &info->lock;
-		punch_hole = 1;
-	}
-
-	topdir = info->i_indirect;
-	if (topdir && idx <= SHMEM_NR_DIRECT && !punch_hole) {
-		info->i_indirect = NULL;
-		nr_pages_to_free++;
-		list_add(&topdir->lru, &pages_to_free);
-	}
-	spin_unlock(&info->lock);
-
-	if (info->swapped && idx < SHMEM_NR_DIRECT) {
-		ptr = info->i_direct;
-		size = limit;
-		if (size > SHMEM_NR_DIRECT)
-			size = SHMEM_NR_DIRECT;
-		nr_swaps_freed = shmem_free_swp(ptr+idx, ptr+size, needs_lock);
-	}
-
-	/*
-	 * If there are no indirect blocks or we are punching a hole
-	 * below indirect blocks, nothing to be done.
-	 */
-	if (!topdir || limit <= SHMEM_NR_DIRECT)
-		goto done2;
-
-	/*
-	 * The truncation case has already dropped info->lock, and we're safe
-	 * because i_size and next_index have already been lowered, preventing
-	 * access beyond.  But in the punch_hole case, we still need to take
-	 * the lock when updating the swap directory, because there might be
-	 * racing accesses by shmem_getpage(SGP_CACHE), shmem_unuse_inode or
-	 * shmem_writepage.  However, whenever we find we can remove a whole
-	 * directory page (not at the misaligned start or end of the range),
-	 * we first NULLify its pointer in the level above, and then have no
-	 * need to take the lock when updating its contents: needs_lock and
-	 * punch_lock (either pointing to info->lock or NULL) manage this.
-	 */
-
-	upper_limit -= SHMEM_NR_DIRECT;
-	limit -= SHMEM_NR_DIRECT;
-	idx = (idx > SHMEM_NR_DIRECT)? (idx - SHMEM_NR_DIRECT): 0;
-	offset = idx % ENTRIES_PER_PAGE;
-	idx -= offset;
-
-	dir = shmem_dir_map(topdir);
-	stage = ENTRIES_PER_PAGEPAGE/2;
-	if (idx < ENTRIES_PER_PAGEPAGE/2) {
-		middir = topdir;
-		diroff = idx/ENTRIES_PER_PAGE;
-	} else {
-		dir += ENTRIES_PER_PAGE/2;
-		dir += (idx - ENTRIES_PER_PAGEPAGE/2)/ENTRIES_PER_PAGEPAGE;
-		while (stage <= idx)
-			stage += ENTRIES_PER_PAGEPAGE;
-		middir = *dir;
-		if (*dir) {
-			diroff = ((idx - ENTRIES_PER_PAGEPAGE/2) %
-				ENTRIES_PER_PAGEPAGE) / ENTRIES_PER_PAGE;
-			if (!diroff && !offset && upper_limit >= stage) {
-				if (needs_lock) {
-					spin_lock(needs_lock);
-					*dir = NULL;
-					spin_unlock(needs_lock);
-					needs_lock = NULL;
-				} else
-					*dir = NULL;
-				nr_pages_to_free++;
-				list_add(&middir->lru, &pages_to_free);
-			}
-			shmem_dir_unmap(dir);
-			dir = shmem_dir_map(middir);
-		} else {
-			diroff = 0;
-			offset = 0;
-			idx = stage;
+	for (index = start; index < end; index++) {
+		swap = shmem_get_swap(info, index);
+		if (swap.val) {
+			free_swap_and_cache(swap);
+			shmem_put_swap(info, index, (swp_entry_t){0});
+			info->swapped--;
 		}
 	}
 
-	for (; idx < limit; idx += ENTRIES_PER_PAGE, diroff++) {
-		if (unlikely(idx == stage)) {
-			shmem_dir_unmap(dir);
-			dir = shmem_dir_map(topdir) +
-			    ENTRIES_PER_PAGE/2 + idx/ENTRIES_PER_PAGEPAGE;
-			while (!*dir) {
-				dir++;
-				idx += ENTRIES_PER_PAGEPAGE;
-				if (idx >= limit)
-					goto done1;
-			}
-			stage = idx + ENTRIES_PER_PAGEPAGE;
-			middir = *dir;
-			if (punch_hole)
-				needs_lock = &info->lock;
-			if (upper_limit >= stage) {
-				if (needs_lock) {
-					spin_lock(needs_lock);
-					*dir = NULL;
-					spin_unlock(needs_lock);
-					needs_lock = NULL;
-				} else
-					*dir = NULL;
-				nr_pages_to_free++;
-				list_add(&middir->lru, &pages_to_free);
-			}
-			shmem_dir_unmap(dir);
-			cond_resched();
-			dir = shmem_dir_map(middir);
-			diroff = 0;
-		}
-		punch_lock = needs_lock;
-		subdir = dir[diroff];
-		if (subdir && !offset && upper_limit-idx >= ENTRIES_PER_PAGE) {
-			if (needs_lock) {
-				spin_lock(needs_lock);
-				dir[diroff] = NULL;
-				spin_unlock(needs_lock);
-				punch_lock = NULL;
-			} else
-				dir[diroff] = NULL;
-			nr_pages_to_free++;
-			list_add(&subdir->lru, &pages_to_free);
-		}
-		if (subdir && page_private(subdir) /* has swap entries */) {
-			size = limit - idx;
-			if (size > ENTRIES_PER_PAGE)
-				size = ENTRIES_PER_PAGE;
-			freed = shmem_map_and_free_swp(subdir,
-					offset, size, &dir, punch_lock);
-			if (!dir)
-				dir = shmem_dir_map(middir);
-			nr_swaps_freed += freed;
-			if (offset || punch_lock) {
-				spin_lock(&info->lock);
-				set_page_private(subdir,
-					page_private(subdir) - freed);
-				spin_unlock(&info->lock);
-			} else
-				BUG_ON(page_private(subdir) != freed);
-		}
-		offset = 0;
-	}
-done1:
-	shmem_dir_unmap(dir);
-done2:
-	if (inode->i_mapping->nrpages && (info->flags & SHMEM_PAGEIN)) {
+	if (mapping->nrpages) {
+		spin_unlock(&info->lock);
 		/*
-		 * Call truncate_inode_pages again: racing shmem_unuse_inode
-		 * may have swizzled a page in from swap since
-		 * truncate_pagecache or generic_delete_inode did it, before we
-		 * lowered next_index.  Also, though shmem_getpage checks
-		 * i_size before adding to cache, no recheck after: so fix the
-		 * narrow window there too.
+		 * A page may have meanwhile sneaked in from swap.
 		 */
-		truncate_inode_pages_range(inode->i_mapping, start, end);
+		truncate_inode_pages_range(mapping, lstart, lend);
+		spin_lock(&info->lock);
 	}
 
-	spin_lock(&info->lock);
-	info->flags &= ~SHMEM_TRUNCATE;
-	info->swapped -= nr_swaps_freed;
-	if (nr_pages_to_free)
-		shmem_free_blocks(inode, nr_pages_to_free);
 	shmem_recalc_inode(inode);
 	spin_unlock(&info->lock);
 
-	/*
-	 * Empty swap vector directory pages to be freed?
-	 */
-	if (!list_empty(&pages_to_free)) {
-		pages_to_free.prev->next = NULL;
-		shmem_free_pages(pages_to_free.next);
-	}
+	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
 }
 EXPORT_SYMBOL_GPL(shmem_truncate_range);
 
@@ -797,19 +307,6 @@ static int shmem_setattr(struct dentry *
 				if (page)
 					unlock_page(page);
 			}
-			/*
-			 * Reset SHMEM_PAGEIN flag so that shmem_truncate can
-			 * detect if any pages might have been added to cache
-			 * after truncate_inode_pages.  But we needn't bother
-			 * if it's being fully truncated to zero-length: the
-			 * nrpages check is efficient enough in that case.
-			 */
-			if (newsize) {
-				struct shmem_inode_info *info = SHMEM_I(inode);
-				spin_lock(&info->lock);
-				info->flags &= ~SHMEM_PAGEIN;
-				spin_unlock(&info->lock);
-			}
 		}
 		if (newsize != oldsize) {
 			i_size_write(inode, newsize);
@@ -859,106 +356,28 @@ static void shmem_evict_inode(struct ino
 	end_writeback(inode);
 }
 
-static inline int shmem_find_swp(swp_entry_t entry, swp_entry_t *dir, swp_entry_t *edir)
-{
-	swp_entry_t *ptr;
-
-	for (ptr = dir; ptr < edir; ptr++) {
-		if (ptr->val == entry.val)
-			return ptr - dir;
-	}
-	return -1;
-}
-
 static int shmem_unuse_inode(struct shmem_inode_info *info, swp_entry_t entry, struct page *page)
 {
-	struct address_space *mapping;
+	struct address_space *mapping = info->vfs_inode.i_mapping;
 	unsigned long idx;
-	unsigned long size;
-	unsigned long limit;
-	unsigned long stage;
-	struct page **dir;
-	struct page *subdir;
-	swp_entry_t *ptr;
-	int offset;
 	int error;
 
-	idx = 0;
-	ptr = info->i_direct;
-	spin_lock(&info->lock);
-	if (!info->swapped) {
-		list_del_init(&info->swaplist);
-		goto lost2;
-	}
-	limit = info->next_index;
-	size = limit;
-	if (size > SHMEM_NR_DIRECT)
-		size = SHMEM_NR_DIRECT;
-	offset = shmem_find_swp(entry, ptr, ptr+size);
-	if (offset >= 0) {
-		shmem_swp_balance_unmap();
-		goto found;
-	}
-	if (!info->i_indirect)
-		goto lost2;
-
-	dir = shmem_dir_map(info->i_indirect);
-	stage = SHMEM_NR_DIRECT + ENTRIES_PER_PAGEPAGE/2;
-
-	for (idx = SHMEM_NR_DIRECT; idx < limit; idx += ENTRIES_PER_PAGE, dir++) {
-		if (unlikely(idx == stage)) {
-			shmem_dir_unmap(dir-1);
-			if (cond_resched_lock(&info->lock)) {
-				/* check it has not been truncated */
-				if (limit > info->next_index) {
-					limit = info->next_index;
-					if (idx >= limit)
-						goto lost2;
-				}
-			}
-			dir = shmem_dir_map(info->i_indirect) +
-			    ENTRIES_PER_PAGE/2 + idx/ENTRIES_PER_PAGEPAGE;
-			while (!*dir) {
-				dir++;
-				idx += ENTRIES_PER_PAGEPAGE;
-				if (idx >= limit)
-					goto lost1;
-			}
-			stage = idx + ENTRIES_PER_PAGEPAGE;
-			subdir = *dir;
-			shmem_dir_unmap(dir);
-			dir = shmem_dir_map(subdir);
-		}
-		subdir = *dir;
-		if (subdir && page_private(subdir)) {
-			ptr = shmem_swp_map(subdir);
-			size = limit - idx;
-			if (size > ENTRIES_PER_PAGE)
-				size = ENTRIES_PER_PAGE;
-			offset = shmem_find_swp(entry, ptr, ptr+size);
-			shmem_swp_unmap(ptr);
-			if (offset >= 0) {
-				shmem_dir_unmap(dir);
-				ptr = shmem_swp_map(subdir);
-				goto found;
-			}
-		}
-	}
-lost1:
-	shmem_dir_unmap(dir-1);
-lost2:
-	spin_unlock(&info->lock);
+	for (idx = 0; idx < SHMEM_NR_DIRECT; idx++)
+		if (shmem_get_swap(info, idx).val == entry.val)
+			goto found;
 	return 0;
 found:
-	idx += offset;
-	ptr += offset;
+	spin_lock(&info->lock);
+	if (shmem_get_swap(info, idx).val != entry.val) {
+		spin_unlock(&info->lock);
+		return 0;
+	}
 
 	/*
 	 * Move _head_ to start search for next from here.
 	 * But be careful: shmem_evict_inode checks list_empty without taking
 	 * mutex, and there's an instant in list_move_tail when info->swaplist
-	 * would appear empty, if it were the only one on shmem_swaplist.  We
-	 * could avoid doing it if inode NULL; or use this minor optimization.
+	 * would appear empty, if it were the only one on shmem_swaplist.
 	 */
 	if (shmem_swaplist.next != &info->swaplist)
 		list_move_tail(&shmem_swaplist, &info->swaplist);
@@ -968,19 +387,17 @@ found:
 	 * but also to hold up shmem_evict_inode(): so inode cannot be freed
 	 * beneath us (pagelock doesn't help until the page is in pagecache).
 	 */
-	mapping = info->vfs_inode.i_mapping;
 	error = add_to_page_cache_locked(page, mapping, idx, GFP_NOWAIT);
 	/* which does mem_cgroup_uncharge_cache_page on error */
 
 	if (error != -ENOMEM) {
 		delete_from_swap_cache(page);
 		set_page_dirty(page);
-		info->flags |= SHMEM_PAGEIN;
-		shmem_swp_set(info, ptr, 0);
+		shmem_put_swap(info, idx, (swp_entry_t){0});
+		info->swapped--;
 		swap_free(entry);
 		error = 1;	/* not an error, but entry was found */
 	}
-	shmem_swp_unmap(ptr);
 	spin_unlock(&info->lock);
 	return error;
 }
@@ -1017,7 +434,14 @@ int shmem_unuse(swp_entry_t entry, struc
 	mutex_lock(&shmem_swaplist_mutex);
 	list_for_each_safe(p, next, &shmem_swaplist) {
 		info = list_entry(p, struct shmem_inode_info, swaplist);
-		found = shmem_unuse_inode(info, entry, page);
+		if (!info->swapped) {
+			spin_lock(&info->lock);
+			if (!info->swapped)
+				list_del_init(&info->swaplist);
+			spin_unlock(&info->lock);
+		}
+		if (info->swapped)
+			found = shmem_unuse_inode(info, entry, page);
 		cond_resched();
 		if (found)
 			break;
@@ -1041,7 +465,7 @@ out:
 static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct shmem_inode_info *info;
-	swp_entry_t *entry, swap;
+	swp_entry_t swap, oswap;
 	struct address_space *mapping;
 	unsigned long index;
 	struct inode *inode;
@@ -1067,6 +491,15 @@ static int shmem_writepage(struct page *
 		WARN_ON_ONCE(1);	/* Still happens? Tell us about it! */
 		goto redirty;
 	}
+
+	/*
+	 * Just for this patch, we have a toy implementation,
+	 * which can swap out only the first SHMEM_NR_DIRECT pages:
+	 * for simple demonstration of where we need to think about swap.
+	 */
+	if (index >= SHMEM_NR_DIRECT)
+		goto redirty;
+
 	swap = get_swap_page();
 	if (!swap.val)
 		goto redirty;
@@ -1087,22 +520,19 @@ static int shmem_writepage(struct page *
 	spin_lock(&info->lock);
 	mutex_unlock(&shmem_swaplist_mutex);
 
-	if (index >= info->next_index) {
-		BUG_ON(!(info->flags & SHMEM_TRUNCATE));
-		goto unlock;
-	}
-	entry = shmem_swp_entry(info, index, NULL);
-	if (entry->val) {
+	oswap = shmem_get_swap(info, index);
+	if (oswap.val) {
 		WARN_ON_ONCE(1);	/* Still happens? Tell us about it! */
-		free_swap_and_cache(*entry);
-		shmem_swp_set(info, entry, 0);
+		free_swap_and_cache(oswap);
+		shmem_put_swap(info, index, (swp_entry_t){0});
+		info->swapped--;
 	}
 	shmem_recalc_inode(inode);
 
 	if (add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) {
 		delete_from_page_cache(page);
-		shmem_swp_set(info, entry, swap.val);
-		shmem_swp_unmap(entry);
+		shmem_put_swap(info, index, swap);
+		info->swapped++;
 		swap_shmem_alloc(swap);
 		spin_unlock(&info->lock);
 		BUG_ON(page_mapped(page));
@@ -1110,13 +540,7 @@ static int shmem_writepage(struct page *
 		return 0;
 	}
 
-	shmem_swp_unmap(entry);
-unlock:
 	spin_unlock(&info->lock);
-	/*
-	 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
-	 * clear SWAP_HAS_CACHE flag.
-	 */
 	swapcache_free(swap, NULL);
 redirty:
 	set_page_dirty(page);
@@ -1230,12 +654,10 @@ static int shmem_getpage_gfp(struct inod
 	struct shmem_sb_info *sbinfo;
 	struct page *page;
 	struct page *prealloc_page = NULL;
-	swp_entry_t *entry;
 	swp_entry_t swap;
 	int error;
-	int ret;
 
-	if (idx >= SHMEM_MAX_INDEX)
+	if (idx > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT))
 		return -EFBIG;
 repeat:
 	page = find_lock_page(mapping, idx);
@@ -1272,37 +694,22 @@ repeat:
 
 	spin_lock(&info->lock);
 	shmem_recalc_inode(inode);
-	entry = shmem_swp_alloc(info, idx, sgp, gfp);
-	if (IS_ERR(entry)) {
-		spin_unlock(&info->lock);
-		error = PTR_ERR(entry);
-		goto out;
-	}
-	swap = *entry;
-
+	swap = shmem_get_swap(info, idx);
 	if (swap.val) {
 		/* Look it up and read it in.. */
 		page = lookup_swap_cache(swap);
 		if (!page) {
-			shmem_swp_unmap(entry);
 			spin_unlock(&info->lock);
 			/* here we actually do the io */
 			if (fault_type)
 				*fault_type |= VM_FAULT_MAJOR;
 			page = shmem_swapin(swap, gfp, info, idx);
 			if (!page) {
-				spin_lock(&info->lock);
-				entry = shmem_swp_alloc(info, idx, sgp, gfp);
-				if (IS_ERR(entry))
-					error = PTR_ERR(entry);
-				else {
-					if (entry->val == swap.val)
-						error = -ENOMEM;
-					shmem_swp_unmap(entry);
-				}
-				spin_unlock(&info->lock);
-				if (error)
+				swp_entry_t nswap = shmem_get_swap(info, idx);
+				if (nswap.val == swap.val) {
+					error = -ENOMEM;
 					goto out;
+				}
 				goto repeat;
 			}
 			wait_on_page_locked(page);
@@ -1312,14 +719,12 @@ repeat:
 
 		/* We have to do this with page locked to prevent races */
 		if (!trylock_page(page)) {
-			shmem_swp_unmap(entry);
 			spin_unlock(&info->lock);
 			wait_on_page_locked(page);
 			page_cache_release(page);
 			goto repeat;
 		}
 		if (PageWriteback(page)) {
-			shmem_swp_unmap(entry);
 			spin_unlock(&info->lock);
 			wait_on_page_writeback(page);
 			unlock_page(page);
@@ -1327,7 +732,6 @@ repeat:
 			goto repeat;
 		}
 		if (!PageUptodate(page)) {
-			shmem_swp_unmap(entry);
 			spin_unlock(&info->lock);
 			unlock_page(page);
 			page_cache_release(page);
@@ -1338,7 +742,6 @@ repeat:
 		error = add_to_page_cache_locked(page, mapping,
 						 idx, GFP_NOWAIT);
 		if (error) {
-			shmem_swp_unmap(entry);
 			spin_unlock(&info->lock);
 			if (error == -ENOMEM) {
 				/*
@@ -1358,16 +761,14 @@ repeat:
 			goto repeat;
 		}
 
-		info->flags |= SHMEM_PAGEIN;
-		shmem_swp_set(info, entry, 0);
-		shmem_swp_unmap(entry);
 		delete_from_swap_cache(page);
+		shmem_put_swap(info, idx, (swp_entry_t){0});
+		info->swapped--;
 		spin_unlock(&info->lock);
 		set_page_dirty(page);
 		swap_free(swap);
 
 	} else if (sgp == SGP_READ) {
-		shmem_swp_unmap(entry);
 		page = find_get_page(mapping, idx);
 		if (page && !trylock_page(page)) {
 			spin_unlock(&info->lock);
@@ -1378,7 +779,6 @@ repeat:
 		spin_unlock(&info->lock);
 
 	} else if (prealloc_page) {
-		shmem_swp_unmap(entry);
 		sbinfo = SHMEM_SB(inode->i_sb);
 		if (sbinfo->max_blocks) {
 			if (percpu_counter_compare(&sbinfo->used_blocks,
@@ -1393,34 +793,24 @@ repeat:
 		page = prealloc_page;
 		prealloc_page = NULL;
 
-		entry = shmem_swp_alloc(info, idx, sgp, gfp);
-		if (IS_ERR(entry))
-			error = PTR_ERR(entry);
-		else {
-			swap = *entry;
-			shmem_swp_unmap(entry);
-		}
-		ret = error || swap.val;
-		if (ret)
+		swap = shmem_get_swap(info, idx);
+		if (swap.val)
 			mem_cgroup_uncharge_cache_page(page);
 		else
-			ret = add_to_page_cache_lru(page, mapping,
+			error = add_to_page_cache_lru(page, mapping,
 						idx, GFP_NOWAIT);
 		/*
 		 * At add_to_page_cache_lru() failure,
 		 * uncharge will be done automatically.
 		 */
-		if (ret) {
+		if (swap.val || error) {
 			shmem_unacct_blocks(info->flags, 1);
 			shmem_free_blocks(inode, 1);
 			spin_unlock(&info->lock);
 			page_cache_release(page);
-			if (error)
-				goto out;
 			goto repeat;
 		}
 
-		info->flags |= SHMEM_PAGEIN;
 		info->alloced++;
 		spin_unlock(&info->lock);
 		clear_highpage(page);
@@ -2627,7 +2017,7 @@ int shmem_fill_super(struct super_block
 		goto failed;
 	sbinfo->free_inodes = sbinfo->max_inodes;
 
-	sb->s_maxbytes = SHMEM_MAX_BYTES;
+	sb->s_maxbytes = MAX_LFS_FILESIZE;
 	sb->s_blocksize = PAGE_CACHE_SIZE;
 	sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
 	sb->s_magic = TMPFS_MAGIC;
@@ -2869,7 +2259,7 @@ out4:
 void mem_cgroup_get_shmem_target(struct inode *inode, pgoff_t pgoff,
 					struct page **pagep, swp_entry_t *ent)
 {
-	swp_entry_t entry = { .val = 0 }, *ptr;
+	swp_entry_t entry = { .val = 0 };
 	struct page *page = NULL;
 	struct shmem_inode_info *info = SHMEM_I(inode);
 
@@ -2877,16 +2267,13 @@ void mem_cgroup_get_shmem_target(struct
 		goto out;
 
 	spin_lock(&info->lock);
-	ptr = shmem_swp_entry(info, pgoff, NULL);
 #ifdef CONFIG_SWAP
-	if (ptr && ptr->val) {
-		entry.val = ptr->val;
+	entry = shmem_get_swap(info, pgoff);
+	if (entry.val)
 		page = find_get_page(&swapper_space, entry.val);
-	} else
+	else
 #endif
 		page = find_get_page(inode->i_mapping, pgoff);
-	if (ptr)
-		shmem_swp_unmap(ptr);
 	spin_unlock(&info->lock);
 out:
 	*pagep = page;
@@ -2969,7 +2356,6 @@ out:
 #define shmem_get_inode(sb, dir, mode, dev, flags)	ramfs_get_inode(sb, dir, mode, dev)
 #define shmem_acct_size(flags, size)		0
 #define shmem_unacct_size(flags, size)		do {} while (0)
-#define SHMEM_MAX_BYTES				MAX_LFS_FILESIZE
 
 #endif /* CONFIG_SHMEM */
 
@@ -2993,7 +2379,7 @@ struct file *shmem_file_setup(const char
 	if (IS_ERR(shm_mnt))
 		return (void *)shm_mnt;
 
-	if (size < 0 || size > SHMEM_MAX_BYTES)
+	if (size < 0 || size > MAX_LFS_FILESIZE)
 		return ERR_PTR(-EINVAL);
 
 	if (shmem_acct_size(flags, size))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 4/12] tmpfs: miscellaneous trivial cleanups
  2011-06-14 10:40 ` Hugh Dickins
@ 2011-06-14 10:48   ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

While it's at its least, make a number of boring nitpicky cleanups to
shmem.c, mostly for consistency of variable naming.  Things like "swap"
instead of "entry", "pgoff_t index" instead of "unsigned long idx".

And since everything else here is prefixed "shmem_",
better change init_tmpfs() to shmem_init().

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/shmem_fs.h |    2 
 init/main.c              |    2 
 mm/shmem.c               |  216 ++++++++++++++++++-------------------
 3 files changed, 109 insertions(+), 111 deletions(-)

--- linux.orig/include/linux/shmem_fs.h	2011-06-13 13:27:59.634657055 -0700
+++ linux/include/linux/shmem_fs.h	2011-06-13 13:28:25.822786909 -0700
@@ -47,7 +47,7 @@ static inline struct shmem_inode_info *S
 /*
  * Functions in mm/shmem.c called directly from elsewhere:
  */
-extern int init_tmpfs(void);
+extern int shmem_init(void);
 extern int shmem_fill_super(struct super_block *sb, void *data, int silent);
 extern struct file *shmem_file_setup(const char *name,
 					loff_t size, unsigned long flags);
--- linux.orig/init/main.c	2011-06-13 13:26:07.386100444 -0700
+++ linux/init/main.c	2011-06-13 13:28:25.822786909 -0700
@@ -714,7 +714,7 @@ static void __init do_basic_setup(void)
 {
 	cpuset_init_smp();
 	usermodehelper_init();
-	init_tmpfs();
+	shmem_init();
 	driver_init();
 	init_irq_proc();
 	do_ctors();
--- linux.orig/mm/shmem.c	2011-06-13 13:27:59.634657055 -0700
+++ linux/mm/shmem.c	2011-06-13 13:28:25.822786909 -0700
@@ -28,7 +28,6 @@
 #include <linux/file.h>
 #include <linux/mm.h>
 #include <linux/module.h>
-#include <linux/percpu_counter.h>
 #include <linux/swap.h>
 
 static struct vfsmount *shm_mnt;
@@ -51,6 +50,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/shmem_fs.h>
 #include <linux/writeback.h>
 #include <linux/blkdev.h>
+#include <linux/percpu_counter.h>
 #include <linux/splice.h>
 #include <linux/security.h>
 #include <linux/swapops.h>
@@ -63,7 +63,6 @@ static struct vfsmount *shm_mnt;
 #include <linux/magic.h>
 
 #include <asm/uaccess.h>
-#include <asm/div64.h>
 #include <asm/pgtable.h>
 
 #define BLOCKS_PER_PAGE  (PAGE_CACHE_SIZE/512)
@@ -201,7 +200,7 @@ static void shmem_free_inode(struct supe
 }
 
 /**
- * shmem_recalc_inode - recalculate the size of an inode
+ * shmem_recalc_inode - recalculate the block usage of an inode
  * @inode: inode to recalc
  *
  * We have to calculate the free blocks since the mm can drop
@@ -356,19 +355,20 @@ static void shmem_evict_inode(struct ino
 	end_writeback(inode);
 }
 
-static int shmem_unuse_inode(struct shmem_inode_info *info, swp_entry_t entry, struct page *page)
+static int shmem_unuse_inode(struct shmem_inode_info *info,
+			     swp_entry_t swap, struct page *page)
 {
 	struct address_space *mapping = info->vfs_inode.i_mapping;
-	unsigned long idx;
+	pgoff_t index;
 	int error;
 
-	for (idx = 0; idx < SHMEM_NR_DIRECT; idx++)
-		if (shmem_get_swap(info, idx).val == entry.val)
+	for (index = 0; index < SHMEM_NR_DIRECT; index++)
+		if (shmem_get_swap(info, index).val == swap.val)
 			goto found;
 	return 0;
 found:
 	spin_lock(&info->lock);
-	if (shmem_get_swap(info, idx).val != entry.val) {
+	if (shmem_get_swap(info, index).val != swap.val) {
 		spin_unlock(&info->lock);
 		return 0;
 	}
@@ -387,15 +387,15 @@ found:
 	 * but also to hold up shmem_evict_inode(): so inode cannot be freed
 	 * beneath us (pagelock doesn't help until the page is in pagecache).
 	 */
-	error = add_to_page_cache_locked(page, mapping, idx, GFP_NOWAIT);
+	error = add_to_page_cache_locked(page, mapping, index, GFP_NOWAIT);
 	/* which does mem_cgroup_uncharge_cache_page on error */
 
 	if (error != -ENOMEM) {
 		delete_from_swap_cache(page);
 		set_page_dirty(page);
-		shmem_put_swap(info, idx, (swp_entry_t){0});
+		shmem_put_swap(info, index, (swp_entry_t){0});
 		info->swapped--;
-		swap_free(entry);
+		swap_free(swap);
 		error = 1;	/* not an error, but entry was found */
 	}
 	spin_unlock(&info->lock);
@@ -405,9 +405,9 @@ found:
 /*
  * shmem_unuse() search for an eventually swapped out shmem page.
  */
-int shmem_unuse(swp_entry_t entry, struct page *page)
+int shmem_unuse(swp_entry_t swap, struct page *page)
 {
-	struct list_head *p, *next;
+	struct list_head *this, *next;
 	struct shmem_inode_info *info;
 	int found = 0;
 	int error;
@@ -432,8 +432,8 @@ int shmem_unuse(swp_entry_t entry, struc
 	radix_tree_preload_end();
 
 	mutex_lock(&shmem_swaplist_mutex);
-	list_for_each_safe(p, next, &shmem_swaplist) {
-		info = list_entry(p, struct shmem_inode_info, swaplist);
+	list_for_each_safe(this, next, &shmem_swaplist) {
+		info = list_entry(this, struct shmem_inode_info, swaplist);
 		if (!info->swapped) {
 			spin_lock(&info->lock);
 			if (!info->swapped)
@@ -441,7 +441,7 @@ int shmem_unuse(swp_entry_t entry, struc
 			spin_unlock(&info->lock);
 		}
 		if (info->swapped)
-			found = shmem_unuse_inode(info, entry, page);
+			found = shmem_unuse_inode(info, swap, page);
 		cond_resched();
 		if (found)
 			break;
@@ -467,7 +467,7 @@ static int shmem_writepage(struct page *
 	struct shmem_inode_info *info;
 	swp_entry_t swap, oswap;
 	struct address_space *mapping;
-	unsigned long index;
+	pgoff_t index;
 	struct inode *inode;
 
 	BUG_ON(!PageLocked(page));
@@ -577,35 +577,33 @@ static struct mempolicy *shmem_get_sbmpo
 }
 #endif /* CONFIG_TMPFS */
 
-static struct page *shmem_swapin(swp_entry_t entry, gfp_t gfp,
-			struct shmem_inode_info *info, unsigned long idx)
+static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
+			struct shmem_inode_info *info, pgoff_t index)
 {
 	struct mempolicy mpol, *spol;
 	struct vm_area_struct pvma;
-	struct page *page;
 
 	spol = mpol_cond_copy(&mpol,
-				mpol_shared_policy_lookup(&info->policy, idx));
+			mpol_shared_policy_lookup(&info->policy, index));
 
 	/* Create a pseudo vma that just contains the policy */
 	pvma.vm_start = 0;
-	pvma.vm_pgoff = idx;
+	pvma.vm_pgoff = index;
 	pvma.vm_ops = NULL;
 	pvma.vm_policy = spol;
-	page = swapin_readahead(entry, gfp, &pvma, 0);
-	return page;
+	return swapin_readahead(swap, gfp, &pvma, 0);
 }
 
 static struct page *shmem_alloc_page(gfp_t gfp,
-			struct shmem_inode_info *info, unsigned long idx)
+			struct shmem_inode_info *info, pgoff_t index)
 {
 	struct vm_area_struct pvma;
 
 	/* Create a pseudo vma that just contains the policy */
 	pvma.vm_start = 0;
-	pvma.vm_pgoff = idx;
+	pvma.vm_pgoff = index;
 	pvma.vm_ops = NULL;
-	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
+	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
 
 	/*
 	 * alloc_page_vma() will drop the shared policy reference
@@ -614,19 +612,19 @@ static struct page *shmem_alloc_page(gfp
 }
 #else /* !CONFIG_NUMA */
 #ifdef CONFIG_TMPFS
-static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *p)
+static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
 {
 }
 #endif /* CONFIG_TMPFS */
 
-static inline struct page *shmem_swapin(swp_entry_t entry, gfp_t gfp,
-			struct shmem_inode_info *info, unsigned long idx)
+static inline struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
+			struct shmem_inode_info *info, pgoff_t index)
 {
-	return swapin_readahead(entry, gfp, NULL, 0);
+	return swapin_readahead(swap, gfp, NULL, 0);
 }
 
 static inline struct page *shmem_alloc_page(gfp_t gfp,
-			struct shmem_inode_info *info, unsigned long idx)
+			struct shmem_inode_info *info, pgoff_t index)
 {
 	return alloc_page(gfp);
 }
@@ -646,7 +644,7 @@ static inline struct mempolicy *shmem_ge
  * vm. If we swap it in we mark it dirty since we also free the swap
  * entry since a page cannot live in both the swap and page cache
  */
-static int shmem_getpage_gfp(struct inode *inode, pgoff_t idx,
+static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type)
 {
 	struct address_space *mapping = inode->i_mapping;
@@ -657,10 +655,10 @@ static int shmem_getpage_gfp(struct inod
 	swp_entry_t swap;
 	int error;
 
-	if (idx > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT))
+	if (index > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT))
 		return -EFBIG;
 repeat:
-	page = find_lock_page(mapping, idx);
+	page = find_lock_page(mapping, index);
 	if (page) {
 		/*
 		 * Once we can get the page lock, it must be uptodate:
@@ -681,7 +679,7 @@ repeat:
 	radix_tree_preload_end();
 
 	if (sgp != SGP_READ && !prealloc_page) {
-		prealloc_page = shmem_alloc_page(gfp, info, idx);
+		prealloc_page = shmem_alloc_page(gfp, info, index);
 		if (prealloc_page) {
 			SetPageSwapBacked(prealloc_page);
 			if (mem_cgroup_cache_charge(prealloc_page,
@@ -694,7 +692,7 @@ repeat:
 
 	spin_lock(&info->lock);
 	shmem_recalc_inode(inode);
-	swap = shmem_get_swap(info, idx);
+	swap = shmem_get_swap(info, index);
 	if (swap.val) {
 		/* Look it up and read it in.. */
 		page = lookup_swap_cache(swap);
@@ -703,9 +701,9 @@ repeat:
 			/* here we actually do the io */
 			if (fault_type)
 				*fault_type |= VM_FAULT_MAJOR;
-			page = shmem_swapin(swap, gfp, info, idx);
+			page = shmem_swapin(swap, gfp, info, index);
 			if (!page) {
-				swp_entry_t nswap = shmem_get_swap(info, idx);
+				swp_entry_t nswap = shmem_get_swap(info, index);
 				if (nswap.val == swap.val) {
 					error = -ENOMEM;
 					goto out;
@@ -740,7 +738,7 @@ repeat:
 		}
 
 		error = add_to_page_cache_locked(page, mapping,
-						 idx, GFP_NOWAIT);
+						 index, GFP_NOWAIT);
 		if (error) {
 			spin_unlock(&info->lock);
 			if (error == -ENOMEM) {
@@ -762,14 +760,14 @@ repeat:
 		}
 
 		delete_from_swap_cache(page);
-		shmem_put_swap(info, idx, (swp_entry_t){0});
+		shmem_put_swap(info, index, (swp_entry_t){0});
 		info->swapped--;
 		spin_unlock(&info->lock);
 		set_page_dirty(page);
 		swap_free(swap);
 
 	} else if (sgp == SGP_READ) {
-		page = find_get_page(mapping, idx);
+		page = find_get_page(mapping, index);
 		if (page && !trylock_page(page)) {
 			spin_unlock(&info->lock);
 			wait_on_page_locked(page);
@@ -793,12 +791,12 @@ repeat:
 		page = prealloc_page;
 		prealloc_page = NULL;
 
-		swap = shmem_get_swap(info, idx);
+		swap = shmem_get_swap(info, index);
 		if (swap.val)
 			mem_cgroup_uncharge_cache_page(page);
 		else
 			error = add_to_page_cache_lru(page, mapping,
-						idx, GFP_NOWAIT);
+						index, GFP_NOWAIT);
 		/*
 		 * At add_to_page_cache_lru() failure,
 		 * uncharge will be done automatically.
@@ -841,7 +839,7 @@ nospace:
 	 * but must also avoid reporting a spurious ENOSPC while working on a
 	 * full tmpfs.
 	 */
-	page = find_get_page(mapping, idx);
+	page = find_get_page(mapping, index);
 	spin_unlock(&info->lock);
 	if (page) {
 		page_cache_release(page);
@@ -872,20 +870,20 @@ static int shmem_fault(struct vm_area_st
 }
 
 #ifdef CONFIG_NUMA
-static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
+static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol)
 {
-	struct inode *i = vma->vm_file->f_path.dentry->d_inode;
-	return mpol_set_shared_policy(&SHMEM_I(i)->policy, vma, new);
+	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
+	return mpol_set_shared_policy(&SHMEM_I(inode)->policy, vma, mpol);
 }
 
 static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
 					  unsigned long addr)
 {
-	struct inode *i = vma->vm_file->f_path.dentry->d_inode;
-	unsigned long idx;
+	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
+	pgoff_t index;
 
-	idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
-	return mpol_shared_policy_lookup(&SHMEM_I(i)->policy, idx);
+	index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+	return mpol_shared_policy_lookup(&SHMEM_I(inode)->policy, index);
 }
 #endif
 
@@ -1016,7 +1014,8 @@ static void do_shmem_file_read(struct fi
 {
 	struct inode *inode = filp->f_path.dentry->d_inode;
 	struct address_space *mapping = inode->i_mapping;
-	unsigned long index, offset;
+	pgoff_t index;
+	unsigned long offset;
 	enum sgp_type sgp = SGP_READ;
 
 	/*
@@ -1032,7 +1031,8 @@ static void do_shmem_file_read(struct fi
 
 	for (;;) {
 		struct page *page = NULL;
-		unsigned long end_index, nr, ret;
+		pgoff_t end_index;
+		unsigned long nr, ret;
 		loff_t i_size = i_size_read(inode);
 
 		end_index = i_size >> PAGE_CACHE_SHIFT;
@@ -1270,8 +1270,9 @@ static int shmem_statfs(struct dentry *d
 	buf->f_namelen = NAME_MAX;
 	if (sbinfo->max_blocks) {
 		buf->f_blocks = sbinfo->max_blocks;
-		buf->f_bavail = buf->f_bfree =
-				sbinfo->max_blocks - percpu_counter_sum(&sbinfo->used_blocks);
+		buf->f_bavail =
+		buf->f_bfree  = sbinfo->max_blocks -
+				percpu_counter_sum(&sbinfo->used_blocks);
 	}
 	if (sbinfo->max_inodes) {
 		buf->f_files = sbinfo->max_inodes;
@@ -1480,8 +1481,8 @@ static void *shmem_follow_link_inline(st
 static void *shmem_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
 	struct page *page = NULL;
-	int res = shmem_getpage(dentry->d_inode, 0, &page, SGP_READ, NULL);
-	nd_set_link(nd, res ? ERR_PTR(res) : kmap(page));
+	int error = shmem_getpage(dentry->d_inode, 0, &page, SGP_READ, NULL);
+	nd_set_link(nd, error ? ERR_PTR(error) : kmap(page));
 	if (page)
 		unlock_page(page);
 	return page;
@@ -1592,7 +1593,6 @@ out:
 	return err;
 }
 
-
 static const struct xattr_handler *shmem_xattr_handlers[] = {
 #ifdef CONFIG_TMPFS_POSIX_ACL
 	&generic_acl_access_handler,
@@ -2052,14 +2052,14 @@ static struct kmem_cache *shmem_inode_ca
 
 static struct inode *shmem_alloc_inode(struct super_block *sb)
 {
-	struct shmem_inode_info *p;
-	p = (struct shmem_inode_info *)kmem_cache_alloc(shmem_inode_cachep, GFP_KERNEL);
-	if (!p)
+	struct shmem_inode_info *info;
+	info = kmem_cache_alloc(shmem_inode_cachep, GFP_KERNEL);
+	if (!info)
 		return NULL;
-	return &p->vfs_inode;
+	return &info->vfs_inode;
 }
 
-static void shmem_i_callback(struct rcu_head *head)
+static void shmem_destroy_callback(struct rcu_head *head)
 {
 	struct inode *inode = container_of(head, struct inode, i_rcu);
 	INIT_LIST_HEAD(&inode->i_dentry);
@@ -2072,25 +2072,24 @@ static void shmem_destroy_inode(struct i
 		/* only struct inode is valid if it's an inline symlink */
 		mpol_free_shared_policy(&SHMEM_I(inode)->policy);
 	}
-	call_rcu(&inode->i_rcu, shmem_i_callback);
+	call_rcu(&inode->i_rcu, shmem_destroy_callback);
 }
 
-static void init_once(void *foo)
+static void shmem_init_inode(void *foo)
 {
-	struct shmem_inode_info *p = (struct shmem_inode_info *) foo;
-
-	inode_init_once(&p->vfs_inode);
+	struct shmem_inode_info *info = foo;
+	inode_init_once(&info->vfs_inode);
 }
 
-static int init_inodecache(void)
+static int shmem_init_inodecache(void)
 {
 	shmem_inode_cachep = kmem_cache_create("shmem_inode_cache",
 				sizeof(struct shmem_inode_info),
-				0, SLAB_PANIC, init_once);
+				0, SLAB_PANIC, shmem_init_inode);
 	return 0;
 }
 
-static void destroy_inodecache(void)
+static void shmem_destroy_inodecache(void)
 {
 	kmem_cache_destroy(shmem_inode_cachep);
 }
@@ -2193,21 +2192,20 @@ static const struct vm_operations_struct
 #endif
 };
 
-
 static struct dentry *shmem_mount(struct file_system_type *fs_type,
 	int flags, const char *dev_name, void *data)
 {
 	return mount_nodev(fs_type, flags, data, shmem_fill_super);
 }
 
-static struct file_system_type tmpfs_fs_type = {
+static struct file_system_type shmem_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "tmpfs",
 	.mount		= shmem_mount,
 	.kill_sb	= kill_litter_super,
 };
 
-int __init init_tmpfs(void)
+int __init shmem_init(void)
 {
 	int error;
 
@@ -2215,18 +2213,18 @@ int __init init_tmpfs(void)
 	if (error)
 		goto out4;
 
-	error = init_inodecache();
+	error = shmem_init_inodecache();
 	if (error)
 		goto out3;
 
-	error = register_filesystem(&tmpfs_fs_type);
+	error = register_filesystem(&shmem_fs_type);
 	if (error) {
 		printk(KERN_ERR "Could not register tmpfs\n");
 		goto out2;
 	}
 
-	shm_mnt = vfs_kern_mount(&tmpfs_fs_type, MS_NOUSER,
-				tmpfs_fs_type.name, NULL);
+	shm_mnt = vfs_kern_mount(&shmem_fs_type, MS_NOUSER,
+				 shmem_fs_type.name, NULL);
 	if (IS_ERR(shm_mnt)) {
 		error = PTR_ERR(shm_mnt);
 		printk(KERN_ERR "Could not kern_mount tmpfs\n");
@@ -2235,9 +2233,9 @@ int __init init_tmpfs(void)
 	return 0;
 
 out1:
-	unregister_filesystem(&tmpfs_fs_type);
+	unregister_filesystem(&shmem_fs_type);
 out2:
-	destroy_inodecache();
+	shmem_destroy_inodecache();
 out3:
 	bdi_destroy(&shmem_backing_dev_info);
 out4:
@@ -2247,37 +2245,37 @@ out4:
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /**
- * mem_cgroup_get_shmem_target - find a page or entry assigned to the shmem file
+ * mem_cgroup_get_shmem_target - find page or swap assigned to the shmem file
  * @inode: the inode to be searched
- * @pgoff: the offset to be searched
+ * @index: the page offset to be searched
  * @pagep: the pointer for the found page to be stored
- * @ent: the pointer for the found swap entry to be stored
+ * @swapp: the pointer for the found swap entry to be stored
  *
  * If a page is found, refcount of it is incremented. Callers should handle
  * these refcount.
  */
-void mem_cgroup_get_shmem_target(struct inode *inode, pgoff_t pgoff,
-					struct page **pagep, swp_entry_t *ent)
+void mem_cgroup_get_shmem_target(struct inode *inode, pgoff_t index,
+				 struct page **pagep, swp_entry_t *swapp)
 {
-	swp_entry_t entry = { .val = 0 };
-	struct page *page = NULL;
 	struct shmem_inode_info *info = SHMEM_I(inode);
+	struct page *page = NULL;
+	swp_entry_t swap = {0};
 
-	if ((pgoff << PAGE_CACHE_SHIFT) >= i_size_read(inode))
+	if ((index << PAGE_CACHE_SHIFT) >= i_size_read(inode))
 		goto out;
 
 	spin_lock(&info->lock);
 #ifdef CONFIG_SWAP
-	entry = shmem_get_swap(info, pgoff);
-	if (entry.val)
-		page = find_get_page(&swapper_space, entry.val);
+	swap = shmem_get_swap(info, index);
+	if (swap.val)
+		page = find_get_page(&swapper_space, swap.val);
 	else
 #endif
-		page = find_get_page(inode->i_mapping, pgoff);
+		page = find_get_page(inode->i_mapping, index);
 	spin_unlock(&info->lock);
 out:
 	*pagep = page;
-	*ent = entry;
+	*swapp = swap;
 }
 #endif
 
@@ -2294,23 +2292,23 @@ out:
 
 #include <linux/ramfs.h>
 
-static struct file_system_type tmpfs_fs_type = {
+static struct file_system_type shmem_fs_type = {
 	.name		= "tmpfs",
 	.mount		= ramfs_mount,
 	.kill_sb	= kill_litter_super,
 };
 
-int __init init_tmpfs(void)
+int __init shmem_init(void)
 {
-	BUG_ON(register_filesystem(&tmpfs_fs_type) != 0);
+	BUG_ON(register_filesystem(&shmem_fs_type) != 0);
 
-	shm_mnt = kern_mount(&tmpfs_fs_type);
+	shm_mnt = kern_mount(&shmem_fs_type);
 	BUG_ON(IS_ERR(shm_mnt));
 
 	return 0;
 }
 
-int shmem_unuse(swp_entry_t entry, struct page *page)
+int shmem_unuse(swp_entry_t swap, struct page *page)
 {
 	return 0;
 }
@@ -2320,34 +2318,34 @@ int shmem_lock(struct file *file, int lo
 	return 0;
 }
 
-void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end)
+void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
 {
-	truncate_inode_pages_range(inode->i_mapping, start, end);
+	truncate_inode_pages_range(inode->i_mapping, lstart, lend);
 }
 EXPORT_SYMBOL_GPL(shmem_truncate_range);
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /**
- * mem_cgroup_get_shmem_target - find a page or entry assigned to the shmem file
+ * mem_cgroup_get_shmem_target - find page or swap assigned to the shmem file
  * @inode: the inode to be searched
- * @pgoff: the offset to be searched
+ * @index: the page offset to be searched
  * @pagep: the pointer for the found page to be stored
- * @ent: the pointer for the found swap entry to be stored
+ * @swapp: the pointer for the found swap entry to be stored
  *
  * If a page is found, refcount of it is incremented. Callers should handle
  * these refcount.
  */
-void mem_cgroup_get_shmem_target(struct inode *inode, pgoff_t pgoff,
-					struct page **pagep, swp_entry_t *ent)
+void mem_cgroup_get_shmem_target(struct inode *inode, pgoff_t index,
+				 struct page **pagep, swp_entry_t *swapp)
 {
 	struct page *page = NULL;
 
-	if ((pgoff << PAGE_CACHE_SHIFT) >= i_size_read(inode))
+	if ((index << PAGE_CACHE_SHIFT) >= i_size_read(inode))
 		goto out;
-	page = find_get_page(inode->i_mapping, pgoff);
+	page = find_get_page(inode->i_mapping, index);
 out:
 	*pagep = page;
-	*ent = (swp_entry_t){ .val = 0 };
+	*swapp = (swp_entry_t){0};
 }
 #endif
 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 4/12] tmpfs: miscellaneous trivial cleanups
@ 2011-06-14 10:48   ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

While it's at its least, make a number of boring nitpicky cleanups to
shmem.c, mostly for consistency of variable naming.  Things like "swap"
instead of "entry", "pgoff_t index" instead of "unsigned long idx".

And since everything else here is prefixed "shmem_",
better change init_tmpfs() to shmem_init().

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/shmem_fs.h |    2 
 init/main.c              |    2 
 mm/shmem.c               |  216 ++++++++++++++++++-------------------
 3 files changed, 109 insertions(+), 111 deletions(-)

--- linux.orig/include/linux/shmem_fs.h	2011-06-13 13:27:59.634657055 -0700
+++ linux/include/linux/shmem_fs.h	2011-06-13 13:28:25.822786909 -0700
@@ -47,7 +47,7 @@ static inline struct shmem_inode_info *S
 /*
  * Functions in mm/shmem.c called directly from elsewhere:
  */
-extern int init_tmpfs(void);
+extern int shmem_init(void);
 extern int shmem_fill_super(struct super_block *sb, void *data, int silent);
 extern struct file *shmem_file_setup(const char *name,
 					loff_t size, unsigned long flags);
--- linux.orig/init/main.c	2011-06-13 13:26:07.386100444 -0700
+++ linux/init/main.c	2011-06-13 13:28:25.822786909 -0700
@@ -714,7 +714,7 @@ static void __init do_basic_setup(void)
 {
 	cpuset_init_smp();
 	usermodehelper_init();
-	init_tmpfs();
+	shmem_init();
 	driver_init();
 	init_irq_proc();
 	do_ctors();
--- linux.orig/mm/shmem.c	2011-06-13 13:27:59.634657055 -0700
+++ linux/mm/shmem.c	2011-06-13 13:28:25.822786909 -0700
@@ -28,7 +28,6 @@
 #include <linux/file.h>
 #include <linux/mm.h>
 #include <linux/module.h>
-#include <linux/percpu_counter.h>
 #include <linux/swap.h>
 
 static struct vfsmount *shm_mnt;
@@ -51,6 +50,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/shmem_fs.h>
 #include <linux/writeback.h>
 #include <linux/blkdev.h>
+#include <linux/percpu_counter.h>
 #include <linux/splice.h>
 #include <linux/security.h>
 #include <linux/swapops.h>
@@ -63,7 +63,6 @@ static struct vfsmount *shm_mnt;
 #include <linux/magic.h>
 
 #include <asm/uaccess.h>
-#include <asm/div64.h>
 #include <asm/pgtable.h>
 
 #define BLOCKS_PER_PAGE  (PAGE_CACHE_SIZE/512)
@@ -201,7 +200,7 @@ static void shmem_free_inode(struct supe
 }
 
 /**
- * shmem_recalc_inode - recalculate the size of an inode
+ * shmem_recalc_inode - recalculate the block usage of an inode
  * @inode: inode to recalc
  *
  * We have to calculate the free blocks since the mm can drop
@@ -356,19 +355,20 @@ static void shmem_evict_inode(struct ino
 	end_writeback(inode);
 }
 
-static int shmem_unuse_inode(struct shmem_inode_info *info, swp_entry_t entry, struct page *page)
+static int shmem_unuse_inode(struct shmem_inode_info *info,
+			     swp_entry_t swap, struct page *page)
 {
 	struct address_space *mapping = info->vfs_inode.i_mapping;
-	unsigned long idx;
+	pgoff_t index;
 	int error;
 
-	for (idx = 0; idx < SHMEM_NR_DIRECT; idx++)
-		if (shmem_get_swap(info, idx).val == entry.val)
+	for (index = 0; index < SHMEM_NR_DIRECT; index++)
+		if (shmem_get_swap(info, index).val == swap.val)
 			goto found;
 	return 0;
 found:
 	spin_lock(&info->lock);
-	if (shmem_get_swap(info, idx).val != entry.val) {
+	if (shmem_get_swap(info, index).val != swap.val) {
 		spin_unlock(&info->lock);
 		return 0;
 	}
@@ -387,15 +387,15 @@ found:
 	 * but also to hold up shmem_evict_inode(): so inode cannot be freed
 	 * beneath us (pagelock doesn't help until the page is in pagecache).
 	 */
-	error = add_to_page_cache_locked(page, mapping, idx, GFP_NOWAIT);
+	error = add_to_page_cache_locked(page, mapping, index, GFP_NOWAIT);
 	/* which does mem_cgroup_uncharge_cache_page on error */
 
 	if (error != -ENOMEM) {
 		delete_from_swap_cache(page);
 		set_page_dirty(page);
-		shmem_put_swap(info, idx, (swp_entry_t){0});
+		shmem_put_swap(info, index, (swp_entry_t){0});
 		info->swapped--;
-		swap_free(entry);
+		swap_free(swap);
 		error = 1;	/* not an error, but entry was found */
 	}
 	spin_unlock(&info->lock);
@@ -405,9 +405,9 @@ found:
 /*
  * shmem_unuse() search for an eventually swapped out shmem page.
  */
-int shmem_unuse(swp_entry_t entry, struct page *page)
+int shmem_unuse(swp_entry_t swap, struct page *page)
 {
-	struct list_head *p, *next;
+	struct list_head *this, *next;
 	struct shmem_inode_info *info;
 	int found = 0;
 	int error;
@@ -432,8 +432,8 @@ int shmem_unuse(swp_entry_t entry, struc
 	radix_tree_preload_end();
 
 	mutex_lock(&shmem_swaplist_mutex);
-	list_for_each_safe(p, next, &shmem_swaplist) {
-		info = list_entry(p, struct shmem_inode_info, swaplist);
+	list_for_each_safe(this, next, &shmem_swaplist) {
+		info = list_entry(this, struct shmem_inode_info, swaplist);
 		if (!info->swapped) {
 			spin_lock(&info->lock);
 			if (!info->swapped)
@@ -441,7 +441,7 @@ int shmem_unuse(swp_entry_t entry, struc
 			spin_unlock(&info->lock);
 		}
 		if (info->swapped)
-			found = shmem_unuse_inode(info, entry, page);
+			found = shmem_unuse_inode(info, swap, page);
 		cond_resched();
 		if (found)
 			break;
@@ -467,7 +467,7 @@ static int shmem_writepage(struct page *
 	struct shmem_inode_info *info;
 	swp_entry_t swap, oswap;
 	struct address_space *mapping;
-	unsigned long index;
+	pgoff_t index;
 	struct inode *inode;
 
 	BUG_ON(!PageLocked(page));
@@ -577,35 +577,33 @@ static struct mempolicy *shmem_get_sbmpo
 }
 #endif /* CONFIG_TMPFS */
 
-static struct page *shmem_swapin(swp_entry_t entry, gfp_t gfp,
-			struct shmem_inode_info *info, unsigned long idx)
+static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
+			struct shmem_inode_info *info, pgoff_t index)
 {
 	struct mempolicy mpol, *spol;
 	struct vm_area_struct pvma;
-	struct page *page;
 
 	spol = mpol_cond_copy(&mpol,
-				mpol_shared_policy_lookup(&info->policy, idx));
+			mpol_shared_policy_lookup(&info->policy, index));
 
 	/* Create a pseudo vma that just contains the policy */
 	pvma.vm_start = 0;
-	pvma.vm_pgoff = idx;
+	pvma.vm_pgoff = index;
 	pvma.vm_ops = NULL;
 	pvma.vm_policy = spol;
-	page = swapin_readahead(entry, gfp, &pvma, 0);
-	return page;
+	return swapin_readahead(swap, gfp, &pvma, 0);
 }
 
 static struct page *shmem_alloc_page(gfp_t gfp,
-			struct shmem_inode_info *info, unsigned long idx)
+			struct shmem_inode_info *info, pgoff_t index)
 {
 	struct vm_area_struct pvma;
 
 	/* Create a pseudo vma that just contains the policy */
 	pvma.vm_start = 0;
-	pvma.vm_pgoff = idx;
+	pvma.vm_pgoff = index;
 	pvma.vm_ops = NULL;
-	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
+	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
 
 	/*
 	 * alloc_page_vma() will drop the shared policy reference
@@ -614,19 +612,19 @@ static struct page *shmem_alloc_page(gfp
 }
 #else /* !CONFIG_NUMA */
 #ifdef CONFIG_TMPFS
-static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *p)
+static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
 {
 }
 #endif /* CONFIG_TMPFS */
 
-static inline struct page *shmem_swapin(swp_entry_t entry, gfp_t gfp,
-			struct shmem_inode_info *info, unsigned long idx)
+static inline struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
+			struct shmem_inode_info *info, pgoff_t index)
 {
-	return swapin_readahead(entry, gfp, NULL, 0);
+	return swapin_readahead(swap, gfp, NULL, 0);
 }
 
 static inline struct page *shmem_alloc_page(gfp_t gfp,
-			struct shmem_inode_info *info, unsigned long idx)
+			struct shmem_inode_info *info, pgoff_t index)
 {
 	return alloc_page(gfp);
 }
@@ -646,7 +644,7 @@ static inline struct mempolicy *shmem_ge
  * vm. If we swap it in we mark it dirty since we also free the swap
  * entry since a page cannot live in both the swap and page cache
  */
-static int shmem_getpage_gfp(struct inode *inode, pgoff_t idx,
+static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type)
 {
 	struct address_space *mapping = inode->i_mapping;
@@ -657,10 +655,10 @@ static int shmem_getpage_gfp(struct inod
 	swp_entry_t swap;
 	int error;
 
-	if (idx > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT))
+	if (index > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT))
 		return -EFBIG;
 repeat:
-	page = find_lock_page(mapping, idx);
+	page = find_lock_page(mapping, index);
 	if (page) {
 		/*
 		 * Once we can get the page lock, it must be uptodate:
@@ -681,7 +679,7 @@ repeat:
 	radix_tree_preload_end();
 
 	if (sgp != SGP_READ && !prealloc_page) {
-		prealloc_page = shmem_alloc_page(gfp, info, idx);
+		prealloc_page = shmem_alloc_page(gfp, info, index);
 		if (prealloc_page) {
 			SetPageSwapBacked(prealloc_page);
 			if (mem_cgroup_cache_charge(prealloc_page,
@@ -694,7 +692,7 @@ repeat:
 
 	spin_lock(&info->lock);
 	shmem_recalc_inode(inode);
-	swap = shmem_get_swap(info, idx);
+	swap = shmem_get_swap(info, index);
 	if (swap.val) {
 		/* Look it up and read it in.. */
 		page = lookup_swap_cache(swap);
@@ -703,9 +701,9 @@ repeat:
 			/* here we actually do the io */
 			if (fault_type)
 				*fault_type |= VM_FAULT_MAJOR;
-			page = shmem_swapin(swap, gfp, info, idx);
+			page = shmem_swapin(swap, gfp, info, index);
 			if (!page) {
-				swp_entry_t nswap = shmem_get_swap(info, idx);
+				swp_entry_t nswap = shmem_get_swap(info, index);
 				if (nswap.val == swap.val) {
 					error = -ENOMEM;
 					goto out;
@@ -740,7 +738,7 @@ repeat:
 		}
 
 		error = add_to_page_cache_locked(page, mapping,
-						 idx, GFP_NOWAIT);
+						 index, GFP_NOWAIT);
 		if (error) {
 			spin_unlock(&info->lock);
 			if (error == -ENOMEM) {
@@ -762,14 +760,14 @@ repeat:
 		}
 
 		delete_from_swap_cache(page);
-		shmem_put_swap(info, idx, (swp_entry_t){0});
+		shmem_put_swap(info, index, (swp_entry_t){0});
 		info->swapped--;
 		spin_unlock(&info->lock);
 		set_page_dirty(page);
 		swap_free(swap);
 
 	} else if (sgp == SGP_READ) {
-		page = find_get_page(mapping, idx);
+		page = find_get_page(mapping, index);
 		if (page && !trylock_page(page)) {
 			spin_unlock(&info->lock);
 			wait_on_page_locked(page);
@@ -793,12 +791,12 @@ repeat:
 		page = prealloc_page;
 		prealloc_page = NULL;
 
-		swap = shmem_get_swap(info, idx);
+		swap = shmem_get_swap(info, index);
 		if (swap.val)
 			mem_cgroup_uncharge_cache_page(page);
 		else
 			error = add_to_page_cache_lru(page, mapping,
-						idx, GFP_NOWAIT);
+						index, GFP_NOWAIT);
 		/*
 		 * At add_to_page_cache_lru() failure,
 		 * uncharge will be done automatically.
@@ -841,7 +839,7 @@ nospace:
 	 * but must also avoid reporting a spurious ENOSPC while working on a
 	 * full tmpfs.
 	 */
-	page = find_get_page(mapping, idx);
+	page = find_get_page(mapping, index);
 	spin_unlock(&info->lock);
 	if (page) {
 		page_cache_release(page);
@@ -872,20 +870,20 @@ static int shmem_fault(struct vm_area_st
 }
 
 #ifdef CONFIG_NUMA
-static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
+static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol)
 {
-	struct inode *i = vma->vm_file->f_path.dentry->d_inode;
-	return mpol_set_shared_policy(&SHMEM_I(i)->policy, vma, new);
+	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
+	return mpol_set_shared_policy(&SHMEM_I(inode)->policy, vma, mpol);
 }
 
 static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
 					  unsigned long addr)
 {
-	struct inode *i = vma->vm_file->f_path.dentry->d_inode;
-	unsigned long idx;
+	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
+	pgoff_t index;
 
-	idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
-	return mpol_shared_policy_lookup(&SHMEM_I(i)->policy, idx);
+	index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+	return mpol_shared_policy_lookup(&SHMEM_I(inode)->policy, index);
 }
 #endif
 
@@ -1016,7 +1014,8 @@ static void do_shmem_file_read(struct fi
 {
 	struct inode *inode = filp->f_path.dentry->d_inode;
 	struct address_space *mapping = inode->i_mapping;
-	unsigned long index, offset;
+	pgoff_t index;
+	unsigned long offset;
 	enum sgp_type sgp = SGP_READ;
 
 	/*
@@ -1032,7 +1031,8 @@ static void do_shmem_file_read(struct fi
 
 	for (;;) {
 		struct page *page = NULL;
-		unsigned long end_index, nr, ret;
+		pgoff_t end_index;
+		unsigned long nr, ret;
 		loff_t i_size = i_size_read(inode);
 
 		end_index = i_size >> PAGE_CACHE_SHIFT;
@@ -1270,8 +1270,9 @@ static int shmem_statfs(struct dentry *d
 	buf->f_namelen = NAME_MAX;
 	if (sbinfo->max_blocks) {
 		buf->f_blocks = sbinfo->max_blocks;
-		buf->f_bavail = buf->f_bfree =
-				sbinfo->max_blocks - percpu_counter_sum(&sbinfo->used_blocks);
+		buf->f_bavail =
+		buf->f_bfree  = sbinfo->max_blocks -
+				percpu_counter_sum(&sbinfo->used_blocks);
 	}
 	if (sbinfo->max_inodes) {
 		buf->f_files = sbinfo->max_inodes;
@@ -1480,8 +1481,8 @@ static void *shmem_follow_link_inline(st
 static void *shmem_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
 	struct page *page = NULL;
-	int res = shmem_getpage(dentry->d_inode, 0, &page, SGP_READ, NULL);
-	nd_set_link(nd, res ? ERR_PTR(res) : kmap(page));
+	int error = shmem_getpage(dentry->d_inode, 0, &page, SGP_READ, NULL);
+	nd_set_link(nd, error ? ERR_PTR(error) : kmap(page));
 	if (page)
 		unlock_page(page);
 	return page;
@@ -1592,7 +1593,6 @@ out:
 	return err;
 }
 
-
 static const struct xattr_handler *shmem_xattr_handlers[] = {
 #ifdef CONFIG_TMPFS_POSIX_ACL
 	&generic_acl_access_handler,
@@ -2052,14 +2052,14 @@ static struct kmem_cache *shmem_inode_ca
 
 static struct inode *shmem_alloc_inode(struct super_block *sb)
 {
-	struct shmem_inode_info *p;
-	p = (struct shmem_inode_info *)kmem_cache_alloc(shmem_inode_cachep, GFP_KERNEL);
-	if (!p)
+	struct shmem_inode_info *info;
+	info = kmem_cache_alloc(shmem_inode_cachep, GFP_KERNEL);
+	if (!info)
 		return NULL;
-	return &p->vfs_inode;
+	return &info->vfs_inode;
 }
 
-static void shmem_i_callback(struct rcu_head *head)
+static void shmem_destroy_callback(struct rcu_head *head)
 {
 	struct inode *inode = container_of(head, struct inode, i_rcu);
 	INIT_LIST_HEAD(&inode->i_dentry);
@@ -2072,25 +2072,24 @@ static void shmem_destroy_inode(struct i
 		/* only struct inode is valid if it's an inline symlink */
 		mpol_free_shared_policy(&SHMEM_I(inode)->policy);
 	}
-	call_rcu(&inode->i_rcu, shmem_i_callback);
+	call_rcu(&inode->i_rcu, shmem_destroy_callback);
 }
 
-static void init_once(void *foo)
+static void shmem_init_inode(void *foo)
 {
-	struct shmem_inode_info *p = (struct shmem_inode_info *) foo;
-
-	inode_init_once(&p->vfs_inode);
+	struct shmem_inode_info *info = foo;
+	inode_init_once(&info->vfs_inode);
 }
 
-static int init_inodecache(void)
+static int shmem_init_inodecache(void)
 {
 	shmem_inode_cachep = kmem_cache_create("shmem_inode_cache",
 				sizeof(struct shmem_inode_info),
-				0, SLAB_PANIC, init_once);
+				0, SLAB_PANIC, shmem_init_inode);
 	return 0;
 }
 
-static void destroy_inodecache(void)
+static void shmem_destroy_inodecache(void)
 {
 	kmem_cache_destroy(shmem_inode_cachep);
 }
@@ -2193,21 +2192,20 @@ static const struct vm_operations_struct
 #endif
 };
 
-
 static struct dentry *shmem_mount(struct file_system_type *fs_type,
 	int flags, const char *dev_name, void *data)
 {
 	return mount_nodev(fs_type, flags, data, shmem_fill_super);
 }
 
-static struct file_system_type tmpfs_fs_type = {
+static struct file_system_type shmem_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "tmpfs",
 	.mount		= shmem_mount,
 	.kill_sb	= kill_litter_super,
 };
 
-int __init init_tmpfs(void)
+int __init shmem_init(void)
 {
 	int error;
 
@@ -2215,18 +2213,18 @@ int __init init_tmpfs(void)
 	if (error)
 		goto out4;
 
-	error = init_inodecache();
+	error = shmem_init_inodecache();
 	if (error)
 		goto out3;
 
-	error = register_filesystem(&tmpfs_fs_type);
+	error = register_filesystem(&shmem_fs_type);
 	if (error) {
 		printk(KERN_ERR "Could not register tmpfs\n");
 		goto out2;
 	}
 
-	shm_mnt = vfs_kern_mount(&tmpfs_fs_type, MS_NOUSER,
-				tmpfs_fs_type.name, NULL);
+	shm_mnt = vfs_kern_mount(&shmem_fs_type, MS_NOUSER,
+				 shmem_fs_type.name, NULL);
 	if (IS_ERR(shm_mnt)) {
 		error = PTR_ERR(shm_mnt);
 		printk(KERN_ERR "Could not kern_mount tmpfs\n");
@@ -2235,9 +2233,9 @@ int __init init_tmpfs(void)
 	return 0;
 
 out1:
-	unregister_filesystem(&tmpfs_fs_type);
+	unregister_filesystem(&shmem_fs_type);
 out2:
-	destroy_inodecache();
+	shmem_destroy_inodecache();
 out3:
 	bdi_destroy(&shmem_backing_dev_info);
 out4:
@@ -2247,37 +2245,37 @@ out4:
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /**
- * mem_cgroup_get_shmem_target - find a page or entry assigned to the shmem file
+ * mem_cgroup_get_shmem_target - find page or swap assigned to the shmem file
  * @inode: the inode to be searched
- * @pgoff: the offset to be searched
+ * @index: the page offset to be searched
  * @pagep: the pointer for the found page to be stored
- * @ent: the pointer for the found swap entry to be stored
+ * @swapp: the pointer for the found swap entry to be stored
  *
  * If a page is found, refcount of it is incremented. Callers should handle
  * these refcount.
  */
-void mem_cgroup_get_shmem_target(struct inode *inode, pgoff_t pgoff,
-					struct page **pagep, swp_entry_t *ent)
+void mem_cgroup_get_shmem_target(struct inode *inode, pgoff_t index,
+				 struct page **pagep, swp_entry_t *swapp)
 {
-	swp_entry_t entry = { .val = 0 };
-	struct page *page = NULL;
 	struct shmem_inode_info *info = SHMEM_I(inode);
+	struct page *page = NULL;
+	swp_entry_t swap = {0};
 
-	if ((pgoff << PAGE_CACHE_SHIFT) >= i_size_read(inode))
+	if ((index << PAGE_CACHE_SHIFT) >= i_size_read(inode))
 		goto out;
 
 	spin_lock(&info->lock);
 #ifdef CONFIG_SWAP
-	entry = shmem_get_swap(info, pgoff);
-	if (entry.val)
-		page = find_get_page(&swapper_space, entry.val);
+	swap = shmem_get_swap(info, index);
+	if (swap.val)
+		page = find_get_page(&swapper_space, swap.val);
 	else
 #endif
-		page = find_get_page(inode->i_mapping, pgoff);
+		page = find_get_page(inode->i_mapping, index);
 	spin_unlock(&info->lock);
 out:
 	*pagep = page;
-	*ent = entry;
+	*swapp = swap;
 }
 #endif
 
@@ -2294,23 +2292,23 @@ out:
 
 #include <linux/ramfs.h>
 
-static struct file_system_type tmpfs_fs_type = {
+static struct file_system_type shmem_fs_type = {
 	.name		= "tmpfs",
 	.mount		= ramfs_mount,
 	.kill_sb	= kill_litter_super,
 };
 
-int __init init_tmpfs(void)
+int __init shmem_init(void)
 {
-	BUG_ON(register_filesystem(&tmpfs_fs_type) != 0);
+	BUG_ON(register_filesystem(&shmem_fs_type) != 0);
 
-	shm_mnt = kern_mount(&tmpfs_fs_type);
+	shm_mnt = kern_mount(&shmem_fs_type);
 	BUG_ON(IS_ERR(shm_mnt));
 
 	return 0;
 }
 
-int shmem_unuse(swp_entry_t entry, struct page *page)
+int shmem_unuse(swp_entry_t swap, struct page *page)
 {
 	return 0;
 }
@@ -2320,34 +2318,34 @@ int shmem_lock(struct file *file, int lo
 	return 0;
 }
 
-void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end)
+void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
 {
-	truncate_inode_pages_range(inode->i_mapping, start, end);
+	truncate_inode_pages_range(inode->i_mapping, lstart, lend);
 }
 EXPORT_SYMBOL_GPL(shmem_truncate_range);
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /**
- * mem_cgroup_get_shmem_target - find a page or entry assigned to the shmem file
+ * mem_cgroup_get_shmem_target - find page or swap assigned to the shmem file
  * @inode: the inode to be searched
- * @pgoff: the offset to be searched
+ * @index: the page offset to be searched
  * @pagep: the pointer for the found page to be stored
- * @ent: the pointer for the found swap entry to be stored
+ * @swapp: the pointer for the found swap entry to be stored
  *
  * If a page is found, refcount of it is incremented. Callers should handle
  * these refcount.
  */
-void mem_cgroup_get_shmem_target(struct inode *inode, pgoff_t pgoff,
-					struct page **pagep, swp_entry_t *ent)
+void mem_cgroup_get_shmem_target(struct inode *inode, pgoff_t index,
+				 struct page **pagep, swp_entry_t *swapp)
 {
 	struct page *page = NULL;
 
-	if ((pgoff << PAGE_CACHE_SHIFT) >= i_size_read(inode))
+	if ((index << PAGE_CACHE_SHIFT) >= i_size_read(inode))
 		goto out;
-	page = find_get_page(inode->i_mapping, pgoff);
+	page = find_get_page(inode->i_mapping, index);
 out:
 	*pagep = page;
-	*ent = (swp_entry_t){ .val = 0 };
+	*swapp = (swp_entry_t){0};
 }
 #endif
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 5/12] tmpfs: copy truncate_inode_pages_range
  2011-06-14 10:40 ` Hugh Dickins
@ 2011-06-14 10:49   ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Bring truncate.c's code for truncate_inode_pages_range() inline into
shmem_truncate_range(), replacing its first call (there's a followup
call below, but leave that one, it will disappear next).

Don't play with it yet, apart from leaving out the cleancache flush,
and (importantly) the nrpages == 0 skip, and moving shmem_setattr()'s
partial page preparation into its partial page handling.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c |   99 ++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 79 insertions(+), 20 deletions(-)

--- linux.orig/mm/shmem.c	2011-06-13 13:28:25.822786909 -0700
+++ linux/mm/shmem.c	2011-06-13 13:28:44.330878656 -0700
@@ -50,6 +50,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/shmem_fs.h>
 #include <linux/writeback.h>
 #include <linux/blkdev.h>
+#include <linux/pagevec.h>
 #include <linux/percpu_counter.h>
 #include <linux/splice.h>
 #include <linux/security.h>
@@ -242,11 +243,88 @@ void shmem_truncate_range(struct inode *
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	pgoff_t start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
 	pgoff_t end = (lend >> PAGE_CACHE_SHIFT);
+	struct pagevec pvec;
 	pgoff_t index;
 	swp_entry_t swap;
+	int i;
 
-	truncate_inode_pages_range(mapping, lstart, lend);
+	BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
+
+	pagevec_init(&pvec, 0);
+	index = start;
+	while (index <= end && pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+		mem_cgroup_uncharge_start();
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *page = pvec.pages[i];
+
+			/* We rely upon deletion not changing page->index */
+			index = page->index;
+			if (index > end)
+				break;
+
+			if (!trylock_page(page))
+				continue;
+			WARN_ON(page->index != index);
+			if (PageWriteback(page)) {
+				unlock_page(page);
+				continue;
+			}
+			truncate_inode_page(mapping, page);
+			unlock_page(page);
+		}
+		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
+		cond_resched();
+		index++;
+	}
+
+	if (partial) {
+		struct page *page = NULL;
+		shmem_getpage(inode, start - 1, &page, SGP_READ, NULL);
+		if (page) {
+			zero_user_segment(page, partial, PAGE_CACHE_SIZE);
+			set_page_dirty(page);
+			unlock_page(page);
+			page_cache_release(page);
+		}
+	}
+
+	index = start;
+	for ( ; ; ) {
+		cond_resched();
+		if (!pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+			if (index == start)
+				break;
+			index = start;
+			continue;
+		}
+		if (index == start && pvec.pages[0]->index > end) {
+			pagevec_release(&pvec);
+			break;
+		}
+		mem_cgroup_uncharge_start();
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *page = pvec.pages[i];
+
+			/* We rely upon deletion not changing page->index */
+			index = page->index;
+			if (index > end)
+				break;
+
+			lock_page(page);
+			WARN_ON(page->index != index);
+			wait_on_page_writeback(page);
+			truncate_inode_page(mapping, page);
+			unlock_page(page);
+		}
+		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
+		index++;
+	}
 
 	if (end > SHMEM_NR_DIRECT)
 		end = SHMEM_NR_DIRECT;
@@ -289,24 +367,7 @@ static int shmem_setattr(struct dentry *
 	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
 		loff_t oldsize = inode->i_size;
 		loff_t newsize = attr->ia_size;
-		struct page *page = NULL;
 
-		if (newsize < oldsize) {
-			/*
-			 * If truncating down to a partial page, then
-			 * if that page is already allocated, hold it
-			 * in memory until the truncation is over, so
-			 * truncate_partial_page cannot miss it were
-			 * it assigned to swap.
-			 */
-			if (newsize & (PAGE_CACHE_SIZE-1)) {
-				(void) shmem_getpage(inode,
-					newsize >> PAGE_CACHE_SHIFT,
-						&page, SGP_READ, NULL);
-				if (page)
-					unlock_page(page);
-			}
-		}
 		if (newsize != oldsize) {
 			i_size_write(inode, newsize);
 			inode->i_ctime = inode->i_mtime = CURRENT_TIME;
@@ -318,8 +379,6 @@ static int shmem_setattr(struct dentry *
 			/* unmap again to remove racily COWed private pages */
 			unmap_mapping_range(inode->i_mapping, holebegin, 0, 1);
 		}
-		if (page)
-			page_cache_release(page);
 	}
 
 	setattr_copy(inode, attr);

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 5/12] tmpfs: copy truncate_inode_pages_range
@ 2011-06-14 10:49   ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Bring truncate.c's code for truncate_inode_pages_range() inline into
shmem_truncate_range(), replacing its first call (there's a followup
call below, but leave that one, it will disappear next).

Don't play with it yet, apart from leaving out the cleancache flush,
and (importantly) the nrpages == 0 skip, and moving shmem_setattr()'s
partial page preparation into its partial page handling.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c |   99 ++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 79 insertions(+), 20 deletions(-)

--- linux.orig/mm/shmem.c	2011-06-13 13:28:25.822786909 -0700
+++ linux/mm/shmem.c	2011-06-13 13:28:44.330878656 -0700
@@ -50,6 +50,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/shmem_fs.h>
 #include <linux/writeback.h>
 #include <linux/blkdev.h>
+#include <linux/pagevec.h>
 #include <linux/percpu_counter.h>
 #include <linux/splice.h>
 #include <linux/security.h>
@@ -242,11 +243,88 @@ void shmem_truncate_range(struct inode *
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	pgoff_t start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
 	pgoff_t end = (lend >> PAGE_CACHE_SHIFT);
+	struct pagevec pvec;
 	pgoff_t index;
 	swp_entry_t swap;
+	int i;
 
-	truncate_inode_pages_range(mapping, lstart, lend);
+	BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
+
+	pagevec_init(&pvec, 0);
+	index = start;
+	while (index <= end && pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+		mem_cgroup_uncharge_start();
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *page = pvec.pages[i];
+
+			/* We rely upon deletion not changing page->index */
+			index = page->index;
+			if (index > end)
+				break;
+
+			if (!trylock_page(page))
+				continue;
+			WARN_ON(page->index != index);
+			if (PageWriteback(page)) {
+				unlock_page(page);
+				continue;
+			}
+			truncate_inode_page(mapping, page);
+			unlock_page(page);
+		}
+		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
+		cond_resched();
+		index++;
+	}
+
+	if (partial) {
+		struct page *page = NULL;
+		shmem_getpage(inode, start - 1, &page, SGP_READ, NULL);
+		if (page) {
+			zero_user_segment(page, partial, PAGE_CACHE_SIZE);
+			set_page_dirty(page);
+			unlock_page(page);
+			page_cache_release(page);
+		}
+	}
+
+	index = start;
+	for ( ; ; ) {
+		cond_resched();
+		if (!pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+			if (index == start)
+				break;
+			index = start;
+			continue;
+		}
+		if (index == start && pvec.pages[0]->index > end) {
+			pagevec_release(&pvec);
+			break;
+		}
+		mem_cgroup_uncharge_start();
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *page = pvec.pages[i];
+
+			/* We rely upon deletion not changing page->index */
+			index = page->index;
+			if (index > end)
+				break;
+
+			lock_page(page);
+			WARN_ON(page->index != index);
+			wait_on_page_writeback(page);
+			truncate_inode_page(mapping, page);
+			unlock_page(page);
+		}
+		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
+		index++;
+	}
 
 	if (end > SHMEM_NR_DIRECT)
 		end = SHMEM_NR_DIRECT;
@@ -289,24 +367,7 @@ static int shmem_setattr(struct dentry *
 	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
 		loff_t oldsize = inode->i_size;
 		loff_t newsize = attr->ia_size;
-		struct page *page = NULL;
 
-		if (newsize < oldsize) {
-			/*
-			 * If truncating down to a partial page, then
-			 * if that page is already allocated, hold it
-			 * in memory until the truncation is over, so
-			 * truncate_partial_page cannot miss it were
-			 * it assigned to swap.
-			 */
-			if (newsize & (PAGE_CACHE_SIZE-1)) {
-				(void) shmem_getpage(inode,
-					newsize >> PAGE_CACHE_SHIFT,
-						&page, SGP_READ, NULL);
-				if (page)
-					unlock_page(page);
-			}
-		}
 		if (newsize != oldsize) {
 			i_size_write(inode, newsize);
 			inode->i_ctime = inode->i_mtime = CURRENT_TIME;
@@ -318,8 +379,6 @@ static int shmem_setattr(struct dentry *
 			/* unmap again to remove racily COWed private pages */
 			unmap_mapping_range(inode->i_mapping, holebegin, 0, 1);
 		}
-		if (page)
-			page_cache_release(page);
 	}
 
 	setattr_copy(inode, attr);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 6/12] tmpfs: convert shmem_truncate_range to radix-swap
  2011-06-14 10:40 ` Hugh Dickins
@ 2011-06-14 10:51   ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Disable the toy swapping implementation in shmem_writepage() - it's
hard to support two schemes at once - and convert shmem_truncate_range()
to a lockless gang lookup of swap entries along with pages, freeing both.

Since the second loop tightens its noose until all entries of either
kind have been squeezed out (and we shall make sure that there's not
an instant when neither is visible), there is no longer a need for
yet another pass below.

shmem_radix_tree_replace() compensates for the lockless lookup by
checking that the expected entry is in place, under lock, before
replacing it.  Here it just deletes, but will be used in later
patches to substitute swap entry for page or page for swap entry.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c |  192 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 146 insertions(+), 46 deletions(-)

--- linux.orig/mm/shmem.c	2011-06-13 13:28:44.330878656 -0700
+++ linux/mm/shmem.c	2011-06-13 13:29:36.311136453 -0700
@@ -238,6 +238,111 @@ static swp_entry_t shmem_get_swap(struct
 		info->i_direct[index] : (swp_entry_t){0};
 }
 
+/*
+ * Replace item expected in radix tree by a new item, while holding tree lock.
+ */
+static int shmem_radix_tree_replace(struct address_space *mapping,
+			pgoff_t index, void *expected, void *replacement)
+{
+	void **pslot;
+	void *item = NULL;
+
+	VM_BUG_ON(!expected);
+	pslot = radix_tree_lookup_slot(&mapping->page_tree, index);
+	if (pslot)
+		item = radix_tree_deref_slot_protected(pslot,
+							&mapping->tree_lock);
+	if (item != expected)
+		return -ENOENT;
+	if (replacement)
+		radix_tree_replace_slot(pslot, replacement);
+	else
+		radix_tree_delete(&mapping->page_tree, index);
+	return 0;
+}
+
+/*
+ * Like find_get_pages, but collecting swap entries as well as pages.
+ */
+static unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
+					pgoff_t start, unsigned int nr_pages,
+					struct page **pages, pgoff_t *indices)
+{
+	unsigned int i;
+	unsigned int ret;
+	unsigned int nr_found;
+
+	rcu_read_lock();
+restart:
+	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+				(void ***)pages, indices, start, nr_pages);
+	ret = 0;
+	for (i = 0; i < nr_found; i++) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot((void **)pages[i]);
+		if (unlikely(!page))
+			continue;
+		if (radix_tree_exception(page)) {
+			if (radix_tree_exceptional_entry(page))
+				goto export;
+			/* radix_tree_deref_retry(page) */
+			goto restart;
+		}
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *((void **)pages[i]))) {
+			page_cache_release(page);
+			goto repeat;
+		}
+export:
+		indices[ret] = indices[i];
+		pages[ret] = page;
+		ret++;
+	}
+	if (unlikely(!ret && nr_found))
+		goto restart;
+	rcu_read_unlock();
+	return ret;
+}
+
+/*
+ * Remove swap entry from radix tree, free the swap and its page cache.
+ */
+static int shmem_free_swap(struct address_space *mapping,
+			   pgoff_t index, void *radswap)
+{
+	int error;
+
+	spin_lock_irq(&mapping->tree_lock);
+	error = shmem_radix_tree_replace(mapping, index, radswap, NULL);
+	spin_unlock_irq(&mapping->tree_lock);
+	if (!error)
+		free_swap_and_cache(radix_to_swp_entry(radswap));
+	return error;
+}
+
+/*
+ * Pagevec may contain swap entries, so shuffle up pages before releasing.
+ */
+static void shmem_pagevec_release(struct pagevec *pvec)
+{
+	int i, j;
+
+	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		if (!radix_tree_exceptional_entry(page))
+			pvec->pages[j++] = page;
+	}
+	pvec->nr = j;
+	pagevec_release(pvec);
+}
+
+/*
+ * Remove range of pages and swap entries from radix tree, and free them.
+ */
 void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
 {
 	struct address_space *mapping = inode->i_mapping;
@@ -246,36 +351,44 @@ void shmem_truncate_range(struct inode *
 	unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
 	pgoff_t end = (lend >> PAGE_CACHE_SHIFT);
 	struct pagevec pvec;
+	pgoff_t indices[PAGEVEC_SIZE];
+	long nr_swaps_freed = 0;
 	pgoff_t index;
-	swp_entry_t swap;
 	int i;
 
 	BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
 
 	pagevec_init(&pvec, 0);
 	index = start;
-	while (index <= end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+	while (index <= end) {
+		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+							pvec.pages, indices);
+		if (!pvec.nr)
+			break;
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
-			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
-			if (!trylock_page(page))
+			if (radix_tree_exceptional_entry(page)) {
+				nr_swaps_freed += !shmem_free_swap(mapping,
+								index, page);
 				continue;
-			WARN_ON(page->index != index);
-			if (PageWriteback(page)) {
-				unlock_page(page);
+			}
+
+			if (!trylock_page(page))
 				continue;
+			if (page->mapping == mapping) {
+				VM_BUG_ON(PageWriteback(page));
+				truncate_inode_page(mapping, page);
 			}
-			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
-		pagevec_release(&pvec);
+		shmem_pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
@@ -295,59 +408,47 @@ void shmem_truncate_range(struct inode *
 	index = start;
 	for ( ; ; ) {
 		cond_resched();
-		if (!pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+							pvec.pages, indices);
+		if (!pvec.nr) {
 			if (index == start)
 				break;
 			index = start;
 			continue;
 		}
-		if (index == start && pvec.pages[0]->index > end) {
-			pagevec_release(&pvec);
+		if (index == start && indices[0] > end) {
+			shmem_pagevec_release(&pvec);
 			break;
 		}
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
-			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				nr_swaps_freed += !shmem_free_swap(mapping,
+								index, page);
+				continue;
+			}
+
 			lock_page(page);
-			WARN_ON(page->index != index);
-			wait_on_page_writeback(page);
-			truncate_inode_page(mapping, page);
+			if (page->mapping == mapping) {
+				VM_BUG_ON(PageWriteback(page));
+				truncate_inode_page(mapping, page);
+			}
 			unlock_page(page);
 		}
-		pagevec_release(&pvec);
+		shmem_pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		index++;
 	}
 
-	if (end > SHMEM_NR_DIRECT)
-		end = SHMEM_NR_DIRECT;
-
 	spin_lock(&info->lock);
-	for (index = start; index < end; index++) {
-		swap = shmem_get_swap(info, index);
-		if (swap.val) {
-			free_swap_and_cache(swap);
-			shmem_put_swap(info, index, (swp_entry_t){0});
-			info->swapped--;
-		}
-	}
-
-	if (mapping->nrpages) {
-		spin_unlock(&info->lock);
-		/*
-		 * A page may have meanwhile sneaked in from swap.
-		 */
-		truncate_inode_pages_range(mapping, lstart, lend);
-		spin_lock(&info->lock);
-	}
-
+	info->swapped -= nr_swaps_freed;
 	shmem_recalc_inode(inode);
 	spin_unlock(&info->lock);
 
@@ -552,11 +653,10 @@ static int shmem_writepage(struct page *
 	}
 
 	/*
-	 * Just for this patch, we have a toy implementation,
-	 * which can swap out only the first SHMEM_NR_DIRECT pages:
-	 * for simple demonstration of where we need to think about swap.
+	 * Disable even the toy swapping implementation, while we convert
+	 * functions one by one to having swap entries in the radix tree.
 	 */
-	if (index >= SHMEM_NR_DIRECT)
+	if (index < ULONG_MAX)
 		goto redirty;
 
 	swap = get_swap_page();

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 6/12] tmpfs: convert shmem_truncate_range to radix-swap
@ 2011-06-14 10:51   ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Disable the toy swapping implementation in shmem_writepage() - it's
hard to support two schemes at once - and convert shmem_truncate_range()
to a lockless gang lookup of swap entries along with pages, freeing both.

Since the second loop tightens its noose until all entries of either
kind have been squeezed out (and we shall make sure that there's not
an instant when neither is visible), there is no longer a need for
yet another pass below.

shmem_radix_tree_replace() compensates for the lockless lookup by
checking that the expected entry is in place, under lock, before
replacing it.  Here it just deletes, but will be used in later
patches to substitute swap entry for page or page for swap entry.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c |  192 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 146 insertions(+), 46 deletions(-)

--- linux.orig/mm/shmem.c	2011-06-13 13:28:44.330878656 -0700
+++ linux/mm/shmem.c	2011-06-13 13:29:36.311136453 -0700
@@ -238,6 +238,111 @@ static swp_entry_t shmem_get_swap(struct
 		info->i_direct[index] : (swp_entry_t){0};
 }
 
+/*
+ * Replace item expected in radix tree by a new item, while holding tree lock.
+ */
+static int shmem_radix_tree_replace(struct address_space *mapping,
+			pgoff_t index, void *expected, void *replacement)
+{
+	void **pslot;
+	void *item = NULL;
+
+	VM_BUG_ON(!expected);
+	pslot = radix_tree_lookup_slot(&mapping->page_tree, index);
+	if (pslot)
+		item = radix_tree_deref_slot_protected(pslot,
+							&mapping->tree_lock);
+	if (item != expected)
+		return -ENOENT;
+	if (replacement)
+		radix_tree_replace_slot(pslot, replacement);
+	else
+		radix_tree_delete(&mapping->page_tree, index);
+	return 0;
+}
+
+/*
+ * Like find_get_pages, but collecting swap entries as well as pages.
+ */
+static unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
+					pgoff_t start, unsigned int nr_pages,
+					struct page **pages, pgoff_t *indices)
+{
+	unsigned int i;
+	unsigned int ret;
+	unsigned int nr_found;
+
+	rcu_read_lock();
+restart:
+	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+				(void ***)pages, indices, start, nr_pages);
+	ret = 0;
+	for (i = 0; i < nr_found; i++) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot((void **)pages[i]);
+		if (unlikely(!page))
+			continue;
+		if (radix_tree_exception(page)) {
+			if (radix_tree_exceptional_entry(page))
+				goto export;
+			/* radix_tree_deref_retry(page) */
+			goto restart;
+		}
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *((void **)pages[i]))) {
+			page_cache_release(page);
+			goto repeat;
+		}
+export:
+		indices[ret] = indices[i];
+		pages[ret] = page;
+		ret++;
+	}
+	if (unlikely(!ret && nr_found))
+		goto restart;
+	rcu_read_unlock();
+	return ret;
+}
+
+/*
+ * Remove swap entry from radix tree, free the swap and its page cache.
+ */
+static int shmem_free_swap(struct address_space *mapping,
+			   pgoff_t index, void *radswap)
+{
+	int error;
+
+	spin_lock_irq(&mapping->tree_lock);
+	error = shmem_radix_tree_replace(mapping, index, radswap, NULL);
+	spin_unlock_irq(&mapping->tree_lock);
+	if (!error)
+		free_swap_and_cache(radix_to_swp_entry(radswap));
+	return error;
+}
+
+/*
+ * Pagevec may contain swap entries, so shuffle up pages before releasing.
+ */
+static void shmem_pagevec_release(struct pagevec *pvec)
+{
+	int i, j;
+
+	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		if (!radix_tree_exceptional_entry(page))
+			pvec->pages[j++] = page;
+	}
+	pvec->nr = j;
+	pagevec_release(pvec);
+}
+
+/*
+ * Remove range of pages and swap entries from radix tree, and free them.
+ */
 void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
 {
 	struct address_space *mapping = inode->i_mapping;
@@ -246,36 +351,44 @@ void shmem_truncate_range(struct inode *
 	unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
 	pgoff_t end = (lend >> PAGE_CACHE_SHIFT);
 	struct pagevec pvec;
+	pgoff_t indices[PAGEVEC_SIZE];
+	long nr_swaps_freed = 0;
 	pgoff_t index;
-	swp_entry_t swap;
 	int i;
 
 	BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
 
 	pagevec_init(&pvec, 0);
 	index = start;
-	while (index <= end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+	while (index <= end) {
+		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+							pvec.pages, indices);
+		if (!pvec.nr)
+			break;
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
-			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
-			if (!trylock_page(page))
+			if (radix_tree_exceptional_entry(page)) {
+				nr_swaps_freed += !shmem_free_swap(mapping,
+								index, page);
 				continue;
-			WARN_ON(page->index != index);
-			if (PageWriteback(page)) {
-				unlock_page(page);
+			}
+
+			if (!trylock_page(page))
 				continue;
+			if (page->mapping == mapping) {
+				VM_BUG_ON(PageWriteback(page));
+				truncate_inode_page(mapping, page);
 			}
-			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
-		pagevec_release(&pvec);
+		shmem_pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
 		index++;
@@ -295,59 +408,47 @@ void shmem_truncate_range(struct inode *
 	index = start;
 	for ( ; ; ) {
 		cond_resched();
-		if (!pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+							pvec.pages, indices);
+		if (!pvec.nr) {
 			if (index == start)
 				break;
 			index = start;
 			continue;
 		}
-		if (index == start && pvec.pages[0]->index > end) {
-			pagevec_release(&pvec);
+		if (index == start && indices[0] > end) {
+			shmem_pagevec_release(&pvec);
 			break;
 		}
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
-			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				nr_swaps_freed += !shmem_free_swap(mapping,
+								index, page);
+				continue;
+			}
+
 			lock_page(page);
-			WARN_ON(page->index != index);
-			wait_on_page_writeback(page);
-			truncate_inode_page(mapping, page);
+			if (page->mapping == mapping) {
+				VM_BUG_ON(PageWriteback(page));
+				truncate_inode_page(mapping, page);
+			}
 			unlock_page(page);
 		}
-		pagevec_release(&pvec);
+		shmem_pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		index++;
 	}
 
-	if (end > SHMEM_NR_DIRECT)
-		end = SHMEM_NR_DIRECT;
-
 	spin_lock(&info->lock);
-	for (index = start; index < end; index++) {
-		swap = shmem_get_swap(info, index);
-		if (swap.val) {
-			free_swap_and_cache(swap);
-			shmem_put_swap(info, index, (swp_entry_t){0});
-			info->swapped--;
-		}
-	}
-
-	if (mapping->nrpages) {
-		spin_unlock(&info->lock);
-		/*
-		 * A page may have meanwhile sneaked in from swap.
-		 */
-		truncate_inode_pages_range(mapping, lstart, lend);
-		spin_lock(&info->lock);
-	}
-
+	info->swapped -= nr_swaps_freed;
 	shmem_recalc_inode(inode);
 	spin_unlock(&info->lock);
 
@@ -552,11 +653,10 @@ static int shmem_writepage(struct page *
 	}
 
 	/*
-	 * Just for this patch, we have a toy implementation,
-	 * which can swap out only the first SHMEM_NR_DIRECT pages:
-	 * for simple demonstration of where we need to think about swap.
+	 * Disable even the toy swapping implementation, while we convert
+	 * functions one by one to having swap entries in the radix tree.
 	 */
-	if (index >= SHMEM_NR_DIRECT)
+	if (index < ULONG_MAX)
 		goto redirty;
 
 	swap = get_swap_page();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 7/12] tmpfs: convert shmem_unuse_inode to radix-swap
  2011-06-14 10:40 ` Hugh Dickins
@ 2011-06-14 10:52   ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Convert shmem_unuse_inode() to use a lockless gang lookup of the radix
tree, searching for matching swap.

This is somewhat slower than the old method: because of repeated radix
tree descents, because of copying entries up, but probably most because
the old method noted and skipped once a vector page was cleared of swap.
Perhaps we can devise a use of radix tree tagging to achieve that later.

shmem_add_to_page_cache() uses shmem_radix_tree_replace() to compensate
for the lockless lookup by checking that the expected entry is in place,
under lock.  It is not very satisfactory to be copying this much from
add_to_page_cache_locked(), but I think easier to sell than insisting
that every caller of add_to_page_cache*() go through the extras.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c |  133 +++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 107 insertions(+), 26 deletions(-)

--- linux.orig/mm/shmem.c	2011-06-13 13:29:36.311136453 -0700
+++ linux/mm/shmem.c	2011-06-13 13:29:44.087175010 -0700
@@ -262,6 +262,55 @@ static int shmem_radix_tree_replace(stru
 }
 
 /*
+ * Like add_to_page_cache_locked, but error if expected item has gone.
+ */
+static int shmem_add_to_page_cache(struct page *page,
+				   struct address_space *mapping,
+				   pgoff_t index, gfp_t gfp, void *expected)
+{
+	int error;
+
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(!PageSwapBacked(page));
+
+	error = mem_cgroup_cache_charge(page, current->mm,
+						gfp & GFP_RECLAIM_MASK);
+	if (error)
+		goto out;
+	if (!expected)
+		error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
+	if (!error) {
+		page_cache_get(page);
+		page->mapping = mapping;
+		page->index = index;
+
+		spin_lock_irq(&mapping->tree_lock);
+		if (!expected)
+			error = radix_tree_insert(&mapping->page_tree,
+							index, page);
+		else
+			error = shmem_radix_tree_replace(mapping, index,
+							expected, page);
+		if (!error) {
+			mapping->nrpages++;
+			__inc_zone_page_state(page, NR_FILE_PAGES);
+			__inc_zone_page_state(page, NR_SHMEM);
+			spin_unlock_irq(&mapping->tree_lock);
+		} else {
+			page->mapping = NULL;
+			spin_unlock_irq(&mapping->tree_lock);
+			page_cache_release(page);
+		}
+		if (!expected)
+			radix_tree_preload_end();
+	}
+	if (error)
+		mem_cgroup_uncharge_cache_page(page);
+out:
+	return error;
+}
+
+/*
  * Like find_get_pages, but collecting swap entries as well as pages.
  */
 static unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
@@ -309,6 +358,42 @@ export:
 }
 
 /*
+ * Lockless lookup of swap entry in radix tree, avoiding refcount on pages.
+ */
+static pgoff_t shmem_find_swap(struct address_space *mapping, void *radswap)
+{
+	void  **slots[PAGEVEC_SIZE];
+	pgoff_t indices[PAGEVEC_SIZE];
+	unsigned int nr_found;
+
+restart:
+	nr_found = 1;
+	indices[0] = -1;
+	while (nr_found) {
+		pgoff_t index = indices[nr_found - 1] + 1;
+		unsigned int i;
+
+		rcu_read_lock();
+		nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+					slots, indices, index, PAGEVEC_SIZE);
+		for (i = 0; i < nr_found; i++) {
+			void *item = radix_tree_deref_slot(slots[i]);
+			if (radix_tree_deref_retry(item)) {
+				rcu_read_unlock();
+				goto restart;
+			}
+			if (item == radswap) {
+				rcu_read_unlock();
+				return indices[i];
+			}
+		}
+		rcu_read_unlock();
+		cond_resched();
+	}
+	return -1;
+}
+
+/*
  * Remove swap entry from radix tree, free the swap and its page cache.
  */
 static int shmem_free_swap(struct address_space *mapping,
@@ -515,23 +600,21 @@ static void shmem_evict_inode(struct ino
 	end_writeback(inode);
 }
 
+/*
+ * If swap found in inode, free it and move page from swapcache to filecache.
+ */
 static int shmem_unuse_inode(struct shmem_inode_info *info,
 			     swp_entry_t swap, struct page *page)
 {
 	struct address_space *mapping = info->vfs_inode.i_mapping;
+	void *radswap;
 	pgoff_t index;
 	int error;
 
-	for (index = 0; index < SHMEM_NR_DIRECT; index++)
-		if (shmem_get_swap(info, index).val == swap.val)
-			goto found;
-	return 0;
-found:
-	spin_lock(&info->lock);
-	if (shmem_get_swap(info, index).val != swap.val) {
-		spin_unlock(&info->lock);
+	radswap = swp_to_radix_entry(swap);
+	index = shmem_find_swap(mapping, radswap);
+	if (index == -1)
 		return 0;
-	}
 
 	/*
 	 * Move _head_ to start search for next from here.
@@ -547,23 +630,30 @@ found:
 	 * but also to hold up shmem_evict_inode(): so inode cannot be freed
 	 * beneath us (pagelock doesn't help until the page is in pagecache).
 	 */
-	error = add_to_page_cache_locked(page, mapping, index, GFP_NOWAIT);
+	error = shmem_add_to_page_cache(page, mapping, index,
+						GFP_NOWAIT, radswap);
 	/* which does mem_cgroup_uncharge_cache_page on error */
 
 	if (error != -ENOMEM) {
+		/*
+		 * Truncation and eviction use free_swap_and_cache(), which
+		 * only does trylock page: if we raced, best clean up here.
+		 */
 		delete_from_swap_cache(page);
 		set_page_dirty(page);
-		shmem_put_swap(info, index, (swp_entry_t){0});
-		info->swapped--;
-		swap_free(swap);
+		if (!error) {
+			spin_lock(&info->lock);
+			info->swapped--;
+			spin_unlock(&info->lock);
+			swap_free(swap);
+		}
 		error = 1;	/* not an error, but entry was found */
 	}
-	spin_unlock(&info->lock);
 	return error;
 }
 
 /*
- * shmem_unuse() search for an eventually swapped out shmem page.
+ * Search through swapped inodes to find and replace swap by page.
  */
 int shmem_unuse(swp_entry_t swap, struct page *page)
 {
@@ -576,20 +666,12 @@ int shmem_unuse(swp_entry_t swap, struct
 	 * Charge page using GFP_KERNEL while we can wait, before taking
 	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
 	 * Charged back to the user (not to caller) when swap account is used.
-	 * add_to_page_cache() will be called with GFP_NOWAIT.
+	 * shmem_add_to_page_cache() will be called with GFP_NOWAIT.
 	 */
 	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
 	if (error)
 		goto out;
-	/*
-	 * Try to preload while we can wait, to not make a habit of
-	 * draining atomic reserves; but don't latch on to this cpu,
-	 * it's okay if sometimes we get rescheduled after this.
-	 */
-	error = radix_tree_preload(GFP_KERNEL);
-	if (error)
-		goto uncharge;
-	radix_tree_preload_end();
+	/* No radix_tree_preload: swap entry keeps a place for page in tree */
 
 	mutex_lock(&shmem_swaplist_mutex);
 	list_for_each_safe(this, next, &shmem_swaplist) {
@@ -608,7 +690,6 @@ int shmem_unuse(swp_entry_t swap, struct
 	}
 	mutex_unlock(&shmem_swaplist_mutex);
 
-uncharge:
 	if (!found)
 		mem_cgroup_uncharge_cache_page(page);
 	if (found < 0)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 7/12] tmpfs: convert shmem_unuse_inode to radix-swap
@ 2011-06-14 10:52   ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Convert shmem_unuse_inode() to use a lockless gang lookup of the radix
tree, searching for matching swap.

This is somewhat slower than the old method: because of repeated radix
tree descents, because of copying entries up, but probably most because
the old method noted and skipped once a vector page was cleared of swap.
Perhaps we can devise a use of radix tree tagging to achieve that later.

shmem_add_to_page_cache() uses shmem_radix_tree_replace() to compensate
for the lockless lookup by checking that the expected entry is in place,
under lock.  It is not very satisfactory to be copying this much from
add_to_page_cache_locked(), but I think easier to sell than insisting
that every caller of add_to_page_cache*() go through the extras.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c |  133 +++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 107 insertions(+), 26 deletions(-)

--- linux.orig/mm/shmem.c	2011-06-13 13:29:36.311136453 -0700
+++ linux/mm/shmem.c	2011-06-13 13:29:44.087175010 -0700
@@ -262,6 +262,55 @@ static int shmem_radix_tree_replace(stru
 }
 
 /*
+ * Like add_to_page_cache_locked, but error if expected item has gone.
+ */
+static int shmem_add_to_page_cache(struct page *page,
+				   struct address_space *mapping,
+				   pgoff_t index, gfp_t gfp, void *expected)
+{
+	int error;
+
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(!PageSwapBacked(page));
+
+	error = mem_cgroup_cache_charge(page, current->mm,
+						gfp & GFP_RECLAIM_MASK);
+	if (error)
+		goto out;
+	if (!expected)
+		error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
+	if (!error) {
+		page_cache_get(page);
+		page->mapping = mapping;
+		page->index = index;
+
+		spin_lock_irq(&mapping->tree_lock);
+		if (!expected)
+			error = radix_tree_insert(&mapping->page_tree,
+							index, page);
+		else
+			error = shmem_radix_tree_replace(mapping, index,
+							expected, page);
+		if (!error) {
+			mapping->nrpages++;
+			__inc_zone_page_state(page, NR_FILE_PAGES);
+			__inc_zone_page_state(page, NR_SHMEM);
+			spin_unlock_irq(&mapping->tree_lock);
+		} else {
+			page->mapping = NULL;
+			spin_unlock_irq(&mapping->tree_lock);
+			page_cache_release(page);
+		}
+		if (!expected)
+			radix_tree_preload_end();
+	}
+	if (error)
+		mem_cgroup_uncharge_cache_page(page);
+out:
+	return error;
+}
+
+/*
  * Like find_get_pages, but collecting swap entries as well as pages.
  */
 static unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
@@ -309,6 +358,42 @@ export:
 }
 
 /*
+ * Lockless lookup of swap entry in radix tree, avoiding refcount on pages.
+ */
+static pgoff_t shmem_find_swap(struct address_space *mapping, void *radswap)
+{
+	void  **slots[PAGEVEC_SIZE];
+	pgoff_t indices[PAGEVEC_SIZE];
+	unsigned int nr_found;
+
+restart:
+	nr_found = 1;
+	indices[0] = -1;
+	while (nr_found) {
+		pgoff_t index = indices[nr_found - 1] + 1;
+		unsigned int i;
+
+		rcu_read_lock();
+		nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+					slots, indices, index, PAGEVEC_SIZE);
+		for (i = 0; i < nr_found; i++) {
+			void *item = radix_tree_deref_slot(slots[i]);
+			if (radix_tree_deref_retry(item)) {
+				rcu_read_unlock();
+				goto restart;
+			}
+			if (item == radswap) {
+				rcu_read_unlock();
+				return indices[i];
+			}
+		}
+		rcu_read_unlock();
+		cond_resched();
+	}
+	return -1;
+}
+
+/*
  * Remove swap entry from radix tree, free the swap and its page cache.
  */
 static int shmem_free_swap(struct address_space *mapping,
@@ -515,23 +600,21 @@ static void shmem_evict_inode(struct ino
 	end_writeback(inode);
 }
 
+/*
+ * If swap found in inode, free it and move page from swapcache to filecache.
+ */
 static int shmem_unuse_inode(struct shmem_inode_info *info,
 			     swp_entry_t swap, struct page *page)
 {
 	struct address_space *mapping = info->vfs_inode.i_mapping;
+	void *radswap;
 	pgoff_t index;
 	int error;
 
-	for (index = 0; index < SHMEM_NR_DIRECT; index++)
-		if (shmem_get_swap(info, index).val == swap.val)
-			goto found;
-	return 0;
-found:
-	spin_lock(&info->lock);
-	if (shmem_get_swap(info, index).val != swap.val) {
-		spin_unlock(&info->lock);
+	radswap = swp_to_radix_entry(swap);
+	index = shmem_find_swap(mapping, radswap);
+	if (index == -1)
 		return 0;
-	}
 
 	/*
 	 * Move _head_ to start search for next from here.
@@ -547,23 +630,30 @@ found:
 	 * but also to hold up shmem_evict_inode(): so inode cannot be freed
 	 * beneath us (pagelock doesn't help until the page is in pagecache).
 	 */
-	error = add_to_page_cache_locked(page, mapping, index, GFP_NOWAIT);
+	error = shmem_add_to_page_cache(page, mapping, index,
+						GFP_NOWAIT, radswap);
 	/* which does mem_cgroup_uncharge_cache_page on error */
 
 	if (error != -ENOMEM) {
+		/*
+		 * Truncation and eviction use free_swap_and_cache(), which
+		 * only does trylock page: if we raced, best clean up here.
+		 */
 		delete_from_swap_cache(page);
 		set_page_dirty(page);
-		shmem_put_swap(info, index, (swp_entry_t){0});
-		info->swapped--;
-		swap_free(swap);
+		if (!error) {
+			spin_lock(&info->lock);
+			info->swapped--;
+			spin_unlock(&info->lock);
+			swap_free(swap);
+		}
 		error = 1;	/* not an error, but entry was found */
 	}
-	spin_unlock(&info->lock);
 	return error;
 }
 
 /*
- * shmem_unuse() search for an eventually swapped out shmem page.
+ * Search through swapped inodes to find and replace swap by page.
  */
 int shmem_unuse(swp_entry_t swap, struct page *page)
 {
@@ -576,20 +666,12 @@ int shmem_unuse(swp_entry_t swap, struct
 	 * Charge page using GFP_KERNEL while we can wait, before taking
 	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
 	 * Charged back to the user (not to caller) when swap account is used.
-	 * add_to_page_cache() will be called with GFP_NOWAIT.
+	 * shmem_add_to_page_cache() will be called with GFP_NOWAIT.
 	 */
 	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
 	if (error)
 		goto out;
-	/*
-	 * Try to preload while we can wait, to not make a habit of
-	 * draining atomic reserves; but don't latch on to this cpu,
-	 * it's okay if sometimes we get rescheduled after this.
-	 */
-	error = radix_tree_preload(GFP_KERNEL);
-	if (error)
-		goto uncharge;
-	radix_tree_preload_end();
+	/* No radix_tree_preload: swap entry keeps a place for page in tree */
 
 	mutex_lock(&shmem_swaplist_mutex);
 	list_for_each_safe(this, next, &shmem_swaplist) {
@@ -608,7 +690,6 @@ int shmem_unuse(swp_entry_t swap, struct
 	}
 	mutex_unlock(&shmem_swaplist_mutex);
 
-uncharge:
 	if (!found)
 		mem_cgroup_uncharge_cache_page(page);
 	if (found < 0)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 8/12] tmpfs: convert shmem_getpage_gfp to radix-swap
  2011-06-14 10:40 ` Hugh Dickins
@ 2011-06-14 10:53   ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Convert shmem_getpage_gfp(), the engine-room of shmem, to expect
page or swap entry returned from radix tree by find_lock_page().

Whereas the repetitive old method proceeded mainly under info->lock,
dropping and repeating whenever one of the conditions needed was not
met, now we can proceed without it, leaving shmem_add_to_page_cache()
to check for a race.

This way there is no need to preallocate a page, no need for an early
radix_tree_preload(), no need for mem_cgroup_shmem_charge_fallback().

Move the error unwinding down to the bottom instead of repeating it
throughout.  ENOSPC handling is a little different from before: there
is no longer any race between find_lock_page() and finding swap, but
we can arrive at ENOSPC before calling shmem_recalc_inode(), which
might occasionally discover freed space.

Be stricter to check i_size before returning.  info->lock is used
for little but alloced, swapped, i_blocks updates.  Move i_blocks
updates out from under the max_blocks check, so even an unlimited
size=0 mount can show accurate du.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c |  259 ++++++++++++++++++++++-----------------------------
 1 file changed, 112 insertions(+), 147 deletions(-)

--- linux.orig/mm/shmem.c	2011-06-13 13:29:44.087175010 -0700
+++ linux/mm/shmem.c	2011-06-13 13:29:55.115229689 -0700
@@ -166,15 +166,6 @@ static struct backing_dev_info shmem_bac
 static LIST_HEAD(shmem_swaplist);
 static DEFINE_MUTEX(shmem_swaplist_mutex);
 
-static void shmem_free_blocks(struct inode *inode, long pages)
-{
-	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
-	if (sbinfo->max_blocks) {
-		percpu_counter_add(&sbinfo->used_blocks, -pages);
-		inode->i_blocks -= pages*BLOCKS_PER_PAGE;
-	}
-}
-
 static int shmem_reserve_inode(struct super_block *sb)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
@@ -219,9 +210,12 @@ static void shmem_recalc_inode(struct in
 
 	freed = info->alloced - info->swapped - inode->i_mapping->nrpages;
 	if (freed > 0) {
+		struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+		if (sbinfo->max_blocks)
+			percpu_counter_add(&sbinfo->used_blocks, -freed);
 		info->alloced -= freed;
+		inode->i_blocks -= freed * BLOCKS_PER_PAGE;
 		shmem_unacct_blocks(info->flags, freed);
-		shmem_free_blocks(inode, freed);
 	}
 }
 
@@ -888,205 +882,180 @@ static int shmem_getpage_gfp(struct inod
 	struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type)
 {
 	struct address_space *mapping = inode->i_mapping;
-	struct shmem_inode_info *info = SHMEM_I(inode);
+	struct shmem_inode_info *info;
 	struct shmem_sb_info *sbinfo;
 	struct page *page;
-	struct page *prealloc_page = NULL;
 	swp_entry_t swap;
 	int error;
+	int once = 0;
 
 	if (index > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT))
 		return -EFBIG;
 repeat:
+	swap.val = 0;
 	page = find_lock_page(mapping, index);
-	if (page) {
+	if (radix_tree_exceptional_entry(page)) {
+		swap = radix_to_swp_entry(page);
+		page = NULL;
+	}
+
+	if (sgp != SGP_WRITE &&
+	    ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
+		error = -EINVAL;
+		goto failed;
+	}
+
+	if (page || (sgp == SGP_READ && !swap.val)) {
 		/*
 		 * Once we can get the page lock, it must be uptodate:
 		 * if there were an error in reading back from swap,
 		 * the page would not be inserted into the filecache.
 		 */
-		BUG_ON(!PageUptodate(page));
-		goto done;
+		BUG_ON(page && !PageUptodate(page));
+		*pagep = page;
+		return 0;
 	}
 
 	/*
-	 * Try to preload while we can wait, to not make a habit of
-	 * draining atomic reserves; but don't latch on to this cpu.
+	 * Fast cache lookup did not find it:
+	 * bring it back from swap or allocate.
 	 */
-	error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
-	if (error)
-		goto out;
-	radix_tree_preload_end();
-
-	if (sgp != SGP_READ && !prealloc_page) {
-		prealloc_page = shmem_alloc_page(gfp, info, index);
-		if (prealloc_page) {
-			SetPageSwapBacked(prealloc_page);
-			if (mem_cgroup_cache_charge(prealloc_page,
-					current->mm, GFP_KERNEL)) {
-				page_cache_release(prealloc_page);
-				prealloc_page = NULL;
-			}
-		}
-	}
+	info = SHMEM_I(inode);
+	sbinfo = SHMEM_SB(inode->i_sb);
 
-	spin_lock(&info->lock);
-	shmem_recalc_inode(inode);
-	swap = shmem_get_swap(info, index);
 	if (swap.val) {
 		/* Look it up and read it in.. */
 		page = lookup_swap_cache(swap);
 		if (!page) {
-			spin_unlock(&info->lock);
 			/* here we actually do the io */
 			if (fault_type)
 				*fault_type |= VM_FAULT_MAJOR;
 			page = shmem_swapin(swap, gfp, info, index);
 			if (!page) {
-				swp_entry_t nswap = shmem_get_swap(info, index);
-				if (nswap.val == swap.val) {
-					error = -ENOMEM;
-					goto out;
-				}
-				goto repeat;
+				error = -ENOMEM;
+				goto failed;
 			}
-			wait_on_page_locked(page);
-			page_cache_release(page);
-			goto repeat;
 		}
 
 		/* We have to do this with page locked to prevent races */
-		if (!trylock_page(page)) {
-			spin_unlock(&info->lock);
-			wait_on_page_locked(page);
-			page_cache_release(page);
-			goto repeat;
-		}
-		if (PageWriteback(page)) {
-			spin_unlock(&info->lock);
-			wait_on_page_writeback(page);
-			unlock_page(page);
-			page_cache_release(page);
-			goto repeat;
-		}
+		lock_page(page);
 		if (!PageUptodate(page)) {
-			spin_unlock(&info->lock);
-			unlock_page(page);
-			page_cache_release(page);
 			error = -EIO;
-			goto out;
+			goto failed;
 		}
+		wait_on_page_writeback(page);
 
-		error = add_to_page_cache_locked(page, mapping,
-						 index, GFP_NOWAIT);
-		if (error) {
-			spin_unlock(&info->lock);
-			if (error == -ENOMEM) {
-				/*
-				 * reclaim from proper memory cgroup and
-				 * call memcg's OOM if needed.
-				 */
-				error = mem_cgroup_shmem_charge_fallback(
-						page, current->mm, gfp);
-				if (error) {
-					unlock_page(page);
-					page_cache_release(page);
-					goto out;
-				}
-			}
-			unlock_page(page);
-			page_cache_release(page);
-			goto repeat;
+		/* Someone may have already done it for us */
+		if (page->mapping) {
+			if (page->mapping == mapping &&
+			    page->index == index)
+				goto done;
+			error = -EEXIST;
+			goto failed;
 		}
 
-		delete_from_swap_cache(page);
-		shmem_put_swap(info, index, (swp_entry_t){0});
+		error = shmem_add_to_page_cache(page, mapping, index,
+					gfp, swp_to_radix_entry(swap));
+		if (error)
+			goto failed;
+
+		spin_lock(&info->lock);
 		info->swapped--;
+		shmem_recalc_inode(inode);
 		spin_unlock(&info->lock);
+
+		delete_from_swap_cache(page);
 		set_page_dirty(page);
 		swap_free(swap);
 
-	} else if (sgp == SGP_READ) {
-		page = find_get_page(mapping, index);
-		if (page && !trylock_page(page)) {
-			spin_unlock(&info->lock);
-			wait_on_page_locked(page);
-			page_cache_release(page);
-			goto repeat;
+	} else {
+		if (shmem_acct_block(info->flags)) {
+			error = -ENOSPC;
+			goto failed;
 		}
-		spin_unlock(&info->lock);
-
-	} else if (prealloc_page) {
-		sbinfo = SHMEM_SB(inode->i_sb);
 		if (sbinfo->max_blocks) {
 			if (percpu_counter_compare(&sbinfo->used_blocks,
-						sbinfo->max_blocks) >= 0 ||
-			    shmem_acct_block(info->flags))
-				goto nospace;
+						sbinfo->max_blocks) >= 0) {
+				error = -ENOSPC;
+				goto unacct;
+			}
 			percpu_counter_inc(&sbinfo->used_blocks);
-			inode->i_blocks += BLOCKS_PER_PAGE;
-		} else if (shmem_acct_block(info->flags))
-			goto nospace;
-
-		page = prealloc_page;
-		prealloc_page = NULL;
-
-		swap = shmem_get_swap(info, index);
-		if (swap.val)
-			mem_cgroup_uncharge_cache_page(page);
-		else
-			error = add_to_page_cache_lru(page, mapping,
-						index, GFP_NOWAIT);
-		/*
-		 * At add_to_page_cache_lru() failure,
-		 * uncharge will be done automatically.
-		 */
-		if (swap.val || error) {
-			shmem_unacct_blocks(info->flags, 1);
-			shmem_free_blocks(inode, 1);
-			spin_unlock(&info->lock);
-			page_cache_release(page);
-			goto repeat;
 		}
 
+		page = shmem_alloc_page(gfp, info, index);
+		if (!page) {
+			error = -ENOMEM;
+			goto decused;
+		}
+
+		SetPageSwapBacked(page);
+		__set_page_locked(page);
+		error = shmem_add_to_page_cache(page, mapping, index,
+								gfp, NULL);
+		if (error)
+			goto decused;
+		lru_cache_add_anon(page);
+
+		spin_lock(&info->lock);
 		info->alloced++;
+		inode->i_blocks += BLOCKS_PER_PAGE;
+		shmem_recalc_inode(inode);
 		spin_unlock(&info->lock);
+
 		clear_highpage(page);
 		flush_dcache_page(page);
 		SetPageUptodate(page);
 		if (sgp == SGP_DIRTY)
 			set_page_dirty(page);
-
-	} else {
-		spin_unlock(&info->lock);
-		error = -ENOMEM;
-		goto out;
 	}
 done:
-	*pagep = page;
-	error = 0;
-out:
-	if (prealloc_page) {
-		mem_cgroup_uncharge_cache_page(prealloc_page);
-		page_cache_release(prealloc_page);
+	/* Perhaps the file has been truncated since we checked */
+	if (sgp != SGP_WRITE &&
+	    ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
+		error = -EINVAL;
+		goto trunc;
 	}
-	return error;
+	*pagep = page;
+	return 0;
 
-nospace:
 	/*
-	 * Perhaps the page was brought in from swap between find_lock_page
-	 * and taking info->lock?  We allow for that at add_to_page_cache_lru,
-	 * but must also avoid reporting a spurious ENOSPC while working on a
-	 * full tmpfs.
+	 * Error recovery.
 	 */
-	page = find_get_page(mapping, index);
+trunc:
+	ClearPageDirty(page);
+	delete_from_page_cache(page);
+	spin_lock(&info->lock);
+	info->alloced--;
+	inode->i_blocks -= BLOCKS_PER_PAGE;
 	spin_unlock(&info->lock);
+decused:
+	if (sbinfo->max_blocks)
+		percpu_counter_add(&sbinfo->used_blocks, -1);
+unacct:
+	shmem_unacct_blocks(info->flags, 1);
+failed:
+	if (swap.val && error != -EINVAL) {
+		struct page *test = find_get_page(mapping, index);
+		if (test && !radix_tree_exceptional_entry(test))
+			page_cache_release(test);
+		/* Have another try if the entry has changed */
+		if (test != swp_to_radix_entry(swap))
+			error = -EEXIST;
+	}
 	if (page) {
+		unlock_page(page);
 		page_cache_release(page);
+	}
+	if (error == -ENOSPC && !once++) {
+		info = SHMEM_I(inode);
+		spin_lock(&info->lock);
+		shmem_recalc_inode(inode);
+		spin_unlock(&info->lock);
 		goto repeat;
 	}
-	error = -ENOSPC;
-	goto out;
+	if (error == -EEXIST)
+		goto repeat;
+	return error;
 }
 
 static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
@@ -1095,9 +1064,6 @@ static int shmem_fault(struct vm_area_st
 	int error;
 	int ret = VM_FAULT_LOCKED;
 
-	if (((loff_t)vmf->pgoff << PAGE_CACHE_SHIFT) >= i_size_read(inode))
-		return VM_FAULT_SIGBUS;
-
 	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
 	if (error)
 		return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
@@ -2164,8 +2130,7 @@ static int shmem_remount_fs(struct super
 	if (config.max_inodes < inodes)
 		goto out;
 	/*
-	 * Those tests also disallow limited->unlimited while any are in
-	 * use, so i_blocks will always be zero when max_blocks is zero;
+	 * Those tests disallow limited->unlimited while any are in use;
 	 * but we must separately disallow unlimited->limited, because
 	 * in that case we have no record of how much is already in use.
 	 */

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 8/12] tmpfs: convert shmem_getpage_gfp to radix-swap
@ 2011-06-14 10:53   ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Convert shmem_getpage_gfp(), the engine-room of shmem, to expect
page or swap entry returned from radix tree by find_lock_page().

Whereas the repetitive old method proceeded mainly under info->lock,
dropping and repeating whenever one of the conditions needed was not
met, now we can proceed without it, leaving shmem_add_to_page_cache()
to check for a race.

This way there is no need to preallocate a page, no need for an early
radix_tree_preload(), no need for mem_cgroup_shmem_charge_fallback().

Move the error unwinding down to the bottom instead of repeating it
throughout.  ENOSPC handling is a little different from before: there
is no longer any race between find_lock_page() and finding swap, but
we can arrive at ENOSPC before calling shmem_recalc_inode(), which
might occasionally discover freed space.

Be stricter to check i_size before returning.  info->lock is used
for little but alloced, swapped, i_blocks updates.  Move i_blocks
updates out from under the max_blocks check, so even an unlimited
size=0 mount can show accurate du.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c |  259 ++++++++++++++++++++++-----------------------------
 1 file changed, 112 insertions(+), 147 deletions(-)

--- linux.orig/mm/shmem.c	2011-06-13 13:29:44.087175010 -0700
+++ linux/mm/shmem.c	2011-06-13 13:29:55.115229689 -0700
@@ -166,15 +166,6 @@ static struct backing_dev_info shmem_bac
 static LIST_HEAD(shmem_swaplist);
 static DEFINE_MUTEX(shmem_swaplist_mutex);
 
-static void shmem_free_blocks(struct inode *inode, long pages)
-{
-	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
-	if (sbinfo->max_blocks) {
-		percpu_counter_add(&sbinfo->used_blocks, -pages);
-		inode->i_blocks -= pages*BLOCKS_PER_PAGE;
-	}
-}
-
 static int shmem_reserve_inode(struct super_block *sb)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
@@ -219,9 +210,12 @@ static void shmem_recalc_inode(struct in
 
 	freed = info->alloced - info->swapped - inode->i_mapping->nrpages;
 	if (freed > 0) {
+		struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+		if (sbinfo->max_blocks)
+			percpu_counter_add(&sbinfo->used_blocks, -freed);
 		info->alloced -= freed;
+		inode->i_blocks -= freed * BLOCKS_PER_PAGE;
 		shmem_unacct_blocks(info->flags, freed);
-		shmem_free_blocks(inode, freed);
 	}
 }
 
@@ -888,205 +882,180 @@ static int shmem_getpage_gfp(struct inod
 	struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type)
 {
 	struct address_space *mapping = inode->i_mapping;
-	struct shmem_inode_info *info = SHMEM_I(inode);
+	struct shmem_inode_info *info;
 	struct shmem_sb_info *sbinfo;
 	struct page *page;
-	struct page *prealloc_page = NULL;
 	swp_entry_t swap;
 	int error;
+	int once = 0;
 
 	if (index > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT))
 		return -EFBIG;
 repeat:
+	swap.val = 0;
 	page = find_lock_page(mapping, index);
-	if (page) {
+	if (radix_tree_exceptional_entry(page)) {
+		swap = radix_to_swp_entry(page);
+		page = NULL;
+	}
+
+	if (sgp != SGP_WRITE &&
+	    ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
+		error = -EINVAL;
+		goto failed;
+	}
+
+	if (page || (sgp == SGP_READ && !swap.val)) {
 		/*
 		 * Once we can get the page lock, it must be uptodate:
 		 * if there were an error in reading back from swap,
 		 * the page would not be inserted into the filecache.
 		 */
-		BUG_ON(!PageUptodate(page));
-		goto done;
+		BUG_ON(page && !PageUptodate(page));
+		*pagep = page;
+		return 0;
 	}
 
 	/*
-	 * Try to preload while we can wait, to not make a habit of
-	 * draining atomic reserves; but don't latch on to this cpu.
+	 * Fast cache lookup did not find it:
+	 * bring it back from swap or allocate.
 	 */
-	error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
-	if (error)
-		goto out;
-	radix_tree_preload_end();
-
-	if (sgp != SGP_READ && !prealloc_page) {
-		prealloc_page = shmem_alloc_page(gfp, info, index);
-		if (prealloc_page) {
-			SetPageSwapBacked(prealloc_page);
-			if (mem_cgroup_cache_charge(prealloc_page,
-					current->mm, GFP_KERNEL)) {
-				page_cache_release(prealloc_page);
-				prealloc_page = NULL;
-			}
-		}
-	}
+	info = SHMEM_I(inode);
+	sbinfo = SHMEM_SB(inode->i_sb);
 
-	spin_lock(&info->lock);
-	shmem_recalc_inode(inode);
-	swap = shmem_get_swap(info, index);
 	if (swap.val) {
 		/* Look it up and read it in.. */
 		page = lookup_swap_cache(swap);
 		if (!page) {
-			spin_unlock(&info->lock);
 			/* here we actually do the io */
 			if (fault_type)
 				*fault_type |= VM_FAULT_MAJOR;
 			page = shmem_swapin(swap, gfp, info, index);
 			if (!page) {
-				swp_entry_t nswap = shmem_get_swap(info, index);
-				if (nswap.val == swap.val) {
-					error = -ENOMEM;
-					goto out;
-				}
-				goto repeat;
+				error = -ENOMEM;
+				goto failed;
 			}
-			wait_on_page_locked(page);
-			page_cache_release(page);
-			goto repeat;
 		}
 
 		/* We have to do this with page locked to prevent races */
-		if (!trylock_page(page)) {
-			spin_unlock(&info->lock);
-			wait_on_page_locked(page);
-			page_cache_release(page);
-			goto repeat;
-		}
-		if (PageWriteback(page)) {
-			spin_unlock(&info->lock);
-			wait_on_page_writeback(page);
-			unlock_page(page);
-			page_cache_release(page);
-			goto repeat;
-		}
+		lock_page(page);
 		if (!PageUptodate(page)) {
-			spin_unlock(&info->lock);
-			unlock_page(page);
-			page_cache_release(page);
 			error = -EIO;
-			goto out;
+			goto failed;
 		}
+		wait_on_page_writeback(page);
 
-		error = add_to_page_cache_locked(page, mapping,
-						 index, GFP_NOWAIT);
-		if (error) {
-			spin_unlock(&info->lock);
-			if (error == -ENOMEM) {
-				/*
-				 * reclaim from proper memory cgroup and
-				 * call memcg's OOM if needed.
-				 */
-				error = mem_cgroup_shmem_charge_fallback(
-						page, current->mm, gfp);
-				if (error) {
-					unlock_page(page);
-					page_cache_release(page);
-					goto out;
-				}
-			}
-			unlock_page(page);
-			page_cache_release(page);
-			goto repeat;
+		/* Someone may have already done it for us */
+		if (page->mapping) {
+			if (page->mapping == mapping &&
+			    page->index == index)
+				goto done;
+			error = -EEXIST;
+			goto failed;
 		}
 
-		delete_from_swap_cache(page);
-		shmem_put_swap(info, index, (swp_entry_t){0});
+		error = shmem_add_to_page_cache(page, mapping, index,
+					gfp, swp_to_radix_entry(swap));
+		if (error)
+			goto failed;
+
+		spin_lock(&info->lock);
 		info->swapped--;
+		shmem_recalc_inode(inode);
 		spin_unlock(&info->lock);
+
+		delete_from_swap_cache(page);
 		set_page_dirty(page);
 		swap_free(swap);
 
-	} else if (sgp == SGP_READ) {
-		page = find_get_page(mapping, index);
-		if (page && !trylock_page(page)) {
-			spin_unlock(&info->lock);
-			wait_on_page_locked(page);
-			page_cache_release(page);
-			goto repeat;
+	} else {
+		if (shmem_acct_block(info->flags)) {
+			error = -ENOSPC;
+			goto failed;
 		}
-		spin_unlock(&info->lock);
-
-	} else if (prealloc_page) {
-		sbinfo = SHMEM_SB(inode->i_sb);
 		if (sbinfo->max_blocks) {
 			if (percpu_counter_compare(&sbinfo->used_blocks,
-						sbinfo->max_blocks) >= 0 ||
-			    shmem_acct_block(info->flags))
-				goto nospace;
+						sbinfo->max_blocks) >= 0) {
+				error = -ENOSPC;
+				goto unacct;
+			}
 			percpu_counter_inc(&sbinfo->used_blocks);
-			inode->i_blocks += BLOCKS_PER_PAGE;
-		} else if (shmem_acct_block(info->flags))
-			goto nospace;
-
-		page = prealloc_page;
-		prealloc_page = NULL;
-
-		swap = shmem_get_swap(info, index);
-		if (swap.val)
-			mem_cgroup_uncharge_cache_page(page);
-		else
-			error = add_to_page_cache_lru(page, mapping,
-						index, GFP_NOWAIT);
-		/*
-		 * At add_to_page_cache_lru() failure,
-		 * uncharge will be done automatically.
-		 */
-		if (swap.val || error) {
-			shmem_unacct_blocks(info->flags, 1);
-			shmem_free_blocks(inode, 1);
-			spin_unlock(&info->lock);
-			page_cache_release(page);
-			goto repeat;
 		}
 
+		page = shmem_alloc_page(gfp, info, index);
+		if (!page) {
+			error = -ENOMEM;
+			goto decused;
+		}
+
+		SetPageSwapBacked(page);
+		__set_page_locked(page);
+		error = shmem_add_to_page_cache(page, mapping, index,
+								gfp, NULL);
+		if (error)
+			goto decused;
+		lru_cache_add_anon(page);
+
+		spin_lock(&info->lock);
 		info->alloced++;
+		inode->i_blocks += BLOCKS_PER_PAGE;
+		shmem_recalc_inode(inode);
 		spin_unlock(&info->lock);
+
 		clear_highpage(page);
 		flush_dcache_page(page);
 		SetPageUptodate(page);
 		if (sgp == SGP_DIRTY)
 			set_page_dirty(page);
-
-	} else {
-		spin_unlock(&info->lock);
-		error = -ENOMEM;
-		goto out;
 	}
 done:
-	*pagep = page;
-	error = 0;
-out:
-	if (prealloc_page) {
-		mem_cgroup_uncharge_cache_page(prealloc_page);
-		page_cache_release(prealloc_page);
+	/* Perhaps the file has been truncated since we checked */
+	if (sgp != SGP_WRITE &&
+	    ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
+		error = -EINVAL;
+		goto trunc;
 	}
-	return error;
+	*pagep = page;
+	return 0;
 
-nospace:
 	/*
-	 * Perhaps the page was brought in from swap between find_lock_page
-	 * and taking info->lock?  We allow for that at add_to_page_cache_lru,
-	 * but must also avoid reporting a spurious ENOSPC while working on a
-	 * full tmpfs.
+	 * Error recovery.
 	 */
-	page = find_get_page(mapping, index);
+trunc:
+	ClearPageDirty(page);
+	delete_from_page_cache(page);
+	spin_lock(&info->lock);
+	info->alloced--;
+	inode->i_blocks -= BLOCKS_PER_PAGE;
 	spin_unlock(&info->lock);
+decused:
+	if (sbinfo->max_blocks)
+		percpu_counter_add(&sbinfo->used_blocks, -1);
+unacct:
+	shmem_unacct_blocks(info->flags, 1);
+failed:
+	if (swap.val && error != -EINVAL) {
+		struct page *test = find_get_page(mapping, index);
+		if (test && !radix_tree_exceptional_entry(test))
+			page_cache_release(test);
+		/* Have another try if the entry has changed */
+		if (test != swp_to_radix_entry(swap))
+			error = -EEXIST;
+	}
 	if (page) {
+		unlock_page(page);
 		page_cache_release(page);
+	}
+	if (error == -ENOSPC && !once++) {
+		info = SHMEM_I(inode);
+		spin_lock(&info->lock);
+		shmem_recalc_inode(inode);
+		spin_unlock(&info->lock);
 		goto repeat;
 	}
-	error = -ENOSPC;
-	goto out;
+	if (error == -EEXIST)
+		goto repeat;
+	return error;
 }
 
 static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
@@ -1095,9 +1064,6 @@ static int shmem_fault(struct vm_area_st
 	int error;
 	int ret = VM_FAULT_LOCKED;
 
-	if (((loff_t)vmf->pgoff << PAGE_CACHE_SHIFT) >= i_size_read(inode))
-		return VM_FAULT_SIGBUS;
-
 	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
 	if (error)
 		return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
@@ -2164,8 +2130,7 @@ static int shmem_remount_fs(struct super
 	if (config.max_inodes < inodes)
 		goto out;
 	/*
-	 * Those tests also disallow limited->unlimited while any are in
-	 * use, so i_blocks will always be zero when max_blocks is zero;
+	 * Those tests disallow limited->unlimited while any are in use;
 	 * but we must separately disallow unlimited->limited, because
 	 * in that case we have no record of how much is already in use.
 	 */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 9/12] tmpfs: convert mem_cgroup shmem to radix-swap
  2011-06-14 10:40 ` Hugh Dickins
@ 2011-06-14 10:54   ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Remove mem_cgroup_shmem_charge_fallback(): it was only required
when we had to move swappage to filecache with GFP_NOWAIT.

Remove the GFP_NOWAIT special case from mem_cgroup_cache_charge(),
by moving its call out from shmem_add_to_page_cache() to two of thats
three callers.  But leave it doing mem_cgroup_uncharge_cache_page() on
error: although asymmetrical, it's easier for all 3 callers to handle.

These two changes would also be appropriate if anyone were
to start using shmem_read_mapping_page_gfp() with GFP_NOWAIT.

Remove mem_cgroup_get_shmem_target(): mc_handle_file_pte() can test
radix_tree_exceptional_entry() to get what it needs for itself.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/memcontrol.h |    8 ---
 include/linux/shmem_fs.h   |    2 
 mm/memcontrol.c            |   66 +++------------------------
 mm/shmem.c                 |   83 ++++-------------------------------
 4 files changed, 20 insertions(+), 139 deletions(-)

--- linux.orig/include/linux/memcontrol.h	2011-06-13 13:26:07.126099155 -0700
+++ linux/include/linux/memcontrol.h	2011-06-13 13:30:05.951283422 -0700
@@ -76,8 +76,6 @@ extern void mem_cgroup_uncharge_end(void
 
 extern void mem_cgroup_uncharge_page(struct page *page);
 extern void mem_cgroup_uncharge_cache_page(struct page *page);
-extern int mem_cgroup_shmem_charge_fallback(struct page *page,
-			struct mm_struct *mm, gfp_t gfp_mask);
 
 extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
@@ -206,12 +204,6 @@ static inline void mem_cgroup_uncharge_c
 {
 }
 
-static inline int mem_cgroup_shmem_charge_fallback(struct page *page,
-			struct mm_struct *mm, gfp_t gfp_mask)
-{
-	return 0;
-}
-
 static inline void mem_cgroup_add_lru_list(struct page *page, int lru)
 {
 }
--- linux.orig/include/linux/shmem_fs.h	2011-06-13 13:28:25.822786909 -0700
+++ linux/include/linux/shmem_fs.h	2011-06-14 00:45:20.625161293 -0700
@@ -57,8 +57,6 @@ extern struct page *shmem_read_mapping_p
 					pgoff_t index, gfp_t gfp_mask);
 extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
 extern int shmem_unuse(swp_entry_t entry, struct page *page);
-extern void mem_cgroup_get_shmem_target(struct inode *inode, pgoff_t pgoff,
-					struct page **pagep, swp_entry_t *ent);
 
 static inline struct page *shmem_read_mapping_page(
 				struct address_space *mapping, pgoff_t index)
--- linux.orig/mm/memcontrol.c	2011-06-13 13:26:07.446100738 -0700
+++ linux/mm/memcontrol.c	2011-06-14 00:50:17.346633542 -0700
@@ -35,7 +35,6 @@
 #include <linux/limits.h>
 #include <linux/mutex.h>
 #include <linux/rbtree.h>
-#include <linux/shmem_fs.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
@@ -2690,30 +2689,6 @@ int mem_cgroup_cache_charge(struct page
 		return 0;
 	if (PageCompound(page))
 		return 0;
-	/*
-	 * Corner case handling. This is called from add_to_page_cache()
-	 * in usual. But some FS (shmem) precharges this page before calling it
-	 * and call add_to_page_cache() with GFP_NOWAIT.
-	 *
-	 * For GFP_NOWAIT case, the page may be pre-charged before calling
-	 * add_to_page_cache(). (See shmem.c) check it here and avoid to call
-	 * charge twice. (It works but has to pay a bit larger cost.)
-	 * And when the page is SwapCache, it should take swap information
-	 * into account. This is under lock_page() now.
-	 */
-	if (!(gfp_mask & __GFP_WAIT)) {
-		struct page_cgroup *pc;
-
-		pc = lookup_page_cgroup(page);
-		if (!pc)
-			return 0;
-		lock_page_cgroup(pc);
-		if (PageCgroupUsed(pc)) {
-			unlock_page_cgroup(pc);
-			return 0;
-		}
-		unlock_page_cgroup(pc);
-	}
 
 	if (unlikely(!mm))
 		mm = &init_mm;
@@ -3303,31 +3278,6 @@ void mem_cgroup_end_migration(struct mem
 	cgroup_release_and_wakeup_rmdir(&mem->css);
 }
 
-/*
- * A call to try to shrink memory usage on charge failure at shmem's swapin.
- * Calling hierarchical_reclaim is not enough because we should update
- * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
- * Moreover considering hierarchy, we should reclaim from the mem_over_limit,
- * not from the memcg which this page would be charged to.
- * try_charge_swapin does all of these works properly.
- */
-int mem_cgroup_shmem_charge_fallback(struct page *page,
-			    struct mm_struct *mm,
-			    gfp_t gfp_mask)
-{
-	struct mem_cgroup *mem;
-	int ret;
-
-	if (mem_cgroup_disabled())
-		return 0;
-
-	ret = mem_cgroup_try_charge_swapin(mm, page, gfp_mask, &mem);
-	if (!ret)
-		mem_cgroup_cancel_charge_swapin(mem); /* it does !mem check */
-
-	return ret;
-}
-
 #ifdef CONFIG_DEBUG_VM
 static struct page_cgroup *lookup_page_cgroup_used(struct page *page)
 {
@@ -5086,15 +5036,17 @@ static struct page *mc_handle_file_pte(s
 		pgoff = pte_to_pgoff(ptent);
 
 	/* page is moved even if it's not RSS of this task(page-faulted). */
-	if (!mapping_cap_swap_backed(mapping)) { /* normal file */
-		page = find_get_page(mapping, pgoff);
-	} else { /* shmem/tmpfs file. we should take account of swap too. */
-		swp_entry_t ent;
-		mem_cgroup_get_shmem_target(inode, pgoff, &page, &ent);
+	page = find_get_page(mapping, pgoff);
+
+#ifdef CONFIG_SWAP
+	/* shmem/tmpfs may report page out on swap: account for that too. */
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
 		if (do_swap_account)
-			entry->val = ent.val;
+			*entry = swap;
+		page = find_get_page(&swapper_space, swap.val);
 	}
-
+#endif
 	return page;
 }
 
--- linux.orig/mm/shmem.c	2011-06-13 13:29:55.115229689 -0700
+++ linux/mm/shmem.c	2011-06-14 00:45:20.685161581 -0700
@@ -262,15 +262,11 @@ static int shmem_add_to_page_cache(struc
 				   struct address_space *mapping,
 				   pgoff_t index, gfp_t gfp, void *expected)
 {
-	int error;
+	int error = 0;
 
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(!PageSwapBacked(page));
 
-	error = mem_cgroup_cache_charge(page, current->mm,
-						gfp & GFP_RECLAIM_MASK);
-	if (error)
-		goto out;
 	if (!expected)
 		error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
 	if (!error) {
@@ -300,7 +296,6 @@ static int shmem_add_to_page_cache(struc
 	}
 	if (error)
 		mem_cgroup_uncharge_cache_page(page);
-out:
 	return error;
 }
 
@@ -660,7 +655,6 @@ int shmem_unuse(swp_entry_t swap, struct
 	 * Charge page using GFP_KERNEL while we can wait, before taking
 	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
 	 * Charged back to the user (not to caller) when swap account is used.
-	 * shmem_add_to_page_cache() will be called with GFP_NOWAIT.
 	 */
 	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
 	if (error)
@@ -954,8 +948,11 @@ repeat:
 			goto failed;
 		}
 
-		error = shmem_add_to_page_cache(page, mapping, index,
-					gfp, swp_to_radix_entry(swap));
+		error = mem_cgroup_cache_charge(page, current->mm,
+						gfp & GFP_RECLAIM_MASK);
+		if (!error)
+			error = shmem_add_to_page_cache(page, mapping, index,
+						gfp, swp_to_radix_entry(swap));
 		if (error)
 			goto failed;
 
@@ -990,8 +987,11 @@ repeat:
 
 		SetPageSwapBacked(page);
 		__set_page_locked(page);
-		error = shmem_add_to_page_cache(page, mapping, index,
-								gfp, NULL);
+		error = mem_cgroup_cache_charge(page, current->mm,
+						gfp & GFP_RECLAIM_MASK);
+		if (!error)
+			error = shmem_add_to_page_cache(page, mapping, index,
+						gfp, NULL);
 		if (error)
 			goto decused;
 		lru_cache_add_anon(page);
@@ -2448,42 +2448,6 @@ out4:
 	return error;
 }
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-/**
- * mem_cgroup_get_shmem_target - find page or swap assigned to the shmem file
- * @inode: the inode to be searched
- * @index: the page offset to be searched
- * @pagep: the pointer for the found page to be stored
- * @swapp: the pointer for the found swap entry to be stored
- *
- * If a page is found, refcount of it is incremented. Callers should handle
- * these refcount.
- */
-void mem_cgroup_get_shmem_target(struct inode *inode, pgoff_t index,
-				 struct page **pagep, swp_entry_t *swapp)
-{
-	struct shmem_inode_info *info = SHMEM_I(inode);
-	struct page *page = NULL;
-	swp_entry_t swap = {0};
-
-	if ((index << PAGE_CACHE_SHIFT) >= i_size_read(inode))
-		goto out;
-
-	spin_lock(&info->lock);
-#ifdef CONFIG_SWAP
-	swap = shmem_get_swap(info, index);
-	if (swap.val)
-		page = find_get_page(&swapper_space, swap.val);
-	else
-#endif
-		page = find_get_page(inode->i_mapping, index);
-	spin_unlock(&info->lock);
-out:
-	*pagep = page;
-	*swapp = swap;
-}
-#endif
-
 #else /* !CONFIG_SHMEM */
 
 /*
@@ -2529,31 +2493,6 @@ void shmem_truncate_range(struct inode *
 }
 EXPORT_SYMBOL_GPL(shmem_truncate_range);
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-/**
- * mem_cgroup_get_shmem_target - find page or swap assigned to the shmem file
- * @inode: the inode to be searched
- * @index: the page offset to be searched
- * @pagep: the pointer for the found page to be stored
- * @swapp: the pointer for the found swap entry to be stored
- *
- * If a page is found, refcount of it is incremented. Callers should handle
- * these refcount.
- */
-void mem_cgroup_get_shmem_target(struct inode *inode, pgoff_t index,
-				 struct page **pagep, swp_entry_t *swapp)
-{
-	struct page *page = NULL;
-
-	if ((index << PAGE_CACHE_SHIFT) >= i_size_read(inode))
-		goto out;
-	page = find_get_page(inode->i_mapping, index);
-out:
-	*pagep = page;
-	*swapp = (swp_entry_t){0};
-}
-#endif
-
 #define shmem_vm_ops				generic_file_vm_ops
 #define shmem_file_operations			ramfs_file_operations
 #define shmem_get_inode(sb, dir, mode, dev, flags)	ramfs_get_inode(sb, dir, mode, dev)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 9/12] tmpfs: convert mem_cgroup shmem to radix-swap
@ 2011-06-14 10:54   ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Remove mem_cgroup_shmem_charge_fallback(): it was only required
when we had to move swappage to filecache with GFP_NOWAIT.

Remove the GFP_NOWAIT special case from mem_cgroup_cache_charge(),
by moving its call out from shmem_add_to_page_cache() to two of thats
three callers.  But leave it doing mem_cgroup_uncharge_cache_page() on
error: although asymmetrical, it's easier for all 3 callers to handle.

These two changes would also be appropriate if anyone were
to start using shmem_read_mapping_page_gfp() with GFP_NOWAIT.

Remove mem_cgroup_get_shmem_target(): mc_handle_file_pte() can test
radix_tree_exceptional_entry() to get what it needs for itself.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/memcontrol.h |    8 ---
 include/linux/shmem_fs.h   |    2 
 mm/memcontrol.c            |   66 +++------------------------
 mm/shmem.c                 |   83 ++++-------------------------------
 4 files changed, 20 insertions(+), 139 deletions(-)

--- linux.orig/include/linux/memcontrol.h	2011-06-13 13:26:07.126099155 -0700
+++ linux/include/linux/memcontrol.h	2011-06-13 13:30:05.951283422 -0700
@@ -76,8 +76,6 @@ extern void mem_cgroup_uncharge_end(void
 
 extern void mem_cgroup_uncharge_page(struct page *page);
 extern void mem_cgroup_uncharge_cache_page(struct page *page);
-extern int mem_cgroup_shmem_charge_fallback(struct page *page,
-			struct mm_struct *mm, gfp_t gfp_mask);
 
 extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
@@ -206,12 +204,6 @@ static inline void mem_cgroup_uncharge_c
 {
 }
 
-static inline int mem_cgroup_shmem_charge_fallback(struct page *page,
-			struct mm_struct *mm, gfp_t gfp_mask)
-{
-	return 0;
-}
-
 static inline void mem_cgroup_add_lru_list(struct page *page, int lru)
 {
 }
--- linux.orig/include/linux/shmem_fs.h	2011-06-13 13:28:25.822786909 -0700
+++ linux/include/linux/shmem_fs.h	2011-06-14 00:45:20.625161293 -0700
@@ -57,8 +57,6 @@ extern struct page *shmem_read_mapping_p
 					pgoff_t index, gfp_t gfp_mask);
 extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
 extern int shmem_unuse(swp_entry_t entry, struct page *page);
-extern void mem_cgroup_get_shmem_target(struct inode *inode, pgoff_t pgoff,
-					struct page **pagep, swp_entry_t *ent);
 
 static inline struct page *shmem_read_mapping_page(
 				struct address_space *mapping, pgoff_t index)
--- linux.orig/mm/memcontrol.c	2011-06-13 13:26:07.446100738 -0700
+++ linux/mm/memcontrol.c	2011-06-14 00:50:17.346633542 -0700
@@ -35,7 +35,6 @@
 #include <linux/limits.h>
 #include <linux/mutex.h>
 #include <linux/rbtree.h>
-#include <linux/shmem_fs.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
@@ -2690,30 +2689,6 @@ int mem_cgroup_cache_charge(struct page
 		return 0;
 	if (PageCompound(page))
 		return 0;
-	/*
-	 * Corner case handling. This is called from add_to_page_cache()
-	 * in usual. But some FS (shmem) precharges this page before calling it
-	 * and call add_to_page_cache() with GFP_NOWAIT.
-	 *
-	 * For GFP_NOWAIT case, the page may be pre-charged before calling
-	 * add_to_page_cache(). (See shmem.c) check it here and avoid to call
-	 * charge twice. (It works but has to pay a bit larger cost.)
-	 * And when the page is SwapCache, it should take swap information
-	 * into account. This is under lock_page() now.
-	 */
-	if (!(gfp_mask & __GFP_WAIT)) {
-		struct page_cgroup *pc;
-
-		pc = lookup_page_cgroup(page);
-		if (!pc)
-			return 0;
-		lock_page_cgroup(pc);
-		if (PageCgroupUsed(pc)) {
-			unlock_page_cgroup(pc);
-			return 0;
-		}
-		unlock_page_cgroup(pc);
-	}
 
 	if (unlikely(!mm))
 		mm = &init_mm;
@@ -3303,31 +3278,6 @@ void mem_cgroup_end_migration(struct mem
 	cgroup_release_and_wakeup_rmdir(&mem->css);
 }
 
-/*
- * A call to try to shrink memory usage on charge failure at shmem's swapin.
- * Calling hierarchical_reclaim is not enough because we should update
- * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
- * Moreover considering hierarchy, we should reclaim from the mem_over_limit,
- * not from the memcg which this page would be charged to.
- * try_charge_swapin does all of these works properly.
- */
-int mem_cgroup_shmem_charge_fallback(struct page *page,
-			    struct mm_struct *mm,
-			    gfp_t gfp_mask)
-{
-	struct mem_cgroup *mem;
-	int ret;
-
-	if (mem_cgroup_disabled())
-		return 0;
-
-	ret = mem_cgroup_try_charge_swapin(mm, page, gfp_mask, &mem);
-	if (!ret)
-		mem_cgroup_cancel_charge_swapin(mem); /* it does !mem check */
-
-	return ret;
-}
-
 #ifdef CONFIG_DEBUG_VM
 static struct page_cgroup *lookup_page_cgroup_used(struct page *page)
 {
@@ -5086,15 +5036,17 @@ static struct page *mc_handle_file_pte(s
 		pgoff = pte_to_pgoff(ptent);
 
 	/* page is moved even if it's not RSS of this task(page-faulted). */
-	if (!mapping_cap_swap_backed(mapping)) { /* normal file */
-		page = find_get_page(mapping, pgoff);
-	} else { /* shmem/tmpfs file. we should take account of swap too. */
-		swp_entry_t ent;
-		mem_cgroup_get_shmem_target(inode, pgoff, &page, &ent);
+	page = find_get_page(mapping, pgoff);
+
+#ifdef CONFIG_SWAP
+	/* shmem/tmpfs may report page out on swap: account for that too. */
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
 		if (do_swap_account)
-			entry->val = ent.val;
+			*entry = swap;
+		page = find_get_page(&swapper_space, swap.val);
 	}
-
+#endif
 	return page;
 }
 
--- linux.orig/mm/shmem.c	2011-06-13 13:29:55.115229689 -0700
+++ linux/mm/shmem.c	2011-06-14 00:45:20.685161581 -0700
@@ -262,15 +262,11 @@ static int shmem_add_to_page_cache(struc
 				   struct address_space *mapping,
 				   pgoff_t index, gfp_t gfp, void *expected)
 {
-	int error;
+	int error = 0;
 
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(!PageSwapBacked(page));
 
-	error = mem_cgroup_cache_charge(page, current->mm,
-						gfp & GFP_RECLAIM_MASK);
-	if (error)
-		goto out;
 	if (!expected)
 		error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
 	if (!error) {
@@ -300,7 +296,6 @@ static int shmem_add_to_page_cache(struc
 	}
 	if (error)
 		mem_cgroup_uncharge_cache_page(page);
-out:
 	return error;
 }
 
@@ -660,7 +655,6 @@ int shmem_unuse(swp_entry_t swap, struct
 	 * Charge page using GFP_KERNEL while we can wait, before taking
 	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
 	 * Charged back to the user (not to caller) when swap account is used.
-	 * shmem_add_to_page_cache() will be called with GFP_NOWAIT.
 	 */
 	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
 	if (error)
@@ -954,8 +948,11 @@ repeat:
 			goto failed;
 		}
 
-		error = shmem_add_to_page_cache(page, mapping, index,
-					gfp, swp_to_radix_entry(swap));
+		error = mem_cgroup_cache_charge(page, current->mm,
+						gfp & GFP_RECLAIM_MASK);
+		if (!error)
+			error = shmem_add_to_page_cache(page, mapping, index,
+						gfp, swp_to_radix_entry(swap));
 		if (error)
 			goto failed;
 
@@ -990,8 +987,11 @@ repeat:
 
 		SetPageSwapBacked(page);
 		__set_page_locked(page);
-		error = shmem_add_to_page_cache(page, mapping, index,
-								gfp, NULL);
+		error = mem_cgroup_cache_charge(page, current->mm,
+						gfp & GFP_RECLAIM_MASK);
+		if (!error)
+			error = shmem_add_to_page_cache(page, mapping, index,
+						gfp, NULL);
 		if (error)
 			goto decused;
 		lru_cache_add_anon(page);
@@ -2448,42 +2448,6 @@ out4:
 	return error;
 }
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-/**
- * mem_cgroup_get_shmem_target - find page or swap assigned to the shmem file
- * @inode: the inode to be searched
- * @index: the page offset to be searched
- * @pagep: the pointer for the found page to be stored
- * @swapp: the pointer for the found swap entry to be stored
- *
- * If a page is found, refcount of it is incremented. Callers should handle
- * these refcount.
- */
-void mem_cgroup_get_shmem_target(struct inode *inode, pgoff_t index,
-				 struct page **pagep, swp_entry_t *swapp)
-{
-	struct shmem_inode_info *info = SHMEM_I(inode);
-	struct page *page = NULL;
-	swp_entry_t swap = {0};
-
-	if ((index << PAGE_CACHE_SHIFT) >= i_size_read(inode))
-		goto out;
-
-	spin_lock(&info->lock);
-#ifdef CONFIG_SWAP
-	swap = shmem_get_swap(info, index);
-	if (swap.val)
-		page = find_get_page(&swapper_space, swap.val);
-	else
-#endif
-		page = find_get_page(inode->i_mapping, index);
-	spin_unlock(&info->lock);
-out:
-	*pagep = page;
-	*swapp = swap;
-}
-#endif
-
 #else /* !CONFIG_SHMEM */
 
 /*
@@ -2529,31 +2493,6 @@ void shmem_truncate_range(struct inode *
 }
 EXPORT_SYMBOL_GPL(shmem_truncate_range);
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-/**
- * mem_cgroup_get_shmem_target - find page or swap assigned to the shmem file
- * @inode: the inode to be searched
- * @index: the page offset to be searched
- * @pagep: the pointer for the found page to be stored
- * @swapp: the pointer for the found swap entry to be stored
- *
- * If a page is found, refcount of it is incremented. Callers should handle
- * these refcount.
- */
-void mem_cgroup_get_shmem_target(struct inode *inode, pgoff_t index,
-				 struct page **pagep, swp_entry_t *swapp)
-{
-	struct page *page = NULL;
-
-	if ((index << PAGE_CACHE_SHIFT) >= i_size_read(inode))
-		goto out;
-	page = find_get_page(inode->i_mapping, index);
-out:
-	*pagep = page;
-	*swapp = (swp_entry_t){0};
-}
-#endif
-
 #define shmem_vm_ops				generic_file_vm_ops
 #define shmem_file_operations			ramfs_file_operations
 #define shmem_get_inode(sb, dir, mode, dev, flags)	ramfs_get_inode(sb, dir, mode, dev)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 10/12] tmpfs: convert shmem_writepage and enable swap
  2011-06-14 10:40 ` Hugh Dickins
@ 2011-06-14 10:56   ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Convert shmem_writepage() to use shmem_delete_from_page_cache() to use
shmem_radix_tree_replace() to substitute swap entry for page pointer
atomically in the radix tree.

As with shmem_add_to_page_cache(), it's not entirely satisfactory to be
copying such code from delete_from_swap_cache, but again judged easier
to sell than making its other callers go through the extras.

Remove the toy implementation's shmem_put_swap() and shmem_get_swap(),
now unreferenced, and the hack to disable swap: it's now good to go.

The way things have worked out, info->lock no longer helps to guard the
shmem_swaplist: we increment swapped under shmem_swaplist_mutex only.
That global mutex exclusion between shmem_writepage() and shmem_unuse()
is not pretty, and we ought to find another way; but it's been forced
on us by recent race discoveries, not a consequence of this patchset.

And what has become of the WARN_ON_ONCE(1) free_swap_and_cache() if a
swap entry was found already present?  That's no longer possible, the
(unknown) one inserting this page into filecache would hit the swap
entry occupying that slot.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c |   88 +++++++++++++++++++++------------------------------
 1 file changed, 37 insertions(+), 51 deletions(-)

--- linux.orig/mm/shmem.c	2011-06-14 00:45:20.685161581 -0700
+++ linux/mm/shmem.c	2011-06-14 00:54:36.499917716 -0700
@@ -6,7 +6,8 @@
  *		 2000-2001 Christoph Rohland
  *		 2000-2001 SAP AG
  *		 2002 Red Hat Inc.
- * Copyright (C) 2002-2005 Hugh Dickins.
+ * Copyright (C) 2002-2011 Hugh Dickins.
+ * Copyright (C) 2011 Google Inc.
  * Copyright (C) 2002-2005 VERITAS Software Corporation.
  * Copyright (C) 2004 Andi Kleen, SuSE Labs
  *
@@ -219,19 +220,6 @@ static void shmem_recalc_inode(struct in
 	}
 }
 
-static void shmem_put_swap(struct shmem_inode_info *info, pgoff_t index,
-			   swp_entry_t swap)
-{
-	if (index < SHMEM_NR_DIRECT)
-		info->i_direct[index] = swap;
-}
-
-static swp_entry_t shmem_get_swap(struct shmem_inode_info *info, pgoff_t index)
-{
-	return (index < SHMEM_NR_DIRECT) ?
-		info->i_direct[index] : (swp_entry_t){0};
-}
-
 /*
  * Replace item expected in radix tree by a new item, while holding tree lock.
  */
@@ -300,6 +288,25 @@ static int shmem_add_to_page_cache(struc
 }
 
 /*
+ * Like delete_from_page_cache, but substitutes swap for page.
+ */
+static void shmem_delete_from_page_cache(struct page *page, void *radswap)
+{
+	struct address_space *mapping = page->mapping;
+	int error;
+
+	spin_lock_irq(&mapping->tree_lock);
+	error = shmem_radix_tree_replace(mapping, page->index, page, radswap);
+	page->mapping = NULL;
+	mapping->nrpages--;
+	__dec_zone_page_state(page, NR_FILE_PAGES);
+	__dec_zone_page_state(page, NR_SHMEM);
+	spin_unlock_irq(&mapping->tree_lock);
+	page_cache_release(page);
+	BUG_ON(error);
+}
+
+/*
  * Like find_get_pages, but collecting swap entries as well as pages.
  */
 static unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
@@ -664,14 +671,10 @@ int shmem_unuse(swp_entry_t swap, struct
 	mutex_lock(&shmem_swaplist_mutex);
 	list_for_each_safe(this, next, &shmem_swaplist) {
 		info = list_entry(this, struct shmem_inode_info, swaplist);
-		if (!info->swapped) {
-			spin_lock(&info->lock);
-			if (!info->swapped)
-				list_del_init(&info->swaplist);
-			spin_unlock(&info->lock);
-		}
 		if (info->swapped)
 			found = shmem_unuse_inode(info, swap, page);
+		else
+			list_del_init(&info->swaplist);
 		cond_resched();
 		if (found)
 			break;
@@ -694,10 +697,10 @@ out:
 static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct shmem_inode_info *info;
-	swp_entry_t swap, oswap;
 	struct address_space *mapping;
-	pgoff_t index;
 	struct inode *inode;
+	swp_entry_t swap;
+	pgoff_t index;
 
 	BUG_ON(!PageLocked(page));
 	mapping = page->mapping;
@@ -720,55 +723,38 @@ static int shmem_writepage(struct page *
 		WARN_ON_ONCE(1);	/* Still happens? Tell us about it! */
 		goto redirty;
 	}
-
-	/*
-	 * Disable even the toy swapping implementation, while we convert
-	 * functions one by one to having swap entries in the radix tree.
-	 */
-	if (index < ULONG_MAX)
-		goto redirty;
-
 	swap = get_swap_page();
 	if (!swap.val)
 		goto redirty;
 
 	/*
 	 * Add inode to shmem_unuse()'s list of swapped-out inodes,
-	 * if it's not already there.  Do it now because we cannot take
-	 * mutex while holding spinlock, and must do so before the page
-	 * is moved to swap cache, when its pagelock no longer protects
+	 * if it's not already there.  Do it now before the page is
+	 * moved to swap cache, when its pagelock no longer protects
 	 * the inode from eviction.  But don't unlock the mutex until
-	 * we've taken the spinlock, because shmem_unuse_inode() will
-	 * prune a !swapped inode from the swaplist under both locks.
+	 * we've incremented swapped, because shmem_unuse_inode() will
+	 * prune a !swapped inode from the swaplist under this mutex.
 	 */
 	mutex_lock(&shmem_swaplist_mutex);
 	if (list_empty(&info->swaplist))
 		list_add_tail(&info->swaplist, &shmem_swaplist);
 
-	spin_lock(&info->lock);
-	mutex_unlock(&shmem_swaplist_mutex);
-
-	oswap = shmem_get_swap(info, index);
-	if (oswap.val) {
-		WARN_ON_ONCE(1);	/* Still happens? Tell us about it! */
-		free_swap_and_cache(oswap);
-		shmem_put_swap(info, index, (swp_entry_t){0});
-		info->swapped--;
-	}
-	shmem_recalc_inode(inode);
-
 	if (add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) {
-		delete_from_page_cache(page);
-		shmem_put_swap(info, index, swap);
-		info->swapped++;
 		swap_shmem_alloc(swap);
+		shmem_delete_from_page_cache(page, swp_to_radix_entry(swap));
+
+		spin_lock(&info->lock);
+		info->swapped++;
+		shmem_recalc_inode(inode);
 		spin_unlock(&info->lock);
+
+		mutex_unlock(&shmem_swaplist_mutex);
 		BUG_ON(page_mapped(page));
 		swap_writepage(page, wbc);
 		return 0;
 	}
 
-	spin_unlock(&info->lock);
+	mutex_unlock(&shmem_swaplist_mutex);
 	swapcache_free(swap, NULL);
 redirty:
 	set_page_dirty(page);

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 10/12] tmpfs: convert shmem_writepage and enable swap
@ 2011-06-14 10:56   ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Convert shmem_writepage() to use shmem_delete_from_page_cache() to use
shmem_radix_tree_replace() to substitute swap entry for page pointer
atomically in the radix tree.

As with shmem_add_to_page_cache(), it's not entirely satisfactory to be
copying such code from delete_from_swap_cache, but again judged easier
to sell than making its other callers go through the extras.

Remove the toy implementation's shmem_put_swap() and shmem_get_swap(),
now unreferenced, and the hack to disable swap: it's now good to go.

The way things have worked out, info->lock no longer helps to guard the
shmem_swaplist: we increment swapped under shmem_swaplist_mutex only.
That global mutex exclusion between shmem_writepage() and shmem_unuse()
is not pretty, and we ought to find another way; but it's been forced
on us by recent race discoveries, not a consequence of this patchset.

And what has become of the WARN_ON_ONCE(1) free_swap_and_cache() if a
swap entry was found already present?  That's no longer possible, the
(unknown) one inserting this page into filecache would hit the swap
entry occupying that slot.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c |   88 +++++++++++++++++++++------------------------------
 1 file changed, 37 insertions(+), 51 deletions(-)

--- linux.orig/mm/shmem.c	2011-06-14 00:45:20.685161581 -0700
+++ linux/mm/shmem.c	2011-06-14 00:54:36.499917716 -0700
@@ -6,7 +6,8 @@
  *		 2000-2001 Christoph Rohland
  *		 2000-2001 SAP AG
  *		 2002 Red Hat Inc.
- * Copyright (C) 2002-2005 Hugh Dickins.
+ * Copyright (C) 2002-2011 Hugh Dickins.
+ * Copyright (C) 2011 Google Inc.
  * Copyright (C) 2002-2005 VERITAS Software Corporation.
  * Copyright (C) 2004 Andi Kleen, SuSE Labs
  *
@@ -219,19 +220,6 @@ static void shmem_recalc_inode(struct in
 	}
 }
 
-static void shmem_put_swap(struct shmem_inode_info *info, pgoff_t index,
-			   swp_entry_t swap)
-{
-	if (index < SHMEM_NR_DIRECT)
-		info->i_direct[index] = swap;
-}
-
-static swp_entry_t shmem_get_swap(struct shmem_inode_info *info, pgoff_t index)
-{
-	return (index < SHMEM_NR_DIRECT) ?
-		info->i_direct[index] : (swp_entry_t){0};
-}
-
 /*
  * Replace item expected in radix tree by a new item, while holding tree lock.
  */
@@ -300,6 +288,25 @@ static int shmem_add_to_page_cache(struc
 }
 
 /*
+ * Like delete_from_page_cache, but substitutes swap for page.
+ */
+static void shmem_delete_from_page_cache(struct page *page, void *radswap)
+{
+	struct address_space *mapping = page->mapping;
+	int error;
+
+	spin_lock_irq(&mapping->tree_lock);
+	error = shmem_radix_tree_replace(mapping, page->index, page, radswap);
+	page->mapping = NULL;
+	mapping->nrpages--;
+	__dec_zone_page_state(page, NR_FILE_PAGES);
+	__dec_zone_page_state(page, NR_SHMEM);
+	spin_unlock_irq(&mapping->tree_lock);
+	page_cache_release(page);
+	BUG_ON(error);
+}
+
+/*
  * Like find_get_pages, but collecting swap entries as well as pages.
  */
 static unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
@@ -664,14 +671,10 @@ int shmem_unuse(swp_entry_t swap, struct
 	mutex_lock(&shmem_swaplist_mutex);
 	list_for_each_safe(this, next, &shmem_swaplist) {
 		info = list_entry(this, struct shmem_inode_info, swaplist);
-		if (!info->swapped) {
-			spin_lock(&info->lock);
-			if (!info->swapped)
-				list_del_init(&info->swaplist);
-			spin_unlock(&info->lock);
-		}
 		if (info->swapped)
 			found = shmem_unuse_inode(info, swap, page);
+		else
+			list_del_init(&info->swaplist);
 		cond_resched();
 		if (found)
 			break;
@@ -694,10 +697,10 @@ out:
 static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct shmem_inode_info *info;
-	swp_entry_t swap, oswap;
 	struct address_space *mapping;
-	pgoff_t index;
 	struct inode *inode;
+	swp_entry_t swap;
+	pgoff_t index;
 
 	BUG_ON(!PageLocked(page));
 	mapping = page->mapping;
@@ -720,55 +723,38 @@ static int shmem_writepage(struct page *
 		WARN_ON_ONCE(1);	/* Still happens? Tell us about it! */
 		goto redirty;
 	}
-
-	/*
-	 * Disable even the toy swapping implementation, while we convert
-	 * functions one by one to having swap entries in the radix tree.
-	 */
-	if (index < ULONG_MAX)
-		goto redirty;
-
 	swap = get_swap_page();
 	if (!swap.val)
 		goto redirty;
 
 	/*
 	 * Add inode to shmem_unuse()'s list of swapped-out inodes,
-	 * if it's not already there.  Do it now because we cannot take
-	 * mutex while holding spinlock, and must do so before the page
-	 * is moved to swap cache, when its pagelock no longer protects
+	 * if it's not already there.  Do it now before the page is
+	 * moved to swap cache, when its pagelock no longer protects
 	 * the inode from eviction.  But don't unlock the mutex until
-	 * we've taken the spinlock, because shmem_unuse_inode() will
-	 * prune a !swapped inode from the swaplist under both locks.
+	 * we've incremented swapped, because shmem_unuse_inode() will
+	 * prune a !swapped inode from the swaplist under this mutex.
 	 */
 	mutex_lock(&shmem_swaplist_mutex);
 	if (list_empty(&info->swaplist))
 		list_add_tail(&info->swaplist, &shmem_swaplist);
 
-	spin_lock(&info->lock);
-	mutex_unlock(&shmem_swaplist_mutex);
-
-	oswap = shmem_get_swap(info, index);
-	if (oswap.val) {
-		WARN_ON_ONCE(1);	/* Still happens? Tell us about it! */
-		free_swap_and_cache(oswap);
-		shmem_put_swap(info, index, (swp_entry_t){0});
-		info->swapped--;
-	}
-	shmem_recalc_inode(inode);
-
 	if (add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) {
-		delete_from_page_cache(page);
-		shmem_put_swap(info, index, swap);
-		info->swapped++;
 		swap_shmem_alloc(swap);
+		shmem_delete_from_page_cache(page, swp_to_radix_entry(swap));
+
+		spin_lock(&info->lock);
+		info->swapped++;
+		shmem_recalc_inode(inode);
 		spin_unlock(&info->lock);
+
+		mutex_unlock(&shmem_swaplist_mutex);
 		BUG_ON(page_mapped(page));
 		swap_writepage(page, wbc);
 		return 0;
 	}
 
-	spin_unlock(&info->lock);
+	mutex_unlock(&shmem_swaplist_mutex);
 	swapcache_free(swap, NULL);
 redirty:
 	set_page_dirty(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 11/12] tmpfs: use kmemdup for short symlinks
  2011-06-14 10:40 ` Hugh Dickins
@ 2011-06-14 10:57   ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

But we've not yet removed the old swp_entry_t i_direct[16] from
shmem_inode_info.  That's because it was still being shared with the
inline symlink.  Remove it now (saving 64 or 128 bytes from shmem inode
size), and use kmemdup() for short symlinks, say, those up to 128 bytes.

I wonder why mpol_free_shared_policy() is done in shmem_destroy_inode()
rather than shmem_evict_inode(), where we usually do such freeing?  I
guess it doesn't matter, and I'm not into NUMA mpol testing right now.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/shmem_fs.h |   11 +++--------
 mm/shmem.c               |   31 ++++++++++++++++++-------------
 2 files changed, 21 insertions(+), 21 deletions(-)

--- linux.orig/include/linux/shmem_fs.h	2011-06-14 00:45:20.625161293 -0700
+++ linux/include/linux/shmem_fs.h	2011-06-14 00:54:49.667983016 -0700
@@ -8,20 +8,15 @@
 
 /* inode in-kernel data */
 
-#define SHMEM_NR_DIRECT 16
-
-#define SHMEM_SYMLINK_INLINE_LEN (SHMEM_NR_DIRECT * sizeof(swp_entry_t))
-
 struct shmem_inode_info {
 	spinlock_t		lock;
 	unsigned long		flags;
 	unsigned long		alloced;	/* data pages alloced to file */
-	unsigned long		swapped;	/* subtotal assigned to swap */
-	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	union {
-		swp_entry_t	i_direct[SHMEM_NR_DIRECT]; /* first blocks */
-		char		inline_symlink[SHMEM_SYMLINK_INLINE_LEN];
+		unsigned long	swapped;	/* subtotal assigned to swap */
+		char		*symlink;	/* unswappable short symlink */
 	};
+	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	struct list_head	swaplist;	/* chain of maybes on swap */
 	struct list_head	xattr_list;	/* list of shmem_xattr */
 	struct inode		vfs_inode;
--- linux.orig/mm/shmem.c	2011-06-14 00:54:36.499917716 -0700
+++ linux/mm/shmem.c	2011-06-14 00:54:49.667983016 -0700
@@ -73,6 +73,9 @@ static struct vfsmount *shm_mnt;
 /* Pretend that each entry is of this size in directory's i_size */
 #define BOGO_DIRENT_SIZE 20
 
+/* Symlink up to this size is kmalloc'ed instead of using a swappable page */
+#define SHORT_SYMLINK_LEN 128
+
 struct shmem_xattr {
 	struct list_head list;	/* anchored by shmem_inode_info->xattr_list */
 	char *name;		/* xattr name */
@@ -585,7 +588,8 @@ static void shmem_evict_inode(struct ino
 			list_del_init(&info->swaplist);
 			mutex_unlock(&shmem_swaplist_mutex);
 		}
-	}
+	} else
+		kfree(info->symlink);
 
 	list_for_each_entry_safe(xattr, nxattr, &info->xattr_list, list) {
 		kfree(xattr->name);
@@ -1173,7 +1177,7 @@ static struct inode *shmem_get_inode(str
 
 #ifdef CONFIG_TMPFS
 static const struct inode_operations shmem_symlink_inode_operations;
-static const struct inode_operations shmem_symlink_inline_operations;
+static const struct inode_operations shmem_short_symlink_operations;
 
 static int
 shmem_write_begin(struct file *file, struct address_space *mapping,
@@ -1638,10 +1642,13 @@ static int shmem_symlink(struct inode *d
 
 	info = SHMEM_I(inode);
 	inode->i_size = len-1;
-	if (len <= SHMEM_SYMLINK_INLINE_LEN) {
-		/* do it inline */
-		memcpy(info->inline_symlink, symname, len);
-		inode->i_op = &shmem_symlink_inline_operations;
+	if (len <= SHORT_SYMLINK_LEN) {
+		info->symlink = kmemdup(symname, len, GFP_KERNEL);
+		if (!info->symlink) {
+			iput(inode);
+			return -ENOMEM;
+		}
+		inode->i_op = &shmem_short_symlink_operations;
 	} else {
 		error = shmem_getpage(inode, 0, &page, SGP_WRITE, NULL);
 		if (error) {
@@ -1664,9 +1671,9 @@ static int shmem_symlink(struct inode *d
 	return 0;
 }
 
-static void *shmem_follow_link_inline(struct dentry *dentry, struct nameidata *nd)
+static void *shmem_follow_short_symlink(struct dentry *dentry, struct nameidata *nd)
 {
-	nd_set_link(nd, SHMEM_I(dentry->d_inode)->inline_symlink);
+	nd_set_link(nd, SHMEM_I(dentry->d_inode)->symlink);
 	return NULL;
 }
 
@@ -1914,9 +1921,9 @@ static ssize_t shmem_listxattr(struct de
 }
 #endif /* CONFIG_TMPFS_XATTR */
 
-static const struct inode_operations shmem_symlink_inline_operations = {
+static const struct inode_operations shmem_short_symlink_operations = {
 	.readlink	= generic_readlink,
-	.follow_link	= shmem_follow_link_inline,
+	.follow_link	= shmem_follow_short_symlink,
 #ifdef CONFIG_TMPFS_XATTR
 	.setxattr	= shmem_setxattr,
 	.getxattr	= shmem_getxattr,
@@ -2259,10 +2266,8 @@ static void shmem_destroy_callback(struc
 
 static void shmem_destroy_inode(struct inode *inode)
 {
-	if ((inode->i_mode & S_IFMT) == S_IFREG) {
-		/* only struct inode is valid if it's an inline symlink */
+	if ((inode->i_mode & S_IFMT) == S_IFREG)
 		mpol_free_shared_policy(&SHMEM_I(inode)->policy);
-	}
 	call_rcu(&inode->i_rcu, shmem_destroy_callback);
 }
 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 11/12] tmpfs: use kmemdup for short symlinks
@ 2011-06-14 10:57   ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

But we've not yet removed the old swp_entry_t i_direct[16] from
shmem_inode_info.  That's because it was still being shared with the
inline symlink.  Remove it now (saving 64 or 128 bytes from shmem inode
size), and use kmemdup() for short symlinks, say, those up to 128 bytes.

I wonder why mpol_free_shared_policy() is done in shmem_destroy_inode()
rather than shmem_evict_inode(), where we usually do such freeing?  I
guess it doesn't matter, and I'm not into NUMA mpol testing right now.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/shmem_fs.h |   11 +++--------
 mm/shmem.c               |   31 ++++++++++++++++++-------------
 2 files changed, 21 insertions(+), 21 deletions(-)

--- linux.orig/include/linux/shmem_fs.h	2011-06-14 00:45:20.625161293 -0700
+++ linux/include/linux/shmem_fs.h	2011-06-14 00:54:49.667983016 -0700
@@ -8,20 +8,15 @@
 
 /* inode in-kernel data */
 
-#define SHMEM_NR_DIRECT 16
-
-#define SHMEM_SYMLINK_INLINE_LEN (SHMEM_NR_DIRECT * sizeof(swp_entry_t))
-
 struct shmem_inode_info {
 	spinlock_t		lock;
 	unsigned long		flags;
 	unsigned long		alloced;	/* data pages alloced to file */
-	unsigned long		swapped;	/* subtotal assigned to swap */
-	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	union {
-		swp_entry_t	i_direct[SHMEM_NR_DIRECT]; /* first blocks */
-		char		inline_symlink[SHMEM_SYMLINK_INLINE_LEN];
+		unsigned long	swapped;	/* subtotal assigned to swap */
+		char		*symlink;	/* unswappable short symlink */
 	};
+	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	struct list_head	swaplist;	/* chain of maybes on swap */
 	struct list_head	xattr_list;	/* list of shmem_xattr */
 	struct inode		vfs_inode;
--- linux.orig/mm/shmem.c	2011-06-14 00:54:36.499917716 -0700
+++ linux/mm/shmem.c	2011-06-14 00:54:49.667983016 -0700
@@ -73,6 +73,9 @@ static struct vfsmount *shm_mnt;
 /* Pretend that each entry is of this size in directory's i_size */
 #define BOGO_DIRENT_SIZE 20
 
+/* Symlink up to this size is kmalloc'ed instead of using a swappable page */
+#define SHORT_SYMLINK_LEN 128
+
 struct shmem_xattr {
 	struct list_head list;	/* anchored by shmem_inode_info->xattr_list */
 	char *name;		/* xattr name */
@@ -585,7 +588,8 @@ static void shmem_evict_inode(struct ino
 			list_del_init(&info->swaplist);
 			mutex_unlock(&shmem_swaplist_mutex);
 		}
-	}
+	} else
+		kfree(info->symlink);
 
 	list_for_each_entry_safe(xattr, nxattr, &info->xattr_list, list) {
 		kfree(xattr->name);
@@ -1173,7 +1177,7 @@ static struct inode *shmem_get_inode(str
 
 #ifdef CONFIG_TMPFS
 static const struct inode_operations shmem_symlink_inode_operations;
-static const struct inode_operations shmem_symlink_inline_operations;
+static const struct inode_operations shmem_short_symlink_operations;
 
 static int
 shmem_write_begin(struct file *file, struct address_space *mapping,
@@ -1638,10 +1642,13 @@ static int shmem_symlink(struct inode *d
 
 	info = SHMEM_I(inode);
 	inode->i_size = len-1;
-	if (len <= SHMEM_SYMLINK_INLINE_LEN) {
-		/* do it inline */
-		memcpy(info->inline_symlink, symname, len);
-		inode->i_op = &shmem_symlink_inline_operations;
+	if (len <= SHORT_SYMLINK_LEN) {
+		info->symlink = kmemdup(symname, len, GFP_KERNEL);
+		if (!info->symlink) {
+			iput(inode);
+			return -ENOMEM;
+		}
+		inode->i_op = &shmem_short_symlink_operations;
 	} else {
 		error = shmem_getpage(inode, 0, &page, SGP_WRITE, NULL);
 		if (error) {
@@ -1664,9 +1671,9 @@ static int shmem_symlink(struct inode *d
 	return 0;
 }
 
-static void *shmem_follow_link_inline(struct dentry *dentry, struct nameidata *nd)
+static void *shmem_follow_short_symlink(struct dentry *dentry, struct nameidata *nd)
 {
-	nd_set_link(nd, SHMEM_I(dentry->d_inode)->inline_symlink);
+	nd_set_link(nd, SHMEM_I(dentry->d_inode)->symlink);
 	return NULL;
 }
 
@@ -1914,9 +1921,9 @@ static ssize_t shmem_listxattr(struct de
 }
 #endif /* CONFIG_TMPFS_XATTR */
 
-static const struct inode_operations shmem_symlink_inline_operations = {
+static const struct inode_operations shmem_short_symlink_operations = {
 	.readlink	= generic_readlink,
-	.follow_link	= shmem_follow_link_inline,
+	.follow_link	= shmem_follow_short_symlink,
 #ifdef CONFIG_TMPFS_XATTR
 	.setxattr	= shmem_setxattr,
 	.getxattr	= shmem_getxattr,
@@ -2259,10 +2266,8 @@ static void shmem_destroy_callback(struc
 
 static void shmem_destroy_inode(struct inode *inode)
 {
-	if ((inode->i_mode & S_IFMT) == S_IFREG) {
-		/* only struct inode is valid if it's an inline symlink */
+	if ((inode->i_mode & S_IFMT) == S_IFREG)
 		mpol_free_shared_policy(&SHMEM_I(inode)->policy);
-	}
 	call_rcu(&inode->i_rcu, shmem_destroy_callback);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 12/12] mm: a few small updates for radix-swap
  2011-06-14 10:40 ` Hugh Dickins
@ 2011-06-14 10:59   ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Remove PageSwapBacked (!page_is_file_cache) cases from
add_to_page_cache_locked() and add_to_page_cache_lru():
those pages now go through shmem_add_to_page_cache().

Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(),
and add a comment on swap entries to invalidate_mapping_pages().

And mincore_page() uses find_get_page() on what might be shmem or a
tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to
find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 fs/stack.c    |    5 +----
 mm/filemap.c  |   21 +++------------------
 mm/mincore.c  |   10 ++++++----
 mm/truncate.c |    8 ++++++++
 4 files changed, 18 insertions(+), 26 deletions(-)

--- linux.orig/fs/stack.c	2011-06-14 01:22:10.768120780 -0700
+++ linux/fs/stack.c	2011-06-14 01:23:26.088494288 -0700
@@ -29,10 +29,7 @@ void fsstack_copy_inode_size(struct inod
 	 *
 	 * We don't actually know what locking is used at the lower level;
 	 * but if it's a filesystem that supports quotas, it will be using
-	 * i_lock as in inode_add_bytes().  tmpfs uses other locking, and
-	 * its 32-bit is (just) able to exceed 2TB i_size with the aid of
-	 * holes; but its i_blocks cannot carry into the upper long without
-	 * almost 2TB swap - let's ignore that case.
+	 * i_lock as in inode_add_bytes().
 	 */
 	if (sizeof(i_blocks) > sizeof(long))
 		spin_lock(&src->i_lock);
--- linux.orig/mm/filemap.c	2011-06-14 01:22:10.768120780 -0700
+++ linux/mm/filemap.c	2011-06-14 01:23:26.088494288 -0700
@@ -33,7 +33,6 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
-#include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include <linux/cleancache.h>
 #include "internal.h"
 
@@ -465,6 +464,7 @@ int add_to_page_cache_locked(struct page
 	int error;
 
 	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(PageSwapBacked(page));
 
 	error = mem_cgroup_cache_charge(page, current->mm,
 					gfp_mask & GFP_RECLAIM_MASK);
@@ -482,8 +482,6 @@ int add_to_page_cache_locked(struct page
 		if (likely(!error)) {
 			mapping->nrpages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
-			if (PageSwapBacked(page))
-				__inc_zone_page_state(page, NR_SHMEM);
 			spin_unlock_irq(&mapping->tree_lock);
 		} else {
 			page->mapping = NULL;
@@ -505,22 +503,9 @@ int add_to_page_cache_lru(struct page *p
 {
 	int ret;
 
-	/*
-	 * Splice_read and readahead add shmem/tmpfs pages into the page cache
-	 * before shmem_readpage has a chance to mark them as SwapBacked: they
-	 * need to go on the anon lru below, and mem_cgroup_cache_charge
-	 * (called in add_to_page_cache) needs to know where they're going too.
-	 */
-	if (mapping_cap_swap_backed(mapping))
-		SetPageSwapBacked(page);
-
 	ret = add_to_page_cache(page, mapping, offset, gfp_mask);
-	if (ret == 0) {
-		if (page_is_file_cache(page))
-			lru_cache_add_file(page);
-		else
-			lru_cache_add_anon(page);
-	}
+	if (ret == 0)
+		lru_cache_add_file(page);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
--- linux.orig/mm/mincore.c	2011-06-14 01:22:10.768120780 -0700
+++ linux/mm/mincore.c	2011-06-14 01:23:26.088494288 -0700
@@ -69,13 +69,15 @@ static unsigned char mincore_page(struct
 	 * file will not get a swp_entry_t in its pte, but rather it is like
 	 * any other file mapping (ie. marked !present and faulted in with
 	 * tmpfs's .fault). So swapped out tmpfs mappings are tested here.
-	 *
-	 * However when tmpfs moves the page from pagecache and into swapcache,
-	 * it is still in core, but the find_get_page below won't find it.
-	 * No big deal, but make a note of it.
 	 */
 	page = find_get_page(mapping, pgoff);
 	if (page) {
+#ifdef CONFIG_SWAP
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+			page = find_get_page(&swapper_space, swap.val);
+		}
+#endif
 		present = PageUptodate(page);
 		page_cache_release(page);
 	}
--- linux.orig/mm/truncate.c	2011-06-14 01:22:10.768120780 -0700
+++ linux/mm/truncate.c	2011-06-14 01:23:26.092494303 -0700
@@ -331,6 +331,14 @@ unsigned long invalidate_mapping_pages(s
 	unsigned long count = 0;
 	int i;
 
+	/*
+	 * Note: this function may get called on a shmem/tmpfs mapping:
+	 * pagevec_lookup() might then return 0 prematurely (because it
+	 * got a gangful of swap entries); but it's hardly worth worrying
+	 * about - it can rarely have anything to free from such a mapping
+	 * (most pages are dirty), and already skips over any difficulties.
+	 */
+
 	pagevec_init(&pvec, 0);
 	while (index <= end && pagevec_lookup(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 12/12] mm: a few small updates for radix-swap
@ 2011-06-14 10:59   ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-14 10:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Remove PageSwapBacked (!page_is_file_cache) cases from
add_to_page_cache_locked() and add_to_page_cache_lru():
those pages now go through shmem_add_to_page_cache().

Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(),
and add a comment on swap entries to invalidate_mapping_pages().

And mincore_page() uses find_get_page() on what might be shmem or a
tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to
find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 fs/stack.c    |    5 +----
 mm/filemap.c  |   21 +++------------------
 mm/mincore.c  |   10 ++++++----
 mm/truncate.c |    8 ++++++++
 4 files changed, 18 insertions(+), 26 deletions(-)

--- linux.orig/fs/stack.c	2011-06-14 01:22:10.768120780 -0700
+++ linux/fs/stack.c	2011-06-14 01:23:26.088494288 -0700
@@ -29,10 +29,7 @@ void fsstack_copy_inode_size(struct inod
 	 *
 	 * We don't actually know what locking is used at the lower level;
 	 * but if it's a filesystem that supports quotas, it will be using
-	 * i_lock as in inode_add_bytes().  tmpfs uses other locking, and
-	 * its 32-bit is (just) able to exceed 2TB i_size with the aid of
-	 * holes; but its i_blocks cannot carry into the upper long without
-	 * almost 2TB swap - let's ignore that case.
+	 * i_lock as in inode_add_bytes().
 	 */
 	if (sizeof(i_blocks) > sizeof(long))
 		spin_lock(&src->i_lock);
--- linux.orig/mm/filemap.c	2011-06-14 01:22:10.768120780 -0700
+++ linux/mm/filemap.c	2011-06-14 01:23:26.088494288 -0700
@@ -33,7 +33,6 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
-#include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include <linux/cleancache.h>
 #include "internal.h"
 
@@ -465,6 +464,7 @@ int add_to_page_cache_locked(struct page
 	int error;
 
 	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(PageSwapBacked(page));
 
 	error = mem_cgroup_cache_charge(page, current->mm,
 					gfp_mask & GFP_RECLAIM_MASK);
@@ -482,8 +482,6 @@ int add_to_page_cache_locked(struct page
 		if (likely(!error)) {
 			mapping->nrpages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
-			if (PageSwapBacked(page))
-				__inc_zone_page_state(page, NR_SHMEM);
 			spin_unlock_irq(&mapping->tree_lock);
 		} else {
 			page->mapping = NULL;
@@ -505,22 +503,9 @@ int add_to_page_cache_lru(struct page *p
 {
 	int ret;
 
-	/*
-	 * Splice_read and readahead add shmem/tmpfs pages into the page cache
-	 * before shmem_readpage has a chance to mark them as SwapBacked: they
-	 * need to go on the anon lru below, and mem_cgroup_cache_charge
-	 * (called in add_to_page_cache) needs to know where they're going too.
-	 */
-	if (mapping_cap_swap_backed(mapping))
-		SetPageSwapBacked(page);
-
 	ret = add_to_page_cache(page, mapping, offset, gfp_mask);
-	if (ret == 0) {
-		if (page_is_file_cache(page))
-			lru_cache_add_file(page);
-		else
-			lru_cache_add_anon(page);
-	}
+	if (ret == 0)
+		lru_cache_add_file(page);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
--- linux.orig/mm/mincore.c	2011-06-14 01:22:10.768120780 -0700
+++ linux/mm/mincore.c	2011-06-14 01:23:26.088494288 -0700
@@ -69,13 +69,15 @@ static unsigned char mincore_page(struct
 	 * file will not get a swp_entry_t in its pte, but rather it is like
 	 * any other file mapping (ie. marked !present and faulted in with
 	 * tmpfs's .fault). So swapped out tmpfs mappings are tested here.
-	 *
-	 * However when tmpfs moves the page from pagecache and into swapcache,
-	 * it is still in core, but the find_get_page below won't find it.
-	 * No big deal, but make a note of it.
 	 */
 	page = find_get_page(mapping, pgoff);
 	if (page) {
+#ifdef CONFIG_SWAP
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+			page = find_get_page(&swapper_space, swap.val);
+		}
+#endif
 		present = PageUptodate(page);
 		page_cache_release(page);
 	}
--- linux.orig/mm/truncate.c	2011-06-14 01:22:10.768120780 -0700
+++ linux/mm/truncate.c	2011-06-14 01:23:26.092494303 -0700
@@ -331,6 +331,14 @@ unsigned long invalidate_mapping_pages(s
 	unsigned long count = 0;
 	int i;
 
+	/*
+	 * Note: this function may get called on a shmem/tmpfs mapping:
+	 * pagevec_lookup() might then return 0 prematurely (because it
+	 * got a gangful of swap entries); but it's hardly worth worrying
+	 * about - it can rarely have anything to free from such a mapping
+	 * (most pages are dirty), and already skips over any difficulties.
+	 */
+
 	pagevec_init(&pvec, 0);
 	while (index <= end && pagevec_lookup(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 11/12] tmpfs: use kmemdup for short symlinks
  2011-06-14 10:57   ` Hugh Dickins
@ 2011-06-14 11:16     ` Pekka Enberg
  -1 siblings, 0 replies; 71+ messages in thread
From: Pekka Enberg @ 2011-06-14 11:16 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-mm

On Tue, Jun 14, 2011 at 1:57 PM, Hugh Dickins <hughd@google.com> wrote:
> But we've not yet removed the old swp_entry_t i_direct[16] from
> shmem_inode_info.  That's because it was still being shared with the
> inline symlink.  Remove it now (saving 64 or 128 bytes from shmem inode
> size), and use kmemdup() for short symlinks, say, those up to 128 bytes.
>
> I wonder why mpol_free_shared_policy() is done in shmem_destroy_inode()
> rather than shmem_evict_inode(), where we usually do such freeing?  I
> guess it doesn't matter, and I'm not into NUMA mpol testing right now.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>

Reviewed-by: Pekka Enberg <penberg@kernel.org>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 11/12] tmpfs: use kmemdup for short symlinks
@ 2011-06-14 11:16     ` Pekka Enberg
  0 siblings, 0 replies; 71+ messages in thread
From: Pekka Enberg @ 2011-06-14 11:16 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-mm

On Tue, Jun 14, 2011 at 1:57 PM, Hugh Dickins <hughd@google.com> wrote:
> But we've not yet removed the old swp_entry_t i_direct[16] from
> shmem_inode_info.  That's because it was still being shared with the
> inline symlink.  Remove it now (saving 64 or 128 bytes from shmem inode
> size), and use kmemdup() for short symlinks, say, those up to 128 bytes.
>
> I wonder why mpol_free_shared_policy() is done in shmem_destroy_inode()
> rather than shmem_evict_inode(), where we usually do such freeing?  I
> guess it doesn't matter, and I'm not into NUMA mpol testing right now.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>

Reviewed-by: Pekka Enberg <penberg@kernel.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
  2011-06-14 10:42   ` Hugh Dickins
@ 2011-06-14 11:22     ` Pekka Enberg
  -1 siblings, 0 replies; 71+ messages in thread
From: Pekka Enberg @ 2011-06-14 11:22 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-mm

Hi Hugh!

On Tue, Jun 14, 2011 at 1:42 PM, Hugh Dickins <hughd@google.com> wrote:
> @@ -39,7 +39,15 @@
>  * when it is shrunk, before we rcu free the node. See shrink code for
>  * details.
>  */
> -#define RADIX_TREE_INDIRECT_PTR        1
> +#define RADIX_TREE_INDIRECT_PTR                1
> +/*
> + * A common use of the radix tree is to store pointers to struct pages;
> + * but shmem/tmpfs needs also to store swap entries in the same tree:
> + * those are marked as exceptional entries to distinguish them.
> + * EXCEPTIONAL_ENTRY tests the bit, EXCEPTIONAL_SHIFT shifts content past it.
> + */
> +#define RADIX_TREE_EXCEPTIONAL_ENTRY   2
> +#define RADIX_TREE_EXCEPTIONAL_SHIFT   2
>
>  #define radix_tree_indirect_to_ptr(ptr) \
>        radix_tree_indirect_to_ptr((void __force *)(ptr))
> @@ -174,6 +182,28 @@ static inline int radix_tree_deref_retry
>  }
>
>  /**
> + * radix_tree_exceptional_entry        - radix_tree_deref_slot gave exceptional entry?
> + * @arg:       value returned by radix_tree_deref_slot
> + * Returns:    0 if well-aligned pointer, non-0 if exceptional entry.
> + */
> +static inline int radix_tree_exceptional_entry(void *arg)
> +{
> +       /* Not unlikely because radix_tree_exception often tested first */
> +       return (unsigned long)arg & RADIX_TREE_EXCEPTIONAL_ENTRY;
> +}
> +
> +/**
> + * radix_tree_exception        - radix_tree_deref_slot returned either exception?
> + * @arg:       value returned by radix_tree_deref_slot
> + * Returns:    0 if well-aligned pointer, non-0 if either kind of exception.
> + */
> +static inline int radix_tree_exception(void *arg)
> +{
> +       return unlikely((unsigned long)arg &
> +               (RADIX_TREE_INDIRECT_PTR | RADIX_TREE_EXCEPTIONAL_ENTRY));
> +}

Would something like radix_tree_augmented() be a better name for this
(with RADIX_TREE_AUGMENTED_MASK defined)? This one seems too easy to
confuse with radix_tree_exceptional_entry() to me which is not the
same thing, right?

                                Pekka

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
@ 2011-06-14 11:22     ` Pekka Enberg
  0 siblings, 0 replies; 71+ messages in thread
From: Pekka Enberg @ 2011-06-14 11:22 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-mm

Hi Hugh!

On Tue, Jun 14, 2011 at 1:42 PM, Hugh Dickins <hughd@google.com> wrote:
> @@ -39,7 +39,15 @@
>  * when it is shrunk, before we rcu free the node. See shrink code for
>  * details.
>  */
> -#define RADIX_TREE_INDIRECT_PTR        1
> +#define RADIX_TREE_INDIRECT_PTR                1
> +/*
> + * A common use of the radix tree is to store pointers to struct pages;
> + * but shmem/tmpfs needs also to store swap entries in the same tree:
> + * those are marked as exceptional entries to distinguish them.
> + * EXCEPTIONAL_ENTRY tests the bit, EXCEPTIONAL_SHIFT shifts content past it.
> + */
> +#define RADIX_TREE_EXCEPTIONAL_ENTRY   2
> +#define RADIX_TREE_EXCEPTIONAL_SHIFT   2
>
>  #define radix_tree_indirect_to_ptr(ptr) \
>        radix_tree_indirect_to_ptr((void __force *)(ptr))
> @@ -174,6 +182,28 @@ static inline int radix_tree_deref_retry
>  }
>
>  /**
> + * radix_tree_exceptional_entry        - radix_tree_deref_slot gave exceptional entry?
> + * @arg:       value returned by radix_tree_deref_slot
> + * Returns:    0 if well-aligned pointer, non-0 if exceptional entry.
> + */
> +static inline int radix_tree_exceptional_entry(void *arg)
> +{
> +       /* Not unlikely because radix_tree_exception often tested first */
> +       return (unsigned long)arg & RADIX_TREE_EXCEPTIONAL_ENTRY;
> +}
> +
> +/**
> + * radix_tree_exception        - radix_tree_deref_slot returned either exception?
> + * @arg:       value returned by radix_tree_deref_slot
> + * Returns:    0 if well-aligned pointer, non-0 if either kind of exception.
> + */
> +static inline int radix_tree_exception(void *arg)
> +{
> +       return unlikely((unsigned long)arg &
> +               (RADIX_TREE_INDIRECT_PTR | RADIX_TREE_EXCEPTIONAL_ENTRY));
> +}

Would something like radix_tree_augmented() be a better name for this
(with RADIX_TREE_AUGMENTED_MASK defined)? This one seems too easy to
confuse with radix_tree_exceptional_entry() to me which is not the
same thing, right?

                                Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 0/12] tmpfs: convert from old swap vector to radix tree
  2011-06-14 10:40 ` Hugh Dickins
@ 2011-06-14 17:29   ` Linus Torvalds
  -1 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2011-06-14 17:29 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Christoph Hellwig, Robin Holt, Nick Piggin,
	Rik van Riel, Andrea Arcangeli, Miklos Szeredi,
	KAMEZAWA Hiroyuki, Shaohua Li, Tim Chen, Zhang, Yanmin,
	linux-kernel, linux-mm

On Tue, Jun 14, 2011 at 3:40 AM, Hugh Dickins <hughd@google.com> wrote:
>
> thus saving memory, and simplifying its code and locking.
>
>  13 files changed, 669 insertions(+), 1144 deletions(-)

Hey, I can Ack this just based on the fact that for once "simplifying
its code" clearly also removes code. Yay! Too many times the code
becomes "simpler" but bigger.

                       Linus

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 0/12] tmpfs: convert from old swap vector to radix tree
@ 2011-06-14 17:29   ` Linus Torvalds
  0 siblings, 0 replies; 71+ messages in thread
From: Linus Torvalds @ 2011-06-14 17:29 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Christoph Hellwig, Robin Holt, Nick Piggin,
	Rik van Riel, Andrea Arcangeli, Miklos Szeredi,
	KAMEZAWA Hiroyuki, Shaohua Li, Tim Chen, Zhang, Yanmin,
	linux-kernel, linux-mm

On Tue, Jun 14, 2011 at 3:40 AM, Hugh Dickins <hughd@google.com> wrote:
>
> thus saving memory, and simplifying its code and locking.
>
>  13 files changed, 669 insertions(+), 1144 deletions(-)

Hey, I can Ack this just based on the fact that for once "simplifying
its code" clearly also removes code. Yay! Too many times the code
becomes "simpler" but bigger.

                       Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 0/12] tmpfs: convert from old swap vector to radix tree
  2011-06-14 17:29   ` Linus Torvalds
@ 2011-06-14 18:20     ` Rik van Riel
  -1 siblings, 0 replies; 71+ messages in thread
From: Rik van Riel @ 2011-06-14 18:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, Christoph Hellwig, Robin Holt,
	Nick Piggin, Andrea Arcangeli, Miklos Szeredi, KAMEZAWA Hiroyuki,
	Shaohua Li, Tim Chen, Zhang, Yanmin, linux-kernel, linux-mm

On 06/14/2011 01:29 PM, Linus Torvalds wrote:
> On Tue, Jun 14, 2011 at 3:40 AM, Hugh Dickins<hughd@google.com>  wrote:
>>
>> thus saving memory, and simplifying its code and locking.
>>
>>   13 files changed, 669 insertions(+), 1144 deletions(-)
>
> Hey, I can Ack this just based on the fact that for once "simplifying
> its code" clearly also removes code. Yay! Too many times the code
> becomes "simpler" but bigger.

I looked through Hugh's patches for a while and didn't
see anything wrong with the code.  Consider all patches

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 0/12] tmpfs: convert from old swap vector to radix tree
@ 2011-06-14 18:20     ` Rik van Riel
  0 siblings, 0 replies; 71+ messages in thread
From: Rik van Riel @ 2011-06-14 18:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, Christoph Hellwig, Robin Holt,
	Nick Piggin, Andrea Arcangeli, Miklos Szeredi, KAMEZAWA Hiroyuki,
	Shaohua Li, Tim Chen, Zhang, Yanmin, linux-kernel, linux-mm

On 06/14/2011 01:29 PM, Linus Torvalds wrote:
> On Tue, Jun 14, 2011 at 3:40 AM, Hugh Dickins<hughd@google.com>  wrote:
>>
>> thus saving memory, and simplifying its code and locking.
>>
>>   13 files changed, 669 insertions(+), 1144 deletions(-)
>
> Hey, I can Ack this just based on the fact that for once "simplifying
> its code" clearly also removes code. Yay! Too many times the code
> becomes "simpler" but bigger.

I looked through Hugh's patches for a while and didn't
see anything wrong with the code.  Consider all patches

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
  2011-06-14 11:22     ` Pekka Enberg
  (?)
@ 2011-06-15  0:24     ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-15  0:24 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Andrew Morton, linux-kernel, linux-mm

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3212 bytes --]

Hi Pekka!

Thanks for taking a look.

On Tue, 14 Jun 2011, Pekka Enberg wrote:
> On Tue, Jun 14, 2011 at 1:42 PM, Hugh Dickins <hughd@google.com> wrote:
> > @@ -39,7 +39,15 @@
> >  * when it is shrunk, before we rcu free the node. See shrink code for
> >  * details.
> >  */
> > -#define RADIX_TREE_INDIRECT_PTR        1
> > +#define RADIX_TREE_INDIRECT_PTR                1
> > +/*
> > + * A common use of the radix tree is to store pointers to struct pages;
> > + * but shmem/tmpfs needs also to store swap entries in the same tree:
> > + * those are marked as exceptional entries to distinguish them.
> > + * EXCEPTIONAL_ENTRY tests the bit, EXCEPTIONAL_SHIFT shifts content past it.
> > + */
> > +#define RADIX_TREE_EXCEPTIONAL_ENTRY   2
> > +#define RADIX_TREE_EXCEPTIONAL_SHIFT   2
> >
> >  #define radix_tree_indirect_to_ptr(ptr) \
> >        radix_tree_indirect_to_ptr((void __force *)(ptr))
> > @@ -174,6 +182,28 @@ static inline int radix_tree_deref_retry
> >  }
> >
> >  /**
> > + * radix_tree_exceptional_entry        - radix_tree_deref_slot gave exceptional entry?
> > + * @arg:       value returned by radix_tree_deref_slot
> > + * Returns:    0 if well-aligned pointer, non-0 if exceptional entry.
> > + */
> > +static inline int radix_tree_exceptional_entry(void *arg)
> > +{
> > +       /* Not unlikely because radix_tree_exception often tested first */
> > +       return (unsigned long)arg & RADIX_TREE_EXCEPTIONAL_ENTRY;
> > +}
> > +
> > +/**
> > + * radix_tree_exception        - radix_tree_deref_slot returned either exception?
> > + * @arg:       value returned by radix_tree_deref_slot
> > + * Returns:    0 if well-aligned pointer, non-0 if either kind of exception.
> > + */
> > +static inline int radix_tree_exception(void *arg)
> > +{
> > +       return unlikely((unsigned long)arg &
> > +               (RADIX_TREE_INDIRECT_PTR | RADIX_TREE_EXCEPTIONAL_ENTRY));
> > +}
> 
> Would something like radix_tree_augmented() be a better name for this
> (with RADIX_TREE_AUGMENTED_MASK defined)? This one seems too easy to
> confuse with radix_tree_exceptional_entry() to me which is not the
> same thing, right?

They're not _quite_ the same thing, and I agree that a different naming
that would make it clearer (without going on and on) would be welcome.

But I don't think the word "augmented" helps or really fits in there.

What I had in mind was: there are two exceptional conditions which you
can meet in reading the radix tree, and radix_tree_exception() covers
both of those conditions.

One exceptional condition is the radix_tree_deref_retry() case, a
momentary condition where you just have to go back and read it again.

The other exceptional condition is the radix_tree_exceptional_entry():
you've read a valid entry, but it's not the usual type of thing stored
there, you need to be careful to process it differently (not try to
increment its "page" count in our case).

I'm fairly happy with "radix_tree_exceptional_entry" for the second;
we could make the test for both more explicit by calling it
"radix_tree_exceptional_entry_or_deref_retry", but
I grow bored before I reach the end of that!

Hugh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 12/12] mm: a few small updates for radix-swap
  2011-06-14 10:59   ` Hugh Dickins
@ 2011-06-15  0:49     ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-15  0:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Remove PageSwapBacked (!page_is_file_cache) cases from
add_to_page_cache_locked() and add_to_page_cache_lru():
those pages now go through shmem_add_to_page_cache().

Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(),
and add a comment on swap entries to invalidate_mapping_pages().

And mincore_page() uses find_get_page() on what might be shmem or a
tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to
find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef).

v2: Fix NULL dereference I introduced in mincore_page().

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 fs/stack.c    |    5 +----
 mm/filemap.c  |   21 +++------------------
 mm/mincore.c  |   10 ++++++----
 mm/truncate.c |    8 ++++++++
 4 files changed, 18 insertions(+), 26 deletions(-)

--- linux.orig/fs/stack.c	2011-06-14 01:22:10.768120780 -0700
+++ linux/fs/stack.c	2011-06-14 01:23:26.088494288 -0700
@@ -29,10 +29,7 @@ void fsstack_copy_inode_size(struct inod
 	 *
 	 * We don't actually know what locking is used at the lower level;
 	 * but if it's a filesystem that supports quotas, it will be using
-	 * i_lock as in inode_add_bytes().  tmpfs uses other locking, and
-	 * its 32-bit is (just) able to exceed 2TB i_size with the aid of
-	 * holes; but its i_blocks cannot carry into the upper long without
-	 * almost 2TB swap - let's ignore that case.
+	 * i_lock as in inode_add_bytes().
 	 */
 	if (sizeof(i_blocks) > sizeof(long))
 		spin_lock(&src->i_lock);
--- linux.orig/mm/filemap.c	2011-06-14 01:22:10.768120780 -0700
+++ linux/mm/filemap.c	2011-06-14 01:23:26.088494288 -0700
@@ -33,7 +33,6 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
-#include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include <linux/cleancache.h>
 #include "internal.h"
 
@@ -465,6 +464,7 @@ int add_to_page_cache_locked(struct page
 	int error;
 
 	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(PageSwapBacked(page));
 
 	error = mem_cgroup_cache_charge(page, current->mm,
 					gfp_mask & GFP_RECLAIM_MASK);
@@ -482,8 +482,6 @@ int add_to_page_cache_locked(struct page
 		if (likely(!error)) {
 			mapping->nrpages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
-			if (PageSwapBacked(page))
-				__inc_zone_page_state(page, NR_SHMEM);
 			spin_unlock_irq(&mapping->tree_lock);
 		} else {
 			page->mapping = NULL;
@@ -505,22 +503,9 @@ int add_to_page_cache_lru(struct page *p
 {
 	int ret;
 
-	/*
-	 * Splice_read and readahead add shmem/tmpfs pages into the page cache
-	 * before shmem_readpage has a chance to mark them as SwapBacked: they
-	 * need to go on the anon lru below, and mem_cgroup_cache_charge
-	 * (called in add_to_page_cache) needs to know where they're going too.
-	 */
-	if (mapping_cap_swap_backed(mapping))
-		SetPageSwapBacked(page);
-
 	ret = add_to_page_cache(page, mapping, offset, gfp_mask);
-	if (ret == 0) {
-		if (page_is_file_cache(page))
-			lru_cache_add_file(page);
-		else
-			lru_cache_add_anon(page);
-	}
+	if (ret == 0)
+		lru_cache_add_file(page);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
--- linux.orig/mm/mincore.c	2011-06-14 01:22:10.768120780 -0700
+++ linux/mm/mincore.c	2011-06-14 17:41:15.760211585 -0700
@@ -69,12 +69,14 @@ static unsigned char mincore_page(struct
 	 * file will not get a swp_entry_t in its pte, but rather it is like
 	 * any other file mapping (ie. marked !present and faulted in with
 	 * tmpfs's .fault). So swapped out tmpfs mappings are tested here.
-	 *
-	 * However when tmpfs moves the page from pagecache and into swapcache,
-	 * it is still in core, but the find_get_page below won't find it.
-	 * No big deal, but make a note of it.
 	 */
 	page = find_get_page(mapping, pgoff);
+#ifdef CONFIG_SWAP
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+		page = find_get_page(&swapper_space, swap.val);
+	}
+#endif
 	if (page) {
 		present = PageUptodate(page);
 		page_cache_release(page);
--- linux.orig/mm/truncate.c	2011-06-14 01:22:10.768120780 -0700
+++ linux/mm/truncate.c	2011-06-14 01:23:26.092494303 -0700
@@ -331,6 +331,14 @@ unsigned long invalidate_mapping_pages(s
 	unsigned long count = 0;
 	int i;
 
+	/*
+	 * Note: this function may get called on a shmem/tmpfs mapping:
+	 * pagevec_lookup() might then return 0 prematurely (because it
+	 * got a gangful of swap entries); but it's hardly worth worrying
+	 * about - it can rarely have anything to free from such a mapping
+	 * (most pages are dirty), and already skips over any difficulties.
+	 */
+
 	pagevec_init(&pvec, 0);
 	while (index <= end && pagevec_lookup(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 12/12] mm: a few small updates for radix-swap
@ 2011-06-15  0:49     ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-15  0:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Remove PageSwapBacked (!page_is_file_cache) cases from
add_to_page_cache_locked() and add_to_page_cache_lru():
those pages now go through shmem_add_to_page_cache().

Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(),
and add a comment on swap entries to invalidate_mapping_pages().

And mincore_page() uses find_get_page() on what might be shmem or a
tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to
find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef).

v2: Fix NULL dereference I introduced in mincore_page().

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 fs/stack.c    |    5 +----
 mm/filemap.c  |   21 +++------------------
 mm/mincore.c  |   10 ++++++----
 mm/truncate.c |    8 ++++++++
 4 files changed, 18 insertions(+), 26 deletions(-)

--- linux.orig/fs/stack.c	2011-06-14 01:22:10.768120780 -0700
+++ linux/fs/stack.c	2011-06-14 01:23:26.088494288 -0700
@@ -29,10 +29,7 @@ void fsstack_copy_inode_size(struct inod
 	 *
 	 * We don't actually know what locking is used at the lower level;
 	 * but if it's a filesystem that supports quotas, it will be using
-	 * i_lock as in inode_add_bytes().  tmpfs uses other locking, and
-	 * its 32-bit is (just) able to exceed 2TB i_size with the aid of
-	 * holes; but its i_blocks cannot carry into the upper long without
-	 * almost 2TB swap - let's ignore that case.
+	 * i_lock as in inode_add_bytes().
 	 */
 	if (sizeof(i_blocks) > sizeof(long))
 		spin_lock(&src->i_lock);
--- linux.orig/mm/filemap.c	2011-06-14 01:22:10.768120780 -0700
+++ linux/mm/filemap.c	2011-06-14 01:23:26.088494288 -0700
@@ -33,7 +33,6 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
-#include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include <linux/cleancache.h>
 #include "internal.h"
 
@@ -465,6 +464,7 @@ int add_to_page_cache_locked(struct page
 	int error;
 
 	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(PageSwapBacked(page));
 
 	error = mem_cgroup_cache_charge(page, current->mm,
 					gfp_mask & GFP_RECLAIM_MASK);
@@ -482,8 +482,6 @@ int add_to_page_cache_locked(struct page
 		if (likely(!error)) {
 			mapping->nrpages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
-			if (PageSwapBacked(page))
-				__inc_zone_page_state(page, NR_SHMEM);
 			spin_unlock_irq(&mapping->tree_lock);
 		} else {
 			page->mapping = NULL;
@@ -505,22 +503,9 @@ int add_to_page_cache_lru(struct page *p
 {
 	int ret;
 
-	/*
-	 * Splice_read and readahead add shmem/tmpfs pages into the page cache
-	 * before shmem_readpage has a chance to mark them as SwapBacked: they
-	 * need to go on the anon lru below, and mem_cgroup_cache_charge
-	 * (called in add_to_page_cache) needs to know where they're going too.
-	 */
-	if (mapping_cap_swap_backed(mapping))
-		SetPageSwapBacked(page);
-
 	ret = add_to_page_cache(page, mapping, offset, gfp_mask);
-	if (ret == 0) {
-		if (page_is_file_cache(page))
-			lru_cache_add_file(page);
-		else
-			lru_cache_add_anon(page);
-	}
+	if (ret == 0)
+		lru_cache_add_file(page);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
--- linux.orig/mm/mincore.c	2011-06-14 01:22:10.768120780 -0700
+++ linux/mm/mincore.c	2011-06-14 17:41:15.760211585 -0700
@@ -69,12 +69,14 @@ static unsigned char mincore_page(struct
 	 * file will not get a swp_entry_t in its pte, but rather it is like
 	 * any other file mapping (ie. marked !present and faulted in with
 	 * tmpfs's .fault). So swapped out tmpfs mappings are tested here.
-	 *
-	 * However when tmpfs moves the page from pagecache and into swapcache,
-	 * it is still in core, but the find_get_page below won't find it.
-	 * No big deal, but make a note of it.
 	 */
 	page = find_get_page(mapping, pgoff);
+#ifdef CONFIG_SWAP
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+		page = find_get_page(&swapper_space, swap.val);
+	}
+#endif
 	if (page) {
 		present = PageUptodate(page);
 		page_cache_release(page);
--- linux.orig/mm/truncate.c	2011-06-14 01:22:10.768120780 -0700
+++ linux/mm/truncate.c	2011-06-14 01:23:26.092494303 -0700
@@ -331,6 +331,14 @@ unsigned long invalidate_mapping_pages(s
 	unsigned long count = 0;
 	int i;
 
+	/*
+	 * Note: this function may get called on a shmem/tmpfs mapping:
+	 * pagevec_lookup() might then return 0 prematurely (because it
+	 * got a gangful of swap entries); but it's hardly worth worrying
+	 * about - it can rarely have anything to free from such a mapping
+	 * (most pages are dirty), and already skips over any difficulties.
+	 */
+
 	pagevec_init(&pvec, 0);
 	while (index <= end && pagevec_lookup(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
  2011-06-14 10:42   ` Hugh Dickins
@ 2011-06-17 23:38     ` Andrew Morton
  -1 siblings, 0 replies; 71+ messages in thread
From: Andrew Morton @ 2011-06-17 23:38 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel, linux-mm

On Tue, 14 Jun 2011 03:42:27 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> The radix_tree is used by several subsystems for different purposes.
> A major use is to store the struct page pointers of a file's pagecache
> for memory management.  But what if mm wanted to store something other
> than page pointers there too?
> 
> The low bit of a radix_tree entry is already used to denote an indirect
> pointer, for internal use, and the unlikely radix_tree_deref_retry() case.
> Define the next bit as denoting an exceptional entry, and supply inline
> functions radix_tree_exception() to return non-0 in either unlikely case,
> and radix_tree_exceptional_entry() to return non-0 in the second case.
> 
> If a subsystem already uses radix_tree with that bit set, no problem:
> it does not affect internal workings at all, but is defined for the
> convenience of those storing well-aligned pointers in the radix_tree.
> 
> The radix_tree_gang_lookups have an implicit assumption that the caller
> can deduce the offset of each entry returned e.g. by the page->index of
> a struct page.  But that may not be feasible for some kinds of item to
> be stored there.
> 
> radix_tree_gang_lookup_slot() allow for an optional indices argument,
> output array in which to return those offsets.  The same could be added
> to other radix_tree_gang_lookups, but for now keep it to the only one
> for which we need it.

Yes, the RADIX_TREE_INDIRECT_PTR hack is internal-use-only, and doesn't
operate on (and hence doesn't corrupt) client-provided items.

This patch uses bit 1 and uses it against client items, so for
practical purpoese it can only be used when the client is storing
addresses.  And it needs new APIs to access that flag.

All a bit ugly.  Why not just add another tag for this?  Or reuse an
existing tag if the current tags aren't all used for these types of
pages?



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
@ 2011-06-17 23:38     ` Andrew Morton
  0 siblings, 0 replies; 71+ messages in thread
From: Andrew Morton @ 2011-06-17 23:38 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel, linux-mm

On Tue, 14 Jun 2011 03:42:27 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> The radix_tree is used by several subsystems for different purposes.
> A major use is to store the struct page pointers of a file's pagecache
> for memory management.  But what if mm wanted to store something other
> than page pointers there too?
> 
> The low bit of a radix_tree entry is already used to denote an indirect
> pointer, for internal use, and the unlikely radix_tree_deref_retry() case.
> Define the next bit as denoting an exceptional entry, and supply inline
> functions radix_tree_exception() to return non-0 in either unlikely case,
> and radix_tree_exceptional_entry() to return non-0 in the second case.
> 
> If a subsystem already uses radix_tree with that bit set, no problem:
> it does not affect internal workings at all, but is defined for the
> convenience of those storing well-aligned pointers in the radix_tree.
> 
> The radix_tree_gang_lookups have an implicit assumption that the caller
> can deduce the offset of each entry returned e.g. by the page->index of
> a struct page.  But that may not be feasible for some kinds of item to
> be stored there.
> 
> radix_tree_gang_lookup_slot() allow for an optional indices argument,
> output array in which to return those offsets.  The same could be added
> to other radix_tree_gang_lookups, but for now keep it to the only one
> for which we need it.

Yes, the RADIX_TREE_INDIRECT_PTR hack is internal-use-only, and doesn't
operate on (and hence doesn't corrupt) client-provided items.

This patch uses bit 1 and uses it against client items, so for
practical purpoese it can only be used when the client is storing
addresses.  And it needs new APIs to access that flag.

All a bit ugly.  Why not just add another tag for this?  Or reuse an
existing tag if the current tags aren't all used for these types of
pages?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
  2011-06-17 23:38     ` Andrew Morton
@ 2011-06-18  0:07       ` Randy Dunlap
  -1 siblings, 0 replies; 71+ messages in thread
From: Randy Dunlap @ 2011-06-18  0:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, linux-kernel, linux-mm

On Fri, 17 Jun 2011 16:38:54 -0700 Andrew Morton wrote:

> On Tue, 14 Jun 2011 03:42:27 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> 
> > The radix_tree is used by several subsystems for different purposes.
> > A major use is to store the struct page pointers of a file's pagecache
> > for memory management.  But what if mm wanted to store something other
> > than page pointers there too?
> > 
> > The low bit of a radix_tree entry is already used to denote an indirect
> > pointer, for internal use, and the unlikely radix_tree_deref_retry() case.
> > Define the next bit as denoting an exceptional entry, and supply inline
> > functions radix_tree_exception() to return non-0 in either unlikely case,
> > and radix_tree_exceptional_entry() to return non-0 in the second case.
> > 
> > If a subsystem already uses radix_tree with that bit set, no problem:
> > it does not affect internal workings at all, but is defined for the
> > convenience of those storing well-aligned pointers in the radix_tree.
> > 
> > The radix_tree_gang_lookups have an implicit assumption that the caller
> > can deduce the offset of each entry returned e.g. by the page->index of
> > a struct page.  But that may not be feasible for some kinds of item to
> > be stored there.
> > 
> > radix_tree_gang_lookup_slot() allow for an optional indices argument,
> > output array in which to return those offsets.  The same could be added
> > to other radix_tree_gang_lookups, but for now keep it to the only one
> > for which we need it.
> 
> Yes, the RADIX_TREE_INDIRECT_PTR hack is internal-use-only, and doesn't
> operate on (and hence doesn't corrupt) client-provided items.
> 
> This patch uses bit 1 and uses it against client items, so for
> practical purpoese it can only be used when the client is storing
> addresses.  And it needs new APIs to access that flag.
> 
> All a bit ugly.  Why not just add another tag for this?  Or reuse an
> existing tag if the current tags aren't all used for these types of
> pages?


And regardless of the patch path that is taken, update test(s) if
applicable.  I thought that someone from Red Hat had a kernel loadable
module for testing radix-tree -- or maybe that was for rbtree (?) --
but I can't find that just now.

And one Andrew Morton has a userspace radix tree test harness at
http://userweb.kernel.org/~akpm/stuff/rtth.tar.gz

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
@ 2011-06-18  0:07       ` Randy Dunlap
  0 siblings, 0 replies; 71+ messages in thread
From: Randy Dunlap @ 2011-06-18  0:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, linux-kernel, linux-mm

On Fri, 17 Jun 2011 16:38:54 -0700 Andrew Morton wrote:

> On Tue, 14 Jun 2011 03:42:27 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> 
> > The radix_tree is used by several subsystems for different purposes.
> > A major use is to store the struct page pointers of a file's pagecache
> > for memory management.  But what if mm wanted to store something other
> > than page pointers there too?
> > 
> > The low bit of a radix_tree entry is already used to denote an indirect
> > pointer, for internal use, and the unlikely radix_tree_deref_retry() case.
> > Define the next bit as denoting an exceptional entry, and supply inline
> > functions radix_tree_exception() to return non-0 in either unlikely case,
> > and radix_tree_exceptional_entry() to return non-0 in the second case.
> > 
> > If a subsystem already uses radix_tree with that bit set, no problem:
> > it does not affect internal workings at all, but is defined for the
> > convenience of those storing well-aligned pointers in the radix_tree.
> > 
> > The radix_tree_gang_lookups have an implicit assumption that the caller
> > can deduce the offset of each entry returned e.g. by the page->index of
> > a struct page.  But that may not be feasible for some kinds of item to
> > be stored there.
> > 
> > radix_tree_gang_lookup_slot() allow for an optional indices argument,
> > output array in which to return those offsets.  The same could be added
> > to other radix_tree_gang_lookups, but for now keep it to the only one
> > for which we need it.
> 
> Yes, the RADIX_TREE_INDIRECT_PTR hack is internal-use-only, and doesn't
> operate on (and hence doesn't corrupt) client-provided items.
> 
> This patch uses bit 1 and uses it against client items, so for
> practical purpoese it can only be used when the client is storing
> addresses.  And it needs new APIs to access that flag.
> 
> All a bit ugly.  Why not just add another tag for this?  Or reuse an
> existing tag if the current tags aren't all used for these types of
> pages?


And regardless of the patch path that is taken, update test(s) if
applicable.  I thought that someone from Red Hat had a kernel loadable
module for testing radix-tree -- or maybe that was for rbtree (?) --
but I can't find that just now.

And one Andrew Morton has a userspace radix tree test harness at
http://userweb.kernel.org/~akpm/stuff/rtth.tar.gz

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
  2011-06-18  0:07       ` Randy Dunlap
@ 2011-06-18  0:12         ` Randy Dunlap
  -1 siblings, 0 replies; 71+ messages in thread
From: Randy Dunlap @ 2011-06-18  0:12 UTC (permalink / raw)
  To: akpm; +Cc: Hugh Dickins, linux-kernel, linux-mm

On Fri, 17 Jun 2011 17:07:42 -0700 Randy Dunlap wrote:

> On Fri, 17 Jun 2011 16:38:54 -0700 Andrew Morton wrote:
> 
> > On Tue, 14 Jun 2011 03:42:27 -0700 (PDT)
> > Hugh Dickins <hughd@google.com> wrote:
> > 
> > > The radix_tree is used by several subsystems for different purposes.
> > > A major use is to store the struct page pointers of a file's pagecache
> > > for memory management.  But what if mm wanted to store something other
> > > than page pointers there too?
> > > 
> > > The low bit of a radix_tree entry is already used to denote an indirect
> > > pointer, for internal use, and the unlikely radix_tree_deref_retry() case.
> > > Define the next bit as denoting an exceptional entry, and supply inline
> > > functions radix_tree_exception() to return non-0 in either unlikely case,
> > > and radix_tree_exceptional_entry() to return non-0 in the second case.
> > > 
> > > If a subsystem already uses radix_tree with that bit set, no problem:
> > > it does not affect internal workings at all, but is defined for the
> > > convenience of those storing well-aligned pointers in the radix_tree.
> > > 
> > > The radix_tree_gang_lookups have an implicit assumption that the caller
> > > can deduce the offset of each entry returned e.g. by the page->index of
> > > a struct page.  But that may not be feasible for some kinds of item to
> > > be stored there.
> > > 
> > > radix_tree_gang_lookup_slot() allow for an optional indices argument,
> > > output array in which to return those offsets.  The same could be added
> > > to other radix_tree_gang_lookups, but for now keep it to the only one
> > > for which we need it.
> > 
> > Yes, the RADIX_TREE_INDIRECT_PTR hack is internal-use-only, and doesn't
> > operate on (and hence doesn't corrupt) client-provided items.
> > 
> > This patch uses bit 1 and uses it against client items, so for
> > practical purpoese it can only be used when the client is storing
> > addresses.  And it needs new APIs to access that flag.
> > 
> > All a bit ugly.  Why not just add another tag for this?  Or reuse an
> > existing tag if the current tags aren't all used for these types of
> > pages?
> 
> 
> And regardless of the patch path that is taken, update test(s) if
> applicable.  I thought that someone from Red Hat had a kernel loadable
> module for testing radix-tree -- or maybe that was for rbtree (?) --
> but I can't find that just now.

http://people.redhat.com/jmoyer/radix-tree/


> And one Andrew Morton has a userspace radix tree test harness at
> http://userweb.kernel.org/~akpm/stuff/rtth.tar.gz


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
@ 2011-06-18  0:12         ` Randy Dunlap
  0 siblings, 0 replies; 71+ messages in thread
From: Randy Dunlap @ 2011-06-18  0:12 UTC (permalink / raw)
  To: akpm; +Cc: Hugh Dickins, linux-kernel, linux-mm

On Fri, 17 Jun 2011 17:07:42 -0700 Randy Dunlap wrote:

> On Fri, 17 Jun 2011 16:38:54 -0700 Andrew Morton wrote:
> 
> > On Tue, 14 Jun 2011 03:42:27 -0700 (PDT)
> > Hugh Dickins <hughd@google.com> wrote:
> > 
> > > The radix_tree is used by several subsystems for different purposes.
> > > A major use is to store the struct page pointers of a file's pagecache
> > > for memory management.  But what if mm wanted to store something other
> > > than page pointers there too?
> > > 
> > > The low bit of a radix_tree entry is already used to denote an indirect
> > > pointer, for internal use, and the unlikely radix_tree_deref_retry() case.
> > > Define the next bit as denoting an exceptional entry, and supply inline
> > > functions radix_tree_exception() to return non-0 in either unlikely case,
> > > and radix_tree_exceptional_entry() to return non-0 in the second case.
> > > 
> > > If a subsystem already uses radix_tree with that bit set, no problem:
> > > it does not affect internal workings at all, but is defined for the
> > > convenience of those storing well-aligned pointers in the radix_tree.
> > > 
> > > The radix_tree_gang_lookups have an implicit assumption that the caller
> > > can deduce the offset of each entry returned e.g. by the page->index of
> > > a struct page.  But that may not be feasible for some kinds of item to
> > > be stored there.
> > > 
> > > radix_tree_gang_lookup_slot() allow for an optional indices argument,
> > > output array in which to return those offsets.  The same could be added
> > > to other radix_tree_gang_lookups, but for now keep it to the only one
> > > for which we need it.
> > 
> > Yes, the RADIX_TREE_INDIRECT_PTR hack is internal-use-only, and doesn't
> > operate on (and hence doesn't corrupt) client-provided items.
> > 
> > This patch uses bit 1 and uses it against client items, so for
> > practical purpoese it can only be used when the client is storing
> > addresses.  And it needs new APIs to access that flag.
> > 
> > All a bit ugly.  Why not just add another tag for this?  Or reuse an
> > existing tag if the current tags aren't all used for these types of
> > pages?
> 
> 
> And regardless of the patch path that is taken, update test(s) if
> applicable.  I thought that someone from Red Hat had a kernel loadable
> module for testing radix-tree -- or maybe that was for rbtree (?) --
> but I can't find that just now.

http://people.redhat.com/jmoyer/radix-tree/


> And one Andrew Morton has a userspace radix tree test harness at
> http://userweb.kernel.org/~akpm/stuff/rtth.tar.gz


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
  2011-06-17 23:38     ` Andrew Morton
@ 2011-06-18  0:13       ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-18  0:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

On Fri, 17 Jun 2011, Andrew Morton wrote:
> On Tue, 14 Jun 2011 03:42:27 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> 
> > The low bit of a radix_tree entry is already used to denote an indirect
> > pointer, for internal use, and the unlikely radix_tree_deref_retry() case.
> > Define the next bit as denoting an exceptional entry, and supply inline
> > functions radix_tree_exception() to return non-0 in either unlikely case,
> > and radix_tree_exceptional_entry() to return non-0 in the second case.
> 
> Yes, the RADIX_TREE_INDIRECT_PTR hack is internal-use-only, and doesn't
> operate on (and hence doesn't corrupt) client-provided items.
> 
> This patch uses bit 1 and uses it against client items, so for
> practical purpoese it can only be used when the client is storing
> addresses.  And it needs new APIs to access that flag.
> 
> All a bit ugly.  Why not just add another tag for this?  Or reuse an
> existing tag if the current tags aren't all used for these types of
> pages?

I couldn't see how to use tags without losing the "lockless" lookups:
because the tag is a separate bit from the entry itself, unless you're
under tree_lock, there would be races when changing from page pointer
to swap entry or back, when slot was updated but tag not or vice versa.

Perhaps solvable, like seqlocks, by having two tag bits, the combination
saying come back and look again in a moment.  Hah, that can/is already
done with the low bit, the deref_retry.  So, yes, we could use one tag
bit: but it would be messier (could no longer use the slow-path-slightly-
modified find_get_page() etc).  I thought, while we've got a nearby bit
available, let's put it to use.

Hugh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
@ 2011-06-18  0:13       ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-18  0:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

On Fri, 17 Jun 2011, Andrew Morton wrote:
> On Tue, 14 Jun 2011 03:42:27 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> 
> > The low bit of a radix_tree entry is already used to denote an indirect
> > pointer, for internal use, and the unlikely radix_tree_deref_retry() case.
> > Define the next bit as denoting an exceptional entry, and supply inline
> > functions radix_tree_exception() to return non-0 in either unlikely case,
> > and radix_tree_exceptional_entry() to return non-0 in the second case.
> 
> Yes, the RADIX_TREE_INDIRECT_PTR hack is internal-use-only, and doesn't
> operate on (and hence doesn't corrupt) client-provided items.
> 
> This patch uses bit 1 and uses it against client items, so for
> practical purpoese it can only be used when the client is storing
> addresses.  And it needs new APIs to access that flag.
> 
> All a bit ugly.  Why not just add another tag for this?  Or reuse an
> existing tag if the current tags aren't all used for these types of
> pages?

I couldn't see how to use tags without losing the "lockless" lookups:
because the tag is a separate bit from the entry itself, unless you're
under tree_lock, there would be races when changing from page pointer
to swap entry or back, when slot was updated but tag not or vice versa.

Perhaps solvable, like seqlocks, by having two tag bits, the combination
saying come back and look again in a moment.  Hah, that can/is already
done with the low bit, the deref_retry.  So, yes, we could use one tag
bit: but it would be messier (could no longer use the slow-path-slightly-
modified find_get_page() etc).  I thought, while we've got a nearby bit
available, let's put it to use.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
  2011-06-18  0:12         ` Randy Dunlap
@ 2011-06-18  1:52           ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-18  1:52 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: akpm, linux-kernel, linux-mm

On Fri, 17 Jun 2011, Randy Dunlap wrote:
> > 
> > And regardless of the patch path that is taken, update test(s) if
> > applicable.

Thanks for the links, Randy, I hadn't thought of those at all.

> > I thought that someone from Red Hat had a kernel loadable
> > module for testing radix-tree -- or maybe that was for rbtree (?) --
> > but I can't find that just now.
> 
> http://people.redhat.com/jmoyer/radix-tree/

This one just tests that radix_tree_preload() goes deep enough:
not affected by the little change I've made.

> > And one Andrew Morton has a userspace radix tree test harness at
> > http://userweb.kernel.org/~akpm/stuff/rtth.tar.gz

This should still be as relevant as it was before, but I notice its
radix_tree.c is almost identical to the source currently in the kernel
tree, so I ought at the least to keep it in synch.

Whether there's anything suitable for testing here in the changes that
I've made, I'll have to look into later.

Hugh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
@ 2011-06-18  1:52           ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-06-18  1:52 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: akpm, linux-kernel, linux-mm

On Fri, 17 Jun 2011, Randy Dunlap wrote:
> > 
> > And regardless of the patch path that is taken, update test(s) if
> > applicable.

Thanks for the links, Randy, I hadn't thought of those at all.

> > I thought that someone from Red Hat had a kernel loadable
> > module for testing radix-tree -- or maybe that was for rbtree (?) --
> > but I can't find that just now.
> 
> http://people.redhat.com/jmoyer/radix-tree/

This one just tests that radix_tree_preload() goes deep enough:
not affected by the little change I've made.

> > And one Andrew Morton has a userspace radix tree test harness at
> > http://userweb.kernel.org/~akpm/stuff/rtth.tar.gz

This should still be as relevant as it was before, but I notice its
radix_tree.c is almost identical to the source currently in the kernel
tree, so I ought at the least to keep it in synch.

Whether there's anything suitable for testing here in the changes that
I've made, I'll have to look into later.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
  2011-06-18  0:13       ` Hugh Dickins
@ 2011-06-18 21:48         ` Andrew Morton
  -1 siblings, 0 replies; 71+ messages in thread
From: Andrew Morton @ 2011-06-18 21:48 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel, linux-mm

On Fri, 17 Jun 2011 17:13:38 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:

> On Fri, 17 Jun 2011, Andrew Morton wrote:
> > On Tue, 14 Jun 2011 03:42:27 -0700 (PDT)
> > Hugh Dickins <hughd@google.com> wrote:
> > 
> > > The low bit of a radix_tree entry is already used to denote an indirect
> > > pointer, for internal use, and the unlikely radix_tree_deref_retry() case.
> > > Define the next bit as denoting an exceptional entry, and supply inline
> > > functions radix_tree_exception() to return non-0 in either unlikely case,
> > > and radix_tree_exceptional_entry() to return non-0 in the second case.
> > 
> > Yes, the RADIX_TREE_INDIRECT_PTR hack is internal-use-only, and doesn't
> > operate on (and hence doesn't corrupt) client-provided items.
> > 
> > This patch uses bit 1 and uses it against client items, so for
> > practical purpoese it can only be used when the client is storing
> > addresses.  And it needs new APIs to access that flag.
> > 
> > All a bit ugly.  Why not just add another tag for this?  Or reuse an
> > existing tag if the current tags aren't all used for these types of
> > pages?
> 
> I couldn't see how to use tags without losing the "lockless" lookups:

So lockless pagecache broke the radix-tree tag-versus-item coherency as
well as the address_space nrpages-vs-radix-tree coherency.  Isn't it
fun learning these things.

> because the tag is a separate bit from the entry itself, unless you're
> under tree_lock, there would be races when changing from page pointer
> to swap entry or back, when slot was updated but tag not or vice versa.

So...  take tree_lock?  What effect does that have?  It'd better be
"really bad", because this patchset does nothing at all to improve core
MM maintainability :(


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
@ 2011-06-18 21:48         ` Andrew Morton
  0 siblings, 0 replies; 71+ messages in thread
From: Andrew Morton @ 2011-06-18 21:48 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel, linux-mm

On Fri, 17 Jun 2011 17:13:38 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:

> On Fri, 17 Jun 2011, Andrew Morton wrote:
> > On Tue, 14 Jun 2011 03:42:27 -0700 (PDT)
> > Hugh Dickins <hughd@google.com> wrote:
> > 
> > > The low bit of a radix_tree entry is already used to denote an indirect
> > > pointer, for internal use, and the unlikely radix_tree_deref_retry() case.
> > > Define the next bit as denoting an exceptional entry, and supply inline
> > > functions radix_tree_exception() to return non-0 in either unlikely case,
> > > and radix_tree_exceptional_entry() to return non-0 in the second case.
> > 
> > Yes, the RADIX_TREE_INDIRECT_PTR hack is internal-use-only, and doesn't
> > operate on (and hence doesn't corrupt) client-provided items.
> > 
> > This patch uses bit 1 and uses it against client items, so for
> > practical purpoese it can only be used when the client is storing
> > addresses.  And it needs new APIs to access that flag.
> > 
> > All a bit ugly.  Why not just add another tag for this?  Or reuse an
> > existing tag if the current tags aren't all used for these types of
> > pages?
> 
> I couldn't see how to use tags without losing the "lockless" lookups:

So lockless pagecache broke the radix-tree tag-versus-item coherency as
well as the address_space nrpages-vs-radix-tree coherency.  Isn't it
fun learning these things.

> because the tag is a separate bit from the entry itself, unless you're
> under tree_lock, there would be races when changing from page pointer
> to swap entry or back, when slot was updated but tag not or vice versa.

So...  take tree_lock?  What effect does that have?  It'd better be
"really bad", because this patchset does nothing at all to improve core
MM maintainability :(

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 2/12] mm: let swap use exceptional entries
  2011-06-14 10:43   ` Hugh Dickins
@ 2011-06-18 21:52     ` Andrew Morton
  -1 siblings, 0 replies; 71+ messages in thread
From: Andrew Morton @ 2011-06-18 21:52 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel, linux-mm

On Tue, 14 Jun 2011 03:43:47 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:

> --- linux.orig/mm/filemap.c	2011-06-13 13:26:44.430284135 -0700
> +++ linux/mm/filemap.c	2011-06-13 13:27:34.526532556 -0700
> @@ -717,9 +717,12 @@ repeat:
>  		page = radix_tree_deref_slot(pagep);
>  		if (unlikely(!page))
>  			goto out;
> -		if (radix_tree_deref_retry(page))
> +		if (radix_tree_exception(page)) {
> +			if (radix_tree_exceptional_entry(page))
> +				goto out;
> +			/* radix_tree_deref_retry(page) */
>  			goto repeat;
> -
> +		}
>  		if (!page_cache_get_speculative(page))
>  			goto repeat;

All the crap^Wnice changes made to filemap.c really need some comments,
please.  Particularly when they're keyed off the bland-sounding
"radix_tree_exception()".  Apparently they have something to do with
swap, but how is the poor reader to know this?

Also, commenting out a function call might be meaningful information for
Hugh-right-now, but for other people later on, they're just a big WTF.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 2/12] mm: let swap use exceptional entries
@ 2011-06-18 21:52     ` Andrew Morton
  0 siblings, 0 replies; 71+ messages in thread
From: Andrew Morton @ 2011-06-18 21:52 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel, linux-mm

On Tue, 14 Jun 2011 03:43:47 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:

> --- linux.orig/mm/filemap.c	2011-06-13 13:26:44.430284135 -0700
> +++ linux/mm/filemap.c	2011-06-13 13:27:34.526532556 -0700
> @@ -717,9 +717,12 @@ repeat:
>  		page = radix_tree_deref_slot(pagep);
>  		if (unlikely(!page))
>  			goto out;
> -		if (radix_tree_deref_retry(page))
> +		if (radix_tree_exception(page)) {
> +			if (radix_tree_exceptional_entry(page))
> +				goto out;
> +			/* radix_tree_deref_retry(page) */
>  			goto repeat;
> -
> +		}
>  		if (!page_cache_get_speculative(page))
>  			goto repeat;

All the crap^Wnice changes made to filemap.c really need some comments,
please.  Particularly when they're keyed off the bland-sounding
"radix_tree_exception()".  Apparently they have something to do with
swap, but how is the poor reader to know this?

Also, commenting out a function call might be meaningful information for
Hugh-right-now, but for other people later on, they're just a big WTF.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 2/12] mm: let swap use exceptional entries
  2011-06-14 10:43   ` Hugh Dickins
@ 2011-06-18 21:55     ` Andrew Morton
  -1 siblings, 0 replies; 71+ messages in thread
From: Andrew Morton @ 2011-06-18 21:55 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel, linux-mm

On Tue, 14 Jun 2011 03:43:47 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:

> In an i386 kernel this limits its information (type and page offset)
> to 30 bits: given 32 "types" of swapfile and 4kB pagesize, that's
> a maximum swapfile size of 128GB.  Which is less than the 512GB we
> previously allowed with X86_PAE (where the swap entry can occupy the
> entire upper 32 bits of a pte_t), but not a new limitation on 32-bit
> without PAE; and there's not a new limitation on 64-bit (where swap
> filesize is already limited to 16TB by a 32-bit page offset).

hm.

>  Thirty
> areas of 128GB is probably still enough swap for a 64GB 32-bit machine.

What if it was only one area?  128GB is close enough to 64GB (or, more
realistically, 32GB) to be significant.  For the people out there who
are using a single 200GB swap partition and actually needed that much,
what happens?  swapon fails?


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 2/12] mm: let swap use exceptional entries
@ 2011-06-18 21:55     ` Andrew Morton
  0 siblings, 0 replies; 71+ messages in thread
From: Andrew Morton @ 2011-06-18 21:55 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel, linux-mm

On Tue, 14 Jun 2011 03:43:47 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:

> In an i386 kernel this limits its information (type and page offset)
> to 30 bits: given 32 "types" of swapfile and 4kB pagesize, that's
> a maximum swapfile size of 128GB.  Which is less than the 512GB we
> previously allowed with X86_PAE (where the swap entry can occupy the
> entire upper 32 bits of a pte_t), but not a new limitation on 32-bit
> without PAE; and there's not a new limitation on 64-bit (where swap
> filesize is already limited to 16TB by a 32-bit page offset).

hm.

>  Thirty
> areas of 128GB is probably still enough swap for a 64GB 32-bit machine.

What if it was only one area?  128GB is close enough to 64GB (or, more
realistically, 32GB) to be significant.  For the people out there who
are using a single 200GB swap partition and actually needed that much,
what happens?  swapon fails?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 2/12] mm: let swap use exceptional entries
  2011-06-18 21:52     ` Andrew Morton
@ 2011-07-12 22:08       ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-07-12 22:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

On Sat, 18 Jun 2011, Andrew Morton wrote:
> On Tue, 14 Jun 2011 03:43:47 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
> 
> > --- linux.orig/mm/filemap.c	2011-06-13 13:26:44.430284135 -0700
> > +++ linux/mm/filemap.c	2011-06-13 13:27:34.526532556 -0700
> > @@ -717,9 +717,12 @@ repeat:
> >  		page = radix_tree_deref_slot(pagep);
> >  		if (unlikely(!page))
> >  			goto out;
> > -		if (radix_tree_deref_retry(page))
> > +		if (radix_tree_exception(page)) {
> > +			if (radix_tree_exceptional_entry(page))
> > +				goto out;
> > +			/* radix_tree_deref_retry(page) */
> >  			goto repeat;
> > -
> > +		}
> >  		if (!page_cache_get_speculative(page))
> >  			goto repeat;
> 
> All the crap^Wnice changes made to filemap.c really need some comments,
> please.  Particularly when they're keyed off the bland-sounding
> "radix_tree_exception()".  Apparently they have something to do with
> swap, but how is the poor reader to know this?

The naming was intentionally bland, because other filesystems might
in future have other uses for such exceptional entries.

(I think the field size would generally defeat it, but you can,
for example, imagine a small filesystem wanting to save sector number
there when a page is evicted.)

But let's go bland when it's more familiar, and such uses materialize -
particularly since I only placed those checks in places where they're
needed now for shmem/tmpfs/swap.

I'll keep the bland naming, if that's okay, but send a patch adding
a line of comment in such places.  Mentioning shmem, tmpfs, swap.

> 
> Also, commenting out a function call might be meaningful information for
> Hugh-right-now, but for other people later on, they're just a big WTF.

Ah yes, I hadn't realized at all that those look like commented-out
function calls.  No, they're comments on what the else case is that
we have arrived at there.  I'll make those clearer too.

Hugh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 2/12] mm: let swap use exceptional entries
@ 2011-07-12 22:08       ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-07-12 22:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

On Sat, 18 Jun 2011, Andrew Morton wrote:
> On Tue, 14 Jun 2011 03:43:47 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
> 
> > --- linux.orig/mm/filemap.c	2011-06-13 13:26:44.430284135 -0700
> > +++ linux/mm/filemap.c	2011-06-13 13:27:34.526532556 -0700
> > @@ -717,9 +717,12 @@ repeat:
> >  		page = radix_tree_deref_slot(pagep);
> >  		if (unlikely(!page))
> >  			goto out;
> > -		if (radix_tree_deref_retry(page))
> > +		if (radix_tree_exception(page)) {
> > +			if (radix_tree_exceptional_entry(page))
> > +				goto out;
> > +			/* radix_tree_deref_retry(page) */
> >  			goto repeat;
> > -
> > +		}
> >  		if (!page_cache_get_speculative(page))
> >  			goto repeat;
> 
> All the crap^Wnice changes made to filemap.c really need some comments,
> please.  Particularly when they're keyed off the bland-sounding
> "radix_tree_exception()".  Apparently they have something to do with
> swap, but how is the poor reader to know this?

The naming was intentionally bland, because other filesystems might
in future have other uses for such exceptional entries.

(I think the field size would generally defeat it, but you can,
for example, imagine a small filesystem wanting to save sector number
there when a page is evicted.)

But let's go bland when it's more familiar, and such uses materialize -
particularly since I only placed those checks in places where they're
needed now for shmem/tmpfs/swap.

I'll keep the bland naming, if that's okay, but send a patch adding
a line of comment in such places.  Mentioning shmem, tmpfs, swap.

> 
> Also, commenting out a function call might be meaningful information for
> Hugh-right-now, but for other people later on, they're just a big WTF.

Ah yes, I hadn't realized at all that those look like commented-out
function calls.  No, they're comments on what the else case is that
we have arrived at there.  I'll make those clearer too.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 2/12] mm: let swap use exceptional entries
  2011-06-18 21:55     ` Andrew Morton
@ 2011-07-12 22:35       ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-07-12 22:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

On Sat, 18 Jun 2011, Andrew Morton wrote:
> On Tue, 14 Jun 2011 03:43:47 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
> 
> > In an i386 kernel this limits its information (type and page offset)
> > to 30 bits: given 32 "types" of swapfile and 4kB pagesize, that's
> > a maximum swapfile size of 128GB.  Which is less than the 512GB we
> > previously allowed with X86_PAE (where the swap entry can occupy the
> > entire upper 32 bits of a pte_t), but not a new limitation on 32-bit
> > without PAE; and there's not a new limitation on 64-bit (where swap
> > filesize is already limited to 16TB by a 32-bit page offset).
> 
> hm.
> 
> >  Thirty
> > areas of 128GB is probably still enough swap for a 64GB 32-bit machine.
> 
> What if it was only one area?  128GB is close enough to 64GB (or, more
> realistically, 32GB) to be significant.  For the people out there who
> are using a single 200GB swap partition and actually needed that much,
> what happens?  swapon fails?

No, it doesn't fail: it just trims back the amount of swap that is used
(and counted) to the maximum that the running kernel supports (just like
when you switch between 64bit and 32bit-PAE and 32bit-nonPAE kernels
using the same large swap device, the 64bit being able to access more
of it than the 32bit-PAE kernel, and that more than the 32bit-nonPAE).

I'd grown to think that the users of large amounts of RAM may like to
have a little swap for leeway, but live in dread of the slow death that a
large amount of swap can result in.  Maybe that's just one class of user.

I'd worry more about this if it were a new limitation for 64bit; but it's
just a lower limitation for the 32bit-PAE case.  If it actually proves
to be an issue (and we abandon our usual mantra to go to 64bit), then I
don't think having 32 distinct areas is sacrosanct: we can (configurably
or tunably) lower the number of areas and increase their size; but I
doubt we shall need to bother.

ARM is getting LPAE?  Then I guess this is a good moment to enforce
the new limit.

Hugh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 2/12] mm: let swap use exceptional entries
@ 2011-07-12 22:35       ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-07-12 22:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

On Sat, 18 Jun 2011, Andrew Morton wrote:
> On Tue, 14 Jun 2011 03:43:47 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
> 
> > In an i386 kernel this limits its information (type and page offset)
> > to 30 bits: given 32 "types" of swapfile and 4kB pagesize, that's
> > a maximum swapfile size of 128GB.  Which is less than the 512GB we
> > previously allowed with X86_PAE (where the swap entry can occupy the
> > entire upper 32 bits of a pte_t), but not a new limitation on 32-bit
> > without PAE; and there's not a new limitation on 64-bit (where swap
> > filesize is already limited to 16TB by a 32-bit page offset).
> 
> hm.
> 
> >  Thirty
> > areas of 128GB is probably still enough swap for a 64GB 32-bit machine.
> 
> What if it was only one area?  128GB is close enough to 64GB (or, more
> realistically, 32GB) to be significant.  For the people out there who
> are using a single 200GB swap partition and actually needed that much,
> what happens?  swapon fails?

No, it doesn't fail: it just trims back the amount of swap that is used
(and counted) to the maximum that the running kernel supports (just like
when you switch between 64bit and 32bit-PAE and 32bit-nonPAE kernels
using the same large swap device, the 64bit being able to access more
of it than the 32bit-PAE kernel, and that more than the 32bit-nonPAE).

I'd grown to think that the users of large amounts of RAM may like to
have a little swap for leeway, but live in dread of the slow death that a
large amount of swap can result in.  Maybe that's just one class of user.

I'd worry more about this if it were a new limitation for 64bit; but it's
just a lower limitation for the 32bit-PAE case.  If it actually proves
to be an issue (and we abandon our usual mantra to go to 64bit), then I
don't think having 32 distinct areas is sacrosanct: we can (configurably
or tunably) lower the number of areas and increase their size; but I
doubt we shall need to bother.

ARM is getting LPAE?  Then I guess this is a good moment to enforce
the new limit.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
  2011-06-18 21:48         ` Andrew Morton
@ 2011-07-12 22:56           ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-07-12 22:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

On Sat, 18 Jun 2011, Andrew Morton wrote:
> On Fri, 17 Jun 2011 17:13:38 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
> > On Fri, 17 Jun 2011, Andrew Morton wrote:
> > > On Tue, 14 Jun 2011 03:42:27 -0700 (PDT)
> > > Hugh Dickins <hughd@google.com> wrote:
> > > 
> > > > The low bit of a radix_tree entry is already used to denote an indirect
> > > > pointer, for internal use, and the unlikely radix_tree_deref_retry() case.
> > > > Define the next bit as denoting an exceptional entry, and supply inline
> > > > functions radix_tree_exception() to return non-0 in either unlikely case,
> > > > and radix_tree_exceptional_entry() to return non-0 in the second case.
> > > 
> > > Yes, the RADIX_TREE_INDIRECT_PTR hack is internal-use-only, and doesn't
> > > operate on (and hence doesn't corrupt) client-provided items.
> > > 
> > > This patch uses bit 1 and uses it against client items, so for
> > > practical purpoese it can only be used when the client is storing
> > > addresses.  And it needs new APIs to access that flag.
> > > 
> > > All a bit ugly.  Why not just add another tag for this?  Or reuse an
> > > existing tag if the current tags aren't all used for these types of
> > > pages?
> > 
> > I couldn't see how to use tags without losing the "lockless" lookups:
> 
> So lockless pagecache broke the radix-tree tag-versus-item coherency as
> well as the address_space nrpages-vs-radix-tree coherency.

I don't think that remark is fair to lockless pagecache at all.  If we
want the scalability advantage of lockless lookup, yes, we don't have
strict coherency with tagging at that time.  But those places that need
to worry about that coherency, can lock to do so.

> Isn't it fun learning these things.
> 
> > because the tag is a separate bit from the entry itself, unless you're
> > under tree_lock, there would be races when changing from page pointer
> > to swap entry or back, when slot was updated but tag not or vice versa.
> 
> So...  take tree_lock?

I wouldn't call that an improvement...

> What effect does that have?

... but admit I have not measured: I rather assume that if we now change
tmpfs from lockless to locked lookup, someone else will soon come up with
the regression numbers.

> It'd better be
> "really bad", because this patchset does nothing at all to improve core
> MM maintainability :(

I was aiming to improve shmem.c maintainability; and you have good grounds
to accuse me of hurting shmem.c maintainability when I highmem-ized the
swap vector nine years ago.

I was not aiming to improve core MM maintainability, nor to harm it.
I am extending the use to which the radix-tree can be put, but is that
so bad?

Hugh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
@ 2011-07-12 22:56           ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-07-12 22:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

On Sat, 18 Jun 2011, Andrew Morton wrote:
> On Fri, 17 Jun 2011 17:13:38 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
> > On Fri, 17 Jun 2011, Andrew Morton wrote:
> > > On Tue, 14 Jun 2011 03:42:27 -0700 (PDT)
> > > Hugh Dickins <hughd@google.com> wrote:
> > > 
> > > > The low bit of a radix_tree entry is already used to denote an indirect
> > > > pointer, for internal use, and the unlikely radix_tree_deref_retry() case.
> > > > Define the next bit as denoting an exceptional entry, and supply inline
> > > > functions radix_tree_exception() to return non-0 in either unlikely case,
> > > > and radix_tree_exceptional_entry() to return non-0 in the second case.
> > > 
> > > Yes, the RADIX_TREE_INDIRECT_PTR hack is internal-use-only, and doesn't
> > > operate on (and hence doesn't corrupt) client-provided items.
> > > 
> > > This patch uses bit 1 and uses it against client items, so for
> > > practical purpoese it can only be used when the client is storing
> > > addresses.  And it needs new APIs to access that flag.
> > > 
> > > All a bit ugly.  Why not just add another tag for this?  Or reuse an
> > > existing tag if the current tags aren't all used for these types of
> > > pages?
> > 
> > I couldn't see how to use tags without losing the "lockless" lookups:
> 
> So lockless pagecache broke the radix-tree tag-versus-item coherency as
> well as the address_space nrpages-vs-radix-tree coherency.

I don't think that remark is fair to lockless pagecache at all.  If we
want the scalability advantage of lockless lookup, yes, we don't have
strict coherency with tagging at that time.  But those places that need
to worry about that coherency, can lock to do so.

> Isn't it fun learning these things.
> 
> > because the tag is a separate bit from the entry itself, unless you're
> > under tree_lock, there would be races when changing from page pointer
> > to swap entry or back, when slot was updated but tag not or vice versa.
> 
> So...  take tree_lock?

I wouldn't call that an improvement...

> What effect does that have?

... but admit I have not measured: I rather assume that if we now change
tmpfs from lockless to locked lookup, someone else will soon come up with
the regression numbers.

> It'd better be
> "really bad", because this patchset does nothing at all to improve core
> MM maintainability :(

I was aiming to improve shmem.c maintainability; and you have good grounds
to accuse me of hurting shmem.c maintainability when I highmem-ized the
swap vector nine years ago.

I was not aiming to improve core MM maintainability, nor to harm it.
I am extending the use to which the radix-tree can be put, but is that
so bad?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
  2011-07-12 22:56           ` Hugh Dickins
@ 2011-07-12 23:24             ` Andrew Morton
  -1 siblings, 0 replies; 71+ messages in thread
From: Andrew Morton @ 2011-07-12 23:24 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel, linux-mm

<tries to remember what this is all about>

l 2011 15:56:14 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Sat, 18 Jun 2011, Andrew Morton wrote:
> > On Fri, 17 Jun 2011 17:13:38 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
> > > On Fri, 17 Jun 2011, Andrew Morton wrote:
> > > > On Tue, 14 Jun 2011 03:42:27 -0700 (PDT)
> > > > Hugh Dickins <hughd@google.com> wrote:
> > > > 
> > > > > The low bit of a radix_tree entry is already used to denote an indirect
> > > > > pointer, for internal use, and the unlikely radix_tree_deref_retry() case.
> > > > > Define the next bit as denoting an exceptional entry, and supply inline
> > > > > functions radix_tree_exception() to return non-0 in either unlikely case,
> > > > > and radix_tree_exceptional_entry() to return non-0 in the second case.
> > > > 
> > > > Yes, the RADIX_TREE_INDIRECT_PTR hack is internal-use-only, and doesn't
> > > > operate on (and hence doesn't corrupt) client-provided items.
> > > > 
> > > > This patch uses bit 1 and uses it against client items, so for
> > > > practical purpoese it can only be used when the client is storing
> > > > addresses.  And it needs new APIs to access that flag.
> > > > 
> > > > All a bit ugly.  Why not just add another tag for this?  Or reuse an
> > > > existing tag if the current tags aren't all used for these types of
> > > > pages?
> > > 
> > > I couldn't see how to use tags without losing the "lockless" lookups:
> > 
> > So lockless pagecache broke the radix-tree tag-versus-item coherency as
> > well as the address_space nrpages-vs-radix-tree coherency.
> 
> I don't think that remark is fair to lockless pagecache at all.  If we
> want the scalability advantage of lockless lookup, yes, we don't have
> strict coherency with tagging at that time.  But those places that need
> to worry about that coherency, can lock to do so.

Nobody thought about these issues, afaik.  Things have broken and the
code has become significantly more complex/fragile.

Does the locking in mapping_tagged() make any sense?

> > Isn't it fun learning these things.
> > 
> > > because the tag is a separate bit from the entry itself, unless you're
> > > under tree_lock, there would be races when changing from page pointer
> > > to swap entry or back, when slot was updated but tag not or vice versa.
> > 
> > So...  take tree_lock?
> 
> I wouldn't call that an improvement...

I wouldn't call the proposed changes to radix-tree.c an improvement,
either.  It's an expedient, once-off, single-caller hack.

If the cost of adding locking is negligible then that is a superior fix.

> > What effect does that have?
> 
> ... but admit I have not measured: I rather assume that if we now change
> tmpfs from lockless to locked lookup, someone else will soon come up with
> the regression numbers.
> 
> > It'd better be
> > "really bad", because this patchset does nothing at all to improve core
> > MM maintainability :(
> 
> I was aiming to improve shmem.c maintainability; and you have good grounds
> to accuse me of hurting shmem.c maintainability when I highmem-ized the
> swap vector nine years ago.
> 
> I was not aiming to improve core MM maintainability, nor to harm it.
> I am extending the use to which the radix-tree can be put, but is that
> so bad?

I find it hard to believe that this wart added to the side of the
radix-tree code will find any other users.  And the wart spreads
contagion into core filemap pagecache lookup.

It's pretty nasty stuff.  Please, what is a better way of doing all this?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
@ 2011-07-12 23:24             ` Andrew Morton
  0 siblings, 0 replies; 71+ messages in thread
From: Andrew Morton @ 2011-07-12 23:24 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel, linux-mm

<tries to remember what this is all about>

l 2011 15:56:14 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Sat, 18 Jun 2011, Andrew Morton wrote:
> > On Fri, 17 Jun 2011 17:13:38 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
> > > On Fri, 17 Jun 2011, Andrew Morton wrote:
> > > > On Tue, 14 Jun 2011 03:42:27 -0700 (PDT)
> > > > Hugh Dickins <hughd@google.com> wrote:
> > > > 
> > > > > The low bit of a radix_tree entry is already used to denote an indirect
> > > > > pointer, for internal use, and the unlikely radix_tree_deref_retry() case.
> > > > > Define the next bit as denoting an exceptional entry, and supply inline
> > > > > functions radix_tree_exception() to return non-0 in either unlikely case,
> > > > > and radix_tree_exceptional_entry() to return non-0 in the second case.
> > > > 
> > > > Yes, the RADIX_TREE_INDIRECT_PTR hack is internal-use-only, and doesn't
> > > > operate on (and hence doesn't corrupt) client-provided items.
> > > > 
> > > > This patch uses bit 1 and uses it against client items, so for
> > > > practical purpoese it can only be used when the client is storing
> > > > addresses.  And it needs new APIs to access that flag.
> > > > 
> > > > All a bit ugly.  Why not just add another tag for this?  Or reuse an
> > > > existing tag if the current tags aren't all used for these types of
> > > > pages?
> > > 
> > > I couldn't see how to use tags without losing the "lockless" lookups:
> > 
> > So lockless pagecache broke the radix-tree tag-versus-item coherency as
> > well as the address_space nrpages-vs-radix-tree coherency.
> 
> I don't think that remark is fair to lockless pagecache at all.  If we
> want the scalability advantage of lockless lookup, yes, we don't have
> strict coherency with tagging at that time.  But those places that need
> to worry about that coherency, can lock to do so.

Nobody thought about these issues, afaik.  Things have broken and the
code has become significantly more complex/fragile.

Does the locking in mapping_tagged() make any sense?

> > Isn't it fun learning these things.
> > 
> > > because the tag is a separate bit from the entry itself, unless you're
> > > under tree_lock, there would be races when changing from page pointer
> > > to swap entry or back, when slot was updated but tag not or vice versa.
> > 
> > So...  take tree_lock?
> 
> I wouldn't call that an improvement...

I wouldn't call the proposed changes to radix-tree.c an improvement,
either.  It's an expedient, once-off, single-caller hack.

If the cost of adding locking is negligible then that is a superior fix.

> > What effect does that have?
> 
> ... but admit I have not measured: I rather assume that if we now change
> tmpfs from lockless to locked lookup, someone else will soon come up with
> the regression numbers.
> 
> > It'd better be
> > "really bad", because this patchset does nothing at all to improve core
> > MM maintainability :(
> 
> I was aiming to improve shmem.c maintainability; and you have good grounds
> to accuse me of hurting shmem.c maintainability when I highmem-ized the
> swap vector nine years ago.
> 
> I was not aiming to improve core MM maintainability, nor to harm it.
> I am extending the use to which the radix-tree can be put, but is that
> so bad?

I find it hard to believe that this wart added to the side of the
radix-tree code will find any other users.  And the wart spreads
contagion into core filemap pagecache lookup.

It's pretty nasty stuff.  Please, what is a better way of doing all this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
  2011-07-12 23:24             ` Andrew Morton
@ 2011-07-13 22:27               ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-07-13 22:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

On Tue, 12 Jul 2011, Andrew Morton wrote:
> Hugh Dickins <hughd@google.com> wrote:
> > On Sat, 18 Jun 2011, Andrew Morton wrote:
> > > On Fri, 17 Jun 2011 17:13:38 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
> > > > 
> > > > I couldn't see how to use tags without losing the "lockless" lookups:
> > > 
> > > So lockless pagecache broke the radix-tree tag-versus-item coherency as
> > > well as the address_space nrpages-vs-radix-tree coherency.
> > 
> > I don't think that remark is fair to lockless pagecache at all.  If we
> > want the scalability advantage of lockless lookup, yes, we don't have
> > strict coherency with tagging at that time.  But those places that need
> > to worry about that coherency, can lock to do so.
> 
> Nobody thought about these issues, afaik.  Things have broken and the
> code has become significantly more complex/fragile.
> 
> Does the locking in mapping_tagged() make any sense?

Not really, but it's reassuring for mapping_mapped(),
which doesn't know how radix_tree_tagged() is implemented nowadays :)

I think the "lockless pagecache" change there was a blanket replacement
of read_lock_irqsave() by rcu_read_lock().  The effective change came
two years earlier, when radix_tree_tagged() changed from looking at
an array in the rnode to looking at a bit in the root.

> 
> > > Isn't it fun learning these things.
> > > 
> > > > because the tag is a separate bit from the entry itself, unless you're
> > > > under tree_lock, there would be races when changing from page pointer
> > > > to swap entry or back, when slot was updated but tag not or vice versa.
> > > 
> > > So...  take tree_lock?
> > 
> > I wouldn't call that an improvement...
> 
> I wouldn't call the proposed changes to radix-tree.c an improvement,
> either.  It's an expedient, once-off, single-caller hack.

I do hope you're using "radix-tree.c" there as shorthand for the
scattered little additions to filemap.c and radix-tree.h.  (And
I just did a diffstat on the filemap.c changes, they stand at 44
insertions, 44 deletions - mapping_cap_swap_backed stuff goes away.)

The only change to radix-tree.c is adding an indices argument to
radix_tree_gang_lookup_slot() and __lookup(): because hitherto the
gang lookup code has implicitly assumed that it's being used on
struct page pointers, or something like, from which each index
can be deduced by the caller.

Whilst that addition offers nothing to existing users who can deduce
the index of each item found, I claim that it's a clear improvement
to the interface.

If that's unacceptable to you, then perhaps shmem.c needs its own
variant of the code in radix-tree.c, to fill in that lacuna.  I chose
that approach for some parts of the code (e.g. shmem_file_splice_read),
but I'd really prefer not to duplicate large parts of radix-tree.c.
No, it would be better to stick with shmem.c's own peculiar radix tree
in that case, though I don't relish extending it to larger filesizes.

(I should warn that I do have an expedient, once-off, single-caller
hack to come for radix-tree.c.  As was admitted in the changelog, the
loser in these shmem.c changes is swapoff, where shmem_unuse does get
slower.  I had a shock when I found it 20 times slower on this laptop:
though later the most of that turned out to be an artifact of lockdep
and prove_rcu.  But even without the debug, I've found the gang lookup
method too slow, tried a number of different things including tagging,
but the only change which has given clear benefit is doing the lookup
directly on the rnodes, instead of gang lookup going back and forth.)

> 
> If the cost of adding locking is negligible then that is a superior fix.

Various people, first at SGI then more recently at Intel, have chipped
away at the locking in shmem_getpage(), citing this or that benchmark.
Locking here is not negligible for them, and I'm trying to extend their
work, not regress it.

> 
> > > What effect does that have?
> > 
> > ... but admit I have not measured: I rather assume that if we now change
> > tmpfs from lockless to locked lookup, someone else will soon come up with
> > the regression numbers.
> > 
> > > It'd better be
> > > "really bad", because this patchset does nothing at all to improve core
> > > MM maintainability :(
> > 
> > I was aiming to improve shmem.c maintainability; and you have good grounds
> > to accuse me of hurting shmem.c maintainability when I highmem-ized the
> > swap vector nine years ago.
> > 
> > I was not aiming to improve core MM maintainability, nor to harm it.
> > I am extending the use to which the radix-tree can be put, but is that
> > so bad?
> 
> I find it hard to believe that this wart added to the side of the
> radix-tree code will find any other users.  And the wart spreads
> contagion into core filemap pagecache lookup.
> 
> It's pretty nasty stuff.  Please, what is a better way of doing all this?

I cannot offer you a better way of doing this: if I thought there were
a better way, then that's what I would have implemented.  But I can list
some alternative ways, which I think are inferior, but you might prefer.

Though before that, I'd better remind us of the reasons for making any
change at all: support MAX_LFS_FILESIZE; simpler and more maintainable
shmem.c; more scalable shmem_getpage(); less memory consumption.

Alternatives:

1. Stick with the status quo, whether that's defined as what's in
   2.6.39, or what's in 3.0-rc, or what's in 3.0-rc plus mmotm patches
   prior to tmpfs-clone-shmem_file_splice_read, or what's in 3.0-rc plus
   mmotm patches prior to radix_tree-exceptional-entries-and-indices.
   There is one scalability fix in there (tmpfs-no-need-to-use-i_lock),
   and some shmem_getpage simplification, though not as much as I'd like.
   Does nothing for MAX_LFS_FILESIZE, maintainability, memory consumption.

2. Same as 1, plus work to extend shmem's own radix tree (swap vector)
   to MAX_LFS_FILESIZE.  No doubt doable, but reduces maintainabilty,
   increases memory consumption slightly (by added code at least) -
   and FWIW not work that I'm at all attracted to doing!

3. Same as 1, plus work to change shmem over to a different radix tree
   to meet MAX_LFS_FILESIZE: let's use the kind served by radix-tree.c,
   and assume that the indices addition there is acceptable after all.
   Keep away from filemap.c changes by using two radix trees for each
   shmem inode, a pagecache tree and a swap entry tree.  Manage the
   two trees together in the same way as at present, shmem_getpage
   holding info->lock to prevent races between them at awkward moments.
   Improves maintainability (gets rid of the highmem swap vector), may
   reduce memory consumption somewhat (less code, less-than-pagesize
   radix tree nodes), but no scalability or shmem_getpage simplification,
   and less efficient checks on swap entries (more cacheline accesses
   than before) - lowering performance when initially populating pages.

4. Same as 3, but combine those two radix trees into one, since the
   empty slots in one are precisely the occupied slots in the other:
   so fit swap entries into pagecache radix tree in the same spirit
   as we have always fitted swap entries into the pagetables.
   Keep away from filemap.c changes: avoid those unlikely path
   radix_tree_exceptional_entry tests in find_get_page, find_lock_page,
   find_get_pages, find_get_pages_contig by duplicating such code into
   static internal shmem_ variants of each.  May still need a hack in
   invalidate_mapping_pages or find_get_pages, I believe that's one way
   generic code can still arrive at shmem.  Consider tagging swap entries
   instead of marking them exceptional, but I think that will add overhead
   to shmem_getpage and shmem_truncate fast paths (need separate descent
   to tags to get type of each entry found, unless radix-tree.c extended
   to pass that info back from the same descent).  This approach achieves
   most of the goals, but duplicates code, increasing kernel size.

5. What's already in mmotm (plus further comments in filemap.c, and
   radix_tree_locate_item patch to increase speed of shmem swapoff).

Hugh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
@ 2011-07-13 22:27               ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-07-13 22:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

On Tue, 12 Jul 2011, Andrew Morton wrote:
> Hugh Dickins <hughd@google.com> wrote:
> > On Sat, 18 Jun 2011, Andrew Morton wrote:
> > > On Fri, 17 Jun 2011 17:13:38 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
> > > > 
> > > > I couldn't see how to use tags without losing the "lockless" lookups:
> > > 
> > > So lockless pagecache broke the radix-tree tag-versus-item coherency as
> > > well as the address_space nrpages-vs-radix-tree coherency.
> > 
> > I don't think that remark is fair to lockless pagecache at all.  If we
> > want the scalability advantage of lockless lookup, yes, we don't have
> > strict coherency with tagging at that time.  But those places that need
> > to worry about that coherency, can lock to do so.
> 
> Nobody thought about these issues, afaik.  Things have broken and the
> code has become significantly more complex/fragile.
> 
> Does the locking in mapping_tagged() make any sense?

Not really, but it's reassuring for mapping_mapped(),
which doesn't know how radix_tree_tagged() is implemented nowadays :)

I think the "lockless pagecache" change there was a blanket replacement
of read_lock_irqsave() by rcu_read_lock().  The effective change came
two years earlier, when radix_tree_tagged() changed from looking at
an array in the rnode to looking at a bit in the root.

> 
> > > Isn't it fun learning these things.
> > > 
> > > > because the tag is a separate bit from the entry itself, unless you're
> > > > under tree_lock, there would be races when changing from page pointer
> > > > to swap entry or back, when slot was updated but tag not or vice versa.
> > > 
> > > So...  take tree_lock?
> > 
> > I wouldn't call that an improvement...
> 
> I wouldn't call the proposed changes to radix-tree.c an improvement,
> either.  It's an expedient, once-off, single-caller hack.

I do hope you're using "radix-tree.c" there as shorthand for the
scattered little additions to filemap.c and radix-tree.h.  (And
I just did a diffstat on the filemap.c changes, they stand at 44
insertions, 44 deletions - mapping_cap_swap_backed stuff goes away.)

The only change to radix-tree.c is adding an indices argument to
radix_tree_gang_lookup_slot() and __lookup(): because hitherto the
gang lookup code has implicitly assumed that it's being used on
struct page pointers, or something like, from which each index
can be deduced by the caller.

Whilst that addition offers nothing to existing users who can deduce
the index of each item found, I claim that it's a clear improvement
to the interface.

If that's unacceptable to you, then perhaps shmem.c needs its own
variant of the code in radix-tree.c, to fill in that lacuna.  I chose
that approach for some parts of the code (e.g. shmem_file_splice_read),
but I'd really prefer not to duplicate large parts of radix-tree.c.
No, it would be better to stick with shmem.c's own peculiar radix tree
in that case, though I don't relish extending it to larger filesizes.

(I should warn that I do have an expedient, once-off, single-caller
hack to come for radix-tree.c.  As was admitted in the changelog, the
loser in these shmem.c changes is swapoff, where shmem_unuse does get
slower.  I had a shock when I found it 20 times slower on this laptop:
though later the most of that turned out to be an artifact of lockdep
and prove_rcu.  But even without the debug, I've found the gang lookup
method too slow, tried a number of different things including tagging,
but the only change which has given clear benefit is doing the lookup
directly on the rnodes, instead of gang lookup going back and forth.)

> 
> If the cost of adding locking is negligible then that is a superior fix.

Various people, first at SGI then more recently at Intel, have chipped
away at the locking in shmem_getpage(), citing this or that benchmark.
Locking here is not negligible for them, and I'm trying to extend their
work, not regress it.

> 
> > > What effect does that have?
> > 
> > ... but admit I have not measured: I rather assume that if we now change
> > tmpfs from lockless to locked lookup, someone else will soon come up with
> > the regression numbers.
> > 
> > > It'd better be
> > > "really bad", because this patchset does nothing at all to improve core
> > > MM maintainability :(
> > 
> > I was aiming to improve shmem.c maintainability; and you have good grounds
> > to accuse me of hurting shmem.c maintainability when I highmem-ized the
> > swap vector nine years ago.
> > 
> > I was not aiming to improve core MM maintainability, nor to harm it.
> > I am extending the use to which the radix-tree can be put, but is that
> > so bad?
> 
> I find it hard to believe that this wart added to the side of the
> radix-tree code will find any other users.  And the wart spreads
> contagion into core filemap pagecache lookup.
> 
> It's pretty nasty stuff.  Please, what is a better way of doing all this?

I cannot offer you a better way of doing this: if I thought there were
a better way, then that's what I would have implemented.  But I can list
some alternative ways, which I think are inferior, but you might prefer.

Though before that, I'd better remind us of the reasons for making any
change at all: support MAX_LFS_FILESIZE; simpler and more maintainable
shmem.c; more scalable shmem_getpage(); less memory consumption.

Alternatives:

1. Stick with the status quo, whether that's defined as what's in
   2.6.39, or what's in 3.0-rc, or what's in 3.0-rc plus mmotm patches
   prior to tmpfs-clone-shmem_file_splice_read, or what's in 3.0-rc plus
   mmotm patches prior to radix_tree-exceptional-entries-and-indices.
   There is one scalability fix in there (tmpfs-no-need-to-use-i_lock),
   and some shmem_getpage simplification, though not as much as I'd like.
   Does nothing for MAX_LFS_FILESIZE, maintainability, memory consumption.

2. Same as 1, plus work to extend shmem's own radix tree (swap vector)
   to MAX_LFS_FILESIZE.  No doubt doable, but reduces maintainabilty,
   increases memory consumption slightly (by added code at least) -
   and FWIW not work that I'm at all attracted to doing!

3. Same as 1, plus work to change shmem over to a different radix tree
   to meet MAX_LFS_FILESIZE: let's use the kind served by radix-tree.c,
   and assume that the indices addition there is acceptable after all.
   Keep away from filemap.c changes by using two radix trees for each
   shmem inode, a pagecache tree and a swap entry tree.  Manage the
   two trees together in the same way as at present, shmem_getpage
   holding info->lock to prevent races between them at awkward moments.
   Improves maintainability (gets rid of the highmem swap vector), may
   reduce memory consumption somewhat (less code, less-than-pagesize
   radix tree nodes), but no scalability or shmem_getpage simplification,
   and less efficient checks on swap entries (more cacheline accesses
   than before) - lowering performance when initially populating pages.

4. Same as 3, but combine those two radix trees into one, since the
   empty slots in one are precisely the occupied slots in the other:
   so fit swap entries into pagecache radix tree in the same spirit
   as we have always fitted swap entries into the pagetables.
   Keep away from filemap.c changes: avoid those unlikely path
   radix_tree_exceptional_entry tests in find_get_page, find_lock_page,
   find_get_pages, find_get_pages_contig by duplicating such code into
   static internal shmem_ variants of each.  May still need a hack in
   invalidate_mapping_pages or find_get_pages, I believe that's one way
   generic code can still arrive at shmem.  Consider tagging swap entries
   instead of marking them exceptional, but I think that will add overhead
   to shmem_getpage and shmem_truncate fast paths (need separate descent
   to tags to get type of each entry found, unless radix-tree.c extended
   to pass that info back from the same descent).  This approach achieves
   most of the goals, but duplicates code, increasing kernel size.

5. What's already in mmotm (plus further comments in filemap.c, and
   radix_tree_locate_item patch to increase speed of shmem swapoff).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 2/12] mm: let swap use exceptional entries
  2011-07-12 22:08       ` Hugh Dickins
@ 2011-07-13 23:11         ` Andrew Morton
  -1 siblings, 0 replies; 71+ messages in thread
From: Andrew Morton @ 2011-07-13 23:11 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel, linux-mm

On Tue, 12 Jul 2011 15:08:58 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> > All the crap^Wnice changes made to filemap.c really need some comments,
> > please.  Particularly when they're keyed off the bland-sounding
> > "radix_tree_exception()".  Apparently they have something to do with
> > swap, but how is the poor reader to know this?
> 
> The naming was intentionally bland, because other filesystems might
> in future have other uses for such exceptional entries.
> 
> (I think the field size would generally defeat it, but you can,
> for example, imagine a small filesystem wanting to save sector number
> there when a page is evicted.)
> 
> But let's go bland when it's more familiar, and such uses materialize -
> particularly since I only placed those checks in places where they're
> needed now for shmem/tmpfs/swap.
> 
> I'll keep the bland naming, if that's okay, but send a patch adding
> a line of comment in such places.  Mentioning shmem, tmpfs, swap.

A better fix would be to create a nicely-documented filemap-specific
function with a non-bland name which simply wraps
radix_tree_exception().

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 2/12] mm: let swap use exceptional entries
@ 2011-07-13 23:11         ` Andrew Morton
  0 siblings, 0 replies; 71+ messages in thread
From: Andrew Morton @ 2011-07-13 23:11 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel, linux-mm

On Tue, 12 Jul 2011 15:08:58 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> > All the crap^Wnice changes made to filemap.c really need some comments,
> > please.  Particularly when they're keyed off the bland-sounding
> > "radix_tree_exception()".  Apparently they have something to do with
> > swap, but how is the poor reader to know this?
> 
> The naming was intentionally bland, because other filesystems might
> in future have other uses for such exceptional entries.
> 
> (I think the field size would generally defeat it, but you can,
> for example, imagine a small filesystem wanting to save sector number
> there when a page is evicted.)
> 
> But let's go bland when it's more familiar, and such uses materialize -
> particularly since I only placed those checks in places where they're
> needed now for shmem/tmpfs/swap.
> 
> I'll keep the bland naming, if that's okay, but send a patch adding
> a line of comment in such places.  Mentioning shmem, tmpfs, swap.

A better fix would be to create a nicely-documented filemap-specific
function with a non-bland name which simply wraps
radix_tree_exception().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
  2011-06-18  1:52           ` Hugh Dickins
@ 2011-07-19 22:36             ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-07-19 22:36 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: akpm, linux-kernel, linux-mm

On Fri, 17 Jun 2011, Hugh Dickins wrote:
> On Fri, 17 Jun 2011, Randy Dunlap wrote:
> 
> > > And one Andrew Morton has a userspace radix tree test harness at
> > > http://userweb.kernel.org/~akpm/stuff/rtth.tar.gz
> 
> This should still be as relevant as it was before, but I notice its
> radix_tree.c is almost identical to the source currently in the kernel
> tree, so I ought at the least to keep it in synch.

I was hoping to have dealt with this by now, Randy; but after downloading
an up-to-date urcu, I'm finding what's currently in rtth does not build
with it.  Unlikely to be hard to fix, but means I'll have to defer it a
little while longer.

Hugh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
@ 2011-07-19 22:36             ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-07-19 22:36 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: akpm, linux-kernel, linux-mm

On Fri, 17 Jun 2011, Hugh Dickins wrote:
> On Fri, 17 Jun 2011, Randy Dunlap wrote:
> 
> > > And one Andrew Morton has a userspace radix tree test harness at
> > > http://userweb.kernel.org/~akpm/stuff/rtth.tar.gz
> 
> This should still be as relevant as it was before, but I notice its
> radix_tree.c is almost identical to the source currently in the kernel
> tree, so I ought at the least to keep it in synch.

I was hoping to have dealt with this by now, Randy; but after downloading
an up-to-date urcu, I'm finding what's currently in rtth does not build
with it.  Unlikely to be hard to fix, but means I'll have to defer it a
little while longer.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 2/12] mm: let swap use exceptional entries
  2011-07-13 23:11         ` Andrew Morton
@ 2011-07-19 22:46           ` Hugh Dickins
  -1 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-07-19 22:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

On Wed, 13 Jul 2011, Andrew Morton wrote:
> On Tue, 12 Jul 2011 15:08:58 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> > 
> > I'll keep the bland naming, if that's okay, but send a patch adding
> > a line of comment in such places.  Mentioning shmem, tmpfs, swap.
> 
> A better fix would be to create a nicely-documented filemap-specific
> function with a non-bland name which simply wraps
> radix_tree_exception().

I did yesterday try out page_tree_entry_is_not_a_page() to wrap
radix_tree_exceptional_entry(); but (a) I'm wary of negative names,
(b) it was hard to explain why radix_tree_deref_retry() is not a
part of that case, and (c) does a further wrapper help or obscure?

I've skirted the issue in the patch 3/3 I'm about to send you,
maybe you'll think it an improvement, maybe not: I'm neutral.

Hugh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 2/12] mm: let swap use exceptional entries
@ 2011-07-19 22:46           ` Hugh Dickins
  0 siblings, 0 replies; 71+ messages in thread
From: Hugh Dickins @ 2011-07-19 22:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

On Wed, 13 Jul 2011, Andrew Morton wrote:
> On Tue, 12 Jul 2011 15:08:58 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> > 
> > I'll keep the bland naming, if that's okay, but send a patch adding
> > a line of comment in such places.  Mentioning shmem, tmpfs, swap.
> 
> A better fix would be to create a nicely-documented filemap-specific
> function with a non-bland name which simply wraps
> radix_tree_exception().

I did yesterday try out page_tree_entry_is_not_a_page() to wrap
radix_tree_exceptional_entry(); but (a) I'm wary of negative names,
(b) it was hard to explain why radix_tree_deref_retry() is not a
part of that case, and (c) does a further wrapper help or obscure?

I've skirted the issue in the patch 3/3 I'm about to send you,
maybe you'll think it an improvement, maybe not: I'm neutral.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
  2011-07-19 22:36             ` Hugh Dickins
@ 2011-07-19 23:28               ` Randy Dunlap
  -1 siblings, 0 replies; 71+ messages in thread
From: Randy Dunlap @ 2011-07-19 23:28 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: akpm, linux-kernel, linux-mm

On Tue, 19 Jul 2011 15:36:37 -0700 (PDT) Hugh Dickins wrote:

> On Fri, 17 Jun 2011, Hugh Dickins wrote:
> > On Fri, 17 Jun 2011, Randy Dunlap wrote:
> > 
> > > > And one Andrew Morton has a userspace radix tree test harness at
> > > > http://userweb.kernel.org/~akpm/stuff/rtth.tar.gz
> > 
> > This should still be as relevant as it was before, but I notice its
> > radix_tree.c is almost identical to the source currently in the kernel
> > tree, so I ought at the least to keep it in synch.
> 
> I was hoping to have dealt with this by now, Randy; but after downloading
> an up-to-date urcu, I'm finding what's currently in rtth does not build
> with it.  Unlikely to be hard to fix, but means I'll have to defer it a
> little while longer.

Sure, not a problem.  Thanks for not dropping it completely.

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/12] radix_tree: exceptional entries and indices
@ 2011-07-19 23:28               ` Randy Dunlap
  0 siblings, 0 replies; 71+ messages in thread
From: Randy Dunlap @ 2011-07-19 23:28 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: akpm, linux-kernel, linux-mm

On Tue, 19 Jul 2011 15:36:37 -0700 (PDT) Hugh Dickins wrote:

> On Fri, 17 Jun 2011, Hugh Dickins wrote:
> > On Fri, 17 Jun 2011, Randy Dunlap wrote:
> > 
> > > > And one Andrew Morton has a userspace radix tree test harness at
> > > > http://userweb.kernel.org/~akpm/stuff/rtth.tar.gz
> > 
> > This should still be as relevant as it was before, but I notice its
> > radix_tree.c is almost identical to the source currently in the kernel
> > tree, so I ought at the least to keep it in synch.
> 
> I was hoping to have dealt with this by now, Randy; but after downloading
> an up-to-date urcu, I'm finding what's currently in rtth does not build
> with it.  Unlikely to be hard to fix, but means I'll have to defer it a
> little while longer.

Sure, not a problem.  Thanks for not dropping it completely.

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2011-07-19 23:29 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-14 10:40 [PATCH 0/12] tmpfs: convert from old swap vector to radix tree Hugh Dickins
2011-06-14 10:40 ` Hugh Dickins
2011-06-14 10:42 ` [PATCH 1/12] radix_tree: exceptional entries and indices Hugh Dickins
2011-06-14 10:42   ` Hugh Dickins
2011-06-14 11:22   ` Pekka Enberg
2011-06-14 11:22     ` Pekka Enberg
2011-06-15  0:24     ` Hugh Dickins
2011-06-17 23:38   ` Andrew Morton
2011-06-17 23:38     ` Andrew Morton
2011-06-18  0:07     ` Randy Dunlap
2011-06-18  0:07       ` Randy Dunlap
2011-06-18  0:12       ` Randy Dunlap
2011-06-18  0:12         ` Randy Dunlap
2011-06-18  1:52         ` Hugh Dickins
2011-06-18  1:52           ` Hugh Dickins
2011-07-19 22:36           ` Hugh Dickins
2011-07-19 22:36             ` Hugh Dickins
2011-07-19 23:28             ` Randy Dunlap
2011-07-19 23:28               ` Randy Dunlap
2011-06-18  0:13     ` Hugh Dickins
2011-06-18  0:13       ` Hugh Dickins
2011-06-18 21:48       ` Andrew Morton
2011-06-18 21:48         ` Andrew Morton
2011-07-12 22:56         ` Hugh Dickins
2011-07-12 22:56           ` Hugh Dickins
2011-07-12 23:24           ` Andrew Morton
2011-07-12 23:24             ` Andrew Morton
2011-07-13 22:27             ` Hugh Dickins
2011-07-13 22:27               ` Hugh Dickins
2011-06-14 10:43 ` [PATCH 2/12] mm: let swap use exceptional entries Hugh Dickins
2011-06-14 10:43   ` Hugh Dickins
2011-06-18 21:52   ` Andrew Morton
2011-06-18 21:52     ` Andrew Morton
2011-07-12 22:08     ` Hugh Dickins
2011-07-12 22:08       ` Hugh Dickins
2011-07-13 23:11       ` Andrew Morton
2011-07-13 23:11         ` Andrew Morton
2011-07-19 22:46         ` Hugh Dickins
2011-07-19 22:46           ` Hugh Dickins
2011-06-18 21:55   ` Andrew Morton
2011-06-18 21:55     ` Andrew Morton
2011-07-12 22:35     ` Hugh Dickins
2011-07-12 22:35       ` Hugh Dickins
2011-06-14 10:45 ` [PATCH 3/12] tmpfs: demolish old swap vector support Hugh Dickins
2011-06-14 10:45   ` Hugh Dickins
2011-06-14 10:48 ` [PATCH 4/12] tmpfs: miscellaneous trivial cleanups Hugh Dickins
2011-06-14 10:48   ` Hugh Dickins
2011-06-14 10:49 ` [PATCH 5/12] tmpfs: copy truncate_inode_pages_range Hugh Dickins
2011-06-14 10:49   ` Hugh Dickins
2011-06-14 10:51 ` [PATCH 6/12] tmpfs: convert shmem_truncate_range to radix-swap Hugh Dickins
2011-06-14 10:51   ` Hugh Dickins
2011-06-14 10:52 ` [PATCH 7/12] tmpfs: convert shmem_unuse_inode " Hugh Dickins
2011-06-14 10:52   ` Hugh Dickins
2011-06-14 10:53 ` [PATCH 8/12] tmpfs: convert shmem_getpage_gfp " Hugh Dickins
2011-06-14 10:53   ` Hugh Dickins
2011-06-14 10:54 ` [PATCH 9/12] tmpfs: convert mem_cgroup shmem " Hugh Dickins
2011-06-14 10:54   ` Hugh Dickins
2011-06-14 10:56 ` [PATCH 10/12] tmpfs: convert shmem_writepage and enable swap Hugh Dickins
2011-06-14 10:56   ` Hugh Dickins
2011-06-14 10:57 ` [PATCH 11/12] tmpfs: use kmemdup for short symlinks Hugh Dickins
2011-06-14 10:57   ` Hugh Dickins
2011-06-14 11:16   ` Pekka Enberg
2011-06-14 11:16     ` Pekka Enberg
2011-06-14 10:59 ` [PATCH 12/12] mm: a few small updates for radix-swap Hugh Dickins
2011-06-14 10:59   ` Hugh Dickins
2011-06-15  0:49   ` [PATCH v2 " Hugh Dickins
2011-06-15  0:49     ` Hugh Dickins
2011-06-14 17:29 ` [PATCH 0/12] tmpfs: convert from old swap vector to radix tree Linus Torvalds
2011-06-14 17:29   ` Linus Torvalds
2011-06-14 18:20   ` Rik van Riel
2011-06-14 18:20     ` Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.