All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv5 00/23] Transparent huge page cache: phase 1, everything but mmap()
@ 2013-08-04  2:17 ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

This is the second part of my transparent huge page cache work.
It brings thp support for ramfs, but without mmap() -- it will be posted
separately.

Intro
-----

The goal of the project is preparing kernel infrastructure to handle huge
pages in page cache.

To proof that the proposed changes are functional we enable the feature
for the most simple file system -- ramfs. ramfs is not that useful by
itself, but it's good pilot project.

Design overview
---------------

Every huge page is represented in page cache radix-tree by HPAGE_PMD_NR
(512 on x86-64) entries: one entry for head page and HPAGE_PMD_NR-1 entries
for tail pages.

Radix tree manipulations are implemented in batched way: we add and remove
whole huge page at once, under one tree_lock. To make it possible, we
extended radix-tree interface to be able to pre-allocate memory enough to
insert a number of *contiguous* elements (kudos to Matthew Wilcox).

Huge pages can be added to page cache three ways:
 - write(2) to file or page;
 - read(2) from sparse file;
 - fault sparse file.

Potentially, one more way is collapsing small page, but it's outside initial
implementation.

For now we still write/read at most PAGE_CACHE_SIZE bytes a time. There's
some room for speed up later.

Since mmap() isn't targeted for this patchset, we just split huge page on
page fault.

To minimize memory overhead for small file we setup fops->release helper
-- simple_thp_release() --  which splits the last page in file, when last
writer goes away.

truncate_inode_pages_range() drops whole huge page at once if it's fully
inside the range. If a huge page is only partly in the range we zero out
the part, exactly like we do for partial small pages.

split_huge_page() for file pages works similar to anon pages, but we
walk by mapping->i_mmap rather then anon_vma->rb_root. At the end we call
truncate_inode_pages() to drop small pages beyond i_size, if any.

Locking model around split_huge_page() rather complicated and I still
don't feel myself confident enough with it. Looks like we need to
serialize over i_mutex in split_huge_page(), but it breaks locking
ordering for i_mutex->mmap_sem. I don't see how it can be fixed easily.
Any ideas are welcome.

Performance indicators will be posted separately.

Please, review.

Kirill A. Shutemov (23):
  radix-tree: implement preload for multiple contiguous elements
  memcg, thp: charge huge cache pages
  thp: compile-time and sysfs knob for thp pagecache
  thp, mm: introduce mapping_can_have_hugepages() predicate
  thp: represent file thp pages in meminfo and friends
  thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  mm: trace filemap: dump page order
  block: implement add_bdi_stat()
  thp, mm: rewrite delete_from_page_cache() to support huge pages
  thp, mm: warn if we try to use replace_page_cache_page() with THP
  thp, mm: handle tail pages in page_cache_get_speculative()
  thp, mm: add event counters for huge page alloc on file write or read
  thp, mm: allocate huge pages in grab_cache_page_write_begin()
  thp, mm: naive support of thp in generic_perform_write
  mm, fs: avoid page allocation beyond i_size on read
  thp, mm: handle transhuge pages in do_generic_file_read()
  thp, libfs: initial thp support
  thp: libfs: introduce simple_thp_release()
  truncate: support huge pages
  thp: handle file pages in split_huge_page()
  thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  thp, mm: split huge page on mmap file page
  ramfs: enable transparent huge page cache

 Documentation/vm/transhuge.txt |  16 ++++
 drivers/base/node.c            |   4 +
 fs/libfs.c                     |  80 ++++++++++++++++++-
 fs/proc/meminfo.c              |   3 +
 fs/ramfs/file-mmu.c            |   3 +-
 fs/ramfs/inode.c               |   6 +-
 include/linux/backing-dev.h    |  10 +++
 include/linux/fs.h             |  10 +++
 include/linux/huge_mm.h        |  53 ++++++++++++-
 include/linux/mmzone.h         |   1 +
 include/linux/page-flags.h     |  33 ++++++++
 include/linux/pagemap.h        |  48 +++++++++++-
 include/linux/radix-tree.h     |  11 +++
 include/linux/vm_event_item.h  |   4 +
 include/trace/events/filemap.h |   7 +-
 lib/radix-tree.c               |  41 +++++++---
 mm/Kconfig                     |  12 +++
 mm/filemap.c                   | 171 +++++++++++++++++++++++++++++++++++------
 mm/huge_memory.c               | 116 ++++++++++++++++++++++++----
 mm/memcontrol.c                |   2 -
 mm/memory.c                    |   4 +-
 mm/truncate.c                  | 108 ++++++++++++++++++++------
 mm/vmstat.c                    |   5 ++
 23 files changed, 658 insertions(+), 90 deletions(-)

-- 
1.8.3.2


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCHv5 00/23] Transparent huge page cache: phase 1, everything but mmap()
@ 2013-08-04  2:17 ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

This is the second part of my transparent huge page cache work.
It brings thp support for ramfs, but without mmap() -- it will be posted
separately.

Intro
-----

The goal of the project is preparing kernel infrastructure to handle huge
pages in page cache.

To proof that the proposed changes are functional we enable the feature
for the most simple file system -- ramfs. ramfs is not that useful by
itself, but it's good pilot project.

Design overview
---------------

Every huge page is represented in page cache radix-tree by HPAGE_PMD_NR
(512 on x86-64) entries: one entry for head page and HPAGE_PMD_NR-1 entries
for tail pages.

Radix tree manipulations are implemented in batched way: we add and remove
whole huge page at once, under one tree_lock. To make it possible, we
extended radix-tree interface to be able to pre-allocate memory enough to
insert a number of *contiguous* elements (kudos to Matthew Wilcox).

Huge pages can be added to page cache three ways:
 - write(2) to file or page;
 - read(2) from sparse file;
 - fault sparse file.

Potentially, one more way is collapsing small page, but it's outside initial
implementation.

For now we still write/read at most PAGE_CACHE_SIZE bytes a time. There's
some room for speed up later.

Since mmap() isn't targeted for this patchset, we just split huge page on
page fault.

To minimize memory overhead for small file we setup fops->release helper
-- simple_thp_release() --  which splits the last page in file, when last
writer goes away.

truncate_inode_pages_range() drops whole huge page at once if it's fully
inside the range. If a huge page is only partly in the range we zero out
the part, exactly like we do for partial small pages.

split_huge_page() for file pages works similar to anon pages, but we
walk by mapping->i_mmap rather then anon_vma->rb_root. At the end we call
truncate_inode_pages() to drop small pages beyond i_size, if any.

Locking model around split_huge_page() rather complicated and I still
don't feel myself confident enough with it. Looks like we need to
serialize over i_mutex in split_huge_page(), but it breaks locking
ordering for i_mutex->mmap_sem. I don't see how it can be fixed easily.
Any ideas are welcome.

Performance indicators will be posted separately.

Please, review.

Kirill A. Shutemov (23):
  radix-tree: implement preload for multiple contiguous elements
  memcg, thp: charge huge cache pages
  thp: compile-time and sysfs knob for thp pagecache
  thp, mm: introduce mapping_can_have_hugepages() predicate
  thp: represent file thp pages in meminfo and friends
  thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  mm: trace filemap: dump page order
  block: implement add_bdi_stat()
  thp, mm: rewrite delete_from_page_cache() to support huge pages
  thp, mm: warn if we try to use replace_page_cache_page() with THP
  thp, mm: handle tail pages in page_cache_get_speculative()
  thp, mm: add event counters for huge page alloc on file write or read
  thp, mm: allocate huge pages in grab_cache_page_write_begin()
  thp, mm: naive support of thp in generic_perform_write
  mm, fs: avoid page allocation beyond i_size on read
  thp, mm: handle transhuge pages in do_generic_file_read()
  thp, libfs: initial thp support
  thp: libfs: introduce simple_thp_release()
  truncate: support huge pages
  thp: handle file pages in split_huge_page()
  thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  thp, mm: split huge page on mmap file page
  ramfs: enable transparent huge page cache

 Documentation/vm/transhuge.txt |  16 ++++
 drivers/base/node.c            |   4 +
 fs/libfs.c                     |  80 ++++++++++++++++++-
 fs/proc/meminfo.c              |   3 +
 fs/ramfs/file-mmu.c            |   3 +-
 fs/ramfs/inode.c               |   6 +-
 include/linux/backing-dev.h    |  10 +++
 include/linux/fs.h             |  10 +++
 include/linux/huge_mm.h        |  53 ++++++++++++-
 include/linux/mmzone.h         |   1 +
 include/linux/page-flags.h     |  33 ++++++++
 include/linux/pagemap.h        |  48 +++++++++++-
 include/linux/radix-tree.h     |  11 +++
 include/linux/vm_event_item.h  |   4 +
 include/trace/events/filemap.h |   7 +-
 lib/radix-tree.c               |  41 +++++++---
 mm/Kconfig                     |  12 +++
 mm/filemap.c                   | 171 +++++++++++++++++++++++++++++++++++------
 mm/huge_memory.c               | 116 ++++++++++++++++++++++++----
 mm/memcontrol.c                |   2 -
 mm/memory.c                    |   4 +-
 mm/truncate.c                  | 108 ++++++++++++++++++++------
 mm/vmstat.c                    |   5 ++
 23 files changed, 658 insertions(+), 90 deletions(-)

-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The radix tree is variable-height, so an insert operation not only has
to build the branch to its corresponding item, it also has to build the
branch to existing items if the size has to be increased (by
radix_tree_extend).

The worst case is a zero height tree with just a single item at index 0,
and then inserting an item at index ULONG_MAX. This requires 2 new branches
of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.

Radix tree is usually protected by spin lock. It means we want to
pre-allocate required memory before taking the lock.

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. For transparent huge page cache we want
to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.
The feature costs about 5KiB per-CPU, details below.

Worst case for adding N contiguous items is adding entries at indexes
(ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
item plus extra nodes if you cross the boundary from one node to the next.

Preload uses per-CPU array to store nodes. The total cost of preload is
"array size" * sizeof(void*) * NR_CPUS. We want to increase array size
to be able to handle 512 entries at once.

Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.

We have three possible RADIX_TREE_MAP_SHIFT:

 #ifdef __KERNEL__
 #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
 #else
 #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
 #endif

On 64-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.

On 32-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.

On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
on 64-bit system the per-CPU feature overhead is
 for preload array:
   (30 - 21) * sizeof(void*) = 72 bytes
 plus, if the preload array is full
   (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
 total: 5112 bytes

on 32-bit system the per-CPU feature overhead is
 for preload array:
   (19 - 11) * sizeof(void*) = 32 bytes
 plus, if the preload array is full
   (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
 total: 2400 bytes

Since only THP uses batched preload at the moment, we disable (set max
preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
changed in the future.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 include/linux/radix-tree.h | 11 +++++++++++
 lib/radix-tree.c           | 41 ++++++++++++++++++++++++++++++++---------
 2 files changed, 43 insertions(+), 9 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 4039407..3bf0b3e 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,16 @@ do {									\
 	(root)->rnode = NULL;						\
 } while (0)
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+/*
+ * At the moment only THP uses preload for more then on item for batched
+ * pagecache manipulations.
+ */
+#define RADIX_TREE_PRELOAD_NR	512
+#else
+#define RADIX_TREE_PRELOAD_NR	1
+#endif
+
 /**
  * Radix-tree synchronization
  *
@@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
 				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
 			unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 7811ed3..99ab73c 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
  * The worst case is a zero height tree with just a single item at index 0,
  * and then inserting an item at index ULONG_MAX. This requires 2 new branches
  * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
+ *
+ * Worst case for adding N contiguous items is adding entries at indexes
+ * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
+ * item plus extra nodes if you cross the boundary from one node to the next.
+ *
  * Hence:
  */
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MAX \
+	(RADIX_TREE_PRELOAD_MIN + \
+	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
 
 /*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
 	int nr;
-	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
 };
 static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
 
@@ -263,29 +271,35 @@ radix_tree_node_free(struct radix_tree_node *node)
 
 /*
  * Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail.  On
- * success, return zero, with preemption disabled.  On error, return -ENOMEM
+ * ensure that the addition of *contiguous* elements in the tree cannot fail.
+ * On success, return zero, with preemption disabled.  On error, return -ENOMEM
  * with preemption not disabled.
  *
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_WAIT being passed to INIT_RADIX_TREE().
  */
-static int __radix_tree_preload(gfp_t gfp_mask)
+static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
 	int ret = -ENOMEM;
+	int preload_target = RADIX_TREE_PRELOAD_MIN +
+		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
+
+	if (WARN_ONCE(size > RADIX_TREE_PRELOAD_NR,
+				"too large preload requested"))
+		return -ENOMEM;
 
 	preempt_disable();
 	rtp = &__get_cpu_var(radix_tree_preloads);
-	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+	while (rtp->nr < preload_target) {
 		preempt_enable();
 		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
 		if (node == NULL)
 			goto out;
 		preempt_disable();
 		rtp = &__get_cpu_var(radix_tree_preloads);
-		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+		if (rtp->nr < preload_target)
 			rtp->nodes[rtp->nr++] = node;
 		else
 			kmem_cache_free(radix_tree_node_cachep, node);
@@ -308,7 +322,7 @@ int radix_tree_preload(gfp_t gfp_mask)
 {
 	/* Warn on non-sensical use... */
 	WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
-	return __radix_tree_preload(gfp_mask);
+	return __radix_tree_preload_contig(1, gfp_mask);
 }
 EXPORT_SYMBOL(radix_tree_preload);
 
@@ -320,13 +334,22 @@ EXPORT_SYMBOL(radix_tree_preload);
 int radix_tree_maybe_preload(gfp_t gfp_mask)
 {
 	if (gfp_mask & __GFP_WAIT)
-		return __radix_tree_preload(gfp_mask);
+		return __radix_tree_preload_contig(1, gfp_mask);
 	/* Preloading doesn't help anything with this gfp mask, skip it */
 	preempt_disable();
 	return 0;
 }
 EXPORT_SYMBOL(radix_tree_maybe_preload);
 
+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask)
+{
+	if (gfp_mask & __GFP_WAIT)
+		return __radix_tree_preload_contig(size, gfp_mask);
+	/* Preloading doesn't help anything with this gfp mask, skip it */
+	preempt_disable();
+	return 0;
+}
+
 /*
  *	Return the maximum key which can be store into a
  *	radix tree with height HEIGHT.
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The radix tree is variable-height, so an insert operation not only has
to build the branch to its corresponding item, it also has to build the
branch to existing items if the size has to be increased (by
radix_tree_extend).

The worst case is a zero height tree with just a single item at index 0,
and then inserting an item at index ULONG_MAX. This requires 2 new branches
of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.

Radix tree is usually protected by spin lock. It means we want to
pre-allocate required memory before taking the lock.

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. For transparent huge page cache we want
to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.
The feature costs about 5KiB per-CPU, details below.

Worst case for adding N contiguous items is adding entries at indexes
(ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
item plus extra nodes if you cross the boundary from one node to the next.

Preload uses per-CPU array to store nodes. The total cost of preload is
"array size" * sizeof(void*) * NR_CPUS. We want to increase array size
to be able to handle 512 entries at once.

Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.

We have three possible RADIX_TREE_MAP_SHIFT:

 #ifdef __KERNEL__
 #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
 #else
 #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
 #endif

On 64-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.

On 32-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.

On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
on 64-bit system the per-CPU feature overhead is
 for preload array:
   (30 - 21) * sizeof(void*) = 72 bytes
 plus, if the preload array is full
   (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
 total: 5112 bytes

on 32-bit system the per-CPU feature overhead is
 for preload array:
   (19 - 11) * sizeof(void*) = 32 bytes
 plus, if the preload array is full
   (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
 total: 2400 bytes

Since only THP uses batched preload at the moment, we disable (set max
preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
changed in the future.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 include/linux/radix-tree.h | 11 +++++++++++
 lib/radix-tree.c           | 41 ++++++++++++++++++++++++++++++++---------
 2 files changed, 43 insertions(+), 9 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 4039407..3bf0b3e 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,16 @@ do {									\
 	(root)->rnode = NULL;						\
 } while (0)
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+/*
+ * At the moment only THP uses preload for more then on item for batched
+ * pagecache manipulations.
+ */
+#define RADIX_TREE_PRELOAD_NR	512
+#else
+#define RADIX_TREE_PRELOAD_NR	1
+#endif
+
 /**
  * Radix-tree synchronization
  *
@@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
 				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
 			unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 7811ed3..99ab73c 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
  * The worst case is a zero height tree with just a single item at index 0,
  * and then inserting an item at index ULONG_MAX. This requires 2 new branches
  * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
+ *
+ * Worst case for adding N contiguous items is adding entries at indexes
+ * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
+ * item plus extra nodes if you cross the boundary from one node to the next.
+ *
  * Hence:
  */
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MAX \
+	(RADIX_TREE_PRELOAD_MIN + \
+	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
 
 /*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
 	int nr;
-	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
 };
 static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
 
@@ -263,29 +271,35 @@ radix_tree_node_free(struct radix_tree_node *node)
 
 /*
  * Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail.  On
- * success, return zero, with preemption disabled.  On error, return -ENOMEM
+ * ensure that the addition of *contiguous* elements in the tree cannot fail.
+ * On success, return zero, with preemption disabled.  On error, return -ENOMEM
  * with preemption not disabled.
  *
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_WAIT being passed to INIT_RADIX_TREE().
  */
-static int __radix_tree_preload(gfp_t gfp_mask)
+static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
 	int ret = -ENOMEM;
+	int preload_target = RADIX_TREE_PRELOAD_MIN +
+		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
+
+	if (WARN_ONCE(size > RADIX_TREE_PRELOAD_NR,
+				"too large preload requested"))
+		return -ENOMEM;
 
 	preempt_disable();
 	rtp = &__get_cpu_var(radix_tree_preloads);
-	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+	while (rtp->nr < preload_target) {
 		preempt_enable();
 		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
 		if (node == NULL)
 			goto out;
 		preempt_disable();
 		rtp = &__get_cpu_var(radix_tree_preloads);
-		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+		if (rtp->nr < preload_target)
 			rtp->nodes[rtp->nr++] = node;
 		else
 			kmem_cache_free(radix_tree_node_cachep, node);
@@ -308,7 +322,7 @@ int radix_tree_preload(gfp_t gfp_mask)
 {
 	/* Warn on non-sensical use... */
 	WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
-	return __radix_tree_preload(gfp_mask);
+	return __radix_tree_preload_contig(1, gfp_mask);
 }
 EXPORT_SYMBOL(radix_tree_preload);
 
@@ -320,13 +334,22 @@ EXPORT_SYMBOL(radix_tree_preload);
 int radix_tree_maybe_preload(gfp_t gfp_mask)
 {
 	if (gfp_mask & __GFP_WAIT)
-		return __radix_tree_preload(gfp_mask);
+		return __radix_tree_preload_contig(1, gfp_mask);
 	/* Preloading doesn't help anything with this gfp mask, skip it */
 	preempt_disable();
 	return 0;
 }
 EXPORT_SYMBOL(radix_tree_maybe_preload);
 
+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask)
+{
+	if (gfp_mask & __GFP_WAIT)
+		return __radix_tree_preload_contig(size, gfp_mask);
+	/* Preloading doesn't help anything with this gfp mask, skip it */
+	preempt_disable();
+	return 0;
+}
+
 /*
  *	Return the maximum key which can be store into a
  *	radix tree with height HEIGHT.
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 02/23] memcg, thp: charge huge cache pages
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov, Michal Hocko, KAMEZAWA Hiroyuki

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

mem_cgroup_cache_charge() has check for PageCompound(). The check
prevents charging huge cache pages.

I don't see a reason why the check is present. Looks like it's just
legacy (introduced in 52d4b9a memcg: allocate all page_cgroup at boot).

Let's just drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 mm/memcontrol.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b6cd870..dc50c1a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3921,8 +3921,6 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 
 	if (mem_cgroup_disabled())
 		return 0;
-	if (PageCompound(page))
-		return 0;
 
 	if (!PageSwapCache(page))
 		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 02/23] memcg, thp: charge huge cache pages
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov, Michal Hocko, KAMEZAWA Hiroyuki

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

mem_cgroup_cache_charge() has check for PageCompound(). The check
prevents charging huge cache pages.

I don't see a reason why the check is present. Looks like it's just
legacy (introduced in 52d4b9a memcg: allocate all page_cgroup at boot).

Let's just drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 mm/memcontrol.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b6cd870..dc50c1a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3921,8 +3921,6 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 
 	if (mem_cgroup_disabled())
 		return 0;
-	if (PageCompound(page))
-		return 0;
 
 	if (!PageSwapCache(page))
 		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 03/23] thp: compile-time and sysfs knob for thp pagecache
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now, TRANSPARENT_HUGEPAGE_PAGECACHE is only implemented for x86_64.

Radix tree perload overhead can be significant on BASE_SMALL systems, so
let's add dependency on !BASE_SMALL.

/sys/kernel/mm/transparent_hugepage/page_cache is runtime knob for the
feature. It's enabled by default if TRANSPARENT_HUGEPAGE_PAGECACHE is
enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/transhuge.txt |  9 +++++++++
 include/linux/huge_mm.h        |  9 +++++++++
 mm/Kconfig                     | 12 ++++++++++++
 mm/huge_memory.c               | 23 +++++++++++++++++++++++
 4 files changed, 53 insertions(+)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 4a63953..4cc15c4 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -103,6 +103,15 @@ echo always >/sys/kernel/mm/transparent_hugepage/enabled
 echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
 echo never >/sys/kernel/mm/transparent_hugepage/enabled
 
+If TRANSPARENT_HUGEPAGE_PAGECACHE is enabled kernel will use huge pages in
+page cache if possible. It can be disable and re-enabled via sysfs:
+
+echo 0 >/sys/kernel/mm/transparent_hugepage/page_cache
+echo 1 >/sys/kernel/mm/transparent_hugepage/page_cache
+
+If it's disabled kernel will not add new huge pages to page cache and
+split them on mapping, but already mapped pages will stay intakt.
+
 It's also possible to limit defrag efforts in the VM to generate
 hugepages in case they're not immediately free to madvise regions or
 to never try to defrag memory and simply fallback to regular pages
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3935428..1534e1e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -40,6 +40,7 @@ enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
+	TRANSPARENT_HUGEPAGE_PAGECACHE,
 	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
 #ifdef CONFIG_DEBUG_VM
 	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
@@ -229,4 +230,12 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str
 
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline bool transparent_hugepage_pagecache(void)
+{
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
+		return false;
+	if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG)))
+		return false;
+	return transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 256bfd0..1e30ee8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -420,6 +420,18 @@ choice
 	  benefit.
 endchoice
 
+config TRANSPARENT_HUGEPAGE_PAGECACHE
+	bool "Transparent Hugepage Support for page cache"
+	depends on X86_64 && TRANSPARENT_HUGEPAGE
+	# avoid radix tree preload overhead
+	depends on !BASE_SMALL
+	default y
+	help
+	  Enabling the option adds support hugepages for file-backed
+	  mappings. It requires transparent hugepage support from
+	  filesystem side. For now, the only filesystem which supports
+	  hugepages is ramfs.
+
 config CROSS_MEMORY_ATTACH
 	bool "Cross Memory Support"
 	depends on MMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d96d921..523946c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -42,6 +42,9 @@ unsigned long transparent_hugepage_flags __read_mostly =
 #endif
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+	(1<<TRANSPARENT_HUGEPAGE_PAGECACHE)|
+#endif
 	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
 
 /* default scan 8*512 pte (or vmas) every 30 second */
@@ -362,6 +365,23 @@ static ssize_t defrag_store(struct kobject *kobj,
 static struct kobj_attribute defrag_attr =
 	__ATTR(defrag, 0644, defrag_show, defrag_store);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+static ssize_t page_cache_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	return single_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
+static ssize_t page_cache_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	return single_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
+static struct kobj_attribute page_cache_attr =
+	__ATTR(page_cache, 0644, page_cache_show, page_cache_store);
+#endif
+
 static ssize_t use_zero_page_show(struct kobject *kobj,
 		struct kobj_attribute *attr, char *buf)
 {
@@ -397,6 +417,9 @@ static struct kobj_attribute debug_cow_attr =
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
 	&defrag_attr.attr,
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+	&page_cache_attr.attr,
+#endif
 	&use_zero_page_attr.attr,
 #ifdef CONFIG_DEBUG_VM
 	&debug_cow_attr.attr,
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 03/23] thp: compile-time and sysfs knob for thp pagecache
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now, TRANSPARENT_HUGEPAGE_PAGECACHE is only implemented for x86_64.

Radix tree perload overhead can be significant on BASE_SMALL systems, so
let's add dependency on !BASE_SMALL.

/sys/kernel/mm/transparent_hugepage/page_cache is runtime knob for the
feature. It's enabled by default if TRANSPARENT_HUGEPAGE_PAGECACHE is
enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/transhuge.txt |  9 +++++++++
 include/linux/huge_mm.h        |  9 +++++++++
 mm/Kconfig                     | 12 ++++++++++++
 mm/huge_memory.c               | 23 +++++++++++++++++++++++
 4 files changed, 53 insertions(+)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 4a63953..4cc15c4 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -103,6 +103,15 @@ echo always >/sys/kernel/mm/transparent_hugepage/enabled
 echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
 echo never >/sys/kernel/mm/transparent_hugepage/enabled
 
+If TRANSPARENT_HUGEPAGE_PAGECACHE is enabled kernel will use huge pages in
+page cache if possible. It can be disable and re-enabled via sysfs:
+
+echo 0 >/sys/kernel/mm/transparent_hugepage/page_cache
+echo 1 >/sys/kernel/mm/transparent_hugepage/page_cache
+
+If it's disabled kernel will not add new huge pages to page cache and
+split them on mapping, but already mapped pages will stay intakt.
+
 It's also possible to limit defrag efforts in the VM to generate
 hugepages in case they're not immediately free to madvise regions or
 to never try to defrag memory and simply fallback to regular pages
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3935428..1534e1e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -40,6 +40,7 @@ enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
+	TRANSPARENT_HUGEPAGE_PAGECACHE,
 	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
 #ifdef CONFIG_DEBUG_VM
 	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
@@ -229,4 +230,12 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str
 
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline bool transparent_hugepage_pagecache(void)
+{
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
+		return false;
+	if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG)))
+		return false;
+	return transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 256bfd0..1e30ee8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -420,6 +420,18 @@ choice
 	  benefit.
 endchoice
 
+config TRANSPARENT_HUGEPAGE_PAGECACHE
+	bool "Transparent Hugepage Support for page cache"
+	depends on X86_64 && TRANSPARENT_HUGEPAGE
+	# avoid radix tree preload overhead
+	depends on !BASE_SMALL
+	default y
+	help
+	  Enabling the option adds support hugepages for file-backed
+	  mappings. It requires transparent hugepage support from
+	  filesystem side. For now, the only filesystem which supports
+	  hugepages is ramfs.
+
 config CROSS_MEMORY_ATTACH
 	bool "Cross Memory Support"
 	depends on MMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d96d921..523946c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -42,6 +42,9 @@ unsigned long transparent_hugepage_flags __read_mostly =
 #endif
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+	(1<<TRANSPARENT_HUGEPAGE_PAGECACHE)|
+#endif
 	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
 
 /* default scan 8*512 pte (or vmas) every 30 second */
@@ -362,6 +365,23 @@ static ssize_t defrag_store(struct kobject *kobj,
 static struct kobj_attribute defrag_attr =
 	__ATTR(defrag, 0644, defrag_show, defrag_store);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+static ssize_t page_cache_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	return single_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
+static ssize_t page_cache_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	return single_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
+static struct kobj_attribute page_cache_attr =
+	__ATTR(page_cache, 0644, page_cache_show, page_cache_store);
+#endif
+
 static ssize_t use_zero_page_show(struct kobject *kobj,
 		struct kobj_attribute *attr, char *buf)
 {
@@ -397,6 +417,9 @@ static struct kobj_attribute debug_cow_attr =
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
 	&defrag_attr.attr,
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+	&page_cache_attr.attr,
+#endif
 	&use_zero_page_attr.attr,
 #ifdef CONFIG_DEBUG_VM
 	&debug_cow_attr.attr,
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 04/23] thp, mm: introduce mapping_can_have_hugepages() predicate
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Returns true if mapping can have huge pages. Just check for __GFP_COMP
in gfp mask of the mapping for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e8ca8cf..47b5082 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -84,6 +84,20 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 				(__force unsigned long)mask;
 }
 
+static inline bool mapping_can_have_hugepages(struct address_space *m)
+{
+	gfp_t gfp_mask = mapping_gfp_mask(m);
+
+	if (!transparent_hugepage_pagecache())
+		return false;
+
+	/*
+	 * It's up to filesystem what gfp mask to use.
+	 * The only part of GFP_TRANSHUGE which matters for us is __GFP_COMP.
+	 */
+	return !!(gfp_mask & __GFP_COMP);
+}
+
 /*
  * The page cache can done in larger chunks than
  * one page, because it allows for more efficient
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 04/23] thp, mm: introduce mapping_can_have_hugepages() predicate
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Returns true if mapping can have huge pages. Just check for __GFP_COMP
in gfp mask of the mapping for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e8ca8cf..47b5082 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -84,6 +84,20 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 				(__force unsigned long)mask;
 }
 
+static inline bool mapping_can_have_hugepages(struct address_space *m)
+{
+	gfp_t gfp_mask = mapping_gfp_mask(m);
+
+	if (!transparent_hugepage_pagecache())
+		return false;
+
+	/*
+	 * It's up to filesystem what gfp mask to use.
+	 * The only part of GFP_TRANSHUGE which matters for us is __GFP_COMP.
+	 */
+	return !!(gfp_mask & __GFP_COMP);
+}
+
 /*
  * The page cache can done in larger chunks than
  * one page, because it allows for more efficient
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 05/23] thp: represent file thp pages in meminfo and friends
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The patch adds new zone stat to count file transparent huge pages and
adjust related places.

For now we don't count mapped or dirty file thp pages separately.

The patch depends on patch
 thp: account anon transparent huge pages into NR_ANON_PAGES

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 drivers/base/node.c    | 4 ++++
 fs/proc/meminfo.c      | 3 +++
 include/linux/mmzone.h | 1 +
 mm/vmstat.c            | 1 +
 4 files changed, 9 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index bc9f43b..de261f5 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -119,6 +119,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       "Node %d SUnreclaim:     %8lu kB\n"
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		       "Node %d AnonHugePages:  %8lu kB\n"
+		       "Node %d FileHugePages:  %8lu kB\n"
 #endif
 			,
 		       nid, K(node_page_state(nid, NR_FILE_DIRTY)),
@@ -140,6 +141,9 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
 			, nid,
 			K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
+			HPAGE_PMD_NR)
+			, nid,
+			K(node_page_state(nid, NR_FILE_TRANSPARENT_HUGEPAGES) *
 			HPAGE_PMD_NR));
 #else
 		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 59d85d6..a62952c 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -104,6 +104,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		"AnonHugePages:  %8lu kB\n"
+		"FileHugePages:  %8lu kB\n"
 #endif
 		,
 		K(i.totalram),
@@ -158,6 +159,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
 		   HPAGE_PMD_NR)
+		,K(global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
+		   HPAGE_PMD_NR)
 #endif
 		);
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0c41d59..ba81833 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -142,6 +142,7 @@ enum zone_stat_item {
 	NUMA_OTHER,		/* allocation from other node */
 #endif
 	NR_ANON_TRANSPARENT_HUGEPAGES,
+	NR_FILE_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 87228c5..ffe3fbd 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -739,6 +739,7 @@ const char * const vmstat_text[] = {
 	"numa_other",
 #endif
 	"nr_anon_transparent_hugepages",
+	"nr_file_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
 	"nr_dirty_background_threshold",
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 05/23] thp: represent file thp pages in meminfo and friends
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The patch adds new zone stat to count file transparent huge pages and
adjust related places.

For now we don't count mapped or dirty file thp pages separately.

The patch depends on patch
 thp: account anon transparent huge pages into NR_ANON_PAGES

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 drivers/base/node.c    | 4 ++++
 fs/proc/meminfo.c      | 3 +++
 include/linux/mmzone.h | 1 +
 mm/vmstat.c            | 1 +
 4 files changed, 9 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index bc9f43b..de261f5 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -119,6 +119,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       "Node %d SUnreclaim:     %8lu kB\n"
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		       "Node %d AnonHugePages:  %8lu kB\n"
+		       "Node %d FileHugePages:  %8lu kB\n"
 #endif
 			,
 		       nid, K(node_page_state(nid, NR_FILE_DIRTY)),
@@ -140,6 +141,9 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
 			, nid,
 			K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
+			HPAGE_PMD_NR)
+			, nid,
+			K(node_page_state(nid, NR_FILE_TRANSPARENT_HUGEPAGES) *
 			HPAGE_PMD_NR));
 #else
 		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 59d85d6..a62952c 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -104,6 +104,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		"AnonHugePages:  %8lu kB\n"
+		"FileHugePages:  %8lu kB\n"
 #endif
 		,
 		K(i.totalram),
@@ -158,6 +159,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
 		   HPAGE_PMD_NR)
+		,K(global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
+		   HPAGE_PMD_NR)
 #endif
 		);
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0c41d59..ba81833 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -142,6 +142,7 @@ enum zone_stat_item {
 	NUMA_OTHER,		/* allocation from other node */
 #endif
 	NR_ANON_TRANSPARENT_HUGEPAGES,
+	NR_FILE_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 87228c5..ffe3fbd 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -739,6 +739,7 @@ const char * const vmstat_text[] = {
 	"numa_other",
 #endif
 	"nr_anon_transparent_hugepages",
+	"nr_file_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
 	"nr_dirty_background_threshold",
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 06/23] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
page for the specified index and HPAGE_CACHE_NR-1 tail pages for
following indexes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 include/linux/huge_mm.h    | 24 ++++++++++++++++++++++
 include/linux/page-flags.h | 33 ++++++++++++++++++++++++++++++
 mm/filemap.c               | 50 +++++++++++++++++++++++++++++++++++-----------
 3 files changed, 95 insertions(+), 12 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1534e1e..4dc66c9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -230,6 +230,20 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str
 
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+
+#define HPAGE_CACHE_ORDER      (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
+#define HPAGE_CACHE_NR         (1L << HPAGE_CACHE_ORDER)
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)
+
+#else
+
+#define HPAGE_CACHE_ORDER      ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
+
+#endif
+
 static inline bool transparent_hugepage_pagecache(void)
 {
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
@@ -238,4 +252,14 @@ static inline bool transparent_hugepage_pagecache(void)
 		return false;
 	return transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_PAGECACHE);
 }
+
+static inline int hpagecache_nr_pages(struct page *page)
+{
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
+		return hpage_nr_pages(page);
+
+	BUG_ON(PageTransHuge(page));
+	return 1;
+}
+
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f1a5b59..7657de0 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -452,6 +452,39 @@ static inline int PageTransTail(struct page *page)
 }
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+static inline int PageTransHugeCache(struct page *page)
+{
+	return PageTransHuge(page);
+}
+
+static inline int PageTransCompoundCache(struct page *page)
+{
+	return PageTransCompound(page);
+}
+
+static inline int PageTransTailCache(struct page *page)
+{
+	return PageTransTail(page);
+}
+#else
+
+static inline int PageTransHugeCache(struct page *page)
+{
+	return 0;
+}
+
+static inline int PageTransCompoundCache(struct page *page)
+{
+	return 0;
+}
+
+static inline int PageTransTailCache(struct page *page)
+{
+	return 0;
+}
+#endif
+
 /*
  * If network-based swap is enabled, sl*b must keep track of whether pages
  * were allocated from pfmemalloc reserves.
diff --git a/mm/filemap.c b/mm/filemap.c
index ae5cc01..619e6cb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -460,38 +460,64 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 		pgoff_t offset, gfp_t gfp_mask)
 {
 	int error;
+	int i, nr;
 
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageSwapBacked(page));
 
+	/* memory cgroup controller handles thp pages on its side */
 	error = mem_cgroup_cache_charge(page, current->mm,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
 		return error;
 
-	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
+	if (PageTransHugeCache(page))
+		BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
+
+	nr = hpagecache_nr_pages(page);
+
+	error = radix_tree_maybe_preload_contig(nr, gfp_mask & ~__GFP_HIGHMEM);
 	if (error) {
 		mem_cgroup_uncharge_cache_page(page);
 		return error;
 	}
 
-	page_cache_get(page);
-	page->mapping = mapping;
-	page->index = offset;
-
 	spin_lock_irq(&mapping->tree_lock);
-	error = radix_tree_insert(&mapping->page_tree, offset, page);
+	page_cache_get(page);
+	for (i = 0; i < nr; i++) {
+		error = radix_tree_insert(&mapping->page_tree,
+				offset + i, page + i);
+		/*
+		 * In the midle of THP we can collide with small page which was
+		 * established before THP page cache is enabled or by other VMA
+		 * with bad alignement (most likely MAP_FIXED).
+		 */
+		if (error)
+			goto err_insert;
+		page[i].index = offset + i;
+		page[i].mapping = mapping;
+	}
 	radix_tree_preload_end();
-	if (unlikely(error))
-		goto err_insert;
-	mapping->nrpages++;
-	__inc_zone_page_state(page, NR_FILE_PAGES);
+	mapping->nrpages += nr;
+	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
+	if (PageTransHuge(page))
+		__inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
 	spin_unlock_irq(&mapping->tree_lock);
 	trace_mm_filemap_add_to_page_cache(page);
 	return 0;
 err_insert:
-	page->mapping = NULL;
-	/* Leave page->index set: truncation relies upon it */
+	radix_tree_preload_end();
+	if (i != 0)
+		error = -ENOSPC; /* no space for a huge page */
+
+	/* page[i] was not inserted to tree, skip it */
+	i--;
+
+	for (; i >= 0; i--) {
+		/* Leave page->index set: truncation relies upon it */
+		page[i].mapping = NULL;
+		radix_tree_delete(&mapping->page_tree, offset + i);
+	}
 	spin_unlock_irq(&mapping->tree_lock);
 	mem_cgroup_uncharge_cache_page(page);
 	page_cache_release(page);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 06/23] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
page for the specified index and HPAGE_CACHE_NR-1 tail pages for
following indexes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 include/linux/huge_mm.h    | 24 ++++++++++++++++++++++
 include/linux/page-flags.h | 33 ++++++++++++++++++++++++++++++
 mm/filemap.c               | 50 +++++++++++++++++++++++++++++++++++-----------
 3 files changed, 95 insertions(+), 12 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1534e1e..4dc66c9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -230,6 +230,20 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str
 
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+
+#define HPAGE_CACHE_ORDER      (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
+#define HPAGE_CACHE_NR         (1L << HPAGE_CACHE_ORDER)
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)
+
+#else
+
+#define HPAGE_CACHE_ORDER      ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
+
+#endif
+
 static inline bool transparent_hugepage_pagecache(void)
 {
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
@@ -238,4 +252,14 @@ static inline bool transparent_hugepage_pagecache(void)
 		return false;
 	return transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_PAGECACHE);
 }
+
+static inline int hpagecache_nr_pages(struct page *page)
+{
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
+		return hpage_nr_pages(page);
+
+	BUG_ON(PageTransHuge(page));
+	return 1;
+}
+
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f1a5b59..7657de0 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -452,6 +452,39 @@ static inline int PageTransTail(struct page *page)
 }
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+static inline int PageTransHugeCache(struct page *page)
+{
+	return PageTransHuge(page);
+}
+
+static inline int PageTransCompoundCache(struct page *page)
+{
+	return PageTransCompound(page);
+}
+
+static inline int PageTransTailCache(struct page *page)
+{
+	return PageTransTail(page);
+}
+#else
+
+static inline int PageTransHugeCache(struct page *page)
+{
+	return 0;
+}
+
+static inline int PageTransCompoundCache(struct page *page)
+{
+	return 0;
+}
+
+static inline int PageTransTailCache(struct page *page)
+{
+	return 0;
+}
+#endif
+
 /*
  * If network-based swap is enabled, sl*b must keep track of whether pages
  * were allocated from pfmemalloc reserves.
diff --git a/mm/filemap.c b/mm/filemap.c
index ae5cc01..619e6cb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -460,38 +460,64 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 		pgoff_t offset, gfp_t gfp_mask)
 {
 	int error;
+	int i, nr;
 
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageSwapBacked(page));
 
+	/* memory cgroup controller handles thp pages on its side */
 	error = mem_cgroup_cache_charge(page, current->mm,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
 		return error;
 
-	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
+	if (PageTransHugeCache(page))
+		BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
+
+	nr = hpagecache_nr_pages(page);
+
+	error = radix_tree_maybe_preload_contig(nr, gfp_mask & ~__GFP_HIGHMEM);
 	if (error) {
 		mem_cgroup_uncharge_cache_page(page);
 		return error;
 	}
 
-	page_cache_get(page);
-	page->mapping = mapping;
-	page->index = offset;
-
 	spin_lock_irq(&mapping->tree_lock);
-	error = radix_tree_insert(&mapping->page_tree, offset, page);
+	page_cache_get(page);
+	for (i = 0; i < nr; i++) {
+		error = radix_tree_insert(&mapping->page_tree,
+				offset + i, page + i);
+		/*
+		 * In the midle of THP we can collide with small page which was
+		 * established before THP page cache is enabled or by other VMA
+		 * with bad alignement (most likely MAP_FIXED).
+		 */
+		if (error)
+			goto err_insert;
+		page[i].index = offset + i;
+		page[i].mapping = mapping;
+	}
 	radix_tree_preload_end();
-	if (unlikely(error))
-		goto err_insert;
-	mapping->nrpages++;
-	__inc_zone_page_state(page, NR_FILE_PAGES);
+	mapping->nrpages += nr;
+	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
+	if (PageTransHuge(page))
+		__inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
 	spin_unlock_irq(&mapping->tree_lock);
 	trace_mm_filemap_add_to_page_cache(page);
 	return 0;
 err_insert:
-	page->mapping = NULL;
-	/* Leave page->index set: truncation relies upon it */
+	radix_tree_preload_end();
+	if (i != 0)
+		error = -ENOSPC; /* no space for a huge page */
+
+	/* page[i] was not inserted to tree, skip it */
+	i--;
+
+	for (; i >= 0; i--) {
+		/* Leave page->index set: truncation relies upon it */
+		page[i].mapping = NULL;
+		radix_tree_delete(&mapping->page_tree, offset + i);
+	}
 	spin_unlock_irq(&mapping->tree_lock);
 	mem_cgroup_uncharge_cache_page(page);
 	page_cache_release(page);
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 07/23] mm: trace filemap: dump page order
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Dump page order to trace to be able to distinguish between small page
and huge page in page cache.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 include/trace/events/filemap.h | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 0421f49..7e14b13 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -21,6 +21,7 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
 		__field(struct page *, page)
 		__field(unsigned long, i_ino)
 		__field(unsigned long, index)
+		__field(int, order)
 		__field(dev_t, s_dev)
 	),
 
@@ -28,18 +29,20 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
 		__entry->page = page;
 		__entry->i_ino = page->mapping->host->i_ino;
 		__entry->index = page->index;
+		__entry->order = compound_order(page);
 		if (page->mapping->host->i_sb)
 			__entry->s_dev = page->mapping->host->i_sb->s_dev;
 		else
 			__entry->s_dev = page->mapping->host->i_rdev;
 	),
 
-	TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu",
+	TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu order=%d",
 		MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
 		__entry->i_ino,
 		__entry->page,
 		page_to_pfn(__entry->page),
-		__entry->index << PAGE_SHIFT)
+		__entry->index << PAGE_SHIFT,
+		__entry->order)
 );
 
 DEFINE_EVENT(mm_filemap_op_page_cache, mm_filemap_delete_from_page_cache,
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 07/23] mm: trace filemap: dump page order
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Dump page order to trace to be able to distinguish between small page
and huge page in page cache.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 include/trace/events/filemap.h | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 0421f49..7e14b13 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -21,6 +21,7 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
 		__field(struct page *, page)
 		__field(unsigned long, i_ino)
 		__field(unsigned long, index)
+		__field(int, order)
 		__field(dev_t, s_dev)
 	),
 
@@ -28,18 +29,20 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
 		__entry->page = page;
 		__entry->i_ino = page->mapping->host->i_ino;
 		__entry->index = page->index;
+		__entry->order = compound_order(page);
 		if (page->mapping->host->i_sb)
 			__entry->s_dev = page->mapping->host->i_sb->s_dev;
 		else
 			__entry->s_dev = page->mapping->host->i_rdev;
 	),
 
-	TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu",
+	TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu order=%d",
 		MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
 		__entry->i_ino,
 		__entry->page,
 		page_to_pfn(__entry->page),
-		__entry->index << PAGE_SHIFT)
+		__entry->index << PAGE_SHIFT,
+		__entry->order)
 );
 
 DEFINE_EVENT(mm_filemap_op_page_cache, mm_filemap_delete_from_page_cache,
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 08/23] block: implement add_bdi_stat()
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We're going to add/remove a number of page cache entries at once. This
patch implements add_bdi_stat() which adjusts bdi stats by arbitrary
amount. It's required for batched page cache manipulations.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/backing-dev.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c388155..7060180 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -166,6 +166,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
 	__add_bdi_stat(bdi, item, -1);
 }
 
+static inline void add_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, s64 amount)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__add_bdi_stat(bdi, item, amount);
+	local_irq_restore(flags);
+}
+
 static inline void dec_bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item)
 {
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 08/23] block: implement add_bdi_stat()
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We're going to add/remove a number of page cache entries at once. This
patch implements add_bdi_stat() which adjusts bdi stats by arbitrary
amount. It's required for batched page cache manipulations.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/backing-dev.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c388155..7060180 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -166,6 +166,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
 	__add_bdi_stat(bdi, item, -1);
 }
 
+static inline void add_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, s64 amount)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__add_bdi_stat(bdi, item, amount);
+	local_irq_restore(flags);
+}
+
 static inline void dec_bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item)
 {
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 09/23] thp, mm: rewrite delete_from_page_cache() to support huge pages
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
time.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 619e6cb..b75bdf5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -115,6 +115,7 @@
 void __delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
+	int i, nr;
 
 	trace_mm_filemap_delete_from_page_cache(page);
 	/*
@@ -127,13 +128,21 @@ void __delete_from_page_cache(struct page *page)
 	else
 		cleancache_invalidate_page(mapping, page);
 
-	radix_tree_delete(&mapping->page_tree, page->index);
+	nr = hpagecache_nr_pages(page);
+	for (i = 0; i < nr; i++) {
+		page[i].mapping = NULL;
+		radix_tree_delete(&mapping->page_tree, page->index + i);
+	}
+	/* thp */
+	if (nr > 1)
+		__dec_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
+
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
-	mapping->nrpages--;
-	__dec_zone_page_state(page, NR_FILE_PAGES);
+	mapping->nrpages -= nr;
+	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
 	if (PageSwapBacked(page))
-		__dec_zone_page_state(page, NR_SHMEM);
+		__mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
 	BUG_ON(page_mapped(page));
 
 	/*
@@ -144,8 +153,8 @@ void __delete_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
-		dec_zone_page_state(page, NR_FILE_DIRTY);
-		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
+		add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
 	}
 }
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 09/23] thp, mm: rewrite delete_from_page_cache() to support huge pages
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
time.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 619e6cb..b75bdf5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -115,6 +115,7 @@
 void __delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
+	int i, nr;
 
 	trace_mm_filemap_delete_from_page_cache(page);
 	/*
@@ -127,13 +128,21 @@ void __delete_from_page_cache(struct page *page)
 	else
 		cleancache_invalidate_page(mapping, page);
 
-	radix_tree_delete(&mapping->page_tree, page->index);
+	nr = hpagecache_nr_pages(page);
+	for (i = 0; i < nr; i++) {
+		page[i].mapping = NULL;
+		radix_tree_delete(&mapping->page_tree, page->index + i);
+	}
+	/* thp */
+	if (nr > 1)
+		__dec_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
+
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
-	mapping->nrpages--;
-	__dec_zone_page_state(page, NR_FILE_PAGES);
+	mapping->nrpages -= nr;
+	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
 	if (PageSwapBacked(page))
-		__dec_zone_page_state(page, NR_SHMEM);
+		__mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
 	BUG_ON(page_mapped(page));
 
 	/*
@@ -144,8 +153,8 @@ void __delete_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
-		dec_zone_page_state(page, NR_FILE_DIRTY);
-		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
+		add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
 	}
 }
 
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 10/23] thp, mm: warn if we try to use replace_page_cache_page() with THP
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

replace_page_cache_page() is only used by FUSE. It's unlikely that we
will support THP in FUSE page cache any soon.

Let's pospone implemetation of THP handling in replace_page_cache_page()
until any will use it. -EINVAL and WARN_ONCE() for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index b75bdf5..28f4927 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -418,6 +418,10 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 {
 	int error;
 
+	if (WARN_ONCE(PageTransHuge(old) || PageTransHuge(new),
+		     "unexpected transhuge page\n"))
+		return -EINVAL;
+
 	VM_BUG_ON(!PageLocked(old));
 	VM_BUG_ON(!PageLocked(new));
 	VM_BUG_ON(new->mapping);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 10/23] thp, mm: warn if we try to use replace_page_cache_page() with THP
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

replace_page_cache_page() is only used by FUSE. It's unlikely that we
will support THP in FUSE page cache any soon.

Let's pospone implemetation of THP handling in replace_page_cache_page()
until any will use it. -EINVAL and WARN_ONCE() for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index b75bdf5..28f4927 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -418,6 +418,10 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 {
 	int error;
 
+	if (WARN_ONCE(PageTransHuge(old) || PageTransHuge(new),
+		     "unexpected transhuge page\n"))
+		return -EINVAL;
+
 	VM_BUG_ON(!PageLocked(old));
 	VM_BUG_ON(!PageLocked(new));
 	VM_BUG_ON(new->mapping);
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 11/23] thp, mm: handle tail pages in page_cache_get_speculative()
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For tail pages we need to take two refcounters:
 - ->_count for its head page;
 - ->_mapcount for the tail page;

To protect against splitting we take compound lock and re-check that we
still have tail page before taking ->_mapcount reference.
If the page was split we drop ->_count reference from head page and
return 0 to indicate caller that it must retry.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 47b5082..d459b38 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -161,6 +161,8 @@ void release_pages(struct page **pages, int nr, int cold);
  */
 static inline int page_cache_get_speculative(struct page *page)
 {
+	struct page *page_head = compound_trans_head(page);
+
 	VM_BUG_ON(in_interrupt());
 
 #ifdef CONFIG_TINY_RCU
@@ -176,11 +178,11 @@ static inline int page_cache_get_speculative(struct page *page)
 	 * disabling preempt, and hence no need for the "speculative get" that
 	 * SMP requires.
 	 */
-	VM_BUG_ON(page_count(page) == 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_count(page_head) == 0);
+	atomic_inc(&page_head->_count);
 
 #else
-	if (unlikely(!get_page_unless_zero(page))) {
+	if (unlikely(!get_page_unless_zero(page_head))) {
 		/*
 		 * Either the page has been freed, or will be freed.
 		 * In either case, retry here and the caller should
@@ -189,7 +191,23 @@ static inline int page_cache_get_speculative(struct page *page)
 		return 0;
 	}
 #endif
-	VM_BUG_ON(PageTail(page));
+
+	if (unlikely(PageTransTail(page))) {
+		unsigned long flags;
+		int got = 0;
+
+		flags = compound_lock_irqsave(page_head);
+		if (likely(PageTransTail(page))) {
+			atomic_inc(&page->_mapcount);
+			got = 1;
+		}
+		compound_unlock_irqrestore(page_head, flags);
+
+		if (unlikely(!got))
+			atomic_dec(&page_head->_count);
+
+		return got;
+	}
 
 	return 1;
 }
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 11/23] thp, mm: handle tail pages in page_cache_get_speculative()
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For tail pages we need to take two refcounters:
 - ->_count for its head page;
 - ->_mapcount for the tail page;

To protect against splitting we take compound lock and re-check that we
still have tail page before taking ->_mapcount reference.
If the page was split we drop ->_count reference from head page and
return 0 to indicate caller that it must retry.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 47b5082..d459b38 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -161,6 +161,8 @@ void release_pages(struct page **pages, int nr, int cold);
  */
 static inline int page_cache_get_speculative(struct page *page)
 {
+	struct page *page_head = compound_trans_head(page);
+
 	VM_BUG_ON(in_interrupt());
 
 #ifdef CONFIG_TINY_RCU
@@ -176,11 +178,11 @@ static inline int page_cache_get_speculative(struct page *page)
 	 * disabling preempt, and hence no need for the "speculative get" that
 	 * SMP requires.
 	 */
-	VM_BUG_ON(page_count(page) == 0);
-	atomic_inc(&page->_count);
+	VM_BUG_ON(page_count(page_head) == 0);
+	atomic_inc(&page_head->_count);
 
 #else
-	if (unlikely(!get_page_unless_zero(page))) {
+	if (unlikely(!get_page_unless_zero(page_head))) {
 		/*
 		 * Either the page has been freed, or will be freed.
 		 * In either case, retry here and the caller should
@@ -189,7 +191,23 @@ static inline int page_cache_get_speculative(struct page *page)
 		return 0;
 	}
 #endif
-	VM_BUG_ON(PageTail(page));
+
+	if (unlikely(PageTransTail(page))) {
+		unsigned long flags;
+		int got = 0;
+
+		flags = compound_lock_irqsave(page_head);
+		if (likely(PageTransTail(page))) {
+			atomic_inc(&page->_mapcount);
+			got = 1;
+		}
+		compound_unlock_irqrestore(page_head, flags);
+
+		if (unlikely(!got))
+			atomic_dec(&page_head->_count);
+
+		return got;
+	}
 
 	return 1;
 }
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 12/23] thp, mm: add event counters for huge page alloc on file write or read
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Existing stats specify source of thp page: fault or collapse. We're
going allocate a new huge page with write(2) and read(2). It's nither
fault nor collapse.

Let's introduce new events for that.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/transhuge.txt | 7 +++++++
 include/linux/huge_mm.h        | 5 +++++
 include/linux/vm_event_item.h  | 4 ++++
 mm/vmstat.c                    | 4 ++++
 4 files changed, 20 insertions(+)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 4cc15c4..a78f738 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -202,6 +202,10 @@ thp_collapse_alloc is incremented by khugepaged when it has found
 	a range of pages to collapse into one huge page and has
 	successfully allocated a new huge page to store the data.
 
+thp_write_alloc and thp_read_alloc are incremented every time a huge
+	page is	successfully allocated to handle write(2) to a file or
+	read(2) from file.
+
 thp_fault_fallback is incremented if a page fault fails to allocate
 	a huge page and instead falls back to using small pages.
 
@@ -209,6 +213,9 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range
 	of pages that should be collapsed into one huge page but failed
 	the allocation.
 
+thp_write_alloc_failed and thp_read_alloc_failed are incremented if
+	huge page allocation failed when tried on write(2) or read(2).
+
 thp_split is incremented every time a huge page is split into base
 	pages. This can happen for a variety of reasons but a common
 	reason is that a huge page is old and is being reclaimed.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4dc66c9..9a0a114 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -183,6 +183,11 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
 
+#define THP_WRITE_ALLOC		({ BUILD_BUG(); 0; })
+#define THP_WRITE_ALLOC_FAILED	({ BUILD_BUG(); 0; })
+#define THP_READ_ALLOC		({ BUILD_BUG(); 0; })
+#define THP_READ_ALLOC_FAILED	({ BUILD_BUG(); 0; })
+
 #define hpage_nr_pages(x) 1
 
 #define transparent_hugepage_enabled(__vma) 0
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 1855f0a..8e071bb 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -66,6 +66,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_FAULT_FALLBACK,
 		THP_COLLAPSE_ALLOC,
 		THP_COLLAPSE_ALLOC_FAILED,
+		THP_WRITE_ALLOC,
+		THP_WRITE_ALLOC_FAILED,
+		THP_READ_ALLOC,
+		THP_READ_ALLOC_FAILED,
 		THP_SPLIT,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ffe3fbd..a80ea59 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -815,6 +815,10 @@ const char * const vmstat_text[] = {
 	"thp_fault_fallback",
 	"thp_collapse_alloc",
 	"thp_collapse_alloc_failed",
+	"thp_write_alloc",
+	"thp_write_alloc_failed",
+	"thp_read_alloc",
+	"thp_read_alloc_failed",
 	"thp_split",
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 12/23] thp, mm: add event counters for huge page alloc on file write or read
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Existing stats specify source of thp page: fault or collapse. We're
going allocate a new huge page with write(2) and read(2). It's nither
fault nor collapse.

Let's introduce new events for that.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/transhuge.txt | 7 +++++++
 include/linux/huge_mm.h        | 5 +++++
 include/linux/vm_event_item.h  | 4 ++++
 mm/vmstat.c                    | 4 ++++
 4 files changed, 20 insertions(+)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 4cc15c4..a78f738 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -202,6 +202,10 @@ thp_collapse_alloc is incremented by khugepaged when it has found
 	a range of pages to collapse into one huge page and has
 	successfully allocated a new huge page to store the data.
 
+thp_write_alloc and thp_read_alloc are incremented every time a huge
+	page is	successfully allocated to handle write(2) to a file or
+	read(2) from file.
+
 thp_fault_fallback is incremented if a page fault fails to allocate
 	a huge page and instead falls back to using small pages.
 
@@ -209,6 +213,9 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range
 	of pages that should be collapsed into one huge page but failed
 	the allocation.
 
+thp_write_alloc_failed and thp_read_alloc_failed are incremented if
+	huge page allocation failed when tried on write(2) or read(2).
+
 thp_split is incremented every time a huge page is split into base
 	pages. This can happen for a variety of reasons but a common
 	reason is that a huge page is old and is being reclaimed.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4dc66c9..9a0a114 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -183,6 +183,11 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
 
+#define THP_WRITE_ALLOC		({ BUILD_BUG(); 0; })
+#define THP_WRITE_ALLOC_FAILED	({ BUILD_BUG(); 0; })
+#define THP_READ_ALLOC		({ BUILD_BUG(); 0; })
+#define THP_READ_ALLOC_FAILED	({ BUILD_BUG(); 0; })
+
 #define hpage_nr_pages(x) 1
 
 #define transparent_hugepage_enabled(__vma) 0
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 1855f0a..8e071bb 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -66,6 +66,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_FAULT_FALLBACK,
 		THP_COLLAPSE_ALLOC,
 		THP_COLLAPSE_ALLOC_FAILED,
+		THP_WRITE_ALLOC,
+		THP_WRITE_ALLOC_FAILED,
+		THP_READ_ALLOC,
+		THP_READ_ALLOC_FAILED,
 		THP_SPLIT,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ffe3fbd..a80ea59 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -815,6 +815,10 @@ const char * const vmstat_text[] = {
 	"thp_fault_fallback",
 	"thp_collapse_alloc",
 	"thp_collapse_alloc_failed",
+	"thp_write_alloc",
+	"thp_write_alloc_failed",
+	"thp_read_alloc",
+	"thp_read_alloc_failed",
 	"thp_split",
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 13/23] thp, mm: allocate huge pages in grab_cache_page_write_begin()
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Try to allocate huge page if flags has AOP_FLAG_TRANSHUGE.

If, for some reason, it's not possible allocate a huge page at this
possition, it returns NULL. Caller should take care of fallback to
small pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/fs.h |  1 +
 mm/filemap.c       | 24 ++++++++++++++++++++++--
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b09ddc0..d5f58b3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -282,6 +282,7 @@ enum positive_aop_returns {
 #define AOP_FLAG_NOFS			0x0004 /* used by filesystem to direct
 						* helper code (eg buffer layer)
 						* to clear GFP_FS from alloc */
+#define AOP_FLAG_TRANSHUGE		0x0008 /* allocate transhuge page */
 
 /*
  * oh the beauties of C type declarations.
diff --git a/mm/filemap.c b/mm/filemap.c
index 28f4927..b17ebb9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2313,18 +2313,38 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 	gfp_t gfp_mask;
 	struct page *page;
 	gfp_t gfp_notmask = 0;
+	bool must_use_thp = (flags & AOP_FLAG_TRANSHUGE) &&
+		IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
+
 
 	gfp_mask = mapping_gfp_mask(mapping);
+	if (must_use_thp) {
+		BUG_ON(index & HPAGE_CACHE_INDEX_MASK);
+		BUG_ON(!(gfp_mask & __GFP_COMP));
+	}
 	if (mapping_cap_account_dirty(mapping))
 		gfp_mask |= __GFP_WRITE;
 	if (flags & AOP_FLAG_NOFS)
 		gfp_notmask = __GFP_FS;
 repeat:
 	page = find_lock_page(mapping, index);
-	if (page)
+	if (page) {
+		if (must_use_thp && !PageTransHuge(page)) {
+			unlock_page(page);
+			page_cache_release(page);
+			return NULL;
+		}
 		goto found;
+	}
 
-	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
+	if (must_use_thp) {
+		page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
+		if (page)
+			count_vm_event(THP_WRITE_ALLOC);
+		else
+			count_vm_event(THP_WRITE_ALLOC_FAILED);
+	} else
+		page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
 	if (!page)
 		return NULL;
 	status = add_to_page_cache_lru(page, mapping, index,
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 13/23] thp, mm: allocate huge pages in grab_cache_page_write_begin()
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Try to allocate huge page if flags has AOP_FLAG_TRANSHUGE.

If, for some reason, it's not possible allocate a huge page at this
possition, it returns NULL. Caller should take care of fallback to
small pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/fs.h |  1 +
 mm/filemap.c       | 24 ++++++++++++++++++++++--
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b09ddc0..d5f58b3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -282,6 +282,7 @@ enum positive_aop_returns {
 #define AOP_FLAG_NOFS			0x0004 /* used by filesystem to direct
 						* helper code (eg buffer layer)
 						* to clear GFP_FS from alloc */
+#define AOP_FLAG_TRANSHUGE		0x0008 /* allocate transhuge page */
 
 /*
  * oh the beauties of C type declarations.
diff --git a/mm/filemap.c b/mm/filemap.c
index 28f4927..b17ebb9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2313,18 +2313,38 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 	gfp_t gfp_mask;
 	struct page *page;
 	gfp_t gfp_notmask = 0;
+	bool must_use_thp = (flags & AOP_FLAG_TRANSHUGE) &&
+		IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
+
 
 	gfp_mask = mapping_gfp_mask(mapping);
+	if (must_use_thp) {
+		BUG_ON(index & HPAGE_CACHE_INDEX_MASK);
+		BUG_ON(!(gfp_mask & __GFP_COMP));
+	}
 	if (mapping_cap_account_dirty(mapping))
 		gfp_mask |= __GFP_WRITE;
 	if (flags & AOP_FLAG_NOFS)
 		gfp_notmask = __GFP_FS;
 repeat:
 	page = find_lock_page(mapping, index);
-	if (page)
+	if (page) {
+		if (must_use_thp && !PageTransHuge(page)) {
+			unlock_page(page);
+			page_cache_release(page);
+			return NULL;
+		}
 		goto found;
+	}
 
-	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
+	if (must_use_thp) {
+		page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
+		if (page)
+			count_vm_event(THP_WRITE_ALLOC);
+		else
+			count_vm_event(THP_WRITE_ALLOC_FAILED);
+	} else
+		page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
 	if (!page)
 		return NULL;
 	status = add_to_page_cache_lru(page, mapping, index,
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 14/23] thp, mm: naive support of thp in generic_perform_write
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now we still write/read at most PAGE_CACHE_SIZE bytes a time.

This implementation doesn't cover address spaces with backing storage.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index b17ebb9..066bbff 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2382,6 +2382,7 @@ static ssize_t generic_perform_write(struct file *file,
 		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 		void *fsdata;
+		int subpage_nr = 0;
 
 		offset = (pos & (PAGE_CACHE_SIZE - 1));
 		bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
@@ -2411,8 +2412,14 @@ again:
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
 
+		if (PageTransHuge(page)) {
+			off_t huge_offset = pos & ~HPAGE_PMD_MASK;
+			subpage_nr = huge_offset >> PAGE_CACHE_SHIFT;
+		}
+
 		pagefault_disable();
-		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+		copied = iov_iter_copy_from_user_atomic(page + subpage_nr, i,
+				offset, bytes);
 		pagefault_enable();
 		flush_dcache_page(page);
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 14/23] thp, mm: naive support of thp in generic_perform_write
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now we still write/read at most PAGE_CACHE_SIZE bytes a time.

This implementation doesn't cover address spaces with backing storage.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index b17ebb9..066bbff 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2382,6 +2382,7 @@ static ssize_t generic_perform_write(struct file *file,
 		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 		void *fsdata;
+		int subpage_nr = 0;
 
 		offset = (pos & (PAGE_CACHE_SIZE - 1));
 		bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
@@ -2411,8 +2412,14 @@ again:
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
 
+		if (PageTransHuge(page)) {
+			off_t huge_offset = pos & ~HPAGE_PMD_MASK;
+			subpage_nr = huge_offset >> PAGE_CACHE_SHIFT;
+		}
+
 		pagefault_disable();
-		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+		copied = iov_iter_copy_from_user_atomic(page + subpage_nr, i,
+				offset, bytes);
 		pagefault_enable();
 		flush_dcache_page(page);
 
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 15/23] mm, fs: avoid page allocation beyond i_size on read
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov, NeilBrown

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

I've noticed that we allocated unneeded page for cache on read beyond
i_size. Simple test case (I checked it on ramfs):

$ touch testfile
$ cat testfile

It triggers 'no_cached_page' code path in do_generic_file_read().

Looks like it's regression since commit a32ea1e. Let's fix it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: NeilBrown <neilb@suse.de>
---
 mm/filemap.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 066bbff..c31d296 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1163,6 +1163,10 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
 		loff_t isize;
 		unsigned long nr, ret;
 
+		isize = i_size_read(inode);
+		if (!isize || index > (isize - 1) >> PAGE_CACHE_SHIFT)
+			goto out;
+
 		cond_resched();
 find_page:
 		page = find_get_page(mapping, index);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 15/23] mm, fs: avoid page allocation beyond i_size on read
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov, NeilBrown

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

I've noticed that we allocated unneeded page for cache on read beyond
i_size. Simple test case (I checked it on ramfs):

$ touch testfile
$ cat testfile

It triggers 'no_cached_page' code path in do_generic_file_read().

Looks like it's regression since commit a32ea1e. Let's fix it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: NeilBrown <neilb@suse.de>
---
 mm/filemap.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 066bbff..c31d296 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1163,6 +1163,10 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
 		loff_t isize;
 		unsigned long nr, ret;
 
+		isize = i_size_read(inode);
+		if (!isize || index > (isize - 1) >> PAGE_CACHE_SHIFT)
+			goto out;
+
 		cond_resched();
 find_page:
 		page = find_get_page(mapping, index);
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 16/23] thp, mm: handle transhuge pages in do_generic_file_read()
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

If a transhuge page is already in page cache (up to date and not
readahead) we go usual path: read from relevant subpage (head or tail).

If page is not cached (sparse file in ramfs case) and the mapping can
have hugepage we try allocate a new one and read it.

If a page is not up to date or in readahead, we have to move 'page' to
head page of the compound page, since it represents state of whole
transhuge page. We will switch back to relevant subpage when page is
ready to be read ('page_ok' label).

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 55 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index c31d296..ed65af5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1175,8 +1175,28 @@ find_page:
 					ra, filp,
 					index, last_index - index);
 			page = find_get_page(mapping, index);
-			if (unlikely(page == NULL))
-				goto no_cached_page;
+			if (unlikely(page == NULL)) {
+				if (mapping_can_have_hugepages(mapping))
+					goto no_cached_page_thp;
+				else
+					goto no_cached_page;
+			}
+		}
+		if (PageTransCompound(page)) {
+			struct page *head = compound_trans_head(page);
+
+			if (!PageReadahead(head) && PageUptodate(page))
+				goto page_ok;
+
+			/*
+			 * Switch 'page' to head page. That's needed to handle
+			 * readahead or make page uptodate.
+			 * It will be switched back to the right tail page at
+			 * the begining 'page_ok'.
+			 */
+			page_cache_get(head);
+			page_cache_release(page);
+			page = head;
 		}
 		if (PageReadahead(page)) {
 			page_cache_async_readahead(mapping,
@@ -1198,6 +1218,18 @@ find_page:
 			unlock_page(page);
 		}
 page_ok:
+		/* Switch back to relevant tail page, if needed */
+		if (PageTransCompoundCache(page) && !PageTransTail(page)) {
+			int off = index & HPAGE_CACHE_INDEX_MASK;
+			if (off){
+				page_cache_get(page + off);
+				page_cache_release(page);
+				page += off;
+			}
+		}
+
+		VM_BUG_ON(page->index != index);
+
 		/*
 		 * i_size must be checked after we know the page is Uptodate.
 		 *
@@ -1329,6 +1361,27 @@ readpage_error:
 		page_cache_release(page);
 		goto out;
 
+no_cached_page_thp:
+		page = alloc_pages(mapping_gfp_mask(mapping) | __GFP_COLD,
+				HPAGE_PMD_ORDER);
+		if (!page) {
+			count_vm_event(THP_READ_ALLOC_FAILED);
+			goto no_cached_page;
+		}
+		count_vm_event(THP_READ_ALLOC);
+
+		error = add_to_page_cache_lru(page, mapping,
+				index & ~HPAGE_CACHE_INDEX_MASK, GFP_KERNEL);
+		if (!error)
+			goto readpage;
+
+		page_cache_release(page);
+		if (error != -EEXIST && error != -ENOSPC) {
+			desc->error = error;
+			goto out;
+		}
+
+		/* Fallback to small page */
 no_cached_page:
 		/*
 		 * Ok, it wasn't cached, so we need to create a new
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 16/23] thp, mm: handle transhuge pages in do_generic_file_read()
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

If a transhuge page is already in page cache (up to date and not
readahead) we go usual path: read from relevant subpage (head or tail).

If page is not cached (sparse file in ramfs case) and the mapping can
have hugepage we try allocate a new one and read it.

If a page is not up to date or in readahead, we have to move 'page' to
head page of the compound page, since it represents state of whole
transhuge page. We will switch back to relevant subpage when page is
ready to be read ('page_ok' label).

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 55 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index c31d296..ed65af5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1175,8 +1175,28 @@ find_page:
 					ra, filp,
 					index, last_index - index);
 			page = find_get_page(mapping, index);
-			if (unlikely(page == NULL))
-				goto no_cached_page;
+			if (unlikely(page == NULL)) {
+				if (mapping_can_have_hugepages(mapping))
+					goto no_cached_page_thp;
+				else
+					goto no_cached_page;
+			}
+		}
+		if (PageTransCompound(page)) {
+			struct page *head = compound_trans_head(page);
+
+			if (!PageReadahead(head) && PageUptodate(page))
+				goto page_ok;
+
+			/*
+			 * Switch 'page' to head page. That's needed to handle
+			 * readahead or make page uptodate.
+			 * It will be switched back to the right tail page at
+			 * the begining 'page_ok'.
+			 */
+			page_cache_get(head);
+			page_cache_release(page);
+			page = head;
 		}
 		if (PageReadahead(page)) {
 			page_cache_async_readahead(mapping,
@@ -1198,6 +1218,18 @@ find_page:
 			unlock_page(page);
 		}
 page_ok:
+		/* Switch back to relevant tail page, if needed */
+		if (PageTransCompoundCache(page) && !PageTransTail(page)) {
+			int off = index & HPAGE_CACHE_INDEX_MASK;
+			if (off){
+				page_cache_get(page + off);
+				page_cache_release(page);
+				page += off;
+			}
+		}
+
+		VM_BUG_ON(page->index != index);
+
 		/*
 		 * i_size must be checked after we know the page is Uptodate.
 		 *
@@ -1329,6 +1361,27 @@ readpage_error:
 		page_cache_release(page);
 		goto out;
 
+no_cached_page_thp:
+		page = alloc_pages(mapping_gfp_mask(mapping) | __GFP_COLD,
+				HPAGE_PMD_ORDER);
+		if (!page) {
+			count_vm_event(THP_READ_ALLOC_FAILED);
+			goto no_cached_page;
+		}
+		count_vm_event(THP_READ_ALLOC);
+
+		error = add_to_page_cache_lru(page, mapping,
+				index & ~HPAGE_CACHE_INDEX_MASK, GFP_KERNEL);
+		if (!error)
+			goto readpage;
+
+		page_cache_release(page);
+		if (error != -EEXIST && error != -ENOSPC) {
+			desc->error = error;
+			goto out;
+		}
+
+		/* Fallback to small page */
 no_cached_page:
 		/*
 		 * Ok, it wasn't cached, so we need to create a new
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 17/23] thp, libfs: initial thp support
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

simple_readpage() and simple_write_end() are modified to handle huge
pages.

simple_thp_write_begin() is introduced to allocate huge pages on write.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/libfs.c              | 53 +++++++++++++++++++++++++++++++++++++++++++++----
 include/linux/fs.h      |  7 +++++++
 include/linux/pagemap.h |  8 ++++++++
 3 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 3a3a9b5..934778b 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -364,7 +364,7 @@ EXPORT_SYMBOL(simple_setattr);
 
 int simple_readpage(struct file *file, struct page *page)
 {
-	clear_highpage(page);
+	clear_pagecache_page(page);
 	flush_dcache_page(page);
 	SetPageUptodate(page);
 	unlock_page(page);
@@ -424,9 +424,14 @@ int simple_write_end(struct file *file, struct address_space *mapping,
 
 	/* zero the stale part of the page if we did a short copy */
 	if (copied < len) {
-		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
-		zero_user(page, from + copied, len - copied);
+		unsigned from;
+		if (PageTransHuge(page)) {
+			from = pos & ~HPAGE_PMD_MASK;
+			zero_huge_user(page, from + copied, len - copied);
+		} else {
+			from = pos & ~PAGE_CACHE_MASK;
+			zero_user(page, from + copied, len - copied);
+		}
 	}
 
 	if (!PageUptodate(page))
@@ -445,6 +450,46 @@ int simple_write_end(struct file *file, struct address_space *mapping,
 	return copied;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+int simple_thp_write_begin(struct file *file, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned flags,
+		struct page **pagep, void **fsdata)
+{
+	struct page *page = NULL;
+	pgoff_t index;
+
+	index = pos >> PAGE_CACHE_SHIFT;
+
+	if (mapping_can_have_hugepages(mapping)) {
+		page = grab_cache_page_write_begin(mapping,
+				index & ~HPAGE_CACHE_INDEX_MASK,
+				flags | AOP_FLAG_TRANSHUGE);
+		/* fallback to small page */
+		if (!page) {
+			unsigned long offset;
+			offset = pos & ~PAGE_CACHE_MASK;
+			/* adjust the len to not cross small page boundary */
+			len = min_t(unsigned long,
+					len, PAGE_CACHE_SIZE - offset);
+		}
+		BUG_ON(page && !PageTransHuge(page));
+	}
+	if (!page)
+		return simple_write_begin(file, mapping, pos, len, flags,
+				pagep, fsdata);
+
+	*pagep = page;
+
+	if (!PageUptodate(page) && len != HPAGE_PMD_SIZE) {
+		unsigned from = pos & ~HPAGE_PMD_MASK;
+
+		zero_huge_user_segment(page, 0, from);
+		zero_huge_user_segment(page, from + len, HPAGE_PMD_SIZE);
+	}
+	return 0;
+}
+#endif
+
 /*
  * the inodes created here are not hashed. If you use iunique to generate
  * unique inode values later for this filesystem, then you must take care
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d5f58b3..c1dbf43 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2553,6 +2553,13 @@ extern int simple_write_begin(struct file *file, struct address_space *mapping,
 extern int simple_write_end(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned copied,
 			struct page *page, void *fsdata);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+extern int simple_thp_write_begin(struct file *file,
+		struct address_space *mapping, loff_t pos, unsigned len,
+		unsigned flags,	struct page **pagep, void **fsdata);
+#else
+#define simple_thp_write_begin simple_write_begin
+#endif
 
 extern struct dentry *simple_lookup(struct inode *, struct dentry *, unsigned int flags);
 extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d459b38..eb484f2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -591,4 +591,12 @@ static inline int add_to_page_cache(struct page *page,
 	return error;
 }
 
+static inline void clear_pagecache_page(struct page *page)
+{
+	if (PageTransHuge(page))
+		zero_huge_user(page, 0, HPAGE_PMD_SIZE);
+	else
+		clear_highpage(page);
+}
+
 #endif /* _LINUX_PAGEMAP_H */
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 17/23] thp, libfs: initial thp support
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

simple_readpage() and simple_write_end() are modified to handle huge
pages.

simple_thp_write_begin() is introduced to allocate huge pages on write.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/libfs.c              | 53 +++++++++++++++++++++++++++++++++++++++++++++----
 include/linux/fs.h      |  7 +++++++
 include/linux/pagemap.h |  8 ++++++++
 3 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 3a3a9b5..934778b 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -364,7 +364,7 @@ EXPORT_SYMBOL(simple_setattr);
 
 int simple_readpage(struct file *file, struct page *page)
 {
-	clear_highpage(page);
+	clear_pagecache_page(page);
 	flush_dcache_page(page);
 	SetPageUptodate(page);
 	unlock_page(page);
@@ -424,9 +424,14 @@ int simple_write_end(struct file *file, struct address_space *mapping,
 
 	/* zero the stale part of the page if we did a short copy */
 	if (copied < len) {
-		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
-		zero_user(page, from + copied, len - copied);
+		unsigned from;
+		if (PageTransHuge(page)) {
+			from = pos & ~HPAGE_PMD_MASK;
+			zero_huge_user(page, from + copied, len - copied);
+		} else {
+			from = pos & ~PAGE_CACHE_MASK;
+			zero_user(page, from + copied, len - copied);
+		}
 	}
 
 	if (!PageUptodate(page))
@@ -445,6 +450,46 @@ int simple_write_end(struct file *file, struct address_space *mapping,
 	return copied;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+int simple_thp_write_begin(struct file *file, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned flags,
+		struct page **pagep, void **fsdata)
+{
+	struct page *page = NULL;
+	pgoff_t index;
+
+	index = pos >> PAGE_CACHE_SHIFT;
+
+	if (mapping_can_have_hugepages(mapping)) {
+		page = grab_cache_page_write_begin(mapping,
+				index & ~HPAGE_CACHE_INDEX_MASK,
+				flags | AOP_FLAG_TRANSHUGE);
+		/* fallback to small page */
+		if (!page) {
+			unsigned long offset;
+			offset = pos & ~PAGE_CACHE_MASK;
+			/* adjust the len to not cross small page boundary */
+			len = min_t(unsigned long,
+					len, PAGE_CACHE_SIZE - offset);
+		}
+		BUG_ON(page && !PageTransHuge(page));
+	}
+	if (!page)
+		return simple_write_begin(file, mapping, pos, len, flags,
+				pagep, fsdata);
+
+	*pagep = page;
+
+	if (!PageUptodate(page) && len != HPAGE_PMD_SIZE) {
+		unsigned from = pos & ~HPAGE_PMD_MASK;
+
+		zero_huge_user_segment(page, 0, from);
+		zero_huge_user_segment(page, from + len, HPAGE_PMD_SIZE);
+	}
+	return 0;
+}
+#endif
+
 /*
  * the inodes created here are not hashed. If you use iunique to generate
  * unique inode values later for this filesystem, then you must take care
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d5f58b3..c1dbf43 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2553,6 +2553,13 @@ extern int simple_write_begin(struct file *file, struct address_space *mapping,
 extern int simple_write_end(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned copied,
 			struct page *page, void *fsdata);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+extern int simple_thp_write_begin(struct file *file,
+		struct address_space *mapping, loff_t pos, unsigned len,
+		unsigned flags,	struct page **pagep, void **fsdata);
+#else
+#define simple_thp_write_begin simple_write_begin
+#endif
 
 extern struct dentry *simple_lookup(struct inode *, struct dentry *, unsigned int flags);
 extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d459b38..eb484f2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -591,4 +591,12 @@ static inline int add_to_page_cache(struct page *page,
 	return error;
 }
 
+static inline void clear_pagecache_page(struct page *page)
+{
+	if (PageTransHuge(page))
+		zero_huge_user(page, 0, HPAGE_PMD_SIZE);
+	else
+		clear_highpage(page);
+}
+
 #endif /* _LINUX_PAGEMAP_H */
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 18/23] thp: libfs: introduce simple_thp_release()
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

simple_thp_release() is a dummy implementation of fops->release with
transparent huge page support. It's required to minimize memory overhead
of huge pages for small files.

It checks whether we should split the last page in the file to give
memory back to the system.

We split the page if it meets following criteria:
 - nobody has the file opened on write;
 - spliting will actually free any memory (at least one small page);
 - if it's a huge page ;)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/libfs.c         | 27 +++++++++++++++++++++++++++
 include/linux/fs.h |  2 ++
 2 files changed, 29 insertions(+)

diff --git a/fs/libfs.c b/fs/libfs.c
index 934778b..c43b055 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -488,6 +488,33 @@ int simple_thp_write_begin(struct file *file, struct address_space *mapping,
 	}
 	return 0;
 }
+
+int simple_thp_release(struct inode *inode, struct file *file)
+{
+	pgoff_t last_index;
+	struct page *page;
+
+	/* check if anybody still writes to file */
+	if (atomic_read(&inode->i_writecount) != !!(file->f_mode & FMODE_WRITE))
+		return 0;
+
+	last_index = i_size_read(inode) >> PAGE_CACHE_SHIFT;
+
+	/* check if splitting the page will free any memory */
+	if ((last_index & HPAGE_CACHE_INDEX_MASK) + 1 == HPAGE_CACHE_NR)
+		return 0;
+
+	page = find_get_page(file->f_mapping,
+			last_index & ~HPAGE_CACHE_INDEX_MASK);
+	if (!page)
+		return 0;
+
+	if (PageTransHuge(page))
+		split_huge_page(page);
+
+	page_cache_release(page);
+	return 0;
+}
 #endif
 
 /*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c1dbf43..b594f10 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2557,8 +2557,10 @@ extern int simple_write_end(struct file *file, struct address_space *mapping,
 extern int simple_thp_write_begin(struct file *file,
 		struct address_space *mapping, loff_t pos, unsigned len,
 		unsigned flags,	struct page **pagep, void **fsdata);
+extern int simple_thp_release(struct inode *inode, struct file *file);
 #else
 #define simple_thp_write_begin simple_write_begin
+#define simple_thp_release NULL
 #endif
 
 extern struct dentry *simple_lookup(struct inode *, struct dentry *, unsigned int flags);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 18/23] thp: libfs: introduce simple_thp_release()
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

simple_thp_release() is a dummy implementation of fops->release with
transparent huge page support. It's required to minimize memory overhead
of huge pages for small files.

It checks whether we should split the last page in the file to give
memory back to the system.

We split the page if it meets following criteria:
 - nobody has the file opened on write;
 - spliting will actually free any memory (at least one small page);
 - if it's a huge page ;)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/libfs.c         | 27 +++++++++++++++++++++++++++
 include/linux/fs.h |  2 ++
 2 files changed, 29 insertions(+)

diff --git a/fs/libfs.c b/fs/libfs.c
index 934778b..c43b055 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -488,6 +488,33 @@ int simple_thp_write_begin(struct file *file, struct address_space *mapping,
 	}
 	return 0;
 }
+
+int simple_thp_release(struct inode *inode, struct file *file)
+{
+	pgoff_t last_index;
+	struct page *page;
+
+	/* check if anybody still writes to file */
+	if (atomic_read(&inode->i_writecount) != !!(file->f_mode & FMODE_WRITE))
+		return 0;
+
+	last_index = i_size_read(inode) >> PAGE_CACHE_SHIFT;
+
+	/* check if splitting the page will free any memory */
+	if ((last_index & HPAGE_CACHE_INDEX_MASK) + 1 == HPAGE_CACHE_NR)
+		return 0;
+
+	page = find_get_page(file->f_mapping,
+			last_index & ~HPAGE_CACHE_INDEX_MASK);
+	if (!page)
+		return 0;
+
+	if (PageTransHuge(page))
+		split_huge_page(page);
+
+	page_cache_release(page);
+	return 0;
+}
 #endif
 
 /*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c1dbf43..b594f10 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2557,8 +2557,10 @@ extern int simple_write_end(struct file *file, struct address_space *mapping,
 extern int simple_thp_write_begin(struct file *file,
 		struct address_space *mapping, loff_t pos, unsigned len,
 		unsigned flags,	struct page **pagep, void **fsdata);
+extern int simple_thp_release(struct inode *inode, struct file *file);
 #else
 #define simple_thp_write_begin simple_write_begin
+#define simple_thp_release NULL
 #endif
 
 extern struct dentry *simple_lookup(struct inode *, struct dentry *, unsigned int flags);
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 19/23] truncate: support huge pages
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

truncate_inode_pages_range() drops whole huge page at once if it's fully
inside the range.

If a huge page is only partly in the range we zero out the part,
exactly like we do for partial small pages.

invalidate_mapping_pages() just skips huge pages if they are not fully
in the range.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/truncate.c | 108 +++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 84 insertions(+), 24 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 353b683..fcef7cb 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -205,8 +205,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 {
 	pgoff_t		start;		/* inclusive */
 	pgoff_t		end;		/* exclusive */
-	unsigned int	partial_start;	/* inclusive */
-	unsigned int	partial_end;	/* exclusive */
+	bool		partial_thp_start = false, partial_thp_end = false;
 	struct pagevec	pvec;
 	pgoff_t		index;
 	int		i;
@@ -215,15 +214,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	if (mapping->nrpages == 0)
 		return;
 
-	/* Offsets within partial pages */
-	partial_start = lstart & (PAGE_CACHE_SIZE - 1);
-	partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1);
-
 	/*
 	 * 'start' and 'end' always covers the range of pages to be fully
-	 * truncated. Partial pages are covered with 'partial_start' at the
-	 * start of the range and 'partial_end' at the end of the range.
-	 * Note that 'end' is exclusive while 'lend' is inclusive.
+	 * truncated. Note that 'end' is exclusive while 'lend' is inclusive.
 	 */
 	start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	if (lend == -1)
@@ -249,6 +242,23 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index >= end)
 				break;
 
+			if (PageTransTailCache(page)) {
+				/* part of already handled huge page */
+				if (!page->mapping)
+					continue;
+				/* the range starts in middle of huge page */
+				partial_thp_start = true;
+				start = index & ~HPAGE_CACHE_INDEX_MASK;
+				continue;
+			}
+			/* the range ends on huge page */
+			if (PageTransHugeCache(page) &&
+				index == (end & ~HPAGE_CACHE_INDEX_MASK)) {
+				partial_thp_end = true;
+				end = index;
+				break;
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -265,34 +275,74 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		index++;
 	}
 
-	if (partial_start) {
-		struct page *page = find_lock_page(mapping, start - 1);
+	if (partial_thp_start || lstart & ~PAGE_CACHE_MASK) {
+		pgoff_t off;
+		struct page *page;
+		unsigned pstart, pend;
+		void (*zero_segment)(struct page *page,
+				unsigned start, unsigned len);
+retry_partial_start:
+		if (partial_thp_start) {
+			zero_segment = zero_huge_user_segment;
+			off = (start - 1) & ~HPAGE_CACHE_INDEX_MASK;
+			pstart = lstart & ~HPAGE_PMD_MASK;
+			if ((end & ~HPAGE_CACHE_INDEX_MASK) == off)
+				pend = (lend - 1) & ~HPAGE_PMD_MASK;
+			else
+				pend = HPAGE_PMD_SIZE;
+		} else {
+			zero_segment = zero_user_segment;
+			off = start - 1;
+			pstart = lstart & ~PAGE_CACHE_MASK;
+			if (start > end)
+				pend = (lend - 1) & ~PAGE_CACHE_MASK;
+			else
+				pend = PAGE_CACHE_SIZE;
+		}
+
+		page = find_get_page(mapping, off);
 		if (page) {
-			unsigned int top = PAGE_CACHE_SIZE;
-			if (start > end) {
-				/* Truncation within a single page */
-				top = partial_end;
-				partial_end = 0;
+			/* the last tail page*/
+			if (PageTransTailCache(page)) {
+				partial_thp_start = true;
+				page_cache_release(page);
+				goto retry_partial_start;
 			}
+
+			lock_page(page);
 			wait_on_page_writeback(page);
-			zero_user_segment(page, partial_start, top);
+			zero_segment(page, pstart, pend);
 			cleancache_invalidate_page(mapping, page);
 			if (page_has_private(page))
-				do_invalidatepage(page, partial_start,
-						  top - partial_start);
+				do_invalidatepage(page, pstart,
+						pend - pstart);
 			unlock_page(page);
 			page_cache_release(page);
 		}
 	}
-	if (partial_end) {
-		struct page *page = find_lock_page(mapping, end);
+	if (partial_thp_end || (lend + 1) & ~PAGE_CACHE_MASK) {
+		pgoff_t off;
+		struct page *page;
+		unsigned pend;
+		void (*zero_segment)(struct page *page,
+				unsigned start, unsigned len);
+		if (partial_thp_end) {
+			zero_segment = zero_huge_user_segment;
+			off = end & ~HPAGE_CACHE_INDEX_MASK;
+			pend = (lend - 1) & ~HPAGE_PMD_MASK;
+		} else {
+			zero_segment = zero_user_segment;
+			off = end;
+			pend = (lend - 1) & ~PAGE_CACHE_MASK;
+		}
+
+		page = find_lock_page(mapping, end);
 		if (page) {
 			wait_on_page_writeback(page);
-			zero_user_segment(page, 0, partial_end);
+			zero_segment(page, 0, pend);
 			cleancache_invalidate_page(mapping, page);
 			if (page_has_private(page))
-				do_invalidatepage(page, 0,
-						  partial_end);
+				do_invalidatepage(page, 0, pend);
 			unlock_page(page);
 			page_cache_release(page);
 		}
@@ -327,6 +377,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index >= end)
 				break;
 
+			if (PageTransTailCache(page))
+				continue;
+
 			lock_page(page);
 			WARN_ON(page->index != index);
 			wait_on_page_writeback(page);
@@ -401,6 +454,13 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 			if (index > end)
 				break;
 
+			/* skip huge page if it's not fully in the range */
+			if (PageTransHugeCache(page) &&
+					index + HPAGE_CACHE_NR - 1 > end)
+				continue;
+			if (PageTransTailCache(page))
+				continue;
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 19/23] truncate: support huge pages
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

truncate_inode_pages_range() drops whole huge page at once if it's fully
inside the range.

If a huge page is only partly in the range we zero out the part,
exactly like we do for partial small pages.

invalidate_mapping_pages() just skips huge pages if they are not fully
in the range.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/truncate.c | 108 +++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 84 insertions(+), 24 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 353b683..fcef7cb 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -205,8 +205,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 {
 	pgoff_t		start;		/* inclusive */
 	pgoff_t		end;		/* exclusive */
-	unsigned int	partial_start;	/* inclusive */
-	unsigned int	partial_end;	/* exclusive */
+	bool		partial_thp_start = false, partial_thp_end = false;
 	struct pagevec	pvec;
 	pgoff_t		index;
 	int		i;
@@ -215,15 +214,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	if (mapping->nrpages == 0)
 		return;
 
-	/* Offsets within partial pages */
-	partial_start = lstart & (PAGE_CACHE_SIZE - 1);
-	partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1);
-
 	/*
 	 * 'start' and 'end' always covers the range of pages to be fully
-	 * truncated. Partial pages are covered with 'partial_start' at the
-	 * start of the range and 'partial_end' at the end of the range.
-	 * Note that 'end' is exclusive while 'lend' is inclusive.
+	 * truncated. Note that 'end' is exclusive while 'lend' is inclusive.
 	 */
 	start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	if (lend == -1)
@@ -249,6 +242,23 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index >= end)
 				break;
 
+			if (PageTransTailCache(page)) {
+				/* part of already handled huge page */
+				if (!page->mapping)
+					continue;
+				/* the range starts in middle of huge page */
+				partial_thp_start = true;
+				start = index & ~HPAGE_CACHE_INDEX_MASK;
+				continue;
+			}
+			/* the range ends on huge page */
+			if (PageTransHugeCache(page) &&
+				index == (end & ~HPAGE_CACHE_INDEX_MASK)) {
+				partial_thp_end = true;
+				end = index;
+				break;
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -265,34 +275,74 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		index++;
 	}
 
-	if (partial_start) {
-		struct page *page = find_lock_page(mapping, start - 1);
+	if (partial_thp_start || lstart & ~PAGE_CACHE_MASK) {
+		pgoff_t off;
+		struct page *page;
+		unsigned pstart, pend;
+		void (*zero_segment)(struct page *page,
+				unsigned start, unsigned len);
+retry_partial_start:
+		if (partial_thp_start) {
+			zero_segment = zero_huge_user_segment;
+			off = (start - 1) & ~HPAGE_CACHE_INDEX_MASK;
+			pstart = lstart & ~HPAGE_PMD_MASK;
+			if ((end & ~HPAGE_CACHE_INDEX_MASK) == off)
+				pend = (lend - 1) & ~HPAGE_PMD_MASK;
+			else
+				pend = HPAGE_PMD_SIZE;
+		} else {
+			zero_segment = zero_user_segment;
+			off = start - 1;
+			pstart = lstart & ~PAGE_CACHE_MASK;
+			if (start > end)
+				pend = (lend - 1) & ~PAGE_CACHE_MASK;
+			else
+				pend = PAGE_CACHE_SIZE;
+		}
+
+		page = find_get_page(mapping, off);
 		if (page) {
-			unsigned int top = PAGE_CACHE_SIZE;
-			if (start > end) {
-				/* Truncation within a single page */
-				top = partial_end;
-				partial_end = 0;
+			/* the last tail page*/
+			if (PageTransTailCache(page)) {
+				partial_thp_start = true;
+				page_cache_release(page);
+				goto retry_partial_start;
 			}
+
+			lock_page(page);
 			wait_on_page_writeback(page);
-			zero_user_segment(page, partial_start, top);
+			zero_segment(page, pstart, pend);
 			cleancache_invalidate_page(mapping, page);
 			if (page_has_private(page))
-				do_invalidatepage(page, partial_start,
-						  top - partial_start);
+				do_invalidatepage(page, pstart,
+						pend - pstart);
 			unlock_page(page);
 			page_cache_release(page);
 		}
 	}
-	if (partial_end) {
-		struct page *page = find_lock_page(mapping, end);
+	if (partial_thp_end || (lend + 1) & ~PAGE_CACHE_MASK) {
+		pgoff_t off;
+		struct page *page;
+		unsigned pend;
+		void (*zero_segment)(struct page *page,
+				unsigned start, unsigned len);
+		if (partial_thp_end) {
+			zero_segment = zero_huge_user_segment;
+			off = end & ~HPAGE_CACHE_INDEX_MASK;
+			pend = (lend - 1) & ~HPAGE_PMD_MASK;
+		} else {
+			zero_segment = zero_user_segment;
+			off = end;
+			pend = (lend - 1) & ~PAGE_CACHE_MASK;
+		}
+
+		page = find_lock_page(mapping, end);
 		if (page) {
 			wait_on_page_writeback(page);
-			zero_user_segment(page, 0, partial_end);
+			zero_segment(page, 0, pend);
 			cleancache_invalidate_page(mapping, page);
 			if (page_has_private(page))
-				do_invalidatepage(page, 0,
-						  partial_end);
+				do_invalidatepage(page, 0, pend);
 			unlock_page(page);
 			page_cache_release(page);
 		}
@@ -327,6 +377,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index >= end)
 				break;
 
+			if (PageTransTailCache(page))
+				continue;
+
 			lock_page(page);
 			WARN_ON(page->index != index);
 			wait_on_page_writeback(page);
@@ -401,6 +454,13 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 			if (index > end)
 				break;
 
+			/* skip huge page if it's not fully in the range */
+			if (PageTransHugeCache(page) &&
+					index + HPAGE_CACHE_NR - 1 > end)
+				continue;
+			if (PageTransTailCache(page))
+				continue;
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 20/23] thp: handle file pages in split_huge_page()
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The base scheme is the same as for anonymous pages, but we walk by
mapping->i_mmap rather then anon_vma->rb_root.

When we add a huge page to page cache we take only reference to head
page, but on split we need to take addition reference to all tail pages
since they are still in page cache after splitting.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 76 insertions(+), 13 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 523946c..d7c6830 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1580,6 +1580,7 @@ static void __split_huge_page_refcount(struct page *page,
 	struct zone *zone = page_zone(page);
 	struct lruvec *lruvec;
 	int tail_count = 0;
+	int initial_tail_refcount;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
@@ -1589,6 +1590,13 @@ static void __split_huge_page_refcount(struct page *page,
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(page);
 
+	/*
+	 * When we add a huge page to page cache we take only reference to head
+	 * page, but on split we need to take addition reference to all tail
+	 * pages since they are still in page cache after splitting.
+	 */
+	initial_tail_refcount = PageAnon(page) ? 0 : 1;
+
 	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
 		struct page *page_tail = page + i;
 
@@ -1611,8 +1619,9 @@ static void __split_huge_page_refcount(struct page *page,
 		 * atomic_set() here would be safe on all archs (and
 		 * not only on x86), it's safer to use atomic_add().
 		 */
-		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
-			   &page_tail->_count);
+		atomic_add(initial_tail_refcount + page_mapcount(page) +
+				page_mapcount(page_tail) + 1,
+				&page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();
@@ -1651,23 +1660,23 @@ static void __split_huge_page_refcount(struct page *page,
 		*/
 		page_tail->_mapcount = page->_mapcount;
 
-		BUG_ON(page_tail->mapping);
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
 		page_nid_xchg_last(page_tail, page_nid_last(page));
 
-		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
 		BUG_ON(!PageDirty(page_tail));
-		BUG_ON(!PageSwapBacked(page_tail));
 
 		lru_add_page_tail(page, page_tail, lruvec, list);
 	}
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(atomic_read(&page->_count) <= 0);
 
-	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+	if (PageAnon(page))
+		__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+	else
+		__mod_zone_page_state(zone, NR_FILE_TRANSPARENT_HUGEPAGES, -1);
 
 	ClearPageCompound(page);
 	compound_unlock(page);
@@ -1767,7 +1776,7 @@ static int __split_huge_page_map(struct page *page,
 }
 
 /* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
+static void __split_anon_huge_page(struct page *page,
 			      struct anon_vma *anon_vma,
 			      struct list_head *list)
 {
@@ -1791,7 +1800,7 @@ static void __split_huge_page(struct page *page,
 	 * and establishes a child pmd before
 	 * __split_huge_page_splitting() freezes the parent pmd (so if
 	 * we fail to prevent copy_huge_pmd() from running until the
-	 * whole __split_huge_page() is complete), we will still see
+	 * whole __split_anon_huge_page() is complete), we will still see
 	 * the newly established pmd of the child later during the
 	 * walk, to be able to set it as pmd_trans_splitting too.
 	 */
@@ -1822,14 +1831,11 @@ static void __split_huge_page(struct page *page,
  * from the hugepage.
  * Return 0 if the hugepage is split successfully otherwise return 1.
  */
-int split_huge_page_to_list(struct page *page, struct list_head *list)
+static int split_anon_huge_page(struct page *page, struct list_head *list)
 {
 	struct anon_vma *anon_vma;
 	int ret = 1;
 
-	BUG_ON(is_huge_zero_page(page));
-	BUG_ON(!PageAnon(page));
-
 	/*
 	 * The caller does not necessarily hold an mmap_sem that would prevent
 	 * the anon_vma disappearing so we first we take a reference to it
@@ -1847,7 +1853,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		goto out_unlock;
 
 	BUG_ON(!PageSwapBacked(page));
-	__split_huge_page(page, anon_vma, list);
+	__split_anon_huge_page(page, anon_vma, list);
 	count_vm_event(THP_SPLIT);
 
 	BUG_ON(PageCompound(page));
@@ -1858,6 +1864,63 @@ out:
 	return ret;
 }
 
+static int split_file_huge_page(struct page *page, struct list_head *list)
+{
+	struct address_space *mapping = page->mapping;
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	int mapcount, mapcount2;
+
+	BUG_ON(!PageHead(page));
+	BUG_ON(PageTail(page));
+
+	mutex_lock(&mapping->i_mmap_mutex);
+	mapcount = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount += __split_huge_page_splitting(page, vma, addr);
+	}
+
+	if (mapcount != page_mapcount(page))
+		printk(KERN_ERR "mapcount %d page_mapcount %d\n",
+		       mapcount, page_mapcount(page));
+	BUG_ON(mapcount != page_mapcount(page));
+
+	__split_huge_page_refcount(page, list);
+
+	mapcount2 = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount2 += __split_huge_page_map(page, vma, addr);
+	}
+
+	if (mapcount != mapcount2)
+		printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount %d\n",
+		       mapcount, mapcount2, page_mapcount(page));
+	BUG_ON(mapcount != mapcount2);
+	count_vm_event(THP_SPLIT);
+	mutex_unlock(&mapping->i_mmap_mutex);
+
+	/*
+	 * Drop small pages beyond i_size if any.
+	 *
+	 * XXX: do we need to serialize over i_mutex here?
+	 * If yes, how to get mmap_sem vs. i_mutex ordering fixed?
+	 */
+	truncate_inode_pages(mapping, i_size_read(mapping->host));
+	return 0;
+}
+
+int split_huge_page_to_list(struct page *page, struct list_head *list)
+{
+	BUG_ON(is_huge_zero_page(page));
+
+	if (PageAnon(page))
+		return split_anon_huge_page(page, list);
+	else
+		return split_file_huge_page(page, list);
+}
+
 #define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 20/23] thp: handle file pages in split_huge_page()
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The base scheme is the same as for anonymous pages, but we walk by
mapping->i_mmap rather then anon_vma->rb_root.

When we add a huge page to page cache we take only reference to head
page, but on split we need to take addition reference to all tail pages
since they are still in page cache after splitting.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 76 insertions(+), 13 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 523946c..d7c6830 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1580,6 +1580,7 @@ static void __split_huge_page_refcount(struct page *page,
 	struct zone *zone = page_zone(page);
 	struct lruvec *lruvec;
 	int tail_count = 0;
+	int initial_tail_refcount;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
@@ -1589,6 +1590,13 @@ static void __split_huge_page_refcount(struct page *page,
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(page);
 
+	/*
+	 * When we add a huge page to page cache we take only reference to head
+	 * page, but on split we need to take addition reference to all tail
+	 * pages since they are still in page cache after splitting.
+	 */
+	initial_tail_refcount = PageAnon(page) ? 0 : 1;
+
 	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
 		struct page *page_tail = page + i;
 
@@ -1611,8 +1619,9 @@ static void __split_huge_page_refcount(struct page *page,
 		 * atomic_set() here would be safe on all archs (and
 		 * not only on x86), it's safer to use atomic_add().
 		 */
-		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
-			   &page_tail->_count);
+		atomic_add(initial_tail_refcount + page_mapcount(page) +
+				page_mapcount(page_tail) + 1,
+				&page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();
@@ -1651,23 +1660,23 @@ static void __split_huge_page_refcount(struct page *page,
 		*/
 		page_tail->_mapcount = page->_mapcount;
 
-		BUG_ON(page_tail->mapping);
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
 		page_nid_xchg_last(page_tail, page_nid_last(page));
 
-		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
 		BUG_ON(!PageDirty(page_tail));
-		BUG_ON(!PageSwapBacked(page_tail));
 
 		lru_add_page_tail(page, page_tail, lruvec, list);
 	}
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(atomic_read(&page->_count) <= 0);
 
-	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+	if (PageAnon(page))
+		__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+	else
+		__mod_zone_page_state(zone, NR_FILE_TRANSPARENT_HUGEPAGES, -1);
 
 	ClearPageCompound(page);
 	compound_unlock(page);
@@ -1767,7 +1776,7 @@ static int __split_huge_page_map(struct page *page,
 }
 
 /* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
+static void __split_anon_huge_page(struct page *page,
 			      struct anon_vma *anon_vma,
 			      struct list_head *list)
 {
@@ -1791,7 +1800,7 @@ static void __split_huge_page(struct page *page,
 	 * and establishes a child pmd before
 	 * __split_huge_page_splitting() freezes the parent pmd (so if
 	 * we fail to prevent copy_huge_pmd() from running until the
-	 * whole __split_huge_page() is complete), we will still see
+	 * whole __split_anon_huge_page() is complete), we will still see
 	 * the newly established pmd of the child later during the
 	 * walk, to be able to set it as pmd_trans_splitting too.
 	 */
@@ -1822,14 +1831,11 @@ static void __split_huge_page(struct page *page,
  * from the hugepage.
  * Return 0 if the hugepage is split successfully otherwise return 1.
  */
-int split_huge_page_to_list(struct page *page, struct list_head *list)
+static int split_anon_huge_page(struct page *page, struct list_head *list)
 {
 	struct anon_vma *anon_vma;
 	int ret = 1;
 
-	BUG_ON(is_huge_zero_page(page));
-	BUG_ON(!PageAnon(page));
-
 	/*
 	 * The caller does not necessarily hold an mmap_sem that would prevent
 	 * the anon_vma disappearing so we first we take a reference to it
@@ -1847,7 +1853,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		goto out_unlock;
 
 	BUG_ON(!PageSwapBacked(page));
-	__split_huge_page(page, anon_vma, list);
+	__split_anon_huge_page(page, anon_vma, list);
 	count_vm_event(THP_SPLIT);
 
 	BUG_ON(PageCompound(page));
@@ -1858,6 +1864,63 @@ out:
 	return ret;
 }
 
+static int split_file_huge_page(struct page *page, struct list_head *list)
+{
+	struct address_space *mapping = page->mapping;
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	int mapcount, mapcount2;
+
+	BUG_ON(!PageHead(page));
+	BUG_ON(PageTail(page));
+
+	mutex_lock(&mapping->i_mmap_mutex);
+	mapcount = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount += __split_huge_page_splitting(page, vma, addr);
+	}
+
+	if (mapcount != page_mapcount(page))
+		printk(KERN_ERR "mapcount %d page_mapcount %d\n",
+		       mapcount, page_mapcount(page));
+	BUG_ON(mapcount != page_mapcount(page));
+
+	__split_huge_page_refcount(page, list);
+
+	mapcount2 = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount2 += __split_huge_page_map(page, vma, addr);
+	}
+
+	if (mapcount != mapcount2)
+		printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount %d\n",
+		       mapcount, mapcount2, page_mapcount(page));
+	BUG_ON(mapcount != mapcount2);
+	count_vm_event(THP_SPLIT);
+	mutex_unlock(&mapping->i_mmap_mutex);
+
+	/*
+	 * Drop small pages beyond i_size if any.
+	 *
+	 * XXX: do we need to serialize over i_mutex here?
+	 * If yes, how to get mmap_sem vs. i_mutex ordering fixed?
+	 */
+	truncate_inode_pages(mapping, i_size_read(mapping->host));
+	return 0;
+}
+
+int split_huge_page_to_list(struct page *page, struct list_head *list)
+{
+	BUG_ON(is_huge_zero_page(page));
+
+	if (PageAnon(page))
+		return split_anon_huge_page(page, list);
+	else
+		return split_file_huge_page(page, list);
+}
+
 #define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 21/23] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We're going to have huge pages backed by files, so we need to modify
wait_split_huge_page() to support that.

We have two options for:
 - check whether the page anon or not and serialize only over required
   lock;
 - always serialize over both locks;

Current implementation, in fact, guarantees that *all* pages on the vma
is not splitting, not only the pages pmd is pointing on.

For now I prefer the second option since it's the safest: we provide the
same level of guarantees.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h | 15 ++++++++++++---
 mm/huge_memory.c        |  4 ++--
 mm/memory.c             |  4 ++--
 3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9a0a114..186f4f2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -111,11 +111,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
 			__split_huge_page_pmd(__vma, __address,		\
 					____pmd);			\
 	}  while (0)
-#define wait_split_huge_page(__anon_vma, __pmd)				\
+#define wait_split_huge_page(__vma, __pmd)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
-		anon_vma_lock_write(__anon_vma);			\
-		anon_vma_unlock_write(__anon_vma);			\
+		struct address_space *__mapping = (__vma)->vm_file ?	\
+				(__vma)->vm_file->f_mapping : NULL;	\
+		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
+		if (__mapping)						\
+			mutex_lock(&__mapping->i_mmap_mutex);		\
+		if (__anon_vma) {					\
+			anon_vma_lock_write(__anon_vma);		\
+			anon_vma_unlock_write(__anon_vma);		\
+		}							\
+		if (__mapping)						\
+			mutex_unlock(&__mapping->i_mmap_mutex);		\
 		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
 		       pmd_trans_huge(*____pmd));			\
 	} while (0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d7c6830..9af643d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -911,7 +911,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		spin_unlock(&dst_mm->page_table_lock);
 		pte_free(dst_mm, pgtable);
 
-		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+		wait_split_huge_page(vma, src_pmd); /* src_vma */
 		goto out;
 	}
 	src_page = pmd_page(pmd);
@@ -1493,7 +1493,7 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 	if (likely(pmd_trans_huge(*pmd))) {
 		if (unlikely(pmd_trans_splitting(*pmd))) {
 			spin_unlock(&vma->vm_mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
+			wait_split_huge_page(vma, pmd);
 			return -1;
 		} else {
 			/* Thp mapped by 'pmd' is stable, so we can
diff --git a/mm/memory.c b/mm/memory.c
index 7d35f90..ea74ab1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -609,7 +609,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (new)
 		pte_free(mm, new);
 	if (wait_split_huge_page)
-		wait_split_huge_page(vma->anon_vma, pmd);
+		wait_split_huge_page(vma, pmd);
 	return 0;
 }
 
@@ -1522,7 +1522,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		if (likely(pmd_trans_huge(*pmd))) {
 			if (unlikely(pmd_trans_splitting(*pmd))) {
 				spin_unlock(&mm->page_table_lock);
-				wait_split_huge_page(vma->anon_vma, pmd);
+				wait_split_huge_page(vma, pmd);
 			} else {
 				page = follow_trans_huge_pmd(vma, address,
 							     pmd, flags);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 21/23] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We're going to have huge pages backed by files, so we need to modify
wait_split_huge_page() to support that.

We have two options for:
 - check whether the page anon or not and serialize only over required
   lock;
 - always serialize over both locks;

Current implementation, in fact, guarantees that *all* pages on the vma
is not splitting, not only the pages pmd is pointing on.

For now I prefer the second option since it's the safest: we provide the
same level of guarantees.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h | 15 ++++++++++++---
 mm/huge_memory.c        |  4 ++--
 mm/memory.c             |  4 ++--
 3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9a0a114..186f4f2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -111,11 +111,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
 			__split_huge_page_pmd(__vma, __address,		\
 					____pmd);			\
 	}  while (0)
-#define wait_split_huge_page(__anon_vma, __pmd)				\
+#define wait_split_huge_page(__vma, __pmd)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
-		anon_vma_lock_write(__anon_vma);			\
-		anon_vma_unlock_write(__anon_vma);			\
+		struct address_space *__mapping = (__vma)->vm_file ?	\
+				(__vma)->vm_file->f_mapping : NULL;	\
+		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
+		if (__mapping)						\
+			mutex_lock(&__mapping->i_mmap_mutex);		\
+		if (__anon_vma) {					\
+			anon_vma_lock_write(__anon_vma);		\
+			anon_vma_unlock_write(__anon_vma);		\
+		}							\
+		if (__mapping)						\
+			mutex_unlock(&__mapping->i_mmap_mutex);		\
 		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
 		       pmd_trans_huge(*____pmd));			\
 	} while (0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d7c6830..9af643d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -911,7 +911,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		spin_unlock(&dst_mm->page_table_lock);
 		pte_free(dst_mm, pgtable);
 
-		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+		wait_split_huge_page(vma, src_pmd); /* src_vma */
 		goto out;
 	}
 	src_page = pmd_page(pmd);
@@ -1493,7 +1493,7 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 	if (likely(pmd_trans_huge(*pmd))) {
 		if (unlikely(pmd_trans_splitting(*pmd))) {
 			spin_unlock(&vma->vm_mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
+			wait_split_huge_page(vma, pmd);
 			return -1;
 		} else {
 			/* Thp mapped by 'pmd' is stable, so we can
diff --git a/mm/memory.c b/mm/memory.c
index 7d35f90..ea74ab1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -609,7 +609,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (new)
 		pte_free(mm, new);
 	if (wait_split_huge_page)
-		wait_split_huge_page(vma->anon_vma, pmd);
+		wait_split_huge_page(vma, pmd);
 	return 0;
 }
 
@@ -1522,7 +1522,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		if (likely(pmd_trans_huge(*pmd))) {
 			if (unlikely(pmd_trans_splitting(*pmd))) {
 				spin_unlock(&mm->page_table_lock);
-				wait_split_huge_page(vma->anon_vma, pmd);
+				wait_split_huge_page(vma, pmd);
 			} else {
 				page = follow_trans_huge_pmd(vma, address,
 							     pmd, flags);
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 22/23] thp, mm: split huge page on mmap file page
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We are not ready to mmap file-backed tranparent huge pages. Let's split
them on fault attempt.

Later we'll implement mmap() properly and this code path be used for
fallback cases.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index ed65af5..f7857ef 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1743,6 +1743,8 @@ retry_find:
 			goto no_cached_page;
 	}
 
+	if (PageTransCompound(page))
+		split_huge_page(compound_trans_head(page));
 	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
 		page_cache_release(page);
 		return ret | VM_FAULT_RETRY;
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 22/23] thp, mm: split huge page on mmap file page
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We are not ready to mmap file-backed tranparent huge pages. Let's split
them on fault attempt.

Later we'll implement mmap() properly and this code path be used for
fallback cases.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index ed65af5..f7857ef 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1743,6 +1743,8 @@ retry_find:
 			goto no_cached_page;
 	}
 
+	if (PageTransCompound(page))
+		split_huge_page(compound_trans_head(page));
 	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
 		page_cache_release(page);
 		return ret | VM_FAULT_RETRY;
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 23/23] ramfs: enable transparent huge page cache
  2013-08-04  2:17 ` Kirill A. Shutemov
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

ramfs is the most simple fs from page cache point of view. Let's start
transparent huge page cache enabling here.

ramfs pages are not movable[1] and switching to transhuge pages doesn't
affect that. We need to fix this eventually.

[1] http://lkml.org/lkml/2013/4/2/720

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/ramfs/file-mmu.c | 3 ++-
 fs/ramfs/inode.c    | 6 +++++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/ramfs/file-mmu.c b/fs/ramfs/file-mmu.c
index 4884ac5..3236e41 100644
--- a/fs/ramfs/file-mmu.c
+++ b/fs/ramfs/file-mmu.c
@@ -32,7 +32,7 @@
 
 const struct address_space_operations ramfs_aops = {
 	.readpage	= simple_readpage,
-	.write_begin	= simple_write_begin,
+	.write_begin	= simple_thp_write_begin,
 	.write_end	= simple_write_end,
 	.set_page_dirty = __set_page_dirty_no_writeback,
 };
@@ -47,6 +47,7 @@ const struct file_operations ramfs_file_operations = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
 	.llseek		= generic_file_llseek,
+	.release	= simple_thp_release,
 };
 
 const struct inode_operations ramfs_file_inode_operations = {
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index 39d1465..5dafdfc 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
 		inode_init_owner(inode, dir, mode);
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
-		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+		/*
+		 * TODO: make ramfs pages movable
+		 */
+		mapping_set_gfp_mask(inode->i_mapping,
+				GFP_TRANSHUGE & ~__GFP_MOVABLE);
 		mapping_set_unevictable(inode->i_mapping);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 23/23] ramfs: enable transparent huge page cache
@ 2013-08-04  2:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-04  2:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

ramfs is the most simple fs from page cache point of view. Let's start
transparent huge page cache enabling here.

ramfs pages are not movable[1] and switching to transhuge pages doesn't
affect that. We need to fix this eventually.

[1] http://lkml.org/lkml/2013/4/2/720

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/ramfs/file-mmu.c | 3 ++-
 fs/ramfs/inode.c    | 6 +++++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/ramfs/file-mmu.c b/fs/ramfs/file-mmu.c
index 4884ac5..3236e41 100644
--- a/fs/ramfs/file-mmu.c
+++ b/fs/ramfs/file-mmu.c
@@ -32,7 +32,7 @@
 
 const struct address_space_operations ramfs_aops = {
 	.readpage	= simple_readpage,
-	.write_begin	= simple_write_begin,
+	.write_begin	= simple_thp_write_begin,
 	.write_end	= simple_write_end,
 	.set_page_dirty = __set_page_dirty_no_writeback,
 };
@@ -47,6 +47,7 @@ const struct file_operations ramfs_file_operations = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
 	.llseek		= generic_file_llseek,
+	.release	= simple_thp_release,
 };
 
 const struct inode_operations ramfs_file_inode_operations = {
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index 39d1465..5dafdfc 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
 		inode_init_owner(inode, dir, mode);
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
-		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+		/*
+		 * TODO: make ramfs pages movable
+		 */
+		mapping_set_gfp_mask(inode->i_mapping,
+				GFP_TRANSHUGE & ~__GFP_MOVABLE);
 		mapping_set_unevictable(inode->i_mapping);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH 02/23] memcg, thp: charge huge cache pages
  2013-08-04  2:17   ` Kirill A. Shutemov
@ 2013-08-04  8:25     ` Michal Hocko
  -1 siblings, 0 replies; 116+ messages in thread
From: Michal Hocko @ 2013-08-04  8:25 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, linux-fsdevel, linux-kernel, KAMEZAWA Hiroyuki

On Sun 04-08-13 05:17:04, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> mem_cgroup_cache_charge() has check for PageCompound(). The check
> prevents charging huge cache pages.
> 
> I don't see a reason why the check is present. Looks like it's just
> legacy (introduced in 52d4b9a memcg: allocate all page_cgroup at boot).
> 
> Let's just drop it.

If the page cache charging path only sees THP as compound pages then OK.
Can we keep at least VM_BUG_ON(PageCompound(page) && !PageTransHuge(page))

Otherwise mem_cgroup_charge_common would be confused and charge such a
page as order-0
 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Other than that, looks good to me.
Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/memcontrol.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b6cd870..dc50c1a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3921,8 +3921,6 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>  
>  	if (mem_cgroup_disabled())
>  		return 0;
> -	if (PageCompound(page))
> -		return 0;
>  
>  	if (!PageSwapCache(page))
>  		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
> -- 
> 1.8.3.2
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 02/23] memcg, thp: charge huge cache pages
@ 2013-08-04  8:25     ` Michal Hocko
  0 siblings, 0 replies; 116+ messages in thread
From: Michal Hocko @ 2013-08-04  8:25 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, linux-fsdevel, linux-kernel, KAMEZAWA Hiroyuki

On Sun 04-08-13 05:17:04, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> mem_cgroup_cache_charge() has check for PageCompound(). The check
> prevents charging huge cache pages.
> 
> I don't see a reason why the check is present. Looks like it's just
> legacy (introduced in 52d4b9a memcg: allocate all page_cgroup at boot).
> 
> Let's just drop it.

If the page cache charging path only sees THP as compound pages then OK.
Can we keep at least VM_BUG_ON(PageCompound(page) && !PageTransHuge(page))

Otherwise mem_cgroup_charge_common would be confused and charge such a
page as order-0
 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Other than that, looks good to me.
Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/memcontrol.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b6cd870..dc50c1a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3921,8 +3921,6 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>  
>  	if (mem_cgroup_disabled())
>  		return 0;
> -	if (PageCompound(page))
> -		return 0;
>  
>  	if (!PageSwapCache(page))
>  		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
> -- 
> 1.8.3.2
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 15/23] mm, fs: avoid page allocation beyond i_size on read
  2013-08-04  2:17   ` Kirill A. Shutemov
@ 2013-08-05  0:29     ` NeilBrown
  -1 siblings, 0 replies; 116+ messages in thread
From: NeilBrown @ 2013-08-05  0:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1208 bytes --]

On Sun,  4 Aug 2013 05:17:17 +0300 "Kirill A. Shutemov"
<kirill.shutemov@linux.intel.com> wrote:

> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> I've noticed that we allocated unneeded page for cache on read beyond
> i_size. Simple test case (I checked it on ramfs):
> 
> $ touch testfile
> $ cat testfile
> 
> It triggers 'no_cached_page' code path in do_generic_file_read().
> 
> Looks like it's regression since commit a32ea1e. Let's fix it.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: NeilBrown <neilb@suse.de>
> ---
>  mm/filemap.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 066bbff..c31d296 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1163,6 +1163,10 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
>  		loff_t isize;
>  		unsigned long nr, ret;
>  
> +		isize = i_size_read(inode);
> +		if (!isize || index > (isize - 1) >> PAGE_CACHE_SHIFT)
> +			goto out;
> +
>  		cond_resched();
>  find_page:
>  		page = find_get_page(mapping, index);

Looks good to me.

Acked-by: NeilBrown <neilb@suse.de>

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 15/23] mm, fs: avoid page allocation beyond i_size on read
@ 2013-08-05  0:29     ` NeilBrown
  0 siblings, 0 replies; 116+ messages in thread
From: NeilBrown @ 2013-08-05  0:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill  A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1208 bytes --]

On Sun,  4 Aug 2013 05:17:17 +0300 "Kirill A. Shutemov"
<kirill.shutemov@linux.intel.com> wrote:

> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> I've noticed that we allocated unneeded page for cache on read beyond
> i_size. Simple test case (I checked it on ramfs):
> 
> $ touch testfile
> $ cat testfile
> 
> It triggers 'no_cached_page' code path in do_generic_file_read().
> 
> Looks like it's regression since commit a32ea1e. Let's fix it.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: NeilBrown <neilb@suse.de>
> ---
>  mm/filemap.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 066bbff..c31d296 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1163,6 +1163,10 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
>  		loff_t isize;
>  		unsigned long nr, ret;
>  
> +		isize = i_size_read(inode);
> +		if (!isize || index > (isize - 1) >> PAGE_CACHE_SHIFT)
> +			goto out;
> +
>  		cond_resched();
>  find_page:
>  		page = find_get_page(mapping, index);

Looks good to me.

Acked-by: NeilBrown <neilb@suse.de>

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
  2013-08-04  2:17   ` Kirill A. Shutemov
@ 2013-08-05 11:17     ` Jan Kara
  -1 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2013-08-05 11:17 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, linux-fsdevel, linux-kernel

On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> The radix tree is variable-height, so an insert operation not only has
> to build the branch to its corresponding item, it also has to build the
> branch to existing items if the size has to be increased (by
> radix_tree_extend).
> 
> The worst case is a zero height tree with just a single item at index 0,
> and then inserting an item at index ULONG_MAX. This requires 2 new branches
> of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> 
> Radix tree is usually protected by spin lock. It means we want to
> pre-allocate required memory before taking the lock.
> 
> Currently radix_tree_preload() only guarantees enough nodes to insert
> one element. It's a hard limit. For transparent huge page cache we want
> to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> 
> This patch introduces radix_tree_preload_count(). It allows to
> preallocate nodes enough to insert a number of *contiguous* elements.
> The feature costs about 5KiB per-CPU, details below.
> 
> Worst case for adding N contiguous items is adding entries at indexes
> (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> item plus extra nodes if you cross the boundary from one node to the next.
> 
> Preload uses per-CPU array to store nodes. The total cost of preload is
> "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> to be able to handle 512 entries at once.
> 
> Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> 
> We have three possible RADIX_TREE_MAP_SHIFT:
> 
>  #ifdef __KERNEL__
>  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
>  #else
>  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
>  #endif
> 
> On 64-bit system:
> For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> 
> On 32-bit system:
> For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> 
> On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> on 64-bit system the per-CPU feature overhead is
>  for preload array:
>    (30 - 21) * sizeof(void*) = 72 bytes
>  plus, if the preload array is full
>    (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
>  total: 5112 bytes
> 
> on 32-bit system the per-CPU feature overhead is
>  for preload array:
>    (19 - 11) * sizeof(void*) = 32 bytes
>  plus, if the preload array is full
>    (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
>  total: 2400 bytes
> 
> Since only THP uses batched preload at the moment, we disable (set max
> preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> changed in the future.
> 
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
>  include/linux/radix-tree.h | 11 +++++++++++
>  lib/radix-tree.c           | 41 ++++++++++++++++++++++++++++++++---------
>  2 files changed, 43 insertions(+), 9 deletions(-)
...
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 7811ed3..99ab73c 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
>   * The worst case is a zero height tree with just a single item at index 0,
>   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
>   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> + *
> + * Worst case for adding N contiguous items is adding entries at indexes
> + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> + * item plus extra nodes if you cross the boundary from one node to the next.
> + *
>   * Hence:
>   */
> -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MAX \
> +	(RADIX_TREE_PRELOAD_MIN + \
> +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
  Umm, is this really correct? I see two problems:
1) You may need internal tree nodes at various levels but you seem to
account only for the level 1.
2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
require 3 nodes at level 1 if the indexes are like:
i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
    ^                                ^
    node boundary                    node boundary

  Otherwise the patch looks good.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
@ 2013-08-05 11:17     ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2013-08-05 11:17 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, linux-fsdevel, linux-kernel

On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> The radix tree is variable-height, so an insert operation not only has
> to build the branch to its corresponding item, it also has to build the
> branch to existing items if the size has to be increased (by
> radix_tree_extend).
> 
> The worst case is a zero height tree with just a single item at index 0,
> and then inserting an item at index ULONG_MAX. This requires 2 new branches
> of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> 
> Radix tree is usually protected by spin lock. It means we want to
> pre-allocate required memory before taking the lock.
> 
> Currently radix_tree_preload() only guarantees enough nodes to insert
> one element. It's a hard limit. For transparent huge page cache we want
> to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> 
> This patch introduces radix_tree_preload_count(). It allows to
> preallocate nodes enough to insert a number of *contiguous* elements.
> The feature costs about 5KiB per-CPU, details below.
> 
> Worst case for adding N contiguous items is adding entries at indexes
> (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> item plus extra nodes if you cross the boundary from one node to the next.
> 
> Preload uses per-CPU array to store nodes. The total cost of preload is
> "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> to be able to handle 512 entries at once.
> 
> Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> 
> We have three possible RADIX_TREE_MAP_SHIFT:
> 
>  #ifdef __KERNEL__
>  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
>  #else
>  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
>  #endif
> 
> On 64-bit system:
> For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> 
> On 32-bit system:
> For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> 
> On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> on 64-bit system the per-CPU feature overhead is
>  for preload array:
>    (30 - 21) * sizeof(void*) = 72 bytes
>  plus, if the preload array is full
>    (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
>  total: 5112 bytes
> 
> on 32-bit system the per-CPU feature overhead is
>  for preload array:
>    (19 - 11) * sizeof(void*) = 32 bytes
>  plus, if the preload array is full
>    (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
>  total: 2400 bytes
> 
> Since only THP uses batched preload at the moment, we disable (set max
> preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> changed in the future.
> 
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
>  include/linux/radix-tree.h | 11 +++++++++++
>  lib/radix-tree.c           | 41 ++++++++++++++++++++++++++++++++---------
>  2 files changed, 43 insertions(+), 9 deletions(-)
...
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 7811ed3..99ab73c 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
>   * The worst case is a zero height tree with just a single item at index 0,
>   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
>   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> + *
> + * Worst case for adding N contiguous items is adding entries at indexes
> + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> + * item plus extra nodes if you cross the boundary from one node to the next.
> + *
>   * Hence:
>   */
> -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MAX \
> +	(RADIX_TREE_PRELOAD_MIN + \
> +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
  Umm, is this really correct? I see two problems:
1) You may need internal tree nodes at various levels but you seem to
account only for the level 1.
2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
require 3 nodes at level 1 if the indexes are like:
i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
    ^                                ^
    node boundary                    node boundary

  Otherwise the patch looks good.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 08/23] block: implement add_bdi_stat()
  2013-08-04  2:17   ` Kirill A. Shutemov
@ 2013-08-05 11:21     ` Jan Kara
  -1 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2013-08-05 11:21 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, linux-fsdevel, linux-kernel

On Sun 04-08-13 05:17:10, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> We're going to add/remove a number of page cache entries at once. This
> patch implements add_bdi_stat() which adjusts bdi stats by arbitrary
> amount. It's required for batched page cache manipulations.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
  Looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/backing-dev.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index c388155..7060180 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -166,6 +166,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
>  	__add_bdi_stat(bdi, item, -1);
>  }
>  
> +static inline void add_bdi_stat(struct backing_dev_info *bdi,
> +		enum bdi_stat_item item, s64 amount)
> +{
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	__add_bdi_stat(bdi, item, amount);
> +	local_irq_restore(flags);
> +}
> +
>  static inline void dec_bdi_stat(struct backing_dev_info *bdi,
>  		enum bdi_stat_item item)
>  {
> -- 
> 1.8.3.2
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 08/23] block: implement add_bdi_stat()
@ 2013-08-05 11:21     ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2013-08-05 11:21 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, linux-fsdevel, linux-kernel

On Sun 04-08-13 05:17:10, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> We're going to add/remove a number of page cache entries at once. This
> patch implements add_bdi_stat() which adjusts bdi stats by arbitrary
> amount. It's required for batched page cache manipulations.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
  Looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/backing-dev.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index c388155..7060180 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -166,6 +166,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
>  	__add_bdi_stat(bdi, item, -1);
>  }
>  
> +static inline void add_bdi_stat(struct backing_dev_info *bdi,
> +		enum bdi_stat_item item, s64 amount)
> +{
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	__add_bdi_stat(bdi, item, amount);
> +	local_irq_restore(flags);
> +}
> +
>  static inline void dec_bdi_stat(struct backing_dev_info *bdi,
>  		enum bdi_stat_item item)
>  {
> -- 
> 1.8.3.2
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 19/23] truncate: support huge pages
  2013-08-04  2:17   ` Kirill A. Shutemov
@ 2013-08-05 13:29     ` Jan Kara
  -1 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2013-08-05 13:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, linux-fsdevel, linux-kernel

On Sun 04-08-13 05:17:21, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> truncate_inode_pages_range() drops whole huge page at once if it's fully
> inside the range.
> 
> If a huge page is only partly in the range we zero out the part,
> exactly like we do for partial small pages.
> 
> invalidate_mapping_pages() just skips huge pages if they are not fully
> in the range.
  Umm, this is not a new problem but with THP pagecache it will become more
visible: When we punch holes within a file like <0..2MB>, <2MB-4MB>
(presuming 4 MB hugepages), then we won't free the underlying huge page for
the range 0..4MB. Maybe for initial implementation is doesn't matter but we
should at least note it in truncate_inode_pages_range() so that people are
aware of this.

Otherwise the patch looks OK to me. So you can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/truncate.c | 108 +++++++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 84 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 353b683..fcef7cb 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -205,8 +205,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  {
>  	pgoff_t		start;		/* inclusive */
>  	pgoff_t		end;		/* exclusive */
> -	unsigned int	partial_start;	/* inclusive */
> -	unsigned int	partial_end;	/* exclusive */
> +	bool		partial_thp_start = false, partial_thp_end = false;
>  	struct pagevec	pvec;
>  	pgoff_t		index;
>  	int		i;
> @@ -215,15 +214,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  	if (mapping->nrpages == 0)
>  		return;
>  
> -	/* Offsets within partial pages */
> -	partial_start = lstart & (PAGE_CACHE_SIZE - 1);
> -	partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1);
> -
>  	/*
>  	 * 'start' and 'end' always covers the range of pages to be fully
> -	 * truncated. Partial pages are covered with 'partial_start' at the
> -	 * start of the range and 'partial_end' at the end of the range.
> -	 * Note that 'end' is exclusive while 'lend' is inclusive.
> +	 * truncated. Note that 'end' is exclusive while 'lend' is inclusive.
>  	 */
>  	start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>  	if (lend == -1)
> @@ -249,6 +242,23 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  			if (index >= end)
>  				break;
>  
> +			if (PageTransTailCache(page)) {
> +				/* part of already handled huge page */
> +				if (!page->mapping)
> +					continue;
> +				/* the range starts in middle of huge page */
> +				partial_thp_start = true;
> +				start = index & ~HPAGE_CACHE_INDEX_MASK;
> +				continue;
> +			}
> +			/* the range ends on huge page */
> +			if (PageTransHugeCache(page) &&
> +				index == (end & ~HPAGE_CACHE_INDEX_MASK)) {
> +				partial_thp_end = true;
> +				end = index;
> +				break;
> +			}
> +
>  			if (!trylock_page(page))
>  				continue;
>  			WARN_ON(page->index != index);
> @@ -265,34 +275,74 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  		index++;
>  	}
>  
> -	if (partial_start) {
> -		struct page *page = find_lock_page(mapping, start - 1);
> +	if (partial_thp_start || lstart & ~PAGE_CACHE_MASK) {
> +		pgoff_t off;
> +		struct page *page;
> +		unsigned pstart, pend;
> +		void (*zero_segment)(struct page *page,
> +				unsigned start, unsigned len);
> +retry_partial_start:
> +		if (partial_thp_start) {
> +			zero_segment = zero_huge_user_segment;
> +			off = (start - 1) & ~HPAGE_CACHE_INDEX_MASK;
> +			pstart = lstart & ~HPAGE_PMD_MASK;
> +			if ((end & ~HPAGE_CACHE_INDEX_MASK) == off)
> +				pend = (lend - 1) & ~HPAGE_PMD_MASK;
> +			else
> +				pend = HPAGE_PMD_SIZE;
> +		} else {
> +			zero_segment = zero_user_segment;
> +			off = start - 1;
> +			pstart = lstart & ~PAGE_CACHE_MASK;
> +			if (start > end)
> +				pend = (lend - 1) & ~PAGE_CACHE_MASK;
> +			else
> +				pend = PAGE_CACHE_SIZE;
> +		}
> +
> +		page = find_get_page(mapping, off);
>  		if (page) {
> -			unsigned int top = PAGE_CACHE_SIZE;
> -			if (start > end) {
> -				/* Truncation within a single page */
> -				top = partial_end;
> -				partial_end = 0;
> +			/* the last tail page*/
> +			if (PageTransTailCache(page)) {
> +				partial_thp_start = true;
> +				page_cache_release(page);
> +				goto retry_partial_start;
>  			}
> +
> +			lock_page(page);
>  			wait_on_page_writeback(page);
> -			zero_user_segment(page, partial_start, top);
> +			zero_segment(page, pstart, pend);
>  			cleancache_invalidate_page(mapping, page);
>  			if (page_has_private(page))
> -				do_invalidatepage(page, partial_start,
> -						  top - partial_start);
> +				do_invalidatepage(page, pstart,
> +						pend - pstart);
>  			unlock_page(page);
>  			page_cache_release(page);
>  		}
>  	}
> -	if (partial_end) {
> -		struct page *page = find_lock_page(mapping, end);
> +	if (partial_thp_end || (lend + 1) & ~PAGE_CACHE_MASK) {
> +		pgoff_t off;
> +		struct page *page;
> +		unsigned pend;
> +		void (*zero_segment)(struct page *page,
> +				unsigned start, unsigned len);
> +		if (partial_thp_end) {
> +			zero_segment = zero_huge_user_segment;
> +			off = end & ~HPAGE_CACHE_INDEX_MASK;
> +			pend = (lend - 1) & ~HPAGE_PMD_MASK;
> +		} else {
> +			zero_segment = zero_user_segment;
> +			off = end;
> +			pend = (lend - 1) & ~PAGE_CACHE_MASK;
> +		}
> +
> +		page = find_lock_page(mapping, end);
>  		if (page) {
>  			wait_on_page_writeback(page);
> -			zero_user_segment(page, 0, partial_end);
> +			zero_segment(page, 0, pend);
>  			cleancache_invalidate_page(mapping, page);
>  			if (page_has_private(page))
> -				do_invalidatepage(page, 0,
> -						  partial_end);
> +				do_invalidatepage(page, 0, pend);
>  			unlock_page(page);
>  			page_cache_release(page);
>  		}
> @@ -327,6 +377,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  			if (index >= end)
>  				break;
>  
> +			if (PageTransTailCache(page))
> +				continue;
> +
>  			lock_page(page);
>  			WARN_ON(page->index != index);
>  			wait_on_page_writeback(page);
> @@ -401,6 +454,13 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
>  			if (index > end)
>  				break;
>  
> +			/* skip huge page if it's not fully in the range */
> +			if (PageTransHugeCache(page) &&
> +					index + HPAGE_CACHE_NR - 1 > end)
> +				continue;
> +			if (PageTransTailCache(page))
> +				continue;
> +
>  			if (!trylock_page(page))
>  				continue;
>  			WARN_ON(page->index != index);
> -- 
> 1.8.3.2
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 19/23] truncate: support huge pages
@ 2013-08-05 13:29     ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2013-08-05 13:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, linux-fsdevel, linux-kernel

On Sun 04-08-13 05:17:21, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> truncate_inode_pages_range() drops whole huge page at once if it's fully
> inside the range.
> 
> If a huge page is only partly in the range we zero out the part,
> exactly like we do for partial small pages.
> 
> invalidate_mapping_pages() just skips huge pages if they are not fully
> in the range.
  Umm, this is not a new problem but with THP pagecache it will become more
visible: When we punch holes within a file like <0..2MB>, <2MB-4MB>
(presuming 4 MB hugepages), then we won't free the underlying huge page for
the range 0..4MB. Maybe for initial implementation is doesn't matter but we
should at least note it in truncate_inode_pages_range() so that people are
aware of this.

Otherwise the patch looks OK to me. So you can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/truncate.c | 108 +++++++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 84 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 353b683..fcef7cb 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -205,8 +205,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  {
>  	pgoff_t		start;		/* inclusive */
>  	pgoff_t		end;		/* exclusive */
> -	unsigned int	partial_start;	/* inclusive */
> -	unsigned int	partial_end;	/* exclusive */
> +	bool		partial_thp_start = false, partial_thp_end = false;
>  	struct pagevec	pvec;
>  	pgoff_t		index;
>  	int		i;
> @@ -215,15 +214,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  	if (mapping->nrpages == 0)
>  		return;
>  
> -	/* Offsets within partial pages */
> -	partial_start = lstart & (PAGE_CACHE_SIZE - 1);
> -	partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1);
> -
>  	/*
>  	 * 'start' and 'end' always covers the range of pages to be fully
> -	 * truncated. Partial pages are covered with 'partial_start' at the
> -	 * start of the range and 'partial_end' at the end of the range.
> -	 * Note that 'end' is exclusive while 'lend' is inclusive.
> +	 * truncated. Note that 'end' is exclusive while 'lend' is inclusive.
>  	 */
>  	start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>  	if (lend == -1)
> @@ -249,6 +242,23 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  			if (index >= end)
>  				break;
>  
> +			if (PageTransTailCache(page)) {
> +				/* part of already handled huge page */
> +				if (!page->mapping)
> +					continue;
> +				/* the range starts in middle of huge page */
> +				partial_thp_start = true;
> +				start = index & ~HPAGE_CACHE_INDEX_MASK;
> +				continue;
> +			}
> +			/* the range ends on huge page */
> +			if (PageTransHugeCache(page) &&
> +				index == (end & ~HPAGE_CACHE_INDEX_MASK)) {
> +				partial_thp_end = true;
> +				end = index;
> +				break;
> +			}
> +
>  			if (!trylock_page(page))
>  				continue;
>  			WARN_ON(page->index != index);
> @@ -265,34 +275,74 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  		index++;
>  	}
>  
> -	if (partial_start) {
> -		struct page *page = find_lock_page(mapping, start - 1);
> +	if (partial_thp_start || lstart & ~PAGE_CACHE_MASK) {
> +		pgoff_t off;
> +		struct page *page;
> +		unsigned pstart, pend;
> +		void (*zero_segment)(struct page *page,
> +				unsigned start, unsigned len);
> +retry_partial_start:
> +		if (partial_thp_start) {
> +			zero_segment = zero_huge_user_segment;
> +			off = (start - 1) & ~HPAGE_CACHE_INDEX_MASK;
> +			pstart = lstart & ~HPAGE_PMD_MASK;
> +			if ((end & ~HPAGE_CACHE_INDEX_MASK) == off)
> +				pend = (lend - 1) & ~HPAGE_PMD_MASK;
> +			else
> +				pend = HPAGE_PMD_SIZE;
> +		} else {
> +			zero_segment = zero_user_segment;
> +			off = start - 1;
> +			pstart = lstart & ~PAGE_CACHE_MASK;
> +			if (start > end)
> +				pend = (lend - 1) & ~PAGE_CACHE_MASK;
> +			else
> +				pend = PAGE_CACHE_SIZE;
> +		}
> +
> +		page = find_get_page(mapping, off);
>  		if (page) {
> -			unsigned int top = PAGE_CACHE_SIZE;
> -			if (start > end) {
> -				/* Truncation within a single page */
> -				top = partial_end;
> -				partial_end = 0;
> +			/* the last tail page*/
> +			if (PageTransTailCache(page)) {
> +				partial_thp_start = true;
> +				page_cache_release(page);
> +				goto retry_partial_start;
>  			}
> +
> +			lock_page(page);
>  			wait_on_page_writeback(page);
> -			zero_user_segment(page, partial_start, top);
> +			zero_segment(page, pstart, pend);
>  			cleancache_invalidate_page(mapping, page);
>  			if (page_has_private(page))
> -				do_invalidatepage(page, partial_start,
> -						  top - partial_start);
> +				do_invalidatepage(page, pstart,
> +						pend - pstart);
>  			unlock_page(page);
>  			page_cache_release(page);
>  		}
>  	}
> -	if (partial_end) {
> -		struct page *page = find_lock_page(mapping, end);
> +	if (partial_thp_end || (lend + 1) & ~PAGE_CACHE_MASK) {
> +		pgoff_t off;
> +		struct page *page;
> +		unsigned pend;
> +		void (*zero_segment)(struct page *page,
> +				unsigned start, unsigned len);
> +		if (partial_thp_end) {
> +			zero_segment = zero_huge_user_segment;
> +			off = end & ~HPAGE_CACHE_INDEX_MASK;
> +			pend = (lend - 1) & ~HPAGE_PMD_MASK;
> +		} else {
> +			zero_segment = zero_user_segment;
> +			off = end;
> +			pend = (lend - 1) & ~PAGE_CACHE_MASK;
> +		}
> +
> +		page = find_lock_page(mapping, end);
>  		if (page) {
>  			wait_on_page_writeback(page);
> -			zero_user_segment(page, 0, partial_end);
> +			zero_segment(page, 0, pend);
>  			cleancache_invalidate_page(mapping, page);
>  			if (page_has_private(page))
> -				do_invalidatepage(page, 0,
> -						  partial_end);
> +				do_invalidatepage(page, 0, pend);
>  			unlock_page(page);
>  			page_cache_release(page);
>  		}
> @@ -327,6 +377,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  			if (index >= end)
>  				break;
>  
> +			if (PageTransTailCache(page))
> +				continue;
> +
>  			lock_page(page);
>  			WARN_ON(page->index != index);
>  			wait_on_page_writeback(page);
> @@ -401,6 +454,13 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
>  			if (index > end)
>  				break;
>  
> +			/* skip huge page if it's not fully in the range */
> +			if (PageTransHugeCache(page) &&
> +					index + HPAGE_CACHE_NR - 1 > end)
> +				continue;
> +			if (PageTransTailCache(page))
> +				continue;
> +
>  			if (!trylock_page(page))
>  				continue;
>  			WARN_ON(page->index != index);
> -- 
> 1.8.3.2
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
  2013-08-05 11:17     ` Jan Kara
@ 2013-08-06 16:34       ` Matthew Wilcox
  -1 siblings, 0 replies; 116+ messages in thread
From: Matthew Wilcox @ 2013-08-06 16:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Mel Gorman, linux-mm, Andi Kleen,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	linux-fsdevel, linux-kernel

On Mon, Aug 05, 2013 at 01:17:39PM +0200, Jan Kara wrote:
> On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > The radix tree is variable-height, so an insert operation not only has
> > to build the branch to its corresponding item, it also has to build the
> > branch to existing items if the size has to be increased (by
> > radix_tree_extend).
> > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> >   * The worst case is a zero height tree with just a single item at index 0,
> >   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > + *
> > + * Worst case for adding N contiguous items is adding entries at indexes
> > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > + * item plus extra nodes if you cross the boundary from one node to the next.
> > + *
> >   * Hence:
> >   */
> > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MAX \
> > +	(RADIX_TREE_PRELOAD_MIN + \
> > +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
>
>   Umm, is this really correct? I see two problems:
> 1) You may need internal tree nodes at various levels but you seem to
> account only for the level 1.
> 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> require 3 nodes at level 1 if the indexes are like:
> i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
>     ^                                ^
>     node boundary                    node boundary
> 
>   Otherwise the patch looks good.

You are correct that in the fully general case, these things are needed,
and the patch undercounts the number of nodes needed.  However, in the
specific case of THP pagecache, insertions are naturally aligned, and
we end up needing very few internal nodes (so few that we've never hit
the end of this array in some fairly heavy testing).

There are two penalties for getting the general case correct.  One is
that the calculation becomes harder to understand, and the other is
that we consume more per-CPU memory.  I think we should document that
the current code requires "natural alignment", and include a note about
what things will need to change in order to support arbitrary alignment
in case anybody needs to do it in the future, but not include support
for arbitrary alignment in this patchset.

What do you think?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
@ 2013-08-06 16:34       ` Matthew Wilcox
  0 siblings, 0 replies; 116+ messages in thread
From: Matthew Wilcox @ 2013-08-06 16:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Mel Gorman, linux-mm, Andi Kleen,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	linux-fsdevel, linux-kernel

On Mon, Aug 05, 2013 at 01:17:39PM +0200, Jan Kara wrote:
> On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > The radix tree is variable-height, so an insert operation not only has
> > to build the branch to its corresponding item, it also has to build the
> > branch to existing items if the size has to be increased (by
> > radix_tree_extend).
> > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> >   * The worst case is a zero height tree with just a single item at index 0,
> >   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > + *
> > + * Worst case for adding N contiguous items is adding entries at indexes
> > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > + * item plus extra nodes if you cross the boundary from one node to the next.
> > + *
> >   * Hence:
> >   */
> > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MAX \
> > +	(RADIX_TREE_PRELOAD_MIN + \
> > +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
>
>   Umm, is this really correct? I see two problems:
> 1) You may need internal tree nodes at various levels but you seem to
> account only for the level 1.
> 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> require 3 nodes at level 1 if the indexes are like:
> i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
>     ^                                ^
>     node boundary                    node boundary
> 
>   Otherwise the patch looks good.

You are correct that in the fully general case, these things are needed,
and the patch undercounts the number of nodes needed.  However, in the
specific case of THP pagecache, insertions are naturally aligned, and
we end up needing very few internal nodes (so few that we've never hit
the end of this array in some fairly heavy testing).

There are two penalties for getting the general case correct.  One is
that the calculation becomes harder to understand, and the other is
that we consume more per-CPU memory.  I think we should document that
the current code requires "natural alignment", and include a note about
what things will need to change in order to support arbitrary alignment
in case anybody needs to do it in the future, but not include support
for arbitrary alignment in this patchset.

What do you think?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 20/23] thp: handle file pages in split_huge_page()
  2013-08-04  2:17   ` Kirill A. Shutemov
  (?)
@ 2013-08-06 19:09   ` Ning Qu
  2013-08-06 21:09     ` Ning Qu
  -1 siblings, 1 reply; 116+ messages in thread
From: Ning Qu @ 2013-08-06 19:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 8689 bytes --]

I am probably running into a deadlock case for this patch.

When splitting the file huge page, we hold the i_mmap_mutex.

However, when coming from the call path in vma_adjust as following, we will
grab the i_mmap_mutex already before doing vma_adjust_trans_huge, which
will eventually calls the split_huge_page then split_file_huge_page ....

https://git.kernel.org/cgit/linux/kernel/git/kas/linux.git/tree/mm/mmap.c?h=thp/pagecache#n753




Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Sat, Aug 3, 2013 at 7:17 PM, Kirill A. Shutemov <
kirill.shutemov@linux.intel.com> wrote:

> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> The base scheme is the same as for anonymous pages, but we walk by
> mapping->i_mmap rather then anon_vma->rb_root.
>
> When we add a huge page to page cache we take only reference to head
> page, but on split we need to take addition reference to all tail pages
> since they are still in page cache after splitting.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/huge_memory.c | 89
> +++++++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 76 insertions(+), 13 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 523946c..d7c6830 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1580,6 +1580,7 @@ static void __split_huge_page_refcount(struct page
> *page,
>         struct zone *zone = page_zone(page);
>         struct lruvec *lruvec;
>         int tail_count = 0;
> +       int initial_tail_refcount;
>
>         /* prevent PageLRU to go away from under us, and freeze lru stats
> */
>         spin_lock_irq(&zone->lru_lock);
> @@ -1589,6 +1590,13 @@ static void __split_huge_page_refcount(struct page
> *page,
>         /* complete memcg works before add pages to LRU */
>         mem_cgroup_split_huge_fixup(page);
>
> +       /*
> +        * When we add a huge page to page cache we take only reference to
> head
> +        * page, but on split we need to take addition reference to all
> tail
> +        * pages since they are still in page cache after splitting.
> +        */
> +       initial_tail_refcount = PageAnon(page) ? 0 : 1;
> +
>         for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
>                 struct page *page_tail = page + i;
>
> @@ -1611,8 +1619,9 @@ static void __split_huge_page_refcount(struct page
> *page,
>                  * atomic_set() here would be safe on all archs (and
>                  * not only on x86), it's safer to use atomic_add().
>                  */
> -               atomic_add(page_mapcount(page) + page_mapcount(page_tail)
> + 1,
> -                          &page_tail->_count);
> +               atomic_add(initial_tail_refcount + page_mapcount(page) +
> +                               page_mapcount(page_tail) + 1,
> +                               &page_tail->_count);
>
>                 /* after clearing PageTail the gup refcount can be
> released */
>                 smp_mb();
> @@ -1651,23 +1660,23 @@ static void __split_huge_page_refcount(struct page
> *page,
>                 */
>                 page_tail->_mapcount = page->_mapcount;
>
> -               BUG_ON(page_tail->mapping);
>                 page_tail->mapping = page->mapping;
>
>                 page_tail->index = page->index + i;
>                 page_nid_xchg_last(page_tail, page_nid_last(page));
>
> -               BUG_ON(!PageAnon(page_tail));
>                 BUG_ON(!PageUptodate(page_tail));
>                 BUG_ON(!PageDirty(page_tail));
> -               BUG_ON(!PageSwapBacked(page_tail));
>
>                 lru_add_page_tail(page, page_tail, lruvec, list);
>         }
>         atomic_sub(tail_count, &page->_count);
>         BUG_ON(atomic_read(&page->_count) <= 0);
>
> -       __mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
> +       if (PageAnon(page))
> +               __mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES,
> -1);
> +       else
> +               __mod_zone_page_state(zone, NR_FILE_TRANSPARENT_HUGEPAGES,
> -1);
>
>         ClearPageCompound(page);
>         compound_unlock(page);
> @@ -1767,7 +1776,7 @@ static int __split_huge_page_map(struct page *page,
>  }
>
>  /* must be called with anon_vma->root->rwsem held */
> -static void __split_huge_page(struct page *page,
> +static void __split_anon_huge_page(struct page *page,
>                               struct anon_vma *anon_vma,
>                               struct list_head *list)
>  {
> @@ -1791,7 +1800,7 @@ static void __split_huge_page(struct page *page,
>          * and establishes a child pmd before
>          * __split_huge_page_splitting() freezes the parent pmd (so if
>          * we fail to prevent copy_huge_pmd() from running until the
> -        * whole __split_huge_page() is complete), we will still see
> +        * whole __split_anon_huge_page() is complete), we will still see
>          * the newly established pmd of the child later during the
>          * walk, to be able to set it as pmd_trans_splitting too.
>          */
> @@ -1822,14 +1831,11 @@ static void __split_huge_page(struct page *page,
>   * from the hugepage.
>   * Return 0 if the hugepage is split successfully otherwise return 1.
>   */
> -int split_huge_page_to_list(struct page *page, struct list_head *list)
> +static int split_anon_huge_page(struct page *page, struct list_head *list)
>  {
>         struct anon_vma *anon_vma;
>         int ret = 1;
>
> -       BUG_ON(is_huge_zero_page(page));
> -       BUG_ON(!PageAnon(page));
> -
>         /*
>          * The caller does not necessarily hold an mmap_sem that would
> prevent
>          * the anon_vma disappearing so we first we take a reference to it
> @@ -1847,7 +1853,7 @@ int split_huge_page_to_list(struct page *page,
> struct list_head *list)
>                 goto out_unlock;
>
>         BUG_ON(!PageSwapBacked(page));
> -       __split_huge_page(page, anon_vma, list);
> +       __split_anon_huge_page(page, anon_vma, list);
>         count_vm_event(THP_SPLIT);
>
>         BUG_ON(PageCompound(page));
> @@ -1858,6 +1864,63 @@ out:
>         return ret;
>  }
>
> +static int split_file_huge_page(struct page *page, struct list_head *list)
> +{
> +       struct address_space *mapping = page->mapping;
> +       pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> +       struct vm_area_struct *vma;
> +       int mapcount, mapcount2;
> +
> +       BUG_ON(!PageHead(page));
> +       BUG_ON(PageTail(page));
> +
> +       mutex_lock(&mapping->i_mmap_mutex);
> +       mapcount = 0;
> +       vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> +               unsigned long addr = vma_address(page, vma);
> +               mapcount += __split_huge_page_splitting(page, vma, addr);
> +       }
> +
> +       if (mapcount != page_mapcount(page))
> +               printk(KERN_ERR "mapcount %d page_mapcount %d\n",
> +                      mapcount, page_mapcount(page));
> +       BUG_ON(mapcount != page_mapcount(page));
> +
> +       __split_huge_page_refcount(page, list);
> +
> +       mapcount2 = 0;
> +       vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> +               unsigned long addr = vma_address(page, vma);
> +               mapcount2 += __split_huge_page_map(page, vma, addr);
> +       }
> +
> +       if (mapcount != mapcount2)
> +               printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount
> %d\n",
> +                      mapcount, mapcount2, page_mapcount(page));
> +       BUG_ON(mapcount != mapcount2);
> +       count_vm_event(THP_SPLIT);
> +       mutex_unlock(&mapping->i_mmap_mutex);
> +
> +       /*
> +        * Drop small pages beyond i_size if any.
> +        *
> +        * XXX: do we need to serialize over i_mutex here?
> +        * If yes, how to get mmap_sem vs. i_mutex ordering fixed?
> +        */
> +       truncate_inode_pages(mapping, i_size_read(mapping->host));
> +       return 0;
> +}
> +
> +int split_huge_page_to_list(struct page *page, struct list_head *list)
> +{
> +       BUG_ON(is_huge_zero_page(page));
> +
> +       if (PageAnon(page))
> +               return split_anon_huge_page(page, list);
> +       else
> +               return split_file_huge_page(page, list);
> +}
> +
>  #define VM_NO_THP
> (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)
>
>  int hugepage_madvise(struct vm_area_struct *vma,
> --
> 1.8.3.2
>
>

[-- Attachment #2: Type: text/html, Size: 12595 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
  2013-08-06 16:34       ` Matthew Wilcox
@ 2013-08-06 20:17         ` Jan Kara
  -1 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2013-08-06 20:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton,
	Al Viro, Hugh Dickins, Wu Fengguang, Mel Gorman, linux-mm,
	Andi Kleen, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, linux-fsdevel, linux-kernel

On Tue 06-08-13 12:34:14, Matthew Wilcox wrote:
> On Mon, Aug 05, 2013 at 01:17:39PM +0200, Jan Kara wrote:
> > On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > The radix tree is variable-height, so an insert operation not only has
> > > to build the branch to its corresponding item, it also has to build the
> > > branch to existing items if the size has to be increased (by
> > > radix_tree_extend).
> > > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> > >   * The worst case is a zero height tree with just a single item at index 0,
> > >   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > + *
> > > + * Worst case for adding N contiguous items is adding entries at indexes
> > > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > + * item plus extra nodes if you cross the boundary from one node to the next.
> > > + *
> > >   * Hence:
> > >   */
> > > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > > +#define RADIX_TREE_PRELOAD_MAX \
> > > +	(RADIX_TREE_PRELOAD_MIN + \
> > > +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
> >
> >   Umm, is this really correct? I see two problems:
> > 1) You may need internal tree nodes at various levels but you seem to
> > account only for the level 1.
> > 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> > require 3 nodes at level 1 if the indexes are like:
> > i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
> >     ^                                ^
> >     node boundary                    node boundary
> > 
> >   Otherwise the patch looks good.
> 
> You are correct that in the fully general case, these things are needed,
> and the patch undercounts the number of nodes needed.  However, in the
> specific case of THP pagecache, insertions are naturally aligned, and
> we end up needing very few internal nodes (so few that we've never hit
> the end of this array in some fairly heavy testing).
  Have you checked that you really didn't hit end of the array? Because if
we run out of preloaded nodes, we just try atomic kmalloc to get new
nodes. And that has high chances of success...

> There are two penalties for getting the general case correct.  One is
> that the calculation becomes harder to understand, and the other is
> that we consume more per-CPU memory.  I think we should document that
> the current code requires "natural alignment", and include a note about
> what things will need to change in order to support arbitrary alignment
> in case anybody needs to do it in the future, but not include support
> for arbitrary alignment in this patchset.
> 
> What do you think?
  I'm OK with expecting things to be naturally aligned if that's documented
and there's WARN_ON_ONCE which checks that someone isn't misusing the code.
It is true that you actually implemented only 'maybe_preload' variant for
contiguous extent preloading so leaving out higher level nodes might be OK.
But it would be good to at least document this at the place where you
estimate how much nodes to preload.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
@ 2013-08-06 20:17         ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2013-08-06 20:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton,
	Al Viro, Hugh Dickins, Wu Fengguang, Mel Gorman, linux-mm,
	Andi Kleen, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, linux-fsdevel, linux-kernel

On Tue 06-08-13 12:34:14, Matthew Wilcox wrote:
> On Mon, Aug 05, 2013 at 01:17:39PM +0200, Jan Kara wrote:
> > On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > The radix tree is variable-height, so an insert operation not only has
> > > to build the branch to its corresponding item, it also has to build the
> > > branch to existing items if the size has to be increased (by
> > > radix_tree_extend).
> > > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> > >   * The worst case is a zero height tree with just a single item at index 0,
> > >   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > + *
> > > + * Worst case for adding N contiguous items is adding entries at indexes
> > > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > + * item plus extra nodes if you cross the boundary from one node to the next.
> > > + *
> > >   * Hence:
> > >   */
> > > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > > +#define RADIX_TREE_PRELOAD_MAX \
> > > +	(RADIX_TREE_PRELOAD_MIN + \
> > > +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
> >
> >   Umm, is this really correct? I see two problems:
> > 1) You may need internal tree nodes at various levels but you seem to
> > account only for the level 1.
> > 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> > require 3 nodes at level 1 if the indexes are like:
> > i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
> >     ^                                ^
> >     node boundary                    node boundary
> > 
> >   Otherwise the patch looks good.
> 
> You are correct that in the fully general case, these things are needed,
> and the patch undercounts the number of nodes needed.  However, in the
> specific case of THP pagecache, insertions are naturally aligned, and
> we end up needing very few internal nodes (so few that we've never hit
> the end of this array in some fairly heavy testing).
  Have you checked that you really didn't hit end of the array? Because if
we run out of preloaded nodes, we just try atomic kmalloc to get new
nodes. And that has high chances of success...

> There are two penalties for getting the general case correct.  One is
> that the calculation becomes harder to understand, and the other is
> that we consume more per-CPU memory.  I think we should document that
> the current code requires "natural alignment", and include a note about
> what things will need to change in order to support arbitrary alignment
> in case anybody needs to do it in the future, but not include support
> for arbitrary alignment in this patchset.
> 
> What do you think?
  I'm OK with expecting things to be naturally aligned if that's documented
and there's WARN_ON_ONCE which checks that someone isn't misusing the code.
It is true that you actually implemented only 'maybe_preload' variant for
contiguous extent preloading so leaving out higher level nodes might be OK.
But it would be good to at least document this at the place where you
estimate how much nodes to preload.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 19/23] truncate: support huge pages
  2013-08-04  2:17   ` Kirill A. Shutemov
@ 2013-08-06 20:23     ` Dave Hansen
  -1 siblings, 0 replies; 116+ messages in thread
From: Dave Hansen @ 2013-08-06 20:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Ning Qu,
	linux-fsdevel, linux-kernel

On 08/03/2013 07:17 PM, Kirill A. Shutemov wrote:
> +			if (PageTransTailCache(page)) {
> +				/* part of already handled huge page */
> +				if (!page->mapping)
> +					continue;
> +				/* the range starts in middle of huge page */
> +				partial_thp_start = true;
> +				start = index & ~HPAGE_CACHE_INDEX_MASK;
> +				continue;
> +			}
> +			/* the range ends on huge page */
> +			if (PageTransHugeCache(page) &&
> +				index == (end & ~HPAGE_CACHE_INDEX_MASK)) {
> +				partial_thp_end = true;
> +				end = index;
> +				break;
> +			}

Still reading through the code, but that "index ==" line's indentation
is screwed up.  It makes it look like "index == " is a line of code
instead of part of the if() condition.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 19/23] truncate: support huge pages
@ 2013-08-06 20:23     ` Dave Hansen
  0 siblings, 0 replies; 116+ messages in thread
From: Dave Hansen @ 2013-08-06 20:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Ning Qu,
	linux-fsdevel, linux-kernel

On 08/03/2013 07:17 PM, Kirill A. Shutemov wrote:
> +			if (PageTransTailCache(page)) {
> +				/* part of already handled huge page */
> +				if (!page->mapping)
> +					continue;
> +				/* the range starts in middle of huge page */
> +				partial_thp_start = true;
> +				start = index & ~HPAGE_CACHE_INDEX_MASK;
> +				continue;
> +			}
> +			/* the range ends on huge page */
> +			if (PageTransHugeCache(page) &&
> +				index == (end & ~HPAGE_CACHE_INDEX_MASK)) {
> +				partial_thp_end = true;
> +				end = index;
> +				break;
> +			}

Still reading through the code, but that "index ==" line's indentation
is screwed up.  It makes it look like "index == " is a line of code
instead of part of the if() condition.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 19/23] truncate: support huge pages
  2013-08-06 20:23     ` Dave Hansen
@ 2013-08-06 20:57       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-06 20:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Hillf Danton, Ning Qu, linux-fsdevel,
	linux-kernel

On Tue, Aug 06, 2013 at 01:23:48PM -0700, Dave Hansen wrote:
> On 08/03/2013 07:17 PM, Kirill A. Shutemov wrote:
> > +			if (PageTransTailCache(page)) {
> > +				/* part of already handled huge page */
> > +				if (!page->mapping)
> > +					continue;
> > +				/* the range starts in middle of huge page */
> > +				partial_thp_start = true;
> > +				start = index & ~HPAGE_CACHE_INDEX_MASK;
> > +				continue;
> > +			}
> > +			/* the range ends on huge page */
> > +			if (PageTransHugeCache(page) &&
> > +				index == (end & ~HPAGE_CACHE_INDEX_MASK)) {
> > +				partial_thp_end = true;
> > +				end = index;
> > +				break;
> > +			}
> 
> Still reading through the code, but that "index ==" line's indentation
> is screwed up.  It makes it look like "index == " is a line of code
> instead of part of the if() condition.

I screwed it up myself. Otherwise, the line is too long. :-/

Probably, I should move partial page logic into separate function.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 19/23] truncate: support huge pages
@ 2013-08-06 20:57       ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-06 20:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Hillf Danton, Ning Qu, linux-fsdevel,
	linux-kernel

On Tue, Aug 06, 2013 at 01:23:48PM -0700, Dave Hansen wrote:
> On 08/03/2013 07:17 PM, Kirill A. Shutemov wrote:
> > +			if (PageTransTailCache(page)) {
> > +				/* part of already handled huge page */
> > +				if (!page->mapping)
> > +					continue;
> > +				/* the range starts in middle of huge page */
> > +				partial_thp_start = true;
> > +				start = index & ~HPAGE_CACHE_INDEX_MASK;
> > +				continue;
> > +			}
> > +			/* the range ends on huge page */
> > +			if (PageTransHugeCache(page) &&
> > +				index == (end & ~HPAGE_CACHE_INDEX_MASK)) {
> > +				partial_thp_end = true;
> > +				end = index;
> > +				break;
> > +			}
> 
> Still reading through the code, but that "index ==" line's indentation
> is screwed up.  It makes it look like "index == " is a line of code
> instead of part of the if() condition.

I screwed it up myself. Otherwise, the line is too long. :-/

Probably, I should move partial page logic into separate function.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 20/23] thp: handle file pages in split_huge_page()
  2013-08-06 19:09   ` Ning Qu
@ 2013-08-06 21:09     ` Ning Qu
  2013-08-06 21:47       ` Ning Qu
  0 siblings, 1 reply; 116+ messages in thread
From: Ning Qu @ 2013-08-06 21:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 9304 bytes --]

Is this safe to move the vma_adjust_trans_huge before the line 772? Seems
for anonymous memory, we only take the lock after vma_adjust_trans_huge,
maybe we should do the same for file?

Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Tue, Aug 6, 2013 at 12:09 PM, Ning Qu <quning@google.com> wrote:

> I am probably running into a deadlock case for this patch.
>
> When splitting the file huge page, we hold the i_mmap_mutex.
>
> However, when coming from the call path in vma_adjust as following, we
> will grab the i_mmap_mutex already before doing vma_adjust_trans_huge,
> which will eventually calls the split_huge_page then split_file_huge_page
> ....
>
>
> https://git.kernel.org/cgit/linux/kernel/git/kas/linux.git/tree/mm/mmap.c?h=thp/pagecache#n753
>
>
>
>
> Best wishes,
> --
> Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066
>
>
> On Sat, Aug 3, 2013 at 7:17 PM, Kirill A. Shutemov <
> kirill.shutemov@linux.intel.com> wrote:
>
>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>
>> The base scheme is the same as for anonymous pages, but we walk by
>> mapping->i_mmap rather then anon_vma->rb_root.
>>
>> When we add a huge page to page cache we take only reference to head
>> page, but on split we need to take addition reference to all tail pages
>> since they are still in page cache after splitting.
>>
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> ---
>>  mm/huge_memory.c | 89
>> +++++++++++++++++++++++++++++++++++++++++++++++---------
>>  1 file changed, 76 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 523946c..d7c6830 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1580,6 +1580,7 @@ static void __split_huge_page_refcount(struct page
>> *page,
>>         struct zone *zone = page_zone(page);
>>         struct lruvec *lruvec;
>>         int tail_count = 0;
>> +       int initial_tail_refcount;
>>
>>         /* prevent PageLRU to go away from under us, and freeze lru stats
>> */
>>         spin_lock_irq(&zone->lru_lock);
>> @@ -1589,6 +1590,13 @@ static void __split_huge_page_refcount(struct page
>> *page,
>>         /* complete memcg works before add pages to LRU */
>>         mem_cgroup_split_huge_fixup(page);
>>
>> +       /*
>> +        * When we add a huge page to page cache we take only reference
>> to head
>> +        * page, but on split we need to take addition reference to all
>> tail
>> +        * pages since they are still in page cache after splitting.
>> +        */
>> +       initial_tail_refcount = PageAnon(page) ? 0 : 1;
>> +
>>         for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
>>                 struct page *page_tail = page + i;
>>
>> @@ -1611,8 +1619,9 @@ static void __split_huge_page_refcount(struct page
>> *page,
>>                  * atomic_set() here would be safe on all archs (and
>>                  * not only on x86), it's safer to use atomic_add().
>>                  */
>> -               atomic_add(page_mapcount(page) + page_mapcount(page_tail)
>> + 1,
>> -                          &page_tail->_count);
>> +               atomic_add(initial_tail_refcount + page_mapcount(page) +
>> +                               page_mapcount(page_tail) + 1,
>> +                               &page_tail->_count);
>>
>>                 /* after clearing PageTail the gup refcount can be
>> released */
>>                 smp_mb();
>> @@ -1651,23 +1660,23 @@ static void __split_huge_page_refcount(struct
>> page *page,
>>                 */
>>                 page_tail->_mapcount = page->_mapcount;
>>
>> -               BUG_ON(page_tail->mapping);
>>                 page_tail->mapping = page->mapping;
>>
>>                 page_tail->index = page->index + i;
>>                 page_nid_xchg_last(page_tail, page_nid_last(page));
>>
>> -               BUG_ON(!PageAnon(page_tail));
>>                 BUG_ON(!PageUptodate(page_tail));
>>                 BUG_ON(!PageDirty(page_tail));
>> -               BUG_ON(!PageSwapBacked(page_tail));
>>
>>                 lru_add_page_tail(page, page_tail, lruvec, list);
>>         }
>>         atomic_sub(tail_count, &page->_count);
>>         BUG_ON(atomic_read(&page->_count) <= 0);
>>
>> -       __mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
>> +       if (PageAnon(page))
>> +               __mod_zone_page_state(zone,
>> NR_ANON_TRANSPARENT_HUGEPAGES, -1);
>> +       else
>> +               __mod_zone_page_state(zone,
>> NR_FILE_TRANSPARENT_HUGEPAGES, -1);
>>
>>         ClearPageCompound(page);
>>         compound_unlock(page);
>> @@ -1767,7 +1776,7 @@ static int __split_huge_page_map(struct page *page,
>>  }
>>
>>  /* must be called with anon_vma->root->rwsem held */
>> -static void __split_huge_page(struct page *page,
>> +static void __split_anon_huge_page(struct page *page,
>>                               struct anon_vma *anon_vma,
>>                               struct list_head *list)
>>  {
>> @@ -1791,7 +1800,7 @@ static void __split_huge_page(struct page *page,
>>          * and establishes a child pmd before
>>          * __split_huge_page_splitting() freezes the parent pmd (so if
>>          * we fail to prevent copy_huge_pmd() from running until the
>> -        * whole __split_huge_page() is complete), we will still see
>> +        * whole __split_anon_huge_page() is complete), we will still see
>>          * the newly established pmd of the child later during the
>>          * walk, to be able to set it as pmd_trans_splitting too.
>>          */
>> @@ -1822,14 +1831,11 @@ static void __split_huge_page(struct page *page,
>>   * from the hugepage.
>>   * Return 0 if the hugepage is split successfully otherwise return 1.
>>   */
>> -int split_huge_page_to_list(struct page *page, struct list_head *list)
>> +static int split_anon_huge_page(struct page *page, struct list_head
>> *list)
>>  {
>>         struct anon_vma *anon_vma;
>>         int ret = 1;
>>
>> -       BUG_ON(is_huge_zero_page(page));
>> -       BUG_ON(!PageAnon(page));
>> -
>>         /*
>>          * The caller does not necessarily hold an mmap_sem that would
>> prevent
>>          * the anon_vma disappearing so we first we take a reference to it
>> @@ -1847,7 +1853,7 @@ int split_huge_page_to_list(struct page *page,
>> struct list_head *list)
>>                 goto out_unlock;
>>
>>         BUG_ON(!PageSwapBacked(page));
>> -       __split_huge_page(page, anon_vma, list);
>> +       __split_anon_huge_page(page, anon_vma, list);
>>         count_vm_event(THP_SPLIT);
>>
>>         BUG_ON(PageCompound(page));
>> @@ -1858,6 +1864,63 @@ out:
>>         return ret;
>>  }
>>
>> +static int split_file_huge_page(struct page *page, struct list_head
>> *list)
>> +{
>> +       struct address_space *mapping = page->mapping;
>> +       pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
>> +       struct vm_area_struct *vma;
>> +       int mapcount, mapcount2;
>> +
>> +       BUG_ON(!PageHead(page));
>> +       BUG_ON(PageTail(page));
>> +
>> +       mutex_lock(&mapping->i_mmap_mutex);
>> +       mapcount = 0;
>> +       vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>> +               unsigned long addr = vma_address(page, vma);
>> +               mapcount += __split_huge_page_splitting(page, vma, addr);
>> +       }
>> +
>> +       if (mapcount != page_mapcount(page))
>> +               printk(KERN_ERR "mapcount %d page_mapcount %d\n",
>> +                      mapcount, page_mapcount(page));
>> +       BUG_ON(mapcount != page_mapcount(page));
>> +
>> +       __split_huge_page_refcount(page, list);
>> +
>> +       mapcount2 = 0;
>> +       vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>> +               unsigned long addr = vma_address(page, vma);
>> +               mapcount2 += __split_huge_page_map(page, vma, addr);
>> +       }
>> +
>> +       if (mapcount != mapcount2)
>> +               printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount
>> %d\n",
>> +                      mapcount, mapcount2, page_mapcount(page));
>> +       BUG_ON(mapcount != mapcount2);
>> +       count_vm_event(THP_SPLIT);
>> +       mutex_unlock(&mapping->i_mmap_mutex);
>> +
>> +       /*
>> +        * Drop small pages beyond i_size if any.
>> +        *
>> +        * XXX: do we need to serialize over i_mutex here?
>> +        * If yes, how to get mmap_sem vs. i_mutex ordering fixed?
>> +        */
>> +       truncate_inode_pages(mapping, i_size_read(mapping->host));
>> +       return 0;
>> +}
>> +
>> +int split_huge_page_to_list(struct page *page, struct list_head *list)
>> +{
>> +       BUG_ON(is_huge_zero_page(page));
>> +
>> +       if (PageAnon(page))
>> +               return split_anon_huge_page(page, list);
>> +       else
>> +               return split_file_huge_page(page, list);
>> +}
>> +
>>  #define VM_NO_THP
>> (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)
>>
>>  int hugepage_madvise(struct vm_area_struct *vma,
>> --
>> 1.8.3.2
>>
>>
>

[-- Attachment #2: Type: text/html, Size: 15402 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 20/23] thp: handle file pages in split_huge_page()
  2013-08-06 21:09     ` Ning Qu
@ 2013-08-06 21:47       ` Ning Qu
  2013-08-09 14:46           ` Kirill A. Shutemov
  0 siblings, 1 reply; 116+ messages in thread
From: Ning Qu @ 2013-08-06 21:47 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 10695 bytes --]

I just tried, and it seems working fine now without the deadlock anymore. I
can run some big internal test with about 40GB files in sysv shm. Just move
the line before the locking happens in vma_adjust, something as below, the
line number is not accurate because my patch is based on another tree right
now.

--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -581,6 +581,8 @@ again:                      remove_next = 1 + (end >
next->vm_end);
                }
        }

+       vma_adjust_trans_huge(vma, start, end, adjust_next);
+
        if (file) {
                mapping = file->f_mapping;
                if (!(vma->vm_flags & VM_NONLINEAR))
@@ -597,8 +599,6 @@ again:                      remove_next = 1 + (end >
next->vm_end);
                }
        }

-       vma_adjust_trans_huge(vma, start, end, adjust_next);
-
        anon_vma = vma->anon_vma;
        if (!anon_vma && adjust_next)
                anon_vma = next->anon_vma;


Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Tue, Aug 6, 2013 at 2:09 PM, Ning Qu <quning@google.com> wrote:

> Is this safe to move the vma_adjust_trans_huge before the line 772? Seems
> for anonymous memory, we only take the lock after vma_adjust_trans_huge,
> maybe we should do the same for file?
>
> Best wishes,
> --
> Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066
>
>
> On Tue, Aug 6, 2013 at 12:09 PM, Ning Qu <quning@google.com> wrote:
>
>> I am probably running into a deadlock case for this patch.
>>
>> When splitting the file huge page, we hold the i_mmap_mutex.
>>
>> However, when coming from the call path in vma_adjust as following, we
>> will grab the i_mmap_mutex already before doing vma_adjust_trans_huge,
>> which will eventually calls the split_huge_page then split_file_huge_page
>> ....
>>
>>
>> https://git.kernel.org/cgit/linux/kernel/git/kas/linux.git/tree/mm/mmap.c?h=thp/pagecache#n753
>>
>>
>>
>>
>> Best wishes,
>> --
>> Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066
>>
>>
>> On Sat, Aug 3, 2013 at 7:17 PM, Kirill A. Shutemov <
>> kirill.shutemov@linux.intel.com> wrote:
>>
>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>
>>> The base scheme is the same as for anonymous pages, but we walk by
>>> mapping->i_mmap rather then anon_vma->rb_root.
>>>
>>> When we add a huge page to page cache we take only reference to head
>>> page, but on split we need to take addition reference to all tail pages
>>> since they are still in page cache after splitting.
>>>
>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> ---
>>>  mm/huge_memory.c | 89
>>> +++++++++++++++++++++++++++++++++++++++++++++++---------
>>>  1 file changed, 76 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 523946c..d7c6830 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1580,6 +1580,7 @@ static void __split_huge_page_refcount(struct page
>>> *page,
>>>         struct zone *zone = page_zone(page);
>>>         struct lruvec *lruvec;
>>>         int tail_count = 0;
>>> +       int initial_tail_refcount;
>>>
>>>         /* prevent PageLRU to go away from under us, and freeze lru
>>> stats */
>>>         spin_lock_irq(&zone->lru_lock);
>>> @@ -1589,6 +1590,13 @@ static void __split_huge_page_refcount(struct
>>> page *page,
>>>         /* complete memcg works before add pages to LRU */
>>>         mem_cgroup_split_huge_fixup(page);
>>>
>>> +       /*
>>> +        * When we add a huge page to page cache we take only reference
>>> to head
>>> +        * page, but on split we need to take addition reference to all
>>> tail
>>> +        * pages since they are still in page cache after splitting.
>>> +        */
>>> +       initial_tail_refcount = PageAnon(page) ? 0 : 1;
>>> +
>>>         for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
>>>                 struct page *page_tail = page + i;
>>>
>>> @@ -1611,8 +1619,9 @@ static void __split_huge_page_refcount(struct page
>>> *page,
>>>                  * atomic_set() here would be safe on all archs (and
>>>                  * not only on x86), it's safer to use atomic_add().
>>>                  */
>>> -               atomic_add(page_mapcount(page) +
>>> page_mapcount(page_tail) + 1,
>>> -                          &page_tail->_count);
>>> +               atomic_add(initial_tail_refcount + page_mapcount(page) +
>>> +                               page_mapcount(page_tail) + 1,
>>> +                               &page_tail->_count);
>>>
>>>                 /* after clearing PageTail the gup refcount can be
>>> released */
>>>                 smp_mb();
>>> @@ -1651,23 +1660,23 @@ static void __split_huge_page_refcount(struct
>>> page *page,
>>>                 */
>>>                 page_tail->_mapcount = page->_mapcount;
>>>
>>> -               BUG_ON(page_tail->mapping);
>>>                 page_tail->mapping = page->mapping;
>>>
>>>                 page_tail->index = page->index + i;
>>>                 page_nid_xchg_last(page_tail, page_nid_last(page));
>>>
>>> -               BUG_ON(!PageAnon(page_tail));
>>>                 BUG_ON(!PageUptodate(page_tail));
>>>                 BUG_ON(!PageDirty(page_tail));
>>> -               BUG_ON(!PageSwapBacked(page_tail));
>>>
>>>                 lru_add_page_tail(page, page_tail, lruvec, list);
>>>         }
>>>         atomic_sub(tail_count, &page->_count);
>>>         BUG_ON(atomic_read(&page->_count) <= 0);
>>>
>>> -       __mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
>>> +       if (PageAnon(page))
>>> +               __mod_zone_page_state(zone,
>>> NR_ANON_TRANSPARENT_HUGEPAGES, -1);
>>> +       else
>>> +               __mod_zone_page_state(zone,
>>> NR_FILE_TRANSPARENT_HUGEPAGES, -1);
>>>
>>>         ClearPageCompound(page);
>>>         compound_unlock(page);
>>> @@ -1767,7 +1776,7 @@ static int __split_huge_page_map(struct page *page,
>>>  }
>>>
>>>  /* must be called with anon_vma->root->rwsem held */
>>> -static void __split_huge_page(struct page *page,
>>> +static void __split_anon_huge_page(struct page *page,
>>>                               struct anon_vma *anon_vma,
>>>                               struct list_head *list)
>>>  {
>>> @@ -1791,7 +1800,7 @@ static void __split_huge_page(struct page *page,
>>>          * and establishes a child pmd before
>>>          * __split_huge_page_splitting() freezes the parent pmd (so if
>>>          * we fail to prevent copy_huge_pmd() from running until the
>>> -        * whole __split_huge_page() is complete), we will still see
>>> +        * whole __split_anon_huge_page() is complete), we will still see
>>>          * the newly established pmd of the child later during the
>>>          * walk, to be able to set it as pmd_trans_splitting too.
>>>          */
>>> @@ -1822,14 +1831,11 @@ static void __split_huge_page(struct page *page,
>>>   * from the hugepage.
>>>   * Return 0 if the hugepage is split successfully otherwise return 1.
>>>   */
>>> -int split_huge_page_to_list(struct page *page, struct list_head *list)
>>> +static int split_anon_huge_page(struct page *page, struct list_head
>>> *list)
>>>  {
>>>         struct anon_vma *anon_vma;
>>>         int ret = 1;
>>>
>>> -       BUG_ON(is_huge_zero_page(page));
>>> -       BUG_ON(!PageAnon(page));
>>> -
>>>         /*
>>>          * The caller does not necessarily hold an mmap_sem that would
>>> prevent
>>>          * the anon_vma disappearing so we first we take a reference to
>>> it
>>> @@ -1847,7 +1853,7 @@ int split_huge_page_to_list(struct page *page,
>>> struct list_head *list)
>>>                 goto out_unlock;
>>>
>>>         BUG_ON(!PageSwapBacked(page));
>>> -       __split_huge_page(page, anon_vma, list);
>>> +       __split_anon_huge_page(page, anon_vma, list);
>>>         count_vm_event(THP_SPLIT);
>>>
>>>         BUG_ON(PageCompound(page));
>>> @@ -1858,6 +1864,63 @@ out:
>>>         return ret;
>>>  }
>>>
>>> +static int split_file_huge_page(struct page *page, struct list_head
>>> *list)
>>> +{
>>> +       struct address_space *mapping = page->mapping;
>>> +       pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
>>> +       struct vm_area_struct *vma;
>>> +       int mapcount, mapcount2;
>>> +
>>> +       BUG_ON(!PageHead(page));
>>> +       BUG_ON(PageTail(page));
>>> +
>>> +       mutex_lock(&mapping->i_mmap_mutex);
>>> +       mapcount = 0;
>>> +       vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>>> +               unsigned long addr = vma_address(page, vma);
>>> +               mapcount += __split_huge_page_splitting(page, vma, addr);
>>> +       }
>>> +
>>> +       if (mapcount != page_mapcount(page))
>>> +               printk(KERN_ERR "mapcount %d page_mapcount %d\n",
>>> +                      mapcount, page_mapcount(page));
>>> +       BUG_ON(mapcount != page_mapcount(page));
>>> +
>>> +       __split_huge_page_refcount(page, list);
>>> +
>>> +       mapcount2 = 0;
>>> +       vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>>> +               unsigned long addr = vma_address(page, vma);
>>> +               mapcount2 += __split_huge_page_map(page, vma, addr);
>>> +       }
>>> +
>>> +       if (mapcount != mapcount2)
>>> +               printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount
>>> %d\n",
>>> +                      mapcount, mapcount2, page_mapcount(page));
>>> +       BUG_ON(mapcount != mapcount2);
>>> +       count_vm_event(THP_SPLIT);
>>> +       mutex_unlock(&mapping->i_mmap_mutex);
>>> +
>>> +       /*
>>> +        * Drop small pages beyond i_size if any.
>>> +        *
>>> +        * XXX: do we need to serialize over i_mutex here?
>>> +        * If yes, how to get mmap_sem vs. i_mutex ordering fixed?
>>> +        */
>>> +       truncate_inode_pages(mapping, i_size_read(mapping->host));
>>> +       return 0;
>>> +}
>>> +
>>> +int split_huge_page_to_list(struct page *page, struct list_head *list)
>>> +{
>>> +       BUG_ON(is_huge_zero_page(page));
>>> +
>>> +       if (PageAnon(page))
>>> +               return split_anon_huge_page(page, list);
>>> +       else
>>> +               return split_file_huge_page(page, list);
>>> +}
>>> +
>>>  #define VM_NO_THP
>>> (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)
>>>
>>>  int hugepage_madvise(struct vm_area_struct *vma,
>>> --
>>> 1.8.3.2
>>>
>>>
>>
>

[-- Attachment #2: Type: text/html, Size: 19191 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 19/23] truncate: support huge pages
  2013-08-04  2:17   ` Kirill A. Shutemov
@ 2013-08-06 21:55     ` Dave Hansen
  -1 siblings, 0 replies; 116+ messages in thread
From: Dave Hansen @ 2013-08-06 21:55 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Ning Qu,
	linux-fsdevel, linux-kernel

On 08/03/2013 07:17 PM, Kirill A. Shutemov wrote:
> If a huge page is only partly in the range we zero out the part,
> exactly like we do for partial small pages.

What's the logic behind this behaviour?  Seems like the kind of place
that we would really want to be splitting pages.

> +	if (partial_thp_start || lstart & ~PAGE_CACHE_MASK) {
> +		pgoff_t off;
> +		struct page *page;
> +		unsigned pstart, pend;
> +		void (*zero_segment)(struct page *page,
> +				unsigned start, unsigned len);
> +retry_partial_start:
> +		if (partial_thp_start) {
> +			zero_segment = zero_huge_user_segment;

That's a pretty hackish way to conditionally call a function, especially
since its done twice in one function. :)

I seem to recall zero_user_segment() vs. zero_huge_user_segment() being
something that caused some ugliness in the previous versions too.
What's the barrier to just having a smart zero_..._user_segment()
function that can conditionally perform huge or base page-zeroing?

> +		if (partial_thp_end) {
> +			zero_segment = zero_huge_user_segment;
> +			off = end & ~HPAGE_CACHE_INDEX_MASK;
> +			pend = (lend - 1) & ~HPAGE_PMD_MASK;
> +		} else {
> +			zero_segment = zero_user_segment;
> +			off = end;
> +			pend = (lend - 1) & ~PAGE_CACHE_MASK;
> +		}

We went though a similar exercise for the fault code (I think), but I
really think you need to refactor this.  Way too much of the code is in
the style:

	if (thp) {
		// new behavior
	} else {
		// old behavior
	}

To me, that's just a recipe that makes it hard to review, and I also bet
it'll make the thp much more prone to bitrot.  Maybe something like this?

	size_t page_cache_mask = PAGE_CACHE_MASK;
	unsigned long end_mask = 0UL;
	
	if (partial_thp_end) {
		page_cache_mask = HPAGE_PMD_MASK;
		end_mask = HPAGE_CACHE_INDEX_MASK;
	}
	...
	magic_zero_user_segment(...);
	off = end & ~end_mask;
	pend = (lend - 1) & ~page_cache_mask;

Like I said before, I somehow like to rewrite your code. :)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 19/23] truncate: support huge pages
@ 2013-08-06 21:55     ` Dave Hansen
  0 siblings, 0 replies; 116+ messages in thread
From: Dave Hansen @ 2013-08-06 21:55 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Ning Qu,
	linux-fsdevel, linux-kernel

On 08/03/2013 07:17 PM, Kirill A. Shutemov wrote:
> If a huge page is only partly in the range we zero out the part,
> exactly like we do for partial small pages.

What's the logic behind this behaviour?  Seems like the kind of place
that we would really want to be splitting pages.

> +	if (partial_thp_start || lstart & ~PAGE_CACHE_MASK) {
> +		pgoff_t off;
> +		struct page *page;
> +		unsigned pstart, pend;
> +		void (*zero_segment)(struct page *page,
> +				unsigned start, unsigned len);
> +retry_partial_start:
> +		if (partial_thp_start) {
> +			zero_segment = zero_huge_user_segment;

That's a pretty hackish way to conditionally call a function, especially
since its done twice in one function. :)

I seem to recall zero_user_segment() vs. zero_huge_user_segment() being
something that caused some ugliness in the previous versions too.
What's the barrier to just having a smart zero_..._user_segment()
function that can conditionally perform huge or base page-zeroing?

> +		if (partial_thp_end) {
> +			zero_segment = zero_huge_user_segment;
> +			off = end & ~HPAGE_CACHE_INDEX_MASK;
> +			pend = (lend - 1) & ~HPAGE_PMD_MASK;
> +		} else {
> +			zero_segment = zero_user_segment;
> +			off = end;
> +			pend = (lend - 1) & ~PAGE_CACHE_MASK;
> +		}

We went though a similar exercise for the fault code (I think), but I
really think you need to refactor this.  Way too much of the code is in
the style:

	if (thp) {
		// new behavior
	} else {
		// old behavior
	}

To me, that's just a recipe that makes it hard to review, and I also bet
it'll make the thp much more prone to bitrot.  Maybe something like this?

	size_t page_cache_mask = PAGE_CACHE_MASK;
	unsigned long end_mask = 0UL;
	
	if (partial_thp_end) {
		page_cache_mask = HPAGE_PMD_MASK;
		end_mask = HPAGE_CACHE_INDEX_MASK;
	}
	...
	magic_zero_user_segment(...);
	off = end & ~end_mask;
	pend = (lend - 1) & ~page_cache_mask;

Like I said before, I somehow like to rewrite your code. :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
  2013-08-05 11:17     ` Jan Kara
  (?)
@ 2013-08-07 16:32       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-07 16:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel

Jan Kara wrote:
> On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > The radix tree is variable-height, so an insert operation not only has
> > to build the branch to its corresponding item, it also has to build the
> > branch to existing items if the size has to be increased (by
> > radix_tree_extend).
> > 
> > The worst case is a zero height tree with just a single item at index 0,
> > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > 
> > Radix tree is usually protected by spin lock. It means we want to
> > pre-allocate required memory before taking the lock.
> > 
> > Currently radix_tree_preload() only guarantees enough nodes to insert
> > one element. It's a hard limit. For transparent huge page cache we want
> > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > 
> > This patch introduces radix_tree_preload_count(). It allows to
> > preallocate nodes enough to insert a number of *contiguous* elements.
> > The feature costs about 5KiB per-CPU, details below.
> > 
> > Worst case for adding N contiguous items is adding entries at indexes
> > (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > item plus extra nodes if you cross the boundary from one node to the next.
> > 
> > Preload uses per-CPU array to store nodes. The total cost of preload is
> > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > to be able to handle 512 entries at once.
> > 
> > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > 
> > We have three possible RADIX_TREE_MAP_SHIFT:
> > 
> >  #ifdef __KERNEL__
> >  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> >  #else
> >  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> >  #endif
> > 
> > On 64-bit system:
> > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> > 
> > On 32-bit system:
> > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> > 
> > On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> > on 64-bit system the per-CPU feature overhead is
> >  for preload array:
> >    (30 - 21) * sizeof(void*) = 72 bytes
> >  plus, if the preload array is full
> >    (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
> >  total: 5112 bytes
> > 
> > on 32-bit system the per-CPU feature overhead is
> >  for preload array:
> >    (19 - 11) * sizeof(void*) = 32 bytes
> >  plus, if the preload array is full
> >    (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
> >  total: 2400 bytes
> > 
> > Since only THP uses batched preload at the moment, we disable (set max
> > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > changed in the future.
> > 
> > Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> > ---
> >  include/linux/radix-tree.h | 11 +++++++++++
> >  lib/radix-tree.c           | 41 ++++++++++++++++++++++++++++++++---------
> >  2 files changed, 43 insertions(+), 9 deletions(-)
> ...
> > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > index 7811ed3..99ab73c 100644
> > --- a/lib/radix-tree.c
> > +++ b/lib/radix-tree.c
> > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> >   * The worst case is a zero height tree with just a single item at index 0,
> >   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > + *
> > + * Worst case for adding N contiguous items is adding entries at indexes
> > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > + * item plus extra nodes if you cross the boundary from one node to the next.
> > + *
> >   * Hence:
> >   */
> > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MAX \
> > +	(RADIX_TREE_PRELOAD_MIN + \
> > +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
>   Umm, is this really correct? I see two problems:
> 1) You may need internal tree nodes at various levels but you seem to
> account only for the level 1.
> 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> require 3 nodes at level 1 if the indexes are like:
> i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
>     ^                                ^
>     node boundary                    node boundary

My bad. Let's try to calculate once again.

We want to insert N contiguous items without restriction on alignment.

Let's limit N <= 1UL << (2 * RADIX_TREE_MAP_SHIFT), without
CONFIG_BASE_SMALL it's 4096. It will simplify calculation a bit.

Worst case scenario, I can imagine, is tree with only one element at index
0 and we add N items where at least one index requires max tree high and
we cross boundary between items in root node.

Basically, at least one index is less then

1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)

and one equal or more.

In this case we need:

- RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
- DIV_ROUND_UP(N, RADIX_TREE_MAP_SIZE) nodes for last level nodes for new
  items;
- 2 * (RADIX_TREE_MAX_PATH - 2) nodes to build path to last level nodes.
  (-2, because root node and last level nodes are already accounted).

The macro:

#define RADIX_TREE_PRELOAD_MAX \
	( RADIX_TREE_MAX_PATH + \
	  DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR, RADIX_TREE_MAP_SIZE) + \
	  2 * (RADIX_TREE_MAX_PATH - 2) )

For 64-bit system and N=512, RADIX_TREE_PRELOAD_MAX is 37.

Does it sound correct? Any thoughts?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
@ 2013-08-07 16:32       ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-07 16:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel

Jan Kara wrote:
> On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > The radix tree is variable-height, so an insert operation not only has
> > to build the branch to its corresponding item, it also has to build the
> > branch to existing items if the size has to be increased (by
> > radix_tree_extend).
> > 
> > The worst case is a zero height tree with just a single item at index 0,
> > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > 
> > Radix tree is usually protected by spin lock. It means we want to
> > pre-allocate required memory before taking the lock.
> > 
> > Currently radix_tree_preload() only guarantees enough nodes to insert
> > one element. It's a hard limit. For transparent huge page cache we want
> > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > 
> > This patch introduces radix_tree_preload_count(). It allows to
> > preallocate nodes enough to insert a number of *contiguous* elements.
> > The feature costs about 5KiB per-CPU, details below.
> > 
> > Worst case for adding N contiguous items is adding entries at indexes
> > (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > item plus extra nodes if you cross the boundary from one node to the next.
> > 
> > Preload uses per-CPU array to store nodes. The total cost of preload is
> > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > to be able to handle 512 entries at once.
> > 
> > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > 
> > We have three possible RADIX_TREE_MAP_SHIFT:
> > 
> >  #ifdef __KERNEL__
> >  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> >  #else
> >  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> >  #endif
> > 
> > On 64-bit system:
> > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> > 
> > On 32-bit system:
> > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> > 
> > On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> > on 64-bit system the per-CPU feature overhead is
> >  for preload array:
> >    (30 - 21) * sizeof(void*) = 72 bytes
> >  plus, if the preload array is full
> >    (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
> >  total: 5112 bytes
> > 
> > on 32-bit system the per-CPU feature overhead is
> >  for preload array:
> >    (19 - 11) * sizeof(void*) = 32 bytes
> >  plus, if the preload array is full
> >    (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
> >  total: 2400 bytes
> > 
> > Since only THP uses batched preload at the moment, we disable (set max
> > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > changed in the future.
> > 
> > Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> > ---
> >  include/linux/radix-tree.h | 11 +++++++++++
> >  lib/radix-tree.c           | 41 ++++++++++++++++++++++++++++++++---------
> >  2 files changed, 43 insertions(+), 9 deletions(-)
> ...
> > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > index 7811ed3..99ab73c 100644
> > --- a/lib/radix-tree.c
> > +++ b/lib/radix-tree.c
> > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> >   * The worst case is a zero height tree with just a single item at index 0,
> >   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > + *
> > + * Worst case for adding N contiguous items is adding entries at indexes
> > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > + * item plus extra nodes if you cross the boundary from one node to the next.
> > + *
> >   * Hence:
> >   */
> > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MAX \
> > +	(RADIX_TREE_PRELOAD_MIN + \
> > +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
>   Umm, is this really correct? I see two problems:
> 1) You may need internal tree nodes at various levels but you seem to
> account only for the level 1.
> 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> require 3 nodes at level 1 if the indexes are like:
> i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
>     ^                                ^
>     node boundary                    node boundary

My bad. Let's try to calculate once again.

We want to insert N contiguous items without restriction on alignment.

Let's limit N <= 1UL << (2 * RADIX_TREE_MAP_SHIFT), without
CONFIG_BASE_SMALL it's 4096. It will simplify calculation a bit.

Worst case scenario, I can imagine, is tree with only one element at index
0 and we add N items where at least one index requires max tree high and
we cross boundary between items in root node.

Basically, at least one index is less then

1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)

and one equal or more.

In this case we need:

- RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
- DIV_ROUND_UP(N, RADIX_TREE_MAP_SIZE) nodes for last level nodes for new
  items;
- 2 * (RADIX_TREE_MAX_PATH - 2) nodes to build path to last level nodes.
  (-2, because root node and last level nodes are already accounted).

The macro:

#define RADIX_TREE_PRELOAD_MAX \
	( RADIX_TREE_MAX_PATH + \
	  DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR, RADIX_TREE_MAP_SIZE) + \
	  2 * (RADIX_TREE_MAX_PATH - 2) )

For 64-bit system and N=512, RADIX_TREE_PRELOAD_MAX is 37.

Does it sound correct? Any thoughts?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
@ 2013-08-07 16:32       ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-07 16:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, linux-fsdevel, linux-kernel

Jan Kara wrote:
> On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > The radix tree is variable-height, so an insert operation not only has
> > to build the branch to its corresponding item, it also has to build the
> > branch to existing items if the size has to be increased (by
> > radix_tree_extend).
> > 
> > The worst case is a zero height tree with just a single item at index 0,
> > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > 
> > Radix tree is usually protected by spin lock. It means we want to
> > pre-allocate required memory before taking the lock.
> > 
> > Currently radix_tree_preload() only guarantees enough nodes to insert
> > one element. It's a hard limit. For transparent huge page cache we want
> > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > 
> > This patch introduces radix_tree_preload_count(). It allows to
> > preallocate nodes enough to insert a number of *contiguous* elements.
> > The feature costs about 5KiB per-CPU, details below.
> > 
> > Worst case for adding N contiguous items is adding entries at indexes
> > (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > item plus extra nodes if you cross the boundary from one node to the next.
> > 
> > Preload uses per-CPU array to store nodes. The total cost of preload is
> > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > to be able to handle 512 entries at once.
> > 
> > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > 
> > We have three possible RADIX_TREE_MAP_SHIFT:
> > 
> >  #ifdef __KERNEL__
> >  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> >  #else
> >  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> >  #endif
> > 
> > On 64-bit system:
> > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> > 
> > On 32-bit system:
> > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> > 
> > On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> > on 64-bit system the per-CPU feature overhead is
> >  for preload array:
> >    (30 - 21) * sizeof(void*) = 72 bytes
> >  plus, if the preload array is full
> >    (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
> >  total: 5112 bytes
> > 
> > on 32-bit system the per-CPU feature overhead is
> >  for preload array:
> >    (19 - 11) * sizeof(void*) = 32 bytes
> >  plus, if the preload array is full
> >    (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
> >  total: 2400 bytes
> > 
> > Since only THP uses batched preload at the moment, we disable (set max
> > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > changed in the future.
> > 
> > Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> > ---
> >  include/linux/radix-tree.h | 11 +++++++++++
> >  lib/radix-tree.c           | 41 ++++++++++++++++++++++++++++++++---------
> >  2 files changed, 43 insertions(+), 9 deletions(-)
> ...
> > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > index 7811ed3..99ab73c 100644
> > --- a/lib/radix-tree.c
> > +++ b/lib/radix-tree.c
> > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> >   * The worst case is a zero height tree with just a single item at index 0,
> >   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > + *
> > + * Worst case for adding N contiguous items is adding entries at indexes
> > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > + * item plus extra nodes if you cross the boundary from one node to the next.
> > + *
> >   * Hence:
> >   */
> > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MAX \
> > +	(RADIX_TREE_PRELOAD_MIN + \
> > +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
>   Umm, is this really correct? I see two problems:
> 1) You may need internal tree nodes at various levels but you seem to
> account only for the level 1.
> 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> require 3 nodes at level 1 if the indexes are like:
> i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
>     ^                                ^
>     node boundary                    node boundary

My bad. Let's try to calculate once again.

We want to insert N contiguous items without restriction on alignment.

Let's limit N <= 1UL << (2 * RADIX_TREE_MAP_SHIFT), without
CONFIG_BASE_SMALL it's 4096. It will simplify calculation a bit.

Worst case scenario, I can imagine, is tree with only one element at index
0 and we add N items where at least one index requires max tree high and
we cross boundary between items in root node.

Basically, at least one index is less then

1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)

and one equal or more.

In this case we need:

- RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
- DIV_ROUND_UP(N, RADIX_TREE_MAP_SIZE) nodes for last level nodes for new
  items;
- 2 * (RADIX_TREE_MAX_PATH - 2) nodes to build path to last level nodes.
  (-2, because root node and last level nodes are already accounted).

The macro:

#define RADIX_TREE_PRELOAD_MAX \
	( RADIX_TREE_MAX_PATH + \
	  DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR, RADIX_TREE_MAP_SIZE) + \
	  2 * (RADIX_TREE_MAX_PATH - 2) )

For 64-bit system and N=512, RADIX_TREE_PRELOAD_MAX is 37.

Does it sound correct? Any thoughts?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
  2013-08-07 16:32       ` Kirill A. Shutemov
@ 2013-08-07 20:00         ` Jan Kara
  -1 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2013-08-07 20:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Jan Kara, Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	linux-fsdevel, linux-kernel

On Wed 07-08-13 19:32:36, Kirill A. Shutemov wrote:
> Jan Kara wrote:
> > On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > 
> > > The radix tree is variable-height, so an insert operation not only has
> > > to build the branch to its corresponding item, it also has to build the
> > > branch to existing items if the size has to be increased (by
> > > radix_tree_extend).
> > > 
> > > The worst case is a zero height tree with just a single item at index 0,
> > > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > 
> > > Radix tree is usually protected by spin lock. It means we want to
> > > pre-allocate required memory before taking the lock.
> > > 
> > > Currently radix_tree_preload() only guarantees enough nodes to insert
> > > one element. It's a hard limit. For transparent huge page cache we want
> > > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > > 
> > > This patch introduces radix_tree_preload_count(). It allows to
> > > preallocate nodes enough to insert a number of *contiguous* elements.
> > > The feature costs about 5KiB per-CPU, details below.
> > > 
> > > Worst case for adding N contiguous items is adding entries at indexes
> > > (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > item plus extra nodes if you cross the boundary from one node to the next.
> > > 
> > > Preload uses per-CPU array to store nodes. The total cost of preload is
> > > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > > to be able to handle 512 entries at once.
> > > 
> > > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > > 
> > > We have three possible RADIX_TREE_MAP_SHIFT:
> > > 
> > >  #ifdef __KERNEL__
> > >  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> > >  #else
> > >  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> > >  #endif
> > > 
> > > On 64-bit system:
> > > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> > > 
> > > On 32-bit system:
> > > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> > > 
> > > On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> > > on 64-bit system the per-CPU feature overhead is
> > >  for preload array:
> > >    (30 - 21) * sizeof(void*) = 72 bytes
> > >  plus, if the preload array is full
> > >    (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
> > >  total: 5112 bytes
> > > 
> > > on 32-bit system the per-CPU feature overhead is
> > >  for preload array:
> > >    (19 - 11) * sizeof(void*) = 32 bytes
> > >  plus, if the preload array is full
> > >    (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
> > >  total: 2400 bytes
> > > 
> > > Since only THP uses batched preload at the moment, we disable (set max
> > > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > > changed in the future.
> > > 
> > > Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > ---
> > >  include/linux/radix-tree.h | 11 +++++++++++
> > >  lib/radix-tree.c           | 41 ++++++++++++++++++++++++++++++++---------
> > >  2 files changed, 43 insertions(+), 9 deletions(-)
> > ...
> > > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > > index 7811ed3..99ab73c 100644
> > > --- a/lib/radix-tree.c
> > > +++ b/lib/radix-tree.c
> > > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> > >   * The worst case is a zero height tree with just a single item at index 0,
> > >   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > + *
> > > + * Worst case for adding N contiguous items is adding entries at indexes
> > > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > + * item plus extra nodes if you cross the boundary from one node to the next.
> > > + *
> > >   * Hence:
> > >   */
> > > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > > +#define RADIX_TREE_PRELOAD_MAX \
> > > +	(RADIX_TREE_PRELOAD_MIN + \
> > > +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
> >   Umm, is this really correct? I see two problems:
> > 1) You may need internal tree nodes at various levels but you seem to
> > account only for the level 1.
> > 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> > require 3 nodes at level 1 if the indexes are like:
> > i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
> >     ^                                ^
> >     node boundary                    node boundary
> 
> My bad. Let's try to calculate once again.
> 
> We want to insert N contiguous items without restriction on alignment.
> 
> Let's limit N <= 1UL << (2 * RADIX_TREE_MAP_SHIFT), without
> CONFIG_BASE_SMALL it's 4096. It will simplify calculation a bit.
> 
> Worst case scenario, I can imagine, is tree with only one element at index
> 0 and we add N items where at least one index requires max tree high and
> we cross boundary between items in root node.
> 
> Basically, at least one index is less then
> 
> 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
> 
> and one equal or more.
> 
> In this case we need:
> 
> - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
> - DIV_ROUND_UP(N, RADIX_TREE_MAP_SIZE) nodes for last level nodes for new
>   items;
  Here, I think you need to count with
DIV_ROUND_UP(N + RADIX_TREE_MAP_SIZE - 1, RADIX_TREE_MAP_SIZE) to propely
account for the situation b) I described.

> - 2 * (RADIX_TREE_MAX_PATH - 2) nodes to build path to last level nodes.
>   (-2, because root node and last level nodes are already accounted).
> 
> The macro:
> 
> #define RADIX_TREE_PRELOAD_MAX \
> 	( RADIX_TREE_MAX_PATH + \
> 	  DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR, RADIX_TREE_MAP_SIZE) + \
> 	  2 * (RADIX_TREE_MAX_PATH - 2) )
> 
> For 64-bit system and N=512, RADIX_TREE_PRELOAD_MAX is 37.
> 
> Does it sound correct? Any thoughts?
  Otherwise I agree with your math (given the limitation on N). But please
add a comment explaining how we arrived to the math in
RADIX_TREE_PRELOAD_MAX. Thanks!

Another option I could live with is the one Matthew described - rely on
inserted range being aligned, but then comment about it and add some
assertion checking it.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
@ 2013-08-07 20:00         ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2013-08-07 20:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Jan Kara, Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	linux-fsdevel, linux-kernel

On Wed 07-08-13 19:32:36, Kirill A. Shutemov wrote:
> Jan Kara wrote:
> > On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > 
> > > The radix tree is variable-height, so an insert operation not only has
> > > to build the branch to its corresponding item, it also has to build the
> > > branch to existing items if the size has to be increased (by
> > > radix_tree_extend).
> > > 
> > > The worst case is a zero height tree with just a single item at index 0,
> > > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > 
> > > Radix tree is usually protected by spin lock. It means we want to
> > > pre-allocate required memory before taking the lock.
> > > 
> > > Currently radix_tree_preload() only guarantees enough nodes to insert
> > > one element. It's a hard limit. For transparent huge page cache we want
> > > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > > 
> > > This patch introduces radix_tree_preload_count(). It allows to
> > > preallocate nodes enough to insert a number of *contiguous* elements.
> > > The feature costs about 5KiB per-CPU, details below.
> > > 
> > > Worst case for adding N contiguous items is adding entries at indexes
> > > (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > item plus extra nodes if you cross the boundary from one node to the next.
> > > 
> > > Preload uses per-CPU array to store nodes. The total cost of preload is
> > > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > > to be able to handle 512 entries at once.
> > > 
> > > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > > 
> > > We have three possible RADIX_TREE_MAP_SHIFT:
> > > 
> > >  #ifdef __KERNEL__
> > >  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> > >  #else
> > >  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> > >  #endif
> > > 
> > > On 64-bit system:
> > > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> > > 
> > > On 32-bit system:
> > > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> > > 
> > > On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> > > on 64-bit system the per-CPU feature overhead is
> > >  for preload array:
> > >    (30 - 21) * sizeof(void*) = 72 bytes
> > >  plus, if the preload array is full
> > >    (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
> > >  total: 5112 bytes
> > > 
> > > on 32-bit system the per-CPU feature overhead is
> > >  for preload array:
> > >    (19 - 11) * sizeof(void*) = 32 bytes
> > >  plus, if the preload array is full
> > >    (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
> > >  total: 2400 bytes
> > > 
> > > Since only THP uses batched preload at the moment, we disable (set max
> > > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > > changed in the future.
> > > 
> > > Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > ---
> > >  include/linux/radix-tree.h | 11 +++++++++++
> > >  lib/radix-tree.c           | 41 ++++++++++++++++++++++++++++++++---------
> > >  2 files changed, 43 insertions(+), 9 deletions(-)
> > ...
> > > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > > index 7811ed3..99ab73c 100644
> > > --- a/lib/radix-tree.c
> > > +++ b/lib/radix-tree.c
> > > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> > >   * The worst case is a zero height tree with just a single item at index 0,
> > >   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > + *
> > > + * Worst case for adding N contiguous items is adding entries at indexes
> > > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > + * item plus extra nodes if you cross the boundary from one node to the next.
> > > + *
> > >   * Hence:
> > >   */
> > > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > > +#define RADIX_TREE_PRELOAD_MAX \
> > > +	(RADIX_TREE_PRELOAD_MIN + \
> > > +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
> >   Umm, is this really correct? I see two problems:
> > 1) You may need internal tree nodes at various levels but you seem to
> > account only for the level 1.
> > 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> > require 3 nodes at level 1 if the indexes are like:
> > i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
> >     ^                                ^
> >     node boundary                    node boundary
> 
> My bad. Let's try to calculate once again.
> 
> We want to insert N contiguous items without restriction on alignment.
> 
> Let's limit N <= 1UL << (2 * RADIX_TREE_MAP_SHIFT), without
> CONFIG_BASE_SMALL it's 4096. It will simplify calculation a bit.
> 
> Worst case scenario, I can imagine, is tree with only one element at index
> 0 and we add N items where at least one index requires max tree high and
> we cross boundary between items in root node.
> 
> Basically, at least one index is less then
> 
> 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
> 
> and one equal or more.
> 
> In this case we need:
> 
> - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
> - DIV_ROUND_UP(N, RADIX_TREE_MAP_SIZE) nodes for last level nodes for new
>   items;
  Here, I think you need to count with
DIV_ROUND_UP(N + RADIX_TREE_MAP_SIZE - 1, RADIX_TREE_MAP_SIZE) to propely
account for the situation b) I described.

> - 2 * (RADIX_TREE_MAX_PATH - 2) nodes to build path to last level nodes.
>   (-2, because root node and last level nodes are already accounted).
> 
> The macro:
> 
> #define RADIX_TREE_PRELOAD_MAX \
> 	( RADIX_TREE_MAX_PATH + \
> 	  DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR, RADIX_TREE_MAP_SIZE) + \
> 	  2 * (RADIX_TREE_MAX_PATH - 2) )
> 
> For 64-bit system and N=512, RADIX_TREE_PRELOAD_MAX is 37.
> 
> Does it sound correct? Any thoughts?
  Otherwise I agree with your math (given the limitation on N). But please
add a comment explaining how we arrived to the math in
RADIX_TREE_PRELOAD_MAX. Thanks!

Another option I could live with is the one Matthew described - rely on
inserted range being aligned, but then comment about it and add some
assertion checking it.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
  2013-08-07 20:00         ` Jan Kara
  (?)
@ 2013-08-07 20:24           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-07 20:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kirill A. Shutemov, Jan Kara, Andrea Arcangeli, Andrew Morton,
	Al Viro, Hugh Dickins, Wu Fengguang, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel

Jan Kara wrote:
> On Wed 07-08-13 19:32:36, Kirill A. Shutemov wrote:
> > Jan Kara wrote:
> > > On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > 
> > > > The radix tree is variable-height, so an insert operation not only has
> > > > to build the branch to its corresponding item, it also has to build the
> > > > branch to existing items if the size has to be increased (by
> > > > radix_tree_extend).
> > > > 
> > > > The worst case is a zero height tree with just a single item at index 0,
> > > > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > > 
> > > > Radix tree is usually protected by spin lock. It means we want to
> > > > pre-allocate required memory before taking the lock.
> > > > 
> > > > Currently radix_tree_preload() only guarantees enough nodes to insert
> > > > one element. It's a hard limit. For transparent huge page cache we want
> > > > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > > > 
> > > > This patch introduces radix_tree_preload_count(). It allows to
> > > > preallocate nodes enough to insert a number of *contiguous* elements.
> > > > The feature costs about 5KiB per-CPU, details below.
> > > > 
> > > > Worst case for adding N contiguous items is adding entries at indexes
> > > > (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > item plus extra nodes if you cross the boundary from one node to the next.
> > > > 
> > > > Preload uses per-CPU array to store nodes. The total cost of preload is
> > > > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > > > to be able to handle 512 entries at once.
> > > > 
> > > > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > > > 
> > > > We have three possible RADIX_TREE_MAP_SHIFT:
> > > > 
> > > >  #ifdef __KERNEL__
> > > >  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> > > >  #else
> > > >  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> > > >  #endif
> > > > 
> > > > On 64-bit system:
> > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> > > > 
> > > > On 32-bit system:
> > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> > > > 
> > > > On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> > > > on 64-bit system the per-CPU feature overhead is
> > > >  for preload array:
> > > >    (30 - 21) * sizeof(void*) = 72 bytes
> > > >  plus, if the preload array is full
> > > >    (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
> > > >  total: 5112 bytes
> > > > 
> > > > on 32-bit system the per-CPU feature overhead is
> > > >  for preload array:
> > > >    (19 - 11) * sizeof(void*) = 32 bytes
> > > >  plus, if the preload array is full
> > > >    (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
> > > >  total: 2400 bytes
> > > > 
> > > > Since only THP uses batched preload at the moment, we disable (set max
> > > > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > > > changed in the future.
> > > > 
> > > > Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > > ---
> > > >  include/linux/radix-tree.h | 11 +++++++++++
> > > >  lib/radix-tree.c           | 41 ++++++++++++++++++++++++++++++++---------
> > > >  2 files changed, 43 insertions(+), 9 deletions(-)
> > > ...
> > > > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > > > index 7811ed3..99ab73c 100644
> > > > --- a/lib/radix-tree.c
> > > > +++ b/lib/radix-tree.c
> > > > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> > > >   * The worst case is a zero height tree with just a single item at index 0,
> > > >   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > > + *
> > > > + * Worst case for adding N contiguous items is adding entries at indexes
> > > > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > + * item plus extra nodes if you cross the boundary from one node to the next.
> > > > + *
> > > >   * Hence:
> > > >   */
> > > > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > +#define RADIX_TREE_PRELOAD_MAX \
> > > > +	(RADIX_TREE_PRELOAD_MIN + \
> > > > +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
> > >   Umm, is this really correct? I see two problems:
> > > 1) You may need internal tree nodes at various levels but you seem to
> > > account only for the level 1.
> > > 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> > > require 3 nodes at level 1 if the indexes are like:
> > > i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
> > >     ^                                ^
> > >     node boundary                    node boundary
> > 
> > My bad. Let's try to calculate once again.
> > 
> > We want to insert N contiguous items without restriction on alignment.
> > 
> > Let's limit N <= 1UL << (2 * RADIX_TREE_MAP_SHIFT), without
> > CONFIG_BASE_SMALL it's 4096. It will simplify calculation a bit.
> > 
> > Worst case scenario, I can imagine, is tree with only one element at index
> > 0 and we add N items where at least one index requires max tree high and
> > we cross boundary between items in root node.
> > 
> > Basically, at least one index is less then
> > 
> > 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
> > 
> > and one equal or more.
> > 
> > In this case we need:
> > 
> > - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
> > - DIV_ROUND_UP(N, RADIX_TREE_MAP_SIZE) nodes for last level nodes for new
> >   items;
>   Here, I think you need to count with
> DIV_ROUND_UP(N + RADIX_TREE_MAP_SIZE - 1, RADIX_TREE_MAP_SIZE) to propely
> account for the situation b) I described.

It's not obvious for me why it's needed. Since we limit N, we can cross
second (or above) level node boundary only once.

I've tried to confirm the math with my kvm (see *ugly* patch below) and
I was not able to find anything that is not covered.

Could you demonstrate the case you are talking about.

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 3bf0b3e..67768b6 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -63,12 +63,14 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
 	unsigned int		height;
+	unsigned int		nr_nodes;
 	gfp_t			gfp_mask;
 	struct radix_tree_node	__rcu *rnode;
 };
 
 #define RADIX_TREE_INIT(mask)	{					\
 	.height = 0,							\
+	.nr_nodes = 0,							\
 	.gfp_mask = (mask),						\
 	.rnode = NULL,							\
 }
@@ -79,6 +81,7 @@ struct radix_tree_root {
 #define INIT_RADIX_TREE(root, mask)					\
 do {									\
 	(root)->height = 0;						\
+	(root)->nr_nodes = 0;						\
 	(root)->gfp_mask = (mask);					\
 	(root)->rnode = NULL;						\
 } while (0)
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 99ab73c..a4d267e 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -33,7 +33,7 @@
 #include <linux/bitops.h>
 #include <linux/rcupdate.h>
 #include <linux/hardirq.h>		/* in_interrupt() */
-
+#include <linux/reboot.h>
 
 #ifdef __KERNEL__
 #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
@@ -44,6 +44,9 @@
 #define RADIX_TREE_MAP_SIZE	(1UL << RADIX_TREE_MAP_SHIFT)
 #define RADIX_TREE_MAP_MASK	(RADIX_TREE_MAP_SIZE-1)
 
+#undef RADIX_TREE_PRELOAD_NR
+#define RADIX_TREE_PRELOAD_NR	(1UL << 9)
+
 #define RADIX_TREE_TAG_LONGS	\
 	((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
 
@@ -91,8 +94,9 @@ static struct kmem_cache *radix_tree_node_cachep;
  */
 #define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
 #define RADIX_TREE_PRELOAD_MAX \
-	(RADIX_TREE_PRELOAD_MIN + \
-	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
+	( RADIX_TREE_MAX_PATH + \
+	  DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR, RADIX_TREE_MAP_SIZE) + \
+	  2 * (RADIX_TREE_MAX_PATH - 2) )
 
 /*
  * Per-cpu pool of preloaded nodes
@@ -240,6 +244,9 @@ radix_tree_node_alloc(struct radix_tree_root *root)
 		ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
 
 	BUG_ON(radix_tree_is_indirect_ptr(ret));
+	if (ret)
+		root->nr_nodes++;
+
 	return ret;
 }
 
@@ -264,8 +271,9 @@ static void radix_tree_node_rcu_free(struct rcu_head *head)
 }
 
 static inline void
-radix_tree_node_free(struct radix_tree_node *node)
+radix_tree_node_free(struct radix_tree_root *root, struct radix_tree_node *node)
 {
+	root->nr_nodes--;
 	call_rcu(&node->rcu_head, radix_tree_node_rcu_free);
 }
 
@@ -1353,7 +1361,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
 			*((unsigned long *)&to_free->slots[0]) |=
 						RADIX_TREE_INDIRECT_PTR;
 
-		radix_tree_node_free(to_free);
+		radix_tree_node_free(root, to_free);
 	}
 }
 
@@ -1420,7 +1428,7 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 		 * last reference to it disappears (set NULL, above).
 		 */
 		if (to_free)
-			radix_tree_node_free(to_free);
+			radix_tree_node_free(root, to_free);
 
 		if (node->count) {
 			if (node == indirect_to_ptr(root->rnode))
@@ -1440,7 +1448,7 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 	root->height = 0;
 	root->rnode = NULL;
 	if (to_free)
-		radix_tree_node_free(to_free);
+		radix_tree_node_free(root, to_free);
 
 out:
 	return slot;
@@ -1504,6 +1512,33 @@ static int radix_tree_callback(struct notifier_block *nfb,
        return NOTIFY_OK;
 }
 
+static void self_test(void)
+{
+	struct radix_tree_root test_tree = RADIX_TREE_INIT(GFP_ATOMIC);
+	int i;
+	unsigned long off = (1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)) - 1;
+
+	printk("radix tree self-test\n");
+	printk("RADIX_TREE_PRELOAD_MAX: %lu\n", RADIX_TREE_PRELOAD_MAX);
+	printk("RADIX_TREE_PRELOAD_MIN: %lu\n", RADIX_TREE_PRELOAD_MIN);
+	printk("RADIX_TREE_PRELOAD_NR: %lu\n", RADIX_TREE_PRELOAD_NR);
+
+	radix_tree_insert(&test_tree, 0, (void*)0xdead0000);
+
+	if (test_tree.nr_nodes != 0) {
+		printk("\toff: %lu, nr_nodes: %d\n", off, test_tree.nr_nodes);
+		BUG();
+	}
+
+	for (i = 0; i < RADIX_TREE_PRELOAD_NR; i++) {
+		radix_tree_insert(&test_tree, off + i, (void*)0xdead0000);
+	}
+	printk("off: %lu, nr_nodes: %d, height: %d\n",
+			off, test_tree.nr_nodes, test_tree.height);
+
+	emergency_restart();
+}
+
 void __init radix_tree_init(void)
 {
 	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
@@ -1511,5 +1546,6 @@ void __init radix_tree_init(void)
 			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
 			radix_tree_node_ctor);
 	radix_tree_init_maxindex();
+	self_test();
 	hotcpu_notifier(radix_tree_callback, 0);
 }
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
@ 2013-08-07 20:24           ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-07 20:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kirill A. Shutemov, Jan Kara, Andrea Arcangeli, Andrew Morton,
	Al Viro, Hugh Dickins, Wu Fengguang, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel

Jan Kara wrote:
> On Wed 07-08-13 19:32:36, Kirill A. Shutemov wrote:
> > Jan Kara wrote:
> > > On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > 
> > > > The radix tree is variable-height, so an insert operation not only has
> > > > to build the branch to its corresponding item, it also has to build the
> > > > branch to existing items if the size has to be increased (by
> > > > radix_tree_extend).
> > > > 
> > > > The worst case is a zero height tree with just a single item at index 0,
> > > > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > > 
> > > > Radix tree is usually protected by spin lock. It means we want to
> > > > pre-allocate required memory before taking the lock.
> > > > 
> > > > Currently radix_tree_preload() only guarantees enough nodes to insert
> > > > one element. It's a hard limit. For transparent huge page cache we want
> > > > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > > > 
> > > > This patch introduces radix_tree_preload_count(). It allows to
> > > > preallocate nodes enough to insert a number of *contiguous* elements.
> > > > The feature costs about 5KiB per-CPU, details below.
> > > > 
> > > > Worst case for adding N contiguous items is adding entries at indexes
> > > > (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > item plus extra nodes if you cross the boundary from one node to the next.
> > > > 
> > > > Preload uses per-CPU array to store nodes. The total cost of preload is
> > > > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > > > to be able to handle 512 entries at once.
> > > > 
> > > > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > > > 
> > > > We have three possible RADIX_TREE_MAP_SHIFT:
> > > > 
> > > >  #ifdef __KERNEL__
> > > >  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> > > >  #else
> > > >  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> > > >  #endif
> > > > 
> > > > On 64-bit system:
> > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> > > > 
> > > > On 32-bit system:
> > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> > > > 
> > > > On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> > > > on 64-bit system the per-CPU feature overhead is
> > > >  for preload array:
> > > >    (30 - 21) * sizeof(void*) = 72 bytes
> > > >  plus, if the preload array is full
> > > >    (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
> > > >  total: 5112 bytes
> > > > 
> > > > on 32-bit system the per-CPU feature overhead is
> > > >  for preload array:
> > > >    (19 - 11) * sizeof(void*) = 32 bytes
> > > >  plus, if the preload array is full
> > > >    (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
> > > >  total: 2400 bytes
> > > > 
> > > > Since only THP uses batched preload at the moment, we disable (set max
> > > > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > > > changed in the future.
> > > > 
> > > > Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > > ---
> > > >  include/linux/radix-tree.h | 11 +++++++++++
> > > >  lib/radix-tree.c           | 41 ++++++++++++++++++++++++++++++++---------
> > > >  2 files changed, 43 insertions(+), 9 deletions(-)
> > > ...
> > > > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > > > index 7811ed3..99ab73c 100644
> > > > --- a/lib/radix-tree.c
> > > > +++ b/lib/radix-tree.c
> > > > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> > > >   * The worst case is a zero height tree with just a single item at index 0,
> > > >   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > > + *
> > > > + * Worst case for adding N contiguous items is adding entries at indexes
> > > > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > + * item plus extra nodes if you cross the boundary from one node to the next.
> > > > + *
> > > >   * Hence:
> > > >   */
> > > > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > +#define RADIX_TREE_PRELOAD_MAX \
> > > > +	(RADIX_TREE_PRELOAD_MIN + \
> > > > +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
> > >   Umm, is this really correct? I see two problems:
> > > 1) You may need internal tree nodes at various levels but you seem to
> > > account only for the level 1.
> > > 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> > > require 3 nodes at level 1 if the indexes are like:
> > > i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
> > >     ^                                ^
> > >     node boundary                    node boundary
> > 
> > My bad. Let's try to calculate once again.
> > 
> > We want to insert N contiguous items without restriction on alignment.
> > 
> > Let's limit N <= 1UL << (2 * RADIX_TREE_MAP_SHIFT), without
> > CONFIG_BASE_SMALL it's 4096. It will simplify calculation a bit.
> > 
> > Worst case scenario, I can imagine, is tree with only one element at index
> > 0 and we add N items where at least one index requires max tree high and
> > we cross boundary between items in root node.
> > 
> > Basically, at least one index is less then
> > 
> > 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
> > 
> > and one equal or more.
> > 
> > In this case we need:
> > 
> > - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
> > - DIV_ROUND_UP(N, RADIX_TREE_MAP_SIZE) nodes for last level nodes for new
> >   items;
>   Here, I think you need to count with
> DIV_ROUND_UP(N + RADIX_TREE_MAP_SIZE - 1, RADIX_TREE_MAP_SIZE) to propely
> account for the situation b) I described.

It's not obvious for me why it's needed. Since we limit N, we can cross
second (or above) level node boundary only once.

I've tried to confirm the math with my kvm (see *ugly* patch below) and
I was not able to find anything that is not covered.

Could you demonstrate the case you are talking about.

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 3bf0b3e..67768b6 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -63,12 +63,14 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
 	unsigned int		height;
+	unsigned int		nr_nodes;
 	gfp_t			gfp_mask;
 	struct radix_tree_node	__rcu *rnode;
 };
 
 #define RADIX_TREE_INIT(mask)	{					\
 	.height = 0,							\
+	.nr_nodes = 0,							\
 	.gfp_mask = (mask),						\
 	.rnode = NULL,							\
 }
@@ -79,6 +81,7 @@ struct radix_tree_root {
 #define INIT_RADIX_TREE(root, mask)					\
 do {									\
 	(root)->height = 0;						\
+	(root)->nr_nodes = 0;						\
 	(root)->gfp_mask = (mask);					\
 	(root)->rnode = NULL;						\
 } while (0)
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 99ab73c..a4d267e 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -33,7 +33,7 @@
 #include <linux/bitops.h>
 #include <linux/rcupdate.h>
 #include <linux/hardirq.h>		/* in_interrupt() */
-
+#include <linux/reboot.h>
 
 #ifdef __KERNEL__
 #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
@@ -44,6 +44,9 @@
 #define RADIX_TREE_MAP_SIZE	(1UL << RADIX_TREE_MAP_SHIFT)
 #define RADIX_TREE_MAP_MASK	(RADIX_TREE_MAP_SIZE-1)
 
+#undef RADIX_TREE_PRELOAD_NR
+#define RADIX_TREE_PRELOAD_NR	(1UL << 9)
+
 #define RADIX_TREE_TAG_LONGS	\
 	((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
 
@@ -91,8 +94,9 @@ static struct kmem_cache *radix_tree_node_cachep;
  */
 #define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
 #define RADIX_TREE_PRELOAD_MAX \
-	(RADIX_TREE_PRELOAD_MIN + \
-	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
+	( RADIX_TREE_MAX_PATH + \
+	  DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR, RADIX_TREE_MAP_SIZE) + \
+	  2 * (RADIX_TREE_MAX_PATH - 2) )
 
 /*
  * Per-cpu pool of preloaded nodes
@@ -240,6 +244,9 @@ radix_tree_node_alloc(struct radix_tree_root *root)
 		ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
 
 	BUG_ON(radix_tree_is_indirect_ptr(ret));
+	if (ret)
+		root->nr_nodes++;
+
 	return ret;
 }
 
@@ -264,8 +271,9 @@ static void radix_tree_node_rcu_free(struct rcu_head *head)
 }
 
 static inline void
-radix_tree_node_free(struct radix_tree_node *node)
+radix_tree_node_free(struct radix_tree_root *root, struct radix_tree_node *node)
 {
+	root->nr_nodes--;
 	call_rcu(&node->rcu_head, radix_tree_node_rcu_free);
 }
 
@@ -1353,7 +1361,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
 			*((unsigned long *)&to_free->slots[0]) |=
 						RADIX_TREE_INDIRECT_PTR;
 
-		radix_tree_node_free(to_free);
+		radix_tree_node_free(root, to_free);
 	}
 }
 
@@ -1420,7 +1428,7 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 		 * last reference to it disappears (set NULL, above).
 		 */
 		if (to_free)
-			radix_tree_node_free(to_free);
+			radix_tree_node_free(root, to_free);
 
 		if (node->count) {
 			if (node == indirect_to_ptr(root->rnode))
@@ -1440,7 +1448,7 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 	root->height = 0;
 	root->rnode = NULL;
 	if (to_free)
-		radix_tree_node_free(to_free);
+		radix_tree_node_free(root, to_free);
 
 out:
 	return slot;
@@ -1504,6 +1512,33 @@ static int radix_tree_callback(struct notifier_block *nfb,
        return NOTIFY_OK;
 }
 
+static void self_test(void)
+{
+	struct radix_tree_root test_tree = RADIX_TREE_INIT(GFP_ATOMIC);
+	int i;
+	unsigned long off = (1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)) - 1;
+
+	printk("radix tree self-test\n");
+	printk("RADIX_TREE_PRELOAD_MAX: %lu\n", RADIX_TREE_PRELOAD_MAX);
+	printk("RADIX_TREE_PRELOAD_MIN: %lu\n", RADIX_TREE_PRELOAD_MIN);
+	printk("RADIX_TREE_PRELOAD_NR: %lu\n", RADIX_TREE_PRELOAD_NR);
+
+	radix_tree_insert(&test_tree, 0, (void*)0xdead0000);
+
+	if (test_tree.nr_nodes != 0) {
+		printk("\toff: %lu, nr_nodes: %d\n", off, test_tree.nr_nodes);
+		BUG();
+	}
+
+	for (i = 0; i < RADIX_TREE_PRELOAD_NR; i++) {
+		radix_tree_insert(&test_tree, off + i, (void*)0xdead0000);
+	}
+	printk("off: %lu, nr_nodes: %d, height: %d\n",
+			off, test_tree.nr_nodes, test_tree.height);
+
+	emergency_restart();
+}
+
 void __init radix_tree_init(void)
 {
 	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
@@ -1511,5 +1546,6 @@ void __init radix_tree_init(void)
 			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
 			radix_tree_node_ctor);
 	radix_tree_init_maxindex();
+	self_test();
 	hotcpu_notifier(radix_tree_callback, 0);
 }
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
@ 2013-08-07 20:24           ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-07 20:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, linux-fsdevel, linux-kernel

Jan Kara wrote:
> On Wed 07-08-13 19:32:36, Kirill A. Shutemov wrote:
> > Jan Kara wrote:
> > > On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > 
> > > > The radix tree is variable-height, so an insert operation not only has
> > > > to build the branch to its corresponding item, it also has to build the
> > > > branch to existing items if the size has to be increased (by
> > > > radix_tree_extend).
> > > > 
> > > > The worst case is a zero height tree with just a single item at index 0,
> > > > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > > 
> > > > Radix tree is usually protected by spin lock. It means we want to
> > > > pre-allocate required memory before taking the lock.
> > > > 
> > > > Currently radix_tree_preload() only guarantees enough nodes to insert
> > > > one element. It's a hard limit. For transparent huge page cache we want
> > > > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > > > 
> > > > This patch introduces radix_tree_preload_count(). It allows to
> > > > preallocate nodes enough to insert a number of *contiguous* elements.
> > > > The feature costs about 5KiB per-CPU, details below.
> > > > 
> > > > Worst case for adding N contiguous items is adding entries at indexes
> > > > (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > item plus extra nodes if you cross the boundary from one node to the next.
> > > > 
> > > > Preload uses per-CPU array to store nodes. The total cost of preload is
> > > > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > > > to be able to handle 512 entries at once.
> > > > 
> > > > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > > > 
> > > > We have three possible RADIX_TREE_MAP_SHIFT:
> > > > 
> > > >  #ifdef __KERNEL__
> > > >  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> > > >  #else
> > > >  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> > > >  #endif
> > > > 
> > > > On 64-bit system:
> > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> > > > 
> > > > On 32-bit system:
> > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> > > > 
> > > > On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> > > > on 64-bit system the per-CPU feature overhead is
> > > >  for preload array:
> > > >    (30 - 21) * sizeof(void*) = 72 bytes
> > > >  plus, if the preload array is full
> > > >    (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
> > > >  total: 5112 bytes
> > > > 
> > > > on 32-bit system the per-CPU feature overhead is
> > > >  for preload array:
> > > >    (19 - 11) * sizeof(void*) = 32 bytes
> > > >  plus, if the preload array is full
> > > >    (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
> > > >  total: 2400 bytes
> > > > 
> > > > Since only THP uses batched preload at the moment, we disable (set max
> > > > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > > > changed in the future.
> > > > 
> > > > Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > > ---
> > > >  include/linux/radix-tree.h | 11 +++++++++++
> > > >  lib/radix-tree.c           | 41 ++++++++++++++++++++++++++++++++---------
> > > >  2 files changed, 43 insertions(+), 9 deletions(-)
> > > ...
> > > > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > > > index 7811ed3..99ab73c 100644
> > > > --- a/lib/radix-tree.c
> > > > +++ b/lib/radix-tree.c
> > > > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> > > >   * The worst case is a zero height tree with just a single item at index 0,
> > > >   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > > + *
> > > > + * Worst case for adding N contiguous items is adding entries at indexes
> > > > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > + * item plus extra nodes if you cross the boundary from one node to the next.
> > > > + *
> > > >   * Hence:
> > > >   */
> > > > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > +#define RADIX_TREE_PRELOAD_MAX \
> > > > +	(RADIX_TREE_PRELOAD_MIN + \
> > > > +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
> > >   Umm, is this really correct? I see two problems:
> > > 1) You may need internal tree nodes at various levels but you seem to
> > > account only for the level 1.
> > > 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> > > require 3 nodes at level 1 if the indexes are like:
> > > i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
> > >     ^                                ^
> > >     node boundary                    node boundary
> > 
> > My bad. Let's try to calculate once again.
> > 
> > We want to insert N contiguous items without restriction on alignment.
> > 
> > Let's limit N <= 1UL << (2 * RADIX_TREE_MAP_SHIFT), without
> > CONFIG_BASE_SMALL it's 4096. It will simplify calculation a bit.
> > 
> > Worst case scenario, I can imagine, is tree with only one element at index
> > 0 and we add N items where at least one index requires max tree high and
> > we cross boundary between items in root node.
> > 
> > Basically, at least one index is less then
> > 
> > 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
> > 
> > and one equal or more.
> > 
> > In this case we need:
> > 
> > - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
> > - DIV_ROUND_UP(N, RADIX_TREE_MAP_SIZE) nodes for last level nodes for new
> >   items;
>   Here, I think you need to count with
> DIV_ROUND_UP(N + RADIX_TREE_MAP_SIZE - 1, RADIX_TREE_MAP_SIZE) to propely
> account for the situation b) I described.

It's not obvious for me why it's needed. Since we limit N, we can cross
second (or above) level node boundary only once.

I've tried to confirm the math with my kvm (see *ugly* patch below) and
I was not able to find anything that is not covered.

Could you demonstrate the case you are talking about.

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 3bf0b3e..67768b6 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -63,12 +63,14 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
 	unsigned int		height;
+	unsigned int		nr_nodes;
 	gfp_t			gfp_mask;
 	struct radix_tree_node	__rcu *rnode;
 };
 
 #define RADIX_TREE_INIT(mask)	{					\
 	.height = 0,							\
+	.nr_nodes = 0,							\
 	.gfp_mask = (mask),						\
 	.rnode = NULL,							\
 }
@@ -79,6 +81,7 @@ struct radix_tree_root {
 #define INIT_RADIX_TREE(root, mask)					\
 do {									\
 	(root)->height = 0;						\
+	(root)->nr_nodes = 0;						\
 	(root)->gfp_mask = (mask);					\
 	(root)->rnode = NULL;						\
 } while (0)
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 99ab73c..a4d267e 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -33,7 +33,7 @@
 #include <linux/bitops.h>
 #include <linux/rcupdate.h>
 #include <linux/hardirq.h>		/* in_interrupt() */
-
+#include <linux/reboot.h>
 
 #ifdef __KERNEL__
 #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
@@ -44,6 +44,9 @@
 #define RADIX_TREE_MAP_SIZE	(1UL << RADIX_TREE_MAP_SHIFT)
 #define RADIX_TREE_MAP_MASK	(RADIX_TREE_MAP_SIZE-1)
 
+#undef RADIX_TREE_PRELOAD_NR
+#define RADIX_TREE_PRELOAD_NR	(1UL << 9)
+
 #define RADIX_TREE_TAG_LONGS	\
 	((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
 
@@ -91,8 +94,9 @@ static struct kmem_cache *radix_tree_node_cachep;
  */
 #define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
 #define RADIX_TREE_PRELOAD_MAX \
-	(RADIX_TREE_PRELOAD_MIN + \
-	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
+	( RADIX_TREE_MAX_PATH + \
+	  DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR, RADIX_TREE_MAP_SIZE) + \
+	  2 * (RADIX_TREE_MAX_PATH - 2) )
 
 /*
  * Per-cpu pool of preloaded nodes
@@ -240,6 +244,9 @@ radix_tree_node_alloc(struct radix_tree_root *root)
 		ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
 
 	BUG_ON(radix_tree_is_indirect_ptr(ret));
+	if (ret)
+		root->nr_nodes++;
+
 	return ret;
 }
 
@@ -264,8 +271,9 @@ static void radix_tree_node_rcu_free(struct rcu_head *head)
 }
 
 static inline void
-radix_tree_node_free(struct radix_tree_node *node)
+radix_tree_node_free(struct radix_tree_root *root, struct radix_tree_node *node)
 {
+	root->nr_nodes--;
 	call_rcu(&node->rcu_head, radix_tree_node_rcu_free);
 }
 
@@ -1353,7 +1361,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
 			*((unsigned long *)&to_free->slots[0]) |=
 						RADIX_TREE_INDIRECT_PTR;
 
-		radix_tree_node_free(to_free);
+		radix_tree_node_free(root, to_free);
 	}
 }
 
@@ -1420,7 +1428,7 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 		 * last reference to it disappears (set NULL, above).
 		 */
 		if (to_free)
-			radix_tree_node_free(to_free);
+			radix_tree_node_free(root, to_free);
 
 		if (node->count) {
 			if (node == indirect_to_ptr(root->rnode))
@@ -1440,7 +1448,7 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 	root->height = 0;
 	root->rnode = NULL;
 	if (to_free)
-		radix_tree_node_free(to_free);
+		radix_tree_node_free(root, to_free);
 
 out:
 	return slot;
@@ -1504,6 +1512,33 @@ static int radix_tree_callback(struct notifier_block *nfb,
        return NOTIFY_OK;
 }
 
+static void self_test(void)
+{
+	struct radix_tree_root test_tree = RADIX_TREE_INIT(GFP_ATOMIC);
+	int i;
+	unsigned long off = (1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)) - 1;
+
+	printk("radix tree self-test\n");
+	printk("RADIX_TREE_PRELOAD_MAX: %lu\n", RADIX_TREE_PRELOAD_MAX);
+	printk("RADIX_TREE_PRELOAD_MIN: %lu\n", RADIX_TREE_PRELOAD_MIN);
+	printk("RADIX_TREE_PRELOAD_NR: %lu\n", RADIX_TREE_PRELOAD_NR);
+
+	radix_tree_insert(&test_tree, 0, (void*)0xdead0000);
+
+	if (test_tree.nr_nodes != 0) {
+		printk("\toff: %lu, nr_nodes: %d\n", off, test_tree.nr_nodes);
+		BUG();
+	}
+
+	for (i = 0; i < RADIX_TREE_PRELOAD_NR; i++) {
+		radix_tree_insert(&test_tree, off + i, (void*)0xdead0000);
+	}
+	printk("off: %lu, nr_nodes: %d, height: %d\n",
+			off, test_tree.nr_nodes, test_tree.height);
+
+	emergency_restart();
+}
+
 void __init radix_tree_init(void)
 {
 	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
@@ -1511,5 +1546,6 @@ void __init radix_tree_init(void)
 			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
 			radix_tree_node_ctor);
 	radix_tree_init_maxindex();
+	self_test();
 	hotcpu_notifier(radix_tree_callback, 0);
 }
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
  2013-08-07 20:24           ` Kirill A. Shutemov
@ 2013-08-07 20:36             ` Jan Kara
  -1 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2013-08-07 20:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Jan Kara, Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	linux-fsdevel, linux-kernel

On Wed 07-08-13 23:24:03, Kirill A. Shutemov wrote:
> Jan Kara wrote:
> > On Wed 07-08-13 19:32:36, Kirill A. Shutemov wrote:
> > > Jan Kara wrote:
> > > > On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > > 
> > > > > The radix tree is variable-height, so an insert operation not only has
> > > > > to build the branch to its corresponding item, it also has to build the
> > > > > branch to existing items if the size has to be increased (by
> > > > > radix_tree_extend).
> > > > > 
> > > > > The worst case is a zero height tree with just a single item at index 0,
> > > > > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > > > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > > > 
> > > > > Radix tree is usually protected by spin lock. It means we want to
> > > > > pre-allocate required memory before taking the lock.
> > > > > 
> > > > > Currently radix_tree_preload() only guarantees enough nodes to insert
> > > > > one element. It's a hard limit. For transparent huge page cache we want
> > > > > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > > > > 
> > > > > This patch introduces radix_tree_preload_count(). It allows to
> > > > > preallocate nodes enough to insert a number of *contiguous* elements.
> > > > > The feature costs about 5KiB per-CPU, details below.
> > > > > 
> > > > > Worst case for adding N contiguous items is adding entries at indexes
> > > > > (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > > item plus extra nodes if you cross the boundary from one node to the next.
> > > > > 
> > > > > Preload uses per-CPU array to store nodes. The total cost of preload is
> > > > > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > > > > to be able to handle 512 entries at once.
> > > > > 
> > > > > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > > > > 
> > > > > We have three possible RADIX_TREE_MAP_SHIFT:
> > > > > 
> > > > >  #ifdef __KERNEL__
> > > > >  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> > > > >  #else
> > > > >  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> > > > >  #endif
> > > > > 
> > > > > On 64-bit system:
> > > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> > > > > 
> > > > > On 32-bit system:
> > > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> > > > > 
> > > > > On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> > > > > on 64-bit system the per-CPU feature overhead is
> > > > >  for preload array:
> > > > >    (30 - 21) * sizeof(void*) = 72 bytes
> > > > >  plus, if the preload array is full
> > > > >    (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
> > > > >  total: 5112 bytes
> > > > > 
> > > > > on 32-bit system the per-CPU feature overhead is
> > > > >  for preload array:
> > > > >    (19 - 11) * sizeof(void*) = 32 bytes
> > > > >  plus, if the preload array is full
> > > > >    (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
> > > > >  total: 2400 bytes
> > > > > 
> > > > > Since only THP uses batched preload at the moment, we disable (set max
> > > > > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > > > > changed in the future.
> > > > > 
> > > > > Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > > Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > > > ---
> > > > >  include/linux/radix-tree.h | 11 +++++++++++
> > > > >  lib/radix-tree.c           | 41 ++++++++++++++++++++++++++++++++---------
> > > > >  2 files changed, 43 insertions(+), 9 deletions(-)
> > > > ...
> > > > > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > > > > index 7811ed3..99ab73c 100644
> > > > > --- a/lib/radix-tree.c
> > > > > +++ b/lib/radix-tree.c
> > > > > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> > > > >   * The worst case is a zero height tree with just a single item at index 0,
> > > > >   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > > >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > > > + *
> > > > > + * Worst case for adding N contiguous items is adding entries at indexes
> > > > > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > > + * item plus extra nodes if you cross the boundary from one node to the next.
> > > > > + *
> > > > >   * Hence:
> > > > >   */
> > > > > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > > +#define RADIX_TREE_PRELOAD_MAX \
> > > > > +	(RADIX_TREE_PRELOAD_MIN + \
> > > > > +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
> > > >   Umm, is this really correct? I see two problems:
> > > > 1) You may need internal tree nodes at various levels but you seem to
> > > > account only for the level 1.
> > > > 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> > > > require 3 nodes at level 1 if the indexes are like:
> > > > i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
> > > >     ^                                ^
> > > >     node boundary                    node boundary
> > > 
> > > My bad. Let's try to calculate once again.
> > > 
> > > We want to insert N contiguous items without restriction on alignment.
> > > 
> > > Let's limit N <= 1UL << (2 * RADIX_TREE_MAP_SHIFT), without
> > > CONFIG_BASE_SMALL it's 4096. It will simplify calculation a bit.
> > > 
> > > Worst case scenario, I can imagine, is tree with only one element at index
> > > 0 and we add N items where at least one index requires max tree high and
> > > we cross boundary between items in root node.
> > > 
> > > Basically, at least one index is less then
> > > 
> > > 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
> > > 
> > > and one equal or more.
> > > 
> > > In this case we need:
> > > 
> > > - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
> > > - DIV_ROUND_UP(N, RADIX_TREE_MAP_SIZE) nodes for last level nodes for new
> > >   items;
> >   Here, I think you need to count with
> > DIV_ROUND_UP(N + RADIX_TREE_MAP_SIZE - 1, RADIX_TREE_MAP_SIZE) to propely
> > account for the situation b) I described.
> 
> It's not obvious for me why it's needed. Since we limit N, we can cross
> second (or above) level node boundary only once.
> 
> I've tried to confirm the math with my kvm (see *ugly* patch below) and
> I was not able to find anything that is not covered.
> 
> Could you demonstrate the case you are talking about.
  Sure. So let RADIX_TREE_MAP_SHIFT be 6 (i.e. RADIX_TREE_MAP_SIZE == 64).
Let's have radix tree with single element from index 0. We insert 66 elements
starting from index 127. So for nodes at the last level we need - node for
index 127, node for indexes 128 .. 191, node for index 192. That is
together three nodes. But DIV_ROUND_UP(66, 64) = 2. The problem happens
because starting index 127 isn't multiple of RADIX_TREE_MAP_SIZE so we can
have partially used nodes both at the beginning and at the end of the
range.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
@ 2013-08-07 20:36             ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2013-08-07 20:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Jan Kara, Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	linux-fsdevel, linux-kernel

On Wed 07-08-13 23:24:03, Kirill A. Shutemov wrote:
> Jan Kara wrote:
> > On Wed 07-08-13 19:32:36, Kirill A. Shutemov wrote:
> > > Jan Kara wrote:
> > > > On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > > 
> > > > > The radix tree is variable-height, so an insert operation not only has
> > > > > to build the branch to its corresponding item, it also has to build the
> > > > > branch to existing items if the size has to be increased (by
> > > > > radix_tree_extend).
> > > > > 
> > > > > The worst case is a zero height tree with just a single item at index 0,
> > > > > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > > > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > > > 
> > > > > Radix tree is usually protected by spin lock. It means we want to
> > > > > pre-allocate required memory before taking the lock.
> > > > > 
> > > > > Currently radix_tree_preload() only guarantees enough nodes to insert
> > > > > one element. It's a hard limit. For transparent huge page cache we want
> > > > > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > > > > 
> > > > > This patch introduces radix_tree_preload_count(). It allows to
> > > > > preallocate nodes enough to insert a number of *contiguous* elements.
> > > > > The feature costs about 5KiB per-CPU, details below.
> > > > > 
> > > > > Worst case for adding N contiguous items is adding entries at indexes
> > > > > (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > > item plus extra nodes if you cross the boundary from one node to the next.
> > > > > 
> > > > > Preload uses per-CPU array to store nodes. The total cost of preload is
> > > > > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > > > > to be able to handle 512 entries at once.
> > > > > 
> > > > > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > > > > 
> > > > > We have three possible RADIX_TREE_MAP_SHIFT:
> > > > > 
> > > > >  #ifdef __KERNEL__
> > > > >  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> > > > >  #else
> > > > >  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> > > > >  #endif
> > > > > 
> > > > > On 64-bit system:
> > > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> > > > > 
> > > > > On 32-bit system:
> > > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> > > > > 
> > > > > On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> > > > > on 64-bit system the per-CPU feature overhead is
> > > > >  for preload array:
> > > > >    (30 - 21) * sizeof(void*) = 72 bytes
> > > > >  plus, if the preload array is full
> > > > >    (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
> > > > >  total: 5112 bytes
> > > > > 
> > > > > on 32-bit system the per-CPU feature overhead is
> > > > >  for preload array:
> > > > >    (19 - 11) * sizeof(void*) = 32 bytes
> > > > >  plus, if the preload array is full
> > > > >    (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
> > > > >  total: 2400 bytes
> > > > > 
> > > > > Since only THP uses batched preload at the moment, we disable (set max
> > > > > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > > > > changed in the future.
> > > > > 
> > > > > Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > > Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > > > ---
> > > > >  include/linux/radix-tree.h | 11 +++++++++++
> > > > >  lib/radix-tree.c           | 41 ++++++++++++++++++++++++++++++++---------
> > > > >  2 files changed, 43 insertions(+), 9 deletions(-)
> > > > ...
> > > > > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > > > > index 7811ed3..99ab73c 100644
> > > > > --- a/lib/radix-tree.c
> > > > > +++ b/lib/radix-tree.c
> > > > > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> > > > >   * The worst case is a zero height tree with just a single item at index 0,
> > > > >   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > > >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > > > + *
> > > > > + * Worst case for adding N contiguous items is adding entries at indexes
> > > > > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > > + * item plus extra nodes if you cross the boundary from one node to the next.
> > > > > + *
> > > > >   * Hence:
> > > > >   */
> > > > > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > > +#define RADIX_TREE_PRELOAD_MAX \
> > > > > +	(RADIX_TREE_PRELOAD_MIN + \
> > > > > +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
> > > >   Umm, is this really correct? I see two problems:
> > > > 1) You may need internal tree nodes at various levels but you seem to
> > > > account only for the level 1.
> > > > 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> > > > require 3 nodes at level 1 if the indexes are like:
> > > > i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
> > > >     ^                                ^
> > > >     node boundary                    node boundary
> > > 
> > > My bad. Let's try to calculate once again.
> > > 
> > > We want to insert N contiguous items without restriction on alignment.
> > > 
> > > Let's limit N <= 1UL << (2 * RADIX_TREE_MAP_SHIFT), without
> > > CONFIG_BASE_SMALL it's 4096. It will simplify calculation a bit.
> > > 
> > > Worst case scenario, I can imagine, is tree with only one element at index
> > > 0 and we add N items where at least one index requires max tree high and
> > > we cross boundary between items in root node.
> > > 
> > > Basically, at least one index is less then
> > > 
> > > 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
> > > 
> > > and one equal or more.
> > > 
> > > In this case we need:
> > > 
> > > - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
> > > - DIV_ROUND_UP(N, RADIX_TREE_MAP_SIZE) nodes for last level nodes for new
> > >   items;
> >   Here, I think you need to count with
> > DIV_ROUND_UP(N + RADIX_TREE_MAP_SIZE - 1, RADIX_TREE_MAP_SIZE) to propely
> > account for the situation b) I described.
> 
> It's not obvious for me why it's needed. Since we limit N, we can cross
> second (or above) level node boundary only once.
> 
> I've tried to confirm the math with my kvm (see *ugly* patch below) and
> I was not able to find anything that is not covered.
> 
> Could you demonstrate the case you are talking about.
  Sure. So let RADIX_TREE_MAP_SHIFT be 6 (i.e. RADIX_TREE_MAP_SIZE == 64).
Let's have radix tree with single element from index 0. We insert 66 elements
starting from index 127. So for nodes at the last level we need - node for
index 127, node for indexes 128 .. 191, node for index 192. That is
together three nodes. But DIV_ROUND_UP(66, 64) = 2. The problem happens
because starting index 127 isn't multiple of RADIX_TREE_MAP_SIZE so we can
have partially used nodes both at the beginning and at the end of the
range.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
  2013-08-07 20:36             ` Jan Kara
  (?)
@ 2013-08-07 21:37               ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-07 21:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kirill A. Shutemov, Jan Kara, Andrea Arcangeli, Andrew Morton,
	Al Viro, Hugh Dickins, Wu Fengguang, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel

Jan Kara wrote:
> On Wed 07-08-13 23:24:03, Kirill A. Shutemov wrote:
> > Jan Kara wrote:
> > > On Wed 07-08-13 19:32:36, Kirill A. Shutemov wrote:
> > > > Jan Kara wrote:
> > > > > On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > > > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > > > 
> > > > > > The radix tree is variable-height, so an insert operation not only has
> > > > > > to build the branch to its corresponding item, it also has to build the
> > > > > > branch to existing items if the size has to be increased (by
> > > > > > radix_tree_extend).
> > > > > > 
> > > > > > The worst case is a zero height tree with just a single item at index 0,
> > > > > > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > > > > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > > > > 
> > > > > > Radix tree is usually protected by spin lock. It means we want to
> > > > > > pre-allocate required memory before taking the lock.
> > > > > > 
> > > > > > Currently radix_tree_preload() only guarantees enough nodes to insert
> > > > > > one element. It's a hard limit. For transparent huge page cache we want
> > > > > > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > > > > > 
> > > > > > This patch introduces radix_tree_preload_count(). It allows to
> > > > > > preallocate nodes enough to insert a number of *contiguous* elements.
> > > > > > The feature costs about 5KiB per-CPU, details below.
> > > > > > 
> > > > > > Worst case for adding N contiguous items is adding entries at indexes
> > > > > > (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > > > item plus extra nodes if you cross the boundary from one node to the next.
> > > > > > 
> > > > > > Preload uses per-CPU array to store nodes. The total cost of preload is
> > > > > > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > > > > > to be able to handle 512 entries at once.
> > > > > > 
> > > > > > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > > > > > 
> > > > > > We have three possible RADIX_TREE_MAP_SHIFT:
> > > > > > 
> > > > > >  #ifdef __KERNEL__
> > > > > >  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> > > > > >  #else
> > > > > >  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> > > > > >  #endif
> > > > > > 
> > > > > > On 64-bit system:
> > > > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > > > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > > > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> > > > > > 
> > > > > > On 32-bit system:
> > > > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > > > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > > > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> > > > > > 
> > > > > > On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> > > > > > on 64-bit system the per-CPU feature overhead is
> > > > > >  for preload array:
> > > > > >    (30 - 21) * sizeof(void*) = 72 bytes
> > > > > >  plus, if the preload array is full
> > > > > >    (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
> > > > > >  total: 5112 bytes
> > > > > > 
> > > > > > on 32-bit system the per-CPU feature overhead is
> > > > > >  for preload array:
> > > > > >    (19 - 11) * sizeof(void*) = 32 bytes
> > > > > >  plus, if the preload array is full
> > > > > >    (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
> > > > > >  total: 2400 bytes
> > > > > > 
> > > > > > Since only THP uses batched preload at the moment, we disable (set max
> > > > > > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > > > > > changed in the future.
> > > > > > 
> > > > > > Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> > > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > > > Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > > > > ---
> > > > > >  include/linux/radix-tree.h | 11 +++++++++++
> > > > > >  lib/radix-tree.c           | 41 ++++++++++++++++++++++++++++++++---------
> > > > > >  2 files changed, 43 insertions(+), 9 deletions(-)
> > > > > ...
> > > > > > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > > > > > index 7811ed3..99ab73c 100644
> > > > > > --- a/lib/radix-tree.c
> > > > > > +++ b/lib/radix-tree.c
> > > > > > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> > > > > >   * The worst case is a zero height tree with just a single item at index 0,
> > > > > >   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > > > >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > > > > + *
> > > > > > + * Worst case for adding N contiguous items is adding entries at indexes
> > > > > > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > > > + * item plus extra nodes if you cross the boundary from one node to the next.
> > > > > > + *
> > > > > >   * Hence:
> > > > > >   */
> > > > > > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > > > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > > > +#define RADIX_TREE_PRELOAD_MAX \
> > > > > > +	(RADIX_TREE_PRELOAD_MIN + \
> > > > > > +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
> > > > >   Umm, is this really correct? I see two problems:
> > > > > 1) You may need internal tree nodes at various levels but you seem to
> > > > > account only for the level 1.
> > > > > 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> > > > > require 3 nodes at level 1 if the indexes are like:
> > > > > i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
> > > > >     ^                                ^
> > > > >     node boundary                    node boundary
> > > > 
> > > > My bad. Let's try to calculate once again.
> > > > 
> > > > We want to insert N contiguous items without restriction on alignment.
> > > > 
> > > > Let's limit N <= 1UL << (2 * RADIX_TREE_MAP_SHIFT), without
> > > > CONFIG_BASE_SMALL it's 4096. It will simplify calculation a bit.
> > > > 
> > > > Worst case scenario, I can imagine, is tree with only one element at index
> > > > 0 and we add N items where at least one index requires max tree high and
> > > > we cross boundary between items in root node.
> > > > 
> > > > Basically, at least one index is less then
> > > > 
> > > > 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
> > > > 
> > > > and one equal or more.
> > > > 
> > > > In this case we need:
> > > > 
> > > > - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
> > > > - DIV_ROUND_UP(N, RADIX_TREE_MAP_SIZE) nodes for last level nodes for new
> > > >   items;
> > >   Here, I think you need to count with
> > > DIV_ROUND_UP(N + RADIX_TREE_MAP_SIZE - 1, RADIX_TREE_MAP_SIZE) to propely
> > > account for the situation b) I described.
> > 
> > It's not obvious for me why it's needed. Since we limit N, we can cross
> > second (or above) level node boundary only once.
> > 
> > I've tried to confirm the math with my kvm (see *ugly* patch below) and
> > I was not able to find anything that is not covered.
> > 
> > Could you demonstrate the case you are talking about.
>   Sure. So let RADIX_TREE_MAP_SHIFT be 6 (i.e. RADIX_TREE_MAP_SIZE == 64).
> Let's have radix tree with single element from index 0. We insert 66 elements
> starting from index 127. So for nodes at the last level we need - node for
> index 127, node for indexes 128 .. 191, node for index 192. That is
> together three nodes. But DIV_ROUND_UP(66, 64) = 2. The problem happens
> because starting index 127 isn't multiple of RADIX_TREE_MAP_SIZE so we can
> have partially used nodes both at the beginning and at the end of the
> range.

I see. Looks like you are correct.

So with patch in previous mail, it should be triable if we set
RADIX_TREE_PRELOAD_NR to 514 and off to
(1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)) - RADIX_TREE_MAP_SIZE - 1

In this case it should use 39 nodes, but it uses only 38. I can't understand why. :(

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
@ 2013-08-07 21:37               ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-07 21:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kirill A. Shutemov, Jan Kara, Andrea Arcangeli, Andrew Morton,
	Al Viro, Hugh Dickins, Wu Fengguang, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel

Jan Kara wrote:
> On Wed 07-08-13 23:24:03, Kirill A. Shutemov wrote:
> > Jan Kara wrote:
> > > On Wed 07-08-13 19:32:36, Kirill A. Shutemov wrote:
> > > > Jan Kara wrote:
> > > > > On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > > > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > > > 
> > > > > > The radix tree is variable-height, so an insert operation not only has
> > > > > > to build the branch to its corresponding item, it also has to build the
> > > > > > branch to existing items if the size has to be increased (by
> > > > > > radix_tree_extend).
> > > > > > 
> > > > > > The worst case is a zero height tree with just a single item at index 0,
> > > > > > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > > > > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > > > > 
> > > > > > Radix tree is usually protected by spin lock. It means we want to
> > > > > > pre-allocate required memory before taking the lock.
> > > > > > 
> > > > > > Currently radix_tree_preload() only guarantees enough nodes to insert
> > > > > > one element. It's a hard limit. For transparent huge page cache we want
> > > > > > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > > > > > 
> > > > > > This patch introduces radix_tree_preload_count(). It allows to
> > > > > > preallocate nodes enough to insert a number of *contiguous* elements.
> > > > > > The feature costs about 5KiB per-CPU, details below.
> > > > > > 
> > > > > > Worst case for adding N contiguous items is adding entries at indexes
> > > > > > (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > > > item plus extra nodes if you cross the boundary from one node to the next.
> > > > > > 
> > > > > > Preload uses per-CPU array to store nodes. The total cost of preload is
> > > > > > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > > > > > to be able to handle 512 entries at once.
> > > > > > 
> > > > > > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > > > > > 
> > > > > > We have three possible RADIX_TREE_MAP_SHIFT:
> > > > > > 
> > > > > >  #ifdef __KERNEL__
> > > > > >  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> > > > > >  #else
> > > > > >  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> > > > > >  #endif
> > > > > > 
> > > > > > On 64-bit system:
> > > > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > > > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > > > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> > > > > > 
> > > > > > On 32-bit system:
> > > > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > > > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > > > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> > > > > > 
> > > > > > On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> > > > > > on 64-bit system the per-CPU feature overhead is
> > > > > >  for preload array:
> > > > > >    (30 - 21) * sizeof(void*) = 72 bytes
> > > > > >  plus, if the preload array is full
> > > > > >    (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
> > > > > >  total: 5112 bytes
> > > > > > 
> > > > > > on 32-bit system the per-CPU feature overhead is
> > > > > >  for preload array:
> > > > > >    (19 - 11) * sizeof(void*) = 32 bytes
> > > > > >  plus, if the preload array is full
> > > > > >    (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
> > > > > >  total: 2400 bytes
> > > > > > 
> > > > > > Since only THP uses batched preload at the moment, we disable (set max
> > > > > > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > > > > > changed in the future.
> > > > > > 
> > > > > > Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> > > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > > > Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > > > > ---
> > > > > >  include/linux/radix-tree.h | 11 +++++++++++
> > > > > >  lib/radix-tree.c           | 41 ++++++++++++++++++++++++++++++++---------
> > > > > >  2 files changed, 43 insertions(+), 9 deletions(-)
> > > > > ...
> > > > > > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > > > > > index 7811ed3..99ab73c 100644
> > > > > > --- a/lib/radix-tree.c
> > > > > > +++ b/lib/radix-tree.c
> > > > > > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> > > > > >   * The worst case is a zero height tree with just a single item at index 0,
> > > > > >   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > > > >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > > > > + *
> > > > > > + * Worst case for adding N contiguous items is adding entries at indexes
> > > > > > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > > > + * item plus extra nodes if you cross the boundary from one node to the next.
> > > > > > + *
> > > > > >   * Hence:
> > > > > >   */
> > > > > > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > > > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > > > +#define RADIX_TREE_PRELOAD_MAX \
> > > > > > +	(RADIX_TREE_PRELOAD_MIN + \
> > > > > > +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
> > > > >   Umm, is this really correct? I see two problems:
> > > > > 1) You may need internal tree nodes at various levels but you seem to
> > > > > account only for the level 1.
> > > > > 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> > > > > require 3 nodes at level 1 if the indexes are like:
> > > > > i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
> > > > >     ^                                ^
> > > > >     node boundary                    node boundary
> > > > 
> > > > My bad. Let's try to calculate once again.
> > > > 
> > > > We want to insert N contiguous items without restriction on alignment.
> > > > 
> > > > Let's limit N <= 1UL << (2 * RADIX_TREE_MAP_SHIFT), without
> > > > CONFIG_BASE_SMALL it's 4096. It will simplify calculation a bit.
> > > > 
> > > > Worst case scenario, I can imagine, is tree with only one element at index
> > > > 0 and we add N items where at least one index requires max tree high and
> > > > we cross boundary between items in root node.
> > > > 
> > > > Basically, at least one index is less then
> > > > 
> > > > 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
> > > > 
> > > > and one equal or more.
> > > > 
> > > > In this case we need:
> > > > 
> > > > - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
> > > > - DIV_ROUND_UP(N, RADIX_TREE_MAP_SIZE) nodes for last level nodes for new
> > > >   items;
> > >   Here, I think you need to count with
> > > DIV_ROUND_UP(N + RADIX_TREE_MAP_SIZE - 1, RADIX_TREE_MAP_SIZE) to propely
> > > account for the situation b) I described.
> > 
> > It's not obvious for me why it's needed. Since we limit N, we can cross
> > second (or above) level node boundary only once.
> > 
> > I've tried to confirm the math with my kvm (see *ugly* patch below) and
> > I was not able to find anything that is not covered.
> > 
> > Could you demonstrate the case you are talking about.
>   Sure. So let RADIX_TREE_MAP_SHIFT be 6 (i.e. RADIX_TREE_MAP_SIZE == 64).
> Let's have radix tree with single element from index 0. We insert 66 elements
> starting from index 127. So for nodes at the last level we need - node for
> index 127, node for indexes 128 .. 191, node for index 192. That is
> together three nodes. But DIV_ROUND_UP(66, 64) = 2. The problem happens
> because starting index 127 isn't multiple of RADIX_TREE_MAP_SIZE so we can
> have partially used nodes both at the beginning and at the end of the
> range.

I see. Looks like you are correct.

So with patch in previous mail, it should be triable if we set
RADIX_TREE_PRELOAD_NR to 514 and off to
(1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)) - RADIX_TREE_MAP_SIZE - 1

In this case it should use 39 nodes, but it uses only 38. I can't understand why. :(

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
@ 2013-08-07 21:37               ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-07 21:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, linux-fsdevel, linux-kernel

Jan Kara wrote:
> On Wed 07-08-13 23:24:03, Kirill A. Shutemov wrote:
> > Jan Kara wrote:
> > > On Wed 07-08-13 19:32:36, Kirill A. Shutemov wrote:
> > > > Jan Kara wrote:
> > > > > On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > > > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > > > 
> > > > > > The radix tree is variable-height, so an insert operation not only has
> > > > > > to build the branch to its corresponding item, it also has to build the
> > > > > > branch to existing items if the size has to be increased (by
> > > > > > radix_tree_extend).
> > > > > > 
> > > > > > The worst case is a zero height tree with just a single item at index 0,
> > > > > > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > > > > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > > > > 
> > > > > > Radix tree is usually protected by spin lock. It means we want to
> > > > > > pre-allocate required memory before taking the lock.
> > > > > > 
> > > > > > Currently radix_tree_preload() only guarantees enough nodes to insert
> > > > > > one element. It's a hard limit. For transparent huge page cache we want
> > > > > > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > > > > > 
> > > > > > This patch introduces radix_tree_preload_count(). It allows to
> > > > > > preallocate nodes enough to insert a number of *contiguous* elements.
> > > > > > The feature costs about 5KiB per-CPU, details below.
> > > > > > 
> > > > > > Worst case for adding N contiguous items is adding entries at indexes
> > > > > > (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > > > item plus extra nodes if you cross the boundary from one node to the next.
> > > > > > 
> > > > > > Preload uses per-CPU array to store nodes. The total cost of preload is
> > > > > > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > > > > > to be able to handle 512 entries at once.
> > > > > > 
> > > > > > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > > > > > 
> > > > > > We have three possible RADIX_TREE_MAP_SHIFT:
> > > > > > 
> > > > > >  #ifdef __KERNEL__
> > > > > >  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> > > > > >  #else
> > > > > >  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> > > > > >  #endif
> > > > > > 
> > > > > > On 64-bit system:
> > > > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > > > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > > > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> > > > > > 
> > > > > > On 32-bit system:
> > > > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > > > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > > > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> > > > > > 
> > > > > > On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> > > > > > on 64-bit system the per-CPU feature overhead is
> > > > > >  for preload array:
> > > > > >    (30 - 21) * sizeof(void*) = 72 bytes
> > > > > >  plus, if the preload array is full
> > > > > >    (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
> > > > > >  total: 5112 bytes
> > > > > > 
> > > > > > on 32-bit system the per-CPU feature overhead is
> > > > > >  for preload array:
> > > > > >    (19 - 11) * sizeof(void*) = 32 bytes
> > > > > >  plus, if the preload array is full
> > > > > >    (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
> > > > > >  total: 2400 bytes
> > > > > > 
> > > > > > Since only THP uses batched preload at the moment, we disable (set max
> > > > > > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > > > > > changed in the future.
> > > > > > 
> > > > > > Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> > > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > > > Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > > > > ---
> > > > > >  include/linux/radix-tree.h | 11 +++++++++++
> > > > > >  lib/radix-tree.c           | 41 ++++++++++++++++++++++++++++++++---------
> > > > > >  2 files changed, 43 insertions(+), 9 deletions(-)
> > > > > ...
> > > > > > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > > > > > index 7811ed3..99ab73c 100644
> > > > > > --- a/lib/radix-tree.c
> > > > > > +++ b/lib/radix-tree.c
> > > > > > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> > > > > >   * The worst case is a zero height tree with just a single item at index 0,
> > > > > >   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > > > >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > > > > + *
> > > > > > + * Worst case for adding N contiguous items is adding entries at indexes
> > > > > > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > > > + * item plus extra nodes if you cross the boundary from one node to the next.
> > > > > > + *
> > > > > >   * Hence:
> > > > > >   */
> > > > > > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > > > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > > > +#define RADIX_TREE_PRELOAD_MAX \
> > > > > > +	(RADIX_TREE_PRELOAD_MIN + \
> > > > > > +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
> > > > >   Umm, is this really correct? I see two problems:
> > > > > 1) You may need internal tree nodes at various levels but you seem to
> > > > > account only for the level 1.
> > > > > 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> > > > > require 3 nodes at level 1 if the indexes are like:
> > > > > i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
> > > > >     ^                                ^
> > > > >     node boundary                    node boundary
> > > > 
> > > > My bad. Let's try to calculate once again.
> > > > 
> > > > We want to insert N contiguous items without restriction on alignment.
> > > > 
> > > > Let's limit N <= 1UL << (2 * RADIX_TREE_MAP_SHIFT), without
> > > > CONFIG_BASE_SMALL it's 4096. It will simplify calculation a bit.
> > > > 
> > > > Worst case scenario, I can imagine, is tree with only one element at index
> > > > 0 and we add N items where at least one index requires max tree high and
> > > > we cross boundary between items in root node.
> > > > 
> > > > Basically, at least one index is less then
> > > > 
> > > > 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
> > > > 
> > > > and one equal or more.
> > > > 
> > > > In this case we need:
> > > > 
> > > > - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
> > > > - DIV_ROUND_UP(N, RADIX_TREE_MAP_SIZE) nodes for last level nodes for new
> > > >   items;
> > >   Here, I think you need to count with
> > > DIV_ROUND_UP(N + RADIX_TREE_MAP_SIZE - 1, RADIX_TREE_MAP_SIZE) to propely
> > > account for the situation b) I described.
> > 
> > It's not obvious for me why it's needed. Since we limit N, we can cross
> > second (or above) level node boundary only once.
> > 
> > I've tried to confirm the math with my kvm (see *ugly* patch below) and
> > I was not able to find anything that is not covered.
> > 
> > Could you demonstrate the case you are talking about.
>   Sure. So let RADIX_TREE_MAP_SHIFT be 6 (i.e. RADIX_TREE_MAP_SIZE == 64).
> Let's have radix tree with single element from index 0. We insert 66 elements
> starting from index 127. So for nodes at the last level we need - node for
> index 127, node for indexes 128 .. 191, node for index 192. That is
> together three nodes. But DIV_ROUND_UP(66, 64) = 2. The problem happens
> because starting index 127 isn't multiple of RADIX_TREE_MAP_SIZE so we can
> have partially used nodes both at the beginning and at the end of the
> range.

I see. Looks like you are correct.

So with patch in previous mail, it should be triable if we set
RADIX_TREE_PRELOAD_NR to 514 and off to
(1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)) - RADIX_TREE_MAP_SIZE - 1

In this case it should use 39 nodes, but it uses only 38. I can't understand why. :(

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
  2013-08-07 21:37               ` Kirill A. Shutemov
  (?)
@ 2013-08-08  8:45                 ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-08  8:45 UTC (permalink / raw)
  To: Jan Kara, Matthew Wilcox, Dave Hansen
  Cc: Jan Kara, Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton,
	Al Viro, Hugh Dickins, Wu Fengguang, Mel Gorman, linux-mm,
	Andi Kleen, Kirill A. Shutemov, Hillf Danton, Ning Qu,
	linux-fsdevel, linux-kernel

Kirill A. Shutemov wrote:
> In this case it should use 39 nodes, but it uses only 38. I can't understand why. :(

Okay, I've got it. We share one 2nd level node.

Patch is below. Please review.

>From 35ba5687ea7aea98645da34ddd0be01a9de8b32d Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Wed, 9 Jan 2013 13:25:47 +0200
Subject: [PATCH] radix-tree: implement preload for multiple contiguous
 elements

The radix tree is variable-height, so an insert operation not only has
to build the branch to its corresponding item, it also has to build the
branch to existing items if the size has to be increased (by
radix_tree_extend).

The worst case is a zero height tree with just a single item at index 0,
and then inserting an item at index ULONG_MAX. This requires 2 new branches
of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.

Radix tree is usually protected by spin lock. It means we want to
pre-allocate required memory before taking the lock.

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. For transparent huge page cache we want
to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.
The feature costs about 9KiB per-CPU on x86_64, details below.

Preload uses per-CPU array to store nodes. The total cost of preload is
"array size" * sizeof(void*) * NR_CPUS. We want to increase array size
to be able to handle 512 entries at once.

Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.

We have three possible RADIX_TREE_MAP_SHIFT:

 #ifdef __KERNEL__
 #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
 #else
 #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
 #endif

We are not going to use transparent huge page cache on small machines or
in userspace, so we are interested in RADIX_TREE_MAP_SHIFT=6.

On 64-bit system old array size is 21, new is 37. Per-CPU feature
overhead is
 for preload array:
   (37 - 21) * sizeof(void*) = 128 bytes
 plus, if the preload array is full
   (37 - 21) * sizeof(struct radix_tree_node) = 16 * 560 = 8960 bytes
 total: 9088 bytes

On 32-bit system old array size is 11, new is 22. Per-CPU feature
overhead is
 for preload array:
   (22 - 11) * sizeof(void*) = 44 bytes
 plus, if the preload array is full
   (22 - 11) * sizeof(struct radix_tree_node) = 11 * 296 = 3256 bytes
 total: 3300 bytes

Since only THP uses batched preload at the moment, we disable (set max
preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
changed in the future.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 include/linux/radix-tree.h | 11 ++++++
 lib/radix-tree.c           | 89 +++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 91 insertions(+), 9 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 4039407..3bf0b3e 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,16 @@ do {									\
 	(root)->rnode = NULL;						\
 } while (0)
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+/*
+ * At the moment only THP uses preload for more then on item for batched
+ * pagecache manipulations.
+ */
+#define RADIX_TREE_PRELOAD_NR	512
+#else
+#define RADIX_TREE_PRELOAD_NR	1
+#endif
+
 /**
  * Radix-tree synchronization
  *
@@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
 				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
 			unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 7811ed3..980e4c4 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -84,14 +84,54 @@ static struct kmem_cache *radix_tree_node_cachep;
  * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
  * Hence:
  */
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+
+/*
+ * Inserting N contiguous items is more complex. To simplify calculation, let's
+ * limit N (validated in radix_tree_init()):
+ *  - N is multiplier of RADIX_TREE_MAP_SIZE (or 1);
+ *  - N <= number of items 2-level tree can contain:
+ *    1UL << (2 * RADIX_TREE_MAP_SHIFT).
+ *
+ * No limitation on insert index alignment.
+ *
+ * Then the worst case is tree with only one element at index 0 and we add N
+ * items where at least one index requires max tree high and we cross boundary
+ * between items in root node.
+ *
+ * Basically, at least one index is less then
+ *
+ * 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
+ *
+ * and one is equal to.
+ *
+ * In this case we need:
+ *
+ * - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
+ * - N / RADIX_TREE_MAP_SIZE + 1 nodes for last level nodes for new items:
+ *    - +1 is for misalinged case;
+ * - 2 * (RADIX_TREE_MAX_PATH - 2) - 1 nodes to build path to last level nodes:
+ *    - -2, because root node and last level nodes are already accounted);
+ *    - -1, because one second level node is shared between item with index 0
+ *      and item below the boundary;
+ *
+ * Hence:
+ */
+#if RADIX_TREE_PRELOAD_NR > 1
+#define RADIX_TREE_PRELOAD_MAX \
+	( RADIX_TREE_MAX_PATH + \
+	  RADIX_TREE_PRELOAD_NR / RADIX_TREE_MAP_SIZE + 1 + \
+	  2 * (RADIX_TREE_MAX_PATH - 2) - 1)
+#else
+#define RADIX_TREE_PRELOAD_MAX RADIX_TREE_PRELOAD_MIN
+#endif
 
 /*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
 	int nr;
-	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
 };
 static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
 
@@ -263,29 +303,35 @@ radix_tree_node_free(struct radix_tree_node *node)
 
 /*
  * Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail.  On
- * success, return zero, with preemption disabled.  On error, return -ENOMEM
+ * ensure that the addition of *contiguous* items in the tree cannot fail.
+ * On success, return zero, with preemption disabled.  On error, return -ENOMEM
  * with preemption not disabled.
  *
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_WAIT being passed to INIT_RADIX_TREE().
  */
-static int __radix_tree_preload(gfp_t gfp_mask)
+static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
 	int ret = -ENOMEM;
+	int preload_target = RADIX_TREE_PRELOAD_MIN +
+		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
+
+	if (WARN_ONCE(size > RADIX_TREE_PRELOAD_NR,
+				"too large preload requested"))
+		return -ENOMEM;
 
 	preempt_disable();
 	rtp = &__get_cpu_var(radix_tree_preloads);
-	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+	while (rtp->nr < preload_target) {
 		preempt_enable();
 		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
 		if (node == NULL)
 			goto out;
 		preempt_disable();
 		rtp = &__get_cpu_var(radix_tree_preloads);
-		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+		if (rtp->nr < preload_target)
 			rtp->nodes[rtp->nr++] = node;
 		else
 			kmem_cache_free(radix_tree_node_cachep, node);
@@ -308,7 +354,7 @@ int radix_tree_preload(gfp_t gfp_mask)
 {
 	/* Warn on non-sensical use... */
 	WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
-	return __radix_tree_preload(gfp_mask);
+	return __radix_tree_preload_contig(1, gfp_mask);
 }
 EXPORT_SYMBOL(radix_tree_preload);
 
@@ -320,13 +366,22 @@ EXPORT_SYMBOL(radix_tree_preload);
 int radix_tree_maybe_preload(gfp_t gfp_mask)
 {
 	if (gfp_mask & __GFP_WAIT)
-		return __radix_tree_preload(gfp_mask);
+		return __radix_tree_preload_contig(1, gfp_mask);
 	/* Preloading doesn't help anything with this gfp mask, skip it */
 	preempt_disable();
 	return 0;
 }
 EXPORT_SYMBOL(radix_tree_maybe_preload);
 
+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask)
+{
+	if (gfp_mask & __GFP_WAIT)
+		return __radix_tree_preload_contig(size, gfp_mask);
+	/* Preloading doesn't help anything with this gfp mask, skip it */
+	preempt_disable();
+	return 0;
+}
+
 /*
  *	Return the maximum key which can be store into a
  *	radix tree with height HEIGHT.
@@ -1483,6 +1538,22 @@ static int radix_tree_callback(struct notifier_block *nfb,
 
 void __init radix_tree_init(void)
 {
+	/*
+	 * Restrictions on RADIX_TREE_PRELOAD_NR simplify RADIX_TREE_PRELOAD_MAX
+	 * calculation, it's already complex enough:
+	 *  - it must be multiplier of RADIX_TREE_MAP_SIZE, otherwise we will
+	 *    have to round it up to next RADIX_TREE_MAP_SIZE multiplier and we
+	 *    don't win anything;
+	 *  - must be less then number of items 2-level tree can contain.
+	 *    It's easier to calculate number of internal nodes required
+	 *    this way.
+	 */
+	if (RADIX_TREE_PRELOAD_NR != 1) {
+		BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR % RADIX_TREE_MAP_SIZE != 0);
+		BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR >
+				1UL << (2 * RADIX_TREE_MAP_SHIFT));
+	}
+
 	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
 			sizeof(struct radix_tree_node), 0,
 			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
-- 
1.8.3.2

-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
@ 2013-08-08  8:45                 ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-08  8:45 UTC (permalink / raw)
  To: Jan Kara, Matthew Wilcox, Dave Hansen
  Cc: Jan Kara, Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton,
	Al Viro, Hugh Dickins, Wu Fengguang, Mel Gorman, linux-mm,
	Andi Kleen, Kirill A. Shutemov, Hillf Danton, Ning Qu,
	linux-fsdevel, linux-kernel

Kirill A. Shutemov wrote:
> In this case it should use 39 nodes, but it uses only 38. I can't understand why. :(

Okay, I've got it. We share one 2nd level node.

Patch is below. Please review.

>From 35ba5687ea7aea98645da34ddd0be01a9de8b32d Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Wed, 9 Jan 2013 13:25:47 +0200
Subject: [PATCH] radix-tree: implement preload for multiple contiguous
 elements

The radix tree is variable-height, so an insert operation not only has
to build the branch to its corresponding item, it also has to build the
branch to existing items if the size has to be increased (by
radix_tree_extend).

The worst case is a zero height tree with just a single item at index 0,
and then inserting an item at index ULONG_MAX. This requires 2 new branches
of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.

Radix tree is usually protected by spin lock. It means we want to
pre-allocate required memory before taking the lock.

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. For transparent huge page cache we want
to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.
The feature costs about 9KiB per-CPU on x86_64, details below.

Preload uses per-CPU array to store nodes. The total cost of preload is
"array size" * sizeof(void*) * NR_CPUS. We want to increase array size
to be able to handle 512 entries at once.

Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.

We have three possible RADIX_TREE_MAP_SHIFT:

 #ifdef __KERNEL__
 #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
 #else
 #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
 #endif

We are not going to use transparent huge page cache on small machines or
in userspace, so we are interested in RADIX_TREE_MAP_SHIFT=6.

On 64-bit system old array size is 21, new is 37. Per-CPU feature
overhead is
 for preload array:
   (37 - 21) * sizeof(void*) = 128 bytes
 plus, if the preload array is full
   (37 - 21) * sizeof(struct radix_tree_node) = 16 * 560 = 8960 bytes
 total: 9088 bytes

On 32-bit system old array size is 11, new is 22. Per-CPU feature
overhead is
 for preload array:
   (22 - 11) * sizeof(void*) = 44 bytes
 plus, if the preload array is full
   (22 - 11) * sizeof(struct radix_tree_node) = 11 * 296 = 3256 bytes
 total: 3300 bytes

Since only THP uses batched preload at the moment, we disable (set max
preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
changed in the future.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 include/linux/radix-tree.h | 11 ++++++
 lib/radix-tree.c           | 89 +++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 91 insertions(+), 9 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 4039407..3bf0b3e 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,16 @@ do {									\
 	(root)->rnode = NULL;						\
 } while (0)
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+/*
+ * At the moment only THP uses preload for more then on item for batched
+ * pagecache manipulations.
+ */
+#define RADIX_TREE_PRELOAD_NR	512
+#else
+#define RADIX_TREE_PRELOAD_NR	1
+#endif
+
 /**
  * Radix-tree synchronization
  *
@@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
 				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
 			unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 7811ed3..980e4c4 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -84,14 +84,54 @@ static struct kmem_cache *radix_tree_node_cachep;
  * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
  * Hence:
  */
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+
+/*
+ * Inserting N contiguous items is more complex. To simplify calculation, let's
+ * limit N (validated in radix_tree_init()):
+ *  - N is multiplier of RADIX_TREE_MAP_SIZE (or 1);
+ *  - N <= number of items 2-level tree can contain:
+ *    1UL << (2 * RADIX_TREE_MAP_SHIFT).
+ *
+ * No limitation on insert index alignment.
+ *
+ * Then the worst case is tree with only one element at index 0 and we add N
+ * items where at least one index requires max tree high and we cross boundary
+ * between items in root node.
+ *
+ * Basically, at least one index is less then
+ *
+ * 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
+ *
+ * and one is equal to.
+ *
+ * In this case we need:
+ *
+ * - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
+ * - N / RADIX_TREE_MAP_SIZE + 1 nodes for last level nodes for new items:
+ *    - +1 is for misalinged case;
+ * - 2 * (RADIX_TREE_MAX_PATH - 2) - 1 nodes to build path to last level nodes:
+ *    - -2, because root node and last level nodes are already accounted);
+ *    - -1, because one second level node is shared between item with index 0
+ *      and item below the boundary;
+ *
+ * Hence:
+ */
+#if RADIX_TREE_PRELOAD_NR > 1
+#define RADIX_TREE_PRELOAD_MAX \
+	( RADIX_TREE_MAX_PATH + \
+	  RADIX_TREE_PRELOAD_NR / RADIX_TREE_MAP_SIZE + 1 + \
+	  2 * (RADIX_TREE_MAX_PATH - 2) - 1)
+#else
+#define RADIX_TREE_PRELOAD_MAX RADIX_TREE_PRELOAD_MIN
+#endif
 
 /*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
 	int nr;
-	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
 };
 static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
 
@@ -263,29 +303,35 @@ radix_tree_node_free(struct radix_tree_node *node)
 
 /*
  * Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail.  On
- * success, return zero, with preemption disabled.  On error, return -ENOMEM
+ * ensure that the addition of *contiguous* items in the tree cannot fail.
+ * On success, return zero, with preemption disabled.  On error, return -ENOMEM
  * with preemption not disabled.
  *
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_WAIT being passed to INIT_RADIX_TREE().
  */
-static int __radix_tree_preload(gfp_t gfp_mask)
+static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
 	int ret = -ENOMEM;
+	int preload_target = RADIX_TREE_PRELOAD_MIN +
+		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
+
+	if (WARN_ONCE(size > RADIX_TREE_PRELOAD_NR,
+				"too large preload requested"))
+		return -ENOMEM;
 
 	preempt_disable();
 	rtp = &__get_cpu_var(radix_tree_preloads);
-	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+	while (rtp->nr < preload_target) {
 		preempt_enable();
 		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
 		if (node == NULL)
 			goto out;
 		preempt_disable();
 		rtp = &__get_cpu_var(radix_tree_preloads);
-		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+		if (rtp->nr < preload_target)
 			rtp->nodes[rtp->nr++] = node;
 		else
 			kmem_cache_free(radix_tree_node_cachep, node);
@@ -308,7 +354,7 @@ int radix_tree_preload(gfp_t gfp_mask)
 {
 	/* Warn on non-sensical use... */
 	WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
-	return __radix_tree_preload(gfp_mask);
+	return __radix_tree_preload_contig(1, gfp_mask);
 }
 EXPORT_SYMBOL(radix_tree_preload);
 
@@ -320,13 +366,22 @@ EXPORT_SYMBOL(radix_tree_preload);
 int radix_tree_maybe_preload(gfp_t gfp_mask)
 {
 	if (gfp_mask & __GFP_WAIT)
-		return __radix_tree_preload(gfp_mask);
+		return __radix_tree_preload_contig(1, gfp_mask);
 	/* Preloading doesn't help anything with this gfp mask, skip it */
 	preempt_disable();
 	return 0;
 }
 EXPORT_SYMBOL(radix_tree_maybe_preload);
 
+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask)
+{
+	if (gfp_mask & __GFP_WAIT)
+		return __radix_tree_preload_contig(size, gfp_mask);
+	/* Preloading doesn't help anything with this gfp mask, skip it */
+	preempt_disable();
+	return 0;
+}
+
 /*
  *	Return the maximum key which can be store into a
  *	radix tree with height HEIGHT.
@@ -1483,6 +1538,22 @@ static int radix_tree_callback(struct notifier_block *nfb,
 
 void __init radix_tree_init(void)
 {
+	/*
+	 * Restrictions on RADIX_TREE_PRELOAD_NR simplify RADIX_TREE_PRELOAD_MAX
+	 * calculation, it's already complex enough:
+	 *  - it must be multiplier of RADIX_TREE_MAP_SIZE, otherwise we will
+	 *    have to round it up to next RADIX_TREE_MAP_SIZE multiplier and we
+	 *    don't win anything;
+	 *  - must be less then number of items 2-level tree can contain.
+	 *    It's easier to calculate number of internal nodes required
+	 *    this way.
+	 */
+	if (RADIX_TREE_PRELOAD_NR != 1) {
+		BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR % RADIX_TREE_MAP_SIZE != 0);
+		BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR >
+				1UL << (2 * RADIX_TREE_MAP_SHIFT));
+	}
+
 	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
 			sizeof(struct radix_tree_node), 0,
 			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
-- 
1.8.3.2

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
@ 2013-08-08  8:45                 ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-08  8:45 UTC (permalink / raw)
  To: Jan Kara, Matthew Wilcox, Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Mel Gorman, linux-mm, Andi Kleen,
	Kirill A. Shutemov, Hillf Danton, Ning Qu, linux-fsdevel,
	linux-kernel

Kirill A. Shutemov wrote:
> In this case it should use 39 nodes, but it uses only 38. I can't understand why. :(

Okay, I've got it. We share one 2nd level node.

Patch is below. Please review.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
  2013-08-08  8:45                 ` Kirill A. Shutemov
@ 2013-08-08 10:04                   ` Jan Kara
  -1 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2013-08-08 10:04 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Jan Kara, Matthew Wilcox, Dave Hansen, Andrea Arcangeli,
	Andrew Morton, Al Viro, Hugh Dickins, Wu Fengguang, Mel Gorman,
	linux-mm, Andi Kleen, Kirill A. Shutemov, Hillf Danton, Ning Qu,
	linux-fsdevel, linux-kernel

On Thu 08-08-13 11:45:05, Kirill A. Shutemov wrote:
> From 35ba5687ea7aea98645da34ddd0be01a9de8b32d Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Wed, 9 Jan 2013 13:25:47 +0200
> Subject: [PATCH] radix-tree: implement preload for multiple contiguous
>  elements
> 
> The radix tree is variable-height, so an insert operation not only has
> to build the branch to its corresponding item, it also has to build the
> branch to existing items if the size has to be increased (by
> radix_tree_extend).
> 
> The worst case is a zero height tree with just a single item at index 0,
> and then inserting an item at index ULONG_MAX. This requires 2 new branches
> of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> 
> Radix tree is usually protected by spin lock. It means we want to
> pre-allocate required memory before taking the lock.
> 
> Currently radix_tree_preload() only guarantees enough nodes to insert
> one element. It's a hard limit. For transparent huge page cache we want
> to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> 
> This patch introduces radix_tree_preload_count(). It allows to
> preallocate nodes enough to insert a number of *contiguous* elements.
> The feature costs about 9KiB per-CPU on x86_64, details below.
> 
> Preload uses per-CPU array to store nodes. The total cost of preload is
> "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> to be able to handle 512 entries at once.
> 
> Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> 
> We have three possible RADIX_TREE_MAP_SHIFT:
> 
>  #ifdef __KERNEL__
>  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
>  #else
>  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
>  #endif
> 
> We are not going to use transparent huge page cache on small machines or
> in userspace, so we are interested in RADIX_TREE_MAP_SHIFT=6.
> 
> On 64-bit system old array size is 21, new is 37. Per-CPU feature
> overhead is
>  for preload array:
>    (37 - 21) * sizeof(void*) = 128 bytes
>  plus, if the preload array is full
>    (37 - 21) * sizeof(struct radix_tree_node) = 16 * 560 = 8960 bytes
>  total: 9088 bytes
> 
> On 32-bit system old array size is 11, new is 22. Per-CPU feature
> overhead is
>  for preload array:
>    (22 - 11) * sizeof(void*) = 44 bytes
>  plus, if the preload array is full
>    (22 - 11) * sizeof(struct radix_tree_node) = 11 * 296 = 3256 bytes
>  total: 3300 bytes
> 
> Since only THP uses batched preload at the moment, we disable (set max
> preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> changed in the future.
> 
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
>  include/linux/radix-tree.h | 11 ++++++
>  lib/radix-tree.c           | 89 +++++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 91 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index 4039407..3bf0b3e 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -83,6 +83,16 @@ do {									\
>  	(root)->rnode = NULL;						\
>  } while (0)
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
> +/*
> + * At the moment only THP uses preload for more then on item for batched
> + * pagecache manipulations.
> + */
> +#define RADIX_TREE_PRELOAD_NR	512
> +#else
> +#define RADIX_TREE_PRELOAD_NR	1
> +#endif
> +
>  /**
>   * Radix-tree synchronization
>   *
> @@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
>  				unsigned long index, unsigned long max_scan);
>  int radix_tree_preload(gfp_t gfp_mask);
>  int radix_tree_maybe_preload(gfp_t gfp_mask);
> +int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
>  void radix_tree_init(void);
>  void *radix_tree_tag_set(struct radix_tree_root *root,
>  			unsigned long index, unsigned int tag);
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 7811ed3..980e4c4 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -84,14 +84,54 @@ static struct kmem_cache *radix_tree_node_cachep;
>   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
>   * Hence:
>   */
> -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> +
> +/*
> + * Inserting N contiguous items is more complex. To simplify calculation, let's
> + * limit N (validated in radix_tree_init()):
> + *  - N is multiplier of RADIX_TREE_MAP_SIZE (or 1);
  Is this limitation really worth the one reserved item you save?


> + *  - N <= number of items 2-level tree can contain:
> + *    1UL << (2 * RADIX_TREE_MAP_SHIFT).
> + *
> + * No limitation on insert index alignment.
> + *
> + * Then the worst case is tree with only one element at index 0 and we add N
> + * items where at least one index requires max tree high and we cross boundary
> + * between items in root node.
> + *
> + * Basically, at least one index is less then
> + *
> + * 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
> + *
> + * and one is equal to.
> + *
> + * In this case we need:
> + *
> + * - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
> + * - N / RADIX_TREE_MAP_SIZE + 1 nodes for last level nodes for new items:
> + *    - +1 is for misalinged case;
> + * - 2 * (RADIX_TREE_MAX_PATH - 2) - 1 nodes to build path to last level nodes:
> + *    - -2, because root node and last level nodes are already accounted);
> + *    - -1, because one second level node is shared between item with index 0
> + *      and item below the boundary;
> + *
> + * Hence:
> + */
> +#if RADIX_TREE_PRELOAD_NR > 1
> +#define RADIX_TREE_PRELOAD_MAX \
> +	( RADIX_TREE_MAX_PATH + \
> +	  RADIX_TREE_PRELOAD_NR / RADIX_TREE_MAP_SIZE + 1 + \
> +	  2 * (RADIX_TREE_MAX_PATH - 2) - 1)
> +#else
> +#define RADIX_TREE_PRELOAD_MAX RADIX_TREE_PRELOAD_MIN
> +#endif
>  
>  /*
>   * Per-cpu pool of preloaded nodes
>   */
>  struct radix_tree_preload {
>  	int nr;
> -	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
> +	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
>  };
>  static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
>  
> @@ -263,29 +303,35 @@ radix_tree_node_free(struct radix_tree_node *node)
>  
>  /*
>   * Load up this CPU's radix_tree_node buffer with sufficient objects to
> - * ensure that the addition of a single element in the tree cannot fail.  On
> - * success, return zero, with preemption disabled.  On error, return -ENOMEM
> + * ensure that the addition of *contiguous* items in the tree cannot fail.
> + * On success, return zero, with preemption disabled.  On error, return -ENOMEM
>   * with preemption not disabled.
>   *
>   * To make use of this facility, the radix tree must be initialised without
>   * __GFP_WAIT being passed to INIT_RADIX_TREE().
>   */
> -static int __radix_tree_preload(gfp_t gfp_mask)
> +static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
>  {
>  	struct radix_tree_preload *rtp;
>  	struct radix_tree_node *node;
>  	int ret = -ENOMEM;
> +	int preload_target = RADIX_TREE_PRELOAD_MIN +
> +		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
  So you should also warn if size % RADIX_TREE_MAP_SIZE != 0, right?
Otherwise the patch looks fine.

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
@ 2013-08-08 10:04                   ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2013-08-08 10:04 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Jan Kara, Matthew Wilcox, Dave Hansen, Andrea Arcangeli,
	Andrew Morton, Al Viro, Hugh Dickins, Wu Fengguang, Mel Gorman,
	linux-mm, Andi Kleen, Kirill A. Shutemov, Hillf Danton, Ning Qu,
	linux-fsdevel, linux-kernel

On Thu 08-08-13 11:45:05, Kirill A. Shutemov wrote:
> From 35ba5687ea7aea98645da34ddd0be01a9de8b32d Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Wed, 9 Jan 2013 13:25:47 +0200
> Subject: [PATCH] radix-tree: implement preload for multiple contiguous
>  elements
> 
> The radix tree is variable-height, so an insert operation not only has
> to build the branch to its corresponding item, it also has to build the
> branch to existing items if the size has to be increased (by
> radix_tree_extend).
> 
> The worst case is a zero height tree with just a single item at index 0,
> and then inserting an item at index ULONG_MAX. This requires 2 new branches
> of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> 
> Radix tree is usually protected by spin lock. It means we want to
> pre-allocate required memory before taking the lock.
> 
> Currently radix_tree_preload() only guarantees enough nodes to insert
> one element. It's a hard limit. For transparent huge page cache we want
> to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> 
> This patch introduces radix_tree_preload_count(). It allows to
> preallocate nodes enough to insert a number of *contiguous* elements.
> The feature costs about 9KiB per-CPU on x86_64, details below.
> 
> Preload uses per-CPU array to store nodes. The total cost of preload is
> "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> to be able to handle 512 entries at once.
> 
> Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> 
> We have three possible RADIX_TREE_MAP_SHIFT:
> 
>  #ifdef __KERNEL__
>  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
>  #else
>  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
>  #endif
> 
> We are not going to use transparent huge page cache on small machines or
> in userspace, so we are interested in RADIX_TREE_MAP_SHIFT=6.
> 
> On 64-bit system old array size is 21, new is 37. Per-CPU feature
> overhead is
>  for preload array:
>    (37 - 21) * sizeof(void*) = 128 bytes
>  plus, if the preload array is full
>    (37 - 21) * sizeof(struct radix_tree_node) = 16 * 560 = 8960 bytes
>  total: 9088 bytes
> 
> On 32-bit system old array size is 11, new is 22. Per-CPU feature
> overhead is
>  for preload array:
>    (22 - 11) * sizeof(void*) = 44 bytes
>  plus, if the preload array is full
>    (22 - 11) * sizeof(struct radix_tree_node) = 11 * 296 = 3256 bytes
>  total: 3300 bytes
> 
> Since only THP uses batched preload at the moment, we disable (set max
> preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> changed in the future.
> 
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
>  include/linux/radix-tree.h | 11 ++++++
>  lib/radix-tree.c           | 89 +++++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 91 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index 4039407..3bf0b3e 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -83,6 +83,16 @@ do {									\
>  	(root)->rnode = NULL;						\
>  } while (0)
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
> +/*
> + * At the moment only THP uses preload for more then on item for batched
> + * pagecache manipulations.
> + */
> +#define RADIX_TREE_PRELOAD_NR	512
> +#else
> +#define RADIX_TREE_PRELOAD_NR	1
> +#endif
> +
>  /**
>   * Radix-tree synchronization
>   *
> @@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
>  				unsigned long index, unsigned long max_scan);
>  int radix_tree_preload(gfp_t gfp_mask);
>  int radix_tree_maybe_preload(gfp_t gfp_mask);
> +int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
>  void radix_tree_init(void);
>  void *radix_tree_tag_set(struct radix_tree_root *root,
>  			unsigned long index, unsigned int tag);
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 7811ed3..980e4c4 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -84,14 +84,54 @@ static struct kmem_cache *radix_tree_node_cachep;
>   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
>   * Hence:
>   */
> -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> +
> +/*
> + * Inserting N contiguous items is more complex. To simplify calculation, let's
> + * limit N (validated in radix_tree_init()):
> + *  - N is multiplier of RADIX_TREE_MAP_SIZE (or 1);
  Is this limitation really worth the one reserved item you save?


> + *  - N <= number of items 2-level tree can contain:
> + *    1UL << (2 * RADIX_TREE_MAP_SHIFT).
> + *
> + * No limitation on insert index alignment.
> + *
> + * Then the worst case is tree with only one element at index 0 and we add N
> + * items where at least one index requires max tree high and we cross boundary
> + * between items in root node.
> + *
> + * Basically, at least one index is less then
> + *
> + * 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
> + *
> + * and one is equal to.
> + *
> + * In this case we need:
> + *
> + * - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
> + * - N / RADIX_TREE_MAP_SIZE + 1 nodes for last level nodes for new items:
> + *    - +1 is for misalinged case;
> + * - 2 * (RADIX_TREE_MAX_PATH - 2) - 1 nodes to build path to last level nodes:
> + *    - -2, because root node and last level nodes are already accounted);
> + *    - -1, because one second level node is shared between item with index 0
> + *      and item below the boundary;
> + *
> + * Hence:
> + */
> +#if RADIX_TREE_PRELOAD_NR > 1
> +#define RADIX_TREE_PRELOAD_MAX \
> +	( RADIX_TREE_MAX_PATH + \
> +	  RADIX_TREE_PRELOAD_NR / RADIX_TREE_MAP_SIZE + 1 + \
> +	  2 * (RADIX_TREE_MAX_PATH - 2) - 1)
> +#else
> +#define RADIX_TREE_PRELOAD_MAX RADIX_TREE_PRELOAD_MIN
> +#endif
>  
>  /*
>   * Per-cpu pool of preloaded nodes
>   */
>  struct radix_tree_preload {
>  	int nr;
> -	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
> +	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
>  };
>  static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
>  
> @@ -263,29 +303,35 @@ radix_tree_node_free(struct radix_tree_node *node)
>  
>  /*
>   * Load up this CPU's radix_tree_node buffer with sufficient objects to
> - * ensure that the addition of a single element in the tree cannot fail.  On
> - * success, return zero, with preemption disabled.  On error, return -ENOMEM
> + * ensure that the addition of *contiguous* items in the tree cannot fail.
> + * On success, return zero, with preemption disabled.  On error, return -ENOMEM
>   * with preemption not disabled.
>   *
>   * To make use of this facility, the radix tree must be initialised without
>   * __GFP_WAIT being passed to INIT_RADIX_TREE().
>   */
> -static int __radix_tree_preload(gfp_t gfp_mask)
> +static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
>  {
>  	struct radix_tree_preload *rtp;
>  	struct radix_tree_node *node;
>  	int ret = -ENOMEM;
> +	int preload_target = RADIX_TREE_PRELOAD_MIN +
> +		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
  So you should also warn if size % RADIX_TREE_MAP_SIZE != 0, right?
Otherwise the patch looks fine.

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 22/23] thp, mm: split huge page on mmap file page
  2013-08-04  2:17   ` Kirill A. Shutemov
@ 2013-08-08 20:49     ` Khalid Aziz
  -1 siblings, 0 replies; 116+ messages in thread
From: Khalid Aziz @ 2013-08-08 20:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, linux-fsdevel, linux-kernel

On Sun, 2013-08-04 at 05:17 +0300, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> We are not ready to mmap file-backed tranparent huge pages. Let's split
> them on fault attempt.
> 
> Later we'll implement mmap() properly and this code path be used for
> fallback cases.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/filemap.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index ed65af5..f7857ef 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1743,6 +1743,8 @@ retry_find:
>  			goto no_cached_page;
>  	}
>  
> +	if (PageTransCompound(page))
> +		split_huge_page(compound_trans_head(page));

Since PageTransCompound(page) returns true for transparent huge pages as
well as hugetlbfs pages, could this code split hugetlbfs pages on an
mmap() on to hugetlbfs pages? hugetlbfs pages are not supposed to be
split, right?

>  	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
>  		page_cache_release(page);
>  		return ret | VM_FAULT_RETRY;

--
Khalid


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 22/23] thp, mm: split huge page on mmap file page
@ 2013-08-08 20:49     ` Khalid Aziz
  0 siblings, 0 replies; 116+ messages in thread
From: Khalid Aziz @ 2013-08-08 20:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, linux-fsdevel, linux-kernel

On Sun, 2013-08-04 at 05:17 +0300, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> We are not ready to mmap file-backed tranparent huge pages. Let's split
> them on fault attempt.
> 
> Later we'll implement mmap() properly and this code path be used for
> fallback cases.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/filemap.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index ed65af5..f7857ef 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1743,6 +1743,8 @@ retry_find:
>  			goto no_cached_page;
>  	}
>  
> +	if (PageTransCompound(page))
> +		split_huge_page(compound_trans_head(page));

Since PageTransCompound(page) returns true for transparent huge pages as
well as hugetlbfs pages, could this code split hugetlbfs pages on an
mmap() on to hugetlbfs pages? hugetlbfs pages are not supposed to be
split, right?

>  	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
>  		page_cache_release(page);
>  		return ret | VM_FAULT_RETRY;

--
Khalid

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
  2013-08-08 10:04                   ` Jan Kara
  (?)
@ 2013-08-09 11:13                     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-09 11:13 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kirill A. Shutemov, Jan Kara, Matthew Wilcox, Dave Hansen,
	Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Mel Gorman, linux-mm, Andi Kleen,
	Kirill A. Shutemov, Hillf Danton, Ning Qu, linux-fsdevel,
	linux-kernel

Jan Kara wrote:
> On Thu 08-08-13 11:45:05, Kirill A. Shutemov wrote:
> > From 35ba5687ea7aea98645da34ddd0be01a9de8b32d Mon Sep 17 00:00:00 2001
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > Date: Wed, 9 Jan 2013 13:25:47 +0200
> > Subject: [PATCH] radix-tree: implement preload for multiple contiguous
> >  elements
> > 
> > The radix tree is variable-height, so an insert operation not only has
> > to build the branch to its corresponding item, it also has to build the
> > branch to existing items if the size has to be increased (by
> > radix_tree_extend).
> > 
> > The worst case is a zero height tree with just a single item at index 0,
> > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > 
> > Radix tree is usually protected by spin lock. It means we want to
> > pre-allocate required memory before taking the lock.
> > 
> > Currently radix_tree_preload() only guarantees enough nodes to insert
> > one element. It's a hard limit. For transparent huge page cache we want
> > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > 
> > This patch introduces radix_tree_preload_count(). It allows to
> > preallocate nodes enough to insert a number of *contiguous* elements.
> > The feature costs about 9KiB per-CPU on x86_64, details below.
> > 
> > Preload uses per-CPU array to store nodes. The total cost of preload is
> > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > to be able to handle 512 entries at once.
> > 
> > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > 
> > We have three possible RADIX_TREE_MAP_SHIFT:
> > 
> >  #ifdef __KERNEL__
> >  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> >  #else
> >  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> >  #endif
> > 
> > We are not going to use transparent huge page cache on small machines or
> > in userspace, so we are interested in RADIX_TREE_MAP_SHIFT=6.
> > 
> > On 64-bit system old array size is 21, new is 37. Per-CPU feature
> > overhead is
> >  for preload array:
> >    (37 - 21) * sizeof(void*) = 128 bytes
> >  plus, if the preload array is full
> >    (37 - 21) * sizeof(struct radix_tree_node) = 16 * 560 = 8960 bytes
> >  total: 9088 bytes
> > 
> > On 32-bit system old array size is 11, new is 22. Per-CPU feature
> > overhead is
> >  for preload array:
> >    (22 - 11) * sizeof(void*) = 44 bytes
> >  plus, if the preload array is full
> >    (22 - 11) * sizeof(struct radix_tree_node) = 11 * 296 = 3256 bytes
> >  total: 3300 bytes
> > 
> > Since only THP uses batched preload at the moment, we disable (set max
> > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > changed in the future.
> > 
> > Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> > ---
> >  include/linux/radix-tree.h | 11 ++++++
> >  lib/radix-tree.c           | 89 +++++++++++++++++++++++++++++++++++++++++-----
> >  2 files changed, 91 insertions(+), 9 deletions(-)
> > 
> > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > index 4039407..3bf0b3e 100644
> > --- a/include/linux/radix-tree.h
> > +++ b/include/linux/radix-tree.h
> > @@ -83,6 +83,16 @@ do {									\
> >  	(root)->rnode = NULL;						\
> >  } while (0)
> >  
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
> > +/*
> > + * At the moment only THP uses preload for more then on item for batched
> > + * pagecache manipulations.
> > + */
> > +#define RADIX_TREE_PRELOAD_NR	512
> > +#else
> > +#define RADIX_TREE_PRELOAD_NR	1
> > +#endif
> > +
> >  /**
> >   * Radix-tree synchronization
> >   *
> > @@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
> >  				unsigned long index, unsigned long max_scan);
> >  int radix_tree_preload(gfp_t gfp_mask);
> >  int radix_tree_maybe_preload(gfp_t gfp_mask);
> > +int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
> >  void radix_tree_init(void);
> >  void *radix_tree_tag_set(struct radix_tree_root *root,
> >  			unsigned long index, unsigned int tag);
> > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > index 7811ed3..980e4c4 100644
> > --- a/lib/radix-tree.c
> > +++ b/lib/radix-tree.c
> > @@ -84,14 +84,54 @@ static struct kmem_cache *radix_tree_node_cachep;
> >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> >   * Hence:
> >   */
> > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > +
> > +/*
> > + * Inserting N contiguous items is more complex. To simplify calculation, let's
> > + * limit N (validated in radix_tree_init()):
> > + *  - N is multiplier of RADIX_TREE_MAP_SIZE (or 1);
>   Is this limitation really worth the one reserved item you save?

It's not really limitation, It just doesn't make any sense to lower
RADIX_TREE_PRELOAD_NR not to RADIX_TREE_MAP_SIZE boundary, since we will
have to round it up anyway.

I made one more mistake: I've modeled not the situation I've describe, so
last -1 is wrong. Fixed in patch below.

> > -static int __radix_tree_preload(gfp_t gfp_mask)
> > +static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
> >  {
> >  	struct radix_tree_preload *rtp;
> >  	struct radix_tree_node *node;
> >  	int ret = -ENOMEM;
> > +	int preload_target = RADIX_TREE_PRELOAD_MIN +
> > +		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
>   So you should also warn if size % RADIX_TREE_MAP_SIZE != 0, right?

I've reworked logic here to match RADIX_TREE_PRELOAD_MAX calculation math.

> Otherwise the patch looks fine.

Ack?

>From 0e7d481093a4ded0dfb58e87d4edda003f3b2fd6 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Wed, 9 Jan 2013 13:25:47 +0200
Subject: [PATCH] radix-tree: implement preload for multiple contiguous
 elements

The radix tree is variable-height, so an insert operation not only has
to build the branch to its corresponding item, it also has to build the
branch to existing items if the size has to be increased (by
radix_tree_extend).

The worst case is a zero height tree with just a single item at index 0,
and then inserting an item at index ULONG_MAX. This requires 2 new branches
of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.

Radix tree is usually protected by spin lock. It means we want to
pre-allocate required memory before taking the lock.

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. For transparent huge page cache we want
to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.
The feature costs about 9.5KiB per-CPU on x86_64, details below.

Preload uses per-CPU array to store nodes. The total cost of preload is
"array size" * sizeof(void*) * NR_CPUS. We want to increase array size
to be able to handle 512 entries at once.

Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.

We have three possible RADIX_TREE_MAP_SHIFT:

 #ifdef __KERNEL__
 #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
 #else
 #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
 #endif

We are not going to use transparent huge page cache on small machines or
in userspace, so we are interested in RADIX_TREE_MAP_SHIFT=6.

On 64-bit system old array size is 21, new is 37. Per-CPU feature
overhead is
 for preload array:
   (38 - 21) * sizeof(void*) = 136 bytes
 plus, if the preload array is full
   (38 - 21) * sizeof(struct radix_tree_node) = 17 * 560 = 9520 bytes
 total: 9656 bytes

On 32-bit system old array size is 11, new is 22. Per-CPU feature
overhead is
 for preload array:
   (23 - 11) * sizeof(void*) = 48 bytes
 plus, if the preload array is full
   (23 - 11) * sizeof(struct radix_tree_node) = 12 * 296 = 3552 bytes
 total: 3600 bytes

Since only THP uses batched preload at the moment, we disable (set max
preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
changed in the future.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 include/linux/radix-tree.h | 11 ++++++
 lib/radix-tree.c           | 95 +++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 97 insertions(+), 9 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 4039407..3bf0b3e 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,16 @@ do {									\
 	(root)->rnode = NULL;						\
 } while (0)
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+/*
+ * At the moment only THP uses preload for more then on item for batched
+ * pagecache manipulations.
+ */
+#define RADIX_TREE_PRELOAD_NR	512
+#else
+#define RADIX_TREE_PRELOAD_NR	1
+#endif
+
 /**
  * Radix-tree synchronization
  *
@@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
 				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
 			unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 7811ed3..1c19595 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -84,14 +84,52 @@ static struct kmem_cache *radix_tree_node_cachep;
  * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
  * Hence:
  */
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+
+/*
+ * Inserting N contiguous items is more complex. To simplify calculation, let's
+ * limit N (validated in radix_tree_init()):
+ *  - N is multiplier of RADIX_TREE_MAP_SIZE (or 1);
+ *  - N <= number of items 2-level tree can contain:
+ *    1UL << (2 * RADIX_TREE_MAP_SHIFT).
+ *
+ * No limitation on insert index alignment.
+ *
+ * Then the worst case is tree with only one element at index 0 and we add N
+ * items where at least one index requires max tree high and we cross boundary
+ * between items in root node.
+ *
+ * Basically, at least one index is less then
+ *
+ * 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
+ *
+ * and one is equal to.
+ *
+ * In this case we need:
+ *
+ * - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
+ * - N / RADIX_TREE_MAP_SIZE + 1 nodes for last level nodes for new items:
+ *    - +1 is for misalinged case;
+ * - 2 * (RADIX_TREE_MAX_PATH - 2) - 1 nodes to build path to last level nodes:
+ *    - -2, because root node and last level nodes are already accounted).
+ *
+ * Hence:
+ */
+#if RADIX_TREE_PRELOAD_NR > 1
+#define RADIX_TREE_PRELOAD_MAX \
+	( RADIX_TREE_MAX_PATH + \
+	  RADIX_TREE_PRELOAD_NR / RADIX_TREE_MAP_SIZE + 1 + \
+	  2 * (RADIX_TREE_MAX_PATH - 2))
+#else
+#define RADIX_TREE_PRELOAD_MAX RADIX_TREE_PRELOAD_MIN
+#endif
 
 /*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
 	int nr;
-	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
 };
 static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
 
@@ -263,29 +301,43 @@ radix_tree_node_free(struct radix_tree_node *node)
 
 /*
  * Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail.  On
- * success, return zero, with preemption disabled.  On error, return -ENOMEM
+ * ensure that the addition of *contiguous* items in the tree cannot fail.
+ * On success, return zero, with preemption disabled.  On error, return -ENOMEM
  * with preemption not disabled.
  *
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_WAIT being passed to INIT_RADIX_TREE().
  */
-static int __radix_tree_preload(gfp_t gfp_mask)
+static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
 	int ret = -ENOMEM;
+	int preload_target = RADIX_TREE_PRELOAD_MIN;
 
+	if (size > 1) {
+		size = round_up(size, RADIX_TREE_MAP_SIZE);
+		if (WARN_ONCE(size > RADIX_TREE_PRELOAD_NR,
+					"too large preload requested"))
+			return -ENOMEM;
+
+		/* The same math as with RADIX_TREE_PRELOAD_MAX */
+		preload_target = RADIX_TREE_MAX_PATH +
+			size / RADIX_TREE_MAP_SIZE + 1 +
+			2 * (RADIX_TREE_MAX_PATH - 2);
+	}
+
+	BUG_ON(preload_target > RADIX_TREE_PRELOAD_MAX);
 	preempt_disable();
 	rtp = &__get_cpu_var(radix_tree_preloads);
-	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+	while (rtp->nr < preload_target) {
 		preempt_enable();
 		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
 		if (node == NULL)
 			goto out;
 		preempt_disable();
 		rtp = &__get_cpu_var(radix_tree_preloads);
-		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+		if (rtp->nr < preload_target)
 			rtp->nodes[rtp->nr++] = node;
 		else
 			kmem_cache_free(radix_tree_node_cachep, node);
@@ -308,7 +360,7 @@ int radix_tree_preload(gfp_t gfp_mask)
 {
 	/* Warn on non-sensical use... */
 	WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
-	return __radix_tree_preload(gfp_mask);
+	return __radix_tree_preload_contig(1, gfp_mask);
 }
 EXPORT_SYMBOL(radix_tree_preload);
 
@@ -320,13 +372,22 @@ EXPORT_SYMBOL(radix_tree_preload);
 int radix_tree_maybe_preload(gfp_t gfp_mask)
 {
 	if (gfp_mask & __GFP_WAIT)
-		return __radix_tree_preload(gfp_mask);
+		return __radix_tree_preload_contig(1, gfp_mask);
 	/* Preloading doesn't help anything with this gfp mask, skip it */
 	preempt_disable();
 	return 0;
 }
 EXPORT_SYMBOL(radix_tree_maybe_preload);
 
+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask)
+{
+	if (gfp_mask & __GFP_WAIT)
+		return __radix_tree_preload_contig(size, gfp_mask);
+	/* Preloading doesn't help anything with this gfp mask, skip it */
+	preempt_disable();
+	return 0;
+}
+
 /*
  *	Return the maximum key which can be store into a
  *	radix tree with height HEIGHT.
@@ -1483,6 +1544,22 @@ static int radix_tree_callback(struct notifier_block *nfb,
 
 void __init radix_tree_init(void)
 {
+	/*
+	 * Restrictions on RADIX_TREE_PRELOAD_NR simplify RADIX_TREE_PRELOAD_MAX
+	 * calculation, it's already complex enough:
+	 *  - it must be multiplier of RADIX_TREE_MAP_SIZE, otherwise we will
+	 *    have to round it up to next RADIX_TREE_MAP_SIZE multiplier and we
+	 *    don't win anything;
+	 *  - must be less then number of items 2-level tree can contain.
+	 *    It's easier to calculate number of internal nodes required
+	 *    this way.
+	 */
+	if (RADIX_TREE_PRELOAD_NR != 1) {
+		BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR % RADIX_TREE_MAP_SIZE != 0);
+		BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR >
+				1UL << (2 * RADIX_TREE_MAP_SHIFT));
+	}
+
 	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
 			sizeof(struct radix_tree_node), 0,
 			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
@ 2013-08-09 11:13                     ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-09 11:13 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kirill A. Shutemov, Jan Kara, Matthew Wilcox, Dave Hansen,
	Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Mel Gorman, linux-mm, Andi Kleen,
	Kirill A. Shutemov, Hillf Danton, Ning Qu, linux-fsdevel,
	linux-kernel

Jan Kara wrote:
> On Thu 08-08-13 11:45:05, Kirill A. Shutemov wrote:
> > From 35ba5687ea7aea98645da34ddd0be01a9de8b32d Mon Sep 17 00:00:00 2001
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > Date: Wed, 9 Jan 2013 13:25:47 +0200
> > Subject: [PATCH] radix-tree: implement preload for multiple contiguous
> >  elements
> > 
> > The radix tree is variable-height, so an insert operation not only has
> > to build the branch to its corresponding item, it also has to build the
> > branch to existing items if the size has to be increased (by
> > radix_tree_extend).
> > 
> > The worst case is a zero height tree with just a single item at index 0,
> > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > 
> > Radix tree is usually protected by spin lock. It means we want to
> > pre-allocate required memory before taking the lock.
> > 
> > Currently radix_tree_preload() only guarantees enough nodes to insert
> > one element. It's a hard limit. For transparent huge page cache we want
> > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > 
> > This patch introduces radix_tree_preload_count(). It allows to
> > preallocate nodes enough to insert a number of *contiguous* elements.
> > The feature costs about 9KiB per-CPU on x86_64, details below.
> > 
> > Preload uses per-CPU array to store nodes. The total cost of preload is
> > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > to be able to handle 512 entries at once.
> > 
> > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > 
> > We have three possible RADIX_TREE_MAP_SHIFT:
> > 
> >  #ifdef __KERNEL__
> >  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> >  #else
> >  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> >  #endif
> > 
> > We are not going to use transparent huge page cache on small machines or
> > in userspace, so we are interested in RADIX_TREE_MAP_SHIFT=6.
> > 
> > On 64-bit system old array size is 21, new is 37. Per-CPU feature
> > overhead is
> >  for preload array:
> >    (37 - 21) * sizeof(void*) = 128 bytes
> >  plus, if the preload array is full
> >    (37 - 21) * sizeof(struct radix_tree_node) = 16 * 560 = 8960 bytes
> >  total: 9088 bytes
> > 
> > On 32-bit system old array size is 11, new is 22. Per-CPU feature
> > overhead is
> >  for preload array:
> >    (22 - 11) * sizeof(void*) = 44 bytes
> >  plus, if the preload array is full
> >    (22 - 11) * sizeof(struct radix_tree_node) = 11 * 296 = 3256 bytes
> >  total: 3300 bytes
> > 
> > Since only THP uses batched preload at the moment, we disable (set max
> > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > changed in the future.
> > 
> > Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> > ---
> >  include/linux/radix-tree.h | 11 ++++++
> >  lib/radix-tree.c           | 89 +++++++++++++++++++++++++++++++++++++++++-----
> >  2 files changed, 91 insertions(+), 9 deletions(-)
> > 
> > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > index 4039407..3bf0b3e 100644
> > --- a/include/linux/radix-tree.h
> > +++ b/include/linux/radix-tree.h
> > @@ -83,6 +83,16 @@ do {									\
> >  	(root)->rnode = NULL;						\
> >  } while (0)
> >  
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
> > +/*
> > + * At the moment only THP uses preload for more then on item for batched
> > + * pagecache manipulations.
> > + */
> > +#define RADIX_TREE_PRELOAD_NR	512
> > +#else
> > +#define RADIX_TREE_PRELOAD_NR	1
> > +#endif
> > +
> >  /**
> >   * Radix-tree synchronization
> >   *
> > @@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
> >  				unsigned long index, unsigned long max_scan);
> >  int radix_tree_preload(gfp_t gfp_mask);
> >  int radix_tree_maybe_preload(gfp_t gfp_mask);
> > +int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
> >  void radix_tree_init(void);
> >  void *radix_tree_tag_set(struct radix_tree_root *root,
> >  			unsigned long index, unsigned int tag);
> > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > index 7811ed3..980e4c4 100644
> > --- a/lib/radix-tree.c
> > +++ b/lib/radix-tree.c
> > @@ -84,14 +84,54 @@ static struct kmem_cache *radix_tree_node_cachep;
> >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> >   * Hence:
> >   */
> > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > +
> > +/*
> > + * Inserting N contiguous items is more complex. To simplify calculation, let's
> > + * limit N (validated in radix_tree_init()):
> > + *  - N is multiplier of RADIX_TREE_MAP_SIZE (or 1);
>   Is this limitation really worth the one reserved item you save?

It's not really limitation, It just doesn't make any sense to lower
RADIX_TREE_PRELOAD_NR not to RADIX_TREE_MAP_SIZE boundary, since we will
have to round it up anyway.

I made one more mistake: I've modeled not the situation I've describe, so
last -1 is wrong. Fixed in patch below.

> > -static int __radix_tree_preload(gfp_t gfp_mask)
> > +static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
> >  {
> >  	struct radix_tree_preload *rtp;
> >  	struct radix_tree_node *node;
> >  	int ret = -ENOMEM;
> > +	int preload_target = RADIX_TREE_PRELOAD_MIN +
> > +		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
>   So you should also warn if size % RADIX_TREE_MAP_SIZE != 0, right?

I've reworked logic here to match RADIX_TREE_PRELOAD_MAX calculation math.

> Otherwise the patch looks fine.

Ack?

>From 0e7d481093a4ded0dfb58e87d4edda003f3b2fd6 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Wed, 9 Jan 2013 13:25:47 +0200
Subject: [PATCH] radix-tree: implement preload for multiple contiguous
 elements

The radix tree is variable-height, so an insert operation not only has
to build the branch to its corresponding item, it also has to build the
branch to existing items if the size has to be increased (by
radix_tree_extend).

The worst case is a zero height tree with just a single item at index 0,
and then inserting an item at index ULONG_MAX. This requires 2 new branches
of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.

Radix tree is usually protected by spin lock. It means we want to
pre-allocate required memory before taking the lock.

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. For transparent huge page cache we want
to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.
The feature costs about 9.5KiB per-CPU on x86_64, details below.

Preload uses per-CPU array to store nodes. The total cost of preload is
"array size" * sizeof(void*) * NR_CPUS. We want to increase array size
to be able to handle 512 entries at once.

Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.

We have three possible RADIX_TREE_MAP_SHIFT:

 #ifdef __KERNEL__
 #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
 #else
 #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
 #endif

We are not going to use transparent huge page cache on small machines or
in userspace, so we are interested in RADIX_TREE_MAP_SHIFT=6.

On 64-bit system old array size is 21, new is 37. Per-CPU feature
overhead is
 for preload array:
   (38 - 21) * sizeof(void*) = 136 bytes
 plus, if the preload array is full
   (38 - 21) * sizeof(struct radix_tree_node) = 17 * 560 = 9520 bytes
 total: 9656 bytes

On 32-bit system old array size is 11, new is 22. Per-CPU feature
overhead is
 for preload array:
   (23 - 11) * sizeof(void*) = 48 bytes
 plus, if the preload array is full
   (23 - 11) * sizeof(struct radix_tree_node) = 12 * 296 = 3552 bytes
 total: 3600 bytes

Since only THP uses batched preload at the moment, we disable (set max
preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
changed in the future.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 include/linux/radix-tree.h | 11 ++++++
 lib/radix-tree.c           | 95 +++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 97 insertions(+), 9 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 4039407..3bf0b3e 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,16 @@ do {									\
 	(root)->rnode = NULL;						\
 } while (0)
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+/*
+ * At the moment only THP uses preload for more then on item for batched
+ * pagecache manipulations.
+ */
+#define RADIX_TREE_PRELOAD_NR	512
+#else
+#define RADIX_TREE_PRELOAD_NR	1
+#endif
+
 /**
  * Radix-tree synchronization
  *
@@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
 				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
 			unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 7811ed3..1c19595 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -84,14 +84,52 @@ static struct kmem_cache *radix_tree_node_cachep;
  * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
  * Hence:
  */
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+
+/*
+ * Inserting N contiguous items is more complex. To simplify calculation, let's
+ * limit N (validated in radix_tree_init()):
+ *  - N is multiplier of RADIX_TREE_MAP_SIZE (or 1);
+ *  - N <= number of items 2-level tree can contain:
+ *    1UL << (2 * RADIX_TREE_MAP_SHIFT).
+ *
+ * No limitation on insert index alignment.
+ *
+ * Then the worst case is tree with only one element at index 0 and we add N
+ * items where at least one index requires max tree high and we cross boundary
+ * between items in root node.
+ *
+ * Basically, at least one index is less then
+ *
+ * 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
+ *
+ * and one is equal to.
+ *
+ * In this case we need:
+ *
+ * - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
+ * - N / RADIX_TREE_MAP_SIZE + 1 nodes for last level nodes for new items:
+ *    - +1 is for misalinged case;
+ * - 2 * (RADIX_TREE_MAX_PATH - 2) - 1 nodes to build path to last level nodes:
+ *    - -2, because root node and last level nodes are already accounted).
+ *
+ * Hence:
+ */
+#if RADIX_TREE_PRELOAD_NR > 1
+#define RADIX_TREE_PRELOAD_MAX \
+	( RADIX_TREE_MAX_PATH + \
+	  RADIX_TREE_PRELOAD_NR / RADIX_TREE_MAP_SIZE + 1 + \
+	  2 * (RADIX_TREE_MAX_PATH - 2))
+#else
+#define RADIX_TREE_PRELOAD_MAX RADIX_TREE_PRELOAD_MIN
+#endif
 
 /*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
 	int nr;
-	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
 };
 static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
 
@@ -263,29 +301,43 @@ radix_tree_node_free(struct radix_tree_node *node)
 
 /*
  * Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail.  On
- * success, return zero, with preemption disabled.  On error, return -ENOMEM
+ * ensure that the addition of *contiguous* items in the tree cannot fail.
+ * On success, return zero, with preemption disabled.  On error, return -ENOMEM
  * with preemption not disabled.
  *
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_WAIT being passed to INIT_RADIX_TREE().
  */
-static int __radix_tree_preload(gfp_t gfp_mask)
+static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
 	int ret = -ENOMEM;
+	int preload_target = RADIX_TREE_PRELOAD_MIN;
 
+	if (size > 1) {
+		size = round_up(size, RADIX_TREE_MAP_SIZE);
+		if (WARN_ONCE(size > RADIX_TREE_PRELOAD_NR,
+					"too large preload requested"))
+			return -ENOMEM;
+
+		/* The same math as with RADIX_TREE_PRELOAD_MAX */
+		preload_target = RADIX_TREE_MAX_PATH +
+			size / RADIX_TREE_MAP_SIZE + 1 +
+			2 * (RADIX_TREE_MAX_PATH - 2);
+	}
+
+	BUG_ON(preload_target > RADIX_TREE_PRELOAD_MAX);
 	preempt_disable();
 	rtp = &__get_cpu_var(radix_tree_preloads);
-	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+	while (rtp->nr < preload_target) {
 		preempt_enable();
 		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
 		if (node == NULL)
 			goto out;
 		preempt_disable();
 		rtp = &__get_cpu_var(radix_tree_preloads);
-		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+		if (rtp->nr < preload_target)
 			rtp->nodes[rtp->nr++] = node;
 		else
 			kmem_cache_free(radix_tree_node_cachep, node);
@@ -308,7 +360,7 @@ int radix_tree_preload(gfp_t gfp_mask)
 {
 	/* Warn on non-sensical use... */
 	WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
-	return __radix_tree_preload(gfp_mask);
+	return __radix_tree_preload_contig(1, gfp_mask);
 }
 EXPORT_SYMBOL(radix_tree_preload);
 
@@ -320,13 +372,22 @@ EXPORT_SYMBOL(radix_tree_preload);
 int radix_tree_maybe_preload(gfp_t gfp_mask)
 {
 	if (gfp_mask & __GFP_WAIT)
-		return __radix_tree_preload(gfp_mask);
+		return __radix_tree_preload_contig(1, gfp_mask);
 	/* Preloading doesn't help anything with this gfp mask, skip it */
 	preempt_disable();
 	return 0;
 }
 EXPORT_SYMBOL(radix_tree_maybe_preload);
 
+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask)
+{
+	if (gfp_mask & __GFP_WAIT)
+		return __radix_tree_preload_contig(size, gfp_mask);
+	/* Preloading doesn't help anything with this gfp mask, skip it */
+	preempt_disable();
+	return 0;
+}
+
 /*
  *	Return the maximum key which can be store into a
  *	radix tree with height HEIGHT.
@@ -1483,6 +1544,22 @@ static int radix_tree_callback(struct notifier_block *nfb,
 
 void __init radix_tree_init(void)
 {
+	/*
+	 * Restrictions on RADIX_TREE_PRELOAD_NR simplify RADIX_TREE_PRELOAD_MAX
+	 * calculation, it's already complex enough:
+	 *  - it must be multiplier of RADIX_TREE_MAP_SIZE, otherwise we will
+	 *    have to round it up to next RADIX_TREE_MAP_SIZE multiplier and we
+	 *    don't win anything;
+	 *  - must be less then number of items 2-level tree can contain.
+	 *    It's easier to calculate number of internal nodes required
+	 *    this way.
+	 */
+	if (RADIX_TREE_PRELOAD_NR != 1) {
+		BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR % RADIX_TREE_MAP_SIZE != 0);
+		BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR >
+				1UL << (2 * RADIX_TREE_MAP_SHIFT));
+	}
+
 	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
 			sizeof(struct radix_tree_node), 0,
 			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
@ 2013-08-09 11:13                     ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-09 11:13 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kirill A. Shutemov, Matthew Wilcox, Dave Hansen,
	Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Mel Gorman, linux-mm, Andi Kleen,
	Kirill A. Shutemov, Hillf Danton, Ning Qu, linux-fsdevel,
	linux-kernel

Jan Kara wrote:
> On Thu 08-08-13 11:45:05, Kirill A. Shutemov wrote:
> > From 35ba5687ea7aea98645da34ddd0be01a9de8b32d Mon Sep 17 00:00:00 2001
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > Date: Wed, 9 Jan 2013 13:25:47 +0200
> > Subject: [PATCH] radix-tree: implement preload for multiple contiguous
> >  elements
> > 
> > The radix tree is variable-height, so an insert operation not only has
> > to build the branch to its corresponding item, it also has to build the
> > branch to existing items if the size has to be increased (by
> > radix_tree_extend).
> > 
> > The worst case is a zero height tree with just a single item at index 0,
> > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > 
> > Radix tree is usually protected by spin lock. It means we want to
> > pre-allocate required memory before taking the lock.
> > 
> > Currently radix_tree_preload() only guarantees enough nodes to insert
> > one element. It's a hard limit. For transparent huge page cache we want
> > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > 
> > This patch introduces radix_tree_preload_count(). It allows to
> > preallocate nodes enough to insert a number of *contiguous* elements.
> > The feature costs about 9KiB per-CPU on x86_64, details below.
> > 
> > Preload uses per-CPU array to store nodes. The total cost of preload is
> > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > to be able to handle 512 entries at once.
> > 
> > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > 
> > We have three possible RADIX_TREE_MAP_SHIFT:
> > 
> >  #ifdef __KERNEL__
> >  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> >  #else
> >  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> >  #endif
> > 
> > We are not going to use transparent huge page cache on small machines or
> > in userspace, so we are interested in RADIX_TREE_MAP_SHIFT=6.
> > 
> > On 64-bit system old array size is 21, new is 37. Per-CPU feature
> > overhead is
> >  for preload array:
> >    (37 - 21) * sizeof(void*) = 128 bytes
> >  plus, if the preload array is full
> >    (37 - 21) * sizeof(struct radix_tree_node) = 16 * 560 = 8960 bytes
> >  total: 9088 bytes
> > 
> > On 32-bit system old array size is 11, new is 22. Per-CPU feature
> > overhead is
> >  for preload array:
> >    (22 - 11) * sizeof(void*) = 44 bytes
> >  plus, if the preload array is full
> >    (22 - 11) * sizeof(struct radix_tree_node) = 11 * 296 = 3256 bytes
> >  total: 3300 bytes
> > 
> > Since only THP uses batched preload at the moment, we disable (set max
> > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > changed in the future.
> > 
> > Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> > ---
> >  include/linux/radix-tree.h | 11 ++++++
> >  lib/radix-tree.c           | 89 +++++++++++++++++++++++++++++++++++++++++-----
> >  2 files changed, 91 insertions(+), 9 deletions(-)
> > 
> > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > index 4039407..3bf0b3e 100644
> > --- a/include/linux/radix-tree.h
> > +++ b/include/linux/radix-tree.h
> > @@ -83,6 +83,16 @@ do {									\
> >  	(root)->rnode = NULL;						\
> >  } while (0)
> >  
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
> > +/*
> > + * At the moment only THP uses preload for more then on item for batched
> > + * pagecache manipulations.
> > + */
> > +#define RADIX_TREE_PRELOAD_NR	512
> > +#else
> > +#define RADIX_TREE_PRELOAD_NR	1
> > +#endif
> > +
> >  /**
> >   * Radix-tree synchronization
> >   *
> > @@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
> >  				unsigned long index, unsigned long max_scan);
> >  int radix_tree_preload(gfp_t gfp_mask);
> >  int radix_tree_maybe_preload(gfp_t gfp_mask);
> > +int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
> >  void radix_tree_init(void);
> >  void *radix_tree_tag_set(struct radix_tree_root *root,
> >  			unsigned long index, unsigned int tag);
> > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > index 7811ed3..980e4c4 100644
> > --- a/lib/radix-tree.c
> > +++ b/lib/radix-tree.c
> > @@ -84,14 +84,54 @@ static struct kmem_cache *radix_tree_node_cachep;
> >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> >   * Hence:
> >   */
> > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > +
> > +/*
> > + * Inserting N contiguous items is more complex. To simplify calculation, let's
> > + * limit N (validated in radix_tree_init()):
> > + *  - N is multiplier of RADIX_TREE_MAP_SIZE (or 1);
>   Is this limitation really worth the one reserved item you save?

It's not really limitation, It just doesn't make any sense to lower
RADIX_TREE_PRELOAD_NR not to RADIX_TREE_MAP_SIZE boundary, since we will
have to round it up anyway.

I made one more mistake: I've modeled not the situation I've describe, so
last -1 is wrong. Fixed in patch below.

> > -static int __radix_tree_preload(gfp_t gfp_mask)
> > +static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
> >  {
> >  	struct radix_tree_preload *rtp;
> >  	struct radix_tree_node *node;
> >  	int ret = -ENOMEM;
> > +	int preload_target = RADIX_TREE_PRELOAD_MIN +
> > +		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
>   So you should also warn if size % RADIX_TREE_MAP_SIZE != 0, right?

I've reworked logic here to match RADIX_TREE_PRELOAD_MAX calculation math.

> Otherwise the patch looks fine.

Ack?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
  2013-08-09 11:13                     ` Kirill A. Shutemov
@ 2013-08-09 11:36                       ` Jan Kara
  -1 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2013-08-09 11:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Jan Kara, Matthew Wilcox, Dave Hansen, Andrea Arcangeli,
	Andrew Morton, Al Viro, Hugh Dickins, Wu Fengguang, Mel Gorman,
	linux-mm, Andi Kleen, Kirill A. Shutemov, Hillf Danton, Ning Qu,
	linux-fsdevel, linux-kernel

On Fri 09-08-13 14:13:51, Kirill A. Shutemov wrote:
> Jan Kara wrote:
> > On Thu 08-08-13 11:45:05, Kirill A. Shutemov wrote:
> > > From 35ba5687ea7aea98645da34ddd0be01a9de8b32d Mon Sep 17 00:00:00 2001
> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > Date: Wed, 9 Jan 2013 13:25:47 +0200
> > > Subject: [PATCH] radix-tree: implement preload for multiple contiguous
> > >  elements
> > > 
> > > The radix tree is variable-height, so an insert operation not only has
> > > to build the branch to its corresponding item, it also has to build the
> > > branch to existing items if the size has to be increased (by
> > > radix_tree_extend).
> > > 
> > > The worst case is a zero height tree with just a single item at index 0,
> > > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > 
> > > Radix tree is usually protected by spin lock. It means we want to
> > > pre-allocate required memory before taking the lock.
> > > 
> > > Currently radix_tree_preload() only guarantees enough nodes to insert
> > > one element. It's a hard limit. For transparent huge page cache we want
> > > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > > 
> > > This patch introduces radix_tree_preload_count(). It allows to
> > > preallocate nodes enough to insert a number of *contiguous* elements.
> > > The feature costs about 9KiB per-CPU on x86_64, details below.
> > > 
> > > Preload uses per-CPU array to store nodes. The total cost of preload is
> > > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > > to be able to handle 512 entries at once.
> > > 
> > > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > > 
> > > We have three possible RADIX_TREE_MAP_SHIFT:
> > > 
> > >  #ifdef __KERNEL__
> > >  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> > >  #else
> > >  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> > >  #endif
> > > 
> > > We are not going to use transparent huge page cache on small machines or
> > > in userspace, so we are interested in RADIX_TREE_MAP_SHIFT=6.
> > > 
> > > On 64-bit system old array size is 21, new is 37. Per-CPU feature
> > > overhead is
> > >  for preload array:
> > >    (37 - 21) * sizeof(void*) = 128 bytes
> > >  plus, if the preload array is full
> > >    (37 - 21) * sizeof(struct radix_tree_node) = 16 * 560 = 8960 bytes
> > >  total: 9088 bytes
> > > 
> > > On 32-bit system old array size is 11, new is 22. Per-CPU feature
> > > overhead is
> > >  for preload array:
> > >    (22 - 11) * sizeof(void*) = 44 bytes
> > >  plus, if the preload array is full
> > >    (22 - 11) * sizeof(struct radix_tree_node) = 11 * 296 = 3256 bytes
> > >  total: 3300 bytes
> > > 
> > > Since only THP uses batched preload at the moment, we disable (set max
> > > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > > changed in the future.
> > > 
> > > Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > ---
> > >  include/linux/radix-tree.h | 11 ++++++
> > >  lib/radix-tree.c           | 89 +++++++++++++++++++++++++++++++++++++++++-----
> > >  2 files changed, 91 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > > index 4039407..3bf0b3e 100644
> > > --- a/include/linux/radix-tree.h
> > > +++ b/include/linux/radix-tree.h
> > > @@ -83,6 +83,16 @@ do {									\
> > >  	(root)->rnode = NULL;						\
> > >  } while (0)
> > >  
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
> > > +/*
> > > + * At the moment only THP uses preload for more then on item for batched
> > > + * pagecache manipulations.
> > > + */
> > > +#define RADIX_TREE_PRELOAD_NR	512
> > > +#else
> > > +#define RADIX_TREE_PRELOAD_NR	1
> > > +#endif
> > > +
> > >  /**
> > >   * Radix-tree synchronization
> > >   *
> > > @@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
> > >  				unsigned long index, unsigned long max_scan);
> > >  int radix_tree_preload(gfp_t gfp_mask);
> > >  int radix_tree_maybe_preload(gfp_t gfp_mask);
> > > +int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
> > >  void radix_tree_init(void);
> > >  void *radix_tree_tag_set(struct radix_tree_root *root,
> > >  			unsigned long index, unsigned int tag);
> > > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > > index 7811ed3..980e4c4 100644
> > > --- a/lib/radix-tree.c
> > > +++ b/lib/radix-tree.c
> > > @@ -84,14 +84,54 @@ static struct kmem_cache *radix_tree_node_cachep;
> > >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > >   * Hence:
> > >   */
> > > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > > +
> > > +/*
> > > + * Inserting N contiguous items is more complex. To simplify calculation, let's
> > > + * limit N (validated in radix_tree_init()):
> > > + *  - N is multiplier of RADIX_TREE_MAP_SIZE (or 1);
> >   Is this limitation really worth the one reserved item you save?
> 
> It's not really limitation, It just doesn't make any sense to lower
> RADIX_TREE_PRELOAD_NR not to RADIX_TREE_MAP_SIZE boundary, since we will
> have to round it up anyway.
> 
> I made one more mistake: I've modeled not the situation I've describe, so
> last -1 is wrong. Fixed in patch below.
> 
> > > -static int __radix_tree_preload(gfp_t gfp_mask)
> > > +static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
> > >  {
> > >  	struct radix_tree_preload *rtp;
> > >  	struct radix_tree_node *node;
> > >  	int ret = -ENOMEM;
> > > +	int preload_target = RADIX_TREE_PRELOAD_MIN +
> > > +		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
> >   So you should also warn if size % RADIX_TREE_MAP_SIZE != 0, right?
> 
> I've reworked logic here to match RADIX_TREE_PRELOAD_MAX calculation math.
> 
> > Otherwise the patch looks fine.
> 
> Ack?
  Yes. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> 
> From 0e7d481093a4ded0dfb58e87d4edda003f3b2fd6 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Wed, 9 Jan 2013 13:25:47 +0200
> Subject: [PATCH] radix-tree: implement preload for multiple contiguous
>  elements
> 
> The radix tree is variable-height, so an insert operation not only has
> to build the branch to its corresponding item, it also has to build the
> branch to existing items if the size has to be increased (by
> radix_tree_extend).
> 
> The worst case is a zero height tree with just a single item at index 0,
> and then inserting an item at index ULONG_MAX. This requires 2 new branches
> of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> 
> Radix tree is usually protected by spin lock. It means we want to
> pre-allocate required memory before taking the lock.
> 
> Currently radix_tree_preload() only guarantees enough nodes to insert
> one element. It's a hard limit. For transparent huge page cache we want
> to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> 
> This patch introduces radix_tree_preload_count(). It allows to
> preallocate nodes enough to insert a number of *contiguous* elements.
> The feature costs about 9.5KiB per-CPU on x86_64, details below.
> 
> Preload uses per-CPU array to store nodes. The total cost of preload is
> "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> to be able to handle 512 entries at once.
> 
> Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> 
> We have three possible RADIX_TREE_MAP_SHIFT:
> 
>  #ifdef __KERNEL__
>  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
>  #else
>  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
>  #endif
> 
> We are not going to use transparent huge page cache on small machines or
> in userspace, so we are interested in RADIX_TREE_MAP_SHIFT=6.
> 
> On 64-bit system old array size is 21, new is 37. Per-CPU feature
> overhead is
>  for preload array:
>    (38 - 21) * sizeof(void*) = 136 bytes
>  plus, if the preload array is full
>    (38 - 21) * sizeof(struct radix_tree_node) = 17 * 560 = 9520 bytes
>  total: 9656 bytes
> 
> On 32-bit system old array size is 11, new is 22. Per-CPU feature
> overhead is
>  for preload array:
>    (23 - 11) * sizeof(void*) = 48 bytes
>  plus, if the preload array is full
>    (23 - 11) * sizeof(struct radix_tree_node) = 12 * 296 = 3552 bytes
>  total: 3600 bytes
> 
> Since only THP uses batched preload at the moment, we disable (set max
> preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> changed in the future.
> 
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
>  include/linux/radix-tree.h | 11 ++++++
>  lib/radix-tree.c           | 95 +++++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 97 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index 4039407..3bf0b3e 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -83,6 +83,16 @@ do {									\
>  	(root)->rnode = NULL;						\
>  } while (0)
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
> +/*
> + * At the moment only THP uses preload for more then on item for batched
> + * pagecache manipulations.
> + */
> +#define RADIX_TREE_PRELOAD_NR	512
> +#else
> +#define RADIX_TREE_PRELOAD_NR	1
> +#endif
> +
>  /**
>   * Radix-tree synchronization
>   *
> @@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
>  				unsigned long index, unsigned long max_scan);
>  int radix_tree_preload(gfp_t gfp_mask);
>  int radix_tree_maybe_preload(gfp_t gfp_mask);
> +int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
>  void radix_tree_init(void);
>  void *radix_tree_tag_set(struct radix_tree_root *root,
>  			unsigned long index, unsigned int tag);
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 7811ed3..1c19595 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -84,14 +84,52 @@ static struct kmem_cache *radix_tree_node_cachep;
>   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
>   * Hence:
>   */
> -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> +
> +/*
> + * Inserting N contiguous items is more complex. To simplify calculation, let's
> + * limit N (validated in radix_tree_init()):
> + *  - N is multiplier of RADIX_TREE_MAP_SIZE (or 1);
> + *  - N <= number of items 2-level tree can contain:
> + *    1UL << (2 * RADIX_TREE_MAP_SHIFT).
> + *
> + * No limitation on insert index alignment.
> + *
> + * Then the worst case is tree with only one element at index 0 and we add N
> + * items where at least one index requires max tree high and we cross boundary
> + * between items in root node.
> + *
> + * Basically, at least one index is less then
> + *
> + * 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
> + *
> + * and one is equal to.
> + *
> + * In this case we need:
> + *
> + * - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
> + * - N / RADIX_TREE_MAP_SIZE + 1 nodes for last level nodes for new items:
> + *    - +1 is for misalinged case;
> + * - 2 * (RADIX_TREE_MAX_PATH - 2) - 1 nodes to build path to last level nodes:
> + *    - -2, because root node and last level nodes are already accounted).
> + *
> + * Hence:
> + */
> +#if RADIX_TREE_PRELOAD_NR > 1
> +#define RADIX_TREE_PRELOAD_MAX \
> +	( RADIX_TREE_MAX_PATH + \
> +	  RADIX_TREE_PRELOAD_NR / RADIX_TREE_MAP_SIZE + 1 + \
> +	  2 * (RADIX_TREE_MAX_PATH - 2))
> +#else
> +#define RADIX_TREE_PRELOAD_MAX RADIX_TREE_PRELOAD_MIN
> +#endif
>  
>  /*
>   * Per-cpu pool of preloaded nodes
>   */
>  struct radix_tree_preload {
>  	int nr;
> -	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
> +	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
>  };
>  static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
>  
> @@ -263,29 +301,43 @@ radix_tree_node_free(struct radix_tree_node *node)
>  
>  /*
>   * Load up this CPU's radix_tree_node buffer with sufficient objects to
> - * ensure that the addition of a single element in the tree cannot fail.  On
> - * success, return zero, with preemption disabled.  On error, return -ENOMEM
> + * ensure that the addition of *contiguous* items in the tree cannot fail.
> + * On success, return zero, with preemption disabled.  On error, return -ENOMEM
>   * with preemption not disabled.
>   *
>   * To make use of this facility, the radix tree must be initialised without
>   * __GFP_WAIT being passed to INIT_RADIX_TREE().
>   */
> -static int __radix_tree_preload(gfp_t gfp_mask)
> +static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
>  {
>  	struct radix_tree_preload *rtp;
>  	struct radix_tree_node *node;
>  	int ret = -ENOMEM;
> +	int preload_target = RADIX_TREE_PRELOAD_MIN;
>  
> +	if (size > 1) {
> +		size = round_up(size, RADIX_TREE_MAP_SIZE);
> +		if (WARN_ONCE(size > RADIX_TREE_PRELOAD_NR,
> +					"too large preload requested"))
> +			return -ENOMEM;
> +
> +		/* The same math as with RADIX_TREE_PRELOAD_MAX */
> +		preload_target = RADIX_TREE_MAX_PATH +
> +			size / RADIX_TREE_MAP_SIZE + 1 +
> +			2 * (RADIX_TREE_MAX_PATH - 2);
> +	}
> +
> +	BUG_ON(preload_target > RADIX_TREE_PRELOAD_MAX);
>  	preempt_disable();
>  	rtp = &__get_cpu_var(radix_tree_preloads);
> -	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
> +	while (rtp->nr < preload_target) {
>  		preempt_enable();
>  		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
>  		if (node == NULL)
>  			goto out;
>  		preempt_disable();
>  		rtp = &__get_cpu_var(radix_tree_preloads);
> -		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
> +		if (rtp->nr < preload_target)
>  			rtp->nodes[rtp->nr++] = node;
>  		else
>  			kmem_cache_free(radix_tree_node_cachep, node);
> @@ -308,7 +360,7 @@ int radix_tree_preload(gfp_t gfp_mask)
>  {
>  	/* Warn on non-sensical use... */
>  	WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
> -	return __radix_tree_preload(gfp_mask);
> +	return __radix_tree_preload_contig(1, gfp_mask);
>  }
>  EXPORT_SYMBOL(radix_tree_preload);
>  
> @@ -320,13 +372,22 @@ EXPORT_SYMBOL(radix_tree_preload);
>  int radix_tree_maybe_preload(gfp_t gfp_mask)
>  {
>  	if (gfp_mask & __GFP_WAIT)
> -		return __radix_tree_preload(gfp_mask);
> +		return __radix_tree_preload_contig(1, gfp_mask);
>  	/* Preloading doesn't help anything with this gfp mask, skip it */
>  	preempt_disable();
>  	return 0;
>  }
>  EXPORT_SYMBOL(radix_tree_maybe_preload);
>  
> +int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask)
> +{
> +	if (gfp_mask & __GFP_WAIT)
> +		return __radix_tree_preload_contig(size, gfp_mask);
> +	/* Preloading doesn't help anything with this gfp mask, skip it */
> +	preempt_disable();
> +	return 0;
> +}
> +
>  /*
>   *	Return the maximum key which can be store into a
>   *	radix tree with height HEIGHT.
> @@ -1483,6 +1544,22 @@ static int radix_tree_callback(struct notifier_block *nfb,
>  
>  void __init radix_tree_init(void)
>  {
> +	/*
> +	 * Restrictions on RADIX_TREE_PRELOAD_NR simplify RADIX_TREE_PRELOAD_MAX
> +	 * calculation, it's already complex enough:
> +	 *  - it must be multiplier of RADIX_TREE_MAP_SIZE, otherwise we will
> +	 *    have to round it up to next RADIX_TREE_MAP_SIZE multiplier and we
> +	 *    don't win anything;
> +	 *  - must be less then number of items 2-level tree can contain.
> +	 *    It's easier to calculate number of internal nodes required
> +	 *    this way.
> +	 */
> +	if (RADIX_TREE_PRELOAD_NR != 1) {
> +		BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR % RADIX_TREE_MAP_SIZE != 0);
> +		BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR >
> +				1UL << (2 * RADIX_TREE_MAP_SHIFT));
> +	}
> +
>  	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
>  			sizeof(struct radix_tree_node), 0,
>  			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
> -- 
>  Kirill A. Shutemov
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements
@ 2013-08-09 11:36                       ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2013-08-09 11:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Jan Kara, Matthew Wilcox, Dave Hansen, Andrea Arcangeli,
	Andrew Morton, Al Viro, Hugh Dickins, Wu Fengguang, Mel Gorman,
	linux-mm, Andi Kleen, Kirill A. Shutemov, Hillf Danton, Ning Qu,
	linux-fsdevel, linux-kernel

On Fri 09-08-13 14:13:51, Kirill A. Shutemov wrote:
> Jan Kara wrote:
> > On Thu 08-08-13 11:45:05, Kirill A. Shutemov wrote:
> > > From 35ba5687ea7aea98645da34ddd0be01a9de8b32d Mon Sep 17 00:00:00 2001
> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > Date: Wed, 9 Jan 2013 13:25:47 +0200
> > > Subject: [PATCH] radix-tree: implement preload for multiple contiguous
> > >  elements
> > > 
> > > The radix tree is variable-height, so an insert operation not only has
> > > to build the branch to its corresponding item, it also has to build the
> > > branch to existing items if the size has to be increased (by
> > > radix_tree_extend).
> > > 
> > > The worst case is a zero height tree with just a single item at index 0,
> > > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > 
> > > Radix tree is usually protected by spin lock. It means we want to
> > > pre-allocate required memory before taking the lock.
> > > 
> > > Currently radix_tree_preload() only guarantees enough nodes to insert
> > > one element. It's a hard limit. For transparent huge page cache we want
> > > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > > 
> > > This patch introduces radix_tree_preload_count(). It allows to
> > > preallocate nodes enough to insert a number of *contiguous* elements.
> > > The feature costs about 9KiB per-CPU on x86_64, details below.
> > > 
> > > Preload uses per-CPU array to store nodes. The total cost of preload is
> > > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > > to be able to handle 512 entries at once.
> > > 
> > > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > > 
> > > We have three possible RADIX_TREE_MAP_SHIFT:
> > > 
> > >  #ifdef __KERNEL__
> > >  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> > >  #else
> > >  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> > >  #endif
> > > 
> > > We are not going to use transparent huge page cache on small machines or
> > > in userspace, so we are interested in RADIX_TREE_MAP_SHIFT=6.
> > > 
> > > On 64-bit system old array size is 21, new is 37. Per-CPU feature
> > > overhead is
> > >  for preload array:
> > >    (37 - 21) * sizeof(void*) = 128 bytes
> > >  plus, if the preload array is full
> > >    (37 - 21) * sizeof(struct radix_tree_node) = 16 * 560 = 8960 bytes
> > >  total: 9088 bytes
> > > 
> > > On 32-bit system old array size is 11, new is 22. Per-CPU feature
> > > overhead is
> > >  for preload array:
> > >    (22 - 11) * sizeof(void*) = 44 bytes
> > >  plus, if the preload array is full
> > >    (22 - 11) * sizeof(struct radix_tree_node) = 11 * 296 = 3256 bytes
> > >  total: 3300 bytes
> > > 
> > > Since only THP uses batched preload at the moment, we disable (set max
> > > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > > changed in the future.
> > > 
> > > Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> > > ---
> > >  include/linux/radix-tree.h | 11 ++++++
> > >  lib/radix-tree.c           | 89 +++++++++++++++++++++++++++++++++++++++++-----
> > >  2 files changed, 91 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > > index 4039407..3bf0b3e 100644
> > > --- a/include/linux/radix-tree.h
> > > +++ b/include/linux/radix-tree.h
> > > @@ -83,6 +83,16 @@ do {									\
> > >  	(root)->rnode = NULL;						\
> > >  } while (0)
> > >  
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
> > > +/*
> > > + * At the moment only THP uses preload for more then on item for batched
> > > + * pagecache manipulations.
> > > + */
> > > +#define RADIX_TREE_PRELOAD_NR	512
> > > +#else
> > > +#define RADIX_TREE_PRELOAD_NR	1
> > > +#endif
> > > +
> > >  /**
> > >   * Radix-tree synchronization
> > >   *
> > > @@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
> > >  				unsigned long index, unsigned long max_scan);
> > >  int radix_tree_preload(gfp_t gfp_mask);
> > >  int radix_tree_maybe_preload(gfp_t gfp_mask);
> > > +int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
> > >  void radix_tree_init(void);
> > >  void *radix_tree_tag_set(struct radix_tree_root *root,
> > >  			unsigned long index, unsigned int tag);
> > > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > > index 7811ed3..980e4c4 100644
> > > --- a/lib/radix-tree.c
> > > +++ b/lib/radix-tree.c
> > > @@ -84,14 +84,54 @@ static struct kmem_cache *radix_tree_node_cachep;
> > >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > >   * Hence:
> > >   */
> > > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > > +
> > > +/*
> > > + * Inserting N contiguous items is more complex. To simplify calculation, let's
> > > + * limit N (validated in radix_tree_init()):
> > > + *  - N is multiplier of RADIX_TREE_MAP_SIZE (or 1);
> >   Is this limitation really worth the one reserved item you save?
> 
> It's not really limitation, It just doesn't make any sense to lower
> RADIX_TREE_PRELOAD_NR not to RADIX_TREE_MAP_SIZE boundary, since we will
> have to round it up anyway.
> 
> I made one more mistake: I've modeled not the situation I've describe, so
> last -1 is wrong. Fixed in patch below.
> 
> > > -static int __radix_tree_preload(gfp_t gfp_mask)
> > > +static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
> > >  {
> > >  	struct radix_tree_preload *rtp;
> > >  	struct radix_tree_node *node;
> > >  	int ret = -ENOMEM;
> > > +	int preload_target = RADIX_TREE_PRELOAD_MIN +
> > > +		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
> >   So you should also warn if size % RADIX_TREE_MAP_SIZE != 0, right?
> 
> I've reworked logic here to match RADIX_TREE_PRELOAD_MAX calculation math.
> 
> > Otherwise the patch looks fine.
> 
> Ack?
  Yes. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> 
> From 0e7d481093a4ded0dfb58e87d4edda003f3b2fd6 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Wed, 9 Jan 2013 13:25:47 +0200
> Subject: [PATCH] radix-tree: implement preload for multiple contiguous
>  elements
> 
> The radix tree is variable-height, so an insert operation not only has
> to build the branch to its corresponding item, it also has to build the
> branch to existing items if the size has to be increased (by
> radix_tree_extend).
> 
> The worst case is a zero height tree with just a single item at index 0,
> and then inserting an item at index ULONG_MAX. This requires 2 new branches
> of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> 
> Radix tree is usually protected by spin lock. It means we want to
> pre-allocate required memory before taking the lock.
> 
> Currently radix_tree_preload() only guarantees enough nodes to insert
> one element. It's a hard limit. For transparent huge page cache we want
> to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> 
> This patch introduces radix_tree_preload_count(). It allows to
> preallocate nodes enough to insert a number of *contiguous* elements.
> The feature costs about 9.5KiB per-CPU on x86_64, details below.
> 
> Preload uses per-CPU array to store nodes. The total cost of preload is
> "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> to be able to handle 512 entries at once.
> 
> Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> 
> We have three possible RADIX_TREE_MAP_SHIFT:
> 
>  #ifdef __KERNEL__
>  #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
>  #else
>  #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
>  #endif
> 
> We are not going to use transparent huge page cache on small machines or
> in userspace, so we are interested in RADIX_TREE_MAP_SHIFT=6.
> 
> On 64-bit system old array size is 21, new is 37. Per-CPU feature
> overhead is
>  for preload array:
>    (38 - 21) * sizeof(void*) = 136 bytes
>  plus, if the preload array is full
>    (38 - 21) * sizeof(struct radix_tree_node) = 17 * 560 = 9520 bytes
>  total: 9656 bytes
> 
> On 32-bit system old array size is 11, new is 22. Per-CPU feature
> overhead is
>  for preload array:
>    (23 - 11) * sizeof(void*) = 48 bytes
>  plus, if the preload array is full
>    (23 - 11) * sizeof(struct radix_tree_node) = 12 * 296 = 3552 bytes
>  total: 3600 bytes
> 
> Since only THP uses batched preload at the moment, we disable (set max
> preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> changed in the future.
> 
> Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
>  include/linux/radix-tree.h | 11 ++++++
>  lib/radix-tree.c           | 95 +++++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 97 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index 4039407..3bf0b3e 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -83,6 +83,16 @@ do {									\
>  	(root)->rnode = NULL;						\
>  } while (0)
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
> +/*
> + * At the moment only THP uses preload for more then on item for batched
> + * pagecache manipulations.
> + */
> +#define RADIX_TREE_PRELOAD_NR	512
> +#else
> +#define RADIX_TREE_PRELOAD_NR	1
> +#endif
> +
>  /**
>   * Radix-tree synchronization
>   *
> @@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
>  				unsigned long index, unsigned long max_scan);
>  int radix_tree_preload(gfp_t gfp_mask);
>  int radix_tree_maybe_preload(gfp_t gfp_mask);
> +int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
>  void radix_tree_init(void);
>  void *radix_tree_tag_set(struct radix_tree_root *root,
>  			unsigned long index, unsigned int tag);
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 7811ed3..1c19595 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -84,14 +84,52 @@ static struct kmem_cache *radix_tree_node_cachep;
>   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
>   * Hence:
>   */
> -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> +
> +/*
> + * Inserting N contiguous items is more complex. To simplify calculation, let's
> + * limit N (validated in radix_tree_init()):
> + *  - N is multiplier of RADIX_TREE_MAP_SIZE (or 1);
> + *  - N <= number of items 2-level tree can contain:
> + *    1UL << (2 * RADIX_TREE_MAP_SHIFT).
> + *
> + * No limitation on insert index alignment.
> + *
> + * Then the worst case is tree with only one element at index 0 and we add N
> + * items where at least one index requires max tree high and we cross boundary
> + * between items in root node.
> + *
> + * Basically, at least one index is less then
> + *
> + * 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
> + *
> + * and one is equal to.
> + *
> + * In this case we need:
> + *
> + * - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
> + * - N / RADIX_TREE_MAP_SIZE + 1 nodes for last level nodes for new items:
> + *    - +1 is for misalinged case;
> + * - 2 * (RADIX_TREE_MAX_PATH - 2) - 1 nodes to build path to last level nodes:
> + *    - -2, because root node and last level nodes are already accounted).
> + *
> + * Hence:
> + */
> +#if RADIX_TREE_PRELOAD_NR > 1
> +#define RADIX_TREE_PRELOAD_MAX \
> +	( RADIX_TREE_MAX_PATH + \
> +	  RADIX_TREE_PRELOAD_NR / RADIX_TREE_MAP_SIZE + 1 + \
> +	  2 * (RADIX_TREE_MAX_PATH - 2))
> +#else
> +#define RADIX_TREE_PRELOAD_MAX RADIX_TREE_PRELOAD_MIN
> +#endif
>  
>  /*
>   * Per-cpu pool of preloaded nodes
>   */
>  struct radix_tree_preload {
>  	int nr;
> -	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
> +	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
>  };
>  static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
>  
> @@ -263,29 +301,43 @@ radix_tree_node_free(struct radix_tree_node *node)
>  
>  /*
>   * Load up this CPU's radix_tree_node buffer with sufficient objects to
> - * ensure that the addition of a single element in the tree cannot fail.  On
> - * success, return zero, with preemption disabled.  On error, return -ENOMEM
> + * ensure that the addition of *contiguous* items in the tree cannot fail.
> + * On success, return zero, with preemption disabled.  On error, return -ENOMEM
>   * with preemption not disabled.
>   *
>   * To make use of this facility, the radix tree must be initialised without
>   * __GFP_WAIT being passed to INIT_RADIX_TREE().
>   */
> -static int __radix_tree_preload(gfp_t gfp_mask)
> +static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
>  {
>  	struct radix_tree_preload *rtp;
>  	struct radix_tree_node *node;
>  	int ret = -ENOMEM;
> +	int preload_target = RADIX_TREE_PRELOAD_MIN;
>  
> +	if (size > 1) {
> +		size = round_up(size, RADIX_TREE_MAP_SIZE);
> +		if (WARN_ONCE(size > RADIX_TREE_PRELOAD_NR,
> +					"too large preload requested"))
> +			return -ENOMEM;
> +
> +		/* The same math as with RADIX_TREE_PRELOAD_MAX */
> +		preload_target = RADIX_TREE_MAX_PATH +
> +			size / RADIX_TREE_MAP_SIZE + 1 +
> +			2 * (RADIX_TREE_MAX_PATH - 2);
> +	}
> +
> +	BUG_ON(preload_target > RADIX_TREE_PRELOAD_MAX);
>  	preempt_disable();
>  	rtp = &__get_cpu_var(radix_tree_preloads);
> -	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
> +	while (rtp->nr < preload_target) {
>  		preempt_enable();
>  		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
>  		if (node == NULL)
>  			goto out;
>  		preempt_disable();
>  		rtp = &__get_cpu_var(radix_tree_preloads);
> -		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
> +		if (rtp->nr < preload_target)
>  			rtp->nodes[rtp->nr++] = node;
>  		else
>  			kmem_cache_free(radix_tree_node_cachep, node);
> @@ -308,7 +360,7 @@ int radix_tree_preload(gfp_t gfp_mask)
>  {
>  	/* Warn on non-sensical use... */
>  	WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
> -	return __radix_tree_preload(gfp_mask);
> +	return __radix_tree_preload_contig(1, gfp_mask);
>  }
>  EXPORT_SYMBOL(radix_tree_preload);
>  
> @@ -320,13 +372,22 @@ EXPORT_SYMBOL(radix_tree_preload);
>  int radix_tree_maybe_preload(gfp_t gfp_mask)
>  {
>  	if (gfp_mask & __GFP_WAIT)
> -		return __radix_tree_preload(gfp_mask);
> +		return __radix_tree_preload_contig(1, gfp_mask);
>  	/* Preloading doesn't help anything with this gfp mask, skip it */
>  	preempt_disable();
>  	return 0;
>  }
>  EXPORT_SYMBOL(radix_tree_maybe_preload);
>  
> +int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask)
> +{
> +	if (gfp_mask & __GFP_WAIT)
> +		return __radix_tree_preload_contig(size, gfp_mask);
> +	/* Preloading doesn't help anything with this gfp mask, skip it */
> +	preempt_disable();
> +	return 0;
> +}
> +
>  /*
>   *	Return the maximum key which can be store into a
>   *	radix tree with height HEIGHT.
> @@ -1483,6 +1544,22 @@ static int radix_tree_callback(struct notifier_block *nfb,
>  
>  void __init radix_tree_init(void)
>  {
> +	/*
> +	 * Restrictions on RADIX_TREE_PRELOAD_NR simplify RADIX_TREE_PRELOAD_MAX
> +	 * calculation, it's already complex enough:
> +	 *  - it must be multiplier of RADIX_TREE_MAP_SIZE, otherwise we will
> +	 *    have to round it up to next RADIX_TREE_MAP_SIZE multiplier and we
> +	 *    don't win anything;
> +	 *  - must be less then number of items 2-level tree can contain.
> +	 *    It's easier to calculate number of internal nodes required
> +	 *    this way.
> +	 */
> +	if (RADIX_TREE_PRELOAD_NR != 1) {
> +		BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR % RADIX_TREE_MAP_SIZE != 0);
> +		BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR >
> +				1UL << (2 * RADIX_TREE_MAP_SHIFT));
> +	}
> +
>  	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
>  			sizeof(struct radix_tree_node), 0,
>  			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
> -- 
>  Kirill A. Shutemov
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 19/23] truncate: support huge pages
  2013-08-06 21:55     ` Dave Hansen
  (?)
@ 2013-08-09 14:39       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-09 14:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Ning Qu, linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 08/03/2013 07:17 PM, Kirill A. Shutemov wrote:
> > If a huge page is only partly in the range we zero out the part,
> > exactly like we do for partial small pages.
> 
> What's the logic behind this behaviour?  Seems like the kind of place
> that we would really want to be splitting pages.

split_huge_page() now truncates the file, so we need to break
truncate<->split interdependency at some point.

> Like I said before, I somehow like to rewrite your code. :)

Makes sense. Please, take a look on patch below.

>From 631ad747933acbaa3284fae6e24ff1ae870a8f8f Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Fri, 2 Aug 2013 12:57:08 +0300
Subject: [PATCH] truncate: support huge pages

truncate_inode_pages_range() drops whole huge page at once if it's fully
inside the range.

If a huge page is only partly in the range we zero out the part,
exactly like we do for partial small pages.

In some cases it worth to split the huge page instead, if we need to
truncate it partly and free some memory. But split_huge_page() now
truncates the file, so we need to break truncate<->split interdependency
at some point.

invalidate_mapping_pages() just skips huge pages if they are not fully
in the range.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/pagemap.h |  9 +++++
 mm/truncate.c           | 98 +++++++++++++++++++++++++++++++++++++------------
 2 files changed, 83 insertions(+), 24 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index eb484f2..418be14 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -599,4 +599,13 @@ static inline void clear_pagecache_page(struct page *page)
 		clear_highpage(page);
 }
 
+static inline void zero_pagecache_segment(struct page *page,
+		unsigned start, unsigned len)
+{
+	if (PageTransHugeCache(page))
+		zero_huge_user_segment(page, start, len);
+	else
+		zero_user_segment(page, start, len);
+}
+
 #endif /* _LINUX_PAGEMAP_H */
diff --git a/mm/truncate.c b/mm/truncate.c
index 353b683..bc4f8d6 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -205,8 +205,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 {
 	pgoff_t		start;		/* inclusive */
 	pgoff_t		end;		/* exclusive */
-	unsigned int	partial_start;	/* inclusive */
-	unsigned int	partial_end;	/* exclusive */
+	bool		partial_thp_start = false, partial_thp_end = false;
 	struct pagevec	pvec;
 	pgoff_t		index;
 	int		i;
@@ -215,15 +214,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	if (mapping->nrpages == 0)
 		return;
 
-	/* Offsets within partial pages */
-	partial_start = lstart & (PAGE_CACHE_SIZE - 1);
-	partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1);
-
 	/*
 	 * 'start' and 'end' always covers the range of pages to be fully
-	 * truncated. Partial pages are covered with 'partial_start' at the
-	 * start of the range and 'partial_end' at the end of the range.
-	 * Note that 'end' is exclusive while 'lend' is inclusive.
+	 * truncated. Note that 'end' is exclusive while 'lend' is inclusive.
 	 */
 	start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	if (lend == -1)
@@ -249,6 +242,23 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index >= end)
 				break;
 
+			if (PageTransTailCache(page)) {
+				/* part of already handled huge page */
+				if (!page->mapping)
+					continue;
+				/* the range starts in middle of huge page */
+				partial_thp_start = true;
+				start = index & ~HPAGE_CACHE_INDEX_MASK;
+				continue;
+			}
+			/* the range ends on huge page */
+			if (PageTransHugeCache(page) && index ==
+					(end & ~HPAGE_CACHE_INDEX_MASK)) {
+				partial_thp_end = true;
+				end = index;
+				break;
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -265,34 +275,64 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		index++;
 	}
 
-	if (partial_start) {
-		struct page *page = find_lock_page(mapping, start - 1);
+	if (partial_thp_start || lstart & ~PAGE_CACHE_MASK) {
+		pgoff_t off;
+		struct page *page;
+		pgoff_t index_mask = 0UL;
+		loff_t page_cache_mask = PAGE_CACHE_MASK;
+retry_partial_start:
+		if (partial_thp_start) {
+			index_mask = HPAGE_CACHE_INDEX_MASK;
+			page_cache_mask = HPAGE_PMD_MASK;
+		}
+
+		off = (start - 1) & ~index_mask;
+		page = find_get_page(mapping, off);
 		if (page) {
-			unsigned int top = PAGE_CACHE_SIZE;
-			if (start > end) {
-				/* Truncation within a single page */
-				top = partial_end;
-				partial_end = 0;
+			unsigned pstart, pend;
+
+			/* the last tail page */
+			if (PageTransTailCache(page)) {
+				partial_thp_start = true;
+				page_cache_release(page);
+				goto retry_partial_start;
 			}
+
+			pstart = lstart & ~page_cache_mask;
+			if ((end & ~index_mask) == off)
+				pend = (lend - 1) & ~PAGE_CACHE_MASK;
+			else
+				pend = PAGE_CACHE_SIZE;
+
+			lock_page(page);
 			wait_on_page_writeback(page);
-			zero_user_segment(page, partial_start, top);
+			zero_pagecache_segment(page, pstart, pend);
 			cleancache_invalidate_page(mapping, page);
 			if (page_has_private(page))
-				do_invalidatepage(page, partial_start,
-						  top - partial_start);
+				do_invalidatepage(page, pstart,
+						pend - pstart);
 			unlock_page(page);
 			page_cache_release(page);
 		}
 	}
-	if (partial_end) {
-		struct page *page = find_lock_page(mapping, end);
+	if (partial_thp_end || (lend + 1) & ~PAGE_CACHE_MASK) {
+		struct page *page;
+		pgoff_t index_mask = 0UL;
+		loff_t page_cache_mask = PAGE_CACHE_MASK;
+
+		if (partial_thp_end) {
+			index_mask = HPAGE_CACHE_INDEX_MASK;
+			page_cache_mask = HPAGE_PMD_MASK;
+		}
+
+		page = find_lock_page(mapping, end & ~index_mask);
 		if (page) {
+			unsigned pend = (lend - 1) & ~page_cache_mask;
 			wait_on_page_writeback(page);
-			zero_user_segment(page, 0, partial_end);
+			zero_pagecache_segment(page, 0, pend);
 			cleancache_invalidate_page(mapping, page);
 			if (page_has_private(page))
-				do_invalidatepage(page, 0,
-						  partial_end);
+				do_invalidatepage(page, 0, pend);
 			unlock_page(page);
 			page_cache_release(page);
 		}
@@ -327,6 +367,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index >= end)
 				break;
 
+			if (PageTransTailCache(page))
+				continue;
+
 			lock_page(page);
 			WARN_ON(page->index != index);
 			wait_on_page_writeback(page);
@@ -401,6 +444,13 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 			if (index > end)
 				break;
 
+			/* skip huge page if it's not fully in the range */
+			if (PageTransHugeCache(page) &&
+					index + HPAGE_CACHE_NR - 1 > end)
+				continue;
+			if (PageTransTailCache(page))
+				continue;
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH 19/23] truncate: support huge pages
@ 2013-08-09 14:39       ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-09 14:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Ning Qu, linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 08/03/2013 07:17 PM, Kirill A. Shutemov wrote:
> > If a huge page is only partly in the range we zero out the part,
> > exactly like we do for partial small pages.
> 
> What's the logic behind this behaviour?  Seems like the kind of place
> that we would really want to be splitting pages.

split_huge_page() now truncates the file, so we need to break
truncate<->split interdependency at some point.

> Like I said before, I somehow like to rewrite your code. :)

Makes sense. Please, take a look on patch below.

>From 631ad747933acbaa3284fae6e24ff1ae870a8f8f Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Fri, 2 Aug 2013 12:57:08 +0300
Subject: [PATCH] truncate: support huge pages

truncate_inode_pages_range() drops whole huge page at once if it's fully
inside the range.

If a huge page is only partly in the range we zero out the part,
exactly like we do for partial small pages.

In some cases it worth to split the huge page instead, if we need to
truncate it partly and free some memory. But split_huge_page() now
truncates the file, so we need to break truncate<->split interdependency
at some point.

invalidate_mapping_pages() just skips huge pages if they are not fully
in the range.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/pagemap.h |  9 +++++
 mm/truncate.c           | 98 +++++++++++++++++++++++++++++++++++++------------
 2 files changed, 83 insertions(+), 24 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index eb484f2..418be14 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -599,4 +599,13 @@ static inline void clear_pagecache_page(struct page *page)
 		clear_highpage(page);
 }
 
+static inline void zero_pagecache_segment(struct page *page,
+		unsigned start, unsigned len)
+{
+	if (PageTransHugeCache(page))
+		zero_huge_user_segment(page, start, len);
+	else
+		zero_user_segment(page, start, len);
+}
+
 #endif /* _LINUX_PAGEMAP_H */
diff --git a/mm/truncate.c b/mm/truncate.c
index 353b683..bc4f8d6 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -205,8 +205,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 {
 	pgoff_t		start;		/* inclusive */
 	pgoff_t		end;		/* exclusive */
-	unsigned int	partial_start;	/* inclusive */
-	unsigned int	partial_end;	/* exclusive */
+	bool		partial_thp_start = false, partial_thp_end = false;
 	struct pagevec	pvec;
 	pgoff_t		index;
 	int		i;
@@ -215,15 +214,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	if (mapping->nrpages == 0)
 		return;
 
-	/* Offsets within partial pages */
-	partial_start = lstart & (PAGE_CACHE_SIZE - 1);
-	partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1);
-
 	/*
 	 * 'start' and 'end' always covers the range of pages to be fully
-	 * truncated. Partial pages are covered with 'partial_start' at the
-	 * start of the range and 'partial_end' at the end of the range.
-	 * Note that 'end' is exclusive while 'lend' is inclusive.
+	 * truncated. Note that 'end' is exclusive while 'lend' is inclusive.
 	 */
 	start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	if (lend == -1)
@@ -249,6 +242,23 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index >= end)
 				break;
 
+			if (PageTransTailCache(page)) {
+				/* part of already handled huge page */
+				if (!page->mapping)
+					continue;
+				/* the range starts in middle of huge page */
+				partial_thp_start = true;
+				start = index & ~HPAGE_CACHE_INDEX_MASK;
+				continue;
+			}
+			/* the range ends on huge page */
+			if (PageTransHugeCache(page) && index ==
+					(end & ~HPAGE_CACHE_INDEX_MASK)) {
+				partial_thp_end = true;
+				end = index;
+				break;
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -265,34 +275,64 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		index++;
 	}
 
-	if (partial_start) {
-		struct page *page = find_lock_page(mapping, start - 1);
+	if (partial_thp_start || lstart & ~PAGE_CACHE_MASK) {
+		pgoff_t off;
+		struct page *page;
+		pgoff_t index_mask = 0UL;
+		loff_t page_cache_mask = PAGE_CACHE_MASK;
+retry_partial_start:
+		if (partial_thp_start) {
+			index_mask = HPAGE_CACHE_INDEX_MASK;
+			page_cache_mask = HPAGE_PMD_MASK;
+		}
+
+		off = (start - 1) & ~index_mask;
+		page = find_get_page(mapping, off);
 		if (page) {
-			unsigned int top = PAGE_CACHE_SIZE;
-			if (start > end) {
-				/* Truncation within a single page */
-				top = partial_end;
-				partial_end = 0;
+			unsigned pstart, pend;
+
+			/* the last tail page */
+			if (PageTransTailCache(page)) {
+				partial_thp_start = true;
+				page_cache_release(page);
+				goto retry_partial_start;
 			}
+
+			pstart = lstart & ~page_cache_mask;
+			if ((end & ~index_mask) == off)
+				pend = (lend - 1) & ~PAGE_CACHE_MASK;
+			else
+				pend = PAGE_CACHE_SIZE;
+
+			lock_page(page);
 			wait_on_page_writeback(page);
-			zero_user_segment(page, partial_start, top);
+			zero_pagecache_segment(page, pstart, pend);
 			cleancache_invalidate_page(mapping, page);
 			if (page_has_private(page))
-				do_invalidatepage(page, partial_start,
-						  top - partial_start);
+				do_invalidatepage(page, pstart,
+						pend - pstart);
 			unlock_page(page);
 			page_cache_release(page);
 		}
 	}
-	if (partial_end) {
-		struct page *page = find_lock_page(mapping, end);
+	if (partial_thp_end || (lend + 1) & ~PAGE_CACHE_MASK) {
+		struct page *page;
+		pgoff_t index_mask = 0UL;
+		loff_t page_cache_mask = PAGE_CACHE_MASK;
+
+		if (partial_thp_end) {
+			index_mask = HPAGE_CACHE_INDEX_MASK;
+			page_cache_mask = HPAGE_PMD_MASK;
+		}
+
+		page = find_lock_page(mapping, end & ~index_mask);
 		if (page) {
+			unsigned pend = (lend - 1) & ~page_cache_mask;
 			wait_on_page_writeback(page);
-			zero_user_segment(page, 0, partial_end);
+			zero_pagecache_segment(page, 0, pend);
 			cleancache_invalidate_page(mapping, page);
 			if (page_has_private(page))
-				do_invalidatepage(page, 0,
-						  partial_end);
+				do_invalidatepage(page, 0, pend);
 			unlock_page(page);
 			page_cache_release(page);
 		}
@@ -327,6 +367,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index >= end)
 				break;
 
+			if (PageTransTailCache(page))
+				continue;
+
 			lock_page(page);
 			WARN_ON(page->index != index);
 			wait_on_page_writeback(page);
@@ -401,6 +444,13 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 			if (index > end)
 				break;
 
+			/* skip huge page if it's not fully in the range */
+			if (PageTransHugeCache(page) &&
+					index + HPAGE_CACHE_NR - 1 > end)
+				continue;
+			if (PageTransTailCache(page))
+				continue;
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH 19/23] truncate: support huge pages
@ 2013-08-09 14:39       ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-09 14:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Ning Qu, linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 08/03/2013 07:17 PM, Kirill A. Shutemov wrote:
> > If a huge page is only partly in the range we zero out the part,
> > exactly like we do for partial small pages.
> 
> What's the logic behind this behaviour?  Seems like the kind of place
> that we would really want to be splitting pages.

split_huge_page() now truncates the file, so we need to break
truncate<->split interdependency at some point.

> Like I said before, I somehow like to rewrite your code. :)

Makes sense. Please, take a look on patch below.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 20/23] thp: handle file pages in split_huge_page()
  2013-08-06 21:47       ` Ning Qu
  2013-08-09 14:46           ` Kirill A. Shutemov
@ 2013-08-09 14:46           ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-09 14:46 UTC (permalink / raw)
  To: Ning Qu
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Dave Hansen, linux-fsdevel, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1230 bytes --]

Ning Qu wrote:
> I just tried, and it seems working fine now without the deadlock anymore. I
> can run some big internal test with about 40GB files in sysv shm. Just move
> the line before the locking happens in vma_adjust, something as below, the
> line number is not accurate because my patch is based on another tree right
> now.

Looks okay to me. Could you prepare real patch (description, etc.). I'll
add it to my patchset.

> 
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -581,6 +581,8 @@ again:                      remove_next = 1 + (end >
> next->vm_end);
>                 }
>         }
> 
> +       vma_adjust_trans_huge(vma, start, end, adjust_next);
> +
>         if (file) {
>                 mapping = file->f_mapping;
>                 if (!(vma->vm_flags & VM_NONLINEAR))
> @@ -597,8 +599,6 @@ again:                      remove_next = 1 + (end >
> next->vm_end);
>                 }
>         }
> 
> -       vma_adjust_trans_huge(vma, start, end, adjust_next);
> -
>         anon_vma = vma->anon_vma;
>         if (!anon_vma && adjust_next)
>                 anon_vma = next->anon_vma;
> 
> 
> Best wishes,
> -- 
> Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 20/23] thp: handle file pages in split_huge_page()
@ 2013-08-09 14:46           ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-09 14:46 UTC (permalink / raw)
  To: Ning Qu
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Dave Hansen, linux-fsdevel, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1456 bytes --]

Ning Qu wrote:
> I just tried, and it seems working fine now without the deadlock anymore. I
> can run some big internal test with about 40GB files in sysv shm. Just move
> the line before the locking happens in vma_adjust, something as below, the
> line number is not accurate because my patch is based on another tree right
> now.

Looks okay to me. Could you prepare real patch (description, etc.). I'll
add it to my patchset.

> 
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -581,6 +581,8 @@ again:                      remove_next = 1 + (end >
> next->vm_end);
>                 }
>         }
> 
> +       vma_adjust_trans_huge(vma, start, end, adjust_next);
> +
>         if (file) {
>                 mapping = file->f_mapping;
>                 if (!(vma->vm_flags & VM_NONLINEAR))
> @@ -597,8 +599,6 @@ again:                      remove_next = 1 + (end >
> next->vm_end);
>                 }
>         }
> 
> -       vma_adjust_trans_huge(vma, start, end, adjust_next);
> -
>         anon_vma = vma->anon_vma;
>         if (!anon_vma && adjust_next)
>                 anon_vma = next->anon_vma;
> 
> 
> Best wishes,
> -- 
> Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 20/23] thp: handle file pages in split_huge_page()
@ 2013-08-09 14:46           ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-09 14:46 UTC (permalink / raw)
  To: Ning Qu
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Dave Hansen, linux-fsdevel, linux-kernel

Ning Qu wrote:
> I just tried, and it seems working fine now without the deadlock anymore. I
> can run some big internal test with about 40GB files in sysv shm. Just move
> the line before the locking happens in vma_adjust, something as below, the
> line number is not accurate because my patch is based on another tree right
> now.

Looks okay to me. Could you prepare real patch (description, etc.). I'll
add it to my patchset.

> 
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -581,6 +581,8 @@ again:                      remove_next = 1 + (end >
> next->vm_end);
>                 }
>         }
> 
> +       vma_adjust_trans_huge(vma, start, end, adjust_next);
> +
>         if (file) {
>                 mapping = file->f_mapping;
>                 if (!(vma->vm_flags & VM_NONLINEAR))
> @@ -597,8 +599,6 @@ again:                      remove_next = 1 + (end >
> next->vm_end);
>                 }
>         }
> 
> -       vma_adjust_trans_huge(vma, start, end, adjust_next);
> -
>         anon_vma = vma->anon_vma;
>         if (!anon_vma && adjust_next)
>                 anon_vma = next->anon_vma;
> 
> 
> Best wishes,
> -- 
> Ning Qu (ae?2a(R)?) | Software Engineer | quning@google.com | +1-408-418-6066

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 20/23] thp: handle file pages in split_huge_page()
  2013-08-09 14:46           ` Kirill A. Shutemov
  (?)
  (?)
@ 2013-08-09 14:49           ` Ning Qu
  2013-08-09 21:24               ` Ning Qu
  -1 siblings, 1 reply; 116+ messages in thread
From: Ning Qu @ 2013-08-09 14:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, Hugh Dickins,
	Mel Gorman, Al Viro, Andrew Morton, Andrea Arcangeli,
	linux-kernel, Andi Kleen, Wu Fengguang, Jan Kara, Dave Hansen,
	linux-mm, Hillf Danton

[-- Attachment #1: Type: text/plain, Size: 1456 bytes --]

Sure!
On Aug 9, 2013 7:42 AM, "Kirill A. Shutemov" <
kirill.shutemov@linux.intel.com> wrote:

> Ning Qu wrote:
> > I just tried, and it seems working fine now without the deadlock
> anymore. I
> > can run some big internal test with about 40GB files in sysv shm. Just
> move
> > the line before the locking happens in vma_adjust, something as below,
> the
> > line number is not accurate because my patch is based on another tree
> right
> > now.
>
> Looks okay to me. Could you prepare real patch (description, etc.). I'll
> add it to my patchset.
>
> >
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -581,6 +581,8 @@ again:                      remove_next = 1 + (end >
> > next->vm_end);
> >                 }
> >         }
> >
> > +       vma_adjust_trans_huge(vma, start, end, adjust_next);
> > +
> >         if (file) {
> >                 mapping = file->f_mapping;
> >                 if (!(vma->vm_flags & VM_NONLINEAR))
> > @@ -597,8 +599,6 @@ again:                      remove_next = 1 + (end >
> > next->vm_end);
> >                 }
> >         }
> >
> > -       vma_adjust_trans_huge(vma, start, end, adjust_next);
> > -
> >         anon_vma = vma->anon_vma;
> >         if (!anon_vma && adjust_next)
> >                 anon_vma = next->anon_vma;
> >
> >
> > Best wishes,
> > --
> > Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066
>
> --
>  Kirill A. Shutemov
>

[-- Attachment #2: Type: text/html, Size: 2104 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 22/23] thp, mm: split huge page on mmap file page
  2013-08-08 20:49     ` Khalid Aziz
@ 2013-08-09 14:50       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-09 14:50 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel

Khalid Aziz wrote:
> On Sun, 2013-08-04 at 05:17 +0300, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > We are not ready to mmap file-backed tranparent huge pages. Let's split
> > them on fault attempt.
> > 
> > Later we'll implement mmap() properly and this code path be used for
> > fallback cases.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  mm/filemap.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index ed65af5..f7857ef 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1743,6 +1743,8 @@ retry_find:
> >  			goto no_cached_page;
> >  	}
> >  
> > +	if (PageTransCompound(page))
> > +		split_huge_page(compound_trans_head(page));
> 
> Since PageTransCompound(page) returns true for transparent huge pages as
> well as hugetlbfs pages, could this code split hugetlbfs pages on an
> mmap() on to hugetlbfs pages? hugetlbfs pages are not supposed to be
> split, right?

hugetlbfs page cannot (should not) be here. hugetlb uses different
vm_ops->fault handeler.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 22/23] thp, mm: split huge page on mmap file page
@ 2013-08-09 14:50       ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-08-09 14:50 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Dave Hansen, Ning Qu, linux-fsdevel, linux-kernel

Khalid Aziz wrote:
> On Sun, 2013-08-04 at 05:17 +0300, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > We are not ready to mmap file-backed tranparent huge pages. Let's split
> > them on fault attempt.
> > 
> > Later we'll implement mmap() properly and this code path be used for
> > fallback cases.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  mm/filemap.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index ed65af5..f7857ef 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1743,6 +1743,8 @@ retry_find:
> >  			goto no_cached_page;
> >  	}
> >  
> > +	if (PageTransCompound(page))
> > +		split_huge_page(compound_trans_head(page));
> 
> Since PageTransCompound(page) returns true for transparent huge pages as
> well as hugetlbfs pages, could this code split hugetlbfs pages on an
> mmap() on to hugetlbfs pages? hugetlbfs pages are not supposed to be
> split, right?

hugetlbfs page cannot (should not) be here. hugetlb uses different
vm_ops->fault handeler.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 20/23] thp: handle file pages in split_huge_page()
  2013-08-09 14:49           ` Ning Qu
@ 2013-08-09 21:24               ` Ning Qu
  0 siblings, 0 replies; 116+ messages in thread
From: Ning Qu @ 2013-08-09 21:24 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, Hugh Dickins,
	Mel Gorman, Al Viro, Andrew Morton, Andrea Arcangeli,
	linux-kernel, Andi Kleen, Wu Fengguang, Jan Kara, Dave Hansen,
	linux-mm, Hillf Danton

[-- Attachment #1: Type: text/plain, Size: 3411 bytes --]

Does this look good to you? I want it to stay in this thread so I didn't
send it by git imap-send.


>From e637cfc915107f0b7df11ec58046c1f7fae87a10 Mon Sep 17 00:00:00 2001
From: Ning Qu <quning@google.com>
Date: Fri, 9 Aug 2013 13:54:24 -0700
Subject: [PATCH] thp: Fix deadlock situation in vma_adjust with huge page in
 page cache.

In vma_adjust, the current code grabs i_mmap_mutex before calling
vma_adjust_trans_huge. This used to be fine until huge page in page
cache comes in. The problem is the underlying function
split_file_huge_page will also grab the i_mmap_mutex before splitting
the huge page in page cache. Obviously this is causing deadlock
situation.

This fix is to move the vma_adjust_trans_huge before grab the lock for
file, the same as what the function is currently doing for anonymous
memory. Tested, everything works fine so far.

Signed-off-by: Ning Qu <quning@google.com>
---
 mm/mmap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 519ce78..accf1b3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -750,6 +750,8 @@ again:                      remove_next = 1 + (end >
next->vm_end);
                }
        }

+       vma_adjust_trans_huge(vma, start, end, adjust_next);
+
        if (file) {
                mapping = file->f_mapping;
                if (!(vma->vm_flags & VM_NONLINEAR)) {
@@ -773,8 +775,6 @@ again:                      remove_next = 1 + (end >
next->vm_end);
                }
        }

-       vma_adjust_trans_huge(vma, start, end, adjust_next);
-
        anon_vma = vma->anon_vma;
        if (!anon_vma && adjust_next)
                anon_vma = next->anon_vma;
-- 
1.8.3


Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Fri, Aug 9, 2013 at 7:49 AM, Ning Qu <quning@google.com> wrote:

> Sure!
> On Aug 9, 2013 7:42 AM, "Kirill A. Shutemov" <
> kirill.shutemov@linux.intel.com> wrote:
>
>> Ning Qu wrote:
>> > I just tried, and it seems working fine now without the deadlock
>> anymore. I
>> > can run some big internal test with about 40GB files in sysv shm. Just
>> move
>> > the line before the locking happens in vma_adjust, something as below,
>> the
>> > line number is not accurate because my patch is based on another tree
>> right
>> > now.
>>
>> Looks okay to me. Could you prepare real patch (description, etc.). I'll
>> add it to my patchset.
>>
>> >
>> > --- a/mm/mmap.c
>> > +++ b/mm/mmap.c
>> > @@ -581,6 +581,8 @@ again:                      remove_next = 1 + (end >
>> > next->vm_end);
>> >                 }
>> >         }
>> >
>> > +       vma_adjust_trans_huge(vma, start, end, adjust_next);
>> > +
>> >         if (file) {
>> >                 mapping = file->f_mapping;
>> >                 if (!(vma->vm_flags & VM_NONLINEAR))
>> > @@ -597,8 +599,6 @@ again:                      remove_next = 1 + (end >
>> > next->vm_end);
>> >                 }
>> >         }
>> >
>> > -       vma_adjust_trans_huge(vma, start, end, adjust_next);
>> > -
>> >         anon_vma = vma->anon_vma;
>> >         if (!anon_vma && adjust_next)
>> >                 anon_vma = next->anon_vma;
>> >
>> >
>> > Best wishes,
>> > --
>> > Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066
>>
>> --
>>  Kirill A. Shutemov
>>
>

[-- Attachment #2: Type: text/html, Size: 7861 bytes --]

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH 20/23] thp: handle file pages in split_huge_page()
@ 2013-08-09 21:24               ` Ning Qu
  0 siblings, 0 replies; 116+ messages in thread
From: Ning Qu @ 2013-08-09 21:24 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, Hugh Dickins,
	Mel Gorman, Al Viro, Andrew Morton, Andrea Arcangeli,
	linux-kernel, Andi Kleen, Wu Fengguang, Jan Kara, Dave Hansen,
	linux-mm, Hillf Danton

[-- Attachment #1: Type: text/plain, Size: 101 bytes --]

Does this look good to you? I want it to stay in this thread so I didn't
send it by git imap-send.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 05/23] thp: represent file thp pages in meminfo and friends
  2013-08-04  2:17   ` Kirill A. Shutemov
  (?)
@ 2013-08-30 22:16   ` Ning Qu
  2013-09-02 11:36       ` Kirill A. Shutemov
  -1 siblings, 1 reply; 116+ messages in thread
From: Ning Qu @ 2013-08-30 22:16 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4696 bytes --]

Hi, Kirill

I believe there is a typo in your previous commit, but you didn't include
it in this series of patch set. Below is the link for the commit. I think
you are trying to decrease the value NR_ANON_PAGES in page_remove_rmap, but
it is currently adding the value instead when using __mod_zone_page_state.Let
me know if you would like to fix it in your commit or you want another
patch from me. Thanks!

https://git.kernel.org/cgit/linux/kernel/git/kas/linux.git/commit/?h=thp/pagecache&id=90ca9354b08a7b26ba468c7d2ea1229e93d67b92

@@ -1151,11 +1151,11 @@ void page_remove_rmap(struct page *page)
goto out;
if (anon) {
mem_cgroup_uncharge_page(page);
- if (!PageTransHuge(page))
- __dec_zone_page_state(page, NR_ANON_PAGES);
- else
+ if (PageTransHuge(page))
__dec_zone_page_state(page,
NR_ANON_TRANSPARENT_HUGEPAGES);
+ __mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
+ hpage_nr_pages(page));
} else {
__dec_zone_page_state(page, NR_FILE_MAPPED);
mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);


Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Sat, Aug 3, 2013 at 7:17 PM, Kirill A. Shutemov <
kirill.shutemov@linux.intel.com> wrote:

> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> The patch adds new zone stat to count file transparent huge pages and
> adjust related places.
>
> For now we don't count mapped or dirty file thp pages separately.
>
> The patch depends on patch
>  thp: account anon transparent huge pages into NR_ANON_PAGES
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
>  drivers/base/node.c    | 4 ++++
>  fs/proc/meminfo.c      | 3 +++
>  include/linux/mmzone.h | 1 +
>  mm/vmstat.c            | 1 +
>  4 files changed, 9 insertions(+)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index bc9f43b..de261f5 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -119,6 +119,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>                        "Node %d SUnreclaim:     %8lu kB\n"
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>                        "Node %d AnonHugePages:  %8lu kB\n"
> +                      "Node %d FileHugePages:  %8lu kB\n"
>  #endif
>                         ,
>                        nid, K(node_page_state(nid, NR_FILE_DIRTY)),
> @@ -140,6 +141,9 @@ static ssize_t node_read_meminfo(struct device *dev,
>                        nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
>                         , nid,
>                         K(node_page_state(nid,
> NR_ANON_TRANSPARENT_HUGEPAGES) *
> +                       HPAGE_PMD_NR)
> +                       , nid,
> +                       K(node_page_state(nid,
> NR_FILE_TRANSPARENT_HUGEPAGES) *
>                         HPAGE_PMD_NR));
>  #else
>                        nid, K(node_page_state(nid,
> NR_SLAB_UNRECLAIMABLE)));
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 59d85d6..a62952c 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -104,6 +104,7 @@ static int meminfo_proc_show(struct seq_file *m, void
> *v)
>  #endif
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>                 "AnonHugePages:  %8lu kB\n"
> +               "FileHugePages:  %8lu kB\n"
>  #endif
>                 ,
>                 K(i.totalram),
> @@ -158,6 +159,8 @@ static int meminfo_proc_show(struct seq_file *m, void
> *v)
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>                 ,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
>                    HPAGE_PMD_NR)
> +               ,K(global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
> +                  HPAGE_PMD_NR)
>  #endif
>                 );
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 0c41d59..ba81833 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -142,6 +142,7 @@ enum zone_stat_item {
>         NUMA_OTHER,             /* allocation from other node */
>  #endif
>         NR_ANON_TRANSPARENT_HUGEPAGES,
> +       NR_FILE_TRANSPARENT_HUGEPAGES,
>         NR_FREE_CMA_PAGES,
>         NR_VM_ZONE_STAT_ITEMS };
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 87228c5..ffe3fbd 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -739,6 +739,7 @@ const char * const vmstat_text[] = {
>         "numa_other",
>  #endif
>         "nr_anon_transparent_hugepages",
> +       "nr_file_transparent_hugepages",
>         "nr_free_cma",
>         "nr_dirty_threshold",
>         "nr_dirty_background_threshold",
> --
> 1.8.3.2
>
>

[-- Attachment #2: Type: text/html, Size: 9888 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 05/23] thp: represent file thp pages in meminfo and friends
  2013-08-30 22:16   ` Ning Qu
@ 2013-09-02 11:36       ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-09-02 11:36 UTC (permalink / raw)
  To: Ning Qu
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Dave Hansen, linux-fsdevel, linux-kernel

Ning Qu wrote:
> Hi, Kirill
> 
> I believe there is a typo in your previous commit, but you didn't include
> it in this series of patch set. Below is the link for the commit. I think
> you are trying to decrease the value NR_ANON_PAGES in page_remove_rmap, but
> it is currently adding the value instead when using __mod_zone_page_state.Let
> me know if you would like to fix it in your commit or you want another
> patch from me. Thanks!

The patch is already in Andrew's tree. I'll send a fix for that. Thanks.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 05/23] thp: represent file thp pages in meminfo and friends
@ 2013-09-02 11:36       ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-09-02 11:36 UTC (permalink / raw)
  To: Ning Qu
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Dave Hansen, linux-fsdevel, linux-kernel

Ning Qu wrote:
> Hi, Kirill
> 
> I believe there is a typo in your previous commit, but you didn't include
> it in this series of patch set. Below is the link for the commit. I think
> you are trying to decrease the value NR_ANON_PAGES in page_remove_rmap, but
> it is currently adding the value instead when using __mod_zone_page_state.Let
> me know if you would like to fix it in your commit or you want another
> patch from me. Thanks!

The patch is already in Andrew's tree. I'll send a fix for that. Thanks.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 05/23] thp: represent file thp pages in meminfo and friends
  2013-09-02 11:36       ` Kirill A. Shutemov
  (?)
@ 2013-09-02 20:05       ` Ning Qu
  -1 siblings, 0 replies; 116+ messages in thread
From: Ning Qu @ 2013-09-02 20:05 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1262 bytes --]

Oh, I can see it, those patches make their way into Andrew's tree a few
days ago.

It kinda of make it harder for me to maintain the backport from the
patchset.

Btw, I am comparing the different between your most recent patchset vs v4.
Are you planning to release any new patch set?

After consolidating my backport with your most recent patchset, I probably
will see if I can get my patch on tmpfs adapting to 3.11. I will let you
know once I get there.

Thanks!

Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Mon, Sep 2, 2013 at 4:36 AM, Kirill A. Shutemov <
kirill.shutemov@linux.intel.com> wrote:

> Ning Qu wrote:
> > Hi, Kirill
> >
> > I believe there is a typo in your previous commit, but you didn't include
> > it in this series of patch set. Below is the link for the commit. I think
> > you are trying to decrease the value NR_ANON_PAGES in page_remove_rmap,
> but
> > it is currently adding the value instead when using
> __mod_zone_page_state.Let
> > me know if you would like to fix it in your commit or you want another
> > patch from me. Thanks!
>
> The patch is already in Andrew's tree. I'll send a fix for that. Thanks.
>
> --
>  Kirill A. Shutemov
>

[-- Attachment #2: Type: text/html, Size: 3776 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 03/23] thp: compile-time and sysfs knob for thp pagecache
  2013-08-04  2:17   ` Kirill A. Shutemov
  (?)
@ 2013-09-05 21:53   ` Ning Qu
  2013-09-06 11:33       ` Kirill A. Shutemov
  -1 siblings, 1 reply; 116+ messages in thread
From: Ning Qu @ 2013-09-05 21:53 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6173 bytes --]

One minor question inline.

Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Sat, Aug 3, 2013 at 7:17 PM, Kirill A. Shutemov <
kirill.shutemov@linux.intel.com> wrote:

> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> For now, TRANSPARENT_HUGEPAGE_PAGECACHE is only implemented for x86_64.
>
> Radix tree perload overhead can be significant on BASE_SMALL systems, so
> let's add dependency on !BASE_SMALL.
>
> /sys/kernel/mm/transparent_hugepage/page_cache is runtime knob for the
> feature. It's enabled by default if TRANSPARENT_HUGEPAGE_PAGECACHE is
> enabled.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  Documentation/vm/transhuge.txt |  9 +++++++++
>  include/linux/huge_mm.h        |  9 +++++++++
>  mm/Kconfig                     | 12 ++++++++++++
>  mm/huge_memory.c               | 23 +++++++++++++++++++++++
>  4 files changed, 53 insertions(+)
>
> diff --git a/Documentation/vm/transhuge.txt
> b/Documentation/vm/transhuge.txt
> index 4a63953..4cc15c4 100644
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -103,6 +103,15 @@ echo always
> >/sys/kernel/mm/transparent_hugepage/enabled
>  echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
>  echo never >/sys/kernel/mm/transparent_hugepage/enabled
>
> +If TRANSPARENT_HUGEPAGE_PAGECACHE is enabled kernel will use huge pages in
> +page cache if possible. It can be disable and re-enabled via sysfs:
> +
> +echo 0 >/sys/kernel/mm/transparent_hugepage/page_cache
> +echo 1 >/sys/kernel/mm/transparent_hugepage/page_cache
> +
> +If it's disabled kernel will not add new huge pages to page cache and
> +split them on mapping, but already mapped pages will stay intakt.
> +
>  It's also possible to limit defrag efforts in the VM to generate
>  hugepages in case they're not immediately free to madvise regions or
>  to never try to defrag memory and simply fallback to regular pages
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 3935428..1534e1e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -40,6 +40,7 @@ enum transparent_hugepage_flag {
>         TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
>         TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
>         TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
> +       TRANSPARENT_HUGEPAGE_PAGECACHE,
>         TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
>  #ifdef CONFIG_DEBUG_VM
>         TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
> @@ -229,4 +230,12 @@ static inline int do_huge_pmd_numa_page(struct
> mm_struct *mm, struct vm_area_str
>
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> +static inline bool transparent_hugepage_pagecache(void)
> +{
> +       if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
> +               return false;
> +       if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG)))
>

Here, I suppose we should test the  TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG as
well? E.g.
        if (!(transparent_hugepage_flags &
              ((1<<TRANSPARENT_HUGEPAGE_FLAG) |
               (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG))))

+               return false;
> +       return transparent_hugepage_flags &
> (1<<TRANSPARENT_HUGEPAGE_PAGECACHE);
> +}
>  #endif /* _LINUX_HUGE_MM_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 256bfd0..1e30ee8 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -420,6 +420,18 @@ choice
>           benefit.
>  endchoice
>
> +config TRANSPARENT_HUGEPAGE_PAGECACHE
> +       bool "Transparent Hugepage Support for page cache"
> +       depends on X86_64 && TRANSPARENT_HUGEPAGE
> +       # avoid radix tree preload overhead
> +       depends on !BASE_SMALL
> +       default y
> +       help
> +         Enabling the option adds support hugepages for file-backed
> +         mappings. It requires transparent hugepage support from
> +         filesystem side. For now, the only filesystem which supports
> +         hugepages is ramfs.
> +
>  config CROSS_MEMORY_ATTACH
>         bool "Cross Memory Support"
>         depends on MMU
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d96d921..523946c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -42,6 +42,9 @@ unsigned long transparent_hugepage_flags __read_mostly =
>  #endif
>         (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
>         (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
> +       (1<<TRANSPARENT_HUGEPAGE_PAGECACHE)|
> +#endif
>         (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
>
>  /* default scan 8*512 pte (or vmas) every 30 second */
> @@ -362,6 +365,23 @@ static ssize_t defrag_store(struct kobject *kobj,
>  static struct kobj_attribute defrag_attr =
>         __ATTR(defrag, 0644, defrag_show, defrag_store);
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
> +static ssize_t page_cache_show(struct kobject *kobj,
> +               struct kobj_attribute *attr, char *buf)
> +{
> +       return single_flag_show(kobj, attr, buf,
> +                               TRANSPARENT_HUGEPAGE_PAGECACHE);
> +}
> +static ssize_t page_cache_store(struct kobject *kobj,
> +               struct kobj_attribute *attr, const char *buf, size_t count)
> +{
> +       return single_flag_store(kobj, attr, buf, count,
> +                                TRANSPARENT_HUGEPAGE_PAGECACHE);
> +}
> +static struct kobj_attribute page_cache_attr =
> +       __ATTR(page_cache, 0644, page_cache_show, page_cache_store);
> +#endif
> +
>  static ssize_t use_zero_page_show(struct kobject *kobj,
>                 struct kobj_attribute *attr, char *buf)
>  {
> @@ -397,6 +417,9 @@ static struct kobj_attribute debug_cow_attr =
>  static struct attribute *hugepage_attr[] = {
>         &enabled_attr.attr,
>         &defrag_attr.attr,
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
> +       &page_cache_attr.attr,
> +#endif
>         &use_zero_page_attr.attr,
>  #ifdef CONFIG_DEBUG_VM
>         &debug_cow_attr.attr,
> --
> 1.8.3.2
>
>

[-- Attachment #2: Type: text/html, Size: 8627 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 03/23] thp: compile-time and sysfs knob for thp pagecache
  2013-09-05 21:53   ` Ning Qu
  2013-09-06 11:33       ` Kirill A. Shutemov
@ 2013-09-06 11:33       ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-09-06 11:33 UTC (permalink / raw)
  To: Ning Qu
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Dave Hansen, linux-fsdevel, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3403 bytes --]

Ning Qu wrote:
> One minor question inline.
> 
> Best wishes,
> -- 
> Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066
> 
> 
> On Sat, Aug 3, 2013 at 7:17 PM, Kirill A. Shutemov <
> kirill.shutemov@linux.intel.com> wrote:
> 
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > For now, TRANSPARENT_HUGEPAGE_PAGECACHE is only implemented for x86_64.
> >
> > Radix tree perload overhead can be significant on BASE_SMALL systems, so
> > let's add dependency on !BASE_SMALL.
> >
> > /sys/kernel/mm/transparent_hugepage/page_cache is runtime knob for the
> > feature. It's enabled by default if TRANSPARENT_HUGEPAGE_PAGECACHE is
> > enabled.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  Documentation/vm/transhuge.txt |  9 +++++++++
> >  include/linux/huge_mm.h        |  9 +++++++++
> >  mm/Kconfig                     | 12 ++++++++++++
> >  mm/huge_memory.c               | 23 +++++++++++++++++++++++
> >  4 files changed, 53 insertions(+)
> >
> > diff --git a/Documentation/vm/transhuge.txt
> > b/Documentation/vm/transhuge.txt
> > index 4a63953..4cc15c4 100644
> > --- a/Documentation/vm/transhuge.txt
> > +++ b/Documentation/vm/transhuge.txt
> > @@ -103,6 +103,15 @@ echo always
> > >/sys/kernel/mm/transparent_hugepage/enabled
> >  echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> >  echo never >/sys/kernel/mm/transparent_hugepage/enabled
> >
> > +If TRANSPARENT_HUGEPAGE_PAGECACHE is enabled kernel will use huge pages in
> > +page cache if possible. It can be disable and re-enabled via sysfs:
> > +
> > +echo 0 >/sys/kernel/mm/transparent_hugepage/page_cache
> > +echo 1 >/sys/kernel/mm/transparent_hugepage/page_cache
> > +
> > +If it's disabled kernel will not add new huge pages to page cache and
> > +split them on mapping, but already mapped pages will stay intakt.
> > +
> >  It's also possible to limit defrag efforts in the VM to generate
> >  hugepages in case they're not immediately free to madvise regions or
> >  to never try to defrag memory and simply fallback to regular pages
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 3935428..1534e1e 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -40,6 +40,7 @@ enum transparent_hugepage_flag {
> >         TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> >         TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> >         TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
> > +       TRANSPARENT_HUGEPAGE_PAGECACHE,
> >         TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> >  #ifdef CONFIG_DEBUG_VM
> >         TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
> > @@ -229,4 +230,12 @@ static inline int do_huge_pmd_numa_page(struct
> > mm_struct *mm, struct vm_area_str
> >
> >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > +static inline bool transparent_hugepage_pagecache(void)
> > +{
> > +       if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
> > +               return false;
> > +       if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG)))
> >
> 
> Here, I suppose we should test the  TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG as
> well? E.g.
>         if (!(transparent_hugepage_flags &
>               ((1<<TRANSPARENT_HUGEPAGE_FLAG) |
>                (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG))))
> 
> +               return false;

You're right. Fixed.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 03/23] thp: compile-time and sysfs knob for thp pagecache
@ 2013-09-06 11:33       ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-09-06 11:33 UTC (permalink / raw)
  To: Ning Qu
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Dave Hansen, linux-fsdevel, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3629 bytes --]

Ning Qu wrote:
> One minor question inline.
> 
> Best wishes,
> -- 
> Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066
> 
> 
> On Sat, Aug 3, 2013 at 7:17 PM, Kirill A. Shutemov <
> kirill.shutemov@linux.intel.com> wrote:
> 
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > For now, TRANSPARENT_HUGEPAGE_PAGECACHE is only implemented for x86_64.
> >
> > Radix tree perload overhead can be significant on BASE_SMALL systems, so
> > let's add dependency on !BASE_SMALL.
> >
> > /sys/kernel/mm/transparent_hugepage/page_cache is runtime knob for the
> > feature. It's enabled by default if TRANSPARENT_HUGEPAGE_PAGECACHE is
> > enabled.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  Documentation/vm/transhuge.txt |  9 +++++++++
> >  include/linux/huge_mm.h        |  9 +++++++++
> >  mm/Kconfig                     | 12 ++++++++++++
> >  mm/huge_memory.c               | 23 +++++++++++++++++++++++
> >  4 files changed, 53 insertions(+)
> >
> > diff --git a/Documentation/vm/transhuge.txt
> > b/Documentation/vm/transhuge.txt
> > index 4a63953..4cc15c4 100644
> > --- a/Documentation/vm/transhuge.txt
> > +++ b/Documentation/vm/transhuge.txt
> > @@ -103,6 +103,15 @@ echo always
> > >/sys/kernel/mm/transparent_hugepage/enabled
> >  echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> >  echo never >/sys/kernel/mm/transparent_hugepage/enabled
> >
> > +If TRANSPARENT_HUGEPAGE_PAGECACHE is enabled kernel will use huge pages in
> > +page cache if possible. It can be disable and re-enabled via sysfs:
> > +
> > +echo 0 >/sys/kernel/mm/transparent_hugepage/page_cache
> > +echo 1 >/sys/kernel/mm/transparent_hugepage/page_cache
> > +
> > +If it's disabled kernel will not add new huge pages to page cache and
> > +split them on mapping, but already mapped pages will stay intakt.
> > +
> >  It's also possible to limit defrag efforts in the VM to generate
> >  hugepages in case they're not immediately free to madvise regions or
> >  to never try to defrag memory and simply fallback to regular pages
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 3935428..1534e1e 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -40,6 +40,7 @@ enum transparent_hugepage_flag {
> >         TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> >         TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> >         TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
> > +       TRANSPARENT_HUGEPAGE_PAGECACHE,
> >         TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> >  #ifdef CONFIG_DEBUG_VM
> >         TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
> > @@ -229,4 +230,12 @@ static inline int do_huge_pmd_numa_page(struct
> > mm_struct *mm, struct vm_area_str
> >
> >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > +static inline bool transparent_hugepage_pagecache(void)
> > +{
> > +       if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
> > +               return false;
> > +       if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG)))
> >
> 
> Here, I suppose we should test the  TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG as
> well? E.g.
>         if (!(transparent_hugepage_flags &
>               ((1<<TRANSPARENT_HUGEPAGE_FLAG) |
>                (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG))))
> 
> +               return false;

You're right. Fixed.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 03/23] thp: compile-time and sysfs knob for thp pagecache
@ 2013-09-06 11:33       ` Kirill A. Shutemov
  0 siblings, 0 replies; 116+ messages in thread
From: Kirill A. Shutemov @ 2013-09-06 11:33 UTC (permalink / raw)
  To: Ning Qu
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Dave Hansen, linux-fsdevel, linux-kernel

Ning Qu wrote:
> One minor question inline.
> 
> Best wishes,
> -- 
> Ning Qu (ae?2a(R)?) | Software Engineer | quning@google.com | +1-408-418-6066
> 
> 
> On Sat, Aug 3, 2013 at 7:17 PM, Kirill A. Shutemov <
> kirill.shutemov@linux.intel.com> wrote:
> 
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > For now, TRANSPARENT_HUGEPAGE_PAGECACHE is only implemented for x86_64.
> >
> > Radix tree perload overhead can be significant on BASE_SMALL systems, so
> > let's add dependency on !BASE_SMALL.
> >
> > /sys/kernel/mm/transparent_hugepage/page_cache is runtime knob for the
> > feature. It's enabled by default if TRANSPARENT_HUGEPAGE_PAGECACHE is
> > enabled.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  Documentation/vm/transhuge.txt |  9 +++++++++
> >  include/linux/huge_mm.h        |  9 +++++++++
> >  mm/Kconfig                     | 12 ++++++++++++
> >  mm/huge_memory.c               | 23 +++++++++++++++++++++++
> >  4 files changed, 53 insertions(+)
> >
> > diff --git a/Documentation/vm/transhuge.txt
> > b/Documentation/vm/transhuge.txt
> > index 4a63953..4cc15c4 100644
> > --- a/Documentation/vm/transhuge.txt
> > +++ b/Documentation/vm/transhuge.txt
> > @@ -103,6 +103,15 @@ echo always
> > >/sys/kernel/mm/transparent_hugepage/enabled
> >  echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> >  echo never >/sys/kernel/mm/transparent_hugepage/enabled
> >
> > +If TRANSPARENT_HUGEPAGE_PAGECACHE is enabled kernel will use huge pages in
> > +page cache if possible. It can be disable and re-enabled via sysfs:
> > +
> > +echo 0 >/sys/kernel/mm/transparent_hugepage/page_cache
> > +echo 1 >/sys/kernel/mm/transparent_hugepage/page_cache
> > +
> > +If it's disabled kernel will not add new huge pages to page cache and
> > +split them on mapping, but already mapped pages will stay intakt.
> > +
> >  It's also possible to limit defrag efforts in the VM to generate
> >  hugepages in case they're not immediately free to madvise regions or
> >  to never try to defrag memory and simply fallback to regular pages
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 3935428..1534e1e 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -40,6 +40,7 @@ enum transparent_hugepage_flag {
> >         TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> >         TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> >         TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
> > +       TRANSPARENT_HUGEPAGE_PAGECACHE,
> >         TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> >  #ifdef CONFIG_DEBUG_VM
> >         TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
> > @@ -229,4 +230,12 @@ static inline int do_huge_pmd_numa_page(struct
> > mm_struct *mm, struct vm_area_str
> >
> >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > +static inline bool transparent_hugepage_pagecache(void)
> > +{
> > +       if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
> > +               return false;
> > +       if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG)))
> >
> 
> Here, I suppose we should test the  TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG as
> well? E.g.
>         if (!(transparent_hugepage_flags &
>               ((1<<TRANSPARENT_HUGEPAGE_FLAG) |
>                (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG))))
> 
> +               return false;

You're right. Fixed.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 03/23] thp: compile-time and sysfs knob for thp pagecache
  2013-09-06 11:33       ` Kirill A. Shutemov
  (?)
  (?)
@ 2013-09-06 17:14       ` Ning Qu
  -1 siblings, 0 replies; 116+ messages in thread
From: Ning Qu @ 2013-09-06 17:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3875 bytes --]

Great!

Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Fri, Sep 6, 2013 at 4:33 AM, Kirill A. Shutemov <
kirill.shutemov@linux.intel.com> wrote:

> Ning Qu wrote:
> > One minor question inline.
> >
> > Best wishes,
> > --
> > Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066
> >
> >
> > On Sat, Aug 3, 2013 at 7:17 PM, Kirill A. Shutemov <
> > kirill.shutemov@linux.intel.com> wrote:
> >
> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > >
> > > For now, TRANSPARENT_HUGEPAGE_PAGECACHE is only implemented for x86_64.
> > >
> > > Radix tree perload overhead can be significant on BASE_SMALL systems,
> so
> > > let's add dependency on !BASE_SMALL.
> > >
> > > /sys/kernel/mm/transparent_hugepage/page_cache is runtime knob for the
> > > feature. It's enabled by default if TRANSPARENT_HUGEPAGE_PAGECACHE is
> > > enabled.
> > >
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > ---
> > >  Documentation/vm/transhuge.txt |  9 +++++++++
> > >  include/linux/huge_mm.h        |  9 +++++++++
> > >  mm/Kconfig                     | 12 ++++++++++++
> > >  mm/huge_memory.c               | 23 +++++++++++++++++++++++
> > >  4 files changed, 53 insertions(+)
> > >
> > > diff --git a/Documentation/vm/transhuge.txt
> > > b/Documentation/vm/transhuge.txt
> > > index 4a63953..4cc15c4 100644
> > > --- a/Documentation/vm/transhuge.txt
> > > +++ b/Documentation/vm/transhuge.txt
> > > @@ -103,6 +103,15 @@ echo always
> > > >/sys/kernel/mm/transparent_hugepage/enabled
> > >  echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> > >  echo never >/sys/kernel/mm/transparent_hugepage/enabled
> > >
> > > +If TRANSPARENT_HUGEPAGE_PAGECACHE is enabled kernel will use huge
> pages in
> > > +page cache if possible. It can be disable and re-enabled via sysfs:
> > > +
> > > +echo 0 >/sys/kernel/mm/transparent_hugepage/page_cache
> > > +echo 1 >/sys/kernel/mm/transparent_hugepage/page_cache
> > > +
> > > +If it's disabled kernel will not add new huge pages to page cache and
> > > +split them on mapping, but already mapped pages will stay intakt.
> > > +
> > >  It's also possible to limit defrag efforts in the VM to generate
> > >  hugepages in case they're not immediately free to madvise regions or
> > >  to never try to defrag memory and simply fallback to regular pages
> > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > index 3935428..1534e1e 100644
> > > --- a/include/linux/huge_mm.h
> > > +++ b/include/linux/huge_mm.h
> > > @@ -40,6 +40,7 @@ enum transparent_hugepage_flag {
> > >         TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> > >         TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> > >         TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
> > > +       TRANSPARENT_HUGEPAGE_PAGECACHE,
> > >         TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> > >  #ifdef CONFIG_DEBUG_VM
> > >         TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
> > > @@ -229,4 +230,12 @@ static inline int do_huge_pmd_numa_page(struct
> > > mm_struct *mm, struct vm_area_str
> > >
> > >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > >
> > > +static inline bool transparent_hugepage_pagecache(void)
> > > +{
> > > +       if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
> > > +               return false;
> > > +       if (!(transparent_hugepage_flags &
> (1<<TRANSPARENT_HUGEPAGE_FLAG)))
> > >
> >
> > Here, I suppose we should test the  TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG as
> > well? E.g.
> >         if (!(transparent_hugepage_flags &
> >               ((1<<TRANSPARENT_HUGEPAGE_FLAG) |
> >                (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG))))
> >
> > +               return false;
>
> You're right. Fixed.
>
> --
>  Kirill A. Shutemov
>

[-- Attachment #2: Type: text/html, Size: 7311 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

end of thread, other threads:[~2013-09-06 17:14 UTC | newest]

Thread overview: 116+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-04  2:17 [PATCHv5 00/23] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
2013-08-04  2:17 ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-05 11:17   ` Jan Kara
2013-08-05 11:17     ` Jan Kara
2013-08-06 16:34     ` Matthew Wilcox
2013-08-06 16:34       ` Matthew Wilcox
2013-08-06 20:17       ` Jan Kara
2013-08-06 20:17         ` Jan Kara
2013-08-07 16:32     ` Kirill A. Shutemov
2013-08-07 16:32       ` Kirill A. Shutemov
2013-08-07 16:32       ` Kirill A. Shutemov
2013-08-07 20:00       ` Jan Kara
2013-08-07 20:00         ` Jan Kara
2013-08-07 20:24         ` Kirill A. Shutemov
2013-08-07 20:24           ` Kirill A. Shutemov
2013-08-07 20:24           ` Kirill A. Shutemov
2013-08-07 20:36           ` Jan Kara
2013-08-07 20:36             ` Jan Kara
2013-08-07 21:37             ` Kirill A. Shutemov
2013-08-07 21:37               ` Kirill A. Shutemov
2013-08-07 21:37               ` Kirill A. Shutemov
2013-08-08  8:45               ` Kirill A. Shutemov
2013-08-08  8:45                 ` Kirill A. Shutemov
2013-08-08  8:45                 ` Kirill A. Shutemov
2013-08-08 10:04                 ` Jan Kara
2013-08-08 10:04                   ` Jan Kara
2013-08-09 11:13                   ` Kirill A. Shutemov
2013-08-09 11:13                     ` Kirill A. Shutemov
2013-08-09 11:13                     ` Kirill A. Shutemov
2013-08-09 11:36                     ` Jan Kara
2013-08-09 11:36                       ` Jan Kara
2013-08-04  2:17 ` [PATCH 02/23] memcg, thp: charge huge cache pages Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  8:25   ` Michal Hocko
2013-08-04  8:25     ` Michal Hocko
2013-08-04  2:17 ` [PATCH 03/23] thp: compile-time and sysfs knob for thp pagecache Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-09-05 21:53   ` Ning Qu
2013-09-06 11:33     ` Kirill A. Shutemov
2013-09-06 11:33       ` Kirill A. Shutemov
2013-09-06 11:33       ` Kirill A. Shutemov
2013-09-06 17:14       ` Ning Qu
2013-08-04  2:17 ` [PATCH 04/23] thp, mm: introduce mapping_can_have_hugepages() predicate Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 05/23] thp: represent file thp pages in meminfo and friends Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-30 22:16   ` Ning Qu
2013-09-02 11:36     ` Kirill A. Shutemov
2013-09-02 11:36       ` Kirill A. Shutemov
2013-09-02 20:05       ` Ning Qu
2013-08-04  2:17 ` [PATCH 06/23] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 07/23] mm: trace filemap: dump page order Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 08/23] block: implement add_bdi_stat() Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-05 11:21   ` Jan Kara
2013-08-05 11:21     ` Jan Kara
2013-08-04  2:17 ` [PATCH 09/23] thp, mm: rewrite delete_from_page_cache() to support huge pages Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 10/23] thp, mm: warn if we try to use replace_page_cache_page() with THP Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 11/23] thp, mm: handle tail pages in page_cache_get_speculative() Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 12/23] thp, mm: add event counters for huge page alloc on file write or read Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 13/23] thp, mm: allocate huge pages in grab_cache_page_write_begin() Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 14/23] thp, mm: naive support of thp in generic_perform_write Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 15/23] mm, fs: avoid page allocation beyond i_size on read Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-05  0:29   ` NeilBrown
2013-08-05  0:29     ` NeilBrown
2013-08-04  2:17 ` [PATCH 16/23] thp, mm: handle transhuge pages in do_generic_file_read() Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 17/23] thp, libfs: initial thp support Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 18/23] thp: libfs: introduce simple_thp_release() Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 19/23] truncate: support huge pages Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-05 13:29   ` Jan Kara
2013-08-05 13:29     ` Jan Kara
2013-08-06 20:23   ` Dave Hansen
2013-08-06 20:23     ` Dave Hansen
2013-08-06 20:57     ` Kirill A. Shutemov
2013-08-06 20:57       ` Kirill A. Shutemov
2013-08-06 21:55   ` Dave Hansen
2013-08-06 21:55     ` Dave Hansen
2013-08-09 14:39     ` Kirill A. Shutemov
2013-08-09 14:39       ` Kirill A. Shutemov
2013-08-09 14:39       ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 20/23] thp: handle file pages in split_huge_page() Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-06 19:09   ` Ning Qu
2013-08-06 21:09     ` Ning Qu
2013-08-06 21:47       ` Ning Qu
2013-08-09 14:46         ` Kirill A. Shutemov
2013-08-09 14:46           ` Kirill A. Shutemov
2013-08-09 14:46           ` Kirill A. Shutemov
2013-08-09 14:49           ` Ning Qu
2013-08-09 21:24             ` Ning Qu
2013-08-09 21:24               ` Ning Qu
2013-08-04  2:17 ` [PATCH 21/23] thp: wait_split_huge_page(): serialize over i_mmap_mutex too Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 22/23] thp, mm: split huge page on mmap file page Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-08 20:49   ` Khalid Aziz
2013-08-08 20:49     ` Khalid Aziz
2013-08-09 14:50     ` Kirill A. Shutemov
2013-08-09 14:50       ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 23/23] ramfs: enable transparent huge page cache Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.