All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv3, RFC 00/34] Transparent huge page cache
@ 2013-04-05 11:59 ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Here's third RFC. Thanks everybody for feedback.

The patchset is pretty big already and I want to stop generate new
features to keep it reviewable. Next I'll concentrate on benchmarking and
tuning.

Therefore some features will be outside initial transparent huge page
cache implementation:
 - page collapsing;
 - migration;
 - tmpfs/shmem;

There are few features which are not implemented and potentially can block
upstreaming:

1. Currently we allocate 2M page even if we create only 1 byte file on
ramfs. I don't think it's a problem by itself. With anon thp pages we also
try to allocate huge pages whenever possible.
The problem is that ramfs pages are unevictable and we can't just split
and pushed them in swap as with anon thp. We (at some point) have to have
mechanism to split last page of the file under memory pressure to reclaim
some memory.

2. We don't have knobs for disabling transparent huge page cache per-mount
or per-file. Should we have mount option and fadivse flags as part of
initial implementation?

Any thoughts?

The patchset is also on git:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/pagecache

v3:
 - set RADIX_TREE_PRELOAD_NR to 512 only if we build with THP;
 - rewrite lru_add_page_tail() to address few bags;
 - memcg accounting;
 - represent file thp pages in meminfo and friends;
 - dump page order in filemap trace;
 - add missed flush_dcache_page() in zero_huge_user_segment;
 - random cleanups based on feedback.
v2:
 - mmap();
 - fix add_to_page_cache_locked() and delete_from_page_cache();
 - introduce mapping_can_have_hugepages();
 - call split_huge_page() only for head page in filemap_fault();
 - wait_split_huge_page(): serialize over i_mmap_mutex too;
 - lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists;
 - fix off-by-one in zero_huge_user_segment();
 - THP_WRITE_ALLOC/THP_WRITE_FAILED counters;

Kirill A. Shutemov (34):
  mm: drop actor argument of do_generic_file_read()
  block: implement add_bdi_stat()
  mm: implement zero_huge_user_segment and friends
  radix-tree: implement preload for multiple contiguous elements
  memcg, thp: charge huge cache pages
  thp, mm: avoid PageUnevictable on active/inactive lru lists
  thp, mm: basic defines for transparent huge page cache
  thp, mm: introduce mapping_can_have_hugepages() predicate
  thp: represent file thp pages in meminfo and friends
  thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  mm: trace filemap: dump page order
  thp, mm: rewrite delete_from_page_cache() to support huge pages
  thp, mm: trigger bug in replace_page_cache_page() on THP
  thp, mm: locking tail page is a bug
  thp, mm: handle tail pages in page_cache_get_speculative()
  thp, mm: add event counters for huge page alloc on write to a file
  thp, mm: implement grab_thp_write_begin()
  thp, mm: naive support of thp in generic read/write routines
  thp, libfs: initial support of thp in
    simple_read/write_begin/write_end
  thp: handle file pages in split_huge_page()
  thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  thp, mm: truncate support for transparent huge page cache
  thp, mm: split huge page on mmap file page
  ramfs: enable transparent huge page cache
  x86-64, mm: proper alignment mappings with hugepages
  mm: add huge_fault() callback to vm_operations_struct
  thp: prepare zap_huge_pmd() to uncharge file pages
  thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
  thp, mm: basic huge_fault implementation for generic_file_vm_ops
  thp: extract fallback path from do_huge_pmd_anonymous_page() to a
    function
  thp: initial implementation of do_huge_linear_fault()
  thp: handle write-protect exception to file-backed huge pages
  thp: call __vma_adjust_trans_huge() for file-backed VMA
  thp: map file-backed huge pages on fault

 arch/x86/kernel/sys_x86_64.c   |   12 +-
 drivers/base/node.c            |   10 +
 fs/libfs.c                     |   48 +++-
 fs/proc/meminfo.c              |    6 +
 fs/ramfs/inode.c               |    6 +-
 include/linux/backing-dev.h    |   10 +
 include/linux/huge_mm.h        |   36 ++-
 include/linux/mm.h             |    8 +
 include/linux/mmzone.h         |    1 +
 include/linux/pagemap.h        |   33 ++-
 include/linux/radix-tree.h     |   11 +
 include/linux/vm_event_item.h  |    2 +
 include/trace/events/filemap.h |    7 +-
 lib/radix-tree.c               |   33 ++-
 mm/filemap.c                   |  298 ++++++++++++++++++++-----
 mm/huge_memory.c               |  474 +++++++++++++++++++++++++++++++++-------
 mm/memcontrol.c                |    2 -
 mm/memory.c                    |   41 +++-
 mm/mmap.c                      |    3 +
 mm/page_alloc.c                |    7 +-
 mm/swap.c                      |   20 +-
 mm/truncate.c                  |   13 ++
 mm/vmstat.c                    |    2 +
 23 files changed, 902 insertions(+), 181 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 00/34] Transparent huge page cache
@ 2013-04-05 11:59 ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Here's third RFC. Thanks everybody for feedback.

The patchset is pretty big already and I want to stop generate new
features to keep it reviewable. Next I'll concentrate on benchmarking and
tuning.

Therefore some features will be outside initial transparent huge page
cache implementation:
 - page collapsing;
 - migration;
 - tmpfs/shmem;

There are few features which are not implemented and potentially can block
upstreaming:

1. Currently we allocate 2M page even if we create only 1 byte file on
ramfs. I don't think it's a problem by itself. With anon thp pages we also
try to allocate huge pages whenever possible.
The problem is that ramfs pages are unevictable and we can't just split
and pushed them in swap as with anon thp. We (at some point) have to have
mechanism to split last page of the file under memory pressure to reclaim
some memory.

2. We don't have knobs for disabling transparent huge page cache per-mount
or per-file. Should we have mount option and fadivse flags as part of
initial implementation?

Any thoughts?

The patchset is also on git:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/pagecache

v3:
 - set RADIX_TREE_PRELOAD_NR to 512 only if we build with THP;
 - rewrite lru_add_page_tail() to address few bags;
 - memcg accounting;
 - represent file thp pages in meminfo and friends;
 - dump page order in filemap trace;
 - add missed flush_dcache_page() in zero_huge_user_segment;
 - random cleanups based on feedback.
v2:
 - mmap();
 - fix add_to_page_cache_locked() and delete_from_page_cache();
 - introduce mapping_can_have_hugepages();
 - call split_huge_page() only for head page in filemap_fault();
 - wait_split_huge_page(): serialize over i_mmap_mutex too;
 - lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists;
 - fix off-by-one in zero_huge_user_segment();
 - THP_WRITE_ALLOC/THP_WRITE_FAILED counters;

Kirill A. Shutemov (34):
  mm: drop actor argument of do_generic_file_read()
  block: implement add_bdi_stat()
  mm: implement zero_huge_user_segment and friends
  radix-tree: implement preload for multiple contiguous elements
  memcg, thp: charge huge cache pages
  thp, mm: avoid PageUnevictable on active/inactive lru lists
  thp, mm: basic defines for transparent huge page cache
  thp, mm: introduce mapping_can_have_hugepages() predicate
  thp: represent file thp pages in meminfo and friends
  thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  mm: trace filemap: dump page order
  thp, mm: rewrite delete_from_page_cache() to support huge pages
  thp, mm: trigger bug in replace_page_cache_page() on THP
  thp, mm: locking tail page is a bug
  thp, mm: handle tail pages in page_cache_get_speculative()
  thp, mm: add event counters for huge page alloc on write to a file
  thp, mm: implement grab_thp_write_begin()
  thp, mm: naive support of thp in generic read/write routines
  thp, libfs: initial support of thp in
    simple_read/write_begin/write_end
  thp: handle file pages in split_huge_page()
  thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  thp, mm: truncate support for transparent huge page cache
  thp, mm: split huge page on mmap file page
  ramfs: enable transparent huge page cache
  x86-64, mm: proper alignment mappings with hugepages
  mm: add huge_fault() callback to vm_operations_struct
  thp: prepare zap_huge_pmd() to uncharge file pages
  thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
  thp, mm: basic huge_fault implementation for generic_file_vm_ops
  thp: extract fallback path from do_huge_pmd_anonymous_page() to a
    function
  thp: initial implementation of do_huge_linear_fault()
  thp: handle write-protect exception to file-backed huge pages
  thp: call __vma_adjust_trans_huge() for file-backed VMA
  thp: map file-backed huge pages on fault

 arch/x86/kernel/sys_x86_64.c   |   12 +-
 drivers/base/node.c            |   10 +
 fs/libfs.c                     |   48 +++-
 fs/proc/meminfo.c              |    6 +
 fs/ramfs/inode.c               |    6 +-
 include/linux/backing-dev.h    |   10 +
 include/linux/huge_mm.h        |   36 ++-
 include/linux/mm.h             |    8 +
 include/linux/mmzone.h         |    1 +
 include/linux/pagemap.h        |   33 ++-
 include/linux/radix-tree.h     |   11 +
 include/linux/vm_event_item.h  |    2 +
 include/trace/events/filemap.h |    7 +-
 lib/radix-tree.c               |   33 ++-
 mm/filemap.c                   |  298 ++++++++++++++++++++-----
 mm/huge_memory.c               |  474 +++++++++++++++++++++++++++++++++-------
 mm/memcontrol.c                |    2 -
 mm/memory.c                    |   41 +++-
 mm/mmap.c                      |    3 +
 mm/page_alloc.c                |    7 +-
 mm/swap.c                      |   20 +-
 mm/truncate.c                  |   13 ++
 mm/vmstat.c                    |    2 +
 23 files changed, 902 insertions(+), 181 deletions(-)

-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 01/34] mm: drop actor argument of do_generic_file_read()
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

There's only one caller of do_generic_file_read() and the only actor is
file_read_actor(). No reason to have a callback parameter.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 4ebaf95..2d99191 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1075,7 +1075,6 @@ static void shrink_readahead_size_eio(struct file *filp,
  * @filp:	the file to read
  * @ppos:	current file position
  * @desc:	read_descriptor
- * @actor:	read method
  *
  * This is a generic file read routine, and uses the
  * mapping->a_ops->readpage() function for the actual low-level stuff.
@@ -1084,7 +1083,7 @@ static void shrink_readahead_size_eio(struct file *filp,
  * of the logic when it comes to error handling etc.
  */
 static void do_generic_file_read(struct file *filp, loff_t *ppos,
-		read_descriptor_t *desc, read_actor_t actor)
+		read_descriptor_t *desc)
 {
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
@@ -1185,13 +1184,14 @@ page_ok:
 		 * Ok, we have the page, and it's up-to-date, so
 		 * now we can copy it to user space...
 		 *
-		 * The actor routine returns how many bytes were actually used..
+		 * The file_read_actor routine returns how many bytes were
+		 * actually used..
 		 * NOTE! This may not be the same as how much of a user buffer
 		 * we filled up (we may be padding etc), so we can only update
 		 * "pos" here (the actor routine has to update the user buffer
 		 * pointers and the remaining count).
 		 */
-		ret = actor(desc, page, offset, nr);
+		ret = file_read_actor(desc, page, offset, nr);
 		offset += ret;
 		index += offset >> PAGE_CACHE_SHIFT;
 		offset &= ~PAGE_CACHE_MASK;
@@ -1464,7 +1464,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
 		if (desc.count == 0)
 			continue;
 		desc.error = 0;
-		do_generic_file_read(filp, ppos, &desc, file_read_actor);
+		do_generic_file_read(filp, ppos, &desc);
 		retval += desc.written;
 		if (desc.error) {
 			retval = retval ?: desc.error;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 01/34] mm: drop actor argument of do_generic_file_read()
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

There's only one caller of do_generic_file_read() and the only actor is
file_read_actor(). No reason to have a callback parameter.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 4ebaf95..2d99191 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1075,7 +1075,6 @@ static void shrink_readahead_size_eio(struct file *filp,
  * @filp:	the file to read
  * @ppos:	current file position
  * @desc:	read_descriptor
- * @actor:	read method
  *
  * This is a generic file read routine, and uses the
  * mapping->a_ops->readpage() function for the actual low-level stuff.
@@ -1084,7 +1083,7 @@ static void shrink_readahead_size_eio(struct file *filp,
  * of the logic when it comes to error handling etc.
  */
 static void do_generic_file_read(struct file *filp, loff_t *ppos,
-		read_descriptor_t *desc, read_actor_t actor)
+		read_descriptor_t *desc)
 {
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
@@ -1185,13 +1184,14 @@ page_ok:
 		 * Ok, we have the page, and it's up-to-date, so
 		 * now we can copy it to user space...
 		 *
-		 * The actor routine returns how many bytes were actually used..
+		 * The file_read_actor routine returns how many bytes were
+		 * actually used..
 		 * NOTE! This may not be the same as how much of a user buffer
 		 * we filled up (we may be padding etc), so we can only update
 		 * "pos" here (the actor routine has to update the user buffer
 		 * pointers and the remaining count).
 		 */
-		ret = actor(desc, page, offset, nr);
+		ret = file_read_actor(desc, page, offset, nr);
 		offset += ret;
 		index += offset >> PAGE_CACHE_SHIFT;
 		offset &= ~PAGE_CACHE_MASK;
@@ -1464,7 +1464,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
 		if (desc.count == 0)
 			continue;
 		desc.error = 0;
-		do_generic_file_read(filp, ppos, &desc, file_read_actor);
+		do_generic_file_read(filp, ppos, &desc);
 		retval += desc.written;
 		if (desc.error) {
 			retval = retval ?: desc.error;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 02/34] block: implement add_bdi_stat()
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We're going to add/remove a number of page cache entries at once. This
patch implements add_bdi_stat() which adjusts bdi stats by arbitrary
amount. It's required for batched page cache manipulations.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/backing-dev.h |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3504599..b05d961 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -167,6 +167,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
 	__add_bdi_stat(bdi, item, -1);
 }
 
+static inline void add_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, s64 amount)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__add_bdi_stat(bdi, item, amount);
+	local_irq_restore(flags);
+}
+
 static inline void dec_bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item)
 {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 02/34] block: implement add_bdi_stat()
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We're going to add/remove a number of page cache entries at once. This
patch implements add_bdi_stat() which adjusts bdi stats by arbitrary
amount. It's required for batched page cache manipulations.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/backing-dev.h |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3504599..b05d961 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -167,6 +167,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
 	__add_bdi_stat(bdi, item, -1);
 }
 
+static inline void add_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, s64 amount)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__add_bdi_stat(bdi, item, amount);
+	local_irq_restore(flags);
+}
+
 static inline void dec_bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item)
 {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 03/34] mm: implement zero_huge_user_segment and friends
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Let's add helpers to clear huge page segment(s). They provide the same
functionallity as zero_user_segment and zero_user, but for huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |    7 +++++++
 mm/memory.c        |   36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5b7fd4e..09530c7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1731,6 +1731,13 @@ extern void dump_page(struct page *page);
 extern void clear_huge_page(struct page *page,
 			    unsigned long addr,
 			    unsigned int pages_per_huge_page);
+extern void zero_huge_user_segment(struct page *page,
+		unsigned start, unsigned end);
+static inline void zero_huge_user(struct page *page,
+		unsigned start, unsigned len)
+{
+	zero_huge_user_segment(page, start, start + len);
+}
 extern void copy_user_huge_page(struct page *dst, struct page *src,
 				unsigned long addr, struct vm_area_struct *vma,
 				unsigned int pages_per_huge_page);
diff --git a/mm/memory.c b/mm/memory.c
index 494526a..9da540f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4213,6 +4213,42 @@ void clear_huge_page(struct page *page,
 	}
 }
 
+void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
+{
+	int i;
+	unsigned start_idx, end_idx;
+	unsigned start_off, end_off;
+
+	BUG_ON(end < start);
+
+	might_sleep();
+
+	if (start == end)
+		return;
+
+	start_idx = start >> PAGE_SHIFT;
+	start_off = start & ~PAGE_MASK;
+	end_idx = (end - 1) >> PAGE_SHIFT;
+	end_off = ((end - 1) & ~PAGE_MASK) + 1;
+
+	/*
+	 * if start and end are on the same small page we can call
+	 * zero_user_segment() once and save one kmap_atomic().
+	 */
+	if (start_idx == end_idx)
+		return zero_user_segment(page + start_idx, start_off, end_off);
+
+	/* zero the first (possibly partial) page */
+	zero_user_segment(page + start_idx, start_off, PAGE_SIZE);
+	for (i = start_idx + 1; i < end_idx; i++) {
+		cond_resched();
+		clear_highpage(page + i);
+		flush_dcache_page(page + i);
+	}
+	/* zero the last (possibly partial) page */
+	zero_user_segment(page + end_idx, 0, end_off);
+}
+
 static void copy_user_gigantic_page(struct page *dst, struct page *src,
 				    unsigned long addr,
 				    struct vm_area_struct *vma,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 03/34] mm: implement zero_huge_user_segment and friends
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Let's add helpers to clear huge page segment(s). They provide the same
functionallity as zero_user_segment and zero_user, but for huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |    7 +++++++
 mm/memory.c        |   36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5b7fd4e..09530c7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1731,6 +1731,13 @@ extern void dump_page(struct page *page);
 extern void clear_huge_page(struct page *page,
 			    unsigned long addr,
 			    unsigned int pages_per_huge_page);
+extern void zero_huge_user_segment(struct page *page,
+		unsigned start, unsigned end);
+static inline void zero_huge_user(struct page *page,
+		unsigned start, unsigned len)
+{
+	zero_huge_user_segment(page, start, start + len);
+}
 extern void copy_user_huge_page(struct page *dst, struct page *src,
 				unsigned long addr, struct vm_area_struct *vma,
 				unsigned int pages_per_huge_page);
diff --git a/mm/memory.c b/mm/memory.c
index 494526a..9da540f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4213,6 +4213,42 @@ void clear_huge_page(struct page *page,
 	}
 }
 
+void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
+{
+	int i;
+	unsigned start_idx, end_idx;
+	unsigned start_off, end_off;
+
+	BUG_ON(end < start);
+
+	might_sleep();
+
+	if (start == end)
+		return;
+
+	start_idx = start >> PAGE_SHIFT;
+	start_off = start & ~PAGE_MASK;
+	end_idx = (end - 1) >> PAGE_SHIFT;
+	end_off = ((end - 1) & ~PAGE_MASK) + 1;
+
+	/*
+	 * if start and end are on the same small page we can call
+	 * zero_user_segment() once and save one kmap_atomic().
+	 */
+	if (start_idx == end_idx)
+		return zero_user_segment(page + start_idx, start_off, end_off);
+
+	/* zero the first (possibly partial) page */
+	zero_user_segment(page + start_idx, start_off, PAGE_SIZE);
+	for (i = start_idx + 1; i < end_idx; i++) {
+		cond_resched();
+		clear_highpage(page + i);
+		flush_dcache_page(page + i);
+	}
+	/* zero the last (possibly partial) page */
+	zero_user_segment(page + end_idx, 0, end_off);
+}
+
 static void copy_user_gigantic_page(struct page *dst, struct page *src,
 				    unsigned long addr,
 				    struct vm_area_struct *vma,
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 04/34] radix-tree: implement preload for multiple contiguous elements
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The radix tree is variable-height, so an insert operation not only has
to build the branch to its corresponding item, it also has to build the
branch to existing items if the size has to be increased (by
radix_tree_extend).

The worst case is a zero height tree with just a single item at index 0,
and then inserting an item at index ULONG_MAX. This requires 2 new branches
of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.

Radix tree is usually protected by spin lock. It means we want to
pre-allocate required memory before taking the lock.

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. For transparent huge page cache we want
to insert HPAGE_PMD_NR (512 on x86-64) entires to address_space at once.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.

Worst case for adding N contiguous items is adding entries at indexes
(ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
item plus extra nodes if you cross the boundary from one node to the next.

Preload uses per-CPU array to store nodes. The total cost of preload is
"array size" * sizeof(void*) * NR_CPUS. We want to increase array size
to be able to handle 512 entries at once.

Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.

We have three possible RADIX_TREE_MAP_SHIFT:

 #ifdef __KERNEL__
 #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
 #else
 #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
 #endif

On 64-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.

On 32-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.

On most machines we will have RADIX_TREE_MAP_SHIFT=6.

Since only THP uses batched preload at the , we disable (set max preload
to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE. This can be changed in the
future.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/radix-tree.h |   11 +++++++++++
 lib/radix-tree.c           |   33 ++++++++++++++++++++++++++-------
 2 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index ffc444c..0d98fd6 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,16 @@ do {									\
 	(root)->rnode = NULL;						\
 } while (0)
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * At the moment only THP uses preload for more then on item for batched
+ * pagecache manipulations.
+ */
+#define RADIX_TREE_PRELOAD_NR	512
+#else
+#define RADIX_TREE_PRELOAD_NR	1
+#endif
+
 /**
  * Radix-tree synchronization
  *
@@ -231,6 +241,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root,
 unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
 				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
 			unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e796429..1bc352f 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
  * The worst case is a zero height tree with just a single item at index 0,
  * and then inserting an item at index ULONG_MAX. This requires 2 new branches
  * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
+ *
+ * Worst case for adding N contiguous items is adding entries at indexes
+ * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
+ * item plus extra nodes if you cross the boundary from one node to the next.
+ *
  * Hence:
  */
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MAX \
+	(RADIX_TREE_PRELOAD_MIN + \
+	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
 
 /*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
 	int nr;
-	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
 };
 static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
 
@@ -257,29 +265,35 @@ radix_tree_node_free(struct radix_tree_node *node)
 
 /*
  * Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail.  On
- * success, return zero, with preemption disabled.  On error, return -ENOMEM
+ * ensure that the addition of *contiguous* elements in the tree cannot fail.
+ * On success, return zero, with preemption disabled.  On error, return -ENOMEM
  * with preemption not disabled.
  *
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_WAIT being passed to INIT_RADIX_TREE().
  */
-int radix_tree_preload(gfp_t gfp_mask)
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
 	int ret = -ENOMEM;
+	int preload_target = RADIX_TREE_PRELOAD_MIN +
+		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
+
+	if (WARN_ONCE(size > RADIX_TREE_PRELOAD_NR,
+				"too large preload requested"))
+		return -ENOMEM;
 
 	preempt_disable();
 	rtp = &__get_cpu_var(radix_tree_preloads);
-	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+	while (rtp->nr < preload_target) {
 		preempt_enable();
 		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
 		if (node == NULL)
 			goto out;
 		preempt_disable();
 		rtp = &__get_cpu_var(radix_tree_preloads);
-		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+		if (rtp->nr < preload_target)
 			rtp->nodes[rtp->nr++] = node;
 		else
 			kmem_cache_free(radix_tree_node_cachep, node);
@@ -288,6 +302,11 @@ int radix_tree_preload(gfp_t gfp_mask)
 out:
 	return ret;
 }
+
+int radix_tree_preload(gfp_t gfp_mask)
+{
+	return radix_tree_preload_count(1, gfp_mask);
+}
 EXPORT_SYMBOL(radix_tree_preload);
 
 /*
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 04/34] radix-tree: implement preload for multiple contiguous elements
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The radix tree is variable-height, so an insert operation not only has
to build the branch to its corresponding item, it also has to build the
branch to existing items if the size has to be increased (by
radix_tree_extend).

The worst case is a zero height tree with just a single item at index 0,
and then inserting an item at index ULONG_MAX. This requires 2 new branches
of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.

Radix tree is usually protected by spin lock. It means we want to
pre-allocate required memory before taking the lock.

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. For transparent huge page cache we want
to insert HPAGE_PMD_NR (512 on x86-64) entires to address_space at once.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.

Worst case for adding N contiguous items is adding entries at indexes
(ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
item plus extra nodes if you cross the boundary from one node to the next.

Preload uses per-CPU array to store nodes. The total cost of preload is
"array size" * sizeof(void*) * NR_CPUS. We want to increase array size
to be able to handle 512 entries at once.

Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.

We have three possible RADIX_TREE_MAP_SHIFT:

 #ifdef __KERNEL__
 #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
 #else
 #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
 #endif

On 64-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.

On 32-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.

On most machines we will have RADIX_TREE_MAP_SHIFT=6.

Since only THP uses batched preload at the , we disable (set max preload
to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE. This can be changed in the
future.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/radix-tree.h |   11 +++++++++++
 lib/radix-tree.c           |   33 ++++++++++++++++++++++++++-------
 2 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index ffc444c..0d98fd6 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,16 @@ do {									\
 	(root)->rnode = NULL;						\
 } while (0)
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * At the moment only THP uses preload for more then on item for batched
+ * pagecache manipulations.
+ */
+#define RADIX_TREE_PRELOAD_NR	512
+#else
+#define RADIX_TREE_PRELOAD_NR	1
+#endif
+
 /**
  * Radix-tree synchronization
  *
@@ -231,6 +241,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root,
 unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
 				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
 			unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e796429..1bc352f 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
  * The worst case is a zero height tree with just a single item at index 0,
  * and then inserting an item at index ULONG_MAX. This requires 2 new branches
  * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
+ *
+ * Worst case for adding N contiguous items is adding entries at indexes
+ * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
+ * item plus extra nodes if you cross the boundary from one node to the next.
+ *
  * Hence:
  */
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MAX \
+	(RADIX_TREE_PRELOAD_MIN + \
+	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
 
 /*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
 	int nr;
-	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
 };
 static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
 
@@ -257,29 +265,35 @@ radix_tree_node_free(struct radix_tree_node *node)
 
 /*
  * Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail.  On
- * success, return zero, with preemption disabled.  On error, return -ENOMEM
+ * ensure that the addition of *contiguous* elements in the tree cannot fail.
+ * On success, return zero, with preemption disabled.  On error, return -ENOMEM
  * with preemption not disabled.
  *
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_WAIT being passed to INIT_RADIX_TREE().
  */
-int radix_tree_preload(gfp_t gfp_mask)
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
 	int ret = -ENOMEM;
+	int preload_target = RADIX_TREE_PRELOAD_MIN +
+		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
+
+	if (WARN_ONCE(size > RADIX_TREE_PRELOAD_NR,
+				"too large preload requested"))
+		return -ENOMEM;
 
 	preempt_disable();
 	rtp = &__get_cpu_var(radix_tree_preloads);
-	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+	while (rtp->nr < preload_target) {
 		preempt_enable();
 		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
 		if (node == NULL)
 			goto out;
 		preempt_disable();
 		rtp = &__get_cpu_var(radix_tree_preloads);
-		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+		if (rtp->nr < preload_target)
 			rtp->nodes[rtp->nr++] = node;
 		else
 			kmem_cache_free(radix_tree_node_cachep, node);
@@ -288,6 +302,11 @@ int radix_tree_preload(gfp_t gfp_mask)
 out:
 	return ret;
 }
+
+int radix_tree_preload(gfp_t gfp_mask)
+{
+	return radix_tree_preload_count(1, gfp_mask);
+}
 EXPORT_SYMBOL(radix_tree_preload);
 
 /*
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 05/34] memcg, thp: charge huge cache pages
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

mem_cgroup_cache_charge() has check for PageCompound(). The check
prevents charging huge cache pages.

I don't see a reason why the check is present. Looks like it's just
legacy (introduced in 52d4b9a memcg: allocate all page_cgroup at boot).

Let's just drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memcontrol.c |    2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 690fa8c..0e7f7e6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3975,8 +3975,6 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 
 	if (mem_cgroup_disabled())
 		return 0;
-	if (PageCompound(page))
-		return 0;
 
 	if (!PageSwapCache(page))
 		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 05/34] memcg, thp: charge huge cache pages
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

mem_cgroup_cache_charge() has check for PageCompound(). The check
prevents charging huge cache pages.

I don't see a reason why the check is present. Looks like it's just
legacy (introduced in 52d4b9a memcg: allocate all page_cgroup at boot).

Let's just drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memcontrol.c |    2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 690fa8c..0e7f7e6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3975,8 +3975,6 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 
 	if (mem_cgroup_disabled())
 		return 0;
-	if (PageCompound(page))
-		return 0;
 
 	if (!PageSwapCache(page))
 		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 06/34] thp, mm: avoid PageUnevictable on active/inactive lru lists
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
that have been placed on the LRU lists when first allocated), but these
pages must not have PageUnevictable set - otherwise shrink_active_list
goes crazy:

kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122!
invalid opcode: 0000 [#1] SMP
CPU 0
Pid: 293, comm: kswapd0 Not tainted 3.8.0-rc6-next-20130202+ #531
RIP: 0010:[<ffffffff81110478>]  [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
RSP: 0000:ffff8800796d9b28  EFLAGS: 00010082
RAX: 00000000ffffffea RBX: 0000000000000012 RCX: 0000000000000001
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffea0001de8040
RBP: ffff8800796d9b88 R08: ffff8800796d9df0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000012
R13: ffffea0001de8060 R14: ffffffff818818e8 R15: ffff8800796d9bf8
FS:  0000000000000000(0000) GS:ffff88007a200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1bfc108000 CR3: 000000000180b000 CR4: 00000000000406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kswapd0 (pid: 293, threadinfo ffff8800796d8000, task ffff880079e0a6e0)
Stack:
 ffff8800796d9b48 ffffffff81881880 ffff8800796d9df0 ffff8800796d9be0
 0000000000000002 000000000000001f ffff8800796d9b88 ffffffff818818c8
 ffffffff81881480 ffff8800796d9dc0 0000000000000002 000000000000001f
Call Trace:
 [<ffffffff81111e98>] shrink_inactive_list+0x108/0x4a0
 [<ffffffff8109ce3d>] ? trace_hardirqs_off+0xd/0x10
 [<ffffffff8107b8bf>] ? local_clock+0x4f/0x60
 [<ffffffff8110ff5d>] ? shrink_slab+0x1fd/0x4c0
 [<ffffffff811125a1>] shrink_zone+0x371/0x610
 [<ffffffff8110ff75>] ? shrink_slab+0x215/0x4c0
 [<ffffffff81112dfc>] kswapd+0x5bc/0xb60
 [<ffffffff81112840>] ? shrink_zone+0x610/0x610
 [<ffffffff81066676>] kthread+0xd6/0xe0
 [<ffffffff810665a0>] ? __kthread_bind+0x40/0x40
 [<ffffffff814fed6c>] ret_from_fork+0x7c/0xb0
 [<ffffffff810665a0>] ? __kthread_bind+0x40/0x40
Code: 1f 40 00 49 8b 45 08 49 8b 75 00 48 89 46 08 48 89 30 49 8b 06 4c 89 68 08 49 89 45 00 4d 89 75 08 4d 89 2e eb 9c 0f 1f 44 00 00 <0f> 0b 66 0f 1f 44 00 00 31 db 45 31 e4 eb 9b 0f 0b 0f 0b 65 48
RIP  [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
 RSP <ffff8800796d9b28>

For lru_add_page_tail(), it means we should not set PageUnevictable()
for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
Let's just copy PG_active and PG_unevictable from head page in
__split_huge_page_refcount(), it will simplify lru_add_page_tail().

This will fix one more bug in lru_add_page_tail():
if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
will go to the same lru as page, but nobody cares to sync page_tail
active/inactive state with page. So we can end up with inactive page on
active lru.
The patch will fix it as well since we copy PG_active from head page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |    4 +++-
 mm/swap.c        |   20 ++------------------
 2 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2f7f5aa..46a44ac 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1613,7 +1613,9 @@ static void __split_huge_page_refcount(struct page *page)
 				     ((1L << PG_referenced) |
 				      (1L << PG_swapbacked) |
 				      (1L << PG_mlocked) |
-				      (1L << PG_uptodate)));
+				      (1L << PG_uptodate) |
+				      (1L << PG_active) |
+				      (1L << PG_unevictable)));
 		page_tail->flags |= (1L << PG_dirty);
 
 		/* clear PageTail before overwriting first_page */
diff --git a/mm/swap.c b/mm/swap.c
index 92a9be5..20d78b6 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -740,8 +740,6 @@ EXPORT_SYMBOL(__pagevec_release);
 void lru_add_page_tail(struct page *page, struct page *page_tail,
 		       struct lruvec *lruvec)
 {
-	int uninitialized_var(active);
-	enum lru_list lru;
 	const int file = 0;
 
 	VM_BUG_ON(!PageHead(page));
@@ -752,20 +750,6 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 
 	SetPageLRU(page_tail);
 
-	if (page_evictable(page_tail)) {
-		if (PageActive(page)) {
-			SetPageActive(page_tail);
-			active = 1;
-			lru = LRU_ACTIVE_ANON;
-		} else {
-			active = 0;
-			lru = LRU_INACTIVE_ANON;
-		}
-	} else {
-		SetPageUnevictable(page_tail);
-		lru = LRU_UNEVICTABLE;
-	}
-
 	if (likely(PageLRU(page)))
 		list_add_tail(&page_tail->lru, &page->lru);
 	else {
@@ -777,13 +761,13 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 		 * Use the standard add function to put page_tail on the list,
 		 * but then correct its position so they all end up in order.
 		 */
-		add_page_to_lru_list(page_tail, lruvec, lru);
+		add_page_to_lru_list(page_tail, lruvec, page_lru(page_tail));
 		list_head = page_tail->lru.prev;
 		list_move_tail(&page_tail->lru, list_head);
 	}
 
 	if (!PageUnevictable(page))
-		update_page_reclaim_stat(lruvec, file, active);
+		update_page_reclaim_stat(lruvec, file, PageActive(page_tail));
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 06/34] thp, mm: avoid PageUnevictable on active/inactive lru lists
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
that have been placed on the LRU lists when first allocated), but these
pages must not have PageUnevictable set - otherwise shrink_active_list
goes crazy:

kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122!
invalid opcode: 0000 [#1] SMP
CPU 0
Pid: 293, comm: kswapd0 Not tainted 3.8.0-rc6-next-20130202+ #531
RIP: 0010:[<ffffffff81110478>]  [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
RSP: 0000:ffff8800796d9b28  EFLAGS: 00010082
RAX: 00000000ffffffea RBX: 0000000000000012 RCX: 0000000000000001
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffea0001de8040
RBP: ffff8800796d9b88 R08: ffff8800796d9df0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000012
R13: ffffea0001de8060 R14: ffffffff818818e8 R15: ffff8800796d9bf8
FS:  0000000000000000(0000) GS:ffff88007a200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1bfc108000 CR3: 000000000180b000 CR4: 00000000000406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kswapd0 (pid: 293, threadinfo ffff8800796d8000, task ffff880079e0a6e0)
Stack:
 ffff8800796d9b48 ffffffff81881880 ffff8800796d9df0 ffff8800796d9be0
 0000000000000002 000000000000001f ffff8800796d9b88 ffffffff818818c8
 ffffffff81881480 ffff8800796d9dc0 0000000000000002 000000000000001f
Call Trace:
 [<ffffffff81111e98>] shrink_inactive_list+0x108/0x4a0
 [<ffffffff8109ce3d>] ? trace_hardirqs_off+0xd/0x10
 [<ffffffff8107b8bf>] ? local_clock+0x4f/0x60
 [<ffffffff8110ff5d>] ? shrink_slab+0x1fd/0x4c0
 [<ffffffff811125a1>] shrink_zone+0x371/0x610
 [<ffffffff8110ff75>] ? shrink_slab+0x215/0x4c0
 [<ffffffff81112dfc>] kswapd+0x5bc/0xb60
 [<ffffffff81112840>] ? shrink_zone+0x610/0x610
 [<ffffffff81066676>] kthread+0xd6/0xe0
 [<ffffffff810665a0>] ? __kthread_bind+0x40/0x40
 [<ffffffff814fed6c>] ret_from_fork+0x7c/0xb0
 [<ffffffff810665a0>] ? __kthread_bind+0x40/0x40
Code: 1f 40 00 49 8b 45 08 49 8b 75 00 48 89 46 08 48 89 30 49 8b 06 4c 89 68 08 49 89 45 00 4d 89 75 08 4d 89 2e eb 9c 0f 1f 44 00 00 <0f> 0b 66 0f 1f 44 00 00 31 db 45 31 e4 eb 9b 0f 0b 0f 0b 65 48
RIP  [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
 RSP <ffff8800796d9b28>

For lru_add_page_tail(), it means we should not set PageUnevictable()
for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
Let's just copy PG_active and PG_unevictable from head page in
__split_huge_page_refcount(), it will simplify lru_add_page_tail().

This will fix one more bug in lru_add_page_tail():
if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
will go to the same lru as page, but nobody cares to sync page_tail
active/inactive state with page. So we can end up with inactive page on
active lru.
The patch will fix it as well since we copy PG_active from head page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |    4 +++-
 mm/swap.c        |   20 ++------------------
 2 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2f7f5aa..46a44ac 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1613,7 +1613,9 @@ static void __split_huge_page_refcount(struct page *page)
 				     ((1L << PG_referenced) |
 				      (1L << PG_swapbacked) |
 				      (1L << PG_mlocked) |
-				      (1L << PG_uptodate)));
+				      (1L << PG_uptodate) |
+				      (1L << PG_active) |
+				      (1L << PG_unevictable)));
 		page_tail->flags |= (1L << PG_dirty);
 
 		/* clear PageTail before overwriting first_page */
diff --git a/mm/swap.c b/mm/swap.c
index 92a9be5..20d78b6 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -740,8 +740,6 @@ EXPORT_SYMBOL(__pagevec_release);
 void lru_add_page_tail(struct page *page, struct page *page_tail,
 		       struct lruvec *lruvec)
 {
-	int uninitialized_var(active);
-	enum lru_list lru;
 	const int file = 0;
 
 	VM_BUG_ON(!PageHead(page));
@@ -752,20 +750,6 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 
 	SetPageLRU(page_tail);
 
-	if (page_evictable(page_tail)) {
-		if (PageActive(page)) {
-			SetPageActive(page_tail);
-			active = 1;
-			lru = LRU_ACTIVE_ANON;
-		} else {
-			active = 0;
-			lru = LRU_INACTIVE_ANON;
-		}
-	} else {
-		SetPageUnevictable(page_tail);
-		lru = LRU_UNEVICTABLE;
-	}
-
 	if (likely(PageLRU(page)))
 		list_add_tail(&page_tail->lru, &page->lru);
 	else {
@@ -777,13 +761,13 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 		 * Use the standard add function to put page_tail on the list,
 		 * but then correct its position so they all end up in order.
 		 */
-		add_page_to_lru_list(page_tail, lruvec, lru);
+		add_page_to_lru_list(page_tail, lruvec, page_lru(page_tail));
 		list_head = page_tail->lru.prev;
 		list_move_tail(&page_tail->lru, list_head);
 	}
 
 	if (!PageUnevictable(page))
-		update_page_reclaim_stat(lruvec, file, active);
+		update_page_reclaim_stat(lruvec, file, PageActive(page_tail));
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 07/34] thp, mm: basic defines for transparent huge page cache
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ee1c244..a54939c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -64,6 +64,10 @@ extern pmd_t *page_check_address_pmd(struct page *page,
 #define HPAGE_PMD_MASK HPAGE_MASK
 #define HPAGE_PMD_SIZE HPAGE_SIZE
 
+#define HPAGE_CACHE_ORDER      (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
+#define HPAGE_CACHE_NR         (1L << HPAGE_CACHE_ORDER)
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)
+
 extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
 #define transparent_hugepage_enabled(__vma)				\
@@ -181,6 +185,10 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
 
+#define HPAGE_CACHE_ORDER      ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
+
 #define hpage_nr_pages(x) 1
 
 #define transparent_hugepage_enabled(__vma) 0
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 07/34] thp, mm: basic defines for transparent huge page cache
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ee1c244..a54939c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -64,6 +64,10 @@ extern pmd_t *page_check_address_pmd(struct page *page,
 #define HPAGE_PMD_MASK HPAGE_MASK
 #define HPAGE_PMD_SIZE HPAGE_SIZE
 
+#define HPAGE_CACHE_ORDER      (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
+#define HPAGE_CACHE_NR         (1L << HPAGE_CACHE_ORDER)
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)
+
 extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
 #define transparent_hugepage_enabled(__vma)				\
@@ -181,6 +185,10 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
 
+#define HPAGE_CACHE_ORDER      ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
+
 #define hpage_nr_pages(x) 1
 
 #define transparent_hugepage_enabled(__vma) 0
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 08/34] thp, mm: introduce mapping_can_have_hugepages() predicate
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Returns true if mapping can have huge pages. Just check for __GFP_COMP
in gfp mask of the mapping for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h |   11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..56debde 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -84,6 +84,17 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 				(__force unsigned long)mask;
 }
 
+static inline bool mapping_can_have_hugepages(struct address_space *m)
+{
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+		gfp_t gfp_mask = mapping_gfp_mask(m);
+		/* __GFP_COMP is key part of GFP_TRANSHUGE */
+		return !!(gfp_mask & __GFP_COMP);
+	}
+
+	return false;
+}
+
 /*
  * The page cache can done in larger chunks than
  * one page, because it allows for more efficient
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 08/34] thp, mm: introduce mapping_can_have_hugepages() predicate
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Returns true if mapping can have huge pages. Just check for __GFP_COMP
in gfp mask of the mapping for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h |   11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..56debde 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -84,6 +84,17 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 				(__force unsigned long)mask;
 }
 
+static inline bool mapping_can_have_hugepages(struct address_space *m)
+{
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+		gfp_t gfp_mask = mapping_gfp_mask(m);
+		/* __GFP_COMP is key part of GFP_TRANSHUGE */
+		return !!(gfp_mask & __GFP_COMP);
+	}
+
+	return false;
+}
+
 /*
  * The page cache can done in larger chunks than
  * one page, because it allows for more efficient
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 09/34] thp: represent file thp pages in meminfo and friends
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The patch adds new zone stat to count file transparent huge pages and
adjust related places.

For now we don't count mapped or dirty file thp pages separately.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/base/node.c    |   10 ++++++++++
 fs/proc/meminfo.c      |    6 ++++++
 include/linux/mmzone.h |    1 +
 mm/mmap.c              |    3 +++
 mm/page_alloc.c        |    7 ++++++-
 5 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index fac124a..eed3763 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -118,11 +118,18 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       "Node %d SUnreclaim:     %8lu kB\n"
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		       "Node %d AnonHugePages:  %8lu kB\n"
+		       "Node %d FileHugePages:  %8lu kB\n"
 #endif
 			,
 		       nid, K(node_page_state(nid, NR_FILE_DIRTY)),
 		       nid, K(node_page_state(nid, NR_WRITEBACK)),
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		       nid, K(node_page_state(nid, NR_FILE_PAGES)
+			+ node_page_state(nid, NR_FILE_TRANSPARENT_HUGEPAGES) *
+			HPAGE_PMD_NR),
+#else
 		       nid, K(node_page_state(nid, NR_FILE_PAGES)),
+#endif
 		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		       nid, K(node_page_state(nid, NR_ANON_PAGES)
@@ -145,6 +152,9 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
 			, nid,
 			K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
+			HPAGE_PMD_NR)
+			, nid,
+			K(node_page_state(nid, NR_FILE_TRANSPARENT_HUGEPAGES) *
 			HPAGE_PMD_NR));
 #else
 		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 1efaaa1..747ec70 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -41,6 +41,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 
 	cached = global_page_state(NR_FILE_PAGES) -
 			total_swapcache_pages() - i.bufferram;
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		cached += global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
+			HPAGE_PMD_NR;
 	if (cached < 0)
 		cached = 0;
 
@@ -103,6 +106,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		"AnonHugePages:  %8lu kB\n"
+		"FileHugePages:  %8lu kB\n"
 #endif
 		,
 		K(i.totalram),
@@ -163,6 +167,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
 		   HPAGE_PMD_NR)
+		,K(global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
+		   HPAGE_PMD_NR)
 #endif
 		);
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ab20a60..91fadd6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -142,6 +142,7 @@ enum zone_stat_item {
 	NUMA_OTHER,		/* allocation from other node */
 #endif
 	NR_ANON_TRANSPARENT_HUGEPAGES,
+	NR_FILE_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 49dc7d5..afb9088 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -135,6 +135,9 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
 		free = global_page_state(NR_FREE_PAGES);
 		free += global_page_state(NR_FILE_PAGES);
+		if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+			free += global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES)
+				* HPAGE_PMD_NR;
 
 		/*
 		 * shmem pages shouldn't be counted as free in this
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ca7b01e..7a26038 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2963,6 +2963,7 @@ void show_free_areas(unsigned int filter)
 {
 	int cpu;
 	struct zone *zone;
+	long cached;
 
 	for_each_populated_zone(zone) {
 		if (skip_free_areas_node(filter, zone_to_nid(zone)))
@@ -3112,7 +3113,11 @@ void show_free_areas(unsigned int filter)
 		printk("= %lukB\n", K(total));
 	}
 
-	printk("%ld total pagecache pages\n", global_page_state(NR_FILE_PAGES));
+	cached = global_page_state(NR_FILE_PAGES);
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		cached += global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
+			HPAGE_PMD_NR;
+	printk("%ld total pagecache pages\n", cached);
 
 	show_swap_cache_info();
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 09/34] thp: represent file thp pages in meminfo and friends
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The patch adds new zone stat to count file transparent huge pages and
adjust related places.

For now we don't count mapped or dirty file thp pages separately.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/base/node.c    |   10 ++++++++++
 fs/proc/meminfo.c      |    6 ++++++
 include/linux/mmzone.h |    1 +
 mm/mmap.c              |    3 +++
 mm/page_alloc.c        |    7 ++++++-
 5 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index fac124a..eed3763 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -118,11 +118,18 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       "Node %d SUnreclaim:     %8lu kB\n"
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		       "Node %d AnonHugePages:  %8lu kB\n"
+		       "Node %d FileHugePages:  %8lu kB\n"
 #endif
 			,
 		       nid, K(node_page_state(nid, NR_FILE_DIRTY)),
 		       nid, K(node_page_state(nid, NR_WRITEBACK)),
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		       nid, K(node_page_state(nid, NR_FILE_PAGES)
+			+ node_page_state(nid, NR_FILE_TRANSPARENT_HUGEPAGES) *
+			HPAGE_PMD_NR),
+#else
 		       nid, K(node_page_state(nid, NR_FILE_PAGES)),
+#endif
 		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		       nid, K(node_page_state(nid, NR_ANON_PAGES)
@@ -145,6 +152,9 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
 			, nid,
 			K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
+			HPAGE_PMD_NR)
+			, nid,
+			K(node_page_state(nid, NR_FILE_TRANSPARENT_HUGEPAGES) *
 			HPAGE_PMD_NR));
 #else
 		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 1efaaa1..747ec70 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -41,6 +41,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 
 	cached = global_page_state(NR_FILE_PAGES) -
 			total_swapcache_pages() - i.bufferram;
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		cached += global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
+			HPAGE_PMD_NR;
 	if (cached < 0)
 		cached = 0;
 
@@ -103,6 +106,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		"AnonHugePages:  %8lu kB\n"
+		"FileHugePages:  %8lu kB\n"
 #endif
 		,
 		K(i.totalram),
@@ -163,6 +167,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
 		   HPAGE_PMD_NR)
+		,K(global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
+		   HPAGE_PMD_NR)
 #endif
 		);
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ab20a60..91fadd6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -142,6 +142,7 @@ enum zone_stat_item {
 	NUMA_OTHER,		/* allocation from other node */
 #endif
 	NR_ANON_TRANSPARENT_HUGEPAGES,
+	NR_FILE_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 49dc7d5..afb9088 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -135,6 +135,9 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
 		free = global_page_state(NR_FREE_PAGES);
 		free += global_page_state(NR_FILE_PAGES);
+		if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+			free += global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES)
+				* HPAGE_PMD_NR;
 
 		/*
 		 * shmem pages shouldn't be counted as free in this
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ca7b01e..7a26038 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2963,6 +2963,7 @@ void show_free_areas(unsigned int filter)
 {
 	int cpu;
 	struct zone *zone;
+	long cached;
 
 	for_each_populated_zone(zone) {
 		if (skip_free_areas_node(filter, zone_to_nid(zone)))
@@ -3112,7 +3113,11 @@ void show_free_areas(unsigned int filter)
 		printk("= %lukB\n", K(total));
 	}
 
-	printk("%ld total pagecache pages\n", global_page_state(NR_FILE_PAGES));
+	cached = global_page_state(NR_FILE_PAGES);
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		cached += global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
+			HPAGE_PMD_NR;
+	printk("%ld total pagecache pages\n", cached);
 
 	show_swap_cache_info();
 }
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 10/34] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
page for the specified index and HPAGE_CACHE_NR-1 tail pages for
following indexes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   71 ++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 47 insertions(+), 24 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 2d99191..ce1ded8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -447,39 +447,62 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 		pgoff_t offset, gfp_t gfp_mask)
 {
 	int error;
+	enum zone_stat_item item;
+	int i, nr;
 
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageSwapBacked(page));
 
+	/* memory cgroup controller handles thp pages on its side */
 	error = mem_cgroup_cache_charge(page, current->mm,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
-		goto out;
-
-	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
-	if (error == 0) {
-		page_cache_get(page);
-		page->mapping = mapping;
-		page->index = offset;
+		return error;
 
-		spin_lock_irq(&mapping->tree_lock);
-		error = radix_tree_insert(&mapping->page_tree, offset, page);
-		if (likely(!error)) {
-			mapping->nrpages++;
-			__inc_zone_page_state(page, NR_FILE_PAGES);
-			spin_unlock_irq(&mapping->tree_lock);
-			trace_mm_filemap_add_to_page_cache(page);
-		} else {
-			page->mapping = NULL;
-			/* Leave page->index set: truncation relies upon it */
-			spin_unlock_irq(&mapping->tree_lock);
-			mem_cgroup_uncharge_cache_page(page);
-			page_cache_release(page);
-		}
-		radix_tree_preload_end();
-	} else
+	if (PageTransHuge(page)) {
+		BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
+		nr = HPAGE_CACHE_NR;
+		item = NR_FILE_TRANSPARENT_HUGEPAGES;
+	} else {
+		nr = 1;
+		item = NR_FILE_PAGES;
+	}
+	error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM);
+	if (error) {
 		mem_cgroup_uncharge_cache_page(page);
-out:
+		return error;
+	}
+
+	spin_lock_irq(&mapping->tree_lock);
+	for (i = 0; i < nr; i++) {
+		page_cache_get(page + i);
+		page[i].index = offset + i;
+		page[i].mapping = mapping;
+		error = radix_tree_insert(&mapping->page_tree,
+				offset + i, page + i);
+		if (error)
+			goto err;
+	}
+	__inc_zone_page_state(page, item);
+	mapping->nrpages += nr;
+	spin_unlock_irq(&mapping->tree_lock);
+	radix_tree_preload_end();
+	trace_mm_filemap_add_to_page_cache(page);
+	return 0;
+err:
+	if (i != 0)
+		error = -ENOSPC; /* no space for a huge page */
+	page_cache_release(page + i);
+	page[i].mapping = NULL;
+	for (i--; i >= 0; i--) {
+		/* Leave page->index set: truncation relies upon it */
+		page[i].mapping = NULL;
+		radix_tree_delete(&mapping->page_tree, offset + i);
+		page_cache_release(page + i);
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+	radix_tree_preload_end();
+	mem_cgroup_uncharge_cache_page(page);
 	return error;
 }
 EXPORT_SYMBOL(add_to_page_cache_locked);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 10/34] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
page for the specified index and HPAGE_CACHE_NR-1 tail pages for
following indexes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   71 ++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 47 insertions(+), 24 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 2d99191..ce1ded8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -447,39 +447,62 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 		pgoff_t offset, gfp_t gfp_mask)
 {
 	int error;
+	enum zone_stat_item item;
+	int i, nr;
 
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageSwapBacked(page));
 
+	/* memory cgroup controller handles thp pages on its side */
 	error = mem_cgroup_cache_charge(page, current->mm,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
-		goto out;
-
-	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
-	if (error == 0) {
-		page_cache_get(page);
-		page->mapping = mapping;
-		page->index = offset;
+		return error;
 
-		spin_lock_irq(&mapping->tree_lock);
-		error = radix_tree_insert(&mapping->page_tree, offset, page);
-		if (likely(!error)) {
-			mapping->nrpages++;
-			__inc_zone_page_state(page, NR_FILE_PAGES);
-			spin_unlock_irq(&mapping->tree_lock);
-			trace_mm_filemap_add_to_page_cache(page);
-		} else {
-			page->mapping = NULL;
-			/* Leave page->index set: truncation relies upon it */
-			spin_unlock_irq(&mapping->tree_lock);
-			mem_cgroup_uncharge_cache_page(page);
-			page_cache_release(page);
-		}
-		radix_tree_preload_end();
-	} else
+	if (PageTransHuge(page)) {
+		BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
+		nr = HPAGE_CACHE_NR;
+		item = NR_FILE_TRANSPARENT_HUGEPAGES;
+	} else {
+		nr = 1;
+		item = NR_FILE_PAGES;
+	}
+	error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM);
+	if (error) {
 		mem_cgroup_uncharge_cache_page(page);
-out:
+		return error;
+	}
+
+	spin_lock_irq(&mapping->tree_lock);
+	for (i = 0; i < nr; i++) {
+		page_cache_get(page + i);
+		page[i].index = offset + i;
+		page[i].mapping = mapping;
+		error = radix_tree_insert(&mapping->page_tree,
+				offset + i, page + i);
+		if (error)
+			goto err;
+	}
+	__inc_zone_page_state(page, item);
+	mapping->nrpages += nr;
+	spin_unlock_irq(&mapping->tree_lock);
+	radix_tree_preload_end();
+	trace_mm_filemap_add_to_page_cache(page);
+	return 0;
+err:
+	if (i != 0)
+		error = -ENOSPC; /* no space for a huge page */
+	page_cache_release(page + i);
+	page[i].mapping = NULL;
+	for (i--; i >= 0; i--) {
+		/* Leave page->index set: truncation relies upon it */
+		page[i].mapping = NULL;
+		radix_tree_delete(&mapping->page_tree, offset + i);
+		page_cache_release(page + i);
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+	radix_tree_preload_end();
+	mem_cgroup_uncharge_cache_page(page);
 	return error;
 }
 EXPORT_SYMBOL(add_to_page_cache_locked);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 11/34] mm: trace filemap: dump page order
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Dump page order to trace to be able to distinguish between small page
and huge page in page cache.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/trace/events/filemap.h |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 0421f49..7e14b13 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -21,6 +21,7 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
 		__field(struct page *, page)
 		__field(unsigned long, i_ino)
 		__field(unsigned long, index)
+		__field(int, order)
 		__field(dev_t, s_dev)
 	),
 
@@ -28,18 +29,20 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
 		__entry->page = page;
 		__entry->i_ino = page->mapping->host->i_ino;
 		__entry->index = page->index;
+		__entry->order = compound_order(page);
 		if (page->mapping->host->i_sb)
 			__entry->s_dev = page->mapping->host->i_sb->s_dev;
 		else
 			__entry->s_dev = page->mapping->host->i_rdev;
 	),
 
-	TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu",
+	TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu order=%d",
 		MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
 		__entry->i_ino,
 		__entry->page,
 		page_to_pfn(__entry->page),
-		__entry->index << PAGE_SHIFT)
+		__entry->index << PAGE_SHIFT,
+		__entry->order)
 );
 
 DEFINE_EVENT(mm_filemap_op_page_cache, mm_filemap_delete_from_page_cache,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 11/34] mm: trace filemap: dump page order
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Dump page order to trace to be able to distinguish between small page
and huge page in page cache.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/trace/events/filemap.h |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 0421f49..7e14b13 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -21,6 +21,7 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
 		__field(struct page *, page)
 		__field(unsigned long, i_ino)
 		__field(unsigned long, index)
+		__field(int, order)
 		__field(dev_t, s_dev)
 	),
 
@@ -28,18 +29,20 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
 		__entry->page = page;
 		__entry->i_ino = page->mapping->host->i_ino;
 		__entry->index = page->index;
+		__entry->order = compound_order(page);
 		if (page->mapping->host->i_sb)
 			__entry->s_dev = page->mapping->host->i_sb->s_dev;
 		else
 			__entry->s_dev = page->mapping->host->i_rdev;
 	),
 
-	TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu",
+	TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu order=%d",
 		MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
 		__entry->i_ino,
 		__entry->page,
 		page_to_pfn(__entry->page),
-		__entry->index << PAGE_SHIFT)
+		__entry->index << PAGE_SHIFT,
+		__entry->order)
 );
 
 DEFINE_EVENT(mm_filemap_op_page_cache, mm_filemap_delete_from_page_cache,
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 12/34] thp, mm: rewrite delete_from_page_cache() to support huge pages
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
time.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index ce1ded8..56a81e3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -115,6 +115,7 @@
 void __delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
+	int nr;
 
 	trace_mm_filemap_delete_from_page_cache(page);
 	/*
@@ -127,13 +128,28 @@ void __delete_from_page_cache(struct page *page)
 	else
 		cleancache_invalidate_page(mapping, page);
 
-	radix_tree_delete(&mapping->page_tree, page->index);
+	if (PageTransHuge(page)) {
+		int i;
+
+		radix_tree_delete(&mapping->page_tree, page->index);
+		for (i = 1; i < HPAGE_CACHE_NR; i++) {
+			radix_tree_delete(&mapping->page_tree, page->index + i);
+			page[i].mapping = NULL;
+			page_cache_release(page + i);
+		}
+		__dec_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
+		nr = HPAGE_CACHE_NR;
+	} else {
+		radix_tree_delete(&mapping->page_tree, page->index);
+		__dec_zone_page_state(page, NR_FILE_PAGES);
+		nr = 1;
+	}
+
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
-	mapping->nrpages--;
-	__dec_zone_page_state(page, NR_FILE_PAGES);
+	mapping->nrpages -= nr;
 	if (PageSwapBacked(page))
-		__dec_zone_page_state(page, NR_SHMEM);
+		__mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
 	BUG_ON(page_mapped(page));
 
 	/*
@@ -144,8 +160,8 @@ void __delete_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
-		dec_zone_page_state(page, NR_FILE_DIRTY);
-		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
+		add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
 	}
 }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 12/34] thp, mm: rewrite delete_from_page_cache() to support huge pages
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
time.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index ce1ded8..56a81e3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -115,6 +115,7 @@
 void __delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
+	int nr;
 
 	trace_mm_filemap_delete_from_page_cache(page);
 	/*
@@ -127,13 +128,28 @@ void __delete_from_page_cache(struct page *page)
 	else
 		cleancache_invalidate_page(mapping, page);
 
-	radix_tree_delete(&mapping->page_tree, page->index);
+	if (PageTransHuge(page)) {
+		int i;
+
+		radix_tree_delete(&mapping->page_tree, page->index);
+		for (i = 1; i < HPAGE_CACHE_NR; i++) {
+			radix_tree_delete(&mapping->page_tree, page->index + i);
+			page[i].mapping = NULL;
+			page_cache_release(page + i);
+		}
+		__dec_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
+		nr = HPAGE_CACHE_NR;
+	} else {
+		radix_tree_delete(&mapping->page_tree, page->index);
+		__dec_zone_page_state(page, NR_FILE_PAGES);
+		nr = 1;
+	}
+
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
-	mapping->nrpages--;
-	__dec_zone_page_state(page, NR_FILE_PAGES);
+	mapping->nrpages -= nr;
 	if (PageSwapBacked(page))
-		__dec_zone_page_state(page, NR_SHMEM);
+		__mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
 	BUG_ON(page_mapped(page));
 
 	/*
@@ -144,8 +160,8 @@ void __delete_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
-		dec_zone_page_state(page, NR_FILE_DIRTY);
-		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
+		add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
 	}
 }
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 13/34] thp, mm: trigger bug in replace_page_cache_page() on THP
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

replace_page_cache_page() is only used by FUSE. It's unlikely that we
will support THP in FUSE page cache any soon.

Let's pospone implemetation of THP handling in replace_page_cache_page()
until any will use it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 56a81e3..1defa83 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -412,6 +412,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 {
 	int error;
 
+	VM_BUG_ON(PageTransHuge(old));
+	VM_BUG_ON(PageTransHuge(new));
 	VM_BUG_ON(!PageLocked(old));
 	VM_BUG_ON(!PageLocked(new));
 	VM_BUG_ON(new->mapping);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 13/34] thp, mm: trigger bug in replace_page_cache_page() on THP
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

replace_page_cache_page() is only used by FUSE. It's unlikely that we
will support THP in FUSE page cache any soon.

Let's pospone implemetation of THP handling in replace_page_cache_page()
until any will use it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 56a81e3..1defa83 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -412,6 +412,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 {
 	int error;
 
+	VM_BUG_ON(PageTransHuge(old));
+	VM_BUG_ON(PageTransHuge(new));
 	VM_BUG_ON(!PageLocked(old));
 	VM_BUG_ON(!PageLocked(new));
 	VM_BUG_ON(new->mapping);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 14/34] thp, mm: locking tail page is a bug
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Locking head page means locking entire compound page.
If we try to lock tail page, something went wrong.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 1defa83..7b4736c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -665,6 +665,7 @@ void __lock_page(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	VM_BUG_ON(PageTail(page));
 	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
 							TASK_UNINTERRUPTIBLE);
 }
@@ -674,6 +675,7 @@ int __lock_page_killable(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	VM_BUG_ON(PageTail(page));
 	return __wait_on_bit_lock(page_waitqueue(page), &wait,
 					sleep_on_page_killable, TASK_KILLABLE);
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 14/34] thp, mm: locking tail page is a bug
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Locking head page means locking entire compound page.
If we try to lock tail page, something went wrong.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 1defa83..7b4736c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -665,6 +665,7 @@ void __lock_page(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	VM_BUG_ON(PageTail(page));
 	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
 							TASK_UNINTERRUPTIBLE);
 }
@@ -674,6 +675,7 @@ int __lock_page_killable(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	VM_BUG_ON(PageTail(page));
 	return __wait_on_bit_lock(page_waitqueue(page), &wait,
 					sleep_on_page_killable, TASK_KILLABLE);
 }
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 15/34] thp, mm: handle tail pages in page_cache_get_speculative()
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For tail page we call __get_page_tail(). It has the same semantics, but
for tail page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 56debde..bd07fc1 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -160,6 +160,9 @@ static inline int page_cache_get_speculative(struct page *page)
 {
 	VM_BUG_ON(in_interrupt());
 
+	if (unlikely(PageTail(page)))
+		return __get_page_tail(page);
+
 #ifdef CONFIG_TINY_RCU
 # ifdef CONFIG_PREEMPT_COUNT
 	VM_BUG_ON(!in_atomic());
@@ -186,7 +189,6 @@ static inline int page_cache_get_speculative(struct page *page)
 		return 0;
 	}
 #endif
-	VM_BUG_ON(PageTail(page));
 
 	return 1;
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 15/34] thp, mm: handle tail pages in page_cache_get_speculative()
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For tail page we call __get_page_tail(). It has the same semantics, but
for tail page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 56debde..bd07fc1 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -160,6 +160,9 @@ static inline int page_cache_get_speculative(struct page *page)
 {
 	VM_BUG_ON(in_interrupt());
 
+	if (unlikely(PageTail(page)))
+		return __get_page_tail(page);
+
 #ifdef CONFIG_TINY_RCU
 # ifdef CONFIG_PREEMPT_COUNT
 	VM_BUG_ON(!in_atomic());
@@ -186,7 +189,6 @@ static inline int page_cache_get_speculative(struct page *page)
 		return 0;
 	}
 #endif
-	VM_BUG_ON(PageTail(page));
 
 	return 1;
 }
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 16/34] thp, mm: add event counters for huge page alloc on write to a file
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Existing stats specify source of thp page: fault or collapse. We're
going allocate a new huge page with write(2). It's nither fault nor
collapse.

Let's introduce new events for that.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/vm_event_item.h |    2 ++
 mm/vmstat.c                   |    2 ++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index d4b7a18..584c71c 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_FAULT_FALLBACK,
 		THP_COLLAPSE_ALLOC,
 		THP_COLLAPSE_ALLOC_FAILED,
+		THP_WRITE_ALLOC,
+		THP_WRITE_ALLOC_FAILED,
 		THP_SPLIT,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 292b1cf..dd8323a 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -818,6 +818,8 @@ const char * const vmstat_text[] = {
 	"thp_fault_fallback",
 	"thp_collapse_alloc",
 	"thp_collapse_alloc_failed",
+	"thp_write_alloc",
+	"thp_write_alloc_failed",
 	"thp_split",
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 16/34] thp, mm: add event counters for huge page alloc on write to a file
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Existing stats specify source of thp page: fault or collapse. We're
going allocate a new huge page with write(2). It's nither fault nor
collapse.

Let's introduce new events for that.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/vm_event_item.h |    2 ++
 mm/vmstat.c                   |    2 ++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index d4b7a18..584c71c 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_FAULT_FALLBACK,
 		THP_COLLAPSE_ALLOC,
 		THP_COLLAPSE_ALLOC_FAILED,
+		THP_WRITE_ALLOC,
+		THP_WRITE_ALLOC_FAILED,
 		THP_SPLIT,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 292b1cf..dd8323a 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -818,6 +818,8 @@ const char * const vmstat_text[] = {
 	"thp_fault_fallback",
 	"thp_collapse_alloc",
 	"thp_collapse_alloc_failed",
+	"thp_write_alloc",
+	"thp_write_alloc_failed",
 	"thp_split",
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 17/34] thp, mm: implement grab_thp_write_begin()
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The function is grab_cache_page_write_begin() twin but it tries to
allocate huge page at given position aligned to HPAGE_CACHE_NR.

If, for some reason, it's not possible allocate a huge page at this
possition, it returns NULL. Caller should take care of fallback to
small pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h |   10 ++++++
 mm/filemap.c            |   89 +++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 85 insertions(+), 14 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index bd07fc1..5a7dda9 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -271,6 +271,16 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 
 struct page *grab_cache_page_write_begin(struct address_space *mapping,
 			pgoff_t index, unsigned flags);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+struct page *grab_thp_write_begin(struct address_space *mapping,
+		pgoff_t index, unsigned flags);
+#else
+static inline struct page *grab_thp_write_begin(struct address_space *mapping,
+		pgoff_t index, unsigned flags)
+{
+	return NULL;
+}
+#endif
 
 /*
  * Returns locked page at given index in given cache, creating it if needed.
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b4736c..bcb679c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2290,16 +2290,17 @@ out:
 EXPORT_SYMBOL(generic_file_direct_write);
 
 /*
- * Find or create a page at the given pagecache position. Return the locked
- * page. This function is specifically for buffered writes.
+ * Returns true if the page was found in page cache and
+ * false if it had to allocate a new page.
  */
-struct page *grab_cache_page_write_begin(struct address_space *mapping,
-					pgoff_t index, unsigned flags)
+static bool __grab_cache_page_write_begin(struct address_space *mapping,
+		pgoff_t index, unsigned flags, unsigned int order,
+		struct page **page)
 {
 	int status;
 	gfp_t gfp_mask;
-	struct page *page;
 	gfp_t gfp_notmask = 0;
+	int found = true;
 
 	gfp_mask = mapping_gfp_mask(mapping);
 	if (mapping_cap_account_dirty(mapping))
@@ -2307,27 +2308,87 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 	if (flags & AOP_FLAG_NOFS)
 		gfp_notmask = __GFP_FS;
 repeat:
-	page = find_lock_page(mapping, index);
-	if (page)
+	*page = find_lock_page(mapping, index);
+	if (*page)
 		goto found;
 
-	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
-	if (!page)
-		return NULL;
-	status = add_to_page_cache_lru(page, mapping, index,
+	found = false;
+	if (order)
+		*page = alloc_pages(gfp_mask & ~gfp_notmask, order);
+	else
+		*page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
+	if (!*page)
+		return false;
+	status = add_to_page_cache_lru(*page, mapping, index,
 						GFP_KERNEL & ~gfp_notmask);
 	if (unlikely(status)) {
-		page_cache_release(page);
+		page_cache_release(*page);
 		if (status == -EEXIST)
 			goto repeat;
-		return NULL;
+		*page = NULL;
+		return false;
 	}
 found:
-	wait_for_stable_page(page);
+	wait_for_stable_page(*page);
+	return found;
+}
+
+/*
+ * Find or create a page at the given pagecache position. Return the locked
+ * page. This function is specifically for buffered writes.
+ */
+struct page *grab_cache_page_write_begin(struct address_space *mapping,
+					pgoff_t index, unsigned flags)
+{
+	struct page *page;
+	__grab_cache_page_write_begin(mapping, index, flags, 0, &page);
 	return page;
 }
 EXPORT_SYMBOL(grab_cache_page_write_begin);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * Find or create a huge page at the given pagecache position, aligned to
+ * HPAGE_CACHE_NR. Return the locked huge page.
+ *
+ * If, for some reason, it's not possible allocate a huge page at this
+ * possition, it returns NULL. Caller should take care of fallback to small
+ * pages.
+ *
+ * This function is specifically for buffered writes.
+ */
+struct page *grab_thp_write_begin(struct address_space *mapping,
+		pgoff_t index, unsigned flags)
+{
+	gfp_t gfp_mask;
+	struct page *page;
+	bool found;
+
+	BUG_ON(index & HPAGE_CACHE_INDEX_MASK);
+	gfp_mask = mapping_gfp_mask(mapping);
+	BUG_ON(!(gfp_mask & __GFP_COMP));
+
+	found = __grab_cache_page_write_begin(mapping, index, flags,
+			HPAGE_PMD_ORDER, &page);
+	if (!page) {
+		if (!found)
+			count_vm_event(THP_WRITE_ALLOC_FAILED);
+		return NULL;
+	}
+
+	if (!found)
+		count_vm_event(THP_WRITE_ALLOC);
+
+	if (!PageTransHuge(page)) {
+		unlock_page(page);
+		page_cache_release(page);
+		return NULL;
+	}
+
+	return page;
+}
+#endif
+
 static ssize_t generic_perform_write(struct file *file,
 				struct iov_iter *i, loff_t pos)
 {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 17/34] thp, mm: implement grab_thp_write_begin()
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The function is grab_cache_page_write_begin() twin but it tries to
allocate huge page at given position aligned to HPAGE_CACHE_NR.

If, for some reason, it's not possible allocate a huge page at this
possition, it returns NULL. Caller should take care of fallback to
small pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h |   10 ++++++
 mm/filemap.c            |   89 +++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 85 insertions(+), 14 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index bd07fc1..5a7dda9 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -271,6 +271,16 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 
 struct page *grab_cache_page_write_begin(struct address_space *mapping,
 			pgoff_t index, unsigned flags);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+struct page *grab_thp_write_begin(struct address_space *mapping,
+		pgoff_t index, unsigned flags);
+#else
+static inline struct page *grab_thp_write_begin(struct address_space *mapping,
+		pgoff_t index, unsigned flags)
+{
+	return NULL;
+}
+#endif
 
 /*
  * Returns locked page at given index in given cache, creating it if needed.
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b4736c..bcb679c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2290,16 +2290,17 @@ out:
 EXPORT_SYMBOL(generic_file_direct_write);
 
 /*
- * Find or create a page at the given pagecache position. Return the locked
- * page. This function is specifically for buffered writes.
+ * Returns true if the page was found in page cache and
+ * false if it had to allocate a new page.
  */
-struct page *grab_cache_page_write_begin(struct address_space *mapping,
-					pgoff_t index, unsigned flags)
+static bool __grab_cache_page_write_begin(struct address_space *mapping,
+		pgoff_t index, unsigned flags, unsigned int order,
+		struct page **page)
 {
 	int status;
 	gfp_t gfp_mask;
-	struct page *page;
 	gfp_t gfp_notmask = 0;
+	int found = true;
 
 	gfp_mask = mapping_gfp_mask(mapping);
 	if (mapping_cap_account_dirty(mapping))
@@ -2307,27 +2308,87 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 	if (flags & AOP_FLAG_NOFS)
 		gfp_notmask = __GFP_FS;
 repeat:
-	page = find_lock_page(mapping, index);
-	if (page)
+	*page = find_lock_page(mapping, index);
+	if (*page)
 		goto found;
 
-	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
-	if (!page)
-		return NULL;
-	status = add_to_page_cache_lru(page, mapping, index,
+	found = false;
+	if (order)
+		*page = alloc_pages(gfp_mask & ~gfp_notmask, order);
+	else
+		*page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
+	if (!*page)
+		return false;
+	status = add_to_page_cache_lru(*page, mapping, index,
 						GFP_KERNEL & ~gfp_notmask);
 	if (unlikely(status)) {
-		page_cache_release(page);
+		page_cache_release(*page);
 		if (status == -EEXIST)
 			goto repeat;
-		return NULL;
+		*page = NULL;
+		return false;
 	}
 found:
-	wait_for_stable_page(page);
+	wait_for_stable_page(*page);
+	return found;
+}
+
+/*
+ * Find or create a page at the given pagecache position. Return the locked
+ * page. This function is specifically for buffered writes.
+ */
+struct page *grab_cache_page_write_begin(struct address_space *mapping,
+					pgoff_t index, unsigned flags)
+{
+	struct page *page;
+	__grab_cache_page_write_begin(mapping, index, flags, 0, &page);
 	return page;
 }
 EXPORT_SYMBOL(grab_cache_page_write_begin);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * Find or create a huge page at the given pagecache position, aligned to
+ * HPAGE_CACHE_NR. Return the locked huge page.
+ *
+ * If, for some reason, it's not possible allocate a huge page at this
+ * possition, it returns NULL. Caller should take care of fallback to small
+ * pages.
+ *
+ * This function is specifically for buffered writes.
+ */
+struct page *grab_thp_write_begin(struct address_space *mapping,
+		pgoff_t index, unsigned flags)
+{
+	gfp_t gfp_mask;
+	struct page *page;
+	bool found;
+
+	BUG_ON(index & HPAGE_CACHE_INDEX_MASK);
+	gfp_mask = mapping_gfp_mask(mapping);
+	BUG_ON(!(gfp_mask & __GFP_COMP));
+
+	found = __grab_cache_page_write_begin(mapping, index, flags,
+			HPAGE_PMD_ORDER, &page);
+	if (!page) {
+		if (!found)
+			count_vm_event(THP_WRITE_ALLOC_FAILED);
+		return NULL;
+	}
+
+	if (!found)
+		count_vm_event(THP_WRITE_ALLOC);
+
+	if (!PageTransHuge(page)) {
+		unlock_page(page);
+		page_cache_release(page);
+		return NULL;
+	}
+
+	return page;
+}
+#endif
+
 static ssize_t generic_perform_write(struct file *file,
 				struct iov_iter *i, loff_t pos)
 {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 18/34] thp, mm: naive support of thp in generic read/write routines
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now we still write/read at most PAGE_CACHE_SIZE bytes a time.

This implementation doesn't cover address spaces with backing store.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index bcb679c..3296f5c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1161,6 +1161,16 @@ find_page:
 			if (unlikely(page == NULL))
 				goto no_cached_page;
 		}
+		if (PageTransCompound(page)) {
+			struct page *head = compound_trans_head(page);
+			/*
+			 * We don't yet support huge pages in page cache
+			 * for filesystems with backing device, so pages
+			 * should always be up-to-date.
+			 */
+			BUG_ON(PageReadahead(head) || !PageUptodate(head));
+			goto page_ok;
+		}
 		if (PageReadahead(page)) {
 			page_cache_async_readahead(mapping,
 					ra, filp, page,
@@ -2439,8 +2449,13 @@ again:
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
 
+		if (PageTransHuge(page))
+			offset = pos & ~HPAGE_PMD_MASK;
+
 		pagefault_disable();
-		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+		copied = iov_iter_copy_from_user_atomic(
+				page + (offset >> PAGE_CACHE_SHIFT),
+				i, offset & ~PAGE_CACHE_MASK, bytes);
 		pagefault_enable();
 		flush_dcache_page(page);
 
@@ -2463,6 +2478,7 @@ again:
 			 * because not all segments in the iov can be copied at
 			 * once without a pagefault.
 			 */
+			offset = pos & ~PAGE_CACHE_MASK;
 			bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
 						iov_iter_single_seg_count(i));
 			goto again;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 18/34] thp, mm: naive support of thp in generic read/write routines
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now we still write/read at most PAGE_CACHE_SIZE bytes a time.

This implementation doesn't cover address spaces with backing store.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index bcb679c..3296f5c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1161,6 +1161,16 @@ find_page:
 			if (unlikely(page == NULL))
 				goto no_cached_page;
 		}
+		if (PageTransCompound(page)) {
+			struct page *head = compound_trans_head(page);
+			/*
+			 * We don't yet support huge pages in page cache
+			 * for filesystems with backing device, so pages
+			 * should always be up-to-date.
+			 */
+			BUG_ON(PageReadahead(head) || !PageUptodate(head));
+			goto page_ok;
+		}
 		if (PageReadahead(page)) {
 			page_cache_async_readahead(mapping,
 					ra, filp, page,
@@ -2439,8 +2449,13 @@ again:
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
 
+		if (PageTransHuge(page))
+			offset = pos & ~HPAGE_PMD_MASK;
+
 		pagefault_disable();
-		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+		copied = iov_iter_copy_from_user_atomic(
+				page + (offset >> PAGE_CACHE_SHIFT),
+				i, offset & ~PAGE_CACHE_MASK, bytes);
 		pagefault_enable();
 		flush_dcache_page(page);
 
@@ -2463,6 +2478,7 @@ again:
 			 * because not all segments in the iov can be copied at
 			 * once without a pagefault.
 			 */
+			offset = pos & ~PAGE_CACHE_MASK;
 			bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
 						iov_iter_single_seg_count(i));
 			goto again;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 19/34] thp, libfs: initial support of thp in simple_read/write_begin/write_end
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now we try to grab a huge cache page if gfp_mask has __GFP_COMP.
It's probably to weak condition and need to be reworked later.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/libfs.c              |   48 ++++++++++++++++++++++++++++++++++++-----------
 include/linux/pagemap.h |    8 ++++++++
 2 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 916da8c..6e5286d 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -383,7 +383,7 @@ EXPORT_SYMBOL(simple_setattr);
 
 int simple_readpage(struct file *file, struct page *page)
 {
-	clear_highpage(page);
+	clear_pagecache_page(page);
 	flush_dcache_page(page);
 	SetPageUptodate(page);
 	unlock_page(page);
@@ -394,21 +394,42 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned flags,
 			struct page **pagep, void **fsdata)
 {
-	struct page *page;
+	struct page *page = NULL;
 	pgoff_t index;
 
 	index = pos >> PAGE_CACHE_SHIFT;
 
-	page = grab_cache_page_write_begin(mapping, index, flags);
+	/* XXX: too weak condition. Good enough for initial testing */
+	if (mapping_can_have_hugepages(mapping)) {
+		page = grab_thp_write_begin(mapping,
+				index & ~HPAGE_CACHE_INDEX_MASK, flags);
+		/* fallback to small page */
+		if (!page || !PageTransHuge(page)) {
+			unsigned long offset;
+			offset = pos & ~PAGE_CACHE_MASK;
+			len = min_t(unsigned long,
+					len, PAGE_CACHE_SIZE - offset);
+		}
+	}
+	if (!page)
+		page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
-
 	*pagep = page;
 
-	if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
-		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
-		zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE);
+	if (!PageUptodate(page)) {
+		unsigned from;
+
+		if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) {
+			from = pos & ~HPAGE_PMD_MASK;
+			zero_huge_user_segment(page, 0, from);
+			zero_huge_user_segment(page,
+					from + len, HPAGE_PMD_SIZE);
+		} else if (len != PAGE_CACHE_SIZE) {
+			from = pos & ~PAGE_CACHE_MASK;
+			zero_user_segments(page, 0, from,
+					from + len, PAGE_CACHE_SIZE);
+		}
 	}
 	return 0;
 }
@@ -443,9 +464,14 @@ int simple_write_end(struct file *file, struct address_space *mapping,
 
 	/* zero the stale part of the page if we did a short copy */
 	if (copied < len) {
-		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
-		zero_user(page, from + copied, len - copied);
+		unsigned from;
+		if (PageTransHuge(page)) {
+			from = pos & ~HPAGE_PMD_MASK;
+			zero_huge_user(page, from + copied, len - copied);
+		} else {
+			from = pos & ~PAGE_CACHE_MASK;
+			zero_user(page, from + copied, len - copied);
+		}
 	}
 
 	if (!PageUptodate(page))
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 5a7dda9..c64d19c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -581,4 +581,12 @@ static inline int add_to_page_cache(struct page *page,
 	return error;
 }
 
+static inline void clear_pagecache_page(struct page *page)
+{
+	if (PageTransHuge(page))
+		zero_huge_user(page, 0, HPAGE_PMD_SIZE);
+	else
+		clear_highpage(page);
+}
+
 #endif /* _LINUX_PAGEMAP_H */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 19/34] thp, libfs: initial support of thp in simple_read/write_begin/write_end
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now we try to grab a huge cache page if gfp_mask has __GFP_COMP.
It's probably to weak condition and need to be reworked later.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/libfs.c              |   48 ++++++++++++++++++++++++++++++++++++-----------
 include/linux/pagemap.h |    8 ++++++++
 2 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 916da8c..6e5286d 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -383,7 +383,7 @@ EXPORT_SYMBOL(simple_setattr);
 
 int simple_readpage(struct file *file, struct page *page)
 {
-	clear_highpage(page);
+	clear_pagecache_page(page);
 	flush_dcache_page(page);
 	SetPageUptodate(page);
 	unlock_page(page);
@@ -394,21 +394,42 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned flags,
 			struct page **pagep, void **fsdata)
 {
-	struct page *page;
+	struct page *page = NULL;
 	pgoff_t index;
 
 	index = pos >> PAGE_CACHE_SHIFT;
 
-	page = grab_cache_page_write_begin(mapping, index, flags);
+	/* XXX: too weak condition. Good enough for initial testing */
+	if (mapping_can_have_hugepages(mapping)) {
+		page = grab_thp_write_begin(mapping,
+				index & ~HPAGE_CACHE_INDEX_MASK, flags);
+		/* fallback to small page */
+		if (!page || !PageTransHuge(page)) {
+			unsigned long offset;
+			offset = pos & ~PAGE_CACHE_MASK;
+			len = min_t(unsigned long,
+					len, PAGE_CACHE_SIZE - offset);
+		}
+	}
+	if (!page)
+		page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
-
 	*pagep = page;
 
-	if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
-		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
-		zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE);
+	if (!PageUptodate(page)) {
+		unsigned from;
+
+		if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) {
+			from = pos & ~HPAGE_PMD_MASK;
+			zero_huge_user_segment(page, 0, from);
+			zero_huge_user_segment(page,
+					from + len, HPAGE_PMD_SIZE);
+		} else if (len != PAGE_CACHE_SIZE) {
+			from = pos & ~PAGE_CACHE_MASK;
+			zero_user_segments(page, 0, from,
+					from + len, PAGE_CACHE_SIZE);
+		}
 	}
 	return 0;
 }
@@ -443,9 +464,14 @@ int simple_write_end(struct file *file, struct address_space *mapping,
 
 	/* zero the stale part of the page if we did a short copy */
 	if (copied < len) {
-		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
-		zero_user(page, from + copied, len - copied);
+		unsigned from;
+		if (PageTransHuge(page)) {
+			from = pos & ~HPAGE_PMD_MASK;
+			zero_huge_user(page, from + copied, len - copied);
+		} else {
+			from = pos & ~PAGE_CACHE_MASK;
+			zero_user(page, from + copied, len - copied);
+		}
 	}
 
 	if (!PageUptodate(page))
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 5a7dda9..c64d19c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -581,4 +581,12 @@ static inline int add_to_page_cache(struct page *page,
 	return error;
 }
 
+static inline void clear_pagecache_page(struct page *page)
+{
+	if (PageTransHuge(page))
+		zero_huge_user(page, 0, HPAGE_PMD_SIZE);
+	else
+		clear_highpage(page);
+}
+
 #endif /* _LINUX_PAGEMAP_H */
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 20/34] thp: handle file pages in split_huge_page()
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The base scheme is the same as for anonymous pages, but we walk by
mapping->i_mmap rather then anon_vma->rb_root.

__split_huge_page_refcount() has been tunned a bit: we need to transfer
PG_swapbacked to tail pages.

Splitting mapped pages haven't tested at all, since we cannot mmap()
file-backed huge pages yet.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   71 +++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 59 insertions(+), 12 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 46a44ac..ac0dc80 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1637,24 +1637,25 @@ static void __split_huge_page_refcount(struct page *page)
 		*/
 		page_tail->_mapcount = page->_mapcount;
 
-		BUG_ON(page_tail->mapping);
 		page_tail->mapping = page->mapping;
-
 		page_tail->index = page->index + i;
 		page_nid_xchg_last(page_tail, page_nid_last(page));
 
-		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
 		BUG_ON(!PageDirty(page_tail));
-		BUG_ON(!PageSwapBacked(page_tail));
 
 		lru_add_page_tail(page, page_tail, lruvec);
 	}
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(atomic_read(&page->_count) <= 0);
 
-	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
-	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
+	if (PageAnon(page)) {
+		__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+		__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
+	} else {
+		__mod_zone_page_state(zone, NR_FILE_TRANSPARENT_HUGEPAGES, -1);
+		__mod_zone_page_state(zone, NR_FILE_PAGES, HPAGE_PMD_NR);
+	}
 
 	ClearPageCompound(page);
 	compound_unlock(page);
@@ -1754,7 +1755,7 @@ static int __split_huge_page_map(struct page *page,
 }
 
 /* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
+static void __split_anon_huge_page(struct page *page,
 			      struct anon_vma *anon_vma)
 {
 	int mapcount, mapcount2;
@@ -1801,14 +1802,11 @@ static void __split_huge_page(struct page *page,
 	BUG_ON(mapcount != mapcount2);
 }
 
-int split_huge_page(struct page *page)
+static int split_anon_huge_page(struct page *page)
 {
 	struct anon_vma *anon_vma;
 	int ret = 1;
 
-	BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
-	BUG_ON(!PageAnon(page));
-
 	/*
 	 * The caller does not necessarily hold an mmap_sem that would prevent
 	 * the anon_vma disappearing so we first we take a reference to it
@@ -1826,7 +1824,7 @@ int split_huge_page(struct page *page)
 		goto out_unlock;
 
 	BUG_ON(!PageSwapBacked(page));
-	__split_huge_page(page, anon_vma);
+	__split_anon_huge_page(page, anon_vma);
 	count_vm_event(THP_SPLIT);
 
 	BUG_ON(PageCompound(page));
@@ -1837,6 +1835,55 @@ out:
 	return ret;
 }
 
+static int split_file_huge_page(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	int mapcount, mapcount2;
+
+	BUG_ON(!PageHead(page));
+	BUG_ON(PageTail(page));
+
+	mutex_lock(&mapping->i_mmap_mutex);
+	mapcount = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount += __split_huge_page_splitting(page, vma, addr);
+	}
+
+	if (mapcount != page_mapcount(page))
+		printk(KERN_ERR "mapcount %d page_mapcount %d\n",
+		       mapcount, page_mapcount(page));
+	BUG_ON(mapcount != page_mapcount(page));
+
+	__split_huge_page_refcount(page);
+
+	mapcount2 = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount2 += __split_huge_page_map(page, vma, addr);
+	}
+
+	if (mapcount != mapcount2)
+		printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount %d\n",
+		       mapcount, mapcount2, page_mapcount(page));
+	BUG_ON(mapcount != mapcount2);
+	count_vm_event(THP_SPLIT);
+	mutex_unlock(&mapping->i_mmap_mutex);
+	return 0;
+}
+
+int split_huge_page(struct page *page)
+{
+	BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
+
+	if (PageAnon(page))
+		return split_anon_huge_page(page);
+	else
+		return split_file_huge_page(page);
+}
+
 #define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 20/34] thp: handle file pages in split_huge_page()
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The base scheme is the same as for anonymous pages, but we walk by
mapping->i_mmap rather then anon_vma->rb_root.

__split_huge_page_refcount() has been tunned a bit: we need to transfer
PG_swapbacked to tail pages.

Splitting mapped pages haven't tested at all, since we cannot mmap()
file-backed huge pages yet.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   71 +++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 59 insertions(+), 12 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 46a44ac..ac0dc80 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1637,24 +1637,25 @@ static void __split_huge_page_refcount(struct page *page)
 		*/
 		page_tail->_mapcount = page->_mapcount;
 
-		BUG_ON(page_tail->mapping);
 		page_tail->mapping = page->mapping;
-
 		page_tail->index = page->index + i;
 		page_nid_xchg_last(page_tail, page_nid_last(page));
 
-		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
 		BUG_ON(!PageDirty(page_tail));
-		BUG_ON(!PageSwapBacked(page_tail));
 
 		lru_add_page_tail(page, page_tail, lruvec);
 	}
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(atomic_read(&page->_count) <= 0);
 
-	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
-	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
+	if (PageAnon(page)) {
+		__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+		__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
+	} else {
+		__mod_zone_page_state(zone, NR_FILE_TRANSPARENT_HUGEPAGES, -1);
+		__mod_zone_page_state(zone, NR_FILE_PAGES, HPAGE_PMD_NR);
+	}
 
 	ClearPageCompound(page);
 	compound_unlock(page);
@@ -1754,7 +1755,7 @@ static int __split_huge_page_map(struct page *page,
 }
 
 /* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
+static void __split_anon_huge_page(struct page *page,
 			      struct anon_vma *anon_vma)
 {
 	int mapcount, mapcount2;
@@ -1801,14 +1802,11 @@ static void __split_huge_page(struct page *page,
 	BUG_ON(mapcount != mapcount2);
 }
 
-int split_huge_page(struct page *page)
+static int split_anon_huge_page(struct page *page)
 {
 	struct anon_vma *anon_vma;
 	int ret = 1;
 
-	BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
-	BUG_ON(!PageAnon(page));
-
 	/*
 	 * The caller does not necessarily hold an mmap_sem that would prevent
 	 * the anon_vma disappearing so we first we take a reference to it
@@ -1826,7 +1824,7 @@ int split_huge_page(struct page *page)
 		goto out_unlock;
 
 	BUG_ON(!PageSwapBacked(page));
-	__split_huge_page(page, anon_vma);
+	__split_anon_huge_page(page, anon_vma);
 	count_vm_event(THP_SPLIT);
 
 	BUG_ON(PageCompound(page));
@@ -1837,6 +1835,55 @@ out:
 	return ret;
 }
 
+static int split_file_huge_page(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	int mapcount, mapcount2;
+
+	BUG_ON(!PageHead(page));
+	BUG_ON(PageTail(page));
+
+	mutex_lock(&mapping->i_mmap_mutex);
+	mapcount = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount += __split_huge_page_splitting(page, vma, addr);
+	}
+
+	if (mapcount != page_mapcount(page))
+		printk(KERN_ERR "mapcount %d page_mapcount %d\n",
+		       mapcount, page_mapcount(page));
+	BUG_ON(mapcount != page_mapcount(page));
+
+	__split_huge_page_refcount(page);
+
+	mapcount2 = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount2 += __split_huge_page_map(page, vma, addr);
+	}
+
+	if (mapcount != mapcount2)
+		printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount %d\n",
+		       mapcount, mapcount2, page_mapcount(page));
+	BUG_ON(mapcount != mapcount2);
+	count_vm_event(THP_SPLIT);
+	mutex_unlock(&mapping->i_mmap_mutex);
+	return 0;
+}
+
+int split_huge_page(struct page *page)
+{
+	BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
+
+	if (PageAnon(page))
+		return split_anon_huge_page(page);
+	else
+		return split_file_huge_page(page);
+}
+
 #define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 21/34] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Since we're going to have huge pages backed by files,
wait_split_huge_page() has to serialize not only over anon_vma_lock,
but over i_mmap_mutex too.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |   15 ++++++++++++---
 mm/huge_memory.c        |    4 ++--
 mm/memory.c             |    4 ++--
 3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a54939c..b53e295 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -113,11 +113,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
 			__split_huge_page_pmd(__vma, __address,		\
 					____pmd);			\
 	}  while (0)
-#define wait_split_huge_page(__anon_vma, __pmd)				\
+#define wait_split_huge_page(__vma, __pmd)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
-		anon_vma_lock_write(__anon_vma);			\
-		anon_vma_unlock_write(__anon_vma);			\
+		struct address_space *__mapping =			\
+					vma->vm_file->f_mapping;	\
+		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
+		if (__mapping)						\
+			mutex_lock(&__mapping->i_mmap_mutex);		\
+		if (__anon_vma) {					\
+			anon_vma_lock_write(__anon_vma);		\
+			anon_vma_unlock_write(__anon_vma);		\
+		}							\
+		if (__mapping)						\
+			mutex_unlock(&__mapping->i_mmap_mutex);		\
 		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
 		       pmd_trans_huge(*____pmd));			\
 	} while (0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ac0dc80..7c48f58 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -907,7 +907,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		spin_unlock(&dst_mm->page_table_lock);
 		pte_free(dst_mm, pgtable);
 
-		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+		wait_split_huge_page(vma, src_pmd); /* src_vma */
 		goto out;
 	}
 	src_page = pmd_page(pmd);
@@ -1480,7 +1480,7 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 	if (likely(pmd_trans_huge(*pmd))) {
 		if (unlikely(pmd_trans_splitting(*pmd))) {
 			spin_unlock(&vma->vm_mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
+			wait_split_huge_page(vma, pmd);
 			return -1;
 		} else {
 			/* Thp mapped by 'pmd' is stable, so we can
diff --git a/mm/memory.c b/mm/memory.c
index 9da540f..2895f0e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -619,7 +619,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (new)
 		pte_free(mm, new);
 	if (wait_split_huge_page)
-		wait_split_huge_page(vma->anon_vma, pmd);
+		wait_split_huge_page(vma, pmd);
 	return 0;
 }
 
@@ -1529,7 +1529,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		if (likely(pmd_trans_huge(*pmd))) {
 			if (unlikely(pmd_trans_splitting(*pmd))) {
 				spin_unlock(&mm->page_table_lock);
-				wait_split_huge_page(vma->anon_vma, pmd);
+				wait_split_huge_page(vma, pmd);
 			} else {
 				page = follow_trans_huge_pmd(vma, address,
 							     pmd, flags);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 21/34] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Since we're going to have huge pages backed by files,
wait_split_huge_page() has to serialize not only over anon_vma_lock,
but over i_mmap_mutex too.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |   15 ++++++++++++---
 mm/huge_memory.c        |    4 ++--
 mm/memory.c             |    4 ++--
 3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a54939c..b53e295 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -113,11 +113,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
 			__split_huge_page_pmd(__vma, __address,		\
 					____pmd);			\
 	}  while (0)
-#define wait_split_huge_page(__anon_vma, __pmd)				\
+#define wait_split_huge_page(__vma, __pmd)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
-		anon_vma_lock_write(__anon_vma);			\
-		anon_vma_unlock_write(__anon_vma);			\
+		struct address_space *__mapping =			\
+					vma->vm_file->f_mapping;	\
+		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
+		if (__mapping)						\
+			mutex_lock(&__mapping->i_mmap_mutex);		\
+		if (__anon_vma) {					\
+			anon_vma_lock_write(__anon_vma);		\
+			anon_vma_unlock_write(__anon_vma);		\
+		}							\
+		if (__mapping)						\
+			mutex_unlock(&__mapping->i_mmap_mutex);		\
 		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
 		       pmd_trans_huge(*____pmd));			\
 	} while (0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ac0dc80..7c48f58 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -907,7 +907,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		spin_unlock(&dst_mm->page_table_lock);
 		pte_free(dst_mm, pgtable);
 
-		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+		wait_split_huge_page(vma, src_pmd); /* src_vma */
 		goto out;
 	}
 	src_page = pmd_page(pmd);
@@ -1480,7 +1480,7 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 	if (likely(pmd_trans_huge(*pmd))) {
 		if (unlikely(pmd_trans_splitting(*pmd))) {
 			spin_unlock(&vma->vm_mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
+			wait_split_huge_page(vma, pmd);
 			return -1;
 		} else {
 			/* Thp mapped by 'pmd' is stable, so we can
diff --git a/mm/memory.c b/mm/memory.c
index 9da540f..2895f0e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -619,7 +619,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (new)
 		pte_free(mm, new);
 	if (wait_split_huge_page)
-		wait_split_huge_page(vma->anon_vma, pmd);
+		wait_split_huge_page(vma, pmd);
 	return 0;
 }
 
@@ -1529,7 +1529,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		if (likely(pmd_trans_huge(*pmd))) {
 			if (unlikely(pmd_trans_splitting(*pmd))) {
 				spin_unlock(&mm->page_table_lock);
-				wait_split_huge_page(vma->anon_vma, pmd);
+				wait_split_huge_page(vma, pmd);
 			} else {
 				page = follow_trans_huge_pmd(vma, address,
 							     pmd, flags);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 22/34] thp, mm: truncate support for transparent huge page cache
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

If we starting position of truncation is in tail page we have to spilit
the huge page page first.

We also have to split if end is within the huge page. Otherwise we can
truncate whole huge page at once.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/truncate.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/truncate.c b/mm/truncate.c
index c75b736..0152feb 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -231,6 +231,17 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index > end)
 				break;
 
+			/* split page if we start from tail page */
+			if (PageTransTail(page))
+				split_huge_page(compound_trans_head(page));
+			if (PageTransHuge(page)) {
+				/* split if end is within huge page */
+				if (index == (end & ~HPAGE_CACHE_INDEX_MASK))
+					split_huge_page(page);
+				else
+					/* skip tail pages */
+					i += HPAGE_CACHE_NR - 1;
+			}
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -280,6 +291,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index > end)
 				break;
 
+			if (PageTransHuge(page))
+				split_huge_page(page);
 			lock_page(page);
 			WARN_ON(page->index != index);
 			wait_on_page_writeback(page);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 22/34] thp, mm: truncate support for transparent huge page cache
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

If we starting position of truncation is in tail page we have to spilit
the huge page page first.

We also have to split if end is within the huge page. Otherwise we can
truncate whole huge page at once.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/truncate.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/truncate.c b/mm/truncate.c
index c75b736..0152feb 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -231,6 +231,17 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index > end)
 				break;
 
+			/* split page if we start from tail page */
+			if (PageTransTail(page))
+				split_huge_page(compound_trans_head(page));
+			if (PageTransHuge(page)) {
+				/* split if end is within huge page */
+				if (index == (end & ~HPAGE_CACHE_INDEX_MASK))
+					split_huge_page(page);
+				else
+					/* skip tail pages */
+					i += HPAGE_CACHE_NR - 1;
+			}
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -280,6 +291,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index > end)
 				break;
 
+			if (PageTransHuge(page))
+				split_huge_page(page);
 			lock_page(page);
 			WARN_ON(page->index != index);
 			wait_on_page_writeback(page);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 23/34] thp, mm: split huge page on mmap file page
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We are not ready to mmap file-backed tranparent huge pages. Let's split
them on fault attempt.

Later in the patchset we'll implement mmap() properly and this code path
be used for fallback cases.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 3296f5c..6f0e3be 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1683,6 +1683,8 @@ retry_find:
 			goto no_cached_page;
 	}
 
+	if (PageTransCompound(page))
+		split_huge_page(compound_trans_head(page));
 	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
 		page_cache_release(page);
 		return ret | VM_FAULT_RETRY;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 23/34] thp, mm: split huge page on mmap file page
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We are not ready to mmap file-backed tranparent huge pages. Let's split
them on fault attempt.

Later in the patchset we'll implement mmap() properly and this code path
be used for fallback cases.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 3296f5c..6f0e3be 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1683,6 +1683,8 @@ retry_find:
 			goto no_cached_page;
 	}
 
+	if (PageTransCompound(page))
+		split_huge_page(compound_trans_head(page));
 	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
 		page_cache_release(page);
 		return ret | VM_FAULT_RETRY;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 24/34] ramfs: enable transparent huge page cache
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

ramfs is the most simple fs from page cache point of view. Let's start
transparent huge page cache enabling here.

For now we allocate only non-movable huge page. ramfs pages cannot be
moved yet.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/ramfs/inode.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index c24f1e1..54d69c7 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
 		inode_init_owner(inode, dir, mode);
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
-		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+		/*
+		 * TODO: make ramfs pages movable
+		 */
+		mapping_set_gfp_mask(inode->i_mapping,
+				GFP_TRANSHUGE & ~__GFP_MOVABLE);
 		mapping_set_unevictable(inode->i_mapping);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 24/34] ramfs: enable transparent huge page cache
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

ramfs is the most simple fs from page cache point of view. Let's start
transparent huge page cache enabling here.

For now we allocate only non-movable huge page. ramfs pages cannot be
moved yet.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/ramfs/inode.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index c24f1e1..54d69c7 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
 		inode_init_owner(inode, dir, mode);
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
-		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+		/*
+		 * TODO: make ramfs pages movable
+		 */
+		mapping_set_gfp_mask(inode->i_mapping,
+				GFP_TRANSHUGE & ~__GFP_MOVABLE);
 		mapping_set_unevictable(inode->i_mapping);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 25/34] x86-64, mm: proper alignment mappings with hugepages
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Make arch_get_unmapped_area() return unmapped area aligned to HPAGE_MASK
if the file mapping can have huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/sys_x86_64.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index dbded5a..d97ab40 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -15,6 +15,7 @@
 #include <linux/random.h>
 #include <linux/uaccess.h>
 #include <linux/elf.h>
+#include <linux/pagemap.h>
 
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
@@ -34,6 +35,13 @@ static unsigned long get_align_mask(void)
 	return va_align.mask;
 }
 
+static inline unsigned long mapping_align_mask(struct address_space *mapping)
+{
+	if (mapping_can_have_hugepages(mapping))
+		return PAGE_MASK & ~HPAGE_MASK;
+	return get_align_mask();
+}
+
 unsigned long align_vdso_addr(unsigned long addr)
 {
 	unsigned long align_mask = get_align_mask();
@@ -135,7 +143,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.length = len;
 	info.low_limit = begin;
 	info.high_limit = end;
-	info.align_mask = filp ? get_align_mask() : 0;
+	info.align_mask = filp ? mapping_align_mask(filp->f_mapping) : 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	return vm_unmapped_area(&info);
 }
@@ -174,7 +182,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = mm->mmap_base;
-	info.align_mask = filp ? get_align_mask() : 0;
+	info.align_mask = filp ? mapping_align_mask(filp->f_mapping) : 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	addr = vm_unmapped_area(&info);
 	if (!(addr & ~PAGE_MASK))
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 25/34] x86-64, mm: proper alignment mappings with hugepages
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Make arch_get_unmapped_area() return unmapped area aligned to HPAGE_MASK
if the file mapping can have huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/sys_x86_64.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index dbded5a..d97ab40 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -15,6 +15,7 @@
 #include <linux/random.h>
 #include <linux/uaccess.h>
 #include <linux/elf.h>
+#include <linux/pagemap.h>
 
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
@@ -34,6 +35,13 @@ static unsigned long get_align_mask(void)
 	return va_align.mask;
 }
 
+static inline unsigned long mapping_align_mask(struct address_space *mapping)
+{
+	if (mapping_can_have_hugepages(mapping))
+		return PAGE_MASK & ~HPAGE_MASK;
+	return get_align_mask();
+}
+
 unsigned long align_vdso_addr(unsigned long addr)
 {
 	unsigned long align_mask = get_align_mask();
@@ -135,7 +143,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.length = len;
 	info.low_limit = begin;
 	info.high_limit = end;
-	info.align_mask = filp ? get_align_mask() : 0;
+	info.align_mask = filp ? mapping_align_mask(filp->f_mapping) : 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	return vm_unmapped_area(&info);
 }
@@ -174,7 +182,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = mm->mmap_base;
-	info.align_mask = filp ? get_align_mask() : 0;
+	info.align_mask = filp ? mapping_align_mask(filp->f_mapping) : 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	addr = vm_unmapped_area(&info);
 	if (!(addr & ~PAGE_MASK))
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 26/34] mm: add huge_fault() callback to vm_operations_struct
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

huge_fault() should try to setup huge page for the pgoff, if possbile.
VM_FAULT_OOM return code means we need to fallback to small pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |    1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 09530c7..d978de8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -195,6 +195,7 @@ struct vm_operations_struct {
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
+	int (*huge_fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 26/34] mm: add huge_fault() callback to vm_operations_struct
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

huge_fault() should try to setup huge page for the pgoff, if possbile.
VM_FAULT_OOM return code means we need to fallback to small pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |    1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 09530c7..d978de8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -195,6 +195,7 @@ struct vm_operations_struct {
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
+	int (*huge_fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 27/34] thp: prepare zap_huge_pmd() to uncharge file pages
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Uncharge pages from correct counter.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7c48f58..4a1d8d7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1368,10 +1368,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			spin_unlock(&tlb->mm->page_table_lock);
 			put_huge_zero_page();
 		} else {
+			int member;
 			page = pmd_page(orig_pmd);
 			page_remove_rmap(page);
 			VM_BUG_ON(page_mapcount(page) < 0);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+			member = PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES;
+			add_mm_counter(tlb->mm, member, -HPAGE_PMD_NR);
 			VM_BUG_ON(!PageHead(page));
 			tlb->mm->nr_ptes--;
 			spin_unlock(&tlb->mm->page_table_lock);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 27/34] thp: prepare zap_huge_pmd() to uncharge file pages
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Uncharge pages from correct counter.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7c48f58..4a1d8d7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1368,10 +1368,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			spin_unlock(&tlb->mm->page_table_lock);
 			put_huge_zero_page();
 		} else {
+			int member;
 			page = pmd_page(orig_pmd);
 			page_remove_rmap(page);
 			VM_BUG_ON(page_mapcount(page) < 0);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+			member = PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES;
+			add_mm_counter(tlb->mm, member, -HPAGE_PMD_NR);
 			VM_BUG_ON(!PageHead(page));
 			tlb->mm->nr_ptes--;
 			spin_unlock(&tlb->mm->page_table_lock);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 28/34] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

It's confusing that mk_huge_pmd() has sematics different from mk_pte()
or mk_pmd().

Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
prototype to match mk_pte().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4a1d8d7..0cf2e79 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -691,11 +691,10 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 	return pmd;
 }
 
-static inline pmd_t mk_huge_pmd(struct page *page, struct vm_area_struct *vma)
+static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
 {
 	pmd_t entry;
-	entry = mk_pmd(page, vma->vm_page_prot);
-	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+	entry = mk_pmd(page, prot);
 	entry = pmd_mkhuge(entry);
 	return entry;
 }
@@ -723,7 +722,8 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		pte_free(mm, pgtable);
 	} else {
 		pmd_t entry;
-		entry = mk_huge_pmd(page, vma);
+		entry = mk_huge_pmd(page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		/*
 		 * The spinlocking to take the lru_lock inside
 		 * page_add_new_anon_rmap() acts as a full memory
@@ -1212,7 +1212,8 @@ alloc:
 		goto out_mn;
 	} else {
 		pmd_t entry;
-		entry = mk_huge_pmd(new_page, vma);
+		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		pmdp_clear_flush(vma, haddr, pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr);
 		set_pmd_at(mm, haddr, pmd, entry);
@@ -2386,7 +2387,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	__SetPageUptodate(new_page);
 	pgtable = pmd_pgtable(_pmd);
 
-	_pmd = mk_huge_pmd(new_page, vma);
+	_pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
+	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
 
 	/*
 	 * spin_lock() below is not the equivalent of smp_wmb(), so
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 28/34] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

It's confusing that mk_huge_pmd() has sematics different from mk_pte()
or mk_pmd().

Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
prototype to match mk_pte().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4a1d8d7..0cf2e79 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -691,11 +691,10 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 	return pmd;
 }
 
-static inline pmd_t mk_huge_pmd(struct page *page, struct vm_area_struct *vma)
+static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
 {
 	pmd_t entry;
-	entry = mk_pmd(page, vma->vm_page_prot);
-	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+	entry = mk_pmd(page, prot);
 	entry = pmd_mkhuge(entry);
 	return entry;
 }
@@ -723,7 +722,8 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		pte_free(mm, pgtable);
 	} else {
 		pmd_t entry;
-		entry = mk_huge_pmd(page, vma);
+		entry = mk_huge_pmd(page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		/*
 		 * The spinlocking to take the lru_lock inside
 		 * page_add_new_anon_rmap() acts as a full memory
@@ -1212,7 +1212,8 @@ alloc:
 		goto out_mn;
 	} else {
 		pmd_t entry;
-		entry = mk_huge_pmd(new_page, vma);
+		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		pmdp_clear_flush(vma, haddr, pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr);
 		set_pmd_at(mm, haddr, pmd, entry);
@@ -2386,7 +2387,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	__SetPageUptodate(new_page);
 	pgtable = pmd_pgtable(_pmd);
 
-	_pmd = mk_huge_pmd(new_page, vma);
+	_pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
+	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
 
 	/*
 	 * spin_lock() below is not the equivalent of smp_wmb(), so
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 29/34] thp, mm: basic huge_fault implementation for generic_file_vm_ops
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

It provide enough functionality for simple cases like ramfs. Need to be
extended later.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   76 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 6f0e3be..a170a40 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1768,6 +1768,81 @@ page_not_uptodate:
 }
 EXPORT_SYMBOL(filemap_fault);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static int filemap_huge_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
+	pgoff_t size, offset = vmf->pgoff;
+	unsigned long address = (unsigned long) vmf->virtual_address;
+	struct page *page;
+	int ret = 0;
+
+	BUG_ON(((address >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+			(offset & HPAGE_CACHE_INDEX_MASK));
+
+retry:
+	page = find_get_page(mapping, offset);
+	if (!page) {
+		gfp_t gfp_mask = mapping_gfp_mask(mapping) | __GFP_COLD;
+		page = alloc_pages_vma(gfp_mask, HPAGE_PMD_ORDER,
+				vma, address, 0);
+		if (!page) {
+			count_vm_event(THP_FAULT_FALLBACK);
+			return VM_FAULT_OOM;
+		}
+		count_vm_event(THP_FAULT_ALLOC);
+		ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);
+		if (ret == 0)
+			ret = mapping->a_ops->readpage(file, page);
+		else if (ret == -EEXIST)
+			ret = 0; /* losing race to add is OK */
+		page_cache_release(page);
+		if (!ret || ret == AOP_TRUNCATED_PAGE)
+			goto retry;
+		return ret;
+	}
+
+	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
+		page_cache_release(page);
+		return ret | VM_FAULT_RETRY;
+	}
+
+	/* Did it get truncated? */
+	if (unlikely(page->mapping != mapping)) {
+		unlock_page(page);
+		put_page(page);
+		goto retry;
+	}
+	VM_BUG_ON(page->index != offset);
+	VM_BUG_ON(!PageUptodate(page));
+
+	if (!PageTransHuge(page)) {
+		unlock_page(page);
+		put_page(page);
+		/* Ask fallback to small pages */
+		return VM_FAULT_OOM;
+	}
+
+	/*
+	 * Found the page and have a reference on it.
+	 * We must recheck i_size under page lock.
+	 */
+	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	if (unlikely(offset >= size)) {
+		unlock_page(page);
+		page_cache_release(page);
+		return VM_FAULT_SIGBUS;
+	}
+
+	vmf->page = page;
+	return ret | VM_FAULT_LOCKED;
+}
+#else
+#define filemap_huge_fault NULL
+#endif
+
 int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct page *page = vmf->page;
@@ -1797,6 +1872,7 @@ EXPORT_SYMBOL(filemap_page_mkwrite);
 
 const struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
+	.huge_fault	= filemap_huge_fault,
 	.page_mkwrite	= filemap_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 29/34] thp, mm: basic huge_fault implementation for generic_file_vm_ops
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

It provide enough functionality for simple cases like ramfs. Need to be
extended later.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   76 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 6f0e3be..a170a40 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1768,6 +1768,81 @@ page_not_uptodate:
 }
 EXPORT_SYMBOL(filemap_fault);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static int filemap_huge_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
+	pgoff_t size, offset = vmf->pgoff;
+	unsigned long address = (unsigned long) vmf->virtual_address;
+	struct page *page;
+	int ret = 0;
+
+	BUG_ON(((address >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+			(offset & HPAGE_CACHE_INDEX_MASK));
+
+retry:
+	page = find_get_page(mapping, offset);
+	if (!page) {
+		gfp_t gfp_mask = mapping_gfp_mask(mapping) | __GFP_COLD;
+		page = alloc_pages_vma(gfp_mask, HPAGE_PMD_ORDER,
+				vma, address, 0);
+		if (!page) {
+			count_vm_event(THP_FAULT_FALLBACK);
+			return VM_FAULT_OOM;
+		}
+		count_vm_event(THP_FAULT_ALLOC);
+		ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);
+		if (ret == 0)
+			ret = mapping->a_ops->readpage(file, page);
+		else if (ret == -EEXIST)
+			ret = 0; /* losing race to add is OK */
+		page_cache_release(page);
+		if (!ret || ret == AOP_TRUNCATED_PAGE)
+			goto retry;
+		return ret;
+	}
+
+	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
+		page_cache_release(page);
+		return ret | VM_FAULT_RETRY;
+	}
+
+	/* Did it get truncated? */
+	if (unlikely(page->mapping != mapping)) {
+		unlock_page(page);
+		put_page(page);
+		goto retry;
+	}
+	VM_BUG_ON(page->index != offset);
+	VM_BUG_ON(!PageUptodate(page));
+
+	if (!PageTransHuge(page)) {
+		unlock_page(page);
+		put_page(page);
+		/* Ask fallback to small pages */
+		return VM_FAULT_OOM;
+	}
+
+	/*
+	 * Found the page and have a reference on it.
+	 * We must recheck i_size under page lock.
+	 */
+	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	if (unlikely(offset >= size)) {
+		unlock_page(page);
+		page_cache_release(page);
+		return VM_FAULT_SIGBUS;
+	}
+
+	vmf->page = page;
+	return ret | VM_FAULT_LOCKED;
+}
+#else
+#define filemap_huge_fault NULL
+#endif
+
 int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct page *page = vmf->page;
@@ -1797,6 +1872,7 @@ EXPORT_SYMBOL(filemap_page_mkwrite);
 
 const struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
+	.huge_fault	= filemap_huge_fault,
 	.page_mkwrite	= filemap_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 30/34] thp: extract fallback path from do_huge_pmd_anonymous_page() to a function
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The same fallback path will be reused by non-anonymous pages, so lets'
extract it in separate function.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |  112 ++++++++++++++++++++++++++++--------------------------
 1 file changed, 59 insertions(+), 53 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0cf2e79..c1d5f2b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -779,64 +779,12 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
 	return true;
 }
 
-int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+static int do_fallback(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
 			       unsigned int flags)
 {
-	struct page *page;
-	unsigned long haddr = address & HPAGE_PMD_MASK;
 	pte_t *pte;
 
-	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
-		if (unlikely(anon_vma_prepare(vma)))
-			return VM_FAULT_OOM;
-		if (unlikely(khugepaged_enter(vma)))
-			return VM_FAULT_OOM;
-		if (!(flags & FAULT_FLAG_WRITE) &&
-				transparent_hugepage_use_zero_page()) {
-			pgtable_t pgtable;
-			unsigned long zero_pfn;
-			bool set;
-			pgtable = pte_alloc_one(mm, haddr);
-			if (unlikely(!pgtable))
-				return VM_FAULT_OOM;
-			zero_pfn = get_huge_zero_page();
-			if (unlikely(!zero_pfn)) {
-				pte_free(mm, pgtable);
-				count_vm_event(THP_FAULT_FALLBACK);
-				goto out;
-			}
-			spin_lock(&mm->page_table_lock);
-			set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
-					zero_pfn);
-			spin_unlock(&mm->page_table_lock);
-			if (!set) {
-				pte_free(mm, pgtable);
-				put_huge_zero_page();
-			}
-			return 0;
-		}
-		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
-					  vma, haddr, numa_node_id(), 0);
-		if (unlikely(!page)) {
-			count_vm_event(THP_FAULT_FALLBACK);
-			goto out;
-		}
-		count_vm_event(THP_FAULT_ALLOC);
-		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
-			put_page(page);
-			goto out;
-		}
-		if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
-							  page))) {
-			mem_cgroup_uncharge_page(page);
-			put_page(page);
-			goto out;
-		}
-
-		return 0;
-	}
-out:
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
 	 * run pte_offset_map on the pmd, if an huge pmd could
@@ -858,6 +806,64 @@ out:
 	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
 }
 
+int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, pmd_t *pmd,
+			       unsigned int flags)
+{
+	struct page *page;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+
+	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+		return do_fallback(mm, vma, address, pmd, flags);
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
+	if (unlikely(khugepaged_enter(vma)))
+		return VM_FAULT_OOM;
+	if (!(flags & FAULT_FLAG_WRITE) &&
+			transparent_hugepage_use_zero_page()) {
+		pgtable_t pgtable;
+		unsigned long zero_pfn;
+		bool set;
+		pgtable = pte_alloc_one(mm, haddr);
+		if (unlikely(!pgtable))
+			return VM_FAULT_OOM;
+		zero_pfn = get_huge_zero_page();
+		if (unlikely(!zero_pfn)) {
+			pte_free(mm, pgtable);
+			count_vm_event(THP_FAULT_FALLBACK);
+			return do_fallback(mm, vma, address, pmd, flags);
+		}
+		spin_lock(&mm->page_table_lock);
+		set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
+				zero_pfn);
+		spin_unlock(&mm->page_table_lock);
+		if (!set) {
+			pte_free(mm, pgtable);
+			put_huge_zero_page();
+		}
+		return 0;
+	}
+	page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+			vma, haddr, numa_node_id(), 0);
+	if (unlikely(!page)) {
+		count_vm_event(THP_FAULT_FALLBACK);
+		return do_fallback(mm, vma, address, pmd, flags);
+	}
+	count_vm_event(THP_FAULT_ALLOC);
+	if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
+		put_page(page);
+		return do_fallback(mm, vma, address, pmd, flags);
+	}
+	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
+					page))) {
+		mem_cgroup_uncharge_page(page);
+		put_page(page);
+		return do_fallback(mm, vma, address, pmd, flags);
+	}
+
+	return 0;
+}
+
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		  struct vm_area_struct *vma)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 30/34] thp: extract fallback path from do_huge_pmd_anonymous_page() to a function
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The same fallback path will be reused by non-anonymous pages, so lets'
extract it in separate function.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |  112 ++++++++++++++++++++++++++++--------------------------
 1 file changed, 59 insertions(+), 53 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0cf2e79..c1d5f2b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -779,64 +779,12 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
 	return true;
 }
 
-int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+static int do_fallback(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
 			       unsigned int flags)
 {
-	struct page *page;
-	unsigned long haddr = address & HPAGE_PMD_MASK;
 	pte_t *pte;
 
-	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
-		if (unlikely(anon_vma_prepare(vma)))
-			return VM_FAULT_OOM;
-		if (unlikely(khugepaged_enter(vma)))
-			return VM_FAULT_OOM;
-		if (!(flags & FAULT_FLAG_WRITE) &&
-				transparent_hugepage_use_zero_page()) {
-			pgtable_t pgtable;
-			unsigned long zero_pfn;
-			bool set;
-			pgtable = pte_alloc_one(mm, haddr);
-			if (unlikely(!pgtable))
-				return VM_FAULT_OOM;
-			zero_pfn = get_huge_zero_page();
-			if (unlikely(!zero_pfn)) {
-				pte_free(mm, pgtable);
-				count_vm_event(THP_FAULT_FALLBACK);
-				goto out;
-			}
-			spin_lock(&mm->page_table_lock);
-			set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
-					zero_pfn);
-			spin_unlock(&mm->page_table_lock);
-			if (!set) {
-				pte_free(mm, pgtable);
-				put_huge_zero_page();
-			}
-			return 0;
-		}
-		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
-					  vma, haddr, numa_node_id(), 0);
-		if (unlikely(!page)) {
-			count_vm_event(THP_FAULT_FALLBACK);
-			goto out;
-		}
-		count_vm_event(THP_FAULT_ALLOC);
-		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
-			put_page(page);
-			goto out;
-		}
-		if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
-							  page))) {
-			mem_cgroup_uncharge_page(page);
-			put_page(page);
-			goto out;
-		}
-
-		return 0;
-	}
-out:
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
 	 * run pte_offset_map on the pmd, if an huge pmd could
@@ -858,6 +806,64 @@ out:
 	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
 }
 
+int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, pmd_t *pmd,
+			       unsigned int flags)
+{
+	struct page *page;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+
+	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+		return do_fallback(mm, vma, address, pmd, flags);
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
+	if (unlikely(khugepaged_enter(vma)))
+		return VM_FAULT_OOM;
+	if (!(flags & FAULT_FLAG_WRITE) &&
+			transparent_hugepage_use_zero_page()) {
+		pgtable_t pgtable;
+		unsigned long zero_pfn;
+		bool set;
+		pgtable = pte_alloc_one(mm, haddr);
+		if (unlikely(!pgtable))
+			return VM_FAULT_OOM;
+		zero_pfn = get_huge_zero_page();
+		if (unlikely(!zero_pfn)) {
+			pte_free(mm, pgtable);
+			count_vm_event(THP_FAULT_FALLBACK);
+			return do_fallback(mm, vma, address, pmd, flags);
+		}
+		spin_lock(&mm->page_table_lock);
+		set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
+				zero_pfn);
+		spin_unlock(&mm->page_table_lock);
+		if (!set) {
+			pte_free(mm, pgtable);
+			put_huge_zero_page();
+		}
+		return 0;
+	}
+	page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+			vma, haddr, numa_node_id(), 0);
+	if (unlikely(!page)) {
+		count_vm_event(THP_FAULT_FALLBACK);
+		return do_fallback(mm, vma, address, pmd, flags);
+	}
+	count_vm_event(THP_FAULT_ALLOC);
+	if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
+		put_page(page);
+		return do_fallback(mm, vma, address, pmd, flags);
+	}
+	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
+					page))) {
+		mem_cgroup_uncharge_page(page);
+		put_page(page);
+		return do_fallback(mm, vma, address, pmd, flags);
+	}
+
+	return 0;
+}
+
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		  struct vm_area_struct *vma)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The function tries to create a new page mapping using huge pages. It
only called for not yet mapped pages.

As usual in THP, we fallback to small pages if we fail to allocate huge
page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    3 +
 mm/huge_memory.c        |  196 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 199 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b53e295..aa52c48 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -5,6 +5,9 @@ extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
 				      struct vm_area_struct *vma,
 				      unsigned long address, pmd_t *pmd,
 				      unsigned int flags);
+extern int do_huge_linear_fault(struct mm_struct *mm,
+		struct vm_area_struct *vma, unsigned long address, pmd_t *pmd,
+		unsigned int flags);
 extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 			 struct vm_area_struct *vma);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c1d5f2b..ed4389b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -21,6 +21,7 @@
 #include <linux/pagemap.h>
 #include <linux/migrate.h>
 #include <linux/hashtable.h>
+#include <linux/writeback.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -864,6 +865,201 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	return 0;
 }
 
+int do_huge_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmd, unsigned int flags)
+{
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	struct page *cow_page, *page, *dirty_page = NULL;
+	bool anon = false, fallback = false, page_mkwrite = false;
+	pgtable_t pgtable = NULL;
+	struct vm_fault vmf;
+	int ret;
+
+	/* Fallback if vm_pgoff and vm_start are not suitable */
+	if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+			(vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK))
+		return do_fallback(mm, vma, address, pmd, flags);
+
+	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+		return do_fallback(mm, vma, address, pmd, flags);
+
+	if (unlikely(khugepaged_enter(vma)))
+		return VM_FAULT_OOM;
+
+	/*
+	 * If we do COW later, allocate page before taking lock_page()
+	 * on the file cache page. This will reduce lock holding time.
+	 */
+	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
+		if (unlikely(anon_vma_prepare(vma)))
+			return VM_FAULT_OOM;
+
+		cow_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+				vma, haddr, numa_node_id(), 0);
+		if (!cow_page) {
+			count_vm_event(THP_FAULT_FALLBACK);
+			return do_fallback(mm, vma, address, pmd, flags);
+		}
+		count_vm_event(THP_FAULT_ALLOC);
+		if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
+			page_cache_release(cow_page);
+			return do_fallback(mm, vma, address, pmd, flags);
+		}
+	} else
+		cow_page = NULL;
+
+	pgtable = pte_alloc_one(mm, haddr);
+	if (unlikely(!pgtable)) {
+		ret = VM_FAULT_OOM;
+		goto uncharge_out;
+	}
+
+	vmf.virtual_address = (void __user *)haddr;
+	vmf.pgoff = ((haddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+	vmf.flags = flags;
+	vmf.page = NULL;
+
+	ret = vma->vm_ops->huge_fault(vma, &vmf);
+	if (unlikely(ret & VM_FAULT_OOM))
+		goto uncharge_out_fallback;
+	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
+		goto uncharge_out;
+
+	if (unlikely(PageHWPoison(vmf.page))) {
+		if (ret & VM_FAULT_LOCKED)
+			unlock_page(vmf.page);
+		ret = VM_FAULT_HWPOISON;
+		goto uncharge_out;
+	}
+
+	/*
+	 * For consistency in subsequent calls, make the faulted page always
+	 * locked.
+	 */
+	if (unlikely(!(ret & VM_FAULT_LOCKED)))
+		lock_page(vmf.page);
+	else
+		VM_BUG_ON(!PageLocked(vmf.page));
+
+	/*
+	 * Should we do an early C-O-W break?
+	 */
+	page = vmf.page;
+	if (flags & FAULT_FLAG_WRITE) {
+		if (!(vma->vm_flags & VM_SHARED)) {
+			page = cow_page;
+			anon = true;
+			copy_user_huge_page(page, vmf.page, haddr, vma,
+					HPAGE_PMD_NR);
+			__SetPageUptodate(page);
+		} else if (vma->vm_ops->page_mkwrite) {
+			int tmp;
+
+			unlock_page(page);
+			vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE;
+			tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
+			if (unlikely(tmp &
+				  (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+				ret = tmp;
+				goto unwritable_page;
+			}
+			if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+				lock_page(page);
+				if (!page->mapping) {
+					ret = 0; /* retry the fault */
+					unlock_page(page);
+					goto unwritable_page;
+				}
+			} else
+				VM_BUG_ON(!PageLocked(page));
+			page_mkwrite = true;
+		}
+	}
+
+	VM_BUG_ON(!PageCompound(page));
+
+	spin_lock(&mm->page_table_lock);
+	if (likely(pmd_none(*pmd))) {
+		pmd_t entry;
+
+		flush_icache_page(vma, page);
+		entry = mk_huge_pmd(page, vma->vm_page_prot);
+		if (flags & FAULT_FLAG_WRITE)
+			entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (anon) {
+			add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+			page_add_new_anon_rmap(page, vma, haddr);
+		} else {
+			add_mm_counter(mm, MM_FILEPAGES, HPAGE_PMD_NR);
+			page_add_file_rmap(page);
+			if (flags & FAULT_FLAG_WRITE) {
+				dirty_page = page;
+				get_page(dirty_page);
+			}
+		}
+		set_pmd_at(mm, haddr, pmd, entry);
+		pgtable_trans_huge_deposit(mm, pgtable);
+		mm->nr_ptes++;
+
+		/* no need to invalidate: a not-present page won't be cached */
+		update_mmu_cache_pmd(vma, address, pmd);
+	} else {
+		if (cow_page)
+			mem_cgroup_uncharge_page(cow_page);
+		if (anon)
+			page_cache_release(page);
+		else
+			anon = true; /* no anon but release faulted_page */
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	if (dirty_page) {
+		struct address_space *mapping = page->mapping;
+		bool dirtied = false;
+
+		if (set_page_dirty(dirty_page))
+			dirtied = true;
+		unlock_page(dirty_page);
+		put_page(dirty_page);
+		if ((dirtied || page_mkwrite) && mapping) {
+			/*
+			 * Some device drivers do not set page.mapping but still
+			 * dirty their pages
+			 */
+			balance_dirty_pages_ratelimited(mapping);
+		}
+
+		/* file_update_time outside page_lock */
+		if (vma->vm_file && !page_mkwrite)
+			file_update_time(vma->vm_file);
+	} else {
+		unlock_page(vmf.page);
+		if (anon)
+			page_cache_release(vmf.page);
+	}
+
+	return ret;
+
+unwritable_page:
+	pte_free(mm, pgtable);
+	page_cache_release(page);
+	return ret;
+uncharge_out_fallback:
+	fallback = true;
+uncharge_out:
+	if (pgtable)
+		pte_free(mm, pgtable);
+	if (cow_page) {
+		mem_cgroup_uncharge_page(cow_page);
+		page_cache_release(cow_page);
+	}
+
+	if (fallback)
+		return do_fallback(mm, vma, address, pmd, flags);
+	else
+		return ret;
+}
+
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		  struct vm_area_struct *vma)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The function tries to create a new page mapping using huge pages. It
only called for not yet mapped pages.

As usual in THP, we fallback to small pages if we fail to allocate huge
page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    3 +
 mm/huge_memory.c        |  196 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 199 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b53e295..aa52c48 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -5,6 +5,9 @@ extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
 				      struct vm_area_struct *vma,
 				      unsigned long address, pmd_t *pmd,
 				      unsigned int flags);
+extern int do_huge_linear_fault(struct mm_struct *mm,
+		struct vm_area_struct *vma, unsigned long address, pmd_t *pmd,
+		unsigned int flags);
 extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 			 struct vm_area_struct *vma);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c1d5f2b..ed4389b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -21,6 +21,7 @@
 #include <linux/pagemap.h>
 #include <linux/migrate.h>
 #include <linux/hashtable.h>
+#include <linux/writeback.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -864,6 +865,201 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	return 0;
 }
 
+int do_huge_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmd, unsigned int flags)
+{
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	struct page *cow_page, *page, *dirty_page = NULL;
+	bool anon = false, fallback = false, page_mkwrite = false;
+	pgtable_t pgtable = NULL;
+	struct vm_fault vmf;
+	int ret;
+
+	/* Fallback if vm_pgoff and vm_start are not suitable */
+	if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+			(vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK))
+		return do_fallback(mm, vma, address, pmd, flags);
+
+	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+		return do_fallback(mm, vma, address, pmd, flags);
+
+	if (unlikely(khugepaged_enter(vma)))
+		return VM_FAULT_OOM;
+
+	/*
+	 * If we do COW later, allocate page before taking lock_page()
+	 * on the file cache page. This will reduce lock holding time.
+	 */
+	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
+		if (unlikely(anon_vma_prepare(vma)))
+			return VM_FAULT_OOM;
+
+		cow_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+				vma, haddr, numa_node_id(), 0);
+		if (!cow_page) {
+			count_vm_event(THP_FAULT_FALLBACK);
+			return do_fallback(mm, vma, address, pmd, flags);
+		}
+		count_vm_event(THP_FAULT_ALLOC);
+		if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
+			page_cache_release(cow_page);
+			return do_fallback(mm, vma, address, pmd, flags);
+		}
+	} else
+		cow_page = NULL;
+
+	pgtable = pte_alloc_one(mm, haddr);
+	if (unlikely(!pgtable)) {
+		ret = VM_FAULT_OOM;
+		goto uncharge_out;
+	}
+
+	vmf.virtual_address = (void __user *)haddr;
+	vmf.pgoff = ((haddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+	vmf.flags = flags;
+	vmf.page = NULL;
+
+	ret = vma->vm_ops->huge_fault(vma, &vmf);
+	if (unlikely(ret & VM_FAULT_OOM))
+		goto uncharge_out_fallback;
+	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
+		goto uncharge_out;
+
+	if (unlikely(PageHWPoison(vmf.page))) {
+		if (ret & VM_FAULT_LOCKED)
+			unlock_page(vmf.page);
+		ret = VM_FAULT_HWPOISON;
+		goto uncharge_out;
+	}
+
+	/*
+	 * For consistency in subsequent calls, make the faulted page always
+	 * locked.
+	 */
+	if (unlikely(!(ret & VM_FAULT_LOCKED)))
+		lock_page(vmf.page);
+	else
+		VM_BUG_ON(!PageLocked(vmf.page));
+
+	/*
+	 * Should we do an early C-O-W break?
+	 */
+	page = vmf.page;
+	if (flags & FAULT_FLAG_WRITE) {
+		if (!(vma->vm_flags & VM_SHARED)) {
+			page = cow_page;
+			anon = true;
+			copy_user_huge_page(page, vmf.page, haddr, vma,
+					HPAGE_PMD_NR);
+			__SetPageUptodate(page);
+		} else if (vma->vm_ops->page_mkwrite) {
+			int tmp;
+
+			unlock_page(page);
+			vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE;
+			tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
+			if (unlikely(tmp &
+				  (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+				ret = tmp;
+				goto unwritable_page;
+			}
+			if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+				lock_page(page);
+				if (!page->mapping) {
+					ret = 0; /* retry the fault */
+					unlock_page(page);
+					goto unwritable_page;
+				}
+			} else
+				VM_BUG_ON(!PageLocked(page));
+			page_mkwrite = true;
+		}
+	}
+
+	VM_BUG_ON(!PageCompound(page));
+
+	spin_lock(&mm->page_table_lock);
+	if (likely(pmd_none(*pmd))) {
+		pmd_t entry;
+
+		flush_icache_page(vma, page);
+		entry = mk_huge_pmd(page, vma->vm_page_prot);
+		if (flags & FAULT_FLAG_WRITE)
+			entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (anon) {
+			add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+			page_add_new_anon_rmap(page, vma, haddr);
+		} else {
+			add_mm_counter(mm, MM_FILEPAGES, HPAGE_PMD_NR);
+			page_add_file_rmap(page);
+			if (flags & FAULT_FLAG_WRITE) {
+				dirty_page = page;
+				get_page(dirty_page);
+			}
+		}
+		set_pmd_at(mm, haddr, pmd, entry);
+		pgtable_trans_huge_deposit(mm, pgtable);
+		mm->nr_ptes++;
+
+		/* no need to invalidate: a not-present page won't be cached */
+		update_mmu_cache_pmd(vma, address, pmd);
+	} else {
+		if (cow_page)
+			mem_cgroup_uncharge_page(cow_page);
+		if (anon)
+			page_cache_release(page);
+		else
+			anon = true; /* no anon but release faulted_page */
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	if (dirty_page) {
+		struct address_space *mapping = page->mapping;
+		bool dirtied = false;
+
+		if (set_page_dirty(dirty_page))
+			dirtied = true;
+		unlock_page(dirty_page);
+		put_page(dirty_page);
+		if ((dirtied || page_mkwrite) && mapping) {
+			/*
+			 * Some device drivers do not set page.mapping but still
+			 * dirty their pages
+			 */
+			balance_dirty_pages_ratelimited(mapping);
+		}
+
+		/* file_update_time outside page_lock */
+		if (vma->vm_file && !page_mkwrite)
+			file_update_time(vma->vm_file);
+	} else {
+		unlock_page(vmf.page);
+		if (anon)
+			page_cache_release(vmf.page);
+	}
+
+	return ret;
+
+unwritable_page:
+	pte_free(mm, pgtable);
+	page_cache_release(page);
+	return ret;
+uncharge_out_fallback:
+	fallback = true;
+uncharge_out:
+	if (pgtable)
+		pte_free(mm, pgtable);
+	if (cow_page) {
+		mem_cgroup_uncharge_page(cow_page);
+		page_cache_release(cow_page);
+	}
+
+	if (fallback)
+		return do_fallback(mm, vma, address, pmd, flags);
+	else
+		return ret;
+}
+
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		  struct vm_area_struct *vma)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 32/34] thp: handle write-protect exception to file-backed huge pages
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   69 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 67 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ed4389b..6dde87f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1339,7 +1339,6 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
-	VM_BUG_ON(!vma->anon_vma);
 	haddr = address & HPAGE_PMD_MASK;
 	if (is_huge_zero_pmd(orig_pmd))
 		goto alloc;
@@ -1349,7 +1348,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	page = pmd_page(orig_pmd);
 	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
-	if (page_mapcount(page) == 1) {
+	if (PageAnon(page) && page_mapcount(page) == 1) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -1357,10 +1356,72 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			update_mmu_cache_pmd(vma, address, pmd);
 		ret |= VM_FAULT_WRITE;
 		goto out_unlock;
+	} else if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+			(VM_WRITE|VM_SHARED)) {
+		struct vm_fault vmf;
+		pmd_t entry;
+		struct address_space *mapping;
+
+		/* not yet impemented */
+		VM_BUG_ON(!vma->vm_ops || !vma->vm_ops->page_mkwrite);
+
+		vmf.virtual_address = (void __user *)haddr;
+		vmf.pgoff = page->index;
+		vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
+		vmf.page = page;
+
+		page_cache_get(page);
+		spin_unlock(&mm->page_table_lock);
+
+		ret = vma->vm_ops->page_mkwrite(vma, &vmf);
+		if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+			page_cache_release(page);
+			goto out;
+		}
+		if (unlikely(!(ret & VM_FAULT_LOCKED))) {
+			lock_page(page);
+			if (!page->mapping) {
+				ret = 0; /* retry */
+				unlock_page(page);
+				page_cache_release(page);
+				goto out;
+			}
+		} else
+			VM_BUG_ON(!PageLocked(page));
+		spin_lock(&mm->page_table_lock);
+		if (unlikely(!pmd_same(*pmd, orig_pmd))) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto out_unlock;
+		}
+
+		flush_cache_page(vma, address, pmd_pfn(orig_pmd));
+		entry = pmd_mkyoung(orig_pmd);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
+			update_mmu_cache_pmd(vma, address, pmd);
+		ret = VM_FAULT_WRITE;
+		spin_unlock(&mm->page_table_lock);
+
+		mapping = page->mapping;
+		set_page_dirty(page);
+		unlock_page(page);
+		page_cache_release(page);
+		if (mapping)	{
+			/*
+			 * Some device drivers do not set page.mapping
+			 * but still dirty their pages
+			 */
+			balance_dirty_pages_ratelimited(mapping);
+		}
+		return ret;
 	}
 	get_page(page);
 	spin_unlock(&mm->page_table_lock);
 alloc:
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
+
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow())
 		new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
@@ -1424,6 +1485,10 @@ alloc:
 			add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
 			put_huge_zero_page();
 		} else {
+			if (!PageAnon(page)) {
+				add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+				add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+			}
 			VM_BUG_ON(!PageHead(page));
 			page_remove_rmap(page);
 			put_page(page);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 32/34] thp: handle write-protect exception to file-backed huge pages
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   69 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 67 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ed4389b..6dde87f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1339,7 +1339,6 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
-	VM_BUG_ON(!vma->anon_vma);
 	haddr = address & HPAGE_PMD_MASK;
 	if (is_huge_zero_pmd(orig_pmd))
 		goto alloc;
@@ -1349,7 +1348,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	page = pmd_page(orig_pmd);
 	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
-	if (page_mapcount(page) == 1) {
+	if (PageAnon(page) && page_mapcount(page) == 1) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -1357,10 +1356,72 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			update_mmu_cache_pmd(vma, address, pmd);
 		ret |= VM_FAULT_WRITE;
 		goto out_unlock;
+	} else if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+			(VM_WRITE|VM_SHARED)) {
+		struct vm_fault vmf;
+		pmd_t entry;
+		struct address_space *mapping;
+
+		/* not yet impemented */
+		VM_BUG_ON(!vma->vm_ops || !vma->vm_ops->page_mkwrite);
+
+		vmf.virtual_address = (void __user *)haddr;
+		vmf.pgoff = page->index;
+		vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
+		vmf.page = page;
+
+		page_cache_get(page);
+		spin_unlock(&mm->page_table_lock);
+
+		ret = vma->vm_ops->page_mkwrite(vma, &vmf);
+		if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+			page_cache_release(page);
+			goto out;
+		}
+		if (unlikely(!(ret & VM_FAULT_LOCKED))) {
+			lock_page(page);
+			if (!page->mapping) {
+				ret = 0; /* retry */
+				unlock_page(page);
+				page_cache_release(page);
+				goto out;
+			}
+		} else
+			VM_BUG_ON(!PageLocked(page));
+		spin_lock(&mm->page_table_lock);
+		if (unlikely(!pmd_same(*pmd, orig_pmd))) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto out_unlock;
+		}
+
+		flush_cache_page(vma, address, pmd_pfn(orig_pmd));
+		entry = pmd_mkyoung(orig_pmd);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
+			update_mmu_cache_pmd(vma, address, pmd);
+		ret = VM_FAULT_WRITE;
+		spin_unlock(&mm->page_table_lock);
+
+		mapping = page->mapping;
+		set_page_dirty(page);
+		unlock_page(page);
+		page_cache_release(page);
+		if (mapping)	{
+			/*
+			 * Some device drivers do not set page.mapping
+			 * but still dirty their pages
+			 */
+			balance_dirty_pages_ratelimited(mapping);
+		}
+		return ret;
 	}
 	get_page(page);
 	spin_unlock(&mm->page_table_lock);
 alloc:
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
+
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow())
 		new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
@@ -1424,6 +1485,10 @@ alloc:
 			add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
 			put_huge_zero_page();
 		} else {
+			if (!PageAnon(page)) {
+				add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+				add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+			}
 			VM_BUG_ON(!PageHead(page));
 			page_remove_rmap(page);
 			put_page(page);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 33/34] thp: call __vma_adjust_trans_huge() for file-backed VMA
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Since we're going to have huge pages in page cache, we need to call
__vma_adjust_trans_huge() for file-backed VMA, which potentially can
contain huge pages.

For now we call it for all VMAs with vm_ops->huge_fault defined.

Probably later we will need to introduce a flag to indicate that the VMA
has huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index aa52c48..c6e3aef 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -161,9 +161,9 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 unsigned long end,
 					 long adjust_next)
 {
-	if (!vma->anon_vma || vma->vm_ops)
-		return;
-	__vma_adjust_trans_huge(vma, start, end, adjust_next);
+	if ((vma->anon_vma && !vma->vm_ops) ||
+			(vma->vm_ops && vma->vm_ops->huge_fault))
+		__vma_adjust_trans_huge(vma, start, end, adjust_next);
 }
 static inline int hpage_nr_pages(struct page *page)
 {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 33/34] thp: call __vma_adjust_trans_huge() for file-backed VMA
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Since we're going to have huge pages in page cache, we need to call
__vma_adjust_trans_huge() for file-backed VMA, which potentially can
contain huge pages.

For now we call it for all VMAs with vm_ops->huge_fault defined.

Probably later we will need to introduce a flag to indicate that the VMA
has huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index aa52c48..c6e3aef 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -161,9 +161,9 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 unsigned long end,
 					 long adjust_next)
 {
-	if (!vma->anon_vma || vma->vm_ops)
-		return;
-	__vma_adjust_trans_huge(vma, start, end, adjust_next);
+	if ((vma->anon_vma && !vma->vm_ops) ||
+			(vma->vm_ops && vma->vm_ops->huge_fault))
+		__vma_adjust_trans_huge(vma, start, end, adjust_next);
 }
 static inline int hpage_nr_pages(struct page *page)
 {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 34/34] thp: map file-backed huge pages on fault
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Look like all pieces are in place, we can map file-backed huge-pages
now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    4 +++-
 mm/memory.c             |    1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c6e3aef..c175c78 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -80,7 +80,9 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 	   (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&			\
 	   ((__vma)->vm_flags & VM_HUGEPAGE))) &&			\
 	 !((__vma)->vm_flags & VM_NOHUGEPAGE) &&			\
-	 !is_vma_temporary_stack(__vma))
+	 !is_vma_temporary_stack(__vma) &&				\
+	 (!(__vma)->vm_ops ||						\
+		  mapping_can_have_hugepages((__vma)->vm_file->f_mapping)))
 #define transparent_hugepage_defrag(__vma)				\
 	((transparent_hugepage_flags &					\
 	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) ||			\
diff --git a/mm/memory.c b/mm/memory.c
index 2895f0e..e40965f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3738,6 +3738,7 @@ retry:
 		if (!vma->vm_ops)
 			return do_huge_pmd_anonymous_page(mm, vma, address,
 							  pmd, flags);
+		return do_huge_linear_fault(mm, vma, address, pmd, flags);
 	} else {
 		pmd_t orig_pmd = *pmd;
 		int ret;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHv3, RFC 34/34] thp: map file-backed huge pages on fault
@ 2013-04-05 11:59   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-05 11:59 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Look like all pieces are in place, we can map file-backed huge-pages
now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    4 +++-
 mm/memory.c             |    1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c6e3aef..c175c78 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -80,7 +80,9 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 	   (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&			\
 	   ((__vma)->vm_flags & VM_HUGEPAGE))) &&			\
 	 !((__vma)->vm_flags & VM_NOHUGEPAGE) &&			\
-	 !is_vma_temporary_stack(__vma))
+	 !is_vma_temporary_stack(__vma) &&				\
+	 (!(__vma)->vm_ops ||						\
+		  mapping_can_have_hugepages((__vma)->vm_file->f_mapping)))
 #define transparent_hugepage_defrag(__vma)				\
 	((transparent_hugepage_flags &					\
 	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) ||			\
diff --git a/mm/memory.c b/mm/memory.c
index 2895f0e..e40965f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3738,6 +3738,7 @@ retry:
 		if (!vma->vm_ops)
 			return do_huge_pmd_anonymous_page(mm, vma, address,
 							  pmd, flags);
+		return do_huge_linear_fault(mm, vma, address, pmd, flags);
 	} else {
 		pmd_t orig_pmd = *pmd;
 		int ret;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 00/34] Transparent huge page cache
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-07  0:40   ` Ric Mason
  -1 siblings, 0 replies; 110+ messages in thread
From: Ric Mason @ 2013-04-07  0:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	linux-fsdevel, linux-kernel

Hi Kirill,
On 04/05/2013 07:59 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Here's third RFC. Thanks everybody for feedback.

Could you answer my questions in your version two?

>
> The patchset is pretty big already and I want to stop generate new
> features to keep it reviewable. Next I'll concentrate on benchmarking and
> tuning.
>
> Therefore some features will be outside initial transparent huge page
> cache implementation:
>   - page collapsing;
>   - migration;
>   - tmpfs/shmem;
>
> There are few features which are not implemented and potentially can block
> upstreaming:
>
> 1. Currently we allocate 2M page even if we create only 1 byte file on
> ramfs. I don't think it's a problem by itself. With anon thp pages we also
> try to allocate huge pages whenever possible.
> The problem is that ramfs pages are unevictable and we can't just split
> and pushed them in swap as with anon thp. We (at some point) have to have
> mechanism to split last page of the file under memory pressure to reclaim
> some memory.
>
> 2. We don't have knobs for disabling transparent huge page cache per-mount
> or per-file. Should we have mount option and fadivse flags as part of
> initial implementation?
>
> Any thoughts?
>
> The patchset is also on git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/pagecache
>
> v3:
>   - set RADIX_TREE_PRELOAD_NR to 512 only if we build with THP;
>   - rewrite lru_add_page_tail() to address few bags;
>   - memcg accounting;
>   - represent file thp pages in meminfo and friends;
>   - dump page order in filemap trace;
>   - add missed flush_dcache_page() in zero_huge_user_segment;
>   - random cleanups based on feedback.
> v2:
>   - mmap();
>   - fix add_to_page_cache_locked() and delete_from_page_cache();
>   - introduce mapping_can_have_hugepages();
>   - call split_huge_page() only for head page in filemap_fault();
>   - wait_split_huge_page(): serialize over i_mmap_mutex too;
>   - lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists;
>   - fix off-by-one in zero_huge_user_segment();
>   - THP_WRITE_ALLOC/THP_WRITE_FAILED counters;
>
> Kirill A. Shutemov (34):
>    mm: drop actor argument of do_generic_file_read()
>    block: implement add_bdi_stat()
>    mm: implement zero_huge_user_segment and friends
>    radix-tree: implement preload for multiple contiguous elements
>    memcg, thp: charge huge cache pages
>    thp, mm: avoid PageUnevictable on active/inactive lru lists
>    thp, mm: basic defines for transparent huge page cache
>    thp, mm: introduce mapping_can_have_hugepages() predicate
>    thp: represent file thp pages in meminfo and friends
>    thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>    mm: trace filemap: dump page order
>    thp, mm: rewrite delete_from_page_cache() to support huge pages
>    thp, mm: trigger bug in replace_page_cache_page() on THP
>    thp, mm: locking tail page is a bug
>    thp, mm: handle tail pages in page_cache_get_speculative()
>    thp, mm: add event counters for huge page alloc on write to a file
>    thp, mm: implement grab_thp_write_begin()
>    thp, mm: naive support of thp in generic read/write routines
>    thp, libfs: initial support of thp in
>      simple_read/write_begin/write_end
>    thp: handle file pages in split_huge_page()
>    thp: wait_split_huge_page(): serialize over i_mmap_mutex too
>    thp, mm: truncate support for transparent huge page cache
>    thp, mm: split huge page on mmap file page
>    ramfs: enable transparent huge page cache
>    x86-64, mm: proper alignment mappings with hugepages
>    mm: add huge_fault() callback to vm_operations_struct
>    thp: prepare zap_huge_pmd() to uncharge file pages
>    thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
>    thp, mm: basic huge_fault implementation for generic_file_vm_ops
>    thp: extract fallback path from do_huge_pmd_anonymous_page() to a
>      function
>    thp: initial implementation of do_huge_linear_fault()
>    thp: handle write-protect exception to file-backed huge pages
>    thp: call __vma_adjust_trans_huge() for file-backed VMA
>    thp: map file-backed huge pages on fault
>
>   arch/x86/kernel/sys_x86_64.c   |   12 +-
>   drivers/base/node.c            |   10 +
>   fs/libfs.c                     |   48 +++-
>   fs/proc/meminfo.c              |    6 +
>   fs/ramfs/inode.c               |    6 +-
>   include/linux/backing-dev.h    |   10 +
>   include/linux/huge_mm.h        |   36 ++-
>   include/linux/mm.h             |    8 +
>   include/linux/mmzone.h         |    1 +
>   include/linux/pagemap.h        |   33 ++-
>   include/linux/radix-tree.h     |   11 +
>   include/linux/vm_event_item.h  |    2 +
>   include/trace/events/filemap.h |    7 +-
>   lib/radix-tree.c               |   33 ++-
>   mm/filemap.c                   |  298 ++++++++++++++++++++-----
>   mm/huge_memory.c               |  474 +++++++++++++++++++++++++++++++++-------
>   mm/memcontrol.c                |    2 -
>   mm/memory.c                    |   41 +++-
>   mm/mmap.c                      |    3 +
>   mm/page_alloc.c                |    7 +-
>   mm/swap.c                      |   20 +-
>   mm/truncate.c                  |   13 ++
>   mm/vmstat.c                    |    2 +
>   23 files changed, 902 insertions(+), 181 deletions(-)
>


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 00/34] Transparent huge page cache
@ 2013-04-07  0:40   ` Ric Mason
  0 siblings, 0 replies; 110+ messages in thread
From: Ric Mason @ 2013-04-07  0:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	linux-fsdevel, linux-kernel

Hi Kirill,
On 04/05/2013 07:59 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Here's third RFC. Thanks everybody for feedback.

Could you answer my questions in your version two?

>
> The patchset is pretty big already and I want to stop generate new
> features to keep it reviewable. Next I'll concentrate on benchmarking and
> tuning.
>
> Therefore some features will be outside initial transparent huge page
> cache implementation:
>   - page collapsing;
>   - migration;
>   - tmpfs/shmem;
>
> There are few features which are not implemented and potentially can block
> upstreaming:
>
> 1. Currently we allocate 2M page even if we create only 1 byte file on
> ramfs. I don't think it's a problem by itself. With anon thp pages we also
> try to allocate huge pages whenever possible.
> The problem is that ramfs pages are unevictable and we can't just split
> and pushed them in swap as with anon thp. We (at some point) have to have
> mechanism to split last page of the file under memory pressure to reclaim
> some memory.
>
> 2. We don't have knobs for disabling transparent huge page cache per-mount
> or per-file. Should we have mount option and fadivse flags as part of
> initial implementation?
>
> Any thoughts?
>
> The patchset is also on git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/pagecache
>
> v3:
>   - set RADIX_TREE_PRELOAD_NR to 512 only if we build with THP;
>   - rewrite lru_add_page_tail() to address few bags;
>   - memcg accounting;
>   - represent file thp pages in meminfo and friends;
>   - dump page order in filemap trace;
>   - add missed flush_dcache_page() in zero_huge_user_segment;
>   - random cleanups based on feedback.
> v2:
>   - mmap();
>   - fix add_to_page_cache_locked() and delete_from_page_cache();
>   - introduce mapping_can_have_hugepages();
>   - call split_huge_page() only for head page in filemap_fault();
>   - wait_split_huge_page(): serialize over i_mmap_mutex too;
>   - lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists;
>   - fix off-by-one in zero_huge_user_segment();
>   - THP_WRITE_ALLOC/THP_WRITE_FAILED counters;
>
> Kirill A. Shutemov (34):
>    mm: drop actor argument of do_generic_file_read()
>    block: implement add_bdi_stat()
>    mm: implement zero_huge_user_segment and friends
>    radix-tree: implement preload for multiple contiguous elements
>    memcg, thp: charge huge cache pages
>    thp, mm: avoid PageUnevictable on active/inactive lru lists
>    thp, mm: basic defines for transparent huge page cache
>    thp, mm: introduce mapping_can_have_hugepages() predicate
>    thp: represent file thp pages in meminfo and friends
>    thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>    mm: trace filemap: dump page order
>    thp, mm: rewrite delete_from_page_cache() to support huge pages
>    thp, mm: trigger bug in replace_page_cache_page() on THP
>    thp, mm: locking tail page is a bug
>    thp, mm: handle tail pages in page_cache_get_speculative()
>    thp, mm: add event counters for huge page alloc on write to a file
>    thp, mm: implement grab_thp_write_begin()
>    thp, mm: naive support of thp in generic read/write routines
>    thp, libfs: initial support of thp in
>      simple_read/write_begin/write_end
>    thp: handle file pages in split_huge_page()
>    thp: wait_split_huge_page(): serialize over i_mmap_mutex too
>    thp, mm: truncate support for transparent huge page cache
>    thp, mm: split huge page on mmap file page
>    ramfs: enable transparent huge page cache
>    x86-64, mm: proper alignment mappings with hugepages
>    mm: add huge_fault() callback to vm_operations_struct
>    thp: prepare zap_huge_pmd() to uncharge file pages
>    thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
>    thp, mm: basic huge_fault implementation for generic_file_vm_ops
>    thp: extract fallback path from do_huge_pmd_anonymous_page() to a
>      function
>    thp: initial implementation of do_huge_linear_fault()
>    thp: handle write-protect exception to file-backed huge pages
>    thp: call __vma_adjust_trans_huge() for file-backed VMA
>    thp: map file-backed huge pages on fault
>
>   arch/x86/kernel/sys_x86_64.c   |   12 +-
>   drivers/base/node.c            |   10 +
>   fs/libfs.c                     |   48 +++-
>   fs/proc/meminfo.c              |    6 +
>   fs/ramfs/inode.c               |    6 +-
>   include/linux/backing-dev.h    |   10 +
>   include/linux/huge_mm.h        |   36 ++-
>   include/linux/mm.h             |    8 +
>   include/linux/mmzone.h         |    1 +
>   include/linux/pagemap.h        |   33 ++-
>   include/linux/radix-tree.h     |   11 +
>   include/linux/vm_event_item.h  |    2 +
>   include/trace/events/filemap.h |    7 +-
>   lib/radix-tree.c               |   33 ++-
>   mm/filemap.c                   |  298 ++++++++++++++++++++-----
>   mm/huge_memory.c               |  474 +++++++++++++++++++++++++++++++++-------
>   mm/memcontrol.c                |    2 -
>   mm/memory.c                    |   41 +++-
>   mm/mmap.c                      |    3 +
>   mm/page_alloc.c                |    7 +-
>   mm/swap.c                      |   20 +-
>   mm/truncate.c                  |   13 ++
>   mm/vmstat.c                    |    2 +
>   23 files changed, 902 insertions(+), 181 deletions(-)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()
  2013-04-05 11:59   ` Kirill A. Shutemov
@ 2013-04-08 18:46     ` Dave Hansen
  -1 siblings, 0 replies; 110+ messages in thread
From: Dave Hansen @ 2013-04-08 18:46 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 04/05/2013 04:59 AM, Kirill A. Shutemov wrote:
> +	if (unlikely(khugepaged_enter(vma)))
> +		return VM_FAULT_OOM;
...
> +	ret = vma->vm_ops->huge_fault(vma, &vmf);
> +	if (unlikely(ret & VM_FAULT_OOM))
> +		goto uncharge_out_fallback;
> +	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> +		goto uncharge_out;
> +
> +	if (unlikely(PageHWPoison(vmf.page))) {
> +		if (ret & VM_FAULT_LOCKED)
> +			unlock_page(vmf.page);
> +		ret = VM_FAULT_HWPOISON;
> +		goto uncharge_out;
> +	}

One note on all these patches, but especially this one is that I think
they're way too liberal with unlikely()s.  You really don't need to do
this for every single error case.  Please reserve them for places where
you _know_ there is a benefit, or that the compiler is doing things that
you _know_ are blatantly wrong.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()
@ 2013-04-08 18:46     ` Dave Hansen
  0 siblings, 0 replies; 110+ messages in thread
From: Dave Hansen @ 2013-04-08 18:46 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 04/05/2013 04:59 AM, Kirill A. Shutemov wrote:
> +	if (unlikely(khugepaged_enter(vma)))
> +		return VM_FAULT_OOM;
...
> +	ret = vma->vm_ops->huge_fault(vma, &vmf);
> +	if (unlikely(ret & VM_FAULT_OOM))
> +		goto uncharge_out_fallback;
> +	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> +		goto uncharge_out;
> +
> +	if (unlikely(PageHWPoison(vmf.page))) {
> +		if (ret & VM_FAULT_LOCKED)
> +			unlock_page(vmf.page);
> +		ret = VM_FAULT_HWPOISON;
> +		goto uncharge_out;
> +	}

One note on all these patches, but especially this one is that I think
they're way too liberal with unlikely()s.  You really don't need to do
this for every single error case.  Please reserve them for places where
you _know_ there is a benefit, or that the compiler is doing things that
you _know_ are blatantly wrong.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()
  2013-04-05 11:59   ` Kirill A. Shutemov
@ 2013-04-08 18:52     ` Dave Hansen
  -1 siblings, 0 replies; 110+ messages in thread
From: Dave Hansen @ 2013-04-08 18:52 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 04/05/2013 04:59 AM, Kirill A. Shutemov wrote:
> +int do_huge_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> +		unsigned long address, pmd_t *pmd, unsigned int flags)
> +{
> +	unsigned long haddr = address & HPAGE_PMD_MASK;
> +	struct page *cow_page, *page, *dirty_page = NULL;
> +	bool anon = false, fallback = false, page_mkwrite = false;
> +	pgtable_t pgtable = NULL;
> +	struct vm_fault vmf;
> +	int ret;
> +
> +	/* Fallback if vm_pgoff and vm_start are not suitable */
> +	if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
> +			(vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK))
> +		return do_fallback(mm, vma, address, pmd, flags);
> +
> +	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
> +		return do_fallback(mm, vma, address, pmd, flags);
> +
> +	if (unlikely(khugepaged_enter(vma)))
> +		return VM_FAULT_OOM;
> +
> +	/*
> +	 * If we do COW later, allocate page before taking lock_page()
> +	 * on the file cache page. This will reduce lock holding time.
> +	 */
> +	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
> +		if (unlikely(anon_vma_prepare(vma)))
> +			return VM_FAULT_OOM;
> +
> +		cow_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
> +				vma, haddr, numa_node_id(), 0);
> +		if (!cow_page) {
> +			count_vm_event(THP_FAULT_FALLBACK);
> +			return do_fallback(mm, vma, address, pmd, flags);
> +		}
> +		count_vm_event(THP_FAULT_ALLOC);
> +		if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
> +			page_cache_release(cow_page);
> +			return do_fallback(mm, vma, address, pmd, flags);
> +		}

Ugh.  This is essentially a copy-n-paste of code in __do_fault(),
including the comments.  Is there no way to consolidate the code so that
there's less duplication here?

Part of the reason we have so many bugs in hugetlbfs is that it's really
a forked set of code that does things its own way.  I really hope we're
not going down the road of creating another feature in the same way.

When you copy *this* much code (or any, really), you really need to talk
about it in the patch description.  I was looking at other COW code, and
just happened to stumble on to the __do_fault() code.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()
@ 2013-04-08 18:52     ` Dave Hansen
  0 siblings, 0 replies; 110+ messages in thread
From: Dave Hansen @ 2013-04-08 18:52 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 04/05/2013 04:59 AM, Kirill A. Shutemov wrote:
> +int do_huge_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> +		unsigned long address, pmd_t *pmd, unsigned int flags)
> +{
> +	unsigned long haddr = address & HPAGE_PMD_MASK;
> +	struct page *cow_page, *page, *dirty_page = NULL;
> +	bool anon = false, fallback = false, page_mkwrite = false;
> +	pgtable_t pgtable = NULL;
> +	struct vm_fault vmf;
> +	int ret;
> +
> +	/* Fallback if vm_pgoff and vm_start are not suitable */
> +	if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
> +			(vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK))
> +		return do_fallback(mm, vma, address, pmd, flags);
> +
> +	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
> +		return do_fallback(mm, vma, address, pmd, flags);
> +
> +	if (unlikely(khugepaged_enter(vma)))
> +		return VM_FAULT_OOM;
> +
> +	/*
> +	 * If we do COW later, allocate page before taking lock_page()
> +	 * on the file cache page. This will reduce lock holding time.
> +	 */
> +	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
> +		if (unlikely(anon_vma_prepare(vma)))
> +			return VM_FAULT_OOM;
> +
> +		cow_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
> +				vma, haddr, numa_node_id(), 0);
> +		if (!cow_page) {
> +			count_vm_event(THP_FAULT_FALLBACK);
> +			return do_fallback(mm, vma, address, pmd, flags);
> +		}
> +		count_vm_event(THP_FAULT_ALLOC);
> +		if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
> +			page_cache_release(cow_page);
> +			return do_fallback(mm, vma, address, pmd, flags);
> +		}

Ugh.  This is essentially a copy-n-paste of code in __do_fault(),
including the comments.  Is there no way to consolidate the code so that
there's less duplication here?

Part of the reason we have so many bugs in hugetlbfs is that it's really
a forked set of code that does things its own way.  I really hope we're
not going down the road of creating another feature in the same way.

When you copy *this* much code (or any, really), you really need to talk
about it in the patch description.  I was looking at other COW code, and
just happened to stumble on to the __do_fault() code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 32/34] thp: handle write-protect exception to file-backed huge pages
  2013-04-05 11:59   ` Kirill A. Shutemov
@ 2013-04-08 19:07     ` Dave Hansen
  -1 siblings, 0 replies; 110+ messages in thread
From: Dave Hansen @ 2013-04-08 19:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

For all the do_huge_pmd_wp_page(), I think we need a better description
of where the code came from.  There are some more obviously
copy-n-pasted comments in there.

For the entire series, in the patch description, we need to know:
1. What was originally written and what was copied from elsewhere
2. For the stuff that was copied, was an attempt made to consolidate
   instead of copy?  Why was consolidation impossible or infeasible?

> +			if (!PageAnon(page)) {
> +				add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> +				add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
> +			}

This seems like a bit of a hack.  Shouldn't we have just been accounting
to MM_FILEPAGES in the first place?

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 32/34] thp: handle write-protect exception to file-backed huge pages
@ 2013-04-08 19:07     ` Dave Hansen
  0 siblings, 0 replies; 110+ messages in thread
From: Dave Hansen @ 2013-04-08 19:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

For all the do_huge_pmd_wp_page(), I think we need a better description
of where the code came from.  There are some more obviously
copy-n-pasted comments in there.

For the entire series, in the patch description, we need to know:
1. What was originally written and what was copied from elsewhere
2. For the stuff that was copied, was an attempt made to consolidate
   instead of copy?  Why was consolidation impossible or infeasible?

> +			if (!PageAnon(page)) {
> +				add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> +				add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
> +			}

This seems like a bit of a hack.  Shouldn't we have just been accounting
to MM_FILEPAGES in the first place?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 09/34] thp: represent file thp pages in meminfo and friends
  2013-04-05 11:59   ` Kirill A. Shutemov
@ 2013-04-08 19:38     ` Dave Hansen
  -1 siblings, 0 replies; 110+ messages in thread
From: Dave Hansen @ 2013-04-08 19:38 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 04/05/2013 04:59 AM, Kirill A. Shutemov wrote:
> The patch adds new zone stat to count file transparent huge pages and
> adjust related places.
> 
> For now we don't count mapped or dirty file thp pages separately.

I can understand tracking NR_FILE_TRANSPARENT_HUGEPAGES itself.  But,
why not also account for them in NR_FILE_PAGES?  That way, you don't
have to special-case each of the cases below:

> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -41,6 +41,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  
>  	cached = global_page_state(NR_FILE_PAGES) -
>  			total_swapcache_pages() - i.bufferram;
> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> +		cached += global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
> +			HPAGE_PMD_NR;
>  	if (cached < 0)
>  		cached = 0;
....
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -135,6 +135,9 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
>  	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
>  		free = global_page_state(NR_FREE_PAGES);
>  		free += global_page_state(NR_FILE_PAGES);
> +		if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> +			free += global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES)
> +				* HPAGE_PMD_NR;
...
> -	printk("%ld total pagecache pages\n", global_page_state(NR_FILE_PAGES));
> +	cached = global_page_state(NR_FILE_PAGES);
> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> +		cached += global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
> +			HPAGE_PMD_NR;
> +	printk("%ld total pagecache pages\n", cached);


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 09/34] thp: represent file thp pages in meminfo and friends
@ 2013-04-08 19:38     ` Dave Hansen
  0 siblings, 0 replies; 110+ messages in thread
From: Dave Hansen @ 2013-04-08 19:38 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 04/05/2013 04:59 AM, Kirill A. Shutemov wrote:
> The patch adds new zone stat to count file transparent huge pages and
> adjust related places.
> 
> For now we don't count mapped or dirty file thp pages separately.

I can understand tracking NR_FILE_TRANSPARENT_HUGEPAGES itself.  But,
why not also account for them in NR_FILE_PAGES?  That way, you don't
have to special-case each of the cases below:

> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -41,6 +41,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  
>  	cached = global_page_state(NR_FILE_PAGES) -
>  			total_swapcache_pages() - i.bufferram;
> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> +		cached += global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
> +			HPAGE_PMD_NR;
>  	if (cached < 0)
>  		cached = 0;
....
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -135,6 +135,9 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
>  	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
>  		free = global_page_state(NR_FREE_PAGES);
>  		free += global_page_state(NR_FILE_PAGES);
> +		if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> +			free += global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES)
> +				* HPAGE_PMD_NR;
...
> -	printk("%ld total pagecache pages\n", global_page_state(NR_FILE_PAGES));
> +	cached = global_page_state(NR_FILE_PAGES);
> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> +		cached += global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
> +			HPAGE_PMD_NR;
> +	printk("%ld total pagecache pages\n", cached);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* IOZone with transparent huge page cache
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-15 16:02   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-15 16:02 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton; +Cc: Kirill A. Shutemov

Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Here's third RFC. Thanks everybody for feedback.
> 
> The patchset is pretty big already and I want to stop generate new
> features to keep it reviewable. Next I'll concentrate on benchmarking and
> tuning.

Okay, here's first test results for the patchset (I did few minor fixes).

I run iozone using mmap files (-B) with different number of threads.
The test machine is 4s Westmere - 4x10 cores + HT.

** Initial writers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1103360   912585   500065   260503   128918    62039    34799    18718     9376
patched:	  2127476  2155029  2345079  1942158  1127109   571899   127090    52939    25950
speed-up(times):     1.93     2.36     4.69     7.46     8.74     9.22     3.65     2.83     2.77

** Rewriters **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1391599  1167340  1040505   484431   225883   108426    53239    28133    15007
patched:	  2284535  1943774  2245681  1288542   438100   308559   115641    64990    30638
speed-up(times):     1.64     1.67     2.16     2.66     1.94     2.85     2.17     2.31     2.04

** Readers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1344180  1339641  1094513   604079   273020   129403    76666    41785    20111
patched:	  1790010  1807535  2039022  1884901  1470841   874517   429414   207033    99853
speed-up(times):     1.33     1.35     1.86     3.12     5.39     6.76     5.60     4.95     4.97

** Re-readers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1402398  1239448  1105293   653823   273997   130629    75943    40588    20456
patched:	  1928768  2076134  1791750  1907793  1494477   876014   432898   207279   102002
speed-up(times):     1.38     1.68     1.62     2.92     5.45     6.71     5.70     5.11     4.99

** Reverse readers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1545930  1443504  1175183   604320   277178   128694    76734    40956    20345
patched:	  1907933  1827041  1919202  1734568  1497661   862046   429960   208326    93213
speed-up(times):     1.23     1.27     1.63     2.87     5.40     6.70     5.60     5.09     4.58

** Random_readers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1069364   968029   887646   570211   257909   124713    74354    40663    20213
patched:	  1881762  2144045  1989631  2057963  1560892   867901   424109   205934    98021
speed-up(times):     1.76     2.21     2.24     3.61     6.05     6.96     5.70     5.06     4.85

** Random_writers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1236302  1033694   882697   475439   231973   113590    65675    35458    17890
patched:	  2778110  2484373  2454243  1329193   706394   353300   173871    86194    42815
speed-up(times):     2.25     2.40     2.78     2.80     3.05     3.11     2.65     2.43     2.39

Minimal speed up is in 1-thread reverse readers - 23%.
Maximal is 9.2 times in 32-thread initial writers. It's probably due
batched radix tree insert - we insert 512 pages a time. It reduces
mapping->tree_lock contention.

I wounder why rewriters are slower then initial writers. Readers also
slower then initial writers for low number of threads. It requires further
investigation.

In overall looks pretty impressive to me. :)

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* IOZone with transparent huge page cache
@ 2013-04-15 16:02   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-15 16:02 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Kirill A. Shutemov, Al Viro, Hugh Dickins, Wu Fengguang,
	Jan Kara, Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, linux-fsdevel,
	linux-kernel

Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Here's third RFC. Thanks everybody for feedback.
> 
> The patchset is pretty big already and I want to stop generate new
> features to keep it reviewable. Next I'll concentrate on benchmarking and
> tuning.

Okay, here's first test results for the patchset (I did few minor fixes).

I run iozone using mmap files (-B) with different number of threads.
The test machine is 4s Westmere - 4x10 cores + HT.

** Initial writers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1103360   912585   500065   260503   128918    62039    34799    18718     9376
patched:	  2127476  2155029  2345079  1942158  1127109   571899   127090    52939    25950
speed-up(times):     1.93     2.36     4.69     7.46     8.74     9.22     3.65     2.83     2.77

** Rewriters **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1391599  1167340  1040505   484431   225883   108426    53239    28133    15007
patched:	  2284535  1943774  2245681  1288542   438100   308559   115641    64990    30638
speed-up(times):     1.64     1.67     2.16     2.66     1.94     2.85     2.17     2.31     2.04

** Readers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1344180  1339641  1094513   604079   273020   129403    76666    41785    20111
patched:	  1790010  1807535  2039022  1884901  1470841   874517   429414   207033    99853
speed-up(times):     1.33     1.35     1.86     3.12     5.39     6.76     5.60     4.95     4.97

** Re-readers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1402398  1239448  1105293   653823   273997   130629    75943    40588    20456
patched:	  1928768  2076134  1791750  1907793  1494477   876014   432898   207279   102002
speed-up(times):     1.38     1.68     1.62     2.92     5.45     6.71     5.70     5.11     4.99

** Reverse readers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1545930  1443504  1175183   604320   277178   128694    76734    40956    20345
patched:	  1907933  1827041  1919202  1734568  1497661   862046   429960   208326    93213
speed-up(times):     1.23     1.27     1.63     2.87     5.40     6.70     5.60     5.09     4.58

** Random_readers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1069364   968029   887646   570211   257909   124713    74354    40663    20213
patched:	  1881762  2144045  1989631  2057963  1560892   867901   424109   205934    98021
speed-up(times):     1.76     2.21     2.24     3.61     6.05     6.96     5.70     5.06     4.85

** Random_writers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1236302  1033694   882697   475439   231973   113590    65675    35458    17890
patched:	  2778110  2484373  2454243  1329193   706394   353300   173871    86194    42815
speed-up(times):     2.25     2.40     2.78     2.80     3.05     3.11     2.65     2.43     2.39

Minimal speed up is in 1-thread reverse readers - 23%.
Maximal is 9.2 times in 32-thread initial writers. It's probably due
batched radix tree insert - we insert 512 pages a time. It reduces
mapping->tree_lock contention.

I wounder why rewriters are slower then initial writers. Readers also
slower then initial writers for low number of threads. It requires further
investigation.

In overall looks pretty impressive to me. :)

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RESEND] IOZone with transparent huge page cache
  2013-04-05 11:59 ` Kirill A. Shutemov
@ 2013-04-15 18:17   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-15 18:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Kirill A. Shutemov, Al Viro, Hugh Dickins, Wu Fengguang,
	Jan Kara, Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, linux-fsdevel,
	linux-kernel

[ resend with fixed mail headers, sorry ]

> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Here's third RFC. Thanks everybody for feedback.
> 
> The patchset is pretty big already and I want to stop generate new
> features to keep it reviewable. Next I'll concentrate on benchmarking and
> tuning.

Okay, here's first test results for the patchset (I did few minor fixes).

I run iozone using mmap files (-B) with different number of threads.
The test machine is 4s Westmere - 4x10 cores + HT.

** Initial writers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1103360   912585   500065   260503   128918    62039    34799    18718     9376
patched:	  2127476  2155029  2345079  1942158  1127109   571899   127090    52939    25950
speed-up(times):     1.93     2.36     4.69     7.46     8.74     9.22     3.65     2.83     2.77

** Rewriters **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1391599  1167340  1040505   484431   225883   108426    53239    28133    15007
patched:	  2284535  1943774  2245681  1288542   438100   308559   115641    64990    30638
speed-up(times):     1.64     1.67     2.16     2.66     1.94     2.85     2.17     2.31     2.04

** Readers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1344180  1339641  1094513   604079   273020   129403    76666    41785    20111
patched:	  1790010  1807535  2039022  1884901  1470841   874517   429414   207033    99853
speed-up(times):     1.33     1.35     1.86     3.12     5.39     6.76     5.60     4.95     4.97

** Re-readers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1402398  1239448  1105293   653823   273997   130629    75943    40588    20456
patched:	  1928768  2076134  1791750  1907793  1494477   876014   432898   207279   102002
speed-up(times):     1.38     1.68     1.62     2.92     5.45     6.71     5.70     5.11     4.99

** Reverse readers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1545930  1443504  1175183   604320   277178   128694    76734    40956    20345
patched:	  1907933  1827041  1919202  1734568  1497661   862046   429960   208326    93213
speed-up(times):     1.23     1.27     1.63     2.87     5.40     6.70     5.60     5.09     4.58

** Random_readers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1069364   968029   887646   570211   257909   124713    74354    40663    20213
patched:	  1881762  2144045  1989631  2057963  1560892   867901   424109   205934    98021
speed-up(times):     1.76     2.21     2.24     3.61     6.05     6.96     5.70     5.06     4.85

** Random_writers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1236302  1033694   882697   475439   231973   113590    65675    35458    17890
patched:	  2778110  2484373  2454243  1329193   706394   353300   173871    86194    42815
speed-up(times):     2.25     2.40     2.78     2.80     3.05     3.11     2.65     2.43     2.39

Minimal speed up is in 1-thread reverse readers - 23%.
Maximal is 9.2 times in 32-thread initial writers. It's probably due
batched radix tree insert - we insert 512 pages a time. It reduces
mapping->tree_lock contention.

I wounder why rewriters are slower then initial writers. Readers also
slower then initial writers for low number of threads. It requires further
investigation.

In overall looks pretty impressive to me. :)

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RESEND] IOZone with transparent huge page cache
@ 2013-04-15 18:17   ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-15 18:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Kirill A. Shutemov, Al Viro, Hugh Dickins, Wu Fengguang,
	Jan Kara, Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, linux-fsdevel,
	linux-kernel

[ resend with fixed mail headers, sorry ]

> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Here's third RFC. Thanks everybody for feedback.
> 
> The patchset is pretty big already and I want to stop generate new
> features to keep it reviewable. Next I'll concentrate on benchmarking and
> tuning.

Okay, here's first test results for the patchset (I did few minor fixes).

I run iozone using mmap files (-B) with different number of threads.
The test machine is 4s Westmere - 4x10 cores + HT.

** Initial writers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1103360   912585   500065   260503   128918    62039    34799    18718     9376
patched:	  2127476  2155029  2345079  1942158  1127109   571899   127090    52939    25950
speed-up(times):     1.93     2.36     4.69     7.46     8.74     9.22     3.65     2.83     2.77

** Rewriters **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1391599  1167340  1040505   484431   225883   108426    53239    28133    15007
patched:	  2284535  1943774  2245681  1288542   438100   308559   115641    64990    30638
speed-up(times):     1.64     1.67     2.16     2.66     1.94     2.85     2.17     2.31     2.04

** Readers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1344180  1339641  1094513   604079   273020   129403    76666    41785    20111
patched:	  1790010  1807535  2039022  1884901  1470841   874517   429414   207033    99853
speed-up(times):     1.33     1.35     1.86     3.12     5.39     6.76     5.60     4.95     4.97

** Re-readers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1402398  1239448  1105293   653823   273997   130629    75943    40588    20456
patched:	  1928768  2076134  1791750  1907793  1494477   876014   432898   207279   102002
speed-up(times):     1.38     1.68     1.62     2.92     5.45     6.71     5.70     5.11     4.99

** Reverse readers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1545930  1443504  1175183   604320   277178   128694    76734    40956    20345
patched:	  1907933  1827041  1919202  1734568  1497661   862046   429960   208326    93213
speed-up(times):     1.23     1.27     1.63     2.87     5.40     6.70     5.60     5.09     4.58

** Random_readers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1069364   968029   887646   570211   257909   124713    74354    40663    20213
patched:	  1881762  2144045  1989631  2057963  1560892   867901   424109   205934    98021
speed-up(times):     1.76     2.21     2.24     3.61     6.05     6.96     5.70     5.06     4.85

** Random_writers **
threads:	        1        2        4        8       16       32       64      128      256
baseline:	  1236302  1033694   882697   475439   231973   113590    65675    35458    17890
patched:	  2778110  2484373  2454243  1329193   706394   353300   173871    86194    42815
speed-up(times):     2.25     2.40     2.78     2.80     3.05     3.11     2.65     2.43     2.39

Minimal speed up is in 1-thread reverse readers - 23%.
Maximal is 9.2 times in 32-thread initial writers. It's probably due
batched radix tree insert - we insert 512 pages a time. It reduces
mapping->tree_lock contention.

I wounder why rewriters are slower then initial writers. Readers also
slower then initial writers for low number of threads. It requires further
investigation.

In overall looks pretty impressive to me. :)

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RESEND] IOZone with transparent huge page cache
  2013-04-15 18:17   ` Kirill A. Shutemov
@ 2013-04-15 23:19     ` Dave Hansen
  -1 siblings, 0 replies; 110+ messages in thread
From: Dave Hansen @ 2013-04-15 23:19 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 04/15/2013 11:17 AM, Kirill A. Shutemov wrote:
> I run iozone using mmap files (-B) with different number of threads.
> The test machine is 4s Westmere - 4x10 cores + HT.

How did you run this, exactly?  Which iozone arguments?  It was run on
ramfs, since that's the only thing that transparent huge page cache
supports right now?

> ** Initial writers **
> threads:	        1        2        4        8       16       32       64      128      256
> baseline:	  1103360   912585   500065   260503   128918    62039    34799    18718     9376
> patched:	  2127476  2155029  2345079  1942158  1127109   571899   127090    52939    25950
> speed-up(times):     1.93     2.36     4.69     7.46     8.74     9.22     3.65     2.83     2.77

I'm a _bit_ surprised that iozone scales _that_ badly especially while
threads<nr_cpus.  Is this normal for iozone?  What are the units and
metric there, btw?

> Minimal speed up is in 1-thread reverse readers - 23%.
> Maximal is 9.2 times in 32-thread initial writers. It's probably due
> batched radix tree insert - we insert 512 pages a time. It reduces
> mapping->tree_lock contention.

It might actually be interesting to see this at 10, 20, 40, 80, etc...
since that'll actually match iozone threads to CPU cores on your
particular system.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RESEND] IOZone with transparent huge page cache
@ 2013-04-15 23:19     ` Dave Hansen
  0 siblings, 0 replies; 110+ messages in thread
From: Dave Hansen @ 2013-04-15 23:19 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 04/15/2013 11:17 AM, Kirill A. Shutemov wrote:
> I run iozone using mmap files (-B) with different number of threads.
> The test machine is 4s Westmere - 4x10 cores + HT.

How did you run this, exactly?  Which iozone arguments?  It was run on
ramfs, since that's the only thing that transparent huge page cache
supports right now?

> ** Initial writers **
> threads:	        1        2        4        8       16       32       64      128      256
> baseline:	  1103360   912585   500065   260503   128918    62039    34799    18718     9376
> patched:	  2127476  2155029  2345079  1942158  1127109   571899   127090    52939    25950
> speed-up(times):     1.93     2.36     4.69     7.46     8.74     9.22     3.65     2.83     2.77

I'm a _bit_ surprised that iozone scales _that_ badly especially while
threads<nr_cpus.  Is this normal for iozone?  What are the units and
metric there, btw?

> Minimal speed up is in 1-thread reverse readers - 23%.
> Maximal is 9.2 times in 32-thread initial writers. It's probably due
> batched radix tree insert - we insert 512 pages a time. It reduces
> mapping->tree_lock contention.

It might actually be interesting to see this at 10, 20, 40, 80, etc...
since that'll actually match iozone threads to CPU cores on your
particular system.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RESEND] IOZone with transparent huge page cache
  2013-04-15 23:19     ` Dave Hansen
@ 2013-04-16  5:57       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-16  5:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 04/15/2013 11:17 AM, Kirill A. Shutemov wrote:
> > I run iozone using mmap files (-B) with different number of threads.
> > The test machine is 4s Westmere - 4x10 cores + HT.
> 
> How did you run this, exactly?  Which iozone arguments?

iozone -B -s 21822226/$threads -t $threads -r 4 -i 0 -i 1 -i 2 -i 3

It's slightly modified iozone test from mmtests.

> It was run on ramfs, since that's the only thing that transparent huge page
> cache supports right now?

Correct.

> > ** Initial writers **
> > threads:	        1        2        4        8       16       32       64      128      256
> > baseline:	  1103360   912585   500065   260503   128918    62039    34799    18718     9376
> > patched:	  2127476  2155029  2345079  1942158  1127109   571899   127090    52939    25950
> > speed-up(times):     1.93     2.36     4.69     7.46     8.74     9.22     3.65     2.83     2.77
> 
> I'm a _bit_ surprised that iozone scales _that_ badly especially while
> threads<nr_cpus.  Is this normal for iozone?  What are the units and
> metric there, btw?

The units is KB/sec per process (I used 'Avg throughput per process' from
iozone report). So it scales not that badly.
I will use total children throughput next time to avoid confusion.

> > Minimal speed up is in 1-thread reverse readers - 23%.
> > Maximal is 9.2 times in 32-thread initial writers. It's probably due
> > batched radix tree insert - we insert 512 pages a time. It reduces
> > mapping->tree_lock contention.
> 
> It might actually be interesting to see this at 10, 20, 40, 80, etc...
> since that'll actually match iozone threads to CPU cores on your
> particular system.

Okay.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RESEND] IOZone with transparent huge page cache
@ 2013-04-16  5:57       ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-16  5:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 04/15/2013 11:17 AM, Kirill A. Shutemov wrote:
> > I run iozone using mmap files (-B) with different number of threads.
> > The test machine is 4s Westmere - 4x10 cores + HT.
> 
> How did you run this, exactly?  Which iozone arguments?

iozone -B -s 21822226/$threads -t $threads -r 4 -i 0 -i 1 -i 2 -i 3

It's slightly modified iozone test from mmtests.

> It was run on ramfs, since that's the only thing that transparent huge page
> cache supports right now?

Correct.

> > ** Initial writers **
> > threads:	        1        2        4        8       16       32       64      128      256
> > baseline:	  1103360   912585   500065   260503   128918    62039    34799    18718     9376
> > patched:	  2127476  2155029  2345079  1942158  1127109   571899   127090    52939    25950
> > speed-up(times):     1.93     2.36     4.69     7.46     8.74     9.22     3.65     2.83     2.77
> 
> I'm a _bit_ surprised that iozone scales _that_ badly especially while
> threads<nr_cpus.  Is this normal for iozone?  What are the units and
> metric there, btw?

The units is KB/sec per process (I used 'Avg throughput per process' from
iozone report). So it scales not that badly.
I will use total children throughput next time to avoid confusion.

> > Minimal speed up is in 1-thread reverse readers - 23%.
> > Maximal is 9.2 times in 32-thread initial writers. It's probably due
> > batched radix tree insert - we insert 512 pages a time. It reduces
> > mapping->tree_lock contention.
> 
> It might actually be interesting to see this at 10, 20, 40, 80, etc...
> since that'll actually match iozone threads to CPU cores on your
> particular system.

Okay.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RESEND] IOZone with transparent huge page cache
  2013-04-16  5:57       ` Kirill A. Shutemov
@ 2013-04-16  6:11         ` Dave Hansen
  -1 siblings, 0 replies; 110+ messages in thread
From: Dave Hansen @ 2013-04-16  6:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 04/15/2013 10:57 PM, Kirill A. Shutemov wrote:
>>> > > ** Initial writers **
>>> > > threads:	        1        2        4        8       16       32       64      128      256
>>> > > baseline:	  1103360   912585   500065   260503   128918    62039    34799    18718     9376
>>> > > patched:	  2127476  2155029  2345079  1942158  1127109   571899   127090    52939    25950
>>> > > speed-up(times):     1.93     2.36     4.69     7.46     8.74     9.22     3.65     2.83     2.77
>> > 
>> > I'm a _bit_ surprised that iozone scales _that_ badly especially while
>> > threads<nr_cpus.  Is this normal for iozone?  What are the units and
>> > metric there, btw?
> The units is KB/sec per process (I used 'Avg throughput per process' from
> iozone report). So it scales not that badly.
> I will use total children throughput next time to avoid confusion.

Wow.  Well, it's cool that your patches just fix it up inherently.  I'd
still really like to see some analysis exactly where the benefit is
coming from though.


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RESEND] IOZone with transparent huge page cache
@ 2013-04-16  6:11         ` Dave Hansen
  0 siblings, 0 replies; 110+ messages in thread
From: Dave Hansen @ 2013-04-16  6:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 04/15/2013 10:57 PM, Kirill A. Shutemov wrote:
>>> > > ** Initial writers **
>>> > > threads:	        1        2        4        8       16       32       64      128      256
>>> > > baseline:	  1103360   912585   500065   260503   128918    62039    34799    18718     9376
>>> > > patched:	  2127476  2155029  2345079  1942158  1127109   571899   127090    52939    25950
>>> > > speed-up(times):     1.93     2.36     4.69     7.46     8.74     9.22     3.65     2.83     2.77
>> > 
>> > I'm a _bit_ surprised that iozone scales _that_ badly especially while
>> > threads<nr_cpus.  Is this normal for iozone?  What are the units and
>> > metric there, btw?
> The units is KB/sec per process (I used 'Avg throughput per process' from
> iozone report). So it scales not that badly.
> I will use total children throughput next time to avoid confusion.

Wow.  Well, it's cool that your patches just fix it up inherently.  I'd
still really like to see some analysis exactly where the benefit is
coming from though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 09/34] thp: represent file thp pages in meminfo and friends
  2013-04-08 19:38     ` Dave Hansen
@ 2013-04-16 14:49       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-16 14:49 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 04/05/2013 04:59 AM, Kirill A. Shutemov wrote:
> > The patch adds new zone stat to count file transparent huge pages and
> > adjust related places.
> > 
> > For now we don't count mapped or dirty file thp pages separately.
> 
> I can understand tracking NR_FILE_TRANSPARENT_HUGEPAGES itself.  But,
> why not also account for them in NR_FILE_PAGES?  That way, you don't
> have to special-case each of the cases below:

Good point.
To be consistent I'll also convert NR_ANON_TRANSPARENT_HUGEPAGES to be
accounted in NR_ANON_PAGES.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 09/34] thp: represent file thp pages in meminfo and friends
@ 2013-04-16 14:49       ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-16 14:49 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 04/05/2013 04:59 AM, Kirill A. Shutemov wrote:
> > The patch adds new zone stat to count file transparent huge pages and
> > adjust related places.
> > 
> > For now we don't count mapped or dirty file thp pages separately.
> 
> I can understand tracking NR_FILE_TRANSPARENT_HUGEPAGES itself.  But,
> why not also account for them in NR_FILE_PAGES?  That way, you don't
> have to special-case each of the cases below:

Good point.
To be consistent I'll also convert NR_ANON_TRANSPARENT_HUGEPAGES to be
accounted in NR_ANON_PAGES.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 09/34] thp: represent file thp pages in meminfo and friends
  2013-04-16 14:49       ` Kirill A. Shutemov
@ 2013-04-16 15:11         ` Dave Hansen
  -1 siblings, 0 replies; 110+ messages in thread
From: Dave Hansen @ 2013-04-16 15:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 04/16/2013 07:49 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 04/05/2013 04:59 AM, Kirill A. Shutemov wrote:
>>> The patch adds new zone stat to count file transparent huge pages and
>>> adjust related places.
>>>
>>> For now we don't count mapped or dirty file thp pages separately.
>>
>> I can understand tracking NR_FILE_TRANSPARENT_HUGEPAGES itself.  But,
>> why not also account for them in NR_FILE_PAGES?  That way, you don't
>> have to special-case each of the cases below:
> 
> Good point.
> To be consistent I'll also convert NR_ANON_TRANSPARENT_HUGEPAGES to be
> accounted in NR_ANON_PAGES.

Hmm...  I didn't realize we did that for the anonymous version.  But,
looking at the meminfo code:

> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>                 K(global_page_state(NR_ANON_PAGES)
>                   + global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
>                   HPAGE_PMD_NR),
> #else
>                 K(global_page_state(NR_ANON_PAGES)),
> #endif

That #ifdef and a couple others like it would just go away if we did
this.  It would be a nice cleanup.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 09/34] thp: represent file thp pages in meminfo and friends
@ 2013-04-16 15:11         ` Dave Hansen
  0 siblings, 0 replies; 110+ messages in thread
From: Dave Hansen @ 2013-04-16 15:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 04/16/2013 07:49 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 04/05/2013 04:59 AM, Kirill A. Shutemov wrote:
>>> The patch adds new zone stat to count file transparent huge pages and
>>> adjust related places.
>>>
>>> For now we don't count mapped or dirty file thp pages separately.
>>
>> I can understand tracking NR_FILE_TRANSPARENT_HUGEPAGES itself.  But,
>> why not also account for them in NR_FILE_PAGES?  That way, you don't
>> have to special-case each of the cases below:
> 
> Good point.
> To be consistent I'll also convert NR_ANON_TRANSPARENT_HUGEPAGES to be
> accounted in NR_ANON_PAGES.

Hmm...  I didn't realize we did that for the anonymous version.  But,
looking at the meminfo code:

> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>                 K(global_page_state(NR_ANON_PAGES)
>                   + global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
>                   HPAGE_PMD_NR),
> #else
>                 K(global_page_state(NR_ANON_PAGES)),
> #endif

That #ifdef and a couple others like it would just go away if we did
this.  It would be a nice cleanup.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()
  2013-04-08 18:52     ` Dave Hansen
@ 2013-04-17 14:38       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-17 14:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 04/05/2013 04:59 AM, Kirill A. Shutemov wrote:
> > +int do_huge_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > +		unsigned long address, pmd_t *pmd, unsigned int flags)
> > +{
> > +	unsigned long haddr = address & HPAGE_PMD_MASK;
> > +	struct page *cow_page, *page, *dirty_page = NULL;
> > +	bool anon = false, fallback = false, page_mkwrite = false;
> > +	pgtable_t pgtable = NULL;
> > +	struct vm_fault vmf;
> > +	int ret;
> > +
> > +	/* Fallback if vm_pgoff and vm_start are not suitable */
> > +	if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
> > +			(vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK))
> > +		return do_fallback(mm, vma, address, pmd, flags);
> > +
> > +	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
> > +		return do_fallback(mm, vma, address, pmd, flags);
> > +
> > +	if (unlikely(khugepaged_enter(vma)))
> > +		return VM_FAULT_OOM;
> > +
> > +	/*
> > +	 * If we do COW later, allocate page before taking lock_page()
> > +	 * on the file cache page. This will reduce lock holding time.
> > +	 */
> > +	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
> > +		if (unlikely(anon_vma_prepare(vma)))
> > +			return VM_FAULT_OOM;
> > +
> > +		cow_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
> > +				vma, haddr, numa_node_id(), 0);
> > +		if (!cow_page) {
> > +			count_vm_event(THP_FAULT_FALLBACK);
> > +			return do_fallback(mm, vma, address, pmd, flags);
> > +		}
> > +		count_vm_event(THP_FAULT_ALLOC);
> > +		if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
> > +			page_cache_release(cow_page);
> > +			return do_fallback(mm, vma, address, pmd, flags);
> > +		}
> 
> Ugh.  This is essentially a copy-n-paste of code in __do_fault(),
> including the comments.  Is there no way to consolidate the code so that
> there's less duplication here?

I've looked into it once again and it seems there's not much space for
consolidation. Code structure looks very similar, but there are many
special cases for thp: fallback path, pte vs. pmd, etc. I don't see how we
can consolidate them in them in sane way.
I think copy is more maintainable :(

> Part of the reason we have so many bugs in hugetlbfs is that it's really
> a forked set of code that does things its own way.  I really hope we're
> not going down the road of creating another feature in the same way.
> 
> When you copy *this* much code (or any, really), you really need to talk
> about it in the patch description.  I was looking at other COW code, and
> just happened to stumble on to the __do_fault() code.

I will document it in commit message and in comments for both functions.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()
@ 2013-04-17 14:38       ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-17 14:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 04/05/2013 04:59 AM, Kirill A. Shutemov wrote:
> > +int do_huge_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > +		unsigned long address, pmd_t *pmd, unsigned int flags)
> > +{
> > +	unsigned long haddr = address & HPAGE_PMD_MASK;
> > +	struct page *cow_page, *page, *dirty_page = NULL;
> > +	bool anon = false, fallback = false, page_mkwrite = false;
> > +	pgtable_t pgtable = NULL;
> > +	struct vm_fault vmf;
> > +	int ret;
> > +
> > +	/* Fallback if vm_pgoff and vm_start are not suitable */
> > +	if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
> > +			(vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK))
> > +		return do_fallback(mm, vma, address, pmd, flags);
> > +
> > +	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
> > +		return do_fallback(mm, vma, address, pmd, flags);
> > +
> > +	if (unlikely(khugepaged_enter(vma)))
> > +		return VM_FAULT_OOM;
> > +
> > +	/*
> > +	 * If we do COW later, allocate page before taking lock_page()
> > +	 * on the file cache page. This will reduce lock holding time.
> > +	 */
> > +	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
> > +		if (unlikely(anon_vma_prepare(vma)))
> > +			return VM_FAULT_OOM;
> > +
> > +		cow_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
> > +				vma, haddr, numa_node_id(), 0);
> > +		if (!cow_page) {
> > +			count_vm_event(THP_FAULT_FALLBACK);
> > +			return do_fallback(mm, vma, address, pmd, flags);
> > +		}
> > +		count_vm_event(THP_FAULT_ALLOC);
> > +		if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
> > +			page_cache_release(cow_page);
> > +			return do_fallback(mm, vma, address, pmd, flags);
> > +		}
> 
> Ugh.  This is essentially a copy-n-paste of code in __do_fault(),
> including the comments.  Is there no way to consolidate the code so that
> there's less duplication here?

I've looked into it once again and it seems there's not much space for
consolidation. Code structure looks very similar, but there are many
special cases for thp: fallback path, pte vs. pmd, etc. I don't see how we
can consolidate them in them in sane way.
I think copy is more maintainable :(

> Part of the reason we have so many bugs in hugetlbfs is that it's really
> a forked set of code that does things its own way.  I really hope we're
> not going down the road of creating another feature in the same way.
> 
> When you copy *this* much code (or any, really), you really need to talk
> about it in the patch description.  I was looking at other COW code, and
> just happened to stumble on to the __do_fault() code.

I will document it in commit message and in comments for both functions.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()
  2013-04-17 14:38       ` Kirill A. Shutemov
  (?)
@ 2013-04-17 22:07       ` Dave Hansen
  2013-04-18 16:09           ` Kirill A. Shutemov
  -1 siblings, 1 reply; 110+ messages in thread
From: Dave Hansen @ 2013-04-17 22:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1648 bytes --]

On 04/17/2013 07:38 AM, Kirill A. Shutemov wrote:
>> > Ugh.  This is essentially a copy-n-paste of code in __do_fault(),
>> > including the comments.  Is there no way to consolidate the code so that
>> > there's less duplication here?
> I've looked into it once again and it seems there's not much space for
> consolidation. Code structure looks very similar, but there are many
> special cases for thp: fallback path, pte vs. pmd, etc. I don't see how we
> can consolidate them in them in sane way.
> I think copy is more maintainable :(

I took the two copies, put them each in a file, changed some of the
_very_ trivial stuff to match (foo=1 vs foo=true) and diffed them.
They're very similar lengths (in lines):

 185  __do_fault
 197  do_huge_linear_fault

If you diff them:

 1 file changed, 68 insertions(+), 56 deletions(-)

That means that of 185 lines in __do_fault(), 129 (70%) of them were
copied *VERBATIM*.  Not similar in structure or appearance.  Bit-for-bit
the same.

I took a stab at consolidating them.  I think we could add a
VM_FAULT_FALLBACK flag to explicitly indicate that we need to do a
huge->small fallback, as well as a FAULT_FLAG_TRANSHUGE to indicate that
a given fault has not attempted to be handled by a huge page.  If we
call __do_fault() with FAULT_FLAG_TRANSHUGE and we get back
VM_FAULT_FALLBACK or VM_FAULT_OOM, then we clear FAULT_FLAG_TRANSHUGE
and retry in handle_mm_fault().

I only went about 1/4 of the way in to __do_fault().  If went and spent
another hour or two, I'm pretty convinced I could push this even further.

Are you still sure you can't do _any_ better than a verbatim copy of 129
lines?



[-- Attachment #2: extend-__do_fault.patch --]
[-- Type: text/x-patch, Size: 3942 bytes --]

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7acc9dc..d408b5b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -879,6 +879,7 @@ static inline int page_mapped(struct page *page)
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
+#define VM_FAULT_FALLBACK 0x0800	/* large page fault failed, fall back to small */
 
 #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
 
diff --git a/mm/memory.c b/mm/memory.c
index 494526a..9aced3a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3229,6 +3229,40 @@ oom:
 	return VM_FAULT_OOM;
 }
 
+static inline bool transhuge_vma_suitable(struct vm_area_struct *vma, unsigned long addr)
+{
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+
+	if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+	    (vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK)) {
+		return false;
+	}
+	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end) {
+		return false;
+	}
+	return true;
+}
+
+static struct page *alloc_fault_page_vma(gfp_t flags, vm_area_struct *vma,
+		unsigned long address, unsigned int flags)
+{
+	int try_huge_pages = flags & FAULT_FLAG_TRANSHUGE;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+
+	if (try_huge_pages) {
+		return alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+				vma, haddr, numa_node_id(), 0);
+	}
+	return alloc_page_vma(flags, vma, address);
+}
+
+static inline void __user *align_fault_address(unsigned long address, unsigned int flags)
+{
+	if (flags & FAULT_FLAG_TRANSHUGE)
+		return (void __user *)address & HPAGE_PMD_MASK;
+	return (void __user *)(address & PAGE_MASK);
+}
+
 /*
  * __do_fault() tries to create a new page mapping. It aggressively
  * tries to share with existing pages, but makes a separate copy if
@@ -3256,17 +3290,21 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct vm_fault vmf;
 	int ret;
 	int page_mkwrite = 0;
+	int try_huge_pages = !!(flags & FAULT_FLAG_TRANSHUGE);
+
+	if (try_huge_pages && !transhuge_vma_suitable(vma)) {
+		return VM_FAULT_FALLBACK;
+	}
 
 	/*
 	 * If we do COW later, allocate page befor taking lock_page()
 	 * on the file cache page. This will reduce lock holding time.
 	 */
 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
-
 		if (unlikely(anon_vma_prepare(vma)))
 			return VM_FAULT_OOM;
 
-		cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+		cow_page = alloc_fault_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
 		if (!cow_page)
 			return VM_FAULT_OOM;
 
@@ -3277,7 +3315,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	} else
 		cow_page = NULL;
 
-	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
+	vmf.virtual_address = align_fault_address(address, flags);
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
 	vmf.page = NULL;
@@ -3714,7 +3752,6 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
-
 	__set_current_state(TASK_RUNNING);
 
 	count_vm_event(PGFAULT);
@@ -3726,6 +3763,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		return hugetlb_fault(mm, vma, address, flags);
 
+	/* We will try a single shot (only if enabled an possible)
+	 * to do a transparent huge page */
+	flags |= FAULT_FLAG_TRANSHUGE;
 retry:
 	pgd = pgd_offset(mm, address);
 	pud = pud_alloc(mm, pgd, address);
@@ -3738,6 +3778,11 @@ retry:
 		if (!vma->vm_ops)
 			return do_huge_pmd_anonymous_page(mm, vma, address,
 							  pmd, flags);
+		ret = __do_fault(mm, vma, address, pmd, ...)
+		if (ret & (VM_FAULT_OOM | VM_FAULT_FALLBACK)) {
+			flags &= ~FAULT_FLAG_TRANSHUGE;
+			goto retry;
+		}
 	} else {
 		pmd_t orig_pmd = *pmd;
 		int ret;

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()
  2013-04-17 22:07       ` Dave Hansen
@ 2013-04-18 16:09           ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-18 16:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 04/17/2013 07:38 AM, Kirill A. Shutemov wrote:
> Are you still sure you can't do _any_ better than a verbatim copy of 129
> lines?

It seems I was too lazy. Shame on me. :(
Here's consolidated version. Only build tested. Does it look better?

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1c25b90..47651d4 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -186,6 +186,28 @@ static inline struct page *compound_trans_head(struct page *page)
 	return page;
 }
 
+static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
+{
+	return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
+}
+
+static inline struct page *alloc_hugepage_vma(int defrag,
+					      struct vm_area_struct *vma,
+					      unsigned long haddr, int nd,
+					      gfp_t extra_gfp)
+{
+	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
+			       HPAGE_PMD_ORDER, vma, haddr, nd);
+}
+
+static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
+{
+	pmd_t entry;
+	entry = mk_pmd(page, prot);
+	entry = pmd_mkhuge(entry);
+	return entry;
+}
+
 extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				unsigned long addr, pmd_t pmd, pmd_t *pmdp);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c8a8626..4669c19 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -165,6 +165,11 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_RETRY_NOWAIT	0x10	/* Don't drop mmap_sem and wait when retrying */
 #define FAULT_FLAG_KILLABLE	0x20	/* The fault task is in SIGKILL killable region */
 #define FAULT_FLAG_TRIED	0x40	/* second try */
+#ifdef CONFIG_CONFIG_TRANSPARENT_HUGEPAGE
+#define	FAULT_FLAG_TRANSHUGE	0x80	/* Try to allocate transhuge page */
+#else
+#define FAULT_FLAG_TRANSHUGE	0	/* Optimize out THP code if disabled*/
+#endif
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
@@ -880,6 +885,7 @@ static inline int page_mapped(struct page *page)
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
+#define VM_FAULT_FALLBACK 0x0800	/* large page fault failed, fall back to small */
 
 #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 73691a3..e14fa81 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -692,14 +692,6 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 	return pmd;
 }
 
-static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
-{
-	pmd_t entry;
-	entry = mk_pmd(page, prot);
-	entry = pmd_mkhuge(entry);
-	return entry;
-}
-
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long haddr, pmd_t *pmd,
@@ -742,20 +734,6 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 	return 0;
 }
 
-static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
-{
-	return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
-}
-
-static inline struct page *alloc_hugepage_vma(int defrag,
-					      struct vm_area_struct *vma,
-					      unsigned long haddr, int nd,
-					      gfp_t extra_gfp)
-{
-	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
-			       HPAGE_PMD_ORDER, vma, haddr, nd);
-}
-
 #ifndef CONFIG_NUMA
 static inline struct page *alloc_hugepage(int defrag)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 5f782d6..e6efd8c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -59,6 +59,7 @@
 #include <linux/gfp.h>
 #include <linux/migrate.h>
 #include <linux/string.h>
+#include <linux/khugepaged.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3229,6 +3230,53 @@ oom:
 	return VM_FAULT_OOM;
 }
 
+static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
+		unsigned long addr)
+{
+	unsigned long haddr = addr & HPAGE_PMD_MASK;
+
+	if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+	    (vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK)) {
+		return false;
+	}
+	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end) {
+		return false;
+	}
+	return true;
+}
+
+static struct page *alloc_fault_page_vma(struct vm_area_struct *vma,
+		unsigned long addr, unsigned int flags)
+{
+
+	if (flags & FAULT_FLAG_TRANSHUGE) {
+		struct page *page;
+		unsigned long haddr = addr & HPAGE_PMD_MASK;
+
+		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+				vma, haddr, numa_node_id(), 0);
+		if (page)
+			count_vm_event(THP_FAULT_ALLOC);
+		else
+			count_vm_event(THP_FAULT_FALLBACK);
+		return page;
+	}
+	return alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, addr);
+}
+
+static inline bool ptl_lock_and_check_entry(struct mm_struct *mm, pmd_t *pmd,
+	       unsigned long address, spinlock_t **ptl, pte_t **page_table,
+	       pte_t orig_pte, unsigned int flags)
+{
+	if (flags & FAULT_FLAG_TRANSHUGE) {
+		spin_lock(&mm->page_table_lock);
+		return !pmd_none(*pmd);
+	} else {
+		*page_table  = pte_offset_map_lock(mm, pmd, address, ptl);
+		return !pte_same(**page_table, orig_pte);
+	}
+}
+
 /*
  * __do_fault() tries to create a new page mapping. It aggressively
  * tries to share with existing pages, but makes a separate copy if
@@ -3246,45 +3294,61 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd,
 		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
 {
+	unsigned long haddr = address & PAGE_MASK;
 	pte_t *page_table;
 	spinlock_t *ptl;
-	struct page *page;
-	struct page *cow_page;
-	pte_t entry;
-	int anon = 0;
-	struct page *dirty_page = NULL;
+	struct page *page, *cow_page, *dirty_page = NULL;
+	bool anon = false, page_mkwrite = false;
+	bool try_huge_pages = !!(flags & FAULT_FLAG_TRANSHUGE);
+	pgtable_t pgtable = NULL;
 	struct vm_fault vmf;
-	int ret;
-	int page_mkwrite = 0;
+	int nr = 1, ret;
+
+	if (try_huge_pages) {
+		if (!transhuge_vma_suitable(vma, haddr))
+			return VM_FAULT_FALLBACK;
+		if (unlikely(khugepaged_enter(vma)))
+			return VM_FAULT_OOM;
+		nr = HPAGE_PMD_NR;
+		haddr = address & HPAGE_PMD_MASK;
+		pgoff = linear_page_index(vma, haddr);
+	}
 
 	/*
 	 * If we do COW later, allocate page befor taking lock_page()
 	 * on the file cache page. This will reduce lock holding time.
 	 */
 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
-
 		if (unlikely(anon_vma_prepare(vma)))
 			return VM_FAULT_OOM;
 
-		cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+		cow_page = alloc_fault_page_vma(vma, address, flags);
 		if (!cow_page)
-			return VM_FAULT_OOM;
+			return VM_FAULT_OOM | VM_FAULT_FALLBACK;
 
 		if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
 			page_cache_release(cow_page);
-			return VM_FAULT_OOM;
+			return VM_FAULT_OOM | VM_FAULT_FALLBACK;
 		}
 	} else
 		cow_page = NULL;
 
-	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
+	vmf.virtual_address = (void __user *)haddr;
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
 	vmf.page = NULL;
 
-	ret = vma->vm_ops->fault(vma, &vmf);
+	if (try_huge_pages) {
+		pgtable = pte_alloc_one(mm, haddr);
+		if (unlikely(!pgtable)) {
+			ret = VM_FAULT_OOM;
+			goto uncharge_out;
+		}
+		ret = vma->vm_ops->huge_fault(vma, &vmf);
+	} else
+		ret = vma->vm_ops->fault(vma, &vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
-			    VM_FAULT_RETRY)))
+			    VM_FAULT_RETRY | VM_FAULT_FALLBACK)))
 		goto uncharge_out;
 
 	if (unlikely(PageHWPoison(vmf.page))) {
@@ -3310,42 +3374,69 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!(vma->vm_flags & VM_SHARED)) {
 			page = cow_page;
-			anon = 1;
-			copy_user_highpage(page, vmf.page, address, vma);
+			anon = true;
+			if (try_huge_pages)
+				copy_user_highpage(page, vmf.page,
+						address, vma);
+			else
+				copy_user_huge_page(page, vmf.page, haddr, vma,
+						HPAGE_PMD_NR);
 			__SetPageUptodate(page);
-		} else {
+		} else if (vma->vm_ops->page_mkwrite) {
 			/*
 			 * If the page will be shareable, see if the backing
 			 * address space wants to know that the page is about
 			 * to become writable
 			 */
-			if (vma->vm_ops->page_mkwrite) {
-				int tmp;
-
-				unlock_page(page);
-				vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
-				tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
-				if (unlikely(tmp &
-					  (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
-					ret = tmp;
+			int tmp;
+
+			unlock_page(page);
+			vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE;
+			tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
+			if (unlikely(tmp &
+					(VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+				ret = tmp;
+				goto unwritable_page;
+			}
+			if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+				lock_page(page);
+				if (!page->mapping) {
+					ret = 0; /* retry the fault */
+					unlock_page(page);
 					goto unwritable_page;
 				}
-				if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
-					lock_page(page);
-					if (!page->mapping) {
-						ret = 0; /* retry the fault */
-						unlock_page(page);
-						goto unwritable_page;
-					}
-				} else
-					VM_BUG_ON(!PageLocked(page));
-				page_mkwrite = 1;
-			}
+			} else
+				VM_BUG_ON(!PageLocked(page));
+			page_mkwrite = true;
 		}
+	}
 
+	if (unlikely(ptl_lock_and_check_entry(mm, pmd, address,
+					&ptl, &page_table, orig_pte, flags))) {
+		/* pte/pmd has changed. do not touch it */
+		if (pgtable)
+			pte_free(mm, pgtable);
+		if (cow_page)
+			mem_cgroup_uncharge_page(cow_page);
+		if (anon)
+			page_cache_release(page);
+		unlock_page(vmf.page);
+		page_cache_release(vmf.page);
+		return ret;
 	}
 
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	flush_icache_page(vma, page);
+	if (anon) {
+		add_mm_counter_fast(mm, MM_ANONPAGES, nr);
+		page_add_new_anon_rmap(page, vma, address);
+	} else {
+		add_mm_counter_fast(mm, MM_FILEPAGES, nr);
+		page_add_file_rmap(page);
+		if (flags & FAULT_FLAG_WRITE) {
+			dirty_page = page;
+			get_page(dirty_page);
+		}
+	}
 
 	/*
 	 * This silly early PAGE_DIRTY setting removes a race
@@ -3358,43 +3449,28 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * handle that later.
 	 */
 	/* Only go through if we didn't race with anybody else... */
-	if (likely(pte_same(*page_table, orig_pte))) {
-		flush_icache_page(vma, page);
-		entry = mk_pte(page, vma->vm_page_prot);
+	if (try_huge_pages) {
+		pmd_t entry = mk_huge_pmd(page, vma->vm_page_prot);
 		if (flags & FAULT_FLAG_WRITE)
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-		if (anon) {
-			inc_mm_counter_fast(mm, MM_ANONPAGES);
-			page_add_new_anon_rmap(page, vma, address);
-		} else {
-			inc_mm_counter_fast(mm, MM_FILEPAGES);
-			page_add_file_rmap(page);
-			if (flags & FAULT_FLAG_WRITE) {
-				dirty_page = page;
-				get_page(dirty_page);
-			}
-		}
-		set_pte_at(mm, address, page_table, entry);
-
-		/* no need to invalidate: a not-present page won't be cached */
+			entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		set_pmd_at(mm, address, pmd, entry);
 		update_mmu_cache(vma, address, page_table);
+		spin_lock(&mm->page_table_lock);
 	} else {
-		if (cow_page)
-			mem_cgroup_uncharge_page(cow_page);
-		if (anon)
-			page_cache_release(page);
-		else
-			anon = 1; /* no anon but release faulted_page */
+		pte_t entry = mk_pte(page, vma->vm_page_prot);
+		if (flags & FAULT_FLAG_WRITE)
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		set_pte_at(mm, address, page_table, entry);
+		update_mmu_cache_pmd(vma, address, pmd);
+		pte_unmap_unlock(page_table, ptl);
 	}
 
-	pte_unmap_unlock(page_table, ptl);
-
 	if (dirty_page) {
 		struct address_space *mapping = page->mapping;
-		int dirtied = 0;
+		bool dirtied = false;
 
 		if (set_page_dirty(dirty_page))
-			dirtied = 1;
+			dirtied = true;
 		unlock_page(dirty_page);
 		put_page(dirty_page);
 		if ((dirtied || page_mkwrite) && mapping) {
@@ -3413,13 +3489,16 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (anon)
 			page_cache_release(vmf.page);
 	}
-
 	return ret;
 
 unwritable_page:
+	if (pgtable)
+		pte_free(mm, pgtable);
 	page_cache_release(page);
 	return ret;
 uncharge_out:
+	if (pgtable)
+		pte_free(mm, pgtable);
 	/* fs's fault handler get error */
 	if (cow_page) {
 		mem_cgroup_uncharge_page(cow_page);
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()
@ 2013-04-18 16:09           ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-18 16:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 04/17/2013 07:38 AM, Kirill A. Shutemov wrote:
> Are you still sure you can't do _any_ better than a verbatim copy of 129
> lines?

It seems I was too lazy. Shame on me. :(
Here's consolidated version. Only build tested. Does it look better?

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1c25b90..47651d4 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -186,6 +186,28 @@ static inline struct page *compound_trans_head(struct page *page)
 	return page;
 }
 
+static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
+{
+	return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
+}
+
+static inline struct page *alloc_hugepage_vma(int defrag,
+					      struct vm_area_struct *vma,
+					      unsigned long haddr, int nd,
+					      gfp_t extra_gfp)
+{
+	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
+			       HPAGE_PMD_ORDER, vma, haddr, nd);
+}
+
+static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
+{
+	pmd_t entry;
+	entry = mk_pmd(page, prot);
+	entry = pmd_mkhuge(entry);
+	return entry;
+}
+
 extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				unsigned long addr, pmd_t pmd, pmd_t *pmdp);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c8a8626..4669c19 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -165,6 +165,11 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_RETRY_NOWAIT	0x10	/* Don't drop mmap_sem and wait when retrying */
 #define FAULT_FLAG_KILLABLE	0x20	/* The fault task is in SIGKILL killable region */
 #define FAULT_FLAG_TRIED	0x40	/* second try */
+#ifdef CONFIG_CONFIG_TRANSPARENT_HUGEPAGE
+#define	FAULT_FLAG_TRANSHUGE	0x80	/* Try to allocate transhuge page */
+#else
+#define FAULT_FLAG_TRANSHUGE	0	/* Optimize out THP code if disabled*/
+#endif
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
@@ -880,6 +885,7 @@ static inline int page_mapped(struct page *page)
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
+#define VM_FAULT_FALLBACK 0x0800	/* large page fault failed, fall back to small */
 
 #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 73691a3..e14fa81 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -692,14 +692,6 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 	return pmd;
 }
 
-static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
-{
-	pmd_t entry;
-	entry = mk_pmd(page, prot);
-	entry = pmd_mkhuge(entry);
-	return entry;
-}
-
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long haddr, pmd_t *pmd,
@@ -742,20 +734,6 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 	return 0;
 }
 
-static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
-{
-	return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
-}
-
-static inline struct page *alloc_hugepage_vma(int defrag,
-					      struct vm_area_struct *vma,
-					      unsigned long haddr, int nd,
-					      gfp_t extra_gfp)
-{
-	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
-			       HPAGE_PMD_ORDER, vma, haddr, nd);
-}
-
 #ifndef CONFIG_NUMA
 static inline struct page *alloc_hugepage(int defrag)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 5f782d6..e6efd8c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -59,6 +59,7 @@
 #include <linux/gfp.h>
 #include <linux/migrate.h>
 #include <linux/string.h>
+#include <linux/khugepaged.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3229,6 +3230,53 @@ oom:
 	return VM_FAULT_OOM;
 }
 
+static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
+		unsigned long addr)
+{
+	unsigned long haddr = addr & HPAGE_PMD_MASK;
+
+	if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+	    (vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK)) {
+		return false;
+	}
+	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end) {
+		return false;
+	}
+	return true;
+}
+
+static struct page *alloc_fault_page_vma(struct vm_area_struct *vma,
+		unsigned long addr, unsigned int flags)
+{
+
+	if (flags & FAULT_FLAG_TRANSHUGE) {
+		struct page *page;
+		unsigned long haddr = addr & HPAGE_PMD_MASK;
+
+		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+				vma, haddr, numa_node_id(), 0);
+		if (page)
+			count_vm_event(THP_FAULT_ALLOC);
+		else
+			count_vm_event(THP_FAULT_FALLBACK);
+		return page;
+	}
+	return alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, addr);
+}
+
+static inline bool ptl_lock_and_check_entry(struct mm_struct *mm, pmd_t *pmd,
+	       unsigned long address, spinlock_t **ptl, pte_t **page_table,
+	       pte_t orig_pte, unsigned int flags)
+{
+	if (flags & FAULT_FLAG_TRANSHUGE) {
+		spin_lock(&mm->page_table_lock);
+		return !pmd_none(*pmd);
+	} else {
+		*page_table  = pte_offset_map_lock(mm, pmd, address, ptl);
+		return !pte_same(**page_table, orig_pte);
+	}
+}
+
 /*
  * __do_fault() tries to create a new page mapping. It aggressively
  * tries to share with existing pages, but makes a separate copy if
@@ -3246,45 +3294,61 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd,
 		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
 {
+	unsigned long haddr = address & PAGE_MASK;
 	pte_t *page_table;
 	spinlock_t *ptl;
-	struct page *page;
-	struct page *cow_page;
-	pte_t entry;
-	int anon = 0;
-	struct page *dirty_page = NULL;
+	struct page *page, *cow_page, *dirty_page = NULL;
+	bool anon = false, page_mkwrite = false;
+	bool try_huge_pages = !!(flags & FAULT_FLAG_TRANSHUGE);
+	pgtable_t pgtable = NULL;
 	struct vm_fault vmf;
-	int ret;
-	int page_mkwrite = 0;
+	int nr = 1, ret;
+
+	if (try_huge_pages) {
+		if (!transhuge_vma_suitable(vma, haddr))
+			return VM_FAULT_FALLBACK;
+		if (unlikely(khugepaged_enter(vma)))
+			return VM_FAULT_OOM;
+		nr = HPAGE_PMD_NR;
+		haddr = address & HPAGE_PMD_MASK;
+		pgoff = linear_page_index(vma, haddr);
+	}
 
 	/*
 	 * If we do COW later, allocate page befor taking lock_page()
 	 * on the file cache page. This will reduce lock holding time.
 	 */
 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
-
 		if (unlikely(anon_vma_prepare(vma)))
 			return VM_FAULT_OOM;
 
-		cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+		cow_page = alloc_fault_page_vma(vma, address, flags);
 		if (!cow_page)
-			return VM_FAULT_OOM;
+			return VM_FAULT_OOM | VM_FAULT_FALLBACK;
 
 		if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
 			page_cache_release(cow_page);
-			return VM_FAULT_OOM;
+			return VM_FAULT_OOM | VM_FAULT_FALLBACK;
 		}
 	} else
 		cow_page = NULL;
 
-	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
+	vmf.virtual_address = (void __user *)haddr;
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
 	vmf.page = NULL;
 
-	ret = vma->vm_ops->fault(vma, &vmf);
+	if (try_huge_pages) {
+		pgtable = pte_alloc_one(mm, haddr);
+		if (unlikely(!pgtable)) {
+			ret = VM_FAULT_OOM;
+			goto uncharge_out;
+		}
+		ret = vma->vm_ops->huge_fault(vma, &vmf);
+	} else
+		ret = vma->vm_ops->fault(vma, &vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
-			    VM_FAULT_RETRY)))
+			    VM_FAULT_RETRY | VM_FAULT_FALLBACK)))
 		goto uncharge_out;
 
 	if (unlikely(PageHWPoison(vmf.page))) {
@@ -3310,42 +3374,69 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!(vma->vm_flags & VM_SHARED)) {
 			page = cow_page;
-			anon = 1;
-			copy_user_highpage(page, vmf.page, address, vma);
+			anon = true;
+			if (try_huge_pages)
+				copy_user_highpage(page, vmf.page,
+						address, vma);
+			else
+				copy_user_huge_page(page, vmf.page, haddr, vma,
+						HPAGE_PMD_NR);
 			__SetPageUptodate(page);
-		} else {
+		} else if (vma->vm_ops->page_mkwrite) {
 			/*
 			 * If the page will be shareable, see if the backing
 			 * address space wants to know that the page is about
 			 * to become writable
 			 */
-			if (vma->vm_ops->page_mkwrite) {
-				int tmp;
-
-				unlock_page(page);
-				vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
-				tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
-				if (unlikely(tmp &
-					  (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
-					ret = tmp;
+			int tmp;
+
+			unlock_page(page);
+			vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE;
+			tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
+			if (unlikely(tmp &
+					(VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+				ret = tmp;
+				goto unwritable_page;
+			}
+			if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+				lock_page(page);
+				if (!page->mapping) {
+					ret = 0; /* retry the fault */
+					unlock_page(page);
 					goto unwritable_page;
 				}
-				if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
-					lock_page(page);
-					if (!page->mapping) {
-						ret = 0; /* retry the fault */
-						unlock_page(page);
-						goto unwritable_page;
-					}
-				} else
-					VM_BUG_ON(!PageLocked(page));
-				page_mkwrite = 1;
-			}
+			} else
+				VM_BUG_ON(!PageLocked(page));
+			page_mkwrite = true;
 		}
+	}
 
+	if (unlikely(ptl_lock_and_check_entry(mm, pmd, address,
+					&ptl, &page_table, orig_pte, flags))) {
+		/* pte/pmd has changed. do not touch it */
+		if (pgtable)
+			pte_free(mm, pgtable);
+		if (cow_page)
+			mem_cgroup_uncharge_page(cow_page);
+		if (anon)
+			page_cache_release(page);
+		unlock_page(vmf.page);
+		page_cache_release(vmf.page);
+		return ret;
 	}
 
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	flush_icache_page(vma, page);
+	if (anon) {
+		add_mm_counter_fast(mm, MM_ANONPAGES, nr);
+		page_add_new_anon_rmap(page, vma, address);
+	} else {
+		add_mm_counter_fast(mm, MM_FILEPAGES, nr);
+		page_add_file_rmap(page);
+		if (flags & FAULT_FLAG_WRITE) {
+			dirty_page = page;
+			get_page(dirty_page);
+		}
+	}
 
 	/*
 	 * This silly early PAGE_DIRTY setting removes a race
@@ -3358,43 +3449,28 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * handle that later.
 	 */
 	/* Only go through if we didn't race with anybody else... */
-	if (likely(pte_same(*page_table, orig_pte))) {
-		flush_icache_page(vma, page);
-		entry = mk_pte(page, vma->vm_page_prot);
+	if (try_huge_pages) {
+		pmd_t entry = mk_huge_pmd(page, vma->vm_page_prot);
 		if (flags & FAULT_FLAG_WRITE)
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-		if (anon) {
-			inc_mm_counter_fast(mm, MM_ANONPAGES);
-			page_add_new_anon_rmap(page, vma, address);
-		} else {
-			inc_mm_counter_fast(mm, MM_FILEPAGES);
-			page_add_file_rmap(page);
-			if (flags & FAULT_FLAG_WRITE) {
-				dirty_page = page;
-				get_page(dirty_page);
-			}
-		}
-		set_pte_at(mm, address, page_table, entry);
-
-		/* no need to invalidate: a not-present page won't be cached */
+			entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		set_pmd_at(mm, address, pmd, entry);
 		update_mmu_cache(vma, address, page_table);
+		spin_lock(&mm->page_table_lock);
 	} else {
-		if (cow_page)
-			mem_cgroup_uncharge_page(cow_page);
-		if (anon)
-			page_cache_release(page);
-		else
-			anon = 1; /* no anon but release faulted_page */
+		pte_t entry = mk_pte(page, vma->vm_page_prot);
+		if (flags & FAULT_FLAG_WRITE)
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		set_pte_at(mm, address, page_table, entry);
+		update_mmu_cache_pmd(vma, address, pmd);
+		pte_unmap_unlock(page_table, ptl);
 	}
 
-	pte_unmap_unlock(page_table, ptl);
-
 	if (dirty_page) {
 		struct address_space *mapping = page->mapping;
-		int dirtied = 0;
+		bool dirtied = false;
 
 		if (set_page_dirty(dirty_page))
-			dirtied = 1;
+			dirtied = true;
 		unlock_page(dirty_page);
 		put_page(dirty_page);
 		if ((dirtied || page_mkwrite) && mapping) {
@@ -3413,13 +3489,16 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (anon)
 			page_cache_release(vmf.page);
 	}
-
 	return ret;
 
 unwritable_page:
+	if (pgtable)
+		pte_free(mm, pgtable);
 	page_cache_release(page);
 	return ret;
 uncharge_out:
+	if (pgtable)
+		pte_free(mm, pgtable);
 	/* fs's fault handler get error */
 	if (cow_page) {
 		mem_cgroup_uncharge_page(cow_page);
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()
  2013-04-18 16:09           ` Kirill A. Shutemov
  (?)
@ 2013-04-18 16:19             ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-18 16:19 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton,
	Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, linux-fsdevel, linux-kernel

Kirill A. Shutemov wrote:
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c8a8626..4669c19 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -165,6 +165,11 @@ extern pgprot_t protection_map[16];
>  #define FAULT_FLAG_RETRY_NOWAIT	0x10	/* Don't drop mmap_sem and wait when retrying */
>  #define FAULT_FLAG_KILLABLE	0x20	/* The fault task is in SIGKILL killable region */
>  #define FAULT_FLAG_TRIED	0x40	/* second try */
> +#ifdef CONFIG_CONFIG_TRANSPARENT_HUGEPAGE

Oops, s/CONFIG_CONFIG/CONFIG/.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()
@ 2013-04-18 16:19             ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-18 16:19 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton,
	Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, linux-fsdevel, linux-kernel

Kirill A. Shutemov wrote:
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c8a8626..4669c19 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -165,6 +165,11 @@ extern pgprot_t protection_map[16];
>  #define FAULT_FLAG_RETRY_NOWAIT	0x10	/* Don't drop mmap_sem and wait when retrying */
>  #define FAULT_FLAG_KILLABLE	0x20	/* The fault task is in SIGKILL killable region */
>  #define FAULT_FLAG_TRIED	0x40	/* second try */
> +#ifdef CONFIG_CONFIG_TRANSPARENT_HUGEPAGE

Oops, s/CONFIG_CONFIG/CONFIG/.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()
@ 2013-04-18 16:19             ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-18 16:19 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Kirill A. Shutemov wrote:
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c8a8626..4669c19 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -165,6 +165,11 @@ extern pgprot_t protection_map[16];
>  #define FAULT_FLAG_RETRY_NOWAIT	0x10	/* Don't drop mmap_sem and wait when retrying */
>  #define FAULT_FLAG_KILLABLE	0x20	/* The fault task is in SIGKILL killable region */
>  #define FAULT_FLAG_TRIED	0x40	/* second try */
> +#ifdef CONFIG_CONFIG_TRANSPARENT_HUGEPAGE

Oops, s/CONFIG_CONFIG/CONFIG/.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()
  2013-04-18 16:09           ` Kirill A. Shutemov
@ 2013-04-18 16:20             ` Dave Hansen
  -1 siblings, 0 replies; 110+ messages in thread
From: Dave Hansen @ 2013-04-18 16:20 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 04/18/2013 09:09 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 04/17/2013 07:38 AM, Kirill A. Shutemov wrote:
>> Are you still sure you can't do _any_ better than a verbatim copy of 129
>> lines?
> 
> It seems I was too lazy. Shame on me. :(
> Here's consolidated version. Only build tested. Does it look better?

Yeah, it's definitely a step in the right direction.  There rae
definitely some bugs in there like:

+	unsigned long haddr = address & PAGE_MASK;

I do think some of this refactoring stuff

> -				unlock_page(page);
> -				vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
> -				tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
> -				if (unlikely(tmp &
> -					  (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
> -					ret = tmp;
> +			unlock_page(page);
> +			vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE;
> +			tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
> +			if (unlikely(tmp &
> +					(VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
> +				ret = tmp;
> +				goto unwritable_page;
> +			}

could probably be a separate patch and would make what's going on more
clear, but it's passable the way it is.  When it is done this way it's
hard sometimes reading the diff to realize if you are adding code or
just moving it around.

This stuff:

>  		if (set_page_dirty(dirty_page))
> -			dirtied = 1;
> +			dirtied = true;

needs to go in another patch for sure.

One thing I *REALLY* like about doing patches this way is that things
like this start to pop out:

> -	ret = vma->vm_ops->fault(vma, &vmf);
> +	if (try_huge_pages) {
> +		pgtable = pte_alloc_one(mm, haddr);
> +		if (unlikely(!pgtable)) {
> +			ret = VM_FAULT_OOM;
> +			goto uncharge_out;
> +		}
> +		ret = vma->vm_ops->huge_fault(vma, &vmf);
> +	} else
> +		ret = vma->vm_ops->fault(vma, &vmf);

The ->fault is (or can be) essentially per filesystem, and we're going
to be adding support per-filesystem.  any reason we can't just handle
this inside the ->fault code and avoid adding huge_fault altogether?

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()
@ 2013-04-18 16:20             ` Dave Hansen
  0 siblings, 0 replies; 110+ messages in thread
From: Dave Hansen @ 2013-04-18 16:20 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 04/18/2013 09:09 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 04/17/2013 07:38 AM, Kirill A. Shutemov wrote:
>> Are you still sure you can't do _any_ better than a verbatim copy of 129
>> lines?
> 
> It seems I was too lazy. Shame on me. :(
> Here's consolidated version. Only build tested. Does it look better?

Yeah, it's definitely a step in the right direction.  There rae
definitely some bugs in there like:

+	unsigned long haddr = address & PAGE_MASK;

I do think some of this refactoring stuff

> -				unlock_page(page);
> -				vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
> -				tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
> -				if (unlikely(tmp &
> -					  (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
> -					ret = tmp;
> +			unlock_page(page);
> +			vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE;
> +			tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
> +			if (unlikely(tmp &
> +					(VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
> +				ret = tmp;
> +				goto unwritable_page;
> +			}

could probably be a separate patch and would make what's going on more
clear, but it's passable the way it is.  When it is done this way it's
hard sometimes reading the diff to realize if you are adding code or
just moving it around.

This stuff:

>  		if (set_page_dirty(dirty_page))
> -			dirtied = 1;
> +			dirtied = true;

needs to go in another patch for sure.

One thing I *REALLY* like about doing patches this way is that things
like this start to pop out:

> -	ret = vma->vm_ops->fault(vma, &vmf);
> +	if (try_huge_pages) {
> +		pgtable = pte_alloc_one(mm, haddr);
> +		if (unlikely(!pgtable)) {
> +			ret = VM_FAULT_OOM;
> +			goto uncharge_out;
> +		}
> +		ret = vma->vm_ops->huge_fault(vma, &vmf);
> +	} else
> +		ret = vma->vm_ops->fault(vma, &vmf);

The ->fault is (or can be) essentially per filesystem, and we're going
to be adding support per-filesystem.  any reason we can't just handle
this inside the ->fault code and avoid adding huge_fault altogether?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()
  2013-04-18 16:20             ` Dave Hansen
@ 2013-04-18 16:38               ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-18 16:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 04/18/2013 09:09 AM, Kirill A. Shutemov wrote:
> > Dave Hansen wrote:
> >> On 04/17/2013 07:38 AM, Kirill A. Shutemov wrote:
> >> Are you still sure you can't do _any_ better than a verbatim copy of 129
> >> lines?
> > 
> > It seems I was too lazy. Shame on me. :(
> > Here's consolidated version. Only build tested. Does it look better?
> 
> Yeah, it's definitely a step in the right direction.  There rae
> definitely some bugs in there like:
> 
> +	unsigned long haddr = address & PAGE_MASK;

It's not bug. It's bad name for the variable.
See, first 'if (try_huge_pages)'. I update it there for huge page case.

addr_aligned better?

> 
> I do think some of this refactoring stuff
> 
> > -				unlock_page(page);
> > -				vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
> > -				tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
> > -				if (unlikely(tmp &
> > -					  (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
> > -					ret = tmp;
> > +			unlock_page(page);
> > +			vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE;
> > +			tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
> > +			if (unlikely(tmp &
> > +					(VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
> > +				ret = tmp;
> > +				goto unwritable_page;
> > +			}
> 
> could probably be a separate patch and would make what's going on more
> clear, but it's passable the way it is.  When it is done this way it's
> hard sometimes reading the diff to realize if you are adding code or
> just moving it around.

Will do.

> 
> This stuff:
> 
> >  		if (set_page_dirty(dirty_page))
> > -			dirtied = 1;
> > +			dirtied = true;
> 
> needs to go in another patch for sure.

Ditto.

> One thing I *REALLY* like about doing patches this way is that things
> like this start to pop out:
> 
> > -	ret = vma->vm_ops->fault(vma, &vmf);
> > +	if (try_huge_pages) {
> > +		pgtable = pte_alloc_one(mm, haddr);
> > +		if (unlikely(!pgtable)) {
> > +			ret = VM_FAULT_OOM;
> > +			goto uncharge_out;
> > +		}
> > +		ret = vma->vm_ops->huge_fault(vma, &vmf);
> > +	} else
> > +		ret = vma->vm_ops->fault(vma, &vmf);
> 
> The ->fault is (or can be) essentially per filesystem, and we're going
> to be adding support per-filesystem.  any reason we can't just handle
> this inside the ->fault code and avoid adding huge_fault altogether?

will check. it's on my todo list already.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()
@ 2013-04-18 16:38               ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-18 16:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 04/18/2013 09:09 AM, Kirill A. Shutemov wrote:
> > Dave Hansen wrote:
> >> On 04/17/2013 07:38 AM, Kirill A. Shutemov wrote:
> >> Are you still sure you can't do _any_ better than a verbatim copy of 129
> >> lines?
> > 
> > It seems I was too lazy. Shame on me. :(
> > Here's consolidated version. Only build tested. Does it look better?
> 
> Yeah, it's definitely a step in the right direction.  There rae
> definitely some bugs in there like:
> 
> +	unsigned long haddr = address & PAGE_MASK;

It's not bug. It's bad name for the variable.
See, first 'if (try_huge_pages)'. I update it there for huge page case.

addr_aligned better?

> 
> I do think some of this refactoring stuff
> 
> > -				unlock_page(page);
> > -				vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
> > -				tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
> > -				if (unlikely(tmp &
> > -					  (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
> > -					ret = tmp;
> > +			unlock_page(page);
> > +			vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE;
> > +			tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
> > +			if (unlikely(tmp &
> > +					(VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
> > +				ret = tmp;
> > +				goto unwritable_page;
> > +			}
> 
> could probably be a separate patch and would make what's going on more
> clear, but it's passable the way it is.  When it is done this way it's
> hard sometimes reading the diff to realize if you are adding code or
> just moving it around.

Will do.

> 
> This stuff:
> 
> >  		if (set_page_dirty(dirty_page))
> > -			dirtied = 1;
> > +			dirtied = true;
> 
> needs to go in another patch for sure.

Ditto.

> One thing I *REALLY* like about doing patches this way is that things
> like this start to pop out:
> 
> > -	ret = vma->vm_ops->fault(vma, &vmf);
> > +	if (try_huge_pages) {
> > +		pgtable = pte_alloc_one(mm, haddr);
> > +		if (unlikely(!pgtable)) {
> > +			ret = VM_FAULT_OOM;
> > +			goto uncharge_out;
> > +		}
> > +		ret = vma->vm_ops->huge_fault(vma, &vmf);
> > +	} else
> > +		ret = vma->vm_ops->fault(vma, &vmf);
> 
> The ->fault is (or can be) essentially per filesystem, and we're going
> to be adding support per-filesystem.  any reason we can't just handle
> this inside the ->fault code and avoid adding huge_fault altogether?

will check. it's on my todo list already.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()
  2013-04-18 16:38               ` Kirill A. Shutemov
@ 2013-04-18 16:42                 ` Dave Hansen
  -1 siblings, 0 replies; 110+ messages in thread
From: Dave Hansen @ 2013-04-18 16:42 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 04/18/2013 09:38 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 04/18/2013 09:09 AM, Kirill A. Shutemov wrote:
>>> Dave Hansen wrote:
>>>> On 04/17/2013 07:38 AM, Kirill A. Shutemov wrote:
>>>> Are you still sure you can't do _any_ better than a verbatim copy of 129
>>>> lines?
>>>
>>> It seems I was too lazy. Shame on me. :(
>>> Here's consolidated version. Only build tested. Does it look better?
>>
>> Yeah, it's definitely a step in the right direction.  There rae
>> definitely some bugs in there like:
>>
>> +	unsigned long haddr = address & PAGE_MASK;
> 
> It's not bug. It's bad name for the variable.
> See, first 'if (try_huge_pages)'. I update it there for huge page case.
> 
> addr_aligned better?

That's a criminally bad name. :)

addr_aligned is better, and also please initialize the two cases
together.  It's mean to separate them.



^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()
@ 2013-04-18 16:42                 ` Dave Hansen
  0 siblings, 0 replies; 110+ messages in thread
From: Dave Hansen @ 2013-04-18 16:42 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 04/18/2013 09:38 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 04/18/2013 09:09 AM, Kirill A. Shutemov wrote:
>>> Dave Hansen wrote:
>>>> On 04/17/2013 07:38 AM, Kirill A. Shutemov wrote:
>>>> Are you still sure you can't do _any_ better than a verbatim copy of 129
>>>> lines?
>>>
>>> It seems I was too lazy. Shame on me. :(
>>> Here's consolidated version. Only build tested. Does it look better?
>>
>> Yeah, it's definitely a step in the right direction.  There rae
>> definitely some bugs in there like:
>>
>> +	unsigned long haddr = address & PAGE_MASK;
> 
> It's not bug. It's bad name for the variable.
> See, first 'if (try_huge_pages)'. I update it there for huge page case.
> 
> addr_aligned better?

That's a criminally bad name. :)

addr_aligned is better, and also please initialize the two cases
together.  It's mean to separate them.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 32/34] thp: handle write-protect exception to file-backed huge pages
  2013-04-08 19:07     ` Dave Hansen
@ 2013-04-26 15:31       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-26 15:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> > +			if (!PageAnon(page)) {
> > +				add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> > +				add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
> > +			}
> 
> This seems like a bit of a hack.  Shouldn't we have just been accounting
> to MM_FILEPAGES in the first place?

No, it's not.

It handles MAP_PRIVATE file mappings. The page was read first and
accounted to MM_FILEPAGES and then COW'ed by anon page here, so we have to
adjust counters. do_wp_page() has similar code.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCHv3, RFC 32/34] thp: handle write-protect exception to file-backed huge pages
@ 2013-04-26 15:31       ` Kirill A. Shutemov
  0 siblings, 0 replies; 110+ messages in thread
From: Kirill A. Shutemov @ 2013-04-26 15:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> > +			if (!PageAnon(page)) {
> > +				add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> > +				add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
> > +			}
> 
> This seems like a bit of a hack.  Shouldn't we have just been accounting
> to MM_FILEPAGES in the first place?

No, it's not.

It handles MAP_PRIVATE file mappings. The page was read first and
accounted to MM_FILEPAGES and then COW'ed by anon page here, so we have to
adjust counters. do_wp_page() has similar code.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

end of thread, other threads:[~2013-04-26 15:31 UTC | newest]

Thread overview: 110+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-05 11:59 [PATCHv3, RFC 00/34] Transparent huge page cache Kirill A. Shutemov
2013-04-05 11:59 ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 01/34] mm: drop actor argument of do_generic_file_read() Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 02/34] block: implement add_bdi_stat() Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 03/34] mm: implement zero_huge_user_segment and friends Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 04/34] radix-tree: implement preload for multiple contiguous elements Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 05/34] memcg, thp: charge huge cache pages Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 06/34] thp, mm: avoid PageUnevictable on active/inactive lru lists Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 07/34] thp, mm: basic defines for transparent huge page cache Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 08/34] thp, mm: introduce mapping_can_have_hugepages() predicate Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 09/34] thp: represent file thp pages in meminfo and friends Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-08 19:38   ` Dave Hansen
2013-04-08 19:38     ` Dave Hansen
2013-04-16 14:49     ` Kirill A. Shutemov
2013-04-16 14:49       ` Kirill A. Shutemov
2013-04-16 15:11       ` Dave Hansen
2013-04-16 15:11         ` Dave Hansen
2013-04-05 11:59 ` [PATCHv3, RFC 10/34] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 11/34] mm: trace filemap: dump page order Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 12/34] thp, mm: rewrite delete_from_page_cache() to support huge pages Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 13/34] thp, mm: trigger bug in replace_page_cache_page() on THP Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 14/34] thp, mm: locking tail page is a bug Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 15/34] thp, mm: handle tail pages in page_cache_get_speculative() Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 16/34] thp, mm: add event counters for huge page alloc on write to a file Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 17/34] thp, mm: implement grab_thp_write_begin() Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 18/34] thp, mm: naive support of thp in generic read/write routines Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 19/34] thp, libfs: initial support of thp in simple_read/write_begin/write_end Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 20/34] thp: handle file pages in split_huge_page() Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 21/34] thp: wait_split_huge_page(): serialize over i_mmap_mutex too Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 22/34] thp, mm: truncate support for transparent huge page cache Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 23/34] thp, mm: split huge page on mmap file page Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 24/34] ramfs: enable transparent huge page cache Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 25/34] x86-64, mm: proper alignment mappings with hugepages Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 26/34] mm: add huge_fault() callback to vm_operations_struct Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 27/34] thp: prepare zap_huge_pmd() to uncharge file pages Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 28/34] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd() Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 29/34] thp, mm: basic huge_fault implementation for generic_file_vm_ops Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 30/34] thp: extract fallback path from do_huge_pmd_anonymous_page() to a function Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault() Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-08 18:46   ` Dave Hansen
2013-04-08 18:46     ` Dave Hansen
2013-04-08 18:52   ` Dave Hansen
2013-04-08 18:52     ` Dave Hansen
2013-04-17 14:38     ` Kirill A. Shutemov
2013-04-17 14:38       ` Kirill A. Shutemov
2013-04-17 22:07       ` Dave Hansen
2013-04-18 16:09         ` Kirill A. Shutemov
2013-04-18 16:09           ` Kirill A. Shutemov
2013-04-18 16:19           ` Kirill A. Shutemov
2013-04-18 16:19             ` Kirill A. Shutemov
2013-04-18 16:19             ` Kirill A. Shutemov
2013-04-18 16:20           ` Dave Hansen
2013-04-18 16:20             ` Dave Hansen
2013-04-18 16:38             ` Kirill A. Shutemov
2013-04-18 16:38               ` Kirill A. Shutemov
2013-04-18 16:42               ` Dave Hansen
2013-04-18 16:42                 ` Dave Hansen
2013-04-05 11:59 ` [PATCHv3, RFC 32/34] thp: handle write-protect exception to file-backed huge pages Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-08 19:07   ` Dave Hansen
2013-04-08 19:07     ` Dave Hansen
2013-04-26 15:31     ` Kirill A. Shutemov
2013-04-26 15:31       ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 33/34] thp: call __vma_adjust_trans_huge() for file-backed VMA Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-05 11:59 ` [PATCHv3, RFC 34/34] thp: map file-backed huge pages on fault Kirill A. Shutemov
2013-04-05 11:59   ` Kirill A. Shutemov
2013-04-07  0:40 ` [PATCHv3, RFC 00/34] Transparent huge page cache Ric Mason
2013-04-07  0:40   ` Ric Mason
2013-04-15 16:02 ` IOZone with transparent " Kirill A. Shutemov
2013-04-15 16:02   ` Kirill A. Shutemov
2013-04-15 18:17 ` [RESEND] " Kirill A. Shutemov
2013-04-15 18:17   ` Kirill A. Shutemov
2013-04-15 23:19   ` Dave Hansen
2013-04-15 23:19     ` Dave Hansen
2013-04-16  5:57     ` Kirill A. Shutemov
2013-04-16  5:57       ` Kirill A. Shutemov
2013-04-16  6:11       ` Dave Hansen
2013-04-16  6:11         ` Dave Hansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.