All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH, RFC 00/16] Transparent huge page cache
@ 2013-01-28  9:24 ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Here's first steps towards huge pages in page cache.

The intend of the work is get code ready to enable transparent huge page
cache for the most simple fs -- ramfs.

It's not yet near feature-complete. It only provides basic infrastructure.
At the moment we can read, write and truncate file on ramfs with huge pages in
page cache. The most interesting part, mmap(), is not yet there. For now
we split huge page on mmap() attempt.

I can't say that I see whole picture. I'm not sure if I understand locking
model around split_huge_page(). Probably, not.
Andrea, could you check if it looks correct?

Next steps (not necessary in this order):
 - mmap();
 - migration (?);
 - collapse;
 - stats, knobs, etc.;
 - tmpfs/shmem enabling;
 - ...

Kirill A. Shutemov (16):
  block: implement add_bdi_stat()
  mm: implement zero_huge_user_segment and friends
  mm: drop actor argument of do_generic_file_read()
  radix-tree: implement preload for multiple contiguous elements
  thp, mm: basic defines for transparent huge page cache
  thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  thp, mm: rewrite delete_from_page_cache() to support huge pages
  thp, mm: locking tail page is a bug
  thp, mm: handle tail pages in page_cache_get_speculative()
  thp, mm: implement grab_cache_huge_page_write_begin()
  thp, mm: naive support of thp in generic read/write routines
  thp, libfs: initial support of thp in
    simple_read/write_begin/write_end
  thp: handle file pages in split_huge_page()
  thp, mm: truncate support for transparent huge page cache
  thp, mm: split huge page on mmap file page
  ramfs: enable transparent huge page cache

 fs/libfs.c                  |   54 +++++++++---
 fs/ramfs/inode.c            |    6 +-
 include/linux/backing-dev.h |   10 +++
 include/linux/huge_mm.h     |    8 ++
 include/linux/mm.h          |   15 ++++
 include/linux/pagemap.h     |   14 ++-
 include/linux/radix-tree.h  |    3 +
 lib/radix-tree.c            |   32 +++++--
 mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
 mm/huge_memory.c            |   62 +++++++++++--
 mm/memory.c                 |   22 +++++
 mm/truncate.c               |   12 +++
 12 files changed, 375 insertions(+), 67 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH, RFC 00/16] Transparent huge page cache
@ 2013-01-28  9:24 ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Here's first steps towards huge pages in page cache.

The intend of the work is get code ready to enable transparent huge page
cache for the most simple fs -- ramfs.

It's not yet near feature-complete. It only provides basic infrastructure.
At the moment we can read, write and truncate file on ramfs with huge pages in
page cache. The most interesting part, mmap(), is not yet there. For now
we split huge page on mmap() attempt.

I can't say that I see whole picture. I'm not sure if I understand locking
model around split_huge_page(). Probably, not.
Andrea, could you check if it looks correct?

Next steps (not necessary in this order):
 - mmap();
 - migration (?);
 - collapse;
 - stats, knobs, etc.;
 - tmpfs/shmem enabling;
 - ...

Kirill A. Shutemov (16):
  block: implement add_bdi_stat()
  mm: implement zero_huge_user_segment and friends
  mm: drop actor argument of do_generic_file_read()
  radix-tree: implement preload for multiple contiguous elements
  thp, mm: basic defines for transparent huge page cache
  thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  thp, mm: rewrite delete_from_page_cache() to support huge pages
  thp, mm: locking tail page is a bug
  thp, mm: handle tail pages in page_cache_get_speculative()
  thp, mm: implement grab_cache_huge_page_write_begin()
  thp, mm: naive support of thp in generic read/write routines
  thp, libfs: initial support of thp in
    simple_read/write_begin/write_end
  thp: handle file pages in split_huge_page()
  thp, mm: truncate support for transparent huge page cache
  thp, mm: split huge page on mmap file page
  ramfs: enable transparent huge page cache

 fs/libfs.c                  |   54 +++++++++---
 fs/ramfs/inode.c            |    6 +-
 include/linux/backing-dev.h |   10 +++
 include/linux/huge_mm.h     |    8 ++
 include/linux/mm.h          |   15 ++++
 include/linux/pagemap.h     |   14 ++-
 include/linux/radix-tree.h  |    3 +
 lib/radix-tree.c            |   32 +++++--
 mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
 mm/huge_memory.c            |   62 +++++++++++--
 mm/memory.c                 |   22 +++++
 mm/truncate.c               |   12 +++
 12 files changed, 375 insertions(+), 67 deletions(-)

-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH, RFC 01/16] block: implement add_bdi_stat()
  2013-01-28  9:24 ` Kirill A. Shutemov
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

It's required for batched stats update.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/backing-dev.h |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3504599..b05d961 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -167,6 +167,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
 	__add_bdi_stat(bdi, item, -1);
 }
 
+static inline void add_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, s64 amount)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__add_bdi_stat(bdi, item, amount);
+	local_irq_restore(flags);
+}
+
 static inline void dec_bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item)
 {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 01/16] block: implement add_bdi_stat()
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

It's required for batched stats update.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/backing-dev.h |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3504599..b05d961 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -167,6 +167,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
 	__add_bdi_stat(bdi, item, -1);
 }
 
+static inline void add_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, s64 amount)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__add_bdi_stat(bdi, item, amount);
+	local_irq_restore(flags);
+}
+
 static inline void dec_bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item)
 {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 02/16] mm: implement zero_huge_user_segment and friends
  2013-01-28  9:24 ` Kirill A. Shutemov
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Let's add helpers to clear huge page segment(s). They provide the same
functionallity as zero_user_segment{,s} and zero_user, but for huge
pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |   15 +++++++++++++++
 mm/memory.c        |   22 ++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e4533a1..c011771 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1728,6 +1728,21 @@ extern void dump_page(struct page *page);
 extern void clear_huge_page(struct page *page,
 			    unsigned long addr,
 			    unsigned int pages_per_huge_page);
+extern void zero_huge_user_segment(struct page *page,
+		unsigned start, unsigned end);
+static inline void zero_huge_user_segments(struct page *page,
+		unsigned start1, unsigned end1,
+		unsigned start2, unsigned end2)
+{
+	zero_huge_user_segment(page, start1, end1);
+	zero_huge_user_segment(page, start2, end2);
+}
+static inline void zero_huge_user(struct page *page,
+		unsigned start, unsigned len)
+{
+	zero_huge_user_segment(page, start, start+len);
+}
+
 extern void copy_user_huge_page(struct page *dst, struct page *src,
 				unsigned long addr, struct vm_area_struct *vma,
 				unsigned int pages_per_huge_page);
diff --git a/mm/memory.c b/mm/memory.c
index c04078b..200a74d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4185,6 +4185,28 @@ void clear_huge_page(struct page *page,
 	}
 }
 
+void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
+{
+	int i;
+
+	BUG_ON(end < start);
+
+	might_sleep();
+
+	/* start and end are on the same small page */
+	if ((start & PAGE_MASK) == (end & PAGE_MASK))
+		return zero_user_segment(page + (start >> PAGE_SHIFT),
+				start & ~PAGE_MASK, end & ~PAGE_MASK);
+
+	zero_user_segment(page + (start >> PAGE_SHIFT),
+			start & ~PAGE_MASK, PAGE_SIZE);
+	for (i = (start >> PAGE_SHIFT) + 1; i < (end >> PAGE_SHIFT) - 1; i++) {
+		cond_resched();
+		clear_highpage(page + i);
+	}
+	zero_user_segment(page + i, 0, end & ~PAGE_MASK);
+}
+
 static void copy_user_gigantic_page(struct page *dst, struct page *src,
 				    unsigned long addr,
 				    struct vm_area_struct *vma,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 02/16] mm: implement zero_huge_user_segment and friends
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Let's add helpers to clear huge page segment(s). They provide the same
functionallity as zero_user_segment{,s} and zero_user, but for huge
pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |   15 +++++++++++++++
 mm/memory.c        |   22 ++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e4533a1..c011771 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1728,6 +1728,21 @@ extern void dump_page(struct page *page);
 extern void clear_huge_page(struct page *page,
 			    unsigned long addr,
 			    unsigned int pages_per_huge_page);
+extern void zero_huge_user_segment(struct page *page,
+		unsigned start, unsigned end);
+static inline void zero_huge_user_segments(struct page *page,
+		unsigned start1, unsigned end1,
+		unsigned start2, unsigned end2)
+{
+	zero_huge_user_segment(page, start1, end1);
+	zero_huge_user_segment(page, start2, end2);
+}
+static inline void zero_huge_user(struct page *page,
+		unsigned start, unsigned len)
+{
+	zero_huge_user_segment(page, start, start+len);
+}
+
 extern void copy_user_huge_page(struct page *dst, struct page *src,
 				unsigned long addr, struct vm_area_struct *vma,
 				unsigned int pages_per_huge_page);
diff --git a/mm/memory.c b/mm/memory.c
index c04078b..200a74d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4185,6 +4185,28 @@ void clear_huge_page(struct page *page,
 	}
 }
 
+void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
+{
+	int i;
+
+	BUG_ON(end < start);
+
+	might_sleep();
+
+	/* start and end are on the same small page */
+	if ((start & PAGE_MASK) == (end & PAGE_MASK))
+		return zero_user_segment(page + (start >> PAGE_SHIFT),
+				start & ~PAGE_MASK, end & ~PAGE_MASK);
+
+	zero_user_segment(page + (start >> PAGE_SHIFT),
+			start & ~PAGE_MASK, PAGE_SIZE);
+	for (i = (start >> PAGE_SHIFT) + 1; i < (end >> PAGE_SHIFT) - 1; i++) {
+		cond_resched();
+		clear_highpage(page + i);
+	}
+	zero_user_segment(page + i, 0, end & ~PAGE_MASK);
+}
+
 static void copy_user_gigantic_page(struct page *dst, struct page *src,
 				    unsigned long addr,
 				    struct vm_area_struct *vma,
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 03/16] mm: drop actor argument of do_generic_file_read()
  2013-01-28  9:24 ` Kirill A. Shutemov
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

There's only one caller of do_generic_file_read() and the only actor is
file_read_actor(). No reason to have a callback parameter.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index c610076..b6a6d7e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1070,7 +1070,6 @@ static void shrink_readahead_size_eio(struct file *filp,
  * @filp:	the file to read
  * @ppos:	current file position
  * @desc:	read_descriptor
- * @actor:	read method
  *
  * This is a generic file read routine, and uses the
  * mapping->a_ops->readpage() function for the actual low-level stuff.
@@ -1079,7 +1078,7 @@ static void shrink_readahead_size_eio(struct file *filp,
  * of the logic when it comes to error handling etc.
  */
 static void do_generic_file_read(struct file *filp, loff_t *ppos,
-		read_descriptor_t *desc, read_actor_t actor)
+		read_descriptor_t *desc)
 {
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
@@ -1180,13 +1179,14 @@ page_ok:
 		 * Ok, we have the page, and it's up-to-date, so
 		 * now we can copy it to user space...
 		 *
-		 * The actor routine returns how many bytes were actually used..
+		 * The file_read_actor routine returns how many bytes were
+		 * actually used..
 		 * NOTE! This may not be the same as how much of a user buffer
 		 * we filled up (we may be padding etc), so we can only update
 		 * "pos" here (the actor routine has to update the user buffer
 		 * pointers and the remaining count).
 		 */
-		ret = actor(desc, page, offset, nr);
+		ret = file_read_actor(desc, page, offset, nr);
 		offset += ret;
 		index += offset >> PAGE_CACHE_SHIFT;
 		offset &= ~PAGE_CACHE_MASK;
@@ -1459,7 +1459,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
 		if (desc.count == 0)
 			continue;
 		desc.error = 0;
-		do_generic_file_read(filp, ppos, &desc, file_read_actor);
+		do_generic_file_read(filp, ppos, &desc);
 		retval += desc.written;
 		if (desc.error) {
 			retval = retval ?: desc.error;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 03/16] mm: drop actor argument of do_generic_file_read()
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

There's only one caller of do_generic_file_read() and the only actor is
file_read_actor(). No reason to have a callback parameter.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index c610076..b6a6d7e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1070,7 +1070,6 @@ static void shrink_readahead_size_eio(struct file *filp,
  * @filp:	the file to read
  * @ppos:	current file position
  * @desc:	read_descriptor
- * @actor:	read method
  *
  * This is a generic file read routine, and uses the
  * mapping->a_ops->readpage() function for the actual low-level stuff.
@@ -1079,7 +1078,7 @@ static void shrink_readahead_size_eio(struct file *filp,
  * of the logic when it comes to error handling etc.
  */
 static void do_generic_file_read(struct file *filp, loff_t *ppos,
-		read_descriptor_t *desc, read_actor_t actor)
+		read_descriptor_t *desc)
 {
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
@@ -1180,13 +1179,14 @@ page_ok:
 		 * Ok, we have the page, and it's up-to-date, so
 		 * now we can copy it to user space...
 		 *
-		 * The actor routine returns how many bytes were actually used..
+		 * The file_read_actor routine returns how many bytes were
+		 * actually used..
 		 * NOTE! This may not be the same as how much of a user buffer
 		 * we filled up (we may be padding etc), so we can only update
 		 * "pos" here (the actor routine has to update the user buffer
 		 * pointers and the remaining count).
 		 */
-		ret = actor(desc, page, offset, nr);
+		ret = file_read_actor(desc, page, offset, nr);
 		offset += ret;
 		index += offset >> PAGE_CACHE_SHIFT;
 		offset &= ~PAGE_CACHE_MASK;
@@ -1459,7 +1459,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
 		if (desc.count == 0)
 			continue;
 		desc.error = 0;
-		do_generic_file_read(filp, ppos, &desc, file_read_actor);
+		do_generic_file_read(filp, ppos, &desc);
 		retval += desc.written;
 		if (desc.error) {
 			retval = retval ?: desc.error;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 04/16] radix-tree: implement preload for multiple contiguous elements
  2013-01-28  9:24 ` Kirill A. Shutemov
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. You cannot batch a number insert under
one tree_lock.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/radix-tree.h |    3 +++
 lib/radix-tree.c           |   32 +++++++++++++++++++++++++-------
 2 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index ffc444c..81318cb 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,8 @@ do {									\
 	(root)->rnode = NULL;						\
 } while (0)
 
+#define RADIX_TREE_PRELOAD_NR		512 /* For THP's benefit */
+
 /**
  * Radix-tree synchronization
  *
@@ -231,6 +233,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root,
 unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
 				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
 			unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e796429..9bef0ac 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
  * The worst case is a zero height tree with just a single item at index 0,
  * and then inserting an item at index ULONG_MAX. This requires 2 new branches
  * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
+ *
+ * Worst case for adding N contiguous items is adding entries at indexes
+ * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
+ * item plus extra nodes if you cross the boundary from one node to the next.
+ *
  * Hence:
  */
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MAX \
+	(RADIX_TREE_PRELOAD_MIN + \
+	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
 
 /*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
 	int nr;
-	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
 };
 static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
 
@@ -257,29 +265,34 @@ radix_tree_node_free(struct radix_tree_node *node)
 
 /*
  * Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail.  On
- * success, return zero, with preemption disabled.  On error, return -ENOMEM
+ * ensure that the addition of *contiguous* elements in the tree cannot fail.
+ * On success, return zero, with preemption disabled.  On error, return -ENOMEM
  * with preemption not disabled.
  *
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_WAIT being passed to INIT_RADIX_TREE().
  */
-int radix_tree_preload(gfp_t gfp_mask)
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
 	int ret = -ENOMEM;
+	int alloc = RADIX_TREE_PRELOAD_MIN +
+		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
+
+	if (size > RADIX_TREE_PRELOAD_NR)
+		return -ENOMEM;
 
 	preempt_disable();
 	rtp = &__get_cpu_var(radix_tree_preloads);
-	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+	while (rtp->nr < alloc) {
 		preempt_enable();
 		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
 		if (node == NULL)
 			goto out;
 		preempt_disable();
 		rtp = &__get_cpu_var(radix_tree_preloads);
-		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+		if (rtp->nr < alloc)
 			rtp->nodes[rtp->nr++] = node;
 		else
 			kmem_cache_free(radix_tree_node_cachep, node);
@@ -288,6 +301,11 @@ int radix_tree_preload(gfp_t gfp_mask)
 out:
 	return ret;
 }
+
+int radix_tree_preload(gfp_t gfp_mask)
+{
+	return radix_tree_preload_count(1, gfp_mask);
+}
 EXPORT_SYMBOL(radix_tree_preload);
 
 /*
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 04/16] radix-tree: implement preload for multiple contiguous elements
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. You cannot batch a number insert under
one tree_lock.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/radix-tree.h |    3 +++
 lib/radix-tree.c           |   32 +++++++++++++++++++++++++-------
 2 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index ffc444c..81318cb 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,8 @@ do {									\
 	(root)->rnode = NULL;						\
 } while (0)
 
+#define RADIX_TREE_PRELOAD_NR		512 /* For THP's benefit */
+
 /**
  * Radix-tree synchronization
  *
@@ -231,6 +233,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root,
 unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
 				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
 			unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e796429..9bef0ac 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
  * The worst case is a zero height tree with just a single item at index 0,
  * and then inserting an item at index ULONG_MAX. This requires 2 new branches
  * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
+ *
+ * Worst case for adding N contiguous items is adding entries at indexes
+ * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
+ * item plus extra nodes if you cross the boundary from one node to the next.
+ *
  * Hence:
  */
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MAX \
+	(RADIX_TREE_PRELOAD_MIN + \
+	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
 
 /*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
 	int nr;
-	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
 };
 static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
 
@@ -257,29 +265,34 @@ radix_tree_node_free(struct radix_tree_node *node)
 
 /*
  * Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail.  On
- * success, return zero, with preemption disabled.  On error, return -ENOMEM
+ * ensure that the addition of *contiguous* elements in the tree cannot fail.
+ * On success, return zero, with preemption disabled.  On error, return -ENOMEM
  * with preemption not disabled.
  *
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_WAIT being passed to INIT_RADIX_TREE().
  */
-int radix_tree_preload(gfp_t gfp_mask)
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
 	int ret = -ENOMEM;
+	int alloc = RADIX_TREE_PRELOAD_MIN +
+		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
+
+	if (size > RADIX_TREE_PRELOAD_NR)
+		return -ENOMEM;
 
 	preempt_disable();
 	rtp = &__get_cpu_var(radix_tree_preloads);
-	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+	while (rtp->nr < alloc) {
 		preempt_enable();
 		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
 		if (node == NULL)
 			goto out;
 		preempt_disable();
 		rtp = &__get_cpu_var(radix_tree_preloads);
-		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+		if (rtp->nr < alloc)
 			rtp->nodes[rtp->nr++] = node;
 		else
 			kmem_cache_free(radix_tree_node_cachep, node);
@@ -288,6 +301,11 @@ int radix_tree_preload(gfp_t gfp_mask)
 out:
 	return ret;
 }
+
+int radix_tree_preload(gfp_t gfp_mask)
+{
+	return radix_tree_preload_count(1, gfp_mask);
+}
 EXPORT_SYMBOL(radix_tree_preload);
 
 /*
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 05/16] thp, mm: basic defines for transparent huge page cache
  2013-01-28  9:24 ` Kirill A. Shutemov
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ee1c244..a54939c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -64,6 +64,10 @@ extern pmd_t *page_check_address_pmd(struct page *page,
 #define HPAGE_PMD_MASK HPAGE_MASK
 #define HPAGE_PMD_SIZE HPAGE_SIZE
 
+#define HPAGE_CACHE_ORDER      (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
+#define HPAGE_CACHE_NR         (1L << HPAGE_CACHE_ORDER)
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)
+
 extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
 #define transparent_hugepage_enabled(__vma)				\
@@ -181,6 +185,10 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
 
+#define HPAGE_CACHE_ORDER      ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
+
 #define hpage_nr_pages(x) 1
 
 #define transparent_hugepage_enabled(__vma) 0
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 05/16] thp, mm: basic defines for transparent huge page cache
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ee1c244..a54939c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -64,6 +64,10 @@ extern pmd_t *page_check_address_pmd(struct page *page,
 #define HPAGE_PMD_MASK HPAGE_MASK
 #define HPAGE_PMD_SIZE HPAGE_SIZE
 
+#define HPAGE_CACHE_ORDER      (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
+#define HPAGE_CACHE_NR         (1L << HPAGE_CACHE_ORDER)
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)
+
 extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
 #define transparent_hugepage_enabled(__vma)				\
@@ -181,6 +185,10 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
 
+#define HPAGE_CACHE_ORDER      ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
+
 #define hpage_nr_pages(x) 1
 
 #define transparent_hugepage_enabled(__vma) 0
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-01-28  9:24 ` Kirill A. Shutemov
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
page for the specified index and HPAGE_CACHE_NR-1 tail pages for
following indexes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   75 +++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 53 insertions(+), 22 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index b6a6d7e..fa2fdab 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -443,6 +443,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 		pgoff_t offset, gfp_t gfp_mask)
 {
 	int error;
+	int nr = 1;
 
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageSwapBacked(page));
@@ -450,31 +451,61 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 	error = mem_cgroup_cache_charge(page, current->mm,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
-		goto out;
+		return error;
 
-	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
-	if (error == 0) {
-		page_cache_get(page);
-		page->mapping = mapping;
-		page->index = offset;
+	if (PageTransHuge(page)) {
+		BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
+		nr = HPAGE_CACHE_NR;
+	}
+	error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM);
+	if (error) {
+		mem_cgroup_uncharge_cache_page(page);
+		return error;
+	}
 
-		spin_lock_irq(&mapping->tree_lock);
-		error = radix_tree_insert(&mapping->page_tree, offset, page);
-		if (likely(!error)) {
-			mapping->nrpages++;
-			__inc_zone_page_state(page, NR_FILE_PAGES);
-			spin_unlock_irq(&mapping->tree_lock);
-		} else {
-			page->mapping = NULL;
-			/* Leave page->index set: truncation relies upon it */
-			spin_unlock_irq(&mapping->tree_lock);
-			mem_cgroup_uncharge_cache_page(page);
-			page_cache_release(page);
+	page_cache_get(page);
+	spin_lock_irq(&mapping->tree_lock);
+	page->mapping = mapping;
+	if (PageTransHuge(page)) {
+		int i;
+		for (i = 0; i < HPAGE_CACHE_NR; i++) {
+			page_cache_get(page + i);
+			page[i].index = offset + i;
+			error = radix_tree_insert(&mapping->page_tree,
+					offset + i, page + i);
+			if (error) {
+				page_cache_release(page + i);
+				break;
+			}
 		}
-		radix_tree_preload_end();
-	} else
-		mem_cgroup_uncharge_cache_page(page);
-out:
+		if (error) {
+			if (i > 0 && error == EEXIST)
+				error = ENOSPC; /* no space for a huge page */
+			for (i--; i > 0; i--) {
+				page_cache_release(page + i);
+				radix_tree_delete(&mapping->page_tree,
+						offset + i);
+			}
+			goto err;
+		}
+	} else {
+		page->index = offset;
+		error = radix_tree_insert(&mapping->page_tree, offset, page);
+		if (unlikely(error))
+			goto err;
+	}
+	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
+	mapping->nrpages += nr;
+	spin_unlock_irq(&mapping->tree_lock);
+	radix_tree_preload_end();
+	return 0;
+err:
+	page->mapping = NULL;
+	/* Leave page->index set: truncation relies upon it */
+	spin_unlock_irq(&mapping->tree_lock);
+	radix_tree_preload_end();
+	mem_cgroup_uncharge_cache_page(page);
+	page_cache_release(page);
 	return error;
 }
 EXPORT_SYMBOL(add_to_page_cache_locked);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
page for the specified index and HPAGE_CACHE_NR-1 tail pages for
following indexes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   75 +++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 53 insertions(+), 22 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index b6a6d7e..fa2fdab 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -443,6 +443,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 		pgoff_t offset, gfp_t gfp_mask)
 {
 	int error;
+	int nr = 1;
 
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageSwapBacked(page));
@@ -450,31 +451,61 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 	error = mem_cgroup_cache_charge(page, current->mm,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
-		goto out;
+		return error;
 
-	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
-	if (error == 0) {
-		page_cache_get(page);
-		page->mapping = mapping;
-		page->index = offset;
+	if (PageTransHuge(page)) {
+		BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
+		nr = HPAGE_CACHE_NR;
+	}
+	error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM);
+	if (error) {
+		mem_cgroup_uncharge_cache_page(page);
+		return error;
+	}
 
-		spin_lock_irq(&mapping->tree_lock);
-		error = radix_tree_insert(&mapping->page_tree, offset, page);
-		if (likely(!error)) {
-			mapping->nrpages++;
-			__inc_zone_page_state(page, NR_FILE_PAGES);
-			spin_unlock_irq(&mapping->tree_lock);
-		} else {
-			page->mapping = NULL;
-			/* Leave page->index set: truncation relies upon it */
-			spin_unlock_irq(&mapping->tree_lock);
-			mem_cgroup_uncharge_cache_page(page);
-			page_cache_release(page);
+	page_cache_get(page);
+	spin_lock_irq(&mapping->tree_lock);
+	page->mapping = mapping;
+	if (PageTransHuge(page)) {
+		int i;
+		for (i = 0; i < HPAGE_CACHE_NR; i++) {
+			page_cache_get(page + i);
+			page[i].index = offset + i;
+			error = radix_tree_insert(&mapping->page_tree,
+					offset + i, page + i);
+			if (error) {
+				page_cache_release(page + i);
+				break;
+			}
 		}
-		radix_tree_preload_end();
-	} else
-		mem_cgroup_uncharge_cache_page(page);
-out:
+		if (error) {
+			if (i > 0 && error == EEXIST)
+				error = ENOSPC; /* no space for a huge page */
+			for (i--; i > 0; i--) {
+				page_cache_release(page + i);
+				radix_tree_delete(&mapping->page_tree,
+						offset + i);
+			}
+			goto err;
+		}
+	} else {
+		page->index = offset;
+		error = radix_tree_insert(&mapping->page_tree, offset, page);
+		if (unlikely(error))
+			goto err;
+	}
+	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
+	mapping->nrpages += nr;
+	spin_unlock_irq(&mapping->tree_lock);
+	radix_tree_preload_end();
+	return 0;
+err:
+	page->mapping = NULL;
+	/* Leave page->index set: truncation relies upon it */
+	spin_unlock_irq(&mapping->tree_lock);
+	radix_tree_preload_end();
+	mem_cgroup_uncharge_cache_page(page);
+	page_cache_release(page);
 	return error;
 }
 EXPORT_SYMBOL(add_to_page_cache_locked);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 07/16] thp, mm: rewrite delete_from_page_cache() to support huge pages
  2013-01-28  9:24 ` Kirill A. Shutemov
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
time.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index fa2fdab..a4b4fd5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -112,6 +112,7 @@
 void __delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
+	int nr = 1;
 
 	/*
 	 * if we're uptodate, flush out into the cleancache, otherwise
@@ -123,13 +124,23 @@ void __delete_from_page_cache(struct page *page)
 	else
 		cleancache_invalidate_page(mapping, page);
 
-	radix_tree_delete(&mapping->page_tree, page->index);
+	if (PageTransHuge(page)) {
+		int i;
+
+		for (i = 0; i < HPAGE_CACHE_NR; i++)
+			radix_tree_delete(&mapping->page_tree, page->index + i);
+		nr = HPAGE_CACHE_NR;
+	} else {
+		radix_tree_delete(&mapping->page_tree, page->index);
+	}
+
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
-	mapping->nrpages--;
-	__dec_zone_page_state(page, NR_FILE_PAGES);
+
+	mapping->nrpages -= nr;
+	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
 	if (PageSwapBacked(page))
-		__dec_zone_page_state(page, NR_SHMEM);
+		__mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
 	BUG_ON(page_mapped(page));
 
 	/*
@@ -140,8 +151,8 @@ void __delete_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
-		dec_zone_page_state(page, NR_FILE_DIRTY);
-		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
+		add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
 	}
 }
 
@@ -157,6 +168,7 @@ void delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
 	void (*freepage)(struct page *);
+	int i;
 
 	BUG_ON(!PageLocked(page));
 
@@ -168,6 +180,9 @@ void delete_from_page_cache(struct page *page)
 
 	if (freepage)
 		freepage(page);
+	if (PageTransHuge(page))
+		for (i = 1; i < HPAGE_CACHE_NR; i++)
+			page_cache_release(page);
 	page_cache_release(page);
 }
 EXPORT_SYMBOL(delete_from_page_cache);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 07/16] thp, mm: rewrite delete_from_page_cache() to support huge pages
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
time.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index fa2fdab..a4b4fd5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -112,6 +112,7 @@
 void __delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
+	int nr = 1;
 
 	/*
 	 * if we're uptodate, flush out into the cleancache, otherwise
@@ -123,13 +124,23 @@ void __delete_from_page_cache(struct page *page)
 	else
 		cleancache_invalidate_page(mapping, page);
 
-	radix_tree_delete(&mapping->page_tree, page->index);
+	if (PageTransHuge(page)) {
+		int i;
+
+		for (i = 0; i < HPAGE_CACHE_NR; i++)
+			radix_tree_delete(&mapping->page_tree, page->index + i);
+		nr = HPAGE_CACHE_NR;
+	} else {
+		radix_tree_delete(&mapping->page_tree, page->index);
+	}
+
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
-	mapping->nrpages--;
-	__dec_zone_page_state(page, NR_FILE_PAGES);
+
+	mapping->nrpages -= nr;
+	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
 	if (PageSwapBacked(page))
-		__dec_zone_page_state(page, NR_SHMEM);
+		__mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
 	BUG_ON(page_mapped(page));
 
 	/*
@@ -140,8 +151,8 @@ void __delete_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
-		dec_zone_page_state(page, NR_FILE_DIRTY);
-		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
+		add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
 	}
 }
 
@@ -157,6 +168,7 @@ void delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
 	void (*freepage)(struct page *);
+	int i;
 
 	BUG_ON(!PageLocked(page));
 
@@ -168,6 +180,9 @@ void delete_from_page_cache(struct page *page)
 
 	if (freepage)
 		freepage(page);
+	if (PageTransHuge(page))
+		for (i = 1; i < HPAGE_CACHE_NR; i++)
+			page_cache_release(page);
 	page_cache_release(page);
 }
 EXPORT_SYMBOL(delete_from_page_cache);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 08/16] thp, mm: locking tail page is a bug
  2013-01-28  9:24 ` Kirill A. Shutemov
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index a4b4fd5..f59eaa1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -665,6 +665,7 @@ void __lock_page(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	VM_BUG_ON(PageTail(page));
 	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
 							TASK_UNINTERRUPTIBLE);
 }
@@ -674,6 +675,7 @@ int __lock_page_killable(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	VM_BUG_ON(PageTail(page));
 	return __wait_on_bit_lock(page_waitqueue(page), &wait,
 					sleep_on_page_killable, TASK_KILLABLE);
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 08/16] thp, mm: locking tail page is a bug
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index a4b4fd5..f59eaa1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -665,6 +665,7 @@ void __lock_page(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	VM_BUG_ON(PageTail(page));
 	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
 							TASK_UNINTERRUPTIBLE);
 }
@@ -674,6 +675,7 @@ int __lock_page_killable(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	VM_BUG_ON(PageTail(page));
 	return __wait_on_bit_lock(page_waitqueue(page), &wait,
 					sleep_on_page_killable, TASK_KILLABLE);
 }
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 09/16] thp, mm: handle tail pages in page_cache_get_speculative()
  2013-01-28  9:24 ` Kirill A. Shutemov
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For tail page we call __get_page_tail(). It has the same semantics, but
for tail page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 0e38e13..1da2043 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -149,6 +149,9 @@ static inline int page_cache_get_speculative(struct page *page)
 {
 	VM_BUG_ON(in_interrupt());
 
+	if (unlikely(PageTail(page)))
+		return __get_page_tail(page);
+
 #if !defined(CONFIG_SMP) && defined(CONFIG_TREE_RCU)
 # ifdef CONFIG_PREEMPT_COUNT
 	VM_BUG_ON(!in_atomic());
@@ -175,7 +178,6 @@ static inline int page_cache_get_speculative(struct page *page)
 		return 0;
 	}
 #endif
-	VM_BUG_ON(PageTail(page));
 
 	return 1;
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 09/16] thp, mm: handle tail pages in page_cache_get_speculative()
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For tail page we call __get_page_tail(). It has the same semantics, but
for tail page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 0e38e13..1da2043 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -149,6 +149,9 @@ static inline int page_cache_get_speculative(struct page *page)
 {
 	VM_BUG_ON(in_interrupt());
 
+	if (unlikely(PageTail(page)))
+		return __get_page_tail(page);
+
 #if !defined(CONFIG_SMP) && defined(CONFIG_TREE_RCU)
 # ifdef CONFIG_PREEMPT_COUNT
 	VM_BUG_ON(!in_atomic());
@@ -175,7 +178,6 @@ static inline int page_cache_get_speculative(struct page *page)
 		return 0;
 	}
 #endif
-	VM_BUG_ON(PageTail(page));
 
 	return 1;
 }
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 10/16] thp, mm: implement grab_cache_huge_page_write_begin()
  2013-01-28  9:24 ` Kirill A. Shutemov
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The function is grab_cache_page_write_begin() twin but it tries to
allocate huge page at given position aligned to HPAGE_CACHE_NR.

If, for some reason, it's not possible allocate a huge page at this
possition, it returns NULL. Caller should take care of fallback to
small pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h |   10 +++++++++
 mm/filemap.c            |   55 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 65 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 1da2043..5836d0d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -260,6 +260,16 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 
 struct page *grab_cache_page_write_begin(struct address_space *mapping,
 			pgoff_t index, unsigned flags);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
+			pgoff_t index, unsigned flags);
+#else
+static inline struct page *grab_cache_huge_page_write_begin(
+		struct address_space *mapping, pgoff_t index, unsigned flags)
+{
+	return NULL;
+}
+#endif
 
 /*
  * Returns locked page at given index in given cache, creating it if needed.
diff --git a/mm/filemap.c b/mm/filemap.c
index f59eaa1..68e47e4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2328,6 +2328,61 @@ found:
 }
 EXPORT_SYMBOL(grab_cache_page_write_begin);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * Find or create a huge page at the given pagecache position, aligned to
+ * HPAGE_CACHE_NR. Return the locked huge page.
+ *
+ * If, for some reason, it's not possible allocate a huge page at this
+ * possition, it returns NULL. Caller should take care of fallback to small
+ * pages.
+ *
+ * This function is specifically for buffered writes.
+ */
+struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
+		pgoff_t index, unsigned flags)
+{
+	int status;
+	gfp_t gfp_mask;
+	struct page *page;
+	gfp_t gfp_notmask = 0;
+
+	BUG_ON(index & HPAGE_CACHE_INDEX_MASK);
+	gfp_mask = mapping_gfp_mask(mapping);
+	BUG_ON(!(gfp_mask & __GFP_COMP));
+	if (mapping_cap_account_dirty(mapping))
+		gfp_mask |= __GFP_WRITE;
+	if (flags & AOP_FLAG_NOFS)
+		gfp_notmask = __GFP_FS;
+repeat:
+	page = find_lock_page(mapping, index);
+	if (page) {
+		if (!PageTransHuge(page)) {
+			unlock_page(page);
+			page_cache_release(page);
+			return NULL;
+		}
+		goto found;
+	}
+
+	page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
+	if (!page)
+		return NULL;
+
+	status = add_to_page_cache_lru(page, mapping, index,
+			GFP_KERNEL & ~gfp_notmask);
+	if (unlikely(status)) {
+		page_cache_release(page);
+		if (status == -EEXIST)
+			goto repeat;
+		return NULL;
+	}
+found:
+	wait_on_page_writeback(page);
+	return page;
+}
+#endif
+
 static ssize_t generic_perform_write(struct file *file,
 				struct iov_iter *i, loff_t pos)
 {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 10/16] thp, mm: implement grab_cache_huge_page_write_begin()
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The function is grab_cache_page_write_begin() twin but it tries to
allocate huge page at given position aligned to HPAGE_CACHE_NR.

If, for some reason, it's not possible allocate a huge page at this
possition, it returns NULL. Caller should take care of fallback to
small pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h |   10 +++++++++
 mm/filemap.c            |   55 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 65 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 1da2043..5836d0d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -260,6 +260,16 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 
 struct page *grab_cache_page_write_begin(struct address_space *mapping,
 			pgoff_t index, unsigned flags);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
+			pgoff_t index, unsigned flags);
+#else
+static inline struct page *grab_cache_huge_page_write_begin(
+		struct address_space *mapping, pgoff_t index, unsigned flags)
+{
+	return NULL;
+}
+#endif
 
 /*
  * Returns locked page at given index in given cache, creating it if needed.
diff --git a/mm/filemap.c b/mm/filemap.c
index f59eaa1..68e47e4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2328,6 +2328,61 @@ found:
 }
 EXPORT_SYMBOL(grab_cache_page_write_begin);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * Find or create a huge page at the given pagecache position, aligned to
+ * HPAGE_CACHE_NR. Return the locked huge page.
+ *
+ * If, for some reason, it's not possible allocate a huge page at this
+ * possition, it returns NULL. Caller should take care of fallback to small
+ * pages.
+ *
+ * This function is specifically for buffered writes.
+ */
+struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
+		pgoff_t index, unsigned flags)
+{
+	int status;
+	gfp_t gfp_mask;
+	struct page *page;
+	gfp_t gfp_notmask = 0;
+
+	BUG_ON(index & HPAGE_CACHE_INDEX_MASK);
+	gfp_mask = mapping_gfp_mask(mapping);
+	BUG_ON(!(gfp_mask & __GFP_COMP));
+	if (mapping_cap_account_dirty(mapping))
+		gfp_mask |= __GFP_WRITE;
+	if (flags & AOP_FLAG_NOFS)
+		gfp_notmask = __GFP_FS;
+repeat:
+	page = find_lock_page(mapping, index);
+	if (page) {
+		if (!PageTransHuge(page)) {
+			unlock_page(page);
+			page_cache_release(page);
+			return NULL;
+		}
+		goto found;
+	}
+
+	page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
+	if (!page)
+		return NULL;
+
+	status = add_to_page_cache_lru(page, mapping, index,
+			GFP_KERNEL & ~gfp_notmask);
+	if (unlikely(status)) {
+		page_cache_release(page);
+		if (status == -EEXIST)
+			goto repeat;
+		return NULL;
+	}
+found:
+	wait_on_page_writeback(page);
+	return page;
+}
+#endif
+
 static ssize_t generic_perform_write(struct file *file,
 				struct iov_iter *i, loff_t pos)
 {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 11/16] thp, mm: naive support of thp in generic read/write routines
  2013-01-28  9:24 ` Kirill A. Shutemov
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now we still write/read at most PAGE_CACHE_SIZE bytes a time.

This implementation doesn't cover address spaces with backing store.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   35 ++++++++++++++++++++++++++++++-----
 1 file changed, 30 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 68e47e4..a7331fb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1161,12 +1161,23 @@ find_page:
 			if (unlikely(page == NULL))
 				goto no_cached_page;
 		}
+		if (PageTransTail(page)) {
+			page_cache_release(page);
+			page = find_get_page(mapping,
+					index & ~HPAGE_CACHE_INDEX_MASK);
+			if (!PageTransHuge(page)) {
+				page_cache_release(page);
+				goto find_page;
+			}
+		}
 		if (PageReadahead(page)) {
+			BUG_ON(PageTransHuge(page));
 			page_cache_async_readahead(mapping,
 					ra, filp, page,
 					index, last_index - index);
 		}
 		if (!PageUptodate(page)) {
+			BUG_ON(PageTransHuge(page));
 			if (inode->i_blkbits == PAGE_CACHE_SHIFT ||
 					!mapping->a_ops->is_partially_uptodate)
 				goto page_not_up_to_date;
@@ -1208,18 +1219,25 @@ page_ok:
 		}
 		nr = nr - offset;
 
+		/* Recalculate offset in page if we've got a huge page */
+		if (PageTransHuge(page)) {
+			offset = (((loff_t)index << PAGE_CACHE_SHIFT) + offset);
+			offset &= ~HPAGE_PMD_MASK;
+		}
+
 		/* If users can be writing to this page using arbitrary
 		 * virtual addresses, take care about potential aliasing
 		 * before reading the page on the kernel side.
 		 */
 		if (mapping_writably_mapped(mapping))
-			flush_dcache_page(page);
+			flush_dcache_page(page + (offset >> PAGE_CACHE_SHIFT));
 
 		/*
 		 * When a sequential read accesses a page several times,
 		 * only mark it as accessed the first time.
 		 */
-		if (prev_index != index || offset != prev_offset)
+		if (prev_index != index ||
+				(offset & ~PAGE_CACHE_MASK) != prev_offset)
 			mark_page_accessed(page);
 		prev_index = index;
 
@@ -1234,8 +1252,9 @@ page_ok:
 		 * "pos" here (the actor routine has to update the user buffer
 		 * pointers and the remaining count).
 		 */
-		ret = file_read_actor(desc, page, offset, nr);
-		offset += ret;
+		ret = file_read_actor(desc, page + (offset >> PAGE_CACHE_SHIFT),
+				offset & ~PAGE_CACHE_MASK, nr);
+		offset =  (offset & ~PAGE_CACHE_MASK) + ret;
 		index += offset >> PAGE_CACHE_SHIFT;
 		offset &= ~PAGE_CACHE_MASK;
 		prev_offset = offset;
@@ -2433,8 +2452,13 @@ again:
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
 
+		if (PageTransHuge(page))
+			offset = pos & ~HPAGE_PMD_MASK;
+
 		pagefault_disable();
-		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+		copied = iov_iter_copy_from_user_atomic(
+				page + (offset >> PAGE_CACHE_SHIFT),
+				i, offset & ~PAGE_CACHE_MASK, bytes);
 		pagefault_enable();
 		flush_dcache_page(page);
 
@@ -2457,6 +2481,7 @@ again:
 			 * because not all segments in the iov can be copied at
 			 * once without a pagefault.
 			 */
+			offset = pos & ~PAGE_CACHE_MASK;
 			bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
 						iov_iter_single_seg_count(i));
 			goto again;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 11/16] thp, mm: naive support of thp in generic read/write routines
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now we still write/read at most PAGE_CACHE_SIZE bytes a time.

This implementation doesn't cover address spaces with backing store.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   35 ++++++++++++++++++++++++++++++-----
 1 file changed, 30 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 68e47e4..a7331fb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1161,12 +1161,23 @@ find_page:
 			if (unlikely(page == NULL))
 				goto no_cached_page;
 		}
+		if (PageTransTail(page)) {
+			page_cache_release(page);
+			page = find_get_page(mapping,
+					index & ~HPAGE_CACHE_INDEX_MASK);
+			if (!PageTransHuge(page)) {
+				page_cache_release(page);
+				goto find_page;
+			}
+		}
 		if (PageReadahead(page)) {
+			BUG_ON(PageTransHuge(page));
 			page_cache_async_readahead(mapping,
 					ra, filp, page,
 					index, last_index - index);
 		}
 		if (!PageUptodate(page)) {
+			BUG_ON(PageTransHuge(page));
 			if (inode->i_blkbits == PAGE_CACHE_SHIFT ||
 					!mapping->a_ops->is_partially_uptodate)
 				goto page_not_up_to_date;
@@ -1208,18 +1219,25 @@ page_ok:
 		}
 		nr = nr - offset;
 
+		/* Recalculate offset in page if we've got a huge page */
+		if (PageTransHuge(page)) {
+			offset = (((loff_t)index << PAGE_CACHE_SHIFT) + offset);
+			offset &= ~HPAGE_PMD_MASK;
+		}
+
 		/* If users can be writing to this page using arbitrary
 		 * virtual addresses, take care about potential aliasing
 		 * before reading the page on the kernel side.
 		 */
 		if (mapping_writably_mapped(mapping))
-			flush_dcache_page(page);
+			flush_dcache_page(page + (offset >> PAGE_CACHE_SHIFT));
 
 		/*
 		 * When a sequential read accesses a page several times,
 		 * only mark it as accessed the first time.
 		 */
-		if (prev_index != index || offset != prev_offset)
+		if (prev_index != index ||
+				(offset & ~PAGE_CACHE_MASK) != prev_offset)
 			mark_page_accessed(page);
 		prev_index = index;
 
@@ -1234,8 +1252,9 @@ page_ok:
 		 * "pos" here (the actor routine has to update the user buffer
 		 * pointers and the remaining count).
 		 */
-		ret = file_read_actor(desc, page, offset, nr);
-		offset += ret;
+		ret = file_read_actor(desc, page + (offset >> PAGE_CACHE_SHIFT),
+				offset & ~PAGE_CACHE_MASK, nr);
+		offset =  (offset & ~PAGE_CACHE_MASK) + ret;
 		index += offset >> PAGE_CACHE_SHIFT;
 		offset &= ~PAGE_CACHE_MASK;
 		prev_offset = offset;
@@ -2433,8 +2452,13 @@ again:
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
 
+		if (PageTransHuge(page))
+			offset = pos & ~HPAGE_PMD_MASK;
+
 		pagefault_disable();
-		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+		copied = iov_iter_copy_from_user_atomic(
+				page + (offset >> PAGE_CACHE_SHIFT),
+				i, offset & ~PAGE_CACHE_MASK, bytes);
 		pagefault_enable();
 		flush_dcache_page(page);
 
@@ -2457,6 +2481,7 @@ again:
 			 * because not all segments in the iov can be copied at
 			 * once without a pagefault.
 			 */
+			offset = pos & ~PAGE_CACHE_MASK;
 			bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
 						iov_iter_single_seg_count(i));
 			goto again;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 12/16] thp, libfs: initial support of thp in simple_read/write_begin/write_end
  2013-01-28  9:24 ` Kirill A. Shutemov
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now we try to grab a huge cache page if gfp_mask has __GFP_COMP.
It's probably to weak condition and need to be reworked later.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/libfs.c |   54 ++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 42 insertions(+), 12 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 916da8c..a4530d5 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -383,7 +383,10 @@ EXPORT_SYMBOL(simple_setattr);
 
 int simple_readpage(struct file *file, struct page *page)
 {
-	clear_highpage(page);
+	if (PageTransHuge(page))
+		zero_huge_user(page, 0, HPAGE_PMD_SIZE);
+	else
+		clear_highpage(page);
 	flush_dcache_page(page);
 	SetPageUptodate(page);
 	unlock_page(page);
@@ -394,21 +397,43 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned flags,
 			struct page **pagep, void **fsdata)
 {
-	struct page *page;
+	struct page *page = NULL;
 	pgoff_t index;
+	gfp_t gfp_mask;
 
 	index = pos >> PAGE_CACHE_SHIFT;
-
-	page = grab_cache_page_write_begin(mapping, index, flags);
+	gfp_mask = mapping_gfp_mask(mapping);
+
+	/* XXX: too weak condition. Good enough for initial testing */
+	if (gfp_mask & __GFP_COMP) {
+		page = grab_cache_huge_page_write_begin(mapping,
+				index & ~HPAGE_CACHE_INDEX_MASK, flags);
+		/* fallback to small page */
+		if (!page || !PageTransHuge(page)) {
+			unsigned long offset;
+			offset = pos & ~PAGE_CACHE_MASK;
+			len = min_t(unsigned long,
+					len, PAGE_CACHE_SIZE - offset);
+		}
+	}
+	if (!page)
+		page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
-
 	*pagep = page;
 
-	if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
-		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
-		zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE);
+	if (!PageUptodate(page)) {
+		unsigned from;
+
+		if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) {
+			from = pos & ~HPAGE_PMD_MASK;
+			zero_huge_user_segments(page, 0, from,
+					from + len, HPAGE_PMD_SIZE);
+		} else if (len != PAGE_CACHE_SIZE) {
+			from = pos & ~PAGE_CACHE_MASK;
+			zero_user_segments(page, 0, from,
+					from + len, PAGE_CACHE_SIZE);
+		}
 	}
 	return 0;
 }
@@ -443,9 +468,14 @@ int simple_write_end(struct file *file, struct address_space *mapping,
 
 	/* zero the stale part of the page if we did a short copy */
 	if (copied < len) {
-		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
-		zero_user(page, from + copied, len - copied);
+		unsigned from;
+		if (PageTransHuge(page)) {
+			from = pos & ~HPAGE_PMD_MASK;
+			zero_huge_user(page, from + copied, len - copied);
+		} else {
+			from = pos & ~PAGE_CACHE_MASK;
+			zero_user(page, from + copied, len - copied);
+		}
 	}
 
 	if (!PageUptodate(page))
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 12/16] thp, libfs: initial support of thp in simple_read/write_begin/write_end
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now we try to grab a huge cache page if gfp_mask has __GFP_COMP.
It's probably to weak condition and need to be reworked later.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/libfs.c |   54 ++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 42 insertions(+), 12 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 916da8c..a4530d5 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -383,7 +383,10 @@ EXPORT_SYMBOL(simple_setattr);
 
 int simple_readpage(struct file *file, struct page *page)
 {
-	clear_highpage(page);
+	if (PageTransHuge(page))
+		zero_huge_user(page, 0, HPAGE_PMD_SIZE);
+	else
+		clear_highpage(page);
 	flush_dcache_page(page);
 	SetPageUptodate(page);
 	unlock_page(page);
@@ -394,21 +397,43 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned flags,
 			struct page **pagep, void **fsdata)
 {
-	struct page *page;
+	struct page *page = NULL;
 	pgoff_t index;
+	gfp_t gfp_mask;
 
 	index = pos >> PAGE_CACHE_SHIFT;
-
-	page = grab_cache_page_write_begin(mapping, index, flags);
+	gfp_mask = mapping_gfp_mask(mapping);
+
+	/* XXX: too weak condition. Good enough for initial testing */
+	if (gfp_mask & __GFP_COMP) {
+		page = grab_cache_huge_page_write_begin(mapping,
+				index & ~HPAGE_CACHE_INDEX_MASK, flags);
+		/* fallback to small page */
+		if (!page || !PageTransHuge(page)) {
+			unsigned long offset;
+			offset = pos & ~PAGE_CACHE_MASK;
+			len = min_t(unsigned long,
+					len, PAGE_CACHE_SIZE - offset);
+		}
+	}
+	if (!page)
+		page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
-
 	*pagep = page;
 
-	if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
-		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
-		zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE);
+	if (!PageUptodate(page)) {
+		unsigned from;
+
+		if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) {
+			from = pos & ~HPAGE_PMD_MASK;
+			zero_huge_user_segments(page, 0, from,
+					from + len, HPAGE_PMD_SIZE);
+		} else if (len != PAGE_CACHE_SIZE) {
+			from = pos & ~PAGE_CACHE_MASK;
+			zero_user_segments(page, 0, from,
+					from + len, PAGE_CACHE_SIZE);
+		}
 	}
 	return 0;
 }
@@ -443,9 +468,14 @@ int simple_write_end(struct file *file, struct address_space *mapping,
 
 	/* zero the stale part of the page if we did a short copy */
 	if (copied < len) {
-		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
-		zero_user(page, from + copied, len - copied);
+		unsigned from;
+		if (PageTransHuge(page)) {
+			from = pos & ~HPAGE_PMD_MASK;
+			zero_huge_user(page, from + copied, len - copied);
+		} else {
+			from = pos & ~PAGE_CACHE_MASK;
+			zero_user(page, from + copied, len - copied);
+		}
 	}
 
 	if (!PageUptodate(page))
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 13/16] thp: handle file pages in split_huge_page()
  2013-01-28  9:24 ` Kirill A. Shutemov
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The base scheme is the same as for anonymous pages, but we walk by
mapping->i_mmap rather then anon_vma->rb_root.

__split_huge_page_refcount() has been tunned a bit: we need to transfer
PG_swapbacked to tail pages.

Splitting mapped pages haven't tested at all, since we cannot mmap()
file-backed huge pages yet.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   62 ++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 53 insertions(+), 9 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c63a21d..008b2c9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1613,7 +1613,8 @@ static void __split_huge_page_refcount(struct page *page)
 				     ((1L << PG_referenced) |
 				      (1L << PG_swapbacked) |
 				      (1L << PG_mlocked) |
-				      (1L << PG_uptodate)));
+				      (1L << PG_uptodate) |
+				      (1L << PG_swapbacked)));
 		page_tail->flags |= (1L << PG_dirty);
 
 		/* clear PageTail before overwriting first_page */
@@ -1641,10 +1642,8 @@ static void __split_huge_page_refcount(struct page *page)
 		page_tail->index = page->index + i;
 		page_xchg_last_nid(page_tail, page_last_nid(page));
 
-		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
 		BUG_ON(!PageDirty(page_tail));
-		BUG_ON(!PageSwapBacked(page_tail));
 
 		lru_add_page_tail(page, page_tail, lruvec);
 	}
@@ -1752,7 +1751,7 @@ static int __split_huge_page_map(struct page *page,
 }
 
 /* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
+static void __split_anon_huge_page(struct page *page,
 			      struct anon_vma *anon_vma)
 {
 	int mapcount, mapcount2;
@@ -1799,14 +1798,11 @@ static void __split_huge_page(struct page *page,
 	BUG_ON(mapcount != mapcount2);
 }
 
-int split_huge_page(struct page *page)
+static int split_anon_huge_page(struct page *page)
 {
 	struct anon_vma *anon_vma;
 	int ret = 1;
 
-	BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
-	BUG_ON(!PageAnon(page));
-
 	/*
 	 * The caller does not necessarily hold an mmap_sem that would prevent
 	 * the anon_vma disappearing so we first we take a reference to it
@@ -1824,7 +1820,7 @@ int split_huge_page(struct page *page)
 		goto out_unlock;
 
 	BUG_ON(!PageSwapBacked(page));
-	__split_huge_page(page, anon_vma);
+	__split_anon_huge_page(page, anon_vma);
 	count_vm_event(THP_SPLIT);
 
 	BUG_ON(PageCompound(page));
@@ -1835,6 +1831,54 @@ out:
 	return ret;
 }
 
+static int split_file_huge_page(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	int mapcount, mapcount2;
+
+	BUG_ON(!PageHead(page));
+	BUG_ON(PageTail(page));
+
+	mutex_lock(&mapping->i_mmap_mutex);
+	mapcount = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount += __split_huge_page_splitting(page, vma, addr);
+	}
+
+	if (mapcount != page_mapcount(page))
+		printk(KERN_ERR "mapcount %d page_mapcount %d\n",
+		       mapcount, page_mapcount(page));
+	BUG_ON(mapcount != page_mapcount(page));
+
+	__split_huge_page_refcount(page);
+
+	mapcount2 = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount2 += __split_huge_page_map(page, vma, addr);
+	}
+
+	if (mapcount != mapcount2)
+		printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount %d\n",
+		       mapcount, mapcount2, page_mapcount(page));
+	BUG_ON(mapcount != mapcount2);
+	mutex_unlock(&mapping->i_mmap_mutex);
+	return 0;
+}
+
+int split_huge_page(struct page *page)
+{
+	BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
+
+	if (PageAnon(page))
+		return split_anon_huge_page(page);
+	else
+		return split_file_huge_page(page);
+}
+
 #define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 13/16] thp: handle file pages in split_huge_page()
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The base scheme is the same as for anonymous pages, but we walk by
mapping->i_mmap rather then anon_vma->rb_root.

__split_huge_page_refcount() has been tunned a bit: we need to transfer
PG_swapbacked to tail pages.

Splitting mapped pages haven't tested at all, since we cannot mmap()
file-backed huge pages yet.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   62 ++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 53 insertions(+), 9 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c63a21d..008b2c9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1613,7 +1613,8 @@ static void __split_huge_page_refcount(struct page *page)
 				     ((1L << PG_referenced) |
 				      (1L << PG_swapbacked) |
 				      (1L << PG_mlocked) |
-				      (1L << PG_uptodate)));
+				      (1L << PG_uptodate) |
+				      (1L << PG_swapbacked)));
 		page_tail->flags |= (1L << PG_dirty);
 
 		/* clear PageTail before overwriting first_page */
@@ -1641,10 +1642,8 @@ static void __split_huge_page_refcount(struct page *page)
 		page_tail->index = page->index + i;
 		page_xchg_last_nid(page_tail, page_last_nid(page));
 
-		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
 		BUG_ON(!PageDirty(page_tail));
-		BUG_ON(!PageSwapBacked(page_tail));
 
 		lru_add_page_tail(page, page_tail, lruvec);
 	}
@@ -1752,7 +1751,7 @@ static int __split_huge_page_map(struct page *page,
 }
 
 /* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
+static void __split_anon_huge_page(struct page *page,
 			      struct anon_vma *anon_vma)
 {
 	int mapcount, mapcount2;
@@ -1799,14 +1798,11 @@ static void __split_huge_page(struct page *page,
 	BUG_ON(mapcount != mapcount2);
 }
 
-int split_huge_page(struct page *page)
+static int split_anon_huge_page(struct page *page)
 {
 	struct anon_vma *anon_vma;
 	int ret = 1;
 
-	BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
-	BUG_ON(!PageAnon(page));
-
 	/*
 	 * The caller does not necessarily hold an mmap_sem that would prevent
 	 * the anon_vma disappearing so we first we take a reference to it
@@ -1824,7 +1820,7 @@ int split_huge_page(struct page *page)
 		goto out_unlock;
 
 	BUG_ON(!PageSwapBacked(page));
-	__split_huge_page(page, anon_vma);
+	__split_anon_huge_page(page, anon_vma);
 	count_vm_event(THP_SPLIT);
 
 	BUG_ON(PageCompound(page));
@@ -1835,6 +1831,54 @@ out:
 	return ret;
 }
 
+static int split_file_huge_page(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	int mapcount, mapcount2;
+
+	BUG_ON(!PageHead(page));
+	BUG_ON(PageTail(page));
+
+	mutex_lock(&mapping->i_mmap_mutex);
+	mapcount = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount += __split_huge_page_splitting(page, vma, addr);
+	}
+
+	if (mapcount != page_mapcount(page))
+		printk(KERN_ERR "mapcount %d page_mapcount %d\n",
+		       mapcount, page_mapcount(page));
+	BUG_ON(mapcount != page_mapcount(page));
+
+	__split_huge_page_refcount(page);
+
+	mapcount2 = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount2 += __split_huge_page_map(page, vma, addr);
+	}
+
+	if (mapcount != mapcount2)
+		printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount %d\n",
+		       mapcount, mapcount2, page_mapcount(page));
+	BUG_ON(mapcount != mapcount2);
+	mutex_unlock(&mapping->i_mmap_mutex);
+	return 0;
+}
+
+int split_huge_page(struct page *page)
+{
+	BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
+
+	if (PageAnon(page))
+		return split_anon_huge_page(page);
+	else
+		return split_file_huge_page(page);
+}
+
 #define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 14/16] thp, mm: truncate support for transparent huge page cache
  2013-01-28  9:24 ` Kirill A. Shutemov
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

If we starting position of truncation is in tail page we have to spilit
the huge page page first.

We also have to split if end is within the huge page. Otherwise we can
truncate whole huge page at once.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/truncate.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/truncate.c b/mm/truncate.c
index c75b736..87c247d 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -231,6 +231,17 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index > end)
 				break;
 
+			/* split page if we start from tail page */
+			if (PageTransTail(page))
+				split_huge_page(compound_trans_head(page));
+			if (PageTransHuge(page)) {
+				/* split if end is within huge page */
+				if (index == (end & ~HPAGE_CACHE_INDEX_MASK))
+					split_huge_page(page);
+				else
+					/* skip tail pages */
+					i += HPAGE_CACHE_NR - 1;
+			}
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -280,6 +291,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index > end)
 				break;
 
+			VM_BUG_ON(PageTransHuge(page));
 			lock_page(page);
 			WARN_ON(page->index != index);
 			wait_on_page_writeback(page);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 14/16] thp, mm: truncate support for transparent huge page cache
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

If we starting position of truncation is in tail page we have to spilit
the huge page page first.

We also have to split if end is within the huge page. Otherwise we can
truncate whole huge page at once.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/truncate.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/truncate.c b/mm/truncate.c
index c75b736..87c247d 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -231,6 +231,17 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index > end)
 				break;
 
+			/* split page if we start from tail page */
+			if (PageTransTail(page))
+				split_huge_page(compound_trans_head(page));
+			if (PageTransHuge(page)) {
+				/* split if end is within huge page */
+				if (index == (end & ~HPAGE_CACHE_INDEX_MASK))
+					split_huge_page(page);
+				else
+					/* skip tail pages */
+					i += HPAGE_CACHE_NR - 1;
+			}
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -280,6 +291,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index > end)
 				break;
 
+			VM_BUG_ON(PageTransHuge(page));
 			lock_page(page);
 			WARN_ON(page->index != index);
 			wait_on_page_writeback(page);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 15/16] thp, mm: split huge page on mmap file page
  2013-01-28  9:24 ` Kirill A. Shutemov
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We are not ready to mmap file-backed tranparent huge pages. Let's split
them on mmap() attempt.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index a7331fb..2e08582 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1692,6 +1692,8 @@ retry_find:
 			goto no_cached_page;
 	}
 
+	if (PageTransCompound(page))
+		split_huge_page(page);
 	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
 		page_cache_release(page);
 		return ret | VM_FAULT_RETRY;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 15/16] thp, mm: split huge page on mmap file page
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We are not ready to mmap file-backed tranparent huge pages. Let's split
them on mmap() attempt.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index a7331fb..2e08582 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1692,6 +1692,8 @@ retry_find:
 			goto no_cached_page;
 	}
 
+	if (PageTransCompound(page))
+		split_huge_page(page);
 	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
 		page_cache_release(page);
 		return ret | VM_FAULT_RETRY;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 16/16] ramfs: enable transparent huge page cache
  2013-01-28  9:24 ` Kirill A. Shutemov
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

ramfs is the most simple fs from page cache point of view. Let's start
transparent huge page cache enabling here.

For now we allocate only non-movable huge page. It's not yet clear if
movable page is safe here and what need to be done to make it safe.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/ramfs/inode.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index eab8c09..591457d 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
 		inode_init_owner(inode, dir, mode);
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
-		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+		/*
+		 * TODO: what should be done to make movable safe?
+		 */
+		mapping_set_gfp_mask(inode->i_mapping,
+				GFP_TRANSHUGE & ~__GFP_MOVABLE);
 		mapping_set_unevictable(inode->i_mapping);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH, RFC 16/16] ramfs: enable transparent huge page cache
@ 2013-01-28  9:24   ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-28  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

ramfs is the most simple fs from page cache point of view. Let's start
transparent huge page cache enabling here.

For now we allocate only non-movable huge page. It's not yet clear if
movable page is safe here and what need to be done to make it safe.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/ramfs/inode.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index eab8c09..591457d 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
 		inode_init_owner(inode, dir, mode);
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
-		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+		/*
+		 * TODO: what should be done to make movable safe?
+		 */
+		mapping_set_gfp_mask(inode->i_mapping,
+				GFP_TRANSHUGE & ~__GFP_MOVABLE);
 		mapping_set_unevictable(inode->i_mapping);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
  2013-01-28  9:24 ` Kirill A. Shutemov
@ 2013-01-29  5:03   ` Hugh Dickins
  -1 siblings, 0 replies; 66+ messages in thread
From: Hugh Dickins @ 2013-01-29  5:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Mon, 28 Jan 2013, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Here's first steps towards huge pages in page cache.
> 
> The intend of the work is get code ready to enable transparent huge page
> cache for the most simple fs -- ramfs.
> 
> It's not yet near feature-complete. It only provides basic infrastructure.
> At the moment we can read, write and truncate file on ramfs with huge pages in
> page cache. The most interesting part, mmap(), is not yet there. For now
> we split huge page on mmap() attempt.
> 
> I can't say that I see whole picture. I'm not sure if I understand locking
> model around split_huge_page(). Probably, not.
> Andrea, could you check if it looks correct?
> 
> Next steps (not necessary in this order):
>  - mmap();
>  - migration (?);
>  - collapse;
>  - stats, knobs, etc.;
>  - tmpfs/shmem enabling;
>  - ...
> 
> Kirill A. Shutemov (16):
>   block: implement add_bdi_stat()
>   mm: implement zero_huge_user_segment and friends
>   mm: drop actor argument of do_generic_file_read()
>   radix-tree: implement preload for multiple contiguous elements
>   thp, mm: basic defines for transparent huge page cache
>   thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>   thp, mm: rewrite delete_from_page_cache() to support huge pages
>   thp, mm: locking tail page is a bug
>   thp, mm: handle tail pages in page_cache_get_speculative()
>   thp, mm: implement grab_cache_huge_page_write_begin()
>   thp, mm: naive support of thp in generic read/write routines
>   thp, libfs: initial support of thp in
>     simple_read/write_begin/write_end
>   thp: handle file pages in split_huge_page()
>   thp, mm: truncate support for transparent huge page cache
>   thp, mm: split huge page on mmap file page
>   ramfs: enable transparent huge page cache
> 
>  fs/libfs.c                  |   54 +++++++++---
>  fs/ramfs/inode.c            |    6 +-
>  include/linux/backing-dev.h |   10 +++
>  include/linux/huge_mm.h     |    8 ++
>  include/linux/mm.h          |   15 ++++
>  include/linux/pagemap.h     |   14 ++-
>  include/linux/radix-tree.h  |    3 +
>  lib/radix-tree.c            |   32 +++++--
>  mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
>  mm/huge_memory.c            |   62 +++++++++++--
>  mm/memory.c                 |   22 +++++
>  mm/truncate.c               |   12 +++
>  12 files changed, 375 insertions(+), 67 deletions(-)

Interesting.

I was starting to think about Transparent Huge Pagecache a few
months ago, but then got washed away by incoming waves as usual.

Certainly I don't have a line of code to show for it; but my first
impression of your patches is that we have very different ideas of
where to start.

Perhaps that's good complementarity, or perhaps I'll disagree with
your approach.  I'll be taking a look at yours in the coming days,
and trying to summon back up my own ideas to summarize them for you.

Perhaps I was naive to imagine it, but I did intend to start out
generically, independent of filesystem; but content to narrow down
on tmpfs alone where it gets hard to support the others (writeback
springs to mind).  khugepaged would be migrating little pages into
huge pages, where it saw that the mmaps of the file would benefit
(and for testing I would hack mmap alignment choice to favour it).

I had arrived at a conviction that the first thing to change was
the way that tail pages of a THP are refcounted, that it had been a
mistake to use the compound page method of holding the THP together.
But I'll have to enter a trance now to recall the arguments ;)

Hugh

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
@ 2013-01-29  5:03   ` Hugh Dickins
  0 siblings, 0 replies; 66+ messages in thread
From: Hugh Dickins @ 2013-01-29  5:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Mon, 28 Jan 2013, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Here's first steps towards huge pages in page cache.
> 
> The intend of the work is get code ready to enable transparent huge page
> cache for the most simple fs -- ramfs.
> 
> It's not yet near feature-complete. It only provides basic infrastructure.
> At the moment we can read, write and truncate file on ramfs with huge pages in
> page cache. The most interesting part, mmap(), is not yet there. For now
> we split huge page on mmap() attempt.
> 
> I can't say that I see whole picture. I'm not sure if I understand locking
> model around split_huge_page(). Probably, not.
> Andrea, could you check if it looks correct?
> 
> Next steps (not necessary in this order):
>  - mmap();
>  - migration (?);
>  - collapse;
>  - stats, knobs, etc.;
>  - tmpfs/shmem enabling;
>  - ...
> 
> Kirill A. Shutemov (16):
>   block: implement add_bdi_stat()
>   mm: implement zero_huge_user_segment and friends
>   mm: drop actor argument of do_generic_file_read()
>   radix-tree: implement preload for multiple contiguous elements
>   thp, mm: basic defines for transparent huge page cache
>   thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>   thp, mm: rewrite delete_from_page_cache() to support huge pages
>   thp, mm: locking tail page is a bug
>   thp, mm: handle tail pages in page_cache_get_speculative()
>   thp, mm: implement grab_cache_huge_page_write_begin()
>   thp, mm: naive support of thp in generic read/write routines
>   thp, libfs: initial support of thp in
>     simple_read/write_begin/write_end
>   thp: handle file pages in split_huge_page()
>   thp, mm: truncate support for transparent huge page cache
>   thp, mm: split huge page on mmap file page
>   ramfs: enable transparent huge page cache
> 
>  fs/libfs.c                  |   54 +++++++++---
>  fs/ramfs/inode.c            |    6 +-
>  include/linux/backing-dev.h |   10 +++
>  include/linux/huge_mm.h     |    8 ++
>  include/linux/mm.h          |   15 ++++
>  include/linux/pagemap.h     |   14 ++-
>  include/linux/radix-tree.h  |    3 +
>  lib/radix-tree.c            |   32 +++++--
>  mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
>  mm/huge_memory.c            |   62 +++++++++++--
>  mm/memory.c                 |   22 +++++
>  mm/truncate.c               |   12 +++
>  12 files changed, 375 insertions(+), 67 deletions(-)

Interesting.

I was starting to think about Transparent Huge Pagecache a few
months ago, but then got washed away by incoming waves as usual.

Certainly I don't have a line of code to show for it; but my first
impression of your patches is that we have very different ideas of
where to start.

Perhaps that's good complementarity, or perhaps I'll disagree with
your approach.  I'll be taking a look at yours in the coming days,
and trying to summon back up my own ideas to summarize them for you.

Perhaps I was naive to imagine it, but I did intend to start out
generically, independent of filesystem; but content to narrow down
on tmpfs alone where it gets hard to support the others (writeback
springs to mind).  khugepaged would be migrating little pages into
huge pages, where it saw that the mmaps of the file would benefit
(and for testing I would hack mmap alignment choice to favour it).

I had arrived at a conviction that the first thing to change was
the way that tail pages of a THP are refcounted, that it had been a
mistake to use the compound page method of holding the THP together.
But I'll have to enter a trance now to recall the arguments ;)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-01-28  9:24   ` Kirill A. Shutemov
@ 2013-01-29 12:11     ` Hillf Danton
  -1 siblings, 0 replies; 66+ messages in thread
From: Hillf Danton @ 2013-01-29 12:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, LKML

On Mon, Jan 28, 2013 at 5:24 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> @@ -443,6 +443,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>                 pgoff_t offset, gfp_t gfp_mask)
>  {
>         int error;
> +       int nr = 1;
>
>         VM_BUG_ON(!PageLocked(page));
>         VM_BUG_ON(PageSwapBacked(page));
> @@ -450,31 +451,61 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>         error = mem_cgroup_cache_charge(page, current->mm,
>                                         gfp_mask & GFP_RECLAIM_MASK);
>         if (error)
> -               goto out;
> +               return error;

Due to PageCompound check, thp could not be charged effectively.
Any change added for charging it?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-01-29 12:11     ` Hillf Danton
  0 siblings, 0 replies; 66+ messages in thread
From: Hillf Danton @ 2013-01-29 12:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, LKML

On Mon, Jan 28, 2013 at 5:24 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> @@ -443,6 +443,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>                 pgoff_t offset, gfp_t gfp_mask)
>  {
>         int error;
> +       int nr = 1;
>
>         VM_BUG_ON(!PageLocked(page));
>         VM_BUG_ON(PageSwapBacked(page));
> @@ -450,31 +451,61 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>         error = mem_cgroup_cache_charge(page, current->mm,
>                                         gfp_mask & GFP_RECLAIM_MASK);
>         if (error)
> -               goto out;
> +               return error;

Due to PageCompound check, thp could not be charged effectively.
Any change added for charging it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-01-28  9:24   ` Kirill A. Shutemov
@ 2013-01-29 12:14     ` Hillf Danton
  -1 siblings, 0 replies; 66+ messages in thread
From: Hillf Danton @ 2013-01-29 12:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, LKML

On Mon, Jan 28, 2013 at 5:24 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> +       page_cache_get(page);
> +       spin_lock_irq(&mapping->tree_lock);
> +       page->mapping = mapping;
> +       if (PageTransHuge(page)) {
> +               int i;
> +               for (i = 0; i < HPAGE_CACHE_NR; i++) {
> +                       page_cache_get(page + i);

Page count is raised twice for head page, why?

> +                       page[i].index = offset + i;
> +                       error = radix_tree_insert(&mapping->page_tree,
> +                                       offset + i, page + i);
> +                       if (error) {
> +                               page_cache_release(page + i);
> +                               break;
> +                       }

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-01-29 12:14     ` Hillf Danton
  0 siblings, 0 replies; 66+ messages in thread
From: Hillf Danton @ 2013-01-29 12:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, LKML

On Mon, Jan 28, 2013 at 5:24 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> +       page_cache_get(page);
> +       spin_lock_irq(&mapping->tree_lock);
> +       page->mapping = mapping;
> +       if (PageTransHuge(page)) {
> +               int i;
> +               for (i = 0; i < HPAGE_CACHE_NR; i++) {
> +                       page_cache_get(page + i);

Page count is raised twice for head page, why?

> +                       page[i].index = offset + i;
> +                       error = radix_tree_insert(&mapping->page_tree,
> +                                       offset + i, page + i);
> +                       if (error) {
> +                               page_cache_release(page + i);
> +                               break;
> +                       }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-01-28  9:24   ` Kirill A. Shutemov
@ 2013-01-29 12:26     ` Hillf Danton
  -1 siblings, 0 replies; 66+ messages in thread
From: Hillf Danton @ 2013-01-29 12:26 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, LKML

On Mon, Jan 28, 2013 at 5:24 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> +       page_cache_get(page);
> +       spin_lock_irq(&mapping->tree_lock);
> +       page->mapping = mapping;
> +       if (PageTransHuge(page)) {
> +               int i;
> +               for (i = 0; i < HPAGE_CACHE_NR; i++) {
> +                       page_cache_get(page + i);
> +                       page[i].index = offset + i;
> +                       error = radix_tree_insert(&mapping->page_tree,
> +                                       offset + i, page + i);
> +                       if (error) {
> +                               page_cache_release(page + i);
> +                               break;
> +                       }

Is page count balanced with the following?


@@ -168,6 +180,9 @@ void delete_from_page_cache(struct page *page)

        if (freepage)
                freepage(page);
+       if (PageTransHuge(page))
+               for (i = 1; i < HPAGE_CACHE_NR; i++)
+                       page_cache_release(page);
        page_cache_release(page);

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-01-29 12:26     ` Hillf Danton
  0 siblings, 0 replies; 66+ messages in thread
From: Hillf Danton @ 2013-01-29 12:26 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, LKML

On Mon, Jan 28, 2013 at 5:24 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> +       page_cache_get(page);
> +       spin_lock_irq(&mapping->tree_lock);
> +       page->mapping = mapping;
> +       if (PageTransHuge(page)) {
> +               int i;
> +               for (i = 0; i < HPAGE_CACHE_NR; i++) {
> +                       page_cache_get(page + i);
> +                       page[i].index = offset + i;
> +                       error = radix_tree_insert(&mapping->page_tree,
> +                                       offset + i, page + i);
> +                       if (error) {
> +                               page_cache_release(page + i);
> +                               break;
> +                       }

Is page count balanced with the following?


@@ -168,6 +180,9 @@ void delete_from_page_cache(struct page *page)

        if (freepage)
                freepage(page);
+       if (PageTransHuge(page))
+               for (i = 1; i < HPAGE_CACHE_NR; i++)
+                       page_cache_release(page);
        page_cache_release(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-01-29 12:26     ` Hillf Danton
@ 2013-01-29 12:48       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-29 12:48 UTC (permalink / raw)
  To: Hillf Danton, Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, LKML

Hillf Danton wrote:
> On Mon, Jan 28, 2013 at 5:24 PM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > +       page_cache_get(page);
> > +       spin_lock_irq(&mapping->tree_lock);
> > +       page->mapping = mapping;
> > +       if (PageTransHuge(page)) {
> > +               int i;
> > +               for (i = 0; i < HPAGE_CACHE_NR; i++) {
> > +                       page_cache_get(page + i);
> > +                       page[i].index = offset + i;
> > +                       error = radix_tree_insert(&mapping->page_tree,
> > +                                       offset + i, page + i);
> > +                       if (error) {
> > +                               page_cache_release(page + i);
> > +                               break;
> > +                       }
> 
> Is page count balanced with the following?

It's broken. Last minue changes are evil :(

Thanks for catching it. I'll fix it in next revision.

> @@ -168,6 +180,9 @@ void delete_from_page_cache(struct page *page)
> 
>         if (freepage)
>                 freepage(page);
> +       if (PageTransHuge(page))
> +               for (i = 1; i < HPAGE_CACHE_NR; i++)
> +                       page_cache_release(page);
>         page_cache_release(page);

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-01-29 12:48       ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-29 12:48 UTC (permalink / raw)
  To: Hillf Danton, Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, LKML

Hillf Danton wrote:
> On Mon, Jan 28, 2013 at 5:24 PM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > +       page_cache_get(page);
> > +       spin_lock_irq(&mapping->tree_lock);
> > +       page->mapping = mapping;
> > +       if (PageTransHuge(page)) {
> > +               int i;
> > +               for (i = 0; i < HPAGE_CACHE_NR; i++) {
> > +                       page_cache_get(page + i);
> > +                       page[i].index = offset + i;
> > +                       error = radix_tree_insert(&mapping->page_tree,
> > +                                       offset + i, page + i);
> > +                       if (error) {
> > +                               page_cache_release(page + i);
> > +                               break;
> > +                       }
> 
> Is page count balanced with the following?

It's broken. Last minue changes are evil :(

Thanks for catching it. I'll fix it in next revision.

> @@ -168,6 +180,9 @@ void delete_from_page_cache(struct page *page)
> 
>         if (freepage)
>                 freepage(page);
> +       if (PageTransHuge(page))
> +               for (i = 1; i < HPAGE_CACHE_NR; i++)
> +                       page_cache_release(page);
>         page_cache_release(page);

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-01-29 12:11     ` Hillf Danton
@ 2013-01-29 13:01       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-29 13:01 UTC (permalink / raw)
  To: Hillf Danton, Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, LKML

Hillf Danton wrote:
> On Mon, Jan 28, 2013 at 5:24 PM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > @@ -443,6 +443,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> >                 pgoff_t offset, gfp_t gfp_mask)
> >  {
> >         int error;
> > +       int nr = 1;
> >
> >         VM_BUG_ON(!PageLocked(page));
> >         VM_BUG_ON(PageSwapBacked(page));
> > @@ -450,31 +451,61 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> >         error = mem_cgroup_cache_charge(page, current->mm,
> >                                         gfp_mask & GFP_RECLAIM_MASK);
> >         if (error)
> > -               goto out;
> > +               return error;
> 
> Due to PageCompound check, thp could not be charged effectively.
> Any change added for charging it?

I've missed this. Will fix.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-01-29 13:01       ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-29 13:01 UTC (permalink / raw)
  To: Hillf Danton, Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, LKML

Hillf Danton wrote:
> On Mon, Jan 28, 2013 at 5:24 PM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > @@ -443,6 +443,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> >                 pgoff_t offset, gfp_t gfp_mask)
> >  {
> >         int error;
> > +       int nr = 1;
> >
> >         VM_BUG_ON(!PageLocked(page));
> >         VM_BUG_ON(PageSwapBacked(page));
> > @@ -450,31 +451,61 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> >         error = mem_cgroup_cache_charge(page, current->mm,
> >                                         gfp_mask & GFP_RECLAIM_MASK);
> >         if (error)
> > -               goto out;
> > +               return error;
> 
> Due to PageCompound check, thp could not be charged effectively.
> Any change added for charging it?

I've missed this. Will fix.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
  2013-01-29  5:03   ` Hugh Dickins
@ 2013-01-29 13:14     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-29 13:14 UTC (permalink / raw)
  To: Hugh Dickins, Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, linux-kernel

Hugh Dickins wrote:
> On Mon, 28 Jan 2013, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Here's first steps towards huge pages in page cache.
> > 
> > The intend of the work is get code ready to enable transparent huge page
> > cache for the most simple fs -- ramfs.
> > 
> > It's not yet near feature-complete. It only provides basic infrastructure.
> > At the moment we can read, write and truncate file on ramfs with huge pages in
> > page cache. The most interesting part, mmap(), is not yet there. For now
> > we split huge page on mmap() attempt.
> > 
> > I can't say that I see whole picture. I'm not sure if I understand locking
> > model around split_huge_page(). Probably, not.
> > Andrea, could you check if it looks correct?
> > 
> > Next steps (not necessary in this order):
> >  - mmap();
> >  - migration (?);
> >  - collapse;
> >  - stats, knobs, etc.;
> >  - tmpfs/shmem enabling;
> >  - ...
> > 
> > Kirill A. Shutemov (16):
> >   block: implement add_bdi_stat()
> >   mm: implement zero_huge_user_segment and friends
> >   mm: drop actor argument of do_generic_file_read()
> >   radix-tree: implement preload for multiple contiguous elements
> >   thp, mm: basic defines for transparent huge page cache
> >   thp, mm: rewrite add_to_page_cache_locked() to support huge pages
> >   thp, mm: rewrite delete_from_page_cache() to support huge pages
> >   thp, mm: locking tail page is a bug
> >   thp, mm: handle tail pages in page_cache_get_speculative()
> >   thp, mm: implement grab_cache_huge_page_write_begin()
> >   thp, mm: naive support of thp in generic read/write routines
> >   thp, libfs: initial support of thp in
> >     simple_read/write_begin/write_end
> >   thp: handle file pages in split_huge_page()
> >   thp, mm: truncate support for transparent huge page cache
> >   thp, mm: split huge page on mmap file page
> >   ramfs: enable transparent huge page cache
> > 
> >  fs/libfs.c                  |   54 +++++++++---
> >  fs/ramfs/inode.c            |    6 +-
> >  include/linux/backing-dev.h |   10 +++
> >  include/linux/huge_mm.h     |    8 ++
> >  include/linux/mm.h          |   15 ++++
> >  include/linux/pagemap.h     |   14 ++-
> >  include/linux/radix-tree.h  |    3 +
> >  lib/radix-tree.c            |   32 +++++--
> >  mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
> >  mm/huge_memory.c            |   62 +++++++++++--
> >  mm/memory.c                 |   22 +++++
> >  mm/truncate.c               |   12 +++
> >  12 files changed, 375 insertions(+), 67 deletions(-)
> 
> Interesting.
> 
> I was starting to think about Transparent Huge Pagecache a few
> months ago, but then got washed away by incoming waves as usual.
> 
> Certainly I don't have a line of code to show for it; but my first
> impression of your patches is that we have very different ideas of
> where to start.
> 
> Perhaps that's good complementarity, or perhaps I'll disagree with
> your approach.  I'll be taking a look at yours in the coming days,
> and trying to summon back up my own ideas to summarize them for you.

Yeah, it would be nice to see alternative design ideas. Looking forward.

> Perhaps I was naive to imagine it, but I did intend to start out
> generically, independent of filesystem; but content to narrow down
> on tmpfs alone where it gets hard to support the others (writeback
> springs to mind).  khugepaged would be migrating little pages into
> huge pages, where it saw that the mmaps of the file would benefit
> (and for testing I would hack mmap alignment choice to favour it).

I don't think all fs at once would fly, but it's wonderful, if I'm
wrong :)

> I had arrived at a conviction that the first thing to change was
> the way that tail pages of a THP are refcounted, that it had been a
> mistake to use the compound page method of holding the THP together.
> But I'll have to enter a trance now to recall the arguments ;)

THP refcounting looks reasonable for me, if take split_huge_page() in
account.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
@ 2013-01-29 13:14     ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-01-29 13:14 UTC (permalink / raw)
  To: Hugh Dickins, Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, linux-kernel

Hugh Dickins wrote:
> On Mon, 28 Jan 2013, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Here's first steps towards huge pages in page cache.
> > 
> > The intend of the work is get code ready to enable transparent huge page
> > cache for the most simple fs -- ramfs.
> > 
> > It's not yet near feature-complete. It only provides basic infrastructure.
> > At the moment we can read, write and truncate file on ramfs with huge pages in
> > page cache. The most interesting part, mmap(), is not yet there. For now
> > we split huge page on mmap() attempt.
> > 
> > I can't say that I see whole picture. I'm not sure if I understand locking
> > model around split_huge_page(). Probably, not.
> > Andrea, could you check if it looks correct?
> > 
> > Next steps (not necessary in this order):
> >  - mmap();
> >  - migration (?);
> >  - collapse;
> >  - stats, knobs, etc.;
> >  - tmpfs/shmem enabling;
> >  - ...
> > 
> > Kirill A. Shutemov (16):
> >   block: implement add_bdi_stat()
> >   mm: implement zero_huge_user_segment and friends
> >   mm: drop actor argument of do_generic_file_read()
> >   radix-tree: implement preload for multiple contiguous elements
> >   thp, mm: basic defines for transparent huge page cache
> >   thp, mm: rewrite add_to_page_cache_locked() to support huge pages
> >   thp, mm: rewrite delete_from_page_cache() to support huge pages
> >   thp, mm: locking tail page is a bug
> >   thp, mm: handle tail pages in page_cache_get_speculative()
> >   thp, mm: implement grab_cache_huge_page_write_begin()
> >   thp, mm: naive support of thp in generic read/write routines
> >   thp, libfs: initial support of thp in
> >     simple_read/write_begin/write_end
> >   thp: handle file pages in split_huge_page()
> >   thp, mm: truncate support for transparent huge page cache
> >   thp, mm: split huge page on mmap file page
> >   ramfs: enable transparent huge page cache
> > 
> >  fs/libfs.c                  |   54 +++++++++---
> >  fs/ramfs/inode.c            |    6 +-
> >  include/linux/backing-dev.h |   10 +++
> >  include/linux/huge_mm.h     |    8 ++
> >  include/linux/mm.h          |   15 ++++
> >  include/linux/pagemap.h     |   14 ++-
> >  include/linux/radix-tree.h  |    3 +
> >  lib/radix-tree.c            |   32 +++++--
> >  mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
> >  mm/huge_memory.c            |   62 +++++++++++--
> >  mm/memory.c                 |   22 +++++
> >  mm/truncate.c               |   12 +++
> >  12 files changed, 375 insertions(+), 67 deletions(-)
> 
> Interesting.
> 
> I was starting to think about Transparent Huge Pagecache a few
> months ago, but then got washed away by incoming waves as usual.
> 
> Certainly I don't have a line of code to show for it; but my first
> impression of your patches is that we have very different ideas of
> where to start.
> 
> Perhaps that's good complementarity, or perhaps I'll disagree with
> your approach.  I'll be taking a look at yours in the coming days,
> and trying to summon back up my own ideas to summarize them for you.

Yeah, it would be nice to see alternative design ideas. Looking forward.

> Perhaps I was naive to imagine it, but I did intend to start out
> generically, independent of filesystem; but content to narrow down
> on tmpfs alone where it gets hard to support the others (writeback
> springs to mind).  khugepaged would be migrating little pages into
> huge pages, where it saw that the mmaps of the file would benefit
> (and for testing I would hack mmap alignment choice to favour it).

I don't think all fs at once would fly, but it's wonderful, if I'm
wrong :)

> I had arrived at a conviction that the first thing to change was
> the way that tail pages of a THP are refcounted, that it had been a
> mistake to use the compound page method of holding the THP together.
> But I'll have to enter a trance now to recall the arguments ;)

THP refcounting looks reasonable for me, if take split_huge_page() in
account.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
  2013-01-29 13:14     ` Kirill A. Shutemov
@ 2013-01-31  2:12       ` Hugh Dickins
  -1 siblings, 0 replies; 66+ messages in thread
From: Hugh Dickins @ 2013-01-31  2:12 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Tue, 29 Jan 2013, Kirill A. Shutemov wrote:
> Hugh Dickins wrote:
> > On Mon, 28 Jan 2013, Kirill A. Shutemov wrote:
> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > 
> > > Here's first steps towards huge pages in page cache.
> > > 
> > > The intend of the work is get code ready to enable transparent huge page
> > > cache for the most simple fs -- ramfs.
> > > 
> > > It's not yet near feature-complete. It only provides basic infrastructure.
> > > At the moment we can read, write and truncate file on ramfs with huge pages in
> > > page cache. The most interesting part, mmap(), is not yet there. For now
> > > we split huge page on mmap() attempt.
> > > 
> > > I can't say that I see whole picture. I'm not sure if I understand locking
> > > model around split_huge_page(). Probably, not.
> > > Andrea, could you check if it looks correct?
> > > 
> > > Next steps (not necessary in this order):
> > >  - mmap();
> > >  - migration (?);
> > >  - collapse;
> > >  - stats, knobs, etc.;
> > >  - tmpfs/shmem enabling;
> > >  - ...
> > > 
> > > Kirill A. Shutemov (16):
> > >   block: implement add_bdi_stat()
> > >   mm: implement zero_huge_user_segment and friends
> > >   mm: drop actor argument of do_generic_file_read()
> > >   radix-tree: implement preload for multiple contiguous elements
> > >   thp, mm: basic defines for transparent huge page cache
> > >   thp, mm: rewrite add_to_page_cache_locked() to support huge pages
> > >   thp, mm: rewrite delete_from_page_cache() to support huge pages
> > >   thp, mm: locking tail page is a bug
> > >   thp, mm: handle tail pages in page_cache_get_speculative()
> > >   thp, mm: implement grab_cache_huge_page_write_begin()
> > >   thp, mm: naive support of thp in generic read/write routines
> > >   thp, libfs: initial support of thp in
> > >     simple_read/write_begin/write_end
> > >   thp: handle file pages in split_huge_page()
> > >   thp, mm: truncate support for transparent huge page cache
> > >   thp, mm: split huge page on mmap file page
> > >   ramfs: enable transparent huge page cache
> > > 
> > >  fs/libfs.c                  |   54 +++++++++---
> > >  fs/ramfs/inode.c            |    6 +-
> > >  include/linux/backing-dev.h |   10 +++
> > >  include/linux/huge_mm.h     |    8 ++
> > >  include/linux/mm.h          |   15 ++++
> > >  include/linux/pagemap.h     |   14 ++-
> > >  include/linux/radix-tree.h  |    3 +
> > >  lib/radix-tree.c            |   32 +++++--
> > >  mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
> > >  mm/huge_memory.c            |   62 +++++++++++--
> > >  mm/memory.c                 |   22 +++++
> > >  mm/truncate.c               |   12 +++
> > >  12 files changed, 375 insertions(+), 67 deletions(-)
> > 
> > Interesting.
> > 
> > I was starting to think about Transparent Huge Pagecache a few
> > months ago, but then got washed away by incoming waves as usual.
> > 
> > Certainly I don't have a line of code to show for it; but my first
> > impression of your patches is that we have very different ideas of
> > where to start.

A second impression confirms that we have very different ideas of
where to start.  I don't want to be dismissive, and please don't let
me discourage you, but I just don't find what you have very interesting.

I'm sure you'll agree that the interesting part, and the difficult part,
comes with mmap(); and there's no point whatever to THPages without mmap()
(of course, I'm including exec and brk and shm when I say mmap there).

(There may be performance benefits in working with larger page cache
size, which Christoph Lameter explored a few years back, but that's a
different topic: I think 2MB - if I may be x86_64-centric - would not be
the unit of choice for that, unless SSD erase block were to dominate.)

I'm interested to get to the point of prototyping something that does
support mmap() of THPageCache: I'm pretty sure that I'd then soon learn
a lot about my misconceptions, and have to rework for a while (or give
up!); but I don't see much point in posting anything without that.
I don't know if we have 5 or 50 places which "know" that a THPage
must be Anon: some I'll spot in advance, some I sadly won't.

It's not clear to me that the infrastructural changes you make in this
series will be needed or not, if I pursue my approach: some perhaps as
optimizations on top of the poorly performing base that may emerge from
going about it my way.  But for me it's too soon to think about those.

Something I notice that we do agree upon: the radix_tree holding the
4k subpages, at least for now.  When I first started thinking towards
THPageCache, I was fascinated by how we could manage the hugepages in
the radix_tree, cutting out unnecessary levels etc; but after a while
I realized that although there's probably nice scope for cleverness
there (significantly constrained by RCU expectations), it would only
be about optimization.  Let's be simple and stupid about radix_tree
for now, the problems that need to be worked out lie elsewhere.

> > 
> > Perhaps that's good complementarity, or perhaps I'll disagree with
> > your approach.  I'll be taking a look at yours in the coming days,
> > and trying to summon back up my own ideas to summarize them for you.
> 
> Yeah, it would be nice to see alternative design ideas. Looking forward.
> 
> > Perhaps I was naive to imagine it, but I did intend to start out
> > generically, independent of filesystem; but content to narrow down
> > on tmpfs alone where it gets hard to support the others (writeback
> > springs to mind).  khugepaged would be migrating little pages into
> > huge pages, where it saw that the mmaps of the file would benefit
> > (and for testing I would hack mmap alignment choice to favour it).
> 
> I don't think all fs at once would fly, but it's wonderful, if I'm
> wrong :)

You are imagining the filesystem putting huge pages into its cache.
Whereas I'm imagining khugepaged looking around at mmaped file areas,
seeing which would benefit from huge pagecache (let's assume offset 0
belongs on hugepage boundary - maybe one day someone will want to tune
some files or parts differently, but that's low priority), migrating 4k
pages over to 2MB page (wouldn't have to be done all in one pass), then
finally slotting in the pmds for that.

But going this way, I expect we'd have to split at page_mkwrite():
we probably don't want a single touch to dirty 2MB at a time,
unless tmpfs or ramfs.

> 
> > I had arrived at a conviction that the first thing to change was
> > the way that tail pages of a THP are refcounted, that it had been a
> > mistake to use the compound page method of holding the THP together.
> > But I'll have to enter a trance now to recall the arguments ;)
> 
> THP refcounting looks reasonable for me, if take split_huge_page() in
> account.

I'm not claiming that the THP refcounting is wrong in what it's doing
at present; but that I suspect we'll want to rework it for THPageCache.

Something I take for granted, I think you do too but I'm not certain:
a file with transparent huge pages in its page cache can also have small
pages in other extents of its page cache; and can be mapped hugely (2MB
extents) into one address space at the same time as individual 4k pages
from those extents are mapped into another (or the same) address space.

One can certainly imagine sacrificing that principle, splitting whenever
there's such a "conflict"; but it then becomes uninteresting to me, too
much like hugetlbfs.  Splitting an anonymous hugepage in all address
spaces that hold it when one of them needs it split, that has been a
pragmatic strategy: it's not a common case for forks to diverge like
that; but files are expected to be more widely shared.

At present THP is using compound pages, with mapcount of tail pages
reused to track their contribution to head page count; but I think we
shall want to be able to use the mapcount, and the count, of TH tail
pages for their original purpose if huge mappings can coexist with tiny.
Not fully thought out, but that's my feeling.

The use of compound pages, in particular the redirection of tail page
count to head page count, was important in hugetlbfs: a get_user_pages
reference on a subpage must prevent the containing hugepage from being
freed, because hugetlbfs has its own separate pool of hugepages to
which freeing returns them.

But for transparent huge pages?  It should not matter so much if the
subpages are freed independently.  So I'd like to devise another glue
to hold them together more loosely (for prototyping I can certainly
pretend we have infinite pageflag and pagefield space if that helps):
I may find in practice that they're forever falling apart, and I run
crying back to compound pages; but at present I'm hoping not.

This mail might suggest that I'm about to start coding: I wish that
were true, but in reality there's always a lot of unrelated things
I have to look at, which dilute my focus.  So if I've said anything
that sparks ideas for you, go with them.

Hugh

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
@ 2013-01-31  2:12       ` Hugh Dickins
  0 siblings, 0 replies; 66+ messages in thread
From: Hugh Dickins @ 2013-01-31  2:12 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Tue, 29 Jan 2013, Kirill A. Shutemov wrote:
> Hugh Dickins wrote:
> > On Mon, 28 Jan 2013, Kirill A. Shutemov wrote:
> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > 
> > > Here's first steps towards huge pages in page cache.
> > > 
> > > The intend of the work is get code ready to enable transparent huge page
> > > cache for the most simple fs -- ramfs.
> > > 
> > > It's not yet near feature-complete. It only provides basic infrastructure.
> > > At the moment we can read, write and truncate file on ramfs with huge pages in
> > > page cache. The most interesting part, mmap(), is not yet there. For now
> > > we split huge page on mmap() attempt.
> > > 
> > > I can't say that I see whole picture. I'm not sure if I understand locking
> > > model around split_huge_page(). Probably, not.
> > > Andrea, could you check if it looks correct?
> > > 
> > > Next steps (not necessary in this order):
> > >  - mmap();
> > >  - migration (?);
> > >  - collapse;
> > >  - stats, knobs, etc.;
> > >  - tmpfs/shmem enabling;
> > >  - ...
> > > 
> > > Kirill A. Shutemov (16):
> > >   block: implement add_bdi_stat()
> > >   mm: implement zero_huge_user_segment and friends
> > >   mm: drop actor argument of do_generic_file_read()
> > >   radix-tree: implement preload for multiple contiguous elements
> > >   thp, mm: basic defines for transparent huge page cache
> > >   thp, mm: rewrite add_to_page_cache_locked() to support huge pages
> > >   thp, mm: rewrite delete_from_page_cache() to support huge pages
> > >   thp, mm: locking tail page is a bug
> > >   thp, mm: handle tail pages in page_cache_get_speculative()
> > >   thp, mm: implement grab_cache_huge_page_write_begin()
> > >   thp, mm: naive support of thp in generic read/write routines
> > >   thp, libfs: initial support of thp in
> > >     simple_read/write_begin/write_end
> > >   thp: handle file pages in split_huge_page()
> > >   thp, mm: truncate support for transparent huge page cache
> > >   thp, mm: split huge page on mmap file page
> > >   ramfs: enable transparent huge page cache
> > > 
> > >  fs/libfs.c                  |   54 +++++++++---
> > >  fs/ramfs/inode.c            |    6 +-
> > >  include/linux/backing-dev.h |   10 +++
> > >  include/linux/huge_mm.h     |    8 ++
> > >  include/linux/mm.h          |   15 ++++
> > >  include/linux/pagemap.h     |   14 ++-
> > >  include/linux/radix-tree.h  |    3 +
> > >  lib/radix-tree.c            |   32 +++++--
> > >  mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
> > >  mm/huge_memory.c            |   62 +++++++++++--
> > >  mm/memory.c                 |   22 +++++
> > >  mm/truncate.c               |   12 +++
> > >  12 files changed, 375 insertions(+), 67 deletions(-)
> > 
> > Interesting.
> > 
> > I was starting to think about Transparent Huge Pagecache a few
> > months ago, but then got washed away by incoming waves as usual.
> > 
> > Certainly I don't have a line of code to show for it; but my first
> > impression of your patches is that we have very different ideas of
> > where to start.

A second impression confirms that we have very different ideas of
where to start.  I don't want to be dismissive, and please don't let
me discourage you, but I just don't find what you have very interesting.

I'm sure you'll agree that the interesting part, and the difficult part,
comes with mmap(); and there's no point whatever to THPages without mmap()
(of course, I'm including exec and brk and shm when I say mmap there).

(There may be performance benefits in working with larger page cache
size, which Christoph Lameter explored a few years back, but that's a
different topic: I think 2MB - if I may be x86_64-centric - would not be
the unit of choice for that, unless SSD erase block were to dominate.)

I'm interested to get to the point of prototyping something that does
support mmap() of THPageCache: I'm pretty sure that I'd then soon learn
a lot about my misconceptions, and have to rework for a while (or give
up!); but I don't see much point in posting anything without that.
I don't know if we have 5 or 50 places which "know" that a THPage
must be Anon: some I'll spot in advance, some I sadly won't.

It's not clear to me that the infrastructural changes you make in this
series will be needed or not, if I pursue my approach: some perhaps as
optimizations on top of the poorly performing base that may emerge from
going about it my way.  But for me it's too soon to think about those.

Something I notice that we do agree upon: the radix_tree holding the
4k subpages, at least for now.  When I first started thinking towards
THPageCache, I was fascinated by how we could manage the hugepages in
the radix_tree, cutting out unnecessary levels etc; but after a while
I realized that although there's probably nice scope for cleverness
there (significantly constrained by RCU expectations), it would only
be about optimization.  Let's be simple and stupid about radix_tree
for now, the problems that need to be worked out lie elsewhere.

> > 
> > Perhaps that's good complementarity, or perhaps I'll disagree with
> > your approach.  I'll be taking a look at yours in the coming days,
> > and trying to summon back up my own ideas to summarize them for you.
> 
> Yeah, it would be nice to see alternative design ideas. Looking forward.
> 
> > Perhaps I was naive to imagine it, but I did intend to start out
> > generically, independent of filesystem; but content to narrow down
> > on tmpfs alone where it gets hard to support the others (writeback
> > springs to mind).  khugepaged would be migrating little pages into
> > huge pages, where it saw that the mmaps of the file would benefit
> > (and for testing I would hack mmap alignment choice to favour it).
> 
> I don't think all fs at once would fly, but it's wonderful, if I'm
> wrong :)

You are imagining the filesystem putting huge pages into its cache.
Whereas I'm imagining khugepaged looking around at mmaped file areas,
seeing which would benefit from huge pagecache (let's assume offset 0
belongs on hugepage boundary - maybe one day someone will want to tune
some files or parts differently, but that's low priority), migrating 4k
pages over to 2MB page (wouldn't have to be done all in one pass), then
finally slotting in the pmds for that.

But going this way, I expect we'd have to split at page_mkwrite():
we probably don't want a single touch to dirty 2MB at a time,
unless tmpfs or ramfs.

> 
> > I had arrived at a conviction that the first thing to change was
> > the way that tail pages of a THP are refcounted, that it had been a
> > mistake to use the compound page method of holding the THP together.
> > But I'll have to enter a trance now to recall the arguments ;)
> 
> THP refcounting looks reasonable for me, if take split_huge_page() in
> account.

I'm not claiming that the THP refcounting is wrong in what it's doing
at present; but that I suspect we'll want to rework it for THPageCache.

Something I take for granted, I think you do too but I'm not certain:
a file with transparent huge pages in its page cache can also have small
pages in other extents of its page cache; and can be mapped hugely (2MB
extents) into one address space at the same time as individual 4k pages
from those extents are mapped into another (or the same) address space.

One can certainly imagine sacrificing that principle, splitting whenever
there's such a "conflict"; but it then becomes uninteresting to me, too
much like hugetlbfs.  Splitting an anonymous hugepage in all address
spaces that hold it when one of them needs it split, that has been a
pragmatic strategy: it's not a common case for forks to diverge like
that; but files are expected to be more widely shared.

At present THP is using compound pages, with mapcount of tail pages
reused to track their contribution to head page count; but I think we
shall want to be able to use the mapcount, and the count, of TH tail
pages for their original purpose if huge mappings can coexist with tiny.
Not fully thought out, but that's my feeling.

The use of compound pages, in particular the redirection of tail page
count to head page count, was important in hugetlbfs: a get_user_pages
reference on a subpage must prevent the containing hugepage from being
freed, because hugetlbfs has its own separate pool of hugepages to
which freeing returns them.

But for transparent huge pages?  It should not matter so much if the
subpages are freed independently.  So I'd like to devise another glue
to hold them together more loosely (for prototyping I can certainly
pretend we have infinite pageflag and pagefield space if that helps):
I may find in practice that they're forever falling apart, and I run
crying back to compound pages; but at present I'm hoping not.

This mail might suggest that I'm about to start coding: I wish that
were true, but in reality there's always a lot of unrelated things
I have to look at, which dilute my focus.  So if I've said anything
that sparks ideas for you, go with them.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
  2013-01-31  2:12       ` Hugh Dickins
@ 2013-02-02 15:13         ` Kirill A. Shutemov
  -1 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-02-02 15:13 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli
  Cc: Kirill A. Shutemov, Andrew Morton, Al Viro, Wu Fengguang,
	Jan Kara, Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, linux-kernel


Hugh Dickins wrote:
> On Tue, 29 Jan 2013, Kirill A. Shutemov wrote:
> > Hugh Dickins wrote:
> > > 
> > > Interesting.
> > > 
> > > I was starting to think about Transparent Huge Pagecache a few
> > > months ago, but then got washed away by incoming waves as usual.
> > > 
> > > Certainly I don't have a line of code to show for it; but my first
> > > impression of your patches is that we have very different ideas of
> > > where to start.
> 
> A second impression confirms that we have very different ideas of
> where to start.  I don't want to be dismissive, and please don't let
> me discourage you, but I just don't find what you have very interesting.

The main reason for publishing the patchset in current
(not-really-useful) state is to start discussion early.
Looks like it works :)

> I'm sure you'll agree that the interesting part, and the difficult part,
> comes with mmap(); and there's no point whatever to THPages without mmap()
> (of course, I'm including exec and brk and shm when I say mmap there).
> 
> (There may be performance benefits in working with larger page cache
> size, which Christoph Lameter explored a few years back, but that's a
> different topic: I think 2MB - if I may be x86_64-centric - would not be
> the unit of choice for that, unless SSD erase block were to dominate.)
> 
> I'm interested to get to the point of prototyping something that does
> support mmap() of THPageCache: I'm pretty sure that I'd then soon learn
> a lot about my misconceptions, and have to rework for a while (or give
> up!); but I don't see much point in posting anything without that.
> I don't know if we have 5 or 50 places which "know" that a THPage
> must be Anon: some I'll spot in advance, some I sadly won't.
> 
> It's not clear to me that the infrastructural changes you make in this
> series will be needed or not, if I pursue my approach: some perhaps as
> optimizations on top of the poorly performing base that may emerge from
> going about it my way.  But for me it's too soon to think about those.
> 
> Something I notice that we do agree upon: the radix_tree holding the
> 4k subpages, at least for now.  When I first started thinking towards
> THPageCache, I was fascinated by how we could manage the hugepages in
> the radix_tree, cutting out unnecessary levels etc; but after a while
> I realized that although there's probably nice scope for cleverness
> there (significantly constrained by RCU expectations), it would only
> be about optimization.

One more point: you have still preserve memory for these levels anyway,
since we must have never-fail split_huge_page().

> Let's be simple and stupid about radix_tree
> for now, the problems that need to be worked out lie elsewhere.
> 
> > > 
> > > Perhaps that's good complementarity, or perhaps I'll disagree with
> > > your approach.  I'll be taking a look at yours in the coming days,
> > > and trying to summon back up my own ideas to summarize them for you.
> > 
> > Yeah, it would be nice to see alternative design ideas. Looking forward.
> > 
> > > Perhaps I was naive to imagine it, but I did intend to start out
> > > generically, independent of filesystem; but content to narrow down
> > > on tmpfs alone where it gets hard to support the others (writeback
> > > springs to mind).  khugepaged would be migrating little pages into
> > > huge pages, where it saw that the mmaps of the file would benefit
> > > (and for testing I would hack mmap alignment choice to favour it).
> > 
> > I don't think all fs at once would fly, but it's wonderful, if I'm
> > wrong :)
> 
> You are imagining the filesystem putting huge pages into its cache.
> Whereas I'm imagining khugepaged looking around at mmaped file areas,
> seeing which would benefit from huge pagecache (let's assume offset 0
> belongs on hugepage boundary - maybe one day someone will want to tune
> some files or parts differently, but that's low priority), migrating 4k
> pages over to 2MB page (wouldn't have to be done all in one pass), then
> finally slotting in the pmds for that.

I had file huge page consolidation on todo list, but much later. I feel
that our approaches are complimentary.

> But going this way, I expect we'd have to split at page_mkwrite():
> we probably don't want a single touch to dirty 2MB at a time,
> unless tmpfs or ramfs.

Hm.. Splitting is rather expensive. I think it makes sense for fs with
backing device to consolidate only pages which mapped without PROT_WRITE.
This way we can avoid consolidate-split loops.

> > > I had arrived at a conviction that the first thing to change was
> > > the way that tail pages of a THP are refcounted, that it had been a
> > > mistake to use the compound page method of holding the THP together.
> > > But I'll have to enter a trance now to recall the arguments ;)
> > 
> > THP refcounting looks reasonable for me, if take split_huge_page() in
> > account.
> 
> I'm not claiming that the THP refcounting is wrong in what it's doing
> at present; but that I suspect we'll want to rework it for THPageCache.
> 
> Something I take for granted, I think you do too but I'm not certain:
> a file with transparent huge pages in its page cache can also have small
> pages in other extents of its page cache; and can be mapped hugely (2MB
> extents) into one address space at the same time as individual 4k pages
> from those extents are mapped into another (or the same) address space.
> 
> One can certainly imagine sacrificing that principle, splitting whenever
> there's such a "conflict"; but it then becomes uninteresting to me, too
> much like hugetlbfs.  Splitting an anonymous hugepage in all address
> spaces that hold it when one of them needs it split, that has been a
> pragmatic strategy: it's not a common case for forks to diverge like
> that; but files are expected to be more widely shared.
> 
> At present THP is using compound pages, with mapcount of tail pages
> reused to track their contribution to head page count; but I think we
> shall want to be able to use the mapcount, and the count, of TH tail
> pages for their original purpose if huge mappings can coexist with tiny.
> Not fully thought out, but that's my feeling.
> 
> The use of compound pages, in particular the redirection of tail page
> count to head page count, was important in hugetlbfs: a get_user_pages
> reference on a subpage must prevent the containing hugepage from being
> freed, because hugetlbfs has its own separate pool of hugepages to
> which freeing returns them.
> 
> But for transparent huge pages?  It should not matter so much if the
> subpages are freed independently.  So I'd like to devise another glue
> to hold them together more loosely (for prototyping I can certainly
> pretend we have infinite pageflag and pagefield space if that helps):
> I may find in practice that they're forever falling apart, and I run
> crying back to compound pages; but at present I'm hoping not.

Looks interesting. But I'm not sure whether it will work. It would be nice
to summon Andrea to the thread.
 
> This mail might suggest that I'm about to start coding: I wish that
> were true, but in reality there's always a lot of unrelated things
> I have to look at, which dilute my focus.  So if I've said anything
> that sparks ideas for you, go with them.

I want get my current approach work first. Will see.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
@ 2013-02-02 15:13         ` Kirill A. Shutemov
  0 siblings, 0 replies; 66+ messages in thread
From: Kirill A. Shutemov @ 2013-02-02 15:13 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli
  Cc: Kirill A. Shutemov, Andrew Morton, Al Viro, Wu Fengguang,
	Jan Kara, Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, linux-kernel


Hugh Dickins wrote:
> On Tue, 29 Jan 2013, Kirill A. Shutemov wrote:
> > Hugh Dickins wrote:
> > > 
> > > Interesting.
> > > 
> > > I was starting to think about Transparent Huge Pagecache a few
> > > months ago, but then got washed away by incoming waves as usual.
> > > 
> > > Certainly I don't have a line of code to show for it; but my first
> > > impression of your patches is that we have very different ideas of
> > > where to start.
> 
> A second impression confirms that we have very different ideas of
> where to start.  I don't want to be dismissive, and please don't let
> me discourage you, but I just don't find what you have very interesting.

The main reason for publishing the patchset in current
(not-really-useful) state is to start discussion early.
Looks like it works :)

> I'm sure you'll agree that the interesting part, and the difficult part,
> comes with mmap(); and there's no point whatever to THPages without mmap()
> (of course, I'm including exec and brk and shm when I say mmap there).
> 
> (There may be performance benefits in working with larger page cache
> size, which Christoph Lameter explored a few years back, but that's a
> different topic: I think 2MB - if I may be x86_64-centric - would not be
> the unit of choice for that, unless SSD erase block were to dominate.)
> 
> I'm interested to get to the point of prototyping something that does
> support mmap() of THPageCache: I'm pretty sure that I'd then soon learn
> a lot about my misconceptions, and have to rework for a while (or give
> up!); but I don't see much point in posting anything without that.
> I don't know if we have 5 or 50 places which "know" that a THPage
> must be Anon: some I'll spot in advance, some I sadly won't.
> 
> It's not clear to me that the infrastructural changes you make in this
> series will be needed or not, if I pursue my approach: some perhaps as
> optimizations on top of the poorly performing base that may emerge from
> going about it my way.  But for me it's too soon to think about those.
> 
> Something I notice that we do agree upon: the radix_tree holding the
> 4k subpages, at least for now.  When I first started thinking towards
> THPageCache, I was fascinated by how we could manage the hugepages in
> the radix_tree, cutting out unnecessary levels etc; but after a while
> I realized that although there's probably nice scope for cleverness
> there (significantly constrained by RCU expectations), it would only
> be about optimization.

One more point: you have still preserve memory for these levels anyway,
since we must have never-fail split_huge_page().

> Let's be simple and stupid about radix_tree
> for now, the problems that need to be worked out lie elsewhere.
> 
> > > 
> > > Perhaps that's good complementarity, or perhaps I'll disagree with
> > > your approach.  I'll be taking a look at yours in the coming days,
> > > and trying to summon back up my own ideas to summarize them for you.
> > 
> > Yeah, it would be nice to see alternative design ideas. Looking forward.
> > 
> > > Perhaps I was naive to imagine it, but I did intend to start out
> > > generically, independent of filesystem; but content to narrow down
> > > on tmpfs alone where it gets hard to support the others (writeback
> > > springs to mind).  khugepaged would be migrating little pages into
> > > huge pages, where it saw that the mmaps of the file would benefit
> > > (and for testing I would hack mmap alignment choice to favour it).
> > 
> > I don't think all fs at once would fly, but it's wonderful, if I'm
> > wrong :)
> 
> You are imagining the filesystem putting huge pages into its cache.
> Whereas I'm imagining khugepaged looking around at mmaped file areas,
> seeing which would benefit from huge pagecache (let's assume offset 0
> belongs on hugepage boundary - maybe one day someone will want to tune
> some files or parts differently, but that's low priority), migrating 4k
> pages over to 2MB page (wouldn't have to be done all in one pass), then
> finally slotting in the pmds for that.

I had file huge page consolidation on todo list, but much later. I feel
that our approaches are complimentary.

> But going this way, I expect we'd have to split at page_mkwrite():
> we probably don't want a single touch to dirty 2MB at a time,
> unless tmpfs or ramfs.

Hm.. Splitting is rather expensive. I think it makes sense for fs with
backing device to consolidate only pages which mapped without PROT_WRITE.
This way we can avoid consolidate-split loops.

> > > I had arrived at a conviction that the first thing to change was
> > > the way that tail pages of a THP are refcounted, that it had been a
> > > mistake to use the compound page method of holding the THP together.
> > > But I'll have to enter a trance now to recall the arguments ;)
> > 
> > THP refcounting looks reasonable for me, if take split_huge_page() in
> > account.
> 
> I'm not claiming that the THP refcounting is wrong in what it's doing
> at present; but that I suspect we'll want to rework it for THPageCache.
> 
> Something I take for granted, I think you do too but I'm not certain:
> a file with transparent huge pages in its page cache can also have small
> pages in other extents of its page cache; and can be mapped hugely (2MB
> extents) into one address space at the same time as individual 4k pages
> from those extents are mapped into another (or the same) address space.
> 
> One can certainly imagine sacrificing that principle, splitting whenever
> there's such a "conflict"; but it then becomes uninteresting to me, too
> much like hugetlbfs.  Splitting an anonymous hugepage in all address
> spaces that hold it when one of them needs it split, that has been a
> pragmatic strategy: it's not a common case for forks to diverge like
> that; but files are expected to be more widely shared.
> 
> At present THP is using compound pages, with mapcount of tail pages
> reused to track their contribution to head page count; but I think we
> shall want to be able to use the mapcount, and the count, of TH tail
> pages for their original purpose if huge mappings can coexist with tiny.
> Not fully thought out, but that's my feeling.
> 
> The use of compound pages, in particular the redirection of tail page
> count to head page count, was important in hugetlbfs: a get_user_pages
> reference on a subpage must prevent the containing hugepage from being
> freed, because hugetlbfs has its own separate pool of hugepages to
> which freeing returns them.
> 
> But for transparent huge pages?  It should not matter so much if the
> subpages are freed independently.  So I'd like to devise another glue
> to hold them together more loosely (for prototyping I can certainly
> pretend we have infinite pageflag and pagefield space if that helps):
> I may find in practice that they're forever falling apart, and I run
> crying back to compound pages; but at present I'm hoping not.

Looks interesting. But I'm not sure whether it will work. It would be nice
to summon Andrea to the thread.
 
> This mail might suggest that I'm about to start coding: I wish that
> were true, but in reality there's always a lot of unrelated things
> I have to look at, which dilute my focus.  So if I've said anything
> that sparks ideas for you, go with them.

I want get my current approach work first. Will see.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
  2013-01-28  9:24 ` Kirill A. Shutemov
@ 2013-03-18  9:36   ` Simon Jeons
  -1 siblings, 0 replies; 66+ messages in thread
From: Simon Jeons @ 2013-03-18  9:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, linux-kernel

On 01/28/2013 05:24 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Here's first steps towards huge pages in page cache.
>
> The intend of the work is get code ready to enable transparent huge page
> cache for the most simple fs -- ramfs.
>
> It's not yet near feature-complete. It only provides basic infrastructure.
> At the moment we can read, write and truncate file on ramfs with huge pages in
> page cache. The most interesting part, mmap(), is not yet there. For now
> we split huge page on mmap() attempt.
>
> I can't say that I see whole picture. I'm not sure if I understand locking
> model around split_huge_page(). Probably, not.
> Andrea, could you check if it looks correct?

Another offline question:
Why don't clear tail page PG_tail flag in function 
__split_huge_page_refcount?

>
> Next steps (not necessary in this order):
>   - mmap();
>   - migration (?);
>   - collapse;
>   - stats, knobs, etc.;
>   - tmpfs/shmem enabling;
>   - ...
>
> Kirill A. Shutemov (16):
>    block: implement add_bdi_stat()
>    mm: implement zero_huge_user_segment and friends
>    mm: drop actor argument of do_generic_file_read()
>    radix-tree: implement preload for multiple contiguous elements
>    thp, mm: basic defines for transparent huge page cache
>    thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>    thp, mm: rewrite delete_from_page_cache() to support huge pages
>    thp, mm: locking tail page is a bug
>    thp, mm: handle tail pages in page_cache_get_speculative()
>    thp, mm: implement grab_cache_huge_page_write_begin()
>    thp, mm: naive support of thp in generic read/write routines
>    thp, libfs: initial support of thp in
>      simple_read/write_begin/write_end
>    thp: handle file pages in split_huge_page()
>    thp, mm: truncate support for transparent huge page cache
>    thp, mm: split huge page on mmap file page
>    ramfs: enable transparent huge page cache
>
>   fs/libfs.c                  |   54 +++++++++---
>   fs/ramfs/inode.c            |    6 +-
>   include/linux/backing-dev.h |   10 +++
>   include/linux/huge_mm.h     |    8 ++
>   include/linux/mm.h          |   15 ++++
>   include/linux/pagemap.h     |   14 ++-
>   include/linux/radix-tree.h  |    3 +
>   lib/radix-tree.c            |   32 +++++--
>   mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
>   mm/huge_memory.c            |   62 +++++++++++--
>   mm/memory.c                 |   22 +++++
>   mm/truncate.c               |   12 +++
>   12 files changed, 375 insertions(+), 67 deletions(-)
>


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
@ 2013-03-18  9:36   ` Simon Jeons
  0 siblings, 0 replies; 66+ messages in thread
From: Simon Jeons @ 2013-03-18  9:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, linux-kernel

On 01/28/2013 05:24 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Here's first steps towards huge pages in page cache.
>
> The intend of the work is get code ready to enable transparent huge page
> cache for the most simple fs -- ramfs.
>
> It's not yet near feature-complete. It only provides basic infrastructure.
> At the moment we can read, write and truncate file on ramfs with huge pages in
> page cache. The most interesting part, mmap(), is not yet there. For now
> we split huge page on mmap() attempt.
>
> I can't say that I see whole picture. I'm not sure if I understand locking
> model around split_huge_page(). Probably, not.
> Andrea, could you check if it looks correct?

Another offline question:
Why don't clear tail page PG_tail flag in function 
__split_huge_page_refcount?

>
> Next steps (not necessary in this order):
>   - mmap();
>   - migration (?);
>   - collapse;
>   - stats, knobs, etc.;
>   - tmpfs/shmem enabling;
>   - ...
>
> Kirill A. Shutemov (16):
>    block: implement add_bdi_stat()
>    mm: implement zero_huge_user_segment and friends
>    mm: drop actor argument of do_generic_file_read()
>    radix-tree: implement preload for multiple contiguous elements
>    thp, mm: basic defines for transparent huge page cache
>    thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>    thp, mm: rewrite delete_from_page_cache() to support huge pages
>    thp, mm: locking tail page is a bug
>    thp, mm: handle tail pages in page_cache_get_speculative()
>    thp, mm: implement grab_cache_huge_page_write_begin()
>    thp, mm: naive support of thp in generic read/write routines
>    thp, libfs: initial support of thp in
>      simple_read/write_begin/write_end
>    thp: handle file pages in split_huge_page()
>    thp, mm: truncate support for transparent huge page cache
>    thp, mm: split huge page on mmap file page
>    ramfs: enable transparent huge page cache
>
>   fs/libfs.c                  |   54 +++++++++---
>   fs/ramfs/inode.c            |    6 +-
>   include/linux/backing-dev.h |   10 +++
>   include/linux/huge_mm.h     |    8 ++
>   include/linux/mm.h          |   15 ++++
>   include/linux/pagemap.h     |   14 ++-
>   include/linux/radix-tree.h  |    3 +
>   lib/radix-tree.c            |   32 +++++--
>   mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
>   mm/huge_memory.c            |   62 +++++++++++--
>   mm/memory.c                 |   22 +++++
>   mm/truncate.c               |   12 +++
>   12 files changed, 375 insertions(+), 67 deletions(-)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
  2013-01-28  9:24 ` Kirill A. Shutemov
@ 2013-03-21  8:00   ` Simon Jeons
  -1 siblings, 0 replies; 66+ messages in thread
From: Simon Jeons @ 2013-03-21  8:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, linux-kernel

On 01/28/2013 05:24 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Here's first steps towards huge pages in page cache.
>
> The intend of the work is get code ready to enable transparent huge page
> cache for the most simple fs -- ramfs.
>
> It's not yet near feature-complete. It only provides basic infrastructure.
> At the moment we can read, write and truncate file on ramfs with huge pages in
> page cache. The most interesting part, mmap(), is not yet there. For now
> we split huge page on mmap() attempt.
>
> I can't say that I see whole picture. I'm not sure if I understand locking
> model around split_huge_page(). Probably, not.
> Andrea, could you check if it looks correct?

Is there any thp performance test benchmark? For anonymous pages or file 
pages.

>
> Next steps (not necessary in this order):
>   - mmap();
>   - migration (?);
>   - collapse;
>   - stats, knobs, etc.;
>   - tmpfs/shmem enabling;
>   - ...
>
> Kirill A. Shutemov (16):
>    block: implement add_bdi_stat()
>    mm: implement zero_huge_user_segment and friends
>    mm: drop actor argument of do_generic_file_read()
>    radix-tree: implement preload for multiple contiguous elements
>    thp, mm: basic defines for transparent huge page cache
>    thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>    thp, mm: rewrite delete_from_page_cache() to support huge pages
>    thp, mm: locking tail page is a bug
>    thp, mm: handle tail pages in page_cache_get_speculative()
>    thp, mm: implement grab_cache_huge_page_write_begin()
>    thp, mm: naive support of thp in generic read/write routines
>    thp, libfs: initial support of thp in
>      simple_read/write_begin/write_end
>    thp: handle file pages in split_huge_page()
>    thp, mm: truncate support for transparent huge page cache
>    thp, mm: split huge page on mmap file page
>    ramfs: enable transparent huge page cache
>
>   fs/libfs.c                  |   54 +++++++++---
>   fs/ramfs/inode.c            |    6 +-
>   include/linux/backing-dev.h |   10 +++
>   include/linux/huge_mm.h     |    8 ++
>   include/linux/mm.h          |   15 ++++
>   include/linux/pagemap.h     |   14 ++-
>   include/linux/radix-tree.h  |    3 +
>   lib/radix-tree.c            |   32 +++++--
>   mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
>   mm/huge_memory.c            |   62 +++++++++++--
>   mm/memory.c                 |   22 +++++
>   mm/truncate.c               |   12 +++
>   12 files changed, 375 insertions(+), 67 deletions(-)
>


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
@ 2013-03-21  8:00   ` Simon Jeons
  0 siblings, 0 replies; 66+ messages in thread
From: Simon Jeons @ 2013-03-21  8:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, linux-fsdevel, linux-kernel

On 01/28/2013 05:24 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Here's first steps towards huge pages in page cache.
>
> The intend of the work is get code ready to enable transparent huge page
> cache for the most simple fs -- ramfs.
>
> It's not yet near feature-complete. It only provides basic infrastructure.
> At the moment we can read, write and truncate file on ramfs with huge pages in
> page cache. The most interesting part, mmap(), is not yet there. For now
> we split huge page on mmap() attempt.
>
> I can't say that I see whole picture. I'm not sure if I understand locking
> model around split_huge_page(). Probably, not.
> Andrea, could you check if it looks correct?

Is there any thp performance test benchmark? For anonymous pages or file 
pages.

>
> Next steps (not necessary in this order):
>   - mmap();
>   - migration (?);
>   - collapse;
>   - stats, knobs, etc.;
>   - tmpfs/shmem enabling;
>   - ...
>
> Kirill A. Shutemov (16):
>    block: implement add_bdi_stat()
>    mm: implement zero_huge_user_segment and friends
>    mm: drop actor argument of do_generic_file_read()
>    radix-tree: implement preload for multiple contiguous elements
>    thp, mm: basic defines for transparent huge page cache
>    thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>    thp, mm: rewrite delete_from_page_cache() to support huge pages
>    thp, mm: locking tail page is a bug
>    thp, mm: handle tail pages in page_cache_get_speculative()
>    thp, mm: implement grab_cache_huge_page_write_begin()
>    thp, mm: naive support of thp in generic read/write routines
>    thp, libfs: initial support of thp in
>      simple_read/write_begin/write_end
>    thp: handle file pages in split_huge_page()
>    thp, mm: truncate support for transparent huge page cache
>    thp, mm: split huge page on mmap file page
>    ramfs: enable transparent huge page cache
>
>   fs/libfs.c                  |   54 +++++++++---
>   fs/ramfs/inode.c            |    6 +-
>   include/linux/backing-dev.h |   10 +++
>   include/linux/huge_mm.h     |    8 ++
>   include/linux/mm.h          |   15 ++++
>   include/linux/pagemap.h     |   14 ++-
>   include/linux/radix-tree.h  |    3 +
>   lib/radix-tree.c            |   32 +++++--
>   mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
>   mm/huge_memory.c            |   62 +++++++++++--
>   mm/memory.c                 |   22 +++++
>   mm/truncate.c               |   12 +++
>   12 files changed, 375 insertions(+), 67 deletions(-)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
  2013-01-31  2:12       ` Hugh Dickins
@ 2013-04-05  0:26         ` Simon Jeons
  -1 siblings, 0 replies; 66+ messages in thread
From: Simon Jeons @ 2013-04-05  0:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

Hi Hugh,
On 01/31/2013 10:12 AM, Hugh Dickins wrote:
> On Tue, 29 Jan 2013, Kirill A. Shutemov wrote:
>> Hugh Dickins wrote:
>>> On Mon, 28 Jan 2013, Kirill A. Shutemov wrote:
>>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>>
>>>> Here's first steps towards huge pages in page cache.
>>>>
>>>> The intend of the work is get code ready to enable transparent huge page
>>>> cache for the most simple fs -- ramfs.
>>>>
>>>> It's not yet near feature-complete. It only provides basic infrastructure.
>>>> At the moment we can read, write and truncate file on ramfs with huge pages in
>>>> page cache. The most interesting part, mmap(), is not yet there. For now
>>>> we split huge page on mmap() attempt.
>>>>
>>>> I can't say that I see whole picture. I'm not sure if I understand locking
>>>> model around split_huge_page(). Probably, not.
>>>> Andrea, could you check if it looks correct?
>>>>
>>>> Next steps (not necessary in this order):
>>>>   - mmap();
>>>>   - migration (?);
>>>>   - collapse;
>>>>   - stats, knobs, etc.;
>>>>   - tmpfs/shmem enabling;
>>>>   - ...
>>>>
>>>> Kirill A. Shutemov (16):
>>>>    block: implement add_bdi_stat()
>>>>    mm: implement zero_huge_user_segment and friends
>>>>    mm: drop actor argument of do_generic_file_read()
>>>>    radix-tree: implement preload for multiple contiguous elements
>>>>    thp, mm: basic defines for transparent huge page cache
>>>>    thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>>>>    thp, mm: rewrite delete_from_page_cache() to support huge pages
>>>>    thp, mm: locking tail page is a bug
>>>>    thp, mm: handle tail pages in page_cache_get_speculative()
>>>>    thp, mm: implement grab_cache_huge_page_write_begin()
>>>>    thp, mm: naive support of thp in generic read/write routines
>>>>    thp, libfs: initial support of thp in
>>>>      simple_read/write_begin/write_end
>>>>    thp: handle file pages in split_huge_page()
>>>>    thp, mm: truncate support for transparent huge page cache
>>>>    thp, mm: split huge page on mmap file page
>>>>    ramfs: enable transparent huge page cache
>>>>
>>>>   fs/libfs.c                  |   54 +++++++++---
>>>>   fs/ramfs/inode.c            |    6 +-
>>>>   include/linux/backing-dev.h |   10 +++
>>>>   include/linux/huge_mm.h     |    8 ++
>>>>   include/linux/mm.h          |   15 ++++
>>>>   include/linux/pagemap.h     |   14 ++-
>>>>   include/linux/radix-tree.h  |    3 +
>>>>   lib/radix-tree.c            |   32 +++++--
>>>>   mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
>>>>   mm/huge_memory.c            |   62 +++++++++++--
>>>>   mm/memory.c                 |   22 +++++
>>>>   mm/truncate.c               |   12 +++
>>>>   12 files changed, 375 insertions(+), 67 deletions(-)
>>> Interesting.
>>>
>>> I was starting to think about Transparent Huge Pagecache a few
>>> months ago, but then got washed away by incoming waves as usual.
>>>
>>> Certainly I don't have a line of code to show for it; but my first
>>> impression of your patches is that we have very different ideas of
>>> where to start.
> A second impression confirms that we have very different ideas of
> where to start.  I don't want to be dismissive, and please don't let
> me discourage you, but I just don't find what you have very interesting.
>
> I'm sure you'll agree that the interesting part, and the difficult part,
> comes with mmap(); and there's no point whatever to THPages without mmap()
> (of course, I'm including exec and brk and shm when I say mmap there).
>
> (There may be performance benefits in working with larger page cache
> size, which Christoph Lameter explored a few years back, but that's a
> different topic: I think 2MB - if I may be x86_64-centric - would not be
> the unit of choice for that, unless SSD erase block were to dominate.)
>
> I'm interested to get to the point of prototyping something that does
> support mmap() of THPageCache: I'm pretty sure that I'd then soon learn
> a lot about my misconceptions, and have to rework for a while (or give
> up!); but I don't see much point in posting anything without that.
> I don't know if we have 5 or 50 places which "know" that a THPage
> must be Anon: some I'll spot in advance, some I sadly won't.
>
> It's not clear to me that the infrastructural changes you make in this
> series will be needed or not, if I pursue my approach: some perhaps as
> optimizations on top of the poorly performing base that may emerge from
> going about it my way.  But for me it's too soon to think about those.
>
> Something I notice that we do agree upon: the radix_tree holding the
> 4k subpages, at least for now.  When I first started thinking towards
> THPageCache, I was fascinated by how we could manage the hugepages in
> the radix_tree, cutting out unnecessary levels etc; but after a while
> I realized that although there's probably nice scope for cleverness
> there (significantly constrained by RCU expectations), it would only
> be about optimization.  Let's be simple and stupid about radix_tree
> for now, the problems that need to be worked out lie elsewhere.
>
>>> Perhaps that's good complementarity, or perhaps I'll disagree with
>>> your approach.  I'll be taking a look at yours in the coming days,
>>> and trying to summon back up my own ideas to summarize them for you.
>> Yeah, it would be nice to see alternative design ideas. Looking forward.
>>
>>> Perhaps I was naive to imagine it, but I did intend to start out
>>> generically, independent of filesystem; but content to narrow down
>>> on tmpfs alone where it gets hard to support the others (writeback
>>> springs to mind).  khugepaged would be migrating little pages into
>>> huge pages, where it saw that the mmaps of the file would benefit
>>> (and for testing I would hack mmap alignment choice to favour it).
>> I don't think all fs at once would fly, but it's wonderful, if I'm
>> wrong :)
> You are imagining the filesystem putting huge pages into its cache.
> Whereas I'm imagining khugepaged looking around at mmaped file areas,
> seeing which would benefit from huge pagecache (let's assume offset 0
> belongs on hugepage boundary - maybe one day someone will want to tune
> some files or parts differently, but that's low priority), migrating 4k
> pages over to 2MB page (wouldn't have to be done all in one pass), they

There are isolation and migration process during collapse. But why 
didn't use migration entry in migration process?

> finally slotting in the pmds for that.
>
> But going this way, I expect we'd have to split at page_mkwrite():
> we probably don't want a single touch to dirty 2MB at a time,
> unless tmpfs or ramfs.
>
>>> I had arrived at a conviction that the first thing to change was
>>> the way that tail pages of a THP are refcounted, that it had been a
>>> mistake to use the compound page method of holding the THP together.
>>> But I'll have to enter a trance now to recall the arguments ;)
>> THP refcounting looks reasonable for me, if take split_huge_page() in
>> account.
> I'm not claiming that the THP refcounting is wrong in what it's doing
> at present; but that I suspect we'll want to rework it for THPageCache.
>
> Something I take for granted, I think you do too but I'm not certain:
> a file with transparent huge pages in its page cache can also have small
> pages in other extents of its page cache; and can be mapped hugely (2MB
> extents) into one address space at the same time as individual 4k pages
> from those extents are mapped into another (or the same) address space.
>
> One can certainly imagine sacrificing that principle, splitting whenever
> there's such a "conflict"; but it then becomes uninteresting to me, too
> much like hugetlbfs.  Splitting an anonymous hugepage in all address
> spaces that hold it when one of them needs it split, that has been a
> pragmatic strategy: it's not a common case for forks to diverge like
> that; but files are expected to be more widely shared.
>
> At present THP is using compound pages, with mapcount of tail pages
> reused to track their contribution to head page count; but I think we
> shall want to be able to use the mapcount, and the count, of TH tail
> pages for their original purpose if huge mappings can coexist with tiny.
> Not fully thought out, but that's my feeling.
>
> The use of compound pages, in particular the redirection of tail page
> count to head page count, was important in hugetlbfs: a get_user_pages
> reference on a subpage must prevent the containing hugepage from being
> freed, because hugetlbfs has its own separate pool of hugepages to
> which freeing returns them.
>
> But for transparent huge pages?  It should not matter so much if the
> subpages are freed independently.  So I'd like to devise another glue
> to hold them together more loosely (for prototyping I can certainly
> pretend we have infinite pageflag and pagefield space if that helps):
> I may find in practice that they're forever falling apart, and I run
> crying back to compound pages; but at present I'm hoping not.
>
> This mail might suggest that I'm about to start coding: I wish that
> were true, but in reality there's always a lot of unrelated things
> I have to look at, which dilute my focus.  So if I've said anything
> that sparks ideas for you, go with them.
>
> Hugh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
@ 2013-04-05  0:26         ` Simon Jeons
  0 siblings, 0 replies; 66+ messages in thread
From: Simon Jeons @ 2013-04-05  0:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

Hi Hugh,
On 01/31/2013 10:12 AM, Hugh Dickins wrote:
> On Tue, 29 Jan 2013, Kirill A. Shutemov wrote:
>> Hugh Dickins wrote:
>>> On Mon, 28 Jan 2013, Kirill A. Shutemov wrote:
>>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>>
>>>> Here's first steps towards huge pages in page cache.
>>>>
>>>> The intend of the work is get code ready to enable transparent huge page
>>>> cache for the most simple fs -- ramfs.
>>>>
>>>> It's not yet near feature-complete. It only provides basic infrastructure.
>>>> At the moment we can read, write and truncate file on ramfs with huge pages in
>>>> page cache. The most interesting part, mmap(), is not yet there. For now
>>>> we split huge page on mmap() attempt.
>>>>
>>>> I can't say that I see whole picture. I'm not sure if I understand locking
>>>> model around split_huge_page(). Probably, not.
>>>> Andrea, could you check if it looks correct?
>>>>
>>>> Next steps (not necessary in this order):
>>>>   - mmap();
>>>>   - migration (?);
>>>>   - collapse;
>>>>   - stats, knobs, etc.;
>>>>   - tmpfs/shmem enabling;
>>>>   - ...
>>>>
>>>> Kirill A. Shutemov (16):
>>>>    block: implement add_bdi_stat()
>>>>    mm: implement zero_huge_user_segment and friends
>>>>    mm: drop actor argument of do_generic_file_read()
>>>>    radix-tree: implement preload for multiple contiguous elements
>>>>    thp, mm: basic defines for transparent huge page cache
>>>>    thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>>>>    thp, mm: rewrite delete_from_page_cache() to support huge pages
>>>>    thp, mm: locking tail page is a bug
>>>>    thp, mm: handle tail pages in page_cache_get_speculative()
>>>>    thp, mm: implement grab_cache_huge_page_write_begin()
>>>>    thp, mm: naive support of thp in generic read/write routines
>>>>    thp, libfs: initial support of thp in
>>>>      simple_read/write_begin/write_end
>>>>    thp: handle file pages in split_huge_page()
>>>>    thp, mm: truncate support for transparent huge page cache
>>>>    thp, mm: split huge page on mmap file page
>>>>    ramfs: enable transparent huge page cache
>>>>
>>>>   fs/libfs.c                  |   54 +++++++++---
>>>>   fs/ramfs/inode.c            |    6 +-
>>>>   include/linux/backing-dev.h |   10 +++
>>>>   include/linux/huge_mm.h     |    8 ++
>>>>   include/linux/mm.h          |   15 ++++
>>>>   include/linux/pagemap.h     |   14 ++-
>>>>   include/linux/radix-tree.h  |    3 +
>>>>   lib/radix-tree.c            |   32 +++++--
>>>>   mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
>>>>   mm/huge_memory.c            |   62 +++++++++++--
>>>>   mm/memory.c                 |   22 +++++
>>>>   mm/truncate.c               |   12 +++
>>>>   12 files changed, 375 insertions(+), 67 deletions(-)
>>> Interesting.
>>>
>>> I was starting to think about Transparent Huge Pagecache a few
>>> months ago, but then got washed away by incoming waves as usual.
>>>
>>> Certainly I don't have a line of code to show for it; but my first
>>> impression of your patches is that we have very different ideas of
>>> where to start.
> A second impression confirms that we have very different ideas of
> where to start.  I don't want to be dismissive, and please don't let
> me discourage you, but I just don't find what you have very interesting.
>
> I'm sure you'll agree that the interesting part, and the difficult part,
> comes with mmap(); and there's no point whatever to THPages without mmap()
> (of course, I'm including exec and brk and shm when I say mmap there).
>
> (There may be performance benefits in working with larger page cache
> size, which Christoph Lameter explored a few years back, but that's a
> different topic: I think 2MB - if I may be x86_64-centric - would not be
> the unit of choice for that, unless SSD erase block were to dominate.)
>
> I'm interested to get to the point of prototyping something that does
> support mmap() of THPageCache: I'm pretty sure that I'd then soon learn
> a lot about my misconceptions, and have to rework for a while (or give
> up!); but I don't see much point in posting anything without that.
> I don't know if we have 5 or 50 places which "know" that a THPage
> must be Anon: some I'll spot in advance, some I sadly won't.
>
> It's not clear to me that the infrastructural changes you make in this
> series will be needed or not, if I pursue my approach: some perhaps as
> optimizations on top of the poorly performing base that may emerge from
> going about it my way.  But for me it's too soon to think about those.
>
> Something I notice that we do agree upon: the radix_tree holding the
> 4k subpages, at least for now.  When I first started thinking towards
> THPageCache, I was fascinated by how we could manage the hugepages in
> the radix_tree, cutting out unnecessary levels etc; but after a while
> I realized that although there's probably nice scope for cleverness
> there (significantly constrained by RCU expectations), it would only
> be about optimization.  Let's be simple and stupid about radix_tree
> for now, the problems that need to be worked out lie elsewhere.
>
>>> Perhaps that's good complementarity, or perhaps I'll disagree with
>>> your approach.  I'll be taking a look at yours in the coming days,
>>> and trying to summon back up my own ideas to summarize them for you.
>> Yeah, it would be nice to see alternative design ideas. Looking forward.
>>
>>> Perhaps I was naive to imagine it, but I did intend to start out
>>> generically, independent of filesystem; but content to narrow down
>>> on tmpfs alone where it gets hard to support the others (writeback
>>> springs to mind).  khugepaged would be migrating little pages into
>>> huge pages, where it saw that the mmaps of the file would benefit
>>> (and for testing I would hack mmap alignment choice to favour it).
>> I don't think all fs at once would fly, but it's wonderful, if I'm
>> wrong :)
> You are imagining the filesystem putting huge pages into its cache.
> Whereas I'm imagining khugepaged looking around at mmaped file areas,
> seeing which would benefit from huge pagecache (let's assume offset 0
> belongs on hugepage boundary - maybe one day someone will want to tune
> some files or parts differently, but that's low priority), migrating 4k
> pages over to 2MB page (wouldn't have to be done all in one pass), they

There are isolation and migration process during collapse. But why 
didn't use migration entry in migration process?

> finally slotting in the pmds for that.
>
> But going this way, I expect we'd have to split at page_mkwrite():
> we probably don't want a single touch to dirty 2MB at a time,
> unless tmpfs or ramfs.
>
>>> I had arrived at a conviction that the first thing to change was
>>> the way that tail pages of a THP are refcounted, that it had been a
>>> mistake to use the compound page method of holding the THP together.
>>> But I'll have to enter a trance now to recall the arguments ;)
>> THP refcounting looks reasonable for me, if take split_huge_page() in
>> account.
> I'm not claiming that the THP refcounting is wrong in what it's doing
> at present; but that I suspect we'll want to rework it for THPageCache.
>
> Something I take for granted, I think you do too but I'm not certain:
> a file with transparent huge pages in its page cache can also have small
> pages in other extents of its page cache; and can be mapped hugely (2MB
> extents) into one address space at the same time as individual 4k pages
> from those extents are mapped into another (or the same) address space.
>
> One can certainly imagine sacrificing that principle, splitting whenever
> there's such a "conflict"; but it then becomes uninteresting to me, too
> much like hugetlbfs.  Splitting an anonymous hugepage in all address
> spaces that hold it when one of them needs it split, that has been a
> pragmatic strategy: it's not a common case for forks to diverge like
> that; but files are expected to be more widely shared.
>
> At present THP is using compound pages, with mapcount of tail pages
> reused to track their contribution to head page count; but I think we
> shall want to be able to use the mapcount, and the count, of TH tail
> pages for their original purpose if huge mappings can coexist with tiny.
> Not fully thought out, but that's my feeling.
>
> The use of compound pages, in particular the redirection of tail page
> count to head page count, was important in hugetlbfs: a get_user_pages
> reference on a subpage must prevent the containing hugepage from being
> freed, because hugetlbfs has its own separate pool of hugepages to
> which freeing returns them.
>
> But for transparent huge pages?  It should not matter so much if the
> subpages are freed independently.  So I'd like to devise another glue
> to hold them together more loosely (for prototyping I can certainly
> pretend we have infinite pageflag and pagefield space if that helps):
> I may find in practice that they're forever falling apart, and I run
> crying back to compound pages; but at present I'm hoping not.
>
> This mail might suggest that I'm about to start coding: I wish that
> were true, but in reality there's always a lot of unrelated things
> I have to look at, which dilute my focus.  So if I've said anything
> that sparks ideas for you, go with them.
>
> Hugh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
  2013-01-31  2:12       ` Hugh Dickins
@ 2013-04-05  1:03         ` Simon Jeons
  -1 siblings, 0 replies; 66+ messages in thread
From: Simon Jeons @ 2013-04-05  1:03 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

Hi Hugh,
On 01/31/2013 10:12 AM, Hugh Dickins wrote:
> On Tue, 29 Jan 2013, Kirill A. Shutemov wrote:
>> Hugh Dickins wrote:
>>> On Mon, 28 Jan 2013, Kirill A. Shutemov wrote:
>>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>>
>>>> Here's first steps towards huge pages in page cache.
>>>>
>>>> The intend of the work is get code ready to enable transparent huge page
>>>> cache for the most simple fs -- ramfs.
>>>>
>>>> It's not yet near feature-complete. It only provides basic infrastructure.
>>>> At the moment we can read, write and truncate file on ramfs with huge pages in
>>>> page cache. The most interesting part, mmap(), is not yet there. For now
>>>> we split huge page on mmap() attempt.
>>>>
>>>> I can't say that I see whole picture. I'm not sure if I understand locking
>>>> model around split_huge_page(). Probably, not.
>>>> Andrea, could you check if it looks correct?
>>>>
>>>> Next steps (not necessary in this order):
>>>>   - mmap();
>>>>   - migration (?);
>>>>   - collapse;
>>>>   - stats, knobs, etc.;
>>>>   - tmpfs/shmem enabling;
>>>>   - ...
>>>>
>>>> Kirill A. Shutemov (16):
>>>>    block: implement add_bdi_stat()
>>>>    mm: implement zero_huge_user_segment and friends
>>>>    mm: drop actor argument of do_generic_file_read()
>>>>    radix-tree: implement preload for multiple contiguous elements
>>>>    thp, mm: basic defines for transparent huge page cache
>>>>    thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>>>>    thp, mm: rewrite delete_from_page_cache() to support huge pages
>>>>    thp, mm: locking tail page is a bug
>>>>    thp, mm: handle tail pages in page_cache_get_speculative()
>>>>    thp, mm: implement grab_cache_huge_page_write_begin()
>>>>    thp, mm: naive support of thp in generic read/write routines
>>>>    thp, libfs: initial support of thp in
>>>>      simple_read/write_begin/write_end
>>>>    thp: handle file pages in split_huge_page()
>>>>    thp, mm: truncate support for transparent huge page cache
>>>>    thp, mm: split huge page on mmap file page
>>>>    ramfs: enable transparent huge page cache
>>>>
>>>>   fs/libfs.c                  |   54 +++++++++---
>>>>   fs/ramfs/inode.c            |    6 +-
>>>>   include/linux/backing-dev.h |   10 +++
>>>>   include/linux/huge_mm.h     |    8 ++
>>>>   include/linux/mm.h          |   15 ++++
>>>>   include/linux/pagemap.h     |   14 ++-
>>>>   include/linux/radix-tree.h  |    3 +
>>>>   lib/radix-tree.c            |   32 +++++--
>>>>   mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
>>>>   mm/huge_memory.c            |   62 +++++++++++--
>>>>   mm/memory.c                 |   22 +++++
>>>>   mm/truncate.c               |   12 +++
>>>>   12 files changed, 375 insertions(+), 67 deletions(-)
>>> Interesting.
>>>
>>> I was starting to think about Transparent Huge Pagecache a few
>>> months ago, but then got washed away by incoming waves as usual.
>>>
>>> Certainly I don't have a line of code to show for it; but my first
>>> impression of your patches is that we have very different ideas of
>>> where to start.
> A second impression confirms that we have very different ideas of
> where to start.  I don't want to be dismissive, and please don't let
> me discourage you, but I just don't find what you have very interesting.
>
> I'm sure you'll agree that the interesting part, and the difficult part,
> comes with mmap(); and there's no point whatever to THPages without mmap()
> (of course, I'm including exec and brk and shm when I say mmap there).
>
> (There may be performance benefits in working with larger page cache
> size, which Christoph Lameter explored a few years back, but that's a
> different topic: I think 2MB - if I may be x86_64-centric - would not be
> the unit of choice for that, unless SSD erase block were to dominate.)
>
> I'm interested to get to the point of prototyping something that does
> support mmap() of THPageCache: I'm pretty sure that I'd then soon learn
> a lot about my misconceptions, and have to rework for a while (or give
> up!); but I don't see much point in posting anything without that.
> I don't know if we have 5 or 50 places which "know" that a THPage
> must be Anon: some I'll spot in advance, some I sadly won't.
>
> It's not clear to me that the infrastructural changes you make in this
> series will be needed or not, if I pursue my approach: some perhaps as
> optimizations on top of the poorly performing base that may emerge from
> going about it my way.  But for me it's too soon to think about those.
>
> Something I notice that we do agree upon: the radix_tree holding the
> 4k subpages, at least for now.  When I first started thinking towards
> THPageCache, I was fascinated by how we could manage the hugepages in
> the radix_tree, cutting out unnecessary levels etc; but after a while
> I realized that although there's probably nice scope for cleverness
> there (significantly constrained by RCU expectations), it would only
> be about optimization.  Let's be simple and stupid about radix_tree
> for now, the problems that need to be worked out lie elsewhere.
>
>>> Perhaps that's good complementarity, or perhaps I'll disagree with
>>> your approach.  I'll be taking a look at yours in the coming days,
>>> and trying to summon back up my own ideas to summarize them for you.
>> Yeah, it would be nice to see alternative design ideas. Looking forward.
>>
>>> Perhaps I was naive to imagine it, but I did intend to start out
>>> generically, independent of filesystem; but content to narrow down
>>> on tmpfs alone where it gets hard to support the others (writeback
>>> springs to mind).  khugepaged would be migrating little pages into
>>> huge pages, where it saw that the mmaps of the file would benefit

If add heuristic to adjust khugepaged_max_ptes_none make sense? Reduce 
its value if memoy pressure is big and increase it if memory pressure is 
small.

>>> (and for testing I would hack mmap alignment choice to favour it).
>> I don't think all fs at once would fly, but it's wonderful, if I'm
>> wrong :)
> You are imagining the filesystem putting huge pages into its cache.
> Whereas I'm imagining khugepaged looking around at mmaped file areas,
> seeing which would benefit from huge pagecache (let's assume offset 0
> belongs on hugepage boundary - maybe one day someone will want to tune
> some files or parts differently, but that's low priority), migrating 4k
> pages over to 2MB page (wouldn't have to be done all in one pass), then
> finally slotting in the pmds for that.
>
> But going this way, I expect we'd have to split at page_mkwrite():
> we probably don't want a single touch to dirty 2MB at a time,
> unless tmpfs or ramfs.
>
>>> I had arrived at a conviction that the first thing to change was
>>> the way that tail pages of a THP are refcounted, that it had been a
>>> mistake to use the compound page method of holding the THP together.
>>> But I'll have to enter a trance now to recall the arguments ;)
>> THP refcounting looks reasonable for me, if take split_huge_page() in
>> account.
> I'm not claiming that the THP refcounting is wrong in what it's doing
> at present; but that I suspect we'll want to rework it for THPageCache.
>
> Something I take for granted, I think you do too but I'm not certain:
> a file with transparent huge pages in its page cache can also have small
> pages in other extents of its page cache; and can be mapped hugely (2MB
> extents) into one address space at the same time as individual 4k pages
> from those extents are mapped into another (or the same) address space.
>
> One can certainly imagine sacrificing that principle, splitting whenever
> there's such a "conflict"; but it then becomes uninteresting to me, too
> much like hugetlbfs.  Splitting an anonymous hugepage in all address
> spaces that hold it when one of them needs it split, that has been a
> pragmatic strategy: it's not a common case for forks to diverge like
> that; but files are expected to be more widely shared.
>
> At present THP is using compound pages, with mapcount of tail pages
> reused to track their contribution to head page count; but I think we
> shall want to be able to use the mapcount, and the count, of TH tail
> pages for their original purpose if huge mappings can coexist with tiny.
> Not fully thought out, but that's my feeling.
>
> The use of compound pages, in particular the redirection of tail page
> count to head page count, was important in hugetlbfs: a get_user_pages
> reference on a subpage must prevent the containing hugepage from being
> freed, because hugetlbfs has its own separate pool of hugepages to
> which freeing returns them.
>
> But for transparent huge pages?  It should not matter so much if the
> subpages are freed independently.  So I'd like to devise another glue
> to hold them together more loosely (for prototyping I can certainly
> pretend we have infinite pageflag and pagefield space if that helps):
> I may find in practice that they're forever falling apart, and I run
> crying back to compound pages; but at present I'm hoping not.
>
> This mail might suggest that I'm about to start coding: I wish that
> were true, but in reality there's always a lot of unrelated things
> I have to look at, which dilute my focus.  So if I've said anything
> that sparks ideas for you, go with them.
>
> Hugh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
@ 2013-04-05  1:03         ` Simon Jeons
  0 siblings, 0 replies; 66+ messages in thread
From: Simon Jeons @ 2013-04-05  1:03 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

Hi Hugh,
On 01/31/2013 10:12 AM, Hugh Dickins wrote:
> On Tue, 29 Jan 2013, Kirill A. Shutemov wrote:
>> Hugh Dickins wrote:
>>> On Mon, 28 Jan 2013, Kirill A. Shutemov wrote:
>>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>>
>>>> Here's first steps towards huge pages in page cache.
>>>>
>>>> The intend of the work is get code ready to enable transparent huge page
>>>> cache for the most simple fs -- ramfs.
>>>>
>>>> It's not yet near feature-complete. It only provides basic infrastructure.
>>>> At the moment we can read, write and truncate file on ramfs with huge pages in
>>>> page cache. The most interesting part, mmap(), is not yet there. For now
>>>> we split huge page on mmap() attempt.
>>>>
>>>> I can't say that I see whole picture. I'm not sure if I understand locking
>>>> model around split_huge_page(). Probably, not.
>>>> Andrea, could you check if it looks correct?
>>>>
>>>> Next steps (not necessary in this order):
>>>>   - mmap();
>>>>   - migration (?);
>>>>   - collapse;
>>>>   - stats, knobs, etc.;
>>>>   - tmpfs/shmem enabling;
>>>>   - ...
>>>>
>>>> Kirill A. Shutemov (16):
>>>>    block: implement add_bdi_stat()
>>>>    mm: implement zero_huge_user_segment and friends
>>>>    mm: drop actor argument of do_generic_file_read()
>>>>    radix-tree: implement preload for multiple contiguous elements
>>>>    thp, mm: basic defines for transparent huge page cache
>>>>    thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>>>>    thp, mm: rewrite delete_from_page_cache() to support huge pages
>>>>    thp, mm: locking tail page is a bug
>>>>    thp, mm: handle tail pages in page_cache_get_speculative()
>>>>    thp, mm: implement grab_cache_huge_page_write_begin()
>>>>    thp, mm: naive support of thp in generic read/write routines
>>>>    thp, libfs: initial support of thp in
>>>>      simple_read/write_begin/write_end
>>>>    thp: handle file pages in split_huge_page()
>>>>    thp, mm: truncate support for transparent huge page cache
>>>>    thp, mm: split huge page on mmap file page
>>>>    ramfs: enable transparent huge page cache
>>>>
>>>>   fs/libfs.c                  |   54 +++++++++---
>>>>   fs/ramfs/inode.c            |    6 +-
>>>>   include/linux/backing-dev.h |   10 +++
>>>>   include/linux/huge_mm.h     |    8 ++
>>>>   include/linux/mm.h          |   15 ++++
>>>>   include/linux/pagemap.h     |   14 ++-
>>>>   include/linux/radix-tree.h  |    3 +
>>>>   lib/radix-tree.c            |   32 +++++--
>>>>   mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
>>>>   mm/huge_memory.c            |   62 +++++++++++--
>>>>   mm/memory.c                 |   22 +++++
>>>>   mm/truncate.c               |   12 +++
>>>>   12 files changed, 375 insertions(+), 67 deletions(-)
>>> Interesting.
>>>
>>> I was starting to think about Transparent Huge Pagecache a few
>>> months ago, but then got washed away by incoming waves as usual.
>>>
>>> Certainly I don't have a line of code to show for it; but my first
>>> impression of your patches is that we have very different ideas of
>>> where to start.
> A second impression confirms that we have very different ideas of
> where to start.  I don't want to be dismissive, and please don't let
> me discourage you, but I just don't find what you have very interesting.
>
> I'm sure you'll agree that the interesting part, and the difficult part,
> comes with mmap(); and there's no point whatever to THPages without mmap()
> (of course, I'm including exec and brk and shm when I say mmap there).
>
> (There may be performance benefits in working with larger page cache
> size, which Christoph Lameter explored a few years back, but that's a
> different topic: I think 2MB - if I may be x86_64-centric - would not be
> the unit of choice for that, unless SSD erase block were to dominate.)
>
> I'm interested to get to the point of prototyping something that does
> support mmap() of THPageCache: I'm pretty sure that I'd then soon learn
> a lot about my misconceptions, and have to rework for a while (or give
> up!); but I don't see much point in posting anything without that.
> I don't know if we have 5 or 50 places which "know" that a THPage
> must be Anon: some I'll spot in advance, some I sadly won't.
>
> It's not clear to me that the infrastructural changes you make in this
> series will be needed or not, if I pursue my approach: some perhaps as
> optimizations on top of the poorly performing base that may emerge from
> going about it my way.  But for me it's too soon to think about those.
>
> Something I notice that we do agree upon: the radix_tree holding the
> 4k subpages, at least for now.  When I first started thinking towards
> THPageCache, I was fascinated by how we could manage the hugepages in
> the radix_tree, cutting out unnecessary levels etc; but after a while
> I realized that although there's probably nice scope for cleverness
> there (significantly constrained by RCU expectations), it would only
> be about optimization.  Let's be simple and stupid about radix_tree
> for now, the problems that need to be worked out lie elsewhere.
>
>>> Perhaps that's good complementarity, or perhaps I'll disagree with
>>> your approach.  I'll be taking a look at yours in the coming days,
>>> and trying to summon back up my own ideas to summarize them for you.
>> Yeah, it would be nice to see alternative design ideas. Looking forward.
>>
>>> Perhaps I was naive to imagine it, but I did intend to start out
>>> generically, independent of filesystem; but content to narrow down
>>> on tmpfs alone where it gets hard to support the others (writeback
>>> springs to mind).  khugepaged would be migrating little pages into
>>> huge pages, where it saw that the mmaps of the file would benefit

If add heuristic to adjust khugepaged_max_ptes_none make sense? Reduce 
its value if memoy pressure is big and increase it if memory pressure is 
small.

>>> (and for testing I would hack mmap alignment choice to favour it).
>> I don't think all fs at once would fly, but it's wonderful, if I'm
>> wrong :)
> You are imagining the filesystem putting huge pages into its cache.
> Whereas I'm imagining khugepaged looking around at mmaped file areas,
> seeing which would benefit from huge pagecache (let's assume offset 0
> belongs on hugepage boundary - maybe one day someone will want to tune
> some files or parts differently, but that's low priority), migrating 4k
> pages over to 2MB page (wouldn't have to be done all in one pass), then
> finally slotting in the pmds for that.
>
> But going this way, I expect we'd have to split at page_mkwrite():
> we probably don't want a single touch to dirty 2MB at a time,
> unless tmpfs or ramfs.
>
>>> I had arrived at a conviction that the first thing to change was
>>> the way that tail pages of a THP are refcounted, that it had been a
>>> mistake to use the compound page method of holding the THP together.
>>> But I'll have to enter a trance now to recall the arguments ;)
>> THP refcounting looks reasonable for me, if take split_huge_page() in
>> account.
> I'm not claiming that the THP refcounting is wrong in what it's doing
> at present; but that I suspect we'll want to rework it for THPageCache.
>
> Something I take for granted, I think you do too but I'm not certain:
> a file with transparent huge pages in its page cache can also have small
> pages in other extents of its page cache; and can be mapped hugely (2MB
> extents) into one address space at the same time as individual 4k pages
> from those extents are mapped into another (or the same) address space.
>
> One can certainly imagine sacrificing that principle, splitting whenever
> there's such a "conflict"; but it then becomes uninteresting to me, too
> much like hugetlbfs.  Splitting an anonymous hugepage in all address
> spaces that hold it when one of them needs it split, that has been a
> pragmatic strategy: it's not a common case for forks to diverge like
> that; but files are expected to be more widely shared.
>
> At present THP is using compound pages, with mapcount of tail pages
> reused to track their contribution to head page count; but I think we
> shall want to be able to use the mapcount, and the count, of TH tail
> pages for their original purpose if huge mappings can coexist with tiny.
> Not fully thought out, but that's my feeling.
>
> The use of compound pages, in particular the redirection of tail page
> count to head page count, was important in hugetlbfs: a get_user_pages
> reference on a subpage must prevent the containing hugepage from being
> freed, because hugetlbfs has its own separate pool of hugepages to
> which freeing returns them.
>
> But for transparent huge pages?  It should not matter so much if the
> subpages are freed independently.  So I'd like to devise another glue
> to hold them together more loosely (for prototyping I can certainly
> pretend we have infinite pageflag and pagefield space if that helps):
> I may find in practice that they're forever falling apart, and I run
> crying back to compound pages; but at present I'm hoping not.
>
> This mail might suggest that I'm about to start coding: I wish that
> were true, but in reality there's always a lot of unrelated things
> I have to look at, which dilute my focus.  So if I've said anything
> that sparks ideas for you, go with them.
>
> Hugh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
  2013-01-29  5:03   ` Hugh Dickins
@ 2013-04-05  1:24     ` Ric Mason
  -1 siblings, 0 replies; 66+ messages in thread
From: Ric Mason @ 2013-04-05  1:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

Hi Hugh,
On 01/29/2013 01:03 PM, Hugh Dickins wrote:
> On Mon, 28 Jan 2013, Kirill A. Shutemov wrote:
>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>
>> Here's first steps towards huge pages in page cache.
>>
>> The intend of the work is get code ready to enable transparent huge page
>> cache for the most simple fs -- ramfs.
>>
>> It's not yet near feature-complete. It only provides basic infrastructure.
>> At the moment we can read, write and truncate file on ramfs with huge pages in
>> page cache. The most interesting part, mmap(), is not yet there. For now
>> we split huge page on mmap() attempt.
>>
>> I can't say that I see whole picture. I'm not sure if I understand locking
>> model around split_huge_page(). Probably, not.
>> Andrea, could you check if it looks correct?
>>
>> Next steps (not necessary in this order):
>>   - mmap();
>>   - migration (?);
>>   - collapse;
>>   - stats, knobs, etc.;
>>   - tmpfs/shmem enabling;
>>   - ...
>>
>> Kirill A. Shutemov (16):
>>    block: implement add_bdi_stat()
>>    mm: implement zero_huge_user_segment and friends
>>    mm: drop actor argument of do_generic_file_read()
>>    radix-tree: implement preload for multiple contiguous elements
>>    thp, mm: basic defines for transparent huge page cache
>>    thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>>    thp, mm: rewrite delete_from_page_cache() to support huge pages
>>    thp, mm: locking tail page is a bug
>>    thp, mm: handle tail pages in page_cache_get_speculative()
>>    thp, mm: implement grab_cache_huge_page_write_begin()
>>    thp, mm: naive support of thp in generic read/write routines
>>    thp, libfs: initial support of thp in
>>      simple_read/write_begin/write_end
>>    thp: handle file pages in split_huge_page()
>>    thp, mm: truncate support for transparent huge page cache
>>    thp, mm: split huge page on mmap file page
>>    ramfs: enable transparent huge page cache
>>
>>   fs/libfs.c                  |   54 +++++++++---
>>   fs/ramfs/inode.c            |    6 +-
>>   include/linux/backing-dev.h |   10 +++
>>   include/linux/huge_mm.h     |    8 ++
>>   include/linux/mm.h          |   15 ++++
>>   include/linux/pagemap.h     |   14 ++-
>>   include/linux/radix-tree.h  |    3 +
>>   lib/radix-tree.c            |   32 +++++--
>>   mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
>>   mm/huge_memory.c            |   62 +++++++++++--
>>   mm/memory.c                 |   22 +++++
>>   mm/truncate.c               |   12 +++
>>   12 files changed, 375 insertions(+), 67 deletions(-)
> Interesting.
>
> I was starting to think about Transparent Huge Pagecache a few
> months ago, but then got washed away by incoming waves as usual.
>
> Certainly I don't have a line of code to show for it; but my first
> impression of your patches is that we have very different ideas of
> where to start.
>
> Perhaps that's good complementarity, or perhaps I'll disagree with
> your approach.  I'll be taking a look at yours in the coming days,
> and trying to summon back up my own ideas to summarize them for you.
>
> Perhaps I was naive to imagine it, but I did intend to start out
> generically, independent of filesystem; but content to narrow down
> on tmpfs alone where it gets hard to support the others (writeback
> springs to mind).  khugepaged would be migrating little pages into
> huge pages, where it saw that the mmaps of the file would benefit
> (and for testing I would hack mmap alignment choice to favour it).
>
> I had arrived at a conviction that the first thing to change was
> the way that tail pages of a THP are refcounted, that it had been a
> mistake to use the compound page method of holding the THP together.
> But I'll have to enter a trance now to recall the arguments ;)

One offline question, do you have any idea hugetlbfs pages support swapping?

>
> Hugh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
@ 2013-04-05  1:24     ` Ric Mason
  0 siblings, 0 replies; 66+ messages in thread
From: Ric Mason @ 2013-04-05  1:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

Hi Hugh,
On 01/29/2013 01:03 PM, Hugh Dickins wrote:
> On Mon, 28 Jan 2013, Kirill A. Shutemov wrote:
>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>
>> Here's first steps towards huge pages in page cache.
>>
>> The intend of the work is get code ready to enable transparent huge page
>> cache for the most simple fs -- ramfs.
>>
>> It's not yet near feature-complete. It only provides basic infrastructure.
>> At the moment we can read, write and truncate file on ramfs with huge pages in
>> page cache. The most interesting part, mmap(), is not yet there. For now
>> we split huge page on mmap() attempt.
>>
>> I can't say that I see whole picture. I'm not sure if I understand locking
>> model around split_huge_page(). Probably, not.
>> Andrea, could you check if it looks correct?
>>
>> Next steps (not necessary in this order):
>>   - mmap();
>>   - migration (?);
>>   - collapse;
>>   - stats, knobs, etc.;
>>   - tmpfs/shmem enabling;
>>   - ...
>>
>> Kirill A. Shutemov (16):
>>    block: implement add_bdi_stat()
>>    mm: implement zero_huge_user_segment and friends
>>    mm: drop actor argument of do_generic_file_read()
>>    radix-tree: implement preload for multiple contiguous elements
>>    thp, mm: basic defines for transparent huge page cache
>>    thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>>    thp, mm: rewrite delete_from_page_cache() to support huge pages
>>    thp, mm: locking tail page is a bug
>>    thp, mm: handle tail pages in page_cache_get_speculative()
>>    thp, mm: implement grab_cache_huge_page_write_begin()
>>    thp, mm: naive support of thp in generic read/write routines
>>    thp, libfs: initial support of thp in
>>      simple_read/write_begin/write_end
>>    thp: handle file pages in split_huge_page()
>>    thp, mm: truncate support for transparent huge page cache
>>    thp, mm: split huge page on mmap file page
>>    ramfs: enable transparent huge page cache
>>
>>   fs/libfs.c                  |   54 +++++++++---
>>   fs/ramfs/inode.c            |    6 +-
>>   include/linux/backing-dev.h |   10 +++
>>   include/linux/huge_mm.h     |    8 ++
>>   include/linux/mm.h          |   15 ++++
>>   include/linux/pagemap.h     |   14 ++-
>>   include/linux/radix-tree.h  |    3 +
>>   lib/radix-tree.c            |   32 +++++--
>>   mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
>>   mm/huge_memory.c            |   62 +++++++++++--
>>   mm/memory.c                 |   22 +++++
>>   mm/truncate.c               |   12 +++
>>   12 files changed, 375 insertions(+), 67 deletions(-)
> Interesting.
>
> I was starting to think about Transparent Huge Pagecache a few
> months ago, but then got washed away by incoming waves as usual.
>
> Certainly I don't have a line of code to show for it; but my first
> impression of your patches is that we have very different ideas of
> where to start.
>
> Perhaps that's good complementarity, or perhaps I'll disagree with
> your approach.  I'll be taking a look at yours in the coming days,
> and trying to summon back up my own ideas to summarize them for you.
>
> Perhaps I was naive to imagine it, but I did intend to start out
> generically, independent of filesystem; but content to narrow down
> on tmpfs alone where it gets hard to support the others (writeback
> springs to mind).  khugepaged would be migrating little pages into
> huge pages, where it saw that the mmaps of the file would benefit
> (and for testing I would hack mmap alignment choice to favour it).
>
> I had arrived at a conviction that the first thing to change was
> the way that tail pages of a THP are refcounted, that it had been a
> mistake to use the compound page method of holding the THP together.
> But I'll have to enter a trance now to recall the arguments ;)

One offline question, do you have any idea hugetlbfs pages support swapping?

>
> Hugh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
  2013-01-31  2:12       ` Hugh Dickins
                         ` (4 preceding siblings ...)
  (?)
@ 2013-04-05  1:42       ` Wanpeng Li
  -1 siblings, 0 replies; 66+ messages in thread
From: Wanpeng Li @ 2013-04-05  1:42 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Wed, Jan 30, 2013 at 06:12:05PM -0800, Hugh Dickins wrote:
>On Tue, 29 Jan 2013, Kirill A. Shutemov wrote:
>> Hugh Dickins wrote:
>> > On Mon, 28 Jan 2013, Kirill A. Shutemov wrote:
>> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> > > 
>> > > Here's first steps towards huge pages in page cache.
>> > > 
>> > > The intend of the work is get code ready to enable transparent huge page
>> > > cache for the most simple fs -- ramfs.
>> > > 
>> > > It's not yet near feature-complete. It only provides basic infrastructure.
>> > > At the moment we can read, write and truncate file on ramfs with huge pages in
>> > > page cache. The most interesting part, mmap(), is not yet there. For now
>> > > we split huge page on mmap() attempt.
>> > > 
>> > > I can't say that I see whole picture. I'm not sure if I understand locking
>> > > model around split_huge_page(). Probably, not.
>> > > Andrea, could you check if it looks correct?
>> > > 
>> > > Next steps (not necessary in this order):
>> > >  - mmap();
>> > >  - migration (?);
>> > >  - collapse;
>> > >  - stats, knobs, etc.;
>> > >  - tmpfs/shmem enabling;
>> > >  - ...
>> > > 
>> > > Kirill A. Shutemov (16):
>> > >   block: implement add_bdi_stat()
>> > >   mm: implement zero_huge_user_segment and friends
>> > >   mm: drop actor argument of do_generic_file_read()
>> > >   radix-tree: implement preload for multiple contiguous elements
>> > >   thp, mm: basic defines for transparent huge page cache
>> > >   thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>> > >   thp, mm: rewrite delete_from_page_cache() to support huge pages
>> > >   thp, mm: locking tail page is a bug
>> > >   thp, mm: handle tail pages in page_cache_get_speculative()
>> > >   thp, mm: implement grab_cache_huge_page_write_begin()
>> > >   thp, mm: naive support of thp in generic read/write routines
>> > >   thp, libfs: initial support of thp in
>> > >     simple_read/write_begin/write_end
>> > >   thp: handle file pages in split_huge_page()
>> > >   thp, mm: truncate support for transparent huge page cache
>> > >   thp, mm: split huge page on mmap file page
>> > >   ramfs: enable transparent huge page cache
>> > > 
>> > >  fs/libfs.c                  |   54 +++++++++---
>> > >  fs/ramfs/inode.c            |    6 +-
>> > >  include/linux/backing-dev.h |   10 +++
>> > >  include/linux/huge_mm.h     |    8 ++
>> > >  include/linux/mm.h          |   15 ++++
>> > >  include/linux/pagemap.h     |   14 ++-
>> > >  include/linux/radix-tree.h  |    3 +
>> > >  lib/radix-tree.c            |   32 +++++--
>> > >  mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
>> > >  mm/huge_memory.c            |   62 +++++++++++--
>> > >  mm/memory.c                 |   22 +++++
>> > >  mm/truncate.c               |   12 +++
>> > >  12 files changed, 375 insertions(+), 67 deletions(-)
>> > 
>> > Interesting.
>> > 
>> > I was starting to think about Transparent Huge Pagecache a few
>> > months ago, but then got washed away by incoming waves as usual.
>> > 
>> > Certainly I don't have a line of code to show for it; but my first
>> > impression of your patches is that we have very different ideas of
>> > where to start.
>
>A second impression confirms that we have very different ideas of
>where to start.  I don't want to be dismissive, and please don't let
>me discourage you, but I just don't find what you have very interesting.
>
>I'm sure you'll agree that the interesting part, and the difficult part,
>comes with mmap(); and there's no point whatever to THPages without mmap()
>(of course, I'm including exec and brk and shm when I say mmap there).
>
>(There may be performance benefits in working with larger page cache
>size, which Christoph Lameter explored a few years back, but that's a
>different topic: I think 2MB - if I may be x86_64-centric - would not be
>the unit of choice for that, unless SSD erase block were to dominate.)
>
>I'm interested to get to the point of prototyping something that does
>support mmap() of THPageCache: I'm pretty sure that I'd then soon learn
>a lot about my misconceptions, and have to rework for a while (or give
>up!); but I don't see much point in posting anything without that.
>I don't know if we have 5 or 50 places which "know" that a THPage
>must be Anon: some I'll spot in advance, some I sadly won't.
>
>It's not clear to me that the infrastructural changes you make in this
>series will be needed or not, if I pursue my approach: some perhaps as
>optimizations on top of the poorly performing base that may emerge from
>going about it my way.  But for me it's too soon to think about those.
>
>Something I notice that we do agree upon: the radix_tree holding the
>4k subpages, at least for now.  When I first started thinking towards
>THPageCache, I was fascinated by how we could manage the hugepages in
>the radix_tree, cutting out unnecessary levels etc; but after a while
>I realized that although there's probably nice scope for cleverness
>there (significantly constrained by RCU expectations), it would only
>be about optimization.  Let's be simple and stupid about radix_tree
>for now, the problems that need to be worked out lie elsewhere.
>
>> > 
>> > Perhaps that's good complementarity, or perhaps I'll disagree with
>> > your approach.  I'll be taking a look at yours in the coming days,
>> > and trying to summon back up my own ideas to summarize them for you.
>> 
>> Yeah, it would be nice to see alternative design ideas. Looking forward.
>> 
>> > Perhaps I was naive to imagine it, but I did intend to start out
>> > generically, independent of filesystem; but content to narrow down
>> > on tmpfs alone where it gets hard to support the others (writeback
>> > springs to mind).  khugepaged would be migrating little pages into
>> > huge pages, where it saw that the mmaps of the file would benefit
>> > (and for testing I would hack mmap alignment choice to favour it).
>> 
>> I don't think all fs at once would fly, but it's wonderful, if I'm
>> wrong :)
>
>You are imagining the filesystem putting huge pages into its cache.
>Whereas I'm imagining khugepaged looking around at mmaped file areas,
>seeing which would benefit from huge pagecache (let's assume offset 0
>belongs on hugepage boundary - maybe one day someone will want to tune
>some files or parts differently, but that's low priority), migrating 4k
>pages over to 2MB page (wouldn't have to be done all in one pass), then
>finally slotting in the pmds for that.
>
>But going this way, I expect we'd have to split at page_mkwrite():
>we probably don't want a single touch to dirty 2MB at a time,
>unless tmpfs or ramfs.
>
>> 
>> > I had arrived at a conviction that the first thing to change was
>> > the way that tail pages of a THP are refcounted, that it had been a
>> > mistake to use the compound page method of holding the THP together.
>> > But I'll have to enter a trance now to recall the arguments ;)
>> 
>> THP refcounting looks reasonable for me, if take split_huge_page() in
>> account.
>
>I'm not claiming that the THP refcounting is wrong in what it's doing
>at present; but that I suspect we'll want to rework it for THPageCache.
>
>Something I take for granted, I think you do too but I'm not certain:
>a file with transparent huge pages in its page cache can also have small
>pages in other extents of its page cache; and can be mapped hugely (2MB
>extents) into one address space at the same time as individual 4k pages
>from those extents are mapped into another (or the same) address space.
>
>One can certainly imagine sacrificing that principle, splitting whenever
>there's such a "conflict"; but it then becomes uninteresting to me, too
>much like hugetlbfs.  Splitting an anonymous hugepage in all address
>spaces that hold it when one of them needs it split, that has been a
>pragmatic strategy: it's not a common case for forks to diverge like
>that; but files are expected to be more widely shared.
>
>At present THP is using compound pages, with mapcount of tail pages
>reused to track their contribution to head page count; but I think we
>shall want to be able to use the mapcount, and the count, of TH tail
>pages for their original purpose if huge mappings can coexist with tiny.
>Not fully thought out, but that's my feeling.
>
>The use of compound pages, in particular the redirection of tail page
>count to head page count, was important in hugetlbfs: a get_user_pages
>reference on a subpage must prevent the containing hugepage from being
>freed, because hugetlbfs has its own separate pool of hugepages to
>which freeing returns them.
>
>But for transparent huge pages?  It should not matter so much if the
>subpages are freed independently.  So I'd like to devise another glue
>to hold them together more loosely (for prototyping I can certainly
>pretend we have infinite pageflag and pagefield space if that helps):
>I may find in practice that they're forever falling apart, and I run
>crying back to compound pages; but at present I'm hoping not.
>
>This mail might suggest that I'm about to start coding: I wish that
>were true, but in reality there's always a lot of unrelated things
>I have to look at, which dilute my focus.  So if I've said anything
>that sparks ideas for you, go with them.

It seems that it's a good idea, Hugh. I will start coding this. ;-)

Regards,
Wanpeng Li 

>
>Hugh
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
  2013-01-31  2:12       ` Hugh Dickins
                         ` (3 preceding siblings ...)
  (?)
@ 2013-04-05  1:42       ` Wanpeng Li
  2013-04-07  0:26         ` Wanpeng Li
  2013-04-07  0:26         ` Wanpeng Li
  -1 siblings, 2 replies; 66+ messages in thread
From: Wanpeng Li @ 2013-04-05  1:42 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Wed, Jan 30, 2013 at 06:12:05PM -0800, Hugh Dickins wrote:
>On Tue, 29 Jan 2013, Kirill A. Shutemov wrote:
>> Hugh Dickins wrote:
>> > On Mon, 28 Jan 2013, Kirill A. Shutemov wrote:
>> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> > > 
>> > > Here's first steps towards huge pages in page cache.
>> > > 
>> > > The intend of the work is get code ready to enable transparent huge page
>> > > cache for the most simple fs -- ramfs.
>> > > 
>> > > It's not yet near feature-complete. It only provides basic infrastructure.
>> > > At the moment we can read, write and truncate file on ramfs with huge pages in
>> > > page cache. The most interesting part, mmap(), is not yet there. For now
>> > > we split huge page on mmap() attempt.
>> > > 
>> > > I can't say that I see whole picture. I'm not sure if I understand locking
>> > > model around split_huge_page(). Probably, not.
>> > > Andrea, could you check if it looks correct?
>> > > 
>> > > Next steps (not necessary in this order):
>> > >  - mmap();
>> > >  - migration (?);
>> > >  - collapse;
>> > >  - stats, knobs, etc.;
>> > >  - tmpfs/shmem enabling;
>> > >  - ...
>> > > 
>> > > Kirill A. Shutemov (16):
>> > >   block: implement add_bdi_stat()
>> > >   mm: implement zero_huge_user_segment and friends
>> > >   mm: drop actor argument of do_generic_file_read()
>> > >   radix-tree: implement preload for multiple contiguous elements
>> > >   thp, mm: basic defines for transparent huge page cache
>> > >   thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>> > >   thp, mm: rewrite delete_from_page_cache() to support huge pages
>> > >   thp, mm: locking tail page is a bug
>> > >   thp, mm: handle tail pages in page_cache_get_speculative()
>> > >   thp, mm: implement grab_cache_huge_page_write_begin()
>> > >   thp, mm: naive support of thp in generic read/write routines
>> > >   thp, libfs: initial support of thp in
>> > >     simple_read/write_begin/write_end
>> > >   thp: handle file pages in split_huge_page()
>> > >   thp, mm: truncate support for transparent huge page cache
>> > >   thp, mm: split huge page on mmap file page
>> > >   ramfs: enable transparent huge page cache
>> > > 
>> > >  fs/libfs.c                  |   54 +++++++++---
>> > >  fs/ramfs/inode.c            |    6 +-
>> > >  include/linux/backing-dev.h |   10 +++
>> > >  include/linux/huge_mm.h     |    8 ++
>> > >  include/linux/mm.h          |   15 ++++
>> > >  include/linux/pagemap.h     |   14 ++-
>> > >  include/linux/radix-tree.h  |    3 +
>> > >  lib/radix-tree.c            |   32 +++++--
>> > >  mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
>> > >  mm/huge_memory.c            |   62 +++++++++++--
>> > >  mm/memory.c                 |   22 +++++
>> > >  mm/truncate.c               |   12 +++
>> > >  12 files changed, 375 insertions(+), 67 deletions(-)
>> > 
>> > Interesting.
>> > 
>> > I was starting to think about Transparent Huge Pagecache a few
>> > months ago, but then got washed away by incoming waves as usual.
>> > 
>> > Certainly I don't have a line of code to show for it; but my first
>> > impression of your patches is that we have very different ideas of
>> > where to start.
>
>A second impression confirms that we have very different ideas of
>where to start.  I don't want to be dismissive, and please don't let
>me discourage you, but I just don't find what you have very interesting.
>
>I'm sure you'll agree that the interesting part, and the difficult part,
>comes with mmap(); and there's no point whatever to THPages without mmap()
>(of course, I'm including exec and brk and shm when I say mmap there).
>
>(There may be performance benefits in working with larger page cache
>size, which Christoph Lameter explored a few years back, but that's a
>different topic: I think 2MB - if I may be x86_64-centric - would not be
>the unit of choice for that, unless SSD erase block were to dominate.)
>
>I'm interested to get to the point of prototyping something that does
>support mmap() of THPageCache: I'm pretty sure that I'd then soon learn
>a lot about my misconceptions, and have to rework for a while (or give
>up!); but I don't see much point in posting anything without that.
>I don't know if we have 5 or 50 places which "know" that a THPage
>must be Anon: some I'll spot in advance, some I sadly won't.
>
>It's not clear to me that the infrastructural changes you make in this
>series will be needed or not, if I pursue my approach: some perhaps as
>optimizations on top of the poorly performing base that may emerge from
>going about it my way.  But for me it's too soon to think about those.
>
>Something I notice that we do agree upon: the radix_tree holding the
>4k subpages, at least for now.  When I first started thinking towards
>THPageCache, I was fascinated by how we could manage the hugepages in
>the radix_tree, cutting out unnecessary levels etc; but after a while
>I realized that although there's probably nice scope for cleverness
>there (significantly constrained by RCU expectations), it would only
>be about optimization.  Let's be simple and stupid about radix_tree
>for now, the problems that need to be worked out lie elsewhere.
>
>> > 
>> > Perhaps that's good complementarity, or perhaps I'll disagree with
>> > your approach.  I'll be taking a look at yours in the coming days,
>> > and trying to summon back up my own ideas to summarize them for you.
>> 
>> Yeah, it would be nice to see alternative design ideas. Looking forward.
>> 
>> > Perhaps I was naive to imagine it, but I did intend to start out
>> > generically, independent of filesystem; but content to narrow down
>> > on tmpfs alone where it gets hard to support the others (writeback
>> > springs to mind).  khugepaged would be migrating little pages into
>> > huge pages, where it saw that the mmaps of the file would benefit
>> > (and for testing I would hack mmap alignment choice to favour it).
>> 
>> I don't think all fs at once would fly, but it's wonderful, if I'm
>> wrong :)
>
>You are imagining the filesystem putting huge pages into its cache.
>Whereas I'm imagining khugepaged looking around at mmaped file areas,
>seeing which would benefit from huge pagecache (let's assume offset 0
>belongs on hugepage boundary - maybe one day someone will want to tune
>some files or parts differently, but that's low priority), migrating 4k
>pages over to 2MB page (wouldn't have to be done all in one pass), then
>finally slotting in the pmds for that.
>
>But going this way, I expect we'd have to split at page_mkwrite():
>we probably don't want a single touch to dirty 2MB at a time,
>unless tmpfs or ramfs.
>
>> 
>> > I had arrived at a conviction that the first thing to change was
>> > the way that tail pages of a THP are refcounted, that it had been a
>> > mistake to use the compound page method of holding the THP together.
>> > But I'll have to enter a trance now to recall the arguments ;)
>> 
>> THP refcounting looks reasonable for me, if take split_huge_page() in
>> account.
>
>I'm not claiming that the THP refcounting is wrong in what it's doing
>at present; but that I suspect we'll want to rework it for THPageCache.
>
>Something I take for granted, I think you do too but I'm not certain:
>a file with transparent huge pages in its page cache can also have small
>pages in other extents of its page cache; and can be mapped hugely (2MB
>extents) into one address space at the same time as individual 4k pages
>from those extents are mapped into another (or the same) address space.
>
>One can certainly imagine sacrificing that principle, splitting whenever
>there's such a "conflict"; but it then becomes uninteresting to me, too
>much like hugetlbfs.  Splitting an anonymous hugepage in all address
>spaces that hold it when one of them needs it split, that has been a
>pragmatic strategy: it's not a common case for forks to diverge like
>that; but files are expected to be more widely shared.
>
>At present THP is using compound pages, with mapcount of tail pages
>reused to track their contribution to head page count; but I think we
>shall want to be able to use the mapcount, and the count, of TH tail
>pages for their original purpose if huge mappings can coexist with tiny.
>Not fully thought out, but that's my feeling.
>
>The use of compound pages, in particular the redirection of tail page
>count to head page count, was important in hugetlbfs: a get_user_pages
>reference on a subpage must prevent the containing hugepage from being
>freed, because hugetlbfs has its own separate pool of hugepages to
>which freeing returns them.
>
>But for transparent huge pages?  It should not matter so much if the
>subpages are freed independently.  So I'd like to devise another glue
>to hold them together more loosely (for prototyping I can certainly
>pretend we have infinite pageflag and pagefield space if that helps):
>I may find in practice that they're forever falling apart, and I run
>crying back to compound pages; but at present I'm hoping not.
>
>This mail might suggest that I'm about to start coding: I wish that
>were true, but in reality there's always a lot of unrelated things
>I have to look at, which dilute my focus.  So if I've said anything
>that sparks ideas for you, go with them.

It seems that it's a good idea, Hugh. I will start coding this. ;-)

Regards,
Wanpeng Li 

>
>Hugh
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
  2013-04-05  1:42       ` Wanpeng Li
  2013-04-07  0:26         ` Wanpeng Li
@ 2013-04-07  0:26         ` Wanpeng Li
  1 sibling, 0 replies; 66+ messages in thread
From: Wanpeng Li @ 2013-04-07  0:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Apr 05, 2013 at 09:42:08AM +0800, Wanpeng Li wrote:
>On Wed, Jan 30, 2013 at 06:12:05PM -0800, Hugh Dickins wrote:
>>On Tue, 29 Jan 2013, Kirill A. Shutemov wrote:
>>> Hugh Dickins wrote:
>>> > On Mon, 28 Jan 2013, Kirill A. Shutemov wrote:
>>> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>> > > 
>>> > > Here's first steps towards huge pages in page cache.
>>> > > 
>>> > > The intend of the work is get code ready to enable transparent huge page
>>> > > cache for the most simple fs -- ramfs.
>>> > > 
>>> > > It's not yet near feature-complete. It only provides basic infrastructure.
>>> > > At the moment we can read, write and truncate file on ramfs with huge pages in
>>> > > page cache. The most interesting part, mmap(), is not yet there. For now
>>> > > we split huge page on mmap() attempt.
>>> > > 
>>> > > I can't say that I see whole picture. I'm not sure if I understand locking
>>> > > model around split_huge_page(). Probably, not.
>>> > > Andrea, could you check if it looks correct?
>>> > > 
>>> > > Next steps (not necessary in this order):
>>> > >  - mmap();
>>> > >  - migration (?);
>>> > >  - collapse;
>>> > >  - stats, knobs, etc.;
>>> > >  - tmpfs/shmem enabling;
>>> > >  - ...
>>> > > 
>>> > > Kirill A. Shutemov (16):
>>> > >   block: implement add_bdi_stat()
>>> > >   mm: implement zero_huge_user_segment and friends
>>> > >   mm: drop actor argument of do_generic_file_read()
>>> > >   radix-tree: implement preload for multiple contiguous elements
>>> > >   thp, mm: basic defines for transparent huge page cache
>>> > >   thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>>> > >   thp, mm: rewrite delete_from_page_cache() to support huge pages
>>> > >   thp, mm: locking tail page is a bug
>>> > >   thp, mm: handle tail pages in page_cache_get_speculative()
>>> > >   thp, mm: implement grab_cache_huge_page_write_begin()
>>> > >   thp, mm: naive support of thp in generic read/write routines
>>> > >   thp, libfs: initial support of thp in
>>> > >     simple_read/write_begin/write_end
>>> > >   thp: handle file pages in split_huge_page()
>>> > >   thp, mm: truncate support for transparent huge page cache
>>> > >   thp, mm: split huge page on mmap file page
>>> > >   ramfs: enable transparent huge page cache
>>> > > 
>>> > >  fs/libfs.c                  |   54 +++++++++---
>>> > >  fs/ramfs/inode.c            |    6 +-
>>> > >  include/linux/backing-dev.h |   10 +++
>>> > >  include/linux/huge_mm.h     |    8 ++
>>> > >  include/linux/mm.h          |   15 ++++
>>> > >  include/linux/pagemap.h     |   14 ++-
>>> > >  include/linux/radix-tree.h  |    3 +
>>> > >  lib/radix-tree.c            |   32 +++++--
>>> > >  mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
>>> > >  mm/huge_memory.c            |   62 +++++++++++--
>>> > >  mm/memory.c                 |   22 +++++
>>> > >  mm/truncate.c               |   12 +++
>>> > >  12 files changed, 375 insertions(+), 67 deletions(-)
>>> > 
>>> > Interesting.
>>> > 
>>> > I was starting to think about Transparent Huge Pagecache a few
>>> > months ago, but then got washed away by incoming waves as usual.
>>> > 
>>> > Certainly I don't have a line of code to show for it; but my first
>>> > impression of your patches is that we have very different ideas of
>>> > where to start.
>>
>>A second impression confirms that we have very different ideas of
>>where to start.  I don't want to be dismissive, and please don't let
>>me discourage you, but I just don't find what you have very interesting.
>>
>>I'm sure you'll agree that the interesting part, and the difficult part,
>>comes with mmap(); and there's no point whatever to THPages without mmap()
>>(of course, I'm including exec and brk and shm when I say mmap there).
>>
>>(There may be performance benefits in working with larger page cache
>>size, which Christoph Lameter explored a few years back, but that's a
>>different topic: I think 2MB - if I may be x86_64-centric - would not be
>>the unit of choice for that, unless SSD erase block were to dominate.)
>>
>>I'm interested to get to the point of prototyping something that does
>>support mmap() of THPageCache: I'm pretty sure that I'd then soon learn
>>a lot about my misconceptions, and have to rework for a while (or give
>>up!); but I don't see much point in posting anything without that.
>>I don't know if we have 5 or 50 places which "know" that a THPage
>>must be Anon: some I'll spot in advance, some I sadly won't.
>>
>>It's not clear to me that the infrastructural changes you make in this
>>series will be needed or not, if I pursue my approach: some perhaps as
>>optimizations on top of the poorly performing base that may emerge from
>>going about it my way.  But for me it's too soon to think about those.
>>
>>Something I notice that we do agree upon: the radix_tree holding the
>>4k subpages, at least for now.  When I first started thinking towards
>>THPageCache, I was fascinated by how we could manage the hugepages in
>>the radix_tree, cutting out unnecessary levels etc; but after a while
>>I realized that although there's probably nice scope for cleverness
>>there (significantly constrained by RCU expectations), it would only
>>be about optimization.  Let's be simple and stupid about radix_tree
>>for now, the problems that need to be worked out lie elsewhere.
>>
>>> > 
>>> > Perhaps that's good complementarity, or perhaps I'll disagree with
>>> > your approach.  I'll be taking a look at yours in the coming days,
>>> > and trying to summon back up my own ideas to summarize them for you.
>>> 
>>> Yeah, it would be nice to see alternative design ideas. Looking forward.
>>> 
>>> > Perhaps I was naive to imagine it, but I did intend to start out
>>> > generically, independent of filesystem; but content to narrow down
>>> > on tmpfs alone where it gets hard to support the others (writeback
>>> > springs to mind).  khugepaged would be migrating little pages into
>>> > huge pages, where it saw that the mmaps of the file would benefit
>>> > (and for testing I would hack mmap alignment choice to favour it).
>>> 
>>> I don't think all fs at once would fly, but it's wonderful, if I'm
>>> wrong :)
>>
>>You are imagining the filesystem putting huge pages into its cache.
>>Whereas I'm imagining khugepaged looking around at mmaped file areas,
>>seeing which would benefit from huge pagecache (let's assume offset 0
>>belongs on hugepage boundary - maybe one day someone will want to tune
>>some files or parts differently, but that's low priority), migrating 4k
>>pages over to 2MB page (wouldn't have to be done all in one pass), then
>>finally slotting in the pmds for that.
>>
>>But going this way, I expect we'd have to split at page_mkwrite():
>>we probably don't want a single touch to dirty 2MB at a time,
>>unless tmpfs or ramfs.
>>
>>> 
>>> > I had arrived at a conviction that the first thing to change was
>>> > the way that tail pages of a THP are refcounted, that it had been a
>>> > mistake to use the compound page method of holding the THP together.
>>> > But I'll have to enter a trance now to recall the arguments ;)
>>> 
>>> THP refcounting looks reasonable for me, if take split_huge_page() in
>>> account.
>>
>>I'm not claiming that the THP refcounting is wrong in what it's doing
>>at present; but that I suspect we'll want to rework it for THPageCache.
>>
>>Something I take for granted, I think you do too but I'm not certain:
>>a file with transparent huge pages in its page cache can also have small
>>pages in other extents of its page cache; and can be mapped hugely (2MB
>>extents) into one address space at the same time as individual 4k pages
>>from those extents are mapped into another (or the same) address space.
>>
>>One can certainly imagine sacrificing that principle, splitting whenever
>>there's such a "conflict"; but it then becomes uninteresting to me, too
>>much like hugetlbfs.  Splitting an anonymous hugepage in all address
>>spaces that hold it when one of them needs it split, that has been a
>>pragmatic strategy: it's not a common case for forks to diverge like
>>that; but files are expected to be more widely shared.
>>
>>At present THP is using compound pages, with mapcount of tail pages
>>reused to track their contribution to head page count; but I think we
>>shall want to be able to use the mapcount, and the count, of TH tail
>>pages for their original purpose if huge mappings can coexist with tiny.
>>Not fully thought out, but that's my feeling.
>>
>>The use of compound pages, in particular the redirection of tail page
>>count to head page count, was important in hugetlbfs: a get_user_pages
>>reference on a subpage must prevent the containing hugepage from being
>>freed, because hugetlbfs has its own separate pool of hugepages to
>>which freeing returns them.
>>
>>But for transparent huge pages?  It should not matter so much if the
>>subpages are freed independently.  So I'd like to devise another glue
>>to hold them together more loosely (for prototyping I can certainly
>>pretend we have infinite pageflag and pagefield space if that helps):
>>I may find in practice that they're forever falling apart, and I run
>>crying back to compound pages; but at present I'm hoping not.
>>
>>This mail might suggest that I'm about to start coding: I wish that
>>were true, but in reality there's always a lot of unrelated things
>>I have to look at, which dilute my focus.  So if I've said anything
>>that sparks ideas for you, go with them.

Hi Hugh,

commit 70b50f94f16 ("mm: thp: tail page refcounting fix") tells us account 
the tail page references on tail_page->_count wasn't safe.

Regards,
Wanpeng Li 

>
>It seems that it's a good idea, Hugh. I will start coding this. ;-)
>
>Regards,
>Wanpeng Li 
>
>>
>>Hugh
>>
>>--
>>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>the body to majordomo@kvack.org.  For more info on Linux MM,
>>see: http://www.linux-mm.org/ .
>>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH, RFC 00/16] Transparent huge page cache
  2013-04-05  1:42       ` Wanpeng Li
@ 2013-04-07  0:26         ` Wanpeng Li
  2013-04-07  0:26         ` Wanpeng Li
  1 sibling, 0 replies; 66+ messages in thread
From: Wanpeng Li @ 2013-04-07  0:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Apr 05, 2013 at 09:42:08AM +0800, Wanpeng Li wrote:
>On Wed, Jan 30, 2013 at 06:12:05PM -0800, Hugh Dickins wrote:
>>On Tue, 29 Jan 2013, Kirill A. Shutemov wrote:
>>> Hugh Dickins wrote:
>>> > On Mon, 28 Jan 2013, Kirill A. Shutemov wrote:
>>> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>> > > 
>>> > > Here's first steps towards huge pages in page cache.
>>> > > 
>>> > > The intend of the work is get code ready to enable transparent huge page
>>> > > cache for the most simple fs -- ramfs.
>>> > > 
>>> > > It's not yet near feature-complete. It only provides basic infrastructure.
>>> > > At the moment we can read, write and truncate file on ramfs with huge pages in
>>> > > page cache. The most interesting part, mmap(), is not yet there. For now
>>> > > we split huge page on mmap() attempt.
>>> > > 
>>> > > I can't say that I see whole picture. I'm not sure if I understand locking
>>> > > model around split_huge_page(). Probably, not.
>>> > > Andrea, could you check if it looks correct?
>>> > > 
>>> > > Next steps (not necessary in this order):
>>> > >  - mmap();
>>> > >  - migration (?);
>>> > >  - collapse;
>>> > >  - stats, knobs, etc.;
>>> > >  - tmpfs/shmem enabling;
>>> > >  - ...
>>> > > 
>>> > > Kirill A. Shutemov (16):
>>> > >   block: implement add_bdi_stat()
>>> > >   mm: implement zero_huge_user_segment and friends
>>> > >   mm: drop actor argument of do_generic_file_read()
>>> > >   radix-tree: implement preload for multiple contiguous elements
>>> > >   thp, mm: basic defines for transparent huge page cache
>>> > >   thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>>> > >   thp, mm: rewrite delete_from_page_cache() to support huge pages
>>> > >   thp, mm: locking tail page is a bug
>>> > >   thp, mm: handle tail pages in page_cache_get_speculative()
>>> > >   thp, mm: implement grab_cache_huge_page_write_begin()
>>> > >   thp, mm: naive support of thp in generic read/write routines
>>> > >   thp, libfs: initial support of thp in
>>> > >     simple_read/write_begin/write_end
>>> > >   thp: handle file pages in split_huge_page()
>>> > >   thp, mm: truncate support for transparent huge page cache
>>> > >   thp, mm: split huge page on mmap file page
>>> > >   ramfs: enable transparent huge page cache
>>> > > 
>>> > >  fs/libfs.c                  |   54 +++++++++---
>>> > >  fs/ramfs/inode.c            |    6 +-
>>> > >  include/linux/backing-dev.h |   10 +++
>>> > >  include/linux/huge_mm.h     |    8 ++
>>> > >  include/linux/mm.h          |   15 ++++
>>> > >  include/linux/pagemap.h     |   14 ++-
>>> > >  include/linux/radix-tree.h  |    3 +
>>> > >  lib/radix-tree.c            |   32 +++++--
>>> > >  mm/filemap.c                |  204 +++++++++++++++++++++++++++++++++++--------
>>> > >  mm/huge_memory.c            |   62 +++++++++++--
>>> > >  mm/memory.c                 |   22 +++++
>>> > >  mm/truncate.c               |   12 +++
>>> > >  12 files changed, 375 insertions(+), 67 deletions(-)
>>> > 
>>> > Interesting.
>>> > 
>>> > I was starting to think about Transparent Huge Pagecache a few
>>> > months ago, but then got washed away by incoming waves as usual.
>>> > 
>>> > Certainly I don't have a line of code to show for it; but my first
>>> > impression of your patches is that we have very different ideas of
>>> > where to start.
>>
>>A second impression confirms that we have very different ideas of
>>where to start.  I don't want to be dismissive, and please don't let
>>me discourage you, but I just don't find what you have very interesting.
>>
>>I'm sure you'll agree that the interesting part, and the difficult part,
>>comes with mmap(); and there's no point whatever to THPages without mmap()
>>(of course, I'm including exec and brk and shm when I say mmap there).
>>
>>(There may be performance benefits in working with larger page cache
>>size, which Christoph Lameter explored a few years back, but that's a
>>different topic: I think 2MB - if I may be x86_64-centric - would not be
>>the unit of choice for that, unless SSD erase block were to dominate.)
>>
>>I'm interested to get to the point of prototyping something that does
>>support mmap() of THPageCache: I'm pretty sure that I'd then soon learn
>>a lot about my misconceptions, and have to rework for a while (or give
>>up!); but I don't see much point in posting anything without that.
>>I don't know if we have 5 or 50 places which "know" that a THPage
>>must be Anon: some I'll spot in advance, some I sadly won't.
>>
>>It's not clear to me that the infrastructural changes you make in this
>>series will be needed or not, if I pursue my approach: some perhaps as
>>optimizations on top of the poorly performing base that may emerge from
>>going about it my way.  But for me it's too soon to think about those.
>>
>>Something I notice that we do agree upon: the radix_tree holding the
>>4k subpages, at least for now.  When I first started thinking towards
>>THPageCache, I was fascinated by how we could manage the hugepages in
>>the radix_tree, cutting out unnecessary levels etc; but after a while
>>I realized that although there's probably nice scope for cleverness
>>there (significantly constrained by RCU expectations), it would only
>>be about optimization.  Let's be simple and stupid about radix_tree
>>for now, the problems that need to be worked out lie elsewhere.
>>
>>> > 
>>> > Perhaps that's good complementarity, or perhaps I'll disagree with
>>> > your approach.  I'll be taking a look at yours in the coming days,
>>> > and trying to summon back up my own ideas to summarize them for you.
>>> 
>>> Yeah, it would be nice to see alternative design ideas. Looking forward.
>>> 
>>> > Perhaps I was naive to imagine it, but I did intend to start out
>>> > generically, independent of filesystem; but content to narrow down
>>> > on tmpfs alone where it gets hard to support the others (writeback
>>> > springs to mind).  khugepaged would be migrating little pages into
>>> > huge pages, where it saw that the mmaps of the file would benefit
>>> > (and for testing I would hack mmap alignment choice to favour it).
>>> 
>>> I don't think all fs at once would fly, but it's wonderful, if I'm
>>> wrong :)
>>
>>You are imagining the filesystem putting huge pages into its cache.
>>Whereas I'm imagining khugepaged looking around at mmaped file areas,
>>seeing which would benefit from huge pagecache (let's assume offset 0
>>belongs on hugepage boundary - maybe one day someone will want to tune
>>some files or parts differently, but that's low priority), migrating 4k
>>pages over to 2MB page (wouldn't have to be done all in one pass), then
>>finally slotting in the pmds for that.
>>
>>But going this way, I expect we'd have to split at page_mkwrite():
>>we probably don't want a single touch to dirty 2MB at a time,
>>unless tmpfs or ramfs.
>>
>>> 
>>> > I had arrived at a conviction that the first thing to change was
>>> > the way that tail pages of a THP are refcounted, that it had been a
>>> > mistake to use the compound page method of holding the THP together.
>>> > But I'll have to enter a trance now to recall the arguments ;)
>>> 
>>> THP refcounting looks reasonable for me, if take split_huge_page() in
>>> account.
>>
>>I'm not claiming that the THP refcounting is wrong in what it's doing
>>at present; but that I suspect we'll want to rework it for THPageCache.
>>
>>Something I take for granted, I think you do too but I'm not certain:
>>a file with transparent huge pages in its page cache can also have small
>>pages in other extents of its page cache; and can be mapped hugely (2MB
>>extents) into one address space at the same time as individual 4k pages
>>from those extents are mapped into another (or the same) address space.
>>
>>One can certainly imagine sacrificing that principle, splitting whenever
>>there's such a "conflict"; but it then becomes uninteresting to me, too
>>much like hugetlbfs.  Splitting an anonymous hugepage in all address
>>spaces that hold it when one of them needs it split, that has been a
>>pragmatic strategy: it's not a common case for forks to diverge like
>>that; but files are expected to be more widely shared.
>>
>>At present THP is using compound pages, with mapcount of tail pages
>>reused to track their contribution to head page count; but I think we
>>shall want to be able to use the mapcount, and the count, of TH tail
>>pages for their original purpose if huge mappings can coexist with tiny.
>>Not fully thought out, but that's my feeling.
>>
>>The use of compound pages, in particular the redirection of tail page
>>count to head page count, was important in hugetlbfs: a get_user_pages
>>reference on a subpage must prevent the containing hugepage from being
>>freed, because hugetlbfs has its own separate pool of hugepages to
>>which freeing returns them.
>>
>>But for transparent huge pages?  It should not matter so much if the
>>subpages are freed independently.  So I'd like to devise another glue
>>to hold them together more loosely (for prototyping I can certainly
>>pretend we have infinite pageflag and pagefield space if that helps):
>>I may find in practice that they're forever falling apart, and I run
>>crying back to compound pages; but at present I'm hoping not.
>>
>>This mail might suggest that I'm about to start coding: I wish that
>>were true, but in reality there's always a lot of unrelated things
>>I have to look at, which dilute my focus.  So if I've said anything
>>that sparks ideas for you, go with them.

Hi Hugh,

commit 70b50f94f16 ("mm: thp: tail page refcounting fix") tells us account 
the tail page references on tail_page->_count wasn't safe.

Regards,
Wanpeng Li 

>
>It seems that it's a good idea, Hugh. I will start coding this. ;-)
>
>Regards,
>Wanpeng Li 
>
>>
>>Hugh
>>
>>--
>>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>the body to majordomo@kvack.org.  For more info on Linux MM,
>>see: http://www.linux-mm.org/ .
>>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2013-04-07  0:26 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-28  9:24 [PATCH, RFC 00/16] Transparent huge page cache Kirill A. Shutemov
2013-01-28  9:24 ` Kirill A. Shutemov
2013-01-28  9:24 ` [PATCH, RFC 01/16] block: implement add_bdi_stat() Kirill A. Shutemov
2013-01-28  9:24   ` Kirill A. Shutemov
2013-01-28  9:24 ` [PATCH, RFC 02/16] mm: implement zero_huge_user_segment and friends Kirill A. Shutemov
2013-01-28  9:24   ` Kirill A. Shutemov
2013-01-28  9:24 ` [PATCH, RFC 03/16] mm: drop actor argument of do_generic_file_read() Kirill A. Shutemov
2013-01-28  9:24   ` Kirill A. Shutemov
2013-01-28  9:24 ` [PATCH, RFC 04/16] radix-tree: implement preload for multiple contiguous elements Kirill A. Shutemov
2013-01-28  9:24   ` Kirill A. Shutemov
2013-01-28  9:24 ` [PATCH, RFC 05/16] thp, mm: basic defines for transparent huge page cache Kirill A. Shutemov
2013-01-28  9:24   ` Kirill A. Shutemov
2013-01-28  9:24 ` [PATCH, RFC 06/16] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Kirill A. Shutemov
2013-01-28  9:24   ` Kirill A. Shutemov
2013-01-29 12:11   ` Hillf Danton
2013-01-29 12:11     ` Hillf Danton
2013-01-29 13:01     ` Kirill A. Shutemov
2013-01-29 13:01       ` Kirill A. Shutemov
2013-01-29 12:14   ` Hillf Danton
2013-01-29 12:14     ` Hillf Danton
2013-01-29 12:26   ` Hillf Danton
2013-01-29 12:26     ` Hillf Danton
2013-01-29 12:48     ` Kirill A. Shutemov
2013-01-29 12:48       ` Kirill A. Shutemov
2013-01-28  9:24 ` [PATCH, RFC 07/16] thp, mm: rewrite delete_from_page_cache() " Kirill A. Shutemov
2013-01-28  9:24   ` Kirill A. Shutemov
2013-01-28  9:24 ` [PATCH, RFC 08/16] thp, mm: locking tail page is a bug Kirill A. Shutemov
2013-01-28  9:24   ` Kirill A. Shutemov
2013-01-28  9:24 ` [PATCH, RFC 09/16] thp, mm: handle tail pages in page_cache_get_speculative() Kirill A. Shutemov
2013-01-28  9:24   ` Kirill A. Shutemov
2013-01-28  9:24 ` [PATCH, RFC 10/16] thp, mm: implement grab_cache_huge_page_write_begin() Kirill A. Shutemov
2013-01-28  9:24   ` Kirill A. Shutemov
2013-01-28  9:24 ` [PATCH, RFC 11/16] thp, mm: naive support of thp in generic read/write routines Kirill A. Shutemov
2013-01-28  9:24   ` Kirill A. Shutemov
2013-01-28  9:24 ` [PATCH, RFC 12/16] thp, libfs: initial support of thp in simple_read/write_begin/write_end Kirill A. Shutemov
2013-01-28  9:24   ` Kirill A. Shutemov
2013-01-28  9:24 ` [PATCH, RFC 13/16] thp: handle file pages in split_huge_page() Kirill A. Shutemov
2013-01-28  9:24   ` Kirill A. Shutemov
2013-01-28  9:24 ` [PATCH, RFC 14/16] thp, mm: truncate support for transparent huge page cache Kirill A. Shutemov
2013-01-28  9:24   ` Kirill A. Shutemov
2013-01-28  9:24 ` [PATCH, RFC 15/16] thp, mm: split huge page on mmap file page Kirill A. Shutemov
2013-01-28  9:24   ` Kirill A. Shutemov
2013-01-28  9:24 ` [PATCH, RFC 16/16] ramfs: enable transparent huge page cache Kirill A. Shutemov
2013-01-28  9:24   ` Kirill A. Shutemov
2013-01-29  5:03 ` [PATCH, RFC 00/16] Transparent " Hugh Dickins
2013-01-29  5:03   ` Hugh Dickins
2013-01-29 13:14   ` Kirill A. Shutemov
2013-01-29 13:14     ` Kirill A. Shutemov
2013-01-31  2:12     ` Hugh Dickins
2013-01-31  2:12       ` Hugh Dickins
2013-02-02 15:13       ` Kirill A. Shutemov
2013-02-02 15:13         ` Kirill A. Shutemov
2013-04-05  0:26       ` Simon Jeons
2013-04-05  0:26         ` Simon Jeons
2013-04-05  1:03       ` Simon Jeons
2013-04-05  1:03         ` Simon Jeons
2013-04-05  1:42       ` Wanpeng Li
2013-04-07  0:26         ` Wanpeng Li
2013-04-07  0:26         ` Wanpeng Li
2013-04-05  1:42       ` Wanpeng Li
2013-04-05  1:24   ` Ric Mason
2013-04-05  1:24     ` Ric Mason
2013-03-18  9:36 ` Simon Jeons
2013-03-18  9:36   ` Simon Jeons
2013-03-21  8:00 ` Simon Jeons
2013-03-21  8:00   ` Simon Jeons

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.