All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv2, RFC 00/30] Transparent huge page cache
@ 2013-03-14 17:50 ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Here's the second version of the patchset.

The intend of the work is get code ready to enable transparent huge page
cache for the most simple fs -- ramfs.

We have read()/write()/mmap() functionality now. Still plenty work ahead.

Any feedback is welcome.

Changes since v1:
 - mmap();
 - fix add_to_page_cache_locked() and delete_from_page_cache();
 - introduce mapping_can_have_hugepages();
 - call split_huge_page() only for head page in filemap_fault();
 - wait_split_huge_page(): serialize over i_mmap_mutex too;
 - lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists;
 - fix off-by-one in zero_huge_user_segment();
 - THP_WRITE_ALLOC/THP_WRITE_FAILED counters;

TODO:
 - memcg accounting has not yet evaluated;
 - collapse;
 - migration (?);
 - stats, knobs, etc.;
 - tmpfs/shmem enabling;


Kirill A. Shutemov (30):
  block: implement add_bdi_stat()
  mm: implement zero_huge_user_segment and friends
  mm: drop actor argument of do_generic_file_read()
  radix-tree: implement preload for multiple contiguous elements
  thp, mm: avoid PageUnevictable on active/inactive lru lists
  thp, mm: basic defines for transparent huge page cache
  thp, mm: introduce mapping_can_have_hugepages() predicate
  thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  thp, mm: rewrite delete_from_page_cache() to support huge pages
  thp, mm: locking tail page is a bug
  thp, mm: handle tail pages in page_cache_get_speculative()
  thp, mm: add event counters for huge page alloc on write to a file
  thp, mm: implement grab_cache_huge_page_write_begin()
  thp, mm: naive support of thp in generic read/write routines
  thp, libfs: initial support of thp in
    simple_read/write_begin/write_end
  thp: handle file pages in split_huge_page()
  thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  thp, mm: truncate support for transparent huge page cache
  thp, mm: split huge page on mmap file page
  ramfs: enable transparent huge page cache
  x86-64, mm: proper alignment mappings with hugepages
  mm: add huge_fault() callback to vm_operations_struct
  thp: prepare zap_huge_pmd() to uncharge file pages
  thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
  thp, mm: basic huge_fault implementation for generic_file_vm_ops
  thp: extract fallback path from do_huge_pmd_anonymous_page() to a
    function
  thp: initial implementation of do_huge_linear_fault()
  thp: handle write-protect exception to file-backed huge pages
  thp: call __vma_adjust_trans_huge() for file-backed VMA
  thp: map file-backed huge pages on fault

 arch/x86/kernel/sys_x86_64.c  |   13 +-
 fs/libfs.c                    |   50 ++++-
 fs/ramfs/inode.c              |    6 +-
 include/linux/backing-dev.h   |   10 +
 include/linux/huge_mm.h       |   36 +++-
 include/linux/mm.h            |   16 ++
 include/linux/pagemap.h       |   24 ++-
 include/linux/radix-tree.h    |    3 +
 include/linux/vm_event_item.h |    2 +
 lib/radix-tree.c              |   32 ++-
 mm/filemap.c                  |  283 +++++++++++++++++++++----
 mm/huge_memory.c              |  462 ++++++++++++++++++++++++++++++++++-------
 mm/memory.c                   |   31 ++-
 mm/swap.c                     |    3 +-
 mm/truncate.c                 |   12 ++
 mm/vmstat.c                   |    2 +
 16 files changed, 842 insertions(+), 143 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 00/30] Transparent huge page cache
@ 2013-03-14 17:50 ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Here's the second version of the patchset.

The intend of the work is get code ready to enable transparent huge page
cache for the most simple fs -- ramfs.

We have read()/write()/mmap() functionality now. Still plenty work ahead.

Any feedback is welcome.

Changes since v1:
 - mmap();
 - fix add_to_page_cache_locked() and delete_from_page_cache();
 - introduce mapping_can_have_hugepages();
 - call split_huge_page() only for head page in filemap_fault();
 - wait_split_huge_page(): serialize over i_mmap_mutex too;
 - lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists;
 - fix off-by-one in zero_huge_user_segment();
 - THP_WRITE_ALLOC/THP_WRITE_FAILED counters;

TODO:
 - memcg accounting has not yet evaluated;
 - collapse;
 - migration (?);
 - stats, knobs, etc.;
 - tmpfs/shmem enabling;


Kirill A. Shutemov (30):
  block: implement add_bdi_stat()
  mm: implement zero_huge_user_segment and friends
  mm: drop actor argument of do_generic_file_read()
  radix-tree: implement preload for multiple contiguous elements
  thp, mm: avoid PageUnevictable on active/inactive lru lists
  thp, mm: basic defines for transparent huge page cache
  thp, mm: introduce mapping_can_have_hugepages() predicate
  thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  thp, mm: rewrite delete_from_page_cache() to support huge pages
  thp, mm: locking tail page is a bug
  thp, mm: handle tail pages in page_cache_get_speculative()
  thp, mm: add event counters for huge page alloc on write to a file
  thp, mm: implement grab_cache_huge_page_write_begin()
  thp, mm: naive support of thp in generic read/write routines
  thp, libfs: initial support of thp in
    simple_read/write_begin/write_end
  thp: handle file pages in split_huge_page()
  thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  thp, mm: truncate support for transparent huge page cache
  thp, mm: split huge page on mmap file page
  ramfs: enable transparent huge page cache
  x86-64, mm: proper alignment mappings with hugepages
  mm: add huge_fault() callback to vm_operations_struct
  thp: prepare zap_huge_pmd() to uncharge file pages
  thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
  thp, mm: basic huge_fault implementation for generic_file_vm_ops
  thp: extract fallback path from do_huge_pmd_anonymous_page() to a
    function
  thp: initial implementation of do_huge_linear_fault()
  thp: handle write-protect exception to file-backed huge pages
  thp: call __vma_adjust_trans_huge() for file-backed VMA
  thp: map file-backed huge pages on fault

 arch/x86/kernel/sys_x86_64.c  |   13 +-
 fs/libfs.c                    |   50 ++++-
 fs/ramfs/inode.c              |    6 +-
 include/linux/backing-dev.h   |   10 +
 include/linux/huge_mm.h       |   36 +++-
 include/linux/mm.h            |   16 ++
 include/linux/pagemap.h       |   24 ++-
 include/linux/radix-tree.h    |    3 +
 include/linux/vm_event_item.h |    2 +
 lib/radix-tree.c              |   32 ++-
 mm/filemap.c                  |  283 +++++++++++++++++++++----
 mm/huge_memory.c              |  462 ++++++++++++++++++++++++++++++++++-------
 mm/memory.c                   |   31 ++-
 mm/swap.c                     |    3 +-
 mm/truncate.c                 |   12 ++
 mm/vmstat.c                   |    2 +
 16 files changed, 842 insertions(+), 143 deletions(-)

-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 01/30] block: implement add_bdi_stat()
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

It's required for batched stats update.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/backing-dev.h |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3504599..b05d961 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -167,6 +167,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
 	__add_bdi_stat(bdi, item, -1);
 }
 
+static inline void add_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, s64 amount)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__add_bdi_stat(bdi, item, amount);
+	local_irq_restore(flags);
+}
+
 static inline void dec_bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item)
 {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 01/30] block: implement add_bdi_stat()
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

It's required for batched stats update.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/backing-dev.h |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3504599..b05d961 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -167,6 +167,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
 	__add_bdi_stat(bdi, item, -1);
 }
 
+static inline void add_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, s64 amount)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__add_bdi_stat(bdi, item, amount);
+	local_irq_restore(flags);
+}
+
 static inline void dec_bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item)
 {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 02/30] mm: implement zero_huge_user_segment and friends
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Let's add helpers to clear huge page segment(s). They provide the same
functionallity as zero_user_segment{,s} and zero_user, but for huge
pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |   15 +++++++++++++++
 mm/memory.c        |   26 ++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5b7fd4e..df83ab9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1731,6 +1731,21 @@ extern void dump_page(struct page *page);
 extern void clear_huge_page(struct page *page,
 			    unsigned long addr,
 			    unsigned int pages_per_huge_page);
+extern void zero_huge_user_segment(struct page *page,
+		unsigned start, unsigned end);
+static inline void zero_huge_user_segments(struct page *page,
+		unsigned start1, unsigned end1,
+		unsigned start2, unsigned end2)
+{
+	zero_huge_user_segment(page, start1, end1);
+	zero_huge_user_segment(page, start2, end2);
+}
+static inline void zero_huge_user(struct page *page,
+		unsigned start, unsigned len)
+{
+	zero_huge_user_segment(page, start, start+len);
+}
+
 extern void copy_user_huge_page(struct page *dst, struct page *src,
 				unsigned long addr, struct vm_area_struct *vma,
 				unsigned int pages_per_huge_page);
diff --git a/mm/memory.c b/mm/memory.c
index 494526a..98c25dd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4213,6 +4213,32 @@ void clear_huge_page(struct page *page,
 	}
 }
 
+void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
+{
+	int i;
+
+	BUG_ON(end < start);
+
+	might_sleep();
+
+	if (start == end)
+		return;
+
+	/* start and end are on the same small page */
+	if ((start & PAGE_MASK) == ((end - 1) & PAGE_MASK))
+		return zero_user_segment(page + (start >> PAGE_SHIFT),
+				start & ~PAGE_MASK,
+				((end - 1) & ~PAGE_MASK) + 1);
+
+	zero_user_segment(page + (start >> PAGE_SHIFT),
+			start & ~PAGE_MASK, PAGE_SIZE);
+	for (i = (start >> PAGE_SHIFT) + 1; i < (end >> PAGE_SHIFT) - 1; i++) {
+		cond_resched();
+		clear_highpage(page + i);
+	}
+	zero_user_segment(page + i, 0, ((end - 1) & ~PAGE_MASK) + 1);
+}
+
 static void copy_user_gigantic_page(struct page *dst, struct page *src,
 				    unsigned long addr,
 				    struct vm_area_struct *vma,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 02/30] mm: implement zero_huge_user_segment and friends
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Let's add helpers to clear huge page segment(s). They provide the same
functionallity as zero_user_segment{,s} and zero_user, but for huge
pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |   15 +++++++++++++++
 mm/memory.c        |   26 ++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5b7fd4e..df83ab9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1731,6 +1731,21 @@ extern void dump_page(struct page *page);
 extern void clear_huge_page(struct page *page,
 			    unsigned long addr,
 			    unsigned int pages_per_huge_page);
+extern void zero_huge_user_segment(struct page *page,
+		unsigned start, unsigned end);
+static inline void zero_huge_user_segments(struct page *page,
+		unsigned start1, unsigned end1,
+		unsigned start2, unsigned end2)
+{
+	zero_huge_user_segment(page, start1, end1);
+	zero_huge_user_segment(page, start2, end2);
+}
+static inline void zero_huge_user(struct page *page,
+		unsigned start, unsigned len)
+{
+	zero_huge_user_segment(page, start, start+len);
+}
+
 extern void copy_user_huge_page(struct page *dst, struct page *src,
 				unsigned long addr, struct vm_area_struct *vma,
 				unsigned int pages_per_huge_page);
diff --git a/mm/memory.c b/mm/memory.c
index 494526a..98c25dd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4213,6 +4213,32 @@ void clear_huge_page(struct page *page,
 	}
 }
 
+void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
+{
+	int i;
+
+	BUG_ON(end < start);
+
+	might_sleep();
+
+	if (start == end)
+		return;
+
+	/* start and end are on the same small page */
+	if ((start & PAGE_MASK) == ((end - 1) & PAGE_MASK))
+		return zero_user_segment(page + (start >> PAGE_SHIFT),
+				start & ~PAGE_MASK,
+				((end - 1) & ~PAGE_MASK) + 1);
+
+	zero_user_segment(page + (start >> PAGE_SHIFT),
+			start & ~PAGE_MASK, PAGE_SIZE);
+	for (i = (start >> PAGE_SHIFT) + 1; i < (end >> PAGE_SHIFT) - 1; i++) {
+		cond_resched();
+		clear_highpage(page + i);
+	}
+	zero_user_segment(page + i, 0, ((end - 1) & ~PAGE_MASK) + 1);
+}
+
 static void copy_user_gigantic_page(struct page *dst, struct page *src,
 				    unsigned long addr,
 				    struct vm_area_struct *vma,
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 03/30] mm: drop actor argument of do_generic_file_read()
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

There's only one caller of do_generic_file_read() and the only actor is
file_read_actor(). No reason to have a callback parameter.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 4ebaf95..2d99191 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1075,7 +1075,6 @@ static void shrink_readahead_size_eio(struct file *filp,
  * @filp:	the file to read
  * @ppos:	current file position
  * @desc:	read_descriptor
- * @actor:	read method
  *
  * This is a generic file read routine, and uses the
  * mapping->a_ops->readpage() function for the actual low-level stuff.
@@ -1084,7 +1083,7 @@ static void shrink_readahead_size_eio(struct file *filp,
  * of the logic when it comes to error handling etc.
  */
 static void do_generic_file_read(struct file *filp, loff_t *ppos,
-		read_descriptor_t *desc, read_actor_t actor)
+		read_descriptor_t *desc)
 {
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
@@ -1185,13 +1184,14 @@ page_ok:
 		 * Ok, we have the page, and it's up-to-date, so
 		 * now we can copy it to user space...
 		 *
-		 * The actor routine returns how many bytes were actually used..
+		 * The file_read_actor routine returns how many bytes were
+		 * actually used..
 		 * NOTE! This may not be the same as how much of a user buffer
 		 * we filled up (we may be padding etc), so we can only update
 		 * "pos" here (the actor routine has to update the user buffer
 		 * pointers and the remaining count).
 		 */
-		ret = actor(desc, page, offset, nr);
+		ret = file_read_actor(desc, page, offset, nr);
 		offset += ret;
 		index += offset >> PAGE_CACHE_SHIFT;
 		offset &= ~PAGE_CACHE_MASK;
@@ -1464,7 +1464,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
 		if (desc.count == 0)
 			continue;
 		desc.error = 0;
-		do_generic_file_read(filp, ppos, &desc, file_read_actor);
+		do_generic_file_read(filp, ppos, &desc);
 		retval += desc.written;
 		if (desc.error) {
 			retval = retval ?: desc.error;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 03/30] mm: drop actor argument of do_generic_file_read()
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

There's only one caller of do_generic_file_read() and the only actor is
file_read_actor(). No reason to have a callback parameter.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 4ebaf95..2d99191 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1075,7 +1075,6 @@ static void shrink_readahead_size_eio(struct file *filp,
  * @filp:	the file to read
  * @ppos:	current file position
  * @desc:	read_descriptor
- * @actor:	read method
  *
  * This is a generic file read routine, and uses the
  * mapping->a_ops->readpage() function for the actual low-level stuff.
@@ -1084,7 +1083,7 @@ static void shrink_readahead_size_eio(struct file *filp,
  * of the logic when it comes to error handling etc.
  */
 static void do_generic_file_read(struct file *filp, loff_t *ppos,
-		read_descriptor_t *desc, read_actor_t actor)
+		read_descriptor_t *desc)
 {
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
@@ -1185,13 +1184,14 @@ page_ok:
 		 * Ok, we have the page, and it's up-to-date, so
 		 * now we can copy it to user space...
 		 *
-		 * The actor routine returns how many bytes were actually used..
+		 * The file_read_actor routine returns how many bytes were
+		 * actually used..
 		 * NOTE! This may not be the same as how much of a user buffer
 		 * we filled up (we may be padding etc), so we can only update
 		 * "pos" here (the actor routine has to update the user buffer
 		 * pointers and the remaining count).
 		 */
-		ret = actor(desc, page, offset, nr);
+		ret = file_read_actor(desc, page, offset, nr);
 		offset += ret;
 		index += offset >> PAGE_CACHE_SHIFT;
 		offset &= ~PAGE_CACHE_MASK;
@@ -1464,7 +1464,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
 		if (desc.count == 0)
 			continue;
 		desc.error = 0;
-		do_generic_file_read(filp, ppos, &desc, file_read_actor);
+		do_generic_file_read(filp, ppos, &desc);
 		retval += desc.written;
 		if (desc.error) {
 			retval = retval ?: desc.error;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. You cannot batch a number insert under
one tree_lock.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/radix-tree.h |    3 +++
 lib/radix-tree.c           |   32 +++++++++++++++++++++++++-------
 2 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index ffc444c..81318cb 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,8 @@ do {									\
 	(root)->rnode = NULL;						\
 } while (0)
 
+#define RADIX_TREE_PRELOAD_NR		512 /* For THP's benefit */
+
 /**
  * Radix-tree synchronization
  *
@@ -231,6 +233,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root,
 unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
 				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
 			unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e796429..9bef0ac 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
  * The worst case is a zero height tree with just a single item at index 0,
  * and then inserting an item at index ULONG_MAX. This requires 2 new branches
  * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
+ *
+ * Worst case for adding N contiguous items is adding entries at indexes
+ * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
+ * item plus extra nodes if you cross the boundary from one node to the next.
+ *
  * Hence:
  */
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MAX \
+	(RADIX_TREE_PRELOAD_MIN + \
+	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
 
 /*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
 	int nr;
-	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
 };
 static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
 
@@ -257,29 +265,34 @@ radix_tree_node_free(struct radix_tree_node *node)
 
 /*
  * Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail.  On
- * success, return zero, with preemption disabled.  On error, return -ENOMEM
+ * ensure that the addition of *contiguous* elements in the tree cannot fail.
+ * On success, return zero, with preemption disabled.  On error, return -ENOMEM
  * with preemption not disabled.
  *
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_WAIT being passed to INIT_RADIX_TREE().
  */
-int radix_tree_preload(gfp_t gfp_mask)
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
 	int ret = -ENOMEM;
+	int alloc = RADIX_TREE_PRELOAD_MIN +
+		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
+
+	if (size > RADIX_TREE_PRELOAD_NR)
+		return -ENOMEM;
 
 	preempt_disable();
 	rtp = &__get_cpu_var(radix_tree_preloads);
-	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+	while (rtp->nr < alloc) {
 		preempt_enable();
 		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
 		if (node == NULL)
 			goto out;
 		preempt_disable();
 		rtp = &__get_cpu_var(radix_tree_preloads);
-		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+		if (rtp->nr < alloc)
 			rtp->nodes[rtp->nr++] = node;
 		else
 			kmem_cache_free(radix_tree_node_cachep, node);
@@ -288,6 +301,11 @@ int radix_tree_preload(gfp_t gfp_mask)
 out:
 	return ret;
 }
+
+int radix_tree_preload(gfp_t gfp_mask)
+{
+	return radix_tree_preload_count(1, gfp_mask);
+}
 EXPORT_SYMBOL(radix_tree_preload);
 
 /*
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. You cannot batch a number insert under
one tree_lock.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/radix-tree.h |    3 +++
 lib/radix-tree.c           |   32 +++++++++++++++++++++++++-------
 2 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index ffc444c..81318cb 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,8 @@ do {									\
 	(root)->rnode = NULL;						\
 } while (0)
 
+#define RADIX_TREE_PRELOAD_NR		512 /* For THP's benefit */
+
 /**
  * Radix-tree synchronization
  *
@@ -231,6 +233,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root,
 unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
 				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
 			unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e796429..9bef0ac 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
  * The worst case is a zero height tree with just a single item at index 0,
  * and then inserting an item at index ULONG_MAX. This requires 2 new branches
  * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
+ *
+ * Worst case for adding N contiguous items is adding entries at indexes
+ * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
+ * item plus extra nodes if you cross the boundary from one node to the next.
+ *
  * Hence:
  */
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MAX \
+	(RADIX_TREE_PRELOAD_MIN + \
+	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
 
 /*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
 	int nr;
-	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
 };
 static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
 
@@ -257,29 +265,34 @@ radix_tree_node_free(struct radix_tree_node *node)
 
 /*
  * Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail.  On
- * success, return zero, with preemption disabled.  On error, return -ENOMEM
+ * ensure that the addition of *contiguous* elements in the tree cannot fail.
+ * On success, return zero, with preemption disabled.  On error, return -ENOMEM
  * with preemption not disabled.
  *
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_WAIT being passed to INIT_RADIX_TREE().
  */
-int radix_tree_preload(gfp_t gfp_mask)
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
 	int ret = -ENOMEM;
+	int alloc = RADIX_TREE_PRELOAD_MIN +
+		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
+
+	if (size > RADIX_TREE_PRELOAD_NR)
+		return -ENOMEM;
 
 	preempt_disable();
 	rtp = &__get_cpu_var(radix_tree_preloads);
-	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+	while (rtp->nr < alloc) {
 		preempt_enable();
 		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
 		if (node == NULL)
 			goto out;
 		preempt_disable();
 		rtp = &__get_cpu_var(radix_tree_preloads);
-		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+		if (rtp->nr < alloc)
 			rtp->nodes[rtp->nr++] = node;
 		else
 			kmem_cache_free(radix_tree_node_cachep, node);
@@ -288,6 +301,11 @@ int radix_tree_preload(gfp_t gfp_mask)
 out:
 	return ret;
 }
+
+int radix_tree_preload(gfp_t gfp_mask)
+{
+	return radix_tree_preload_count(1, gfp_mask);
+}
 EXPORT_SYMBOL(radix_tree_preload);
 
 /*
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 05/30] thp, mm: avoid PageUnevictable on active/inactive lru lists
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
that have been placed on the LRU lists when first allocated), but these
pages must not have PageUnevictable set - otherwise shrink_active_list
goes crazy:

kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122!
invalid opcode: 0000 [#1] SMP
CPU 0
Pid: 293, comm: kswapd0 Not tainted 3.8.0-rc6-next-20130202+ #531
RIP: 0010:[<ffffffff81110478>]  [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
RSP: 0000:ffff8800796d9b28  EFLAGS: 00010082
RAX: 00000000ffffffea RBX: 0000000000000012 RCX: 0000000000000001
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffea0001de8040
RBP: ffff8800796d9b88 R08: ffff8800796d9df0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000012
R13: ffffea0001de8060 R14: ffffffff818818e8 R15: ffff8800796d9bf8
FS:  0000000000000000(0000) GS:ffff88007a200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1bfc108000 CR3: 000000000180b000 CR4: 00000000000406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kswapd0 (pid: 293, threadinfo ffff8800796d8000, task ffff880079e0a6e0)
Stack:
 ffff8800796d9b48 ffffffff81881880 ffff8800796d9df0 ffff8800796d9be0
 0000000000000002 000000000000001f ffff8800796d9b88 ffffffff818818c8
 ffffffff81881480 ffff8800796d9dc0 0000000000000002 000000000000001f
Call Trace:
 [<ffffffff81111e98>] shrink_inactive_list+0x108/0x4a0
 [<ffffffff8109ce3d>] ? trace_hardirqs_off+0xd/0x10
 [<ffffffff8107b8bf>] ? local_clock+0x4f/0x60
 [<ffffffff8110ff5d>] ? shrink_slab+0x1fd/0x4c0
 [<ffffffff811125a1>] shrink_zone+0x371/0x610
 [<ffffffff8110ff75>] ? shrink_slab+0x215/0x4c0
 [<ffffffff81112dfc>] kswapd+0x5bc/0xb60
 [<ffffffff81112840>] ? shrink_zone+0x610/0x610
 [<ffffffff81066676>] kthread+0xd6/0xe0
 [<ffffffff810665a0>] ? __kthread_bind+0x40/0x40
 [<ffffffff814fed6c>] ret_from_fork+0x7c/0xb0
 [<ffffffff810665a0>] ? __kthread_bind+0x40/0x40
Code: 1f 40 00 49 8b 45 08 49 8b 75 00 48 89 46 08 48 89 30 49 8b 06 4c 89 68 08 49 89 45 00 4d 89 75 08 4d 89 2e eb 9c 0f 1f 44 00 00 <0f> 0b 66 0f 1f 44 00 00 31 db 45 31 e4 eb 9b 0f 0b 0f 0b 65 48
RIP  [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
 RSP <ffff8800796d9b28>

For lru_add_page_tail(), it means we should not set PageUnevictable()
for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
The tail page will go LRU_UNEVICTABLE if head page is not on LRU or if
it's marked PageUnevictable() too.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/swap.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/swap.c b/mm/swap.c
index 92a9be5..31584d0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -762,7 +762,8 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 			lru = LRU_INACTIVE_ANON;
 		}
 	} else {
-		SetPageUnevictable(page_tail);
+		if (!PageLRU(page) || PageUnevictable(page))
+			SetPageUnevictable(page_tail);
 		lru = LRU_UNEVICTABLE;
 	}
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 05/30] thp, mm: avoid PageUnevictable on active/inactive lru lists
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
that have been placed on the LRU lists when first allocated), but these
pages must not have PageUnevictable set - otherwise shrink_active_list
goes crazy:

kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122!
invalid opcode: 0000 [#1] SMP
CPU 0
Pid: 293, comm: kswapd0 Not tainted 3.8.0-rc6-next-20130202+ #531
RIP: 0010:[<ffffffff81110478>]  [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
RSP: 0000:ffff8800796d9b28  EFLAGS: 00010082
RAX: 00000000ffffffea RBX: 0000000000000012 RCX: 0000000000000001
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffea0001de8040
RBP: ffff8800796d9b88 R08: ffff8800796d9df0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000012
R13: ffffea0001de8060 R14: ffffffff818818e8 R15: ffff8800796d9bf8
FS:  0000000000000000(0000) GS:ffff88007a200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1bfc108000 CR3: 000000000180b000 CR4: 00000000000406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kswapd0 (pid: 293, threadinfo ffff8800796d8000, task ffff880079e0a6e0)
Stack:
 ffff8800796d9b48 ffffffff81881880 ffff8800796d9df0 ffff8800796d9be0
 0000000000000002 000000000000001f ffff8800796d9b88 ffffffff818818c8
 ffffffff81881480 ffff8800796d9dc0 0000000000000002 000000000000001f
Call Trace:
 [<ffffffff81111e98>] shrink_inactive_list+0x108/0x4a0
 [<ffffffff8109ce3d>] ? trace_hardirqs_off+0xd/0x10
 [<ffffffff8107b8bf>] ? local_clock+0x4f/0x60
 [<ffffffff8110ff5d>] ? shrink_slab+0x1fd/0x4c0
 [<ffffffff811125a1>] shrink_zone+0x371/0x610
 [<ffffffff8110ff75>] ? shrink_slab+0x215/0x4c0
 [<ffffffff81112dfc>] kswapd+0x5bc/0xb60
 [<ffffffff81112840>] ? shrink_zone+0x610/0x610
 [<ffffffff81066676>] kthread+0xd6/0xe0
 [<ffffffff810665a0>] ? __kthread_bind+0x40/0x40
 [<ffffffff814fed6c>] ret_from_fork+0x7c/0xb0
 [<ffffffff810665a0>] ? __kthread_bind+0x40/0x40
Code: 1f 40 00 49 8b 45 08 49 8b 75 00 48 89 46 08 48 89 30 49 8b 06 4c 89 68 08 49 89 45 00 4d 89 75 08 4d 89 2e eb 9c 0f 1f 44 00 00 <0f> 0b 66 0f 1f 44 00 00 31 db 45 31 e4 eb 9b 0f 0b 0f 0b 65 48
RIP  [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
 RSP <ffff8800796d9b28>

For lru_add_page_tail(), it means we should not set PageUnevictable()
for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
The tail page will go LRU_UNEVICTABLE if head page is not on LRU or if
it's marked PageUnevictable() too.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/swap.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/swap.c b/mm/swap.c
index 92a9be5..31584d0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -762,7 +762,8 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 			lru = LRU_INACTIVE_ANON;
 		}
 	} else {
-		SetPageUnevictable(page_tail);
+		if (!PageLRU(page) || PageUnevictable(page))
+			SetPageUnevictable(page_tail);
 		lru = LRU_UNEVICTABLE;
 	}
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 06/30] thp, mm: basic defines for transparent huge page cache
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ee1c244..a54939c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -64,6 +64,10 @@ extern pmd_t *page_check_address_pmd(struct page *page,
 #define HPAGE_PMD_MASK HPAGE_MASK
 #define HPAGE_PMD_SIZE HPAGE_SIZE
 
+#define HPAGE_CACHE_ORDER      (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
+#define HPAGE_CACHE_NR         (1L << HPAGE_CACHE_ORDER)
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)
+
 extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
 #define transparent_hugepage_enabled(__vma)				\
@@ -181,6 +185,10 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
 
+#define HPAGE_CACHE_ORDER      ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
+
 #define hpage_nr_pages(x) 1
 
 #define transparent_hugepage_enabled(__vma) 0
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 06/30] thp, mm: basic defines for transparent huge page cache
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ee1c244..a54939c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -64,6 +64,10 @@ extern pmd_t *page_check_address_pmd(struct page *page,
 #define HPAGE_PMD_MASK HPAGE_MASK
 #define HPAGE_PMD_SIZE HPAGE_SIZE
 
+#define HPAGE_CACHE_ORDER      (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
+#define HPAGE_CACHE_NR         (1L << HPAGE_CACHE_ORDER)
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)
+
 extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
 #define transparent_hugepage_enabled(__vma)				\
@@ -181,6 +185,10 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
 
+#define HPAGE_CACHE_ORDER      ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
+
 #define hpage_nr_pages(x) 1
 
 #define transparent_hugepage_enabled(__vma) 0
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Returns true if mapping can have huge pages. Just check for __GFP_COMP
in gfp mask of the mapping for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..3521b0d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -84,6 +84,16 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 				(__force unsigned long)mask;
 }
 
+static inline bool mapping_can_have_hugepages(struct address_space *m)
+{
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+		gfp_t gfp_mask = mapping_gfp_mask(m);
+		return !!(gfp_mask & __GFP_COMP);
+	}
+
+	return false;
+}
+
 /*
  * The page cache can done in larger chunks than
  * one page, because it allows for more efficient
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Returns true if mapping can have huge pages. Just check for __GFP_COMP
in gfp mask of the mapping for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..3521b0d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -84,6 +84,16 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 				(__force unsigned long)mask;
 }
 
+static inline bool mapping_can_have_hugepages(struct address_space *m)
+{
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+		gfp_t gfp_mask = mapping_gfp_mask(m);
+		return !!(gfp_mask & __GFP_COMP);
+	}
+
+	return false;
+}
+
 /*
  * The page cache can done in larger chunks than
  * one page, because it allows for more efficient
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
page for the specified index and HPAGE_CACHE_NR-1 tail pages for
following indexes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   76 ++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 53 insertions(+), 23 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 2d99191..6bac9e2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -447,6 +447,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 		pgoff_t offset, gfp_t gfp_mask)
 {
 	int error;
+	int nr = 1;
 
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageSwapBacked(page));
@@ -454,32 +455,61 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 	error = mem_cgroup_cache_charge(page, current->mm,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
-		goto out;
+		return error;
 
-	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
-	if (error == 0) {
-		page_cache_get(page);
-		page->mapping = mapping;
-		page->index = offset;
+	if (PageTransHuge(page)) {
+		BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
+		nr = HPAGE_CACHE_NR;
+	}
+	error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM);
+	if (error) {
+		mem_cgroup_uncharge_cache_page(page);
+		return error;
+	}
 
-		spin_lock_irq(&mapping->tree_lock);
-		error = radix_tree_insert(&mapping->page_tree, offset, page);
-		if (likely(!error)) {
-			mapping->nrpages++;
-			__inc_zone_page_state(page, NR_FILE_PAGES);
-			spin_unlock_irq(&mapping->tree_lock);
-			trace_mm_filemap_add_to_page_cache(page);
-		} else {
-			page->mapping = NULL;
-			/* Leave page->index set: truncation relies upon it */
-			spin_unlock_irq(&mapping->tree_lock);
-			mem_cgroup_uncharge_cache_page(page);
-			page_cache_release(page);
+	page_cache_get(page);
+	spin_lock_irq(&mapping->tree_lock);
+	page->mapping = mapping;
+	page->index = offset;
+	error = radix_tree_insert(&mapping->page_tree, offset, page);
+	if (unlikely(error))
+		goto err;
+	if (PageTransHuge(page)) {
+		int i;
+		for (i = 1; i < HPAGE_CACHE_NR; i++) {
+			page_cache_get(page + i);
+			page[i].index = offset + i;
+			error = radix_tree_insert(&mapping->page_tree,
+					offset + i, page + i);
+			if (error) {
+				page_cache_release(page + i);
+				break;
+			}
 		}
-		radix_tree_preload_end();
-	} else
-		mem_cgroup_uncharge_cache_page(page);
-out:
+		if (error) {
+			error = ENOSPC; /* no space for a huge page */
+			for (i--; i > 0; i--) {
+				radix_tree_delete(&mapping->page_tree,
+						offset + i);
+				page_cache_release(page + i);
+			}
+			radix_tree_delete(&mapping->page_tree, offset);
+			goto err;
+		}
+	}
+	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
+	mapping->nrpages += nr;
+	spin_unlock_irq(&mapping->tree_lock);
+	trace_mm_filemap_add_to_page_cache(page);
+	radix_tree_preload_end();
+	return 0;
+err:
+	page->mapping = NULL;
+	/* Leave page->index set: truncation relies upon it */
+	spin_unlock_irq(&mapping->tree_lock);
+	radix_tree_preload_end();
+	mem_cgroup_uncharge_cache_page(page);
+	page_cache_release(page);
 	return error;
 }
 EXPORT_SYMBOL(add_to_page_cache_locked);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
page for the specified index and HPAGE_CACHE_NR-1 tail pages for
following indexes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   76 ++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 53 insertions(+), 23 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 2d99191..6bac9e2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -447,6 +447,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 		pgoff_t offset, gfp_t gfp_mask)
 {
 	int error;
+	int nr = 1;
 
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageSwapBacked(page));
@@ -454,32 +455,61 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 	error = mem_cgroup_cache_charge(page, current->mm,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
-		goto out;
+		return error;
 
-	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
-	if (error == 0) {
-		page_cache_get(page);
-		page->mapping = mapping;
-		page->index = offset;
+	if (PageTransHuge(page)) {
+		BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
+		nr = HPAGE_CACHE_NR;
+	}
+	error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM);
+	if (error) {
+		mem_cgroup_uncharge_cache_page(page);
+		return error;
+	}
 
-		spin_lock_irq(&mapping->tree_lock);
-		error = radix_tree_insert(&mapping->page_tree, offset, page);
-		if (likely(!error)) {
-			mapping->nrpages++;
-			__inc_zone_page_state(page, NR_FILE_PAGES);
-			spin_unlock_irq(&mapping->tree_lock);
-			trace_mm_filemap_add_to_page_cache(page);
-		} else {
-			page->mapping = NULL;
-			/* Leave page->index set: truncation relies upon it */
-			spin_unlock_irq(&mapping->tree_lock);
-			mem_cgroup_uncharge_cache_page(page);
-			page_cache_release(page);
+	page_cache_get(page);
+	spin_lock_irq(&mapping->tree_lock);
+	page->mapping = mapping;
+	page->index = offset;
+	error = radix_tree_insert(&mapping->page_tree, offset, page);
+	if (unlikely(error))
+		goto err;
+	if (PageTransHuge(page)) {
+		int i;
+		for (i = 1; i < HPAGE_CACHE_NR; i++) {
+			page_cache_get(page + i);
+			page[i].index = offset + i;
+			error = radix_tree_insert(&mapping->page_tree,
+					offset + i, page + i);
+			if (error) {
+				page_cache_release(page + i);
+				break;
+			}
 		}
-		radix_tree_preload_end();
-	} else
-		mem_cgroup_uncharge_cache_page(page);
-out:
+		if (error) {
+			error = ENOSPC; /* no space for a huge page */
+			for (i--; i > 0; i--) {
+				radix_tree_delete(&mapping->page_tree,
+						offset + i);
+				page_cache_release(page + i);
+			}
+			radix_tree_delete(&mapping->page_tree, offset);
+			goto err;
+		}
+	}
+	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
+	mapping->nrpages += nr;
+	spin_unlock_irq(&mapping->tree_lock);
+	trace_mm_filemap_add_to_page_cache(page);
+	radix_tree_preload_end();
+	return 0;
+err:
+	page->mapping = NULL;
+	/* Leave page->index set: truncation relies upon it */
+	spin_unlock_irq(&mapping->tree_lock);
+	radix_tree_preload_end();
+	mem_cgroup_uncharge_cache_page(page);
+	page_cache_release(page);
 	return error;
 }
 EXPORT_SYMBOL(add_to_page_cache_locked);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 09/30] thp, mm: rewrite delete_from_page_cache() to support huge pages
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
time.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 6bac9e2..0ff3403 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -115,6 +115,7 @@
 void __delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
+	int nr = 1;
 
 	trace_mm_filemap_delete_from_page_cache(page);
 	/*
@@ -127,13 +128,23 @@ void __delete_from_page_cache(struct page *page)
 	else
 		cleancache_invalidate_page(mapping, page);
 
-	radix_tree_delete(&mapping->page_tree, page->index);
+	if (PageTransHuge(page)) {
+		int i;
+
+		for (i = 0; i < HPAGE_CACHE_NR; i++)
+			radix_tree_delete(&mapping->page_tree, page->index + i);
+		nr = HPAGE_CACHE_NR;
+	} else {
+		radix_tree_delete(&mapping->page_tree, page->index);
+	}
+
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
-	mapping->nrpages--;
-	__dec_zone_page_state(page, NR_FILE_PAGES);
+
+	mapping->nrpages -= nr;
+	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
 	if (PageSwapBacked(page))
-		__dec_zone_page_state(page, NR_SHMEM);
+		__mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
 	BUG_ON(page_mapped(page));
 
 	/*
@@ -144,8 +155,8 @@ void __delete_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
-		dec_zone_page_state(page, NR_FILE_DIRTY);
-		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
+		add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
 	}
 }
 
@@ -161,6 +172,7 @@ void delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
 	void (*freepage)(struct page *);
+	int i;
 
 	BUG_ON(!PageLocked(page));
 
@@ -172,6 +184,9 @@ void delete_from_page_cache(struct page *page)
 
 	if (freepage)
 		freepage(page);
+	if (PageTransHuge(page))
+		for (i = 1; i < HPAGE_CACHE_NR; i++)
+			page_cache_release(page + i);
 	page_cache_release(page);
 }
 EXPORT_SYMBOL(delete_from_page_cache);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 09/30] thp, mm: rewrite delete_from_page_cache() to support huge pages
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
time.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 6bac9e2..0ff3403 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -115,6 +115,7 @@
 void __delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
+	int nr = 1;
 
 	trace_mm_filemap_delete_from_page_cache(page);
 	/*
@@ -127,13 +128,23 @@ void __delete_from_page_cache(struct page *page)
 	else
 		cleancache_invalidate_page(mapping, page);
 
-	radix_tree_delete(&mapping->page_tree, page->index);
+	if (PageTransHuge(page)) {
+		int i;
+
+		for (i = 0; i < HPAGE_CACHE_NR; i++)
+			radix_tree_delete(&mapping->page_tree, page->index + i);
+		nr = HPAGE_CACHE_NR;
+	} else {
+		radix_tree_delete(&mapping->page_tree, page->index);
+	}
+
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
-	mapping->nrpages--;
-	__dec_zone_page_state(page, NR_FILE_PAGES);
+
+	mapping->nrpages -= nr;
+	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
 	if (PageSwapBacked(page))
-		__dec_zone_page_state(page, NR_SHMEM);
+		__mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
 	BUG_ON(page_mapped(page));
 
 	/*
@@ -144,8 +155,8 @@ void __delete_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
-		dec_zone_page_state(page, NR_FILE_DIRTY);
-		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
+		add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
 	}
 }
 
@@ -161,6 +172,7 @@ void delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
 	void (*freepage)(struct page *);
+	int i;
 
 	BUG_ON(!PageLocked(page));
 
@@ -172,6 +184,9 @@ void delete_from_page_cache(struct page *page)
 
 	if (freepage)
 		freepage(page);
+	if (PageTransHuge(page))
+		for (i = 1; i < HPAGE_CACHE_NR; i++)
+			page_cache_release(page + i);
 	page_cache_release(page);
 }
 EXPORT_SYMBOL(delete_from_page_cache);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 10/30] thp, mm: locking tail page is a bug
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 0ff3403..38fdc92 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -669,6 +669,7 @@ void __lock_page(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	VM_BUG_ON(PageTail(page));
 	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
 							TASK_UNINTERRUPTIBLE);
 }
@@ -678,6 +679,7 @@ int __lock_page_killable(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	VM_BUG_ON(PageTail(page));
 	return __wait_on_bit_lock(page_waitqueue(page), &wait,
 					sleep_on_page_killable, TASK_KILLABLE);
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 10/30] thp, mm: locking tail page is a bug
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 0ff3403..38fdc92 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -669,6 +669,7 @@ void __lock_page(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	VM_BUG_ON(PageTail(page));
 	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
 							TASK_UNINTERRUPTIBLE);
 }
@@ -678,6 +679,7 @@ int __lock_page_killable(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	VM_BUG_ON(PageTail(page));
 	return __wait_on_bit_lock(page_waitqueue(page), &wait,
 					sleep_on_page_killable, TASK_KILLABLE);
 }
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 11/30] thp, mm: handle tail pages in page_cache_get_speculative()
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For tail page we call __get_page_tail(). It has the same semantics, but
for tail page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3521b0d..408c4e3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -159,6 +159,9 @@ static inline int page_cache_get_speculative(struct page *page)
 {
 	VM_BUG_ON(in_interrupt());
 
+	if (unlikely(PageTail(page)))
+		return __get_page_tail(page);
+
 #ifdef CONFIG_TINY_RCU
 # ifdef CONFIG_PREEMPT_COUNT
 	VM_BUG_ON(!in_atomic());
@@ -185,7 +188,6 @@ static inline int page_cache_get_speculative(struct page *page)
 		return 0;
 	}
 #endif
-	VM_BUG_ON(PageTail(page));
 
 	return 1;
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 11/30] thp, mm: handle tail pages in page_cache_get_speculative()
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For tail page we call __get_page_tail(). It has the same semantics, but
for tail page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3521b0d..408c4e3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -159,6 +159,9 @@ static inline int page_cache_get_speculative(struct page *page)
 {
 	VM_BUG_ON(in_interrupt());
 
+	if (unlikely(PageTail(page)))
+		return __get_page_tail(page);
+
 #ifdef CONFIG_TINY_RCU
 # ifdef CONFIG_PREEMPT_COUNT
 	VM_BUG_ON(!in_atomic());
@@ -185,7 +188,6 @@ static inline int page_cache_get_speculative(struct page *page)
 		return 0;
 	}
 #endif
-	VM_BUG_ON(PageTail(page));
 
 	return 1;
 }
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 12/30] thp, mm: add event counters for huge page alloc on write to a file
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/vm_event_item.h |    2 ++
 mm/vmstat.c                   |    2 ++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index d4b7a18..04587c4 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_FAULT_FALLBACK,
 		THP_COLLAPSE_ALLOC,
 		THP_COLLAPSE_ALLOC_FAILED,
+		THP_WRITE_ALLOC,
+		THP_WRITE_FAILED,
 		THP_SPLIT,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 292b1cf..bd59782 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -818,6 +818,8 @@ const char * const vmstat_text[] = {
 	"thp_fault_fallback",
 	"thp_collapse_alloc",
 	"thp_collapse_alloc_failed",
+	"thp_write_alloc",
+	"thp_write_failed",
 	"thp_split",
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 12/30] thp, mm: add event counters for huge page alloc on write to a file
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/vm_event_item.h |    2 ++
 mm/vmstat.c                   |    2 ++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index d4b7a18..04587c4 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_FAULT_FALLBACK,
 		THP_COLLAPSE_ALLOC,
 		THP_COLLAPSE_ALLOC_FAILED,
+		THP_WRITE_ALLOC,
+		THP_WRITE_FAILED,
 		THP_SPLIT,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 292b1cf..bd59782 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -818,6 +818,8 @@ const char * const vmstat_text[] = {
 	"thp_fault_fallback",
 	"thp_collapse_alloc",
 	"thp_collapse_alloc_failed",
+	"thp_write_alloc",
+	"thp_write_failed",
 	"thp_split",
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The function is grab_cache_page_write_begin() twin but it tries to
allocate huge page at given position aligned to HPAGE_CACHE_NR.

If, for some reason, it's not possible allocate a huge page at this
possition, it returns NULL. Caller should take care of fallback to
small pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h |   10 ++++++++
 mm/filemap.c            |   58 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 68 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 408c4e3..c87ed7c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -270,6 +270,16 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 
 struct page *grab_cache_page_write_begin(struct address_space *mapping,
 			pgoff_t index, unsigned flags);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
+			pgoff_t index, unsigned flags);
+#else
+static inline struct page *grab_cache_huge_page_write_begin(
+		struct address_space *mapping, pgoff_t index, unsigned flags)
+{
+	return NULL;
+}
+#endif
 
 /*
  * Returns locked page at given index in given cache, creating it if needed.
diff --git a/mm/filemap.c b/mm/filemap.c
index 38fdc92..bdedb1b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2332,6 +2332,64 @@ found:
 }
 EXPORT_SYMBOL(grab_cache_page_write_begin);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * Find or create a huge page at the given pagecache position, aligned to
+ * HPAGE_CACHE_NR. Return the locked huge page.
+ *
+ * If, for some reason, it's not possible allocate a huge page at this
+ * possition, it returns NULL. Caller should take care of fallback to small
+ * pages.
+ *
+ * This function is specifically for buffered writes.
+ */
+struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
+		pgoff_t index, unsigned flags)
+{
+	int status;
+	gfp_t gfp_mask;
+	struct page *page;
+	gfp_t gfp_notmask = 0;
+
+	BUG_ON(index & HPAGE_CACHE_INDEX_MASK);
+	gfp_mask = mapping_gfp_mask(mapping);
+	BUG_ON(!(gfp_mask & __GFP_COMP));
+	if (mapping_cap_account_dirty(mapping))
+		gfp_mask |= __GFP_WRITE;
+	if (flags & AOP_FLAG_NOFS)
+		gfp_notmask = __GFP_FS;
+repeat:
+	page = find_lock_page(mapping, index);
+	if (page) {
+		if (!PageTransHuge(page)) {
+			unlock_page(page);
+			page_cache_release(page);
+			return NULL;
+		}
+		goto found;
+	}
+
+	page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
+	if (!page) {
+		count_vm_event(THP_WRITE_FAILED);
+		return NULL;
+	}
+
+	count_vm_event(THP_WRITE_ALLOC);
+	status = add_to_page_cache_lru(page, mapping, index,
+			GFP_KERNEL & ~gfp_notmask);
+	if (unlikely(status)) {
+		page_cache_release(page);
+		if (status == -EEXIST)
+			goto repeat;
+		return NULL;
+	}
+found:
+	wait_on_page_writeback(page);
+	return page;
+}
+#endif
+
 static ssize_t generic_perform_write(struct file *file,
 				struct iov_iter *i, loff_t pos)
 {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The function is grab_cache_page_write_begin() twin but it tries to
allocate huge page at given position aligned to HPAGE_CACHE_NR.

If, for some reason, it's not possible allocate a huge page at this
possition, it returns NULL. Caller should take care of fallback to
small pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h |   10 ++++++++
 mm/filemap.c            |   58 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 68 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 408c4e3..c87ed7c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -270,6 +270,16 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 
 struct page *grab_cache_page_write_begin(struct address_space *mapping,
 			pgoff_t index, unsigned flags);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
+			pgoff_t index, unsigned flags);
+#else
+static inline struct page *grab_cache_huge_page_write_begin(
+		struct address_space *mapping, pgoff_t index, unsigned flags)
+{
+	return NULL;
+}
+#endif
 
 /*
  * Returns locked page at given index in given cache, creating it if needed.
diff --git a/mm/filemap.c b/mm/filemap.c
index 38fdc92..bdedb1b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2332,6 +2332,64 @@ found:
 }
 EXPORT_SYMBOL(grab_cache_page_write_begin);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * Find or create a huge page at the given pagecache position, aligned to
+ * HPAGE_CACHE_NR. Return the locked huge page.
+ *
+ * If, for some reason, it's not possible allocate a huge page at this
+ * possition, it returns NULL. Caller should take care of fallback to small
+ * pages.
+ *
+ * This function is specifically for buffered writes.
+ */
+struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
+		pgoff_t index, unsigned flags)
+{
+	int status;
+	gfp_t gfp_mask;
+	struct page *page;
+	gfp_t gfp_notmask = 0;
+
+	BUG_ON(index & HPAGE_CACHE_INDEX_MASK);
+	gfp_mask = mapping_gfp_mask(mapping);
+	BUG_ON(!(gfp_mask & __GFP_COMP));
+	if (mapping_cap_account_dirty(mapping))
+		gfp_mask |= __GFP_WRITE;
+	if (flags & AOP_FLAG_NOFS)
+		gfp_notmask = __GFP_FS;
+repeat:
+	page = find_lock_page(mapping, index);
+	if (page) {
+		if (!PageTransHuge(page)) {
+			unlock_page(page);
+			page_cache_release(page);
+			return NULL;
+		}
+		goto found;
+	}
+
+	page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
+	if (!page) {
+		count_vm_event(THP_WRITE_FAILED);
+		return NULL;
+	}
+
+	count_vm_event(THP_WRITE_ALLOC);
+	status = add_to_page_cache_lru(page, mapping, index,
+			GFP_KERNEL & ~gfp_notmask);
+	if (unlikely(status)) {
+		page_cache_release(page);
+		if (status == -EEXIST)
+			goto repeat;
+		return NULL;
+	}
+found:
+	wait_on_page_writeback(page);
+	return page;
+}
+#endif
+
 static ssize_t generic_perform_write(struct file *file,
 				struct iov_iter *i, loff_t pos)
 {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 14/30] thp, mm: naive support of thp in generic read/write routines
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now we still write/read at most PAGE_CACHE_SIZE bytes a time.

This implementation doesn't cover address spaces with backing store.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   35 ++++++++++++++++++++++++++++++-----
 1 file changed, 30 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index bdedb1b..79ba9cd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1165,12 +1165,23 @@ find_page:
 			if (unlikely(page == NULL))
 				goto no_cached_page;
 		}
+		if (PageTransTail(page)) {
+			page_cache_release(page);
+			page = find_get_page(mapping,
+					index & ~HPAGE_CACHE_INDEX_MASK);
+			if (!PageTransHuge(page)) {
+				page_cache_release(page);
+				goto find_page;
+			}
+		}
 		if (PageReadahead(page)) {
+			BUG_ON(PageTransHuge(page));
 			page_cache_async_readahead(mapping,
 					ra, filp, page,
 					index, last_index - index);
 		}
 		if (!PageUptodate(page)) {
+			BUG_ON(PageTransHuge(page));
 			if (inode->i_blkbits == PAGE_CACHE_SHIFT ||
 					!mapping->a_ops->is_partially_uptodate)
 				goto page_not_up_to_date;
@@ -1212,18 +1223,25 @@ page_ok:
 		}
 		nr = nr - offset;
 
+		/* Recalculate offset in page if we've got a huge page */
+		if (PageTransHuge(page)) {
+			offset = (((loff_t)index << PAGE_CACHE_SHIFT) + offset);
+			offset &= ~HPAGE_PMD_MASK;
+		}
+
 		/* If users can be writing to this page using arbitrary
 		 * virtual addresses, take care about potential aliasing
 		 * before reading the page on the kernel side.
 		 */
 		if (mapping_writably_mapped(mapping))
-			flush_dcache_page(page);
+			flush_dcache_page(page + (offset >> PAGE_CACHE_SHIFT));
 
 		/*
 		 * When a sequential read accesses a page several times,
 		 * only mark it as accessed the first time.
 		 */
-		if (prev_index != index || offset != prev_offset)
+		if (prev_index != index ||
+				(offset & ~PAGE_CACHE_MASK) != prev_offset)
 			mark_page_accessed(page);
 		prev_index = index;
 
@@ -1238,8 +1256,9 @@ page_ok:
 		 * "pos" here (the actor routine has to update the user buffer
 		 * pointers and the remaining count).
 		 */
-		ret = file_read_actor(desc, page, offset, nr);
-		offset += ret;
+		ret = file_read_actor(desc, page + (offset >> PAGE_CACHE_SHIFT),
+				offset & ~PAGE_CACHE_MASK, nr);
+		offset =  (offset & ~PAGE_CACHE_MASK) + ret;
 		index += offset >> PAGE_CACHE_SHIFT;
 		offset &= ~PAGE_CACHE_MASK;
 		prev_offset = offset;
@@ -2440,8 +2459,13 @@ again:
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
 
+		if (PageTransHuge(page))
+			offset = pos & ~HPAGE_PMD_MASK;
+
 		pagefault_disable();
-		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+		copied = iov_iter_copy_from_user_atomic(
+				page + (offset >> PAGE_CACHE_SHIFT),
+				i, offset & ~PAGE_CACHE_MASK, bytes);
 		pagefault_enable();
 		flush_dcache_page(page);
 
@@ -2464,6 +2488,7 @@ again:
 			 * because not all segments in the iov can be copied at
 			 * once without a pagefault.
 			 */
+			offset = pos & ~PAGE_CACHE_MASK;
 			bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
 						iov_iter_single_seg_count(i));
 			goto again;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 14/30] thp, mm: naive support of thp in generic read/write routines
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now we still write/read at most PAGE_CACHE_SIZE bytes a time.

This implementation doesn't cover address spaces with backing store.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   35 ++++++++++++++++++++++++++++++-----
 1 file changed, 30 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index bdedb1b..79ba9cd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1165,12 +1165,23 @@ find_page:
 			if (unlikely(page == NULL))
 				goto no_cached_page;
 		}
+		if (PageTransTail(page)) {
+			page_cache_release(page);
+			page = find_get_page(mapping,
+					index & ~HPAGE_CACHE_INDEX_MASK);
+			if (!PageTransHuge(page)) {
+				page_cache_release(page);
+				goto find_page;
+			}
+		}
 		if (PageReadahead(page)) {
+			BUG_ON(PageTransHuge(page));
 			page_cache_async_readahead(mapping,
 					ra, filp, page,
 					index, last_index - index);
 		}
 		if (!PageUptodate(page)) {
+			BUG_ON(PageTransHuge(page));
 			if (inode->i_blkbits == PAGE_CACHE_SHIFT ||
 					!mapping->a_ops->is_partially_uptodate)
 				goto page_not_up_to_date;
@@ -1212,18 +1223,25 @@ page_ok:
 		}
 		nr = nr - offset;
 
+		/* Recalculate offset in page if we've got a huge page */
+		if (PageTransHuge(page)) {
+			offset = (((loff_t)index << PAGE_CACHE_SHIFT) + offset);
+			offset &= ~HPAGE_PMD_MASK;
+		}
+
 		/* If users can be writing to this page using arbitrary
 		 * virtual addresses, take care about potential aliasing
 		 * before reading the page on the kernel side.
 		 */
 		if (mapping_writably_mapped(mapping))
-			flush_dcache_page(page);
+			flush_dcache_page(page + (offset >> PAGE_CACHE_SHIFT));
 
 		/*
 		 * When a sequential read accesses a page several times,
 		 * only mark it as accessed the first time.
 		 */
-		if (prev_index != index || offset != prev_offset)
+		if (prev_index != index ||
+				(offset & ~PAGE_CACHE_MASK) != prev_offset)
 			mark_page_accessed(page);
 		prev_index = index;
 
@@ -1238,8 +1256,9 @@ page_ok:
 		 * "pos" here (the actor routine has to update the user buffer
 		 * pointers and the remaining count).
 		 */
-		ret = file_read_actor(desc, page, offset, nr);
-		offset += ret;
+		ret = file_read_actor(desc, page + (offset >> PAGE_CACHE_SHIFT),
+				offset & ~PAGE_CACHE_MASK, nr);
+		offset =  (offset & ~PAGE_CACHE_MASK) + ret;
 		index += offset >> PAGE_CACHE_SHIFT;
 		offset &= ~PAGE_CACHE_MASK;
 		prev_offset = offset;
@@ -2440,8 +2459,13 @@ again:
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
 
+		if (PageTransHuge(page))
+			offset = pos & ~HPAGE_PMD_MASK;
+
 		pagefault_disable();
-		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+		copied = iov_iter_copy_from_user_atomic(
+				page + (offset >> PAGE_CACHE_SHIFT),
+				i, offset & ~PAGE_CACHE_MASK, bytes);
 		pagefault_enable();
 		flush_dcache_page(page);
 
@@ -2464,6 +2488,7 @@ again:
 			 * because not all segments in the iov can be copied at
 			 * once without a pagefault.
 			 */
+			offset = pos & ~PAGE_CACHE_MASK;
 			bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
 						iov_iter_single_seg_count(i));
 			goto again;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 15/30] thp, libfs: initial support of thp in simple_read/write_begin/write_end
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now we try to grab a huge cache page if gfp_mask has __GFP_COMP.
It's probably to weak condition and need to be reworked later.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/libfs.c |   50 +++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 39 insertions(+), 11 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 916da8c..6edfa9d 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -383,7 +383,10 @@ EXPORT_SYMBOL(simple_setattr);
 
 int simple_readpage(struct file *file, struct page *page)
 {
-	clear_highpage(page);
+	if (PageTransHuge(page))
+		zero_huge_user(page, 0, HPAGE_PMD_SIZE);
+	else
+		clear_highpage(page);
 	flush_dcache_page(page);
 	SetPageUptodate(page);
 	unlock_page(page);
@@ -394,21 +397,41 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned flags,
 			struct page **pagep, void **fsdata)
 {
-	struct page *page;
+	struct page *page = NULL;
 	pgoff_t index;
 
 	index = pos >> PAGE_CACHE_SHIFT;
 
-	page = grab_cache_page_write_begin(mapping, index, flags);
+	/* XXX: too weak condition. Good enough for initial testing */
+	if (mapping_can_have_hugepages(mapping)) {
+		page = grab_cache_huge_page_write_begin(mapping,
+				index & ~HPAGE_CACHE_INDEX_MASK, flags);
+		/* fallback to small page */
+		if (!page || !PageTransHuge(page)) {
+			unsigned long offset;
+			offset = pos & ~PAGE_CACHE_MASK;
+			len = min_t(unsigned long,
+					len, PAGE_CACHE_SIZE - offset);
+		}
+	}
+	if (!page)
+		page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
-
 	*pagep = page;
 
-	if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
-		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
-		zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE);
+	if (!PageUptodate(page)) {
+		unsigned from;
+
+		if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) {
+			from = pos & ~HPAGE_PMD_MASK;
+			zero_huge_user_segments(page, 0, from,
+					from + len, HPAGE_PMD_SIZE);
+		} else if (len != PAGE_CACHE_SIZE) {
+			from = pos & ~PAGE_CACHE_MASK;
+			zero_user_segments(page, 0, from,
+					from + len, PAGE_CACHE_SIZE);
+		}
 	}
 	return 0;
 }
@@ -443,9 +466,14 @@ int simple_write_end(struct file *file, struct address_space *mapping,
 
 	/* zero the stale part of the page if we did a short copy */
 	if (copied < len) {
-		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
-		zero_user(page, from + copied, len - copied);
+		unsigned from;
+		if (PageTransHuge(page)) {
+			from = pos & ~HPAGE_PMD_MASK;
+			zero_huge_user(page, from + copied, len - copied);
+		} else {
+			from = pos & ~PAGE_CACHE_MASK;
+			zero_user(page, from + copied, len - copied);
+		}
 	}
 
 	if (!PageUptodate(page))
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 15/30] thp, libfs: initial support of thp in simple_read/write_begin/write_end
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now we try to grab a huge cache page if gfp_mask has __GFP_COMP.
It's probably to weak condition and need to be reworked later.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/libfs.c |   50 +++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 39 insertions(+), 11 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 916da8c..6edfa9d 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -383,7 +383,10 @@ EXPORT_SYMBOL(simple_setattr);
 
 int simple_readpage(struct file *file, struct page *page)
 {
-	clear_highpage(page);
+	if (PageTransHuge(page))
+		zero_huge_user(page, 0, HPAGE_PMD_SIZE);
+	else
+		clear_highpage(page);
 	flush_dcache_page(page);
 	SetPageUptodate(page);
 	unlock_page(page);
@@ -394,21 +397,41 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned flags,
 			struct page **pagep, void **fsdata)
 {
-	struct page *page;
+	struct page *page = NULL;
 	pgoff_t index;
 
 	index = pos >> PAGE_CACHE_SHIFT;
 
-	page = grab_cache_page_write_begin(mapping, index, flags);
+	/* XXX: too weak condition. Good enough for initial testing */
+	if (mapping_can_have_hugepages(mapping)) {
+		page = grab_cache_huge_page_write_begin(mapping,
+				index & ~HPAGE_CACHE_INDEX_MASK, flags);
+		/* fallback to small page */
+		if (!page || !PageTransHuge(page)) {
+			unsigned long offset;
+			offset = pos & ~PAGE_CACHE_MASK;
+			len = min_t(unsigned long,
+					len, PAGE_CACHE_SIZE - offset);
+		}
+	}
+	if (!page)
+		page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
-
 	*pagep = page;
 
-	if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
-		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
-		zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE);
+	if (!PageUptodate(page)) {
+		unsigned from;
+
+		if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) {
+			from = pos & ~HPAGE_PMD_MASK;
+			zero_huge_user_segments(page, 0, from,
+					from + len, HPAGE_PMD_SIZE);
+		} else if (len != PAGE_CACHE_SIZE) {
+			from = pos & ~PAGE_CACHE_MASK;
+			zero_user_segments(page, 0, from,
+					from + len, PAGE_CACHE_SIZE);
+		}
 	}
 	return 0;
 }
@@ -443,9 +466,14 @@ int simple_write_end(struct file *file, struct address_space *mapping,
 
 	/* zero the stale part of the page if we did a short copy */
 	if (copied < len) {
-		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
-		zero_user(page, from + copied, len - copied);
+		unsigned from;
+		if (PageTransHuge(page)) {
+			from = pos & ~HPAGE_PMD_MASK;
+			zero_huge_user(page, from + copied, len - copied);
+		} else {
+			from = pos & ~PAGE_CACHE_MASK;
+			zero_user(page, from + copied, len - copied);
+		}
 	}
 
 	if (!PageUptodate(page))
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 16/30] thp: handle file pages in split_huge_page()
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The base scheme is the same as for anonymous pages, but we walk by
mapping->i_mmap rather then anon_vma->rb_root.

__split_huge_page_refcount() has been tunned a bit: we need to transfer
PG_swapbacked to tail pages.

Splitting mapped pages haven't tested at all, since we cannot mmap()
file-backed huge pages yet.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   63 ++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 54 insertions(+), 9 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2f7f5aa..eb777d3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1613,7 +1613,8 @@ static void __split_huge_page_refcount(struct page *page)
 				     ((1L << PG_referenced) |
 				      (1L << PG_swapbacked) |
 				      (1L << PG_mlocked) |
-				      (1L << PG_uptodate)));
+				      (1L << PG_uptodate) |
+				      (1L << PG_swapbacked)));
 		page_tail->flags |= (1L << PG_dirty);
 
 		/* clear PageTail before overwriting first_page */
@@ -1641,10 +1642,8 @@ static void __split_huge_page_refcount(struct page *page)
 		page_tail->index = page->index + i;
 		page_nid_xchg_last(page_tail, page_nid_last(page));
 
-		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
 		BUG_ON(!PageDirty(page_tail));
-		BUG_ON(!PageSwapBacked(page_tail));
 
 		lru_add_page_tail(page, page_tail, lruvec);
 	}
@@ -1752,7 +1751,7 @@ static int __split_huge_page_map(struct page *page,
 }
 
 /* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
+static void __split_anon_huge_page(struct page *page,
 			      struct anon_vma *anon_vma)
 {
 	int mapcount, mapcount2;
@@ -1799,14 +1798,11 @@ static void __split_huge_page(struct page *page,
 	BUG_ON(mapcount != mapcount2);
 }
 
-int split_huge_page(struct page *page)
+static int split_anon_huge_page(struct page *page)
 {
 	struct anon_vma *anon_vma;
 	int ret = 1;
 
-	BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
-	BUG_ON(!PageAnon(page));
-
 	/*
 	 * The caller does not necessarily hold an mmap_sem that would prevent
 	 * the anon_vma disappearing so we first we take a reference to it
@@ -1824,7 +1820,7 @@ int split_huge_page(struct page *page)
 		goto out_unlock;
 
 	BUG_ON(!PageSwapBacked(page));
-	__split_huge_page(page, anon_vma);
+	__split_anon_huge_page(page, anon_vma);
 	count_vm_event(THP_SPLIT);
 
 	BUG_ON(PageCompound(page));
@@ -1835,6 +1831,55 @@ out:
 	return ret;
 }
 
+static int split_file_huge_page(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	int mapcount, mapcount2;
+
+	BUG_ON(!PageHead(page));
+	BUG_ON(PageTail(page));
+
+	mutex_lock(&mapping->i_mmap_mutex);
+	mapcount = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount += __split_huge_page_splitting(page, vma, addr);
+	}
+
+	if (mapcount != page_mapcount(page))
+		printk(KERN_ERR "mapcount %d page_mapcount %d\n",
+		       mapcount, page_mapcount(page));
+	BUG_ON(mapcount != page_mapcount(page));
+
+	__split_huge_page_refcount(page);
+
+	mapcount2 = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount2 += __split_huge_page_map(page, vma, addr);
+	}
+
+	if (mapcount != mapcount2)
+		printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount %d\n",
+		       mapcount, mapcount2, page_mapcount(page));
+	BUG_ON(mapcount != mapcount2);
+	count_vm_event(THP_SPLIT);
+	mutex_unlock(&mapping->i_mmap_mutex);
+	return 0;
+}
+
+int split_huge_page(struct page *page)
+{
+	BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
+
+	if (PageAnon(page))
+		return split_anon_huge_page(page);
+	else
+		return split_file_huge_page(page);
+}
+
 #define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 16/30] thp: handle file pages in split_huge_page()
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The base scheme is the same as for anonymous pages, but we walk by
mapping->i_mmap rather then anon_vma->rb_root.

__split_huge_page_refcount() has been tunned a bit: we need to transfer
PG_swapbacked to tail pages.

Splitting mapped pages haven't tested at all, since we cannot mmap()
file-backed huge pages yet.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   63 ++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 54 insertions(+), 9 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2f7f5aa..eb777d3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1613,7 +1613,8 @@ static void __split_huge_page_refcount(struct page *page)
 				     ((1L << PG_referenced) |
 				      (1L << PG_swapbacked) |
 				      (1L << PG_mlocked) |
-				      (1L << PG_uptodate)));
+				      (1L << PG_uptodate) |
+				      (1L << PG_swapbacked)));
 		page_tail->flags |= (1L << PG_dirty);
 
 		/* clear PageTail before overwriting first_page */
@@ -1641,10 +1642,8 @@ static void __split_huge_page_refcount(struct page *page)
 		page_tail->index = page->index + i;
 		page_nid_xchg_last(page_tail, page_nid_last(page));
 
-		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
 		BUG_ON(!PageDirty(page_tail));
-		BUG_ON(!PageSwapBacked(page_tail));
 
 		lru_add_page_tail(page, page_tail, lruvec);
 	}
@@ -1752,7 +1751,7 @@ static int __split_huge_page_map(struct page *page,
 }
 
 /* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
+static void __split_anon_huge_page(struct page *page,
 			      struct anon_vma *anon_vma)
 {
 	int mapcount, mapcount2;
@@ -1799,14 +1798,11 @@ static void __split_huge_page(struct page *page,
 	BUG_ON(mapcount != mapcount2);
 }
 
-int split_huge_page(struct page *page)
+static int split_anon_huge_page(struct page *page)
 {
 	struct anon_vma *anon_vma;
 	int ret = 1;
 
-	BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
-	BUG_ON(!PageAnon(page));
-
 	/*
 	 * The caller does not necessarily hold an mmap_sem that would prevent
 	 * the anon_vma disappearing so we first we take a reference to it
@@ -1824,7 +1820,7 @@ int split_huge_page(struct page *page)
 		goto out_unlock;
 
 	BUG_ON(!PageSwapBacked(page));
-	__split_huge_page(page, anon_vma);
+	__split_anon_huge_page(page, anon_vma);
 	count_vm_event(THP_SPLIT);
 
 	BUG_ON(PageCompound(page));
@@ -1835,6 +1831,55 @@ out:
 	return ret;
 }
 
+static int split_file_huge_page(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	int mapcount, mapcount2;
+
+	BUG_ON(!PageHead(page));
+	BUG_ON(PageTail(page));
+
+	mutex_lock(&mapping->i_mmap_mutex);
+	mapcount = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount += __split_huge_page_splitting(page, vma, addr);
+	}
+
+	if (mapcount != page_mapcount(page))
+		printk(KERN_ERR "mapcount %d page_mapcount %d\n",
+		       mapcount, page_mapcount(page));
+	BUG_ON(mapcount != page_mapcount(page));
+
+	__split_huge_page_refcount(page);
+
+	mapcount2 = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount2 += __split_huge_page_map(page, vma, addr);
+	}
+
+	if (mapcount != mapcount2)
+		printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount %d\n",
+		       mapcount, mapcount2, page_mapcount(page));
+	BUG_ON(mapcount != mapcount2);
+	count_vm_event(THP_SPLIT);
+	mutex_unlock(&mapping->i_mmap_mutex);
+	return 0;
+}
+
+int split_huge_page(struct page *page)
+{
+	BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
+
+	if (PageAnon(page))
+		return split_anon_huge_page(page);
+	else
+		return split_file_huge_page(page);
+}
+
 #define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 17/30] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Since we're going to have huge pages backed by files,
wait_split_huge_page() has to serialize not only over anon_vma_lock,
but over i_mmap_mutex too.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |   15 ++++++++++++---
 mm/huge_memory.c        |    4 ++--
 mm/memory.c             |    4 ++--
 3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a54939c..b53e295 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -113,11 +113,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
 			__split_huge_page_pmd(__vma, __address,		\
 					____pmd);			\
 	}  while (0)
-#define wait_split_huge_page(__anon_vma, __pmd)				\
+#define wait_split_huge_page(__vma, __pmd)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
-		anon_vma_lock_write(__anon_vma);			\
-		anon_vma_unlock_write(__anon_vma);			\
+		struct address_space *__mapping =			\
+					vma->vm_file->f_mapping;	\
+		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
+		if (__mapping)						\
+			mutex_lock(&__mapping->i_mmap_mutex);		\
+		if (__anon_vma) {					\
+			anon_vma_lock_write(__anon_vma);		\
+			anon_vma_unlock_write(__anon_vma);		\
+		}							\
+		if (__mapping)						\
+			mutex_unlock(&__mapping->i_mmap_mutex);		\
 		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
 		       pmd_trans_huge(*____pmd));			\
 	} while (0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index eb777d3..a23da8b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -907,7 +907,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		spin_unlock(&dst_mm->page_table_lock);
 		pte_free(dst_mm, pgtable);
 
-		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+		wait_split_huge_page(vma, src_pmd); /* src_vma */
 		goto out;
 	}
 	src_page = pmd_page(pmd);
@@ -1480,7 +1480,7 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 	if (likely(pmd_trans_huge(*pmd))) {
 		if (unlikely(pmd_trans_splitting(*pmd))) {
 			spin_unlock(&vma->vm_mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
+			wait_split_huge_page(vma, pmd);
 			return -1;
 		} else {
 			/* Thp mapped by 'pmd' is stable, so we can
diff --git a/mm/memory.c b/mm/memory.c
index 98c25dd..52bd6cf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -619,7 +619,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (new)
 		pte_free(mm, new);
 	if (wait_split_huge_page)
-		wait_split_huge_page(vma->anon_vma, pmd);
+		wait_split_huge_page(vma, pmd);
 	return 0;
 }
 
@@ -1529,7 +1529,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		if (likely(pmd_trans_huge(*pmd))) {
 			if (unlikely(pmd_trans_splitting(*pmd))) {
 				spin_unlock(&mm->page_table_lock);
-				wait_split_huge_page(vma->anon_vma, pmd);
+				wait_split_huge_page(vma, pmd);
 			} else {
 				page = follow_trans_huge_pmd(vma, address,
 							     pmd, flags);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 17/30] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Since we're going to have huge pages backed by files,
wait_split_huge_page() has to serialize not only over anon_vma_lock,
but over i_mmap_mutex too.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |   15 ++++++++++++---
 mm/huge_memory.c        |    4 ++--
 mm/memory.c             |    4 ++--
 3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a54939c..b53e295 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -113,11 +113,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
 			__split_huge_page_pmd(__vma, __address,		\
 					____pmd);			\
 	}  while (0)
-#define wait_split_huge_page(__anon_vma, __pmd)				\
+#define wait_split_huge_page(__vma, __pmd)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
-		anon_vma_lock_write(__anon_vma);			\
-		anon_vma_unlock_write(__anon_vma);			\
+		struct address_space *__mapping =			\
+					vma->vm_file->f_mapping;	\
+		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
+		if (__mapping)						\
+			mutex_lock(&__mapping->i_mmap_mutex);		\
+		if (__anon_vma) {					\
+			anon_vma_lock_write(__anon_vma);		\
+			anon_vma_unlock_write(__anon_vma);		\
+		}							\
+		if (__mapping)						\
+			mutex_unlock(&__mapping->i_mmap_mutex);		\
 		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
 		       pmd_trans_huge(*____pmd));			\
 	} while (0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index eb777d3..a23da8b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -907,7 +907,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		spin_unlock(&dst_mm->page_table_lock);
 		pte_free(dst_mm, pgtable);
 
-		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+		wait_split_huge_page(vma, src_pmd); /* src_vma */
 		goto out;
 	}
 	src_page = pmd_page(pmd);
@@ -1480,7 +1480,7 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 	if (likely(pmd_trans_huge(*pmd))) {
 		if (unlikely(pmd_trans_splitting(*pmd))) {
 			spin_unlock(&vma->vm_mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
+			wait_split_huge_page(vma, pmd);
 			return -1;
 		} else {
 			/* Thp mapped by 'pmd' is stable, so we can
diff --git a/mm/memory.c b/mm/memory.c
index 98c25dd..52bd6cf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -619,7 +619,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (new)
 		pte_free(mm, new);
 	if (wait_split_huge_page)
-		wait_split_huge_page(vma->anon_vma, pmd);
+		wait_split_huge_page(vma, pmd);
 	return 0;
 }
 
@@ -1529,7 +1529,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		if (likely(pmd_trans_huge(*pmd))) {
 			if (unlikely(pmd_trans_splitting(*pmd))) {
 				spin_unlock(&mm->page_table_lock);
-				wait_split_huge_page(vma->anon_vma, pmd);
+				wait_split_huge_page(vma, pmd);
 			} else {
 				page = follow_trans_huge_pmd(vma, address,
 							     pmd, flags);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 18/30] thp, mm: truncate support for transparent huge page cache
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

If we starting position of truncation is in tail page we have to spilit
the huge page page first.

We also have to split if end is within the huge page. Otherwise we can
truncate whole huge page at once.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/truncate.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/truncate.c b/mm/truncate.c
index c75b736..87c247d 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -231,6 +231,17 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index > end)
 				break;
 
+			/* split page if we start from tail page */
+			if (PageTransTail(page))
+				split_huge_page(compound_trans_head(page));
+			if (PageTransHuge(page)) {
+				/* split if end is within huge page */
+				if (index == (end & ~HPAGE_CACHE_INDEX_MASK))
+					split_huge_page(page);
+				else
+					/* skip tail pages */
+					i += HPAGE_CACHE_NR - 1;
+			}
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -280,6 +291,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index > end)
 				break;
 
+			VM_BUG_ON(PageTransHuge(page));
 			lock_page(page);
 			WARN_ON(page->index != index);
 			wait_on_page_writeback(page);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 18/30] thp, mm: truncate support for transparent huge page cache
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

If we starting position of truncation is in tail page we have to spilit
the huge page page first.

We also have to split if end is within the huge page. Otherwise we can
truncate whole huge page at once.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/truncate.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/truncate.c b/mm/truncate.c
index c75b736..87c247d 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -231,6 +231,17 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index > end)
 				break;
 
+			/* split page if we start from tail page */
+			if (PageTransTail(page))
+				split_huge_page(compound_trans_head(page));
+			if (PageTransHuge(page)) {
+				/* split if end is within huge page */
+				if (index == (end & ~HPAGE_CACHE_INDEX_MASK))
+					split_huge_page(page);
+				else
+					/* skip tail pages */
+					i += HPAGE_CACHE_NR - 1;
+			}
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -280,6 +291,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index > end)
 				break;
 
+			VM_BUG_ON(PageTransHuge(page));
 			lock_page(page);
 			WARN_ON(page->index != index);
 			wait_on_page_writeback(page);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 19/30] thp, mm: split huge page on mmap file page
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We are not ready to mmap file-backed tranparent huge pages. Let's split
them on mmap() attempt.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 79ba9cd..57611be 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1696,6 +1696,8 @@ retry_find:
 			goto no_cached_page;
 	}
 
+	if (PageTransCompound(page))
+		split_huge_page(compound_trans_head(page));
 	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
 		page_cache_release(page);
 		return ret | VM_FAULT_RETRY;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 19/30] thp, mm: split huge page on mmap file page
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We are not ready to mmap file-backed tranparent huge pages. Let's split
them on mmap() attempt.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 79ba9cd..57611be 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1696,6 +1696,8 @@ retry_find:
 			goto no_cached_page;
 	}
 
+	if (PageTransCompound(page))
+		split_huge_page(compound_trans_head(page));
 	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
 		page_cache_release(page);
 		return ret | VM_FAULT_RETRY;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

ramfs is the most simple fs from page cache point of view. Let's start
transparent huge page cache enabling here.

For now we allocate only non-movable huge page. It's not yet clear if
movable page is safe here and what need to be done to make it safe.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/ramfs/inode.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index c24f1e1..da30b4f 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
 		inode_init_owner(inode, dir, mode);
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
-		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+		/*
+		 * TODO: what should be done to make movable safe?
+		 */
+		mapping_set_gfp_mask(inode->i_mapping,
+				GFP_TRANSHUGE & ~__GFP_MOVABLE);
 		mapping_set_unevictable(inode->i_mapping);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

ramfs is the most simple fs from page cache point of view. Let's start
transparent huge page cache enabling here.

For now we allocate only non-movable huge page. It's not yet clear if
movable page is safe here and what need to be done to make it safe.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/ramfs/inode.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index c24f1e1..da30b4f 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
 		inode_init_owner(inode, dir, mode);
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
-		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+		/*
+		 * TODO: what should be done to make movable safe?
+		 */
+		mapping_set_gfp_mask(inode->i_mapping,
+				GFP_TRANSHUGE & ~__GFP_MOVABLE);
 		mapping_set_unevictable(inode->i_mapping);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 21/30] x86-64, mm: proper alignment mappings with hugepages
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Make arch_get_unmapped_area() return unmapped area aligned to HPAGE_MASK
if the file mapping can have huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/sys_x86_64.c |   13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index dbded5a..4e20cd5 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -15,6 +15,7 @@
 #include <linux/random.h>
 #include <linux/uaccess.h>
 #include <linux/elf.h>
+#include <linux/pagemap.h>
 
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
@@ -135,7 +136,11 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.length = len;
 	info.low_limit = begin;
 	info.high_limit = end;
-	info.align_mask = filp ? get_align_mask() : 0;
+	if (filp)
+		info.align_mask = mapping_can_have_hugepages(filp->f_mapping) ?
+			PAGE_MASK & ~HPAGE_MASK : get_align_mask();
+	else
+		info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	return vm_unmapped_area(&info);
 }
@@ -174,7 +179,11 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = mm->mmap_base;
-	info.align_mask = filp ? get_align_mask() : 0;
+	if (filp)
+		info.align_mask = mapping_can_have_hugepages(filp->f_mapping) ?
+			PAGE_MASK & ~HPAGE_MASK : get_align_mask();
+	else
+		info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	addr = vm_unmapped_area(&info);
 	if (!(addr & ~PAGE_MASK))
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 21/30] x86-64, mm: proper alignment mappings with hugepages
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Make arch_get_unmapped_area() return unmapped area aligned to HPAGE_MASK
if the file mapping can have huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/sys_x86_64.c |   13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index dbded5a..4e20cd5 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -15,6 +15,7 @@
 #include <linux/random.h>
 #include <linux/uaccess.h>
 #include <linux/elf.h>
+#include <linux/pagemap.h>
 
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
@@ -135,7 +136,11 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.length = len;
 	info.low_limit = begin;
 	info.high_limit = end;
-	info.align_mask = filp ? get_align_mask() : 0;
+	if (filp)
+		info.align_mask = mapping_can_have_hugepages(filp->f_mapping) ?
+			PAGE_MASK & ~HPAGE_MASK : get_align_mask();
+	else
+		info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	return vm_unmapped_area(&info);
 }
@@ -174,7 +179,11 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = mm->mmap_base;
-	info.align_mask = filp ? get_align_mask() : 0;
+	if (filp)
+		info.align_mask = mapping_can_have_hugepages(filp->f_mapping) ?
+			PAGE_MASK & ~HPAGE_MASK : get_align_mask();
+	else
+		info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	addr = vm_unmapped_area(&info);
 	if (!(addr & ~PAGE_MASK))
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 22/30] mm: add huge_fault() callback to vm_operations_struct
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

huge_fault() should try to setup huge page for the pgoff, if possbile.
VM_FAULT_OOM return code means we need to fallback to small pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |    1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index df83ab9..5456294 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -195,6 +195,7 @@ struct vm_operations_struct {
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
+	int (*huge_fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 22/30] mm: add huge_fault() callback to vm_operations_struct
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

huge_fault() should try to setup huge page for the pgoff, if possbile.
VM_FAULT_OOM return code means we need to fallback to small pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |    1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index df83ab9..5456294 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -195,6 +195,7 @@ struct vm_operations_struct {
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
+	int (*huge_fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 23/30] thp: prepare zap_huge_pmd() to uncharge file pages
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Uncharge pages from correct counter.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a23da8b..34e0385 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1368,10 +1368,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			spin_unlock(&tlb->mm->page_table_lock);
 			put_huge_zero_page();
 		} else {
+			int counter;
 			page = pmd_page(orig_pmd);
 			page_remove_rmap(page);
 			VM_BUG_ON(page_mapcount(page) < 0);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+			counter = PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES;
+			add_mm_counter(tlb->mm, counter, -HPAGE_PMD_NR);
 			VM_BUG_ON(!PageHead(page));
 			tlb->mm->nr_ptes--;
 			spin_unlock(&tlb->mm->page_table_lock);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 23/30] thp: prepare zap_huge_pmd() to uncharge file pages
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Uncharge pages from correct counter.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a23da8b..34e0385 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1368,10 +1368,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			spin_unlock(&tlb->mm->page_table_lock);
 			put_huge_zero_page();
 		} else {
+			int counter;
 			page = pmd_page(orig_pmd);
 			page_remove_rmap(page);
 			VM_BUG_ON(page_mapcount(page) < 0);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+			counter = PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES;
+			add_mm_counter(tlb->mm, counter, -HPAGE_PMD_NR);
 			VM_BUG_ON(!PageHead(page));
 			tlb->mm->nr_ptes--;
 			spin_unlock(&tlb->mm->page_table_lock);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 24/30] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

It's confusing that mk_huge_pmd() has sematics different from mk_pte()
or mk_pmd().

Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
prototype to match mk_pte().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 34e0385..be7b7e1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -691,11 +691,10 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 	return pmd;
 }
 
-static inline pmd_t mk_huge_pmd(struct page *page, struct vm_area_struct *vma)
+static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
 {
 	pmd_t entry;
-	entry = mk_pmd(page, vma->vm_page_prot);
-	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+	entry = mk_pmd(page, prot);
 	entry = pmd_mkhuge(entry);
 	return entry;
 }
@@ -723,7 +722,8 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		pte_free(mm, pgtable);
 	} else {
 		pmd_t entry;
-		entry = mk_huge_pmd(page, vma);
+		entry = mk_huge_pmd(page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		/*
 		 * The spinlocking to take the lru_lock inside
 		 * page_add_new_anon_rmap() acts as a full memory
@@ -1212,7 +1212,8 @@ alloc:
 		goto out_mn;
 	} else {
 		pmd_t entry;
-		entry = mk_huge_pmd(new_page, vma);
+		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		pmdp_clear_flush(vma, haddr, pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr);
 		set_pmd_at(mm, haddr, pmd, entry);
@@ -2382,7 +2383,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	__SetPageUptodate(new_page);
 	pgtable = pmd_pgtable(_pmd);
 
-	_pmd = mk_huge_pmd(new_page, vma);
+	_pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
+	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
 
 	/*
 	 * spin_lock() below is not the equivalent of smp_wmb(), so
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 24/30] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

It's confusing that mk_huge_pmd() has sematics different from mk_pte()
or mk_pmd().

Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
prototype to match mk_pte().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 34e0385..be7b7e1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -691,11 +691,10 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 	return pmd;
 }
 
-static inline pmd_t mk_huge_pmd(struct page *page, struct vm_area_struct *vma)
+static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
 {
 	pmd_t entry;
-	entry = mk_pmd(page, vma->vm_page_prot);
-	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+	entry = mk_pmd(page, prot);
 	entry = pmd_mkhuge(entry);
 	return entry;
 }
@@ -723,7 +722,8 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		pte_free(mm, pgtable);
 	} else {
 		pmd_t entry;
-		entry = mk_huge_pmd(page, vma);
+		entry = mk_huge_pmd(page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		/*
 		 * The spinlocking to take the lru_lock inside
 		 * page_add_new_anon_rmap() acts as a full memory
@@ -1212,7 +1212,8 @@ alloc:
 		goto out_mn;
 	} else {
 		pmd_t entry;
-		entry = mk_huge_pmd(new_page, vma);
+		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		pmdp_clear_flush(vma, haddr, pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr);
 		set_pmd_at(mm, haddr, pmd, entry);
@@ -2382,7 +2383,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	__SetPageUptodate(new_page);
 	pgtable = pmd_pgtable(_pmd);
 
-	_pmd = mk_huge_pmd(new_page, vma);
+	_pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
+	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
 
 	/*
 	 * spin_lock() below is not the equivalent of smp_wmb(), so
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 25/30] thp, mm: basic huge_fault implementation for generic_file_vm_ops
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

It provide enough functionality for simple cases like ramfs. Need to be
extended later.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   75 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 57611be..032fec39 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1781,6 +1781,80 @@ page_not_uptodate:
 }
 EXPORT_SYMBOL(filemap_fault);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static int filemap_huge_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
+	pgoff_t size, offset = vmf->pgoff;
+	unsigned long address = (unsigned long) vmf->virtual_address;
+	struct page *page;
+	int ret = 0;
+
+	BUG_ON(((address >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+			(offset & HPAGE_CACHE_INDEX_MASK));
+
+retry:
+	page = find_get_page(mapping, offset);
+	if (!page) {
+		gfp_t gfp_mask = mapping_gfp_mask(mapping) | __GFP_COLD;
+		page = alloc_pages(gfp_mask, HPAGE_PMD_ORDER);
+		if (!page) {
+			count_vm_event(THP_FAULT_FALLBACK);
+			return VM_FAULT_OOM;
+		}
+		count_vm_event(THP_FAULT_ALLOC);
+		ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);
+		if (ret == 0)
+			ret = mapping->a_ops->readpage(file, page);
+		else if (ret == -EEXIST)
+			ret = 0; /* losing race to add is OK */
+		page_cache_release(page);
+		if (!ret || ret == AOP_TRUNCATED_PAGE)
+			goto retry;
+		return ret;
+	}
+
+	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
+		page_cache_release(page);
+		return ret | VM_FAULT_RETRY;
+	}
+
+	/* Did it get truncated? */
+	if (unlikely(page->mapping != mapping)) {
+		unlock_page(page);
+		put_page(page);
+		goto retry;
+	}
+	VM_BUG_ON(page->index != offset);
+	VM_BUG_ON(!PageUptodate(page));
+
+	if (!PageTransHuge(page)) {
+		unlock_page(page);
+		put_page(page);
+		/* Ask fallback to small pages */
+		return VM_FAULT_OOM;
+	}
+
+	/*
+	 * Found the page and have a reference on it.
+	 * We must recheck i_size under page lock.
+	 */
+	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	if (unlikely(offset >= size)) {
+		unlock_page(page);
+		page_cache_release(page);
+		return VM_FAULT_SIGBUS;
+	}
+
+	vmf->page = page;
+	return ret | VM_FAULT_LOCKED;
+}
+#else
+#define filemap_huge_fault NULL
+#endif
+
 int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct page *page = vmf->page;
@@ -1810,6 +1884,7 @@ EXPORT_SYMBOL(filemap_page_mkwrite);
 
 const struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
+	.huge_fault	= filemap_huge_fault,
 	.page_mkwrite	= filemap_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 25/30] thp, mm: basic huge_fault implementation for generic_file_vm_ops
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

It provide enough functionality for simple cases like ramfs. Need to be
extended later.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   75 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 57611be..032fec39 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1781,6 +1781,80 @@ page_not_uptodate:
 }
 EXPORT_SYMBOL(filemap_fault);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static int filemap_huge_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
+	pgoff_t size, offset = vmf->pgoff;
+	unsigned long address = (unsigned long) vmf->virtual_address;
+	struct page *page;
+	int ret = 0;
+
+	BUG_ON(((address >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+			(offset & HPAGE_CACHE_INDEX_MASK));
+
+retry:
+	page = find_get_page(mapping, offset);
+	if (!page) {
+		gfp_t gfp_mask = mapping_gfp_mask(mapping) | __GFP_COLD;
+		page = alloc_pages(gfp_mask, HPAGE_PMD_ORDER);
+		if (!page) {
+			count_vm_event(THP_FAULT_FALLBACK);
+			return VM_FAULT_OOM;
+		}
+		count_vm_event(THP_FAULT_ALLOC);
+		ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);
+		if (ret == 0)
+			ret = mapping->a_ops->readpage(file, page);
+		else if (ret == -EEXIST)
+			ret = 0; /* losing race to add is OK */
+		page_cache_release(page);
+		if (!ret || ret == AOP_TRUNCATED_PAGE)
+			goto retry;
+		return ret;
+	}
+
+	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
+		page_cache_release(page);
+		return ret | VM_FAULT_RETRY;
+	}
+
+	/* Did it get truncated? */
+	if (unlikely(page->mapping != mapping)) {
+		unlock_page(page);
+		put_page(page);
+		goto retry;
+	}
+	VM_BUG_ON(page->index != offset);
+	VM_BUG_ON(!PageUptodate(page));
+
+	if (!PageTransHuge(page)) {
+		unlock_page(page);
+		put_page(page);
+		/* Ask fallback to small pages */
+		return VM_FAULT_OOM;
+	}
+
+	/*
+	 * Found the page and have a reference on it.
+	 * We must recheck i_size under page lock.
+	 */
+	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	if (unlikely(offset >= size)) {
+		unlock_page(page);
+		page_cache_release(page);
+		return VM_FAULT_SIGBUS;
+	}
+
+	vmf->page = page;
+	return ret | VM_FAULT_LOCKED;
+}
+#else
+#define filemap_huge_fault NULL
+#endif
+
 int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct page *page = vmf->page;
@@ -1810,6 +1884,7 @@ EXPORT_SYMBOL(filemap_page_mkwrite);
 
 const struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
+	.huge_fault	= filemap_huge_fault,
 	.page_mkwrite	= filemap_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 26/30] thp: extract fallback path from do_huge_pmd_anonymous_page() to a function
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The same fallback path will be reused by non-anonymous pages, so lets'
extract it in separate function.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |  112 ++++++++++++++++++++++++++++--------------------------
 1 file changed, 59 insertions(+), 53 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index be7b7e1..0df1309 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -779,64 +779,12 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
 	return true;
 }
 
-int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+static int do_fallback(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
 			       unsigned int flags)
 {
-	struct page *page;
-	unsigned long haddr = address & HPAGE_PMD_MASK;
 	pte_t *pte;
 
-	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
-		if (unlikely(anon_vma_prepare(vma)))
-			return VM_FAULT_OOM;
-		if (unlikely(khugepaged_enter(vma)))
-			return VM_FAULT_OOM;
-		if (!(flags & FAULT_FLAG_WRITE) &&
-				transparent_hugepage_use_zero_page()) {
-			pgtable_t pgtable;
-			unsigned long zero_pfn;
-			bool set;
-			pgtable = pte_alloc_one(mm, haddr);
-			if (unlikely(!pgtable))
-				return VM_FAULT_OOM;
-			zero_pfn = get_huge_zero_page();
-			if (unlikely(!zero_pfn)) {
-				pte_free(mm, pgtable);
-				count_vm_event(THP_FAULT_FALLBACK);
-				goto out;
-			}
-			spin_lock(&mm->page_table_lock);
-			set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
-					zero_pfn);
-			spin_unlock(&mm->page_table_lock);
-			if (!set) {
-				pte_free(mm, pgtable);
-				put_huge_zero_page();
-			}
-			return 0;
-		}
-		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
-					  vma, haddr, numa_node_id(), 0);
-		if (unlikely(!page)) {
-			count_vm_event(THP_FAULT_FALLBACK);
-			goto out;
-		}
-		count_vm_event(THP_FAULT_ALLOC);
-		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
-			put_page(page);
-			goto out;
-		}
-		if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
-							  page))) {
-			mem_cgroup_uncharge_page(page);
-			put_page(page);
-			goto out;
-		}
-
-		return 0;
-	}
-out:
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
 	 * run pte_offset_map on the pmd, if an huge pmd could
@@ -858,6 +806,64 @@ out:
 	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
 }
 
+int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, pmd_t *pmd,
+			       unsigned int flags)
+{
+	struct page *page;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+
+	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+		return do_fallback(mm, vma, address, pmd, flags);
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
+	if (unlikely(khugepaged_enter(vma)))
+		return VM_FAULT_OOM;
+	if (!(flags & FAULT_FLAG_WRITE) &&
+			transparent_hugepage_use_zero_page()) {
+		pgtable_t pgtable;
+		unsigned long zero_pfn;
+		bool set;
+		pgtable = pte_alloc_one(mm, haddr);
+		if (unlikely(!pgtable))
+			return VM_FAULT_OOM;
+		zero_pfn = get_huge_zero_page();
+		if (unlikely(!zero_pfn)) {
+			pte_free(mm, pgtable);
+			count_vm_event(THP_FAULT_FALLBACK);
+			return do_fallback(mm, vma, address, pmd, flags);
+		}
+		spin_lock(&mm->page_table_lock);
+		set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
+				zero_pfn);
+		spin_unlock(&mm->page_table_lock);
+		if (!set) {
+			pte_free(mm, pgtable);
+			put_huge_zero_page();
+		}
+		return 0;
+	}
+	page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+			vma, haddr, numa_node_id(), 0);
+	if (unlikely(!page)) {
+		count_vm_event(THP_FAULT_FALLBACK);
+		return do_fallback(mm, vma, address, pmd, flags);
+	}
+	count_vm_event(THP_FAULT_ALLOC);
+	if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
+		put_page(page);
+		return do_fallback(mm, vma, address, pmd, flags);
+	}
+	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
+					page))) {
+		mem_cgroup_uncharge_page(page);
+		put_page(page);
+		return do_fallback(mm, vma, address, pmd, flags);
+	}
+
+	return 0;
+}
+
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		  struct vm_area_struct *vma)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 26/30] thp: extract fallback path from do_huge_pmd_anonymous_page() to a function
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The same fallback path will be reused by non-anonymous pages, so lets'
extract it in separate function.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |  112 ++++++++++++++++++++++++++++--------------------------
 1 file changed, 59 insertions(+), 53 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index be7b7e1..0df1309 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -779,64 +779,12 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
 	return true;
 }
 
-int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+static int do_fallback(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
 			       unsigned int flags)
 {
-	struct page *page;
-	unsigned long haddr = address & HPAGE_PMD_MASK;
 	pte_t *pte;
 
-	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
-		if (unlikely(anon_vma_prepare(vma)))
-			return VM_FAULT_OOM;
-		if (unlikely(khugepaged_enter(vma)))
-			return VM_FAULT_OOM;
-		if (!(flags & FAULT_FLAG_WRITE) &&
-				transparent_hugepage_use_zero_page()) {
-			pgtable_t pgtable;
-			unsigned long zero_pfn;
-			bool set;
-			pgtable = pte_alloc_one(mm, haddr);
-			if (unlikely(!pgtable))
-				return VM_FAULT_OOM;
-			zero_pfn = get_huge_zero_page();
-			if (unlikely(!zero_pfn)) {
-				pte_free(mm, pgtable);
-				count_vm_event(THP_FAULT_FALLBACK);
-				goto out;
-			}
-			spin_lock(&mm->page_table_lock);
-			set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
-					zero_pfn);
-			spin_unlock(&mm->page_table_lock);
-			if (!set) {
-				pte_free(mm, pgtable);
-				put_huge_zero_page();
-			}
-			return 0;
-		}
-		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
-					  vma, haddr, numa_node_id(), 0);
-		if (unlikely(!page)) {
-			count_vm_event(THP_FAULT_FALLBACK);
-			goto out;
-		}
-		count_vm_event(THP_FAULT_ALLOC);
-		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
-			put_page(page);
-			goto out;
-		}
-		if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
-							  page))) {
-			mem_cgroup_uncharge_page(page);
-			put_page(page);
-			goto out;
-		}
-
-		return 0;
-	}
-out:
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
 	 * run pte_offset_map on the pmd, if an huge pmd could
@@ -858,6 +806,64 @@ out:
 	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
 }
 
+int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, pmd_t *pmd,
+			       unsigned int flags)
+{
+	struct page *page;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+
+	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+		return do_fallback(mm, vma, address, pmd, flags);
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
+	if (unlikely(khugepaged_enter(vma)))
+		return VM_FAULT_OOM;
+	if (!(flags & FAULT_FLAG_WRITE) &&
+			transparent_hugepage_use_zero_page()) {
+		pgtable_t pgtable;
+		unsigned long zero_pfn;
+		bool set;
+		pgtable = pte_alloc_one(mm, haddr);
+		if (unlikely(!pgtable))
+			return VM_FAULT_OOM;
+		zero_pfn = get_huge_zero_page();
+		if (unlikely(!zero_pfn)) {
+			pte_free(mm, pgtable);
+			count_vm_event(THP_FAULT_FALLBACK);
+			return do_fallback(mm, vma, address, pmd, flags);
+		}
+		spin_lock(&mm->page_table_lock);
+		set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
+				zero_pfn);
+		spin_unlock(&mm->page_table_lock);
+		if (!set) {
+			pte_free(mm, pgtable);
+			put_huge_zero_page();
+		}
+		return 0;
+	}
+	page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+			vma, haddr, numa_node_id(), 0);
+	if (unlikely(!page)) {
+		count_vm_event(THP_FAULT_FALLBACK);
+		return do_fallback(mm, vma, address, pmd, flags);
+	}
+	count_vm_event(THP_FAULT_ALLOC);
+	if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
+		put_page(page);
+		return do_fallback(mm, vma, address, pmd, flags);
+	}
+	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
+					page))) {
+		mem_cgroup_uncharge_page(page);
+		put_page(page);
+		return do_fallback(mm, vma, address, pmd, flags);
+	}
+
+	return 0;
+}
+
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		  struct vm_area_struct *vma)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 27/30] thp: initial implementation of do_huge_linear_fault()
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The function tries to create a new page mapping using huge pages. It
only called for not yet mapped pages.

As usual in THP, we fallback to small pages if we fail to allocate huge
page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    3 +
 mm/huge_memory.c        |  196 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 199 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b53e295..aa52c48 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -5,6 +5,9 @@ extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
 				      struct vm_area_struct *vma,
 				      unsigned long address, pmd_t *pmd,
 				      unsigned int flags);
+extern int do_huge_linear_fault(struct mm_struct *mm,
+		struct vm_area_struct *vma, unsigned long address, pmd_t *pmd,
+		unsigned int flags);
 extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 			 struct vm_area_struct *vma);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0df1309..d1adaea 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -21,6 +21,7 @@
 #include <linux/pagemap.h>
 #include <linux/migrate.h>
 #include <linux/hashtable.h>
+#include <linux/writeback.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -864,6 +865,201 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	return 0;
 }
 
+int do_huge_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmd, unsigned int flags)
+{
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	struct page *cow_page, *page, *dirty_page = NULL;
+	bool anon = false, fallback = false, page_mkwrite = false;
+	pgtable_t pgtable = NULL;
+	struct vm_fault vmf;
+	int ret;
+
+	/* Fallback if vm_pgoff and vm_start are not suitable */
+	if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+			(vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK))
+		return do_fallback(mm, vma, address, pmd, flags);
+
+	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+		return do_fallback(mm, vma, address, pmd, flags);
+
+	if (unlikely(khugepaged_enter(vma)))
+		return VM_FAULT_OOM;
+
+	/*
+	 * If we do COW later, allocate page before taking lock_page()
+	 * on the file cache page. This will reduce lock holding time.
+	 */
+	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
+		if (unlikely(anon_vma_prepare(vma)))
+			return VM_FAULT_OOM;
+
+		cow_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+				vma, haddr, numa_node_id(), 0);
+		if (!cow_page) {
+			count_vm_event(THP_FAULT_FALLBACK);
+			return do_fallback(mm, vma, address, pmd, flags);
+		}
+		count_vm_event(THP_FAULT_ALLOC);
+		if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
+			page_cache_release(cow_page);
+			return do_fallback(mm, vma, address, pmd, flags);
+		}
+	} else
+		cow_page = NULL;
+
+	pgtable = pte_alloc_one(mm, haddr);
+	if (unlikely(!pgtable)) {
+		ret = VM_FAULT_OOM;
+		goto uncharge_out;
+	}
+
+	vmf.virtual_address = (void __user *)haddr;
+	vmf.pgoff = ((haddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+	vmf.flags = flags;
+	vmf.page = NULL;
+
+	ret = vma->vm_ops->huge_fault(vma, &vmf);
+	if (unlikely(ret & VM_FAULT_OOM))
+		goto uncharge_out_fallback;
+	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
+		goto uncharge_out;
+
+	if (unlikely(PageHWPoison(vmf.page))) {
+		if (ret & VM_FAULT_LOCKED)
+			unlock_page(vmf.page);
+		ret = VM_FAULT_HWPOISON;
+		goto uncharge_out;
+	}
+
+	/*
+	 * For consistency in subsequent calls, make the faulted page always
+	 * locked.
+	 */
+	if (unlikely(!(ret & VM_FAULT_LOCKED)))
+		lock_page(vmf.page);
+	else
+		VM_BUG_ON(!PageLocked(vmf.page));
+
+	/*
+	 * Should we do an early C-O-W break?
+	 */
+	page = vmf.page;
+	if (flags & FAULT_FLAG_WRITE) {
+		if (!(vma->vm_flags & VM_SHARED)) {
+			page = cow_page;
+			anon = true;
+			copy_user_huge_page(page, vmf.page, haddr, vma,
+					HPAGE_PMD_NR);
+			__SetPageUptodate(page);
+		} else if (vma->vm_ops->page_mkwrite) {
+			int tmp;
+
+			unlock_page(page);
+			vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE;
+			tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
+			if (unlikely(tmp &
+				  (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+				ret = tmp;
+				goto unwritable_page;
+			}
+			if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+				lock_page(page);
+				if (!page->mapping) {
+					ret = 0; /* retry the fault */
+					unlock_page(page);
+					goto unwritable_page;
+				}
+			} else
+				VM_BUG_ON(!PageLocked(page));
+			page_mkwrite = true;
+		}
+	}
+
+	VM_BUG_ON(!PageCompound(page));
+
+	spin_lock(&mm->page_table_lock);
+	if (likely(pmd_none(*pmd))) {
+		pmd_t entry;
+
+		flush_icache_page(vma, page);
+		entry = mk_huge_pmd(page, vma->vm_page_prot);
+		if (flags & FAULT_FLAG_WRITE)
+			entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (anon) {
+			add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+			page_add_new_anon_rmap(page, vma, haddr);
+		} else {
+			add_mm_counter(mm, MM_FILEPAGES, HPAGE_PMD_NR);
+			page_add_file_rmap(page);
+			if (flags & FAULT_FLAG_WRITE) {
+				dirty_page = page;
+				get_page(dirty_page);
+			}
+		}
+		set_pmd_at(mm, haddr, pmd, entry);
+		pgtable_trans_huge_deposit(mm, pgtable);
+		mm->nr_ptes++;
+
+		/* no need to invalidate: a not-present page won't be cached */
+		update_mmu_cache_pmd(vma, address, pmd);
+	} else {
+		if (cow_page)
+			mem_cgroup_uncharge_page(cow_page);
+		if (anon)
+			page_cache_release(page);
+		else
+			anon = true; /* no anon but release faulted_page */
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	if (dirty_page) {
+		struct address_space *mapping = page->mapping;
+		bool dirtied = false;
+
+		if (set_page_dirty(dirty_page))
+			dirtied = true;
+		unlock_page(dirty_page);
+		put_page(dirty_page);
+		if ((dirtied || page_mkwrite) && mapping) {
+			/*
+			 * Some device drivers do not set page.mapping but still
+			 * dirty their pages
+			 */
+			balance_dirty_pages_ratelimited(mapping);
+		}
+
+		/* file_update_time outside page_lock */
+		if (vma->vm_file && !page_mkwrite)
+			file_update_time(vma->vm_file);
+	} else {
+		unlock_page(vmf.page);
+		if (anon)
+			page_cache_release(vmf.page);
+	}
+
+	return ret;
+
+unwritable_page:
+	pte_free(mm, pgtable);
+	page_cache_release(page);
+	return ret;
+uncharge_out_fallback:
+	fallback = true;
+uncharge_out:
+	if (pgtable)
+		pte_free(mm, pgtable);
+	if (cow_page) {
+		mem_cgroup_uncharge_page(cow_page);
+		page_cache_release(cow_page);
+	}
+
+	if (fallback)
+		return do_fallback(mm, vma, address, pmd, flags);
+	else
+		return ret;
+}
+
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		  struct vm_area_struct *vma)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 27/30] thp: initial implementation of do_huge_linear_fault()
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The function tries to create a new page mapping using huge pages. It
only called for not yet mapped pages.

As usual in THP, we fallback to small pages if we fail to allocate huge
page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    3 +
 mm/huge_memory.c        |  196 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 199 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b53e295..aa52c48 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -5,6 +5,9 @@ extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
 				      struct vm_area_struct *vma,
 				      unsigned long address, pmd_t *pmd,
 				      unsigned int flags);
+extern int do_huge_linear_fault(struct mm_struct *mm,
+		struct vm_area_struct *vma, unsigned long address, pmd_t *pmd,
+		unsigned int flags);
 extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 			 struct vm_area_struct *vma);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0df1309..d1adaea 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -21,6 +21,7 @@
 #include <linux/pagemap.h>
 #include <linux/migrate.h>
 #include <linux/hashtable.h>
+#include <linux/writeback.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -864,6 +865,201 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	return 0;
 }
 
+int do_huge_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmd, unsigned int flags)
+{
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	struct page *cow_page, *page, *dirty_page = NULL;
+	bool anon = false, fallback = false, page_mkwrite = false;
+	pgtable_t pgtable = NULL;
+	struct vm_fault vmf;
+	int ret;
+
+	/* Fallback if vm_pgoff and vm_start are not suitable */
+	if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+			(vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK))
+		return do_fallback(mm, vma, address, pmd, flags);
+
+	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+		return do_fallback(mm, vma, address, pmd, flags);
+
+	if (unlikely(khugepaged_enter(vma)))
+		return VM_FAULT_OOM;
+
+	/*
+	 * If we do COW later, allocate page before taking lock_page()
+	 * on the file cache page. This will reduce lock holding time.
+	 */
+	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
+		if (unlikely(anon_vma_prepare(vma)))
+			return VM_FAULT_OOM;
+
+		cow_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+				vma, haddr, numa_node_id(), 0);
+		if (!cow_page) {
+			count_vm_event(THP_FAULT_FALLBACK);
+			return do_fallback(mm, vma, address, pmd, flags);
+		}
+		count_vm_event(THP_FAULT_ALLOC);
+		if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
+			page_cache_release(cow_page);
+			return do_fallback(mm, vma, address, pmd, flags);
+		}
+	} else
+		cow_page = NULL;
+
+	pgtable = pte_alloc_one(mm, haddr);
+	if (unlikely(!pgtable)) {
+		ret = VM_FAULT_OOM;
+		goto uncharge_out;
+	}
+
+	vmf.virtual_address = (void __user *)haddr;
+	vmf.pgoff = ((haddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+	vmf.flags = flags;
+	vmf.page = NULL;
+
+	ret = vma->vm_ops->huge_fault(vma, &vmf);
+	if (unlikely(ret & VM_FAULT_OOM))
+		goto uncharge_out_fallback;
+	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
+		goto uncharge_out;
+
+	if (unlikely(PageHWPoison(vmf.page))) {
+		if (ret & VM_FAULT_LOCKED)
+			unlock_page(vmf.page);
+		ret = VM_FAULT_HWPOISON;
+		goto uncharge_out;
+	}
+
+	/*
+	 * For consistency in subsequent calls, make the faulted page always
+	 * locked.
+	 */
+	if (unlikely(!(ret & VM_FAULT_LOCKED)))
+		lock_page(vmf.page);
+	else
+		VM_BUG_ON(!PageLocked(vmf.page));
+
+	/*
+	 * Should we do an early C-O-W break?
+	 */
+	page = vmf.page;
+	if (flags & FAULT_FLAG_WRITE) {
+		if (!(vma->vm_flags & VM_SHARED)) {
+			page = cow_page;
+			anon = true;
+			copy_user_huge_page(page, vmf.page, haddr, vma,
+					HPAGE_PMD_NR);
+			__SetPageUptodate(page);
+		} else if (vma->vm_ops->page_mkwrite) {
+			int tmp;
+
+			unlock_page(page);
+			vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE;
+			tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
+			if (unlikely(tmp &
+				  (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+				ret = tmp;
+				goto unwritable_page;
+			}
+			if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+				lock_page(page);
+				if (!page->mapping) {
+					ret = 0; /* retry the fault */
+					unlock_page(page);
+					goto unwritable_page;
+				}
+			} else
+				VM_BUG_ON(!PageLocked(page));
+			page_mkwrite = true;
+		}
+	}
+
+	VM_BUG_ON(!PageCompound(page));
+
+	spin_lock(&mm->page_table_lock);
+	if (likely(pmd_none(*pmd))) {
+		pmd_t entry;
+
+		flush_icache_page(vma, page);
+		entry = mk_huge_pmd(page, vma->vm_page_prot);
+		if (flags & FAULT_FLAG_WRITE)
+			entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (anon) {
+			add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+			page_add_new_anon_rmap(page, vma, haddr);
+		} else {
+			add_mm_counter(mm, MM_FILEPAGES, HPAGE_PMD_NR);
+			page_add_file_rmap(page);
+			if (flags & FAULT_FLAG_WRITE) {
+				dirty_page = page;
+				get_page(dirty_page);
+			}
+		}
+		set_pmd_at(mm, haddr, pmd, entry);
+		pgtable_trans_huge_deposit(mm, pgtable);
+		mm->nr_ptes++;
+
+		/* no need to invalidate: a not-present page won't be cached */
+		update_mmu_cache_pmd(vma, address, pmd);
+	} else {
+		if (cow_page)
+			mem_cgroup_uncharge_page(cow_page);
+		if (anon)
+			page_cache_release(page);
+		else
+			anon = true; /* no anon but release faulted_page */
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	if (dirty_page) {
+		struct address_space *mapping = page->mapping;
+		bool dirtied = false;
+
+		if (set_page_dirty(dirty_page))
+			dirtied = true;
+		unlock_page(dirty_page);
+		put_page(dirty_page);
+		if ((dirtied || page_mkwrite) && mapping) {
+			/*
+			 * Some device drivers do not set page.mapping but still
+			 * dirty their pages
+			 */
+			balance_dirty_pages_ratelimited(mapping);
+		}
+
+		/* file_update_time outside page_lock */
+		if (vma->vm_file && !page_mkwrite)
+			file_update_time(vma->vm_file);
+	} else {
+		unlock_page(vmf.page);
+		if (anon)
+			page_cache_release(vmf.page);
+	}
+
+	return ret;
+
+unwritable_page:
+	pte_free(mm, pgtable);
+	page_cache_release(page);
+	return ret;
+uncharge_out_fallback:
+	fallback = true;
+uncharge_out:
+	if (pgtable)
+		pte_free(mm, pgtable);
+	if (cow_page) {
+		mem_cgroup_uncharge_page(cow_page);
+		page_cache_release(cow_page);
+	}
+
+	if (fallback)
+		return do_fallback(mm, vma, address, pmd, flags);
+	else
+		return ret;
+}
+
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		  struct vm_area_struct *vma)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 28/30] thp: handle write-protect exception to file-backed huge pages
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   69 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 67 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d1adaea..a416a77 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1339,7 +1339,6 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
-	VM_BUG_ON(!vma->anon_vma);
 	haddr = address & HPAGE_PMD_MASK;
 	if (is_huge_zero_pmd(orig_pmd))
 		goto alloc;
@@ -1349,7 +1348,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	page = pmd_page(orig_pmd);
 	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
-	if (page_mapcount(page) == 1) {
+	if (PageAnon(page) && page_mapcount(page) == 1) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -1357,10 +1356,72 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			update_mmu_cache_pmd(vma, address, pmd);
 		ret |= VM_FAULT_WRITE;
 		goto out_unlock;
+	} else if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+			(VM_WRITE|VM_SHARED)) {
+		struct vm_fault vmf;
+		pmd_t entry;
+		struct address_space *mapping;
+
+		/* not yet impemented */
+		VM_BUG_ON(!vma->vm_ops || !vma->vm_ops->page_mkwrite);
+
+		vmf.virtual_address = (void __user *)haddr;
+		vmf.pgoff = page->index;
+		vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
+		vmf.page = page;
+
+		page_cache_get(page);
+		spin_unlock(&mm->page_table_lock);
+
+		ret = vma->vm_ops->page_mkwrite(vma, &vmf);
+		if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+			page_cache_release(page);
+			goto out;
+		}
+		if (unlikely(!(ret & VM_FAULT_LOCKED))) {
+			lock_page(page);
+			if (!page->mapping) {
+				ret = 0; /* retry */
+				unlock_page(page);
+				page_cache_release(page);
+				goto out;
+			}
+		} else
+			VM_BUG_ON(!PageLocked(page));
+		spin_lock(&mm->page_table_lock);
+		if (unlikely(!pmd_same(*pmd, orig_pmd))) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto out_unlock;
+		}
+
+		flush_cache_page(vma, address, pmd_pfn(orig_pmd));
+		entry = pmd_mkyoung(orig_pmd);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
+			update_mmu_cache_pmd(vma, address, pmd);
+		ret = VM_FAULT_WRITE;
+		spin_unlock(&mm->page_table_lock);
+
+		mapping = page->mapping;
+		set_page_dirty(page);
+		unlock_page(page);
+		page_cache_release(page);
+		if (mapping)	{
+			/*
+			 * Some device drivers do not set page.mapping
+			 * but still dirty their pages
+			 */
+			balance_dirty_pages_ratelimited(mapping);
+		}
+		return ret;
 	}
 	get_page(page);
 	spin_unlock(&mm->page_table_lock);
 alloc:
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
+
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow())
 		new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
@@ -1424,6 +1485,10 @@ alloc:
 			add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
 			put_huge_zero_page();
 		} else {
+			if (!PageAnon(page)) {
+				add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+				add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+			}
 			VM_BUG_ON(!PageHead(page));
 			page_remove_rmap(page);
 			put_page(page);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 28/30] thp: handle write-protect exception to file-backed huge pages
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   69 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 67 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d1adaea..a416a77 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1339,7 +1339,6 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
-	VM_BUG_ON(!vma->anon_vma);
 	haddr = address & HPAGE_PMD_MASK;
 	if (is_huge_zero_pmd(orig_pmd))
 		goto alloc;
@@ -1349,7 +1348,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	page = pmd_page(orig_pmd);
 	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
-	if (page_mapcount(page) == 1) {
+	if (PageAnon(page) && page_mapcount(page) == 1) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -1357,10 +1356,72 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			update_mmu_cache_pmd(vma, address, pmd);
 		ret |= VM_FAULT_WRITE;
 		goto out_unlock;
+	} else if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+			(VM_WRITE|VM_SHARED)) {
+		struct vm_fault vmf;
+		pmd_t entry;
+		struct address_space *mapping;
+
+		/* not yet impemented */
+		VM_BUG_ON(!vma->vm_ops || !vma->vm_ops->page_mkwrite);
+
+		vmf.virtual_address = (void __user *)haddr;
+		vmf.pgoff = page->index;
+		vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
+		vmf.page = page;
+
+		page_cache_get(page);
+		spin_unlock(&mm->page_table_lock);
+
+		ret = vma->vm_ops->page_mkwrite(vma, &vmf);
+		if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+			page_cache_release(page);
+			goto out;
+		}
+		if (unlikely(!(ret & VM_FAULT_LOCKED))) {
+			lock_page(page);
+			if (!page->mapping) {
+				ret = 0; /* retry */
+				unlock_page(page);
+				page_cache_release(page);
+				goto out;
+			}
+		} else
+			VM_BUG_ON(!PageLocked(page));
+		spin_lock(&mm->page_table_lock);
+		if (unlikely(!pmd_same(*pmd, orig_pmd))) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto out_unlock;
+		}
+
+		flush_cache_page(vma, address, pmd_pfn(orig_pmd));
+		entry = pmd_mkyoung(orig_pmd);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
+			update_mmu_cache_pmd(vma, address, pmd);
+		ret = VM_FAULT_WRITE;
+		spin_unlock(&mm->page_table_lock);
+
+		mapping = page->mapping;
+		set_page_dirty(page);
+		unlock_page(page);
+		page_cache_release(page);
+		if (mapping)	{
+			/*
+			 * Some device drivers do not set page.mapping
+			 * but still dirty their pages
+			 */
+			balance_dirty_pages_ratelimited(mapping);
+		}
+		return ret;
 	}
 	get_page(page);
 	spin_unlock(&mm->page_table_lock);
 alloc:
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
+
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow())
 		new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
@@ -1424,6 +1485,10 @@ alloc:
 			add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
 			put_huge_zero_page();
 		} else {
+			if (!PageAnon(page)) {
+				add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+				add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+			}
 			VM_BUG_ON(!PageHead(page));
 			page_remove_rmap(page);
 			put_page(page);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 29/30] thp: call __vma_adjust_trans_huge() for file-backed VMA
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Since we're going to have huge pages in page cache, we need to call
__vma_adjust_trans_huge() for file-backed VMA, which potentially can
contain huge pages.

For now we call it for all VMAs with vm_ops->huge_fault defined.

Probably later we will need to introduce a flag to indicate that the VMA
has huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index aa52c48..c6e3aef 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -161,9 +161,9 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 unsigned long end,
 					 long adjust_next)
 {
-	if (!vma->anon_vma || vma->vm_ops)
-		return;
-	__vma_adjust_trans_huge(vma, start, end, adjust_next);
+	if ((vma->anon_vma && !vma->vm_ops) ||
+			(vma->vm_ops && vma->vm_ops->huge_fault))
+		__vma_adjust_trans_huge(vma, start, end, adjust_next);
 }
 static inline int hpage_nr_pages(struct page *page)
 {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 29/30] thp: call __vma_adjust_trans_huge() for file-backed VMA
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Since we're going to have huge pages in page cache, we need to call
__vma_adjust_trans_huge() for file-backed VMA, which potentially can
contain huge pages.

For now we call it for all VMAs with vm_ops->huge_fault defined.

Probably later we will need to introduce a flag to indicate that the VMA
has huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index aa52c48..c6e3aef 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -161,9 +161,9 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 unsigned long end,
 					 long adjust_next)
 {
-	if (!vma->anon_vma || vma->vm_ops)
-		return;
-	__vma_adjust_trans_huge(vma, start, end, adjust_next);
+	if ((vma->anon_vma && !vma->vm_ops) ||
+			(vma->vm_ops && vma->vm_ops->huge_fault))
+		__vma_adjust_trans_huge(vma, start, end, adjust_next);
 }
 static inline int hpage_nr_pages(struct page *page)
 {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 30/30] thp: map file-backed huge pages on fault
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Look like all pieces are in place, we can map file-backed huge-pages
now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    4 +++-
 mm/memory.c             |    1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c6e3aef..c175c78 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -80,7 +80,9 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 	   (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&			\
 	   ((__vma)->vm_flags & VM_HUGEPAGE))) &&			\
 	 !((__vma)->vm_flags & VM_NOHUGEPAGE) &&			\
-	 !is_vma_temporary_stack(__vma))
+	 !is_vma_temporary_stack(__vma) &&				\
+	 (!(__vma)->vm_ops ||						\
+		  mapping_can_have_hugepages((__vma)->vm_file->f_mapping)))
 #define transparent_hugepage_defrag(__vma)				\
 	((transparent_hugepage_flags &					\
 	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) ||			\
diff --git a/mm/memory.c b/mm/memory.c
index 52bd6cf..f3b5a11 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3738,6 +3738,7 @@ retry:
 		if (!vma->vm_ops)
 			return do_huge_pmd_anonymous_page(mm, vma, address,
 							  pmd, flags);
+		return do_huge_linear_fault(mm, vma, address, pmd, flags);
 	} else {
 		pmd_t orig_pmd = *pmd;
 		int ret;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv2, RFC 30/30] thp: map file-backed huge pages on fault
@ 2013-03-14 17:50   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-14 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins
  Cc: Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Look like all pieces are in place, we can map file-backed huge-pages
now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    4 +++-
 mm/memory.c             |    1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c6e3aef..c175c78 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -80,7 +80,9 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 	   (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&			\
 	   ((__vma)->vm_flags & VM_HUGEPAGE))) &&			\
 	 !((__vma)->vm_flags & VM_NOHUGEPAGE) &&			\
-	 !is_vma_temporary_stack(__vma))
+	 !is_vma_temporary_stack(__vma) &&				\
+	 (!(__vma)->vm_ops ||						\
+		  mapping_can_have_hugepages((__vma)->vm_file->f_mapping)))
 #define transparent_hugepage_defrag(__vma)				\
 	((transparent_hugepage_flags &					\
 	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) ||			\
diff --git a/mm/memory.c b/mm/memory.c
index 52bd6cf..f3b5a11 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3738,6 +3738,7 @@ retry:
 		if (!vma->vm_ops)
 			return do_huge_pmd_anonymous_page(mm, vma, address,
 							  pmd, flags);
+		return do_huge_linear_fault(mm, vma, address, pmd, flags);
 	} else {
 		pmd_t orig_pmd = *pmd;
 		int ret;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 03/30] mm: drop actor argument of do_generic_file_read()
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-15  0:21     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  0:21 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> There's only one caller of do_generic_file_read() and the only actor is
> file_read_actor(). No reason to have a callback parameter.
>
This cleanup is not urgent if it nukes no barrier for THP cache.

Hillf

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 03/30] mm: drop actor argument of do_generic_file_read()
@ 2013-03-15  0:21     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  0:21 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> There's only one caller of do_generic_file_read() and the only actor is
> file_read_actor(). No reason to have a callback parameter.
>
This cleanup is not urgent if it nukes no barrier for THP cache.

Hillf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 03/30] mm: drop actor argument of do_generic_file_read()
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-15  0:27     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  0:27 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> There's only one caller of do_generic_file_read() and the only actor is
> file_read_actor(). No reason to have a callback parameter.
>
This cleanup is not urgent if it nukes no barrier for THP cache.

Hillf

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 03/30] mm: drop actor argument of do_generic_file_read()
@ 2013-03-15  0:27     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  0:27 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> There's only one caller of do_generic_file_read() and the only actor is
> file_read_actor(). No reason to have a callback parameter.
>
This cleanup is not urgent if it nukes no barrier for THP cache.

Hillf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 00/30] Transparent huge page cache
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-15  0:33   ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  0:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> Here's the second version of the patchset.
>
> The intend of the work is get code ready to enable transparent huge page
> cache for the most simple fs -- ramfs.
>
Where is your git tree including THP cache?

Hillf

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 00/30] Transparent huge page cache
@ 2013-03-15  0:33   ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  0:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> Here's the second version of the patchset.
>
> The intend of the work is get code ready to enable transparent huge page
> cache for the most simple fs -- ramfs.
>
Where is your git tree including THP cache?

Hillf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-15  1:30     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  1:30 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> +       page_cache_get(page);
> +       spin_lock_irq(&mapping->tree_lock);
> +       page->mapping = mapping;
> +       page->index = offset;
> +       error = radix_tree_insert(&mapping->page_tree, offset, page);
> +       if (unlikely(error))
> +               goto err;
> +       if (PageTransHuge(page)) {
> +               int i;
> +               for (i = 1; i < HPAGE_CACHE_NR; i++) {
			struct page *tail = page + i; to easy reader

> +                       page_cache_get(page + i);
s/page_cache_get/get_page_foll/ ?

> +                       page[i].index = offset + i;
> +                       error = radix_tree_insert(&mapping->page_tree,
> +                                       offset + i, page + i);
> +                       if (error) {
> +                               page_cache_release(page + i);
> +                               break;
> +                       }
>                 }
> -               radix_tree_preload_end();
> -       } else
> -               mem_cgroup_uncharge_cache_page(page);
> -out:
> +               if (error) {
> +                       error = ENOSPC; /* no space for a huge page */
s/E/-E/

> +                       for (i--; i > 0; i--) {
> +                               radix_tree_delete(&mapping->page_tree,
> +                                               offset + i);
> +                               page_cache_release(page + i);
> +                       }
> +                       radix_tree_delete(&mapping->page_tree, offset);
> +                       goto err;
> +               }
> +       }

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-03-15  1:30     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  1:30 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> +       page_cache_get(page);
> +       spin_lock_irq(&mapping->tree_lock);
> +       page->mapping = mapping;
> +       page->index = offset;
> +       error = radix_tree_insert(&mapping->page_tree, offset, page);
> +       if (unlikely(error))
> +               goto err;
> +       if (PageTransHuge(page)) {
> +               int i;
> +               for (i = 1; i < HPAGE_CACHE_NR; i++) {
			struct page *tail = page + i; to easy reader

> +                       page_cache_get(page + i);
s/page_cache_get/get_page_foll/ ?

> +                       page[i].index = offset + i;
> +                       error = radix_tree_insert(&mapping->page_tree,
> +                                       offset + i, page + i);
> +                       if (error) {
> +                               page_cache_release(page + i);
> +                               break;
> +                       }
>                 }
> -               radix_tree_preload_end();
> -       } else
> -               mem_cgroup_uncharge_cache_page(page);
> -out:
> +               if (error) {
> +                       error = ENOSPC; /* no space for a huge page */
s/E/-E/

> +                       for (i--; i > 0; i--) {
> +                               radix_tree_delete(&mapping->page_tree,
> +                                               offset + i);
> +                               page_cache_release(page + i);
> +                       }
> +                       radix_tree_delete(&mapping->page_tree, offset);
> +                       goto err;
> +               }
> +       }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 09/30] thp, mm: rewrite delete_from_page_cache() to support huge pages
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-15  2:25     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  2:25 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> +       if (PageTransHuge(page)) {
> +               int i;
> +
> +               for (i = 0; i < HPAGE_CACHE_NR; i++)
> +                       radix_tree_delete(&mapping->page_tree, page->index + i);

Move the below page_cache_release for tail page here, please.

> +               nr = HPAGE_CACHE_NR;
[...]
> +       if (PageTransHuge(page))
> +               for (i = 1; i < HPAGE_CACHE_NR; i++)
> +                       page_cache_release(page + i);
>         page_cache_release(page);

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 09/30] thp, mm: rewrite delete_from_page_cache() to support huge pages
@ 2013-03-15  2:25     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  2:25 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> +       if (PageTransHuge(page)) {
> +               int i;
> +
> +               for (i = 0; i < HPAGE_CACHE_NR; i++)
> +                       radix_tree_delete(&mapping->page_tree, page->index + i);

Move the below page_cache_release for tail page here, please.

> +               nr = HPAGE_CACHE_NR;
[...]
> +       if (PageTransHuge(page))
> +               for (i = 1; i < HPAGE_CACHE_NR; i++)
> +                       page_cache_release(page + i);
>         page_cache_release(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-15  2:34     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  2:34 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
> +                       pgoff_t index, unsigned flags);
> +#else
> +static inline struct page *grab_cache_huge_page_write_begin(
> +               struct address_space *mapping, pgoff_t index, unsigned flags)
> +{
build bug?

> +       return NULL;
> +}
> +#endif
>
btw, how about grab_thp_write_begin?

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()
@ 2013-03-15  2:34     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  2:34 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
> +                       pgoff_t index, unsigned flags);
> +#else
> +static inline struct page *grab_cache_huge_page_write_begin(
> +               struct address_space *mapping, pgoff_t index, unsigned flags)
> +{
build bug?

> +       return NULL;
> +}
> +#endif
>
btw, how about grab_thp_write_begin?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 14/30] thp, mm: naive support of thp in generic read/write routines
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-15  3:11     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  3:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> +               if (PageTransTail(page)) {
> +                       page_cache_release(page);
> +                       page = find_get_page(mapping,
> +                                       index & ~HPAGE_CACHE_INDEX_MASK);
check page is valid, please.

> +                       if (!PageTransHuge(page)) {
> +                               page_cache_release(page);
> +                               goto find_page;
> +                       }
> +               }

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 14/30] thp, mm: naive support of thp in generic read/write routines
@ 2013-03-15  3:11     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  3:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> +               if (PageTransTail(page)) {
> +                       page_cache_release(page);
> +                       page = find_get_page(mapping,
> +                                       index & ~HPAGE_CACHE_INDEX_MASK);
check page is valid, please.

> +                       if (!PageTransHuge(page)) {
> +                               page_cache_release(page);
> +                               goto find_page;
> +                       }
> +               }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 16/30] thp: handle file pages in split_huge_page()
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-15  6:15     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  6:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> -int split_huge_page(struct page *page)
> +static int split_anon_huge_page(struct page *page)
>  {
>         struct anon_vma *anon_vma;
>         int ret = 1;
>
> -       BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
> -       BUG_ON(!PageAnon(page));
> -
deleted, why?

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 16/30] thp: handle file pages in split_huge_page()
@ 2013-03-15  6:15     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  6:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> -int split_huge_page(struct page *page)
> +static int split_anon_huge_page(struct page *page)
>  {
>         struct anon_vma *anon_vma;
>         int ret = 1;
>
> -       BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
> -       BUG_ON(!PageAnon(page));
> -
deleted, why?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 19/30] thp, mm: split huge page on mmap file page
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-15  6:58     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  6:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> We are not ready to mmap file-backed tranparent huge pages.
>
It is not on todo list either.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 19/30] thp, mm: split huge page on mmap file page
@ 2013-03-15  6:58     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  6:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> We are not ready to mmap file-backed tranparent huge pages.
>
It is not on todo list either.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 23/30] thp: prepare zap_huge_pmd() to uncharge file pages
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-15  7:09     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  7:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Uncharge pages from correct counter.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/huge_memory.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a23da8b..34e0385 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1368,10 +1368,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>                         spin_unlock(&tlb->mm->page_table_lock);
>                         put_huge_zero_page();
>                 } else {
> +                       int counter;
s/counter/item/ ?

>                         page = pmd_page(orig_pmd);
>                         page_remove_rmap(page);
>                         VM_BUG_ON(page_mapcount(page) < 0);
> -                       add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +                       counter = PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES;
> +                       add_mm_counter(tlb->mm, counter, -HPAGE_PMD_NR);
>                         VM_BUG_ON(!PageHead(page));
>                         tlb->mm->nr_ptes--;
>                         spin_unlock(&tlb->mm->page_table_lock);
> --
> 1.7.10.4
>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 23/30] thp: prepare zap_huge_pmd() to uncharge file pages
@ 2013-03-15  7:09     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  7:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Uncharge pages from correct counter.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/huge_memory.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a23da8b..34e0385 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1368,10 +1368,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>                         spin_unlock(&tlb->mm->page_table_lock);
>                         put_huge_zero_page();
>                 } else {
> +                       int counter;
s/counter/item/ ?

>                         page = pmd_page(orig_pmd);
>                         page_remove_rmap(page);
>                         VM_BUG_ON(page_mapcount(page) < 0);
> -                       add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +                       counter = PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES;
> +                       add_mm_counter(tlb->mm, counter, -HPAGE_PMD_NR);
>                         VM_BUG_ON(!PageHead(page));
>                         tlb->mm->nr_ptes--;
>                         spin_unlock(&tlb->mm->page_table_lock);
> --
> 1.7.10.4
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 24/30] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-15  7:31     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  7:31 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> It's confusing that mk_huge_pmd() has sematics different from mk_pte()
> or mk_pmd().
>
> Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
> prototype to match mk_pte().
>
No urgent if not used subsequently.

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/huge_memory.c |   14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 34e0385..be7b7e1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -691,11 +691,10 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
>         return pmd;
>  }
>
> -static inline pmd_t mk_huge_pmd(struct page *page, struct vm_area_struct *vma)
> +static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
>  {
>         pmd_t entry;
> -       entry = mk_pmd(page, vma->vm_page_prot);
> -       entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +       entry = mk_pmd(page, prot);
>         entry = pmd_mkhuge(entry);
>         return entry;
>  }
> @@ -723,7 +722,8 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
>                 pte_free(mm, pgtable);
>         } else {
>                 pmd_t entry;
> -               entry = mk_huge_pmd(page, vma);
> +               entry = mk_huge_pmd(page, vma->vm_page_prot);
> +               entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
>                 /*
>                  * The spinlocking to take the lru_lock inside
>                  * page_add_new_anon_rmap() acts as a full memory
> @@ -1212,7 +1212,8 @@ alloc:
>                 goto out_mn;
>         } else {
>                 pmd_t entry;
> -               entry = mk_huge_pmd(new_page, vma);
> +               entry = mk_huge_pmd(new_page, vma->vm_page_prot);
> +               entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
>                 pmdp_clear_flush(vma, haddr, pmd);
>                 page_add_new_anon_rmap(new_page, vma, haddr);
>                 set_pmd_at(mm, haddr, pmd, entry);
> @@ -2382,7 +2383,8 @@ static void collapse_huge_page(struct mm_struct *mm,
>         __SetPageUptodate(new_page);
>         pgtable = pmd_pgtable(_pmd);
>
> -       _pmd = mk_huge_pmd(new_page, vma);
> +       _pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
> +       _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
>
>         /*
>          * spin_lock() below is not the equivalent of smp_wmb(), so
> --
> 1.7.10.4
>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 24/30] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
@ 2013-03-15  7:31     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  7:31 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> It's confusing that mk_huge_pmd() has sematics different from mk_pte()
> or mk_pmd().
>
> Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
> prototype to match mk_pte().
>
No urgent if not used subsequently.

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/huge_memory.c |   14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 34e0385..be7b7e1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -691,11 +691,10 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
>         return pmd;
>  }
>
> -static inline pmd_t mk_huge_pmd(struct page *page, struct vm_area_struct *vma)
> +static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
>  {
>         pmd_t entry;
> -       entry = mk_pmd(page, vma->vm_page_prot);
> -       entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +       entry = mk_pmd(page, prot);
>         entry = pmd_mkhuge(entry);
>         return entry;
>  }
> @@ -723,7 +722,8 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
>                 pte_free(mm, pgtable);
>         } else {
>                 pmd_t entry;
> -               entry = mk_huge_pmd(page, vma);
> +               entry = mk_huge_pmd(page, vma->vm_page_prot);
> +               entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
>                 /*
>                  * The spinlocking to take the lru_lock inside
>                  * page_add_new_anon_rmap() acts as a full memory
> @@ -1212,7 +1212,8 @@ alloc:
>                 goto out_mn;
>         } else {
>                 pmd_t entry;
> -               entry = mk_huge_pmd(new_page, vma);
> +               entry = mk_huge_pmd(new_page, vma->vm_page_prot);
> +               entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
>                 pmdp_clear_flush(vma, haddr, pmd);
>                 page_add_new_anon_rmap(new_page, vma, haddr);
>                 set_pmd_at(mm, haddr, pmd, entry);
> @@ -2382,7 +2383,8 @@ static void collapse_huge_page(struct mm_struct *mm,
>         __SetPageUptodate(new_page);
>         pgtable = pmd_pgtable(_pmd);
>
> -       _pmd = mk_huge_pmd(new_page, vma);
> +       _pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
> +       _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
>
>         /*
>          * spin_lock() below is not the equivalent of smp_wmb(), so
> --
> 1.7.10.4
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 25/30] thp, mm: basic huge_fault implementation for generic_file_vm_ops
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-15  7:44     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  7:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static int filemap_huge_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +       struct file *file = vma->vm_file;
> +       struct address_space *mapping = file->f_mapping;
> +       struct inode *inode = mapping->host;
> +       pgoff_t size, offset = vmf->pgoff;
> +       unsigned long address = (unsigned long) vmf->virtual_address;
> +       struct page *page;
> +       int ret = 0;
> +
> +       BUG_ON(((address >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
> +                       (offset & HPAGE_CACHE_INDEX_MASK));
> +
> +retry:
> +       page = find_get_page(mapping, offset);
> +       if (!page) {
> +               gfp_t gfp_mask = mapping_gfp_mask(mapping) | __GFP_COLD;
> +               page = alloc_pages(gfp_mask, HPAGE_PMD_ORDER);
s/pages/pages_vma/ ?

> +               if (!page) {
> +                       count_vm_event(THP_FAULT_FALLBACK);
> +                       return VM_FAULT_OOM;
> +               }

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 25/30] thp, mm: basic huge_fault implementation for generic_file_vm_ops
@ 2013-03-15  7:44     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15  7:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static int filemap_huge_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +       struct file *file = vma->vm_file;
> +       struct address_space *mapping = file->f_mapping;
> +       struct inode *inode = mapping->host;
> +       pgoff_t size, offset = vmf->pgoff;
> +       unsigned long address = (unsigned long) vmf->virtual_address;
> +       struct page *page;
> +       int ret = 0;
> +
> +       BUG_ON(((address >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
> +                       (offset & HPAGE_CACHE_INDEX_MASK));
> +
> +retry:
> +       page = find_get_page(mapping, offset);
> +       if (!page) {
> +               gfp_t gfp_mask = mapping_gfp_mask(mapping) | __GFP_COLD;
> +               page = alloc_pages(gfp_mask, HPAGE_PMD_ORDER);
s/pages/pages_vma/ ?

> +               if (!page) {
> +                       count_vm_event(THP_FAULT_FALLBACK);
> +                       return VM_FAULT_OOM;
> +               }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 03/30] mm: drop actor argument of do_generic_file_read()
  2013-03-15  0:27     ` Hillf Danton
@ 2013-03-15 13:22       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:22 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > There's only one caller of do_generic_file_read() and the only actor is
> > file_read_actor(). No reason to have a callback parameter.
> >
> This cleanup is not urgent if it nukes no barrier for THP cache.

Yes, it's not urgent. On other hand it can be applied upstream right now ;)

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 03/30] mm: drop actor argument of do_generic_file_read()
@ 2013-03-15 13:22       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:22 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > There's only one caller of do_generic_file_read() and the only actor is
> > file_read_actor(). No reason to have a callback parameter.
> >
> This cleanup is not urgent if it nukes no barrier for THP cache.

Yes, it's not urgent. On other hand it can be applied upstream right now ;)

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-03-15  1:30     ` Hillf Danton
@ 2013-03-15 13:23       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:23 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > +       page_cache_get(page);
> > +       spin_lock_irq(&mapping->tree_lock);
> > +       page->mapping = mapping;
> > +       page->index = offset;
> > +       error = radix_tree_insert(&mapping->page_tree, offset, page);
> > +       if (unlikely(error))
> > +               goto err;
> > +       if (PageTransHuge(page)) {
> > +               int i;
> > +               for (i = 1; i < HPAGE_CACHE_NR; i++) {
> 			struct page *tail = page + i; to easy reader
> 
> > +                       page_cache_get(page + i);
> s/page_cache_get/get_page_foll/ ?

Why?

> > +                       page[i].index = offset + i;
> > +                       error = radix_tree_insert(&mapping->page_tree,
> > +                                       offset + i, page + i);
> > +                       if (error) {
> > +                               page_cache_release(page + i);
> > +                               break;
> > +                       }
> >                 }
> > -               radix_tree_preload_end();
> > -       } else
> > -               mem_cgroup_uncharge_cache_page(page);
> > -out:
> > +               if (error) {
> > +                       error = ENOSPC; /* no space for a huge page */
> s/E/-E/

Good catch! Thanks.

> > +                       for (i--; i > 0; i--) {
> > +                               radix_tree_delete(&mapping->page_tree,
> > +                                               offset + i);
> > +                               page_cache_release(page + i);
> > +                       }
> > +                       radix_tree_delete(&mapping->page_tree, offset);
> > +                       goto err;
> > +               }
> > +       }

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-03-15 13:23       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:23 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > +       page_cache_get(page);
> > +       spin_lock_irq(&mapping->tree_lock);
> > +       page->mapping = mapping;
> > +       page->index = offset;
> > +       error = radix_tree_insert(&mapping->page_tree, offset, page);
> > +       if (unlikely(error))
> > +               goto err;
> > +       if (PageTransHuge(page)) {
> > +               int i;
> > +               for (i = 1; i < HPAGE_CACHE_NR; i++) {
> 			struct page *tail = page + i; to easy reader
> 
> > +                       page_cache_get(page + i);
> s/page_cache_get/get_page_foll/ ?

Why?

> > +                       page[i].index = offset + i;
> > +                       error = radix_tree_insert(&mapping->page_tree,
> > +                                       offset + i, page + i);
> > +                       if (error) {
> > +                               page_cache_release(page + i);
> > +                               break;
> > +                       }
> >                 }
> > -               radix_tree_preload_end();
> > -       } else
> > -               mem_cgroup_uncharge_cache_page(page);
> > -out:
> > +               if (error) {
> > +                       error = ENOSPC; /* no space for a huge page */
> s/E/-E/

Good catch! Thanks.

> > +                       for (i--; i > 0; i--) {
> > +                               radix_tree_delete(&mapping->page_tree,
> > +                                               offset + i);
> > +                               page_cache_release(page + i);
> > +                       }
> > +                       radix_tree_delete(&mapping->page_tree, offset);
> > +                       goto err;
> > +               }
> > +       }

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 09/30] thp, mm: rewrite delete_from_page_cache() to support huge pages
  2013-03-15  2:25     ` Hillf Danton
@ 2013-03-15 13:23       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:23 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > +       if (PageTransHuge(page)) {
> > +               int i;
> > +
> > +               for (i = 0; i < HPAGE_CACHE_NR; i++)
> > +                       radix_tree_delete(&mapping->page_tree, page->index + i);
> 
> Move the below page_cache_release for tail page here, please.

Okay. Thanks.

> > +               nr = HPAGE_CACHE_NR;
> [...]
> > +       if (PageTransHuge(page))
> > +               for (i = 1; i < HPAGE_CACHE_NR; i++)
> > +                       page_cache_release(page + i);
> >         page_cache_release(page);

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 09/30] thp, mm: rewrite delete_from_page_cache() to support huge pages
@ 2013-03-15 13:23       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:23 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > +       if (PageTransHuge(page)) {
> > +               int i;
> > +
> > +               for (i = 0; i < HPAGE_CACHE_NR; i++)
> > +                       radix_tree_delete(&mapping->page_tree, page->index + i);
> 
> Move the below page_cache_release for tail page here, please.

Okay. Thanks.

> > +               nr = HPAGE_CACHE_NR;
> [...]
> > +       if (PageTransHuge(page))
> > +               for (i = 1; i < HPAGE_CACHE_NR; i++)
> > +                       page_cache_release(page + i);
> >         page_cache_release(page);

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()
  2013-03-15  2:34     ` Hillf Danton
@ 2013-03-15 13:24       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:24 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
> > +                       pgoff_t index, unsigned flags);
> > +#else
> > +static inline struct page *grab_cache_huge_page_write_begin(
> > +               struct address_space *mapping, pgoff_t index, unsigned flags)
> > +{
> build bug?

Hm?. No. Why?

> > +       return NULL;
> > +}
> > +#endif
> >
> btw, how about grab_thp_write_begin?

Sounds better, thanks.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()
@ 2013-03-15 13:24       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:24 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
> > +                       pgoff_t index, unsigned flags);
> > +#else
> > +static inline struct page *grab_cache_huge_page_write_begin(
> > +               struct address_space *mapping, pgoff_t index, unsigned flags)
> > +{
> build bug?

Hm?. No. Why?

> > +       return NULL;
> > +}
> > +#endif
> >
> btw, how about grab_thp_write_begin?

Sounds better, thanks.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-03-15 13:23       ` Kirill A. Shutemov
@ 2013-03-15 13:25         ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15 13:25 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 9:23 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Hillf Danton wrote:
>> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> > +       page_cache_get(page);
>> > +       spin_lock_irq(&mapping->tree_lock);
>> > +       page->mapping = mapping;
>> > +       page->index = offset;
>> > +       error = radix_tree_insert(&mapping->page_tree, offset, page);
>> > +       if (unlikely(error))
>> > +               goto err;
>> > +       if (PageTransHuge(page)) {
>> > +               int i;
>> > +               for (i = 1; i < HPAGE_CACHE_NR; i++) {
>>                       struct page *tail = page + i; to easy reader
>>
>> > +                       page_cache_get(page + i);
>> s/page_cache_get/get_page_foll/ ?
>
> Why?
>
see follow_trans_huge_pmd() please.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-03-15 13:25         ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15 13:25 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 9:23 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Hillf Danton wrote:
>> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> > +       page_cache_get(page);
>> > +       spin_lock_irq(&mapping->tree_lock);
>> > +       page->mapping = mapping;
>> > +       page->index = offset;
>> > +       error = radix_tree_insert(&mapping->page_tree, offset, page);
>> > +       if (unlikely(error))
>> > +               goto err;
>> > +       if (PageTransHuge(page)) {
>> > +               int i;
>> > +               for (i = 1; i < HPAGE_CACHE_NR; i++) {
>>                       struct page *tail = page + i; to easy reader
>>
>> > +                       page_cache_get(page + i);
>> s/page_cache_get/get_page_foll/ ?
>
> Why?
>
see follow_trans_huge_pmd() please.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 16/30] thp: handle file pages in split_huge_page()
  2013-03-15  6:15     ` Hillf Danton
@ 2013-03-15 13:26       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:26 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > -int split_huge_page(struct page *page)
> > +static int split_anon_huge_page(struct page *page)
> >  {
> >         struct anon_vma *anon_vma;
> >         int ret = 1;
> >
> > -       BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
> > -       BUG_ON(!PageAnon(page));
> > -
> deleted, why?

split_anon_huge_page() should only be called from split_huge_page().
Probably I could bring it back, but it's kinda redundant.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 16/30] thp: handle file pages in split_huge_page()
@ 2013-03-15 13:26       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:26 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > -int split_huge_page(struct page *page)
> > +static int split_anon_huge_page(struct page *page)
> >  {
> >         struct anon_vma *anon_vma;
> >         int ret = 1;
> >
> > -       BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
> > -       BUG_ON(!PageAnon(page));
> > -
> deleted, why?

split_anon_huge_page() should only be called from split_huge_page().
Probably I could bring it back, but it's kinda redundant.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 14/30] thp, mm: naive support of thp in generic read/write routines
  2013-03-15  3:11     ` Hillf Danton
@ 2013-03-15 13:27       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:27 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > +               if (PageTransTail(page)) {
> > +                       page_cache_release(page);
> > +                       page = find_get_page(mapping,
> > +                                       index & ~HPAGE_CACHE_INDEX_MASK);
> check page is valid, please.

Good catch. Fixed.

> > +                       if (!PageTransHuge(page)) {
> > +                               page_cache_release(page);
> > +                               goto find_page;
> > +                       }
> > +               }

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 14/30] thp, mm: naive support of thp in generic read/write routines
@ 2013-03-15 13:27       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:27 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > +               if (PageTransTail(page)) {
> > +                       page_cache_release(page);
> > +                       page = find_get_page(mapping,
> > +                                       index & ~HPAGE_CACHE_INDEX_MASK);
> check page is valid, please.

Good catch. Fixed.

> > +                       if (!PageTransHuge(page)) {
> > +                               page_cache_release(page);
> > +                               goto find_page;
> > +                       }
> > +               }

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 19/30] thp, mm: split huge page on mmap file page
  2013-03-15  6:58     ` Hillf Danton
@ 2013-03-15 13:29       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:29 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > We are not ready to mmap file-backed tranparent huge pages.
> >
> It is not on todo list either.

Actually, following patches implement mmap for file-backed thp and this
split_huge_page() will catch only fallback cases.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 19/30] thp, mm: split huge page on mmap file page
@ 2013-03-15 13:29       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:29 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > We are not ready to mmap file-backed tranparent huge pages.
> >
> It is not on todo list either.

Actually, following patches implement mmap for file-backed thp and this
split_huge_page() will catch only fallback cases.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 23/30] thp: prepare zap_huge_pmd() to uncharge file pages
  2013-03-15  7:09     ` Hillf Danton
@ 2013-03-15 13:30       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:30 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > Uncharge pages from correct counter.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  mm/huge_memory.c |    4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index a23da8b..34e0385 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1368,10 +1368,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >                         spin_unlock(&tlb->mm->page_table_lock);
> >                         put_huge_zero_page();
> >                 } else {
> > +                       int counter;
> s/counter/item/ ?

I saw 'member' in other place, so I'll rename it to 'member'.

> >                         page = pmd_page(orig_pmd);
> >                         page_remove_rmap(page);
> >                         VM_BUG_ON(page_mapcount(page) < 0);
> > -                       add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> > +                       counter = PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES;
> > +                       add_mm_counter(tlb->mm, counter, -HPAGE_PMD_NR);
> >                         VM_BUG_ON(!PageHead(page));
> >                         tlb->mm->nr_ptes--;
> >                         spin_unlock(&tlb->mm->page_table_lock);
> > --
> > 1.7.10.4
> >

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 23/30] thp: prepare zap_huge_pmd() to uncharge file pages
@ 2013-03-15 13:30       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:30 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > Uncharge pages from correct counter.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  mm/huge_memory.c |    4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index a23da8b..34e0385 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1368,10 +1368,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >                         spin_unlock(&tlb->mm->page_table_lock);
> >                         put_huge_zero_page();
> >                 } else {
> > +                       int counter;
> s/counter/item/ ?

I saw 'member' in other place, so I'll rename it to 'member'.

> >                         page = pmd_page(orig_pmd);
> >                         page_remove_rmap(page);
> >                         VM_BUG_ON(page_mapcount(page) < 0);
> > -                       add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> > +                       counter = PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES;
> > +                       add_mm_counter(tlb->mm, counter, -HPAGE_PMD_NR);
> >                         VM_BUG_ON(!PageHead(page));
> >                         tlb->mm->nr_ptes--;
> >                         spin_unlock(&tlb->mm->page_table_lock);
> > --
> > 1.7.10.4
> >

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()
  2013-03-15 13:24       ` Kirill A. Shutemov
@ 2013-03-15 13:30         ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15 13:30 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 9:24 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Hillf Danton wrote:
>> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> > +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
>> > +                       pgoff_t index, unsigned flags);
>> > +#else
>> > +static inline struct page *grab_cache_huge_page_write_begin(
>> > +               struct address_space *mapping, pgoff_t index, unsigned flags)
>> > +{
>> build bug?
>
> Hm?. No. Why?
>
Stop build if THP not configured?

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()
@ 2013-03-15 13:30         ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15 13:30 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 9:24 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Hillf Danton wrote:
>> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> > +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
>> > +                       pgoff_t index, unsigned flags);
>> > +#else
>> > +static inline struct page *grab_cache_huge_page_write_begin(
>> > +               struct address_space *mapping, pgoff_t index, unsigned flags)
>> > +{
>> build bug?
>
> Hm?. No. Why?
>
Stop build if THP not configured?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 25/30] thp, mm: basic huge_fault implementation for generic_file_vm_ops
  2013-03-15  7:44     ` Hillf Danton
@ 2013-03-15 13:30       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:30 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +static int filemap_huge_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > +{
> > +       struct file *file = vma->vm_file;
> > +       struct address_space *mapping = file->f_mapping;
> > +       struct inode *inode = mapping->host;
> > +       pgoff_t size, offset = vmf->pgoff;
> > +       unsigned long address = (unsigned long) vmf->virtual_address;
> > +       struct page *page;
> > +       int ret = 0;
> > +
> > +       BUG_ON(((address >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
> > +                       (offset & HPAGE_CACHE_INDEX_MASK));
> > +
> > +retry:
> > +       page = find_get_page(mapping, offset);
> > +       if (!page) {
> > +               gfp_t gfp_mask = mapping_gfp_mask(mapping) | __GFP_COLD;
> > +               page = alloc_pages(gfp_mask, HPAGE_PMD_ORDER);
> s/pages/pages_vma/ ?

Fixed. Thanks.

> > +               if (!page) {
> > +                       count_vm_event(THP_FAULT_FALLBACK);
> > +                       return VM_FAULT_OOM;
> > +               }

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 25/30] thp, mm: basic huge_fault implementation for generic_file_vm_ops
@ 2013-03-15 13:30       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:30 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +static int filemap_huge_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > +{
> > +       struct file *file = vma->vm_file;
> > +       struct address_space *mapping = file->f_mapping;
> > +       struct inode *inode = mapping->host;
> > +       pgoff_t size, offset = vmf->pgoff;
> > +       unsigned long address = (unsigned long) vmf->virtual_address;
> > +       struct page *page;
> > +       int ret = 0;
> > +
> > +       BUG_ON(((address >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
> > +                       (offset & HPAGE_CACHE_INDEX_MASK));
> > +
> > +retry:
> > +       page = find_get_page(mapping, offset);
> > +       if (!page) {
> > +               gfp_t gfp_mask = mapping_gfp_mask(mapping) | __GFP_COLD;
> > +               page = alloc_pages(gfp_mask, HPAGE_PMD_ORDER);
> s/pages/pages_vma/ ?

Fixed. Thanks.

> > +               if (!page) {
> > +                       count_vm_event(THP_FAULT_FALLBACK);
> > +                       return VM_FAULT_OOM;
> > +               }

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 16/30] thp: handle file pages in split_huge_page()
  2013-03-15 13:26       ` Kirill A. Shutemov
@ 2013-03-15 13:33         ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15 13:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 9:26 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Hillf Danton wrote:
>> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> > -int split_huge_page(struct page *page)
>> > +static int split_anon_huge_page(struct page *page)
>> >  {
>> >         struct anon_vma *anon_vma;
>> >         int ret = 1;
>> >
>> > -       BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
>> > -       BUG_ON(!PageAnon(page));
>> > -
>> deleted, why?
>
> split_anon_huge_page() should only be called from split_huge_page().
> Probably I could bring it back, but it's kinda redundant.
>
Ok, no more question.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 16/30] thp: handle file pages in split_huge_page()
@ 2013-03-15 13:33         ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15 13:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 9:26 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Hillf Danton wrote:
>> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> > -int split_huge_page(struct page *page)
>> > +static int split_anon_huge_page(struct page *page)
>> >  {
>> >         struct anon_vma *anon_vma;
>> >         int ret = 1;
>> >
>> > -       BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
>> > -       BUG_ON(!PageAnon(page));
>> > -
>> deleted, why?
>
> split_anon_huge_page() should only be called from split_huge_page().
> Probably I could bring it back, but it's kinda redundant.
>
Ok, no more question.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 00/30] Transparent huge page cache
  2013-03-15  0:33   ` Hillf Danton
@ 2013-03-15 13:33     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:33 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > Here's the second version of the patchset.
> >
> > The intend of the work is get code ready to enable transparent huge page
> > cache for the most simple fs -- ramfs.
> >
> Where is your git tree including THP cache?

Here: git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git

Tag thp/pagecache/v2 represents the patcheset.
Branch thp/pagecache has fixed according to your feedback.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 00/30] Transparent huge page cache
@ 2013-03-15 13:33     ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:33 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > Here's the second version of the patchset.
> >
> > The intend of the work is get code ready to enable transparent huge page
> > cache for the most simple fs -- ramfs.
> >
> Where is your git tree including THP cache?

Here: git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git

Tag thp/pagecache/v2 represents the patcheset.
Branch thp/pagecache has fixed according to your feedback.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()
  2013-03-15 13:30         ` Hillf Danton
@ 2013-03-15 13:35           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:35 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 9:24 PM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > Hillf Danton wrote:
> >> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> >> <kirill.shutemov@linux.intel.com> wrote:
> >> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >> > +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
> >> > +                       pgoff_t index, unsigned flags);
> >> > +#else
> >> > +static inline struct page *grab_cache_huge_page_write_begin(
> >> > +               struct address_space *mapping, pgoff_t index, unsigned flags)
> >> > +{
> >> build bug?
> >
> > Hm?. No. Why?
> >
> Stop build if THP not configured?

No. I've tested it without CONFIG_TRANSPARENT_HUGEPAGE.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()
@ 2013-03-15 13:35           ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:35 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 9:24 PM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > Hillf Danton wrote:
> >> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> >> <kirill.shutemov@linux.intel.com> wrote:
> >> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >> > +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
> >> > +                       pgoff_t index, unsigned flags);
> >> > +#else
> >> > +static inline struct page *grab_cache_huge_page_write_begin(
> >> > +               struct address_space *mapping, pgoff_t index, unsigned flags)
> >> > +{
> >> build bug?
> >
> > Hm?. No. Why?
> >
> Stop build if THP not configured?

No. I've tested it without CONFIG_TRANSPARENT_HUGEPAGE.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 19/30] thp, mm: split huge page on mmap file page
  2013-03-15 13:29       ` Kirill A. Shutemov
@ 2013-03-15 13:35         ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15 13:35 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 9:29 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Hillf Danton wrote:
>> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> >
>> > We are not ready to mmap file-backed tranparent huge pages.
>> >
>> It is not on todo list either.
>
> Actually, following patches implement mmap for file-backed thp and this
> split_huge_page() will catch only fallback cases.
>
I wonder if the effort we pay for THP cache is nuked by mmap.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 19/30] thp, mm: split huge page on mmap file page
@ 2013-03-15 13:35         ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15 13:35 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 9:29 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Hillf Danton wrote:
>> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> >
>> > We are not ready to mmap file-backed tranparent huge pages.
>> >
>> It is not on todo list either.
>
> Actually, following patches implement mmap for file-backed thp and this
> split_huge_page() will catch only fallback cases.
>
I wonder if the effort we pay for THP cache is nuked by mmap.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()
  2013-03-15 13:35           ` Kirill A. Shutemov
@ 2013-03-15 13:37             ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15 13:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 9:35 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Hillf Danton wrote:
>> On Fri, Mar 15, 2013 at 9:24 PM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> > Hillf Danton wrote:
>> >> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
>> >> <kirill.shutemov@linux.intel.com> wrote:
>> >> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> >> > +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
>> >> > +                       pgoff_t index, unsigned flags);
>> >> > +#else
>> >> > +static inline struct page *grab_cache_huge_page_write_begin(
>> >> > +               struct address_space *mapping, pgoff_t index, unsigned flags)
>> >> > +{
>> >> build bug?
>> >
>> > Hm?. No. Why?
>> >
>> Stop build if THP not configured?
>
> No. I've tested it without CONFIG_TRANSPARENT_HUGEPAGE.
>
OK, I see.

thanks
Hillf

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()
@ 2013-03-15 13:37             ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15 13:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 9:35 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Hillf Danton wrote:
>> On Fri, Mar 15, 2013 at 9:24 PM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> > Hillf Danton wrote:
>> >> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
>> >> <kirill.shutemov@linux.intel.com> wrote:
>> >> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> >> > +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
>> >> > +                       pgoff_t index, unsigned flags);
>> >> > +#else
>> >> > +static inline struct page *grab_cache_huge_page_write_begin(
>> >> > +               struct address_space *mapping, pgoff_t index, unsigned flags)
>> >> > +{
>> >> build bug?
>> >
>> > Hm?. No. Why?
>> >
>> Stop build if THP not configured?
>
> No. I've tested it without CONFIG_TRANSPARENT_HUGEPAGE.
>
OK, I see.

thanks
Hillf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 19/30] thp, mm: split huge page on mmap file page
  2013-03-15 13:35         ` Hillf Danton
@ 2013-03-15 13:45           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:45 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 9:29 PM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > Hillf Danton wrote:
> >> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> >> <kirill.shutemov@linux.intel.com> wrote:
> >> >
> >> > We are not ready to mmap file-backed tranparent huge pages.
> >> >
> >> It is not on todo list either.
> >
> > Actually, following patches implement mmap for file-backed thp and this
> > split_huge_page() will catch only fallback cases.
> >
> I wonder if the effort we pay for THP cache is nuked by mmap.

I will not be nuked after patch 30/30.

Actually, it will be splited only if process tries to map hugepage with
unsuitable alignment.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 19/30] thp, mm: split huge page on mmap file page
@ 2013-03-15 13:45           ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:45 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 9:29 PM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > Hillf Danton wrote:
> >> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> >> <kirill.shutemov@linux.intel.com> wrote:
> >> >
> >> > We are not ready to mmap file-backed tranparent huge pages.
> >> >
> >> It is not on todo list either.
> >
> > Actually, following patches implement mmap for file-backed thp and this
> > split_huge_page() will catch only fallback cases.
> >
> I wonder if the effort we pay for THP cache is nuked by mmap.

I will not be nuked after patch 30/30.

Actually, it will be splited only if process tries to map hugepage with
unsuitable alignment.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-03-15 13:25         ` Hillf Danton
@ 2013-03-15 13:50           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:50 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 9:23 PM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > Hillf Danton wrote:
> >> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> >> <kirill.shutemov@linux.intel.com> wrote:
> >> > +       page_cache_get(page);
> >> > +       spin_lock_irq(&mapping->tree_lock);
> >> > +       page->mapping = mapping;
> >> > +       page->index = offset;
> >> > +       error = radix_tree_insert(&mapping->page_tree, offset, page);
> >> > +       if (unlikely(error))
> >> > +               goto err;
> >> > +       if (PageTransHuge(page)) {
> >> > +               int i;
> >> > +               for (i = 1; i < HPAGE_CACHE_NR; i++) {
> >>                       struct page *tail = page + i; to easy reader
> >>
> >> > +                       page_cache_get(page + i);
> >> s/page_cache_get/get_page_foll/ ?
> >
> > Why?
> >
> see follow_trans_huge_pmd() please.

Sorry, I fail to see how follow_trans_huge_pmd() is relevant here.
Could you elaborate?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-03-15 13:50           ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 13:50 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 9:23 PM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > Hillf Danton wrote:
> >> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> >> <kirill.shutemov@linux.intel.com> wrote:
> >> > +       page_cache_get(page);
> >> > +       spin_lock_irq(&mapping->tree_lock);
> >> > +       page->mapping = mapping;
> >> > +       page->index = offset;
> >> > +       error = radix_tree_insert(&mapping->page_tree, offset, page);
> >> > +       if (unlikely(error))
> >> > +               goto err;
> >> > +       if (PageTransHuge(page)) {
> >> > +               int i;
> >> > +               for (i = 1; i < HPAGE_CACHE_NR; i++) {
> >>                       struct page *tail = page + i; to easy reader
> >>
> >> > +                       page_cache_get(page + i);
> >> s/page_cache_get/get_page_foll/ ?
> >
> > Why?
> >
> see follow_trans_huge_pmd() please.

Sorry, I fail to see how follow_trans_huge_pmd() is relevant here.
Could you elaborate?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-03-15 13:50           ` Kirill A. Shutemov
@ 2013-03-15 13:55             ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15 13:55 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 9:50 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Hillf Danton wrote:
>> On Fri, Mar 15, 2013 at 9:23 PM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> > Hillf Danton wrote:
>> >> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
>> >> <kirill.shutemov@linux.intel.com> wrote:
>> >> > +       page_cache_get(page);
>> >> > +       spin_lock_irq(&mapping->tree_lock);
>> >> > +       page->mapping = mapping;
>> >> > +       page->index = offset;
>> >> > +       error = radix_tree_insert(&mapping->page_tree, offset, page);
>> >> > +       if (unlikely(error))
>> >> > +               goto err;
>> >> > +       if (PageTransHuge(page)) {
>> >> > +               int i;
>> >> > +               for (i = 1; i < HPAGE_CACHE_NR; i++) {
>> >>                       struct page *tail = page + i; to easy reader
>> >>
>> >> > +                       page_cache_get(page + i);
>> >> s/page_cache_get/get_page_foll/ ?
>> >
>> > Why?
>> >
>> see follow_trans_huge_pmd() please.
>
> Sorry, I fail to see how follow_trans_huge_pmd() is relevant here.
> Could you elaborate?
>
Lets see the code

	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
//page is tail now
	VM_BUG_ON(!PageCompound(page));
	if (flags & FOLL_GET)
		get_page_foll(page);
//raise page count with the foll function

thus I raised question.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-03-15 13:55             ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-03-15 13:55 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel, linux-kernel

On Fri, Mar 15, 2013 at 9:50 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Hillf Danton wrote:
>> On Fri, Mar 15, 2013 at 9:23 PM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> > Hillf Danton wrote:
>> >> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
>> >> <kirill.shutemov@linux.intel.com> wrote:
>> >> > +       page_cache_get(page);
>> >> > +       spin_lock_irq(&mapping->tree_lock);
>> >> > +       page->mapping = mapping;
>> >> > +       page->index = offset;
>> >> > +       error = radix_tree_insert(&mapping->page_tree, offset, page);
>> >> > +       if (unlikely(error))
>> >> > +               goto err;
>> >> > +       if (PageTransHuge(page)) {
>> >> > +               int i;
>> >> > +               for (i = 1; i < HPAGE_CACHE_NR; i++) {
>> >>                       struct page *tail = page + i; to easy reader
>> >>
>> >> > +                       page_cache_get(page + i);
>> >> s/page_cache_get/get_page_foll/ ?
>> >
>> > Why?
>> >
>> see follow_trans_huge_pmd() please.
>
> Sorry, I fail to see how follow_trans_huge_pmd() is relevant here.
> Could you elaborate?
>
Lets see the code

	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
//page is tail now
	VM_BUG_ON(!PageCompound(page));
	if (flags & FOLL_GET)
		get_page_foll(page);
//raise page count with the foll function

thus I raised question.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-03-15 13:55             ` Hillf Danton
@ 2013-03-15 15:05               ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 15:05 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 9:50 PM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > Hillf Danton wrote:
> >> On Fri, Mar 15, 2013 at 9:23 PM, Kirill A. Shutemov
> >> <kirill.shutemov@linux.intel.com> wrote:
> >> > Hillf Danton wrote:
> >> >> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> >> >> <kirill.shutemov@linux.intel.com> wrote:
> >> >> > +       page_cache_get(page);
> >> >> > +       spin_lock_irq(&mapping->tree_lock);
> >> >> > +       page->mapping = mapping;
> >> >> > +       page->index = offset;
> >> >> > +       error = radix_tree_insert(&mapping->page_tree, offset, page);
> >> >> > +       if (unlikely(error))
> >> >> > +               goto err;
> >> >> > +       if (PageTransHuge(page)) {
> >> >> > +               int i;
> >> >> > +               for (i = 1; i < HPAGE_CACHE_NR; i++) {
> >> >>                       struct page *tail = page + i; to easy reader
> >> >>
> >> >> > +                       page_cache_get(page + i);
> >> >> s/page_cache_get/get_page_foll/ ?
> >> >
> >> > Why?
> >> >
> >> see follow_trans_huge_pmd() please.
> >
> > Sorry, I fail to see how follow_trans_huge_pmd() is relevant here.
> > Could you elaborate?
> >
> Lets see the code
> 
> 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
> //page is tail now
> 	VM_BUG_ON(!PageCompound(page));
> 	if (flags & FOLL_GET)
> 		get_page_foll(page);
> //raise page count with the foll function
> 
> thus I raised question.

get_page_foll() is designed to be part of follow_page*() call chain.
get_page() can handle compound pages properly.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-03-15 15:05               ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-15 15:05 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 9:50 PM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > Hillf Danton wrote:
> >> On Fri, Mar 15, 2013 at 9:23 PM, Kirill A. Shutemov
> >> <kirill.shutemov@linux.intel.com> wrote:
> >> > Hillf Danton wrote:
> >> >> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> >> >> <kirill.shutemov@linux.intel.com> wrote:
> >> >> > +       page_cache_get(page);
> >> >> > +       spin_lock_irq(&mapping->tree_lock);
> >> >> > +       page->mapping = mapping;
> >> >> > +       page->index = offset;
> >> >> > +       error = radix_tree_insert(&mapping->page_tree, offset, page);
> >> >> > +       if (unlikely(error))
> >> >> > +               goto err;
> >> >> > +       if (PageTransHuge(page)) {
> >> >> > +               int i;
> >> >> > +               for (i = 1; i < HPAGE_CACHE_NR; i++) {
> >> >>                       struct page *tail = page + i; to easy reader
> >> >>
> >> >> > +                       page_cache_get(page + i);
> >> >> s/page_cache_get/get_page_foll/ ?
> >> >
> >> > Why?
> >> >
> >> see follow_trans_huge_pmd() please.
> >
> > Sorry, I fail to see how follow_trans_huge_pmd() is relevant here.
> > Could you elaborate?
> >
> Lets see the code
> 
> 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
> //page is tail now
> 	VM_BUG_ON(!PageCompound(page));
> 	if (flags & FOLL_GET)
> 		get_page_foll(page);
> //raise page count with the foll function
> 
> thus I raised question.

get_page_foll() is designed to be part of follow_page*() call chain.
get_page() can handle compound pages properly.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 00/30] Transparent huge page cache
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-18  4:03   ` Simon Jeons
  -1 siblings, 0 replies; 243+ messages in thread
From: Simon Jeons @ 2013-03-18  4:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Here's the second version of the patchset.
>
> The intend of the work is get code ready to enable transparent huge page
> cache for the most simple fs -- ramfs.
>
> We have read()/write()/mmap() functionality now. Still plenty work ahead.

One offline question.

Why set PG_mlocked to page_tail which be splited in function 
__split_huge_page_refcount?

>
> Any feedback is welcome.
>
> Changes since v1:
>   - mmap();
>   - fix add_to_page_cache_locked() and delete_from_page_cache();
>   - introduce mapping_can_have_hugepages();
>   - call split_huge_page() only for head page in filemap_fault();
>   - wait_split_huge_page(): serialize over i_mmap_mutex too;
>   - lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists;
>   - fix off-by-one in zero_huge_user_segment();
>   - THP_WRITE_ALLOC/THP_WRITE_FAILED counters;
>
> TODO:
>   - memcg accounting has not yet evaluated;
>   - collapse;
>   - migration (?);
>   - stats, knobs, etc.;
>   - tmpfs/shmem enabling;
>
>
> Kirill A. Shutemov (30):
>    block: implement add_bdi_stat()
>    mm: implement zero_huge_user_segment and friends
>    mm: drop actor argument of do_generic_file_read()
>    radix-tree: implement preload for multiple contiguous elements
>    thp, mm: avoid PageUnevictable on active/inactive lru lists
>    thp, mm: basic defines for transparent huge page cache
>    thp, mm: introduce mapping_can_have_hugepages() predicate
>    thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>    thp, mm: rewrite delete_from_page_cache() to support huge pages
>    thp, mm: locking tail page is a bug
>    thp, mm: handle tail pages in page_cache_get_speculative()
>    thp, mm: add event counters for huge page alloc on write to a file
>    thp, mm: implement grab_cache_huge_page_write_begin()
>    thp, mm: naive support of thp in generic read/write routines
>    thp, libfs: initial support of thp in
>      simple_read/write_begin/write_end
>    thp: handle file pages in split_huge_page()
>    thp: wait_split_huge_page(): serialize over i_mmap_mutex too
>    thp, mm: truncate support for transparent huge page cache
>    thp, mm: split huge page on mmap file page
>    ramfs: enable transparent huge page cache
>    x86-64, mm: proper alignment mappings with hugepages
>    mm: add huge_fault() callback to vm_operations_struct
>    thp: prepare zap_huge_pmd() to uncharge file pages
>    thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
>    thp, mm: basic huge_fault implementation for generic_file_vm_ops
>    thp: extract fallback path from do_huge_pmd_anonymous_page() to a
>      function
>    thp: initial implementation of do_huge_linear_fault()
>    thp: handle write-protect exception to file-backed huge pages
>    thp: call __vma_adjust_trans_huge() for file-backed VMA
>    thp: map file-backed huge pages on fault
>
>   arch/x86/kernel/sys_x86_64.c  |   13 +-
>   fs/libfs.c                    |   50 ++++-
>   fs/ramfs/inode.c              |    6 +-
>   include/linux/backing-dev.h   |   10 +
>   include/linux/huge_mm.h       |   36 +++-
>   include/linux/mm.h            |   16 ++
>   include/linux/pagemap.h       |   24 ++-
>   include/linux/radix-tree.h    |    3 +
>   include/linux/vm_event_item.h |    2 +
>   lib/radix-tree.c              |   32 ++-
>   mm/filemap.c                  |  283 +++++++++++++++++++++----
>   mm/huge_memory.c              |  462 ++++++++++++++++++++++++++++++++++-------
>   mm/memory.c                   |   31 ++-
>   mm/swap.c                     |    3 +-
>   mm/truncate.c                 |   12 ++
>   mm/vmstat.c                   |    2 +
>   16 files changed, 842 insertions(+), 143 deletions(-)
>


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 00/30] Transparent huge page cache
@ 2013-03-18  4:03   ` Simon Jeons
  0 siblings, 0 replies; 243+ messages in thread
From: Simon Jeons @ 2013-03-18  4:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Here's the second version of the patchset.
>
> The intend of the work is get code ready to enable transparent huge page
> cache for the most simple fs -- ramfs.
>
> We have read()/write()/mmap() functionality now. Still plenty work ahead.

One offline question.

Why set PG_mlocked to page_tail which be splited in function 
__split_huge_page_refcount?

>
> Any feedback is welcome.
>
> Changes since v1:
>   - mmap();
>   - fix add_to_page_cache_locked() and delete_from_page_cache();
>   - introduce mapping_can_have_hugepages();
>   - call split_huge_page() only for head page in filemap_fault();
>   - wait_split_huge_page(): serialize over i_mmap_mutex too;
>   - lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists;
>   - fix off-by-one in zero_huge_user_segment();
>   - THP_WRITE_ALLOC/THP_WRITE_FAILED counters;
>
> TODO:
>   - memcg accounting has not yet evaluated;
>   - collapse;
>   - migration (?);
>   - stats, knobs, etc.;
>   - tmpfs/shmem enabling;
>
>
> Kirill A. Shutemov (30):
>    block: implement add_bdi_stat()
>    mm: implement zero_huge_user_segment and friends
>    mm: drop actor argument of do_generic_file_read()
>    radix-tree: implement preload for multiple contiguous elements
>    thp, mm: avoid PageUnevictable on active/inactive lru lists
>    thp, mm: basic defines for transparent huge page cache
>    thp, mm: introduce mapping_can_have_hugepages() predicate
>    thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>    thp, mm: rewrite delete_from_page_cache() to support huge pages
>    thp, mm: locking tail page is a bug
>    thp, mm: handle tail pages in page_cache_get_speculative()
>    thp, mm: add event counters for huge page alloc on write to a file
>    thp, mm: implement grab_cache_huge_page_write_begin()
>    thp, mm: naive support of thp in generic read/write routines
>    thp, libfs: initial support of thp in
>      simple_read/write_begin/write_end
>    thp: handle file pages in split_huge_page()
>    thp: wait_split_huge_page(): serialize over i_mmap_mutex too
>    thp, mm: truncate support for transparent huge page cache
>    thp, mm: split huge page on mmap file page
>    ramfs: enable transparent huge page cache
>    x86-64, mm: proper alignment mappings with hugepages
>    mm: add huge_fault() callback to vm_operations_struct
>    thp: prepare zap_huge_pmd() to uncharge file pages
>    thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
>    thp, mm: basic huge_fault implementation for generic_file_vm_ops
>    thp: extract fallback path from do_huge_pmd_anonymous_page() to a
>      function
>    thp: initial implementation of do_huge_linear_fault()
>    thp: handle write-protect exception to file-backed huge pages
>    thp: call __vma_adjust_trans_huge() for file-backed VMA
>    thp: map file-backed huge pages on fault
>
>   arch/x86/kernel/sys_x86_64.c  |   13 +-
>   fs/libfs.c                    |   50 ++++-
>   fs/ramfs/inode.c              |    6 +-
>   include/linux/backing-dev.h   |   10 +
>   include/linux/huge_mm.h       |   36 +++-
>   include/linux/mm.h            |   16 ++
>   include/linux/pagemap.h       |   24 ++-
>   include/linux/radix-tree.h    |    3 +
>   include/linux/vm_event_item.h |    2 +
>   lib/radix-tree.c              |   32 ++-
>   mm/filemap.c                  |  283 +++++++++++++++++++++----
>   mm/huge_memory.c              |  462 ++++++++++++++++++++++++++++++++++-------
>   mm/memory.c                   |   31 ++-
>   mm/swap.c                     |    3 +-
>   mm/truncate.c                 |   12 ++
>   mm/vmstat.c                   |    2 +
>   16 files changed, 842 insertions(+), 143 deletions(-)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 00/30] Transparent huge page cache
  2013-03-18  4:03   ` Simon Jeons
  (?)
@ 2013-03-18  5:23   ` Simon Jeons
  2013-03-18 11:19       ` Kirill A. Shutemov
  -1 siblings, 1 reply; 243+ messages in thread
From: Simon Jeons @ 2013-03-18  5:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4051 bytes --]

On 03/18/2013 12:03 PM, Simon Jeons wrote:
> Hi Kirill,
> On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>
>> Here's the second version of the patchset.
>>
>> The intend of the work is get code ready to enable transparent huge page
>> cache for the most simple fs -- ramfs.
>>
>> We have read()/write()/mmap() functionality now. Still plenty work 
>> ahead.
>
> One offline question.
>
> Why set PG_mlocked to page_tail which be splited in function 
> __split_huge_page_refcount?
>

Also why can't find where _PAGE_SPLITTING and _PAGE_PSE flags are 
cleared in split_huge_page path?

>>
>> Any feedback is welcome.
>>
>> Changes since v1:
>>   - mmap();
>>   - fix add_to_page_cache_locked() and delete_from_page_cache();
>>   - introduce mapping_can_have_hugepages();
>>   - call split_huge_page() only for head page in filemap_fault();
>>   - wait_split_huge_page(): serialize over i_mmap_mutex too;
>>   - lru_add_page_tail: avoid PageUnevictable on active/inactive lru 
>> lists;
>>   - fix off-by-one in zero_huge_user_segment();
>>   - THP_WRITE_ALLOC/THP_WRITE_FAILED counters;
>>
>> TODO:
>>   - memcg accounting has not yet evaluated;
>>   - collapse;
>>   - migration (?);
>>   - stats, knobs, etc.;
>>   - tmpfs/shmem enabling;
>>
>>
>> Kirill A. Shutemov (30):
>>    block: implement add_bdi_stat()
>>    mm: implement zero_huge_user_segment and friends
>>    mm: drop actor argument of do_generic_file_read()
>>    radix-tree: implement preload for multiple contiguous elements
>>    thp, mm: avoid PageUnevictable on active/inactive lru lists
>>    thp, mm: basic defines for transparent huge page cache
>>    thp, mm: introduce mapping_can_have_hugepages() predicate
>>    thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>>    thp, mm: rewrite delete_from_page_cache() to support huge pages
>>    thp, mm: locking tail page is a bug
>>    thp, mm: handle tail pages in page_cache_get_speculative()
>>    thp, mm: add event counters for huge page alloc on write to a file
>>    thp, mm: implement grab_cache_huge_page_write_begin()
>>    thp, mm: naive support of thp in generic read/write routines
>>    thp, libfs: initial support of thp in
>>      simple_read/write_begin/write_end
>>    thp: handle file pages in split_huge_page()
>>    thp: wait_split_huge_page(): serialize over i_mmap_mutex too
>>    thp, mm: truncate support for transparent huge page cache
>>    thp, mm: split huge page on mmap file page
>>    ramfs: enable transparent huge page cache
>>    x86-64, mm: proper alignment mappings with hugepages
>>    mm: add huge_fault() callback to vm_operations_struct
>>    thp: prepare zap_huge_pmd() to uncharge file pages
>>    thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
>>    thp, mm: basic huge_fault implementation for generic_file_vm_ops
>>    thp: extract fallback path from do_huge_pmd_anonymous_page() to a
>>      function
>>    thp: initial implementation of do_huge_linear_fault()
>>    thp: handle write-protect exception to file-backed huge pages
>>    thp: call __vma_adjust_trans_huge() for file-backed VMA
>>    thp: map file-backed huge pages on fault
>>
>>   arch/x86/kernel/sys_x86_64.c  |   13 +-
>>   fs/libfs.c                    |   50 ++++-
>>   fs/ramfs/inode.c              |    6 +-
>>   include/linux/backing-dev.h   |   10 +
>>   include/linux/huge_mm.h       |   36 +++-
>>   include/linux/mm.h            |   16 ++
>>   include/linux/pagemap.h       |   24 ++-
>>   include/linux/radix-tree.h    |    3 +
>>   include/linux/vm_event_item.h |    2 +
>>   lib/radix-tree.c              |   32 ++-
>>   mm/filemap.c                  |  283 +++++++++++++++++++++----
>>   mm/huge_memory.c              |  462 
>> ++++++++++++++++++++++++++++++++++-------
>>   mm/memory.c                   |   31 ++-
>>   mm/swap.c                     |    3 +-
>>   mm/truncate.c                 |   12 ++
>>   mm/vmstat.c                   |    2 +
>>   16 files changed, 842 insertions(+), 143 deletions(-)
>>
>


[-- Attachment #2: Type: text/html, Size: 7779 bytes --]

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 00/30] Transparent huge page cache
  2013-03-18  5:23   ` Simon Jeons
@ 2013-03-18 11:19       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-18 11:19 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Simon Jeons wrote:
> On 03/18/2013 12:03 PM, Simon Jeons wrote:
> > Hi Kirill,
> > On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
> >> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >>
> >> Here's the second version of the patchset.
> >>
> >> The intend of the work is get code ready to enable transparent huge page
> >> cache for the most simple fs -- ramfs.
> >>
> >> We have read()/write()/mmap() functionality now. Still plenty work 
> >> ahead.
> >
> > One offline question.
> >
> > Why set PG_mlocked to page_tail which be splited in function 
> > __split_huge_page_refcount?

Not set, but copied from head page. Head page represents up-to-date sate
of compound page, we need to copy it to all tail pages on split.
 
> Also why can't find where _PAGE_SPLITTING and _PAGE_PSE flags are
> cleared in split_huge_page path?
 
The pmd is invalidated and replaced with reference to page table at the end
of __split_huge_page_map.
 
> Another offline question:
> Why don't clear tail page PG_tail flag in function
> __split_huge_page_refcount?

We do:

 page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP | __PG_HWPOISON;

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 00/30] Transparent huge page cache
@ 2013-03-18 11:19       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-18 11:19 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Simon Jeons wrote:
> On 03/18/2013 12:03 PM, Simon Jeons wrote:
> > Hi Kirill,
> > On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
> >> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >>
> >> Here's the second version of the patchset.
> >>
> >> The intend of the work is get code ready to enable transparent huge page
> >> cache for the most simple fs -- ramfs.
> >>
> >> We have read()/write()/mmap() functionality now. Still plenty work 
> >> ahead.
> >
> > One offline question.
> >
> > Why set PG_mlocked to page_tail which be splited in function 
> > __split_huge_page_refcount?

Not set, but copied from head page. Head page represents up-to-date sate
of compound page, we need to copy it to all tail pages on split.
 
> Also why can't find where _PAGE_SPLITTING and _PAGE_PSE flags are
> cleared in split_huge_page path?
 
The pmd is invalidated and replaced with reference to page table at the end
of __split_huge_page_map.
 
> Another offline question:
> Why don't clear tail page PG_tail flag in function
> __split_huge_page_refcount?

We do:

 page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP | __PG_HWPOISON;

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 00/30] Transparent huge page cache
  2013-03-18 11:19       ` Kirill A. Shutemov
@ 2013-03-18 11:29         ` Simon Jeons
  -1 siblings, 0 replies; 243+ messages in thread
From: Simon Jeons @ 2013-03-18 11:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

Hi Kirill,
On 03/18/2013 07:19 PM, Kirill A. Shutemov wrote:
> Simon Jeons wrote:
>> On 03/18/2013 12:03 PM, Simon Jeons wrote:
>>> Hi Kirill,
>>> On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
>>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>>
>>>> Here's the second version of the patchset.
>>>>
>>>> The intend of the work is get code ready to enable transparent huge page
>>>> cache for the most simple fs -- ramfs.
>>>>
>>>> We have read()/write()/mmap() functionality now. Still plenty work
>>>> ahead.
>>> One offline question.
>>>
>>> Why set PG_mlocked to page_tail which be splited in function
>>> __split_huge_page_refcount?
> Not set, but copied from head page. Head page represents up-to-date sate
> of compound page, we need to copy it to all tail pages on split.

I always see up-to-date state, could you conclude to me which state can 
be treated as up-to-date? :-)

>   
>> Also why can't find where _PAGE_SPLITTING and _PAGE_PSE flags are
>> cleared in split_huge_page path?
>   
> The pmd is invalidated and replaced with reference to page table at the end
> of __split_huge_page_map.

Since pmd is populated by page table and new flag why need 
invalidated(clear present flag) before it?

>   
>> Another offline question:
>> Why don't clear tail page PG_tail flag in function
>> __split_huge_page_refcount?
> We do:
>
>   page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP | __PG_HWPOISON;
>

Got it.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 00/30] Transparent huge page cache
@ 2013-03-18 11:29         ` Simon Jeons
  0 siblings, 0 replies; 243+ messages in thread
From: Simon Jeons @ 2013-03-18 11:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

Hi Kirill,
On 03/18/2013 07:19 PM, Kirill A. Shutemov wrote:
> Simon Jeons wrote:
>> On 03/18/2013 12:03 PM, Simon Jeons wrote:
>>> Hi Kirill,
>>> On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
>>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>>
>>>> Here's the second version of the patchset.
>>>>
>>>> The intend of the work is get code ready to enable transparent huge page
>>>> cache for the most simple fs -- ramfs.
>>>>
>>>> We have read()/write()/mmap() functionality now. Still plenty work
>>>> ahead.
>>> One offline question.
>>>
>>> Why set PG_mlocked to page_tail which be splited in function
>>> __split_huge_page_refcount?
> Not set, but copied from head page. Head page represents up-to-date sate
> of compound page, we need to copy it to all tail pages on split.

I always see up-to-date state, could you conclude to me which state can 
be treated as up-to-date? :-)

>   
>> Also why can't find where _PAGE_SPLITTING and _PAGE_PSE flags are
>> cleared in split_huge_page path?
>   
> The pmd is invalidated and replaced with reference to page table at the end
> of __split_huge_page_map.

Since pmd is populated by page table and new flag why need 
invalidated(clear present flag) before it?

>   
>> Another offline question:
>> Why don't clear tail page PG_tail flag in function
>> __split_huge_page_refcount?
> We do:
>
>   page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP | __PG_HWPOISON;
>

Got it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 00/30] Transparent huge page cache
  2013-03-18 11:29         ` Simon Jeons
@ 2013-03-18 11:42           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-18 11:42 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Simon Jeons wrote:
> Hi Kirill,
> On 03/18/2013 07:19 PM, Kirill A. Shutemov wrote:
> > Simon Jeons wrote:
> >> On 03/18/2013 12:03 PM, Simon Jeons wrote:
> >>> Hi Kirill,
> >>> On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
> >>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >>>>
> >>>> Here's the second version of the patchset.
> >>>>
> >>>> The intend of the work is get code ready to enable transparent huge page
> >>>> cache for the most simple fs -- ramfs.
> >>>>
> >>>> We have read()/write()/mmap() functionality now. Still plenty work
> >>>> ahead.
> >>> One offline question.
> >>>
> >>> Why set PG_mlocked to page_tail which be splited in function
> >>> __split_huge_page_refcount?
> > Not set, but copied from head page. Head page represents up-to-date sate
> > of compound page, we need to copy it to all tail pages on split.
> 
> I always see up-to-date state, could you conclude to me which state can 
> be treated as up-to-date? :-)

While we work with huge page we only alter flags (like mlocked or
uptodate) of head page, but not tail, so we have to copy flags to all tail
pages on split. We also need to distribute _count and _mapcount properly.
Just read the code.

> 
> >   
> >> Also why can't find where _PAGE_SPLITTING and _PAGE_PSE flags are
> >> cleared in split_huge_page path?
> >   
> > The pmd is invalidated and replaced with reference to page table at the end
> > of __split_huge_page_map.
> 
> Since pmd is populated by page table and new flag why need 
> invalidated(clear present flag) before it?

Comment just before pmdp_invalidate() in __split_huge_page_map() is fairly
informative.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 00/30] Transparent huge page cache
@ 2013-03-18 11:42           ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-18 11:42 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Simon Jeons wrote:
> Hi Kirill,
> On 03/18/2013 07:19 PM, Kirill A. Shutemov wrote:
> > Simon Jeons wrote:
> >> On 03/18/2013 12:03 PM, Simon Jeons wrote:
> >>> Hi Kirill,
> >>> On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
> >>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >>>>
> >>>> Here's the second version of the patchset.
> >>>>
> >>>> The intend of the work is get code ready to enable transparent huge page
> >>>> cache for the most simple fs -- ramfs.
> >>>>
> >>>> We have read()/write()/mmap() functionality now. Still plenty work
> >>>> ahead.
> >>> One offline question.
> >>>
> >>> Why set PG_mlocked to page_tail which be splited in function
> >>> __split_huge_page_refcount?
> > Not set, but copied from head page. Head page represents up-to-date sate
> > of compound page, we need to copy it to all tail pages on split.
> 
> I always see up-to-date state, could you conclude to me which state can 
> be treated as up-to-date? :-)

While we work with huge page we only alter flags (like mlocked or
uptodate) of head page, but not tail, so we have to copy flags to all tail
pages on split. We also need to distribute _count and _mapcount properly.
Just read the code.

> 
> >   
> >> Also why can't find where _PAGE_SPLITTING and _PAGE_PSE flags are
> >> cleared in split_huge_page path?
> >   
> > The pmd is invalidated and replaced with reference to page table at the end
> > of __split_huge_page_map.
> 
> Since pmd is populated by page table and new flag why need 
> invalidated(clear present flag) before it?

Comment just before pmdp_invalidate() in __split_huge_page_map() is fairly
informative.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 00/30] Transparent huge page cache
  2013-03-18 11:42           ` Kirill A. Shutemov
@ 2013-03-18 11:42             ` Ric Mason
  -1 siblings, 0 replies; 243+ messages in thread
From: Ric Mason @ 2013-03-18 11:42 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Simon Jeons, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

On 03/18/2013 07:42 PM, Kirill A. Shutemov wrote:
> Simon Jeons wrote:
>> Hi Kirill,
>> On 03/18/2013 07:19 PM, Kirill A. Shutemov wrote:
>>> Simon Jeons wrote:
>>>> On 03/18/2013 12:03 PM, Simon Jeons wrote:
>>>>> Hi Kirill,
>>>>> On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
>>>>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>>>>
>>>>>> Here's the second version of the patchset.
>>>>>>
>>>>>> The intend of the work is get code ready to enable transparent huge page
>>>>>> cache for the most simple fs -- ramfs.
>>>>>>
>>>>>> We have read()/write()/mmap() functionality now. Still plenty work
>>>>>> ahead.
>>>>> One offline question.
>>>>>
>>>>> Why set PG_mlocked to page_tail which be splited in function
>>>>> __split_huge_page_refcount?
>>> Not set, but copied from head page. Head page represents up-to-date sate
>>> of compound page, we need to copy it to all tail pages on split.
>> I always see up-to-date state, could you conclude to me which state can
>> be treated as up-to-date? :-)
> While we work with huge page we only alter flags (like mlocked or
> uptodate) of head page, but not tail, so we have to copy flags to all tail
> pages on split. We also need to distribute _count and _mapcount properly.
> Just read the code.

Sorry, you can treat this question as an offline one and irrelevant thp. 
Which state of page can be treated as up-to-date?

>
>>>    
>>>> Also why can't find where _PAGE_SPLITTING and _PAGE_PSE flags are
>>>> cleared in split_huge_page path?
>>>    
>>> The pmd is invalidated and replaced with reference to page table at the end
>>> of __split_huge_page_map.
>> Since pmd is populated by page table and new flag why need
>> invalidated(clear present flag) before it?
> Comment just before pmdp_invalidate() in __split_huge_page_map() is fairly
> informative.
>


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 00/30] Transparent huge page cache
@ 2013-03-18 11:42             ` Ric Mason
  0 siblings, 0 replies; 243+ messages in thread
From: Ric Mason @ 2013-03-18 11:42 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Simon Jeons, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

On 03/18/2013 07:42 PM, Kirill A. Shutemov wrote:
> Simon Jeons wrote:
>> Hi Kirill,
>> On 03/18/2013 07:19 PM, Kirill A. Shutemov wrote:
>>> Simon Jeons wrote:
>>>> On 03/18/2013 12:03 PM, Simon Jeons wrote:
>>>>> Hi Kirill,
>>>>> On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
>>>>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>>>>
>>>>>> Here's the second version of the patchset.
>>>>>>
>>>>>> The intend of the work is get code ready to enable transparent huge page
>>>>>> cache for the most simple fs -- ramfs.
>>>>>>
>>>>>> We have read()/write()/mmap() functionality now. Still plenty work
>>>>>> ahead.
>>>>> One offline question.
>>>>>
>>>>> Why set PG_mlocked to page_tail which be splited in function
>>>>> __split_huge_page_refcount?
>>> Not set, but copied from head page. Head page represents up-to-date sate
>>> of compound page, we need to copy it to all tail pages on split.
>> I always see up-to-date state, could you conclude to me which state can
>> be treated as up-to-date? :-)
> While we work with huge page we only alter flags (like mlocked or
> uptodate) of head page, but not tail, so we have to copy flags to all tail
> pages on split. We also need to distribute _count and _mapcount properly.
> Just read the code.

Sorry, you can treat this question as an offline one and irrelevant thp. 
Which state of page can be treated as up-to-date?

>
>>>    
>>>> Also why can't find where _PAGE_SPLITTING and _PAGE_PSE flags are
>>>> cleared in split_huge_page path?
>>>    
>>> The pmd is invalidated and replaced with reference to page table at the end
>>> of __split_huge_page_map.
>> Since pmd is populated by page table and new flag why need
>> invalidated(clear present flag) before it?
> Comment just before pmdp_invalidate() in __split_huge_page_map() is fairly
> informative.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 00/30] Transparent huge page cache
  2013-03-14 17:50 ` Kirill A. Shutemov
@ 2013-03-20  1:09   ` Simon Jeons
  -1 siblings, 0 replies; 243+ messages in thread
From: Simon Jeons @ 2013-03-20  1:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Here's the second version of the patchset.
>
> The intend of the work is get code ready to enable transparent huge page
> cache for the most simple fs -- ramfs.
>
> We have read()/write()/mmap() functionality now. Still plenty work ahead.

It's a great work. But could you give more explanation how you design to 
implement thp support page cache in order that more developers are easy 
and happy to review your big patchset?

>
> Any feedback is welcome.
>
> Changes since v1:
>   - mmap();
>   - fix add_to_page_cache_locked() and delete_from_page_cache();
>   - introduce mapping_can_have_hugepages();
>   - call split_huge_page() only for head page in filemap_fault();
>   - wait_split_huge_page(): serialize over i_mmap_mutex too;
>   - lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists;
>   - fix off-by-one in zero_huge_user_segment();
>   - THP_WRITE_ALLOC/THP_WRITE_FAILED counters;
>
> TODO:
>   - memcg accounting has not yet evaluated;
>   - collapse;
>   - migration (?);
>   - stats, knobs, etc.;
>   - tmpfs/shmem enabling;
>
>
> Kirill A. Shutemov (30):
>    block: implement add_bdi_stat()
>    mm: implement zero_huge_user_segment and friends
>    mm: drop actor argument of do_generic_file_read()
>    radix-tree: implement preload for multiple contiguous elements
>    thp, mm: avoid PageUnevictable on active/inactive lru lists
>    thp, mm: basic defines for transparent huge page cache
>    thp, mm: introduce mapping_can_have_hugepages() predicate
>    thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>    thp, mm: rewrite delete_from_page_cache() to support huge pages
>    thp, mm: locking tail page is a bug
>    thp, mm: handle tail pages in page_cache_get_speculative()
>    thp, mm: add event counters for huge page alloc on write to a file
>    thp, mm: implement grab_cache_huge_page_write_begin()
>    thp, mm: naive support of thp in generic read/write routines
>    thp, libfs: initial support of thp in
>      simple_read/write_begin/write_end
>    thp: handle file pages in split_huge_page()
>    thp: wait_split_huge_page(): serialize over i_mmap_mutex too
>    thp, mm: truncate support for transparent huge page cache
>    thp, mm: split huge page on mmap file page
>    ramfs: enable transparent huge page cache
>    x86-64, mm: proper alignment mappings with hugepages
>    mm: add huge_fault() callback to vm_operations_struct
>    thp: prepare zap_huge_pmd() to uncharge file pages
>    thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
>    thp, mm: basic huge_fault implementation for generic_file_vm_ops
>    thp: extract fallback path from do_huge_pmd_anonymous_page() to a
>      function
>    thp: initial implementation of do_huge_linear_fault()
>    thp: handle write-protect exception to file-backed huge pages
>    thp: call __vma_adjust_trans_huge() for file-backed VMA
>    thp: map file-backed huge pages on fault
>
>   arch/x86/kernel/sys_x86_64.c  |   13 +-
>   fs/libfs.c                    |   50 ++++-
>   fs/ramfs/inode.c              |    6 +-
>   include/linux/backing-dev.h   |   10 +
>   include/linux/huge_mm.h       |   36 +++-
>   include/linux/mm.h            |   16 ++
>   include/linux/pagemap.h       |   24 ++-
>   include/linux/radix-tree.h    |    3 +
>   include/linux/vm_event_item.h |    2 +
>   lib/radix-tree.c              |   32 ++-
>   mm/filemap.c                  |  283 +++++++++++++++++++++----
>   mm/huge_memory.c              |  462 ++++++++++++++++++++++++++++++++++-------
>   mm/memory.c                   |   31 ++-
>   mm/swap.c                     |    3 +-
>   mm/truncate.c                 |   12 ++
>   mm/vmstat.c                   |    2 +
>   16 files changed, 842 insertions(+), 143 deletions(-)
>


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 00/30] Transparent huge page cache
@ 2013-03-20  1:09   ` Simon Jeons
  0 siblings, 0 replies; 243+ messages in thread
From: Simon Jeons @ 2013-03-20  1:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Here's the second version of the patchset.
>
> The intend of the work is get code ready to enable transparent huge page
> cache for the most simple fs -- ramfs.
>
> We have read()/write()/mmap() functionality now. Still plenty work ahead.

It's a great work. But could you give more explanation how you design to 
implement thp support page cache in order that more developers are easy 
and happy to review your big patchset?

>
> Any feedback is welcome.
>
> Changes since v1:
>   - mmap();
>   - fix add_to_page_cache_locked() and delete_from_page_cache();
>   - introduce mapping_can_have_hugepages();
>   - call split_huge_page() only for head page in filemap_fault();
>   - wait_split_huge_page(): serialize over i_mmap_mutex too;
>   - lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists;
>   - fix off-by-one in zero_huge_user_segment();
>   - THP_WRITE_ALLOC/THP_WRITE_FAILED counters;
>
> TODO:
>   - memcg accounting has not yet evaluated;
>   - collapse;
>   - migration (?);
>   - stats, knobs, etc.;
>   - tmpfs/shmem enabling;
>
>
> Kirill A. Shutemov (30):
>    block: implement add_bdi_stat()
>    mm: implement zero_huge_user_segment and friends
>    mm: drop actor argument of do_generic_file_read()
>    radix-tree: implement preload for multiple contiguous elements
>    thp, mm: avoid PageUnevictable on active/inactive lru lists
>    thp, mm: basic defines for transparent huge page cache
>    thp, mm: introduce mapping_can_have_hugepages() predicate
>    thp, mm: rewrite add_to_page_cache_locked() to support huge pages
>    thp, mm: rewrite delete_from_page_cache() to support huge pages
>    thp, mm: locking tail page is a bug
>    thp, mm: handle tail pages in page_cache_get_speculative()
>    thp, mm: add event counters for huge page alloc on write to a file
>    thp, mm: implement grab_cache_huge_page_write_begin()
>    thp, mm: naive support of thp in generic read/write routines
>    thp, libfs: initial support of thp in
>      simple_read/write_begin/write_end
>    thp: handle file pages in split_huge_page()
>    thp: wait_split_huge_page(): serialize over i_mmap_mutex too
>    thp, mm: truncate support for transparent huge page cache
>    thp, mm: split huge page on mmap file page
>    ramfs: enable transparent huge page cache
>    x86-64, mm: proper alignment mappings with hugepages
>    mm: add huge_fault() callback to vm_operations_struct
>    thp: prepare zap_huge_pmd() to uncharge file pages
>    thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
>    thp, mm: basic huge_fault implementation for generic_file_vm_ops
>    thp: extract fallback path from do_huge_pmd_anonymous_page() to a
>      function
>    thp: initial implementation of do_huge_linear_fault()
>    thp: handle write-protect exception to file-backed huge pages
>    thp: call __vma_adjust_trans_huge() for file-backed VMA
>    thp: map file-backed huge pages on fault
>
>   arch/x86/kernel/sys_x86_64.c  |   13 +-
>   fs/libfs.c                    |   50 ++++-
>   fs/ramfs/inode.c              |    6 +-
>   include/linux/backing-dev.h   |   10 +
>   include/linux/huge_mm.h       |   36 +++-
>   include/linux/mm.h            |   16 ++
>   include/linux/pagemap.h       |   24 ++-
>   include/linux/radix-tree.h    |    3 +
>   include/linux/vm_event_item.h |    2 +
>   lib/radix-tree.c              |   32 ++-
>   mm/filemap.c                  |  283 +++++++++++++++++++++----
>   mm/huge_memory.c              |  462 ++++++++++++++++++++++++++++++++++-------
>   mm/memory.c                   |   31 ++-
>   mm/swap.c                     |    3 +-
>   mm/truncate.c                 |   12 ++
>   mm/vmstat.c                   |    2 +
>   16 files changed, 842 insertions(+), 143 deletions(-)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 01/30] block: implement add_bdi_stat()
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-21 14:46     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 14:46 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> It's required for batched stats update.

The description here is a little terse.  Could you at least also
describe how and where it's going to get used in later patches?  Just a
sentence or two more would be really helpful.

You might also mention that it's essentially a clone of dec_bdi_stat().



^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 01/30] block: implement add_bdi_stat()
@ 2013-03-21 14:46     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 14:46 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> It's required for batched stats update.

The description here is a little terse.  Could you at least also
describe how and where it's going to get used in later patches?  Just a
sentence or two more would be really helpful.

You might also mention that it's essentially a clone of dec_bdi_stat().


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 02/30] mm: implement zero_huge_user_segment and friends
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-21 15:23     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 15:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> Let's add helpers to clear huge page segment(s). They provide the same
> functionallity as zero_user_segment{,s} and zero_user, but for huge
> pages
...
> +static inline void zero_huge_user_segments(struct page *page,
> +		unsigned start1, unsigned end1,
> +		unsigned start2, unsigned end2)
> +{
> +	zero_huge_user_segment(page, start1, end1);
> +	zero_huge_user_segment(page, start2, end2);
> +}

I'm not sure that this helper saves very much code.  The one call later
in these patches:

+                       zero_huge_user_segments(page, 0, from,
+                                       from + len, HPAGE_PMD_SIZE);

really only saves one line over this:

			zero_huge_user_segment(page, 0, from);
			zero_huge_user_segment(page, from + len,
					       HPAGE_PMD_SIZE);

and I think the second one is much more clear to read.

I do see that there's a small-page variant of this, but I think that one
was done to save doing two kmap_atomic() operations when you wanted to
zero two separate operations.  This variant doesn't have that kind of
optimization, so it makes much less sense.

> +void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
> +{
> +	int i;
> +	
> +	BUG_ON(end < start);
> +
> +	might_sleep();
> +
> +	if (start == end)
> +		return;

I've really got to wonder how much of an optimization this is in
practice.  Was there a specific reason this was added?

> +	/* start and end are on the same small page */
> +	if ((start & PAGE_MASK) == ((end - 1) & PAGE_MASK))
> +		return zero_user_segment(page + (start >> PAGE_SHIFT),
> +				start & ~PAGE_MASK,
> +				((end - 1) & ~PAGE_MASK) + 1);

It wasn't immediately obvious to me why we need to optimize the "on the
same page" case.  I _think_ it's because using zero_user_segments()
saves us a kmap_atomic() over the code below.  Is that right?  It might
be worth a comment.

> +	zero_user_segment(page + (start >> PAGE_SHIFT),
> +			start & ~PAGE_MASK, PAGE_SIZE);
> +	for (i = (start >> PAGE_SHIFT) + 1; i < (end >> PAGE_SHIFT) - 1; i++) {
> +		cond_resched();
> +		clear_highpage(page + i);

zero_user_segments() does a flush_dcache_page(), which wouldn't get done
on these middle pages.  Is that a problem?

> +	}
> +	zero_user_segment(page + i, 0, ((end - 1) & ~PAGE_MASK) + 1);
> +}

This code is dying for some local variables.  It could really use a
'start_pfn_offset' and 'end_pfn_offset' or something similar.  All of
the shifting and masking is a bit hard to read and it would be nice to
think of some real names for what it is doing.

It also desperately needs some comments about how it works.  Some
one-liners like:

	/* zero the first (possibly partial) page */
	for()..
		/* zero the full pages in the middle */
	/* zero the last (possibly partial) page */

would be pretty sweet.


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 02/30] mm: implement zero_huge_user_segment and friends
@ 2013-03-21 15:23     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 15:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> Let's add helpers to clear huge page segment(s). They provide the same
> functionallity as zero_user_segment{,s} and zero_user, but for huge
> pages
...
> +static inline void zero_huge_user_segments(struct page *page,
> +		unsigned start1, unsigned end1,
> +		unsigned start2, unsigned end2)
> +{
> +	zero_huge_user_segment(page, start1, end1);
> +	zero_huge_user_segment(page, start2, end2);
> +}

I'm not sure that this helper saves very much code.  The one call later
in these patches:

+                       zero_huge_user_segments(page, 0, from,
+                                       from + len, HPAGE_PMD_SIZE);

really only saves one line over this:

			zero_huge_user_segment(page, 0, from);
			zero_huge_user_segment(page, from + len,
					       HPAGE_PMD_SIZE);

and I think the second one is much more clear to read.

I do see that there's a small-page variant of this, but I think that one
was done to save doing two kmap_atomic() operations when you wanted to
zero two separate operations.  This variant doesn't have that kind of
optimization, so it makes much less sense.

> +void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
> +{
> +	int i;
> +	
> +	BUG_ON(end < start);
> +
> +	might_sleep();
> +
> +	if (start == end)
> +		return;

I've really got to wonder how much of an optimization this is in
practice.  Was there a specific reason this was added?

> +	/* start and end are on the same small page */
> +	if ((start & PAGE_MASK) == ((end - 1) & PAGE_MASK))
> +		return zero_user_segment(page + (start >> PAGE_SHIFT),
> +				start & ~PAGE_MASK,
> +				((end - 1) & ~PAGE_MASK) + 1);

It wasn't immediately obvious to me why we need to optimize the "on the
same page" case.  I _think_ it's because using zero_user_segments()
saves us a kmap_atomic() over the code below.  Is that right?  It might
be worth a comment.

> +	zero_user_segment(page + (start >> PAGE_SHIFT),
> +			start & ~PAGE_MASK, PAGE_SIZE);
> +	for (i = (start >> PAGE_SHIFT) + 1; i < (end >> PAGE_SHIFT) - 1; i++) {
> +		cond_resched();
> +		clear_highpage(page + i);

zero_user_segments() does a flush_dcache_page(), which wouldn't get done
on these middle pages.  Is that a problem?

> +	}
> +	zero_user_segment(page + i, 0, ((end - 1) & ~PAGE_MASK) + 1);
> +}

This code is dying for some local variables.  It could really use a
'start_pfn_offset' and 'end_pfn_offset' or something similar.  All of
the shifting and masking is a bit hard to read and it would be nice to
think of some real names for what it is doing.

It also desperately needs some comments about how it works.  Some
one-liners like:

	/* zero the first (possibly partial) page */
	for()..
		/* zero the full pages in the middle */
	/* zero the last (possibly partial) page */

would be pretty sweet.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 03/30] mm: drop actor argument of do_generic_file_read()
  2013-03-15 13:22       ` Kirill A. Shutemov
@ 2013-03-21 15:26         ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 15:26 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hillf Danton, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

On 03/15/2013 06:22 AM, Kirill A. Shutemov wrote:
> Hillf Danton wrote:
>> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>>>
>>> There's only one caller of do_generic_file_read() and the only actor is
>>> file_read_actor(). No reason to have a callback parameter.
>>>
>> This cleanup is not urgent if it nukes no barrier for THP cache.
> 
> Yes, it's not urgent. On other hand it can be applied upstream right now ;)

If someone might pick this up and merge it right away, it's probably
best to put it very first in the series and call it out as such to the
person you expect to pick it up, or even send it up to them separately.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 03/30] mm: drop actor argument of do_generic_file_read()
@ 2013-03-21 15:26         ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 15:26 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hillf Danton, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

On 03/15/2013 06:22 AM, Kirill A. Shutemov wrote:
> Hillf Danton wrote:
>> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>>>
>>> There's only one caller of do_generic_file_read() and the only actor is
>>> file_read_actor(). No reason to have a callback parameter.
>>>
>> This cleanup is not urgent if it nukes no barrier for THP cache.
> 
> Yes, it's not urgent. On other hand it can be applied upstream right now ;)

If someone might pick this up and merge it right away, it's probably
best to put it very first in the series and call it out as such to the
person you expect to pick it up, or even send it up to them separately.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-21 15:56     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 15:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Currently radix_tree_preload() only guarantees enough nodes to insert
> one element. It's a hard limit. You cannot batch a number insert under
> one tree_lock.
> 
> This patch introduces radix_tree_preload_count(). It allows to
> preallocate nodes enough to insert a number of *contiguous* elements.

You don't need to write a paper on how radix trees work, but it might be
nice to include a wee bit of text in here about how the existing preload
works, and how this new guarantee works.

> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index ffc444c..81318cb 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -83,6 +83,8 @@ do {									\
>  	(root)->rnode = NULL;						\
>  } while (0)
>  
> +#define RADIX_TREE_PRELOAD_NR		512 /* For THP's benefit */

This eventually boils down to making the radix_tree_preload array
larger.  Do we really want to do this unconditionally if it's only for
THP's benefit?

>  /**
>   * Radix-tree synchronization
>   *
> @@ -231,6 +233,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root,
>  unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
>  				unsigned long index, unsigned long max_scan);
>  int radix_tree_preload(gfp_t gfp_mask);
> +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
>  void radix_tree_init(void);
>  void *radix_tree_tag_set(struct radix_tree_root *root,
>  			unsigned long index, unsigned int tag);
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index e796429..9bef0ac 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
>   * The worst case is a zero height tree with just a single item at index 0,
>   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
>   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> + *
> + * Worst case for adding N contiguous items is adding entries at indexes
> + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> + * item plus extra nodes if you cross the boundary from one node to the next.
> + *
>   * Hence:
>   */
> -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MAX \
> +	(RADIX_TREE_PRELOAD_MIN + \
> +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
>  
>  /*
>   * Per-cpu pool of preloaded nodes
>   */
>  struct radix_tree_preload {
>  	int nr;
> -	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
> +	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
>  };

For those of us too lazy to go compile a kernel and figure this out in
practice, how much bigger does this make the nodes[] array?

>  static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
>  
> @@ -257,29 +265,34 @@ radix_tree_node_free(struct radix_tree_node *node)
>  
>  /*
>   * Load up this CPU's radix_tree_node buffer with sufficient objects to
> - * ensure that the addition of a single element in the tree cannot fail.  On
> - * success, return zero, with preemption disabled.  On error, return -ENOMEM
> + * ensure that the addition of *contiguous* elements in the tree cannot fail.
> + * On success, return zero, with preemption disabled.  On error, return -ENOMEM
>   * with preemption not disabled.
>   *
>   * To make use of this facility, the radix tree must be initialised without
>   * __GFP_WAIT being passed to INIT_RADIX_TREE().
>   */
> -int radix_tree_preload(gfp_t gfp_mask)
> +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
>  {
>  	struct radix_tree_preload *rtp;
>  	struct radix_tree_node *node;
>  	int ret = -ENOMEM;
> +	int alloc = RADIX_TREE_PRELOAD_MIN +
> +		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);

Any chance I could talk you in to giving 'alloc' a better name?  Maybe
"preload_target" or "preload_fill_to".

> +	if (size > RADIX_TREE_PRELOAD_NR)
> +		return -ENOMEM;

I always wonder if these deep, logical -ENOMEMs deserve a WARN_ONCE().
We really don't expect to hit this unless something really funky is
going on, right?

>  	preempt_disable();
>  	rtp = &__get_cpu_var(radix_tree_preloads);
> -	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
> +	while (rtp->nr < alloc) {
>  		preempt_enable();
>  		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
>  		if (node == NULL)
>  			goto out;
>  		preempt_disable();
>  		rtp = &__get_cpu_var(radix_tree_preloads);
> -		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
> +		if (rtp->nr < alloc)
>  			rtp->nodes[rtp->nr++] = node;
>  		else
>  			kmem_cache_free(radix_tree_node_cachep, node);
> @@ -288,6 +301,11 @@ int radix_tree_preload(gfp_t gfp_mask)
>  out:
>  	return ret;
>  }


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements
@ 2013-03-21 15:56     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 15:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Currently radix_tree_preload() only guarantees enough nodes to insert
> one element. It's a hard limit. You cannot batch a number insert under
> one tree_lock.
> 
> This patch introduces radix_tree_preload_count(). It allows to
> preallocate nodes enough to insert a number of *contiguous* elements.

You don't need to write a paper on how radix trees work, but it might be
nice to include a wee bit of text in here about how the existing preload
works, and how this new guarantee works.

> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index ffc444c..81318cb 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -83,6 +83,8 @@ do {									\
>  	(root)->rnode = NULL;						\
>  } while (0)
>  
> +#define RADIX_TREE_PRELOAD_NR		512 /* For THP's benefit */

This eventually boils down to making the radix_tree_preload array
larger.  Do we really want to do this unconditionally if it's only for
THP's benefit?

>  /**
>   * Radix-tree synchronization
>   *
> @@ -231,6 +233,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root,
>  unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
>  				unsigned long index, unsigned long max_scan);
>  int radix_tree_preload(gfp_t gfp_mask);
> +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
>  void radix_tree_init(void);
>  void *radix_tree_tag_set(struct radix_tree_root *root,
>  			unsigned long index, unsigned int tag);
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index e796429..9bef0ac 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
>   * The worst case is a zero height tree with just a single item at index 0,
>   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
>   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> + *
> + * Worst case for adding N contiguous items is adding entries at indexes
> + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> + * item plus extra nodes if you cross the boundary from one node to the next.
> + *
>   * Hence:
>   */
> -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MAX \
> +	(RADIX_TREE_PRELOAD_MIN + \
> +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
>  
>  /*
>   * Per-cpu pool of preloaded nodes
>   */
>  struct radix_tree_preload {
>  	int nr;
> -	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
> +	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
>  };

For those of us too lazy to go compile a kernel and figure this out in
practice, how much bigger does this make the nodes[] array?

>  static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
>  
> @@ -257,29 +265,34 @@ radix_tree_node_free(struct radix_tree_node *node)
>  
>  /*
>   * Load up this CPU's radix_tree_node buffer with sufficient objects to
> - * ensure that the addition of a single element in the tree cannot fail.  On
> - * success, return zero, with preemption disabled.  On error, return -ENOMEM
> + * ensure that the addition of *contiguous* elements in the tree cannot fail.
> + * On success, return zero, with preemption disabled.  On error, return -ENOMEM
>   * with preemption not disabled.
>   *
>   * To make use of this facility, the radix tree must be initialised without
>   * __GFP_WAIT being passed to INIT_RADIX_TREE().
>   */
> -int radix_tree_preload(gfp_t gfp_mask)
> +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
>  {
>  	struct radix_tree_preload *rtp;
>  	struct radix_tree_node *node;
>  	int ret = -ENOMEM;
> +	int alloc = RADIX_TREE_PRELOAD_MIN +
> +		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);

Any chance I could talk you in to giving 'alloc' a better name?  Maybe
"preload_target" or "preload_fill_to".

> +	if (size > RADIX_TREE_PRELOAD_NR)
> +		return -ENOMEM;

I always wonder if these deep, logical -ENOMEMs deserve a WARN_ONCE().
We really don't expect to hit this unless something really funky is
going on, right?

>  	preempt_disable();
>  	rtp = &__get_cpu_var(radix_tree_preloads);
> -	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
> +	while (rtp->nr < alloc) {
>  		preempt_enable();
>  		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
>  		if (node == NULL)
>  			goto out;
>  		preempt_disable();
>  		rtp = &__get_cpu_var(radix_tree_preloads);
> -		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
> +		if (rtp->nr < alloc)
>  			rtp->nodes[rtp->nr++] = node;
>  		else
>  			kmem_cache_free(radix_tree_node_cachep, node);
> @@ -288,6 +301,11 @@ int radix_tree_preload(gfp_t gfp_mask)
>  out:
>  	return ret;
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 05/30] thp, mm: avoid PageUnevictable on active/inactive lru lists
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-21 16:15     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 16:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
> that have been placed on the LRU lists when first allocated), but these
> pages must not have PageUnevictable set - otherwise shrink_active_list
> goes crazy:
...
> For lru_add_page_tail(), it means we should not set PageUnevictable()
> for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
> The tail page will go LRU_UNEVICTABLE if head page is not on LRU or if
> it's marked PageUnevictable() too.

This is only an issue once you're using lru_add_page_tail() for
non-anonymous pages, right?

> diff --git a/mm/swap.c b/mm/swap.c
> index 92a9be5..31584d0 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -762,7 +762,8 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>  			lru = LRU_INACTIVE_ANON;
>  		}
>  	} else {
> -		SetPageUnevictable(page_tail);
> +		if (!PageLRU(page) || PageUnevictable(page))
> +			SetPageUnevictable(page_tail);
>  		lru = LRU_UNEVICTABLE;
>  	}

You were saying above that ramfs pages can get on the normal
active/inactive lists.  But, this will end up getting them on the
unevictable list, right?  So, we have normal ramfs pages on the
active/inactive lists, but ramfs pages after a huge-page-split on the
unevictable list.  That seems a bit inconsistent.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 05/30] thp, mm: avoid PageUnevictable on active/inactive lru lists
@ 2013-03-21 16:15     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 16:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
> that have been placed on the LRU lists when first allocated), but these
> pages must not have PageUnevictable set - otherwise shrink_active_list
> goes crazy:
...
> For lru_add_page_tail(), it means we should not set PageUnevictable()
> for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
> The tail page will go LRU_UNEVICTABLE if head page is not on LRU or if
> it's marked PageUnevictable() too.

This is only an issue once you're using lru_add_page_tail() for
non-anonymous pages, right?

> diff --git a/mm/swap.c b/mm/swap.c
> index 92a9be5..31584d0 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -762,7 +762,8 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>  			lru = LRU_INACTIVE_ANON;
>  		}
>  	} else {
> -		SetPageUnevictable(page_tail);
> +		if (!PageLRU(page) || PageUnevictable(page))
> +			SetPageUnevictable(page_tail);
>  		lru = LRU_UNEVICTABLE;
>  	}

You were saying above that ramfs pages can get on the normal
active/inactive lists.  But, this will end up getting them on the
unevictable list, right?  So, we have normal ramfs pages on the
active/inactive lists, but ramfs pages after a huge-page-split on the
unevictable list.  That seems a bit inconsistent.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-21 16:21     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 16:21 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> +static inline bool mapping_can_have_hugepages(struct address_space *m)
> +{
> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> +		gfp_t gfp_mask = mapping_gfp_mask(m);
> +		return !!(gfp_mask & __GFP_COMP);
> +	}
> +
> +	return false;
> +}

I did a quick search in all your patches and don't see __GFP_COMP
getting _set_ anywhere.  Am I missing something?

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate
@ 2013-03-21 16:21     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 16:21 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> +static inline bool mapping_can_have_hugepages(struct address_space *m)
> +{
> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> +		gfp_t gfp_mask = mapping_gfp_mask(m);
> +		return !!(gfp_mask & __GFP_COMP);
> +	}
> +
> +	return false;
> +}

I did a quick search in all your patches and don't see __GFP_COMP
getting _set_ anywhere.  Am I missing something?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-21 17:11     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 17:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
> page for the specified index and HPAGE_CACHE_NR-1 tail pages for
> following indexes.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/filemap.c |   76 ++++++++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 53 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 2d99191..6bac9e2 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -447,6 +447,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>  		pgoff_t offset, gfp_t gfp_mask)
>  {
>  	int error;
> +	int nr = 1;
>  
>  	VM_BUG_ON(!PageLocked(page));
>  	VM_BUG_ON(PageSwapBacked(page));
> @@ -454,32 +455,61 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>  	error = mem_cgroup_cache_charge(page, current->mm,
>  					gfp_mask & GFP_RECLAIM_MASK);
>  	if (error)
> -		goto out;
> +		return error;
>  
> -	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> -	if (error == 0) {
> -		page_cache_get(page);
> -		page->mapping = mapping;
> -		page->index = offset;
> +	if (PageTransHuge(page)) {
> +		BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
> +		nr = HPAGE_CACHE_NR;
> +	}

That seems like a slightly odd place to put a BUILD_BUG_ON().  I guess
it doesn't matter to some degree, but does putting it inside the if()
imply anything?

> +	error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM);
> +	if (error) {
> +		mem_cgroup_uncharge_cache_page(page);
> +		return error;
> +	}
>  
> -		spin_lock_irq(&mapping->tree_lock);
> -		error = radix_tree_insert(&mapping->page_tree, offset, page);
> -		if (likely(!error)) {
> -			mapping->nrpages++;
> -			__inc_zone_page_state(page, NR_FILE_PAGES);
> -			spin_unlock_irq(&mapping->tree_lock);
> -			trace_mm_filemap_add_to_page_cache(page);
> -		} else {
> -			page->mapping = NULL;
> -			/* Leave page->index set: truncation relies upon it */
> -			spin_unlock_irq(&mapping->tree_lock);
> -			mem_cgroup_uncharge_cache_page(page);
> -			page_cache_release(page);

I do really like how this rewrite de-indents this code. :)

> +	page_cache_get(page);
> +	spin_lock_irq(&mapping->tree_lock);
> +	page->mapping = mapping;
> +	page->index = offset;
> +	error = radix_tree_insert(&mapping->page_tree, offset, page);
> +	if (unlikely(error))
> +		goto err;
> +	if (PageTransHuge(page)) {
> +		int i;
> +		for (i = 1; i < HPAGE_CACHE_NR; i++) {
> +			page_cache_get(page + i);
> +			page[i].index = offset + i;

Is it OK to leave page->mapping unset for these?

> +			error = radix_tree_insert(&mapping->page_tree,
> +					offset + i, page + i);
> +			if (error) {
> +				page_cache_release(page + i);
> +				break;
> +			}
>  		}

Throughout all this new code, I'd really challenge you to try as much as
possible to minimize the code stuck under "if (PageTransHuge(page))".

For instance, could you change the for() loop a bit and have it shared
between both cases, like:

> +	for (i = 0; i < nr; i++) {
> +		page_cache_get(page + i);
> +		page[i].index = offset + i;
> +		error = radix_tree_insert(&mapping->page_tree,
> +				offset + i, page + i);
> +		if (error) {
> +			page_cache_release(page + i);
> +			break;
> +		}
>  	}

> -		radix_tree_preload_end();
> -	} else
> -		mem_cgroup_uncharge_cache_page(page);
> -out:
> +		if (error) {
> +			error = ENOSPC; /* no space for a huge page */
> +			for (i--; i > 0; i--) {
> +				radix_tree_delete(&mapping->page_tree,
> +						offset + i);
> +				page_cache_release(page + i);
> +			}
> +			radix_tree_delete(&mapping->page_tree, offset);

I wonder if this would look any nicer if you just did all the
page_cache_get()s for the entire huge page along with the head page, and
then released them all in one place.  I think it might shrink the error
handling paths here.

> +			goto err;
> +		}
> +	}
> +	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
> +	mapping->nrpages += nr;
> +	spin_unlock_irq(&mapping->tree_lock);
> +	trace_mm_filemap_add_to_page_cache(page);

Do we need to change the tracing to make sure it notes that these were
or weren't huge pages?

> +	radix_tree_preload_end();
> +	return 0;
> +err:
> +	page->mapping = NULL;
> +	/* Leave page->index set: truncation relies upon it */
> +	spin_unlock_irq(&mapping->tree_lock);
> +	radix_tree_preload_end();
> +	mem_cgroup_uncharge_cache_page(page);
> +	page_cache_release(page);
>  	return error;
>  }
>  EXPORT_SYMBOL(add_to_page_cache_locked);

Does the cgroup code know how to handle these large pages internally
somehow?  It looks like the charge/uncharge is only being done for the
head page.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-03-21 17:11     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 17:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
> page for the specified index and HPAGE_CACHE_NR-1 tail pages for
> following indexes.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/filemap.c |   76 ++++++++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 53 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 2d99191..6bac9e2 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -447,6 +447,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>  		pgoff_t offset, gfp_t gfp_mask)
>  {
>  	int error;
> +	int nr = 1;
>  
>  	VM_BUG_ON(!PageLocked(page));
>  	VM_BUG_ON(PageSwapBacked(page));
> @@ -454,32 +455,61 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>  	error = mem_cgroup_cache_charge(page, current->mm,
>  					gfp_mask & GFP_RECLAIM_MASK);
>  	if (error)
> -		goto out;
> +		return error;
>  
> -	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> -	if (error == 0) {
> -		page_cache_get(page);
> -		page->mapping = mapping;
> -		page->index = offset;
> +	if (PageTransHuge(page)) {
> +		BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
> +		nr = HPAGE_CACHE_NR;
> +	}

That seems like a slightly odd place to put a BUILD_BUG_ON().  I guess
it doesn't matter to some degree, but does putting it inside the if()
imply anything?

> +	error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM);
> +	if (error) {
> +		mem_cgroup_uncharge_cache_page(page);
> +		return error;
> +	}
>  
> -		spin_lock_irq(&mapping->tree_lock);
> -		error = radix_tree_insert(&mapping->page_tree, offset, page);
> -		if (likely(!error)) {
> -			mapping->nrpages++;
> -			__inc_zone_page_state(page, NR_FILE_PAGES);
> -			spin_unlock_irq(&mapping->tree_lock);
> -			trace_mm_filemap_add_to_page_cache(page);
> -		} else {
> -			page->mapping = NULL;
> -			/* Leave page->index set: truncation relies upon it */
> -			spin_unlock_irq(&mapping->tree_lock);
> -			mem_cgroup_uncharge_cache_page(page);
> -			page_cache_release(page);

I do really like how this rewrite de-indents this code. :)

> +	page_cache_get(page);
> +	spin_lock_irq(&mapping->tree_lock);
> +	page->mapping = mapping;
> +	page->index = offset;
> +	error = radix_tree_insert(&mapping->page_tree, offset, page);
> +	if (unlikely(error))
> +		goto err;
> +	if (PageTransHuge(page)) {
> +		int i;
> +		for (i = 1; i < HPAGE_CACHE_NR; i++) {
> +			page_cache_get(page + i);
> +			page[i].index = offset + i;

Is it OK to leave page->mapping unset for these?

> +			error = radix_tree_insert(&mapping->page_tree,
> +					offset + i, page + i);
> +			if (error) {
> +				page_cache_release(page + i);
> +				break;
> +			}
>  		}

Throughout all this new code, I'd really challenge you to try as much as
possible to minimize the code stuck under "if (PageTransHuge(page))".

For instance, could you change the for() loop a bit and have it shared
between both cases, like:

> +	for (i = 0; i < nr; i++) {
> +		page_cache_get(page + i);
> +		page[i].index = offset + i;
> +		error = radix_tree_insert(&mapping->page_tree,
> +				offset + i, page + i);
> +		if (error) {
> +			page_cache_release(page + i);
> +			break;
> +		}
>  	}

> -		radix_tree_preload_end();
> -	} else
> -		mem_cgroup_uncharge_cache_page(page);
> -out:
> +		if (error) {
> +			error = ENOSPC; /* no space for a huge page */
> +			for (i--; i > 0; i--) {
> +				radix_tree_delete(&mapping->page_tree,
> +						offset + i);
> +				page_cache_release(page + i);
> +			}
> +			radix_tree_delete(&mapping->page_tree, offset);

I wonder if this would look any nicer if you just did all the
page_cache_get()s for the entire huge page along with the head page, and
then released them all in one place.  I think it might shrink the error
handling paths here.

> +			goto err;
> +		}
> +	}
> +	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
> +	mapping->nrpages += nr;
> +	spin_unlock_irq(&mapping->tree_lock);
> +	trace_mm_filemap_add_to_page_cache(page);

Do we need to change the tracing to make sure it notes that these were
or weren't huge pages?

> +	radix_tree_preload_end();
> +	return 0;
> +err:
> +	page->mapping = NULL;
> +	/* Leave page->index set: truncation relies upon it */
> +	spin_unlock_irq(&mapping->tree_lock);
> +	radix_tree_preload_end();
> +	mem_cgroup_uncharge_cache_page(page);
> +	page_cache_release(page);
>  	return error;
>  }
>  EXPORT_SYMBOL(add_to_page_cache_locked);

Does the cgroup code know how to handle these large pages internally
somehow?  It looks like the charge/uncharge is only being done for the
head page.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 01/30] block: implement add_bdi_stat()
  2013-03-21 14:46     ` Dave Hansen
@ 2013-03-21 17:19       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-21 17:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > It's required for batched stats update.
> 
> The description here is a little terse.  Could you at least also
> describe how and where it's going to get used in later patches?  Just a
> sentence or two more would be really helpful.
> 
> You might also mention that it's essentially a clone of dec_bdi_stat().

Okay. Will update the description. Thanks.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 01/30] block: implement add_bdi_stat()
@ 2013-03-21 17:19       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-21 17:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > It's required for batched stats update.
> 
> The description here is a little terse.  Could you at least also
> describe how and where it's going to get used in later patches?  Just a
> sentence or two more would be really helpful.
> 
> You might also mention that it's essentially a clone of dec_bdi_stat().

Okay. Will update the description. Thanks.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 10/30] thp, mm: locking tail page is a bug
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-21 17:20     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 17:20 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> index 0ff3403..38fdc92 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -669,6 +669,7 @@ void __lock_page(struct page *page)
>  {
>  	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
>  
> +	VM_BUG_ON(PageTail(page));
>  	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
>  							TASK_UNINTERRUPTIBLE);
>  }

Could we get a bit more of a description here in a comment or in the
patch summary about this?  It makes some sense to me that code shouldn't
be mucking with the tail pages, but I'm curious what your logic is here,
too.  I'm sure you've thought about it a lot more than I have.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 10/30] thp, mm: locking tail page is a bug
@ 2013-03-21 17:20     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 17:20 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> index 0ff3403..38fdc92 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -669,6 +669,7 @@ void __lock_page(struct page *page)
>  {
>  	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
>  
> +	VM_BUG_ON(PageTail(page));
>  	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
>  							TASK_UNINTERRUPTIBLE);
>  }

Could we get a bit more of a description here in a comment or in the
patch summary about this?  It makes some sense to me that code shouldn't
be mucking with the tail pages, but I'm curious what your logic is here,
too.  I'm sure you've thought about it a lot more than I have.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 12/30] thp, mm: add event counters for huge page alloc on write to a file
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-21 17:59     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 17:59 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		THP_FAULT_FALLBACK,
>  		THP_COLLAPSE_ALLOC,
>  		THP_COLLAPSE_ALLOC_FAILED,
> +		THP_WRITE_ALLOC,
> +		THP_WRITE_FAILED,
>  		THP_SPLIT,
>  		THP_ZERO_PAGE_ALLOC,
>  		THP_ZERO_PAGE_ALLOC_FAILED,

I think these names are a bit terse.  It's certainly not _writes_ that
are failing and "THP_WRITE_FAILED" makes it sound that way.  Also, why
do we need to differentiate these from the existing anon-hugepage vm
stats?  The alloc_pages() call seems to be doing the exact same thing in
the end.  Is one more likely to succeed than the other?

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 12/30] thp, mm: add event counters for huge page alloc on write to a file
@ 2013-03-21 17:59     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 17:59 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		THP_FAULT_FALLBACK,
>  		THP_COLLAPSE_ALLOC,
>  		THP_COLLAPSE_ALLOC_FAILED,
> +		THP_WRITE_ALLOC,
> +		THP_WRITE_FAILED,
>  		THP_SPLIT,
>  		THP_ZERO_PAGE_ALLOC,
>  		THP_ZERO_PAGE_ALLOC_FAILED,

I think these names are a bit terse.  It's certainly not _writes_ that
are failing and "THP_WRITE_FAILED" makes it sound that way.  Also, why
do we need to differentiate these from the existing anon-hugepage vm
stats?  The alloc_pages() call seems to be doing the exact same thing in
the end.  Is one more likely to succeed than the other?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-21 18:15     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 18:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> The function is grab_cache_page_write_begin() twin but it tries to
> allocate huge page at given position aligned to HPAGE_CACHE_NR.

The obvious question, then, is whether we should just replace
grab_cache_page_write_begin() with this code and pass in HPAGE_CACHE_NR
or 1 based on whether we're doing a huge or normal page.

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 38fdc92..bdedb1b 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2332,6 +2332,64 @@ found:
>  }
>  EXPORT_SYMBOL(grab_cache_page_write_begin);
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +/*
> + * Find or create a huge page at the given pagecache position, aligned to
> + * HPAGE_CACHE_NR. Return the locked huge page.
> + *
> + * If, for some reason, it's not possible allocate a huge page at this
> + * possition, it returns NULL. Caller should take care of fallback to small
> + * pages.
> + *
> + * This function is specifically for buffered writes.
> + */
> +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
> +		pgoff_t index, unsigned flags)
> +{
> +	int status;
> +	gfp_t gfp_mask;
> +	struct page *page;
> +	gfp_t gfp_notmask = 0;
> +
> +	BUG_ON(index & HPAGE_CACHE_INDEX_MASK);

--
> +	gfp_mask = mapping_gfp_mask(mapping);
> +	BUG_ON(!(gfp_mask & __GFP_COMP));
> +	if (mapping_cap_account_dirty(mapping))
> +		gfp_mask |= __GFP_WRITE;
> +	if (flags & AOP_FLAG_NOFS)
> +		gfp_notmask = __GFP_FS;

This whole hunk is both non-obvious and copy-n-pasted from
grab_cache_page_write_begin().  That makes me worry that bugs/features
will get added/removed in one and not the other.  I really think they
need to get consolidated somehow.

> +repeat:
> +	page = find_lock_page(mapping, index);
> +	if (page) {
> +		if (!PageTransHuge(page)) {
> +			unlock_page(page);
> +			page_cache_release(page);
> +			return NULL;
> +		}
> +		goto found;
> +	}
> +
> +	page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);

I alluded to this a second ago, but what's wrong with alloc_hugepage()?

> +	if (!page) {
> +		count_vm_event(THP_WRITE_FAILED);
> +		return NULL;
> +	}
> +
> +	count_vm_event(THP_WRITE_ALLOC);
> +	status = add_to_page_cache_lru(page, mapping, index,
> +			GFP_KERNEL & ~gfp_notmask);
> +	if (unlikely(status)) {
> +		page_cache_release(page);
> +		if (status == -EEXIST)
> +			goto repeat;
> +		return NULL;
> +	}

I'm rather un-fond of sprinking likely/unlikelies around.  But, I guess
this is really just copied from the existing one.  <sigh>

> +found:
> +	wait_on_page_writeback(page);
> +	return page;
> +}
> +#endif

So, I diffed :

-struct page *grab_cache_page_write_begin(struct address_space
vs.
+struct page *grab_cache_huge_page_write_begin(struct address_space

They're just to similar to ignore.  Please consolidate them somehow.


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()
@ 2013-03-21 18:15     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 18:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> The function is grab_cache_page_write_begin() twin but it tries to
> allocate huge page at given position aligned to HPAGE_CACHE_NR.

The obvious question, then, is whether we should just replace
grab_cache_page_write_begin() with this code and pass in HPAGE_CACHE_NR
or 1 based on whether we're doing a huge or normal page.

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 38fdc92..bdedb1b 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2332,6 +2332,64 @@ found:
>  }
>  EXPORT_SYMBOL(grab_cache_page_write_begin);
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +/*
> + * Find or create a huge page at the given pagecache position, aligned to
> + * HPAGE_CACHE_NR. Return the locked huge page.
> + *
> + * If, for some reason, it's not possible allocate a huge page at this
> + * possition, it returns NULL. Caller should take care of fallback to small
> + * pages.
> + *
> + * This function is specifically for buffered writes.
> + */
> +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
> +		pgoff_t index, unsigned flags)
> +{
> +	int status;
> +	gfp_t gfp_mask;
> +	struct page *page;
> +	gfp_t gfp_notmask = 0;
> +
> +	BUG_ON(index & HPAGE_CACHE_INDEX_MASK);

--
> +	gfp_mask = mapping_gfp_mask(mapping);
> +	BUG_ON(!(gfp_mask & __GFP_COMP));
> +	if (mapping_cap_account_dirty(mapping))
> +		gfp_mask |= __GFP_WRITE;
> +	if (flags & AOP_FLAG_NOFS)
> +		gfp_notmask = __GFP_FS;

This whole hunk is both non-obvious and copy-n-pasted from
grab_cache_page_write_begin().  That makes me worry that bugs/features
will get added/removed in one and not the other.  I really think they
need to get consolidated somehow.

> +repeat:
> +	page = find_lock_page(mapping, index);
> +	if (page) {
> +		if (!PageTransHuge(page)) {
> +			unlock_page(page);
> +			page_cache_release(page);
> +			return NULL;
> +		}
> +		goto found;
> +	}
> +
> +	page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);

I alluded to this a second ago, but what's wrong with alloc_hugepage()?

> +	if (!page) {
> +		count_vm_event(THP_WRITE_FAILED);
> +		return NULL;
> +	}
> +
> +	count_vm_event(THP_WRITE_ALLOC);
> +	status = add_to_page_cache_lru(page, mapping, index,
> +			GFP_KERNEL & ~gfp_notmask);
> +	if (unlikely(status)) {
> +		page_cache_release(page);
> +		if (status == -EEXIST)
> +			goto repeat;
> +		return NULL;
> +	}

I'm rather un-fond of sprinking likely/unlikelies around.  But, I guess
this is really just copied from the existing one.  <sigh>

> +found:
> +	wait_on_page_writeback(page);
> +	return page;
> +}
> +#endif

So, I diffed :

-struct page *grab_cache_page_write_begin(struct address_space
vs.
+struct page *grab_cache_huge_page_write_begin(struct address_space

They're just to similar to ignore.  Please consolidate them somehow.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-21 18:16     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 18:16 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> +found:
> +	wait_on_page_writeback(page);
> +	return page;
> +}
> +#endif

In grab_cache_page_write_begin(), this "wait" is:

        wait_for_stable_page(page);

Why is it different here?

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()
@ 2013-03-21 18:16     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-21 18:16 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> +found:
> +	wait_on_page_writeback(page);
> +	return page;
> +}
> +#endif

In grab_cache_page_write_begin(), this "wait" is:

        wait_for_stable_page(page);

Why is it different here?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 02/30] mm: implement zero_huge_user_segment and friends
  2013-03-21 15:23     ` Dave Hansen
@ 2013-03-22  9:21       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-22  9:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > Let's add helpers to clear huge page segment(s). They provide the same
> > functionallity as zero_user_segment{,s} and zero_user, but for huge
> > pages
> ...
> > +static inline void zero_huge_user_segments(struct page *page,
> > +		unsigned start1, unsigned end1,
> > +		unsigned start2, unsigned end2)
> > +{
> > +	zero_huge_user_segment(page, start1, end1);
> > +	zero_huge_user_segment(page, start2, end2);
> > +}
> 
> I'm not sure that this helper saves very much code.  The one call later
> in these patches:
> 
> +                       zero_huge_user_segments(page, 0, from,
> +                                       from + len, HPAGE_PMD_SIZE);
> 
> really only saves one line over this:
> 
> 			zero_huge_user_segment(page, 0, from);
> 			zero_huge_user_segment(page, from + len,
> 					       HPAGE_PMD_SIZE);
> 
> and I think the second one is much more clear to read.

I've tried to mimic non-huge zero_user*, but, yeah, this is silly.
Will drop.

> I do see that there's a small-page variant of this, but I think that one
> was done to save doing two kmap_atomic() operations when you wanted to
> zero two separate operations.  This variant doesn't have that kind of
> optimization, so it makes much less sense.
> 
> > +void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
> > +{
> > +	int i;
> > +	
> > +	BUG_ON(end < start);
> > +
> > +	might_sleep();
> > +
> > +	if (start == end)
> > +		return;
> 
> I've really got to wonder how much of an optimization this is in
> practice.  Was there a specific reason this was added?

It's likely for simple_write_begin() to call zero[_huge]_user_segments()
with one of two segments start == end.

But, honestly, it was just easier to cut the corner case first and don't
bother about it in following code. ;)

> > +	/* start and end are on the same small page */
> > +	if ((start & PAGE_MASK) == ((end - 1) & PAGE_MASK))
> > +		return zero_user_segment(page + (start >> PAGE_SHIFT),
> > +				start & ~PAGE_MASK,
> > +				((end - 1) & ~PAGE_MASK) + 1);
> 
> It wasn't immediately obvious to me why we need to optimize the "on the
> same page" case.  I _think_ it's because using zero_user_segments()
> saves us a kmap_atomic() over the code below.  Is that right?  It might
> be worth a comment.

The code below will call zero_user_segment() twice for the same small
page, but here we can use just one.

I'll document it.

> > +	zero_user_segment(page + (start >> PAGE_SHIFT),
> > +			start & ~PAGE_MASK, PAGE_SIZE);
> > +	for (i = (start >> PAGE_SHIFT) + 1; i < (end >> PAGE_SHIFT) - 1; i++) {
> > +		cond_resched();
> > +		clear_highpage(page + i);
> 
> zero_user_segments() does a flush_dcache_page(), which wouldn't get done
> on these middle pages.  Is that a problem?

I think, it is. Will fix.

> > +	}
> > +	zero_user_segment(page + i, 0, ((end - 1) & ~PAGE_MASK) + 1);
> > +}
> 
> This code is dying for some local variables.  It could really use a
> 'start_pfn_offset' and 'end_pfn_offset' or something similar.  All of
> the shifting and masking is a bit hard to read and it would be nice to
> think of some real names for what it is doing.
> 
> It also desperately needs some comments about how it works.  Some
> one-liners like:
> 
> 	/* zero the first (possibly partial) page */
> 	for()..
> 		/* zero the full pages in the middle */
> 	/* zero the last (possibly partial) page */
> 
> would be pretty sweet.

Okay, will rework it.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 02/30] mm: implement zero_huge_user_segment and friends
@ 2013-03-22  9:21       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-22  9:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > Let's add helpers to clear huge page segment(s). They provide the same
> > functionallity as zero_user_segment{,s} and zero_user, but for huge
> > pages
> ...
> > +static inline void zero_huge_user_segments(struct page *page,
> > +		unsigned start1, unsigned end1,
> > +		unsigned start2, unsigned end2)
> > +{
> > +	zero_huge_user_segment(page, start1, end1);
> > +	zero_huge_user_segment(page, start2, end2);
> > +}
> 
> I'm not sure that this helper saves very much code.  The one call later
> in these patches:
> 
> +                       zero_huge_user_segments(page, 0, from,
> +                                       from + len, HPAGE_PMD_SIZE);
> 
> really only saves one line over this:
> 
> 			zero_huge_user_segment(page, 0, from);
> 			zero_huge_user_segment(page, from + len,
> 					       HPAGE_PMD_SIZE);
> 
> and I think the second one is much more clear to read.

I've tried to mimic non-huge zero_user*, but, yeah, this is silly.
Will drop.

> I do see that there's a small-page variant of this, but I think that one
> was done to save doing two kmap_atomic() operations when you wanted to
> zero two separate operations.  This variant doesn't have that kind of
> optimization, so it makes much less sense.
> 
> > +void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
> > +{
> > +	int i;
> > +	
> > +	BUG_ON(end < start);
> > +
> > +	might_sleep();
> > +
> > +	if (start == end)
> > +		return;
> 
> I've really got to wonder how much of an optimization this is in
> practice.  Was there a specific reason this was added?

It's likely for simple_write_begin() to call zero[_huge]_user_segments()
with one of two segments start == end.

But, honestly, it was just easier to cut the corner case first and don't
bother about it in following code. ;)

> > +	/* start and end are on the same small page */
> > +	if ((start & PAGE_MASK) == ((end - 1) & PAGE_MASK))
> > +		return zero_user_segment(page + (start >> PAGE_SHIFT),
> > +				start & ~PAGE_MASK,
> > +				((end - 1) & ~PAGE_MASK) + 1);
> 
> It wasn't immediately obvious to me why we need to optimize the "on the
> same page" case.  I _think_ it's because using zero_user_segments()
> saves us a kmap_atomic() over the code below.  Is that right?  It might
> be worth a comment.

The code below will call zero_user_segment() twice for the same small
page, but here we can use just one.

I'll document it.

> > +	zero_user_segment(page + (start >> PAGE_SHIFT),
> > +			start & ~PAGE_MASK, PAGE_SIZE);
> > +	for (i = (start >> PAGE_SHIFT) + 1; i < (end >> PAGE_SHIFT) - 1; i++) {
> > +		cond_resched();
> > +		clear_highpage(page + i);
> 
> zero_user_segments() does a flush_dcache_page(), which wouldn't get done
> on these middle pages.  Is that a problem?

I think, it is. Will fix.

> > +	}
> > +	zero_user_segment(page + i, 0, ((end - 1) & ~PAGE_MASK) + 1);
> > +}
> 
> This code is dying for some local variables.  It could really use a
> 'start_pfn_offset' and 'end_pfn_offset' or something similar.  All of
> the shifting and masking is a bit hard to read and it would be nice to
> think of some real names for what it is doing.
> 
> It also desperately needs some comments about how it works.  Some
> one-liners like:
> 
> 	/* zero the first (possibly partial) page */
> 	for()..
> 		/* zero the full pages in the middle */
> 	/* zero the last (possibly partial) page */
> 
> would be pretty sweet.

Okay, will rework it.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements
  2013-03-21 15:56     ` Dave Hansen
@ 2013-03-22  9:47       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-22  9:47 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Currently radix_tree_preload() only guarantees enough nodes to insert
> > one element. It's a hard limit. You cannot batch a number insert under
> > one tree_lock.
> > 
> > This patch introduces radix_tree_preload_count(). It allows to
> > preallocate nodes enough to insert a number of *contiguous* elements.
> 
> You don't need to write a paper on how radix trees work, but it might be
> nice to include a wee bit of text in here about how the existing preload
> works, and how this new guarantee works.

Reasonable, will do.

> > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > index ffc444c..81318cb 100644
> > --- a/include/linux/radix-tree.h
> > +++ b/include/linux/radix-tree.h
> > @@ -83,6 +83,8 @@ do {									\
> >  	(root)->rnode = NULL;						\
> >  } while (0)
> >  
> > +#define RADIX_TREE_PRELOAD_NR		512 /* For THP's benefit */
> 
> This eventually boils down to making the radix_tree_preload array
> larger.  Do we really want to do this unconditionally if it's only for
> THP's benefit?

It will be useful not only for THP. Batching can be useful to solve
scalability issues.

> >  /**
> >   * Radix-tree synchronization
> >   *
> > @@ -231,6 +233,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root,
> >  unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
> >  				unsigned long index, unsigned long max_scan);
> >  int radix_tree_preload(gfp_t gfp_mask);
> > +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
> >  void radix_tree_init(void);
> >  void *radix_tree_tag_set(struct radix_tree_root *root,
> >  			unsigned long index, unsigned int tag);
> > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > index e796429..9bef0ac 100644
> > --- a/lib/radix-tree.c
> > +++ b/lib/radix-tree.c
> > @@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> >   * The worst case is a zero height tree with just a single item at index 0,
> >   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > + *
> > + * Worst case for adding N contiguous items is adding entries at indexes
> > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > + * item plus extra nodes if you cross the boundary from one node to the next.
> > + *
> >   * Hence:
> >   */
> > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MAX \
> > +	(RADIX_TREE_PRELOAD_MIN + \
> > +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
> >  
> >  /*
> >   * Per-cpu pool of preloaded nodes
> >   */
> >  struct radix_tree_preload {
> >  	int nr;
> > -	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
> > +	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
> >  };
> 
> For those of us too lazy to go compile a kernel and figure this out in
> practice, how much bigger does this make the nodes[] array?

We have three possible RADIX_TREE_MAP_SHIFT:

#ifdef __KERNEL__
#define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
#else
#define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
#endif

On 64-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.

On 32-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.

On most machines we will have RADIX_TREE_MAP_SHIFT=6.

> 
> >  static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
> >  
> > @@ -257,29 +265,34 @@ radix_tree_node_free(struct radix_tree_node *node)
> >  
> >  /*
> >   * Load up this CPU's radix_tree_node buffer with sufficient objects to
> > - * ensure that the addition of a single element in the tree cannot fail.  On
> > - * success, return zero, with preemption disabled.  On error, return -ENOMEM
> > + * ensure that the addition of *contiguous* elements in the tree cannot fail.
> > + * On success, return zero, with preemption disabled.  On error, return -ENOMEM
> >   * with preemption not disabled.
> >   *
> >   * To make use of this facility, the radix tree must be initialised without
> >   * __GFP_WAIT being passed to INIT_RADIX_TREE().
> >   */
> > -int radix_tree_preload(gfp_t gfp_mask)
> > +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
> >  {
> >  	struct radix_tree_preload *rtp;
> >  	struct radix_tree_node *node;
> >  	int ret = -ENOMEM;
> > +	int alloc = RADIX_TREE_PRELOAD_MIN +
> > +		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
> 
> Any chance I could talk you in to giving 'alloc' a better name?  Maybe
> "preload_target" or "preload_fill_to".

Ok.

> > +	if (size > RADIX_TREE_PRELOAD_NR)
> > +		return -ENOMEM;
> 
> I always wonder if these deep, logical -ENOMEMs deserve a WARN_ONCE().
> We really don't expect to hit this unless something really funky is
> going on, right?

Correct. Will add WARN.

> >  	preempt_disable();
> >  	rtp = &__get_cpu_var(radix_tree_preloads);
> > -	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
> > +	while (rtp->nr < alloc) {
> >  		preempt_enable();
> >  		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
> >  		if (node == NULL)
> >  			goto out;
> >  		preempt_disable();
> >  		rtp = &__get_cpu_var(radix_tree_preloads);
> > -		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
> > +		if (rtp->nr < alloc)
> >  			rtp->nodes[rtp->nr++] = node;
> >  		else
> >  			kmem_cache_free(radix_tree_node_cachep, node);
> > @@ -288,6 +301,11 @@ int radix_tree_preload(gfp_t gfp_mask)
> >  out:
> >  	return ret;
> >  }

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements
@ 2013-03-22  9:47       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-22  9:47 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Currently radix_tree_preload() only guarantees enough nodes to insert
> > one element. It's a hard limit. You cannot batch a number insert under
> > one tree_lock.
> > 
> > This patch introduces radix_tree_preload_count(). It allows to
> > preallocate nodes enough to insert a number of *contiguous* elements.
> 
> You don't need to write a paper on how radix trees work, but it might be
> nice to include a wee bit of text in here about how the existing preload
> works, and how this new guarantee works.

Reasonable, will do.

> > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > index ffc444c..81318cb 100644
> > --- a/include/linux/radix-tree.h
> > +++ b/include/linux/radix-tree.h
> > @@ -83,6 +83,8 @@ do {									\
> >  	(root)->rnode = NULL;						\
> >  } while (0)
> >  
> > +#define RADIX_TREE_PRELOAD_NR		512 /* For THP's benefit */
> 
> This eventually boils down to making the radix_tree_preload array
> larger.  Do we really want to do this unconditionally if it's only for
> THP's benefit?

It will be useful not only for THP. Batching can be useful to solve
scalability issues.

> >  /**
> >   * Radix-tree synchronization
> >   *
> > @@ -231,6 +233,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root,
> >  unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
> >  				unsigned long index, unsigned long max_scan);
> >  int radix_tree_preload(gfp_t gfp_mask);
> > +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
> >  void radix_tree_init(void);
> >  void *radix_tree_tag_set(struct radix_tree_root *root,
> >  			unsigned long index, unsigned int tag);
> > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > index e796429..9bef0ac 100644
> > --- a/lib/radix-tree.c
> > +++ b/lib/radix-tree.c
> > @@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> >   * The worst case is a zero height tree with just a single item at index 0,
> >   * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> >   * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > + *
> > + * Worst case for adding N contiguous items is adding entries at indexes
> > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > + * item plus extra nodes if you cross the boundary from one node to the next.
> > + *
> >   * Hence:
> >   */
> > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MAX \
> > +	(RADIX_TREE_PRELOAD_MIN + \
> > +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
> >  
> >  /*
> >   * Per-cpu pool of preloaded nodes
> >   */
> >  struct radix_tree_preload {
> >  	int nr;
> > -	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
> > +	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
> >  };
> 
> For those of us too lazy to go compile a kernel and figure this out in
> practice, how much bigger does this make the nodes[] array?

We have three possible RADIX_TREE_MAP_SHIFT:

#ifdef __KERNEL__
#define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
#else
#define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
#endif

On 64-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.

On 32-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.

On most machines we will have RADIX_TREE_MAP_SHIFT=6.

> 
> >  static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
> >  
> > @@ -257,29 +265,34 @@ radix_tree_node_free(struct radix_tree_node *node)
> >  
> >  /*
> >   * Load up this CPU's radix_tree_node buffer with sufficient objects to
> > - * ensure that the addition of a single element in the tree cannot fail.  On
> > - * success, return zero, with preemption disabled.  On error, return -ENOMEM
> > + * ensure that the addition of *contiguous* elements in the tree cannot fail.
> > + * On success, return zero, with preemption disabled.  On error, return -ENOMEM
> >   * with preemption not disabled.
> >   *
> >   * To make use of this facility, the radix tree must be initialised without
> >   * __GFP_WAIT being passed to INIT_RADIX_TREE().
> >   */
> > -int radix_tree_preload(gfp_t gfp_mask)
> > +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
> >  {
> >  	struct radix_tree_preload *rtp;
> >  	struct radix_tree_node *node;
> >  	int ret = -ENOMEM;
> > +	int alloc = RADIX_TREE_PRELOAD_MIN +
> > +		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
> 
> Any chance I could talk you in to giving 'alloc' a better name?  Maybe
> "preload_target" or "preload_fill_to".

Ok.

> > +	if (size > RADIX_TREE_PRELOAD_NR)
> > +		return -ENOMEM;
> 
> I always wonder if these deep, logical -ENOMEMs deserve a WARN_ONCE().
> We really don't expect to hit this unless something really funky is
> going on, right?

Correct. Will add WARN.

> >  	preempt_disable();
> >  	rtp = &__get_cpu_var(radix_tree_preloads);
> > -	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
> > +	while (rtp->nr < alloc) {
> >  		preempt_enable();
> >  		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
> >  		if (node == NULL)
> >  			goto out;
> >  		preempt_disable();
> >  		rtp = &__get_cpu_var(radix_tree_preloads);
> > -		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
> > +		if (rtp->nr < alloc)
> >  			rtp->nodes[rtp->nr++] = node;
> >  		else
> >  			kmem_cache_free(radix_tree_node_cachep, node);
> > @@ -288,6 +301,11 @@ int radix_tree_preload(gfp_t gfp_mask)
> >  out:
> >  	return ret;
> >  }

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 05/30] thp, mm: avoid PageUnevictable on active/inactive lru lists
  2013-03-21 16:15     ` Dave Hansen
@ 2013-03-22 10:11       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-22 10:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
> > that have been placed on the LRU lists when first allocated), but these
> > pages must not have PageUnevictable set - otherwise shrink_active_list
> > goes crazy:
> ...
> > For lru_add_page_tail(), it means we should not set PageUnevictable()
> > for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
> > The tail page will go LRU_UNEVICTABLE if head page is not on LRU or if
> > it's marked PageUnevictable() too.
> 
> This is only an issue once you're using lru_add_page_tail() for
> non-anonymous pages, right?

I'm not sure about that. Documentation/vm/unevictable-lru.txt:

Some examples of these unevictable pages on the LRU lists are:

 (1) ramfs pages that have been placed on the LRU lists when first allocated.

 (2) SHM_LOCK'd shared memory pages.  shmctl(SHM_LOCK) does not attempt to
     allocate or fault in the pages in the shared memory region.  This happens
     when an application accesses the page the first time after SHM_LOCK'ing
     the segment.

 (3) mlocked pages that could not be isolated from the LRU and moved to the
     unevictable list in mlock_vma_page().

 (4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't
     acquire the VMA's mmap semaphore to test the flags and set PageMlocked.
     munlock_vma_page() was forced to let the page back on to the normal LRU
     list for vmscan to handle.

> > diff --git a/mm/swap.c b/mm/swap.c
> > index 92a9be5..31584d0 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -762,7 +762,8 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
> >  			lru = LRU_INACTIVE_ANON;
> >  		}
> >  	} else {
> > -		SetPageUnevictable(page_tail);
> > +		if (!PageLRU(page) || PageUnevictable(page))
> > +			SetPageUnevictable(page_tail);
> >  		lru = LRU_UNEVICTABLE;
> >  	}
> 
> You were saying above that ramfs pages can get on the normal
> active/inactive lists.  But, this will end up getting them on the
> unevictable list, right?  So, we have normal ramfs pages on the
> active/inactive lists, but ramfs pages after a huge-page-split on the
> unevictable list.  That seems a bit inconsistent.

Yeah, it's confusing.

I was able to trigger another bug on this code:
if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
will go to the same lru as page, but nobody cares to sync page_tail
active/inactive state with page. So we can end up with inactive page on
active lru...

I've updated the patch for the next interation. You can check it in git.
It should be cleaner. Description need to be updated.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 05/30] thp, mm: avoid PageUnevictable on active/inactive lru lists
@ 2013-03-22 10:11       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-22 10:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
> > that have been placed on the LRU lists when first allocated), but these
> > pages must not have PageUnevictable set - otherwise shrink_active_list
> > goes crazy:
> ...
> > For lru_add_page_tail(), it means we should not set PageUnevictable()
> > for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
> > The tail page will go LRU_UNEVICTABLE if head page is not on LRU or if
> > it's marked PageUnevictable() too.
> 
> This is only an issue once you're using lru_add_page_tail() for
> non-anonymous pages, right?

I'm not sure about that. Documentation/vm/unevictable-lru.txt:

Some examples of these unevictable pages on the LRU lists are:

 (1) ramfs pages that have been placed on the LRU lists when first allocated.

 (2) SHM_LOCK'd shared memory pages.  shmctl(SHM_LOCK) does not attempt to
     allocate or fault in the pages in the shared memory region.  This happens
     when an application accesses the page the first time after SHM_LOCK'ing
     the segment.

 (3) mlocked pages that could not be isolated from the LRU and moved to the
     unevictable list in mlock_vma_page().

 (4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't
     acquire the VMA's mmap semaphore to test the flags and set PageMlocked.
     munlock_vma_page() was forced to let the page back on to the normal LRU
     list for vmscan to handle.

> > diff --git a/mm/swap.c b/mm/swap.c
> > index 92a9be5..31584d0 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -762,7 +762,8 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
> >  			lru = LRU_INACTIVE_ANON;
> >  		}
> >  	} else {
> > -		SetPageUnevictable(page_tail);
> > +		if (!PageLRU(page) || PageUnevictable(page))
> > +			SetPageUnevictable(page_tail);
> >  		lru = LRU_UNEVICTABLE;
> >  	}
> 
> You were saying above that ramfs pages can get on the normal
> active/inactive lists.  But, this will end up getting them on the
> unevictable list, right?  So, we have normal ramfs pages on the
> active/inactive lists, but ramfs pages after a huge-page-split on the
> unevictable list.  That seems a bit inconsistent.

Yeah, it's confusing.

I was able to trigger another bug on this code:
if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
will go to the same lru as page, but nobody cares to sync page_tail
active/inactive state with page. So we can end up with inactive page on
active lru...

I've updated the patch for the next interation. You can check it in git.
It should be cleaner. Description need to be updated.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate
  2013-03-21 16:21     ` Dave Hansen
@ 2013-03-22 10:12       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-22 10:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > +static inline bool mapping_can_have_hugepages(struct address_space *m)
> > +{
> > +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> > +		gfp_t gfp_mask = mapping_gfp_mask(m);
> > +		return !!(gfp_mask & __GFP_COMP);
> > +	}
> > +
> > +	return false;
> > +}
> 
> I did a quick search in all your patches and don't see __GFP_COMP
> getting _set_ anywhere.  Am I missing something?

__GFP_COMP is part of GFP_TRANSHUGE. We set it for ramfs in patch 20/30.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate
@ 2013-03-22 10:12       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-22 10:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > +static inline bool mapping_can_have_hugepages(struct address_space *m)
> > +{
> > +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> > +		gfp_t gfp_mask = mapping_gfp_mask(m);
> > +		return !!(gfp_mask & __GFP_COMP);
> > +	}
> > +
> > +	return false;
> > +}
> 
> I did a quick search in all your patches and don't see __GFP_COMP
> getting _set_ anywhere.  Am I missing something?

__GFP_COMP is part of GFP_TRANSHUGE. We set it for ramfs in patch 20/30.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-03-21 17:11     ` Dave Hansen
@ 2013-03-22 10:34       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-22 10:34 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
> > page for the specified index and HPAGE_CACHE_NR-1 tail pages for
> > following indexes.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  mm/filemap.c |   76 ++++++++++++++++++++++++++++++++++++++++------------------
> >  1 file changed, 53 insertions(+), 23 deletions(-)
> > 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 2d99191..6bac9e2 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -447,6 +447,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> >  		pgoff_t offset, gfp_t gfp_mask)
> >  {
> >  	int error;
> > +	int nr = 1;
> >  
> >  	VM_BUG_ON(!PageLocked(page));
> >  	VM_BUG_ON(PageSwapBacked(page));
> > @@ -454,32 +455,61 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> >  	error = mem_cgroup_cache_charge(page, current->mm,
> >  					gfp_mask & GFP_RECLAIM_MASK);
> >  	if (error)
> > -		goto out;
> > +		return error;
> >  
> > -	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> > -	if (error == 0) {
> > -		page_cache_get(page);
> > -		page->mapping = mapping;
> > -		page->index = offset;
> > +	if (PageTransHuge(page)) {
> > +		BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
> > +		nr = HPAGE_CACHE_NR;
> > +	}
> 
> That seems like a slightly odd place to put a BUILD_BUG_ON().  I guess
> it doesn't matter to some degree, but does putting it inside the if()
> imply anything?

It actually matters.

HPAGE_CACHE_NR is BUILD_BUG() if !CONFIG_TRANSPARENT_HUGEPAGE, so we need
to hide it inside 'if (PageTransHuge(page))'. PageTransHuge(page) is 0 in
compile time if !CONFIG_TRANSPARENT_HUGEPAGE, so compiler can be smart and
optimize out the check.

> > +	error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM);
> > +	if (error) {
> > +		mem_cgroup_uncharge_cache_page(page);
> > +		return error;
> > +	}
> >  
> > -		spin_lock_irq(&mapping->tree_lock);
> > -		error = radix_tree_insert(&mapping->page_tree, offset, page);
> > -		if (likely(!error)) {
> > -			mapping->nrpages++;
> > -			__inc_zone_page_state(page, NR_FILE_PAGES);
> > -			spin_unlock_irq(&mapping->tree_lock);
> > -			trace_mm_filemap_add_to_page_cache(page);
> > -		} else {
> > -			page->mapping = NULL;
> > -			/* Leave page->index set: truncation relies upon it */
> > -			spin_unlock_irq(&mapping->tree_lock);
> > -			mem_cgroup_uncharge_cache_page(page);
> > -			page_cache_release(page);
> 
> I do really like how this rewrite de-indents this code. :)

:)

> > +	page_cache_get(page);
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	page->mapping = mapping;
> > +	page->index = offset;
> > +	error = radix_tree_insert(&mapping->page_tree, offset, page);
> > +	if (unlikely(error))
> > +		goto err;
> > +	if (PageTransHuge(page)) {
> > +		int i;
> > +		for (i = 1; i < HPAGE_CACHE_NR; i++) {
> > +			page_cache_get(page + i);
> > +			page[i].index = offset + i;
> 
> Is it OK to leave page->mapping unset for these?

Good catch, thanks.
Seems nobody really use it, since I haven't got any oops, but we need to
set it anyway.

> > +			error = radix_tree_insert(&mapping->page_tree,
> > +					offset + i, page + i);
> > +			if (error) {
> > +				page_cache_release(page + i);
> > +				break;
> > +			}
> >  		}
> 
> Throughout all this new code, I'd really challenge you to try as much as
> possible to minimize the code stuck under "if (PageTransHuge(page))".

I put thp-related code under the 'if' intentionally to be able to optimize
it out if !CONFIG_TRANSPARENT_HUGEPAGE. The config option is disabled by
default.

> For instance, could you change the for() loop a bit and have it shared
> between both cases, like:
> 
> > +	for (i = 0; i < nr; i++) {
> > +		page_cache_get(page + i);
> > +		page[i].index = offset + i;
> > +		error = radix_tree_insert(&mapping->page_tree,
> > +				offset + i, page + i);
> > +		if (error) {
> > +			page_cache_release(page + i);
> > +			break;
> > +		}
> >  	}
> 
> > -		radix_tree_preload_end();
> > -	} else
> > -		mem_cgroup_uncharge_cache_page(page);
> > -out:
> > +		if (error) {
> > +			error = ENOSPC; /* no space for a huge page */
> > +			for (i--; i > 0; i--) {
> > +				radix_tree_delete(&mapping->page_tree,
> > +						offset + i);
> > +				page_cache_release(page + i);
> > +			}
> > +			radix_tree_delete(&mapping->page_tree, offset);
> 
> I wonder if this would look any nicer if you just did all the
> page_cache_get()s for the entire huge page along with the head page, and
> then released them all in one place.  I think it might shrink the error
> handling paths here.
> 
> > +			goto err;
> > +		}
> > +	}
> > +	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
> > +	mapping->nrpages += nr;
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +	trace_mm_filemap_add_to_page_cache(page);
> 
> Do we need to change the tracing to make sure it notes that these were
> or weren't huge pages?

Hm.. I guess we just need to add page order to the trace.

> > +	radix_tree_preload_end();
> > +	return 0;
> > +err:
> > +	page->mapping = NULL;
> > +	/* Leave page->index set: truncation relies upon it */
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +	radix_tree_preload_end();
> > +	mem_cgroup_uncharge_cache_page(page);
> > +	page_cache_release(page);
> >  	return error;
> >  }
> >  EXPORT_SYMBOL(add_to_page_cache_locked);
> 
> Does the cgroup code know how to handle these large pages internally
> somehow?  It looks like the charge/uncharge is only being done for the
> head page.

It can. We only need to remove PageCompound() check there. Patch is in
git.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-03-22 10:34       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-22 10:34 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
> > page for the specified index and HPAGE_CACHE_NR-1 tail pages for
> > following indexes.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  mm/filemap.c |   76 ++++++++++++++++++++++++++++++++++++++++------------------
> >  1 file changed, 53 insertions(+), 23 deletions(-)
> > 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 2d99191..6bac9e2 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -447,6 +447,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> >  		pgoff_t offset, gfp_t gfp_mask)
> >  {
> >  	int error;
> > +	int nr = 1;
> >  
> >  	VM_BUG_ON(!PageLocked(page));
> >  	VM_BUG_ON(PageSwapBacked(page));
> > @@ -454,32 +455,61 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> >  	error = mem_cgroup_cache_charge(page, current->mm,
> >  					gfp_mask & GFP_RECLAIM_MASK);
> >  	if (error)
> > -		goto out;
> > +		return error;
> >  
> > -	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> > -	if (error == 0) {
> > -		page_cache_get(page);
> > -		page->mapping = mapping;
> > -		page->index = offset;
> > +	if (PageTransHuge(page)) {
> > +		BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
> > +		nr = HPAGE_CACHE_NR;
> > +	}
> 
> That seems like a slightly odd place to put a BUILD_BUG_ON().  I guess
> it doesn't matter to some degree, but does putting it inside the if()
> imply anything?

It actually matters.

HPAGE_CACHE_NR is BUILD_BUG() if !CONFIG_TRANSPARENT_HUGEPAGE, so we need
to hide it inside 'if (PageTransHuge(page))'. PageTransHuge(page) is 0 in
compile time if !CONFIG_TRANSPARENT_HUGEPAGE, so compiler can be smart and
optimize out the check.

> > +	error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM);
> > +	if (error) {
> > +		mem_cgroup_uncharge_cache_page(page);
> > +		return error;
> > +	}
> >  
> > -		spin_lock_irq(&mapping->tree_lock);
> > -		error = radix_tree_insert(&mapping->page_tree, offset, page);
> > -		if (likely(!error)) {
> > -			mapping->nrpages++;
> > -			__inc_zone_page_state(page, NR_FILE_PAGES);
> > -			spin_unlock_irq(&mapping->tree_lock);
> > -			trace_mm_filemap_add_to_page_cache(page);
> > -		} else {
> > -			page->mapping = NULL;
> > -			/* Leave page->index set: truncation relies upon it */
> > -			spin_unlock_irq(&mapping->tree_lock);
> > -			mem_cgroup_uncharge_cache_page(page);
> > -			page_cache_release(page);
> 
> I do really like how this rewrite de-indents this code. :)

:)

> > +	page_cache_get(page);
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	page->mapping = mapping;
> > +	page->index = offset;
> > +	error = radix_tree_insert(&mapping->page_tree, offset, page);
> > +	if (unlikely(error))
> > +		goto err;
> > +	if (PageTransHuge(page)) {
> > +		int i;
> > +		for (i = 1; i < HPAGE_CACHE_NR; i++) {
> > +			page_cache_get(page + i);
> > +			page[i].index = offset + i;
> 
> Is it OK to leave page->mapping unset for these?

Good catch, thanks.
Seems nobody really use it, since I haven't got any oops, but we need to
set it anyway.

> > +			error = radix_tree_insert(&mapping->page_tree,
> > +					offset + i, page + i);
> > +			if (error) {
> > +				page_cache_release(page + i);
> > +				break;
> > +			}
> >  		}
> 
> Throughout all this new code, I'd really challenge you to try as much as
> possible to minimize the code stuck under "if (PageTransHuge(page))".

I put thp-related code under the 'if' intentionally to be able to optimize
it out if !CONFIG_TRANSPARENT_HUGEPAGE. The config option is disabled by
default.

> For instance, could you change the for() loop a bit and have it shared
> between both cases, like:
> 
> > +	for (i = 0; i < nr; i++) {
> > +		page_cache_get(page + i);
> > +		page[i].index = offset + i;
> > +		error = radix_tree_insert(&mapping->page_tree,
> > +				offset + i, page + i);
> > +		if (error) {
> > +			page_cache_release(page + i);
> > +			break;
> > +		}
> >  	}
> 
> > -		radix_tree_preload_end();
> > -	} else
> > -		mem_cgroup_uncharge_cache_page(page);
> > -out:
> > +		if (error) {
> > +			error = ENOSPC; /* no space for a huge page */
> > +			for (i--; i > 0; i--) {
> > +				radix_tree_delete(&mapping->page_tree,
> > +						offset + i);
> > +				page_cache_release(page + i);
> > +			}
> > +			radix_tree_delete(&mapping->page_tree, offset);
> 
> I wonder if this would look any nicer if you just did all the
> page_cache_get()s for the entire huge page along with the head page, and
> then released them all in one place.  I think it might shrink the error
> handling paths here.
> 
> > +			goto err;
> > +		}
> > +	}
> > +	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
> > +	mapping->nrpages += nr;
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +	trace_mm_filemap_add_to_page_cache(page);
> 
> Do we need to change the tracing to make sure it notes that these were
> or weren't huge pages?

Hm.. I guess we just need to add page order to the trace.

> > +	radix_tree_preload_end();
> > +	return 0;
> > +err:
> > +	page->mapping = NULL;
> > +	/* Leave page->index set: truncation relies upon it */
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +	radix_tree_preload_end();
> > +	mem_cgroup_uncharge_cache_page(page);
> > +	page_cache_release(page);
> >  	return error;
> >  }
> >  EXPORT_SYMBOL(add_to_page_cache_locked);
> 
> Does the cgroup code know how to handle these large pages internally
> somehow?  It looks like the charge/uncharge is only being done for the
> head page.

It can. We only need to remove PageCompound() check there. Patch is in
git.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements
  2013-03-22  9:47       ` Kirill A. Shutemov
@ 2013-03-22 14:38         ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-22 14:38 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/22/2013 02:47 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
>>> +#define RADIX_TREE_PRELOAD_NR		512 /* For THP's benefit */
>>
>> This eventually boils down to making the radix_tree_preload array
>> larger.  Do we really want to do this unconditionally if it's only for
>> THP's benefit?
> 
> It will be useful not only for THP. Batching can be useful to solve
> scalability issues.

Still, it seems like something that little machines with no THP support
probably don't want to pay the cost for.  Perhaps you could enable it
for THP||NR_CPUS>$FOO.

>> For those of us too lazy to go compile a kernel and figure this out in
>> practice, how much bigger does this make the nodes[] array?
> 
> We have three possible RADIX_TREE_MAP_SHIFT:
> 
> #ifdef __KERNEL__
> #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> #else
> #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> #endif
> 
> On 64-bit system:
> For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> 
> On 32-bit system:
> For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> 
> On most machines we will have RADIX_TREE_MAP_SHIFT=6.

Could you stick that in your patch description?  The total cost is
"array size" * sizeof(void*) * NR_CPUS, right?

-- Dave Hansen, Intel OTC Scalability Team

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements
@ 2013-03-22 14:38         ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-22 14:38 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/22/2013 02:47 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
>>> +#define RADIX_TREE_PRELOAD_NR		512 /* For THP's benefit */
>>
>> This eventually boils down to making the radix_tree_preload array
>> larger.  Do we really want to do this unconditionally if it's only for
>> THP's benefit?
> 
> It will be useful not only for THP. Batching can be useful to solve
> scalability issues.

Still, it seems like something that little machines with no THP support
probably don't want to pay the cost for.  Perhaps you could enable it
for THP||NR_CPUS>$FOO.

>> For those of us too lazy to go compile a kernel and figure this out in
>> practice, how much bigger does this make the nodes[] array?
> 
> We have three possible RADIX_TREE_MAP_SHIFT:
> 
> #ifdef __KERNEL__
> #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> #else
> #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> #endif
> 
> On 64-bit system:
> For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> 
> On 32-bit system:
> For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> 
> On most machines we will have RADIX_TREE_MAP_SHIFT=6.

Could you stick that in your patch description?  The total cost is
"array size" * sizeof(void*) * NR_CPUS, right?

-- Dave Hansen, Intel OTC Scalability Team

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate
  2013-03-22 10:12       ` Kirill A. Shutemov
@ 2013-03-22 14:44         ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-22 14:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/22/2013 03:12 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
>>> +static inline bool mapping_can_have_hugepages(struct address_space *m)
>>> +{
>>> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>>> +		gfp_t gfp_mask = mapping_gfp_mask(m);
>>> +		return !!(gfp_mask & __GFP_COMP);
>>> +	}
>>> +
>>> +	return false;
>>> +}
>>
>> I did a quick search in all your patches and don't see __GFP_COMP
>> getting _set_ anywhere.  Am I missing something?
> 
> __GFP_COMP is part of GFP_TRANSHUGE. We set it for ramfs in patch 20/30.

That's a bit non-obvious.  For a casual observer, it _seems_ like you
should just be setting and checking GFP_TRANSHUGE directly.  It looks
like you were having some problems with __GFP_MOVABLE and masked it out
of GFP_TRANSHUGE and that has cascaded over to _this_ check.

I _think_ the right thing to do is add a comment up there in
mapping_can_have_hugepages() that does (GFP_TRANSHUGE & ~__GFP_MOVABLE),
and adds a TODO in the code and patch comments to clean it up once
ramfs_get_inode() gets fixed up too.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate
@ 2013-03-22 14:44         ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-22 14:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/22/2013 03:12 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
>>> +static inline bool mapping_can_have_hugepages(struct address_space *m)
>>> +{
>>> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>>> +		gfp_t gfp_mask = mapping_gfp_mask(m);
>>> +		return !!(gfp_mask & __GFP_COMP);
>>> +	}
>>> +
>>> +	return false;
>>> +}
>>
>> I did a quick search in all your patches and don't see __GFP_COMP
>> getting _set_ anywhere.  Am I missing something?
> 
> __GFP_COMP is part of GFP_TRANSHUGE. We set it for ramfs in patch 20/30.

That's a bit non-obvious.  For a casual observer, it _seems_ like you
should just be setting and checking GFP_TRANSHUGE directly.  It looks
like you were having some problems with __GFP_MOVABLE and masked it out
of GFP_TRANSHUGE and that has cascaded over to _this_ check.

I _think_ the right thing to do is add a comment up there in
mapping_can_have_hugepages() that does (GFP_TRANSHUGE & ~__GFP_MOVABLE),
and adds a TODO in the code and patch comments to clean it up once
ramfs_get_inode() gets fixed up too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-03-22 10:34       ` Kirill A. Shutemov
@ 2013-03-22 14:51         ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-22 14:51 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/22/2013 03:34 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
>>> +			error = radix_tree_insert(&mapping->page_tree,
>>> +					offset + i, page + i);
>>> +			if (error) {
>>> +				page_cache_release(page + i);
>>> +				break;
>>> +			}
>>>  		}
>>
>> Throughout all this new code, I'd really challenge you to try as much as
>> possible to minimize the code stuck under "if (PageTransHuge(page))".
> 
> I put thp-related code under the 'if' intentionally to be able to optimize
> it out if !CONFIG_TRANSPARENT_HUGEPAGE. The config option is disabled by
> default.

You've gotta give the compiler some credit! :)  In this function's case,
the compiler should be able to realize that nr=1 is constant if
!TRANSPARENT_HUGEPAGE.  It'll realize that all your for loops are:

	for (i = 0; i < 1; i++) {

and will end up generating _very_ similar code to what you get with the
explicit #ifdefs.  You already _have_ #ifdefs, but they're just up in
the headers around PageTransHuge()'s definition.

The advantages for you are *huge* since it means that folks will have a
harder time inadvertently breaking your CONFIG_TRANSPARENT_HUGEPAGE code.

>> Does the cgroup code know how to handle these large pages internally
>> somehow?  It looks like the charge/uncharge is only being done for the
>> head page.
> 
> It can. We only need to remove PageCompound() check there. Patch is in
> git.

OK, cool.  This _might_ deserve a comment or something here.  Again, it
looks asymmetric and fishy to someone just reading the code.


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-03-22 14:51         ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-22 14:51 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/22/2013 03:34 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
>>> +			error = radix_tree_insert(&mapping->page_tree,
>>> +					offset + i, page + i);
>>> +			if (error) {
>>> +				page_cache_release(page + i);
>>> +				break;
>>> +			}
>>>  		}
>>
>> Throughout all this new code, I'd really challenge you to try as much as
>> possible to minimize the code stuck under "if (PageTransHuge(page))".
> 
> I put thp-related code under the 'if' intentionally to be able to optimize
> it out if !CONFIG_TRANSPARENT_HUGEPAGE. The config option is disabled by
> default.

You've gotta give the compiler some credit! :)  In this function's case,
the compiler should be able to realize that nr=1 is constant if
!TRANSPARENT_HUGEPAGE.  It'll realize that all your for loops are:

	for (i = 0; i < 1; i++) {

and will end up generating _very_ similar code to what you get with the
explicit #ifdefs.  You already _have_ #ifdefs, but they're just up in
the headers around PageTransHuge()'s definition.

The advantages for you are *huge* since it means that folks will have a
harder time inadvertently breaking your CONFIG_TRANSPARENT_HUGEPAGE code.

>> Does the cgroup code know how to handle these large pages internally
>> somehow?  It looks like the charge/uncharge is only being done for the
>> head page.
> 
> It can. We only need to remove PageCompound() check there. Patch is in
> git.

OK, cool.  This _might_ deserve a comment or something here.  Again, it
looks asymmetric and fishy to someone just reading the code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 14/30] thp, mm: naive support of thp in generic read/write routines
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-22 15:22     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-22 15:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> For now we still write/read at most PAGE_CACHE_SIZE bytes a time.
> 
> This implementation doesn't cover address spaces with backing store.
...
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1165,12 +1165,23 @@ find_page:
>  			if (unlikely(page == NULL))
>  				goto no_cached_page;
>  		}
> +		if (PageTransTail(page)) {
> +			page_cache_release(page);
> +			page = find_get_page(mapping,
> +					index & ~HPAGE_CACHE_INDEX_MASK);
> +			if (!PageTransHuge(page)) {
> +				page_cache_release(page);
> +				goto find_page;
> +			}
> +		}

So, we're going to do a read of a file, and we pulled a tail page out of
the page cache.  Why can't we just deal with the tail page directly?
What prevents this?

Is there something special about THP pages that keeps the head page in
the page cache after the tail has been released?  I'd normally be
worried that the find_get_page() here might fail.

It's probably also worth a quick comment like:

	/* can't deal with tail pages directly, move to head page */

otherwise the reassignment of "page" starts to seem a bit odd.

>  		if (PageReadahead(page)) {
> +			BUG_ON(PageTransHuge(page));
>  			page_cache_async_readahead(mapping,
>  					ra, filp, page,
>  					index, last_index - index);
>  		}

Is this because we only do readahead for fs's with backing stores?
Could we have a comment to this effect?

>  		if (!PageUptodate(page)) {
> +			BUG_ON(PageTransHuge(page));
>  			if (inode->i_blkbits == PAGE_CACHE_SHIFT ||
>  					!mapping->a_ops->is_partially_uptodate)
>  				goto page_not_up_to_date;

Same question. :)

Since your two-line description covers two topics, it's not immediately
obvious which one this BUG_ON() applies to.

> @@ -1212,18 +1223,25 @@ page_ok:
>  		}
>  		nr = nr - offset;
>  
> +		/* Recalculate offset in page if we've got a huge page */
> +		if (PageTransHuge(page)) {
> +			offset = (((loff_t)index << PAGE_CACHE_SHIFT) + offset);
> +			offset &= ~HPAGE_PMD_MASK;
> +		}

Does this need to be done in cases other than the path that goes through
"if(PageTransTail(page))" above?  If not, I'd probably stick this code
up with the other part.

>  		/* If users can be writing to this page using arbitrary
>  		 * virtual addresses, take care about potential aliasing
>  		 * before reading the page on the kernel side.
>  		 */
>  		if (mapping_writably_mapped(mapping))
> -			flush_dcache_page(page);
> +			flush_dcache_page(page + (offset >> PAGE_CACHE_SHIFT));

This is another case where I think adding another local variable would
essentially help the code self-document.  The way it stands, it's fairly
subtle how (offset>>PAGE_CACHE_SHIFT) works and that it's conditional on
THP being enabled.

		int tail_page_index = (offset >> PAGE_CACHE_SHIFT)
...
> +			flush_dcache_page(page + tail_page_index);

This makes it obvious that we're indexing off something, *and* that it's
only going to be relevant when dealing with tail pages.

>  		/*
>  		 * When a sequential read accesses a page several times,
>  		 * only mark it as accessed the first time.
>  		 */
> -		if (prev_index != index || offset != prev_offset)
> +		if (prev_index != index ||
> +				(offset & ~PAGE_CACHE_MASK) != prev_offset)
>  			mark_page_accessed(page);
>  		prev_index = index;
>  
> @@ -1238,8 +1256,9 @@ page_ok:
>  		 * "pos" here (the actor routine has to update the user buffer
>  		 * pointers and the remaining count).
>  		 */
> -		ret = file_read_actor(desc, page, offset, nr);
> -		offset += ret;
> +		ret = file_read_actor(desc, page + (offset >> PAGE_CACHE_SHIFT),
> +				offset & ~PAGE_CACHE_MASK, nr);
> +		offset =  (offset & ~PAGE_CACHE_MASK) + ret;

^^ There's an extra space in that last line.

>  		index += offset >> PAGE_CACHE_SHIFT;
>  		offset &= ~PAGE_CACHE_MASK;
>  		prev_offset = offset;
> @@ -2440,8 +2459,13 @@ again:
>  		if (mapping_writably_mapped(mapping))
>  			flush_dcache_page(page);
>  
> +		if (PageTransHuge(page))
> +			offset = pos & ~HPAGE_PMD_MASK;
> +
>  		pagefault_disable();
> -		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
> +		copied = iov_iter_copy_from_user_atomic(
> +				page + (offset >> PAGE_CACHE_SHIFT),
> +				i, offset & ~PAGE_CACHE_MASK, bytes);
>  		pagefault_enable();
>  		flush_dcache_page(page);
>  
> @@ -2464,6 +2488,7 @@ again:
>  			 * because not all segments in the iov can be copied at
>  			 * once without a pagefault.
>  			 */
> +			offset = pos & ~PAGE_CACHE_MASK;
>  			bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
>  						iov_iter_single_seg_count(i));
>  			goto again;
> 

I think the difficulty in this function is that you're now dealing with
two 'struct page's, two offsets, and two indexes.  It isn't blindingly
obvious which one should be used in a given situation.

The way you've done it here might just be the best way.  I'd *really*
encourage you to make sure that this is tested exhaustively, and make
sure you hit all the different paths in that function.  I'd suspect
there is still a bug or two in there outside the diff context.

Would it be sane to have a set of variables like:

    struct page *thp_tail_page = page + (offset >> PAGE_CACHE_SHIFT);

instead of just open-coding the masks and shifts every time?

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 14/30] thp, mm: naive support of thp in generic read/write routines
@ 2013-03-22 15:22     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-03-22 15:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> For now we still write/read at most PAGE_CACHE_SIZE bytes a time.
> 
> This implementation doesn't cover address spaces with backing store.
...
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1165,12 +1165,23 @@ find_page:
>  			if (unlikely(page == NULL))
>  				goto no_cached_page;
>  		}
> +		if (PageTransTail(page)) {
> +			page_cache_release(page);
> +			page = find_get_page(mapping,
> +					index & ~HPAGE_CACHE_INDEX_MASK);
> +			if (!PageTransHuge(page)) {
> +				page_cache_release(page);
> +				goto find_page;
> +			}
> +		}

So, we're going to do a read of a file, and we pulled a tail page out of
the page cache.  Why can't we just deal with the tail page directly?
What prevents this?

Is there something special about THP pages that keeps the head page in
the page cache after the tail has been released?  I'd normally be
worried that the find_get_page() here might fail.

It's probably also worth a quick comment like:

	/* can't deal with tail pages directly, move to head page */

otherwise the reassignment of "page" starts to seem a bit odd.

>  		if (PageReadahead(page)) {
> +			BUG_ON(PageTransHuge(page));
>  			page_cache_async_readahead(mapping,
>  					ra, filp, page,
>  					index, last_index - index);
>  		}

Is this because we only do readahead for fs's with backing stores?
Could we have a comment to this effect?

>  		if (!PageUptodate(page)) {
> +			BUG_ON(PageTransHuge(page));
>  			if (inode->i_blkbits == PAGE_CACHE_SHIFT ||
>  					!mapping->a_ops->is_partially_uptodate)
>  				goto page_not_up_to_date;

Same question. :)

Since your two-line description covers two topics, it's not immediately
obvious which one this BUG_ON() applies to.

> @@ -1212,18 +1223,25 @@ page_ok:
>  		}
>  		nr = nr - offset;
>  
> +		/* Recalculate offset in page if we've got a huge page */
> +		if (PageTransHuge(page)) {
> +			offset = (((loff_t)index << PAGE_CACHE_SHIFT) + offset);
> +			offset &= ~HPAGE_PMD_MASK;
> +		}

Does this need to be done in cases other than the path that goes through
"if(PageTransTail(page))" above?  If not, I'd probably stick this code
up with the other part.

>  		/* If users can be writing to this page using arbitrary
>  		 * virtual addresses, take care about potential aliasing
>  		 * before reading the page on the kernel side.
>  		 */
>  		if (mapping_writably_mapped(mapping))
> -			flush_dcache_page(page);
> +			flush_dcache_page(page + (offset >> PAGE_CACHE_SHIFT));

This is another case where I think adding another local variable would
essentially help the code self-document.  The way it stands, it's fairly
subtle how (offset>>PAGE_CACHE_SHIFT) works and that it's conditional on
THP being enabled.

		int tail_page_index = (offset >> PAGE_CACHE_SHIFT)
...
> +			flush_dcache_page(page + tail_page_index);

This makes it obvious that we're indexing off something, *and* that it's
only going to be relevant when dealing with tail pages.

>  		/*
>  		 * When a sequential read accesses a page several times,
>  		 * only mark it as accessed the first time.
>  		 */
> -		if (prev_index != index || offset != prev_offset)
> +		if (prev_index != index ||
> +				(offset & ~PAGE_CACHE_MASK) != prev_offset)
>  			mark_page_accessed(page);
>  		prev_index = index;
>  
> @@ -1238,8 +1256,9 @@ page_ok:
>  		 * "pos" here (the actor routine has to update the user buffer
>  		 * pointers and the remaining count).
>  		 */
> -		ret = file_read_actor(desc, page, offset, nr);
> -		offset += ret;
> +		ret = file_read_actor(desc, page + (offset >> PAGE_CACHE_SHIFT),
> +				offset & ~PAGE_CACHE_MASK, nr);
> +		offset =  (offset & ~PAGE_CACHE_MASK) + ret;

^^ There's an extra space in that last line.

>  		index += offset >> PAGE_CACHE_SHIFT;
>  		offset &= ~PAGE_CACHE_MASK;
>  		prev_offset = offset;
> @@ -2440,8 +2459,13 @@ again:
>  		if (mapping_writably_mapped(mapping))
>  			flush_dcache_page(page);
>  
> +		if (PageTransHuge(page))
> +			offset = pos & ~HPAGE_PMD_MASK;
> +
>  		pagefault_disable();
> -		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
> +		copied = iov_iter_copy_from_user_atomic(
> +				page + (offset >> PAGE_CACHE_SHIFT),
> +				i, offset & ~PAGE_CACHE_MASK, bytes);
>  		pagefault_enable();
>  		flush_dcache_page(page);
>  
> @@ -2464,6 +2488,7 @@ again:
>  			 * because not all segments in the iov can be copied at
>  			 * once without a pagefault.
>  			 */
> +			offset = pos & ~PAGE_CACHE_MASK;
>  			bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
>  						iov_iter_single_seg_count(i));
>  			goto again;
> 

I think the difficulty in this function is that you're now dealing with
two 'struct page's, two offsets, and two indexes.  It isn't blindingly
obvious which one should be used in a given situation.

The way you've done it here might just be the best way.  I'd *really*
encourage you to make sure that this is tested exhaustively, and make
sure you hit all the different paths in that function.  I'd suspect
there is still a bug or two in there outside the diff context.

Would it be sane to have a set of variables like:

    struct page *thp_tail_page = page + (offset >> PAGE_CACHE_SHIFT);

instead of just open-coding the masks and shifts every time?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 15/30] thp, libfs: initial support of thp in simple_read/write_begin/write_end
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-22 18:01     ` Dave
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave @ 2013-03-22 18:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> @@ -383,7 +383,10 @@ EXPORT_SYMBOL(simple_setattr);
>  
>  int simple_readpage(struct file *file, struct page *page)
>  {
> -	clear_highpage(page);
> +	if (PageTransHuge(page))
> +		zero_huge_user(page, 0, HPAGE_PMD_SIZE);
> +	else
> +		clear_highpage(page);

This makes me really wonder on which level we want to be hooking in this
code.  The fact that we're doing it in simple_readpage() seems to mean
that we'll have to go explicitly and widely modify every fs that wants
to support this.

It seems to me that we either want to hide this behind clear_highpage()
itself, _or_ make a clear_pagecache_page() function that does it.

BTW, didn't you have a HUGE_PAGE_CACHE_SIZE macro somewhere?  Shouldn't
that get used here?

>  	flush_dcache_page(page);
>  	SetPageUptodate(page);
>  	unlock_page(page);
> @@ -394,21 +397,41 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
>  			loff_t pos, unsigned len, unsigned flags,
>  			struct page **pagep, void **fsdata)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
>  	pgoff_t index;
>  
>  	index = pos >> PAGE_CACHE_SHIFT;
>  
> -	page = grab_cache_page_write_begin(mapping, index, flags);
> +	/* XXX: too weak condition. Good enough for initial testing */
> +	if (mapping_can_have_hugepages(mapping)) {
> +		page = grab_cache_huge_page_write_begin(mapping,
> +				index & ~HPAGE_CACHE_INDEX_MASK, flags);
> +		/* fallback to small page */
> +		if (!page || !PageTransHuge(page)) {
> +			unsigned long offset;
> +			offset = pos & ~PAGE_CACHE_MASK;
> +			len = min_t(unsigned long,
> +					len, PAGE_CACHE_SIZE - offset);
> +		}
> +	}
> +	if (!page)
> +		page = grab_cache_page_write_begin(mapping, index, flags);

Same thing goes here.  Can/should we hide the
grab_cache_huge_page_write_begin() call inside
grab_cache_page_write_begin()?

>  	if (!page)
>  		return -ENOMEM;
> -
>  	*pagep = page;
>  
> -	if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
> -		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
> -
> -		zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE);
> +	if (!PageUptodate(page)) {
> +		unsigned from;
> +
> +		if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) {
> +			from = pos & ~HPAGE_PMD_MASK;
> +			zero_huge_user_segments(page, 0, from,
> +					from + len, HPAGE_PMD_SIZE);
> +		} else if (len != PAGE_CACHE_SIZE) {
> +			from = pos & ~PAGE_CACHE_MASK;
> +			zero_user_segments(page, 0, from,
> +					from + len, PAGE_CACHE_SIZE);
> +		}
>  	}

Let's say you introduced two new functions:  page_cache_size(page) and
page_cache_mask(page), and hid the zero_huge_user_segments() inside
zero_user_segments().  This code would end up looking like this:

	if (len != page_cache_size(page)) {
		from = pos & ~page_cache_mask(page)
		zero_user_segments(page, 0, from,
				from + len, page_cache_size(page));
	}

It would also compile down to exactly what was there before without
having to _explicitly_ put a case in for THP.


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 15/30] thp, libfs: initial support of thp in simple_read/write_begin/write_end
@ 2013-03-22 18:01     ` Dave
  0 siblings, 0 replies; 243+ messages in thread
From: Dave @ 2013-03-22 18:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> @@ -383,7 +383,10 @@ EXPORT_SYMBOL(simple_setattr);
>  
>  int simple_readpage(struct file *file, struct page *page)
>  {
> -	clear_highpage(page);
> +	if (PageTransHuge(page))
> +		zero_huge_user(page, 0, HPAGE_PMD_SIZE);
> +	else
> +		clear_highpage(page);

This makes me really wonder on which level we want to be hooking in this
code.  The fact that we're doing it in simple_readpage() seems to mean
that we'll have to go explicitly and widely modify every fs that wants
to support this.

It seems to me that we either want to hide this behind clear_highpage()
itself, _or_ make a clear_pagecache_page() function that does it.

BTW, didn't you have a HUGE_PAGE_CACHE_SIZE macro somewhere?  Shouldn't
that get used here?

>  	flush_dcache_page(page);
>  	SetPageUptodate(page);
>  	unlock_page(page);
> @@ -394,21 +397,41 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
>  			loff_t pos, unsigned len, unsigned flags,
>  			struct page **pagep, void **fsdata)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
>  	pgoff_t index;
>  
>  	index = pos >> PAGE_CACHE_SHIFT;
>  
> -	page = grab_cache_page_write_begin(mapping, index, flags);
> +	/* XXX: too weak condition. Good enough for initial testing */
> +	if (mapping_can_have_hugepages(mapping)) {
> +		page = grab_cache_huge_page_write_begin(mapping,
> +				index & ~HPAGE_CACHE_INDEX_MASK, flags);
> +		/* fallback to small page */
> +		if (!page || !PageTransHuge(page)) {
> +			unsigned long offset;
> +			offset = pos & ~PAGE_CACHE_MASK;
> +			len = min_t(unsigned long,
> +					len, PAGE_CACHE_SIZE - offset);
> +		}
> +	}
> +	if (!page)
> +		page = grab_cache_page_write_begin(mapping, index, flags);

Same thing goes here.  Can/should we hide the
grab_cache_huge_page_write_begin() call inside
grab_cache_page_write_begin()?

>  	if (!page)
>  		return -ENOMEM;
> -
>  	*pagep = page;
>  
> -	if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
> -		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
> -
> -		zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE);
> +	if (!PageUptodate(page)) {
> +		unsigned from;
> +
> +		if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) {
> +			from = pos & ~HPAGE_PMD_MASK;
> +			zero_huge_user_segments(page, 0, from,
> +					from + len, HPAGE_PMD_SIZE);
> +		} else if (len != PAGE_CACHE_SIZE) {
> +			from = pos & ~PAGE_CACHE_MASK;
> +			zero_user_segments(page, 0, from,
> +					from + len, PAGE_CACHE_SIZE);
> +		}
>  	}

Let's say you introduced two new functions:  page_cache_size(page) and
page_cache_mask(page), and hid the zero_huge_user_segments() inside
zero_user_segments().  This code would end up looking like this:

	if (len != page_cache_size(page)) {
		from = pos & ~page_cache_mask(page)
		zero_user_segments(page, 0, from,
				from + len, page_cache_size(page));
	}

It would also compile down to exactly what was there before without
having to _explicitly_ put a case in for THP.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 16/30] thp: handle file pages in split_huge_page()
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-22 18:18     ` Dave
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave @ 2013-03-22 18:18 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> +static int split_anon_huge_page(struct page *page)
>  {
>  	struct anon_vma *anon_vma;
>  	int ret = 1;
>  
> -	BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
> -	BUG_ON(!PageAnon(page));

Did you really mean to kill these BUG_ON()s?  They still look relevant
to me.



^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 16/30] thp: handle file pages in split_huge_page()
@ 2013-03-22 18:18     ` Dave
  0 siblings, 0 replies; 243+ messages in thread
From: Dave @ 2013-03-22 18:18 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> +static int split_anon_huge_page(struct page *page)
>  {
>  	struct anon_vma *anon_vma;
>  	int ret = 1;
>  
> -	BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
> -	BUG_ON(!PageAnon(page));

Did you really mean to kill these BUG_ON()s?  They still look relevant
to me.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 17/30] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-22 18:22     ` Dave
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave @ 2013-03-22 18:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -113,11 +113,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
>  			__split_huge_page_pmd(__vma, __address,		\
>  					____pmd);			\
>  	}  while (0)
> -#define wait_split_huge_page(__anon_vma, __pmd)				\
> +#define wait_split_huge_page(__vma, __pmd)				\
>  	do {								\
>  		pmd_t *____pmd = (__pmd);				\
> -		anon_vma_lock_write(__anon_vma);			\
> -		anon_vma_unlock_write(__anon_vma);			\
> +		struct address_space *__mapping =			\
> +					vma->vm_file->f_mapping;	\
> +		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
> +		if (__mapping)						\
> +			mutex_lock(&__mapping->i_mmap_mutex);		\
> +		if (__anon_vma) {					\
> +			anon_vma_lock_write(__anon_vma);		\
> +			anon_vma_unlock_write(__anon_vma);		\
> +		}							\
> +		if (__mapping)						\
> +			mutex_unlock(&__mapping->i_mmap_mutex);		\
>  		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
>  		       pmd_trans_huge(*____pmd));			\
>  	} while (0)

That thing was pretty ugly _before_. :)  Any chance this can get turned
in to a function?

What's the deal with the i_mmap_mutex operation getting split up?  I'm
blanking on what kind of pages would have both anon_vmas and a valid
mapping.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 17/30] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
@ 2013-03-22 18:22     ` Dave
  0 siblings, 0 replies; 243+ messages in thread
From: Dave @ 2013-03-22 18:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -113,11 +113,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
>  			__split_huge_page_pmd(__vma, __address,		\
>  					____pmd);			\
>  	}  while (0)
> -#define wait_split_huge_page(__anon_vma, __pmd)				\
> +#define wait_split_huge_page(__vma, __pmd)				\
>  	do {								\
>  		pmd_t *____pmd = (__pmd);				\
> -		anon_vma_lock_write(__anon_vma);			\
> -		anon_vma_unlock_write(__anon_vma);			\
> +		struct address_space *__mapping =			\
> +					vma->vm_file->f_mapping;	\
> +		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
> +		if (__mapping)						\
> +			mutex_lock(&__mapping->i_mmap_mutex);		\
> +		if (__anon_vma) {					\
> +			anon_vma_lock_write(__anon_vma);		\
> +			anon_vma_unlock_write(__anon_vma);		\
> +		}							\
> +		if (__mapping)						\
> +			mutex_unlock(&__mapping->i_mmap_mutex);		\
>  		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
>  		       pmd_trans_huge(*____pmd));			\
>  	} while (0)

That thing was pretty ugly _before_. :)  Any chance this can get turned
in to a function?

What's the deal with the i_mmap_mutex operation getting split up?  I'm
blanking on what kind of pages would have both anon_vmas and a valid
mapping.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 18/30] thp, mm: truncate support for transparent huge page cache
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-22 18:29     ` Dave
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave @ 2013-03-22 18:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> @@ -280,6 +291,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  			if (index > end)
>  				break;
>  
> +			VM_BUG_ON(PageTransHuge(page));
>  			lock_page(page);
>  			WARN_ON(page->index != index);
>  			wait_on_page_writeback(page);

This looks to be during the second truncate pass where things are
allowed to block.  What's the logic behind it not being possible to
encounter TransHugePage()s here?

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 18/30] thp, mm: truncate support for transparent huge page cache
@ 2013-03-22 18:29     ` Dave
  0 siblings, 0 replies; 243+ messages in thread
From: Dave @ 2013-03-22 18:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> @@ -280,6 +291,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  			if (index > end)
>  				break;
>  
> +			VM_BUG_ON(PageTransHuge(page));
>  			lock_page(page);
>  			WARN_ON(page->index != index);
>  			wait_on_page_writeback(page);

This looks to be during the second truncate pass where things are
allowed to block.  What's the logic behind it not being possible to
encounter TransHugePage()s here?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 21/30] x86-64, mm: proper alignment mappings with hugepages
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-03-22 18:37     ` Dave
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave @ 2013-03-22 18:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> +	if (filp)
> +		info.align_mask = mapping_can_have_hugepages(filp->f_mapping) ?
> +			PAGE_MASK & ~HPAGE_MASK : get_align_mask();
> +	else
> +		info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	return vm_unmapped_area(&info);
>  }
> @@ -174,7 +179,11 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = mm->mmap_base;
> -	info.align_mask = filp ? get_align_mask() : 0;
> +	if (filp)
> +		info.align_mask = mapping_can_have_hugepages(filp->f_mapping) ?
> +			PAGE_MASK & ~HPAGE_MASK : get_align_mask();
> +	else
> +		info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	addr = vm_unmapped_area(&info);
>  	if (!(addr & ~PAGE_MASK))

how about

static inline
unsigned long mapping_align_mask(struct address_space *mapping)
{
	if (mapping_can_have_hugepages(filp->f_mapping))
		return PAGE_MASK & ~HPAGE_MASK;
	return get_align_mask();
}

to replace these two open-coded versions?

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 21/30] x86-64, mm: proper alignment mappings with hugepages
@ 2013-03-22 18:37     ` Dave
  0 siblings, 0 replies; 243+ messages in thread
From: Dave @ 2013-03-22 18:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> +	if (filp)
> +		info.align_mask = mapping_can_have_hugepages(filp->f_mapping) ?
> +			PAGE_MASK & ~HPAGE_MASK : get_align_mask();
> +	else
> +		info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	return vm_unmapped_area(&info);
>  }
> @@ -174,7 +179,11 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
>  	info.high_limit = mm->mmap_base;
> -	info.align_mask = filp ? get_align_mask() : 0;
> +	if (filp)
> +		info.align_mask = mapping_can_have_hugepages(filp->f_mapping) ?
> +			PAGE_MASK & ~HPAGE_MASK : get_align_mask();
> +	else
> +		info.align_mask = 0;
>  	info.align_offset = pgoff << PAGE_SHIFT;
>  	addr = vm_unmapped_area(&info);
>  	if (!(addr & ~PAGE_MASK))

how about

static inline
unsigned long mapping_align_mask(struct address_space *mapping)
{
	if (mapping_can_have_hugepages(filp->f_mapping))
		return PAGE_MASK & ~HPAGE_MASK;
	return get_align_mask();
}

to replace these two open-coded versions?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements
  2013-03-22 14:38         ` Dave Hansen
@ 2013-03-25 13:03           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-25 13:03 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 03/22/2013 02:47 AM, Kirill A. Shutemov wrote:
> > Dave Hansen wrote:
> >> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> >>> +#define RADIX_TREE_PRELOAD_NR		512 /* For THP's benefit */
> >>
> >> This eventually boils down to making the radix_tree_preload array
> >> larger.  Do we really want to do this unconditionally if it's only for
> >> THP's benefit?
> > 
> > It will be useful not only for THP. Batching can be useful to solve
> > scalability issues.
> 
> Still, it seems like something that little machines with no THP support
> probably don't want to pay the cost for.  Perhaps you could enable it
> for THP||NR_CPUS>$FOO.

Okay, I'll disable it for !THP. We always can change it if we'll find good
candidate for batching.

> >> For those of us too lazy to go compile a kernel and figure this out in
> >> practice, how much bigger does this make the nodes[] array?
> > 
> > We have three possible RADIX_TREE_MAP_SHIFT:
> > 
> > #ifdef __KERNEL__
> > #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> > #else
> > #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> > #endif
> > 
> > On 64-bit system:
> > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> > 
> > On 32-bit system:
> > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> > 
> > On most machines we will have RADIX_TREE_MAP_SHIFT=6.
> 
> Could you stick that in your patch description?

Will do.

> The total cost is "array size" * sizeof(void*) * NR_CPUS, right?

Correct.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements
@ 2013-03-25 13:03           ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-25 13:03 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 03/22/2013 02:47 AM, Kirill A. Shutemov wrote:
> > Dave Hansen wrote:
> >> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> >>> +#define RADIX_TREE_PRELOAD_NR		512 /* For THP's benefit */
> >>
> >> This eventually boils down to making the radix_tree_preload array
> >> larger.  Do we really want to do this unconditionally if it's only for
> >> THP's benefit?
> > 
> > It will be useful not only for THP. Batching can be useful to solve
> > scalability issues.
> 
> Still, it seems like something that little machines with no THP support
> probably don't want to pay the cost for.  Perhaps you could enable it
> for THP||NR_CPUS>$FOO.

Okay, I'll disable it for !THP. We always can change it if we'll find good
candidate for batching.

> >> For those of us too lazy to go compile a kernel and figure this out in
> >> practice, how much bigger does this make the nodes[] array?
> > 
> > We have three possible RADIX_TREE_MAP_SHIFT:
> > 
> > #ifdef __KERNEL__
> > #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
> > #else
> > #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
> > #endif
> > 
> > On 64-bit system:
> > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> > 
> > On 32-bit system:
> > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> > 
> > On most machines we will have RADIX_TREE_MAP_SHIFT=6.
> 
> Could you stick that in your patch description?

Will do.

> The total cost is "array size" * sizeof(void*) * NR_CPUS, right?

Correct.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 12/30] thp, mm: add event counters for huge page alloc on write to a file
  2013-03-21 17:59     ` Dave Hansen
@ 2013-03-26  8:40       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-26  8:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >  		THP_FAULT_FALLBACK,
> >  		THP_COLLAPSE_ALLOC,
> >  		THP_COLLAPSE_ALLOC_FAILED,
> > +		THP_WRITE_ALLOC,
> > +		THP_WRITE_FAILED,
> >  		THP_SPLIT,
> >  		THP_ZERO_PAGE_ALLOC,
> >  		THP_ZERO_PAGE_ALLOC_FAILED,
> 
> I think these names are a bit terse.  It's certainly not _writes_ that
> are failing and "THP_WRITE_FAILED" makes it sound that way.

Right. s/THP_WRITE_FAILED/THP_WRITE_ALLOC_FAILED/

> Also, why do we need to differentiate these from the existing anon-hugepage
> vm stats?  The alloc_pages() call seems to be doing the exact same thing in
> the end.  Is one more likely to succeed than the other?

Existing stats specify source of thp page: fault or collapse. When we
allocate a new huge page with write(2) it's nither fault nor collapse. I
think it's reasonable to introduce new type of event for that.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 12/30] thp, mm: add event counters for huge page alloc on write to a file
@ 2013-03-26  8:40       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-26  8:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >  		THP_FAULT_FALLBACK,
> >  		THP_COLLAPSE_ALLOC,
> >  		THP_COLLAPSE_ALLOC_FAILED,
> > +		THP_WRITE_ALLOC,
> > +		THP_WRITE_FAILED,
> >  		THP_SPLIT,
> >  		THP_ZERO_PAGE_ALLOC,
> >  		THP_ZERO_PAGE_ALLOC_FAILED,
> 
> I think these names are a bit terse.  It's certainly not _writes_ that
> are failing and "THP_WRITE_FAILED" makes it sound that way.

Right. s/THP_WRITE_FAILED/THP_WRITE_ALLOC_FAILED/

> Also, why do we need to differentiate these from the existing anon-hugepage
> vm stats?  The alloc_pages() call seems to be doing the exact same thing in
> the end.  Is one more likely to succeed than the other?

Existing stats specify source of thp page: fault or collapse. When we
allocate a new huge page with write(2) it's nither fault nor collapse. I
think it's reasonable to introduce new type of event for that.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()
  2013-03-21 18:15     ` Dave Hansen
@ 2013-03-26 10:48       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-26 10:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> > +repeat:
> > +	page = find_lock_page(mapping, index);
> > +	if (page) {
> > +		if (!PageTransHuge(page)) {
> > +			unlock_page(page);
> > +			page_cache_release(page);
> > +			return NULL;
> > +		}
> > +		goto found;
> > +	}
> > +
> > +	page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
> 
> I alluded to this a second ago, but what's wrong with alloc_hugepage()?

It's defined only for !CONFIG_NUMA and only inside mm/huge_memory.c.

> > +found:
> > +	wait_on_page_writeback(page);
> > +	return page;
> > +}
> > +#endif
> 
> So, I diffed :
> 
> -struct page *grab_cache_page_write_begin(struct address_space
> vs.
> +struct page *grab_cache_huge_page_write_begin(struct address_space
> 
> They're just to similar to ignore.  Please consolidate them somehow.

Will do.

> > +found:
> > +	wait_on_page_writeback(page);
> > +	return page;
> > +}
> > +#endif
> 
> In grab_cache_page_write_begin(), this "wait" is:
> 
>         wait_for_stable_page(page);
> 
> Why is it different here?

It was wait_on_page_writeback() in grab_cache_page_write_begin() when I forked
it :(

See 1d1d1a7 mm: only enforce stable page writes if the backing device requires it

Consolidation will fix this.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()
@ 2013-03-26 10:48       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-26 10:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> > +repeat:
> > +	page = find_lock_page(mapping, index);
> > +	if (page) {
> > +		if (!PageTransHuge(page)) {
> > +			unlock_page(page);
> > +			page_cache_release(page);
> > +			return NULL;
> > +		}
> > +		goto found;
> > +	}
> > +
> > +	page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
> 
> I alluded to this a second ago, but what's wrong with alloc_hugepage()?

It's defined only for !CONFIG_NUMA and only inside mm/huge_memory.c.

> > +found:
> > +	wait_on_page_writeback(page);
> > +	return page;
> > +}
> > +#endif
> 
> So, I diffed :
> 
> -struct page *grab_cache_page_write_begin(struct address_space
> vs.
> +struct page *grab_cache_huge_page_write_begin(struct address_space
> 
> They're just to similar to ignore.  Please consolidate them somehow.

Will do.

> > +found:
> > +	wait_on_page_writeback(page);
> > +	return page;
> > +}
> > +#endif
> 
> In grab_cache_page_write_begin(), this "wait" is:
> 
>         wait_for_stable_page(page);
> 
> Why is it different here?

It was wait_on_page_writeback() in grab_cache_page_write_begin() when I forked
it :(

See 1d1d1a7 mm: only enforce stable page writes if the backing device requires it

Consolidation will fix this.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()
  2013-03-26 10:48       ` Kirill A. Shutemov
@ 2013-03-26 15:40         ` Dave
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave @ 2013-03-26 15:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/26/2013 03:48 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>>> +repeat:
>>> +	page = find_lock_page(mapping, index);
>>> +	if (page) {
>>> +		if (!PageTransHuge(page)) {
>>> +			unlock_page(page);
>>> +			page_cache_release(page);
>>> +			return NULL;
>>> +		}
>>> +		goto found;
>>> +	}
>>> +
>>> +	page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
>>
>> I alluded to this a second ago, but what's wrong with alloc_hugepage()?
> 
> It's defined only for !CONFIG_NUMA and only inside mm/huge_memory.c.

It's a short function, but you could easily pull it out from under the
#ifdef and export it.  I kinda like the idea of these things being
allocated in as few code paths possible.  But, it's not a big deal.

>>> +found:
>>> +	wait_on_page_writeback(page);
>>> +	return page;
>>> +}
>>> +#endif
>>
>> So, I diffed :
>>
>> -struct page *grab_cache_page_write_begin(struct address_space
>> vs.
>> +struct page *grab_cache_huge_page_write_begin(struct address_space
>>
>> They're just to similar to ignore.  Please consolidate them somehow.
> 
> Will do.
> 
>>> +found:
>>> +	wait_on_page_writeback(page);
>>> +	return page;
>>> +}
>>> +#endif
>>
>> In grab_cache_page_write_begin(), this "wait" is:
>>
>>         wait_for_stable_page(page);
>>
>> Why is it different here?
> 
> It was wait_on_page_writeback() in grab_cache_page_write_begin() when I forked
> it :(
> 
> See 1d1d1a7 mm: only enforce stable page writes if the backing device requires it
> 
> Consolidation will fix this.

Excellent.


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()
@ 2013-03-26 15:40         ` Dave
  0 siblings, 0 replies; 243+ messages in thread
From: Dave @ 2013-03-26 15:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 03/26/2013 03:48 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>>> +repeat:
>>> +	page = find_lock_page(mapping, index);
>>> +	if (page) {
>>> +		if (!PageTransHuge(page)) {
>>> +			unlock_page(page);
>>> +			page_cache_release(page);
>>> +			return NULL;
>>> +		}
>>> +		goto found;
>>> +	}
>>> +
>>> +	page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
>>
>> I alluded to this a second ago, but what's wrong with alloc_hugepage()?
> 
> It's defined only for !CONFIG_NUMA and only inside mm/huge_memory.c.

It's a short function, but you could easily pull it out from under the
#ifdef and export it.  I kinda like the idea of these things being
allocated in as few code paths possible.  But, it's not a big deal.

>>> +found:
>>> +	wait_on_page_writeback(page);
>>> +	return page;
>>> +}
>>> +#endif
>>
>> So, I diffed :
>>
>> -struct page *grab_cache_page_write_begin(struct address_space
>> vs.
>> +struct page *grab_cache_huge_page_write_begin(struct address_space
>>
>> They're just to similar to ignore.  Please consolidate them somehow.
> 
> Will do.
> 
>>> +found:
>>> +	wait_on_page_writeback(page);
>>> +	return page;
>>> +}
>>> +#endif
>>
>> In grab_cache_page_write_begin(), this "wait" is:
>>
>>         wait_for_stable_page(page);
>>
>> Why is it different here?
> 
> It was wait_on_page_writeback() in grab_cache_page_write_begin() when I forked
> it :(
> 
> See 1d1d1a7 mm: only enforce stable page writes if the backing device requires it
> 
> Consolidation will fix this.

Excellent.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 14/30] thp, mm: naive support of thp in generic read/write routines
  2013-03-22 15:22     ` Dave Hansen
@ 2013-03-28 12:25       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-28 12:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > For now we still write/read at most PAGE_CACHE_SIZE bytes a time.
> > 
> > This implementation doesn't cover address spaces with backing store.
> ...
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1165,12 +1165,23 @@ find_page:
> >  			if (unlikely(page == NULL))
> >  				goto no_cached_page;
> >  		}
> > +		if (PageTransTail(page)) {
> > +			page_cache_release(page);
> > +			page = find_get_page(mapping,
> > +					index & ~HPAGE_CACHE_INDEX_MASK);
> > +			if (!PageTransHuge(page)) {
> > +				page_cache_release(page);
> > +				goto find_page;
> > +			}
> > +		}
> 
> So, we're going to do a read of a file, and we pulled a tail page out of
> the page cache.  Why can't we just deal with the tail page directly?

Good point. I'll redo it once again.

First I thought to make it possible to read/write more PAGE_SIZE at once.
If not take this option in account for now, it's possible to make code much
cleaner.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 14/30] thp, mm: naive support of thp in generic read/write routines
@ 2013-03-28 12:25       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-28 12:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > For now we still write/read at most PAGE_CACHE_SIZE bytes a time.
> > 
> > This implementation doesn't cover address spaces with backing store.
> ...
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1165,12 +1165,23 @@ find_page:
> >  			if (unlikely(page == NULL))
> >  				goto no_cached_page;
> >  		}
> > +		if (PageTransTail(page)) {
> > +			page_cache_release(page);
> > +			page = find_get_page(mapping,
> > +					index & ~HPAGE_CACHE_INDEX_MASK);
> > +			if (!PageTransHuge(page)) {
> > +				page_cache_release(page);
> > +				goto find_page;
> > +			}
> > +		}
> 
> So, we're going to do a read of a file, and we pulled a tail page out of
> the page cache.  Why can't we just deal with the tail page directly?

Good point. I'll redo it once again.

First I thought to make it possible to read/write more PAGE_SIZE at once.
If not take this option in account for now, it's possible to make code much
cleaner.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 15/30] thp, libfs: initial support of thp in simple_read/write_begin/write_end
  2013-03-22 18:01     ` Dave
@ 2013-03-28 14:29       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-28 14:29 UTC (permalink / raw)
  To: Dave
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > @@ -383,7 +383,10 @@ EXPORT_SYMBOL(simple_setattr);
> >  
> >  int simple_readpage(struct file *file, struct page *page)
> >  {
> > -	clear_highpage(page);
> > +	if (PageTransHuge(page))
> > +		zero_huge_user(page, 0, HPAGE_PMD_SIZE);
> > +	else
> > +		clear_highpage(page);
> 
> This makes me really wonder on which level we want to be hooking in this
> code.  The fact that we're doing it in simple_readpage() seems to mean
> that we'll have to go explicitly and widely modify every fs that wants
> to support this.
> 
> It seems to me that we either want to hide this behind clear_highpage()
> itself, _or_ make a clear_pagecache_page() function that does it.

clear_pagecache_page() is a good idea.

> BTW, didn't you have a HUGE_PAGE_CACHE_SIZE macro somewhere?  Shouldn't
> that get used here?

No. If we have a huge page in page cache it's always HPAGE_PMD_SIZE.

All these PAGE_CACHE_* are really annoying. page cache page size is always
equal to small page size and the macros only confuses, especially on
border between fs/pagecache and rest mm.

I want to get rid of them eventually.

> 
> >  	flush_dcache_page(page);
> >  	SetPageUptodate(page);
> >  	unlock_page(page);
> > @@ -394,21 +397,41 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
> >  			loff_t pos, unsigned len, unsigned flags,
> >  			struct page **pagep, void **fsdata)
> >  {
> > -	struct page *page;
> > +	struct page *page = NULL;
> >  	pgoff_t index;
> >  
> >  	index = pos >> PAGE_CACHE_SHIFT;
> >  
> > -	page = grab_cache_page_write_begin(mapping, index, flags);
> > +	/* XXX: too weak condition. Good enough for initial testing */
> > +	if (mapping_can_have_hugepages(mapping)) {
> > +		page = grab_cache_huge_page_write_begin(mapping,
> > +				index & ~HPAGE_CACHE_INDEX_MASK, flags);
> > +		/* fallback to small page */
> > +		if (!page || !PageTransHuge(page)) {
> > +			unsigned long offset;
> > +			offset = pos & ~PAGE_CACHE_MASK;
> > +			len = min_t(unsigned long,
> > +					len, PAGE_CACHE_SIZE - offset);
> > +		}
> > +	}
> > +	if (!page)
> > +		page = grab_cache_page_write_begin(mapping, index, flags);
> 
> Same thing goes here.  Can/should we hide the
> grab_cache_huge_page_write_begin() call inside
> grab_cache_page_write_begin()?

No. I want to keep it open coded. fs, not page cache, should decide
whether it wants huge page or not.

> >  	if (!page)
> >  		return -ENOMEM;
> > -
> >  	*pagep = page;
> >  
> > -	if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
> > -		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
> > -
> > -		zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE);
> > +	if (!PageUptodate(page)) {
> > +		unsigned from;
> > +
> > +		if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) {
> > +			from = pos & ~HPAGE_PMD_MASK;
> > +			zero_huge_user_segments(page, 0, from,
> > +					from + len, HPAGE_PMD_SIZE);
> > +		} else if (len != PAGE_CACHE_SIZE) {
> > +			from = pos & ~PAGE_CACHE_MASK;
> > +			zero_user_segments(page, 0, from,
> > +					from + len, PAGE_CACHE_SIZE);
> > +		}
> >  	}
> 
> Let's say you introduced two new functions:  page_cache_size(page) and
> page_cache_mask(page), and hid the zero_huge_user_segments() inside
> zero_user_segments().  This code would end up looking like this:
> 
> 	if (len != page_cache_size(page)) {
> 		from = pos & ~page_cache_mask(page)
> 		zero_user_segments(page, 0, from,
> 				from + len, page_cache_size(page));
> 	}
> 
> It would also compile down to exactly what was there before without
> having to _explicitly_ put a case in for THP.

I would keep it as is for now. There are not that many places where we
have to check for THP. It can change when (if) we implement it for other
fs'es. We can generalize it later if needed.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 15/30] thp, libfs: initial support of thp in simple_read/write_begin/write_end
@ 2013-03-28 14:29       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-28 14:29 UTC (permalink / raw)
  To: Dave
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > @@ -383,7 +383,10 @@ EXPORT_SYMBOL(simple_setattr);
> >  
> >  int simple_readpage(struct file *file, struct page *page)
> >  {
> > -	clear_highpage(page);
> > +	if (PageTransHuge(page))
> > +		zero_huge_user(page, 0, HPAGE_PMD_SIZE);
> > +	else
> > +		clear_highpage(page);
> 
> This makes me really wonder on which level we want to be hooking in this
> code.  The fact that we're doing it in simple_readpage() seems to mean
> that we'll have to go explicitly and widely modify every fs that wants
> to support this.
> 
> It seems to me that we either want to hide this behind clear_highpage()
> itself, _or_ make a clear_pagecache_page() function that does it.

clear_pagecache_page() is a good idea.

> BTW, didn't you have a HUGE_PAGE_CACHE_SIZE macro somewhere?  Shouldn't
> that get used here?

No. If we have a huge page in page cache it's always HPAGE_PMD_SIZE.

All these PAGE_CACHE_* are really annoying. page cache page size is always
equal to small page size and the macros only confuses, especially on
border between fs/pagecache and rest mm.

I want to get rid of them eventually.

> 
> >  	flush_dcache_page(page);
> >  	SetPageUptodate(page);
> >  	unlock_page(page);
> > @@ -394,21 +397,41 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
> >  			loff_t pos, unsigned len, unsigned flags,
> >  			struct page **pagep, void **fsdata)
> >  {
> > -	struct page *page;
> > +	struct page *page = NULL;
> >  	pgoff_t index;
> >  
> >  	index = pos >> PAGE_CACHE_SHIFT;
> >  
> > -	page = grab_cache_page_write_begin(mapping, index, flags);
> > +	/* XXX: too weak condition. Good enough for initial testing */
> > +	if (mapping_can_have_hugepages(mapping)) {
> > +		page = grab_cache_huge_page_write_begin(mapping,
> > +				index & ~HPAGE_CACHE_INDEX_MASK, flags);
> > +		/* fallback to small page */
> > +		if (!page || !PageTransHuge(page)) {
> > +			unsigned long offset;
> > +			offset = pos & ~PAGE_CACHE_MASK;
> > +			len = min_t(unsigned long,
> > +					len, PAGE_CACHE_SIZE - offset);
> > +		}
> > +	}
> > +	if (!page)
> > +		page = grab_cache_page_write_begin(mapping, index, flags);
> 
> Same thing goes here.  Can/should we hide the
> grab_cache_huge_page_write_begin() call inside
> grab_cache_page_write_begin()?

No. I want to keep it open coded. fs, not page cache, should decide
whether it wants huge page or not.

> >  	if (!page)
> >  		return -ENOMEM;
> > -
> >  	*pagep = page;
> >  
> > -	if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
> > -		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
> > -
> > -		zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE);
> > +	if (!PageUptodate(page)) {
> > +		unsigned from;
> > +
> > +		if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) {
> > +			from = pos & ~HPAGE_PMD_MASK;
> > +			zero_huge_user_segments(page, 0, from,
> > +					from + len, HPAGE_PMD_SIZE);
> > +		} else if (len != PAGE_CACHE_SIZE) {
> > +			from = pos & ~PAGE_CACHE_MASK;
> > +			zero_user_segments(page, 0, from,
> > +					from + len, PAGE_CACHE_SIZE);
> > +		}
> >  	}
> 
> Let's say you introduced two new functions:  page_cache_size(page) and
> page_cache_mask(page), and hid the zero_huge_user_segments() inside
> zero_user_segments().  This code would end up looking like this:
> 
> 	if (len != page_cache_size(page)) {
> 		from = pos & ~page_cache_mask(page)
> 		zero_user_segments(page, 0, from,
> 				from + len, page_cache_size(page));
> 	}
> 
> It would also compile down to exactly what was there before without
> having to _explicitly_ put a case in for THP.

I would keep it as is for now. There are not that many places where we
have to check for THP. It can change when (if) we implement it for other
fs'es. We can generalize it later if needed.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 16/30] thp: handle file pages in split_huge_page()
  2013-03-22 18:18     ` Dave
@ 2013-03-28 14:32       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-28 14:32 UTC (permalink / raw)
  To: Dave
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > +static int split_anon_huge_page(struct page *page)
> >  {
> >  	struct anon_vma *anon_vma;
> >  	int ret = 1;
> >  
> > -	BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
> > -	BUG_ON(!PageAnon(page));
> 
> Did you really mean to kill these BUG_ON()s?  They still look relevant
> to me.

The zero page BUG_ON() is moved to new split_huge_page().
!PageAnon(page) we now can handle.

Note: nobody should call split_anon_huge_page() directly, only
split_huge_page().

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 16/30] thp: handle file pages in split_huge_page()
@ 2013-03-28 14:32       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-28 14:32 UTC (permalink / raw)
  To: Dave
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > +static int split_anon_huge_page(struct page *page)
> >  {
> >  	struct anon_vma *anon_vma;
> >  	int ret = 1;
> >  
> > -	BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
> > -	BUG_ON(!PageAnon(page));
> 
> Did you really mean to kill these BUG_ON()s?  They still look relevant
> to me.

The zero page BUG_ON() is moved to new split_huge_page().
!PageAnon(page) we now can handle.

Note: nobody should call split_anon_huge_page() directly, only
split_huge_page().

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 17/30] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  2013-03-22 18:22     ` Dave
@ 2013-03-28 15:08       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-28 15:08 UTC (permalink / raw)
  To: Dave
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -113,11 +113,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
> >  			__split_huge_page_pmd(__vma, __address,		\
> >  					____pmd);			\
> >  	}  while (0)
> > -#define wait_split_huge_page(__anon_vma, __pmd)				\
> > +#define wait_split_huge_page(__vma, __pmd)				\
> >  	do {								\
> >  		pmd_t *____pmd = (__pmd);				\
> > -		anon_vma_lock_write(__anon_vma);			\
> > -		anon_vma_unlock_write(__anon_vma);			\
> > +		struct address_space *__mapping =			\
> > +					vma->vm_file->f_mapping;	\
> > +		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
> > +		if (__mapping)						\
> > +			mutex_lock(&__mapping->i_mmap_mutex);		\
> > +		if (__anon_vma) {					\
> > +			anon_vma_lock_write(__anon_vma);		\
> > +			anon_vma_unlock_write(__anon_vma);		\
> > +		}							\
> > +		if (__mapping)						\
> > +			mutex_unlock(&__mapping->i_mmap_mutex);		\
> >  		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
> >  		       pmd_trans_huge(*____pmd));			\
> >  	} while (0)
> 
> That thing was pretty ugly _before_. :)  Any chance this can get turned
> in to a function?

Cyclic headers dependencies... :(

> What's the deal with the i_mmap_mutex operation getting split up?  I'm
> blanking on what kind of pages would have both anon_vmas and a valid
> mapping.

anon_vma_lock protects all anon pages on vma against splitting.
i_mmap_mutex protects shared pages. None pages can be on both sides, but
MAP_PRIVATE file vma can have both anon and shared pages.

As an option we can lookup for struct page with pmd_page(), check
PageAnon() and serialize only relevant lock, but...

Original macro, in fact, guarantees that *all* pages on the vma is not
splitting, not only the pages pmd is poinging on. PageAnon() check will
change semantics a bit. It shouldn't be a problem, but who knows.

Do you want me to add the check?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 17/30] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
@ 2013-03-28 15:08       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-28 15:08 UTC (permalink / raw)
  To: Dave
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -113,11 +113,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
> >  			__split_huge_page_pmd(__vma, __address,		\
> >  					____pmd);			\
> >  	}  while (0)
> > -#define wait_split_huge_page(__anon_vma, __pmd)				\
> > +#define wait_split_huge_page(__vma, __pmd)				\
> >  	do {								\
> >  		pmd_t *____pmd = (__pmd);				\
> > -		anon_vma_lock_write(__anon_vma);			\
> > -		anon_vma_unlock_write(__anon_vma);			\
> > +		struct address_space *__mapping =			\
> > +					vma->vm_file->f_mapping;	\
> > +		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
> > +		if (__mapping)						\
> > +			mutex_lock(&__mapping->i_mmap_mutex);		\
> > +		if (__anon_vma) {					\
> > +			anon_vma_lock_write(__anon_vma);		\
> > +			anon_vma_unlock_write(__anon_vma);		\
> > +		}							\
> > +		if (__mapping)						\
> > +			mutex_unlock(&__mapping->i_mmap_mutex);		\
> >  		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
> >  		       pmd_trans_huge(*____pmd));			\
> >  	} while (0)
> 
> That thing was pretty ugly _before_. :)  Any chance this can get turned
> in to a function?

Cyclic headers dependencies... :(

> What's the deal with the i_mmap_mutex operation getting split up?  I'm
> blanking on what kind of pages would have both anon_vmas and a valid
> mapping.

anon_vma_lock protects all anon pages on vma against splitting.
i_mmap_mutex protects shared pages. None pages can be on both sides, but
MAP_PRIVATE file vma can have both anon and shared pages.

As an option we can lookup for struct page with pmd_page(), check
PageAnon() and serialize only relevant lock, but...

Original macro, in fact, guarantees that *all* pages on the vma is not
splitting, not only the pages pmd is poinging on. PageAnon() check will
change semantics a bit. It shouldn't be a problem, but who knows.

Do you want me to add the check?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 18/30] thp, mm: truncate support for transparent huge page cache
  2013-03-22 18:29     ` Dave
@ 2013-03-28 15:31       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-28 15:31 UTC (permalink / raw)
  To: Dave
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > @@ -280,6 +291,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
> >  			if (index > end)
> >  				break;
> >  
> > +			VM_BUG_ON(PageTransHuge(page));
> >  			lock_page(page);
> >  			WARN_ON(page->index != index);
> >  			wait_on_page_writeback(page);
> 
> This looks to be during the second truncate pass where things are
> allowed to block.  What's the logic behind it not being possible to
> encounter TransHugePage()s here?

Good question.

The only way how the page can be created from under us is collapsing, but
it's not implemented for file pages and I'm not sure yet how to implement
it...

Probably, I'll replace the BUG with

if (PageTransHuge(page))
	split_huge_page(page);

It should be good enough.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 18/30] thp, mm: truncate support for transparent huge page cache
@ 2013-03-28 15:31       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-03-28 15:31 UTC (permalink / raw)
  To: Dave
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > @@ -280,6 +291,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
> >  			if (index > end)
> >  				break;
> >  
> > +			VM_BUG_ON(PageTransHuge(page));
> >  			lock_page(page);
> >  			WARN_ON(page->index != index);
> >  			wait_on_page_writeback(page);
> 
> This looks to be during the second truncate pass where things are
> allowed to block.  What's the logic behind it not being possible to
> encounter TransHugePage()s here?

Good question.

The only way how the page can be created from under us is collapsing, but
it's not implemented for file pages and I'm not sure yet how to implement
it...

Probably, I'll replace the BUG with

if (PageTransHuge(page))
	split_huge_page(page);

It should be good enough.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate
  2013-03-22 14:44         ` Dave Hansen
@ 2013-04-02 14:46           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-04-02 14:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 03/22/2013 03:12 AM, Kirill A. Shutemov wrote:
> > Dave Hansen wrote:
> >> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> >>> +static inline bool mapping_can_have_hugepages(struct address_space *m)
> >>> +{
> >>> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> >>> +		gfp_t gfp_mask = mapping_gfp_mask(m);
> >>> +		return !!(gfp_mask & __GFP_COMP);
> >>> +	}
> >>> +
> >>> +	return false;
> >>> +}
> >>
> >> I did a quick search in all your patches and don't see __GFP_COMP
> >> getting _set_ anywhere.  Am I missing something?
> > 
> > __GFP_COMP is part of GFP_TRANSHUGE. We set it for ramfs in patch 20/30.
> 
> That's a bit non-obvious.  For a casual observer, it _seems_ like you
> should just be setting and checking GFP_TRANSHUGE directly.  It looks
> like you were having some problems with __GFP_MOVABLE and masked it out
> of GFP_TRANSHUGE and that has cascaded over to _this_ check.

Checking GFP_TRANSHUGE directly is not right way. File systems can clear
GFP bits or set additional for its own reason. We should not limit file
systems here.

So the only way robust way is to check __GFP_COMP. I'll add comment.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate
@ 2013-04-02 14:46           ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-04-02 14:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 03/22/2013 03:12 AM, Kirill A. Shutemov wrote:
> > Dave Hansen wrote:
> >> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> >>> +static inline bool mapping_can_have_hugepages(struct address_space *m)
> >>> +{
> >>> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> >>> +		gfp_t gfp_mask = mapping_gfp_mask(m);
> >>> +		return !!(gfp_mask & __GFP_COMP);
> >>> +	}
> >>> +
> >>> +	return false;
> >>> +}
> >>
> >> I did a quick search in all your patches and don't see __GFP_COMP
> >> getting _set_ anywhere.  Am I missing something?
> > 
> > __GFP_COMP is part of GFP_TRANSHUGE. We set it for ramfs in patch 20/30.
> 
> That's a bit non-obvious.  For a casual observer, it _seems_ like you
> should just be setting and checking GFP_TRANSHUGE directly.  It looks
> like you were having some problems with __GFP_MOVABLE and masked it out
> of GFP_TRANSHUGE and that has cascaded over to _this_ check.

Checking GFP_TRANSHUGE directly is not right way. File systems can clear
GFP bits or set additional for its own reason. We should not limit file
systems here.

So the only way robust way is to check __GFP_COMP. I'll add comment.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-04-02 16:28     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-04-02 16:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> ramfs is the most simple fs from page cache point of view. Let's start
> transparent huge page cache enabling here.
> 
> For now we allocate only non-movable huge page. It's not yet clear if
> movable page is safe here and what need to be done to make it safe.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  fs/ramfs/inode.c |    6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> index c24f1e1..da30b4f 100644
> --- a/fs/ramfs/inode.c
> +++ b/fs/ramfs/inode.c
> @@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
>  		inode_init_owner(inode, dir, mode);
>  		inode->i_mapping->a_ops = &ramfs_aops;
>  		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> -		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> +		/*
> +		 * TODO: what should be done to make movable safe?
> +		 */
> +		mapping_set_gfp_mask(inode->i_mapping,
> +				GFP_TRANSHUGE & ~__GFP_MOVABLE);

Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
GFP_HIGHUSER_MOVABLE:

http://lkml.org/lkml/2006/11/27/156

It seems the origin reason is not longer valid, correct?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
@ 2013-04-02 16:28     ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-04-02 16:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> ramfs is the most simple fs from page cache point of view. Let's start
> transparent huge page cache enabling here.
> 
> For now we allocate only non-movable huge page. It's not yet clear if
> movable page is safe here and what need to be done to make it safe.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  fs/ramfs/inode.c |    6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> index c24f1e1..da30b4f 100644
> --- a/fs/ramfs/inode.c
> +++ b/fs/ramfs/inode.c
> @@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
>  		inode_init_owner(inode, dir, mode);
>  		inode->i_mapping->a_ops = &ramfs_aops;
>  		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> -		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> +		/*
> +		 * TODO: what should be done to make movable safe?
> +		 */
> +		mapping_set_gfp_mask(inode->i_mapping,
> +				GFP_TRANSHUGE & ~__GFP_MOVABLE);

Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
GFP_HIGHUSER_MOVABLE:

http://lkml.org/lkml/2006/11/27/156

It seems the origin reason is not longer valid, correct?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
  2013-04-02 16:28     ` Kirill A. Shutemov
@ 2013-04-02 22:15       ` Hugh Dickins
  -1 siblings, 0 replies; 243+ messages in thread
From: Hugh Dickins @ 2013-04-02 22:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Ying Han, Minchan Kim,
	Christoph Lameter, linux-fsdevel, linux-kernel

On Tue, 2 Apr 2013, Kirill A. Shutemov wrote:
> Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > ramfs is the most simple fs from page cache point of view. Let's start
> > transparent huge page cache enabling here.
> > 
> > For now we allocate only non-movable huge page. It's not yet clear if
> > movable page is safe here and what need to be done to make it safe.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  fs/ramfs/inode.c |    6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> > index c24f1e1..da30b4f 100644
> > --- a/fs/ramfs/inode.c
> > +++ b/fs/ramfs/inode.c
> > @@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
> >  		inode_init_owner(inode, dir, mode);
> >  		inode->i_mapping->a_ops = &ramfs_aops;
> >  		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> > -		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> > +		/*
> > +		 * TODO: what should be done to make movable safe?
> > +		 */
> > +		mapping_set_gfp_mask(inode->i_mapping,
> > +				GFP_TRANSHUGE & ~__GFP_MOVABLE);
> 
> Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
> GFP_HIGHUSER_MOVABLE:
> 
> http://lkml.org/lkml/2006/11/27/156
> 
> It seems the origin reason is not longer valid, correct?

Incorrect, I believe: so far as I know, the original reason remains
valid - though it would only require a couple of good small changes
to reverse that - or perhaps you have already made these changes?

The original reason is that ramfs pages are not migratable,
therefore they should be allocated from an unmovable area.

As I understand it (and I would have preferred to run a test to check
my understanding before replying, but don't have time for that), ramfs
pages cannot be migrated for two reasons, neither of them a good reason.

One reason (okay, it wouldn't have been quite this way in 2006) is that
ramfs (rightly) calls mapping_set_unevictable(), so its pages will fail
the page_evictable() test, so they will be marked PageUnevictable, so
__isolate_lru_page() will refuse to isolate them for migration (except
for CMA).

I am strongly in favour of removing that limitation from
__isolate_lru_page() (and the thread you pointed - thank you - shows Mel
and Christoph were both in favour too); and note that there is no such
restriction in the confusingly similar but different isolate_lru_page().

Some people do worry that migrating Mlocked pages would introduce the
occasional possibility of a minor fault (with migration_entry_wait())
on an Mlocked region which never faulted before.  I tend to dismiss
that worry, but maybe I'm wrong to do so: maybe there should be a
tunable for realtimey people to set, to prohibit page migration from
mlocked areas; but the default should be to allow it.

(Of course, we could separate ramfs's mapping_unevictable case from
the Mlocked case; but I'd prefer to continue to treat them the same.)

The other reason it looks as if ramfs pages cannot be migrated, is
that it does not set a suitable ->migratepage method, so would be
handled by fallback_migrate_page(), whose PageDirty test will end
up failing the migration with -EBUSY or -EINVAL - if I read it
correctly.

Perhaps other such reasons would surface once those are fixed.
But until ramfs pages can be migrated, they should not be allocated
with __GFP_MOVABLE.  (I've been writing about the migratability of
small pages: I expect you have the migratability of THPages in flux.)

Hugh

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
@ 2013-04-02 22:15       ` Hugh Dickins
  0 siblings, 0 replies; 243+ messages in thread
From: Hugh Dickins @ 2013-04-02 22:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Ying Han, Minchan Kim,
	Christoph Lameter, linux-fsdevel, linux-kernel

On Tue, 2 Apr 2013, Kirill A. Shutemov wrote:
> Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > ramfs is the most simple fs from page cache point of view. Let's start
> > transparent huge page cache enabling here.
> > 
> > For now we allocate only non-movable huge page. It's not yet clear if
> > movable page is safe here and what need to be done to make it safe.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  fs/ramfs/inode.c |    6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> > index c24f1e1..da30b4f 100644
> > --- a/fs/ramfs/inode.c
> > +++ b/fs/ramfs/inode.c
> > @@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
> >  		inode_init_owner(inode, dir, mode);
> >  		inode->i_mapping->a_ops = &ramfs_aops;
> >  		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> > -		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> > +		/*
> > +		 * TODO: what should be done to make movable safe?
> > +		 */
> > +		mapping_set_gfp_mask(inode->i_mapping,
> > +				GFP_TRANSHUGE & ~__GFP_MOVABLE);
> 
> Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
> GFP_HIGHUSER_MOVABLE:
> 
> http://lkml.org/lkml/2006/11/27/156
> 
> It seems the origin reason is not longer valid, correct?

Incorrect, I believe: so far as I know, the original reason remains
valid - though it would only require a couple of good small changes
to reverse that - or perhaps you have already made these changes?

The original reason is that ramfs pages are not migratable,
therefore they should be allocated from an unmovable area.

As I understand it (and I would have preferred to run a test to check
my understanding before replying, but don't have time for that), ramfs
pages cannot be migrated for two reasons, neither of them a good reason.

One reason (okay, it wouldn't have been quite this way in 2006) is that
ramfs (rightly) calls mapping_set_unevictable(), so its pages will fail
the page_evictable() test, so they will be marked PageUnevictable, so
__isolate_lru_page() will refuse to isolate them for migration (except
for CMA).

I am strongly in favour of removing that limitation from
__isolate_lru_page() (and the thread you pointed - thank you - shows Mel
and Christoph were both in favour too); and note that there is no such
restriction in the confusingly similar but different isolate_lru_page().

Some people do worry that migrating Mlocked pages would introduce the
occasional possibility of a minor fault (with migration_entry_wait())
on an Mlocked region which never faulted before.  I tend to dismiss
that worry, but maybe I'm wrong to do so: maybe there should be a
tunable for realtimey people to set, to prohibit page migration from
mlocked areas; but the default should be to allow it.

(Of course, we could separate ramfs's mapping_unevictable case from
the Mlocked case; but I'd prefer to continue to treat them the same.)

The other reason it looks as if ramfs pages cannot be migrated, is
that it does not set a suitable ->migratepage method, so would be
handled by fallback_migrate_page(), whose PageDirty test will end
up failing the migration with -EBUSY or -EINVAL - if I read it
correctly.

Perhaps other such reasons would surface once those are fixed.
But until ramfs pages can be migrated, they should not be allocated
with __GFP_MOVABLE.  (I've been writing about the migratability of
small pages: I expect you have the migratability of THPages in flux.)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
  2013-04-02 22:15       ` Hugh Dickins
@ 2013-04-03  1:11         ` Minchan Kim
  -1 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2013-04-03  1:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Ying Han,
	Christoph Lameter, linux-fsdevel, linux-kernel

On Tue, Apr 02, 2013 at 03:15:23PM -0700, Hugh Dickins wrote:
> On Tue, 2 Apr 2013, Kirill A. Shutemov wrote:
> > Kirill A. Shutemov wrote:
> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > 
> > > ramfs is the most simple fs from page cache point of view. Let's start
> > > transparent huge page cache enabling here.
> > > 
> > > For now we allocate only non-movable huge page. It's not yet clear if
> > > movable page is safe here and what need to be done to make it safe.
> > > 
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > ---
> > >  fs/ramfs/inode.c |    6 +++++-
> > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> > > index c24f1e1..da30b4f 100644
> > > --- a/fs/ramfs/inode.c
> > > +++ b/fs/ramfs/inode.c
> > > @@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
> > >  		inode_init_owner(inode, dir, mode);
> > >  		inode->i_mapping->a_ops = &ramfs_aops;
> > >  		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> > > -		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> > > +		/*
> > > +		 * TODO: what should be done to make movable safe?
> > > +		 */
> > > +		mapping_set_gfp_mask(inode->i_mapping,
> > > +				GFP_TRANSHUGE & ~__GFP_MOVABLE);
> > 
> > Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
> > GFP_HIGHUSER_MOVABLE:
> > 
> > http://lkml.org/lkml/2006/11/27/156
> > 
> > It seems the origin reason is not longer valid, correct?
> 
> Incorrect, I believe: so far as I know, the original reason remains
> valid - though it would only require a couple of good small changes
> to reverse that - or perhaps you have already made these changes?
> 
> The original reason is that ramfs pages are not migratable,
> therefore they should be allocated from an unmovable area.
> 
> As I understand it (and I would have preferred to run a test to check
> my understanding before replying, but don't have time for that), ramfs
> pages cannot be migrated for two reasons, neither of them a good reason.
> 
> One reason (okay, it wouldn't have been quite this way in 2006) is that
> ramfs (rightly) calls mapping_set_unevictable(), so its pages will fail
> the page_evictable() test, so they will be marked PageUnevictable, so
> __isolate_lru_page() will refuse to isolate them for migration (except
> for CMA).

True.

> 
> I am strongly in favour of removing that limitation from
> __isolate_lru_page() (and the thread you pointed - thank you - shows Mel
> and Christoph were both in favour too); and note that there is no such
> restriction in the confusingly similar but different isolate_lru_page().
> 
> Some people do worry that migrating Mlocked pages would introduce the
> occasional possibility of a minor fault (with migration_entry_wait())
> on an Mlocked region which never faulted before.  I tend to dismiss
> that worry, but maybe I'm wrong to do so: maybe there should be a
> tunable for realtimey people to set, to prohibit page migration from
> mlocked areas; but the default should be to allow it.

I agree.
Just FYI for mlocked page migration

I tried migratioin of mlocked page and Johannes and Mel had a concern
about that.
http://lkml.indiana.edu/hypermail/linux/kernel/1109.0/00175.html

But later, Peter already acked it and I guess by reading the thread that
Hugh was in favour when page migration was merged first time.

http://marc.info/?l=linux-mm&m=133697873414205&w=2
http://marc.info/?l=linux-mm&m=133700341823358&w=2

Many people said mlock means memory-resident, NOT pinning so it could
allow minor fault while Mel still had a concern except CMA.
http://marc.info/?l=linux-mm&m=133674219714419&w=2

> 
> (Of course, we could separate ramfs's mapping_unevictable case from
> the Mlocked case; but I'd prefer to continue to treat them the same.)

Fair enough.

> 
> The other reason it looks as if ramfs pages cannot be migrated, is
> that it does not set a suitable ->migratepage method, so would be
> handled by fallback_migrate_page(), whose PageDirty test will end
> up failing the migration with -EBUSY or -EINVAL - if I read it
> correctly.

True.

> 
> Perhaps other such reasons would surface once those are fixed.
> But until ramfs pages can be migrated, they should not be allocated
> with __GFP_MOVABLE.  (I've been writing about the migratability of
> small pages: I expect you have the migratability of THPages in flux.)

Agreed.

> 
> Hugh
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
@ 2013-04-03  1:11         ` Minchan Kim
  0 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2013-04-03  1:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Ying Han,
	Christoph Lameter, linux-fsdevel, linux-kernel

On Tue, Apr 02, 2013 at 03:15:23PM -0700, Hugh Dickins wrote:
> On Tue, 2 Apr 2013, Kirill A. Shutemov wrote:
> > Kirill A. Shutemov wrote:
> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > 
> > > ramfs is the most simple fs from page cache point of view. Let's start
> > > transparent huge page cache enabling here.
> > > 
> > > For now we allocate only non-movable huge page. It's not yet clear if
> > > movable page is safe here and what need to be done to make it safe.
> > > 
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > ---
> > >  fs/ramfs/inode.c |    6 +++++-
> > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> > > index c24f1e1..da30b4f 100644
> > > --- a/fs/ramfs/inode.c
> > > +++ b/fs/ramfs/inode.c
> > > @@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
> > >  		inode_init_owner(inode, dir, mode);
> > >  		inode->i_mapping->a_ops = &ramfs_aops;
> > >  		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> > > -		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> > > +		/*
> > > +		 * TODO: what should be done to make movable safe?
> > > +		 */
> > > +		mapping_set_gfp_mask(inode->i_mapping,
> > > +				GFP_TRANSHUGE & ~__GFP_MOVABLE);
> > 
> > Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
> > GFP_HIGHUSER_MOVABLE:
> > 
> > http://lkml.org/lkml/2006/11/27/156
> > 
> > It seems the origin reason is not longer valid, correct?
> 
> Incorrect, I believe: so far as I know, the original reason remains
> valid - though it would only require a couple of good small changes
> to reverse that - or perhaps you have already made these changes?
> 
> The original reason is that ramfs pages are not migratable,
> therefore they should be allocated from an unmovable area.
> 
> As I understand it (and I would have preferred to run a test to check
> my understanding before replying, but don't have time for that), ramfs
> pages cannot be migrated for two reasons, neither of them a good reason.
> 
> One reason (okay, it wouldn't have been quite this way in 2006) is that
> ramfs (rightly) calls mapping_set_unevictable(), so its pages will fail
> the page_evictable() test, so they will be marked PageUnevictable, so
> __isolate_lru_page() will refuse to isolate them for migration (except
> for CMA).

True.

> 
> I am strongly in favour of removing that limitation from
> __isolate_lru_page() (and the thread you pointed - thank you - shows Mel
> and Christoph were both in favour too); and note that there is no such
> restriction in the confusingly similar but different isolate_lru_page().
> 
> Some people do worry that migrating Mlocked pages would introduce the
> occasional possibility of a minor fault (with migration_entry_wait())
> on an Mlocked region which never faulted before.  I tend to dismiss
> that worry, but maybe I'm wrong to do so: maybe there should be a
> tunable for realtimey people to set, to prohibit page migration from
> mlocked areas; but the default should be to allow it.

I agree.
Just FYI for mlocked page migration

I tried migratioin of mlocked page and Johannes and Mel had a concern
about that.
http://lkml.indiana.edu/hypermail/linux/kernel/1109.0/00175.html

But later, Peter already acked it and I guess by reading the thread that
Hugh was in favour when page migration was merged first time.

http://marc.info/?l=linux-mm&m=133697873414205&w=2
http://marc.info/?l=linux-mm&m=133700341823358&w=2

Many people said mlock means memory-resident, NOT pinning so it could
allow minor fault while Mel still had a concern except CMA.
http://marc.info/?l=linux-mm&m=133674219714419&w=2

> 
> (Of course, we could separate ramfs's mapping_unevictable case from
> the Mlocked case; but I'd prefer to continue to treat them the same.)

Fair enough.

> 
> The other reason it looks as if ramfs pages cannot be migrated, is
> that it does not set a suitable ->migratepage method, so would be
> handled by fallback_migrate_page(), whose PageDirty test will end
> up failing the migration with -EBUSY or -EINVAL - if I read it
> correctly.

True.

> 
> Perhaps other such reasons would surface once those are fixed.
> But until ramfs pages can be migrated, they should not be allocated
> with __GFP_MOVABLE.  (I've been writing about the migratability of
> small pages: I expect you have the migratability of THPages in flux.)

Agreed.

> 
> Hugh
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
  2013-04-02 22:15       ` Hugh Dickins
@ 2013-04-03 13:53         ` Christoph Lameter
  -1 siblings, 0 replies; 243+ messages in thread
From: Christoph Lameter @ 2013-04-03 13:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Ying Han,
	Minchan Kim, linux-fsdevel, linux-kernel

On Tue, 2 Apr 2013, Hugh Dickins wrote:

> I am strongly in favour of removing that limitation from
> __isolate_lru_page() (and the thread you pointed - thank you - shows Mel
> and Christoph were both in favour too); and note that there is no such
> restriction in the confusingly similar but different isolate_lru_page().

Well the naming could be cleaned up. The fundamental issue with migrating
pages is that all references have to be tracked and updates in a way that
no references can be followed to invalid or stale page contents. If ramfs
does not maintain separate pointers but only relies on pointers already
handled by the migration logic then migration is fine.

> Some people do worry that migrating Mlocked pages would introduce the
> occasional possibility of a minor fault (with migration_entry_wait())
> on an Mlocked region which never faulted before.  I tend to dismiss
> that worry, but maybe I'm wrong to do so: maybe there should be a
> tunable for realtimey people to set, to prohibit page migration from
> mlocked areas; but the default should be to allow it.

Could we have a different way of marking pages "pinned"? This is useful
for various subsystems (like RDMA and various graphics drivers etc) which
need to ensure that virtual address to physical address mappings stay the
same for a prolonged period of time. I think this use case is becoming
more frequent given that offload techniques have to be used these days to
overcome the limits on processor performance.

> The other reason it looks as if ramfs pages cannot be migrated, is
> that it does not set a suitable ->migratepage method, so would be
> handled by fallback_migrate_page(), whose PageDirty test will end
> up failing the migration with -EBUSY or -EINVAL - if I read it
> correctly.

These could be handled the same way that anonymous pages are handled.

> But until ramfs pages can be migrated, they should not be allocated
> with __GFP_MOVABLE.  (I've been writing about the migratability of
> small pages: I expect you have the migratability of THPages in flux.)

I agree.


^ permalink raw reply	[flat|nested] 243+ messages in thread

* RE: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
@ 2013-04-03 13:53         ` Christoph Lameter
  0 siblings, 0 replies; 243+ messages in thread
From: Christoph Lameter @ 2013-04-03 13:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Ying Han,
	Minchan Kim, linux-fsdevel, linux-kernel

On Tue, 2 Apr 2013, Hugh Dickins wrote:

> I am strongly in favour of removing that limitation from
> __isolate_lru_page() (and the thread you pointed - thank you - shows Mel
> and Christoph were both in favour too); and note that there is no such
> restriction in the confusingly similar but different isolate_lru_page().

Well the naming could be cleaned up. The fundamental issue with migrating
pages is that all references have to be tracked and updates in a way that
no references can be followed to invalid or stale page contents. If ramfs
does not maintain separate pointers but only relies on pointers already
handled by the migration logic then migration is fine.

> Some people do worry that migrating Mlocked pages would introduce the
> occasional possibility of a minor fault (with migration_entry_wait())
> on an Mlocked region which never faulted before.  I tend to dismiss
> that worry, but maybe I'm wrong to do so: maybe there should be a
> tunable for realtimey people to set, to prohibit page migration from
> mlocked areas; but the default should be to allow it.

Could we have a different way of marking pages "pinned"? This is useful
for various subsystems (like RDMA and various graphics drivers etc) which
need to ensure that virtual address to physical address mappings stay the
same for a prolonged period of time. I think this use case is becoming
more frequent given that offload techniques have to be used these days to
overcome the limits on processor performance.

> The other reason it looks as if ramfs pages cannot be migrated, is
> that it does not set a suitable ->migratepage method, so would be
> handled by fallback_migrate_page(), whose PageDirty test will end
> up failing the migration with -EBUSY or -EINVAL - if I read it
> correctly.

These could be handled the same way that anonymous pages are handled.

> But until ramfs pages can be migrated, they should not be allocated
> with __GFP_MOVABLE.  (I've been writing about the migratability of
> small pages: I expect you have the migratability of THPages in flux.)

I agree.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-04-05  3:37     ` Ric Mason
  -1 siblings, 0 replies; 243+ messages in thread
From: Ric Mason @ 2013-04-05  3:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Currently radix_tree_preload() only guarantees enough nodes to insert
> one element. It's a hard limit. You cannot batch a number insert under
> one tree_lock.
>
> This patch introduces radix_tree_preload_count(). It allows to
> preallocate nodes enough to insert a number of *contiguous* elements.
>
> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>   include/linux/radix-tree.h |    3 +++
>   lib/radix-tree.c           |   32 +++++++++++++++++++++++++-------
>   2 files changed, 28 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index ffc444c..81318cb 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -83,6 +83,8 @@ do {									\
>   	(root)->rnode = NULL;						\
>   } while (0)
>   
> +#define RADIX_TREE_PRELOAD_NR		512 /* For THP's benefit */
> +
>   /**
>    * Radix-tree synchronization
>    *
> @@ -231,6 +233,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root,
>   unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
>   				unsigned long index, unsigned long max_scan);
>   int radix_tree_preload(gfp_t gfp_mask);
> +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
>   void radix_tree_init(void);
>   void *radix_tree_tag_set(struct radix_tree_root *root,
>   			unsigned long index, unsigned int tag);
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index e796429..9bef0ac 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
>    * The worst case is a zero height tree with just a single item at index 0,
>    * and then inserting an item at index ULONG_MAX. This requires 2 new branches
>    * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> + *
> + * Worst case for adding N contiguous items is adding entries at indexes
> + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> + * item plus extra nodes if you cross the boundary from one node to the next.
> + *

What's the meaning of this comments? Could you explain in details? I 
also don't understand #define RADIX_TREE_PRELOAD_SIZE 
(RADIX_TREE_MAX_PATH * 2 - 1), why RADIX_TREE_MAX_PATH * 2 - 1, I fail 
to understand comments above it.

>    * Hence:
>    */
> -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MAX \
> +	(RADIX_TREE_PRELOAD_MIN + \
> +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
>   
>   /*
>    * Per-cpu pool of preloaded nodes
>    */
>   struct radix_tree_preload {
>   	int nr;
> -	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
> +	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
>   };
>   static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
>   
> @@ -257,29 +265,34 @@ radix_tree_node_free(struct radix_tree_node *node)
>   
>   /*
>    * Load up this CPU's radix_tree_node buffer with sufficient objects to
> - * ensure that the addition of a single element in the tree cannot fail.  On
> - * success, return zero, with preemption disabled.  On error, return -ENOMEM
> + * ensure that the addition of *contiguous* elements in the tree cannot fail.
> + * On success, return zero, with preemption disabled.  On error, return -ENOMEM
>    * with preemption not disabled.
>    *
>    * To make use of this facility, the radix tree must be initialised without
>    * __GFP_WAIT being passed to INIT_RADIX_TREE().
>    */
> -int radix_tree_preload(gfp_t gfp_mask)
> +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
>   {
>   	struct radix_tree_preload *rtp;
>   	struct radix_tree_node *node;
>   	int ret = -ENOMEM;
> +	int alloc = RADIX_TREE_PRELOAD_MIN +
> +		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
> +
> +	if (size > RADIX_TREE_PRELOAD_NR)
> +		return -ENOMEM;
>   
>   	preempt_disable();
>   	rtp = &__get_cpu_var(radix_tree_preloads);
> -	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
> +	while (rtp->nr < alloc) {
>   		preempt_enable();
>   		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
>   		if (node == NULL)
>   			goto out;
>   		preempt_disable();
>   		rtp = &__get_cpu_var(radix_tree_preloads);
> -		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
> +		if (rtp->nr < alloc)
>   			rtp->nodes[rtp->nr++] = node;
>   		else
>   			kmem_cache_free(radix_tree_node_cachep, node);
> @@ -288,6 +301,11 @@ int radix_tree_preload(gfp_t gfp_mask)
>   out:
>   	return ret;
>   }
> +
> +int radix_tree_preload(gfp_t gfp_mask)
> +{
> +	return radix_tree_preload_count(1, gfp_mask);
> +}
>   EXPORT_SYMBOL(radix_tree_preload);
>   
>   /*


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements
@ 2013-04-05  3:37     ` Ric Mason
  0 siblings, 0 replies; 243+ messages in thread
From: Ric Mason @ 2013-04-05  3:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Currently radix_tree_preload() only guarantees enough nodes to insert
> one element. It's a hard limit. You cannot batch a number insert under
> one tree_lock.
>
> This patch introduces radix_tree_preload_count(). It allows to
> preallocate nodes enough to insert a number of *contiguous* elements.
>
> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>   include/linux/radix-tree.h |    3 +++
>   lib/radix-tree.c           |   32 +++++++++++++++++++++++++-------
>   2 files changed, 28 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index ffc444c..81318cb 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -83,6 +83,8 @@ do {									\
>   	(root)->rnode = NULL;						\
>   } while (0)
>   
> +#define RADIX_TREE_PRELOAD_NR		512 /* For THP's benefit */
> +
>   /**
>    * Radix-tree synchronization
>    *
> @@ -231,6 +233,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root,
>   unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
>   				unsigned long index, unsigned long max_scan);
>   int radix_tree_preload(gfp_t gfp_mask);
> +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
>   void radix_tree_init(void);
>   void *radix_tree_tag_set(struct radix_tree_root *root,
>   			unsigned long index, unsigned int tag);
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index e796429..9bef0ac 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
>    * The worst case is a zero height tree with just a single item at index 0,
>    * and then inserting an item at index ULONG_MAX. This requires 2 new branches
>    * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> + *
> + * Worst case for adding N contiguous items is adding entries at indexes
> + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> + * item plus extra nodes if you cross the boundary from one node to the next.
> + *

What's the meaning of this comments? Could you explain in details? I 
also don't understand #define RADIX_TREE_PRELOAD_SIZE 
(RADIX_TREE_MAX_PATH * 2 - 1), why RADIX_TREE_MAX_PATH * 2 - 1, I fail 
to understand comments above it.

>    * Hence:
>    */
> -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MAX \
> +	(RADIX_TREE_PRELOAD_MIN + \
> +	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
>   
>   /*
>    * Per-cpu pool of preloaded nodes
>    */
>   struct radix_tree_preload {
>   	int nr;
> -	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
> +	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
>   };
>   static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
>   
> @@ -257,29 +265,34 @@ radix_tree_node_free(struct radix_tree_node *node)
>   
>   /*
>    * Load up this CPU's radix_tree_node buffer with sufficient objects to
> - * ensure that the addition of a single element in the tree cannot fail.  On
> - * success, return zero, with preemption disabled.  On error, return -ENOMEM
> + * ensure that the addition of *contiguous* elements in the tree cannot fail.
> + * On success, return zero, with preemption disabled.  On error, return -ENOMEM
>    * with preemption not disabled.
>    *
>    * To make use of this facility, the radix tree must be initialised without
>    * __GFP_WAIT being passed to INIT_RADIX_TREE().
>    */
> -int radix_tree_preload(gfp_t gfp_mask)
> +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
>   {
>   	struct radix_tree_preload *rtp;
>   	struct radix_tree_node *node;
>   	int ret = -ENOMEM;
> +	int alloc = RADIX_TREE_PRELOAD_MIN +
> +		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
> +
> +	if (size > RADIX_TREE_PRELOAD_NR)
> +		return -ENOMEM;
>   
>   	preempt_disable();
>   	rtp = &__get_cpu_var(radix_tree_preloads);
> -	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
> +	while (rtp->nr < alloc) {
>   		preempt_enable();
>   		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
>   		if (node == NULL)
>   			goto out;
>   		preempt_disable();
>   		rtp = &__get_cpu_var(radix_tree_preloads);
> -		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
> +		if (rtp->nr < alloc)
>   			rtp->nodes[rtp->nr++] = node;
>   		else
>   			kmem_cache_free(radix_tree_node_cachep, node);
> @@ -288,6 +301,11 @@ int radix_tree_preload(gfp_t gfp_mask)
>   out:
>   	return ret;
>   }
> +
> +int radix_tree_preload(gfp_t gfp_mask)
> +{
> +	return radix_tree_preload_count(1, gfp_mask);
> +}
>   EXPORT_SYMBOL(radix_tree_preload);
>   
>   /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 05/30] thp, mm: avoid PageUnevictable on active/inactive lru lists
  2013-03-22 10:11       ` Kirill A. Shutemov
@ 2013-04-05  3:42         ` Ric Mason
  -1 siblings, 0 replies; 243+ messages in thread
From: Ric Mason @ 2013-04-05  3:42 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Hi Kirill,
On 03/22/2013 06:11 PM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
>>> active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
>>> that have been placed on the LRU lists when first allocated), but these
>>> pages must not have PageUnevictable set - otherwise shrink_active_list
>>> goes crazy:
>> ...
>>> For lru_add_page_tail(), it means we should not set PageUnevictable()
>>> for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
>>> The tail page will go LRU_UNEVICTABLE if head page is not on LRU or if
>>> it's marked PageUnevictable() too.
>> This is only an issue once you're using lru_add_page_tail() for
>> non-anonymous pages, right?
> I'm not sure about that. Documentation/vm/unevictable-lru.txt:
>
> Some examples of these unevictable pages on the LRU lists are:
>
>   (1) ramfs pages that have been placed on the LRU lists when first allocated.
>
>   (2) SHM_LOCK'd shared memory pages.  shmctl(SHM_LOCK) does not attempt to
>       allocate or fault in the pages in the shared memory region.  This happens
>       when an application accesses the page the first time after SHM_LOCK'ing
>       the segment.
>
>   (3) mlocked pages that could not be isolated from the LRU and moved to the
>       unevictable list in mlock_vma_page().
>
>   (4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't
>       acquire the VMA's mmap semaphore to test the flags and set PageMlocked.
>       munlock_vma_page() was forced to let the page back on to the normal LRU
>       list for vmscan to handle.
>
>>> diff --git a/mm/swap.c b/mm/swap.c
>>> index 92a9be5..31584d0 100644
>>> --- a/mm/swap.c
>>> +++ b/mm/swap.c
>>> @@ -762,7 +762,8 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>>>   			lru = LRU_INACTIVE_ANON;
>>>   		}
>>>   	} else {
>>> -		SetPageUnevictable(page_tail);
>>> +		if (!PageLRU(page) || PageUnevictable(page))
>>> +			SetPageUnevictable(page_tail);
>>>   		lru = LRU_UNEVICTABLE;
>>>   	}
>> You were saying above that ramfs pages can get on the normal
>> active/inactive lists.  But, this will end up getting them on the
>> unevictable list, right?  So, we have normal ramfs pages on the
>> active/inactive lists, but ramfs pages after a huge-page-split on the
>> unevictable list.  That seems a bit inconsistent.
> Yeah, it's confusing.
>
> I was able to trigger another bug on this code:
> if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
> will go to the same lru as page, but nobody cares to sync page_tail
> active/inactive state with page. So we can end up with inactive page on
> active lru...
>
> I've updated the patch for the next interation. You can check it in git.
> It should be cleaner. Description need to be updated.

Hope you can send out soon. ;-)

>


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 05/30] thp, mm: avoid PageUnevictable on active/inactive lru lists
@ 2013-04-05  3:42         ` Ric Mason
  0 siblings, 0 replies; 243+ messages in thread
From: Ric Mason @ 2013-04-05  3:42 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Hi Kirill,
On 03/22/2013 06:11 PM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
>>> active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
>>> that have been placed on the LRU lists when first allocated), but these
>>> pages must not have PageUnevictable set - otherwise shrink_active_list
>>> goes crazy:
>> ...
>>> For lru_add_page_tail(), it means we should not set PageUnevictable()
>>> for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
>>> The tail page will go LRU_UNEVICTABLE if head page is not on LRU or if
>>> it's marked PageUnevictable() too.
>> This is only an issue once you're using lru_add_page_tail() for
>> non-anonymous pages, right?
> I'm not sure about that. Documentation/vm/unevictable-lru.txt:
>
> Some examples of these unevictable pages on the LRU lists are:
>
>   (1) ramfs pages that have been placed on the LRU lists when first allocated.
>
>   (2) SHM_LOCK'd shared memory pages.  shmctl(SHM_LOCK) does not attempt to
>       allocate or fault in the pages in the shared memory region.  This happens
>       when an application accesses the page the first time after SHM_LOCK'ing
>       the segment.
>
>   (3) mlocked pages that could not be isolated from the LRU and moved to the
>       unevictable list in mlock_vma_page().
>
>   (4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't
>       acquire the VMA's mmap semaphore to test the flags and set PageMlocked.
>       munlock_vma_page() was forced to let the page back on to the normal LRU
>       list for vmscan to handle.
>
>>> diff --git a/mm/swap.c b/mm/swap.c
>>> index 92a9be5..31584d0 100644
>>> --- a/mm/swap.c
>>> +++ b/mm/swap.c
>>> @@ -762,7 +762,8 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>>>   			lru = LRU_INACTIVE_ANON;
>>>   		}
>>>   	} else {
>>> -		SetPageUnevictable(page_tail);
>>> +		if (!PageLRU(page) || PageUnevictable(page))
>>> +			SetPageUnevictable(page_tail);
>>>   		lru = LRU_UNEVICTABLE;
>>>   	}
>> You were saying above that ramfs pages can get on the normal
>> active/inactive lists.  But, this will end up getting them on the
>> unevictable list, right?  So, we have normal ramfs pages on the
>> active/inactive lists, but ramfs pages after a huge-page-split on the
>> unevictable list.  That seems a bit inconsistent.
> Yeah, it's confusing.
>
> I was able to trigger another bug on this code:
> if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
> will go to the same lru as page, but nobody cares to sync page_tail
> active/inactive state with page. So we can end up with inactive page on
> active lru...
>
> I've updated the patch for the next interation. You can check it in git.
> It should be cleaner. Description need to be updated.

Hope you can send out soon. ;-)

>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-04-05  3:45     ` Ric Mason
  -1 siblings, 0 replies; 243+ messages in thread
From: Ric Mason @ 2013-04-05  3:45 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Returns true if mapping can have huge pages. Just check for __GFP_COMP
> in gfp mask of the mapping for now.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>   include/linux/pagemap.h |   10 ++++++++++
>   1 file changed, 10 insertions(+)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index e3dea75..3521b0d 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -84,6 +84,16 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>   				(__force unsigned long)mask;
>   }
>   
> +static inline bool mapping_can_have_hugepages(struct address_space *m)
> +{
> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> +		gfp_t gfp_mask = mapping_gfp_mask(m);
> +		return !!(gfp_mask & __GFP_COMP);

I always see !! in kernel, but why check directly instead of have !! prefix?

> +	}
> +
> +	return false;
> +}
> +
>   /*
>    * The page cache can done in larger chunks than
>    * one page, because it allows for more efficient


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate
@ 2013-04-05  3:45     ` Ric Mason
  0 siblings, 0 replies; 243+ messages in thread
From: Ric Mason @ 2013-04-05  3:45 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Returns true if mapping can have huge pages. Just check for __GFP_COMP
> in gfp mask of the mapping for now.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>   include/linux/pagemap.h |   10 ++++++++++
>   1 file changed, 10 insertions(+)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index e3dea75..3521b0d 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -84,6 +84,16 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>   				(__force unsigned long)mask;
>   }
>   
> +static inline bool mapping_can_have_hugepages(struct address_space *m)
> +{
> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> +		gfp_t gfp_mask = mapping_gfp_mask(m);
> +		return !!(gfp_mask & __GFP_COMP);

I always see !! in kernel, but why check directly instead of have !! prefix?

> +	}
> +
> +	return false;
> +}
> +
>   /*
>    * The page cache can done in larger chunks than
>    * one page, because it allows for more efficient

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate
  2013-04-05  3:45     ` Ric Mason
@ 2013-04-05  3:48       ` Ric Mason
  -1 siblings, 0 replies; 243+ messages in thread
From: Ric Mason @ 2013-04-05  3:48 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 04/05/2013 11:45 AM, Ric Mason wrote:
> Hi Kirill,
> On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>
>> Returns true if mapping can have huge pages. Just check for __GFP_COMP
>> in gfp mask of the mapping for now.
>>
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> ---
>>   include/linux/pagemap.h |   10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
>> index e3dea75..3521b0d 100644
>> --- a/include/linux/pagemap.h
>> +++ b/include/linux/pagemap.h
>> @@ -84,6 +84,16 @@ static inline void mapping_set_gfp_mask(struct 
>> address_space *m, gfp_t mask)
>>                   (__force unsigned long)mask;
>>   }
>>   +static inline bool mapping_can_have_hugepages(struct address_space 
>> *m)
>> +{
>> +    if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>> +        gfp_t gfp_mask = mapping_gfp_mask(m);
>> +        return !!(gfp_mask & __GFP_COMP);
>
> I always see !! in kernel, but why check directly instead of have !! 
> prefix?

s/why/why not

>
>> +    }
>> +
>> +    return false;
>> +}
>> +
>>   /*
>>    * The page cache can done in larger chunks than
>>    * one page, because it allows for more efficient
>


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate
@ 2013-04-05  3:48       ` Ric Mason
  0 siblings, 0 replies; 243+ messages in thread
From: Ric Mason @ 2013-04-05  3:48 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 04/05/2013 11:45 AM, Ric Mason wrote:
> Hi Kirill,
> On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>
>> Returns true if mapping can have huge pages. Just check for __GFP_COMP
>> in gfp mask of the mapping for now.
>>
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> ---
>>   include/linux/pagemap.h |   10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
>> index e3dea75..3521b0d 100644
>> --- a/include/linux/pagemap.h
>> +++ b/include/linux/pagemap.h
>> @@ -84,6 +84,16 @@ static inline void mapping_set_gfp_mask(struct 
>> address_space *m, gfp_t mask)
>>                   (__force unsigned long)mask;
>>   }
>>   +static inline bool mapping_can_have_hugepages(struct address_space 
>> *m)
>> +{
>> +    if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>> +        gfp_t gfp_mask = mapping_gfp_mask(m);
>> +        return !!(gfp_mask & __GFP_COMP);
>
> I always see !! in kernel, but why check directly instead of have !! 
> prefix?

s/why/why not

>
>> +    }
>> +
>> +    return false;
>> +}
>> +
>>   /*
>>    * The page cache can done in larger chunks than
>>    * one page, because it allows for more efficient
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 11/30] thp, mm: handle tail pages in page_cache_get_speculative()
  2013-03-14 17:50   ` Kirill A. Shutemov
@ 2013-04-05  4:03     ` Ric Mason
  -1 siblings, 0 replies; 243+ messages in thread
From: Ric Mason @ 2013-04-05  4:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> For tail page we call __get_page_tail(). It has the same semantics, but
> for tail page.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>   include/linux/pagemap.h |    4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 3521b0d..408c4e3 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -159,6 +159,9 @@ static inline int page_cache_get_speculative(struct page *page)

What's the different between page_cache_get_speculative and page_cache_get?

>   {
>   	VM_BUG_ON(in_interrupt());
>   
> +	if (unlikely(PageTail(page)))
> +		return __get_page_tail(page);
> +
>   #ifdef CONFIG_TINY_RCU
>   # ifdef CONFIG_PREEMPT_COUNT
>   	VM_BUG_ON(!in_atomic());
> @@ -185,7 +188,6 @@ static inline int page_cache_get_speculative(struct page *page)
>   		return 0;
>   	}
>   #endif
> -	VM_BUG_ON(PageTail(page));
>   
>   	return 1;
>   }


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 11/30] thp, mm: handle tail pages in page_cache_get_speculative()
@ 2013-04-05  4:03     ` Ric Mason
  0 siblings, 0 replies; 243+ messages in thread
From: Ric Mason @ 2013-04-05  4:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> For tail page we call __get_page_tail(). It has the same semantics, but
> for tail page.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>   include/linux/pagemap.h |    4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 3521b0d..408c4e3 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -159,6 +159,9 @@ static inline int page_cache_get_speculative(struct page *page)

What's the different between page_cache_get_speculative and page_cache_get?

>   {
>   	VM_BUG_ON(in_interrupt());
>   
> +	if (unlikely(PageTail(page)))
> +		return __get_page_tail(page);
> +
>   #ifdef CONFIG_TINY_RCU
>   # ifdef CONFIG_PREEMPT_COUNT
>   	VM_BUG_ON(!in_atomic());
> @@ -185,7 +188,6 @@ static inline int page_cache_get_speculative(struct page *page)
>   		return 0;
>   	}
>   #endif
> -	VM_BUG_ON(PageTail(page));
>   
>   	return 1;
>   }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 12/30] thp, mm: add event counters for huge page alloc on write to a file
  2013-03-26  8:40       ` Kirill A. Shutemov
@ 2013-04-05  4:05         ` Ric Mason
  -1 siblings, 0 replies; 243+ messages in thread
From: Ric Mason @ 2013-04-05  4:05 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Hi Kirill,
On 03/26/2013 04:40 PM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
>>> --- a/include/linux/vm_event_item.h
>>> +++ b/include/linux/vm_event_item.h
>>> @@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>>   		THP_FAULT_FALLBACK,
>>>   		THP_COLLAPSE_ALLOC,
>>>   		THP_COLLAPSE_ALLOC_FAILED,
>>> +		THP_WRITE_ALLOC,
>>> +		THP_WRITE_FAILED,
>>>   		THP_SPLIT,
>>>   		THP_ZERO_PAGE_ALLOC,
>>>   		THP_ZERO_PAGE_ALLOC_FAILED,
>> I think these names are a bit terse.  It's certainly not _writes_ that
>> are failing and "THP_WRITE_FAILED" makes it sound that way.
> Right. s/THP_WRITE_FAILED/THP_WRITE_ALLOC_FAILED/
>
>> Also, why do we need to differentiate these from the existing anon-hugepage
>> vm stats?  The alloc_pages() call seems to be doing the exact same thing in
>> the end.  Is one more likely to succeed than the other?
> Existing stats specify source of thp page: fault or collapse. When we
> allocate a new huge page with write(2) it's nither fault nor collapse. I
> think it's reasonable to introduce new type of event for that.

Why when we allocated a new huge page with write(2) is not a write fault?

>


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 12/30] thp, mm: add event counters for huge page alloc on write to a file
@ 2013-04-05  4:05         ` Ric Mason
  0 siblings, 0 replies; 243+ messages in thread
From: Ric Mason @ 2013-04-05  4:05 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Hi Kirill,
On 03/26/2013 04:40 PM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
>>> --- a/include/linux/vm_event_item.h
>>> +++ b/include/linux/vm_event_item.h
>>> @@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>>   		THP_FAULT_FALLBACK,
>>>   		THP_COLLAPSE_ALLOC,
>>>   		THP_COLLAPSE_ALLOC_FAILED,
>>> +		THP_WRITE_ALLOC,
>>> +		THP_WRITE_FAILED,
>>>   		THP_SPLIT,
>>>   		THP_ZERO_PAGE_ALLOC,
>>>   		THP_ZERO_PAGE_ALLOC_FAILED,
>> I think these names are a bit terse.  It's certainly not _writes_ that
>> are failing and "THP_WRITE_FAILED" makes it sound that way.
> Right. s/THP_WRITE_FAILED/THP_WRITE_ALLOC_FAILED/
>
>> Also, why do we need to differentiate these from the existing anon-hugepage
>> vm stats?  The alloc_pages() call seems to be doing the exact same thing in
>> the end.  Is one more likely to succeed than the other?
> Existing stats specify source of thp page: fault or collapse. When we
> allocate a new huge page with write(2) it's nither fault nor collapse. I
> think it's reasonable to introduce new type of event for that.

Why when we allocated a new huge page with write(2) is not a write fault?

>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
  2013-04-03  1:11         ` Minchan Kim
@ 2013-04-05  6:47           ` Simon Jeons
  -1 siblings, 0 replies; 243+ messages in thread
From: Simon Jeons @ 2013-04-05  6:47 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrea Arcangeli,
	Andrew Morton, Al Viro, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Ying Han, Christoph Lameter, linux-fsdevel,
	linux-kernel

Hi Minchan,
On 04/03/2013 09:11 AM, Minchan Kim wrote:
> On Tue, Apr 02, 2013 at 03:15:23PM -0700, Hugh Dickins wrote:
>> On Tue, 2 Apr 2013, Kirill A. Shutemov wrote:
>>> Kirill A. Shutemov wrote:
>>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>>
>>>> ramfs is the most simple fs from page cache point of view. Let's start
>>>> transparent huge page cache enabling here.
>>>>
>>>> For now we allocate only non-movable huge page. It's not yet clear if
>>>> movable page is safe here and what need to be done to make it safe.
>>>>
>>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>> ---
>>>>   fs/ramfs/inode.c |    6 +++++-
>>>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
>>>> index c24f1e1..da30b4f 100644
>>>> --- a/fs/ramfs/inode.c
>>>> +++ b/fs/ramfs/inode.c
>>>> @@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
>>>>   		inode_init_owner(inode, dir, mode);
>>>>   		inode->i_mapping->a_ops = &ramfs_aops;
>>>>   		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
>>>> -		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
>>>> +		/*
>>>> +		 * TODO: what should be done to make movable safe?
>>>> +		 */
>>>> +		mapping_set_gfp_mask(inode->i_mapping,
>>>> +				GFP_TRANSHUGE & ~__GFP_MOVABLE);
>>> Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
>>> GFP_HIGHUSER_MOVABLE:
>>>
>>> http://lkml.org/lkml/2006/11/27/156
>>>
>>> It seems the origin reason is not longer valid, correct?
>> Incorrect, I believe: so far as I know, the original reason remains
>> valid - though it would only require a couple of good small changes
>> to reverse that - or perhaps you have already made these changes?
>>
>> The original reason is that ramfs pages are not migratable,
>> therefore they should be allocated from an unmovable area.
>>
>> As I understand it (and I would have preferred to run a test to check
>> my understanding before replying, but don't have time for that), ramfs
>> pages cannot be migrated for two reasons, neither of them a good reason.
>>
>> One reason (okay, it wouldn't have been quite this way in 2006) is that
>> ramfs (rightly) calls mapping_set_unevictable(), so its pages will fail
>> the page_evictable() test, so they will be marked PageUnevictable, so
>> __isolate_lru_page() will refuse to isolate them for migration (except
>> for CMA).
> True.
>
>> I am strongly in favour of removing that limitation from
>> __isolate_lru_page() (and the thread you pointed - thank you - shows Mel
>> and Christoph were both in favour too); and note that there is no such
>> restriction in the confusingly similar but different isolate_lru_page().
>>
>> Some people do worry that migrating Mlocked pages would introduce the
>> occasional possibility of a minor fault (with migration_entry_wait())
>> on an Mlocked region which never faulted before.  I tend to dismiss
>> that worry, but maybe I'm wrong to do so: maybe there should be a
>> tunable for realtimey people to set, to prohibit page migration from
>> mlocked areas; but the default should be to allow it.
> I agree.
> Just FYI for mlocked page migration
>
> I tried migratioin of mlocked page and Johannes and Mel had a concern
> about that.
> http://lkml.indiana.edu/hypermail/linux/kernel/1109.0/00175.html
>
> But later, Peter already acked it and I guess by reading the thread that
> Hugh was in favour when page migration was merged first time.
>
> http://marc.info/?l=linux-mm&m=133697873414205&w=2
> http://marc.info/?l=linux-mm&m=133700341823358&w=2
>
> Many people said mlock means memory-resident, NOT pinning so it could
> allow minor fault while Mel still had a concern except CMA.
> http://marc.info/?l=linux-mm&m=133674219714419&w=2

How about add a knob?

>> (Of course, we could separate ramfs's mapping_unevictable case from
>> the Mlocked case; but I'd prefer to continue to treat them the same.)
> Fair enough.
>
>> The other reason it looks as if ramfs pages cannot be migrated, is
>> that it does not set a suitable ->migratepage method, so would be
>> handled by fallback_migrate_page(), whose PageDirty test will end
>> up failing the migration with -EBUSY or -EINVAL - if I read it
>> correctly.
> True.
>
>> Perhaps other such reasons would surface once those are fixed.
>> But until ramfs pages can be migrated, they should not be allocated
>> with __GFP_MOVABLE.  (I've been writing about the migratability of
>> small pages: I expect you have the migratability of THPages in flux.)
> Agreed.
>
>> Hugh
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
@ 2013-04-05  6:47           ` Simon Jeons
  0 siblings, 0 replies; 243+ messages in thread
From: Simon Jeons @ 2013-04-05  6:47 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrea Arcangeli,
	Andrew Morton, Al Viro, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Ying Han, Christoph Lameter, linux-fsdevel,
	linux-kernel

Hi Minchan,
On 04/03/2013 09:11 AM, Minchan Kim wrote:
> On Tue, Apr 02, 2013 at 03:15:23PM -0700, Hugh Dickins wrote:
>> On Tue, 2 Apr 2013, Kirill A. Shutemov wrote:
>>> Kirill A. Shutemov wrote:
>>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>>
>>>> ramfs is the most simple fs from page cache point of view. Let's start
>>>> transparent huge page cache enabling here.
>>>>
>>>> For now we allocate only non-movable huge page. It's not yet clear if
>>>> movable page is safe here and what need to be done to make it safe.
>>>>
>>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>> ---
>>>>   fs/ramfs/inode.c |    6 +++++-
>>>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
>>>> index c24f1e1..da30b4f 100644
>>>> --- a/fs/ramfs/inode.c
>>>> +++ b/fs/ramfs/inode.c
>>>> @@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
>>>>   		inode_init_owner(inode, dir, mode);
>>>>   		inode->i_mapping->a_ops = &ramfs_aops;
>>>>   		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
>>>> -		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
>>>> +		/*
>>>> +		 * TODO: what should be done to make movable safe?
>>>> +		 */
>>>> +		mapping_set_gfp_mask(inode->i_mapping,
>>>> +				GFP_TRANSHUGE & ~__GFP_MOVABLE);
>>> Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
>>> GFP_HIGHUSER_MOVABLE:
>>>
>>> http://lkml.org/lkml/2006/11/27/156
>>>
>>> It seems the origin reason is not longer valid, correct?
>> Incorrect, I believe: so far as I know, the original reason remains
>> valid - though it would only require a couple of good small changes
>> to reverse that - or perhaps you have already made these changes?
>>
>> The original reason is that ramfs pages are not migratable,
>> therefore they should be allocated from an unmovable area.
>>
>> As I understand it (and I would have preferred to run a test to check
>> my understanding before replying, but don't have time for that), ramfs
>> pages cannot be migrated for two reasons, neither of them a good reason.
>>
>> One reason (okay, it wouldn't have been quite this way in 2006) is that
>> ramfs (rightly) calls mapping_set_unevictable(), so its pages will fail
>> the page_evictable() test, so they will be marked PageUnevictable, so
>> __isolate_lru_page() will refuse to isolate them for migration (except
>> for CMA).
> True.
>
>> I am strongly in favour of removing that limitation from
>> __isolate_lru_page() (and the thread you pointed - thank you - shows Mel
>> and Christoph were both in favour too); and note that there is no such
>> restriction in the confusingly similar but different isolate_lru_page().
>>
>> Some people do worry that migrating Mlocked pages would introduce the
>> occasional possibility of a minor fault (with migration_entry_wait())
>> on an Mlocked region which never faulted before.  I tend to dismiss
>> that worry, but maybe I'm wrong to do so: maybe there should be a
>> tunable for realtimey people to set, to prohibit page migration from
>> mlocked areas; but the default should be to allow it.
> I agree.
> Just FYI for mlocked page migration
>
> I tried migratioin of mlocked page and Johannes and Mel had a concern
> about that.
> http://lkml.indiana.edu/hypermail/linux/kernel/1109.0/00175.html
>
> But later, Peter already acked it and I guess by reading the thread that
> Hugh was in favour when page migration was merged first time.
>
> http://marc.info/?l=linux-mm&m=133697873414205&w=2
> http://marc.info/?l=linux-mm&m=133700341823358&w=2
>
> Many people said mlock means memory-resident, NOT pinning so it could
> allow minor fault while Mel still had a concern except CMA.
> http://marc.info/?l=linux-mm&m=133674219714419&w=2

How about add a knob?

>> (Of course, we could separate ramfs's mapping_unevictable case from
>> the Mlocked case; but I'd prefer to continue to treat them the same.)
> Fair enough.
>
>> The other reason it looks as if ramfs pages cannot be migrated, is
>> that it does not set a suitable ->migratepage method, so would be
>> handled by fallback_migrate_page(), whose PageDirty test will end
>> up failing the migration with -EBUSY or -EINVAL - if I read it
>> correctly.
> True.
>
>> Perhaps other such reasons would surface once those are fixed.
>> But until ramfs pages can be migrated, they should not be allocated
>> with __GFP_MOVABLE.  (I've been writing about the migratability of
>> small pages: I expect you have the migratability of THPages in flux.)
> Agreed.
>
>> Hugh
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
  2013-04-05  6:47           ` Simon Jeons
@ 2013-04-05  8:01             ` Minchan Kim
  -1 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2013-04-05  8:01 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrea Arcangeli,
	Andrew Morton, Al Viro, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Ying Han, Christoph Lameter, linux-fsdevel,
	linux-kernel

On Fri, Apr 05, 2013 at 02:47:25PM +0800, Simon Jeons wrote:
> Hi Minchan,
> On 04/03/2013 09:11 AM, Minchan Kim wrote:
> >On Tue, Apr 02, 2013 at 03:15:23PM -0700, Hugh Dickins wrote:
> >>On Tue, 2 Apr 2013, Kirill A. Shutemov wrote:
> >>>Kirill A. Shutemov wrote:
> >>>>From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >>>>
> >>>>ramfs is the most simple fs from page cache point of view. Let's start
> >>>>transparent huge page cache enabling here.
> >>>>
> >>>>For now we allocate only non-movable huge page. It's not yet clear if
> >>>>movable page is safe here and what need to be done to make it safe.
> >>>>
> >>>>Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >>>>---
> >>>>  fs/ramfs/inode.c |    6 +++++-
> >>>>  1 file changed, 5 insertions(+), 1 deletion(-)
> >>>>
> >>>>diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> >>>>index c24f1e1..da30b4f 100644
> >>>>--- a/fs/ramfs/inode.c
> >>>>+++ b/fs/ramfs/inode.c
> >>>>@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
> >>>>  		inode_init_owner(inode, dir, mode);
> >>>>  		inode->i_mapping->a_ops = &ramfs_aops;
> >>>>  		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> >>>>-		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> >>>>+		/*
> >>>>+		 * TODO: what should be done to make movable safe?
> >>>>+		 */
> >>>>+		mapping_set_gfp_mask(inode->i_mapping,
> >>>>+				GFP_TRANSHUGE & ~__GFP_MOVABLE);
> >>>Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
> >>>GFP_HIGHUSER_MOVABLE:
> >>>
> >>>http://lkml.org/lkml/2006/11/27/156
> >>>
> >>>It seems the origin reason is not longer valid, correct?
> >>Incorrect, I believe: so far as I know, the original reason remains
> >>valid - though it would only require a couple of good small changes
> >>to reverse that - or perhaps you have already made these changes?
> >>
> >>The original reason is that ramfs pages are not migratable,
> >>therefore they should be allocated from an unmovable area.
> >>
> >>As I understand it (and I would have preferred to run a test to check
> >>my understanding before replying, but don't have time for that), ramfs
> >>pages cannot be migrated for two reasons, neither of them a good reason.
> >>
> >>One reason (okay, it wouldn't have been quite this way in 2006) is that
> >>ramfs (rightly) calls mapping_set_unevictable(), so its pages will fail
> >>the page_evictable() test, so they will be marked PageUnevictable, so
> >>__isolate_lru_page() will refuse to isolate them for migration (except
> >>for CMA).
> >True.
> >
> >>I am strongly in favour of removing that limitation from
> >>__isolate_lru_page() (and the thread you pointed - thank you - shows Mel
> >>and Christoph were both in favour too); and note that there is no such
> >>restriction in the confusingly similar but different isolate_lru_page().
> >>
> >>Some people do worry that migrating Mlocked pages would introduce the
> >>occasional possibility of a minor fault (with migration_entry_wait())
> >>on an Mlocked region which never faulted before.  I tend to dismiss
> >>that worry, but maybe I'm wrong to do so: maybe there should be a
> >>tunable for realtimey people to set, to prohibit page migration from
> >>mlocked areas; but the default should be to allow it.
> >I agree.
> >Just FYI for mlocked page migration
> >
> >I tried migratioin of mlocked page and Johannes and Mel had a concern
> >about that.
> >http://lkml.indiana.edu/hypermail/linux/kernel/1109.0/00175.html
> >
> >But later, Peter already acked it and I guess by reading the thread that
> >Hugh was in favour when page migration was merged first time.
> >
> >http://marc.info/?l=linux-mm&m=133697873414205&w=2
> >http://marc.info/?l=linux-mm&m=133700341823358&w=2
> >
> >Many people said mlock means memory-resident, NOT pinning so it could
> >allow minor fault while Mel still had a concern except CMA.
> >http://marc.info/?l=linux-mm&m=133674219714419&w=2
> 
> How about add a knob?

Maybe, volunteering?

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
@ 2013-04-05  8:01             ` Minchan Kim
  0 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2013-04-05  8:01 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrea Arcangeli,
	Andrew Morton, Al Viro, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Ying Han, Christoph Lameter, linux-fsdevel,
	linux-kernel

On Fri, Apr 05, 2013 at 02:47:25PM +0800, Simon Jeons wrote:
> Hi Minchan,
> On 04/03/2013 09:11 AM, Minchan Kim wrote:
> >On Tue, Apr 02, 2013 at 03:15:23PM -0700, Hugh Dickins wrote:
> >>On Tue, 2 Apr 2013, Kirill A. Shutemov wrote:
> >>>Kirill A. Shutemov wrote:
> >>>>From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >>>>
> >>>>ramfs is the most simple fs from page cache point of view. Let's start
> >>>>transparent huge page cache enabling here.
> >>>>
> >>>>For now we allocate only non-movable huge page. It's not yet clear if
> >>>>movable page is safe here and what need to be done to make it safe.
> >>>>
> >>>>Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >>>>---
> >>>>  fs/ramfs/inode.c |    6 +++++-
> >>>>  1 file changed, 5 insertions(+), 1 deletion(-)
> >>>>
> >>>>diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> >>>>index c24f1e1..da30b4f 100644
> >>>>--- a/fs/ramfs/inode.c
> >>>>+++ b/fs/ramfs/inode.c
> >>>>@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
> >>>>  		inode_init_owner(inode, dir, mode);
> >>>>  		inode->i_mapping->a_ops = &ramfs_aops;
> >>>>  		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> >>>>-		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> >>>>+		/*
> >>>>+		 * TODO: what should be done to make movable safe?
> >>>>+		 */
> >>>>+		mapping_set_gfp_mask(inode->i_mapping,
> >>>>+				GFP_TRANSHUGE & ~__GFP_MOVABLE);
> >>>Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
> >>>GFP_HIGHUSER_MOVABLE:
> >>>
> >>>http://lkml.org/lkml/2006/11/27/156
> >>>
> >>>It seems the origin reason is not longer valid, correct?
> >>Incorrect, I believe: so far as I know, the original reason remains
> >>valid - though it would only require a couple of good small changes
> >>to reverse that - or perhaps you have already made these changes?
> >>
> >>The original reason is that ramfs pages are not migratable,
> >>therefore they should be allocated from an unmovable area.
> >>
> >>As I understand it (and I would have preferred to run a test to check
> >>my understanding before replying, but don't have time for that), ramfs
> >>pages cannot be migrated for two reasons, neither of them a good reason.
> >>
> >>One reason (okay, it wouldn't have been quite this way in 2006) is that
> >>ramfs (rightly) calls mapping_set_unevictable(), so its pages will fail
> >>the page_evictable() test, so they will be marked PageUnevictable, so
> >>__isolate_lru_page() will refuse to isolate them for migration (except
> >>for CMA).
> >True.
> >
> >>I am strongly in favour of removing that limitation from
> >>__isolate_lru_page() (and the thread you pointed - thank you - shows Mel
> >>and Christoph were both in favour too); and note that there is no such
> >>restriction in the confusingly similar but different isolate_lru_page().
> >>
> >>Some people do worry that migrating Mlocked pages would introduce the
> >>occasional possibility of a minor fault (with migration_entry_wait())
> >>on an Mlocked region which never faulted before.  I tend to dismiss
> >>that worry, but maybe I'm wrong to do so: maybe there should be a
> >>tunable for realtimey people to set, to prohibit page migration from
> >>mlocked areas; but the default should be to allow it.
> >I agree.
> >Just FYI for mlocked page migration
> >
> >I tried migratioin of mlocked page and Johannes and Mel had a concern
> >about that.
> >http://lkml.indiana.edu/hypermail/linux/kernel/1109.0/00175.html
> >
> >But later, Peter already acked it and I guess by reading the thread that
> >Hugh was in favour when page migration was merged first time.
> >
> >http://marc.info/?l=linux-mm&m=133697873414205&w=2
> >http://marc.info/?l=linux-mm&m=133700341823358&w=2
> >
> >Many people said mlock means memory-resident, NOT pinning so it could
> >allow minor fault while Mel still had a concern except CMA.
> >http://marc.info/?l=linux-mm&m=133674219714419&w=2
> 
> How about add a knob?

Maybe, volunteering?

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
  2013-04-05  8:01             ` Minchan Kim
  (?)
  (?)
@ 2013-04-05  8:22             ` Wanpeng Li
  -1 siblings, 0 replies; 243+ messages in thread
From: Wanpeng Li @ 2013-04-05  8:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Simon Jeons, Hugh Dickins, Kirill A. Shutemov, Andrea Arcangeli,
	Andrew Morton, Al Viro, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Ying Han, Christoph Lameter, linux-fsdevel,
	linux-kernel

On Fri, Apr 05, 2013 at 05:01:06PM +0900, Minchan Kim wrote:
>On Fri, Apr 05, 2013 at 02:47:25PM +0800, Simon Jeons wrote:
>> Hi Minchan,
>> On 04/03/2013 09:11 AM, Minchan Kim wrote:
>> >On Tue, Apr 02, 2013 at 03:15:23PM -0700, Hugh Dickins wrote:
>> >>On Tue, 2 Apr 2013, Kirill A. Shutemov wrote:
>> >>>Kirill A. Shutemov wrote:
>> >>>>From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> >>>>
>> >>>>ramfs is the most simple fs from page cache point of view. Let's start
>> >>>>transparent huge page cache enabling here.
>> >>>>
>> >>>>For now we allocate only non-movable huge page. It's not yet clear if
>> >>>>movable page is safe here and what need to be done to make it safe.
>> >>>>
>> >>>>Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> >>>>---
>> >>>>  fs/ramfs/inode.c |    6 +++++-
>> >>>>  1 file changed, 5 insertions(+), 1 deletion(-)
>> >>>>
>> >>>>diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
>> >>>>index c24f1e1..da30b4f 100644
>> >>>>--- a/fs/ramfs/inode.c
>> >>>>+++ b/fs/ramfs/inode.c
>> >>>>@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
>> >>>>  		inode_init_owner(inode, dir, mode);
>> >>>>  		inode->i_mapping->a_ops = &ramfs_aops;
>> >>>>  		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
>> >>>>-		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
>> >>>>+		/*
>> >>>>+		 * TODO: what should be done to make movable safe?
>> >>>>+		 */
>> >>>>+		mapping_set_gfp_mask(inode->i_mapping,
>> >>>>+				GFP_TRANSHUGE & ~__GFP_MOVABLE);
>> >>>Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
>> >>>GFP_HIGHUSER_MOVABLE:
>> >>>
>> >>>http://lkml.org/lkml/2006/11/27/156
>> >>>
>> >>>It seems the origin reason is not longer valid, correct?
>> >>Incorrect, I believe: so far as I know, the original reason remains
>> >>valid - though it would only require a couple of good small changes
>> >>to reverse that - or perhaps you have already made these changes?
>> >>
>> >>The original reason is that ramfs pages are not migratable,
>> >>therefore they should be allocated from an unmovable area.
>> >>
>> >>As I understand it (and I would have preferred to run a test to check
>> >>my understanding before replying, but don't have time for that), ramfs
>> >>pages cannot be migrated for two reasons, neither of them a good reason.
>> >>
>> >>One reason (okay, it wouldn't have been quite this way in 2006) is that
>> >>ramfs (rightly) calls mapping_set_unevictable(), so its pages will fail
>> >>the page_evictable() test, so they will be marked PageUnevictable, so
>> >>__isolate_lru_page() will refuse to isolate them for migration (except
>> >>for CMA).
>> >True.
>> >
>> >>I am strongly in favour of removing that limitation from
>> >>__isolate_lru_page() (and the thread you pointed - thank you - shows Mel
>> >>and Christoph were both in favour too); and note that there is no such
>> >>restriction in the confusingly similar but different isolate_lru_page().
>> >>
>> >>Some people do worry that migrating Mlocked pages would introduce the
>> >>occasional possibility of a minor fault (with migration_entry_wait())
>> >>on an Mlocked region which never faulted before.  I tend to dismiss
>> >>that worry, but maybe I'm wrong to do so: maybe there should be a
>> >>tunable for realtimey people to set, to prohibit page migration from
>> >>mlocked areas; but the default should be to allow it.
>> >I agree.
>> >Just FYI for mlocked page migration
>> >
>> >I tried migratioin of mlocked page and Johannes and Mel had a concern
>> >about that.
>> >http://lkml.indiana.edu/hypermail/linux/kernel/1109.0/00175.html
>> >
>> >But later, Peter already acked it and I guess by reading the thread that
>> >Hugh was in favour when page migration was merged first time.
>> >
>> >http://marc.info/?l=linux-mm&m=133697873414205&w=2
>> >http://marc.info/?l=linux-mm&m=133700341823358&w=2
>> >
>> >Many people said mlock means memory-resident, NOT pinning so it could
>> >allow minor fault while Mel still had a concern except CMA.
>> >http://marc.info/?l=linux-mm&m=133674219714419&w=2
>> 
>> How about add a knob?
>
>Maybe, volunteering?

Hi Minchan,

I can be the volunteer, what I care is if add a knob make sense?

Regards,
Wanpeng Li 

>
>-- 
>Kind regards,
>Minchan Kim
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
  2013-04-05  8:01             ` Minchan Kim
  (?)
@ 2013-04-05  8:22             ` Wanpeng Li
  -1 siblings, 0 replies; 243+ messages in thread
From: Wanpeng Li @ 2013-04-05  8:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Simon Jeons, Hugh Dickins, Kirill A. Shutemov, Andrea Arcangeli,
	Andrew Morton, Al Viro, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Ying Han, Christoph Lameter, linux-fsdevel,
	linux-kernel

On Fri, Apr 05, 2013 at 05:01:06PM +0900, Minchan Kim wrote:
>On Fri, Apr 05, 2013 at 02:47:25PM +0800, Simon Jeons wrote:
>> Hi Minchan,
>> On 04/03/2013 09:11 AM, Minchan Kim wrote:
>> >On Tue, Apr 02, 2013 at 03:15:23PM -0700, Hugh Dickins wrote:
>> >>On Tue, 2 Apr 2013, Kirill A. Shutemov wrote:
>> >>>Kirill A. Shutemov wrote:
>> >>>>From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> >>>>
>> >>>>ramfs is the most simple fs from page cache point of view. Let's start
>> >>>>transparent huge page cache enabling here.
>> >>>>
>> >>>>For now we allocate only non-movable huge page. It's not yet clear if
>> >>>>movable page is safe here and what need to be done to make it safe.
>> >>>>
>> >>>>Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> >>>>---
>> >>>>  fs/ramfs/inode.c |    6 +++++-
>> >>>>  1 file changed, 5 insertions(+), 1 deletion(-)
>> >>>>
>> >>>>diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
>> >>>>index c24f1e1..da30b4f 100644
>> >>>>--- a/fs/ramfs/inode.c
>> >>>>+++ b/fs/ramfs/inode.c
>> >>>>@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
>> >>>>  		inode_init_owner(inode, dir, mode);
>> >>>>  		inode->i_mapping->a_ops = &ramfs_aops;
>> >>>>  		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
>> >>>>-		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
>> >>>>+		/*
>> >>>>+		 * TODO: what should be done to make movable safe?
>> >>>>+		 */
>> >>>>+		mapping_set_gfp_mask(inode->i_mapping,
>> >>>>+				GFP_TRANSHUGE & ~__GFP_MOVABLE);
>> >>>Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
>> >>>GFP_HIGHUSER_MOVABLE:
>> >>>
>> >>>http://lkml.org/lkml/2006/11/27/156
>> >>>
>> >>>It seems the origin reason is not longer valid, correct?
>> >>Incorrect, I believe: so far as I know, the original reason remains
>> >>valid - though it would only require a couple of good small changes
>> >>to reverse that - or perhaps you have already made these changes?
>> >>
>> >>The original reason is that ramfs pages are not migratable,
>> >>therefore they should be allocated from an unmovable area.
>> >>
>> >>As I understand it (and I would have preferred to run a test to check
>> >>my understanding before replying, but don't have time for that), ramfs
>> >>pages cannot be migrated for two reasons, neither of them a good reason.
>> >>
>> >>One reason (okay, it wouldn't have been quite this way in 2006) is that
>> >>ramfs (rightly) calls mapping_set_unevictable(), so its pages will fail
>> >>the page_evictable() test, so they will be marked PageUnevictable, so
>> >>__isolate_lru_page() will refuse to isolate them for migration (except
>> >>for CMA).
>> >True.
>> >
>> >>I am strongly in favour of removing that limitation from
>> >>__isolate_lru_page() (and the thread you pointed - thank you - shows Mel
>> >>and Christoph were both in favour too); and note that there is no such
>> >>restriction in the confusingly similar but different isolate_lru_page().
>> >>
>> >>Some people do worry that migrating Mlocked pages would introduce the
>> >>occasional possibility of a minor fault (with migration_entry_wait())
>> >>on an Mlocked region which never faulted before.  I tend to dismiss
>> >>that worry, but maybe I'm wrong to do so: maybe there should be a
>> >>tunable for realtimey people to set, to prohibit page migration from
>> >>mlocked areas; but the default should be to allow it.
>> >I agree.
>> >Just FYI for mlocked page migration
>> >
>> >I tried migratioin of mlocked page and Johannes and Mel had a concern
>> >about that.
>> >http://lkml.indiana.edu/hypermail/linux/kernel/1109.0/00175.html
>> >
>> >But later, Peter already acked it and I guess by reading the thread that
>> >Hugh was in favour when page migration was merged first time.
>> >
>> >http://marc.info/?l=linux-mm&m=133697873414205&w=2
>> >http://marc.info/?l=linux-mm&m=133700341823358&w=2
>> >
>> >Many people said mlock means memory-resident, NOT pinning so it could
>> >allow minor fault while Mel still had a concern except CMA.
>> >http://marc.info/?l=linux-mm&m=133674219714419&w=2
>> 
>> How about add a knob?
>
>Maybe, volunteering?

Hi Minchan,

I can be the volunteer, what I care is if add a knob make sense?

Regards,
Wanpeng Li 

>
>-- 
>Kind regards,
>Minchan Kim
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
       [not found]             ` <515e89d2.e725320a.3a74.7fe7SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2013-04-05  8:31                 ` Minchan Kim
  0 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2013-04-05  8:31 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Simon Jeons, Hugh Dickins, Kirill A. Shutemov, Andrea Arcangeli,
	Andrew Morton, Al Viro, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Ying Han, Christoph Lameter, linux-fsdevel,
	linux-kernel

Hi Wanpeng,

On Fri, Apr 05, 2013 at 04:22:17PM +0800, Wanpeng Li wrote:
> On Fri, Apr 05, 2013 at 05:01:06PM +0900, Minchan Kim wrote:
> >On Fri, Apr 05, 2013 at 02:47:25PM +0800, Simon Jeons wrote:
> >> Hi Minchan,
> >> On 04/03/2013 09:11 AM, Minchan Kim wrote:
> >> >On Tue, Apr 02, 2013 at 03:15:23PM -0700, Hugh Dickins wrote:
> >> >>On Tue, 2 Apr 2013, Kirill A. Shutemov wrote:
> >> >>>Kirill A. Shutemov wrote:
> >> >>>>From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >> >>>>
> >> >>>>ramfs is the most simple fs from page cache point of view. Let's start
> >> >>>>transparent huge page cache enabling here.
> >> >>>>
> >> >>>>For now we allocate only non-movable huge page. It's not yet clear if
> >> >>>>movable page is safe here and what need to be done to make it safe.
> >> >>>>
> >> >>>>Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >> >>>>---
> >> >>>>  fs/ramfs/inode.c |    6 +++++-
> >> >>>>  1 file changed, 5 insertions(+), 1 deletion(-)
> >> >>>>
> >> >>>>diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> >> >>>>index c24f1e1..da30b4f 100644
> >> >>>>--- a/fs/ramfs/inode.c
> >> >>>>+++ b/fs/ramfs/inode.c
> >> >>>>@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
> >> >>>>  		inode_init_owner(inode, dir, mode);
> >> >>>>  		inode->i_mapping->a_ops = &ramfs_aops;
> >> >>>>  		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> >> >>>>-		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> >> >>>>+		/*
> >> >>>>+		 * TODO: what should be done to make movable safe?
> >> >>>>+		 */
> >> >>>>+		mapping_set_gfp_mask(inode->i_mapping,
> >> >>>>+				GFP_TRANSHUGE & ~__GFP_MOVABLE);
> >> >>>Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
> >> >>>GFP_HIGHUSER_MOVABLE:
> >> >>>
> >> >>>http://lkml.org/lkml/2006/11/27/156
> >> >>>
> >> >>>It seems the origin reason is not longer valid, correct?
> >> >>Incorrect, I believe: so far as I know, the original reason remains
> >> >>valid - though it would only require a couple of good small changes
> >> >>to reverse that - or perhaps you have already made these changes?
> >> >>
> >> >>The original reason is that ramfs pages are not migratable,
> >> >>therefore they should be allocated from an unmovable area.
> >> >>
> >> >>As I understand it (and I would have preferred to run a test to check
> >> >>my understanding before replying, but don't have time for that), ramfs
> >> >>pages cannot be migrated for two reasons, neither of them a good reason.
> >> >>
> >> >>One reason (okay, it wouldn't have been quite this way in 2006) is that
> >> >>ramfs (rightly) calls mapping_set_unevictable(), so its pages will fail
> >> >>the page_evictable() test, so they will be marked PageUnevictable, so
> >> >>__isolate_lru_page() will refuse to isolate them for migration (except
> >> >>for CMA).
> >> >True.
> >> >
> >> >>I am strongly in favour of removing that limitation from
> >> >>__isolate_lru_page() (and the thread you pointed - thank you - shows Mel
> >> >>and Christoph were both in favour too); and note that there is no such
> >> >>restriction in the confusingly similar but different isolate_lru_page().
> >> >>
> >> >>Some people do worry that migrating Mlocked pages would introduce the
> >> >>occasional possibility of a minor fault (with migration_entry_wait())
> >> >>on an Mlocked region which never faulted before.  I tend to dismiss
> >> >>that worry, but maybe I'm wrong to do so: maybe there should be a
> >> >>tunable for realtimey people to set, to prohibit page migration from
> >> >>mlocked areas; but the default should be to allow it.
> >> >I agree.
> >> >Just FYI for mlocked page migration
> >> >
> >> >I tried migratioin of mlocked page and Johannes and Mel had a concern
> >> >about that.
> >> >http://lkml.indiana.edu/hypermail/linux/kernel/1109.0/00175.html
> >> >
> >> >But later, Peter already acked it and I guess by reading the thread that
> >> >Hugh was in favour when page migration was merged first time.
> >> >
> >> >http://marc.info/?l=linux-mm&m=133697873414205&w=2
> >> >http://marc.info/?l=linux-mm&m=133700341823358&w=2
> >> >
> >> >Many people said mlock means memory-resident, NOT pinning so it could
> >> >allow minor fault while Mel still had a concern except CMA.
> >> >http://marc.info/?l=linux-mm&m=133674219714419&w=2
> >> 
> >> How about add a knob?
> >
> >Maybe, volunteering?
> 
> Hi Minchan,
> 
> I can be the volunteer, what I care is if add a knob make sense?

Frankly sepaking, I'd like to avoid new knob but there might be
some workloads suffered from mlocked page migration so we coudn't
dismiss it. In such case, introducing the knob would be a solution
with default enabling. If we don't have any report for a long time,
we can remove the knob someday, IMHO.

Thanks.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
@ 2013-04-05  8:31                 ` Minchan Kim
  0 siblings, 0 replies; 243+ messages in thread
From: Minchan Kim @ 2013-04-05  8:31 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Simon Jeons, Hugh Dickins, Kirill A. Shutemov, Andrea Arcangeli,
	Andrew Morton, Al Viro, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Ying Han, Christoph Lameter, linux-fsdevel,
	linux-kernel

Hi Wanpeng,

On Fri, Apr 05, 2013 at 04:22:17PM +0800, Wanpeng Li wrote:
> On Fri, Apr 05, 2013 at 05:01:06PM +0900, Minchan Kim wrote:
> >On Fri, Apr 05, 2013 at 02:47:25PM +0800, Simon Jeons wrote:
> >> Hi Minchan,
> >> On 04/03/2013 09:11 AM, Minchan Kim wrote:
> >> >On Tue, Apr 02, 2013 at 03:15:23PM -0700, Hugh Dickins wrote:
> >> >>On Tue, 2 Apr 2013, Kirill A. Shutemov wrote:
> >> >>>Kirill A. Shutemov wrote:
> >> >>>>From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >> >>>>
> >> >>>>ramfs is the most simple fs from page cache point of view. Let's start
> >> >>>>transparent huge page cache enabling here.
> >> >>>>
> >> >>>>For now we allocate only non-movable huge page. It's not yet clear if
> >> >>>>movable page is safe here and what need to be done to make it safe.
> >> >>>>
> >> >>>>Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >> >>>>---
> >> >>>>  fs/ramfs/inode.c |    6 +++++-
> >> >>>>  1 file changed, 5 insertions(+), 1 deletion(-)
> >> >>>>
> >> >>>>diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> >> >>>>index c24f1e1..da30b4f 100644
> >> >>>>--- a/fs/ramfs/inode.c
> >> >>>>+++ b/fs/ramfs/inode.c
> >> >>>>@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
> >> >>>>  		inode_init_owner(inode, dir, mode);
> >> >>>>  		inode->i_mapping->a_ops = &ramfs_aops;
> >> >>>>  		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> >> >>>>-		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> >> >>>>+		/*
> >> >>>>+		 * TODO: what should be done to make movable safe?
> >> >>>>+		 */
> >> >>>>+		mapping_set_gfp_mask(inode->i_mapping,
> >> >>>>+				GFP_TRANSHUGE & ~__GFP_MOVABLE);
> >> >>>Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
> >> >>>GFP_HIGHUSER_MOVABLE:
> >> >>>
> >> >>>http://lkml.org/lkml/2006/11/27/156
> >> >>>
> >> >>>It seems the origin reason is not longer valid, correct?
> >> >>Incorrect, I believe: so far as I know, the original reason remains
> >> >>valid - though it would only require a couple of good small changes
> >> >>to reverse that - or perhaps you have already made these changes?
> >> >>
> >> >>The original reason is that ramfs pages are not migratable,
> >> >>therefore they should be allocated from an unmovable area.
> >> >>
> >> >>As I understand it (and I would have preferred to run a test to check
> >> >>my understanding before replying, but don't have time for that), ramfs
> >> >>pages cannot be migrated for two reasons, neither of them a good reason.
> >> >>
> >> >>One reason (okay, it wouldn't have been quite this way in 2006) is that
> >> >>ramfs (rightly) calls mapping_set_unevictable(), so its pages will fail
> >> >>the page_evictable() test, so they will be marked PageUnevictable, so
> >> >>__isolate_lru_page() will refuse to isolate them for migration (except
> >> >>for CMA).
> >> >True.
> >> >
> >> >>I am strongly in favour of removing that limitation from
> >> >>__isolate_lru_page() (and the thread you pointed - thank you - shows Mel
> >> >>and Christoph were both in favour too); and note that there is no such
> >> >>restriction in the confusingly similar but different isolate_lru_page().
> >> >>
> >> >>Some people do worry that migrating Mlocked pages would introduce the
> >> >>occasional possibility of a minor fault (with migration_entry_wait())
> >> >>on an Mlocked region which never faulted before.  I tend to dismiss
> >> >>that worry, but maybe I'm wrong to do so: maybe there should be a
> >> >>tunable for realtimey people to set, to prohibit page migration from
> >> >>mlocked areas; but the default should be to allow it.
> >> >I agree.
> >> >Just FYI for mlocked page migration
> >> >
> >> >I tried migratioin of mlocked page and Johannes and Mel had a concern
> >> >about that.
> >> >http://lkml.indiana.edu/hypermail/linux/kernel/1109.0/00175.html
> >> >
> >> >But later, Peter already acked it and I guess by reading the thread that
> >> >Hugh was in favour when page migration was merged first time.
> >> >
> >> >http://marc.info/?l=linux-mm&m=133697873414205&w=2
> >> >http://marc.info/?l=linux-mm&m=133700341823358&w=2
> >> >
> >> >Many people said mlock means memory-resident, NOT pinning so it could
> >> >allow minor fault while Mel still had a concern except CMA.
> >> >http://marc.info/?l=linux-mm&m=133674219714419&w=2
> >> 
> >> How about add a knob?
> >
> >Maybe, volunteering?
> 
> Hi Minchan,
> 
> I can be the volunteer, what I care is if add a knob make sense?

Frankly sepaking, I'd like to avoid new knob but there might be
some workloads suffered from mlocked page migration so we coudn't
dismiss it. In such case, introducing the knob would be a solution
with default enabling. If we don't have any report for a long time,
we can remove the knob someday, IMHO.

Thanks.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
  2013-04-05  8:31                 ` Minchan Kim
  (?)
@ 2013-04-05  8:35                 ` Wanpeng Li
  -1 siblings, 0 replies; 243+ messages in thread
From: Wanpeng Li @ 2013-04-05  8:35 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Simon Jeons, Hugh Dickins, Kirill A. Shutemov, Andrea Arcangeli,
	Andrew Morton, Al Viro, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Ying Han, Christoph Lameter, linux-fsdevel,
	linux-kernel

On Fri, Apr 05, 2013 at 05:31:12PM +0900, Minchan Kim wrote:
>Hi Wanpeng,
>
>On Fri, Apr 05, 2013 at 04:22:17PM +0800, Wanpeng Li wrote:
>> On Fri, Apr 05, 2013 at 05:01:06PM +0900, Minchan Kim wrote:
>> >On Fri, Apr 05, 2013 at 02:47:25PM +0800, Simon Jeons wrote:
>> >> Hi Minchan,
>> >> On 04/03/2013 09:11 AM, Minchan Kim wrote:
>> >> >On Tue, Apr 02, 2013 at 03:15:23PM -0700, Hugh Dickins wrote:
>> >> >>On Tue, 2 Apr 2013, Kirill A. Shutemov wrote:
>> >> >>>Kirill A. Shutemov wrote:
>> >> >>>>From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> >> >>>>
>> >> >>>>ramfs is the most simple fs from page cache point of view. Let's start
>> >> >>>>transparent huge page cache enabling here.
>> >> >>>>
>> >> >>>>For now we allocate only non-movable huge page. It's not yet clear if
>> >> >>>>movable page is safe here and what need to be done to make it safe.
>> >> >>>>
>> >> >>>>Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> >> >>>>---
>> >> >>>>  fs/ramfs/inode.c |    6 +++++-
>> >> >>>>  1 file changed, 5 insertions(+), 1 deletion(-)
>> >> >>>>
>> >> >>>>diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
>> >> >>>>index c24f1e1..da30b4f 100644
>> >> >>>>--- a/fs/ramfs/inode.c
>> >> >>>>+++ b/fs/ramfs/inode.c
>> >> >>>>@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
>> >> >>>>  		inode_init_owner(inode, dir, mode);
>> >> >>>>  		inode->i_mapping->a_ops = &ramfs_aops;
>> >> >>>>  		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
>> >> >>>>-		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
>> >> >>>>+		/*
>> >> >>>>+		 * TODO: what should be done to make movable safe?
>> >> >>>>+		 */
>> >> >>>>+		mapping_set_gfp_mask(inode->i_mapping,
>> >> >>>>+				GFP_TRANSHUGE & ~__GFP_MOVABLE);
>> >> >>>Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
>> >> >>>GFP_HIGHUSER_MOVABLE:
>> >> >>>
>> >> >>>http://lkml.org/lkml/2006/11/27/156
>> >> >>>
>> >> >>>It seems the origin reason is not longer valid, correct?
>> >> >>Incorrect, I believe: so far as I know, the original reason remains
>> >> >>valid - though it would only require a couple of good small changes
>> >> >>to reverse that - or perhaps you have already made these changes?
>> >> >>
>> >> >>The original reason is that ramfs pages are not migratable,
>> >> >>therefore they should be allocated from an unmovable area.
>> >> >>
>> >> >>As I understand it (and I would have preferred to run a test to check
>> >> >>my understanding before replying, but don't have time for that), ramfs
>> >> >>pages cannot be migrated for two reasons, neither of them a good reason.
>> >> >>
>> >> >>One reason (okay, it wouldn't have been quite this way in 2006) is that
>> >> >>ramfs (rightly) calls mapping_set_unevictable(), so its pages will fail
>> >> >>the page_evictable() test, so they will be marked PageUnevictable, so
>> >> >>__isolate_lru_page() will refuse to isolate them for migration (except
>> >> >>for CMA).
>> >> >True.
>> >> >
>> >> >>I am strongly in favour of removing that limitation from
>> >> >>__isolate_lru_page() (and the thread you pointed - thank you - shows Mel
>> >> >>and Christoph were both in favour too); and note that there is no such
>> >> >>restriction in the confusingly similar but different isolate_lru_page().
>> >> >>
>> >> >>Some people do worry that migrating Mlocked pages would introduce the
>> >> >>occasional possibility of a minor fault (with migration_entry_wait())
>> >> >>on an Mlocked region which never faulted before.  I tend to dismiss
>> >> >>that worry, but maybe I'm wrong to do so: maybe there should be a
>> >> >>tunable for realtimey people to set, to prohibit page migration from
>> >> >>mlocked areas; but the default should be to allow it.
>> >> >I agree.
>> >> >Just FYI for mlocked page migration
>> >> >
>> >> >I tried migratioin of mlocked page and Johannes and Mel had a concern
>> >> >about that.
>> >> >http://lkml.indiana.edu/hypermail/linux/kernel/1109.0/00175.html
>> >> >
>> >> >But later, Peter already acked it and I guess by reading the thread that
>> >> >Hugh was in favour when page migration was merged first time.
>> >> >
>> >> >http://marc.info/?l=linux-mm&m=133697873414205&w=2
>> >> >http://marc.info/?l=linux-mm&m=133700341823358&w=2
>> >> >
>> >> >Many people said mlock means memory-resident, NOT pinning so it could
>> >> >allow minor fault while Mel still had a concern except CMA.
>> >> >http://marc.info/?l=linux-mm&m=133674219714419&w=2
>> >> 
>> >> How about add a knob?
>> >
>> >Maybe, volunteering?
>> 
>> Hi Minchan,
>> 
>> I can be the volunteer, what I care is if add a knob make sense?
>
>Frankly sepaking, I'd like to avoid new knob but there might be
>some workloads suffered from mlocked page migration so we coudn't
>dismiss it. In such case, introducing the knob would be a solution
>with default enabling. If we don't have any report for a long time,
>we can remove the knob someday, IMHO.

Ok, I will start coding next week. ;-)

Regards,
Wanpeng Li 

>
>Thanks.
>
>-- 
>Kind regards,
>Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
  2013-04-05  8:31                 ` Minchan Kim
  (?)
  (?)
@ 2013-04-05  8:35                 ` Wanpeng Li
  -1 siblings, 0 replies; 243+ messages in thread
From: Wanpeng Li @ 2013-04-05  8:35 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Simon Jeons, Hugh Dickins, Kirill A. Shutemov, Andrea Arcangeli,
	Andrew Morton, Al Viro, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Ying Han, Christoph Lameter, linux-fsdevel,
	linux-kernel

On Fri, Apr 05, 2013 at 05:31:12PM +0900, Minchan Kim wrote:
>Hi Wanpeng,
>
>On Fri, Apr 05, 2013 at 04:22:17PM +0800, Wanpeng Li wrote:
>> On Fri, Apr 05, 2013 at 05:01:06PM +0900, Minchan Kim wrote:
>> >On Fri, Apr 05, 2013 at 02:47:25PM +0800, Simon Jeons wrote:
>> >> Hi Minchan,
>> >> On 04/03/2013 09:11 AM, Minchan Kim wrote:
>> >> >On Tue, Apr 02, 2013 at 03:15:23PM -0700, Hugh Dickins wrote:
>> >> >>On Tue, 2 Apr 2013, Kirill A. Shutemov wrote:
>> >> >>>Kirill A. Shutemov wrote:
>> >> >>>>From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> >> >>>>
>> >> >>>>ramfs is the most simple fs from page cache point of view. Let's start
>> >> >>>>transparent huge page cache enabling here.
>> >> >>>>
>> >> >>>>For now we allocate only non-movable huge page. It's not yet clear if
>> >> >>>>movable page is safe here and what need to be done to make it safe.
>> >> >>>>
>> >> >>>>Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> >> >>>>---
>> >> >>>>  fs/ramfs/inode.c |    6 +++++-
>> >> >>>>  1 file changed, 5 insertions(+), 1 deletion(-)
>> >> >>>>
>> >> >>>>diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
>> >> >>>>index c24f1e1..da30b4f 100644
>> >> >>>>--- a/fs/ramfs/inode.c
>> >> >>>>+++ b/fs/ramfs/inode.c
>> >> >>>>@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
>> >> >>>>  		inode_init_owner(inode, dir, mode);
>> >> >>>>  		inode->i_mapping->a_ops = &ramfs_aops;
>> >> >>>>  		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
>> >> >>>>-		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
>> >> >>>>+		/*
>> >> >>>>+		 * TODO: what should be done to make movable safe?
>> >> >>>>+		 */
>> >> >>>>+		mapping_set_gfp_mask(inode->i_mapping,
>> >> >>>>+				GFP_TRANSHUGE & ~__GFP_MOVABLE);
>> >> >>>Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
>> >> >>>GFP_HIGHUSER_MOVABLE:
>> >> >>>
>> >> >>>http://lkml.org/lkml/2006/11/27/156
>> >> >>>
>> >> >>>It seems the origin reason is not longer valid, correct?
>> >> >>Incorrect, I believe: so far as I know, the original reason remains
>> >> >>valid - though it would only require a couple of good small changes
>> >> >>to reverse that - or perhaps you have already made these changes?
>> >> >>
>> >> >>The original reason is that ramfs pages are not migratable,
>> >> >>therefore they should be allocated from an unmovable area.
>> >> >>
>> >> >>As I understand it (and I would have preferred to run a test to check
>> >> >>my understanding before replying, but don't have time for that), ramfs
>> >> >>pages cannot be migrated for two reasons, neither of them a good reason.
>> >> >>
>> >> >>One reason (okay, it wouldn't have been quite this way in 2006) is that
>> >> >>ramfs (rightly) calls mapping_set_unevictable(), so its pages will fail
>> >> >>the page_evictable() test, so they will be marked PageUnevictable, so
>> >> >>__isolate_lru_page() will refuse to isolate them for migration (except
>> >> >>for CMA).
>> >> >True.
>> >> >
>> >> >>I am strongly in favour of removing that limitation from
>> >> >>__isolate_lru_page() (and the thread you pointed - thank you - shows Mel
>> >> >>and Christoph were both in favour too); and note that there is no such
>> >> >>restriction in the confusingly similar but different isolate_lru_page().
>> >> >>
>> >> >>Some people do worry that migrating Mlocked pages would introduce the
>> >> >>occasional possibility of a minor fault (with migration_entry_wait())
>> >> >>on an Mlocked region which never faulted before.  I tend to dismiss
>> >> >>that worry, but maybe I'm wrong to do so: maybe there should be a
>> >> >>tunable for realtimey people to set, to prohibit page migration from
>> >> >>mlocked areas; but the default should be to allow it.
>> >> >I agree.
>> >> >Just FYI for mlocked page migration
>> >> >
>> >> >I tried migratioin of mlocked page and Johannes and Mel had a concern
>> >> >about that.
>> >> >http://lkml.indiana.edu/hypermail/linux/kernel/1109.0/00175.html
>> >> >
>> >> >But later, Peter already acked it and I guess by reading the thread that
>> >> >Hugh was in favour when page migration was merged first time.
>> >> >
>> >> >http://marc.info/?l=linux-mm&m=133697873414205&w=2
>> >> >http://marc.info/?l=linux-mm&m=133700341823358&w=2
>> >> >
>> >> >Many people said mlock means memory-resident, NOT pinning so it could
>> >> >allow minor fault while Mel still had a concern except CMA.
>> >> >http://marc.info/?l=linux-mm&m=133674219714419&w=2
>> >> 
>> >> How about add a knob?
>> >
>> >Maybe, volunteering?
>> 
>> Hi Minchan,
>> 
>> I can be the volunteer, what I care is if add a knob make sense?
>
>Frankly sepaking, I'd like to avoid new knob but there might be
>some workloads suffered from mlocked page migration so we coudn't
>dismiss it. In such case, introducing the knob would be a solution
>with default enabling. If we don't have any report for a long time,
>we can remove the knob someday, IMHO.

Ok, I will start coding next week. ;-)

Regards,
Wanpeng Li 

>
>Thanks.
>
>-- 
>Kind regards,
>Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
  2013-04-05  8:31                 ` Minchan Kim
@ 2013-04-05 13:46                   ` Christoph Lameter
  -1 siblings, 0 replies; 243+ messages in thread
From: Christoph Lameter @ 2013-04-05 13:46 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Wanpeng Li, Simon Jeons, Hugh Dickins, Kirill A. Shutemov,
	Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Ying Han, linux-fsdevel,
	linux-kernel

On Fri, 5 Apr 2013, Minchan Kim wrote:

> > >> How about add a knob?
> > >
> > >Maybe, volunteering?
> >
> > Hi Minchan,
> >
> > I can be the volunteer, what I care is if add a knob make sense?
>
> Frankly sepaking, I'd like to avoid new knob but there might be
> some workloads suffered from mlocked page migration so we coudn't
> dismiss it. In such case, introducing the knob would be a solution
> with default enabling. If we don't have any report for a long time,
> we can remove the knob someday, IMHO.

No Knob please. A new implementation for page pinning that avoids the
mlock crap.

1. It should be available for device drivers to pin their memory (they are
now elevating the ref counter which means page migration will have to see
if it can account for all references before giving up and it does that
quite frequently). So there needs to be an in kernel API, a syscall API as
well as a command line one. Preferably as similar as possible.

2. A sane API for marking pages as mlocked. Maybe part of MMAP? I hate the
command line tools and the APIs for doing that right now.

3. The reservation scheme for mlock via ulimit is broken. We have per
process constraints only it seems. If you start enough processes you can
still make the kernel go OOM.

4. mlock semantics are prescribed by posix which states that the page
stays in memory. I think we should stay with that narrow definition for
mlock.

5. Pinning could also mean that page faults on the page are to be avoided.
COW could occur on fork and page table entries could be instantated at
mmap/fork time. Pinning could mean that minor/major faults will not occur
on a page.


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache
@ 2013-04-05 13:46                   ` Christoph Lameter
  0 siblings, 0 replies; 243+ messages in thread
From: Christoph Lameter @ 2013-04-05 13:46 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Wanpeng Li, Simon Jeons, Hugh Dickins, Kirill A. Shutemov,
	Andrea Arcangeli, Andrew Morton, Al Viro, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Ying Han, linux-fsdevel,
	linux-kernel

On Fri, 5 Apr 2013, Minchan Kim wrote:

> > >> How about add a knob?
> > >
> > >Maybe, volunteering?
> >
> > Hi Minchan,
> >
> > I can be the volunteer, what I care is if add a knob make sense?
>
> Frankly sepaking, I'd like to avoid new knob but there might be
> some workloads suffered from mlocked page migration so we coudn't
> dismiss it. In such case, introducing the knob would be a solution
> with default enabling. If we don't have any report for a long time,
> we can remove the knob someday, IMHO.

No Knob please. A new implementation for page pinning that avoids the
mlock crap.

1. It should be available for device drivers to pin their memory (they are
now elevating the ref counter which means page migration will have to see
if it can account for all references before giving up and it does that
quite frequently). So there needs to be an in kernel API, a syscall API as
well as a command line one. Preferably as similar as possible.

2. A sane API for marking pages as mlocked. Maybe part of MMAP? I hate the
command line tools and the APIs for doing that right now.

3. The reservation scheme for mlock via ulimit is broken. We have per
process constraints only it seems. If you start enough processes you can
still make the kernel go OOM.

4. mlock semantics are prescribed by posix which states that the page
stays in memory. I think we should stay with that narrow definition for
mlock.

5. Pinning could also mean that page faults on the page are to be avoided.
COW could occur on fork and page table entries could be instantated at
mmap/fork time. Pinning could mean that minor/major faults will not occur
on a page.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

end of thread, other threads:[~2013-04-05 13:52 UTC | newest]

Thread overview: 243+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-14 17:50 [PATCHv2, RFC 00/30] Transparent huge page cache Kirill A. Shutemov
2013-03-14 17:50 ` Kirill A. Shutemov
2013-03-14 17:50 ` [PATCHv2, RFC 01/30] block: implement add_bdi_stat() Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-21 14:46   ` Dave Hansen
2013-03-21 14:46     ` Dave Hansen
2013-03-21 17:19     ` Kirill A. Shutemov
2013-03-21 17:19       ` Kirill A. Shutemov
2013-03-14 17:50 ` [PATCHv2, RFC 02/30] mm: implement zero_huge_user_segment and friends Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-21 15:23   ` Dave Hansen
2013-03-21 15:23     ` Dave Hansen
2013-03-22  9:21     ` Kirill A. Shutemov
2013-03-22  9:21       ` Kirill A. Shutemov
2013-03-14 17:50 ` [PATCHv2, RFC 03/30] mm: drop actor argument of do_generic_file_read() Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-15  0:21   ` Hillf Danton
2013-03-15  0:21     ` Hillf Danton
2013-03-15  0:27   ` Hillf Danton
2013-03-15  0:27     ` Hillf Danton
2013-03-15 13:22     ` Kirill A. Shutemov
2013-03-15 13:22       ` Kirill A. Shutemov
2013-03-21 15:26       ` Dave Hansen
2013-03-21 15:26         ` Dave Hansen
2013-03-14 17:50 ` [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-21 15:56   ` Dave Hansen
2013-03-21 15:56     ` Dave Hansen
2013-03-22  9:47     ` Kirill A. Shutemov
2013-03-22  9:47       ` Kirill A. Shutemov
2013-03-22 14:38       ` Dave Hansen
2013-03-22 14:38         ` Dave Hansen
2013-03-25 13:03         ` Kirill A. Shutemov
2013-03-25 13:03           ` Kirill A. Shutemov
2013-04-05  3:37   ` Ric Mason
2013-04-05  3:37     ` Ric Mason
2013-03-14 17:50 ` [PATCHv2, RFC 05/30] thp, mm: avoid PageUnevictable on active/inactive lru lists Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-21 16:15   ` Dave Hansen
2013-03-21 16:15     ` Dave Hansen
2013-03-22 10:11     ` Kirill A. Shutemov
2013-03-22 10:11       ` Kirill A. Shutemov
2013-04-05  3:42       ` Ric Mason
2013-04-05  3:42         ` Ric Mason
2013-03-14 17:50 ` [PATCHv2, RFC 06/30] thp, mm: basic defines for transparent huge page cache Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-14 17:50 ` [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-21 16:21   ` Dave Hansen
2013-03-21 16:21     ` Dave Hansen
2013-03-22 10:12     ` Kirill A. Shutemov
2013-03-22 10:12       ` Kirill A. Shutemov
2013-03-22 14:44       ` Dave Hansen
2013-03-22 14:44         ` Dave Hansen
2013-04-02 14:46         ` Kirill A. Shutemov
2013-04-02 14:46           ` Kirill A. Shutemov
2013-04-05  3:45   ` Ric Mason
2013-04-05  3:45     ` Ric Mason
2013-04-05  3:48     ` Ric Mason
2013-04-05  3:48       ` Ric Mason
2013-03-14 17:50 ` [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-15  1:30   ` Hillf Danton
2013-03-15  1:30     ` Hillf Danton
2013-03-15 13:23     ` Kirill A. Shutemov
2013-03-15 13:23       ` Kirill A. Shutemov
2013-03-15 13:25       ` Hillf Danton
2013-03-15 13:25         ` Hillf Danton
2013-03-15 13:50         ` Kirill A. Shutemov
2013-03-15 13:50           ` Kirill A. Shutemov
2013-03-15 13:55           ` Hillf Danton
2013-03-15 13:55             ` Hillf Danton
2013-03-15 15:05             ` Kirill A. Shutemov
2013-03-15 15:05               ` Kirill A. Shutemov
2013-03-21 17:11   ` Dave Hansen
2013-03-21 17:11     ` Dave Hansen
2013-03-22 10:34     ` Kirill A. Shutemov
2013-03-22 10:34       ` Kirill A. Shutemov
2013-03-22 14:51       ` Dave Hansen
2013-03-22 14:51         ` Dave Hansen
2013-03-14 17:50 ` [PATCHv2, RFC 09/30] thp, mm: rewrite delete_from_page_cache() " Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-15  2:25   ` Hillf Danton
2013-03-15  2:25     ` Hillf Danton
2013-03-15 13:23     ` Kirill A. Shutemov
2013-03-15 13:23       ` Kirill A. Shutemov
2013-03-14 17:50 ` [PATCHv2, RFC 10/30] thp, mm: locking tail page is a bug Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-21 17:20   ` Dave Hansen
2013-03-21 17:20     ` Dave Hansen
2013-03-14 17:50 ` [PATCHv2, RFC 11/30] thp, mm: handle tail pages in page_cache_get_speculative() Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-04-05  4:03   ` Ric Mason
2013-04-05  4:03     ` Ric Mason
2013-03-14 17:50 ` [PATCHv2, RFC 12/30] thp, mm: add event counters for huge page alloc on write to a file Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-21 17:59   ` Dave Hansen
2013-03-21 17:59     ` Dave Hansen
2013-03-26  8:40     ` Kirill A. Shutemov
2013-03-26  8:40       ` Kirill A. Shutemov
2013-04-05  4:05       ` Ric Mason
2013-04-05  4:05         ` Ric Mason
2013-03-14 17:50 ` [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin() Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-15  2:34   ` Hillf Danton
2013-03-15  2:34     ` Hillf Danton
2013-03-15 13:24     ` Kirill A. Shutemov
2013-03-15 13:24       ` Kirill A. Shutemov
2013-03-15 13:30       ` Hillf Danton
2013-03-15 13:30         ` Hillf Danton
2013-03-15 13:35         ` Kirill A. Shutemov
2013-03-15 13:35           ` Kirill A. Shutemov
2013-03-15 13:37           ` Hillf Danton
2013-03-15 13:37             ` Hillf Danton
2013-03-21 18:15   ` Dave Hansen
2013-03-21 18:15     ` Dave Hansen
2013-03-26 10:48     ` Kirill A. Shutemov
2013-03-26 10:48       ` Kirill A. Shutemov
2013-03-26 15:40       ` Dave
2013-03-26 15:40         ` Dave
2013-03-21 18:16   ` Dave Hansen
2013-03-21 18:16     ` Dave Hansen
2013-03-14 17:50 ` [PATCHv2, RFC 14/30] thp, mm: naive support of thp in generic read/write routines Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-15  3:11   ` Hillf Danton
2013-03-15  3:11     ` Hillf Danton
2013-03-15 13:27     ` Kirill A. Shutemov
2013-03-15 13:27       ` Kirill A. Shutemov
2013-03-22 15:22   ` Dave Hansen
2013-03-22 15:22     ` Dave Hansen
2013-03-28 12:25     ` Kirill A. Shutemov
2013-03-28 12:25       ` Kirill A. Shutemov
2013-03-14 17:50 ` [PATCHv2, RFC 15/30] thp, libfs: initial support of thp in simple_read/write_begin/write_end Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-22 18:01   ` Dave
2013-03-22 18:01     ` Dave
2013-03-28 14:29     ` Kirill A. Shutemov
2013-03-28 14:29       ` Kirill A. Shutemov
2013-03-14 17:50 ` [PATCHv2, RFC 16/30] thp: handle file pages in split_huge_page() Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-15  6:15   ` Hillf Danton
2013-03-15  6:15     ` Hillf Danton
2013-03-15 13:26     ` Kirill A. Shutemov
2013-03-15 13:26       ` Kirill A. Shutemov
2013-03-15 13:33       ` Hillf Danton
2013-03-15 13:33         ` Hillf Danton
2013-03-22 18:18   ` Dave
2013-03-22 18:18     ` Dave
2013-03-28 14:32     ` Kirill A. Shutemov
2013-03-28 14:32       ` Kirill A. Shutemov
2013-03-14 17:50 ` [PATCHv2, RFC 17/30] thp: wait_split_huge_page(): serialize over i_mmap_mutex too Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-22 18:22   ` Dave
2013-03-22 18:22     ` Dave
2013-03-28 15:08     ` Kirill A. Shutemov
2013-03-28 15:08       ` Kirill A. Shutemov
2013-03-14 17:50 ` [PATCHv2, RFC 18/30] thp, mm: truncate support for transparent huge page cache Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-22 18:29   ` Dave
2013-03-22 18:29     ` Dave
2013-03-28 15:31     ` Kirill A. Shutemov
2013-03-28 15:31       ` Kirill A. Shutemov
2013-03-14 17:50 ` [PATCHv2, RFC 19/30] thp, mm: split huge page on mmap file page Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-15  6:58   ` Hillf Danton
2013-03-15  6:58     ` Hillf Danton
2013-03-15 13:29     ` Kirill A. Shutemov
2013-03-15 13:29       ` Kirill A. Shutemov
2013-03-15 13:35       ` Hillf Danton
2013-03-15 13:35         ` Hillf Danton
2013-03-15 13:45         ` Kirill A. Shutemov
2013-03-15 13:45           ` Kirill A. Shutemov
2013-03-14 17:50 ` [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-04-02 16:28   ` Kirill A. Shutemov
2013-04-02 16:28     ` Kirill A. Shutemov
2013-04-02 22:15     ` Hugh Dickins
2013-04-02 22:15       ` Hugh Dickins
2013-04-03  1:11       ` Minchan Kim
2013-04-03  1:11         ` Minchan Kim
2013-04-05  6:47         ` Simon Jeons
2013-04-05  6:47           ` Simon Jeons
2013-04-05  8:01           ` Minchan Kim
2013-04-05  8:01             ` Minchan Kim
2013-04-05  8:22             ` Wanpeng Li
2013-04-05  8:22             ` Wanpeng Li
     [not found]             ` <515e89d2.e725320a.3a74.7fe7SMTPIN_ADDED_BROKEN@mx.google.com>
2013-04-05  8:31               ` Minchan Kim
2013-04-05  8:31                 ` Minchan Kim
2013-04-05  8:35                 ` Wanpeng Li
2013-04-05  8:35                 ` Wanpeng Li
2013-04-05 13:46                 ` Christoph Lameter
2013-04-05 13:46                   ` Christoph Lameter
2013-04-03 13:53       ` Christoph Lameter
2013-04-03 13:53         ` Christoph Lameter
2013-03-14 17:50 ` [PATCHv2, RFC 21/30] x86-64, mm: proper alignment mappings with hugepages Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-22 18:37   ` Dave
2013-03-22 18:37     ` Dave
2013-03-14 17:50 ` [PATCHv2, RFC 22/30] mm: add huge_fault() callback to vm_operations_struct Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-14 17:50 ` [PATCHv2, RFC 23/30] thp: prepare zap_huge_pmd() to uncharge file pages Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-15  7:09   ` Hillf Danton
2013-03-15  7:09     ` Hillf Danton
2013-03-15 13:30     ` Kirill A. Shutemov
2013-03-15 13:30       ` Kirill A. Shutemov
2013-03-14 17:50 ` [PATCHv2, RFC 24/30] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd() Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-15  7:31   ` Hillf Danton
2013-03-15  7:31     ` Hillf Danton
2013-03-14 17:50 ` [PATCHv2, RFC 25/30] thp, mm: basic huge_fault implementation for generic_file_vm_ops Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-15  7:44   ` Hillf Danton
2013-03-15  7:44     ` Hillf Danton
2013-03-15 13:30     ` Kirill A. Shutemov
2013-03-15 13:30       ` Kirill A. Shutemov
2013-03-14 17:50 ` [PATCHv2, RFC 26/30] thp: extract fallback path from do_huge_pmd_anonymous_page() to a function Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-14 17:50 ` [PATCHv2, RFC 27/30] thp: initial implementation of do_huge_linear_fault() Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-14 17:50 ` [PATCHv2, RFC 28/30] thp: handle write-protect exception to file-backed huge pages Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-14 17:50 ` [PATCHv2, RFC 29/30] thp: call __vma_adjust_trans_huge() for file-backed VMA Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-14 17:50 ` [PATCHv2, RFC 30/30] thp: map file-backed huge pages on fault Kirill A. Shutemov
2013-03-14 17:50   ` Kirill A. Shutemov
2013-03-15  0:33 ` [PATCHv2, RFC 00/30] Transparent huge page cache Hillf Danton
2013-03-15  0:33   ` Hillf Danton
2013-03-15 13:33   ` Kirill A. Shutemov
2013-03-15 13:33     ` Kirill A. Shutemov
2013-03-18  4:03 ` Simon Jeons
2013-03-18  4:03   ` Simon Jeons
2013-03-18  5:23   ` Simon Jeons
2013-03-18 11:19     ` Kirill A. Shutemov
2013-03-18 11:19       ` Kirill A. Shutemov
2013-03-18 11:29       ` Simon Jeons
2013-03-18 11:29         ` Simon Jeons
2013-03-18 11:42         ` Kirill A. Shutemov
2013-03-18 11:42           ` Kirill A. Shutemov
2013-03-18 11:42           ` Ric Mason
2013-03-18 11:42             ` Ric Mason
2013-03-20  1:09 ` Simon Jeons
2013-03-20  1:09   ` Simon Jeons

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.