linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
@ 2013-09-23 12:05 Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 01/22] mm: implement zero_huge_user_segment and friends Kirill A. Shutemov
                   ` (24 more replies)
  0 siblings, 25 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

It brings thp support for ramfs, but without mmap() -- it will be posted
separately.

Please review and consider applying.

Intro
-----

The goal of the project is preparing kernel infrastructure to handle huge
pages in page cache.

To proof that the proposed changes are functional we enable the feature
for the most simple file system -- ramfs. ramfs is not that useful by
itself, but it's good pilot project.

Design overview
---------------

Every huge page is represented in page cache radix-tree by HPAGE_PMD_NR
(512 on x86-64) entries. All entries points to head page -- refcounting for
tail pages is pretty expensive.

Radix tree manipulations are implemented in batched way: we add and remove
whole huge page at once, under one tree_lock. To make it possible, we
extended radix-tree interface to be able to pre-allocate memory enough to
insert a number of *contiguous* elements (kudos to Matthew Wilcox).

Huge pages can be added to page cache three ways:
 - write(2) to file or page;
 - read(2) from sparse file;
 - fault sparse file.

Potentially, one more way is collapsing small page, but it's outside initial
implementation.

For now we still write/read at most PAGE_CACHE_SIZE bytes a time. There's
some room for speed up later.

Since mmap() isn't targeted for this patchset, we just split huge page on
page fault.

To minimize memory overhead for small files we aviod write-allocation in
first huge page area (2M on x86-64) of the file.

truncate_inode_pages_range() drops whole huge page at once if it's fully
inside the range. If a huge page is only partly in the range we zero out
the part, exactly like we do for partial small pages.

split_huge_page() for file pages works similar to anon pages, but we
walk by mapping->i_mmap rather then anon_vma->rb_root. At the end we call
truncate_inode_pages() to drop small pages beyond i_size, if any.

inode->i_split_sem taken on read will protect hugepages in inode's pagecache
against splitting. We take it on write during splitting.

Changes since v5
----------------
 - change how hugepage stored in pagecache: head page for all relevant
   indexes;
 - introduce i_split_sem;
 - do not create huge pages on write(2) into first hugepage area;
 - compile-disabled by default;
 - fix transparent_hugepage_pagecache();

Benchmarks
----------

Since the patchset doesn't include mmap() support, we should expect much
change in performance. We just need to check that we don't introduce any
major regression.

On average read/write on ramfs with thp is a bit slower, but I don't think
it's a stopper -- ramfs is a toy anyway, on real world filesystems I
expect difference to be smaller.

postmark
========

workload1:
chmod +x postmark
mount -t ramfs none /mnt
cat >/root/workload1 <<EOF
set transactions 250000
set size 5120 524288
set number 500
run
quit

workload2:
set transactions 10000
set size 2097152 10485760
set number 100
run
quit

throughput (transactions/sec)
                workload1       workload2
baseline        8333            416
patched         8333            454

FS-Mark
=======

throughput (files/sec)

                2000 files by 1M        200 files by 10M
baseline        5326.1                  548.1
patched         5192.8                  528.4

tiobench
========

baseline:
Tiotest results for 16 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        2048 MBs |    0.2 s | 8667.792 MB/s | 445.2 %  | 5535.9 % |
| Random Write   62 MBs |    0.0 s | 8341.118 MB/s |   0.0 %  | 2615.8 % |
| Read         2048 MBs |    0.2 s | 11680.431 MB/s | 339.9 %  | 5470.6 % |
| Random Read    62 MBs |    0.0 s | 9451.081 MB/s | 786.3 %  | 1451.7 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.006 ms |       28.019 ms |  0.00000 |   0.00000 |
| Random Write |        0.002 ms |        5.574 ms |  0.00000 |   0.00000 |
| Read         |        0.005 ms |       28.018 ms |  0.00000 |   0.00000 |
| Random Read  |        0.002 ms |        4.852 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.005 ms |       28.019 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

patched:
Tiotest results for 16 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        2048 MBs |    0.3 s | 7942.818 MB/s | 442.1 %  | 5533.6 % |
| Random Write   62 MBs |    0.0 s | 9425.426 MB/s | 723.9 %  | 965.2 % |
| Read         2048 MBs |    0.2 s | 11998.008 MB/s | 374.9 %  | 5485.8 % |
| Random Read    62 MBs |    0.0 s | 9823.955 MB/s | 251.5 %  | 2011.9 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.007 ms |       28.020 ms |  0.00000 |   0.00000 |
| Random Write |        0.001 ms |        0.022 ms |  0.00000 |   0.00000 |
| Read         |        0.004 ms |       24.011 ms |  0.00000 |   0.00000 |
| Random Read  |        0.001 ms |        0.019 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.005 ms |       28.020 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

IOZone
======

Syscalls, not mmap.

** Initial writers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80
baseline:	    4741691    7986408    9149064    9898695    9868597    9629383    9469202   11605064    9507802   10641869   11360701   11040376
patched:	    4682864    7275535    8691034    8872887    8712492    8771912    8397216    7701346    7366853    8839736    8299893   10788439
speed-up(times):       0.99       0.91       0.95       0.90       0.88       0.91       0.89       0.66       0.77       0.83       0.73       0.98

** Rewriters **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80
baseline:	    5807891    9554869   12101083   13113533   12989751   14359910   16998236   16833861   24735659   17502634   17396706   20448655
patched:	    6161690    9981294   12285789   13428846   13610058   13669153   20060182   17328347   24109999   19247934   24225103   34686574
speed-up(times):       1.06       1.04       1.02       1.02       1.05       0.95       1.18       1.03       0.97       1.10       1.39       1.70

** Readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80
baseline:	    7978066   11825735   13808941   14049598   14765175   14422642   17322681   23209831   21386483   20060744   22032935   31166663
patched:	    7723293   11481500   13796383   14363808   14353966   14979865   17648225   18701258   29192810   23973723   22163317   23104638
speed-up(times):       0.97       0.97       1.00       1.02       0.97       1.04       1.02       0.81       1.37       1.20       1.01       0.74

** Re-readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80
baseline:	    7966269   11878323   14000782   14678206   14154235   14271991   15170829   20924052   27393344   19114990   12509316   18495597
patched:	    7719350   11410937   13710233   13232756   14040928   15895021   16279330   17256068   26023572   18364678   27834483   23288680
speed-up(times):       0.97       0.96       0.98       0.90       0.99       1.11       1.07       0.82       0.95       0.96       2.23       1.26

** Reverse readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80
baseline:	    6630795   10331013   12839501   13157433   12783323   13580283   15753068   15434572   21928982   17636994   14737489   19470679
patched:	    6502341    9887711   12639278   12979232   13212825   12928255   13961195   14695786   21370667   19873807   20902582   21892899
speed-up(times):       0.98       0.96       0.98       0.99       1.03       0.95       0.89       0.95       0.97       1.13       1.42       1.12

** Random_readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80
baseline:	    5152935    9043813   11752615   11996078   12283579   12484039   14588004   15781507   23847538   15748906   13698335   27195847
patched:	    5009089    8438137   11266015   11631218   12093650   12779308   17768691   13640378   30468890   19269033   23444358   22775908
speed-up(times):       0.97       0.93       0.96       0.97       0.98       1.02       1.22       0.86       1.28       1.22       1.71       0.84

** Random_writers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80
baseline:	    3886268    7405345   10531192   10858984   10994693   12758450   10729531    9656825   10370144   13139452    4528331   12615812
patched:	    4335323    7916132   10978892   11423247   11790932   11424525   11798171   11413452   12230616   13075887   11165314   16925679
speed-up(times):       1.12       1.07       1.04       1.05       1.07       0.90       1.10       1.18       1.18       1.00       2.47       1.34

Kirill A. Shutemov (22):
  mm: implement zero_huge_user_segment and friends
  radix-tree: implement preload for multiple contiguous elements
  memcg, thp: charge huge cache pages
  thp: compile-time and sysfs knob for thp pagecache
  thp, mm: introduce mapping_can_have_hugepages() predicate
  thp: represent file thp pages in meminfo and friends
  thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  mm: trace filemap: dump page order
  block: implement add_bdi_stat()
  thp, mm: rewrite delete_from_page_cache() to support huge pages
  thp, mm: warn if we try to use replace_page_cache_page() with THP
  thp, mm: add event counters for huge page alloc on file write or read
  mm, vfs: introduce i_split_sem
  thp, mm: allocate huge pages in grab_cache_page_write_begin()
  thp, mm: naive support of thp in generic_perform_write
  thp, mm: handle transhuge pages in do_generic_file_read()
  thp, libfs: initial thp support
  truncate: support huge pages
  thp: handle file pages in split_huge_page()
  thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  thp, mm: split huge page on mmap file page
  ramfs: enable transparent huge page cache

 Documentation/vm/transhuge.txt |  16 ++++
 drivers/base/node.c            |   4 +
 fs/inode.c                     |   3 +
 fs/libfs.c                     |  58 +++++++++++-
 fs/proc/meminfo.c              |   3 +
 fs/ramfs/file-mmu.c            |   2 +-
 fs/ramfs/inode.c               |   6 +-
 include/linux/backing-dev.h    |  10 +++
 include/linux/fs.h             |  11 +++
 include/linux/huge_mm.h        |  68 +++++++++++++-
 include/linux/mm.h             |  18 ++++
 include/linux/mmzone.h         |   1 +
 include/linux/page-flags.h     |  13 +++
 include/linux/pagemap.h        |  31 +++++++
 include/linux/radix-tree.h     |  11 +++
 include/linux/vm_event_item.h  |   4 +
 include/trace/events/filemap.h |   7 +-
 lib/radix-tree.c               |  94 ++++++++++++++++++--
 mm/Kconfig                     |  11 +++
 mm/filemap.c                   | 196 ++++++++++++++++++++++++++++++++---------
 mm/huge_memory.c               | 147 +++++++++++++++++++++++++++----
 mm/memcontrol.c                |   3 +-
 mm/memory.c                    |  40 ++++++++-
 mm/truncate.c                  | 125 ++++++++++++++++++++------
 mm/vmstat.c                    |   5 ++
 25 files changed, 779 insertions(+), 108 deletions(-)

-- 
1.8.4.rc3


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCHv6 01/22] mm: implement zero_huge_user_segment and friends
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 02/22] radix-tree: implement preload for multiple contiguous elements Kirill A. Shutemov
                   ` (23 subsequent siblings)
  24 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

Let's add helpers to clear huge page segment(s). They provide the same
functionallity as zero_user_segment and zero_user, but for huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h | 18 ++++++++++++++++++
 mm/memory.c        | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b6e55ee88..a7b7e62930 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1809,9 +1809,27 @@ extern void dump_page(struct page *page);
 extern void clear_huge_page(struct page *page,
 			    unsigned long addr,
 			    unsigned int pages_per_huge_page);
+extern void zero_huge_user_segment(struct page *page,
+		unsigned start, unsigned end);
+static inline void zero_huge_user(struct page *page,
+		unsigned start, unsigned len)
+{
+	zero_huge_user_segment(page, start, start + len);
+}
 extern void copy_user_huge_page(struct page *dst, struct page *src,
 				unsigned long addr, struct vm_area_struct *vma,
 				unsigned int pages_per_huge_page);
+#else
+static inline void zero_huge_user_segment(struct page *page,
+		unsigned start, unsigned end)
+{
+	BUILD_BUG();
+}
+static inline void zero_huge_user(struct page *page,
+		unsigned start, unsigned len)
+{
+	BUILD_BUG();
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/mm/memory.c b/mm/memory.c
index ca00039471..e5f74cd634 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4291,6 +4291,42 @@ void clear_huge_page(struct page *page,
 	}
 }
 
+void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
+{
+	int i;
+	unsigned start_idx, end_idx;
+	unsigned start_off, end_off;
+
+	BUG_ON(end < start);
+
+	might_sleep();
+
+	if (start == end)
+		return;
+
+	start_idx = start >> PAGE_SHIFT;
+	start_off = start & ~PAGE_MASK;
+	end_idx = (end - 1) >> PAGE_SHIFT;
+	end_off = ((end - 1) & ~PAGE_MASK) + 1;
+
+	/*
+	 * if start and end are on the same small page we can call
+	 * zero_user_segment() once and save one kmap_atomic().
+	 */
+	if (start_idx == end_idx)
+		return zero_user_segment(page + start_idx, start_off, end_off);
+
+	/* zero the first (possibly partial) page */
+	zero_user_segment(page + start_idx, start_off, PAGE_SIZE);
+	for (i = start_idx + 1; i < end_idx; i++) {
+		cond_resched();
+		clear_highpage(page + i);
+		flush_dcache_page(page + i);
+	}
+	/* zero the last (possibly partial) page */
+	zero_user_segment(page + end_idx, 0, end_off);
+}
+
 static void copy_user_gigantic_page(struct page *dst, struct page *src,
 				    unsigned long addr,
 				    struct vm_area_struct *vma,
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCHv6 02/22] radix-tree: implement preload for multiple contiguous elements
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 01/22] mm: implement zero_huge_user_segment and friends Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 03/22] memcg, thp: charge huge cache pages Kirill A. Shutemov
                   ` (22 subsequent siblings)
  24 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

The radix tree is variable-height, so an insert operation not only has
to build the branch to its corresponding item, it also has to build the
branch to existing items if the size has to be increased (by
radix_tree_extend).

The worst case is a zero height tree with just a single item at index 0,
and then inserting an item at index ULONG_MAX. This requires 2 new branches
of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.

Radix tree is usually protected by spin lock. It means we want to
pre-allocate required memory before taking the lock.

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. For transparent huge page cache we want
to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.
The feature costs about 9.5KiB per-CPU on x86_64, details below.

Preload uses per-CPU array to store nodes. The total cost of preload is
"array size" * sizeof(void*) * NR_CPUS. We want to increase array size
to be able to handle 512 entries at once.

Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.

We have three possible RADIX_TREE_MAP_SHIFT:

 #ifdef __KERNEL__
 #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
 #else
 #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
 #endif

We are not going to use transparent huge page cache on small machines or
in userspace, so we are interested in RADIX_TREE_MAP_SHIFT=6.

On 64-bit system old array size is 21, new is 37. Per-CPU feature
overhead is
 for preload array:
   (38 - 21) * sizeof(void*) = 136 bytes
 plus, if the preload array is full
   (38 - 21) * sizeof(struct radix_tree_node) = 17 * 560 = 9520 bytes
 total: 9656 bytes

On 32-bit system old array size is 11, new is 22. Per-CPU feature
overhead is
 for preload array:
   (23 - 11) * sizeof(void*) = 48 bytes
 plus, if the preload array is full
   (23 - 11) * sizeof(struct radix_tree_node) = 12 * 296 = 3552 bytes
 total: 3600 bytes

Since only THP uses batched preload at the moment, we disable (set max
preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
changed in the future.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/radix-tree.h | 11 ++++++
 lib/radix-tree.c           | 94 +++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 96 insertions(+), 9 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 403940787b..3bf0b3e594 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,16 @@ do {									\
 	(root)->rnode = NULL;						\
 } while (0)
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+/*
+ * At the moment only THP uses preload for more then on item for batched
+ * pagecache manipulations.
+ */
+#define RADIX_TREE_PRELOAD_NR	512
+#else
+#define RADIX_TREE_PRELOAD_NR	1
+#endif
+
 /**
  * Radix-tree synchronization
  *
@@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
 				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
 			unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 7811ed3b4e..544a00a93b 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -84,14 +84,51 @@ static struct kmem_cache *radix_tree_node_cachep;
  * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
  * Hence:
  */
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+
+/*
+ * Inserting N contiguous items is more complex. To simplify calculation, let's
+ * limit N (validated in radix_tree_init()):
+ *  - N is multiplier of RADIX_TREE_MAP_SIZE (or 1);
+ *  - N <= number of items 2-level tree can contain:
+ *    1UL << (2 * RADIX_TREE_MAP_SHIFT).
+ *
+ * No limitation on insert index alignment.
+ *
+ * Then the worst case is tree with only one element at index 0 and we add N
+ * items which cross boundary between items in root node.
+ *
+ * Basically, at least one index is less then
+ *
+ * 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT + 1)
+ *
+ * and one is equal to.
+ *
+ * In this case we need:
+ *
+ * - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
+ * - N / RADIX_TREE_MAP_SIZE + 1 nodes for last level nodes for new items:
+ *    - +1 is for misalinged case;
+ * - 2 * (RADIX_TREE_MAX_PATH - 2) - 1 nodes to build path to last level nodes:
+ *    - -2, because root node and last level nodes are already accounted).
+ *
+ * Hence:
+ */
+#if RADIX_TREE_PRELOAD_NR > 1
+#define RADIX_TREE_PRELOAD_MAX \
+	( RADIX_TREE_MAX_PATH + \
+	  RADIX_TREE_PRELOAD_NR / RADIX_TREE_MAP_SIZE + 1 + \
+	  2 * (RADIX_TREE_MAX_PATH - 2))
+#else
+#define RADIX_TREE_PRELOAD_MAX RADIX_TREE_PRELOAD_MIN
+#endif
 
 /*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
 	int nr;
-	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
 };
 static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
 
@@ -263,29 +300,43 @@ radix_tree_node_free(struct radix_tree_node *node)
 
 /*
  * Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail.  On
- * success, return zero, with preemption disabled.  On error, return -ENOMEM
+ * ensure that the addition of *contiguous* items in the tree cannot fail.
+ * On success, return zero, with preemption disabled.  On error, return -ENOMEM
  * with preemption not disabled.
  *
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_WAIT being passed to INIT_RADIX_TREE().
  */
-static int __radix_tree_preload(gfp_t gfp_mask)
+static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
 	int ret = -ENOMEM;
+	int preload_target = RADIX_TREE_PRELOAD_MIN;
 
+	if (size > 1) {
+		size = round_up(size, RADIX_TREE_MAP_SIZE);
+		if (WARN_ONCE(size > RADIX_TREE_PRELOAD_NR,
+					"too large preload requested"))
+			return -ENOMEM;
+
+		/* The same math as with RADIX_TREE_PRELOAD_MAX */
+		preload_target = RADIX_TREE_MAX_PATH +
+			size / RADIX_TREE_MAP_SIZE + 1 +
+			2 * (RADIX_TREE_MAX_PATH - 2);
+	}
+
+	BUG_ON(preload_target > RADIX_TREE_PRELOAD_MAX);
 	preempt_disable();
 	rtp = &__get_cpu_var(radix_tree_preloads);
-	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+	while (rtp->nr < preload_target) {
 		preempt_enable();
 		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
 		if (node == NULL)
 			goto out;
 		preempt_disable();
 		rtp = &__get_cpu_var(radix_tree_preloads);
-		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+		if (rtp->nr < preload_target)
 			rtp->nodes[rtp->nr++] = node;
 		else
 			kmem_cache_free(radix_tree_node_cachep, node);
@@ -308,7 +359,7 @@ int radix_tree_preload(gfp_t gfp_mask)
 {
 	/* Warn on non-sensical use... */
 	WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
-	return __radix_tree_preload(gfp_mask);
+	return __radix_tree_preload_contig(1, gfp_mask);
 }
 EXPORT_SYMBOL(radix_tree_preload);
 
@@ -320,13 +371,22 @@ EXPORT_SYMBOL(radix_tree_preload);
 int radix_tree_maybe_preload(gfp_t gfp_mask)
 {
 	if (gfp_mask & __GFP_WAIT)
-		return __radix_tree_preload(gfp_mask);
+		return __radix_tree_preload_contig(1, gfp_mask);
 	/* Preloading doesn't help anything with this gfp mask, skip it */
 	preempt_disable();
 	return 0;
 }
 EXPORT_SYMBOL(radix_tree_maybe_preload);
 
+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask)
+{
+	if (gfp_mask & __GFP_WAIT)
+		return __radix_tree_preload_contig(size, gfp_mask);
+	/* Preloading doesn't help anything with this gfp mask, skip it */
+	preempt_disable();
+	return 0;
+}
+
 /*
  *	Return the maximum key which can be store into a
  *	radix tree with height HEIGHT.
@@ -1483,6 +1543,22 @@ static int radix_tree_callback(struct notifier_block *nfb,
 
 void __init radix_tree_init(void)
 {
+	/*
+	 * Restrictions on RADIX_TREE_PRELOAD_NR simplify RADIX_TREE_PRELOAD_MAX
+	 * calculation, it's already complex enough:
+	 *  - it must be multiplier of RADIX_TREE_MAP_SIZE, otherwise we will
+	 *    have to round it up to next RADIX_TREE_MAP_SIZE multiplier and we
+	 *    don't win anything;
+	 *  - must be less then number of items 2-level tree can contain.
+	 *    It's easier to calculate number of internal nodes required
+	 *    this way.
+	 */
+	if (RADIX_TREE_PRELOAD_NR != 1) {
+		BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR % RADIX_TREE_MAP_SIZE != 0);
+		BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR >
+				1UL << (2 * RADIX_TREE_MAP_SHIFT));
+	}
+
 	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
 			sizeof(struct radix_tree_node), 0,
 			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCHv6 03/22] memcg, thp: charge huge cache pages
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 01/22] mm: implement zero_huge_user_segment and friends Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 02/22] radix-tree: implement preload for multiple contiguous elements Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 04/22] thp: compile-time and sysfs knob for thp pagecache Kirill A. Shutemov
                   ` (21 subsequent siblings)
  24 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov,
	KAMEZAWA Hiroyuki

mem_cgroup_cache_charge() has check for PageCompound(). The check
prevents charging huge cache pages.

I don't see a reason why the check is present. Looks like it's just
legacy (introduced in 52d4b9a memcg: allocate all page_cgroup at boot).

Let's just drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d5ff3ce130..0b87a1bd25 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3963,8 +3963,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 
 	if (mem_cgroup_disabled())
 		return 0;
-	if (PageCompound(page))
-		return 0;
+	VM_BUG_ON(PageCompound(page) && !PageTransHuge(page));
 
 	if (!PageSwapCache(page))
 		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCHv6 04/22] thp: compile-time and sysfs knob for thp pagecache
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2013-09-23 12:05 ` [PATCHv6 03/22] memcg, thp: charge huge cache pages Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 05/22] thp, mm: introduce mapping_can_have_hugepages() predicate Kirill A. Shutemov
                   ` (20 subsequent siblings)
  24 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

For now, TRANSPARENT_HUGEPAGE_PAGECACHE is only implemented for x86_64.
It's disabled by default.

Radix tree perload overhead can be significant on !BASE_FULL systems, so
let's add dependency.

/sys/kernel/mm/transparent_hugepage/page_cache is runtime knob for the
feature.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/transhuge.txt |  9 +++++++++
 include/linux/huge_mm.h        | 14 ++++++++++++++
 mm/Kconfig                     | 11 +++++++++++
 mm/huge_memory.c               | 23 +++++++++++++++++++++++
 4 files changed, 57 insertions(+)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 4a63953a41..4cc15c40f4 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -103,6 +103,15 @@ echo always >/sys/kernel/mm/transparent_hugepage/enabled
 echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
 echo never >/sys/kernel/mm/transparent_hugepage/enabled
 
+If TRANSPARENT_HUGEPAGE_PAGECACHE is enabled kernel will use huge pages in
+page cache if possible. It can be disable and re-enabled via sysfs:
+
+echo 0 >/sys/kernel/mm/transparent_hugepage/page_cache
+echo 1 >/sys/kernel/mm/transparent_hugepage/page_cache
+
+If it's disabled kernel will not add new huge pages to page cache and
+split them on mapping, but already mapped pages will stay intakt.
+
 It's also possible to limit defrag efforts in the VM to generate
 hugepages in case they're not immediately free to madvise regions or
 to never try to defrag memory and simply fallback to regular pages
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3935428c57..fb0847572c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -40,6 +40,7 @@ enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
+	TRANSPARENT_HUGEPAGE_PAGECACHE,
 	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
 #ifdef CONFIG_DEBUG_VM
 	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
@@ -229,4 +230,17 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str
 
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline bool transparent_hugepage_pagecache(void)
+{
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
+		return false;
+	if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG)))
+		return false;
+
+	if (!(transparent_hugepage_flags &
+				((1<<TRANSPARENT_HUGEPAGE_FLAG) |
+				 (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG))))
+                 return false;
+	return transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 026771a9b0..562f12fd89 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -420,6 +420,17 @@ choice
 	  benefit.
 endchoice
 
+config TRANSPARENT_HUGEPAGE_PAGECACHE
+	bool "Transparent Hugepage Support for page cache"
+	depends on X86_64 && TRANSPARENT_HUGEPAGE
+	# avoid radix tree preload overhead
+	depends on BASE_FULL
+	help
+	  Enabling the option adds support hugepages for file-backed
+	  mappings. It requires transparent hugepage support from
+	  filesystem side. For now, the only filesystem which supports
+	  hugepages is ramfs.
+
 config CROSS_MEMORY_ATTACH
 	bool "Cross Memory Support"
 	depends on MMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7489884682..59f099b93f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -42,6 +42,9 @@ unsigned long transparent_hugepage_flags __read_mostly =
 #endif
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+	(1<<TRANSPARENT_HUGEPAGE_PAGECACHE)|
+#endif
 	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
 
 /* default scan 8*512 pte (or vmas) every 30 second */
@@ -362,6 +365,23 @@ static ssize_t defrag_store(struct kobject *kobj,
 static struct kobj_attribute defrag_attr =
 	__ATTR(defrag, 0644, defrag_show, defrag_store);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+static ssize_t page_cache_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	return single_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
+static ssize_t page_cache_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	return single_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
+static struct kobj_attribute page_cache_attr =
+	__ATTR(page_cache, 0644, page_cache_show, page_cache_store);
+#endif
+
 static ssize_t use_zero_page_show(struct kobject *kobj,
 		struct kobj_attribute *attr, char *buf)
 {
@@ -397,6 +417,9 @@ static struct kobj_attribute debug_cow_attr =
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
 	&defrag_attr.attr,
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+	&page_cache_attr.attr,
+#endif
 	&use_zero_page_attr.attr,
 #ifdef CONFIG_DEBUG_VM
 	&debug_cow_attr.attr,
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCHv6 05/22] thp, mm: introduce mapping_can_have_hugepages() predicate
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2013-09-23 12:05 ` [PATCHv6 04/22] thp: compile-time and sysfs knob for thp pagecache Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 06/22] thp: represent file thp pages in meminfo and friends Kirill A. Shutemov
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

Returns true if mapping can have huge pages. Just check for __GFP_COMP
in gfp mask of the mapping for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75a07..ad60dcc50e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -84,6 +84,20 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 				(__force unsigned long)mask;
 }
 
+static inline bool mapping_can_have_hugepages(struct address_space *m)
+{
+	gfp_t gfp_mask = mapping_gfp_mask(m);
+
+	if (!transparent_hugepage_pagecache())
+		return false;
+
+	/*
+	 * It's up to filesystem what gfp mask to use.
+	 * The only part of GFP_TRANSHUGE which matters for us is __GFP_COMP.
+	 */
+	return !!(gfp_mask & __GFP_COMP);
+}
+
 /*
  * The page cache can done in larger chunks than
  * one page, because it allows for more efficient
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCHv6 06/22] thp: represent file thp pages in meminfo and friends
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2013-09-23 12:05 ` [PATCHv6 05/22] thp, mm: introduce mapping_can_have_hugepages() predicate Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 07/22] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Kirill A. Shutemov
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

The patch adds new zone stat to count file transparent huge pages and
adjust related places.

For now we don't count mapped or dirty file thp pages separately.

The patch depends on patch
 thp: account anon transparent huge pages into NR_ANON_PAGES

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 drivers/base/node.c    | 4 ++++
 fs/proc/meminfo.c      | 3 +++
 include/linux/mmzone.h | 1 +
 mm/vmstat.c            | 1 +
 4 files changed, 9 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index bc9f43bf7e..de261f5722 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -119,6 +119,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       "Node %d SUnreclaim:     %8lu kB\n"
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		       "Node %d AnonHugePages:  %8lu kB\n"
+		       "Node %d FileHugePages:  %8lu kB\n"
 #endif
 			,
 		       nid, K(node_page_state(nid, NR_FILE_DIRTY)),
@@ -140,6 +141,9 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
 			, nid,
 			K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
+			HPAGE_PMD_NR)
+			, nid,
+			K(node_page_state(nid, NR_FILE_TRANSPARENT_HUGEPAGES) *
 			HPAGE_PMD_NR));
 #else
 		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 59d85d6088..a62952cd4f 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -104,6 +104,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		"AnonHugePages:  %8lu kB\n"
+		"FileHugePages:  %8lu kB\n"
 #endif
 		,
 		K(i.totalram),
@@ -158,6 +159,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
 		   HPAGE_PMD_NR)
+		,K(global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
+		   HPAGE_PMD_NR)
 #endif
 		);
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bd791e452a..8b4525bd4f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -143,6 +143,7 @@ enum zone_stat_item {
 	NUMA_OTHER,		/* allocation from other node */
 #endif
 	NR_ANON_TRANSPARENT_HUGEPAGES,
+	NR_FILE_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9bb3145779..9af0d8536b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -771,6 +771,7 @@ const char * const vmstat_text[] = {
 	"numa_other",
 #endif
 	"nr_anon_transparent_hugepages",
+	"nr_file_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
 	"nr_dirty_background_threshold",
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCHv6 07/22] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2013-09-23 12:05 ` [PATCHv6 06/22] thp: represent file thp pages in meminfo and friends Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 08/22] mm: trace filemap: dump page order Kirill A. Shutemov
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
page for the specified index and HPAGE_CACHE_NR-1 tail pages for
following indexes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h    | 24 ++++++++++++++++++++++++
 include/linux/page-flags.h | 13 +++++++++++++
 mm/filemap.c               | 45 +++++++++++++++++++++++++++++++++++----------
 3 files changed, 72 insertions(+), 10 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index fb0847572c..9747af1117 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -230,6 +230,20 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str
 
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+
+#define HPAGE_CACHE_ORDER      (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
+#define HPAGE_CACHE_NR         (1L << HPAGE_CACHE_ORDER)
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)
+
+#else
+
+#define HPAGE_CACHE_ORDER      ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
+
+#endif
+
 static inline bool transparent_hugepage_pagecache(void)
 {
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
@@ -243,4 +257,14 @@ static inline bool transparent_hugepage_pagecache(void)
                  return false;
 	return transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_PAGECACHE);
 }
+
+static inline int hpagecache_nr_pages(struct page *page)
+{
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
+		return hpage_nr_pages(page);
+
+	BUG_ON(PageTransHuge(page));
+	return 1;
+}
+
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675c2b..6d2d7ce3e1 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -452,6 +452,19 @@ static inline int PageTransTail(struct page *page)
 }
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+static inline int PageTransHugeCache(struct page *page)
+{
+	return PageTransHuge(page);
+}
+#else
+
+static inline int PageTransHugeCache(struct page *page)
+{
+	return 0;
+}
+#endif
+
 /*
  * If network-based swap is enabled, sl*b must keep track of whether pages
  * were allocated from pfmemalloc reserves.
diff --git a/mm/filemap.c b/mm/filemap.c
index c7e42aee5c..d2d6c0ebe9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -460,38 +460,63 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 		pgoff_t offset, gfp_t gfp_mask)
 {
 	int error;
+	int i, nr;
 
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageSwapBacked(page));
 
+	/* memory cgroup controller handles thp pages on its side */
 	error = mem_cgroup_cache_charge(page, current->mm,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
 		return error;
 
-	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
+	if (PageTransHugeCache(page))
+		BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
+
+	nr = hpagecache_nr_pages(page);
+
+	error = radix_tree_maybe_preload_contig(nr, gfp_mask & ~__GFP_HIGHMEM);
 	if (error) {
 		mem_cgroup_uncharge_cache_page(page);
 		return error;
 	}
 
+	spin_lock_irq(&mapping->tree_lock);
 	page_cache_get(page);
-	page->mapping = mapping;
 	page->index = offset;
-
-	spin_lock_irq(&mapping->tree_lock);
-	error = radix_tree_insert(&mapping->page_tree, offset, page);
+	page->mapping = mapping;
+	for (i = 0; i < nr; i++) {
+		error = radix_tree_insert(&mapping->page_tree,
+				offset + i, page);
+		/*
+		 * In the midle of THP we can collide with small page which was
+		 * established before THP page cache is enabled or by other VMA
+		 * with bad alignement (most likely MAP_FIXED).
+		 */
+		if (error) {
+			i--; /* failed to insert anything at offset + i */
+			goto err_insert;
+		}
+	}
 	radix_tree_preload_end();
-	if (unlikely(error))
-		goto err_insert;
-	mapping->nrpages++;
-	__inc_zone_page_state(page, NR_FILE_PAGES);
+	mapping->nrpages += nr;
+	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
+	if (PageTransHuge(page))
+		__inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
 	spin_unlock_irq(&mapping->tree_lock);
 	trace_mm_filemap_add_to_page_cache(page);
 	return 0;
 err_insert:
-	page->mapping = NULL;
+	radix_tree_preload_end();
+	if (i != 0)
+		error = -ENOSPC; /* no space for a huge page */
+
 	/* Leave page->index set: truncation relies upon it */
+	page->mapping = NULL;
+	for (; i >= 0; i--)
+		radix_tree_delete(&mapping->page_tree, offset + i);
+
 	spin_unlock_irq(&mapping->tree_lock);
 	mem_cgroup_uncharge_cache_page(page);
 	page_cache_release(page);
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCHv6 08/22] mm: trace filemap: dump page order
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2013-09-23 12:05 ` [PATCHv6 07/22] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 09/22] block: implement add_bdi_stat() Kirill A. Shutemov
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

Dump page order to trace to be able to distinguish between small page
and huge page in page cache.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 include/trace/events/filemap.h | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 0421f49a20..7e14b13470 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -21,6 +21,7 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
 		__field(struct page *, page)
 		__field(unsigned long, i_ino)
 		__field(unsigned long, index)
+		__field(int, order)
 		__field(dev_t, s_dev)
 	),
 
@@ -28,18 +29,20 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
 		__entry->page = page;
 		__entry->i_ino = page->mapping->host->i_ino;
 		__entry->index = page->index;
+		__entry->order = compound_order(page);
 		if (page->mapping->host->i_sb)
 			__entry->s_dev = page->mapping->host->i_sb->s_dev;
 		else
 			__entry->s_dev = page->mapping->host->i_rdev;
 	),
 
-	TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu",
+	TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu order=%d",
 		MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
 		__entry->i_ino,
 		__entry->page,
 		page_to_pfn(__entry->page),
-		__entry->index << PAGE_SHIFT)
+		__entry->index << PAGE_SHIFT,
+		__entry->order)
 );
 
 DEFINE_EVENT(mm_filemap_op_page_cache, mm_filemap_delete_from_page_cache,
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCHv6 09/22] block: implement add_bdi_stat()
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
                   ` (7 preceding siblings ...)
  2013-09-23 12:05 ` [PATCHv6 08/22] mm: trace filemap: dump page order Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 10/22] thp, mm: rewrite delete_from_page_cache() to support huge pages Kirill A. Shutemov
                   ` (15 subsequent siblings)
  24 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

We're going to add/remove a number of page cache entries at once. This
patch implements add_bdi_stat() which adjusts bdi stats by arbitrary
amount. It's required for batched page cache manipulations.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 5f66d519a7..39acfa974b 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -166,6 +166,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
 	__add_bdi_stat(bdi, item, -1);
 }
 
+static inline void add_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, s64 amount)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__add_bdi_stat(bdi, item, amount);
+	local_irq_restore(flags);
+}
+
 static inline void dec_bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item)
 {
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCHv6 10/22] thp, mm: rewrite delete_from_page_cache() to support huge pages
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
                   ` (8 preceding siblings ...)
  2013-09-23 12:05 ` [PATCHv6 09/22] block: implement add_bdi_stat() Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-25 20:02   ` Ning Qu
  2013-09-23 12:05 ` [PATCHv6 11/22] thp, mm: warn if we try to use replace_page_cache_page() with THP Kirill A. Shutemov
                   ` (14 subsequent siblings)
  24 siblings, 1 reply; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
time.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index d2d6c0ebe9..60478ebeda 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -115,6 +115,7 @@
 void __delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
+	int i, nr;
 
 	trace_mm_filemap_delete_from_page_cache(page);
 	/*
@@ -127,13 +128,20 @@ void __delete_from_page_cache(struct page *page)
 	else
 		cleancache_invalidate_page(mapping, page);
 
-	radix_tree_delete(&mapping->page_tree, page->index);
+	page->mapping = NULL;
+	nr = hpagecache_nr_pages(page);
+	for (i = 0; i < nr; i++)
+		radix_tree_delete(&mapping->page_tree, page->index + i);
+	/* thp */
+	if (nr > 1)
+		__dec_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
+
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
-	mapping->nrpages--;
-	__dec_zone_page_state(page, NR_FILE_PAGES);
+	mapping->nrpages -= nr;
+	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
 	if (PageSwapBacked(page))
-		__dec_zone_page_state(page, NR_SHMEM);
+		__mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
 	BUG_ON(page_mapped(page));
 
 	/*
@@ -144,8 +152,8 @@ void __delete_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
-		dec_zone_page_state(page, NR_FILE_DIRTY);
-		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
+		add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
 	}
 }
 
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCHv6 11/22] thp, mm: warn if we try to use replace_page_cache_page() with THP
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
                   ` (9 preceding siblings ...)
  2013-09-23 12:05 ` [PATCHv6 10/22] thp, mm: rewrite delete_from_page_cache() to support huge pages Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 12/22] thp, mm: add event counters for huge page alloc on file write or read Kirill A. Shutemov
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

replace_page_cache_page() is only used by FUSE. It's unlikely that we
will support THP in FUSE page cache any soon.

Let's pospone implemetation of THP handling in replace_page_cache_page()
until any will use it. -EINVAL and WARN_ONCE() for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 60478ebeda..3421bcaed4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -417,6 +417,10 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 {
 	int error;
 
+	if (WARN_ONCE(PageTransHuge(old) || PageTransHuge(new),
+		     "unexpected transhuge page\n"))
+		return -EINVAL;
+
 	VM_BUG_ON(!PageLocked(old));
 	VM_BUG_ON(!PageLocked(new));
 	VM_BUG_ON(new->mapping);
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCHv6 12/22] thp, mm: add event counters for huge page alloc on file write or read
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
                   ` (10 preceding siblings ...)
  2013-09-23 12:05 ` [PATCHv6 11/22] thp, mm: warn if we try to use replace_page_cache_page() with THP Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 13/22] mm, vfs: introduce i_split_sem Kirill A. Shutemov
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

Existing stats specify source of thp page: fault or collapse. We're
going allocate a new huge page with write(2) and read(2). It's nither
fault nor collapse.

Let's introduce new events for that.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/transhuge.txt | 7 +++++++
 include/linux/huge_mm.h        | 5 +++++
 include/linux/vm_event_item.h  | 4 ++++
 mm/vmstat.c                    | 4 ++++
 4 files changed, 20 insertions(+)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 4cc15c40f4..a78f738403 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -202,6 +202,10 @@ thp_collapse_alloc is incremented by khugepaged when it has found
 	a range of pages to collapse into one huge page and has
 	successfully allocated a new huge page to store the data.
 
+thp_write_alloc and thp_read_alloc are incremented every time a huge
+	page is	successfully allocated to handle write(2) to a file or
+	read(2) from file.
+
 thp_fault_fallback is incremented if a page fault fails to allocate
 	a huge page and instead falls back to using small pages.
 
@@ -209,6 +213,9 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range
 	of pages that should be collapsed into one huge page but failed
 	the allocation.
 
+thp_write_alloc_failed and thp_read_alloc_failed are incremented if
+	huge page allocation failed when tried on write(2) or read(2).
+
 thp_split is incremented every time a huge page is split into base
 	pages. This can happen for a variety of reasons but a common
 	reason is that a huge page is old and is being reclaimed.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9747af1117..3700ada4d2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -183,6 +183,11 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
 
+#define THP_WRITE_ALLOC		({ BUILD_BUG(); 0; })
+#define THP_WRITE_ALLOC_FAILED	({ BUILD_BUG(); 0; })
+#define THP_READ_ALLOC		({ BUILD_BUG(); 0; })
+#define THP_READ_ALLOC_FAILED	({ BUILD_BUG(); 0; })
+
 #define hpage_nr_pages(x) 1
 
 #define transparent_hugepage_enabled(__vma) 0
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 1855f0a22a..8e071bbaa0 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -66,6 +66,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_FAULT_FALLBACK,
 		THP_COLLAPSE_ALLOC,
 		THP_COLLAPSE_ALLOC_FAILED,
+		THP_WRITE_ALLOC,
+		THP_WRITE_ALLOC_FAILED,
+		THP_READ_ALLOC,
+		THP_READ_ALLOC_FAILED,
 		THP_SPLIT,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9af0d8536b..5d1eb7dbf1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -847,6 +847,10 @@ const char * const vmstat_text[] = {
 	"thp_fault_fallback",
 	"thp_collapse_alloc",
 	"thp_collapse_alloc_failed",
+	"thp_write_alloc",
+	"thp_write_alloc_failed",
+	"thp_read_alloc",
+	"thp_read_alloc_failed",
 	"thp_split",
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCHv6 13/22] mm, vfs: introduce i_split_sem
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
                   ` (11 preceding siblings ...)
  2013-09-23 12:05 ` [PATCHv6 12/22] thp, mm: add event counters for huge page alloc on file write or read Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 14/22] thp, mm: allocate huge pages in grab_cache_page_write_begin() Kirill A. Shutemov
                   ` (11 subsequent siblings)
  24 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

i_split_sem taken on read will protect hugepages in inode's pagecache
against splitting.

i_split_sem will be taken on write during splitting.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/inode.c              |  3 +++
 include/linux/fs.h      |  3 +++
 include/linux/huge_mm.h | 10 ++++++++++
 3 files changed, 16 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index b33ba8e021..ea06e378c6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -162,6 +162,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 
 	atomic_set(&inode->i_dio_count, 0);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+	init_rwsem(&inode->i_split_sem);
+#endif
 	mapping->a_ops = &empty_aops;
 	mapping->host = inode;
 	mapping->flags = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3f40547ba1..26801f0bb1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -610,6 +610,9 @@ struct inode {
 	atomic_t		i_readcount; /* struct files open RO */
 #endif
 	void			*i_private; /* fs or device private pointer */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+	struct rw_semaphore	i_split_sem;
+#endif
 };
 
 static inline int inode_unhashed(struct inode *inode)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3700ada4d2..ce9fcae8ef 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -241,12 +241,22 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str
 #define HPAGE_CACHE_NR         (1L << HPAGE_CACHE_ORDER)
 #define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)
 
+#define i_split_down_read(inode) down_read(&inode->i_split_sem)
+#define i_split_up_read(inode) up_read(&inode->i_split_sem)
+
 #else
 
 #define HPAGE_CACHE_ORDER      ({ BUILD_BUG(); 0; })
 #define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
 #define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
 
+static inline void i_split_down_read(struct inode *inode)
+{
+}
+
+static inline void i_split_up_read(struct inode *inode)
+{
+}
 #endif
 
 static inline bool transparent_hugepage_pagecache(void)
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCHv6 14/22] thp, mm: allocate huge pages in grab_cache_page_write_begin()
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
                   ` (12 preceding siblings ...)
  2013-09-23 12:05 ` [PATCHv6 13/22] mm, vfs: introduce i_split_sem Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 15/22] thp, mm: naive support of thp in generic_perform_write Kirill A. Shutemov
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

Try to allocate huge page if flags has AOP_FLAG_TRANSHUGE.

If, for some reason, it's not possible allocate a huge page at this
possition, it returns NULL. Caller should take care of fallback to
small pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/fs.h |  1 +
 mm/filemap.c       | 23 +++++++++++++++++++++--
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 26801f0bb1..42ccdeddd9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -282,6 +282,7 @@ enum positive_aop_returns {
 #define AOP_FLAG_NOFS			0x0004 /* used by filesystem to direct
 						* helper code (eg buffer layer)
 						* to clear GFP_FS from alloc */
+#define AOP_FLAG_TRANSHUGE		0x0008 /* allocate transhuge page */
 
 /*
  * oh the beauties of C type declarations.
diff --git a/mm/filemap.c b/mm/filemap.c
index 3421bcaed4..410879a801 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2322,18 +2322,37 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 	gfp_t gfp_mask;
 	struct page *page;
 	gfp_t gfp_notmask = 0;
+	bool must_use_thp = (flags & AOP_FLAG_TRANSHUGE) &&
+		IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
 
 	gfp_mask = mapping_gfp_mask(mapping);
+	if (must_use_thp) {
+		BUG_ON(index & HPAGE_CACHE_INDEX_MASK);
+		BUG_ON(!(gfp_mask & __GFP_COMP));
+	}
 	if (mapping_cap_account_dirty(mapping))
 		gfp_mask |= __GFP_WRITE;
 	if (flags & AOP_FLAG_NOFS)
 		gfp_notmask = __GFP_FS;
 repeat:
 	page = find_lock_page(mapping, index);
-	if (page)
+	if (page) {
+		if (must_use_thp && !PageTransHuge(page)) {
+			unlock_page(page);
+			page_cache_release(page);
+			return NULL;
+		}
 		goto found;
+	}
 
-	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
+	if (must_use_thp) {
+		page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
+		if (page)
+			count_vm_event(THP_WRITE_ALLOC);
+		else
+			count_vm_event(THP_WRITE_ALLOC_FAILED);
+	} else
+		page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
 	if (!page)
 		return NULL;
 	status = add_to_page_cache_lru(page, mapping, index,
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCHv6 15/22] thp, mm: naive support of thp in generic_perform_write
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
                   ` (13 preceding siblings ...)
  2013-09-23 12:05 ` [PATCHv6 14/22] thp, mm: allocate huge pages in grab_cache_page_write_begin() Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 16/22] thp, mm: handle transhuge pages in do_generic_file_read() Kirill A. Shutemov
                   ` (9 subsequent siblings)
  24 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

For now we still write/read at most PAGE_CACHE_SIZE bytes a time.

This implementation doesn't cover address spaces with backing storage.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 410879a801..38d6856737 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2384,12 +2384,14 @@ static ssize_t generic_perform_write(struct file *file,
 	if (segment_eq(get_fs(), KERNEL_DS))
 		flags |= AOP_FLAG_UNINTERRUPTIBLE;
 
+	i_split_down_read(mapping->host);
 	do {
 		struct page *page;
 		unsigned long offset;	/* Offset into pagecache page */
 		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
 		void *fsdata;
+		int subpage_nr = 0;
 
 		offset = (pos & (PAGE_CACHE_SIZE - 1));
 		bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
@@ -2419,8 +2421,14 @@ again:
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
 
+		if (PageTransHuge(page)) {
+			off_t huge_offset = pos & ~HPAGE_PMD_MASK;
+			subpage_nr = huge_offset >> PAGE_CACHE_SHIFT;
+		}
+
 		pagefault_disable();
-		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+		copied = iov_iter_copy_from_user_atomic(page + subpage_nr, i,
+				offset, bytes);
 		pagefault_enable();
 		flush_dcache_page(page);
 
@@ -2457,6 +2465,7 @@ again:
 		}
 	} while (iov_iter_count(i));
 
+	i_split_up_read(mapping->host);
 	return written ? written : status;
 }
 
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCHv6 16/22] thp, mm: handle transhuge pages in do_generic_file_read()
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
                   ` (14 preceding siblings ...)
  2013-09-23 12:05 ` [PATCHv6 15/22] thp, mm: naive support of thp in generic_perform_write Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 17/22] thp, libfs: initial thp support Kirill A. Shutemov
                   ` (8 subsequent siblings)
  24 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

If a transhuge page is already in page cache (up to date and not
readahead) we go usual path: read from relevant subpage (head or tail).

If page is not cached (sparse file in ramfs case) and the mapping can
have hugepage we try allocate a new one and read it.

If a page is not up to date or in readahead, we have to move 'page' to
head page of the compound page, since it represents state of whole
transhuge page. We will switch back to relevant subpage when page is
ready to be read ('page_ok' label).

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c | 91 +++++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 66 insertions(+), 25 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 38d6856737..9bbc024e4c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1122,6 +1122,27 @@ static void shrink_readahead_size_eio(struct file *filp,
 	ra->ra_pages /= 4;
 }
 
+static unsigned long page_cache_mask(struct page *page)
+{
+	if (PageTransHugeCache(page))
+		return HPAGE_PMD_MASK;
+	else
+		return PAGE_CACHE_MASK;
+}
+
+static unsigned long pos_to_off(struct page *page, loff_t pos)
+{
+	return pos & ~page_cache_mask(page);
+}
+
+static unsigned long pos_to_index(struct page *page, loff_t pos)
+{
+	if (PageTransHugeCache(page))
+		return pos >> HPAGE_PMD_SHIFT;
+	else
+		return pos >> PAGE_CACHE_SHIFT;
+}
+
 /**
  * do_generic_file_read - generic file read routine
  * @filp:	the file to read
@@ -1143,17 +1164,12 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
 	struct file_ra_state *ra = &filp->f_ra;
 	pgoff_t index;
 	pgoff_t last_index;
-	pgoff_t prev_index;
-	unsigned long offset;      /* offset into pagecache page */
-	unsigned int prev_offset;
 	int error;
 
 	index = *ppos >> PAGE_CACHE_SHIFT;
-	prev_index = ra->prev_pos >> PAGE_CACHE_SHIFT;
-	prev_offset = ra->prev_pos & (PAGE_CACHE_SIZE-1);
 	last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
-	offset = *ppos & ~PAGE_CACHE_MASK;
 
+	i_split_down_read(inode);
 	for (;;) {
 		struct page *page;
 		pgoff_t end_index;
@@ -1172,8 +1188,12 @@ find_page:
 					ra, filp,
 					index, last_index - index);
 			page = find_get_page(mapping, index);
-			if (unlikely(page == NULL))
-				goto no_cached_page;
+			if (unlikely(page == NULL)) {
+				if (mapping_can_have_hugepages(mapping))
+					goto no_cached_page_thp;
+				else
+					goto no_cached_page;
+			}
 		}
 		if (PageReadahead(page)) {
 			page_cache_async_readahead(mapping,
@@ -1190,7 +1210,7 @@ find_page:
 			if (!page->mapping)
 				goto page_not_up_to_date_locked;
 			if (!mapping->a_ops->is_partially_uptodate(page,
-								desc, offset))
+						desc, pos_to_off(page, *ppos)))
 				goto page_not_up_to_date_locked;
 			unlock_page(page);
 		}
@@ -1206,21 +1226,25 @@ page_ok:
 
 		isize = i_size_read(inode);
 		end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
+		if (PageTransHugeCache(page)) {
+			index &= ~HPAGE_CACHE_INDEX_MASK;
+			end_index &= ~HPAGE_CACHE_INDEX_MASK;
+		}
 		if (unlikely(!isize || index > end_index)) {
 			page_cache_release(page);
 			goto out;
 		}
 
 		/* nr is the maximum number of bytes to copy from this page */
-		nr = PAGE_CACHE_SIZE;
+		nr = PAGE_CACHE_SIZE << compound_order(page);
 		if (index == end_index) {
-			nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
-			if (nr <= offset) {
+			nr = ((isize - 1) & ~page_cache_mask(page)) + 1;
+			if (nr <= pos_to_off(page, *ppos)) {
 				page_cache_release(page);
 				goto out;
 			}
 		}
-		nr = nr - offset;
+		nr = nr - pos_to_off(page, *ppos);
 
 		/* If users can be writing to this page using arbitrary
 		 * virtual addresses, take care about potential aliasing
@@ -1233,9 +1257,10 @@ page_ok:
 		 * When a sequential read accesses a page several times,
 		 * only mark it as accessed the first time.
 		 */
-		if (prev_index != index || offset != prev_offset)
+		if (pos_to_index(page, ra->prev_pos) != index ||
+				pos_to_off(page, *ppos) !=
+				pos_to_off(page, ra->prev_pos))
 			mark_page_accessed(page);
-		prev_index = index;
 
 		/*
 		 * Ok, we have the page, and it's up-to-date, so
@@ -1247,11 +1272,10 @@ page_ok:
 		 * "pos" here (the actor routine has to update the user buffer
 		 * pointers and the remaining count).
 		 */
-		ret = actor(desc, page, offset, nr);
-		offset += ret;
-		index += offset >> PAGE_CACHE_SHIFT;
-		offset &= ~PAGE_CACHE_MASK;
-		prev_offset = offset;
+		ret = actor(desc, page, pos_to_off(page, *ppos), nr);
+		ra->prev_pos = *ppos;
+		*ppos += ret;
+		index = *ppos >> PAGE_CACHE_SHIFT;
 
 		page_cache_release(page);
 		if (ret == nr && desc->count)
@@ -1325,6 +1349,27 @@ readpage_error:
 		page_cache_release(page);
 		goto out;
 
+no_cached_page_thp:
+		page = alloc_pages(mapping_gfp_mask(mapping) | __GFP_COLD,
+				HPAGE_PMD_ORDER);
+		if (!page) {
+			count_vm_event(THP_READ_ALLOC_FAILED);
+			goto no_cached_page;
+		}
+		count_vm_event(THP_READ_ALLOC);
+
+		error = add_to_page_cache_lru(page, mapping,
+				pos_to_index(page, *ppos), GFP_KERNEL);
+		if (!error)
+			goto readpage;
+
+		page_cache_release(page);
+		if (error != -EEXIST && error != -ENOSPC) {
+			desc->error = error;
+			goto out;
+		}
+
+		/* Fallback to small page */
 no_cached_page:
 		/*
 		 * Ok, it wasn't cached, so we need to create a new
@@ -1348,11 +1393,7 @@ no_cached_page:
 	}
 
 out:
-	ra->prev_pos = prev_index;
-	ra->prev_pos <<= PAGE_CACHE_SHIFT;
-	ra->prev_pos |= prev_offset;
-
-	*ppos = ((loff_t)index << PAGE_CACHE_SHIFT) + offset;
+	i_split_up_read(inode);
 	file_accessed(filp);
 }
 
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCHv6 17/22] thp, libfs: initial thp support
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
                   ` (15 preceding siblings ...)
  2013-09-23 12:05 ` [PATCHv6 16/22] thp, mm: handle transhuge pages in do_generic_file_read() Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 18/22] truncate: support huge pages Kirill A. Shutemov
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

simple_readpage() and simple_write_end() are modified to handle huge
pages.

simple_thp_write_begin() is introduced to allocate huge pages on write.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/libfs.c              | 58 +++++++++++++++++++++++++++++++++++++++++++++----
 include/linux/fs.h      |  7 ++++++
 include/linux/pagemap.h |  8 +++++++
 3 files changed, 69 insertions(+), 4 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 3a3a9b53bf..807f66098e 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -364,7 +364,7 @@ EXPORT_SYMBOL(simple_setattr);
 
 int simple_readpage(struct file *file, struct page *page)
 {
-	clear_highpage(page);
+	clear_pagecache_page(page);
 	flush_dcache_page(page);
 	SetPageUptodate(page);
 	unlock_page(page);
@@ -424,9 +424,14 @@ int simple_write_end(struct file *file, struct address_space *mapping,
 
 	/* zero the stale part of the page if we did a short copy */
 	if (copied < len) {
-		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
-		zero_user(page, from + copied, len - copied);
+		unsigned from;
+		if (PageTransHugeCache(page)) {
+			from = pos & ~HPAGE_PMD_MASK;
+			zero_huge_user(page, from + copied, len - copied);
+		} else {
+			from = pos & ~PAGE_CACHE_MASK;
+			zero_user(page, from + copied, len - copied);
+		}
 	}
 
 	if (!PageUptodate(page))
@@ -445,6 +450,51 @@ int simple_write_end(struct file *file, struct address_space *mapping,
 	return copied;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+int simple_thp_write_begin(struct file *file, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned flags,
+		struct page **pagep, void **fsdata)
+{
+	struct page *page = NULL;
+	pgoff_t index;
+
+	index = pos >> PAGE_CACHE_SHIFT;
+
+	/*
+	 * Do not allocate a huge page in the first huge page range in page
+	 * cache. This way we can avoid most small files overhead.
+	 */
+	if (mapping_can_have_hugepages(mapping) &&
+			 pos >= HPAGE_PMD_SIZE) {
+		page = grab_cache_page_write_begin(mapping,
+				index & ~HPAGE_CACHE_INDEX_MASK,
+				flags | AOP_FLAG_TRANSHUGE);
+		/* fallback to small page */
+		if (!page) {
+			unsigned long offset;
+			offset = pos & ~PAGE_CACHE_MASK;
+			/* adjust the len to not cross small page boundary */
+			len = min_t(unsigned long,
+					len, PAGE_CACHE_SIZE - offset);
+		}
+		BUG_ON(page && !PageTransHuge(page));
+	}
+	if (!page)
+		return simple_write_begin(file, mapping, pos, len, flags,
+				pagep, fsdata);
+
+	*pagep = page;
+
+	if (!PageUptodate(page) && len != HPAGE_PMD_SIZE) {
+		unsigned from = pos & ~HPAGE_PMD_MASK;
+
+		zero_huge_user_segment(page, 0, from);
+		zero_huge_user_segment(page, from + len, HPAGE_PMD_SIZE);
+	}
+	return 0;
+}
+#endif
+
 /*
  * the inodes created here are not hashed. If you use iunique to generate
  * unique inode values later for this filesystem, then you must take care
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 42ccdeddd9..71a5ce4472 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2566,6 +2566,13 @@ extern int simple_write_begin(struct file *file, struct address_space *mapping,
 extern int simple_write_end(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned copied,
 			struct page *page, void *fsdata);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+extern int simple_thp_write_begin(struct file *file,
+		struct address_space *mapping, loff_t pos, unsigned len,
+		unsigned flags,	struct page **pagep, void **fsdata);
+#else
+#define simple_thp_write_begin simple_write_begin
+#endif
 
 extern struct dentry *simple_lookup(struct inode *, struct dentry *, unsigned int flags);
 extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ad60dcc50e..967aadbc5e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -572,4 +572,12 @@ static inline int add_to_page_cache(struct page *page,
 	return error;
 }
 
+static inline void clear_pagecache_page(struct page *page)
+{
+	if (PageTransHuge(page))
+		zero_huge_user(page, 0, HPAGE_PMD_SIZE);
+	else
+		clear_highpage(page);
+}
+
 #endif /* _LINUX_PAGEMAP_H */
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCHv6 18/22] truncate: support huge pages
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
                   ` (16 preceding siblings ...)
  2013-09-23 12:05 ` [PATCHv6 17/22] thp, libfs: initial thp support Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 19/22] thp: handle file pages in split_huge_page() Kirill A. Shutemov
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

truncate_inode_pages_range() drops whole huge page at once if it's fully
inside the range.

If a huge page is only partly in the range we zero out the part,
exactly like we do for partial small pages.

In some cases it worth to split the huge page instead, if we need to
truncate it partly and free some memory. But split_huge_page() now
truncates the file, so we need to break truncate<->split interdependency
at some point.

invalidate_mapping_pages() just skips huge pages if they are not fully
in the range.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h |   9 ++++
 mm/truncate.c           | 125 ++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 109 insertions(+), 25 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 967aadbc5e..8ce130fe56 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -580,4 +580,13 @@ static inline void clear_pagecache_page(struct page *page)
 		clear_highpage(page);
 }
 
+static inline void zero_pagecache_segment(struct page *page,
+		unsigned start, unsigned len)
+{
+	if (PageTransHugeCache(page))
+		zero_huge_user_segment(page, start, len);
+	else
+		zero_user_segment(page, start, len);
+}
+
 #endif /* _LINUX_PAGEMAP_H */
diff --git a/mm/truncate.c b/mm/truncate.c
index 353b683afd..ba62ab2168 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -203,10 +203,10 @@ int invalidate_inode_page(struct page *page)
 void truncate_inode_pages_range(struct address_space *mapping,
 				loff_t lstart, loff_t lend)
 {
+	struct inode	*inode = mapping->host;
 	pgoff_t		start;		/* inclusive */
 	pgoff_t		end;		/* exclusive */
-	unsigned int	partial_start;	/* inclusive */
-	unsigned int	partial_end;	/* exclusive */
+	bool		partial_start, partial_end;
 	struct pagevec	pvec;
 	pgoff_t		index;
 	int		i;
@@ -215,15 +215,13 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	if (mapping->nrpages == 0)
 		return;
 
-	/* Offsets within partial pages */
+	/* Whether we have to do partial truncate */
 	partial_start = lstart & (PAGE_CACHE_SIZE - 1);
 	partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1);
 
 	/*
 	 * 'start' and 'end' always covers the range of pages to be fully
-	 * truncated. Partial pages are covered with 'partial_start' at the
-	 * start of the range and 'partial_end' at the end of the range.
-	 * Note that 'end' is exclusive while 'lend' is inclusive.
+	 * truncated. Note that 'end' is exclusive while 'lend' is inclusive.
 	 */
 	start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	if (lend == -1)
@@ -236,10 +234,12 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	else
 		end = (lend + 1) >> PAGE_CACHE_SHIFT;
 
+	i_split_down_read(inode);
 	pagevec_init(&pvec, 0);
 	index = start;
 	while (index < end && pagevec_lookup(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
+		bool thp = false;
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
@@ -249,6 +249,23 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index >= end)
 				break;
 
+			thp = PageTransHugeCache(page);
+			if (thp) {
+				/* the range starts in middle of huge page */
+			       if (index < start) {
+				       partial_start = true;
+				       start = index + HPAGE_CACHE_NR;
+				       break;
+			       }
+
+			       /* the range ends on huge page */
+			       if (index == (end & ~HPAGE_CACHE_INDEX_MASK)) {
+				       partial_end = true;
+				       end = index;
+				       break;
+			       }
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -258,54 +275,88 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			}
 			truncate_inode_page(mapping, page);
 			unlock_page(page);
+			if (thp)
+				break;
 		}
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
-		index++;
+		if (thp)
+			index += HPAGE_CACHE_NR;
+		else
+			index++;
 	}
 
 	if (partial_start) {
-		struct page *page = find_lock_page(mapping, start - 1);
+		struct page *page;
+
+		page = find_get_page(mapping, start - 1);
 		if (page) {
-			unsigned int top = PAGE_CACHE_SIZE;
-			if (start > end) {
-				/* Truncation within a single page */
-				top = partial_end;
-				partial_end = 0;
+			pgoff_t index_mask;
+			loff_t page_cache_mask;
+			unsigned pstart, pend;
+
+			if (PageTransHugeCache(page)) {
+				index_mask = HPAGE_CACHE_INDEX_MASK;
+				page_cache_mask = HPAGE_PMD_MASK;
+			} else {
+				index_mask = 0UL;
+				page_cache_mask = PAGE_CACHE_MASK;
 			}
+
+			pstart = lstart & ~page_cache_mask;
+			if ((end & ~index_mask) == page->index) {
+				pend = (lend + 1) & ~page_cache_mask;
+				end = page->index;
+				partial_end = false; /* handled here */
+			} else
+				pend = PAGE_CACHE_SIZE << compound_order(page);
+
+			lock_page(page);
 			wait_on_page_writeback(page);
-			zero_user_segment(page, partial_start, top);
+			zero_pagecache_segment(page, pstart, pend);
 			cleancache_invalidate_page(mapping, page);
 			if (page_has_private(page))
-				do_invalidatepage(page, partial_start,
-						  top - partial_start);
+				do_invalidatepage(page, pstart,
+						pend - pstart);
 			unlock_page(page);
 			page_cache_release(page);
 		}
 	}
 	if (partial_end) {
-		struct page *page = find_lock_page(mapping, end);
+		struct page *page;
+
+		page = find_lock_page(mapping, end);
 		if (page) {
+			loff_t page_cache_mask;
+			unsigned pend;
+
+			if (PageTransHugeCache(page))
+				page_cache_mask = HPAGE_PMD_MASK;
+			else
+				page_cache_mask = PAGE_CACHE_MASK;
+			pend = (lend + 1) & ~page_cache_mask;
+			end = page->index;
 			wait_on_page_writeback(page);
-			zero_user_segment(page, 0, partial_end);
+			zero_pagecache_segment(page, 0, pend);
 			cleancache_invalidate_page(mapping, page);
 			if (page_has_private(page))
-				do_invalidatepage(page, 0,
-						  partial_end);
+				do_invalidatepage(page, 0, pend);
 			unlock_page(page);
 			page_cache_release(page);
 		}
 	}
 	/*
-	 * If the truncation happened within a single page no pages
-	 * will be released, just zeroed, so we can bail out now.
+	 * If the truncation happened within a single page no
+	 * pages will be released, just zeroed, so we can bail
+	 * out now.
 	 */
 	if (start >= end)
-		return;
+		goto out;
 
 	index = start;
 	for ( ; ; ) {
+		bool thp = false;
 		cond_resched();
 		if (!pagevec_lookup(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
@@ -327,16 +378,24 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index >= end)
 				break;
 
+			thp = PageTransHugeCache(page);
 			lock_page(page);
 			WARN_ON(page->index != index);
 			wait_on_page_writeback(page);
 			truncate_inode_page(mapping, page);
 			unlock_page(page);
+			if (thp)
+				break;
 		}
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
-		index++;
+		if (thp)
+			index += HPAGE_CACHE_NR;
+		else
+			index++;
 	}
+out:
+	i_split_up_read(inode);
 	cleancache_invalidate_inode(mapping);
 }
 EXPORT_SYMBOL(truncate_inode_pages_range);
@@ -375,6 +434,7 @@ EXPORT_SYMBOL(truncate_inode_pages);
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t end)
 {
+	struct inode *inode = mapping->host;
 	struct pagevec pvec;
 	pgoff_t index = start;
 	unsigned long ret;
@@ -389,9 +449,11 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 	 * (most pages are dirty), and already skips over any difficulties.
 	 */
 
+	i_split_down_read(inode);
 	pagevec_init(&pvec, 0);
 	while (index <= end && pagevec_lookup(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+		bool thp = false;
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
@@ -401,6 +463,15 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 			if (index > end)
 				break;
 
+			/* skip huge page if it's not fully in the range */
+			thp = PageTransHugeCache(page);
+			if (thp) {
+			       if (index < start)
+				       break;
+			       if (index == (end & ~HPAGE_CACHE_INDEX_MASK))
+				       break;
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -417,8 +488,12 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
-		index++;
+		if (thp)
+			index += HPAGE_CACHE_NR;
+		else
+			index++;
 	}
+	i_split_up_read(inode);
 	return count;
 }
 EXPORT_SYMBOL(invalidate_mapping_pages);
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCHv6 19/22] thp: handle file pages in split_huge_page()
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
                   ` (17 preceding siblings ...)
  2013-09-23 12:05 ` [PATCHv6 18/22] truncate: support huge pages Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 20/22] thp: wait_split_huge_page(): serialize over i_mmap_mutex too Kirill A. Shutemov
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

The base scheme is the same as for anonymous pages, but we walk by
mapping->i_mmap rather then anon_vma->rb_root.

When we add a huge page to page cache we take only reference to head
page, but on split we need to take addition reference to all tail pages
since they are still in page cache after splitting.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 120 +++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 107 insertions(+), 13 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 59f099b93f..3c45c62cde 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1584,6 +1584,7 @@ static void __split_huge_page_refcount(struct page *page,
 	struct zone *zone = page_zone(page);
 	struct lruvec *lruvec;
 	int tail_count = 0;
+	int initial_tail_refcount;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
@@ -1593,6 +1594,13 @@ static void __split_huge_page_refcount(struct page *page,
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(page);
 
+	/*
+	 * When we add a huge page to page cache we take only reference to head
+	 * page, but on split we need to take addition reference to all tail
+	 * pages since they are still in page cache after splitting.
+	 */
+	initial_tail_refcount = PageAnon(page) ? 0 : 1;
+
 	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
 		struct page *page_tail = page + i;
 
@@ -1615,8 +1623,9 @@ static void __split_huge_page_refcount(struct page *page,
 		 * atomic_set() here would be safe on all archs (and
 		 * not only on x86), it's safer to use atomic_add().
 		 */
-		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
-			   &page_tail->_count);
+		atomic_add(initial_tail_refcount + page_mapcount(page) +
+				page_mapcount(page_tail) + 1,
+				&page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();
@@ -1655,23 +1664,23 @@ static void __split_huge_page_refcount(struct page *page,
 		*/
 		page_tail->_mapcount = page->_mapcount;
 
-		BUG_ON(page_tail->mapping);
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
 		page_nid_xchg_last(page_tail, page_nid_last(page));
 
-		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
 		BUG_ON(!PageDirty(page_tail));
-		BUG_ON(!PageSwapBacked(page_tail));
 
 		lru_add_page_tail(page, page_tail, lruvec, list);
 	}
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(atomic_read(&page->_count) <= 0);
 
-	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+	if (PageAnon(page))
+		__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+	else
+		__mod_zone_page_state(zone, NR_FILE_TRANSPARENT_HUGEPAGES, -1);
 
 	ClearPageCompound(page);
 	compound_unlock(page);
@@ -1771,7 +1780,7 @@ static int __split_huge_page_map(struct page *page,
 }
 
 /* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
+static void __split_anon_huge_page(struct page *page,
 			      struct anon_vma *anon_vma,
 			      struct list_head *list)
 {
@@ -1795,7 +1804,7 @@ static void __split_huge_page(struct page *page,
 	 * and establishes a child pmd before
 	 * __split_huge_page_splitting() freezes the parent pmd (so if
 	 * we fail to prevent copy_huge_pmd() from running until the
-	 * whole __split_huge_page() is complete), we will still see
+	 * whole __split_anon_huge_page() is complete), we will still see
 	 * the newly established pmd of the child later during the
 	 * walk, to be able to set it as pmd_trans_splitting too.
 	 */
@@ -1826,14 +1835,11 @@ static void __split_huge_page(struct page *page,
  * from the hugepage.
  * Return 0 if the hugepage is split successfully otherwise return 1.
  */
-int split_huge_page_to_list(struct page *page, struct list_head *list)
+static int split_anon_huge_page(struct page *page, struct list_head *list)
 {
 	struct anon_vma *anon_vma;
 	int ret = 1;
 
-	BUG_ON(is_huge_zero_page(page));
-	BUG_ON(!PageAnon(page));
-
 	/*
 	 * The caller does not necessarily hold an mmap_sem that would prevent
 	 * the anon_vma disappearing so we first we take a reference to it
@@ -1851,7 +1857,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		goto out_unlock;
 
 	BUG_ON(!PageSwapBacked(page));
-	__split_huge_page(page, anon_vma, list);
+	__split_anon_huge_page(page, anon_vma, list);
 	count_vm_event(THP_SPLIT);
 
 	BUG_ON(PageCompound(page));
@@ -1862,6 +1868,94 @@ out:
 	return ret;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+static void __split_file_mapping(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	struct radix_tree_iter iter;
+	void **slot;
+	int count = 1;
+
+	spin_lock(&mapping->tree_lock);
+	radix_tree_for_each_slot(slot, &mapping->page_tree,
+			&iter, page->index + 1) {
+		struct page *slot_page;
+
+		slot_page = radix_tree_deref_slot_protected(slot,
+				&mapping->tree_lock);
+		BUG_ON(slot_page != page);
+		radix_tree_replace_slot(slot, page + count);
+		if (++count == HPAGE_CACHE_NR)
+			break;
+	}
+	BUG_ON(count != HPAGE_CACHE_NR);
+	spin_unlock(&mapping->tree_lock);
+}
+
+static int split_file_huge_page(struct page *page, struct list_head *list)
+{
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	int mapcount, mapcount2;
+
+	BUG_ON(!PageHead(page));
+	BUG_ON(PageTail(page));
+
+	down_write(&inode->i_split_sem);
+	mutex_lock(&mapping->i_mmap_mutex);
+	mapcount = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount += __split_huge_page_splitting(page, vma, addr);
+	}
+
+	if (mapcount != page_mapcount(page))
+		printk(KERN_ERR "mapcount %d page_mapcount %d\n",
+		       mapcount, page_mapcount(page));
+	BUG_ON(mapcount != page_mapcount(page));
+
+	__split_huge_page_refcount(page, list);
+	__split_file_mapping(page);
+
+	mapcount2 = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount2 += __split_huge_page_map(page, vma, addr);
+	}
+
+	if (mapcount != mapcount2)
+		printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount %d\n",
+		       mapcount, mapcount2, page_mapcount(page));
+	BUG_ON(mapcount != mapcount2);
+	count_vm_event(THP_SPLIT);
+	mutex_unlock(&mapping->i_mmap_mutex);
+	up_write(&inode->i_split_sem);
+
+	/*
+	 * Drop small pages beyond i_size if any.
+	 */
+	truncate_inode_pages(mapping, i_size_read(inode));
+	return 0;
+}
+#else
+static int split_file_huge_page(struct page *page, struct list_head *list)
+{
+	BUG();
+}
+#endif
+
+int split_huge_page_to_list(struct page *page, struct list_head *list)
+{
+	BUG_ON(is_huge_zero_page(page));
+
+	if (PageAnon(page))
+		return split_anon_huge_page(page, list);
+	else
+		return split_file_huge_page(page, list);
+}
+
 #define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCHv6 20/22] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
                   ` (18 preceding siblings ...)
  2013-09-23 12:05 ` [PATCHv6 19/22] thp: handle file pages in split_huge_page() Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 21/22] thp, mm: split huge page on mmap file page Kirill A. Shutemov
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

We're going to have huge pages backed by files, so we need to modify
wait_split_huge_page() to support that.

We have two options for:
 - check whether the page anon or not and serialize only over required
   lock;
 - always serialize over both locks;

Current implementation, in fact, guarantees that *all* pages on the vma
is not splitting, not only the pages pmd is pointing on.

For now I prefer the second option since it's the safest: we provide the
same level of guarantees.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h | 15 ++++++++++++---
 mm/huge_memory.c        |  4 ++--
 mm/memory.c             |  4 ++--
 3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ce9fcae8ef..9bc9937498 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -111,11 +111,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
 			__split_huge_page_pmd(__vma, __address,		\
 					____pmd);			\
 	}  while (0)
-#define wait_split_huge_page(__anon_vma, __pmd)				\
+#define wait_split_huge_page(__vma, __pmd)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
-		anon_vma_lock_write(__anon_vma);			\
-		anon_vma_unlock_write(__anon_vma);			\
+		struct address_space *__mapping = (__vma)->vm_file ?	\
+				(__vma)->vm_file->f_mapping : NULL;	\
+		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
+		if (__mapping)						\
+			mutex_lock(&__mapping->i_mmap_mutex);		\
+		if (__anon_vma) {					\
+			anon_vma_lock_write(__anon_vma);		\
+			anon_vma_unlock_write(__anon_vma);		\
+		}							\
+		if (__mapping)						\
+			mutex_unlock(&__mapping->i_mmap_mutex);		\
 		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
 		       pmd_trans_huge(*____pmd));			\
 	} while (0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3c45c62cde..d0798e5122 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -913,7 +913,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		spin_unlock(&dst_mm->page_table_lock);
 		pte_free(dst_mm, pgtable);
 
-		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+		wait_split_huge_page(vma, src_pmd); /* src_vma */
 		goto out;
 	}
 	src_page = pmd_page(pmd);
@@ -1497,7 +1497,7 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 	if (likely(pmd_trans_huge(*pmd))) {
 		if (unlikely(pmd_trans_splitting(*pmd))) {
 			spin_unlock(&vma->vm_mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
+			wait_split_huge_page(vma, pmd);
 			return -1;
 		} else {
 			/* Thp mapped by 'pmd' is stable, so we can
diff --git a/mm/memory.c b/mm/memory.c
index e5f74cd634..dc5a56cab7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -584,7 +584,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (new)
 		pte_free(mm, new);
 	if (wait_split_huge_page)
-		wait_split_huge_page(vma->anon_vma, pmd);
+		wait_split_huge_page(vma, pmd);
 	return 0;
 }
 
@@ -1520,7 +1520,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		if (likely(pmd_trans_huge(*pmd))) {
 			if (unlikely(pmd_trans_splitting(*pmd))) {
 				spin_unlock(&mm->page_table_lock);
-				wait_split_huge_page(vma->anon_vma, pmd);
+				wait_split_huge_page(vma, pmd);
 			} else {
 				page = follow_trans_huge_pmd(vma, address,
 							     pmd, flags);
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCHv6 21/22] thp, mm: split huge page on mmap file page
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
                   ` (19 preceding siblings ...)
  2013-09-23 12:05 ` [PATCHv6 20/22] thp: wait_split_huge_page(): serialize over i_mmap_mutex too Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 22/22] ramfs: enable transparent huge page cache Kirill A. Shutemov
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

We are not ready to mmap file-backed tranparent huge pages. Let's split
them on fault attempt.

Later we'll implement mmap() properly and this code path be used for
fallback cases.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 9bbc024e4c..01a8f9945a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1736,6 +1736,8 @@ retry_find:
 			goto no_cached_page;
 	}
 
+	if (PageTransCompound(page))
+		split_huge_page(compound_trans_head(page));
 	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
 		page_cache_release(page);
 		return ret | VM_FAULT_RETRY;
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCHv6 22/22] ramfs: enable transparent huge page cache
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
                   ` (20 preceding siblings ...)
  2013-09-23 12:05 ` [PATCHv6 21/22] thp, mm: split huge page on mmap file page Kirill A. Shutemov
@ 2013-09-23 12:05 ` Kirill A. Shutemov
  2013-09-24 23:37 ` [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Andrew Morton
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

ramfs is the most simple fs from page cache point of view. Let's start
transparent huge page cache enabling here.

ramfs pages are not movable[1] and switching to transhuge pages doesn't
affect that. We need to fix this eventually.

[1] http://lkml.org/lkml/2013/4/2/720

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/ramfs/file-mmu.c | 2 +-
 fs/ramfs/inode.c    | 6 +++++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/ramfs/file-mmu.c b/fs/ramfs/file-mmu.c
index 4884ac5ae9..ae787bf9ba 100644
--- a/fs/ramfs/file-mmu.c
+++ b/fs/ramfs/file-mmu.c
@@ -32,7 +32,7 @@
 
 const struct address_space_operations ramfs_aops = {
 	.readpage	= simple_readpage,
-	.write_begin	= simple_write_begin,
+	.write_begin	= simple_thp_write_begin,
 	.write_end	= simple_write_end,
 	.set_page_dirty = __set_page_dirty_no_writeback,
 };
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index 39d14659a8..5dafdfcd86 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
 		inode_init_owner(inode, dir, mode);
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
-		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+		/*
+		 * TODO: make ramfs pages movable
+		 */
+		mapping_set_gfp_mask(inode->i_mapping,
+				GFP_TRANSHUGE & ~__GFP_MOVABLE);
 		mapping_set_unevictable(inode->i_mapping);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
-- 
1.8.4.rc3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
                   ` (21 preceding siblings ...)
  2013-09-23 12:05 ` [PATCHv6 22/22] ramfs: enable transparent huge page cache Kirill A. Shutemov
@ 2013-09-24 23:37 ` Andrew Morton
  2013-09-24 23:49   ` Andi Kleen
                     ` (2 more replies)
       [not found] ` <CACz4_2drFs5LsM8mTFNOWGHAs0QbsNfHAhiBXJ7jM3qkGerd5w@mail.gmail.com>
  2013-09-26 21:13 ` Dave Hansen
  24 siblings, 3 replies; 48+ messages in thread
From: Andrew Morton @ 2013-09-24 23:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara,
	Mel Gorman, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	Alexander Shishkin, linux-fsdevel, linux-kernel

On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> It brings thp support for ramfs, but without mmap() -- it will be posted
> separately.

We were never going to do this :(

Has anyone reviewed these patches much yet?

> Please review and consider applying.

It appears rather too immature at this stage.

> Intro
> -----
> 
> The goal of the project is preparing kernel infrastructure to handle huge
> pages in page cache.
> 
> To proof that the proposed changes are functional we enable the feature
> for the most simple file system -- ramfs. ramfs is not that useful by
> itself, but it's good pilot project.

At the very least we should get this done for a real filesystem to see
how intrusive the changes are and to evaluate the performance changes.


Sigh.  A pox on whoever thought up huge pages.  Words cannot express
how much of a godawful mess they have made of Linux MM.  And it hasn't
ended yet :( My take is that we'd need to see some very attractive and
convincing real-world performance numbers before even thinking of
taking this on.




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-24 23:37 ` [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Andrew Morton
@ 2013-09-24 23:49   ` Andi Kleen
  2013-09-24 23:58     ` Andrew Morton
                       ` (2 more replies)
  2013-09-25  9:51   ` Kirill A. Shutemov
  2013-09-30 10:02   ` Mel Gorman
  2 siblings, 3 replies; 48+ messages in thread
From: Andi Kleen @ 2013-09-24 23:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	Alexander Shishkin, linux-fsdevel, linux-kernel

On Tue, Sep 24, 2013 at 04:37:40PM -0700, Andrew Morton wrote:
> On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > It brings thp support for ramfs, but without mmap() -- it will be posted
> > separately.
> 
> We were never going to do this :(
> 
> Has anyone reviewed these patches much yet?

There already was a lot of review by various people.

This is not the first post, just the latest refactoring.

> > Intro
> > -----
> > 
> > The goal of the project is preparing kernel infrastructure to handle huge
> > pages in page cache.
> > 
> > To proof that the proposed changes are functional we enable the feature
> > for the most simple file system -- ramfs. ramfs is not that useful by
> > itself, but it's good pilot project.
> 
> At the very least we should get this done for a real filesystem to see
> how intrusive the changes are and to evaluate the performance changes.

That would give even larger patches, and people already complain
the patchkit is too large.

The only good way to handle this is baby steps, and you 
have to start somewhere.

> Sigh.  A pox on whoever thought up huge pages. 

managing 1TB+ of memory in 4K chunks is just insane.
The question of larger pages is not "if", but only "when".

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-24 23:49   ` Andi Kleen
@ 2013-09-24 23:58     ` Andrew Morton
  2013-09-25 11:15       ` Kirill A. Shutemov
  2013-09-26 18:30     ` Zach Brown
  2013-09-30 10:13     ` Mel Gorman
  2 siblings, 1 reply; 48+ messages in thread
From: Andrew Morton @ 2013-09-24 23:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	Alexander Shishkin, linux-fsdevel, linux-kernel

On Tue, 24 Sep 2013 16:49:50 -0700 Andi Kleen <ak@linux.intel.com> wrote:

> > At the very least we should get this done for a real filesystem to see
> > how intrusive the changes are and to evaluate the performance changes.
> 
> That would give even larger patches, and people already complain
> the patchkit is too large.

The thing is that merging an implementation for ramfs commits us to
doing it for the major real filesystems.  Before making that commitment
we should at least have a pretty good understanding of what those
changes will look like.

Plus I don't see how we can realistically performance-test it without
having real physical backing store in the picture?


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
       [not found] ` <CACz4_2drFs5LsM8mTFNOWGHAs0QbsNfHAhiBXJ7jM3qkGerd5w@mail.gmail.com>
@ 2013-09-25  9:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-25  9:23 UTC (permalink / raw)
  To: Ning Qu
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Dave Hansen, Alexander Shishkin, linux-fsdevel, linux-kernel

Ning Qu wrote:
> Hi, Kirill,
> 
> Seems you dropped one patch in v5, is that intentional? Just wondering ...
> 
>   thp, mm: handle tail pages in page_cache_get_speculative()

It's not needed anymore, since we don't have tail pages in radix tree.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-24 23:37 ` [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Andrew Morton
  2013-09-24 23:49   ` Andi Kleen
@ 2013-09-25  9:51   ` Kirill A. Shutemov
  2013-09-25 23:29     ` Dave Chinner
  2013-09-30 10:02   ` Mel Gorman
  2 siblings, 1 reply; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-25  9:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, Alexander Shishkin, linux-fsdevel, linux-kernel

Andrew Morton wrote:
> On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > It brings thp support for ramfs, but without mmap() -- it will be posted
> > separately.
> 
> We were never going to do this :(
> 
> Has anyone reviewed these patches much yet?

Dave did very good review. Few other people looked to separate patches.
See Reviewed-by/Acked-by tags in patches.

It looks like most mm experts are busy with numa balancing nowadays, so
it's hard to get more review.

The patchset was mostly ignored for few rounds and Dave suggested to split
to have less scary patch number.

> > Please review and consider applying.
> 
> It appears rather too immature at this stage.

More review is always welcome and I'm committed to address issues.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-24 23:58     ` Andrew Morton
@ 2013-09-25 11:15       ` Kirill A. Shutemov
  2013-09-25 15:05         ` Andi Kleen
  0 siblings, 1 reply; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-25 11:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, Alexander Shishkin, linux-fsdevel, linux-kernel

Andrew Morton wrote:
> On Tue, 24 Sep 2013 16:49:50 -0700 Andi Kleen <ak@linux.intel.com> wrote:
> 
> > > At the very least we should get this done for a real filesystem to see
> > > how intrusive the changes are and to evaluate the performance changes.
> > 
> > That would give even larger patches, and people already complain
> > the patchkit is too large.
> 
> The thing is that merging an implementation for ramfs commits us to
> doing it for the major real filesystems.  Before making that commitment
> we should at least have a pretty good understanding of what those
> changes will look like.
> 
> Plus I don't see how we can realistically performance-test it without
> having real physical backing store in the picture?

My plan for real filesystem is to get it first beneficial for read-mostly
files:
 - allocate huge pages on read (or collapse small pages) only if nobody
   has the inode opened on write;
 - split huge page on write to avoid dealing with write back patch at
   first and dirty only 4k pages;

This will will get most of elf executables and libraries mapped with huge
pages (it may require dynamic linker change to align length to huge page
boundary) which is not bad for start.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-25 11:15       ` Kirill A. Shutemov
@ 2013-09-25 15:05         ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2013-09-25 15:05 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Andrea Arcangeli, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	Alexander Shishkin, linux-fsdevel, linux-kernel

> (it may require dynamic linker change to align length to huge page
> boundary) 

x86-64 binaries should be already padded for this.

-Andi

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 10/22] thp, mm: rewrite delete_from_page_cache() to support huge pages
  2013-09-23 12:05 ` [PATCHv6 10/22] thp, mm: rewrite delete_from_page_cache() to support huge pages Kirill A. Shutemov
@ 2013-09-25 20:02   ` Ning Qu
  0 siblings, 0 replies; 48+ messages in thread
From: Ning Qu @ 2013-09-25 20:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Alexander Shishkin, linux-fsdevel, linux-kernel

Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Mon, Sep 23, 2013 at 5:05 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
> time.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/filemap.c | 20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index d2d6c0ebe9..60478ebeda 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -115,6 +115,7 @@
>  void __delete_from_page_cache(struct page *page)
>  {
>         struct address_space *mapping = page->mapping;
> +       int i, nr;
>
>         trace_mm_filemap_delete_from_page_cache(page);
>         /*
> @@ -127,13 +128,20 @@ void __delete_from_page_cache(struct page *page)
>         else
>                 cleancache_invalidate_page(mapping, page);
>
> -       radix_tree_delete(&mapping->page_tree, page->index);
> +       page->mapping = NULL;
Seems with this line added, we clear the page->mapping twice? Once
here and another one after radix_tree_delete. Is this necessary here?

>
> +       nr = hpagecache_nr_pages(page);
> +       for (i = 0; i < nr; i++)
> +               radix_tree_delete(&mapping->page_tree, page->index + i);
> +       /* thp */
> +       if (nr > 1)
> +               __dec_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
> +
>         page->mapping = NULL;
>         /* Leave page->index set: truncation lookup relies upon it */
> -       mapping->nrpages--;
> -       __dec_zone_page_state(page, NR_FILE_PAGES);
> +       mapping->nrpages -= nr;
> +       __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
>         if (PageSwapBacked(page))
> -               __dec_zone_page_state(page, NR_SHMEM);
> +               __mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
>         BUG_ON(page_mapped(page));
>
>         /*
> @@ -144,8 +152,8 @@ void __delete_from_page_cache(struct page *page)
>          * having removed the page entirely.
>          */
>         if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> -               dec_zone_page_state(page, NR_FILE_DIRTY);
> -               dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> +               mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
> +               add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
>         }
>  }
>
> --
> 1.8.4.rc3
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-25  9:51   ` Kirill A. Shutemov
@ 2013-09-25 23:29     ` Dave Chinner
  2013-10-14 13:56       ` Kirill A. Shutemov
  0 siblings, 1 reply; 48+ messages in thread
From: Dave Chinner @ 2013-09-25 23:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Andrea Arcangeli, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, Alexander Shishkin, linux-fsdevel, linux-kernel

On Wed, Sep 25, 2013 at 12:51:04PM +0300, Kirill A. Shutemov wrote:
> Andrew Morton wrote:
> > On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > 
> > > It brings thp support for ramfs, but without mmap() -- it will be posted
> > > separately.
> > 
> > We were never going to do this :(
> > 
> > Has anyone reviewed these patches much yet?
> 
> Dave did very good review. Few other people looked to separate patches.
> See Reviewed-by/Acked-by tags in patches.
> 
> It looks like most mm experts are busy with numa balancing nowadays, so
> it's hard to get more review.

Nobody has reviewed it from the filesystem side, though.

The changes that require special code paths for huge pages in the
write_begin/write_end paths are nasty. You're adding conditional
code that depends on the page size and then having to add checks to
ensure that large page operations don't step over small page
boundaries and other such corner cases. It's an extremely fragile
design, IMO.

In general, I don't like all the if (thp) {} else {}; code that this
series introduces - they are code paths that simply won't get tested
with any sort of regularity and make the code more complex for those
that aren't using THP to understand and debug...

Then there is a new per-inode lock that is used in
generic_perform_write() which is held across page faults and calls
to filesystem block mapping callbacks. This inserts into the middle
of an existing locking chain that needs to be strictly ordered, and
as such will lead to the same type of lock inversion problems that
the mmap_sem had.  We do not want to introduce a new lock that has
this same problem just as we are getting rid of that long standing
nastiness from the page fault path...

I also note that you didn't convert invalidate_inode_pages2_range()
to support huge pages which is needed by real filesystems that
support direct IO. There are other truncate/invalidate interfaces
that you didn't convert, either, and some of them will present you
with interesting locking challenges as a result of adding that new
lock...

> The patchset was mostly ignored for few rounds and Dave suggested to split
> to have less scary patch number.

It's still being ignored by filesystem people because you haven't
actually tried to implement support into a real filesystem.....

> > > Please review and consider applying.
> > 
> > It appears rather too immature at this stage.
> 
> More review is always welcome and I'm committed to address issues.

IMO, supporting a real block based filesystem like ext4 or XFS and
demonstrating that everything works is necessary before we go any
further...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-24 23:49   ` Andi Kleen
  2013-09-24 23:58     ` Andrew Morton
@ 2013-09-26 18:30     ` Zach Brown
  2013-09-26 19:05       ` Andi Kleen
  2013-09-30 10:13     ` Mel Gorman
  2 siblings, 1 reply; 48+ messages in thread
From: Zach Brown @ 2013-09-26 18:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, Alexander Shishkin, linux-fsdevel, linux-kernel

> > Sigh.  A pox on whoever thought up huge pages. 
> 
> managing 1TB+ of memory in 4K chunks is just insane.
> The question of larger pages is not "if", but only "when".

And "how"!

Sprinking a bunch of magical if (thp) {} else {} throughtout the code
looks like a stunningly bad idea to me.  It'd take real work to
restructure the code such that the current paths are a degenerate case
of the larger thp page case, but that's the work that needs doing in my
estimation.

- z

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-26 18:30     ` Zach Brown
@ 2013-09-26 19:05       ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2013-09-26 19:05 UTC (permalink / raw)
  To: Zach Brown
  Cc: Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, Alexander Shishkin, linux-fsdevel, linux-kernel

On Thu, Sep 26, 2013 at 11:30:22AM -0700, Zach Brown wrote:
> > > Sigh.  A pox on whoever thought up huge pages. 
> > 
> > managing 1TB+ of memory in 4K chunks is just insane.
> > The question of larger pages is not "if", but only "when".
> 
> And "how"!
> 
> Sprinking a bunch of magical if (thp) {} else {} throughtout the code
> looks like a stunningly bad idea to me.  It'd take real work to
> restructure the code such that the current paths are a degenerate case
> of the larger thp page case, but that's the work that needs doing in my
> estimation.

Sorry, but that is how all of large pages in the Linux VM works
(both THP and hugetlbfs) 

Yes it would be nice if small pages and large pages all ran
in a unified VM. But that's not how Linux is designed today.

Yes having a Pony would be nice too.

Back when huge pages were originally proposed Linus came
up with the "separate hugetlbfs VM" design and that is what were
stuck with today.

Asking for a whole scale VM redesign is just not realistic.

VM is always changing in baby steps. And the only 
known way to do that is to have if (thp) and if (hugetlbfs) .

-Andi 

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
                   ` (23 preceding siblings ...)
       [not found] ` <CACz4_2drFs5LsM8mTFNOWGHAs0QbsNfHAhiBXJ7jM3qkGerd5w@mail.gmail.com>
@ 2013-09-26 21:13 ` Dave Hansen
  24 siblings, 0 replies; 48+ messages in thread
From: Dave Hansen @ 2013-09-26 21:13 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Ning Qu, Alexander Shishkin, linux-fsdevel,
	linux-kernel, Luck, Tony, Andi Kleen

On 09/23/2013 05:05 AM, Kirill A. Shutemov wrote:
> To proof that the proposed changes are functional we enable the feature
> for the most simple file system -- ramfs. ramfs is not that useful by
> itself, but it's good pilot project.

This does, at the least, give us a shared memory mechanism that can move
between large and small pages.  We don't have anything which can do that
today.

Tony Luck was just mentioning that if we have a small (say 1-bit) memory
failure in a hugetlbfs page, then we end up tossing out the entire 2MB.
 The app gets a chance to recover the contents, but it has to do it for
the entire 2MB.  Ideally, we'd like to break the 2M down in to 4k pages,
which lets us continue using the remaining 2M-4k, and leaves the app to
rebuild 4k of its data instead of 2M.

If you look at the diffstat, it's also pretty obvious that virtually
none of this code is actually specific to ramfs.  It'll all get used as
the foundation for the "real" filesystems too.  I'm very interested in
how those end up looking, too, but I think Kirill is selling his patches
a bit short calling this a toy.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-24 23:37 ` [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Andrew Morton
  2013-09-24 23:49   ` Andi Kleen
  2013-09-25  9:51   ` Kirill A. Shutemov
@ 2013-09-30 10:02   ` Mel Gorman
  2013-09-30 10:10     ` Mel Gorman
  2013-09-30 15:27     ` Dave Hansen
  2 siblings, 2 replies; 48+ messages in thread
From: Mel Gorman @ 2013-09-30 10:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	Alexander Shishkin, linux-fsdevel, linux-kernel

On Tue, Sep 24, 2013 at 04:37:40PM -0700, Andrew Morton wrote:
> On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > It brings thp support for ramfs, but without mmap() -- it will be posted
> > separately.
> 
> We were never going to do this :(
> 
> Has anyone reviewed these patches much yet?
> 

I am afraid I never looked too closely once I learned that the primary
motivation for this was relieving iTLB pressure in a very specific
case. AFAIK, this is not a problem in the vast majority of modern CPUs
and I found it very hard to be motivated to review the series as a result.
I suspected that in many cases that the cost of IO would continue to dominate
performance instead of TLB pressure. I also found it unlikely that there
was a workload that was tmpfs based that used enough memory to be hurt
by TLB pressure. My feedback was that a much more compelling case for the
series was needed but this discussion all happened on IRC unfortunately.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-30 10:02   ` Mel Gorman
@ 2013-09-30 10:10     ` Mel Gorman
  2013-09-30 18:07       ` Ning Qu
  2013-09-30 18:51       ` Andi Kleen
  2013-09-30 15:27     ` Dave Hansen
  1 sibling, 2 replies; 48+ messages in thread
From: Mel Gorman @ 2013-09-30 10:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	Alexander Shishkin, linux-fsdevel, linux-kernel

On Mon, Sep 30, 2013 at 11:02:49AM +0100, Mel Gorman wrote:
> On Tue, Sep 24, 2013 at 04:37:40PM -0700, Andrew Morton wrote:
> > On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > 
> > > It brings thp support for ramfs, but without mmap() -- it will be posted
> > > separately.
> > 
> > We were never going to do this :(
> > 
> > Has anyone reviewed these patches much yet?
> > 
> 
> I am afraid I never looked too closely once I learned that the primary
> motivation for this was relieving iTLB pressure in a very specific
> case. AFAIK, this is not a problem in the vast majority of modern CPUs
> and I found it very hard to be motivated to review the series as a result.
> I suspected that in many cases that the cost of IO would continue to dominate
> performance instead of TLB pressure. I also found it unlikely that there
> was a workload that was tmpfs based that used enough memory to be hurt
> by TLB pressure. My feedback was that a much more compelling case for the
> series was needed but this discussion all happened on IRC unfortunately.
> 

Oh, one last thing I forgot. While tmpfs-based workloads were not likely to
benefit I would expect that sysV shared memory workloads would potentially
benefit from this.  hugetlbfs is still required for shared memory areas
but it is not a problem that is addressed by this series.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-24 23:49   ` Andi Kleen
  2013-09-24 23:58     ` Andrew Morton
  2013-09-26 18:30     ` Zach Brown
@ 2013-09-30 10:13     ` Mel Gorman
  2013-09-30 16:05       ` Andi Kleen
  2 siblings, 1 reply; 48+ messages in thread
From: Mel Gorman @ 2013-09-30 10:13 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	Alexander Shishkin, linux-fsdevel, linux-kernel

On Tue, Sep 24, 2013 at 04:49:50PM -0700, Andi Kleen wrote:
> > Sigh.  A pox on whoever thought up huge pages. 
> 
> managing 1TB+ of memory in 4K chunks is just insane.
> The question of larger pages is not "if", but only "when".
> 

Remember that there are at least two separate issues there. One is the
handling data in larger granularities than a 4K page and the second is
the TLB, pagetable etc handling. They are not necessarily the same problem.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-30 10:02   ` Mel Gorman
  2013-09-30 10:10     ` Mel Gorman
@ 2013-09-30 15:27     ` Dave Hansen
  2013-09-30 18:05       ` Ning Qu
  1 sibling, 1 reply; 48+ messages in thread
From: Dave Hansen @ 2013-09-30 15:27 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, linux-mm, Andi Kleen, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel

On 09/30/2013 03:02 AM, Mel Gorman wrote:
> I am afraid I never looked too closely once I learned that the primary
> motivation for this was relieving iTLB pressure in a very specific
> case. AFAIK, this is not a problem in the vast majority of modern CPUs
> and I found it very hard to be motivated to review the series as a result.
> I suspected that in many cases that the cost of IO would continue to dominate
> performance instead of TLB pressure. I also found it unlikely that there
> was a workload that was tmpfs based that used enough memory to be hurt
> by TLB pressure. My feedback was that a much more compelling case for the
> series was needed but this discussion all happened on IRC unfortunately.

FWIW, I'm mostly intrigued by the possibilities of how this can speed up
_software_, and I'm rather uninterested in what it can do for the TLB.
Page cache is particularly painful today, precisely because hugetlbfs
and anonymous-thp aren't available there.  If you have an app with
hundreds of GB of files that it wants to mmap(), even if it's in the
page cache, it takes _minutes_ to just fault in.  One example:

	https://lkml.org/lkml/2013/6/27/698

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-30 10:13     ` Mel Gorman
@ 2013-09-30 16:05       ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2013-09-30 16:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	Alexander Shishkin, linux-fsdevel, linux-kernel

On Mon, Sep 30, 2013 at 11:13:00AM +0100, Mel Gorman wrote:
> On Tue, Sep 24, 2013 at 04:49:50PM -0700, Andi Kleen wrote:
> > > Sigh.  A pox on whoever thought up huge pages. 
> > 
> > managing 1TB+ of memory in 4K chunks is just insane.
> > The question of larger pages is not "if", but only "when".
> > 
> 
> Remember that there are at least two separate issues there. One is the
> handling data in larger granularities than a 4K page and the second is
> the TLB, pagetable etc handling. They are not necessarily the same problem.

It's the same problem in the end.

The hardware is struggling with 4K pages too (both i and d)

I expect longer term TLB/page optimization to have far more important
than all this NUMA placement work that people spend so much
time on.


-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-30 15:27     ` Dave Hansen
@ 2013-09-30 18:05       ` Ning Qu
  0 siblings, 0 replies; 48+ messages in thread
From: Ning Qu @ 2013-09-30 18:05 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mel Gorman, Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli,
	Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Alexander Shishkin, linux-fsdevel, linux-kernel

Yes, I agree. For our case, we have tens of GB files and thp with page
cache does improve the number as expected.

And compared to hugetlbfs (static huge page), it's more flexible and
beneficial to the system wide ....


Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Mon, Sep 30, 2013 at 8:27 AM, Dave Hansen <dave@sr71.net> wrote:
> On 09/30/2013 03:02 AM, Mel Gorman wrote:
>> I am afraid I never looked too closely once I learned that the primary
>> motivation for this was relieving iTLB pressure in a very specific
>> case. AFAIK, this is not a problem in the vast majority of modern CPUs
>> and I found it very hard to be motivated to review the series as a result.
>> I suspected that in many cases that the cost of IO would continue to dominate
>> performance instead of TLB pressure. I also found it unlikely that there
>> was a workload that was tmpfs based that used enough memory to be hurt
>> by TLB pressure. My feedback was that a much more compelling case for the
>> series was needed but this discussion all happened on IRC unfortunately.
>
> FWIW, I'm mostly intrigued by the possibilities of how this can speed up
> _software_, and I'm rather uninterested in what it can do for the TLB.
> Page cache is particularly painful today, precisely because hugetlbfs
> and anonymous-thp aren't available there.  If you have an app with
> hundreds of GB of files that it wants to mmap(), even if it's in the
> page cache, it takes _minutes_ to just fault in.  One example:
>
>         https://lkml.org/lkml/2013/6/27/698

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-30 10:10     ` Mel Gorman
@ 2013-09-30 18:07       ` Ning Qu
  2013-09-30 18:51       ` Andi Kleen
  1 sibling, 0 replies; 48+ messages in thread
From: Ning Qu @ 2013-09-30 18:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Alexander Shishkin, linux-fsdevel, linux-kernel

I suppose sysv shm and tmpfs share the same code base now, so both of
them will benefit from thp page cache?

And for Kirill's previous patchset (till v4), it contains mmap support
as well. I suppose the patchset got splitted into smaller group so
it's easier to review ....

Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Mon, Sep 30, 2013 at 3:10 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Mon, Sep 30, 2013 at 11:02:49AM +0100, Mel Gorman wrote:
>> On Tue, Sep 24, 2013 at 04:37:40PM -0700, Andrew Morton wrote:
>> > On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
>> >
>> > > It brings thp support for ramfs, but without mmap() -- it will be posted
>> > > separately.
>> >
>> > We were never going to do this :(
>> >
>> > Has anyone reviewed these patches much yet?
>> >
>>
>> I am afraid I never looked too closely once I learned that the primary
>> motivation for this was relieving iTLB pressure in a very specific
>> case. AFAIK, this is not a problem in the vast majority of modern CPUs
>> and I found it very hard to be motivated to review the series as a result.
>> I suspected that in many cases that the cost of IO would continue to dominate
>> performance instead of TLB pressure. I also found it unlikely that there
>> was a workload that was tmpfs based that used enough memory to be hurt
>> by TLB pressure. My feedback was that a much more compelling case for the
>> series was needed but this discussion all happened on IRC unfortunately.
>>
>
> Oh, one last thing I forgot. While tmpfs-based workloads were not likely to
> benefit I would expect that sysV shared memory workloads would potentially
> benefit from this.  hugetlbfs is still required for shared memory areas
> but it is not a problem that is addressed by this series.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-30 10:10     ` Mel Gorman
  2013-09-30 18:07       ` Ning Qu
@ 2013-09-30 18:51       ` Andi Kleen
  2013-10-01  8:38         ` Mel Gorman
  1 sibling, 1 reply; 48+ messages in thread
From: Andi Kleen @ 2013-09-30 18:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	Alexander Shishkin, linux-fsdevel, linux-kernel

> AFAIK, this is not a problem in the vast majority of modern CPUs

Let's do some simple math: e.g. a Sandy Bridge system has 512 4K iTLB L2 entries.
That's around 2MB. There's more and more code whose footprint exceeds
that.

Besides iTLB is not the only target. It is also useful for 
data of course.

> > and I found it very hard to be motivated to review the series as a result.
> > I suspected that in many cases that the cost of IO would continue to dominate
> > performance instead of TLB pressure

The trend is to larger and larger memories, keeping things in memory.

In fact there's a good argument that memory sizes are growing faster
than TLB capacities. And without large TLBs we're even further off
the curve.

> Oh, one last thing I forgot. While tmpfs-based workloads were not likely to
> benefit I would expect that sysV shared memory workloads would potentially
> benefit from this.  hugetlbfs is still required for shared memory areas
> but it is not a problem that is addressed by this series.

Of course it's only the first step. But if noone does the babysteps
then the other usages will also not ever materialize.

I expect once ramfs works, extending it to tmpfs etc. should be
straight forward.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-30 18:51       ` Andi Kleen
@ 2013-10-01  8:38         ` Mel Gorman
  2013-10-01 17:11           ` Ning Qu
  2013-10-14 14:27           ` Kirill A. Shutemov
  0 siblings, 2 replies; 48+ messages in thread
From: Mel Gorman @ 2013-10-01  8:38 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm, Matthew Wilcox,
	Kirill A. Shutemov, Hillf Danton, Dave Hansen, Ning Qu,
	Alexander Shishkin, linux-fsdevel, linux-kernel

On Mon, Sep 30, 2013 at 11:51:06AM -0700, Andi Kleen wrote:
> > AFAIK, this is not a problem in the vast majority of modern CPUs
> 
> Let's do some simple math: e.g. a Sandy Bridge system has 512 4K iTLB L2 entries.
> That's around 2MB. There's more and more code whose footprint exceeds
> that.
> 

With an expectation that it is read-mostly data, replicated between the
caches accessing it and TLB refills taking very little time. This is not
universally true and there are exceptions but even recent papers on TLB
behaviour have tended to dismiss the iTLB refill overhead as a negligible
portion of the overall workload of interest.

> Besides iTLB is not the only target. It is also useful for 
> data of course.
> 

True, but how useful? I have not seen an example of a workload showing that
dTLB pressure on file-backed data was a major component of the workload. I
would expect that sysV shared memory is an exception but does that require
generic support for all filesystems or can tmpfs be special cased when
it's used for shared memory?

For normal data, if it's read-only data then there would be some benefit to
using huge pages once the data is in page cache. How common are workloads
that mmap() large amounts of read-only data? Possibly some databases
depending on the workload although there I would expect that the data is
placed in shared memory.

If the mmap()s data is being written then the cost of IO is likely to
dominate, not TLB pressure. For write-mostly workloads there are greater
concerns because dirty tracking can only be done at the huge page boundary
potentially leading to greater amounts of IO and degraded performance
overall.

I could be completely wrong here but these were the concerns I had when
I first glanced through the patches. The changelogs had no information
to convince me otherwise so I never dedicated the time to reviewing the
patches in detail. I raised my concerns and then dropped it.

> > > and I found it very hard to be motivated to review the series as a result.
> > > I suspected that in many cases that the cost of IO would continue to dominate
> > > performance instead of TLB pressure
> 
> The trend is to larger and larger memories, keeping things in memory.
> 

Yes, but using huge pages is not *necessarily* the answer. For fault
scalability it probably would be a lot easier to batch handle faults if
readahead indicates accesses are sequential. Background zeroing of pages
could be revisited for fault intensive workloads. A potential alternative
is that a contiguous page is allocated, zerod as one lump, split the pages
and put onto a local per-task list although the details get messy. Reclaim
scanning could be heavily modified to use collections of pages instead of
single pages (although I'm not aware of the proper design of such a thing).

Again, this could be completely off the mark but if it was me that was
working on this problem, I would have some profile data from some workloads
to make sure the part I'm optimising was a noticable percentage of the
workload and included that in the patch leader. I would hope that the data
was compelling enough to convince reviewers to pay close attention to the
series as the complexity would then be justified. Based on how complex THP
was for anonymous pages, I would be tempted to treat THP for file-backed
data as a last resort.

> In fact there's a good argument that memory sizes are growing faster
> than TLB capacities. And without large TLBs we're even further off
> the curve.
> 

I'll admit this is also true. It was considered to be true in the 90's
when huge pages were first being thrown around as a possible solution to
the problem. One paper recently suggested using segmentation for large
memory segments but the workloads they examined looked like they would
be dominated by anonymous access, not file-backed data with one exception
where the workload frequently accessed compile-time constants.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-10-01  8:38         ` Mel Gorman
@ 2013-10-01 17:11           ` Ning Qu
  2013-10-14 14:27           ` Kirill A. Shutemov
  1 sibling, 0 replies; 48+ messages in thread
From: Ning Qu @ 2013-10-01 17:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andi Kleen, Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli,
	Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Alexander Shishkin, linux-fsdevel, linux-kernel

I can throw in some numbers for one of the test case I am working on.

One of the workload is using sysv shm to load GB level files into
memory, which is shared with other worker processes for long term. We
could load as much file which fits all the physical memory available.
And also, the heap is pretty big (GB level as well) to handle those
data.

For the workload I just mentioned, with thp, we have about 8%
performance improvement, 5% from thp anonymous memory and 3% from thp
page cache. It might not look so good but it's pretty good without
changing one line of code in application, which is the beauty of thp.

Before that, we have been using hugetlbfs, then we have to reserve a
huge amount of memory at boot time, no matter those memory will be
used or not. It is working but no other major services could ever
share the server resources anymore.
Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Tue, Oct 1, 2013 at 1:38 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Mon, Sep 30, 2013 at 11:51:06AM -0700, Andi Kleen wrote:
>> > AFAIK, this is not a problem in the vast majority of modern CPUs
>>
>> Let's do some simple math: e.g. a Sandy Bridge system has 512 4K iTLB L2 entries.
>> That's around 2MB. There's more and more code whose footprint exceeds
>> that.
>>
>
> With an expectation that it is read-mostly data, replicated between the
> caches accessing it and TLB refills taking very little time. This is not
> universally true and there are exceptions but even recent papers on TLB
> behaviour have tended to dismiss the iTLB refill overhead as a negligible
> portion of the overall workload of interest.
>
>> Besides iTLB is not the only target. It is also useful for
>> data of course.
>>
>
> True, but how useful? I have not seen an example of a workload showing that
> dTLB pressure on file-backed data was a major component of the workload. I
> would expect that sysV shared memory is an exception but does that require
> generic support for all filesystems or can tmpfs be special cased when
> it's used for shared memory?
>
> For normal data, if it's read-only data then there would be some benefit to
> using huge pages once the data is in page cache. How common are workloads
> that mmap() large amounts of read-only data? Possibly some databases
> depending on the workload although there I would expect that the data is
> placed in shared memory.
>
> If the mmap()s data is being written then the cost of IO is likely to
> dominate, not TLB pressure. For write-mostly workloads there are greater
> concerns because dirty tracking can only be done at the huge page boundary
> potentially leading to greater amounts of IO and degraded performance
> overall.
>
> I could be completely wrong here but these were the concerns I had when
> I first glanced through the patches. The changelogs had no information
> to convince me otherwise so I never dedicated the time to reviewing the
> patches in detail. I raised my concerns and then dropped it.
>
>> > > and I found it very hard to be motivated to review the series as a result.
>> > > I suspected that in many cases that the cost of IO would continue to dominate
>> > > performance instead of TLB pressure
>>
>> The trend is to larger and larger memories, keeping things in memory.
>>
>
> Yes, but using huge pages is not *necessarily* the answer. For fault
> scalability it probably would be a lot easier to batch handle faults if
> readahead indicates accesses are sequential. Background zeroing of pages
> could be revisited for fault intensive workloads. A potential alternative
> is that a contiguous page is allocated, zerod as one lump, split the pages
> and put onto a local per-task list although the details get messy. Reclaim
> scanning could be heavily modified to use collections of pages instead of
> single pages (although I'm not aware of the proper design of such a thing).
>
> Again, this could be completely off the mark but if it was me that was
> working on this problem, I would have some profile data from some workloads
> to make sure the part I'm optimising was a noticable percentage of the
> workload and included that in the patch leader. I would hope that the data
> was compelling enough to convince reviewers to pay close attention to the
> series as the complexity would then be justified. Based on how complex THP
> was for anonymous pages, I would be tempted to treat THP for file-backed
> data as a last resort.
>
>> In fact there's a good argument that memory sizes are growing faster
>> than TLB capacities. And without large TLBs we're even further off
>> the curve.
>>
>
> I'll admit this is also true. It was considered to be true in the 90's
> when huge pages were first being thrown around as a possible solution to
> the problem. One paper recently suggested using segmentation for large
> memory segments but the workloads they examined looked like they would
> be dominated by anonymous access, not file-backed data with one exception
> where the workload frequently accessed compile-time constants.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-09-25 23:29     ` Dave Chinner
@ 2013-10-14 13:56       ` Kirill A. Shutemov
  0 siblings, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-10-14 13:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	Dave Hansen, Ning Qu, Alexander Shishkin, linux-fsdevel,
	linux-kernel

Dave Chinner wrote:
> On Wed, Sep 25, 2013 at 12:51:04PM +0300, Kirill A. Shutemov wrote:
> > Andrew Morton wrote:
> > > On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> > > 
> > > > It brings thp support for ramfs, but without mmap() -- it will be posted
> > > > separately.
> > > 
> > > We were never going to do this :(
> > > 
> > > Has anyone reviewed these patches much yet?
> > 
> > Dave did very good review. Few other people looked to separate patches.
> > See Reviewed-by/Acked-by tags in patches.
> > 
> > It looks like most mm experts are busy with numa balancing nowadays, so
> > it's hard to get more review.
> 
> Nobody has reviewed it from the filesystem side, though.
> 
> The changes that require special code paths for huge pages in the
> write_begin/write_end paths are nasty. You're adding conditional
> code that depends on the page size and then having to add checks to
> ensure that large page operations don't step over small page
> boundaries and other such corner cases. It's an extremely fragile
> design, IMO.
> 
> In general, I don't like all the if (thp) {} else {}; code that this
> series introduces - they are code paths that simply won't get tested
> with any sort of regularity and make the code more complex for those
> that aren't using THP to understand and debug...

Okay, I'll try to get rid of special cases where it's possible.

> Then there is a new per-inode lock that is used in
> generic_perform_write() which is held across page faults and calls
> to filesystem block mapping callbacks. This inserts into the middle
> of an existing locking chain that needs to be strictly ordered, and
> as such will lead to the same type of lock inversion problems that
> the mmap_sem had.  We do not want to introduce a new lock that has
> this same problem just as we are getting rid of that long standing
> nastiness from the page fault path...

I don't see how we can protect against splitting with existing locks,
but I'll try find a way.

> I also note that you didn't convert invalidate_inode_pages2_range()
> to support huge pages which is needed by real filesystems that
> support direct IO. There are other truncate/invalidate interfaces
> that you didn't convert, either, and some of them will present you
> with interesting locking challenges as a result of adding that new
> lock...

Thanks. I'll take a look on these code paths.

> > The patchset was mostly ignored for few rounds and Dave suggested to split
> > to have less scary patch number.
> 
> It's still being ignored by filesystem people because you haven't
> actually tried to implement support into a real filesystem.....

If it will support a real filesystem, wouldn't it be ignored due
patch count? ;)

> > > > Please review and consider applying.
> > > 
> > > It appears rather too immature at this stage.
> > 
> > More review is always welcome and I'm committed to address issues.
> 
> IMO, supporting a real block based filesystem like ext4 or XFS and
> demonstrating that everything works is necessary before we go any
> further...

Will see what numbers I can bring in next iterations.

Thanks for your feedback. And sorry for late answer.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
  2013-10-01  8:38         ` Mel Gorman
  2013-10-01 17:11           ` Ning Qu
@ 2013-10-14 14:27           ` Kirill A. Shutemov
  1 sibling, 0 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-10-14 14:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andi Kleen, Andrew Morton, Kirill A. Shutemov, Andrea Arcangeli,
	Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, linux-mm,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Ning Qu, Alexander Shishkin, linux-fsdevel, linux-kernel

Mel Gorman wrote:
> I could be completely wrong here but these were the concerns I had when
> I first glanced through the patches. The changelogs had no information
> to convince me otherwise so I never dedicated the time to reviewing the
> patches in detail. I raised my concerns and then dropped it.

Okay. I got your point: more data from real-world workloads. I'll try to
bring some in next iteration.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
@ 2013-09-25 18:11 Ning Qu
  0 siblings, 0 replies; 48+ messages in thread
From: Ning Qu @ 2013-09-25 18:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	Alexander Shishkin, linux-fsdevel, linux-kernel

Got you. THanks!

Best wishes,
-- 
Ning Qu (曲宁) | Software Engineer | quning@google.com | +1-408-418-6066


On Wed, Sep 25, 2013 at 2:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Ning Qu wrote:
>> Hi, Kirill,
>>
>> Seems you dropped one patch in v5, is that intentional? Just wondering ...
>>
>>   thp, mm: handle tail pages in page_cache_get_speculative()
>
> It's not needed anymore, since we don't have tail pages in radix tree.
>
> --
>  Kirill A. Shutemov
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2013-10-14 14:27 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-23 12:05 [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 01/22] mm: implement zero_huge_user_segment and friends Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 02/22] radix-tree: implement preload for multiple contiguous elements Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 03/22] memcg, thp: charge huge cache pages Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 04/22] thp: compile-time and sysfs knob for thp pagecache Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 05/22] thp, mm: introduce mapping_can_have_hugepages() predicate Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 06/22] thp: represent file thp pages in meminfo and friends Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 07/22] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 08/22] mm: trace filemap: dump page order Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 09/22] block: implement add_bdi_stat() Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 10/22] thp, mm: rewrite delete_from_page_cache() to support huge pages Kirill A. Shutemov
2013-09-25 20:02   ` Ning Qu
2013-09-23 12:05 ` [PATCHv6 11/22] thp, mm: warn if we try to use replace_page_cache_page() with THP Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 12/22] thp, mm: add event counters for huge page alloc on file write or read Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 13/22] mm, vfs: introduce i_split_sem Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 14/22] thp, mm: allocate huge pages in grab_cache_page_write_begin() Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 15/22] thp, mm: naive support of thp in generic_perform_write Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 16/22] thp, mm: handle transhuge pages in do_generic_file_read() Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 17/22] thp, libfs: initial thp support Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 18/22] truncate: support huge pages Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 19/22] thp: handle file pages in split_huge_page() Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 20/22] thp: wait_split_huge_page(): serialize over i_mmap_mutex too Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 21/22] thp, mm: split huge page on mmap file page Kirill A. Shutemov
2013-09-23 12:05 ` [PATCHv6 22/22] ramfs: enable transparent huge page cache Kirill A. Shutemov
2013-09-24 23:37 ` [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap() Andrew Morton
2013-09-24 23:49   ` Andi Kleen
2013-09-24 23:58     ` Andrew Morton
2013-09-25 11:15       ` Kirill A. Shutemov
2013-09-25 15:05         ` Andi Kleen
2013-09-26 18:30     ` Zach Brown
2013-09-26 19:05       ` Andi Kleen
2013-09-30 10:13     ` Mel Gorman
2013-09-30 16:05       ` Andi Kleen
2013-09-25  9:51   ` Kirill A. Shutemov
2013-09-25 23:29     ` Dave Chinner
2013-10-14 13:56       ` Kirill A. Shutemov
2013-09-30 10:02   ` Mel Gorman
2013-09-30 10:10     ` Mel Gorman
2013-09-30 18:07       ` Ning Qu
2013-09-30 18:51       ` Andi Kleen
2013-10-01  8:38         ` Mel Gorman
2013-10-01 17:11           ` Ning Qu
2013-10-14 14:27           ` Kirill A. Shutemov
2013-09-30 15:27     ` Dave Hansen
2013-09-30 18:05       ` Ning Qu
     [not found] ` <CACz4_2drFs5LsM8mTFNOWGHAs0QbsNfHAhiBXJ7jM3qkGerd5w@mail.gmail.com>
2013-09-25  9:23   ` Kirill A. Shutemov
2013-09-26 21:13 ` Dave Hansen
2013-09-25 18:11 Ning Qu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).