[PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

* [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()
@ 2013-09-23 12:05 Kirill A. Shutemov
  2013-09-23 12:05 ` [PATCHv6 01/22] mm: implement zero_huge_user_segment and friends Kirill A. Shutemov
                   ` (24 more replies)
  0 siblings, 25 replies; 48+ messages in thread
From: Kirill A. Shutemov @ 2013-09-23 12:05 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, Ning Qu, Alexander Shishkin,
	linux-fsdevel, linux-kernel, Kirill A. Shutemov

It brings thp support for ramfs, but without mmap() -- it will be posted
separately.

Please review and consider applying.

Intro
-----

The goal of the project is preparing kernel infrastructure to handle huge
pages in page cache.

To proof that the proposed changes are functional we enable the feature
for the most simple file system -- ramfs. ramfs is not that useful by
itself, but it's good pilot project.

Design overview
---------------

Every huge page is represented in page cache radix-tree by HPAGE_PMD_NR
(512 on x86-64) entries. All entries points to head page -- refcounting for
tail pages is pretty expensive.

Radix tree manipulations are implemented in batched way: we add and remove
whole huge page at once, under one tree_lock. To make it possible, we
extended radix-tree interface to be able to pre-allocate memory enough to
insert a number of *contiguous* elements (kudos to Matthew Wilcox).

Huge pages can be added to page cache three ways:
 - write(2) to file or page;
 - read(2) from sparse file;
 - fault sparse file.

Potentially, one more way is collapsing small page, but it's outside initial
implementation.

For now we still write/read at most PAGE_CACHE_SIZE bytes a time. There's
some room for speed up later.

Since mmap() isn't targeted for this patchset, we just split huge page on
page fault.

To minimize memory overhead for small files we aviod write-allocation in
first huge page area (2M on x86-64) of the file.

truncate_inode_pages_range() drops whole huge page at once if it's fully
inside the range. If a huge page is only partly in the range we zero out
the part, exactly like we do for partial small pages.

split_huge_page() for file pages works similar to anon pages, but we
walk by mapping->i_mmap rather then anon_vma->rb_root. At the end we call
truncate_inode_pages() to drop small pages beyond i_size, if any.

inode->i_split_sem taken on read will protect hugepages in inode's pagecache
against splitting. We take it on write during splitting.

Changes since v5
----------------
 - change how hugepage stored in pagecache: head page for all relevant
   indexes;
 - introduce i_split_sem;
 - do not create huge pages on write(2) into first hugepage area;
 - compile-disabled by default;
 - fix transparent_hugepage_pagecache();

Benchmarks
----------

Since the patchset doesn't include mmap() support, we should expect much
change in performance. We just need to check that we don't introduce any
major regression.

On average read/write on ramfs with thp is a bit slower, but I don't think
it's a stopper -- ramfs is a toy anyway, on real world filesystems I
expect difference to be smaller.

postmark
========

workload1:
chmod +x postmark
mount -t ramfs none /mnt
cat >/root/workload1 <<EOF
set transactions 250000
set size 5120 524288
set number 500
run
quit

workload2:
set transactions 10000
set size 2097152 10485760
set number 100
run
quit

throughput (transactions/sec)
                workload1       workload2
baseline        8333            416
patched         8333            454

FS-Mark
=======

throughput (files/sec)

                2000 files by 1M        200 files by 10M
baseline        5326.1                  548.1
patched         5192.8                  528.4

tiobench
========

baseline:
Tiotest results for 16 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        2048 MBs |    0.2 s | 8667.792 MB/s | 445.2 %  | 5535.9 % |
| Random Write   62 MBs |    0.0 s | 8341.118 MB/s |   0.0 %  | 2615.8 % |
| Read         2048 MBs |    0.2 s | 11680.431 MB/s | 339.9 %  | 5470.6 % |
| Random Read    62 MBs |    0.0 s | 9451.081 MB/s | 786.3 %  | 1451.7 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.006 ms |       28.019 ms |  0.00000 |   0.00000 |
| Random Write |        0.002 ms |        5.574 ms |  0.00000 |   0.00000 |
| Read         |        0.005 ms |       28.018 ms |  0.00000 |   0.00000 |
| Random Read  |        0.002 ms |        4.852 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.005 ms |       28.019 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

patched:
Tiotest results for 16 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        2048 MBs |    0.3 s | 7942.818 MB/s | 442.1 %  | 5533.6 % |
| Random Write   62 MBs |    0.0 s | 9425.426 MB/s | 723.9 %  | 965.2 % |
| Read         2048 MBs |    0.2 s | 11998.008 MB/s | 374.9 %  | 5485.8 % |
| Random Read    62 MBs |    0.0 s | 9823.955 MB/s | 251.5 %  | 2011.9 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.007 ms |       28.020 ms |  0.00000 |   0.00000 |
| Random Write |        0.001 ms |        0.022 ms |  0.00000 |   0.00000 |
| Read         |        0.004 ms |       24.011 ms |  0.00000 |   0.00000 |
| Random Read  |        0.001 ms |        0.019 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.005 ms |       28.020 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

IOZone
======

Syscalls, not mmap.

** Initial writers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80
baseline:	    4741691    7986408    9149064    9898695    9868597    9629383    9469202   11605064    9507802   10641869   11360701   11040376
patched:	    4682864    7275535    8691034    8872887    8712492    8771912    8397216    7701346    7366853    8839736    8299893   10788439
speed-up(times):       0.99       0.91       0.95       0.90       0.88       0.91       0.89       0.66       0.77       0.83       0.73       0.98

** Rewriters **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80
baseline:	    5807891    9554869   12101083   13113533   12989751   14359910   16998236   16833861   24735659   17502634   17396706   20448655
patched:	    6161690    9981294   12285789   13428846   13610058   13669153   20060182   17328347   24109999   19247934   24225103   34686574
speed-up(times):       1.06       1.04       1.02       1.02       1.05       0.95       1.18       1.03       0.97       1.10       1.39       1.70

** Readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80
baseline:	    7978066   11825735   13808941   14049598   14765175   14422642   17322681   23209831   21386483   20060744   22032935   31166663
patched:	    7723293   11481500   13796383   14363808   14353966   14979865   17648225   18701258   29192810   23973723   22163317   23104638
speed-up(times):       0.97       0.97       1.00       1.02       0.97       1.04       1.02       0.81       1.37       1.20       1.01       0.74

** Re-readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80
baseline:	    7966269   11878323   14000782   14678206   14154235   14271991   15170829   20924052   27393344   19114990   12509316   18495597
patched:	    7719350   11410937   13710233   13232756   14040928   15895021   16279330   17256068   26023572   18364678   27834483   23288680
speed-up(times):       0.97       0.96       0.98       0.90       0.99       1.11       1.07       0.82       0.95       0.96       2.23       1.26

** Reverse readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80
baseline:	    6630795   10331013   12839501   13157433   12783323   13580283   15753068   15434572   21928982   17636994   14737489   19470679
patched:	    6502341    9887711   12639278   12979232   13212825   12928255   13961195   14695786   21370667   19873807   20902582   21892899
speed-up(times):       0.98       0.96       0.98       0.99       1.03       0.95       0.89       0.95       0.97       1.13       1.42       1.12

** Random_readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80
baseline:	    5152935    9043813   11752615   11996078   12283579   12484039   14588004   15781507   23847538   15748906   13698335   27195847
patched:	    5009089    8438137   11266015   11631218   12093650   12779308   17768691   13640378   30468890   19269033   23444358   22775908
speed-up(times):       0.97       0.93       0.96       0.97       0.98       1.02       1.22       0.86       1.28       1.22       1.71       0.84

** Random_writers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80
baseline:	    3886268    7405345   10531192   10858984   10994693   12758450   10729531    9656825   10370144   13139452    4528331   12615812
patched:	    4335323    7916132   10978892   11423247   11790932   11424525   11798171   11413452   12230616   13075887   11165314   16925679
speed-up(times):       1.12       1.07       1.04       1.05       1.07       0.90       1.10       1.18       1.18       1.00       2.47       1.34

Kirill A. Shutemov (22):
  mm: implement zero_huge_user_segment and friends
  radix-tree: implement preload for multiple contiguous elements
  memcg, thp: charge huge cache pages
  thp: compile-time and sysfs knob for thp pagecache
  thp, mm: introduce mapping_can_have_hugepages() predicate
  thp: represent file thp pages in meminfo and friends
  thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  mm: trace filemap: dump page order
  block: implement add_bdi_stat()
  thp, mm: rewrite delete_from_page_cache() to support huge pages
  thp, mm: warn if we try to use replace_page_cache_page() with THP
  thp, mm: add event counters for huge page alloc on file write or read
  mm, vfs: introduce i_split_sem
  thp, mm: allocate huge pages in grab_cache_page_write_begin()
  thp, mm: naive support of thp in generic_perform_write
  thp, mm: handle transhuge pages in do_generic_file_read()
  thp, libfs: initial thp support
  truncate: support huge pages
  thp: handle file pages in split_huge_page()
  thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  thp, mm: split huge page on mmap file page
  ramfs: enable transparent huge page cache

 Documentation/vm/transhuge.txt |  16 ++++
 drivers/base/node.c            |   4 +
 fs/inode.c                     |   3 +
 fs/libfs.c                     |  58 +++++++++++-
 fs/proc/meminfo.c              |   3 +
 fs/ramfs/file-mmu.c            |   2 +-
 fs/ramfs/inode.c               |   6 +-
 include/linux/backing-dev.h    |  10 +++
 include/linux/fs.h             |  11 +++
 include/linux/huge_mm.h        |  68 +++++++++++++-
 include/linux/mm.h             |  18 ++++
 include/linux/mmzone.h         |   1 +
 include/linux/page-flags.h     |  13 +++
 include/linux/pagemap.h        |  31 +++++++
 include/linux/radix-tree.h     |  11 +++
 include/linux/vm_event_item.h  |   4 +
 include/trace/events/filemap.h |   7 +-
 lib/radix-tree.c               |  94 ++++++++++++++++++--
 mm/Kconfig                     |  11 +++
 mm/filemap.c                   | 196 ++++++++++++++++++++++++++++++++---------
 mm/huge_memory.c               | 147 +++++++++++++++++++++++++++----
 mm/memcontrol.c                |   3 +-
 mm/memory.c                    |  40 ++++++++-
 mm/truncate.c                  | 125 ++++++++++++++++++++------
 mm/vmstat.c                    |   5 ++
 25 files changed, 779 insertions(+), 108 deletions(-)

-- 
1.8.4.rc3

^ permalink raw reply	[flat|nested] 48+ messages in thread