All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
To: Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	Hugh Dickins <hughd@google.com>,
	Wu Fengguang <fengguang.wu@intel.com>, Jan Kara <jack@suse.cz>,
	Mel Gorman <mgorman@suse.de>,
	linux-mm@kvack.org, Andi Kleen <ak@linux.intel.com>,
	Matthew Wilcox <willy@linux.intel.com>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Hillf Danton <dhillf@gmail.com>, Dave Hansen <dave@sr71.net>,
	Ning Qu <quning@google.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Subject: [PATCHv5 00/23] Transparent huge page cache: phase 1, everything but mmap()
Date: Sun,  4 Aug 2013 05:17:02 +0300	[thread overview]
Message-ID: <1375582645-29274-1-git-send-email-kirill.shutemov@linux.intel.com> (raw)

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

This is the second part of my transparent huge page cache work.
It brings thp support for ramfs, but without mmap() -- it will be posted
separately.

Intro
-----

The goal of the project is preparing kernel infrastructure to handle huge
pages in page cache.

To proof that the proposed changes are functional we enable the feature
for the most simple file system -- ramfs. ramfs is not that useful by
itself, but it's good pilot project.

Design overview
---------------

Every huge page is represented in page cache radix-tree by HPAGE_PMD_NR
(512 on x86-64) entries: one entry for head page and HPAGE_PMD_NR-1 entries
for tail pages.

Radix tree manipulations are implemented in batched way: we add and remove
whole huge page at once, under one tree_lock. To make it possible, we
extended radix-tree interface to be able to pre-allocate memory enough to
insert a number of *contiguous* elements (kudos to Matthew Wilcox).

Huge pages can be added to page cache three ways:
 - write(2) to file or page;
 - read(2) from sparse file;
 - fault sparse file.

Potentially, one more way is collapsing small page, but it's outside initial
implementation.

For now we still write/read at most PAGE_CACHE_SIZE bytes a time. There's
some room for speed up later.

Since mmap() isn't targeted for this patchset, we just split huge page on
page fault.

To minimize memory overhead for small file we setup fops->release helper
-- simple_thp_release() --  which splits the last page in file, when last
writer goes away.

truncate_inode_pages_range() drops whole huge page at once if it's fully
inside the range. If a huge page is only partly in the range we zero out
the part, exactly like we do for partial small pages.

split_huge_page() for file pages works similar to anon pages, but we
walk by mapping->i_mmap rather then anon_vma->rb_root. At the end we call
truncate_inode_pages() to drop small pages beyond i_size, if any.

Locking model around split_huge_page() rather complicated and I still
don't feel myself confident enough with it. Looks like we need to
serialize over i_mutex in split_huge_page(), but it breaks locking
ordering for i_mutex->mmap_sem. I don't see how it can be fixed easily.
Any ideas are welcome.

Performance indicators will be posted separately.

Please, review.

Kirill A. Shutemov (23):
  radix-tree: implement preload for multiple contiguous elements
  memcg, thp: charge huge cache pages
  thp: compile-time and sysfs knob for thp pagecache
  thp, mm: introduce mapping_can_have_hugepages() predicate
  thp: represent file thp pages in meminfo and friends
  thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  mm: trace filemap: dump page order
  block: implement add_bdi_stat()
  thp, mm: rewrite delete_from_page_cache() to support huge pages
  thp, mm: warn if we try to use replace_page_cache_page() with THP
  thp, mm: handle tail pages in page_cache_get_speculative()
  thp, mm: add event counters for huge page alloc on file write or read
  thp, mm: allocate huge pages in grab_cache_page_write_begin()
  thp, mm: naive support of thp in generic_perform_write
  mm, fs: avoid page allocation beyond i_size on read
  thp, mm: handle transhuge pages in do_generic_file_read()
  thp, libfs: initial thp support
  thp: libfs: introduce simple_thp_release()
  truncate: support huge pages
  thp: handle file pages in split_huge_page()
  thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  thp, mm: split huge page on mmap file page
  ramfs: enable transparent huge page cache

 Documentation/vm/transhuge.txt |  16 ++++
 drivers/base/node.c            |   4 +
 fs/libfs.c                     |  80 ++++++++++++++++++-
 fs/proc/meminfo.c              |   3 +
 fs/ramfs/file-mmu.c            |   3 +-
 fs/ramfs/inode.c               |   6 +-
 include/linux/backing-dev.h    |  10 +++
 include/linux/fs.h             |  10 +++
 include/linux/huge_mm.h        |  53 ++++++++++++-
 include/linux/mmzone.h         |   1 +
 include/linux/page-flags.h     |  33 ++++++++
 include/linux/pagemap.h        |  48 +++++++++++-
 include/linux/radix-tree.h     |  11 +++
 include/linux/vm_event_item.h  |   4 +
 include/trace/events/filemap.h |   7 +-
 lib/radix-tree.c               |  41 +++++++---
 mm/Kconfig                     |  12 +++
 mm/filemap.c                   | 171 +++++++++++++++++++++++++++++++++++------
 mm/huge_memory.c               | 116 ++++++++++++++++++++++++----
 mm/memcontrol.c                |   2 -
 mm/memory.c                    |   4 +-
 mm/truncate.c                  | 108 ++++++++++++++++++++------
 mm/vmstat.c                    |   5 ++
 23 files changed, 658 insertions(+), 90 deletions(-)

-- 
1.8.3.2


WARNING: multiple messages have this Message-ID (diff)
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
To: Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	Hugh Dickins <hughd@google.com>,
	Wu Fengguang <fengguang.wu@intel.com>, Jan Kara <jack@suse.cz>,
	Mel Gorman <mgorman@suse.de>,
	linux-mm@kvack.org, Andi Kleen <ak@linux.intel.com>,
	Matthew Wilcox <willy@linux.intel.com>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Hillf Danton <dhillf@gmail.com>, Dave Hansen <dave@sr71.net>,
	Ning Qu <quning@google.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Subject: [PATCHv5 00/23] Transparent huge page cache: phase 1, everything but mmap()
Date: Sun,  4 Aug 2013 05:17:02 +0300	[thread overview]
Message-ID: <1375582645-29274-1-git-send-email-kirill.shutemov@linux.intel.com> (raw)

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

This is the second part of my transparent huge page cache work.
It brings thp support for ramfs, but without mmap() -- it will be posted
separately.

Intro
-----

The goal of the project is preparing kernel infrastructure to handle huge
pages in page cache.

To proof that the proposed changes are functional we enable the feature
for the most simple file system -- ramfs. ramfs is not that useful by
itself, but it's good pilot project.

Design overview
---------------

Every huge page is represented in page cache radix-tree by HPAGE_PMD_NR
(512 on x86-64) entries: one entry for head page and HPAGE_PMD_NR-1 entries
for tail pages.

Radix tree manipulations are implemented in batched way: we add and remove
whole huge page at once, under one tree_lock. To make it possible, we
extended radix-tree interface to be able to pre-allocate memory enough to
insert a number of *contiguous* elements (kudos to Matthew Wilcox).

Huge pages can be added to page cache three ways:
 - write(2) to file or page;
 - read(2) from sparse file;
 - fault sparse file.

Potentially, one more way is collapsing small page, but it's outside initial
implementation.

For now we still write/read at most PAGE_CACHE_SIZE bytes a time. There's
some room for speed up later.

Since mmap() isn't targeted for this patchset, we just split huge page on
page fault.

To minimize memory overhead for small file we setup fops->release helper
-- simple_thp_release() --  which splits the last page in file, when last
writer goes away.

truncate_inode_pages_range() drops whole huge page at once if it's fully
inside the range. If a huge page is only partly in the range we zero out
the part, exactly like we do for partial small pages.

split_huge_page() for file pages works similar to anon pages, but we
walk by mapping->i_mmap rather then anon_vma->rb_root. At the end we call
truncate_inode_pages() to drop small pages beyond i_size, if any.

Locking model around split_huge_page() rather complicated and I still
don't feel myself confident enough with it. Looks like we need to
serialize over i_mutex in split_huge_page(), but it breaks locking
ordering for i_mutex->mmap_sem. I don't see how it can be fixed easily.
Any ideas are welcome.

Performance indicators will be posted separately.

Please, review.

Kirill A. Shutemov (23):
  radix-tree: implement preload for multiple contiguous elements
  memcg, thp: charge huge cache pages
  thp: compile-time and sysfs knob for thp pagecache
  thp, mm: introduce mapping_can_have_hugepages() predicate
  thp: represent file thp pages in meminfo and friends
  thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  mm: trace filemap: dump page order
  block: implement add_bdi_stat()
  thp, mm: rewrite delete_from_page_cache() to support huge pages
  thp, mm: warn if we try to use replace_page_cache_page() with THP
  thp, mm: handle tail pages in page_cache_get_speculative()
  thp, mm: add event counters for huge page alloc on file write or read
  thp, mm: allocate huge pages in grab_cache_page_write_begin()
  thp, mm: naive support of thp in generic_perform_write
  mm, fs: avoid page allocation beyond i_size on read
  thp, mm: handle transhuge pages in do_generic_file_read()
  thp, libfs: initial thp support
  thp: libfs: introduce simple_thp_release()
  truncate: support huge pages
  thp: handle file pages in split_huge_page()
  thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  thp, mm: split huge page on mmap file page
  ramfs: enable transparent huge page cache

 Documentation/vm/transhuge.txt |  16 ++++
 drivers/base/node.c            |   4 +
 fs/libfs.c                     |  80 ++++++++++++++++++-
 fs/proc/meminfo.c              |   3 +
 fs/ramfs/file-mmu.c            |   3 +-
 fs/ramfs/inode.c               |   6 +-
 include/linux/backing-dev.h    |  10 +++
 include/linux/fs.h             |  10 +++
 include/linux/huge_mm.h        |  53 ++++++++++++-
 include/linux/mmzone.h         |   1 +
 include/linux/page-flags.h     |  33 ++++++++
 include/linux/pagemap.h        |  48 +++++++++++-
 include/linux/radix-tree.h     |  11 +++
 include/linux/vm_event_item.h  |   4 +
 include/trace/events/filemap.h |   7 +-
 lib/radix-tree.c               |  41 +++++++---
 mm/Kconfig                     |  12 +++
 mm/filemap.c                   | 171 +++++++++++++++++++++++++++++++++++------
 mm/huge_memory.c               | 116 ++++++++++++++++++++++++----
 mm/memcontrol.c                |   2 -
 mm/memory.c                    |   4 +-
 mm/truncate.c                  | 108 ++++++++++++++++++++------
 mm/vmstat.c                    |   5 ++
 23 files changed, 658 insertions(+), 90 deletions(-)

-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

             reply	other threads:[~2013-08-04  2:15 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-04  2:17 Kirill A. Shutemov [this message]
2013-08-04  2:17 ` [PATCHv5 00/23] Transparent huge page cache: phase 1, everything but mmap() Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-05 11:17   ` Jan Kara
2013-08-05 11:17     ` Jan Kara
2013-08-06 16:34     ` Matthew Wilcox
2013-08-06 16:34       ` Matthew Wilcox
2013-08-06 20:17       ` Jan Kara
2013-08-06 20:17         ` Jan Kara
2013-08-07 16:32     ` Kirill A. Shutemov
2013-08-07 16:32       ` Kirill A. Shutemov
2013-08-07 16:32       ` Kirill A. Shutemov
2013-08-07 20:00       ` Jan Kara
2013-08-07 20:00         ` Jan Kara
2013-08-07 20:24         ` Kirill A. Shutemov
2013-08-07 20:24           ` Kirill A. Shutemov
2013-08-07 20:24           ` Kirill A. Shutemov
2013-08-07 20:36           ` Jan Kara
2013-08-07 20:36             ` Jan Kara
2013-08-07 21:37             ` Kirill A. Shutemov
2013-08-07 21:37               ` Kirill A. Shutemov
2013-08-07 21:37               ` Kirill A. Shutemov
2013-08-08  8:45               ` Kirill A. Shutemov
2013-08-08  8:45                 ` Kirill A. Shutemov
2013-08-08  8:45                 ` Kirill A. Shutemov
2013-08-08 10:04                 ` Jan Kara
2013-08-08 10:04                   ` Jan Kara
2013-08-09 11:13                   ` Kirill A. Shutemov
2013-08-09 11:13                     ` Kirill A. Shutemov
2013-08-09 11:13                     ` Kirill A. Shutemov
2013-08-09 11:36                     ` Jan Kara
2013-08-09 11:36                       ` Jan Kara
2013-08-04  2:17 ` [PATCH 02/23] memcg, thp: charge huge cache pages Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  8:25   ` Michal Hocko
2013-08-04  8:25     ` Michal Hocko
2013-08-04  2:17 ` [PATCH 03/23] thp: compile-time and sysfs knob for thp pagecache Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-09-05 21:53   ` Ning Qu
2013-09-06 11:33     ` Kirill A. Shutemov
2013-09-06 11:33       ` Kirill A. Shutemov
2013-09-06 11:33       ` Kirill A. Shutemov
2013-09-06 17:14       ` Ning Qu
2013-08-04  2:17 ` [PATCH 04/23] thp, mm: introduce mapping_can_have_hugepages() predicate Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 05/23] thp: represent file thp pages in meminfo and friends Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-30 22:16   ` Ning Qu
2013-09-02 11:36     ` Kirill A. Shutemov
2013-09-02 11:36       ` Kirill A. Shutemov
2013-09-02 20:05       ` Ning Qu
2013-08-04  2:17 ` [PATCH 06/23] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 07/23] mm: trace filemap: dump page order Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 08/23] block: implement add_bdi_stat() Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-05 11:21   ` Jan Kara
2013-08-05 11:21     ` Jan Kara
2013-08-04  2:17 ` [PATCH 09/23] thp, mm: rewrite delete_from_page_cache() to support huge pages Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 10/23] thp, mm: warn if we try to use replace_page_cache_page() with THP Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 11/23] thp, mm: handle tail pages in page_cache_get_speculative() Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 12/23] thp, mm: add event counters for huge page alloc on file write or read Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 13/23] thp, mm: allocate huge pages in grab_cache_page_write_begin() Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 14/23] thp, mm: naive support of thp in generic_perform_write Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 15/23] mm, fs: avoid page allocation beyond i_size on read Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-05  0:29   ` NeilBrown
2013-08-05  0:29     ` NeilBrown
2013-08-04  2:17 ` [PATCH 16/23] thp, mm: handle transhuge pages in do_generic_file_read() Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 17/23] thp, libfs: initial thp support Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 18/23] thp: libfs: introduce simple_thp_release() Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 19/23] truncate: support huge pages Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-05 13:29   ` Jan Kara
2013-08-05 13:29     ` Jan Kara
2013-08-06 20:23   ` Dave Hansen
2013-08-06 20:23     ` Dave Hansen
2013-08-06 20:57     ` Kirill A. Shutemov
2013-08-06 20:57       ` Kirill A. Shutemov
2013-08-06 21:55   ` Dave Hansen
2013-08-06 21:55     ` Dave Hansen
2013-08-09 14:39     ` Kirill A. Shutemov
2013-08-09 14:39       ` Kirill A. Shutemov
2013-08-09 14:39       ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 20/23] thp: handle file pages in split_huge_page() Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-06 19:09   ` Ning Qu
2013-08-06 21:09     ` Ning Qu
2013-08-06 21:47       ` Ning Qu
2013-08-09 14:46         ` Kirill A. Shutemov
2013-08-09 14:46           ` Kirill A. Shutemov
2013-08-09 14:46           ` Kirill A. Shutemov
2013-08-09 14:49           ` Ning Qu
2013-08-09 21:24             ` Ning Qu
2013-08-09 21:24               ` Ning Qu
2013-08-04  2:17 ` [PATCH 21/23] thp: wait_split_huge_page(): serialize over i_mmap_mutex too Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 22/23] thp, mm: split huge page on mmap file page Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov
2013-08-08 20:49   ` Khalid Aziz
2013-08-08 20:49     ` Khalid Aziz
2013-08-09 14:50     ` Kirill A. Shutemov
2013-08-09 14:50       ` Kirill A. Shutemov
2013-08-04  2:17 ` [PATCH 23/23] ramfs: enable transparent huge page cache Kirill A. Shutemov
2013-08-04  2:17   ` Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1375582645-29274-1-git-send-email-kirill.shutemov@linux.intel.com \
    --to=kirill.shutemov@linux.intel.com \
    --cc=aarcange@redhat.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave@sr71.net \
    --cc=dhillf@gmail.com \
    --cc=fengguang.wu@intel.com \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=kirill@shutemov.name \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=quning@google.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.