linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/9] mm: thrash detection-based file cache sizing v3
@ 2013-08-06 22:44 Johannes Weiner
  2013-08-06 22:44 ` [patch 1/9] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
                   ` (9 more replies)
  0 siblings, 10 replies; 19+ messages in thread
From: Johannes Weiner @ 2013-08-06 22:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

[ My apologies for the double send, I screwed up one of the recipient
  addresses the first time around and it got dropped by some MTAs. ]

Changes in version 3:

o Lazily remove inodes without shadow entries from the global list to
  reduce modifications of said list to an absolute minimum.  Global
  list operations are now reduced to when an inode has its first cache
  page reclaimed (rare) and when a linked inode is destroyed (rare) or
  when the inode's shadows are shrunk (rare) to zero (rare).  These
  events should be even rarer than the per-sb inode list operations,
  which take a global lock.  Based on feedback from Peter Zijlstra.

o Drop global working set time, store zone ID in addition to
  zone-specific timestamp in radix tree instead.  Balance zones based
  on their own refaults only.  This allows the refault detecting side
  to be much sleaker too and removes a lot of changes to the page
  allocator interface.  Based on feedback from Peter Zijlstra.

o Document all interfaces properly

o Split out fair allocator patches (in -mmotm)

---

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".

The tricky part of this model is finding the right balance between
them.  A big inactive list may not leave enough room for the active
list to protect all the frequently used pages.  A big active list may
not leave enough room for the inactive list for a new set of
frequently used pages, "working set", to establish itself because the
young pages get pushed out of memory before having a chance to get
promoted.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use once streams,
but was not satisfactory when use once streaming persisted over longer
periods of time and the established working set was temporarily
suspended, like a nightly backup evicting all the interactive user
program data.
    
Subsequently, the rules were changed to only age active pages when
they exceeded the amount of inactive pages, i.e. leave the working set
alone as long as the other half of memory is easy to reclaim use once
pages.  This works well until working set transitions exceed the size
of half of memory and the average access distance between the pages of
the new working set is bigger than the inactive list.  The VM will
mistake the thrashing new working set for use once streaming, while
the unused old working set pages are stuck on the active list.

This happens on file servers and media streaming servers, where the
popular set of files changes over time.  Even though the individual
files might be smaller than half of memory, concurrent access to many
of them may still result in their inter-reference distance being
greater than half of memory.  It's also been reported as a problem on
database workloads that switch back and forth between tables that are
bigger than half of memory.  In these cases the VM never recognizes
the new working set and will for the remainder of the workload thrash
disk data which could easily live in memory.

This series solves the problem by maintaining a history of pages
evicted from the inactive list, enabling the VM to tell streaming IO
from thrashing and rebalance the page cache lists when appropriate.

 drivers/staging/lustre/lustre/llite/dir.c |   2 +-
 fs/block_dev.c                            |   2 +-
 fs/btrfs/compression.c                    |   4 +-
 fs/cachefiles/rdwr.c                      |  13 +-
 fs/ceph/xattr.c                           |   2 +-
 fs/inode.c                                |   6 +-
 fs/logfs/readwrite.c                      |   6 +-
 fs/nfs/blocklayout/blocklayout.c          |   2 +-
 fs/nilfs2/inode.c                         |   4 +-
 fs/ntfs/file.c                            |   7 +-
 fs/splice.c                               |   6 +-
 include/linux/fs.h                        |   3 +
 include/linux/mm.h                        |   8 +
 include/linux/mmzone.h                    |   7 +
 include/linux/pagemap.h                   |  55 +++-
 include/linux/pagevec.h                   |   3 +
 include/linux/radix-tree.h                |   5 +-
 include/linux/shmem_fs.h                  |   1 +
 include/linux/swap.h                      |  10 +
 include/linux/writeback.h                 |   1 +
 lib/radix-tree.c                          | 105 ++-----
 mm/Makefile                               |   2 +-
 mm/filemap.c                              | 265 +++++++++++++---
 mm/mincore.c                              |  20 +-
 mm/page-writeback.c                       |   2 +-
 mm/readahead.c                            |   8 +-
 mm/shmem.c                                | 122 ++------
 mm/swap.c                                 |  22 ++
 mm/truncate.c                             |  78 ++++-
 mm/vmscan.c                               |  62 +++-
 mm/vmstat.c                               |   4 +
 mm/workingset.c                           | 461 ++++++++++++++++++++++++++++
 net/ceph/pagelist.c                       |   4 +-
 net/ceph/pagevec.c                        |   2 +-
 34 files changed, 1005 insertions(+), 299 deletions(-)

Based on the latest -mmotm, which includes the required page allocator
fairness patches.  All that: http://git.cmpxchg.org/cgit/linux-jw.git/

Thanks!


^ permalink raw reply	[flat|nested] 19+ messages in thread
* [patch 9/9] mm: thrash detection-based file cache sizing v4
@ 2013-08-17 19:31 Johannes Weiner
  2013-08-17 19:31 ` [patch 3/9] mm: filemap: move radix tree hole searching here Johannes Weiner
  0 siblings, 1 reply; 19+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

Changes in version 4:

o Rework shrinker and shadow planting throttle.  The per-file
  throttling created problems in production tests.  And the shrinker
  code changed so much over the development of the series that the
  throttling policy is no longer applicable, so just remove it, and
  with it the extra unsigned long to track refault ratios in struct
  inode (yay!).

o Remove the 'enough free pages' filter from refault detection.  This
  never was just right for all types of zone sizes (varying watermarks
  and lowmem reserves) and filtered too many valid refault hits.  It
  was put in place to detect when reclaim already freed enough pages,
  to stop deactivating more than necessary.  But reclaim advances the
  working set time, so progress is reflected in the refault distances
  that we check either way.  Just remove the redundant test.

o Update changelog in terms of what the refault distance means and how
  the code protects against spurious refaults that happen out of
  order.  Suggested by Vlastimil Babka.

Changes in version 3:

o Drop global working set time, store zone ID in addition to
  zone-specific timestamp in radix tree instead.  Balance zones based
  on their own refaults only.  Based on feedback from Peter Zijlstra.

o Lazily remove inodes without shadow entries from the global list to
  reduce modifications of said list to an absolute minimum.  Global
  list operations are now reduced to when an inode has its first cache
  page reclaimed (rare) and when a linked inode is destroyed (rare) or
  when the inode's shadows are shrunk to zero (rare).  Based on
  feedback from Peter Zijlstra.

o Document all interfaces properly

o Split out fair allocator patches (in -mmotm)

---

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".

The tricky part of this model is finding the right balance between
them.  A big inactive list may not leave enough room for the active
list to protect all the frequently used pages.  A big active list may
not leave enough room for the inactive list for a new set of
frequently used pages, "working set", to establish itself because the
young pages get pushed out of memory before having a chance to get
promoted.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use once streams,
but was not satisfactory when use once streaming persisted over longer
periods of time and the established working set was temporarily
suspended, like a nightly backup evicting all the interactive user
program data.
    
Subsequently, the rules were changed to only age active pages when
they exceeded the amount of inactive pages, i.e. leave the working set
alone as long as the other half of memory is easy to reclaim use once
pages.  This works well until working set transitions exceed the size
of half of memory and the average access distance between the pages of
the new working set is bigger than the inactive list.  The VM will
mistake the thrashing new working set for use once streaming, while
the unused old working set pages are stuck on the active list.

This happens on file servers and media streaming servers, where the
popular set of files changes over time.  Even though the individual
files might be smaller than half of memory, concurrent access to many
of them may still result in their inter-reference distance being
greater than half of memory.  It's also been reported as a problem on
database workloads that switch back and forth between tables that are
bigger than half of memory.  In these cases the VM never recognizes
the new working set and will for the remainder of the workload thrash
disk data which could easily live in memory.

This series solves the problem by maintaining a history of pages
evicted from the inactive list, enabling the VM to tell streaming IO
from thrashing and rebalance the page cache lists when appropriate.

 drivers/staging/lustre/lustre/llite/dir.c |   2 +-
 fs/block_dev.c                            |   2 +-
 fs/btrfs/compression.c                    |   4 +-
 fs/cachefiles/rdwr.c                      |  13 +-
 fs/ceph/xattr.c                           |   2 +-
 fs/inode.c                                |   6 +-
 fs/logfs/readwrite.c                      |   6 +-
 fs/nfs/blocklayout/blocklayout.c          |   2 +-
 fs/nilfs2/inode.c                         |   4 +-
 fs/ntfs/file.c                            |   7 +-
 fs/splice.c                               |   6 +-
 include/linux/fs.h                        |   2 +
 include/linux/mm.h                        |   8 +
 include/linux/mmzone.h                    |   8 +
 include/linux/pagemap.h                   |  55 ++--
 include/linux/pagevec.h                   |   3 +
 include/linux/radix-tree.h                |   5 +-
 include/linux/shmem_fs.h                  |   1 +
 include/linux/swap.h                      |   9 +
 include/linux/writeback.h                 |   1 +
 lib/radix-tree.c                          | 105 ++------
 mm/Makefile                               |   2 +-
 mm/filemap.c                              | 264 ++++++++++++++++---
 mm/mincore.c                              |  20 +-
 mm/page-writeback.c                       |   2 +-
 mm/readahead.c                            |   8 +-
 mm/shmem.c                                | 122 +++------
 mm/swap.c                                 |  22 ++
 mm/truncate.c                             |  78 ++++--
 mm/vmscan.c                               |  62 ++++-
 mm/vmstat.c                               |   5 +
 mm/workingset.c                           | 396 ++++++++++++++++++++++++++++
 net/ceph/pagelist.c                       |   4 +-
 net/ceph/pagevec.c                        |   2 +-
 34 files changed, 939 insertions(+), 299 deletions(-)

Based on -mmotm, which includes the required page allocator fairness
patches.  All that: http://git.cmpxchg.org/cgit/linux-jw.git/

Thanks!


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2013-08-17 19:32 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-06 22:44 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
2013-08-06 22:44 ` [patch 1/9] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
2013-08-06 22:44 ` [patch 2/9] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
2013-08-06 22:44 ` [patch 3/9] mm: filemap: move radix tree hole searching here Johannes Weiner
2013-08-06 22:44 ` [patch 4/9] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
2013-08-06 22:44 ` [patch 5/9] mm + fs: store shadow entries in page cache Johannes Weiner
2013-08-06 22:44 ` [patch 6/9] mm + fs: provide shadow pages to page cache allocations Johannes Weiner
2013-08-06 22:44 ` [patch 7/9] mm: make global_dirtyable_memory() available to other mm code Johannes Weiner
2013-08-06 22:44 ` [patch 8/9] mm: thrash detection-based file cache sizing Johannes Weiner
2013-08-09 22:49   ` Andrew Morton
2013-08-12 16:00     ` Johannes Weiner
2013-08-11 21:57   ` Vlastimil Babka
2013-08-12 16:27     ` Johannes Weiner
2013-08-06 22:44 ` [patch 9/9] mm: workingset: keep shadow entries in check Johannes Weiner
2013-08-11 23:56   ` Andi Kleen
2013-08-14 14:41     ` Johannes Weiner
2013-08-09 22:53 ` [patch 0/9] mm: thrash detection-based file cache sizing v3 Andrew Morton
2013-08-12 22:15   ` Johannes Weiner
2013-08-17 19:31 [patch 9/9] mm: thrash detection-based file cache sizing v4 Johannes Weiner
2013-08-17 19:31 ` [patch 3/9] mm: filemap: move radix tree hole searching here Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).