linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Rik van Riel <riel@redhat.com>
To: balbir@linux.vnet.ibm.com
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	lee.schermerhorn@hp.com
Subject: Re: [patch 00/20] VM pageout scalability improvements
Date: Sat, 22 Dec 2007 19:21:19 -0500	[thread overview]
Message-ID: <20071222192119.030f32d5@bree.surriel.com> (raw)
In-Reply-To: <476D7334.4010301@linux.vnet.ibm.com>

On Sun, 23 Dec 2007 01:57:32 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> Rik van Riel wrote:
> > On large memory systems, the VM can spend way too much time scanning
> > through pages that it cannot (or should not) evict from memory. Not
> > only does it use up CPU time, but it also provokes lock contention
> > and can leave large systems under memory presure in a catatonic state.
> 
> I remember you mentioning that by large memory systems you mean systems
> with at-least 128GB, does this definition still hold?

It depends on the workload.  Certain test cases can wedge the
VM with as little as 16GB of RAM.  Other workloads cause trouble
at 32 or 64GB, with the system sometimes hanging for several
minutes, all the CPUs in the pageout code and no actual swap IO.

On systems of 128GB and more, we have seen systems hang in the
pageout code overnight, without deciding what to swap out.
 
> > This patch series improves VM scalability by:
> > 
> > 1) making the locking a little more scalable
> > 
> > 2) putting filesystem backed, swap backed and non-reclaimable pages
> >    onto their own LRUs, so the system only scans the pages that it
> >    can/should evict from memory
> > 
> > 3) switching to SEQ replacement for the anonymous LRUs, so the
> >    number of pages that need to be scanned when the system
> >    starts swapping is bound to a reasonable number
> > 
> > The noreclaim patches come verbatim from Lee Schermerhorn and
> > Nick Piggin.  I have not taken a detailed look at them yet and
> > all I have done is fix the rejects against the latest -mm kernel.
> 
> Is there a consolidate patch available, it makes it easier to test.

I will make a big patch available with the next version.  I have
to upgrade my patch set to newer noreclaim patches from Lee and
add a few small cleanups elsewhere.

> > I am posting this series now because I would like to get more
> > feedback, while I am studying and improving the noreclaim patches
> > myself.
> 
> What kind of tests show the problem? I'll try and review and test the code.

The easiest test possible simply allocates a ton of memory and
then touches it all.  Enough memory that the system needs to go
into swap.

Once memory is full, you will see the VM scan like mad, with a
big CPU spike (clearing the referenced bits off all pages) before
it starts swapping out anything.  That big CPU spike should be
gone or greatly reduced with my patches.

On really huge systems, that big CPU spike can be enough for one
CPU to spend so much time in the VM that all the other CPUs join
it, and the system goes under in a big lock contention fest.

Besides, even single threadedly clearing the referenced bits on
1TB worth of memory can't result in acceptable latencies :)

In the real world, users with large JVMs on their servers, which
sometimes go a little into swap, can trigger this system.  All of
the CPUs end up scanning the active list, and all pages have the
referenced bit set.  Even if the system eventually recovers, it
might as well have been dead.

Going into swap a little should only take a little bit of time.

  reply	other threads:[~2007-12-23  0:22 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
2007-12-18 21:15 ` [patch 01/20] convert anon_vma list lock a read/write lock Rik van Riel
2007-12-20  7:07   ` Christoph Lameter
2007-12-18 21:15 ` [patch 02/20] make the inode i_mmap_lock a reader/writer lock Rik van Riel
2007-12-19  0:48   ` Nick Piggin
2007-12-19  4:09     ` KOSAKI Motohiro
2007-12-19 15:52     ` Lee Schermerhorn
2007-12-19 16:31       ` Rik van Riel
2007-12-19 16:53         ` Lee Schermerhorn
2007-12-19 19:28           ` Peter Zijlstra
2007-12-19 23:40             ` Nick Piggin
2007-12-20  7:04               ` Christoph Lameter
2007-12-20  7:59                 ` Nick Piggin
2008-01-02 23:35                   ` Mike Travis
2008-01-03  6:07                     ` Nick Piggin
2008-01-03  8:55                       ` Ingo Molnar
2008-01-07  9:01                         ` Nick Piggin
2007-12-18 21:15 ` [patch 03/20] move isolate_lru_page() to vmscan.c Rik van Riel
2007-12-20  7:08   ` Christoph Lameter
2007-12-18 21:15 ` [patch 04/20] free swap space on swap-in/activation Rik van Riel
2007-12-18 21:15 ` [patch 05/20] define page_file_cache() function Rik van Riel
2007-12-18 21:15 ` [patch 06/20] debugging checks for page_file_cache() Rik van Riel
2007-12-18 21:15 ` [patch 07/20] Use an indexed array for LRU variables Rik van Riel
2007-12-18 21:15 ` [patch 08/20] split LRU lists into anon & file sets Rik van Riel
2007-12-18 21:15 ` [patch 09/20] split anon & file LRUs for memcontrol code Rik van Riel
2007-12-18 21:15 ` [patch 10/20] SEQ replacement for anonymous pages Rik van Riel
2007-12-19  5:17   ` KOSAKI Motohiro
2007-12-19 13:40     ` Rik van Riel
2007-12-20  2:04       ` KOSAKI Motohiro
2007-12-18 21:15 ` [patch 11/20] add newly swapped in pages to the inactive list Rik van Riel
2007-12-18 21:15 ` [patch 12/20] No Reclaim LRU Infrastructure Rik van Riel
2007-12-18 21:15 ` [patch 13/20] Non-reclaimable page statistics Rik van Riel
2007-12-18 21:15 ` [patch 14/20] Scan noreclaim list for reclaimable pages Rik van Riel
2007-12-18 21:15 ` [patch 15/20] ramfs pages are non-reclaimable Rik van Riel
2007-12-18 21:15 ` [patch 16/20] SHM_LOCKED pages are nonreclaimable Rik van Riel
2007-12-18 21:15 ` [patch 17/20] non-reclaimable mlocked pages Rik van Riel
2007-12-19  0:56   ` Nick Piggin
2007-12-19 13:45     ` Rik van Riel
2007-12-19 14:24       ` Peter Zijlstra
2007-12-19 14:53         ` Rik van Riel
2007-12-19 16:08           ` Lee Schermerhorn
2007-12-19 16:04       ` Lee Schermerhorn
2007-12-20 20:56         ` Rik van Riel
2007-12-21 10:52           ` Nick Piggin
2007-12-21 14:17             ` Rik van Riel
2007-12-23 12:22               ` Nick Piggin
2007-12-24  1:00                 ` Rik van Riel
2007-12-19 23:34       ` Nick Piggin
2007-12-20  7:19     ` Christoph Lameter
2007-12-20 15:33       ` Rik van Riel
2007-12-21 17:13         ` Lee Schermerhorn
2007-12-18 21:15 ` [patch 18/20] mlock vma pages under mmap_sem held for read Rik van Riel
2007-12-18 21:15 ` [patch 19/20] handle mlocked pages during map/unmap and truncate Rik van Riel
2007-12-18 21:15 ` [patch 20/20] account mlocked pages Rik van Riel
2007-12-22 20:27 ` [patch 00/20] VM pageout scalability improvements Balbir Singh
2007-12-23  0:21   ` Rik van Riel [this message]
2007-12-23 22:59     ` Balbir Singh
2007-12-24  1:11       ` Rik van Riel
2007-12-28  3:20         ` Matt Mackall

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20071222192119.030f32d5@bree.surriel.com \
    --to=riel@redhat.com \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=lee.schermerhorn@hp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).