linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: mel@skynet.ie (Mel Gorman)
To: Dave Hansen <hansendc@us.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Kirill Korotaev <dev@sw.ru>,
	containers@lists.osdl.org, linux-kernel@vger.kernel.org,
	Mel Gorman <MELGOR@ie.ibm.com>,
	Andy Wihitcroft <apw@shadowen.org>
Subject: Re: [RFC][PATCH 2/7] RSS controller core
Date: Wed, 14 Mar 2007 15:38:24 +0000	[thread overview]
Message-ID: <20070314153824.GA6607@skynet.ie> (raw)
In-Reply-To: <1173805534.6680.26.camel@localhost.localdomain>

On (13/03/07 10:05), Dave Hansen didst pronounce:
> On Tue, 2007-03-13 at 03:48 -0800, Andrew Morton wrote: 
> > If we use a physical zone-based containment scheme: fake-numa,
> > variable-sized zones, etc then it all becomes moot.  You set up a container
> > which has 1.5GB of physial memory then toss processes into it.  As that
> > process set increases in size it will toss out stray pages which shouldn't
> > be there, then it will start reclaiming and swapping out its own pages and
> > eventually it'll get an oom-killing.
> 
> I was just reading through the (comprehensive) thread about this from
> last week, so forgive me if I missed some of it.  The idea is really
> tempting, precisely because I don't think anyone really wants to have to
> screw with the reclaim logic.  
> 
> I'm just brain-dumping here, hoping that somebody has already thought
> through some of this stuff.  It's not a bitch-fest, I promise. :)
> 
> How do we determine what is shared, and goes into the shared zones?

Assuming we had a means of creating a zone that was assigned to a container,
a second zone for shared data between a set of containers.  For shared data,
the time the pages are being allocated is at page fault time. At that point,
the faulting VMA is known and you also know if it's MAP_SHARED or not.

The caller allocating the page would select (or create) a zonelist that
is appropriate for the container. For shared mappings, it would be one
zone - the shared zone for the set. For private mappings, it would be
one zone - the shared zone for the set.

For overcommit, the allowable zones for overcommit could be included.
Allowing overcommit opens the possibility for containers to interfere with
each other but I'm guessing that if overcommit is enabled, the administrator
is willing to live with that interference.

This has the awkward possibility of having two "shared" zones for two container
sets and one file that needs sharing. Similarly, there is a possibility for
having a container that has no shared zone and faulted in shared data. In
that case, the page ends up in the first faulting container set and it's
too bad it got "charged" for the page use on behalf of other containers. I'm
not sure there is a sane way of accounting this situation fairly.

I think that it's important to note that once data is shared between containers
at all that they have the potential to interfere with each other (by reclaiming
within the shared zone for example).

> Once we've allocated a page, it's too late because we already picked.

We'd choose the appropriate zonelist before faulting. Once allocated,
the page stays there.

> Do we just assume all page cache is shared?  Base it on filesystem,
> mount, ...?  Mount seems the most logical to me, that a sysadmin would
> have to set up a container's fs, anyway, and will likely be doing
> special things to shared data, anyway (r/o bind mounts :).
> 

I have no strong feelings here. To me, it's "who do I assign this fake
zone to?" I guess you would have at least one zone per container mount
for private data.

> There's a conflict between the resize granularity of the zones, and the
> storage space their lookup consumes.  We'd want a container to have a
> limited ability to fill up memory with stuff like the dcache, so we'd
> appear to need to put the dentries inside the software zone.  But, that
> gets us to our inability to evict arbitrary dentries. 

Stuff like shrinking dentry caches is already pretty course-grained.
Last I looked, we couldn't even shrink within a specific node, let alone
a zone or a specific dentry. This is a separate problem.

> After a while,
> would containers tend to pin an otherwise empty zone into place?  We
> could resize it, but what is the cost of keeping zones that can be
> resized down to a small enough size that we don't mind keeping it there?
> We could merge those "orphaned" zones back into the shared zone.

Merging "orphaned" zones back into the "main" zone would seem a sensible
choice.

> Were there any requirements about physical contiguity? 

For the lookup to software zone to be efficient, it would be easiest to have
them as MAX_ORDER_NR_PAGES contiguous. This would avoid having to break the
existing assumptions in the buddy allocator about MAX_ORDER_NR_PAGES
always being in the same zone.

> What about minimum
> zone sizes?
> 

MAX_ORDER_NR_PAGES would be the minimum zone size.

> If we really do bind a set of processes strongly to a set of memory on a
> set of nodes, then those really do become its home NUMA nodes.  If the
> CPUs there get overloaded, running it elsewhere will continue to grab
> pages from the home.  Would this basically keep us from ever being able
> to move tasks around a NUMA system?
> 

Moving the tasks around would not be easy. It would require a new zone
to be created based on the new NUMA node and all the data migrated. hmm

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

  reply	other threads:[~2007-03-14 15:38 UTC|newest]

Thread overview: 129+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-03-06 14:42 [RFC][PATCH 0/7] Resource controllers based on process containers Pavel Emelianov
2007-03-06 14:49 ` [RFC][PATCH 1/7] Resource counters Pavel Emelianov
2007-03-07  4:03   ` Balbir Singh
2007-03-07  7:19     ` Pavel Emelianov
2007-03-09 16:37       ` Herbert Poetzl
2007-03-11  9:01         ` Pavel Emelianov
2007-03-11 19:00         ` Eric W. Biederman
2007-03-12  1:16           ` Herbert Poetzl
2007-03-13  9:09             ` Eric W. Biederman
2007-03-13  9:27               ` Pavel Emelianov
2007-03-13  9:49               ` [Devel] " Kirill Korotaev
2007-03-13 15:21               ` Herbert Poetzl
2007-03-13 15:41                 ` Pavel Emelianov
2007-03-13 16:07                   ` Srivatsa Vaddagiri
2007-03-14  7:12                     ` Pavel Emelianov
2007-03-15 16:51                       ` Eric W. Biederman
2007-03-13 16:32                   ` Herbert Poetzl
2007-03-06 14:55 ` [RFC][PATCH 2/7] RSS controller core Pavel Emelianov
2007-03-06 22:00   ` Andrew Morton
2007-03-09 16:48     ` Herbert Poetzl
2007-03-11  9:08       ` Pavel Emelianov
2007-03-11 14:32         ` Herbert Poetzl
2007-03-11 15:04           ` Pavel Emelianov
2007-03-12  0:41             ` Herbert Poetzl
2007-03-12  8:31               ` Pavel Emelianov
2007-03-12  9:55       ` Balbir Singh
2007-03-12 23:43         ` Herbert Poetzl
2007-03-13  1:57           ` Balbir Singh
2007-03-13  2:24             ` Srivatsa Vaddagiri
2007-03-13 16:06             ` Herbert Poetzl
2007-03-11 12:26     ` Kirill Korotaev
2007-03-11 12:51       ` Andrew Morton
2007-03-11 15:51         ` Balbir Singh
2007-03-11 19:34         ` Eric W. Biederman
2007-03-12  9:23           ` [Devel] " Kirill Korotaev
2007-03-13  9:26             ` Eric W. Biederman
2007-03-13 15:43               ` Kirill Korotaev
2007-03-12  1:00         ` Herbert Poetzl
2007-03-12  9:02           ` Pavel Emelianov
2007-03-12 21:11             ` Herbert Poetzl
2007-03-13  7:17               ` Pavel Emelianov
2007-03-13 15:05                 ` Herbert Poetzl
2007-03-13 15:32                   ` Pavel Emelianov
2007-03-13 15:10               ` Kirill Korotaev
2007-03-13 15:11                 ` Herbert Poetzl
2007-03-13 15:54                   ` Kirill Korotaev
2007-03-12 18:42           ` Dave Hansen
2007-03-12 22:41             ` Herbert Poetzl
2007-03-12 23:02               ` Dave Hansen
2007-03-18 16:58                 ` Eric W. Biederman
2007-03-13  6:04               ` Andrew Morton
2007-03-13 10:19                 ` [Devel] " Kirill Korotaev
2007-03-13 11:48                   ` Andrew Morton
2007-03-13 14:59                     ` Herbert Poetzl
2007-03-13 17:05                     ` Dave Hansen
2007-03-14 15:38                       ` Mel Gorman [this message]
2007-03-14 20:42                         ` Dave Hansen
2007-03-20 18:57                           ` Mel Gorman
2007-03-18 22:44                       ` [Devel] " Paul Menage
2007-03-19 17:41                         ` Eric W. Biederman
2007-03-13 17:26                 ` Dave Hansen
2007-03-13 19:09                   ` Alan Cox
2007-03-13 20:28                     ` Dave Hansen
2007-03-16  0:55                     ` Eric W. Biederman
2007-03-16 16:31                       ` Dave Hansen
2007-03-16 18:54                         ` Eric W. Biederman
2007-03-16 19:46                           ` Dave Hansen
2007-03-18 17:42                             ` Eric W. Biederman
2007-03-19 15:48                               ` Herbert Poetzl
2007-03-20 16:15                               ` controlling mmap()'d vs read/write() pages Dave Hansen
2007-03-20 21:19                                 ` Eric W. Biederman
2007-03-23  0:51                                   ` Herbert Poetzl
2007-03-23  5:57                                   ` Nick Piggin
2007-03-23 10:12                                     ` Eric W. Biederman
2007-03-23 10:47                                       ` Nick Piggin
2007-03-23 12:21                                         ` Eric W. Biederman
2007-03-28  7:33                                           ` Nick Piggin
2007-03-23 16:41                                       ` Dave Hansen
2007-03-23 18:16                                         ` Herbert Poetzl
2007-03-28  9:18                                           ` Balbir Singh
2007-03-14 16:47                   ` [RFC][PATCH 2/7] RSS controller core Mel Gorman
2007-03-07  5:37   ` Balbir Singh
2007-03-07  7:27     ` Pavel Emelianov
2007-03-06 14:58 ` [RFC][PATCH 3/7] Data structures changes for RSS accounting Pavel Emelianov
2007-03-11 19:13   ` Eric W. Biederman
2007-03-12 16:16     ` Kirill Korotaev
2007-03-12 16:48       ` Dave Hansen
2007-03-12 17:19         ` Pavel Emelianov
2007-03-12 17:27           ` Dave Hansen
2007-03-13  7:10             ` Pavel Emelianov
2007-03-12 17:21         ` Balbir Singh
2007-03-06 15:00 ` [RFC][PATCH 4/7] RSS accounting hooks over the code Pavel Emelianov
2007-03-11 19:14   ` Eric W. Biederman
2007-03-12 16:23     ` Kirill Korotaev
2007-03-12 16:50       ` Dave Hansen
2007-03-12 17:07         ` Kirill Korotaev
2007-03-12 17:33           ` Dave Hansen
2007-03-13  9:43             ` Eric W. Biederman
2007-03-12 23:54         ` Herbert Poetzl
2007-03-13  9:58           ` Eric W. Biederman
2007-03-13 10:25             ` Nick Piggin
2007-03-13 16:01               ` Eric W. Biederman
2007-03-14  3:51                 ` Nick Piggin
2007-03-14  6:42                   ` Balbir Singh
2007-03-14  6:57                     ` Nick Piggin
2007-03-14  7:48                       ` Balbir Singh
2007-03-14 13:25                         ` Vaidyanathan Srinivasan
2007-03-14 13:49                           ` Nick Piggin
2007-03-14 14:43                             ` Vaidyanathan Srinivasan
2007-03-14 16:16                             ` Kirill Korotaev
2007-03-15  5:01                               ` Nick Piggin
2007-03-15  5:44                                 ` Balbir Singh
2007-03-28 20:15               ` Ethan Solomita
2007-03-14 15:37   ` Cedric Le Goater
2007-03-14 15:45     ` Pavel Emelianov
2007-03-06 15:03 ` [RFC][PATCH 5/7] Per-container OOM killer and page reclamation Pavel Emelianov
2007-03-09 21:21   ` Balbir Singh
2007-03-11  8:41     ` Pavel Emelianov
2007-03-06 15:04 ` [RFC][PATCH 6/7] Account for the number of tasks within container Pavel Emelianov
2007-03-07  2:00   ` Paul Menage
2007-03-07  7:13     ` Pavel Emelianov
2007-03-08 13:49       ` Paul Menage
2007-03-11  8:36         ` Pavel Emelianov
2007-03-06 15:07 ` [RFC][PATCH 7/7] Account for the number of files opened " Pavel Emelianov
2007-03-07  2:02 ` [RFC][PATCH 0/7] Resource controllers based on process containers Paul Menage
2007-03-07  7:30   ` Pavel Emelianov
2007-03-07  6:52 ` Balbir Singh
2007-03-07  7:32   ` Pavel Emelianov
2007-03-07  9:43     ` Kirill Korotaev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070314153824.GA6607@skynet.ie \
    --to=mel@skynet.ie \
    --cc=MELGOR@ie.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=apw@shadowen.org \
    --cc=containers@lists.osdl.org \
    --cc=dev@sw.ru \
    --cc=hansendc@us.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).