linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] VM containers
@ 2016-01-22 15:56 Rik van Riel
  2016-01-22 16:05 ` [Lsf-pc] " James Bottomley
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Rik van Riel @ 2016-01-22 15:56 UTC (permalink / raw)
  To: lsf-pc; +Cc: Linux Memory Management List, Linux kernel Mailing List, KVM list

Hi,

I am trying to gauge interest in discussing VM containers at the LSF/MM
summit this year. Projects like ClearLinux, Qubes, and others are all
trying to use virtual machines as better isolated containers.

That changes some of the goals the memory management subsystem has,
from "use all the resources effectively" to "use as few resources as
necessary, in case the host needs the memory for something else".

These VMs could be as small as running just one application, so this
goes a little further than simply trying to squeeze more virtual
machines into a system with frontswap and cleancache.

Single-application VM sandboxes could also get their data differently,
using (partial) host filesystem passthrough, instead of a virtual
block device. This may change the relative utility of caching data
inside the guest page cache, versus freeing up that memory and
allowing the host to use it to cache things.

Are people interested in discussing this at LSF/MM, or is it better
saved for a different forum?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] VM containers
  2016-01-22 15:56 [LSF/MM TOPIC] VM containers Rik van Riel
@ 2016-01-22 16:05 ` James Bottomley
  2016-01-22 17:11 ` Johannes Weiner
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: James Bottomley @ 2016-01-22 16:05 UTC (permalink / raw)
  To: Rik van Riel, lsf-pc
  Cc: Linux Memory Management List, Linux kernel Mailing List, KVM list

On Fri, 2016-01-22 at 10:56 -0500, Rik van Riel wrote:
> Hi,
> 
> I am trying to gauge interest in discussing VM containers at the
> LSF/MM
> summit this year. Projects like ClearLinux, Qubes, and others are all
> trying to use virtual machines as better isolated containers.
> 
> That changes some of the goals the memory management subsystem has,
> from "use all the resources effectively" to "use as few resources as
> necessary, in case the host needs the memory for something else".
> 
> These VMs could be as small as running just one application, so this
> goes a little further than simply trying to squeeze more virtual
> machines into a system with frontswap and cleancache.
> 
> Single-application VM sandboxes could also get their data
> differently,
> using (partial) host filesystem passthrough, instead of a virtual
> block device. This may change the relative utility of caching data
> inside the guest page cache, versus freeing up that memory and
> allowing the host to use it to cache things.
> 
> Are people interested in discussing this at LSF/MM, or is it better
> saved for a different forum?

Actually, I don't really think this is a container technology topic,
but I'm only objecting to the title not the content.  I don't know
Qubes, but I do know clearlinux ... it's VM based.  I think the
question that really needs answering is whether we can improve the
paravirt interfaces for memory control in VMs.  The biggest advantage
containers have over hypervisors is that the former know exactly what's
going on with the memory in the guests because of the shared kernel and
the latter have no real clue, because of the separate guest kernel
which only communicates with the host via hardware interfaces, which
leads to all sorts of bad scheduling decisions.

If I look at the current state of play, it looks like Hypervisors can
get an easy handle on file backed memory using the persistent memory
interfaces; that's how ClearLinux achieves its speed up today. 
 However, controlling guests under memory pressure requires us to have
a handle on the anonymous memory as well.  I really think a topic
exploring paravirt interfaces for anonymous memory would be really
useful.

James

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] VM containers
  2016-01-22 15:56 [LSF/MM TOPIC] VM containers Rik van Riel
  2016-01-22 16:05 ` [Lsf-pc] " James Bottomley
@ 2016-01-22 17:11 ` Johannes Weiner
  2016-01-27 15:48   ` Vladimir Davydov
  2016-01-23 23:41 ` Nakajima, Jun
  2016-01-28 15:18 ` Aneesh Kumar K.V
  3 siblings, 1 reply; 10+ messages in thread
From: Johannes Weiner @ 2016-01-22 17:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: lsf-pc, Linux Memory Management List, Linux kernel Mailing List,
	KVM list

Hi,

On Fri, Jan 22, 2016 at 10:56:15AM -0500, Rik van Riel wrote:
> I am trying to gauge interest in discussing VM containers at the LSF/MM
> summit this year. Projects like ClearLinux, Qubes, and others are all
> trying to use virtual machines as better isolated containers.
> 
> That changes some of the goals the memory management subsystem has,
> from "use all the resources effectively" to "use as few resources as
> necessary, in case the host needs the memory for something else".

I would be very interested in discussing this topic, because I think
the issue is more generic than these VM applications. We are facing
the same issues with regular containers, where aggressive caching is
counteracting the desire to cut down workloads to their bare minimum
in order to pack them as tightly as possible.

With per-cgroup LRUs and thrash detection, we have infrastructure in
place that could allow us to accomplish this. Right now we only enter
reclaim once memory runs out, but we could add an allocation mode that
would prefer to always reclaim from the local LRU before increasing
the memory footprint, and only expand once we detect thrashing in the
page cache. That would keep the workloads neatly trimmed at all times.

For virtualized environments, the thrashing information would be
communicated slightly differently to the page allocator and/or the
host, but otherwise the fundamental principles should be the same.

We'd have to figure out how to balance the aggressiveness there and
how to describe this to the user, as I can imagine that users would
want to tune this based on a tolerance for the degree of thrashing: if
pages are used every M ms, keep them cached; if pages are used every N
ms, freeing up the memory and refetching them from disk is better etc.

And we don't have thrash detection in secondary slab caches (yet).

> Are people interested in discussing this at LSF/MM, or is it better
> saved for a different forum?

If more people are interested, I think that could be a great topic.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] VM containers
  2016-01-22 15:56 [LSF/MM TOPIC] VM containers Rik van Riel
  2016-01-22 16:05 ` [Lsf-pc] " James Bottomley
  2016-01-22 17:11 ` Johannes Weiner
@ 2016-01-23 23:41 ` Nakajima, Jun
  2016-01-24 17:06   ` One Thousand Gnomes
  2016-01-28 15:18 ` Aneesh Kumar K.V
  3 siblings, 1 reply; 10+ messages in thread
From: Nakajima, Jun @ 2016-01-23 23:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: lsf-pc, Linux Memory Management List, Linux kernel Mailing List,
	KVM list


> On Jan 22, 2016, at 7:56 AM, Rik van Riel <riel@redhat.com> wrote:
> 
> Hi,
> 
> I am trying to gauge interest in discussing VM containers at the LSF/MM
> summit this year. Projects like ClearLinux, Qubes, and others are all
> trying to use virtual machines as better isolated containers.
> 
> That changes some of the goals the memory management subsystem has,
> from "use all the resources effectively" to "use as few resources as
> necessary, in case the host needs the memory for something else".
> 
> These VMs could be as small as running just one application, so this
> goes a little further than simply trying to squeeze more virtual
> machines into a system with frontswap and clean cache.

I would be very interested in discussing this topic, and I agree that "a topic exploring paravirt interfaces for anonymous memory would be really useful" (as James pointed out).

Beyond memory consumption, I would be interested whether we can harden the kernel by the paravirt interfaces for memory protection in VMs (if any). For example, the hypervisor could write-protect part of the page tables or kernel data structures in VMs, and does it help?

> 
> Single-application VM sandboxes could also get their data differently,
> using (partial) host filesystem passthrough, instead of a virtual
> block device. This may change the relative utility of caching data
> inside the guest page cache, versus freeing up that memory and
> allowing the host to use it to cache things.
> 
> Are people interested in discussing this at LSF/MM, or is it better
> saved for a different forum?

In my view, it’s worth discussing the details focusing on memory and storage. It would be good if we can discuss other areas in a different forum, such as CPU scheduling and network. For example, the cost of context switching becomes higher as applications run in more (small) VMs because that tends to incur more VM exits. 

---
Jun
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] VM containers
  2016-01-23 23:41 ` Nakajima, Jun
@ 2016-01-24 17:06   ` One Thousand Gnomes
  2016-01-25 17:25     ` Rik van Riel
  0 siblings, 1 reply; 10+ messages in thread
From: One Thousand Gnomes @ 2016-01-24 17:06 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: Rik van Riel, lsf-pc, Linux Memory Management List,
	Linux kernel Mailing List, KVM list

> > That changes some of the goals the memory management subsystem has,
> > from "use all the resources effectively" to "use as few resources as
> > necessary, in case the host needs the memory for something else".

Also "and take guidance/provide telemetry" - because you want to tune the
VM behaviours based upon policy and to learn from them for when you re-run
that container.

> Beyond memory consumption, I would be interested whether we can harden the kernel by the paravirt interfaces for memory protection in VMs (if any). For example, the hypervisor could write-protect part of the page tables or kernel data structures in VMs, and does it help?

There are four behaviours I can think of, some of which you see in
various hypervisors and security hardening systems

- die on write (a write here causes a security trap and termination after
  the guest has marked the page range die on write, and it cannot be
  unmarked). The guest OS at boot can for example mark all it's code as
  die-on-write.
- irrevocably read only (VM never allows page to be rewritten by guest
  after the guest marks the page range irrevocably r/o)
- asynchronous faulting (pages the guest thinks are in it's memory but
  are in fact on the hosts swap cause a subscribable fault in the guest
  so that it can (where possible) be context switched
- free if needed - marking pages as freed up and either you get a page
  back as it was or a fault and a zeroed page

Alan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] VM containers
  2016-01-24 17:06   ` One Thousand Gnomes
@ 2016-01-25 17:25     ` Rik van Riel
  0 siblings, 0 replies; 10+ messages in thread
From: Rik van Riel @ 2016-01-25 17:25 UTC (permalink / raw)
  To: One Thousand Gnomes, Nakajima, Jun
  Cc: lsf-pc, Linux Memory Management List, Linux kernel Mailing List,
	KVM list

On 01/24/2016 12:06 PM, One Thousand Gnomes wrote:
>>> That changes some of the goals the memory management subsystem has,
>>> from "use all the resources effectively" to "use as few resources as
>>> necessary, in case the host needs the memory for something else".
> 
> Also "and take guidance/provide telemetry" - because you want to tune the
> VM behaviours based upon policy and to learn from them for when you re-run
> that container.
> 
>> Beyond memory consumption, I would be interested whether we can harden the kernel by the paravirt interfaces for memory protection in VMs (if any). For example, the hypervisor could write-protect part of the page tables or kernel data structures in VMs, and does it help?
> 
> There are four behaviours I can think of, some of which you see in
> various hypervisors and security hardening systems
> 
> - die on write (a write here causes a security trap and termination after
>   the guest has marked the page range die on write, and it cannot be
>   unmarked). The guest OS at boot can for example mark all it's code as
>   die-on-write.
> - irrevocably read only (VM never allows page to be rewritten by guest
>   after the guest marks the page range irrevocably r/o)

For these we get the question "how do we make it harder for the
guest to remap the page tables to point at read/write memory,
and modify that instead of the read-only memory?"

On "smaller" guests (less than 1TB in size), it may be enough to
ensure that the kernel PUD pointer points to the (read-only) kernel
PUD at context switch time, placing the main kernel page tables,
kernel text, and some other things in read-only memory.

> - asynchronous faulting (pages the guest thinks are in it's memory but
>   are in fact on the hosts swap cause a subscribable fault in the guest
>   so that it can (where possible) be context switched

KVM (and s390) already do the asynchronous page fault trick.

> - free if needed - marking pages as freed up and either you get a page
>   back as it was or a fault and a zeroed page

People have worked on this for KVM. I do not remember what
happened to the code.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] VM containers
  2016-01-22 17:11 ` Johannes Weiner
@ 2016-01-27 15:48   ` Vladimir Davydov
  2016-01-27 18:36     ` Johannes Weiner
  0 siblings, 1 reply; 10+ messages in thread
From: Vladimir Davydov @ 2016-01-27 15:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, lsf-pc, Linux Memory Management List,
	Linux kernel Mailing List, KVM list

On Fri, Jan 22, 2016 at 12:11:21PM -0500, Johannes Weiner wrote:
> Hi,
> 
> On Fri, Jan 22, 2016 at 10:56:15AM -0500, Rik van Riel wrote:
> > I am trying to gauge interest in discussing VM containers at the LSF/MM
> > summit this year. Projects like ClearLinux, Qubes, and others are all
> > trying to use virtual machines as better isolated containers.
> > 
> > That changes some of the goals the memory management subsystem has,
> > from "use all the resources effectively" to "use as few resources as
> > necessary, in case the host needs the memory for something else".
> 
> I would be very interested in discussing this topic, because I think
> the issue is more generic than these VM applications. We are facing
> the same issues with regular containers, where aggressive caching is
> counteracting the desire to cut down workloads to their bare minimum
> in order to pack them as tightly as possible.
> 
> With per-cgroup LRUs and thrash detection, we have infrastructure in

By thrash detection, do you mean vmpressure?

> place that could allow us to accomplish this. Right now we only enter
> reclaim once memory runs out, but we could add an allocation mode that
> would prefer to always reclaim from the local LRU before increasing
> the memory footprint, and only expand once we detect thrashing in the
> page cache. That would keep the workloads neatly trimmed at all times.

I don't get it. Do you mean a sort of special GFP flag that would force
the caller to reclaim before actual charging/allocation? Or is it
supposed to be automatic, basing on how memcg is behaving? If the
latter, I suppose it could be already done by a userspace daemon by
adjusting memory.high as needed, although it's unclear how to do it
optimally.

> 
> For virtualized environments, the thrashing information would be
> communicated slightly differently to the page allocator and/or the
> host, but otherwise the fundamental principles should be the same.
> 
> We'd have to figure out how to balance the aggressiveness there and
> how to describe this to the user, as I can imagine that users would
> want to tune this based on a tolerance for the degree of thrashing: if
> pages are used every M ms, keep them cached; if pages are used every N
> ms, freeing up the memory and refetching them from disk is better etc.

Sounds reasonable. What about adding a parameter to memcg that would
define ws access time? So that it would act just like memory.low, but in
terms of lruvec age instead of lruvec size. I mean, we keep track of
lruvec ages and scan those lruvecs whose age is > ws access time before
others. That would protect those workloads that access their ws quite,
but not very often from streaming workloads which can generate a lot of
useless pressure.

Thanks,
Vladimir

> 
> And we don't have thrash detection in secondary slab caches (yet).
> 
> > Are people interested in discussing this at LSF/MM, or is it better
> > saved for a different forum?
> 
> If more people are interested, I think that could be a great topic.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] VM containers
  2016-01-27 15:48   ` Vladimir Davydov
@ 2016-01-27 18:36     ` Johannes Weiner
  2016-01-28 17:12       ` Vladimir Davydov
  0 siblings, 1 reply; 10+ messages in thread
From: Johannes Weiner @ 2016-01-27 18:36 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Rik van Riel, lsf-pc, Linux Memory Management List,
	Linux kernel Mailing List, KVM list

On Wed, Jan 27, 2016 at 06:48:31PM +0300, Vladimir Davydov wrote:
> On Fri, Jan 22, 2016 at 12:11:21PM -0500, Johannes Weiner wrote:
> > Hi,
> > 
> > On Fri, Jan 22, 2016 at 10:56:15AM -0500, Rik van Riel wrote:
> > > I am trying to gauge interest in discussing VM containers at the LSF/MM
> > > summit this year. Projects like ClearLinux, Qubes, and others are all
> > > trying to use virtual machines as better isolated containers.
> > > 
> > > That changes some of the goals the memory management subsystem has,
> > > from "use all the resources effectively" to "use as few resources as
> > > necessary, in case the host needs the memory for something else".
> > 
> > I would be very interested in discussing this topic, because I think
> > the issue is more generic than these VM applications. We are facing
> > the same issues with regular containers, where aggressive caching is
> > counteracting the desire to cut down workloads to their bare minimum
> > in order to pack them as tightly as possible.
> > 
> > With per-cgroup LRUs and thrash detection, we have infrastructure in
> 
> By thrash detection, do you mean vmpressure?

I mean mm/workingset.c, we'd have to look at actual refaults.

Reclaim efficiency is not a meaningful measure of memory pressure. You
could be reclaiming happily and successfully every single cache page
on the LRU, only to have userspace fault them in again right after.
No memory pressure would be detected, even though a ton of IO is
caused by a lack of memory. [ For this reason, I think we should phase
out reclaim effifiency as a metric throughout the VM - vmpressure, LRU
type balancing, OOM invocation etc. - and base it all on thrashing. ]

> > place that could allow us to accomplish this. Right now we only enter
> > reclaim once memory runs out, but we could add an allocation mode that
> > would prefer to always reclaim from the local LRU before increasing
> > the memory footprint, and only expand once we detect thrashing in the
> > page cache. That would keep the workloads neatly trimmed at all times.
> 
> I don't get it. Do you mean a sort of special GFP flag that would force
> the caller to reclaim before actual charging/allocation? Or is it
> supposed to be automatic, basing on how memcg is behaving? If the
> latter, I suppose it could be already done by a userspace daemon by
> adjusting memory.high as needed, although it's unclear how to do it
> optimally.

Yes, essentially we'd have a target footprint that we increase only
when cache refaults (or swapins) are detected.

This could be memory.high and a userspace daemon.

We could also put it in the kernel so it's useful out of the box.

It could be a watermark for the page allocator to work in virtualized
environments.

> > For virtualized environments, the thrashing information would be
> > communicated slightly differently to the page allocator and/or the
> > host, but otherwise the fundamental principles should be the same.
> > 
> > We'd have to figure out how to balance the aggressiveness there and
> > how to describe this to the user, as I can imagine that users would
> > want to tune this based on a tolerance for the degree of thrashing: if
> > pages are used every M ms, keep them cached; if pages are used every N
> > ms, freeing up the memory and refetching them from disk is better etc.
> 
> Sounds reasonable. What about adding a parameter to memcg that would
> define ws access time? So that it would act just like memory.low, but in
> terms of lruvec age instead of lruvec size. I mean, we keep track of
> lruvec ages and scan those lruvecs whose age is > ws access time before
> others. That would protect those workloads that access their ws quite,
> but not very often from streaming workloads which can generate a lot of
> useless pressure.

I'm not following here. Which lruvec age?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] VM containers
  2016-01-22 15:56 [LSF/MM TOPIC] VM containers Rik van Riel
                   ` (2 preceding siblings ...)
  2016-01-23 23:41 ` Nakajima, Jun
@ 2016-01-28 15:18 ` Aneesh Kumar K.V
  3 siblings, 0 replies; 10+ messages in thread
From: Aneesh Kumar K.V @ 2016-01-28 15:18 UTC (permalink / raw)
  To: Rik van Riel, lsf-pc
  Cc: Linux Memory Management List, Linux kernel Mailing List, KVM list

Rik van Riel <riel@redhat.com> writes:

> Hi,
>
> I am trying to gauge interest in discussing VM containers at the LSF/MM
> summit this year. Projects like ClearLinux, Qubes, and others are all
> trying to use virtual machines as better isolated containers.
>
> That changes some of the goals the memory management subsystem has,
> from "use all the resources effectively" to "use as few resources as
> necessary, in case the host needs the memory for something else".
>
> These VMs could be as small as running just one application, so this
> goes a little further than simply trying to squeeze more virtual
> machines into a system with frontswap and cleancache.
>
> Single-application VM sandboxes could also get their data differently,
> using (partial) host filesystem passthrough, instead of a virtual
> block device. This may change the relative utility of caching data
> inside the guest page cache, versus freeing up that memory and
> allowing the host to use it to cache things.
>
> Are people interested in discussing this at LSF/MM, or is it better
> saved for a different forum?
>

I am interested in the topic. We did look at doing something similar on
ppc64 and most of our focus was in reducing boot time by cutting out the
overhead of guest bios (SLOF) and block layer (by using 9pfs).  I would
like to understand the MM challenges you have identified.

-aneesh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] VM containers
  2016-01-27 18:36     ` Johannes Weiner
@ 2016-01-28 17:12       ` Vladimir Davydov
  0 siblings, 0 replies; 10+ messages in thread
From: Vladimir Davydov @ 2016-01-28 17:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, lsf-pc, Linux Memory Management List,
	Linux kernel Mailing List, KVM list

On Wed, Jan 27, 2016 at 01:36:51PM -0500, Johannes Weiner wrote:
> On Wed, Jan 27, 2016 at 06:48:31PM +0300, Vladimir Davydov wrote:
> > On Fri, Jan 22, 2016 at 12:11:21PM -0500, Johannes Weiner wrote:
> > > Hi,
> > > 
> > > On Fri, Jan 22, 2016 at 10:56:15AM -0500, Rik van Riel wrote:
> > > > I am trying to gauge interest in discussing VM containers at the LSF/MM
> > > > summit this year. Projects like ClearLinux, Qubes, and others are all
> > > > trying to use virtual machines as better isolated containers.
> > > > 
> > > > That changes some of the goals the memory management subsystem has,
> > > > from "use all the resources effectively" to "use as few resources as
> > > > necessary, in case the host needs the memory for something else".
> > > 
> > > I would be very interested in discussing this topic, because I think
> > > the issue is more generic than these VM applications. We are facing
> > > the same issues with regular containers, where aggressive caching is
> > > counteracting the desire to cut down workloads to their bare minimum
> > > in order to pack them as tightly as possible.
> > > 
> > > With per-cgroup LRUs and thrash detection, we have infrastructure in
> > 
> > By thrash detection, do you mean vmpressure?
> 
> I mean mm/workingset.c, we'd have to look at actual refaults.
> 
> Reclaim efficiency is not a meaningful measure of memory pressure. You
> could be reclaiming happily and successfully every single cache page
> on the LRU, only to have userspace fault them in again right after.
> No memory pressure would be detected, even though a ton of IO is

But, if they are part of ws, mm/workingset should activate them, so that
they'd be given one more round over lru and therefore contribute to
vmpressure. So vmpressure should be a good enough indicator of
thrashing, provided mm/workingset works fine.

> caused by a lack of memory. [ For this reason, I think we should phase
> out reclaim effifiency as a metric throughout the VM - vmpressure, LRU
> type balancing, OOM invocation etc. - and base it all on thrashing. ]
> 
> > > place that could allow us to accomplish this. Right now we only enter
> > > reclaim once memory runs out, but we could add an allocation mode that
> > > would prefer to always reclaim from the local LRU before increasing
> > > the memory footprint, and only expand once we detect thrashing in the
> > > page cache. That would keep the workloads neatly trimmed at all times.
> > 
> > I don't get it. Do you mean a sort of special GFP flag that would force
> > the caller to reclaim before actual charging/allocation? Or is it
> > supposed to be automatic, basing on how memcg is behaving? If the
> > latter, I suppose it could be already done by a userspace daemon by
> > adjusting memory.high as needed, although it's unclear how to do it
> > optimally.
> 
> Yes, essentially we'd have a target footprint that we increase only
> when cache refaults (or swapins) are detected.
> 
> This could be memory.high and a userspace daemon.
> 
> We could also put it in the kernel so it's useful out of the box.

Yeah, it'd be great to have the perfect reclaimer out of the box. Not
sure it's feasible though, because there are so many ways of how it
could be implemented. I mean, well OK, it's more-or-less clear when we
should increase a container's allocation - when it starts thrashing. But
when to decrease it? Possible answers are: when other containers are
thrashing; when we detect a container stops using its memory (say, by
tracking access bits); or we can try to decrease a container's
allocation if it hasn't been thrashing for some time. That said, there
are a lot of options with their pros/cons, I don't think there's the
only right answer which could be fused into the kernel. May be, I'm
wrong.

> 
> It could be a watermark for the page allocator to work in virtualized
> environments.
> 
> > > For virtualized environments, the thrashing information would be
> > > communicated slightly differently to the page allocator and/or the
> > > host, but otherwise the fundamental principles should be the same.
> > > 
> > > We'd have to figure out how to balance the aggressiveness there and
> > > how to describe this to the user, as I can imagine that users would
> > > want to tune this based on a tolerance for the degree of thrashing: if
> > > pages are used every M ms, keep them cached; if pages are used every N
> > > ms, freeing up the memory and refetching them from disk is better etc.
> > 
> > Sounds reasonable. What about adding a parameter to memcg that would
> > define ws access time? So that it would act just like memory.low, but in
> > terms of lruvec age instead of lruvec size. I mean, we keep track of
> > lruvec ages and scan those lruvecs whose age is > ws access time before
> > others. That would protect those workloads that access their ws quite,
> > but not very often from streaming workloads which can generate a lot of
> > useless pressure.
> 
> I'm not following here. Which lruvec age?

Well, there's no such thing like lruvec age currently, but I think we
could add one. By lru list age I mean the real time that has passed
since the current tail page was added to the list or rotated. A
straightforward way to track it would be attaching a timestamp to each
page and updating it when a page is added to lru or rotated, but I
believe it is possible to get a good approximation w/o adding new page
fields. By biasing vmscan to those lruvecs whose 'age' is greater than N
ms, we would give containers N ms to set reference bits on ws pages so
that the next time they are scanned they all get rotated.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-01-28 17:12 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-22 15:56 [LSF/MM TOPIC] VM containers Rik van Riel
2016-01-22 16:05 ` [Lsf-pc] " James Bottomley
2016-01-22 17:11 ` Johannes Weiner
2016-01-27 15:48   ` Vladimir Davydov
2016-01-27 18:36     ` Johannes Weiner
2016-01-28 17:12       ` Vladimir Davydov
2016-01-23 23:41 ` Nakajima, Jun
2016-01-24 17:06   ` One Thousand Gnomes
2016-01-25 17:25     ` Rik van Riel
2016-01-28 15:18 ` Aneesh Kumar K.V

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).