linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] memory cgroup: my thoughts on memsw
@ 2014-09-04 14:30 Vladimir Davydov
  2014-09-04 22:03 ` Kamezawa Hiroyuki
  2014-09-15 19:14 ` Johannes Weiner
  0 siblings, 2 replies; 19+ messages in thread
From: Vladimir Davydov @ 2014-09-04 14:30 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko
  Cc: Greg Thelen, Hugh Dickins, Kamezawa Hiroyuki, Motohiro Kosaki,
	Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov,
	Konstantin Khorenko, LKML-MM, LKML-cgroups, LKML

Hi,

Over its long history the memory cgroup has been developed rapidly, but
rather in a disordered manner. As a result, today we have a bunch of
features that are practically unusable and wants redesign (soft limits)
or even not working (kmem accounting), not talking about the messy user
interface we have (the _in_bytes suffix is driving me mad :-).

Fortunately, thanks to Tejun's unified cgroup hierarchy, we have a great
chance to drop or redesign some of the old features and their
interfaces. We should use this opportunity to examine every aspect of
the memory cgroup design, because we will probably not be granted such a
present in future.

That's why I'm starting a series of RFC's with *my thoughts* not only on
kmem accounting, which I've been trying to fix for a while, but also on
other parts of the memory cgroup. I'll be happy if anybody reads this to
the end, but please don't kick me too hard if something will look stupid
to you :-)


Today's topic is (surprisingly!) the memsw resource counter and where it
fails to satisfy user requests.

Let's start from the very beginning. The memory cgroup has basically two
resource counters (not counting kmem, which is unusable anyway):
mem_cgroup->res (configured by memory.limit), which counts the total
amount of user pages charged to the cgroup, and mem_cgroup->memsw
(memory.memsw.limit), which is basically res + the cgroup's swap usage.
Obviously, memsw always has both the value and limit less than the value
and limit of res. That said, we have three options:

 - memory.limit=inf, memory.memsw.limit=inf
   No limits, only accounting.

 - memory.limit=L<inf, memory.memsw.limit=inf
   Not allowed to use more than L bytes of user pages, but use as much
   swap as you want.

 - memory.limit=L<inf, memory.memsw.limit=S<inf, L<=S
   Not allowed to use more than L bytes of user memory. Swap *plus*
   memory usage is limited by S.

When it comes to *hard* limits everything looks fine, but hard limits
are not efficient for partitioning a large system among lots of
containers, because it's hard to predict the right value for the limit,
besides many workloads will do better when they are granted more file
caches. There we need a kind of soft limit that is only used on global
memory pressure to shrink containers exceeding it.


Obviously the soft limit must be less than memory.limit and therefore
memory.memsw.limit. And here comes a problem. Suppose admin sets a
relatively high memsw.limit (say half of RAM) and a low soft limit for a
container hoping it will use it for file caches when there's free
memory, but when hard times come it will be shrunk back to the soft
limit quickly. Suppose the container, instead of using the granted
memory for caches, creates a lot of anonymous data filling up to its
memsw limit (i.e. half of RAM). Then, when admin starts other
containers, he might find out that they are effectively using only half
of RAM. Why can this happen? See below.

For example, if there's no or a little swap. It's pretty common for
customers not to bother about creating TBs of swap to back TBs of RAM
they have. One might propose to issue OOM if we can't reclaim anything
from a container exceeding its soft limit. OK, let it be so, although
it's still not agreed upon AFAIK.

Another case. There's plenty of swap space out there so that we can swap
out the guilty container completely. However, it will take us some
reasonable amount of time especially if the container isn't standing
still, but keeps touching its data. If other containers are mostly using
file caches, they will experience heavy pressure for a long time, not
saying about the slowdown caused by high disk usage. Unfair. One might
object that we can set a limit on IO operations for the culprit (more
limits and dependencies among them, I doubt admins will be happy!). This
will slow it down and guarantee it won't be swapping back in pages that
are being swapped out due to high memory pressure. However, disks have
limited speed. That means, it doesn't solve the problem with unfair
slowdown of other containers. What is worse, if we impose IO limit we
will slow down swap out by ourselves! Because we shouldn't ignore IO
limit for swap out, otherwise the system will be prune to DOS attacks
targeted on disk from inside containers, which is what IO limit (as well
as any other limit) is to protect against.

Or perhaps, I'm missing something and malicious behaviour isn't
considered when developing cgroups?!


To sum it up, the current mem + memsw configuration scheme doesn't allow
us to limit swap usage if we want to partition the system dynamically
using soft limits. Actually, it also looks rather confusing to me. We
have mem limit and mem+swap limit. I bet that from the first glance, an
average admin will think it's possible to limit swap usage by setting
the limits so that the difference between memory.memsw.limit and
memory.limit equals the maximal swap usage, but (surprise!) it isn't
really so. It holds if there's no global memory pressure, but otherwise
swap usage is only limited by memory.memsw.limit! IMHO, it isn't
something obvious.


Finally, my understanding (may be crazy!) how the things should be
configured. Just like now, there should be mem_cgroup->res accounting
and limiting total user memory (cache+anon) usage for processes inside
cgroups. This is where there's nothing to do. However, mem_cgroup->memsw
should be reworked to account *only* memory that may be swapped out plus
memory that has been swapped out (i.e. swap usage).

This way, by setting memsw.limit (or how it should be called) less than
memory soft limit we would solve the problem I described above. The
container would be then allowed to use only file caches above its
memsw.limit, which are usually easily shrinkable, and get OOM-kill while
trying to eat too much swappable memory.

The configuration will also be less confusing then IMO:

 - memory.limit - container can't use memory above this
 - memory.memsw.limit - container can't use swappable memory above this

>From this it clearly follows maximal swap usage is limited by
memory.memsw.limit.

One more thought. Anon memory and file caches are different and should
be handled differently, so mixing them both under the same counter looks
strange to me. Moreover, they are *already* handled differently
throughout the kernel - just look at mm/vmscan.c. Here are the
differences between them I see:

 - Anon memory is handled by the user application, while file caches are
   all on the kernel. That means the application will *definitely* die
   w/o anon memory. W/o file caches it usually can survive, but the more
   caches it has the better it feels.

 - Anon memory is not that easy to reclaim. Swap out is a really slow
   process, because data are usually read/written w/o any specific
   order. Dropping file caches is much easier. Typically we have lots of
   clean pages there.

 - Swap space is limited. And today, it's OK to have TBs of RAM and only
   several GBs of swap. Customers simply don't want to waste their disk
   space on that.

IMO, these lead us to the need for limiting swap/swappable memory usage,
but not swap+mem usage.


Now, a bad thing about such a change (if it were ever considered).
There's no way to convert old settings to new, i.e. if we currently have

  mem <= L,
  mem + swap <= S,
  L <= S,

we can set

  mem <= L1,
  swappable_mem <= S1,

where either 

L1 = L, S1 = S

or

L1 = L, S1 = S - L,

but both configurations won't be exactly the same. In the first case
memory+swap usage will be limited by L+S, not by S. In the second case,
although memory+swap<S, the container won't be able to use more than S-L
anonymous memory. This is the price we would have to pay if we decided
to go with this change...


Questions, comments, complains, threats?

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] memory cgroup: my thoughts on memsw
  2014-09-04 14:30 [RFC] memory cgroup: my thoughts on memsw Vladimir Davydov
@ 2014-09-04 22:03 ` Kamezawa Hiroyuki
  2014-09-05  8:28   ` Vladimir Davydov
  2014-09-15 19:14 ` Johannes Weiner
  1 sibling, 1 reply; 19+ messages in thread
From: Kamezawa Hiroyuki @ 2014-09-04 22:03 UTC (permalink / raw)
  To: Vladimir Davydov, Johannes Weiner, Michal Hocko
  Cc: Greg Thelen, Hugh Dickins, Motohiro Kosaki, Glauber Costa,
	Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko,
	LKML-MM, LKML-cgroups, LKML

(2014/09/04 23:30), Vladimir Davydov wrote:
> Hi,
>
> Over its long history the memory cgroup has been developed rapidly, but
> rather in a disordered manner. As a result, today we have a bunch of
> features that are practically unusable and wants redesign (soft limits)
> or even not working (kmem accounting), not talking about the messy user
> interface we have (the _in_bytes suffix is driving me mad :-).
>
> Fortunately, thanks to Tejun's unified cgroup hierarchy, we have a great
> chance to drop or redesign some of the old features and their
> interfaces. We should use this opportunity to examine every aspect of
> the memory cgroup design, because we will probably not be granted such a
> present in future.
>
> That's why I'm starting a series of RFC's with *my thoughts* not only on
> kmem accounting, which I've been trying to fix for a while, but also on
> other parts of the memory cgroup. I'll be happy if anybody reads this to
> the end, but please don't kick me too hard if something will look stupid
> to you :-)
>
>
> Today's topic is (surprisingly!) the memsw resource counter and where it
> fails to satisfy user requests.
>
> Let's start from the very beginning. The memory cgroup has basically two
> resource counters (not counting kmem, which is unusable anyway):
> mem_cgroup->res (configured by memory.limit), which counts the total
> amount of user pages charged to the cgroup, and mem_cgroup->memsw
> (memory.memsw.limit), which is basically res + the cgroup's swap usage.
> Obviously, memsw always has both the value and limit less than the value
> and limit of res. That said, we have three options:
>
>   - memory.limit=inf, memory.memsw.limit=inf
>     No limits, only accounting.
>
>   - memory.limit=L<inf, memory.memsw.limit=inf
>     Not allowed to use more than L bytes of user pages, but use as much
>     swap as you want.
>
>   - memory.limit=L<inf, memory.memsw.limit=S<inf, L<=S
>     Not allowed to use more than L bytes of user memory. Swap *plus*
>     memory usage is limited by S.
>
> When it comes to *hard* limits everything looks fine, but hard limits
> are not efficient for partitioning a large system among lots of
> containers, because it's hard to predict the right value for the limit,
> besides many workloads will do better when they are granted more file
> caches. There we need a kind of soft limit that is only used on global
> memory pressure to shrink containers exceeding it.
>
>
> Obviously the soft limit must be less than memory.limit and therefore
> memory.memsw.limit. And here comes a problem. Suppose admin sets a
> relatively high memsw.limit (say half of RAM) and a low soft limit for a
> container hoping it will use it for file caches when there's free
> memory, but when hard times come it will be shrunk back to the soft
> limit quickly. Suppose the container, instead of using the granted
> memory for caches, creates a lot of anonymous data filling up to its
> memsw limit (i.e. half of RAM). Then, when admin starts other
> containers, he might find out that they are effectively using only half
> of RAM. Why can this happen? See below.
>
> For example, if there's no or a little swap. It's pretty common for
> customers not to bother about creating TBs of swap to back TBs of RAM
> they have. One might propose to issue OOM if we can't reclaim anything
> from a container exceeding its soft limit. OK, let it be so, although
> it's still not agreed upon AFAIK.
>
> Another case. There's plenty of swap space out there so that we can swap
> out the guilty container completely. However, it will take us some
> reasonable amount of time especially if the container isn't standing
> still, but keeps touching its data. If other containers are mostly using
> file caches, they will experience heavy pressure for a long time, not
> saying about the slowdown caused by high disk usage. Unfair. One might
> object that we can set a limit on IO operations for the culprit (more
> limits and dependencies among them, I doubt admins will be happy!). This
> will slow it down and guarantee it won't be swapping back in pages that
> are being swapped out due to high memory pressure. However, disks have
> limited speed. That means, it doesn't solve the problem with unfair
> slowdown of other containers. What is worse, if we impose IO limit we
> will slow down swap out by ourselves! Because we shouldn't ignore IO
> limit for swap out, otherwise the system will be prune to DOS attacks
> targeted on disk from inside containers, which is what IO limit (as well
> as any other limit) is to protect against.
>
> Or perhaps, I'm missing something and malicious behaviour isn't
> considered when developing cgroups?!
>
>
> To sum it up, the current mem + memsw configuration scheme doesn't allow
> us to limit swap usage if we want to partition the system dynamically
> using soft limits. Actually, it also looks rather confusing to me. We
> have mem limit and mem+swap limit. I bet that from the first glance, an
> average admin will think it's possible to limit swap usage by setting
> the limits so that the difference between memory.memsw.limit and
> memory.limit equals the maximal swap usage, but (surprise!) it isn't
> really so. It holds if there's no global memory pressure, but otherwise
> swap usage is only limited by memory.memsw.limit! IMHO, it isn't
> something obvious.
>
>
> Finally, my understanding (may be crazy!) how the things should be
> configured. Just like now, there should be mem_cgroup->res accounting
> and limiting total user memory (cache+anon) usage for processes inside
> cgroups. This is where there's nothing to do. However, mem_cgroup->memsw
> should be reworked to account *only* memory that may be swapped out plus
> memory that has been swapped out (i.e. swap usage).
>
> This way, by setting memsw.limit (or how it should be called) less than
> memory soft limit we would solve the problem I described above. The
> container would be then allowed to use only file caches above its
> memsw.limit, which are usually easily shrinkable, and get OOM-kill while
> trying to eat too much swappable memory.
>
> The configuration will also be less confusing then IMO:
>
>   - memory.limit - container can't use memory above this
>   - memory.memsw.limit - container can't use swappable memory above this
>
>  From this it clearly follows maximal swap usage is limited by
> memory.memsw.limit.
>
> One more thought. Anon memory and file caches are different and should
> be handled differently, so mixing them both under the same counter looks
> strange to me. Moreover, they are *already* handled differently
> throughout the kernel - just look at mm/vmscan.c. Here are the
> differences between them I see:
>
>   - Anon memory is handled by the user application, while file caches are
>     all on the kernel. That means the application will *definitely* die
>     w/o anon memory. W/o file caches it usually can survive, but the more
>     caches it has the better it feels.
>
>   - Anon memory is not that easy to reclaim. Swap out is a really slow
>     process, because data are usually read/written w/o any specific
>     order. Dropping file caches is much easier. Typically we have lots of
>     clean pages there.
>
>   - Swap space is limited. And today, it's OK to have TBs of RAM and only
>     several GBs of swap. Customers simply don't want to waste their disk
>     space on that.
>
> IMO, these lead us to the need for limiting swap/swappable memory usage,
> but not swap+mem usage.
>
>
> Now, a bad thing about such a change (if it were ever considered).
> There's no way to convert old settings to new, i.e. if we currently have
>
>    mem <= L,
>    mem + swap <= S,
>    L <= S,
>
> we can set
>
>    mem <= L1,
>    swappable_mem <= S1,
>
> where either
>
> L1 = L, S1 = S
>
> or
>
> L1 = L, S1 = S - L,
>
> but both configurations won't be exactly the same. In the first case
> memory+swap usage will be limited by L+S, not by S. In the second case,
> although memory+swap<S, the container won't be able to use more than S-L
> anonymous memory. This is the price we would have to pay if we decided
> to go with this change...
>
>
> Questions, comments, complains, threats?
>

If one hits anon+swap limit, it just means OOM. Hitting limit means process's death.
Is it useful ?

Thanks,
-Kame






^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] memory cgroup: my thoughts on memsw
  2014-09-04 22:03 ` Kamezawa Hiroyuki
@ 2014-09-05  8:28   ` Vladimir Davydov
  2014-09-05 14:20     ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 19+ messages in thread
From: Vladimir Davydov @ 2014-09-05  8:28 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins,
	Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton,
	Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups,
	LKML

Hi Kamezawa,

Thanks for reading this :-)

On Fri, Sep 05, 2014 at 07:03:57AM +0900, Kamezawa Hiroyuki wrote:
> (2014/09/04 23:30), Vladimir Davydov wrote:
> >  - memory.limit - container can't use memory above this
> >  - memory.memsw.limit - container can't use swappable memory above this
> 
> If one hits anon+swap limit, it just means OOM. Hitting limit means
> process's death.

Basically yes. Hitting the memory.limit will result in swap out + cache
reclaim no matter if it's an anon charge or a page cache one. Hitting
the swappable memory limit (anon+swap) can only occur on anon charge and
if it happens we have no choice rather than invoking OOM.

Frankly, I don't see anything wrong in such a behavior. Why is it worse
than the current behavior where we also kill processes if a cgroup
reaches memsw.limit and we can't reclaim page caches?

I admit I may be missing something. So I'd appreciate if you could
provide me with a use case where we want *only* the current behavior and
my proposal is a no-go.

> Is it useful ?

I think so, at least, if we want to use soft limits. The point is we
will have to kill a process if it eats too much anon memory *anyway*
when it comes to global memory pressure, but before finishing it we'll
be torturing the culprit as well as *innocent* processes by issuing
massive reclaim, as I tried to point out in the example above. IMO, this
is no good.

Besides, I believe such a distinction between swappable memory and
caches would look more natural to users. Everyone got used to it
actually. For example, when an admin or user or any userspace utility
looks at the output of free(1), it primarily pays attention to free
memory "-/+ buffers/caches", because almost all memory is usually full
with file caches. And they know that caches easy come, easy go. IMO, for
them it'd be more useful to limit this to avoid nasty surprises in the
future, and only set some hints for page cache reclaim.

The only exception is strict sand-boxing, but AFAIU we can sand-box apps
perfectly well with this either, because we would still have a strict
memory limit and a limit on maximal swap usage.

Please sorry if the idea looks to you totally stupid (may be it is!),
but let's just try to consider every possibility we have in mind.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] memory cgroup: my thoughts on memsw
  2014-09-05  8:28   ` Vladimir Davydov
@ 2014-09-05 14:20     ` Kamezawa Hiroyuki
  2014-09-05 16:00       ` Vladimir Davydov
  0 siblings, 1 reply; 19+ messages in thread
From: Kamezawa Hiroyuki @ 2014-09-05 14:20 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins,
	Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton,
	Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups,
	LKML

(2014/09/05 17:28), Vladimir Davydov wrote:
> Hi Kamezawa,
>
> Thanks for reading this :-)
>
> On Fri, Sep 05, 2014 at 07:03:57AM +0900, Kamezawa Hiroyuki wrote:
>> (2014/09/04 23:30), Vladimir Davydov wrote:
>>>   - memory.limit - container can't use memory above this
>>>   - memory.memsw.limit - container can't use swappable memory above this
>>
>> If one hits anon+swap limit, it just means OOM. Hitting limit means
>> process's death.
>
> Basically yes. Hitting the memory.limit will result in swap out + cache
> reclaim no matter if it's an anon charge or a page cache one. Hitting
> the swappable memory limit (anon+swap) can only occur on anon charge and
> if it happens we have no choice rather than invoking OOM.
>
> Frankly, I don't see anything wrong in such a behavior. Why is it worse
> than the current behavior where we also kill processes if a cgroup
> reaches memsw.limit and we can't reclaim page caches?
>

IIUC, it's the same behavior with the system without cgroup.

> I admit I may be missing something. So I'd appreciate if you could
> provide me with a use case where we want *only* the current behavior and
> my proposal is a no-go.
>

Basically, I don't like OOM Kill. Anyone don't like it, I think.

In recent container use, application may be build as "stateless" and
kill-and-respawn may not be problematic, but I think killing "a" process
by oom-kill is too naive.

If your proposal is triggering notification to user space at hitting
anon+swap limit, it may be useful.
...Some container-cluster management software can handle it.
For example, container may be restarted.

Memcg has threshold notifier and vmpressure notifier.
I think you can enhance it.


>> Is it useful ?
>
> I think so, at least, if we want to use soft limits. The point is we
> will have to kill a process if it eats too much anon memory *anyway*
> when it comes to global memory pressure, but before finishing it we'll
> be torturing the culprit as well as *innocent* processes by issuing
> massive reclaim, as I tried to point out in the example above. IMO, this
> is no good.
>

My point is that "killing a process" tend not to be able to fix the situation.
For example, fork-bomb by "make -j" cannot be handled by it.

So, I don't want to think about enhancing OOM-Kill. Please think of better
way to survive. With the help of countainer-management-softwares, I think
we can have several choices.

Restart contantainer (killall) may be the best if container app is stateless.
Or container-management can provide some failover.

> Besides, I believe such a distinction between swappable memory and
> caches would look more natural to users. Everyone got used to it
> actually. For example, when an admin or user or any userspace utility
> looks at the output of free(1), it primarily pays attention to free
> memory "-/+ buffers/caches", because almost all memory is usually full
> with file caches. And they know that caches easy come, easy go. IMO, for
> them it'd be more useful to limit this to avoid nasty surprises in the
> future, and only set some hints for page cache reclaim.
>
> The only exception is strict sand-boxing, but AFAIU we can sand-box apps
>perfectly well with this either, because we would still have a strict
> memory limit and a limit on maximal swap usage.
>
> Please sorry if the idea looks to you totally stupid (may be it is!),
> but let's just try to consider every possibility we have in mind.
>

The 1st reason we added memsw.limit was for avoiding that the whole swap
is used up by a cgroup where memory-leak of forkbomb running and not for
some intellegent controls.

 From your opinion, I feel what you want is avoiding charging against page-caches.
But thiking docker at el, page-cache is not shared between containers any more.
I think "including cache" makes sense.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] memory cgroup: my thoughts on memsw
  2014-09-05 14:20     ` Kamezawa Hiroyuki
@ 2014-09-05 16:00       ` Vladimir Davydov
  2014-09-05 23:15         ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 19+ messages in thread
From: Vladimir Davydov @ 2014-09-05 16:00 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins,
	Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton,
	Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups,
	LKML

On Fri, Sep 05, 2014 at 11:20:43PM +0900, Kamezawa Hiroyuki wrote:
> Basically, I don't like OOM Kill. Anyone don't like it, I think.
> 
> In recent container use, application may be build as "stateless" and
> kill-and-respawn may not be problematic, but I think killing "a" process
> by oom-kill is too naive.
> 
> If your proposal is triggering notification to user space at hitting
> anon+swap limit, it may be useful.
> ...Some container-cluster management software can handle it.
> For example, container may be restarted.
> 
> Memcg has threshold notifier and vmpressure notifier.
> I think you can enhance it.
[...]
> My point is that "killing a process" tend not to be able to fix the situation.
> For example, fork-bomb by "make -j" cannot be handled by it.
> 
> So, I don't want to think about enhancing OOM-Kill. Please think of better
> way to survive. With the help of countainer-management-softwares, I think
> we can have several choices.
> 
> Restart contantainer (killall) may be the best if container app is stateless.
> Or container-management can provide some failover.

The problem I'm trying to set out is not about OOM actually (sorry if
the way I explain is confusing). We could probably configure OOM to kill
a whole cgroup (not just a process) and/or improve user-notification so
that the userspace could react somehow. I'm sure it must and will be
discussed one day.

The problem is that *before* invoking OOM on *global* pressure we're
trying to reclaim containers' memory and if there's progress we won't
invoke OOM. This can result in a huge slow down of the whole system (due
to swap out).

And if we want to fully make use of soft limits, we currently have no
means to limit anon memory at all. It's just impossible, because
memsw.limit must be > soft limit, otherwise it makes no sense. So we
will be trying to swap out under global pressure until we finally
realize there's no point in it and call OOM. If we don't, we'll be
suffering until the load goes away by itself.

> The 1st reason we added memsw.limit was for avoiding that the whole swap
> is used up by a cgroup where memory-leak of forkbomb running and not for
> some intellegent controls.
> 
> From your opinion, I feel what you want is avoiding charging against page-caches.
> But thiking docker at el, page-cache is not shared between containers any more.
> I think "including cache" makes sense.

Not exactly. It's not about sharing caches among containers. The point
is (1) it's difficult to estimate the size of file caches that will max
out the performance of a container, and (2) a typical workload will
perform better and put less pressure on disk if it has more caches.

Now imagine a big host running a small number of containers and
therefore having a lot of free memory most of time, but still
experiencing load spikes once an hour/day/whatever when memory usage
raises up drastically. It'd be unwise to set hard limits for those
containers that are running regularly, because they'd probably perform
much better if they had more file caches. So the admin decides to use
soft limits instead. He is forced to use memsw.limit > the soft limit,
but this is unsafe, because the container may eat anon memory up to
memsw.limit then, and anon memory isn't easy to get rid of when it comes
to the global pressure. If the admin had a mean to limit swappable
memory, he could avoid it. This is what I was trying to illustrate by
the example in the first e-mail of this thread.

Note if there were no soft limits, the current setup would be just fine,
otherwise it fails. And soft limits are proved to be useful AFAIK.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] memory cgroup: my thoughts on memsw
  2014-09-05 16:00       ` Vladimir Davydov
@ 2014-09-05 23:15         ` Kamezawa Hiroyuki
  2014-09-08 11:01           ` Vladimir Davydov
  0 siblings, 1 reply; 19+ messages in thread
From: Kamezawa Hiroyuki @ 2014-09-05 23:15 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins,
	Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton,
	Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups,
	LKML

(2014/09/06 1:00), Vladimir Davydov wrote:
> On Fri, Sep 05, 2014 at 11:20:43PM +0900, Kamezawa Hiroyuki wrote:
>> Basically, I don't like OOM Kill. Anyone don't like it, I think.
>>
>> In recent container use, application may be build as "stateless" and
>> kill-and-respawn may not be problematic, but I think killing "a" process
>> by oom-kill is too naive.
>>
>> If your proposal is triggering notification to user space at hitting
>> anon+swap limit, it may be useful.
>> ...Some container-cluster management software can handle it.
>> For example, container may be restarted.
>>
>> Memcg has threshold notifier and vmpressure notifier.
>> I think you can enhance it.
> [...]
>> My point is that "killing a process" tend not to be able to fix the situation.
>> For example, fork-bomb by "make -j" cannot be handled by it.
>>
>> So, I don't want to think about enhancing OOM-Kill. Please think of better
>> way to survive. With the help of countainer-management-softwares, I think
>> we can have several choices.
>>
>> Restart contantainer (killall) may be the best if container app is stateless.
>> Or container-management can provide some failover.
>
> The problem I'm trying to set out is not about OOM actually (sorry if
> the way I explain is confusing). We could probably configure OOM to kill
> a whole cgroup (not just a process) and/or improve user-notification so
> that the userspace could react somehow. I'm sure it must and will be
> discussed one day.
>
> The problem is that *before* invoking OOM on *global* pressure we're
> trying to reclaim containers' memory and if there's progress we won't
> invoke OOM. This can result in a huge slow down of the whole system (due
> to swap out).
>
use SSD or zram for swap device.


>> The 1st reason we added memsw.limit was for avoiding that the whole swap
>> is used up by a cgroup where memory-leak of forkbomb running and not for
>> some intellegent controls.
>>
>>  From your opinion, I feel what you want is avoiding charging against page-caches.
>> But thiking docker at el, page-cache is not shared between containers any more.
>> I think "including cache" makes sense.
>
> Not exactly. It's not about sharing caches among containers. The point
> is (1) it's difficult to estimate the size of file caches that will max
> out the performance of a container, and (2) a typical workload will
> perform better and put less pressure on disk if it has more caches.
>
> Now imagine a big host running a small number of containers and
> therefore having a lot of free memory most of time, but still
> experiencing load spikes once an hour/day/whatever when memory usage
> raises up drastically. It'd be unwise to set hard limits for those
> containers that are running regularly, because they'd probably perform
> much better if they had more file caches. So the admin decides to use
> soft limits instead. He is forced to use memsw.limit > the soft limit,
> but this is unsafe, because the container may eat anon memory up to
> memsw.limit then, and anon memory isn't easy to get rid of when it comes
> to the global pressure. If the admin had a mean to limit swappable
> memory, he could avoid it. This is what I was trying to illustrate by
> the example in the first e-mail of this thread.
>
> Note if there were no soft limits, the current setup would be just fine,
> otherwise it fails. And soft limits are proved to be useful AFAIK.
>  

As you noticed, hitting anon+swap limit just means oom-kill.
My point is that using oom-killer for "server management" just seems crazy.

Let my clarify things. your proposal was.
  1. soft-limit will be a main feature for server management.
  2. Because of soft-limit, global memory reclaim runs.
  3. Using swap at global memory reclaim can cause poor performance.
  4. So, making use of OOM-Killer for avoiding swap.

I can't agree "4". I think

  - don't configure swap.
  - use zram
  - use SSD for swap
Or
  - provide a way to notify usage of "anon+swap" to container management software.

    Now we have "vmpressure". Container management software can kill or respawn container
    with using user-defined policy for avoidng swap.

    If you don't want to run kswapd at all, threshold notifier enhancement may be required.

/proc/meminfo provides total number of ANON/CACHE pages.
Many things can be done in userland.

And your idea can't help swap-out caused by memory pressure comes from "zones".
I guess vmpressure will be a total win. The kernel may need some enhancement
but I don't like to make use of oom-killer as a part of feature for avoiding swap.

Thanks,
-Kame








^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] memory cgroup: my thoughts on memsw
  2014-09-05 23:15         ` Kamezawa Hiroyuki
@ 2014-09-08 11:01           ` Vladimir Davydov
  2014-09-08 13:53             ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 19+ messages in thread
From: Vladimir Davydov @ 2014-09-08 11:01 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins,
	Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton,
	Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups,
	LKML

On Sat, Sep 06, 2014 at 08:15:44AM +0900, Kamezawa Hiroyuki wrote:
> As you noticed, hitting anon+swap limit just means oom-kill.
> My point is that using oom-killer for "server management" just seems crazy.
> 
> Let my clarify things. your proposal was.
>  1. soft-limit will be a main feature for server management.
>  2. Because of soft-limit, global memory reclaim runs.
>  3. Using swap at global memory reclaim can cause poor performance.
>  4. So, making use of OOM-Killer for avoiding swap.
> 
> I can't agree "4". I think
> 
>  - don't configure swap.

Suppose there are two containers, each having soft limit set to 50% of
total system RAM. One of the containers eats 90% of the system RAM by
allocating anonymous pages. Another starts using file caches and wants
more than 10% of RAM to work w/o issuing disk reads. So what should we
do then? We won't be able to shrink the first container to its soft
limit, because there's no swap. Leaving it as is would be unfair from
the second container's point of view. Kill it? But the whole system is
going OK, because the working set of the second container is easily
shrinkable. Besides there may be some progress in shrinking file caches
from the first container.

>  - use zram

In fact this isn't different from the previous proposal (working w/o
swap). ZRAM only compresses data while still storing them in RAM so we
eventually may get into a situation where almost all RAM is full of
compressed anon pages.

>  - use SSD for swap

Such a requirement might be OK in enterprise, but forcing SMB to update
their hardware to run a piece of software is a no go. And again, SSD
isn't infinite, we may use it up.

> Or
>  - provide a way to notify usage of "anon+swap" to container management software.
> 
>    Now we have "vmpressure". Container management software can kill or respawn container
>    with using user-defined policy for avoidng swap.
> 
>    If you don't want to run kswapd at all, threshold notifier enhancement may be required.
> 
> /proc/meminfo provides total number of ANON/CACHE pages.
> Many things can be done in userland.

AFAIK OOM-in-userspace-handling has been discussed many times, but
there's still no agreement upon it. Basically it isn't reliable, because
it can lead to a deadlock if the userspace handler won't be able to
allocate memory to proceed or will get stuck in some other way. IMO
there must be in-kernel OOM-handling as a last resort anyway. And
actually we already have one - we may kill processes when they hit the
memsw limit.

But OK, you don't like OOM on hitting anon+swap limit and propose to
introduce a kind of userspace notification instead, but the problem
actually isn't *WHAT* we should do on hitting anon+swap limit, but *HOW*
we should implement it (or should we implement it at all). No matter
which way we go, in-kernel OOM or userland notifications, we have to
*INTRODUCE ANON+SWAP ACCOUNTING* to achieve that so that on breaching a
predefined threshold we could invoke OOM or issue a userland
notification or both. And here goes the problem: there's anon+file and
anon+file+swap resource counters, but no anon+swap counter. To react on
anon+swap limit breaching, we must introduce one. I propose to *REUSE*
memsw instead by slightly modifying its meaning.

What we would get then is the ability to react on potentially
unreclaimable memory growth inside a container. What we would loose is
the current implementation of memory+swap limit, *BUT* we would still be
able to limit memory+swap usage by imposing limits on total memory and
anon+swap usage.

> And your idea can't help swap-out caused by memory pressure comes from "zones".

It would help limit swap-out to a sane value.


I'm sorry if I'm not clear or don't understand something that looks
trivial to you.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] memory cgroup: my thoughts on memsw
  2014-09-08 11:01           ` Vladimir Davydov
@ 2014-09-08 13:53             ` Kamezawa Hiroyuki
  2014-09-09 10:39               ` Vladimir Davydov
  2014-09-10 12:01               ` Vladimir Davydov
  0 siblings, 2 replies; 19+ messages in thread
From: Kamezawa Hiroyuki @ 2014-09-08 13:53 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins,
	Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton,
	Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups,
	LKML

(2014/09/08 20:01), Vladimir Davydov wrote:
> On Sat, Sep 06, 2014 at 08:15:44AM +0900, Kamezawa Hiroyuki wrote:
>> As you noticed, hitting anon+swap limit just means oom-kill.
>> My point is that using oom-killer for "server management" just seems crazy.
>>
>> Let my clarify things. your proposal was.
>>   1. soft-limit will be a main feature for server management.
>>   2. Because of soft-limit, global memory reclaim runs.
>>   3. Using swap at global memory reclaim can cause poor performance.
>>   4. So, making use of OOM-Killer for avoiding swap.
>>
>> I can't agree "4". I think
>>
>>   - don't configure swap.
>
> Suppose there are two containers, each having soft limit set to 50% of
> total system RAM. One of the containers eats 90% of the system RAM by
> allocating anonymous pages. Another starts using file caches and wants
> more than 10% of RAM to work w/o issuing disk reads. So what should we
> do then?
> We won't be able to shrink the first container to its soft
> limit, because there's no swap. Leaving it as is would be unfair from
> the second container's point of view. Kill it? But the whole system is
> going OK, because the working set of the second container is easily
> shrinkable. Besides there may be some progress in shrinking file caches
> from the first container.
>
>>   - use zram
>
> In fact this isn't different from the previous proposal (working w/o
> swap). ZRAM only compresses data while still storing them in RAM so we
> eventually may get into a situation where almost all RAM is full of
> compressed anon pages.
>

In above 2 cases, "vmpressure" works fine.

>   - use SSD for swap
>
> Such a requirement might be OK in enterprise, but forcing SMB to update
> their hardware to run a piece of software is a no go. And again, SSD
> isn't infinite, we may use it up.
>
ditto.

>> Or
>>   - provide a way to notify usage of "anon+swap" to container management software.
>>
>>     Now we have "vmpressure". Container management software can kill or respawn container
>>     with using user-defined policy for avoidng swap.
>>
>>     If you don't want to run kswapd at all, threshold notifier enhancement may be required.
>>
>> /proc/meminfo provides total number of ANON/CACHE pages.
>> Many things can be done in userland.
>
> AFAIK OOM-in-userspace-handling has been discussed many times, but
> there's still no agreement upon it. Basically it isn't reliable, because
> it can lead to a deadlock if the userspace handler won't be able to
> allocate memory to proceed or will get stuck in some other way. IMO
> there must be in-kernel OOM-handling as a last resort anyway. And
> actually we already have one - we may kill processes when they hit the
> memsw limit.
>
> But OK, you don't like OOM on hitting anon+swap limit and propose to
> introduce a kind of userspace notification instead, but the problem
> actually isn't *WHAT* we should do on hitting anon+swap limit, but *HOW*
> we should implement it (or should we implement it at all).


I'm not sure you're aware of or not, "hardlimit" counter is too expensive
for your purpose.

If I was you, I'll use some lightweight counter like percpu_counter() or
memcg's event handling system.
Did you see how threshold notifier or vmpressure works ? It's very light weight.


> No matter which way we go, in-kernel OOM or userland notifications, we have to
> *INTRODUCE ANON+SWAP ACCOUNTING* to achieve that so that on breaching a
> predefined threshold we could invoke OOM or issue a userland
> notification or both. And here goes the problem: there's anon+file and
> anon+file+swap resource counters, but no anon+swap counter. To react on
> anon+swap limit breaching, we must introduce one. I propose to *REUSE*
> memsw instead by slightly modifying its meaning.
>
you can see "anon+swap"  via memcg's accounting.

  
> What we would get then is the ability to react on potentially
> unreclaimable memory growth inside a container. What we would loose is
> the current implementation of memory+swap limit, *BUT* we would still be
> able to limit memory+swap usage by imposing limits on total memory and
> anon+swap usage.
>

I repeatedly say anon+swap "hardlimit" just means OOM. That's not buy.


>> And your idea can't help swap-out caused by memory pressure comes from "zones".
>
> It would help limit swap-out to a sane value.
>
>
> I'm sorry if I'm not clear or don't understand something that looks
> trivial to you.
>

It seems your purpose is to avoiding system-wide-oom-situation. Right ?

Implementing system-wide-oom-kill-avoidance logic in memcg doesn't
sound good to me. It should work under system-wide memory management logic.
If memcg can be a help for it, it will be good.


For your purpose, you need to implement your method in system-wide way.
It seems crazy to set per-cgroup-anon-limit for avoding system-wide-oom.
You'll need help of system-wide-cgroup-configuration-middleware even if
you have a method in a cgroup. If you say logic should be in OS kernel,
please implement it in a system wide logic rather than cgroup.

I think it's okay to add a help functionality in memcg if there is a
system-wide-oom-avoidance logic.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] memory cgroup: my thoughts on memsw
  2014-09-08 13:53             ` Kamezawa Hiroyuki
@ 2014-09-09 10:39               ` Vladimir Davydov
  2014-09-11  2:04                 ` Kamezawa Hiroyuki
  2014-09-10 12:01               ` Vladimir Davydov
  1 sibling, 1 reply; 19+ messages in thread
From: Vladimir Davydov @ 2014-09-09 10:39 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins,
	Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton,
	Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups,
	LKML

On Mon, Sep 08, 2014 at 10:53:48PM +0900, Kamezawa Hiroyuki wrote:
> (2014/09/08 20:01), Vladimir Davydov wrote:
> >But OK, you don't like OOM on hitting anon+swap limit and propose to
> >introduce a kind of userspace notification instead, but the problem
> >actually isn't *WHAT* we should do on hitting anon+swap limit, but *HOW*
> >we should implement it (or should we implement it at all).
> 
> 
> I'm not sure you're aware of or not, "hardlimit" counter is too expensive
> for your purpose.
> 
> If I was you, I'll use some lightweight counter like percpu_counter() or
> memcg's event handling system.
> Did you see how threshold notifier or vmpressure works ? It's very light weight.

OK, after looking through the memory thresholds code and pondering the
problem a bit I tend to agree with you. We can tweak the notifiers to
trigger on anon+swap thresholds, handle them in userspace and do
whatever we like. At least for now, I don't see anything why this could
be worse than hard anon+swap limit except it requires more steps to
configure. Thank you for your patience while explaining this to me :-)

However, there's one thing, which made me start this discussion, and it
still bothers me. It's about memsw.limit_in_bytes knob itself.

First, its value must be greater or equal to memory.limit_in_bytes.
IMO, such a dependency in the user interface isn't great, but it isn't
the worst thing. What is worse, there's only point in setting it to
infinity if one wants to fully make use of soft limits as I pointed out
earlier.

So, we have a userspace knob that suits only for strict sand-boxing when
one wants to hard-limit the amount of memory and swap an app can use.
When it comes to soft limits, you have to set it to infinity, and it'll
still be accounted at the cost of performance, but without any purpose.
It just seems meaningless to me.

Not counting that the knob itself is a kind of confusing IMO. memsw
means memory+swap, so one would mistakenly think memsw.limit-mem.limit
is the limit on swap usage, but that's wrong.

My point is that anon+swap accounting instead of the current
anon+file+swap memsw implementation would be more flexible. We could
still sandbox apps by setting hard anon+swap and memory limits, but it
would also be possible to make use of it in "soft" environments. It
wouldn't be mandatory though. If one doesn't like OOM, he can use
threshold notifications to restart the container when it starts to
behave badly. But if the user just doesn't want to bother about
configuration or is OK with OOM-killer, he could set hard anon+swap
limit. Besides, it would untie mem.limit knob from memsw.limit, which
would make the user interface simpler and cleaner.

So, I think anon+swap limit would be more flexible than file+anon+swap
limit we have now. Is there any use case where anon+swap and anon+file
accounting couldn't satisfy the user requirements while the
anon+file+swap and anon+file pair could?

> >No matter which way we go, in-kernel OOM or userland notifications, we have to
> >*INTRODUCE ANON+SWAP ACCOUNTING* to achieve that so that on breaching a
> >predefined threshold we could invoke OOM or issue a userland
> >notification or both. And here goes the problem: there's anon+file and
> >anon+file+swap resource counters, but no anon+swap counter. To react on
> >anon+swap limit breaching, we must introduce one. I propose to *REUSE*
> >memsw instead by slightly modifying its meaning.
> >
> you can see "anon+swap"  via memcg's accounting.
> 
> >What we would get then is the ability to react on potentially
> >unreclaimable memory growth inside a container. What we would loose is
> >the current implementation of memory+swap limit, *BUT* we would still be
> >able to limit memory+swap usage by imposing limits on total memory and
> >anon+swap usage.
> >
> 
> I repeatedly say anon+swap "hardlimit" just means OOM. That's not buy.

anon+file+swap hardlimit eventually means OOM too :-/

> >>And your idea can't help swap-out caused by memory pressure comes from "zones".
> >
> >It would help limit swap-out to a sane value.
> >
> >
> >I'm sorry if I'm not clear or don't understand something that looks
> >trivial to you.
> >
> 
> It seems your purpose is to avoiding system-wide-oom-situation. Right ?

This is the purpose of any hard memory limit, including the current
implementation - avoiding global memory pressure in general and
system-wide OOM in particular.

> Implementing system-wide-oom-kill-avoidance logic in memcg doesn't
> sound good to me. It should work under system-wide memory management logic.
> If memcg can be a help for it, it will be good.
> 
> 
> For your purpose, you need to implement your method in system-wide way.
> It seems crazy to set per-cgroup-anon-limit for avoding system-wide-oom.
> You'll need help of system-wide-cgroup-configuration-middleware even if
> you have a method in a cgroup. If you say logic should be in OS kernel,
> please implement it in a system wide logic rather than cgroup.

What if on global pressure a memory cgroup exceeding its soft limit is
being reclaimed, but not fast enough, because it has a lot of anon
memory? The global OOM won't be triggered then, because there's still
progress, but the system will experience hard pressure due to the
reclaimer runs. How can we detect if we should kill the container or
not? It smells like one more heuristic to vmscan, IMO.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] memory cgroup: my thoughts on memsw
  2014-09-08 13:53             ` Kamezawa Hiroyuki
  2014-09-09 10:39               ` Vladimir Davydov
@ 2014-09-10 12:01               ` Vladimir Davydov
  2014-09-11  1:22                 ` Kamezawa Hiroyuki
  1 sibling, 1 reply; 19+ messages in thread
From: Vladimir Davydov @ 2014-09-10 12:01 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins,
	Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton,
	Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups,
	LKML

On Mon, Sep 08, 2014 at 10:53:48PM +0900, Kamezawa Hiroyuki wrote:
> (2014/09/08 20:01), Vladimir Davydov wrote:
> >On Sat, Sep 06, 2014 at 08:15:44AM +0900, Kamezawa Hiroyuki wrote:
> >>As you noticed, hitting anon+swap limit just means oom-kill.
> >>My point is that using oom-killer for "server management" just seems crazy.
> >>
> >>Let my clarify things. your proposal was.
> >>  1. soft-limit will be a main feature for server management.
> >>  2. Because of soft-limit, global memory reclaim runs.
> >>  3. Using swap at global memory reclaim can cause poor performance.
> >>  4. So, making use of OOM-Killer for avoiding swap.
> >>
> >>I can't agree "4". I think
> >>
> >>  - don't configure swap.
> >
> >Suppose there are two containers, each having soft limit set to 50% of
> >total system RAM. One of the containers eats 90% of the system RAM by
> >allocating anonymous pages. Another starts using file caches and wants
> >more than 10% of RAM to work w/o issuing disk reads. So what should we
> >do then?
> >We won't be able to shrink the first container to its soft
> >limit, because there's no swap. Leaving it as is would be unfair from
> >the second container's point of view. Kill it? But the whole system is
> >going OK, because the working set of the second container is easily
> >shrinkable. Besides there may be some progress in shrinking file caches
> >from the first container.
> >
> >>  - use zram
> >
> >In fact this isn't different from the previous proposal (working w/o
> >swap). ZRAM only compresses data while still storing them in RAM so we
> >eventually may get into a situation where almost all RAM is full of
> >compressed anon pages.
> >
> 
> In above 2 cases, "vmpressure" works fine.

What if a container allocates memory so fast that the userspace thread
handling its threshold notifications won't have time to react before it
eats all memory?

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] memory cgroup: my thoughts on memsw
  2014-09-10 12:01               ` Vladimir Davydov
@ 2014-09-11  1:22                 ` Kamezawa Hiroyuki
  2014-09-11  7:03                   ` Vladimir Davydov
  0 siblings, 1 reply; 19+ messages in thread
From: Kamezawa Hiroyuki @ 2014-09-11  1:22 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins,
	Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton,
	Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups,
	LKML

(2014/09/10 21:01), Vladimir Davydov wrote:
> On Mon, Sep 08, 2014 at 10:53:48PM +0900, Kamezawa Hiroyuki wrote:
>> (2014/09/08 20:01), Vladimir Davydov wrote:
>>> On Sat, Sep 06, 2014 at 08:15:44AM +0900, Kamezawa Hiroyuki wrote:
>>>> As you noticed, hitting anon+swap limit just means oom-kill.
>>>> My point is that using oom-killer for "server management" just seems crazy.
>>>>
>>>> Let my clarify things. your proposal was.
>>>>   1. soft-limit will be a main feature for server management.
>>>>   2. Because of soft-limit, global memory reclaim runs.
>>>>   3. Using swap at global memory reclaim can cause poor performance.
>>>>   4. So, making use of OOM-Killer for avoiding swap.
>>>>
>>>> I can't agree "4". I think
>>>>
>>>>   - don't configure swap.
>>>
>>> Suppose there are two containers, each having soft limit set to 50% of
>>> total system RAM. One of the containers eats 90% of the system RAM by
>>> allocating anonymous pages. Another starts using file caches and wants
>>> more than 10% of RAM to work w/o issuing disk reads. So what should we
>>> do then?
>>> We won't be able to shrink the first container to its soft
>>> limit, because there's no swap. Leaving it as is would be unfair from
>>> the second container's point of view. Kill it? But the whole system is
>>> going OK, because the working set of the second container is easily
>>> shrinkable. Besides there may be some progress in shrinking file caches
>> >from the first container.
>>>
>>>>   - use zram
>>>
>>> In fact this isn't different from the previous proposal (working w/o
>>> swap). ZRAM only compresses data while still storing them in RAM so we
>>> eventually may get into a situation where almost all RAM is full of
>>> compressed anon pages.
>>>
>>
>> In above 2 cases, "vmpressure" works fine.
>
> What if a container allocates memory so fast that the userspace thread
> handling its threshold notifications won't have time to react before it
> eats all memory?
>

Softlimit is for avoiding such unfair memory scheduling, isn't it ?

Thanks,
-Kame






^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] memory cgroup: my thoughts on memsw
  2014-09-09 10:39               ` Vladimir Davydov
@ 2014-09-11  2:04                 ` Kamezawa Hiroyuki
  2014-09-11  8:23                   ` Vladimir Davydov
  0 siblings, 1 reply; 19+ messages in thread
From: Kamezawa Hiroyuki @ 2014-09-11  2:04 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins,
	Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton,
	Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups,
	LKML

(2014/09/09 19:39), Vladimir Davydov wrote:

>> For your purpose, you need to implement your method in system-wide way.
>> It seems crazy to set per-cgroup-anon-limit for avoding system-wide-oom.
>> You'll need help of system-wide-cgroup-configuration-middleware even if
>> you have a method in a cgroup. If you say logic should be in OS kernel,
>> please implement it in a system wide logic rather than cgroup.
>
> What if on global pressure a memory cgroup exceeding its soft limit is
> being reclaimed, but not fast enough, because it has a lot of anon
> memory? The global OOM won't be triggered then, because there's still
> progress, but the system will experience hard pressure due to the
> reclaimer runs. How can we detect if we should kill the container or
> not? It smells like one more heuristic to vmscan, IMO.


That's you are trying to implement by per-cgroup-anon+swap-limit, the difference
is heuristics by system designer at container creation or heuristics by kernel in
the dynamic way.

I said it should be done by system/cloud-container-scheduler based on notification.

But okay, let me think of kernel help in global reclaim.

  - Assume "priority" is a value calculated by "usage - soft limit".

  - weighted kswapd/direct reclaim
    => Based on priority of each threads/cgroup,  increase "wait" in direct reclaim
       if it's contended.
       Low prio container will sleep longer until memory contention is fixed.

  - weighted anon allocation
    similar to above, if memory is contended, page fault speed should be weighted
    based on priority(softlimit).

  - off cpu direct-reclaim
    run direct recalim in workqueue with cpu mask. the cpu mask is a global setting
    per numa node, which determines cpus available for being used to reclaim memory.
    "How to wait" may affect the performance of system but this can allow masked cpus
    to be used for more important jobs.

All of them will give a container-manager time to consinder next action.

Anyway, if swap is slow but necessary, you can use faster swap, now.
It's a good age.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] memory cgroup: my thoughts on memsw
  2014-09-11  1:22                 ` Kamezawa Hiroyuki
@ 2014-09-11  7:03                   ` Vladimir Davydov
  0 siblings, 0 replies; 19+ messages in thread
From: Vladimir Davydov @ 2014-09-11  7:03 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins,
	Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton,
	Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups,
	LKML

On Thu, Sep 11, 2014 at 10:22:51AM +0900, Kamezawa Hiroyuki wrote:
> (2014/09/10 21:01), Vladimir Davydov wrote:
> >On Mon, Sep 08, 2014 at 10:53:48PM +0900, Kamezawa Hiroyuki wrote:
> >>(2014/09/08 20:01), Vladimir Davydov wrote:
> >>>On Sat, Sep 06, 2014 at 08:15:44AM +0900, Kamezawa Hiroyuki wrote:
> >>>>As you noticed, hitting anon+swap limit just means oom-kill.
> >>>>My point is that using oom-killer for "server management" just seems crazy.
> >>>>
> >>>>Let my clarify things. your proposal was.
> >>>>  1. soft-limit will be a main feature for server management.
> >>>>  2. Because of soft-limit, global memory reclaim runs.
> >>>>  3. Using swap at global memory reclaim can cause poor performance.
> >>>>  4. So, making use of OOM-Killer for avoiding swap.
> >>>>
> >>>>I can't agree "4". I think
> >>>>
> >>>>  - don't configure swap.
> >>>
> >>>Suppose there are two containers, each having soft limit set to 50% of
> >>>total system RAM. One of the containers eats 90% of the system RAM by
> >>>allocating anonymous pages. Another starts using file caches and wants
> >>>more than 10% of RAM to work w/o issuing disk reads. So what should we
> >>>do then?
> >>>We won't be able to shrink the first container to its soft
> >>>limit, because there's no swap. Leaving it as is would be unfair from
> >>>the second container's point of view. Kill it? But the whole system is
> >>>going OK, because the working set of the second container is easily
> >>>shrinkable. Besides there may be some progress in shrinking file caches
> >>>from the first container.
> >>>
> >>>>  - use zram
> >>>
> >>>In fact this isn't different from the previous proposal (working w/o
> >>>swap). ZRAM only compresses data while still storing them in RAM so we
> >>>eventually may get into a situation where almost all RAM is full of
> >>>compressed anon pages.
> >>>
> >>
> >>In above 2 cases, "vmpressure" works fine.
> >
> >What if a container allocates memory so fast that the userspace thread
> >handling its threshold notifications won't have time to react before it
> >eats all memory?
> >
> 
> Softlimit is for avoiding such unfair memory scheduling, isn't it ?

Yeah, and we're returning back to the very beginning. Anonymous memory
reclaim triggered by soft limit may be impossible due to lack of swap
space or really sluggish. The whole system will be dragging its feet
until it finally realizes the container must be killed. It's a kind of
DOS attack...

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] memory cgroup: my thoughts on memsw
  2014-09-11  2:04                 ` Kamezawa Hiroyuki
@ 2014-09-11  8:23                   ` Vladimir Davydov
  2014-09-11  8:53                     ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 19+ messages in thread
From: Vladimir Davydov @ 2014-09-11  8:23 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins,
	Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton,
	Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups,
	LKML

On Thu, Sep 11, 2014 at 11:04:41AM +0900, Kamezawa Hiroyuki wrote:
> (2014/09/09 19:39), Vladimir Davydov wrote:
> 
> >>For your purpose, you need to implement your method in system-wide way.
> >>It seems crazy to set per-cgroup-anon-limit for avoding system-wide-oom.
> >>You'll need help of system-wide-cgroup-configuration-middleware even if
> >>you have a method in a cgroup. If you say logic should be in OS kernel,
> >>please implement it in a system wide logic rather than cgroup.
> >
> >What if on global pressure a memory cgroup exceeding its soft limit is
> >being reclaimed, but not fast enough, because it has a lot of anon
> >memory? The global OOM won't be triggered then, because there's still
> >progress, but the system will experience hard pressure due to the
> >reclaimer runs. How can we detect if we should kill the container or
> >not? It smells like one more heuristic to vmscan, IMO.
> 
> 
> That's you are trying to implement by per-cgroup-anon+swap-limit, the difference
> is heuristics by system designer at container creation or heuristics by kernel in
> the dynamic way.

anon+swap limit isn't a heuristic, it's a configuration!

The difference is that the user usually knows *minimal* requirements of
the app he's going to run in a container/VM. Basing on them, he buys a
container/VM with some predefined amount of RAM. From the whole system
POV it's suboptimal to set the hard limit for the container by the user
configuration, because there might be free memory, which could be used
for file caches and hence lower disk load. If we had anon+swap hard
limit, we could use it in conjunction with the soft limit instead of the
hard limit. That would be more efficient than VM-like sand-boxing though
still safe.

When I'm talking about in-kernel heuristics, I mean a pile of
hard-to-read functions with a bunch of obscure constants. This is much
worse than providing the user with a convenient and flexible interface.

> I said it should be done by system/cloud-container-scheduler based on notification.

Basically, it's unsafe to hand this out to userspace completely. The
system would be prone to DOS attacks from inside containers then.

> But okay, let me think of kernel help in global reclaim.
> 
>  - Assume "priority" is a value calculated by "usage - soft limit".
> 
>  - weighted kswapd/direct reclaim
>    => Based on priority of each threads/cgroup,  increase "wait" in direct reclaim
>       if it's contended.
>       Low prio container will sleep longer until memory contention is fixed.
> 
>  - weighted anon allocation
>    similar to above, if memory is contended, page fault speed should be weighted
>    based on priority(softlimit).
> 
>  - off cpu direct-reclaim
>    run direct recalim in workqueue with cpu mask. the cpu mask is a global setting
>    per numa node, which determines cpus available for being used to reclaim memory.
>    "How to wait" may affect the performance of system but this can allow masked cpus
>    to be used for more important jobs.

That's what I call a bunch of heuristics. And actually I don't see how
it'd help us against latency spikes caused by reclaimer runs, seems the
set is still incomplete :-/

For example, there are two cgroups, one having a huge soft limit excess
and full of anon memory and another not exceeding its soft limit but
using primarily clean file caches. This prioritizing/weighting stuff
would result in shrinking the first group first on global pressure,
though it's way slower than shrinking the second one. That means a
latency spike in other containers. The heuristics you proposed above
will only make it non-critical - the system will get over sooner or
later. However, it's still a kind of DOS, which anon+swap hard limit
would prevent.

Sorry, but I simply don't understand what would go wrong if we
substituted the current memsw (anon+file+swap) with anon+swap limit. As
I stated before it would be more flexible and logical:

On Tue, Sep 09, 2014 at 02:39:43PM +0400, Vladimir Davydov wrote:
> However, there's one thing, which made me start this discussion, and it
> still bothers me. It's about memsw.limit_in_bytes knob itself.
> 
> First, its value must be greater or equal to memory.limit_in_bytes.
> IMO, such a dependency in the user interface isn't great, but it isn't
> the worst thing. What is worse, there's only point in setting it to
> infinity if one wants to fully make use of soft limits as I pointed out
> earlier.
> 
> So, we have a userspace knob that suits only for strict sand-boxing when
> one wants to hard-limit the amount of memory and swap an app can use.
> When it comes to soft limits, you have to set it to infinity, and it'll
> still be accounted at the cost of performance, but without any purpose.
> It just seems meaningless to me.
> 
> Not counting that the knob itself is a kind of confusing IMO. memsw
> means memory+swap, so one would mistakenly think memsw.limit-mem.limit
> is the limit on swap usage, but that's wrong.
> 
> My point is that anon+swap accounting instead of the current
> anon+file+swap memsw implementation would be more flexible. We could
> still sandbox apps by setting hard anon+swap and memory limits, but it
> would also be possible to make use of it in "soft" environments. It
> wouldn't be mandatory though. If one doesn't like OOM, he can use
> threshold notifications to restart the container when it starts to
> behave badly. But if the user just doesn't want to bother about
> configuration or is OK with OOM-killer, he could set hard anon+swap
> limit. Besides, it would untie mem.limit knob from memsw.limit, which
> would make the user interface simpler and cleaner.
> 
> So, I think anon+swap limit would be more flexible than file+anon+swap
> limit we have now. Is there any use case where anon+swap and anon+file
> accounting couldn't satisfy the user requirements while the
> anon+file+swap and anon+file pair could?

I would appreciate if anybody could answer this.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] memory cgroup: my thoughts on memsw
  2014-09-11  8:23                   ` Vladimir Davydov
@ 2014-09-11  8:53                     ` Kamezawa Hiroyuki
  2014-09-11  9:50                       ` Vladimir Davydov
  0 siblings, 1 reply; 19+ messages in thread
From: Kamezawa Hiroyuki @ 2014-09-11  8:53 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins,
	Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton,
	Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups,
	LKML

(2014/09/11 17:23), Vladimir Davydov wrote:
> On Thu, Sep 11, 2014 at 11:04:41AM +0900, Kamezawa Hiroyuki wrote:
>> (2014/09/09 19:39), Vladimir Davydov wrote:
>>
>>>> For your purpose, you need to implement your method in system-wide way.
>>>> It seems crazy to set per-cgroup-anon-limit for avoding system-wide-oom.
>>>> You'll need help of system-wide-cgroup-configuration-middleware even if
>>>> you have a method in a cgroup. If you say logic should be in OS kernel,
>>>> please implement it in a system wide logic rather than cgroup.
>>>
>>> What if on global pressure a memory cgroup exceeding its soft limit is
>>> being reclaimed, but not fast enough, because it has a lot of anon
>>> memory? The global OOM won't be triggered then, because there's still
>>> progress, but the system will experience hard pressure due to the
>>> reclaimer runs. How can we detect if we should kill the container or
>>> not? It smells like one more heuristic to vmscan, IMO.
>>
>>
>> That's you are trying to implement by per-cgroup-anon+swap-limit, the difference
>> is heuristics by system designer at container creation or heuristics by kernel in
>> the dynamic way.
>
> anon+swap limit isn't a heuristic, it's a configuration!
>
> The difference is that the user usually knows *minimal* requirements of
> the app he's going to run in a container/VM. Basing on them, he buys a
> container/VM with some predefined amount of RAM. From the whole system
> POV it's suboptimal to set the hard limit for the container by the user
> configuration, because there might be free memory, which could be used
> for file caches and hence lower disk load. If we had anon+swap hard
> limit, we could use it in conjunction with the soft limit instead of the
> hard limit. That would be more efficient than VM-like sand-boxing though
> still safe.
>
> When I'm talking about in-kernel heuristics, I mean a pile of
> hard-to-read functions with a bunch of obscure constants. This is much
> worse than providing the user with a convenient and flexible interface.
>
>> I said it should be done by system/cloud-container-scheduler based on notification.
>
> Basically, it's unsafe to hand this out to userspace completely. The
> system would be prone to DOS attacks from inside containers then.
>
>> But okay, let me think of kernel help in global reclaim.
>>
>>   - Assume "priority" is a value calculated by "usage - soft limit".
>>
>>   - weighted kswapd/direct reclaim
>>     => Based on priority of each threads/cgroup,  increase "wait" in direct reclaim
>>        if it's contended.
>>        Low prio container will sleep longer until memory contention is fixed.
>>
>>   - weighted anon allocation
>>     similar to above, if memory is contended, page fault speed should be weighted
>>     based on priority(softlimit).
>>
>>   - off cpu direct-reclaim
>>     run direct recalim in workqueue with cpu mask. the cpu mask is a global setting
>>     per numa node, which determines cpus available for being used to reclaim memory.
>>     "How to wait" may affect the performance of system but this can allow masked cpus
>>     to be used for more important jobs.
>
> That's what I call a bunch of heuristics. And actually I don't see how
> it'd help us against latency spikes caused by reclaimer runs, seems the
> set is still incomplete :-/
>
> For example, there are two cgroups, one having a huge soft limit excess
> and full of anon memory and another not exceeding its soft limit but
> using primarily clean file caches. This prioritizing/weighting stuff
> would result in shrinking the first group first on global pressure,
> though it's way slower than shrinking the second one.

Current implementation just round-robin all memcgs under the tree.
With re-designed soft-limit, things will be changed, you can change it.


> That means a latency spike in other containers.

why ? you said the other container just contains file caches.
latency-spike just because file cache drops ?
If the service is such naive, please use hard limit.

Hmm.
How about raising kswapd's scheduling threshold in some situation ?
Per-memcg-kswapd-for-helping-softlimit may work.

> The heuristics you proposed above
> will only make it non-critical - the system will get over sooner or
> later.

My idea is always based on there is a container-manager on the system,
which can do enough clever decision based on a policy, admin specified.
IIUC, reducing cpu-hog caused by memory pressure is always helpful.

> However, it's still a kind of DOS, which anon+swap hard limit would prevent.

by oom-killer.


> On Tue, Sep 09, 2014 at 02:39:43PM +0400, Vladimir Davydov wrote:
>> However, there's one thing, which made me start this discussion, and it
>> still bothers me. It's about memsw.limit_in_bytes knob itself.
>>
>> First, its value must be greater or equal to memory.limit_in_bytes.
>> IMO, such a dependency in the user interface isn't great, but it isn't
>> the worst thing. What is worse, there's only point in setting it to
>> infinity if one wants to fully make use of soft limits as I pointed out
>> earlier.
>>
>> So, we have a userspace knob that suits only for strict sand-boxing when
>> one wants to hard-limit the amount of memory and swap an app can use.
>> When it comes to soft limits, you have to set it to infinity, and it'll
>> still be accounted at the cost of performance, but without any purpose.
>> It just seems meaningless to me.
>>
>> Not counting that the knob itself is a kind of confusing IMO. memsw
>> means memory+swap, so one would mistakenly think memsw.limit-mem.limit
>> is the limit on swap usage, but that's wrong.
>>
>> My point is that anon+swap accounting instead of the current
>> anon+file+swap memsw implementation would be more flexible. We could
>> still sandbox apps by setting hard anon+swap and memory limits, but it
>> would also be possible to make use of it in "soft" environments. It
>> wouldn't be mandatory though. If one doesn't like OOM, he can use
>> threshold notifications to restart the container when it starts to
>> behave badly. But if the user just doesn't want to bother about
>> configuration or is OK with OOM-killer, he could set hard anon+swap
>> limit. Besides, it would untie mem.limit knob from memsw.limit, which
>> would make the user interface simpler and cleaner.
>>
>> So, I think anon+swap limit would be more flexible than file+anon+swap
>> limit we have now. Is there any use case where anon+swap and anon+file
>> accounting couldn't satisfy the user requirements while the
>> anon+file+swap and anon+file pair could?
>
> I would appreciate if anybody could answer this.
>

I can't understand why you want to use OOM killer for resource controlling .

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] memory cgroup: my thoughts on memsw
  2014-09-11  8:53                     ` Kamezawa Hiroyuki
@ 2014-09-11  9:50                       ` Vladimir Davydov
  0 siblings, 0 replies; 19+ messages in thread
From: Vladimir Davydov @ 2014-09-11  9:50 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins,
	Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton,
	Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups,
	LKML

On Thu, Sep 11, 2014 at 05:53:56PM +0900, Kamezawa Hiroyuki wrote:
> (2014/09/11 17:23), Vladimir Davydov wrote:
> >For example, there are two cgroups, one having a huge soft limit excess
> >and full of anon memory and another not exceeding its soft limit but
> >using primarily clean file caches. This prioritizing/weighting stuff
> >would result in shrinking the first group first on global pressure,
> >though it's way slower than shrinking the second one.
> 
> Current implementation just round-robin all memcgs under the tree.
> With re-designed soft-limit, things will be changed, you can change it.
> 
> 
> >That means a latency spike in other containers.
> 
> why ? you said the other container just contains file caches.

A container wants some mem (anon, file, whatever) under pressure. If the
pressure is high, it falls into direct reclaim and starts shrinking the
container with a lot of anon memory, which is going to be slow, - here
goes a latency spike.

> latency-spike just because file cache drops ?
> If the service is such naive, please use hard limit.

File caches are evicted much easier than anon memory, simply because the
latter is (almost) always dirty, However, file caches still can be a
vital part of the working set. It all depends on the load. What's wrong
with a web server that most of the time sends the same set of web pages
to clients? The data it needs are stored on the disk and mostly clean,
but it's still its working set. Evicting it will lower the server
responsiveness, which will result in clients getting upset and stopping
visiting the web site. Or do you suppose the web server must cache disk
data in anon memory on its own? Why do we keep clean caches at all then?

> Hmm.
> How about raising kswapd's scheduling threshold in some situation ?
> Per-memcg-kswapd-for-helping-softlimit may work.

Instead of preventing the worst case you propose to prepare the
after-treatment...

> >The heuristics you proposed above
> >will only make it non-critical - the system will get over sooner or
> >later.
> 
> My idea is always based on there is a container-manager on the system,
> which can do enough clever decision based on a policy, admin specified.
> IIUC, reducing cpu-hog caused by memory pressure is always helpful.
> 
> >However, it's still a kind of DOS, which anon+swap hard limit would prevent.
> 
> by oom-killer.

*Local* oom-killer inside the container behaving badly. This is way
better than waiting until it puts the whole system under heavy pressure.

> >On Tue, Sep 09, 2014 at 02:39:43PM +0400, Vladimir Davydov wrote:
> >>However, there's one thing, which made me start this discussion, and it
> >>still bothers me. It's about memsw.limit_in_bytes knob itself.
> >>
> >>First, its value must be greater or equal to memory.limit_in_bytes.
> >>IMO, such a dependency in the user interface isn't great, but it isn't
> >>the worst thing. What is worse, there's only point in setting it to
> >>infinity if one wants to fully make use of soft limits as I pointed out
> >>earlier.
> >>
> >>So, we have a userspace knob that suits only for strict sand-boxing when
> >>one wants to hard-limit the amount of memory and swap an app can use.
> >>When it comes to soft limits, you have to set it to infinity, and it'll
> >>still be accounted at the cost of performance, but without any purpose.
> >>It just seems meaningless to me.
> >>
> >>Not counting that the knob itself is a kind of confusing IMO. memsw
> >>means memory+swap, so one would mistakenly think memsw.limit-mem.limit
> >>is the limit on swap usage, but that's wrong.
> >>
> >>My point is that anon+swap accounting instead of the current
> >>anon+file+swap memsw implementation would be more flexible. We could
> >>still sandbox apps by setting hard anon+swap and memory limits, but it
> >>would also be possible to make use of it in "soft" environments. It
> >>wouldn't be mandatory though. If one doesn't like OOM, he can use
> >>threshold notifications to restart the container when it starts to
> >>behave badly. But if the user just doesn't want to bother about
> >>configuration or is OK with OOM-killer, he could set hard anon+swap
> >>limit. Besides, it would untie mem.limit knob from memsw.limit, which
> >>would make the user interface simpler and cleaner.
> >>
> >>So, I think anon+swap limit would be more flexible than file+anon+swap
> >>limit we have now. Is there any use case where anon+swap and anon+file
> >>accounting couldn't satisfy the user requirements while the
> >>anon+file+swap and anon+file pair could?
> >
> >I would appreciate if anybody could answer this.
> >
> 
> I can't understand why you want to use OOM killer for resource controlling .

Because there are situations when an app inside a container goes mad.
There must be a reliable way to stop it. It's all about the compromise
between safety (sand-boxing) and efficiency (soft limits). Currently we
can't mix them. Soft limits are intrinsically unsafe though must be
efficient while hard limits guarantee safety at cost of performance.
Anon+swap limit would allow us to combine them to yield an efficient yet
safe setup.

Besides, memsw limit eventually means OOM too, why is it better?

What I propose is to give the admin a choice. If he thinks the app is
100% safe, let him rely on userspace handling and in-kernel after-care.
But if there's a possibility of a malicious and/or badly designed app,
let him configure in-kernel OOM per container to prevent a disaster for
sure. The latter is usually the case when you sell containers to
third-party users.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] memory cgroup: my thoughts on memsw
  2014-09-04 14:30 [RFC] memory cgroup: my thoughts on memsw Vladimir Davydov
  2014-09-04 22:03 ` Kamezawa Hiroyuki
@ 2014-09-15 19:14 ` Johannes Weiner
  2014-09-16  1:34   ` Kamezawa Hiroyuki
  2014-09-17 15:59   ` Vladimir Davydov
  1 sibling, 2 replies; 19+ messages in thread
From: Johannes Weiner @ 2014-09-15 19:14 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Michal Hocko, Greg Thelen, Hugh Dickins, Kamezawa Hiroyuki,
	Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton,
	Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups,
	LKML

Hi Vladimir,

On Thu, Sep 04, 2014 at 06:30:55PM +0400, Vladimir Davydov wrote:
> To sum it up, the current mem + memsw configuration scheme doesn't allow
> us to limit swap usage if we want to partition the system dynamically
> using soft limits. Actually, it also looks rather confusing to me. We
> have mem limit and mem+swap limit. I bet that from the first glance, an
> average admin will think it's possible to limit swap usage by setting
> the limits so that the difference between memory.memsw.limit and
> memory.limit equals the maximal swap usage, but (surprise!) it isn't
> really so. It holds if there's no global memory pressure, but otherwise
> swap usage is only limited by memory.memsw.limit! IMHO, it isn't
> something obvious.

Agreed, memory+swap accounting & limiting is broken.

>  - Anon memory is handled by the user application, while file caches are
>    all on the kernel. That means the application will *definitely* die
>    w/o anon memory. W/o file caches it usually can survive, but the more
>    caches it has the better it feels.
> 
>  - Anon memory is not that easy to reclaim. Swap out is a really slow
>    process, because data are usually read/written w/o any specific
>    order. Dropping file caches is much easier. Typically we have lots of
>    clean pages there.
> 
>  - Swap space is limited. And today, it's OK to have TBs of RAM and only
>    several GBs of swap. Customers simply don't want to waste their disk
>    space on that.

> Finally, my understanding (may be crazy!) how the things should be
> configured. Just like now, there should be mem_cgroup->res accounting
> and limiting total user memory (cache+anon) usage for processes inside
> cgroups. This is where there's nothing to do. However, mem_cgroup->memsw
> should be reworked to account *only* memory that may be swapped out plus
> memory that has been swapped out (i.e. swap usage).

But anon pages are not a resource, they are a swap space liability.
Think of virtual memory vs. physical pages - the use of one does not
necessarily result in the use of the other.  Without memory pressure,
anonymous pages do not consume swap space.

What we *should* be accounting and limiting here is the actual finite
resource: swap space.  Whenever we try to swap a page, its owner
should be charged for the swap space - or the swapout be rejected.

For hard limit reclaim, the semantics of a swap space limit would be
fairly obvious, because it's clear who the offender is.

However, in an overcommitted machine, the amount of swap space used by
a particular group depends just as much on the behavior of the other
groups in the system, so the per-group swap limit should be enforced
even during global reclaim to feed back pressure on whoever is causing
the swapout.  If reclaim fails, the global OOM killer triggers, which
should then off the group with the biggest soft limit excess.

As far as implementation goes, it should be doable to try-charge from
add_to_swap() and keep the uncharging in swap_entry_free().

We'll also have to extend the global OOM killer to be memcg-aware, but
we've been meaning to do that anyway.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] memory cgroup: my thoughts on memsw
  2014-09-15 19:14 ` Johannes Weiner
@ 2014-09-16  1:34   ` Kamezawa Hiroyuki
  2014-09-17 15:59   ` Vladimir Davydov
  1 sibling, 0 replies; 19+ messages in thread
From: Kamezawa Hiroyuki @ 2014-09-16  1:34 UTC (permalink / raw)
  To: Johannes Weiner, Vladimir Davydov
  Cc: Michal Hocko, Greg Thelen, Hugh Dickins, Motohiro Kosaki,
	Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov,
	Konstantin Khorenko, LKML-MM, LKML-cgroups, LKML

(2014/09/16 4:14), Johannes Weiner wrote:
> Hi Vladimir,
>
> On Thu, Sep 04, 2014 at 06:30:55PM +0400, Vladimir Davydov wrote:
>> To sum it up, the current mem + memsw configuration scheme doesn't allow
>> us to limit swap usage if we want to partition the system dynamically
>> using soft limits. Actually, it also looks rather confusing to me. We
>> have mem limit and mem+swap limit. I bet that from the first glance, an
>> average admin will think it's possible to limit swap usage by setting
>> the limits so that the difference between memory.memsw.limit and
>> memory.limit equals the maximal swap usage, but (surprise!) it isn't
>> really so. It holds if there's no global memory pressure, but otherwise
>> swap usage is only limited by memory.memsw.limit! IMHO, it isn't
>> something obvious.
>
> Agreed, memory+swap accounting & limiting is broken.
>
>>   - Anon memory is handled by the user application, while file caches are
>>     all on the kernel. That means the application will *definitely* die
>>     w/o anon memory. W/o file caches it usually can survive, but the more
>>     caches it has the better it feels.
>>
>>   - Anon memory is not that easy to reclaim. Swap out is a really slow
>>     process, because data are usually read/written w/o any specific
>>     order. Dropping file caches is much easier. Typically we have lots of
>>     clean pages there.
>>
>>   - Swap space is limited. And today, it's OK to have TBs of RAM and only
>>     several GBs of swap. Customers simply don't want to waste their disk
>>     space on that.
>
>> Finally, my understanding (may be crazy!) how the things should be
>> configured. Just like now, there should be mem_cgroup->res accounting
>> and limiting total user memory (cache+anon) usage for processes inside
>> cgroups. This is where there's nothing to do. However, mem_cgroup->memsw
>> should be reworked to account *only* memory that may be swapped out plus
>> memory that has been swapped out (i.e. swap usage).
>
> But anon pages are not a resource, they are a swap space liability.
> Think of virtual memory vs. physical pages - the use of one does not
> necessarily result in the use of the other.  Without memory pressure,
> anonymous pages do not consume swap space.
>
> What we *should* be accounting and limiting here is the actual finite
> resource: swap space.  Whenever we try to swap a page, its owner
> should be charged for the swap space - or the swapout be rejected.
>
> For hard limit reclaim, the semantics of a swap space limit would be
> fairly obvious, because it's clear who the offender is.
>
> However, in an overcommitted machine, the amount of swap space used by
> a particular group depends just as much on the behavior of the other
> groups in the system, so the per-group swap limit should be enforced
> even during global reclaim to feed back pressure on whoever is causing
> the swapout.  If reclaim fails, the global OOM killer triggers, which
> should then off the group with the biggest soft limit excess.
>
> As far as implementation goes, it should be doable to try-charge from
> add_to_swap() and keep the uncharging in swap_entry_free().
>
> We'll also have to extend the global OOM killer to be memcg-aware, but
> we've been meaning to do that anyway.
>

When we introduced memsw limitation, we tried to avoid affecting global memory reclaim.
Then, we did memory+swap limitation.

Now, global memory reclaim is memcg-aware. So, I think swap-limitation rather than
anon+swap may be a choice. The change will reduce res_counter access. Hmm, it will be
desireble to move anon pages to Unevictable if memcg's swap slot is 0.

Anyway, I think softlimit should be re-implemented, 1st. It will be starting point.

Thanks,
-Kame







^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] memory cgroup: my thoughts on memsw
  2014-09-15 19:14 ` Johannes Weiner
  2014-09-16  1:34   ` Kamezawa Hiroyuki
@ 2014-09-17 15:59   ` Vladimir Davydov
  1 sibling, 0 replies; 19+ messages in thread
From: Vladimir Davydov @ 2014-09-17 15:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Greg Thelen, Hugh Dickins, Kamezawa Hiroyuki,
	Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton,
	Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups,
	LKML

Hi Johannes,

On Mon, Sep 15, 2014 at 03:14:35PM -0400, Johannes Weiner wrote:
> > Finally, my understanding (may be crazy!) how the things should be
> > configured. Just like now, there should be mem_cgroup->res accounting
> > and limiting total user memory (cache+anon) usage for processes inside
> > cgroups. This is where there's nothing to do. However, mem_cgroup->memsw
> > should be reworked to account *only* memory that may be swapped out plus
> > memory that has been swapped out (i.e. swap usage).
> 
> But anon pages are not a resource, they are a swap space liability.
> Think of virtual memory vs. physical pages - the use of one does not
> necessarily result in the use of the other.  Without memory pressure,
> anonymous pages do not consume swap space.
> 
> What we *should* be accounting and limiting here is the actual finite
> resource: swap space.  Whenever we try to swap a page, its owner
> should be charged for the swap space - or the swapout be rejected.

I've been thinking quite a bit on the problem, and finally I believe
you're right: a separate swap limit would be better than anon+swap.

Provided we make the OOM-killer kill cgroups that exceed their soft
limit and can't be reclaimed, it will solve the problem with soft limits
I described above.

Besides, comparing to anon+swap, swap limit would be more efficient (we
only need to charge one res counter, not two) and understandable to
users (it's simple to setup a limit for both kinds of resources then,
because they never mix).

Finally, we could transfer user configuration from cgroup v1 to v2
easily: just setup swap.limit to be equal to memsw.limit-mem.limit; it
won't be exactly the same, but I bet nobody will notice any difference.

So, at least for now, I vote for moving from mem+swap to swap
accounting.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-09-17 15:59 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-04 14:30 [RFC] memory cgroup: my thoughts on memsw Vladimir Davydov
2014-09-04 22:03 ` Kamezawa Hiroyuki
2014-09-05  8:28   ` Vladimir Davydov
2014-09-05 14:20     ` Kamezawa Hiroyuki
2014-09-05 16:00       ` Vladimir Davydov
2014-09-05 23:15         ` Kamezawa Hiroyuki
2014-09-08 11:01           ` Vladimir Davydov
2014-09-08 13:53             ` Kamezawa Hiroyuki
2014-09-09 10:39               ` Vladimir Davydov
2014-09-11  2:04                 ` Kamezawa Hiroyuki
2014-09-11  8:23                   ` Vladimir Davydov
2014-09-11  8:53                     ` Kamezawa Hiroyuki
2014-09-11  9:50                       ` Vladimir Davydov
2014-09-10 12:01               ` Vladimir Davydov
2014-09-11  1:22                 ` Kamezawa Hiroyuki
2014-09-11  7:03                   ` Vladimir Davydov
2014-09-15 19:14 ` Johannes Weiner
2014-09-16  1:34   ` Kamezawa Hiroyuki
2014-09-17 15:59   ` Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).