linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] fs, mm: account filp and names caches to kmemcg
@ 2017-10-05 22:21 Shakeel Butt
  2017-10-06  7:59 ` Michal Hocko
  2017-10-10 23:32 ` Al Viro
  0 siblings, 2 replies; 52+ messages in thread
From: Shakeel Butt @ 2017-10-05 22:21 UTC (permalink / raw)
  To: Alexander Viro, Vladimir Davydov, Michal Hocko, Greg Thelen
  Cc: Andrew Morton, linux-mm, linux-fsdevel, linux-kernel, Shakeel Butt

The allocations from filp and names kmem caches can be directly
triggered by user space applications. A buggy application can
consume a significant amount of unaccounted system memory. Though
we have not noticed such buggy applications in our production
but upon close inspection, we found that a lot of machines spend
very significant amount of memory on these caches. So, these
caches should be accounted to kmemcg.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
 fs/dcache.c     | 2 +-
 fs/file_table.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index f90141387f01..fb3449161063 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3642,7 +3642,7 @@ void __init vfs_caches_init_early(void)
 void __init vfs_caches_init(void)
 {
 	names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
-			SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
+			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
 	dcache_init();
 	inode_init();
diff --git a/fs/file_table.c b/fs/file_table.c
index 61517f57f8ef..567888cdf7d3 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -312,7 +312,7 @@ void put_filp(struct file *file)
 void __init files_init(void)
 {
 	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
-			SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	percpu_counter_init(&nr_files, 0, GFP_KERNEL);
 }
 
-- 
2.14.2.920.gcf0c67979c-goog

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-05 22:21 [PATCH] fs, mm: account filp and names caches to kmemcg Shakeel Butt
@ 2017-10-06  7:59 ` Michal Hocko
  2017-10-06 19:33   ` Shakeel Butt
  2017-10-10 23:32 ` Al Viro
  1 sibling, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-06  7:59 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Alexander Viro, Vladimir Davydov, Greg Thelen, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel

On Thu 05-10-17 15:21:44, Shakeel Butt wrote:
> The allocations from filp and names kmem caches can be directly
> triggered by user space applications. A buggy application can
> consume a significant amount of unaccounted system memory. Though
> we have not noticed such buggy applications in our production
> but upon close inspection, we found that a lot of machines spend
> very significant amount of memory on these caches. So, these
> caches should be accounted to kmemcg.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> ---
>  fs/dcache.c     | 2 +-
>  fs/file_table.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index f90141387f01..fb3449161063 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -3642,7 +3642,7 @@ void __init vfs_caches_init_early(void)
>  void __init vfs_caches_init(void)
>  {
>  	names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
> -			SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
> +			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);

I might be wrong but isn't name cache only holding temporary objects
used for path resolution which are not stored anywhere?

>  
>  	dcache_init();
>  	inode_init();
> diff --git a/fs/file_table.c b/fs/file_table.c
> index 61517f57f8ef..567888cdf7d3 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -312,7 +312,7 @@ void put_filp(struct file *file)
>  void __init files_init(void)
>  {
>  	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
> -			SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
> +			SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
>  	percpu_counter_init(&nr_files, 0, GFP_KERNEL);
>  }

Don't we have a limit for the maximum number of open files?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-06  7:59 ` Michal Hocko
@ 2017-10-06 19:33   ` Shakeel Butt
  2017-10-09  6:24     ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Shakeel Butt @ 2017-10-06 19:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexander Viro, Vladimir Davydov, Greg Thelen, Andrew Morton,
	Linux MM, linux-fsdevel, LKML

>>       names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
>> -                     SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
>> +                     SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
>
> I might be wrong but isn't name cache only holding temporary objects
> used for path resolution which are not stored anywhere?
>

Even though they're temporary, many containers can together use a
significant amount of transient uncharged memory. We've seen machines
with 100s of MiBs in names_cache.

>>       filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
>> -                     SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
>> +                     SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
>>       percpu_counter_init(&nr_files, 0, GFP_KERNEL);
>>  }
>
> Don't we have a limit for the maximum number of open files?
>

Yes, there is a system limit of maximum number of open files. However
this limit is shared between different users on the system and one
user can hog this resource. To cater that, we set the maximum limit
very high and let the memory limit of each user limit the number of
files they can open.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-06 19:33   ` Shakeel Butt
@ 2017-10-09  6:24     ` Michal Hocko
  2017-10-09 17:52       ` Greg Thelen
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-09  6:24 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Alexander Viro, Vladimir Davydov, Greg Thelen, Andrew Morton,
	Linux MM, linux-fsdevel, LKML

On Fri 06-10-17 12:33:03, Shakeel Butt wrote:
> >>       names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
> >> -                     SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
> >> +                     SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
> >
> > I might be wrong but isn't name cache only holding temporary objects
> > used for path resolution which are not stored anywhere?
> >
> 
> Even though they're temporary, many containers can together use a
> significant amount of transient uncharged memory. We've seen machines
> with 100s of MiBs in names_cache.

Yes that might be possible but are we prepared for random ENOMEM from
vfs calls which need to allocate a temporary name?

> 
> >>       filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
> >> -                     SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
> >> +                     SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
> >>       percpu_counter_init(&nr_files, 0, GFP_KERNEL);
> >>  }
> >
> > Don't we have a limit for the maximum number of open files?
> >
> 
> Yes, there is a system limit of maximum number of open files. However
> this limit is shared between different users on the system and one
> user can hog this resource. To cater that, we set the maximum limit
> very high and let the memory limit of each user limit the number of
> files they can open.

Similarly here. Are all syscalls allocating a fd prepared to return
ENOMEM?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-09  6:24     ` Michal Hocko
@ 2017-10-09 17:52       ` Greg Thelen
  2017-10-09 18:04         ` Michal Hocko
  2017-10-09 20:26         ` Johannes Weiner
  0 siblings, 2 replies; 52+ messages in thread
From: Greg Thelen @ 2017-10-09 17:52 UTC (permalink / raw)
  To: Michal Hocko, Shakeel Butt
  Cc: Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM,
	linux-fsdevel, LKML

Michal Hocko <mhocko@kernel.org> wrote:

> On Fri 06-10-17 12:33:03, Shakeel Butt wrote:
>> >>       names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
>> >> -                     SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
>> >> +                     SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
>> >
>> > I might be wrong but isn't name cache only holding temporary objects
>> > used for path resolution which are not stored anywhere?
>> >
>> 
>> Even though they're temporary, many containers can together use a
>> significant amount of transient uncharged memory. We've seen machines
>> with 100s of MiBs in names_cache.
>
> Yes that might be possible but are we prepared for random ENOMEM from
> vfs calls which need to allocate a temporary name?
>
>> 
>> >>       filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
>> >> -                     SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
>> >> +                     SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
>> >>       percpu_counter_init(&nr_files, 0, GFP_KERNEL);
>> >>  }
>> >
>> > Don't we have a limit for the maximum number of open files?
>> >
>> 
>> Yes, there is a system limit of maximum number of open files. However
>> this limit is shared between different users on the system and one
>> user can hog this resource. To cater that, we set the maximum limit
>> very high and let the memory limit of each user limit the number of
>> files they can open.
>
> Similarly here. Are all syscalls allocating a fd prepared to return
> ENOMEM?
>
> -- 
> Michal Hocko
> SUSE Labs

Even before this patch I find memcg oom handling inconsistent.  Page
cache pages trigger oom killer and may allow caller to succeed once the
kernel retries.  But kmem allocations don't call oom killer.  They
surface errors to user space.  This makes memcg hard to use for memory
overcommit because it's desirable for a high priority task to
transparently kill a lower priority task using the memcg oom killer.

A few ideas on how to make it more flexible:

a) Go back to memcg oom killing within memcg charging.  This runs risk
   of oom killing while caller holds locks which oom victim selection or
   oom victim termination may need.  Google's been running this way for
   a while.

b) Have every syscall return do something similar to page fault handler:
   kmem allocations in oom memcg mark the current task as needing an oom
   check return NULL.  If marked oom, syscall exit would use
   mem_cgroup_oom_synchronize() before retrying the syscall.  Seems
   risky.  I doubt every syscall is compatible with such a restart.

c) Overcharge kmem to oom memcg and queue an async memcg limit checker,
   which will oom kill if needed.

Comments?

Demo program which eventually gets ENOSPC from mkdir.

$ cat /tmp/t
while umount /tmp/mnt; do true; done
mkdir -p /tmp/mnt
mount -t tmpfs nodev /tmp/mnt
cd /dev/cgroup/memory
rmdir t
mkdir t
echo 32M > t/memory.limit_in_bytes
(echo $BASHPID > t/cgroup.procs && cd /tmp/mnt && exec /tmp/mkdirs)

$ cat /tmp/mkdirs.c
#include <err.h>
#include <stdio.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>

int main()
{
        int i;
        char name[32];

        if (mlockall(MCL_CURRENT|MCL_FUTURE))
                err(1, "mlockall");
        for (i = 0; i < (1<<20); i++) {
                sprintf(name, "%d", i);
                if (mkdir(name, 0700))
                        err(1, "mkdir");
        }
        printf("done\n");
        return 0;
}

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-09 17:52       ` Greg Thelen
@ 2017-10-09 18:04         ` Michal Hocko
  2017-10-09 18:17           ` Michal Hocko
  2017-10-09 20:26         ` Johannes Weiner
  1 sibling, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-09 18:04 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton,
	Linux MM, linux-fsdevel, LKML, Johannes Weiner

[CC Johannes - the thread starts
http://lkml.kernel.org/r/20171005222144.123797-1-shakeelb@google.com]

On Mon 09-10-17 10:52:44, Greg Thelen wrote:
> Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Fri 06-10-17 12:33:03, Shakeel Butt wrote:
> >> >>       names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
> >> >> -                     SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
> >> >> +                     SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
> >> >
> >> > I might be wrong but isn't name cache only holding temporary objects
> >> > used for path resolution which are not stored anywhere?
> >> >
> >> 
> >> Even though they're temporary, many containers can together use a
> >> significant amount of transient uncharged memory. We've seen machines
> >> with 100s of MiBs in names_cache.
> >
> > Yes that might be possible but are we prepared for random ENOMEM from
> > vfs calls which need to allocate a temporary name?
> >
> >> 
> >> >>       filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
> >> >> -                     SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
> >> >> +                     SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
> >> >>       percpu_counter_init(&nr_files, 0, GFP_KERNEL);
> >> >>  }
> >> >
> >> > Don't we have a limit for the maximum number of open files?
> >> >
> >> 
> >> Yes, there is a system limit of maximum number of open files. However
> >> this limit is shared between different users on the system and one
> >> user can hog this resource. To cater that, we set the maximum limit
> >> very high and let the memory limit of each user limit the number of
> >> files they can open.
> >
> > Similarly here. Are all syscalls allocating a fd prepared to return
> > ENOMEM?
> >
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> Even before this patch I find memcg oom handling inconsistent.  Page
> cache pages trigger oom killer and may allow caller to succeed once the
> kernel retries.  But kmem allocations don't call oom killer.  They
> surface errors to user space.  This makes memcg hard to use for memory
> overcommit because it's desirable for a high priority task to
> transparently kill a lower priority task using the memcg oom killer.
> 
> A few ideas on how to make it more flexible:
> 
> a) Go back to memcg oom killing within memcg charging.  This runs risk
>    of oom killing while caller holds locks which oom victim selection or
>    oom victim termination may need.  Google's been running this way for
>    a while.
> 
> b) Have every syscall return do something similar to page fault handler:
>    kmem allocations in oom memcg mark the current task as needing an oom
>    check return NULL.  If marked oom, syscall exit would use
>    mem_cgroup_oom_synchronize() before retrying the syscall.  Seems
>    risky.  I doubt every syscall is compatible with such a restart.
> 
> c) Overcharge kmem to oom memcg and queue an async memcg limit checker,
>    which will oom kill if needed.
> 
> Comments?
> 
> Demo program which eventually gets ENOSPC from mkdir.
> 
> $ cat /tmp/t
> while umount /tmp/mnt; do true; done
> mkdir -p /tmp/mnt
> mount -t tmpfs nodev /tmp/mnt
> cd /dev/cgroup/memory
> rmdir t
> mkdir t
> echo 32M > t/memory.limit_in_bytes
> (echo $BASHPID > t/cgroup.procs && cd /tmp/mnt && exec /tmp/mkdirs)
> 
> $ cat /tmp/mkdirs.c
> #include <err.h>
> #include <stdio.h>
> #include <sys/mman.h>
> #include <sys/stat.h>
> #include <sys/types.h>
> 
> int main()
> {
>         int i;
>         char name[32];
> 
>         if (mlockall(MCL_CURRENT|MCL_FUTURE))
>                 err(1, "mlockall");
>         for (i = 0; i < (1<<20); i++) {
>                 sprintf(name, "%d", i);
>                 if (mkdir(name, 0700))
>                         err(1, "mkdir");
>         }
>         printf("done\n");
>         return 0;
> }

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-09 18:04         ` Michal Hocko
@ 2017-10-09 18:17           ` Michal Hocko
  2017-10-10  9:10             ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-09 18:17 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton,
	Linux MM, linux-fsdevel, LKML, Johannes Weiner

On Mon 09-10-17 20:04:09, Michal Hocko wrote:
> [CC Johannes - the thread starts
> http://lkml.kernel.org/r/20171005222144.123797-1-shakeelb@google.com]
> 
> On Mon 09-10-17 10:52:44, Greg Thelen wrote:
[...]
> > A few ideas on how to make it more flexible:
> > 
> > a) Go back to memcg oom killing within memcg charging.  This runs risk
> >    of oom killing while caller holds locks which oom victim selection or
> >    oom victim termination may need.  Google's been running this way for
> >    a while.

We can actually reopen this discussion now that the oom handling is
async due to the oom_reaper. At least for the v2 interface. I would have
to think about it much more but the primary concern for this patch was
whether we really need/want to charge short therm objects which do not
outlive a single syscall.
 
> > b) Have every syscall return do something similar to page fault handler:
> >    kmem allocations in oom memcg mark the current task as needing an oom
> >    check return NULL.  If marked oom, syscall exit would use
> >    mem_cgroup_oom_synchronize() before retrying the syscall.  Seems
> >    risky.  I doubt every syscall is compatible with such a restart.

yes, this is simply a no go

> > c) Overcharge kmem to oom memcg and queue an async memcg limit checker,
> >    which will oom kill if needed.

This is what we have max limit for.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-09 17:52       ` Greg Thelen
  2017-10-09 18:04         ` Michal Hocko
@ 2017-10-09 20:26         ` Johannes Weiner
  2017-10-10  9:14           ` Michal Hocko
  1 sibling, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2017-10-09 20:26 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Michal Hocko, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Mon, Oct 09, 2017 at 10:52:44AM -0700, Greg Thelen wrote:
> Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Fri 06-10-17 12:33:03, Shakeel Butt wrote:
> >> >>       names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
> >> >> -                     SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
> >> >> +                     SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
> >> >
> >> > I might be wrong but isn't name cache only holding temporary objects
> >> > used for path resolution which are not stored anywhere?
> >> >
> >> 
> >> Even though they're temporary, many containers can together use a
> >> significant amount of transient uncharged memory. We've seen machines
> >> with 100s of MiBs in names_cache.
> >
> > Yes that might be possible but are we prepared for random ENOMEM from
> > vfs calls which need to allocate a temporary name?
> >
> >> 
> >> >>       filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
> >> >> -                     SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
> >> >> +                     SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
> >> >>       percpu_counter_init(&nr_files, 0, GFP_KERNEL);
> >> >>  }
> >> >
> >> > Don't we have a limit for the maximum number of open files?
> >> >
> >> 
> >> Yes, there is a system limit of maximum number of open files. However
> >> this limit is shared between different users on the system and one
> >> user can hog this resource. To cater that, we set the maximum limit
> >> very high and let the memory limit of each user limit the number of
> >> files they can open.
> >
> > Similarly here. Are all syscalls allocating a fd prepared to return
> > ENOMEM?
> >
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> Even before this patch I find memcg oom handling inconsistent.  Page
> cache pages trigger oom killer and may allow caller to succeed once the
> kernel retries.  But kmem allocations don't call oom killer.

It's consistent in the sense that only page faults enable the memcg
OOM killer. It's not the type of memory that decides, it's whether the
allocation context has a channel to communicate an error to userspace.

Whether userspace is able to handle -ENOMEM from syscalls was a voiced
concern at the time this patch was merged, although there haven't been
any reports so far, and it seemed like the lesser evil between that
and deadlocking the kernel.

If we could find a way to invoke the OOM killer safely, I would
welcome such patches.

> They surface errors to user space.  This makes memcg hard to use for
> memory overcommit because it's desirable for a high priority task to
> transparently kill a lower priority task using the memcg oom killer.
> 
> A few ideas on how to make it more flexible:
> 
> a) Go back to memcg oom killing within memcg charging.  This runs risk
>    of oom killing while caller holds locks which oom victim selection or
>    oom victim termination may need.  Google's been running this way for
>    a while.

We've had real-life reports of this breaking, so even if it works for
some people, I'd rather not revert to that way of doing things.

> b) Have every syscall return do something similar to page fault handler:
>    kmem allocations in oom memcg mark the current task as needing an oom
>    check return NULL.  If marked oom, syscall exit would use
>    mem_cgroup_oom_synchronize() before retrying the syscall.  Seems
>    risky.  I doubt every syscall is compatible with such a restart.

That sounds like a lateral move.

> c) Overcharge kmem to oom memcg and queue an async memcg limit checker,
>    which will oom kill if needed.

This makes the most sense to me. Architecturally, I imagine this would
look like b), with an OOM handler at the point of return to userspace,
except that we'd overcharge instead of retrying the syscall.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-09 18:17           ` Michal Hocko
@ 2017-10-10  9:10             ` Michal Hocko
  2017-10-10 22:21               ` Shakeel Butt
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-10  9:10 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton,
	Linux MM, linux-fsdevel, LKML, Johannes Weiner

On Mon 09-10-17 20:17:54, Michal Hocko wrote:
> the primary concern for this patch was whether we really need/want to
> charge short therm objects which do not outlive a single syscall.

Let me expand on this some more. What is the benefit of kmem accounting
of such an object? It cannot stop any runaway as a syscall lifetime
allocations are bound to number of processes which we kind of contain by
other means. If we do account then we put a memory pressure due to
something that cannot be reclaimed by no means. Even the memcg OOM
killer would simply kick a single path while there might be others
to consume the same type of memory.

So what is the actual point in accounting these? Does it help to contain
any workload better? What kind of workload?

Or am I completely wrong and name objects can outlive a syscall
considerably?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-09 20:26         ` Johannes Weiner
@ 2017-10-10  9:14           ` Michal Hocko
  2017-10-10 14:17             ` Johannes Weiner
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-10  9:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Mon 09-10-17 16:26:13, Johannes Weiner wrote:
> On Mon, Oct 09, 2017 at 10:52:44AM -0700, Greg Thelen wrote:
> > Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > > On Fri 06-10-17 12:33:03, Shakeel Butt wrote:
> > >> >>       names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
> > >> >> -                     SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
> > >> >> +                     SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
> > >> >
> > >> > I might be wrong but isn't name cache only holding temporary objects
> > >> > used for path resolution which are not stored anywhere?
> > >> >
> > >> 
> > >> Even though they're temporary, many containers can together use a
> > >> significant amount of transient uncharged memory. We've seen machines
> > >> with 100s of MiBs in names_cache.
> > >
> > > Yes that might be possible but are we prepared for random ENOMEM from
> > > vfs calls which need to allocate a temporary name?
> > >
> > >> 
> > >> >>       filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
> > >> >> -                     SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
> > >> >> +                     SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
> > >> >>       percpu_counter_init(&nr_files, 0, GFP_KERNEL);
> > >> >>  }
> > >> >
> > >> > Don't we have a limit for the maximum number of open files?
> > >> >
> > >> 
> > >> Yes, there is a system limit of maximum number of open files. However
> > >> this limit is shared between different users on the system and one
> > >> user can hog this resource. To cater that, we set the maximum limit
> > >> very high and let the memory limit of each user limit the number of
> > >> files they can open.
> > >
> > > Similarly here. Are all syscalls allocating a fd prepared to return
> > > ENOMEM?
> > >
> > > -- 
> > > Michal Hocko
> > > SUSE Labs
> > 
> > Even before this patch I find memcg oom handling inconsistent.  Page
> > cache pages trigger oom killer and may allow caller to succeed once the
> > kernel retries.  But kmem allocations don't call oom killer.
> 
> It's consistent in the sense that only page faults enable the memcg
> OOM killer. It's not the type of memory that decides, it's whether the
> allocation context has a channel to communicate an error to userspace.
> 
> Whether userspace is able to handle -ENOMEM from syscalls was a voiced
> concern at the time this patch was merged, although there haven't been
> any reports so far,

Well, I remember reports about MAP_POPULATE breaking or at least having
an unexpected behavior.

> and it seemed like the lesser evil between that
> and deadlocking the kernel.

agreed on this part though

> If we could find a way to invoke the OOM killer safely, I would
> welcome such patches.

Well, we should be able to do that with the oom_reaper. At least for v2
which doesn't have synchronous userspace oom killing.

[...]

> > c) Overcharge kmem to oom memcg and queue an async memcg limit checker,
> >    which will oom kill if needed.
> 
> This makes the most sense to me. Architecturally, I imagine this would
> look like b), with an OOM handler at the point of return to userspace,
> except that we'd overcharge instead of retrying the syscall.

I do not think we should break the hard limit semantic if possible. We
can currently allow that for allocations which are very short term (oom
victims) or too important to fail but allowing that for kmem charges in
general sounds like too easy to runaway.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-10  9:14           ` Michal Hocko
@ 2017-10-10 14:17             ` Johannes Weiner
  2017-10-10 14:24               ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2017-10-10 14:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Tue, Oct 10, 2017 at 11:14:30AM +0200, Michal Hocko wrote:
> On Mon 09-10-17 16:26:13, Johannes Weiner wrote:
> > It's consistent in the sense that only page faults enable the memcg
> > OOM killer. It's not the type of memory that decides, it's whether the
> > allocation context has a channel to communicate an error to userspace.
> > 
> > Whether userspace is able to handle -ENOMEM from syscalls was a voiced
> > concern at the time this patch was merged, although there haven't been
> > any reports so far,
> 
> Well, I remember reports about MAP_POPULATE breaking or at least having
> an unexpected behavior.

Hm, that slipped past me. Did we do something about these? Or did they
fix userspace?

> Well, we should be able to do that with the oom_reaper. At least for v2
> which doesn't have synchronous userspace oom killing.

I don't see how the OOM reaper is a guarantee as long as we have this:

	if (!down_read_trylock(&mm->mmap_sem)) {
		ret = false;
		trace_skip_task_reaping(tsk->pid);
		goto unlock_oom;
	}

What do you mean by 'v2'?

> > > c) Overcharge kmem to oom memcg and queue an async memcg limit checker,
> > >    which will oom kill if needed.
> > 
> > This makes the most sense to me. Architecturally, I imagine this would
> > look like b), with an OOM handler at the point of return to userspace,
> > except that we'd overcharge instead of retrying the syscall.
> 
> I do not think we should break the hard limit semantic if possible. We
> can currently allow that for allocations which are very short term (oom
> victims) or too important to fail but allowing that for kmem charges in
> general sounds like too easy to runaway.

I'm not sure there is a convenient way out of this.

If we want to respect the hard limit AND guarantee allocation success,
the OOM killer has to free memory reliably - which it doesn't. But if
it did, we could also break the limit temporarily and have the OOM
killer replenish the pool before that userspace app can continue. The
allocation wouldn't have to be short-lived, since memory is fungible.

Until the OOM killer is 100% reliable, we have the choice between
sometimes deadlocking the cgroup tasks and everything that interacts
with them, returning -ENOMEM for syscalls, or breaking the hard limit
guarantee during memcg OOM.

It seems breaking the limit temporarily in order to reclaim memory is
the best option. There is kernel memory we don't account to the memcg
already because we think it's probably not going to be significant, so
the isolation isn't 100% watertight in the first place. And I'd rather
have the worst-case effect of a cgroup OOMing be spilling over its
hard limit than deadlocking things inside and outside the cgroup.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-10 14:17             ` Johannes Weiner
@ 2017-10-10 14:24               ` Michal Hocko
  2017-10-12 19:03                 ` Johannes Weiner
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-10 14:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Tue 10-10-17 10:17:33, Johannes Weiner wrote:
> On Tue, Oct 10, 2017 at 11:14:30AM +0200, Michal Hocko wrote:
> > On Mon 09-10-17 16:26:13, Johannes Weiner wrote:
> > > It's consistent in the sense that only page faults enable the memcg
> > > OOM killer. It's not the type of memory that decides, it's whether the
> > > allocation context has a channel to communicate an error to userspace.
> > > 
> > > Whether userspace is able to handle -ENOMEM from syscalls was a voiced
> > > concern at the time this patch was merged, although there haven't been
> > > any reports so far,
> > 
> > Well, I remember reports about MAP_POPULATE breaking or at least having
> > an unexpected behavior.
> 
> Hm, that slipped past me. Did we do something about these? Or did they
> fix userspace?

Well it was mostly LTP complaining. I have tried to fix that but Linus
was against so we just documented that this is possible and MAP_POPULATE
is not a guarantee.

> > Well, we should be able to do that with the oom_reaper. At least for v2
> > which doesn't have synchronous userspace oom killing.
> 
> I don't see how the OOM reaper is a guarantee as long as we have this:
> 
> 	if (!down_read_trylock(&mm->mmap_sem)) {
> 		ret = false;
> 		trace_skip_task_reaping(tsk->pid);
> 		goto unlock_oom;
> 	}

And we will simply mark the victim MMF_OOM_SKIP and hide it from the oom
killer if we fail to get the mmap_sem after several attempts. This will
allow to find a new victim. So we shouldn't deadlock.

> What do you mean by 'v2'?

cgroup v2 because the legacy memcg allowed sync wait for the oom killer
and that would be a bigger problem from a deep callchains for obevious
reasons.

> > > > c) Overcharge kmem to oom memcg and queue an async memcg limit checker,
> > > >    which will oom kill if needed.
> > > 
> > > This makes the most sense to me. Architecturally, I imagine this would
> > > look like b), with an OOM handler at the point of return to userspace,
> > > except that we'd overcharge instead of retrying the syscall.
> > 
> > I do not think we should break the hard limit semantic if possible. We
> > can currently allow that for allocations which are very short term (oom
> > victims) or too important to fail but allowing that for kmem charges in
> > general sounds like too easy to runaway.
> 
> I'm not sure there is a convenient way out of this.
> 
> If we want to respect the hard limit AND guarantee allocation success,
> the OOM killer has to free memory reliably - which it doesn't. But if
> it did, we could also break the limit temporarily and have the OOM
> killer replenish the pool before that userspace app can continue. The
> allocation wouldn't have to be short-lived, since memory is fungible.

If we can guarantee the oom killer is started then we can allow temporal
access to reserves which is already implemented even for memcg. The
thing is we do not invoke the oom killer...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-10  9:10             ` Michal Hocko
@ 2017-10-10 22:21               ` Shakeel Butt
  2017-10-11  9:09                 ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Shakeel Butt @ 2017-10-10 22:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, Alexander Viro, Vladimir Davydov, Andrew Morton,
	Linux MM, linux-fsdevel, LKML, Johannes Weiner

On Sun, Oct 8, 2017 at 11:24 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Fri 06-10-17 12:33:03, Shakeel Butt wrote:
>> >>       names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
>> >> -                     SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
>> >> +                     SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
>> >
>> > I might be wrong but isn't name cache only holding temporary objects
>> > used for path resolution which are not stored anywhere?
>> >
>>
>> Even though they're temporary, many containers can together use a
>> significant amount of transient uncharged memory. We've seen machines
>> with 100s of MiBs in names_cache.
>
> Yes that might be possible but are we prepared for random ENOMEM from
> vfs calls which need to allocate a temporary name?
>

I looked at all the syscalls which invoke allocations from
'names_cache' and tried to narrow down whose man page does not mention
that they can return ENOMEM. I found couple of syscalls like
truncate(), readdir() & getdents() which does not mention that they
can return ENOMEM but this patch will make them return ENOMEM.

>>
>> >>       filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
>> >> -                     SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
>> >> +                     SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
>> >>       percpu_counter_init(&nr_files, 0, GFP_KERNEL);
>> >>  }
>> >
>> > Don't we have a limit for the maximum number of open files?
>> >
>>
>> Yes, there is a system limit of maximum number of open files. However
>> this limit is shared between different users on the system and one
>> user can hog this resource. To cater that, we set the maximum limit
>> very high and let the memory limit of each user limit the number of
>> files they can open.
>
> Similarly here. Are all syscalls allocating a fd prepared to return
> ENOMEM?

For filp, I found _sysctl(). However the man page says not to use it.

On Tue, Oct 10, 2017 at 2:10 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 09-10-17 20:17:54, Michal Hocko wrote:
>> the primary concern for this patch was whether we really need/want to
>> charge short therm objects which do not outlive a single syscall.
>
> Let me expand on this some more. What is the benefit of kmem accounting
> of such an object? It cannot stop any runaway as a syscall lifetime
> allocations are bound to number of processes which we kind of contain by
> other means.

We can contain by limited the number of processes or thread but for us
applications having thousands of threads is very common. So, limiting
the number of threads/processes will not work.

> If we do account then we put a memory pressure due to
> something that cannot be reclaimed by no means. Even the memcg OOM
> killer would simply kick a single path while there might be others
> to consume the same type of memory.
>
> So what is the actual point in accounting these? Does it help to contain
> any workload better? What kind of workload?
>

I think the benefits will be isolation and more accurate billing. As I
have said before we have observed 100s of MiBs in names_cache on many
machines and cumulative amount is not something we can ignore as just
memory overhead.

> Or am I completely wrong and name objects can outlive a syscall
> considerably?
>

No, I didn't find any instance of the name objects outliving the syscall.

Anyways, we can discuss more on names_cache, do you have any objection
regarding charging filp?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-05 22:21 [PATCH] fs, mm: account filp and names caches to kmemcg Shakeel Butt
  2017-10-06  7:59 ` Michal Hocko
@ 2017-10-10 23:32 ` Al Viro
  1 sibling, 0 replies; 52+ messages in thread
From: Al Viro @ 2017-10-10 23:32 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vladimir Davydov, Michal Hocko, Greg Thelen, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel

On Thu, Oct 05, 2017 at 03:21:44PM -0700, Shakeel Butt wrote:
> The allocations from filp and names kmem caches can be directly
> triggered by user space applications. A buggy application can
> consume a significant amount of unaccounted system memory. Though
> we have not noticed such buggy applications in our production
> but upon close inspection, we found that a lot of machines spend
> very significant amount of memory on these caches. So, these
> caches should be accounted to kmemcg.

IDGI...  Surely, it's not hard to come up with a syscall that can
allocate a page for the duration of syscall?  Just to pick a random
example: reading from /proc/self/cmdline does that.  So does
readlink of /proc/self/cwd, etc.

What does accounting for such temporary allocations (with fixed
limit per syscall, always freed by the end of syscall) buy you,
why is it needed and what makes it not needed for the examples
above (and a slew of similar ones)?

While we are at it, how much overhead does it add on syscall-heavy
loads?  As in, a whole lot of threads is calling something like
stat("/", &stbuf); in parallel?  Because outside of that kind of
loads it's completely pointless...

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-10 22:21               ` Shakeel Butt
@ 2017-10-11  9:09                 ` Michal Hocko
  0 siblings, 0 replies; 52+ messages in thread
From: Michal Hocko @ 2017-10-11  9:09 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Greg Thelen, Alexander Viro, Vladimir Davydov, Andrew Morton,
	Linux MM, linux-fsdevel, LKML, Johannes Weiner

On Tue 10-10-17 15:21:53, Shakeel Butt wrote:
[...]
> On Tue, Oct 10, 2017 at 2:10 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Mon 09-10-17 20:17:54, Michal Hocko wrote:
> >> the primary concern for this patch was whether we really need/want to
> >> charge short therm objects which do not outlive a single syscall.
> >
> > Let me expand on this some more. What is the benefit of kmem accounting
> > of such an object? It cannot stop any runaway as a syscall lifetime
> > allocations are bound to number of processes which we kind of contain by
> > other means.
> 
> We can contain by limited the number of processes or thread but for us
> applications having thousands of threads is very common. So, limiting
> the number of threads/processes will not work.

Well, the number of tasks is already contained in a way because we do
account each task (kernel) stack AFAIR.

> > If we do account then we put a memory pressure due to
> > something that cannot be reclaimed by no means. Even the memcg OOM
> > killer would simply kick a single path while there might be others
> > to consume the same type of memory.
> >
> > So what is the actual point in accounting these? Does it help to contain
> > any workload better? What kind of workload?
> >
> 
> I think the benefits will be isolation and more accurate billing. As I
> have said before we have observed 100s of MiBs in names_cache on many
> machines and cumulative amount is not something we can ignore as just
> memory overhead.

I do agree with Al arguing this is rather dubious and it can add an
overhead without a good reason.

> > Or am I completely wrong and name objects can outlive a syscall
> > considerably?
> >
> 
> No, I didn't find any instance of the name objects outliving the syscall.
> 
> Anyways, we can discuss more on names_cache, do you have any objection
> regarding charging filp?

I think filep makes more sense. But let's drop the names for now. I am
not really convinced this is a good idea.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-10 14:24               ` Michal Hocko
@ 2017-10-12 19:03                 ` Johannes Weiner
  2017-10-12 23:57                   ` Greg Thelen
  2017-10-13  6:35                   ` Michal Hocko
  0 siblings, 2 replies; 52+ messages in thread
From: Johannes Weiner @ 2017-10-12 19:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Tue, Oct 10, 2017 at 04:24:34PM +0200, Michal Hocko wrote:
> On Tue 10-10-17 10:17:33, Johannes Weiner wrote:
> > On Tue, Oct 10, 2017 at 11:14:30AM +0200, Michal Hocko wrote:
> > > On Mon 09-10-17 16:26:13, Johannes Weiner wrote:
> > > > It's consistent in the sense that only page faults enable the memcg
> > > > OOM killer. It's not the type of memory that decides, it's whether the
> > > > allocation context has a channel to communicate an error to userspace.
> > > > 
> > > > Whether userspace is able to handle -ENOMEM from syscalls was a voiced
> > > > concern at the time this patch was merged, although there haven't been
> > > > any reports so far,
> > > 
> > > Well, I remember reports about MAP_POPULATE breaking or at least having
> > > an unexpected behavior.
> > 
> > Hm, that slipped past me. Did we do something about these? Or did they
> > fix userspace?
> 
> Well it was mostly LTP complaining. I have tried to fix that but Linus
> was against so we just documented that this is possible and MAP_POPULATE
> is not a guarantee.

Okay, makes sense. I wouldn't really count that as a regression.

> > > Well, we should be able to do that with the oom_reaper. At least for v2
> > > which doesn't have synchronous userspace oom killing.
> > 
> > I don't see how the OOM reaper is a guarantee as long as we have this:
> > 
> > 	if (!down_read_trylock(&mm->mmap_sem)) {
> > 		ret = false;
> > 		trace_skip_task_reaping(tsk->pid);
> > 		goto unlock_oom;
> > 	}
> 
> And we will simply mark the victim MMF_OOM_SKIP and hide it from the oom
> killer if we fail to get the mmap_sem after several attempts. This will
> allow to find a new victim. So we shouldn't deadlock.

It's less likely to deadlock, but not exactly deadlock-free. There
might not BE any other mm's holding significant amounts of memory.

> > What do you mean by 'v2'?
> 
> cgroup v2 because the legacy memcg allowed sync wait for the oom killer
> and that would be a bigger problem from a deep callchains for obevious
> reasons.

Actually, the async oom killing code isn't dependent on cgroup
version. cgroup1 doesn't wait inside the charge context, either.

> > > > > c) Overcharge kmem to oom memcg and queue an async memcg limit checker,
> > > > >    which will oom kill if needed.
> > > > 
> > > > This makes the most sense to me. Architecturally, I imagine this would
> > > > look like b), with an OOM handler at the point of return to userspace,
> > > > except that we'd overcharge instead of retrying the syscall.
> > > 
> > > I do not think we should break the hard limit semantic if possible. We
> > > can currently allow that for allocations which are very short term (oom
> > > victims) or too important to fail but allowing that for kmem charges in
> > > general sounds like too easy to runaway.
> > 
> > I'm not sure there is a convenient way out of this.
> > 
> > If we want to respect the hard limit AND guarantee allocation success,
> > the OOM killer has to free memory reliably - which it doesn't. But if
> > it did, we could also break the limit temporarily and have the OOM
> > killer replenish the pool before that userspace app can continue. The
> > allocation wouldn't have to be short-lived, since memory is fungible.
> 
> If we can guarantee the oom killer is started then we can allow temporal
> access to reserves which is already implemented even for memcg. The
> thing is we do not invoke the oom killer...

You lost me here. Which reserves?

All I'm saying is that, when the syscall-context fails to charge, we
should do mem_cgroup_oom() to set up the async OOM killer, let the
charge succeed over the hard limit - since the OOM killer will most
likely get us back below the limit - then mem_cgroup_oom_synchronize()
before the syscall returns to userspace.

That would avoid returning -ENOMEM from syscalls without the risk of
the hard limit deadlocking - at the risk of sometimes overrunning the
hard limit, but that seems like the least problematic behavior out of
the three.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-12 19:03                 ` Johannes Weiner
@ 2017-10-12 23:57                   ` Greg Thelen
  2017-10-13  6:51                     ` Michal Hocko
  2017-10-13  6:35                   ` Michal Hocko
  1 sibling, 1 reply; 52+ messages in thread
From: Greg Thelen @ 2017-10-12 23:57 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko
  Cc: Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton,
	Linux MM, linux-fsdevel, LKML

Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Tue, Oct 10, 2017 at 04:24:34PM +0200, Michal Hocko wrote:
>> On Tue 10-10-17 10:17:33, Johannes Weiner wrote:
>> > On Tue, Oct 10, 2017 at 11:14:30AM +0200, Michal Hocko wrote:
>> > > On Mon 09-10-17 16:26:13, Johannes Weiner wrote:
>> > > > It's consistent in the sense that only page faults enable the memcg
>> > > > OOM killer. It's not the type of memory that decides, it's whether the
>> > > > allocation context has a channel to communicate an error to userspace.
>> > > > 
>> > > > Whether userspace is able to handle -ENOMEM from syscalls was a voiced
>> > > > concern at the time this patch was merged, although there haven't been
>> > > > any reports so far,
>> > > 
>> > > Well, I remember reports about MAP_POPULATE breaking or at least having
>> > > an unexpected behavior.
>> > 
>> > Hm, that slipped past me. Did we do something about these? Or did they
>> > fix userspace?
>> 
>> Well it was mostly LTP complaining. I have tried to fix that but Linus
>> was against so we just documented that this is possible and MAP_POPULATE
>> is not a guarantee.
>
> Okay, makes sense. I wouldn't really count that as a regression.
>
>> > > Well, we should be able to do that with the oom_reaper. At least for v2
>> > > which doesn't have synchronous userspace oom killing.
>> > 
>> > I don't see how the OOM reaper is a guarantee as long as we have this:
>> > 
>> > 	if (!down_read_trylock(&mm->mmap_sem)) {
>> > 		ret = false;
>> > 		trace_skip_task_reaping(tsk->pid);
>> > 		goto unlock_oom;
>> > 	}
>> 
>> And we will simply mark the victim MMF_OOM_SKIP and hide it from the oom
>> killer if we fail to get the mmap_sem after several attempts. This will
>> allow to find a new victim. So we shouldn't deadlock.
>
> It's less likely to deadlock, but not exactly deadlock-free. There
> might not BE any other mm's holding significant amounts of memory.
>
>> > What do you mean by 'v2'?
>> 
>> cgroup v2 because the legacy memcg allowed sync wait for the oom killer
>> and that would be a bigger problem from a deep callchains for obevious
>> reasons.
>
> Actually, the async oom killing code isn't dependent on cgroup
> version. cgroup1 doesn't wait inside the charge context, either.
>
>> > > > > c) Overcharge kmem to oom memcg and queue an async memcg limit checker,
>> > > > >    which will oom kill if needed.
>> > > > 
>> > > > This makes the most sense to me. Architecturally, I imagine this would
>> > > > look like b), with an OOM handler at the point of return to userspace,
>> > > > except that we'd overcharge instead of retrying the syscall.
>> > > 
>> > > I do not think we should break the hard limit semantic if possible. We
>> > > can currently allow that for allocations which are very short term (oom
>> > > victims) or too important to fail but allowing that for kmem charges in
>> > > general sounds like too easy to runaway.
>> > 
>> > I'm not sure there is a convenient way out of this.
>> > 
>> > If we want to respect the hard limit AND guarantee allocation success,
>> > the OOM killer has to free memory reliably - which it doesn't. But if
>> > it did, we could also break the limit temporarily and have the OOM
>> > killer replenish the pool before that userspace app can continue. The
>> > allocation wouldn't have to be short-lived, since memory is fungible.
>> 
>> If we can guarantee the oom killer is started then we can allow temporal
>> access to reserves which is already implemented even for memcg. The
>> thing is we do not invoke the oom killer...
>
> You lost me here. Which reserves?
>
> All I'm saying is that, when the syscall-context fails to charge, we
> should do mem_cgroup_oom() to set up the async OOM killer, let the
> charge succeed over the hard limit - since the OOM killer will most
> likely get us back below the limit - then mem_cgroup_oom_synchronize()
> before the syscall returns to userspace.
>
> That would avoid returning -ENOMEM from syscalls without the risk of
> the hard limit deadlocking - at the risk of sometimes overrunning the
> hard limit, but that seems like the least problematic behavior out of
> the three.

Overcharging kmem with deferred reconciliation sounds good to me.

A few comments (not reasons to avoid this):

1) If a task is moved between memcg it seems possible to overcharge
   multiple oom memcg for different kmem/user allocations.
   mem_cgroup_oom_synchronize() would see at most one oom memcg in
   current->memcg_in_oom.  Thus it'd only reconcile a single memcg.  But
   that seems pretty rare and the next charge to any of the other memcg
   would reconcile them.

2) if a kernel thread charges kmem on behalf of a client mm then there
   is no good place to call mem_cgroup_oom_synchronize(), short of
   launching a work item in mem_cgroup_oom().  I don't we have anything
   like that yet.  So nothing to worry about.

3) it's debatable if mem_cgroup_oom_synchronize() should first attempt
   reclaim before killing.  But that's a whole 'nother thread.

4) overcharging with deferred reconciliation could also be used for user
   pages.  But I haven't looked at the code long enough to know if this
   would be a net win.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-12 19:03                 ` Johannes Weiner
  2017-10-12 23:57                   ` Greg Thelen
@ 2017-10-13  6:35                   ` Michal Hocko
  2017-10-13  7:00                     ` Michal Hocko
  2017-10-24 15:45                     ` Johannes Weiner
  1 sibling, 2 replies; 52+ messages in thread
From: Michal Hocko @ 2017-10-13  6:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Thu 12-10-17 15:03:12, Johannes Weiner wrote:
> On Tue, Oct 10, 2017 at 04:24:34PM +0200, Michal Hocko wrote:
[...]
> > And we will simply mark the victim MMF_OOM_SKIP and hide it from the oom
> > killer if we fail to get the mmap_sem after several attempts. This will
> > allow to find a new victim. So we shouldn't deadlock.
> 
> It's less likely to deadlock, but not exactly deadlock-free. There
> might not BE any other mm's holding significant amounts of memory.

true, try_charge would have to return with failure when out_of_memory
returns with false of course.

> > > What do you mean by 'v2'?
> > 
> > cgroup v2 because the legacy memcg allowed sync wait for the oom killer
> > and that would be a bigger problem from a deep callchains for obevious
> > reasons.
> 
> Actually, the async oom killing code isn't dependent on cgroup
> version. cgroup1 doesn't wait inside the charge context, either.

Sorry, I was just not clear. What I meant to say, would couldn't make v1
wait inside the try_charge path because async oom killing wouldn't help
for the oom disabled case (aka user space oom handling).

> > > > > > c) Overcharge kmem to oom memcg and queue an async memcg limit checker,
> > > > > >    which will oom kill if needed.
> > > > > 
> > > > > This makes the most sense to me. Architecturally, I imagine this would
> > > > > look like b), with an OOM handler at the point of return to userspace,
> > > > > except that we'd overcharge instead of retrying the syscall.
> > > > 
> > > > I do not think we should break the hard limit semantic if possible. We
> > > > can currently allow that for allocations which are very short term (oom
> > > > victims) or too important to fail but allowing that for kmem charges in
> > > > general sounds like too easy to runaway.
> > > 
> > > I'm not sure there is a convenient way out of this.
> > > 
> > > If we want to respect the hard limit AND guarantee allocation success,
> > > the OOM killer has to free memory reliably - which it doesn't. But if
> > > it did, we could also break the limit temporarily and have the OOM
> > > killer replenish the pool before that userspace app can continue. The
> > > allocation wouldn't have to be short-lived, since memory is fungible.
> > 
> > If we can guarantee the oom killer is started then we can allow temporal
> > access to reserves which is already implemented even for memcg. The
> > thing is we do not invoke the oom killer...
> 
> You lost me here. Which reserves?
> 
> All I'm saying is that, when the syscall-context fails to charge, we
> should do mem_cgroup_oom() to set up the async OOM killer, let the
> charge succeed over the hard limit - since the OOM killer will most
> likely get us back below the limit - then mem_cgroup_oom_synchronize()
> before the syscall returns to userspace.

OK, then we are on the same page now. Your initial wording didn't
mention async OOM killer. This makes more sense. Although I would argue
that we can retry the charge as long as out_of_memory finds a victim.
This would return ENOMEM to the pathological cases where no victims
could be found.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-12 23:57                   ` Greg Thelen
@ 2017-10-13  6:51                     ` Michal Hocko
  0 siblings, 0 replies; 52+ messages in thread
From: Michal Hocko @ 2017-10-13  6:51 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Johannes Weiner, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Thu 12-10-17 16:57:22, Greg Thelen wrote:
[...]
> Overcharging kmem with deferred reconciliation sounds good to me.
> 
> A few comments (not reasons to avoid this):
> 
> 1) If a task is moved between memcg it seems possible to overcharge
>    multiple oom memcg for different kmem/user allocations.
>    mem_cgroup_oom_synchronize() would see at most one oom memcg in
>    current->memcg_in_oom.  Thus it'd only reconcile a single memcg.  But
>    that seems pretty rare and the next charge to any of the other memcg
>    would reconcile them.

This is a general problem for the cgroup v2 memcg oom handling. 

> 2) if a kernel thread charges kmem on behalf of a client mm then there
>    is no good place to call mem_cgroup_oom_synchronize(), short of
>    launching a work item in mem_cgroup_oom().  I don't we have anything
>    like that yet.  So nothing to worry about.

If we do invoke the OOM killer from the charge path, it shouldn't be a
problem.

> 3) it's debatable if mem_cgroup_oom_synchronize() should first attempt
>    reclaim before killing.  But that's a whole 'nother thread.

Again, this shouldn't be an issue if we invoke the oom killer from the
charge path.

> 4) overcharging with deferred reconciliation could also be used for user
>    pages.  But I haven't looked at the code long enough to know if this
>    would be a net win.

It would solve g-u-p issues failing with ENOMEM unexpectedly just
because of memcg charge failure.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-13  6:35                   ` Michal Hocko
@ 2017-10-13  7:00                     ` Michal Hocko
  2017-10-13 15:24                       ` Michal Hocko
  2017-10-24 15:45                     ` Johannes Weiner
  1 sibling, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-13  7:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

Just to be explicit what I've had in mind. This hasn't been even compile
tested but it should provide at least an idea where I am trying to go..
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d5f3a62887cf..91fa05372114 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1528,26 +1528,36 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
 
 static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 {
-	if (!current->memcg_may_oom)
-		return;
 	/*
 	 * We are in the middle of the charge context here, so we
 	 * don't want to block when potentially sitting on a callstack
 	 * that holds all kinds of filesystem and mm locks.
 	 *
-	 * Also, the caller may handle a failed allocation gracefully
-	 * (like optional page cache readahead) and so an OOM killer
-	 * invocation might not even be necessary.
+	 * cgroup v1 allowes sync users space handling so we cannot afford
+	 * to get stuck here for that configuration. That's why we don't do
+	 * anything here except remember the OOM context and then deal with
+	 * it at the end of the page fault when the stack is unwound, the 
+	 * locks are released, and when we know whether the fault was overall
+	 * successful.
 	 *
-	 * That's why we don't do anything here except remember the
-	 * OOM context and then deal with it at the end of the page
-	 * fault when the stack is unwound, the locks are released,
-	 * and when we know whether the fault was overall successful.
+	 * On the other hand, in-kernel OOM killer allows for an async victim
+	 * memory reclaim (oom_reaper) and that means that we are not solely
+	 * relying on the oom victim to make a forward progress so we can stay
+	 * in the the try_charge context and keep retrying as long as there
+	 * are oom victims to select.
 	 */
-	css_get(&memcg->css);
-	current->memcg_in_oom = memcg;
-	current->memcg_oom_gfp_mask = mask;
-	current->memcg_oom_order = order;
+	if (memcg->oom_kill_disable) {
+		if (!current->memcg_may_oom)
+			return false;
+		css_get(&memcg->css);
+		current->memcg_in_oom = memcg;
+		current->memcg_oom_gfp_mask = mask;
+		current->memcg_oom_order = order;
+
+		return false;
+	}
+
+	return mem_cgroup_out_of_memory(memcg, mask, order);
 }
 
 /**
@@ -2007,8 +2017,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	mem_cgroup_event(mem_over_limit, MEMCG_OOM);
 
-	mem_cgroup_oom(mem_over_limit, gfp_mask,
-		       get_order(nr_pages * PAGE_SIZE));
+	if (mem_cgroup_oom(mem_over_limit, gfp_mask,
+		       get_order(nr_pages * PAGE_SIZE))) {
+		nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+		goto retry;
+	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
 		return -ENOMEM;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-13  7:00                     ` Michal Hocko
@ 2017-10-13 15:24                       ` Michal Hocko
  2017-10-24 12:18                         ` Michal Hocko
  2017-10-24 16:06                         ` Johannes Weiner
  0 siblings, 2 replies; 52+ messages in thread
From: Michal Hocko @ 2017-10-13 15:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

Well, it actually occured to me that this would trigger the global oom
killer in case no memcg specific victim can be found which is definitely
not something we would like to do. This should work better. I am not
sure we can trigger this corner case but we should cover it and it
actually doesn't make the code much worse.
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d5f3a62887cf..7b370f070b82 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1528,26 +1528,40 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
 
 static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 {
-	if (!current->memcg_may_oom)
-		return;
 	/*
 	 * We are in the middle of the charge context here, so we
 	 * don't want to block when potentially sitting on a callstack
 	 * that holds all kinds of filesystem and mm locks.
 	 *
-	 * Also, the caller may handle a failed allocation gracefully
-	 * (like optional page cache readahead) and so an OOM killer
-	 * invocation might not even be necessary.
+	 * cgroup v1 allowes sync users space handling so we cannot afford
+	 * to get stuck here for that configuration. That's why we don't do
+	 * anything here except remember the OOM context and then deal with
+	 * it at the end of the page fault when the stack is unwound, the
+	 * locks are released, and when we know whether the fault was overall
+	 * successful.
+	 *
+	 * On the other hand, in-kernel OOM killer allows for an async victim
+	 * memory reclaim (oom_reaper) and that means that we are not solely
+	 * relying on the oom victim to make a forward progress so we can stay
+	 * in the the try_charge context and keep retrying as long as there
+	 * are oom victims to select.
 	 *
-	 * That's why we don't do anything here except remember the
-	 * OOM context and then deal with it at the end of the page
-	 * fault when the stack is unwound, the locks are released,
-	 * and when we know whether the fault was overall successful.
+	 * Please note that mem_cgroup_oom_synchronize might fail to find a
+	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
+	 * we would fall back to the global oom killer in pagefault_out_of_memory
 	 */
+	if (!memcg->oom_kill_disable &&
+			mem_cgroup_out_of_memory(memcg, mask, order))
+		return true;
+
+	if (!current->memcg_may_oom)
+		return false;
 	css_get(&memcg->css);
 	current->memcg_in_oom = memcg;
 	current->memcg_oom_gfp_mask = mask;
 	current->memcg_oom_order = order;
+
+	return false;
 }
 
 /**
@@ -2007,8 +2021,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	mem_cgroup_event(mem_over_limit, MEMCG_OOM);
 
-	mem_cgroup_oom(mem_over_limit, gfp_mask,
-		       get_order(nr_pages * PAGE_SIZE));
+	if (mem_cgroup_oom(mem_over_limit, gfp_mask,
+		       get_order(nr_pages * PAGE_SIZE))) {
+		nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+		goto retry;
+	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
 		return -ENOMEM;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-13 15:24                       ` Michal Hocko
@ 2017-10-24 12:18                         ` Michal Hocko
  2017-10-24 17:54                           ` Johannes Weiner
  2017-10-24 16:06                         ` Johannes Weiner
  1 sibling, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-24 12:18 UTC (permalink / raw)
  To: Greg Thelen, Johannes Weiner
  Cc: Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton,
	Linux MM, linux-fsdevel, LKML

Does this sound something that you would be interested in? I can spend
som more time on it if it is worthwhile.

On Fri 13-10-17 17:24:21, Michal Hocko wrote:
> Well, it actually occured to me that this would trigger the global oom
> killer in case no memcg specific victim can be found which is definitely
> not something we would like to do. This should work better. I am not
> sure we can trigger this corner case but we should cover it and it
> actually doesn't make the code much worse.
> ---
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d5f3a62887cf..7b370f070b82 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1528,26 +1528,40 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
>  
>  static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  {
> -	if (!current->memcg_may_oom)
> -		return;
>  	/*
>  	 * We are in the middle of the charge context here, so we
>  	 * don't want to block when potentially sitting on a callstack
>  	 * that holds all kinds of filesystem and mm locks.
>  	 *
> -	 * Also, the caller may handle a failed allocation gracefully
> -	 * (like optional page cache readahead) and so an OOM killer
> -	 * invocation might not even be necessary.
> +	 * cgroup v1 allowes sync users space handling so we cannot afford
> +	 * to get stuck here for that configuration. That's why we don't do
> +	 * anything here except remember the OOM context and then deal with
> +	 * it at the end of the page fault when the stack is unwound, the
> +	 * locks are released, and when we know whether the fault was overall
> +	 * successful.
> +	 *
> +	 * On the other hand, in-kernel OOM killer allows for an async victim
> +	 * memory reclaim (oom_reaper) and that means that we are not solely
> +	 * relying on the oom victim to make a forward progress so we can stay
> +	 * in the the try_charge context and keep retrying as long as there
> +	 * are oom victims to select.
>  	 *
> -	 * That's why we don't do anything here except remember the
> -	 * OOM context and then deal with it at the end of the page
> -	 * fault when the stack is unwound, the locks are released,
> -	 * and when we know whether the fault was overall successful.
> +	 * Please note that mem_cgroup_oom_synchronize might fail to find a
> +	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
> +	 * we would fall back to the global oom killer in pagefault_out_of_memory
>  	 */
> +	if (!memcg->oom_kill_disable &&
> +			mem_cgroup_out_of_memory(memcg, mask, order))
> +		return true;
> +
> +	if (!current->memcg_may_oom)
> +		return false;
>  	css_get(&memcg->css);
>  	current->memcg_in_oom = memcg;
>  	current->memcg_oom_gfp_mask = mask;
>  	current->memcg_oom_order = order;
> +
> +	return false;
>  }
>  
>  /**
> @@ -2007,8 +2021,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  
>  	mem_cgroup_event(mem_over_limit, MEMCG_OOM);
>  
> -	mem_cgroup_oom(mem_over_limit, gfp_mask,
> -		       get_order(nr_pages * PAGE_SIZE));
> +	if (mem_cgroup_oom(mem_over_limit, gfp_mask,
> +		       get_order(nr_pages * PAGE_SIZE))) {
> +		nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> +		goto retry;
> +	}
>  nomem:
>  	if (!(gfp_mask & __GFP_NOFAIL))
>  		return -ENOMEM;
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-13  6:35                   ` Michal Hocko
  2017-10-13  7:00                     ` Michal Hocko
@ 2017-10-24 15:45                     ` Johannes Weiner
  2017-10-24 16:30                       ` Michal Hocko
  1 sibling, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2017-10-24 15:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Fri, Oct 13, 2017 at 08:35:55AM +0200, Michal Hocko wrote:
> On Thu 12-10-17 15:03:12, Johannes Weiner wrote:
> > All I'm saying is that, when the syscall-context fails to charge, we
> > should do mem_cgroup_oom() to set up the async OOM killer, let the
> > charge succeed over the hard limit - since the OOM killer will most
> > likely get us back below the limit - then mem_cgroup_oom_synchronize()
> > before the syscall returns to userspace.
> 
> OK, then we are on the same page now. Your initial wording didn't
> mention async OOM killer. This makes more sense. Although I would argue
> that we can retry the charge as long as out_of_memory finds a victim.
> This would return ENOMEM to the pathological cases where no victims
> could be found.

I think that's much worse because it's even harder to test and verify
your applications against.

If syscalls can return -ENOMEM on OOM, they should do so reliably.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-13 15:24                       ` Michal Hocko
  2017-10-24 12:18                         ` Michal Hocko
@ 2017-10-24 16:06                         ` Johannes Weiner
  2017-10-24 16:22                           ` Michal Hocko
  1 sibling, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2017-10-24 16:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Fri, Oct 13, 2017 at 05:24:21PM +0200, Michal Hocko wrote:
> Well, it actually occured to me that this would trigger the global oom
> killer in case no memcg specific victim can be found which is definitely
> not something we would like to do. This should work better. I am not
> sure we can trigger this corner case but we should cover it and it
> actually doesn't make the code much worse.
> ---
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d5f3a62887cf..7b370f070b82 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1528,26 +1528,40 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
>  
>  static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  {
> -	if (!current->memcg_may_oom)
> -		return;
>  	/*
>  	 * We are in the middle of the charge context here, so we
>  	 * don't want to block when potentially sitting on a callstack
>  	 * that holds all kinds of filesystem and mm locks.
>  	 *
> -	 * Also, the caller may handle a failed allocation gracefully
> -	 * (like optional page cache readahead) and so an OOM killer
> -	 * invocation might not even be necessary.
> +	 * cgroup v1 allowes sync users space handling so we cannot afford
> +	 * to get stuck here for that configuration. That's why we don't do
> +	 * anything here except remember the OOM context and then deal with
> +	 * it at the end of the page fault when the stack is unwound, the
> +	 * locks are released, and when we know whether the fault was overall
> +	 * successful.

How about

"cgroup1 allows disabling the OOM killer and waiting for outside
handling until the charge can succeed; remember the context and put
the task to sleep at the end of the page fault when all locks are
released."

and then follow it directly with the branch that handles this:

	if (memcg->oom_kill_disable) {
		css_get(&memcg->css);
		current->memcg_in_oom = memcg;
		...
		return false;
	}

	return mem_cgroup_out_of_memory(memcg, mask, order);	

> +	 * On the other hand, in-kernel OOM killer allows for an async victim
> +	 * memory reclaim (oom_reaper) and that means that we are not solely
> +	 * relying on the oom victim to make a forward progress so we can stay
> +	 * in the the try_charge context and keep retrying as long as there
> +	 * are oom victims to select.

I would put that part into try_charge, where that decision is made.

>  	 *
> -	 * That's why we don't do anything here except remember the
> -	 * OOM context and then deal with it at the end of the page
> -	 * fault when the stack is unwound, the locks are released,
> -	 * and when we know whether the fault was overall successful.
> +	 * Please note that mem_cgroup_oom_synchronize might fail to find a
> +	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
> +	 * we would fall back to the global oom killer in pagefault_out_of_memory

Ah, that's why... Ugh, that's really duct-tapey.

>  	 */
> +	if (!memcg->oom_kill_disable &&
> +			mem_cgroup_out_of_memory(memcg, mask, order))
> +		return true;
> +
> +	if (!current->memcg_may_oom)
> +		return false;
>  	css_get(&memcg->css);
>  	current->memcg_in_oom = memcg;
>  	current->memcg_oom_gfp_mask = mask;
>  	current->memcg_oom_order = order;
> +
> +	return false;
>  }
>  
>  /**
> @@ -2007,8 +2021,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  
>  	mem_cgroup_event(mem_over_limit, MEMCG_OOM);
>  
> -	mem_cgroup_oom(mem_over_limit, gfp_mask,
> -		       get_order(nr_pages * PAGE_SIZE));
> +	if (mem_cgroup_oom(mem_over_limit, gfp_mask,
> +		       get_order(nr_pages * PAGE_SIZE))) {
> +		nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> +		goto retry;
> +	}

As per the previous email, this has to goto force, otherwise we return
-ENOMEM from syscalls once in a blue moon, which makes verification an
absolute nightmare. The behavior should be reliable, without weird p99
corner cases.

I think what we should be doing here is: if a charge fails, set up an
oom context and force the charge; add mem_cgroup_oom_synchronize() to
the end of syscalls and kernel-context faults.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-24 16:06                         ` Johannes Weiner
@ 2017-10-24 16:22                           ` Michal Hocko
  2017-10-24 17:23                             ` Johannes Weiner
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-24 16:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Tue 24-10-17 12:06:37, Johannes Weiner wrote:
> On Fri, Oct 13, 2017 at 05:24:21PM +0200, Michal Hocko wrote:
> > Well, it actually occured to me that this would trigger the global oom
> > killer in case no memcg specific victim can be found which is definitely
> > not something we would like to do. This should work better. I am not
> > sure we can trigger this corner case but we should cover it and it
> > actually doesn't make the code much worse.
> > ---
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index d5f3a62887cf..7b370f070b82 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1528,26 +1528,40 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
> >  
> >  static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> >  {
> > -	if (!current->memcg_may_oom)
> > -		return;
> >  	/*
> >  	 * We are in the middle of the charge context here, so we
> >  	 * don't want to block when potentially sitting on a callstack
> >  	 * that holds all kinds of filesystem and mm locks.
> >  	 *
> > -	 * Also, the caller may handle a failed allocation gracefully
> > -	 * (like optional page cache readahead) and so an OOM killer
> > -	 * invocation might not even be necessary.
> > +	 * cgroup v1 allowes sync users space handling so we cannot afford
> > +	 * to get stuck here for that configuration. That's why we don't do
> > +	 * anything here except remember the OOM context and then deal with
> > +	 * it at the end of the page fault when the stack is unwound, the
> > +	 * locks are released, and when we know whether the fault was overall
> > +	 * successful.
> 
> How about
> 
> "cgroup1 allows disabling the OOM killer and waiting for outside
> handling until the charge can succeed; remember the context and put
> the task to sleep at the end of the page fault when all locks are
> released."

OK

> and then follow it directly with the branch that handles this:
> 
> 	if (memcg->oom_kill_disable) {
> 		css_get(&memcg->css);
> 		current->memcg_in_oom = memcg;
> 		...
> 		return false;
> 	}
> 
> 	return mem_cgroup_out_of_memory(memcg, mask, order);	
> 
> > +	 * On the other hand, in-kernel OOM killer allows for an async victim
> > +	 * memory reclaim (oom_reaper) and that means that we are not solely
> > +	 * relying on the oom victim to make a forward progress so we can stay
> > +	 * in the the try_charge context and keep retrying as long as there
> > +	 * are oom victims to select.
> 
> I would put that part into try_charge, where that decision is made.

OK

> >  	 *
> > -	 * That's why we don't do anything here except remember the
> > -	 * OOM context and then deal with it at the end of the page
> > -	 * fault when the stack is unwound, the locks are released,
> > -	 * and when we know whether the fault was overall successful.
> > +	 * Please note that mem_cgroup_oom_synchronize might fail to find a
> > +	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
> > +	 * we would fall back to the global oom killer in pagefault_out_of_memory
> 
> Ah, that's why... Ugh, that's really duct-tapey.

As you know, I really hate the #PF OOM path. We should get rid of it.
 
> >  	 */
> > +	if (!memcg->oom_kill_disable &&
> > +			mem_cgroup_out_of_memory(memcg, mask, order))
> > +		return true;
> > +
> > +	if (!current->memcg_may_oom)
> > +		return false;
> >  	css_get(&memcg->css);
> >  	current->memcg_in_oom = memcg;
> >  	current->memcg_oom_gfp_mask = mask;
> >  	current->memcg_oom_order = order;
> > +
> > +	return false;
> >  }
> >  
> >  /**
> > @@ -2007,8 +2021,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  
> >  	mem_cgroup_event(mem_over_limit, MEMCG_OOM);
> >  
> > -	mem_cgroup_oom(mem_over_limit, gfp_mask,
> > -		       get_order(nr_pages * PAGE_SIZE));
> > +	if (mem_cgroup_oom(mem_over_limit, gfp_mask,
> > +		       get_order(nr_pages * PAGE_SIZE))) {
> > +		nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > +		goto retry;
> > +	}
> 
> As per the previous email, this has to goto force, otherwise we return
> -ENOMEM from syscalls once in a blue moon, which makes verification an
> absolute nightmare. The behavior should be reliable, without weird p99
> corner cases.
>
> I think what we should be doing here is: if a charge fails, set up an
> oom context and force the charge; add mem_cgroup_oom_synchronize() to
> the end of syscalls and kernel-context faults.

What would prevent a runaway in case the only process in the memcg is
oom unkillable then?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-24 15:45                     ` Johannes Weiner
@ 2017-10-24 16:30                       ` Michal Hocko
  0 siblings, 0 replies; 52+ messages in thread
From: Michal Hocko @ 2017-10-24 16:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Tue 24-10-17 11:45:11, Johannes Weiner wrote:
> On Fri, Oct 13, 2017 at 08:35:55AM +0200, Michal Hocko wrote:
> > On Thu 12-10-17 15:03:12, Johannes Weiner wrote:
> > > All I'm saying is that, when the syscall-context fails to charge, we
> > > should do mem_cgroup_oom() to set up the async OOM killer, let the
> > > charge succeed over the hard limit - since the OOM killer will most
> > > likely get us back below the limit - then mem_cgroup_oom_synchronize()
> > > before the syscall returns to userspace.
> > 
> > OK, then we are on the same page now. Your initial wording didn't
> > mention async OOM killer. This makes more sense. Although I would argue
> > that we can retry the charge as long as out_of_memory finds a victim.
> > This would return ENOMEM to the pathological cases where no victims
> > could be found.
> 
> I think that's much worse because it's even harder to test and verify
> your applications against.

Well, the main distinction to the global OOM killer is that we panic
when there is no oom victim eligible which we cannot do in the memcg
context. So we have to bail somehow and I would be really careful to
allow for a runaway from the hard limit just because we are out of all
eligible tasks. Returning ENOMEM sounds like a safer option to me.

> If syscalls can return -ENOMEM on OOM, they should do so reliably.

The main problem is that we do not know which syscalls can return ENOMEM

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-24 16:22                           ` Michal Hocko
@ 2017-10-24 17:23                             ` Johannes Weiner
  2017-10-24 17:55                               ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2017-10-24 17:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Tue, Oct 24, 2017 at 06:22:13PM +0200, Michal Hocko wrote:
> On Tue 24-10-17 12:06:37, Johannes Weiner wrote:
> > >  	 *
> > > -	 * That's why we don't do anything here except remember the
> > > -	 * OOM context and then deal with it at the end of the page
> > > -	 * fault when the stack is unwound, the locks are released,
> > > -	 * and when we know whether the fault was overall successful.
> > > +	 * Please note that mem_cgroup_oom_synchronize might fail to find a
> > > +	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
> > > +	 * we would fall back to the global oom killer in pagefault_out_of_memory
> > 
> > Ah, that's why... Ugh, that's really duct-tapey.
> 
> As you know, I really hate the #PF OOM path. We should get rid of it.

I agree, but this isn't getting rid of it, it just adds more layers.

> > > @@ -2007,8 +2021,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > >  
> > >  	mem_cgroup_event(mem_over_limit, MEMCG_OOM);
> > >  
> > > -	mem_cgroup_oom(mem_over_limit, gfp_mask,
> > > -		       get_order(nr_pages * PAGE_SIZE));
> > > +	if (mem_cgroup_oom(mem_over_limit, gfp_mask,
> > > +		       get_order(nr_pages * PAGE_SIZE))) {
> > > +		nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > > +		goto retry;
> > > +	}
> > 
> > As per the previous email, this has to goto force, otherwise we return
> > -ENOMEM from syscalls once in a blue moon, which makes verification an
> > absolute nightmare. The behavior should be reliable, without weird p99
> > corner cases.
> >
> > I think what we should be doing here is: if a charge fails, set up an
> > oom context and force the charge; add mem_cgroup_oom_synchronize() to
> > the end of syscalls and kernel-context faults.
> 
> What would prevent a runaway in case the only process in the memcg is
> oom unkillable then?

In such a scenario, the page fault handler would busy-loop right now.

Disabling oom kills is a privileged operation with dire consequences
if used incorrectly. You can panic the kernel with it. Why should the
cgroup OOM killer implement protective semantics around this setting?
Breaching the limit in such a setup is entirely acceptable.

Really, I think it's an enormous mistake to start modeling semantics
based on the most contrived and non-sensical edge case configurations.
Start the discussion with what is sane and what most users should
optimally experience, and keep the cornercases simple.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-24 12:18                         ` Michal Hocko
@ 2017-10-24 17:54                           ` Johannes Weiner
  0 siblings, 0 replies; 52+ messages in thread
From: Johannes Weiner @ 2017-10-24 17:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Tue, Oct 24, 2017 at 02:18:59PM +0200, Michal Hocko wrote:
> Does this sound something that you would be interested in? I can spend
> som more time on it if it is worthwhile.

Before you invest too much time in this, I think the rationale for
changing the current behavior so far is very weak. The ideas that have
been floated around in this thread barely cross into nice-to-have
territory, and as a result the acceptable additional complexity to
implement them is very low as well.

Making the OOM behavior less consistent, or introducing very rare
problem behavior (e.g. merely reducing the probability of syscalls
returning -ENOMEM instead of fully eliminating it, re-adding avenues
for deadlocks, no matter how rare, etc.) is a non-starter.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-24 17:23                             ` Johannes Weiner
@ 2017-10-24 17:55                               ` Michal Hocko
  2017-10-24 18:58                                 ` Johannes Weiner
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-24 17:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Tue 24-10-17 13:23:30, Johannes Weiner wrote:
> On Tue, Oct 24, 2017 at 06:22:13PM +0200, Michal Hocko wrote:
[...]
> > What would prevent a runaway in case the only process in the memcg is
> > oom unkillable then?
> 
> In such a scenario, the page fault handler would busy-loop right now.
> 
> Disabling oom kills is a privileged operation with dire consequences
> if used incorrectly. You can panic the kernel with it. Why should the
> cgroup OOM killer implement protective semantics around this setting?
> Breaching the limit in such a setup is entirely acceptable.
> 
> Really, I think it's an enormous mistake to start modeling semantics
> based on the most contrived and non-sensical edge case configurations.
> Start the discussion with what is sane and what most users should
> optimally experience, and keep the cornercases simple.

I am not really seeing your concern about the semantic. The most
important property of the hard limit is to protect from runaways and
stop them if they happen. Users can use the softer variant (high limit)
if they are not afraid of those scenarios. It is not so insane to
imagine that a master task (which I can easily imagine would be oom
disabled) has a leak and runaway as a result. We are not talking only
about the page fault path. There are other allocation paths to consume a
lot of memory and spill over and break the isolation restriction. So it
makes much more sense to me to fail the allocation in such a situation
rather than allow the runaway to continue. Just consider that such a
situation shouldn't happen in the first place because there should
always be an eligible task to kill - who would own all the memory
otherwise?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-24 17:55                               ` Michal Hocko
@ 2017-10-24 18:58                                 ` Johannes Weiner
  2017-10-24 20:15                                   ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2017-10-24 18:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Tue, Oct 24, 2017 at 07:55:58PM +0200, Michal Hocko wrote:
> On Tue 24-10-17 13:23:30, Johannes Weiner wrote:
> > On Tue, Oct 24, 2017 at 06:22:13PM +0200, Michal Hocko wrote:
> [...]
> > > What would prevent a runaway in case the only process in the memcg is
> > > oom unkillable then?
> > 
> > In such a scenario, the page fault handler would busy-loop right now.
> > 
> > Disabling oom kills is a privileged operation with dire consequences
> > if used incorrectly. You can panic the kernel with it. Why should the
> > cgroup OOM killer implement protective semantics around this setting?
> > Breaching the limit in such a setup is entirely acceptable.
> > 
> > Really, I think it's an enormous mistake to start modeling semantics
> > based on the most contrived and non-sensical edge case configurations.
> > Start the discussion with what is sane and what most users should
> > optimally experience, and keep the cornercases simple.
> 
> I am not really seeing your concern about the semantic. The most
> important property of the hard limit is to protect from runaways and
> stop them if they happen. Users can use the softer variant (high limit)
> if they are not afraid of those scenarios. It is not so insane to
> imagine that a master task (which I can easily imagine would be oom
> disabled) has a leak and runaway as a result.

Then you're screwed either way. Where do you return -ENOMEM in a page
fault path that cannot OOM kill anything? Your choice is between
maintaining the hard limit semantics or going into an infinite loop.

I fail to see how this setup has any impact on the semantics we pick
here. And even if it were real, it's really not what most users do.

> We are not talking only about the page fault path. There are other
> allocation paths to consume a lot of memory and spill over and break
> the isolation restriction. So it makes much more sense to me to fail
> the allocation in such a situation rather than allow the runaway to
> continue. Just consider that such a situation shouldn't happen in
> the first place because there should always be an eligible task to
> kill - who would own all the memory otherwise?

Okay, then let's just stick to the current behavior.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-24 18:58                                 ` Johannes Weiner
@ 2017-10-24 20:15                                   ` Michal Hocko
  2017-10-25  6:51                                     ` Greg Thelen
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-24 20:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Tue 24-10-17 14:58:54, Johannes Weiner wrote:
> On Tue, Oct 24, 2017 at 07:55:58PM +0200, Michal Hocko wrote:
> > On Tue 24-10-17 13:23:30, Johannes Weiner wrote:
> > > On Tue, Oct 24, 2017 at 06:22:13PM +0200, Michal Hocko wrote:
> > [...]
> > > > What would prevent a runaway in case the only process in the memcg is
> > > > oom unkillable then?
> > > 
> > > In such a scenario, the page fault handler would busy-loop right now.
> > > 
> > > Disabling oom kills is a privileged operation with dire consequences
> > > if used incorrectly. You can panic the kernel with it. Why should the
> > > cgroup OOM killer implement protective semantics around this setting?
> > > Breaching the limit in such a setup is entirely acceptable.
> > > 
> > > Really, I think it's an enormous mistake to start modeling semantics
> > > based on the most contrived and non-sensical edge case configurations.
> > > Start the discussion with what is sane and what most users should
> > > optimally experience, and keep the cornercases simple.
> > 
> > I am not really seeing your concern about the semantic. The most
> > important property of the hard limit is to protect from runaways and
> > stop them if they happen. Users can use the softer variant (high limit)
> > if they are not afraid of those scenarios. It is not so insane to
> > imagine that a master task (which I can easily imagine would be oom
> > disabled) has a leak and runaway as a result.
> 
> Then you're screwed either way. Where do you return -ENOMEM in a page
> fault path that cannot OOM kill anything? Your choice is between
> maintaining the hard limit semantics or going into an infinite loop.

in the PF path yes. And I would argue that this is a reasonable
compromise to provide the gurantee the hard limit is giving us (and
the resulting isolation which is the whole point). Btw. we are already
having that behavior. All we are talking about is the non-PF path which
ENOMEMs right now and the meta-patch tried to handle it more gracefully
and only ENOMEM when there is no other option.

> I fail to see how this setup has any impact on the semantics we pick
> here. And even if it were real, it's really not what most users do.

sure, such a scenario is really on the edge but my main point was that
the hard limit is an enforcement of an isolation guarantee (as much as
possible of course).

> > We are not talking only about the page fault path. There are other
> > allocation paths to consume a lot of memory and spill over and break
> > the isolation restriction. So it makes much more sense to me to fail
> > the allocation in such a situation rather than allow the runaway to
> > continue. Just consider that such a situation shouldn't happen in
> > the first place because there should always be an eligible task to
> > kill - who would own all the memory otherwise?
> 
> Okay, then let's just stick to the current behavior.

I am definitely not pushing that thing right now. It is good to discuss
it, though. The more kernel allocations we will track the more careful we
will have to be. So maybe we will have to reconsider the current
approach. I am not sure we need it _right now_ but I feel we will
eventually have to reconsider it.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-24 20:15                                   ` Michal Hocko
@ 2017-10-25  6:51                                     ` Greg Thelen
  2017-10-25  7:15                                       ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Greg Thelen @ 2017-10-25  6:51 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner
  Cc: Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton,
	Linux MM, linux-fsdevel, LKML

Michal Hocko <mhocko@kernel.org> wrote:

> On Tue 24-10-17 14:58:54, Johannes Weiner wrote:
>> On Tue, Oct 24, 2017 at 07:55:58PM +0200, Michal Hocko wrote:
>> > On Tue 24-10-17 13:23:30, Johannes Weiner wrote:
>> > > On Tue, Oct 24, 2017 at 06:22:13PM +0200, Michal Hocko wrote:
>> > [...]
>> > > > What would prevent a runaway in case the only process in the memcg is
>> > > > oom unkillable then?
>> > > 
>> > > In such a scenario, the page fault handler would busy-loop right now.
>> > > 
>> > > Disabling oom kills is a privileged operation with dire consequences
>> > > if used incorrectly. You can panic the kernel with it. Why should the
>> > > cgroup OOM killer implement protective semantics around this setting?
>> > > Breaching the limit in such a setup is entirely acceptable.
>> > > 
>> > > Really, I think it's an enormous mistake to start modeling semantics
>> > > based on the most contrived and non-sensical edge case configurations.
>> > > Start the discussion with what is sane and what most users should
>> > > optimally experience, and keep the cornercases simple.
>> > 
>> > I am not really seeing your concern about the semantic. The most
>> > important property of the hard limit is to protect from runaways and
>> > stop them if they happen. Users can use the softer variant (high limit)
>> > if they are not afraid of those scenarios. It is not so insane to
>> > imagine that a master task (which I can easily imagine would be oom
>> > disabled) has a leak and runaway as a result.
>> 
>> Then you're screwed either way. Where do you return -ENOMEM in a page
>> fault path that cannot OOM kill anything? Your choice is between
>> maintaining the hard limit semantics or going into an infinite loop.
>
> in the PF path yes. And I would argue that this is a reasonable
> compromise to provide the gurantee the hard limit is giving us (and
> the resulting isolation which is the whole point). Btw. we are already
> having that behavior. All we are talking about is the non-PF path which
> ENOMEMs right now and the meta-patch tried to handle it more gracefully
> and only ENOMEM when there is no other option.
>
>> I fail to see how this setup has any impact on the semantics we pick
>> here. And even if it were real, it's really not what most users do.
>
> sure, such a scenario is really on the edge but my main point was that
> the hard limit is an enforcement of an isolation guarantee (as much as
> possible of course).
>
>> > We are not talking only about the page fault path. There are other
>> > allocation paths to consume a lot of memory and spill over and break
>> > the isolation restriction. So it makes much more sense to me to fail
>> > the allocation in such a situation rather than allow the runaway to
>> > continue. Just consider that such a situation shouldn't happen in
>> > the first place because there should always be an eligible task to
>> > kill - who would own all the memory otherwise?
>> 
>> Okay, then let's just stick to the current behavior.
>
> I am definitely not pushing that thing right now. It is good to discuss
> it, though. The more kernel allocations we will track the more careful we
> will have to be. So maybe we will have to reconsider the current
> approach. I am not sure we need it _right now_ but I feel we will
> eventually have to reconsider it.

The kernel already attempts to charge radix_tree_nodes.  If they fail
then we fallback to unaccounted memory.  So the memcg limit already
isn't an air tight constraint.

I agree that unchecked overcharging could be bad, but wonder if we could
overcharge kmem so long as there is a pending oom kill victim.  If
current is the victim or no victim, then fail allocations (as is
currently done).  The current thread can loop in syscall exit until
usage is reconciled (either via reclaim or kill).  This seems consistent
with pagefault oom handling and compatible with overcommit use case.

Here's an example of an overcommit case we've found quite useful.  Memcg A has
memory which is shared between children B and C.  B is more important the C.
B and C are unprivileged, neither has the authority to kill the other.

    /A(limit=100MB) - B(limit=80MB,prio=high)
                     \ C(limit=80MB,prio=low)

If memcg charge drives B.usage+C.usage>=A.limit, then C should be killed due to
its low priority.  B pagefault can kill, but if a syscall returns ENOMEM then B
can't do anything useful with it.

I know there are related oom killer victim selections discussions afoot.
Even with classic oom_score_adj killing it's possible to heavily bias
oom killer to select C over B.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-25  6:51                                     ` Greg Thelen
@ 2017-10-25  7:15                                       ` Michal Hocko
  2017-10-25 13:11                                         ` Johannes Weiner
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-25  7:15 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Johannes Weiner, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Tue 24-10-17 23:51:30, Greg Thelen wrote:
> Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > I am definitely not pushing that thing right now. It is good to discuss
> > it, though. The more kernel allocations we will track the more careful we
> > will have to be. So maybe we will have to reconsider the current
> > approach. I am not sure we need it _right now_ but I feel we will
> > eventually have to reconsider it.
> 
> The kernel already attempts to charge radix_tree_nodes.  If they fail
> then we fallback to unaccounted memory. 

I am not sure which code path you have in mind. All I can see is that we
drop __GFP_ACCOUNT when preloading radix tree nodes. Anyway...

> So the memcg limit already
> isn't an air tight constraint.

... we shouldn't make it more loose though.

> I agree that unchecked overcharging could be bad, but wonder if we could
> overcharge kmem so long as there is a pending oom kill victim.

Why is this any better than simply trying to charge as long as the oom
killer makes progress?

> If
> current is the victim or no victim, then fail allocations (as is
> currently done).

we actually force the charge in that case so we will proceed.

> The current thread can loop in syscall exit until
> usage is reconciled (either via reclaim or kill).  This seems consistent
> with pagefault oom handling and compatible with overcommit use case.

But we do not really want to make the syscall exit path any more complex
or more expensive than it is. The point is that we shouldn't be afraid
about triggering the oom killer from the charge patch because we do have
async OOM killer. This is very same with the standard allocator path. So
why should be memcg any different?

> Here's an example of an overcommit case we've found quite useful.  Memcg A has
> memory which is shared between children B and C.  B is more important the C.
> B and C are unprivileged, neither has the authority to kill the other.
> 
>     /A(limit=100MB) - B(limit=80MB,prio=high)
>                      \ C(limit=80MB,prio=low)
> 
> If memcg charge drives B.usage+C.usage>=A.limit, then C should be killed due to
> its low priority.  B pagefault can kill, but if a syscall returns ENOMEM then B
> can't do anything useful with it.

well, my proposal was to not return ENOMEM and rather loop in the charge
path and wait for the oom killer to free up some charges. Who gets
killed is really out of scope of this discussion.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-25  7:15                                       ` Michal Hocko
@ 2017-10-25 13:11                                         ` Johannes Weiner
  2017-10-25 14:12                                           ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2017-10-25 13:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Wed, Oct 25, 2017 at 09:15:22AM +0200, Michal Hocko wrote:
> On Tue 24-10-17 23:51:30, Greg Thelen wrote:
> > Michal Hocko <mhocko@kernel.org> wrote:
> [...]
> > > I am definitely not pushing that thing right now. It is good to discuss
> > > it, though. The more kernel allocations we will track the more careful we
> > > will have to be. So maybe we will have to reconsider the current
> > > approach. I am not sure we need it _right now_ but I feel we will
> > > eventually have to reconsider it.
> > 
> > The kernel already attempts to charge radix_tree_nodes.  If they fail
> > then we fallback to unaccounted memory. 
> 
> I am not sure which code path you have in mind. All I can see is that we
> drop __GFP_ACCOUNT when preloading radix tree nodes. Anyway...
> 
> > So the memcg limit already
> > isn't an air tight constraint.

I fully agree with this. Socket buffers overcharge too. There are
plenty of memory allocations that aren't even tracked.

The point is, it's a hard limit in the sense that breaching it will
trigger the OOM killer. It's not a hard limit in the sense that the
kernel will deadlock to avoid crossing it.

> ... we shouldn't make it more loose though.

Then we can end this discussion right now. I pointed out right from
the start that the only way to replace -ENOMEM with OOM killing in the
syscall is to force charges. If we don't, we either deadlock or still
return -ENOMEM occasionally. Nobody has refuted that this is the case.

> > The current thread can loop in syscall exit until
> > usage is reconciled (either via reclaim or kill).  This seems consistent
> > with pagefault oom handling and compatible with overcommit use case.
> 
> But we do not really want to make the syscall exit path any more complex
> or more expensive than it is. The point is that we shouldn't be afraid
> about triggering the oom killer from the charge patch because we do have
> async OOM killer. This is very same with the standard allocator path. So
> why should be memcg any different?

I have nothing against triggering the OOM killer from the allocation
path. I am dead-set against making the -ENOMEM return from syscalls
rare and unpredictable. They're a challenge as it is.

The only sane options are to stick with the status quo, or make sure
the task never returns before the allocation succeeds. Making things
in this path more speculative is a downgrade, not an improvement.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-25 13:11                                         ` Johannes Weiner
@ 2017-10-25 14:12                                           ` Michal Hocko
  2017-10-25 16:44                                             ` Johannes Weiner
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-25 14:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Wed 25-10-17 09:11:51, Johannes Weiner wrote:
> On Wed, Oct 25, 2017 at 09:15:22AM +0200, Michal Hocko wrote:
[...]
> > ... we shouldn't make it more loose though.
> 
> Then we can end this discussion right now. I pointed out right from
> the start that the only way to replace -ENOMEM with OOM killing in the
> syscall is to force charges. If we don't, we either deadlock or still
> return -ENOMEM occasionally. Nobody has refuted that this is the case.

Yes this is true. I guess we are back to the non-failing allocations
discussion...  Currently we are too ENOMEM happy for memcg !PF paths which
can lead to weird issues Greg has pointed out earlier. Going to opposite
direction to basically never ENOMEM and rather pretend a success (which
allows runaways for extreme setups with no oom eligible tasks) sounds
like going from one extreme to another. This basically means that those
charges will effectively GFP_NOFAIL. Too much to guarantee IMHO.

> > > The current thread can loop in syscall exit until
> > > usage is reconciled (either via reclaim or kill).  This seems consistent
> > > with pagefault oom handling and compatible with overcommit use case.
> > 
> > But we do not really want to make the syscall exit path any more complex
> > or more expensive than it is. The point is that we shouldn't be afraid
> > about triggering the oom killer from the charge patch because we do have
> > async OOM killer. This is very same with the standard allocator path. So
> > why should be memcg any different?
> 
> I have nothing against triggering the OOM killer from the allocation
> path. I am dead-set against making the -ENOMEM return from syscalls
> rare and unpredictable.

Isn't that the case when we put memcg out of the picture already? More
on that below.

> They're a challenge as it is.  The only sane options are to stick with
> the status quo,

One thing that really worries me about the current status quo is that
the behavior depends on whether you run under memcg or not. The global
policy is "almost never fail unless something horrible is going on".
But we _do not_ guarantee that ENOMEM stays inside the kernel.

So if we need to do something about that I would think we need an
universal solution rather than something memcg specific. Sure global
ENOMEMs are so rare that nobody will probably trigger those but that is
just a wishful thinking...

So how about we start with a BIG FAT WARNING for the failure case?
Something resembling warn_alloc for the failure case.
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5d9323028870..3ba62c73eee5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
 	 * we would fall back to the global oom killer in pagefault_out_of_memory
 	 */
-	if (!memcg->oom_kill_disable &&
-			mem_cgroup_out_of_memory(memcg, mask, order))
-		return true;
+	if (!memcg->oom_kill_disable) {
+		if (mem_cgroup_out_of_memory(memcg, mask, order))
+			return true;
+
+		WARN(!current->memcg_may_oom,
+				"Memory cgroup charge failed because of no reclaimable memory! "
+				"This looks like a misconfiguration or a kernel bug.");
+	}
 
 	if (!current->memcg_may_oom)
 		return false;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-25 14:12                                           ` Michal Hocko
@ 2017-10-25 16:44                                             ` Johannes Weiner
  2017-10-25 17:29                                               ` Michal Hocko
  2017-10-27 20:50                                               ` Shakeel Butt
  0 siblings, 2 replies; 52+ messages in thread
From: Johannes Weiner @ 2017-10-25 16:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Wed, Oct 25, 2017 at 04:12:21PM +0200, Michal Hocko wrote:
> On Wed 25-10-17 09:11:51, Johannes Weiner wrote:
> > On Wed, Oct 25, 2017 at 09:15:22AM +0200, Michal Hocko wrote:
> [...]
> > > ... we shouldn't make it more loose though.
> > 
> > Then we can end this discussion right now. I pointed out right from
> > the start that the only way to replace -ENOMEM with OOM killing in the
> > syscall is to force charges. If we don't, we either deadlock or still
> > return -ENOMEM occasionally. Nobody has refuted that this is the case.
> 
> Yes this is true. I guess we are back to the non-failing allocations
> discussion...  Currently we are too ENOMEM happy for memcg !PF paths which
> can lead to weird issues Greg has pointed out earlier. Going to opposite
> direction to basically never ENOMEM and rather pretend a success (which
> allows runaways for extreme setups with no oom eligible tasks) sounds
> like going from one extreme to another. This basically means that those
> charges will effectively GFP_NOFAIL. Too much to guarantee IMHO.

We're talking in circles.

Overrunning the hard limit in the syscall path isn't worse than an
infinite loop in the page fault path. Once you make tasks ineligible
for OOM, that's what you have to expect. It's the OOM disabling that
results in extreme consequences across the board.

It's silly to keep bringing this up as an example.

> > They're a challenge as it is.  The only sane options are to stick with
> > the status quo,
> 
> One thing that really worries me about the current status quo is that
> the behavior depends on whether you run under memcg or not. The global
> policy is "almost never fail unless something horrible is going on".
> But we _do not_ guarantee that ENOMEM stays inside the kernel.
> 
> So if we need to do something about that I would think we need an
> universal solution rather than something memcg specific. Sure global
> ENOMEMs are so rare that nobody will probably trigger those but that is
> just a wishful thinking...

The difference in current behavior comes from the fact that real-life
workloads could trigger a deadlock with memcg looping indefinitely in
the charge path, but not with the allocator looping indefinitely in
the allocation path.

That also means that if we change it like you proposed, it'll be much
more likely to trigger -ENOMEM from syscalls inside memcg than it is
outside.

The margins are simply tighter inside cgroups. You often have only a
few closely interacting tasks in there, and the probability for
something to go wrong is much higher than on the system scope.

But also, the system is constrained by physical limits. It's a hard
wall; once you hit it, the kernel is dead, so we don't have a lot of
options. Memcg on the other hand is a completely artificial limit. It
doesn't make sense to me to pretend it's a hard wall to the point
where we actively *create* for ourselves the downsides of a physical
limitation.

> So how about we start with a BIG FAT WARNING for the failure case?
> Something resembling warn_alloc for the failure case.
>
> ---
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5d9323028870..3ba62c73eee5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
>  	 * we would fall back to the global oom killer in pagefault_out_of_memory
>  	 */
> -	if (!memcg->oom_kill_disable &&
> -			mem_cgroup_out_of_memory(memcg, mask, order))
> -		return true;
> +	if (!memcg->oom_kill_disable) {
> +		if (mem_cgroup_out_of_memory(memcg, mask, order))
> +			return true;
> +
> +		WARN(!current->memcg_may_oom,
> +				"Memory cgroup charge failed because of no reclaimable memory! "
> +				"This looks like a misconfiguration or a kernel bug.");
> +	}

That's crazy!

We shouldn't create interfaces that make it possible to accidentally
livelock the kernel. Then warn about it and let it crash. That is a
DOS-level lack of OS abstraction.

In such a situation, we should ignore oom_score_adj or ignore the hard
limit. Even panic() would be better from a machine management point of
view than leaving random tasks inside infinite loops.

Why is OOM-disabling a thing? Why isn't this simply a "kill everything
else before you kill me"? It's crashing the kernel in trying to
protect a userspace application. How is that not insane?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-25 16:44                                             ` Johannes Weiner
@ 2017-10-25 17:29                                               ` Michal Hocko
  2017-10-25 18:11                                                 ` Johannes Weiner
  2017-10-27 20:50                                               ` Shakeel Butt
  1 sibling, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-25 17:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Wed 25-10-17 12:44:02, Johannes Weiner wrote:
> On Wed, Oct 25, 2017 at 04:12:21PM +0200, Michal Hocko wrote:
[...]

I yet have to digest the first path of the email but the remaining
just sounds we are not on the same page.

> > So how about we start with a BIG FAT WARNING for the failure case?
> > Something resembling warn_alloc for the failure case.
> >
> > ---
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 5d9323028870..3ba62c73eee5 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> >  	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
> >  	 * we would fall back to the global oom killer in pagefault_out_of_memory
> >  	 */
> > -	if (!memcg->oom_kill_disable &&
> > -			mem_cgroup_out_of_memory(memcg, mask, order))
> > -		return true;
> > +	if (!memcg->oom_kill_disable) {
> > +		if (mem_cgroup_out_of_memory(memcg, mask, order))
> > +			return true;
> > +
> > +		WARN(!current->memcg_may_oom,
> > +				"Memory cgroup charge failed because of no reclaimable memory! "
> > +				"This looks like a misconfiguration or a kernel bug.");
> > +	}
> 
> That's crazy!
> 
> We shouldn't create interfaces that make it possible to accidentally
> livelock the kernel. Then warn about it and let it crash. That is a
> DOS-level lack of OS abstraction.
> 
> In such a situation, we should ignore oom_score_adj or ignore the hard
> limit. Even panic() would be better from a machine management point of
> view than leaving random tasks inside infinite loops.
> 
> Why is OOM-disabling a thing? Why isn't this simply a "kill everything
> else before you kill me"? It's crashing the kernel in trying to
> protect a userspace application. How is that not insane?

I really do not follow. What kind of livelock or crash are you talking
about. All this code does is that the charge request (which is not
explicitly GFP_NOFAIL) fails with ENOMEM if the oom killer is not able
to make a forward progress. That sounds like a safer option than failing
with ENOMEM unconditionally which is what we do currently. So the only
change I am really proposing is to keep retrying as long as the oom
killer makes a forward progress and ENOMEM otherwise.

I am also not trying to protect an userspace application. Quite
contrary, I would like the application gets ENOMEM when it should run
away from the constraint it runs within. I am protecting everything
outside of the hard limited memcg actually.

So what is that I am missing?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-25 17:29                                               ` Michal Hocko
@ 2017-10-25 18:11                                                 ` Johannes Weiner
  2017-10-25 19:00                                                   ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2017-10-25 18:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Wed, Oct 25, 2017 at 07:29:24PM +0200, Michal Hocko wrote:
> On Wed 25-10-17 12:44:02, Johannes Weiner wrote:
> > On Wed, Oct 25, 2017 at 04:12:21PM +0200, Michal Hocko wrote:
> > > So how about we start with a BIG FAT WARNING for the failure case?
> > > Something resembling warn_alloc for the failure case.
> > >
> > > ---
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 5d9323028870..3ba62c73eee5 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> > >  	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
> > >  	 * we would fall back to the global oom killer in pagefault_out_of_memory
> > >  	 */
> > > -	if (!memcg->oom_kill_disable &&
> > > -			mem_cgroup_out_of_memory(memcg, mask, order))
> > > -		return true;
> > > +	if (!memcg->oom_kill_disable) {
> > > +		if (mem_cgroup_out_of_memory(memcg, mask, order))
> > > +			return true;
> > > +
> > > +		WARN(!current->memcg_may_oom,
> > > +				"Memory cgroup charge failed because of no reclaimable memory! "
> > > +				"This looks like a misconfiguration or a kernel bug.");
> > > +	}
> > 
> > That's crazy!
> > 
> > We shouldn't create interfaces that make it possible to accidentally
> > livelock the kernel. Then warn about it and let it crash. That is a
> > DOS-level lack of OS abstraction.
> > 
> > In such a situation, we should ignore oom_score_adj or ignore the hard
> > limit. Even panic() would be better from a machine management point of
> > view than leaving random tasks inside infinite loops.
> > 
> > Why is OOM-disabling a thing? Why isn't this simply a "kill everything
> > else before you kill me"? It's crashing the kernel in trying to
> > protect a userspace application. How is that not insane?
> 
> I really do not follow. What kind of livelock or crash are you talking
> about. All this code does is that the charge request (which is not
> explicitly GFP_NOFAIL) fails with ENOMEM if the oom killer is not able
> to make a forward progress. That sounds like a safer option than failing
> with ENOMEM unconditionally which is what we do currently.

I pointed out multiple times now that consistent -ENOMEM is better
than a rare one because it's easier to verify application behavior
with test runs. And I don't understand what your counter-argument is.

"Safe" is a vague term, and it doesn't make much sense to me in this
situation. The OOM behavior should be predictable and consistent.

Yes, global might in the rarest cases also return -ENOMEM. Maybe. We
don't have to do that in memcg because we're not physically limited.

> So the only change I am really proposing is to keep retrying as long
> as the oom killer makes a forward progress and ENOMEM otherwise.

That's the behavior change I'm against.

> I am also not trying to protect an userspace application. Quite
> contrary, I would like the application gets ENOMEM when it should run
> away from the constraint it runs within. I am protecting everything
> outside of the hard limited memcg actually.
> 
> So what is that I am missing?

The page fault path.

Maybe that's not what you set out to fix, but it will now not only
enter an infinite loop, it will also WARN() on each iteration.

It would make sense to step back and think about the comprehensive
user-visible behavior of an out-of-memory situation, and not just the
one syscall aspect.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-25 18:11                                                 ` Johannes Weiner
@ 2017-10-25 19:00                                                   ` Michal Hocko
  2017-10-25 21:13                                                     ` Johannes Weiner
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-25 19:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Wed 25-10-17 14:11:06, Johannes Weiner wrote:
> On Wed, Oct 25, 2017 at 07:29:24PM +0200, Michal Hocko wrote:
> > On Wed 25-10-17 12:44:02, Johannes Weiner wrote:
> > > On Wed, Oct 25, 2017 at 04:12:21PM +0200, Michal Hocko wrote:
> > > > So how about we start with a BIG FAT WARNING for the failure case?
> > > > Something resembling warn_alloc for the failure case.
> > > >
> > > > ---
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index 5d9323028870..3ba62c73eee5 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> > > >  	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
> > > >  	 * we would fall back to the global oom killer in pagefault_out_of_memory
> > > >  	 */
> > > > -	if (!memcg->oom_kill_disable &&
> > > > -			mem_cgroup_out_of_memory(memcg, mask, order))
> > > > -		return true;
> > > > +	if (!memcg->oom_kill_disable) {
> > > > +		if (mem_cgroup_out_of_memory(memcg, mask, order))
> > > > +			return true;
> > > > +
> > > > +		WARN(!current->memcg_may_oom,
> > > > +				"Memory cgroup charge failed because of no reclaimable memory! "
> > > > +				"This looks like a misconfiguration or a kernel bug.");
> > > > +	}
> > > 
> > > That's crazy!
> > > 
> > > We shouldn't create interfaces that make it possible to accidentally
> > > livelock the kernel. Then warn about it and let it crash. That is a
> > > DOS-level lack of OS abstraction.
> > > 
> > > In such a situation, we should ignore oom_score_adj or ignore the hard
> > > limit. Even panic() would be better from a machine management point of
> > > view than leaving random tasks inside infinite loops.
> > > 
> > > Why is OOM-disabling a thing? Why isn't this simply a "kill everything
> > > else before you kill me"? It's crashing the kernel in trying to
> > > protect a userspace application. How is that not insane?
> > 
> > I really do not follow. What kind of livelock or crash are you talking
> > about. All this code does is that the charge request (which is not
> > explicitly GFP_NOFAIL) fails with ENOMEM if the oom killer is not able
> > to make a forward progress. That sounds like a safer option than failing
> > with ENOMEM unconditionally which is what we do currently.
> 
> I pointed out multiple times now that consistent -ENOMEM is better
> than a rare one because it's easier to verify application behavior
> with test runs. And I don't understand what your counter-argument is.

My counter argument is that running inside the memcg shouldn't be too
much different than running outside. And that means that if we try to
reduce chances of ENOMEM in the global case then we should do the same
in the memcg case as well. Failing overly eagerly is more harmful than
what the determinism gives you for testing. You have other means to test
failure paths like fault injections.
 
> "Safe" is a vague term, and it doesn't make much sense to me in this
> situation. The OOM behavior should be predictable and consistent.
> 
> Yes, global might in the rarest cases also return -ENOMEM. Maybe. We
> don't have to do that in memcg because we're not physically limited.

OK, so here seems to be the biggest disconnect. Being physically or
artificially constrained shouldn't make much difference IMHO. In both
cases the resource is simply limited for the consumer. And once all the
attempts to fit within the limit fail then the request for the resource
has to fail.
 
> > So the only change I am really proposing is to keep retrying as long
> > as the oom killer makes a forward progress and ENOMEM otherwise.
> 
> That's the behavior change I'm against.

So just to make it clear you would be OK with the retry on successful
OOM killer invocation and force charge on oom failure, right?

> > I am also not trying to protect an userspace application. Quite
> > contrary, I would like the application gets ENOMEM when it should run
> > away from the constraint it runs within. I am protecting everything
> > outside of the hard limited memcg actually.
> > 
> > So what is that I am missing?
> 
> The page fault path.
> 
> Maybe that's not what you set out to fix, but it will now not only
> enter an infinite loop, it will also WARN() on each iteration.

It will not warn! The WARN is explicitly for non-PF paths unless I
have screwed something there because admittedly I even didn't try to
compile that code - it was merely for an illustration. Please note that
the diff is on top of the previous one.

And we already do have an endless loop if the memcg PF oom path doesn't
make a forward progress. So there shouldn't be any difference in that
regards.

> It would make sense to step back and think about the comprehensive
> user-visible behavior of an out-of-memory situation, and not just the
> one syscall aspect.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-25 19:00                                                   ` Michal Hocko
@ 2017-10-25 21:13                                                     ` Johannes Weiner
  2017-10-25 22:49                                                       ` Greg Thelen
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2017-10-25 21:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote:
> On Wed 25-10-17 14:11:06, Johannes Weiner wrote:
> > "Safe" is a vague term, and it doesn't make much sense to me in this
> > situation. The OOM behavior should be predictable and consistent.
> > 
> > Yes, global might in the rarest cases also return -ENOMEM. Maybe. We
> > don't have to do that in memcg because we're not physically limited.
> 
> OK, so here seems to be the biggest disconnect. Being physically or
> artificially constrained shouldn't make much difference IMHO. In both
> cases the resource is simply limited for the consumer. And once all the
> attempts to fit within the limit fail then the request for the resource
> has to fail.

It's a huge difference. In the global case, we have to make trade-offs
to not deadlock the kernel. In the memcg case, we have to make a trade
off between desirable OOM behavior and desirable meaning of memory.max.

If we can borrow a resource temporarily from the ether to resolve the
OOM situation, I don't see why we shouldn't. We're only briefly
ignoring the limit to make sure the allocating task isn't preventing
the OOM victim from exiting or the OOM reaper from reaping. It's more
of an implementation detail than interface.

The only scenario you brought up where this might be the permanent
overrun is the single, oom-disabled task. And I explained why that is
a silly argument, why that's the least problematic consequence of
oom-disabling, and why it probably shouldn't even be configurable.

The idea that memory.max must never be breached is an extreme and
narrow view. As Greg points out, there are allocations we do not even
track. There are other scenarios that force allocations. They may
violate the limit on paper, but they're not notably weakening the goal
of memory.max - isolating workloads from each other.

Let's look at it this way.

There are two deadlock relationships the OOM killer needs to solve
between the triggerer and the potential OOM victim:

	#1 Memory. The triggerer needs memory that the victim has,
	    but the victim needs some additional memory to release it.

	#2 Locks. The triggerer needs memory that the victim has, but
	    the victim needs a lock the triggerer holds to release it.

We have no qualms letting the victim temporarily (until the victim's
exit) ignore memory.max to resolve the memory deadlock #1.

I don't understand why it's such a stretch to let the triggerer
temporarily (until the victim's exit) ignore memory.max to resolve the
locks deadlock #2. [1]

We need both for the OOM killer to function correctly.

We've solved #1 both for memcg and globally. But we haven't solved #2.
Global can still deadlock, and memcg copped out and returns -ENOMEM.

Adding speculative OOM killing before the -ENOMEM makes things more
muddy and unpredictable. It doesn't actually solve deadlock #2.

[1] And arguably that's what we should be doing in the global case
    too: give the triggerer access to reserves. If you recall this
    thread here: https://patchwork.kernel.org/patch/6088511/

> > > So the only change I am really proposing is to keep retrying as long
> > > as the oom killer makes a forward progress and ENOMEM otherwise.
> > 
> > That's the behavior change I'm against.
> 
> So just to make it clear you would be OK with the retry on successful
> OOM killer invocation and force charge on oom failure, right?

Yeah, that sounds reasonable to me.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-25 21:13                                                     ` Johannes Weiner
@ 2017-10-25 22:49                                                       ` Greg Thelen
  2017-10-26  7:49                                                         ` Michal Hocko
  2017-10-26 14:31                                                         ` Johannes Weiner
  0 siblings, 2 replies; 52+ messages in thread
From: Greg Thelen @ 2017-10-25 22:49 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko
  Cc: Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton,
	Linux MM, linux-fsdevel, LKML

Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote:
>> On Wed 25-10-17 14:11:06, Johannes Weiner wrote:
>> > "Safe" is a vague term, and it doesn't make much sense to me in this
>> > situation. The OOM behavior should be predictable and consistent.
>> > 
>> > Yes, global might in the rarest cases also return -ENOMEM. Maybe. We
>> > don't have to do that in memcg because we're not physically limited.
>> 
>> OK, so here seems to be the biggest disconnect. Being physically or
>> artificially constrained shouldn't make much difference IMHO. In both
>> cases the resource is simply limited for the consumer. And once all the
>> attempts to fit within the limit fail then the request for the resource
>> has to fail.
>
> It's a huge difference. In the global case, we have to make trade-offs
> to not deadlock the kernel. In the memcg case, we have to make a trade
> off between desirable OOM behavior and desirable meaning of memory.max.
>
> If we can borrow a resource temporarily from the ether to resolve the
> OOM situation, I don't see why we shouldn't. We're only briefly
> ignoring the limit to make sure the allocating task isn't preventing
> the OOM victim from exiting or the OOM reaper from reaping. It's more
> of an implementation detail than interface.
>
> The only scenario you brought up where this might be the permanent
> overrun is the single, oom-disabled task. And I explained why that is
> a silly argument, why that's the least problematic consequence of
> oom-disabling, and why it probably shouldn't even be configurable.
>
> The idea that memory.max must never be breached is an extreme and
> narrow view. As Greg points out, there are allocations we do not even
> track. There are other scenarios that force allocations. They may
> violate the limit on paper, but they're not notably weakening the goal
> of memory.max - isolating workloads from each other.
>
> Let's look at it this way.
>
> There are two deadlock relationships the OOM killer needs to solve
> between the triggerer and the potential OOM victim:
>
> 	#1 Memory. The triggerer needs memory that the victim has,
> 	    but the victim needs some additional memory to release it.
>
> 	#2 Locks. The triggerer needs memory that the victim has, but
> 	    the victim needs a lock the triggerer holds to release it.
>
> We have no qualms letting the victim temporarily (until the victim's
> exit) ignore memory.max to resolve the memory deadlock #1.
>
> I don't understand why it's such a stretch to let the triggerer
> temporarily (until the victim's exit) ignore memory.max to resolve the
> locks deadlock #2. [1]
>
> We need both for the OOM killer to function correctly.
>
> We've solved #1 both for memcg and globally. But we haven't solved #2.
> Global can still deadlock, and memcg copped out and returns -ENOMEM.
>
> Adding speculative OOM killing before the -ENOMEM makes things more
> muddy and unpredictable. It doesn't actually solve deadlock #2.
>
> [1] And arguably that's what we should be doing in the global case
>     too: give the triggerer access to reserves. If you recall this
>     thread here: https://patchwork.kernel.org/patch/6088511/
>
>> > > So the only change I am really proposing is to keep retrying as long
>> > > as the oom killer makes a forward progress and ENOMEM otherwise.
>> > 
>> > That's the behavior change I'm against.
>> 
>> So just to make it clear you would be OK with the retry on successful
>> OOM killer invocation and force charge on oom failure, right?
>
> Yeah, that sounds reasonable to me.

Assuming we're talking about retrying within try_charge(), then there's
a detail to iron out...

If there is a pending oom victim blocked on a lock held by try_charge() caller
(the "#2 Locks" case), then I think repeated calls to out_of_memory() will
return true until the victim either gets MMF_OOM_SKIP or disappears.  So a force
charge fallback might be a needed even with oom killer successful invocations.
Or we'll need to teach out_of_memory() to return three values (e.g. NO_VICTIM,
NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on NEW_VICTIM.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-25 22:49                                                       ` Greg Thelen
@ 2017-10-26  7:49                                                         ` Michal Hocko
  2017-10-26 12:45                                                           ` Tetsuo Handa
  2017-10-26 14:31                                                         ` Johannes Weiner
  1 sibling, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-26  7:49 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Johannes Weiner, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Wed 25-10-17 15:49:21, Greg Thelen wrote:
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote:
[...]
> >> So just to make it clear you would be OK with the retry on successful
> >> OOM killer invocation and force charge on oom failure, right?
> >
> > Yeah, that sounds reasonable to me.
> 
> Assuming we're talking about retrying within try_charge(), then there's
> a detail to iron out...
> 
> If there is a pending oom victim blocked on a lock held by try_charge() caller
> (the "#2 Locks" case), then I think repeated calls to out_of_memory() will
> return true until the victim either gets MMF_OOM_SKIP or disappears.

true. And oom_reaper guarantees that MMF_OOM_SKIP gets set in the finit
amount of time.

> So a force
> charge fallback might be a needed even with oom killer successful invocations.
> Or we'll need to teach out_of_memory() to return three values (e.g. NO_VICTIM,
> NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on NEW_VICTIM.

No we, really want to wait for the oom victim to do its job. The only
thing we should be worried about is when out_of_memory doesn't invoke
the reaper. There is only one case like that AFAIK - GFP_NOFS request. I
have to think about this case some more. We currently fail in that case
the request.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-26  7:49                                                         ` Michal Hocko
@ 2017-10-26 12:45                                                           ` Tetsuo Handa
  0 siblings, 0 replies; 52+ messages in thread
From: Tetsuo Handa @ 2017-10-26 12:45 UTC (permalink / raw)
  To: Michal Hocko, Greg Thelen
  Cc: Johannes Weiner, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On 2017/10/26 16:49, Michal Hocko wrote:
> On Wed 25-10-17 15:49:21, Greg Thelen wrote:
>> Johannes Weiner <hannes@cmpxchg.org> wrote:
>>
>>> On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote:
> [...]
>>>> So just to make it clear you would be OK with the retry on successful
>>>> OOM killer invocation and force charge on oom failure, right?
>>>
>>> Yeah, that sounds reasonable to me.
>>
>> Assuming we're talking about retrying within try_charge(), then there's
>> a detail to iron out...
>>
>> If there is a pending oom victim blocked on a lock held by try_charge() caller
>> (the "#2 Locks" case), then I think repeated calls to out_of_memory() will
>> return true until the victim either gets MMF_OOM_SKIP or disappears.
> 
> true. And oom_reaper guarantees that MMF_OOM_SKIP gets set in the finit
> amount of time.

Just a confirmation. You are talking about kmemcg, aren't you? And kmemcg
depends on CONFIG_MMU=y, doesn't it? If no, there is no such guarantee.

> 
>> So a force
>> charge fallback might be a needed even with oom killer successful invocations.
>> Or we'll need to teach out_of_memory() to return three values (e.g. NO_VICTIM,
>> NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on NEW_VICTIM.
> 
> No we, really want to wait for the oom victim to do its job. The only
> thing we should be worried about is when out_of_memory doesn't invoke
> the reaper. There is only one case like that AFAIK - GFP_NOFS request. I
> have to think about this case some more. We currently fail in that case
> the request.
> 

Do we really need to apply

	/*
	 * The OOM killer does not compensate for IO-less reclaim.
	 * pagefault_out_of_memory lost its gfp context so we have to
	 * make sure exclude 0 mask - all other users should have at least
	 * ___GFP_DIRECT_RECLAIM to get here.
	 */
	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
		return true;

unconditionally?

We can encourage !__GFP_FS allocations to use __GFP_NORETRY or
__GFP_RETRY_MAYFAIL if their allocations are not important.
Then, only important !__GFP_FS allocations will be checked here.
I think that we can allow such important allocations to invoke the OOM
killer (i.e. remove this check) because situation is already hopeless
if important !__GFP_FS allocations cannot make progress.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-25 22:49                                                       ` Greg Thelen
  2017-10-26  7:49                                                         ` Michal Hocko
@ 2017-10-26 14:31                                                         ` Johannes Weiner
  2017-10-26 19:56                                                           ` Greg Thelen
  1 sibling, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2017-10-26 14:31 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Michal Hocko, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Wed, Oct 25, 2017 at 03:49:21PM -0700, Greg Thelen wrote:
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote:
> >> On Wed 25-10-17 14:11:06, Johannes Weiner wrote:
> >> > "Safe" is a vague term, and it doesn't make much sense to me in this
> >> > situation. The OOM behavior should be predictable and consistent.
> >> > 
> >> > Yes, global might in the rarest cases also return -ENOMEM. Maybe. We
> >> > don't have to do that in memcg because we're not physically limited.
> >> 
> >> OK, so here seems to be the biggest disconnect. Being physically or
> >> artificially constrained shouldn't make much difference IMHO. In both
> >> cases the resource is simply limited for the consumer. And once all the
> >> attempts to fit within the limit fail then the request for the resource
> >> has to fail.
> >
> > It's a huge difference. In the global case, we have to make trade-offs
> > to not deadlock the kernel. In the memcg case, we have to make a trade
> > off between desirable OOM behavior and desirable meaning of memory.max.
> >
> > If we can borrow a resource temporarily from the ether to resolve the
> > OOM situation, I don't see why we shouldn't. We're only briefly
> > ignoring the limit to make sure the allocating task isn't preventing
> > the OOM victim from exiting or the OOM reaper from reaping. It's more
> > of an implementation detail than interface.
> >
> > The only scenario you brought up where this might be the permanent
> > overrun is the single, oom-disabled task. And I explained why that is
> > a silly argument, why that's the least problematic consequence of
> > oom-disabling, and why it probably shouldn't even be configurable.
> >
> > The idea that memory.max must never be breached is an extreme and
> > narrow view. As Greg points out, there are allocations we do not even
> > track. There are other scenarios that force allocations. They may
> > violate the limit on paper, but they're not notably weakening the goal
> > of memory.max - isolating workloads from each other.
> >
> > Let's look at it this way.
> >
> > There are two deadlock relationships the OOM killer needs to solve
> > between the triggerer and the potential OOM victim:
> >
> > 	#1 Memory. The triggerer needs memory that the victim has,
> > 	    but the victim needs some additional memory to release it.
> >
> > 	#2 Locks. The triggerer needs memory that the victim has, but
> > 	    the victim needs a lock the triggerer holds to release it.
> >
> > We have no qualms letting the victim temporarily (until the victim's
> > exit) ignore memory.max to resolve the memory deadlock #1.
> >
> > I don't understand why it's such a stretch to let the triggerer
> > temporarily (until the victim's exit) ignore memory.max to resolve the
> > locks deadlock #2. [1]
> >
> > We need both for the OOM killer to function correctly.
> >
> > We've solved #1 both for memcg and globally. But we haven't solved #2.
> > Global can still deadlock, and memcg copped out and returns -ENOMEM.
> >
> > Adding speculative OOM killing before the -ENOMEM makes things more
> > muddy and unpredictable. It doesn't actually solve deadlock #2.
> >
> > [1] And arguably that's what we should be doing in the global case
> >     too: give the triggerer access to reserves. If you recall this
> >     thread here: https://patchwork.kernel.org/patch/6088511/
> >
> >> > > So the only change I am really proposing is to keep retrying as long
> >> > > as the oom killer makes a forward progress and ENOMEM otherwise.
> >> > 
> >> > That's the behavior change I'm against.
> >> 
> >> So just to make it clear you would be OK with the retry on successful
> >> OOM killer invocation and force charge on oom failure, right?
> >
> > Yeah, that sounds reasonable to me.
> 
> Assuming we're talking about retrying within try_charge(), then there's
> a detail to iron out...
> 
> If there is a pending oom victim blocked on a lock held by try_charge() caller
> (the "#2 Locks" case), then I think repeated calls to out_of_memory() will
> return true until the victim either gets MMF_OOM_SKIP or disappears.  So a force
> charge fallback might be a needed even with oom killer successful invocations.
> Or we'll need to teach out_of_memory() to return three values (e.g. NO_VICTIM,
> NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on NEW_VICTIM.

True. I was assuming we'd retry MEM_CGROUP_RECLAIM_RETRIES times at a
maximum, even if the OOM killer indicates a kill has been issued. What
you propose makes sense too.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-26 14:31                                                         ` Johannes Weiner
@ 2017-10-26 19:56                                                           ` Greg Thelen
  2017-10-27  8:20                                                             ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Greg Thelen @ 2017-10-26 19:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

Michal Hocko wrote:
> Greg Thelen wrote:
> > So a force charge fallback might be a needed even with oom killer successful
> > invocations.  Or we'll need to teach out_of_memory() to return three values
> > (e.g. NO_VICTIM, NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on
> > NEW_VICTIM.
> 
> No we, really want to wait for the oom victim to do its job. The only thing we
> should be worried about is when out_of_memory doesn't invoke the reaper. There
> is only one case like that AFAIK - GFP_NOFS request. I have to think about
> this case some more. We currently fail in that case the request.

Nod, but I think only wait a short time (more below).  The
schedule_timeout_killable(1) in out_of_memory() seems ok to me.  I don't
think there's a problem overcharging a little bit to expedite oom
killing.

Johannes Weiner wrote:
> True. I was assuming we'd retry MEM_CGROUP_RECLAIM_RETRIES times at a maximum,
> even if the OOM killer indicates a kill has been issued. What you propose
> makes sense too.

Sounds good.

It looks like the oom reaper will wait 1 second
(MAX_OOM_REAP_RETRIES*HZ/10) before giving up and setting MMF_OOM_SKIP,
which enables the oom killer to select another victim.  Repeated
try_charge() => out_of_memory() calls will return true while there's a
pending victim.  After the first call, out_of_memory() doesn't appear to
sleep.  So I assume try_charge() would quickly burn through
MEM_CGROUP_RECLAIM_RETRIES (5) attempts before resorting to
overcharging.  IMO, this is fine because:
1) it's possible the victim wants locks held by try_charge caller.  So
   waiting for the oom reaper to timeout and out_of_memory to select
   additional victims would kill more than required.
2) waiting 1 sec to detect a livelock between try_charge() and pending
   oom victim seems unfortunate.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-26 19:56                                                           ` Greg Thelen
@ 2017-10-27  8:20                                                             ` Michal Hocko
  0 siblings, 0 replies; 52+ messages in thread
From: Michal Hocko @ 2017-10-27  8:20 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Johannes Weiner, Shakeel Butt, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Thu 26-10-17 12:56:48, Greg Thelen wrote:
> Michal Hocko wrote:
> > Greg Thelen wrote:
> > > So a force charge fallback might be a needed even with oom killer successful
> > > invocations.  Or we'll need to teach out_of_memory() to return three values
> > > (e.g. NO_VICTIM, NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on
> > > NEW_VICTIM.
> > 
> > No we, really want to wait for the oom victim to do its job. The only thing we
> > should be worried about is when out_of_memory doesn't invoke the reaper. There
> > is only one case like that AFAIK - GFP_NOFS request. I have to think about
> > this case some more. We currently fail in that case the request.
> 
> Nod, but I think only wait a short time (more below).  The
> schedule_timeout_killable(1) in out_of_memory() seems ok to me.  I don't
> think there's a problem overcharging a little bit to expedite oom
> killing.

This is not about time. This is about the feedback mechanism oom_reaper
provides. We should do any actions until we get that feedback.

> Johannes Weiner wrote:
> > True. I was assuming we'd retry MEM_CGROUP_RECLAIM_RETRIES times at a maximum,
> > even if the OOM killer indicates a kill has been issued. What you propose
> > makes sense too.
> 
> Sounds good.
> 
> It looks like the oom reaper will wait 1 second
> (MAX_OOM_REAP_RETRIES*HZ/10) before giving up and setting MMF_OOM_SKIP,
> which enables the oom killer to select another victim.  Repeated
> try_charge() => out_of_memory() calls will return true while there's a
> pending victim.  After the first call, out_of_memory() doesn't appear to
> sleep.  So I assume try_charge() would quickly burn through
> MEM_CGROUP_RECLAIM_RETRIES (5) attempts before resorting to
> overcharging.  IMO, this is fine because:
> 1) it's possible the victim wants locks held by try_charge caller.  So
>    waiting for the oom reaper to timeout and out_of_memory to select
>    additional victims would kill more than required.
> 2) waiting 1 sec to detect a livelock between try_charge() and pending
>    oom victim seems unfortunate.

I am not yet sure whether overcharging or ENOMEM is the right way to go
(have to think through that much more) but one thing is clear I guess.
The charge path shouldn't do any decision as long as it gets a possitive
feedback from the oom killer path. And that pretty much depend on what
oom reaper is able to do. Implementation details about how much the
reaper waits if it is not able reclaim any memory is not all that
important.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-25 16:44                                             ` Johannes Weiner
  2017-10-25 17:29                                               ` Michal Hocko
@ 2017-10-27 20:50                                               ` Shakeel Butt
  2017-10-30  8:29                                                 ` Michal Hocko
  1 sibling, 1 reply; 52+ messages in thread
From: Shakeel Butt @ 2017-10-27 20:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Greg Thelen, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

> Why is OOM-disabling a thing? Why isn't this simply a "kill everything
> else before you kill me"? It's crashing the kernel in trying to
> protect a userspace application. How is that not insane?

In parallel to other discussion, I think we should definitely move
from "completely oom-disabled" semantics to something similar to "kill
me last" semantics. Is there any objection to this idea?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-27 20:50                                               ` Shakeel Butt
@ 2017-10-30  8:29                                                 ` Michal Hocko
  2017-10-30 19:28                                                   ` Shakeel Butt
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-30  8:29 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Greg Thelen, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Fri 27-10-17 13:50:47, Shakeel Butt wrote:
> > Why is OOM-disabling a thing? Why isn't this simply a "kill everything
> > else before you kill me"? It's crashing the kernel in trying to
> > protect a userspace application. How is that not insane?
> 
> In parallel to other discussion, I think we should definitely move
> from "completely oom-disabled" semantics to something similar to "kill
> me last" semantics. Is there any objection to this idea?

Could you be more specific what you mean?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-30  8:29                                                 ` Michal Hocko
@ 2017-10-30 19:28                                                   ` Shakeel Butt
  2017-10-31  8:00                                                     ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Shakeel Butt @ 2017-10-30 19:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Greg Thelen, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Mon, Oct 30, 2017 at 1:29 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Fri 27-10-17 13:50:47, Shakeel Butt wrote:
>> > Why is OOM-disabling a thing? Why isn't this simply a "kill everything
>> > else before you kill me"? It's crashing the kernel in trying to
>> > protect a userspace application. How is that not insane?
>>
>> In parallel to other discussion, I think we should definitely move
>> from "completely oom-disabled" semantics to something similar to "kill
>> me last" semantics. Is there any objection to this idea?
>
> Could you be more specific what you mean?
>

I get the impression that the main reason behind the complexity of
oom-killer is allowing processes to be protected from the oom-killer
i.e. disabling oom-killing a process by setting
/proc/[pid]/oom_score_adj to -1000. So, instead of oom-disabling, add
an interface which will let users/admins to set a process to be
oom-killed as a last resort.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-30 19:28                                                   ` Shakeel Butt
@ 2017-10-31  8:00                                                     ` Michal Hocko
  2017-10-31 16:49                                                       ` Johannes Weiner
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2017-10-31  8:00 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Greg Thelen, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Mon 30-10-17 12:28:13, Shakeel Butt wrote:
> On Mon, Oct 30, 2017 at 1:29 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Fri 27-10-17 13:50:47, Shakeel Butt wrote:
> >> > Why is OOM-disabling a thing? Why isn't this simply a "kill everything
> >> > else before you kill me"? It's crashing the kernel in trying to
> >> > protect a userspace application. How is that not insane?
> >>
> >> In parallel to other discussion, I think we should definitely move
> >> from "completely oom-disabled" semantics to something similar to "kill
> >> me last" semantics. Is there any objection to this idea?
> >
> > Could you be more specific what you mean?
> >
> 
> I get the impression that the main reason behind the complexity of
> oom-killer is allowing processes to be protected from the oom-killer
> i.e. disabling oom-killing a process by setting
> /proc/[pid]/oom_score_adj to -1000. So, instead of oom-disabling, add
> an interface which will let users/admins to set a process to be
> oom-killed as a last resort.

If a process opts in to be oom disabled it needs CAP_SYS_RESOURCE and it
probably has a strong reason to do that. E.g. no unexpected SIGKILL
which could leave inconsistent data behind. We cannot simply break that
contract. Yes, it is a PITA configuration to support but it has its
reasons to exit. We are not guaranteeing it to 100% though, e.g. the
global case just panics if there is no eligible task. It is
responsibility of the userspace to make sure that the protected task
doesn't blow up completely otherwise they are on their own. We should do
something similar for the memcg case. Protect those tasks as long as we
are able to make forward progress and then either give them ENOMEM or
overcharge. Which one to go requires more discussion but I do not think
that unexpected SIGKILL is a way to go. We just want to give those tasks
a chance to do a graceful shutdown.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-31  8:00                                                     ` Michal Hocko
@ 2017-10-31 16:49                                                       ` Johannes Weiner
  2017-10-31 18:50                                                         ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2017-10-31 16:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shakeel Butt, Greg Thelen, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Tue, Oct 31, 2017 at 09:00:48AM +0100, Michal Hocko wrote:
> On Mon 30-10-17 12:28:13, Shakeel Butt wrote:
> > On Mon, Oct 30, 2017 at 1:29 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > On Fri 27-10-17 13:50:47, Shakeel Butt wrote:
> > >> > Why is OOM-disabling a thing? Why isn't this simply a "kill everything
> > >> > else before you kill me"? It's crashing the kernel in trying to
> > >> > protect a userspace application. How is that not insane?
> > >>
> > >> In parallel to other discussion, I think we should definitely move
> > >> from "completely oom-disabled" semantics to something similar to "kill
> > >> me last" semantics. Is there any objection to this idea?
> > >
> > > Could you be more specific what you mean?
> > 
> > I get the impression that the main reason behind the complexity of
> > oom-killer is allowing processes to be protected from the oom-killer
> > i.e. disabling oom-killing a process by setting
> > /proc/[pid]/oom_score_adj to -1000. So, instead of oom-disabling, add
> > an interface which will let users/admins to set a process to be
> > oom-killed as a last resort.
> 
> If a process opts in to be oom disabled it needs CAP_SYS_RESOURCE and it
> probably has a strong reason to do that. E.g. no unexpected SIGKILL
> which could leave inconsistent data behind. We cannot simply break that
> contract. Yes, it is a PITA configuration to support but it has its
> reasons to exit.

I don't think that's true. The most prominent users are things like X
and sshd, and all they wanted to say was "kill me last."

If sshd were to have a bug and swell up, currently the system would
kill everything and then panic. It'd be much better to kill sshd at
the end and let the init system restart it.

Can you describe a scenario in which the NEVERKILL semantics actually
make sense? You're still OOM-killing the task anyway, it's not like it
can run without the kernel. So why kill the kernel?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] fs, mm: account filp and names caches to kmemcg
  2017-10-31 16:49                                                       ` Johannes Weiner
@ 2017-10-31 18:50                                                         ` Michal Hocko
  0 siblings, 0 replies; 52+ messages in thread
From: Michal Hocko @ 2017-10-31 18:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Shakeel Butt, Greg Thelen, Alexander Viro, Vladimir Davydov,
	Andrew Morton, Linux MM, linux-fsdevel, LKML

On Tue 31-10-17 12:49:59, Johannes Weiner wrote:
> On Tue, Oct 31, 2017 at 09:00:48AM +0100, Michal Hocko wrote:
> > On Mon 30-10-17 12:28:13, Shakeel Butt wrote:
> > > On Mon, Oct 30, 2017 at 1:29 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > > On Fri 27-10-17 13:50:47, Shakeel Butt wrote:
> > > >> > Why is OOM-disabling a thing? Why isn't this simply a "kill everything
> > > >> > else before you kill me"? It's crashing the kernel in trying to
> > > >> > protect a userspace application. How is that not insane?
> > > >>
> > > >> In parallel to other discussion, I think we should definitely move
> > > >> from "completely oom-disabled" semantics to something similar to "kill
> > > >> me last" semantics. Is there any objection to this idea?
> > > >
> > > > Could you be more specific what you mean?
> > > 
> > > I get the impression that the main reason behind the complexity of
> > > oom-killer is allowing processes to be protected from the oom-killer
> > > i.e. disabling oom-killing a process by setting
> > > /proc/[pid]/oom_score_adj to -1000. So, instead of oom-disabling, add
> > > an interface which will let users/admins to set a process to be
> > > oom-killed as a last resort.
> > 
> > If a process opts in to be oom disabled it needs CAP_SYS_RESOURCE and it
> > probably has a strong reason to do that. E.g. no unexpected SIGKILL
> > which could leave inconsistent data behind. We cannot simply break that
> > contract. Yes, it is a PITA configuration to support but it has its
> > reasons to exit.
> 
> I don't think that's true. The most prominent users are things like X
> and sshd, and all they wanted to say was "kill me last."

This might be the case for the desktop environment and I would tend to
agree that those can handle restart easily. I was considering
applications which need an explicit shut down and manual intervention
when not done so. Think of a database or similar.

> If sshd were to have a bug and swell up, currently the system would
> kill everything and then panic. It'd be much better to kill sshd at
> the end and let the init system restart it.
> 
> Can you describe a scenario in which the NEVERKILL semantics actually
> make sense? You're still OOM-killing the task anyway, it's not like it
> can run without the kernel. So why kill the kernel?

Yes but you start with a clean state after reboot which is rather a
different thing than restarting from an inconsistant state.

In any case I am not trying to defend this configuration! I really
dislike it and it shouldn't have ever been introduced. But it is an
established behavior for many years and I am not really willing to break
it without having a _really strong_ reason.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2017-10-31 18:50 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-05 22:21 [PATCH] fs, mm: account filp and names caches to kmemcg Shakeel Butt
2017-10-06  7:59 ` Michal Hocko
2017-10-06 19:33   ` Shakeel Butt
2017-10-09  6:24     ` Michal Hocko
2017-10-09 17:52       ` Greg Thelen
2017-10-09 18:04         ` Michal Hocko
2017-10-09 18:17           ` Michal Hocko
2017-10-10  9:10             ` Michal Hocko
2017-10-10 22:21               ` Shakeel Butt
2017-10-11  9:09                 ` Michal Hocko
2017-10-09 20:26         ` Johannes Weiner
2017-10-10  9:14           ` Michal Hocko
2017-10-10 14:17             ` Johannes Weiner
2017-10-10 14:24               ` Michal Hocko
2017-10-12 19:03                 ` Johannes Weiner
2017-10-12 23:57                   ` Greg Thelen
2017-10-13  6:51                     ` Michal Hocko
2017-10-13  6:35                   ` Michal Hocko
2017-10-13  7:00                     ` Michal Hocko
2017-10-13 15:24                       ` Michal Hocko
2017-10-24 12:18                         ` Michal Hocko
2017-10-24 17:54                           ` Johannes Weiner
2017-10-24 16:06                         ` Johannes Weiner
2017-10-24 16:22                           ` Michal Hocko
2017-10-24 17:23                             ` Johannes Weiner
2017-10-24 17:55                               ` Michal Hocko
2017-10-24 18:58                                 ` Johannes Weiner
2017-10-24 20:15                                   ` Michal Hocko
2017-10-25  6:51                                     ` Greg Thelen
2017-10-25  7:15                                       ` Michal Hocko
2017-10-25 13:11                                         ` Johannes Weiner
2017-10-25 14:12                                           ` Michal Hocko
2017-10-25 16:44                                             ` Johannes Weiner
2017-10-25 17:29                                               ` Michal Hocko
2017-10-25 18:11                                                 ` Johannes Weiner
2017-10-25 19:00                                                   ` Michal Hocko
2017-10-25 21:13                                                     ` Johannes Weiner
2017-10-25 22:49                                                       ` Greg Thelen
2017-10-26  7:49                                                         ` Michal Hocko
2017-10-26 12:45                                                           ` Tetsuo Handa
2017-10-26 14:31                                                         ` Johannes Weiner
2017-10-26 19:56                                                           ` Greg Thelen
2017-10-27  8:20                                                             ` Michal Hocko
2017-10-27 20:50                                               ` Shakeel Butt
2017-10-30  8:29                                                 ` Michal Hocko
2017-10-30 19:28                                                   ` Shakeel Butt
2017-10-31  8:00                                                     ` Michal Hocko
2017-10-31 16:49                                                       ` Johannes Weiner
2017-10-31 18:50                                                         ` Michal Hocko
2017-10-24 15:45                     ` Johannes Weiner
2017-10-24 16:30                       ` Michal Hocko
2017-10-10 23:32 ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).