* [PATCH] fs, mm: account filp and names caches to kmemcg @ 2017-10-05 22:21 Shakeel Butt 2017-10-06 7:59 ` Michal Hocko 2017-10-10 23:32 ` Al Viro 0 siblings, 2 replies; 52+ messages in thread From: Shakeel Butt @ 2017-10-05 22:21 UTC (permalink / raw) To: Alexander Viro, Vladimir Davydov, Michal Hocko, Greg Thelen Cc: Andrew Morton, linux-mm, linux-fsdevel, linux-kernel, Shakeel Butt The allocations from filp and names kmem caches can be directly triggered by user space applications. A buggy application can consume a significant amount of unaccounted system memory. Though we have not noticed such buggy applications in our production but upon close inspection, we found that a lot of machines spend very significant amount of memory on these caches. So, these caches should be accounted to kmemcg. Signed-off-by: Shakeel Butt <shakeelb@google.com> --- fs/dcache.c | 2 +- fs/file_table.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index f90141387f01..fb3449161063 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -3642,7 +3642,7 @@ void __init vfs_caches_init_early(void) void __init vfs_caches_init(void) { names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0, - SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); dcache_init(); inode_init(); diff --git a/fs/file_table.c b/fs/file_table.c index 61517f57f8ef..567888cdf7d3 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -312,7 +312,7 @@ void put_filp(struct file *file) void __init files_init(void) { filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, - SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); + SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL); percpu_counter_init(&nr_files, 0, GFP_KERNEL); } -- 2.14.2.920.gcf0c67979c-goog ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-05 22:21 [PATCH] fs, mm: account filp and names caches to kmemcg Shakeel Butt @ 2017-10-06 7:59 ` Michal Hocko 2017-10-06 19:33 ` Shakeel Butt 2017-10-10 23:32 ` Al Viro 1 sibling, 1 reply; 52+ messages in thread From: Michal Hocko @ 2017-10-06 7:59 UTC (permalink / raw) To: Shakeel Butt Cc: Alexander Viro, Vladimir Davydov, Greg Thelen, Andrew Morton, linux-mm, linux-fsdevel, linux-kernel On Thu 05-10-17 15:21:44, Shakeel Butt wrote: > The allocations from filp and names kmem caches can be directly > triggered by user space applications. A buggy application can > consume a significant amount of unaccounted system memory. Though > we have not noticed such buggy applications in our production > but upon close inspection, we found that a lot of machines spend > very significant amount of memory on these caches. So, these > caches should be accounted to kmemcg. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > --- > fs/dcache.c | 2 +- > fs/file_table.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/fs/dcache.c b/fs/dcache.c > index f90141387f01..fb3449161063 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -3642,7 +3642,7 @@ void __init vfs_caches_init_early(void) > void __init vfs_caches_init(void) > { > names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0, > - SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); > + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); I might be wrong but isn't name cache only holding temporary objects used for path resolution which are not stored anywhere? > > dcache_init(); > inode_init(); > diff --git a/fs/file_table.c b/fs/file_table.c > index 61517f57f8ef..567888cdf7d3 100644 > --- a/fs/file_table.c > +++ b/fs/file_table.c > @@ -312,7 +312,7 @@ void put_filp(struct file *file) > void __init files_init(void) > { > filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, > - SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); > + SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL); > percpu_counter_init(&nr_files, 0, GFP_KERNEL); > } Don't we have a limit for the maximum number of open files? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-06 7:59 ` Michal Hocko @ 2017-10-06 19:33 ` Shakeel Butt 2017-10-09 6:24 ` Michal Hocko 0 siblings, 1 reply; 52+ messages in thread From: Shakeel Butt @ 2017-10-06 19:33 UTC (permalink / raw) To: Michal Hocko Cc: Alexander Viro, Vladimir Davydov, Greg Thelen, Andrew Morton, Linux MM, linux-fsdevel, LKML >> names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0, >> - SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); >> + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); > > I might be wrong but isn't name cache only holding temporary objects > used for path resolution which are not stored anywhere? > Even though they're temporary, many containers can together use a significant amount of transient uncharged memory. We've seen machines with 100s of MiBs in names_cache. >> filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, >> - SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); >> + SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL); >> percpu_counter_init(&nr_files, 0, GFP_KERNEL); >> } > > Don't we have a limit for the maximum number of open files? > Yes, there is a system limit of maximum number of open files. However this limit is shared between different users on the system and one user can hog this resource. To cater that, we set the maximum limit very high and let the memory limit of each user limit the number of files they can open. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-06 19:33 ` Shakeel Butt @ 2017-10-09 6:24 ` Michal Hocko 2017-10-09 17:52 ` Greg Thelen 0 siblings, 1 reply; 52+ messages in thread From: Michal Hocko @ 2017-10-09 6:24 UTC (permalink / raw) To: Shakeel Butt Cc: Alexander Viro, Vladimir Davydov, Greg Thelen, Andrew Morton, Linux MM, linux-fsdevel, LKML On Fri 06-10-17 12:33:03, Shakeel Butt wrote: > >> names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0, > >> - SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); > >> + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); > > > > I might be wrong but isn't name cache only holding temporary objects > > used for path resolution which are not stored anywhere? > > > > Even though they're temporary, many containers can together use a > significant amount of transient uncharged memory. We've seen machines > with 100s of MiBs in names_cache. Yes that might be possible but are we prepared for random ENOMEM from vfs calls which need to allocate a temporary name? > > >> filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, > >> - SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); > >> + SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL); > >> percpu_counter_init(&nr_files, 0, GFP_KERNEL); > >> } > > > > Don't we have a limit for the maximum number of open files? > > > > Yes, there is a system limit of maximum number of open files. However > this limit is shared between different users on the system and one > user can hog this resource. To cater that, we set the maximum limit > very high and let the memory limit of each user limit the number of > files they can open. Similarly here. Are all syscalls allocating a fd prepared to return ENOMEM? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-09 6:24 ` Michal Hocko @ 2017-10-09 17:52 ` Greg Thelen 2017-10-09 18:04 ` Michal Hocko 2017-10-09 20:26 ` Johannes Weiner 0 siblings, 2 replies; 52+ messages in thread From: Greg Thelen @ 2017-10-09 17:52 UTC (permalink / raw) To: Michal Hocko, Shakeel Butt Cc: Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML Michal Hocko <mhocko@kernel.org> wrote: > On Fri 06-10-17 12:33:03, Shakeel Butt wrote: >> >> names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0, >> >> - SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); >> >> + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); >> > >> > I might be wrong but isn't name cache only holding temporary objects >> > used for path resolution which are not stored anywhere? >> > >> >> Even though they're temporary, many containers can together use a >> significant amount of transient uncharged memory. We've seen machines >> with 100s of MiBs in names_cache. > > Yes that might be possible but are we prepared for random ENOMEM from > vfs calls which need to allocate a temporary name? > >> >> >> filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, >> >> - SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); >> >> + SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL); >> >> percpu_counter_init(&nr_files, 0, GFP_KERNEL); >> >> } >> > >> > Don't we have a limit for the maximum number of open files? >> > >> >> Yes, there is a system limit of maximum number of open files. However >> this limit is shared between different users on the system and one >> user can hog this resource. To cater that, we set the maximum limit >> very high and let the memory limit of each user limit the number of >> files they can open. > > Similarly here. Are all syscalls allocating a fd prepared to return > ENOMEM? > > -- > Michal Hocko > SUSE Labs Even before this patch I find memcg oom handling inconsistent. Page cache pages trigger oom killer and may allow caller to succeed once the kernel retries. But kmem allocations don't call oom killer. They surface errors to user space. This makes memcg hard to use for memory overcommit because it's desirable for a high priority task to transparently kill a lower priority task using the memcg oom killer. A few ideas on how to make it more flexible: a) Go back to memcg oom killing within memcg charging. This runs risk of oom killing while caller holds locks which oom victim selection or oom victim termination may need. Google's been running this way for a while. b) Have every syscall return do something similar to page fault handler: kmem allocations in oom memcg mark the current task as needing an oom check return NULL. If marked oom, syscall exit would use mem_cgroup_oom_synchronize() before retrying the syscall. Seems risky. I doubt every syscall is compatible with such a restart. c) Overcharge kmem to oom memcg and queue an async memcg limit checker, which will oom kill if needed. Comments? Demo program which eventually gets ENOSPC from mkdir. $ cat /tmp/t while umount /tmp/mnt; do true; done mkdir -p /tmp/mnt mount -t tmpfs nodev /tmp/mnt cd /dev/cgroup/memory rmdir t mkdir t echo 32M > t/memory.limit_in_bytes (echo $BASHPID > t/cgroup.procs && cd /tmp/mnt && exec /tmp/mkdirs) $ cat /tmp/mkdirs.c #include <err.h> #include <stdio.h> #include <sys/mman.h> #include <sys/stat.h> #include <sys/types.h> int main() { int i; char name[32]; if (mlockall(MCL_CURRENT|MCL_FUTURE)) err(1, "mlockall"); for (i = 0; i < (1<<20); i++) { sprintf(name, "%d", i); if (mkdir(name, 0700)) err(1, "mkdir"); } printf("done\n"); return 0; } ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-09 17:52 ` Greg Thelen @ 2017-10-09 18:04 ` Michal Hocko 2017-10-09 18:17 ` Michal Hocko 2017-10-09 20:26 ` Johannes Weiner 1 sibling, 1 reply; 52+ messages in thread From: Michal Hocko @ 2017-10-09 18:04 UTC (permalink / raw) To: Greg Thelen Cc: Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML, Johannes Weiner [CC Johannes - the thread starts http://lkml.kernel.org/r/20171005222144.123797-1-shakeelb@google.com] On Mon 09-10-17 10:52:44, Greg Thelen wrote: > Michal Hocko <mhocko@kernel.org> wrote: > > > On Fri 06-10-17 12:33:03, Shakeel Butt wrote: > >> >> names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0, > >> >> - SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); > >> >> + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); > >> > > >> > I might be wrong but isn't name cache only holding temporary objects > >> > used for path resolution which are not stored anywhere? > >> > > >> > >> Even though they're temporary, many containers can together use a > >> significant amount of transient uncharged memory. We've seen machines > >> with 100s of MiBs in names_cache. > > > > Yes that might be possible but are we prepared for random ENOMEM from > > vfs calls which need to allocate a temporary name? > > > >> > >> >> filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, > >> >> - SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); > >> >> + SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL); > >> >> percpu_counter_init(&nr_files, 0, GFP_KERNEL); > >> >> } > >> > > >> > Don't we have a limit for the maximum number of open files? > >> > > >> > >> Yes, there is a system limit of maximum number of open files. However > >> this limit is shared between different users on the system and one > >> user can hog this resource. To cater that, we set the maximum limit > >> very high and let the memory limit of each user limit the number of > >> files they can open. > > > > Similarly here. Are all syscalls allocating a fd prepared to return > > ENOMEM? > > > > -- > > Michal Hocko > > SUSE Labs > > Even before this patch I find memcg oom handling inconsistent. Page > cache pages trigger oom killer and may allow caller to succeed once the > kernel retries. But kmem allocations don't call oom killer. They > surface errors to user space. This makes memcg hard to use for memory > overcommit because it's desirable for a high priority task to > transparently kill a lower priority task using the memcg oom killer. > > A few ideas on how to make it more flexible: > > a) Go back to memcg oom killing within memcg charging. This runs risk > of oom killing while caller holds locks which oom victim selection or > oom victim termination may need. Google's been running this way for > a while. > > b) Have every syscall return do something similar to page fault handler: > kmem allocations in oom memcg mark the current task as needing an oom > check return NULL. If marked oom, syscall exit would use > mem_cgroup_oom_synchronize() before retrying the syscall. Seems > risky. I doubt every syscall is compatible with such a restart. > > c) Overcharge kmem to oom memcg and queue an async memcg limit checker, > which will oom kill if needed. > > Comments? > > Demo program which eventually gets ENOSPC from mkdir. > > $ cat /tmp/t > while umount /tmp/mnt; do true; done > mkdir -p /tmp/mnt > mount -t tmpfs nodev /tmp/mnt > cd /dev/cgroup/memory > rmdir t > mkdir t > echo 32M > t/memory.limit_in_bytes > (echo $BASHPID > t/cgroup.procs && cd /tmp/mnt && exec /tmp/mkdirs) > > $ cat /tmp/mkdirs.c > #include <err.h> > #include <stdio.h> > #include <sys/mman.h> > #include <sys/stat.h> > #include <sys/types.h> > > int main() > { > int i; > char name[32]; > > if (mlockall(MCL_CURRENT|MCL_FUTURE)) > err(1, "mlockall"); > for (i = 0; i < (1<<20); i++) { > sprintf(name, "%d", i); > if (mkdir(name, 0700)) > err(1, "mkdir"); > } > printf("done\n"); > return 0; > } -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-09 18:04 ` Michal Hocko @ 2017-10-09 18:17 ` Michal Hocko 2017-10-10 9:10 ` Michal Hocko 0 siblings, 1 reply; 52+ messages in thread From: Michal Hocko @ 2017-10-09 18:17 UTC (permalink / raw) To: Greg Thelen Cc: Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML, Johannes Weiner On Mon 09-10-17 20:04:09, Michal Hocko wrote: > [CC Johannes - the thread starts > http://lkml.kernel.org/r/20171005222144.123797-1-shakeelb@google.com] > > On Mon 09-10-17 10:52:44, Greg Thelen wrote: [...] > > A few ideas on how to make it more flexible: > > > > a) Go back to memcg oom killing within memcg charging. This runs risk > > of oom killing while caller holds locks which oom victim selection or > > oom victim termination may need. Google's been running this way for > > a while. We can actually reopen this discussion now that the oom handling is async due to the oom_reaper. At least for the v2 interface. I would have to think about it much more but the primary concern for this patch was whether we really need/want to charge short therm objects which do not outlive a single syscall. > > b) Have every syscall return do something similar to page fault handler: > > kmem allocations in oom memcg mark the current task as needing an oom > > check return NULL. If marked oom, syscall exit would use > > mem_cgroup_oom_synchronize() before retrying the syscall. Seems > > risky. I doubt every syscall is compatible with such a restart. yes, this is simply a no go > > c) Overcharge kmem to oom memcg and queue an async memcg limit checker, > > which will oom kill if needed. This is what we have max limit for. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-09 18:17 ` Michal Hocko @ 2017-10-10 9:10 ` Michal Hocko 2017-10-10 22:21 ` Shakeel Butt 0 siblings, 1 reply; 52+ messages in thread From: Michal Hocko @ 2017-10-10 9:10 UTC (permalink / raw) To: Greg Thelen Cc: Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML, Johannes Weiner On Mon 09-10-17 20:17:54, Michal Hocko wrote: > the primary concern for this patch was whether we really need/want to > charge short therm objects which do not outlive a single syscall. Let me expand on this some more. What is the benefit of kmem accounting of such an object? It cannot stop any runaway as a syscall lifetime allocations are bound to number of processes which we kind of contain by other means. If we do account then we put a memory pressure due to something that cannot be reclaimed by no means. Even the memcg OOM killer would simply kick a single path while there might be others to consume the same type of memory. So what is the actual point in accounting these? Does it help to contain any workload better? What kind of workload? Or am I completely wrong and name objects can outlive a syscall considerably? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-10 9:10 ` Michal Hocko @ 2017-10-10 22:21 ` Shakeel Butt 2017-10-11 9:09 ` Michal Hocko 0 siblings, 1 reply; 52+ messages in thread From: Shakeel Butt @ 2017-10-10 22:21 UTC (permalink / raw) To: Michal Hocko Cc: Greg Thelen, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML, Johannes Weiner On Sun, Oct 8, 2017 at 11:24 PM, Michal Hocko <mhocko@kernel.org> wrote: > On Fri 06-10-17 12:33:03, Shakeel Butt wrote: >> >> names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0, >> >> - SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); >> >> + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); >> > >> > I might be wrong but isn't name cache only holding temporary objects >> > used for path resolution which are not stored anywhere? >> > >> >> Even though they're temporary, many containers can together use a >> significant amount of transient uncharged memory. We've seen machines >> with 100s of MiBs in names_cache. > > Yes that might be possible but are we prepared for random ENOMEM from > vfs calls which need to allocate a temporary name? > I looked at all the syscalls which invoke allocations from 'names_cache' and tried to narrow down whose man page does not mention that they can return ENOMEM. I found couple of syscalls like truncate(), readdir() & getdents() which does not mention that they can return ENOMEM but this patch will make them return ENOMEM. >> >> >> filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, >> >> - SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); >> >> + SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL); >> >> percpu_counter_init(&nr_files, 0, GFP_KERNEL); >> >> } >> > >> > Don't we have a limit for the maximum number of open files? >> > >> >> Yes, there is a system limit of maximum number of open files. However >> this limit is shared between different users on the system and one >> user can hog this resource. To cater that, we set the maximum limit >> very high and let the memory limit of each user limit the number of >> files they can open. > > Similarly here. Are all syscalls allocating a fd prepared to return > ENOMEM? For filp, I found _sysctl(). However the man page says not to use it. On Tue, Oct 10, 2017 at 2:10 AM, Michal Hocko <mhocko@kernel.org> wrote: > On Mon 09-10-17 20:17:54, Michal Hocko wrote: >> the primary concern for this patch was whether we really need/want to >> charge short therm objects which do not outlive a single syscall. > > Let me expand on this some more. What is the benefit of kmem accounting > of such an object? It cannot stop any runaway as a syscall lifetime > allocations are bound to number of processes which we kind of contain by > other means. We can contain by limited the number of processes or thread but for us applications having thousands of threads is very common. So, limiting the number of threads/processes will not work. > If we do account then we put a memory pressure due to > something that cannot be reclaimed by no means. Even the memcg OOM > killer would simply kick a single path while there might be others > to consume the same type of memory. > > So what is the actual point in accounting these? Does it help to contain > any workload better? What kind of workload? > I think the benefits will be isolation and more accurate billing. As I have said before we have observed 100s of MiBs in names_cache on many machines and cumulative amount is not something we can ignore as just memory overhead. > Or am I completely wrong and name objects can outlive a syscall > considerably? > No, I didn't find any instance of the name objects outliving the syscall. Anyways, we can discuss more on names_cache, do you have any objection regarding charging filp? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-10 22:21 ` Shakeel Butt @ 2017-10-11 9:09 ` Michal Hocko 0 siblings, 0 replies; 52+ messages in thread From: Michal Hocko @ 2017-10-11 9:09 UTC (permalink / raw) To: Shakeel Butt Cc: Greg Thelen, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML, Johannes Weiner On Tue 10-10-17 15:21:53, Shakeel Butt wrote: [...] > On Tue, Oct 10, 2017 at 2:10 AM, Michal Hocko <mhocko@kernel.org> wrote: > > On Mon 09-10-17 20:17:54, Michal Hocko wrote: > >> the primary concern for this patch was whether we really need/want to > >> charge short therm objects which do not outlive a single syscall. > > > > Let me expand on this some more. What is the benefit of kmem accounting > > of such an object? It cannot stop any runaway as a syscall lifetime > > allocations are bound to number of processes which we kind of contain by > > other means. > > We can contain by limited the number of processes or thread but for us > applications having thousands of threads is very common. So, limiting > the number of threads/processes will not work. Well, the number of tasks is already contained in a way because we do account each task (kernel) stack AFAIR. > > If we do account then we put a memory pressure due to > > something that cannot be reclaimed by no means. Even the memcg OOM > > killer would simply kick a single path while there might be others > > to consume the same type of memory. > > > > So what is the actual point in accounting these? Does it help to contain > > any workload better? What kind of workload? > > > > I think the benefits will be isolation and more accurate billing. As I > have said before we have observed 100s of MiBs in names_cache on many > machines and cumulative amount is not something we can ignore as just > memory overhead. I do agree with Al arguing this is rather dubious and it can add an overhead without a good reason. > > Or am I completely wrong and name objects can outlive a syscall > > considerably? > > > > No, I didn't find any instance of the name objects outliving the syscall. > > Anyways, we can discuss more on names_cache, do you have any objection > regarding charging filp? I think filep makes more sense. But let's drop the names for now. I am not really convinced this is a good idea. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-09 17:52 ` Greg Thelen 2017-10-09 18:04 ` Michal Hocko @ 2017-10-09 20:26 ` Johannes Weiner 2017-10-10 9:14 ` Michal Hocko 1 sibling, 1 reply; 52+ messages in thread From: Johannes Weiner @ 2017-10-09 20:26 UTC (permalink / raw) To: Greg Thelen Cc: Michal Hocko, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Mon, Oct 09, 2017 at 10:52:44AM -0700, Greg Thelen wrote: > Michal Hocko <mhocko@kernel.org> wrote: > > > On Fri 06-10-17 12:33:03, Shakeel Butt wrote: > >> >> names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0, > >> >> - SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); > >> >> + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); > >> > > >> > I might be wrong but isn't name cache only holding temporary objects > >> > used for path resolution which are not stored anywhere? > >> > > >> > >> Even though they're temporary, many containers can together use a > >> significant amount of transient uncharged memory. We've seen machines > >> with 100s of MiBs in names_cache. > > > > Yes that might be possible but are we prepared for random ENOMEM from > > vfs calls which need to allocate a temporary name? > > > >> > >> >> filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, > >> >> - SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); > >> >> + SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL); > >> >> percpu_counter_init(&nr_files, 0, GFP_KERNEL); > >> >> } > >> > > >> > Don't we have a limit for the maximum number of open files? > >> > > >> > >> Yes, there is a system limit of maximum number of open files. However > >> this limit is shared between different users on the system and one > >> user can hog this resource. To cater that, we set the maximum limit > >> very high and let the memory limit of each user limit the number of > >> files they can open. > > > > Similarly here. Are all syscalls allocating a fd prepared to return > > ENOMEM? > > > > -- > > Michal Hocko > > SUSE Labs > > Even before this patch I find memcg oom handling inconsistent. Page > cache pages trigger oom killer and may allow caller to succeed once the > kernel retries. But kmem allocations don't call oom killer. It's consistent in the sense that only page faults enable the memcg OOM killer. It's not the type of memory that decides, it's whether the allocation context has a channel to communicate an error to userspace. Whether userspace is able to handle -ENOMEM from syscalls was a voiced concern at the time this patch was merged, although there haven't been any reports so far, and it seemed like the lesser evil between that and deadlocking the kernel. If we could find a way to invoke the OOM killer safely, I would welcome such patches. > They surface errors to user space. This makes memcg hard to use for > memory overcommit because it's desirable for a high priority task to > transparently kill a lower priority task using the memcg oom killer. > > A few ideas on how to make it more flexible: > > a) Go back to memcg oom killing within memcg charging. This runs risk > of oom killing while caller holds locks which oom victim selection or > oom victim termination may need. Google's been running this way for > a while. We've had real-life reports of this breaking, so even if it works for some people, I'd rather not revert to that way of doing things. > b) Have every syscall return do something similar to page fault handler: > kmem allocations in oom memcg mark the current task as needing an oom > check return NULL. If marked oom, syscall exit would use > mem_cgroup_oom_synchronize() before retrying the syscall. Seems > risky. I doubt every syscall is compatible with such a restart. That sounds like a lateral move. > c) Overcharge kmem to oom memcg and queue an async memcg limit checker, > which will oom kill if needed. This makes the most sense to me. Architecturally, I imagine this would look like b), with an OOM handler at the point of return to userspace, except that we'd overcharge instead of retrying the syscall. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-09 20:26 ` Johannes Weiner @ 2017-10-10 9:14 ` Michal Hocko 2017-10-10 14:17 ` Johannes Weiner 0 siblings, 1 reply; 52+ messages in thread From: Michal Hocko @ 2017-10-10 9:14 UTC (permalink / raw) To: Johannes Weiner Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Mon 09-10-17 16:26:13, Johannes Weiner wrote: > On Mon, Oct 09, 2017 at 10:52:44AM -0700, Greg Thelen wrote: > > Michal Hocko <mhocko@kernel.org> wrote: > > > > > On Fri 06-10-17 12:33:03, Shakeel Butt wrote: > > >> >> names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0, > > >> >> - SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); > > >> >> + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); > > >> > > > >> > I might be wrong but isn't name cache only holding temporary objects > > >> > used for path resolution which are not stored anywhere? > > >> > > > >> > > >> Even though they're temporary, many containers can together use a > > >> significant amount of transient uncharged memory. We've seen machines > > >> with 100s of MiBs in names_cache. > > > > > > Yes that might be possible but are we prepared for random ENOMEM from > > > vfs calls which need to allocate a temporary name? > > > > > >> > > >> >> filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, > > >> >> - SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); > > >> >> + SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL); > > >> >> percpu_counter_init(&nr_files, 0, GFP_KERNEL); > > >> >> } > > >> > > > >> > Don't we have a limit for the maximum number of open files? > > >> > > > >> > > >> Yes, there is a system limit of maximum number of open files. However > > >> this limit is shared between different users on the system and one > > >> user can hog this resource. To cater that, we set the maximum limit > > >> very high and let the memory limit of each user limit the number of > > >> files they can open. > > > > > > Similarly here. Are all syscalls allocating a fd prepared to return > > > ENOMEM? > > > > > > -- > > > Michal Hocko > > > SUSE Labs > > > > Even before this patch I find memcg oom handling inconsistent. Page > > cache pages trigger oom killer and may allow caller to succeed once the > > kernel retries. But kmem allocations don't call oom killer. > > It's consistent in the sense that only page faults enable the memcg > OOM killer. It's not the type of memory that decides, it's whether the > allocation context has a channel to communicate an error to userspace. > > Whether userspace is able to handle -ENOMEM from syscalls was a voiced > concern at the time this patch was merged, although there haven't been > any reports so far, Well, I remember reports about MAP_POPULATE breaking or at least having an unexpected behavior. > and it seemed like the lesser evil between that > and deadlocking the kernel. agreed on this part though > If we could find a way to invoke the OOM killer safely, I would > welcome such patches. Well, we should be able to do that with the oom_reaper. At least for v2 which doesn't have synchronous userspace oom killing. [...] > > c) Overcharge kmem to oom memcg and queue an async memcg limit checker, > > which will oom kill if needed. > > This makes the most sense to me. Architecturally, I imagine this would > look like b), with an OOM handler at the point of return to userspace, > except that we'd overcharge instead of retrying the syscall. I do not think we should break the hard limit semantic if possible. We can currently allow that for allocations which are very short term (oom victims) or too important to fail but allowing that for kmem charges in general sounds like too easy to runaway. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-10 9:14 ` Michal Hocko @ 2017-10-10 14:17 ` Johannes Weiner 2017-10-10 14:24 ` Michal Hocko 0 siblings, 1 reply; 52+ messages in thread From: Johannes Weiner @ 2017-10-10 14:17 UTC (permalink / raw) To: Michal Hocko Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Tue, Oct 10, 2017 at 11:14:30AM +0200, Michal Hocko wrote: > On Mon 09-10-17 16:26:13, Johannes Weiner wrote: > > It's consistent in the sense that only page faults enable the memcg > > OOM killer. It's not the type of memory that decides, it's whether the > > allocation context has a channel to communicate an error to userspace. > > > > Whether userspace is able to handle -ENOMEM from syscalls was a voiced > > concern at the time this patch was merged, although there haven't been > > any reports so far, > > Well, I remember reports about MAP_POPULATE breaking or at least having > an unexpected behavior. Hm, that slipped past me. Did we do something about these? Or did they fix userspace? > Well, we should be able to do that with the oom_reaper. At least for v2 > which doesn't have synchronous userspace oom killing. I don't see how the OOM reaper is a guarantee as long as we have this: if (!down_read_trylock(&mm->mmap_sem)) { ret = false; trace_skip_task_reaping(tsk->pid); goto unlock_oom; } What do you mean by 'v2'? > > > c) Overcharge kmem to oom memcg and queue an async memcg limit checker, > > > which will oom kill if needed. > > > > This makes the most sense to me. Architecturally, I imagine this would > > look like b), with an OOM handler at the point of return to userspace, > > except that we'd overcharge instead of retrying the syscall. > > I do not think we should break the hard limit semantic if possible. We > can currently allow that for allocations which are very short term (oom > victims) or too important to fail but allowing that for kmem charges in > general sounds like too easy to runaway. I'm not sure there is a convenient way out of this. If we want to respect the hard limit AND guarantee allocation success, the OOM killer has to free memory reliably - which it doesn't. But if it did, we could also break the limit temporarily and have the OOM killer replenish the pool before that userspace app can continue. The allocation wouldn't have to be short-lived, since memory is fungible. Until the OOM killer is 100% reliable, we have the choice between sometimes deadlocking the cgroup tasks and everything that interacts with them, returning -ENOMEM for syscalls, or breaking the hard limit guarantee during memcg OOM. It seems breaking the limit temporarily in order to reclaim memory is the best option. There is kernel memory we don't account to the memcg already because we think it's probably not going to be significant, so the isolation isn't 100% watertight in the first place. And I'd rather have the worst-case effect of a cgroup OOMing be spilling over its hard limit than deadlocking things inside and outside the cgroup. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-10 14:17 ` Johannes Weiner @ 2017-10-10 14:24 ` Michal Hocko 2017-10-12 19:03 ` Johannes Weiner 0 siblings, 1 reply; 52+ messages in thread From: Michal Hocko @ 2017-10-10 14:24 UTC (permalink / raw) To: Johannes Weiner Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Tue 10-10-17 10:17:33, Johannes Weiner wrote: > On Tue, Oct 10, 2017 at 11:14:30AM +0200, Michal Hocko wrote: > > On Mon 09-10-17 16:26:13, Johannes Weiner wrote: > > > It's consistent in the sense that only page faults enable the memcg > > > OOM killer. It's not the type of memory that decides, it's whether the > > > allocation context has a channel to communicate an error to userspace. > > > > > > Whether userspace is able to handle -ENOMEM from syscalls was a voiced > > > concern at the time this patch was merged, although there haven't been > > > any reports so far, > > > > Well, I remember reports about MAP_POPULATE breaking or at least having > > an unexpected behavior. > > Hm, that slipped past me. Did we do something about these? Or did they > fix userspace? Well it was mostly LTP complaining. I have tried to fix that but Linus was against so we just documented that this is possible and MAP_POPULATE is not a guarantee. > > Well, we should be able to do that with the oom_reaper. At least for v2 > > which doesn't have synchronous userspace oom killing. > > I don't see how the OOM reaper is a guarantee as long as we have this: > > if (!down_read_trylock(&mm->mmap_sem)) { > ret = false; > trace_skip_task_reaping(tsk->pid); > goto unlock_oom; > } And we will simply mark the victim MMF_OOM_SKIP and hide it from the oom killer if we fail to get the mmap_sem after several attempts. This will allow to find a new victim. So we shouldn't deadlock. > What do you mean by 'v2'? cgroup v2 because the legacy memcg allowed sync wait for the oom killer and that would be a bigger problem from a deep callchains for obevious reasons. > > > > c) Overcharge kmem to oom memcg and queue an async memcg limit checker, > > > > which will oom kill if needed. > > > > > > This makes the most sense to me. Architecturally, I imagine this would > > > look like b), with an OOM handler at the point of return to userspace, > > > except that we'd overcharge instead of retrying the syscall. > > > > I do not think we should break the hard limit semantic if possible. We > > can currently allow that for allocations which are very short term (oom > > victims) or too important to fail but allowing that for kmem charges in > > general sounds like too easy to runaway. > > I'm not sure there is a convenient way out of this. > > If we want to respect the hard limit AND guarantee allocation success, > the OOM killer has to free memory reliably - which it doesn't. But if > it did, we could also break the limit temporarily and have the OOM > killer replenish the pool before that userspace app can continue. The > allocation wouldn't have to be short-lived, since memory is fungible. If we can guarantee the oom killer is started then we can allow temporal access to reserves which is already implemented even for memcg. The thing is we do not invoke the oom killer... -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-10 14:24 ` Michal Hocko @ 2017-10-12 19:03 ` Johannes Weiner 2017-10-12 23:57 ` Greg Thelen 2017-10-13 6:35 ` Michal Hocko 0 siblings, 2 replies; 52+ messages in thread From: Johannes Weiner @ 2017-10-12 19:03 UTC (permalink / raw) To: Michal Hocko Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Tue, Oct 10, 2017 at 04:24:34PM +0200, Michal Hocko wrote: > On Tue 10-10-17 10:17:33, Johannes Weiner wrote: > > On Tue, Oct 10, 2017 at 11:14:30AM +0200, Michal Hocko wrote: > > > On Mon 09-10-17 16:26:13, Johannes Weiner wrote: > > > > It's consistent in the sense that only page faults enable the memcg > > > > OOM killer. It's not the type of memory that decides, it's whether the > > > > allocation context has a channel to communicate an error to userspace. > > > > > > > > Whether userspace is able to handle -ENOMEM from syscalls was a voiced > > > > concern at the time this patch was merged, although there haven't been > > > > any reports so far, > > > > > > Well, I remember reports about MAP_POPULATE breaking or at least having > > > an unexpected behavior. > > > > Hm, that slipped past me. Did we do something about these? Or did they > > fix userspace? > > Well it was mostly LTP complaining. I have tried to fix that but Linus > was against so we just documented that this is possible and MAP_POPULATE > is not a guarantee. Okay, makes sense. I wouldn't really count that as a regression. > > > Well, we should be able to do that with the oom_reaper. At least for v2 > > > which doesn't have synchronous userspace oom killing. > > > > I don't see how the OOM reaper is a guarantee as long as we have this: > > > > if (!down_read_trylock(&mm->mmap_sem)) { > > ret = false; > > trace_skip_task_reaping(tsk->pid); > > goto unlock_oom; > > } > > And we will simply mark the victim MMF_OOM_SKIP and hide it from the oom > killer if we fail to get the mmap_sem after several attempts. This will > allow to find a new victim. So we shouldn't deadlock. It's less likely to deadlock, but not exactly deadlock-free. There might not BE any other mm's holding significant amounts of memory. > > What do you mean by 'v2'? > > cgroup v2 because the legacy memcg allowed sync wait for the oom killer > and that would be a bigger problem from a deep callchains for obevious > reasons. Actually, the async oom killing code isn't dependent on cgroup version. cgroup1 doesn't wait inside the charge context, either. > > > > > c) Overcharge kmem to oom memcg and queue an async memcg limit checker, > > > > > which will oom kill if needed. > > > > > > > > This makes the most sense to me. Architecturally, I imagine this would > > > > look like b), with an OOM handler at the point of return to userspace, > > > > except that we'd overcharge instead of retrying the syscall. > > > > > > I do not think we should break the hard limit semantic if possible. We > > > can currently allow that for allocations which are very short term (oom > > > victims) or too important to fail but allowing that for kmem charges in > > > general sounds like too easy to runaway. > > > > I'm not sure there is a convenient way out of this. > > > > If we want to respect the hard limit AND guarantee allocation success, > > the OOM killer has to free memory reliably - which it doesn't. But if > > it did, we could also break the limit temporarily and have the OOM > > killer replenish the pool before that userspace app can continue. The > > allocation wouldn't have to be short-lived, since memory is fungible. > > If we can guarantee the oom killer is started then we can allow temporal > access to reserves which is already implemented even for memcg. The > thing is we do not invoke the oom killer... You lost me here. Which reserves? All I'm saying is that, when the syscall-context fails to charge, we should do mem_cgroup_oom() to set up the async OOM killer, let the charge succeed over the hard limit - since the OOM killer will most likely get us back below the limit - then mem_cgroup_oom_synchronize() before the syscall returns to userspace. That would avoid returning -ENOMEM from syscalls without the risk of the hard limit deadlocking - at the risk of sometimes overrunning the hard limit, but that seems like the least problematic behavior out of the three. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-12 19:03 ` Johannes Weiner @ 2017-10-12 23:57 ` Greg Thelen 2017-10-13 6:51 ` Michal Hocko 2017-10-13 6:35 ` Michal Hocko 1 sibling, 1 reply; 52+ messages in thread From: Greg Thelen @ 2017-10-12 23:57 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko Cc: Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML Johannes Weiner <hannes@cmpxchg.org> wrote: > On Tue, Oct 10, 2017 at 04:24:34PM +0200, Michal Hocko wrote: >> On Tue 10-10-17 10:17:33, Johannes Weiner wrote: >> > On Tue, Oct 10, 2017 at 11:14:30AM +0200, Michal Hocko wrote: >> > > On Mon 09-10-17 16:26:13, Johannes Weiner wrote: >> > > > It's consistent in the sense that only page faults enable the memcg >> > > > OOM killer. It's not the type of memory that decides, it's whether the >> > > > allocation context has a channel to communicate an error to userspace. >> > > > >> > > > Whether userspace is able to handle -ENOMEM from syscalls was a voiced >> > > > concern at the time this patch was merged, although there haven't been >> > > > any reports so far, >> > > >> > > Well, I remember reports about MAP_POPULATE breaking or at least having >> > > an unexpected behavior. >> > >> > Hm, that slipped past me. Did we do something about these? Or did they >> > fix userspace? >> >> Well it was mostly LTP complaining. I have tried to fix that but Linus >> was against so we just documented that this is possible and MAP_POPULATE >> is not a guarantee. > > Okay, makes sense. I wouldn't really count that as a regression. > >> > > Well, we should be able to do that with the oom_reaper. At least for v2 >> > > which doesn't have synchronous userspace oom killing. >> > >> > I don't see how the OOM reaper is a guarantee as long as we have this: >> > >> > if (!down_read_trylock(&mm->mmap_sem)) { >> > ret = false; >> > trace_skip_task_reaping(tsk->pid); >> > goto unlock_oom; >> > } >> >> And we will simply mark the victim MMF_OOM_SKIP and hide it from the oom >> killer if we fail to get the mmap_sem after several attempts. This will >> allow to find a new victim. So we shouldn't deadlock. > > It's less likely to deadlock, but not exactly deadlock-free. There > might not BE any other mm's holding significant amounts of memory. > >> > What do you mean by 'v2'? >> >> cgroup v2 because the legacy memcg allowed sync wait for the oom killer >> and that would be a bigger problem from a deep callchains for obevious >> reasons. > > Actually, the async oom killing code isn't dependent on cgroup > version. cgroup1 doesn't wait inside the charge context, either. > >> > > > > c) Overcharge kmem to oom memcg and queue an async memcg limit checker, >> > > > > which will oom kill if needed. >> > > > >> > > > This makes the most sense to me. Architecturally, I imagine this would >> > > > look like b), with an OOM handler at the point of return to userspace, >> > > > except that we'd overcharge instead of retrying the syscall. >> > > >> > > I do not think we should break the hard limit semantic if possible. We >> > > can currently allow that for allocations which are very short term (oom >> > > victims) or too important to fail but allowing that for kmem charges in >> > > general sounds like too easy to runaway. >> > >> > I'm not sure there is a convenient way out of this. >> > >> > If we want to respect the hard limit AND guarantee allocation success, >> > the OOM killer has to free memory reliably - which it doesn't. But if >> > it did, we could also break the limit temporarily and have the OOM >> > killer replenish the pool before that userspace app can continue. The >> > allocation wouldn't have to be short-lived, since memory is fungible. >> >> If we can guarantee the oom killer is started then we can allow temporal >> access to reserves which is already implemented even for memcg. The >> thing is we do not invoke the oom killer... > > You lost me here. Which reserves? > > All I'm saying is that, when the syscall-context fails to charge, we > should do mem_cgroup_oom() to set up the async OOM killer, let the > charge succeed over the hard limit - since the OOM killer will most > likely get us back below the limit - then mem_cgroup_oom_synchronize() > before the syscall returns to userspace. > > That would avoid returning -ENOMEM from syscalls without the risk of > the hard limit deadlocking - at the risk of sometimes overrunning the > hard limit, but that seems like the least problematic behavior out of > the three. Overcharging kmem with deferred reconciliation sounds good to me. A few comments (not reasons to avoid this): 1) If a task is moved between memcg it seems possible to overcharge multiple oom memcg for different kmem/user allocations. mem_cgroup_oom_synchronize() would see at most one oom memcg in current->memcg_in_oom. Thus it'd only reconcile a single memcg. But that seems pretty rare and the next charge to any of the other memcg would reconcile them. 2) if a kernel thread charges kmem on behalf of a client mm then there is no good place to call mem_cgroup_oom_synchronize(), short of launching a work item in mem_cgroup_oom(). I don't we have anything like that yet. So nothing to worry about. 3) it's debatable if mem_cgroup_oom_synchronize() should first attempt reclaim before killing. But that's a whole 'nother thread. 4) overcharging with deferred reconciliation could also be used for user pages. But I haven't looked at the code long enough to know if this would be a net win. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-12 23:57 ` Greg Thelen @ 2017-10-13 6:51 ` Michal Hocko 0 siblings, 0 replies; 52+ messages in thread From: Michal Hocko @ 2017-10-13 6:51 UTC (permalink / raw) To: Greg Thelen Cc: Johannes Weiner, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Thu 12-10-17 16:57:22, Greg Thelen wrote: [...] > Overcharging kmem with deferred reconciliation sounds good to me. > > A few comments (not reasons to avoid this): > > 1) If a task is moved between memcg it seems possible to overcharge > multiple oom memcg for different kmem/user allocations. > mem_cgroup_oom_synchronize() would see at most one oom memcg in > current->memcg_in_oom. Thus it'd only reconcile a single memcg. But > that seems pretty rare and the next charge to any of the other memcg > would reconcile them. This is a general problem for the cgroup v2 memcg oom handling. > 2) if a kernel thread charges kmem on behalf of a client mm then there > is no good place to call mem_cgroup_oom_synchronize(), short of > launching a work item in mem_cgroup_oom(). I don't we have anything > like that yet. So nothing to worry about. If we do invoke the OOM killer from the charge path, it shouldn't be a problem. > 3) it's debatable if mem_cgroup_oom_synchronize() should first attempt > reclaim before killing. But that's a whole 'nother thread. Again, this shouldn't be an issue if we invoke the oom killer from the charge path. > 4) overcharging with deferred reconciliation could also be used for user > pages. But I haven't looked at the code long enough to know if this > would be a net win. It would solve g-u-p issues failing with ENOMEM unexpectedly just because of memcg charge failure. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-12 19:03 ` Johannes Weiner 2017-10-12 23:57 ` Greg Thelen @ 2017-10-13 6:35 ` Michal Hocko 2017-10-13 7:00 ` Michal Hocko 2017-10-24 15:45 ` Johannes Weiner 1 sibling, 2 replies; 52+ messages in thread From: Michal Hocko @ 2017-10-13 6:35 UTC (permalink / raw) To: Johannes Weiner Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Thu 12-10-17 15:03:12, Johannes Weiner wrote: > On Tue, Oct 10, 2017 at 04:24:34PM +0200, Michal Hocko wrote: [...] > > And we will simply mark the victim MMF_OOM_SKIP and hide it from the oom > > killer if we fail to get the mmap_sem after several attempts. This will > > allow to find a new victim. So we shouldn't deadlock. > > It's less likely to deadlock, but not exactly deadlock-free. There > might not BE any other mm's holding significant amounts of memory. true, try_charge would have to return with failure when out_of_memory returns with false of course. > > > What do you mean by 'v2'? > > > > cgroup v2 because the legacy memcg allowed sync wait for the oom killer > > and that would be a bigger problem from a deep callchains for obevious > > reasons. > > Actually, the async oom killing code isn't dependent on cgroup > version. cgroup1 doesn't wait inside the charge context, either. Sorry, I was just not clear. What I meant to say, would couldn't make v1 wait inside the try_charge path because async oom killing wouldn't help for the oom disabled case (aka user space oom handling). > > > > > > c) Overcharge kmem to oom memcg and queue an async memcg limit checker, > > > > > > which will oom kill if needed. > > > > > > > > > > This makes the most sense to me. Architecturally, I imagine this would > > > > > look like b), with an OOM handler at the point of return to userspace, > > > > > except that we'd overcharge instead of retrying the syscall. > > > > > > > > I do not think we should break the hard limit semantic if possible. We > > > > can currently allow that for allocations which are very short term (oom > > > > victims) or too important to fail but allowing that for kmem charges in > > > > general sounds like too easy to runaway. > > > > > > I'm not sure there is a convenient way out of this. > > > > > > If we want to respect the hard limit AND guarantee allocation success, > > > the OOM killer has to free memory reliably - which it doesn't. But if > > > it did, we could also break the limit temporarily and have the OOM > > > killer replenish the pool before that userspace app can continue. The > > > allocation wouldn't have to be short-lived, since memory is fungible. > > > > If we can guarantee the oom killer is started then we can allow temporal > > access to reserves which is already implemented even for memcg. The > > thing is we do not invoke the oom killer... > > You lost me here. Which reserves? > > All I'm saying is that, when the syscall-context fails to charge, we > should do mem_cgroup_oom() to set up the async OOM killer, let the > charge succeed over the hard limit - since the OOM killer will most > likely get us back below the limit - then mem_cgroup_oom_synchronize() > before the syscall returns to userspace. OK, then we are on the same page now. Your initial wording didn't mention async OOM killer. This makes more sense. Although I would argue that we can retry the charge as long as out_of_memory finds a victim. This would return ENOMEM to the pathological cases where no victims could be found. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-13 6:35 ` Michal Hocko @ 2017-10-13 7:00 ` Michal Hocko 2017-10-13 15:24 ` Michal Hocko 2017-10-24 15:45 ` Johannes Weiner 1 sibling, 1 reply; 52+ messages in thread From: Michal Hocko @ 2017-10-13 7:00 UTC (permalink / raw) To: Johannes Weiner Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML Just to be explicit what I've had in mind. This hasn't been even compile tested but it should provide at least an idea where I am trying to go.. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d5f3a62887cf..91fa05372114 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1528,26 +1528,36 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) { - if (!current->memcg_may_oom) - return; /* * We are in the middle of the charge context here, so we * don't want to block when potentially sitting on a callstack * that holds all kinds of filesystem and mm locks. * - * Also, the caller may handle a failed allocation gracefully - * (like optional page cache readahead) and so an OOM killer - * invocation might not even be necessary. + * cgroup v1 allowes sync users space handling so we cannot afford + * to get stuck here for that configuration. That's why we don't do + * anything here except remember the OOM context and then deal with + * it at the end of the page fault when the stack is unwound, the + * locks are released, and when we know whether the fault was overall + * successful. * - * That's why we don't do anything here except remember the - * OOM context and then deal with it at the end of the page - * fault when the stack is unwound, the locks are released, - * and when we know whether the fault was overall successful. + * On the other hand, in-kernel OOM killer allows for an async victim + * memory reclaim (oom_reaper) and that means that we are not solely + * relying on the oom victim to make a forward progress so we can stay + * in the the try_charge context and keep retrying as long as there + * are oom victims to select. */ - css_get(&memcg->css); - current->memcg_in_oom = memcg; - current->memcg_oom_gfp_mask = mask; - current->memcg_oom_order = order; + if (memcg->oom_kill_disable) { + if (!current->memcg_may_oom) + return false; + css_get(&memcg->css); + current->memcg_in_oom = memcg; + current->memcg_oom_gfp_mask = mask; + current->memcg_oom_order = order; + + return false; + } + + return mem_cgroup_out_of_memory(memcg, mask, order); } /** @@ -2007,8 +2017,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, mem_cgroup_event(mem_over_limit, MEMCG_OOM); - mem_cgroup_oom(mem_over_limit, gfp_mask, - get_order(nr_pages * PAGE_SIZE)); + if (mem_cgroup_oom(mem_over_limit, gfp_mask, + get_order(nr_pages * PAGE_SIZE))) { + nr_retries = MEM_CGROUP_RECLAIM_RETRIES; + goto retry; + } nomem: if (!(gfp_mask & __GFP_NOFAIL)) return -ENOMEM; -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-13 7:00 ` Michal Hocko @ 2017-10-13 15:24 ` Michal Hocko 2017-10-24 12:18 ` Michal Hocko 2017-10-24 16:06 ` Johannes Weiner 0 siblings, 2 replies; 52+ messages in thread From: Michal Hocko @ 2017-10-13 15:24 UTC (permalink / raw) To: Johannes Weiner Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML Well, it actually occured to me that this would trigger the global oom killer in case no memcg specific victim can be found which is definitely not something we would like to do. This should work better. I am not sure we can trigger this corner case but we should cover it and it actually doesn't make the code much worse. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d5f3a62887cf..7b370f070b82 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1528,26 +1528,40 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) { - if (!current->memcg_may_oom) - return; /* * We are in the middle of the charge context here, so we * don't want to block when potentially sitting on a callstack * that holds all kinds of filesystem and mm locks. * - * Also, the caller may handle a failed allocation gracefully - * (like optional page cache readahead) and so an OOM killer - * invocation might not even be necessary. + * cgroup v1 allowes sync users space handling so we cannot afford + * to get stuck here for that configuration. That's why we don't do + * anything here except remember the OOM context and then deal with + * it at the end of the page fault when the stack is unwound, the + * locks are released, and when we know whether the fault was overall + * successful. + * + * On the other hand, in-kernel OOM killer allows for an async victim + * memory reclaim (oom_reaper) and that means that we are not solely + * relying on the oom victim to make a forward progress so we can stay + * in the the try_charge context and keep retrying as long as there + * are oom victims to select. * - * That's why we don't do anything here except remember the - * OOM context and then deal with it at the end of the page - * fault when the stack is unwound, the locks are released, - * and when we know whether the fault was overall successful. + * Please note that mem_cgroup_oom_synchronize might fail to find a + * victim and then we have rely on mem_cgroup_oom_synchronize otherwise + * we would fall back to the global oom killer in pagefault_out_of_memory */ + if (!memcg->oom_kill_disable && + mem_cgroup_out_of_memory(memcg, mask, order)) + return true; + + if (!current->memcg_may_oom) + return false; css_get(&memcg->css); current->memcg_in_oom = memcg; current->memcg_oom_gfp_mask = mask; current->memcg_oom_order = order; + + return false; } /** @@ -2007,8 +2021,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, mem_cgroup_event(mem_over_limit, MEMCG_OOM); - mem_cgroup_oom(mem_over_limit, gfp_mask, - get_order(nr_pages * PAGE_SIZE)); + if (mem_cgroup_oom(mem_over_limit, gfp_mask, + get_order(nr_pages * PAGE_SIZE))) { + nr_retries = MEM_CGROUP_RECLAIM_RETRIES; + goto retry; + } nomem: if (!(gfp_mask & __GFP_NOFAIL)) return -ENOMEM; -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-13 15:24 ` Michal Hocko @ 2017-10-24 12:18 ` Michal Hocko 2017-10-24 17:54 ` Johannes Weiner 2017-10-24 16:06 ` Johannes Weiner 1 sibling, 1 reply; 52+ messages in thread From: Michal Hocko @ 2017-10-24 12:18 UTC (permalink / raw) To: Greg Thelen, Johannes Weiner Cc: Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML Does this sound something that you would be interested in? I can spend som more time on it if it is worthwhile. On Fri 13-10-17 17:24:21, Michal Hocko wrote: > Well, it actually occured to me that this would trigger the global oom > killer in case no memcg specific victim can be found which is definitely > not something we would like to do. This should work better. I am not > sure we can trigger this corner case but we should cover it and it > actually doesn't make the code much worse. > --- > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index d5f3a62887cf..7b370f070b82 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1528,26 +1528,40 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) > > static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > { > - if (!current->memcg_may_oom) > - return; > /* > * We are in the middle of the charge context here, so we > * don't want to block when potentially sitting on a callstack > * that holds all kinds of filesystem and mm locks. > * > - * Also, the caller may handle a failed allocation gracefully > - * (like optional page cache readahead) and so an OOM killer > - * invocation might not even be necessary. > + * cgroup v1 allowes sync users space handling so we cannot afford > + * to get stuck here for that configuration. That's why we don't do > + * anything here except remember the OOM context and then deal with > + * it at the end of the page fault when the stack is unwound, the > + * locks are released, and when we know whether the fault was overall > + * successful. > + * > + * On the other hand, in-kernel OOM killer allows for an async victim > + * memory reclaim (oom_reaper) and that means that we are not solely > + * relying on the oom victim to make a forward progress so we can stay > + * in the the try_charge context and keep retrying as long as there > + * are oom victims to select. > * > - * That's why we don't do anything here except remember the > - * OOM context and then deal with it at the end of the page > - * fault when the stack is unwound, the locks are released, > - * and when we know whether the fault was overall successful. > + * Please note that mem_cgroup_oom_synchronize might fail to find a > + * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > + * we would fall back to the global oom killer in pagefault_out_of_memory > */ > + if (!memcg->oom_kill_disable && > + mem_cgroup_out_of_memory(memcg, mask, order)) > + return true; > + > + if (!current->memcg_may_oom) > + return false; > css_get(&memcg->css); > current->memcg_in_oom = memcg; > current->memcg_oom_gfp_mask = mask; > current->memcg_oom_order = order; > + > + return false; > } > > /** > @@ -2007,8 +2021,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > > mem_cgroup_event(mem_over_limit, MEMCG_OOM); > > - mem_cgroup_oom(mem_over_limit, gfp_mask, > - get_order(nr_pages * PAGE_SIZE)); > + if (mem_cgroup_oom(mem_over_limit, gfp_mask, > + get_order(nr_pages * PAGE_SIZE))) { > + nr_retries = MEM_CGROUP_RECLAIM_RETRIES; > + goto retry; > + } > nomem: > if (!(gfp_mask & __GFP_NOFAIL)) > return -ENOMEM; > -- > Michal Hocko > SUSE Labs -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-24 12:18 ` Michal Hocko @ 2017-10-24 17:54 ` Johannes Weiner 0 siblings, 0 replies; 52+ messages in thread From: Johannes Weiner @ 2017-10-24 17:54 UTC (permalink / raw) To: Michal Hocko Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Tue, Oct 24, 2017 at 02:18:59PM +0200, Michal Hocko wrote: > Does this sound something that you would be interested in? I can spend > som more time on it if it is worthwhile. Before you invest too much time in this, I think the rationale for changing the current behavior so far is very weak. The ideas that have been floated around in this thread barely cross into nice-to-have territory, and as a result the acceptable additional complexity to implement them is very low as well. Making the OOM behavior less consistent, or introducing very rare problem behavior (e.g. merely reducing the probability of syscalls returning -ENOMEM instead of fully eliminating it, re-adding avenues for deadlocks, no matter how rare, etc.) is a non-starter. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-13 15:24 ` Michal Hocko 2017-10-24 12:18 ` Michal Hocko @ 2017-10-24 16:06 ` Johannes Weiner 2017-10-24 16:22 ` Michal Hocko 1 sibling, 1 reply; 52+ messages in thread From: Johannes Weiner @ 2017-10-24 16:06 UTC (permalink / raw) To: Michal Hocko Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Fri, Oct 13, 2017 at 05:24:21PM +0200, Michal Hocko wrote: > Well, it actually occured to me that this would trigger the global oom > killer in case no memcg specific victim can be found which is definitely > not something we would like to do. This should work better. I am not > sure we can trigger this corner case but we should cover it and it > actually doesn't make the code much worse. > --- > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index d5f3a62887cf..7b370f070b82 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1528,26 +1528,40 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) > > static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > { > - if (!current->memcg_may_oom) > - return; > /* > * We are in the middle of the charge context here, so we > * don't want to block when potentially sitting on a callstack > * that holds all kinds of filesystem and mm locks. > * > - * Also, the caller may handle a failed allocation gracefully > - * (like optional page cache readahead) and so an OOM killer > - * invocation might not even be necessary. > + * cgroup v1 allowes sync users space handling so we cannot afford > + * to get stuck here for that configuration. That's why we don't do > + * anything here except remember the OOM context and then deal with > + * it at the end of the page fault when the stack is unwound, the > + * locks are released, and when we know whether the fault was overall > + * successful. How about "cgroup1 allows disabling the OOM killer and waiting for outside handling until the charge can succeed; remember the context and put the task to sleep at the end of the page fault when all locks are released." and then follow it directly with the branch that handles this: if (memcg->oom_kill_disable) { css_get(&memcg->css); current->memcg_in_oom = memcg; ... return false; } return mem_cgroup_out_of_memory(memcg, mask, order); > + * On the other hand, in-kernel OOM killer allows for an async victim > + * memory reclaim (oom_reaper) and that means that we are not solely > + * relying on the oom victim to make a forward progress so we can stay > + * in the the try_charge context and keep retrying as long as there > + * are oom victims to select. I would put that part into try_charge, where that decision is made. > * > - * That's why we don't do anything here except remember the > - * OOM context and then deal with it at the end of the page > - * fault when the stack is unwound, the locks are released, > - * and when we know whether the fault was overall successful. > + * Please note that mem_cgroup_oom_synchronize might fail to find a > + * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > + * we would fall back to the global oom killer in pagefault_out_of_memory Ah, that's why... Ugh, that's really duct-tapey. > */ > + if (!memcg->oom_kill_disable && > + mem_cgroup_out_of_memory(memcg, mask, order)) > + return true; > + > + if (!current->memcg_may_oom) > + return false; > css_get(&memcg->css); > current->memcg_in_oom = memcg; > current->memcg_oom_gfp_mask = mask; > current->memcg_oom_order = order; > + > + return false; > } > > /** > @@ -2007,8 +2021,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > > mem_cgroup_event(mem_over_limit, MEMCG_OOM); > > - mem_cgroup_oom(mem_over_limit, gfp_mask, > - get_order(nr_pages * PAGE_SIZE)); > + if (mem_cgroup_oom(mem_over_limit, gfp_mask, > + get_order(nr_pages * PAGE_SIZE))) { > + nr_retries = MEM_CGROUP_RECLAIM_RETRIES; > + goto retry; > + } As per the previous email, this has to goto force, otherwise we return -ENOMEM from syscalls once in a blue moon, which makes verification an absolute nightmare. The behavior should be reliable, without weird p99 corner cases. I think what we should be doing here is: if a charge fails, set up an oom context and force the charge; add mem_cgroup_oom_synchronize() to the end of syscalls and kernel-context faults. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-24 16:06 ` Johannes Weiner @ 2017-10-24 16:22 ` Michal Hocko 2017-10-24 17:23 ` Johannes Weiner 0 siblings, 1 reply; 52+ messages in thread From: Michal Hocko @ 2017-10-24 16:22 UTC (permalink / raw) To: Johannes Weiner Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Tue 24-10-17 12:06:37, Johannes Weiner wrote: > On Fri, Oct 13, 2017 at 05:24:21PM +0200, Michal Hocko wrote: > > Well, it actually occured to me that this would trigger the global oom > > killer in case no memcg specific victim can be found which is definitely > > not something we would like to do. This should work better. I am not > > sure we can trigger this corner case but we should cover it and it > > actually doesn't make the code much worse. > > --- > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index d5f3a62887cf..7b370f070b82 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -1528,26 +1528,40 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) > > > > static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > > { > > - if (!current->memcg_may_oom) > > - return; > > /* > > * We are in the middle of the charge context here, so we > > * don't want to block when potentially sitting on a callstack > > * that holds all kinds of filesystem and mm locks. > > * > > - * Also, the caller may handle a failed allocation gracefully > > - * (like optional page cache readahead) and so an OOM killer > > - * invocation might not even be necessary. > > + * cgroup v1 allowes sync users space handling so we cannot afford > > + * to get stuck here for that configuration. That's why we don't do > > + * anything here except remember the OOM context and then deal with > > + * it at the end of the page fault when the stack is unwound, the > > + * locks are released, and when we know whether the fault was overall > > + * successful. > > How about > > "cgroup1 allows disabling the OOM killer and waiting for outside > handling until the charge can succeed; remember the context and put > the task to sleep at the end of the page fault when all locks are > released." OK > and then follow it directly with the branch that handles this: > > if (memcg->oom_kill_disable) { > css_get(&memcg->css); > current->memcg_in_oom = memcg; > ... > return false; > } > > return mem_cgroup_out_of_memory(memcg, mask, order); > > > + * On the other hand, in-kernel OOM killer allows for an async victim > > + * memory reclaim (oom_reaper) and that means that we are not solely > > + * relying on the oom victim to make a forward progress so we can stay > > + * in the the try_charge context and keep retrying as long as there > > + * are oom victims to select. > > I would put that part into try_charge, where that decision is made. OK > > * > > - * That's why we don't do anything here except remember the > > - * OOM context and then deal with it at the end of the page > > - * fault when the stack is unwound, the locks are released, > > - * and when we know whether the fault was overall successful. > > + * Please note that mem_cgroup_oom_synchronize might fail to find a > > + * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > > + * we would fall back to the global oom killer in pagefault_out_of_memory > > Ah, that's why... Ugh, that's really duct-tapey. As you know, I really hate the #PF OOM path. We should get rid of it. > > */ > > + if (!memcg->oom_kill_disable && > > + mem_cgroup_out_of_memory(memcg, mask, order)) > > + return true; > > + > > + if (!current->memcg_may_oom) > > + return false; > > css_get(&memcg->css); > > current->memcg_in_oom = memcg; > > current->memcg_oom_gfp_mask = mask; > > current->memcg_oom_order = order; > > + > > + return false; > > } > > > > /** > > @@ -2007,8 +2021,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > > > > mem_cgroup_event(mem_over_limit, MEMCG_OOM); > > > > - mem_cgroup_oom(mem_over_limit, gfp_mask, > > - get_order(nr_pages * PAGE_SIZE)); > > + if (mem_cgroup_oom(mem_over_limit, gfp_mask, > > + get_order(nr_pages * PAGE_SIZE))) { > > + nr_retries = MEM_CGROUP_RECLAIM_RETRIES; > > + goto retry; > > + } > > As per the previous email, this has to goto force, otherwise we return > -ENOMEM from syscalls once in a blue moon, which makes verification an > absolute nightmare. The behavior should be reliable, without weird p99 > corner cases. > > I think what we should be doing here is: if a charge fails, set up an > oom context and force the charge; add mem_cgroup_oom_synchronize() to > the end of syscalls and kernel-context faults. What would prevent a runaway in case the only process in the memcg is oom unkillable then? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-24 16:22 ` Michal Hocko @ 2017-10-24 17:23 ` Johannes Weiner 2017-10-24 17:55 ` Michal Hocko 0 siblings, 1 reply; 52+ messages in thread From: Johannes Weiner @ 2017-10-24 17:23 UTC (permalink / raw) To: Michal Hocko Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Tue, Oct 24, 2017 at 06:22:13PM +0200, Michal Hocko wrote: > On Tue 24-10-17 12:06:37, Johannes Weiner wrote: > > > * > > > - * That's why we don't do anything here except remember the > > > - * OOM context and then deal with it at the end of the page > > > - * fault when the stack is unwound, the locks are released, > > > - * and when we know whether the fault was overall successful. > > > + * Please note that mem_cgroup_oom_synchronize might fail to find a > > > + * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > > > + * we would fall back to the global oom killer in pagefault_out_of_memory > > > > Ah, that's why... Ugh, that's really duct-tapey. > > As you know, I really hate the #PF OOM path. We should get rid of it. I agree, but this isn't getting rid of it, it just adds more layers. > > > @@ -2007,8 +2021,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > > > > > > mem_cgroup_event(mem_over_limit, MEMCG_OOM); > > > > > > - mem_cgroup_oom(mem_over_limit, gfp_mask, > > > - get_order(nr_pages * PAGE_SIZE)); > > > + if (mem_cgroup_oom(mem_over_limit, gfp_mask, > > > + get_order(nr_pages * PAGE_SIZE))) { > > > + nr_retries = MEM_CGROUP_RECLAIM_RETRIES; > > > + goto retry; > > > + } > > > > As per the previous email, this has to goto force, otherwise we return > > -ENOMEM from syscalls once in a blue moon, which makes verification an > > absolute nightmare. The behavior should be reliable, without weird p99 > > corner cases. > > > > I think what we should be doing here is: if a charge fails, set up an > > oom context and force the charge; add mem_cgroup_oom_synchronize() to > > the end of syscalls and kernel-context faults. > > What would prevent a runaway in case the only process in the memcg is > oom unkillable then? In such a scenario, the page fault handler would busy-loop right now. Disabling oom kills is a privileged operation with dire consequences if used incorrectly. You can panic the kernel with it. Why should the cgroup OOM killer implement protective semantics around this setting? Breaching the limit in such a setup is entirely acceptable. Really, I think it's an enormous mistake to start modeling semantics based on the most contrived and non-sensical edge case configurations. Start the discussion with what is sane and what most users should optimally experience, and keep the cornercases simple. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-24 17:23 ` Johannes Weiner @ 2017-10-24 17:55 ` Michal Hocko 2017-10-24 18:58 ` Johannes Weiner 0 siblings, 1 reply; 52+ messages in thread From: Michal Hocko @ 2017-10-24 17:55 UTC (permalink / raw) To: Johannes Weiner Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Tue 24-10-17 13:23:30, Johannes Weiner wrote: > On Tue, Oct 24, 2017 at 06:22:13PM +0200, Michal Hocko wrote: [...] > > What would prevent a runaway in case the only process in the memcg is > > oom unkillable then? > > In such a scenario, the page fault handler would busy-loop right now. > > Disabling oom kills is a privileged operation with dire consequences > if used incorrectly. You can panic the kernel with it. Why should the > cgroup OOM killer implement protective semantics around this setting? > Breaching the limit in such a setup is entirely acceptable. > > Really, I think it's an enormous mistake to start modeling semantics > based on the most contrived and non-sensical edge case configurations. > Start the discussion with what is sane and what most users should > optimally experience, and keep the cornercases simple. I am not really seeing your concern about the semantic. The most important property of the hard limit is to protect from runaways and stop them if they happen. Users can use the softer variant (high limit) if they are not afraid of those scenarios. It is not so insane to imagine that a master task (which I can easily imagine would be oom disabled) has a leak and runaway as a result. We are not talking only about the page fault path. There are other allocation paths to consume a lot of memory and spill over and break the isolation restriction. So it makes much more sense to me to fail the allocation in such a situation rather than allow the runaway to continue. Just consider that such a situation shouldn't happen in the first place because there should always be an eligible task to kill - who would own all the memory otherwise? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-24 17:55 ` Michal Hocko @ 2017-10-24 18:58 ` Johannes Weiner 2017-10-24 20:15 ` Michal Hocko 0 siblings, 1 reply; 52+ messages in thread From: Johannes Weiner @ 2017-10-24 18:58 UTC (permalink / raw) To: Michal Hocko Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Tue, Oct 24, 2017 at 07:55:58PM +0200, Michal Hocko wrote: > On Tue 24-10-17 13:23:30, Johannes Weiner wrote: > > On Tue, Oct 24, 2017 at 06:22:13PM +0200, Michal Hocko wrote: > [...] > > > What would prevent a runaway in case the only process in the memcg is > > > oom unkillable then? > > > > In such a scenario, the page fault handler would busy-loop right now. > > > > Disabling oom kills is a privileged operation with dire consequences > > if used incorrectly. You can panic the kernel with it. Why should the > > cgroup OOM killer implement protective semantics around this setting? > > Breaching the limit in such a setup is entirely acceptable. > > > > Really, I think it's an enormous mistake to start modeling semantics > > based on the most contrived and non-sensical edge case configurations. > > Start the discussion with what is sane and what most users should > > optimally experience, and keep the cornercases simple. > > I am not really seeing your concern about the semantic. The most > important property of the hard limit is to protect from runaways and > stop them if they happen. Users can use the softer variant (high limit) > if they are not afraid of those scenarios. It is not so insane to > imagine that a master task (which I can easily imagine would be oom > disabled) has a leak and runaway as a result. Then you're screwed either way. Where do you return -ENOMEM in a page fault path that cannot OOM kill anything? Your choice is between maintaining the hard limit semantics or going into an infinite loop. I fail to see how this setup has any impact on the semantics we pick here. And even if it were real, it's really not what most users do. > We are not talking only about the page fault path. There are other > allocation paths to consume a lot of memory and spill over and break > the isolation restriction. So it makes much more sense to me to fail > the allocation in such a situation rather than allow the runaway to > continue. Just consider that such a situation shouldn't happen in > the first place because there should always be an eligible task to > kill - who would own all the memory otherwise? Okay, then let's just stick to the current behavior. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-24 18:58 ` Johannes Weiner @ 2017-10-24 20:15 ` Michal Hocko 2017-10-25 6:51 ` Greg Thelen 0 siblings, 1 reply; 52+ messages in thread From: Michal Hocko @ 2017-10-24 20:15 UTC (permalink / raw) To: Johannes Weiner Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Tue 24-10-17 14:58:54, Johannes Weiner wrote: > On Tue, Oct 24, 2017 at 07:55:58PM +0200, Michal Hocko wrote: > > On Tue 24-10-17 13:23:30, Johannes Weiner wrote: > > > On Tue, Oct 24, 2017 at 06:22:13PM +0200, Michal Hocko wrote: > > [...] > > > > What would prevent a runaway in case the only process in the memcg is > > > > oom unkillable then? > > > > > > In such a scenario, the page fault handler would busy-loop right now. > > > > > > Disabling oom kills is a privileged operation with dire consequences > > > if used incorrectly. You can panic the kernel with it. Why should the > > > cgroup OOM killer implement protective semantics around this setting? > > > Breaching the limit in such a setup is entirely acceptable. > > > > > > Really, I think it's an enormous mistake to start modeling semantics > > > based on the most contrived and non-sensical edge case configurations. > > > Start the discussion with what is sane and what most users should > > > optimally experience, and keep the cornercases simple. > > > > I am not really seeing your concern about the semantic. The most > > important property of the hard limit is to protect from runaways and > > stop them if they happen. Users can use the softer variant (high limit) > > if they are not afraid of those scenarios. It is not so insane to > > imagine that a master task (which I can easily imagine would be oom > > disabled) has a leak and runaway as a result. > > Then you're screwed either way. Where do you return -ENOMEM in a page > fault path that cannot OOM kill anything? Your choice is between > maintaining the hard limit semantics or going into an infinite loop. in the PF path yes. And I would argue that this is a reasonable compromise to provide the gurantee the hard limit is giving us (and the resulting isolation which is the whole point). Btw. we are already having that behavior. All we are talking about is the non-PF path which ENOMEMs right now and the meta-patch tried to handle it more gracefully and only ENOMEM when there is no other option. > I fail to see how this setup has any impact on the semantics we pick > here. And even if it were real, it's really not what most users do. sure, such a scenario is really on the edge but my main point was that the hard limit is an enforcement of an isolation guarantee (as much as possible of course). > > We are not talking only about the page fault path. There are other > > allocation paths to consume a lot of memory and spill over and break > > the isolation restriction. So it makes much more sense to me to fail > > the allocation in such a situation rather than allow the runaway to > > continue. Just consider that such a situation shouldn't happen in > > the first place because there should always be an eligible task to > > kill - who would own all the memory otherwise? > > Okay, then let's just stick to the current behavior. I am definitely not pushing that thing right now. It is good to discuss it, though. The more kernel allocations we will track the more careful we will have to be. So maybe we will have to reconsider the current approach. I am not sure we need it _right now_ but I feel we will eventually have to reconsider it. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-24 20:15 ` Michal Hocko @ 2017-10-25 6:51 ` Greg Thelen 2017-10-25 7:15 ` Michal Hocko 0 siblings, 1 reply; 52+ messages in thread From: Greg Thelen @ 2017-10-25 6:51 UTC (permalink / raw) To: Michal Hocko, Johannes Weiner Cc: Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML Michal Hocko <mhocko@kernel.org> wrote: > On Tue 24-10-17 14:58:54, Johannes Weiner wrote: >> On Tue, Oct 24, 2017 at 07:55:58PM +0200, Michal Hocko wrote: >> > On Tue 24-10-17 13:23:30, Johannes Weiner wrote: >> > > On Tue, Oct 24, 2017 at 06:22:13PM +0200, Michal Hocko wrote: >> > [...] >> > > > What would prevent a runaway in case the only process in the memcg is >> > > > oom unkillable then? >> > > >> > > In such a scenario, the page fault handler would busy-loop right now. >> > > >> > > Disabling oom kills is a privileged operation with dire consequences >> > > if used incorrectly. You can panic the kernel with it. Why should the >> > > cgroup OOM killer implement protective semantics around this setting? >> > > Breaching the limit in such a setup is entirely acceptable. >> > > >> > > Really, I think it's an enormous mistake to start modeling semantics >> > > based on the most contrived and non-sensical edge case configurations. >> > > Start the discussion with what is sane and what most users should >> > > optimally experience, and keep the cornercases simple. >> > >> > I am not really seeing your concern about the semantic. The most >> > important property of the hard limit is to protect from runaways and >> > stop them if they happen. Users can use the softer variant (high limit) >> > if they are not afraid of those scenarios. It is not so insane to >> > imagine that a master task (which I can easily imagine would be oom >> > disabled) has a leak and runaway as a result. >> >> Then you're screwed either way. Where do you return -ENOMEM in a page >> fault path that cannot OOM kill anything? Your choice is between >> maintaining the hard limit semantics or going into an infinite loop. > > in the PF path yes. And I would argue that this is a reasonable > compromise to provide the gurantee the hard limit is giving us (and > the resulting isolation which is the whole point). Btw. we are already > having that behavior. All we are talking about is the non-PF path which > ENOMEMs right now and the meta-patch tried to handle it more gracefully > and only ENOMEM when there is no other option. > >> I fail to see how this setup has any impact on the semantics we pick >> here. And even if it were real, it's really not what most users do. > > sure, such a scenario is really on the edge but my main point was that > the hard limit is an enforcement of an isolation guarantee (as much as > possible of course). > >> > We are not talking only about the page fault path. There are other >> > allocation paths to consume a lot of memory and spill over and break >> > the isolation restriction. So it makes much more sense to me to fail >> > the allocation in such a situation rather than allow the runaway to >> > continue. Just consider that such a situation shouldn't happen in >> > the first place because there should always be an eligible task to >> > kill - who would own all the memory otherwise? >> >> Okay, then let's just stick to the current behavior. > > I am definitely not pushing that thing right now. It is good to discuss > it, though. The more kernel allocations we will track the more careful we > will have to be. So maybe we will have to reconsider the current > approach. I am not sure we need it _right now_ but I feel we will > eventually have to reconsider it. The kernel already attempts to charge radix_tree_nodes. If they fail then we fallback to unaccounted memory. So the memcg limit already isn't an air tight constraint. I agree that unchecked overcharging could be bad, but wonder if we could overcharge kmem so long as there is a pending oom kill victim. If current is the victim or no victim, then fail allocations (as is currently done). The current thread can loop in syscall exit until usage is reconciled (either via reclaim or kill). This seems consistent with pagefault oom handling and compatible with overcommit use case. Here's an example of an overcommit case we've found quite useful. Memcg A has memory which is shared between children B and C. B is more important the C. B and C are unprivileged, neither has the authority to kill the other. /A(limit=100MB) - B(limit=80MB,prio=high) \ C(limit=80MB,prio=low) If memcg charge drives B.usage+C.usage>=A.limit, then C should be killed due to its low priority. B pagefault can kill, but if a syscall returns ENOMEM then B can't do anything useful with it. I know there are related oom killer victim selections discussions afoot. Even with classic oom_score_adj killing it's possible to heavily bias oom killer to select C over B. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-25 6:51 ` Greg Thelen @ 2017-10-25 7:15 ` Michal Hocko 2017-10-25 13:11 ` Johannes Weiner 0 siblings, 1 reply; 52+ messages in thread From: Michal Hocko @ 2017-10-25 7:15 UTC (permalink / raw) To: Greg Thelen Cc: Johannes Weiner, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Tue 24-10-17 23:51:30, Greg Thelen wrote: > Michal Hocko <mhocko@kernel.org> wrote: [...] > > I am definitely not pushing that thing right now. It is good to discuss > > it, though. The more kernel allocations we will track the more careful we > > will have to be. So maybe we will have to reconsider the current > > approach. I am not sure we need it _right now_ but I feel we will > > eventually have to reconsider it. > > The kernel already attempts to charge radix_tree_nodes. If they fail > then we fallback to unaccounted memory. I am not sure which code path you have in mind. All I can see is that we drop __GFP_ACCOUNT when preloading radix tree nodes. Anyway... > So the memcg limit already > isn't an air tight constraint. ... we shouldn't make it more loose though. > I agree that unchecked overcharging could be bad, but wonder if we could > overcharge kmem so long as there is a pending oom kill victim. Why is this any better than simply trying to charge as long as the oom killer makes progress? > If > current is the victim or no victim, then fail allocations (as is > currently done). we actually force the charge in that case so we will proceed. > The current thread can loop in syscall exit until > usage is reconciled (either via reclaim or kill). This seems consistent > with pagefault oom handling and compatible with overcommit use case. But we do not really want to make the syscall exit path any more complex or more expensive than it is. The point is that we shouldn't be afraid about triggering the oom killer from the charge patch because we do have async OOM killer. This is very same with the standard allocator path. So why should be memcg any different? > Here's an example of an overcommit case we've found quite useful. Memcg A has > memory which is shared between children B and C. B is more important the C. > B and C are unprivileged, neither has the authority to kill the other. > > /A(limit=100MB) - B(limit=80MB,prio=high) > \ C(limit=80MB,prio=low) > > If memcg charge drives B.usage+C.usage>=A.limit, then C should be killed due to > its low priority. B pagefault can kill, but if a syscall returns ENOMEM then B > can't do anything useful with it. well, my proposal was to not return ENOMEM and rather loop in the charge path and wait for the oom killer to free up some charges. Who gets killed is really out of scope of this discussion. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-25 7:15 ` Michal Hocko @ 2017-10-25 13:11 ` Johannes Weiner 2017-10-25 14:12 ` Michal Hocko 0 siblings, 1 reply; 52+ messages in thread From: Johannes Weiner @ 2017-10-25 13:11 UTC (permalink / raw) To: Michal Hocko Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Wed, Oct 25, 2017 at 09:15:22AM +0200, Michal Hocko wrote: > On Tue 24-10-17 23:51:30, Greg Thelen wrote: > > Michal Hocko <mhocko@kernel.org> wrote: > [...] > > > I am definitely not pushing that thing right now. It is good to discuss > > > it, though. The more kernel allocations we will track the more careful we > > > will have to be. So maybe we will have to reconsider the current > > > approach. I am not sure we need it _right now_ but I feel we will > > > eventually have to reconsider it. > > > > The kernel already attempts to charge radix_tree_nodes. If they fail > > then we fallback to unaccounted memory. > > I am not sure which code path you have in mind. All I can see is that we > drop __GFP_ACCOUNT when preloading radix tree nodes. Anyway... > > > So the memcg limit already > > isn't an air tight constraint. I fully agree with this. Socket buffers overcharge too. There are plenty of memory allocations that aren't even tracked. The point is, it's a hard limit in the sense that breaching it will trigger the OOM killer. It's not a hard limit in the sense that the kernel will deadlock to avoid crossing it. > ... we shouldn't make it more loose though. Then we can end this discussion right now. I pointed out right from the start that the only way to replace -ENOMEM with OOM killing in the syscall is to force charges. If we don't, we either deadlock or still return -ENOMEM occasionally. Nobody has refuted that this is the case. > > The current thread can loop in syscall exit until > > usage is reconciled (either via reclaim or kill). This seems consistent > > with pagefault oom handling and compatible with overcommit use case. > > But we do not really want to make the syscall exit path any more complex > or more expensive than it is. The point is that we shouldn't be afraid > about triggering the oom killer from the charge patch because we do have > async OOM killer. This is very same with the standard allocator path. So > why should be memcg any different? I have nothing against triggering the OOM killer from the allocation path. I am dead-set against making the -ENOMEM return from syscalls rare and unpredictable. They're a challenge as it is. The only sane options are to stick with the status quo, or make sure the task never returns before the allocation succeeds. Making things in this path more speculative is a downgrade, not an improvement. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-25 13:11 ` Johannes Weiner @ 2017-10-25 14:12 ` Michal Hocko 2017-10-25 16:44 ` Johannes Weiner 0 siblings, 1 reply; 52+ messages in thread From: Michal Hocko @ 2017-10-25 14:12 UTC (permalink / raw) To: Johannes Weiner Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Wed 25-10-17 09:11:51, Johannes Weiner wrote: > On Wed, Oct 25, 2017 at 09:15:22AM +0200, Michal Hocko wrote: [...] > > ... we shouldn't make it more loose though. > > Then we can end this discussion right now. I pointed out right from > the start that the only way to replace -ENOMEM with OOM killing in the > syscall is to force charges. If we don't, we either deadlock or still > return -ENOMEM occasionally. Nobody has refuted that this is the case. Yes this is true. I guess we are back to the non-failing allocations discussion... Currently we are too ENOMEM happy for memcg !PF paths which can lead to weird issues Greg has pointed out earlier. Going to opposite direction to basically never ENOMEM and rather pretend a success (which allows runaways for extreme setups with no oom eligible tasks) sounds like going from one extreme to another. This basically means that those charges will effectively GFP_NOFAIL. Too much to guarantee IMHO. > > > The current thread can loop in syscall exit until > > > usage is reconciled (either via reclaim or kill). This seems consistent > > > with pagefault oom handling and compatible with overcommit use case. > > > > But we do not really want to make the syscall exit path any more complex > > or more expensive than it is. The point is that we shouldn't be afraid > > about triggering the oom killer from the charge patch because we do have > > async OOM killer. This is very same with the standard allocator path. So > > why should be memcg any different? > > I have nothing against triggering the OOM killer from the allocation > path. I am dead-set against making the -ENOMEM return from syscalls > rare and unpredictable. Isn't that the case when we put memcg out of the picture already? More on that below. > They're a challenge as it is. The only sane options are to stick with > the status quo, One thing that really worries me about the current status quo is that the behavior depends on whether you run under memcg or not. The global policy is "almost never fail unless something horrible is going on". But we _do not_ guarantee that ENOMEM stays inside the kernel. So if we need to do something about that I would think we need an universal solution rather than something memcg specific. Sure global ENOMEMs are so rare that nobody will probably trigger those but that is just a wishful thinking... So how about we start with a BIG FAT WARNING for the failure case? Something resembling warn_alloc for the failure case. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5d9323028870..3ba62c73eee5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) * victim and then we have rely on mem_cgroup_oom_synchronize otherwise * we would fall back to the global oom killer in pagefault_out_of_memory */ - if (!memcg->oom_kill_disable && - mem_cgroup_out_of_memory(memcg, mask, order)) - return true; + if (!memcg->oom_kill_disable) { + if (mem_cgroup_out_of_memory(memcg, mask, order)) + return true; + + WARN(!current->memcg_may_oom, + "Memory cgroup charge failed because of no reclaimable memory! " + "This looks like a misconfiguration or a kernel bug."); + } if (!current->memcg_may_oom) return false; -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-25 14:12 ` Michal Hocko @ 2017-10-25 16:44 ` Johannes Weiner 2017-10-25 17:29 ` Michal Hocko 2017-10-27 20:50 ` Shakeel Butt 0 siblings, 2 replies; 52+ messages in thread From: Johannes Weiner @ 2017-10-25 16:44 UTC (permalink / raw) To: Michal Hocko Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Wed, Oct 25, 2017 at 04:12:21PM +0200, Michal Hocko wrote: > On Wed 25-10-17 09:11:51, Johannes Weiner wrote: > > On Wed, Oct 25, 2017 at 09:15:22AM +0200, Michal Hocko wrote: > [...] > > > ... we shouldn't make it more loose though. > > > > Then we can end this discussion right now. I pointed out right from > > the start that the only way to replace -ENOMEM with OOM killing in the > > syscall is to force charges. If we don't, we either deadlock or still > > return -ENOMEM occasionally. Nobody has refuted that this is the case. > > Yes this is true. I guess we are back to the non-failing allocations > discussion... Currently we are too ENOMEM happy for memcg !PF paths which > can lead to weird issues Greg has pointed out earlier. Going to opposite > direction to basically never ENOMEM and rather pretend a success (which > allows runaways for extreme setups with no oom eligible tasks) sounds > like going from one extreme to another. This basically means that those > charges will effectively GFP_NOFAIL. Too much to guarantee IMHO. We're talking in circles. Overrunning the hard limit in the syscall path isn't worse than an infinite loop in the page fault path. Once you make tasks ineligible for OOM, that's what you have to expect. It's the OOM disabling that results in extreme consequences across the board. It's silly to keep bringing this up as an example. > > They're a challenge as it is. The only sane options are to stick with > > the status quo, > > One thing that really worries me about the current status quo is that > the behavior depends on whether you run under memcg or not. The global > policy is "almost never fail unless something horrible is going on". > But we _do not_ guarantee that ENOMEM stays inside the kernel. > > So if we need to do something about that I would think we need an > universal solution rather than something memcg specific. Sure global > ENOMEMs are so rare that nobody will probably trigger those but that is > just a wishful thinking... The difference in current behavior comes from the fact that real-life workloads could trigger a deadlock with memcg looping indefinitely in the charge path, but not with the allocator looping indefinitely in the allocation path. That also means that if we change it like you proposed, it'll be much more likely to trigger -ENOMEM from syscalls inside memcg than it is outside. The margins are simply tighter inside cgroups. You often have only a few closely interacting tasks in there, and the probability for something to go wrong is much higher than on the system scope. But also, the system is constrained by physical limits. It's a hard wall; once you hit it, the kernel is dead, so we don't have a lot of options. Memcg on the other hand is a completely artificial limit. It doesn't make sense to me to pretend it's a hard wall to the point where we actively *create* for ourselves the downsides of a physical limitation. > So how about we start with a BIG FAT WARNING for the failure case? > Something resembling warn_alloc for the failure case. > > --- > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 5d9323028870..3ba62c73eee5 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > * we would fall back to the global oom killer in pagefault_out_of_memory > */ > - if (!memcg->oom_kill_disable && > - mem_cgroup_out_of_memory(memcg, mask, order)) > - return true; > + if (!memcg->oom_kill_disable) { > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > + return true; > + > + WARN(!current->memcg_may_oom, > + "Memory cgroup charge failed because of no reclaimable memory! " > + "This looks like a misconfiguration or a kernel bug."); > + } That's crazy! We shouldn't create interfaces that make it possible to accidentally livelock the kernel. Then warn about it and let it crash. That is a DOS-level lack of OS abstraction. In such a situation, we should ignore oom_score_adj or ignore the hard limit. Even panic() would be better from a machine management point of view than leaving random tasks inside infinite loops. Why is OOM-disabling a thing? Why isn't this simply a "kill everything else before you kill me"? It's crashing the kernel in trying to protect a userspace application. How is that not insane? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-25 16:44 ` Johannes Weiner @ 2017-10-25 17:29 ` Michal Hocko 2017-10-25 18:11 ` Johannes Weiner 2017-10-27 20:50 ` Shakeel Butt 1 sibling, 1 reply; 52+ messages in thread From: Michal Hocko @ 2017-10-25 17:29 UTC (permalink / raw) To: Johannes Weiner Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Wed 25-10-17 12:44:02, Johannes Weiner wrote: > On Wed, Oct 25, 2017 at 04:12:21PM +0200, Michal Hocko wrote: [...] I yet have to digest the first path of the email but the remaining just sounds we are not on the same page. > > So how about we start with a BIG FAT WARNING for the failure case? > > Something resembling warn_alloc for the failure case. > > > > --- > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 5d9323028870..3ba62c73eee5 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > > * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > > * we would fall back to the global oom killer in pagefault_out_of_memory > > */ > > - if (!memcg->oom_kill_disable && > > - mem_cgroup_out_of_memory(memcg, mask, order)) > > - return true; > > + if (!memcg->oom_kill_disable) { > > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > > + return true; > > + > > + WARN(!current->memcg_may_oom, > > + "Memory cgroup charge failed because of no reclaimable memory! " > > + "This looks like a misconfiguration or a kernel bug."); > > + } > > That's crazy! > > We shouldn't create interfaces that make it possible to accidentally > livelock the kernel. Then warn about it and let it crash. That is a > DOS-level lack of OS abstraction. > > In such a situation, we should ignore oom_score_adj or ignore the hard > limit. Even panic() would be better from a machine management point of > view than leaving random tasks inside infinite loops. > > Why is OOM-disabling a thing? Why isn't this simply a "kill everything > else before you kill me"? It's crashing the kernel in trying to > protect a userspace application. How is that not insane? I really do not follow. What kind of livelock or crash are you talking about. All this code does is that the charge request (which is not explicitly GFP_NOFAIL) fails with ENOMEM if the oom killer is not able to make a forward progress. That sounds like a safer option than failing with ENOMEM unconditionally which is what we do currently. So the only change I am really proposing is to keep retrying as long as the oom killer makes a forward progress and ENOMEM otherwise. I am also not trying to protect an userspace application. Quite contrary, I would like the application gets ENOMEM when it should run away from the constraint it runs within. I am protecting everything outside of the hard limited memcg actually. So what is that I am missing? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-25 17:29 ` Michal Hocko @ 2017-10-25 18:11 ` Johannes Weiner 2017-10-25 19:00 ` Michal Hocko 0 siblings, 1 reply; 52+ messages in thread From: Johannes Weiner @ 2017-10-25 18:11 UTC (permalink / raw) To: Michal Hocko Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Wed, Oct 25, 2017 at 07:29:24PM +0200, Michal Hocko wrote: > On Wed 25-10-17 12:44:02, Johannes Weiner wrote: > > On Wed, Oct 25, 2017 at 04:12:21PM +0200, Michal Hocko wrote: > > > So how about we start with a BIG FAT WARNING for the failure case? > > > Something resembling warn_alloc for the failure case. > > > > > > --- > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index 5d9323028870..3ba62c73eee5 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > > > * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > > > * we would fall back to the global oom killer in pagefault_out_of_memory > > > */ > > > - if (!memcg->oom_kill_disable && > > > - mem_cgroup_out_of_memory(memcg, mask, order)) > > > - return true; > > > + if (!memcg->oom_kill_disable) { > > > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > > > + return true; > > > + > > > + WARN(!current->memcg_may_oom, > > > + "Memory cgroup charge failed because of no reclaimable memory! " > > > + "This looks like a misconfiguration or a kernel bug."); > > > + } > > > > That's crazy! > > > > We shouldn't create interfaces that make it possible to accidentally > > livelock the kernel. Then warn about it and let it crash. That is a > > DOS-level lack of OS abstraction. > > > > In such a situation, we should ignore oom_score_adj or ignore the hard > > limit. Even panic() would be better from a machine management point of > > view than leaving random tasks inside infinite loops. > > > > Why is OOM-disabling a thing? Why isn't this simply a "kill everything > > else before you kill me"? It's crashing the kernel in trying to > > protect a userspace application. How is that not insane? > > I really do not follow. What kind of livelock or crash are you talking > about. All this code does is that the charge request (which is not > explicitly GFP_NOFAIL) fails with ENOMEM if the oom killer is not able > to make a forward progress. That sounds like a safer option than failing > with ENOMEM unconditionally which is what we do currently. I pointed out multiple times now that consistent -ENOMEM is better than a rare one because it's easier to verify application behavior with test runs. And I don't understand what your counter-argument is. "Safe" is a vague term, and it doesn't make much sense to me in this situation. The OOM behavior should be predictable and consistent. Yes, global might in the rarest cases also return -ENOMEM. Maybe. We don't have to do that in memcg because we're not physically limited. > So the only change I am really proposing is to keep retrying as long > as the oom killer makes a forward progress and ENOMEM otherwise. That's the behavior change I'm against. > I am also not trying to protect an userspace application. Quite > contrary, I would like the application gets ENOMEM when it should run > away from the constraint it runs within. I am protecting everything > outside of the hard limited memcg actually. > > So what is that I am missing? The page fault path. Maybe that's not what you set out to fix, but it will now not only enter an infinite loop, it will also WARN() on each iteration. It would make sense to step back and think about the comprehensive user-visible behavior of an out-of-memory situation, and not just the one syscall aspect. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-25 18:11 ` Johannes Weiner @ 2017-10-25 19:00 ` Michal Hocko 2017-10-25 21:13 ` Johannes Weiner 0 siblings, 1 reply; 52+ messages in thread From: Michal Hocko @ 2017-10-25 19:00 UTC (permalink / raw) To: Johannes Weiner Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Wed 25-10-17 14:11:06, Johannes Weiner wrote: > On Wed, Oct 25, 2017 at 07:29:24PM +0200, Michal Hocko wrote: > > On Wed 25-10-17 12:44:02, Johannes Weiner wrote: > > > On Wed, Oct 25, 2017 at 04:12:21PM +0200, Michal Hocko wrote: > > > > So how about we start with a BIG FAT WARNING for the failure case? > > > > Something resembling warn_alloc for the failure case. > > > > > > > > --- > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > index 5d9323028870..3ba62c73eee5 100644 > > > > --- a/mm/memcontrol.c > > > > +++ b/mm/memcontrol.c > > > > @@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > > > > * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > > > > * we would fall back to the global oom killer in pagefault_out_of_memory > > > > */ > > > > - if (!memcg->oom_kill_disable && > > > > - mem_cgroup_out_of_memory(memcg, mask, order)) > > > > - return true; > > > > + if (!memcg->oom_kill_disable) { > > > > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > > > > + return true; > > > > + > > > > + WARN(!current->memcg_may_oom, > > > > + "Memory cgroup charge failed because of no reclaimable memory! " > > > > + "This looks like a misconfiguration or a kernel bug."); > > > > + } > > > > > > That's crazy! > > > > > > We shouldn't create interfaces that make it possible to accidentally > > > livelock the kernel. Then warn about it and let it crash. That is a > > > DOS-level lack of OS abstraction. > > > > > > In such a situation, we should ignore oom_score_adj or ignore the hard > > > limit. Even panic() would be better from a machine management point of > > > view than leaving random tasks inside infinite loops. > > > > > > Why is OOM-disabling a thing? Why isn't this simply a "kill everything > > > else before you kill me"? It's crashing the kernel in trying to > > > protect a userspace application. How is that not insane? > > > > I really do not follow. What kind of livelock or crash are you talking > > about. All this code does is that the charge request (which is not > > explicitly GFP_NOFAIL) fails with ENOMEM if the oom killer is not able > > to make a forward progress. That sounds like a safer option than failing > > with ENOMEM unconditionally which is what we do currently. > > I pointed out multiple times now that consistent -ENOMEM is better > than a rare one because it's easier to verify application behavior > with test runs. And I don't understand what your counter-argument is. My counter argument is that running inside the memcg shouldn't be too much different than running outside. And that means that if we try to reduce chances of ENOMEM in the global case then we should do the same in the memcg case as well. Failing overly eagerly is more harmful than what the determinism gives you for testing. You have other means to test failure paths like fault injections. > "Safe" is a vague term, and it doesn't make much sense to me in this > situation. The OOM behavior should be predictable and consistent. > > Yes, global might in the rarest cases also return -ENOMEM. Maybe. We > don't have to do that in memcg because we're not physically limited. OK, so here seems to be the biggest disconnect. Being physically or artificially constrained shouldn't make much difference IMHO. In both cases the resource is simply limited for the consumer. And once all the attempts to fit within the limit fail then the request for the resource has to fail. > > So the only change I am really proposing is to keep retrying as long > > as the oom killer makes a forward progress and ENOMEM otherwise. > > That's the behavior change I'm against. So just to make it clear you would be OK with the retry on successful OOM killer invocation and force charge on oom failure, right? > > I am also not trying to protect an userspace application. Quite > > contrary, I would like the application gets ENOMEM when it should run > > away from the constraint it runs within. I am protecting everything > > outside of the hard limited memcg actually. > > > > So what is that I am missing? > > The page fault path. > > Maybe that's not what you set out to fix, but it will now not only > enter an infinite loop, it will also WARN() on each iteration. It will not warn! The WARN is explicitly for non-PF paths unless I have screwed something there because admittedly I even didn't try to compile that code - it was merely for an illustration. Please note that the diff is on top of the previous one. And we already do have an endless loop if the memcg PF oom path doesn't make a forward progress. So there shouldn't be any difference in that regards. > It would make sense to step back and think about the comprehensive > user-visible behavior of an out-of-memory situation, and not just the > one syscall aspect. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-25 19:00 ` Michal Hocko @ 2017-10-25 21:13 ` Johannes Weiner 2017-10-25 22:49 ` Greg Thelen 0 siblings, 1 reply; 52+ messages in thread From: Johannes Weiner @ 2017-10-25 21:13 UTC (permalink / raw) To: Michal Hocko Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote: > On Wed 25-10-17 14:11:06, Johannes Weiner wrote: > > "Safe" is a vague term, and it doesn't make much sense to me in this > > situation. The OOM behavior should be predictable and consistent. > > > > Yes, global might in the rarest cases also return -ENOMEM. Maybe. We > > don't have to do that in memcg because we're not physically limited. > > OK, so here seems to be the biggest disconnect. Being physically or > artificially constrained shouldn't make much difference IMHO. In both > cases the resource is simply limited for the consumer. And once all the > attempts to fit within the limit fail then the request for the resource > has to fail. It's a huge difference. In the global case, we have to make trade-offs to not deadlock the kernel. In the memcg case, we have to make a trade off between desirable OOM behavior and desirable meaning of memory.max. If we can borrow a resource temporarily from the ether to resolve the OOM situation, I don't see why we shouldn't. We're only briefly ignoring the limit to make sure the allocating task isn't preventing the OOM victim from exiting or the OOM reaper from reaping. It's more of an implementation detail than interface. The only scenario you brought up where this might be the permanent overrun is the single, oom-disabled task. And I explained why that is a silly argument, why that's the least problematic consequence of oom-disabling, and why it probably shouldn't even be configurable. The idea that memory.max must never be breached is an extreme and narrow view. As Greg points out, there are allocations we do not even track. There are other scenarios that force allocations. They may violate the limit on paper, but they're not notably weakening the goal of memory.max - isolating workloads from each other. Let's look at it this way. There are two deadlock relationships the OOM killer needs to solve between the triggerer and the potential OOM victim: #1 Memory. The triggerer needs memory that the victim has, but the victim needs some additional memory to release it. #2 Locks. The triggerer needs memory that the victim has, but the victim needs a lock the triggerer holds to release it. We have no qualms letting the victim temporarily (until the victim's exit) ignore memory.max to resolve the memory deadlock #1. I don't understand why it's such a stretch to let the triggerer temporarily (until the victim's exit) ignore memory.max to resolve the locks deadlock #2. [1] We need both for the OOM killer to function correctly. We've solved #1 both for memcg and globally. But we haven't solved #2. Global can still deadlock, and memcg copped out and returns -ENOMEM. Adding speculative OOM killing before the -ENOMEM makes things more muddy and unpredictable. It doesn't actually solve deadlock #2. [1] And arguably that's what we should be doing in the global case too: give the triggerer access to reserves. If you recall this thread here: https://patchwork.kernel.org/patch/6088511/ > > > So the only change I am really proposing is to keep retrying as long > > > as the oom killer makes a forward progress and ENOMEM otherwise. > > > > That's the behavior change I'm against. > > So just to make it clear you would be OK with the retry on successful > OOM killer invocation and force charge on oom failure, right? Yeah, that sounds reasonable to me. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-25 21:13 ` Johannes Weiner @ 2017-10-25 22:49 ` Greg Thelen 2017-10-26 7:49 ` Michal Hocko 2017-10-26 14:31 ` Johannes Weiner 0 siblings, 2 replies; 52+ messages in thread From: Greg Thelen @ 2017-10-25 22:49 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko Cc: Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML Johannes Weiner <hannes@cmpxchg.org> wrote: > On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote: >> On Wed 25-10-17 14:11:06, Johannes Weiner wrote: >> > "Safe" is a vague term, and it doesn't make much sense to me in this >> > situation. The OOM behavior should be predictable and consistent. >> > >> > Yes, global might in the rarest cases also return -ENOMEM. Maybe. We >> > don't have to do that in memcg because we're not physically limited. >> >> OK, so here seems to be the biggest disconnect. Being physically or >> artificially constrained shouldn't make much difference IMHO. In both >> cases the resource is simply limited for the consumer. And once all the >> attempts to fit within the limit fail then the request for the resource >> has to fail. > > It's a huge difference. In the global case, we have to make trade-offs > to not deadlock the kernel. In the memcg case, we have to make a trade > off between desirable OOM behavior and desirable meaning of memory.max. > > If we can borrow a resource temporarily from the ether to resolve the > OOM situation, I don't see why we shouldn't. We're only briefly > ignoring the limit to make sure the allocating task isn't preventing > the OOM victim from exiting or the OOM reaper from reaping. It's more > of an implementation detail than interface. > > The only scenario you brought up where this might be the permanent > overrun is the single, oom-disabled task. And I explained why that is > a silly argument, why that's the least problematic consequence of > oom-disabling, and why it probably shouldn't even be configurable. > > The idea that memory.max must never be breached is an extreme and > narrow view. As Greg points out, there are allocations we do not even > track. There are other scenarios that force allocations. They may > violate the limit on paper, but they're not notably weakening the goal > of memory.max - isolating workloads from each other. > > Let's look at it this way. > > There are two deadlock relationships the OOM killer needs to solve > between the triggerer and the potential OOM victim: > > #1 Memory. The triggerer needs memory that the victim has, > but the victim needs some additional memory to release it. > > #2 Locks. The triggerer needs memory that the victim has, but > the victim needs a lock the triggerer holds to release it. > > We have no qualms letting the victim temporarily (until the victim's > exit) ignore memory.max to resolve the memory deadlock #1. > > I don't understand why it's such a stretch to let the triggerer > temporarily (until the victim's exit) ignore memory.max to resolve the > locks deadlock #2. [1] > > We need both for the OOM killer to function correctly. > > We've solved #1 both for memcg and globally. But we haven't solved #2. > Global can still deadlock, and memcg copped out and returns -ENOMEM. > > Adding speculative OOM killing before the -ENOMEM makes things more > muddy and unpredictable. It doesn't actually solve deadlock #2. > > [1] And arguably that's what we should be doing in the global case > too: give the triggerer access to reserves. If you recall this > thread here: https://patchwork.kernel.org/patch/6088511/ > >> > > So the only change I am really proposing is to keep retrying as long >> > > as the oom killer makes a forward progress and ENOMEM otherwise. >> > >> > That's the behavior change I'm against. >> >> So just to make it clear you would be OK with the retry on successful >> OOM killer invocation and force charge on oom failure, right? > > Yeah, that sounds reasonable to me. Assuming we're talking about retrying within try_charge(), then there's a detail to iron out... If there is a pending oom victim blocked on a lock held by try_charge() caller (the "#2 Locks" case), then I think repeated calls to out_of_memory() will return true until the victim either gets MMF_OOM_SKIP or disappears. So a force charge fallback might be a needed even with oom killer successful invocations. Or we'll need to teach out_of_memory() to return three values (e.g. NO_VICTIM, NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on NEW_VICTIM. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-25 22:49 ` Greg Thelen @ 2017-10-26 7:49 ` Michal Hocko 2017-10-26 12:45 ` Tetsuo Handa 2017-10-26 14:31 ` Johannes Weiner 1 sibling, 1 reply; 52+ messages in thread From: Michal Hocko @ 2017-10-26 7:49 UTC (permalink / raw) To: Greg Thelen Cc: Johannes Weiner, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Wed 25-10-17 15:49:21, Greg Thelen wrote: > Johannes Weiner <hannes@cmpxchg.org> wrote: > > > On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote: [...] > >> So just to make it clear you would be OK with the retry on successful > >> OOM killer invocation and force charge on oom failure, right? > > > > Yeah, that sounds reasonable to me. > > Assuming we're talking about retrying within try_charge(), then there's > a detail to iron out... > > If there is a pending oom victim blocked on a lock held by try_charge() caller > (the "#2 Locks" case), then I think repeated calls to out_of_memory() will > return true until the victim either gets MMF_OOM_SKIP or disappears. true. And oom_reaper guarantees that MMF_OOM_SKIP gets set in the finit amount of time. > So a force > charge fallback might be a needed even with oom killer successful invocations. > Or we'll need to teach out_of_memory() to return three values (e.g. NO_VICTIM, > NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on NEW_VICTIM. No we, really want to wait for the oom victim to do its job. The only thing we should be worried about is when out_of_memory doesn't invoke the reaper. There is only one case like that AFAIK - GFP_NOFS request. I have to think about this case some more. We currently fail in that case the request. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-26 7:49 ` Michal Hocko @ 2017-10-26 12:45 ` Tetsuo Handa 0 siblings, 0 replies; 52+ messages in thread From: Tetsuo Handa @ 2017-10-26 12:45 UTC (permalink / raw) To: Michal Hocko, Greg Thelen Cc: Johannes Weiner, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On 2017/10/26 16:49, Michal Hocko wrote: > On Wed 25-10-17 15:49:21, Greg Thelen wrote: >> Johannes Weiner <hannes@cmpxchg.org> wrote: >> >>> On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote: > [...] >>>> So just to make it clear you would be OK with the retry on successful >>>> OOM killer invocation and force charge on oom failure, right? >>> >>> Yeah, that sounds reasonable to me. >> >> Assuming we're talking about retrying within try_charge(), then there's >> a detail to iron out... >> >> If there is a pending oom victim blocked on a lock held by try_charge() caller >> (the "#2 Locks" case), then I think repeated calls to out_of_memory() will >> return true until the victim either gets MMF_OOM_SKIP or disappears. > > true. And oom_reaper guarantees that MMF_OOM_SKIP gets set in the finit > amount of time. Just a confirmation. You are talking about kmemcg, aren't you? And kmemcg depends on CONFIG_MMU=y, doesn't it? If no, there is no such guarantee. > >> So a force >> charge fallback might be a needed even with oom killer successful invocations. >> Or we'll need to teach out_of_memory() to return three values (e.g. NO_VICTIM, >> NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on NEW_VICTIM. > > No we, really want to wait for the oom victim to do its job. The only > thing we should be worried about is when out_of_memory doesn't invoke > the reaper. There is only one case like that AFAIK - GFP_NOFS request. I > have to think about this case some more. We currently fail in that case > the request. > Do we really need to apply /* * The OOM killer does not compensate for IO-less reclaim. * pagefault_out_of_memory lost its gfp context so we have to * make sure exclude 0 mask - all other users should have at least * ___GFP_DIRECT_RECLAIM to get here. */ if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS)) return true; unconditionally? We can encourage !__GFP_FS allocations to use __GFP_NORETRY or __GFP_RETRY_MAYFAIL if their allocations are not important. Then, only important !__GFP_FS allocations will be checked here. I think that we can allow such important allocations to invoke the OOM killer (i.e. remove this check) because situation is already hopeless if important !__GFP_FS allocations cannot make progress. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-25 22:49 ` Greg Thelen 2017-10-26 7:49 ` Michal Hocko @ 2017-10-26 14:31 ` Johannes Weiner 2017-10-26 19:56 ` Greg Thelen 1 sibling, 1 reply; 52+ messages in thread From: Johannes Weiner @ 2017-10-26 14:31 UTC (permalink / raw) To: Greg Thelen Cc: Michal Hocko, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Wed, Oct 25, 2017 at 03:49:21PM -0700, Greg Thelen wrote: > Johannes Weiner <hannes@cmpxchg.org> wrote: > > > On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote: > >> On Wed 25-10-17 14:11:06, Johannes Weiner wrote: > >> > "Safe" is a vague term, and it doesn't make much sense to me in this > >> > situation. The OOM behavior should be predictable and consistent. > >> > > >> > Yes, global might in the rarest cases also return -ENOMEM. Maybe. We > >> > don't have to do that in memcg because we're not physically limited. > >> > >> OK, so here seems to be the biggest disconnect. Being physically or > >> artificially constrained shouldn't make much difference IMHO. In both > >> cases the resource is simply limited for the consumer. And once all the > >> attempts to fit within the limit fail then the request for the resource > >> has to fail. > > > > It's a huge difference. In the global case, we have to make trade-offs > > to not deadlock the kernel. In the memcg case, we have to make a trade > > off between desirable OOM behavior and desirable meaning of memory.max. > > > > If we can borrow a resource temporarily from the ether to resolve the > > OOM situation, I don't see why we shouldn't. We're only briefly > > ignoring the limit to make sure the allocating task isn't preventing > > the OOM victim from exiting or the OOM reaper from reaping. It's more > > of an implementation detail than interface. > > > > The only scenario you brought up where this might be the permanent > > overrun is the single, oom-disabled task. And I explained why that is > > a silly argument, why that's the least problematic consequence of > > oom-disabling, and why it probably shouldn't even be configurable. > > > > The idea that memory.max must never be breached is an extreme and > > narrow view. As Greg points out, there are allocations we do not even > > track. There are other scenarios that force allocations. They may > > violate the limit on paper, but they're not notably weakening the goal > > of memory.max - isolating workloads from each other. > > > > Let's look at it this way. > > > > There are two deadlock relationships the OOM killer needs to solve > > between the triggerer and the potential OOM victim: > > > > #1 Memory. The triggerer needs memory that the victim has, > > but the victim needs some additional memory to release it. > > > > #2 Locks. The triggerer needs memory that the victim has, but > > the victim needs a lock the triggerer holds to release it. > > > > We have no qualms letting the victim temporarily (until the victim's > > exit) ignore memory.max to resolve the memory deadlock #1. > > > > I don't understand why it's such a stretch to let the triggerer > > temporarily (until the victim's exit) ignore memory.max to resolve the > > locks deadlock #2. [1] > > > > We need both for the OOM killer to function correctly. > > > > We've solved #1 both for memcg and globally. But we haven't solved #2. > > Global can still deadlock, and memcg copped out and returns -ENOMEM. > > > > Adding speculative OOM killing before the -ENOMEM makes things more > > muddy and unpredictable. It doesn't actually solve deadlock #2. > > > > [1] And arguably that's what we should be doing in the global case > > too: give the triggerer access to reserves. If you recall this > > thread here: https://patchwork.kernel.org/patch/6088511/ > > > >> > > So the only change I am really proposing is to keep retrying as long > >> > > as the oom killer makes a forward progress and ENOMEM otherwise. > >> > > >> > That's the behavior change I'm against. > >> > >> So just to make it clear you would be OK with the retry on successful > >> OOM killer invocation and force charge on oom failure, right? > > > > Yeah, that sounds reasonable to me. > > Assuming we're talking about retrying within try_charge(), then there's > a detail to iron out... > > If there is a pending oom victim blocked on a lock held by try_charge() caller > (the "#2 Locks" case), then I think repeated calls to out_of_memory() will > return true until the victim either gets MMF_OOM_SKIP or disappears. So a force > charge fallback might be a needed even with oom killer successful invocations. > Or we'll need to teach out_of_memory() to return three values (e.g. NO_VICTIM, > NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on NEW_VICTIM. True. I was assuming we'd retry MEM_CGROUP_RECLAIM_RETRIES times at a maximum, even if the OOM killer indicates a kill has been issued. What you propose makes sense too. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-26 14:31 ` Johannes Weiner @ 2017-10-26 19:56 ` Greg Thelen 2017-10-27 8:20 ` Michal Hocko 0 siblings, 1 reply; 52+ messages in thread From: Greg Thelen @ 2017-10-26 19:56 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML Michal Hocko wrote: > Greg Thelen wrote: > > So a force charge fallback might be a needed even with oom killer successful > > invocations. Or we'll need to teach out_of_memory() to return three values > > (e.g. NO_VICTIM, NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on > > NEW_VICTIM. > > No we, really want to wait for the oom victim to do its job. The only thing we > should be worried about is when out_of_memory doesn't invoke the reaper. There > is only one case like that AFAIK - GFP_NOFS request. I have to think about > this case some more. We currently fail in that case the request. Nod, but I think only wait a short time (more below). The schedule_timeout_killable(1) in out_of_memory() seems ok to me. I don't think there's a problem overcharging a little bit to expedite oom killing. Johannes Weiner wrote: > True. I was assuming we'd retry MEM_CGROUP_RECLAIM_RETRIES times at a maximum, > even if the OOM killer indicates a kill has been issued. What you propose > makes sense too. Sounds good. It looks like the oom reaper will wait 1 second (MAX_OOM_REAP_RETRIES*HZ/10) before giving up and setting MMF_OOM_SKIP, which enables the oom killer to select another victim. Repeated try_charge() => out_of_memory() calls will return true while there's a pending victim. After the first call, out_of_memory() doesn't appear to sleep. So I assume try_charge() would quickly burn through MEM_CGROUP_RECLAIM_RETRIES (5) attempts before resorting to overcharging. IMO, this is fine because: 1) it's possible the victim wants locks held by try_charge caller. So waiting for the oom reaper to timeout and out_of_memory to select additional victims would kill more than required. 2) waiting 1 sec to detect a livelock between try_charge() and pending oom victim seems unfortunate. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-26 19:56 ` Greg Thelen @ 2017-10-27 8:20 ` Michal Hocko 0 siblings, 0 replies; 52+ messages in thread From: Michal Hocko @ 2017-10-27 8:20 UTC (permalink / raw) To: Greg Thelen Cc: Johannes Weiner, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Thu 26-10-17 12:56:48, Greg Thelen wrote: > Michal Hocko wrote: > > Greg Thelen wrote: > > > So a force charge fallback might be a needed even with oom killer successful > > > invocations. Or we'll need to teach out_of_memory() to return three values > > > (e.g. NO_VICTIM, NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on > > > NEW_VICTIM. > > > > No we, really want to wait for the oom victim to do its job. The only thing we > > should be worried about is when out_of_memory doesn't invoke the reaper. There > > is only one case like that AFAIK - GFP_NOFS request. I have to think about > > this case some more. We currently fail in that case the request. > > Nod, but I think only wait a short time (more below). The > schedule_timeout_killable(1) in out_of_memory() seems ok to me. I don't > think there's a problem overcharging a little bit to expedite oom > killing. This is not about time. This is about the feedback mechanism oom_reaper provides. We should do any actions until we get that feedback. > Johannes Weiner wrote: > > True. I was assuming we'd retry MEM_CGROUP_RECLAIM_RETRIES times at a maximum, > > even if the OOM killer indicates a kill has been issued. What you propose > > makes sense too. > > Sounds good. > > It looks like the oom reaper will wait 1 second > (MAX_OOM_REAP_RETRIES*HZ/10) before giving up and setting MMF_OOM_SKIP, > which enables the oom killer to select another victim. Repeated > try_charge() => out_of_memory() calls will return true while there's a > pending victim. After the first call, out_of_memory() doesn't appear to > sleep. So I assume try_charge() would quickly burn through > MEM_CGROUP_RECLAIM_RETRIES (5) attempts before resorting to > overcharging. IMO, this is fine because: > 1) it's possible the victim wants locks held by try_charge caller. So > waiting for the oom reaper to timeout and out_of_memory to select > additional victims would kill more than required. > 2) waiting 1 sec to detect a livelock between try_charge() and pending > oom victim seems unfortunate. I am not yet sure whether overcharging or ENOMEM is the right way to go (have to think through that much more) but one thing is clear I guess. The charge path shouldn't do any decision as long as it gets a possitive feedback from the oom killer path. And that pretty much depend on what oom reaper is able to do. Implementation details about how much the reaper waits if it is not able reclaim any memory is not all that important. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-25 16:44 ` Johannes Weiner 2017-10-25 17:29 ` Michal Hocko @ 2017-10-27 20:50 ` Shakeel Butt 2017-10-30 8:29 ` Michal Hocko 1 sibling, 1 reply; 52+ messages in thread From: Shakeel Butt @ 2017-10-27 20:50 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, Greg Thelen, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML > Why is OOM-disabling a thing? Why isn't this simply a "kill everything > else before you kill me"? It's crashing the kernel in trying to > protect a userspace application. How is that not insane? In parallel to other discussion, I think we should definitely move from "completely oom-disabled" semantics to something similar to "kill me last" semantics. Is there any objection to this idea? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-27 20:50 ` Shakeel Butt @ 2017-10-30 8:29 ` Michal Hocko 2017-10-30 19:28 ` Shakeel Butt 0 siblings, 1 reply; 52+ messages in thread From: Michal Hocko @ 2017-10-30 8:29 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Greg Thelen, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Fri 27-10-17 13:50:47, Shakeel Butt wrote: > > Why is OOM-disabling a thing? Why isn't this simply a "kill everything > > else before you kill me"? It's crashing the kernel in trying to > > protect a userspace application. How is that not insane? > > In parallel to other discussion, I think we should definitely move > from "completely oom-disabled" semantics to something similar to "kill > me last" semantics. Is there any objection to this idea? Could you be more specific what you mean? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-30 8:29 ` Michal Hocko @ 2017-10-30 19:28 ` Shakeel Butt 2017-10-31 8:00 ` Michal Hocko 0 siblings, 1 reply; 52+ messages in thread From: Shakeel Butt @ 2017-10-30 19:28 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Greg Thelen, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Mon, Oct 30, 2017 at 1:29 AM, Michal Hocko <mhocko@kernel.org> wrote: > On Fri 27-10-17 13:50:47, Shakeel Butt wrote: >> > Why is OOM-disabling a thing? Why isn't this simply a "kill everything >> > else before you kill me"? It's crashing the kernel in trying to >> > protect a userspace application. How is that not insane? >> >> In parallel to other discussion, I think we should definitely move >> from "completely oom-disabled" semantics to something similar to "kill >> me last" semantics. Is there any objection to this idea? > > Could you be more specific what you mean? > I get the impression that the main reason behind the complexity of oom-killer is allowing processes to be protected from the oom-killer i.e. disabling oom-killing a process by setting /proc/[pid]/oom_score_adj to -1000. So, instead of oom-disabling, add an interface which will let users/admins to set a process to be oom-killed as a last resort. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-30 19:28 ` Shakeel Butt @ 2017-10-31 8:00 ` Michal Hocko 2017-10-31 16:49 ` Johannes Weiner 0 siblings, 1 reply; 52+ messages in thread From: Michal Hocko @ 2017-10-31 8:00 UTC (permalink / raw) To: Shakeel Butt Cc: Johannes Weiner, Greg Thelen, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Mon 30-10-17 12:28:13, Shakeel Butt wrote: > On Mon, Oct 30, 2017 at 1:29 AM, Michal Hocko <mhocko@kernel.org> wrote: > > On Fri 27-10-17 13:50:47, Shakeel Butt wrote: > >> > Why is OOM-disabling a thing? Why isn't this simply a "kill everything > >> > else before you kill me"? It's crashing the kernel in trying to > >> > protect a userspace application. How is that not insane? > >> > >> In parallel to other discussion, I think we should definitely move > >> from "completely oom-disabled" semantics to something similar to "kill > >> me last" semantics. Is there any objection to this idea? > > > > Could you be more specific what you mean? > > > > I get the impression that the main reason behind the complexity of > oom-killer is allowing processes to be protected from the oom-killer > i.e. disabling oom-killing a process by setting > /proc/[pid]/oom_score_adj to -1000. So, instead of oom-disabling, add > an interface which will let users/admins to set a process to be > oom-killed as a last resort. If a process opts in to be oom disabled it needs CAP_SYS_RESOURCE and it probably has a strong reason to do that. E.g. no unexpected SIGKILL which could leave inconsistent data behind. We cannot simply break that contract. Yes, it is a PITA configuration to support but it has its reasons to exit. We are not guaranteeing it to 100% though, e.g. the global case just panics if there is no eligible task. It is responsibility of the userspace to make sure that the protected task doesn't blow up completely otherwise they are on their own. We should do something similar for the memcg case. Protect those tasks as long as we are able to make forward progress and then either give them ENOMEM or overcharge. Which one to go requires more discussion but I do not think that unexpected SIGKILL is a way to go. We just want to give those tasks a chance to do a graceful shutdown. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-31 8:00 ` Michal Hocko @ 2017-10-31 16:49 ` Johannes Weiner 2017-10-31 18:50 ` Michal Hocko 0 siblings, 1 reply; 52+ messages in thread From: Johannes Weiner @ 2017-10-31 16:49 UTC (permalink / raw) To: Michal Hocko Cc: Shakeel Butt, Greg Thelen, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Tue, Oct 31, 2017 at 09:00:48AM +0100, Michal Hocko wrote: > On Mon 30-10-17 12:28:13, Shakeel Butt wrote: > > On Mon, Oct 30, 2017 at 1:29 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > On Fri 27-10-17 13:50:47, Shakeel Butt wrote: > > >> > Why is OOM-disabling a thing? Why isn't this simply a "kill everything > > >> > else before you kill me"? It's crashing the kernel in trying to > > >> > protect a userspace application. How is that not insane? > > >> > > >> In parallel to other discussion, I think we should definitely move > > >> from "completely oom-disabled" semantics to something similar to "kill > > >> me last" semantics. Is there any objection to this idea? > > > > > > Could you be more specific what you mean? > > > > I get the impression that the main reason behind the complexity of > > oom-killer is allowing processes to be protected from the oom-killer > > i.e. disabling oom-killing a process by setting > > /proc/[pid]/oom_score_adj to -1000. So, instead of oom-disabling, add > > an interface which will let users/admins to set a process to be > > oom-killed as a last resort. > > If a process opts in to be oom disabled it needs CAP_SYS_RESOURCE and it > probably has a strong reason to do that. E.g. no unexpected SIGKILL > which could leave inconsistent data behind. We cannot simply break that > contract. Yes, it is a PITA configuration to support but it has its > reasons to exit. I don't think that's true. The most prominent users are things like X and sshd, and all they wanted to say was "kill me last." If sshd were to have a bug and swell up, currently the system would kill everything and then panic. It'd be much better to kill sshd at the end and let the init system restart it. Can you describe a scenario in which the NEVERKILL semantics actually make sense? You're still OOM-killing the task anyway, it's not like it can run without the kernel. So why kill the kernel? ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-31 16:49 ` Johannes Weiner @ 2017-10-31 18:50 ` Michal Hocko 0 siblings, 0 replies; 52+ messages in thread From: Michal Hocko @ 2017-10-31 18:50 UTC (permalink / raw) To: Johannes Weiner Cc: Shakeel Butt, Greg Thelen, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Tue 31-10-17 12:49:59, Johannes Weiner wrote: > On Tue, Oct 31, 2017 at 09:00:48AM +0100, Michal Hocko wrote: > > On Mon 30-10-17 12:28:13, Shakeel Butt wrote: > > > On Mon, Oct 30, 2017 at 1:29 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > > On Fri 27-10-17 13:50:47, Shakeel Butt wrote: > > > >> > Why is OOM-disabling a thing? Why isn't this simply a "kill everything > > > >> > else before you kill me"? It's crashing the kernel in trying to > > > >> > protect a userspace application. How is that not insane? > > > >> > > > >> In parallel to other discussion, I think we should definitely move > > > >> from "completely oom-disabled" semantics to something similar to "kill > > > >> me last" semantics. Is there any objection to this idea? > > > > > > > > Could you be more specific what you mean? > > > > > > I get the impression that the main reason behind the complexity of > > > oom-killer is allowing processes to be protected from the oom-killer > > > i.e. disabling oom-killing a process by setting > > > /proc/[pid]/oom_score_adj to -1000. So, instead of oom-disabling, add > > > an interface which will let users/admins to set a process to be > > > oom-killed as a last resort. > > > > If a process opts in to be oom disabled it needs CAP_SYS_RESOURCE and it > > probably has a strong reason to do that. E.g. no unexpected SIGKILL > > which could leave inconsistent data behind. We cannot simply break that > > contract. Yes, it is a PITA configuration to support but it has its > > reasons to exit. > > I don't think that's true. The most prominent users are things like X > and sshd, and all they wanted to say was "kill me last." This might be the case for the desktop environment and I would tend to agree that those can handle restart easily. I was considering applications which need an explicit shut down and manual intervention when not done so. Think of a database or similar. > If sshd were to have a bug and swell up, currently the system would > kill everything and then panic. It'd be much better to kill sshd at > the end and let the init system restart it. > > Can you describe a scenario in which the NEVERKILL semantics actually > make sense? You're still OOM-killing the task anyway, it's not like it > can run without the kernel. So why kill the kernel? Yes but you start with a clean state after reboot which is rather a different thing than restarting from an inconsistant state. In any case I am not trying to defend this configuration! I really dislike it and it shouldn't have ever been introduced. But it is an established behavior for many years and I am not really willing to break it without having a _really strong_ reason. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-13 6:35 ` Michal Hocko 2017-10-13 7:00 ` Michal Hocko @ 2017-10-24 15:45 ` Johannes Weiner 2017-10-24 16:30 ` Michal Hocko 1 sibling, 1 reply; 52+ messages in thread From: Johannes Weiner @ 2017-10-24 15:45 UTC (permalink / raw) To: Michal Hocko Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Fri, Oct 13, 2017 at 08:35:55AM +0200, Michal Hocko wrote: > On Thu 12-10-17 15:03:12, Johannes Weiner wrote: > > All I'm saying is that, when the syscall-context fails to charge, we > > should do mem_cgroup_oom() to set up the async OOM killer, let the > > charge succeed over the hard limit - since the OOM killer will most > > likely get us back below the limit - then mem_cgroup_oom_synchronize() > > before the syscall returns to userspace. > > OK, then we are on the same page now. Your initial wording didn't > mention async OOM killer. This makes more sense. Although I would argue > that we can retry the charge as long as out_of_memory finds a victim. > This would return ENOMEM to the pathological cases where no victims > could be found. I think that's much worse because it's even harder to test and verify your applications against. If syscalls can return -ENOMEM on OOM, they should do so reliably. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-24 15:45 ` Johannes Weiner @ 2017-10-24 16:30 ` Michal Hocko 0 siblings, 0 replies; 52+ messages in thread From: Michal Hocko @ 2017-10-24 16:30 UTC (permalink / raw) To: Johannes Weiner Cc: Greg Thelen, Shakeel Butt, Alexander Viro, Vladimir Davydov, Andrew Morton, Linux MM, linux-fsdevel, LKML On Tue 24-10-17 11:45:11, Johannes Weiner wrote: > On Fri, Oct 13, 2017 at 08:35:55AM +0200, Michal Hocko wrote: > > On Thu 12-10-17 15:03:12, Johannes Weiner wrote: > > > All I'm saying is that, when the syscall-context fails to charge, we > > > should do mem_cgroup_oom() to set up the async OOM killer, let the > > > charge succeed over the hard limit - since the OOM killer will most > > > likely get us back below the limit - then mem_cgroup_oom_synchronize() > > > before the syscall returns to userspace. > > > > OK, then we are on the same page now. Your initial wording didn't > > mention async OOM killer. This makes more sense. Although I would argue > > that we can retry the charge as long as out_of_memory finds a victim. > > This would return ENOMEM to the pathological cases where no victims > > could be found. > > I think that's much worse because it's even harder to test and verify > your applications against. Well, the main distinction to the global OOM killer is that we panic when there is no oom victim eligible which we cannot do in the memcg context. So we have to bail somehow and I would be really careful to allow for a runaway from the hard limit just because we are out of all eligible tasks. Returning ENOMEM sounds like a safer option to me. > If syscalls can return -ENOMEM on OOM, they should do so reliably. The main problem is that we do not know which syscalls can return ENOMEM -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH] fs, mm: account filp and names caches to kmemcg 2017-10-05 22:21 [PATCH] fs, mm: account filp and names caches to kmemcg Shakeel Butt 2017-10-06 7:59 ` Michal Hocko @ 2017-10-10 23:32 ` Al Viro 1 sibling, 0 replies; 52+ messages in thread From: Al Viro @ 2017-10-10 23:32 UTC (permalink / raw) To: Shakeel Butt Cc: Vladimir Davydov, Michal Hocko, Greg Thelen, Andrew Morton, linux-mm, linux-fsdevel, linux-kernel On Thu, Oct 05, 2017 at 03:21:44PM -0700, Shakeel Butt wrote: > The allocations from filp and names kmem caches can be directly > triggered by user space applications. A buggy application can > consume a significant amount of unaccounted system memory. Though > we have not noticed such buggy applications in our production > but upon close inspection, we found that a lot of machines spend > very significant amount of memory on these caches. So, these > caches should be accounted to kmemcg. IDGI... Surely, it's not hard to come up with a syscall that can allocate a page for the duration of syscall? Just to pick a random example: reading from /proc/self/cmdline does that. So does readlink of /proc/self/cwd, etc. What does accounting for such temporary allocations (with fixed limit per syscall, always freed by the end of syscall) buy you, why is it needed and what makes it not needed for the examples above (and a slew of similar ones)? While we are at it, how much overhead does it add on syscall-heavy loads? As in, a whole lot of threads is calling something like stat("/", &stbuf); in parallel? Because outside of that kind of loads it's completely pointless... ^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2017-10-31 18:50 UTC | newest] Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-10-05 22:21 [PATCH] fs, mm: account filp and names caches to kmemcg Shakeel Butt 2017-10-06 7:59 ` Michal Hocko 2017-10-06 19:33 ` Shakeel Butt 2017-10-09 6:24 ` Michal Hocko 2017-10-09 17:52 ` Greg Thelen 2017-10-09 18:04 ` Michal Hocko 2017-10-09 18:17 ` Michal Hocko 2017-10-10 9:10 ` Michal Hocko 2017-10-10 22:21 ` Shakeel Butt 2017-10-11 9:09 ` Michal Hocko 2017-10-09 20:26 ` Johannes Weiner 2017-10-10 9:14 ` Michal Hocko 2017-10-10 14:17 ` Johannes Weiner 2017-10-10 14:24 ` Michal Hocko 2017-10-12 19:03 ` Johannes Weiner 2017-10-12 23:57 ` Greg Thelen 2017-10-13 6:51 ` Michal Hocko 2017-10-13 6:35 ` Michal Hocko 2017-10-13 7:00 ` Michal Hocko 2017-10-13 15:24 ` Michal Hocko 2017-10-24 12:18 ` Michal Hocko 2017-10-24 17:54 ` Johannes Weiner 2017-10-24 16:06 ` Johannes Weiner 2017-10-24 16:22 ` Michal Hocko 2017-10-24 17:23 ` Johannes Weiner 2017-10-24 17:55 ` Michal Hocko 2017-10-24 18:58 ` Johannes Weiner 2017-10-24 20:15 ` Michal Hocko 2017-10-25 6:51 ` Greg Thelen 2017-10-25 7:15 ` Michal Hocko 2017-10-25 13:11 ` Johannes Weiner 2017-10-25 14:12 ` Michal Hocko 2017-10-25 16:44 ` Johannes Weiner 2017-10-25 17:29 ` Michal Hocko 2017-10-25 18:11 ` Johannes Weiner 2017-10-25 19:00 ` Michal Hocko 2017-10-25 21:13 ` Johannes Weiner 2017-10-25 22:49 ` Greg Thelen 2017-10-26 7:49 ` Michal Hocko 2017-10-26 12:45 ` Tetsuo Handa 2017-10-26 14:31 ` Johannes Weiner 2017-10-26 19:56 ` Greg Thelen 2017-10-27 8:20 ` Michal Hocko 2017-10-27 20:50 ` Shakeel Butt 2017-10-30 8:29 ` Michal Hocko 2017-10-30 19:28 ` Shakeel Butt 2017-10-31 8:00 ` Michal Hocko 2017-10-31 16:49 ` Johannes Weiner 2017-10-31 18:50 ` Michal Hocko 2017-10-24 15:45 ` Johannes Weiner 2017-10-24 16:30 ` Michal Hocko 2017-10-10 23:32 ` Al Viro
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).