All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2017-10-27 18:22 ` Yang Shi
  0 siblings, 0 replies; 75+ messages in thread
From: Yang Shi @ 2017-10-27 18:22 UTC (permalink / raw)
  To: jack, amir73il; +Cc: Yang Shi, linux-fsdevel, linux-mm, linux-kernel

If some process generates events into a huge or unlimit event queue, but no
listener read them, they may consume significant amount of memory silently
until oom happens or some memory pressure issue is raised.
It'd better to account those slab caches in memcg so that we can get heads
up before the problematic process consume too much memory silently.

But, the accounting might be heuristic if the producer is in the different
memcg from listener if the listener doesn't read the events. Due to the
current design of kmemcg, who does the allocation, who gets the accounting.

Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
---
v1 --> v2:
* Updated commit log per Amir's suggestion

 fs/notify/dnotify/dnotify.c        | 4 ++--
 fs/notify/fanotify/fanotify_user.c | 6 +++---
 fs/notify/fsnotify.c               | 2 +-
 fs/notify/inotify/inotify_user.c   | 2 +-
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index cba3283..3ec6233 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -379,8 +379,8 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
 
 static int __init dnotify_init(void)
 {
-	dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC);
-	dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC);
+	dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC|SLAB_ACCOUNT);
+	dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);
 
 	dnotify_group = fsnotify_alloc_group(&dnotify_fsnotify_ops);
 	if (IS_ERR(dnotify_group))
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 907a481..7d62dee 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -947,11 +947,11 @@ static int fanotify_add_inode_mark(struct fsnotify_group *group,
  */
 static int __init fanotify_user_setup(void)
 {
-	fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC);
-	fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC);
+	fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);
+	fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC|SLAB_ACCOUNT);
 #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
 	fanotify_perm_event_cachep = KMEM_CACHE(fanotify_perm_event_info,
-						SLAB_PANIC);
+						SLAB_PANIC|SLAB_ACCOUNT);
 #endif
 
 	return 0;
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 0c4583b..82620ac 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -386,7 +386,7 @@ static __init int fsnotify_init(void)
 		panic("initializing fsnotify_mark_srcu");
 
 	fsnotify_mark_connector_cachep = KMEM_CACHE(fsnotify_mark_connector,
-						    SLAB_PANIC);
+						    SLAB_PANIC|SLAB_ACCOUNT);
 
 	return 0;
 }
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index 7cc7d3f..57b32ff 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -785,7 +785,7 @@ static int __init inotify_user_setup(void)
 
 	BUG_ON(hweight32(ALL_INOTIFY_BITS) != 21);
 
-	inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC);
+	inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC|SLAB_ACCOUNT);
 
 	inotify_max_queued_events = 16384;
 	init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES] = 128;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2017-10-27 18:22 ` Yang Shi
  0 siblings, 0 replies; 75+ messages in thread
From: Yang Shi @ 2017-10-27 18:22 UTC (permalink / raw)
  To: jack, amir73il; +Cc: Yang Shi, linux-fsdevel, linux-mm, linux-kernel

If some process generates events into a huge or unlimit event queue, but no
listener read them, they may consume significant amount of memory silently
until oom happens or some memory pressure issue is raised.
It'd better to account those slab caches in memcg so that we can get heads
up before the problematic process consume too much memory silently.

But, the accounting might be heuristic if the producer is in the different
memcg from listener if the listener doesn't read the events. Due to the
current design of kmemcg, who does the allocation, who gets the accounting.

Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
---
v1 --> v2:
* Updated commit log per Amir's suggestion

 fs/notify/dnotify/dnotify.c        | 4 ++--
 fs/notify/fanotify/fanotify_user.c | 6 +++---
 fs/notify/fsnotify.c               | 2 +-
 fs/notify/inotify/inotify_user.c   | 2 +-
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index cba3283..3ec6233 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -379,8 +379,8 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
 
 static int __init dnotify_init(void)
 {
-	dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC);
-	dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC);
+	dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC|SLAB_ACCOUNT);
+	dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);
 
 	dnotify_group = fsnotify_alloc_group(&dnotify_fsnotify_ops);
 	if (IS_ERR(dnotify_group))
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 907a481..7d62dee 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -947,11 +947,11 @@ static int fanotify_add_inode_mark(struct fsnotify_group *group,
  */
 static int __init fanotify_user_setup(void)
 {
-	fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC);
-	fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC);
+	fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);
+	fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC|SLAB_ACCOUNT);
 #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
 	fanotify_perm_event_cachep = KMEM_CACHE(fanotify_perm_event_info,
-						SLAB_PANIC);
+						SLAB_PANIC|SLAB_ACCOUNT);
 #endif
 
 	return 0;
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 0c4583b..82620ac 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -386,7 +386,7 @@ static __init int fsnotify_init(void)
 		panic("initializing fsnotify_mark_srcu");
 
 	fsnotify_mark_connector_cachep = KMEM_CACHE(fsnotify_mark_connector,
-						    SLAB_PANIC);
+						    SLAB_PANIC|SLAB_ACCOUNT);
 
 	return 0;
 }
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index 7cc7d3f..57b32ff 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -785,7 +785,7 @@ static int __init inotify_user_setup(void)
 
 	BUG_ON(hweight32(ALL_INOTIFY_BITS) != 21);
 
-	inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC);
+	inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC|SLAB_ACCOUNT);
 
 	inotify_max_queued_events = 16384;
 	init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES] = 128;
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2017-10-27 18:22 ` Yang Shi
  0 siblings, 0 replies; 75+ messages in thread
From: Yang Shi @ 2017-10-27 18:22 UTC (permalink / raw)
  To: jack, amir73il; +Cc: Yang Shi, linux-fsdevel, linux-mm, linux-kernel

If some process generates events into a huge or unlimit event queue, but no
listener read them, they may consume significant amount of memory silently
until oom happens or some memory pressure issue is raised.
It'd better to account those slab caches in memcg so that we can get heads
up before the problematic process consume too much memory silently.

But, the accounting might be heuristic if the producer is in the different
memcg from listener if the listener doesn't read the events. Due to the
current design of kmemcg, who does the allocation, who gets the accounting.

Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
---
v1 --> v2:
* Updated commit log per Amir's suggestion

 fs/notify/dnotify/dnotify.c        | 4 ++--
 fs/notify/fanotify/fanotify_user.c | 6 +++---
 fs/notify/fsnotify.c               | 2 +-
 fs/notify/inotify/inotify_user.c   | 2 +-
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index cba3283..3ec6233 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -379,8 +379,8 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
 
 static int __init dnotify_init(void)
 {
-	dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC);
-	dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC);
+	dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC|SLAB_ACCOUNT);
+	dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);
 
 	dnotify_group = fsnotify_alloc_group(&dnotify_fsnotify_ops);
 	if (IS_ERR(dnotify_group))
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 907a481..7d62dee 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -947,11 +947,11 @@ static int fanotify_add_inode_mark(struct fsnotify_group *group,
  */
 static int __init fanotify_user_setup(void)
 {
-	fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC);
-	fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC);
+	fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);
+	fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC|SLAB_ACCOUNT);
 #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
 	fanotify_perm_event_cachep = KMEM_CACHE(fanotify_perm_event_info,
-						SLAB_PANIC);
+						SLAB_PANIC|SLAB_ACCOUNT);
 #endif
 
 	return 0;
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 0c4583b..82620ac 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -386,7 +386,7 @@ static __init int fsnotify_init(void)
 		panic("initializing fsnotify_mark_srcu");
 
 	fsnotify_mark_connector_cachep = KMEM_CACHE(fsnotify_mark_connector,
-						    SLAB_PANIC);
+						    SLAB_PANIC|SLAB_ACCOUNT);
 
 	return 0;
 }
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index 7cc7d3f..57b32ff 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -785,7 +785,7 @@ static int __init inotify_user_setup(void)
 
 	BUG_ON(hweight32(ALL_INOTIFY_BITS) != 21);
 
-	inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC);
+	inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC|SLAB_ACCOUNT);
 
 	inotify_max_queued_events = 16384;
 	init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES] = 128;
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-10-27 18:22 ` Yang Shi
@ 2017-10-28 14:19   ` Amir Goldstein
  -1 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2017-10-28 14:19 UTC (permalink / raw)
  To: Yang Shi; +Cc: Jan Kara, linux-fsdevel, linux-mm, linux-kernel

On Fri, Oct 27, 2017 at 9:22 PM, Yang Shi <yang.s@alibaba-inc.com> wrote:
> If some process generates events into a huge or unlimit event queue, but no
> listener read them, they may consume significant amount of memory silently
> until oom happens or some memory pressure issue is raised.
> It'd better to account those slab caches in memcg so that we can get heads
> up before the problematic process consume too much memory silently.
>
> But, the accounting might be heuristic if the producer is in the different
> memcg from listener if the listener doesn't read the events. Due to the
> current design of kmemcg, who does the allocation, who gets the accounting.

<suggest rephrase>
Due to the current design of kmemcg, the memcg of the process who does the
allocation gets the accounting, so event allocations get accounted for
the memcg of
the event producer process, even though the misbehaving process is the listener.
The event allocations won't be freed if the producer exits, only if
the listener exists.
Nevertheless, it is still better to account event allocations to memcg
of producer
process and not to root memcg, because heuristically producer is many
time in the
same memcg as the listener. For example, this is the case with listeners inside
containers that listen on events for files or mounts that are private
to the container.
<\suggest rephrase>

And the same comment should be above creation of event kmem caches,
so we know this is the lesser evil and not the perfect solution.

>
> Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
> ---
> v1 --> v2:
> * Updated commit log per Amir's suggestion
>
>  fs/notify/dnotify/dnotify.c        | 4 ++--
>  fs/notify/fanotify/fanotify_user.c | 6 +++---
>  fs/notify/fsnotify.c               | 2 +-
>  fs/notify/inotify/inotify_user.c   | 2 +-
>  4 files changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
> index cba3283..3ec6233 100644
> --- a/fs/notify/dnotify/dnotify.c
> +++ b/fs/notify/dnotify/dnotify.c
> @@ -379,8 +379,8 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
>
>  static int __init dnotify_init(void)
>  {
> -       dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC);
> -       dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC);
> +       dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC|SLAB_ACCOUNT);
> +       dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);
>
>         dnotify_group = fsnotify_alloc_group(&dnotify_fsnotify_ops);
>         if (IS_ERR(dnotify_group))
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 907a481..7d62dee 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -947,11 +947,11 @@ static int fanotify_add_inode_mark(struct fsnotify_group *group,
>   */
>  static int __init fanotify_user_setup(void)
>  {
> -       fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC);
> -       fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC);
> +       fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);
> +       fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC|SLAB_ACCOUNT);
>  #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
>         fanotify_perm_event_cachep = KMEM_CACHE(fanotify_perm_event_info,
> -                                               SLAB_PANIC);
> +                                               SLAB_PANIC|SLAB_ACCOUNT);
>  #endif
>
>         return 0;
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 0c4583b..82620ac 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -386,7 +386,7 @@ static __init int fsnotify_init(void)
>                 panic("initializing fsnotify_mark_srcu");
>
>         fsnotify_mark_connector_cachep = KMEM_CACHE(fsnotify_mark_connector,
> -                                                   SLAB_PANIC);
> +                                                   SLAB_PANIC|SLAB_ACCOUNT);
>
>         return 0;
>  }
> diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
> index 7cc7d3f..57b32ff 100644
> --- a/fs/notify/inotify/inotify_user.c
> +++ b/fs/notify/inotify/inotify_user.c
> @@ -785,7 +785,7 @@ static int __init inotify_user_setup(void)
>
>         BUG_ON(hweight32(ALL_INOTIFY_BITS) != 21);
>
> -       inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC);
> +       inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC|SLAB_ACCOUNT);
>
>         inotify_max_queued_events = 16384;
>         init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES] = 128;
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2017-10-28 14:19   ` Amir Goldstein
  0 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2017-10-28 14:19 UTC (permalink / raw)
  To: Yang Shi; +Cc: Jan Kara, linux-fsdevel, linux-mm, linux-kernel

On Fri, Oct 27, 2017 at 9:22 PM, Yang Shi <yang.s@alibaba-inc.com> wrote:
> If some process generates events into a huge or unlimit event queue, but no
> listener read them, they may consume significant amount of memory silently
> until oom happens or some memory pressure issue is raised.
> It'd better to account those slab caches in memcg so that we can get heads
> up before the problematic process consume too much memory silently.
>
> But, the accounting might be heuristic if the producer is in the different
> memcg from listener if the listener doesn't read the events. Due to the
> current design of kmemcg, who does the allocation, who gets the accounting.

<suggest rephrase>
Due to the current design of kmemcg, the memcg of the process who does the
allocation gets the accounting, so event allocations get accounted for
the memcg of
the event producer process, even though the misbehaving process is the listener.
The event allocations won't be freed if the producer exits, only if
the listener exists.
Nevertheless, it is still better to account event allocations to memcg
of producer
process and not to root memcg, because heuristically producer is many
time in the
same memcg as the listener. For example, this is the case with listeners inside
containers that listen on events for files or mounts that are private
to the container.
<\suggest rephrase>

And the same comment should be above creation of event kmem caches,
so we know this is the lesser evil and not the perfect solution.

>
> Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
> ---
> v1 --> v2:
> * Updated commit log per Amir's suggestion
>
>  fs/notify/dnotify/dnotify.c        | 4 ++--
>  fs/notify/fanotify/fanotify_user.c | 6 +++---
>  fs/notify/fsnotify.c               | 2 +-
>  fs/notify/inotify/inotify_user.c   | 2 +-
>  4 files changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
> index cba3283..3ec6233 100644
> --- a/fs/notify/dnotify/dnotify.c
> +++ b/fs/notify/dnotify/dnotify.c
> @@ -379,8 +379,8 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
>
>  static int __init dnotify_init(void)
>  {
> -       dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC);
> -       dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC);
> +       dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC|SLAB_ACCOUNT);
> +       dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);
>
>         dnotify_group = fsnotify_alloc_group(&dnotify_fsnotify_ops);
>         if (IS_ERR(dnotify_group))
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 907a481..7d62dee 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -947,11 +947,11 @@ static int fanotify_add_inode_mark(struct fsnotify_group *group,
>   */
>  static int __init fanotify_user_setup(void)
>  {
> -       fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC);
> -       fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC);
> +       fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);
> +       fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC|SLAB_ACCOUNT);
>  #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
>         fanotify_perm_event_cachep = KMEM_CACHE(fanotify_perm_event_info,
> -                                               SLAB_PANIC);
> +                                               SLAB_PANIC|SLAB_ACCOUNT);
>  #endif
>
>         return 0;
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 0c4583b..82620ac 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -386,7 +386,7 @@ static __init int fsnotify_init(void)
>                 panic("initializing fsnotify_mark_srcu");
>
>         fsnotify_mark_connector_cachep = KMEM_CACHE(fsnotify_mark_connector,
> -                                                   SLAB_PANIC);
> +                                                   SLAB_PANIC|SLAB_ACCOUNT);
>
>         return 0;
>  }
> diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
> index 7cc7d3f..57b32ff 100644
> --- a/fs/notify/inotify/inotify_user.c
> +++ b/fs/notify/inotify/inotify_user.c
> @@ -785,7 +785,7 @@ static int __init inotify_user_setup(void)
>
>         BUG_ON(hweight32(ALL_INOTIFY_BITS) != 21);
>
> -       inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC);
> +       inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC|SLAB_ACCOUNT);
>
>         inotify_max_queued_events = 16384;
>         init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES] = 128;
> --
> 1.8.3.1
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-10-28 14:19   ` Amir Goldstein
@ 2017-10-29  2:39     ` Matthew Wilcox
  -1 siblings, 0 replies; 75+ messages in thread
From: Matthew Wilcox @ 2017-10-29  2:39 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Yang Shi, Jan Kara, linux-fsdevel, linux-mm, linux-kernel

On Sat, Oct 28, 2017 at 05:19:36PM +0300, Amir Goldstein wrote:
> <suggest rephrase>
> Due to the current design of kmemcg, the memcg of the process who does the
> allocation gets the accounting, so event allocations get accounted for
> the memcg of
> the event producer process, even though the misbehaving process is the listener.
> The event allocations won't be freed if the producer exits, only if
> the listener exists.
> Nevertheless, it is still better to account event allocations to memcg
> of producer
> process and not to root memcg, because heuristically producer is many
> time in the
> same memcg as the listener. For example, this is the case with listeners inside
> containers that listen on events for files or mounts that are private
> to the container.
> <\suggest rephrase>

Well, if we're nitpicking ...

Due to the current design of kmemcg, the event allocation is accounted to
the memcg of the process producing the event, even though the misbehaving
process is the listener.  The event allocations won't be freed if the
producer exits, only if the listener exits.  Nevertheless, it is still
better to account event allocations to the producer's memcg than the
root memcg, because the producer is frequently in the same memcg as
the listener.  For example, this is the case with listeners inside
containers that listen to events for files or mounts that are private
to the container.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2017-10-29  2:39     ` Matthew Wilcox
  0 siblings, 0 replies; 75+ messages in thread
From: Matthew Wilcox @ 2017-10-29  2:39 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Yang Shi, Jan Kara, linux-fsdevel, linux-mm, linux-kernel

On Sat, Oct 28, 2017 at 05:19:36PM +0300, Amir Goldstein wrote:
> <suggest rephrase>
> Due to the current design of kmemcg, the memcg of the process who does the
> allocation gets the accounting, so event allocations get accounted for
> the memcg of
> the event producer process, even though the misbehaving process is the listener.
> The event allocations won't be freed if the producer exits, only if
> the listener exists.
> Nevertheless, it is still better to account event allocations to memcg
> of producer
> process and not to root memcg, because heuristically producer is many
> time in the
> same memcg as the listener. For example, this is the case with listeners inside
> containers that listen on events for files or mounts that are private
> to the container.
> <\suggest rephrase>

Well, if we're nitpicking ...

Due to the current design of kmemcg, the event allocation is accounted to
the memcg of the process producing the event, even though the misbehaving
process is the listener.  The event allocations won't be freed if the
producer exits, only if the listener exits.  Nevertheless, it is still
better to account event allocations to the producer's memcg than the
root memcg, because the producer is frequently in the same memcg as
the listener.  For example, this is the case with listeners inside
containers that listen to events for files or mounts that are private
to the container.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-10-27 18:22 ` Yang Shi
@ 2017-10-30 12:43   ` Jan Kara
  -1 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2017-10-30 12:43 UTC (permalink / raw)
  To: Yang Shi; +Cc: jack, amir73il, linux-fsdevel, linux-mm, linux-kernel

On Sat 28-10-17 02:22:18, Yang Shi wrote:
> If some process generates events into a huge or unlimit event queue, but no
> listener read them, they may consume significant amount of memory silently
> until oom happens or some memory pressure issue is raised.
> It'd better to account those slab caches in memcg so that we can get heads
> up before the problematic process consume too much memory silently.
> 
> But, the accounting might be heuristic if the producer is in the different
> memcg from listener if the listener doesn't read the events. Due to the
> current design of kmemcg, who does the allocation, who gets the accounting.
> 
> Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
> ---
> v1 --> v2:
> * Updated commit log per Amir's suggestion

I'm sorry but I don't think this solution is acceptable. I understand that
in some cases (and you likely run one of these) the result may *happen* to
be the desired one but in other cases, you might be charging wrong memcg
and so misbehaving process in memcg A can effectively cause a DoS attack on
a process in memcg B.

If you have a setup in which notification events can consume considerable
amount of resources, you are doing something wrong I think. Standard event
queue length is limited, overall events are bounded to consume less than 1
MB. If you have unbounded queue, the process has to be CAP_SYS_ADMIN and
presumably it has good reasons for requesting unbounded queue and it should
know what it is doing.

So maybe we could come up with some better way to control amount of
resources consumed by notification events but for that we lack more
information about your use case. And I maintain that the solution should
account events to the consumer, not the producer...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2017-10-30 12:43   ` Jan Kara
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2017-10-30 12:43 UTC (permalink / raw)
  To: Yang Shi; +Cc: jack, amir73il, linux-fsdevel, linux-mm, linux-kernel

On Sat 28-10-17 02:22:18, Yang Shi wrote:
> If some process generates events into a huge or unlimit event queue, but no
> listener read them, they may consume significant amount of memory silently
> until oom happens or some memory pressure issue is raised.
> It'd better to account those slab caches in memcg so that we can get heads
> up before the problematic process consume too much memory silently.
> 
> But, the accounting might be heuristic if the producer is in the different
> memcg from listener if the listener doesn't read the events. Due to the
> current design of kmemcg, who does the allocation, who gets the accounting.
> 
> Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
> ---
> v1 --> v2:
> * Updated commit log per Amir's suggestion

I'm sorry but I don't think this solution is acceptable. I understand that
in some cases (and you likely run one of these) the result may *happen* to
be the desired one but in other cases, you might be charging wrong memcg
and so misbehaving process in memcg A can effectively cause a DoS attack on
a process in memcg B.

If you have a setup in which notification events can consume considerable
amount of resources, you are doing something wrong I think. Standard event
queue length is limited, overall events are bounded to consume less than 1
MB. If you have unbounded queue, the process has to be CAP_SYS_ADMIN and
presumably it has good reasons for requesting unbounded queue and it should
know what it is doing.

So maybe we could come up with some better way to control amount of
resources consumed by notification events but for that we lack more
information about your use case. And I maintain that the solution should
account events to the consumer, not the producer...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-10-30 12:43   ` Jan Kara
@ 2017-10-30 16:39     ` Yang Shi
  -1 siblings, 0 replies; 75+ messages in thread
From: Yang Shi @ 2017-10-30 16:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: amir73il, linux-fsdevel, linux-mm, linux-kernel



On 10/30/17 5:43 AM, Jan Kara wrote:
> On Sat 28-10-17 02:22:18, Yang Shi wrote:
>> If some process generates events into a huge or unlimit event queue, but no
>> listener read them, they may consume significant amount of memory silently
>> until oom happens or some memory pressure issue is raised.
>> It'd better to account those slab caches in memcg so that we can get heads
>> up before the problematic process consume too much memory silently.
>>
>> But, the accounting might be heuristic if the producer is in the different
>> memcg from listener if the listener doesn't read the events. Due to the
>> current design of kmemcg, who does the allocation, who gets the accounting.
>>
>> Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
>> ---
>> v1 --> v2:
>> * Updated commit log per Amir's suggestion
> 
> I'm sorry but I don't think this solution is acceptable. I understand that
> in some cases (and you likely run one of these) the result may *happen* to
> be the desired one but in other cases, you might be charging wrong memcg
> and so misbehaving process in memcg A can effectively cause a DoS attack on
> a process in memcg B.

Yes, as what I discussed with Amir in earlier review, current memcg 
design just accounts memory to the allocation process, but has no idea 
who is consumer process.

Although it is not desirable to DoS a memcg, it still sounds better than 
DoS the whole machine due to potential oom. This patch is aimed to avoid 
such case.

> 
> If you have a setup in which notification events can consume considerable
> amount of resources, you are doing something wrong I think. Standard event
> queue length is limited, overall events are bounded to consume less than 1
> MB. If you have unbounded queue, the process has to be CAP_SYS_ADMIN and
> presumably it has good reasons for requesting unbounded queue and it should
> know what it is doing.

Yes, I agree it does mean something is going wrong. So, it'd better to 
be accounted in order to get some heads up early before something is 
going really bad. The limit will not be set too high since fsnotify 
metadata will not consume too much memory in *normal* case.

I agree we should trust admin user, but kernel should be responsible for 
the last defense when something is really going wrong. And, we can't 
guarantee admin process will not do something wrong, the code might be 
not reviewed thoroughly, the test might not cover some extreme cases.

> 
> So maybe we could come up with some better way to control amount of
> resources consumed by notification events but for that we lack more
> information about your use case. And I maintain that the solution should
> account events to the consumer, not the producer...

I do agree it is not fair and not neat to account to producer rather 
than misbehaving consumer, but current memcg design looks not support 
such use case. And, the other question is do we know who is the listener 
if it doesn't read the events?

Thanks,
Yang

> 
> 								Honza
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2017-10-30 16:39     ` Yang Shi
  0 siblings, 0 replies; 75+ messages in thread
From: Yang Shi @ 2017-10-30 16:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: amir73il, linux-fsdevel, linux-mm, linux-kernel



On 10/30/17 5:43 AM, Jan Kara wrote:
> On Sat 28-10-17 02:22:18, Yang Shi wrote:
>> If some process generates events into a huge or unlimit event queue, but no
>> listener read them, they may consume significant amount of memory silently
>> until oom happens or some memory pressure issue is raised.
>> It'd better to account those slab caches in memcg so that we can get heads
>> up before the problematic process consume too much memory silently.
>>
>> But, the accounting might be heuristic if the producer is in the different
>> memcg from listener if the listener doesn't read the events. Due to the
>> current design of kmemcg, who does the allocation, who gets the accounting.
>>
>> Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
>> ---
>> v1 --> v2:
>> * Updated commit log per Amir's suggestion
> 
> I'm sorry but I don't think this solution is acceptable. I understand that
> in some cases (and you likely run one of these) the result may *happen* to
> be the desired one but in other cases, you might be charging wrong memcg
> and so misbehaving process in memcg A can effectively cause a DoS attack on
> a process in memcg B.

Yes, as what I discussed with Amir in earlier review, current memcg 
design just accounts memory to the allocation process, but has no idea 
who is consumer process.

Although it is not desirable to DoS a memcg, it still sounds better than 
DoS the whole machine due to potential oom. This patch is aimed to avoid 
such case.

> 
> If you have a setup in which notification events can consume considerable
> amount of resources, you are doing something wrong I think. Standard event
> queue length is limited, overall events are bounded to consume less than 1
> MB. If you have unbounded queue, the process has to be CAP_SYS_ADMIN and
> presumably it has good reasons for requesting unbounded queue and it should
> know what it is doing.

Yes, I agree it does mean something is going wrong. So, it'd better to 
be accounted in order to get some heads up early before something is 
going really bad. The limit will not be set too high since fsnotify 
metadata will not consume too much memory in *normal* case.

I agree we should trust admin user, but kernel should be responsible for 
the last defense when something is really going wrong. And, we can't 
guarantee admin process will not do something wrong, the code might be 
not reviewed thoroughly, the test might not cover some extreme cases.

> 
> So maybe we could come up with some better way to control amount of
> resources consumed by notification events but for that we lack more
> information about your use case. And I maintain that the solution should
> account events to the consumer, not the producer...

I do agree it is not fair and not neat to account to producer rather 
than misbehaving consumer, but current memcg design looks not support 
such use case. And, the other question is do we know who is the listener 
if it doesn't read the events?

Thanks,
Yang

> 
> 								Honza
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-10-30 16:39     ` Yang Shi
@ 2017-10-31 10:12       ` Jan Kara
  -1 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2017-10-31 10:12 UTC (permalink / raw)
  To: Yang Shi
  Cc: Jan Kara, amir73il, linux-fsdevel, linux-mm, linux-kernel, mhocko

On Tue 31-10-17 00:39:58, Yang Shi wrote:
> On 10/30/17 5:43 AM, Jan Kara wrote:
> >On Sat 28-10-17 02:22:18, Yang Shi wrote:
> >>If some process generates events into a huge or unlimit event queue, but no
> >>listener read them, they may consume significant amount of memory silently
> >>until oom happens or some memory pressure issue is raised.
> >>It'd better to account those slab caches in memcg so that we can get heads
> >>up before the problematic process consume too much memory silently.
> >>
> >>But, the accounting might be heuristic if the producer is in the different
> >>memcg from listener if the listener doesn't read the events. Due to the
> >>current design of kmemcg, who does the allocation, who gets the accounting.
> >>
> >>Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
> >>---
> >>v1 --> v2:
> >>* Updated commit log per Amir's suggestion
> >
> >I'm sorry but I don't think this solution is acceptable. I understand that
> >in some cases (and you likely run one of these) the result may *happen* to
> >be the desired one but in other cases, you might be charging wrong memcg
> >and so misbehaving process in memcg A can effectively cause a DoS attack on
> >a process in memcg B.
> 
> Yes, as what I discussed with Amir in earlier review, current memcg design
> just accounts memory to the allocation process, but has no idea who is
> consumer process.
> 
> Although it is not desirable to DoS a memcg, it still sounds better than DoS
> the whole machine due to potential oom. This patch is aimed to avoid such
> case.

Thinking about this even more, your solution may have even worse impact -
due to allocations failing, some applications may avoid generation of fs
notification events for actions they do. And that maybe a security issue in
case there are other applications using fanotify for security enforcement,
virus scanning, or whatever... In such cases it is better to take the
whole machine down than to let it run.

> >If you have a setup in which notification events can consume considerable
> >amount of resources, you are doing something wrong I think. Standard event
> >queue length is limited, overall events are bounded to consume less than 1
> >MB. If you have unbounded queue, the process has to be CAP_SYS_ADMIN and
> >presumably it has good reasons for requesting unbounded queue and it should
> >know what it is doing.
> 
> Yes, I agree it does mean something is going wrong. So, it'd better to be
> accounted in order to get some heads up early before something is going
> really bad. The limit will not be set too high since fsnotify metadata will
> not consume too much memory in *normal* case.
> 
> I agree we should trust admin user, but kernel should be responsible for the
> last defense when something is really going wrong. And, we can't guarantee
> admin process will not do something wrong, the code might be not reviewed
> thoroughly, the test might not cover some extreme cases.
> 
> >
> >So maybe we could come up with some better way to control amount of
> >resources consumed by notification events but for that we lack more
> >information about your use case. And I maintain that the solution should
> >account events to the consumer, not the producer...
> 
> I do agree it is not fair and not neat to account to producer rather than
> misbehaving consumer, but current memcg design looks not support such use
> case. And, the other question is do we know who is the listener if it
> doesn't read the events?

So you never know who will read from the notification file descriptor but
you can simply account that to the process that created the notification
group and that is IMO the right process to account to.

I agree that current SLAB memcg accounting does not allow to account to a
different memcg than the one of the running process. However I *think* it
should be possible to add such interface. Michal?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2017-10-31 10:12       ` Jan Kara
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2017-10-31 10:12 UTC (permalink / raw)
  To: Yang Shi
  Cc: Jan Kara, amir73il, linux-fsdevel, linux-mm, linux-kernel, mhocko

On Tue 31-10-17 00:39:58, Yang Shi wrote:
> On 10/30/17 5:43 AM, Jan Kara wrote:
> >On Sat 28-10-17 02:22:18, Yang Shi wrote:
> >>If some process generates events into a huge or unlimit event queue, but no
> >>listener read them, they may consume significant amount of memory silently
> >>until oom happens or some memory pressure issue is raised.
> >>It'd better to account those slab caches in memcg so that we can get heads
> >>up before the problematic process consume too much memory silently.
> >>
> >>But, the accounting might be heuristic if the producer is in the different
> >>memcg from listener if the listener doesn't read the events. Due to the
> >>current design of kmemcg, who does the allocation, who gets the accounting.
> >>
> >>Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
> >>---
> >>v1 --> v2:
> >>* Updated commit log per Amir's suggestion
> >
> >I'm sorry but I don't think this solution is acceptable. I understand that
> >in some cases (and you likely run one of these) the result may *happen* to
> >be the desired one but in other cases, you might be charging wrong memcg
> >and so misbehaving process in memcg A can effectively cause a DoS attack on
> >a process in memcg B.
> 
> Yes, as what I discussed with Amir in earlier review, current memcg design
> just accounts memory to the allocation process, but has no idea who is
> consumer process.
> 
> Although it is not desirable to DoS a memcg, it still sounds better than DoS
> the whole machine due to potential oom. This patch is aimed to avoid such
> case.

Thinking about this even more, your solution may have even worse impact -
due to allocations failing, some applications may avoid generation of fs
notification events for actions they do. And that maybe a security issue in
case there are other applications using fanotify for security enforcement,
virus scanning, or whatever... In such cases it is better to take the
whole machine down than to let it run.

> >If you have a setup in which notification events can consume considerable
> >amount of resources, you are doing something wrong I think. Standard event
> >queue length is limited, overall events are bounded to consume less than 1
> >MB. If you have unbounded queue, the process has to be CAP_SYS_ADMIN and
> >presumably it has good reasons for requesting unbounded queue and it should
> >know what it is doing.
> 
> Yes, I agree it does mean something is going wrong. So, it'd better to be
> accounted in order to get some heads up early before something is going
> really bad. The limit will not be set too high since fsnotify metadata will
> not consume too much memory in *normal* case.
> 
> I agree we should trust admin user, but kernel should be responsible for the
> last defense when something is really going wrong. And, we can't guarantee
> admin process will not do something wrong, the code might be not reviewed
> thoroughly, the test might not cover some extreme cases.
> 
> >
> >So maybe we could come up with some better way to control amount of
> >resources consumed by notification events but for that we lack more
> >information about your use case. And I maintain that the solution should
> >account events to the consumer, not the producer...
> 
> I do agree it is not fair and not neat to account to producer rather than
> misbehaving consumer, but current memcg design looks not support such use
> case. And, the other question is do we know who is the listener if it
> doesn't read the events?

So you never know who will read from the notification file descriptor but
you can simply account that to the process that created the notification
group and that is IMO the right process to account to.

I agree that current SLAB memcg accounting does not allow to account to a
different memcg than the one of the running process. However I *think* it
should be possible to add such interface. Michal?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-10-31 10:12       ` Jan Kara
@ 2017-10-31 16:44         ` Yang Shi
  -1 siblings, 0 replies; 75+ messages in thread
From: Yang Shi @ 2017-10-31 16:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: amir73il, linux-fsdevel, linux-mm, linux-kernel, mhocko



On 10/31/17 3:12 AM, Jan Kara wrote:
> On Tue 31-10-17 00:39:58, Yang Shi wrote:
>> On 10/30/17 5:43 AM, Jan Kara wrote:
>>> On Sat 28-10-17 02:22:18, Yang Shi wrote:
>>>> If some process generates events into a huge or unlimit event queue, but no
>>>> listener read them, they may consume significant amount of memory silently
>>>> until oom happens or some memory pressure issue is raised.
>>>> It'd better to account those slab caches in memcg so that we can get heads
>>>> up before the problematic process consume too much memory silently.
>>>>
>>>> But, the accounting might be heuristic if the producer is in the different
>>>> memcg from listener if the listener doesn't read the events. Due to the
>>>> current design of kmemcg, who does the allocation, who gets the accounting.
>>>>
>>>> Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
>>>> ---
>>>> v1 --> v2:
>>>> * Updated commit log per Amir's suggestion
>>>
>>> I'm sorry but I don't think this solution is acceptable. I understand that
>>> in some cases (and you likely run one of these) the result may *happen* to
>>> be the desired one but in other cases, you might be charging wrong memcg
>>> and so misbehaving process in memcg A can effectively cause a DoS attack on
>>> a process in memcg B.
>>
>> Yes, as what I discussed with Amir in earlier review, current memcg design
>> just accounts memory to the allocation process, but has no idea who is
>> consumer process.
>>
>> Although it is not desirable to DoS a memcg, it still sounds better than DoS
>> the whole machine due to potential oom. This patch is aimed to avoid such
>> case.
> 
> Thinking about this even more, your solution may have even worse impact -
> due to allocations failing, some applications may avoid generation of fs
> notification events for actions they do. And that maybe a security issue in
> case there are other applications using fanotify for security enforcement,
> virus scanning, or whatever... In such cases it is better to take the
> whole machine down than to let it run.

I guess (just guess) this might be able to be solved by Amir's patch, 
right? An overflow or error event will be queued, then the consumer 
applications could do nicer error handling/softer exit.

Actually, the event is dropped when -ENOMEM regardless of my patch. As 
Amir said this patch may just amplify this problem if my understanding 
is right.

Thanks,
Yang

> 
>>> If you have a setup in which notification events can consume considerable
>>> amount of resources, you are doing something wrong I think. Standard event
>>> queue length is limited, overall events are bounded to consume less than 1
>>> MB. If you have unbounded queue, the process has to be CAP_SYS_ADMIN and
>>> presumably it has good reasons for requesting unbounded queue and it should
>>> know what it is doing.
>>
>> Yes, I agree it does mean something is going wrong. So, it'd better to be
>> accounted in order to get some heads up early before something is going
>> really bad. The limit will not be set too high since fsnotify metadata will
>> not consume too much memory in *normal* case.
>>
>> I agree we should trust admin user, but kernel should be responsible for the
>> last defense when something is really going wrong. And, we can't guarantee
>> admin process will not do something wrong, the code might be not reviewed
>> thoroughly, the test might not cover some extreme cases.
>>
>>>
>>> So maybe we could come up with some better way to control amount of
>>> resources consumed by notification events but for that we lack more
>>> information about your use case. And I maintain that the solution should
>>> account events to the consumer, not the producer...
>>
>> I do agree it is not fair and not neat to account to producer rather than
>> misbehaving consumer, but current memcg design looks not support such use
>> case. And, the other question is do we know who is the listener if it
>> doesn't read the events?
> 
> So you never know who will read from the notification file descriptor but
> you can simply account that to the process that created the notification
> group and that is IMO the right process to account to.
> 
> I agree that current SLAB memcg accounting does not allow to account to a
> different memcg than the one of the running process. However I *think* it
> should be possible to add such interface. Michal?
> 
> 								Honza
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2017-10-31 16:44         ` Yang Shi
  0 siblings, 0 replies; 75+ messages in thread
From: Yang Shi @ 2017-10-31 16:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: amir73il, linux-fsdevel, linux-mm, linux-kernel, mhocko



On 10/31/17 3:12 AM, Jan Kara wrote:
> On Tue 31-10-17 00:39:58, Yang Shi wrote:
>> On 10/30/17 5:43 AM, Jan Kara wrote:
>>> On Sat 28-10-17 02:22:18, Yang Shi wrote:
>>>> If some process generates events into a huge or unlimit event queue, but no
>>>> listener read them, they may consume significant amount of memory silently
>>>> until oom happens or some memory pressure issue is raised.
>>>> It'd better to account those slab caches in memcg so that we can get heads
>>>> up before the problematic process consume too much memory silently.
>>>>
>>>> But, the accounting might be heuristic if the producer is in the different
>>>> memcg from listener if the listener doesn't read the events. Due to the
>>>> current design of kmemcg, who does the allocation, who gets the accounting.
>>>>
>>>> Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
>>>> ---
>>>> v1 --> v2:
>>>> * Updated commit log per Amir's suggestion
>>>
>>> I'm sorry but I don't think this solution is acceptable. I understand that
>>> in some cases (and you likely run one of these) the result may *happen* to
>>> be the desired one but in other cases, you might be charging wrong memcg
>>> and so misbehaving process in memcg A can effectively cause a DoS attack on
>>> a process in memcg B.
>>
>> Yes, as what I discussed with Amir in earlier review, current memcg design
>> just accounts memory to the allocation process, but has no idea who is
>> consumer process.
>>
>> Although it is not desirable to DoS a memcg, it still sounds better than DoS
>> the whole machine due to potential oom. This patch is aimed to avoid such
>> case.
> 
> Thinking about this even more, your solution may have even worse impact -
> due to allocations failing, some applications may avoid generation of fs
> notification events for actions they do. And that maybe a security issue in
> case there are other applications using fanotify for security enforcement,
> virus scanning, or whatever... In such cases it is better to take the
> whole machine down than to let it run.

I guess (just guess) this might be able to be solved by Amir's patch, 
right? An overflow or error event will be queued, then the consumer 
applications could do nicer error handling/softer exit.

Actually, the event is dropped when -ENOMEM regardless of my patch. As 
Amir said this patch may just amplify this problem if my understanding 
is right.

Thanks,
Yang

> 
>>> If you have a setup in which notification events can consume considerable
>>> amount of resources, you are doing something wrong I think. Standard event
>>> queue length is limited, overall events are bounded to consume less than 1
>>> MB. If you have unbounded queue, the process has to be CAP_SYS_ADMIN and
>>> presumably it has good reasons for requesting unbounded queue and it should
>>> know what it is doing.
>>
>> Yes, I agree it does mean something is going wrong. So, it'd better to be
>> accounted in order to get some heads up early before something is going
>> really bad. The limit will not be set too high since fsnotify metadata will
>> not consume too much memory in *normal* case.
>>
>> I agree we should trust admin user, but kernel should be responsible for the
>> last defense when something is really going wrong. And, we can't guarantee
>> admin process will not do something wrong, the code might be not reviewed
>> thoroughly, the test might not cover some extreme cases.
>>
>>>
>>> So maybe we could come up with some better way to control amount of
>>> resources consumed by notification events but for that we lack more
>>> information about your use case. And I maintain that the solution should
>>> account events to the consumer, not the producer...
>>
>> I do agree it is not fair and not neat to account to producer rather than
>> misbehaving consumer, but current memcg design looks not support such use
>> case. And, the other question is do we know who is the listener if it
>> doesn't read the events?
> 
> So you never know who will read from the notification file descriptor but
> you can simply account that to the process that created the notification
> group and that is IMO the right process to account to.
> 
> I agree that current SLAB memcg accounting does not allow to account to a
> different memcg than the one of the running process. However I *think* it
> should be possible to add such interface. Michal?
> 
> 								Honza
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-10-31 16:44         ` Yang Shi
@ 2017-11-01 15:15           ` Jan Kara
  -1 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2017-11-01 15:15 UTC (permalink / raw)
  To: Yang Shi
  Cc: Jan Kara, amir73il, linux-fsdevel, linux-mm, linux-kernel, mhocko

On Wed 01-11-17 00:44:18, Yang Shi wrote:
> On 10/31/17 3:12 AM, Jan Kara wrote:
> >On Tue 31-10-17 00:39:58, Yang Shi wrote:
> >>On 10/30/17 5:43 AM, Jan Kara wrote:
> >>>On Sat 28-10-17 02:22:18, Yang Shi wrote:
> >>>>If some process generates events into a huge or unlimit event queue, but no
> >>>>listener read them, they may consume significant amount of memory silently
> >>>>until oom happens or some memory pressure issue is raised.
> >>>>It'd better to account those slab caches in memcg so that we can get heads
> >>>>up before the problematic process consume too much memory silently.
> >>>>
> >>>>But, the accounting might be heuristic if the producer is in the different
> >>>>memcg from listener if the listener doesn't read the events. Due to the
> >>>>current design of kmemcg, who does the allocation, who gets the accounting.
> >>>>
> >>>>Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
> >>>>---
> >>>>v1 --> v2:
> >>>>* Updated commit log per Amir's suggestion
> >>>
> >>>I'm sorry but I don't think this solution is acceptable. I understand that
> >>>in some cases (and you likely run one of these) the result may *happen* to
> >>>be the desired one but in other cases, you might be charging wrong memcg
> >>>and so misbehaving process in memcg A can effectively cause a DoS attack on
> >>>a process in memcg B.
> >>
> >>Yes, as what I discussed with Amir in earlier review, current memcg design
> >>just accounts memory to the allocation process, but has no idea who is
> >>consumer process.
> >>
> >>Although it is not desirable to DoS a memcg, it still sounds better than DoS
> >>the whole machine due to potential oom. This patch is aimed to avoid such
> >>case.
> >
> >Thinking about this even more, your solution may have even worse impact -
> >due to allocations failing, some applications may avoid generation of fs
> >notification events for actions they do. And that maybe a security issue in
> >case there are other applications using fanotify for security enforcement,
> >virus scanning, or whatever... In such cases it is better to take the
> >whole machine down than to let it run.
> 
> I guess (just guess) this might be able to be solved by Amir's patch, right?
> An overflow or error event will be queued, then the consumer applications
> could do nicer error handling/softer exit.

Well, Amir's patch solves the problem of visibility that something bad
(lost event) happened. But it does not address the fundamental issue that
you account the event to a wrong memcg and thus fail the allocation at
wrong times.

> Actually, the event is dropped when -ENOMEM regardless of my patch. As Amir
> said this patch may just amplify this problem if my understanding is right.

So currently, -ENOMEM cannot normally happen for such small allocation. The
kernel will rather go OOM and kill some process to free memory. So putting
memcgs into the picture changes the behavior.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2017-11-01 15:15           ` Jan Kara
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2017-11-01 15:15 UTC (permalink / raw)
  To: Yang Shi
  Cc: Jan Kara, amir73il, linux-fsdevel, linux-mm, linux-kernel, mhocko

On Wed 01-11-17 00:44:18, Yang Shi wrote:
> On 10/31/17 3:12 AM, Jan Kara wrote:
> >On Tue 31-10-17 00:39:58, Yang Shi wrote:
> >>On 10/30/17 5:43 AM, Jan Kara wrote:
> >>>On Sat 28-10-17 02:22:18, Yang Shi wrote:
> >>>>If some process generates events into a huge or unlimit event queue, but no
> >>>>listener read them, they may consume significant amount of memory silently
> >>>>until oom happens or some memory pressure issue is raised.
> >>>>It'd better to account those slab caches in memcg so that we can get heads
> >>>>up before the problematic process consume too much memory silently.
> >>>>
> >>>>But, the accounting might be heuristic if the producer is in the different
> >>>>memcg from listener if the listener doesn't read the events. Due to the
> >>>>current design of kmemcg, who does the allocation, who gets the accounting.
> >>>>
> >>>>Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
> >>>>---
> >>>>v1 --> v2:
> >>>>* Updated commit log per Amir's suggestion
> >>>
> >>>I'm sorry but I don't think this solution is acceptable. I understand that
> >>>in some cases (and you likely run one of these) the result may *happen* to
> >>>be the desired one but in other cases, you might be charging wrong memcg
> >>>and so misbehaving process in memcg A can effectively cause a DoS attack on
> >>>a process in memcg B.
> >>
> >>Yes, as what I discussed with Amir in earlier review, current memcg design
> >>just accounts memory to the allocation process, but has no idea who is
> >>consumer process.
> >>
> >>Although it is not desirable to DoS a memcg, it still sounds better than DoS
> >>the whole machine due to potential oom. This patch is aimed to avoid such
> >>case.
> >
> >Thinking about this even more, your solution may have even worse impact -
> >due to allocations failing, some applications may avoid generation of fs
> >notification events for actions they do. And that maybe a security issue in
> >case there are other applications using fanotify for security enforcement,
> >virus scanning, or whatever... In such cases it is better to take the
> >whole machine down than to let it run.
> 
> I guess (just guess) this might be able to be solved by Amir's patch, right?
> An overflow or error event will be queued, then the consumer applications
> could do nicer error handling/softer exit.

Well, Amir's patch solves the problem of visibility that something bad
(lost event) happened. But it does not address the fundamental issue that
you account the event to a wrong memcg and thus fail the allocation at
wrong times.

> Actually, the event is dropped when -ENOMEM regardless of my patch. As Amir
> said this patch may just amplify this problem if my understanding is right.

So currently, -ENOMEM cannot normally happen for such small allocation. The
kernel will rather go OOM and kill some process to free memory. So putting
memcgs into the picture changes the behavior.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-10-31 10:12       ` Jan Kara
@ 2017-11-09 13:54         ` Michal Hocko
  -1 siblings, 0 replies; 75+ messages in thread
From: Michal Hocko @ 2017-11-09 13:54 UTC (permalink / raw)
  To: Jan Kara; +Cc: Yang Shi, amir73il, linux-fsdevel, linux-mm, linux-kernel

[Sorry for the late reply]

On Tue 31-10-17 11:12:38, Jan Kara wrote:
> On Tue 31-10-17 00:39:58, Yang Shi wrote:
[...]
> > I do agree it is not fair and not neat to account to producer rather than
> > misbehaving consumer, but current memcg design looks not support such use
> > case. And, the other question is do we know who is the listener if it
> > doesn't read the events?
> 
> So you never know who will read from the notification file descriptor but
> you can simply account that to the process that created the notification
> group and that is IMO the right process to account to.

Yes, if the creator is de-facto owner which defines the lifetime of
those objects then this should be a target of the charge.

> I agree that current SLAB memcg accounting does not allow to account to a
> different memcg than the one of the running process. However I *think* it
> should be possible to add such interface. Michal?

We do have memcg_kmem_charge_memcg but that would require some plumbing
to hook it into the specific allocation path. I suspect it uses kmalloc,
right?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2017-11-09 13:54         ` Michal Hocko
  0 siblings, 0 replies; 75+ messages in thread
From: Michal Hocko @ 2017-11-09 13:54 UTC (permalink / raw)
  To: Jan Kara; +Cc: Yang Shi, amir73il, linux-fsdevel, linux-mm, linux-kernel

[Sorry for the late reply]

On Tue 31-10-17 11:12:38, Jan Kara wrote:
> On Tue 31-10-17 00:39:58, Yang Shi wrote:
[...]
> > I do agree it is not fair and not neat to account to producer rather than
> > misbehaving consumer, but current memcg design looks not support such use
> > case. And, the other question is do we know who is the listener if it
> > doesn't read the events?
> 
> So you never know who will read from the notification file descriptor but
> you can simply account that to the process that created the notification
> group and that is IMO the right process to account to.

Yes, if the creator is de-facto owner which defines the lifetime of
those objects then this should be a target of the charge.

> I agree that current SLAB memcg accounting does not allow to account to a
> different memcg than the one of the running process. However I *think* it
> should be possible to add such interface. Michal?

We do have memcg_kmem_charge_memcg but that would require some plumbing
to hook it into the specific allocation path. I suspect it uses kmalloc,
right?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-11-09 13:54         ` Michal Hocko
@ 2017-11-13 19:10           ` Yang Shi
  -1 siblings, 0 replies; 75+ messages in thread
From: Yang Shi @ 2017-11-13 19:10 UTC (permalink / raw)
  To: Michal Hocko, Jan Kara; +Cc: amir73il, linux-fsdevel, linux-mm, linux-kernel



On 11/9/17 5:54 AM, Michal Hocko wrote:
> [Sorry for the late reply]
> 
> On Tue 31-10-17 11:12:38, Jan Kara wrote:
>> On Tue 31-10-17 00:39:58, Yang Shi wrote:
> [...]
>>> I do agree it is not fair and not neat to account to producer rather than
>>> misbehaving consumer, but current memcg design looks not support such use
>>> case. And, the other question is do we know who is the listener if it
>>> doesn't read the events?
>>
>> So you never know who will read from the notification file descriptor but
>> you can simply account that to the process that created the notification
>> group and that is IMO the right process to account to.
> 
> Yes, if the creator is de-facto owner which defines the lifetime of
> those objects then this should be a target of the charge.
> 
>> I agree that current SLAB memcg accounting does not allow to account to a
>> different memcg than the one of the running process. However I *think* it
>> should be possible to add such interface. Michal?
> 
> We do have memcg_kmem_charge_memcg but that would require some plumbing
> to hook it into the specific allocation path. I suspect it uses kmalloc,
> right?

Yes.

I took a look at the implementation and the callsites of 
memcg_kmem_charge_memcg(). It looks it is called by:

* charge kmem to memcg, but it is charged to the allocator's memcg
* allocate new slab page, charge to memcg_params.memcg

I think this is the plumbing you mentioned, right?

Thanks,
Yang

> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2017-11-13 19:10           ` Yang Shi
  0 siblings, 0 replies; 75+ messages in thread
From: Yang Shi @ 2017-11-13 19:10 UTC (permalink / raw)
  To: Michal Hocko, Jan Kara; +Cc: amir73il, linux-fsdevel, linux-mm, linux-kernel



On 11/9/17 5:54 AM, Michal Hocko wrote:
> [Sorry for the late reply]
> 
> On Tue 31-10-17 11:12:38, Jan Kara wrote:
>> On Tue 31-10-17 00:39:58, Yang Shi wrote:
> [...]
>>> I do agree it is not fair and not neat to account to producer rather than
>>> misbehaving consumer, but current memcg design looks not support such use
>>> case. And, the other question is do we know who is the listener if it
>>> doesn't read the events?
>>
>> So you never know who will read from the notification file descriptor but
>> you can simply account that to the process that created the notification
>> group and that is IMO the right process to account to.
> 
> Yes, if the creator is de-facto owner which defines the lifetime of
> those objects then this should be a target of the charge.
> 
>> I agree that current SLAB memcg accounting does not allow to account to a
>> different memcg than the one of the running process. However I *think* it
>> should be possible to add such interface. Michal?
> 
> We do have memcg_kmem_charge_memcg but that would require some plumbing
> to hook it into the specific allocation path. I suspect it uses kmalloc,
> right?

Yes.

I took a look at the implementation and the callsites of 
memcg_kmem_charge_memcg(). It looks it is called by:

* charge kmem to memcg, but it is charged to the allocator's memcg
* allocate new slab page, charge to memcg_params.memcg

I think this is the plumbing you mentioned, right?

Thanks,
Yang

> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-11-13 19:10           ` Yang Shi
@ 2017-11-14  9:39             ` Michal Hocko
  -1 siblings, 0 replies; 75+ messages in thread
From: Michal Hocko @ 2017-11-14  9:39 UTC (permalink / raw)
  To: Yang Shi; +Cc: Jan Kara, amir73il, linux-fsdevel, linux-mm, linux-kernel

On Tue 14-11-17 03:10:22, Yang Shi wrote:
> 
> 
> On 11/9/17 5:54 AM, Michal Hocko wrote:
> > [Sorry for the late reply]
> > 
> > On Tue 31-10-17 11:12:38, Jan Kara wrote:
> > > On Tue 31-10-17 00:39:58, Yang Shi wrote:
> > [...]
> > > > I do agree it is not fair and not neat to account to producer rather than
> > > > misbehaving consumer, but current memcg design looks not support such use
> > > > case. And, the other question is do we know who is the listener if it
> > > > doesn't read the events?
> > > 
> > > So you never know who will read from the notification file descriptor but
> > > you can simply account that to the process that created the notification
> > > group and that is IMO the right process to account to.
> > 
> > Yes, if the creator is de-facto owner which defines the lifetime of
> > those objects then this should be a target of the charge.
> > 
> > > I agree that current SLAB memcg accounting does not allow to account to a
> > > different memcg than the one of the running process. However I *think* it
> > > should be possible to add such interface. Michal?
> > 
> > We do have memcg_kmem_charge_memcg but that would require some plumbing
> > to hook it into the specific allocation path. I suspect it uses kmalloc,
> > right?
> 
> Yes.
> 
> I took a look at the implementation and the callsites of
> memcg_kmem_charge_memcg(). It looks it is called by:
> 
> * charge kmem to memcg, but it is charged to the allocator's memcg
> * allocate new slab page, charge to memcg_params.memcg
> 
> I think this is the plumbing you mentioned, right?

Maybe I have misunderstood, but you are using slab allocator. So you
would need to force it to use a different charging context than current.
I haven't checked deeply but this doesn't look trivial to me.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2017-11-14  9:39             ` Michal Hocko
  0 siblings, 0 replies; 75+ messages in thread
From: Michal Hocko @ 2017-11-14  9:39 UTC (permalink / raw)
  To: Yang Shi; +Cc: Jan Kara, amir73il, linux-fsdevel, linux-mm, linux-kernel

On Tue 14-11-17 03:10:22, Yang Shi wrote:
> 
> 
> On 11/9/17 5:54 AM, Michal Hocko wrote:
> > [Sorry for the late reply]
> > 
> > On Tue 31-10-17 11:12:38, Jan Kara wrote:
> > > On Tue 31-10-17 00:39:58, Yang Shi wrote:
> > [...]
> > > > I do agree it is not fair and not neat to account to producer rather than
> > > > misbehaving consumer, but current memcg design looks not support such use
> > > > case. And, the other question is do we know who is the listener if it
> > > > doesn't read the events?
> > > 
> > > So you never know who will read from the notification file descriptor but
> > > you can simply account that to the process that created the notification
> > > group and that is IMO the right process to account to.
> > 
> > Yes, if the creator is de-facto owner which defines the lifetime of
> > those objects then this should be a target of the charge.
> > 
> > > I agree that current SLAB memcg accounting does not allow to account to a
> > > different memcg than the one of the running process. However I *think* it
> > > should be possible to add such interface. Michal?
> > 
> > We do have memcg_kmem_charge_memcg but that would require some plumbing
> > to hook it into the specific allocation path. I suspect it uses kmalloc,
> > right?
> 
> Yes.
> 
> I took a look at the implementation and the callsites of
> memcg_kmem_charge_memcg(). It looks it is called by:
> 
> * charge kmem to memcg, but it is charged to the allocator's memcg
> * allocate new slab page, charge to memcg_params.memcg
> 
> I think this is the plumbing you mentioned, right?

Maybe I have misunderstood, but you are using slab allocator. So you
would need to force it to use a different charging context than current.
I haven't checked deeply but this doesn't look trivial to me.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-11-14  9:39             ` Michal Hocko
@ 2017-11-14 17:32               ` Yang Shi
  -1 siblings, 0 replies; 75+ messages in thread
From: Yang Shi @ 2017-11-14 17:32 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Jan Kara, amir73il, linux-fsdevel, linux-mm, linux-kernel



On 11/14/17 1:39 AM, Michal Hocko wrote:
> On Tue 14-11-17 03:10:22, Yang Shi wrote:
>>
>>
>> On 11/9/17 5:54 AM, Michal Hocko wrote:
>>> [Sorry for the late reply]
>>>
>>> On Tue 31-10-17 11:12:38, Jan Kara wrote:
>>>> On Tue 31-10-17 00:39:58, Yang Shi wrote:
>>> [...]
>>>>> I do agree it is not fair and not neat to account to producer rather than
>>>>> misbehaving consumer, but current memcg design looks not support such use
>>>>> case. And, the other question is do we know who is the listener if it
>>>>> doesn't read the events?
>>>>
>>>> So you never know who will read from the notification file descriptor but
>>>> you can simply account that to the process that created the notification
>>>> group and that is IMO the right process to account to.
>>>
>>> Yes, if the creator is de-facto owner which defines the lifetime of
>>> those objects then this should be a target of the charge.
>>>
>>>> I agree that current SLAB memcg accounting does not allow to account to a
>>>> different memcg than the one of the running process. However I *think* it
>>>> should be possible to add such interface. Michal?
>>>
>>> We do have memcg_kmem_charge_memcg but that would require some plumbing
>>> to hook it into the specific allocation path. I suspect it uses kmalloc,
>>> right?
>>
>> Yes.
>>
>> I took a look at the implementation and the callsites of
>> memcg_kmem_charge_memcg(). It looks it is called by:
>>
>> * charge kmem to memcg, but it is charged to the allocator's memcg
>> * allocate new slab page, charge to memcg_params.memcg
>>
>> I think this is the plumbing you mentioned, right?
> 
> Maybe I have misunderstood, but you are using slab allocator. So you
> would need to force it to use a different charging context than current.

Yes.

> I haven't checked deeply but this doesn't look trivial to me.

I agree. This is also what I explained to Jan and Amir in earlier 
discussion.

Yang

> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2017-11-14 17:32               ` Yang Shi
  0 siblings, 0 replies; 75+ messages in thread
From: Yang Shi @ 2017-11-14 17:32 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Jan Kara, amir73il, linux-fsdevel, linux-mm, linux-kernel



On 11/14/17 1:39 AM, Michal Hocko wrote:
> On Tue 14-11-17 03:10:22, Yang Shi wrote:
>>
>>
>> On 11/9/17 5:54 AM, Michal Hocko wrote:
>>> [Sorry for the late reply]
>>>
>>> On Tue 31-10-17 11:12:38, Jan Kara wrote:
>>>> On Tue 31-10-17 00:39:58, Yang Shi wrote:
>>> [...]
>>>>> I do agree it is not fair and not neat to account to producer rather than
>>>>> misbehaving consumer, but current memcg design looks not support such use
>>>>> case. And, the other question is do we know who is the listener if it
>>>>> doesn't read the events?
>>>>
>>>> So you never know who will read from the notification file descriptor but
>>>> you can simply account that to the process that created the notification
>>>> group and that is IMO the right process to account to.
>>>
>>> Yes, if the creator is de-facto owner which defines the lifetime of
>>> those objects then this should be a target of the charge.
>>>
>>>> I agree that current SLAB memcg accounting does not allow to account to a
>>>> different memcg than the one of the running process. However I *think* it
>>>> should be possible to add such interface. Michal?
>>>
>>> We do have memcg_kmem_charge_memcg but that would require some plumbing
>>> to hook it into the specific allocation path. I suspect it uses kmalloc,
>>> right?
>>
>> Yes.
>>
>> I took a look at the implementation and the callsites of
>> memcg_kmem_charge_memcg(). It looks it is called by:
>>
>> * charge kmem to memcg, but it is charged to the allocator's memcg
>> * allocate new slab page, charge to memcg_params.memcg
>>
>> I think this is the plumbing you mentioned, right?
> 
> Maybe I have misunderstood, but you are using slab allocator. So you
> would need to force it to use a different charging context than current.

Yes.

> I haven't checked deeply but this doesn't look trivial to me.

I agree. This is also what I explained to Jan and Amir in earlier 
discussion.

Yang

> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-11-14 17:32               ` Yang Shi
@ 2017-11-15  9:31                 ` Jan Kara
  -1 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2017-11-15  9:31 UTC (permalink / raw)
  To: Yang Shi
  Cc: Michal Hocko, Jan Kara, amir73il, linux-fsdevel, linux-mm, linux-kernel

On Wed 15-11-17 01:32:16, Yang Shi wrote:
> 
> 
> On 11/14/17 1:39 AM, Michal Hocko wrote:
> >On Tue 14-11-17 03:10:22, Yang Shi wrote:
> >>
> >>
> >>On 11/9/17 5:54 AM, Michal Hocko wrote:
> >>>[Sorry for the late reply]
> >>>
> >>>On Tue 31-10-17 11:12:38, Jan Kara wrote:
> >>>>On Tue 31-10-17 00:39:58, Yang Shi wrote:
> >>>[...]
> >>>>>I do agree it is not fair and not neat to account to producer rather than
> >>>>>misbehaving consumer, but current memcg design looks not support such use
> >>>>>case. And, the other question is do we know who is the listener if it
> >>>>>doesn't read the events?
> >>>>
> >>>>So you never know who will read from the notification file descriptor but
> >>>>you can simply account that to the process that created the notification
> >>>>group and that is IMO the right process to account to.
> >>>
> >>>Yes, if the creator is de-facto owner which defines the lifetime of
> >>>those objects then this should be a target of the charge.
> >>>
> >>>>I agree that current SLAB memcg accounting does not allow to account to a
> >>>>different memcg than the one of the running process. However I *think* it
> >>>>should be possible to add such interface. Michal?
> >>>
> >>>We do have memcg_kmem_charge_memcg but that would require some plumbing
> >>>to hook it into the specific allocation path. I suspect it uses kmalloc,
> >>>right?
> >>
> >>Yes.
> >>
> >>I took a look at the implementation and the callsites of
> >>memcg_kmem_charge_memcg(). It looks it is called by:
> >>
> >>* charge kmem to memcg, but it is charged to the allocator's memcg
> >>* allocate new slab page, charge to memcg_params.memcg
> >>
> >>I think this is the plumbing you mentioned, right?
> >
> >Maybe I have misunderstood, but you are using slab allocator. So you
> >would need to force it to use a different charging context than current.
> 
> Yes.
> 
> >I haven't checked deeply but this doesn't look trivial to me.
> 
> I agree. This is also what I explained to Jan and Amir in earlier
> discussion.

And I also agree. But the fact that it is not trivial does not mean that it
should not be done...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2017-11-15  9:31                 ` Jan Kara
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2017-11-15  9:31 UTC (permalink / raw)
  To: Yang Shi
  Cc: Michal Hocko, Jan Kara, amir73il, linux-fsdevel, linux-mm, linux-kernel

On Wed 15-11-17 01:32:16, Yang Shi wrote:
> 
> 
> On 11/14/17 1:39 AM, Michal Hocko wrote:
> >On Tue 14-11-17 03:10:22, Yang Shi wrote:
> >>
> >>
> >>On 11/9/17 5:54 AM, Michal Hocko wrote:
> >>>[Sorry for the late reply]
> >>>
> >>>On Tue 31-10-17 11:12:38, Jan Kara wrote:
> >>>>On Tue 31-10-17 00:39:58, Yang Shi wrote:
> >>>[...]
> >>>>>I do agree it is not fair and not neat to account to producer rather than
> >>>>>misbehaving consumer, but current memcg design looks not support such use
> >>>>>case. And, the other question is do we know who is the listener if it
> >>>>>doesn't read the events?
> >>>>
> >>>>So you never know who will read from the notification file descriptor but
> >>>>you can simply account that to the process that created the notification
> >>>>group and that is IMO the right process to account to.
> >>>
> >>>Yes, if the creator is de-facto owner which defines the lifetime of
> >>>those objects then this should be a target of the charge.
> >>>
> >>>>I agree that current SLAB memcg accounting does not allow to account to a
> >>>>different memcg than the one of the running process. However I *think* it
> >>>>should be possible to add such interface. Michal?
> >>>
> >>>We do have memcg_kmem_charge_memcg but that would require some plumbing
> >>>to hook it into the specific allocation path. I suspect it uses kmalloc,
> >>>right?
> >>
> >>Yes.
> >>
> >>I took a look at the implementation and the callsites of
> >>memcg_kmem_charge_memcg(). It looks it is called by:
> >>
> >>* charge kmem to memcg, but it is charged to the allocator's memcg
> >>* allocate new slab page, charge to memcg_params.memcg
> >>
> >>I think this is the plumbing you mentioned, right?
> >
> >Maybe I have misunderstood, but you are using slab allocator. So you
> >would need to force it to use a different charging context than current.
> 
> Yes.
> 
> >I haven't checked deeply but this doesn't look trivial to me.
> 
> I agree. This is also what I explained to Jan and Amir in earlier
> discussion.

And I also agree. But the fact that it is not trivial does not mean that it
should not be done...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-11-15  9:31                 ` Jan Kara
@ 2018-01-19 15:02                   ` Shakeel Butt
  -1 siblings, 0 replies; 75+ messages in thread
From: Shakeel Butt @ 2018-01-19 15:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: Yang Shi, Michal Hocko, amir73il, linux-fsdevel, Linux MM, LKML

On Wed, Nov 15, 2017 at 1:31 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 15-11-17 01:32:16, Yang Shi wrote:
>>
>>
>> On 11/14/17 1:39 AM, Michal Hocko wrote:
>> >On Tue 14-11-17 03:10:22, Yang Shi wrote:
>> >>
>> >>
>> >>On 11/9/17 5:54 AM, Michal Hocko wrote:
>> >>>[Sorry for the late reply]
>> >>>
>> >>>On Tue 31-10-17 11:12:38, Jan Kara wrote:
>> >>>>On Tue 31-10-17 00:39:58, Yang Shi wrote:
>> >>>[...]
>> >>>>>I do agree it is not fair and not neat to account to producer rather than
>> >>>>>misbehaving consumer, but current memcg design looks not support such use
>> >>>>>case. And, the other question is do we know who is the listener if it
>> >>>>>doesn't read the events?
>> >>>>
>> >>>>So you never know who will read from the notification file descriptor but
>> >>>>you can simply account that to the process that created the notification
>> >>>>group and that is IMO the right process to account to.
>> >>>
>> >>>Yes, if the creator is de-facto owner which defines the lifetime of
>> >>>those objects then this should be a target of the charge.
>> >>>
>> >>>>I agree that current SLAB memcg accounting does not allow to account to a
>> >>>>different memcg than the one of the running process. However I *think* it
>> >>>>should be possible to add such interface. Michal?
>> >>>
>> >>>We do have memcg_kmem_charge_memcg but that would require some plumbing
>> >>>to hook it into the specific allocation path. I suspect it uses kmalloc,
>> >>>right?
>> >>
>> >>Yes.
>> >>
>> >>I took a look at the implementation and the callsites of
>> >>memcg_kmem_charge_memcg(). It looks it is called by:
>> >>
>> >>* charge kmem to memcg, but it is charged to the allocator's memcg
>> >>* allocate new slab page, charge to memcg_params.memcg
>> >>
>> >>I think this is the plumbing you mentioned, right?
>> >
>> >Maybe I have misunderstood, but you are using slab allocator. So you
>> >would need to force it to use a different charging context than current.
>>
>> Yes.
>>
>> >I haven't checked deeply but this doesn't look trivial to me.
>>
>> I agree. This is also what I explained to Jan and Amir in earlier
>> discussion.
>
> And I also agree. But the fact that it is not trivial does not mean that it
> should not be done...
>

I am currently working on directed or remote memcg charging for a
different usecase and I think that would be helpful here as well.

I have two questions though:

1) Is fsnotify_group the right structure to hold the reference to
target mem_cgroup for charging?
2) Remote charging can trigger an OOM in the target memcg. In this
usecase, I think, there should be security concerns if the events
producer can trigger OOM in the memcg of the monitor. We can either
change these allocations to use __GFP_NORETRY or some new gfp flag to
not trigger oom-killer. So, is this valid concern or am I
over-thinking?

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-01-19 15:02                   ` Shakeel Butt
  0 siblings, 0 replies; 75+ messages in thread
From: Shakeel Butt @ 2018-01-19 15:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: Yang Shi, Michal Hocko, amir73il, linux-fsdevel, Linux MM, LKML

On Wed, Nov 15, 2017 at 1:31 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 15-11-17 01:32:16, Yang Shi wrote:
>>
>>
>> On 11/14/17 1:39 AM, Michal Hocko wrote:
>> >On Tue 14-11-17 03:10:22, Yang Shi wrote:
>> >>
>> >>
>> >>On 11/9/17 5:54 AM, Michal Hocko wrote:
>> >>>[Sorry for the late reply]
>> >>>
>> >>>On Tue 31-10-17 11:12:38, Jan Kara wrote:
>> >>>>On Tue 31-10-17 00:39:58, Yang Shi wrote:
>> >>>[...]
>> >>>>>I do agree it is not fair and not neat to account to producer rather than
>> >>>>>misbehaving consumer, but current memcg design looks not support such use
>> >>>>>case. And, the other question is do we know who is the listener if it
>> >>>>>doesn't read the events?
>> >>>>
>> >>>>So you never know who will read from the notification file descriptor but
>> >>>>you can simply account that to the process that created the notification
>> >>>>group and that is IMO the right process to account to.
>> >>>
>> >>>Yes, if the creator is de-facto owner which defines the lifetime of
>> >>>those objects then this should be a target of the charge.
>> >>>
>> >>>>I agree that current SLAB memcg accounting does not allow to account to a
>> >>>>different memcg than the one of the running process. However I *think* it
>> >>>>should be possible to add such interface. Michal?
>> >>>
>> >>>We do have memcg_kmem_charge_memcg but that would require some plumbing
>> >>>to hook it into the specific allocation path. I suspect it uses kmalloc,
>> >>>right?
>> >>
>> >>Yes.
>> >>
>> >>I took a look at the implementation and the callsites of
>> >>memcg_kmem_charge_memcg(). It looks it is called by:
>> >>
>> >>* charge kmem to memcg, but it is charged to the allocator's memcg
>> >>* allocate new slab page, charge to memcg_params.memcg
>> >>
>> >>I think this is the plumbing you mentioned, right?
>> >
>> >Maybe I have misunderstood, but you are using slab allocator. So you
>> >would need to force it to use a different charging context than current.
>>
>> Yes.
>>
>> >I haven't checked deeply but this doesn't look trivial to me.
>>
>> I agree. This is also what I explained to Jan and Amir in earlier
>> discussion.
>
> And I also agree. But the fact that it is not trivial does not mean that it
> should not be done...
>

I am currently working on directed or remote memcg charging for a
different usecase and I think that would be helpful here as well.

I have two questions though:

1) Is fsnotify_group the right structure to hold the reference to
target mem_cgroup for charging?
2) Remote charging can trigger an OOM in the target memcg. In this
usecase, I think, there should be security concerns if the events
producer can trigger OOM in the memcg of the monitor. We can either
change these allocations to use __GFP_NORETRY or some new gfp flag to
not trigger oom-killer. So, is this valid concern or am I
over-thinking?

thanks,
Shakeel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2018-01-19 15:02                   ` Shakeel Butt
@ 2018-01-22 20:31                     ` Amir Goldstein
  -1 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-01-22 20:31 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML

On Fri, Jan 19, 2018 at 5:02 PM, Shakeel Butt <shakeelb@google.com> wrote:
> On Wed, Nov 15, 2017 at 1:31 AM, Jan Kara <jack@suse.cz> wrote:
>> On Wed 15-11-17 01:32:16, Yang Shi wrote:
>>>
>>>
>>> On 11/14/17 1:39 AM, Michal Hocko wrote:
>>> >On Tue 14-11-17 03:10:22, Yang Shi wrote:
>>> >>
>>> >>
>>> >>On 11/9/17 5:54 AM, Michal Hocko wrote:
>>> >>>[Sorry for the late reply]
>>> >>>
>>> >>>On Tue 31-10-17 11:12:38, Jan Kara wrote:
>>> >>>>On Tue 31-10-17 00:39:58, Yang Shi wrote:
>>> >>>[...]
>>> >>>>>I do agree it is not fair and not neat to account to producer rather than
>>> >>>>>misbehaving consumer, but current memcg design looks not support such use
>>> >>>>>case. And, the other question is do we know who is the listener if it
>>> >>>>>doesn't read the events?
>>> >>>>
>>> >>>>So you never know who will read from the notification file descriptor but
>>> >>>>you can simply account that to the process that created the notification
>>> >>>>group and that is IMO the right process to account to.
>>> >>>
>>> >>>Yes, if the creator is de-facto owner which defines the lifetime of
>>> >>>those objects then this should be a target of the charge.
>>> >>>
>>> >>>>I agree that current SLAB memcg accounting does not allow to account to a
>>> >>>>different memcg than the one of the running process. However I *think* it
>>> >>>>should be possible to add such interface. Michal?
>>> >>>
>>> >>>We do have memcg_kmem_charge_memcg but that would require some plumbing
>>> >>>to hook it into the specific allocation path. I suspect it uses kmalloc,
>>> >>>right?
>>> >>
>>> >>Yes.
>>> >>
>>> >>I took a look at the implementation and the callsites of
>>> >>memcg_kmem_charge_memcg(). It looks it is called by:
>>> >>
>>> >>* charge kmem to memcg, but it is charged to the allocator's memcg
>>> >>* allocate new slab page, charge to memcg_params.memcg
>>> >>
>>> >>I think this is the plumbing you mentioned, right?
>>> >
>>> >Maybe I have misunderstood, but you are using slab allocator. So you
>>> >would need to force it to use a different charging context than current.
>>>
>>> Yes.
>>>
>>> >I haven't checked deeply but this doesn't look trivial to me.
>>>
>>> I agree. This is also what I explained to Jan and Amir in earlier
>>> discussion.
>>
>> And I also agree. But the fact that it is not trivial does not mean that it
>> should not be done...
>>
>
> I am currently working on directed or remote memcg charging for a
> different usecase and I think that would be helpful here as well.
>
> I have two questions though:
>
> 1) Is fsnotify_group the right structure to hold the reference to
> target mem_cgroup for charging?

I think it is. The process who set up the group and determined the unlimited
events queue size and did not consume the events from the queue in a timely
manner is the process to blame for the OOM situation.

> 2) Remote charging can trigger an OOM in the target memcg. In this
> usecase, I think, there should be security concerns if the events
> producer can trigger OOM in the memcg of the monitor. We can either
> change these allocations to use __GFP_NORETRY or some new gfp flag to
> not trigger oom-killer. So, is this valid concern or am I
> over-thinking?
>

I think that is a very valid concern, but not sure about the solution.
For an inotify listener and fanotify listener of class FAN_CLASS_NOTIF
(group->priority == 0), I think it makes sense to let oom-killer kill
the listener which is not keeping up with consuming events.

For fanotify listener of class FAN_CLASS_{,PRE_}CONTENT
(group->priority > 0) allowing an adversary to trigger oom-killer
and kill the listener could bypass permission event checks.

So we could use different allocation flags for permission events
or for groups with high priority or for groups that have permission
events in their mask, so an adversary could not use non-permission
events to oom-kill a listener that is also monitoring permission events.

Generally speaking, permission event monitors should not be
setting  FAN_UNLIMITED_QUEUE and should not be causing oom
(because permission events are not queued they are blocking), but
there is nothing preventing a process to setup a FAN_CLASS_CONTENT
group with FAN_UNLIMITED_QUEUE and also monitor non-permission
events.

There is also nothing preventing a process from setting up one
FAN_CLASS_CONTENT listener for permission events and another
FAN_CLASS_NOTIF listener for non-permission event.
Maybe it is not wise, but we don't know that there are no such processes
out there.

I know I am throwing a lot of confusing information, but here is an argument
that might be simple enough:

If a process has setup a group with permission event mask,
then the administrator has granted this process the permission to
bring the system to a halt even if the process misbehaves
(because better halt the system than let it be compromised).
If that interpretation is acceptable then oom-killing a process
which has setup a group with permission event mask is not
acceptable.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-01-22 20:31                     ` Amir Goldstein
  0 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-01-22 20:31 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML

On Fri, Jan 19, 2018 at 5:02 PM, Shakeel Butt <shakeelb@google.com> wrote:
> On Wed, Nov 15, 2017 at 1:31 AM, Jan Kara <jack@suse.cz> wrote:
>> On Wed 15-11-17 01:32:16, Yang Shi wrote:
>>>
>>>
>>> On 11/14/17 1:39 AM, Michal Hocko wrote:
>>> >On Tue 14-11-17 03:10:22, Yang Shi wrote:
>>> >>
>>> >>
>>> >>On 11/9/17 5:54 AM, Michal Hocko wrote:
>>> >>>[Sorry for the late reply]
>>> >>>
>>> >>>On Tue 31-10-17 11:12:38, Jan Kara wrote:
>>> >>>>On Tue 31-10-17 00:39:58, Yang Shi wrote:
>>> >>>[...]
>>> >>>>>I do agree it is not fair and not neat to account to producer rather than
>>> >>>>>misbehaving consumer, but current memcg design looks not support such use
>>> >>>>>case. And, the other question is do we know who is the listener if it
>>> >>>>>doesn't read the events?
>>> >>>>
>>> >>>>So you never know who will read from the notification file descriptor but
>>> >>>>you can simply account that to the process that created the notification
>>> >>>>group and that is IMO the right process to account to.
>>> >>>
>>> >>>Yes, if the creator is de-facto owner which defines the lifetime of
>>> >>>those objects then this should be a target of the charge.
>>> >>>
>>> >>>>I agree that current SLAB memcg accounting does not allow to account to a
>>> >>>>different memcg than the one of the running process. However I *think* it
>>> >>>>should be possible to add such interface. Michal?
>>> >>>
>>> >>>We do have memcg_kmem_charge_memcg but that would require some plumbing
>>> >>>to hook it into the specific allocation path. I suspect it uses kmalloc,
>>> >>>right?
>>> >>
>>> >>Yes.
>>> >>
>>> >>I took a look at the implementation and the callsites of
>>> >>memcg_kmem_charge_memcg(). It looks it is called by:
>>> >>
>>> >>* charge kmem to memcg, but it is charged to the allocator's memcg
>>> >>* allocate new slab page, charge to memcg_params.memcg
>>> >>
>>> >>I think this is the plumbing you mentioned, right?
>>> >
>>> >Maybe I have misunderstood, but you are using slab allocator. So you
>>> >would need to force it to use a different charging context than current.
>>>
>>> Yes.
>>>
>>> >I haven't checked deeply but this doesn't look trivial to me.
>>>
>>> I agree. This is also what I explained to Jan and Amir in earlier
>>> discussion.
>>
>> And I also agree. But the fact that it is not trivial does not mean that it
>> should not be done...
>>
>
> I am currently working on directed or remote memcg charging for a
> different usecase and I think that would be helpful here as well.
>
> I have two questions though:
>
> 1) Is fsnotify_group the right structure to hold the reference to
> target mem_cgroup for charging?

I think it is. The process who set up the group and determined the unlimited
events queue size and did not consume the events from the queue in a timely
manner is the process to blame for the OOM situation.

> 2) Remote charging can trigger an OOM in the target memcg. In this
> usecase, I think, there should be security concerns if the events
> producer can trigger OOM in the memcg of the monitor. We can either
> change these allocations to use __GFP_NORETRY or some new gfp flag to
> not trigger oom-killer. So, is this valid concern or am I
> over-thinking?
>

I think that is a very valid concern, but not sure about the solution.
For an inotify listener and fanotify listener of class FAN_CLASS_NOTIF
(group->priority == 0), I think it makes sense to let oom-killer kill
the listener which is not keeping up with consuming events.

For fanotify listener of class FAN_CLASS_{,PRE_}CONTENT
(group->priority > 0) allowing an adversary to trigger oom-killer
and kill the listener could bypass permission event checks.

So we could use different allocation flags for permission events
or for groups with high priority or for groups that have permission
events in their mask, so an adversary could not use non-permission
events to oom-kill a listener that is also monitoring permission events.

Generally speaking, permission event monitors should not be
setting  FAN_UNLIMITED_QUEUE and should not be causing oom
(because permission events are not queued they are blocking), but
there is nothing preventing a process to setup a FAN_CLASS_CONTENT
group with FAN_UNLIMITED_QUEUE and also monitor non-permission
events.

There is also nothing preventing a process from setting up one
FAN_CLASS_CONTENT listener for permission events and another
FAN_CLASS_NOTIF listener for non-permission event.
Maybe it is not wise, but we don't know that there are no such processes
out there.

I know I am throwing a lot of confusing information, but here is an argument
that might be simple enough:

If a process has setup a group with permission event mask,
then the administrator has granted this process the permission to
bring the system to a halt even if the process misbehaves
(because better halt the system than let it be compromised).
If that interpretation is acceptable then oom-killing a process
which has setup a group with permission event mask is not
acceptable.

Thanks,
Amir.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2018-01-22 20:31                     ` Amir Goldstein
@ 2018-01-24 10:34                       ` Jan Kara
  -1 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2018-01-24 10:34 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Shakeel Butt, Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel,
	Linux MM, LKML

On Mon 22-01-18 22:31:20, Amir Goldstein wrote:
> On Fri, Jan 19, 2018 at 5:02 PM, Shakeel Butt <shakeelb@google.com> wrote:
> > On Wed, Nov 15, 2017 at 1:31 AM, Jan Kara <jack@suse.cz> wrote:
> >> On Wed 15-11-17 01:32:16, Yang Shi wrote:
> >>> On 11/14/17 1:39 AM, Michal Hocko wrote:
> >>> >On Tue 14-11-17 03:10:22, Yang Shi wrote:
> >>> >>
> >>> >>
> >>> >>On 11/9/17 5:54 AM, Michal Hocko wrote:
> >>> >>>[Sorry for the late reply]
> >>> >>>
> >>> >>>On Tue 31-10-17 11:12:38, Jan Kara wrote:
> >>> >>>>On Tue 31-10-17 00:39:58, Yang Shi wrote:
> >>> >>>[...]
> >>> >>>>>I do agree it is not fair and not neat to account to producer rather than
> >>> >>>>>misbehaving consumer, but current memcg design looks not support such use
> >>> >>>>>case. And, the other question is do we know who is the listener if it
> >>> >>>>>doesn't read the events?
> >>> >>>>
> >>> >>>>So you never know who will read from the notification file descriptor but
> >>> >>>>you can simply account that to the process that created the notification
> >>> >>>>group and that is IMO the right process to account to.
> >>> >>>
> >>> >>>Yes, if the creator is de-facto owner which defines the lifetime of
> >>> >>>those objects then this should be a target of the charge.
> >>> >>>
> >>> >>>>I agree that current SLAB memcg accounting does not allow to account to a
> >>> >>>>different memcg than the one of the running process. However I *think* it
> >>> >>>>should be possible to add such interface. Michal?
> >>> >>>
> >>> >>>We do have memcg_kmem_charge_memcg but that would require some plumbing
> >>> >>>to hook it into the specific allocation path. I suspect it uses kmalloc,
> >>> >>>right?
> >>> >>
> >>> >>Yes.
> >>> >>
> >>> >>I took a look at the implementation and the callsites of
> >>> >>memcg_kmem_charge_memcg(). It looks it is called by:
> >>> >>
> >>> >>* charge kmem to memcg, but it is charged to the allocator's memcg
> >>> >>* allocate new slab page, charge to memcg_params.memcg
> >>> >>
> >>> >>I think this is the plumbing you mentioned, right?
> >>> >
> >>> >Maybe I have misunderstood, but you are using slab allocator. So you
> >>> >would need to force it to use a different charging context than current.
> >>>
> >>> Yes.
> >>>
> >>> >I haven't checked deeply but this doesn't look trivial to me.
> >>>
> >>> I agree. This is also what I explained to Jan and Amir in earlier
> >>> discussion.
> >>
> >> And I also agree. But the fact that it is not trivial does not mean that it
> >> should not be done...
> >>
> >
> > I am currently working on directed or remote memcg charging for a
> > different usecase and I think that would be helpful here as well.
> >
> > I have two questions though:
> >
> > 1) Is fsnotify_group the right structure to hold the reference to
> > target mem_cgroup for charging?
> 
> I think it is. The process who set up the group and determined the unlimited
> events queue size and did not consume the events from the queue in a timely
> manner is the process to blame for the OOM situation.

Agreed here.

> > 2) Remote charging can trigger an OOM in the target memcg. In this
> > usecase, I think, there should be security concerns if the events
> > producer can trigger OOM in the memcg of the monitor. We can either
> > change these allocations to use __GFP_NORETRY or some new gfp flag to
> > not trigger oom-killer. So, is this valid concern or am I
> > over-thinking?
> >
> 
> I think that is a very valid concern, but not sure about the solution.
> For an inotify listener and fanotify listener of class FAN_CLASS_NOTIF
> (group->priority == 0), I think it makes sense to let oom-killer kill
> the listener which is not keeping up with consuming events.

Agreed.

> For fanotify listener of class FAN_CLASS_{,PRE_}CONTENT
> (group->priority > 0) allowing an adversary to trigger oom-killer
> and kill the listener could bypass permission event checks.
> 
> So we could use different allocation flags for permission events
> or for groups with high priority or for groups that have permission
> events in their mask, so an adversary could not use non-permission
> events to oom-kill a listener that is also monitoring permission events.
> 
> Generally speaking, permission event monitors should not be
> setting  FAN_UNLIMITED_QUEUE and should not be causing oom
> (because permission events are not queued they are blocking), but
> there is nothing preventing a process to setup a FAN_CLASS_CONTENT
> group with FAN_UNLIMITED_QUEUE and also monitor non-permission
> events.
> 
> There is also nothing preventing a process from setting up one
> FAN_CLASS_CONTENT listener for permission events and another
> FAN_CLASS_NOTIF listener for non-permission event.
> Maybe it is not wise, but we don't know that there are no such processes
> out there.

So IMHO there are different ways to setup memcgs and processes in them and
you cannot just guess what desired outcome is. The facts are:

1) Process has setup queue with unlimited length.
2) Admin has put the process into memcg with limited memory.
3) The process cannot keep up with event producer.

These three facts are creating conflicting demands and it depends on the
eye of the beholder what is correct. E.g. you may not want to take the
whole hosting machine down because something bad happened in one container.
OTOH you might what that to happen if that particular container is
responsible for maintaining security - but then IMHO it is not a clever
setup to constrain memory of the security sensitive application.

So my stance on this is that we should define easy to understand semantics
and let the admins deal with it. IMO that semantics should follow how we
currently behave on system-wide OOM - i.e., simply trigger OOM killer when
cgroup is going over limits.

If we wanted to create safer API for fanotify from this standpoint, we
could allow new type of fanotify groups where queue would be limited in
length but tasks adding events to queue would get throttled as the queue
fills up. Something like dirty page cache throttling. But I'd go for this
complexity only once we have clear indications from practice that current
scheme is not enough.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-01-24 10:34                       ` Jan Kara
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2018-01-24 10:34 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Shakeel Butt, Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel,
	Linux MM, LKML

On Mon 22-01-18 22:31:20, Amir Goldstein wrote:
> On Fri, Jan 19, 2018 at 5:02 PM, Shakeel Butt <shakeelb@google.com> wrote:
> > On Wed, Nov 15, 2017 at 1:31 AM, Jan Kara <jack@suse.cz> wrote:
> >> On Wed 15-11-17 01:32:16, Yang Shi wrote:
> >>> On 11/14/17 1:39 AM, Michal Hocko wrote:
> >>> >On Tue 14-11-17 03:10:22, Yang Shi wrote:
> >>> >>
> >>> >>
> >>> >>On 11/9/17 5:54 AM, Michal Hocko wrote:
> >>> >>>[Sorry for the late reply]
> >>> >>>
> >>> >>>On Tue 31-10-17 11:12:38, Jan Kara wrote:
> >>> >>>>On Tue 31-10-17 00:39:58, Yang Shi wrote:
> >>> >>>[...]
> >>> >>>>>I do agree it is not fair and not neat to account to producer rather than
> >>> >>>>>misbehaving consumer, but current memcg design looks not support such use
> >>> >>>>>case. And, the other question is do we know who is the listener if it
> >>> >>>>>doesn't read the events?
> >>> >>>>
> >>> >>>>So you never know who will read from the notification file descriptor but
> >>> >>>>you can simply account that to the process that created the notification
> >>> >>>>group and that is IMO the right process to account to.
> >>> >>>
> >>> >>>Yes, if the creator is de-facto owner which defines the lifetime of
> >>> >>>those objects then this should be a target of the charge.
> >>> >>>
> >>> >>>>I agree that current SLAB memcg accounting does not allow to account to a
> >>> >>>>different memcg than the one of the running process. However I *think* it
> >>> >>>>should be possible to add such interface. Michal?
> >>> >>>
> >>> >>>We do have memcg_kmem_charge_memcg but that would require some plumbing
> >>> >>>to hook it into the specific allocation path. I suspect it uses kmalloc,
> >>> >>>right?
> >>> >>
> >>> >>Yes.
> >>> >>
> >>> >>I took a look at the implementation and the callsites of
> >>> >>memcg_kmem_charge_memcg(). It looks it is called by:
> >>> >>
> >>> >>* charge kmem to memcg, but it is charged to the allocator's memcg
> >>> >>* allocate new slab page, charge to memcg_params.memcg
> >>> >>
> >>> >>I think this is the plumbing you mentioned, right?
> >>> >
> >>> >Maybe I have misunderstood, but you are using slab allocator. So you
> >>> >would need to force it to use a different charging context than current.
> >>>
> >>> Yes.
> >>>
> >>> >I haven't checked deeply but this doesn't look trivial to me.
> >>>
> >>> I agree. This is also what I explained to Jan and Amir in earlier
> >>> discussion.
> >>
> >> And I also agree. But the fact that it is not trivial does not mean that it
> >> should not be done...
> >>
> >
> > I am currently working on directed or remote memcg charging for a
> > different usecase and I think that would be helpful here as well.
> >
> > I have two questions though:
> >
> > 1) Is fsnotify_group the right structure to hold the reference to
> > target mem_cgroup for charging?
> 
> I think it is. The process who set up the group and determined the unlimited
> events queue size and did not consume the events from the queue in a timely
> manner is the process to blame for the OOM situation.

Agreed here.

> > 2) Remote charging can trigger an OOM in the target memcg. In this
> > usecase, I think, there should be security concerns if the events
> > producer can trigger OOM in the memcg of the monitor. We can either
> > change these allocations to use __GFP_NORETRY or some new gfp flag to
> > not trigger oom-killer. So, is this valid concern or am I
> > over-thinking?
> >
> 
> I think that is a very valid concern, but not sure about the solution.
> For an inotify listener and fanotify listener of class FAN_CLASS_NOTIF
> (group->priority == 0), I think it makes sense to let oom-killer kill
> the listener which is not keeping up with consuming events.

Agreed.

> For fanotify listener of class FAN_CLASS_{,PRE_}CONTENT
> (group->priority > 0) allowing an adversary to trigger oom-killer
> and kill the listener could bypass permission event checks.
> 
> So we could use different allocation flags for permission events
> or for groups with high priority or for groups that have permission
> events in their mask, so an adversary could not use non-permission
> events to oom-kill a listener that is also monitoring permission events.
> 
> Generally speaking, permission event monitors should not be
> setting  FAN_UNLIMITED_QUEUE and should not be causing oom
> (because permission events are not queued they are blocking), but
> there is nothing preventing a process to setup a FAN_CLASS_CONTENT
> group with FAN_UNLIMITED_QUEUE and also monitor non-permission
> events.
> 
> There is also nothing preventing a process from setting up one
> FAN_CLASS_CONTENT listener for permission events and another
> FAN_CLASS_NOTIF listener for non-permission event.
> Maybe it is not wise, but we don't know that there are no such processes
> out there.

So IMHO there are different ways to setup memcgs and processes in them and
you cannot just guess what desired outcome is. The facts are:

1) Process has setup queue with unlimited length.
2) Admin has put the process into memcg with limited memory.
3) The process cannot keep up with event producer.

These three facts are creating conflicting demands and it depends on the
eye of the beholder what is correct. E.g. you may not want to take the
whole hosting machine down because something bad happened in one container.
OTOH you might what that to happen if that particular container is
responsible for maintaining security - but then IMHO it is not a clever
setup to constrain memory of the security sensitive application.

So my stance on this is that we should define easy to understand semantics
and let the admins deal with it. IMO that semantics should follow how we
currently behave on system-wide OOM - i.e., simply trigger OOM killer when
cgroup is going over limits.

If we wanted to create safer API for fanotify from this standpoint, we
could allow new type of fanotify groups where queue would be limited in
length but tasks adding events to queue would get throttled as the queue
fills up. Something like dirty page cache throttling. But I'd go for this
complexity only once we have clear indications from practice that current
scheme is not enough.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2018-01-24 10:34                       ` Jan Kara
@ 2018-01-24 11:12                         ` Amir Goldstein
  -1 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-01-24 11:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Shakeel Butt, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML

On Wed, Jan 24, 2018 at 12:34 PM, Jan Kara <jack@suse.cz> wrote:
> On Mon 22-01-18 22:31:20, Amir Goldstein wrote:
>> On Fri, Jan 19, 2018 at 5:02 PM, Shakeel Butt <shakeelb@google.com> wrote:
>> > On Wed, Nov 15, 2017 at 1:31 AM, Jan Kara <jack@suse.cz> wrote:
>> >> On Wed 15-11-17 01:32:16, Yang Shi wrote:
>> >>> On 11/14/17 1:39 AM, Michal Hocko wrote:
>> >>> >On Tue 14-11-17 03:10:22, Yang Shi wrote:
>> >>> >>
>> >>> >>
>> >>> >>On 11/9/17 5:54 AM, Michal Hocko wrote:
>> >>> >>>[Sorry for the late reply]
>> >>> >>>
>> >>> >>>On Tue 31-10-17 11:12:38, Jan Kara wrote:
>> >>> >>>>On Tue 31-10-17 00:39:58, Yang Shi wrote:
>> >>> >>>[...]
>> >>> >>>>>I do agree it is not fair and not neat to account to producer rather than
>> >>> >>>>>misbehaving consumer, but current memcg design looks not support such use
>> >>> >>>>>case. And, the other question is do we know who is the listener if it
>> >>> >>>>>doesn't read the events?
>> >>> >>>>
>> >>> >>>>So you never know who will read from the notification file descriptor but
>> >>> >>>>you can simply account that to the process that created the notification
>> >>> >>>>group and that is IMO the right process to account to.
>> >>> >>>
>> >>> >>>Yes, if the creator is de-facto owner which defines the lifetime of
>> >>> >>>those objects then this should be a target of the charge.
>> >>> >>>
>> >>> >>>>I agree that current SLAB memcg accounting does not allow to account to a
>> >>> >>>>different memcg than the one of the running process. However I *think* it
>> >>> >>>>should be possible to add such interface. Michal?
>> >>> >>>
>> >>> >>>We do have memcg_kmem_charge_memcg but that would require some plumbing
>> >>> >>>to hook it into the specific allocation path. I suspect it uses kmalloc,
>> >>> >>>right?
>> >>> >>
>> >>> >>Yes.
>> >>> >>
>> >>> >>I took a look at the implementation and the callsites of
>> >>> >>memcg_kmem_charge_memcg(). It looks it is called by:
>> >>> >>
>> >>> >>* charge kmem to memcg, but it is charged to the allocator's memcg
>> >>> >>* allocate new slab page, charge to memcg_params.memcg
>> >>> >>
>> >>> >>I think this is the plumbing you mentioned, right?
>> >>> >
>> >>> >Maybe I have misunderstood, but you are using slab allocator. So you
>> >>> >would need to force it to use a different charging context than current.
>> >>>
>> >>> Yes.
>> >>>
>> >>> >I haven't checked deeply but this doesn't look trivial to me.
>> >>>
>> >>> I agree. This is also what I explained to Jan and Amir in earlier
>> >>> discussion.
>> >>
>> >> And I also agree. But the fact that it is not trivial does not mean that it
>> >> should not be done...
>> >>
>> >
>> > I am currently working on directed or remote memcg charging for a
>> > different usecase and I think that would be helpful here as well.
>> >
>> > I have two questions though:
>> >
>> > 1) Is fsnotify_group the right structure to hold the reference to
>> > target mem_cgroup for charging?
>>
>> I think it is. The process who set up the group and determined the unlimited
>> events queue size and did not consume the events from the queue in a timely
>> manner is the process to blame for the OOM situation.
>
> Agreed here.
>
>> > 2) Remote charging can trigger an OOM in the target memcg. In this
>> > usecase, I think, there should be security concerns if the events
>> > producer can trigger OOM in the memcg of the monitor. We can either
>> > change these allocations to use __GFP_NORETRY or some new gfp flag to
>> > not trigger oom-killer. So, is this valid concern or am I
>> > over-thinking?
>> >
>>
>> I think that is a very valid concern, but not sure about the solution.
>> For an inotify listener and fanotify listener of class FAN_CLASS_NOTIF
>> (group->priority == 0), I think it makes sense to let oom-killer kill
>> the listener which is not keeping up with consuming events.
>
> Agreed.
>
>> For fanotify listener of class FAN_CLASS_{,PRE_}CONTENT
>> (group->priority > 0) allowing an adversary to trigger oom-killer
>> and kill the listener could bypass permission event checks.
>>
>> So we could use different allocation flags for permission events
>> or for groups with high priority or for groups that have permission
>> events in their mask, so an adversary could not use non-permission
>> events to oom-kill a listener that is also monitoring permission events.
>>
>> Generally speaking, permission event monitors should not be
>> setting  FAN_UNLIMITED_QUEUE and should not be causing oom
>> (because permission events are not queued they are blocking), but
>> there is nothing preventing a process to setup a FAN_CLASS_CONTENT
>> group with FAN_UNLIMITED_QUEUE and also monitor non-permission
>> events.
>>
>> There is also nothing preventing a process from setting up one
>> FAN_CLASS_CONTENT listener for permission events and another
>> FAN_CLASS_NOTIF listener for non-permission event.
>> Maybe it is not wise, but we don't know that there are no such processes
>> out there.
>
> So IMHO there are different ways to setup memcgs and processes in them and
> you cannot just guess what desired outcome is. The facts are:
>
> 1) Process has setup queue with unlimited length.
> 2) Admin has put the process into memcg with limited memory.
> 3) The process cannot keep up with event producer.
>
> These three facts are creating conflicting demands and it depends on the
> eye of the beholder what is correct. E.g. you may not want to take the
> whole hosting machine down because something bad happened in one container.
> OTOH you might what that to happen if that particular container is
> responsible for maintaining security - but then IMHO it is not a clever
> setup to constrain memory of the security sensitive application.
>
> So my stance on this is that we should define easy to understand semantics
> and let the admins deal with it. IMO that semantics should follow how we
> currently behave on system-wide OOM - i.e., simply trigger OOM killer when
> cgroup is going over limits.
>
> If we wanted to create safer API for fanotify from this standpoint, we
> could allow new type of fanotify groups where queue would be limited in
> length but tasks adding events to queue would get throttled as the queue
> fills up. Something like dirty page cache throttling. But I'd go for this
> complexity only once we have clear indications from practice that current
> scheme is not enough.
>

What you are saying makes sense for a good design, but IIUC, what follows
is that you think we should change behavior and start accounting event
allocation to listener without any opt-in from admin?

That could make existing systems break and become vulnerable to killing
the AV daemon by unlimited queue overflow attack.
My claim was that we cannot change behavior to charge a misconfigured
permission event fanotify listener with allocations, because of that risk.
Perhaps I did not read your response correctly to understand how you intend
we mitigate this potential breakage.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-01-24 11:12                         ` Amir Goldstein
  0 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-01-24 11:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Shakeel Butt, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML

On Wed, Jan 24, 2018 at 12:34 PM, Jan Kara <jack@suse.cz> wrote:
> On Mon 22-01-18 22:31:20, Amir Goldstein wrote:
>> On Fri, Jan 19, 2018 at 5:02 PM, Shakeel Butt <shakeelb@google.com> wrote:
>> > On Wed, Nov 15, 2017 at 1:31 AM, Jan Kara <jack@suse.cz> wrote:
>> >> On Wed 15-11-17 01:32:16, Yang Shi wrote:
>> >>> On 11/14/17 1:39 AM, Michal Hocko wrote:
>> >>> >On Tue 14-11-17 03:10:22, Yang Shi wrote:
>> >>> >>
>> >>> >>
>> >>> >>On 11/9/17 5:54 AM, Michal Hocko wrote:
>> >>> >>>[Sorry for the late reply]
>> >>> >>>
>> >>> >>>On Tue 31-10-17 11:12:38, Jan Kara wrote:
>> >>> >>>>On Tue 31-10-17 00:39:58, Yang Shi wrote:
>> >>> >>>[...]
>> >>> >>>>>I do agree it is not fair and not neat to account to producer rather than
>> >>> >>>>>misbehaving consumer, but current memcg design looks not support such use
>> >>> >>>>>case. And, the other question is do we know who is the listener if it
>> >>> >>>>>doesn't read the events?
>> >>> >>>>
>> >>> >>>>So you never know who will read from the notification file descriptor but
>> >>> >>>>you can simply account that to the process that created the notification
>> >>> >>>>group and that is IMO the right process to account to.
>> >>> >>>
>> >>> >>>Yes, if the creator is de-facto owner which defines the lifetime of
>> >>> >>>those objects then this should be a target of the charge.
>> >>> >>>
>> >>> >>>>I agree that current SLAB memcg accounting does not allow to account to a
>> >>> >>>>different memcg than the one of the running process. However I *think* it
>> >>> >>>>should be possible to add such interface. Michal?
>> >>> >>>
>> >>> >>>We do have memcg_kmem_charge_memcg but that would require some plumbing
>> >>> >>>to hook it into the specific allocation path. I suspect it uses kmalloc,
>> >>> >>>right?
>> >>> >>
>> >>> >>Yes.
>> >>> >>
>> >>> >>I took a look at the implementation and the callsites of
>> >>> >>memcg_kmem_charge_memcg(). It looks it is called by:
>> >>> >>
>> >>> >>* charge kmem to memcg, but it is charged to the allocator's memcg
>> >>> >>* allocate new slab page, charge to memcg_params.memcg
>> >>> >>
>> >>> >>I think this is the plumbing you mentioned, right?
>> >>> >
>> >>> >Maybe I have misunderstood, but you are using slab allocator. So you
>> >>> >would need to force it to use a different charging context than current.
>> >>>
>> >>> Yes.
>> >>>
>> >>> >I haven't checked deeply but this doesn't look trivial to me.
>> >>>
>> >>> I agree. This is also what I explained to Jan and Amir in earlier
>> >>> discussion.
>> >>
>> >> And I also agree. But the fact that it is not trivial does not mean that it
>> >> should not be done...
>> >>
>> >
>> > I am currently working on directed or remote memcg charging for a
>> > different usecase and I think that would be helpful here as well.
>> >
>> > I have two questions though:
>> >
>> > 1) Is fsnotify_group the right structure to hold the reference to
>> > target mem_cgroup for charging?
>>
>> I think it is. The process who set up the group and determined the unlimited
>> events queue size and did not consume the events from the queue in a timely
>> manner is the process to blame for the OOM situation.
>
> Agreed here.
>
>> > 2) Remote charging can trigger an OOM in the target memcg. In this
>> > usecase, I think, there should be security concerns if the events
>> > producer can trigger OOM in the memcg of the monitor. We can either
>> > change these allocations to use __GFP_NORETRY or some new gfp flag to
>> > not trigger oom-killer. So, is this valid concern or am I
>> > over-thinking?
>> >
>>
>> I think that is a very valid concern, but not sure about the solution.
>> For an inotify listener and fanotify listener of class FAN_CLASS_NOTIF
>> (group->priority == 0), I think it makes sense to let oom-killer kill
>> the listener which is not keeping up with consuming events.
>
> Agreed.
>
>> For fanotify listener of class FAN_CLASS_{,PRE_}CONTENT
>> (group->priority > 0) allowing an adversary to trigger oom-killer
>> and kill the listener could bypass permission event checks.
>>
>> So we could use different allocation flags for permission events
>> or for groups with high priority or for groups that have permission
>> events in their mask, so an adversary could not use non-permission
>> events to oom-kill a listener that is also monitoring permission events.
>>
>> Generally speaking, permission event monitors should not be
>> setting  FAN_UNLIMITED_QUEUE and should not be causing oom
>> (because permission events are not queued they are blocking), but
>> there is nothing preventing a process to setup a FAN_CLASS_CONTENT
>> group with FAN_UNLIMITED_QUEUE and also monitor non-permission
>> events.
>>
>> There is also nothing preventing a process from setting up one
>> FAN_CLASS_CONTENT listener for permission events and another
>> FAN_CLASS_NOTIF listener for non-permission event.
>> Maybe it is not wise, but we don't know that there are no such processes
>> out there.
>
> So IMHO there are different ways to setup memcgs and processes in them and
> you cannot just guess what desired outcome is. The facts are:
>
> 1) Process has setup queue with unlimited length.
> 2) Admin has put the process into memcg with limited memory.
> 3) The process cannot keep up with event producer.
>
> These three facts are creating conflicting demands and it depends on the
> eye of the beholder what is correct. E.g. you may not want to take the
> whole hosting machine down because something bad happened in one container.
> OTOH you might what that to happen if that particular container is
> responsible for maintaining security - but then IMHO it is not a clever
> setup to constrain memory of the security sensitive application.
>
> So my stance on this is that we should define easy to understand semantics
> and let the admins deal with it. IMO that semantics should follow how we
> currently behave on system-wide OOM - i.e., simply trigger OOM killer when
> cgroup is going over limits.
>
> If we wanted to create safer API for fanotify from this standpoint, we
> could allow new type of fanotify groups where queue would be limited in
> length but tasks adding events to queue would get throttled as the queue
> fills up. Something like dirty page cache throttling. But I'd go for this
> complexity only once we have clear indications from practice that current
> scheme is not enough.
>

What you are saying makes sense for a good design, but IIUC, what follows
is that you think we should change behavior and start accounting event
allocation to listener without any opt-in from admin?

That could make existing systems break and become vulnerable to killing
the AV daemon by unlimited queue overflow attack.
My claim was that we cannot change behavior to charge a misconfigured
permission event fanotify listener with allocations, because of that risk.
Perhaps I did not read your response correctly to understand how you intend
we mitigate this potential breakage.

Thanks,
Amir.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2018-01-24 11:12                         ` Amir Goldstein
@ 2018-01-25  1:08                           ` Shakeel Butt
  -1 siblings, 0 replies; 75+ messages in thread
From: Shakeel Butt @ 2018-01-25  1:08 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML

On Wed, Jan 24, 2018 at 3:12 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Wed, Jan 24, 2018 at 12:34 PM, Jan Kara <jack@suse.cz> wrote:
>> On Mon 22-01-18 22:31:20, Amir Goldstein wrote:
>>> On Fri, Jan 19, 2018 at 5:02 PM, Shakeel Butt <shakeelb@google.com> wrote:
>>> > On Wed, Nov 15, 2017 at 1:31 AM, Jan Kara <jack@suse.cz> wrote:
>>> >> On Wed 15-11-17 01:32:16, Yang Shi wrote:
>>> >>> On 11/14/17 1:39 AM, Michal Hocko wrote:
>>> >>> >On Tue 14-11-17 03:10:22, Yang Shi wrote:
>>> >>> >>
>>> >>> >>
>>> >>> >>On 11/9/17 5:54 AM, Michal Hocko wrote:
>>> >>> >>>[Sorry for the late reply]
>>> >>> >>>
>>> >>> >>>On Tue 31-10-17 11:12:38, Jan Kara wrote:
>>> >>> >>>>On Tue 31-10-17 00:39:58, Yang Shi wrote:
>>> >>> >>>[...]
>>> >>> >>>>>I do agree it is not fair and not neat to account to producer rather than
>>> >>> >>>>>misbehaving consumer, but current memcg design looks not support such use
>>> >>> >>>>>case. And, the other question is do we know who is the listener if it
>>> >>> >>>>>doesn't read the events?
>>> >>> >>>>
>>> >>> >>>>So you never know who will read from the notification file descriptor but
>>> >>> >>>>you can simply account that to the process that created the notification
>>> >>> >>>>group and that is IMO the right process to account to.
>>> >>> >>>
>>> >>> >>>Yes, if the creator is de-facto owner which defines the lifetime of
>>> >>> >>>those objects then this should be a target of the charge.
>>> >>> >>>
>>> >>> >>>>I agree that current SLAB memcg accounting does not allow to account to a
>>> >>> >>>>different memcg than the one of the running process. However I *think* it
>>> >>> >>>>should be possible to add such interface. Michal?
>>> >>> >>>
>>> >>> >>>We do have memcg_kmem_charge_memcg but that would require some plumbing
>>> >>> >>>to hook it into the specific allocation path. I suspect it uses kmalloc,
>>> >>> >>>right?
>>> >>> >>
>>> >>> >>Yes.
>>> >>> >>
>>> >>> >>I took a look at the implementation and the callsites of
>>> >>> >>memcg_kmem_charge_memcg(). It looks it is called by:
>>> >>> >>
>>> >>> >>* charge kmem to memcg, but it is charged to the allocator's memcg
>>> >>> >>* allocate new slab page, charge to memcg_params.memcg
>>> >>> >>
>>> >>> >>I think this is the plumbing you mentioned, right?
>>> >>> >
>>> >>> >Maybe I have misunderstood, but you are using slab allocator. So you
>>> >>> >would need to force it to use a different charging context than current.
>>> >>>
>>> >>> Yes.
>>> >>>
>>> >>> >I haven't checked deeply but this doesn't look trivial to me.
>>> >>>
>>> >>> I agree. This is also what I explained to Jan and Amir in earlier
>>> >>> discussion.
>>> >>
>>> >> And I also agree. But the fact that it is not trivial does not mean that it
>>> >> should not be done...
>>> >>
>>> >
>>> > I am currently working on directed or remote memcg charging for a
>>> > different usecase and I think that would be helpful here as well.
>>> >
>>> > I have two questions though:
>>> >
>>> > 1) Is fsnotify_group the right structure to hold the reference to
>>> > target mem_cgroup for charging?
>>>
>>> I think it is. The process who set up the group and determined the unlimited
>>> events queue size and did not consume the events from the queue in a timely
>>> manner is the process to blame for the OOM situation.
>>
>> Agreed here.

Please note that for fcntl(F_NOTIFY), a global group, dnotify_group,
is used. The allocations from dnotify_struct_cache &
dnotify_mark_cache happens in the fcntl(F_NOTIFY), so , I take that
the memcg of the current process should be charged.

>>
>>> > 2) Remote charging can trigger an OOM in the target memcg. In this
>>> > usecase, I think, there should be security concerns if the events
>>> > producer can trigger OOM in the memcg of the monitor. We can either
>>> > change these allocations to use __GFP_NORETRY or some new gfp flag to
>>> > not trigger oom-killer. So, is this valid concern or am I
>>> > over-thinking?

First, let me apologize, I think I might have led the discussion in
wrong direction by giving one wrong information. The current upstream
kernel, from the syscall context, does not invoke oom-killer when a
memcg hits its limit and fails to reclaim memory, instead ENOMEM is
returned. The memcg oom-killer is only invoked on page faults. However
in a separate effort I do plan to converge the behavior, long
discussion at <https://patchwork.kernel.org/patch/9988063/>.

>>> >
>>>
>>> I think that is a very valid concern, but not sure about the solution.
>>> For an inotify listener and fanotify listener of class FAN_CLASS_NOTIF
>>> (group->priority == 0), I think it makes sense to let oom-killer kill
>>> the listener which is not keeping up with consuming events.
>>
>> Agreed.
>>

Is ENOMEM as acceptable as oom-killer killing the listener for
FAN_CLASS_NOTIF? The current kernel will return ENOMEM but the future
may trigger the oom-killer.

>>> For fanotify listener of class FAN_CLASS_{,PRE_}CONTENT
>>> (group->priority > 0) allowing an adversary to trigger oom-killer
>>> and kill the listener could bypass permission event checks.
>>>
>>> So we could use different allocation flags for permission events
>>> or for groups with high priority or for groups that have permission
>>> events in their mask, so an adversary could not use non-permission
>>> events to oom-kill a listener that is also monitoring permission events.
>>>
>>> Generally speaking, permission event monitors should not be
>>> setting  FAN_UNLIMITED_QUEUE and should not be causing oom
>>> (because permission events are not queued they are blocking), but
>>> there is nothing preventing a process to setup a FAN_CLASS_CONTENT
>>> group with FAN_UNLIMITED_QUEUE and also monitor non-permission
>>> events.
>>>
>>> There is also nothing preventing a process from setting up one
>>> FAN_CLASS_CONTENT listener for permission events and another
>>> FAN_CLASS_NOTIF listener for non-permission event.
>>> Maybe it is not wise, but we don't know that there are no such processes
>>> out there.
>>
>> So IMHO there are different ways to setup memcgs and processes in them and
>> you cannot just guess what desired outcome is. The facts are:
>>
>> 1) Process has setup queue with unlimited length.
>> 2) Admin has put the process into memcg with limited memory.
>> 3) The process cannot keep up with event producer.
>>
>> These three facts are creating conflicting demands and it depends on the
>> eye of the beholder what is correct. E.g. you may not want to take the
>> whole hosting machine down because something bad happened in one container.
>> OTOH you might what that to happen if that particular container is
>> responsible for maintaining security - but then IMHO it is not a clever
>> setup to constrain memory of the security sensitive application.
>>
>> So my stance on this is that we should define easy to understand semantics
>> and let the admins deal with it. IMO that semantics should follow how we
>> currently behave on system-wide OOM - i.e., simply trigger OOM killer when
>> cgroup is going over limits.
>>
>> If we wanted to create safer API for fanotify from this standpoint, we
>> could allow new type of fanotify groups where queue would be limited in
>> length but tasks adding events to queue would get throttled as the queue
>> fills up. Something like dirty page cache throttling. But I'd go for this
>> complexity only once we have clear indications from practice that current
>> scheme is not enough.
>>
>
> What you are saying makes sense for a good design, but IIUC, what follows
> is that you think we should change behavior and start accounting event
> allocation to listener without any opt-in from admin?
>
> That could make existing systems break and become vulnerable to killing
> the AV daemon by unlimited queue overflow attack.
> My claim was that we cannot change behavior to charge a misconfigured
> permission event fanotify listener with allocations, because of that risk.
> Perhaps I did not read your response correctly to understand how you intend
> we mitigate this potential breakage.
>

I am new to fsnotify APIs, so please point out if I have missed or
misunderstood something. From what I understand after looking at the
code, the only allocations which are interesting are from
fanotify_perm_event_cachep and fanotify_event_cachep when
fanotify_alloc_event() is called from fanotify_handle_event(). For all
the other kmem caches, the allocations happen from syscalls done by
the listener and thus can directly be charged to the memcg of the
current and if charging is unsuccessful the ENOMEM will be returned by
the syscall to the listener (oom-kill in the future probable kernel
should be fine too).

For the interesting case, we can get the memcg from the fsnotify_group
object. Following scenarios can happen:

1. ENOMEM returned by fanotify_perm_event_cachep
2. ENOMEM returned by fanotify_event_cachep
3. (Maybe) oom-kill triggered by fanotify_perm_event_cachep (in the
future kernel)
4. (Maybe) oom-kill triggered by fanotify_event_cachep (in the future kernel)

Which of the above scenarios are acceptable?

Shakeel

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-01-25  1:08                           ` Shakeel Butt
  0 siblings, 0 replies; 75+ messages in thread
From: Shakeel Butt @ 2018-01-25  1:08 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML

On Wed, Jan 24, 2018 at 3:12 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Wed, Jan 24, 2018 at 12:34 PM, Jan Kara <jack@suse.cz> wrote:
>> On Mon 22-01-18 22:31:20, Amir Goldstein wrote:
>>> On Fri, Jan 19, 2018 at 5:02 PM, Shakeel Butt <shakeelb@google.com> wrote:
>>> > On Wed, Nov 15, 2017 at 1:31 AM, Jan Kara <jack@suse.cz> wrote:
>>> >> On Wed 15-11-17 01:32:16, Yang Shi wrote:
>>> >>> On 11/14/17 1:39 AM, Michal Hocko wrote:
>>> >>> >On Tue 14-11-17 03:10:22, Yang Shi wrote:
>>> >>> >>
>>> >>> >>
>>> >>> >>On 11/9/17 5:54 AM, Michal Hocko wrote:
>>> >>> >>>[Sorry for the late reply]
>>> >>> >>>
>>> >>> >>>On Tue 31-10-17 11:12:38, Jan Kara wrote:
>>> >>> >>>>On Tue 31-10-17 00:39:58, Yang Shi wrote:
>>> >>> >>>[...]
>>> >>> >>>>>I do agree it is not fair and not neat to account to producer rather than
>>> >>> >>>>>misbehaving consumer, but current memcg design looks not support such use
>>> >>> >>>>>case. And, the other question is do we know who is the listener if it
>>> >>> >>>>>doesn't read the events?
>>> >>> >>>>
>>> >>> >>>>So you never know who will read from the notification file descriptor but
>>> >>> >>>>you can simply account that to the process that created the notification
>>> >>> >>>>group and that is IMO the right process to account to.
>>> >>> >>>
>>> >>> >>>Yes, if the creator is de-facto owner which defines the lifetime of
>>> >>> >>>those objects then this should be a target of the charge.
>>> >>> >>>
>>> >>> >>>>I agree that current SLAB memcg accounting does not allow to account to a
>>> >>> >>>>different memcg than the one of the running process. However I *think* it
>>> >>> >>>>should be possible to add such interface. Michal?
>>> >>> >>>
>>> >>> >>>We do have memcg_kmem_charge_memcg but that would require some plumbing
>>> >>> >>>to hook it into the specific allocation path. I suspect it uses kmalloc,
>>> >>> >>>right?
>>> >>> >>
>>> >>> >>Yes.
>>> >>> >>
>>> >>> >>I took a look at the implementation and the callsites of
>>> >>> >>memcg_kmem_charge_memcg(). It looks it is called by:
>>> >>> >>
>>> >>> >>* charge kmem to memcg, but it is charged to the allocator's memcg
>>> >>> >>* allocate new slab page, charge to memcg_params.memcg
>>> >>> >>
>>> >>> >>I think this is the plumbing you mentioned, right?
>>> >>> >
>>> >>> >Maybe I have misunderstood, but you are using slab allocator. So you
>>> >>> >would need to force it to use a different charging context than current.
>>> >>>
>>> >>> Yes.
>>> >>>
>>> >>> >I haven't checked deeply but this doesn't look trivial to me.
>>> >>>
>>> >>> I agree. This is also what I explained to Jan and Amir in earlier
>>> >>> discussion.
>>> >>
>>> >> And I also agree. But the fact that it is not trivial does not mean that it
>>> >> should not be done...
>>> >>
>>> >
>>> > I am currently working on directed or remote memcg charging for a
>>> > different usecase and I think that would be helpful here as well.
>>> >
>>> > I have two questions though:
>>> >
>>> > 1) Is fsnotify_group the right structure to hold the reference to
>>> > target mem_cgroup for charging?
>>>
>>> I think it is. The process who set up the group and determined the unlimited
>>> events queue size and did not consume the events from the queue in a timely
>>> manner is the process to blame for the OOM situation.
>>
>> Agreed here.

Please note that for fcntl(F_NOTIFY), a global group, dnotify_group,
is used. The allocations from dnotify_struct_cache &
dnotify_mark_cache happens in the fcntl(F_NOTIFY), so , I take that
the memcg of the current process should be charged.

>>
>>> > 2) Remote charging can trigger an OOM in the target memcg. In this
>>> > usecase, I think, there should be security concerns if the events
>>> > producer can trigger OOM in the memcg of the monitor. We can either
>>> > change these allocations to use __GFP_NORETRY or some new gfp flag to
>>> > not trigger oom-killer. So, is this valid concern or am I
>>> > over-thinking?

First, let me apologize, I think I might have led the discussion in
wrong direction by giving one wrong information. The current upstream
kernel, from the syscall context, does not invoke oom-killer when a
memcg hits its limit and fails to reclaim memory, instead ENOMEM is
returned. The memcg oom-killer is only invoked on page faults. However
in a separate effort I do plan to converge the behavior, long
discussion at <https://patchwork.kernel.org/patch/9988063/>.

>>> >
>>>
>>> I think that is a very valid concern, but not sure about the solution.
>>> For an inotify listener and fanotify listener of class FAN_CLASS_NOTIF
>>> (group->priority == 0), I think it makes sense to let oom-killer kill
>>> the listener which is not keeping up with consuming events.
>>
>> Agreed.
>>

Is ENOMEM as acceptable as oom-killer killing the listener for
FAN_CLASS_NOTIF? The current kernel will return ENOMEM but the future
may trigger the oom-killer.

>>> For fanotify listener of class FAN_CLASS_{,PRE_}CONTENT
>>> (group->priority > 0) allowing an adversary to trigger oom-killer
>>> and kill the listener could bypass permission event checks.
>>>
>>> So we could use different allocation flags for permission events
>>> or for groups with high priority or for groups that have permission
>>> events in their mask, so an adversary could not use non-permission
>>> events to oom-kill a listener that is also monitoring permission events.
>>>
>>> Generally speaking, permission event monitors should not be
>>> setting  FAN_UNLIMITED_QUEUE and should not be causing oom
>>> (because permission events are not queued they are blocking), but
>>> there is nothing preventing a process to setup a FAN_CLASS_CONTENT
>>> group with FAN_UNLIMITED_QUEUE and also monitor non-permission
>>> events.
>>>
>>> There is also nothing preventing a process from setting up one
>>> FAN_CLASS_CONTENT listener for permission events and another
>>> FAN_CLASS_NOTIF listener for non-permission event.
>>> Maybe it is not wise, but we don't know that there are no such processes
>>> out there.
>>
>> So IMHO there are different ways to setup memcgs and processes in them and
>> you cannot just guess what desired outcome is. The facts are:
>>
>> 1) Process has setup queue with unlimited length.
>> 2) Admin has put the process into memcg with limited memory.
>> 3) The process cannot keep up with event producer.
>>
>> These three facts are creating conflicting demands and it depends on the
>> eye of the beholder what is correct. E.g. you may not want to take the
>> whole hosting machine down because something bad happened in one container.
>> OTOH you might what that to happen if that particular container is
>> responsible for maintaining security - but then IMHO it is not a clever
>> setup to constrain memory of the security sensitive application.
>>
>> So my stance on this is that we should define easy to understand semantics
>> and let the admins deal with it. IMO that semantics should follow how we
>> currently behave on system-wide OOM - i.e., simply trigger OOM killer when
>> cgroup is going over limits.
>>
>> If we wanted to create safer API for fanotify from this standpoint, we
>> could allow new type of fanotify groups where queue would be limited in
>> length but tasks adding events to queue would get throttled as the queue
>> fills up. Something like dirty page cache throttling. But I'd go for this
>> complexity only once we have clear indications from practice that current
>> scheme is not enough.
>>
>
> What you are saying makes sense for a good design, but IIUC, what follows
> is that you think we should change behavior and start accounting event
> allocation to listener without any opt-in from admin?
>
> That could make existing systems break and become vulnerable to killing
> the AV daemon by unlimited queue overflow attack.
> My claim was that we cannot change behavior to charge a misconfigured
> permission event fanotify listener with allocations, because of that risk.
> Perhaps I did not read your response correctly to understand how you intend
> we mitigate this potential breakage.
>

I am new to fsnotify APIs, so please point out if I have missed or
misunderstood something. From what I understand after looking at the
code, the only allocations which are interesting are from
fanotify_perm_event_cachep and fanotify_event_cachep when
fanotify_alloc_event() is called from fanotify_handle_event(). For all
the other kmem caches, the allocations happen from syscalls done by
the listener and thus can directly be charged to the memcg of the
current and if charging is unsuccessful the ENOMEM will be returned by
the syscall to the listener (oom-kill in the future probable kernel
should be fine too).

For the interesting case, we can get the memcg from the fsnotify_group
object. Following scenarios can happen:

1. ENOMEM returned by fanotify_perm_event_cachep
2. ENOMEM returned by fanotify_event_cachep
3. (Maybe) oom-kill triggered by fanotify_perm_event_cachep (in the
future kernel)
4. (Maybe) oom-kill triggered by fanotify_event_cachep (in the future kernel)

Which of the above scenarios are acceptable?

Shakeel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2018-01-25  1:08                           ` Shakeel Butt
@ 2018-01-25  1:54                             ` Al Viro
  -1 siblings, 0 replies; 75+ messages in thread
From: Al Viro @ 2018-01-25  1:54 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Amir Goldstein, Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel,
	Linux MM, LKML

On Wed, Jan 24, 2018 at 05:08:27PM -0800, Shakeel Butt wrote:
> First, let me apologize, I think I might have led the discussion in
> wrong direction by giving one wrong information. The current upstream
> kernel, from the syscall context, does not invoke oom-killer when a
> memcg hits its limit and fails to reclaim memory, instead ENOMEM is
> returned. The memcg oom-killer is only invoked on page faults. However
> in a separate effort I do plan to converge the behavior, long
> discussion at <https://patchwork.kernel.org/patch/9988063/>.

Correct me if I'm misinterpreting you, but your rationale in there
appears to be along the lines of "userland applications might not
be ready to handle -ENOMEM gracefully, so let's hit them with
kill -9 instead - that will be handled properly, 'cuz M4G1C!!1!!!!"

I must admit that I like the general feel of that idea; may I suggest,
as a modest improvement, appending "/usr/local/bin/self-LART\n"
to the end of $HOME/.bashrc as well?  Killing luser's processes is
a nice start, but you have to allow for local policies...

-- 
WWSimonDo?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-01-25  1:54                             ` Al Viro
  0 siblings, 0 replies; 75+ messages in thread
From: Al Viro @ 2018-01-25  1:54 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Amir Goldstein, Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel,
	Linux MM, LKML

On Wed, Jan 24, 2018 at 05:08:27PM -0800, Shakeel Butt wrote:
> First, let me apologize, I think I might have led the discussion in
> wrong direction by giving one wrong information. The current upstream
> kernel, from the syscall context, does not invoke oom-killer when a
> memcg hits its limit and fails to reclaim memory, instead ENOMEM is
> returned. The memcg oom-killer is only invoked on page faults. However
> in a separate effort I do plan to converge the behavior, long
> discussion at <https://patchwork.kernel.org/patch/9988063/>.

Correct me if I'm misinterpreting you, but your rationale in there
appears to be along the lines of "userland applications might not
be ready to handle -ENOMEM gracefully, so let's hit them with
kill -9 instead - that will be handled properly, 'cuz M4G1C!!1!!!!"

I must admit that I like the general feel of that idea; may I suggest,
as a modest improvement, appending "/usr/local/bin/self-LART\n"
to the end of $HOME/.bashrc as well?  Killing luser's processes is
a nice start, but you have to allow for local policies...

-- 
WWSimonDo?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2018-01-25  1:54                             ` Al Viro
@ 2018-01-25  2:15                               ` Shakeel Butt
  -1 siblings, 0 replies; 75+ messages in thread
From: Shakeel Butt @ 2018-01-25  2:15 UTC (permalink / raw)
  To: Al Viro
  Cc: Amir Goldstein, Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel,
	Linux MM, LKML

On Wed, Jan 24, 2018 at 5:54 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Jan 24, 2018 at 05:08:27PM -0800, Shakeel Butt wrote:
>> First, let me apologize, I think I might have led the discussion in
>> wrong direction by giving one wrong information. The current upstream
>> kernel, from the syscall context, does not invoke oom-killer when a
>> memcg hits its limit and fails to reclaim memory, instead ENOMEM is
>> returned. The memcg oom-killer is only invoked on page faults. However
>> in a separate effort I do plan to converge the behavior, long
>> discussion at <https://patchwork.kernel.org/patch/9988063/>.
>
> Correct me if I'm misinterpreting you, but your rationale in there
> appears to be along the lines of "userland applications might not
> be ready to handle -ENOMEM gracefully, so let's hit them with
> kill -9 instead - that will be handled properly, 'cuz M4G1C!!1!!!!"
>

Nah, the motivation is something like: In the memory overcommitted
system (or memcg) where jobs of different priorities are running, it
is preferable to kill a low priority job than to return an ENOMEM to
high priority job.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-01-25  2:15                               ` Shakeel Butt
  0 siblings, 0 replies; 75+ messages in thread
From: Shakeel Butt @ 2018-01-25  2:15 UTC (permalink / raw)
  To: Al Viro
  Cc: Amir Goldstein, Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel,
	Linux MM, LKML

On Wed, Jan 24, 2018 at 5:54 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Jan 24, 2018 at 05:08:27PM -0800, Shakeel Butt wrote:
>> First, let me apologize, I think I might have led the discussion in
>> wrong direction by giving one wrong information. The current upstream
>> kernel, from the syscall context, does not invoke oom-killer when a
>> memcg hits its limit and fails to reclaim memory, instead ENOMEM is
>> returned. The memcg oom-killer is only invoked on page faults. However
>> in a separate effort I do plan to converge the behavior, long
>> discussion at <https://patchwork.kernel.org/patch/9988063/>.
>
> Correct me if I'm misinterpreting you, but your rationale in there
> appears to be along the lines of "userland applications might not
> be ready to handle -ENOMEM gracefully, so let's hit them with
> kill -9 instead - that will be handled properly, 'cuz M4G1C!!1!!!!"
>

Nah, the motivation is something like: In the memory overcommitted
system (or memcg) where jobs of different priorities are running, it
is preferable to kill a low priority job than to return an ENOMEM to
high priority job.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2018-01-25  1:08                           ` Shakeel Butt
@ 2018-01-25  7:51                             ` Amir Goldstein
  -1 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-01-25  7:51 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML

On Thu, Jan 25, 2018 at 3:08 AM, Shakeel Butt <shakeelb@google.com> wrote:
> On Wed, Jan 24, 2018 at 3:12 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Wed, Jan 24, 2018 at 12:34 PM, Jan Kara <jack@suse.cz> wrote:
>>> On Mon 22-01-18 22:31:20, Amir Goldstein wrote:
>>>> On Fri, Jan 19, 2018 at 5:02 PM, Shakeel Butt <shakeelb@google.com> wrote:
>>>> > On Wed, Nov 15, 2017 at 1:31 AM, Jan Kara <jack@suse.cz> wrote:
>>>> >> On Wed 15-11-17 01:32:16, Yang Shi wrote:
>>>> >>> On 11/14/17 1:39 AM, Michal Hocko wrote:
>>>> >>> >On Tue 14-11-17 03:10:22, Yang Shi wrote:
>>>> >>> >>
>>>> >>> >>
>>>> >>> >>On 11/9/17 5:54 AM, Michal Hocko wrote:
>>>> >>> >>>[Sorry for the late reply]
>>>> >>> >>>
>>>> >>> >>>On Tue 31-10-17 11:12:38, Jan Kara wrote:
>>>> >>> >>>>On Tue 31-10-17 00:39:58, Yang Shi wrote:
>>>> >>> >>>[...]
>>>> >>> >>>>>I do agree it is not fair and not neat to account to producer rather than
>>>> >>> >>>>>misbehaving consumer, but current memcg design looks not support such use
>>>> >>> >>>>>case. And, the other question is do we know who is the listener if it
>>>> >>> >>>>>doesn't read the events?
>>>> >>> >>>>
>>>> >>> >>>>So you never know who will read from the notification file descriptor but
>>>> >>> >>>>you can simply account that to the process that created the notification
>>>> >>> >>>>group and that is IMO the right process to account to.
>>>> >>> >>>
>>>> >>> >>>Yes, if the creator is de-facto owner which defines the lifetime of
>>>> >>> >>>those objects then this should be a target of the charge.
>>>> >>> >>>
>>>> >>> >>>>I agree that current SLAB memcg accounting does not allow to account to a
>>>> >>> >>>>different memcg than the one of the running process. However I *think* it
>>>> >>> >>>>should be possible to add such interface. Michal?
>>>> >>> >>>
>>>> >>> >>>We do have memcg_kmem_charge_memcg but that would require some plumbing
>>>> >>> >>>to hook it into the specific allocation path. I suspect it uses kmalloc,
>>>> >>> >>>right?
>>>> >>> >>
>>>> >>> >>Yes.
>>>> >>> >>
>>>> >>> >>I took a look at the implementation and the callsites of
>>>> >>> >>memcg_kmem_charge_memcg(). It looks it is called by:
>>>> >>> >>
>>>> >>> >>* charge kmem to memcg, but it is charged to the allocator's memcg
>>>> >>> >>* allocate new slab page, charge to memcg_params.memcg
>>>> >>> >>
>>>> >>> >>I think this is the plumbing you mentioned, right?
>>>> >>> >
>>>> >>> >Maybe I have misunderstood, but you are using slab allocator. So you
>>>> >>> >would need to force it to use a different charging context than current.
>>>> >>>
>>>> >>> Yes.
>>>> >>>
>>>> >>> >I haven't checked deeply but this doesn't look trivial to me.
>>>> >>>
>>>> >>> I agree. This is also what I explained to Jan and Amir in earlier
>>>> >>> discussion.
>>>> >>
>>>> >> And I also agree. But the fact that it is not trivial does not mean that it
>>>> >> should not be done...
>>>> >>
>>>> >
>>>> > I am currently working on directed or remote memcg charging for a
>>>> > different usecase and I think that would be helpful here as well.
>>>> >
>>>> > I have two questions though:
>>>> >
>>>> > 1) Is fsnotify_group the right structure to hold the reference to
>>>> > target mem_cgroup for charging?
>>>>
>>>> I think it is. The process who set up the group and determined the unlimited
>>>> events queue size and did not consume the events from the queue in a timely
>>>> manner is the process to blame for the OOM situation.
>>>
>>> Agreed here.
>
> Please note that for fcntl(F_NOTIFY), a global group, dnotify_group,
> is used. The allocations from dnotify_struct_cache &
> dnotify_mark_cache happens in the fcntl(F_NOTIFY), so , I take that
> the memcg of the current process should be charged.

Correct. Note that dnotify_struct_cache is allocated when setting up the
watch. that is always done by the listener, which is the correct process to
charge. handling the event does not allocate an event struct, so there is
no issue here.

>
>>>
>>>> > 2) Remote charging can trigger an OOM in the target memcg. In this
>>>> > usecase, I think, there should be security concerns if the events
>>>> > producer can trigger OOM in the memcg of the monitor. We can either
>>>> > change these allocations to use __GFP_NORETRY or some new gfp flag to
>>>> > not trigger oom-killer. So, is this valid concern or am I
>>>> > over-thinking?
>
> First, let me apologize, I think I might have led the discussion in
> wrong direction by giving one wrong information. The current upstream
> kernel, from the syscall context, does not invoke oom-killer when a
> memcg hits its limit and fails to reclaim memory, instead ENOMEM is
> returned. The memcg oom-killer is only invoked on page faults. However
> in a separate effort I do plan to converge the behavior, long
> discussion at <https://patchwork.kernel.org/patch/9988063/>.
>

So I guess it would be better if we limit the discussion on this
thread to which memcg should be accounted for the purpose of
ENOMEM. oom-killer is a whole different flammable discussion.

>>>> >
>>>>
>>>> I think that is a very valid concern, but not sure about the solution.
>>>> For an inotify listener and fanotify listener of class FAN_CLASS_NOTIF
>>>> (group->priority == 0), I think it makes sense to let oom-killer kill
>>>> the listener which is not keeping up with consuming events.
>>>
>>> Agreed.
>>>
>
> Is ENOMEM as acceptable as oom-killer killing the listener for
> FAN_CLASS_NOTIF? The current kernel will return ENOMEM but the future
> may trigger the oom-killer.
>

I can't say that the behavior change would be welcome by all current
applications. But current behavior allows fanotify listener to deep into
the pools of other memcg, so I don't know what the right answer is here.

>>>> For fanotify listener of class FAN_CLASS_{,PRE_}CONTENT
>>>> (group->priority > 0) allowing an adversary to trigger oom-killer
>>>> and kill the listener could bypass permission event checks.
>>>>
>>>> So we could use different allocation flags for permission events
>>>> or for groups with high priority or for groups that have permission
>>>> events in their mask, so an adversary could not use non-permission
>>>> events to oom-kill a listener that is also monitoring permission events.
>>>>
>>>> Generally speaking, permission event monitors should not be
>>>> setting  FAN_UNLIMITED_QUEUE and should not be causing oom
>>>> (because permission events are not queued they are blocking), but
>>>> there is nothing preventing a process to setup a FAN_CLASS_CONTENT
>>>> group with FAN_UNLIMITED_QUEUE and also monitor non-permission
>>>> events.
>>>>
>>>> There is also nothing preventing a process from setting up one
>>>> FAN_CLASS_CONTENT listener for permission events and another
>>>> FAN_CLASS_NOTIF listener for non-permission event.
>>>> Maybe it is not wise, but we don't know that there are no such processes
>>>> out there.
>>>
>>> So IMHO there are different ways to setup memcgs and processes in them and
>>> you cannot just guess what desired outcome is. The facts are:
>>>
>>> 1) Process has setup queue with unlimited length.
>>> 2) Admin has put the process into memcg with limited memory.
>>> 3) The process cannot keep up with event producer.
>>>
>>> These three facts are creating conflicting demands and it depends on the
>>> eye of the beholder what is correct. E.g. you may not want to take the
>>> whole hosting machine down because something bad happened in one container.
>>> OTOH you might what that to happen if that particular container is
>>> responsible for maintaining security - but then IMHO it is not a clever
>>> setup to constrain memory of the security sensitive application.
>>>
>>> So my stance on this is that we should define easy to understand semantics
>>> and let the admins deal with it. IMO that semantics should follow how we
>>> currently behave on system-wide OOM - i.e., simply trigger OOM killer when
>>> cgroup is going over limits.
>>>
>>> If we wanted to create safer API for fanotify from this standpoint, we
>>> could allow new type of fanotify groups where queue would be limited in
>>> length but tasks adding events to queue would get throttled as the queue
>>> fills up. Something like dirty page cache throttling. But I'd go for this
>>> complexity only once we have clear indications from practice that current
>>> scheme is not enough.
>>>
>>
>> What you are saying makes sense for a good design, but IIUC, what follows
>> is that you think we should change behavior and start accounting event
>> allocation to listener without any opt-in from admin?
>>
>> That could make existing systems break and become vulnerable to killing
>> the AV daemon by unlimited queue overflow attack.
>> My claim was that we cannot change behavior to charge a misconfigured
>> permission event fanotify listener with allocations, because of that risk.
>> Perhaps I did not read your response correctly to understand how you intend
>> we mitigate this potential breakage.
>>
>
> I am new to fsnotify APIs, so please point out if I have missed or
> misunderstood something. From what I understand after looking at the
> code, the only allocations which are interesting are from
> fanotify_perm_event_cachep and fanotify_event_cachep when
> fanotify_alloc_event() is called from fanotify_handle_event(). For all

There is also the variable length event kmalloc allocation in
inotify_handle_event(). Not sure what we can do about that??

> the other kmem caches, the allocations happen from syscalls done by
> the listener and thus can directly be charged to the memcg of the
> current and if charging is unsuccessful the ENOMEM will be returned by
> the syscall to the listener (oom-kill in the future probable kernel
> should be fine too).

ENOMEM would be acceptable in that case (failure to setup or modify watch).

>
> For the interesting case, we can get the memcg from the fsnotify_group
> object. Following scenarios can happen:
>
> 1. ENOMEM returned by fanotify_perm_event_cachep

That is almost certainly correct.
Only admin can set a watch for permission events.
If admin sets a listener for permission events inside a constrained memcg
any user in any memcg could get ENOMEM when accessing a file, because
that is what the admin cooked.

> 2. ENOMEM returned by fanotify_event_cachep

There is a nicer alternative, instead of failing the file access,
an overflow event can be queued. I sent a patch for that and Jan
agreed to the concept, but thought we should let user opt-in for this
change:
https://marc.info/?l=linux-fsdevel&m=150944704716447&w=2

So IMO, if user opts-in for OVERFLOW instead of ENOMEM,
charging the listener memcg would be non controversial.
Otherwise, I cannot say that starting to charge the listener memgc
for events won't break any application.

> 3. (Maybe) oom-kill triggered by fanotify_perm_event_cachep (in the
> future kernel)

Certainly not acceptable.

> 4. (Maybe) oom-kill triggered by fanotify_event_cachep (in the future kernel)
>

Too hard to answer.
In favor of oom-kill -  when a listener cannot keep up with reading events,
it is bound to get OVERFLOW at some point, which for many change tracking
applications has the same consequences as restarting the app.
Against oom-kill -  there is a much more subtle way to handle a full
events queue -
empty the queue and queue an overflow event, but that will require
fanotify/inotofy
to register a shrinker. Not sure how easy that solution is.


Thanks,
Amir.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-01-25  7:51                             ` Amir Goldstein
  0 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-01-25  7:51 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML

On Thu, Jan 25, 2018 at 3:08 AM, Shakeel Butt <shakeelb@google.com> wrote:
> On Wed, Jan 24, 2018 at 3:12 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Wed, Jan 24, 2018 at 12:34 PM, Jan Kara <jack@suse.cz> wrote:
>>> On Mon 22-01-18 22:31:20, Amir Goldstein wrote:
>>>> On Fri, Jan 19, 2018 at 5:02 PM, Shakeel Butt <shakeelb@google.com> wrote:
>>>> > On Wed, Nov 15, 2017 at 1:31 AM, Jan Kara <jack@suse.cz> wrote:
>>>> >> On Wed 15-11-17 01:32:16, Yang Shi wrote:
>>>> >>> On 11/14/17 1:39 AM, Michal Hocko wrote:
>>>> >>> >On Tue 14-11-17 03:10:22, Yang Shi wrote:
>>>> >>> >>
>>>> >>> >>
>>>> >>> >>On 11/9/17 5:54 AM, Michal Hocko wrote:
>>>> >>> >>>[Sorry for the late reply]
>>>> >>> >>>
>>>> >>> >>>On Tue 31-10-17 11:12:38, Jan Kara wrote:
>>>> >>> >>>>On Tue 31-10-17 00:39:58, Yang Shi wrote:
>>>> >>> >>>[...]
>>>> >>> >>>>>I do agree it is not fair and not neat to account to producer rather than
>>>> >>> >>>>>misbehaving consumer, but current memcg design looks not support such use
>>>> >>> >>>>>case. And, the other question is do we know who is the listener if it
>>>> >>> >>>>>doesn't read the events?
>>>> >>> >>>>
>>>> >>> >>>>So you never know who will read from the notification file descriptor but
>>>> >>> >>>>you can simply account that to the process that created the notification
>>>> >>> >>>>group and that is IMO the right process to account to.
>>>> >>> >>>
>>>> >>> >>>Yes, if the creator is de-facto owner which defines the lifetime of
>>>> >>> >>>those objects then this should be a target of the charge.
>>>> >>> >>>
>>>> >>> >>>>I agree that current SLAB memcg accounting does not allow to account to a
>>>> >>> >>>>different memcg than the one of the running process. However I *think* it
>>>> >>> >>>>should be possible to add such interface. Michal?
>>>> >>> >>>
>>>> >>> >>>We do have memcg_kmem_charge_memcg but that would require some plumbing
>>>> >>> >>>to hook it into the specific allocation path. I suspect it uses kmalloc,
>>>> >>> >>>right?
>>>> >>> >>
>>>> >>> >>Yes.
>>>> >>> >>
>>>> >>> >>I took a look at the implementation and the callsites of
>>>> >>> >>memcg_kmem_charge_memcg(). It looks it is called by:
>>>> >>> >>
>>>> >>> >>* charge kmem to memcg, but it is charged to the allocator's memcg
>>>> >>> >>* allocate new slab page, charge to memcg_params.memcg
>>>> >>> >>
>>>> >>> >>I think this is the plumbing you mentioned, right?
>>>> >>> >
>>>> >>> >Maybe I have misunderstood, but you are using slab allocator. So you
>>>> >>> >would need to force it to use a different charging context than current.
>>>> >>>
>>>> >>> Yes.
>>>> >>>
>>>> >>> >I haven't checked deeply but this doesn't look trivial to me.
>>>> >>>
>>>> >>> I agree. This is also what I explained to Jan and Amir in earlier
>>>> >>> discussion.
>>>> >>
>>>> >> And I also agree. But the fact that it is not trivial does not mean that it
>>>> >> should not be done...
>>>> >>
>>>> >
>>>> > I am currently working on directed or remote memcg charging for a
>>>> > different usecase and I think that would be helpful here as well.
>>>> >
>>>> > I have two questions though:
>>>> >
>>>> > 1) Is fsnotify_group the right structure to hold the reference to
>>>> > target mem_cgroup for charging?
>>>>
>>>> I think it is. The process who set up the group and determined the unlimited
>>>> events queue size and did not consume the events from the queue in a timely
>>>> manner is the process to blame for the OOM situation.
>>>
>>> Agreed here.
>
> Please note that for fcntl(F_NOTIFY), a global group, dnotify_group,
> is used. The allocations from dnotify_struct_cache &
> dnotify_mark_cache happens in the fcntl(F_NOTIFY), so , I take that
> the memcg of the current process should be charged.

Correct. Note that dnotify_struct_cache is allocated when setting up the
watch. that is always done by the listener, which is the correct process to
charge. handling the event does not allocate an event struct, so there is
no issue here.

>
>>>
>>>> > 2) Remote charging can trigger an OOM in the target memcg. In this
>>>> > usecase, I think, there should be security concerns if the events
>>>> > producer can trigger OOM in the memcg of the monitor. We can either
>>>> > change these allocations to use __GFP_NORETRY or some new gfp flag to
>>>> > not trigger oom-killer. So, is this valid concern or am I
>>>> > over-thinking?
>
> First, let me apologize, I think I might have led the discussion in
> wrong direction by giving one wrong information. The current upstream
> kernel, from the syscall context, does not invoke oom-killer when a
> memcg hits its limit and fails to reclaim memory, instead ENOMEM is
> returned. The memcg oom-killer is only invoked on page faults. However
> in a separate effort I do plan to converge the behavior, long
> discussion at <https://patchwork.kernel.org/patch/9988063/>.
>

So I guess it would be better if we limit the discussion on this
thread to which memcg should be accounted for the purpose of
ENOMEM. oom-killer is a whole different flammable discussion.

>>>> >
>>>>
>>>> I think that is a very valid concern, but not sure about the solution.
>>>> For an inotify listener and fanotify listener of class FAN_CLASS_NOTIF
>>>> (group->priority == 0), I think it makes sense to let oom-killer kill
>>>> the listener which is not keeping up with consuming events.
>>>
>>> Agreed.
>>>
>
> Is ENOMEM as acceptable as oom-killer killing the listener for
> FAN_CLASS_NOTIF? The current kernel will return ENOMEM but the future
> may trigger the oom-killer.
>

I can't say that the behavior change would be welcome by all current
applications. But current behavior allows fanotify listener to deep into
the pools of other memcg, so I don't know what the right answer is here.

>>>> For fanotify listener of class FAN_CLASS_{,PRE_}CONTENT
>>>> (group->priority > 0) allowing an adversary to trigger oom-killer
>>>> and kill the listener could bypass permission event checks.
>>>>
>>>> So we could use different allocation flags for permission events
>>>> or for groups with high priority or for groups that have permission
>>>> events in their mask, so an adversary could not use non-permission
>>>> events to oom-kill a listener that is also monitoring permission events.
>>>>
>>>> Generally speaking, permission event monitors should not be
>>>> setting  FAN_UNLIMITED_QUEUE and should not be causing oom
>>>> (because permission events are not queued they are blocking), but
>>>> there is nothing preventing a process to setup a FAN_CLASS_CONTENT
>>>> group with FAN_UNLIMITED_QUEUE and also monitor non-permission
>>>> events.
>>>>
>>>> There is also nothing preventing a process from setting up one
>>>> FAN_CLASS_CONTENT listener for permission events and another
>>>> FAN_CLASS_NOTIF listener for non-permission event.
>>>> Maybe it is not wise, but we don't know that there are no such processes
>>>> out there.
>>>
>>> So IMHO there are different ways to setup memcgs and processes in them and
>>> you cannot just guess what desired outcome is. The facts are:
>>>
>>> 1) Process has setup queue with unlimited length.
>>> 2) Admin has put the process into memcg with limited memory.
>>> 3) The process cannot keep up with event producer.
>>>
>>> These three facts are creating conflicting demands and it depends on the
>>> eye of the beholder what is correct. E.g. you may not want to take the
>>> whole hosting machine down because something bad happened in one container.
>>> OTOH you might what that to happen if that particular container is
>>> responsible for maintaining security - but then IMHO it is not a clever
>>> setup to constrain memory of the security sensitive application.
>>>
>>> So my stance on this is that we should define easy to understand semantics
>>> and let the admins deal with it. IMO that semantics should follow how we
>>> currently behave on system-wide OOM - i.e., simply trigger OOM killer when
>>> cgroup is going over limits.
>>>
>>> If we wanted to create safer API for fanotify from this standpoint, we
>>> could allow new type of fanotify groups where queue would be limited in
>>> length but tasks adding events to queue would get throttled as the queue
>>> fills up. Something like dirty page cache throttling. But I'd go for this
>>> complexity only once we have clear indications from practice that current
>>> scheme is not enough.
>>>
>>
>> What you are saying makes sense for a good design, but IIUC, what follows
>> is that you think we should change behavior and start accounting event
>> allocation to listener without any opt-in from admin?
>>
>> That could make existing systems break and become vulnerable to killing
>> the AV daemon by unlimited queue overflow attack.
>> My claim was that we cannot change behavior to charge a misconfigured
>> permission event fanotify listener with allocations, because of that risk.
>> Perhaps I did not read your response correctly to understand how you intend
>> we mitigate this potential breakage.
>>
>
> I am new to fsnotify APIs, so please point out if I have missed or
> misunderstood something. From what I understand after looking at the
> code, the only allocations which are interesting are from
> fanotify_perm_event_cachep and fanotify_event_cachep when
> fanotify_alloc_event() is called from fanotify_handle_event(). For all

There is also the variable length event kmalloc allocation in
inotify_handle_event(). Not sure what we can do about that??

> the other kmem caches, the allocations happen from syscalls done by
> the listener and thus can directly be charged to the memcg of the
> current and if charging is unsuccessful the ENOMEM will be returned by
> the syscall to the listener (oom-kill in the future probable kernel
> should be fine too).

ENOMEM would be acceptable in that case (failure to setup or modify watch).

>
> For the interesting case, we can get the memcg from the fsnotify_group
> object. Following scenarios can happen:
>
> 1. ENOMEM returned by fanotify_perm_event_cachep

That is almost certainly correct.
Only admin can set a watch for permission events.
If admin sets a listener for permission events inside a constrained memcg
any user in any memcg could get ENOMEM when accessing a file, because
that is what the admin cooked.

> 2. ENOMEM returned by fanotify_event_cachep

There is a nicer alternative, instead of failing the file access,
an overflow event can be queued. I sent a patch for that and Jan
agreed to the concept, but thought we should let user opt-in for this
change:
https://marc.info/?l=linux-fsdevel&m=150944704716447&w=2

So IMO, if user opts-in for OVERFLOW instead of ENOMEM,
charging the listener memcg would be non controversial.
Otherwise, I cannot say that starting to charge the listener memgc
for events won't break any application.

> 3. (Maybe) oom-kill triggered by fanotify_perm_event_cachep (in the
> future kernel)

Certainly not acceptable.

> 4. (Maybe) oom-kill triggered by fanotify_event_cachep (in the future kernel)
>

Too hard to answer.
In favor of oom-kill -  when a listener cannot keep up with reading events,
it is bound to get OVERFLOW at some point, which for many change tracking
applications has the same consequences as restarting the app.
Against oom-kill -  there is a much more subtle way to handle a full
events queue -
empty the queue and queue an overflow event, but that will require
fanotify/inotofy
to register a shrinker. Not sure how easy that solution is.


Thanks,
Amir.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2018-01-25  7:51                             ` Amir Goldstein
@ 2018-01-25 20:20                               ` Shakeel Butt
  -1 siblings, 0 replies; 75+ messages in thread
From: Shakeel Butt @ 2018-01-25 20:20 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML

On Wed, Jan 24, 2018 at 11:51 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>
> There is a nicer alternative, instead of failing the file access,
> an overflow event can be queued. I sent a patch for that and Jan
> agreed to the concept, but thought we should let user opt-in for this
> change:
> https://marc.info/?l=linux-fsdevel&m=150944704716447&w=2
>
> So IMO, if user opts-in for OVERFLOW instead of ENOMEM,
> charging the listener memcg would be non controversial.
> Otherwise, I cannot say that starting to charge the listener memgc
> for events won't break any application.
>

Thanks Amir, I will send out patches soon for directed charging for
fsnotify. Also are you planning to work on the opt-in overflow for the
above case? Should I wait for your patch?

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-01-25 20:20                               ` Shakeel Butt
  0 siblings, 0 replies; 75+ messages in thread
From: Shakeel Butt @ 2018-01-25 20:20 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML

On Wed, Jan 24, 2018 at 11:51 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>
> There is a nicer alternative, instead of failing the file access,
> an overflow event can be queued. I sent a patch for that and Jan
> agreed to the concept, but thought we should let user opt-in for this
> change:
> https://marc.info/?l=linux-fsdevel&m=150944704716447&w=2
>
> So IMO, if user opts-in for OVERFLOW instead of ENOMEM,
> charging the listener memcg would be non controversial.
> Otherwise, I cannot say that starting to charge the listener memgc
> for events won't break any application.
>

Thanks Amir, I will send out patches soon for directed charging for
fsnotify. Also are you planning to work on the opt-in overflow for the
above case? Should I wait for your patch?

thanks,
Shakeel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2018-01-25 20:20                               ` Shakeel Butt
@ 2018-01-25 20:36                                 ` Amir Goldstein
  -1 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-01-25 20:36 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML

On Thu, Jan 25, 2018 at 10:20 PM, Shakeel Butt <shakeelb@google.com> wrote:
> On Wed, Jan 24, 2018 at 11:51 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>
>> There is a nicer alternative, instead of failing the file access,
>> an overflow event can be queued. I sent a patch for that and Jan
>> agreed to the concept, but thought we should let user opt-in for this
>> change:
>> https://marc.info/?l=linux-fsdevel&m=150944704716447&w=2
>>
>> So IMO, if user opts-in for OVERFLOW instead of ENOMEM,
>> charging the listener memcg would be non controversial.
>> Otherwise, I cannot say that starting to charge the listener memgc
>> for events won't break any application.
>>
>
> Thanks Amir, I will send out patches soon for directed charging for
> fsnotify. Also are you planning to work on the opt-in overflow for the
> above case? Should I wait for your patch?
>

Don't wait for me. You can pick up my simple patch if you like
to implement "opt-in for charging listener memcg" it would
make sense with that change.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-01-25 20:36                                 ` Amir Goldstein
  0 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-01-25 20:36 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML

On Thu, Jan 25, 2018 at 10:20 PM, Shakeel Butt <shakeelb@google.com> wrote:
> On Wed, Jan 24, 2018 at 11:51 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>
>> There is a nicer alternative, instead of failing the file access,
>> an overflow event can be queued. I sent a patch for that and Jan
>> agreed to the concept, but thought we should let user opt-in for this
>> change:
>> https://marc.info/?l=linux-fsdevel&m=150944704716447&w=2
>>
>> So IMO, if user opts-in for OVERFLOW instead of ENOMEM,
>> charging the listener memcg would be non controversial.
>> Otherwise, I cannot say that starting to charge the listener memgc
>> for events won't break any application.
>>
>
> Thanks Amir, I will send out patches soon for directed charging for
> fsnotify. Also are you planning to work on the opt-in overflow for the
> above case? Should I wait for your patch?
>

Don't wait for me. You can pick up my simple patch if you like
to implement "opt-in for charging listener memcg" it would
make sense with that change.

Thanks,
Amir.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2018-01-25 20:36                                 ` Amir Goldstein
  (?)
@ 2018-02-13  6:30                                   ` Amir Goldstein
  -1 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-02-13  6:30 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML,
	linux-api

On Thu, Jan 25, 2018 at 10:36 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Thu, Jan 25, 2018 at 10:20 PM, Shakeel Butt <shakeelb@google.com> wrote:
>> On Wed, Jan 24, 2018 at 11:51 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>
>>> There is a nicer alternative, instead of failing the file access,
>>> an overflow event can be queued. I sent a patch for that and Jan
>>> agreed to the concept, but thought we should let user opt-in for this
>>> change:
>>> https://marc.info/?l=linux-fsdevel&m=150944704716447&w=2
>>>
>>> So IMO, if user opts-in for OVERFLOW instead of ENOMEM,
>>> charging the listener memcg would be non controversial.
>>> Otherwise, I cannot say that starting to charge the listener memgc
>>> for events won't break any application.
>>>
>>

Shakeel, Jan,

Reviving this thread and adding linux-api, because I think it is important to
agree on the API before patches.

The last message on the thread you referenced suggest an API change
for opting in for Q_OVERFLOW on ENOMEM:
https://marc.info/?l=linux-api&m=150946878623441&w=2

However, the suggested API change in in fanotify_mark() syscall and
this is not the time when fsnotify_group is initialized.
I believe for opting-in to accounting events for listener, you
will need to add an opt-in flag for the fanotify_init() syscall.

Something like FAN_GROUP_QUEUE  (better name is welcome)
which is mutually exclusive (?) with FAN_UNLIMITED_QUEUE.

The question is, do we need the user to also explicitly opt-in for
Q_OVERFLOW on ENOMEM with FAN_Q_ERR mark mask?
Should these 2 new APIs be coupled or independent?

Another question is whether FAN_GROUP_QUEUE may require
less than CAP_SYS_ADMIN? Of course for now, this is only a
semantic change, because fanotify_init() requires CAP_SYS_ADMIN
but as the documentation suggests, this may be relaxed in the future.

Thought?

Amir.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-02-13  6:30                                   ` Amir Goldstein
  0 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-02-13  6:30 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML,
	linux-api

On Thu, Jan 25, 2018 at 10:36 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Thu, Jan 25, 2018 at 10:20 PM, Shakeel Butt <shakeelb@google.com> wrote:
>> On Wed, Jan 24, 2018 at 11:51 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>
>>> There is a nicer alternative, instead of failing the file access,
>>> an overflow event can be queued. I sent a patch for that and Jan
>>> agreed to the concept, but thought we should let user opt-in for this
>>> change:
>>> https://marc.info/?l=linux-fsdevel&m=150944704716447&w=2
>>>
>>> So IMO, if user opts-in for OVERFLOW instead of ENOMEM,
>>> charging the listener memcg would be non controversial.
>>> Otherwise, I cannot say that starting to charge the listener memgc
>>> for events won't break any application.
>>>
>>

Shakeel, Jan,

Reviving this thread and adding linux-api, because I think it is important to
agree on the API before patches.

The last message on the thread you referenced suggest an API change
for opting in for Q_OVERFLOW on ENOMEM:
https://marc.info/?l=linux-api&m=150946878623441&w=2

However, the suggested API change in in fanotify_mark() syscall and
this is not the time when fsnotify_group is initialized.
I believe for opting-in to accounting events for listener, you
will need to add an opt-in flag for the fanotify_init() syscall.

Something like FAN_GROUP_QUEUE  (better name is welcome)
which is mutually exclusive (?) with FAN_UNLIMITED_QUEUE.

The question is, do we need the user to also explicitly opt-in for
Q_OVERFLOW on ENOMEM with FAN_Q_ERR mark mask?
Should these 2 new APIs be coupled or independent?

Another question is whether FAN_GROUP_QUEUE may require
less than CAP_SYS_ADMIN? Of course for now, this is only a
semantic change, because fanotify_init() requires CAP_SYS_ADMIN
but as the documentation suggests, this may be relaxed in the future.

Thought?

Amir.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-02-13  6:30                                   ` Amir Goldstein
  0 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-02-13  6:30 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Thu, Jan 25, 2018 at 10:36 PM, Amir Goldstein <amir73il-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Thu, Jan 25, 2018 at 10:20 PM, Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Wed, Jan 24, 2018 at 11:51 PM, Amir Goldstein <amir73il-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>
>>> There is a nicer alternative, instead of failing the file access,
>>> an overflow event can be queued. I sent a patch for that and Jan
>>> agreed to the concept, but thought we should let user opt-in for this
>>> change:
>>> https://marc.info/?l=linux-fsdevel&m=150944704716447&w=2
>>>
>>> So IMO, if user opts-in for OVERFLOW instead of ENOMEM,
>>> charging the listener memcg would be non controversial.
>>> Otherwise, I cannot say that starting to charge the listener memgc
>>> for events won't break any application.
>>>
>>

Shakeel, Jan,

Reviving this thread and adding linux-api, because I think it is important to
agree on the API before patches.

The last message on the thread you referenced suggest an API change
for opting in for Q_OVERFLOW on ENOMEM:
https://marc.info/?l=linux-api&m=150946878623441&w=2

However, the suggested API change in in fanotify_mark() syscall and
this is not the time when fsnotify_group is initialized.
I believe for opting-in to accounting events for listener, you
will need to add an opt-in flag for the fanotify_init() syscall.

Something like FAN_GROUP_QUEUE  (better name is welcome)
which is mutually exclusive (?) with FAN_UNLIMITED_QUEUE.

The question is, do we need the user to also explicitly opt-in for
Q_OVERFLOW on ENOMEM with FAN_Q_ERR mark mask?
Should these 2 new APIs be coupled or independent?

Another question is whether FAN_GROUP_QUEUE may require
less than CAP_SYS_ADMIN? Of course for now, this is only a
semantic change, because fanotify_init() requires CAP_SYS_ADMIN
but as the documentation suggests, this may be relaxed in the future.

Thought?

Amir.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2018-02-13  6:30                                   ` Amir Goldstein
  (?)
@ 2018-02-13 21:10                                     ` Shakeel Butt
  -1 siblings, 0 replies; 75+ messages in thread
From: Shakeel Butt @ 2018-02-13 21:10 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML,
	linux-api

On Mon, Feb 12, 2018 at 10:30 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Thu, Jan 25, 2018 at 10:36 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Thu, Jan 25, 2018 at 10:20 PM, Shakeel Butt <shakeelb@google.com> wrote:
>>> On Wed, Jan 24, 2018 at 11:51 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>>
>>>> There is a nicer alternative, instead of failing the file access,
>>>> an overflow event can be queued. I sent a patch for that and Jan
>>>> agreed to the concept, but thought we should let user opt-in for this
>>>> change:
>>>> https://marc.info/?l=linux-fsdevel&m=150944704716447&w=2
>>>>
>>>> So IMO, if user opts-in for OVERFLOW instead of ENOMEM,
>>>> charging the listener memcg would be non controversial.
>>>> Otherwise, I cannot say that starting to charge the listener memgc
>>>> for events won't break any application.
>>>>
>>>
>
> Shakeel, Jan,
>
> Reviving this thread and adding linux-api, because I think it is important to
> agree on the API before patches.
>
> The last message on the thread you referenced suggest an API change
> for opting in for Q_OVERFLOW on ENOMEM:
> https://marc.info/?l=linux-api&m=150946878623441&w=2
>
> However, the suggested API change in in fanotify_mark() syscall and
> this is not the time when fsnotify_group is initialized.
> I believe for opting-in to accounting events for listener, you
> will need to add an opt-in flag for the fanotify_init() syscall.
>

I thought the reason to opt-in "charge memory to listener" was the
risk of oom-killing the listener but it is now clear that there will
be no oom-kills on memcg hitting its limit (no oom-killing listener
risk). In my (not so strong) opinion we should only opt-in for
receiving the {FAN|IN}_Q_OVERFLOW event on ENOMEM but always charge
the memory for events to the listener's memcg if kmem accounting is
enabled.

> Something like FAN_GROUP_QUEUE  (better name is welcome)
> which is mutually exclusive (?) with FAN_UNLIMITED_QUEUE.
>

There is no need to make them mutually exclusive. One should be able
to request an unlimited queue limited by available memory on system
(with no kmem charging) or limited by limit of the listener's memcg
(with kmem charging).

> The question is, do we need the user to also explicitly opt-in for
> Q_OVERFLOW on ENOMEM with FAN_Q_ERR mark mask?
> Should these 2 new APIs be coupled or independent?
>

Are there any error which are not related to queue overflows? I see
the mention of ENODEV and EOVERFLOW in the discussion. If there are
such errors and might be interesting to the listener then we should
have 2 independent APIs.

> Another question is whether FAN_GROUP_QUEUE may require
> less than CAP_SYS_ADMIN? Of course for now, this is only a
> semantic change, because fanotify_init() requires CAP_SYS_ADMIN
> but as the documentation suggests, this may be relaxed in the future.
>

I think there is no need for imposing CAP_SYS_ADMIN for requesting to
charge self for the event memory.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-02-13 21:10                                     ` Shakeel Butt
  0 siblings, 0 replies; 75+ messages in thread
From: Shakeel Butt @ 2018-02-13 21:10 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML,
	linux-api

On Mon, Feb 12, 2018 at 10:30 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Thu, Jan 25, 2018 at 10:36 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Thu, Jan 25, 2018 at 10:20 PM, Shakeel Butt <shakeelb@google.com> wrote:
>>> On Wed, Jan 24, 2018 at 11:51 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>>
>>>> There is a nicer alternative, instead of failing the file access,
>>>> an overflow event can be queued. I sent a patch for that and Jan
>>>> agreed to the concept, but thought we should let user opt-in for this
>>>> change:
>>>> https://marc.info/?l=linux-fsdevel&m=150944704716447&w=2
>>>>
>>>> So IMO, if user opts-in for OVERFLOW instead of ENOMEM,
>>>> charging the listener memcg would be non controversial.
>>>> Otherwise, I cannot say that starting to charge the listener memgc
>>>> for events won't break any application.
>>>>
>>>
>
> Shakeel, Jan,
>
> Reviving this thread and adding linux-api, because I think it is important to
> agree on the API before patches.
>
> The last message on the thread you referenced suggest an API change
> for opting in for Q_OVERFLOW on ENOMEM:
> https://marc.info/?l=linux-api&m=150946878623441&w=2
>
> However, the suggested API change in in fanotify_mark() syscall and
> this is not the time when fsnotify_group is initialized.
> I believe for opting-in to accounting events for listener, you
> will need to add an opt-in flag for the fanotify_init() syscall.
>

I thought the reason to opt-in "charge memory to listener" was the
risk of oom-killing the listener but it is now clear that there will
be no oom-kills on memcg hitting its limit (no oom-killing listener
risk). In my (not so strong) opinion we should only opt-in for
receiving the {FAN|IN}_Q_OVERFLOW event on ENOMEM but always charge
the memory for events to the listener's memcg if kmem accounting is
enabled.

> Something like FAN_GROUP_QUEUE  (better name is welcome)
> which is mutually exclusive (?) with FAN_UNLIMITED_QUEUE.
>

There is no need to make them mutually exclusive. One should be able
to request an unlimited queue limited by available memory on system
(with no kmem charging) or limited by limit of the listener's memcg
(with kmem charging).

> The question is, do we need the user to also explicitly opt-in for
> Q_OVERFLOW on ENOMEM with FAN_Q_ERR mark mask?
> Should these 2 new APIs be coupled or independent?
>

Are there any error which are not related to queue overflows? I see
the mention of ENODEV and EOVERFLOW in the discussion. If there are
such errors and might be interesting to the listener then we should
have 2 independent APIs.

> Another question is whether FAN_GROUP_QUEUE may require
> less than CAP_SYS_ADMIN? Of course for now, this is only a
> semantic change, because fanotify_init() requires CAP_SYS_ADMIN
> but as the documentation suggests, this may be relaxed in the future.
>

I think there is no need for imposing CAP_SYS_ADMIN for requesting to
charge self for the event memory.

thanks,
Shakeel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-02-13 21:10                                     ` Shakeel Butt
  0 siblings, 0 replies; 75+ messages in thread
From: Shakeel Butt @ 2018-02-13 21:10 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Mon, Feb 12, 2018 at 10:30 PM, Amir Goldstein <amir73il-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Thu, Jan 25, 2018 at 10:36 PM, Amir Goldstein <amir73il-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On Thu, Jan 25, 2018 at 10:20 PM, Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Wed, Jan 24, 2018 at 11:51 PM, Amir Goldstein <amir73il-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>
>>>> There is a nicer alternative, instead of failing the file access,
>>>> an overflow event can be queued. I sent a patch for that and Jan
>>>> agreed to the concept, but thought we should let user opt-in for this
>>>> change:
>>>> https://marc.info/?l=linux-fsdevel&m=150944704716447&w=2
>>>>
>>>> So IMO, if user opts-in for OVERFLOW instead of ENOMEM,
>>>> charging the listener memcg would be non controversial.
>>>> Otherwise, I cannot say that starting to charge the listener memgc
>>>> for events won't break any application.
>>>>
>>>
>
> Shakeel, Jan,
>
> Reviving this thread and adding linux-api, because I think it is important to
> agree on the API before patches.
>
> The last message on the thread you referenced suggest an API change
> for opting in for Q_OVERFLOW on ENOMEM:
> https://marc.info/?l=linux-api&m=150946878623441&w=2
>
> However, the suggested API change in in fanotify_mark() syscall and
> this is not the time when fsnotify_group is initialized.
> I believe for opting-in to accounting events for listener, you
> will need to add an opt-in flag for the fanotify_init() syscall.
>

I thought the reason to opt-in "charge memory to listener" was the
risk of oom-killing the listener but it is now clear that there will
be no oom-kills on memcg hitting its limit (no oom-killing listener
risk). In my (not so strong) opinion we should only opt-in for
receiving the {FAN|IN}_Q_OVERFLOW event on ENOMEM but always charge
the memory for events to the listener's memcg if kmem accounting is
enabled.

> Something like FAN_GROUP_QUEUE  (better name is welcome)
> which is mutually exclusive (?) with FAN_UNLIMITED_QUEUE.
>

There is no need to make them mutually exclusive. One should be able
to request an unlimited queue limited by available memory on system
(with no kmem charging) or limited by limit of the listener's memcg
(with kmem charging).

> The question is, do we need the user to also explicitly opt-in for
> Q_OVERFLOW on ENOMEM with FAN_Q_ERR mark mask?
> Should these 2 new APIs be coupled or independent?
>

Are there any error which are not related to queue overflows? I see
the mention of ENODEV and EOVERFLOW in the discussion. If there are
such errors and might be interesting to the listener then we should
have 2 independent APIs.

> Another question is whether FAN_GROUP_QUEUE may require
> less than CAP_SYS_ADMIN? Of course for now, this is only a
> semantic change, because fanotify_init() requires CAP_SYS_ADMIN
> but as the documentation suggests, this may be relaxed in the future.
>

I think there is no need for imposing CAP_SYS_ADMIN for requesting to
charge self for the event memory.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2018-02-13 21:10                                     ` Shakeel Butt
@ 2018-02-13 21:54                                       ` Amir Goldstein
  -1 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-02-13 21:54 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML,
	linux-api

On Tue, Feb 13, 2018 at 11:10 PM, Shakeel Butt <shakeelb@google.com> wrote:
> On Mon, Feb 12, 2018 at 10:30 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Thu, Jan 25, 2018 at 10:36 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>> On Thu, Jan 25, 2018 at 10:20 PM, Shakeel Butt <shakeelb@google.com> wrote:
>>>> On Wed, Jan 24, 2018 at 11:51 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>>>
>>>>> There is a nicer alternative, instead of failing the file access,
>>>>> an overflow event can be queued. I sent a patch for that and Jan
>>>>> agreed to the concept, but thought we should let user opt-in for this
>>>>> change:
>>>>> https://marc.info/?l=linux-fsdevel&m=150944704716447&w=2
>>>>>
>>>>> So IMO, if user opts-in for OVERFLOW instead of ENOMEM,
>>>>> charging the listener memcg would be non controversial.
>>>>> Otherwise, I cannot say that starting to charge the listener memgc
>>>>> for events won't break any application.
>>>>>
>>>>
>>
>> Shakeel, Jan,
>>
>> Reviving this thread and adding linux-api, because I think it is important to
>> agree on the API before patches.
>>
>> The last message on the thread you referenced suggest an API change
>> for opting in for Q_OVERFLOW on ENOMEM:
>> https://marc.info/?l=linux-api&m=150946878623441&w=2
>>
>> However, the suggested API change in in fanotify_mark() syscall and
>> this is not the time when fsnotify_group is initialized.
>> I believe for opting-in to accounting events for listener, you
>> will need to add an opt-in flag for the fanotify_init() syscall.
>>
>
> I thought the reason to opt-in "charge memory to listener" was the
> risk of oom-killing the listener but it is now clear that there will
> be no oom-kills on memcg hitting its limit (no oom-killing listener
> risk). In my (not so strong) opinion we should only opt-in for
> receiving the {FAN|IN}_Q_OVERFLOW event on ENOMEM but always charge
> the memory for events to the listener's memcg if kmem accounting is
> enabled.
>

I agree that charging listener's memcg is preferred, but it is still a change
of behavior, because if attacker can allocate memory from listener's memcg,
then attacker can force overflow and hide the traces of its own filesystem
operations.

>> Something like FAN_GROUP_QUEUE  (better name is welcome)
>> which is mutually exclusive (?) with FAN_UNLIMITED_QUEUE.
>>
>
> There is no need to make them mutually exclusive. One should be able
> to request an unlimited queue limited by available memory on system
> (with no kmem charging) or limited by limit of the listener's memcg
> (with kmem charging).

OK.

>
>> The question is, do we need the user to also explicitly opt-in for
>> Q_OVERFLOW on ENOMEM with FAN_Q_ERR mark mask?
>> Should these 2 new APIs be coupled or independent?
>>
>
> Are there any error which are not related to queue overflows? I see
> the mention of ENODEV and EOVERFLOW in the discussion. If there are
> such errors and might be interesting to the listener then we should
> have 2 independent APIs.
>

These are indeed 2 different use cases.
A Q_OVERFLOW event is only expected one of ENOMEM or
EOVERFLOW in event->fd, but other events (like open of special device
file) can have ENODEV in event->fd.

But I am not convinced that those require 2 independent APIs.
Specifying FAN_Q_ERR means that the user expects to reads errors
from event->fd.

>> Another question is whether FAN_GROUP_QUEUE may require
>> less than CAP_SYS_ADMIN? Of course for now, this is only a
>> semantic change, because fanotify_init() requires CAP_SYS_ADMIN
>> but as the documentation suggests, this may be relaxed in the future.
>>
>
> I think there is no need for imposing CAP_SYS_ADMIN for requesting to
> charge self for the event memory.
>

Certainly. The question is whether the flag combination
FAN_GROUP_QUEUE|FAN_UNLIMITED_QUEUE could relax the
CAP_SYS_ADMIN requirement that is imposed by FAN_UNLIMITED_QUEUE
by itself.

Note that FAN_UNLIMITED_MARKS cannot relax CAP_SYS_ADMIN
even though marks are already accounted to listener memcg. This is because
most of the memory consumption in this case comes from marks pinning the
watched inodes to cache and not from the marks themselves.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-02-13 21:54                                       ` Amir Goldstein
  0 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-02-13 21:54 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML,
	linux-api

On Tue, Feb 13, 2018 at 11:10 PM, Shakeel Butt <shakeelb@google.com> wrote:
> On Mon, Feb 12, 2018 at 10:30 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Thu, Jan 25, 2018 at 10:36 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>> On Thu, Jan 25, 2018 at 10:20 PM, Shakeel Butt <shakeelb@google.com> wrote:
>>>> On Wed, Jan 24, 2018 at 11:51 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>>>
>>>>> There is a nicer alternative, instead of failing the file access,
>>>>> an overflow event can be queued. I sent a patch for that and Jan
>>>>> agreed to the concept, but thought we should let user opt-in for this
>>>>> change:
>>>>> https://marc.info/?l=linux-fsdevel&m=150944704716447&w=2
>>>>>
>>>>> So IMO, if user opts-in for OVERFLOW instead of ENOMEM,
>>>>> charging the listener memcg would be non controversial.
>>>>> Otherwise, I cannot say that starting to charge the listener memgc
>>>>> for events won't break any application.
>>>>>
>>>>
>>
>> Shakeel, Jan,
>>
>> Reviving this thread and adding linux-api, because I think it is important to
>> agree on the API before patches.
>>
>> The last message on the thread you referenced suggest an API change
>> for opting in for Q_OVERFLOW on ENOMEM:
>> https://marc.info/?l=linux-api&m=150946878623441&w=2
>>
>> However, the suggested API change in in fanotify_mark() syscall and
>> this is not the time when fsnotify_group is initialized.
>> I believe for opting-in to accounting events for listener, you
>> will need to add an opt-in flag for the fanotify_init() syscall.
>>
>
> I thought the reason to opt-in "charge memory to listener" was the
> risk of oom-killing the listener but it is now clear that there will
> be no oom-kills on memcg hitting its limit (no oom-killing listener
> risk). In my (not so strong) opinion we should only opt-in for
> receiving the {FAN|IN}_Q_OVERFLOW event on ENOMEM but always charge
> the memory for events to the listener's memcg if kmem accounting is
> enabled.
>

I agree that charging listener's memcg is preferred, but it is still a change
of behavior, because if attacker can allocate memory from listener's memcg,
then attacker can force overflow and hide the traces of its own filesystem
operations.

>> Something like FAN_GROUP_QUEUE  (better name is welcome)
>> which is mutually exclusive (?) with FAN_UNLIMITED_QUEUE.
>>
>
> There is no need to make them mutually exclusive. One should be able
> to request an unlimited queue limited by available memory on system
> (with no kmem charging) or limited by limit of the listener's memcg
> (with kmem charging).

OK.

>
>> The question is, do we need the user to also explicitly opt-in for
>> Q_OVERFLOW on ENOMEM with FAN_Q_ERR mark mask?
>> Should these 2 new APIs be coupled or independent?
>>
>
> Are there any error which are not related to queue overflows? I see
> the mention of ENODEV and EOVERFLOW in the discussion. If there are
> such errors and might be interesting to the listener then we should
> have 2 independent APIs.
>

These are indeed 2 different use cases.
A Q_OVERFLOW event is only expected one of ENOMEM or
EOVERFLOW in event->fd, but other events (like open of special device
file) can have ENODEV in event->fd.

But I am not convinced that those require 2 independent APIs.
Specifying FAN_Q_ERR means that the user expects to reads errors
from event->fd.

>> Another question is whether FAN_GROUP_QUEUE may require
>> less than CAP_SYS_ADMIN? Of course for now, this is only a
>> semantic change, because fanotify_init() requires CAP_SYS_ADMIN
>> but as the documentation suggests, this may be relaxed in the future.
>>
>
> I think there is no need for imposing CAP_SYS_ADMIN for requesting to
> charge self for the event memory.
>

Certainly. The question is whether the flag combination
FAN_GROUP_QUEUE|FAN_UNLIMITED_QUEUE could relax the
CAP_SYS_ADMIN requirement that is imposed by FAN_UNLIMITED_QUEUE
by itself.

Note that FAN_UNLIMITED_MARKS cannot relax CAP_SYS_ADMIN
even though marks are already accounted to listener memcg. This is because
most of the memory consumption in this case comes from marks pinning the
watched inodes to cache and not from the marks themselves.

Thanks,
Amir.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2018-02-13 21:54                                       ` Amir Goldstein
@ 2018-02-13 22:20                                         ` Shakeel Butt
  -1 siblings, 0 replies; 75+ messages in thread
From: Shakeel Butt @ 2018-02-13 22:20 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML,
	linux-api

On Tue, Feb 13, 2018 at 1:54 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Tue, Feb 13, 2018 at 11:10 PM, Shakeel Butt <shakeelb@google.com> wrote:
>> On Mon, Feb 12, 2018 at 10:30 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>> On Thu, Jan 25, 2018 at 10:36 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>> On Thu, Jan 25, 2018 at 10:20 PM, Shakeel Butt <shakeelb@google.com> wrote:
>>>>> On Wed, Jan 24, 2018 at 11:51 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>>>>
>>>>>> There is a nicer alternative, instead of failing the file access,
>>>>>> an overflow event can be queued. I sent a patch for that and Jan
>>>>>> agreed to the concept, but thought we should let user opt-in for this
>>>>>> change:
>>>>>> https://marc.info/?l=linux-fsdevel&m=150944704716447&w=2
>>>>>>
>>>>>> So IMO, if user opts-in for OVERFLOW instead of ENOMEM,
>>>>>> charging the listener memcg would be non controversial.
>>>>>> Otherwise, I cannot say that starting to charge the listener memgc
>>>>>> for events won't break any application.
>>>>>>
>>>>>
>>>
>>> Shakeel, Jan,
>>>
>>> Reviving this thread and adding linux-api, because I think it is important to
>>> agree on the API before patches.
>>>
>>> The last message on the thread you referenced suggest an API change
>>> for opting in for Q_OVERFLOW on ENOMEM:
>>> https://marc.info/?l=linux-api&m=150946878623441&w=2
>>>
>>> However, the suggested API change in in fanotify_mark() syscall and
>>> this is not the time when fsnotify_group is initialized.
>>> I believe for opting-in to accounting events for listener, you
>>> will need to add an opt-in flag for the fanotify_init() syscall.
>>>
>>
>> I thought the reason to opt-in "charge memory to listener" was the
>> risk of oom-killing the listener but it is now clear that there will
>> be no oom-kills on memcg hitting its limit (no oom-killing listener
>> risk). In my (not so strong) opinion we should only opt-in for
>> receiving the {FAN|IN}_Q_OVERFLOW event on ENOMEM but always charge
>> the memory for events to the listener's memcg if kmem accounting is
>> enabled.
>>
>
> I agree that charging listener's memcg is preferred, but it is still a change
> of behavior, because if attacker can allocate memory from listener's memcg,
> then attacker can force overflow and hide the traces of its own filesystem
> operations.
>

ACK.

>>> Something like FAN_GROUP_QUEUE  (better name is welcome)
>>> which is mutually exclusive (?) with FAN_UNLIMITED_QUEUE.
>>>

How about FAN_CHARGE_MEMCG?

>>
>> There is no need to make them mutually exclusive. One should be able
>> to request an unlimited queue limited by available memory on system
>> (with no kmem charging) or limited by limit of the listener's memcg
>> (with kmem charging).
>
> OK.
>
>>
>>> The question is, do we need the user to also explicitly opt-in for
>>> Q_OVERFLOW on ENOMEM with FAN_Q_ERR mark mask?
>>> Should these 2 new APIs be coupled or independent?
>>>
>>
>> Are there any error which are not related to queue overflows? I see
>> the mention of ENODEV and EOVERFLOW in the discussion. If there are
>> such errors and might be interesting to the listener then we should
>> have 2 independent APIs.
>>
>
> These are indeed 2 different use cases.
> A Q_OVERFLOW event is only expected one of ENOMEM or
> EOVERFLOW in event->fd, but other events (like open of special device
> file) can have ENODEV in event->fd.
>
> But I am not convinced that those require 2 independent APIs.
> Specifying FAN_Q_ERR means that the user expects to reads errors
> from event->fd.
>

Can you please explain what you mean by 2 independent APIs? I thought
"no independent APIs" means FAN_Q_ERR can only be used with
FAN_Q_OVERFLOW and without FAN_Q_OVERFLOW, FAN_Q_ERR is ignored. Is
that right or I misunderstood?

>>> Another question is whether FAN_GROUP_QUEUE may require
>>> less than CAP_SYS_ADMIN? Of course for now, this is only a
>>> semantic change, because fanotify_init() requires CAP_SYS_ADMIN
>>> but as the documentation suggests, this may be relaxed in the future.
>>>
>>
>> I think there is no need for imposing CAP_SYS_ADMIN for requesting to
>> charge self for the event memory.
>>
>
> Certainly. The question is whether the flag combination
> FAN_GROUP_QUEUE|FAN_UNLIMITED_QUEUE could relax the
> CAP_SYS_ADMIN requirement that is imposed by FAN_UNLIMITED_QUEUE
> by itself.
>

Oh, I agree with relaxing CAP_SYS_ADMIN requirement if both flags are given.

> Note that FAN_UNLIMITED_MARKS cannot relax CAP_SYS_ADMIN
> even though marks are already accounted to listener memcg. This is because
> most of the memory consumption in this case comes from marks pinning the
> watched inodes to cache and not from the marks themselves.
>

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-02-13 22:20                                         ` Shakeel Butt
  0 siblings, 0 replies; 75+ messages in thread
From: Shakeel Butt @ 2018-02-13 22:20 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML,
	linux-api

On Tue, Feb 13, 2018 at 1:54 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Tue, Feb 13, 2018 at 11:10 PM, Shakeel Butt <shakeelb@google.com> wrote:
>> On Mon, Feb 12, 2018 at 10:30 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>> On Thu, Jan 25, 2018 at 10:36 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>> On Thu, Jan 25, 2018 at 10:20 PM, Shakeel Butt <shakeelb@google.com> wrote:
>>>>> On Wed, Jan 24, 2018 at 11:51 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>>>>
>>>>>> There is a nicer alternative, instead of failing the file access,
>>>>>> an overflow event can be queued. I sent a patch for that and Jan
>>>>>> agreed to the concept, but thought we should let user opt-in for this
>>>>>> change:
>>>>>> https://marc.info/?l=linux-fsdevel&m=150944704716447&w=2
>>>>>>
>>>>>> So IMO, if user opts-in for OVERFLOW instead of ENOMEM,
>>>>>> charging the listener memcg would be non controversial.
>>>>>> Otherwise, I cannot say that starting to charge the listener memgc
>>>>>> for events won't break any application.
>>>>>>
>>>>>
>>>
>>> Shakeel, Jan,
>>>
>>> Reviving this thread and adding linux-api, because I think it is important to
>>> agree on the API before patches.
>>>
>>> The last message on the thread you referenced suggest an API change
>>> for opting in for Q_OVERFLOW on ENOMEM:
>>> https://marc.info/?l=linux-api&m=150946878623441&w=2
>>>
>>> However, the suggested API change in in fanotify_mark() syscall and
>>> this is not the time when fsnotify_group is initialized.
>>> I believe for opting-in to accounting events for listener, you
>>> will need to add an opt-in flag for the fanotify_init() syscall.
>>>
>>
>> I thought the reason to opt-in "charge memory to listener" was the
>> risk of oom-killing the listener but it is now clear that there will
>> be no oom-kills on memcg hitting its limit (no oom-killing listener
>> risk). In my (not so strong) opinion we should only opt-in for
>> receiving the {FAN|IN}_Q_OVERFLOW event on ENOMEM but always charge
>> the memory for events to the listener's memcg if kmem accounting is
>> enabled.
>>
>
> I agree that charging listener's memcg is preferred, but it is still a change
> of behavior, because if attacker can allocate memory from listener's memcg,
> then attacker can force overflow and hide the traces of its own filesystem
> operations.
>

ACK.

>>> Something like FAN_GROUP_QUEUE  (better name is welcome)
>>> which is mutually exclusive (?) with FAN_UNLIMITED_QUEUE.
>>>

How about FAN_CHARGE_MEMCG?

>>
>> There is no need to make them mutually exclusive. One should be able
>> to request an unlimited queue limited by available memory on system
>> (with no kmem charging) or limited by limit of the listener's memcg
>> (with kmem charging).
>
> OK.
>
>>
>>> The question is, do we need the user to also explicitly opt-in for
>>> Q_OVERFLOW on ENOMEM with FAN_Q_ERR mark mask?
>>> Should these 2 new APIs be coupled or independent?
>>>
>>
>> Are there any error which are not related to queue overflows? I see
>> the mention of ENODEV and EOVERFLOW in the discussion. If there are
>> such errors and might be interesting to the listener then we should
>> have 2 independent APIs.
>>
>
> These are indeed 2 different use cases.
> A Q_OVERFLOW event is only expected one of ENOMEM or
> EOVERFLOW in event->fd, but other events (like open of special device
> file) can have ENODEV in event->fd.
>
> But I am not convinced that those require 2 independent APIs.
> Specifying FAN_Q_ERR means that the user expects to reads errors
> from event->fd.
>

Can you please explain what you mean by 2 independent APIs? I thought
"no independent APIs" means FAN_Q_ERR can only be used with
FAN_Q_OVERFLOW and without FAN_Q_OVERFLOW, FAN_Q_ERR is ignored. Is
that right or I misunderstood?

>>> Another question is whether FAN_GROUP_QUEUE may require
>>> less than CAP_SYS_ADMIN? Of course for now, this is only a
>>> semantic change, because fanotify_init() requires CAP_SYS_ADMIN
>>> but as the documentation suggests, this may be relaxed in the future.
>>>
>>
>> I think there is no need for imposing CAP_SYS_ADMIN for requesting to
>> charge self for the event memory.
>>
>
> Certainly. The question is whether the flag combination
> FAN_GROUP_QUEUE|FAN_UNLIMITED_QUEUE could relax the
> CAP_SYS_ADMIN requirement that is imposed by FAN_UNLIMITED_QUEUE
> by itself.
>

Oh, I agree with relaxing CAP_SYS_ADMIN requirement if both flags are given.

> Note that FAN_UNLIMITED_MARKS cannot relax CAP_SYS_ADMIN
> even though marks are already accounted to listener memcg. This is because
> most of the memory consumption in this case comes from marks pinning the
> watched inodes to cache and not from the marks themselves.
>

thanks,
Shakeel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2018-02-13 22:20                                         ` Shakeel Butt
  (?)
@ 2018-02-14  1:59                                           ` Shakeel Butt
  -1 siblings, 0 replies; 75+ messages in thread
From: Shakeel Butt @ 2018-02-14  1:59 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML,
	linux-api

On Tue, Feb 13, 2018 at 2:20 PM, Shakeel Butt <shakeelb@google.com> wrote:
> On Tue, Feb 13, 2018 at 1:54 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Tue, Feb 13, 2018 at 11:10 PM, Shakeel Butt <shakeelb@google.com> wrote:
>>> On Mon, Feb 12, 2018 at 10:30 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>> On Thu, Jan 25, 2018 at 10:36 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>>> On Thu, Jan 25, 2018 at 10:20 PM, Shakeel Butt <shakeelb@google.com> wrote:
>>>>>> On Wed, Jan 24, 2018 at 11:51 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>>>>>
>>>>>>> There is a nicer alternative, instead of failing the file access,
>>>>>>> an overflow event can be queued. I sent a patch for that and Jan
>>>>>>> agreed to the concept, but thought we should let user opt-in for this
>>>>>>> change:
>>>>>>> https://marc.info/?l=linux-fsdevel&m=150944704716447&w=2
>>>>>>>
>>>>>>> So IMO, if user opts-in for OVERFLOW instead of ENOMEM,
>>>>>>> charging the listener memcg would be non controversial.
>>>>>>> Otherwise, I cannot say that starting to charge the listener memgc
>>>>>>> for events won't break any application.
>>>>>>>
>>>>>>
>>>>
>>>> Shakeel, Jan,
>>>>
>>>> Reviving this thread and adding linux-api, because I think it is important to
>>>> agree on the API before patches.
>>>>
>>>> The last message on the thread you referenced suggest an API change
>>>> for opting in for Q_OVERFLOW on ENOMEM:
>>>> https://marc.info/?l=linux-api&m=150946878623441&w=2
>>>>
>>>> However, the suggested API change in in fanotify_mark() syscall and
>>>> this is not the time when fsnotify_group is initialized.
>>>> I believe for opting-in to accounting events for listener, you
>>>> will need to add an opt-in flag for the fanotify_init() syscall.
>>>>
>>>
>>> I thought the reason to opt-in "charge memory to listener" was the
>>> risk of oom-killing the listener but it is now clear that there will
>>> be no oom-kills on memcg hitting its limit (no oom-killing listener
>>> risk). In my (not so strong) opinion we should only opt-in for
>>> receiving the {FAN|IN}_Q_OVERFLOW event on ENOMEM but always charge
>>> the memory for events to the listener's memcg if kmem accounting is
>>> enabled.
>>>
>>
>> I agree that charging listener's memcg is preferred, but it is still a change
>> of behavior, because if attacker can allocate memory from listener's memcg,
>> then attacker can force overflow and hide the traces of its own filesystem
>> operations.
>>
>
> ACK.
>
>>>> Something like FAN_GROUP_QUEUE  (better name is welcome)
>>>> which is mutually exclusive (?) with FAN_UNLIMITED_QUEUE.
>>>>
>
> How about FAN_CHARGE_MEMCG?
>

Also should there be a similar flag for inotify_init1() as well?

>>>
>>> There is no need to make them mutually exclusive. One should be able
>>> to request an unlimited queue limited by available memory on system
>>> (with no kmem charging) or limited by limit of the listener's memcg
>>> (with kmem charging).
>>
>> OK.
>>
>>>
>>>> The question is, do we need the user to also explicitly opt-in for
>>>> Q_OVERFLOW on ENOMEM with FAN_Q_ERR mark mask?
>>>> Should these 2 new APIs be coupled or independent?
>>>>
>>>
>>> Are there any error which are not related to queue overflows? I see
>>> the mention of ENODEV and EOVERFLOW in the discussion. If there are
>>> such errors and might be interesting to the listener then we should
>>> have 2 independent APIs.
>>>
>>
>> These are indeed 2 different use cases.
>> A Q_OVERFLOW event is only expected one of ENOMEM or
>> EOVERFLOW in event->fd, but other events (like open of special device
>> file) can have ENODEV in event->fd.
>>
>> But I am not convinced that those require 2 independent APIs.
>> Specifying FAN_Q_ERR means that the user expects to reads errors
>> from event->fd.
>>
>
> Can you please explain what you mean by 2 independent APIs? I thought
> "no independent APIs" means FAN_Q_ERR can only be used with
> FAN_Q_OVERFLOW and without FAN_Q_OVERFLOW, FAN_Q_ERR is ignored. Is
> that right or I misunderstood?
>
>>>> Another question is whether FAN_GROUP_QUEUE may require
>>>> less than CAP_SYS_ADMIN? Of course for now, this is only a
>>>> semantic change, because fanotify_init() requires CAP_SYS_ADMIN
>>>> but as the documentation suggests, this may be relaxed in the future.
>>>>
>>>
>>> I think there is no need for imposing CAP_SYS_ADMIN for requesting to
>>> charge self for the event memory.
>>>
>>
>> Certainly. The question is whether the flag combination
>> FAN_GROUP_QUEUE|FAN_UNLIMITED_QUEUE could relax the
>> CAP_SYS_ADMIN requirement that is imposed by FAN_UNLIMITED_QUEUE
>> by itself.
>>
>
> Oh, I agree with relaxing CAP_SYS_ADMIN requirement if both flags are given.
>
>> Note that FAN_UNLIMITED_MARKS cannot relax CAP_SYS_ADMIN
>> even though marks are already accounted to listener memcg. This is because
>> most of the memory consumption in this case comes from marks pinning the
>> watched inodes to cache and not from the marks themselves.
>>
>
> thanks,
> Shakeel

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-02-14  1:59                                           ` Shakeel Butt
  0 siblings, 0 replies; 75+ messages in thread
From: Shakeel Butt @ 2018-02-14  1:59 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML,
	linux-api

On Tue, Feb 13, 2018 at 2:20 PM, Shakeel Butt <shakeelb@google.com> wrote:
> On Tue, Feb 13, 2018 at 1:54 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Tue, Feb 13, 2018 at 11:10 PM, Shakeel Butt <shakeelb@google.com> wrote:
>>> On Mon, Feb 12, 2018 at 10:30 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>> On Thu, Jan 25, 2018 at 10:36 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>>> On Thu, Jan 25, 2018 at 10:20 PM, Shakeel Butt <shakeelb@google.com> wrote:
>>>>>> On Wed, Jan 24, 2018 at 11:51 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>>>>>
>>>>>>> There is a nicer alternative, instead of failing the file access,
>>>>>>> an overflow event can be queued. I sent a patch for that and Jan
>>>>>>> agreed to the concept, but thought we should let user opt-in for this
>>>>>>> change:
>>>>>>> https://marc.info/?l=linux-fsdevel&m=150944704716447&w=2
>>>>>>>
>>>>>>> So IMO, if user opts-in for OVERFLOW instead of ENOMEM,
>>>>>>> charging the listener memcg would be non controversial.
>>>>>>> Otherwise, I cannot say that starting to charge the listener memgc
>>>>>>> for events won't break any application.
>>>>>>>
>>>>>>
>>>>
>>>> Shakeel, Jan,
>>>>
>>>> Reviving this thread and adding linux-api, because I think it is important to
>>>> agree on the API before patches.
>>>>
>>>> The last message on the thread you referenced suggest an API change
>>>> for opting in for Q_OVERFLOW on ENOMEM:
>>>> https://marc.info/?l=linux-api&m=150946878623441&w=2
>>>>
>>>> However, the suggested API change in in fanotify_mark() syscall and
>>>> this is not the time when fsnotify_group is initialized.
>>>> I believe for opting-in to accounting events for listener, you
>>>> will need to add an opt-in flag for the fanotify_init() syscall.
>>>>
>>>
>>> I thought the reason to opt-in "charge memory to listener" was the
>>> risk of oom-killing the listener but it is now clear that there will
>>> be no oom-kills on memcg hitting its limit (no oom-killing listener
>>> risk). In my (not so strong) opinion we should only opt-in for
>>> receiving the {FAN|IN}_Q_OVERFLOW event on ENOMEM but always charge
>>> the memory for events to the listener's memcg if kmem accounting is
>>> enabled.
>>>
>>
>> I agree that charging listener's memcg is preferred, but it is still a change
>> of behavior, because if attacker can allocate memory from listener's memcg,
>> then attacker can force overflow and hide the traces of its own filesystem
>> operations.
>>
>
> ACK.
>
>>>> Something like FAN_GROUP_QUEUE  (better name is welcome)
>>>> which is mutually exclusive (?) with FAN_UNLIMITED_QUEUE.
>>>>
>
> How about FAN_CHARGE_MEMCG?
>

Also should there be a similar flag for inotify_init1() as well?

>>>
>>> There is no need to make them mutually exclusive. One should be able
>>> to request an unlimited queue limited by available memory on system
>>> (with no kmem charging) or limited by limit of the listener's memcg
>>> (with kmem charging).
>>
>> OK.
>>
>>>
>>>> The question is, do we need the user to also explicitly opt-in for
>>>> Q_OVERFLOW on ENOMEM with FAN_Q_ERR mark mask?
>>>> Should these 2 new APIs be coupled or independent?
>>>>
>>>
>>> Are there any error which are not related to queue overflows? I see
>>> the mention of ENODEV and EOVERFLOW in the discussion. If there are
>>> such errors and might be interesting to the listener then we should
>>> have 2 independent APIs.
>>>
>>
>> These are indeed 2 different use cases.
>> A Q_OVERFLOW event is only expected one of ENOMEM or
>> EOVERFLOW in event->fd, but other events (like open of special device
>> file) can have ENODEV in event->fd.
>>
>> But I am not convinced that those require 2 independent APIs.
>> Specifying FAN_Q_ERR means that the user expects to reads errors
>> from event->fd.
>>
>
> Can you please explain what you mean by 2 independent APIs? I thought
> "no independent APIs" means FAN_Q_ERR can only be used with
> FAN_Q_OVERFLOW and without FAN_Q_OVERFLOW, FAN_Q_ERR is ignored. Is
> that right or I misunderstood?
>
>>>> Another question is whether FAN_GROUP_QUEUE may require
>>>> less than CAP_SYS_ADMIN? Of course for now, this is only a
>>>> semantic change, because fanotify_init() requires CAP_SYS_ADMIN
>>>> but as the documentation suggests, this may be relaxed in the future.
>>>>
>>>
>>> I think there is no need for imposing CAP_SYS_ADMIN for requesting to
>>> charge self for the event memory.
>>>
>>
>> Certainly. The question is whether the flag combination
>> FAN_GROUP_QUEUE|FAN_UNLIMITED_QUEUE could relax the
>> CAP_SYS_ADMIN requirement that is imposed by FAN_UNLIMITED_QUEUE
>> by itself.
>>
>
> Oh, I agree with relaxing CAP_SYS_ADMIN requirement if both flags are given.
>
>> Note that FAN_UNLIMITED_MARKS cannot relax CAP_SYS_ADMIN
>> even though marks are already accounted to listener memcg. This is because
>> most of the memory consumption in this case comes from marks pinning the
>> watched inodes to cache and not from the marks themselves.
>>
>
> thanks,
> Shakeel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-02-14  1:59                                           ` Shakeel Butt
  0 siblings, 0 replies; 75+ messages in thread
From: Shakeel Butt @ 2018-02-14  1:59 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Tue, Feb 13, 2018 at 2:20 PM, Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Feb 13, 2018 at 1:54 PM, Amir Goldstein <amir73il-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On Tue, Feb 13, 2018 at 11:10 PM, Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Mon, Feb 12, 2018 at 10:30 PM, Amir Goldstein <amir73il-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> On Thu, Jan 25, 2018 at 10:36 PM, Amir Goldstein <amir73il-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>> On Thu, Jan 25, 2018 at 10:20 PM, Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>> On Wed, Jan 24, 2018 at 11:51 PM, Amir Goldstein <amir73il-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>>>
>>>>>>> There is a nicer alternative, instead of failing the file access,
>>>>>>> an overflow event can be queued. I sent a patch for that and Jan
>>>>>>> agreed to the concept, but thought we should let user opt-in for this
>>>>>>> change:
>>>>>>> https://marc.info/?l=linux-fsdevel&m=150944704716447&w=2
>>>>>>>
>>>>>>> So IMO, if user opts-in for OVERFLOW instead of ENOMEM,
>>>>>>> charging the listener memcg would be non controversial.
>>>>>>> Otherwise, I cannot say that starting to charge the listener memgc
>>>>>>> for events won't break any application.
>>>>>>>
>>>>>>
>>>>
>>>> Shakeel, Jan,
>>>>
>>>> Reviving this thread and adding linux-api, because I think it is important to
>>>> agree on the API before patches.
>>>>
>>>> The last message on the thread you referenced suggest an API change
>>>> for opting in for Q_OVERFLOW on ENOMEM:
>>>> https://marc.info/?l=linux-api&m=150946878623441&w=2
>>>>
>>>> However, the suggested API change in in fanotify_mark() syscall and
>>>> this is not the time when fsnotify_group is initialized.
>>>> I believe for opting-in to accounting events for listener, you
>>>> will need to add an opt-in flag for the fanotify_init() syscall.
>>>>
>>>
>>> I thought the reason to opt-in "charge memory to listener" was the
>>> risk of oom-killing the listener but it is now clear that there will
>>> be no oom-kills on memcg hitting its limit (no oom-killing listener
>>> risk). In my (not so strong) opinion we should only opt-in for
>>> receiving the {FAN|IN}_Q_OVERFLOW event on ENOMEM but always charge
>>> the memory for events to the listener's memcg if kmem accounting is
>>> enabled.
>>>
>>
>> I agree that charging listener's memcg is preferred, but it is still a change
>> of behavior, because if attacker can allocate memory from listener's memcg,
>> then attacker can force overflow and hide the traces of its own filesystem
>> operations.
>>
>
> ACK.
>
>>>> Something like FAN_GROUP_QUEUE  (better name is welcome)
>>>> which is mutually exclusive (?) with FAN_UNLIMITED_QUEUE.
>>>>
>
> How about FAN_CHARGE_MEMCG?
>

Also should there be a similar flag for inotify_init1() as well?

>>>
>>> There is no need to make them mutually exclusive. One should be able
>>> to request an unlimited queue limited by available memory on system
>>> (with no kmem charging) or limited by limit of the listener's memcg
>>> (with kmem charging).
>>
>> OK.
>>
>>>
>>>> The question is, do we need the user to also explicitly opt-in for
>>>> Q_OVERFLOW on ENOMEM with FAN_Q_ERR mark mask?
>>>> Should these 2 new APIs be coupled or independent?
>>>>
>>>
>>> Are there any error which are not related to queue overflows? I see
>>> the mention of ENODEV and EOVERFLOW in the discussion. If there are
>>> such errors and might be interesting to the listener then we should
>>> have 2 independent APIs.
>>>
>>
>> These are indeed 2 different use cases.
>> A Q_OVERFLOW event is only expected one of ENOMEM or
>> EOVERFLOW in event->fd, but other events (like open of special device
>> file) can have ENODEV in event->fd.
>>
>> But I am not convinced that those require 2 independent APIs.
>> Specifying FAN_Q_ERR means that the user expects to reads errors
>> from event->fd.
>>
>
> Can you please explain what you mean by 2 independent APIs? I thought
> "no independent APIs" means FAN_Q_ERR can only be used with
> FAN_Q_OVERFLOW and without FAN_Q_OVERFLOW, FAN_Q_ERR is ignored. Is
> that right or I misunderstood?
>
>>>> Another question is whether FAN_GROUP_QUEUE may require
>>>> less than CAP_SYS_ADMIN? Of course for now, this is only a
>>>> semantic change, because fanotify_init() requires CAP_SYS_ADMIN
>>>> but as the documentation suggests, this may be relaxed in the future.
>>>>
>>>
>>> I think there is no need for imposing CAP_SYS_ADMIN for requesting to
>>> charge self for the event memory.
>>>
>>
>> Certainly. The question is whether the flag combination
>> FAN_GROUP_QUEUE|FAN_UNLIMITED_QUEUE could relax the
>> CAP_SYS_ADMIN requirement that is imposed by FAN_UNLIMITED_QUEUE
>> by itself.
>>
>
> Oh, I agree with relaxing CAP_SYS_ADMIN requirement if both flags are given.
>
>> Note that FAN_UNLIMITED_MARKS cannot relax CAP_SYS_ADMIN
>> even though marks are already accounted to listener memcg. This is because
>> most of the memory consumption in this case comes from marks pinning the
>> watched inodes to cache and not from the marks themselves.
>>
>
> thanks,
> Shakeel

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2018-02-14  1:59                                           ` Shakeel Butt
@ 2018-02-14  8:38                                             ` Amir Goldstein
  -1 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-02-14  8:38 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML,
	linux-api

On Wed, Feb 14, 2018 at 3:59 AM, Shakeel Butt <shakeelb@google.com> wrote:
> On Tue, Feb 13, 2018 at 2:20 PM, Shakeel Butt <shakeelb@google.com> wrote:
[...]
>>>>> Something like FAN_GROUP_QUEUE  (better name is welcome)
>>>>> which is mutually exclusive (?) with FAN_UNLIMITED_QUEUE.
>>>>>
>>
>> How about FAN_CHARGE_MEMCG?
>>

I am not crazy about this name.
Imagine a user that writes a file system listener that is going to run
inside a container.
The user doesn't need to know about the container or what is memcg
and what is memcg charging.
IMO, we need to hide those implementation details from the user and
yet encourage user to opt-in for memcg charging... or do we?

>
> Also should there be a similar flag for inotify_init1() as well?
>

This question changed my perspective on the fanotify_init() flag.
Unlike with fanotify, for inotify, is it the sysadmin that determines
the size of the queue of future listeners by setting
/proc/sys/fs/inotify/max_queued_events

IMO, there is little justification for a program to opt-out of memcg
charging if the sysadmin opts-in for memcg charging.
Anyone disagrees with that claim?

So how about /proc/sys/fs/inotify/charge_memcg
which defaults to CONFIG_INOTIFY_CHARGE_MEMCG
which defaults to N.

Then sysadmin can opt-in/out of new behavior and distro can
opt-in for new behavior by default and programs don't need to
be recompiled.

I think that should be enough to address the concern of changing
existing behavior.

The same logic could also apply to fanotify, excpet we may want to
use sysfs instead of procfs.
The question is: do we need a separate knob for charging memcg
for inotify and fanotify or same knob can control both?

I can't think of a reason why we really need 2 knobs, but maybe
its just nicer to have the inotify knob next to the existing
max_queued_events knob.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-02-14  8:38                                             ` Amir Goldstein
  0 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-02-14  8:38 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML,
	linux-api

On Wed, Feb 14, 2018 at 3:59 AM, Shakeel Butt <shakeelb@google.com> wrote:
> On Tue, Feb 13, 2018 at 2:20 PM, Shakeel Butt <shakeelb@google.com> wrote:
[...]
>>>>> Something like FAN_GROUP_QUEUE  (better name is welcome)
>>>>> which is mutually exclusive (?) with FAN_UNLIMITED_QUEUE.
>>>>>
>>
>> How about FAN_CHARGE_MEMCG?
>>

I am not crazy about this name.
Imagine a user that writes a file system listener that is going to run
inside a container.
The user doesn't need to know about the container or what is memcg
and what is memcg charging.
IMO, we need to hide those implementation details from the user and
yet encourage user to opt-in for memcg charging... or do we?

>
> Also should there be a similar flag for inotify_init1() as well?
>

This question changed my perspective on the fanotify_init() flag.
Unlike with fanotify, for inotify, is it the sysadmin that determines
the size of the queue of future listeners by setting
/proc/sys/fs/inotify/max_queued_events

IMO, there is little justification for a program to opt-out of memcg
charging if the sysadmin opts-in for memcg charging.
Anyone disagrees with that claim?

So how about /proc/sys/fs/inotify/charge_memcg
which defaults to CONFIG_INOTIFY_CHARGE_MEMCG
which defaults to N.

Then sysadmin can opt-in/out of new behavior and distro can
opt-in for new behavior by default and programs don't need to
be recompiled.

I think that should be enough to address the concern of changing
existing behavior.

The same logic could also apply to fanotify, excpet we may want to
use sysfs instead of procfs.
The question is: do we need a separate knob for charging memcg
for inotify and fanotify or same knob can control both?

I can't think of a reason why we really need 2 knobs, but maybe
its just nicer to have the inotify knob next to the existing
max_queued_events knob.

Thanks,
Amir.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2018-02-13 22:20                                         ` Shakeel Butt
  (?)
@ 2018-02-14  9:00                                           ` Amir Goldstein
  -1 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-02-14  9:00 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML,
	linux-api

On Wed, Feb 14, 2018 at 12:20 AM, Shakeel Butt <shakeelb@google.com> wrote:
> On Tue, Feb 13, 2018 at 1:54 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Tue, Feb 13, 2018 at 11:10 PM, Shakeel Butt <shakeelb@google.com> wrote:
[...]
>>>> The question is, do we need the user to also explicitly opt-in for
>>>> Q_OVERFLOW on ENOMEM with FAN_Q_ERR mark mask?
>>>> Should these 2 new APIs be coupled or independent?
>>>>
>>>
>>> Are there any error which are not related to queue overflows? I see
>>> the mention of ENODEV and EOVERFLOW in the discussion. If there are
>>> such errors and might be interesting to the listener then we should
>>> have 2 independent APIs.
>>>
>>
>> These are indeed 2 different use cases.
>> A Q_OVERFLOW event is only expected one of ENOMEM or
>> EOVERFLOW in event->fd, but other events (like open of special device
>> file) can have ENODEV in event->fd.
>>
>> But I am not convinced that those require 2 independent APIs.
>> Specifying FAN_Q_ERR means that the user expects to reads errors
>> from event->fd.
>>
>
> Can you please explain what you mean by 2 independent APIs? I thought
> "no independent APIs" means FAN_Q_ERR can only be used with
> FAN_Q_OVERFLOW and without FAN_Q_OVERFLOW, FAN_Q_ERR is ignored. Is
> that right or I misunderstood?
>

What I initially meant to say was, we actually consider several
behavior changes:
1. Charge event allocations to memcg of listener
2. Queue a Q_OVERFLOW event on ENOMEM of event allocation
3. Report the error to user on metadata->fd (instead of FAN_NOFD)
4. Allow non Q_OVERFLOW event to have negative metadata->fd.

#3 is applicable both to Q_OVERFLOW event and other events that
can't provide and open file descriptor for some reason (i.e. ENODEV).

#1 and #2 could be independent, but they both make sense together.
When enabling #1 user increases the chance of ENOMEM and therefore
#2 is desired. So if we are going to let distro/admin/programmer to
opt-in for what we believe to be a "change of behavior for the best", then
we could consider that opting-in  for #1 will also imply opting-in for #2
and #3 (as the means to report Q_OVERFLOW due to ENOMEM).

I guess we will need to allow user to opt-in to #4 and #3 by FAN_Q_ERR
mask flag to cover the ENODEV case independently from opting-in to
charging memcg.

How was I doing in the balance of adding clarity vs. adding confusion?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-02-14  9:00                                           ` Amir Goldstein
  0 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-02-14  9:00 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML,
	linux-api

On Wed, Feb 14, 2018 at 12:20 AM, Shakeel Butt <shakeelb@google.com> wrote:
> On Tue, Feb 13, 2018 at 1:54 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Tue, Feb 13, 2018 at 11:10 PM, Shakeel Butt <shakeelb@google.com> wrote:
[...]
>>>> The question is, do we need the user to also explicitly opt-in for
>>>> Q_OVERFLOW on ENOMEM with FAN_Q_ERR mark mask?
>>>> Should these 2 new APIs be coupled or independent?
>>>>
>>>
>>> Are there any error which are not related to queue overflows? I see
>>> the mention of ENODEV and EOVERFLOW in the discussion. If there are
>>> such errors and might be interesting to the listener then we should
>>> have 2 independent APIs.
>>>
>>
>> These are indeed 2 different use cases.
>> A Q_OVERFLOW event is only expected one of ENOMEM or
>> EOVERFLOW in event->fd, but other events (like open of special device
>> file) can have ENODEV in event->fd.
>>
>> But I am not convinced that those require 2 independent APIs.
>> Specifying FAN_Q_ERR means that the user expects to reads errors
>> from event->fd.
>>
>
> Can you please explain what you mean by 2 independent APIs? I thought
> "no independent APIs" means FAN_Q_ERR can only be used with
> FAN_Q_OVERFLOW and without FAN_Q_OVERFLOW, FAN_Q_ERR is ignored. Is
> that right or I misunderstood?
>

What I initially meant to say was, we actually consider several
behavior changes:
1. Charge event allocations to memcg of listener
2. Queue a Q_OVERFLOW event on ENOMEM of event allocation
3. Report the error to user on metadata->fd (instead of FAN_NOFD)
4. Allow non Q_OVERFLOW event to have negative metadata->fd.

#3 is applicable both to Q_OVERFLOW event and other events that
can't provide and open file descriptor for some reason (i.e. ENODEV).

#1 and #2 could be independent, but they both make sense together.
When enabling #1 user increases the chance of ENOMEM and therefore
#2 is desired. So if we are going to let distro/admin/programmer to
opt-in for what we believe to be a "change of behavior for the best", then
we could consider that opting-in  for #1 will also imply opting-in for #2
and #3 (as the means to report Q_OVERFLOW due to ENOMEM).

I guess we will need to allow user to opt-in to #4 and #3 by FAN_Q_ERR
mask flag to cover the ENODEV case independently from opting-in to
charging memcg.

How was I doing in the balance of adding clarity vs. adding confusion?

Thanks,
Amir.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-02-14  9:00                                           ` Amir Goldstein
  0 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-02-14  9:00 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM, LKML,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Wed, Feb 14, 2018 at 12:20 AM, Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Feb 13, 2018 at 1:54 PM, Amir Goldstein <amir73il-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On Tue, Feb 13, 2018 at 11:10 PM, Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
[...]
>>>> The question is, do we need the user to also explicitly opt-in for
>>>> Q_OVERFLOW on ENOMEM with FAN_Q_ERR mark mask?
>>>> Should these 2 new APIs be coupled or independent?
>>>>
>>>
>>> Are there any error which are not related to queue overflows? I see
>>> the mention of ENODEV and EOVERFLOW in the discussion. If there are
>>> such errors and might be interesting to the listener then we should
>>> have 2 independent APIs.
>>>
>>
>> These are indeed 2 different use cases.
>> A Q_OVERFLOW event is only expected one of ENOMEM or
>> EOVERFLOW in event->fd, but other events (like open of special device
>> file) can have ENODEV in event->fd.
>>
>> But I am not convinced that those require 2 independent APIs.
>> Specifying FAN_Q_ERR means that the user expects to reads errors
>> from event->fd.
>>
>
> Can you please explain what you mean by 2 independent APIs? I thought
> "no independent APIs" means FAN_Q_ERR can only be used with
> FAN_Q_OVERFLOW and without FAN_Q_OVERFLOW, FAN_Q_ERR is ignored. Is
> that right or I misunderstood?
>

What I initially meant to say was, we actually consider several
behavior changes:
1. Charge event allocations to memcg of listener
2. Queue a Q_OVERFLOW event on ENOMEM of event allocation
3. Report the error to user on metadata->fd (instead of FAN_NOFD)
4. Allow non Q_OVERFLOW event to have negative metadata->fd.

#3 is applicable both to Q_OVERFLOW event and other events that
can't provide and open file descriptor for some reason (i.e. ENODEV).

#1 and #2 could be independent, but they both make sense together.
When enabling #1 user increases the chance of ENOMEM and therefore
#2 is desired. So if we are going to let distro/admin/programmer to
opt-in for what we believe to be a "change of behavior for the best", then
we could consider that opting-in  for #1 will also imply opting-in for #2
and #3 (as the means to report Q_OVERFLOW due to ENOMEM).

I guess we will need to allow user to opt-in to #4 and #3 by FAN_Q_ERR
mask flag to cover the ENODEV case independently from opting-in to
charging memcg.

How was I doing in the balance of adding clarity vs. adding confusion?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2018-02-14  8:38                                             ` Amir Goldstein
@ 2018-02-19 13:50                                               ` Jan Kara
  -1 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2018-02-19 13:50 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Shakeel Butt, Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel,
	Linux MM, LKML, linux-api

On Wed 14-02-18 10:38:09, Amir Goldstein wrote:
> On Wed, Feb 14, 2018 at 3:59 AM, Shakeel Butt <shakeelb@google.com> wrote:
> > On Tue, Feb 13, 2018 at 2:20 PM, Shakeel Butt <shakeelb@google.com> wrote:
> [...]
> >>>>> Something like FAN_GROUP_QUEUE  (better name is welcome)
> >>>>> which is mutually exclusive (?) with FAN_UNLIMITED_QUEUE.
> >>>>>
> >>
> >> How about FAN_CHARGE_MEMCG?
> >>
> 
> I am not crazy about this name.
> Imagine a user that writes a file system listener that is going to run
> inside a container.
> The user doesn't need to know about the container or what is memcg
> and what is memcg charging.
> IMO, we need to hide those implementation details from the user and
> yet encourage user to opt-in for memcg charging... or do we?

Well, there's also a different question: Why should an application writer
*want* to opt for such flag? Even if he wants his application to be well
behaved he most likely just won't think about memcg charging and thus why
should he set the flag? IMHO users of such API would be very limited...

> > Also should there be a similar flag for inotify_init1() as well?
> >
> 
> This question changed my perspective on the fanotify_init() flag.
> Unlike with fanotify, for inotify, is it the sysadmin that determines
> the size of the queue of future listeners by setting
> /proc/sys/fs/inotify/max_queued_events
> 
> IMO, there is little justification for a program to opt-out of memcg
> charging if the sysadmin opts-in for memcg charging.
> Anyone disagrees with that claim?

Frankly, I believe there's little point in memcg charging for inotify.
Everything is limited anyway and the amount of consumed memory is small (a
few megabytes at most). That being said I do think that for consistency
it should be implemented. Just the practical impact is going to be small.

> So how about /proc/sys/fs/inotify/charge_memcg
> which defaults to CONFIG_INOTIFY_CHARGE_MEMCG
> which defaults to N.
> 
> Then sysadmin can opt-in/out of new behavior and distro can
> opt-in for new behavior by default and programs don't need to
> be recompiled.
> 
> I think that should be enough to address the concern of changing
> existing behavior.

For inotify my concerns about broken apps are even lower than for fanotify
- if sysadmin sets up memcgs he very likely prefers broken inotify app to
container consuming too much memory (generally breakage is assumed when a
container runs out of memory since most apps just crash in such case
anyway) and app should generally be prepared to handle queue overflow so
there are reasonable chances things actually work out fine. So I don't see
a good reason for adding inotify tunables for memcg charging. We don't have
tunables for inode memcg charging either and those are more likely to break
apps than similar inotify changes after all.

> The same logic could also apply to fanotify, excpet we may want to
> use sysfs instead of procfs.
> The question is: do we need a separate knob for charging memcg
> for inotify and fanotify or same knob can control both?
> 
> I can't think of a reason why we really need 2 knobs, but maybe
> its just nicer to have the inotify knob next to the existing
> max_queued_events knob.

For fanotify without FAN_UNLIMITED_QUEUE the situation is similar as for
inotify - IMO low practical impact, apps should generally handle queue
overflow so I don't see a need for any opt in (more accurate memcg charging
takes precedense over possibly broken apps).

For fanotify with FAN_UNLIMITED_QUEUE the situation is somewhat different -
firstly there is a practical impact (memory consumption is not limited by
anything else) and secondly there are higher chances of the application
breaking (no queue overflow expected) and also that this breakage won't be
completely harmless (e.g., the application participates in securing the
system). I've been thinking about this "conflict of interests" for some
time and currently I think that the best handling of this is that by
default events for FAN_UNLIMITED_QUEUE groups will get allocated with
GFP_NOFAIL - such groups can be created only by global CAP_SYS_ADMIN anyway
so it is reasonably safe against misuse (and since the allocations are
small it is in fact equivalent to current status quo, just more explicit).
That way application won't see unexpected queue overflow. The process
generating event may be looping in the allocator but that is the case
currently as well. Also the memcg with the consumer of events will have
higher chances of triggering oom-kill if events consume too much memory but
I don't see how this is not a good thing by default - and if such reaction
is not desirable, there's memcg's oom_control to tune the OOM behavior
which has capabilities far beyond of what we could invent for fanotify...

What do you think Amir?

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-02-19 13:50                                               ` Jan Kara
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2018-02-19 13:50 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Shakeel Butt, Jan Kara, Yang Shi, Michal Hocko, linux-fsdevel,
	Linux MM, LKML, linux-api

On Wed 14-02-18 10:38:09, Amir Goldstein wrote:
> On Wed, Feb 14, 2018 at 3:59 AM, Shakeel Butt <shakeelb@google.com> wrote:
> > On Tue, Feb 13, 2018 at 2:20 PM, Shakeel Butt <shakeelb@google.com> wrote:
> [...]
> >>>>> Something like FAN_GROUP_QUEUE  (better name is welcome)
> >>>>> which is mutually exclusive (?) with FAN_UNLIMITED_QUEUE.
> >>>>>
> >>
> >> How about FAN_CHARGE_MEMCG?
> >>
> 
> I am not crazy about this name.
> Imagine a user that writes a file system listener that is going to run
> inside a container.
> The user doesn't need to know about the container or what is memcg
> and what is memcg charging.
> IMO, we need to hide those implementation details from the user and
> yet encourage user to opt-in for memcg charging... or do we?

Well, there's also a different question: Why should an application writer
*want* to opt for such flag? Even if he wants his application to be well
behaved he most likely just won't think about memcg charging and thus why
should he set the flag? IMHO users of such API would be very limited...

> > Also should there be a similar flag for inotify_init1() as well?
> >
> 
> This question changed my perspective on the fanotify_init() flag.
> Unlike with fanotify, for inotify, is it the sysadmin that determines
> the size of the queue of future listeners by setting
> /proc/sys/fs/inotify/max_queued_events
> 
> IMO, there is little justification for a program to opt-out of memcg
> charging if the sysadmin opts-in for memcg charging.
> Anyone disagrees with that claim?

Frankly, I believe there's little point in memcg charging for inotify.
Everything is limited anyway and the amount of consumed memory is small (a
few megabytes at most). That being said I do think that for consistency
it should be implemented. Just the practical impact is going to be small.

> So how about /proc/sys/fs/inotify/charge_memcg
> which defaults to CONFIG_INOTIFY_CHARGE_MEMCG
> which defaults to N.
> 
> Then sysadmin can opt-in/out of new behavior and distro can
> opt-in for new behavior by default and programs don't need to
> be recompiled.
> 
> I think that should be enough to address the concern of changing
> existing behavior.

For inotify my concerns about broken apps are even lower than for fanotify
- if sysadmin sets up memcgs he very likely prefers broken inotify app to
container consuming too much memory (generally breakage is assumed when a
container runs out of memory since most apps just crash in such case
anyway) and app should generally be prepared to handle queue overflow so
there are reasonable chances things actually work out fine. So I don't see
a good reason for adding inotify tunables for memcg charging. We don't have
tunables for inode memcg charging either and those are more likely to break
apps than similar inotify changes after all.

> The same logic could also apply to fanotify, excpet we may want to
> use sysfs instead of procfs.
> The question is: do we need a separate knob for charging memcg
> for inotify and fanotify or same knob can control both?
> 
> I can't think of a reason why we really need 2 knobs, but maybe
> its just nicer to have the inotify knob next to the existing
> max_queued_events knob.

For fanotify without FAN_UNLIMITED_QUEUE the situation is similar as for
inotify - IMO low practical impact, apps should generally handle queue
overflow so I don't see a need for any opt in (more accurate memcg charging
takes precedense over possibly broken apps).

For fanotify with FAN_UNLIMITED_QUEUE the situation is somewhat different -
firstly there is a practical impact (memory consumption is not limited by
anything else) and secondly there are higher chances of the application
breaking (no queue overflow expected) and also that this breakage won't be
completely harmless (e.g., the application participates in securing the
system). I've been thinking about this "conflict of interests" for some
time and currently I think that the best handling of this is that by
default events for FAN_UNLIMITED_QUEUE groups will get allocated with
GFP_NOFAIL - such groups can be created only by global CAP_SYS_ADMIN anyway
so it is reasonably safe against misuse (and since the allocations are
small it is in fact equivalent to current status quo, just more explicit).
That way application won't see unexpected queue overflow. The process
generating event may be looping in the allocator but that is the case
currently as well. Also the memcg with the consumer of events will have
higher chances of triggering oom-kill if events consume too much memory but
I don't see how this is not a good thing by default - and if such reaction
is not desirable, there's memcg's oom_control to tune the OOM behavior
which has capabilities far beyond of what we could invent for fanotify...

What do you think Amir?

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2018-02-19 13:50                                               ` Jan Kara
@ 2018-02-19 19:07                                                 ` Amir Goldstein
  -1 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-02-19 19:07 UTC (permalink / raw)
  To: Jan Kara
  Cc: Shakeel Butt, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM,
	LKML, linux-api

On Mon, Feb 19, 2018 at 3:50 PM, Jan Kara <jack@suse.cz> wrote:
[...]
> For fanotify without FAN_UNLIMITED_QUEUE the situation is similar as for
> inotify - IMO low practical impact, apps should generally handle queue
> overflow so I don't see a need for any opt in (more accurate memcg charging
> takes precedense over possibly broken apps).
>
> For fanotify with FAN_UNLIMITED_QUEUE the situation is somewhat different -
> firstly there is a practical impact (memory consumption is not limited by
> anything else) and secondly there are higher chances of the application
> breaking (no queue overflow expected) and also that this breakage won't be
> completely harmless (e.g., the application participates in securing the
> system). I've been thinking about this "conflict of interests" for some
> time and currently I think that the best handling of this is that by
> default events for FAN_UNLIMITED_QUEUE groups will get allocated with
> GFP_NOFAIL - such groups can be created only by global CAP_SYS_ADMIN anyway
> so it is reasonably safe against misuse (and since the allocations are
> small it is in fact equivalent to current status quo, just more explicit).
> That way application won't see unexpected queue overflow. The process
> generating event may be looping in the allocator but that is the case
> currently as well. Also the memcg with the consumer of events will have
> higher chances of triggering oom-kill if events consume too much memory but
> I don't see how this is not a good thing by default - and if such reaction
> is not desirable, there's memcg's oom_control to tune the OOM behavior
> which has capabilities far beyond of what we could invent for fanotify...
>
> What do you think Amir?
>

If I followed all your reasoning correctly, you propose to change behavior to
always account events to group memcg and never fail event allocation,
without any change of API and without opting-in for new behavior?
I think it makes sense. I can't point at any expected breakage,
so overall, this would be a good change.

I just feel sorry about passing an opportunity to improve functionality.
The fact that fanotify does not have a way for defining the events queue
size is a deficiency IMO, one which I had to work around in the past.
I find that assigning group to memgc and configure memcg to desired
memory limit and getting Q_OVERFLOW on failure to allocate event
is going to be a proper way of addressing this deficiency.

But if you don't think we should bind these 2 things together,
I'll let Shakeel decide if he want to pursue the Q_OVERFLOW change
or not.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-02-19 19:07                                                 ` Amir Goldstein
  0 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-02-19 19:07 UTC (permalink / raw)
  To: Jan Kara
  Cc: Shakeel Butt, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM,
	LKML, linux-api

On Mon, Feb 19, 2018 at 3:50 PM, Jan Kara <jack@suse.cz> wrote:
[...]
> For fanotify without FAN_UNLIMITED_QUEUE the situation is similar as for
> inotify - IMO low practical impact, apps should generally handle queue
> overflow so I don't see a need for any opt in (more accurate memcg charging
> takes precedense over possibly broken apps).
>
> For fanotify with FAN_UNLIMITED_QUEUE the situation is somewhat different -
> firstly there is a practical impact (memory consumption is not limited by
> anything else) and secondly there are higher chances of the application
> breaking (no queue overflow expected) and also that this breakage won't be
> completely harmless (e.g., the application participates in securing the
> system). I've been thinking about this "conflict of interests" for some
> time and currently I think that the best handling of this is that by
> default events for FAN_UNLIMITED_QUEUE groups will get allocated with
> GFP_NOFAIL - such groups can be created only by global CAP_SYS_ADMIN anyway
> so it is reasonably safe against misuse (and since the allocations are
> small it is in fact equivalent to current status quo, just more explicit).
> That way application won't see unexpected queue overflow. The process
> generating event may be looping in the allocator but that is the case
> currently as well. Also the memcg with the consumer of events will have
> higher chances of triggering oom-kill if events consume too much memory but
> I don't see how this is not a good thing by default - and if such reaction
> is not desirable, there's memcg's oom_control to tune the OOM behavior
> which has capabilities far beyond of what we could invent for fanotify...
>
> What do you think Amir?
>

If I followed all your reasoning correctly, you propose to change behavior to
always account events to group memcg and never fail event allocation,
without any change of API and without opting-in for new behavior?
I think it makes sense. I can't point at any expected breakage,
so overall, this would be a good change.

I just feel sorry about passing an opportunity to improve functionality.
The fact that fanotify does not have a way for defining the events queue
size is a deficiency IMO, one which I had to work around in the past.
I find that assigning group to memgc and configure memcg to desired
memory limit and getting Q_OVERFLOW on failure to allocate event
is going to be a proper way of addressing this deficiency.

But if you don't think we should bind these 2 things together,
I'll let Shakeel decide if he want to pursue the Q_OVERFLOW change
or not.

Thanks,
Amir.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2018-02-19 19:07                                                 ` Amir Goldstein
@ 2018-02-20 12:43                                                   ` Jan Kara
  -1 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2018-02-20 12:43 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Shakeel Butt, Yang Shi, Michal Hocko, linux-fsdevel,
	Linux MM, LKML, linux-api

On Mon 19-02-18 21:07:28, Amir Goldstein wrote:
> On Mon, Feb 19, 2018 at 3:50 PM, Jan Kara <jack@suse.cz> wrote:
> [...]
> > For fanotify without FAN_UNLIMITED_QUEUE the situation is similar as for
> > inotify - IMO low practical impact, apps should generally handle queue
> > overflow so I don't see a need for any opt in (more accurate memcg charging
> > takes precedense over possibly broken apps).
> >
> > For fanotify with FAN_UNLIMITED_QUEUE the situation is somewhat different -
> > firstly there is a practical impact (memory consumption is not limited by
> > anything else) and secondly there are higher chances of the application
> > breaking (no queue overflow expected) and also that this breakage won't be
> > completely harmless (e.g., the application participates in securing the
> > system). I've been thinking about this "conflict of interests" for some
> > time and currently I think that the best handling of this is that by
> > default events for FAN_UNLIMITED_QUEUE groups will get allocated with
> > GFP_NOFAIL - such groups can be created only by global CAP_SYS_ADMIN anyway
> > so it is reasonably safe against misuse (and since the allocations are
> > small it is in fact equivalent to current status quo, just more explicit).
> > That way application won't see unexpected queue overflow. The process
> > generating event may be looping in the allocator but that is the case
> > currently as well. Also the memcg with the consumer of events will have
> > higher chances of triggering oom-kill if events consume too much memory but
> > I don't see how this is not a good thing by default - and if such reaction
> > is not desirable, there's memcg's oom_control to tune the OOM behavior
> > which has capabilities far beyond of what we could invent for fanotify...
> >
> > What do you think Amir?
> >
> 
> If I followed all your reasoning correctly, you propose to change behavior to
> always account events to group memcg and never fail event allocation,
> without any change of API and without opting-in for new behavior?
> I think it makes sense. I can't point at any expected breakage,
> so overall, this would be a good change.
> 
> I just feel sorry about passing an opportunity to improve functionality.
> The fact that fanotify does not have a way for defining the events queue
> size is a deficiency IMO, one which I had to work around in the past.
> I find that assigning group to memgc and configure memcg to desired
> memory limit and getting Q_OVERFLOW on failure to allocate event
> is going to be a proper way of addressing this deficiency.

So if you don't pass FAN_Q_UNLIMITED, you will get queue with a fixed size
and will get Q_OVERFLOW if that is exceeded. So is your concern that you'd
like some other fixed limit? Larger one or smaller one and for what
reason?

> But if you don't think we should bind these 2 things together,
> I'll let Shakeel decide if he want to pursue the Q_OVERFLOW change
> or not.

So if there is still some uncovered use case for finer tuning of event
queue length than setting or not setting FAN_Q_UNLIMITED (+ possibly
putting the task to memcg to limit memory usage), we can talk about how to
address that but at this point I don't see a strong reason to bind this to
whether / how events are accounted to memcg...

And we still need to make sure we properly do ENOMEM -> Q_OVERFLOW
translation and use GFP_NOFAIL for FAN_Q_UNLIMITED groups before merging
Shakeel's memcg accounting patches. But Shakeel does not have to be the one
implementing that (although if you want to, you are welcome Shakeel :) -
otherwise I hope I'll get to it reasonably soon).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-02-20 12:43                                                   ` Jan Kara
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2018-02-20 12:43 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Shakeel Butt, Yang Shi, Michal Hocko, linux-fsdevel,
	Linux MM, LKML, linux-api

On Mon 19-02-18 21:07:28, Amir Goldstein wrote:
> On Mon, Feb 19, 2018 at 3:50 PM, Jan Kara <jack@suse.cz> wrote:
> [...]
> > For fanotify without FAN_UNLIMITED_QUEUE the situation is similar as for
> > inotify - IMO low practical impact, apps should generally handle queue
> > overflow so I don't see a need for any opt in (more accurate memcg charging
> > takes precedense over possibly broken apps).
> >
> > For fanotify with FAN_UNLIMITED_QUEUE the situation is somewhat different -
> > firstly there is a practical impact (memory consumption is not limited by
> > anything else) and secondly there are higher chances of the application
> > breaking (no queue overflow expected) and also that this breakage won't be
> > completely harmless (e.g., the application participates in securing the
> > system). I've been thinking about this "conflict of interests" for some
> > time and currently I think that the best handling of this is that by
> > default events for FAN_UNLIMITED_QUEUE groups will get allocated with
> > GFP_NOFAIL - such groups can be created only by global CAP_SYS_ADMIN anyway
> > so it is reasonably safe against misuse (and since the allocations are
> > small it is in fact equivalent to current status quo, just more explicit).
> > That way application won't see unexpected queue overflow. The process
> > generating event may be looping in the allocator but that is the case
> > currently as well. Also the memcg with the consumer of events will have
> > higher chances of triggering oom-kill if events consume too much memory but
> > I don't see how this is not a good thing by default - and if such reaction
> > is not desirable, there's memcg's oom_control to tune the OOM behavior
> > which has capabilities far beyond of what we could invent for fanotify...
> >
> > What do you think Amir?
> >
> 
> If I followed all your reasoning correctly, you propose to change behavior to
> always account events to group memcg and never fail event allocation,
> without any change of API and without opting-in for new behavior?
> I think it makes sense. I can't point at any expected breakage,
> so overall, this would be a good change.
> 
> I just feel sorry about passing an opportunity to improve functionality.
> The fact that fanotify does not have a way for defining the events queue
> size is a deficiency IMO, one which I had to work around in the past.
> I find that assigning group to memgc and configure memcg to desired
> memory limit and getting Q_OVERFLOW on failure to allocate event
> is going to be a proper way of addressing this deficiency.

So if you don't pass FAN_Q_UNLIMITED, you will get queue with a fixed size
and will get Q_OVERFLOW if that is exceeded. So is your concern that you'd
like some other fixed limit? Larger one or smaller one and for what
reason?

> But if you don't think we should bind these 2 things together,
> I'll let Shakeel decide if he want to pursue the Q_OVERFLOW change
> or not.

So if there is still some uncovered use case for finer tuning of event
queue length than setting or not setting FAN_Q_UNLIMITED (+ possibly
putting the task to memcg to limit memory usage), we can talk about how to
address that but at this point I don't see a strong reason to bind this to
whether / how events are accounted to memcg...

And we still need to make sure we properly do ENOMEM -> Q_OVERFLOW
translation and use GFP_NOFAIL for FAN_Q_UNLIMITED groups before merging
Shakeel's memcg accounting patches. But Shakeel does not have to be the one
implementing that (although if you want to, you are welcome Shakeel :) -
otherwise I hope I'll get to it reasonably soon).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2018-02-20 12:43                                                   ` Jan Kara
@ 2018-02-20 19:20                                                     ` Shakeel Butt
  -1 siblings, 0 replies; 75+ messages in thread
From: Shakeel Butt @ 2018-02-20 19:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM,
	LKML, linux-api, Andrew Morton

On Tue, Feb 20, 2018 at 4:43 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 19-02-18 21:07:28, Amir Goldstein wrote:
>> On Mon, Feb 19, 2018 at 3:50 PM, Jan Kara <jack@suse.cz> wrote:
>> [...]
>> > For fanotify without FAN_UNLIMITED_QUEUE the situation is similar as for
>> > inotify - IMO low practical impact, apps should generally handle queue
>> > overflow so I don't see a need for any opt in (more accurate memcg charging
>> > takes precedense over possibly broken apps).
>> >
>> > For fanotify with FAN_UNLIMITED_QUEUE the situation is somewhat different -
>> > firstly there is a practical impact (memory consumption is not limited by
>> > anything else) and secondly there are higher chances of the application
>> > breaking (no queue overflow expected) and also that this breakage won't be
>> > completely harmless (e.g., the application participates in securing the
>> > system). I've been thinking about this "conflict of interests" for some
>> > time and currently I think that the best handling of this is that by
>> > default events for FAN_UNLIMITED_QUEUE groups will get allocated with
>> > GFP_NOFAIL - such groups can be created only by global CAP_SYS_ADMIN anyway
>> > so it is reasonably safe against misuse (and since the allocations are
>> > small it is in fact equivalent to current status quo, just more explicit).
>> > That way application won't see unexpected queue overflow. The process
>> > generating event may be looping in the allocator but that is the case
>> > currently as well. Also the memcg with the consumer of events will have
>> > higher chances of triggering oom-kill if events consume too much memory but
>> > I don't see how this is not a good thing by default - and if such reaction
>> > is not desirable, there's memcg's oom_control to tune the OOM behavior
>> > which has capabilities far beyond of what we could invent for fanotify...
>> >
>> > What do you think Amir?
>> >
>>
>> If I followed all your reasoning correctly, you propose to change behavior to
>> always account events to group memcg and never fail event allocation,
>> without any change of API and without opting-in for new behavior?
>> I think it makes sense. I can't point at any expected breakage,
>> so overall, this would be a good change.
>>
>> I just feel sorry about passing an opportunity to improve functionality.
>> The fact that fanotify does not have a way for defining the events queue
>> size is a deficiency IMO, one which I had to work around in the past.
>> I find that assigning group to memgc and configure memcg to desired
>> memory limit and getting Q_OVERFLOW on failure to allocate event
>> is going to be a proper way of addressing this deficiency.
>
> So if you don't pass FAN_Q_UNLIMITED, you will get queue with a fixed size
> and will get Q_OVERFLOW if that is exceeded. So is your concern that you'd
> like some other fixed limit? Larger one or smaller one and for what
> reason?
>
>> But if you don't think we should bind these 2 things together,
>> I'll let Shakeel decide if he want to pursue the Q_OVERFLOW change
>> or not.
>
> So if there is still some uncovered use case for finer tuning of event
> queue length than setting or not setting FAN_Q_UNLIMITED (+ possibly
> putting the task to memcg to limit memory usage), we can talk about how to
> address that but at this point I don't see a strong reason to bind this to
> whether / how events are accounted to memcg...
>
> And we still need to make sure we properly do ENOMEM -> Q_OVERFLOW
> translation and use GFP_NOFAIL for FAN_Q_UNLIMITED groups before merging
> Shakeel's memcg accounting patches. But Shakeel does not have to be the one
> implementing that (although if you want to, you are welcome Shakeel :) -
> otherwise I hope I'll get to it reasonably soon).
>

Thanks Jan & Amir for the help and explanation. I think, Jan, you can
implement the "ENOMEM -> Q_OVERFLOW" and GFP_NOFAIL changes better
than me. I will send out my patches with minor changes based on
feedback but I will let Andrew know to keep my patches in mm tree and
not send for upstream merge. Once Jan has added his patches, I will
Andrew know to go forward with my patches.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-02-20 19:20                                                     ` Shakeel Butt
  0 siblings, 0 replies; 75+ messages in thread
From: Shakeel Butt @ 2018-02-20 19:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM,
	LKML, linux-api, Andrew Morton

On Tue, Feb 20, 2018 at 4:43 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 19-02-18 21:07:28, Amir Goldstein wrote:
>> On Mon, Feb 19, 2018 at 3:50 PM, Jan Kara <jack@suse.cz> wrote:
>> [...]
>> > For fanotify without FAN_UNLIMITED_QUEUE the situation is similar as for
>> > inotify - IMO low practical impact, apps should generally handle queue
>> > overflow so I don't see a need for any opt in (more accurate memcg charging
>> > takes precedense over possibly broken apps).
>> >
>> > For fanotify with FAN_UNLIMITED_QUEUE the situation is somewhat different -
>> > firstly there is a practical impact (memory consumption is not limited by
>> > anything else) and secondly there are higher chances of the application
>> > breaking (no queue overflow expected) and also that this breakage won't be
>> > completely harmless (e.g., the application participates in securing the
>> > system). I've been thinking about this "conflict of interests" for some
>> > time and currently I think that the best handling of this is that by
>> > default events for FAN_UNLIMITED_QUEUE groups will get allocated with
>> > GFP_NOFAIL - such groups can be created only by global CAP_SYS_ADMIN anyway
>> > so it is reasonably safe against misuse (and since the allocations are
>> > small it is in fact equivalent to current status quo, just more explicit).
>> > That way application won't see unexpected queue overflow. The process
>> > generating event may be looping in the allocator but that is the case
>> > currently as well. Also the memcg with the consumer of events will have
>> > higher chances of triggering oom-kill if events consume too much memory but
>> > I don't see how this is not a good thing by default - and if such reaction
>> > is not desirable, there's memcg's oom_control to tune the OOM behavior
>> > which has capabilities far beyond of what we could invent for fanotify...
>> >
>> > What do you think Amir?
>> >
>>
>> If I followed all your reasoning correctly, you propose to change behavior to
>> always account events to group memcg and never fail event allocation,
>> without any change of API and without opting-in for new behavior?
>> I think it makes sense. I can't point at any expected breakage,
>> so overall, this would be a good change.
>>
>> I just feel sorry about passing an opportunity to improve functionality.
>> The fact that fanotify does not have a way for defining the events queue
>> size is a deficiency IMO, one which I had to work around in the past.
>> I find that assigning group to memgc and configure memcg to desired
>> memory limit and getting Q_OVERFLOW on failure to allocate event
>> is going to be a proper way of addressing this deficiency.
>
> So if you don't pass FAN_Q_UNLIMITED, you will get queue with a fixed size
> and will get Q_OVERFLOW if that is exceeded. So is your concern that you'd
> like some other fixed limit? Larger one or smaller one and for what
> reason?
>
>> But if you don't think we should bind these 2 things together,
>> I'll let Shakeel decide if he want to pursue the Q_OVERFLOW change
>> or not.
>
> So if there is still some uncovered use case for finer tuning of event
> queue length than setting or not setting FAN_Q_UNLIMITED (+ possibly
> putting the task to memcg to limit memory usage), we can talk about how to
> address that but at this point I don't see a strong reason to bind this to
> whether / how events are accounted to memcg...
>
> And we still need to make sure we properly do ENOMEM -> Q_OVERFLOW
> translation and use GFP_NOFAIL for FAN_Q_UNLIMITED groups before merging
> Shakeel's memcg accounting patches. But Shakeel does not have to be the one
> implementing that (although if you want to, you are welcome Shakeel :) -
> otherwise I hope I'll get to it reasonably soon).
>

Thanks Jan & Amir for the help and explanation. I think, Jan, you can
implement the "ENOMEM -> Q_OVERFLOW" and GFP_NOFAIL changes better
than me. I will send out my patches with minor changes based on
feedback but I will let Andrew know to keep my patches in mm tree and
not send for upstream merge. Once Jan has added his patches, I will
Andrew know to go forward with my patches.

thanks,
Shakeel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
  2018-02-20 12:43                                                   ` Jan Kara
@ 2018-02-20 20:30                                                     ` Amir Goldstein
  -1 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-02-20 20:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: Shakeel Butt, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM,
	LKML, linux-api

On Tue, Feb 20, 2018 at 2:43 PM, Jan Kara <jack@suse.cz> wrote:
> On Mon 19-02-18 21:07:28, Amir Goldstein wrote:
[...]
>>
>> I just feel sorry about passing an opportunity to improve functionality.
>> The fact that fanotify does not have a way for defining the events queue
>> size is a deficiency IMO, one which I had to work around in the past.
>> I find that assigning group to memgc and configure memcg to desired
>> memory limit and getting Q_OVERFLOW on failure to allocate event
>> is going to be a proper way of addressing this deficiency.
>
> So if you don't pass FAN_Q_UNLIMITED, you will get queue with a fixed size
> and will get Q_OVERFLOW if that is exceeded. So is your concern that you'd
> like some other fixed limit? Larger one or smaller one and for what
> reason?
>

My use case was that with the default queue size, I would get Q_OVERFLOW
on bursty fs workloads, but using  FAN_Q_UNLIMITED and allowing to
consume entire system memory with events was not a desired alternative.
The actual queue size was not important, only allowing admin to tune the
system to bursty workloads without overflowing the event queue.

Something like FAN_Q_BESTEFFORT (i.e. Q_OVERFLOW on ENOMEM)
+ allowing to restrict event allocation to memcg, would allow admin to tune
the system to bursty workloads.

>> But if you don't think we should bind these 2 things together,
>> I'll let Shakeel decide if he want to pursue the Q_OVERFLOW change
>> or not.
>
> So if there is still some uncovered use case for finer tuning of event
> queue length than setting or not setting FAN_Q_UNLIMITED (+ possibly
> putting the task to memcg to limit memory usage), we can talk about how to
> address that but at this point I don't see a strong reason to bind this to
> whether / how events are accounted to memcg...

Agreed.

>
> And we still need to make sure we properly do ENOMEM -> Q_OVERFLOW
> translation and use GFP_NOFAIL for FAN_Q_UNLIMITED groups before merging

Good. it wasn't clear to me from your summary if were going to require
ENOEM -> Q_OVERFLOW before merging this work. If you put it this way,
I think it makes sense to let user to choose between GFP_NOFAIL and
Q_OVERFLOW behavior when queue is not limited, for example by using new
fanotify_init flag FAN_Q_BESTEFFORT (or better name), but I have no problem
with postponing that for later.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2018-02-20 20:30                                                     ` Amir Goldstein
  0 siblings, 0 replies; 75+ messages in thread
From: Amir Goldstein @ 2018-02-20 20:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: Shakeel Butt, Yang Shi, Michal Hocko, linux-fsdevel, Linux MM,
	LKML, linux-api

On Tue, Feb 20, 2018 at 2:43 PM, Jan Kara <jack@suse.cz> wrote:
> On Mon 19-02-18 21:07:28, Amir Goldstein wrote:
[...]
>>
>> I just feel sorry about passing an opportunity to improve functionality.
>> The fact that fanotify does not have a way for defining the events queue
>> size is a deficiency IMO, one which I had to work around in the past.
>> I find that assigning group to memgc and configure memcg to desired
>> memory limit and getting Q_OVERFLOW on failure to allocate event
>> is going to be a proper way of addressing this deficiency.
>
> So if you don't pass FAN_Q_UNLIMITED, you will get queue with a fixed size
> and will get Q_OVERFLOW if that is exceeded. So is your concern that you'd
> like some other fixed limit? Larger one or smaller one and for what
> reason?
>

My use case was that with the default queue size, I would get Q_OVERFLOW
on bursty fs workloads, but using  FAN_Q_UNLIMITED and allowing to
consume entire system memory with events was not a desired alternative.
The actual queue size was not important, only allowing admin to tune the
system to bursty workloads without overflowing the event queue.

Something like FAN_Q_BESTEFFORT (i.e. Q_OVERFLOW on ENOMEM)
+ allowing to restrict event allocation to memcg, would allow admin to tune
the system to bursty workloads.

>> But if you don't think we should bind these 2 things together,
>> I'll let Shakeel decide if he want to pursue the Q_OVERFLOW change
>> or not.
>
> So if there is still some uncovered use case for finer tuning of event
> queue length than setting or not setting FAN_Q_UNLIMITED (+ possibly
> putting the task to memcg to limit memory usage), we can talk about how to
> address that but at this point I don't see a strong reason to bind this to
> whether / how events are accounted to memcg...

Agreed.

>
> And we still need to make sure we properly do ENOMEM -> Q_OVERFLOW
> translation and use GFP_NOFAIL for FAN_Q_UNLIMITED groups before merging

Good. it wasn't clear to me from your summary if were going to require
ENOEM -> Q_OVERFLOW before merging this work. If you put it this way,
I think it makes sense to let user to choose between GFP_NOFAIL and
Q_OVERFLOW behavior when queue is not limited, for example by using new
fanotify_init flag FAN_Q_BESTEFFORT (or better name), but I have no problem
with postponing that for later.

Thanks,
Amir.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2018-02-20 20:30 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-27 18:22 [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg Yang Shi
2017-10-27 18:22 ` Yang Shi
2017-10-27 18:22 ` Yang Shi
2017-10-28 14:19 ` Amir Goldstein
2017-10-28 14:19   ` Amir Goldstein
2017-10-29  2:39   ` Matthew Wilcox
2017-10-29  2:39     ` Matthew Wilcox
2017-10-30 12:43 ` Jan Kara
2017-10-30 12:43   ` Jan Kara
2017-10-30 16:39   ` Yang Shi
2017-10-30 16:39     ` Yang Shi
2017-10-31 10:12     ` Jan Kara
2017-10-31 10:12       ` Jan Kara
2017-10-31 16:44       ` Yang Shi
2017-10-31 16:44         ` Yang Shi
2017-11-01 15:15         ` Jan Kara
2017-11-01 15:15           ` Jan Kara
2017-11-09 13:54       ` Michal Hocko
2017-11-09 13:54         ` Michal Hocko
2017-11-13 19:10         ` Yang Shi
2017-11-13 19:10           ` Yang Shi
2017-11-14  9:39           ` Michal Hocko
2017-11-14  9:39             ` Michal Hocko
2017-11-14 17:32             ` Yang Shi
2017-11-14 17:32               ` Yang Shi
2017-11-15  9:31               ` Jan Kara
2017-11-15  9:31                 ` Jan Kara
2018-01-19 15:02                 ` Shakeel Butt
2018-01-19 15:02                   ` Shakeel Butt
2018-01-22 20:31                   ` Amir Goldstein
2018-01-22 20:31                     ` Amir Goldstein
2018-01-24 10:34                     ` Jan Kara
2018-01-24 10:34                       ` Jan Kara
2018-01-24 11:12                       ` Amir Goldstein
2018-01-24 11:12                         ` Amir Goldstein
2018-01-25  1:08                         ` Shakeel Butt
2018-01-25  1:08                           ` Shakeel Butt
2018-01-25  1:54                           ` Al Viro
2018-01-25  1:54                             ` Al Viro
2018-01-25  2:15                             ` Shakeel Butt
2018-01-25  2:15                               ` Shakeel Butt
2018-01-25  7:51                           ` Amir Goldstein
2018-01-25  7:51                             ` Amir Goldstein
2018-01-25 20:20                             ` Shakeel Butt
2018-01-25 20:20                               ` Shakeel Butt
2018-01-25 20:36                               ` Amir Goldstein
2018-01-25 20:36                                 ` Amir Goldstein
2018-02-13  6:30                                 ` Amir Goldstein
2018-02-13  6:30                                   ` Amir Goldstein
2018-02-13  6:30                                   ` Amir Goldstein
2018-02-13 21:10                                   ` Shakeel Butt
2018-02-13 21:10                                     ` Shakeel Butt
2018-02-13 21:10                                     ` Shakeel Butt
2018-02-13 21:54                                     ` Amir Goldstein
2018-02-13 21:54                                       ` Amir Goldstein
2018-02-13 22:20                                       ` Shakeel Butt
2018-02-13 22:20                                         ` Shakeel Butt
2018-02-14  1:59                                         ` Shakeel Butt
2018-02-14  1:59                                           ` Shakeel Butt
2018-02-14  1:59                                           ` Shakeel Butt
2018-02-14  8:38                                           ` Amir Goldstein
2018-02-14  8:38                                             ` Amir Goldstein
2018-02-19 13:50                                             ` Jan Kara
2018-02-19 13:50                                               ` Jan Kara
2018-02-19 19:07                                               ` Amir Goldstein
2018-02-19 19:07                                                 ` Amir Goldstein
2018-02-20 12:43                                                 ` Jan Kara
2018-02-20 12:43                                                   ` Jan Kara
2018-02-20 19:20                                                   ` Shakeel Butt
2018-02-20 19:20                                                     ` Shakeel Butt
2018-02-20 20:30                                                   ` Amir Goldstein
2018-02-20 20:30                                                     ` Amir Goldstein
2018-02-14  9:00                                         ` Amir Goldstein
2018-02-14  9:00                                           ` Amir Goldstein
2018-02-14  9:00                                           ` Amir Goldstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.