[RFC PATCH] fs: fsnotify: account fsnotify metadata to kmemcg

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] fs: fsnotify: account fsnotify metadata to kmemcg
@ 2017-10-19 21:20 Yang Shi
  2017-10-20  3:14 ` Amir Goldstein
  0 siblings, 1 reply; 11+ messages in thread
From: Yang Shi @ 2017-10-19 21:20 UTC (permalink / raw)
  To: jack, amir73il; +Cc: Yang Shi, linux-fsdevel, linux-mm, linux-kernel

We observed some misbehaved user applications might consume significant
amount of fsnotify slabs silently. It'd better to account those slabs in
kmemcg so that we can get heads up before misbehaved applications use too
much memory silently.

Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
---
 fs/notify/dnotify/dnotify.c        | 4 ++--
 fs/notify/fanotify/fanotify_user.c | 6 +++---
 fs/notify/fsnotify.c               | 2 +-
 fs/notify/inotify/inotify_user.c   | 2 +-
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index cba3283..3ec6233 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -379,8 +379,8 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
 
 static int __init dnotify_init(void)
 {
-	dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC);
-	dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC);
+	dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC|SLAB_ACCOUNT);
+	dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);
 
 	dnotify_group = fsnotify_alloc_group(&dnotify_fsnotify_ops);
 	if (IS_ERR(dnotify_group))
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 907a481..7d62dee 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -947,11 +947,11 @@ static int fanotify_add_inode_mark(struct fsnotify_group *group,
  */
 static int __init fanotify_user_setup(void)
 {
-	fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC);
-	fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC);
+	fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);
+	fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC|SLAB_ACCOUNT);
 #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
 	fanotify_perm_event_cachep = KMEM_CACHE(fanotify_perm_event_info,
-						SLAB_PANIC);
+						SLAB_PANIC|SLAB_ACCOUNT);
 #endif
 
 	return 0;
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 0c4583b..82620ac 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -386,7 +386,7 @@ static __init int fsnotify_init(void)
 		panic("initializing fsnotify_mark_srcu");
 
 	fsnotify_mark_connector_cachep = KMEM_CACHE(fsnotify_mark_connector,
-						    SLAB_PANIC);
+						    SLAB_PANIC|SLAB_ACCOUNT);
 
 	return 0;
 }
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index 7cc7d3f..57b32ff 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -785,7 +785,7 @@ static int __init inotify_user_setup(void)
 
 	BUG_ON(hweight32(ALL_INOTIFY_BITS) != 21);
 
-	inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC);
+	inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC|SLAB_ACCOUNT);
 
 	inotify_max_queued_events = 16384;
 	init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES] = 128;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-10-19 21:20 [RFC PATCH] fs: fsnotify: account fsnotify metadata to kmemcg Yang Shi
@ 2017-10-20  3:14 ` Amir Goldstein
  2017-10-20 21:07   ` Yang Shi
  0 siblings, 1 reply; 11+ messages in thread
From: Amir Goldstein @ 2017-10-20  3:14 UTC (permalink / raw)
  To: Yang Shi; +Cc: Jan Kara, linux-fsdevel, linux-mm, linux-kernel

On Fri, Oct 20, 2017 at 12:20 AM, Yang Shi <yang.s@alibaba-inc.com> wrote:
> We observed some misbehaved user applications might consume significant
> amount of fsnotify slabs silently. It'd better to account those slabs in
> kmemcg so that we can get heads up before misbehaved applications use too
> much memory silently.

In what way do they misbehave? create a lot of marks? create a lot of events?
Not reading events in their queue?
The latter case is more interesting:

Process A is the one that asked to get the events.
Process B is the one that is generating the events and queuing them on
the queue that is owned by process A, who is also to blame if the queue
is not being read.

So why should process B be held accountable for memory pressure
caused by, say, an FAN_UNLIMITED_QUEUE that process A created and
doesn't read from.

Is it possible to get an explicit reference to the memcg's  events cache
at fsnotify_group creation time, store it in the group struct and then allocate
events from the event cache associated with the group (the listener) rather
than the cache associated with the task generating the event?

Amir.

>
> Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
> ---
>  fs/notify/dnotify/dnotify.c        | 4 ++--
>  fs/notify/fanotify/fanotify_user.c | 6 +++---
>  fs/notify/fsnotify.c               | 2 +-
>  fs/notify/inotify/inotify_user.c   | 2 +-
>  4 files changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
> index cba3283..3ec6233 100644
> --- a/fs/notify/dnotify/dnotify.c
> +++ b/fs/notify/dnotify/dnotify.c
> @@ -379,8 +379,8 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
>
>  static int __init dnotify_init(void)
>  {
> -       dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC);
> -       dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC);
> +       dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC|SLAB_ACCOUNT);
> +       dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);
>
>         dnotify_group = fsnotify_alloc_group(&dnotify_fsnotify_ops);
>         if (IS_ERR(dnotify_group))
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 907a481..7d62dee 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -947,11 +947,11 @@ static int fanotify_add_inode_mark(struct fsnotify_group *group,
>   */
>  static int __init fanotify_user_setup(void)
>  {
> -       fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC);
> -       fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC);
> +       fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);
> +       fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC|SLAB_ACCOUNT);
>  #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
>         fanotify_perm_event_cachep = KMEM_CACHE(fanotify_perm_event_info,
> -                                               SLAB_PANIC);
> +                                               SLAB_PANIC|SLAB_ACCOUNT);
>  #endif
>
>         return 0;
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 0c4583b..82620ac 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -386,7 +386,7 @@ static __init int fsnotify_init(void)
>                 panic("initializing fsnotify_mark_srcu");
>
>         fsnotify_mark_connector_cachep = KMEM_CACHE(fsnotify_mark_connector,
> -                                                   SLAB_PANIC);
> +                                                   SLAB_PANIC|SLAB_ACCOUNT);
>
>         return 0;
>  }
> diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
> index 7cc7d3f..57b32ff 100644
> --- a/fs/notify/inotify/inotify_user.c
> +++ b/fs/notify/inotify/inotify_user.c
> @@ -785,7 +785,7 @@ static int __init inotify_user_setup(void)
>
>         BUG_ON(hweight32(ALL_INOTIFY_BITS) != 21);
>
> -       inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC);
> +       inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC|SLAB_ACCOUNT);
>
>         inotify_max_queued_events = 16384;
>         init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES] = 128;
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-10-20  3:14 ` Amir Goldstein
@ 2017-10-20 21:07   ` Yang Shi
  2017-10-22  8:24     ` Amir Goldstein
  0 siblings, 1 reply; 11+ messages in thread
From: Yang Shi @ 2017-10-20 21:07 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, linux-mm, linux-kernel



On 10/19/17 8:14 PM, Amir Goldstein wrote:
> On Fri, Oct 20, 2017 at 12:20 AM, Yang Shi <yang.s@alibaba-inc.com> wrote:
>> We observed some misbehaved user applications might consume significant
>> amount of fsnotify slabs silently. It'd better to account those slabs in
>> kmemcg so that we can get heads up before misbehaved applications use too
>> much memory silently.
> 
> In what way do they misbehave? create a lot of marks? create a lot of events?
> Not reading events in their queue?

It looks both a lot marks and events. I'm not sure if it is the latter 
case. If I knew more about the details of the behavior, I would 
elaborated more in the commit log.

> The latter case is more interesting:
> 
> Process A is the one that asked to get the events.
> Process B is the one that is generating the events and queuing them on
> the queue that is owned by process A, who is also to blame if the queue
> is not being read.

I agree it is not fair to account the memory to the generator. But, 
afaik, accounting non-current memcg is not how memcg is designed and 
works. Please see the below for some details.

> 
> So why should process B be held accountable for memory pressure
> caused by, say, an FAN_UNLIMITED_QUEUE that process A created and
> doesn't read from.
> 
> Is it possible to get an explicit reference to the memcg's  events cache
> at fsnotify_group creation time, store it in the group struct and then allocate
> events from the event cache associated with the group (the listener) rather
> than the cache associated with the task generating the event?

I don't think current memcg design can do this. Because kmem accounting 
happens at allocation (when calling kmem_cache_alloc) stage, and get the 
associated memcg from current task, so basically who does the allocation 
who get it accounted. If the producer is in the different memcg of 
consumer, it should be just accounted to the producer memcg, although 
the problem might be caused by the producer.

However, afaik, both producer and consumer are typically in the same 
memcg. So, this might be not a big issue. But, I do admit such unfair 
accounting may happen.

Thanks,
Yang

> 
> Amir.
> 
>>
>> Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
>> ---
>>   fs/notify/dnotify/dnotify.c        | 4 ++--
>>   fs/notify/fanotify/fanotify_user.c | 6 +++---
>>   fs/notify/fsnotify.c               | 2 +-
>>   fs/notify/inotify/inotify_user.c   | 2 +-
>>   4 files changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
>> index cba3283..3ec6233 100644
>> --- a/fs/notify/dnotify/dnotify.c
>> +++ b/fs/notify/dnotify/dnotify.c
>> @@ -379,8 +379,8 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
>>
>>   static int __init dnotify_init(void)
>>   {
>> -       dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC);
>> -       dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC);
>> +       dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC|SLAB_ACCOUNT);
>> +       dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);
>>
>>          dnotify_group = fsnotify_alloc_group(&dnotify_fsnotify_ops);
>>          if (IS_ERR(dnotify_group))
>> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
>> index 907a481..7d62dee 100644
>> --- a/fs/notify/fanotify/fanotify_user.c
>> +++ b/fs/notify/fanotify/fanotify_user.c
>> @@ -947,11 +947,11 @@ static int fanotify_add_inode_mark(struct fsnotify_group *group,
>>    */
>>   static int __init fanotify_user_setup(void)
>>   {
>> -       fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC);
>> -       fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC);
>> +       fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);
>> +       fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC|SLAB_ACCOUNT);
>>   #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
>>          fanotify_perm_event_cachep = KMEM_CACHE(fanotify_perm_event_info,
>> -                                               SLAB_PANIC);
>> +                                               SLAB_PANIC|SLAB_ACCOUNT);
>>   #endif
>>
>>          return 0;
>> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
>> index 0c4583b..82620ac 100644
>> --- a/fs/notify/fsnotify.c
>> +++ b/fs/notify/fsnotify.c
>> @@ -386,7 +386,7 @@ static __init int fsnotify_init(void)
>>                  panic("initializing fsnotify_mark_srcu");
>>
>>          fsnotify_mark_connector_cachep = KMEM_CACHE(fsnotify_mark_connector,
>> -                                                   SLAB_PANIC);
>> +                                                   SLAB_PANIC|SLAB_ACCOUNT);
>>
>>          return 0;
>>   }
>> diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
>> index 7cc7d3f..57b32ff 100644
>> --- a/fs/notify/inotify/inotify_user.c
>> +++ b/fs/notify/inotify/inotify_user.c
>> @@ -785,7 +785,7 @@ static int __init inotify_user_setup(void)
>>
>>          BUG_ON(hweight32(ALL_INOTIFY_BITS) != 21);
>>
>> -       inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC);
>> +       inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC|SLAB_ACCOUNT);
>>
>>          inotify_max_queued_events = 16384;
>>          init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES] = 128;
>> --
>> 1.8.3.1
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-10-20 21:07   ` Yang Shi
@ 2017-10-22  8:24     ` Amir Goldstein
  2017-10-24  4:12       ` Yang Shi
  2017-10-31 10:50       ` Jan Kara
  0 siblings, 2 replies; 11+ messages in thread
From: Amir Goldstein @ 2017-10-22  8:24 UTC (permalink / raw)
  To: Yang Shi; +Cc: Jan Kara, linux-fsdevel, linux-mm, linux-kernel, linux-api

[-- Attachment #1: Type: text/plain, Size: 4288 bytes --]

On Sat, Oct 21, 2017 at 12:07 AM, Yang Shi <yang.s@alibaba-inc.com> wrote:
>
>
> On 10/19/17 8:14 PM, Amir Goldstein wrote:
>>
>> On Fri, Oct 20, 2017 at 12:20 AM, Yang Shi <yang.s@alibaba-inc.com> wrote:
>>>
>>> We observed some misbehaved user applications might consume significant
>>> amount of fsnotify slabs silently. It'd better to account those slabs in
>>> kmemcg so that we can get heads up before misbehaved applications use too
>>> much memory silently.
>>
>>
>> In what way do they misbehave? create a lot of marks? create a lot of
>> events?
>> Not reading events in their queue?
>
>
> It looks both a lot marks and events. I'm not sure if it is the latter case.
> If I knew more about the details of the behavior, I would elaborated more in
> the commit log.

If you are not sure, do not refer to user application as "misbehaved".
Is updatedb(8) a misbehaved application because it produces a lot of access
events?
It would be better if you provide the dry facts of your setup and slab counters
and say that you are missing information to analyse the distribution of slab
usage because of missing kmemcg accounting.


>
>> The latter case is more interesting:
>>
>> Process A is the one that asked to get the events.
>> Process B is the one that is generating the events and queuing them on
>> the queue that is owned by process A, who is also to blame if the queue
>> is not being read.
>
>
> I agree it is not fair to account the memory to the generator. But, afaik,
> accounting non-current memcg is not how memcg is designed and works. Please
> see the below for some details.
>
>>
>> So why should process B be held accountable for memory pressure
>> caused by, say, an FAN_UNLIMITED_QUEUE that process A created and
>> doesn't read from.
>>
>> Is it possible to get an explicit reference to the memcg's  events cache
>> at fsnotify_group creation time, store it in the group struct and then
>> allocate
>> events from the event cache associated with the group (the listener)
>> rather
>> than the cache associated with the task generating the event?
>
>
> I don't think current memcg design can do this. Because kmem accounting
> happens at allocation (when calling kmem_cache_alloc) stage, and get the
> associated memcg from current task, so basically who does the allocation who
> get it accounted. If the producer is in the different memcg of consumer, it
> should be just accounted to the producer memcg, although the problem might
> be caused by the producer.
>
> However, afaik, both producer and consumer are typically in the same memcg.
> So, this might be not a big issue. But, I do admit such unfair accounting
> may happen.
>

That is a reasonable argument, but please make a comment on that fact in
commit message and above creation of events cache, so that it is clear that
event slab accounting is mostly heuristic.

But I think there is another problem, not introduced by your change, but could
be amplified because of it - when a non-permission event allocation fails, the
event is silently dropped, AFAICT, with no indication to listener.
That seems like a bug to me, because there is a perfectly safe way to deal with
event allocation failure - queue the overflow event.

I am not going to be the one to determine if fixing this alleged bug is a
prerequisite for merging your patch, but I think enforcing memory limits on
event allocation could amplify that bug, so it should be fixed.

The upside is that with both your accounting fix and ENOMEM = overlflow
fix, it going to be easy to write a test that verifies both of them:
- Run a listener in memcg with limited kmem and unlimited (or very
large) event queue
- Produce events inside memcg without listener reading them
- Read event and expect an OVERFLOW event

This is a simple variant of LTP tests inotify05 and fanotify05.

I realize that is user application behavior change and that documentation
implies that an OVERFLOW event is not expected when using
FAN_UNLIMITED_QUEUE, but IMO no one will come shouting
if we stop silently dropping events, so it is better to fix this and update
documentation.

Attached a compile-tested patch to implement overflow on ENOMEM
Hope this helps to test your patch and then we can merge both, accompanied
with LTP tests for inotify and fanotify.

Amir.

[-- Attachment #2: 0001-fsnotify-queue-an-overflow-event-on-failure-to-alloc.patch --]
[-- Type: text/x-patch, Size: 2870 bytes --]

From 112ecd54045f14aff2c42622fabb4ffab9f0d8ff Mon Sep 17 00:00:00 2001
From: Amir Goldstein <amir73il@gmail.com>
Date: Sun, 22 Oct 2017 11:13:10 +0300
Subject: [PATCH] fsnotify: queue an overflow event on failure to allocate
 event

In low memory situations, non permissions events are silently dropped.
It is better to queue an OVERFLOW event in that case to let the listener
know about the lost event.

With this change, an application can now get an FAN_Q_OVERFLOW event,
even if it used flag FAN_UNLIMITED_QUEUE on fanotify_init().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify.c        | 10 ++++++++--
 fs/notify/inotify/inotify_fsnotify.c |  8 ++++++--
 fs/notify/notification.c             |  3 ++-
 3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 2fa99aeaa095..412a32838f58 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -212,8 +212,14 @@ static int fanotify_handle_event(struct fsnotify_group *group,
 		 mask);
 
 	event = fanotify_alloc_event(inode, mask, data);
-	if (unlikely(!event))
-		return -ENOMEM;
+	if (unlikely(!event)) {
+		if (mask & FAN_ALL_PERM_EVENTS)
+			return -ENOMEM;
+
+		/* Queue an overflow event on failure to allocate event */
+		fsnotify_add_event(group, group->overflow_event, NULL);
+		return 0;
+	}
 
 	fsn_event = &event->fse;
 	ret = fsnotify_add_event(group, fsn_event, fanotify_merge);
diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c
index 8b73332735ba..d1837da2ef15 100644
--- a/fs/notify/inotify/inotify_fsnotify.c
+++ b/fs/notify/inotify/inotify_fsnotify.c
@@ -99,8 +99,11 @@ int inotify_handle_event(struct fsnotify_group *group,
 			      fsn_mark);
 
 	event = kmalloc(alloc_len, GFP_KERNEL);
-	if (unlikely(!event))
-		return -ENOMEM;
+	if (unlikely(!event)) {
+		/* Queue an overflow event on failure to allocate event */
+		fsnotify_add_event(group, group->overflow_event, NULL);
+		goto oneshot;
+	}
 
 	fsn_event = &event->fse;
 	fsnotify_init_event(fsn_event, inode, mask);
@@ -116,6 +119,7 @@ int inotify_handle_event(struct fsnotify_group *group,
 		fsnotify_destroy_event(group, fsn_event);
 	}
 
+oneshot:
 	if (inode_mark->mask & IN_ONESHOT)
 		fsnotify_destroy_mark(inode_mark, group);
 
diff --git a/fs/notify/notification.c b/fs/notify/notification.c
index 66f85c651c52..5abd69976a47 100644
--- a/fs/notify/notification.c
+++ b/fs/notify/notification.c
@@ -111,7 +111,8 @@ int fsnotify_add_event(struct fsnotify_group *group,
 		return 2;
 	}
 
-	if (group->q_len >= group->max_events) {
+	if (group->q_len >= group->max_events ||
+	    event == group->overflow_event) {
 		ret = 2;
 		/* Queue overflow event only if it isn't already queued */
 		if (!list_empty(&group->overflow_event->list)) {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-10-22  8:24     ` Amir Goldstein
@ 2017-10-24  4:12       ` Yang Shi
  2017-10-24  5:42         ` Amir Goldstein
  2017-10-31 10:50       ` Jan Kara
  1 sibling, 1 reply; 11+ messages in thread
From: Yang Shi @ 2017-10-24  4:12 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, linux-mm, linux-kernel, linux-api



On 10/22/17 1:24 AM, Amir Goldstein wrote:
> On Sat, Oct 21, 2017 at 12:07 AM, Yang Shi <yang.s@alibaba-inc.com> wrote:
>>
>>
>> On 10/19/17 8:14 PM, Amir Goldstein wrote:
>>>
>>> On Fri, Oct 20, 2017 at 12:20 AM, Yang Shi <yang.s@alibaba-inc.com> wrote:
>>>>
>>>> We observed some misbehaved user applications might consume significant
>>>> amount of fsnotify slabs silently. It'd better to account those slabs in
>>>> kmemcg so that we can get heads up before misbehaved applications use too
>>>> much memory silently.
>>>
>>>
>>> In what way do they misbehave? create a lot of marks? create a lot of
>>> events?
>>> Not reading events in their queue?
>>
>>
>> It looks both a lot marks and events. I'm not sure if it is the latter case.
>> If I knew more about the details of the behavior, I would elaborated more in
>> the commit log.
> 
> If you are not sure, do not refer to user application as "misbehaved".
> Is updatedb(8) a misbehaved application because it produces a lot of access
> events?

Should be not. It sounds like our in-house applications. But, it is a 
sort of blackbox to me.

> It would be better if you provide the dry facts of your setup and slab counters
> and say that you are missing information to analyse the distribution of slab
> usage because of missing kmemcg accounting.

Yes, sure. Will add such information in the commit log for the new version.

> 
> 
>>
>>> The latter case is more interesting:
>>>
>>> Process A is the one that asked to get the events.
>>> Process B is the one that is generating the events and queuing them on
>>> the queue that is owned by process A, who is also to blame if the queue
>>> is not being read.
>>
>>
>> I agree it is not fair to account the memory to the generator. But, afaik,
>> accounting non-current memcg is not how memcg is designed and works. Please
>> see the below for some details.
>>
>>>
>>> So why should process B be held accountable for memory pressure
>>> caused by, say, an FAN_UNLIMITED_QUEUE that process A created and
>>> doesn't read from.
>>>
>>> Is it possible to get an explicit reference to the memcg's  events cache
>>> at fsnotify_group creation time, store it in the group struct and then
>>> allocate
>>> events from the event cache associated with the group (the listener)
>>> rather
>>> than the cache associated with the task generating the event?
>>
>>
>> I don't think current memcg design can do this. Because kmem accounting
>> happens at allocation (when calling kmem_cache_alloc) stage, and get the
>> associated memcg from current task, so basically who does the allocation who
>> get it accounted. If the producer is in the different memcg of consumer, it
>> should be just accounted to the producer memcg, although the problem might
>> be caused by the producer.
>>
>> However, afaik, both producer and consumer are typically in the same memcg.
>> So, this might be not a big issue. But, I do admit such unfair accounting
>> may happen.
>>
> 
> That is a reasonable argument, but please make a comment on that fact in
> commit message and above creation of events cache, so that it is clear that
> event slab accounting is mostly heuristic.

Yes, will add such information in the new version.

> 
> But I think there is another problem, not introduced by your change, but could
> be amplified because of it - when a non-permission event allocation fails, the
> event is silently dropped, AFAICT, with no indication to listener.
> That seems like a bug to me, because there is a perfectly safe way to deal with
> event allocation failure - queue the overflow event.

I'm not sure if such issue could be amplified by the accounting since 
once the usage exceeds the limit any following kmem allocation would 
fail. So, it might fail at fsnotify event allocation, or other places, 
i.e. fork, open syscall, etc. So, in most cases the generator even can't 
generate new event any more.

The typical output from my LTP test is filesystem dcache allocation 
error or fork error due to kmem limit is reached.

> I am not going to be the one to determine if fixing this alleged bug is a
> prerequisite for merging your patch, but I think enforcing memory limits on
> event allocation could amplify that bug, so it should be fixed.
> 
> The upside is that with both your accounting fix and ENOMEM = overlflow
> fix, it going to be easy to write a test that verifies both of them:
> - Run a listener in memcg with limited kmem and unlimited (or very
> large) event queue
> - Produce events inside memcg without listener reading them
> - Read event and expect an OVERFLOW even
> 
> This is a simple variant of LTP tests inotify05 and fanotify05.

I tried to test your patch with LTP, but it sounds not that easy to 
setup a scenario to make fsnotify event allocation just hit the kmem 
limit, since the limit may be hit before a new event is allocated, for 
example allocating dentry cache in open syscall may hit the limit.

So, it sounds the overflow event might be not generated by the producer 
in most cases.

Thanks,
Yang

> 
> I realize that is user application behavior change and that documentation
> implies that an OVERFLOW event is not expected when using
> FAN_UNLIMITED_QUEUE, but IMO no one will come shouting
> if we stop silently dropping events, so it is better to fix this and update
> documentation.
> 
> Attached a compile-tested patch to implement overflow on ENOMEM
> Hope this helps to test your patch and then we can merge both, accompanied
> with LTP tests for inotify and fanotify.
> 
> Amir.
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-10-24  4:12       ` Yang Shi
@ 2017-10-24  5:42         ` Amir Goldstein
  2017-10-25  0:34           ` Yang Shi
  0 siblings, 1 reply; 11+ messages in thread
From: Amir Goldstein @ 2017-10-24  5:42 UTC (permalink / raw)
  To: Yang Shi; +Cc: Jan Kara, linux-fsdevel, linux-mm, linux-kernel, linux-api

On Tue, Oct 24, 2017 at 7:12 AM, Yang Shi <yang.s@alibaba-inc.com> wrote:
>
>
> On 10/22/17 1:24 AM, Amir Goldstein wrote:
>>
>> On Sat, Oct 21, 2017 at 12:07 AM, Yang Shi <yang.s@alibaba-inc.com> wrote:
>>>
>>>
>>>
>>> On 10/19/17 8:14 PM, Amir Goldstein wrote:
>>>>
>>>>
>>>> On Fri, Oct 20, 2017 at 12:20 AM, Yang Shi <yang.s@alibaba-inc.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> We observed some misbehaved user applications might consume significant
>>>>> amount of fsnotify slabs silently. It'd better to account those slabs
>>>>> in
>>>>> kmemcg so that we can get heads up before misbehaved applications use
>>>>> too
>>>>> much memory silently.
>>>>
>>>>
>>>>
>>>> In what way do they misbehave? create a lot of marks? create a lot of
>>>> events?
>>>> Not reading events in their queue?
>>>
>>>
>>>
>>> It looks both a lot marks and events. I'm not sure if it is the latter
>>> case.
>>> If I knew more about the details of the behavior, I would elaborated more
>>> in
>>> the commit log.
>>
>>
>> If you are not sure, do not refer to user application as "misbehaved".
>> Is updatedb(8) a misbehaved application because it produces a lot of
>> access
>> events?
>
>
> Should be not. It sounds like our in-house applications. But, it is a sort
> of blackbox to me.
>

If you know which process is "misbehaving" you can look at
ls -l /proc/<pid>/fd |grep notify
and see the anonymous inotify/fanotify file descriptors

then you can look at  /proc/<pid>/fdinfo/<fd> file of those
file descriptors to learn more about the fanotify flags etc.

...

>
>>
>> But I think there is another problem, not introduced by your change, but
>> could
>> be amplified because of it - when a non-permission event allocation fails,
>> the
>> event is silently dropped, AFAICT, with no indication to listener.
>> That seems like a bug to me, because there is a perfectly safe way to deal
>> with
>> event allocation failure - queue the overflow event.
>
>
> I'm not sure if such issue could be amplified by the accounting since once
> the usage exceeds the limit any following kmem allocation would fail. So, it
> might fail at fsnotify event allocation, or other places, i.e. fork, open
> syscall, etc. So, in most cases the generator even can't generate new event
> any more.
>

To be clear, I did not mean that kmem limit would cause a storm of dropped
events. I meant if you have a listener outside memcp watching a single file
for access/modifications and you have many containers each with its own
limited memcg, then event drops probability goes to infinity as you run more
of those kmem limited containers with event producers.

> The typical output from my LTP test is filesystem dcache allocation error or
> fork error due to kmem limit is reached.

And that should be considered a success result of the test.
The only failure case is when producer touches the file and event is
not delivered
nor an overflow event delivered.
You can probably try to reduce allocation failure for fork and dentry by:
1. pin dentry cache of subject file on test init by opening the file
2. set the low kmem limit after forking

Then you should probably loop the test enough times
in some of the times, producer may fail to access the file
in others if will succeed and produce events properly
and many some times, producer will access the file and event
will be dropped, so event count is lower than access count.



>
>> I am not going to be the one to determine if fixing this alleged bug is a
>> prerequisite for merging your patch, but I think enforcing memory limits
>> on
>> event allocation could amplify that bug, so it should be fixed.
>>
>> The upside is that with both your accounting fix and ENOMEM = overlflow
>> fix, it going to be easy to write a test that verifies both of them:
>> - Run a listener in memcg with limited kmem and unlimited (or very
>> large) event queue
>> - Produce events inside memcg without listener reading them
>> - Read event and expect an OVERFLOW even
>>
>> This is a simple variant of LTP tests inotify05 and fanotify05.
>
>
> I tried to test your patch with LTP, but it sounds not that easy to setup a
> scenario to make fsnotify event allocation just hit the kmem limit, since
> the limit may be hit before a new event is allocated, for example allocating
> dentry cache in open syscall may hit the limit.
>
> So, it sounds the overflow event might be not generated by the producer in
> most cases.
>

Right. not as simple, but maybe still possible as I described above.
Assuming that my patch is not buggy...

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-10-24  5:42         ` Amir Goldstein
@ 2017-10-25  0:34           ` Yang Shi
  0 siblings, 0 replies; 11+ messages in thread
From: Yang Shi @ 2017-10-25  0:34 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, linux-mm, linux-kernel, linux-api



On 10/23/17 10:42 PM, Amir Goldstein wrote:
> On Tue, Oct 24, 2017 at 7:12 AM, Yang Shi <yang.s@alibaba-inc.com> wrote:
>>
>>
>> On 10/22/17 1:24 AM, Amir Goldstein wrote:
>>>
>>> On Sat, Oct 21, 2017 at 12:07 AM, Yang Shi <yang.s@alibaba-inc.com> wrote:
>>>>
>>>>
>>>>
>>>> On 10/19/17 8:14 PM, Amir Goldstein wrote:
>>>>>
>>>>>
>>>>> On Fri, Oct 20, 2017 at 12:20 AM, Yang Shi <yang.s@alibaba-inc.com>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> We observed some misbehaved user applications might consume significant
>>>>>> amount of fsnotify slabs silently. It'd better to account those slabs
>>>>>> in
>>>>>> kmemcg so that we can get heads up before misbehaved applications use
>>>>>> too
>>>>>> much memory silently.
>>>>>
>>>>>
>>>>>
>>>>> In what way do they misbehave? create a lot of marks? create a lot of
>>>>> events?
>>>>> Not reading events in their queue?
>>>>
>>>>
>>>>
>>>> It looks both a lot marks and events. I'm not sure if it is the latter
>>>> case.
>>>> If I knew more about the details of the behavior, I would elaborated more
>>>> in
>>>> the commit log.
>>>
>>>
>>> If you are not sure, do not refer to user application as "misbehaved".
>>> Is updatedb(8) a misbehaved application because it produces a lot of
>>> access
>>> events?
>>
>>
>> Should be not. It sounds like our in-house applications. But, it is a sort
>> of blackbox to me.
>>
> 
> If you know which process is "misbehaving" you can look at
> ls -l /proc/<pid>/fd |grep notify
> and see the anonymous inotify/fanotify file descriptors
> 
> then you can look at  /proc/<pid>/fdinfo/<fd> file of those
> file descriptors to learn more about the fanotify flags etc.

Thanks for the hints.

> 
> ...
> 
>>
>>>
>>> But I think there is another problem, not introduced by your change, but
>>> could
>>> be amplified because of it - when a non-permission event allocation fails,
>>> the
>>> event is silently dropped, AFAICT, with no indication to listener.
>>> That seems like a bug to me, because there is a perfectly safe way to deal
>>> with
>>> event allocation failure - queue the overflow event.
>>
>>
>> I'm not sure if such issue could be amplified by the accounting since once
>> the usage exceeds the limit any following kmem allocation would fail. So, it
>> might fail at fsnotify event allocation, or other places, i.e. fork, open
>> syscall, etc. So, in most cases the generator even can't generate new event
>> any more.
>>
> 
> To be clear, I did not mean that kmem limit would cause a storm of dropped
> events. I meant if you have a listener outside memcp watching a single file
> for access/modifications and you have many containers each with its own
> limited memcg, then event drops probability goes to infinity as you run more
> of those kmem limited containers with event producers.
> 
>> The typical output from my LTP test is filesystem dcache allocation error or
>> fork error due to kmem limit is reached.
> 
> And that should be considered a success result of the test.
> The only failure case is when producer touches the file and event is
> not delivered
> nor an overflow event delivered.
> You can probably try to reduce allocation failure for fork and dentry by:
> 1. pin dentry cache of subject file on test init by opening the file
> 2. set the low kmem limit after forking
> 
> Then you should probably loop the test enough times
> in some of the times, producer may fail to access the file
> in others if will succeed and produce events properly
> and many some times, producer will access the file and event
> will be dropped, so event count is lower than access count.

I still have not very direct test result shows the patch works as 
expected. But, I did get some prove from running fanotify test *without* 
your overflow event patch.

Running fanotify without your overflow patch, sometimes I can see:

[  122.222455] SLUB: Unable to allocate memory on node -1, 
gfp=0x14000c0(GFP_KERNEL)
[  122.224198]   cache: fanotify_event_info(110:fsnotify), object size: 
56, buffer size: 88, default order: 0, min order: 0
[  122.226578]   node 0: slabs: 11, objs: 506, free: 0
[  122.227655]   node 1: slabs: 0, objs: 0, free: 0
[  122.229251] SLUB: Unable to allocate memory on node -1, 
gfp=0x14000c0(GFP_KERNEL)
[  122.230912]   cache: fanotify_event_info(110:fsnotify), object size: 
56, buffer size: 88, default order: 0, min order: 0
[  122.233266]   node 0: slabs: 11, objs: 506, free: 0
[  122.234337]   node 1: slabs: 0, objs: 0, free: 0

The slub oom information is printed a couple of times, it should mean 
neither the event is delivered nor the overflow event is delivered.

With your overflow patch, I've never seen such message.

Regards,
Yang

> 
> 
> 
>>
>>> I am not going to be the one to determine if fixing this alleged bug is a
>>> prerequisite for merging your patch, but I think enforcing memory limits
>>> on
>>> event allocation could amplify that bug, so it should be fixed.
>>>
>>> The upside is that with both your accounting fix and ENOMEM = overlflow
>>> fix, it going to be easy to write a test that verifies both of them:
>>> - Run a listener in memcg with limited kmem and unlimited (or very
>>> large) event queue
>>> - Produce events inside memcg without listener reading them
>>> - Read event and expect an OVERFLOW even
>>>
>>> This is a simple variant of LTP tests inotify05 and fanotify05.
>>
>>
>> I tried to test your patch with LTP, but it sounds not that easy to setup a
>> scenario to make fsnotify event allocation just hit the kmem limit, since
>> the limit may be hit before a new event is allocated, for example allocating
>> dentry cache in open syscall may hit the limit.
>>
>> So, it sounds the overflow event might be not generated by the producer in
>> most cases.
>>
> 
> Right. not as simple, but maybe still possible as I described above.
> Assuming that my patch is not buggy...
> 
> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-10-22  8:24     ` Amir Goldstein
  2017-10-24  4:12       ` Yang Shi
@ 2017-10-31 10:50       ` Jan Kara
  2017-10-31 11:51         ` Amir Goldstein
  1 sibling, 1 reply; 11+ messages in thread
From: Jan Kara @ 2017-10-31 10:50 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Yang Shi, Jan Kara, linux-fsdevel, linux-mm, linux-kernel, linux-api

On Sun 22-10-17 11:24:17, Amir Goldstein wrote:
> But I think there is another problem, not introduced by your change, but could
> be amplified because of it - when a non-permission event allocation fails, the
> event is silently dropped, AFAICT, with no indication to listener.
> That seems like a bug to me, because there is a perfectly safe way to deal with
> event allocation failure - queue the overflow event.
> 
> I am not going to be the one to determine if fixing this alleged bug is a
> prerequisite for merging your patch, but I think enforcing memory limits on
> event allocation could amplify that bug, so it should be fixed.
> 
> The upside is that with both your accounting fix and ENOMEM = overlflow
> fix, it going to be easy to write a test that verifies both of them:
> - Run a listener in memcg with limited kmem and unlimited (or very
> large) event queue
> - Produce events inside memcg without listener reading them
> - Read event and expect an OVERFLOW event
> 
> This is a simple variant of LTP tests inotify05 and fanotify05.
> 
> I realize that is user application behavior change and that documentation
> implies that an OVERFLOW event is not expected when using
> FAN_UNLIMITED_QUEUE, but IMO no one will come shouting
> if we stop silently dropping events, so it is better to fix this and update
> documentation.
> 
> Attached a compile-tested patch to implement overflow on ENOMEM
> Hope this helps to test your patch and then we can merge both, accompanied
> with LTP tests for inotify and fanotify.
> 
> Amir.

> From 112ecd54045f14aff2c42622fabb4ffab9f0d8ff Mon Sep 17 00:00:00 2001
> From: Amir Goldstein <amir73il@gmail.com>
> Date: Sun, 22 Oct 2017 11:13:10 +0300
> Subject: [PATCH] fsnotify: queue an overflow event on failure to allocate
>  event
> 
> In low memory situations, non permissions events are silently dropped.
> It is better to queue an OVERFLOW event in that case to let the listener
> know about the lost event.
> 
> With this change, an application can now get an FAN_Q_OVERFLOW event,
> even if it used flag FAN_UNLIMITED_QUEUE on fanotify_init().
> 
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>

So I agree something like this is desirable but I'm uneasy about using
{IN|FAN}_Q_OVERFLOW for this. Firstly, it is userspace visible change for
FAN_UNLIMITED_QUEUE queues which could confuse applications as you properly
note. Secondly, the event is similar to queue overflow but not quite the
same (it is not that the application would be too slow in processing
events, it is just that the system is in a problematic state overall). What
are your thoughts on adding a new event flags like FAN_Q_LOSTEVENT or
something like that? Probably the biggest downside there I see is that apps
would have to learn to use it...

								Honza

> ---
>  fs/notify/fanotify/fanotify.c        | 10 ++++++++--
>  fs/notify/inotify/inotify_fsnotify.c |  8 ++++++--
>  fs/notify/notification.c             |  3 ++-
>  3 files changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 2fa99aeaa095..412a32838f58 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -212,8 +212,14 @@ static int fanotify_handle_event(struct fsnotify_group *group,
>  		 mask);
>  
>  	event = fanotify_alloc_event(inode, mask, data);
> -	if (unlikely(!event))
> -		return -ENOMEM;
> +	if (unlikely(!event)) {
> +		if (mask & FAN_ALL_PERM_EVENTS)
> +			return -ENOMEM;
> +
> +		/* Queue an overflow event on failure to allocate event */
> +		fsnotify_add_event(group, group->overflow_event, NULL);
> +		return 0;
> +	}
>  
>  	fsn_event = &event->fse;
>  	ret = fsnotify_add_event(group, fsn_event, fanotify_merge);
> diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c
> index 8b73332735ba..d1837da2ef15 100644
> --- a/fs/notify/inotify/inotify_fsnotify.c
> +++ b/fs/notify/inotify/inotify_fsnotify.c
> @@ -99,8 +99,11 @@ int inotify_handle_event(struct fsnotify_group *group,
>  			      fsn_mark);
>  
>  	event = kmalloc(alloc_len, GFP_KERNEL);
> -	if (unlikely(!event))
> -		return -ENOMEM;
> +	if (unlikely(!event)) {
> +		/* Queue an overflow event on failure to allocate event */
> +		fsnotify_add_event(group, group->overflow_event, NULL);
> +		goto oneshot;
> +	}
>  
>  	fsn_event = &event->fse;
>  	fsnotify_init_event(fsn_event, inode, mask);
> @@ -116,6 +119,7 @@ int inotify_handle_event(struct fsnotify_group *group,
>  		fsnotify_destroy_event(group, fsn_event);
>  	}
>  
> +oneshot:
>  	if (inode_mark->mask & IN_ONESHOT)
>  		fsnotify_destroy_mark(inode_mark, group);
>  
> diff --git a/fs/notify/notification.c b/fs/notify/notification.c
> index 66f85c651c52..5abd69976a47 100644
> --- a/fs/notify/notification.c
> +++ b/fs/notify/notification.c
> @@ -111,7 +111,8 @@ int fsnotify_add_event(struct fsnotify_group *group,
>  		return 2;
>  	}
>  
> -	if (group->q_len >= group->max_events) {
> +	if (group->q_len >= group->max_events ||
> +	    event == group->overflow_event) {
>  		ret = 2;
>  		/* Queue overflow event only if it isn't already queued */
>  		if (!list_empty(&group->overflow_event->list)) {
> -- 
> 2.7.4
> 

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-10-31 10:50       ` Jan Kara
@ 2017-10-31 11:51         ` Amir Goldstein
  2017-10-31 16:52           ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Amir Goldstein @ 2017-10-31 11:51 UTC (permalink / raw)
  To: Jan Kara; +Cc: Yang Shi, linux-fsdevel, linux-mm, linux-kernel, linux-api

On Tue, Oct 31, 2017 at 12:50 PM, Jan Kara <jack@suse.cz> wrote:
> On Sun 22-10-17 11:24:17, Amir Goldstein wrote:
>> But I think there is another problem, not introduced by your change, but could
>> be amplified because of it - when a non-permission event allocation fails, the
>> event is silently dropped, AFAICT, with no indication to listener.
>> That seems like a bug to me, because there is a perfectly safe way to deal with
>> event allocation failure - queue the overflow event.
>>
>> I am not going to be the one to determine if fixing this alleged bug is a
>> prerequisite for merging your patch, but I think enforcing memory limits on
>> event allocation could amplify that bug, so it should be fixed.
>>
>> The upside is that with both your accounting fix and ENOMEM = overlflow
>> fix, it going to be easy to write a test that verifies both of them:
>> - Run a listener in memcg with limited kmem and unlimited (or very
>> large) event queue
>> - Produce events inside memcg without listener reading them
>> - Read event and expect an OVERFLOW event
>>
>> This is a simple variant of LTP tests inotify05 and fanotify05.
>>
>> I realize that is user application behavior change and that documentation
>> implies that an OVERFLOW event is not expected when using
>> FAN_UNLIMITED_QUEUE, but IMO no one will come shouting
>> if we stop silently dropping events, so it is better to fix this and update
>> documentation.
>>
>> Attached a compile-tested patch to implement overflow on ENOMEM
>> Hope this helps to test your patch and then we can merge both, accompanied
>> with LTP tests for inotify and fanotify.
>>
>> Amir.
>
>> From 112ecd54045f14aff2c42622fabb4ffab9f0d8ff Mon Sep 17 00:00:00 2001
>> From: Amir Goldstein <amir73il@gmail.com>
>> Date: Sun, 22 Oct 2017 11:13:10 +0300
>> Subject: [PATCH] fsnotify: queue an overflow event on failure to allocate
>>  event
>>
>> In low memory situations, non permissions events are silently dropped.
>> It is better to queue an OVERFLOW event in that case to let the listener
>> know about the lost event.
>>
>> With this change, an application can now get an FAN_Q_OVERFLOW event,
>> even if it used flag FAN_UNLIMITED_QUEUE on fanotify_init().
>>
>> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>
> So I agree something like this is desirable but I'm uneasy about using
> {IN|FAN}_Q_OVERFLOW for this. Firstly, it is userspace visible change for
> FAN_UNLIMITED_QUEUE queues which could confuse applications as you properly
> note. Secondly, the event is similar to queue overflow but not quite the
> same (it is not that the application would be too slow in processing
> events, it is just that the system is in a problematic state overall). What
> are your thoughts on adding a new event flags like FAN_Q_LOSTEVENT or
> something like that? Probably the biggest downside there I see is that apps
> would have to learn to use it...
>

Well, I can't say I like FAN_Q_LOSTEVENT, but I can't really think of
a better option. I guess apps that would want to provide better protection
against loosing event will have to opt-in with a new fanotify_init() flag.
OTOH, if apps opts-in for this feature, we can also report Q_OVERFLOW
and document that it *is* expected in OOM situation.

If we have FAN_Q_LOSTEVENT, we can use it to handle both the case of
error to queue event (-ENOMEM) and the case of error on copy event to user
(e.g. -ENODEV), which is another case where we silently drop events
(in case buffer already contains good events).
In latter case, the error would be reported to user on event->fd.
In the former case, event->fd will also hold the error, as long as we can only
report -ENOMEM from this sort of error, because like overflow event, there
should probably be only one event of that sort in the queue.

Another option for API name is {IN|FAN}_Q_ERR, which implies that event->fd
carries the error. And of course user can get an event with mask
FAN_Q_OVERFLOW|FAN_Q_ERR, where event->fd is -ENOMEM or
-EOVERFLOW and then there is no ambiguity between different kind of
queue overflows.

Amir.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-10-31 11:51         ` Amir Goldstein
@ 2017-10-31 16:52           ` Jan Kara
  2017-10-31 17:01             ` Amir Goldstein
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2017-10-31 16:52 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Yang Shi, linux-fsdevel, linux-mm, linux-kernel, linux-api

On Tue 31-10-17 13:51:40, Amir Goldstein wrote:
> On Tue, Oct 31, 2017 at 12:50 PM, Jan Kara <jack@suse.cz> wrote:
> > On Sun 22-10-17 11:24:17, Amir Goldstein wrote:
> >> But I think there is another problem, not introduced by your change, but could
> >> be amplified because of it - when a non-permission event allocation fails, the
> >> event is silently dropped, AFAICT, with no indication to listener.
> >> That seems like a bug to me, because there is a perfectly safe way to deal with
> >> event allocation failure - queue the overflow event.
> >>
> >> I am not going to be the one to determine if fixing this alleged bug is a
> >> prerequisite for merging your patch, but I think enforcing memory limits on
> >> event allocation could amplify that bug, so it should be fixed.
> >>
> >> The upside is that with both your accounting fix and ENOMEM = overlflow
> >> fix, it going to be easy to write a test that verifies both of them:
> >> - Run a listener in memcg with limited kmem and unlimited (or very
> >> large) event queue
> >> - Produce events inside memcg without listener reading them
> >> - Read event and expect an OVERFLOW event
> >>
> >> This is a simple variant of LTP tests inotify05 and fanotify05.
> >>
> >> I realize that is user application behavior change and that documentation
> >> implies that an OVERFLOW event is not expected when using
> >> FAN_UNLIMITED_QUEUE, but IMO no one will come shouting
> >> if we stop silently dropping events, so it is better to fix this and update
> >> documentation.
> >>
> >> Attached a compile-tested patch to implement overflow on ENOMEM
> >> Hope this helps to test your patch and then we can merge both, accompanied
> >> with LTP tests for inotify and fanotify.
> >>
> >> Amir.
> >
> >> From 112ecd54045f14aff2c42622fabb4ffab9f0d8ff Mon Sep 17 00:00:00 2001
> >> From: Amir Goldstein <amir73il@gmail.com>
> >> Date: Sun, 22 Oct 2017 11:13:10 +0300
> >> Subject: [PATCH] fsnotify: queue an overflow event on failure to allocate
> >>  event
> >>
> >> In low memory situations, non permissions events are silently dropped.
> >> It is better to queue an OVERFLOW event in that case to let the listener
> >> know about the lost event.
> >>
> >> With this change, an application can now get an FAN_Q_OVERFLOW event,
> >> even if it used flag FAN_UNLIMITED_QUEUE on fanotify_init().
> >>
> >> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >
> > So I agree something like this is desirable but I'm uneasy about using
> > {IN|FAN}_Q_OVERFLOW for this. Firstly, it is userspace visible change for
> > FAN_UNLIMITED_QUEUE queues which could confuse applications as you properly
> > note. Secondly, the event is similar to queue overflow but not quite the
> > same (it is not that the application would be too slow in processing
> > events, it is just that the system is in a problematic state overall). What
> > are your thoughts on adding a new event flags like FAN_Q_LOSTEVENT or
> > something like that? Probably the biggest downside there I see is that apps
> > would have to learn to use it...
> >
> 
> Well, I can't say I like FAN_Q_LOSTEVENT, but I can't really think of
> a better option. I guess apps that would want to provide better protection
> against loosing event will have to opt-in with a new fanotify_init() flag.
> OTOH, if apps opts-in for this feature, we can also report Q_OVERFLOW
> and document that it *is* expected in OOM situation.
> 
> If we have FAN_Q_LOSTEVENT, we can use it to handle both the case of
> error to queue event (-ENOMEM) and the case of error on copy event to user
> (e.g. -ENODEV), which is another case where we silently drop events
> (in case buffer already contains good events).
> In latter case, the error would be reported to user on event->fd.
> In the former case, event->fd will also hold the error, as long as we can only
> report -ENOMEM from this sort of error, because like overflow event, there
> should probably be only one event of that sort in the queue.
> 
> Another option for API name is {IN|FAN}_Q_ERR, which implies that event->fd
> carries the error. And of course user can get an event with mask
> FAN_Q_OVERFLOW|FAN_Q_ERR, where event->fd is -ENOMEM or
> -EOVERFLOW and then there is no ambiguity between different kind of
> queue overflows.

I like this last option. I.e., userspace can opt in to get more detailed
error notification. In that case we can report error (I think we can just
reuse {IN|FAN}_Q_OVERFLOW for that) and store more detailed error
description in wd/fd. Will you have time to implement something like that
or should I put it to my todo list?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] fs: fsnotify: account fsnotify metadata to kmemcg
  2017-10-31 16:52           ` Jan Kara
@ 2017-10-31 17:01             ` Amir Goldstein
  0 siblings, 0 replies; 11+ messages in thread
From: Amir Goldstein @ 2017-10-31 17:01 UTC (permalink / raw)
  To: Jan Kara; +Cc: Yang Shi, linux-fsdevel, linux-mm, linux-kernel, linux-api

On Tue, Oct 31, 2017 at 6:52 PM, Jan Kara <jack@suse.cz> wrote:
> On Tue 31-10-17 13:51:40, Amir Goldstein wrote:
>> On Tue, Oct 31, 2017 at 12:50 PM, Jan Kara <jack@suse.cz> wrote:
>> > On Sun 22-10-17 11:24:17, Amir Goldstein wrote:
>> >> But I think there is another problem, not introduced by your change, but could
>> >> be amplified because of it - when a non-permission event allocation fails, the
>> >> event is silently dropped, AFAICT, with no indication to listener.
>> >> That seems like a bug to me, because there is a perfectly safe way to deal with
>> >> event allocation failure - queue the overflow event.
>> >>
>> >> I am not going to be the one to determine if fixing this alleged bug is a
>> >> prerequisite for merging your patch, but I think enforcing memory limits on
>> >> event allocation could amplify that bug, so it should be fixed.
>> >>
>> >> The upside is that with both your accounting fix and ENOMEM = overlflow
>> >> fix, it going to be easy to write a test that verifies both of them:
>> >> - Run a listener in memcg with limited kmem and unlimited (or very
>> >> large) event queue
>> >> - Produce events inside memcg without listener reading them
>> >> - Read event and expect an OVERFLOW event
>> >>
>> >> This is a simple variant of LTP tests inotify05 and fanotify05.
>> >>
>> >> I realize that is user application behavior change and that documentation
>> >> implies that an OVERFLOW event is not expected when using
>> >> FAN_UNLIMITED_QUEUE, but IMO no one will come shouting
>> >> if we stop silently dropping events, so it is better to fix this and update
>> >> documentation.
>> >>
>> >> Attached a compile-tested patch to implement overflow on ENOMEM
>> >> Hope this helps to test your patch and then we can merge both, accompanied
>> >> with LTP tests for inotify and fanotify.
>> >>
>> >> Amir.
>> >
>> >> From 112ecd54045f14aff2c42622fabb4ffab9f0d8ff Mon Sep 17 00:00:00 2001
>> >> From: Amir Goldstein <amir73il@gmail.com>
>> >> Date: Sun, 22 Oct 2017 11:13:10 +0300
>> >> Subject: [PATCH] fsnotify: queue an overflow event on failure to allocate
>> >>  event
>> >>
>> >> In low memory situations, non permissions events are silently dropped.
>> >> It is better to queue an OVERFLOW event in that case to let the listener
>> >> know about the lost event.
>> >>
>> >> With this change, an application can now get an FAN_Q_OVERFLOW event,
>> >> even if it used flag FAN_UNLIMITED_QUEUE on fanotify_init().
>> >>
>> >> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>> >
>> > So I agree something like this is desirable but I'm uneasy about using
>> > {IN|FAN}_Q_OVERFLOW for this. Firstly, it is userspace visible change for
>> > FAN_UNLIMITED_QUEUE queues which could confuse applications as you properly
>> > note. Secondly, the event is similar to queue overflow but not quite the
>> > same (it is not that the application would be too slow in processing
>> > events, it is just that the system is in a problematic state overall). What
>> > are your thoughts on adding a new event flags like FAN_Q_LOSTEVENT or
>> > something like that? Probably the biggest downside there I see is that apps
>> > would have to learn to use it...
>> >
>>
>> Well, I can't say I like FAN_Q_LOSTEVENT, but I can't really think of
>> a better option. I guess apps that would want to provide better protection
>> against loosing event will have to opt-in with a new fanotify_init() flag.
>> OTOH, if apps opts-in for this feature, we can also report Q_OVERFLOW
>> and document that it *is* expected in OOM situation.
>>
>> If we have FAN_Q_LOSTEVENT, we can use it to handle both the case of
>> error to queue event (-ENOMEM) and the case of error on copy event to user
>> (e.g. -ENODEV), which is another case where we silently drop events
>> (in case buffer already contains good events).
>> In latter case, the error would be reported to user on event->fd.
>> In the former case, event->fd will also hold the error, as long as we can only
>> report -ENOMEM from this sort of error, because like overflow event, there
>> should probably be only one event of that sort in the queue.
>>
>> Another option for API name is {IN|FAN}_Q_ERR, which implies that event->fd
>> carries the error. And of course user can get an event with mask
>> FAN_Q_OVERFLOW|FAN_Q_ERR, where event->fd is -ENOMEM or
>> -EOVERFLOW and then there is no ambiguity between different kind of
>> queue overflows.
>
> I like this last option. I.e., userspace can opt in to get more detailed
> error notification. In that case we can report error (I think we can just
> reuse {IN|FAN}_Q_OVERFLOW for that) and store more detailed error
> description in wd/fd. Will you have time to implement something like that
> or should I put it to my todo list?
>

Won't be able to get to that for a while, so better add to todo list.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-10-31 17:01 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-19 21:20 [RFC PATCH] fs: fsnotify: account fsnotify metadata to kmemcg Yang Shi
2017-10-20  3:14 ` Amir Goldstein
2017-10-20 21:07   ` Yang Shi
2017-10-22  8:24     ` Amir Goldstein
2017-10-24  4:12       ` Yang Shi
2017-10-24  5:42         ` Amir Goldstein
2017-10-25  0:34           ` Yang Shi
2017-10-31 10:50       ` Jan Kara
2017-10-31 11:51         ` Amir Goldstein
2017-10-31 16:52           ` Jan Kara
2017-10-31 17:01             ` Amir Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).