linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] vmpressure: implement strict mode
@ 2013-06-27  3:17 Luiz Capitulino
  2013-06-27  9:26 ` Michal Hocko
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Luiz Capitulino @ 2013-06-27  3:17 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, mhocko, minchan, anton, akpm, kmpark, hyunhee.kim

Currently, an eventfd is notified for the level it's registered for
_plus_ higher levels.

This is a problem if an application wants to implement different
actions for different levels. For example, an application might want
to release 10% of its cache on level low, 50% on medium and 100% on
critical. To do this, an application has to register a different
eventfd for each pressure level. However, fd low is always going to
be notified and and all fds are going to be notified on level critical.

Strict mode solves this problem by strictly notifiying an eventfd
for the pressure level it registered for. This new mode is optional,
by default we still notify eventfds on higher levels too.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
---

o v2

 - Improve documentation
 - Use a bit to store mode instead of a bool
 - Minor changelog changes

 Documentation/cgroups/memory.txt | 26 ++++++++++++++++++++++----
 mm/vmpressure.c                  | 26 ++++++++++++++++++++++++--
 2 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index ddf4f93..412872b 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -791,6 +791,22 @@ way to trigger. Applications should do whatever they can to help the
 system. It might be too late to consult with vmstat or any other
 statistics, so it's advisable to take an immediate action.
 
+Applications can also choose between two notification modes when
+registering an eventfd for memory pressure events:
+
+When in "non-strict" mode, an eventfd is notified for the specific level
+it's registered for and higher levels. For example, an eventfd registered
+for low level is also going to be notified on medium and critical levels.
+This mode makes sense for applications interested on monitoring reclaim
+activity or implementing simple load-balacing logic. The non-strict mode
+is the default notification mode.
+
+When in "strict" mode, an eventfd is strictly notified for the pressure
+level it's registered for. For example, an eventfd registered for the low
+level event is not going to be notified when memory pressure gets into
+medium or critical levels. This allows for more complex logic based on
+the actual pressure level the system is experiencing.
+
 The events are propagated upward until the event is handled, i.e. the
 events are not pass-through. Here is what this means: for example you have
 three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
@@ -807,12 +823,14 @@ register a notification, an application must:
 
 - create an eventfd using eventfd(2);
 - open memory.pressure_level;
-- write string like "<event_fd> <fd of memory.pressure_level> <level>"
+- write string like "<event_fd> <fd of memory.pressure_level> <level> [strict]"
   to cgroup.event_control.
 
-Application will be notified through eventfd when memory pressure is at
-the specific level (or higher). Read/write operations to
-memory.pressure_level are no implemented.
+Applications will be notified through eventfd when memory pressure is at
+the specific level or higher. If strict is passed, then applications
+will only be notified when memory pressure reaches the specified level.
+
+Read/write operations to memory.pressure_level are no implemented.
 
 Test:
 
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 736a601..ba5c17e 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -138,8 +138,16 @@ struct vmpressure_event {
 	struct eventfd_ctx *efd;
 	enum vmpressure_levels level;
 	struct list_head node;
+	unsigned int mode;
 };
 
+#define VMPRESSURE_MODE_STRICT 1
+
+static inline bool vmpressure_mode_is_strict(const struct vmpressure_event *ev)
+{
+	return ev->mode & VMPRESSURE_MODE_STRICT;
+}
+
 static bool vmpressure_event(struct vmpressure *vmpr,
 			     unsigned long scanned, unsigned long reclaimed)
 {
@@ -153,6 +161,9 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 
 	list_for_each_entry(ev, &vmpr->events, node) {
 		if (level >= ev->level) {
+			/* strict mode ensures level == ev->level */
+			if (vmpressure_mode_is_strict(ev) && level != ev->level)
+				continue;
 			eventfd_signal(ev->efd, 1);
 			signalled = true;
 		}
@@ -292,7 +303,7 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
  * infrastructure, so that the notifications will be delivered to the
  * @eventfd. The @args parameter is a string that denotes pressure level
  * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
- * "critical").
+ * "critical") and optionally a different operating mode (i.e. "strict")
  *
  * This function should not be used directly, just pass it to (struct
  * cftype).register_event, and then cgroup core will handle everything by
@@ -303,22 +314,33 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
 {
 	struct vmpressure *vmpr = cg_to_vmpressure(cg);
 	struct vmpressure_event *ev;
+	unsigned int mode = 0;
+	const char *p;
 	int level;
 
 	for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
-		if (!strcmp(vmpressure_str_levels[level], args))
+		p = vmpressure_str_levels[level];
+		if (!strncmp(p, args, strlen(p)))
 			break;
 	}
 
 	if (level >= VMPRESSURE_NUM_LEVELS)
 		return -EINVAL;
 
+	p = strchr(args, ' ');
+	if (p) {
+		if (strncmp(++p, "strict", 6))
+			return -EINVAL;
+		mode |= VMPRESSURE_MODE_STRICT;
+	}
+
 	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
 	if (!ev)
 		return -ENOMEM;
 
 	ev->efd = eventfd;
 	ev->level = level;
+	ev->mode = mode;
 
 	mutex_lock(&vmpr->events_lock);
 	list_add(&ev->node, &vmpr->events);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-27  3:17 [PATCH v2] vmpressure: implement strict mode Luiz Capitulino
@ 2013-06-27  9:26 ` Michal Hocko
  2013-06-27 13:34   ` Luiz Capitulino
  2013-06-27 15:53   ` Minchan Kim
  2013-06-27 15:44 ` Minchan Kim
  2013-06-27 22:02 ` Andrew Morton
  2 siblings, 2 replies; 34+ messages in thread
From: Michal Hocko @ 2013-06-27  9:26 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, minchan, anton, akpm, kmpark, hyunhee.kim

On Wed 26-06-13 23:17:12, Luiz Capitulino wrote:
> Currently, an eventfd is notified for the level it's registered for
> _plus_ higher levels.
> 
> This is a problem if an application wants to implement different
> actions for different levels. For example, an application might want
> to release 10% of its cache on level low, 50% on medium and 100% on
> critical. To do this, an application has to register a different
> eventfd for each pressure level. However, fd low is always going to
> be notified and and all fds are going to be notified on level critical.
> 
> Strict mode solves this problem by strictly notifiying an eventfd
> for the pressure level it registered for. This new mode is optional,
> by default we still notify eventfds on higher levels too.
> 
> Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>

Two nits bellow but it looks good in general to me. You can add my
Reviewed-by: Michal Hocko <mhocko@suse.cz>

I still think that edge triggering makes some sense but that one might
be rebased on top of this patch. We should still figure out whether the
edge triggering is the right approach for the use case Hyunhee Kim wants
it for so the strict mode should go first IMO.

> ---
> 
> o v2
> 
>  - Improve documentation
>  - Use a bit to store mode instead of a bool
>  - Minor changelog changes
> 
>  Documentation/cgroups/memory.txt | 26 ++++++++++++++++++++++----
>  mm/vmpressure.c                  | 26 ++++++++++++++++++++++++--
>  2 files changed, 46 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index ddf4f93..412872b 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -791,6 +791,22 @@ way to trigger. Applications should do whatever they can to help the
>  system. It might be too late to consult with vmstat or any other
>  statistics, so it's advisable to take an immediate action.
>  
> +Applications can also choose between two notification modes when
> +registering an eventfd for memory pressure events:
> +
> +When in "non-strict" mode, an eventfd is notified for the specific level
> +it's registered for and higher levels. For example, an eventfd registered
> +for low level is also going to be notified on medium and critical levels.
> +This mode makes sense for applications interested on monitoring reclaim
> +activity or implementing simple load-balacing logic. The non-strict mode
> +is the default notification mode.
> +
> +When in "strict" mode, an eventfd is strictly notified for the pressure
> +level it's registered for. For example, an eventfd registered for the low
> +level event is not going to be notified when memory pressure gets into
> +medium or critical levels. This allows for more complex logic based on
> +the actual pressure level the system is experiencing.

It would be also fair to mention that there is no guarantee that lower
levels are signaled before higher so nobody should rely on seeing LOW
before MEDIUM or CRITICAL.

> +
>  The events are propagated upward until the event is handled, i.e. the
>  events are not pass-through. Here is what this means: for example you have
>  three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
> @@ -807,12 +823,14 @@ register a notification, an application must:
>  
>  - create an eventfd using eventfd(2);
>  - open memory.pressure_level;
> -- write string like "<event_fd> <fd of memory.pressure_level> <level>"
> +- write string like "<event_fd> <fd of memory.pressure_level> <level> [strict]"
>    to cgroup.event_control.
>  
> -Application will be notified through eventfd when memory pressure is at
> -the specific level (or higher). Read/write operations to
> -memory.pressure_level are no implemented.
> +Applications will be notified through eventfd when memory pressure is at
> +the specific level or higher. If strict is passed, then applications
> +will only be notified when memory pressure reaches the specified level.
> +
> +Read/write operations to memory.pressure_level are no implemented.
>  
>  Test:
>  
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index 736a601..ba5c17e 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -138,8 +138,16 @@ struct vmpressure_event {
>  	struct eventfd_ctx *efd;
>  	enum vmpressure_levels level;
>  	struct list_head node;
> +	unsigned int mode;

You would fill up a hole between level and node if you move it up on
64b. Doesn't matter much but why not do it...

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-27  9:26 ` Michal Hocko
@ 2013-06-27 13:34   ` Luiz Capitulino
  2013-06-27 14:59     ` Michal Hocko
  2013-06-27 15:53   ` Minchan Kim
  1 sibling, 1 reply; 34+ messages in thread
From: Luiz Capitulino @ 2013-06-27 13:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, minchan, anton, akpm, kmpark, hyunhee.kim

On Thu, 27 Jun 2013 11:26:16 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Wed 26-06-13 23:17:12, Luiz Capitulino wrote:
> > Currently, an eventfd is notified for the level it's registered for
> > _plus_ higher levels.
> > 
> > This is a problem if an application wants to implement different
> > actions for different levels. For example, an application might want
> > to release 10% of its cache on level low, 50% on medium and 100% on
> > critical. To do this, an application has to register a different
> > eventfd for each pressure level. However, fd low is always going to
> > be notified and and all fds are going to be notified on level critical.
> > 
> > Strict mode solves this problem by strictly notifiying an eventfd
> > for the pressure level it registered for. This new mode is optional,
> > by default we still notify eventfds on higher levels too.
> > 
> > Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
> 
> Two nits bellow but it looks good in general to me. You can add my
> Reviewed-by: Michal Hocko <mhocko@suse.cz>

Thanks.

> I still think that edge triggering makes some sense but that one might
> be rebased on top of this patch. We should still figure out whether the
> edge triggering is the right approach for the use case Hyunhee Kim wants
> it for so the strict mode should go first IMO.

I also think that edge triggering makes sense.

> > ---
> > 
> > o v2
> > 
> >  - Improve documentation
> >  - Use a bit to store mode instead of a bool
> >  - Minor changelog changes
> > 
> >  Documentation/cgroups/memory.txt | 26 ++++++++++++++++++++++----
> >  mm/vmpressure.c                  | 26 ++++++++++++++++++++++++--
> >  2 files changed, 46 insertions(+), 6 deletions(-)
> > 
> > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> > index ddf4f93..412872b 100644
> > --- a/Documentation/cgroups/memory.txt
> > +++ b/Documentation/cgroups/memory.txt
> > @@ -791,6 +791,22 @@ way to trigger. Applications should do whatever they can to help the
> >  system. It might be too late to consult with vmstat or any other
> >  statistics, so it's advisable to take an immediate action.
> >  
> > +Applications can also choose between two notification modes when
> > +registering an eventfd for memory pressure events:
> > +
> > +When in "non-strict" mode, an eventfd is notified for the specific level
> > +it's registered for and higher levels. For example, an eventfd registered
> > +for low level is also going to be notified on medium and critical levels.
> > +This mode makes sense for applications interested on monitoring reclaim
> > +activity or implementing simple load-balacing logic. The non-strict mode
> > +is the default notification mode.
> > +
> > +When in "strict" mode, an eventfd is strictly notified for the pressure
> > +level it's registered for. For example, an eventfd registered for the low
> > +level event is not going to be notified when memory pressure gets into
> > +medium or critical levels. This allows for more complex logic based on
> > +the actual pressure level the system is experiencing.
> 
> It would be also fair to mention that there is no guarantee that lower
> levels are signaled before higher so nobody should rely on seeing LOW
> before MEDIUM or CRITICAL.

I think this is implied. Actually, as an user of this interface I didn't
expect this to happen until I read the code.

> 
> > +
> >  The events are propagated upward until the event is handled, i.e. the
> >  events are not pass-through. Here is what this means: for example you have
> >  three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
> > @@ -807,12 +823,14 @@ register a notification, an application must:
> >  
> >  - create an eventfd using eventfd(2);
> >  - open memory.pressure_level;
> > -- write string like "<event_fd> <fd of memory.pressure_level> <level>"
> > +- write string like "<event_fd> <fd of memory.pressure_level> <level> [strict]"
> >    to cgroup.event_control.
> >  
> > -Application will be notified through eventfd when memory pressure is at
> > -the specific level (or higher). Read/write operations to
> > -memory.pressure_level are no implemented.
> > +Applications will be notified through eventfd when memory pressure is at
> > +the specific level or higher. If strict is passed, then applications
> > +will only be notified when memory pressure reaches the specified level.
> > +
> > +Read/write operations to memory.pressure_level are no implemented.
> >  
> >  Test:
> >  
> > diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> > index 736a601..ba5c17e 100644
> > --- a/mm/vmpressure.c
> > +++ b/mm/vmpressure.c
> > @@ -138,8 +138,16 @@ struct vmpressure_event {
> >  	struct eventfd_ctx *efd;
> >  	enum vmpressure_levels level;
> >  	struct list_head node;
> > +	unsigned int mode;
> 
> You would fill up a hole between level and node if you move it up on
> 64b. Doesn't matter much but why not do it...

You want me to respin?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-27 13:34   ` Luiz Capitulino
@ 2013-06-27 14:59     ` Michal Hocko
  0 siblings, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2013-06-27 14:59 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, minchan, anton, akpm, kmpark, hyunhee.kim

On Thu 27-06-13 09:34:07, Luiz Capitulino wrote:
> On Thu, 27 Jun 2013 11:26:16 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> > On Wed 26-06-13 23:17:12, Luiz Capitulino wrote:
[...]
> > > +Applications can also choose between two notification modes when
> > > +registering an eventfd for memory pressure events:
> > > +
> > > +When in "non-strict" mode, an eventfd is notified for the specific level
> > > +it's registered for and higher levels. For example, an eventfd registered
> > > +for low level is also going to be notified on medium and critical levels.
> > > +This mode makes sense for applications interested on monitoring reclaim
> > > +activity or implementing simple load-balacing logic. The non-strict mode
> > > +is the default notification mode.
> > > +
> > > +When in "strict" mode, an eventfd is strictly notified for the pressure
> > > +level it's registered for. For example, an eventfd registered for the low
> > > +level event is not going to be notified when memory pressure gets into
> > > +medium or critical levels. This allows for more complex logic based on
> > > +the actual pressure level the system is experiencing.
> > 
> > It would be also fair to mention that there is no guarantee that lower
> > levels are signaled before higher so nobody should rely on seeing LOW
> > before MEDIUM or CRITICAL.
> 
> I think this is implied. Actually, as an user of this interface I didn't
> expect this to happen until I read the code.

That is a difference between the two modes, so let's better be _explicit_
about that. There is some confusion about that, considering the last
discussions. And I am not that surprised because critical memory
pressure should have passed low and medium levels first is not entirely
unreasonable expectation. But this is _not_ how this interface works or
will guarantee to work in future.
 
[...]
> > > diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> > > index 736a601..ba5c17e 100644
> > > --- a/mm/vmpressure.c
> > > +++ b/mm/vmpressure.c
> > > @@ -138,8 +138,16 @@ struct vmpressure_event {
> > >  	struct eventfd_ctx *efd;
> > >  	enum vmpressure_levels level;
> > >  	struct list_head node;
> > > +	unsigned int mode;
> > 
> > You would fill up a hole between level and node if you move it up on
> > 64b. Doesn't matter much but why not do it...
> 
> You want me to respin?

I will leave it to you. This is not anything earth shattering as we
won't have zillions of this structures so saving few bytes doesn't help
us that much and 32b doesn't have to hole.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-27  3:17 [PATCH v2] vmpressure: implement strict mode Luiz Capitulino
  2013-06-27  9:26 ` Michal Hocko
@ 2013-06-27 15:44 ` Minchan Kim
  2013-06-27 22:02 ` Andrew Morton
  2 siblings, 0 replies; 34+ messages in thread
From: Minchan Kim @ 2013-06-27 15:44 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, mhocko, anton, akpm, kmpark, hyunhee.kim

On Wed, Jun 26, 2013 at 11:17:12PM -0400, Luiz Capitulino wrote:
> Currently, an eventfd is notified for the level it's registered for
> _plus_ higher levels.
> 
> This is a problem if an application wants to implement different
> actions for different levels. For example, an application might want
> to release 10% of its cache on level low, 50% on medium and 100% on
> critical. To do this, an application has to register a different
> eventfd for each pressure level. However, fd low is always going to
> be notified and and all fds are going to be notified on level critical.
> 
> Strict mode solves this problem by strictly notifiying an eventfd
> for the pressure level it registered for. This new mode is optional,
> by default we still notify eventfds on higher levels too.
> 
> Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>

If you will update this patch without major big change, you can keep
earlier Acked-by and Reviewe-by.

Thanks!


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-27  9:26 ` Michal Hocko
  2013-06-27 13:34   ` Luiz Capitulino
@ 2013-06-27 15:53   ` Minchan Kim
  2013-06-27 17:42     ` Michal Hocko
  1 sibling, 1 reply; 34+ messages in thread
From: Minchan Kim @ 2013-06-27 15:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Luiz Capitulino, linux-mm, linux-kernel, anton, akpm, kmpark,
	hyunhee.kim

On Thu, Jun 27, 2013 at 11:26:16AM +0200, Michal Hocko wrote:
> On Wed 26-06-13 23:17:12, Luiz Capitulino wrote:
> > Currently, an eventfd is notified for the level it's registered for
> > _plus_ higher levels.
> > 
> > This is a problem if an application wants to implement different
> > actions for different levels. For example, an application might want
> > to release 10% of its cache on level low, 50% on medium and 100% on
> > critical. To do this, an application has to register a different
> > eventfd for each pressure level. However, fd low is always going to
> > be notified and and all fds are going to be notified on level critical.
> > 
> > Strict mode solves this problem by strictly notifiying an eventfd
> > for the pressure level it registered for. This new mode is optional,
> > by default we still notify eventfds on higher levels too.
> > 
> > Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
> 
> Two nits bellow but it looks good in general to me. You can add my
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
> 
> I still think that edge triggering makes some sense but that one might
> be rebased on top of this patch. We should still figure out whether the
> edge triggering is the right approach for the use case Hyunhee Kim wants
> it for so the strict mode should go first IMO.

For me, edge trigger as avoiding excessive event sending doesn't make sense
at all so I'd like to merge strict mode firstly, too.

> 
> > ---
> > 
> > o v2
> > 
> >  - Improve documentation
> >  - Use a bit to store mode instead of a bool
> >  - Minor changelog changes
> > 
> >  Documentation/cgroups/memory.txt | 26 ++++++++++++++++++++++----
> >  mm/vmpressure.c                  | 26 ++++++++++++++++++++++++--
> >  2 files changed, 46 insertions(+), 6 deletions(-)
> > 
> > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> > index ddf4f93..412872b 100644
> > --- a/Documentation/cgroups/memory.txt
> > +++ b/Documentation/cgroups/memory.txt
> > @@ -791,6 +791,22 @@ way to trigger. Applications should do whatever they can to help the
> >  system. It might be too late to consult with vmstat or any other
> >  statistics, so it's advisable to take an immediate action.
> >  
> > +Applications can also choose between two notification modes when
> > +registering an eventfd for memory pressure events:
> > +
> > +When in "non-strict" mode, an eventfd is notified for the specific level
> > +it's registered for and higher levels. For example, an eventfd registered
> > +for low level is also going to be notified on medium and critical levels.
> > +This mode makes sense for applications interested on monitoring reclaim
> > +activity or implementing simple load-balacing logic. The non-strict mode
> > +is the default notification mode.
> > +
> > +When in "strict" mode, an eventfd is strictly notified for the pressure
> > +level it's registered for. For example, an eventfd registered for the low
> > +level event is not going to be notified when memory pressure gets into
> > +medium or critical levels. This allows for more complex logic based on
> > +the actual pressure level the system is experiencing.
> 
> It would be also fair to mention that there is no guarantee that lower
> levels are signaled before higher so nobody should rely on seeing LOW
> before MEDIUM or CRITICAL.

You mean "non-strict" mode guarantees it?

> 
> > +
> >  The events are propagated upward until the event is handled, i.e. the
> >  events are not pass-through. Here is what this means: for example you have
> >  three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
> > @@ -807,12 +823,14 @@ register a notification, an application must:
> >  
> >  - create an eventfd using eventfd(2);
> >  - open memory.pressure_level;
> > -- write string like "<event_fd> <fd of memory.pressure_level> <level>"
> > +- write string like "<event_fd> <fd of memory.pressure_level> <level> [strict]"
> >    to cgroup.event_control.
> >  
> > -Application will be notified through eventfd when memory pressure is at
> > -the specific level (or higher). Read/write operations to
> > -memory.pressure_level are no implemented.
> > +Applications will be notified through eventfd when memory pressure is at
> > +the specific level or higher. If strict is passed, then applications
> > +will only be notified when memory pressure reaches the specified level.
> > +
> > +Read/write operations to memory.pressure_level are no implemented.
> >  
> >  Test:
> >  
> > diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> > index 736a601..ba5c17e 100644
> > --- a/mm/vmpressure.c
> > +++ b/mm/vmpressure.c
> > @@ -138,8 +138,16 @@ struct vmpressure_event {
> >  	struct eventfd_ctx *efd;
> >  	enum vmpressure_levels level;
> >  	struct list_head node;
> > +	unsigned int mode;
> 
> You would fill up a hole between level and node if you move it up on
> 64b. Doesn't matter much but why not do it...
> 
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-27 15:53   ` Minchan Kim
@ 2013-06-27 17:42     ` Michal Hocko
  0 siblings, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2013-06-27 17:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Luiz Capitulino, linux-mm, linux-kernel, anton, akpm, kmpark,
	hyunhee.kim

On Fri 28-06-13 00:53:57, Minchan Kim wrote:
> On Thu, Jun 27, 2013 at 11:26:16AM +0200, Michal Hocko wrote:
[...]
> > I still think that edge triggering makes some sense but that one might
> > be rebased on top of this patch. We should still figure out whether the
> > edge triggering is the right approach for the use case Hyunhee Kim wants
> > it for so the strict mode should go first IMO.
> 
> For me, edge trigger as avoiding excessive event sending doesn't make sense
> at all so I'd like to merge strict mode firstly, too.

Yes, it makes sense for once-per-level actions, not as a workaround.
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-27  3:17 [PATCH v2] vmpressure: implement strict mode Luiz Capitulino
  2013-06-27  9:26 ` Michal Hocko
  2013-06-27 15:44 ` Minchan Kim
@ 2013-06-27 22:02 ` Andrew Morton
  2013-06-28  0:02   ` Minchan Kim
  2 siblings, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2013-06-27 22:02 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, mhocko, minchan, anton, kmpark, hyunhee.kim

On Wed, 26 Jun 2013 23:17:12 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:

> Currently, an eventfd is notified for the level it's registered for
> _plus_ higher levels.
> 
> This is a problem if an application wants to implement different
> actions for different levels. For example, an application might want
> to release 10% of its cache on level low, 50% on medium and 100% on
> critical. To do this, an application has to register a different
> eventfd for each pressure level. However, fd low is always going to
> be notified and and all fds are going to be notified on level critical.
> 
> Strict mode solves this problem by strictly notifiying an eventfd
> for the pressure level it registered for. This new mode is optional,
> by default we still notify eventfds on higher levels too.
> 

It didn't take long for this simple interface to start getting ugly :(
And having the fd operate in different modes is ugly.

Can we instead pass the level in the event payload?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-27 22:02 ` Andrew Morton
@ 2013-06-28  0:02   ` Minchan Kim
  2013-06-28  0:34     ` Andrew Morton
  0 siblings, 1 reply; 34+ messages in thread
From: Minchan Kim @ 2013-06-28  0:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Luiz Capitulino, linux-mm, linux-kernel, mhocko, anton, kmpark,
	hyunhee.kim

Hi Andrew,

On Thu, Jun 27, 2013 at 03:02:31PM -0700, Andrew Morton wrote:
> On Wed, 26 Jun 2013 23:17:12 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:
> 
> > Currently, an eventfd is notified for the level it's registered for
> > _plus_ higher levels.
> > 
> > This is a problem if an application wants to implement different
> > actions for different levels. For example, an application might want
> > to release 10% of its cache on level low, 50% on medium and 100% on
> > critical. To do this, an application has to register a different
> > eventfd for each pressure level. However, fd low is always going to
> > be notified and and all fds are going to be notified on level critical.
> > 
> > Strict mode solves this problem by strictly notifiying an eventfd
> > for the pressure level it registered for. This new mode is optional,
> > by default we still notify eventfds on higher levels too.
> > 
> 
> It didn't take long for this simple interface to start getting ugly :(
> And having the fd operate in different modes is ugly.
> 
> Can we instead pass the level in the event payload?

You mean userland have to look the result of read(2) to confirm what
current level is and if it's no interest for us, we don't do any reaction.
If so, userland daemon would receive lots of events which are no interest.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-28  0:02   ` Minchan Kim
@ 2013-06-28  0:34     ` Andrew Morton
  2013-06-28  0:58       ` Anton Vorontsov
  0 siblings, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2013-06-28  0:34 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Luiz Capitulino, linux-mm, linux-kernel, mhocko, anton, kmpark,
	hyunhee.kim

On Fri, 28 Jun 2013 09:02:01 +0900 Minchan Kim <minchan@kernel.org> wrote:

> Hi Andrew,
> 
> On Thu, Jun 27, 2013 at 03:02:31PM -0700, Andrew Morton wrote:
> > On Wed, 26 Jun 2013 23:17:12 -0400 Luiz Capitulino <lcapitulino@redhat.com> wrote:
> > 
> > > Currently, an eventfd is notified for the level it's registered for
> > > _plus_ higher levels.
> > > 
> > > This is a problem if an application wants to implement different
> > > actions for different levels. For example, an application might want
> > > to release 10% of its cache on level low, 50% on medium and 100% on
> > > critical. To do this, an application has to register a different
> > > eventfd for each pressure level. However, fd low is always going to
> > > be notified and and all fds are going to be notified on level critical.
> > > 
> > > Strict mode solves this problem by strictly notifiying an eventfd
> > > for the pressure level it registered for. This new mode is optional,
> > > by default we still notify eventfds on higher levels too.
> > > 
> > 
> > It didn't take long for this simple interface to start getting ugly :(
> > And having the fd operate in different modes is ugly.
> > 
> > Can we instead pass the level in the event payload?
> 
> You mean userland have to look the result of read(2) to confirm what
> current level is and if it's no interest for us, we don't do any reaction.

Something like that.  It's flexible, simple, keeps policy in userspace.

> If so, userland daemon would receive lots of events which are no interest.

"lots"?  If vmpressure is generating events at such a high frequency that
this matters then it's already busted?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-28  0:34     ` Andrew Morton
@ 2013-06-28  0:58       ` Anton Vorontsov
  2013-06-28  1:13         ` Andrew Morton
  0 siblings, 1 reply; 34+ messages in thread
From: Anton Vorontsov @ 2013-06-28  0:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, Luiz Capitulino, linux-mm, linux-kernel, mhocko,
	kmpark, hyunhee.kim

On Thu, Jun 27, 2013 at 05:34:33PM -0700, Andrew Morton wrote:
> > If so, userland daemon would receive lots of events which are no interest.
> 
> "lots"?  If vmpressure is generating events at such a high frequency that
> this matters then it's already busted?

Current frequency is 1/(2MB). Suppose we ended up scanning the whole
memory on a 2GB host, this will give us 1024 hits. Doesn't feel too much*
to me... But for what it worth, I am against adding read() to the
interface -- just because we can avoid the unnecessary switch into the
kernel.

* For bigger hosts we should increase the window, as we do for the vmstat. 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-28  0:58       ` Anton Vorontsov
@ 2013-06-28  1:13         ` Andrew Morton
  2013-06-28  4:34           ` Anton Vorontsov
  2013-06-28  9:04           ` Minchan Kim
  0 siblings, 2 replies; 34+ messages in thread
From: Andrew Morton @ 2013-06-28  1:13 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Minchan Kim, Luiz Capitulino, linux-mm, linux-kernel, mhocko,
	kmpark, hyunhee.kim

On Thu, 27 Jun 2013 17:58:53 -0700 Anton Vorontsov <anton@enomsg.org> wrote:

> On Thu, Jun 27, 2013 at 05:34:33PM -0700, Andrew Morton wrote:
> > > If so, userland daemon would receive lots of events which are no interest.
> > 
> > "lots"?  If vmpressure is generating events at such a high frequency that
> > this matters then it's already busted?
> 
> Current frequency is 1/(2MB). Suppose we ended up scanning the whole
> memory on a 2GB host, this will give us 1024 hits. Doesn't feel too much*
> to me... But for what it worth, I am against adding read() to the
> interface -- just because we can avoid the unnecessary switch into the
> kernel.

What was it they said about premature optimization?

I think I'd rather do nothing than add a mode hack (already!).

The information Luiz wants is already available with the existing
interface, so why not just use it until there is a real demonstrated
problem?

But all this does point at the fact that the chosen interface was not a
good one.  And it's happening so soon :( A far better interface would
be to do away with this level filtering stuff in the kernel altogether.
Register for events and you get all the events, simple.  Or require that
userspace register a separate time for each level, or whatever.

Something clean and simple which leaves the policy in userspace,
please.  Not this.

(Why didn't vmpressure use netlink, btw?  Then we'd have decent payload
delivery)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-28  1:13         ` Andrew Morton
@ 2013-06-28  4:34           ` Anton Vorontsov
  2013-06-28  5:07             ` Anton Vorontsov
                               ` (2 more replies)
  2013-06-28  9:04           ` Minchan Kim
  1 sibling, 3 replies; 34+ messages in thread
From: Anton Vorontsov @ 2013-06-28  4:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, Luiz Capitulino, linux-mm, linux-kernel, mhocko,
	kmpark, hyunhee.kim

On Thu, Jun 27, 2013 at 06:13:53PM -0700, Andrew Morton wrote:
> On Thu, 27 Jun 2013 17:58:53 -0700 Anton Vorontsov <anton@enomsg.org> wrote:
> > Current frequency is 1/(2MB). Suppose we ended up scanning the whole
> > memory on a 2GB host, this will give us 1024 hits. Doesn't feel too much*
> > to me... But for what it worth, I am against adding read() to the
> > interface -- just because we can avoid the unnecessary switch into the
> > kernel.
> 
> What was it they said about premature optimization?
> 
> I think I'd rather do nothing than add a mode hack (already!).
> 
> The information Luiz wants is already available with the existing
> interface, so why not just use it until there is a real demonstrated
> problem?
> 
> But all this does point at the fact that the chosen interface was not a
> good one.  And it's happening so soon :( A far better interface would
> be to do away with this level filtering stuff in the kernel altogether.

OK, I am convinced that modes might be not necessary, but I see no big
problem in current situation, we can add the strict mode and deprecate the
"filtering" -- basically we'll implement the idea of requiring that
userspace registers a separate fd for each level.

As one of the ways to change the interface, we can do the strict mode by
writing levels in uppercase, and warn_once on lowercase levels, describing
that the old behaviour will go away. Once (if ever) we remove the old
behaviour, the apps trying the old-style lowercase levels will fail
gracefully with EINVAL.

Or we can be honest and admit that we can't be perfect and just add an
explicit versioning to the interface. :)

It might be unfortunate that we did not foresee this and have to change
things that soon, but we did change interfaces in the past for a lot of
sysfs and proc knobs, so it is not something new. Once the vmpressure
feature will get even wider usage exposure, we might realize that we need
to make even more changes...

> (Why didn't vmpressure use netlink, btw?  Then we'd have decent payload
> delivery)

Because we decided to be a part of memcg and thus we used standard cgroups
notifications. And we didn't need the payload (and still don't need it if
we go with the multiple fds interface).

Anton

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-28  4:34           ` Anton Vorontsov
@ 2013-06-28  5:07             ` Anton Vorontsov
  2013-06-28 14:00               ` Luiz Capitulino
  2013-06-28  5:24             ` Andrew Morton
  2013-06-28 13:43             ` Luiz Capitulino
  2 siblings, 1 reply; 34+ messages in thread
From: Anton Vorontsov @ 2013-06-28  5:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, Luiz Capitulino, linux-mm, linux-kernel, mhocko,
	kmpark, hyunhee.kim

On Thu, Jun 27, 2013 at 09:34:11PM -0700, Anton Vorontsov wrote:
> ... we can add the strict mode and deprecate the
> "filtering" -- basically we'll implement the idea of requiring that
> userspace registers a separate fd for each level.

Btw, assuming that more levels can be added, there will be a problem:
imagine that an app hooked up onto low, med, crit levels in "strict"
mode... then once we add a new level, the app will start missing the new
level events.

In the old scheme it is not a problem because of the >= condition.

With a proper versioning this won't be a problem for a new scheme too.

Anton

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-28  4:34           ` Anton Vorontsov
  2013-06-28  5:07             ` Anton Vorontsov
@ 2013-06-28  5:24             ` Andrew Morton
  2013-06-28 13:43             ` Luiz Capitulino
  2 siblings, 0 replies; 34+ messages in thread
From: Andrew Morton @ 2013-06-28  5:24 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Minchan Kim, Luiz Capitulino, linux-mm, linux-kernel, mhocko,
	kmpark, hyunhee.kim

On Thu, 27 Jun 2013 21:34:11 -0700 Anton Vorontsov <anton@enomsg.org> wrote:

> On Thu, Jun 27, 2013 at 06:13:53PM -0700, Andrew Morton wrote:
> > On Thu, 27 Jun 2013 17:58:53 -0700 Anton Vorontsov <anton@enomsg.org> wrote:
> > > Current frequency is 1/(2MB). Suppose we ended up scanning the whole
> > > memory on a 2GB host, this will give us 1024 hits. Doesn't feel too much*
> > > to me... But for what it worth, I am against adding read() to the
> > > interface -- just because we can avoid the unnecessary switch into the
> > > kernel.
> > 
> > What was it they said about premature optimization?
> > 
> > I think I'd rather do nothing than add a mode hack (already!).
> > 
> > The information Luiz wants is already available with the existing
> > interface, so why not just use it until there is a real demonstrated
> > problem?
> > 
> > But all this does point at the fact that the chosen interface was not a
> > good one.  And it's happening so soon :( A far better interface would
> > be to do away with this level filtering stuff in the kernel altogether.
> 
> OK, I am convinced that modes might be not necessary, but I see no big
> problem in current situation, we can add the strict mode and deprecate the
> "filtering" -- basically we'll implement the idea of requiring that
> userspace registers a separate fd for each level.
> 
> As one of the ways to change the interface, we can do the strict mode by
> writing levels in uppercase, and warn_once on lowercase levels, describing
> that the old behaviour will go away.

I do think the feature is too young to be bothered about
back-compatibility things.  We could put a little patch into 3.10
tomorrow which disables the vmpressure feature (just putting a few
"return 0"s in there would suffice), then turn the feature back on in
3.11-rc1.

Another option is to change the interface in 3.11 and say "sorry" if
that causes anyone trouble.  But that's obviously less desirable.


> Once (if ever) we remove the old
> behaviour, the apps trying the old-style lowercase levels will fail
> gracefully with EINVAL.
> 
> Or we can be honest and admit that we can't be perfect and just add an
> explicit versioning to the interface. :)
> 
> It might be unfortunate that we did not foresee this and have to change
> things that soon, but we did change interfaces in the past for a lot of
> sysfs and proc knobs, so it is not something new. Once the vmpressure
> feature will get even wider usage exposure, we might realize that we need
> to make even more changes...

Hopefully not ;) But the interface should be designed with that
possibility in mind, of course.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-28  1:13         ` Andrew Morton
  2013-06-28  4:34           ` Anton Vorontsov
@ 2013-06-28  9:04           ` Minchan Kim
  1 sibling, 0 replies; 34+ messages in thread
From: Minchan Kim @ 2013-06-28  9:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Anton Vorontsov, Luiz Capitulino, linux-mm, linux-kernel, mhocko,
	kmpark, hyunhee.kim

On Thu, Jun 27, 2013 at 06:13:53PM -0700, Andrew Morton wrote:
> On Thu, 27 Jun 2013 17:58:53 -0700 Anton Vorontsov <anton@enomsg.org> wrote:
> 
> > On Thu, Jun 27, 2013 at 05:34:33PM -0700, Andrew Morton wrote:
> > > > If so, userland daemon would receive lots of events which are no interest.
> > > 
> > > "lots"?  If vmpressure is generating events at such a high frequency that
> > > this matters then it's already busted?
> > 
> > Current frequency is 1/(2MB). Suppose we ended up scanning the whole
> > memory on a 2GB host, this will give us 1024 hits. Doesn't feel too much*
> > to me... But for what it worth, I am against adding read() to the
> > interface -- just because we can avoid the unnecessary switch into the
> > kernel.
> 
> What was it they said about premature optimization?
> 
> I think I'd rather do nothing than add a mode hack (already!).
> 
> The information Luiz wants is already available with the existing
> interface, so why not just use it until there is a real demonstrated
> problem?
> 
> But all this does point at the fact that the chosen interface was not a
> good one.  And it's happening so soon :( A far better interface would
> be to do away with this level filtering stuff in the kernel altogether.
> Register for events and you get all the events, simple.  Or require that
> userspace register a separate time for each level, or whatever.
> 
> Something clean and simple which leaves the policy in userspace,
> please.  Not this.

Anton, Michal,

Tend to agree. I have been thought that current vmpressure heuristic
could be much fluctuated with various workloads so that how we could make
it stable with another parameters in future so everyone has satisfactory
result with just common general value of low/medium/critical but
different window size. It's not easy for kernel to handle it, IMO.
So as Andrew says, how about leaving the policy in userspace?

For example, we kernel just can expose linear index(ex, 0~100) using
some algorithm(ex, current vmpressure_win/reclaimed/scan) and userland
could poll it with his time granularity and handle the situation and
reset it.

It's a just simple example without enough considering.
Anyway the point is that isn't it worth to think over userspace policy
and what's indicator kernel could expose to user space?


> 
> (Why didn't vmpressure use netlink, btw?  Then we'd have decent payload
> delivery)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-28  4:34           ` Anton Vorontsov
  2013-06-28  5:07             ` Anton Vorontsov
  2013-06-28  5:24             ` Andrew Morton
@ 2013-06-28 13:43             ` Luiz Capitulino
  2 siblings, 0 replies; 34+ messages in thread
From: Luiz Capitulino @ 2013-06-28 13:43 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, mhocko,
	kmpark, hyunhee.kim

On Thu, 27 Jun 2013 21:34:11 -0700
Anton Vorontsov <anton@enomsg.org> wrote:

> On Thu, Jun 27, 2013 at 06:13:53PM -0700, Andrew Morton wrote:
> > On Thu, 27 Jun 2013 17:58:53 -0700 Anton Vorontsov <anton@enomsg.org> wrote:
> > > Current frequency is 1/(2MB). Suppose we ended up scanning the whole
> > > memory on a 2GB host, this will give us 1024 hits. Doesn't feel too much*
> > > to me... But for what it worth, I am against adding read() to the
> > > interface -- just because we can avoid the unnecessary switch into the
> > > kernel.
> > 
> > What was it they said about premature optimization?
> > 
> > I think I'd rather do nothing than add a mode hack (already!).
> > 
> > The information Luiz wants is already available with the existing
> > interface, so why not just use it until there is a real demonstrated
> > problem?
> > 
> > But all this does point at the fact that the chosen interface was not a
> > good one.  And it's happening so soon :( A far better interface would
> > be to do away with this level filtering stuff in the kernel altogether.
> 
> OK, I am convinced that modes might be not necessary, but I see no big
> problem in current situation, we can add the strict mode and deprecate the
> "filtering" -- basically we'll implement the idea of requiring that
> userspace registers a separate fd for each level.

Agreed this is a good solution.

> As one of the ways to change the interface, we can do the strict mode by
> writing levels in uppercase, and warn_once on lowercase levels, describing
> that the old behaviour will go away. Once (if ever) we remove the old
> behaviour, the apps trying the old-style lowercase levels will fail
> gracefully with EINVAL.

Why don't we just break it? There's no non-development kernel released
with this interface yet.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-28  5:07             ` Anton Vorontsov
@ 2013-06-28 14:00               ` Luiz Capitulino
  2013-06-28 16:57                 ` Anton Vorontsov
  0 siblings, 1 reply; 34+ messages in thread
From: Luiz Capitulino @ 2013-06-28 14:00 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, mhocko,
	kmpark, hyunhee.kim

On Thu, 27 Jun 2013 22:07:12 -0700
Anton Vorontsov <anton@enomsg.org> wrote:

> On Thu, Jun 27, 2013 at 09:34:11PM -0700, Anton Vorontsov wrote:
> > ... we can add the strict mode and deprecate the
> > "filtering" -- basically we'll implement the idea of requiring that
> > userspace registers a separate fd for each level.
> 
> Btw, assuming that more levels can be added, there will be a problem:
> imagine that an app hooked up onto low, med, crit levels in "strict"
> mode... then once we add a new level, the app will start missing the new
> level events.

That's how it's expected to work, because on strict mode you're notified
for the level you registered for. So apps registering for critical, will
still be notified on critical just like before.

> In the old scheme it is not a problem because of the >= condition.

I think the problem actually lies with the current interface, because
if an app registers for critical and we add a new level after critical
then this app will now be notified on critical *and* the new level. The
app's algorithm might not be prepared to deal with that.

> With a proper versioning this won't be a problem for a new scheme too.

I don't think there's a problem to be solved here. Strict mode does
allow forward compatibility. For good backward compatibility we can make
memory.pressure_level return supported levels on read. This way
applications can check if the level they are interested in exist and
then fallback or exit with a good error message if they don't.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-28 14:00               ` Luiz Capitulino
@ 2013-06-28 16:57                 ` Anton Vorontsov
  2013-06-28 17:09                   ` Anton Vorontsov
  0 siblings, 1 reply; 34+ messages in thread
From: Anton Vorontsov @ 2013-06-28 16:57 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, mhocko,
	kmpark, hyunhee.kim

On Fri, Jun 28, 2013 at 10:00:27AM -0400, Luiz Capitulino wrote:
> On Thu, 27 Jun 2013 22:07:12 -0700
> Anton Vorontsov <anton@enomsg.org> wrote:
> 
> > On Thu, Jun 27, 2013 at 09:34:11PM -0700, Anton Vorontsov wrote:
> > > ... we can add the strict mode and deprecate the
> > > "filtering" -- basically we'll implement the idea of requiring that
> > > userspace registers a separate fd for each level.
> > 
> > Btw, assuming that more levels can be added, there will be a problem:
> > imagine that an app hooked up onto low, med, crit levels in "strict"
> > mode... then once we add a new level, the app will start missing the new
> > level events.
> 
> That's how it's expected to work, because on strict mode you're notified
> for the level you registered for. So apps registering for critical, will
> still be notified on critical just like before.

Suppose you introduce a new level, and the system hits this level. Before,
the app would receive at least some notification for the given memory load
(i.e. one of the old levels), with the new level introduced in the kernel,
the app will receive no events at all. This makes a serious behavioural
change in the app (read: it'll break it).

Anton

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-28 16:57                 ` Anton Vorontsov
@ 2013-06-28 17:09                   ` Anton Vorontsov
  2013-06-28 18:25                     ` Luiz Capitulino
  2013-06-28 18:45                     ` Luiz Capitulino
  0 siblings, 2 replies; 34+ messages in thread
From: Anton Vorontsov @ 2013-06-28 17:09 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, mhocko,
	kmpark, hyunhee.kim

On Fri, Jun 28, 2013 at 09:57:22AM -0700, Anton Vorontsov wrote:
> On Fri, Jun 28, 2013 at 10:00:27AM -0400, Luiz Capitulino wrote:
> > On Thu, 27 Jun 2013 22:07:12 -0700
> > Anton Vorontsov <anton@enomsg.org> wrote:
> > 
> > > On Thu, Jun 27, 2013 at 09:34:11PM -0700, Anton Vorontsov wrote:
> > > > ... we can add the strict mode and deprecate the
> > > > "filtering" -- basically we'll implement the idea of requiring that
> > > > userspace registers a separate fd for each level.
> > > 
> > > Btw, assuming that more levels can be added, there will be a problem:
> > > imagine that an app hooked up onto low, med, crit levels in "strict"
> > > mode... then once we add a new level, the app will start missing the new
> > > level events.
> > 
> > That's how it's expected to work, because on strict mode you're notified
> > for the level you registered for. So apps registering for critical, will
> > still be notified on critical just like before.
> 
> Suppose you introduce a new level, and the system hits this level. Before,
> the app would receive at least some notification for the given memory load
> (i.e. one of the old levels), with the new level introduced in the kernel,
> the app will receive no events at all. This makes a serious behavioural
> change in the app (read: it'll break it).

Btw, why exactly you need the strict mode? Why 'medium' won't work for the
load-balancing? If you want some special logic for the critical level in
addition to the load-balancing, then you can hook into medium and critical
in the current scheme, and it will be an equivalent, except that you will
still be receiving 'medium' events, which you can filter out in userland.

You don't even need additional syscalls -- you don't have to read from the
medium-level eventfd if poll() returned events from the critical-level
eventfd.

So, I would now argue that the current scheme is perfectly OK and can do
everything you can do with the "strict" one, except that the old one is
better, as it handles backwards-compatibility in a nice way. :)

Anton

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-28 17:09                   ` Anton Vorontsov
@ 2013-06-28 18:25                     ` Luiz Capitulino
  2013-06-28 18:45                       ` Anton Vorontsov
  2013-06-28 18:45                     ` Luiz Capitulino
  1 sibling, 1 reply; 34+ messages in thread
From: Luiz Capitulino @ 2013-06-28 18:25 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, mhocko,
	kmpark, hyunhee.kim

On Fri, 28 Jun 2013 10:09:17 -0700
Anton Vorontsov <anton@enomsg.org> wrote:

> On Fri, Jun 28, 2013 at 09:57:22AM -0700, Anton Vorontsov wrote:
> > On Fri, Jun 28, 2013 at 10:00:27AM -0400, Luiz Capitulino wrote:
> > > On Thu, 27 Jun 2013 22:07:12 -0700
> > > Anton Vorontsov <anton@enomsg.org> wrote:
> > > 
> > > > On Thu, Jun 27, 2013 at 09:34:11PM -0700, Anton Vorontsov wrote:
> > > > > ... we can add the strict mode and deprecate the
> > > > > "filtering" -- basically we'll implement the idea of requiring that
> > > > > userspace registers a separate fd for each level.
> > > > 
> > > > Btw, assuming that more levels can be added, there will be a problem:
> > > > imagine that an app hooked up onto low, med, crit levels in "strict"
> > > > mode... then once we add a new level, the app will start missing the new
> > > > level events.
> > > 
> > > That's how it's expected to work, because on strict mode you're notified
> > > for the level you registered for. So apps registering for critical, will
> > > still be notified on critical just like before.
> > 
> > Suppose you introduce a new level, and the system hits this level. Before,
> > the app would receive at least some notification for the given memory load
> > (i.e. one of the old levels), with the new level introduced in the kernel,
> > the app will receive no events at all.

That's not true. If an app registered for critical it will still get
critical notification when the system is at the critical level. Just as it
always did. No new events will change this.

With today's semantics though, new events will change when current events
are triggered. So each new extension will cause applications to have
different behaviors, in different kernel versions. This looks quite
undesirable to me.

> > This makes a serious behavioural
> > change in the app (read: it'll break it).

How? Strict mode semantics is simple: you get what you registered for.
No matter the kernel version, no matter how many events we add on top
of existing ones. How can this brake applications?

> Btw, why exactly you need the strict mode?

I'm implementing automatic guest-resize for KVM guests. What I'd like to
do is to inflate the balloon by different values depending on the
host pressure. Say, 1MB on low; 16MB on medium and 128MB on critical.

The actual values are meaningless btw, as this is going to be set by
the user. So, saying that 1MB on low is too little to be concerned about
is not an valid argument, as the user can set this to 1GB.

> Why 'medium' won't work for the
> load-balancing?

I need precision. If the system is at 'medium' level, then I'll do
medium stuff. If it gets into critical, then I'll do critical stuff.

I don't want to do critical stuff on all levels, or critical on
medium. This is really messy. Just give me what I asked for, please.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-28 17:09                   ` Anton Vorontsov
  2013-06-28 18:25                     ` Luiz Capitulino
@ 2013-06-28 18:45                     ` Luiz Capitulino
  2013-06-28 18:55                       ` Anton Vorontsov
  1 sibling, 1 reply; 34+ messages in thread
From: Luiz Capitulino @ 2013-06-28 18:45 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, mhocko,
	kmpark, hyunhee.kim

On Fri, 28 Jun 2013 10:09:17 -0700
Anton Vorontsov <anton@enomsg.org> wrote:

> So, I would now argue that the current scheme is perfectly OK and can do
> everything you can do with the "strict" one,

I forgot commenting this bit. This is not true, because I don't want a
low fd to be notified on critical level. The current interface just
can't do that.

However, it *is* possible to make non-strict work on strict if we make
strict default _and_ make reads on memory.pressure_level return
available events. Just do this on app initialization:

for each event in memory.pressure_level; do
	/* register eventfd to be notified on "event" */
done

Then eventfd will always be notified, no matter the event.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-28 18:25                     ` Luiz Capitulino
@ 2013-06-28 18:45                       ` Anton Vorontsov
  2013-06-28 18:58                         ` Luiz Capitulino
  0 siblings, 1 reply; 34+ messages in thread
From: Anton Vorontsov @ 2013-06-28 18:45 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, mhocko,
	kmpark, hyunhee.kim

On Fri, Jun 28, 2013 at 02:25:58PM -0400, Luiz Capitulino wrote:
> > > > That's how it's expected to work, because on strict mode you're notified
> > > > for the level you registered for. So apps registering for critical, will
> > > > still be notified on critical just like before.
> > > 
> > > Suppose you introduce a new level, and the system hits this level. Before,
> > > the app would receive at least some notification for the given memory load
> > > (i.e. one of the old levels), with the new level introduced in the kernel,
> > > the app will receive no events at all.
> 
> That's not true. If an app registered for critical it will still get
> critical notification when the system is at the critical level. Just as it
> always did. No new events will change this.
> 
> With today's semantics though, new events will change when current events
> are triggered. So each new extension will cause applications to have
> different behaviors, in different kernel versions. This looks quite
> undesirable to me.

I'll try to explain it again.

Old behaviour:

low -> event
  x <- but the system is at this unnamed level, between low and med
med
crit


We add a level:

low
low-med <- system at this state, we send an event, but the old app does
           not know about it, so it won't receive *any* notifications. (In
	   older kernels it would receive low level notification
med
crit

You really don't see a problem here?

I see the problem, and I see these solutions:

1. Use current scheme.
2. Add versioning.
3. Never change the levels (how can we know?)

Anton

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-28 18:45                     ` Luiz Capitulino
@ 2013-06-28 18:55                       ` Anton Vorontsov
  2013-06-28 19:44                         ` Luiz Capitulino
  0 siblings, 1 reply; 34+ messages in thread
From: Anton Vorontsov @ 2013-06-28 18:55 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, mhocko,
	kmpark, hyunhee.kim

On Fri, Jun 28, 2013 at 02:45:07PM -0400, Luiz Capitulino wrote:
> On Fri, 28 Jun 2013 10:09:17 -0700
> Anton Vorontsov <anton@enomsg.org> wrote:
> 
> > So, I would now argue that the current scheme is perfectly OK and can do
> > everything you can do with the "strict" one,
> 
> I forgot commenting this bit. This is not true, because I don't want a
> low fd to be notified on critical level. The current interface just
> can't do that.

Why can't you use poll() and demultiplex the events? Check if there is an
event in the crit fd, and if there is, then just ignore all the rest.

> However, it *is* possible to make non-strict work on strict if we make
> strict default _and_ make reads on memory.pressure_level return
> available events. Just do this on app initialization:
> 
> for each event in memory.pressure_level; do
> 	/* register eventfd to be notified on "event" */
> done

This scheme registers "all" events. Here is more complicated case:

Old kernels, pressure_level reads:

  low, med, crit

The app just wants to listen for med level.

New kernels, pressure_level reads:

  low, FOO, med, BAR, crit

How would application decide which of FOO and BAR are ex-med levels?

Anton

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-28 18:45                       ` Anton Vorontsov
@ 2013-06-28 18:58                         ` Luiz Capitulino
  0 siblings, 0 replies; 34+ messages in thread
From: Luiz Capitulino @ 2013-06-28 18:58 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, mhocko,
	kmpark, hyunhee.kim

On Fri, 28 Jun 2013 11:45:47 -0700
Anton Vorontsov <anton@enomsg.org> wrote:

> On Fri, Jun 28, 2013 at 02:25:58PM -0400, Luiz Capitulino wrote:
> > > > > That's how it's expected to work, because on strict mode you're notified
> > > > > for the level you registered for. So apps registering for critical, will
> > > > > still be notified on critical just like before.
> > > > 
> > > > Suppose you introduce a new level, and the system hits this level. Before,
> > > > the app would receive at least some notification for the given memory load
> > > > (i.e. one of the old levels), with the new level introduced in the kernel,
> > > > the app will receive no events at all.
> > 
> > That's not true. If an app registered for critical it will still get
> > critical notification when the system is at the critical level. Just as it
> > always did. No new events will change this.
> > 
> > With today's semantics though, new events will change when current events
> > are triggered. So each new extension will cause applications to have
> > different behaviors, in different kernel versions. This looks quite
> > undesirable to me.
> 
> I'll try to explain it again.
> 
> Old behaviour:
> 
> low -> event
>   x <- but the system is at this unnamed level, between low and med
> med
> crit
> 
> 
> We add a level:
> 
> low
> low-med <- system at this state, we send an event, but the old app does
>            not know about it, so it won't receive *any* notifications. (In
> 	   older kernels it would receive low level notification
> med
> crit
> 
> You really don't see a problem here?

I do get what you're saying. We disagree it's a problem. In my mind the
best API is to get what you registered for. Nothing more, nothing less.

Now, there might be ways around it (being it a problem or not). I was
also considering this:

> 3. Never change the levels (how can we know?)

If we fail at determining levels (I honestly think current levels are
all we need), we can add a new interface later.

Also, what I said in the last email should work, which is to make
memory.pressure_level return supported levels, so an application can
register for all available levels. This way it will never miss a level.

I also think this matches having the mechanism in the kernel and
policy in user-space.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-28 18:55                       ` Anton Vorontsov
@ 2013-06-28 19:44                         ` Luiz Capitulino
  2013-06-29  0:56                           ` Anton Vorontsov
  0 siblings, 1 reply; 34+ messages in thread
From: Luiz Capitulino @ 2013-06-28 19:44 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, mhocko,
	kmpark, hyunhee.kim

On Fri, 28 Jun 2013 11:55:47 -0700
Anton Vorontsov <anton@enomsg.org> wrote:

> On Fri, Jun 28, 2013 at 02:45:07PM -0400, Luiz Capitulino wrote:
> > On Fri, 28 Jun 2013 10:09:17 -0700
> > Anton Vorontsov <anton@enomsg.org> wrote:
> > 
> > > So, I would now argue that the current scheme is perfectly OK and can do
> > > everything you can do with the "strict" one,
> > 
> > I forgot commenting this bit. This is not true, because I don't want a
> > low fd to be notified on critical level. The current interface just
> > can't do that.
> 
> Why can't you use poll() and demultiplex the events? Check if there is an
> event in the crit fd, and if there is, then just ignore all the rest.

This may be a valid workaround for current kernels, but application
behavior will be different among kernels with a different number of
events.

Say, we events on top of critical. Then crit fd will now be
notified for cases where it didn't use to on older kernels.

> > However, it *is* possible to make non-strict work on strict if we make
> > strict default _and_ make reads on memory.pressure_level return
> > available events. Just do this on app initialization:
> > 
> > for each event in memory.pressure_level; do
> > 	/* register eventfd to be notified on "event" */
> > done
> 
> This scheme registers "all" events.

Yes, because I thought that's the user-case that matters for activity
manager :)

> Here is more complicated case:
> 
> Old kernels, pressure_level reads:
> 
>   low, med, crit
> 
> The app just wants to listen for med level.
> 
> New kernels, pressure_level reads:
> 
>   low, FOO, med, BAR, crit
> 
> How would application decide which of FOO and BAR are ex-med levels?

What you meant by ex-med?

Let's not over-design. I agree that allowing the API to be extended
is a good thing, but we shouldn't give complex meaning to events,
otherwise we're better with a numeric scale instead.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-28 19:44                         ` Luiz Capitulino
@ 2013-06-29  0:56                           ` Anton Vorontsov
  2013-07-01  8:22                             ` Hyunhee Kim
                                               ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Anton Vorontsov @ 2013-06-29  0:56 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, mhocko,
	kmpark, hyunhee.kim

On Fri, Jun 28, 2013 at 03:44:02PM -0400, Luiz Capitulino wrote:
> > Why can't you use poll() and demultiplex the events? Check if there is an
> > event in the crit fd, and if there is, then just ignore all the rest.
> 
> This may be a valid workaround for current kernels, but application
> behavior will be different among kernels with a different number of
> events.

This is not a workaround, this is how poll works, and this is kinda
expected... But not that I had this plan in mind when I was designing the
current scheme... :)

> Say, we events on top of critical. Then crit fd will now be
> notified for cases where it didn't use to on older kernels.

I'm not sure I am following here... but thinking about it more, I guess
the extra read() will be needed anyway (to reset the counter).

> > > However, it *is* possible to make non-strict work on strict if we make
> > > strict default _and_ make reads on memory.pressure_level return
> > > available events. Just do this on app initialization:
> > > 
> > > for each event in memory.pressure_level; do
> > > 	/* register eventfd to be notified on "event" */
> > > done
> > 
> > This scheme registers "all" events.
> 
> Yes, because I thought that's the user-case that matters for activity
> manager :)

Some activity managers use only low levels (Android), some might use only
medium levels (simple load-balancing).

Being able to register only "all" does not make sense to me.

> > Here is more complicated case:
> > 
> > Old kernels, pressure_level reads:
> > 
> >   low, med, crit
> > 
> > The app just wants to listen for med level.
> > 
> > New kernels, pressure_level reads:
> > 
> >   low, FOO, med, BAR, crit
> > 
> > How would application decide which of FOO and BAR are ex-med levels?
> 
> What you meant by ex-med?

The scale is continuous and non-overlapping. If you add some other level,
you effectively "shrinking" other levels, so the ex-med in the list above
might correspond to "FOO, med" or "med, BAR" or "FOO, med, BAR", and that
is exactly the problem.

> Let's not over-design. I agree that allowing the API to be extended
> is a good thing, but we shouldn't give complex meaning to events,
> otherwise we're better with a numeric scale instead.

I am not over-desiging at all. Again, you did not provide any solution for
the case if we are going to add a new level. Thing is, I don't know if we
are going to add more levels, but the three-levels scheme is not something
scientifically proven, it is just an arbitrary thing we made up. We may
end up with four, or five... or not.

Thanks,

Anton

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-29  0:56                           ` Anton Vorontsov
@ 2013-07-01  8:22                             ` Hyunhee Kim
  2013-07-02  4:32                               ` Anton Vorontsov
  2013-07-02 13:29                             ` Michal Hocko
  2013-07-02 14:59                             ` Luiz Capitulino
  2 siblings, 1 reply; 34+ messages in thread
From: Hyunhee Kim @ 2013-07-01  8:22 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Luiz Capitulino, Andrew Morton, Minchan Kim, linux-mm,
	linux-kernel, mhocko, kmpark

2013/6/29 Anton Vorontsov <anton@enomsg.org>:
> On Fri, Jun 28, 2013 at 03:44:02PM -0400, Luiz Capitulino wrote:
>> > Why can't you use poll() and demultiplex the events? Check if there is an
>> > event in the crit fd, and if there is, then just ignore all the rest.
>>
>> This may be a valid workaround for current kernels, but application
>> behavior will be different among kernels with a different number of
>> events.
>
> This is not a workaround, this is how poll works, and this is kinda
> expected... But not that I had this plan in mind when I was designing the
> current scheme... :)
>
>> Say, we events on top of critical. Then crit fd will now be
>> notified for cases where it didn't use to on older kernels.
>
> I'm not sure I am following here... but thinking about it more, I guess
> the extra read() will be needed anyway (to reset the counter).
>
>> > > However, it *is* possible to make non-strict work on strict if we make
>> > > strict default _and_ make reads on memory.pressure_level return
>> > > available events. Just do this on app initialization:
>> > >
>> > > for each event in memory.pressure_level; do
>> > >   /* register eventfd to be notified on "event" */
>> > > done
>> >
>> > This scheme registers "all" events.
>>
>> Yes, because I thought that's the user-case that matters for activity
>> manager :)
>
> Some activity managers use only low levels (Android), some might use only
> medium levels (simple load-balancing).

When the platform like Android uses only "low" level, is it the
process you intended when designing vmpressure?

1. activity manager receives "low" level events
2. it reads and checks the current memory (e.g. available memory) using vmstat
3. if the available memory is not under the threshold (defined e.g. by
activity manager), activity manager does nothing
4. if the available memory is under the threshold, activity manager
handles it by e.g. reclaiming or killing processes?

At first time when I saw this vmpressure, I thought that I should
register all events ("low", "medium", and "critical
") and use different handler for each event. However, without the mode
like strict mode, I should see too many events. So, now, I think that
it is better to use only one level and run each handler after checking
available memory as you mentioned.

But,

1. Isn't it overhead to read event and check memory state every time
we receive events?
    - sometimes, even when there are lots of available memory, low
level event could occur if most of them is reclaimable memory not free
pages.
    - Don't most of platforms use available memory to judge their
current memory state? Is there any reason vmpressure use reclaim rate?
IMO, activity manager doesn't have to check available memory if it
could receive signal based on the available memory.

2. If we use only "medium" to avoid the overheads occurred when using
"low" level, isn't it possible to miss sending events when there is a
little available memory but reclaim ratio is high?

IMHO, we cannot consider and cover all the use cases, but considering
some use cases and giving some guides and directions to use this
vmpressure will be helpful to make many platform accept this for their
low memory manager.

Thanks,
Hyunhee Kim.

>
> Being able to register only "all" does not make sense to me.
>
>> > Here is more complicated case:
>> >
>> > Old kernels, pressure_level reads:
>> >
>> >   low, med, crit
>> >
>> > The app just wants to listen for med level.
>> >
>> > New kernels, pressure_level reads:
>> >
>> >   low, FOO, med, BAR, crit
>> >
>> > How would application decide which of FOO and BAR are ex-med levels?
>>
>> What you meant by ex-med?
>
> The scale is continuous and non-overlapping. If you add some other level,
> you effectively "shrinking" other levels, so the ex-med in the list above
> might correspond to "FOO, med" or "med, BAR" or "FOO, med, BAR", and that
> is exactly the problem.
>
>> Let's not over-design. I agree that allowing the API to be extended
>> is a good thing, but we shouldn't give complex meaning to events,
>> otherwise we're better with a numeric scale instead.
>
> I am not over-desiging at all. Again, you did not provide any solution for
> the case if we are going to add a new level. Thing is, I don't know if we
> are going to add more levels, but the three-levels scheme is not something
> scientifically proven, it is just an arbitrary thing we made up. We may
> end up with four, or five... or not.
>
> Thanks,
>
> Anton
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-07-01  8:22                             ` Hyunhee Kim
@ 2013-07-02  4:32                               ` Anton Vorontsov
  2013-07-02  8:29                                 ` Hyunhee Kim
  0 siblings, 1 reply; 34+ messages in thread
From: Anton Vorontsov @ 2013-07-02  4:32 UTC (permalink / raw)
  To: Hyunhee Kim
  Cc: Luiz Capitulino, Andrew Morton, Minchan Kim, linux-mm,
	linux-kernel, mhocko, kmpark

On Mon, Jul 01, 2013 at 05:22:36PM +0900, Hyunhee Kim wrote:
> >> > > for each event in memory.pressure_level; do
> >> > >   /* register eventfd to be notified on "event" */
> >> > > done
> >> >
> >> > This scheme registers "all" events.
> >>
> >> Yes, because I thought that's the user-case that matters for activity
> >> manager :)
> >
> > Some activity managers use only low levels (Android), some might use only
> > medium levels (simple load-balancing).
> 
> When the platform like Android uses only "low" level, is it the
> process you intended when designing vmpressure?
> 
> 1. activity manager receives "low" level events
> 2. it reads and checks the current memory (e.g. available memory) using vmstat
> 3. if the available memory is not under the threshold (defined e.g. by
> activity manager), activity manager does nothing
> 4. if the available memory is under the threshold, activity manager
> handles it by e.g. reclaiming or killing processes?

Yup, exactly.

> At first time when I saw this vmpressure, I thought that I should
> register all events ("low", "medium", and "critical
> ") and use different handler for each event. However, without the mode
> like strict mode, I should see too many events. So, now, I think that
> it is better to use only one level and run each handler after checking
> available memory as you mentioned.

Yup, this should work ideally.

> But,
> 
> 1. Isn't it overhead to read event and check memory state every time
> we receive events?

Even if it is an overhead, is it measurable? Plus, vmstat/memcg stats are
the only source of information that Activity Manager can use to make a
decision, so there is no point in duplicating the information in the
notifications.

>     - sometimes, even when there are lots of available memory, low
> level event could occur if most of them is reclaimable memory not free
> pages.

The point of low level is to signal [any] reclaiming activity. So, yes, 

>     - Don't most of platforms use available memory to judge their
> current memory state?

No, because you hardly want to monitor available memory only. You want to
take into account the level of the page caches, etc.

> Is there any reason vmpressure use reclaim rate?

Yes, you can refer to this email:

  http://lkml.org/lkml/2012/10/4/145

And here is about the levels thing:

  http://lkml.org/lkml/2012/10/22/177

> IMO, activity manager doesn't have to check available memory if it
> could receive signal based on the available memory.

But userspace can define its own policy of managing the tasks/resouces
based on different factors, other than just available memory. And that is
exactly why we don't filter the events in the kernel anymore. The only
filtering that we make is the levels, which, as it appears, can work for
many use-cases. 

> 2. If we use only "medium" to avoid the overheads occurred when using
> "low" level, isn't it possible to miss sending events when there is a
> little available memory but reclaim ratio is high?

If your app don't "trust" reclaim ratio idicator, then the application can
use its own heuristics, using low level just to monitor reclaiming
activity. More than that, you can change vmpressure itself to use
different heuristics for low/med/crit levels: the point of introducing
levels was also to hide the implementation and memory management details,
so if you can come up with a better approach for vmpressure "internals"
you are more than welcome to do so. :)

> IMHO, we cannot consider and cover all the use cases, but considering
> some use cases and giving some guides and directions to use this
> vmpressure will be helpful to make many platform accept this for their
> low memory manager.

Can't argue with that. :) I guess I will need to better document current
behavior of the levels and when exactly the events trigger.

Thanks!

Anton

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-07-02  4:32                               ` Anton Vorontsov
@ 2013-07-02  8:29                                 ` Hyunhee Kim
  0 siblings, 0 replies; 34+ messages in thread
From: Hyunhee Kim @ 2013-07-02  8:29 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Luiz Capitulino, Andrew Morton, Minchan Kim, linux-mm,
	linux-kernel, mhocko, kmpark

2013/7/2 Anton Vorontsov <anton@enomsg.org>:
> On Mon, Jul 01, 2013 at 05:22:36PM +0900, Hyunhee Kim wrote:
>> >> > > for each event in memory.pressure_level; do
>> >> > >   /* register eventfd to be notified on "event" */
>> >> > > done
>> >> >
>> >> > This scheme registers "all" events.
>> >>
>> >> Yes, because I thought that's the user-case that matters for activity
>> >> manager :)
>> >
>> > Some activity managers use only low levels (Android), some might use only
>> > medium levels (simple load-balancing).
>>
>> When the platform like Android uses only "low" level, is it the
>> process you intended when designing vmpressure?
>>
>> 1. activity manager receives "low" level events
>> 2. it reads and checks the current memory (e.g. available memory) using vmstat
>> 3. if the available memory is not under the threshold (defined e.g. by
>> activity manager), activity manager does nothing
>> 4. if the available memory is under the threshold, activity manager
>> handles it by e.g. reclaiming or killing processes?
>
> Yup, exactly.
>
>> At first time when I saw this vmpressure, I thought that I should
>> register all events ("low", "medium", and "critical
>> ") and use different handler for each event. However, without the mode
>> like strict mode, I should see too many events. So, now, I think that
>> it is better to use only one level and run each handler after checking
>> available memory as you mentioned.
>
> Yup, this should work ideally.

Thanks for your reply.
I think that, as you mentioned, using one level event could work well
when activity manager checks available memory in user space.

I also think that Luiz's use case and the use case I thought at first
(registering "low", "medium", and "critical", and run each handler)
are another examples of use cases that could be widely used.
For example, let's think about the case when userland wants to know
"pressure level" (reclaim ratio calculated by vmpressure 0~60~100).
In this case, if we do not register all of these events and register
only "low" level, we cannot distinguish "low", "medium", and
"critical" pressure level in userland.
This reclaim ratio cannot be identified in userland.

Thanks.
Hyunhee Kim.

>
>> But,
>>
>> 1. Isn't it overhead to read event and check memory state every time
>> we receive events?
>
> Even if it is an overhead, is it measurable? Plus, vmstat/memcg stats are
> the only source of information that Activity Manager can use to make a
> decision, so there is no point in duplicating the information in the
> notifications.
>
>>     - sometimes, even when there are lots of available memory, low
>> level event could occur if most of them is reclaimable memory not free
>> pages.
>
> The point of low level is to signal [any] reclaiming activity. So, yes,
>
>>     - Don't most of platforms use available memory to judge their
>> current memory state?
>
> No, because you hardly want to monitor available memory only. You want to
> take into account the level of the page caches, etc.
>
>> Is there any reason vmpressure use reclaim rate?
>
> Yes, you can refer to this email:
>
>   http://lkml.org/lkml/2012/10/4/145
>
> And here is about the levels thing:
>
>   http://lkml.org/lkml/2012/10/22/177
>
>> IMO, activity manager doesn't have to check available memory if it
>> could receive signal based on the available memory.
>
> But userspace can define its own policy of managing the tasks/resouces
> based on different factors, other than just available memory. And that is
> exactly why we don't filter the events in the kernel anymore. The only
> filtering that we make is the levels, which, as it appears, can work for
> many use-cases.
>
>> 2. If we use only "medium" to avoid the overheads occurred when using
>> "low" level, isn't it possible to miss sending events when there is a
>> little available memory but reclaim ratio is high?
>
> If your app don't "trust" reclaim ratio idicator, then the application can
> use its own heuristics, using low level just to monitor reclaiming
> activity. More than that, you can change vmpressure itself to use
> different heuristics for low/med/crit levels: the point of introducing
> levels was also to hide the implementation and memory management details,
> so if you can come up with a better approach for vmpressure "internals"
> you are more than welcome to do so. :)
>
>> IMHO, we cannot consider and cover all the use cases, but considering
>> some use cases and giving some guides and directions to use this
>> vmpressure will be helpful to make many platform accept this for their
>> low memory manager.
>
> Can't argue with that. :) I guess I will need to better document current
> behavior of the levels and when exactly the events trigger.
>
> Thanks!
>
> Anton
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-29  0:56                           ` Anton Vorontsov
  2013-07-01  8:22                             ` Hyunhee Kim
@ 2013-07-02 13:29                             ` Michal Hocko
  2013-07-02 14:59                             ` Luiz Capitulino
  2 siblings, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2013-07-02 13:29 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Luiz Capitulino, Andrew Morton, Minchan Kim, linux-mm,
	linux-kernel, kmpark, hyunhee.kim

On Fri 28-06-13 17:56:37, Anton Vorontsov wrote:
> On Fri, Jun 28, 2013 at 03:44:02PM -0400, Luiz Capitulino wrote:
> > > Why can't you use poll() and demultiplex the events? Check if there is an
> > > event in the crit fd, and if there is, then just ignore all the rest.
> > 
> > This may be a valid workaround for current kernels, but application
> > behavior will be different among kernels with a different number of
> > events.
> 
> This is not a workaround, this is how poll works, and this is kinda
> expected... But not that I had this plan in mind when I was designing the
> current scheme... :)

One thing I found strict mode useful is that a poll based implementation
would be PITA without kernel help. First the timing is nothing you can
rely on. There might be arbitrary timeout between two eventfd_signal
calls. So you would see Medium while critical is waiting for being
scheduled.
Kernel might help here though and signal from the highest event down to
lower ones. The core issue would stay though. What is a tolerable period
when an event is considered separate one?

That being said I think both modes make sense and they cover different
usecases.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-06-29  0:56                           ` Anton Vorontsov
  2013-07-01  8:22                             ` Hyunhee Kim
  2013-07-02 13:29                             ` Michal Hocko
@ 2013-07-02 14:59                             ` Luiz Capitulino
  2013-07-02 17:24                               ` Anton Vorontsov
  2 siblings, 1 reply; 34+ messages in thread
From: Luiz Capitulino @ 2013-07-02 14:59 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, mhocko,
	kmpark, hyunhee.kim

On Fri, 28 Jun 2013 17:56:37 -0700
Anton Vorontsov <anton@enomsg.org> wrote:

> On Fri, Jun 28, 2013 at 03:44:02PM -0400, Luiz Capitulino wrote:
> > > Why can't you use poll() and demultiplex the events? Check if there is an
> > > event in the crit fd, and if there is, then just ignore all the rest.
> > 
> > This may be a valid workaround for current kernels, but application
> > behavior will be different among kernels with a different number of
> > events.
> 
> This is not a workaround, this is how poll works, and this is kinda
> expected...

I think this is a workaround because it's tailored to my specific
use-case and to a specific kernel version, as:

 1. Applications registering for lower levels (eg. low) are still
    unable to tell which level actually caused them to be notified,
    as lower levels are triggered along with higher levels

 2. Considering the interface can be extended, how can new applications
    work on backwards mode? Say, we add ultra-critical on 3.12 and
    I update my application to work on it, how will my application
    work on 3.11?

    Hint: Try and error is an horribly bad approach.

 3. I also don't believe we have good forward compatibility with
    the current API, as adding new events will cause existing ones
    to be triggered more often, so I'd expect app behavior to vary
    among kernels with a different number of events.

Honestly, what Andrew suggested is the best design for me: apps
are notified on all events but the event name is sent to the application.

This is pretty simple and solves all the problems we've discussed
so far.

Why can't we just do it?

> But not that I had this plan in mind when I was designing the
> current scheme... :)

Hehe :)

> > Say, we events on top of critical. Then crit fd will now be
> > notified for cases where it didn't use to on older kernels.
> 
> I'm not sure I am following here... but thinking about it more, I guess
> the extra read() will be needed anyway (to reset the counter).

I hope I have explained this more clearly above.

> > > > However, it *is* possible to make non-strict work on strict if we make
> > > > strict default _and_ make reads on memory.pressure_level return
> > > > available events. Just do this on app initialization:
> > > > 
> > > > for each event in memory.pressure_level; do
> > > > 	/* register eventfd to be notified on "event" */
> > > > done
> > > 
> > > This scheme registers "all" events.
> > 
> > Yes, because I thought that's the user-case that matters for activity
> > manager :)
> 
> Some activity managers use only low levels (Android), some might use only
> medium levels (simple load-balancing).
> 
> Being able to register only "all" does not make sense to me.

Well, you can skip events if you want.

> > > Here is more complicated case:
> > > 
> > > Old kernels, pressure_level reads:
> > > 
> > >   low, med, crit
> > > 
> > > The app just wants to listen for med level.
> > > 
> > > New kernels, pressure_level reads:
> > > 
> > >   low, FOO, med, BAR, crit
> > > 
> > > How would application decide which of FOO and BAR are ex-med levels?
> > 
> > What you meant by ex-med?
> 
> The scale is continuous and non-overlapping. If you add some other level,
> you effectively "shrinking" other levels, so the ex-med in the list above
> might correspond to "FOO, med" or "med, BAR" or "FOO, med, BAR", and that
> is exactly the problem.

Just return the events in order?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-07-02 14:59                             ` Luiz Capitulino
@ 2013-07-02 17:24                               ` Anton Vorontsov
  2013-07-02 18:38                                 ` Luiz Capitulino
  0 siblings, 1 reply; 34+ messages in thread
From: Anton Vorontsov @ 2013-07-02 17:24 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, mhocko,
	kmpark, hyunhee.kim

On Tue, Jul 02, 2013 at 10:59:11AM -0400, Luiz Capitulino wrote:
>  2. Considering the interface can be extended, how can new applications
>     work on backwards mode? Say, we add ultra-critical on 3.12 and
>     I update my application to work on it, how will my application
>     work on 3.11?

It will refuse to run, as expected. We are returning -EINVAL on unknown
levels. The same way you try to run e.g. systemd on linux-2.4. Systemd
requires new features of the kernel, if there are no features present the
kernel returns an error and then app gracefully fails.

>     Hint: Try and error is an horribly bad approach.
> 
>  3. I also don't believe we have good forward compatibility with
>     the current API, as adding new events will cause existing ones
>     to be triggered more often,

If they don't register for the new events (the old apps don't know about
them, so they won't), there will be absolutely no difference in the
behaviour, and that is what is most important.

There is a scalability problem I can see because of the need of the read()
call on each fd, but the "scalability" problem will actually arise if we
have insane number of levels.

> Honestly, what Andrew suggested is the best design for me: apps
> are notified on all events but the event name is sent to the application.

I am fine with this approach (or any other, I'm really indifferent to the
API itself -- read/netlink/notification per file/whatever for the
payload), except that you still have the similar problem:

  read() old    read() new
  --------------------------
       "low"           "low"
       "low"           "foo" -- the app does not know what does this mean
       "med"           "bar" -- ditto
       "med"           "med"

> This is pretty simple and solves all the problems we've discussed
> so far.
> 
> Why can't we just do it?

Because of the problems described above. Again, add versioning and there
will be no problem (but just the fact that we need for versioning for that
kind of interface might raise questions).

> > > > Here is more complicated case:
> > > > 
> > > > Old kernels, pressure_level reads:
> > > > 
> > > >   low, med, crit
> > > > 
> > > > The app just wants to listen for med level.
> > > > 
> > > > New kernels, pressure_level reads:
> > > > 
> > > >   low, FOO, med, BAR, crit
> > > > 
> > > > How would application decide which of FOO and BAR are ex-med levels?
> > > 
> > > What you meant by ex-med?
> > 
> > The scale is continuous and non-overlapping. If you add some other level,
> > you effectively "shrinking" other levels, so the ex-med in the list above
> > might correspond to "FOO, med" or "med, BAR" or "FOO, med, BAR", and that
> > is exactly the problem.
> 
> Just return the events in order?

The order is not a problem, the meaning is. The old app does not know the
meaning of FOO or BAR levels, for it is is literally "some foo" and "some
bar" -- it can't make any decision.

Anton

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] vmpressure: implement strict mode
  2013-07-02 17:24                               ` Anton Vorontsov
@ 2013-07-02 18:38                                 ` Luiz Capitulino
  0 siblings, 0 replies; 34+ messages in thread
From: Luiz Capitulino @ 2013-07-02 18:38 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, mhocko,
	kmpark, hyunhee.kim

On Tue, 2 Jul 2013 10:24:09 -0700
Anton Vorontsov <anton@enomsg.org> wrote:

> > Honestly, what Andrew suggested is the best design for me: apps
> > are notified on all events but the event name is sent to the application.
> 
> I am fine with this approach (or any other, I'm really indifferent to the
> API itself -- read/netlink/notification per file/whatever for the
> payload),

That's a very good thing because we've managed to agree on something :)

I'm also indifferent to the API, as long as we have 100% of the policy
in user-space. To me this means we do absolutely no filtering in the
kernel, which in turn means user-space gets all the events. Of course,
we need the event name as a payload.

Do we agree this solves all use-cases we have discussed so far?

> except that you still have the similar problem:
> 
>   read() old    read() new
>   --------------------------
>        "low"           "low"
>        "low"           "foo" -- the app does not know what does this mean
>        "med"           "bar" -- ditto

It can just ignore it, have a special handling, log it, fail or whatever.
That's the good of having the policy in user-space.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2013-07-02 18:39 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-27  3:17 [PATCH v2] vmpressure: implement strict mode Luiz Capitulino
2013-06-27  9:26 ` Michal Hocko
2013-06-27 13:34   ` Luiz Capitulino
2013-06-27 14:59     ` Michal Hocko
2013-06-27 15:53   ` Minchan Kim
2013-06-27 17:42     ` Michal Hocko
2013-06-27 15:44 ` Minchan Kim
2013-06-27 22:02 ` Andrew Morton
2013-06-28  0:02   ` Minchan Kim
2013-06-28  0:34     ` Andrew Morton
2013-06-28  0:58       ` Anton Vorontsov
2013-06-28  1:13         ` Andrew Morton
2013-06-28  4:34           ` Anton Vorontsov
2013-06-28  5:07             ` Anton Vorontsov
2013-06-28 14:00               ` Luiz Capitulino
2013-06-28 16:57                 ` Anton Vorontsov
2013-06-28 17:09                   ` Anton Vorontsov
2013-06-28 18:25                     ` Luiz Capitulino
2013-06-28 18:45                       ` Anton Vorontsov
2013-06-28 18:58                         ` Luiz Capitulino
2013-06-28 18:45                     ` Luiz Capitulino
2013-06-28 18:55                       ` Anton Vorontsov
2013-06-28 19:44                         ` Luiz Capitulino
2013-06-29  0:56                           ` Anton Vorontsov
2013-07-01  8:22                             ` Hyunhee Kim
2013-07-02  4:32                               ` Anton Vorontsov
2013-07-02  8:29                                 ` Hyunhee Kim
2013-07-02 13:29                             ` Michal Hocko
2013-07-02 14:59                             ` Luiz Capitulino
2013-07-02 17:24                               ` Anton Vorontsov
2013-07-02 18:38                                 ` Luiz Capitulino
2013-06-28  5:24             ` Andrew Morton
2013-06-28 13:43             ` Luiz Capitulino
2013-06-28  9:04           ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).