All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] memcg: event control at vmpressure.
@ 2013-06-17 11:30 Hyunhee Kim
  2013-06-17 13:15 ` Michal Hocko
  0 siblings, 1 reply; 50+ messages in thread
From: Hyunhee Kim @ 2013-06-17 11:30 UTC (permalink / raw)
  To: 'Anton Vorontsov', linux-mm, 'Michal Hocko',
	akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill
  Cc: 'Kyungmin Park'

In the original vmpressure, all levels of events less than or equal to
the current pressure level are signaled whenever there is a reclaim
activity. This becomes overheads to user space module and also increases
power consumption if there is somebody to listen to it. This patch provides
options to trigger only the current level of the event when the pressure
level changes. This trigger option can be set when registering each event by writing
a trigger option, "edge" or "always", next to the string of levels.
"edge" means that the event of the current pressure level is signaled only when
the pressure level is changed. "always" means that events are triggered
whenever there is a reclaim process. To keep backward compatibility,
"always" is set by default if nothing is input as an option.
Each event can have different option. For example, "low" level uses "always"
trigger option to see reclaim activity at user space while "medium"/"critical"
uses "edge" to do an important job like killing tasks only once.

Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
---
 Documentation/cgroups/memory.txt |   13 +++++++++++--
 include/linux/vmpressure.h       |    2 ++
 mm/vmpressure.c                  |   25 ++++++++++++++++++++-----
 3 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index ddf4f93..f5f91fb 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -791,6 +791,14 @@ way to trigger. Applications should do whatever they can to help the
 system. It might be too late to consult with vmstat or any other
 statistics, so it's advisable to take an immediate action.
 
+All events (<= current pressure level) can be triggered whenever there is a
+reclaim activity or only current pressure level can be triggered when
+the pressure level changes. Trigger option is decided by writing it
+next to level. The event whose trigger option is "always" is triggered
+whenever there is a reclaim process. If "edge" is set, an event
+corresponding to the current pressure level is triggered only when the level
+is changed. If the trigger option is not set, "always" is set by default.
+
 The events are propagated upward until the event is handled, i.e. the
 events are not pass-through. Here is what this means: for example you have
 three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
@@ -807,7 +815,8 @@ register a notification, an application must:
 
 - create an eventfd using eventfd(2);
 - open memory.pressure_level;
-- write string like "<event_fd> <fd of memory.pressure_level> <level>"
+- write string like
+	"<event_fd> <fd of memory.pressure_level> <level> <trigger_option>"
   to cgroup.event_control.
 
 Application will be notified through eventfd when memory pressure is at
@@ -823,7 +832,7 @@ Test:
    # cd /sys/fs/cgroup/memory/
    # mkdir foo
    # cd foo
-   # cgroup_event_listener memory.pressure_level low &
+   # cgroup_event_listener memory.pressure_level low "edge" &
    # echo 8000000 > memory.limit_in_bytes
    # echo 8000000 > memory.memsw.limit_in_bytes
    # echo $$ > tasks
diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index 76be077..51aed1f 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -20,6 +20,8 @@ struct vmpressure {
 	struct mutex events_lock;
 
 	struct work_struct work;
+
+	int last_level;
 };
 
 struct mem_cgroup;
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 736a601..a18fdb3 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -137,6 +137,7 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 struct vmpressure_event {
 	struct eventfd_ctx *efd;
 	enum vmpressure_levels level;
+	bool edge_trigger;
 	struct list_head node;
 };
 
@@ -150,14 +151,16 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 	level = vmpressure_calc_level(scanned, reclaimed);
 
 	mutex_lock(&vmpr->events_lock);
-
 	list_for_each_entry(ev, &vmpr->events, node) {
 		if (level >= ev->level) {
+			if (ev->edge_trigger && (level == vmpr->last_level
+				|| level != ev->level))
+				continue;
 			eventfd_signal(ev->efd, 1);
 			signalled = true;
 		}
 	}
-
+	vmpr->last_level = level;
 	mutex_unlock(&vmpr->events_lock);
 
 	return signalled;
@@ -290,9 +293,11 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
  *
  * This function associates eventfd context with the vmpressure
  * infrastructure, so that the notifications will be delivered to the
- * @eventfd. The @args parameter is a string that denotes pressure level
+ * @eventfd. The @args parameters are a string that denotes pressure level
  * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
- * "critical").
+ * "critical") and a trigger option that decides whether events are triggered
+ * continuously or only on edge ("always" or "edge" if "edge", only the current
+ * pressure level is triggered when the pressure level changes.
  *
  * This function should not be used directly, just pass it to (struct
  * cftype).register_event, and then cgroup core will handle everything by
@@ -303,10 +308,14 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
 {
 	struct vmpressure *vmpr = cg_to_vmpressure(cg);
 	struct vmpressure_event *ev;
+	char strlevel[32], strtrigger[32] = "always";
 	int level;
 
+	if ((sscanf(args, "%s %s\n", strlevel, strtrigger) > 2))
+		return -EINVAL;
+
 	for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
-		if (!strcmp(vmpressure_str_levels[level], args))
+		if (!strcmp(vmpressure_str_levels[level], strlevel))
 			break;
 	}
 
@@ -320,6 +329,11 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
 	ev->efd = eventfd;
 	ev->level = level;
 
+	if (!strcmp(strtrigger, "edge"))
+		ev->edge_trigger = true;
+	else
+		ev->edge_trigger = false;
+
 	mutex_lock(&vmpr->events_lock);
 	list_add(&ev->node, &vmpr->events);
 	mutex_unlock(&vmpr->events_lock);
@@ -371,4 +385,5 @@ void vmpressure_init(struct vmpressure *vmpr)
 	mutex_init(&vmpr->events_lock);
 	INIT_LIST_HEAD(&vmpr->events);
 	INIT_WORK(&vmpr->work, vmpressure_work_fn);
+	vmpr->last_level = -1;
 }
-- 
1.7.9.5


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v3] memcg: event control at vmpressure.
  2013-06-17 11:30 [PATCH v3] memcg: event control at vmpressure Hyunhee Kim
@ 2013-06-17 13:15 ` Michal Hocko
  2013-06-18  6:10   ` Hyunhee Kim
  0 siblings, 1 reply; 50+ messages in thread
From: Michal Hocko @ 2013-06-17 13:15 UTC (permalink / raw)
  To: Hyunhee Kim
  Cc: 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park'

On Mon 17-06-13 20:30:11, Hyunhee Kim wrote:
[...]
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index 736a601..a18fdb3 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
[...]
> @@ -150,14 +151,16 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>  	level = vmpressure_calc_level(scanned, reclaimed);
>  
>  	mutex_lock(&vmpr->events_lock);
> -
>  	list_for_each_entry(ev, &vmpr->events, node) {
>  		if (level >= ev->level) {
> +			if (ev->edge_trigger && (level == vmpr->last_level

> +				|| level != ev->level))

Hmm, why this differs from the "always" semantic? You do not want to see
lower events? Why?

> +				continue;
>  			eventfd_signal(ev->efd, 1);
>  			signalled = true;
>  		}
>  	}
> -
> +	vmpr->last_level = level;
>  	mutex_unlock(&vmpr->events_lock);

I have already asked in the previous version but there was no answer for
it. So let's try again.

What is the expected semantic when an event is triggered but there is
nobody to consume it?
I am not sure that the current implementation is practical. Say that
last_level is LOW and you just registered your event. Should you see the
first event or not?

I think that last_level should be per-vmpressure_event and the edge
would be defined as the even seen for the first time since registration.

>  	return signalled;
> @@ -290,9 +293,11 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
>   *
>   * This function associates eventfd context with the vmpressure
>   * infrastructure, so that the notifications will be delivered to the
> - * @eventfd. The @args parameter is a string that denotes pressure level
> + * @eventfd. The @args parameters are a string that denotes pressure level
>   * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
> - * "critical").
> + * "critical") and a trigger option that decides whether events are triggered
> + * continuously or only on edge ("always" or "edge" if "edge", only the current
> + * pressure level is triggered when the pressure level changes.
>   *
>   * This function should not be used directly, just pass it to (struct
>   * cftype).register_event, and then cgroup core will handle everything by
> @@ -303,10 +308,14 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
>  {
>  	struct vmpressure *vmpr = cg_to_vmpressure(cg);
>  	struct vmpressure_event *ev;
> +	char strlevel[32], strtrigger[32] = "always";
>  	int level;
>  
> +	if ((sscanf(args, "%s %s\n", strlevel, strtrigger) > 2))
> +		return -EINVAL;

Ouch! You should rather not let your users write over your stack, should
you?

> +
>  	for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
> -		if (!strcmp(vmpressure_str_levels[level], args))
> +		if (!strcmp(vmpressure_str_levels[level], strlevel))
>  			break;
>  	}
>  
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3] memcg: event control at vmpressure.
  2013-06-17 13:15 ` Michal Hocko
@ 2013-06-18  6:10   ` Hyunhee Kim
  2013-06-18  8:00     ` Hyunhee Kim
  0 siblings, 1 reply; 50+ messages in thread
From: Hyunhee Kim @ 2013-06-18  6:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Anton Vorontsov, linux-mm, akpm, rob, kamezawa.hiroyu, hannes,
	rientjes, kirill, Kyungmin Park

2013/6/17 Michal Hocko <mhocko@suse.cz>:
> On Mon 17-06-13 20:30:11, Hyunhee Kim wrote:
> [...]
>> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>> index 736a601..a18fdb3 100644
>> --- a/mm/vmpressure.c
>> +++ b/mm/vmpressure.c
> [...]
>> @@ -150,14 +151,16 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>>       level = vmpressure_calc_level(scanned, reclaimed);
>>
>>       mutex_lock(&vmpr->events_lock);
>> -
>>       list_for_each_entry(ev, &vmpr->events, node) {
>>               if (level >= ev->level) {
>> +                     if (ev->edge_trigger && (level == vmpr->last_level
>
>> +                             || level != ev->level))
>
> Hmm, why this differs from the "always" semantic? You do not want to see
> lower events? Why?

Yes, I didn't want to see every lower level events whenever the higher
level event occurs because the higher event signal implies that the
lower memory situation also occurs. For example, if CRITICAL event
occurs, it means that LOW and MEDIUM also occur. So, I think that
triggering these lower level events are redundant. And, some users
don't want to see this every event. But, I think that if I don't want
to see lower events, I should add another option. Currently, as you
mentioned, for edge_trigger option, I'll remove "level != ev->level"
part.

>
>> +                             continue;
>>                       eventfd_signal(ev->efd, 1);
>>                       signalled = true;
>>               }
>>       }
>> -
>> +     vmpr->last_level = level;
>>       mutex_unlock(&vmpr->events_lock);
>
> I have already asked in the previous version but there was no answer for
> it. So let's try again.
>
> What is the expected semantic when an event is triggered but there is
> nobody to consume it?
> I am not sure that the current implementation is practical. Say that
> last_level is LOW and you just registered your event. Should you see the
> first event or not?
>
> I think that last_level should be per-vmpressure_event and the edge
> would be defined as the even seen for the first time since registration.

Right. The current implementation could not cover those situations. As
you mentioned, I think that this could be solved by having last_level
per vmpressure_event (after removing "level != ev->level"). If
last_level of each event is set to valid level only after the first
event is signaled, we cannot miss the first signal even when an event
is registered in the middle of runtime.

>
>>       return signalled;
>> @@ -290,9 +293,11 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
>>   *
>>   * This function associates eventfd context with the vmpressure
>>   * infrastructure, so that the notifications will be delivered to the
>> - * @eventfd. The @args parameter is a string that denotes pressure level
>> + * @eventfd. The @args parameters are a string that denotes pressure level
>>   * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
>> - * "critical").
>> + * "critical") and a trigger option that decides whether events are triggered
>> + * continuously or only on edge ("always" or "edge" if "edge", only the current
>> + * pressure level is triggered when the pressure level changes.
>>   *
>>   * This function should not be used directly, just pass it to (struct
>>   * cftype).register_event, and then cgroup core will handle everything by
>> @@ -303,10 +308,14 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
>>  {
>>       struct vmpressure *vmpr = cg_to_vmpressure(cg);
>>       struct vmpressure_event *ev;
>> +     char strlevel[32], strtrigger[32] = "always";
>>       int level;
>>
>> +     if ((sscanf(args, "%s %s\n", strlevel, strtrigger) > 2))
>> +             return -EINVAL;
>
> Ouch! You should rather not let your users write over your stack, should
> you?
>

Sorry, this should be fixed in the next patch.

>> +
>>       for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
>> -             if (!strcmp(vmpressure_str_levels[level], args))
>> +             if (!strcmp(vmpressure_str_levels[level], strlevel))
>>                       break;
>>       }
>>
> [...]
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

Thanks,
Hyunhee Kim.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3] memcg: event control at vmpressure.
  2013-06-18  6:10   ` Hyunhee Kim
@ 2013-06-18  8:00     ` Hyunhee Kim
  2013-06-18 11:01       ` Michal Hocko
  0 siblings, 1 reply; 50+ messages in thread
From: Hyunhee Kim @ 2013-06-18  8:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Anton Vorontsov, linux-mm, akpm, rob, kamezawa.hiroyu, hannes,
	rientjes, kirill, Kyungmin Park

2013/6/18 Hyunhee Kim <hyunhee.kim@samsung.com>:
> 2013/6/17 Michal Hocko <mhocko@suse.cz>:
>> On Mon 17-06-13 20:30:11, Hyunhee Kim wrote:
>> [...]
>>> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>>> index 736a601..a18fdb3 100644
>>> --- a/mm/vmpressure.c
>>> +++ b/mm/vmpressure.c
>> [...]
>>> @@ -150,14 +151,16 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>>>       level = vmpressure_calc_level(scanned, reclaimed);
>>>
>>>       mutex_lock(&vmpr->events_lock);
>>> -
>>>       list_for_each_entry(ev, &vmpr->events, node) {
>>>               if (level >= ev->level) {
>>> +                     if (ev->edge_trigger && (level == vmpr->last_level
>>
>>> +                             || level != ev->level))
>>
>> Hmm, why this differs from the "always" semantic? You do not want to see
>> lower events? Why?
>
> Yes, I didn't want to see every lower level events whenever the higher
> level event occurs because the higher event signal implies that the
> lower memory situation also occurs. For example, if CRITICAL event
> occurs, it means that LOW and MEDIUM also occur. So, I think that
> triggering these lower level events are redundant. And, some users
> don't want to see this every event. But, I think that if I don't want
> to see lower events, I should add another option. Currently, as you
> mentioned, for edge_trigger option, I'll remove "level != ev->level"
> part.
>
>>
>>> +                             continue;
>>>                       eventfd_signal(ev->efd, 1);
>>>                       signalled = true;
>>>               }
>>>       }
>>> -
>>> +     vmpr->last_level = level;
>>>       mutex_unlock(&vmpr->events_lock);
>>
>> I have already asked in the previous version but there was no answer for
>> it. So let's try again.
>>
>> What is the expected semantic when an event is triggered but there is
>> nobody to consume it?
>> I am not sure that the current implementation is practical. Say that
>> last_level is LOW and you just registered your event. Should you see the
>> first event or not?
>>
>> I think that last_level should be per-vmpressure_event and the edge
>> would be defined as the even seen for the first time since registration.
>
> Right. The current implementation could not cover those situations. As
> you mentioned, I think that this could be solved by having last_level
> per vmpressure_event (after removing "level != ev->level"). If
> last_level of each event is set to valid level only after the first
> event is signaled, we cannot miss the first signal even when an event
> is registered in the middle of runtime.
>

How about initializing vmpr->last_level = -1 everytime new event is
registered? (having last_level per vmpr). I think that if we have
last_level for each event, only new event could be triggered when the
current level is same as the last level. And I think that this is a
little awkward. But, if we set vmpr->last_level = -1 when new event is
registered, we can see all events with new event even though the level
is not changed.

>>
>>>       return signalled;
>>> @@ -290,9 +293,11 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
>>>   *
>>>   * This function associates eventfd context with the vmpressure
>>>   * infrastructure, so that the notifications will be delivered to the
>>> - * @eventfd. The @args parameter is a string that denotes pressure level
>>> + * @eventfd. The @args parameters are a string that denotes pressure level
>>>   * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
>>> - * "critical").
>>> + * "critical") and a trigger option that decides whether events are triggered
>>> + * continuously or only on edge ("always" or "edge" if "edge", only the current
>>> + * pressure level is triggered when the pressure level changes.
>>>   *
>>>   * This function should not be used directly, just pass it to (struct
>>>   * cftype).register_event, and then cgroup core will handle everything by
>>> @@ -303,10 +308,14 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
>>>  {
>>>       struct vmpressure *vmpr = cg_to_vmpressure(cg);
>>>       struct vmpressure_event *ev;
>>> +     char strlevel[32], strtrigger[32] = "always";
>>>       int level;
>>>
>>> +     if ((sscanf(args, "%s %s\n", strlevel, strtrigger) > 2))
>>> +             return -EINVAL;
>>
>> Ouch! You should rather not let your users write over your stack, should
>> you?
>>
>
> Sorry, this should be fixed in the next patch.
>
>>> +
>>>       for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
>>> -             if (!strcmp(vmpressure_str_levels[level], args))
>>> +             if (!strcmp(vmpressure_str_levels[level], strlevel))
>>>                       break;
>>>       }
>>>
>> [...]
>> --
>> Michal Hocko
>> SUSE Labs
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> Thanks,
> Hyunhee Kim.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3] memcg: event control at vmpressure.
  2013-06-18  8:00     ` Hyunhee Kim
@ 2013-06-18 11:01       ` Michal Hocko
  2013-06-19 11:25         ` Hyunhee Kim
  2013-06-19 11:31         ` [PATCH v4] " Hyunhee Kim
  0 siblings, 2 replies; 50+ messages in thread
From: Michal Hocko @ 2013-06-18 11:01 UTC (permalink / raw)
  To: Hyunhee Kim
  Cc: Anton Vorontsov, linux-mm, akpm, rob, kamezawa.hiroyu, hannes,
	rientjes, kirill, Kyungmin Park

On Tue 18-06-13 17:00:06, Hyunhee Kim wrote:
> 2013/6/18 Hyunhee Kim <hyunhee.kim@samsung.com>:
> > 2013/6/17 Michal Hocko <mhocko@suse.cz>:
> >> On Mon 17-06-13 20:30:11, Hyunhee Kim wrote:
> >> [...]
> >>> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> >>> index 736a601..a18fdb3 100644
> >>> --- a/mm/vmpressure.c
> >>> +++ b/mm/vmpressure.c
> >> [...]
> >>> @@ -150,14 +151,16 @@ static bool vmpressure_event(struct vmpressure *vmpr,
> >>>       level = vmpressure_calc_level(scanned, reclaimed);
> >>>
> >>>       mutex_lock(&vmpr->events_lock);
> >>> -
> >>>       list_for_each_entry(ev, &vmpr->events, node) {
> >>>               if (level >= ev->level) {
> >>> +                     if (ev->edge_trigger && (level == vmpr->last_level
> >>
> >>> +                             || level != ev->level))
> >>
> >> Hmm, why this differs from the "always" semantic? You do not want to see
> >> lower events? Why?
> >
> > Yes, I didn't want to see every lower level events whenever the higher
> > level event occurs because the higher event signal implies that the
> > lower memory situation also occurs. 

Is there any guarantee that such a condition would be also signalled?

> > For example, if CRITICAL event
> > occurs, it means that LOW and MEDIUM also occur. So, I think that
> > triggering these lower level events are redundant. And, some users
> > don't want to see this every event. But, I think that if I don't want
> > to see lower events, I should add another option.

I think the interface should be consistent with `always' unless there is
very good reason to do otherwise.

> > Currently, as you mentioned, for edge_trigger option, I'll remove
> > "level != ev->level" part.
> >
> >>
> >>> +                             continue;
> >>>                       eventfd_signal(ev->efd, 1);
> >>>                       signalled = true;
> >>>               }
> >>>       }
> >>> -
> >>> +     vmpr->last_level = level;
> >>>       mutex_unlock(&vmpr->events_lock);
> >>
> >> I have already asked in the previous version but there was no answer for
> >> it. So let's try again.
> >>
> >> What is the expected semantic when an event is triggered but there is
> >> nobody to consume it?
> >> I am not sure that the current implementation is practical. Say that
> >> last_level is LOW and you just registered your event. Should you see the
> >> first event or not?
> >>
> >> I think that last_level should be per-vmpressure_event and the edge
> >> would be defined as the even seen for the first time since registration.
> >
> > Right. The current implementation could not cover those situations. As
> > you mentioned, I think that this could be solved by having last_level
> > per vmpressure_event (after removing "level != ev->level"). If
> > last_level of each event is set to valid level only after the first
> > event is signaled, we cannot miss the first signal even when an event
> > is registered in the middle of runtime.
> >
> 
> How about initializing vmpr->last_level = -1 everytime new event is
> registered? (having last_level per vmpr). 

So all those consumers that have seen an event already would be
surprised that they get the very same event again without transition to
other level (so it won't be edge triggered anymore). No this doesn't
make any sense to me.

Please try to think about the interface, what it is supposed to do and
how it is supposed to behave. The current implementation seems hackish
to me and it is an example of a single-use-case-designed interface which
tend to be hard to maintain and a bad idea in long term.

> I think that if we have last_level for each event, only new event
> could be triggered when the current level is same as the last
> level. And I think that this is a little awkward.

Why? I might be wrong here but when I register an event I would like to
get a notification when the event is triggered for the first time from
my POV. I have no way to find out that such an event has been already
triggered for somebody else.

> But, if we set vmpr->last_level = -1 when new event is registered,
> we can see all events with new event even though the level is not
> changed.

Which basically ruins the idea of the edge triggered event.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3] memcg: event control at vmpressure.
  2013-06-18 11:01       ` Michal Hocko
@ 2013-06-19 11:25         ` Hyunhee Kim
  2013-06-19 11:59           ` Michal Hocko
  2013-06-19 11:31         ` [PATCH v4] " Hyunhee Kim
  1 sibling, 1 reply; 50+ messages in thread
From: Hyunhee Kim @ 2013-06-19 11:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Anton Vorontsov, linux-mm, akpm, rob, kamezawa.hiroyu, hannes,
	rientjes, kirill, Kyungmin Park

2013/6/18 Michal Hocko <mhocko@suse.cz>:
> On Tue 18-06-13 17:00:06, Hyunhee Kim wrote:
>> 2013/6/18 Hyunhee Kim <hyunhee.kim@samsung.com>:
>> > 2013/6/17 Michal Hocko <mhocko@suse.cz>:
>> >> On Mon 17-06-13 20:30:11, Hyunhee Kim wrote:
>> >> [...]
>> >>> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>> >>> index 736a601..a18fdb3 100644
>> >>> --- a/mm/vmpressure.c
>> >>> +++ b/mm/vmpressure.c
>> >> [...]
>> >>> @@ -150,14 +151,16 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>> >>>       level = vmpressure_calc_level(scanned, reclaimed);
>> >>>
>> >>>       mutex_lock(&vmpr->events_lock);
>> >>> -
>> >>>       list_for_each_entry(ev, &vmpr->events, node) {
>> >>>               if (level >= ev->level) {
>> >>> +                     if (ev->edge_trigger && (level == vmpr->last_level
>> >>
>> >>> +                             || level != ev->level))
>> >>
>> >> Hmm, why this differs from the "always" semantic? You do not want to see
>> >> lower events? Why?
>> >
>> > Yes, I didn't want to see every lower level events whenever the higher
>> > level event occurs because the higher event signal implies that the
>> > lower memory situation also occurs.
>
> Is there any guarantee that such a condition would be also signalled?

I think so. In the original implementation, ev is signaled if (level
>= ev->level).
This means that on "level == CRITICAL", LOW and MEDIUM are always
signaled if these are registered and somebody listen to them.
What I wanted to do can be seperated two parts: (1) don't send signals
if the current level is same as the last level. (2) only send the
current level not every lower level.
But, I think that (1) is more close to "edge trigger" and I'll
implement "edge trigger".

>
>> > For example, if CRITICAL event
>> > occurs, it means that LOW and MEDIUM also occur. So, I think that
>> > triggering these lower level events are redundant. And, some users
>> > don't want to see this every event. But, I think that if I don't want
>> > to see lower events, I should add another option.
>
> I think the interface should be consistent with `always' unless there is
> very good reason to do otherwise.
>
>> > Currently, as you mentioned, for edge_trigger option, I'll remove
>> > "level != ev->level" part.
>> >
>> >>
>> >>> +                             continue;
>> >>>                       eventfd_signal(ev->efd, 1);
>> >>>                       signalled = true;
>> >>>               }
>> >>>       }
>> >>> -
>> >>> +     vmpr->last_level = level;
>> >>>       mutex_unlock(&vmpr->events_lock);
>> >>
>> >> I have already asked in the previous version but there was no answer for
>> >> it. So let's try again.
>> >>
>> >> What is the expected semantic when an event is triggered but there is
>> >> nobody to consume it?
>> >> I am not sure that the current implementation is practical. Say that
>> >> last_level is LOW and you just registered your event. Should you see the
>> >> first event or not?
>> >>
>> >> I think that last_level should be per-vmpressure_event and the edge
>> >> would be defined as the even seen for the first time since registration.
>> >
>> > Right. The current implementation could not cover those situations. As
>> > you mentioned, I think that this could be solved by having last_level
>> > per vmpressure_event (after removing "level != ev->level"). If
>> > last_level of each event is set to valid level only after the first
>> > event is signaled, we cannot miss the first signal even when an event
>> > is registered in the middle of runtime.
>> >
>>
>> How about initializing vmpr->last_level = -1 everytime new event is
>> registered? (having last_level per vmpr).
>
> So all those consumers that have seen an event already would be
> surprised that they get the very same event again without transition to
> other level (so it won't be edge triggered anymore). No this doesn't
> make any sense to me.

Right. As you mentioned, if edge trigger option is enabled but same
events are triggered,
it does not mean edge triggered any more. I'll modify it to keep
last_level per vmpressure event.

>
> Please try to think about the interface, what it is supposed to do and
> how it is supposed to behave. The current implementation seems hackish
> to me and it is an example of a single-use-case-designed interface which
> tend to be hard to maintain and a bad idea in long term.
>
>> I think that if we have last_level for each event, only new event
>> could be triggered when the current level is same as the last
>> level. And I think that this is a little awkward.
>
> Why? I might be wrong here but when I register an event I would like to
> get a notification when the event is triggered for the first time from
> my POV. I have no way to find out that such an event has been already
> triggered for somebody else.

Okay, I think that I should keep the meaning of "edge trigger".

>
>> But, if we set vmpr->last_level = -1 when new event is registered,
>> we can see all events with new event even though the level is not
>> changed.
>
> Which basically ruins the idea of the edge triggered event.

Agreed.

Thanks for the comments.
I'll update the patch soon.

Hyunhee Kim.

>
> [...]
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v4] memcg: event control at vmpressure.
  2013-06-18 11:01       ` Michal Hocko
  2013-06-19 11:25         ` Hyunhee Kim
@ 2013-06-19 11:31         ` Hyunhee Kim
  2013-06-19 12:53           ` Michal Hocko
  1 sibling, 1 reply; 50+ messages in thread
From: Hyunhee Kim @ 2013-06-19 11:31 UTC (permalink / raw)
  To: 'Michal Hocko', 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill
  Cc: 'Kyungmin Park'

In the original vmpressure, events are triggered whenever there is a reclaim
activity. This becomes overheads to user space module and also increases
power consumption if there is somebody to listen to it. This patch provides
options to trigger events only when the pressure level changes.
This trigger option can be set when registering each event by writing
a trigger option, "edge" or "always", next to the string of levels.
"edge" means that the event is triggered only when the pressure level is changed.
"always" means that events are triggered whenever there is a reclaim process.
To keep backward compatibility, "always" is set by default if nothing is input
as an option. Each event can have different option. For example,
"low" level uses "always" trigger option to see reclaim activity at user space
while "medium"/"critical" uses "edge" to do an important job
like killing tasks only once.

Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
---
 Documentation/cgroups/memory.txt |   12 ++++++++++--
 mm/vmpressure.c                  |   32 ++++++++++++++++++++++++++++----
 2 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index ddf4f93..181a11f 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -791,6 +791,13 @@ way to trigger. Applications should do whatever they can to help the
 system. It might be too late to consult with vmstat or any other
 statistics, so it's advisable to take an immediate action.
 
+Events can be triggered whenever there is a reclaim activity or
+only when the pressure level changes. Trigger option is decided
+by writing it next to level. The event whose trigger option is "always"
+is triggered whenever there is a reclaim process. If "edge" is set,
+the event is triggered only when the level is changed.
+If the trigger option is not set, "always" is set by default.
+
 The events are propagated upward until the event is handled, i.e. the
 events are not pass-through. Here is what this means: for example you have
 three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
@@ -807,7 +814,8 @@ register a notification, an application must:
 
 - create an eventfd using eventfd(2);
 - open memory.pressure_level;
-- write string like "<event_fd> <fd of memory.pressure_level> <level>"
+- write string like
+	"<event_fd> <fd of memory.pressure_level> <level> <trigger_option>"
   to cgroup.event_control.
 
 Application will be notified through eventfd when memory pressure is at
@@ -823,7 +831,7 @@ Test:
    # cd /sys/fs/cgroup/memory/
    # mkdir foo
    # cd foo
-   # cgroup_event_listener memory.pressure_level low &
+   # cgroup_event_listener memory.pressure_level low edge &
    # echo 8000000 > memory.limit_in_bytes
    # echo 8000000 > memory.memsw.limit_in_bytes
    # echo $$ > tasks
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 736a601..4f676b8 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -137,6 +137,8 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 struct vmpressure_event {
 	struct eventfd_ctx *efd;
 	enum vmpressure_levels level;
+	int last_level;
+	bool edge_trigger;
 	struct list_head node;
 };
 
@@ -153,11 +155,14 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 
 	list_for_each_entry(ev, &vmpr->events, node) {
 		if (level >= ev->level) {
+			if (ev->edge_trigger && level == ev->last_level)
+				continue;
+
 			eventfd_signal(ev->efd, 1);
 			signalled = true;
 		}
+		ev->last_level = level;
 	}
-
 	mutex_unlock(&vmpr->events_lock);
 
 	return signalled;
@@ -290,9 +295,11 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
  *
  * This function associates eventfd context with the vmpressure
  * infrastructure, so that the notifications will be delivered to the
- * @eventfd. The @args parameter is a string that denotes pressure level
+ * @eventfd. The @args parameters are a string that denotes pressure level
  * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
- * "critical").
+ * "critical") and a trigger option that decides whether events are triggered
+ * continuously or only on edge ("always" or "edge" if "edge", events
+ * are triggered when the pressure level changes.
  *
  * This function should not be used directly, just pass it to (struct
  * cftype).register_event, and then cgroup core will handle everything by
@@ -303,10 +310,21 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
 {
 	struct vmpressure *vmpr = cg_to_vmpressure(cg);
 	struct vmpressure_event *ev;
+	char *strlevel = NULL, *strtrigger = NULL, *p = NULL;
 	int level;
 
+	p = strchr(args, ' ');
+
+	if (p) {
+		strtrigger = p + 1;
+		*p = '\0';
+		strlevel = (char *)args;
+	} else {
+		strlevel = (char *)args;
+	}
+
 	for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
-		if (!strcmp(vmpressure_str_levels[level], args))
+		if (!strcmp(vmpressure_str_levels[level], strlevel))
 			break;
 	}
 
@@ -319,6 +337,12 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
 
 	ev->efd = eventfd;
 	ev->level = level;
+	ev->last_level = -1;
+
+	if (strtrigger && !strcmp(strtrigger, "edge"))
+		ev->edge_trigger = true;
+	else
+		ev->edge_trigger = false;
 
 	mutex_lock(&vmpr->events_lock);
 	list_add(&ev->node, &vmpr->events);
-- 
1.7.9.5


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v3] memcg: event control at vmpressure.
  2013-06-19 11:25         ` Hyunhee Kim
@ 2013-06-19 11:59           ` Michal Hocko
  0 siblings, 0 replies; 50+ messages in thread
From: Michal Hocko @ 2013-06-19 11:59 UTC (permalink / raw)
  To: Hyunhee Kim
  Cc: Anton Vorontsov, linux-mm, akpm, rob, kamezawa.hiroyu, hannes,
	rientjes, kirill, Kyungmin Park

On Wed 19-06-13 20:25:03, Hyunhee Kim wrote:
> 2013/6/18 Michal Hocko <mhocko@suse.cz>:
> > On Tue 18-06-13 17:00:06, Hyunhee Kim wrote:
> >> 2013/6/18 Hyunhee Kim <hyunhee.kim@samsung.com>:
> >> > 2013/6/17 Michal Hocko <mhocko@suse.cz>:
> >> >> On Mon 17-06-13 20:30:11, Hyunhee Kim wrote:
> >> >> [...]
> >> >>> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> >> >>> index 736a601..a18fdb3 100644
> >> >>> --- a/mm/vmpressure.c
> >> >>> +++ b/mm/vmpressure.c
> >> >> [...]
> >> >>> @@ -150,14 +151,16 @@ static bool vmpressure_event(struct vmpressure *vmpr,
> >> >>>       level = vmpressure_calc_level(scanned, reclaimed);
> >> >>>
> >> >>>       mutex_lock(&vmpr->events_lock);
> >> >>> -
> >> >>>       list_for_each_entry(ev, &vmpr->events, node) {
> >> >>>               if (level >= ev->level) {
> >> >>> +                     if (ev->edge_trigger && (level == vmpr->last_level
> >> >>
> >> >>> +                             || level != ev->level))
> >> >>
> >> >> Hmm, why this differs from the "always" semantic? You do not want to see
> >> >> lower events? Why?
> >> >
> >> > Yes, I didn't want to see every lower level events whenever the higher
> >> > level event occurs because the higher event signal implies that the
> >> > lower memory situation also occurs.
> >
> > Is there any guarantee that such a condition would be also signalled?
> 
> I think so. In the original implementation, ev is signaled if (level
> >= ev->level).
> This means that on "level == CRITICAL", LOW and MEDIUM are always
> signaled if these are registered and somebody listen to them.

But there is no guarantee that LOW and/or MEDIUM events are triggered
actually because the vmpressure calculation can jump directly to
CRITICAL. Just imagine a case when it is really hard to reclaim anything
at all (pages are dirty and need to be written back first etc.).

> What I wanted to do can be seperated two parts: (1) don't send signals
> if the current level is same as the last level. (2) only send the
> current level not every lower level.
>
> But, I think that (1) is more close to "edge trigger" and I'll
> implement "edge trigger".

OK
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4] memcg: event control at vmpressure.
  2013-06-19 11:31         ` [PATCH v4] " Hyunhee Kim
@ 2013-06-19 12:53           ` Michal Hocko
  2013-06-20  2:13             ` Hyunhee Kim
  2013-06-20  2:17             ` [PATCH v5] " Hyunhee Kim
  0 siblings, 2 replies; 50+ messages in thread
From: Michal Hocko @ 2013-06-19 12:53 UTC (permalink / raw)
  To: Hyunhee Kim
  Cc: 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park'

OK, this looks much better. Few nitpicks bellow.

On Wed 19-06-13 20:31:16, Hyunhee Kim wrote:
> In the original vmpressure, events are triggered whenever there is a reclaim
> activity. This becomes overheads to user space module and also increases
> power consumption if there is somebody to listen to it. This patch provides
> options to trigger events only when the pressure level changes.
> This trigger option can be set when registering each event by writing
> a trigger option, "edge" or "always", next to the string of levels.
> "edge" means that the event is triggered only when the pressure level is changed.
> "always" means that events are triggered whenever there is a reclaim process.
> To keep backward compatibility, "always" is set by default if nothing is input
> as an option. Each event can have different option. For example,
> "low" level uses "always" trigger option to see reclaim activity at user space
> while "medium"/"critical" uses "edge" to do an important job
> like killing tasks only once.
> 
> Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
> ---
>  Documentation/cgroups/memory.txt |   12 ++++++++++--
>  mm/vmpressure.c                  |   32 ++++++++++++++++++++++++++++----
>  2 files changed, 38 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index ddf4f93..181a11f 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -791,6 +791,13 @@ way to trigger. Applications should do whatever they can to help the
>  system. It might be too late to consult with vmstat or any other
>  statistics, so it's advisable to take an immediate action.
>  
> +Events can be triggered whenever there is a reclaim activity or
> +only when the pressure level changes. Trigger option is decided
> +by writing it next to level. The event whose trigger option is "always"
> +is triggered whenever there is a reclaim process. If "edge" is set,
> +the event is triggered only when the level is changed.
> +If the trigger option is not set, "always" is set by default.
> +

I would move this information bellow where the usage is described.

>  The events are propagated upward until the event is handled, i.e. the
>  events are not pass-through. Here is what this means: for example you have
>  three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
[...]
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index 736a601..4f676b8 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
[...]
> @@ -303,10 +310,21 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
>  {
>  	struct vmpressure *vmpr = cg_to_vmpressure(cg);
>  	struct vmpressure_event *ev;
> +	char *strlevel = NULL, *strtrigger = NULL, *p = NULL;
>  	int level;
>  
> +	p = strchr(args, ' ');
> +
> +	if (p) {
> +		strtrigger = p + 1;
> +		*p = '\0';
> +		strlevel = (char *)args;
> +	} else {
> +		strlevel = (char *)args;
> +	}
> +

This is a total nit but this can be further simplified.
	strlevel = args;
	strtrigger = strchr(args, ' ');
	if (strtrigger) {
		*strtrigger = '\0';
		strtrigger++;
	}
I would still rather see using sscanf but that is just a matter of taste
I guess.

>  	for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
> -		if (!strcmp(vmpressure_str_levels[level], args))
> +		if (!strcmp(vmpressure_str_levels[level], strlevel))
>  			break;
>  	}
>  
> @@ -319,6 +337,12 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
>  
>  	ev->efd = eventfd;
>  	ev->level = level;
> +	ev->last_level = -1;
> +
> +	if (strtrigger && !strcmp(strtrigger, "edge"))
> +		ev->edge_trigger = true;
> +	else
> +		ev->edge_trigger = false;

I guess it would be more appropriate to return EINVAL if the trigger is
neither always nor edge because we might end up with abuses where
somebody start relying on "foo" implying "always".

The history tells that user interface should be really careful about not
allowing "undocumented but happen to work" behavior.

>  	mutex_lock(&vmpr->events_lock);
>  	list_add(&ev->node, &vmpr->events);
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v4] memcg: event control at vmpressure.
  2013-06-19 12:53           ` Michal Hocko
@ 2013-06-20  2:13             ` Hyunhee Kim
  2013-06-20  2:17             ` [PATCH v5] " Hyunhee Kim
  1 sibling, 0 replies; 50+ messages in thread
From: Hyunhee Kim @ 2013-06-20  2:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Anton Vorontsov, linux-mm, akpm, rob, kamezawa.hiroyu, hannes,
	rientjes, kirill, Kyungmin Park

2013/6/19 Michal Hocko <mhocko@suse.cz>:
> OK, this looks much better. Few nitpicks bellow.
>
> On Wed 19-06-13 20:31:16, Hyunhee Kim wrote:
>> In the original vmpressure, events are triggered whenever there is a reclaim
>> activity. This becomes overheads to user space module and also increases
>> power consumption if there is somebody to listen to it. This patch provides
>> options to trigger events only when the pressure level changes.
>> This trigger option can be set when registering each event by writing
>> a trigger option, "edge" or "always", next to the string of levels.
>> "edge" means that the event is triggered only when the pressure level is changed.
>> "always" means that events are triggered whenever there is a reclaim process.
>> To keep backward compatibility, "always" is set by default if nothing is input
>> as an option. Each event can have different option. For example,
>> "low" level uses "always" trigger option to see reclaim activity at user space
>> while "medium"/"critical" uses "edge" to do an important job
>> like killing tasks only once.
>>
>> Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
>> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
>> ---
>>  Documentation/cgroups/memory.txt |   12 ++++++++++--
>>  mm/vmpressure.c                  |   32 ++++++++++++++++++++++++++++----
>>  2 files changed, 38 insertions(+), 6 deletions(-)
>>
>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>> index ddf4f93..181a11f 100644
>> --- a/Documentation/cgroups/memory.txt
>> +++ b/Documentation/cgroups/memory.txt
>> @@ -791,6 +791,13 @@ way to trigger. Applications should do whatever they can to help the
>>  system. It might be too late to consult with vmstat or any other
>>  statistics, so it's advisable to take an immediate action.
>>
>> +Events can be triggered whenever there is a reclaim activity or
>> +only when the pressure level changes. Trigger option is decided
>> +by writing it next to level. The event whose trigger option is "always"
>> +is triggered whenever there is a reclaim process. If "edge" is set,
>> +the event is triggered only when the level is changed.
>> +If the trigger option is not set, "always" is set by default.
>> +
>
> I would move this information bellow where the usage is described.

Ok, I'll move this bellow the usage description.

>
>>  The events are propagated upward until the event is handled, i.e. the
>>  events are not pass-through. Here is what this means: for example you have
>>  three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
> [...]
>> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>> index 736a601..4f676b8 100644
>> --- a/mm/vmpressure.c
>> +++ b/mm/vmpressure.c
> [...]
>> @@ -303,10 +310,21 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
>>  {
>>       struct vmpressure *vmpr = cg_to_vmpressure(cg);
>>       struct vmpressure_event *ev;
>> +     char *strlevel = NULL, *strtrigger = NULL, *p = NULL;
>>       int level;
>>
>> +     p = strchr(args, ' ');
>> +
>> +     if (p) {
>> +             strtrigger = p + 1;
>> +             *p = '\0';
>> +             strlevel = (char *)args;
>> +     } else {
>> +             strlevel = (char *)args;
>> +     }
>> +
>
> This is a total nit but this can be further simplified.
>         strlevel = args;
>         strtrigger = strchr(args, ' ');
>         if (strtrigger) {
>                 *strtrigger = '\0';
>                 strtrigger++;
>         }
> I would still rather see using sscanf but that is just a matter of taste
> I guess.

Thanks for the comments. This looks better and more simple.

>
>>       for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
>> -             if (!strcmp(vmpressure_str_levels[level], args))
>> +             if (!strcmp(vmpressure_str_levels[level], strlevel))
>>                       break;
>>       }
>>
>> @@ -319,6 +337,12 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
>>
>>       ev->efd = eventfd;
>>       ev->level = level;
>> +     ev->last_level = -1;
>> +
>> +     if (strtrigger && !strcmp(strtrigger, "edge"))
>> +             ev->edge_trigger = true;
>> +     else
>> +             ev->edge_trigger = false;
>
> I guess it would be more appropriate to return EINVAL if the trigger is
> neither always nor edge because we might end up with abuses where
> somebody start relying on "foo" implying "always".
>
> The history tells that user interface should be really careful about not
> allowing "undocumented but happen to work" behavior.

Yes, right. I overlooked the case when somebody inputs wrong option string.
I will add "return -EINVAL" when undefined string is input.

>
>>       mutex_lock(&vmpr->events_lock);
>>       list_add(&ev->node, &vmpr->events);
> --
> Michal Hocko
> SUSE Labs


Thanks for you comments.
Hyunhee Kim.


>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v5] memcg: event control at vmpressure.
  2013-06-19 12:53           ` Michal Hocko
  2013-06-20  2:13             ` Hyunhee Kim
@ 2013-06-20  2:17             ` Hyunhee Kim
  2013-06-20 12:16               ` Michal Hocko
  1 sibling, 1 reply; 50+ messages in thread
From: Hyunhee Kim @ 2013-06-20  2:17 UTC (permalink / raw)
  To: 'Michal Hocko', 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill
  Cc: 'Kyungmin Park'

In the original vmpressure, events are triggered whenever there is a reclaim
activity. This becomes overheads to user space module and also increases
power consumption if there is somebody to listen to it. This patch provides
options to trigger events only when the pressure level changes.
This trigger option can be set when registering each event by writing
a trigger option, "edge" or "always", next to the string of levels.
"edge" means that the event is triggered only when the pressure level is changed.
"always" means that events are triggered whenever there is a reclaim process.
To keep backward compatibility, "always" is set by default if nothing is input
as an option. Each event can have different option. For example,
"low" level uses "always" trigger option to see reclaim activity at user space
while "medium"/"critical" uses "edge" to do an important job
like killing tasks only once.

Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
---
 Documentation/cgroups/memory.txt |   12 ++++++++++--
 mm/vmpressure.c                  |   34 ++++++++++++++++++++++++++++++----
 2 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index ddf4f93..185870f 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -807,13 +807,21 @@ register a notification, an application must:
 
 - create an eventfd using eventfd(2);
 - open memory.pressure_level;
-- write string like "<event_fd> <fd of memory.pressure_level> <level>"
+- write string like
+	"<event_fd> <fd of memory.pressure_level> <level> <trigger_option>"
   to cgroup.event_control.
 
 Application will be notified through eventfd when memory pressure is at
 the specific level (or higher). Read/write operations to
 memory.pressure_level are no implemented.
 
+Events can be triggered whenever there is a reclaim activity or
+only when the pressure level changes. Trigger option is decided
+by writing it next to the level. The event whose trigger option is
+"always" is triggered whenever there is a reclaim process.
+If "edge" is set, the event is triggered only when the level is changed.
+If the trigger option is not set, "always" is set by default.
+
 Test:
 
    Here is a small script example that makes a new cgroup, sets up a
@@ -823,7 +831,7 @@ Test:
    # cd /sys/fs/cgroup/memory/
    # mkdir foo
    # cd foo
-   # cgroup_event_listener memory.pressure_level low &
+   # cgroup_event_listener memory.pressure_level low edge &
    # echo 8000000 > memory.limit_in_bytes
    # echo 8000000 > memory.memsw.limit_in_bytes
    # echo $$ > tasks
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 736a601..3c37b12 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -137,6 +137,8 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 struct vmpressure_event {
 	struct eventfd_ctx *efd;
 	enum vmpressure_levels level;
+	int last_level;
+	bool edge_trigger;
 	struct list_head node;
 };
 
@@ -153,11 +155,14 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 
 	list_for_each_entry(ev, &vmpr->events, node) {
 		if (level >= ev->level) {
+			if (ev->edge_trigger && level == ev->last_level)
+				continue;
+
 			eventfd_signal(ev->efd, 1);
 			signalled = true;
 		}
+		ev->last_level = level;
 	}
-
 	mutex_unlock(&vmpr->events_lock);
 
 	return signalled;
@@ -290,9 +295,11 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
  *
  * This function associates eventfd context with the vmpressure
  * infrastructure, so that the notifications will be delivered to the
- * @eventfd. The @args parameter is a string that denotes pressure level
+ * @eventfd. The @args parameters are a string that denotes pressure level
  * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
- * "critical").
+ * "critical") and a trigger option that decides whether events are triggered
+ * continuously or only on edge ("always" or "edge" if "edge", events
+ * are triggered when the pressure level changes.
  *
  * This function should not be used directly, just pass it to (struct
  * cftype).register_event, and then cgroup core will handle everything by
@@ -303,10 +310,19 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
 {
 	struct vmpressure *vmpr = cg_to_vmpressure(cg);
 	struct vmpressure_event *ev;
+	char *strlevel = NULL, *strtrigger = NULL;
 	int level;
 
+	strlevel = args;
+	strtrigger = strchr(args, ' ');
+
+	if (strtrigger) {
+		*strtrigger = '\0';
+		strtrigger++;
+	}
+
 	for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
-		if (!strcmp(vmpressure_str_levels[level], args))
+		if (!strcmp(vmpressure_str_levels[level], strlevel))
 			break;
 	}
 
@@ -319,6 +335,16 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
 
 	ev->efd = eventfd;
 	ev->level = level;
+	ev->last_level = -1;
+
+	if (strtrigger == NULL)
+		ev->edge_trigger = false;
+	else if (!strcmp(strtrigger, "always"))
+		ev->edge_trigger = false;
+	else if (!strcmp(strtrigger, "edge"))
+		ev->edge_trigger = true;
+	else
+		return -EINVAL;
 
 	mutex_lock(&vmpr->events_lock);
 	list_add(&ev->node, &vmpr->events);
-- 
1.7.9.5


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v5] memcg: event control at vmpressure.
  2013-06-20  2:17             ` [PATCH v5] " Hyunhee Kim
@ 2013-06-20 12:16               ` Michal Hocko
  2013-06-21  0:21                 ` [PATCH v6] " Hyunhee Kim
  0 siblings, 1 reply; 50+ messages in thread
From: Michal Hocko @ 2013-06-20 12:16 UTC (permalink / raw)
  To: Hyunhee Kim
  Cc: 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park'

On Thu 20-06-13 11:17:39, Hyunhee Kim wrote:
[...]
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index 736a601..3c37b12 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
[...]
> @@ -303,10 +310,19 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
>  {
>  	struct vmpressure *vmpr = cg_to_vmpressure(cg);
>  	struct vmpressure_event *ev;
> +	char *strlevel = NULL, *strtrigger = NULL;

No need for initialization to NULL when both would be initialized below.

>  	int level;
>  
> +	strlevel = args;
> +	strtrigger = strchr(args, ' ');
> +
> +	if (strtrigger) {
> +		*strtrigger = '\0';
> +		strtrigger++;
> +	}
> +
>  	for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
> -		if (!strcmp(vmpressure_str_levels[level], args))
> +		if (!strcmp(vmpressure_str_levels[level], strlevel))
>  			break;
>  	}
>  
> @@ -319,6 +335,16 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
>  
>  	ev->efd = eventfd;
>  	ev->level = level;
> +	ev->last_level = -1;
> +
> +	if (strtrigger == NULL)
> +		ev->edge_trigger = false;
> +	else if (!strcmp(strtrigger, "always"))
> +		ev->edge_trigger = false;
> +	else if (!strcmp(strtrigger, "edge"))
> +		ev->edge_trigger = true;
> +	else
> +		return -EINVAL;

I have missed this before but this will cause a memory leak and worse it
is user controlled mem leak. Move this up after the level check.

>  	mutex_lock(&vmpr->events_lock);
>  	list_add(&ev->node, &vmpr->events);
> -- 
> 1.7.9.5
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v6] memcg: event control at vmpressure.
  2013-06-20 12:16               ` Michal Hocko
@ 2013-06-21  0:21                 ` Hyunhee Kim
  2013-06-21  0:24                   ` Hyunhee Kim
  0 siblings, 1 reply; 50+ messages in thread
From: Hyunhee Kim @ 2013-06-21  0:21 UTC (permalink / raw)
  To: 'Michal Hocko', 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill
  Cc: 'Kyungmin Park'



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v6] memcg: event control at vmpressure.
  2013-06-21  0:21                 ` [PATCH v6] " Hyunhee Kim
@ 2013-06-21  0:24                   ` Hyunhee Kim
  2013-06-21  1:22                     ` Minchan Kim
  0 siblings, 1 reply; 50+ messages in thread
From: Hyunhee Kim @ 2013-06-21  0:24 UTC (permalink / raw)
  To: 'Hyunhee Kim', 'Michal Hocko',
	'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill
  Cc: 'Kyungmin Park'

In the original vmpressure, events are triggered whenever there is a reclaim
activity. This becomes overheads to user space module and also increases
power consumption if there is somebody to listen to it. This patch provides
options to trigger events only when the pressure level changes.
This trigger option can be set when registering each event by writing
a trigger option, "edge" or "always", next to the string of levels.
"edge" means that the event is triggered only when the pressure level is changed.
"always" means that events are triggered whenever there is a reclaim process.
To keep backward compatibility, "always" is set by default if nothing is input
as an option. Each event can have different option. For example,
"low" level uses "always" trigger option to see reclaim activity at user space
while "medium"/"critical" uses "edge" to do an important job
like killing tasks only once.

Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
---
 Documentation/cgroups/memory.txt |   12 ++++++++++--
 mm/vmpressure.c                  |   36 ++++++++++++++++++++++++++++++++----
 2 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index ddf4f93..185870f 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -807,13 +807,21 @@ register a notification, an application must:
 
 - create an eventfd using eventfd(2);
 - open memory.pressure_level;
-- write string like "<event_fd> <fd of memory.pressure_level> <level>"
+- write string like
+	"<event_fd> <fd of memory.pressure_level> <level> <trigger_option>"
   to cgroup.event_control.
 
 Application will be notified through eventfd when memory pressure is at
 the specific level (or higher). Read/write operations to
 memory.pressure_level are no implemented.
 
+Events can be triggered whenever there is a reclaim activity or
+only when the pressure level changes. Trigger option is decided
+by writing it next to the level. The event whose trigger option is
+"always" is triggered whenever there is a reclaim process.
+If "edge" is set, the event is triggered only when the level is changed.
+If the trigger option is not set, "always" is set by default.
+
 Test:
 
    Here is a small script example that makes a new cgroup, sets up a
@@ -823,7 +831,7 @@ Test:
    # cd /sys/fs/cgroup/memory/
    # mkdir foo
    # cd foo
-   # cgroup_event_listener memory.pressure_level low &
+   # cgroup_event_listener memory.pressure_level low edge &
    # echo 8000000 > memory.limit_in_bytes
    # echo 8000000 > memory.memsw.limit_in_bytes
    # echo $$ > tasks
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 736a601..a08252e 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -137,6 +137,8 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 struct vmpressure_event {
 	struct eventfd_ctx *efd;
 	enum vmpressure_levels level;
+	int last_level;
+	bool edge_trigger;
 	struct list_head node;
 };
 
@@ -153,11 +155,14 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 
 	list_for_each_entry(ev, &vmpr->events, node) {
 		if (level >= ev->level) {
+			if (ev->edge_trigger && level == ev->last_level)
+				continue;
+
 			eventfd_signal(ev->efd, 1);
 			signalled = true;
 		}
+		ev->last_level = level;
 	}
-
 	mutex_unlock(&vmpr->events_lock);
 
 	return signalled;
@@ -290,9 +295,11 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
  *
  * This function associates eventfd context with the vmpressure
  * infrastructure, so that the notifications will be delivered to the
- * @eventfd. The @args parameter is a string that denotes pressure level
+ * @eventfd. The @args parameters are a string that denotes pressure level
  * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
- * "critical").
+ * "critical") and a trigger option that decides whether events are triggered
+ * continuously or only on edge ("always" or "edge" if "edge", events
+ * are triggered when the pressure level changes.
  *
  * This function should not be used directly, just pass it to (struct
  * cftype).register_event, and then cgroup core will handle everything by
@@ -303,22 +310,43 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
 {
 	struct vmpressure *vmpr = cg_to_vmpressure(cg);
 	struct vmpressure_event *ev;
+	char *strlevel, *strtrigger;
 	int level;
+	bool trigger;
+
+	strlevel = args;
+	strtrigger = strchr(args, ' ');
+
+	if (strtrigger) {
+		*strtrigger = '\0';
+		strtrigger++;
+	}
 
 	for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
-		if (!strcmp(vmpressure_str_levels[level], args))
+		if (!strcmp(vmpressure_str_levels[level], strlevel))
 			break;
 	}
 
 	if (level >= VMPRESSURE_NUM_LEVELS)
 		return -EINVAL;
 
+	if (strtrigger == NULL)
+		trigger = false;
+	else if (!strcmp(strtrigger, "always"))
+		trigger = false;
+	else if (!strcmp(strtrigger, "edge"))
+		trigger = true;
+	else
+		return -EINVAL;
+
 	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
 	if (!ev)
 		return -ENOMEM;
 
 	ev->efd = eventfd;
 	ev->level = level;
+	ev->last_level = -1;
+	ev->edge_trigger = trigger;
 
 	mutex_lock(&vmpr->events_lock);
 	list_add(&ev->node, &vmpr->events);
-- 
1.7.9.5


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v6] memcg: event control at vmpressure.
  2013-06-21  0:24                   ` Hyunhee Kim
@ 2013-06-21  1:22                     ` Minchan Kim
  2013-06-21  9:19                       ` Michal Hocko
  0 siblings, 1 reply; 50+ messages in thread
From: Minchan Kim @ 2013-06-21  1:22 UTC (permalink / raw)
  To: Hyunhee Kim
  Cc: 'Michal Hocko', 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park'

On Fri, Jun 21, 2013 at 09:24:38AM +0900, Hyunhee Kim wrote:
> In the original vmpressure, events are triggered whenever there is a reclaim
> activity. This becomes overheads to user space module and also increases

Not true.
We have lots of filter to not trigger event even if reclaim is going on.
Your statement would make confuse.

> power consumption if there is somebody to listen to it. This patch provides
> options to trigger events only when the pressure level changes.
> This trigger option can be set when registering each event by writing
> a trigger option, "edge" or "always", next to the string of levels.
> "edge" means that the event is triggered only when the pressure level is changed.
> "always" means that events are triggered whenever there is a reclaim process.
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                                  Not true, either.

> To keep backward compatibility, "always" is set by default if nothing is input
> as an option. Each event can have different option. For example,
> "low" level uses "always" trigger option to see reclaim activity at user space
> while "medium"/"critical" uses "edge" to do an important job
> like killing tasks only once.

Question.

1. user: set critical edge
2. kernel: memory is tight and trigger event with critical
3. user: kill a program when he receives a event
4. kernel: memory is very tight again and want to trigger a event
   with critical but fail because last_level was critical and it was edge.

Right?

> 
> Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
> ---
>  Documentation/cgroups/memory.txt |   12 ++++++++++--
>  mm/vmpressure.c                  |   36 ++++++++++++++++++++++++++++++++----
>  2 files changed, 42 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index ddf4f93..185870f 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -807,13 +807,21 @@ register a notification, an application must:
>  
>  - create an eventfd using eventfd(2);
>  - open memory.pressure_level;
> -- write string like "<event_fd> <fd of memory.pressure_level> <level>"
> +- write string like
> +	"<event_fd> <fd of memory.pressure_level> <level> <trigger_option>"
>    to cgroup.event_control.
>  
>  Application will be notified through eventfd when memory pressure is at
>  the specific level (or higher). Read/write operations to
>  memory.pressure_level are no implemented.
>  
> +Events can be triggered whenever there is a reclaim activity or

Not true.

> +only when the pressure level changes. Trigger option is decided
> +by writing it next to the level. The event whose trigger option is
> +"always" is triggered whenever there is a reclaim process.

Not true.

> +If "edge" is set, the event is triggered only when the level is changed.
> +If the trigger option is not set, "always" is set by default.
> +
>  Test:
>  
>     Here is a small script example that makes a new cgroup, sets up a
> @@ -823,7 +831,7 @@ Test:
>     # cd /sys/fs/cgroup/memory/
>     # mkdir foo
>     # cd foo
> -   # cgroup_event_listener memory.pressure_level low &
> +   # cgroup_event_listener memory.pressure_level low edge &
>     # echo 8000000 > memory.limit_in_bytes
>     # echo 8000000 > memory.memsw.limit_in_bytes
>     # echo $$ > tasks
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index 736a601..a08252e 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -137,6 +137,8 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>  struct vmpressure_event {
>  	struct eventfd_ctx *efd;
>  	enum vmpressure_levels level;
> +	int last_level;

int? but level is enum vmpressure_levels?

> +	bool edge_trigger;
>  	struct list_head node;
>  };
>  
> @@ -153,11 +155,14 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>  
>  	list_for_each_entry(ev, &vmpr->events, node) {
>  		if (level >= ev->level) {
> +			if (ev->edge_trigger && level == ev->last_level)
> +				continue;
> +
>  			eventfd_signal(ev->efd, 1);
>  			signalled = true;
>  		}
> +		ev->last_level = level;
>  	}
> -

Unnecessary change.

>  	mutex_unlock(&vmpr->events_lock);
>  
>  	return signalled;
> @@ -290,9 +295,11 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
>   *
>   * This function associates eventfd context with the vmpressure
>   * infrastructure, so that the notifications will be delivered to the
> - * @eventfd. The @args parameter is a string that denotes pressure level
> + * @eventfd. The @args parameters are a string that denotes pressure level
>   * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
> - * "critical").
> + * "critical") and a trigger option that decides whether events are triggered
> + * continuously or only on edge ("always" or "edge" if "edge", events
> + * are triggered when the pressure level changes.
>   *
>   * This function should not be used directly, just pass it to (struct
>   * cftype).register_event, and then cgroup core will handle everything by
> @@ -303,22 +310,43 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
>  {
>  	struct vmpressure *vmpr = cg_to_vmpressure(cg);
>  	struct vmpressure_event *ev;
> +	char *strlevel, *strtrigger;
>  	int level;
> +	bool trigger;

What trigger?
Would be better to use "bool egde" instead?

> +
> +	strlevel = args;
> +	strtrigger = strchr(args, ' ');
> +
> +	if (strtrigger) {
> +		*strtrigger = '\0';
> +		strtrigger++;
> +	}
>  
>  	for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
> -		if (!strcmp(vmpressure_str_levels[level], args))
> +		if (!strcmp(vmpressure_str_levels[level], strlevel))
>  			break;
>  	}
>  
>  	if (level >= VMPRESSURE_NUM_LEVELS)
>  		return -EINVAL;
>  
> +	if (strtrigger == NULL)
> +		trigger = false;
> +	else if (!strcmp(strtrigger, "always"))
> +		trigger = false;
> +	else if (!strcmp(strtrigger, "edge"))
> +		trigger = true;
> +	else
> +		return -EINVAL;
> +
>  	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
>  	if (!ev)
>  		return -ENOMEM;
>  
>  	ev->efd = eventfd;
>  	ev->level = level;
> +	ev->last_level = -1;

VMPRESSURE_NONE is better?


> +	ev->edge_trigger = trigger;
>  
>  	mutex_lock(&vmpr->events_lock);
>  	list_add(&ev->node, &vmpr->events);
> -- 
> 1.7.9.5
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6] memcg: event control at vmpressure.
  2013-06-21  1:22                     ` Minchan Kim
@ 2013-06-21  9:19                       ` Michal Hocko
  2013-06-21 11:02                         ` Hyunhee Kim
  2013-06-21 16:27                         ` [PATCH v6] " Minchan Kim
  0 siblings, 2 replies; 50+ messages in thread
From: Michal Hocko @ 2013-06-21  9:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Hyunhee Kim, 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park'

On Fri 21-06-13 10:22:34, Minchan Kim wrote:
> On Fri, Jun 21, 2013 at 09:24:38AM +0900, Hyunhee Kim wrote:
> > In the original vmpressure, events are triggered whenever there is a reclaim
> > activity. This becomes overheads to user space module and also increases
> 
> Not true.
> We have lots of filter to not trigger event even if reclaim is going on.
> Your statement would make confuse.

Where is the filter implemented? In the kernel? I do not see any
throttling in the current mm tree.

> > power consumption if there is somebody to listen to it. This patch provides
> > options to trigger events only when the pressure level changes.
> > This trigger option can be set when registering each event by writing
> > a trigger option, "edge" or "always", next to the string of levels.
> > "edge" means that the event is triggered only when the pressure level is changed.
> > "always" means that events are triggered whenever there is a reclaim process.
>                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>                                                   Not true, either.

Is this about vmpressure_win? But I agree that this could be more
specific. Something like "`Always' trigger option will signal all events
while `edge' option will trigger only events when the level changes."

> > To keep backward compatibility, "always" is set by default if nothing is input
> > as an option. Each event can have different option. For example,
> > "low" level uses "always" trigger option to see reclaim activity at user space
> > while "medium"/"critical" uses "edge" to do an important job
> > like killing tasks only once.
> 
> Question.
> 
> 1. user: set critical edge
> 2. kernel: memory is tight and trigger event with critical
> 3. user: kill a program when he receives a event
> 4. kernel: memory is very tight again and want to trigger a event
>    with critical but fail because last_level was critical and it was edge.
> 
> Right?

yes, this is the risk of the edge triggering and the user has to be
prepared for that. I still think that it makes some sense to have the
two modes.
 
> > @@ -823,7 +831,7 @@ Test:
> >     # cd /sys/fs/cgroup/memory/
> >     # mkdir foo
> >     # cd foo
> > -   # cgroup_event_listener memory.pressure_level low &
> > +   # cgroup_event_listener memory.pressure_level low edge &
> >     # echo 8000000 > memory.limit_in_bytes
> >     # echo 8000000 > memory.memsw.limit_in_bytes
> >     # echo $$ > tasks
> > diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> > index 736a601..a08252e 100644
> > --- a/mm/vmpressure.c
> > +++ b/mm/vmpressure.c
> > @@ -137,6 +137,8 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> >  struct vmpressure_event {
> >  	struct eventfd_ctx *efd;
> >  	enum vmpressure_levels level;
> > +	int last_level;
> 
> int? but level is enum vmpressure_levels?

good catch
 
> > +	bool edge_trigger;
> >  	struct list_head node;
> >  };
> >  
> > @@ -153,11 +155,14 @@ static bool vmpressure_event(struct vmpressure *vmpr,
> >  
> >  	list_for_each_entry(ev, &vmpr->events, node) {
> >  		if (level >= ev->level) {
> > +			if (ev->edge_trigger && level == ev->last_level)
> > +				continue;
> > +
> >  			eventfd_signal(ev->efd, 1);
> >  			signalled = true;
> >  		}
> > +		ev->last_level = level;
> >  	}
> > -
> 
> Unnecessary change.
> 
> >  	mutex_unlock(&vmpr->events_lock);
> >  
> >  	return signalled;
> > @@ -290,9 +295,11 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
> >   *
> >   * This function associates eventfd context with the vmpressure
> >   * infrastructure, so that the notifications will be delivered to the
> > - * @eventfd. The @args parameter is a string that denotes pressure level
> > + * @eventfd. The @args parameters are a string that denotes pressure level
> >   * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
> > - * "critical").
> > + * "critical") and a trigger option that decides whether events are triggered
> > + * continuously or only on edge ("always" or "edge" if "edge", events
> > + * are triggered when the pressure level changes.
> >   *
> >   * This function should not be used directly, just pass it to (struct
> >   * cftype).register_event, and then cgroup core will handle everything by
> > @@ -303,22 +310,43 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
> >  {
> >  	struct vmpressure *vmpr = cg_to_vmpressure(cg);
> >  	struct vmpressure_event *ev;
> > +	char *strlevel, *strtrigger;
> >  	int level;
> > +	bool trigger;
> 
> What trigger?
> Would be better to use "bool egde" instead?

yes

> > +
> > +	strlevel = args;
> > +	strtrigger = strchr(args, ' ');
> > +
> > +	if (strtrigger) {
> > +		*strtrigger = '\0';
> > +		strtrigger++;
> > +	}
> >  
> >  	for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
> > -		if (!strcmp(vmpressure_str_levels[level], args))
> > +		if (!strcmp(vmpressure_str_levels[level], strlevel))
> >  			break;
> >  	}
> >  
> >  	if (level >= VMPRESSURE_NUM_LEVELS)
> >  		return -EINVAL;
> >  
> > +	if (strtrigger == NULL)
> > +		trigger = false;
> > +	else if (!strcmp(strtrigger, "always"))
> > +		trigger = false;
> > +	else if (!strcmp(strtrigger, "edge"))
> > +		trigger = true;
> > +	else
> > +		return -EINVAL;
> > +
> >  	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
> >  	if (!ev)
> >  		return -ENOMEM;
> >  
> >  	ev->efd = eventfd;
> >  	ev->level = level;
> > +	ev->last_level = -1;
> 
> VMPRESSURE_NONE is better?
 
Yes
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6] memcg: event control at vmpressure.
  2013-06-21  9:19                       ` Michal Hocko
@ 2013-06-21 11:02                         ` Hyunhee Kim
  2013-06-21 11:54                           ` Hyunhee Kim
  2013-06-21 16:27                         ` [PATCH v6] " Minchan Kim
  1 sibling, 1 reply; 50+ messages in thread
From: Hyunhee Kim @ 2013-06-21 11:02 UTC (permalink / raw)
  To: Michal Hocko, Minchan Kim
  Cc: Anton Vorontsov, linux-mm, akpm, rob, kamezawa.hiroyu, hannes,
	rientjes, kirill, Kyungmin Park

2013/6/21 Michal Hocko <mhocko@suse.cz>:
> On Fri 21-06-13 10:22:34, Minchan Kim wrote:
>> On Fri, Jun 21, 2013 at 09:24:38AM +0900, Hyunhee Kim wrote:
>> > In the original vmpressure, events are triggered whenever there is a reclaim
>> > activity. This becomes overheads to user space module and also increases
>>
>> Not true.
>> We have lots of filter to not trigger event even if reclaim is going on.
>> Your statement would make confuse.
>
> Where is the filter implemented? In the kernel? I do not see any
> throttling in the current mm tree.

Thanks for your comments.
As Minchan said, vmpressure_win can filter some of event sinals. But,
when I tested, lots of events are still signaled if a task consume a
lot of memory.
I'll change the expression "whenever there is a reclaim activity".

>
>> > power consumption if there is somebody to listen to it. This patch provides
>> > options to trigger events only when the pressure level changes.
>> > This trigger option can be set when registering each event by writing
>> > a trigger option, "edge" or "always", next to the string of levels.
>> > "edge" means that the event is triggered only when the pressure level is changed.
>> > "always" means that events are triggered whenever there is a reclaim process.
>>                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>                                                   Not true, either.
>
> Is this about vmpressure_win? But I agree that this could be more
> specific. Something like "`Always' trigger option will signal all events
> while `edge' option will trigger only events when the level changes."

I also agree. I'll improve it.

>
>> > To keep backward compatibility, "always" is set by default if nothing is input
>> > as an option. Each event can have different option. For example,
>> > "low" level uses "always" trigger option to see reclaim activity at user space
>> > while "medium"/"critical" uses "edge" to do an important job
>> > like killing tasks only once.
>>
>> Question.
>>
>> 1. user: set critical edge
>> 2. kernel: memory is tight and trigger event with critical
>> 3. user: kill a program when he receives a event
>> 4. kernel: memory is very tight again and want to trigger a event
>>    with critical but fail because last_level was critical and it was edge.
>>
>> Right?
>
> yes, this is the risk of the edge triggering and the user has to be
> prepared for that. I still think that it makes some sense to have the
> two modes.

Right. the above scenario is possible to happen.
And, as Michal said, I also think that a user should handle this
situation. This could be the users' choice to handle continuous events
or handle the above situation by having two modes.

>
>> > @@ -823,7 +831,7 @@ Test:
>> >     # cd /sys/fs/cgroup/memory/
>> >     # mkdir foo
>> >     # cd foo
>> > -   # cgroup_event_listener memory.pressure_level low &
>> > +   # cgroup_event_listener memory.pressure_level low edge &
>> >     # echo 8000000 > memory.limit_in_bytes
>> >     # echo 8000000 > memory.memsw.limit_in_bytes
>> >     # echo $$ > tasks
>> > diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>> > index 736a601..a08252e 100644
>> > --- a/mm/vmpressure.c
>> > +++ b/mm/vmpressure.c
>> > @@ -137,6 +137,8 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>> >  struct vmpressure_event {
>> >     struct eventfd_ctx *efd;
>> >     enum vmpressure_levels level;
>> > +   int last_level;
>>
>> int? but level is enum vmpressure_levels?
>
> good catch

Ok, I'll fix it.

>
>> > +   bool edge_trigger;
>> >     struct list_head node;
>> >  };
>> >
>> > @@ -153,11 +155,14 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>> >
>> >     list_for_each_entry(ev, &vmpr->events, node) {
>> >             if (level >= ev->level) {
>> > +                   if (ev->edge_trigger && level == ev->last_level)
>> > +                           continue;
>> > +
>> >                     eventfd_signal(ev->efd, 1);
>> >                     signalled = true;
>> >             }
>> > +           ev->last_level = level;
>> >     }
>> > -
>>
>> Unnecessary change.

Ok.

>>
>> >     mutex_unlock(&vmpr->events_lock);
>> >
>> >     return signalled;
>> > @@ -290,9 +295,11 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
>> >   *
>> >   * This function associates eventfd context with the vmpressure
>> >   * infrastructure, so that the notifications will be delivered to the
>> > - * @eventfd. The @args parameter is a string that denotes pressure level
>> > + * @eventfd. The @args parameters are a string that denotes pressure level
>> >   * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
>> > - * "critical").
>> > + * "critical") and a trigger option that decides whether events are triggered
>> > + * continuously or only on edge ("always" or "edge" if "edge", events
>> > + * are triggered when the pressure level changes.
>> >   *
>> >   * This function should not be used directly, just pass it to (struct
>> >   * cftype).register_event, and then cgroup core will handle everything by
>> > @@ -303,22 +310,43 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
>> >  {
>> >     struct vmpressure *vmpr = cg_to_vmpressure(cg);
>> >     struct vmpressure_event *ev;
>> > +   char *strlevel, *strtrigger;
>> >     int level;
>> > +   bool trigger;
>>
>> What trigger?
>> Would be better to use "bool egde" instead?
>
> yes

Ok. edge is better.

>
>> > +
>> > +   strlevel = args;
>> > +   strtrigger = strchr(args, ' ');
>> > +
>> > +   if (strtrigger) {
>> > +           *strtrigger = '\0';
>> > +           strtrigger++;
>> > +   }
>> >
>> >     for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
>> > -           if (!strcmp(vmpressure_str_levels[level], args))
>> > +           if (!strcmp(vmpressure_str_levels[level], strlevel))
>> >                     break;
>> >     }
>> >
>> >     if (level >= VMPRESSURE_NUM_LEVELS)
>> >             return -EINVAL;
>> >
>> > +   if (strtrigger == NULL)
>> > +           trigger = false;
>> > +   else if (!strcmp(strtrigger, "always"))
>> > +           trigger = false;
>> > +   else if (!strcmp(strtrigger, "edge"))
>> > +           trigger = true;
>> > +   else
>> > +           return -EINVAL;
>> > +
>> >     ev = kzalloc(sizeof(*ev), GFP_KERNEL);
>> >     if (!ev)
>> >             return -ENOMEM;
>> >
>> >     ev->efd = eventfd;
>> >     ev->level = level;
>> > +   ev->last_level = -1;
>>
>> VMPRESSURE_NONE is better?
>
> Yes

Ok, I'll add VMPRESSURE_NONE. It will be better.

Thanks for review.
Hyunhee Kim.

> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v6] memcg: event control at vmpressure.
  2013-06-21 11:02                         ` Hyunhee Kim
@ 2013-06-21 11:54                           ` Hyunhee Kim
  2013-06-21 12:40                             ` [PATCH v7] " Hyunhee Kim
  0 siblings, 1 reply; 50+ messages in thread
From: Hyunhee Kim @ 2013-06-21 11:54 UTC (permalink / raw)
  To: 'Minchan Kim'
  Cc: 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park', 'Hyunhee Kim',
	'Michal Hocko'

In addition, I have more questions about vmpressure.

1. Why are vmpressure_level_med and vmpressure_level_critical fixed in kernel?
    Isn't it worth to allow users to specify them when registering events?

2. Sometimes, there are frequent fluctuation like entering low, and middle, and low, and middle...
    How about using marginal value to leave each level? Isn't it unnecessary?

3. Is there any reason to use reclaim rate to decide levels? 
    Is it bad option to use available memory like android low memory killer?

Thanks in advance.

Hyunhee Kim.

-----Original Message-----
From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On Behalf Of Hyunhee Kim
Sent: Friday, June 21, 2013 8:03 PM
To: Michal Hocko; Minchan Kim
Cc: Anton Vorontsov; linux-mm@kvack.org; akpm@linux-foundation.org; rob@landley.net; kamezawa.hiroyu@jp.fujitsu.com;
hannes@cmpxchg.org; rientjes@google.com; kirill@shutemov.name; Kyungmin Park
Subject: Re: [PATCH v6] memcg: event control at vmpressure.

2013/6/21 Michal Hocko <mhocko@suse.cz>:
> On Fri 21-06-13 10:22:34, Minchan Kim wrote:
>> On Fri, Jun 21, 2013 at 09:24:38AM +0900, Hyunhee Kim wrote:
>> > In the original vmpressure, events are triggered whenever there is a reclaim
>> > activity. This becomes overheads to user space module and also increases
>>
>> Not true.
>> We have lots of filter to not trigger event even if reclaim is going on.
>> Your statement would make confuse.
>
> Where is the filter implemented? In the kernel? I do not see any
> throttling in the current mm tree.

Thanks for your comments.
As Minchan said, vmpressure_win can filter some of event sinals. But,
when I tested, lots of events are still signaled if a task consume a
lot of memory.
I'll change the expression "whenever there is a reclaim activity".

>
>> > power consumption if there is somebody to listen to it. This patch provides
>> > options to trigger events only when the pressure level changes.
>> > This trigger option can be set when registering each event by writing
>> > a trigger option, "edge" or "always", next to the string of levels.
>> > "edge" means that the event is triggered only when the pressure level is changed.
>> > "always" means that events are triggered whenever there is a reclaim process.
>>                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>                                                   Not true, either.
>
> Is this about vmpressure_win? But I agree that this could be more
> specific. Something like "`Always' trigger option will signal all events
> while `edge' option will trigger only events when the level changes."

I also agree. I'll improve it.

>
>> > To keep backward compatibility, "always" is set by default if nothing is input
>> > as an option. Each event can have different option. For example,
>> > "low" level uses "always" trigger option to see reclaim activity at user space
>> > while "medium"/"critical" uses "edge" to do an important job
>> > like killing tasks only once.
>>
>> Question.
>>
>> 1. user: set critical edge
>> 2. kernel: memory is tight and trigger event with critical
>> 3. user: kill a program when he receives a event
>> 4. kernel: memory is very tight again and want to trigger a event
>>    with critical but fail because last_level was critical and it was edge.
>>
>> Right?
>
> yes, this is the risk of the edge triggering and the user has to be
> prepared for that. I still think that it makes some sense to have the
> two modes.

Right. the above scenario is possible to happen.
And, as Michal said, I also think that a user should handle this
situation. This could be the users' choice to handle continuous events
or handle the above situation by having two modes.

>
>> > @@ -823,7 +831,7 @@ Test:
>> >     # cd /sys/fs/cgroup/memory/
>> >     # mkdir foo
>> >     # cd foo
>> > -   # cgroup_event_listener memory.pressure_level low &
>> > +   # cgroup_event_listener memory.pressure_level low edge &
>> >     # echo 8000000 > memory.limit_in_bytes
>> >     # echo 8000000 > memory.memsw.limit_in_bytes
>> >     # echo $$ > tasks
>> > diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>> > index 736a601..a08252e 100644
>> > --- a/mm/vmpressure.c
>> > +++ b/mm/vmpressure.c
>> > @@ -137,6 +137,8 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>> >  struct vmpressure_event {
>> >     struct eventfd_ctx *efd;
>> >     enum vmpressure_levels level;
>> > +   int last_level;
>>
>> int? but level is enum vmpressure_levels?
>
> good catch

Ok, I'll fix it.

>
>> > +   bool edge_trigger;
>> >     struct list_head node;
>> >  };
>> >
>> > @@ -153,11 +155,14 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>> >
>> >     list_for_each_entry(ev, &vmpr->events, node) {
>> >             if (level >= ev->level) {
>> > +                   if (ev->edge_trigger && level == ev->last_level)
>> > +                           continue;
>> > +
>> >                     eventfd_signal(ev->efd, 1);
>> >                     signalled = true;
>> >             }
>> > +           ev->last_level = level;
>> >     }
>> > -
>>
>> Unnecessary change.

Ok.

>>
>> >     mutex_unlock(&vmpr->events_lock);
>> >
>> >     return signalled;
>> > @@ -290,9 +295,11 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
>> >   *
>> >   * This function associates eventfd context with the vmpressure
>> >   * infrastructure, so that the notifications will be delivered to the
>> > - * @eventfd. The @args parameter is a string that denotes pressure level
>> > + * @eventfd. The @args parameters are a string that denotes pressure level
>> >   * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
>> > - * "critical").
>> > + * "critical") and a trigger option that decides whether events are triggered
>> > + * continuously or only on edge ("always" or "edge" if "edge", events
>> > + * are triggered when the pressure level changes.
>> >   *
>> >   * This function should not be used directly, just pass it to (struct
>> >   * cftype).register_event, and then cgroup core will handle everything by
>> > @@ -303,22 +310,43 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
>> >  {
>> >     struct vmpressure *vmpr = cg_to_vmpressure(cg);
>> >     struct vmpressure_event *ev;
>> > +   char *strlevel, *strtrigger;
>> >     int level;
>> > +   bool trigger;
>>
>> What trigger?
>> Would be better to use "bool egde" instead?
>
> yes

Ok. edge is better.

>
>> > +
>> > +   strlevel = args;
>> > +   strtrigger = strchr(args, ' ');
>> > +
>> > +   if (strtrigger) {
>> > +           *strtrigger = '\0';
>> > +           strtrigger++;
>> > +   }
>> >
>> >     for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
>> > -           if (!strcmp(vmpressure_str_levels[level], args))
>> > +           if (!strcmp(vmpressure_str_levels[level], strlevel))
>> >                     break;
>> >     }
>> >
>> >     if (level >= VMPRESSURE_NUM_LEVELS)
>> >             return -EINVAL;
>> >
>> > +   if (strtrigger == NULL)
>> > +           trigger = false;
>> > +   else if (!strcmp(strtrigger, "always"))
>> > +           trigger = false;
>> > +   else if (!strcmp(strtrigger, "edge"))
>> > +           trigger = true;
>> > +   else
>> > +           return -EINVAL;
>> > +
>> >     ev = kzalloc(sizeof(*ev), GFP_KERNEL);
>> >     if (!ev)
>> >             return -ENOMEM;
>> >
>> >     ev->efd = eventfd;
>> >     ev->level = level;
>> > +   ev->last_level = -1;
>>
>> VMPRESSURE_NONE is better?
>
> Yes

Ok, I'll add VMPRESSURE_NONE. It will be better.

Thanks for review.
Hyunhee Kim.

> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v7] memcg: event control at vmpressure.
  2013-06-21 11:54                           ` Hyunhee Kim
@ 2013-06-21 12:40                             ` Hyunhee Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Hyunhee Kim @ 2013-06-21 12:40 UTC (permalink / raw)
  To: 'Minchan Kim', 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park', 'Michal Hocko'

In the original vmpressure, same events could be signaled to user space until
the pressure level changes. However, for some users, these same event signals
are unnecessary, and handling them becomes overheads. This patch provides
triggering options that can decide when events are signaled: (1) signal all
matched events or (2) signal an event only when the pressure level changes.
This trigger option can be set when each event is registered by writing
a trigger option, "always" or "edge", next to the string of levels.
"always" means that all matched events are signaled while "edge" means that
an event is signaled only when the pressure level is changed.
To keep backward compatibility, "always" is set by default if nothing is input
as an option. Each event can have different option. For example,
"low" level uses "always" trigger option to see reclaim activity at user space
while "medium"/"critical" uses "edge" to do an important job
like killing tasks only once.

Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
---
 Documentation/cgroups/memory.txt |   11 +++++++++--
 mm/vmpressure.c                  |   38 ++++++++++++++++++++++++++++++++++----
 2 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index ddf4f93..a6bb589 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -807,13 +807,20 @@ register a notification, an application must:
 
 - create an eventfd using eventfd(2);
 - open memory.pressure_level;
-- write string like "<event_fd> <fd of memory.pressure_level> <level>"
+- write string like
+	"<event_fd> <fd of memory.pressure_level> <level> <trigger_option>"
   to cgroup.event_control.
 
 Application will be notified through eventfd when memory pressure is at
 the specific level (or higher). Read/write operations to
 memory.pressure_level are no implemented.
 
+All matched events can always be signaled or an event can be signaled only when
+the pressure level changes. This trigger option is decided by writing it next
+to the level. "Always" trigger option will signal all matched events
+while "edge" option will signal the matched event only when the level changes.
+If the trigger option is not set, "always" is set by default.
+
 Test:
 
    Here is a small script example that makes a new cgroup, sets up a
@@ -823,7 +830,7 @@ Test:
    # cd /sys/fs/cgroup/memory/
    # mkdir foo
    # cd foo
-   # cgroup_event_listener memory.pressure_level low &
+   # cgroup_event_listener memory.pressure_level low edge &
    # echo 8000000 > memory.limit_in_bytes
    # echo 8000000 > memory.memsw.limit_in_bytes
    # echo $$ > tasks
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 736a601..7dcfb58 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -91,7 +91,8 @@ static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
 }
 
 enum vmpressure_levels {
-	VMPRESSURE_LOW = 0,
+	VMPRESSURE_NONE = -1,
+	VMPRESSURE_LOW,
 	VMPRESSURE_MEDIUM,
 	VMPRESSURE_CRITICAL,
 	VMPRESSURE_NUM_LEVELS,
@@ -137,6 +138,8 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 struct vmpressure_event {
 	struct eventfd_ctx *efd;
 	enum vmpressure_levels level;
+	enum vmpressure_levels last_level;
+	bool edge_trigger;
 	struct list_head node;
 };
 
@@ -153,9 +156,13 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 
 	list_for_each_entry(ev, &vmpr->events, node) {
 		if (level >= ev->level) {
+			if (ev->edge_trigger && level == ev->last_level)
+				continue;
+
 			eventfd_signal(ev->efd, 1);
 			signalled = true;
 		}
+		ev->last_level = level;
 	}
 
 	mutex_unlock(&vmpr->events_lock);
@@ -290,9 +297,11 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
  *
  * This function associates eventfd context with the vmpressure
  * infrastructure, so that the notifications will be delivered to the
- * @eventfd. The @args parameter is a string that denotes pressure level
+ * @eventfd. The @args parameters are a string that denotes pressure level
  * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
- * "critical").
+ * "critical") and a trigger option that decides whether events are triggered
+ * continuously or only on edge ("always" or "edge" if "edge", events
+ * are triggered when the pressure level changes.
  *
  * This function should not be used directly, just pass it to (struct
  * cftype).register_event, and then cgroup core will handle everything by
@@ -303,22 +312,43 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
 {
 	struct vmpressure *vmpr = cg_to_vmpressure(cg);
 	struct vmpressure_event *ev;
+	char *strlevel, *strtrigger;
 	int level;
+	bool edge;
+
+	strlevel = args;
+	strtrigger = strchr(args, ' ');
+
+	if (strtrigger) {
+		*strtrigger = '\0';
+		strtrigger++;
+	}
 
 	for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
-		if (!strcmp(vmpressure_str_levels[level], args))
+		if (!strcmp(vmpressure_str_levels[level], strlevel))
 			break;
 	}
 
 	if (level >= VMPRESSURE_NUM_LEVELS)
 		return -EINVAL;
 
+	if (strtrigger == NULL)
+		edge = false;
+	else if (!strcmp(strtrigger, "always"))
+		edge = false;
+	else if (!strcmp(strtrigger, "edge"))
+		edge = true;
+	else
+		return -EINVAL;
+
 	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
 	if (!ev)
 		return -ENOMEM;
 
 	ev->efd = eventfd;
 	ev->level = level;
+	ev->last_level = VMPRESSURE_NONE;
+	ev->edge_trigger = edge;
 
 	mutex_lock(&vmpr->events_lock);
 	list_add(&ev->node, &vmpr->events);
-- 
1.7.9.5


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v6] memcg: event control at vmpressure.
  2013-06-21  9:19                       ` Michal Hocko
  2013-06-21 11:02                         ` Hyunhee Kim
@ 2013-06-21 16:27                         ` Minchan Kim
  2013-06-21 16:44                           ` Minchan Kim
                                             ` (3 more replies)
  1 sibling, 4 replies; 50+ messages in thread
From: Minchan Kim @ 2013-06-21 16:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hyunhee Kim, 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park'

Hello Michal,

On Fri, Jun 21, 2013 at 11:19:44AM +0200, Michal Hocko wrote:
> On Fri 21-06-13 10:22:34, Minchan Kim wrote:
> > On Fri, Jun 21, 2013 at 09:24:38AM +0900, Hyunhee Kim wrote:
> > > In the original vmpressure, events are triggered whenever there is a reclaim
> > > activity. This becomes overheads to user space module and also increases
> > 
> > Not true.
> > We have lots of filter to not trigger event even if reclaim is going on.
> > Your statement would make confuse.
> 
> Where is the filter implemented? In the kernel? I do not see any
> throttling in the current mm tree.

1. mem_cgroup_soft_limit_reclaim
2. reclaim caused by DMA zone
3. vmpressure_win

> 
> > > power consumption if there is somebody to listen to it. This patch provides
> > > options to trigger events only when the pressure level changes.
> > > This trigger option can be set when registering each event by writing
> > > a trigger option, "edge" or "always", next to the string of levels.
> > > "edge" means that the event is triggered only when the pressure level is changed.
> > > "always" means that events are triggered whenever there is a reclaim process.
> >                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >                                                   Not true, either.
> 
> Is this about vmpressure_win? But I agree that this could be more
> specific. Something like "`Always' trigger option will signal all events
> while `edge' option will trigger only events when the level changes."
> 
> > > To keep backward compatibility, "always" is set by default if nothing is input
> > > as an option. Each event can have different option. For example,
> > > "low" level uses "always" trigger option to see reclaim activity at user space
> > > while "medium"/"critical" uses "edge" to do an important job
> > > like killing tasks only once.
> > 
> > Question.
> > 
> > 1. user: set critical edge
> > 2. kernel: memory is tight and trigger event with critical
> > 3. user: kill a program when he receives a event
> > 4. kernel: memory is very tight again and want to trigger a event
> >    with critical but fail because last_level was critical and it was edge.
> > 
> > Right?
> 
> yes, this is the risk of the edge triggering and the user has to be
> prepared for that. I still think that it makes some sense to have the
> two modes.

I'm not sure it's good idea.
How could user overcome above problem?
The problem is "critical" means that the memory is really tight so
that user should do ememgency plan to overcome the situation like
kill someone. Even, kill could be a problem because the one from userspace
couldn't use reserved memory pool so that killing could be stalled for a
long time and then, we could end up encountering OOM. :(

So, the description should include how to overcome above situation in
userspace efficiently even though memory is really tight, which we don't
have extra memory to read vmstat.

We reviewers should review that it does make sense. If so, we need to
write down it in documentation, otherwise, we should fix it from kernel
side.

Another problem from this patch is that it couldn't detect same event
contiguously so you are saying userspace have to handle it.
It doesn't make sense to me. Why should user handle such false positive?
I don't mean false positive signal is bad because low memory notification
has inherent such a problem but we should try to avoid such frequent triggering
if possible.
IOW, It reveals current vmpressure's problem.
Fix it without band-aiding of userspace if it's really problem.

>  
> > > @@ -823,7 +831,7 @@ Test:
> > >     # cd /sys/fs/cgroup/memory/
> > >     # mkdir foo
> > >     # cd foo
> > > -   # cgroup_event_listener memory.pressure_level low &
> > > +   # cgroup_event_listener memory.pressure_level low edge &
> > >     # echo 8000000 > memory.limit_in_bytes
> > >     # echo 8000000 > memory.memsw.limit_in_bytes
> > >     # echo $$ > tasks
> > > diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> > > index 736a601..a08252e 100644
> > > --- a/mm/vmpressure.c
> > > +++ b/mm/vmpressure.c
> > > @@ -137,6 +137,8 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> > >  struct vmpressure_event {
> > >  	struct eventfd_ctx *efd;
> > >  	enum vmpressure_levels level;
> > > +	int last_level;
> > 
> > int? but level is enum vmpressure_levels?
> 
> good catch
>  
> > > +	bool edge_trigger;
> > >  	struct list_head node;
> > >  };
> > >  
> > > @@ -153,11 +155,14 @@ static bool vmpressure_event(struct vmpressure *vmpr,
> > >  
> > >  	list_for_each_entry(ev, &vmpr->events, node) {
> > >  		if (level >= ev->level) {
> > > +			if (ev->edge_trigger && level == ev->last_level)
> > > +				continue;
> > > +
> > >  			eventfd_signal(ev->efd, 1);
> > >  			signalled = true;
> > >  		}
> > > +		ev->last_level = level;
> > >  	}
> > > -
> > 
> > Unnecessary change.
> > 
> > >  	mutex_unlock(&vmpr->events_lock);
> > >  
> > >  	return signalled;
> > > @@ -290,9 +295,11 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
> > >   *
> > >   * This function associates eventfd context with the vmpressure
> > >   * infrastructure, so that the notifications will be delivered to the
> > > - * @eventfd. The @args parameter is a string that denotes pressure level
> > > + * @eventfd. The @args parameters are a string that denotes pressure level
> > >   * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
> > > - * "critical").
> > > + * "critical") and a trigger option that decides whether events are triggered
> > > + * continuously or only on edge ("always" or "edge" if "edge", events
> > > + * are triggered when the pressure level changes.
> > >   *
> > >   * This function should not be used directly, just pass it to (struct
> > >   * cftype).register_event, and then cgroup core will handle everything by
> > > @@ -303,22 +310,43 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
> > >  {
> > >  	struct vmpressure *vmpr = cg_to_vmpressure(cg);
> > >  	struct vmpressure_event *ev;
> > > +	char *strlevel, *strtrigger;
> > >  	int level;
> > > +	bool trigger;
> > 
> > What trigger?
> > Would be better to use "bool egde" instead?
> 
> yes
> 
> > > +
> > > +	strlevel = args;
> > > +	strtrigger = strchr(args, ' ');
> > > +
> > > +	if (strtrigger) {
> > > +		*strtrigger = '\0';
> > > +		strtrigger++;
> > > +	}
> > >  
> > >  	for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
> > > -		if (!strcmp(vmpressure_str_levels[level], args))
> > > +		if (!strcmp(vmpressure_str_levels[level], strlevel))
> > >  			break;
> > >  	}
> > >  
> > >  	if (level >= VMPRESSURE_NUM_LEVELS)
> > >  		return -EINVAL;
> > >  
> > > +	if (strtrigger == NULL)
> > > +		trigger = false;
> > > +	else if (!strcmp(strtrigger, "always"))
> > > +		trigger = false;
> > > +	else if (!strcmp(strtrigger, "edge"))
> > > +		trigger = true;
> > > +	else
> > > +		return -EINVAL;
> > > +
> > >  	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
> > >  	if (!ev)
> > >  		return -ENOMEM;
> > >  
> > >  	ev->efd = eventfd;
> > >  	ev->level = level;
> > > +	ev->last_level = -1;
> > 
> > VMPRESSURE_NONE is better?
>  
> Yes
> -- 
> Michal Hocko
> SUSE Labs

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6] memcg: event control at vmpressure.
  2013-06-21 16:27                         ` [PATCH v6] " Minchan Kim
@ 2013-06-21 16:44                           ` Minchan Kim
  2013-06-22  0:27                             ` Anton Vorontsov
  2013-06-21 22:35                           ` Anton Vorontsov
                                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 50+ messages in thread
From: Minchan Kim @ 2013-06-21 16:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hyunhee Kim, 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park'

On Sat, Jun 22, 2013 at 01:27:43AM +0900, Minchan Kim wrote:
> Hello Michal,
> 
> On Fri, Jun 21, 2013 at 11:19:44AM +0200, Michal Hocko wrote:
> > On Fri 21-06-13 10:22:34, Minchan Kim wrote:
> > > On Fri, Jun 21, 2013 at 09:24:38AM +0900, Hyunhee Kim wrote:
> > > > In the original vmpressure, events are triggered whenever there is a reclaim
> > > > activity. This becomes overheads to user space module and also increases
> > > 
> > > Not true.
> > > We have lots of filter to not trigger event even if reclaim is going on.
> > > Your statement would make confuse.
> > 
> > Where is the filter implemented? In the kernel? I do not see any
> > throttling in the current mm tree.
> 
> 1. mem_cgroup_soft_limit_reclaim
> 2. reclaim caused by DMA zone
> 3. vmpressure_win
> 
> > 
> > > > power consumption if there is somebody to listen to it. This patch provides
> > > > options to trigger events only when the pressure level changes.
> > > > This trigger option can be set when registering each event by writing
> > > > a trigger option, "edge" or "always", next to the string of levels.
> > > > "edge" means that the event is triggered only when the pressure level is changed.
> > > > "always" means that events are triggered whenever there is a reclaim process.
> > >                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >                                                   Not true, either.
> > 
> > Is this about vmpressure_win? But I agree that this could be more
> > specific. Something like "`Always' trigger option will signal all events
> > while `edge' option will trigger only events when the level changes."
> > 
> > > > To keep backward compatibility, "always" is set by default if nothing is input
> > > > as an option. Each event can have different option. For example,
> > > > "low" level uses "always" trigger option to see reclaim activity at user space
> > > > while "medium"/"critical" uses "edge" to do an important job
> > > > like killing tasks only once.
> > > 
> > > Question.
> > > 
> > > 1. user: set critical edge
> > > 2. kernel: memory is tight and trigger event with critical
> > > 3. user: kill a program when he receives a event
> > > 4. kernel: memory is very tight again and want to trigger a event
> > >    with critical but fail because last_level was critical and it was edge.
> > > 
> > > Right?
> > 
> > yes, this is the risk of the edge triggering and the user has to be
> > prepared for that. I still think that it makes some sense to have the
> > two modes.
> 
> I'm not sure it's good idea.
> How could user overcome above problem?
> The problem is "critical" means that the memory is really tight so
> that user should do ememgency plan to overcome the situation like
> kill someone. Even, kill could be a problem because the one from userspace
> couldn't use reserved memory pool so that killing could be stalled for a
> long time and then, we could end up encountering OOM. :(
> 
> So, the description should include how to overcome above situation in
> userspace efficiently even though memory is really tight, which we don't
> have extra memory to read vmstat.
> 
> We reviewers should review that it does make sense. If so, we need to
> write down it in documentation, otherwise, we should fix it from kernel
> side.
> 
> Another problem from this patch is that it couldn't detect same event
> contiguously so you are saying userspace have to handle it.
> It doesn't make sense to me. Why should user handle such false positive?
> I don't mean false positive signal is bad because low memory notification
> has inherent such a problem but we should try to avoid such frequent triggering
> if possible.
> IOW, It reveals current vmpressure's problem.
> Fix it without band-aiding of userspace if it's really problem.
> 

1. One of the design problem is that why vmpressure trigger by per-zone
   while userspace isn't aware of zone. It would trigger event
   several time during walking several zones in reclaim path.

2. Why vmpressure skip if another work is pending?
   More urgent work should preempt exisitng pending work.

3. The reclaimed could be greater than scanned in vmpressure_evnet
   by several reasons. Totally, It could trigger wrong event.

2 and 3 is another story but 1 would be related to this problem.
Anyway, I suggest let's fix kernel first without userside's
awkward bandaid.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6] memcg: event control at vmpressure.
  2013-06-21 16:27                         ` [PATCH v6] " Minchan Kim
  2013-06-21 16:44                           ` Minchan Kim
@ 2013-06-21 22:35                           ` Anton Vorontsov
  2013-06-22  4:36                           ` Hyunhee Kim
  2013-06-25 16:07                           ` Michal Hocko
  3 siblings, 0 replies; 50+ messages in thread
From: Anton Vorontsov @ 2013-06-21 22:35 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Hyunhee Kim, linux-mm, akpm, rob, kamezawa.hiroyu,
	hannes, rientjes, kirill, 'Kyungmin Park'

Hi Minchan,

Thanks for your thoughtful reviews!

On Sat, Jun 22, 2013 at 01:27:43AM +0900, Minchan Kim wrote:
> On Fri, Jun 21, 2013 at 11:19:44AM +0200, Michal Hocko wrote:
> > On Fri 21-06-13 10:22:34, Minchan Kim wrote:
> > > On Fri, Jun 21, 2013 at 09:24:38AM +0900, Hyunhee Kim wrote:
> > > > In the original vmpressure, events are triggered whenever there is a reclaim
> > > > activity. This becomes overheads to user space module and also increases
> > > 
> > > Not true.
> > > We have lots of filter to not trigger event even if reclaim is going on.
> > > Your statement would make confuse.
> > 
> > Where is the filter implemented? In the kernel? I do not see any
> > throttling in the current mm tree.
> 
> 1. mem_cgroup_soft_limit_reclaim
> 2. reclaim caused by DMA zone
> 3. vmpressure_win

(Which is going to be machine-size dependant. Plus, it is also an option
to add more filters into the kernel, userland does not need to know all
these "details".)

> > yes, this is the risk of the edge triggering and the user has to be
> > prepared for that. I still think that it makes some sense to have the
> > two modes.
> 
> I'm not sure it's good idea.

After your explanations, it sounds like a bad idea to me. :)

> How could user overcome above problem?
[...]
> So, the description should include how to overcome above situation in
> userspace efficiently even though memory is really tight, which we don't
> have extra memory to read vmstat.

Exactly. What userland would do is it will try to play games with vmstat,
e.g. polling it until it makes sure (by some heuristics) that it is
[again] safe to rely on the notifications. This makes the edge-triggered
interface simply unpredictible.

Until we find any better scheme for filtering, I'd suggest abandon the
idea of the edge tirggered stuff. Instead, I see a possible solution in
defining a new pressure level, or even a new scale (alongside with the
current "low/mid/critical" one), which will have a well-defined behaviour
and that will suit Samsung's needs.

(And for the time being the userland can perfectly fine play the vmstat
games after closing or not reading eventfd, that way you won't receive the
notifications and thus will still able to implement kind of "filtering" in
userland. Surely it is a hack, but much better than the unpredictible
behaviour on the kernel side.)

Thanks,

Anton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6] memcg: event control at vmpressure.
  2013-06-21 16:44                           ` Minchan Kim
@ 2013-06-22  0:27                             ` Anton Vorontsov
  2013-06-22  1:28                               ` Hyunhee Kim
  2013-06-26  7:47                               ` Minchan Kim
  0 siblings, 2 replies; 50+ messages in thread
From: Anton Vorontsov @ 2013-06-22  0:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Hyunhee Kim, linux-mm, akpm, rob, kamezawa.hiroyu,
	hannes, rientjes, kirill, 'Kyungmin Park'

On Sat, Jun 22, 2013 at 01:44:14AM +0900, Minchan Kim wrote:
[...]
> 3. The reclaimed could be greater than scanned in vmpressure_evnet
>    by several reasons. Totally, It could trigger wrong event.

Yup, and in that case the best we can do is just ignore the event (i.e.
not pass it to the userland): thing is, based on the fact that
'reclaimed > scanned' we can't actually conclude anything about the
pressure: it might be still high, or we actually freed enough.

Thanks,

Anton

p.s. I was somewhat sure that someone sent a patch to ignore 'reclaimed >
scanned' situation, but I cannot find it in my mailbox. Maybe I was
dreaming about it? :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6] memcg: event control at vmpressure.
  2013-06-22  0:27                             ` Anton Vorontsov
@ 2013-06-22  1:28                               ` Hyunhee Kim
  2013-06-26  7:47                               ` Minchan Kim
  1 sibling, 0 replies; 50+ messages in thread
From: Hyunhee Kim @ 2013-06-22  1:28 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Michal Hocko, Kyungmin Park, hannes, rob, linux-mm, rientjes,
	Minchan Kim, kirill, akpm, kamezawa.hiroyu

2013. 6. 22. 오전 9:27에 "Anton Vorontsov" <anton@enomsg.org>님이 작성:
>
> On Sat, Jun 22, 2013 at 01:44:14AM +0900, Minchan Kim wrote:
> [...]
> > 3. The reclaimed could be greater than scanned in vmpressure_evnet
> >    by several reasons. Totally, It could trigger wrong event.
>
> Yup, and in that case the best we can do is just ignore the event (i.e.
> not pass it to the userland): thing is, based on the fact that
> 'reclaimed > scanned' we can't actually conclude anything about the
> pressure: it might be still high, or we actually freed enough.
>
> Thanks,
>
> Anton
>
> p.s. I was somewhat sure that someone sent a patch to ignore 'reclaimed >
> scanned' situation, but I cannot find it in my mailbox. Maybe I was
> dreaming about it? :)

I have suggested it as follows and Minchan reviewed it.
I'll send it again after applying Minchan's opinion.
Thanks,

Hyunhee Kim.
=========================================

On Wed, Jun 05, 2013 at 05:31:30PM +0900, Hyunhee Kim wrote:
 > Hi, Anton,
 >

Sorry, I'm not Anton but I involved a little bit when this feature was
 developed so may I answer your qeustion?


> When calculating pressure level in vmpressure_calc_level, I observed that "reclaimed" becomes larger than "scanned".
 > In this case, since these values are "unsigned long", pressure
returns wrong value and critical event is triggered even on low state.
 > Do you think that it is possible?


True, we have a few reasons.

Culprits I can think easily are THP page reclaiming or bails out reclaiming
 by fatal signal in shrink_inactive_list.
 I guess you don't enable THP so I think culprit is latter.


> If so, in this case, should we make "reclaimed" equal to "scanned"?
 > When I tested as below, it could trigger reasonable events.
 >
 > =============================
 > +static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 > +                                                 unsigned long reclaimed)
 > +{
 > +     unsigned long scale = scanned + reclaimed;
 > +     unsigned long pressure;
 > +     if (reclaimed > scanned)
 > +             reclaimed = scanned;


Could we simply return VMPRESSURE_LOW?
 ========================================

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6] memcg: event control at vmpressure.
  2013-06-21 16:27                         ` [PATCH v6] " Minchan Kim
  2013-06-21 16:44                           ` Minchan Kim
  2013-06-21 22:35                           ` Anton Vorontsov
@ 2013-06-22  4:36                           ` Hyunhee Kim
  2013-06-22  4:51                             ` Hyunhee Kim
  2013-06-26  7:31                             ` Minchan Kim
  2013-06-25 16:07                           ` Michal Hocko
  3 siblings, 2 replies; 50+ messages in thread
From: Hyunhee Kim @ 2013-06-22  4:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Anton Vorontsov, linux-mm, akpm, rob,
	kamezawa.hiroyu, hannes, rientjes, kirill, Kyungmin Park

2013/6/22 Minchan Kim <minchan@kernel.org>:
> Hello Michal,
>
> On Fri, Jun 21, 2013 at 11:19:44AM +0200, Michal Hocko wrote:
>> On Fri 21-06-13 10:22:34, Minchan Kim wrote:
>> > On Fri, Jun 21, 2013 at 09:24:38AM +0900, Hyunhee Kim wrote:
>> > > In the original vmpressure, events are triggered whenever there is a reclaim
>> > > activity. This becomes overheads to user space module and also increases
>> >
>> > Not true.
>> > We have lots of filter to not trigger event even if reclaim is going on.
>> > Your statement would make confuse.
>>
>> Where is the filter implemented? In the kernel? I do not see any
>> throttling in the current mm tree.
>
> 1. mem_cgroup_soft_limit_reclaim
> 2. reclaim caused by DMA zone
> 3. vmpressure_win
>
>>
>> > > power consumption if there is somebody to listen to it. This patch provides
>> > > options to trigger events only when the pressure level changes.
>> > > This trigger option can be set when registering each event by writing
>> > > a trigger option, "edge" or "always", next to the string of levels.
>> > > "edge" means that the event is triggered only when the pressure level is changed.
>> > > "always" means that events are triggered whenever there is a reclaim process.
>> >                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> >                                                   Not true, either.
>>
>> Is this about vmpressure_win? But I agree that this could be more
>> specific. Something like "`Always' trigger option will signal all events
>> while `edge' option will trigger only events when the level changes."
>>
>> > > To keep backward compatibility, "always" is set by default if nothing is input
>> > > as an option. Each event can have different option. For example,
>> > > "low" level uses "always" trigger option to see reclaim activity at user space
>> > > while "medium"/"critical" uses "edge" to do an important job
>> > > like killing tasks only once.
>> >
>> > Question.
>> >
>> > 1. user: set critical edge
>> > 2. kernel: memory is tight and trigger event with critical
>> > 3. user: kill a program when he receives a event
>> > 4. kernel: memory is very tight again and want to trigger a event
>> >    with critical but fail because last_level was critical and it was edge.
>> >
>> > Right?
>>
>> yes, this is the risk of the edge triggering and the user has to be
>> prepared for that. I still think that it makes some sense to have the
>> two modes.
>
> I'm not sure it's good idea.
> How could user overcome above problem?
> The problem is "critical" means that the memory is really tight so
> that user should do ememgency plan to overcome the situation like
> kill someone. Even, kill could be a problem because the one from userspace
> couldn't use reserved memory pool so that killing could be stalled for a
> long time and then, we could end up encountering OOM. :(
>
> So, the description should include how to overcome above situation in
> userspace efficiently even though memory is really tight, which we don't
> have extra memory to read vmstat.

Thanks for your review.
With the current vmpressure, if we registered all of the levels "low",
"medium", and "critical", and the current level is critical,
there are so many signals. Whenever the critical signal occurs
continuously, "low" and "medium" are also signaled.
And, I still think that in the critical situation, handling these
signals becomes overheads.

How about handling low memory situation, e.g., reclaiming, swapping,
killing some processes, until the lower level event is signaled?
For example,
1. register "medium" and "critical" with edge trigger option
2. When we reach "critical" level, start to kill processes until we
receive "medium"
3. If the memory state become critical again, we can get signal again.

However, in the current vmpressure, vmpressure_level_med and
vmpressure_level_critical are fixed to 60 and 95.
If there is an interface to change these thresholds when registering
each event, I think that we can use "medium" level (this is an
example. we can also make new level) with new threshold (if 60 is too
small) as the level we can finish the low memory handler.

What's your opinion?

Thanks,
Hyunhee Kim.

>
> We reviewers should review that it does make sense. If so, we need to
> write down it in documentation, otherwise, we should fix it from kernel
> side.
>
> Another problem from this patch is that it couldn't detect same event
> contiguously so you are saying userspace have to handle it.
> It doesn't make sense to me. Why should user handle such false positive?
> I don't mean false positive signal is bad because low memory notification
> has inherent such a problem but we should try to avoid such frequent triggering
> if possible.
> IOW, It reveals current vmpressure's problem.
> Fix it without band-aiding of userspace if it's really problem.
>
>>
>> > > @@ -823,7 +831,7 @@ Test:
>> > >     # cd /sys/fs/cgroup/memory/
>> > >     # mkdir foo
>> > >     # cd foo
>> > > -   # cgroup_event_listener memory.pressure_level low &
>> > > +   # cgroup_event_listener memory.pressure_level low edge &
>> > >     # echo 8000000 > memory.limit_in_bytes
>> > >     # echo 8000000 > memory.memsw.limit_in_bytes
>> > >     # echo $$ > tasks
>> > > diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>> > > index 736a601..a08252e 100644
>> > > --- a/mm/vmpressure.c
>> > > +++ b/mm/vmpressure.c
>> > > @@ -137,6 +137,8 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>> > >  struct vmpressure_event {
>> > >   struct eventfd_ctx *efd;
>> > >   enum vmpressure_levels level;
>> > > + int last_level;
>> >
>> > int? but level is enum vmpressure_levels?
>>
>> good catch
>>
>> > > + bool edge_trigger;
>> > >   struct list_head node;
>> > >  };
>> > >
>> > > @@ -153,11 +155,14 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>> > >
>> > >   list_for_each_entry(ev, &vmpr->events, node) {
>> > >           if (level >= ev->level) {
>> > > +                 if (ev->edge_trigger && level == ev->last_level)
>> > > +                         continue;
>> > > +
>> > >                   eventfd_signal(ev->efd, 1);
>> > >                   signalled = true;
>> > >           }
>> > > +         ev->last_level = level;
>> > >   }
>> > > -
>> >
>> > Unnecessary change.
>> >
>> > >   mutex_unlock(&vmpr->events_lock);
>> > >
>> > >   return signalled;
>> > > @@ -290,9 +295,11 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
>> > >   *
>> > >   * This function associates eventfd context with the vmpressure
>> > >   * infrastructure, so that the notifications will be delivered to the
>> > > - * @eventfd. The @args parameter is a string that denotes pressure level
>> > > + * @eventfd. The @args parameters are a string that denotes pressure level
>> > >   * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
>> > > - * "critical").
>> > > + * "critical") and a trigger option that decides whether events are triggered
>> > > + * continuously or only on edge ("always" or "edge" if "edge", events
>> > > + * are triggered when the pressure level changes.
>> > >   *
>> > >   * This function should not be used directly, just pass it to (struct
>> > >   * cftype).register_event, and then cgroup core will handle everything by
>> > > @@ -303,22 +310,43 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
>> > >  {
>> > >   struct vmpressure *vmpr = cg_to_vmpressure(cg);
>> > >   struct vmpressure_event *ev;
>> > > + char *strlevel, *strtrigger;
>> > >   int level;
>> > > + bool trigger;
>> >
>> > What trigger?
>> > Would be better to use "bool egde" instead?
>>
>> yes
>>
>> > > +
>> > > + strlevel = args;
>> > > + strtrigger = strchr(args, ' ');
>> > > +
>> > > + if (strtrigger) {
>> > > +         *strtrigger = '\0';
>> > > +         strtrigger++;
>> > > + }
>> > >
>> > >   for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
>> > > -         if (!strcmp(vmpressure_str_levels[level], args))
>> > > +         if (!strcmp(vmpressure_str_levels[level], strlevel))
>> > >                   break;
>> > >   }
>> > >
>> > >   if (level >= VMPRESSURE_NUM_LEVELS)
>> > >           return -EINVAL;
>> > >
>> > > + if (strtrigger == NULL)
>> > > +         trigger = false;
>> > > + else if (!strcmp(strtrigger, "always"))
>> > > +         trigger = false;
>> > > + else if (!strcmp(strtrigger, "edge"))
>> > > +         trigger = true;
>> > > + else
>> > > +         return -EINVAL;
>> > > +
>> > >   ev = kzalloc(sizeof(*ev), GFP_KERNEL);
>> > >   if (!ev)
>> > >           return -ENOMEM;
>> > >
>> > >   ev->efd = eventfd;
>> > >   ev->level = level;
>> > > + ev->last_level = -1;
>> >
>> > VMPRESSURE_NONE is better?
>>
>> Yes
>> --
>> Michal Hocko
>> SUSE Labs
>
> --
> Kind regards,
> Minchan Kim
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6] memcg: event control at vmpressure.
  2013-06-22  4:36                           ` Hyunhee Kim
@ 2013-06-22  4:51                             ` Hyunhee Kim
  2013-06-22  5:50                               ` [PATCH] memcg: consider "scanned < reclaimed" case when calculating Hyunhee Kim
  2013-06-26  7:34                               ` [PATCH v6] memcg: event control at vmpressure Minchan Kim
  2013-06-26  7:31                             ` Minchan Kim
  1 sibling, 2 replies; 50+ messages in thread
From: Hyunhee Kim @ 2013-06-22  4:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Anton Vorontsov, linux-mm, akpm, rob,
	kamezawa.hiroyu, hannes, rientjes, kirill, Kyungmin Park

And one more thing.

vmpressure sends all of the lower events than the level (current).

155         list_for_each_entry(ev, &vmpr->events, node) {
156                 if (level >= ev->level) {
157                         eventfd_signal(ev->efd, 1);
158                         signalled = true;
159                 }
160         }

Isn't it wasteful?
When we registered "low" and "critical", and the current level is
critical, why should we signal "low" level together?
If we receive "critical" only, can't we just assume "low" level at the
user space?

2013/6/22 Hyunhee Kim <hyunhee.kim@samsung.com>:
> 2013/6/22 Minchan Kim <minchan@kernel.org>:
>> Hello Michal,
>>
>> On Fri, Jun 21, 2013 at 11:19:44AM +0200, Michal Hocko wrote:
>>> On Fri 21-06-13 10:22:34, Minchan Kim wrote:
>>> > On Fri, Jun 21, 2013 at 09:24:38AM +0900, Hyunhee Kim wrote:
>>> > > In the original vmpressure, events are triggered whenever there is a reclaim
>>> > > activity. This becomes overheads to user space module and also increases
>>> >
>>> > Not true.
>>> > We have lots of filter to not trigger event even if reclaim is going on.
>>> > Your statement would make confuse.
>>>
>>> Where is the filter implemented? In the kernel? I do not see any
>>> throttling in the current mm tree.
>>
>> 1. mem_cgroup_soft_limit_reclaim
>> 2. reclaim caused by DMA zone
>> 3. vmpressure_win
>>
>>>
>>> > > power consumption if there is somebody to listen to it. This patch provides
>>> > > options to trigger events only when the pressure level changes.
>>> > > This trigger option can be set when registering each event by writing
>>> > > a trigger option, "edge" or "always", next to the string of levels.
>>> > > "edge" means that the event is triggered only when the pressure level is changed.
>>> > > "always" means that events are triggered whenever there is a reclaim process.
>>> >                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> >                                                   Not true, either.
>>>
>>> Is this about vmpressure_win? But I agree that this could be more
>>> specific. Something like "`Always' trigger option will signal all events
>>> while `edge' option will trigger only events when the level changes."
>>>
>>> > > To keep backward compatibility, "always" is set by default if nothing is input
>>> > > as an option. Each event can have different option. For example,
>>> > > "low" level uses "always" trigger option to see reclaim activity at user space
>>> > > while "medium"/"critical" uses "edge" to do an important job
>>> > > like killing tasks only once.
>>> >
>>> > Question.
>>> >
>>> > 1. user: set critical edge
>>> > 2. kernel: memory is tight and trigger event with critical
>>> > 3. user: kill a program when he receives a event
>>> > 4. kernel: memory is very tight again and want to trigger a event
>>> >    with critical but fail because last_level was critical and it was edge.
>>> >
>>> > Right?
>>>
>>> yes, this is the risk of the edge triggering and the user has to be
>>> prepared for that. I still think that it makes some sense to have the
>>> two modes.
>>
>> I'm not sure it's good idea.
>> How could user overcome above problem?
>> The problem is "critical" means that the memory is really tight so
>> that user should do ememgency plan to overcome the situation like
>> kill someone. Even, kill could be a problem because the one from userspace
>> couldn't use reserved memory pool so that killing could be stalled for a
>> long time and then, we could end up encountering OOM. :(
>>
>> So, the description should include how to overcome above situation in
>> userspace efficiently even though memory is really tight, which we don't
>> have extra memory to read vmstat.
>
> Thanks for your review.
> With the current vmpressure, if we registered all of the levels "low",
> "medium", and "critical", and the current level is critical,
> there are so many signals. Whenever the critical signal occurs
> continuously, "low" and "medium" are also signaled.
> And, I still think that in the critical situation, handling these
> signals becomes overheads.
>
> How about handling low memory situation, e.g., reclaiming, swapping,
> killing some processes, until the lower level event is signaled?
> For example,
> 1. register "medium" and "critical" with edge trigger option
> 2. When we reach "critical" level, start to kill processes until we
> receive "medium"
> 3. If the memory state become critical again, we can get signal again.
>
> However, in the current vmpressure, vmpressure_level_med and
> vmpressure_level_critical are fixed to 60 and 95.
> If there is an interface to change these thresholds when registering
> each event, I think that we can use "medium" level (this is an
> example. we can also make new level) with new threshold (if 60 is too
> small) as the level we can finish the low memory handler.
>
> What's your opinion?
>
> Thanks,
> Hyunhee Kim.
>
>>
>> We reviewers should review that it does make sense. If so, we need to
>> write down it in documentation, otherwise, we should fix it from kernel
>> side.
>>
>> Another problem from this patch is that it couldn't detect same event
>> contiguously so you are saying userspace have to handle it.
>> It doesn't make sense to me. Why should user handle such false positive?
>> I don't mean false positive signal is bad because low memory notification
>> has inherent such a problem but we should try to avoid such frequent triggering
>> if possible.
>> IOW, It reveals current vmpressure's problem.
>> Fix it without band-aiding of userspace if it's really problem.
>>
>>>
>>> > > @@ -823,7 +831,7 @@ Test:
>>> > >     # cd /sys/fs/cgroup/memory/
>>> > >     # mkdir foo
>>> > >     # cd foo
>>> > > -   # cgroup_event_listener memory.pressure_level low &
>>> > > +   # cgroup_event_listener memory.pressure_level low edge &
>>> > >     # echo 8000000 > memory.limit_in_bytes
>>> > >     # echo 8000000 > memory.memsw.limit_in_bytes
>>> > >     # echo $$ > tasks
>>> > > diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>>> > > index 736a601..a08252e 100644
>>> > > --- a/mm/vmpressure.c
>>> > > +++ b/mm/vmpressure.c
>>> > > @@ -137,6 +137,8 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>>> > >  struct vmpressure_event {
>>> > >   struct eventfd_ctx *efd;
>>> > >   enum vmpressure_levels level;
>>> > > + int last_level;
>>> >
>>> > int? but level is enum vmpressure_levels?
>>>
>>> good catch
>>>
>>> > > + bool edge_trigger;
>>> > >   struct list_head node;
>>> > >  };
>>> > >
>>> > > @@ -153,11 +155,14 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>>> > >
>>> > >   list_for_each_entry(ev, &vmpr->events, node) {
>>> > >           if (level >= ev->level) {
>>> > > +                 if (ev->edge_trigger && level == ev->last_level)
>>> > > +                         continue;
>>> > > +
>>> > >                   eventfd_signal(ev->efd, 1);
>>> > >                   signalled = true;
>>> > >           }
>>> > > +         ev->last_level = level;
>>> > >   }
>>> > > -
>>> >
>>> > Unnecessary change.
>>> >
>>> > >   mutex_unlock(&vmpr->events_lock);
>>> > >
>>> > >   return signalled;
>>> > > @@ -290,9 +295,11 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
>>> > >   *
>>> > >   * This function associates eventfd context with the vmpressure
>>> > >   * infrastructure, so that the notifications will be delivered to the
>>> > > - * @eventfd. The @args parameter is a string that denotes pressure level
>>> > > + * @eventfd. The @args parameters are a string that denotes pressure level
>>> > >   * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
>>> > > - * "critical").
>>> > > + * "critical") and a trigger option that decides whether events are triggered
>>> > > + * continuously or only on edge ("always" or "edge" if "edge", events
>>> > > + * are triggered when the pressure level changes.
>>> > >   *
>>> > >   * This function should not be used directly, just pass it to (struct
>>> > >   * cftype).register_event, and then cgroup core will handle everything by
>>> > > @@ -303,22 +310,43 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
>>> > >  {
>>> > >   struct vmpressure *vmpr = cg_to_vmpressure(cg);
>>> > >   struct vmpressure_event *ev;
>>> > > + char *strlevel, *strtrigger;
>>> > >   int level;
>>> > > + bool trigger;
>>> >
>>> > What trigger?
>>> > Would be better to use "bool egde" instead?
>>>
>>> yes
>>>
>>> > > +
>>> > > + strlevel = args;
>>> > > + strtrigger = strchr(args, ' ');
>>> > > +
>>> > > + if (strtrigger) {
>>> > > +         *strtrigger = '\0';
>>> > > +         strtrigger++;
>>> > > + }
>>> > >
>>> > >   for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
>>> > > -         if (!strcmp(vmpressure_str_levels[level], args))
>>> > > +         if (!strcmp(vmpressure_str_levels[level], strlevel))
>>> > >                   break;
>>> > >   }
>>> > >
>>> > >   if (level >= VMPRESSURE_NUM_LEVELS)
>>> > >           return -EINVAL;
>>> > >
>>> > > + if (strtrigger == NULL)
>>> > > +         trigger = false;
>>> > > + else if (!strcmp(strtrigger, "always"))
>>> > > +         trigger = false;
>>> > > + else if (!strcmp(strtrigger, "edge"))
>>> > > +         trigger = true;
>>> > > + else
>>> > > +         return -EINVAL;
>>> > > +
>>> > >   ev = kzalloc(sizeof(*ev), GFP_KERNEL);
>>> > >   if (!ev)
>>> > >           return -ENOMEM;
>>> > >
>>> > >   ev->efd = eventfd;
>>> > >   ev->level = level;
>>> > > + ev->last_level = -1;
>>> >
>>> > VMPRESSURE_NONE is better?
>>>
>>> Yes
>>> --
>>> Michal Hocko
>>> SUSE Labs
>>
>> --
>> Kind regards,
>> Minchan Kim
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH] memcg: consider "scanned < reclaimed" case when calculating
  2013-06-22  4:51                             ` Hyunhee Kim
@ 2013-06-22  5:50                               ` Hyunhee Kim
  2013-06-22  7:34                                 ` [PATCH] memcg: add interface to specify thresholds of vmpressure Hyunhee Kim
  2013-06-26  7:35                                 ` [PATCH] memcg: consider "scanned < reclaimed" case when calculating Minchan Kim
  2013-06-26  7:34                               ` [PATCH v6] memcg: event control at vmpressure Minchan Kim
  1 sibling, 2 replies; 50+ messages in thread
From: Hyunhee Kim @ 2013-06-22  5:50 UTC (permalink / raw)
  To: 'Hyunhee Kim', 'Minchan Kim',
	'Michal Hocko', 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park'

In vmpressure, the pressure level is calculated based on the ratio
of how many pages were scanned vs. reclaimed in a given time window.
However, there is a possibility that "scanned < reclaimed" in such
a case, THP page is reclaimed or reclaiming is abandoned by fatal
signal in shrink_inactive_list, etc. So, with this patch, we just
return "low" level when "scanned < reclaimed" by assuming that
there are enough reclaimed pages.

Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
---
 mm/vmpressure.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 736a601..c6560f3 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -118,6 +118,9 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 	unsigned long scale = scanned + reclaimed;
 	unsigned long pressure;
 
+	if (reclaimed > scanned)
+		return VMPRESSURE_LOW;
+
 	/*
 	 * We calculate the ratio (in percents) of how many pages were
 	 * scanned vs. reclaimed in a given time frame (window). Note that
-- 
1.7.9.5


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH] memcg: add interface to specify thresholds of vmpressure
  2013-06-22  5:50                               ` [PATCH] memcg: consider "scanned < reclaimed" case when calculating Hyunhee Kim
@ 2013-06-22  7:34                                 ` Hyunhee Kim
  2013-06-25 20:46                                   ` Michal Hocko
  2013-06-26  7:39                                   ` Minchan Kim
  2013-06-26  7:35                                 ` [PATCH] memcg: consider "scanned < reclaimed" case when calculating Minchan Kim
  1 sibling, 2 replies; 50+ messages in thread
From: Hyunhee Kim @ 2013-06-22  7:34 UTC (permalink / raw)
  To: 'Hyunhee Kim', 'Minchan Kim',
	'Michal Hocko', 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park'

Memory pressure is calculated based on scanned/reclaimed ratio. The higher
the value, the more number unsuccessful reclaims there were. These thresholds
can be specified when each event is registered by writing it next to the
string of level. Default value is 60 for "medium" and 95 for "critical"

Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
---
 Documentation/cgroups/memory.txt |    8 +++++-
 mm/vmpressure.c                  |   54 +++++++++++++++++++++++++++-----------
 2 files changed, 45 insertions(+), 17 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index ddf4f93..bd9cf46 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -807,13 +807,19 @@ register a notification, an application must:
 
 - create an eventfd using eventfd(2);
 - open memory.pressure_level;
-- write string like "<event_fd> <fd of memory.pressure_level> <level>"
+- write string like "<event_fd> <fd of memory.pressure_level> <level> <threshold>"
   to cgroup.event_control.
 
 Application will be notified through eventfd when memory pressure is at
 the specific level (or higher). Read/write operations to
 memory.pressure_level are no implemented.
 
+We account memory pressure based on scanned/reclaimed ratio. The higher
+the value, the more number unsuccessful reclaims there were. These thresholds
+can be specified when each event is registered by writing it next to the
+string of level. Default value is 60 for "medium" and 95 for "critical".
+If nothing is input as threshold, default values are used.
+
 Test:
 
    Here is a small script example that makes a new cgroup, sets up a
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 736a601..52b266c 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -40,15 +40,6 @@
 static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
 
 /*
- * These thresholds are used when we account memory pressure through
- * scanned/reclaimed ratio. The current values were chosen empirically. In
- * essence, they are percents: the higher the value, the more number
- * unsuccessful reclaims there were.
- */
-static const unsigned int vmpressure_level_med = 60;
-static const unsigned int vmpressure_level_critical = 95;
-
-/*
  * When there are too little pages left to scan, vmpressure() may miss the
  * critical pressure as number of pages will be less than "window size".
  * However, in that case the vmscan priority will raise fast as the
@@ -97,6 +88,19 @@ enum vmpressure_levels {
 	VMPRESSURE_NUM_LEVELS,
 };
 
+/*
+ * These thresholds are used when we account memory pressure through
+ * scanned/reclaimed ratio. In essence, they are percents: the higher
+ * the value, the more number unsuccessful reclaims there were.
+ * These thresholds can be specified when each event is registered.
+ */
+
+static unsigned int vmpressure_threshold_levels[] = {
+	[VMPRESSURE_LOW] = 0,
+	[VMPRESSURE_MEDIUM] = 60,
+	[VMPRESSURE_CRITICAL] = 95,
+};
+
 static const char * const vmpressure_str_levels[] = {
 	[VMPRESSURE_LOW] = "low",
 	[VMPRESSURE_MEDIUM] = "medium",
@@ -105,11 +109,14 @@ static const char * const vmpressure_str_levels[] = {
 
 static enum vmpressure_levels vmpressure_level(unsigned long pressure)
 {
-	if (pressure >= vmpressure_level_critical)
-		return VMPRESSURE_CRITICAL;
-	else if (pressure >= vmpressure_level_med)
-		return VMPRESSURE_MEDIUM;
-	return VMPRESSURE_LOW;
+	int level;
+
+	for (level = VMPRESSURE_NUM_LEVELS - 1; level >= 0; level--) {
+		if (pressure >= vmpressure_threshold_levels[level])
+			break;
+	}
+
+	return level;
 }
 
 static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
@@ -303,10 +310,21 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
 {
 	struct vmpressure *vmpr = cg_to_vmpressure(cg);
 	struct vmpressure_event *ev;
-	int level;
+	char *strlevel, *strthres;
+	int level, thres = -1;
+
+	strlevel = args;
+	strthres = strchr(args, ' ');
+
+	if (strthres) {
+		*strthres = '\0';
+		strthres++;
+		if(kstrtoint(strthres, 10, &thres))
+			return -EINVAL;
+	}
 
 	for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
-		if (!strcmp(vmpressure_str_levels[level], args))
+		if (!strcmp(vmpressure_str_levels[level], strlevel))
 			break;
 	}
 
@@ -320,6 +338,10 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
 	ev->efd = eventfd;
 	ev->level = level;
 
+	/* If user input threshold is not valid value, use default value */
+	if (thres <= 100 && thres >= 0)
+		vmpressure_threshold_levels[level] = thres;
+
 	mutex_lock(&vmpr->events_lock);
 	list_add(&ev->node, &vmpr->events);
 	mutex_unlock(&vmpr->events_lock);
-- 
1.7.9.5


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v6] memcg: event control at vmpressure.
  2013-06-21 16:27                         ` [PATCH v6] " Minchan Kim
                                             ` (2 preceding siblings ...)
  2013-06-22  4:36                           ` Hyunhee Kim
@ 2013-06-25 16:07                           ` Michal Hocko
  3 siblings, 0 replies; 50+ messages in thread
From: Michal Hocko @ 2013-06-25 16:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Hyunhee Kim, 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park'

Hi,

On Sat 22-06-13 01:27:43, Minchan Kim wrote:
> > > Question.
> > > 
> > > 1. user: set critical edge
> > > 2. kernel: memory is tight and trigger event with critical
> > > 3. user: kill a program when he receives a event
> > > 4. kernel: memory is very tight again and want to trigger a event
> > >    with critical but fail because last_level was critical and it was edge.
> > > 
> > > Right?
> > 
> > yes, this is the risk of the edge triggering and the user has to be
> > prepared for that. I still think that it makes some sense to have the
> > two modes.
> 
> I'm not sure it's good idea.
> How could user overcome above problem?

Well, if the interface is proposed to workaround issues of the current
implementation then it is surely not a good path. 

I was referring to a concept of edge triggering which makes some sense
to me.  Consider one shot actions that should start/stop a certain
action when the level changes (e.g. very popular notification to admin).

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] memcg: add interface to specify thresholds of vmpressure
  2013-06-22  7:34                                 ` [PATCH] memcg: add interface to specify thresholds of vmpressure Hyunhee Kim
@ 2013-06-25 20:46                                   ` Michal Hocko
  2013-06-26  7:39                                   ` Minchan Kim
  1 sibling, 0 replies; 50+ messages in thread
From: Michal Hocko @ 2013-06-25 20:46 UTC (permalink / raw)
  To: Hyunhee Kim
  Cc: 'Minchan Kim', 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park'

On Sat 22-06-13 16:34:34, Hyunhee Kim wrote:
> Memory pressure is calculated based on scanned/reclaimed ratio.

It is done that way _now_ and there is no guarantee it will do that in
future. There was a reason why the interface is so mean on any details.

I am sorry to repeat myself but this is a user interface and we will have
to maintain it for _ever_. We cannot export random knobs that work just
now. Future implementation of the reclaim might change considerably and
scaned vs. reclaimed might no longer mean the same thing.

So no, again, please do not try to push random things to handle you
current and very specific use case.

Nack to this patch.

> The higher
> the value, the more number unsuccessful reclaims there were. These thresholds
> can be specified when each event is registered by writing it next to the
> string of level. Default value is 60 for "medium" and 95 for "critical"
> 
> Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
> ---
>  Documentation/cgroups/memory.txt |    8 +++++-
>  mm/vmpressure.c                  |   54 +++++++++++++++++++++++++++-----------
>  2 files changed, 45 insertions(+), 17 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index ddf4f93..bd9cf46 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -807,13 +807,19 @@ register a notification, an application must:
>  
>  - create an eventfd using eventfd(2);
>  - open memory.pressure_level;
> -- write string like "<event_fd> <fd of memory.pressure_level> <level>"
> +- write string like "<event_fd> <fd of memory.pressure_level> <level> <threshold>"
>    to cgroup.event_control.
>  
>  Application will be notified through eventfd when memory pressure is at
>  the specific level (or higher). Read/write operations to
>  memory.pressure_level are no implemented.
>  
> +We account memory pressure based on scanned/reclaimed ratio. The higher
> +the value, the more number unsuccessful reclaims there were. These thresholds
> +can be specified when each event is registered by writing it next to the
> +string of level. Default value is 60 for "medium" and 95 for "critical".
> +If nothing is input as threshold, default values are used.
> +
>  Test:
>  
>     Here is a small script example that makes a new cgroup, sets up a
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index 736a601..52b266c 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -40,15 +40,6 @@
>  static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
>  
>  /*
> - * These thresholds are used when we account memory pressure through
> - * scanned/reclaimed ratio. The current values were chosen empirically. In
> - * essence, they are percents: the higher the value, the more number
> - * unsuccessful reclaims there were.
> - */
> -static const unsigned int vmpressure_level_med = 60;
> -static const unsigned int vmpressure_level_critical = 95;
> -
> -/*
>   * When there are too little pages left to scan, vmpressure() may miss the
>   * critical pressure as number of pages will be less than "window size".
>   * However, in that case the vmscan priority will raise fast as the
> @@ -97,6 +88,19 @@ enum vmpressure_levels {
>  	VMPRESSURE_NUM_LEVELS,
>  };
>  
> +/*
> + * These thresholds are used when we account memory pressure through
> + * scanned/reclaimed ratio. In essence, they are percents: the higher
> + * the value, the more number unsuccessful reclaims there were.
> + * These thresholds can be specified when each event is registered.
> + */
> +
> +static unsigned int vmpressure_threshold_levels[] = {
> +	[VMPRESSURE_LOW] = 0,
> +	[VMPRESSURE_MEDIUM] = 60,
> +	[VMPRESSURE_CRITICAL] = 95,
> +};
> +
>  static const char * const vmpressure_str_levels[] = {
>  	[VMPRESSURE_LOW] = "low",
>  	[VMPRESSURE_MEDIUM] = "medium",
> @@ -105,11 +109,14 @@ static const char * const vmpressure_str_levels[] = {
>  
>  static enum vmpressure_levels vmpressure_level(unsigned long pressure)
>  {
> -	if (pressure >= vmpressure_level_critical)
> -		return VMPRESSURE_CRITICAL;
> -	else if (pressure >= vmpressure_level_med)
> -		return VMPRESSURE_MEDIUM;
> -	return VMPRESSURE_LOW;
> +	int level;
> +
> +	for (level = VMPRESSURE_NUM_LEVELS - 1; level >= 0; level--) {
> +		if (pressure >= vmpressure_threshold_levels[level])
> +			break;
> +	}
> +
> +	return level;
>  }
>  
>  static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> @@ -303,10 +310,21 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
>  {
>  	struct vmpressure *vmpr = cg_to_vmpressure(cg);
>  	struct vmpressure_event *ev;
> -	int level;
> +	char *strlevel, *strthres;
> +	int level, thres = -1;
> +
> +	strlevel = args;
> +	strthres = strchr(args, ' ');
> +
> +	if (strthres) {
> +		*strthres = '\0';
> +		strthres++;
> +		if(kstrtoint(strthres, 10, &thres))
> +			return -EINVAL;
> +	}
>  
>  	for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
> -		if (!strcmp(vmpressure_str_levels[level], args))
> +		if (!strcmp(vmpressure_str_levels[level], strlevel))
>  			break;
>  	}
>  
> @@ -320,6 +338,10 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
>  	ev->efd = eventfd;
>  	ev->level = level;
>  
> +	/* If user input threshold is not valid value, use default value */
> +	if (thres <= 100 && thres >= 0)
> +		vmpressure_threshold_levels[level] = thres;
> +
>  	mutex_lock(&vmpr->events_lock);
>  	list_add(&ev->node, &vmpr->events);
>  	mutex_unlock(&vmpr->events_lock);
> -- 
> 1.7.9.5
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6] memcg: event control at vmpressure.
  2013-06-22  4:36                           ` Hyunhee Kim
  2013-06-22  4:51                             ` Hyunhee Kim
@ 2013-06-26  7:31                             ` Minchan Kim
  1 sibling, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2013-06-26  7:31 UTC (permalink / raw)
  To: Hyunhee Kim
  Cc: Michal Hocko, Anton Vorontsov, linux-mm, akpm, rob,
	kamezawa.hiroyu, hannes, rientjes, kirill, Kyungmin Park

On Sat, Jun 22, 2013 at 01:36:56PM +0900, Hyunhee Kim wrote:
> 2013/6/22 Minchan Kim <minchan@kernel.org>:
> > Hello Michal,
> >
> > On Fri, Jun 21, 2013 at 11:19:44AM +0200, Michal Hocko wrote:
> >> On Fri 21-06-13 10:22:34, Minchan Kim wrote:
> >> > On Fri, Jun 21, 2013 at 09:24:38AM +0900, Hyunhee Kim wrote:
> >> > > In the original vmpressure, events are triggered whenever there is a reclaim
> >> > > activity. This becomes overheads to user space module and also increases
> >> >
> >> > Not true.
> >> > We have lots of filter to not trigger event even if reclaim is going on.
> >> > Your statement would make confuse.
> >>
> >> Where is the filter implemented? In the kernel? I do not see any
> >> throttling in the current mm tree.
> >
> > 1. mem_cgroup_soft_limit_reclaim
> > 2. reclaim caused by DMA zone
> > 3. vmpressure_win
> >
> >>
> >> > > power consumption if there is somebody to listen to it. This patch provides
> >> > > options to trigger events only when the pressure level changes.
> >> > > This trigger option can be set when registering each event by writing
> >> > > a trigger option, "edge" or "always", next to the string of levels.
> >> > > "edge" means that the event is triggered only when the pressure level is changed.
> >> > > "always" means that events are triggered whenever there is a reclaim process.
> >> >                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >> >                                                   Not true, either.
> >>
> >> Is this about vmpressure_win? But I agree that this could be more
> >> specific. Something like "`Always' trigger option will signal all events
> >> while `edge' option will trigger only events when the level changes."
> >>
> >> > > To keep backward compatibility, "always" is set by default if nothing is input
> >> > > as an option. Each event can have different option. For example,
> >> > > "low" level uses "always" trigger option to see reclaim activity at user space
> >> > > while "medium"/"critical" uses "edge" to do an important job
> >> > > like killing tasks only once.
> >> >
> >> > Question.
> >> >
> >> > 1. user: set critical edge
> >> > 2. kernel: memory is tight and trigger event with critical
> >> > 3. user: kill a program when he receives a event
> >> > 4. kernel: memory is very tight again and want to trigger a event
> >> >    with critical but fail because last_level was critical and it was edge.
> >> >
> >> > Right?
> >>
> >> yes, this is the risk of the edge triggering and the user has to be
> >> prepared for that. I still think that it makes some sense to have the
> >> two modes.
> >
> > I'm not sure it's good idea.
> > How could user overcome above problem?
> > The problem is "critical" means that the memory is really tight so
> > that user should do ememgency plan to overcome the situation like
> > kill someone. Even, kill could be a problem because the one from userspace
> > couldn't use reserved memory pool so that killing could be stalled for a
> > long time and then, we could end up encountering OOM. :(
> >
> > So, the description should include how to overcome above situation in
> > userspace efficiently even though memory is really tight, which we don't
> > have extra memory to read vmstat.
> 
> Thanks for your review.
> With the current vmpressure, if we registered all of the levels "low",
> "medium", and "critical", and the current level is critical,
> there are so many signals. Whenever the critical signal occurs
> continuously, "low" and "medium" are also signaled.
> And, I still think that in the critical situation, handling these
> signals becomes overheads.

Agree.

> 
> How about handling low memory situation, e.g., reclaiming, swapping,
> killing some processes, until the lower level event is signaled?
> For example,
> 1. register "medium" and "critical" with edge trigger option
> 2. When we reach "critical" level, start to kill processes until we
> receive "medium"
> 3. If the memory state become critical again, we can get signal again.
> 

How could you make previous vmpressure event become medium?

1)
Critical event -> kill process -> enough memory ->
dummy memory hogger to make pressure so that it kicks in reclaimer
to make new vmpressure -> unfortunately, critical event should happen
because LRU tail has a ton of unreclaimable pages -> but failed
because critical event was edge.

2)
Critical event -> kill process -> enough memory ->
dummy memory hogger to make pressure so that it kicks in reclaimer
to make new vmpressure -> fortunately, medium event happens because
LRU has proper pages for vmpressure to trigger medium event

It means we are making Luckinux.

> However, in the current vmpressure, vmpressure_level_med and
> vmpressure_level_critical are fixed to 60 and 95.
> If there is an interface to change these thresholds when registering
> each event, I think that we can use "medium" level (this is an
> example. we can also make new level) with new threshold (if 60 is too
> small) as the level we can finish the low memory handler.
> 
> What's your opinion?

I don't know why vmpressure_level_[med|critical] have such vaule
but if we need some knob to tune system, vmpressrure_win would be more
prorper rather than allowing to control each level value by user space.

Having said that, we might need fix default value but it doesn't mean
we should expose it to userspace. If you have a problem on default values,
please fix and send a mail with numbers.

Thanks.

> 
> Thanks,
> Hyunhee Kim.
> 
> >
> > We reviewers should review that it does make sense. If so, we need to
> > write down it in documentation, otherwise, we should fix it from kernel
> > side.
> >
> > Another problem from this patch is that it couldn't detect same event
> > contiguously so you are saying userspace have to handle it.
> > It doesn't make sense to me. Why should user handle such false positive?
> > I don't mean false positive signal is bad because low memory notification
> > has inherent such a problem but we should try to avoid such frequent triggering
> > if possible.
> > IOW, It reveals current vmpressure's problem.
> > Fix it without band-aiding of userspace if it's really problem.
> >
> >>
> >> > > @@ -823,7 +831,7 @@ Test:
> >> > >     # cd /sys/fs/cgroup/memory/
> >> > >     # mkdir foo
> >> > >     # cd foo
> >> > > -   # cgroup_event_listener memory.pressure_level low &
> >> > > +   # cgroup_event_listener memory.pressure_level low edge &
> >> > >     # echo 8000000 > memory.limit_in_bytes
> >> > >     # echo 8000000 > memory.memsw.limit_in_bytes
> >> > >     # echo $$ > tasks
> >> > > diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> >> > > index 736a601..a08252e 100644
> >> > > --- a/mm/vmpressure.c
> >> > > +++ b/mm/vmpressure.c
> >> > > @@ -137,6 +137,8 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> >> > >  struct vmpressure_event {
> >> > >   struct eventfd_ctx *efd;
> >> > >   enum vmpressure_levels level;
> >> > > + int last_level;
> >> >
> >> > int? but level is enum vmpressure_levels?
> >>
> >> good catch
> >>
> >> > > + bool edge_trigger;
> >> > >   struct list_head node;
> >> > >  };
> >> > >
> >> > > @@ -153,11 +155,14 @@ static bool vmpressure_event(struct vmpressure *vmpr,
> >> > >
> >> > >   list_for_each_entry(ev, &vmpr->events, node) {
> >> > >           if (level >= ev->level) {
> >> > > +                 if (ev->edge_trigger && level == ev->last_level)
> >> > > +                         continue;
> >> > > +
> >> > >                   eventfd_signal(ev->efd, 1);
> >> > >                   signalled = true;
> >> > >           }
> >> > > +         ev->last_level = level;
> >> > >   }
> >> > > -
> >> >
> >> > Unnecessary change.
> >> >
> >> > >   mutex_unlock(&vmpr->events_lock);
> >> > >
> >> > >   return signalled;
> >> > > @@ -290,9 +295,11 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
> >> > >   *
> >> > >   * This function associates eventfd context with the vmpressure
> >> > >   * infrastructure, so that the notifications will be delivered to the
> >> > > - * @eventfd. The @args parameter is a string that denotes pressure level
> >> > > + * @eventfd. The @args parameters are a string that denotes pressure level
> >> > >   * threshold (one of vmpressure_str_levels, i.e. "low", "medium", or
> >> > > - * "critical").
> >> > > + * "critical") and a trigger option that decides whether events are triggered
> >> > > + * continuously or only on edge ("always" or "edge" if "edge", events
> >> > > + * are triggered when the pressure level changes.
> >> > >   *
> >> > >   * This function should not be used directly, just pass it to (struct
> >> > >   * cftype).register_event, and then cgroup core will handle everything by
> >> > > @@ -303,22 +310,43 @@ int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
> >> > >  {
> >> > >   struct vmpressure *vmpr = cg_to_vmpressure(cg);
> >> > >   struct vmpressure_event *ev;
> >> > > + char *strlevel, *strtrigger;
> >> > >   int level;
> >> > > + bool trigger;
> >> >
> >> > What trigger?
> >> > Would be better to use "bool egde" instead?
> >>
> >> yes
> >>
> >> > > +
> >> > > + strlevel = args;
> >> > > + strtrigger = strchr(args, ' ');
> >> > > +
> >> > > + if (strtrigger) {
> >> > > +         *strtrigger = '\0';
> >> > > +         strtrigger++;
> >> > > + }
> >> > >
> >> > >   for (level = 0; level < VMPRESSURE_NUM_LEVELS; level++) {
> >> > > -         if (!strcmp(vmpressure_str_levels[level], args))
> >> > > +         if (!strcmp(vmpressure_str_levels[level], strlevel))
> >> > >                   break;
> >> > >   }
> >> > >
> >> > >   if (level >= VMPRESSURE_NUM_LEVELS)
> >> > >           return -EINVAL;
> >> > >
> >> > > + if (strtrigger == NULL)
> >> > > +         trigger = false;
> >> > > + else if (!strcmp(strtrigger, "always"))
> >> > > +         trigger = false;
> >> > > + else if (!strcmp(strtrigger, "edge"))
> >> > > +         trigger = true;
> >> > > + else
> >> > > +         return -EINVAL;
> >> > > +
> >> > >   ev = kzalloc(sizeof(*ev), GFP_KERNEL);
> >> > >   if (!ev)
> >> > >           return -ENOMEM;
> >> > >
> >> > >   ev->efd = eventfd;
> >> > >   ev->level = level;
> >> > > + ev->last_level = -1;
> >> >
> >> > VMPRESSURE_NONE is better?
> >>
> >> Yes
> >> --
> >> Michal Hocko
> >> SUSE Labs
> >
> > --
> > Kind regards,
> > Minchan Kim
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6] memcg: event control at vmpressure.
  2013-06-22  4:51                             ` Hyunhee Kim
  2013-06-22  5:50                               ` [PATCH] memcg: consider "scanned < reclaimed" case when calculating Hyunhee Kim
@ 2013-06-26  7:34                               ` Minchan Kim
  1 sibling, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2013-06-26  7:34 UTC (permalink / raw)
  To: Hyunhee Kim
  Cc: Michal Hocko, Anton Vorontsov, linux-mm, akpm, rob,
	kamezawa.hiroyu, hannes, rientjes, kirill, Kyungmin Park

On Sat, Jun 22, 2013 at 01:51:17PM +0900, Hyunhee Kim wrote:
> And one more thing.
> 
> vmpressure sends all of the lower events than the level (current).
> 
> 155         list_for_each_entry(ev, &vmpr->events, node) {
> 156                 if (level >= ev->level) {
> 157                         eventfd_signal(ev->efd, 1);
> 158                         signalled = true;
> 159                 }
> 160         }
> 
> Isn't it wasteful?
> When we registered "low" and "critical", and the current level is
> critical, why should we signal "low" level together?
> If we receive "critical" only, can't we just assume "low" level at the
> user space?

Agree.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] memcg: consider "scanned < reclaimed" case when calculating
  2013-06-22  5:50                               ` [PATCH] memcg: consider "scanned < reclaimed" case when calculating Hyunhee Kim
  2013-06-22  7:34                                 ` [PATCH] memcg: add interface to specify thresholds of vmpressure Hyunhee Kim
@ 2013-06-26  7:35                                 ` Minchan Kim
  2013-06-27  6:12                                   ` [PATCH v2] vmpressure: consider "scanned < reclaimed" case when calculating a pressure level Hyunhee Kim
  1 sibling, 1 reply; 50+ messages in thread
From: Minchan Kim @ 2013-06-26  7:35 UTC (permalink / raw)
  To: Hyunhee Kim
  Cc: 'Michal Hocko', 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park'

On Sat, Jun 22, 2013 at 02:50:06PM +0900, Hyunhee Kim wrote:
> In vmpressure, the pressure level is calculated based on the ratio
> of how many pages were scanned vs. reclaimed in a given time window.
> However, there is a possibility that "scanned < reclaimed" in such
> a case, THP page is reclaimed or reclaiming is abandoned by fatal
> signal in shrink_inactive_list, etc. So, with this patch, we just
> return "low" level when "scanned < reclaimed" by assuming that
> there are enough reclaimed pages.

I agree send lowevent in this case but you should write down why
lowevent send is better than ignoring in description. 

> 
> Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
> ---
>  mm/vmpressure.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index 736a601..c6560f3 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -118,6 +118,9 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>  	unsigned long scale = scanned + reclaimed;
>  	unsigned long pressure;
>  

Please write when we encounter this case.

> +	if (reclaimed > scanned)
> +		return VMPRESSURE_LOW;
> +
>  	/*
>  	 * We calculate the ratio (in percents) of how many pages were
>  	 * scanned vs. reclaimed in a given time frame (window). Note that
> -- 
> 1.7.9.5
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] memcg: add interface to specify thresholds of vmpressure
  2013-06-22  7:34                                 ` [PATCH] memcg: add interface to specify thresholds of vmpressure Hyunhee Kim
  2013-06-25 20:46                                   ` Michal Hocko
@ 2013-06-26  7:39                                   ` Minchan Kim
  2013-06-26  7:50                                     ` Kyungmin Park
  1 sibling, 1 reply; 50+ messages in thread
From: Minchan Kim @ 2013-06-26  7:39 UTC (permalink / raw)
  To: Hyunhee Kim
  Cc: 'Michal Hocko', 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park'

On Sat, Jun 22, 2013 at 04:34:34PM +0900, Hyunhee Kim wrote:
> Memory pressure is calculated based on scanned/reclaimed ratio. The higher
> the value, the more number unsuccessful reclaims there were. These thresholds
> can be specified when each event is registered by writing it next to the
> string of level. Default value is 60 for "medium" and 95 for "critical"
> 
> Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>

As I mentioned eariler thread, it's not a good idea to expose each level's
raw value to user space. If it's a problem, please fix default vaule and
send a patch with number to convince us although I'm not sure we can get
a stable number.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6] memcg: event control at vmpressure.
  2013-06-22  0:27                             ` Anton Vorontsov
  2013-06-22  1:28                               ` Hyunhee Kim
@ 2013-06-26  7:47                               ` Minchan Kim
  1 sibling, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2013-06-26  7:47 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Michal Hocko, Hyunhee Kim, linux-mm, akpm, rob, kamezawa.hiroyu,
	hannes, rientjes, kirill, 'Kyungmin Park'

Hey Anton,

On Fri, Jun 21, 2013 at 05:27:44PM -0700, Anton Vorontsov wrote:
> On Sat, Jun 22, 2013 at 01:44:14AM +0900, Minchan Kim wrote:
> [...]
> > 3. The reclaimed could be greater than scanned in vmpressure_evnet
> >    by several reasons. Totally, It could trigger wrong event.
> 
> Yup, and in that case the best we can do is just ignore the event (i.e.
> not pass it to the userland): thing is, based on the fact that
> 'reclaimed > scanned' we can't actually conclude anything about the
> pressure: it might be still high, or we actually freed enough.

I'd like to make a safe guard with low notifier rather than ignoring.

> 
> Thanks,
> 
> Anton
> 
> p.s. I was somewhat sure that someone sent a patch to ignore 'reclaimed >
> scanned' situation, but I cannot find it in my mailbox. Maybe I was
> dreaming about it? :)

You seem to work while you're sleeping.
Dear Penguin,

please, take a rest. :)

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] memcg: add interface to specify thresholds of vmpressure
  2013-06-26  7:39                                   ` Minchan Kim
@ 2013-06-26  7:50                                     ` Kyungmin Park
  2013-06-26  8:03                                       ` Minchan Kim
  0 siblings, 1 reply; 50+ messages in thread
From: Kyungmin Park @ 2013-06-26  7:50 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Hyunhee Kim, Michal Hocko, Anton Vorontsov, linux-mm, akpm, rob,
	kamezawa.hiroyu, hannes, rientjes, kirill

On Wed, Jun 26, 2013 at 4:39 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Sat, Jun 22, 2013 at 04:34:34PM +0900, Hyunhee Kim wrote:
>> Memory pressure is calculated based on scanned/reclaimed ratio. The higher
>> the value, the more number unsuccessful reclaims there were. These thresholds
>> can be specified when each event is registered by writing it next to the
>> string of level. Default value is 60 for "medium" and 95 for "critical"
>>
>> Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
>> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
>
> As I mentioned eariler thread, it's not a good idea to expose each level's
> raw value to user space. If it's a problem, please fix default vaule and
> send a patch with number to convince us although I'm not sure we can get
> a stable number.
that's reason to send this patch, can we make a reasonable value to
cover all cases?
which number are satified for all person. I really wonder it.

Thank you,
Kyungmin Park

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] memcg: add interface to specify thresholds of vmpressure
  2013-06-26  7:50                                     ` Kyungmin Park
@ 2013-06-26  8:03                                       ` Minchan Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2013-06-26  8:03 UTC (permalink / raw)
  To: Kyungmin Park
  Cc: Hyunhee Kim, Michal Hocko, Anton Vorontsov, linux-mm, akpm, rob,
	kamezawa.hiroyu, hannes, rientjes, kirill

On Wed, Jun 26, 2013 at 04:50:32PM +0900, Kyungmin Park wrote:
> On Wed, Jun 26, 2013 at 4:39 PM, Minchan Kim <minchan@kernel.org> wrote:
> > On Sat, Jun 22, 2013 at 04:34:34PM +0900, Hyunhee Kim wrote:
> >> Memory pressure is calculated based on scanned/reclaimed ratio. The higher
> >> the value, the more number unsuccessful reclaims there were. These thresholds
> >> can be specified when each event is registered by writing it next to the
> >> string of level. Default value is 60 for "medium" and 95 for "critical"
> >>
> >> Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
> >> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
> >
> > As I mentioned eariler thread, it's not a good idea to expose each level's
> > raw value to user space. If it's a problem, please fix default vaule and
> > send a patch with number to convince us although I'm not sure we can get
> > a stable number.
> that's reason to send this patch, can we make a reasonable value to
> cover all cases?
> which number are satified for all person. I really wonder it.

Then, I really wonder how we decide values we have to pass.
IOW, how could admin tune the system with any values?


> 
> Thank you,
> Kyungmin Park
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2] vmpressure: consider "scanned < reclaimed" case when calculating  a pressure level.
  2013-06-26  7:35                                 ` [PATCH] memcg: consider "scanned < reclaimed" case when calculating Minchan Kim
@ 2013-06-27  6:12                                   ` Hyunhee Kim
  2013-06-27  9:37                                     ` Michal Hocko
  2013-06-27 18:33                                     ` Anton Vorontsov
  0 siblings, 2 replies; 50+ messages in thread
From: Hyunhee Kim @ 2013-06-27  6:12 UTC (permalink / raw)
  To: 'Michal Hocko', 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park', 'Minchan Kim'

In vmpressure, the pressure level is calculated based on the ratio
of how many pages were scanned vs. reclaimed in a given time window.
However, there is a possibility that "scanned < reclaimed" in such a
case, when reclaiming ends by fatal signal in shrink_inactive_list.
So, with this patch, we just return "low" level when "scanned < reclaimed"
happens not to have userland miss reclaim activity.

Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
---
 mm/vmpressure.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 736a601..8c60cad 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -119,6 +119,14 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 	unsigned long pressure;
 
 	/*
+	 * This could happen, in such a case, when reclaiming ends by fatal
+	 * signal in shrink_inactive_list(). In this case, return
+	 * VMPRESSURE_LOW not to have userland miss reclaim activity.
+	 */
+	if (reclaimed > scanned)
+		return VMPRESSURE_LOW;
+
+	/*
 	 * We calculate the ratio (in percents) of how many pages were
 	 * scanned vs. reclaimed in a given time frame (window). Note that
 	 * time is in VM reclaimer's "ticks", i.e. number of pages
-- 
1.7.9.5


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v2] vmpressure: consider "scanned < reclaimed" case when calculating  a pressure level.
  2013-06-27  6:12                                   ` [PATCH v2] vmpressure: consider "scanned < reclaimed" case when calculating a pressure level Hyunhee Kim
@ 2013-06-27  9:37                                     ` Michal Hocko
  2013-06-27 15:35                                       ` Minchan Kim
  2013-06-27 18:33                                     ` Anton Vorontsov
  1 sibling, 1 reply; 50+ messages in thread
From: Michal Hocko @ 2013-06-27  9:37 UTC (permalink / raw)
  To: Hyunhee Kim
  Cc: 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park', 'Minchan Kim'

On Thu 27-06-13 15:12:10, Hyunhee Kim wrote:
> In vmpressure, the pressure level is calculated based on the ratio
> of how many pages were scanned vs. reclaimed in a given time window.
> However, there is a possibility that "scanned < reclaimed" in such a
> case, when reclaiming ends by fatal signal in shrink_inactive_list.
> So, with this patch, we just return "low" level when "scanned < reclaimed"
> happens not to have userland miss reclaim activity.

Hmm, fatal signal pending on kswapd doesn't make sense to me so it has
to be a direct reclaim path. Does it really make sense to signal LOW
when there is probably a big memory pressure and somebody is killing the
current allocator?

The THP case made sense because nr_scanned is in LRU elements units
while nr_reclaimed is in page units which are different so nr_reclaim
might be higher than nr_scanned (so nr_taken would be more approapriate
for vmpressure).
 
> Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
> ---
>  mm/vmpressure.c |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index 736a601..8c60cad 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -119,6 +119,14 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>  	unsigned long pressure;
>  
>  	/*
> +	 * This could happen, in such a case, when reclaiming ends by fatal
> +	 * signal in shrink_inactive_list(). In this case, return
> +	 * VMPRESSURE_LOW not to have userland miss reclaim activity.
> +	 */
> +	if (reclaimed > scanned)
> +		return VMPRESSURE_LOW;
> +
> +	/*
>  	 * We calculate the ratio (in percents) of how many pages were
>  	 * scanned vs. reclaimed in a given time frame (window). Note that
>  	 * time is in VM reclaimer's "ticks", i.e. number of pages
> -- 
> 1.7.9.5
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2] vmpressure: consider "scanned < reclaimed" case when calculating  a pressure level.
  2013-06-27  9:37                                     ` Michal Hocko
@ 2013-06-27 15:35                                       ` Minchan Kim
  2013-06-27 16:11                                         ` Michal Hocko
  0 siblings, 1 reply; 50+ messages in thread
From: Minchan Kim @ 2013-06-27 15:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hyunhee Kim, 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park'

Hi Michal,

On Thu, Jun 27, 2013 at 11:37:21AM +0200, Michal Hocko wrote:
> On Thu 27-06-13 15:12:10, Hyunhee Kim wrote:
> > In vmpressure, the pressure level is calculated based on the ratio
> > of how many pages were scanned vs. reclaimed in a given time window.
> > However, there is a possibility that "scanned < reclaimed" in such a
> > case, when reclaiming ends by fatal signal in shrink_inactive_list.
> > So, with this patch, we just return "low" level when "scanned < reclaimed"
> > happens not to have userland miss reclaim activity.
> 
> Hmm, fatal signal pending on kswapd doesn't make sense to me so it has
> to be a direct reclaim path. Does it really make sense to signal LOW
> when there is probably a big memory pressure and somebody is killing the
> current allocator?

So, do you want to trigger critical instead of low?

Now, current is going to die so we can expect shortly we can get a amount
of memory, normally. but yeah, we cannot sure it happens within a bounded
time since it couldn't use reserved memory pool unlike process killed by OOM.

If we send critical but there isn't big memory pressure,
maybe critical handler would kill some process and the result is that killing
another process unnecessary. That's really thing we should avoid.

If we send low but there is a big memory pressure, at least, userland
could be notified and it has a chance to release small memory, which will
help to exit current process so that it could prevent OOM kill and killing
another process unnecessary.

If we send low but there isn't big memory pressure, totally, we will save
a process.

> 
> The THP case made sense because nr_scanned is in LRU elements units
> while nr_reclaimed is in page units which are different so nr_reclaim
> might be higher than nr_scanned (so nr_taken would be more approapriate
> for vmpressure).

In case of THP, 512 page is equal to vmpressure_win so if we change nr_scanned
with nr_taken, it could easily make vmpressure notifier level critical even if
VM encounter a recent referenced THP page from LRU tail so I'd like to ignore
THP page effect in vmpressure level calculation.

>  
> > Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
> > Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
> > ---
> >  mm/vmpressure.c |    8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> > index 736a601..8c60cad 100644
> > --- a/mm/vmpressure.c
> > +++ b/mm/vmpressure.c
> > @@ -119,6 +119,14 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> >  	unsigned long pressure;
> >  
> >  	/*
> > +	 * This could happen, in such a case, when reclaiming ends by fatal
> > +	 * signal in shrink_inactive_list(). In this case, return
> > +	 * VMPRESSURE_LOW not to have userland miss reclaim activity.
> > +	 */
> > +	if (reclaimed > scanned)
> > +		return VMPRESSURE_LOW;
> > +
> > +	/*
> >  	 * We calculate the ratio (in percents) of how many pages were
> >  	 * scanned vs. reclaimed in a given time frame (window). Note that
> >  	 * time is in VM reclaimer's "ticks", i.e. number of pages
> > -- 
> > 1.7.9.5
> > 
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2] vmpressure: consider "scanned < reclaimed" case when calculating  a pressure level.
  2013-06-27 15:35                                       ` Minchan Kim
@ 2013-06-27 16:11                                         ` Michal Hocko
  2013-06-27 18:05                                           ` Anton Vorontsov
  2013-06-27 23:54                                           ` Minchan Kim
  0 siblings, 2 replies; 50+ messages in thread
From: Michal Hocko @ 2013-06-27 16:11 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Hyunhee Kim, 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park'

On Fri 28-06-13 00:35:28, Minchan Kim wrote:
> Hi Michal,
> 
> On Thu, Jun 27, 2013 at 11:37:21AM +0200, Michal Hocko wrote:
> > On Thu 27-06-13 15:12:10, Hyunhee Kim wrote:
> > > In vmpressure, the pressure level is calculated based on the ratio
> > > of how many pages were scanned vs. reclaimed in a given time window.
> > > However, there is a possibility that "scanned < reclaimed" in such a
> > > case, when reclaiming ends by fatal signal in shrink_inactive_list.
> > > So, with this patch, we just return "low" level when "scanned < reclaimed"
> > > happens not to have userland miss reclaim activity.
> > 
> > Hmm, fatal signal pending on kswapd doesn't make sense to me so it has
> > to be a direct reclaim path. Does it really make sense to signal LOW
> > when there is probably a big memory pressure and somebody is killing the
> > current allocator?
> 
> So, do you want to trigger critical instead of low?
> 
> Now, current is going to die so we can expect shortly we can get a amount
> of memory, normally. 

And also consider that this is per-memcg interface. And so it is even
more complicated. If a task dies then there is _no_ guarantee that there
will be an uncharge in that group (task could have been migrated to that
group so the memory belongs to somebody else).

> but yeah, we cannot sure it happens within a bounded time since it
> couldn't use reserved memory pool unlike process killed by OOM.

The situation should be detected (I am not entirely sure how - e.g.
checking for fatal_signals in vmpressure directly) but we shouldn't
assume that scanned < reclaimed has any impact on the freed memory.

> If we send critical but there isn't big memory pressure, maybe
> critical handler would kill some process and the result is that
> killing another process unnecessary. That's really thing we should
> avoid.
> 
> If we send low but there is a big memory pressure, at least, userland
> could be notified and it has a chance to release small memory, which will
> help to exit current process so that it could prevent OOM kill and killing
> another process unnecessary.
> 
> If we send low but there isn't big memory pressure, totally, we will save
> a process.
> 
> > 
> > The THP case made sense because nr_scanned is in LRU elements units
> > while nr_reclaimed is in page units which are different so nr_reclaim
> > might be higher than nr_scanned (so nr_taken would be more approapriate
> > for vmpressure).
> 
> In case of THP, 512 page is equal to vmpressure_win so if we change
> nr_scanned with nr_taken, it could easily make vmpressure notifier

Wasn't 512 selected for vmpressure_win exactly for this reason?
Shouldn't we rather fix that assumption? Comparing scanned to reclaimed
when they operate on different units just sounds strange to me.

> level critical even if VM encounter a recent referenced THP page from
> LRU tail so I'd like to ignore THP page effect in vmpressure level
> calculation.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2] vmpressure: consider "scanned < reclaimed" case when calculating  a pressure level.
  2013-06-27 16:11                                         ` Michal Hocko
@ 2013-06-27 18:05                                           ` Anton Vorontsov
  2013-06-28 12:17                                             ` Michal Hocko
  2013-06-27 23:54                                           ` Minchan Kim
  1 sibling, 1 reply; 50+ messages in thread
From: Anton Vorontsov @ 2013-06-27 18:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Hyunhee Kim, linux-mm, akpm, rob, kamezawa.hiroyu,
	hannes, rientjes, kirill, 'Kyungmin Park'

On Thu, Jun 27, 2013 at 06:11:03PM +0200, Michal Hocko wrote:
> > If we send critical but there isn't big memory pressure, maybe
> > critical handler would kill some process and the result is that
> > killing another process unnecessary. That's really thing we should
> > avoid.

Yes, so that is why I actually want to ack the patch... It might be not an
ideal solution, but to me it seems like a good for the time being.

(Actually I should have done that check myself.)

> > > The THP case made sense because nr_scanned is in LRU elements units
> > > while nr_reclaimed is in page units which are different so nr_reclaim
> > > might be higher than nr_scanned (so nr_taken would be more approapriate
> > > for vmpressure).
> > 
> > In case of THP, 512 page is equal to vmpressure_win so if we change
> > nr_scanned with nr_taken, it could easily make vmpressure notifier
> 
> Wasn't 512 selected for vmpressure_win exactly for this reason?

Nope. The current vmpressure_win was selected kind of arbitrary, i.e. it
worked good for most of my test cases.

> Shouldn't we rather fix that assumption?

If there is any assumption (which I had not in my mind :), then we
definitely should do that, since vmpressure_win is going to be
machine-size dependant.

Thanks,

Anton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2] vmpressure: consider "scanned < reclaimed" case when calculating  a pressure level.
  2013-06-27  6:12                                   ` [PATCH v2] vmpressure: consider "scanned < reclaimed" case when calculating a pressure level Hyunhee Kim
  2013-06-27  9:37                                     ` Michal Hocko
@ 2013-06-27 18:33                                     ` Anton Vorontsov
  1 sibling, 0 replies; 50+ messages in thread
From: Anton Vorontsov @ 2013-06-27 18:33 UTC (permalink / raw)
  To: Hyunhee Kim
  Cc: 'Michal Hocko',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park', 'Minchan Kim'

On Thu, Jun 27, 2013 at 03:12:10PM +0900, Hyunhee Kim wrote:
> In vmpressure, the pressure level is calculated based on the ratio
> of how many pages were scanned vs. reclaimed in a given time window.
> However, there is a possibility that "scanned < reclaimed" in such a
> case, when reclaiming ends by fatal signal in shrink_inactive_list.
> So, with this patch, we just return "low" level when "scanned < reclaimed"
> happens not to have userland miss reclaim activity.
> 
> Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
> ---
>  mm/vmpressure.c |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index 736a601..8c60cad 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -119,6 +119,14 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>  	unsigned long pressure;
>  
>  	/*
> +	 * This could happen, in such a case, when reclaiming ends by fatal
> +	 * signal in shrink_inactive_list(). In this case, return
> +	 * VMPRESSURE_LOW not to have userland miss reclaim activity.
> +	 */

Personally, I like to see this condition as a generic case: reclaimed >
scanned condition can't tell anything about the pressure itself, so the
best we can do is just report that there is some reclaiming happening
(i.e. signal _LOW level). It might be the fatal signal case, or THP, and I
so far don't see we we have to differentiate why exactly this condition
happened, at least from the vmpressure point of view...

It might be that in some cases we should not call vmpressure() callback
(i.e. when there is a fatal signals or something like that), but it is a
different matter.

There is another way to view the vmpressure subsystem. vmpressure() is
kind of a differential flow meter, which accounts the
reclaiming/allocation flow change. Ideally it should not matter why
exactly the reclaimed > scanned happened (unless we bogusly call
vmpressure() callback, in which case we should stop doing that).

So, I would add the comment about the generic case and why it makes sense
to return _LOW here, but other than that the patch looks good at least to
me.

Acked-by: Anton Vorontsov <anton@enomsg.org>

> +	if (reclaimed > scanned)
> +		return VMPRESSURE_LOW;
> +
> +	/*
>  	 * We calculate the ratio (in percents) of how many pages were
>  	 * scanned vs. reclaimed in a given time frame (window). Note that
>  	 * time is in VM reclaimer's "ticks", i.e. number of pages
> -- 
> 1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2] vmpressure: consider "scanned < reclaimed" case when calculating  a pressure level.
  2013-06-27 16:11                                         ` Michal Hocko
  2013-06-27 18:05                                           ` Anton Vorontsov
@ 2013-06-27 23:54                                           ` Minchan Kim
  2013-06-28  7:43                                             ` [PATCH v3] " Hyunhee Kim
  2013-06-28 12:24                                             ` [PATCH v2] " Michal Hocko
  1 sibling, 2 replies; 50+ messages in thread
From: Minchan Kim @ 2013-06-27 23:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hyunhee Kim, 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park'

Hello Michal,

On Thu, Jun 27, 2013 at 06:11:03PM +0200, Michal Hocko wrote:
> On Fri 28-06-13 00:35:28, Minchan Kim wrote:
> > Hi Michal,
> > 
> > On Thu, Jun 27, 2013 at 11:37:21AM +0200, Michal Hocko wrote:
> > > On Thu 27-06-13 15:12:10, Hyunhee Kim wrote:
> > > > In vmpressure, the pressure level is calculated based on the ratio
> > > > of how many pages were scanned vs. reclaimed in a given time window.
> > > > However, there is a possibility that "scanned < reclaimed" in such a
> > > > case, when reclaiming ends by fatal signal in shrink_inactive_list.
> > > > So, with this patch, we just return "low" level when "scanned < reclaimed"
> > > > happens not to have userland miss reclaim activity.
> > > 
> > > Hmm, fatal signal pending on kswapd doesn't make sense to me so it has
> > > to be a direct reclaim path. Does it really make sense to signal LOW
> > > when there is probably a big memory pressure and somebody is killing the
> > > current allocator?
> > 
> > So, do you want to trigger critical instead of low?
> > 
> > Now, current is going to die so we can expect shortly we can get a amount
> > of memory, normally. 
> 
> And also consider that this is per-memcg interface. And so it is even
> more complicated. If a task dies then there is _no_ guarantee that there
> will be an uncharge in that group (task could have been migrated to that
> group so the memory belongs to somebody else).

Good point and that's one of the reason I hate memcg for just using vmpressure.
Let's think over it. One of the very avaialbe scenario which userland could do
when notified from vmpressure is that manager process sends signal for others to
release own cached memory.
If we use vmpressure without move_charge_at_immigrate in multiple memcg
group, it would be a disaster. But if we use move_charge_at_immigrate,
we will see long stall easily so it's not an option, either.

So, IMO, it's not a good idea to use vmpressure with no-root memcg so it could
raise the question again "why vmpressure is part of memcg".
I really didn't want it. :(

> 
> > but yeah, we cannot sure it happens within a bounded time since it
> > couldn't use reserved memory pool unlike process killed by OOM.
> 
> The situation should be detected (I am not entirely sure how - e.g.
> checking for fatal_signals in vmpressure directly) but we shouldn't
> assume that scanned < reclaimed has any impact on the freed memory.
> 
> > If we send critical but there isn't big memory pressure, maybe
> > critical handler would kill some process and the result is that
> > killing another process unnecessary. That's really thing we should
> > avoid.
> > 
> > If we send low but there is a big memory pressure, at least, userland
> > could be notified and it has a chance to release small memory, which will
> > help to exit current process so that it could prevent OOM kill and killing
> > another process unnecessary.
> > 
> > If we send low but there isn't big memory pressure, totally, we will save
> > a process.
> > 
> > > 
> > > The THP case made sense because nr_scanned is in LRU elements units
> > > while nr_reclaimed is in page units which are different so nr_reclaim
> > > might be higher than nr_scanned (so nr_taken would be more approapriate
> > > for vmpressure).
> > 
> > In case of THP, 512 page is equal to vmpressure_win so if we change
> > nr_scanned with nr_taken, it could easily make vmpressure notifier
> 
> Wasn't 512 selected for vmpressure_win exactly for this reason?
> Shouldn't we rather fix that assumption? Comparing scanned to reclaimed
> when they operate on different units just sounds strange to me.
> 
> > level critical even if VM encounter a recent referenced THP page from
> > LRU tail so I'd like to ignore THP page effect in vmpressure level
> > calculation.
> 
> [...]
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v3] vmpressure: consider "scanned < reclaimed" case when calculating  a pressure level.
  2013-06-27 23:54                                           ` Minchan Kim
@ 2013-06-28  7:43                                             ` Hyunhee Kim
  2013-06-28 12:26                                               ` Michal Hocko
  2013-06-28 12:24                                             ` [PATCH v2] " Michal Hocko
  1 sibling, 1 reply; 50+ messages in thread
From: Hyunhee Kim @ 2013-06-28  7:43 UTC (permalink / raw)
  To: 'Minchan Kim', 'Michal Hocko',
	'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park'

In vmpressure, the pressure level is calculated based on the ratio
of how many pages were scanned vs. reclaimed in a given time window.
However, there is a possibility that "scanned < reclaimed" for some
reasons, e.g., when reclaiming ends by fatal signal in shrink_inactive_list
or THP reclaiming, etc. When this happens, we cannot tell anything about the
current pressure level. So, with this patch, we just return "low" level
when "scanned < reclaimed" happens to inform that there is reclaiming activity.
Userland can have a chance to free some memory.

Signed-off-by: Hyunhee Kim <hyunhee.kim@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
---
 mm/vmpressure.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 736a601..915a608 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -119,6 +119,16 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 	unsigned long pressure;
 
 	/*
+	 * This could happen for some reasons. e.g., reclaiming ends by fatal
+	 * signal in shrink_inactive_list() or THP reclaiming, etc. In this case,
+	 * we cannot tell anything about the pressure level. So, the best way to
+	 * handle this is to notify LOW in order to inform that there is
+	 * reclaiming activity. This gives a chance to userland to free memory.
+	 */
+	if (reclaimed > scanned)
+		return VMPRESSURE_LOW;
+
+	/*
 	 * We calculate the ratio (in percents) of how many pages were
 	 * scanned vs. reclaimed in a given time frame (window). Note that
 	 * time is in VM reclaimer's "ticks", i.e. number of pages
-- 
1.7.9.5


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v2] vmpressure: consider "scanned < reclaimed" case when calculating  a pressure level.
  2013-06-27 18:05                                           ` Anton Vorontsov
@ 2013-06-28 12:17                                             ` Michal Hocko
  0 siblings, 0 replies; 50+ messages in thread
From: Michal Hocko @ 2013-06-28 12:17 UTC (permalink / raw)
  To: Anton Vorontsov
  Cc: Minchan Kim, Hyunhee Kim, linux-mm, akpm, rob, kamezawa.hiroyu,
	hannes, rientjes, kirill, 'Kyungmin Park'

On Thu 27-06-13 11:05:50, Anton Vorontsov wrote:
> On Thu, Jun 27, 2013 at 06:11:03PM +0200, Michal Hocko wrote:
> > > If we send critical but there isn't big memory pressure, maybe
> > > critical handler would kill some process and the result is that
> > > killing another process unnecessary. That's really thing we should
> > > avoid.
> 
> Yes, so that is why I actually want to ack the patch... It might be not an
> ideal solution, but to me it seems like a good for the time being.

I am still not sure why a) nr_taken shouldn't be used instead of
nr_scanned b) why there should be any signal if the current allocator
dies and the direct reclaim is terminated prematurely.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2] vmpressure: consider "scanned < reclaimed" case when calculating  a pressure level.
  2013-06-27 23:54                                           ` Minchan Kim
  2013-06-28  7:43                                             ` [PATCH v3] " Hyunhee Kim
@ 2013-06-28 12:24                                             ` Michal Hocko
  2013-06-28 13:55                                               ` Minchan Kim
  1 sibling, 1 reply; 50+ messages in thread
From: Michal Hocko @ 2013-06-28 12:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Hyunhee Kim, 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park'

On Fri 28-06-13 08:54:35, Minchan Kim wrote:
> Hello Michal,
> 
> On Thu, Jun 27, 2013 at 06:11:03PM +0200, Michal Hocko wrote:
> > On Fri 28-06-13 00:35:28, Minchan Kim wrote:
> > > Hi Michal,
> > > 
> > > On Thu, Jun 27, 2013 at 11:37:21AM +0200, Michal Hocko wrote:
> > > > On Thu 27-06-13 15:12:10, Hyunhee Kim wrote:
> > > > > In vmpressure, the pressure level is calculated based on the ratio
> > > > > of how many pages were scanned vs. reclaimed in a given time window.
> > > > > However, there is a possibility that "scanned < reclaimed" in such a
> > > > > case, when reclaiming ends by fatal signal in shrink_inactive_list.
> > > > > So, with this patch, we just return "low" level when "scanned < reclaimed"
> > > > > happens not to have userland miss reclaim activity.
> > > > 
> > > > Hmm, fatal signal pending on kswapd doesn't make sense to me so it has
> > > > to be a direct reclaim path. Does it really make sense to signal LOW
> > > > when there is probably a big memory pressure and somebody is killing the
> > > > current allocator?
> > > 
> > > So, do you want to trigger critical instead of low?
> > > 
> > > Now, current is going to die so we can expect shortly we can get a amount
> > > of memory, normally. 
> > 
> > And also consider that this is per-memcg interface. And so it is even
> > more complicated. If a task dies then there is _no_ guarantee that there
> > will be an uncharge in that group (task could have been migrated to that
> > group so the memory belongs to somebody else).
> 
> Good point and that's one of the reason I hate memcg for just using
> vmpressure. 

Well, the very same problem is present in the memcg OOM as well. oom
score calculation is not memcg aware wrt charges.

> Let's think over it. One of the very avaialbe scenario
> which userland could do when notified from vmpressure is that manager
> process sends signal for others to release own cached memory.

Assuming those processes are in the same memcg, right?

> If we use vmpressure without move_charge_at_immigrate in multiple memcg
> group, it would be a disaster. But if we use move_charge_at_immigrate,
> we will see long stall easily so it's not an option, either.

I am not sure I am following you here. Could you be more specific what
is the actual problem?
>From my POV, a manager can see a memory pressure, it notifies others in
the same memcg and they will release their caches. With
move_charge_at_immigrate == 0 some of those might release a memory in
other group but somebody must be using memory from the currently
signaled group, right?

> So, IMO, it's not a good idea to use vmpressure with no-root memcg so
> it could raise the question again "why vmpressure is part of memcg".

Maybe I do not see the problem correctly, but making vmpressure memcg
aware was a good idea. It is something like userspace pre-oom handling.

> I really didn't want it. :(
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3] vmpressure: consider "scanned < reclaimed" case when calculating  a pressure level.
  2013-06-28  7:43                                             ` [PATCH v3] " Hyunhee Kim
@ 2013-06-28 12:26                                               ` Michal Hocko
  0 siblings, 0 replies; 50+ messages in thread
From: Michal Hocko @ 2013-06-28 12:26 UTC (permalink / raw)
  To: Hyunhee Kim
  Cc: 'Minchan Kim', 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park'

On Fri 28-06-13 16:43:09, Hyunhee Kim wrote:
> When this happens, we cannot tell anything about the current pressure
> level.

So we tell a random one? No, this doesn't make any sense to me. THP
should be fixed (use nr_taken) and signal_pending should be treated at
the vmpressure level and I think no signal should be sent in that case
as "we cannot say what is the current level"
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2] vmpressure: consider "scanned < reclaimed" case when calculating  a pressure level.
  2013-06-28 12:24                                             ` [PATCH v2] " Michal Hocko
@ 2013-06-28 13:55                                               ` Minchan Kim
  2013-06-28 15:17                                                 ` Michal Hocko
  0 siblings, 1 reply; 50+ messages in thread
From: Minchan Kim @ 2013-06-28 13:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hyunhee Kim, 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park'

Hi Michal,

On Fri, Jun 28, 2013 at 02:24:12PM +0200, Michal Hocko wrote:
> On Fri 28-06-13 08:54:35, Minchan Kim wrote:
> > Hello Michal,
> > 
> > On Thu, Jun 27, 2013 at 06:11:03PM +0200, Michal Hocko wrote:
> > > On Fri 28-06-13 00:35:28, Minchan Kim wrote:
> > > > Hi Michal,
> > > > 
> > > > On Thu, Jun 27, 2013 at 11:37:21AM +0200, Michal Hocko wrote:
> > > > > On Thu 27-06-13 15:12:10, Hyunhee Kim wrote:
> > > > > > In vmpressure, the pressure level is calculated based on the ratio
> > > > > > of how many pages were scanned vs. reclaimed in a given time window.
> > > > > > However, there is a possibility that "scanned < reclaimed" in such a
> > > > > > case, when reclaiming ends by fatal signal in shrink_inactive_list.
> > > > > > So, with this patch, we just return "low" level when "scanned < reclaimed"
> > > > > > happens not to have userland miss reclaim activity.
> > > > > 
> > > > > Hmm, fatal signal pending on kswapd doesn't make sense to me so it has
> > > > > to be a direct reclaim path. Does it really make sense to signal LOW
> > > > > when there is probably a big memory pressure and somebody is killing the
> > > > > current allocator?
> > > > 
> > > > So, do you want to trigger critical instead of low?
> > > > 
> > > > Now, current is going to die so we can expect shortly we can get a amount
> > > > of memory, normally. 
> > > 
> > > And also consider that this is per-memcg interface. And so it is even
> > > more complicated. If a task dies then there is _no_ guarantee that there
> > > will be an uncharge in that group (task could have been migrated to that
> > > group so the memory belongs to somebody else).
> > 
> > Good point and that's one of the reason I hate memcg for just using
> > vmpressure. 
> 
> Well, the very same problem is present in the memcg OOM as well. oom
> score calculation is not memcg aware wrt charges.
> 
> > Let's think over it. One of the very avaialbe scenario
> > which userland could do when notified from vmpressure is that manager
> > process sends signal for others to release own cached memory.
> 
> Assuming those processes are in the same memcg, right?
> 
> > If we use vmpressure without move_charge_at_immigrate in multiple memcg
> > group, it would be a disaster. But if we use move_charge_at_immigrate,
> > we will see long stall easily so it's not an option, either.
> 
> I am not sure I am following you here. Could you be more specific what
> is the actual problem?
> From my POV, a manager can see a memory pressure, it notifies others in
> the same memcg and they will release their caches. With
> move_charge_at_immigrate == 0 some of those might release a memory in
> other group but somebody must be using memory from the currently
> signaled group, right?

My concern is that manager process can send a signal to a process A
in same group but unfortunately, process A would release a memory
in other group so manager process can send a signal to a process B
in same group but unfortunately, process B would release a memory
in other group so manger process can ...
...
...
...
in same group and at last, process Z would release a memory in same
group but we release all of cached from A-Y process. :(

> 
> > So, IMO, it's not a good idea to use vmpressure with no-root memcg so
> > it could raise the question again "why vmpressure is part of memcg".
> 
> Maybe I do not see the problem correctly, but making vmpressure memcg
> aware was a good idea. It is something like userspace pre-oom handling.

I don't say that memcg-aware is bad. Surely it's good thing but
it's not good that we must enable memcg for just using memory notifier
globally. Even above problem would make memcg-vmpressure complicated and
memory reclaim behavior change compared to long history well-made global
page reclaim.

I claim we should be able to use vmpressure without memcg as well as
memcg.

> 
> > I really didn't want it. :(
> [...]
> -- 
> Michal Hocko
> SUSE Labs

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2] vmpressure: consider "scanned < reclaimed" case when calculating  a pressure level.
  2013-06-28 13:55                                               ` Minchan Kim
@ 2013-06-28 15:17                                                 ` Michal Hocko
  0 siblings, 0 replies; 50+ messages in thread
From: Michal Hocko @ 2013-06-28 15:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Hyunhee Kim, 'Anton Vorontsov',
	linux-mm, akpm, rob, kamezawa.hiroyu, hannes, rientjes, kirill,
	'Kyungmin Park'

On Fri 28-06-13 22:55:17, Minchan Kim wrote:
> Hi Michal,
> 
> On Fri, Jun 28, 2013 at 02:24:12PM +0200, Michal Hocko wrote:
> > On Fri 28-06-13 08:54:35, Minchan Kim wrote:
> > > Hello Michal,
> > > 
> > > On Thu, Jun 27, 2013 at 06:11:03PM +0200, Michal Hocko wrote:
> > > > On Fri 28-06-13 00:35:28, Minchan Kim wrote:
> > > > > Hi Michal,
> > > > > 
> > > > > On Thu, Jun 27, 2013 at 11:37:21AM +0200, Michal Hocko wrote:
> > > > > > On Thu 27-06-13 15:12:10, Hyunhee Kim wrote:
> > > > > > > In vmpressure, the pressure level is calculated based on the ratio
> > > > > > > of how many pages were scanned vs. reclaimed in a given time window.
> > > > > > > However, there is a possibility that "scanned < reclaimed" in such a
> > > > > > > case, when reclaiming ends by fatal signal in shrink_inactive_list.
> > > > > > > So, with this patch, we just return "low" level when "scanned < reclaimed"
> > > > > > > happens not to have userland miss reclaim activity.
> > > > > > 
> > > > > > Hmm, fatal signal pending on kswapd doesn't make sense to me so it has
> > > > > > to be a direct reclaim path. Does it really make sense to signal LOW
> > > > > > when there is probably a big memory pressure and somebody is killing the
> > > > > > current allocator?
> > > > > 
> > > > > So, do you want to trigger critical instead of low?
> > > > > 
> > > > > Now, current is going to die so we can expect shortly we can get a amount
> > > > > of memory, normally. 
> > > > 
> > > > And also consider that this is per-memcg interface. And so it is even
> > > > more complicated. If a task dies then there is _no_ guarantee that there
> > > > will be an uncharge in that group (task could have been migrated to that
> > > > group so the memory belongs to somebody else).
> > > 
> > > Good point and that's one of the reason I hate memcg for just using
> > > vmpressure. 
> > 
> > Well, the very same problem is present in the memcg OOM as well. oom
> > score calculation is not memcg aware wrt charges.
> > 
> > > Let's think over it. One of the very avaialbe scenario
> > > which userland could do when notified from vmpressure is that manager
> > > process sends signal for others to release own cached memory.
> > 
> > Assuming those processes are in the same memcg, right?
> > 
> > > If we use vmpressure without move_charge_at_immigrate in multiple memcg
> > > group, it would be a disaster. But if we use move_charge_at_immigrate,
> > > we will see long stall easily so it's not an option, either.
> > 
> > I am not sure I am following you here. Could you be more specific what
> > is the actual problem?
> > From my POV, a manager can see a memory pressure, it notifies others in
> > the same memcg and they will release their caches. With
> > move_charge_at_immigrate == 0 some of those might release a memory in
> > other group but somebody must be using memory from the currently
> > signaled group, right?
> 
> My concern is that manager process can send a signal to a process A
> in same group but unfortunately, process A would release a memory
> in other group so manager process can send a signal to a process B
> in same group but unfortunately, process B would release a memory
> in other group so manger process can ...
> ...
> ...
> ...
> in same group and at last, process Z would release a memory in same
> group but we release all of cached from A-Y process. :(

I would say, do not move tasks without move_at_immigrate if you are
doing something like that.

> > > So, IMO, it's not a good idea to use vmpressure with no-root memcg so
> > > it could raise the question again "why vmpressure is part of memcg".
> > 
> > Maybe I do not see the problem correctly, but making vmpressure memcg
> > aware was a good idea. It is something like userspace pre-oom handling.
> 
> I don't say that memcg-aware is bad. Surely it's good thing but
> it's not good that we must enable memcg for just using memory notifier
> globally. Even above problem would make memcg-vmpressure complicated and
> memory reclaim behavior change compared to long history well-made global
> page reclaim.
> 
> I claim we should be able to use vmpressure without memcg as well as
> memcg.

This is not hard to do. Look at how we are handling lruvec for !memcg
configurations. The similar thing could be done for vmpressure as well.
You just need to find a way how to export eventfd which is not private
to memcg AFAIK.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2013-06-28 15:17 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-17 11:30 [PATCH v3] memcg: event control at vmpressure Hyunhee Kim
2013-06-17 13:15 ` Michal Hocko
2013-06-18  6:10   ` Hyunhee Kim
2013-06-18  8:00     ` Hyunhee Kim
2013-06-18 11:01       ` Michal Hocko
2013-06-19 11:25         ` Hyunhee Kim
2013-06-19 11:59           ` Michal Hocko
2013-06-19 11:31         ` [PATCH v4] " Hyunhee Kim
2013-06-19 12:53           ` Michal Hocko
2013-06-20  2:13             ` Hyunhee Kim
2013-06-20  2:17             ` [PATCH v5] " Hyunhee Kim
2013-06-20 12:16               ` Michal Hocko
2013-06-21  0:21                 ` [PATCH v6] " Hyunhee Kim
2013-06-21  0:24                   ` Hyunhee Kim
2013-06-21  1:22                     ` Minchan Kim
2013-06-21  9:19                       ` Michal Hocko
2013-06-21 11:02                         ` Hyunhee Kim
2013-06-21 11:54                           ` Hyunhee Kim
2013-06-21 12:40                             ` [PATCH v7] " Hyunhee Kim
2013-06-21 16:27                         ` [PATCH v6] " Minchan Kim
2013-06-21 16:44                           ` Minchan Kim
2013-06-22  0:27                             ` Anton Vorontsov
2013-06-22  1:28                               ` Hyunhee Kim
2013-06-26  7:47                               ` Minchan Kim
2013-06-21 22:35                           ` Anton Vorontsov
2013-06-22  4:36                           ` Hyunhee Kim
2013-06-22  4:51                             ` Hyunhee Kim
2013-06-22  5:50                               ` [PATCH] memcg: consider "scanned < reclaimed" case when calculating Hyunhee Kim
2013-06-22  7:34                                 ` [PATCH] memcg: add interface to specify thresholds of vmpressure Hyunhee Kim
2013-06-25 20:46                                   ` Michal Hocko
2013-06-26  7:39                                   ` Minchan Kim
2013-06-26  7:50                                     ` Kyungmin Park
2013-06-26  8:03                                       ` Minchan Kim
2013-06-26  7:35                                 ` [PATCH] memcg: consider "scanned < reclaimed" case when calculating Minchan Kim
2013-06-27  6:12                                   ` [PATCH v2] vmpressure: consider "scanned < reclaimed" case when calculating a pressure level Hyunhee Kim
2013-06-27  9:37                                     ` Michal Hocko
2013-06-27 15:35                                       ` Minchan Kim
2013-06-27 16:11                                         ` Michal Hocko
2013-06-27 18:05                                           ` Anton Vorontsov
2013-06-28 12:17                                             ` Michal Hocko
2013-06-27 23:54                                           ` Minchan Kim
2013-06-28  7:43                                             ` [PATCH v3] " Hyunhee Kim
2013-06-28 12:26                                               ` Michal Hocko
2013-06-28 12:24                                             ` [PATCH v2] " Michal Hocko
2013-06-28 13:55                                               ` Minchan Kim
2013-06-28 15:17                                                 ` Michal Hocko
2013-06-27 18:33                                     ` Anton Vorontsov
2013-06-26  7:34                               ` [PATCH v6] memcg: event control at vmpressure Minchan Kim
2013-06-26  7:31                             ` Minchan Kim
2013-06-25 16:07                           ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.