linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] memcg: use do_div to divide s64 in 32 bit machine.
@ 2010-11-05 16:08 Minchan Kim
  2010-11-05 16:34 ` Greg Thelen
  2010-11-06  1:03 ` hannes
  0 siblings, 2 replies; 31+ messages in thread
From: Minchan Kim @ 2010-11-05 16:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Young, Greg Thelen, Andrea Righi, KAMEZAWA Hiroyuki,
	Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	Linux Kernel Mailing List, linux-mm, Minchan Kim

Use do_div to divide s64 value. Otherwise, build would be failed
like Dave Young reported.

mm/built-in.o: In function `mem_cgroup_dirty_info':
/home/dave/vdb/build/mm/linux-2.6.36/mm/memcontrol.c:1251: undefined
reference to `__divdi3'
/home/dave/vdb/build/mm/linux-2.6.36/mm/memcontrol.c:1259: undefined
reference to `__divdi3'
make: *** [.tmp_vmlinux1] Error 1

Tested-by: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
---
 mm/memcontrol.c |   16 +++++++++-------
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 76386f4..a15d95e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1247,18 +1247,20 @@ bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
 	if (dirty_param.dirty_bytes)
 		info->dirty_thresh =
 			DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
-	else
-		info->dirty_thresh =
-			(dirty_param.dirty_ratio * available_mem) / 100;
+	else {
+		info->dirty_thresh = dirty_param.dirty_ratio * available_mem;
+		do_div(info->dirty_thresh, 100);
+	}
 
 	if (dirty_param.dirty_background_bytes)
 		info->background_thresh =
 			DIV_ROUND_UP(dirty_param.dirty_background_bytes,
 				     PAGE_SIZE);
-	else
-		info->background_thresh =
-			(dirty_param.dirty_background_ratio *
-			       available_mem) / 100;
+	else {
+		info->background_thresh = dirty_param.dirty_background_ratio *
+			available_mem;
+		do_div(info->background_thresh, 100);
+	}
 
 	info->nr_reclaimable =
 		mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
-- 
1.7.0.5


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH] memcg: use do_div to divide s64 in 32 bit machine.
  2010-11-05 16:08 [PATCH] memcg: use do_div to divide s64 in 32 bit machine Minchan Kim
@ 2010-11-05 16:34 ` Greg Thelen
  2010-11-06  1:03 ` hannes
  1 sibling, 0 replies; 31+ messages in thread
From: Greg Thelen @ 2010-11-05 16:34 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Dave Young, Andrea Righi, KAMEZAWA Hiroyuki,
	Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	Linux Kernel Mailing List, linux-mm

Minchan Kim <minchan.kim@gmail.com> writes:

> Use do_div to divide s64 value. Otherwise, build would be failed
> like Dave Young reported.
>
> mm/built-in.o: In function `mem_cgroup_dirty_info':
> /home/dave/vdb/build/mm/linux-2.6.36/mm/memcontrol.c:1251: undefined
> reference to `__divdi3'
> /home/dave/vdb/build/mm/linux-2.6.36/mm/memcontrol.c:1259: undefined
> reference to `__divdi3'
> make: *** [.tmp_vmlinux1] Error 1
>
> Tested-by: Dave Young <hidave.darkstar@gmail.com>
> Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Tested-by: Greg Thelen <gthelen@google.com>

Thanks for report and the patch.

> ---
>  mm/memcontrol.c |   16 +++++++++-------
>  1 files changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 76386f4..a15d95e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1247,18 +1247,20 @@ bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
>  	if (dirty_param.dirty_bytes)
>  		info->dirty_thresh =
>  			DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
> -	else
> -		info->dirty_thresh =
> -			(dirty_param.dirty_ratio * available_mem) / 100;
> +	else {
> +		info->dirty_thresh = dirty_param.dirty_ratio * available_mem;
> +		do_div(info->dirty_thresh, 100);
> +	}
>  
>  	if (dirty_param.dirty_background_bytes)
>  		info->background_thresh =
>  			DIV_ROUND_UP(dirty_param.dirty_background_bytes,
>  				     PAGE_SIZE);
> -	else
> -		info->background_thresh =
> -			(dirty_param.dirty_background_ratio *
> -			       available_mem) / 100;
> +	else {
> +		info->background_thresh = dirty_param.dirty_background_ratio *
> +			available_mem;
> +		do_div(info->background_thresh, 100);
> +	}
>  
>  	info->nr_reclaimable =
>  		mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] memcg: use do_div to divide s64 in 32 bit machine.
  2010-11-05 16:08 [PATCH] memcg: use do_div to divide s64 in 32 bit machine Minchan Kim
  2010-11-05 16:34 ` Greg Thelen
@ 2010-11-06  1:03 ` hannes
  2010-11-06 17:19   ` Greg Thelen
  1 sibling, 1 reply; 31+ messages in thread
From: hannes @ 2010-11-06  1:03 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Dave Young, Greg Thelen, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	Linux Kernel Mailing List, linux-mm

On Sat, Nov 06, 2010 at 01:08:53AM +0900, Minchan Kim wrote:
> Use do_div to divide s64 value. Otherwise, build would be failed
> like Dave Young reported.

I thought about that too, but then I asked myself why you would want
to represent a number of pages as signed 64bit type, even on 32 bit?

Isn't the much better fix to get the types right instead?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] memcg: use do_div to divide s64 in 32 bit machine.
  2010-11-06  1:03 ` hannes
@ 2010-11-06 17:19   ` Greg Thelen
  2010-11-06 17:31     ` Minchan Kim
                       ` (5 more replies)
  0 siblings, 6 replies; 31+ messages in thread
From: Greg Thelen @ 2010-11-06 17:19 UTC (permalink / raw)
  To: hannes
  Cc: Minchan Kim, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	Linux Kernel Mailing List, linux-mm

On Fri, Nov 5, 2010 at 6:03 PM,  <hannes@cmpxchg.org> wrote:
> On Sat, Nov 06, 2010 at 01:08:53AM +0900, Minchan Kim wrote:
>> Use do_div to divide s64 value. Otherwise, build would be failed
>> like Dave Young reported.
>
> I thought about that too, but then I asked myself why you would want
> to represent a number of pages as signed 64bit type, even on 32 bit?

I think the reason that 64 byte type is used for page count in
memcontrol.c is because the low level res_counter primitives operate
on 64 bit counters, even on 32 bit machines.

> Isn't the much better fix to get the types right instead?
>

I agree that consistent types between mem_cgroup_dirty_info() and
global_dirty_info() is important.  There seems to be a lot of usage of
s64 for page counts in memcontrol.c, which I think is due to the
res_counter types.  I think these s64 be switched to unsigned long
rather to be consistent with the rest of mm code.  It looks like this
will be a clean patch, except for the lowest level where
res_counter_read_u64() is used, where some casting may be needed.

I'll post a patch for that change.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] memcg: use do_div to divide s64 in 32 bit machine.
  2010-11-06 17:19   ` Greg Thelen
@ 2010-11-06 17:31     ` Minchan Kim
  2010-11-07 22:14     ` [patch 0/4] memcg: variable type fixes Johannes Weiner
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 31+ messages in thread
From: Minchan Kim @ 2010-11-06 17:31 UTC (permalink / raw)
  To: Greg Thelen
  Cc: hannes, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	Linux Kernel Mailing List, linux-mm

On Sun, Nov 7, 2010 at 2:19 AM, Greg Thelen <gthelen@google.com> wrote:
> On Fri, Nov 5, 2010 at 6:03 PM,  <hannes@cmpxchg.org> wrote:
>> On Sat, Nov 06, 2010 at 01:08:53AM +0900, Minchan Kim wrote:
>>> Use do_div to divide s64 value. Otherwise, build would be failed
>>> like Dave Young reported.
>>
>> I thought about that too, but then I asked myself why you would want
>> to represent a number of pages as signed 64bit type, even on 32 bit?
>
> I think the reason that 64 byte type is used for page count in
> memcontrol.c is because the low level res_counter primitives operate
> on 64 bit counters, even on 32 bit machines.
>
>> Isn't the much better fix to get the types right instead?
>>
>
> I agree that consistent types between mem_cgroup_dirty_info() and
> global_dirty_info() is important.  There seems to be a lot of usage of
> s64 for page counts in memcontrol.c, which I think is due to the
> res_counter types.  I think these s64 be switched to unsigned long
> rather to be consistent with the rest of mm code.  It looks like this
> will be a clean patch, except for the lowest level where
> res_counter_read_u64() is used, where some casting may be needed.
>
> I'll post a patch for that change.
>

Agree. I don't mind it.
Thanks, Hannes and Greg.



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 0/4] memcg: variable type fixes
  2010-11-06 17:19   ` Greg Thelen
  2010-11-06 17:31     ` Minchan Kim
@ 2010-11-07 22:14     ` Johannes Weiner
  2010-11-07 22:14     ` [patch 1/4] memcg: use native word to represent dirtyable pages Johannes Weiner
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 31+ messages in thread
From: Johannes Weiner @ 2010-11-07 22:14 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Minchan Kim, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

Hi Greg,

it is not the res counter primitives, these are our own counters.  We
have to keep signed types for most counters, as the per-cpu counter
folding can race and we end up with negative values.

The fix for the original issue is in patch 1.  There are no casts
needed, the range is checked to be sane and then converted to the
unsigned type through assignment.

Patch 2, also a type fix, ensures we catch accounting races properly.
It is unrelated, but also important.

Patch 3 implements the idea that we only have to used signed types for
_some_ of the counters, but not for constant event counters where the
sign-bit would be a waste.

Patch 4 converts our fundamental page statistics counters to native
words as these should be wide enough for the expected values.

	Hannes


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 1/4] memcg: use native word to represent dirtyable pages
  2010-11-06 17:19   ` Greg Thelen
  2010-11-06 17:31     ` Minchan Kim
  2010-11-07 22:14     ` [patch 0/4] memcg: variable type fixes Johannes Weiner
@ 2010-11-07 22:14     ` Johannes Weiner
  2010-11-07 22:56       ` Minchan Kim
  2010-11-16  3:37       ` KAMEZAWA Hiroyuki
  2010-11-07 22:14     ` [patch 2/4] memcg: catch negative per-cpu sums in dirty info Johannes Weiner
                       ` (2 subsequent siblings)
  5 siblings, 2 replies; 31+ messages in thread
From: Johannes Weiner @ 2010-11-07 22:14 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Minchan Kim, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

[-- Attachment #1: memcg-use-native-word-to-represent-dirtyable-pages.patch --]
[-- Type: text/plain, Size: 1575 bytes --]

The memory cgroup dirty info calculation currently uses a signed
64-bit type to represent the amount of dirtyable memory in pages.

This can instead be changed to an unsigned word, which will allow the
formula to function correctly with up to 160G of LRU pages on a 32-bit
system, assuming 4k pages.  That should be plenty even when taking
racy folding of the per-cpu counters into account.

This fixes a compilation error on 32-bit systems as this code tries to
do 64-bit division.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Dave Young <hidave.darkstar@gmail.com>
---
 mm/memcontrol.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1222,9 +1222,10 @@ static void __mem_cgroup_dirty_param(str
 bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
 			   struct dirty_info *info)
 {
-	s64 available_mem;
 	struct vm_dirty_param dirty_param;
+	unsigned long available_mem;
 	struct mem_cgroup *memcg;
+	s64 value;
 
 	if (mem_cgroup_disabled())
 		return false;
@@ -1238,11 +1239,11 @@ bool mem_cgroup_dirty_info(unsigned long
 	__mem_cgroup_dirty_param(&dirty_param, memcg);
 	rcu_read_unlock();
 
-	available_mem = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
-	if (available_mem < 0)
+	value = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
+	if (value < 0)
 		return false;
 
-	available_mem = min((unsigned long)available_mem, sys_available_mem);
+	available_mem = min((unsigned long)value, sys_available_mem);
 
 	if (dirty_param.dirty_bytes)
 		info->dirty_thresh =



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 2/4] memcg: catch negative per-cpu sums in dirty info
  2010-11-06 17:19   ` Greg Thelen
                       ` (2 preceding siblings ...)
  2010-11-07 22:14     ` [patch 1/4] memcg: use native word to represent dirtyable pages Johannes Weiner
@ 2010-11-07 22:14     ` Johannes Weiner
  2010-11-07 23:26       ` Minchan Kim
  2010-11-16  3:39       ` KAMEZAWA Hiroyuki
  2010-11-07 22:14     ` [patch 3/4] memcg: break out event counters from other stats Johannes Weiner
  2010-11-07 22:14     ` [patch 4/4] memcg: use native word page statistics counters Johannes Weiner
  5 siblings, 2 replies; 31+ messages in thread
From: Johannes Weiner @ 2010-11-07 22:14 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Minchan Kim, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

[-- Attachment #1: memcg-catch-negative-per-cpu-sums-in-dirty-info.patch --]
[-- Type: text/plain, Size: 1251 bytes --]

Folding the per-cpu counters can yield a negative value in case of
accounting races between CPUs.

When collecting the dirty info, the code would read those sums into an
unsigned variable and then check for it being negative, which can not
work.

Instead, fold the counters into a signed local variable, make the
check, and only then assign it.

This way, the function signals correctly when there are insane values
instead of leaking them out to the caller.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1261,14 +1261,15 @@ bool mem_cgroup_dirty_info(unsigned long
 			(dirty_param.dirty_background_ratio *
 			       available_mem) / 100;
 
-	info->nr_reclaimable =
-		mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
-	if (info->nr_reclaimable < 0)
+	value = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+	if (value < 0)
 		return false;
+	info->nr_reclaimable = value;
 
-	info->nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
-	if (info->nr_writeback < 0)
+	value = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
+	if (value < 0)
 		return false;
+	info->nr_writeback = value;
 
 	return true;
 }



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 3/4] memcg: break out event counters from other stats
  2010-11-06 17:19   ` Greg Thelen
                       ` (3 preceding siblings ...)
  2010-11-07 22:14     ` [patch 2/4] memcg: catch negative per-cpu sums in dirty info Johannes Weiner
@ 2010-11-07 22:14     ` Johannes Weiner
  2010-11-07 23:52       ` Minchan Kim
  2010-11-16  3:41       ` KAMEZAWA Hiroyuki
  2010-11-07 22:14     ` [patch 4/4] memcg: use native word page statistics counters Johannes Weiner
  5 siblings, 2 replies; 31+ messages in thread
From: Johannes Weiner @ 2010-11-07 22:14 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Minchan Kim, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

[-- Attachment #1: memcg-break-out-event-counters-from-other-stats.patch --]
[-- Type: text/plain, Size: 4767 bytes --]

For increasing and decreasing per-cpu cgroup usage counters it makes
sense to use signed types, as single per-cpu values might go negative
during updates.  But this is not the case for only-ever-increasing
event counters.

All the counters have been signed 64-bit so far, which was enough to
count events even with the sign bit wasted.

The next patch narrows the usage counters type (on 32-bit CPUs, that
is), though, so break out the event counters and make them unsigned
words as they should have been from the start.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c |   49 +++++++++++++++++++++++++++++++++++++------------
 1 file changed, 37 insertions(+), 12 deletions(-)

--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -85,21 +85,23 @@ enum mem_cgroup_stat_index {
 	 */
 	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
 	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
-	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
 	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
 	MEM_CGROUP_STAT_FILE_WRITEBACK,		/* # of pages under writeback */
 	MEM_CGROUP_STAT_FILE_UNSTABLE_NFS,	/* # of NFS unstable pages */
 	MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
-	/* incremented at every  pagein/pageout */
-	MEM_CGROUP_EVENTS = MEM_CGROUP_STAT_DATA,
 	MEM_CGROUP_ON_MOVE,	/* someone is moving account between groups */
-
 	MEM_CGROUP_STAT_NSTATS,
 };
 
+enum mem_cgroup_events_index {
+	MEM_CGROUP_EVENTS_PGPGIN,	/* # of pages paged in */
+	MEM_CGROUP_EVENTS_PGPGOUT,	/* # of pages paged out */
+	MEM_CGROUP_EVENTS_COUNT,	/* # of pages paged in/out */
+	MEM_CGROUP_EVENTS_NSTATS,
+};
+
 enum {
 	MEM_CGROUP_DIRTY_RATIO,
 	MEM_CGROUP_DIRTY_LIMIT_IN_BYTES,
@@ -109,6 +111,7 @@ enum {
 
 struct mem_cgroup_stat_cpu {
 	s64 count[MEM_CGROUP_STAT_NSTATS];
+	unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
 };
 
 /*
@@ -612,6 +615,22 @@ static void mem_cgroup_swap_statistics(s
 	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_SWAPOUT], val);
 }
 
+static unsigned long mem_cgroup_read_events(struct mem_cgroup *mem,
+					    enum mem_cgroup_events_index idx)
+{
+	unsigned long val = 0;
+	int cpu;
+
+	for_each_online_cpu(cpu)
+		val += per_cpu(mem->stat->events[idx], cpu);
+#ifdef CONFIG_HOTPLUG_CPU
+	spin_lock(&mem->pcp_counter_lock);
+	val += mem->nocpu_base.events[idx];
+	spin_unlock(&mem->pcp_counter_lock);
+#endif
+	return val;
+}
+
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
 					 struct page_cgroup *pc,
 					 bool charge)
@@ -626,10 +645,10 @@ static void mem_cgroup_charge_statistics
 		__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_RSS], val);
 
 	if (charge)
-		__this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_PGPGIN_COUNT]);
+		__this_cpu_inc(mem->stat->events[MEM_CGROUP_EVENTS_PGPGIN]);
 	else
-		__this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_PGPGOUT_COUNT]);
-	__this_cpu_inc(mem->stat->count[MEM_CGROUP_EVENTS]);
+		__this_cpu_inc(mem->stat->events[MEM_CGROUP_EVENTS_PGPGOUT]);
+	__this_cpu_inc(mem->stat->events[MEM_CGROUP_EVENTS_COUNT]);
 
 	preempt_enable();
 }
@@ -651,9 +670,9 @@ static unsigned long mem_cgroup_get_loca
 
 static bool __memcg_event_check(struct mem_cgroup *mem, int event_mask_shift)
 {
-	s64 val;
+	unsigned long val;
 
-	val = this_cpu_read(mem->stat->count[MEM_CGROUP_EVENTS]);
+	val = this_cpu_read(mem->stat->events[MEM_CGROUP_EVENTS_COUNT]);
 
 	return !(val & ((1 << event_mask_shift) - 1));
 }
@@ -2055,6 +2074,12 @@ static void mem_cgroup_drain_pcp_counter
 		per_cpu(mem->stat->count[i], cpu) = 0;
 		mem->nocpu_base.count[i] += x;
 	}
+	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) {
+		unsigned long x = per_cpu(mem->stat->events[i], cpu);
+
+		per_cpu(mem->stat->events[i], cpu) = 0;
+		mem->nocpu_base.events[i] += x;
+	}
 	/* need to clear ON_MOVE value, works as a kind of lock. */
 	per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) = 0;
 	spin_unlock(&mem->pcp_counter_lock);
@@ -3892,9 +3917,9 @@ mem_cgroup_get_local_stat(struct mem_cgr
 	s->stat[MCS_RSS] += val * PAGE_SIZE;
 	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_MAPPED);
 	s->stat[MCS_FILE_MAPPED] += val * PAGE_SIZE;
-	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGPGIN_COUNT);
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGPGIN);
 	s->stat[MCS_PGPGIN] += val;
-	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGPGOUT_COUNT);
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGPGOUT);
 	s->stat[MCS_PGPGOUT] += val;
 	if (do_swap_account) {
 		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [patch 4/4] memcg: use native word page statistics counters
  2010-11-06 17:19   ` Greg Thelen
                       ` (4 preceding siblings ...)
  2010-11-07 22:14     ` [patch 3/4] memcg: break out event counters from other stats Johannes Weiner
@ 2010-11-07 22:14     ` Johannes Weiner
  2010-11-08  0:01       ` Minchan Kim
                         ` (3 more replies)
  5 siblings, 4 replies; 31+ messages in thread
From: Johannes Weiner @ 2010-11-07 22:14 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Minchan Kim, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

[-- Attachment #1: memcg-use-native-word-page-statistics-counters.patch --]
[-- Type: text/plain, Size: 5024 bytes --]

The statistic counters are in units of pages, there is no reason to
make them 64-bit wide on 32-bit machines.

Make them native words.  Since they are signed, this leaves 31 bit on
32-bit machines, which can represent roughly 8TB assuming a page size
of 4k.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |    2 +-
 mm/memcontrol.c            |   43 +++++++++++++++++++++----------------------
 mm/page-writeback.c        |    4 ++--
 3 files changed, 24 insertions(+), 25 deletions(-)

--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -110,7 +110,7 @@ enum {
 };
 
 struct mem_cgroup_stat_cpu {
-	s64 count[MEM_CGROUP_STAT_NSTATS];
+	long count[MEM_CGROUP_STAT_NSTATS];
 	unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
 };
 
@@ -583,11 +583,11 @@ mem_cgroup_largest_soft_limit_node(struc
  * common workload, threashold and synchonization as vmstat[] should be
  * implemented.
  */
-static s64 mem_cgroup_read_stat(struct mem_cgroup *mem,
-		enum mem_cgroup_stat_index idx)
+static long mem_cgroup_read_stat(struct mem_cgroup *mem,
+				 enum mem_cgroup_stat_index idx)
 {
+	long val = 0;
 	int cpu;
-	s64 val = 0;
 
 	for_each_online_cpu(cpu)
 		val += per_cpu(mem->stat->count[idx], cpu);
@@ -599,9 +599,9 @@ static s64 mem_cgroup_read_stat(struct m
 	return val;
 }
 
-static s64 mem_cgroup_local_usage(struct mem_cgroup *mem)
+static long mem_cgroup_local_usage(struct mem_cgroup *mem)
 {
-	s64 ret;
+	long ret;
 
 	ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
 	ret += mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
@@ -1244,7 +1244,7 @@ bool mem_cgroup_dirty_info(unsigned long
 	struct vm_dirty_param dirty_param;
 	unsigned long available_mem;
 	struct mem_cgroup *memcg;
-	s64 value;
+	long value;
 
 	if (mem_cgroup_disabled())
 		return false;
@@ -1301,10 +1301,10 @@ static inline bool mem_cgroup_can_swap(s
 		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
 }
 
-static s64 mem_cgroup_local_page_stat(struct mem_cgroup *mem,
-				      enum mem_cgroup_nr_pages_item item)
+static long mem_cgroup_local_page_stat(struct mem_cgroup *mem,
+				       enum mem_cgroup_nr_pages_item item)
 {
-	s64 ret;
+	long ret;
 
 	switch (item) {
 	case MEMCG_NR_DIRTYABLE_PAGES:
@@ -1365,11 +1365,11 @@ memcg_hierarchical_free_pages(struct mem
  * Return the accounted statistic value or negative value if current task is
  * root cgroup.
  */
-s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
+long mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
 {
-	struct mem_cgroup *mem;
 	struct mem_cgroup *iter;
-	s64 value;
+	struct mem_cgroup *mem;
+	long value;
 
 	get_online_cpus();
 	rcu_read_lock();
@@ -2069,7 +2069,7 @@ static void mem_cgroup_drain_pcp_counter
 
 	spin_lock(&mem->pcp_counter_lock);
 	for (i = 0; i < MEM_CGROUP_STAT_DATA; i++) {
-		s64 x = per_cpu(mem->stat->count[i], cpu);
+		long x = per_cpu(mem->stat->count[i], cpu);
 
 		per_cpu(mem->stat->count[i], cpu) = 0;
 		mem->nocpu_base.count[i] += x;
@@ -3660,13 +3660,13 @@ static int mem_cgroup_hierarchy_write(st
 }
 
 
-static u64 mem_cgroup_get_recursive_idx_stat(struct mem_cgroup *mem,
-				enum mem_cgroup_stat_index idx)
+static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *mem,
+					       enum mem_cgroup_stat_index idx)
 {
 	struct mem_cgroup *iter;
-	s64 val = 0;
+	long val = 0;
 
-	/* each per cpu's value can be minus.Then, use s64 */
+	/* Per-cpu values can be negative, use a signed accumulator */
 	for_each_mem_cgroup_tree(iter, mem)
 		val += mem_cgroup_read_stat(iter, idx);
 
@@ -3686,12 +3686,11 @@ static inline u64 mem_cgroup_usage(struc
 			return res_counter_read_u64(&mem->memsw, RES_USAGE);
 	}
 
-	val = mem_cgroup_get_recursive_idx_stat(mem, MEM_CGROUP_STAT_CACHE);
-	val += mem_cgroup_get_recursive_idx_stat(mem, MEM_CGROUP_STAT_RSS);
+	val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
+	val += mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_RSS);
 
 	if (swap)
-		val += mem_cgroup_get_recursive_idx_stat(mem,
-				MEM_CGROUP_STAT_SWAPOUT);
+		val += mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
 
 	return val << PAGE_SHIFT;
 }
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -157,7 +157,7 @@ static inline void mem_cgroup_dec_page_s
 bool mem_cgroup_has_dirty_limit(void);
 bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
 			   struct dirty_info *info);
-s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
+long mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
 
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -133,10 +133,10 @@ static struct prop_descriptor vm_dirties
 
 static unsigned long dirty_writeback_pages(void)
 {
-	s64 ret;
+	unsigned long ret;
 
 	ret = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
-	if (ret < 0)
+	if ((long)ret < 0)
 		ret = global_page_state(NR_UNSTABLE_NFS) +
 			global_page_state(NR_WRITEBACK);
 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 1/4] memcg: use native word to represent dirtyable pages
  2010-11-07 22:14     ` [patch 1/4] memcg: use native word to represent dirtyable pages Johannes Weiner
@ 2010-11-07 22:56       ` Minchan Kim
  2010-11-08 22:25         ` Greg Thelen
  2010-11-16  3:37       ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 31+ messages in thread
From: Minchan Kim @ 2010-11-07 22:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> The memory cgroup dirty info calculation currently uses a signed
> 64-bit type to represent the amount of dirtyable memory in pages.
>
> This can instead be changed to an unsigned word, which will allow the
> formula to function correctly with up to 160G of LRU pages on a 32-bit
> system, assuming 4k pages.  That should be plenty even when taking
> racy folding of the per-cpu counters into account.
>
> This fixes a compilation error on 32-bit systems as this code tries to
> do 64-bit division.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reported-by: Dave Young <hidave.darkstar@gmail.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/4] memcg: catch negative per-cpu sums in dirty info
  2010-11-07 22:14     ` [patch 2/4] memcg: catch negative per-cpu sums in dirty info Johannes Weiner
@ 2010-11-07 23:26       ` Minchan Kim
  2010-11-08 22:28         ` Greg Thelen
  2010-11-16  3:39       ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 31+ messages in thread
From: Minchan Kim @ 2010-11-07 23:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Folding the per-cpu counters can yield a negative value in case of
> accounting races between CPUs.
>
> When collecting the dirty info, the code would read those sums into an
> unsigned variable and then check for it being negative, which can not
> work.
>
> Instead, fold the counters into a signed local variable, make the
> check, and only then assign it.
>
> This way, the function signals correctly when there are insane values
> instead of leaking them out to the caller.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 3/4] memcg: break out event counters from other stats
  2010-11-07 22:14     ` [patch 3/4] memcg: break out event counters from other stats Johannes Weiner
@ 2010-11-07 23:52       ` Minchan Kim
  2010-11-08 23:20         ` Greg Thelen
  2010-11-16  3:41       ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 31+ messages in thread
From: Minchan Kim @ 2010-11-07 23:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> For increasing and decreasing per-cpu cgroup usage counters it makes
> sense to use signed types, as single per-cpu values might go negative
> during updates.  But this is not the case for only-ever-increasing
> event counters.
>
> All the counters have been signed 64-bit so far, which was enough to
> count events even with the sign bit wasted.
>
> The next patch narrows the usage counters type (on 32-bit CPUs, that
> is), though, so break out the event counters and make them unsigned
> words as they should have been from the start.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Fair enough.
We already have used unsigned long in vmstat.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 4/4] memcg: use native word page statistics counters
  2010-11-07 22:14     ` [patch 4/4] memcg: use native word page statistics counters Johannes Weiner
@ 2010-11-08  0:01       ` Minchan Kim
  2010-11-08  9:08         ` Johannes Weiner
  2010-11-08 22:51         ` Greg Thelen
  2010-11-08  0:07       ` Minchan Kim
                         ` (2 subsequent siblings)
  3 siblings, 2 replies; 31+ messages in thread
From: Minchan Kim @ 2010-11-08  0:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> The statistic counters are in units of pages, there is no reason to
> make them 64-bit wide on 32-bit machines.
>
> Make them native words.  Since they are signed, this leaves 31 bit on
> 32-bit machines, which can represent roughly 8TB assuming a page size
> of 4k.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

This patch changes mem_cgroup_recursive_idx_stat with
mem_cgroup_recursive_stat as well.
As you know, It would be better to be another patch although it's
trivial. But I don't mind it.
I like the name. :)

Thanks, Hannes.





-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 4/4] memcg: use native word page statistics counters
  2010-11-07 22:14     ` [patch 4/4] memcg: use native word page statistics counters Johannes Weiner
  2010-11-08  0:01       ` Minchan Kim
@ 2010-11-08  0:07       ` Minchan Kim
  2010-11-08  9:37         ` memcg writeout throttling, was: " Johannes Weiner
  2010-11-08 23:27       ` Greg Thelen
  2010-11-16  3:44       ` KAMEZAWA Hiroyuki
  3 siblings, 1 reply; 31+ messages in thread
From: Minchan Kim @ 2010-11-08  0:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> The statistic counters are in units of pages, there is no reason to
> make them 64-bit wide on 32-bit machines.
>
> Make them native words.  Since they are signed, this leaves 31 bit on
> 32-bit machines, which can represent roughly 8TB assuming a page size
> of 4k.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/memcontrol.h |    2 +-
>  mm/memcontrol.c            |   43 +++++++++++++++++++++----------------------
>  mm/page-writeback.c        |    4 ++--
>  3 files changed, 24 insertions(+), 25 deletions(-)
>

<snip>

>
>  static unsigned long dirty_writeback_pages(void)
>  {
> -       s64 ret;
> +       unsigned long ret;
>
>        ret = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> -       if (ret < 0)
> +       if ((long)ret < 0)
>                ret = global_page_state(NR_UNSTABLE_NFS) +
>                        global_page_state(NR_WRITEBACK);

BTW, let me ask a question.
dirty_writeback_pages seems to be depends on mem_cgroup_page_stat's
result(ie, negative) for separate global and memcg.
But mem_cgroup_page_stat could return negative value by per-cpu as
well as root cgroup.
If I understand right, Isn't it a problem?

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 4/4] memcg: use native word page statistics counters
  2010-11-08  0:01       ` Minchan Kim
@ 2010-11-08  9:08         ` Johannes Weiner
  2010-11-08 22:51         ` Greg Thelen
  1 sibling, 0 replies; 31+ messages in thread
From: Johannes Weiner @ 2010-11-08  9:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Greg Thelen, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

On Mon, Nov 08, 2010 at 09:01:54AM +0900, Minchan Kim wrote:
> On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > The statistic counters are in units of pages, there is no reason to
> > make them 64-bit wide on 32-bit machines.
> >
> > Make them native words.  Since they are signed, this leaves 31 bit on
> > 32-bit machines, which can represent roughly 8TB assuming a page size
> > of 4k.
> >
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> 
> This patch changes mem_cgroup_recursive_idx_stat with
> mem_cgroup_recursive_stat as well.
> As you know, It would be better to be another patch although it's
> trivial. But I don't mind it.
> I like the name. :)

I also feel that it's not too nice to mix such cleanups with
functionality changes.  But I found the name too appalling to leave it
alone, and a separate patch not worth the trouble.

If somebody has strong feelings, I will happily split it up.

Thanks for your review, Minchan.

	Hannes

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: memcg writeout throttling, was: [patch 4/4] memcg: use native word page statistics counters
  2010-11-08  0:07       ` Minchan Kim
@ 2010-11-08  9:37         ` Johannes Weiner
  2010-11-08 15:45           ` Wu Fengguang
  0 siblings, 1 reply; 31+ messages in thread
From: Johannes Weiner @ 2010-11-08  9:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Greg Thelen, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

On Mon, Nov 08, 2010 at 09:07:35AM +0900, Minchan Kim wrote:
> BTW, let me ask a question.
> dirty_writeback_pages seems to be depends on mem_cgroup_page_stat's
> result(ie, negative) for separate global and memcg.
> But mem_cgroup_page_stat could return negative value by per-cpu as
> well as root cgroup.
> If I understand right, Isn't it a problem?

Yes, the numbers are not reliable and may be off by some.  It appears
to me that the only sensible interpretation of a negative sum is to
assume zero, though.  So to be honest, I don't understand the fallback
to global state when the local state fluctuates around low values.

This function is also only used in throttle_vm_writeout(), where the
outcome is compared to the global dirty threshold.  So using the
number of writeback pages _from the current cgroup_ and falling back
to global writeback pages when this number is low makes no sense to me
at all.

I looks like it should rather compare the cgroup state with the cgroup
limit, and the global state with the global limit.

Can somebody explain the reasoning behind this?  And in case it makes
sense after all, put a comment into this function?

Thanks!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: memcg writeout throttling, was: [patch 4/4] memcg: use native word page statistics counters
  2010-11-08  9:37         ` memcg writeout throttling, was: " Johannes Weiner
@ 2010-11-08 15:45           ` Wu Fengguang
  2010-11-08 19:00             ` Greg Thelen
  0 siblings, 1 reply; 31+ messages in thread
From: Wu Fengguang @ 2010-11-08 15:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Minchan Kim, Greg Thelen, Andrew Morton, Dave Young,
	Andrea Righi, KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh,
	linux-mm, linux-kernel

On Mon, Nov 08, 2010 at 05:37:16PM +0800, Johannes Weiner wrote:
> On Mon, Nov 08, 2010 at 09:07:35AM +0900, Minchan Kim wrote:
> > BTW, let me ask a question.
> > dirty_writeback_pages seems to be depends on mem_cgroup_page_stat's
> > result(ie, negative) for separate global and memcg.
> > But mem_cgroup_page_stat could return negative value by per-cpu as
> > well as root cgroup.
> > If I understand right, Isn't it a problem?
> 
> Yes, the numbers are not reliable and may be off by some.  It appears
> to me that the only sensible interpretation of a negative sum is to
> assume zero, though.  So to be honest, I don't understand the fallback
> to global state when the local state fluctuates around low values.

Agreed. It does not make sense to compare values from different domains.

The bdi stats use percpu_counter_sum_positive() which never return
negative values. It may be suitable for memcg page counts, too.

> This function is also only used in throttle_vm_writeout(), where the
> outcome is compared to the global dirty threshold.  So using the
> number of writeback pages _from the current cgroup_ and falling back
> to global writeback pages when this number is low makes no sense to me
> at all.
> 
> I looks like it should rather compare the cgroup state with the cgroup
> limit, and the global state with the global limit.

Right.

> Can somebody explain the reasoning behind this?  And in case it makes
> sense after all, put a comment into this function?

It seems a better match to test sc->mem_cgroup rather than
mem_cgroup_from_task(current). The latter could make mismatches. When
someone is changing the memcg limits and hence triggers memcg
reclaims, the current task is actually the (unrelated) shell. It's
also possible for the memcg task to trigger _global_ direct reclaim.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: memcg writeout throttling, was: [patch 4/4] memcg: use native word page statistics counters
  2010-11-08 15:45           ` Wu Fengguang
@ 2010-11-08 19:00             ` Greg Thelen
  0 siblings, 0 replies; 31+ messages in thread
From: Greg Thelen @ 2010-11-08 19:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Johannes Weiner, Minchan Kim, Andrew Morton, Dave Young,
	Andrea Righi, KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh,
	linux-mm, linux-kernel

On Mon, Nov 8, 2010 at 7:45 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Mon, Nov 08, 2010 at 05:37:16PM +0800, Johannes Weiner wrote:
>> On Mon, Nov 08, 2010 at 09:07:35AM +0900, Minchan Kim wrote:
>> > BTW, let me ask a question.
>> > dirty_writeback_pages seems to be depends on mem_cgroup_page_stat's
>> > result(ie, negative) for separate global and memcg.
>> > But mem_cgroup_page_stat could return negative value by per-cpu as
>> > well as root cgroup.
>> > If I understand right, Isn't it a problem?
>>
>> Yes, the numbers are not reliable and may be off by some.  It appears
>> to me that the only sensible interpretation of a negative sum is to
>> assume zero, though.  So to be honest, I don't understand the fallback
>> to global state when the local state fluctuates around low values.
>
> Agreed. It does not make sense to compare values from different domains.
>
> The bdi stats use percpu_counter_sum_positive() which never return
> negative values. It may be suitable for memcg page counts, too.
>
>> This function is also only used in throttle_vm_writeout(), where the
>> outcome is compared to the global dirty threshold.  So using the
>> number of writeback pages _from the current cgroup_ and falling back
>> to global writeback pages when this number is low makes no sense to me
>> at all.
>>
>> I looks like it should rather compare the cgroup state with the cgroup
>> limit, and the global state with the global limit.
>
> Right.
>
>> Can somebody explain the reasoning behind this?  And in case it makes
>> sense after all, put a comment into this function?
>
> It seems a better match to test sc->mem_cgroup rather than
> mem_cgroup_from_task(current). The latter could make mismatches. When
> someone is changing the memcg limits and hence triggers memcg
> reclaims, the current task is actually the (unrelated) shell. It's
> also possible for the memcg task to trigger _global_ direct reclaim.

Good point.  I am writing a patch that will pass mem_cgroup from
sc->mem_cgroup into mem_cgroup_page_stat() rather than using
mem_cgroup_from_task(current).  I will post this patch in a few hours.

I will also fix the negative value issue in mem_cgroup_page_stat().

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 1/4] memcg: use native word to represent dirtyable pages
  2010-11-07 22:56       ` Minchan Kim
@ 2010-11-08 22:25         ` Greg Thelen
  2010-11-08 22:38           ` Johannes Weiner
  0 siblings, 1 reply; 31+ messages in thread
From: Greg Thelen @ 2010-11-08 22:25 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

Minchan Kim <minchan.kim@gmail.com> writes:

> On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> The memory cgroup dirty info calculation currently uses a signed
>> 64-bit type to represent the amount of dirtyable memory in pages.
>>
>> This can instead be changed to an unsigned word, which will allow the
>> formula to function correctly with up to 160G of LRU pages on a 32-bit
Is is really 160G of LRU pages?  On 32-bit machine we use a 32 bit
unsigned page number.  With a 4KiB page size, I think that maps 16TiB
(1<<(32+12)) bytes.  Or is there some other limit?
>> system, assuming 4k pages.  That should be plenty even when taking
>> racy folding of the per-cpu counters into account.
>>
>> This fixes a compilation error on 32-bit systems as this code tries to
>> do 64-bit division.
>>
>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>> Reported-by: Dave Young <hidave.darkstar@gmail.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Greg Thelen <gthelen@google.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/4] memcg: catch negative per-cpu sums in dirty info
  2010-11-07 23:26       ` Minchan Kim
@ 2010-11-08 22:28         ` Greg Thelen
  0 siblings, 0 replies; 31+ messages in thread
From: Greg Thelen @ 2010-11-08 22:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

Minchan Kim <minchan.kim@gmail.com> writes:

> On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> Folding the per-cpu counters can yield a negative value in case of
>> accounting races between CPUs.
>>
>> When collecting the dirty info, the code would read those sums into an
>> unsigned variable and then check for it being negative, which can not
>> work.
>>
>> Instead, fold the counters into a signed local variable, make the
>> check, and only then assign it.
>>
>> This way, the function signals correctly when there are insane values
>> instead of leaking them out to the caller.
>>
>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Greg Thelen <gthelen@google.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 1/4] memcg: use native word to represent dirtyable pages
  2010-11-08 22:25         ` Greg Thelen
@ 2010-11-08 22:38           ` Johannes Weiner
  2010-11-08 22:43             ` Greg Thelen
  0 siblings, 1 reply; 31+ messages in thread
From: Johannes Weiner @ 2010-11-08 22:38 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Minchan Kim, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

On Mon, Nov 08, 2010 at 02:25:15PM -0800, Greg Thelen wrote:
> Minchan Kim <minchan.kim@gmail.com> writes:
> 
> > On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >> The memory cgroup dirty info calculation currently uses a signed
> >> 64-bit type to represent the amount of dirtyable memory in pages.
> >>
> >> This can instead be changed to an unsigned word, which will allow the
> >> formula to function correctly with up to 160G of LRU pages on a 32-bit
> Is is really 160G of LRU pages?  On 32-bit machine we use a 32 bit
> unsigned page number.  With a 4KiB page size, I think that maps 16TiB
> (1<<(32+12)) bytes.  Or is there some other limit?

Yes, the dirty limit we calculate from it :)

We have to be able to multiply this number by up to 100 (maximum dirty
ratio value) without overflowing.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 1/4] memcg: use native word to represent dirtyable pages
  2010-11-08 22:38           ` Johannes Weiner
@ 2010-11-08 22:43             ` Greg Thelen
  0 siblings, 0 replies; 31+ messages in thread
From: Greg Thelen @ 2010-11-08 22:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Minchan Kim, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

On Mon, Nov 8, 2010 at 2:38 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Mon, Nov 08, 2010 at 02:25:15PM -0800, Greg Thelen wrote:
>> Minchan Kim <minchan.kim@gmail.com> writes:
>>
>> > On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> >> The memory cgroup dirty info calculation currently uses a signed
>> >> 64-bit type to represent the amount of dirtyable memory in pages.
>> >>
>> >> This can instead be changed to an unsigned word, which will allow the
>> >> formula to function correctly with up to 160G of LRU pages on a 32-bit
>> Is is really 160G of LRU pages?  On 32-bit machine we use a 32 bit
>> unsigned page number.  With a 4KiB page size, I think that maps 16TiB
>> (1<<(32+12)) bytes.  Or is there some other limit?
>
> Yes, the dirty limit we calculate from it :)
>
> We have to be able to multiply this number by up to 100 (maximum dirty
> ratio value) without overflowing.

Duh :)   thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 4/4] memcg: use native word page statistics counters
  2010-11-08  0:01       ` Minchan Kim
  2010-11-08  9:08         ` Johannes Weiner
@ 2010-11-08 22:51         ` Greg Thelen
  1 sibling, 0 replies; 31+ messages in thread
From: Greg Thelen @ 2010-11-08 22:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

Minchan Kim <minchan.kim@gmail.com> writes:

> On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> The statistic counters are in units of pages, there is no reason to
>> make them 64-bit wide on 32-bit machines.
>>
>> Make them native words.  Since they are signed, this leaves 31 bit on
>> 32-bit machines, which can represent roughly 8TB assuming a page size
>> of 4k.
>>
>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Greg Thelen <gthelen@google.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 3/4] memcg: break out event counters from other stats
  2010-11-07 23:52       ` Minchan Kim
@ 2010-11-08 23:20         ` Greg Thelen
  0 siblings, 0 replies; 31+ messages in thread
From: Greg Thelen @ 2010-11-08 23:20 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

Minchan Kim <minchan.kim@gmail.com> writes:

> On Mon, Nov 8, 2010 at 7:14 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> For increasing and decreasing per-cpu cgroup usage counters it makes
>> sense to use signed types, as single per-cpu values might go negative
>> during updates.  But this is not the case for only-ever-increasing
>> event counters.
>>
>> All the counters have been signed 64-bit so far, which was enough to
>> count events even with the sign bit wasted.
>>
>> The next patch narrows the usage counters type (on 32-bit CPUs, that
>> is), though, so break out the event counters and make them unsigned
>> words as they should have been from the start.
>>
>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Greg Thelen <gthelen@google.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 4/4] memcg: use native word page statistics counters
  2010-11-07 22:14     ` [patch 4/4] memcg: use native word page statistics counters Johannes Weiner
  2010-11-08  0:01       ` Minchan Kim
  2010-11-08  0:07       ` Minchan Kim
@ 2010-11-08 23:27       ` Greg Thelen
  2010-11-08 23:45         ` Johannes Weiner
  2010-11-16  3:44       ` KAMEZAWA Hiroyuki
  3 siblings, 1 reply; 31+ messages in thread
From: Greg Thelen @ 2010-11-08 23:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Minchan Kim, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

Johannes Weiner <hannes@cmpxchg.org> writes:

> The statistic counters are in units of pages, there is no reason to
> make them 64-bit wide on 32-bit machines.
>
> Make them native words.  Since they are signed, this leaves 31 bit on
> 32-bit machines, which can represent roughly 8TB assuming a page size
> of 4k.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/memcontrol.h |    2 +-
>  mm/memcontrol.c            |   43 +++++++++++++++++++++----------------------
>  mm/page-writeback.c        |    4 ++--
>  3 files changed, 24 insertions(+), 25 deletions(-)
>
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -110,7 +110,7 @@ enum {
>  };
>  
>  struct mem_cgroup_stat_cpu {
> -	s64 count[MEM_CGROUP_STAT_NSTATS];
> +	long count[MEM_CGROUP_STAT_NSTATS];
>  	unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
>  };
>  
> @@ -583,11 +583,11 @@ mem_cgroup_largest_soft_limit_node(struc
>   * common workload, threashold and synchonization as vmstat[] should be
>   * implemented.
>   */
> -static s64 mem_cgroup_read_stat(struct mem_cgroup *mem,
> -		enum mem_cgroup_stat_index idx)
> +static long mem_cgroup_read_stat(struct mem_cgroup *mem,
> +				 enum mem_cgroup_stat_index idx)
>  {
> +	long val = 0;
>  	int cpu;
> -	s64 val = 0;
>  
>  	for_each_online_cpu(cpu)
>  		val += per_cpu(mem->stat->count[idx], cpu);
> @@ -599,9 +599,9 @@ static s64 mem_cgroup_read_stat(struct m
>  	return val;
>  }
>  
> -static s64 mem_cgroup_local_usage(struct mem_cgroup *mem)
> +static long mem_cgroup_local_usage(struct mem_cgroup *mem)
>  {
> -	s64 ret;
> +	long ret;
>  
>  	ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
>  	ret += mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
> @@ -1244,7 +1244,7 @@ bool mem_cgroup_dirty_info(unsigned long
>  	struct vm_dirty_param dirty_param;
>  	unsigned long available_mem;
>  	struct mem_cgroup *memcg;
> -	s64 value;
> +	long value;
>  
>  	if (mem_cgroup_disabled())
>  		return false;
> @@ -1301,10 +1301,10 @@ static inline bool mem_cgroup_can_swap(s
>  		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
>  }
>  
> -static s64 mem_cgroup_local_page_stat(struct mem_cgroup *mem,
> -				      enum mem_cgroup_nr_pages_item item)
> +static long mem_cgroup_local_page_stat(struct mem_cgroup *mem,
> +				       enum mem_cgroup_nr_pages_item item)
>  {
> -	s64 ret;
> +	long ret;
>  
>  	switch (item) {
>  	case MEMCG_NR_DIRTYABLE_PAGES:
> @@ -1365,11 +1365,11 @@ memcg_hierarchical_free_pages(struct mem
>   * Return the accounted statistic value or negative value if current task is
>   * root cgroup.
>   */
> -s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
> +long mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
>  {
> -	struct mem_cgroup *mem;
>  	struct mem_cgroup *iter;
> -	s64 value;
> +	struct mem_cgroup *mem;
> +	long value;
>  
>  	get_online_cpus();
>  	rcu_read_lock();
> @@ -2069,7 +2069,7 @@ static void mem_cgroup_drain_pcp_counter
>  
>  	spin_lock(&mem->pcp_counter_lock);
>  	for (i = 0; i < MEM_CGROUP_STAT_DATA; i++) {
> -		s64 x = per_cpu(mem->stat->count[i], cpu);
> +		long x = per_cpu(mem->stat->count[i], cpu);
>  
>  		per_cpu(mem->stat->count[i], cpu) = 0;
>  		mem->nocpu_base.count[i] += x;
> @@ -3660,13 +3660,13 @@ static int mem_cgroup_hierarchy_write(st
>  }
>  
>  
> -static u64 mem_cgroup_get_recursive_idx_stat(struct mem_cgroup *mem,
> -				enum mem_cgroup_stat_index idx)
> +static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *mem,
> +					       enum mem_cgroup_stat_index idx)
>  {
>  	struct mem_cgroup *iter;
> -	s64 val = 0;
> +	long val = 0;
>  
> -	/* each per cpu's value can be minus.Then, use s64 */
> +	/* Per-cpu values can be negative, use a signed accumulator */
>  	for_each_mem_cgroup_tree(iter, mem)
>  		val += mem_cgroup_read_stat(iter, idx);
>  
> @@ -3686,12 +3686,11 @@ static inline u64 mem_cgroup_usage(struc
>  			return res_counter_read_u64(&mem->memsw, RES_USAGE);
>  	}
>  
> -	val = mem_cgroup_get_recursive_idx_stat(mem, MEM_CGROUP_STAT_CACHE);
> -	val += mem_cgroup_get_recursive_idx_stat(mem, MEM_CGROUP_STAT_RSS);
> +	val = mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_CACHE);
> +	val += mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_RSS);
>  
>  	if (swap)
> -		val += mem_cgroup_get_recursive_idx_stat(mem,
> -				MEM_CGROUP_STAT_SWAPOUT);
> +		val += mem_cgroup_recursive_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>  
>  	return val << PAGE_SHIFT;
>  }
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -157,7 +157,7 @@ static inline void mem_cgroup_dec_page_s
>  bool mem_cgroup_has_dirty_limit(void);
>  bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
>  			   struct dirty_info *info);
> -s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
> +long mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);

Ooops.  I missed something in my review.

mem_cgroup_page_stat() appears twice in memcontrol.h The return value
should match regardless of if CONFIG_CGROUP_MEM_RES_CTLR is set.

I suggest integrating the following into you patch ([patch 4/4] memcg:
use native word page statistics counters):

---
 include/linux/memcontrol.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4e046d6..7a3d915 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -351,7 +351,7 @@ static inline bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
        return false;
 }
 
-static inline s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
+static inline long mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
 {
        return -ENOSYS;
 }

>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask);
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -133,10 +133,10 @@ static struct prop_descriptor vm_dirties
>  
>  static unsigned long dirty_writeback_pages(void)
>  {
> -	s64 ret;
> +	unsigned long ret;
>  
>  	ret = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> -	if (ret < 0)
> +	if ((long)ret < 0)
>  		ret = global_page_state(NR_UNSTABLE_NFS) +
>  			global_page_state(NR_WRITEBACK);
>  

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [patch 4/4] memcg: use native word page statistics counters
  2010-11-08 23:27       ` Greg Thelen
@ 2010-11-08 23:45         ` Johannes Weiner
  0 siblings, 0 replies; 31+ messages in thread
From: Johannes Weiner @ 2010-11-08 23:45 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Minchan Kim, Andrew Morton, Dave Young, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

On Mon, Nov 08, 2010 at 03:27:26PM -0800, Greg Thelen wrote:
> Johannes Weiner <hannes@cmpxchg.org> writes:
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -157,7 +157,7 @@ static inline void mem_cgroup_dec_page_s
> >  bool mem_cgroup_has_dirty_limit(void);
> >  bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
> >  			   struct dirty_info *info);
> > -s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
> > +long mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
> 
> Ooops.  I missed something in my review.
> 
> mem_cgroup_page_stat() appears twice in memcontrol.h The return value
> should match regardless of if CONFIG_CGROUP_MEM_RES_CTLR is set.
> 
> I suggest integrating the following into you patch ([patch 4/4] memcg:
> use native word page statistics counters):

Right you are!  Thanks.

	Hannes

> ---
>  include/linux/memcontrol.h |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4e046d6..7a3d915 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -351,7 +351,7 @@ static inline bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
>         return false;
>  }
>  
> -static inline s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
> +static inline long mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
>  {
>         return -ENOSYS;
>  }

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 1/4] memcg: use native word to represent dirtyable pages
  2010-11-07 22:14     ` [patch 1/4] memcg: use native word to represent dirtyable pages Johannes Weiner
  2010-11-07 22:56       ` Minchan Kim
@ 2010-11-16  3:37       ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-16  3:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Minchan Kim, Andrew Morton, Dave Young,
	Andrea Righi, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

On Sun,  7 Nov 2010 23:14:36 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> The memory cgroup dirty info calculation currently uses a signed
> 64-bit type to represent the amount of dirtyable memory in pages.
> 
> This can instead be changed to an unsigned word, which will allow the
> formula to function correctly with up to 160G of LRU pages on a 32-bit
> system, assuming 4k pages.  That should be plenty even when taking
> racy folding of the per-cpu counters into account.
> 
> This fixes a compilation error on 32-bit systems as this code tries to
> do 64-bit division.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reported-by: Dave Young <hidave.darkstar@gmail.com>


Thank you.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

I couldn't read email because of vacation.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 2/4] memcg: catch negative per-cpu sums in dirty info
  2010-11-07 22:14     ` [patch 2/4] memcg: catch negative per-cpu sums in dirty info Johannes Weiner
  2010-11-07 23:26       ` Minchan Kim
@ 2010-11-16  3:39       ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-16  3:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Minchan Kim, Andrew Morton, Dave Young,
	Andrea Righi, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

On Sun,  7 Nov 2010 23:14:37 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> Folding the per-cpu counters can yield a negative value in case of
> accounting races between CPUs.
> 
> When collecting the dirty info, the code would read those sums into an
> unsigned variable and then check for it being negative, which can not
> work.
> 
> Instead, fold the counters into a signed local variable, make the
> check, and only then assign it.
> 
> This way, the function signals correctly when there are insane values
> instead of leaking them out to the caller.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 3/4] memcg: break out event counters from other stats
  2010-11-07 22:14     ` [patch 3/4] memcg: break out event counters from other stats Johannes Weiner
  2010-11-07 23:52       ` Minchan Kim
@ 2010-11-16  3:41       ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-16  3:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Minchan Kim, Andrew Morton, Dave Young,
	Andrea Righi, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

On Sun,  7 Nov 2010 23:14:38 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> For increasing and decreasing per-cpu cgroup usage counters it makes
> sense to use signed types, as single per-cpu values might go negative
> during updates.  But this is not the case for only-ever-increasing
> event counters.
> 
> All the counters have been signed 64-bit so far, which was enough to
> count events even with the sign bit wasted.
> 
> The next patch narrows the usage counters type (on 32-bit CPUs, that
> is), though, so break out the event counters and make them unsigned
> words as they should have been from the start.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [patch 4/4] memcg: use native word page statistics counters
  2010-11-07 22:14     ` [patch 4/4] memcg: use native word page statistics counters Johannes Weiner
                         ` (2 preceding siblings ...)
  2010-11-08 23:27       ` Greg Thelen
@ 2010-11-16  3:44       ` KAMEZAWA Hiroyuki
  3 siblings, 0 replies; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-16  3:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Minchan Kim, Andrew Morton, Dave Young,
	Andrea Righi, Daisuke Nishimura, Balbir Singh, Wu Fengguang,
	linux-mm, linux-kernel

On Sun,  7 Nov 2010 23:14:39 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> The statistic counters are in units of pages, there is no reason to
> make them 64-bit wide on 32-bit machines.
> 
> Make them native words.  Since they are signed, this leaves 31 bit on
> 32-bit machines, which can represent roughly 8TB assuming a page size
> of 4k.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2010-11-16  3:49 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-05 16:08 [PATCH] memcg: use do_div to divide s64 in 32 bit machine Minchan Kim
2010-11-05 16:34 ` Greg Thelen
2010-11-06  1:03 ` hannes
2010-11-06 17:19   ` Greg Thelen
2010-11-06 17:31     ` Minchan Kim
2010-11-07 22:14     ` [patch 0/4] memcg: variable type fixes Johannes Weiner
2010-11-07 22:14     ` [patch 1/4] memcg: use native word to represent dirtyable pages Johannes Weiner
2010-11-07 22:56       ` Minchan Kim
2010-11-08 22:25         ` Greg Thelen
2010-11-08 22:38           ` Johannes Weiner
2010-11-08 22:43             ` Greg Thelen
2010-11-16  3:37       ` KAMEZAWA Hiroyuki
2010-11-07 22:14     ` [patch 2/4] memcg: catch negative per-cpu sums in dirty info Johannes Weiner
2010-11-07 23:26       ` Minchan Kim
2010-11-08 22:28         ` Greg Thelen
2010-11-16  3:39       ` KAMEZAWA Hiroyuki
2010-11-07 22:14     ` [patch 3/4] memcg: break out event counters from other stats Johannes Weiner
2010-11-07 23:52       ` Minchan Kim
2010-11-08 23:20         ` Greg Thelen
2010-11-16  3:41       ` KAMEZAWA Hiroyuki
2010-11-07 22:14     ` [patch 4/4] memcg: use native word page statistics counters Johannes Weiner
2010-11-08  0:01       ` Minchan Kim
2010-11-08  9:08         ` Johannes Weiner
2010-11-08 22:51         ` Greg Thelen
2010-11-08  0:07       ` Minchan Kim
2010-11-08  9:37         ` memcg writeout throttling, was: " Johannes Weiner
2010-11-08 15:45           ` Wu Fengguang
2010-11-08 19:00             ` Greg Thelen
2010-11-08 23:27       ` Greg Thelen
2010-11-08 23:45         ` Johannes Weiner
2010-11-16  3:44       ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).