linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Shared page accounting for memory cgroup
@ 2009-12-29 18:27 Balbir Singh
  2010-01-03 23:51 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 31+ messages in thread
From: Balbir Singh @ 2009-12-29 18:27 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, linux-kernel, KAMEZAWA Hiroyuki, nishimura

Hi, Everyone,

I've been working on heuristics for shared page accounting for the
memory cgroup. I've tested the patches by creating multiple cgroups
and running programs that share memory and observed the output.

Comments?


Add shared accounting to memcg

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Currently there is no accurate way of estimating how many pages are
shared in a memory cgroup. The accurate way of accounting shared memory
is to

1. Either follow every page rmap and track number of users
2. Iterate through the pages and use _mapcount

We take an intermediate approach (suggested by Kamezawa), we sum up
the file and anon rss of the mm's belonging to the cgroup and then
subtract the values of anon rss and file mapped. This should give
us a good estimate of the pages being shared.

The shared statistic is called memory.shared_usage_in_bytes and
does not support hierarchical information, just the information
for the current cgroup.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 Documentation/cgroups/memory.txt |    6 +++++
 mm/memcontrol.c                  |   43 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+), 0 deletions(-)


diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index b871f25..c2c70c9 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -341,6 +341,12 @@ Note:
   - a cgroup which uses hierarchy and it has child cgroup.
   - a cgroup which uses hierarchy and not the root of hierarchy.
 
+5.4 shared_usage_in_bytes
+  This data lists the number of shared bytes. The data provided
+  provides an approximation based on the anon and file rss counts
+  of all the mm's belonging to the cgroup. The sum above is subtracted
+  from the count of rss and file mapped count maintained within the
+  memory cgroup statistics (see section 5.2).
 
 6. Hierarchy support
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 488b644..8e296be 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3052,6 +3052,45 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
+static u64 mem_cgroup_shared_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	struct cgroup_iter it;
+	struct task_struct *tsk;
+	u64 total_rss = 0, shared;
+	struct mm_struct *mm;
+	s64 val;
+
+	cgroup_iter_start(cgrp, &it);
+	val = mem_cgroup_read_stat(&memcg->stat, MEM_CGROUP_STAT_RSS);
+	val += mem_cgroup_read_stat(&memcg->stat, MEM_CGROUP_STAT_FILE_MAPPED);
+	while ((tsk = cgroup_iter_next(cgrp, &it))) {
+		if (!thread_group_leader(tsk))
+			continue;
+		mm = tsk->mm;
+		/*
+		 * We can't use get_task_mm(), since mmput() its counterpart
+		 * can sleep. We know that mm can't become invalid since
+		 * we hold the css_set_lock (see cgroup_iter_start()).
+		 */
+		if (tsk->flags & PF_KTHREAD || !mm)
+			continue;
+		total_rss += get_mm_counter(mm, file_rss) +
+				get_mm_counter(mm, anon_rss);
+	}
+	cgroup_iter_end(cgrp, &it);
+
+	/*
+	 * We need to tolerate negative values due to the difference in
+	 * time of calculating total_rss and val, but the shared value
+	 * converges to the correct value quite soon depending on the changing
+	 * memory usage of the workload running in the memory cgroup.
+	 */
+	shared = total_rss - val;
+	shared = max_t(s64, 0, shared);
+	shared <<= PAGE_SHIFT;
+	return shared;
+}
 
 static struct cftype mem_cgroup_files[] = {
 	{
@@ -3101,6 +3140,10 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_swappiness_read,
 		.write_u64 = mem_cgroup_swappiness_write,
 	},
+	{
+		.name = "shared_usage_in_bytes",
+		.read_u64 = mem_cgroup_shared_read,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP

-- 
	Balbir

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2009-12-29 18:27 [RFC] Shared page accounting for memory cgroup Balbir Singh
@ 2010-01-03 23:51 ` KAMEZAWA Hiroyuki
  2010-01-04  0:07   ` Balbir Singh
  0 siblings, 1 reply; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-03 23:51 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, Andrew Morton, linux-kernel, nishimura

On Tue, 29 Dec 2009 23:57:43 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> Hi, Everyone,
> 
> I've been working on heuristics for shared page accounting for the
> memory cgroup. I've tested the patches by creating multiple cgroups
> and running programs that share memory and observed the output.
> 
> Comments?

Hmm? Why we have to do this in the kernel ?

Thanks,
-Kame

> 
> 
> Add shared accounting to memcg
> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> Currently there is no accurate way of estimating how many pages are
> shared in a memory cgroup. The accurate way of accounting shared memory
> is to
> 
> 1. Either follow every page rmap and track number of users
> 2. Iterate through the pages and use _mapcount
> 
> We take an intermediate approach (suggested by Kamezawa), we sum up
> the file and anon rss of the mm's belonging to the cgroup and then
> subtract the values of anon rss and file mapped. This should give
> us a good estimate of the pages being shared.
> 
> The shared statistic is called memory.shared_usage_in_bytes and
> does not support hierarchical information, just the information
> for the current cgroup.
> 
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
> 
>  Documentation/cgroups/memory.txt |    6 +++++
>  mm/memcontrol.c                  |   43 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 49 insertions(+), 0 deletions(-)
> 
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index b871f25..c2c70c9 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -341,6 +341,12 @@ Note:
>    - a cgroup which uses hierarchy and it has child cgroup.
>    - a cgroup which uses hierarchy and not the root of hierarchy.
>  
> +5.4 shared_usage_in_bytes
> +  This data lists the number of shared bytes. The data provided
> +  provides an approximation based on the anon and file rss counts
> +  of all the mm's belonging to the cgroup. The sum above is subtracted
> +  from the count of rss and file mapped count maintained within the
> +  memory cgroup statistics (see section 5.2).
>  
>  6. Hierarchy support
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 488b644..8e296be 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3052,6 +3052,45 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
>  	return 0;
>  }
>  
> +static u64 mem_cgroup_shared_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	struct cgroup_iter it;
> +	struct task_struct *tsk;
> +	u64 total_rss = 0, shared;
> +	struct mm_struct *mm;
> +	s64 val;
> +
> +	cgroup_iter_start(cgrp, &it);
> +	val = mem_cgroup_read_stat(&memcg->stat, MEM_CGROUP_STAT_RSS);
> +	val += mem_cgroup_read_stat(&memcg->stat, MEM_CGROUP_STAT_FILE_MAPPED);
> +	while ((tsk = cgroup_iter_next(cgrp, &it))) {
> +		if (!thread_group_leader(tsk))
> +			continue;
> +		mm = tsk->mm;
> +		/*
> +		 * We can't use get_task_mm(), since mmput() its counterpart
> +		 * can sleep. We know that mm can't become invalid since
> +		 * we hold the css_set_lock (see cgroup_iter_start()).
> +		 */
> +		if (tsk->flags & PF_KTHREAD || !mm)
> +			continue;
> +		total_rss += get_mm_counter(mm, file_rss) +
> +				get_mm_counter(mm, anon_rss);
> +	}
> +	cgroup_iter_end(cgrp, &it);
> +
> +	/*
> +	 * We need to tolerate negative values due to the difference in
> +	 * time of calculating total_rss and val, but the shared value
> +	 * converges to the correct value quite soon depending on the changing
> +	 * memory usage of the workload running in the memory cgroup.
> +	 */
> +	shared = total_rss - val;
> +	shared = max_t(s64, 0, shared);
> +	shared <<= PAGE_SHIFT;
> +	return shared;
> +}
>  
>  static struct cftype mem_cgroup_files[] = {
>  	{
> @@ -3101,6 +3140,10 @@ static struct cftype mem_cgroup_files[] = {
>  		.read_u64 = mem_cgroup_swappiness_read,
>  		.write_u64 = mem_cgroup_swappiness_write,
>  	},
> +	{
> +		.name = "shared_usage_in_bytes",
> +		.read_u64 = mem_cgroup_shared_read,
> +	},
>  };
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> 
> -- 
> 	Balbir
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-03 23:51 ` KAMEZAWA Hiroyuki
@ 2010-01-04  0:07   ` Balbir Singh
  2010-01-04  0:35     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 31+ messages in thread
From: Balbir Singh @ 2010-01-04  0:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Andrew Morton, linux-kernel, nishimura

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-04 08:51:08]:

> On Tue, 29 Dec 2009 23:57:43 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > Hi, Everyone,
> > 
> > I've been working on heuristics for shared page accounting for the
> > memory cgroup. I've tested the patches by creating multiple cgroups
> > and running programs that share memory and observed the output.
> > 
> > Comments?
> 
> Hmm? Why we have to do this in the kernel ?
>

For several reasons that I can think of

1. With task migration changes coming in, getting consistent data free of races
is going to be hard.
2. The cost of doing it in the kernel is not high, it does not impact
the memcg runtime, it is a request-response sort of cost.
3. The cost in user space is going to be high and the implementation
cumbersome to get right.
 
-- 
	Balbir

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-04  0:07   ` Balbir Singh
@ 2010-01-04  0:35     ` KAMEZAWA Hiroyuki
  2010-01-04  0:50       ` Balbir Singh
  0 siblings, 1 reply; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-04  0:35 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, Andrew Morton, linux-kernel, nishimura

On Mon, 4 Jan 2010 05:37:52 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-04 08:51:08]:
> 
> > On Tue, 29 Dec 2009 23:57:43 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > Hi, Everyone,
> > > 
> > > I've been working on heuristics for shared page accounting for the
> > > memory cgroup. I've tested the patches by creating multiple cgroups
> > > and running programs that share memory and observed the output.
> > > 
> > > Comments?
> > 
> > Hmm? Why we have to do this in the kernel ?
> >
> 
> For several reasons that I can think of
> 
> 1. With task migration changes coming in, getting consistent data free of races
> is going to be hard.

Hmm, Let's see real-worlds's "ps" or "top" command. Even when there are no guarantee
of error range of data, it's still useful.

> 2. The cost of doing it in the kernel is not high, it does not impact
> the memcg runtime, it is a request-response sort of cost.
>
> 3. The cost in user space is going to be high and the implementation
> cumbersome to get right.
>  
I don't like moving a cost in the userland to the kernel. Considering 
real-time kernel or full-preemptive kernel, this very long read_lock() in the
kernel is not good, IMHO. (I think css_set_lock should be mutex/rw-sem...)
cgroup_iter_xxx can block cgroup_post_fork() and this may cause critical
system delay of milli-seconds.

BTW, if you really want to calculate somthing in atomic, I think following
interface may be welcomed for freezing.

  cgroup.lock
  # echo 1 > /...../cgroup.lock 
    All task move, mkdir, rmdir to this cgroup will be blocked by mutex.
    (But fork/exit will not be blocked.)

  # echo 0 > /...../cgroup.lock
    Unlock.

  # cat /...../cgroup.lock
    show lock status and lock history (for debug).

Maybe good for some kinds of middleware.
But this may be difficult if we have to consider hierarchy.

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-04  0:35     ` KAMEZAWA Hiroyuki
@ 2010-01-04  0:50       ` Balbir Singh
  2010-01-06  4:02         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 31+ messages in thread
From: Balbir Singh @ 2010-01-04  0:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Andrew Morton, linux-kernel, nishimura

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-04 09:35:28]:

> On Mon, 4 Jan 2010 05:37:52 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-04 08:51:08]:
> > 
> > > On Tue, 29 Dec 2009 23:57:43 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > Hi, Everyone,
> > > > 
> > > > I've been working on heuristics for shared page accounting for the
> > > > memory cgroup. I've tested the patches by creating multiple cgroups
> > > > and running programs that share memory and observed the output.
> > > > 
> > > > Comments?
> > > 
> > > Hmm? Why we have to do this in the kernel ?
> > >
> > 
> > For several reasons that I can think of
> > 
> > 1. With task migration changes coming in, getting consistent data free of races
> > is going to be hard.
> 
> Hmm, Let's see real-worlds's "ps" or "top" command. Even when there are no guarantee
> of error range of data, it's still useful.

Yes, my concern is this

1. I iterate through tasks and calculate RSS
2. I look at memory.usage_in_bytes

If the time in user space between 1 and 2 is large I get very wrong
results, specifically if the workload is changing its memory usage
drastically.. no?

> 
> > 2. The cost of doing it in the kernel is not high, it does not impact
> > the memcg runtime, it is a request-response sort of cost.
> >
> > 3. The cost in user space is going to be high and the implementation
> > cumbersome to get right.
> >  
> I don't like moving a cost in the userland to the kernel.

Me neither, but I don't think it is a fixed overhead.

 Considering 
> real-time kernel or full-preemptive kernel, this very long read_lock() in the
> kernel is not good, IMHO. (I think css_set_lock should be mutex/rw-sem...)

I agree, we should discuss converting the lock to a mutex or a
semaphore, but there might be a good reason for keeping it as a
spin_lock.

> cgroup_iter_xxx can block cgroup_post_fork() and this may cause critical
> system delay of milli-seconds.
> 

Agreed, but then that can happen, even while attaching a task, seeing
cgroup tasks file (list of tasks).

> BTW, if you really want to calculate somthing in atomic, I think following
> interface may be welcomed for freezing.
> 
>   cgroup.lock
>   # echo 1 > /...../cgroup.lock 
>     All task move, mkdir, rmdir to this cgroup will be blocked by mutex.
>     (But fork/exit will not be blocked.)
> 
>   # echo 0 > /...../cgroup.lock
>     Unlock.
> 
>   # cat /...../cgroup.lock
>     show lock status and lock history (for debug).
> 
> Maybe good for some kinds of middleware.
> But this may be difficult if we have to consider hierarchy.
>

I don't like the idea of providing an interface that can control
kernel locks from user space, user space can tangle up and get it
wrong. 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-04  0:50       ` Balbir Singh
@ 2010-01-06  4:02         ` KAMEZAWA Hiroyuki
  2010-01-06  7:01           ` Balbir Singh
  0 siblings, 1 reply; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-06  4:02 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, Andrew Morton, linux-kernel, nishimura

On Mon, 4 Jan 2010 06:20:31 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-04 09:35:28]:
> 
> > On Mon, 4 Jan 2010 05:37:52 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-04 08:51:08]:
> > > 
> > > > On Tue, 29 Dec 2009 23:57:43 +0530
> > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > > 
> > > > > Hi, Everyone,
> > > > > 
> > > > > I've been working on heuristics for shared page accounting for the
> > > > > memory cgroup. I've tested the patches by creating multiple cgroups
> > > > > and running programs that share memory and observed the output.
> > > > > 
> > > > > Comments?
> > > > 
> > > > Hmm? Why we have to do this in the kernel ?
> > > >
> > > 
> > > For several reasons that I can think of
> > > 
> > > 1. With task migration changes coming in, getting consistent data free of races
> > > is going to be hard.
> > 
> > Hmm, Let's see real-worlds's "ps" or "top" command. Even when there are no guarantee
> > of error range of data, it's still useful.
> 
> Yes, my concern is this
> 
> 1. I iterate through tasks and calculate RSS
> 2. I look at memory.usage_in_bytes
> 
> If the time in user space between 1 and 2 is large I get very wrong
> results, specifically if the workload is changing its memory usage
> drastically.. no?
> 
No. If it takes long time, locking fork()/exit() for such long time is the bigger
issue.
I recommend you to add memacct subsystem to sum up RSS of all processes's RSS counting
under a cgroup.  Althoght it may add huge costs in page fault path but implementation
will be very simple and will not hurt realtime ops.
There will be no terrible race, I guess.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-06  4:02         ` KAMEZAWA Hiroyuki
@ 2010-01-06  7:01           ` Balbir Singh
  2010-01-06  7:12             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 31+ messages in thread
From: Balbir Singh @ 2010-01-06  7:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Andrew Morton, linux-kernel, nishimura

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-06 13:02:58]:

> On Mon, 4 Jan 2010 06:20:31 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-04 09:35:28]:
> > 
> > > On Mon, 4 Jan 2010 05:37:52 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-04 08:51:08]:
> > > > 
> > > > > On Tue, 29 Dec 2009 23:57:43 +0530
> > > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > > > 
> > > > > > Hi, Everyone,
> > > > > > 
> > > > > > I've been working on heuristics for shared page accounting for the
> > > > > > memory cgroup. I've tested the patches by creating multiple cgroups
> > > > > > and running programs that share memory and observed the output.
> > > > > > 
> > > > > > Comments?
> > > > > 
> > > > > Hmm? Why we have to do this in the kernel ?
> > > > >
> > > > 
> > > > For several reasons that I can think of
> > > > 
> > > > 1. With task migration changes coming in, getting consistent data free of races
> > > > is going to be hard.
> > > 
> > > Hmm, Let's see real-worlds's "ps" or "top" command. Even when there are no guarantee
> > > of error range of data, it's still useful.
> > 
> > Yes, my concern is this
> > 
> > 1. I iterate through tasks and calculate RSS
> > 2. I look at memory.usage_in_bytes
> > 
> > If the time in user space between 1 and 2 is large I get very wrong
> > results, specifically if the workload is changing its memory usage
> > drastically.. no?
> > 
> No. If it takes long time, locking fork()/exit() for such long time is the bigger
> issue.
> I recommend you to add memacct subsystem to sum up RSS of all processes's RSS counting
> under a cgroup.  Althoght it may add huge costs in page fault path but implementation
> will be very simple and will not hurt realtime ops.
> There will be no terrible race, I guess.
>

But others hold that lock as well, simple thing like listing tasks and
moving tasks, etc. I expect the usage of shared to be in the same
range.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-06  7:01           ` Balbir Singh
@ 2010-01-06  7:12             ` KAMEZAWA Hiroyuki
  2010-01-07  7:15               ` Balbir Singh
  0 siblings, 1 reply; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-06  7:12 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, Andrew Morton, linux-kernel, nishimura

On Wed, 6 Jan 2010 12:31:50 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > No. If it takes long time, locking fork()/exit() for such long time is the bigger
> > issue.
> > I recommend you to add memacct subsystem to sum up RSS of all processes's RSS counting
> > under a cgroup.  Althoght it may add huge costs in page fault path but implementation
> > will be very simple and will not hurt realtime ops.
> > There will be no terrible race, I guess.
> >
> 
> But others hold that lock as well, simple thing like listing tasks and
> moving tasks, etc. I expect the usage of shared to be in the same
> range.
> 

And piles up costs ? I think cgroup guys should pay attention to fork/exit
costs more. Now, it gets slower and slower.
In that point, I never like migrate-at-task-move work in cpuset and memcg.

My 1st objection to this patch is this "shared" doesn't mean "shared between
cgroup" but means "shared between processes".
I think it's of no use and no help to users.

And implementation is 2nd thing.


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-06  7:12             ` KAMEZAWA Hiroyuki
@ 2010-01-07  7:15               ` Balbir Singh
  2010-01-07  7:36                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 31+ messages in thread
From: Balbir Singh @ 2010-01-07  7:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Andrew Morton, linux-kernel, nishimura

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-06 16:12:11]:

> On Wed, 6 Jan 2010 12:31:50 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > No. If it takes long time, locking fork()/exit() for such long time is the bigger
> > > issue.
> > > I recommend you to add memacct subsystem to sum up RSS of all processes's RSS counting
> > > under a cgroup.  Althoght it may add huge costs in page fault path but implementation
> > > will be very simple and will not hurt realtime ops.
> > > There will be no terrible race, I guess.
> > >
> > 
> > But others hold that lock as well, simple thing like listing tasks and
> > moving tasks, etc. I expect the usage of shared to be in the same
> > range.
> > 
> 
> And piles up costs ? I think cgroup guys should pay attention to fork/exit
> costs more. Now, it gets slower and slower.
> In that point, I never like migrate-at-task-move work in cpuset and memcg.
> 
> My 1st objection to this patch is this "shared" doesn't mean "shared between
> cgroup" but means "shared between processes".
> I think it's of no use and no help to users.
>

So what in your opinion would help end users? My concern is that as
we make progress with memcg, we account only for privately used pages
with no hint/data about the real usage (shared within or with other
cgroups). How do we decide if one cgroup is really heavy?
 
> And implementation is 2nd thing.
> 

More details on your concern, please!

-- 
	Balbir

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-07  7:15               ` Balbir Singh
@ 2010-01-07  7:36                 ` KAMEZAWA Hiroyuki
  2010-01-07  8:34                   ` Balbir Singh
  0 siblings, 1 reply; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-07  7:36 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, Andrew Morton, linux-kernel, nishimura

On Thu, 7 Jan 2010 12:45:54 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-06 16:12:11]:
> > And piles up costs ? I think cgroup guys should pay attention to fork/exit
> > costs more. Now, it gets slower and slower.
> > In that point, I never like migrate-at-task-move work in cpuset and memcg.
> > 
> > My 1st objection to this patch is this "shared" doesn't mean "shared between
> > cgroup" but means "shared between processes".
> > I think it's of no use and no help to users.
> >
> 
> So what in your opinion would help end users? My concern is that as
> we make progress with memcg, we account only for privately used pages
> with no hint/data about the real usage (shared within or with other
> cgroups). 

The real usage is already shown as

  [root@bluextal ref-mmotm]# cat /cgroups/memory.stat
  cache 7706181632 
  rss 120905728
  mapped_file 32239616

This is real. And "sum of rss - rss+mapped" doesn't show anything.

> How do we decide if one cgroup is really heavy?
>  

What "heavy" means ? "Hard to page out ?"

Historically, it's caught by pagein/pageout _speed_.
"How heavy memory system is ?" can only be measured by "speed".
If you add latency-stat for memcg, I'm glad to use it.

Anyway, "How memory reclaim can go successfully" is generic problem rather
than memcg. Maybe no good answers from VM guys....
I think you should add codes to global VM rather than cgroup.

"How pages are shared" doesn't show good hints. I don't hear such parameter
is used in production's resource monitoring software.


> > And implementation is 2nd thing.
> > 
> 
> More details on your concern, please!
> 
I already wrote....why do you want to make fork()/exit() slow for a thing
which is not necessary to be done in atomic ?

There are many hosts which has thousands of process and a cgrop may contain
thousands of process in production server.
In that situation, How the "make kernel" can slow down with following ?
==
while true; do cat /cgroup/memory.shared > /dev/null; done
==

In a word, the implementation problem is
 - An operation against a container can cause generic system slow down.
Then, I don't like heavy task move under cgroup.


Yes, this can happen in other places (we have to do some improvements).
But this is not good for a concept of isolation by container, anyway.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-07  7:36                 ` KAMEZAWA Hiroyuki
@ 2010-01-07  8:34                   ` Balbir Singh
  2010-01-07  8:48                     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 31+ messages in thread
From: Balbir Singh @ 2010-01-07  8:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Andrew Morton, linux-kernel, nishimura

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-07 16:36:10]:

> On Thu, 7 Jan 2010 12:45:54 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-06 16:12:11]:
> > > And piles up costs ? I think cgroup guys should pay attention to fork/exit
> > > costs more. Now, it gets slower and slower.
> > > In that point, I never like migrate-at-task-move work in cpuset and memcg.
> > > 
> > > My 1st objection to this patch is this "shared" doesn't mean "shared between
> > > cgroup" but means "shared between processes".
> > > I think it's of no use and no help to users.
> > >
> > 
> > So what in your opinion would help end users? My concern is that as
> > we make progress with memcg, we account only for privately used pages
> > with no hint/data about the real usage (shared within or with other
> > cgroups). 
> 
> The real usage is already shown as
> 
>   [root@bluextal ref-mmotm]# cat /cgroups/memory.stat
>   cache 7706181632 
>   rss 120905728
>   mapped_file 32239616
> 
> This is real. And "sum of rss - rss+mapped" doesn't show anything.
> 
> > How do we decide if one cgroup is really heavy?
> >  
> 
> What "heavy" means ? "Hard to page out ?"
>

Heavy can also indicate, should we OOM kill in this cgroup or kill the
entire cgroup? Should we add or remove resources from this cgroup?
 
> Historically, it's caught by pagein/pageout _speed_.
> "How heavy memory system is ?" can only be measured by "speed".

Not really... A cgroup might be very large with a large number of its
pages shared and frequently used. How do we detect if this cgroup
needs its resources or its taking too many of them.

> If you add latency-stat for memcg, I'm glad to use it.
> 
> Anyway, "How memory reclaim can go successfully" is generic problem rather
> than memcg. Maybe no good answers from VM guys....
> I think you should add codes to global VM rather than cgroup.
> 

No.. this is not for reclaim

> "How pages are shared" doesn't show good hints. I don't hear such parameter
> is used in production's resource monitoring software.
> 

You mean "How many pages are shared" are not good hints, please see my
justification above. With Virtualization (look at KSM for example),
shared pages are going to be increasingly important part of the
accounting.

> 
> > > And implementation is 2nd thing.
> > > 
> > 
> > More details on your concern, please!
> > 
> I already wrote....why do you want to make fork()/exit() slow for a thing
> which is not necessary to be done in atomic ?
> 

So your concern is about iterating through the tasks in cgroup, I can
think of an alternative low cost implementation if possible

> There are many hosts which has thousands of process and a cgrop may contain
> thousands of process in production server.
> In that situation, How the "make kernel" can slow down with following ?
> ==
> while true; do cat /cgroup/memory.shared > /dev/null; done
> ==

This is the worst case usage scenario that would be effected even if
memory.shared were replaced by tasks.

> 
> In a word, the implementation problem is
>  - An operation against a container can cause generic system slow down.
> Then, I don't like heavy task move under cgroup.
> 
> 
> Yes, this can happen in other places (we have to do some improvements).
> But this is not good for a concept of isolation by container, anyway.

Thanks for the review!

-- 
	Balbir

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-07  8:34                   ` Balbir Singh
@ 2010-01-07  8:48                     ` KAMEZAWA Hiroyuki
  2010-01-07  9:08                       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-07  8:48 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, Andrew Morton, linux-kernel, nishimura

On Thu, 7 Jan 2010 14:04:40 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-07 16:36:10]:
> 
> > On Thu, 7 Jan 2010 12:45:54 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-06 16:12:11]:
> > > > And piles up costs ? I think cgroup guys should pay attention to fork/exit
> > > > costs more. Now, it gets slower and slower.
> > > > In that point, I never like migrate-at-task-move work in cpuset and memcg.
> > > > 
> > > > My 1st objection to this patch is this "shared" doesn't mean "shared between
> > > > cgroup" but means "shared between processes".
> > > > I think it's of no use and no help to users.
> > > >
> > > 
> > > So what in your opinion would help end users? My concern is that as
> > > we make progress with memcg, we account only for privately used pages
> > > with no hint/data about the real usage (shared within or with other
> > > cgroups). 
> > 
> > The real usage is already shown as
> > 
> >   [root@bluextal ref-mmotm]# cat /cgroups/memory.stat
> >   cache 7706181632 
> >   rss 120905728
> >   mapped_file 32239616
> > 
> > This is real. And "sum of rss - rss+mapped" doesn't show anything.
> > 
> > > How do we decide if one cgroup is really heavy?
> > >  
> > 
> > What "heavy" means ? "Hard to page out ?"
> >
> 
> Heavy can also indicate, should we OOM kill in this cgroup or kill the
> entire cgroup? Should we add or remove resources from this cgroup?
> 
That's can be shown by usage...

 
> > Historically, it's caught by pagein/pageout _speed_.
> > "How heavy memory system is ?" can only be measured by "speed".
> 
> Not really... A cgroup might be very large with a large number of its
> pages shared and frequently used. How do we detect if this cgroup
> needs its resources or its taking too many of them.
> 
I don't know. If we have good parameter to know "resource is in short" 
in the kernel, please add to global VM before memcg.
as "/dev/mem_notify" proposed in the past. memcg will use similar logic
which is guaranteed by VM guys.


> > "How pages are shared" doesn't show good hints. I don't hear such parameter
> > is used in production's resource monitoring software.
> > 
> 
> You mean "How many pages are shared" are not good hints, please see my
> justification above. With Virtualization (look at KSM for example),
> shared pages are going to be increasingly important part of the
> accounting.
> 

Considering KSM, your cuounting style is tooo bad.

You should add 

 - MEM_CGROUP_STAT_SHARED_BY_KSM
 - MEM_CGROUP_STAT_FOR_TMPFS/SYSV_IPC_SHMEM

counters to memcg rather than scanning. I can help tests.

I have no objections to have above 2 counters. It's informative.

But, memory reclaim can page-out pages even if pages are shared.
So, "how heavy memcg is" is an independent problem from above coutners.

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-07  8:48                     ` KAMEZAWA Hiroyuki
@ 2010-01-07  9:08                       ` KAMEZAWA Hiroyuki
  2010-01-07  9:27                         ` Balbir Singh
  0 siblings, 1 reply; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-07  9:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir, linux-mm, Andrew Morton, linux-kernel, nishimura

On Thu, 7 Jan 2010 17:48:14 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > "How pages are shared" doesn't show good hints. I don't hear such parameter
> > > is used in production's resource monitoring software.
> > > 
> > 
> > You mean "How many pages are shared" are not good hints, please see my
> > justification above. With Virtualization (look at KSM for example),
> > shared pages are going to be increasingly important part of the
> > accounting.
> > 
> 
> Considering KSM, your cuounting style is tooo bad.
> 
> You should add 
> 
>  - MEM_CGROUP_STAT_SHARED_BY_KSM
>  - MEM_CGROUP_STAT_FOR_TMPFS/SYSV_IPC_SHMEM
> 
> counters to memcg rather than scanning. I can help tests.
> 
> I have no objections to have above 2 counters. It's informative.
> 
> But, memory reclaim can page-out pages even if pages are shared.
> So, "how heavy memcg is" is an independent problem from above coutners.
> 

In other words, above counters can show
"What role the memcg play in the system" to some extent.

But I don't express it as "heavy" ....."importance or influence of cgroup" ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-07  9:08                       ` KAMEZAWA Hiroyuki
@ 2010-01-07  9:27                         ` Balbir Singh
  2010-01-07 23:47                           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 31+ messages in thread
From: Balbir Singh @ 2010-01-07  9:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Andrew Morton, linux-kernel, nishimura

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-07 18:08:00]:

> On Thu, 7 Jan 2010 17:48:14 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > "How pages are shared" doesn't show good hints. I don't hear such parameter
> > > > is used in production's resource monitoring software.
> > > > 
> > > 
> > > You mean "How many pages are shared" are not good hints, please see my
> > > justification above. With Virtualization (look at KSM for example),
> > > shared pages are going to be increasingly important part of the
> > > accounting.
> > > 
> > 
> > Considering KSM, your cuounting style is tooo bad.
> > 
> > You should add 
> > 
> >  - MEM_CGROUP_STAT_SHARED_BY_KSM
> >  - MEM_CGROUP_STAT_FOR_TMPFS/SYSV_IPC_SHMEM
> > 

No.. I am just talking about shared memory being important and shared
accounting being useful, no counters for KSM in particular (in the
memcg context).

> > counters to memcg rather than scanning. I can help tests.
> > 
> > I have no objections to have above 2 counters. It's informative.
> > 

Apart from those two, I want to provide what Pss provides today or an
approximation of it.

> > But, memory reclaim can page-out pages even if pages are shared.
> > So, "how heavy memcg is" is an independent problem from above coutners.
> > 
> 
> In other words, above counters can show
> "What role the memcg play in the system" to some extent.
> 
> But I don't express it as "heavy" ....."importance or influence of cgroup" ?
> 
> Thanks,
> -Kame
> 
> 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-07  9:27                         ` Balbir Singh
@ 2010-01-07 23:47                           ` KAMEZAWA Hiroyuki
  2010-01-17 19:30                             ` Balbir Singh
  0 siblings, 1 reply; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-07 23:47 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, Andrew Morton, linux-kernel, nishimura

On Thu, 7 Jan 2010 14:57:36 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-07 18:08:00]:
> 
> > On Thu, 7 Jan 2010 17:48:14 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > > "How pages are shared" doesn't show good hints. I don't hear such parameter
> > > > > is used in production's resource monitoring software.
> > > > > 
> > > > 
> > > > You mean "How many pages are shared" are not good hints, please see my
> > > > justification above. With Virtualization (look at KSM for example),
> > > > shared pages are going to be increasingly important part of the
> > > > accounting.
> > > > 
> > > 
> > > Considering KSM, your cuounting style is tooo bad.
> > > 
> > > You should add 
> > > 
> > >  - MEM_CGROUP_STAT_SHARED_BY_KSM
> > >  - MEM_CGROUP_STAT_FOR_TMPFS/SYSV_IPC_SHMEM
> > > 
> 
> No.. I am just talking about shared memory being important and shared
> accounting being useful, no counters for KSM in particular (in the
> memcg context).
> 
Think so ? The number of memcg-private pages is in interest in my point of view.

Anyway, I don't change my opinion as "sum of rss" is not necessary to be calculated
in the kernel.
If you want to provide that in memcg, please add it to global VM as /proc/meminfo.

IIUC, KSM/SHMEM has some official method in global VM.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-07 23:47                           ` KAMEZAWA Hiroyuki
@ 2010-01-17 19:30                             ` Balbir Singh
  2010-01-18  0:05                               ` KAMEZAWA Hiroyuki
  2010-01-18  0:49                               ` Daisuke Nishimura
  0 siblings, 2 replies; 31+ messages in thread
From: Balbir Singh @ 2010-01-17 19:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Andrew Morton, linux-kernel, nishimura

On Fri, Jan 8, 2010 at 5:17 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 7 Jan 2010 14:57:36 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>
>> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-07 18:08:00]:
>>
>> > On Thu, 7 Jan 2010 17:48:14 +0900
>> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > > > > "How pages are shared" doesn't show good hints. I don't hear such parameter
>> > > > > is used in production's resource monitoring software.
>> > > > >
>> > > >
>> > > > You mean "How many pages are shared" are not good hints, please see my
>> > > > justification above. With Virtualization (look at KSM for example),
>> > > > shared pages are going to be increasingly important part of the
>> > > > accounting.
>> > > >
>> > >
>> > > Considering KSM, your cuounting style is tooo bad.
>> > >
>> > > You should add
>> > >
>> > >  - MEM_CGROUP_STAT_SHARED_BY_KSM
>> > >  - MEM_CGROUP_STAT_FOR_TMPFS/SYSV_IPC_SHMEM
>> > >
>>
>> No.. I am just talking about shared memory being important and shared
>> accounting being useful, no counters for KSM in particular (in the
>> memcg context).
>>
> Think so ? The number of memcg-private pages is in interest in my point of view.
>
> Anyway, I don't change my opinion as "sum of rss" is not necessary to be calculated
> in the kernel.
> If you want to provide that in memcg, please add it to global VM as /proc/meminfo.
>
> IIUC, KSM/SHMEM has some official method in global VM.
>

Kamezawa-San,

I implemented the same in user space and I get really bad results, here is why

1. I need to hold and walk the tasks list in cgroups and extract RSS
through /proc (results in worse hold times for the fork() scenario you
menioned)
2. The data is highly inconsistent due to the higher margin of error
in accumulating data which is changing as we run. By the time we total
and look at the memcg data, the data is stale

Would you be OK with the patch, if I renamed "shared_usage_in_bytes"
to "non_private_usage_in_bytes"?

Given that the stat is user initiated, I don't see your concern w.r.t.
overhead. Many subsystems like KSM do pay the overhead cost if the
user really wants the feature or the data. I would be really
interested in other opinions as well (if people do feel strongly
against or for the feature)

Balbir Singh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-17 19:30                             ` Balbir Singh
@ 2010-01-18  0:05                               ` KAMEZAWA Hiroyuki
  2010-01-18  0:22                                 ` KAMEZAWA Hiroyuki
  2010-01-18  0:49                               ` Daisuke Nishimura
  1 sibling, 1 reply; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-18  0:05 UTC (permalink / raw)
  To: Balbir Singh; +Cc: linux-mm, Andrew Morton, linux-kernel, nishimura

On Mon, 18 Jan 2010 01:00:44 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> On Fri, Jan 8, 2010 at 5:17 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Thu, 7 Jan 2010 14:57:36 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> >
> >> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-07 18:08:00]:
> >>
> >> > On Thu, 7 Jan 2010 17:48:14 +0900
> >> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> > > > > "How pages are shared" doesn't show good hints. I don't hear such parameter
> >> > > > > is used in production's resource monitoring software.
> >> > > > >
> >> > > >
> >> > > > You mean "How many pages are shared" are not good hints, please see my
> >> > > > justification above. With Virtualization (look at KSM for example),
> >> > > > shared pages are going to be increasingly important part of the
> >> > > > accounting.
> >> > > >
> >> > >
> >> > > Considering KSM, your cuounting style is tooo bad.
> >> > >
> >> > > You should add
> >> > >
> >> > >  - MEM_CGROUP_STAT_SHARED_BY_KSM
> >> > >  - MEM_CGROUP_STAT_FOR_TMPFS/SYSV_IPC_SHMEM
> >> > >
> >>
> >> No.. I am just talking about shared memory being important and shared
> >> accounting being useful, no counters for KSM in particular (in the
> >> memcg context).
> >>
> > Think so ? The number of memcg-private pages is in interest in my point of view.
> >
> > Anyway, I don't change my opinion as "sum of rss" is not necessary to be calculated
> > in the kernel.
> > If you want to provide that in memcg, please add it to global VM as /proc/meminfo.
> >
> > IIUC, KSM/SHMEM has some official method in global VM.
> >
> 
> Kamezawa-San,
> 
> I implemented the same in user space and I get really bad results, here is why
> 
> 1. I need to hold and walk the tasks list in cgroups and extract RSS
> through /proc (results in worse hold times for the fork() scenario you
> menioned)
> 2. The data is highly inconsistent due to the higher margin of error
> in accumulating data which is changing as we run. By the time we total
> and look at the memcg data, the data is stale
> 
> Would you be OK with the patch, if I renamed "shared_usage_in_bytes"
> to "non_private_usage_in_bytes"?
> 
> Given that the stat is user initiated, I don't see your concern w.r.t.
> overhead. Many subsystems like KSM do pay the overhead cost if the
> user really wants the feature or the data. I would be really
> interested in other opinions as well (if people do feel strongly
> against or for the feature)
> 

Please add that featuter to global VM before memcg.
If VM guyes admits its good, I have no objections more.

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-18  0:05                               ` KAMEZAWA Hiroyuki
@ 2010-01-18  0:22                                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-18  0:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, linux-mm, Andrew Morton, linux-kernel, nishimura

On Mon, 18 Jan 2010 09:05:49 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > 
> > Kamezawa-San,
> > 
> > I implemented the same in user space and I get really bad results, here is why
> > 
> > 1. I need to hold and walk the tasks list in cgroups and extract RSS
> > through /proc (results in worse hold times for the fork() scenario you
> > menioned)
> > 2. The data is highly inconsistent due to the higher margin of error
> > in accumulating data which is changing as we run. By the time we total
> > and look at the memcg data, the data is stale
> > 
> > Would you be OK with the patch, if I renamed "shared_usage_in_bytes"
> > to "non_private_usage_in_bytes"?
> > 
> > Given that the stat is user initiated, I don't see your concern w.r.t.
> > overhead. Many subsystems like KSM do pay the overhead cost if the
> > user really wants the feature or the data. I would be really
> > interested in other opinions as well (if people do feel strongly
> > against or for the feature)
> > 
> 
> Please add that featuter to global VM before memcg.
> If VM guyes admits its good, I have no objections more.
> 

I don't want to say any more but...one point.

If the status of memory changes so frequently as the user land check program
can't calculate stable data, what the management daemon can react agasinst
it...the stale data ? So, I think it's nonsense anyway.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-17 19:30                             ` Balbir Singh
  2010-01-18  0:05                               ` KAMEZAWA Hiroyuki
@ 2010-01-18  0:49                               ` Daisuke Nishimura
  2010-01-18  8:26                                 ` Balbir Singh
  1 sibling, 1 reply; 31+ messages in thread
From: Daisuke Nishimura @ 2010-01-18  0:49 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, linux-mm, Andrew Morton, linux-kernel,
	Daisuke Nishimura

On Mon, 18 Jan 2010 01:00:44 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> On Fri, Jan 8, 2010 at 5:17 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Thu, 7 Jan 2010 14:57:36 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> >
> >> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-07 18:08:00]:
> >>
> >> > On Thu, 7 Jan 2010 17:48:14 +0900
> >> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> > > > > "How pages are shared" doesn't show good hints. I don't hear such parameter
> >> > > > > is used in production's resource monitoring software.
> >> > > > >
> >> > > >
> >> > > > You mean "How many pages are shared" are not good hints, please see my
> >> > > > justification above. With Virtualization (look at KSM for example),
> >> > > > shared pages are going to be increasingly important part of the
> >> > > > accounting.
> >> > > >
> >> > >
> >> > > Considering KSM, your cuounting style is tooo bad.
> >> > >
> >> > > You should add
> >> > >
> >> > >  - MEM_CGROUP_STAT_SHARED_BY_KSM
> >> > >  - MEM_CGROUP_STAT_FOR_TMPFS/SYSV_IPC_SHMEM
> >> > >
> >>
> >> No.. I am just talking about shared memory being important and shared
> >> accounting being useful, no counters for KSM in particular (in the
> >> memcg context).
> >>
> > Think so ? The number of memcg-private pages is in interest in my point of view.
> >
> > Anyway, I don't change my opinion as "sum of rss" is not necessary to be calculated
> > in the kernel.
> > If you want to provide that in memcg, please add it to global VM as /proc/meminfo.
> >
> > IIUC, KSM/SHMEM has some official method in global VM.
> >
> 
> Kamezawa-San,
> 
> I implemented the same in user space and I get really bad results, here is why
> 
> 1. I need to hold and walk the tasks list in cgroups and extract RSS
> through /proc (results in worse hold times for the fork() scenario you
> menioned)
> 2. The data is highly inconsistent due to the higher margin of error
> in accumulating data which is changing as we run. By the time we total
> and look at the memcg data, the data is stale
> 
> Would you be OK with the patch, if I renamed "shared_usage_in_bytes"
> to "non_private_usage_in_bytes"?
> 
I think the name is still ambiguous.

For example, if process A belongs to /cgroup/memory/01 and process B to /cgroup/memory/02,
both process have 10MB anonymous pages and 10MB file caches of the same pages, and all of the
file caches are charged to 01.
In this case, the value in 01 is 0MB(=20MB - 20MB) and 10MB(20MB - 10MB), right?

I don't think "non private usage" is appropriate to this value.
Why don't you just show "sum_of_each_process_rss" ? I think it would be easier
to understand for users.
But, hmm, I don't see any strong reason to do this in kernel, then :(


Thanks,
Daisuke Nishimura.

> Given that the stat is user initiated, I don't see your concern w.r.t.
> overhead. Many subsystems like KSM do pay the overhead cost if the
> user really wants the feature or the data. I would be really
> interested in other opinions as well (if people do feel strongly
> against or for the feature)
> 
> Balbir Singh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-18  0:49                               ` Daisuke Nishimura
@ 2010-01-18  8:26                                 ` Balbir Singh
  2010-01-19  1:22                                   ` Daisuke Nishimura
  0 siblings, 1 reply; 31+ messages in thread
From: Balbir Singh @ 2010-01-18  8:26 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: KAMEZAWA Hiroyuki, linux-mm, Andrew Morton, linux-kernel

On Monday 18 January 2010 06:19 AM, Daisuke Nishimura wrote:
> On Mon, 18 Jan 2010 01:00:44 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>> On Fri, Jan 8, 2010 at 5:17 AM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>> On Thu, 7 Jan 2010 14:57:36 +0530
>>> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>>
>>>> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-07 18:08:00]:
>>>>
>>>>> On Thu, 7 Jan 2010 17:48:14 +0900
>>>>> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>>>>>>> "How pages are shared" doesn't show good hints. I don't hear such parameter
>>>>>>>> is used in production's resource monitoring software.
>>>>>>>>
>>>>>>>
>>>>>>> You mean "How many pages are shared" are not good hints, please see my
>>>>>>> justification above. With Virtualization (look at KSM for example),
>>>>>>> shared pages are going to be increasingly important part of the
>>>>>>> accounting.
>>>>>>>
>>>>>>
>>>>>> Considering KSM, your cuounting style is tooo bad.
>>>>>>
>>>>>> You should add
>>>>>>
>>>>>>  - MEM_CGROUP_STAT_SHARED_BY_KSM
>>>>>>  - MEM_CGROUP_STAT_FOR_TMPFS/SYSV_IPC_SHMEM
>>>>>>
>>>>
>>>> No.. I am just talking about shared memory being important and shared
>>>> accounting being useful, no counters for KSM in particular (in the
>>>> memcg context).
>>>>
>>> Think so ? The number of memcg-private pages is in interest in my point of view.
>>>
>>> Anyway, I don't change my opinion as "sum of rss" is not necessary to be calculated
>>> in the kernel.
>>> If you want to provide that in memcg, please add it to global VM as /proc/meminfo.
>>>
>>> IIUC, KSM/SHMEM has some official method in global VM.
>>>
>>
>> Kamezawa-San,
>>
>> I implemented the same in user space and I get really bad results, here is why
>>
>> 1. I need to hold and walk the tasks list in cgroups and extract RSS
>> through /proc (results in worse hold times for the fork() scenario you
>> menioned)
>> 2. The data is highly inconsistent due to the higher margin of error
>> in accumulating data which is changing as we run. By the time we total
>> and look at the memcg data, the data is stale
>>
>> Would you be OK with the patch, if I renamed "shared_usage_in_bytes"
>> to "non_private_usage_in_bytes"?
>>
> I think the name is still ambiguous.
> 
> For example, if process A belongs to /cgroup/memory/01 and process B to /cgroup/memory/02,
> both process have 10MB anonymous pages and 10MB file caches of the same pages, and all of the
> file caches are charged to 01.
> In this case, the value in 01 is 0MB(=20MB - 20MB) and 10MB(20MB - 10MB), right?
> 

Correct, file cache is almost always considered shared, so it has

1. non-private or shared usage of 10MB
2. 10 MB of file cache

> I don't think "non private usage" is appropriate to this value.
> Why don't you just show "sum_of_each_process_rss" ? I think it would be easier
> to understand for users.

Here is my concern

1. The gap between looking at memcg stat and sum of all RSS is way
higher in user space
2. Summing up all rss without walking the tasks atomically can and
will lead to consistency issues. Data can be stale as long as it
represents a consistent snapshot of data

We need to differentiate between

1. Data snapshot (taken at a time, but valid at that point)
2. Data taken from different sources that does not form a uniform
snapshot, because the timestamping of the each of the collected data
items is different


> But, hmm, I don't see any strong reason to do this in kernel, then :(

Please see my reason above for doing it in the kernel.

Balbir

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-18  8:26                                 ` Balbir Singh
@ 2010-01-19  1:22                                   ` Daisuke Nishimura
  2010-01-19  1:49                                     ` Balbir Singh
  0 siblings, 1 reply; 31+ messages in thread
From: Daisuke Nishimura @ 2010-01-19  1:22 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, linux-mm, Andrew Morton, linux-kernel,
	Daisuke Nishimura

On Mon, 18 Jan 2010 13:56:44 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> On Monday 18 January 2010 06:19 AM, Daisuke Nishimura wrote:
> > On Mon, 18 Jan 2010 01:00:44 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> >> On Fri, Jan 8, 2010 at 5:17 AM, KAMEZAWA Hiroyuki
> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >>> On Thu, 7 Jan 2010 14:57:36 +0530
> >>> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> >>>
> >>>> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-01-07 18:08:00]:
> >>>>
> >>>>> On Thu, 7 Jan 2010 17:48:14 +0900
> >>>>> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >>>>>>>> "How pages are shared" doesn't show good hints. I don't hear such parameter
> >>>>>>>> is used in production's resource monitoring software.
> >>>>>>>>
> >>>>>>>
> >>>>>>> You mean "How many pages are shared" are not good hints, please see my
> >>>>>>> justification above. With Virtualization (look at KSM for example),
> >>>>>>> shared pages are going to be increasingly important part of the
> >>>>>>> accounting.
> >>>>>>>
> >>>>>>
> >>>>>> Considering KSM, your cuounting style is tooo bad.
> >>>>>>
> >>>>>> You should add
> >>>>>>
> >>>>>>  - MEM_CGROUP_STAT_SHARED_BY_KSM
> >>>>>>  - MEM_CGROUP_STAT_FOR_TMPFS/SYSV_IPC_SHMEM
> >>>>>>
> >>>>
> >>>> No.. I am just talking about shared memory being important and shared
> >>>> accounting being useful, no counters for KSM in particular (in the
> >>>> memcg context).
> >>>>
> >>> Think so ? The number of memcg-private pages is in interest in my point of view.
> >>>
> >>> Anyway, I don't change my opinion as "sum of rss" is not necessary to be calculated
> >>> in the kernel.
> >>> If you want to provide that in memcg, please add it to global VM as /proc/meminfo.
> >>>
> >>> IIUC, KSM/SHMEM has some official method in global VM.
> >>>
> >>
> >> Kamezawa-San,
> >>
> >> I implemented the same in user space and I get really bad results, here is why
> >>
> >> 1. I need to hold and walk the tasks list in cgroups and extract RSS
> >> through /proc (results in worse hold times for the fork() scenario you
> >> menioned)
> >> 2. The data is highly inconsistent due to the higher margin of error
> >> in accumulating data which is changing as we run. By the time we total
> >> and look at the memcg data, the data is stale
> >>
> >> Would you be OK with the patch, if I renamed "shared_usage_in_bytes"
> >> to "non_private_usage_in_bytes"?
> >>
> > I think the name is still ambiguous.
> > 
> > For example, if process A belongs to /cgroup/memory/01 and process B to /cgroup/memory/02,
> > both process have 10MB anonymous pages and 10MB file caches of the same pages, and all of the
> > file caches are charged to 01.
> > In this case, the value in 01 is 0MB(=20MB - 20MB) and 10MB(20MB - 10MB), right?
> > 
> 
> Correct, file cache is almost always considered shared, so it has
> 
> 1. non-private or shared usage of 10MB
> 2. 10 MB of file cache
> 
> > I don't think "non private usage" is appropriate to this value.
> > Why don't you just show "sum_of_each_process_rss" ? I think it would be easier
> > to understand for users.
> 
> Here is my concern
> 
> 1. The gap between looking at memcg stat and sum of all RSS is way
> higher in user space
> 2. Summing up all rss without walking the tasks atomically can and
> will lead to consistency issues. Data can be stale as long as it
> represents a consistent snapshot of data
> 
> We need to differentiate between
> 
> 1. Data snapshot (taken at a time, but valid at that point)
> 2. Data taken from different sources that does not form a uniform
> snapshot, because the timestamping of the each of the collected data
> items is different
> 
Hmm, I'm sorry I can't understand why you need "difference".
IOW, what can users or middlewares know by the value in the above case
(0MB in 01 and 10MB in 02)? I've read this thread, but I can't understande about
this point... Why can this value mean some of the groups are "heavy" ?

> 
> > But, hmm, I don't see any strong reason to do this in kernel, then :(
> 
> Please see my reason above for doing it in the kernel.
> 
> Balbir

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-19  1:22                                   ` Daisuke Nishimura
@ 2010-01-19  1:49                                     ` Balbir Singh
  2010-01-19  2:34                                       ` Daisuke Nishimura
  0 siblings, 1 reply; 31+ messages in thread
From: Balbir Singh @ 2010-01-19  1:49 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: KAMEZAWA Hiroyuki, linux-mm, Andrew Morton, linux-kernel

On Tue, Jan 19, 2010 at 6:52 AM, Daisuke Nishimura
<nishimura@mxp.nes.nec.co.jp> wrote:
[snip]
>> Correct, file cache is almost always considered shared, so it has
>>
>> 1. non-private or shared usage of 10MB
>> 2. 10 MB of file cache
>>
>> > I don't think "non private usage" is appropriate to this value.
>> > Why don't you just show "sum_of_each_process_rss" ? I think it would be easier
>> > to understand for users.
>>
>> Here is my concern
>>
>> 1. The gap between looking at memcg stat and sum of all RSS is way
>> higher in user space
>> 2. Summing up all rss without walking the tasks atomically can and
>> will lead to consistency issues. Data can be stale as long as it
>> represents a consistent snapshot of data
>>
>> We need to differentiate between
>>
>> 1. Data snapshot (taken at a time, but valid at that point)
>> 2. Data taken from different sources that does not form a uniform
>> snapshot, because the timestamping of the each of the collected data
>> items is different
>>
> Hmm, I'm sorry I can't understand why you need "difference".
> IOW, what can users or middlewares know by the value in the above case
> (0MB in 01 and 10MB in 02)? I've read this thread, but I can't understande about
> this point... Why can this value mean some of the groups are "heavy" ?
>

Consider a default cgroup that is not root and assume all applications
move there initially. Now with a lot of shared memory,
the default cgroup will be the first one to page in a lot of the
memory and its usage will be very high. Without the concept of
showing how much is non-private, how does one decide if the default
cgroup is using a lot of memory or sharing it? How
do we decide on limits of a cgroup without knowing its actual usage -
PSS equivalent for a region of memory for a task.

Balbir

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-19  1:49                                     ` Balbir Singh
@ 2010-01-19  2:34                                       ` Daisuke Nishimura
  2010-01-19  3:52                                         ` Balbir Singh
  0 siblings, 1 reply; 31+ messages in thread
From: Daisuke Nishimura @ 2010-01-19  2:34 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, linux-mm, Andrew Morton, linux-kernel,
	Daisuke Nishimura

On Tue, 19 Jan 2010 07:19:42 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> On Tue, Jan 19, 2010 at 6:52 AM, Daisuke Nishimura
> <nishimura@mxp.nes.nec.co.jp> wrote:
> [snip]
> >> Correct, file cache is almost always considered shared, so it has
> >>
> >> 1. non-private or shared usage of 10MB
> >> 2. 10 MB of file cache
> >>
> >> > I don't think "non private usage" is appropriate to this value.
> >> > Why don't you just show "sum_of_each_process_rss" ? I think it would be easier
> >> > to understand for users.
> >>
> >> Here is my concern
> >>
> >> 1. The gap between looking at memcg stat and sum of all RSS is way
> >> higher in user space
> >> 2. Summing up all rss without walking the tasks atomically can and
> >> will lead to consistency issues. Data can be stale as long as it
> >> represents a consistent snapshot of data
> >>
> >> We need to differentiate between
> >>
> >> 1. Data snapshot (taken at a time, but valid at that point)
> >> 2. Data taken from different sources that does not form a uniform
> >> snapshot, because the timestamping of the each of the collected data
> >> items is different
> >>
> > Hmm, I'm sorry I can't understand why you need "difference".
> > IOW, what can users or middlewares know by the value in the above case
> > (0MB in 01 and 10MB in 02)? I've read this thread, but I can't understande about
> > this point... Why can this value mean some of the groups are "heavy" ?
> >
> 
> Consider a default cgroup that is not root and assume all applications
> move there initially. Now with a lot of shared memory,
> the default cgroup will be the first one to page in a lot of the
> memory and its usage will be very high. Without the concept of
> showing how much is non-private, how does one decide if the default
> cgroup is using a lot of memory or sharing it? How
> do we decide on limits of a cgroup without knowing its actual usage -
> PSS equivalent for a region of memory for a task.
> 
As for limit, I think we should decide it based on the actual usage because
we account and limit the accual usage. Why we should take account of the sum of rss ?
I agree that we'd better not to ignore the sum of rss completely, but could you show me
how the value 0MB/10MB can be used to caluculate the limit in 01/02 in detail ?
I wouldn't argue against you if I could understand the value would be useful,
but I can't understand how the value can be used, so I'm asking :)

Thanks
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-19  2:34                                       ` Daisuke Nishimura
@ 2010-01-19  3:52                                         ` Balbir Singh
  2010-01-20  4:09                                           ` Daisuke Nishimura
  0 siblings, 1 reply; 31+ messages in thread
From: Balbir Singh @ 2010-01-19  3:52 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: KAMEZAWA Hiroyuki, linux-mm, Andrew Morton, linux-kernel

On Tuesday 19 January 2010 08:04 AM, Daisuke Nishimura wrote:
> On Tue, 19 Jan 2010 07:19:42 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>> On Tue, Jan 19, 2010 at 6:52 AM, Daisuke Nishimura
>> <nishimura@mxp.nes.nec.co.jp> wrote:
>> [snip]
>>>> Correct, file cache is almost always considered shared, so it has
>>>>
>>>> 1. non-private or shared usage of 10MB
>>>> 2. 10 MB of file cache
>>>>
>>>>> I don't think "non private usage" is appropriate to this value.
>>>>> Why don't you just show "sum_of_each_process_rss" ? I think it would be easier
>>>>> to understand for users.
>>>>
>>>> Here is my concern
>>>>
>>>> 1. The gap between looking at memcg stat and sum of all RSS is way
>>>> higher in user space
>>>> 2. Summing up all rss without walking the tasks atomically can and
>>>> will lead to consistency issues. Data can be stale as long as it
>>>> represents a consistent snapshot of data
>>>>
>>>> We need to differentiate between
>>>>
>>>> 1. Data snapshot (taken at a time, but valid at that point)
>>>> 2. Data taken from different sources that does not form a uniform
>>>> snapshot, because the timestamping of the each of the collected data
>>>> items is different
>>>>
>>> Hmm, I'm sorry I can't understand why you need "difference".
>>> IOW, what can users or middlewares know by the value in the above case
>>> (0MB in 01 and 10MB in 02)? I've read this thread, but I can't understande about
>>> this point... Why can this value mean some of the groups are "heavy" ?
>>>
>>
>> Consider a default cgroup that is not root and assume all applications
>> move there initially. Now with a lot of shared memory,
>> the default cgroup will be the first one to page in a lot of the
>> memory and its usage will be very high. Without the concept of
>> showing how much is non-private, how does one decide if the default
>> cgroup is using a lot of memory or sharing it? How
>> do we decide on limits of a cgroup without knowing its actual usage -
>> PSS equivalent for a region of memory for a task.
>>
> As for limit, I think we should decide it based on the actual usage because
> we account and limit the accual usage. Why we should take account of the sum of rss ?

I am talking of non-private pages or potentially shared pages - which is
derived as follows

sum_of_all_rss - (rss + file_mapped) (from .stat file)

file cache is considered to be shared always


> I agree that we'd better not to ignore the sum of rss completely, but could you show me
> how the value 0MB/10MB can be used to caluculate the limit in 01/02 in detail ?

In your example, usage shows that the real usage of the cgroup is 20 MB
for 01 and 10 MB for 02. Today we show that we are using 40MB instead of
30MB (when summed). If an administrator has to make a decision to say
add more resources, the one with 20MB would be the right place w.r.t.
memory.

> I wouldn't argue against you if I could understand the value would be useful,
> but I can't understand how the value can be used, so I'm asking :)

I understand, I am not completely closed to suggestions from you and
Kamezawa-San, just trying to find a way to get useful information about
shared memory usage back to user space. Remember walking the LRU or even
VMA's to find shared pages is expensive. We could do it lazily at rmap
time, it works well for charging, but not too good for uncharging, since
we'll need to keep track of the mm's, so that if the mm that charge can
be properly marked as private or shared in the correct memcg. It will
require more invasive work.

Balbir

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-19  3:52                                         ` Balbir Singh
@ 2010-01-20  4:09                                           ` Daisuke Nishimura
  2010-01-20  7:15                                             ` Daisuke Nishimura
  2010-01-20  8:17                                             ` Balbir Singh
  0 siblings, 2 replies; 31+ messages in thread
From: Daisuke Nishimura @ 2010-01-20  4:09 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, linux-mm, Andrew Morton, linux-kernel,
	Daisuke Nishimura

On Tue, 19 Jan 2010 09:22:41 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> On Tuesday 19 January 2010 08:04 AM, Daisuke Nishimura wrote:
> > On Tue, 19 Jan 2010 07:19:42 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> >> On Tue, Jan 19, 2010 at 6:52 AM, Daisuke Nishimura
> >> <nishimura@mxp.nes.nec.co.jp> wrote:
> >> [snip]
> >>>> Correct, file cache is almost always considered shared, so it has
> >>>>
> >>>> 1. non-private or shared usage of 10MB
> >>>> 2. 10 MB of file cache
> >>>>
> >>>>> I don't think "non private usage" is appropriate to this value.
> >>>>> Why don't you just show "sum_of_each_process_rss" ? I think it would be easier
> >>>>> to understand for users.
> >>>>
> >>>> Here is my concern
> >>>>
> >>>> 1. The gap between looking at memcg stat and sum of all RSS is way
> >>>> higher in user space
> >>>> 2. Summing up all rss without walking the tasks atomically can and
> >>>> will lead to consistency issues. Data can be stale as long as it
> >>>> represents a consistent snapshot of data
> >>>>
> >>>> We need to differentiate between
> >>>>
> >>>> 1. Data snapshot (taken at a time, but valid at that point)
> >>>> 2. Data taken from different sources that does not form a uniform
> >>>> snapshot, because the timestamping of the each of the collected data
> >>>> items is different
> >>>>
> >>> Hmm, I'm sorry I can't understand why you need "difference".
> >>> IOW, what can users or middlewares know by the value in the above case
> >>> (0MB in 01 and 10MB in 02)? I've read this thread, but I can't understande about
> >>> this point... Why can this value mean some of the groups are "heavy" ?
> >>>
> >>
> >> Consider a default cgroup that is not root and assume all applications
> >> move there initially. Now with a lot of shared memory,
> >> the default cgroup will be the first one to page in a lot of the
> >> memory and its usage will be very high. Without the concept of
> >> showing how much is non-private, how does one decide if the default
> >> cgroup is using a lot of memory or sharing it? How
> >> do we decide on limits of a cgroup without knowing its actual usage -
> >> PSS equivalent for a region of memory for a task.
> >>
> > As for limit, I think we should decide it based on the actual usage because
> > we account and limit the accual usage. Why we should take account of the sum of rss ?
> 
> I am talking of non-private pages or potentially shared pages - which is
> derived as follows
> 
> sum_of_all_rss - (rss + file_mapped) (from .stat file)
> 
> file cache is considered to be shared always
> 
> 
> > I agree that we'd better not to ignore the sum of rss completely, but could you show me
> > how the value 0MB/10MB can be used to caluculate the limit in 01/02 in detail ?
> 
> In your example, usage shows that the real usage of the cgroup is 20 MB
> for 01 and 10 MB for 02.
right.

> Today we show that we are using 40MB instead of
> 30MB (when summed).
Sorry, I can't understand here.
If we sum usage_in_bytes in both groups, it would be 30MB.
If we sum "actual rss(rss_file, rss_anon) via stat file" in both groups, it would be 30M.
If we sum "total rss(rss_file, rss_anon) of all process via mm_counter" in both groups,
it would be 40MB.

> If an administrator has to make a decision to say
> add more resources, the one with 20MB would be the right place w.r.t.
> memory.
> 
You mean he would add the additional resource to 00, right? Then, 
the smaller "shared_usage_in_bytes" is, the more likely an administrator should
add additional resources to the group ?

But when both /cgroup/memory/aa and /cgroup/memory/bb has 20MB as acutual usage,
and aa has 10MB "shared"(used by multiple processes *in aa*) usage while bb has none,
"shared_usage_in_bytes" is 10MB in aa and 0MB in bb(please consider there is
no "shared" usage between aa and bb).
Should an administrator consider bb is heavier than aa ? I don't think so.

IOW, "shared_usage_in_bytes" doesn't have any consistent meaning about which
group is unfairly "heavy".

The problem here is, "shared_usage_in_bytes" doesn't show neither one of nor the sum
of the following value(*IFF* we have only one cgroup, "shared_usage_in_bytes" would
mean a), but it has no use in real case).

  a) memory usage used by multiple processes inside this group.
  b) memory usage used by both processes inside this and another group.
  c) memory usage not used by any processes inside this group, but used by
     that of in another group.

IMHO, we should take account of all the above values to determine which group
is unfairly "heavy". I agree that the bigger the size of a) is, the bigger
"shared_usage_in_bytes" of the group would be, but we cannot know any information about
the size of b) by it, becase those usages are included in both actual usage(rss via stat)
and sum of rss(via mm_counter). To make matters warse, "shared_usage_in_bytes" has
the opposite meaning about b), i.e., the more a processe in some group(foo) has actual
charges in *another* group(baa), the bigger "shared_usage_in_bytes" in "foo" would be
(as 00 and 01 in my example).

I would agree with you if you add interfaces to show some hints to users about above values,
but "shared_usage_in_bytes" doesn't meet it at all.

Thanks,
Daisuke Nishimura.

> > I wouldn't argue against you if I could understand the value would be useful,
> > but I can't understand how the value can be used, so I'm asking :)
> 
> I understand, I am not completely closed to suggestions from you and
> Kamezawa-San, just trying to find a way to get useful information about
> shared memory usage back to user space. Remember walking the LRU or even
> VMA's to find shared pages is expensive. We could do it lazily at rmap
> time, it works well for charging, but not too good for uncharging, since
> we'll need to keep track of the mm's, so that if the mm that charge can
> be properly marked as private or shared in the correct memcg. It will
> require more invasive work.
> 
> Balbir

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-20  4:09                                           ` Daisuke Nishimura
@ 2010-01-20  7:15                                             ` Daisuke Nishimura
  2010-01-20  7:43                                               ` KAMEZAWA Hiroyuki
  2010-01-20  8:18                                               ` Balbir Singh
  2010-01-20  8:17                                             ` Balbir Singh
  1 sibling, 2 replies; 31+ messages in thread
From: Daisuke Nishimura @ 2010-01-20  7:15 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, linux-mm, Andrew Morton, linux-kernel,
	Daisuke Nishimura

On Wed, 20 Jan 2010 13:09:02 +0900, Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> On Tue, 19 Jan 2010 09:22:41 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > On Tuesday 19 January 2010 08:04 AM, Daisuke Nishimura wrote:
> > > On Tue, 19 Jan 2010 07:19:42 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > >> On Tue, Jan 19, 2010 at 6:52 AM, Daisuke Nishimura
> > >> <nishimura@mxp.nes.nec.co.jp> wrote:
> > >> [snip]
> > >>>> Correct, file cache is almost always considered shared, so it has
> > >>>>
> > >>>> 1. non-private or shared usage of 10MB
> > >>>> 2. 10 MB of file cache
> > >>>>
> > >>>>> I don't think "non private usage" is appropriate to this value.
> > >>>>> Why don't you just show "sum_of_each_process_rss" ? I think it would be easier
> > >>>>> to understand for users.
> > >>>>
> > >>>> Here is my concern
> > >>>>
> > >>>> 1. The gap between looking at memcg stat and sum of all RSS is way
> > >>>> higher in user space
> > >>>> 2. Summing up all rss without walking the tasks atomically can and
> > >>>> will lead to consistency issues. Data can be stale as long as it
> > >>>> represents a consistent snapshot of data
> > >>>>
> > >>>> We need to differentiate between
> > >>>>
> > >>>> 1. Data snapshot (taken at a time, but valid at that point)
> > >>>> 2. Data taken from different sources that does not form a uniform
> > >>>> snapshot, because the timestamping of the each of the collected data
> > >>>> items is different
> > >>>>
> > >>> Hmm, I'm sorry I can't understand why you need "difference".
> > >>> IOW, what can users or middlewares know by the value in the above case
> > >>> (0MB in 01 and 10MB in 02)? I've read this thread, but I can't understande about
> > >>> this point... Why can this value mean some of the groups are "heavy" ?
> > >>>
> > >>
> > >> Consider a default cgroup that is not root and assume all applications
> > >> move there initially. Now with a lot of shared memory,
> > >> the default cgroup will be the first one to page in a lot of the
> > >> memory and its usage will be very high. Without the concept of
> > >> showing how much is non-private, how does one decide if the default
> > >> cgroup is using a lot of memory or sharing it? How
> > >> do we decide on limits of a cgroup without knowing its actual usage -
> > >> PSS equivalent for a region of memory for a task.
> > >>
> > > As for limit, I think we should decide it based on the actual usage because
> > > we account and limit the accual usage. Why we should take account of the sum of rss ?
> > 
> > I am talking of non-private pages or potentially shared pages - which is
> > derived as follows
> > 
> > sum_of_all_rss - (rss + file_mapped) (from .stat file)
> > 
> > file cache is considered to be shared always
> > 
> > 
> > > I agree that we'd better not to ignore the sum of rss completely, but could you show me
> > > how the value 0MB/10MB can be used to caluculate the limit in 01/02 in detail ?
> > 
> > In your example, usage shows that the real usage of the cgroup is 20 MB
> > for 01 and 10 MB for 02.
> right.
> 
> > Today we show that we are using 40MB instead of
> > 30MB (when summed).
> Sorry, I can't understand here.
> If we sum usage_in_bytes in both groups, it would be 30MB.
> If we sum "actual rss(rss_file, rss_anon) via stat file" in both groups, it would be 30M.
> If we sum "total rss(rss_file, rss_anon) of all process via mm_counter" in both groups,
> it would be 40MB.
> 
> > If an administrator has to make a decision to say
> > add more resources, the one with 20MB would be the right place w.r.t.
> > memory.
> > 
> You mean he would add the additional resource to 00, right? Then, 
> the smaller "shared_usage_in_bytes" is, the more likely an administrator should
> add additional resources to the group ?
> 
> But when both /cgroup/memory/aa and /cgroup/memory/bb has 20MB as acutual usage,
> and aa has 10MB "shared"(used by multiple processes *in aa*) usage while bb has none,
> "shared_usage_in_bytes" is 10MB in aa and 0MB in bb(please consider there is
> no "shared" usage between aa and bb).
> Should an administrator consider bb is heavier than aa ? I don't think so.
> 
> IOW, "shared_usage_in_bytes" doesn't have any consistent meaning about which
> group is unfairly "heavy".
> 
> The problem here is, "shared_usage_in_bytes" doesn't show neither one of nor the sum
> of the following value(*IFF* we have only one cgroup, "shared_usage_in_bytes" would
> mean a), but it has no use in real case).
> 
>   a) memory usage used by multiple processes inside this group.
>   b) memory usage used by both processes inside this and another group.
>   c) memory usage not used by any processes inside this group, but used by
>      that of in another group.
> 
> IMHO, we should take account of all the above values to determine which group
> is unfairly "heavy". I agree that the bigger the size of a) is, the bigger
> "shared_usage_in_bytes" of the group would be, but we cannot know any information about
> the size of b) by it, becase those usages are included in both actual usage(rss via stat)
> and sum of rss(via mm_counter). To make matters warse, "shared_usage_in_bytes" has
> the opposite meaning about b), i.e., the more a processe in some group(foo) has actual
> charges in *another* group(baa), the bigger "shared_usage_in_bytes" in "foo" would be
> (as 00 and 01 in my example).
> 
> I would agree with you if you add interfaces to show some hints to users about above values,
> but "shared_usage_in_bytes" doesn't meet it at all.
> 
This is just an idea(At least, we need interfaces to read and reset them).

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 385e29b..bf601f2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -83,6 +83,8 @@ enum mem_cgroup_stat_index {
 					used by soft limit implementation */
 	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
 					used by threshold implementation */
+	MEM_CGROUP_STAT_SHARED_IN_GROUP,
+	MEM_CGROUP_STAT_SHARED_FROM_OTHERS,
 
 	MEM_CGROUP_STAT_NSTATS,
 };
@@ -1707,8 +1709,25 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
+		struct mem_cgroup *charged = pc->mem_cgroup;
+		struct mem_cgroup_stat *stat;
+		struct mem_cgroup_stat_cpu *cpustat;
+		int cpu;
+		int shared_type;
+
 		unlock_page_cgroup(pc);
 		mem_cgroup_cancel_charge(mem);
+
+		stat = &charged->stat;
+		cpu = get_cpu();
+		cpustat = &stat->cpustat[cpu];
+		if (charged == mem)
+			shared_type = MEM_CGROUP_STAT_SHARED_IN_GROUP;
+		else
+			shared_type = MEM_CGROUP_STAT_SHARED_FROM_OTHERS;
+		__mem_cgroup_stat_add_safe(cpustat, shared_type, 1);
+		put_cpu();
+
 		return;
 	}
 

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-20  7:15                                             ` Daisuke Nishimura
@ 2010-01-20  7:43                                               ` KAMEZAWA Hiroyuki
  2010-01-20  8:18                                               ` Balbir Singh
  1 sibling, 0 replies; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-20  7:43 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: balbir, linux-mm, Andrew Morton, linux-kernel

On Wed, 20 Jan 2010 16:15:33 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > I would agree with you if you add interfaces to show some hints to users about above values,
> > but "shared_usage_in_bytes" doesn't meet it at all.
> > 
> This is just an idea(At least, we need interfaces to read and reset them).
> 
seems atractive but there is no way to decrement this counter in _scalable_ way.
We need some inovation to go this way.

But I doubt how this comes to be useful.

In general, we can assume
   - file is shared. (because of their nature.)
   - rss is private. (because of thier nature.)

Then, the problem is how rss(private anon) is shared. 
Except for crazy progam as AIM7, rss is private in many case.
Even if highly shared, in most case, shared rss can be estimated by the size
of parent process's rss. And processe's parent-child relationship is appearent.
Measurement is easy. If COW is troublesome, counting # of COW per process
is reasonable way. (But you have to fight with the cost of adding that.)

I tend not to disagree to add a counter to show "shared with other cgroup"
but disagree "shared between process". 

Thanks,
-Kame


> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 385e29b..bf601f2 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -83,6 +83,8 @@ enum mem_cgroup_stat_index {
>  					used by soft limit implementation */
>  	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
>  					used by threshold implementation */
> +	MEM_CGROUP_STAT_SHARED_IN_GROUP,
> +	MEM_CGROUP_STAT_SHARED_FROM_OTHERS,
>  
>  	MEM_CGROUP_STAT_NSTATS,
>  };
> @@ -1707,8 +1709,25 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>  
>  	lock_page_cgroup(pc);
>  	if (unlikely(PageCgroupUsed(pc))) {
> +		struct mem_cgroup *charged = pc->mem_cgroup;
> +		struct mem_cgroup_stat *stat;
> +		struct mem_cgroup_stat_cpu *cpustat;
> +		int cpu;
> +		int shared_type;
> +
>  		unlock_page_cgroup(pc);
>  		mem_cgroup_cancel_charge(mem);
> +
> +		stat = &charged->stat;
> +		cpu = get_cpu();
> +		cpustat = &stat->cpustat[cpu];
> +		if (charged == mem)
> +			shared_type = MEM_CGROUP_STAT_SHARED_IN_GROUP;
> +		else
> +			shared_type = MEM_CGROUP_STAT_SHARED_FROM_OTHERS;
> +		__mem_cgroup_stat_add_safe(cpustat, shared_type, 1);
> +		put_cpu();
> +
>  		return;
>  	}
>  
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-20  4:09                                           ` Daisuke Nishimura
  2010-01-20  7:15                                             ` Daisuke Nishimura
@ 2010-01-20  8:17                                             ` Balbir Singh
  2010-01-21  1:04                                               ` Daisuke Nishimura
  1 sibling, 1 reply; 31+ messages in thread
From: Balbir Singh @ 2010-01-20  8:17 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: KAMEZAWA Hiroyuki, linux-mm, Andrew Morton, linux-kernel

On Wednesday 20 January 2010 09:39 AM, Daisuke Nishimura wrote:
> On Tue, 19 Jan 2010 09:22:41 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>> On Tuesday 19 January 2010 08:04 AM, Daisuke Nishimura wrote:
>>> On Tue, 19 Jan 2010 07:19:42 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>>> On Tue, Jan 19, 2010 at 6:52 AM, Daisuke Nishimura
>>>> <nishimura@mxp.nes.nec.co.jp> wrote:
>>>> [snip]
>>>>>> Correct, file cache is almost always considered shared, so it has
>>>>>>
>>>>>> 1. non-private or shared usage of 10MB
>>>>>> 2. 10 MB of file cache
>>>>>>
>>>>>>> I don't think "non private usage" is appropriate to this value.
>>>>>>> Why don't you just show "sum_of_each_process_rss" ? I think it would be easier
>>>>>>> to understand for users.
>>>>>>
>>>>>> Here is my concern
>>>>>>
>>>>>> 1. The gap between looking at memcg stat and sum of all RSS is way
>>>>>> higher in user space
>>>>>> 2. Summing up all rss without walking the tasks atomically can and
>>>>>> will lead to consistency issues. Data can be stale as long as it
>>>>>> represents a consistent snapshot of data
>>>>>>
>>>>>> We need to differentiate between
>>>>>>
>>>>>> 1. Data snapshot (taken at a time, but valid at that point)
>>>>>> 2. Data taken from different sources that does not form a uniform
>>>>>> snapshot, because the timestamping of the each of the collected data
>>>>>> items is different
>>>>>>
>>>>> Hmm, I'm sorry I can't understand why you need "difference".
>>>>> IOW, what can users or middlewares know by the value in the above case
>>>>> (0MB in 01 and 10MB in 02)? I've read this thread, but I can't understande about
>>>>> this point... Why can this value mean some of the groups are "heavy" ?
>>>>>
>>>>
>>>> Consider a default cgroup that is not root and assume all applications
>>>> move there initially. Now with a lot of shared memory,
>>>> the default cgroup will be the first one to page in a lot of the
>>>> memory and its usage will be very high. Without the concept of
>>>> showing how much is non-private, how does one decide if the default
>>>> cgroup is using a lot of memory or sharing it? How
>>>> do we decide on limits of a cgroup without knowing its actual usage -
>>>> PSS equivalent for a region of memory for a task.
>>>>
>>> As for limit, I think we should decide it based on the actual usage because
>>> we account and limit the accual usage. Why we should take account of the sum of rss ?
>>
>> I am talking of non-private pages or potentially shared pages - which is
>> derived as follows
>>
>> sum_of_all_rss - (rss + file_mapped) (from .stat file)
>>
>> file cache is considered to be shared always
>>
>>
>>> I agree that we'd better not to ignore the sum of rss completely, but could you show me
>>> how the value 0MB/10MB can be used to caluculate the limit in 01/02 in detail ?
>>
>> In your example, usage shows that the real usage of the cgroup is 20 MB
>> for 01 and 10 MB for 02.
> right.
> 
>> Today we show that we are using 40MB instead of
>> 30MB (when summed).
> Sorry, I can't understand here.
> If we sum usage_in_bytes in both groups, it would be 30MB.

Right

> If we sum "actual rss(rss_file, rss_anon) via stat file" in both groups, it would be 30M.
> If we sum "total rss(rss_file, rss_anon) of all process via mm_counter" in both groups,
> it would be 40MB.
> 

mm_counter would show 40GB, memcgroup would show 30MB you are right. But
of the 30MB, do we say the one using 20MB is consuming more resources?

>> If an administrator has to make a decision to say
>> add more resources, the one with 20MB would be the right place w.r.t.
>> memory.
>>
> You mean he would add the additional resource to 00, right? Then, 
> the smaller "shared_usage_in_bytes" is, the more likely an administrator should
> add additional resources to the group ?
> 
> But when both /cgroup/memory/aa and /cgroup/memory/bb has 20MB as acutual usage,
> and aa has 10MB "shared"(used by multiple processes *in aa*) usage while bb has none,
> "shared_usage_in_bytes" is 10MB in aa and 0MB in bb(please consider there is
> no "shared" usage between aa and bb).
> Should an administrator consider bb is heavier than aa ? I don't think so.
> 

No.. but before OOM killing or considering moving in a virtual
environment the cgorup "aa", the real usage should be considered or
at-least the fact that moving "bb" would require 20MB.

> IOW, "shared_usage_in_bytes" doesn't have any consistent meaning about which
> group is unfairly "heavy".
> 

No, but it gives an idea of the sharing, which can be important for
making decisions and estimating the real usage. In the case of aa, one
can estimate private usage to be 20MB - 10MB (10MB) which is one correct
way of looking at the heaviness of the cgroup.

> The problem here is, "shared_usage_in_bytes" doesn't show neither one of nor the sum
> of the following value(*IFF* we have only one cgroup, "shared_usage_in_bytes" would
> mean a), but it has no use in real case).
> 
>   a) memory usage used by multiple processes inside this group.
>   b) memory usage used by both processes inside this and another group.
>   c) memory usage not used by any processes inside this group, but used by
>      that of in another group.
> 
> IMHO, we should take account of all the above values to determine which group
> is unfairly "heavy". I agree that the bigger the size of a) is, the bigger
> "shared_usage_in_bytes" of the group would be, but we cannot know any information about
> the size of b) by it, becase those usages are included in both actual usage(rss via stat)

(b) IMHO is a longer term goal and can be estimated from the PSS of the
processes within the cgroup

> and sum of rss(via mm_counter). To make matters warse, "shared_usage_in_bytes" has
> the opposite meaning about b), i.e., the more a processe in some group(foo) has actual
> charges in *another* group(baa), the bigger "shared_usage_in_bytes" in "foo" would be
> (as 00 and 01 in my example).
> 
> I would agree with you if you add interfaces to show some hints to users about above values,
> but "shared_usage_in_bytes" doesn't meet it at all.
> 

Not sure I follow your suggestion here.

Balbir

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-20  7:15                                             ` Daisuke Nishimura
  2010-01-20  7:43                                               ` KAMEZAWA Hiroyuki
@ 2010-01-20  8:18                                               ` Balbir Singh
  1 sibling, 0 replies; 31+ messages in thread
From: Balbir Singh @ 2010-01-20  8:18 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: KAMEZAWA Hiroyuki, linux-mm, Andrew Morton, linux-kernel

On Wednesday 20 January 2010 12:45 PM, Daisuke Nishimura wrote:
> On Wed, 20 Jan 2010 13:09:02 +0900, Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
>> On Tue, 19 Jan 2010 09:22:41 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>> On Tuesday 19 January 2010 08:04 AM, Daisuke Nishimura wrote:
>>>> On Tue, 19 Jan 2010 07:19:42 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>>>> On Tue, Jan 19, 2010 at 6:52 AM, Daisuke Nishimura
>>>>> <nishimura@mxp.nes.nec.co.jp> wrote:
>>>>> [snip]
>>>>>>> Correct, file cache is almost always considered shared, so it has
>>>>>>>
>>>>>>> 1. non-private or shared usage of 10MB
>>>>>>> 2. 10 MB of file cache
>>>>>>>
>>>>>>>> I don't think "non private usage" is appropriate to this value.
>>>>>>>> Why don't you just show "sum_of_each_process_rss" ? I think it would be easier
>>>>>>>> to understand for users.
>>>>>>>
>>>>>>> Here is my concern
>>>>>>>
>>>>>>> 1. The gap between looking at memcg stat and sum of all RSS is way
>>>>>>> higher in user space
>>>>>>> 2. Summing up all rss without walking the tasks atomically can and
>>>>>>> will lead to consistency issues. Data can be stale as long as it
>>>>>>> represents a consistent snapshot of data
>>>>>>>
>>>>>>> We need to differentiate between
>>>>>>>
>>>>>>> 1. Data snapshot (taken at a time, but valid at that point)
>>>>>>> 2. Data taken from different sources that does not form a uniform
>>>>>>> snapshot, because the timestamping of the each of the collected data
>>>>>>> items is different
>>>>>>>
>>>>>> Hmm, I'm sorry I can't understand why you need "difference".
>>>>>> IOW, what can users or middlewares know by the value in the above case
>>>>>> (0MB in 01 and 10MB in 02)? I've read this thread, but I can't understande about
>>>>>> this point... Why can this value mean some of the groups are "heavy" ?
>>>>>>
>>>>>
>>>>> Consider a default cgroup that is not root and assume all applications
>>>>> move there initially. Now with a lot of shared memory,
>>>>> the default cgroup will be the first one to page in a lot of the
>>>>> memory and its usage will be very high. Without the concept of
>>>>> showing how much is non-private, how does one decide if the default
>>>>> cgroup is using a lot of memory or sharing it? How
>>>>> do we decide on limits of a cgroup without knowing its actual usage -
>>>>> PSS equivalent for a region of memory for a task.
>>>>>
>>>> As for limit, I think we should decide it based on the actual usage because
>>>> we account and limit the accual usage. Why we should take account of the sum of rss ?
>>>
>>> I am talking of non-private pages or potentially shared pages - which is
>>> derived as follows
>>>
>>> sum_of_all_rss - (rss + file_mapped) (from .stat file)
>>>
>>> file cache is considered to be shared always
>>>
>>>
>>>> I agree that we'd better not to ignore the sum of rss completely, but could you show me
>>>> how the value 0MB/10MB can be used to caluculate the limit in 01/02 in detail ?
>>>
>>> In your example, usage shows that the real usage of the cgroup is 20 MB
>>> for 01 and 10 MB for 02.
>> right.
>>
>>> Today we show that we are using 40MB instead of
>>> 30MB (when summed).
>> Sorry, I can't understand here.
>> If we sum usage_in_bytes in both groups, it would be 30MB.
>> If we sum "actual rss(rss_file, rss_anon) via stat file" in both groups, it would be 30M.
>> If we sum "total rss(rss_file, rss_anon) of all process via mm_counter" in both groups,
>> it would be 40MB.
>>
>>> If an administrator has to make a decision to say
>>> add more resources, the one with 20MB would be the right place w.r.t.
>>> memory.
>>>
>> You mean he would add the additional resource to 00, right? Then, 
>> the smaller "shared_usage_in_bytes" is, the more likely an administrator should
>> add additional resources to the group ?
>>
>> But when both /cgroup/memory/aa and /cgroup/memory/bb has 20MB as acutual usage,
>> and aa has 10MB "shared"(used by multiple processes *in aa*) usage while bb has none,
>> "shared_usage_in_bytes" is 10MB in aa and 0MB in bb(please consider there is
>> no "shared" usage between aa and bb).
>> Should an administrator consider bb is heavier than aa ? I don't think so.
>>
>> IOW, "shared_usage_in_bytes" doesn't have any consistent meaning about which
>> group is unfairly "heavy".
>>
>> The problem here is, "shared_usage_in_bytes" doesn't show neither one of nor the sum
>> of the following value(*IFF* we have only one cgroup, "shared_usage_in_bytes" would
>> mean a), but it has no use in real case).
>>
>>   a) memory usage used by multiple processes inside this group.
>>   b) memory usage used by both processes inside this and another group.
>>   c) memory usage not used by any processes inside this group, but used by
>>      that of in another group.
>>
>> IMHO, we should take account of all the above values to determine which group
>> is unfairly "heavy". I agree that the bigger the size of a) is, the bigger
>> "shared_usage_in_bytes" of the group would be, but we cannot know any information about
>> the size of b) by it, becase those usages are included in both actual usage(rss via stat)
>> and sum of rss(via mm_counter). To make matters warse, "shared_usage_in_bytes" has
>> the opposite meaning about b), i.e., the more a processe in some group(foo) has actual
>> charges in *another* group(baa), the bigger "shared_usage_in_bytes" in "foo" would be
>> (as 00 and 01 in my example).
>>
>> I would agree with you if you add interfaces to show some hints to users about above values,
>> but "shared_usage_in_bytes" doesn't meet it at all.
>>
> This is just an idea(At least, we need interfaces to read and reset them).
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c> index 385e29b..bf601f2 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -83,6 +83,8 @@ enum mem_cgroup_stat_index {
>  					used by soft limit implementation */
>  	MEM_CGROUP_STAT_THRESHOLDS, /* decrements on each page in/out.
>  					used by threshold implementation */
> +	MEM_CGROUP_STAT_SHARED_IN_GROUP,
> +	MEM_CGROUP_STAT_SHARED_FROM_OTHERS,
> 
>  	MEM_CGROUP_STAT_NSTATS,
>  };
> @@ -1707,8 +1709,25 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> 
>  	lock_page_cgroup(pc);
>  	if (unlikely(PageCgroupUsed(pc))) {
> +		struct mem_cgroup *charged = pc->mem_cgroup;
> +		struct mem_cgroup_stat *stat;
> +		struct mem_cgroup_stat_cpu *cpustat;
> +		int cpu;
> +		int shared_type;
> +
>  		unlock_page_cgroup(pc);
>  		mem_cgroup_cancel_charge(mem);
> +
> +		stat = &charged->stat;
> +		cpu = get_cpu();
> +		cpustat = &stat->cpustat[cpu];
> +		if (charged == mem)
> +			shared_type = MEM_CGROUP_STAT_SHARED_IN_GROUP;
> +		else
> +			shared_type = MEM_CGROUP_STAT_SHARED_FROM_OTHERS;
> +		__mem_cgroup_stat_add_safe(cpustat, shared_type, 1);
> +		put_cpu();
> +
>  		return;

How will this work during uncharge? If the original cgroup that owns the
pages has unmapped them?

Balbir Singh.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-20  8:17                                             ` Balbir Singh
@ 2010-01-21  1:04                                               ` Daisuke Nishimura
  2010-01-21  1:30                                                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 31+ messages in thread
From: Daisuke Nishimura @ 2010-01-21  1:04 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, linux-mm, Andrew Morton, linux-kernel,
	Daisuke Nishimura

On Wed, 20 Jan 2010 13:47:13 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> On Wednesday 20 January 2010 09:39 AM, Daisuke Nishimura wrote:
> > On Tue, 19 Jan 2010 09:22:41 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> >> On Tuesday 19 January 2010 08:04 AM, Daisuke Nishimura wrote:
> >>> On Tue, 19 Jan 2010 07:19:42 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> >>>> On Tue, Jan 19, 2010 at 6:52 AM, Daisuke Nishimura
> >>>> <nishimura@mxp.nes.nec.co.jp> wrote:
> >>>> [snip]
> >>>>>> Correct, file cache is almost always considered shared, so it has
> >>>>>>
> >>>>>> 1. non-private or shared usage of 10MB
> >>>>>> 2. 10 MB of file cache
> >>>>>>
> >>>>>>> I don't think "non private usage" is appropriate to this value.
> >>>>>>> Why don't you just show "sum_of_each_process_rss" ? I think it would be easier
> >>>>>>> to understand for users.
> >>>>>>
> >>>>>> Here is my concern
> >>>>>>
> >>>>>> 1. The gap between looking at memcg stat and sum of all RSS is way
> >>>>>> higher in user space
> >>>>>> 2. Summing up all rss without walking the tasks atomically can and
> >>>>>> will lead to consistency issues. Data can be stale as long as it
> >>>>>> represents a consistent snapshot of data
> >>>>>>
> >>>>>> We need to differentiate between
> >>>>>>
> >>>>>> 1. Data snapshot (taken at a time, but valid at that point)
> >>>>>> 2. Data taken from different sources that does not form a uniform
> >>>>>> snapshot, because the timestamping of the each of the collected data
> >>>>>> items is different
> >>>>>>
> >>>>> Hmm, I'm sorry I can't understand why you need "difference".
> >>>>> IOW, what can users or middlewares know by the value in the above case
> >>>>> (0MB in 01 and 10MB in 02)? I've read this thread, but I can't understande about
> >>>>> this point... Why can this value mean some of the groups are "heavy" ?
> >>>>>
> >>>>
> >>>> Consider a default cgroup that is not root and assume all applications
> >>>> move there initially. Now with a lot of shared memory,
> >>>> the default cgroup will be the first one to page in a lot of the
> >>>> memory and its usage will be very high. Without the concept of
> >>>> showing how much is non-private, how does one decide if the default
> >>>> cgroup is using a lot of memory or sharing it? How
> >>>> do we decide on limits of a cgroup without knowing its actual usage -
> >>>> PSS equivalent for a region of memory for a task.
> >>>>
> >>> As for limit, I think we should decide it based on the actual usage because
> >>> we account and limit the accual usage. Why we should take account of the sum of rss ?
> >>
> >> I am talking of non-private pages or potentially shared pages - which is
> >> derived as follows
> >>
> >> sum_of_all_rss - (rss + file_mapped) (from .stat file)
> >>
> >> file cache is considered to be shared always
> >>
> >>
> >>> I agree that we'd better not to ignore the sum of rss completely, but could you show me
> >>> how the value 0MB/10MB can be used to caluculate the limit in 01/02 in detail ?
> >>
> >> In your example, usage shows that the real usage of the cgroup is 20 MB
> >> for 01 and 10 MB for 02.
> > right.
> > 
> >> Today we show that we are using 40MB instead of
> >> 30MB (when summed).
> > Sorry, I can't understand here.
> > If we sum usage_in_bytes in both groups, it would be 30MB.
> 
> Right
> 
> > If we sum "actual rss(rss_file, rss_anon) via stat file" in both groups, it would be 30M.
> > If we sum "total rss(rss_file, rss_anon) of all process via mm_counter" in both groups,
> > it would be 40MB.
> > 
> 
> mm_counter would show 40GB, memcgroup would show 30MB you are right. But
> of the 30MB, do we say the one using 20MB is consuming more resources?
> 
> >> If an administrator has to make a decision to say
> >> add more resources, the one with 20MB would be the right place w.r.t.
> >> memory.
> >>
> > You mean he would add the additional resource to 00, right? Then, 
> > the smaller "shared_usage_in_bytes" is, the more likely an administrator should
> > add additional resources to the group ?
> > 
> > But when both /cgroup/memory/aa and /cgroup/memory/bb has 20MB as acutual usage,
> > and aa has 10MB "shared"(used by multiple processes *in aa*) usage while bb has none,
> > "shared_usage_in_bytes" is 10MB in aa and 0MB in bb(please consider there is
> > no "shared" usage between aa and bb).
> > Should an administrator consider bb is heavier than aa ? I don't think so.
> > 
> 
> No.. but before OOM killing or considering moving in a virtual
> environment the cgorup "aa", the real usage should be considered or
> at-least the fact that moving "bb" would require 20MB.
> 
> > IOW, "shared_usage_in_bytes" doesn't have any consistent meaning about which
> > group is unfairly "heavy".
> > 
> 
> No, but it gives an idea of the sharing, which can be important for
> making decisions and estimating the real usage. In the case of aa, one
> can estimate private usage to be 20MB - 10MB (10MB) which is one correct
> way of looking at the heaviness of the cgroup.
> 
This incosistency is the problem I worry about the most. The bigger "shared_usage_in_bytes" is,
the more likey the group is "heavy", or the opposite ? It only confuses users.

The "shared_usage_in_bytes" of A can be used to roughly estimate a sum of

i) memory usage used by multiple processes in A.
ii) memroy usage processes in A charges to OTHER GROUPS.
                                ^^^^^^^^^^^^^^^^^^^^^^^
I would say "yes, it might be usefull to decide the weight of A" if it can be used
to estimate a sum of i) and

iii) memory usage processes in OTHER GROUPS charges to A.
                                            ^^^^^^^^^^^^

Anyway, I wouldn't say any more about the usefullness of "shared_usage_in_bytes".

But if you dare to add this interface to kernel, please and please write the documentation
that it can be used to roughly estimate a sum of i) and ii), not sum of i) and iii), and
can be used to decide the weight of the group only when few pages are shared between groups.
So that users doesn't misunderstand nor misuse the interface.

And I think you should answer what Kamezawa-san pointed in http://lkml.org/lkml/2010/1/17/186.


Thanks,
Daisuke Nishimura.

> > The problem here is, "shared_usage_in_bytes" doesn't show neither one of nor the sum
> > of the following value(*IFF* we have only one cgroup, "shared_usage_in_bytes" would
> > mean a), but it has no use in real case).
> > 
> >   a) memory usage used by multiple processes inside this group.
> >   b) memory usage used by both processes inside this and another group.
> >   c) memory usage not used by any processes inside this group, but used by
> >      that of in another group.
> > 
> > IMHO, we should take account of all the above values to determine which group
> > is unfairly "heavy". I agree that the bigger the size of a) is, the bigger
> > "shared_usage_in_bytes" of the group would be, but we cannot know any information about
> > the size of b) by it, becase those usages are included in both actual usage(rss via stat)
> 
> (b) IMHO is a longer term goal and can be estimated from the PSS of the
> processes within the cgroup
> 
> > and sum of rss(via mm_counter). To make matters warse, "shared_usage_in_bytes" has
> > the opposite meaning about b), i.e., the more a processe in some group(foo) has actual
> > charges in *another* group(baa), the bigger "shared_usage_in_bytes" in "foo" would be
> > (as 00 and 01 in my example).
> > 
> > I would agree with you if you add interfaces to show some hints to users about above values,
> > but "shared_usage_in_bytes" doesn't meet it at all.
> > 
> 
> Not sure I follow your suggestion here.
> 
> Balbir

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Shared page accounting for memory cgroup
  2010-01-21  1:04                                               ` Daisuke Nishimura
@ 2010-01-21  1:30                                                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-01-21  1:30 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: balbir, linux-mm, Andrew Morton, linux-kernel

On Thu, 21 Jan 2010 10:04:16 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> Anyway, I wouldn't say any more about the usefullness of "shared_usage_in_bytes".
> 
> But if you dare to add this interface to kernel, please and please write the documentation
> that it can be used to roughly estimate a sum of i) and ii), not sum of i) and iii), and
> can be used to decide the weight of the group only when few pages are shared between groups.
> So that users doesn't misunderstand nor misuse the interface.
> 
> And I think you should answer what Kamezawa-san pointed in http://lkml.org/lkml/2010/1/17/186.
> 
> 
I wouldn't like to say anything other than 'please add stat to global VM before
memcg if it's really important" because it seems I couldn't persuade him, he can't
do so me. I myself never think sum of rss is important.

An additonal craim I can easily think of is fork()->exit().
Assume there is a program with 1GB RSS and which invokes a helper program by
fork()->exec(). This is an usual situation. Then, sum of RSS can easily
jump up/down 1GB.

Even if getting data in atomic way, the data itself can be corrupted very
easily and the users should remove noises by themselves. So, there is no much
difference to calculate RSS in user land or kernel. The users has to measure
the status and estimate the stable value in statical technique.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2010-01-21  1:34 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-12-29 18:27 [RFC] Shared page accounting for memory cgroup Balbir Singh
2010-01-03 23:51 ` KAMEZAWA Hiroyuki
2010-01-04  0:07   ` Balbir Singh
2010-01-04  0:35     ` KAMEZAWA Hiroyuki
2010-01-04  0:50       ` Balbir Singh
2010-01-06  4:02         ` KAMEZAWA Hiroyuki
2010-01-06  7:01           ` Balbir Singh
2010-01-06  7:12             ` KAMEZAWA Hiroyuki
2010-01-07  7:15               ` Balbir Singh
2010-01-07  7:36                 ` KAMEZAWA Hiroyuki
2010-01-07  8:34                   ` Balbir Singh
2010-01-07  8:48                     ` KAMEZAWA Hiroyuki
2010-01-07  9:08                       ` KAMEZAWA Hiroyuki
2010-01-07  9:27                         ` Balbir Singh
2010-01-07 23:47                           ` KAMEZAWA Hiroyuki
2010-01-17 19:30                             ` Balbir Singh
2010-01-18  0:05                               ` KAMEZAWA Hiroyuki
2010-01-18  0:22                                 ` KAMEZAWA Hiroyuki
2010-01-18  0:49                               ` Daisuke Nishimura
2010-01-18  8:26                                 ` Balbir Singh
2010-01-19  1:22                                   ` Daisuke Nishimura
2010-01-19  1:49                                     ` Balbir Singh
2010-01-19  2:34                                       ` Daisuke Nishimura
2010-01-19  3:52                                         ` Balbir Singh
2010-01-20  4:09                                           ` Daisuke Nishimura
2010-01-20  7:15                                             ` Daisuke Nishimura
2010-01-20  7:43                                               ` KAMEZAWA Hiroyuki
2010-01-20  8:18                                               ` Balbir Singh
2010-01-20  8:17                                             ` Balbir Singh
2010-01-21  1:04                                               ` Daisuke Nishimura
2010-01-21  1:30                                                 ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).