All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v9 00/13] memcg: per cgroup dirty page limiting
@ 2011-08-17 16:14 ` Greg Thelen
  0 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

This patch series provides the ability for each cgroup to have independent dirty
page usage limits.  Limiting dirty memory fixes the max amount of dirty (hard to
reclaim) page cache used by a cgroup.  This allows for better per cgroup memory
isolation and fewer memcg OOMs.

Three features are included in this patch series:
  1. memcg dirty page accounting
  2. memcg writeback
  3. memcg dirty page limiting


1. memcg dirty page accounting

Each memcg maintains a dirty page count and dirty page limit.  Previous
iterations of this patch series have refined this logic.  The interface is
similar to the procfs interface: /proc/sys/vm/dirty_*.  It is possible to
configure a limit to trigger throttling of a dirtier or queue background
writeback.  The root cgroup memory.dirty_* control files are read-only and match
the contents of the /proc/sys/vm/dirty_* files.


2. memcg writeback

Having per cgroup dirty memory limits is not very interesting unless writeback
is also cgroup aware.  There is not much isolation if cgroups have to writeback data
from outside the affected cgroup to get below the cgroup dirty memory threshold.

Per-memcg dirty limits are provided to support isolation and thus cross cgroup
inode sharing is not a priority.  This allows the code be simpler.

To add cgroup awareness to writeback, this series adds an i_memcg field to
struct address_space to allow writeback to isolate inodes for a particular
cgroup.  When an inode is marked dirty, i_memcg is set to the current cgroup.
When inode pages are marked dirty the i_memcg field is compared against the
page's cgroup.  If they differ, then the inode is marked as shared by setting
i_memcg to a special shared value (zero).

When performing per-memcg writeback, move_expired_inodes() scans the per bdi
b_dirty list using each inode's i_memcg and the global over-limit memcg bitmap
to determine if the inode should be written.  This inode scan may involve
skipping many unrelated inodes from other cgroup.  To test the scanning
overhead, I created two cgroups (cgroup_A with 100,000 dirty inodes under A's
dirty limit, cgroup_B with 1 inode over B's dirty limit).  The writeback code
then had to skip 100,000 inodes when balancing cgroup_B to find the one inode
that needed writing.  This scanning took 58 msec to skip 100,000 foreign inodes.


3. memcg dirty page limiting

balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(), which checks the
dirty usage vs dirty thresholds for the current cgroup and its parents.  As
cgroups exceed their background limit, they are marked in a global over-limit
bitmap (indexed by cgroup id) and the bdi flusher is awoke.  As a cgroup hits is
foreground limit, the task is throttled while performing foreground writeback on
inodes owned by the over-limit cgroup.  If mem_cgroup_balance_dirty_pages() is
unable to get below the dirty page threshold writing per-memcg inodes, then
downshifts to also writing shared inodes (i_memcg=0).

I know that there is some significant IO-less balance_dirty_pages() changes.  I
am not trying to derail that effort.  I have done moderate functional testing of
the newly proposed features.

The memcg aspects of this patch are pretty mature.  The writeback aspects are
still fairly new and need feedback from the writeback community.  These features
are linked, so it's not clear which branch to send the changes to (the writeback
development branch or mmotm).

Here is an example of the memcg OOM that is avoided with this patch series:
	# mkdir /dev/cgroup/memory/x
	# echo 100M > /dev/cgroup/memory/x/memory.limit_in_bytes
	# echo $$ > /dev/cgroup/memory/x/tasks
	# dd if=/dev/zero of=/data/f1 bs=1k count=1M &
        # dd if=/dev/zero of=/data/f2 bs=1k count=1M &
        # wait
	[1]-  Killed                  dd if=/dev/zero of=/data/f1 bs=1M count=1k
	[2]+  Killed                  dd if=/dev/zero of=/data/f1 bs=1M count=1k

Changes since -v8:
- Reordered patches for better more readability.

- No longer passing struct writeback_control into memcontrol functions.  Instead
  the needed attributes (memcg_id, etc.) are explicitly passed in.  Therefore no
  more field additions to struct writeback_control.

- Replaced 'Andrea Righi <arighi@develer.com>' with 
  'Andrea Righi <andrea@betterlinux.com>' in commit descriptions.

- Rebased to mmotm-2011-08-02-16-19

Greg Thelen (13):
  memcg: document cgroup dirty memory interfaces
  memcg: add page_cgroup flags for dirty page tracking
  memcg: add dirty page accounting infrastructure
  memcg: add kernel calls for memcg dirty page stats
  memcg: add mem_cgroup_mark_inode_dirty()
  memcg: add dirty limits to mem_cgroup
  memcg: add cgroupfs interface to memcg dirty limits
  memcg: dirty page accounting support routines
  memcg: create support routines for writeback
  writeback: pass wb_writeback_work into move_expired_inodes()
  writeback: make background writeback cgroup aware
  memcg: create support routines for page writeback
  memcg: check memcg dirty limits in page writeback

 Documentation/cgroups/memory.txt  |   70 ++++
 fs/buffer.c                       |    2 +-
 fs/fs-writeback.c                 |  113 ++++--
 fs/inode.c                        |    3 +
 fs/nfs/write.c                    |    4 +
 fs/sync.c                         |    2 +-
 include/linux/cgroup.h            |    1 +
 include/linux/fs.h                |    9 +
 include/linux/memcontrol.h        |   64 +++-
 include/linux/page_cgroup.h       |   23 ++
 include/linux/writeback.h         |    9 +-
 include/trace/events/memcontrol.h |  207 ++++++++++
 kernel/cgroup.c                   |    1 -
 mm/backing-dev.c                  |    3 +-
 mm/filemap.c                      |    1 +
 mm/memcontrol.c                   |  760 ++++++++++++++++++++++++++++++++++++-
 mm/page-writeback.c               |   44 ++-
 mm/truncate.c                     |    1 +
 mm/vmscan.c                       |    5 +-
 19 files changed, 1265 insertions(+), 57 deletions(-)
 create mode 100644 include/trace/events/memcontrol.h

-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v9 00/13] memcg: per cgroup dirty page limiting
@ 2011-08-17 16:14 ` Greg Thelen
  0 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

This patch series provides the ability for each cgroup to have independent dirty
page usage limits.  Limiting dirty memory fixes the max amount of dirty (hard to
reclaim) page cache used by a cgroup.  This allows for better per cgroup memory
isolation and fewer memcg OOMs.

Three features are included in this patch series:
  1. memcg dirty page accounting
  2. memcg writeback
  3. memcg dirty page limiting


1. memcg dirty page accounting

Each memcg maintains a dirty page count and dirty page limit.  Previous
iterations of this patch series have refined this logic.  The interface is
similar to the procfs interface: /proc/sys/vm/dirty_*.  It is possible to
configure a limit to trigger throttling of a dirtier or queue background
writeback.  The root cgroup memory.dirty_* control files are read-only and match
the contents of the /proc/sys/vm/dirty_* files.


2. memcg writeback

Having per cgroup dirty memory limits is not very interesting unless writeback
is also cgroup aware.  There is not much isolation if cgroups have to writeback data
from outside the affected cgroup to get below the cgroup dirty memory threshold.

Per-memcg dirty limits are provided to support isolation and thus cross cgroup
inode sharing is not a priority.  This allows the code be simpler.

To add cgroup awareness to writeback, this series adds an i_memcg field to
struct address_space to allow writeback to isolate inodes for a particular
cgroup.  When an inode is marked dirty, i_memcg is set to the current cgroup.
When inode pages are marked dirty the i_memcg field is compared against the
page's cgroup.  If they differ, then the inode is marked as shared by setting
i_memcg to a special shared value (zero).

When performing per-memcg writeback, move_expired_inodes() scans the per bdi
b_dirty list using each inode's i_memcg and the global over-limit memcg bitmap
to determine if the inode should be written.  This inode scan may involve
skipping many unrelated inodes from other cgroup.  To test the scanning
overhead, I created two cgroups (cgroup_A with 100,000 dirty inodes under A's
dirty limit, cgroup_B with 1 inode over B's dirty limit).  The writeback code
then had to skip 100,000 inodes when balancing cgroup_B to find the one inode
that needed writing.  This scanning took 58 msec to skip 100,000 foreign inodes.


3. memcg dirty page limiting

balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(), which checks the
dirty usage vs dirty thresholds for the current cgroup and its parents.  As
cgroups exceed their background limit, they are marked in a global over-limit
bitmap (indexed by cgroup id) and the bdi flusher is awoke.  As a cgroup hits is
foreground limit, the task is throttled while performing foreground writeback on
inodes owned by the over-limit cgroup.  If mem_cgroup_balance_dirty_pages() is
unable to get below the dirty page threshold writing per-memcg inodes, then
downshifts to also writing shared inodes (i_memcg=0).

I know that there is some significant IO-less balance_dirty_pages() changes.  I
am not trying to derail that effort.  I have done moderate functional testing of
the newly proposed features.

The memcg aspects of this patch are pretty mature.  The writeback aspects are
still fairly new and need feedback from the writeback community.  These features
are linked, so it's not clear which branch to send the changes to (the writeback
development branch or mmotm).

Here is an example of the memcg OOM that is avoided with this patch series:
	# mkdir /dev/cgroup/memory/x
	# echo 100M > /dev/cgroup/memory/x/memory.limit_in_bytes
	# echo $$ > /dev/cgroup/memory/x/tasks
	# dd if=/dev/zero of=/data/f1 bs=1k count=1M &
        # dd if=/dev/zero of=/data/f2 bs=1k count=1M &
        # wait
	[1]-  Killed                  dd if=/dev/zero of=/data/f1 bs=1M count=1k
	[2]+  Killed                  dd if=/dev/zero of=/data/f1 bs=1M count=1k

Changes since -v8:
- Reordered patches for better more readability.

- No longer passing struct writeback_control into memcontrol functions.  Instead
  the needed attributes (memcg_id, etc.) are explicitly passed in.  Therefore no
  more field additions to struct writeback_control.

- Replaced 'Andrea Righi <arighi@develer.com>' with 
  'Andrea Righi <andrea@betterlinux.com>' in commit descriptions.

- Rebased to mmotm-2011-08-02-16-19

Greg Thelen (13):
  memcg: document cgroup dirty memory interfaces
  memcg: add page_cgroup flags for dirty page tracking
  memcg: add dirty page accounting infrastructure
  memcg: add kernel calls for memcg dirty page stats
  memcg: add mem_cgroup_mark_inode_dirty()
  memcg: add dirty limits to mem_cgroup
  memcg: add cgroupfs interface to memcg dirty limits
  memcg: dirty page accounting support routines
  memcg: create support routines for writeback
  writeback: pass wb_writeback_work into move_expired_inodes()
  writeback: make background writeback cgroup aware
  memcg: create support routines for page writeback
  memcg: check memcg dirty limits in page writeback

 Documentation/cgroups/memory.txt  |   70 ++++
 fs/buffer.c                       |    2 +-
 fs/fs-writeback.c                 |  113 ++++--
 fs/inode.c                        |    3 +
 fs/nfs/write.c                    |    4 +
 fs/sync.c                         |    2 +-
 include/linux/cgroup.h            |    1 +
 include/linux/fs.h                |    9 +
 include/linux/memcontrol.h        |   64 +++-
 include/linux/page_cgroup.h       |   23 ++
 include/linux/writeback.h         |    9 +-
 include/trace/events/memcontrol.h |  207 ++++++++++
 kernel/cgroup.c                   |    1 -
 mm/backing-dev.c                  |    3 +-
 mm/filemap.c                      |    1 +
 mm/memcontrol.c                   |  760 ++++++++++++++++++++++++++++++++++++-
 mm/page-writeback.c               |   44 ++-
 mm/truncate.c                     |    1 +
 mm/vmscan.c                       |    5 +-
 19 files changed, 1265 insertions(+), 57 deletions(-)
 create mode 100644 include/trace/events/memcontrol.h

-- 
1.7.3.1


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v9 01/13] memcg: document cgroup dirty memory interfaces
  2011-08-17 16:14 ` Greg Thelen
@ 2011-08-17 16:14   ` Greg Thelen
  -1 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

Document cgroup dirty memory interfaces and statistics.

The implementation for these new interfaces routines comes in a series
of following patches.

Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 Documentation/cgroups/memory.txt |   70 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 70 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 6f3c598..5fd6ab8 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -389,6 +389,10 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
 swap		- # of bytes of swap usage
+dirty		- # of bytes that are waiting to get written back to the disk.
+writeback	- # of bytes that are actively being written back to the disk.
+nfs_unstable	- # of bytes sent to the NFS server, but not yet committed to
+		the actual storage.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
@@ -410,6 +414,9 @@ total_mapped_file	- sum of all children's "cache"
 total_pgpgin		- sum of all children's "pgpgin"
 total_pgpgout		- sum of all children's "pgpgout"
 total_swap		- sum of all children's "swap"
+total_dirty		- sum of all children's "dirty"
+total_writeback		- sum of all children's "writeback"
+total_nfs_unstable	- sum of all children's "nfs_unstable"
 total_inactive_anon	- sum of all children's "inactive_anon"
 total_active_anon	- sum of all children's "active_anon"
 total_inactive_file	- sum of all children's "inactive_file"
@@ -567,6 +574,69 @@ unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
 
 And we have total = file + anon + unevictable.
 
+5.7 dirty memory
+
+Control the maximum amount of dirty pages a cgroup can have at any given time.
+
+Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
+page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
+not be able to consume more than their designated share of dirty pages and will
+be throttled if they cross that limit.  System-wide dirty limits are also
+consulted.  Dirty memory consumption is checked against both system-wide and
+per-cgroup dirty limits.
+
+The interface is similar to the procfs interface: /proc/sys/vm/dirty_*.  It is
+possible to configure a limit to trigger throttling of a dirtier or queue
+background writeback.  The root cgroup memory.dirty_* control files are
+read-only and match the contents of the /proc/sys/vm/dirty_* files.
+
+Per-cgroup dirty limits can be set using the following files in the cgroupfs:
+
+- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage of
+  cgroup memory) at which a process generating dirty pages will be throttled.
+  The default value is the system-wide dirty ratio, /proc/sys/vm/dirty_ratio.
+
+- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in bytes)
+  in the cgroup at which a process generating dirty pages will be throttled.
+  Suffix (k, K, m, M, g, or G) can be used to indicate that value is kilo, mega
+  or gigabytes.  The default value is the system-wide dirty limit,
+  /proc/sys/vm/dirty_bytes.
+
+  Note: memory.dirty_limit_in_bytes is the counterpart of memory.dirty_ratio.
+  Only one may be specified at a time.  When one is written it is immediately
+  taken into account to evaluate the dirty memory limits and the other appears
+  as 0 when read.
+
+- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
+  (expressed as a percentage of cgroup memory) at which background writeback
+  kernel threads will start writing out dirty data.  The default value is the
+  system-wide background dirty ratio, /proc/sys/vm/dirty_background_ratio.
+
+- memory.dirty_background_limit_in_bytes: the amount of dirty memory (expressed
+  in bytes) in the cgroup at which background writeback kernel threads will
+  start writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to
+  indicate that value is kilo, mega or gigabytes.  The default value is the
+  system-wide dirty background limit, /proc/sys/vm/dirty_background_bytes.
+
+  Note: memory.dirty_background_limit_in_bytes is the counterpart of
+  memory.dirty_background_ratio.  Only one may be specified at a time.  When one
+  is written it is immediately taken into account to evaluate the dirty memory
+  limits and the other appears as 0 when read.
+
+A cgroup may contain more dirty memory than its dirty limit.  This is possible
+because of the principle that the first cgroup to touch a page is charged for
+it.  Subsequent page counting events (dirty, writeback, nfs_unstable) are also
+counted to the originally charged cgroup.  Example: If page is allocated by a
+cgroup A task, then the page is charged to cgroup A.  If the page is later
+dirtied by a task in cgroup B, then the cgroup A dirty count will be
+incremented.  If cgroup A is over its dirty limit but cgroup B is not, then
+dirtying a cgroup A page from a cgroup B task may push cgroup A over its dirty
+limit without throttling the dirtying cgroup B task.
+
+When use_hierarchy=0, each cgroup has independent dirty memory usage and limits.
+When use_hierarchy=1 the dirty limits of parent cgroups are also checked to
+ensure that no dirty limit is exceeded.
+
 6. Hierarchy support
 
 The memory controller supports a deep hierarchy and hierarchical accounting.
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 01/13] memcg: document cgroup dirty memory interfaces
@ 2011-08-17 16:14   ` Greg Thelen
  0 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

Document cgroup dirty memory interfaces and statistics.

The implementation for these new interfaces routines comes in a series
of following patches.

Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 Documentation/cgroups/memory.txt |   70 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 70 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 6f3c598..5fd6ab8 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -389,6 +389,10 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
 swap		- # of bytes of swap usage
+dirty		- # of bytes that are waiting to get written back to the disk.
+writeback	- # of bytes that are actively being written back to the disk.
+nfs_unstable	- # of bytes sent to the NFS server, but not yet committed to
+		the actual storage.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
@@ -410,6 +414,9 @@ total_mapped_file	- sum of all children's "cache"
 total_pgpgin		- sum of all children's "pgpgin"
 total_pgpgout		- sum of all children's "pgpgout"
 total_swap		- sum of all children's "swap"
+total_dirty		- sum of all children's "dirty"
+total_writeback		- sum of all children's "writeback"
+total_nfs_unstable	- sum of all children's "nfs_unstable"
 total_inactive_anon	- sum of all children's "inactive_anon"
 total_active_anon	- sum of all children's "active_anon"
 total_inactive_file	- sum of all children's "inactive_file"
@@ -567,6 +574,69 @@ unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
 
 And we have total = file + anon + unevictable.
 
+5.7 dirty memory
+
+Control the maximum amount of dirty pages a cgroup can have at any given time.
+
+Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
+page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
+not be able to consume more than their designated share of dirty pages and will
+be throttled if they cross that limit.  System-wide dirty limits are also
+consulted.  Dirty memory consumption is checked against both system-wide and
+per-cgroup dirty limits.
+
+The interface is similar to the procfs interface: /proc/sys/vm/dirty_*.  It is
+possible to configure a limit to trigger throttling of a dirtier or queue
+background writeback.  The root cgroup memory.dirty_* control files are
+read-only and match the contents of the /proc/sys/vm/dirty_* files.
+
+Per-cgroup dirty limits can be set using the following files in the cgroupfs:
+
+- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage of
+  cgroup memory) at which a process generating dirty pages will be throttled.
+  The default value is the system-wide dirty ratio, /proc/sys/vm/dirty_ratio.
+
+- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in bytes)
+  in the cgroup at which a process generating dirty pages will be throttled.
+  Suffix (k, K, m, M, g, or G) can be used to indicate that value is kilo, mega
+  or gigabytes.  The default value is the system-wide dirty limit,
+  /proc/sys/vm/dirty_bytes.
+
+  Note: memory.dirty_limit_in_bytes is the counterpart of memory.dirty_ratio.
+  Only one may be specified at a time.  When one is written it is immediately
+  taken into account to evaluate the dirty memory limits and the other appears
+  as 0 when read.
+
+- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
+  (expressed as a percentage of cgroup memory) at which background writeback
+  kernel threads will start writing out dirty data.  The default value is the
+  system-wide background dirty ratio, /proc/sys/vm/dirty_background_ratio.
+
+- memory.dirty_background_limit_in_bytes: the amount of dirty memory (expressed
+  in bytes) in the cgroup at which background writeback kernel threads will
+  start writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to
+  indicate that value is kilo, mega or gigabytes.  The default value is the
+  system-wide dirty background limit, /proc/sys/vm/dirty_background_bytes.
+
+  Note: memory.dirty_background_limit_in_bytes is the counterpart of
+  memory.dirty_background_ratio.  Only one may be specified at a time.  When one
+  is written it is immediately taken into account to evaluate the dirty memory
+  limits and the other appears as 0 when read.
+
+A cgroup may contain more dirty memory than its dirty limit.  This is possible
+because of the principle that the first cgroup to touch a page is charged for
+it.  Subsequent page counting events (dirty, writeback, nfs_unstable) are also
+counted to the originally charged cgroup.  Example: If page is allocated by a
+cgroup A task, then the page is charged to cgroup A.  If the page is later
+dirtied by a task in cgroup B, then the cgroup A dirty count will be
+incremented.  If cgroup A is over its dirty limit but cgroup B is not, then
+dirtying a cgroup A page from a cgroup B task may push cgroup A over its dirty
+limit without throttling the dirtying cgroup B task.
+
+When use_hierarchy=0, each cgroup has independent dirty memory usage and limits.
+When use_hierarchy=1 the dirty limits of parent cgroups are also checked to
+ensure that no dirty limit is exceeded.
+
 6. Hierarchy support
 
 The memory controller supports a deep hierarchy and hierarchical accounting.
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 02/13] memcg: add page_cgroup flags for dirty page tracking
  2011-08-17 16:14 ` Greg Thelen
@ 2011-08-17 16:14   ` Greg Thelen
  -1 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

Add additional flags to page_cgroup to track dirty pages
within a mem_cgroup.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 include/linux/page_cgroup.h |   23 +++++++++++++++++++++++
 1 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 961ecc7..66d3245 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -10,6 +10,9 @@ enum {
 	/* flags for mem_cgroup and file and I/O status */
 	PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */
 	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
+	PCG_FILE_DIRTY, /* page is dirty */
+	PCG_FILE_WRITEBACK, /* page is under writeback */
+	PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
 	/* No lock in page_cgroup */
 	PCG_ACCT_LRU, /* page has been accounted for (under lru_lock) */
 	__NR_PCG_FLAGS,
@@ -67,6 +70,10 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
 static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
 	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
 
+#define TESTSETPCGFLAG(uname, lname)			\
+static inline int TestSetPageCgroup##uname(struct page_cgroup *pc)	\
+	{ return test_and_set_bit(PCG_##lname, &pc->flags); }
+
 /* Cache flag is set only once (at allocation) */
 TESTPCGFLAG(Cache, CACHE)
 CLEARPCGFLAG(Cache, CACHE)
@@ -86,6 +93,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
 CLEARPCGFLAG(FileMapped, FILE_MAPPED)
 TESTPCGFLAG(FileMapped, FILE_MAPPED)
 
+SETPCGFLAG(FileDirty, FILE_DIRTY)
+CLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTPCGFLAG(FileDirty, FILE_DIRTY)
+TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
+
+SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
+CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
+TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
+
+SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+
 SETPCGFLAG(Migration, MIGRATION)
 CLEARPCGFLAG(Migration, MIGRATION)
 TESTPCGFLAG(Migration, MIGRATION)
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 02/13] memcg: add page_cgroup flags for dirty page tracking
@ 2011-08-17 16:14   ` Greg Thelen
  0 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

Add additional flags to page_cgroup to track dirty pages
within a mem_cgroup.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 include/linux/page_cgroup.h |   23 +++++++++++++++++++++++
 1 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 961ecc7..66d3245 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -10,6 +10,9 @@ enum {
 	/* flags for mem_cgroup and file and I/O status */
 	PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */
 	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
+	PCG_FILE_DIRTY, /* page is dirty */
+	PCG_FILE_WRITEBACK, /* page is under writeback */
+	PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
 	/* No lock in page_cgroup */
 	PCG_ACCT_LRU, /* page has been accounted for (under lru_lock) */
 	__NR_PCG_FLAGS,
@@ -67,6 +70,10 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
 static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
 	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
 
+#define TESTSETPCGFLAG(uname, lname)			\
+static inline int TestSetPageCgroup##uname(struct page_cgroup *pc)	\
+	{ return test_and_set_bit(PCG_##lname, &pc->flags); }
+
 /* Cache flag is set only once (at allocation) */
 TESTPCGFLAG(Cache, CACHE)
 CLEARPCGFLAG(Cache, CACHE)
@@ -86,6 +93,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
 CLEARPCGFLAG(FileMapped, FILE_MAPPED)
 TESTPCGFLAG(FileMapped, FILE_MAPPED)
 
+SETPCGFLAG(FileDirty, FILE_DIRTY)
+CLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTPCGFLAG(FileDirty, FILE_DIRTY)
+TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
+
+SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
+CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
+TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
+
+SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+
 SETPCGFLAG(Migration, MIGRATION)
 CLEARPCGFLAG(Migration, MIGRATION)
 TESTPCGFLAG(Migration, MIGRATION)
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 03/13] memcg: add dirty page accounting infrastructure
  2011-08-17 16:14 ` Greg Thelen
@ 2011-08-17 16:14   ` Greg Thelen
  -1 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

Add memcg routines to count dirty, writeback, and unstable_NFS pages.
These routines are not yet used by the kernel to count such pages.  A
later change adds kernel calls to these new routines.

As inode pages are marked dirty, if the dirtied page's cgroup differs
from the inode's cgroup, then mark the inode shared across several
cgroup.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <andrea@betterlinux.com>
---
Changelog since v8:
- In v8 this patch was applied after 'memcg: add mem_cgroup_mark_inode_dirty()'.
  In this version (v9), this patch comes first.  The result is that this patch
  does not contain code to mark inode with I_MEMCG_SHARED.  That logic is
  deferred until the later 'memcg: add mem_cgroup_mark_inode_dirty()' patch.

 include/linux/memcontrol.h |    8 ++++-
 mm/memcontrol.c            |   87 ++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 86 insertions(+), 9 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5633f51..e6af3a9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -27,9 +27,15 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 
-/* Stats that can be updated by kernel. */
+/*
+ * Per mem_cgroup page counts tracked by kernel.  As pages enter and leave these
+ * states, the kernel notifies memcg using mem_cgroup_{inc,dec}_page_stat().
+ */
 enum mem_cgroup_page_stat_item {
 	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+	MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
+	MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
+	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
 };
 
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c6faa32..723b8bf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -84,8 +84,11 @@ enum mem_cgroup_stat_index {
 	 */
 	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
 	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_FILE_WRITEBACK,		/* # of pages under writeback */
+	MEM_CGROUP_STAT_FILE_UNSTABLE_NFS,	/* # of NFS unstable pages */
 	MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
 	MEM_CGROUP_ON_MOVE,	/* someone is moving account between groups */
 	MEM_CGROUP_STAT_NSTATS,
@@ -2066,6 +2069,44 @@ void mem_cgroup_update_page_stat(struct page *page,
 			ClearPageCgroupFileMapped(pc);
 		idx = MEM_CGROUP_STAT_FILE_MAPPED;
 		break;
+
+	case MEMCG_NR_FILE_DIRTY:
+		/* Use Test{Set,Clear} to only un/charge the memcg once. */
+		if (val > 0) {
+			if (TestSetPageCgroupFileDirty(pc))
+				val = 0;
+		} else {
+			if (!TestClearPageCgroupFileDirty(pc))
+				val = 0;
+		}
+		idx = MEM_CGROUP_STAT_FILE_DIRTY;
+		break;
+
+	case MEMCG_NR_FILE_WRITEBACK:
+		/*
+		 * This counter is adjusted while holding the mapping's
+		 * tree_lock.  Therefore there is no race between settings and
+		 * clearing of this flag.
+		 */
+		if (val > 0)
+			SetPageCgroupFileWriteback(pc);
+		else
+			ClearPageCgroupFileWriteback(pc);
+		idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
+		break;
+
+	case MEMCG_NR_FILE_UNSTABLE_NFS:
+		/* Use Test{Set,Clear} to only un/charge the memcg once. */
+		if (val > 0) {
+			if (TestSetPageCgroupFileUnstableNFS(pc))
+				val = 0;
+		} else {
+			if (!TestClearPageCgroupFileUnstableNFS(pc))
+				val = 0;
+		}
+		idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
+		break;
+
 	default:
 		BUG();
 	}
@@ -2663,6 +2704,17 @@ void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail)
 }
 #endif
 
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+				       struct mem_cgroup *to,
+				       enum mem_cgroup_stat_index idx)
+{
+	preempt_disable();
+	__this_cpu_dec(from->stat->count[idx]);
+	__this_cpu_inc(to->stat->count[idx]);
+	preempt_enable();
+}
+
 /**
  * mem_cgroup_move_account - move account of the page
  * @page: the page
@@ -2711,13 +2763,18 @@ static int mem_cgroup_move_account(struct page *page,
 
 	move_lock_page_cgroup(pc, &flags);
 
-	if (PageCgroupFileMapped(pc)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (PageCgroupFileMapped(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_MAPPED);
+	if (PageCgroupFileDirty(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+						  MEM_CGROUP_STAT_FILE_DIRTY);
+	if (PageCgroupFileWriteback(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_WRITEBACK);
+	if (PageCgroupFileUnstableNFS(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
 	mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -nr_pages);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
@@ -4147,6 +4204,9 @@ enum {
 	MCS_SWAP,
 	MCS_PGFAULT,
 	MCS_PGMAJFAULT,
+	MCS_FILE_DIRTY,
+	MCS_WRITEBACK,
+	MCS_UNSTABLE_NFS,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -4171,6 +4231,9 @@ struct {
 	{"swap", "total_swap"},
 	{"pgfault", "total_pgfault"},
 	{"pgmajfault", "total_pgmajfault"},
+	{"dirty", "total_dirty"},
+	{"writeback", "total_writeback"},
+	{"nfs_unstable", "total_nfs_unstable"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -4204,6 +4267,14 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
 	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGMAJFAULT);
 	s->stat[MCS_PGMAJFAULT] += val;
 
+	/* dirty stat */
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
+	s->stat[MCS_FILE_DIRTY] += val * PAGE_SIZE;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
+	s->stat[MCS_WRITEBACK] += val * PAGE_SIZE;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+	s->stat[MCS_UNSTABLE_NFS] += val * PAGE_SIZE;
+
 	/* per zone stat */
 	val = mem_cgroup_nr_lru_pages(mem, BIT(LRU_INACTIVE_ANON));
 	s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 03/13] memcg: add dirty page accounting infrastructure
@ 2011-08-17 16:14   ` Greg Thelen
  0 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

Add memcg routines to count dirty, writeback, and unstable_NFS pages.
These routines are not yet used by the kernel to count such pages.  A
later change adds kernel calls to these new routines.

As inode pages are marked dirty, if the dirtied page's cgroup differs
from the inode's cgroup, then mark the inode shared across several
cgroup.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <andrea@betterlinux.com>
---
Changelog since v8:
- In v8 this patch was applied after 'memcg: add mem_cgroup_mark_inode_dirty()'.
  In this version (v9), this patch comes first.  The result is that this patch
  does not contain code to mark inode with I_MEMCG_SHARED.  That logic is
  deferred until the later 'memcg: add mem_cgroup_mark_inode_dirty()' patch.

 include/linux/memcontrol.h |    8 ++++-
 mm/memcontrol.c            |   87 ++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 86 insertions(+), 9 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5633f51..e6af3a9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -27,9 +27,15 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 
-/* Stats that can be updated by kernel. */
+/*
+ * Per mem_cgroup page counts tracked by kernel.  As pages enter and leave these
+ * states, the kernel notifies memcg using mem_cgroup_{inc,dec}_page_stat().
+ */
 enum mem_cgroup_page_stat_item {
 	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+	MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
+	MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
+	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
 };
 
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c6faa32..723b8bf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -84,8 +84,11 @@ enum mem_cgroup_stat_index {
 	 */
 	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
 	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_FILE_WRITEBACK,		/* # of pages under writeback */
+	MEM_CGROUP_STAT_FILE_UNSTABLE_NFS,	/* # of NFS unstable pages */
 	MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
 	MEM_CGROUP_ON_MOVE,	/* someone is moving account between groups */
 	MEM_CGROUP_STAT_NSTATS,
@@ -2066,6 +2069,44 @@ void mem_cgroup_update_page_stat(struct page *page,
 			ClearPageCgroupFileMapped(pc);
 		idx = MEM_CGROUP_STAT_FILE_MAPPED;
 		break;
+
+	case MEMCG_NR_FILE_DIRTY:
+		/* Use Test{Set,Clear} to only un/charge the memcg once. */
+		if (val > 0) {
+			if (TestSetPageCgroupFileDirty(pc))
+				val = 0;
+		} else {
+			if (!TestClearPageCgroupFileDirty(pc))
+				val = 0;
+		}
+		idx = MEM_CGROUP_STAT_FILE_DIRTY;
+		break;
+
+	case MEMCG_NR_FILE_WRITEBACK:
+		/*
+		 * This counter is adjusted while holding the mapping's
+		 * tree_lock.  Therefore there is no race between settings and
+		 * clearing of this flag.
+		 */
+		if (val > 0)
+			SetPageCgroupFileWriteback(pc);
+		else
+			ClearPageCgroupFileWriteback(pc);
+		idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
+		break;
+
+	case MEMCG_NR_FILE_UNSTABLE_NFS:
+		/* Use Test{Set,Clear} to only un/charge the memcg once. */
+		if (val > 0) {
+			if (TestSetPageCgroupFileUnstableNFS(pc))
+				val = 0;
+		} else {
+			if (!TestClearPageCgroupFileUnstableNFS(pc))
+				val = 0;
+		}
+		idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
+		break;
+
 	default:
 		BUG();
 	}
@@ -2663,6 +2704,17 @@ void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail)
 }
 #endif
 
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+				       struct mem_cgroup *to,
+				       enum mem_cgroup_stat_index idx)
+{
+	preempt_disable();
+	__this_cpu_dec(from->stat->count[idx]);
+	__this_cpu_inc(to->stat->count[idx]);
+	preempt_enable();
+}
+
 /**
  * mem_cgroup_move_account - move account of the page
  * @page: the page
@@ -2711,13 +2763,18 @@ static int mem_cgroup_move_account(struct page *page,
 
 	move_lock_page_cgroup(pc, &flags);
 
-	if (PageCgroupFileMapped(pc)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (PageCgroupFileMapped(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_MAPPED);
+	if (PageCgroupFileDirty(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+						  MEM_CGROUP_STAT_FILE_DIRTY);
+	if (PageCgroupFileWriteback(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_WRITEBACK);
+	if (PageCgroupFileUnstableNFS(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
 	mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -nr_pages);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
@@ -4147,6 +4204,9 @@ enum {
 	MCS_SWAP,
 	MCS_PGFAULT,
 	MCS_PGMAJFAULT,
+	MCS_FILE_DIRTY,
+	MCS_WRITEBACK,
+	MCS_UNSTABLE_NFS,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -4171,6 +4231,9 @@ struct {
 	{"swap", "total_swap"},
 	{"pgfault", "total_pgfault"},
 	{"pgmajfault", "total_pgmajfault"},
+	{"dirty", "total_dirty"},
+	{"writeback", "total_writeback"},
+	{"nfs_unstable", "total_nfs_unstable"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -4204,6 +4267,14 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
 	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGMAJFAULT);
 	s->stat[MCS_PGMAJFAULT] += val;
 
+	/* dirty stat */
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
+	s->stat[MCS_FILE_DIRTY] += val * PAGE_SIZE;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
+	s->stat[MCS_WRITEBACK] += val * PAGE_SIZE;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+	s->stat[MCS_UNSTABLE_NFS] += val * PAGE_SIZE;
+
 	/* per zone stat */
 	val = mem_cgroup_nr_lru_pages(mem, BIT(LRU_INACTIVE_ANON));
 	s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 04/13] memcg: add kernel calls for memcg dirty page stats
  2011-08-17 16:14 ` Greg Thelen
@ 2011-08-17 16:14   ` Greg Thelen
  -1 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

Add calls into memcg dirty page accounting.  Notify memcg when pages
transition between clean, file dirty, writeback, and unstable nfs.  This
allows the memory controller to maintain an accurate view of the amount
of its memory that is dirty.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 fs/nfs/write.c      |    4 ++++
 mm/filemap.c        |    1 +
 mm/page-writeback.c |    4 ++++
 mm/truncate.c       |    1 +
 4 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index b39b37f8..f033983 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -449,6 +449,7 @@ nfs_mark_request_commit(struct nfs_page *req, struct pnfs_layout_segment *lseg)
 	nfsi->ncommit++;
 	spin_unlock(&inode->i_lock);
 	pnfs_mark_request_commit(req, lseg);
+	mem_cgroup_inc_page_stat(req->wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -460,6 +461,7 @@ nfs_clear_request_commit(struct nfs_page *req)
 	struct page *page = req->wb_page;
 
 	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 		return 1;
@@ -1408,6 +1410,8 @@ void nfs_retry_commit(struct list_head *page_list,
 		req = nfs_list_entry(page_list->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req, lseg);
+		mem_cgroup_dec_page_stat(req->wb_page,
+					 MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 			     BDI_RECLAIMABLE);
diff --git a/mm/filemap.c b/mm/filemap.c
index 645a080..acf2382 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -142,6 +142,7 @@ void __delete_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 938d943..b1f2390 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1328,6 +1328,7 @@ int __set_page_dirty_no_writeback(struct page *page)
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -1344,6 +1345,7 @@ EXPORT_SYMBOL(account_page_dirtied);
  */
 void account_page_writeback(struct page *page)
 {
+	mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
 	inc_zone_page_state(page, NR_WRITEBACK);
 }
 EXPORT_SYMBOL(account_page_writeback);
@@ -1526,6 +1528,7 @@ int clear_page_dirty_for_io(struct page *page)
 		 * for more comments.
 		 */
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -1562,6 +1565,7 @@ int test_clear_page_writeback(struct page *page)
 		ret = TestClearPageWriteback(page);
 	}
 	if (ret) {
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
 		dec_zone_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
diff --git a/mm/truncate.c b/mm/truncate.c
index b40ac6d..bb85b76 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -76,6 +76,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 04/13] memcg: add kernel calls for memcg dirty page stats
@ 2011-08-17 16:14   ` Greg Thelen
  0 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

Add calls into memcg dirty page accounting.  Notify memcg when pages
transition between clean, file dirty, writeback, and unstable nfs.  This
allows the memory controller to maintain an accurate view of the amount
of its memory that is dirty.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 fs/nfs/write.c      |    4 ++++
 mm/filemap.c        |    1 +
 mm/page-writeback.c |    4 ++++
 mm/truncate.c       |    1 +
 4 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index b39b37f8..f033983 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -449,6 +449,7 @@ nfs_mark_request_commit(struct nfs_page *req, struct pnfs_layout_segment *lseg)
 	nfsi->ncommit++;
 	spin_unlock(&inode->i_lock);
 	pnfs_mark_request_commit(req, lseg);
+	mem_cgroup_inc_page_stat(req->wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -460,6 +461,7 @@ nfs_clear_request_commit(struct nfs_page *req)
 	struct page *page = req->wb_page;
 
 	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 		return 1;
@@ -1408,6 +1410,8 @@ void nfs_retry_commit(struct list_head *page_list,
 		req = nfs_list_entry(page_list->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req, lseg);
+		mem_cgroup_dec_page_stat(req->wb_page,
+					 MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 			     BDI_RECLAIMABLE);
diff --git a/mm/filemap.c b/mm/filemap.c
index 645a080..acf2382 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -142,6 +142,7 @@ void __delete_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 938d943..b1f2390 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1328,6 +1328,7 @@ int __set_page_dirty_no_writeback(struct page *page)
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -1344,6 +1345,7 @@ EXPORT_SYMBOL(account_page_dirtied);
  */
 void account_page_writeback(struct page *page)
 {
+	mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
 	inc_zone_page_state(page, NR_WRITEBACK);
 }
 EXPORT_SYMBOL(account_page_writeback);
@@ -1526,6 +1528,7 @@ int clear_page_dirty_for_io(struct page *page)
 		 * for more comments.
 		 */
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -1562,6 +1565,7 @@ int test_clear_page_writeback(struct page *page)
 		ret = TestClearPageWriteback(page);
 	}
 	if (ret) {
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
 		dec_zone_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
diff --git a/mm/truncate.c b/mm/truncate.c
index b40ac6d..bb85b76 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -76,6 +76,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 05/13] memcg: add mem_cgroup_mark_inode_dirty()
  2011-08-17 16:14 ` Greg Thelen
@ 2011-08-17 16:14   ` Greg Thelen
  -1 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

Create the mem_cgroup_mark_inode_dirty() routine, which is called when
an inode is marked dirty.  In kernels without memcg, this is an inline
no-op.

Add i_memcg field to struct address_space.  When an inode is marked
dirty with mem_cgroup_mark_inode_dirty(), the css_id of current memcg is
recorded in i_memcg.  Per-memcg writeback (introduced in a latter
change) uses this field to isolate inodes associated with a particular
memcg.

The type of i_memcg is an 'unsigned short' because it stores the css_id
of the memcg.  Using a struct mem_cgroup pointer would be larger and
also create a reference on the memcg which would hang memcg rmdir
deletion.  Usage of a css_id is not a reference so cgroup deletion is
not affected.  The memcg can be deleted without cleaning up the i_memcg
field.  When a memcg is deleted its pages are recharged to the cgroup
parent, and the related inode(s) are marked as shared thus
disassociating the inodes from the deleted cgroup.

A mem_cgroup_mark_inode_dirty() tracepoint is also included to allow for
easier understanding of memcg writeback operation.

Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v8:
- Use I_MEMCG_SHARED when initializing i_memcg.

- Use 'memcg' rather than 'mem' for local variables.  This is consistent with
  other memory controller code.

- The logic in mem_cgroup_update_page_stat() and mem_cgroup_move_account() which
  marks inodes I_MEMCG_SHARED is now part of this patch.  This makes more sense
  because this is that patch that introduces that shared-inode concept.

 fs/fs-writeback.c                 |    2 +
 fs/inode.c                        |    3 ++
 include/linux/fs.h                |    9 +++++++
 include/linux/memcontrol.h        |    6 ++++
 include/trace/events/memcontrol.h |   32 +++++++++++++++++++++++++
 mm/memcontrol.c                   |   47 ++++++++++++++++++++++++++++++++++++-
 6 files changed, 98 insertions(+), 1 deletions(-)
 create mode 100644 include/trace/events/memcontrol.h

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 04cf3b9..6bf4c49 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -19,6 +19,7 @@
 #include <linux/slab.h>
 #include <linux/sched.h>
 #include <linux/fs.h>
+#include <linux/memcontrol.h>
 #include <linux/mm.h>
 #include <linux/kthread.h>
 #include <linux/freezer.h>
@@ -1111,6 +1112,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 			spin_lock(&bdi->wb.list_lock);
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
+			mem_cgroup_mark_inode_dirty(inode);
 			spin_unlock(&bdi->wb.list_lock);
 
 			if (wakeup_bdi)
diff --git a/fs/inode.c b/fs/inode.c
index 5aab80d..87f0fcd 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -176,6 +176,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	mapping->assoc_mapping = NULL;
 	mapping->backing_dev_info = &default_backing_dev_info;
 	mapping->writeback_index = 0;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	mapping->i_memcg = I_MEMCG_SHARED;
+#endif
 
 	/*
 	 * If the block_device provides a backing_dev_info for client
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0f496c2..417e9b93b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -651,6 +651,9 @@ struct address_space {
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	unsigned short		i_memcg;	/* css_id of memcg dirtier */
+#endif
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
@@ -658,6 +661,12 @@ struct address_space {
 	 * of struct page's "mapping" pointer be used for PAGE_MAPPING_ANON.
 	 */
 
+/*
+ * When an address_space is shared by multiple memcg dirtieres, then i_memcg is
+ * set to this special, wildcard, css_id value (zero).
+ */
+#define I_MEMCG_SHARED 0
+
 struct block_device {
 	dev_t			bd_dev;  /* not a kdev_t - it's a search key */
 	int			bd_openers;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e6af3a9..630d3fa 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -119,6 +119,8 @@ mem_cgroup_prepare_migration(struct page *page,
 extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
 	struct page *oldpage, struct page *newpage, bool migration_ok);
 
+void mem_cgroup_mark_inode_dirty(struct inode *inode);
+
 /*
  * For memory reclaim.
  */
@@ -297,6 +299,10 @@ static inline void mem_cgroup_end_migration(struct mem_cgroup *mem,
 {
 }
 
+static inline void mem_cgroup_mark_inode_dirty(struct inode *inode)
+{
+}
+
 static inline int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem)
 {
 	return 0;
diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
new file mode 100644
index 0000000..781ef9fc
--- /dev/null
+++ b/include/trace/events/memcontrol.h
@@ -0,0 +1,32 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM memcontrol
+
+#if !defined(_TRACE_MEMCONTROL_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MEMCONTROL_H
+
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(mem_cgroup_mark_inode_dirty,
+	TP_PROTO(struct inode *inode),
+
+	TP_ARGS(inode),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, ino)
+		__field(unsigned short, css_id)
+		),
+
+	TP_fast_assign(
+		__entry->ino = inode->i_ino;
+		__entry->css_id =
+			inode->i_mapping ? inode->i_mapping->i_memcg : 0;
+		),
+
+	TP_printk("ino=%ld css_id=%d", __entry->ino, __entry->css_id)
+)
+
+#endif /* _TRACE_MEMCONTROL_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 723b8bf..eda0d9a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -55,6 +55,9 @@
 
 #include <trace/events/vmscan.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/memcontrol.h>
+
 struct cgroup_subsys mem_cgroup_subsys __read_mostly;
 #define MEM_CGROUP_RECLAIM_RETRIES	5
 struct mem_cgroup *root_mem_cgroup __read_mostly;
@@ -1174,6 +1177,27 @@ static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_
 	return inactive_ratio;
 }
 
+/*
+ * Mark the current task's memcg as the memcg associated with inode.  Note: the
+ * recorded cgroup css_id is not guaranteed to remain correct.  The current task
+ * may be moved to another cgroup.  The memcg may also be deleted before the
+ * caller has time to use the i_memcg.
+ */
+void mem_cgroup_mark_inode_dirty(struct inode *inode)
+{
+	struct mem_cgroup *memcg;
+	unsigned short id;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	id = memcg ? css_id(&memcg->css) : I_MEMCG_SHARED;
+	rcu_read_unlock();
+
+	inode->i_mapping->i_memcg = id;
+
+	trace_mem_cgroup_mark_inode_dirty(inode);
+}
+
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
 {
 	unsigned long active;
@@ -2041,6 +2065,7 @@ void mem_cgroup_update_page_stat(struct page *page,
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
+	struct address_space *mapping;
 	bool need_unlock = false;
 	unsigned long uninitialized_var(flags);
 
@@ -2073,8 +2098,18 @@ void mem_cgroup_update_page_stat(struct page *page,
 	case MEMCG_NR_FILE_DIRTY:
 		/* Use Test{Set,Clear} to only un/charge the memcg once. */
 		if (val > 0) {
+			mapping = page_mapping(page);
 			if (TestSetPageCgroupFileDirty(pc))
 				val = 0;
+			else if (mapping &&
+				 (mapping->i_memcg != I_MEMCG_SHARED) &&
+				 (mapping->i_memcg != css_id(&mem->css)))
+				/*
+				 * If the inode is being dirtied by a memcg
+				 * other than the one that marked it dirty, then
+				 * mark the inode shared by multiple memcg.
+				 */
+				mapping->i_memcg = I_MEMCG_SHARED;
 		} else {
 			if (!TestClearPageCgroupFileDirty(pc))
 				val = 0;
@@ -2766,9 +2801,19 @@ static int mem_cgroup_move_account(struct page *page,
 	if (PageCgroupFileMapped(pc))
 		mem_cgroup_move_account_page_stat(from, to,
 					MEM_CGROUP_STAT_FILE_MAPPED);
-	if (PageCgroupFileDirty(pc))
+	if (PageCgroupFileDirty(pc)) {
 		mem_cgroup_move_account_page_stat(from, to,
 						  MEM_CGROUP_STAT_FILE_DIRTY);
+		/*
+		 * Moving a dirty file page between memcg makes the underlying
+		 * inode shared.  If the new (to) cgroup attempts writeback it
+		 * should consider this inode.  If the old (from) cgroup
+		 * attempts writeback it likely has other pages in the same
+		 * inode.  The inode is now shared by the to and from cgroups.
+		 * So mark the inode as shared.
+		 */
+		page_mapping(page)->i_memcg = I_MEMCG_SHARED;
+	}
 	if (PageCgroupFileWriteback(pc))
 		mem_cgroup_move_account_page_stat(from, to,
 					MEM_CGROUP_STAT_FILE_WRITEBACK);
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 05/13] memcg: add mem_cgroup_mark_inode_dirty()
@ 2011-08-17 16:14   ` Greg Thelen
  0 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

Create the mem_cgroup_mark_inode_dirty() routine, which is called when
an inode is marked dirty.  In kernels without memcg, this is an inline
no-op.

Add i_memcg field to struct address_space.  When an inode is marked
dirty with mem_cgroup_mark_inode_dirty(), the css_id of current memcg is
recorded in i_memcg.  Per-memcg writeback (introduced in a latter
change) uses this field to isolate inodes associated with a particular
memcg.

The type of i_memcg is an 'unsigned short' because it stores the css_id
of the memcg.  Using a struct mem_cgroup pointer would be larger and
also create a reference on the memcg which would hang memcg rmdir
deletion.  Usage of a css_id is not a reference so cgroup deletion is
not affected.  The memcg can be deleted without cleaning up the i_memcg
field.  When a memcg is deleted its pages are recharged to the cgroup
parent, and the related inode(s) are marked as shared thus
disassociating the inodes from the deleted cgroup.

A mem_cgroup_mark_inode_dirty() tracepoint is also included to allow for
easier understanding of memcg writeback operation.

Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v8:
- Use I_MEMCG_SHARED when initializing i_memcg.

- Use 'memcg' rather than 'mem' for local variables.  This is consistent with
  other memory controller code.

- The logic in mem_cgroup_update_page_stat() and mem_cgroup_move_account() which
  marks inodes I_MEMCG_SHARED is now part of this patch.  This makes more sense
  because this is that patch that introduces that shared-inode concept.

 fs/fs-writeback.c                 |    2 +
 fs/inode.c                        |    3 ++
 include/linux/fs.h                |    9 +++++++
 include/linux/memcontrol.h        |    6 ++++
 include/trace/events/memcontrol.h |   32 +++++++++++++++++++++++++
 mm/memcontrol.c                   |   47 ++++++++++++++++++++++++++++++++++++-
 6 files changed, 98 insertions(+), 1 deletions(-)
 create mode 100644 include/trace/events/memcontrol.h

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 04cf3b9..6bf4c49 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -19,6 +19,7 @@
 #include <linux/slab.h>
 #include <linux/sched.h>
 #include <linux/fs.h>
+#include <linux/memcontrol.h>
 #include <linux/mm.h>
 #include <linux/kthread.h>
 #include <linux/freezer.h>
@@ -1111,6 +1112,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 			spin_lock(&bdi->wb.list_lock);
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
+			mem_cgroup_mark_inode_dirty(inode);
 			spin_unlock(&bdi->wb.list_lock);
 
 			if (wakeup_bdi)
diff --git a/fs/inode.c b/fs/inode.c
index 5aab80d..87f0fcd 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -176,6 +176,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	mapping->assoc_mapping = NULL;
 	mapping->backing_dev_info = &default_backing_dev_info;
 	mapping->writeback_index = 0;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	mapping->i_memcg = I_MEMCG_SHARED;
+#endif
 
 	/*
 	 * If the block_device provides a backing_dev_info for client
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0f496c2..417e9b93b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -651,6 +651,9 @@ struct address_space {
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	unsigned short		i_memcg;	/* css_id of memcg dirtier */
+#endif
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
@@ -658,6 +661,12 @@ struct address_space {
 	 * of struct page's "mapping" pointer be used for PAGE_MAPPING_ANON.
 	 */
 
+/*
+ * When an address_space is shared by multiple memcg dirtieres, then i_memcg is
+ * set to this special, wildcard, css_id value (zero).
+ */
+#define I_MEMCG_SHARED 0
+
 struct block_device {
 	dev_t			bd_dev;  /* not a kdev_t - it's a search key */
 	int			bd_openers;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e6af3a9..630d3fa 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -119,6 +119,8 @@ mem_cgroup_prepare_migration(struct page *page,
 extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
 	struct page *oldpage, struct page *newpage, bool migration_ok);
 
+void mem_cgroup_mark_inode_dirty(struct inode *inode);
+
 /*
  * For memory reclaim.
  */
@@ -297,6 +299,10 @@ static inline void mem_cgroup_end_migration(struct mem_cgroup *mem,
 {
 }
 
+static inline void mem_cgroup_mark_inode_dirty(struct inode *inode)
+{
+}
+
 static inline int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem)
 {
 	return 0;
diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
new file mode 100644
index 0000000..781ef9fc
--- /dev/null
+++ b/include/trace/events/memcontrol.h
@@ -0,0 +1,32 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM memcontrol
+
+#if !defined(_TRACE_MEMCONTROL_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MEMCONTROL_H
+
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(mem_cgroup_mark_inode_dirty,
+	TP_PROTO(struct inode *inode),
+
+	TP_ARGS(inode),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, ino)
+		__field(unsigned short, css_id)
+		),
+
+	TP_fast_assign(
+		__entry->ino = inode->i_ino;
+		__entry->css_id =
+			inode->i_mapping ? inode->i_mapping->i_memcg : 0;
+		),
+
+	TP_printk("ino=%ld css_id=%d", __entry->ino, __entry->css_id)
+)
+
+#endif /* _TRACE_MEMCONTROL_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 723b8bf..eda0d9a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -55,6 +55,9 @@
 
 #include <trace/events/vmscan.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/memcontrol.h>
+
 struct cgroup_subsys mem_cgroup_subsys __read_mostly;
 #define MEM_CGROUP_RECLAIM_RETRIES	5
 struct mem_cgroup *root_mem_cgroup __read_mostly;
@@ -1174,6 +1177,27 @@ static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_
 	return inactive_ratio;
 }
 
+/*
+ * Mark the current task's memcg as the memcg associated with inode.  Note: the
+ * recorded cgroup css_id is not guaranteed to remain correct.  The current task
+ * may be moved to another cgroup.  The memcg may also be deleted before the
+ * caller has time to use the i_memcg.
+ */
+void mem_cgroup_mark_inode_dirty(struct inode *inode)
+{
+	struct mem_cgroup *memcg;
+	unsigned short id;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	id = memcg ? css_id(&memcg->css) : I_MEMCG_SHARED;
+	rcu_read_unlock();
+
+	inode->i_mapping->i_memcg = id;
+
+	trace_mem_cgroup_mark_inode_dirty(inode);
+}
+
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
 {
 	unsigned long active;
@@ -2041,6 +2065,7 @@ void mem_cgroup_update_page_stat(struct page *page,
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
+	struct address_space *mapping;
 	bool need_unlock = false;
 	unsigned long uninitialized_var(flags);
 
@@ -2073,8 +2098,18 @@ void mem_cgroup_update_page_stat(struct page *page,
 	case MEMCG_NR_FILE_DIRTY:
 		/* Use Test{Set,Clear} to only un/charge the memcg once. */
 		if (val > 0) {
+			mapping = page_mapping(page);
 			if (TestSetPageCgroupFileDirty(pc))
 				val = 0;
+			else if (mapping &&
+				 (mapping->i_memcg != I_MEMCG_SHARED) &&
+				 (mapping->i_memcg != css_id(&mem->css)))
+				/*
+				 * If the inode is being dirtied by a memcg
+				 * other than the one that marked it dirty, then
+				 * mark the inode shared by multiple memcg.
+				 */
+				mapping->i_memcg = I_MEMCG_SHARED;
 		} else {
 			if (!TestClearPageCgroupFileDirty(pc))
 				val = 0;
@@ -2766,9 +2801,19 @@ static int mem_cgroup_move_account(struct page *page,
 	if (PageCgroupFileMapped(pc))
 		mem_cgroup_move_account_page_stat(from, to,
 					MEM_CGROUP_STAT_FILE_MAPPED);
-	if (PageCgroupFileDirty(pc))
+	if (PageCgroupFileDirty(pc)) {
 		mem_cgroup_move_account_page_stat(from, to,
 						  MEM_CGROUP_STAT_FILE_DIRTY);
+		/*
+		 * Moving a dirty file page between memcg makes the underlying
+		 * inode shared.  If the new (to) cgroup attempts writeback it
+		 * should consider this inode.  If the old (from) cgroup
+		 * attempts writeback it likely has other pages in the same
+		 * inode.  The inode is now shared by the to and from cgroups.
+		 * So mark the inode as shared.
+		 */
+		page_mapping(page)->i_memcg = I_MEMCG_SHARED;
+	}
 	if (PageCgroupFileWriteback(pc))
 		mem_cgroup_move_account_page_stat(from, to,
 					MEM_CGROUP_STAT_FILE_WRITEBACK);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 06/13] memcg: add dirty limits to mem_cgroup
  2011-08-17 16:14 ` Greg Thelen
@ 2011-08-17 16:14   ` Greg Thelen
  -1 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

Extend mem_cgroup to contain dirty page limits.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
Changelog since v8:
- Use 'memcg' rather than 'mem' for local variables and parameters.
  This is consistent with other memory controller code.

 mm/memcontrol.c |   50 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 49 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index eda0d9a..070c4ab 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -254,6 +254,13 @@ const char *scanstat_string[NR_SCANSTATS] = {
 #define SCANSTAT_WORD_SYSTEM	"_by_system"
 #define SCANSTAT_WORD_HIERARCHY	"_under_hierarchy"
 
+/* Dirty memory parameters */
+struct vm_dirty_param {
+	int dirty_ratio;
+	int dirty_background_ratio;
+	unsigned long dirty_bytes;
+	unsigned long dirty_background_bytes;
+};
 
 /*
  * The memory controller data structure. The memory controller controls both
@@ -303,6 +310,10 @@ struct mem_cgroup {
 	atomic_t	refcnt;
 
 	int	swappiness;
+
+	/* control memory cgroup dirty pages */
+	struct vm_dirty_param dirty_param;
+
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
 
@@ -1348,6 +1359,36 @@ int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 	return memcg->swappiness;
 }
 
+/*
+ * Return true if the current memory cgroup has local dirty memory settings.
+ * There is an allowed race between the current task migrating in-to/out-of the
+ * root cgroup while this routine runs.  So the return value may be incorrect if
+ * the current task is being simultaneously migrated.
+ */
+static bool mem_cgroup_has_dirty_limit(struct mem_cgroup *memcg)
+{
+	return memcg && !mem_cgroup_is_root(memcg);
+}
+
+/*
+ * Returns a snapshot of the current dirty limits which is not synchronized with
+ * the routines that change the dirty limits.  If this routine races with an
+ * update to the dirty bytes/ratio value, then the caller must handle the case
+ * where neither dirty_[background_]_ratio nor _bytes are set.
+ */
+static void mem_cgroup_dirty_param(struct vm_dirty_param *param,
+				   struct mem_cgroup *memcg)
+{
+	if (mem_cgroup_has_dirty_limit(memcg)) {
+		*param = memcg->dirty_param;
+	} else {
+		param->dirty_ratio = vm_dirty_ratio;
+		param->dirty_bytes = vm_dirty_bytes;
+		param->dirty_background_ratio = dirty_background_ratio;
+		param->dirty_background_bytes = dirty_background_bytes;
+	}
+}
+
 static void mem_cgroup_start_move(struct mem_cgroup *mem)
 {
 	int cpu;
@@ -5220,8 +5261,15 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	mem->last_scanned_node = MAX_NUMNODES;
 	INIT_LIST_HEAD(&mem->oom_notify);
 
-	if (parent)
+	if (parent) {
 		mem->swappiness = mem_cgroup_swappiness(parent);
+		mem_cgroup_dirty_param(&mem->dirty_param, parent);
+	} else {
+		/*
+		 * The root cgroup dirty_param field is not used, instead,
+		 * system-wide dirty limits are used.
+		 */
+	}
 	atomic_set(&mem->refcnt, 1);
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 06/13] memcg: add dirty limits to mem_cgroup
@ 2011-08-17 16:14   ` Greg Thelen
  0 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

Extend mem_cgroup to contain dirty page limits.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
Changelog since v8:
- Use 'memcg' rather than 'mem' for local variables and parameters.
  This is consistent with other memory controller code.

 mm/memcontrol.c |   50 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 49 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index eda0d9a..070c4ab 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -254,6 +254,13 @@ const char *scanstat_string[NR_SCANSTATS] = {
 #define SCANSTAT_WORD_SYSTEM	"_by_system"
 #define SCANSTAT_WORD_HIERARCHY	"_under_hierarchy"
 
+/* Dirty memory parameters */
+struct vm_dirty_param {
+	int dirty_ratio;
+	int dirty_background_ratio;
+	unsigned long dirty_bytes;
+	unsigned long dirty_background_bytes;
+};
 
 /*
  * The memory controller data structure. The memory controller controls both
@@ -303,6 +310,10 @@ struct mem_cgroup {
 	atomic_t	refcnt;
 
 	int	swappiness;
+
+	/* control memory cgroup dirty pages */
+	struct vm_dirty_param dirty_param;
+
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
 
@@ -1348,6 +1359,36 @@ int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 	return memcg->swappiness;
 }
 
+/*
+ * Return true if the current memory cgroup has local dirty memory settings.
+ * There is an allowed race between the current task migrating in-to/out-of the
+ * root cgroup while this routine runs.  So the return value may be incorrect if
+ * the current task is being simultaneously migrated.
+ */
+static bool mem_cgroup_has_dirty_limit(struct mem_cgroup *memcg)
+{
+	return memcg && !mem_cgroup_is_root(memcg);
+}
+
+/*
+ * Returns a snapshot of the current dirty limits which is not synchronized with
+ * the routines that change the dirty limits.  If this routine races with an
+ * update to the dirty bytes/ratio value, then the caller must handle the case
+ * where neither dirty_[background_]_ratio nor _bytes are set.
+ */
+static void mem_cgroup_dirty_param(struct vm_dirty_param *param,
+				   struct mem_cgroup *memcg)
+{
+	if (mem_cgroup_has_dirty_limit(memcg)) {
+		*param = memcg->dirty_param;
+	} else {
+		param->dirty_ratio = vm_dirty_ratio;
+		param->dirty_bytes = vm_dirty_bytes;
+		param->dirty_background_ratio = dirty_background_ratio;
+		param->dirty_background_bytes = dirty_background_bytes;
+	}
+}
+
 static void mem_cgroup_start_move(struct mem_cgroup *mem)
 {
 	int cpu;
@@ -5220,8 +5261,15 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	mem->last_scanned_node = MAX_NUMNODES;
 	INIT_LIST_HEAD(&mem->oom_notify);
 
-	if (parent)
+	if (parent) {
 		mem->swappiness = mem_cgroup_swappiness(parent);
+		mem_cgroup_dirty_param(&mem->dirty_param, parent);
+	} else {
+		/*
+		 * The root cgroup dirty_param field is not used, instead,
+		 * system-wide dirty limits are used.
+		 */
+	}
 	atomic_set(&mem->refcnt, 1);
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 07/13] memcg: add cgroupfs interface to memcg dirty limits
  2011-08-17 16:14 ` Greg Thelen
@ 2011-08-17 16:14   ` Greg Thelen
  -1 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen,
	Balbir Singh

Add cgroupfs interface to memcg dirty page limits:
  Direct write-out is controlled with:
  - memory.dirty_ratio
  - memory.dirty_limit_in_bytes

  Background write-out is controlled with:
  - memory.dirty_background_ratio
  - memory.dirty_background_limit_bytes

Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g'
and 'G' suffixes for byte counts.  This patch provides the
same functionality for memory.dirty_limit_in_bytes and
memory.dirty_background_limit_bytes.

Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v8:
- Use 'memcg' rather than 'mem' for local variables and parameters.
  This is consistent with other memory controller code.

 mm/memcontrol.c |  115 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 115 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 070c4ab..4e01699 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -121,6 +121,13 @@ enum mem_cgroup_events_target {
 #define SOFTLIMIT_EVENTS_TARGET (1024)
 #define NUMAINFO_EVENTS_TARGET	(1024)
 
+enum {
+	MEM_CGROUP_DIRTY_RATIO,
+	MEM_CGROUP_DIRTY_LIMIT_IN_BYTES,
+	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES,
+};
+
 struct mem_cgroup_stat_cpu {
 	long count[MEM_CGROUP_STAT_NSTATS];
 	unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
@@ -4927,6 +4934,90 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
 	return 0;
 }
 
+static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	bool use_sys = !mem_cgroup_has_dirty_limit(memcg);
+
+	switch (cft->private) {
+	case MEM_CGROUP_DIRTY_RATIO:
+		return use_sys ? vm_dirty_ratio :
+			memcg->dirty_param.dirty_ratio;
+	case MEM_CGROUP_DIRTY_LIMIT_IN_BYTES:
+		return use_sys ? vm_dirty_bytes :
+			memcg->dirty_param.dirty_bytes;
+	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+		return use_sys ? dirty_background_ratio :
+			memcg->dirty_param.dirty_background_ratio;
+	case MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES:
+		return use_sys ? dirty_background_bytes :
+			memcg->dirty_param.dirty_background_bytes;
+	default:
+		BUG();
+	}
+}
+
+static int
+mem_cgroup_dirty_write_string(struct cgroup *cgrp, struct cftype *cft,
+				const char *buffer)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+	int ret = -EINVAL;
+	unsigned long long val;
+
+	if (!mem_cgroup_has_dirty_limit(memcg))
+		return ret;
+
+	switch (type) {
+	case MEM_CGROUP_DIRTY_LIMIT_IN_BYTES:
+		/* This function does all necessary parse...reuse it */
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		memcg->dirty_param.dirty_bytes = val;
+		memcg->dirty_param.dirty_ratio  = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES:
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		memcg->dirty_param.dirty_background_bytes = val;
+		memcg->dirty_param.dirty_background_ratio = 0;
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return ret;
+}
+
+static int
+mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+
+	if (!mem_cgroup_has_dirty_limit(memcg))
+		return -EINVAL;
+	if ((type == MEM_CGROUP_DIRTY_RATIO ||
+	     type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
+		return -EINVAL;
+	switch (type) {
+	case MEM_CGROUP_DIRTY_RATIO:
+		memcg->dirty_param.dirty_ratio = val;
+		memcg->dirty_param.dirty_bytes = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+		memcg->dirty_param.dirty_background_ratio = val;
+		memcg->dirty_param.dirty_background_bytes = 0;
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return 0;
+}
 
 static struct cftype mem_cgroup_files[] = {
 	{
@@ -5003,6 +5094,30 @@ static struct cftype mem_cgroup_files[] = {
 		.read_map = mem_cgroup_vmscan_stat_read,
 		.trigger = mem_cgroup_reset_vmscan_stat,
 	},
+	{
+		.name = "dirty_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_RATIO,
+	},
+	{
+		.name = "dirty_limit_in_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_string = mem_cgroup_dirty_write_string,
+		.private = MEM_CGROUP_DIRTY_LIMIT_IN_BYTES,
+	},
+	{
+		.name = "dirty_background_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	},
+	{
+		.name = "dirty_background_limit_in_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_string = mem_cgroup_dirty_write_string,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 07/13] memcg: add cgroupfs interface to memcg dirty limits
@ 2011-08-17 16:14   ` Greg Thelen
  0 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen,
	Balbir Singh

Add cgroupfs interface to memcg dirty page limits:
  Direct write-out is controlled with:
  - memory.dirty_ratio
  - memory.dirty_limit_in_bytes

  Background write-out is controlled with:
  - memory.dirty_background_ratio
  - memory.dirty_background_limit_bytes

Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g'
and 'G' suffixes for byte counts.  This patch provides the
same functionality for memory.dirty_limit_in_bytes and
memory.dirty_background_limit_bytes.

Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v8:
- Use 'memcg' rather than 'mem' for local variables and parameters.
  This is consistent with other memory controller code.

 mm/memcontrol.c |  115 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 115 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 070c4ab..4e01699 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -121,6 +121,13 @@ enum mem_cgroup_events_target {
 #define SOFTLIMIT_EVENTS_TARGET (1024)
 #define NUMAINFO_EVENTS_TARGET	(1024)
 
+enum {
+	MEM_CGROUP_DIRTY_RATIO,
+	MEM_CGROUP_DIRTY_LIMIT_IN_BYTES,
+	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES,
+};
+
 struct mem_cgroup_stat_cpu {
 	long count[MEM_CGROUP_STAT_NSTATS];
 	unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
@@ -4927,6 +4934,90 @@ static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
 	return 0;
 }
 
+static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	bool use_sys = !mem_cgroup_has_dirty_limit(memcg);
+
+	switch (cft->private) {
+	case MEM_CGROUP_DIRTY_RATIO:
+		return use_sys ? vm_dirty_ratio :
+			memcg->dirty_param.dirty_ratio;
+	case MEM_CGROUP_DIRTY_LIMIT_IN_BYTES:
+		return use_sys ? vm_dirty_bytes :
+			memcg->dirty_param.dirty_bytes;
+	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+		return use_sys ? dirty_background_ratio :
+			memcg->dirty_param.dirty_background_ratio;
+	case MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES:
+		return use_sys ? dirty_background_bytes :
+			memcg->dirty_param.dirty_background_bytes;
+	default:
+		BUG();
+	}
+}
+
+static int
+mem_cgroup_dirty_write_string(struct cgroup *cgrp, struct cftype *cft,
+				const char *buffer)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+	int ret = -EINVAL;
+	unsigned long long val;
+
+	if (!mem_cgroup_has_dirty_limit(memcg))
+		return ret;
+
+	switch (type) {
+	case MEM_CGROUP_DIRTY_LIMIT_IN_BYTES:
+		/* This function does all necessary parse...reuse it */
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		memcg->dirty_param.dirty_bytes = val;
+		memcg->dirty_param.dirty_ratio  = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES:
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		memcg->dirty_param.dirty_background_bytes = val;
+		memcg->dirty_param.dirty_background_ratio = 0;
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return ret;
+}
+
+static int
+mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+
+	if (!mem_cgroup_has_dirty_limit(memcg))
+		return -EINVAL;
+	if ((type == MEM_CGROUP_DIRTY_RATIO ||
+	     type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
+		return -EINVAL;
+	switch (type) {
+	case MEM_CGROUP_DIRTY_RATIO:
+		memcg->dirty_param.dirty_ratio = val;
+		memcg->dirty_param.dirty_bytes = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+		memcg->dirty_param.dirty_background_ratio = val;
+		memcg->dirty_param.dirty_background_bytes = 0;
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return 0;
+}
 
 static struct cftype mem_cgroup_files[] = {
 	{
@@ -5003,6 +5094,30 @@ static struct cftype mem_cgroup_files[] = {
 		.read_map = mem_cgroup_vmscan_stat_read,
 		.trigger = mem_cgroup_reset_vmscan_stat,
 	},
+	{
+		.name = "dirty_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_RATIO,
+	},
+	{
+		.name = "dirty_limit_in_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_string = mem_cgroup_dirty_write_string,
+		.private = MEM_CGROUP_DIRTY_LIMIT_IN_BYTES,
+	},
+	{
+		.name = "dirty_background_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	},
+	{
+		.name = "dirty_background_limit_in_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_string = mem_cgroup_dirty_write_string,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 08/13] memcg: dirty page accounting support routines
  2011-08-17 16:14 ` Greg Thelen
@ 2011-08-17 16:15   ` Greg Thelen
  -1 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

Added memcg dirty page accounting support routines.  These routines are
used by later changes to provide memcg aware writeback and dirty page
limiting.  A mem_cgroup_dirty_info() tracepoint is is also included to
allow for easier understanding of memcg writeback operation.

Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v8:
- Use 'memcg' rather than 'mem' for local variables and parameters.
  This is consistent with other memory controller code.

 include/linux/memcontrol.h        |    9 ++
 include/trace/events/memcontrol.h |   34 +++++++++
 mm/memcontrol.c                   |  147 +++++++++++++++++++++++++++++++++++++
 3 files changed, 190 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 630d3fa..9cc8841 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -36,6 +36,15 @@ enum mem_cgroup_page_stat_item {
 	MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
 	MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
 	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
+	MEMCG_NR_DIRTYABLE_PAGES, /* # of pages that could be dirty */
+};
+
+struct dirty_info {
+	unsigned long dirty_thresh;
+	unsigned long background_thresh;
+	unsigned long nr_file_dirty;
+	unsigned long nr_writeback;
+	unsigned long nr_unstable_nfs;
 };
 
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
index 781ef9fc..abf1306 100644
--- a/include/trace/events/memcontrol.h
+++ b/include/trace/events/memcontrol.h
@@ -26,6 +26,40 @@ TRACE_EVENT(mem_cgroup_mark_inode_dirty,
 	TP_printk("ino=%ld css_id=%d", __entry->ino, __entry->css_id)
 )
 
+TRACE_EVENT(mem_cgroup_dirty_info,
+	TP_PROTO(unsigned short css_id,
+		 struct dirty_info *dirty_info),
+
+	TP_ARGS(css_id, dirty_info),
+
+	TP_STRUCT__entry(
+		__field(unsigned short, css_id)
+		__field(unsigned long, dirty_thresh)
+		__field(unsigned long, background_thresh)
+		__field(unsigned long, nr_file_dirty)
+		__field(unsigned long, nr_writeback)
+		__field(unsigned long, nr_unstable_nfs)
+		),
+
+	TP_fast_assign(
+		__entry->css_id = css_id;
+		__entry->dirty_thresh = dirty_info->dirty_thresh;
+		__entry->background_thresh = dirty_info->background_thresh;
+		__entry->nr_file_dirty = dirty_info->nr_file_dirty;
+		__entry->nr_writeback = dirty_info->nr_writeback;
+		__entry->nr_unstable_nfs = dirty_info->nr_unstable_nfs;
+		),
+
+	TP_printk("css_id=%d thresh=%ld bg_thresh=%ld dirty=%ld wb=%ld "
+		  "unstable_nfs=%ld",
+		  __entry->css_id,
+		  __entry->dirty_thresh,
+		  __entry->background_thresh,
+		  __entry->nr_file_dirty,
+		  __entry->nr_writeback,
+		  __entry->nr_unstable_nfs)
+)
+
 #endif /* _TRACE_MEMCONTROL_H */
 
 /* This part must be outside protection */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4e01699..d54adf4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1366,6 +1366,11 @@ int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 	return memcg->swappiness;
 }
 
+static unsigned long dirty_info_reclaimable(struct dirty_info *info)
+{
+	return info->nr_file_dirty + info->nr_unstable_nfs;
+}
+
 /*
  * Return true if the current memory cgroup has local dirty memory settings.
  * There is an allowed race between the current task migrating in-to/out-of the
@@ -1396,6 +1401,148 @@ static void mem_cgroup_dirty_param(struct vm_dirty_param *param,
 	}
 }
 
+static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
+{
+	if (!do_swap_account)
+		return nr_swap_pages > 0;
+	return !memcg->memsw_is_minimum &&
+		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
+}
+
+static s64 mem_cgroup_local_page_stat(struct mem_cgroup *memcg,
+				      enum mem_cgroup_page_stat_item item)
+{
+	s64 ret;
+
+	switch (item) {
+	case MEMCG_NR_FILE_DIRTY:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY);
+		break;
+	case MEMCG_NR_FILE_WRITEBACK:
+		ret = mem_cgroup_read_stat(memcg,
+					   MEM_CGROUP_STAT_FILE_WRITEBACK);
+		break;
+	case MEMCG_NR_FILE_UNSTABLE_NFS:
+		ret = mem_cgroup_read_stat(memcg,
+					   MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+		break;
+	case MEMCG_NR_DIRTYABLE_PAGES:
+		ret = mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
+			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
+		if (mem_cgroup_can_swap(memcg))
+			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
+				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return ret;
+}
+
+/*
+ * Return the number of additional pages that the @memcg cgroup could allocate.
+ * If use_hierarchy is set, then this involves checking parent mem cgroups to
+ * find the cgroup with the smallest free space.
+ */
+static unsigned long
+mem_cgroup_hierarchical_free_pages(struct mem_cgroup *memcg)
+{
+	u64 free;
+	unsigned long min_free;
+
+	min_free = global_page_state(NR_FREE_PAGES);
+
+	while (memcg) {
+		free = (res_counter_read_u64(&memcg->res, RES_LIMIT) -
+			res_counter_read_u64(&memcg->res, RES_USAGE)) >>
+			PAGE_SHIFT;
+		min_free = min_t(u64, min_free, free);
+		memcg = parent_mem_cgroup(memcg);
+	}
+
+	return min_free;
+}
+
+/*
+ * mem_cgroup_page_stat() - get memory cgroup file cache statistics
+ * @memcg:     memory cgroup to query
+ * @item:      memory statistic item exported to the kernel
+ *
+ * Return the accounted statistic value.
+ */
+static unsigned long mem_cgroup_page_stat(struct mem_cgroup *memcg,
+					  enum mem_cgroup_page_stat_item item)
+{
+	struct mem_cgroup *iter;
+	s64 value;
+
+	/*
+	 * If we're looking for dirtyable pages we need to evaluate free pages
+	 * depending on the limit and usage of the parents first of all.
+	 */
+	if (item == MEMCG_NR_DIRTYABLE_PAGES)
+		value = mem_cgroup_hierarchical_free_pages(memcg);
+	else
+		value = 0;
+
+	/*
+	 * Recursively evaluate page statistics against all cgroup under
+	 * hierarchy tree
+	 */
+	for_each_mem_cgroup_tree(iter, memcg)
+		value += mem_cgroup_local_page_stat(iter, item);
+
+	/*
+	 * Summing of unlocked per-cpu counters is racy and may yield a slightly
+	 * negative value.  Zero is the only sensible value in such cases.
+	 */
+	if (unlikely(value < 0))
+		value = 0;
+
+	return value;
+}
+
+/* Return dirty thresholds and usage for @memcg. */
+static void mem_cgroup_dirty_info(unsigned long sys_available_mem,
+				  struct mem_cgroup *memcg,
+				  struct dirty_info *info)
+{
+	unsigned long uninitialized_var(available_mem);
+	struct vm_dirty_param dirty_param;
+
+	mem_cgroup_dirty_param(&dirty_param, memcg);
+
+	if (!dirty_param.dirty_bytes || !dirty_param.dirty_background_bytes)
+		available_mem = min(
+			sys_available_mem,
+			mem_cgroup_page_stat(memcg, MEMCG_NR_DIRTYABLE_PAGES));
+
+	if (dirty_param.dirty_bytes)
+		info->dirty_thresh =
+			DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
+	else
+		info->dirty_thresh =
+			(dirty_param.dirty_ratio * available_mem) / 100;
+
+	if (dirty_param.dirty_background_bytes)
+		info->background_thresh =
+			DIV_ROUND_UP(dirty_param.dirty_background_bytes,
+				     PAGE_SIZE);
+	else
+		info->background_thresh =
+			(dirty_param.dirty_background_ratio *
+			       available_mem) / 100;
+
+	info->nr_file_dirty = mem_cgroup_page_stat(memcg, MEMCG_NR_FILE_DIRTY);
+	info->nr_writeback =
+		mem_cgroup_page_stat(memcg, MEMCG_NR_FILE_WRITEBACK);
+	info->nr_unstable_nfs =
+		mem_cgroup_page_stat(memcg, MEMCG_NR_FILE_UNSTABLE_NFS);
+
+	trace_mem_cgroup_dirty_info(css_id(&memcg->css), info);
+}
+
 static void mem_cgroup_start_move(struct mem_cgroup *mem)
 {
 	int cpu;
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 08/13] memcg: dirty page accounting support routines
@ 2011-08-17 16:15   ` Greg Thelen
  0 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

Added memcg dirty page accounting support routines.  These routines are
used by later changes to provide memcg aware writeback and dirty page
limiting.  A mem_cgroup_dirty_info() tracepoint is is also included to
allow for easier understanding of memcg writeback operation.

Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v8:
- Use 'memcg' rather than 'mem' for local variables and parameters.
  This is consistent with other memory controller code.

 include/linux/memcontrol.h        |    9 ++
 include/trace/events/memcontrol.h |   34 +++++++++
 mm/memcontrol.c                   |  147 +++++++++++++++++++++++++++++++++++++
 3 files changed, 190 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 630d3fa..9cc8841 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -36,6 +36,15 @@ enum mem_cgroup_page_stat_item {
 	MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
 	MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
 	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
+	MEMCG_NR_DIRTYABLE_PAGES, /* # of pages that could be dirty */
+};
+
+struct dirty_info {
+	unsigned long dirty_thresh;
+	unsigned long background_thresh;
+	unsigned long nr_file_dirty;
+	unsigned long nr_writeback;
+	unsigned long nr_unstable_nfs;
 };
 
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
index 781ef9fc..abf1306 100644
--- a/include/trace/events/memcontrol.h
+++ b/include/trace/events/memcontrol.h
@@ -26,6 +26,40 @@ TRACE_EVENT(mem_cgroup_mark_inode_dirty,
 	TP_printk("ino=%ld css_id=%d", __entry->ino, __entry->css_id)
 )
 
+TRACE_EVENT(mem_cgroup_dirty_info,
+	TP_PROTO(unsigned short css_id,
+		 struct dirty_info *dirty_info),
+
+	TP_ARGS(css_id, dirty_info),
+
+	TP_STRUCT__entry(
+		__field(unsigned short, css_id)
+		__field(unsigned long, dirty_thresh)
+		__field(unsigned long, background_thresh)
+		__field(unsigned long, nr_file_dirty)
+		__field(unsigned long, nr_writeback)
+		__field(unsigned long, nr_unstable_nfs)
+		),
+
+	TP_fast_assign(
+		__entry->css_id = css_id;
+		__entry->dirty_thresh = dirty_info->dirty_thresh;
+		__entry->background_thresh = dirty_info->background_thresh;
+		__entry->nr_file_dirty = dirty_info->nr_file_dirty;
+		__entry->nr_writeback = dirty_info->nr_writeback;
+		__entry->nr_unstable_nfs = dirty_info->nr_unstable_nfs;
+		),
+
+	TP_printk("css_id=%d thresh=%ld bg_thresh=%ld dirty=%ld wb=%ld "
+		  "unstable_nfs=%ld",
+		  __entry->css_id,
+		  __entry->dirty_thresh,
+		  __entry->background_thresh,
+		  __entry->nr_file_dirty,
+		  __entry->nr_writeback,
+		  __entry->nr_unstable_nfs)
+)
+
 #endif /* _TRACE_MEMCONTROL_H */
 
 /* This part must be outside protection */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4e01699..d54adf4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1366,6 +1366,11 @@ int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 	return memcg->swappiness;
 }
 
+static unsigned long dirty_info_reclaimable(struct dirty_info *info)
+{
+	return info->nr_file_dirty + info->nr_unstable_nfs;
+}
+
 /*
  * Return true if the current memory cgroup has local dirty memory settings.
  * There is an allowed race between the current task migrating in-to/out-of the
@@ -1396,6 +1401,148 @@ static void mem_cgroup_dirty_param(struct vm_dirty_param *param,
 	}
 }
 
+static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
+{
+	if (!do_swap_account)
+		return nr_swap_pages > 0;
+	return !memcg->memsw_is_minimum &&
+		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
+}
+
+static s64 mem_cgroup_local_page_stat(struct mem_cgroup *memcg,
+				      enum mem_cgroup_page_stat_item item)
+{
+	s64 ret;
+
+	switch (item) {
+	case MEMCG_NR_FILE_DIRTY:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY);
+		break;
+	case MEMCG_NR_FILE_WRITEBACK:
+		ret = mem_cgroup_read_stat(memcg,
+					   MEM_CGROUP_STAT_FILE_WRITEBACK);
+		break;
+	case MEMCG_NR_FILE_UNSTABLE_NFS:
+		ret = mem_cgroup_read_stat(memcg,
+					   MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+		break;
+	case MEMCG_NR_DIRTYABLE_PAGES:
+		ret = mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
+			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
+		if (mem_cgroup_can_swap(memcg))
+			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
+				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return ret;
+}
+
+/*
+ * Return the number of additional pages that the @memcg cgroup could allocate.
+ * If use_hierarchy is set, then this involves checking parent mem cgroups to
+ * find the cgroup with the smallest free space.
+ */
+static unsigned long
+mem_cgroup_hierarchical_free_pages(struct mem_cgroup *memcg)
+{
+	u64 free;
+	unsigned long min_free;
+
+	min_free = global_page_state(NR_FREE_PAGES);
+
+	while (memcg) {
+		free = (res_counter_read_u64(&memcg->res, RES_LIMIT) -
+			res_counter_read_u64(&memcg->res, RES_USAGE)) >>
+			PAGE_SHIFT;
+		min_free = min_t(u64, min_free, free);
+		memcg = parent_mem_cgroup(memcg);
+	}
+
+	return min_free;
+}
+
+/*
+ * mem_cgroup_page_stat() - get memory cgroup file cache statistics
+ * @memcg:     memory cgroup to query
+ * @item:      memory statistic item exported to the kernel
+ *
+ * Return the accounted statistic value.
+ */
+static unsigned long mem_cgroup_page_stat(struct mem_cgroup *memcg,
+					  enum mem_cgroup_page_stat_item item)
+{
+	struct mem_cgroup *iter;
+	s64 value;
+
+	/*
+	 * If we're looking for dirtyable pages we need to evaluate free pages
+	 * depending on the limit and usage of the parents first of all.
+	 */
+	if (item == MEMCG_NR_DIRTYABLE_PAGES)
+		value = mem_cgroup_hierarchical_free_pages(memcg);
+	else
+		value = 0;
+
+	/*
+	 * Recursively evaluate page statistics against all cgroup under
+	 * hierarchy tree
+	 */
+	for_each_mem_cgroup_tree(iter, memcg)
+		value += mem_cgroup_local_page_stat(iter, item);
+
+	/*
+	 * Summing of unlocked per-cpu counters is racy and may yield a slightly
+	 * negative value.  Zero is the only sensible value in such cases.
+	 */
+	if (unlikely(value < 0))
+		value = 0;
+
+	return value;
+}
+
+/* Return dirty thresholds and usage for @memcg. */
+static void mem_cgroup_dirty_info(unsigned long sys_available_mem,
+				  struct mem_cgroup *memcg,
+				  struct dirty_info *info)
+{
+	unsigned long uninitialized_var(available_mem);
+	struct vm_dirty_param dirty_param;
+
+	mem_cgroup_dirty_param(&dirty_param, memcg);
+
+	if (!dirty_param.dirty_bytes || !dirty_param.dirty_background_bytes)
+		available_mem = min(
+			sys_available_mem,
+			mem_cgroup_page_stat(memcg, MEMCG_NR_DIRTYABLE_PAGES));
+
+	if (dirty_param.dirty_bytes)
+		info->dirty_thresh =
+			DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
+	else
+		info->dirty_thresh =
+			(dirty_param.dirty_ratio * available_mem) / 100;
+
+	if (dirty_param.dirty_background_bytes)
+		info->background_thresh =
+			DIV_ROUND_UP(dirty_param.dirty_background_bytes,
+				     PAGE_SIZE);
+	else
+		info->background_thresh =
+			(dirty_param.dirty_background_ratio *
+			       available_mem) / 100;
+
+	info->nr_file_dirty = mem_cgroup_page_stat(memcg, MEMCG_NR_FILE_DIRTY);
+	info->nr_writeback =
+		mem_cgroup_page_stat(memcg, MEMCG_NR_FILE_WRITEBACK);
+	info->nr_unstable_nfs =
+		mem_cgroup_page_stat(memcg, MEMCG_NR_FILE_UNSTABLE_NFS);
+
+	trace_mem_cgroup_dirty_info(css_id(&memcg->css), info);
+}
+
 static void mem_cgroup_start_move(struct mem_cgroup *mem)
 {
 	int cpu;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 09/13] memcg: create support routines for writeback
  2011-08-17 16:14 ` Greg Thelen
@ 2011-08-17 16:15   ` Greg Thelen
  -1 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

Introduce memcg routines to assist in per-memcg writeback:

- mem_cgroups_over_bground_dirty_thresh() determines if any cgroups need
  writeback because they are over their dirty memory threshold.

- should_writeback_mem_cgroup_inode() will be called by writeback to
  determine if a particular inode should be written back.  The answer
  depends on the writeback context (foreground, background,
  try_to_free_pages, etc.).

- mem_cgroup_writeback_done() is used periodically during writeback to
  update memcg writeback data.

These routines make use of a new over_bground_dirty_thresh bitmap that
indicates which mem_cgroup are over their respective dirty background
threshold.  As this bitmap is indexed by css_id, the largest possible
css_id value is needed to create the bitmap.  So move the definition of
CSS_ID_MAX from cgroup.c to cgroup.h.  This allows users of css_id() to
know the largest possible css_id value.  This knowledge can be used to
build such per-cgroup bitmaps.

Make determine_dirtyable_memory() non-static because it is needed by
mem_cgroup_writeback_done().

Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v8:

- No longer passing struct writeback_control into memcontrol functions.
  Instead the needed attributes (memcg_id, etc.) are explicitly passed in.

- No more field additions to struct writeback_control.

- make determine_dirtyable_memory() non-static.

- rename 'over_limit' in should_writeback_mem_cgroup_inode() to 'wb' because
  should_writeback_mem_cgroup_inode() does not necessarily return just inodes
  that are in over-limit memcg.  It returns inodes that need writeback based
  on input criteria.

- Added more comments to clarify should_writeback_mem_cgroup_inode().

- To handle foreground writeback and try_to_free_pages(),
  should_writeback_mem_cgroup_inode() can check for the inodes in a specific
  memory cgroup.

- Use 'memcg' rather than 'mem' for local variables and parameters.
  This is consistent with other memory controller code.

 include/linux/cgroup.h            |    1 +
 include/linux/memcontrol.h        |   23 ++++++
 include/linux/writeback.h         |    1 +
 include/trace/events/memcontrol.h |   53 +++++++++++++
 kernel/cgroup.c                   |    1 -
 mm/memcontrol.c                   |  153 +++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c               |    2 +-
 7 files changed, 232 insertions(+), 2 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index da7e4bc..9277c8a 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -623,6 +623,7 @@ bool css_is_ancestor(struct cgroup_subsys_state *cg,
 		     const struct cgroup_subsys_state *root);
 
 /* Get id and depth of css */
+#define CSS_ID_MAX	(65535)
 unsigned short css_id(struct cgroup_subsys_state *css);
 unsigned short css_depth(struct cgroup_subsys_state *css);
 struct cgroup_subsys_state *cgroup_css_from_dir(struct file *f, int id);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9cc8841..103d297 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -181,6 +181,12 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
 
+bool should_writeback_mem_cgroup_inode(struct inode *inode,
+				       unsigned short memcg_id,
+				       bool shared_inodes);
+bool mem_cgroups_over_bground_dirty_thresh(void);
+void mem_cgroup_writeback_done(void);
+
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask,
 						unsigned long *total_scanned);
@@ -379,6 +385,23 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 {
 }
 
+static inline bool
+should_writeback_mem_cgroup_inode(struct inode *inode,
+				  unsigned short memcg_id,
+				  bool shared_inodes)
+{
+	return true;
+}
+
+static inline bool mem_cgroups_over_bground_dirty_thresh(void)
+{
+	return true;
+}
+
+static inline void mem_cgroup_writeback_done(void)
+{
+}
+
 static inline
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 					    gfp_t gfp_mask,
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 5e8bd6c..d12d070 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -128,6 +128,7 @@ extern unsigned int dirty_expire_interval;
 extern int vm_highmem_is_dirtyable;
 extern int block_dump;
 extern int laptop_mode;
+extern unsigned long determine_dirtyable_memory(void);
 
 extern int dirty_background_ratio_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
index abf1306..966aac0 100644
--- a/include/trace/events/memcontrol.h
+++ b/include/trace/events/memcontrol.h
@@ -60,6 +60,59 @@ TRACE_EVENT(mem_cgroup_dirty_info,
 		  __entry->nr_unstable_nfs)
 )
 
+TRACE_EVENT(should_writeback_mem_cgroup_inode,
+	TP_PROTO(struct inode *inode,
+		 unsigned short css_id,
+		 bool shared_inodes,
+		 bool wb),
+
+	TP_ARGS(inode, css_id, shared_inodes, wb),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, ino)
+		__field(unsigned short, inode_css_id)
+		__field(unsigned short, css_id)
+		__field(bool, shared_inodes)
+		__field(bool, wb)
+	),
+
+	TP_fast_assign(
+		__entry->ino = inode->i_ino;
+		__entry->inode_css_id =
+			inode->i_mapping ? inode->i_mapping->i_memcg : 0;
+		__entry->css_id = css_id;
+		__entry->shared_inodes = shared_inodes;
+		__entry->wb = wb;
+	),
+
+	TP_printk("ino=%ld inode_css_id=%d css_id=%d shared_inodes=%d wb=%d",
+		  __entry->ino,
+		  __entry->inode_css_id,
+		  __entry->css_id,
+		  __entry->shared_inodes,
+		  __entry->wb)
+)
+
+TRACE_EVENT(mem_cgroups_over_bground_dirty_thresh,
+	TP_PROTO(bool over_limit,
+		 unsigned short first_id),
+
+	TP_ARGS(over_limit, first_id),
+
+	TP_STRUCT__entry(
+		__field(bool, over_limit)
+		__field(unsigned short, first_id)
+	),
+
+	TP_fast_assign(
+		__entry->over_limit = over_limit;
+		__entry->first_id = first_id;
+	),
+
+	TP_printk("over_limit=%d first_css_id=%d", __entry->over_limit,
+		  __entry->first_id)
+)
+
 #endif /* _TRACE_MEMCONTROL_H */
 
 /* This part must be outside protection */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1d2b6ce..be862c0 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -131,7 +131,6 @@ static struct cgroupfs_root rootnode;
  * CSS ID -- ID per subsys's Cgroup Subsys State(CSS). used only when
  * cgroup_subsys->use_id != 0.
  */
-#define CSS_ID_MAX	(65535)
 struct css_id {
 	/*
 	 * The css to which this ID points. This pointer is set to valid value
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d54adf4..5092a68 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -432,10 +432,18 @@ enum charge_type {
 #define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
 #define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
 
+/*
+ * A bitmap representing all possible memcg, indexed by css_id.  Each bit
+ * indicates if the respective memcg is over its background dirty memory
+ * limit.
+ */
+static DECLARE_BITMAP(over_bground_dirty_thresh, CSS_ID_MAX + 1);
+
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 static void drain_all_stock_async(struct mem_cgroup *mem);
+static struct mem_cgroup *mem_cgroup_lookup(unsigned short id);
 
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
@@ -1543,6 +1551,151 @@ static void mem_cgroup_dirty_info(unsigned long sys_available_mem,
 	trace_mem_cgroup_dirty_info(css_id(&memcg->css), info);
 }
 
+/* Are any memcg over their background dirty memory limit? */
+bool mem_cgroups_over_bground_dirty_thresh(void)
+{
+	bool over_thresh;
+
+	over_thresh = !bitmap_empty(over_bground_dirty_thresh, CSS_ID_MAX + 1);
+
+	trace_mem_cgroups_over_bground_dirty_thresh(
+		over_thresh,
+		over_thresh ? find_next_bit(over_bground_dirty_thresh,
+					    CSS_ID_MAX + 1, 0) : 0);
+
+	return over_thresh;
+}
+
+/*
+ * This routine is used by per-memcg writeback to determine if @inode should be
+ * written back.  The routine checks memcg attributes to determine if the inode
+ * should be written.  Note: non-memcg writeback code may choose to writeback
+ * this inode for non-memcg factors: dirtied_when time, etc.
+ *
+ * The optional @memcg_id parameter indicates the specific memcg being written
+ * back.  If set (non-zero), then only writeback inodes dirtied by @memcg_id.
+ * If unset (zero), then writeback inodes dirtied by memcg over background dirty
+ * page limit.
+ *
+ * If @shared_inodes is set, then also consider any inodes dirtied by multiple
+ * memcg.
+ *
+ * Returns true if the inode should be written back, false otherwise.
+ */
+bool should_writeback_mem_cgroup_inode(struct inode *inode,
+				       unsigned short memcg_id,
+				       bool shared_inodes)
+{
+	struct mem_cgroup *memcg;
+	struct mem_cgroup *inode_memcg;
+	unsigned short inode_id;
+	bool wb;
+
+	inode_id = inode->i_mapping->i_memcg;
+	VM_BUG_ON(inode_id >= CSS_ID_MAX + 1);
+
+	if (shared_inodes && inode_id == I_MEMCG_SHARED)
+		wb = true;
+	else if (memcg_id) {
+		if (memcg_id == inode_id)
+			wb = true;
+		else {
+			/*
+			 * Determine if inode is owned by a hierarchy child of
+			 * memcg_id.
+			 */
+			rcu_read_lock();
+			memcg = mem_cgroup_lookup(memcg_id);
+			inode_memcg = mem_cgroup_lookup(inode_id);
+			wb = memcg && inode_memcg &&
+				memcg->use_hierarchy &&
+				css_is_ancestor(&inode_memcg->css,
+						&memcg->css);
+			rcu_read_unlock();
+		}
+	} else
+		wb = test_bit(inode_id, over_bground_dirty_thresh);
+
+	trace_should_writeback_mem_cgroup_inode(inode, memcg_id, shared_inodes,
+						wb);
+	return wb;
+}
+
+/*
+ * Mark all child cgroup as eligible for writeback because @memcg is over its bg
+ * threshold.
+ */
+static void mem_cgroup_mark_over_bg_thresh(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *iter;
+
+	/* mark this and all child cgroup as candidates for writeback */
+	for_each_mem_cgroup_tree(iter, memcg)
+		set_bit(css_id(&iter->css), over_bground_dirty_thresh);
+}
+
+static void mem_cgroup_queue_bg_writeback(struct mem_cgroup *memcg,
+					  struct backing_dev_info *bdi)
+{
+	mem_cgroup_mark_over_bg_thresh(memcg);
+	bdi_start_background_writeback(bdi);
+}
+
+/*
+ * This routine is called as writeback writes inode pages.  The routine clears
+ * any over-background-limit bits for memcg that are no longer over their
+ * background dirty limit.
+ */
+void mem_cgroup_writeback_done(void)
+{
+	struct mem_cgroup *memcg;
+	struct mem_cgroup *ref_memcg;
+	struct dirty_info info;
+	unsigned long sys_available_mem;
+	int id;
+
+	sys_available_mem = 0;
+
+	/* for each previously over-bg-limit memcg... */
+	for (id = 0; (id = find_next_bit(over_bground_dirty_thresh,
+					 CSS_ID_MAX + 1, id)) < CSS_ID_MAX + 1;
+	     id++) {
+
+		/* reference the memcg */
+		rcu_read_lock();
+		memcg = mem_cgroup_lookup(id);
+		if (memcg && !css_tryget(&memcg->css))
+			memcg = NULL;
+		rcu_read_unlock();
+		if (!memcg) {
+			clear_bit(id, over_bground_dirty_thresh);
+			continue;
+		}
+		ref_memcg = memcg;
+
+		if (!sys_available_mem)
+			sys_available_mem = determine_dirtyable_memory();
+
+		/*
+		 * Walk the ancestry of inode's memcg clearing the over-limit
+		 * bits for for any memcg under its dirty memory background
+		 * threshold.
+		 */
+		for (; mem_cgroup_has_dirty_limit(memcg);
+		     memcg = parent_mem_cgroup(memcg)) {
+			mem_cgroup_dirty_info(sys_available_mem, memcg, &info);
+			if (dirty_info_reclaimable(&info) >=
+			    info.background_thresh)
+				break;
+
+			clear_bit(css_id(&memcg->css),
+				  over_bground_dirty_thresh);
+		}
+
+		css_put(&ref_memcg->css);
+	}
+}
+
 static void mem_cgroup_start_move(struct mem_cgroup *mem)
 {
 	int cpu;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b1f2390..12b3900 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -190,7 +190,7 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
  * Returns the numebr of pages that can currently be freed and used
  * by the kernel for direct mappings.
  */
-static unsigned long determine_dirtyable_memory(void)
+unsigned long determine_dirtyable_memory(void)
 {
 	unsigned long x;
 
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 09/13] memcg: create support routines for writeback
@ 2011-08-17 16:15   ` Greg Thelen
  0 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

Introduce memcg routines to assist in per-memcg writeback:

- mem_cgroups_over_bground_dirty_thresh() determines if any cgroups need
  writeback because they are over their dirty memory threshold.

- should_writeback_mem_cgroup_inode() will be called by writeback to
  determine if a particular inode should be written back.  The answer
  depends on the writeback context (foreground, background,
  try_to_free_pages, etc.).

- mem_cgroup_writeback_done() is used periodically during writeback to
  update memcg writeback data.

These routines make use of a new over_bground_dirty_thresh bitmap that
indicates which mem_cgroup are over their respective dirty background
threshold.  As this bitmap is indexed by css_id, the largest possible
css_id value is needed to create the bitmap.  So move the definition of
CSS_ID_MAX from cgroup.c to cgroup.h.  This allows users of css_id() to
know the largest possible css_id value.  This knowledge can be used to
build such per-cgroup bitmaps.

Make determine_dirtyable_memory() non-static because it is needed by
mem_cgroup_writeback_done().

Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v8:

- No longer passing struct writeback_control into memcontrol functions.
  Instead the needed attributes (memcg_id, etc.) are explicitly passed in.

- No more field additions to struct writeback_control.

- make determine_dirtyable_memory() non-static.

- rename 'over_limit' in should_writeback_mem_cgroup_inode() to 'wb' because
  should_writeback_mem_cgroup_inode() does not necessarily return just inodes
  that are in over-limit memcg.  It returns inodes that need writeback based
  on input criteria.

- Added more comments to clarify should_writeback_mem_cgroup_inode().

- To handle foreground writeback and try_to_free_pages(),
  should_writeback_mem_cgroup_inode() can check for the inodes in a specific
  memory cgroup.

- Use 'memcg' rather than 'mem' for local variables and parameters.
  This is consistent with other memory controller code.

 include/linux/cgroup.h            |    1 +
 include/linux/memcontrol.h        |   23 ++++++
 include/linux/writeback.h         |    1 +
 include/trace/events/memcontrol.h |   53 +++++++++++++
 kernel/cgroup.c                   |    1 -
 mm/memcontrol.c                   |  153 +++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c               |    2 +-
 7 files changed, 232 insertions(+), 2 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index da7e4bc..9277c8a 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -623,6 +623,7 @@ bool css_is_ancestor(struct cgroup_subsys_state *cg,
 		     const struct cgroup_subsys_state *root);
 
 /* Get id and depth of css */
+#define CSS_ID_MAX	(65535)
 unsigned short css_id(struct cgroup_subsys_state *css);
 unsigned short css_depth(struct cgroup_subsys_state *css);
 struct cgroup_subsys_state *cgroup_css_from_dir(struct file *f, int id);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9cc8841..103d297 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -181,6 +181,12 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
 
+bool should_writeback_mem_cgroup_inode(struct inode *inode,
+				       unsigned short memcg_id,
+				       bool shared_inodes);
+bool mem_cgroups_over_bground_dirty_thresh(void);
+void mem_cgroup_writeback_done(void);
+
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask,
 						unsigned long *total_scanned);
@@ -379,6 +385,23 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 {
 }
 
+static inline bool
+should_writeback_mem_cgroup_inode(struct inode *inode,
+				  unsigned short memcg_id,
+				  bool shared_inodes)
+{
+	return true;
+}
+
+static inline bool mem_cgroups_over_bground_dirty_thresh(void)
+{
+	return true;
+}
+
+static inline void mem_cgroup_writeback_done(void)
+{
+}
+
 static inline
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 					    gfp_t gfp_mask,
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 5e8bd6c..d12d070 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -128,6 +128,7 @@ extern unsigned int dirty_expire_interval;
 extern int vm_highmem_is_dirtyable;
 extern int block_dump;
 extern int laptop_mode;
+extern unsigned long determine_dirtyable_memory(void);
 
 extern int dirty_background_ratio_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
index abf1306..966aac0 100644
--- a/include/trace/events/memcontrol.h
+++ b/include/trace/events/memcontrol.h
@@ -60,6 +60,59 @@ TRACE_EVENT(mem_cgroup_dirty_info,
 		  __entry->nr_unstable_nfs)
 )
 
+TRACE_EVENT(should_writeback_mem_cgroup_inode,
+	TP_PROTO(struct inode *inode,
+		 unsigned short css_id,
+		 bool shared_inodes,
+		 bool wb),
+
+	TP_ARGS(inode, css_id, shared_inodes, wb),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, ino)
+		__field(unsigned short, inode_css_id)
+		__field(unsigned short, css_id)
+		__field(bool, shared_inodes)
+		__field(bool, wb)
+	),
+
+	TP_fast_assign(
+		__entry->ino = inode->i_ino;
+		__entry->inode_css_id =
+			inode->i_mapping ? inode->i_mapping->i_memcg : 0;
+		__entry->css_id = css_id;
+		__entry->shared_inodes = shared_inodes;
+		__entry->wb = wb;
+	),
+
+	TP_printk("ino=%ld inode_css_id=%d css_id=%d shared_inodes=%d wb=%d",
+		  __entry->ino,
+		  __entry->inode_css_id,
+		  __entry->css_id,
+		  __entry->shared_inodes,
+		  __entry->wb)
+)
+
+TRACE_EVENT(mem_cgroups_over_bground_dirty_thresh,
+	TP_PROTO(bool over_limit,
+		 unsigned short first_id),
+
+	TP_ARGS(over_limit, first_id),
+
+	TP_STRUCT__entry(
+		__field(bool, over_limit)
+		__field(unsigned short, first_id)
+	),
+
+	TP_fast_assign(
+		__entry->over_limit = over_limit;
+		__entry->first_id = first_id;
+	),
+
+	TP_printk("over_limit=%d first_css_id=%d", __entry->over_limit,
+		  __entry->first_id)
+)
+
 #endif /* _TRACE_MEMCONTROL_H */
 
 /* This part must be outside protection */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1d2b6ce..be862c0 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -131,7 +131,6 @@ static struct cgroupfs_root rootnode;
  * CSS ID -- ID per subsys's Cgroup Subsys State(CSS). used only when
  * cgroup_subsys->use_id != 0.
  */
-#define CSS_ID_MAX	(65535)
 struct css_id {
 	/*
 	 * The css to which this ID points. This pointer is set to valid value
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d54adf4..5092a68 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -432,10 +432,18 @@ enum charge_type {
 #define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
 #define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
 
+/*
+ * A bitmap representing all possible memcg, indexed by css_id.  Each bit
+ * indicates if the respective memcg is over its background dirty memory
+ * limit.
+ */
+static DECLARE_BITMAP(over_bground_dirty_thresh, CSS_ID_MAX + 1);
+
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 static void drain_all_stock_async(struct mem_cgroup *mem);
+static struct mem_cgroup *mem_cgroup_lookup(unsigned short id);
 
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
@@ -1543,6 +1551,151 @@ static void mem_cgroup_dirty_info(unsigned long sys_available_mem,
 	trace_mem_cgroup_dirty_info(css_id(&memcg->css), info);
 }
 
+/* Are any memcg over their background dirty memory limit? */
+bool mem_cgroups_over_bground_dirty_thresh(void)
+{
+	bool over_thresh;
+
+	over_thresh = !bitmap_empty(over_bground_dirty_thresh, CSS_ID_MAX + 1);
+
+	trace_mem_cgroups_over_bground_dirty_thresh(
+		over_thresh,
+		over_thresh ? find_next_bit(over_bground_dirty_thresh,
+					    CSS_ID_MAX + 1, 0) : 0);
+
+	return over_thresh;
+}
+
+/*
+ * This routine is used by per-memcg writeback to determine if @inode should be
+ * written back.  The routine checks memcg attributes to determine if the inode
+ * should be written.  Note: non-memcg writeback code may choose to writeback
+ * this inode for non-memcg factors: dirtied_when time, etc.
+ *
+ * The optional @memcg_id parameter indicates the specific memcg being written
+ * back.  If set (non-zero), then only writeback inodes dirtied by @memcg_id.
+ * If unset (zero), then writeback inodes dirtied by memcg over background dirty
+ * page limit.
+ *
+ * If @shared_inodes is set, then also consider any inodes dirtied by multiple
+ * memcg.
+ *
+ * Returns true if the inode should be written back, false otherwise.
+ */
+bool should_writeback_mem_cgroup_inode(struct inode *inode,
+				       unsigned short memcg_id,
+				       bool shared_inodes)
+{
+	struct mem_cgroup *memcg;
+	struct mem_cgroup *inode_memcg;
+	unsigned short inode_id;
+	bool wb;
+
+	inode_id = inode->i_mapping->i_memcg;
+	VM_BUG_ON(inode_id >= CSS_ID_MAX + 1);
+
+	if (shared_inodes && inode_id == I_MEMCG_SHARED)
+		wb = true;
+	else if (memcg_id) {
+		if (memcg_id == inode_id)
+			wb = true;
+		else {
+			/*
+			 * Determine if inode is owned by a hierarchy child of
+			 * memcg_id.
+			 */
+			rcu_read_lock();
+			memcg = mem_cgroup_lookup(memcg_id);
+			inode_memcg = mem_cgroup_lookup(inode_id);
+			wb = memcg && inode_memcg &&
+				memcg->use_hierarchy &&
+				css_is_ancestor(&inode_memcg->css,
+						&memcg->css);
+			rcu_read_unlock();
+		}
+	} else
+		wb = test_bit(inode_id, over_bground_dirty_thresh);
+
+	trace_should_writeback_mem_cgroup_inode(inode, memcg_id, shared_inodes,
+						wb);
+	return wb;
+}
+
+/*
+ * Mark all child cgroup as eligible for writeback because @memcg is over its bg
+ * threshold.
+ */
+static void mem_cgroup_mark_over_bg_thresh(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *iter;
+
+	/* mark this and all child cgroup as candidates for writeback */
+	for_each_mem_cgroup_tree(iter, memcg)
+		set_bit(css_id(&iter->css), over_bground_dirty_thresh);
+}
+
+static void mem_cgroup_queue_bg_writeback(struct mem_cgroup *memcg,
+					  struct backing_dev_info *bdi)
+{
+	mem_cgroup_mark_over_bg_thresh(memcg);
+	bdi_start_background_writeback(bdi);
+}
+
+/*
+ * This routine is called as writeback writes inode pages.  The routine clears
+ * any over-background-limit bits for memcg that are no longer over their
+ * background dirty limit.
+ */
+void mem_cgroup_writeback_done(void)
+{
+	struct mem_cgroup *memcg;
+	struct mem_cgroup *ref_memcg;
+	struct dirty_info info;
+	unsigned long sys_available_mem;
+	int id;
+
+	sys_available_mem = 0;
+
+	/* for each previously over-bg-limit memcg... */
+	for (id = 0; (id = find_next_bit(over_bground_dirty_thresh,
+					 CSS_ID_MAX + 1, id)) < CSS_ID_MAX + 1;
+	     id++) {
+
+		/* reference the memcg */
+		rcu_read_lock();
+		memcg = mem_cgroup_lookup(id);
+		if (memcg && !css_tryget(&memcg->css))
+			memcg = NULL;
+		rcu_read_unlock();
+		if (!memcg) {
+			clear_bit(id, over_bground_dirty_thresh);
+			continue;
+		}
+		ref_memcg = memcg;
+
+		if (!sys_available_mem)
+			sys_available_mem = determine_dirtyable_memory();
+
+		/*
+		 * Walk the ancestry of inode's memcg clearing the over-limit
+		 * bits for for any memcg under its dirty memory background
+		 * threshold.
+		 */
+		for (; mem_cgroup_has_dirty_limit(memcg);
+		     memcg = parent_mem_cgroup(memcg)) {
+			mem_cgroup_dirty_info(sys_available_mem, memcg, &info);
+			if (dirty_info_reclaimable(&info) >=
+			    info.background_thresh)
+				break;
+
+			clear_bit(css_id(&memcg->css),
+				  over_bground_dirty_thresh);
+		}
+
+		css_put(&ref_memcg->css);
+	}
+}
+
 static void mem_cgroup_start_move(struct mem_cgroup *mem)
 {
 	int cpu;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b1f2390..12b3900 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -190,7 +190,7 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
  * Returns the numebr of pages that can currently be freed and used
  * by the kernel for direct mappings.
  */
-static unsigned long determine_dirtyable_memory(void)
+unsigned long determine_dirtyable_memory(void)
 {
 	unsigned long x;
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 10/13] writeback: pass wb_writeback_work into move_expired_inodes()
  2011-08-17 16:14 ` Greg Thelen
@ 2011-08-17 16:15   ` Greg Thelen
  -1 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

A later change to move_expired_inodes() requires passing fields from
writeback work descriptor into memcontrol code when determining if an
inode should be written back considered.

Signed-off-by: Greg Thelen <gthelen@google.com>
---
 fs/fs-writeback.c |   15 ++++++++-------
 1 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6bf4c49..e91fb82 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -252,7 +252,7 @@ static bool inode_dirtied_after(struct inode *inode, unsigned long t)
  */
 static int move_expired_inodes(struct list_head *delaying_queue,
 			       struct list_head *dispatch_queue,
-			       unsigned long *older_than_this)
+			       struct wb_writeback_work *work)
 {
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
@@ -263,8 +263,8 @@ static int move_expired_inodes(struct list_head *delaying_queue,
 
 	while (!list_empty(delaying_queue)) {
 		inode = wb_inode(delaying_queue->prev);
-		if (older_than_this &&
-		    inode_dirtied_after(inode, *older_than_this))
+		if (work->older_than_this &&
+		    inode_dirtied_after(inode, *work->older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
@@ -303,13 +303,14 @@ out:
  *                                           |
  *                                           +--> dequeue for IO
  */
-static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
+static void queue_io(struct bdi_writeback *wb, struct wb_writeback_work *work)
 {
 	int moved;
 	assert_spin_locked(&wb->list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
-	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
-	trace_writeback_queue_io(wb, older_than_this, moved);
+	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, work);
+	trace_writeback_queue_io(wb, work ? work->older_than_this : NULL,
+				 moved);
 }
 
 static int write_inode(struct inode *inode, struct writeback_control *wbc)
@@ -739,7 +740,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 
 		trace_writeback_start(wb->bdi, work);
 		if (list_empty(&wb->b_io))
-			queue_io(wb, work->older_than_this);
+			queue_io(wb, work);
 		if (work->sb)
 			progress = writeback_sb_inodes(work->sb, wb, work);
 		else
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 10/13] writeback: pass wb_writeback_work into move_expired_inodes()
@ 2011-08-17 16:15   ` Greg Thelen
  0 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

A later change to move_expired_inodes() requires passing fields from
writeback work descriptor into memcontrol code when determining if an
inode should be written back considered.

Signed-off-by: Greg Thelen <gthelen@google.com>
---
 fs/fs-writeback.c |   15 ++++++++-------
 1 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6bf4c49..e91fb82 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -252,7 +252,7 @@ static bool inode_dirtied_after(struct inode *inode, unsigned long t)
  */
 static int move_expired_inodes(struct list_head *delaying_queue,
 			       struct list_head *dispatch_queue,
-			       unsigned long *older_than_this)
+			       struct wb_writeback_work *work)
 {
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
@@ -263,8 +263,8 @@ static int move_expired_inodes(struct list_head *delaying_queue,
 
 	while (!list_empty(delaying_queue)) {
 		inode = wb_inode(delaying_queue->prev);
-		if (older_than_this &&
-		    inode_dirtied_after(inode, *older_than_this))
+		if (work->older_than_this &&
+		    inode_dirtied_after(inode, *work->older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
@@ -303,13 +303,14 @@ out:
  *                                           |
  *                                           +--> dequeue for IO
  */
-static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
+static void queue_io(struct bdi_writeback *wb, struct wb_writeback_work *work)
 {
 	int moved;
 	assert_spin_locked(&wb->list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
-	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
-	trace_writeback_queue_io(wb, older_than_this, moved);
+	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, work);
+	trace_writeback_queue_io(wb, work ? work->older_than_this : NULL,
+				 moved);
 }
 
 static int write_inode(struct inode *inode, struct writeback_control *wbc)
@@ -739,7 +740,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 
 		trace_writeback_start(wb->bdi, work);
 		if (list_empty(&wb->b_io))
-			queue_io(wb, work->older_than_this);
+			queue_io(wb, work);
 		if (work->sb)
 			progress = writeback_sb_inodes(work->sb, wb, work);
 		else
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 11/13] writeback: make background writeback cgroup aware
  2011-08-17 16:14 ` Greg Thelen
@ 2011-08-17 16:15   ` Greg Thelen
  -1 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

When the system is under background dirty memory threshold but some
cgroups are over their background dirty memory thresholds, then only
writeback inodes associated with the over-limit cgroups.

In addition to checking if the system dirty memory usage is over the
system background threshold, over_bground_thresh() now checks if any
cgroups are over their respective background dirty memory thresholds.

If over-limit cgroups are found, then the new
wb_writeback_work.for_cgroup field is set to distinguish between system
and memcg overages.  The new wb_writeback_work.shared_inodes field is
also set.  Inodes written by multiple cgroup are marked owned by
I_MEMCG_SHARED rather than a particular cgroup.  Such shared inodes
cannot easily be attributed to a cgroup, so per-cgroup writeback
(futures version of wakeup_flusher_threads and balance_dirty_pages)
performs suboptimally in the presence of shared inodes.  Therefore,
write shared inodes when performing cgroup background writeback.

If performing cgroup writeback, move_expired_inodes() skips inodes that
do not contribute dirty pages to the cgroup being written back.

After writing some pages, wb_writeback() will call
mem_cgroup_writeback_done() to update the set of over-bg-limits memcg.

This change also makes wakeup_flusher_threads() memcg aware so that
per-cgroup try_to_free_pages() is able to operate more efficiently
without having to write pages of foreign containers.  This change adds a
mem_cgroup parameter to wakeup_flusher_threads() to allow callers,
especially try_to_free_pages() and foreground writeback from
balance_dirty_pages(), to specify a particular cgroup to write inodes
from.

Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v8:

- Added optional memcg parameter to __bdi_start_writeback(),
  bdi_start_writeback(), wakeup_flusher_threads(), writeback_inodes_wb().

- move_expired_inodes() now uses pass in struct wb_writeback_work instead of
  struct writeback_control.

- Added comments to over_bground_thresh().

 fs/buffer.c               |    2 +-
 fs/fs-writeback.c         |   96 +++++++++++++++++++++++++++++++++-----------
 fs/sync.c                 |    2 +-
 include/linux/writeback.h |    6 ++-
 mm/backing-dev.c          |    3 +-
 mm/page-writeback.c       |    3 +-
 mm/vmscan.c               |    3 +-
 7 files changed, 84 insertions(+), 31 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index dd0220b..da1fb23 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -293,7 +293,7 @@ static void free_more_memory(void)
 	struct zone *zone;
 	int nid;
 
-	wakeup_flusher_threads(1024);
+	wakeup_flusher_threads(1024, NULL);
 	yield();
 
 	for_each_online_node(nid) {
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e91fb82..ba55336 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -38,10 +38,14 @@ struct wb_writeback_work {
 	struct super_block *sb;
 	unsigned long *older_than_this;
 	enum writeback_sync_modes sync_mode;
+	unsigned short memcg_id;	/* If non-zero, then writeback specified
+					 * cgroup. */
 	unsigned int tagged_writepages:1;
 	unsigned int for_kupdate:1;
 	unsigned int range_cyclic:1;
 	unsigned int for_background:1;
+	unsigned int for_cgroup:1;	/* cgroup writeback */
+	unsigned int shared_inodes:1;	/* write inodes spanning cgroups */
 
 	struct list_head list;		/* pending work list */
 	struct completion *done;	/* set if the caller waits */
@@ -114,9 +118,12 @@ static void bdi_queue_work(struct backing_dev_info *bdi,
 	spin_unlock_bh(&bdi->wb_lock);
 }
 
+/*
+ * @memcg is optional.  If set, then limit writeback to the specified cgroup.
+ */
 static void
 __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
-		      bool range_cyclic)
+		      bool range_cyclic, struct mem_cgroup *memcg)
 {
 	struct wb_writeback_work *work;
 
@@ -136,6 +143,8 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
 	work->sync_mode	= WB_SYNC_NONE;
 	work->nr_pages	= nr_pages;
 	work->range_cyclic = range_cyclic;
+	work->memcg_id = memcg ? css_id(mem_cgroup_css(memcg)) : 0;
+	work->for_cgroup = memcg != NULL;
 
 	bdi_queue_work(bdi, work);
 }
@@ -153,7 +162,7 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
  */
 void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages)
 {
-	__bdi_start_writeback(bdi, nr_pages, true);
+	__bdi_start_writeback(bdi, nr_pages, true, NULL);
 }
 
 /**
@@ -257,15 +266,20 @@ static int move_expired_inodes(struct list_head *delaying_queue,
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
-	struct inode *inode;
+	struct inode *inode, *tmp_inode;
 	int do_sb_sort = 0;
 	int moved = 0;
 
-	while (!list_empty(delaying_queue)) {
-		inode = wb_inode(delaying_queue->prev);
+	list_for_each_entry_safe_reverse(inode, tmp_inode, delaying_queue,
+					 i_wb_list) {
 		if (work->older_than_this &&
 		    inode_dirtied_after(inode, *work->older_than_this))
 			break;
+		if (work->for_cgroup &&
+		    !should_writeback_mem_cgroup_inode(inode,
+						       work->memcg_id,
+						       work->shared_inodes))
+			continue;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
@@ -643,31 +657,63 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
 	return wrote;
 }
 
-long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages)
+/*
+ * @memcg is optional.  If set, then limit writeback to the specified cgroup.
+ * If @shared_inodes is set then writeback inodes shared by several memcg.
+ */
+long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
+			 struct mem_cgroup *memcg, bool shared_inodes)
 {
 	struct wb_writeback_work work = {
 		.nr_pages	= nr_pages,
 		.sync_mode	= WB_SYNC_NONE,
+		.memcg_id	= memcg ? css_id(mem_cgroup_css(memcg)) : 0,
+		.for_cgroup	= (memcg != NULL) || shared_inodes,
+		.shared_inodes	= shared_inodes,
 		.range_cyclic	= 1,
 	};
 
 	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
-		queue_io(wb, NULL);
+		queue_io(wb, &work);
 	__writeback_inodes_wb(wb, &work);
 	spin_unlock(&wb->list_lock);
 
 	return nr_pages - work.nr_pages;
 }
 
-static inline bool over_bground_thresh(void)
+static inline bool over_bground_thresh(struct wb_writeback_work *work)
 {
 	unsigned long background_thresh, dirty_thresh;
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 
-	return (global_page_state(NR_FILE_DIRTY) +
-		global_page_state(NR_UNSTABLE_NFS) > background_thresh);
+	if (global_page_state(NR_FILE_DIRTY) +
+	    global_page_state(NR_UNSTABLE_NFS) > background_thresh) {
+		work->for_cgroup = 0;
+		return true;
+	}
+
+	/*
+	 * System dirty memory is below system background limit.  Check if any
+	 * memcg are over memcg background limit.
+	 */
+	if (mem_cgroups_over_bground_dirty_thresh()) {
+		work->for_cgroup = 1;
+
+		/*
+		 * Set shared_inodes so that background flusher writes shared
+		 * inodes in addition to inodes in over-limit memcg.  Such
+		 * shared inodes should be rarer than inodes written by a single
+		 * memcg.  Shared inodes limit the ability to map from memcg to
+		 * inode in wakeup_flusher_threads() and writeback_inodes_wb().
+		 * So the quicker such shared inodes are written, the better.
+		 */
+		work->shared_inodes = 1;
+		return true;
+	}
+
+	return false;
 }
 
 /*
@@ -729,7 +775,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 * For background writeout, stop when we are below the
 		 * background dirty threshold
 		 */
-		if (work->for_background && !over_bground_thresh())
+		if (work->for_background && !over_bground_thresh(work))
 			break;
 
 		if (work->for_kupdate) {
@@ -749,6 +795,9 @@ static long wb_writeback(struct bdi_writeback *wb,
 
 		wb_update_bandwidth(wb, wb_start);
 
+		if (progress)
+			mem_cgroup_writeback_done();
+
 		/*
 		 * Did we write something? Try for more
 		 *
@@ -813,17 +862,15 @@ static unsigned long get_nr_dirty_pages(void)
 
 static long wb_check_background_flush(struct bdi_writeback *wb)
 {
-	if (over_bground_thresh()) {
-
-		struct wb_writeback_work work = {
-			.nr_pages	= LONG_MAX,
-			.sync_mode	= WB_SYNC_NONE,
-			.for_background	= 1,
-			.range_cyclic	= 1,
-		};
+	struct wb_writeback_work work = {
+		.nr_pages	= LONG_MAX,
+		.sync_mode	= WB_SYNC_NONE,
+		.for_background	= 1,
+		.range_cyclic	= 1,
+	};
 
+	if (over_bground_thresh(&work))
 		return wb_writeback(wb, &work);
-	}
 
 	return 0;
 }
@@ -968,10 +1015,11 @@ int bdi_writeback_thread(void *data)
 
 
 /*
- * Start writeback of `nr_pages' pages.  If `nr_pages' is zero, write back
- * the whole world.
+ * Start writeback of `nr_pages' pages.  If `nr_pages' is zero, write back the
+ * whole world.  If 'memcg' is non-NULL, then limit attempt to only write pages
+ * from the specified cgroup.
  */
-void wakeup_flusher_threads(long nr_pages)
+void wakeup_flusher_threads(long nr_pages, struct mem_cgroup *memcg)
 {
 	struct backing_dev_info *bdi;
 
@@ -984,7 +1032,7 @@ void wakeup_flusher_threads(long nr_pages)
 	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
 		if (!bdi_has_dirty_io(bdi))
 			continue;
-		__bdi_start_writeback(bdi, nr_pages, false);
+		__bdi_start_writeback(bdi, nr_pages, false, memcg);
 	}
 	rcu_read_unlock();
 }
diff --git a/fs/sync.c b/fs/sync.c
index c98a747..7c1ba55 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -98,7 +98,7 @@ static void sync_filesystems(int wait)
  */
 SYSCALL_DEFINE0(sync)
 {
-	wakeup_flusher_threads(0);
+	wakeup_flusher_threads(0, NULL);
 	sync_filesystems(0);
 	sync_filesystems(1);
 	if (unlikely(laptop_mode))
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index d12d070..e6790e8 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -40,6 +40,7 @@
 #define MIN_WRITEBACK_PAGES	(4096UL >> (PAGE_CACHE_SHIFT - 10))
 
 struct backing_dev_info;
+struct mem_cgroup;
 
 /*
  * fs/fs-writeback.c
@@ -85,9 +86,10 @@ void writeback_inodes_sb_nr(struct super_block *, unsigned long nr);
 int writeback_inodes_sb_if_idle(struct super_block *);
 int writeback_inodes_sb_nr_if_idle(struct super_block *, unsigned long nr);
 void sync_inodes_sb(struct super_block *);
-long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages);
+long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
+			 struct mem_cgroup *memcg, bool shared_inodes);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
-void wakeup_flusher_threads(long nr_pages);
+void wakeup_flusher_threads(long nr_pages, struct mem_cgroup *memcg);
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index d6edf8d..60d101d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -456,7 +456,8 @@ static int bdi_forker_thread(void *ptr)
 				 * the bdi from the thread. Hopefully 1024 is
 				 * large enough for efficient IO.
 				 */
-				writeback_inodes_wb(&bdi->wb, 1024);
+				writeback_inodes_wb(&bdi->wb, 1024, NULL,
+						    false);
 			} else {
 				/*
 				 * The spinlock makes sure we do not lose
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 12b3900..64de98c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -736,7 +736,8 @@ static void balance_dirty_pages(struct address_space *mapping,
 		trace_balance_dirty_start(bdi);
 		if (bdi_nr_reclaimable > task_bdi_thresh) {
 			pages_written += writeback_inodes_wb(&bdi->wb,
-							     write_chunk);
+							     write_chunk,
+							     NULL, false);
 			trace_balance_dirty_written(bdi, pages_written);
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3153729..fb0ae99 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2223,7 +2223,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 */
 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
 		if (total_scanned > writeback_threshold) {
-			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
+			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
+					       sc->mem_cgroup);
 			sc->may_writepage = 1;
 		}
 
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 11/13] writeback: make background writeback cgroup aware
@ 2011-08-17 16:15   ` Greg Thelen
  0 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

When the system is under background dirty memory threshold but some
cgroups are over their background dirty memory thresholds, then only
writeback inodes associated with the over-limit cgroups.

In addition to checking if the system dirty memory usage is over the
system background threshold, over_bground_thresh() now checks if any
cgroups are over their respective background dirty memory thresholds.

If over-limit cgroups are found, then the new
wb_writeback_work.for_cgroup field is set to distinguish between system
and memcg overages.  The new wb_writeback_work.shared_inodes field is
also set.  Inodes written by multiple cgroup are marked owned by
I_MEMCG_SHARED rather than a particular cgroup.  Such shared inodes
cannot easily be attributed to a cgroup, so per-cgroup writeback
(futures version of wakeup_flusher_threads and balance_dirty_pages)
performs suboptimally in the presence of shared inodes.  Therefore,
write shared inodes when performing cgroup background writeback.

If performing cgroup writeback, move_expired_inodes() skips inodes that
do not contribute dirty pages to the cgroup being written back.

After writing some pages, wb_writeback() will call
mem_cgroup_writeback_done() to update the set of over-bg-limits memcg.

This change also makes wakeup_flusher_threads() memcg aware so that
per-cgroup try_to_free_pages() is able to operate more efficiently
without having to write pages of foreign containers.  This change adds a
mem_cgroup parameter to wakeup_flusher_threads() to allow callers,
especially try_to_free_pages() and foreground writeback from
balance_dirty_pages(), to specify a particular cgroup to write inodes
from.

Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v8:

- Added optional memcg parameter to __bdi_start_writeback(),
  bdi_start_writeback(), wakeup_flusher_threads(), writeback_inodes_wb().

- move_expired_inodes() now uses pass in struct wb_writeback_work instead of
  struct writeback_control.

- Added comments to over_bground_thresh().

 fs/buffer.c               |    2 +-
 fs/fs-writeback.c         |   96 +++++++++++++++++++++++++++++++++-----------
 fs/sync.c                 |    2 +-
 include/linux/writeback.h |    6 ++-
 mm/backing-dev.c          |    3 +-
 mm/page-writeback.c       |    3 +-
 mm/vmscan.c               |    3 +-
 7 files changed, 84 insertions(+), 31 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index dd0220b..da1fb23 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -293,7 +293,7 @@ static void free_more_memory(void)
 	struct zone *zone;
 	int nid;
 
-	wakeup_flusher_threads(1024);
+	wakeup_flusher_threads(1024, NULL);
 	yield();
 
 	for_each_online_node(nid) {
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e91fb82..ba55336 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -38,10 +38,14 @@ struct wb_writeback_work {
 	struct super_block *sb;
 	unsigned long *older_than_this;
 	enum writeback_sync_modes sync_mode;
+	unsigned short memcg_id;	/* If non-zero, then writeback specified
+					 * cgroup. */
 	unsigned int tagged_writepages:1;
 	unsigned int for_kupdate:1;
 	unsigned int range_cyclic:1;
 	unsigned int for_background:1;
+	unsigned int for_cgroup:1;	/* cgroup writeback */
+	unsigned int shared_inodes:1;	/* write inodes spanning cgroups */
 
 	struct list_head list;		/* pending work list */
 	struct completion *done;	/* set if the caller waits */
@@ -114,9 +118,12 @@ static void bdi_queue_work(struct backing_dev_info *bdi,
 	spin_unlock_bh(&bdi->wb_lock);
 }
 
+/*
+ * @memcg is optional.  If set, then limit writeback to the specified cgroup.
+ */
 static void
 __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
-		      bool range_cyclic)
+		      bool range_cyclic, struct mem_cgroup *memcg)
 {
 	struct wb_writeback_work *work;
 
@@ -136,6 +143,8 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
 	work->sync_mode	= WB_SYNC_NONE;
 	work->nr_pages	= nr_pages;
 	work->range_cyclic = range_cyclic;
+	work->memcg_id = memcg ? css_id(mem_cgroup_css(memcg)) : 0;
+	work->for_cgroup = memcg != NULL;
 
 	bdi_queue_work(bdi, work);
 }
@@ -153,7 +162,7 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
  */
 void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages)
 {
-	__bdi_start_writeback(bdi, nr_pages, true);
+	__bdi_start_writeback(bdi, nr_pages, true, NULL);
 }
 
 /**
@@ -257,15 +266,20 @@ static int move_expired_inodes(struct list_head *delaying_queue,
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
-	struct inode *inode;
+	struct inode *inode, *tmp_inode;
 	int do_sb_sort = 0;
 	int moved = 0;
 
-	while (!list_empty(delaying_queue)) {
-		inode = wb_inode(delaying_queue->prev);
+	list_for_each_entry_safe_reverse(inode, tmp_inode, delaying_queue,
+					 i_wb_list) {
 		if (work->older_than_this &&
 		    inode_dirtied_after(inode, *work->older_than_this))
 			break;
+		if (work->for_cgroup &&
+		    !should_writeback_mem_cgroup_inode(inode,
+						       work->memcg_id,
+						       work->shared_inodes))
+			continue;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
@@ -643,31 +657,63 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
 	return wrote;
 }
 
-long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages)
+/*
+ * @memcg is optional.  If set, then limit writeback to the specified cgroup.
+ * If @shared_inodes is set then writeback inodes shared by several memcg.
+ */
+long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
+			 struct mem_cgroup *memcg, bool shared_inodes)
 {
 	struct wb_writeback_work work = {
 		.nr_pages	= nr_pages,
 		.sync_mode	= WB_SYNC_NONE,
+		.memcg_id	= memcg ? css_id(mem_cgroup_css(memcg)) : 0,
+		.for_cgroup	= (memcg != NULL) || shared_inodes,
+		.shared_inodes	= shared_inodes,
 		.range_cyclic	= 1,
 	};
 
 	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
-		queue_io(wb, NULL);
+		queue_io(wb, &work);
 	__writeback_inodes_wb(wb, &work);
 	spin_unlock(&wb->list_lock);
 
 	return nr_pages - work.nr_pages;
 }
 
-static inline bool over_bground_thresh(void)
+static inline bool over_bground_thresh(struct wb_writeback_work *work)
 {
 	unsigned long background_thresh, dirty_thresh;
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 
-	return (global_page_state(NR_FILE_DIRTY) +
-		global_page_state(NR_UNSTABLE_NFS) > background_thresh);
+	if (global_page_state(NR_FILE_DIRTY) +
+	    global_page_state(NR_UNSTABLE_NFS) > background_thresh) {
+		work->for_cgroup = 0;
+		return true;
+	}
+
+	/*
+	 * System dirty memory is below system background limit.  Check if any
+	 * memcg are over memcg background limit.
+	 */
+	if (mem_cgroups_over_bground_dirty_thresh()) {
+		work->for_cgroup = 1;
+
+		/*
+		 * Set shared_inodes so that background flusher writes shared
+		 * inodes in addition to inodes in over-limit memcg.  Such
+		 * shared inodes should be rarer than inodes written by a single
+		 * memcg.  Shared inodes limit the ability to map from memcg to
+		 * inode in wakeup_flusher_threads() and writeback_inodes_wb().
+		 * So the quicker such shared inodes are written, the better.
+		 */
+		work->shared_inodes = 1;
+		return true;
+	}
+
+	return false;
 }
 
 /*
@@ -729,7 +775,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 * For background writeout, stop when we are below the
 		 * background dirty threshold
 		 */
-		if (work->for_background && !over_bground_thresh())
+		if (work->for_background && !over_bground_thresh(work))
 			break;
 
 		if (work->for_kupdate) {
@@ -749,6 +795,9 @@ static long wb_writeback(struct bdi_writeback *wb,
 
 		wb_update_bandwidth(wb, wb_start);
 
+		if (progress)
+			mem_cgroup_writeback_done();
+
 		/*
 		 * Did we write something? Try for more
 		 *
@@ -813,17 +862,15 @@ static unsigned long get_nr_dirty_pages(void)
 
 static long wb_check_background_flush(struct bdi_writeback *wb)
 {
-	if (over_bground_thresh()) {
-
-		struct wb_writeback_work work = {
-			.nr_pages	= LONG_MAX,
-			.sync_mode	= WB_SYNC_NONE,
-			.for_background	= 1,
-			.range_cyclic	= 1,
-		};
+	struct wb_writeback_work work = {
+		.nr_pages	= LONG_MAX,
+		.sync_mode	= WB_SYNC_NONE,
+		.for_background	= 1,
+		.range_cyclic	= 1,
+	};
 
+	if (over_bground_thresh(&work))
 		return wb_writeback(wb, &work);
-	}
 
 	return 0;
 }
@@ -968,10 +1015,11 @@ int bdi_writeback_thread(void *data)
 
 
 /*
- * Start writeback of `nr_pages' pages.  If `nr_pages' is zero, write back
- * the whole world.
+ * Start writeback of `nr_pages' pages.  If `nr_pages' is zero, write back the
+ * whole world.  If 'memcg' is non-NULL, then limit attempt to only write pages
+ * from the specified cgroup.
  */
-void wakeup_flusher_threads(long nr_pages)
+void wakeup_flusher_threads(long nr_pages, struct mem_cgroup *memcg)
 {
 	struct backing_dev_info *bdi;
 
@@ -984,7 +1032,7 @@ void wakeup_flusher_threads(long nr_pages)
 	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
 		if (!bdi_has_dirty_io(bdi))
 			continue;
-		__bdi_start_writeback(bdi, nr_pages, false);
+		__bdi_start_writeback(bdi, nr_pages, false, memcg);
 	}
 	rcu_read_unlock();
 }
diff --git a/fs/sync.c b/fs/sync.c
index c98a747..7c1ba55 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -98,7 +98,7 @@ static void sync_filesystems(int wait)
  */
 SYSCALL_DEFINE0(sync)
 {
-	wakeup_flusher_threads(0);
+	wakeup_flusher_threads(0, NULL);
 	sync_filesystems(0);
 	sync_filesystems(1);
 	if (unlikely(laptop_mode))
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index d12d070..e6790e8 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -40,6 +40,7 @@
 #define MIN_WRITEBACK_PAGES	(4096UL >> (PAGE_CACHE_SHIFT - 10))
 
 struct backing_dev_info;
+struct mem_cgroup;
 
 /*
  * fs/fs-writeback.c
@@ -85,9 +86,10 @@ void writeback_inodes_sb_nr(struct super_block *, unsigned long nr);
 int writeback_inodes_sb_if_idle(struct super_block *);
 int writeback_inodes_sb_nr_if_idle(struct super_block *, unsigned long nr);
 void sync_inodes_sb(struct super_block *);
-long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages);
+long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
+			 struct mem_cgroup *memcg, bool shared_inodes);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
-void wakeup_flusher_threads(long nr_pages);
+void wakeup_flusher_threads(long nr_pages, struct mem_cgroup *memcg);
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index d6edf8d..60d101d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -456,7 +456,8 @@ static int bdi_forker_thread(void *ptr)
 				 * the bdi from the thread. Hopefully 1024 is
 				 * large enough for efficient IO.
 				 */
-				writeback_inodes_wb(&bdi->wb, 1024);
+				writeback_inodes_wb(&bdi->wb, 1024, NULL,
+						    false);
 			} else {
 				/*
 				 * The spinlock makes sure we do not lose
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 12b3900..64de98c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -736,7 +736,8 @@ static void balance_dirty_pages(struct address_space *mapping,
 		trace_balance_dirty_start(bdi);
 		if (bdi_nr_reclaimable > task_bdi_thresh) {
 			pages_written += writeback_inodes_wb(&bdi->wb,
-							     write_chunk);
+							     write_chunk,
+							     NULL, false);
 			trace_balance_dirty_written(bdi, pages_written);
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3153729..fb0ae99 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2223,7 +2223,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 */
 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
 		if (total_scanned > writeback_threshold) {
-			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
+			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
+					       sc->mem_cgroup);
 			sc->may_writepage = 1;
 		}
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 12/13] memcg: create support routines for page writeback
  2011-08-17 16:14 ` Greg Thelen
@ 2011-08-17 16:15   ` Greg Thelen
  -1 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

Introduce memcg routines to assist in per-memcg dirty page management:

- mem_cgroup_balance_dirty_pages() walks a memcg hierarchy comparing
  dirty memory usage against memcg foreground and background thresholds.
  If an over-background-threshold memcg is found, then per-memcg
  background writeback is queued.  Per-memcg writeback differs from
  classic, non-memcg, per bdi writeback by setting the new
  writeback_control.for_cgroup bit.

  If an over-foreground-threshold memcg is found, then foreground
  writeout occurs.  When performing foreground writeout, first consider
  inodes exclusive to the memcg.  If unable to make enough progress,
  then consider inodes shared between memcg.  Such cross-memcg inode
  sharing likely to be rare in situations that use per-cgroup memory
  isolation.  So the approach tries to handle the common case well
  without falling over in cases where such sharing exists.  This routine
  is used by balance_dirty_pages() in a later change.

- mem_cgroup_hierarchical_dirty_info() returns the dirty memory usage
  and limits of the memcg closest to (or over) its dirty limit.  This
  will be used by throttle_vm_writeout() in a latter change.

Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v8:

- Use 'memcg' rather than 'mem' for local variables and parameters.
  This is consistent with other memory controller code.

- No more field additions to struct writeback_control.

- Added more comments to mem_cgroup_balance_dirty_pages().

- Adapted to changes in writeback_inodes_wb().

- Improved mem_cgroup_hierarchical_dirty_info() comment.

 include/linux/memcontrol.h        |   18 ++++
 include/trace/events/memcontrol.h |   88 ++++++++++++++++++++
 mm/memcontrol.c                   |  165 +++++++++++++++++++++++++++++++++++++
 3 files changed, 271 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 103d297..f49bd2d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -186,6 +186,11 @@ bool should_writeback_mem_cgroup_inode(struct inode *inode,
 				       bool shared_inodes);
 bool mem_cgroups_over_bground_dirty_thresh(void);
 void mem_cgroup_writeback_done(void);
+bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
+					struct mem_cgroup *memcg,
+					struct dirty_info *info);
+void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
+				    unsigned long write_chunk);
 
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask,
@@ -402,6 +407,19 @@ static inline void mem_cgroup_writeback_done(void)
 {
 }
 
+static inline void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
+						  unsigned long write_chunk)
+{
+}
+
+static inline bool
+mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
+				   struct mem_cgroup *memcg,
+				   struct dirty_info *info)
+{
+	return false;
+}
+
 static inline
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 					    gfp_t gfp_mask,
diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
index 966aac0..20bbb85 100644
--- a/include/trace/events/memcontrol.h
+++ b/include/trace/events/memcontrol.h
@@ -113,6 +113,94 @@ TRACE_EVENT(mem_cgroups_over_bground_dirty_thresh,
 		  __entry->first_id)
 )
 
+DECLARE_EVENT_CLASS(mem_cgroup_consider_writeback,
+	TP_PROTO(unsigned short css_id,
+		 struct backing_dev_info *bdi,
+		 unsigned long nr_reclaimable,
+		 unsigned long thresh,
+		 bool over_limit),
+
+	TP_ARGS(css_id, bdi, nr_reclaimable, thresh, over_limit),
+
+	TP_STRUCT__entry(
+		__field(unsigned short, css_id)
+		__field(struct backing_dev_info *, bdi)
+		__field(unsigned long, nr_reclaimable)
+		__field(unsigned long, thresh)
+		__field(bool, over_limit)
+	),
+
+	TP_fast_assign(
+		__entry->css_id = css_id;
+		__entry->bdi = bdi;
+		__entry->nr_reclaimable = nr_reclaimable;
+		__entry->thresh = thresh;
+		__entry->over_limit = over_limit;
+	),
+
+	TP_printk("css_id=%d bdi=%p nr_reclaimable=%ld thresh=%ld "
+		  "over_limit=%d", __entry->css_id, __entry->bdi,
+		  __entry->nr_reclaimable, __entry->thresh, __entry->over_limit)
+)
+
+#define DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(name) \
+DEFINE_EVENT(mem_cgroup_consider_writeback, name, \
+	TP_PROTO(unsigned short id, \
+		 struct backing_dev_info *bdi, \
+		 unsigned long nr_reclaimable, \
+		 unsigned long thresh, \
+		 bool over_limit), \
+	TP_ARGS(id, bdi, nr_reclaimable, thresh, over_limit) \
+)
+
+DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(mem_cgroup_consider_bg_writeback);
+DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(mem_cgroup_consider_fg_writeback);
+
+TRACE_EVENT(mem_cgroup_fg_writeback,
+	TP_PROTO(unsigned long write_chunk,
+		 long nr_written,
+		 unsigned short css_id,
+		 bool shared_inodes),
+
+	TP_ARGS(write_chunk, nr_written, css_id, shared_inodes),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, write_chunk)
+		__field(long, nr_written)
+		__field(unsigned short, css_id)
+		__field(bool, shared_inodes)
+	),
+
+	TP_fast_assign(
+		__entry->write_chunk = write_chunk;
+		__entry->nr_written = nr_written;
+		__entry->css_id = css_id;
+		__entry->shared_inodes = shared_inodes;
+	),
+
+	TP_printk("css_id=%d write_chunk=%ld nr_written=%ld shared_inodes=%d",
+		  __entry->css_id,
+		  __entry->write_chunk,
+		  __entry->nr_written,
+		  __entry->shared_inodes)
+)
+
+TRACE_EVENT(mem_cgroup_enable_shared_writeback,
+	TP_PROTO(unsigned short css_id),
+
+	TP_ARGS(css_id),
+
+	TP_STRUCT__entry(
+		__field(unsigned short, css_id)
+		),
+
+	TP_fast_assign(
+		__entry->css_id = css_id;
+		),
+
+	TP_printk("enabling shared writeback for memcg %d", __entry->css_id)
+)
+
 #endif /* _TRACE_MEMCONTROL_H */
 
 /* This part must be outside protection */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5092a68..9d0b559 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1696,6 +1696,171 @@ void mem_cgroup_writeback_done(void)
 	}
 }
 
+/*
+ * This routine must be called periodically by processes which generate dirty
+ * pages.  It considers the dirty pages usage and thresholds of the current
+ * cgroup and (depending if hierarchical accounting is enabled) ancestral memcg.
+ * If any of the considered memcg are over their background dirty limit, then
+ * background writeback is queued.  If any are over the foreground dirty limit
+ * then the dirtying task is throttled while writing dirty data.  The per-memcg
+ * dirty limits checked by this routine are distinct from either the per-system,
+ * per-bdi, or per-task limits considered by balance_dirty_pages().
+ *
+ *   Example hierarchy:
+ *                 root
+ *            A            B
+ *        A1      A2         B1
+ *     A11 A12  A21 A22
+ *
+ * Assume that mem_cgroup_balance_dirty_pages() is called on A11.  This routine
+ * starts at A11 walking upwards towards the root.  If A11 is over dirty limit,
+ * then writeback A11 inodes until under limit.  Next check A1, if over limit
+ * then write A1,A11,A12.  Then check A.  If A is over A limit, then invoke
+ * writeback on A* until A is under A limit.
+ */
+void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
+				    unsigned long write_chunk)
+{
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct mem_cgroup *memcg;
+	struct mem_cgroup *ref_memcg;
+	struct dirty_info info;
+	unsigned long nr_reclaimable;
+	unsigned long nr_written;
+	unsigned long sys_available_mem;
+	unsigned long pause = 1;
+	unsigned short id;
+	bool over;
+	bool shared_inodes;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	sys_available_mem = determine_dirtyable_memory();
+
+	/* reference the memcg so it is not deleted during this routine */
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (memcg && mem_cgroup_is_root(memcg))
+		memcg = NULL;
+	if (memcg)
+		css_get(&memcg->css);
+	rcu_read_unlock();
+	ref_memcg = memcg;
+
+	/* balance entire ancestry of current's memcg. */
+	for (; mem_cgroup_has_dirty_limit(memcg);
+	     memcg = parent_mem_cgroup(memcg)) {
+		id = css_id(&memcg->css);
+
+		/*
+		 * Keep throttling and writing inode data so long as memcg is
+		 * over its dirty limit.  Inode being written by multiple memcg
+		 * (aka shared_inodes) cannot easily be attributed a particular
+		 * memcg.  Shared inodes are thought to be much rarer than
+		 * shared inodes.  First try to satisfy this memcg's dirty
+		 * limits using non-shared inodes.
+		 */
+		for (shared_inodes = false; ; ) {
+			/*
+			 * if memcg is under dirty limit, then break from
+			 * throttling loop.
+			 */
+			mem_cgroup_dirty_info(sys_available_mem, memcg, &info);
+			nr_reclaimable = dirty_info_reclaimable(&info);
+			over = nr_reclaimable > info.dirty_thresh;
+			trace_mem_cgroup_consider_fg_writeback(
+				id, bdi, nr_reclaimable, info.dirty_thresh,
+				over);
+			if (!over)
+				break;
+
+			nr_written = writeback_inodes_wb(&bdi->wb, write_chunk,
+							 memcg, shared_inodes);
+			trace_mem_cgroup_fg_writeback(write_chunk, nr_written,
+						      id, shared_inodes);
+			/* if no progress, then consider shared inodes */
+			if ((nr_written == 0) && !shared_inodes) {
+				trace_mem_cgroup_enable_shared_writeback(id);
+				shared_inodes = true;
+			}
+
+			__set_current_state(TASK_UNINTERRUPTIBLE);
+			io_schedule_timeout(pause);
+
+			/*
+			 * Increase the delay for each loop, up to our previous
+			 * default of taking a 100ms nap.
+			 */
+			pause <<= 1;
+			if (pause > HZ / 10)
+				pause = HZ / 10;
+		}
+
+		/* if memcg is over background limit, then queue bg writeback */
+		over = nr_reclaimable >= info.background_thresh;
+		trace_mem_cgroup_consider_bg_writeback(
+			id, bdi, nr_reclaimable, info.background_thresh,
+			over);
+		if (over)
+			mem_cgroup_queue_bg_writeback(memcg, bdi);
+	}
+
+	if (ref_memcg)
+		css_put(&ref_memcg->css);
+}
+
+/*
+ * Set @info to the dirty thresholds and usage of the memcg (within the
+ * ancestral chain of @memcg) closest to its dirty limit or the first memcg over
+ * its limit.
+ *
+ * The check is not stable because the usage and limits can change asynchronous
+ * to this routine.
+ *
+ * If @memcg has no per-cgroup dirty limits, then returns false.
+ * Otherwise @info is set and returns true.
+ */
+bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
+					struct mem_cgroup *memcg,
+					struct dirty_info *info)
+{
+	unsigned long usage;
+	struct dirty_info uninitialized_var(cur_info);
+
+	if (mem_cgroup_disabled())
+		return false;
+
+	info->nr_writeback = ULONG_MAX;  /* invalid initial value */
+
+	/* walk up hierarchy enabled parents */
+	for (; mem_cgroup_has_dirty_limit(memcg);
+	     memcg = parent_mem_cgroup(memcg)) {
+		mem_cgroup_dirty_info(sys_available_mem, memcg, &cur_info);
+		usage = dirty_info_reclaimable(&cur_info) +
+			cur_info.nr_writeback;
+
+		/* if over limit, stop searching */
+		if (usage >= cur_info.dirty_thresh) {
+			*info = cur_info;
+			break;
+		}
+
+		/*
+		 * Save dirty usage of memcg closest to its limit if either:
+		 *     - memcg is the first memcg considered
+		 *     - memcg dirty margin is smaller than last recorded one
+		 */
+		if ((info->nr_writeback == ULONG_MAX) ||
+		    (cur_info.dirty_thresh - usage) <
+		    (info->dirty_thresh -
+		     (dirty_info_reclaimable(info) + info->nr_writeback)))
+			*info = cur_info;
+	}
+
+	return info->nr_writeback != ULONG_MAX;
+}
+
 static void mem_cgroup_start_move(struct mem_cgroup *mem)
 {
 	int cpu;
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 12/13] memcg: create support routines for page writeback
@ 2011-08-17 16:15   ` Greg Thelen
  0 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

Introduce memcg routines to assist in per-memcg dirty page management:

- mem_cgroup_balance_dirty_pages() walks a memcg hierarchy comparing
  dirty memory usage against memcg foreground and background thresholds.
  If an over-background-threshold memcg is found, then per-memcg
  background writeback is queued.  Per-memcg writeback differs from
  classic, non-memcg, per bdi writeback by setting the new
  writeback_control.for_cgroup bit.

  If an over-foreground-threshold memcg is found, then foreground
  writeout occurs.  When performing foreground writeout, first consider
  inodes exclusive to the memcg.  If unable to make enough progress,
  then consider inodes shared between memcg.  Such cross-memcg inode
  sharing likely to be rare in situations that use per-cgroup memory
  isolation.  So the approach tries to handle the common case well
  without falling over in cases where such sharing exists.  This routine
  is used by balance_dirty_pages() in a later change.

- mem_cgroup_hierarchical_dirty_info() returns the dirty memory usage
  and limits of the memcg closest to (or over) its dirty limit.  This
  will be used by throttle_vm_writeout() in a latter change.

Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v8:

- Use 'memcg' rather than 'mem' for local variables and parameters.
  This is consistent with other memory controller code.

- No more field additions to struct writeback_control.

- Added more comments to mem_cgroup_balance_dirty_pages().

- Adapted to changes in writeback_inodes_wb().

- Improved mem_cgroup_hierarchical_dirty_info() comment.

 include/linux/memcontrol.h        |   18 ++++
 include/trace/events/memcontrol.h |   88 ++++++++++++++++++++
 mm/memcontrol.c                   |  165 +++++++++++++++++++++++++++++++++++++
 3 files changed, 271 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 103d297..f49bd2d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -186,6 +186,11 @@ bool should_writeback_mem_cgroup_inode(struct inode *inode,
 				       bool shared_inodes);
 bool mem_cgroups_over_bground_dirty_thresh(void);
 void mem_cgroup_writeback_done(void);
+bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
+					struct mem_cgroup *memcg,
+					struct dirty_info *info);
+void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
+				    unsigned long write_chunk);
 
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask,
@@ -402,6 +407,19 @@ static inline void mem_cgroup_writeback_done(void)
 {
 }
 
+static inline void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
+						  unsigned long write_chunk)
+{
+}
+
+static inline bool
+mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
+				   struct mem_cgroup *memcg,
+				   struct dirty_info *info)
+{
+	return false;
+}
+
 static inline
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 					    gfp_t gfp_mask,
diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
index 966aac0..20bbb85 100644
--- a/include/trace/events/memcontrol.h
+++ b/include/trace/events/memcontrol.h
@@ -113,6 +113,94 @@ TRACE_EVENT(mem_cgroups_over_bground_dirty_thresh,
 		  __entry->first_id)
 )
 
+DECLARE_EVENT_CLASS(mem_cgroup_consider_writeback,
+	TP_PROTO(unsigned short css_id,
+		 struct backing_dev_info *bdi,
+		 unsigned long nr_reclaimable,
+		 unsigned long thresh,
+		 bool over_limit),
+
+	TP_ARGS(css_id, bdi, nr_reclaimable, thresh, over_limit),
+
+	TP_STRUCT__entry(
+		__field(unsigned short, css_id)
+		__field(struct backing_dev_info *, bdi)
+		__field(unsigned long, nr_reclaimable)
+		__field(unsigned long, thresh)
+		__field(bool, over_limit)
+	),
+
+	TP_fast_assign(
+		__entry->css_id = css_id;
+		__entry->bdi = bdi;
+		__entry->nr_reclaimable = nr_reclaimable;
+		__entry->thresh = thresh;
+		__entry->over_limit = over_limit;
+	),
+
+	TP_printk("css_id=%d bdi=%p nr_reclaimable=%ld thresh=%ld "
+		  "over_limit=%d", __entry->css_id, __entry->bdi,
+		  __entry->nr_reclaimable, __entry->thresh, __entry->over_limit)
+)
+
+#define DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(name) \
+DEFINE_EVENT(mem_cgroup_consider_writeback, name, \
+	TP_PROTO(unsigned short id, \
+		 struct backing_dev_info *bdi, \
+		 unsigned long nr_reclaimable, \
+		 unsigned long thresh, \
+		 bool over_limit), \
+	TP_ARGS(id, bdi, nr_reclaimable, thresh, over_limit) \
+)
+
+DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(mem_cgroup_consider_bg_writeback);
+DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(mem_cgroup_consider_fg_writeback);
+
+TRACE_EVENT(mem_cgroup_fg_writeback,
+	TP_PROTO(unsigned long write_chunk,
+		 long nr_written,
+		 unsigned short css_id,
+		 bool shared_inodes),
+
+	TP_ARGS(write_chunk, nr_written, css_id, shared_inodes),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, write_chunk)
+		__field(long, nr_written)
+		__field(unsigned short, css_id)
+		__field(bool, shared_inodes)
+	),
+
+	TP_fast_assign(
+		__entry->write_chunk = write_chunk;
+		__entry->nr_written = nr_written;
+		__entry->css_id = css_id;
+		__entry->shared_inodes = shared_inodes;
+	),
+
+	TP_printk("css_id=%d write_chunk=%ld nr_written=%ld shared_inodes=%d",
+		  __entry->css_id,
+		  __entry->write_chunk,
+		  __entry->nr_written,
+		  __entry->shared_inodes)
+)
+
+TRACE_EVENT(mem_cgroup_enable_shared_writeback,
+	TP_PROTO(unsigned short css_id),
+
+	TP_ARGS(css_id),
+
+	TP_STRUCT__entry(
+		__field(unsigned short, css_id)
+		),
+
+	TP_fast_assign(
+		__entry->css_id = css_id;
+		),
+
+	TP_printk("enabling shared writeback for memcg %d", __entry->css_id)
+)
+
 #endif /* _TRACE_MEMCONTROL_H */
 
 /* This part must be outside protection */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5092a68..9d0b559 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1696,6 +1696,171 @@ void mem_cgroup_writeback_done(void)
 	}
 }
 
+/*
+ * This routine must be called periodically by processes which generate dirty
+ * pages.  It considers the dirty pages usage and thresholds of the current
+ * cgroup and (depending if hierarchical accounting is enabled) ancestral memcg.
+ * If any of the considered memcg are over their background dirty limit, then
+ * background writeback is queued.  If any are over the foreground dirty limit
+ * then the dirtying task is throttled while writing dirty data.  The per-memcg
+ * dirty limits checked by this routine are distinct from either the per-system,
+ * per-bdi, or per-task limits considered by balance_dirty_pages().
+ *
+ *   Example hierarchy:
+ *                 root
+ *            A            B
+ *        A1      A2         B1
+ *     A11 A12  A21 A22
+ *
+ * Assume that mem_cgroup_balance_dirty_pages() is called on A11.  This routine
+ * starts at A11 walking upwards towards the root.  If A11 is over dirty limit,
+ * then writeback A11 inodes until under limit.  Next check A1, if over limit
+ * then write A1,A11,A12.  Then check A.  If A is over A limit, then invoke
+ * writeback on A* until A is under A limit.
+ */
+void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
+				    unsigned long write_chunk)
+{
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct mem_cgroup *memcg;
+	struct mem_cgroup *ref_memcg;
+	struct dirty_info info;
+	unsigned long nr_reclaimable;
+	unsigned long nr_written;
+	unsigned long sys_available_mem;
+	unsigned long pause = 1;
+	unsigned short id;
+	bool over;
+	bool shared_inodes;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	sys_available_mem = determine_dirtyable_memory();
+
+	/* reference the memcg so it is not deleted during this routine */
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (memcg && mem_cgroup_is_root(memcg))
+		memcg = NULL;
+	if (memcg)
+		css_get(&memcg->css);
+	rcu_read_unlock();
+	ref_memcg = memcg;
+
+	/* balance entire ancestry of current's memcg. */
+	for (; mem_cgroup_has_dirty_limit(memcg);
+	     memcg = parent_mem_cgroup(memcg)) {
+		id = css_id(&memcg->css);
+
+		/*
+		 * Keep throttling and writing inode data so long as memcg is
+		 * over its dirty limit.  Inode being written by multiple memcg
+		 * (aka shared_inodes) cannot easily be attributed a particular
+		 * memcg.  Shared inodes are thought to be much rarer than
+		 * shared inodes.  First try to satisfy this memcg's dirty
+		 * limits using non-shared inodes.
+		 */
+		for (shared_inodes = false; ; ) {
+			/*
+			 * if memcg is under dirty limit, then break from
+			 * throttling loop.
+			 */
+			mem_cgroup_dirty_info(sys_available_mem, memcg, &info);
+			nr_reclaimable = dirty_info_reclaimable(&info);
+			over = nr_reclaimable > info.dirty_thresh;
+			trace_mem_cgroup_consider_fg_writeback(
+				id, bdi, nr_reclaimable, info.dirty_thresh,
+				over);
+			if (!over)
+				break;
+
+			nr_written = writeback_inodes_wb(&bdi->wb, write_chunk,
+							 memcg, shared_inodes);
+			trace_mem_cgroup_fg_writeback(write_chunk, nr_written,
+						      id, shared_inodes);
+			/* if no progress, then consider shared inodes */
+			if ((nr_written == 0) && !shared_inodes) {
+				trace_mem_cgroup_enable_shared_writeback(id);
+				shared_inodes = true;
+			}
+
+			__set_current_state(TASK_UNINTERRUPTIBLE);
+			io_schedule_timeout(pause);
+
+			/*
+			 * Increase the delay for each loop, up to our previous
+			 * default of taking a 100ms nap.
+			 */
+			pause <<= 1;
+			if (pause > HZ / 10)
+				pause = HZ / 10;
+		}
+
+		/* if memcg is over background limit, then queue bg writeback */
+		over = nr_reclaimable >= info.background_thresh;
+		trace_mem_cgroup_consider_bg_writeback(
+			id, bdi, nr_reclaimable, info.background_thresh,
+			over);
+		if (over)
+			mem_cgroup_queue_bg_writeback(memcg, bdi);
+	}
+
+	if (ref_memcg)
+		css_put(&ref_memcg->css);
+}
+
+/*
+ * Set @info to the dirty thresholds and usage of the memcg (within the
+ * ancestral chain of @memcg) closest to its dirty limit or the first memcg over
+ * its limit.
+ *
+ * The check is not stable because the usage and limits can change asynchronous
+ * to this routine.
+ *
+ * If @memcg has no per-cgroup dirty limits, then returns false.
+ * Otherwise @info is set and returns true.
+ */
+bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
+					struct mem_cgroup *memcg,
+					struct dirty_info *info)
+{
+	unsigned long usage;
+	struct dirty_info uninitialized_var(cur_info);
+
+	if (mem_cgroup_disabled())
+		return false;
+
+	info->nr_writeback = ULONG_MAX;  /* invalid initial value */
+
+	/* walk up hierarchy enabled parents */
+	for (; mem_cgroup_has_dirty_limit(memcg);
+	     memcg = parent_mem_cgroup(memcg)) {
+		mem_cgroup_dirty_info(sys_available_mem, memcg, &cur_info);
+		usage = dirty_info_reclaimable(&cur_info) +
+			cur_info.nr_writeback;
+
+		/* if over limit, stop searching */
+		if (usage >= cur_info.dirty_thresh) {
+			*info = cur_info;
+			break;
+		}
+
+		/*
+		 * Save dirty usage of memcg closest to its limit if either:
+		 *     - memcg is the first memcg considered
+		 *     - memcg dirty margin is smaller than last recorded one
+		 */
+		if ((info->nr_writeback == ULONG_MAX) ||
+		    (cur_info.dirty_thresh - usage) <
+		    (info->dirty_thresh -
+		     (dirty_info_reclaimable(info) + info->nr_writeback)))
+			*info = cur_info;
+	}
+
+	return info->nr_writeback != ULONG_MAX;
+}
+
 static void mem_cgroup_start_move(struct mem_cgroup *mem)
 {
 	int cpu;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 13/13] memcg: check memcg dirty limits in page writeback
  2011-08-17 16:14 ` Greg Thelen
@ 2011-08-17 16:15   ` Greg Thelen
  -1 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

If the current process is in a non-root memcg, then
balance_dirty_pages() will consider the memcg dirty limits as well as
the system-wide limits.  This allows different cgroups to have distinct
dirty limits which trigger direct and background writeback at different
levels.

If called with a mem_cgroup, then throttle_vm_writeout() queries the
given cgroup for its dirty memory usage limits.

Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v8:

- Use 'memcg' rather than 'mem' for local variables and parameters.
  This is consistent with other memory controller code.

 include/linux/writeback.h |    2 +-
 mm/page-writeback.c       |   35 +++++++++++++++++++++++++++++------
 mm/vmscan.c               |    2 +-
 3 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index e6790e8..0f809e3 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -116,7 +116,7 @@ void laptop_mode_timer_fn(unsigned long data);
 #else
 static inline void laptop_sync_completion(void) { }
 #endif
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *memcg);
 
 extern unsigned long global_dirty_limit;
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 64de98c..9ce199d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -645,7 +645,8 @@ static void bdi_update_bandwidth(struct backing_dev_info *bdi,
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
  * If we're over `background_thresh' then the writeback threads are woken to
- * perform some writeout.
+ * perform some writeout.  The current task may belong to a cgroup with
+ * dirty limits, which are also checked.
  */
 static void balance_dirty_pages(struct address_space *mapping,
 				unsigned long write_chunk)
@@ -665,6 +666,8 @@ static void balance_dirty_pages(struct address_space *mapping,
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
 
+	mem_cgroup_balance_dirty_pages(mapping, write_chunk);
+
 	for (;;) {
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
@@ -856,23 +859,43 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
-void throttle_vm_writeout(gfp_t gfp_mask)
+/*
+ * Throttle the current task if it is near dirty memory usage limits.  Both
+ * global dirty memory limits and (if @memcg is given) per-cgroup dirty memory
+ * limits are checked.
+ *
+ * If near limits, then wait for usage to drop.  Dirty usage should drop because
+ * dirty producers should have used balance_dirty_pages(), which would have
+ * scheduled writeback.
+ */
+void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *memcg)
 {
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
+	struct dirty_info memcg_info;
+	bool do_memcg;
 
         for ( ; ; ) {
 		global_dirty_limits(&background_thresh, &dirty_thresh);
+		do_memcg = memcg &&
+			mem_cgroup_hierarchical_dirty_info(
+				determine_dirtyable_memory(), memcg,
+				&memcg_info);
 
                 /*
                  * Boost the allowable dirty threshold a bit for page
                  * allocators so they don't get DoS'ed by heavy writers
                  */
                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */
-
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
+		if (do_memcg)
+			memcg_info.dirty_thresh += memcg_info.dirty_thresh / 10;
+
+		if ((global_page_state(NR_UNSTABLE_NFS) +
+		     global_page_state(NR_WRITEBACK) <= dirty_thresh) &&
+		    (!do_memcg ||
+		     (memcg_info.nr_unstable_nfs +
+		      memcg_info.nr_writeback <= memcg_info.dirty_thresh)))
+			break;
                 congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fb0ae99..3c57788 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2068,7 +2068,7 @@ restart:
 					sc->nr_scanned - nr_scanned, sc))
 		goto restart;
 
-	throttle_vm_writeout(sc->gfp_mask);
+	throttle_vm_writeout(sc->gfp_mask, sc->mem_cgroup);
 }
 
 /*
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v9 13/13] memcg: check memcg dirty limits in page writeback
@ 2011-08-17 16:15   ` Greg Thelen
  0 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-17 16:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, linux-fsdevel,
	KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Johannes Weiner, Wu Fengguang, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Greg Thelen

If the current process is in a non-root memcg, then
balance_dirty_pages() will consider the memcg dirty limits as well as
the system-wide limits.  This allows different cgroups to have distinct
dirty limits which trigger direct and background writeback at different
levels.

If called with a mem_cgroup, then throttle_vm_writeout() queries the
given cgroup for its dirty memory usage limits.

Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v8:

- Use 'memcg' rather than 'mem' for local variables and parameters.
  This is consistent with other memory controller code.

 include/linux/writeback.h |    2 +-
 mm/page-writeback.c       |   35 +++++++++++++++++++++++++++++------
 mm/vmscan.c               |    2 +-
 3 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index e6790e8..0f809e3 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -116,7 +116,7 @@ void laptop_mode_timer_fn(unsigned long data);
 #else
 static inline void laptop_sync_completion(void) { }
 #endif
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *memcg);
 
 extern unsigned long global_dirty_limit;
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 64de98c..9ce199d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -645,7 +645,8 @@ static void bdi_update_bandwidth(struct backing_dev_info *bdi,
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
  * If we're over `background_thresh' then the writeback threads are woken to
- * perform some writeout.
+ * perform some writeout.  The current task may belong to a cgroup with
+ * dirty limits, which are also checked.
  */
 static void balance_dirty_pages(struct address_space *mapping,
 				unsigned long write_chunk)
@@ -665,6 +666,8 @@ static void balance_dirty_pages(struct address_space *mapping,
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
 
+	mem_cgroup_balance_dirty_pages(mapping, write_chunk);
+
 	for (;;) {
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
@@ -856,23 +859,43 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
-void throttle_vm_writeout(gfp_t gfp_mask)
+/*
+ * Throttle the current task if it is near dirty memory usage limits.  Both
+ * global dirty memory limits and (if @memcg is given) per-cgroup dirty memory
+ * limits are checked.
+ *
+ * If near limits, then wait for usage to drop.  Dirty usage should drop because
+ * dirty producers should have used balance_dirty_pages(), which would have
+ * scheduled writeback.
+ */
+void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *memcg)
 {
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
+	struct dirty_info memcg_info;
+	bool do_memcg;
 
         for ( ; ; ) {
 		global_dirty_limits(&background_thresh, &dirty_thresh);
+		do_memcg = memcg &&
+			mem_cgroup_hierarchical_dirty_info(
+				determine_dirtyable_memory(), memcg,
+				&memcg_info);
 
                 /*
                  * Boost the allowable dirty threshold a bit for page
                  * allocators so they don't get DoS'ed by heavy writers
                  */
                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */
-
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
+		if (do_memcg)
+			memcg_info.dirty_thresh += memcg_info.dirty_thresh / 10;
+
+		if ((global_page_state(NR_UNSTABLE_NFS) +
+		     global_page_state(NR_WRITEBACK) <= dirty_thresh) &&
+		    (!do_memcg ||
+		     (memcg_info.nr_unstable_nfs +
+		      memcg_info.nr_writeback <= memcg_info.dirty_thresh)))
+			break;
                 congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fb0ae99..3c57788 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2068,7 +2068,7 @@ restart:
 					sc->nr_scanned - nr_scanned, sc))
 		goto restart;
 
-	throttle_vm_writeout(sc->gfp_mask);
+	throttle_vm_writeout(sc->gfp_mask, sc->mem_cgroup);
 }
 
 /*
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 00/13] memcg: per cgroup dirty page limiting
  2011-08-17 16:14 ` Greg Thelen
@ 2011-08-18  0:35   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  0:35 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Wed, 17 Aug 2011 09:14:52 -0700
Greg Thelen <gthelen@google.com> wrote:

> This patch series provides the ability for each cgroup to have independent dirty
> page usage limits.  Limiting dirty memory fixes the max amount of dirty (hard to
> reclaim) page cache used by a cgroup.  This allows for better per cgroup memory
> isolation and fewer memcg OOMs.
> 

Thank you for your patient work!. I really want this feature.
(Hopefully before we tune vmscan)

I hope this patch will not have heavy HUNKs..

Thanks,
-Kame


> Three features are included in this patch series:
>   1. memcg dirty page accounting
>   2. memcg writeback
>   3. memcg dirty page limiting
> 
> 
> 1. memcg dirty page accounting
> 
> Each memcg maintains a dirty page count and dirty page limit.  Previous
> iterations of this patch series have refined this logic.  The interface is
> similar to the procfs interface: /proc/sys/vm/dirty_*.  It is possible to
> configure a limit to trigger throttling of a dirtier or queue background
> writeback.  The root cgroup memory.dirty_* control files are read-only and match
> the contents of the /proc/sys/vm/dirty_* files.
> 
> 
> 2. memcg writeback
> 
> Having per cgroup dirty memory limits is not very interesting unless writeback
> is also cgroup aware.  There is not much isolation if cgroups have to writeback data
> from outside the affected cgroup to get below the cgroup dirty memory threshold.
> 
> Per-memcg dirty limits are provided to support isolation and thus cross cgroup
> inode sharing is not a priority.  This allows the code be simpler.
> 
> To add cgroup awareness to writeback, this series adds an i_memcg field to
> struct address_space to allow writeback to isolate inodes for a particular
> cgroup.  When an inode is marked dirty, i_memcg is set to the current cgroup.
> When inode pages are marked dirty the i_memcg field is compared against the
> page's cgroup.  If they differ, then the inode is marked as shared by setting
> i_memcg to a special shared value (zero).
> 
> When performing per-memcg writeback, move_expired_inodes() scans the per bdi
> b_dirty list using each inode's i_memcg and the global over-limit memcg bitmap
> to determine if the inode should be written.  This inode scan may involve
> skipping many unrelated inodes from other cgroup.  To test the scanning
> overhead, I created two cgroups (cgroup_A with 100,000 dirty inodes under A's
> dirty limit, cgroup_B with 1 inode over B's dirty limit).  The writeback code
> then had to skip 100,000 inodes when balancing cgroup_B to find the one inode
> that needed writing.  This scanning took 58 msec to skip 100,000 foreign inodes.
> 
> 
> 3. memcg dirty page limiting
> 
> balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(), which checks the
> dirty usage vs dirty thresholds for the current cgroup and its parents.  As
> cgroups exceed their background limit, they are marked in a global over-limit
> bitmap (indexed by cgroup id) and the bdi flusher is awoke.  As a cgroup hits is
> foreground limit, the task is throttled while performing foreground writeback on
> inodes owned by the over-limit cgroup.  If mem_cgroup_balance_dirty_pages() is
> unable to get below the dirty page threshold writing per-memcg inodes, then
> downshifts to also writing shared inodes (i_memcg=0).
> 
> I know that there is some significant IO-less balance_dirty_pages() changes.  I
> am not trying to derail that effort.  I have done moderate functional testing of
> the newly proposed features.
> 
> The memcg aspects of this patch are pretty mature.  The writeback aspects are
> still fairly new and need feedback from the writeback community.  These features
> are linked, so it's not clear which branch to send the changes to (the writeback
> development branch or mmotm).
> 
> Here is an example of the memcg OOM that is avoided with this patch series:
> 	# mkdir /dev/cgroup/memory/x
> 	# echo 100M > /dev/cgroup/memory/x/memory.limit_in_bytes
> 	# echo $$ > /dev/cgroup/memory/x/tasks
> 	# dd if=/dev/zero of=/data/f1 bs=1k count=1M &
>         # dd if=/dev/zero of=/data/f2 bs=1k count=1M &
>         # wait
> 	[1]-  Killed                  dd if=/dev/zero of=/data/f1 bs=1M count=1k
> 	[2]+  Killed                  dd if=/dev/zero of=/data/f1 bs=1M count=1k
> 
> Changes since -v8:
> - Reordered patches for better more readability.
> 
> - No longer passing struct writeback_control into memcontrol functions.  Instead
>   the needed attributes (memcg_id, etc.) are explicitly passed in.  Therefore no
>   more field additions to struct writeback_control.
> 
> - Replaced 'Andrea Righi <arighi@develer.com>' with 
>   'Andrea Righi <andrea@betterlinux.com>' in commit descriptions.
> 
> - Rebased to mmotm-2011-08-02-16-19
> 
> Greg Thelen (13):
>   memcg: document cgroup dirty memory interfaces
>   memcg: add page_cgroup flags for dirty page tracking
>   memcg: add dirty page accounting infrastructure
>   memcg: add kernel calls for memcg dirty page stats
>   memcg: add mem_cgroup_mark_inode_dirty()
>   memcg: add dirty limits to mem_cgroup
>   memcg: add cgroupfs interface to memcg dirty limits
>   memcg: dirty page accounting support routines
>   memcg: create support routines for writeback
>   writeback: pass wb_writeback_work into move_expired_inodes()
>   writeback: make background writeback cgroup aware
>   memcg: create support routines for page writeback
>   memcg: check memcg dirty limits in page writeback
> 
>  Documentation/cgroups/memory.txt  |   70 ++++
>  fs/buffer.c                       |    2 +-
>  fs/fs-writeback.c                 |  113 ++++--
>  fs/inode.c                        |    3 +
>  fs/nfs/write.c                    |    4 +
>  fs/sync.c                         |    2 +-
>  include/linux/cgroup.h            |    1 +
>  include/linux/fs.h                |    9 +
>  include/linux/memcontrol.h        |   64 +++-
>  include/linux/page_cgroup.h       |   23 ++
>  include/linux/writeback.h         |    9 +-
>  include/trace/events/memcontrol.h |  207 ++++++++++
>  kernel/cgroup.c                   |    1 -
>  mm/backing-dev.c                  |    3 +-
>  mm/filemap.c                      |    1 +
>  mm/memcontrol.c                   |  760 ++++++++++++++++++++++++++++++++++++-
>  mm/page-writeback.c               |   44 ++-
>  mm/truncate.c                     |    1 +
>  mm/vmscan.c                       |    5 +-
>  19 files changed, 1265 insertions(+), 57 deletions(-)
>  create mode 100644 include/trace/events/memcontrol.h
> 
> -- 
> 1.7.3.1
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 00/13] memcg: per cgroup dirty page limiting
@ 2011-08-18  0:35   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  0:35 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Wed, 17 Aug 2011 09:14:52 -0700
Greg Thelen <gthelen@google.com> wrote:

> This patch series provides the ability for each cgroup to have independent dirty
> page usage limits.  Limiting dirty memory fixes the max amount of dirty (hard to
> reclaim) page cache used by a cgroup.  This allows for better per cgroup memory
> isolation and fewer memcg OOMs.
> 

Thank you for your patient work!. I really want this feature.
(Hopefully before we tune vmscan)

I hope this patch will not have heavy HUNKs..

Thanks,
-Kame


> Three features are included in this patch series:
>   1. memcg dirty page accounting
>   2. memcg writeback
>   3. memcg dirty page limiting
> 
> 
> 1. memcg dirty page accounting
> 
> Each memcg maintains a dirty page count and dirty page limit.  Previous
> iterations of this patch series have refined this logic.  The interface is
> similar to the procfs interface: /proc/sys/vm/dirty_*.  It is possible to
> configure a limit to trigger throttling of a dirtier or queue background
> writeback.  The root cgroup memory.dirty_* control files are read-only and match
> the contents of the /proc/sys/vm/dirty_* files.
> 
> 
> 2. memcg writeback
> 
> Having per cgroup dirty memory limits is not very interesting unless writeback
> is also cgroup aware.  There is not much isolation if cgroups have to writeback data
> from outside the affected cgroup to get below the cgroup dirty memory threshold.
> 
> Per-memcg dirty limits are provided to support isolation and thus cross cgroup
> inode sharing is not a priority.  This allows the code be simpler.
> 
> To add cgroup awareness to writeback, this series adds an i_memcg field to
> struct address_space to allow writeback to isolate inodes for a particular
> cgroup.  When an inode is marked dirty, i_memcg is set to the current cgroup.
> When inode pages are marked dirty the i_memcg field is compared against the
> page's cgroup.  If they differ, then the inode is marked as shared by setting
> i_memcg to a special shared value (zero).
> 
> When performing per-memcg writeback, move_expired_inodes() scans the per bdi
> b_dirty list using each inode's i_memcg and the global over-limit memcg bitmap
> to determine if the inode should be written.  This inode scan may involve
> skipping many unrelated inodes from other cgroup.  To test the scanning
> overhead, I created two cgroups (cgroup_A with 100,000 dirty inodes under A's
> dirty limit, cgroup_B with 1 inode over B's dirty limit).  The writeback code
> then had to skip 100,000 inodes when balancing cgroup_B to find the one inode
> that needed writing.  This scanning took 58 msec to skip 100,000 foreign inodes.
> 
> 
> 3. memcg dirty page limiting
> 
> balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(), which checks the
> dirty usage vs dirty thresholds for the current cgroup and its parents.  As
> cgroups exceed their background limit, they are marked in a global over-limit
> bitmap (indexed by cgroup id) and the bdi flusher is awoke.  As a cgroup hits is
> foreground limit, the task is throttled while performing foreground writeback on
> inodes owned by the over-limit cgroup.  If mem_cgroup_balance_dirty_pages() is
> unable to get below the dirty page threshold writing per-memcg inodes, then
> downshifts to also writing shared inodes (i_memcg=0).
> 
> I know that there is some significant IO-less balance_dirty_pages() changes.  I
> am not trying to derail that effort.  I have done moderate functional testing of
> the newly proposed features.
> 
> The memcg aspects of this patch are pretty mature.  The writeback aspects are
> still fairly new and need feedback from the writeback community.  These features
> are linked, so it's not clear which branch to send the changes to (the writeback
> development branch or mmotm).
> 
> Here is an example of the memcg OOM that is avoided with this patch series:
> 	# mkdir /dev/cgroup/memory/x
> 	# echo 100M > /dev/cgroup/memory/x/memory.limit_in_bytes
> 	# echo $$ > /dev/cgroup/memory/x/tasks
> 	# dd if=/dev/zero of=/data/f1 bs=1k count=1M &
>         # dd if=/dev/zero of=/data/f2 bs=1k count=1M &
>         # wait
> 	[1]-  Killed                  dd if=/dev/zero of=/data/f1 bs=1M count=1k
> 	[2]+  Killed                  dd if=/dev/zero of=/data/f1 bs=1M count=1k
> 
> Changes since -v8:
> - Reordered patches for better more readability.
> 
> - No longer passing struct writeback_control into memcontrol functions.  Instead
>   the needed attributes (memcg_id, etc.) are explicitly passed in.  Therefore no
>   more field additions to struct writeback_control.
> 
> - Replaced 'Andrea Righi <arighi@develer.com>' with 
>   'Andrea Righi <andrea@betterlinux.com>' in commit descriptions.
> 
> - Rebased to mmotm-2011-08-02-16-19
> 
> Greg Thelen (13):
>   memcg: document cgroup dirty memory interfaces
>   memcg: add page_cgroup flags for dirty page tracking
>   memcg: add dirty page accounting infrastructure
>   memcg: add kernel calls for memcg dirty page stats
>   memcg: add mem_cgroup_mark_inode_dirty()
>   memcg: add dirty limits to mem_cgroup
>   memcg: add cgroupfs interface to memcg dirty limits
>   memcg: dirty page accounting support routines
>   memcg: create support routines for writeback
>   writeback: pass wb_writeback_work into move_expired_inodes()
>   writeback: make background writeback cgroup aware
>   memcg: create support routines for page writeback
>   memcg: check memcg dirty limits in page writeback
> 
>  Documentation/cgroups/memory.txt  |   70 ++++
>  fs/buffer.c                       |    2 +-
>  fs/fs-writeback.c                 |  113 ++++--
>  fs/inode.c                        |    3 +
>  fs/nfs/write.c                    |    4 +
>  fs/sync.c                         |    2 +-
>  include/linux/cgroup.h            |    1 +
>  include/linux/fs.h                |    9 +
>  include/linux/memcontrol.h        |   64 +++-
>  include/linux/page_cgroup.h       |   23 ++
>  include/linux/writeback.h         |    9 +-
>  include/trace/events/memcontrol.h |  207 ++++++++++
>  kernel/cgroup.c                   |    1 -
>  mm/backing-dev.c                  |    3 +-
>  mm/filemap.c                      |    1 +
>  mm/memcontrol.c                   |  760 ++++++++++++++++++++++++++++++++++++-
>  mm/page-writeback.c               |   44 ++-
>  mm/truncate.c                     |    1 +
>  mm/vmscan.c                       |    5 +-
>  19 files changed, 1265 insertions(+), 57 deletions(-)
>  create mode 100644 include/trace/events/memcontrol.h
> 
> -- 
> 1.7.3.1
> 
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 03/13] memcg: add dirty page accounting infrastructure
  2011-08-17 16:14   ` Greg Thelen
@ 2011-08-18  0:39     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  0:39 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Wed, 17 Aug 2011 09:14:55 -0700
Greg Thelen <gthelen@google.com> wrote:

> Add memcg routines to count dirty, writeback, and unstable_NFS pages.
> These routines are not yet used by the kernel to count such pages.  A
> later change adds kernel calls to these new routines.
> 
> As inode pages are marked dirty, if the dirtied page's cgroup differs
> from the inode's cgroup, then mark the inode shared across several
> cgroup.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <andrea@betterlinux.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

A nitpick..



> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +				       struct mem_cgroup *to,
> +				       enum mem_cgroup_stat_index idx)
> +{
> +	preempt_disable();
> +	__this_cpu_dec(from->stat->count[idx]);
> +	__this_cpu_inc(to->stat->count[idx]);
> +	preempt_enable();
> +}
> +

this_cpu_dec()
this_cpu_inc()

without preempt_disable/enable will work. CPU change between dec/inc will
not be problem.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 03/13] memcg: add dirty page accounting infrastructure
@ 2011-08-18  0:39     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  0:39 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Wed, 17 Aug 2011 09:14:55 -0700
Greg Thelen <gthelen@google.com> wrote:

> Add memcg routines to count dirty, writeback, and unstable_NFS pages.
> These routines are not yet used by the kernel to count such pages.  A
> later change adds kernel calls to these new routines.
> 
> As inode pages are marked dirty, if the dirtied page's cgroup differs
> from the inode's cgroup, then mark the inode shared across several
> cgroup.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <andrea@betterlinux.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

A nitpick..



> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +				       struct mem_cgroup *to,
> +				       enum mem_cgroup_stat_index idx)
> +{
> +	preempt_disable();
> +	__this_cpu_dec(from->stat->count[idx]);
> +	__this_cpu_inc(to->stat->count[idx]);
> +	preempt_enable();
> +}
> +

this_cpu_dec()
this_cpu_inc()

without preempt_disable/enable will work. CPU change between dec/inc will
not be problem.

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 05/13] memcg: add mem_cgroup_mark_inode_dirty()
  2011-08-17 16:14   ` Greg Thelen
@ 2011-08-18  0:51     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  0:51 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Wed, 17 Aug 2011 09:14:57 -0700
Greg Thelen <gthelen@google.com> wrote:

> Create the mem_cgroup_mark_inode_dirty() routine, which is called when
> an inode is marked dirty.  In kernels without memcg, this is an inline
> no-op.
> 
> Add i_memcg field to struct address_space.  When an inode is marked
> dirty with mem_cgroup_mark_inode_dirty(), the css_id of current memcg is
> recorded in i_memcg.  Per-memcg writeback (introduced in a latter
> change) uses this field to isolate inodes associated with a particular
> memcg.
> 
> The type of i_memcg is an 'unsigned short' because it stores the css_id
> of the memcg.  Using a struct mem_cgroup pointer would be larger and
> also create a reference on the memcg which would hang memcg rmdir
> deletion.  Usage of a css_id is not a reference so cgroup deletion is
> not affected.  The memcg can be deleted without cleaning up the i_memcg
> field.  When a memcg is deleted its pages are recharged to the cgroup
> parent, and the related inode(s) are marked as shared thus
> disassociating the inodes from the deleted cgroup.
> 
> A mem_cgroup_mark_inode_dirty() tracepoint is also included to allow for
> easier understanding of memcg writeback operation.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


> ---
> Changelog since v8:
> - Use I_MEMCG_SHARED when initializing i_memcg.
> 
> - Use 'memcg' rather than 'mem' for local variables.  This is consistent with
>   other memory controller code.
> 
> - The logic in mem_cgroup_update_page_stat() and mem_cgroup_move_account() which
>   marks inodes I_MEMCG_SHARED is now part of this patch.  This makes more sense
>   because this is that patch that introduces that shared-inode concept.
> 
yes, this makes the patch clearer.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 05/13] memcg: add mem_cgroup_mark_inode_dirty()
@ 2011-08-18  0:51     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  0:51 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Wed, 17 Aug 2011 09:14:57 -0700
Greg Thelen <gthelen@google.com> wrote:

> Create the mem_cgroup_mark_inode_dirty() routine, which is called when
> an inode is marked dirty.  In kernels without memcg, this is an inline
> no-op.
> 
> Add i_memcg field to struct address_space.  When an inode is marked
> dirty with mem_cgroup_mark_inode_dirty(), the css_id of current memcg is
> recorded in i_memcg.  Per-memcg writeback (introduced in a latter
> change) uses this field to isolate inodes associated with a particular
> memcg.
> 
> The type of i_memcg is an 'unsigned short' because it stores the css_id
> of the memcg.  Using a struct mem_cgroup pointer would be larger and
> also create a reference on the memcg which would hang memcg rmdir
> deletion.  Usage of a css_id is not a reference so cgroup deletion is
> not affected.  The memcg can be deleted without cleaning up the i_memcg
> field.  When a memcg is deleted its pages are recharged to the cgroup
> parent, and the related inode(s) are marked as shared thus
> disassociating the inodes from the deleted cgroup.
> 
> A mem_cgroup_mark_inode_dirty() tracepoint is also included to allow for
> easier understanding of memcg writeback operation.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


> ---
> Changelog since v8:
> - Use I_MEMCG_SHARED when initializing i_memcg.
> 
> - Use 'memcg' rather than 'mem' for local variables.  This is consistent with
>   other memory controller code.
> 
> - The logic in mem_cgroup_update_page_stat() and mem_cgroup_move_account() which
>   marks inodes I_MEMCG_SHARED is now part of this patch.  This makes more sense
>   because this is that patch that introduces that shared-inode concept.
> 
yes, this makes the patch clearer.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 06/13] memcg: add dirty limits to mem_cgroup
  2011-08-17 16:14   ` Greg Thelen
@ 2011-08-18  0:53     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  0:53 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Wed, 17 Aug 2011 09:14:58 -0700
Greg Thelen <gthelen@google.com> wrote:

> Extend mem_cgroup to contain dirty page limits.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <andrea@betterlinux.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 06/13] memcg: add dirty limits to mem_cgroup
@ 2011-08-18  0:53     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  0:53 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Wed, 17 Aug 2011 09:14:58 -0700
Greg Thelen <gthelen@google.com> wrote:

> Extend mem_cgroup to contain dirty page limits.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <andrea@betterlinux.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 07/13] memcg: add cgroupfs interface to memcg dirty limits
  2011-08-17 16:14   ` Greg Thelen
@ 2011-08-18  0:55     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  0:55 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes, Balbir Singh

On Wed, 17 Aug 2011 09:14:59 -0700
Greg Thelen <gthelen@google.com> wrote:

> Add cgroupfs interface to memcg dirty page limits:
>   Direct write-out is controlled with:
>   - memory.dirty_ratio
>   - memory.dirty_limit_in_bytes
> 
>   Background write-out is controlled with:
>   - memory.dirty_background_ratio
>   - memory.dirty_background_limit_bytes
> 
> Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g'
> and 'G' suffixes for byte counts.  This patch provides the
> same functionality for memory.dirty_limit_in_bytes and
> memory.dirty_background_limit_bytes.
> 
> Signed-off-by: Andrea Righi <andrea@betterlinux.com>
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 07/13] memcg: add cgroupfs interface to memcg dirty limits
@ 2011-08-18  0:55     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  0:55 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes, Balbir Singh

On Wed, 17 Aug 2011 09:14:59 -0700
Greg Thelen <gthelen@google.com> wrote:

> Add cgroupfs interface to memcg dirty page limits:
>   Direct write-out is controlled with:
>   - memory.dirty_ratio
>   - memory.dirty_limit_in_bytes
> 
>   Background write-out is controlled with:
>   - memory.dirty_background_ratio
>   - memory.dirty_background_limit_bytes
> 
> Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g'
> and 'G' suffixes for byte counts.  This patch provides the
> same functionality for memory.dirty_limit_in_bytes and
> memory.dirty_background_limit_bytes.
> 
> Signed-off-by: Andrea Righi <andrea@betterlinux.com>
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 08/13] memcg: dirty page accounting support routines
  2011-08-17 16:15   ` Greg Thelen
@ 2011-08-18  1:05     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  1:05 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Wed, 17 Aug 2011 09:15:00 -0700
Greg Thelen <gthelen@google.com> wrote:

> Added memcg dirty page accounting support routines.  These routines are
> used by later changes to provide memcg aware writeback and dirty page
> limiting.  A mem_cgroup_dirty_info() tracepoint is is also included to
> allow for easier understanding of memcg writeback operation.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>

I have small comments.

> ---
> Changelog since v8:
> - Use 'memcg' rather than 'mem' for local variables and parameters.
>   This is consistent with other memory controller code.
> 
>  include/linux/memcontrol.h        |    9 ++
>  include/trace/events/memcontrol.h |   34 +++++++++
>  mm/memcontrol.c                   |  147 +++++++++++++++++++++++++++++++++++++
>  3 files changed, 190 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 630d3fa..9cc8841 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -36,6 +36,15 @@ enum mem_cgroup_page_stat_item {
>  	MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
>  	MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
>  	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
> +	MEMCG_NR_DIRTYABLE_PAGES, /* # of pages that could be dirty */
> +};
> +
> +struct dirty_info {
> +	unsigned long dirty_thresh;
> +	unsigned long background_thresh;
> +	unsigned long nr_file_dirty;
> +	unsigned long nr_writeback;
> +	unsigned long nr_unstable_nfs;
>  };
>  
>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
> index 781ef9fc..abf1306 100644
> --- a/include/trace/events/memcontrol.h
> +++ b/include/trace/events/memcontrol.h
> @@ -26,6 +26,40 @@ TRACE_EVENT(mem_cgroup_mark_inode_dirty,
>  	TP_printk("ino=%ld css_id=%d", __entry->ino, __entry->css_id)
>  )
>  
> +TRACE_EVENT(mem_cgroup_dirty_info,
> +	TP_PROTO(unsigned short css_id,
> +		 struct dirty_info *dirty_info),
> +
> +	TP_ARGS(css_id, dirty_info),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned short, css_id)
> +		__field(unsigned long, dirty_thresh)
> +		__field(unsigned long, background_thresh)
> +		__field(unsigned long, nr_file_dirty)
> +		__field(unsigned long, nr_writeback)
> +		__field(unsigned long, nr_unstable_nfs)
> +		),
> +
> +	TP_fast_assign(
> +		__entry->css_id = css_id;
> +		__entry->dirty_thresh = dirty_info->dirty_thresh;
> +		__entry->background_thresh = dirty_info->background_thresh;
> +		__entry->nr_file_dirty = dirty_info->nr_file_dirty;
> +		__entry->nr_writeback = dirty_info->nr_writeback;
> +		__entry->nr_unstable_nfs = dirty_info->nr_unstable_nfs;
> +		),
> +
> +	TP_printk("css_id=%d thresh=%ld bg_thresh=%ld dirty=%ld wb=%ld "
> +		  "unstable_nfs=%ld",
> +		  __entry->css_id,
> +		  __entry->dirty_thresh,
> +		  __entry->background_thresh,
> +		  __entry->nr_file_dirty,
> +		  __entry->nr_writeback,
> +		  __entry->nr_unstable_nfs)
> +)
> +
>  #endif /* _TRACE_MEMCONTROL_H */
>  
>  /* This part must be outside protection */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4e01699..d54adf4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1366,6 +1366,11 @@ int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>  	return memcg->swappiness;
>  }
>  
> +static unsigned long dirty_info_reclaimable(struct dirty_info *info)
> +{
> +	return info->nr_file_dirty + info->nr_unstable_nfs;
> +}
> +
>  /*
>   * Return true if the current memory cgroup has local dirty memory settings.
>   * There is an allowed race between the current task migrating in-to/out-of the
> @@ -1396,6 +1401,148 @@ static void mem_cgroup_dirty_param(struct vm_dirty_param *param,
>  	}
>  }
>  
> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> +{
> +	if (!do_swap_account)
> +		return nr_swap_pages > 0;
> +	return !memcg->memsw_is_minimum &&
> +		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
> +}

I think

	if (nr_swap_pages == 0)
		return false;
	if (!do_swap_account)
		return true;
	if (memcg->memsw_is_mininum)
		return false;
        if (res_counter_margin(&memcg->memsw) == 0)
		return false;

is a correct check.

> +
> +static s64 mem_cgroup_local_page_stat(struct mem_cgroup *memcg,
> +				      enum mem_cgroup_page_stat_item item)
> +{
> +	s64 ret;
> +
> +	switch (item) {
> +	case MEMCG_NR_FILE_DIRTY:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY);
> +		break;
> +	case MEMCG_NR_FILE_WRITEBACK:
> +		ret = mem_cgroup_read_stat(memcg,
> +					   MEM_CGROUP_STAT_FILE_WRITEBACK);
> +		break;
> +	case MEMCG_NR_FILE_UNSTABLE_NFS:
> +		ret = mem_cgroup_read_stat(memcg,
> +					   MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
> +		break;
> +	case MEMCG_NR_DIRTYABLE_PAGES:
> +		ret = mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> +			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> +		if (mem_cgroup_can_swap(memcg))
> +			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> +				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> +		break;
> +	default:
> +		BUG();
> +		break;
> +	}
> +	return ret;
> +}
> +
> +/*
> + * Return the number of additional pages that the @memcg cgroup could allocate.
> + * If use_hierarchy is set, then this involves checking parent mem cgroups to
> + * find the cgroup with the smallest free space.
> + */
> +static unsigned long
> +mem_cgroup_hierarchical_free_pages(struct mem_cgroup *memcg)
> +{
> +	u64 free;
> +	unsigned long min_free;
> +
> +	min_free = global_page_state(NR_FREE_PAGES);
> +
> +	while (memcg) {
> +		free = (res_counter_read_u64(&memcg->res, RES_LIMIT) -
> +			res_counter_read_u64(&memcg->res, RES_USAGE)) >>
> +			PAGE_SHIFT;

How about
		free = mem_cgroup_margin(&mem->res);
?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 08/13] memcg: dirty page accounting support routines
@ 2011-08-18  1:05     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  1:05 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Wed, 17 Aug 2011 09:15:00 -0700
Greg Thelen <gthelen@google.com> wrote:

> Added memcg dirty page accounting support routines.  These routines are
> used by later changes to provide memcg aware writeback and dirty page
> limiting.  A mem_cgroup_dirty_info() tracepoint is is also included to
> allow for easier understanding of memcg writeback operation.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>

I have small comments.

> ---
> Changelog since v8:
> - Use 'memcg' rather than 'mem' for local variables and parameters.
>   This is consistent with other memory controller code.
> 
>  include/linux/memcontrol.h        |    9 ++
>  include/trace/events/memcontrol.h |   34 +++++++++
>  mm/memcontrol.c                   |  147 +++++++++++++++++++++++++++++++++++++
>  3 files changed, 190 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 630d3fa..9cc8841 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -36,6 +36,15 @@ enum mem_cgroup_page_stat_item {
>  	MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
>  	MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
>  	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
> +	MEMCG_NR_DIRTYABLE_PAGES, /* # of pages that could be dirty */
> +};
> +
> +struct dirty_info {
> +	unsigned long dirty_thresh;
> +	unsigned long background_thresh;
> +	unsigned long nr_file_dirty;
> +	unsigned long nr_writeback;
> +	unsigned long nr_unstable_nfs;
>  };
>  
>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
> index 781ef9fc..abf1306 100644
> --- a/include/trace/events/memcontrol.h
> +++ b/include/trace/events/memcontrol.h
> @@ -26,6 +26,40 @@ TRACE_EVENT(mem_cgroup_mark_inode_dirty,
>  	TP_printk("ino=%ld css_id=%d", __entry->ino, __entry->css_id)
>  )
>  
> +TRACE_EVENT(mem_cgroup_dirty_info,
> +	TP_PROTO(unsigned short css_id,
> +		 struct dirty_info *dirty_info),
> +
> +	TP_ARGS(css_id, dirty_info),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned short, css_id)
> +		__field(unsigned long, dirty_thresh)
> +		__field(unsigned long, background_thresh)
> +		__field(unsigned long, nr_file_dirty)
> +		__field(unsigned long, nr_writeback)
> +		__field(unsigned long, nr_unstable_nfs)
> +		),
> +
> +	TP_fast_assign(
> +		__entry->css_id = css_id;
> +		__entry->dirty_thresh = dirty_info->dirty_thresh;
> +		__entry->background_thresh = dirty_info->background_thresh;
> +		__entry->nr_file_dirty = dirty_info->nr_file_dirty;
> +		__entry->nr_writeback = dirty_info->nr_writeback;
> +		__entry->nr_unstable_nfs = dirty_info->nr_unstable_nfs;
> +		),
> +
> +	TP_printk("css_id=%d thresh=%ld bg_thresh=%ld dirty=%ld wb=%ld "
> +		  "unstable_nfs=%ld",
> +		  __entry->css_id,
> +		  __entry->dirty_thresh,
> +		  __entry->background_thresh,
> +		  __entry->nr_file_dirty,
> +		  __entry->nr_writeback,
> +		  __entry->nr_unstable_nfs)
> +)
> +
>  #endif /* _TRACE_MEMCONTROL_H */
>  
>  /* This part must be outside protection */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4e01699..d54adf4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1366,6 +1366,11 @@ int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>  	return memcg->swappiness;
>  }
>  
> +static unsigned long dirty_info_reclaimable(struct dirty_info *info)
> +{
> +	return info->nr_file_dirty + info->nr_unstable_nfs;
> +}
> +
>  /*
>   * Return true if the current memory cgroup has local dirty memory settings.
>   * There is an allowed race between the current task migrating in-to/out-of the
> @@ -1396,6 +1401,148 @@ static void mem_cgroup_dirty_param(struct vm_dirty_param *param,
>  	}
>  }
>  
> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> +{
> +	if (!do_swap_account)
> +		return nr_swap_pages > 0;
> +	return !memcg->memsw_is_minimum &&
> +		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
> +}

I think

	if (nr_swap_pages == 0)
		return false;
	if (!do_swap_account)
		return true;
	if (memcg->memsw_is_mininum)
		return false;
        if (res_counter_margin(&memcg->memsw) == 0)
		return false;

is a correct check.

> +
> +static s64 mem_cgroup_local_page_stat(struct mem_cgroup *memcg,
> +				      enum mem_cgroup_page_stat_item item)
> +{
> +	s64 ret;
> +
> +	switch (item) {
> +	case MEMCG_NR_FILE_DIRTY:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY);
> +		break;
> +	case MEMCG_NR_FILE_WRITEBACK:
> +		ret = mem_cgroup_read_stat(memcg,
> +					   MEM_CGROUP_STAT_FILE_WRITEBACK);
> +		break;
> +	case MEMCG_NR_FILE_UNSTABLE_NFS:
> +		ret = mem_cgroup_read_stat(memcg,
> +					   MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
> +		break;
> +	case MEMCG_NR_DIRTYABLE_PAGES:
> +		ret = mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> +			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> +		if (mem_cgroup_can_swap(memcg))
> +			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> +				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> +		break;
> +	default:
> +		BUG();
> +		break;
> +	}
> +	return ret;
> +}
> +
> +/*
> + * Return the number of additional pages that the @memcg cgroup could allocate.
> + * If use_hierarchy is set, then this involves checking parent mem cgroups to
> + * find the cgroup with the smallest free space.
> + */
> +static unsigned long
> +mem_cgroup_hierarchical_free_pages(struct mem_cgroup *memcg)
> +{
> +	u64 free;
> +	unsigned long min_free;
> +
> +	min_free = global_page_state(NR_FREE_PAGES);
> +
> +	while (memcg) {
> +		free = (res_counter_read_u64(&memcg->res, RES_LIMIT) -
> +			res_counter_read_u64(&memcg->res, RES_USAGE)) >>
> +			PAGE_SHIFT;

How about
		free = mem_cgroup_margin(&mem->res);
?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 09/13] memcg: create support routines for writeback
  2011-08-17 16:15   ` Greg Thelen
@ 2011-08-18  1:13     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  1:13 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Wed, 17 Aug 2011 09:15:01 -0700
Greg Thelen <gthelen@google.com> wrote:

> Introduce memcg routines to assist in per-memcg writeback:
> 
> - mem_cgroups_over_bground_dirty_thresh() determines if any cgroups need
>   writeback because they are over their dirty memory threshold.
> 
> - should_writeback_mem_cgroup_inode() will be called by writeback to
>   determine if a particular inode should be written back.  The answer
>   depends on the writeback context (foreground, background,
>   try_to_free_pages, etc.).
> 
> - mem_cgroup_writeback_done() is used periodically during writeback to
>   update memcg writeback data.
> 
> These routines make use of a new over_bground_dirty_thresh bitmap that
> indicates which mem_cgroup are over their respective dirty background
> threshold.  As this bitmap is indexed by css_id, the largest possible
> css_id value is needed to create the bitmap.  So move the definition of
> CSS_ID_MAX from cgroup.c to cgroup.h.  This allows users of css_id() to
> know the largest possible css_id value.  This knowledge can be used to
> build such per-cgroup bitmaps.
> 
> Make determine_dirtyable_memory() non-static because it is needed by
> mem_cgroup_writeback_done().
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>





^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 09/13] memcg: create support routines for writeback
@ 2011-08-18  1:13     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  1:13 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Wed, 17 Aug 2011 09:15:01 -0700
Greg Thelen <gthelen@google.com> wrote:

> Introduce memcg routines to assist in per-memcg writeback:
> 
> - mem_cgroups_over_bground_dirty_thresh() determines if any cgroups need
>   writeback because they are over their dirty memory threshold.
> 
> - should_writeback_mem_cgroup_inode() will be called by writeback to
>   determine if a particular inode should be written back.  The answer
>   depends on the writeback context (foreground, background,
>   try_to_free_pages, etc.).
> 
> - mem_cgroup_writeback_done() is used periodically during writeback to
>   update memcg writeback data.
> 
> These routines make use of a new over_bground_dirty_thresh bitmap that
> indicates which mem_cgroup are over their respective dirty background
> threshold.  As this bitmap is indexed by css_id, the largest possible
> css_id value is needed to create the bitmap.  So move the definition of
> CSS_ID_MAX from cgroup.c to cgroup.h.  This allows users of css_id() to
> know the largest possible css_id value.  This knowledge can be used to
> build such per-cgroup bitmaps.
> 
> Make determine_dirtyable_memory() non-static because it is needed by
> mem_cgroup_writeback_done().
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 10/13] writeback: pass wb_writeback_work into move_expired_inodes()
  2011-08-17 16:15   ` Greg Thelen
@ 2011-08-18  1:15     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  1:15 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Wed, 17 Aug 2011 09:15:02 -0700
Greg Thelen <gthelen@google.com> wrote:

> A later change to move_expired_inodes() requires passing fields from
> writeback work descriptor into memcontrol code when determining if an
> inode should be written back considered.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 10/13] writeback: pass wb_writeback_work into move_expired_inodes()
@ 2011-08-18  1:15     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  1:15 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Wed, 17 Aug 2011 09:15:02 -0700
Greg Thelen <gthelen@google.com> wrote:

> A later change to move_expired_inodes() requires passing fields from
> writeback work descriptor into memcontrol code when determining if an
> inode should be written back considered.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 11/13] writeback: make background writeback cgroup aware
  2011-08-17 16:15   ` Greg Thelen
@ 2011-08-18  1:23     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  1:23 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Wed, 17 Aug 2011 09:15:03 -0700
Greg Thelen <gthelen@google.com> wrote:

> When the system is under background dirty memory threshold but some
> cgroups are over their background dirty memory thresholds, then only
> writeback inodes associated with the over-limit cgroups.
> 
> In addition to checking if the system dirty memory usage is over the
> system background threshold, over_bground_thresh() now checks if any
> cgroups are over their respective background dirty memory thresholds.
> 
> If over-limit cgroups are found, then the new
> wb_writeback_work.for_cgroup field is set to distinguish between system
> and memcg overages.  The new wb_writeback_work.shared_inodes field is
> also set.  Inodes written by multiple cgroup are marked owned by
> I_MEMCG_SHARED rather than a particular cgroup.  Such shared inodes
> cannot easily be attributed to a cgroup, so per-cgroup writeback
> (futures version of wakeup_flusher_threads and balance_dirty_pages)
> performs suboptimally in the presence of shared inodes.  Therefore,
> write shared inodes when performing cgroup background writeback.
> 
> If performing cgroup writeback, move_expired_inodes() skips inodes that
> do not contribute dirty pages to the cgroup being written back.
> 
> After writing some pages, wb_writeback() will call
> mem_cgroup_writeback_done() to update the set of over-bg-limits memcg.
> 
> This change also makes wakeup_flusher_threads() memcg aware so that
> per-cgroup try_to_free_pages() is able to operate more efficiently
> without having to write pages of foreign containers.  This change adds a
> mem_cgroup parameter to wakeup_flusher_threads() to allow callers,
> especially try_to_free_pages() and foreground writeback from
> balance_dirty_pages(), to specify a particular cgroup to write inodes
> from.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> ---
> Changelog since v8:
> 
> - Added optional memcg parameter to __bdi_start_writeback(),
>   bdi_start_writeback(), wakeup_flusher_threads(), writeback_inodes_wb().
> 
> - move_expired_inodes() now uses pass in struct wb_writeback_work instead of
>   struct writeback_control.
> 
> - Added comments to over_bground_thresh().
> 
>  fs/buffer.c               |    2 +-
>  fs/fs-writeback.c         |   96 +++++++++++++++++++++++++++++++++-----------
>  fs/sync.c                 |    2 +-
>  include/linux/writeback.h |    6 ++-
>  mm/backing-dev.c          |    3 +-
>  mm/page-writeback.c       |    3 +-
>  mm/vmscan.c               |    3 +-
>  7 files changed, 84 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index dd0220b..da1fb23 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -293,7 +293,7 @@ static void free_more_memory(void)
>  	struct zone *zone;
>  	int nid;
>  
> -	wakeup_flusher_threads(1024);
> +	wakeup_flusher_threads(1024, NULL);
>  	yield();
>  
>  	for_each_online_node(nid) {
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index e91fb82..ba55336 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -38,10 +38,14 @@ struct wb_writeback_work {
>  	struct super_block *sb;
>  	unsigned long *older_than_this;
>  	enum writeback_sync_modes sync_mode;
> +	unsigned short memcg_id;	/* If non-zero, then writeback specified
> +					 * cgroup. */
>  	unsigned int tagged_writepages:1;
>  	unsigned int for_kupdate:1;
>  	unsigned int range_cyclic:1;
>  	unsigned int for_background:1;
> +	unsigned int for_cgroup:1;	/* cgroup writeback */
> +	unsigned int shared_inodes:1;	/* write inodes spanning cgroups */
>  
>  	struct list_head list;		/* pending work list */
>  	struct completion *done;	/* set if the caller waits */
> @@ -114,9 +118,12 @@ static void bdi_queue_work(struct backing_dev_info *bdi,
>  	spin_unlock_bh(&bdi->wb_lock);
>  }
>  
> +/*
> + * @memcg is optional.  If set, then limit writeback to the specified cgroup.
> + */
>  static void
>  __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
> -		      bool range_cyclic)
> +		      bool range_cyclic, struct mem_cgroup *memcg)
>  {
>  	struct wb_writeback_work *work;
>  
> @@ -136,6 +143,8 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
>  	work->sync_mode	= WB_SYNC_NONE;
>  	work->nr_pages	= nr_pages;
>  	work->range_cyclic = range_cyclic;
> +	work->memcg_id = memcg ? css_id(mem_cgroup_css(memcg)) : 0;
> +	work->for_cgroup = memcg != NULL;
>  


I couldn't find a patch for mem_cgroup_css(NULL). Is it in patch 1-10 ?
Other parts seems ok to me.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 11/13] writeback: make background writeback cgroup aware
@ 2011-08-18  1:23     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  1:23 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Wed, 17 Aug 2011 09:15:03 -0700
Greg Thelen <gthelen@google.com> wrote:

> When the system is under background dirty memory threshold but some
> cgroups are over their background dirty memory thresholds, then only
> writeback inodes associated with the over-limit cgroups.
> 
> In addition to checking if the system dirty memory usage is over the
> system background threshold, over_bground_thresh() now checks if any
> cgroups are over their respective background dirty memory thresholds.
> 
> If over-limit cgroups are found, then the new
> wb_writeback_work.for_cgroup field is set to distinguish between system
> and memcg overages.  The new wb_writeback_work.shared_inodes field is
> also set.  Inodes written by multiple cgroup are marked owned by
> I_MEMCG_SHARED rather than a particular cgroup.  Such shared inodes
> cannot easily be attributed to a cgroup, so per-cgroup writeback
> (futures version of wakeup_flusher_threads and balance_dirty_pages)
> performs suboptimally in the presence of shared inodes.  Therefore,
> write shared inodes when performing cgroup background writeback.
> 
> If performing cgroup writeback, move_expired_inodes() skips inodes that
> do not contribute dirty pages to the cgroup being written back.
> 
> After writing some pages, wb_writeback() will call
> mem_cgroup_writeback_done() to update the set of over-bg-limits memcg.
> 
> This change also makes wakeup_flusher_threads() memcg aware so that
> per-cgroup try_to_free_pages() is able to operate more efficiently
> without having to write pages of foreign containers.  This change adds a
> mem_cgroup parameter to wakeup_flusher_threads() to allow callers,
> especially try_to_free_pages() and foreground writeback from
> balance_dirty_pages(), to specify a particular cgroup to write inodes
> from.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> ---
> Changelog since v8:
> 
> - Added optional memcg parameter to __bdi_start_writeback(),
>   bdi_start_writeback(), wakeup_flusher_threads(), writeback_inodes_wb().
> 
> - move_expired_inodes() now uses pass in struct wb_writeback_work instead of
>   struct writeback_control.
> 
> - Added comments to over_bground_thresh().
> 
>  fs/buffer.c               |    2 +-
>  fs/fs-writeback.c         |   96 +++++++++++++++++++++++++++++++++-----------
>  fs/sync.c                 |    2 +-
>  include/linux/writeback.h |    6 ++-
>  mm/backing-dev.c          |    3 +-
>  mm/page-writeback.c       |    3 +-
>  mm/vmscan.c               |    3 +-
>  7 files changed, 84 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index dd0220b..da1fb23 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -293,7 +293,7 @@ static void free_more_memory(void)
>  	struct zone *zone;
>  	int nid;
>  
> -	wakeup_flusher_threads(1024);
> +	wakeup_flusher_threads(1024, NULL);
>  	yield();
>  
>  	for_each_online_node(nid) {
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index e91fb82..ba55336 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -38,10 +38,14 @@ struct wb_writeback_work {
>  	struct super_block *sb;
>  	unsigned long *older_than_this;
>  	enum writeback_sync_modes sync_mode;
> +	unsigned short memcg_id;	/* If non-zero, then writeback specified
> +					 * cgroup. */
>  	unsigned int tagged_writepages:1;
>  	unsigned int for_kupdate:1;
>  	unsigned int range_cyclic:1;
>  	unsigned int for_background:1;
> +	unsigned int for_cgroup:1;	/* cgroup writeback */
> +	unsigned int shared_inodes:1;	/* write inodes spanning cgroups */
>  
>  	struct list_head list;		/* pending work list */
>  	struct completion *done;	/* set if the caller waits */
> @@ -114,9 +118,12 @@ static void bdi_queue_work(struct backing_dev_info *bdi,
>  	spin_unlock_bh(&bdi->wb_lock);
>  }
>  
> +/*
> + * @memcg is optional.  If set, then limit writeback to the specified cgroup.
> + */
>  static void
>  __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
> -		      bool range_cyclic)
> +		      bool range_cyclic, struct mem_cgroup *memcg)
>  {
>  	struct wb_writeback_work *work;
>  
> @@ -136,6 +143,8 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
>  	work->sync_mode	= WB_SYNC_NONE;
>  	work->nr_pages	= nr_pages;
>  	work->range_cyclic = range_cyclic;
> +	work->memcg_id = memcg ? css_id(mem_cgroup_css(memcg)) : 0;
> +	work->for_cgroup = memcg != NULL;
>  


I couldn't find a patch for mem_cgroup_css(NULL). Is it in patch 1-10 ?
Other parts seems ok to me.


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 12/13] memcg: create support routines for page writeback
  2011-08-17 16:15   ` Greg Thelen
@ 2011-08-18  1:38     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  1:38 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Wed, 17 Aug 2011 09:15:04 -0700
Greg Thelen <gthelen@google.com> wrote:

> Introduce memcg routines to assist in per-memcg dirty page management:
> 
> - mem_cgroup_balance_dirty_pages() walks a memcg hierarchy comparing
>   dirty memory usage against memcg foreground and background thresholds.
>   If an over-background-threshold memcg is found, then per-memcg
>   background writeback is queued.  Per-memcg writeback differs from
>   classic, non-memcg, per bdi writeback by setting the new
>   writeback_control.for_cgroup bit.
> 
>   If an over-foreground-threshold memcg is found, then foreground
>   writeout occurs.  When performing foreground writeout, first consider
>   inodes exclusive to the memcg.  If unable to make enough progress,
>   then consider inodes shared between memcg.  Such cross-memcg inode
>   sharing likely to be rare in situations that use per-cgroup memory
>   isolation.  So the approach tries to handle the common case well
>   without falling over in cases where such sharing exists.  This routine
>   is used by balance_dirty_pages() in a later change.
> 
> - mem_cgroup_hierarchical_dirty_info() returns the dirty memory usage
>   and limits of the memcg closest to (or over) its dirty limit.  This
>   will be used by throttle_vm_writeout() in a latter change.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Comparing page-writebakc.c, I have some questions.



> +/*
> + * This routine must be called periodically by processes which generate dirty
> + * pages.  It considers the dirty pages usage and thresholds of the current
> + * cgroup and (depending if hierarchical accounting is enabled) ancestral memcg.
> + * If any of the considered memcg are over their background dirty limit, then
> + * background writeback is queued.  If any are over the foreground dirty limit
> + * then the dirtying task is throttled while writing dirty data.  The per-memcg
> + * dirty limits checked by this routine are distinct from either the per-system,
> + * per-bdi, or per-task limits considered by balance_dirty_pages().
> + *
> + *   Example hierarchy:
> + *                 root
> + *            A            B
> + *        A1      A2         B1
> + *     A11 A12  A21 A22
> + *
> + * Assume that mem_cgroup_balance_dirty_pages() is called on A11.  This routine
> + * starts at A11 walking upwards towards the root.  If A11 is over dirty limit,
> + * then writeback A11 inodes until under limit.  Next check A1, if over limit
> + * then write A1,A11,A12.  Then check A.  If A is over A limit, then invoke
> + * writeback on A* until A is under A limit.
> + */
> +void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
> +				    unsigned long write_chunk)
> +{
> +	struct backing_dev_info *bdi = mapping->backing_dev_info;
> +	struct mem_cgroup *memcg;
> +	struct mem_cgroup *ref_memcg;
> +	struct dirty_info info;
> +	unsigned long nr_reclaimable;
> +	unsigned long nr_written;
> +	unsigned long sys_available_mem;
> +	unsigned long pause = 1;
> +	unsigned short id;
> +	bool over;
> +	bool shared_inodes;
> +
> +	if (mem_cgroup_disabled())
> +		return;
> +
> +	sys_available_mem = determine_dirtyable_memory();
> +
> +	/* reference the memcg so it is not deleted during this routine */
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (memcg && mem_cgroup_is_root(memcg))
> +		memcg = NULL;
> +	if (memcg)
> +		css_get(&memcg->css);
> +	rcu_read_unlock();
> +	ref_memcg = memcg;
> +
> +	/* balance entire ancestry of current's memcg. */
> +	for (; mem_cgroup_has_dirty_limit(memcg);
> +	     memcg = parent_mem_cgroup(memcg)) {
> +		id = css_id(&memcg->css);
> +
> +		/*
> +		 * Keep throttling and writing inode data so long as memcg is
> +		 * over its dirty limit.  Inode being written by multiple memcg
> +		 * (aka shared_inodes) cannot easily be attributed a particular
> +		 * memcg.  Shared inodes are thought to be much rarer than
> +		 * shared inodes.  First try to satisfy this memcg's dirty
> +		 * limits using non-shared inodes.
> +		 */
> +		for (shared_inodes = false; ; ) {
> +			/*
> +			 * if memcg is under dirty limit, then break from
> +			 * throttling loop.
> +			 */
> +			mem_cgroup_dirty_info(sys_available_mem, memcg, &info);
> +			nr_reclaimable = dirty_info_reclaimable(&info);
> +			over = nr_reclaimable > info.dirty_thresh;
> +			trace_mem_cgroup_consider_fg_writeback(
> +				id, bdi, nr_reclaimable, info.dirty_thresh,
> +				over);
> +			if (!over)
> +				break;
> +
> +			nr_written = writeback_inodes_wb(&bdi->wb, write_chunk,
> +							 memcg, shared_inodes);
> +			trace_mem_cgroup_fg_writeback(write_chunk, nr_written,
> +						      id, shared_inodes);
> +			/* if no progress, then consider shared inodes */
> +			if ((nr_written == 0) && !shared_inodes) {
> +				trace_mem_cgroup_enable_shared_writeback(id);
> +				shared_inodes = true;
> +			}

in page-writeback.c

                    if (pages_written >= write_chunk)
                                break;          /* We've done our duty */

write_chunk(ratelimit) is used. Can't we make use of this threshold ?






> +
> +			__set_current_state(TASK_UNINTERRUPTIBLE);
> +			io_schedule_timeout(pause);
> +

How do you think about MAX_PAUSE/PASS_GOOD ?
==
                /*
                 * max-pause area. If dirty exceeded but still within this
                 * area, no need to sleep for more than 200ms: (a) 8 pages per
                 * 200ms is typically more than enough to curb heavy dirtiers;
                 * (b) the pause time limit makes the dirtiers more responsive.
                 */
                if (nr_dirty < dirty_thresh +
                               dirty_thresh / DIRTY_MAXPAUSE_AREA &&
                    time_after(jiffies, start_time + MAX_PAUSE))
                        break;
                /*
                 * pass-good area. When some bdi gets blocked (eg. NFS server
                 * not responding), or write bandwidth dropped dramatically due
                 * to concurrent reads, or dirty threshold suddenly dropped and
                 * the dirty pages cannot be brought down anytime soon (eg. on
                 * slow USB stick), at least let go of the good bdi's.
                 */
                if (nr_dirty < dirty_thresh +
                               dirty_thresh / DIRTY_PASSGOOD_AREA &&
                    bdi_dirty < bdi_thresh)
                        break;
==

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 12/13] memcg: create support routines for page writeback
@ 2011-08-18  1:38     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  1:38 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Wed, 17 Aug 2011 09:15:04 -0700
Greg Thelen <gthelen@google.com> wrote:

> Introduce memcg routines to assist in per-memcg dirty page management:
> 
> - mem_cgroup_balance_dirty_pages() walks a memcg hierarchy comparing
>   dirty memory usage against memcg foreground and background thresholds.
>   If an over-background-threshold memcg is found, then per-memcg
>   background writeback is queued.  Per-memcg writeback differs from
>   classic, non-memcg, per bdi writeback by setting the new
>   writeback_control.for_cgroup bit.
> 
>   If an over-foreground-threshold memcg is found, then foreground
>   writeout occurs.  When performing foreground writeout, first consider
>   inodes exclusive to the memcg.  If unable to make enough progress,
>   then consider inodes shared between memcg.  Such cross-memcg inode
>   sharing likely to be rare in situations that use per-cgroup memory
>   isolation.  So the approach tries to handle the common case well
>   without falling over in cases where such sharing exists.  This routine
>   is used by balance_dirty_pages() in a later change.
> 
> - mem_cgroup_hierarchical_dirty_info() returns the dirty memory usage
>   and limits of the memcg closest to (or over) its dirty limit.  This
>   will be used by throttle_vm_writeout() in a latter change.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Comparing page-writebakc.c, I have some questions.



> +/*
> + * This routine must be called periodically by processes which generate dirty
> + * pages.  It considers the dirty pages usage and thresholds of the current
> + * cgroup and (depending if hierarchical accounting is enabled) ancestral memcg.
> + * If any of the considered memcg are over their background dirty limit, then
> + * background writeback is queued.  If any are over the foreground dirty limit
> + * then the dirtying task is throttled while writing dirty data.  The per-memcg
> + * dirty limits checked by this routine are distinct from either the per-system,
> + * per-bdi, or per-task limits considered by balance_dirty_pages().
> + *
> + *   Example hierarchy:
> + *                 root
> + *            A            B
> + *        A1      A2         B1
> + *     A11 A12  A21 A22
> + *
> + * Assume that mem_cgroup_balance_dirty_pages() is called on A11.  This routine
> + * starts at A11 walking upwards towards the root.  If A11 is over dirty limit,
> + * then writeback A11 inodes until under limit.  Next check A1, if over limit
> + * then write A1,A11,A12.  Then check A.  If A is over A limit, then invoke
> + * writeback on A* until A is under A limit.
> + */
> +void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
> +				    unsigned long write_chunk)
> +{
> +	struct backing_dev_info *bdi = mapping->backing_dev_info;
> +	struct mem_cgroup *memcg;
> +	struct mem_cgroup *ref_memcg;
> +	struct dirty_info info;
> +	unsigned long nr_reclaimable;
> +	unsigned long nr_written;
> +	unsigned long sys_available_mem;
> +	unsigned long pause = 1;
> +	unsigned short id;
> +	bool over;
> +	bool shared_inodes;
> +
> +	if (mem_cgroup_disabled())
> +		return;
> +
> +	sys_available_mem = determine_dirtyable_memory();
> +
> +	/* reference the memcg so it is not deleted during this routine */
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (memcg && mem_cgroup_is_root(memcg))
> +		memcg = NULL;
> +	if (memcg)
> +		css_get(&memcg->css);
> +	rcu_read_unlock();
> +	ref_memcg = memcg;
> +
> +	/* balance entire ancestry of current's memcg. */
> +	for (; mem_cgroup_has_dirty_limit(memcg);
> +	     memcg = parent_mem_cgroup(memcg)) {
> +		id = css_id(&memcg->css);
> +
> +		/*
> +		 * Keep throttling and writing inode data so long as memcg is
> +		 * over its dirty limit.  Inode being written by multiple memcg
> +		 * (aka shared_inodes) cannot easily be attributed a particular
> +		 * memcg.  Shared inodes are thought to be much rarer than
> +		 * shared inodes.  First try to satisfy this memcg's dirty
> +		 * limits using non-shared inodes.
> +		 */
> +		for (shared_inodes = false; ; ) {
> +			/*
> +			 * if memcg is under dirty limit, then break from
> +			 * throttling loop.
> +			 */
> +			mem_cgroup_dirty_info(sys_available_mem, memcg, &info);
> +			nr_reclaimable = dirty_info_reclaimable(&info);
> +			over = nr_reclaimable > info.dirty_thresh;
> +			trace_mem_cgroup_consider_fg_writeback(
> +				id, bdi, nr_reclaimable, info.dirty_thresh,
> +				over);
> +			if (!over)
> +				break;
> +
> +			nr_written = writeback_inodes_wb(&bdi->wb, write_chunk,
> +							 memcg, shared_inodes);
> +			trace_mem_cgroup_fg_writeback(write_chunk, nr_written,
> +						      id, shared_inodes);
> +			/* if no progress, then consider shared inodes */
> +			if ((nr_written == 0) && !shared_inodes) {
> +				trace_mem_cgroup_enable_shared_writeback(id);
> +				shared_inodes = true;
> +			}

in page-writeback.c

                    if (pages_written >= write_chunk)
                                break;          /* We've done our duty */

write_chunk(ratelimit) is used. Can't we make use of this threshold ?






> +
> +			__set_current_state(TASK_UNINTERRUPTIBLE);
> +			io_schedule_timeout(pause);
> +

How do you think about MAX_PAUSE/PASS_GOOD ?
==
                /*
                 * max-pause area. If dirty exceeded but still within this
                 * area, no need to sleep for more than 200ms: (a) 8 pages per
                 * 200ms is typically more than enough to curb heavy dirtiers;
                 * (b) the pause time limit makes the dirtiers more responsive.
                 */
                if (nr_dirty < dirty_thresh +
                               dirty_thresh / DIRTY_MAXPAUSE_AREA &&
                    time_after(jiffies, start_time + MAX_PAUSE))
                        break;
                /*
                 * pass-good area. When some bdi gets blocked (eg. NFS server
                 * not responding), or write bandwidth dropped dramatically due
                 * to concurrent reads, or dirty threshold suddenly dropped and
                 * the dirty pages cannot be brought down anytime soon (eg. on
                 * slow USB stick), at least let go of the good bdi's.
                 */
                if (nr_dirty < dirty_thresh +
                               dirty_thresh / DIRTY_PASSGOOD_AREA &&
                    bdi_dirty < bdi_thresh)
                        break;
==

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 13/13] memcg: check memcg dirty limits in page writeback
  2011-08-17 16:15   ` Greg Thelen
@ 2011-08-18  1:40     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  1:40 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Wed, 17 Aug 2011 09:15:05 -0700
Greg Thelen <gthelen@google.com> wrote:

> If the current process is in a non-root memcg, then
> balance_dirty_pages() will consider the memcg dirty limits as well as
> the system-wide limits.  This allows different cgroups to have distinct
> dirty limits which trigger direct and background writeback at different
> levels.
> 
> If called with a mem_cgroup, then throttle_vm_writeout() queries the
> given cgroup for its dirty memory usage limits.
> 
> Signed-off-by: Andrea Righi <andrea@betterlinux.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 13/13] memcg: check memcg dirty limits in page writeback
@ 2011-08-18  1:40     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  1:40 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Wed, 17 Aug 2011 09:15:05 -0700
Greg Thelen <gthelen@google.com> wrote:

> If the current process is in a non-root memcg, then
> balance_dirty_pages() will consider the memcg dirty limits as well as
> the system-wide limits.  This allows different cgroups to have distinct
> dirty limits which trigger direct and background writeback at different
> levels.
> 
> If called with a mem_cgroup, then throttle_vm_writeout() queries the
> given cgroup for its dirty memory usage limits.
> 
> Signed-off-by: Andrea Righi <andrea@betterlinux.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 12/13] memcg: create support routines for page writeback
  2011-08-18  1:38     ` KAMEZAWA Hiroyuki
@ 2011-08-18  2:36       ` Wu Fengguang
  -1 siblings, 0 replies; 72+ messages in thread
From: Wu Fengguang @ 2011-08-18  2:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Jan Kara, Greg Thelen, Andrew Morton, linux-kernel, linux-mm,
	containers, linux-fsdevel, Balbir Singh, Daisuke Nishimura,
	Minchan Kim, Johannes Weiner, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Li Shaohua, Shi,
	Alex, Chen, Tim C

> > +
> > +			__set_current_state(TASK_UNINTERRUPTIBLE);
> > +			io_schedule_timeout(pause);
> > +
> 
> How do you think about MAX_PAUSE/PASS_GOOD ?
> ==
>                 /*
>                  * max-pause area. If dirty exceeded but still within this
>                  * area, no need to sleep for more than 200ms: (a) 8 pages per
>                  * 200ms is typically more than enough to curb heavy dirtiers;
>                  * (b) the pause time limit makes the dirtiers more responsive.
>                  */
>                 if (nr_dirty < dirty_thresh +
>                                dirty_thresh / DIRTY_MAXPAUSE_AREA &&
>                     time_after(jiffies, start_time + MAX_PAUSE))
>                         break;
>                 /*
>                  * pass-good area. When some bdi gets blocked (eg. NFS server
>                  * not responding), or write bandwidth dropped dramatically due
>                  * to concurrent reads, or dirty threshold suddenly dropped and
>                  * the dirty pages cannot be brought down anytime soon (eg. on
>                  * slow USB stick), at least let go of the good bdi's.
>                  */
>                 if (nr_dirty < dirty_thresh +
>                                dirty_thresh / DIRTY_PASSGOOD_AREA &&
>                     bdi_dirty < bdi_thresh)
>                         break;
> ==

Sorry that piece of code actually has some problems in JBOD setup.
I'm going to submit a patch for fixing it:

Subject: squeeze max-pause area and drop pass-good area
Date: Tue Aug 16 13:37:14 CST 2011

Remove the pass-good area introduced in ffd1f609ab10 ("writeback:
introduce max-pause and pass-good dirty limits") and make the
max-pause area smaller and safe.

This fixes ~30% performance regression in the ext3 data=writeback
fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.

Using deadline scheduler also has a regression, but not that big as
CFQ, so this suggests we have some write starvation.

The test logs show that

- the disks are sometimes under utilized

- global dirty pages sometimes rush high to the pass-good area for
  several hundred seconds, while in the mean time some bdi dirty pages
  drop to very low value (bdi_dirty << bdi_thresh).
  Then suddenly the global dirty pages dropped under global dirty
  threshold and bdi_dirty rush very high (for example, 2 times higher
  than bdi_thresh). During which time balance_dirty_pages() is not
  called at all.

So the problems are

1) The random writes progress so slow that they break the assumption of
the max-pause logic that "8 pages per 200ms is typically more than
enough to curb heavy dirtiers".

2) The max-pause logic ignored task_bdi_thresh and thus opens the
   possibility for some bdi's to over dirty pages, leading to
   (bdi_dirty >> bdi_thresh) and then (bdi_thresh >> bdi_dirty) for others.

3) The higher max-pause/pass-good thresholds somehow leads to some bad
   swing of dirty pages.

The fix is to allow the task to slightly dirty over task_bdi_thresh, but
no way to exceed bdi_dirty and/or global dirty_thresh.

Tests show that it fixed the JBOD regression completely (both behavior
and performance), while still being able to cut down large pause times
in balance_dirty_pages() for single-disk cases.

Reported-by: Li Shaohua <shaohua.li@intel.com>
Tested-by: Li Shaohua <shaohua.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |   11 -----------
 mm/page-writeback.c       |   15 ++-------------
 2 files changed, 2 insertions(+), 24 deletions(-)

--- linux.orig/mm/page-writeback.c	2011-08-18 09:52:59.000000000 +0800
+++ linux/mm/page-writeback.c	2011-08-18 10:28:57.000000000 +0800
@@ -786,21 +786,10 @@ static void balance_dirty_pages(struct a
 		 * 200ms is typically more than enough to curb heavy dirtiers;
 		 * (b) the pause time limit makes the dirtiers more responsive.
 		 */
-		if (nr_dirty < dirty_thresh +
-			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
+		if (nr_dirty < dirty_thresh &&
+		    bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2 &&
 		    time_after(jiffies, start_time + MAX_PAUSE))
 			break;
-		/*
-		 * pass-good area. When some bdi gets blocked (eg. NFS server
-		 * not responding), or write bandwidth dropped dramatically due
-		 * to concurrent reads, or dirty threshold suddenly dropped and
-		 * the dirty pages cannot be brought down anytime soon (eg. on
-		 * slow USB stick), at least let go of the good bdi's.
-		 */
-		if (nr_dirty < dirty_thresh +
-			       dirty_thresh / DIRTY_PASSGOOD_AREA &&
-		    bdi_dirty < bdi_thresh)
-			break;
 
 		/*
 		 * Increase the delay for each loop, up to our previous
--- linux.orig/include/linux/writeback.h	2011-08-16 23:34:27.000000000 +0800
+++ linux/include/linux/writeback.h	2011-08-18 09:53:03.000000000 +0800
@@ -12,15 +12,6 @@
  *
  *	(thresh - thresh/DIRTY_FULL_SCOPE, thresh)
  *
- * The 1/16 region above the global dirty limit will be put to maximum pauses:
- *
- *	(limit, limit + limit/DIRTY_MAXPAUSE_AREA)
- *
- * The 1/16 region above the max-pause region, dirty exceeded bdi's will be put
- * to loops:
- *
- *	(limit + limit/DIRTY_MAXPAUSE_AREA, limit + limit/DIRTY_PASSGOOD_AREA)
- *
  * Further beyond, all dirtier tasks will enter a loop waiting (possibly long
  * time) for the dirty pages to drop, unless written enough pages.
  *
@@ -31,8 +22,6 @@
  */
 #define DIRTY_SCOPE		8
 #define DIRTY_FULL_SCOPE	(DIRTY_SCOPE / 2)
-#define DIRTY_MAXPAUSE_AREA		16
-#define DIRTY_PASSGOOD_AREA		8
 
 /*
  * 4MB minimal write chunk size

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 12/13] memcg: create support routines for page writeback
@ 2011-08-18  2:36       ` Wu Fengguang
  0 siblings, 0 replies; 72+ messages in thread
From: Wu Fengguang @ 2011-08-18  2:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Jan Kara, Greg Thelen, Andrew Morton, linux-kernel, linux-mm,
	containers, linux-fsdevel, Balbir Singh, Daisuke Nishimura,
	Minchan Kim, Johannes Weiner, Dave Chinner, Vivek Goyal,
	Andrea Righi, Ciju Rajan K, David Rientjes, Li Shaohua, Shi,
	Alex, Chen, Tim C

> > +
> > +			__set_current_state(TASK_UNINTERRUPTIBLE);
> > +			io_schedule_timeout(pause);
> > +
> 
> How do you think about MAX_PAUSE/PASS_GOOD ?
> ==
>                 /*
>                  * max-pause area. If dirty exceeded but still within this
>                  * area, no need to sleep for more than 200ms: (a) 8 pages per
>                  * 200ms is typically more than enough to curb heavy dirtiers;
>                  * (b) the pause time limit makes the dirtiers more responsive.
>                  */
>                 if (nr_dirty < dirty_thresh +
>                                dirty_thresh / DIRTY_MAXPAUSE_AREA &&
>                     time_after(jiffies, start_time + MAX_PAUSE))
>                         break;
>                 /*
>                  * pass-good area. When some bdi gets blocked (eg. NFS server
>                  * not responding), or write bandwidth dropped dramatically due
>                  * to concurrent reads, or dirty threshold suddenly dropped and
>                  * the dirty pages cannot be brought down anytime soon (eg. on
>                  * slow USB stick), at least let go of the good bdi's.
>                  */
>                 if (nr_dirty < dirty_thresh +
>                                dirty_thresh / DIRTY_PASSGOOD_AREA &&
>                     bdi_dirty < bdi_thresh)
>                         break;
> ==

Sorry that piece of code actually has some problems in JBOD setup.
I'm going to submit a patch for fixing it:

Subject: squeeze max-pause area and drop pass-good area
Date: Tue Aug 16 13:37:14 CST 2011

Remove the pass-good area introduced in ffd1f609ab10 ("writeback:
introduce max-pause and pass-good dirty limits") and make the
max-pause area smaller and safe.

This fixes ~30% performance regression in the ext3 data=writeback
fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.

Using deadline scheduler also has a regression, but not that big as
CFQ, so this suggests we have some write starvation.

The test logs show that

- the disks are sometimes under utilized

- global dirty pages sometimes rush high to the pass-good area for
  several hundred seconds, while in the mean time some bdi dirty pages
  drop to very low value (bdi_dirty << bdi_thresh).
  Then suddenly the global dirty pages dropped under global dirty
  threshold and bdi_dirty rush very high (for example, 2 times higher
  than bdi_thresh). During which time balance_dirty_pages() is not
  called at all.

So the problems are

1) The random writes progress so slow that they break the assumption of
the max-pause logic that "8 pages per 200ms is typically more than
enough to curb heavy dirtiers".

2) The max-pause logic ignored task_bdi_thresh and thus opens the
   possibility for some bdi's to over dirty pages, leading to
   (bdi_dirty >> bdi_thresh) and then (bdi_thresh >> bdi_dirty) for others.

3) The higher max-pause/pass-good thresholds somehow leads to some bad
   swing of dirty pages.

The fix is to allow the task to slightly dirty over task_bdi_thresh, but
no way to exceed bdi_dirty and/or global dirty_thresh.

Tests show that it fixed the JBOD regression completely (both behavior
and performance), while still being able to cut down large pause times
in balance_dirty_pages() for single-disk cases.

Reported-by: Li Shaohua <shaohua.li@intel.com>
Tested-by: Li Shaohua <shaohua.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |   11 -----------
 mm/page-writeback.c       |   15 ++-------------
 2 files changed, 2 insertions(+), 24 deletions(-)

--- linux.orig/mm/page-writeback.c	2011-08-18 09:52:59.000000000 +0800
+++ linux/mm/page-writeback.c	2011-08-18 10:28:57.000000000 +0800
@@ -786,21 +786,10 @@ static void balance_dirty_pages(struct a
 		 * 200ms is typically more than enough to curb heavy dirtiers;
 		 * (b) the pause time limit makes the dirtiers more responsive.
 		 */
-		if (nr_dirty < dirty_thresh +
-			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
+		if (nr_dirty < dirty_thresh &&
+		    bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2 &&
 		    time_after(jiffies, start_time + MAX_PAUSE))
 			break;
-		/*
-		 * pass-good area. When some bdi gets blocked (eg. NFS server
-		 * not responding), or write bandwidth dropped dramatically due
-		 * to concurrent reads, or dirty threshold suddenly dropped and
-		 * the dirty pages cannot be brought down anytime soon (eg. on
-		 * slow USB stick), at least let go of the good bdi's.
-		 */
-		if (nr_dirty < dirty_thresh +
-			       dirty_thresh / DIRTY_PASSGOOD_AREA &&
-		    bdi_dirty < bdi_thresh)
-			break;
 
 		/*
 		 * Increase the delay for each loop, up to our previous
--- linux.orig/include/linux/writeback.h	2011-08-16 23:34:27.000000000 +0800
+++ linux/include/linux/writeback.h	2011-08-18 09:53:03.000000000 +0800
@@ -12,15 +12,6 @@
  *
  *	(thresh - thresh/DIRTY_FULL_SCOPE, thresh)
  *
- * The 1/16 region above the global dirty limit will be put to maximum pauses:
- *
- *	(limit, limit + limit/DIRTY_MAXPAUSE_AREA)
- *
- * The 1/16 region above the max-pause region, dirty exceeded bdi's will be put
- * to loops:
- *
- *	(limit + limit/DIRTY_MAXPAUSE_AREA, limit + limit/DIRTY_PASSGOOD_AREA)
- *
  * Further beyond, all dirtier tasks will enter a loop waiting (possibly long
  * time) for the dirty pages to drop, unless written enough pages.
  *
@@ -31,8 +22,6 @@
  */
 #define DIRTY_SCOPE		8
 #define DIRTY_FULL_SCOPE	(DIRTY_SCOPE / 2)
-#define DIRTY_MAXPAUSE_AREA		16
-#define DIRTY_PASSGOOD_AREA		8
 
 /*
  * 4MB minimal write chunk size

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 03/13] memcg: add dirty page accounting infrastructure
  2011-08-18  0:39     ` KAMEZAWA Hiroyuki
@ 2011-08-18  6:07       ` Greg Thelen
  -1 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-18  6:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Wed, 17 Aug 2011 09:14:55 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> Add memcg routines to count dirty, writeback, and unstable_NFS pages.
>> These routines are not yet used by the kernel to count such pages.  A
>> later change adds kernel calls to these new routines.
>> 
>> As inode pages are marked dirty, if the dirtied page's cgroup differs
>> from the inode's cgroup, then mark the inode shared across several
>> cgroup.
>> 
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>> Signed-off-by: Andrea Righi <andrea@betterlinux.com>
>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> A nitpick..
>
>
>
>> +static inline
>> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
>> +				       struct mem_cgroup *to,
>> +				       enum mem_cgroup_stat_index idx)
>> +{
>> +	preempt_disable();
>> +	__this_cpu_dec(from->stat->count[idx]);
>> +	__this_cpu_inc(to->stat->count[idx]);
>> +	preempt_enable();
>> +}
>> +
>
> this_cpu_dec()
> this_cpu_inc()
>
> without preempt_disable/enable will work. CPU change between dec/inc will
> not be problem.
>
> Thanks,
> -Kame

I agree, but this fix is general cleanup, which seems independent of
memcg dirty accounting.  This preemption disable/enable pattern exists
before this patch series in both mem_cgroup_charge_statistics() and
mem_cgroup_move_account().  For consistency we should change both.  To
keep the dirty page accounting series simple, I would like to make these
changes outside of this series.  On x86 usage of this_cpu_dec/inc looks
equivalent to __this_cpu_inc(), so I assume the only trade off is that
preemptible non-x86 using generic this_this_cpu() will internally
disable/enable preemption in this_cpu_*() operations.

I'll submit a cleanup patch outside of the dirty limit patches for this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 03/13] memcg: add dirty page accounting infrastructure
@ 2011-08-18  6:07       ` Greg Thelen
  0 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-18  6:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Wed, 17 Aug 2011 09:14:55 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> Add memcg routines to count dirty, writeback, and unstable_NFS pages.
>> These routines are not yet used by the kernel to count such pages.  A
>> later change adds kernel calls to these new routines.
>> 
>> As inode pages are marked dirty, if the dirtied page's cgroup differs
>> from the inode's cgroup, then mark the inode shared across several
>> cgroup.
>> 
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>> Signed-off-by: Andrea Righi <andrea@betterlinux.com>
>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> A nitpick..
>
>
>
>> +static inline
>> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
>> +				       struct mem_cgroup *to,
>> +				       enum mem_cgroup_stat_index idx)
>> +{
>> +	preempt_disable();
>> +	__this_cpu_dec(from->stat->count[idx]);
>> +	__this_cpu_inc(to->stat->count[idx]);
>> +	preempt_enable();
>> +}
>> +
>
> this_cpu_dec()
> this_cpu_inc()
>
> without preempt_disable/enable will work. CPU change between dec/inc will
> not be problem.
>
> Thanks,
> -Kame

I agree, but this fix is general cleanup, which seems independent of
memcg dirty accounting.  This preemption disable/enable pattern exists
before this patch series in both mem_cgroup_charge_statistics() and
mem_cgroup_move_account().  For consistency we should change both.  To
keep the dirty page accounting series simple, I would like to make these
changes outside of this series.  On x86 usage of this_cpu_dec/inc looks
equivalent to __this_cpu_inc(), so I assume the only trade off is that
preemptible non-x86 using generic this_this_cpu() will internally
disable/enable preemption in this_cpu_*() operations.

I'll submit a cleanup patch outside of the dirty limit patches for this.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 08/13] memcg: dirty page accounting support routines
  2011-08-18  1:05     ` KAMEZAWA Hiroyuki
@ 2011-08-18  7:04       ` Greg Thelen
  -1 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-18  7:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Wed, 17 Aug 2011 09:15:00 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> Added memcg dirty page accounting support routines.  These routines are
>> used by later changes to provide memcg aware writeback and dirty page
>> limiting.  A mem_cgroup_dirty_info() tracepoint is is also included to
>> allow for easier understanding of memcg writeback operation.
>> 
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>
> I have small comments.
>
>> ---
>> Changelog since v8:
>> - Use 'memcg' rather than 'mem' for local variables and parameters.
>>   This is consistent with other memory controller code.
>> 
>>  include/linux/memcontrol.h        |    9 ++
>>  include/trace/events/memcontrol.h |   34 +++++++++
>>  mm/memcontrol.c                   |  147 +++++++++++++++++++++++++++++++++++++
>>  3 files changed, 190 insertions(+), 0 deletions(-)
>> 
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 630d3fa..9cc8841 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -36,6 +36,15 @@ enum mem_cgroup_page_stat_item {
>>  	MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
>>  	MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
>>  	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>> +	MEMCG_NR_DIRTYABLE_PAGES, /* # of pages that could be dirty */
>> +};
>> +
>> +struct dirty_info {
>> +	unsigned long dirty_thresh;
>> +	unsigned long background_thresh;
>> +	unsigned long nr_file_dirty;
>> +	unsigned long nr_writeback;
>> +	unsigned long nr_unstable_nfs;
>>  };
>>  
>>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>> diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
>> index 781ef9fc..abf1306 100644
>> --- a/include/trace/events/memcontrol.h
>> +++ b/include/trace/events/memcontrol.h
>> @@ -26,6 +26,40 @@ TRACE_EVENT(mem_cgroup_mark_inode_dirty,
>>  	TP_printk("ino=%ld css_id=%d", __entry->ino, __entry->css_id)
>>  )
>>  
>> +TRACE_EVENT(mem_cgroup_dirty_info,
>> +	TP_PROTO(unsigned short css_id,
>> +		 struct dirty_info *dirty_info),
>> +
>> +	TP_ARGS(css_id, dirty_info),
>> +
>> +	TP_STRUCT__entry(
>> +		__field(unsigned short, css_id)
>> +		__field(unsigned long, dirty_thresh)
>> +		__field(unsigned long, background_thresh)
>> +		__field(unsigned long, nr_file_dirty)
>> +		__field(unsigned long, nr_writeback)
>> +		__field(unsigned long, nr_unstable_nfs)
>> +		),
>> +
>> +	TP_fast_assign(
>> +		__entry->css_id = css_id;
>> +		__entry->dirty_thresh = dirty_info->dirty_thresh;
>> +		__entry->background_thresh = dirty_info->background_thresh;
>> +		__entry->nr_file_dirty = dirty_info->nr_file_dirty;
>> +		__entry->nr_writeback = dirty_info->nr_writeback;
>> +		__entry->nr_unstable_nfs = dirty_info->nr_unstable_nfs;
>> +		),
>> +
>> +	TP_printk("css_id=%d thresh=%ld bg_thresh=%ld dirty=%ld wb=%ld "
>> +		  "unstable_nfs=%ld",
>> +		  __entry->css_id,
>> +		  __entry->dirty_thresh,
>> +		  __entry->background_thresh,
>> +		  __entry->nr_file_dirty,
>> +		  __entry->nr_writeback,
>> +		  __entry->nr_unstable_nfs)
>> +)
>> +
>>  #endif /* _TRACE_MEMCONTROL_H */
>>  
>>  /* This part must be outside protection */
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 4e01699..d54adf4 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1366,6 +1366,11 @@ int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>>  	return memcg->swappiness;
>>  }
>>  
>> +static unsigned long dirty_info_reclaimable(struct dirty_info *info)
>> +{
>> +	return info->nr_file_dirty + info->nr_unstable_nfs;
>> +}
>> +
>>  /*
>>   * Return true if the current memory cgroup has local dirty memory settings.
>>   * There is an allowed race between the current task migrating in-to/out-of the
>> @@ -1396,6 +1401,148 @@ static void mem_cgroup_dirty_param(struct vm_dirty_param *param,
>>  	}
>>  }
>>  
>> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
>> +{
>> +	if (!do_swap_account)
>> +		return nr_swap_pages > 0;
>> +	return !memcg->memsw_is_minimum &&
>> +		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
>> +}
>
> I think
>
> 	if (nr_swap_pages == 0)
> 		return false;
> 	if (!do_swap_account)
> 		return true;
> 	if (memcg->memsw_is_mininum)
> 		return false;
>         if (res_counter_margin(&memcg->memsw) == 0)
> 		return false;
>
> is a correct check.

Ok.  I'll update to use your logic.

>> +
>> +static s64 mem_cgroup_local_page_stat(struct mem_cgroup *memcg,
>> +				      enum mem_cgroup_page_stat_item item)
>> +{
>> +	s64 ret;
>> +
>> +	switch (item) {
>> +	case MEMCG_NR_FILE_DIRTY:
>> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY);
>> +		break;
>> +	case MEMCG_NR_FILE_WRITEBACK:
>> +		ret = mem_cgroup_read_stat(memcg,
>> +					   MEM_CGROUP_STAT_FILE_WRITEBACK);
>> +		break;
>> +	case MEMCG_NR_FILE_UNSTABLE_NFS:
>> +		ret = mem_cgroup_read_stat(memcg,
>> +					   MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
>> +		break;
>> +	case MEMCG_NR_DIRTYABLE_PAGES:
>> +		ret = mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
>> +			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
>> +		if (mem_cgroup_can_swap(memcg))
>> +			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
>> +				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
>> +		break;
>> +	default:
>> +		BUG();
>> +		break;
>> +	}
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Return the number of additional pages that the @memcg cgroup could allocate.
>> + * If use_hierarchy is set, then this involves checking parent mem cgroups to
>> + * find the cgroup with the smallest free space.
>> + */
>> +static unsigned long
>> +mem_cgroup_hierarchical_free_pages(struct mem_cgroup *memcg)
>> +{
>> +	u64 free;
>> +	unsigned long min_free;
>> +
>> +	min_free = global_page_state(NR_FREE_PAGES);
>> +
>> +	while (memcg) {
>> +		free = (res_counter_read_u64(&memcg->res, RES_LIMIT) -
>> +			res_counter_read_u64(&memcg->res, RES_USAGE)) >>
>> +			PAGE_SHIFT;
>
> How about
> 		free = mem_cgroup_margin(&mem->res);
> ?
>
> Thanks,
> -Kame

Sounds good.  I'll update to:

        while (memcg) {
                free = res_counter_margin(&memcg->res) >> PAGE_SHIFT;
                min_free = min_t(u64, min_free, free);
                memcg = parent_mem_cgroup(memcg);
        }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 08/13] memcg: dirty page accounting support routines
@ 2011-08-18  7:04       ` Greg Thelen
  0 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-18  7:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Wed, 17 Aug 2011 09:15:00 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> Added memcg dirty page accounting support routines.  These routines are
>> used by later changes to provide memcg aware writeback and dirty page
>> limiting.  A mem_cgroup_dirty_info() tracepoint is is also included to
>> allow for easier understanding of memcg writeback operation.
>> 
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>
> I have small comments.
>
>> ---
>> Changelog since v8:
>> - Use 'memcg' rather than 'mem' for local variables and parameters.
>>   This is consistent with other memory controller code.
>> 
>>  include/linux/memcontrol.h        |    9 ++
>>  include/trace/events/memcontrol.h |   34 +++++++++
>>  mm/memcontrol.c                   |  147 +++++++++++++++++++++++++++++++++++++
>>  3 files changed, 190 insertions(+), 0 deletions(-)
>> 
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 630d3fa..9cc8841 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -36,6 +36,15 @@ enum mem_cgroup_page_stat_item {
>>  	MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
>>  	MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
>>  	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>> +	MEMCG_NR_DIRTYABLE_PAGES, /* # of pages that could be dirty */
>> +};
>> +
>> +struct dirty_info {
>> +	unsigned long dirty_thresh;
>> +	unsigned long background_thresh;
>> +	unsigned long nr_file_dirty;
>> +	unsigned long nr_writeback;
>> +	unsigned long nr_unstable_nfs;
>>  };
>>  
>>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>> diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
>> index 781ef9fc..abf1306 100644
>> --- a/include/trace/events/memcontrol.h
>> +++ b/include/trace/events/memcontrol.h
>> @@ -26,6 +26,40 @@ TRACE_EVENT(mem_cgroup_mark_inode_dirty,
>>  	TP_printk("ino=%ld css_id=%d", __entry->ino, __entry->css_id)
>>  )
>>  
>> +TRACE_EVENT(mem_cgroup_dirty_info,
>> +	TP_PROTO(unsigned short css_id,
>> +		 struct dirty_info *dirty_info),
>> +
>> +	TP_ARGS(css_id, dirty_info),
>> +
>> +	TP_STRUCT__entry(
>> +		__field(unsigned short, css_id)
>> +		__field(unsigned long, dirty_thresh)
>> +		__field(unsigned long, background_thresh)
>> +		__field(unsigned long, nr_file_dirty)
>> +		__field(unsigned long, nr_writeback)
>> +		__field(unsigned long, nr_unstable_nfs)
>> +		),
>> +
>> +	TP_fast_assign(
>> +		__entry->css_id = css_id;
>> +		__entry->dirty_thresh = dirty_info->dirty_thresh;
>> +		__entry->background_thresh = dirty_info->background_thresh;
>> +		__entry->nr_file_dirty = dirty_info->nr_file_dirty;
>> +		__entry->nr_writeback = dirty_info->nr_writeback;
>> +		__entry->nr_unstable_nfs = dirty_info->nr_unstable_nfs;
>> +		),
>> +
>> +	TP_printk("css_id=%d thresh=%ld bg_thresh=%ld dirty=%ld wb=%ld "
>> +		  "unstable_nfs=%ld",
>> +		  __entry->css_id,
>> +		  __entry->dirty_thresh,
>> +		  __entry->background_thresh,
>> +		  __entry->nr_file_dirty,
>> +		  __entry->nr_writeback,
>> +		  __entry->nr_unstable_nfs)
>> +)
>> +
>>  #endif /* _TRACE_MEMCONTROL_H */
>>  
>>  /* This part must be outside protection */
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 4e01699..d54adf4 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1366,6 +1366,11 @@ int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>>  	return memcg->swappiness;
>>  }
>>  
>> +static unsigned long dirty_info_reclaimable(struct dirty_info *info)
>> +{
>> +	return info->nr_file_dirty + info->nr_unstable_nfs;
>> +}
>> +
>>  /*
>>   * Return true if the current memory cgroup has local dirty memory settings.
>>   * There is an allowed race between the current task migrating in-to/out-of the
>> @@ -1396,6 +1401,148 @@ static void mem_cgroup_dirty_param(struct vm_dirty_param *param,
>>  	}
>>  }
>>  
>> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
>> +{
>> +	if (!do_swap_account)
>> +		return nr_swap_pages > 0;
>> +	return !memcg->memsw_is_minimum &&
>> +		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
>> +}
>
> I think
>
> 	if (nr_swap_pages == 0)
> 		return false;
> 	if (!do_swap_account)
> 		return true;
> 	if (memcg->memsw_is_mininum)
> 		return false;
>         if (res_counter_margin(&memcg->memsw) == 0)
> 		return false;
>
> is a correct check.

Ok.  I'll update to use your logic.

>> +
>> +static s64 mem_cgroup_local_page_stat(struct mem_cgroup *memcg,
>> +				      enum mem_cgroup_page_stat_item item)
>> +{
>> +	s64 ret;
>> +
>> +	switch (item) {
>> +	case MEMCG_NR_FILE_DIRTY:
>> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY);
>> +		break;
>> +	case MEMCG_NR_FILE_WRITEBACK:
>> +		ret = mem_cgroup_read_stat(memcg,
>> +					   MEM_CGROUP_STAT_FILE_WRITEBACK);
>> +		break;
>> +	case MEMCG_NR_FILE_UNSTABLE_NFS:
>> +		ret = mem_cgroup_read_stat(memcg,
>> +					   MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
>> +		break;
>> +	case MEMCG_NR_DIRTYABLE_PAGES:
>> +		ret = mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
>> +			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
>> +		if (mem_cgroup_can_swap(memcg))
>> +			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
>> +				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
>> +		break;
>> +	default:
>> +		BUG();
>> +		break;
>> +	}
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Return the number of additional pages that the @memcg cgroup could allocate.
>> + * If use_hierarchy is set, then this involves checking parent mem cgroups to
>> + * find the cgroup with the smallest free space.
>> + */
>> +static unsigned long
>> +mem_cgroup_hierarchical_free_pages(struct mem_cgroup *memcg)
>> +{
>> +	u64 free;
>> +	unsigned long min_free;
>> +
>> +	min_free = global_page_state(NR_FREE_PAGES);
>> +
>> +	while (memcg) {
>> +		free = (res_counter_read_u64(&memcg->res, RES_LIMIT) -
>> +			res_counter_read_u64(&memcg->res, RES_USAGE)) >>
>> +			PAGE_SHIFT;
>
> How about
> 		free = mem_cgroup_margin(&mem->res);
> ?
>
> Thanks,
> -Kame

Sounds good.  I'll update to:

        while (memcg) {
                free = res_counter_margin(&memcg->res) >> PAGE_SHIFT;
                min_free = min_t(u64, min_free, free);
                memcg = parent_mem_cgroup(memcg);
        }

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 11/13] writeback: make background writeback cgroup aware
  2011-08-18  1:23     ` KAMEZAWA Hiroyuki
@ 2011-08-18  7:10       ` Greg Thelen
  -1 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-18  7:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Wed, 17 Aug 2011 09:15:03 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> When the system is under background dirty memory threshold but some
>> cgroups are over their background dirty memory thresholds, then only
>> writeback inodes associated with the over-limit cgroups.
>> 
>> In addition to checking if the system dirty memory usage is over the
>> system background threshold, over_bground_thresh() now checks if any
>> cgroups are over their respective background dirty memory thresholds.
>> 
>> If over-limit cgroups are found, then the new
>> wb_writeback_work.for_cgroup field is set to distinguish between system
>> and memcg overages.  The new wb_writeback_work.shared_inodes field is
>> also set.  Inodes written by multiple cgroup are marked owned by
>> I_MEMCG_SHARED rather than a particular cgroup.  Such shared inodes
>> cannot easily be attributed to a cgroup, so per-cgroup writeback
>> (futures version of wakeup_flusher_threads and balance_dirty_pages)
>> performs suboptimally in the presence of shared inodes.  Therefore,
>> write shared inodes when performing cgroup background writeback.
>> 
>> If performing cgroup writeback, move_expired_inodes() skips inodes that
>> do not contribute dirty pages to the cgroup being written back.
>> 
>> After writing some pages, wb_writeback() will call
>> mem_cgroup_writeback_done() to update the set of over-bg-limits memcg.
>> 
>> This change also makes wakeup_flusher_threads() memcg aware so that
>> per-cgroup try_to_free_pages() is able to operate more efficiently
>> without having to write pages of foreign containers.  This change adds a
>> mem_cgroup parameter to wakeup_flusher_threads() to allow callers,
>> especially try_to_free_pages() and foreground writeback from
>> balance_dirty_pages(), to specify a particular cgroup to write inodes
>> from.
>> 
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>> ---
>> Changelog since v8:
>> 
>> - Added optional memcg parameter to __bdi_start_writeback(),
>>   bdi_start_writeback(), wakeup_flusher_threads(), writeback_inodes_wb().
>> 
>> - move_expired_inodes() now uses pass in struct wb_writeback_work instead of
>>   struct writeback_control.
>> 
>> - Added comments to over_bground_thresh().
>> 
>>  fs/buffer.c               |    2 +-
>>  fs/fs-writeback.c         |   96 +++++++++++++++++++++++++++++++++-----------
>>  fs/sync.c                 |    2 +-
>>  include/linux/writeback.h |    6 ++-
>>  mm/backing-dev.c          |    3 +-
>>  mm/page-writeback.c       |    3 +-
>>  mm/vmscan.c               |    3 +-
>>  7 files changed, 84 insertions(+), 31 deletions(-)
>> 
>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index dd0220b..da1fb23 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -293,7 +293,7 @@ static void free_more_memory(void)
>>  	struct zone *zone;
>>  	int nid;
>>  
>> -	wakeup_flusher_threads(1024);
>> +	wakeup_flusher_threads(1024, NULL);
>>  	yield();
>>  
>>  	for_each_online_node(nid) {
>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>> index e91fb82..ba55336 100644
>> --- a/fs/fs-writeback.c
>> +++ b/fs/fs-writeback.c
>> @@ -38,10 +38,14 @@ struct wb_writeback_work {
>>  	struct super_block *sb;
>>  	unsigned long *older_than_this;
>>  	enum writeback_sync_modes sync_mode;
>> +	unsigned short memcg_id;	/* If non-zero, then writeback specified
>> +					 * cgroup. */
>>  	unsigned int tagged_writepages:1;
>>  	unsigned int for_kupdate:1;
>>  	unsigned int range_cyclic:1;
>>  	unsigned int for_background:1;
>> +	unsigned int for_cgroup:1;	/* cgroup writeback */
>> +	unsigned int shared_inodes:1;	/* write inodes spanning cgroups */
>>  
>>  	struct list_head list;		/* pending work list */
>>  	struct completion *done;	/* set if the caller waits */
>> @@ -114,9 +118,12 @@ static void bdi_queue_work(struct backing_dev_info *bdi,
>>  	spin_unlock_bh(&bdi->wb_lock);
>>  }
>>  
>> +/*
>> + * @memcg is optional.  If set, then limit writeback to the specified cgroup.
>> + */
>>  static void
>>  __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
>> -		      bool range_cyclic)
>> +		      bool range_cyclic, struct mem_cgroup *memcg)
>>  {
>>  	struct wb_writeback_work *work;
>>  
>> @@ -136,6 +143,8 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
>>  	work->sync_mode	= WB_SYNC_NONE;
>>  	work->nr_pages	= nr_pages;
>>  	work->range_cyclic = range_cyclic;
>> +	work->memcg_id = memcg ? css_id(mem_cgroup_css(memcg)) : 0;
>> +	work->for_cgroup = memcg != NULL;
>>  
>
>
> I couldn't find a patch for mem_cgroup_css(NULL). Is it in patch 1-10 ?
> Other parts seems ok to me.
>
>
> Thanks,
> -Kame

Mainline commit d324236b3333e87c8825b35f2104184734020d35 adds
mem_cgroup_css() to memcontrol.c.  The above code does not call
mem_cgroup_css() with a NULL parameter due to the 'memcg ? ...' check.
So I do not think any additional changes to mem_cgroup_css() are needed.
Am I missing your point?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 11/13] writeback: make background writeback cgroup aware
@ 2011-08-18  7:10       ` Greg Thelen
  0 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-18  7:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Wed, 17 Aug 2011 09:15:03 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> When the system is under background dirty memory threshold but some
>> cgroups are over their background dirty memory thresholds, then only
>> writeback inodes associated with the over-limit cgroups.
>> 
>> In addition to checking if the system dirty memory usage is over the
>> system background threshold, over_bground_thresh() now checks if any
>> cgroups are over their respective background dirty memory thresholds.
>> 
>> If over-limit cgroups are found, then the new
>> wb_writeback_work.for_cgroup field is set to distinguish between system
>> and memcg overages.  The new wb_writeback_work.shared_inodes field is
>> also set.  Inodes written by multiple cgroup are marked owned by
>> I_MEMCG_SHARED rather than a particular cgroup.  Such shared inodes
>> cannot easily be attributed to a cgroup, so per-cgroup writeback
>> (futures version of wakeup_flusher_threads and balance_dirty_pages)
>> performs suboptimally in the presence of shared inodes.  Therefore,
>> write shared inodes when performing cgroup background writeback.
>> 
>> If performing cgroup writeback, move_expired_inodes() skips inodes that
>> do not contribute dirty pages to the cgroup being written back.
>> 
>> After writing some pages, wb_writeback() will call
>> mem_cgroup_writeback_done() to update the set of over-bg-limits memcg.
>> 
>> This change also makes wakeup_flusher_threads() memcg aware so that
>> per-cgroup try_to_free_pages() is able to operate more efficiently
>> without having to write pages of foreign containers.  This change adds a
>> mem_cgroup parameter to wakeup_flusher_threads() to allow callers,
>> especially try_to_free_pages() and foreground writeback from
>> balance_dirty_pages(), to specify a particular cgroup to write inodes
>> from.
>> 
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>> ---
>> Changelog since v8:
>> 
>> - Added optional memcg parameter to __bdi_start_writeback(),
>>   bdi_start_writeback(), wakeup_flusher_threads(), writeback_inodes_wb().
>> 
>> - move_expired_inodes() now uses pass in struct wb_writeback_work instead of
>>   struct writeback_control.
>> 
>> - Added comments to over_bground_thresh().
>> 
>>  fs/buffer.c               |    2 +-
>>  fs/fs-writeback.c         |   96 +++++++++++++++++++++++++++++++++-----------
>>  fs/sync.c                 |    2 +-
>>  include/linux/writeback.h |    6 ++-
>>  mm/backing-dev.c          |    3 +-
>>  mm/page-writeback.c       |    3 +-
>>  mm/vmscan.c               |    3 +-
>>  7 files changed, 84 insertions(+), 31 deletions(-)
>> 
>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index dd0220b..da1fb23 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -293,7 +293,7 @@ static void free_more_memory(void)
>>  	struct zone *zone;
>>  	int nid;
>>  
>> -	wakeup_flusher_threads(1024);
>> +	wakeup_flusher_threads(1024, NULL);
>>  	yield();
>>  
>>  	for_each_online_node(nid) {
>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>> index e91fb82..ba55336 100644
>> --- a/fs/fs-writeback.c
>> +++ b/fs/fs-writeback.c
>> @@ -38,10 +38,14 @@ struct wb_writeback_work {
>>  	struct super_block *sb;
>>  	unsigned long *older_than_this;
>>  	enum writeback_sync_modes sync_mode;
>> +	unsigned short memcg_id;	/* If non-zero, then writeback specified
>> +					 * cgroup. */
>>  	unsigned int tagged_writepages:1;
>>  	unsigned int for_kupdate:1;
>>  	unsigned int range_cyclic:1;
>>  	unsigned int for_background:1;
>> +	unsigned int for_cgroup:1;	/* cgroup writeback */
>> +	unsigned int shared_inodes:1;	/* write inodes spanning cgroups */
>>  
>>  	struct list_head list;		/* pending work list */
>>  	struct completion *done;	/* set if the caller waits */
>> @@ -114,9 +118,12 @@ static void bdi_queue_work(struct backing_dev_info *bdi,
>>  	spin_unlock_bh(&bdi->wb_lock);
>>  }
>>  
>> +/*
>> + * @memcg is optional.  If set, then limit writeback to the specified cgroup.
>> + */
>>  static void
>>  __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
>> -		      bool range_cyclic)
>> +		      bool range_cyclic, struct mem_cgroup *memcg)
>>  {
>>  	struct wb_writeback_work *work;
>>  
>> @@ -136,6 +143,8 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
>>  	work->sync_mode	= WB_SYNC_NONE;
>>  	work->nr_pages	= nr_pages;
>>  	work->range_cyclic = range_cyclic;
>> +	work->memcg_id = memcg ? css_id(mem_cgroup_css(memcg)) : 0;
>> +	work->for_cgroup = memcg != NULL;
>>  
>
>
> I couldn't find a patch for mem_cgroup_css(NULL). Is it in patch 1-10 ?
> Other parts seems ok to me.
>
>
> Thanks,
> -Kame

Mainline commit d324236b3333e87c8825b35f2104184734020d35 adds
mem_cgroup_css() to memcontrol.c.  The above code does not call
mem_cgroup_css() with a NULL parameter due to the 'memcg ? ...' check.
So I do not think any additional changes to mem_cgroup_css() are needed.
Am I missing your point?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 11/13] writeback: make background writeback cgroup aware
  2011-08-18  7:10       ` Greg Thelen
@ 2011-08-18  7:17         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  7:17 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Thu, 18 Aug 2011 00:10:56 -0700
Greg Thelen <gthelen@google.com> wrote:

> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> 
> > On Wed, 17 Aug 2011 09:15:03 -0700
> > Greg Thelen <gthelen@google.com> wrote:
> >
> >> When the system is under background dirty memory threshold but some
> >> cgroups are over their background dirty memory thresholds, then only
> >> writeback inodes associated with the over-limit cgroups.
> >> 
> >> In addition to checking if the system dirty memory usage is over the
> >> system background threshold, over_bground_thresh() now checks if any
> >> cgroups are over their respective background dirty memory thresholds.
> >> 
> >> If over-limit cgroups are found, then the new
> >> wb_writeback_work.for_cgroup field is set to distinguish between system
> >> and memcg overages.  The new wb_writeback_work.shared_inodes field is
> >> also set.  Inodes written by multiple cgroup are marked owned by
> >> I_MEMCG_SHARED rather than a particular cgroup.  Such shared inodes
> >> cannot easily be attributed to a cgroup, so per-cgroup writeback
> >> (futures version of wakeup_flusher_threads and balance_dirty_pages)
> >> performs suboptimally in the presence of shared inodes.  Therefore,
> >> write shared inodes when performing cgroup background writeback.
> >> 
> >> If performing cgroup writeback, move_expired_inodes() skips inodes that
> >> do not contribute dirty pages to the cgroup being written back.
> >> 
> >> After writing some pages, wb_writeback() will call
> >> mem_cgroup_writeback_done() to update the set of over-bg-limits memcg.
> >> 
> >> This change also makes wakeup_flusher_threads() memcg aware so that
> >> per-cgroup try_to_free_pages() is able to operate more efficiently
> >> without having to write pages of foreign containers.  This change adds a
> >> mem_cgroup parameter to wakeup_flusher_threads() to allow callers,
> >> especially try_to_free_pages() and foreground writeback from
> >> balance_dirty_pages(), to specify a particular cgroup to write inodes
> >> from.
> >> 
> >> Signed-off-by: Greg Thelen <gthelen@google.com>
> >> ---
> >> Changelog since v8:
> >> 
> >> - Added optional memcg parameter to __bdi_start_writeback(),
> >>   bdi_start_writeback(), wakeup_flusher_threads(), writeback_inodes_wb().
> >> 
> >> - move_expired_inodes() now uses pass in struct wb_writeback_work instead of
> >>   struct writeback_control.
> >> 
> >> - Added comments to over_bground_thresh().
> >> 
> >>  fs/buffer.c               |    2 +-
> >>  fs/fs-writeback.c         |   96 +++++++++++++++++++++++++++++++++-----------
> >>  fs/sync.c                 |    2 +-
> >>  include/linux/writeback.h |    6 ++-
> >>  mm/backing-dev.c          |    3 +-
> >>  mm/page-writeback.c       |    3 +-
> >>  mm/vmscan.c               |    3 +-
> >>  7 files changed, 84 insertions(+), 31 deletions(-)
> >> 
> >> diff --git a/fs/buffer.c b/fs/buffer.c
> >> index dd0220b..da1fb23 100644
> >> --- a/fs/buffer.c
> >> +++ b/fs/buffer.c
> >> @@ -293,7 +293,7 @@ static void free_more_memory(void)
> >>  	struct zone *zone;
> >>  	int nid;
> >>  
> >> -	wakeup_flusher_threads(1024);
> >> +	wakeup_flusher_threads(1024, NULL);
> >>  	yield();
> >>  
> >>  	for_each_online_node(nid) {
> >> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> >> index e91fb82..ba55336 100644
> >> --- a/fs/fs-writeback.c
> >> +++ b/fs/fs-writeback.c
> >> @@ -38,10 +38,14 @@ struct wb_writeback_work {
> >>  	struct super_block *sb;
> >>  	unsigned long *older_than_this;
> >>  	enum writeback_sync_modes sync_mode;
> >> +	unsigned short memcg_id;	/* If non-zero, then writeback specified
> >> +					 * cgroup. */
> >>  	unsigned int tagged_writepages:1;
> >>  	unsigned int for_kupdate:1;
> >>  	unsigned int range_cyclic:1;
> >>  	unsigned int for_background:1;
> >> +	unsigned int for_cgroup:1;	/* cgroup writeback */
> >> +	unsigned int shared_inodes:1;	/* write inodes spanning cgroups */
> >>  
> >>  	struct list_head list;		/* pending work list */
> >>  	struct completion *done;	/* set if the caller waits */
> >> @@ -114,9 +118,12 @@ static void bdi_queue_work(struct backing_dev_info *bdi,
> >>  	spin_unlock_bh(&bdi->wb_lock);
> >>  }
> >>  
> >> +/*
> >> + * @memcg is optional.  If set, then limit writeback to the specified cgroup.
> >> + */
> >>  static void
> >>  __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
> >> -		      bool range_cyclic)
> >> +		      bool range_cyclic, struct mem_cgroup *memcg)
> >>  {
> >>  	struct wb_writeback_work *work;
> >>  
> >> @@ -136,6 +143,8 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
> >>  	work->sync_mode	= WB_SYNC_NONE;
> >>  	work->nr_pages	= nr_pages;
> >>  	work->range_cyclic = range_cyclic;
> >> +	work->memcg_id = memcg ? css_id(mem_cgroup_css(memcg)) : 0;
> >> +	work->for_cgroup = memcg != NULL;
> >>  
> >
> >
> > I couldn't find a patch for mem_cgroup_css(NULL). Is it in patch 1-10 ?
> > Other parts seems ok to me.
> >
> >
> > Thanks,
> > -Kame
> 
> Mainline commit d324236b3333e87c8825b35f2104184734020d35 adds
> mem_cgroup_css() to memcontrol.c.  The above code does not call
> mem_cgroup_css() with a NULL parameter due to the 'memcg ? ...' check.
> So I do not think any additional changes to mem_cgroup_css() are needed.
> Am I missing your point?
> 

I thought you need
==
struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *mem)
{
+	if (!mem)
+		return NULL;
       return &mem->css;
}
==
And
==
unsigned short css_id(struct cgroup_subsys_state *css)
{
        struct css_id *cssid;

+	if (!css)
		return 0;
}
==



Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 11/13] writeback: make background writeback cgroup aware
@ 2011-08-18  7:17         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  7:17 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Thu, 18 Aug 2011 00:10:56 -0700
Greg Thelen <gthelen@google.com> wrote:

> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> 
> > On Wed, 17 Aug 2011 09:15:03 -0700
> > Greg Thelen <gthelen@google.com> wrote:
> >
> >> When the system is under background dirty memory threshold but some
> >> cgroups are over their background dirty memory thresholds, then only
> >> writeback inodes associated with the over-limit cgroups.
> >> 
> >> In addition to checking if the system dirty memory usage is over the
> >> system background threshold, over_bground_thresh() now checks if any
> >> cgroups are over their respective background dirty memory thresholds.
> >> 
> >> If over-limit cgroups are found, then the new
> >> wb_writeback_work.for_cgroup field is set to distinguish between system
> >> and memcg overages.  The new wb_writeback_work.shared_inodes field is
> >> also set.  Inodes written by multiple cgroup are marked owned by
> >> I_MEMCG_SHARED rather than a particular cgroup.  Such shared inodes
> >> cannot easily be attributed to a cgroup, so per-cgroup writeback
> >> (futures version of wakeup_flusher_threads and balance_dirty_pages)
> >> performs suboptimally in the presence of shared inodes.  Therefore,
> >> write shared inodes when performing cgroup background writeback.
> >> 
> >> If performing cgroup writeback, move_expired_inodes() skips inodes that
> >> do not contribute dirty pages to the cgroup being written back.
> >> 
> >> After writing some pages, wb_writeback() will call
> >> mem_cgroup_writeback_done() to update the set of over-bg-limits memcg.
> >> 
> >> This change also makes wakeup_flusher_threads() memcg aware so that
> >> per-cgroup try_to_free_pages() is able to operate more efficiently
> >> without having to write pages of foreign containers.  This change adds a
> >> mem_cgroup parameter to wakeup_flusher_threads() to allow callers,
> >> especially try_to_free_pages() and foreground writeback from
> >> balance_dirty_pages(), to specify a particular cgroup to write inodes
> >> from.
> >> 
> >> Signed-off-by: Greg Thelen <gthelen@google.com>
> >> ---
> >> Changelog since v8:
> >> 
> >> - Added optional memcg parameter to __bdi_start_writeback(),
> >>   bdi_start_writeback(), wakeup_flusher_threads(), writeback_inodes_wb().
> >> 
> >> - move_expired_inodes() now uses pass in struct wb_writeback_work instead of
> >>   struct writeback_control.
> >> 
> >> - Added comments to over_bground_thresh().
> >> 
> >>  fs/buffer.c               |    2 +-
> >>  fs/fs-writeback.c         |   96 +++++++++++++++++++++++++++++++++-----------
> >>  fs/sync.c                 |    2 +-
> >>  include/linux/writeback.h |    6 ++-
> >>  mm/backing-dev.c          |    3 +-
> >>  mm/page-writeback.c       |    3 +-
> >>  mm/vmscan.c               |    3 +-
> >>  7 files changed, 84 insertions(+), 31 deletions(-)
> >> 
> >> diff --git a/fs/buffer.c b/fs/buffer.c
> >> index dd0220b..da1fb23 100644
> >> --- a/fs/buffer.c
> >> +++ b/fs/buffer.c
> >> @@ -293,7 +293,7 @@ static void free_more_memory(void)
> >>  	struct zone *zone;
> >>  	int nid;
> >>  
> >> -	wakeup_flusher_threads(1024);
> >> +	wakeup_flusher_threads(1024, NULL);
> >>  	yield();
> >>  
> >>  	for_each_online_node(nid) {
> >> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> >> index e91fb82..ba55336 100644
> >> --- a/fs/fs-writeback.c
> >> +++ b/fs/fs-writeback.c
> >> @@ -38,10 +38,14 @@ struct wb_writeback_work {
> >>  	struct super_block *sb;
> >>  	unsigned long *older_than_this;
> >>  	enum writeback_sync_modes sync_mode;
> >> +	unsigned short memcg_id;	/* If non-zero, then writeback specified
> >> +					 * cgroup. */
> >>  	unsigned int tagged_writepages:1;
> >>  	unsigned int for_kupdate:1;
> >>  	unsigned int range_cyclic:1;
> >>  	unsigned int for_background:1;
> >> +	unsigned int for_cgroup:1;	/* cgroup writeback */
> >> +	unsigned int shared_inodes:1;	/* write inodes spanning cgroups */
> >>  
> >>  	struct list_head list;		/* pending work list */
> >>  	struct completion *done;	/* set if the caller waits */
> >> @@ -114,9 +118,12 @@ static void bdi_queue_work(struct backing_dev_info *bdi,
> >>  	spin_unlock_bh(&bdi->wb_lock);
> >>  }
> >>  
> >> +/*
> >> + * @memcg is optional.  If set, then limit writeback to the specified cgroup.
> >> + */
> >>  static void
> >>  __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
> >> -		      bool range_cyclic)
> >> +		      bool range_cyclic, struct mem_cgroup *memcg)
> >>  {
> >>  	struct wb_writeback_work *work;
> >>  
> >> @@ -136,6 +143,8 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
> >>  	work->sync_mode	= WB_SYNC_NONE;
> >>  	work->nr_pages	= nr_pages;
> >>  	work->range_cyclic = range_cyclic;
> >> +	work->memcg_id = memcg ? css_id(mem_cgroup_css(memcg)) : 0;
> >> +	work->for_cgroup = memcg != NULL;
> >>  
> >
> >
> > I couldn't find a patch for mem_cgroup_css(NULL). Is it in patch 1-10 ?
> > Other parts seems ok to me.
> >
> >
> > Thanks,
> > -Kame
> 
> Mainline commit d324236b3333e87c8825b35f2104184734020d35 adds
> mem_cgroup_css() to memcontrol.c.  The above code does not call
> mem_cgroup_css() with a NULL parameter due to the 'memcg ? ...' check.
> So I do not think any additional changes to mem_cgroup_css() are needed.
> Am I missing your point?
> 

I thought you need
==
struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *mem)
{
+	if (!mem)
+		return NULL;
       return &mem->css;
}
==
And
==
unsigned short css_id(struct cgroup_subsys_state *css)
{
        struct css_id *cssid;

+	if (!css)
		return 0;
}
==



Thanks,
-Kame





^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 11/13] writeback: make background writeback cgroup aware
  2011-08-18  7:38           ` Greg Thelen
@ 2011-08-18  7:35             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  7:35 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Thu, 18 Aug 2011 00:38:49 -0700
Greg Thelen <gthelen@google.com> wrote:

> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> 
> > On Thu, 18 Aug 2011 00:10:56 -0700
> > Greg Thelen <gthelen@google.com> wrote:
> >
> >> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> >> 
> >> > On Wed, 17 Aug 2011 09:15:03 -0700
> >> > Greg Thelen <gthelen@google.com> wrote:
> >> >
> >> >> When the system is under background dirty memory threshold but some
> >> >> cgroups are over their background dirty memory thresholds, then only
> >> >> writeback inodes associated with the over-limit cgroups.
> >> >> 
> >> >> In addition to checking if the system dirty memory usage is over the
> >> >> system background threshold, over_bground_thresh() now checks if any
> >> >> cgroups are over their respective background dirty memory thresholds.
> >> >> 
> >> >> If over-limit cgroups are found, then the new
> >> >> wb_writeback_work.for_cgroup field is set to distinguish between system
> >> >> and memcg overages.  The new wb_writeback_work.shared_inodes field is
> >> >> also set.  Inodes written by multiple cgroup are marked owned by
> >> >> I_MEMCG_SHARED rather than a particular cgroup.  Such shared inodes
> >> >> cannot easily be attributed to a cgroup, so per-cgroup writeback
> >> >> (futures version of wakeup_flusher_threads and balance_dirty_pages)
> >> >> performs suboptimally in the presence of shared inodes.  Therefore,
> >> >> write shared inodes when performing cgroup background writeback.
> >> >> 
> >> >> If performing cgroup writeback, move_expired_inodes() skips inodes that
> >> >> do not contribute dirty pages to the cgroup being written back.
> >> >> 
> >> >> After writing some pages, wb_writeback() will call
> >> >> mem_cgroup_writeback_done() to update the set of over-bg-limits memcg.
> >> >> 
> >> >> This change also makes wakeup_flusher_threads() memcg aware so that
> >> >> per-cgroup try_to_free_pages() is able to operate more efficiently
> >> >> without having to write pages of foreign containers.  This change adds a
> >> >> mem_cgroup parameter to wakeup_flusher_threads() to allow callers,
> >> >> especially try_to_free_pages() and foreground writeback from
> >> >> balance_dirty_pages(), to specify a particular cgroup to write inodes
> >> >> from.
> >> >> 
> >> >> Signed-off-by: Greg Thelen <gthelen@google.com>
> >> >> ---
> >> >> Changelog since v8:
> >> >> 
> >> >> - Added optional memcg parameter to __bdi_start_writeback(),
> >> >>   bdi_start_writeback(), wakeup_flusher_threads(), writeback_inodes_wb().
> >> >> 
> >> >> - move_expired_inodes() now uses pass in struct wb_writeback_work instead of
> >> >>   struct writeback_control.
> >> >> 
> >> >> - Added comments to over_bground_thresh().
> >> >> 
> >> >>  fs/buffer.c               |    2 +-
> >> >>  fs/fs-writeback.c         |   96 +++++++++++++++++++++++++++++++++-----------
> >> >>  fs/sync.c                 |    2 +-
> >> >>  include/linux/writeback.h |    6 ++-
> >> >>  mm/backing-dev.c          |    3 +-
> >> >>  mm/page-writeback.c       |    3 +-
> >> >>  mm/vmscan.c               |    3 +-
> >> >>  7 files changed, 84 insertions(+), 31 deletions(-)
> >> >> 
> >> >> diff --git a/fs/buffer.c b/fs/buffer.c
> >> >> index dd0220b..da1fb23 100644
> >> >> --- a/fs/buffer.c
> >> >> +++ b/fs/buffer.c
> >> >> @@ -293,7 +293,7 @@ static void free_more_memory(void)
> >> >>  	struct zone *zone;
> >> >>  	int nid;
> >> >>  
> >> >> -	wakeup_flusher_threads(1024);
> >> >> +	wakeup_flusher_threads(1024, NULL);
> >> >>  	yield();
> >> >>  
> >> >>  	for_each_online_node(nid) {
> >> >> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> >> >> index e91fb82..ba55336 100644
> >> >> --- a/fs/fs-writeback.c
> >> >> +++ b/fs/fs-writeback.c
> >> >> @@ -38,10 +38,14 @@ struct wb_writeback_work {
> >> >>  	struct super_block *sb;
> >> >>  	unsigned long *older_than_this;
> >> >>  	enum writeback_sync_modes sync_mode;
> >> >> +	unsigned short memcg_id;	/* If non-zero, then writeback specified
> >> >> +					 * cgroup. */
> >> >>  	unsigned int tagged_writepages:1;
> >> >>  	unsigned int for_kupdate:1;
> >> >>  	unsigned int range_cyclic:1;
> >> >>  	unsigned int for_background:1;
> >> >> +	unsigned int for_cgroup:1;	/* cgroup writeback */
> >> >> +	unsigned int shared_inodes:1;	/* write inodes spanning cgroups */
> >> >>  
> >> >>  	struct list_head list;		/* pending work list */
> >> >>  	struct completion *done;	/* set if the caller waits */
> >> >> @@ -114,9 +118,12 @@ static void bdi_queue_work(struct backing_dev_info *bdi,
> >> >>  	spin_unlock_bh(&bdi->wb_lock);
> >> >>  }
> >> >>  
> >> >> +/*
> >> >> + * @memcg is optional.  If set, then limit writeback to the specified cgroup.
> >> >> + */
> >> >>  static void
> >> >>  __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
> >> >> -		      bool range_cyclic)
> >> >> +		      bool range_cyclic, struct mem_cgroup *memcg)
> >> >>  {
> >> >>  	struct wb_writeback_work *work;
> >> >>  
> >> >> @@ -136,6 +143,8 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
> >> >>  	work->sync_mode	= WB_SYNC_NONE;
> >> >>  	work->nr_pages	= nr_pages;
> >> >>  	work->range_cyclic = range_cyclic;
> >> >> +	work->memcg_id = memcg ? css_id(mem_cgroup_css(memcg)) : 0;
> >> >> +	work->for_cgroup = memcg != NULL;
> >> >>  
> >> >
> >> >
> >> > I couldn't find a patch for mem_cgroup_css(NULL). Is it in patch 1-10 ?
> >> > Other parts seems ok to me.
> >> >
> >> >
> >> > Thanks,
> >> > -Kame
> >> 
> >> Mainline commit d324236b3333e87c8825b35f2104184734020d35 adds
> >> mem_cgroup_css() to memcontrol.c.  The above code does not call
> >> mem_cgroup_css() with a NULL parameter due to the 'memcg ? ...' check.
> >> So I do not think any additional changes to mem_cgroup_css() are needed.
> >> Am I missing your point?
> >> 
> >
> > I thought you need
> > ==
> > struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *mem)
> > {
> > +	if (!mem)
> > +		return NULL;
> >        return &mem->css;
> > }
> > ==
> > And
> > ==
> > unsigned short css_id(struct cgroup_subsys_state *css)
> > {
> >         struct css_id *cssid;
> >
> > +	if (!css)
> > 		return 0;
> > }
> > ==
> >
> > Thanks,
> > -Kame
> 
> I think that your changes to mem_cgroup_css() and css_id() are
> unnecessary for my patches because my patches do not call
> mem_cgroup_css(NULL).  The "?" check below prevents NULL from being
> passed into mem_cgroup_css():
> 
> +	work->memcg_id = memcg ? css_id(mem_cgroup_css(memcg)) : 0;
> 

Ah, I see. Thank you for clarification.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 11/13] writeback: make background writeback cgroup aware
@ 2011-08-18  7:35             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  7:35 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

On Thu, 18 Aug 2011 00:38:49 -0700
Greg Thelen <gthelen@google.com> wrote:

> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> 
> > On Thu, 18 Aug 2011 00:10:56 -0700
> > Greg Thelen <gthelen@google.com> wrote:
> >
> >> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> >> 
> >> > On Wed, 17 Aug 2011 09:15:03 -0700
> >> > Greg Thelen <gthelen@google.com> wrote:
> >> >
> >> >> When the system is under background dirty memory threshold but some
> >> >> cgroups are over their background dirty memory thresholds, then only
> >> >> writeback inodes associated with the over-limit cgroups.
> >> >> 
> >> >> In addition to checking if the system dirty memory usage is over the
> >> >> system background threshold, over_bground_thresh() now checks if any
> >> >> cgroups are over their respective background dirty memory thresholds.
> >> >> 
> >> >> If over-limit cgroups are found, then the new
> >> >> wb_writeback_work.for_cgroup field is set to distinguish between system
> >> >> and memcg overages.  The new wb_writeback_work.shared_inodes field is
> >> >> also set.  Inodes written by multiple cgroup are marked owned by
> >> >> I_MEMCG_SHARED rather than a particular cgroup.  Such shared inodes
> >> >> cannot easily be attributed to a cgroup, so per-cgroup writeback
> >> >> (futures version of wakeup_flusher_threads and balance_dirty_pages)
> >> >> performs suboptimally in the presence of shared inodes.  Therefore,
> >> >> write shared inodes when performing cgroup background writeback.
> >> >> 
> >> >> If performing cgroup writeback, move_expired_inodes() skips inodes that
> >> >> do not contribute dirty pages to the cgroup being written back.
> >> >> 
> >> >> After writing some pages, wb_writeback() will call
> >> >> mem_cgroup_writeback_done() to update the set of over-bg-limits memcg.
> >> >> 
> >> >> This change also makes wakeup_flusher_threads() memcg aware so that
> >> >> per-cgroup try_to_free_pages() is able to operate more efficiently
> >> >> without having to write pages of foreign containers.  This change adds a
> >> >> mem_cgroup parameter to wakeup_flusher_threads() to allow callers,
> >> >> especially try_to_free_pages() and foreground writeback from
> >> >> balance_dirty_pages(), to specify a particular cgroup to write inodes
> >> >> from.
> >> >> 
> >> >> Signed-off-by: Greg Thelen <gthelen@google.com>
> >> >> ---
> >> >> Changelog since v8:
> >> >> 
> >> >> - Added optional memcg parameter to __bdi_start_writeback(),
> >> >>   bdi_start_writeback(), wakeup_flusher_threads(), writeback_inodes_wb().
> >> >> 
> >> >> - move_expired_inodes() now uses pass in struct wb_writeback_work instead of
> >> >>   struct writeback_control.
> >> >> 
> >> >> - Added comments to over_bground_thresh().
> >> >> 
> >> >>  fs/buffer.c               |    2 +-
> >> >>  fs/fs-writeback.c         |   96 +++++++++++++++++++++++++++++++++-----------
> >> >>  fs/sync.c                 |    2 +-
> >> >>  include/linux/writeback.h |    6 ++-
> >> >>  mm/backing-dev.c          |    3 +-
> >> >>  mm/page-writeback.c       |    3 +-
> >> >>  mm/vmscan.c               |    3 +-
> >> >>  7 files changed, 84 insertions(+), 31 deletions(-)
> >> >> 
> >> >> diff --git a/fs/buffer.c b/fs/buffer.c
> >> >> index dd0220b..da1fb23 100644
> >> >> --- a/fs/buffer.c
> >> >> +++ b/fs/buffer.c
> >> >> @@ -293,7 +293,7 @@ static void free_more_memory(void)
> >> >>  	struct zone *zone;
> >> >>  	int nid;
> >> >>  
> >> >> -	wakeup_flusher_threads(1024);
> >> >> +	wakeup_flusher_threads(1024, NULL);
> >> >>  	yield();
> >> >>  
> >> >>  	for_each_online_node(nid) {
> >> >> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> >> >> index e91fb82..ba55336 100644
> >> >> --- a/fs/fs-writeback.c
> >> >> +++ b/fs/fs-writeback.c
> >> >> @@ -38,10 +38,14 @@ struct wb_writeback_work {
> >> >>  	struct super_block *sb;
> >> >>  	unsigned long *older_than_this;
> >> >>  	enum writeback_sync_modes sync_mode;
> >> >> +	unsigned short memcg_id;	/* If non-zero, then writeback specified
> >> >> +					 * cgroup. */
> >> >>  	unsigned int tagged_writepages:1;
> >> >>  	unsigned int for_kupdate:1;
> >> >>  	unsigned int range_cyclic:1;
> >> >>  	unsigned int for_background:1;
> >> >> +	unsigned int for_cgroup:1;	/* cgroup writeback */
> >> >> +	unsigned int shared_inodes:1;	/* write inodes spanning cgroups */
> >> >>  
> >> >>  	struct list_head list;		/* pending work list */
> >> >>  	struct completion *done;	/* set if the caller waits */
> >> >> @@ -114,9 +118,12 @@ static void bdi_queue_work(struct backing_dev_info *bdi,
> >> >>  	spin_unlock_bh(&bdi->wb_lock);
> >> >>  }
> >> >>  
> >> >> +/*
> >> >> + * @memcg is optional.  If set, then limit writeback to the specified cgroup.
> >> >> + */
> >> >>  static void
> >> >>  __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
> >> >> -		      bool range_cyclic)
> >> >> +		      bool range_cyclic, struct mem_cgroup *memcg)
> >> >>  {
> >> >>  	struct wb_writeback_work *work;
> >> >>  
> >> >> @@ -136,6 +143,8 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
> >> >>  	work->sync_mode	= WB_SYNC_NONE;
> >> >>  	work->nr_pages	= nr_pages;
> >> >>  	work->range_cyclic = range_cyclic;
> >> >> +	work->memcg_id = memcg ? css_id(mem_cgroup_css(memcg)) : 0;
> >> >> +	work->for_cgroup = memcg != NULL;
> >> >>  
> >> >
> >> >
> >> > I couldn't find a patch for mem_cgroup_css(NULL). Is it in patch 1-10 ?
> >> > Other parts seems ok to me.
> >> >
> >> >
> >> > Thanks,
> >> > -Kame
> >> 
> >> Mainline commit d324236b3333e87c8825b35f2104184734020d35 adds
> >> mem_cgroup_css() to memcontrol.c.  The above code does not call
> >> mem_cgroup_css() with a NULL parameter due to the 'memcg ? ...' check.
> >> So I do not think any additional changes to mem_cgroup_css() are needed.
> >> Am I missing your point?
> >> 
> >
> > I thought you need
> > ==
> > struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *mem)
> > {
> > +	if (!mem)
> > +		return NULL;
> >        return &mem->css;
> > }
> > ==
> > And
> > ==
> > unsigned short css_id(struct cgroup_subsys_state *css)
> > {
> >         struct css_id *cssid;
> >
> > +	if (!css)
> > 		return 0;
> > }
> > ==
> >
> > Thanks,
> > -Kame
> 
> I think that your changes to mem_cgroup_css() and css_id() are
> unnecessary for my patches because my patches do not call
> mem_cgroup_css(NULL).  The "?" check below prevents NULL from being
> passed into mem_cgroup_css():
> 
> +	work->memcg_id = memcg ? css_id(mem_cgroup_css(memcg)) : 0;
> 

Ah, I see. Thank you for clarification.


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 11/13] writeback: make background writeback cgroup aware
  2011-08-18  7:17         ` KAMEZAWA Hiroyuki
@ 2011-08-18  7:38           ` Greg Thelen
  -1 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-18  7:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Thu, 18 Aug 2011 00:10:56 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
>> 
>> > On Wed, 17 Aug 2011 09:15:03 -0700
>> > Greg Thelen <gthelen@google.com> wrote:
>> >
>> >> When the system is under background dirty memory threshold but some
>> >> cgroups are over their background dirty memory thresholds, then only
>> >> writeback inodes associated with the over-limit cgroups.
>> >> 
>> >> In addition to checking if the system dirty memory usage is over the
>> >> system background threshold, over_bground_thresh() now checks if any
>> >> cgroups are over their respective background dirty memory thresholds.
>> >> 
>> >> If over-limit cgroups are found, then the new
>> >> wb_writeback_work.for_cgroup field is set to distinguish between system
>> >> and memcg overages.  The new wb_writeback_work.shared_inodes field is
>> >> also set.  Inodes written by multiple cgroup are marked owned by
>> >> I_MEMCG_SHARED rather than a particular cgroup.  Such shared inodes
>> >> cannot easily be attributed to a cgroup, so per-cgroup writeback
>> >> (futures version of wakeup_flusher_threads and balance_dirty_pages)
>> >> performs suboptimally in the presence of shared inodes.  Therefore,
>> >> write shared inodes when performing cgroup background writeback.
>> >> 
>> >> If performing cgroup writeback, move_expired_inodes() skips inodes that
>> >> do not contribute dirty pages to the cgroup being written back.
>> >> 
>> >> After writing some pages, wb_writeback() will call
>> >> mem_cgroup_writeback_done() to update the set of over-bg-limits memcg.
>> >> 
>> >> This change also makes wakeup_flusher_threads() memcg aware so that
>> >> per-cgroup try_to_free_pages() is able to operate more efficiently
>> >> without having to write pages of foreign containers.  This change adds a
>> >> mem_cgroup parameter to wakeup_flusher_threads() to allow callers,
>> >> especially try_to_free_pages() and foreground writeback from
>> >> balance_dirty_pages(), to specify a particular cgroup to write inodes
>> >> from.
>> >> 
>> >> Signed-off-by: Greg Thelen <gthelen@google.com>
>> >> ---
>> >> Changelog since v8:
>> >> 
>> >> - Added optional memcg parameter to __bdi_start_writeback(),
>> >>   bdi_start_writeback(), wakeup_flusher_threads(), writeback_inodes_wb().
>> >> 
>> >> - move_expired_inodes() now uses pass in struct wb_writeback_work instead of
>> >>   struct writeback_control.
>> >> 
>> >> - Added comments to over_bground_thresh().
>> >> 
>> >>  fs/buffer.c               |    2 +-
>> >>  fs/fs-writeback.c         |   96 +++++++++++++++++++++++++++++++++-----------
>> >>  fs/sync.c                 |    2 +-
>> >>  include/linux/writeback.h |    6 ++-
>> >>  mm/backing-dev.c          |    3 +-
>> >>  mm/page-writeback.c       |    3 +-
>> >>  mm/vmscan.c               |    3 +-
>> >>  7 files changed, 84 insertions(+), 31 deletions(-)
>> >> 
>> >> diff --git a/fs/buffer.c b/fs/buffer.c
>> >> index dd0220b..da1fb23 100644
>> >> --- a/fs/buffer.c
>> >> +++ b/fs/buffer.c
>> >> @@ -293,7 +293,7 @@ static void free_more_memory(void)
>> >>  	struct zone *zone;
>> >>  	int nid;
>> >>  
>> >> -	wakeup_flusher_threads(1024);
>> >> +	wakeup_flusher_threads(1024, NULL);
>> >>  	yield();
>> >>  
>> >>  	for_each_online_node(nid) {
>> >> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>> >> index e91fb82..ba55336 100644
>> >> --- a/fs/fs-writeback.c
>> >> +++ b/fs/fs-writeback.c
>> >> @@ -38,10 +38,14 @@ struct wb_writeback_work {
>> >>  	struct super_block *sb;
>> >>  	unsigned long *older_than_this;
>> >>  	enum writeback_sync_modes sync_mode;
>> >> +	unsigned short memcg_id;	/* If non-zero, then writeback specified
>> >> +					 * cgroup. */
>> >>  	unsigned int tagged_writepages:1;
>> >>  	unsigned int for_kupdate:1;
>> >>  	unsigned int range_cyclic:1;
>> >>  	unsigned int for_background:1;
>> >> +	unsigned int for_cgroup:1;	/* cgroup writeback */
>> >> +	unsigned int shared_inodes:1;	/* write inodes spanning cgroups */
>> >>  
>> >>  	struct list_head list;		/* pending work list */
>> >>  	struct completion *done;	/* set if the caller waits */
>> >> @@ -114,9 +118,12 @@ static void bdi_queue_work(struct backing_dev_info *bdi,
>> >>  	spin_unlock_bh(&bdi->wb_lock);
>> >>  }
>> >>  
>> >> +/*
>> >> + * @memcg is optional.  If set, then limit writeback to the specified cgroup.
>> >> + */
>> >>  static void
>> >>  __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
>> >> -		      bool range_cyclic)
>> >> +		      bool range_cyclic, struct mem_cgroup *memcg)
>> >>  {
>> >>  	struct wb_writeback_work *work;
>> >>  
>> >> @@ -136,6 +143,8 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
>> >>  	work->sync_mode	= WB_SYNC_NONE;
>> >>  	work->nr_pages	= nr_pages;
>> >>  	work->range_cyclic = range_cyclic;
>> >> +	work->memcg_id = memcg ? css_id(mem_cgroup_css(memcg)) : 0;
>> >> +	work->for_cgroup = memcg != NULL;
>> >>  
>> >
>> >
>> > I couldn't find a patch for mem_cgroup_css(NULL). Is it in patch 1-10 ?
>> > Other parts seems ok to me.
>> >
>> >
>> > Thanks,
>> > -Kame
>> 
>> Mainline commit d324236b3333e87c8825b35f2104184734020d35 adds
>> mem_cgroup_css() to memcontrol.c.  The above code does not call
>> mem_cgroup_css() with a NULL parameter due to the 'memcg ? ...' check.
>> So I do not think any additional changes to mem_cgroup_css() are needed.
>> Am I missing your point?
>> 
>
> I thought you need
> ==
> struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *mem)
> {
> +	if (!mem)
> +		return NULL;
>        return &mem->css;
> }
> ==
> And
> ==
> unsigned short css_id(struct cgroup_subsys_state *css)
> {
>         struct css_id *cssid;
>
> +	if (!css)
> 		return 0;
> }
> ==
>
> Thanks,
> -Kame

I think that your changes to mem_cgroup_css() and css_id() are
unnecessary for my patches because my patches do not call
mem_cgroup_css(NULL).  The "?" check below prevents NULL from being
passed into mem_cgroup_css():

+	work->memcg_id = memcg ? css_id(mem_cgroup_css(memcg)) : 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 11/13] writeback: make background writeback cgroup aware
@ 2011-08-18  7:38           ` Greg Thelen
  0 siblings, 0 replies; 72+ messages in thread
From: Greg Thelen @ 2011-08-18  7:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, linux-fsdevel,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Johannes Weiner,
	Wu Fengguang, Dave Chinner, Vivek Goyal, Andrea Righi,
	Ciju Rajan K, David Rientjes

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Thu, 18 Aug 2011 00:10:56 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
>> 
>> > On Wed, 17 Aug 2011 09:15:03 -0700
>> > Greg Thelen <gthelen@google.com> wrote:
>> >
>> >> When the system is under background dirty memory threshold but some
>> >> cgroups are over their background dirty memory thresholds, then only
>> >> writeback inodes associated with the over-limit cgroups.
>> >> 
>> >> In addition to checking if the system dirty memory usage is over the
>> >> system background threshold, over_bground_thresh() now checks if any
>> >> cgroups are over their respective background dirty memory thresholds.
>> >> 
>> >> If over-limit cgroups are found, then the new
>> >> wb_writeback_work.for_cgroup field is set to distinguish between system
>> >> and memcg overages.  The new wb_writeback_work.shared_inodes field is
>> >> also set.  Inodes written by multiple cgroup are marked owned by
>> >> I_MEMCG_SHARED rather than a particular cgroup.  Such shared inodes
>> >> cannot easily be attributed to a cgroup, so per-cgroup writeback
>> >> (futures version of wakeup_flusher_threads and balance_dirty_pages)
>> >> performs suboptimally in the presence of shared inodes.  Therefore,
>> >> write shared inodes when performing cgroup background writeback.
>> >> 
>> >> If performing cgroup writeback, move_expired_inodes() skips inodes that
>> >> do not contribute dirty pages to the cgroup being written back.
>> >> 
>> >> After writing some pages, wb_writeback() will call
>> >> mem_cgroup_writeback_done() to update the set of over-bg-limits memcg.
>> >> 
>> >> This change also makes wakeup_flusher_threads() memcg aware so that
>> >> per-cgroup try_to_free_pages() is able to operate more efficiently
>> >> without having to write pages of foreign containers.  This change adds a
>> >> mem_cgroup parameter to wakeup_flusher_threads() to allow callers,
>> >> especially try_to_free_pages() and foreground writeback from
>> >> balance_dirty_pages(), to specify a particular cgroup to write inodes
>> >> from.
>> >> 
>> >> Signed-off-by: Greg Thelen <gthelen@google.com>
>> >> ---
>> >> Changelog since v8:
>> >> 
>> >> - Added optional memcg parameter to __bdi_start_writeback(),
>> >>   bdi_start_writeback(), wakeup_flusher_threads(), writeback_inodes_wb().
>> >> 
>> >> - move_expired_inodes() now uses pass in struct wb_writeback_work instead of
>> >>   struct writeback_control.
>> >> 
>> >> - Added comments to over_bground_thresh().
>> >> 
>> >>  fs/buffer.c               |    2 +-
>> >>  fs/fs-writeback.c         |   96 +++++++++++++++++++++++++++++++++-----------
>> >>  fs/sync.c                 |    2 +-
>> >>  include/linux/writeback.h |    6 ++-
>> >>  mm/backing-dev.c          |    3 +-
>> >>  mm/page-writeback.c       |    3 +-
>> >>  mm/vmscan.c               |    3 +-
>> >>  7 files changed, 84 insertions(+), 31 deletions(-)
>> >> 
>> >> diff --git a/fs/buffer.c b/fs/buffer.c
>> >> index dd0220b..da1fb23 100644
>> >> --- a/fs/buffer.c
>> >> +++ b/fs/buffer.c
>> >> @@ -293,7 +293,7 @@ static void free_more_memory(void)
>> >>  	struct zone *zone;
>> >>  	int nid;
>> >>  
>> >> -	wakeup_flusher_threads(1024);
>> >> +	wakeup_flusher_threads(1024, NULL);
>> >>  	yield();
>> >>  
>> >>  	for_each_online_node(nid) {
>> >> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>> >> index e91fb82..ba55336 100644
>> >> --- a/fs/fs-writeback.c
>> >> +++ b/fs/fs-writeback.c
>> >> @@ -38,10 +38,14 @@ struct wb_writeback_work {
>> >>  	struct super_block *sb;
>> >>  	unsigned long *older_than_this;
>> >>  	enum writeback_sync_modes sync_mode;
>> >> +	unsigned short memcg_id;	/* If non-zero, then writeback specified
>> >> +					 * cgroup. */
>> >>  	unsigned int tagged_writepages:1;
>> >>  	unsigned int for_kupdate:1;
>> >>  	unsigned int range_cyclic:1;
>> >>  	unsigned int for_background:1;
>> >> +	unsigned int for_cgroup:1;	/* cgroup writeback */
>> >> +	unsigned int shared_inodes:1;	/* write inodes spanning cgroups */
>> >>  
>> >>  	struct list_head list;		/* pending work list */
>> >>  	struct completion *done;	/* set if the caller waits */
>> >> @@ -114,9 +118,12 @@ static void bdi_queue_work(struct backing_dev_info *bdi,
>> >>  	spin_unlock_bh(&bdi->wb_lock);
>> >>  }
>> >>  
>> >> +/*
>> >> + * @memcg is optional.  If set, then limit writeback to the specified cgroup.
>> >> + */
>> >>  static void
>> >>  __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
>> >> -		      bool range_cyclic)
>> >> +		      bool range_cyclic, struct mem_cgroup *memcg)
>> >>  {
>> >>  	struct wb_writeback_work *work;
>> >>  
>> >> @@ -136,6 +143,8 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
>> >>  	work->sync_mode	= WB_SYNC_NONE;
>> >>  	work->nr_pages	= nr_pages;
>> >>  	work->range_cyclic = range_cyclic;
>> >> +	work->memcg_id = memcg ? css_id(mem_cgroup_css(memcg)) : 0;
>> >> +	work->for_cgroup = memcg != NULL;
>> >>  
>> >
>> >
>> > I couldn't find a patch for mem_cgroup_css(NULL). Is it in patch 1-10 ?
>> > Other parts seems ok to me.
>> >
>> >
>> > Thanks,
>> > -Kame
>> 
>> Mainline commit d324236b3333e87c8825b35f2104184734020d35 adds
>> mem_cgroup_css() to memcontrol.c.  The above code does not call
>> mem_cgroup_css() with a NULL parameter due to the 'memcg ? ...' check.
>> So I do not think any additional changes to mem_cgroup_css() are needed.
>> Am I missing your point?
>> 
>
> I thought you need
> ==
> struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *mem)
> {
> +	if (!mem)
> +		return NULL;
>        return &mem->css;
> }
> ==
> And
> ==
> unsigned short css_id(struct cgroup_subsys_state *css)
> {
>         struct css_id *cssid;
>
> +	if (!css)
> 		return 0;
> }
> ==
>
> Thanks,
> -Kame

I think that your changes to mem_cgroup_css() and css_id() are
unnecessary for my patches because my patches do not call
mem_cgroup_css(NULL).  The "?" check below prevents NULL from being
passed into mem_cgroup_css():

+	work->memcg_id = memcg ? css_id(mem_cgroup_css(memcg)) : 0;

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 12/13] memcg: create support routines for page writeback
  2011-08-18  2:36       ` Wu Fengguang
@ 2011-08-18 10:12         ` Jan Kara
  -1 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2011-08-18 10:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KAMEZAWA Hiroyuki, Jan Kara, Greg Thelen, Andrew Morton,
	linux-kernel, linux-mm, containers, linux-fsdevel, Balbir Singh,
	Daisuke Nishimura, Minchan Kim, Johannes Weiner, Dave Chinner,
	Vivek Goyal, Andrea Righi, Ciju Rajan K, David Rientjes,
	Li Shaohua, Shi, Alex, Chen, Tim C

On Thu 18-08-11 10:36:10, Wu Fengguang wrote:
> Subject: squeeze max-pause area and drop pass-good area
> Date: Tue Aug 16 13:37:14 CST 2011
> 
> Remove the pass-good area introduced in ffd1f609ab10 ("writeback:
> introduce max-pause and pass-good dirty limits") and make the
> max-pause area smaller and safe.
> 
> This fixes ~30% performance regression in the ext3 data=writeback
> fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
> 12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.
> 
> Using deadline scheduler also has a regression, but not that big as
> CFQ, so this suggests we have some write starvation.
> 
> The test logs show that
> 
> - the disks are sometimes under utilized
> 
> - global dirty pages sometimes rush high to the pass-good area for
>   several hundred seconds, while in the mean time some bdi dirty pages
>   drop to very low value (bdi_dirty << bdi_thresh).
>   Then suddenly the global dirty pages dropped under global dirty
>   threshold and bdi_dirty rush very high (for example, 2 times higher
>   than bdi_thresh). During which time balance_dirty_pages() is not
>   called at all.
> 
> So the problems are
> 
> 1) The random writes progress so slow that they break the assumption of
> the max-pause logic that "8 pages per 200ms is typically more than
> enough to curb heavy dirtiers".
> 
> 2) The max-pause logic ignored task_bdi_thresh and thus opens the
>    possibility for some bdi's to over dirty pages, leading to
>    (bdi_dirty >> bdi_thresh) and then (bdi_thresh >> bdi_dirty) for others.
> 
> 3) The higher max-pause/pass-good thresholds somehow leads to some bad
>    swing of dirty pages.
> 
> The fix is to allow the task to slightly dirty over task_bdi_thresh, but
> no way to exceed bdi_dirty and/or global dirty_thresh.
> 
> Tests show that it fixed the JBOD regression completely (both behavior
> and performance), while still being able to cut down large pause times
> in balance_dirty_pages() for single-disk cases.
> 
> Reported-by: Li Shaohua <shaohua.li@intel.com>
> Tested-by: Li Shaohua <shaohua.li@intel.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/writeback.h |   11 -----------
>  mm/page-writeback.c       |   15 ++-------------
>  2 files changed, 2 insertions(+), 24 deletions(-)
> 
> --- linux.orig/mm/page-writeback.c	2011-08-18 09:52:59.000000000 +0800
> +++ linux/mm/page-writeback.c	2011-08-18 10:28:57.000000000 +0800
> @@ -786,21 +786,10 @@ static void balance_dirty_pages(struct a
>  		 * 200ms is typically more than enough to curb heavy dirtiers;
>  		 * (b) the pause time limit makes the dirtiers more responsive.
>  		 */
> -		if (nr_dirty < dirty_thresh +
> -			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
> +		if (nr_dirty < dirty_thresh &&
> +		    bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2 &&
>  		    time_after(jiffies, start_time + MAX_PAUSE))
>  			break;
  This looks definitely much safer than the original patch since we now
always observe global dirty limit. I just wonder: We have throttled the
task because bdi_nr_reclaimable > task_bdi_thresh. Now in practice there
should be some pages under writeback and this task should have submitted
even more just a while ago. So the condition
  bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2
looks still relatively weak. Shouldn't there be
  bdi_nr_reclaimable < (task_bdi_thresh + bdi_thresh) / 2?
Since bdi_nr_reclaimable is really the number we want to limit...
Alternatively, I could see also a reason for
  bdi_dirty < task_bdi_thresh
which leaves the task pages under writeback as the pausing area. But since
these are not really well limited, I'd prefer my first suggestion.

								Honza
> -		/*
> -		 * pass-good area. When some bdi gets blocked (eg. NFS server
> -		 * not responding), or write bandwidth dropped dramatically due
> -		 * to concurrent reads, or dirty threshold suddenly dropped and
> -		 * the dirty pages cannot be brought down anytime soon (eg. on
> -		 * slow USB stick), at least let go of the good bdi's.
> -		 */
> -		if (nr_dirty < dirty_thresh +
> -			       dirty_thresh / DIRTY_PASSGOOD_AREA &&
> -		    bdi_dirty < bdi_thresh)
> -			break;
>  
>  		/*
>  		 * Increase the delay for each loop, up to our previous
> --- linux.orig/include/linux/writeback.h	2011-08-16 23:34:27.000000000 +0800
> +++ linux/include/linux/writeback.h	2011-08-18 09:53:03.000000000 +0800
> @@ -12,15 +12,6 @@
>   *
>   *	(thresh - thresh/DIRTY_FULL_SCOPE, thresh)
>   *
> - * The 1/16 region above the global dirty limit will be put to maximum pauses:
> - *
> - *	(limit, limit + limit/DIRTY_MAXPAUSE_AREA)
> - *
> - * The 1/16 region above the max-pause region, dirty exceeded bdi's will be put
> - * to loops:
> - *
> - *	(limit + limit/DIRTY_MAXPAUSE_AREA, limit + limit/DIRTY_PASSGOOD_AREA)
> - *
>   * Further beyond, all dirtier tasks will enter a loop waiting (possibly long
>   * time) for the dirty pages to drop, unless written enough pages.
>   *
> @@ -31,8 +22,6 @@
>   */
>  #define DIRTY_SCOPE		8
>  #define DIRTY_FULL_SCOPE	(DIRTY_SCOPE / 2)
> -#define DIRTY_MAXPAUSE_AREA		16
> -#define DIRTY_PASSGOOD_AREA		8
>  
>  /*
>   * 4MB minimal write chunk size
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 12/13] memcg: create support routines for page writeback
@ 2011-08-18 10:12         ` Jan Kara
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2011-08-18 10:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KAMEZAWA Hiroyuki, Jan Kara, Greg Thelen, Andrew Morton,
	linux-kernel, linux-mm, containers, linux-fsdevel, Balbir Singh,
	Daisuke Nishimura, Minchan Kim, Johannes Weiner, Dave Chinner,
	Vivek Goyal, Andrea Righi, Ciju Rajan K, David Rientjes,
	Li Shaohua, Shi, Alex, Chen, Tim C

On Thu 18-08-11 10:36:10, Wu Fengguang wrote:
> Subject: squeeze max-pause area and drop pass-good area
> Date: Tue Aug 16 13:37:14 CST 2011
> 
> Remove the pass-good area introduced in ffd1f609ab10 ("writeback:
> introduce max-pause and pass-good dirty limits") and make the
> max-pause area smaller and safe.
> 
> This fixes ~30% performance regression in the ext3 data=writeback
> fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
> 12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.
> 
> Using deadline scheduler also has a regression, but not that big as
> CFQ, so this suggests we have some write starvation.
> 
> The test logs show that
> 
> - the disks are sometimes under utilized
> 
> - global dirty pages sometimes rush high to the pass-good area for
>   several hundred seconds, while in the mean time some bdi dirty pages
>   drop to very low value (bdi_dirty << bdi_thresh).
>   Then suddenly the global dirty pages dropped under global dirty
>   threshold and bdi_dirty rush very high (for example, 2 times higher
>   than bdi_thresh). During which time balance_dirty_pages() is not
>   called at all.
> 
> So the problems are
> 
> 1) The random writes progress so slow that they break the assumption of
> the max-pause logic that "8 pages per 200ms is typically more than
> enough to curb heavy dirtiers".
> 
> 2) The max-pause logic ignored task_bdi_thresh and thus opens the
>    possibility for some bdi's to over dirty pages, leading to
>    (bdi_dirty >> bdi_thresh) and then (bdi_thresh >> bdi_dirty) for others.
> 
> 3) The higher max-pause/pass-good thresholds somehow leads to some bad
>    swing of dirty pages.
> 
> The fix is to allow the task to slightly dirty over task_bdi_thresh, but
> no way to exceed bdi_dirty and/or global dirty_thresh.
> 
> Tests show that it fixed the JBOD regression completely (both behavior
> and performance), while still being able to cut down large pause times
> in balance_dirty_pages() for single-disk cases.
> 
> Reported-by: Li Shaohua <shaohua.li@intel.com>
> Tested-by: Li Shaohua <shaohua.li@intel.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/writeback.h |   11 -----------
>  mm/page-writeback.c       |   15 ++-------------
>  2 files changed, 2 insertions(+), 24 deletions(-)
> 
> --- linux.orig/mm/page-writeback.c	2011-08-18 09:52:59.000000000 +0800
> +++ linux/mm/page-writeback.c	2011-08-18 10:28:57.000000000 +0800
> @@ -786,21 +786,10 @@ static void balance_dirty_pages(struct a
>  		 * 200ms is typically more than enough to curb heavy dirtiers;
>  		 * (b) the pause time limit makes the dirtiers more responsive.
>  		 */
> -		if (nr_dirty < dirty_thresh +
> -			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
> +		if (nr_dirty < dirty_thresh &&
> +		    bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2 &&
>  		    time_after(jiffies, start_time + MAX_PAUSE))
>  			break;
  This looks definitely much safer than the original patch since we now
always observe global dirty limit. I just wonder: We have throttled the
task because bdi_nr_reclaimable > task_bdi_thresh. Now in practice there
should be some pages under writeback and this task should have submitted
even more just a while ago. So the condition
  bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2
looks still relatively weak. Shouldn't there be
  bdi_nr_reclaimable < (task_bdi_thresh + bdi_thresh) / 2?
Since bdi_nr_reclaimable is really the number we want to limit...
Alternatively, I could see also a reason for
  bdi_dirty < task_bdi_thresh
which leaves the task pages under writeback as the pausing area. But since
these are not really well limited, I'd prefer my first suggestion.

								Honza
> -		/*
> -		 * pass-good area. When some bdi gets blocked (eg. NFS server
> -		 * not responding), or write bandwidth dropped dramatically due
> -		 * to concurrent reads, or dirty threshold suddenly dropped and
> -		 * the dirty pages cannot be brought down anytime soon (eg. on
> -		 * slow USB stick), at least let go of the good bdi's.
> -		 */
> -		if (nr_dirty < dirty_thresh +
> -			       dirty_thresh / DIRTY_PASSGOOD_AREA &&
> -		    bdi_dirty < bdi_thresh)
> -			break;
>  
>  		/*
>  		 * Increase the delay for each loop, up to our previous
> --- linux.orig/include/linux/writeback.h	2011-08-16 23:34:27.000000000 +0800
> +++ linux/include/linux/writeback.h	2011-08-18 09:53:03.000000000 +0800
> @@ -12,15 +12,6 @@
>   *
>   *	(thresh - thresh/DIRTY_FULL_SCOPE, thresh)
>   *
> - * The 1/16 region above the global dirty limit will be put to maximum pauses:
> - *
> - *	(limit, limit + limit/DIRTY_MAXPAUSE_AREA)
> - *
> - * The 1/16 region above the max-pause region, dirty exceeded bdi's will be put
> - * to loops:
> - *
> - *	(limit + limit/DIRTY_MAXPAUSE_AREA, limit + limit/DIRTY_PASSGOOD_AREA)
> - *
>   * Further beyond, all dirtier tasks will enter a loop waiting (possibly long
>   * time) for the dirty pages to drop, unless written enough pages.
>   *
> @@ -31,8 +22,6 @@
>   */
>  #define DIRTY_SCOPE		8
>  #define DIRTY_FULL_SCOPE	(DIRTY_SCOPE / 2)
> -#define DIRTY_MAXPAUSE_AREA		16
> -#define DIRTY_PASSGOOD_AREA		8
>  
>  /*
>   * 4MB minimal write chunk size
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 12/13] memcg: create support routines for page writeback
  2011-08-18 10:12         ` Jan Kara
@ 2011-08-18 12:17           ` Wu Fengguang
  -1 siblings, 0 replies; 72+ messages in thread
From: Wu Fengguang @ 2011-08-18 12:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: KAMEZAWA Hiroyuki, Greg Thelen, Andrew Morton, linux-kernel,
	linux-mm, containers, linux-fsdevel, Balbir Singh,
	Daisuke Nishimura, Minchan Kim, Johannes Weiner, Dave Chinner,
	Vivek Goyal, Andrea Righi, Ciju Rajan K, David Rientjes, Li,
	Shaohua, Shi, Alex, Chen, Tim C

On Thu, Aug 18, 2011 at 06:12:48PM +0800, Jan Kara wrote:
> On Thu 18-08-11 10:36:10, Wu Fengguang wrote:
> > Subject: squeeze max-pause area and drop pass-good area
> > Date: Tue Aug 16 13:37:14 CST 2011
> > 
> > Remove the pass-good area introduced in ffd1f609ab10 ("writeback:
> > introduce max-pause and pass-good dirty limits") and make the
> > max-pause area smaller and safe.
> > 
> > This fixes ~30% performance regression in the ext3 data=writeback
> > fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
> > 12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.
> > 
> > Using deadline scheduler also has a regression, but not that big as
> > CFQ, so this suggests we have some write starvation.
> > 
> > The test logs show that
> > 
> > - the disks are sometimes under utilized
> > 
> > - global dirty pages sometimes rush high to the pass-good area for
> >   several hundred seconds, while in the mean time some bdi dirty pages
> >   drop to very low value (bdi_dirty << bdi_thresh).
> >   Then suddenly the global dirty pages dropped under global dirty
> >   threshold and bdi_dirty rush very high (for example, 2 times higher
> >   than bdi_thresh). During which time balance_dirty_pages() is not
> >   called at all.
> > 
> > So the problems are
> > 
> > 1) The random writes progress so slow that they break the assumption of
> > the max-pause logic that "8 pages per 200ms is typically more than
> > enough to curb heavy dirtiers".
> > 
> > 2) The max-pause logic ignored task_bdi_thresh and thus opens the
> >    possibility for some bdi's to over dirty pages, leading to
> >    (bdi_dirty >> bdi_thresh) and then (bdi_thresh >> bdi_dirty) for others.
> > 
> > 3) The higher max-pause/pass-good thresholds somehow leads to some bad
> >    swing of dirty pages.
> > 
> > The fix is to allow the task to slightly dirty over task_bdi_thresh, but
> > no way to exceed bdi_dirty and/or global dirty_thresh.
> > 
> > Tests show that it fixed the JBOD regression completely (both behavior
> > and performance), while still being able to cut down large pause times
> > in balance_dirty_pages() for single-disk cases.
> > 
> > Reported-by: Li Shaohua <shaohua.li@intel.com>
> > Tested-by: Li Shaohua <shaohua.li@intel.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  include/linux/writeback.h |   11 -----------
> >  mm/page-writeback.c       |   15 ++-------------
> >  2 files changed, 2 insertions(+), 24 deletions(-)
> > 
> > --- linux.orig/mm/page-writeback.c	2011-08-18 09:52:59.000000000 +0800
> > +++ linux/mm/page-writeback.c	2011-08-18 10:28:57.000000000 +0800
> > @@ -786,21 +786,10 @@ static void balance_dirty_pages(struct a
> >  		 * 200ms is typically more than enough to curb heavy dirtiers;
> >  		 * (b) the pause time limit makes the dirtiers more responsive.
> >  		 */
> > -		if (nr_dirty < dirty_thresh +
> > -			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
> > +		if (nr_dirty < dirty_thresh &&
> > +		    bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2 &&
> >  		    time_after(jiffies, start_time + MAX_PAUSE))
> >  			break;
>   This looks definitely much safer than the original patch since we now
> always observe global dirty limit.

Yeah.

> I just wonder: We have throttled the
> task because bdi_nr_reclaimable > task_bdi_thresh.

Not necessarily. It's possible (bdi_nr_reclaimable < task_bdi_thresh)
for the whole loop. And the 200ms pause that trigger the above test
may totally come from the io_schedule_timeout() calls.

> Now in practice there
> should be some pages under writeback and this task should have submitted
> even more just a while ago. So the condition
>   bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2

I guess the writeback_inodes_wb() call is irrelevant for the above
test, because writeback_inodes_wb() transfers reclaimable pages to
writeback pages, with the total bdi_dirty value staying the same.
Not to mention the fact that both the bdi_dirty and bdi_nr_reclaimable
variables have not been updated between writeback_inodes_wb() and the
max-pause test.

> looks still relatively weak. Shouldn't there be
>   bdi_nr_reclaimable < (task_bdi_thresh + bdi_thresh) / 2?

That's much easier condition to satisfy..

> Since bdi_nr_reclaimable is really the number we want to limit...
> Alternatively, I could see also a reason for
>   bdi_dirty < task_bdi_thresh
> which leaves the task pages under writeback as the pausing area. But since
> these are not really well limited, I'd prefer my first suggestion.

Thanks,
Fengguang

> > -		/*
> > -		 * pass-good area. When some bdi gets blocked (eg. NFS server
> > -		 * not responding), or write bandwidth dropped dramatically due
> > -		 * to concurrent reads, or dirty threshold suddenly dropped and
> > -		 * the dirty pages cannot be brought down anytime soon (eg. on
> > -		 * slow USB stick), at least let go of the good bdi's.
> > -		 */
> > -		if (nr_dirty < dirty_thresh +
> > -			       dirty_thresh / DIRTY_PASSGOOD_AREA &&
> > -		    bdi_dirty < bdi_thresh)
> > -			break;
> >  
> >  		/*
> >  		 * Increase the delay for each loop, up to our previous
> > --- linux.orig/include/linux/writeback.h	2011-08-16 23:34:27.000000000 +0800
> > +++ linux/include/linux/writeback.h	2011-08-18 09:53:03.000000000 +0800
> > @@ -12,15 +12,6 @@
> >   *
> >   *	(thresh - thresh/DIRTY_FULL_SCOPE, thresh)
> >   *
> > - * The 1/16 region above the global dirty limit will be put to maximum pauses:
> > - *
> > - *	(limit, limit + limit/DIRTY_MAXPAUSE_AREA)
> > - *
> > - * The 1/16 region above the max-pause region, dirty exceeded bdi's will be put
> > - * to loops:
> > - *
> > - *	(limit + limit/DIRTY_MAXPAUSE_AREA, limit + limit/DIRTY_PASSGOOD_AREA)
> > - *
> >   * Further beyond, all dirtier tasks will enter a loop waiting (possibly long
> >   * time) for the dirty pages to drop, unless written enough pages.
> >   *
> > @@ -31,8 +22,6 @@
> >   */
> >  #define DIRTY_SCOPE		8
> >  #define DIRTY_FULL_SCOPE	(DIRTY_SCOPE / 2)
> > -#define DIRTY_MAXPAUSE_AREA		16
> > -#define DIRTY_PASSGOOD_AREA		8
> >  
> >  /*
> >   * 4MB minimal write chunk size
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 12/13] memcg: create support routines for page writeback
@ 2011-08-18 12:17           ` Wu Fengguang
  0 siblings, 0 replies; 72+ messages in thread
From: Wu Fengguang @ 2011-08-18 12:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: KAMEZAWA Hiroyuki, Greg Thelen, Andrew Morton, linux-kernel,
	linux-mm, containers, linux-fsdevel, Balbir Singh,
	Daisuke Nishimura, Minchan Kim, Johannes Weiner, Dave Chinner,
	Vivek Goyal, Andrea Righi, Ciju Rajan K, David Rientjes, Li,
	Shaohua, Shi, Alex, Chen, Tim C

On Thu, Aug 18, 2011 at 06:12:48PM +0800, Jan Kara wrote:
> On Thu 18-08-11 10:36:10, Wu Fengguang wrote:
> > Subject: squeeze max-pause area and drop pass-good area
> > Date: Tue Aug 16 13:37:14 CST 2011
> > 
> > Remove the pass-good area introduced in ffd1f609ab10 ("writeback:
> > introduce max-pause and pass-good dirty limits") and make the
> > max-pause area smaller and safe.
> > 
> > This fixes ~30% performance regression in the ext3 data=writeback
> > fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
> > 12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.
> > 
> > Using deadline scheduler also has a regression, but not that big as
> > CFQ, so this suggests we have some write starvation.
> > 
> > The test logs show that
> > 
> > - the disks are sometimes under utilized
> > 
> > - global dirty pages sometimes rush high to the pass-good area for
> >   several hundred seconds, while in the mean time some bdi dirty pages
> >   drop to very low value (bdi_dirty << bdi_thresh).
> >   Then suddenly the global dirty pages dropped under global dirty
> >   threshold and bdi_dirty rush very high (for example, 2 times higher
> >   than bdi_thresh). During which time balance_dirty_pages() is not
> >   called at all.
> > 
> > So the problems are
> > 
> > 1) The random writes progress so slow that they break the assumption of
> > the max-pause logic that "8 pages per 200ms is typically more than
> > enough to curb heavy dirtiers".
> > 
> > 2) The max-pause logic ignored task_bdi_thresh and thus opens the
> >    possibility for some bdi's to over dirty pages, leading to
> >    (bdi_dirty >> bdi_thresh) and then (bdi_thresh >> bdi_dirty) for others.
> > 
> > 3) The higher max-pause/pass-good thresholds somehow leads to some bad
> >    swing of dirty pages.
> > 
> > The fix is to allow the task to slightly dirty over task_bdi_thresh, but
> > no way to exceed bdi_dirty and/or global dirty_thresh.
> > 
> > Tests show that it fixed the JBOD regression completely (both behavior
> > and performance), while still being able to cut down large pause times
> > in balance_dirty_pages() for single-disk cases.
> > 
> > Reported-by: Li Shaohua <shaohua.li@intel.com>
> > Tested-by: Li Shaohua <shaohua.li@intel.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  include/linux/writeback.h |   11 -----------
> >  mm/page-writeback.c       |   15 ++-------------
> >  2 files changed, 2 insertions(+), 24 deletions(-)
> > 
> > --- linux.orig/mm/page-writeback.c	2011-08-18 09:52:59.000000000 +0800
> > +++ linux/mm/page-writeback.c	2011-08-18 10:28:57.000000000 +0800
> > @@ -786,21 +786,10 @@ static void balance_dirty_pages(struct a
> >  		 * 200ms is typically more than enough to curb heavy dirtiers;
> >  		 * (b) the pause time limit makes the dirtiers more responsive.
> >  		 */
> > -		if (nr_dirty < dirty_thresh +
> > -			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
> > +		if (nr_dirty < dirty_thresh &&
> > +		    bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2 &&
> >  		    time_after(jiffies, start_time + MAX_PAUSE))
> >  			break;
>   This looks definitely much safer than the original patch since we now
> always observe global dirty limit.

Yeah.

> I just wonder: We have throttled the
> task because bdi_nr_reclaimable > task_bdi_thresh.

Not necessarily. It's possible (bdi_nr_reclaimable < task_bdi_thresh)
for the whole loop. And the 200ms pause that trigger the above test
may totally come from the io_schedule_timeout() calls.

> Now in practice there
> should be some pages under writeback and this task should have submitted
> even more just a while ago. So the condition
>   bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2

I guess the writeback_inodes_wb() call is irrelevant for the above
test, because writeback_inodes_wb() transfers reclaimable pages to
writeback pages, with the total bdi_dirty value staying the same.
Not to mention the fact that both the bdi_dirty and bdi_nr_reclaimable
variables have not been updated between writeback_inodes_wb() and the
max-pause test.

> looks still relatively weak. Shouldn't there be
>   bdi_nr_reclaimable < (task_bdi_thresh + bdi_thresh) / 2?

That's much easier condition to satisfy..

> Since bdi_nr_reclaimable is really the number we want to limit...
> Alternatively, I could see also a reason for
>   bdi_dirty < task_bdi_thresh
> which leaves the task pages under writeback as the pausing area. But since
> these are not really well limited, I'd prefer my first suggestion.

Thanks,
Fengguang

> > -		/*
> > -		 * pass-good area. When some bdi gets blocked (eg. NFS server
> > -		 * not responding), or write bandwidth dropped dramatically due
> > -		 * to concurrent reads, or dirty threshold suddenly dropped and
> > -		 * the dirty pages cannot be brought down anytime soon (eg. on
> > -		 * slow USB stick), at least let go of the good bdi's.
> > -		 */
> > -		if (nr_dirty < dirty_thresh +
> > -			       dirty_thresh / DIRTY_PASSGOOD_AREA &&
> > -		    bdi_dirty < bdi_thresh)
> > -			break;
> >  
> >  		/*
> >  		 * Increase the delay for each loop, up to our previous
> > --- linux.orig/include/linux/writeback.h	2011-08-16 23:34:27.000000000 +0800
> > +++ linux/include/linux/writeback.h	2011-08-18 09:53:03.000000000 +0800
> > @@ -12,15 +12,6 @@
> >   *
> >   *	(thresh - thresh/DIRTY_FULL_SCOPE, thresh)
> >   *
> > - * The 1/16 region above the global dirty limit will be put to maximum pauses:
> > - *
> > - *	(limit, limit + limit/DIRTY_MAXPAUSE_AREA)
> > - *
> > - * The 1/16 region above the max-pause region, dirty exceeded bdi's will be put
> > - * to loops:
> > - *
> > - *	(limit + limit/DIRTY_MAXPAUSE_AREA, limit + limit/DIRTY_PASSGOOD_AREA)
> > - *
> >   * Further beyond, all dirtier tasks will enter a loop waiting (possibly long
> >   * time) for the dirty pages to drop, unless written enough pages.
> >   *
> > @@ -31,8 +22,6 @@
> >   */
> >  #define DIRTY_SCOPE		8
> >  #define DIRTY_FULL_SCOPE	(DIRTY_SCOPE / 2)
> > -#define DIRTY_MAXPAUSE_AREA		16
> > -#define DIRTY_PASSGOOD_AREA		8
> >  
> >  /*
> >   * 4MB minimal write chunk size
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 12/13] memcg: create support routines for page writeback
  2011-08-18 12:17           ` Wu Fengguang
@ 2011-08-18 20:08             ` Jan Kara
  -1 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2011-08-18 20:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, KAMEZAWA Hiroyuki, Greg Thelen, Andrew Morton,
	linux-kernel, linux-mm, containers, linux-fsdevel, Balbir Singh,
	Daisuke Nishimura, Minchan Kim, Johannes Weiner, Dave Chinner,
	Vivek Goyal, Andrea Righi, Ciju Rajan K, David Rientjes, Li,
	Shaohua, Shi, Alex, Chen, Tim C

On Thu 18-08-11 20:17:14, Wu Fengguang wrote:
> On Thu, Aug 18, 2011 at 06:12:48PM +0800, Jan Kara wrote:
> > On Thu 18-08-11 10:36:10, Wu Fengguang wrote:
> > > Subject: squeeze max-pause area and drop pass-good area
> > > Date: Tue Aug 16 13:37:14 CST 2011
> > > 
> > > Remove the pass-good area introduced in ffd1f609ab10 ("writeback:
> > > introduce max-pause and pass-good dirty limits") and make the
> > > max-pause area smaller and safe.
> > > 
> > > This fixes ~30% performance regression in the ext3 data=writeback
> > > fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
> > > 12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.
> > > 
> > > Using deadline scheduler also has a regression, but not that big as
> > > CFQ, so this suggests we have some write starvation.
> > > 
> > > The test logs show that
> > > 
> > > - the disks are sometimes under utilized
> > > 
> > > - global dirty pages sometimes rush high to the pass-good area for
> > >   several hundred seconds, while in the mean time some bdi dirty pages
> > >   drop to very low value (bdi_dirty << bdi_thresh).
> > >   Then suddenly the global dirty pages dropped under global dirty
> > >   threshold and bdi_dirty rush very high (for example, 2 times higher
> > >   than bdi_thresh). During which time balance_dirty_pages() is not
> > >   called at all.
> > > 
> > > So the problems are
> > > 
> > > 1) The random writes progress so slow that they break the assumption of
> > > the max-pause logic that "8 pages per 200ms is typically more than
> > > enough to curb heavy dirtiers".
> > > 
> > > 2) The max-pause logic ignored task_bdi_thresh and thus opens the
> > >    possibility for some bdi's to over dirty pages, leading to
> > >    (bdi_dirty >> bdi_thresh) and then (bdi_thresh >> bdi_dirty) for others.
> > > 
> > > 3) The higher max-pause/pass-good thresholds somehow leads to some bad
> > >    swing of dirty pages.
> > > 
> > > The fix is to allow the task to slightly dirty over task_bdi_thresh, but
> > > no way to exceed bdi_dirty and/or global dirty_thresh.
> > > 
> > > Tests show that it fixed the JBOD regression completely (both behavior
> > > and performance), while still being able to cut down large pause times
> > > in balance_dirty_pages() for single-disk cases.
> > > 
> > > Reported-by: Li Shaohua <shaohua.li@intel.com>
> > > Tested-by: Li Shaohua <shaohua.li@intel.com>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  include/linux/writeback.h |   11 -----------
> > >  mm/page-writeback.c       |   15 ++-------------
> > >  2 files changed, 2 insertions(+), 24 deletions(-)
> > > 
> > > --- linux.orig/mm/page-writeback.c	2011-08-18 09:52:59.000000000 +0800
> > > +++ linux/mm/page-writeback.c	2011-08-18 10:28:57.000000000 +0800
> > > @@ -786,21 +786,10 @@ static void balance_dirty_pages(struct a
> > >  		 * 200ms is typically more than enough to curb heavy dirtiers;
> > >  		 * (b) the pause time limit makes the dirtiers more responsive.
> > >  		 */
> > > -		if (nr_dirty < dirty_thresh +
> > > -			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
> > > +		if (nr_dirty < dirty_thresh &&
> > > +		    bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2 &&
> > >  		    time_after(jiffies, start_time + MAX_PAUSE))
> > >  			break;
> >   This looks definitely much safer than the original patch since we now
> > always observe global dirty limit.
> 
> Yeah.
> 
> > I just wonder: We have throttled the
> > task because bdi_nr_reclaimable > task_bdi_thresh.
> 
> Not necessarily. It's possible (bdi_nr_reclaimable < task_bdi_thresh)
> for the whole loop. And the 200ms pause that trigger the above test
> may totally come from the io_schedule_timeout() calls.
> 
> > Now in practice there
> > should be some pages under writeback and this task should have submitted
> > even more just a while ago. So the condition
> >   bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2
> 
> I guess the writeback_inodes_wb() call is irrelevant for the above
> test, because writeback_inodes_wb() transfers reclaimable pages to
> writeback pages, with the total bdi_dirty value staying the same.
> Not to mention the fact that both the bdi_dirty and bdi_nr_reclaimable
> variables have not been updated between writeback_inodes_wb() and the
> max-pause test.
  Right, that comment was a bit off.

> > looks still relatively weak. Shouldn't there be
> >   bdi_nr_reclaimable < (task_bdi_thresh + bdi_thresh) / 2?
> 
> That's much easier condition to satisfy..
  Argh, sorry. I was mistaken by the name of the variable - I though it
contains only dirty pages on the bdi but it also contains pages under
writeback and bdi_nr_reclaimable is the one that contains only dirty pages.
So your patch does exactly what I had in mind. You can add:
  Acked-by: Jan Kara <jack@suse.cz>

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 12/13] memcg: create support routines for page writeback
@ 2011-08-18 20:08             ` Jan Kara
  0 siblings, 0 replies; 72+ messages in thread
From: Jan Kara @ 2011-08-18 20:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, KAMEZAWA Hiroyuki, Greg Thelen, Andrew Morton,
	linux-kernel, linux-mm, containers, linux-fsdevel, Balbir Singh,
	Daisuke Nishimura, Minchan Kim, Johannes Weiner, Dave Chinner,
	Vivek Goyal, Andrea Righi, Ciju Rajan K, David Rientjes, Li,
	Shaohua, Shi, Alex, Chen, Tim C

On Thu 18-08-11 20:17:14, Wu Fengguang wrote:
> On Thu, Aug 18, 2011 at 06:12:48PM +0800, Jan Kara wrote:
> > On Thu 18-08-11 10:36:10, Wu Fengguang wrote:
> > > Subject: squeeze max-pause area and drop pass-good area
> > > Date: Tue Aug 16 13:37:14 CST 2011
> > > 
> > > Remove the pass-good area introduced in ffd1f609ab10 ("writeback:
> > > introduce max-pause and pass-good dirty limits") and make the
> > > max-pause area smaller and safe.
> > > 
> > > This fixes ~30% performance regression in the ext3 data=writeback
> > > fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
> > > 12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.
> > > 
> > > Using deadline scheduler also has a regression, but not that big as
> > > CFQ, so this suggests we have some write starvation.
> > > 
> > > The test logs show that
> > > 
> > > - the disks are sometimes under utilized
> > > 
> > > - global dirty pages sometimes rush high to the pass-good area for
> > >   several hundred seconds, while in the mean time some bdi dirty pages
> > >   drop to very low value (bdi_dirty << bdi_thresh).
> > >   Then suddenly the global dirty pages dropped under global dirty
> > >   threshold and bdi_dirty rush very high (for example, 2 times higher
> > >   than bdi_thresh). During which time balance_dirty_pages() is not
> > >   called at all.
> > > 
> > > So the problems are
> > > 
> > > 1) The random writes progress so slow that they break the assumption of
> > > the max-pause logic that "8 pages per 200ms is typically more than
> > > enough to curb heavy dirtiers".
> > > 
> > > 2) The max-pause logic ignored task_bdi_thresh and thus opens the
> > >    possibility for some bdi's to over dirty pages, leading to
> > >    (bdi_dirty >> bdi_thresh) and then (bdi_thresh >> bdi_dirty) for others.
> > > 
> > > 3) The higher max-pause/pass-good thresholds somehow leads to some bad
> > >    swing of dirty pages.
> > > 
> > > The fix is to allow the task to slightly dirty over task_bdi_thresh, but
> > > no way to exceed bdi_dirty and/or global dirty_thresh.
> > > 
> > > Tests show that it fixed the JBOD regression completely (both behavior
> > > and performance), while still being able to cut down large pause times
> > > in balance_dirty_pages() for single-disk cases.
> > > 
> > > Reported-by: Li Shaohua <shaohua.li@intel.com>
> > > Tested-by: Li Shaohua <shaohua.li@intel.com>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  include/linux/writeback.h |   11 -----------
> > >  mm/page-writeback.c       |   15 ++-------------
> > >  2 files changed, 2 insertions(+), 24 deletions(-)
> > > 
> > > --- linux.orig/mm/page-writeback.c	2011-08-18 09:52:59.000000000 +0800
> > > +++ linux/mm/page-writeback.c	2011-08-18 10:28:57.000000000 +0800
> > > @@ -786,21 +786,10 @@ static void balance_dirty_pages(struct a
> > >  		 * 200ms is typically more than enough to curb heavy dirtiers;
> > >  		 * (b) the pause time limit makes the dirtiers more responsive.
> > >  		 */
> > > -		if (nr_dirty < dirty_thresh +
> > > -			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
> > > +		if (nr_dirty < dirty_thresh &&
> > > +		    bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2 &&
> > >  		    time_after(jiffies, start_time + MAX_PAUSE))
> > >  			break;
> >   This looks definitely much safer than the original patch since we now
> > always observe global dirty limit.
> 
> Yeah.
> 
> > I just wonder: We have throttled the
> > task because bdi_nr_reclaimable > task_bdi_thresh.
> 
> Not necessarily. It's possible (bdi_nr_reclaimable < task_bdi_thresh)
> for the whole loop. And the 200ms pause that trigger the above test
> may totally come from the io_schedule_timeout() calls.
> 
> > Now in practice there
> > should be some pages under writeback and this task should have submitted
> > even more just a while ago. So the condition
> >   bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2
> 
> I guess the writeback_inodes_wb() call is irrelevant for the above
> test, because writeback_inodes_wb() transfers reclaimable pages to
> writeback pages, with the total bdi_dirty value staying the same.
> Not to mention the fact that both the bdi_dirty and bdi_nr_reclaimable
> variables have not been updated between writeback_inodes_wb() and the
> max-pause test.
  Right, that comment was a bit off.

> > looks still relatively weak. Shouldn't there be
> >   bdi_nr_reclaimable < (task_bdi_thresh + bdi_thresh) / 2?
> 
> That's much easier condition to satisfy..
  Argh, sorry. I was mistaken by the name of the variable - I though it
contains only dirty pages on the bdi but it also contains pages under
writeback and bdi_nr_reclaimable is the one that contains only dirty pages.
So your patch does exactly what I had in mind. You can add:
  Acked-by: Jan Kara <jack@suse.cz>

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 12/13] memcg: create support routines for page writeback
  2011-08-18 20:08             ` Jan Kara
@ 2011-08-19  1:36               ` Wu Fengguang
  -1 siblings, 0 replies; 72+ messages in thread
From: Wu Fengguang @ 2011-08-19  1:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: KAMEZAWA Hiroyuki, Greg Thelen, Andrew Morton, linux-kernel,
	linux-mm, containers, linux-fsdevel, Balbir Singh,
	Daisuke Nishimura, Minchan Kim, Johannes Weiner, Dave Chinner,
	Vivek Goyal, Andrea Righi, Ciju Rajan K, David Rientjes, Li,
	Shaohua, Shi, Alex, Chen, Tim C

On Fri, Aug 19, 2011 at 04:08:56AM +0800, Jan Kara wrote:
> On Thu 18-08-11 20:17:14, Wu Fengguang wrote:
> > On Thu, Aug 18, 2011 at 06:12:48PM +0800, Jan Kara wrote:
> > > On Thu 18-08-11 10:36:10, Wu Fengguang wrote:
> > > > Subject: squeeze max-pause area and drop pass-good area
> > > > Date: Tue Aug 16 13:37:14 CST 2011
> > > > 
> > > > Remove the pass-good area introduced in ffd1f609ab10 ("writeback:
> > > > introduce max-pause and pass-good dirty limits") and make the
> > > > max-pause area smaller and safe.
> > > > 
> > > > This fixes ~30% performance regression in the ext3 data=writeback
> > > > fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
> > > > 12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.
> > > > 
> > > > Using deadline scheduler also has a regression, but not that big as
> > > > CFQ, so this suggests we have some write starvation.
> > > > 
> > > > The test logs show that
> > > > 
> > > > - the disks are sometimes under utilized
> > > > 
> > > > - global dirty pages sometimes rush high to the pass-good area for
> > > >   several hundred seconds, while in the mean time some bdi dirty pages
> > > >   drop to very low value (bdi_dirty << bdi_thresh).
> > > >   Then suddenly the global dirty pages dropped under global dirty
> > > >   threshold and bdi_dirty rush very high (for example, 2 times higher
> > > >   than bdi_thresh). During which time balance_dirty_pages() is not
> > > >   called at all.
> > > > 
> > > > So the problems are
> > > > 
> > > > 1) The random writes progress so slow that they break the assumption of
> > > > the max-pause logic that "8 pages per 200ms is typically more than
> > > > enough to curb heavy dirtiers".
> > > > 
> > > > 2) The max-pause logic ignored task_bdi_thresh and thus opens the
> > > >    possibility for some bdi's to over dirty pages, leading to
> > > >    (bdi_dirty >> bdi_thresh) and then (bdi_thresh >> bdi_dirty) for others.
> > > > 
> > > > 3) The higher max-pause/pass-good thresholds somehow leads to some bad
> > > >    swing of dirty pages.
> > > > 
> > > > The fix is to allow the task to slightly dirty over task_bdi_thresh, but
> > > > no way to exceed bdi_dirty and/or global dirty_thresh.
> > > > 
> > > > Tests show that it fixed the JBOD regression completely (both behavior
> > > > and performance), while still being able to cut down large pause times
> > > > in balance_dirty_pages() for single-disk cases.
> > > > 
> > > > Reported-by: Li Shaohua <shaohua.li@intel.com>
> > > > Tested-by: Li Shaohua <shaohua.li@intel.com>
> > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > ---
> > > >  include/linux/writeback.h |   11 -----------
> > > >  mm/page-writeback.c       |   15 ++-------------
> > > >  2 files changed, 2 insertions(+), 24 deletions(-)
> > > > 
> > > > --- linux.orig/mm/page-writeback.c	2011-08-18 09:52:59.000000000 +0800
> > > > +++ linux/mm/page-writeback.c	2011-08-18 10:28:57.000000000 +0800
> > > > @@ -786,21 +786,10 @@ static void balance_dirty_pages(struct a
> > > >  		 * 200ms is typically more than enough to curb heavy dirtiers;
> > > >  		 * (b) the pause time limit makes the dirtiers more responsive.
> > > >  		 */
> > > > -		if (nr_dirty < dirty_thresh +
> > > > -			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
> > > > +		if (nr_dirty < dirty_thresh &&
> > > > +		    bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2 &&
> > > >  		    time_after(jiffies, start_time + MAX_PAUSE))
> > > >  			break;
> > >   This looks definitely much safer than the original patch since we now
> > > always observe global dirty limit.
> > 
> > Yeah.
> > 
> > > I just wonder: We have throttled the
> > > task because bdi_nr_reclaimable > task_bdi_thresh.
> > 
> > Not necessarily. It's possible (bdi_nr_reclaimable < task_bdi_thresh)
> > for the whole loop. And the 200ms pause that trigger the above test
> > may totally come from the io_schedule_timeout() calls.
> > 
> > > Now in practice there
> > > should be some pages under writeback and this task should have submitted
> > > even more just a while ago. So the condition
> > >   bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2
> > 
> > I guess the writeback_inodes_wb() call is irrelevant for the above
> > test, because writeback_inodes_wb() transfers reclaimable pages to
> > writeback pages, with the total bdi_dirty value staying the same.
> > Not to mention the fact that both the bdi_dirty and bdi_nr_reclaimable
> > variables have not been updated between writeback_inodes_wb() and the
> > max-pause test.
>   Right, that comment was a bit off.
> 
> > > looks still relatively weak. Shouldn't there be
> > >   bdi_nr_reclaimable < (task_bdi_thresh + bdi_thresh) / 2?
> > 
> > That's much easier condition to satisfy..
>   Argh, sorry. I was mistaken by the name of the variable - I though it
> contains only dirty pages on the bdi but it also contains pages under
> writeback and bdi_nr_reclaimable is the one that contains only dirty pages.

Yeah the name may be a bit confusing.. but we'll soon get rid of bdi_nr_reclaimable :)

> So your patch does exactly what I had in mind. You can add:
>   Acked-by: Jan Kara <jack@suse.cz>

Thanks! I'll test it in linux-next for a week and then send to Linus.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v9 12/13] memcg: create support routines for page writeback
@ 2011-08-19  1:36               ` Wu Fengguang
  0 siblings, 0 replies; 72+ messages in thread
From: Wu Fengguang @ 2011-08-19  1:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: KAMEZAWA Hiroyuki, Greg Thelen, Andrew Morton, linux-kernel,
	linux-mm, containers, linux-fsdevel, Balbir Singh,
	Daisuke Nishimura, Minchan Kim, Johannes Weiner, Dave Chinner,
	Vivek Goyal, Andrea Righi, Ciju Rajan K, David Rientjes, Li,
	Shaohua, Shi, Alex, Chen, Tim C

On Fri, Aug 19, 2011 at 04:08:56AM +0800, Jan Kara wrote:
> On Thu 18-08-11 20:17:14, Wu Fengguang wrote:
> > On Thu, Aug 18, 2011 at 06:12:48PM +0800, Jan Kara wrote:
> > > On Thu 18-08-11 10:36:10, Wu Fengguang wrote:
> > > > Subject: squeeze max-pause area and drop pass-good area
> > > > Date: Tue Aug 16 13:37:14 CST 2011
> > > > 
> > > > Remove the pass-good area introduced in ffd1f609ab10 ("writeback:
> > > > introduce max-pause and pass-good dirty limits") and make the
> > > > max-pause area smaller and safe.
> > > > 
> > > > This fixes ~30% performance regression in the ext3 data=writeback
> > > > fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
> > > > 12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.
> > > > 
> > > > Using deadline scheduler also has a regression, but not that big as
> > > > CFQ, so this suggests we have some write starvation.
> > > > 
> > > > The test logs show that
> > > > 
> > > > - the disks are sometimes under utilized
> > > > 
> > > > - global dirty pages sometimes rush high to the pass-good area for
> > > >   several hundred seconds, while in the mean time some bdi dirty pages
> > > >   drop to very low value (bdi_dirty << bdi_thresh).
> > > >   Then suddenly the global dirty pages dropped under global dirty
> > > >   threshold and bdi_dirty rush very high (for example, 2 times higher
> > > >   than bdi_thresh). During which time balance_dirty_pages() is not
> > > >   called at all.
> > > > 
> > > > So the problems are
> > > > 
> > > > 1) The random writes progress so slow that they break the assumption of
> > > > the max-pause logic that "8 pages per 200ms is typically more than
> > > > enough to curb heavy dirtiers".
> > > > 
> > > > 2) The max-pause logic ignored task_bdi_thresh and thus opens the
> > > >    possibility for some bdi's to over dirty pages, leading to
> > > >    (bdi_dirty >> bdi_thresh) and then (bdi_thresh >> bdi_dirty) for others.
> > > > 
> > > > 3) The higher max-pause/pass-good thresholds somehow leads to some bad
> > > >    swing of dirty pages.
> > > > 
> > > > The fix is to allow the task to slightly dirty over task_bdi_thresh, but
> > > > no way to exceed bdi_dirty and/or global dirty_thresh.
> > > > 
> > > > Tests show that it fixed the JBOD regression completely (both behavior
> > > > and performance), while still being able to cut down large pause times
> > > > in balance_dirty_pages() for single-disk cases.
> > > > 
> > > > Reported-by: Li Shaohua <shaohua.li@intel.com>
> > > > Tested-by: Li Shaohua <shaohua.li@intel.com>
> > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > ---
> > > >  include/linux/writeback.h |   11 -----------
> > > >  mm/page-writeback.c       |   15 ++-------------
> > > >  2 files changed, 2 insertions(+), 24 deletions(-)
> > > > 
> > > > --- linux.orig/mm/page-writeback.c	2011-08-18 09:52:59.000000000 +0800
> > > > +++ linux/mm/page-writeback.c	2011-08-18 10:28:57.000000000 +0800
> > > > @@ -786,21 +786,10 @@ static void balance_dirty_pages(struct a
> > > >  		 * 200ms is typically more than enough to curb heavy dirtiers;
> > > >  		 * (b) the pause time limit makes the dirtiers more responsive.
> > > >  		 */
> > > > -		if (nr_dirty < dirty_thresh +
> > > > -			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
> > > > +		if (nr_dirty < dirty_thresh &&
> > > > +		    bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2 &&
> > > >  		    time_after(jiffies, start_time + MAX_PAUSE))
> > > >  			break;
> > >   This looks definitely much safer than the original patch since we now
> > > always observe global dirty limit.
> > 
> > Yeah.
> > 
> > > I just wonder: We have throttled the
> > > task because bdi_nr_reclaimable > task_bdi_thresh.
> > 
> > Not necessarily. It's possible (bdi_nr_reclaimable < task_bdi_thresh)
> > for the whole loop. And the 200ms pause that trigger the above test
> > may totally come from the io_schedule_timeout() calls.
> > 
> > > Now in practice there
> > > should be some pages under writeback and this task should have submitted
> > > even more just a while ago. So the condition
> > >   bdi_dirty < (task_bdi_thresh + bdi_thresh) / 2
> > 
> > I guess the writeback_inodes_wb() call is irrelevant for the above
> > test, because writeback_inodes_wb() transfers reclaimable pages to
> > writeback pages, with the total bdi_dirty value staying the same.
> > Not to mention the fact that both the bdi_dirty and bdi_nr_reclaimable
> > variables have not been updated between writeback_inodes_wb() and the
> > max-pause test.
>   Right, that comment was a bit off.
> 
> > > looks still relatively weak. Shouldn't there be
> > >   bdi_nr_reclaimable < (task_bdi_thresh + bdi_thresh) / 2?
> > 
> > That's much easier condition to satisfy..
>   Argh, sorry. I was mistaken by the name of the variable - I though it
> contains only dirty pages on the bdi but it also contains pages under
> writeback and bdi_nr_reclaimable is the one that contains only dirty pages.

Yeah the name may be a bit confusing.. but we'll soon get rid of bdi_nr_reclaimable :)

> So your patch does exactly what I had in mind. You can add:
>   Acked-by: Jan Kara <jack@suse.cz>

Thanks! I'll test it in linux-next for a week and then send to Linus.

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2011-08-19  1:36 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-17 16:14 [PATCH v9 00/13] memcg: per cgroup dirty page limiting Greg Thelen
2011-08-17 16:14 ` Greg Thelen
2011-08-17 16:14 ` [PATCH v9 01/13] memcg: document cgroup dirty memory interfaces Greg Thelen
2011-08-17 16:14   ` Greg Thelen
2011-08-17 16:14 ` [PATCH v9 02/13] memcg: add page_cgroup flags for dirty page tracking Greg Thelen
2011-08-17 16:14   ` Greg Thelen
2011-08-17 16:14 ` [PATCH v9 03/13] memcg: add dirty page accounting infrastructure Greg Thelen
2011-08-17 16:14   ` Greg Thelen
2011-08-18  0:39   ` KAMEZAWA Hiroyuki
2011-08-18  0:39     ` KAMEZAWA Hiroyuki
2011-08-18  6:07     ` Greg Thelen
2011-08-18  6:07       ` Greg Thelen
2011-08-17 16:14 ` [PATCH v9 04/13] memcg: add kernel calls for memcg dirty page stats Greg Thelen
2011-08-17 16:14   ` Greg Thelen
2011-08-17 16:14 ` [PATCH v9 05/13] memcg: add mem_cgroup_mark_inode_dirty() Greg Thelen
2011-08-17 16:14   ` Greg Thelen
2011-08-18  0:51   ` KAMEZAWA Hiroyuki
2011-08-18  0:51     ` KAMEZAWA Hiroyuki
2011-08-17 16:14 ` [PATCH v9 06/13] memcg: add dirty limits to mem_cgroup Greg Thelen
2011-08-17 16:14   ` Greg Thelen
2011-08-18  0:53   ` KAMEZAWA Hiroyuki
2011-08-18  0:53     ` KAMEZAWA Hiroyuki
2011-08-17 16:14 ` [PATCH v9 07/13] memcg: add cgroupfs interface to memcg dirty limits Greg Thelen
2011-08-17 16:14   ` Greg Thelen
2011-08-18  0:55   ` KAMEZAWA Hiroyuki
2011-08-18  0:55     ` KAMEZAWA Hiroyuki
2011-08-17 16:15 ` [PATCH v9 08/13] memcg: dirty page accounting support routines Greg Thelen
2011-08-17 16:15   ` Greg Thelen
2011-08-18  1:05   ` KAMEZAWA Hiroyuki
2011-08-18  1:05     ` KAMEZAWA Hiroyuki
2011-08-18  7:04     ` Greg Thelen
2011-08-18  7:04       ` Greg Thelen
2011-08-17 16:15 ` [PATCH v9 09/13] memcg: create support routines for writeback Greg Thelen
2011-08-17 16:15   ` Greg Thelen
2011-08-18  1:13   ` KAMEZAWA Hiroyuki
2011-08-18  1:13     ` KAMEZAWA Hiroyuki
2011-08-17 16:15 ` [PATCH v9 10/13] writeback: pass wb_writeback_work into move_expired_inodes() Greg Thelen
2011-08-17 16:15   ` Greg Thelen
2011-08-18  1:15   ` KAMEZAWA Hiroyuki
2011-08-18  1:15     ` KAMEZAWA Hiroyuki
2011-08-17 16:15 ` [PATCH v9 11/13] writeback: make background writeback cgroup aware Greg Thelen
2011-08-17 16:15   ` Greg Thelen
2011-08-18  1:23   ` KAMEZAWA Hiroyuki
2011-08-18  1:23     ` KAMEZAWA Hiroyuki
2011-08-18  7:10     ` Greg Thelen
2011-08-18  7:10       ` Greg Thelen
2011-08-18  7:17       ` KAMEZAWA Hiroyuki
2011-08-18  7:17         ` KAMEZAWA Hiroyuki
2011-08-18  7:38         ` Greg Thelen
2011-08-18  7:38           ` Greg Thelen
2011-08-18  7:35           ` KAMEZAWA Hiroyuki
2011-08-18  7:35             ` KAMEZAWA Hiroyuki
2011-08-17 16:15 ` [PATCH v9 12/13] memcg: create support routines for page writeback Greg Thelen
2011-08-17 16:15   ` Greg Thelen
2011-08-18  1:38   ` KAMEZAWA Hiroyuki
2011-08-18  1:38     ` KAMEZAWA Hiroyuki
2011-08-18  2:36     ` Wu Fengguang
2011-08-18  2:36       ` Wu Fengguang
2011-08-18 10:12       ` Jan Kara
2011-08-18 10:12         ` Jan Kara
2011-08-18 12:17         ` Wu Fengguang
2011-08-18 12:17           ` Wu Fengguang
2011-08-18 20:08           ` Jan Kara
2011-08-18 20:08             ` Jan Kara
2011-08-19  1:36             ` Wu Fengguang
2011-08-19  1:36               ` Wu Fengguang
2011-08-17 16:15 ` [PATCH v9 13/13] memcg: check memcg dirty limits in " Greg Thelen
2011-08-17 16:15   ` Greg Thelen
2011-08-18  1:40   ` KAMEZAWA Hiroyuki
2011-08-18  1:40     ` KAMEZAWA Hiroyuki
2011-08-18  0:35 ` [PATCH v9 00/13] memcg: per cgroup dirty page limiting KAMEZAWA Hiroyuki
2011-08-18  0:35   ` KAMEZAWA Hiroyuki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.