All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/10] mm/memcg: per-memcg per-zone lru locking
@ 2012-02-20 23:26 ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

Here is my per-memcg per-zone LRU locking series, as promised last year.

zone->lru_lock is a heavily contended lock, and we expect that splitting
it across memcgs will show benefit on systems with many cpus.  Sorry, no
performance numbers included yet (I did try yesterday, but my own machines
are too small to show any advantage - it'll be a shame if the same proves
so for large ones!); but otherwise tested and ready.

Konstantin Khlebnikov posted RFC for a competing series a few days ago:
[PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting
https://lkml.org/lkml/2012/2/15/445
and then today
[PATCH v2 00/22] mm: lru_lock splitting
https://lkml.org/lkml/2012/2/20/252

I haven't glanced at v2 yet, but judging by a quick look at the RFC:
the two series have lots of overlap and much in common, so I'd better
post this now before the numbers, to help us exchange ideas.  If you
choose to use either series, we shall probably want to add in pieces
from the other.

There should be a further patch, to update references to zone->lru_lock
in comments and Documentation; but that's just a distraction at the
moment, better held over until our final direction is decided.

These patches are based upon what I expect in the next linux-next with
an update from akpm: perhaps 3.3.0-rc4-next-20120222, or maybe later.
They were prepared on 3.3.0-rc3-next-20120217 plus recent mm-commits:

memcg-remove-export_symbolmem_cgroup_update_page_stat.patch
memcg-simplify-move_account-check.patch
memcg-simplify-move_account-check-fix.patch
memcg-remove-pcg_move_lock-flag-from-page_cgroup.patch
memcg-use-new-logic-for-page-stat-accounting.patch
memcg-use-new-logic-for-page-stat-accounting-fix.patch
memcg-remove-pcg_file_mapped.patch
memcg-fix-performance-of-mem_cgroup_begin_update_page_stat.patch
memcg-fix-performance-of-mem_cgroup_begin_update_page_stat-fix.patch
mm-memcontrolc-s-stealed-stolen.patch

mm-vmscan-handle-isolated-pages-with-lru-lock-released.patch
mm-vmscan-forcibly-scan-highmem-if-there-are-too-many-buffer_heads-pinning-highmem-fix.patch
mm-vmscan-forcibly-scan-highmem-if-there-are-too-many-buffer_heads-pinning-highmem-fix-fix.patch

But it looks like there are no clashes with the first ten of those,
the last three little rearrangements in vmscan.c should be enough.
I see Konstantin has based his v2 off 3.3.0-rc3-next-20120210: that
should be good for mine too, if you add the last three commits on first.

Per-memcg per-zone LRU locking series:

 1/10 mm/memcg: scanning_global_lru means mem_cgroup_disabled
 2/10 mm/memcg: move reclaim_stat into lruvec
 3/10 mm/memcg: add zone pointer into lruvec
 4/10 mm/memcg: apply add/del_page to lruvec
 5/10 mm/memcg: introduce page_relock_lruvec
 6/10 mm/memcg: take care over pc->mem_cgroup
 7/10 mm/memcg: remove mem_cgroup_reset_owner
 8/10 mm/memcg: nest lru_lock inside page_cgroup lock
 9/10 mm/memcg: move lru_lock into lruvec
10/10 mm/memcg: per-memcg per-zone lru locking

 include/linux/memcontrol.h |   67 +----
 include/linux/mm_inline.h  |   20 -
 include/linux/mmzone.h     |   33 +-
 include/linux/swap.h       |   68 +++++
 mm/compaction.c            |   64 +++--
 mm/huge_memory.c           |   13 -
 mm/ksm.c                   |   11 
 mm/memcontrol.c            |  402 +++++++++++++++++------------------
 mm/migrate.c               |    2 
 mm/page_alloc.c            |   11 
 mm/swap.c                  |  138 ++++--------
 mm/swap_state.c            |   10 
 mm/vmscan.c                |  396 +++++++++++++++++-----------------
 13 files changed, 605 insertions(+), 630 deletions(-)

Next step: I shall be looking at and trying Konstantin's,
and I hope he can look at and try mine.

Hugh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 0/10] mm/memcg: per-memcg per-zone lru locking
@ 2012-02-20 23:26 ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

Here is my per-memcg per-zone LRU locking series, as promised last year.

zone->lru_lock is a heavily contended lock, and we expect that splitting
it across memcgs will show benefit on systems with many cpus.  Sorry, no
performance numbers included yet (I did try yesterday, but my own machines
are too small to show any advantage - it'll be a shame if the same proves
so for large ones!); but otherwise tested and ready.

Konstantin Khlebnikov posted RFC for a competing series a few days ago:
[PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting
https://lkml.org/lkml/2012/2/15/445
and then today
[PATCH v2 00/22] mm: lru_lock splitting
https://lkml.org/lkml/2012/2/20/252

I haven't glanced at v2 yet, but judging by a quick look at the RFC:
the two series have lots of overlap and much in common, so I'd better
post this now before the numbers, to help us exchange ideas.  If you
choose to use either series, we shall probably want to add in pieces
from the other.

There should be a further patch, to update references to zone->lru_lock
in comments and Documentation; but that's just a distraction at the
moment, better held over until our final direction is decided.

These patches are based upon what I expect in the next linux-next with
an update from akpm: perhaps 3.3.0-rc4-next-20120222, or maybe later.
They were prepared on 3.3.0-rc3-next-20120217 plus recent mm-commits:

memcg-remove-export_symbolmem_cgroup_update_page_stat.patch
memcg-simplify-move_account-check.patch
memcg-simplify-move_account-check-fix.patch
memcg-remove-pcg_move_lock-flag-from-page_cgroup.patch
memcg-use-new-logic-for-page-stat-accounting.patch
memcg-use-new-logic-for-page-stat-accounting-fix.patch
memcg-remove-pcg_file_mapped.patch
memcg-fix-performance-of-mem_cgroup_begin_update_page_stat.patch
memcg-fix-performance-of-mem_cgroup_begin_update_page_stat-fix.patch
mm-memcontrolc-s-stealed-stolen.patch

mm-vmscan-handle-isolated-pages-with-lru-lock-released.patch
mm-vmscan-forcibly-scan-highmem-if-there-are-too-many-buffer_heads-pinning-highmem-fix.patch
mm-vmscan-forcibly-scan-highmem-if-there-are-too-many-buffer_heads-pinning-highmem-fix-fix.patch

But it looks like there are no clashes with the first ten of those,
the last three little rearrangements in vmscan.c should be enough.
I see Konstantin has based his v2 off 3.3.0-rc3-next-20120210: that
should be good for mine too, if you add the last three commits on first.

Per-memcg per-zone LRU locking series:

 1/10 mm/memcg: scanning_global_lru means mem_cgroup_disabled
 2/10 mm/memcg: move reclaim_stat into lruvec
 3/10 mm/memcg: add zone pointer into lruvec
 4/10 mm/memcg: apply add/del_page to lruvec
 5/10 mm/memcg: introduce page_relock_lruvec
 6/10 mm/memcg: take care over pc->mem_cgroup
 7/10 mm/memcg: remove mem_cgroup_reset_owner
 8/10 mm/memcg: nest lru_lock inside page_cgroup lock
 9/10 mm/memcg: move lru_lock into lruvec
10/10 mm/memcg: per-memcg per-zone lru locking

 include/linux/memcontrol.h |   67 +----
 include/linux/mm_inline.h  |   20 -
 include/linux/mmzone.h     |   33 +-
 include/linux/swap.h       |   68 +++++
 mm/compaction.c            |   64 +++--
 mm/huge_memory.c           |   13 -
 mm/ksm.c                   |   11 
 mm/memcontrol.c            |  402 +++++++++++++++++------------------
 mm/migrate.c               |    2 
 mm/page_alloc.c            |   11 
 mm/swap.c                  |  138 ++++--------
 mm/swap_state.c            |   10 
 mm/vmscan.c                |  396 +++++++++++++++++-----------------
 13 files changed, 605 insertions(+), 630 deletions(-)

Next step: I shall be looking at and trying Konstantin's,
and I hope he can look at and try mine.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/10] mm/memcg: scanning_global_lru means mem_cgroup_disabled
  2012-02-20 23:26 ` Hugh Dickins
@ 2012-02-20 23:28   ` Hugh Dickins
  -1 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

Although one has to admire the skill with which it has been concealed,
scanning_global_lru(mz) is actually just an interesting way to test
mem_cgroup_disabled().  Too many developer hours have been wasted on
confusing it with global_reclaim(): just use mem_cgroup_disabled().

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/vmscan.c |   18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

--- mmotm.orig/mm/vmscan.c	2012-02-18 11:56:23.815522718 -0800
+++ mmotm/mm/vmscan.c	2012-02-18 11:56:33.395522945 -0800
@@ -164,26 +164,16 @@ static bool global_reclaim(struct scan_c
 {
 	return !sc->target_mem_cgroup;
 }
-
-static bool scanning_global_lru(struct mem_cgroup_zone *mz)
-{
-	return !mz->mem_cgroup;
-}
 #else
 static bool global_reclaim(struct scan_control *sc)
 {
 	return true;
 }
-
-static bool scanning_global_lru(struct mem_cgroup_zone *mz)
-{
-	return true;
-}
 #endif
 
 static struct zone_reclaim_stat *get_reclaim_stat(struct mem_cgroup_zone *mz)
 {
-	if (!scanning_global_lru(mz))
+	if (!mem_cgroup_disabled())
 		return mem_cgroup_get_reclaim_stat(mz->mem_cgroup, mz->zone);
 
 	return &mz->zone->reclaim_stat;
@@ -192,7 +182,7 @@ static struct zone_reclaim_stat *get_rec
 static unsigned long zone_nr_lru_pages(struct mem_cgroup_zone *mz,
 				       enum lru_list lru)
 {
-	if (!scanning_global_lru(mz))
+	if (!mem_cgroup_disabled())
 		return mem_cgroup_zone_nr_lru_pages(mz->mem_cgroup,
 						    zone_to_nid(mz->zone),
 						    zone_idx(mz->zone),
@@ -1804,7 +1794,7 @@ static int inactive_anon_is_low(struct m
 	if (!total_swap_pages)
 		return 0;
 
-	if (!scanning_global_lru(mz))
+	if (!mem_cgroup_disabled())
 		return mem_cgroup_inactive_anon_is_low(mz->mem_cgroup,
 						       mz->zone);
 
@@ -1843,7 +1833,7 @@ static int inactive_file_is_low_global(s
  */
 static int inactive_file_is_low(struct mem_cgroup_zone *mz)
 {
-	if (!scanning_global_lru(mz))
+	if (!mem_cgroup_disabled())
 		return mem_cgroup_inactive_file_is_low(mz->mem_cgroup,
 						       mz->zone);
 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/10] mm/memcg: scanning_global_lru means mem_cgroup_disabled
@ 2012-02-20 23:28   ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

Although one has to admire the skill with which it has been concealed,
scanning_global_lru(mz) is actually just an interesting way to test
mem_cgroup_disabled().  Too many developer hours have been wasted on
confusing it with global_reclaim(): just use mem_cgroup_disabled().

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/vmscan.c |   18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

--- mmotm.orig/mm/vmscan.c	2012-02-18 11:56:23.815522718 -0800
+++ mmotm/mm/vmscan.c	2012-02-18 11:56:33.395522945 -0800
@@ -164,26 +164,16 @@ static bool global_reclaim(struct scan_c
 {
 	return !sc->target_mem_cgroup;
 }
-
-static bool scanning_global_lru(struct mem_cgroup_zone *mz)
-{
-	return !mz->mem_cgroup;
-}
 #else
 static bool global_reclaim(struct scan_control *sc)
 {
 	return true;
 }
-
-static bool scanning_global_lru(struct mem_cgroup_zone *mz)
-{
-	return true;
-}
 #endif
 
 static struct zone_reclaim_stat *get_reclaim_stat(struct mem_cgroup_zone *mz)
 {
-	if (!scanning_global_lru(mz))
+	if (!mem_cgroup_disabled())
 		return mem_cgroup_get_reclaim_stat(mz->mem_cgroup, mz->zone);
 
 	return &mz->zone->reclaim_stat;
@@ -192,7 +182,7 @@ static struct zone_reclaim_stat *get_rec
 static unsigned long zone_nr_lru_pages(struct mem_cgroup_zone *mz,
 				       enum lru_list lru)
 {
-	if (!scanning_global_lru(mz))
+	if (!mem_cgroup_disabled())
 		return mem_cgroup_zone_nr_lru_pages(mz->mem_cgroup,
 						    zone_to_nid(mz->zone),
 						    zone_idx(mz->zone),
@@ -1804,7 +1794,7 @@ static int inactive_anon_is_low(struct m
 	if (!total_swap_pages)
 		return 0;
 
-	if (!scanning_global_lru(mz))
+	if (!mem_cgroup_disabled())
 		return mem_cgroup_inactive_anon_is_low(mz->mem_cgroup,
 						       mz->zone);
 
@@ -1843,7 +1833,7 @@ static int inactive_file_is_low_global(s
  */
 static int inactive_file_is_low(struct mem_cgroup_zone *mz)
 {
-	if (!scanning_global_lru(mz))
+	if (!mem_cgroup_disabled())
 		return mem_cgroup_inactive_file_is_low(mz->mem_cgroup,
 						       mz->zone);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 2/10] mm/memcg: move reclaim_stat into lruvec
  2012-02-20 23:26 ` Hugh Dickins
@ 2012-02-20 23:29   ` Hugh Dickins
  -1 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

With mem_cgroup_disabled() now explicit, it becomes clear that the
zone_reclaim_stat structure actually belongs in lruvec, per-zone
when memcg is disabled but per-memcg per-zone when it's enabled.

We can delete mem_cgroup_get_reclaim_stat(), and change
update_page_reclaim_stat() to update just the one set of stats,
the one which get_scan_count() will actually use.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/memcontrol.h |    9 ---------
 include/linux/mmzone.h     |   29 ++++++++++++++---------------
 mm/memcontrol.c            |   27 +++++++--------------------
 mm/page_alloc.c            |    8 ++++----
 mm/swap.c                  |   14 ++++----------
 mm/vmscan.c                |    5 +----
 6 files changed, 30 insertions(+), 62 deletions(-)

--- mmotm.orig/include/linux/memcontrol.h	2012-02-18 11:56:52.015523388 -0800
+++ mmotm/include/linux/memcontrol.h	2012-02-18 11:57:20.391524062 -0800
@@ -120,8 +120,6 @@ int mem_cgroup_inactive_file_is_low(stru
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
 					int nid, int zid, unsigned int lrumask);
-struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
-						      struct zone *zone);
 struct zone_reclaim_stat*
 mem_cgroup_get_reclaim_stat_from_page(struct page *page);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
@@ -351,13 +349,6 @@ mem_cgroup_zone_nr_lru_pages(struct mem_
 	return 0;
 }
 
-
-static inline struct zone_reclaim_stat*
-mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone)
-{
-	return NULL;
-}
-
 static inline struct zone_reclaim_stat*
 mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 {
--- mmotm.orig/include/linux/mmzone.h	2012-02-18 11:56:52.015523388 -0800
+++ mmotm/include/linux/mmzone.h	2012-02-18 11:57:20.391524062 -0800
@@ -159,8 +159,22 @@ static inline int is_unevictable_lru(enu
 	return (lru == LRU_UNEVICTABLE);
 }
 
+struct zone_reclaim_stat {
+	/*
+	 * The pageout code in vmscan.c keeps track of how many of the
+	 * mem/swap backed and file backed pages are refeferenced.
+	 * The higher the rotated/scanned ratio, the more valuable
+	 * that cache is.
+	 *
+	 * The anon LRU stats live in [0], file LRU stats in [1]
+	 */
+	unsigned long		recent_rotated[2];
+	unsigned long		recent_scanned[2];
+};
+
 struct lruvec {
 	struct list_head lists[NR_LRU_LISTS];
+	struct zone_reclaim_stat reclaim_stat;
 };
 
 /* Mask used at gathering information at once (see memcontrol.c) */
@@ -287,19 +301,6 @@ enum zone_type {
 #error ZONES_SHIFT -- too many zones configured adjust calculation
 #endif
 
-struct zone_reclaim_stat {
-	/*
-	 * The pageout code in vmscan.c keeps track of how many of the
-	 * mem/swap backed and file backed pages are refeferenced.
-	 * The higher the rotated/scanned ratio, the more valuable
-	 * that cache is.
-	 *
-	 * The anon LRU stats live in [0], file LRU stats in [1]
-	 */
-	unsigned long		recent_rotated[2];
-	unsigned long		recent_scanned[2];
-};
-
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 
@@ -374,8 +375,6 @@ struct zone {
 	spinlock_t		lru_lock;
 	struct lruvec		lruvec;
 
-	struct zone_reclaim_stat reclaim_stat;
-
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	unsigned long		flags;		   /* zone flags, see below */
 
--- mmotm.orig/mm/memcontrol.c	2012-02-18 11:56:52.015523388 -0800
+++ mmotm/mm/memcontrol.c	2012-02-18 11:57:20.391524062 -0800
@@ -138,7 +138,6 @@ struct mem_cgroup_per_zone {
 
 	struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1];
 
-	struct zone_reclaim_stat reclaim_stat;
 	struct rb_node		tree_node;	/* RB tree node */
 	unsigned long long	usage_in_excess;/* Set to the value by which */
 						/* the soft limit is exceeded*/
@@ -1200,16 +1199,6 @@ int mem_cgroup_inactive_file_is_low(stru
 	return (active > inactive);
 }
 
-struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
-						      struct zone *zone)
-{
-	int nid = zone_to_nid(zone);
-	int zid = zone_idx(zone);
-	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
-
-	return &mz->reclaim_stat;
-}
-
 struct zone_reclaim_stat *
 mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 {
@@ -1225,7 +1214,7 @@ mem_cgroup_get_reclaim_stat_from_page(st
 	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
 	smp_rmb();
 	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	return &mz->reclaim_stat;
+	return &mz->lruvec.reclaim_stat;
 }
 
 #define mem_cgroup_from_res_counter(counter, member)	\
@@ -4193,21 +4182,19 @@ static int mem_control_stat_show(struct
 	{
 		int nid, zid;
 		struct mem_cgroup_per_zone *mz;
+		struct zone_reclaim_stat *rstat;
 		unsigned long recent_rotated[2] = {0, 0};
 		unsigned long recent_scanned[2] = {0, 0};
 
 		for_each_online_node(nid)
 			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 				mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+				rstat = &mz->lruvec.reclaim_stat;
 
-				recent_rotated[0] +=
-					mz->reclaim_stat.recent_rotated[0];
-				recent_rotated[1] +=
-					mz->reclaim_stat.recent_rotated[1];
-				recent_scanned[0] +=
-					mz->reclaim_stat.recent_scanned[0];
-				recent_scanned[1] +=
-					mz->reclaim_stat.recent_scanned[1];
+				recent_rotated[0] += rstat->recent_rotated[0];
+				recent_rotated[1] += rstat->recent_rotated[1];
+				recent_scanned[0] += rstat->recent_scanned[0];
+				recent_scanned[1] += rstat->recent_scanned[1];
 			}
 		cb->fill(cb, "recent_rotated_anon", recent_rotated[0]);
 		cb->fill(cb, "recent_rotated_file", recent_rotated[1]);
--- mmotm.orig/mm/page_alloc.c	2012-02-18 11:56:52.015523388 -0800
+++ mmotm/mm/page_alloc.c	2012-02-18 11:57:20.395524062 -0800
@@ -4367,10 +4367,10 @@ static void __paginginit free_area_init_
 		zone_pcp_init(zone);
 		for_each_lru(lru)
 			INIT_LIST_HEAD(&zone->lruvec.lists[lru]);
-		zone->reclaim_stat.recent_rotated[0] = 0;
-		zone->reclaim_stat.recent_rotated[1] = 0;
-		zone->reclaim_stat.recent_scanned[0] = 0;
-		zone->reclaim_stat.recent_scanned[1] = 0;
+		zone->lruvec.reclaim_stat.recent_rotated[0] = 0;
+		zone->lruvec.reclaim_stat.recent_rotated[1] = 0;
+		zone->lruvec.reclaim_stat.recent_scanned[0] = 0;
+		zone->lruvec.reclaim_stat.recent_scanned[1] = 0;
 		zap_zone_vm_stats(zone);
 		zone->flags = 0;
 		if (!size)
--- mmotm.orig/mm/swap.c	2012-02-18 11:56:52.015523388 -0800
+++ mmotm/mm/swap.c	2012-02-18 11:57:20.395524062 -0800
@@ -279,21 +279,15 @@ void rotate_reclaimable_page(struct page
 static void update_page_reclaim_stat(struct zone *zone, struct page *page,
 				     int file, int rotated)
 {
-	struct zone_reclaim_stat *reclaim_stat = &zone->reclaim_stat;
-	struct zone_reclaim_stat *memcg_reclaim_stat;
+	struct zone_reclaim_stat *reclaim_stat;
 
-	memcg_reclaim_stat = mem_cgroup_get_reclaim_stat_from_page(page);
+	reclaim_stat = mem_cgroup_get_reclaim_stat_from_page(page);
+	if (!reclaim_stat)
+		reclaim_stat = &zone->lruvec.reclaim_stat;
 
 	reclaim_stat->recent_scanned[file]++;
 	if (rotated)
 		reclaim_stat->recent_rotated[file]++;
-
-	if (!memcg_reclaim_stat)
-		return;
-
-	memcg_reclaim_stat->recent_scanned[file]++;
-	if (rotated)
-		memcg_reclaim_stat->recent_rotated[file]++;
 }
 
 static void __activate_page(struct page *page, void *arg)
--- mmotm.orig/mm/vmscan.c	2012-02-18 11:57:09.719523809 -0800
+++ mmotm/mm/vmscan.c	2012-02-18 11:57:20.395524062 -0800
@@ -173,10 +173,7 @@ static bool global_reclaim(struct scan_c
 
 static struct zone_reclaim_stat *get_reclaim_stat(struct mem_cgroup_zone *mz)
 {
-	if (!mem_cgroup_disabled())
-		return mem_cgroup_get_reclaim_stat(mz->mem_cgroup, mz->zone);
-
-	return &mz->zone->reclaim_stat;
+	return &mem_cgroup_zone_lruvec(mz->zone, mz->mem_cgroup)->reclaim_stat;
 }
 
 static unsigned long zone_nr_lru_pages(struct mem_cgroup_zone *mz,

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 2/10] mm/memcg: move reclaim_stat into lruvec
@ 2012-02-20 23:29   ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

With mem_cgroup_disabled() now explicit, it becomes clear that the
zone_reclaim_stat structure actually belongs in lruvec, per-zone
when memcg is disabled but per-memcg per-zone when it's enabled.

We can delete mem_cgroup_get_reclaim_stat(), and change
update_page_reclaim_stat() to update just the one set of stats,
the one which get_scan_count() will actually use.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/memcontrol.h |    9 ---------
 include/linux/mmzone.h     |   29 ++++++++++++++---------------
 mm/memcontrol.c            |   27 +++++++--------------------
 mm/page_alloc.c            |    8 ++++----
 mm/swap.c                  |   14 ++++----------
 mm/vmscan.c                |    5 +----
 6 files changed, 30 insertions(+), 62 deletions(-)

--- mmotm.orig/include/linux/memcontrol.h	2012-02-18 11:56:52.015523388 -0800
+++ mmotm/include/linux/memcontrol.h	2012-02-18 11:57:20.391524062 -0800
@@ -120,8 +120,6 @@ int mem_cgroup_inactive_file_is_low(stru
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
 					int nid, int zid, unsigned int lrumask);
-struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
-						      struct zone *zone);
 struct zone_reclaim_stat*
 mem_cgroup_get_reclaim_stat_from_page(struct page *page);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
@@ -351,13 +349,6 @@ mem_cgroup_zone_nr_lru_pages(struct mem_
 	return 0;
 }
 
-
-static inline struct zone_reclaim_stat*
-mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone)
-{
-	return NULL;
-}
-
 static inline struct zone_reclaim_stat*
 mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 {
--- mmotm.orig/include/linux/mmzone.h	2012-02-18 11:56:52.015523388 -0800
+++ mmotm/include/linux/mmzone.h	2012-02-18 11:57:20.391524062 -0800
@@ -159,8 +159,22 @@ static inline int is_unevictable_lru(enu
 	return (lru == LRU_UNEVICTABLE);
 }
 
+struct zone_reclaim_stat {
+	/*
+	 * The pageout code in vmscan.c keeps track of how many of the
+	 * mem/swap backed and file backed pages are refeferenced.
+	 * The higher the rotated/scanned ratio, the more valuable
+	 * that cache is.
+	 *
+	 * The anon LRU stats live in [0], file LRU stats in [1]
+	 */
+	unsigned long		recent_rotated[2];
+	unsigned long		recent_scanned[2];
+};
+
 struct lruvec {
 	struct list_head lists[NR_LRU_LISTS];
+	struct zone_reclaim_stat reclaim_stat;
 };
 
 /* Mask used at gathering information at once (see memcontrol.c) */
@@ -287,19 +301,6 @@ enum zone_type {
 #error ZONES_SHIFT -- too many zones configured adjust calculation
 #endif
 
-struct zone_reclaim_stat {
-	/*
-	 * The pageout code in vmscan.c keeps track of how many of the
-	 * mem/swap backed and file backed pages are refeferenced.
-	 * The higher the rotated/scanned ratio, the more valuable
-	 * that cache is.
-	 *
-	 * The anon LRU stats live in [0], file LRU stats in [1]
-	 */
-	unsigned long		recent_rotated[2];
-	unsigned long		recent_scanned[2];
-};
-
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 
@@ -374,8 +375,6 @@ struct zone {
 	spinlock_t		lru_lock;
 	struct lruvec		lruvec;
 
-	struct zone_reclaim_stat reclaim_stat;
-
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	unsigned long		flags;		   /* zone flags, see below */
 
--- mmotm.orig/mm/memcontrol.c	2012-02-18 11:56:52.015523388 -0800
+++ mmotm/mm/memcontrol.c	2012-02-18 11:57:20.391524062 -0800
@@ -138,7 +138,6 @@ struct mem_cgroup_per_zone {
 
 	struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1];
 
-	struct zone_reclaim_stat reclaim_stat;
 	struct rb_node		tree_node;	/* RB tree node */
 	unsigned long long	usage_in_excess;/* Set to the value by which */
 						/* the soft limit is exceeded*/
@@ -1200,16 +1199,6 @@ int mem_cgroup_inactive_file_is_low(stru
 	return (active > inactive);
 }
 
-struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
-						      struct zone *zone)
-{
-	int nid = zone_to_nid(zone);
-	int zid = zone_idx(zone);
-	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
-
-	return &mz->reclaim_stat;
-}
-
 struct zone_reclaim_stat *
 mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 {
@@ -1225,7 +1214,7 @@ mem_cgroup_get_reclaim_stat_from_page(st
 	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
 	smp_rmb();
 	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	return &mz->reclaim_stat;
+	return &mz->lruvec.reclaim_stat;
 }
 
 #define mem_cgroup_from_res_counter(counter, member)	\
@@ -4193,21 +4182,19 @@ static int mem_control_stat_show(struct
 	{
 		int nid, zid;
 		struct mem_cgroup_per_zone *mz;
+		struct zone_reclaim_stat *rstat;
 		unsigned long recent_rotated[2] = {0, 0};
 		unsigned long recent_scanned[2] = {0, 0};
 
 		for_each_online_node(nid)
 			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 				mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+				rstat = &mz->lruvec.reclaim_stat;
 
-				recent_rotated[0] +=
-					mz->reclaim_stat.recent_rotated[0];
-				recent_rotated[1] +=
-					mz->reclaim_stat.recent_rotated[1];
-				recent_scanned[0] +=
-					mz->reclaim_stat.recent_scanned[0];
-				recent_scanned[1] +=
-					mz->reclaim_stat.recent_scanned[1];
+				recent_rotated[0] += rstat->recent_rotated[0];
+				recent_rotated[1] += rstat->recent_rotated[1];
+				recent_scanned[0] += rstat->recent_scanned[0];
+				recent_scanned[1] += rstat->recent_scanned[1];
 			}
 		cb->fill(cb, "recent_rotated_anon", recent_rotated[0]);
 		cb->fill(cb, "recent_rotated_file", recent_rotated[1]);
--- mmotm.orig/mm/page_alloc.c	2012-02-18 11:56:52.015523388 -0800
+++ mmotm/mm/page_alloc.c	2012-02-18 11:57:20.395524062 -0800
@@ -4367,10 +4367,10 @@ static void __paginginit free_area_init_
 		zone_pcp_init(zone);
 		for_each_lru(lru)
 			INIT_LIST_HEAD(&zone->lruvec.lists[lru]);
-		zone->reclaim_stat.recent_rotated[0] = 0;
-		zone->reclaim_stat.recent_rotated[1] = 0;
-		zone->reclaim_stat.recent_scanned[0] = 0;
-		zone->reclaim_stat.recent_scanned[1] = 0;
+		zone->lruvec.reclaim_stat.recent_rotated[0] = 0;
+		zone->lruvec.reclaim_stat.recent_rotated[1] = 0;
+		zone->lruvec.reclaim_stat.recent_scanned[0] = 0;
+		zone->lruvec.reclaim_stat.recent_scanned[1] = 0;
 		zap_zone_vm_stats(zone);
 		zone->flags = 0;
 		if (!size)
--- mmotm.orig/mm/swap.c	2012-02-18 11:56:52.015523388 -0800
+++ mmotm/mm/swap.c	2012-02-18 11:57:20.395524062 -0800
@@ -279,21 +279,15 @@ void rotate_reclaimable_page(struct page
 static void update_page_reclaim_stat(struct zone *zone, struct page *page,
 				     int file, int rotated)
 {
-	struct zone_reclaim_stat *reclaim_stat = &zone->reclaim_stat;
-	struct zone_reclaim_stat *memcg_reclaim_stat;
+	struct zone_reclaim_stat *reclaim_stat;
 
-	memcg_reclaim_stat = mem_cgroup_get_reclaim_stat_from_page(page);
+	reclaim_stat = mem_cgroup_get_reclaim_stat_from_page(page);
+	if (!reclaim_stat)
+		reclaim_stat = &zone->lruvec.reclaim_stat;
 
 	reclaim_stat->recent_scanned[file]++;
 	if (rotated)
 		reclaim_stat->recent_rotated[file]++;
-
-	if (!memcg_reclaim_stat)
-		return;
-
-	memcg_reclaim_stat->recent_scanned[file]++;
-	if (rotated)
-		memcg_reclaim_stat->recent_rotated[file]++;
 }
 
 static void __activate_page(struct page *page, void *arg)
--- mmotm.orig/mm/vmscan.c	2012-02-18 11:57:09.719523809 -0800
+++ mmotm/mm/vmscan.c	2012-02-18 11:57:20.395524062 -0800
@@ -173,10 +173,7 @@ static bool global_reclaim(struct scan_c
 
 static struct zone_reclaim_stat *get_reclaim_stat(struct mem_cgroup_zone *mz)
 {
-	if (!mem_cgroup_disabled())
-		return mem_cgroup_get_reclaim_stat(mz->mem_cgroup, mz->zone);
-
-	return &mz->zone->reclaim_stat;
+	return &mem_cgroup_zone_lruvec(mz->zone, mz->mem_cgroup)->reclaim_stat;
 }
 
 static unsigned long zone_nr_lru_pages(struct mem_cgroup_zone *mz,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 3/10] mm/memcg: add zone pointer into lruvec
  2012-02-20 23:26 ` Hugh Dickins
@ 2012-02-20 23:30   ` Hugh Dickins
  -1 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

The lruvec is looking rather useful: if we just add a zone pointer
into the lruvec, then we can pass the lruvec pointer around and save
some superfluous arguments and recomputations in various places.

Just occasionally we do want mem_cgroup_from_lruvec() to get back from
lruvec to memcg; but then we can remove all uses of vmscan.c's private
mem_cgroup_zone *mz, passing the lruvec pointer instead.

And while we're there, get_scan_count() can call vmscan_swappiness()
once, instead of twice in a row.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/memcontrol.h |   23 ++-
 include/linux/mmzone.h     |    1 
 mm/memcontrol.c            |   47 ++++----
 mm/page_alloc.c            |    1 
 mm/vmscan.c                |  203 +++++++++++++++--------------------
 5 files changed, 128 insertions(+), 147 deletions(-)

--- mmotm.orig/include/linux/memcontrol.h	2012-02-18 11:57:20.391524062 -0800
+++ mmotm/include/linux/memcontrol.h	2012-02-18 11:57:28.371524252 -0800
@@ -63,6 +63,7 @@ extern int mem_cgroup_cache_charge(struc
 					gfp_t gfp_mask);
 
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
+extern struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec);
 struct lruvec *mem_cgroup_lru_add_list(struct zone *, struct page *,
 				       enum lru_list);
 void mem_cgroup_lru_del_list(struct page *, enum lru_list);
@@ -113,13 +114,11 @@ void mem_cgroup_iter_break(struct mem_cg
 /*
  * For memory reclaim.
  */
-int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg,
-				    struct zone *zone);
-int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg,
-				    struct zone *zone);
+int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec);
+int mem_cgroup_inactive_file_is_low(struct lruvec *lruvec);
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
-unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
-					int nid, int zid, unsigned int lrumask);
+unsigned long mem_cgroup_zone_nr_lru_pages(struct lruvec *lruvec,
+					   unsigned int lrumask);
 struct zone_reclaim_stat*
 mem_cgroup_get_reclaim_stat_from_page(struct page *page);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
@@ -249,6 +248,11 @@ static inline struct lruvec *mem_cgroup_
 	return &zone->lruvec;
 }
 
+static inline struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec)
+{
+	return NULL;
+}
+
 static inline struct lruvec *mem_cgroup_lru_add_list(struct zone *zone,
 						     struct page *page,
 						     enum lru_list lru)
@@ -331,20 +335,19 @@ static inline bool mem_cgroup_disabled(v
 }
 
 static inline int
-mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg, struct zone *zone)
+mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
 {
 	return 1;
 }
 
 static inline int
-mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg, struct zone *zone)
+mem_cgroup_inactive_file_is_low(struct lruvec *lruvec)
 {
 	return 1;
 }
 
 static inline unsigned long
-mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
-				unsigned int lru_mask)
+mem_cgroup_zone_nr_lru_pages(struct lruvec *lruvec, unsigned int lru_mask)
 {
 	return 0;
 }
--- mmotm.orig/include/linux/mmzone.h	2012-02-18 11:57:20.391524062 -0800
+++ mmotm/include/linux/mmzone.h	2012-02-18 11:57:28.371524252 -0800
@@ -173,6 +173,7 @@ struct zone_reclaim_stat {
 };
 
 struct lruvec {
+	struct zone *zone;
 	struct list_head lists[NR_LRU_LISTS];
 	struct zone_reclaim_stat reclaim_stat;
 };
--- mmotm.orig/mm/memcontrol.c	2012-02-18 11:57:20.391524062 -0800
+++ mmotm/mm/memcontrol.c	2012-02-18 11:57:28.371524252 -0800
@@ -703,14 +703,13 @@ static void mem_cgroup_charge_statistics
 }
 
 unsigned long
-mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
-			unsigned int lru_mask)
+mem_cgroup_zone_nr_lru_pages(struct lruvec *lruvec, unsigned int lru_mask)
 {
 	struct mem_cgroup_per_zone *mz;
 	enum lru_list lru;
 	unsigned long ret = 0;
 
-	mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
 
 	for_each_lru(lru) {
 		if (BIT(lru) & lru_mask)
@@ -723,12 +722,14 @@ static unsigned long
 mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
 			int nid, unsigned int lru_mask)
 {
+	struct mem_cgroup_per_zone *mz;
 	u64 total = 0;
 	int zid;
 
-	for (zid = 0; zid < MAX_NR_ZONES; zid++)
-		total += mem_cgroup_zone_nr_lru_pages(memcg,
-						nid, zid, lru_mask);
+	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+		mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+		total += mem_cgroup_zone_nr_lru_pages(&mz->lruvec, lru_mask);
+	}
 
 	return total;
 }
@@ -1003,13 +1004,24 @@ struct lruvec *mem_cgroup_zone_lruvec(st
 {
 	struct mem_cgroup_per_zone *mz;
 
-	if (mem_cgroup_disabled())
+	if (!memcg || mem_cgroup_disabled())
 		return &zone->lruvec;
 
 	mz = mem_cgroup_zoneinfo(memcg, zone_to_nid(zone), zone_idx(zone));
 	return &mz->lruvec;
 }
 
+struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec)
+{
+	struct mem_cgroup_per_zone *mz;
+
+	if (mem_cgroup_disabled())
+		return NULL;
+
+	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
+	return mz->memcg;
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -1161,19 +1173,15 @@ int task_in_mem_cgroup(struct task_struc
 	return ret;
 }
 
-int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg, struct zone *zone)
+int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
 {
 	unsigned long inactive_ratio;
-	int nid = zone_to_nid(zone);
-	int zid = zone_idx(zone);
 	unsigned long inactive;
 	unsigned long active;
 	unsigned long gb;
 
-	inactive = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid,
-						BIT(LRU_INACTIVE_ANON));
-	active = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid,
-					      BIT(LRU_ACTIVE_ANON));
+	inactive = mem_cgroup_zone_nr_lru_pages(lruvec, BIT(LRU_INACTIVE_ANON));
+	active = mem_cgroup_zone_nr_lru_pages(lruvec, BIT(LRU_ACTIVE_ANON));
 
 	gb = (inactive + active) >> (30 - PAGE_SHIFT);
 	if (gb)
@@ -1184,17 +1192,13 @@ int mem_cgroup_inactive_anon_is_low(stru
 	return inactive * inactive_ratio < active;
 }
 
-int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg, struct zone *zone)
+int mem_cgroup_inactive_file_is_low(struct lruvec *lruvec)
 {
 	unsigned long active;
 	unsigned long inactive;
-	int zid = zone_idx(zone);
-	int nid = zone_to_nid(zone);
 
-	inactive = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid,
-						BIT(LRU_INACTIVE_FILE));
-	active = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid,
-					      BIT(LRU_ACTIVE_FILE));
+	inactive = mem_cgroup_zone_nr_lru_pages(lruvec, BIT(LRU_INACTIVE_FILE));
+	active = mem_cgroup_zone_nr_lru_pages(lruvec, BIT(LRU_ACTIVE_FILE));
 
 	return (active > inactive);
 }
@@ -4755,6 +4759,7 @@ static int alloc_mem_cgroup_per_zone_inf
 
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		mz = &pn->zoneinfo[zone];
+		mz->lruvec.zone = &NODE_DATA(node)->node_zones[zone];
 		for_each_lru(lru)
 			INIT_LIST_HEAD(&mz->lruvec.lists[lru]);
 		mz->usage_in_excess = 0;
--- mmotm.orig/mm/page_alloc.c	2012-02-18 11:57:20.395524062 -0800
+++ mmotm/mm/page_alloc.c	2012-02-18 11:57:28.375524252 -0800
@@ -4365,6 +4365,7 @@ static void __paginginit free_area_init_
 		zone->zone_pgdat = pgdat;
 
 		zone_pcp_init(zone);
+		zone->lruvec.zone = zone;
 		for_each_lru(lru)
 			INIT_LIST_HEAD(&zone->lruvec.lists[lru]);
 		zone->lruvec.reclaim_stat.recent_rotated[0] = 0;
--- mmotm.orig/mm/vmscan.c	2012-02-18 11:57:20.395524062 -0800
+++ mmotm/mm/vmscan.c	2012-02-18 11:57:28.375524252 -0800
@@ -115,11 +115,6 @@ struct scan_control {
 	nodemask_t	*nodemask;
 };
 
-struct mem_cgroup_zone {
-	struct mem_cgroup *mem_cgroup;
-	struct zone *zone;
-};
-
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 
 #ifdef ARCH_HAS_PREFETCH
@@ -171,21 +166,12 @@ static bool global_reclaim(struct scan_c
 }
 #endif
 
-static struct zone_reclaim_stat *get_reclaim_stat(struct mem_cgroup_zone *mz)
-{
-	return &mem_cgroup_zone_lruvec(mz->zone, mz->mem_cgroup)->reclaim_stat;
-}
-
-static unsigned long zone_nr_lru_pages(struct mem_cgroup_zone *mz,
-				       enum lru_list lru)
+static unsigned long zone_nr_lru_pages(struct lruvec *lruvec, enum lru_list lru)
 {
 	if (!mem_cgroup_disabled())
-		return mem_cgroup_zone_nr_lru_pages(mz->mem_cgroup,
-						    zone_to_nid(mz->zone),
-						    zone_idx(mz->zone),
-						    BIT(lru));
+		return mem_cgroup_zone_nr_lru_pages(lruvec, BIT(lru));
 
-	return zone_page_state(mz->zone, NR_LRU_BASE + lru);
+	return zone_page_state(lruvec->zone, NR_LRU_BASE + lru);
 }
 
 
@@ -688,13 +674,13 @@ enum page_references {
 };
 
 static enum page_references page_check_references(struct page *page,
-						  struct mem_cgroup_zone *mz,
+						  struct mem_cgroup *memcg,
 						  struct scan_control *sc)
 {
 	int referenced_ptes, referenced_page;
 	unsigned long vm_flags;
 
-	referenced_ptes = page_referenced(page, 1, mz->mem_cgroup, &vm_flags);
+	referenced_ptes = page_referenced(page, 1, memcg, &vm_flags);
 	referenced_page = TestClearPageReferenced(page);
 
 	/* Lumpy reclaim - ignore references */
@@ -750,12 +736,13 @@ static enum page_references page_check_r
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
-				      struct mem_cgroup_zone *mz,
+				      struct lruvec *lruvec,
 				      struct scan_control *sc,
 				      int priority,
 				      unsigned long *ret_nr_dirty,
 				      unsigned long *ret_nr_writeback)
 {
+	struct mem_cgroup *memcg = mem_cgroup_from_lruvec(lruvec);
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
@@ -781,7 +768,7 @@ static unsigned long shrink_page_list(st
 			goto keep;
 
 		VM_BUG_ON(PageActive(page));
-		VM_BUG_ON(page_zone(page) != mz->zone);
+		VM_BUG_ON(page_zone(page) != lruvec->zone);
 
 		sc->nr_scanned++;
 
@@ -815,7 +802,7 @@ static unsigned long shrink_page_list(st
 			}
 		}
 
-		references = page_check_references(page, mz, sc);
+		references = page_check_references(page, memcg, sc);
 		switch (references) {
 		case PAGEREF_ACTIVATE:
 			goto activate_locked;
@@ -1007,7 +994,7 @@ keep_lumpy:
 	 * will encounter the same problem
 	 */
 	if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
-		zone_set_flag(mz->zone, ZONE_CONGESTED);
+		zone_set_flag(lruvec->zone, ZONE_CONGESTED);
 
 	free_hot_cold_page_list(&free_pages, 1);
 
@@ -1122,7 +1109,7 @@ int __isolate_lru_page(struct page *page
  * Appropriate locks must be held before calling this function.
  *
  * @nr_to_scan:	The number of pages to look through on the list.
- * @mz:		The mem_cgroup_zone to pull pages from.
+ * @lruvec:	The mem_cgroup/zone lruvec to pull pages from.
  * @dst:	The temp list to put pages on to.
  * @nr_scanned:	The number of pages that were scanned.
  * @sc:		The scan_control struct for this reclaim session
@@ -1133,11 +1120,10 @@ int __isolate_lru_page(struct page *page
  * returns how many pages were moved onto *@dst.
  */
 static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
-		struct mem_cgroup_zone *mz, struct list_head *dst,
+		struct lruvec *lruvec, struct list_head *dst,
 		unsigned long *nr_scanned, struct scan_control *sc,
 		isolate_mode_t mode, int active, int file)
 {
-	struct lruvec *lruvec;
 	struct list_head *src;
 	unsigned long nr_taken = 0;
 	unsigned long nr_lumpy_taken = 0;
@@ -1146,7 +1132,6 @@ static unsigned long isolate_lru_pages(u
 	unsigned long scan;
 	int lru = LRU_BASE;
 
-	lruvec = mem_cgroup_zone_lruvec(mz->zone, mz->mem_cgroup);
 	if (active)
 		lru += LRU_ACTIVE;
 	if (file)
@@ -1344,11 +1329,10 @@ static int too_many_isolated(struct zone
 }
 
 static noinline_for_stack void
-putback_inactive_pages(struct mem_cgroup_zone *mz,
-		       struct list_head *page_list)
+putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 {
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
-	struct zone *zone = mz->zone;
+	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	struct zone *zone = lruvec->zone;
 	LIST_HEAD(pages_to_free);
 
 	/*
@@ -1395,12 +1379,9 @@ putback_inactive_pages(struct mem_cgroup
 }
 
 static noinline_for_stack void
-update_isolated_counts(struct mem_cgroup_zone *mz,
-		       struct list_head *page_list,
-		       unsigned long *nr_anon,
-		       unsigned long *nr_file)
+update_isolated_counts(struct zone *zone, struct list_head *page_list,
+		       unsigned long *nr_anon, unsigned long *nr_file)
 {
-	struct zone *zone = mz->zone;
 	unsigned int count[NR_LRU_LISTS] = { 0, };
 	unsigned long nr_active = 0;
 	struct page *page;
@@ -1486,9 +1467,11 @@ static inline bool should_reclaim_stall(
  * of reclaimed pages
  */
 static noinline_for_stack unsigned long
-shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
+shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 		     struct scan_control *sc, int priority, int file)
 {
+	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	struct zone *zone = lruvec->zone;
 	LIST_HEAD(page_list);
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
@@ -1498,8 +1481,6 @@ shrink_inactive_list(unsigned long nr_to
 	unsigned long nr_dirty = 0;
 	unsigned long nr_writeback = 0;
 	isolate_mode_t isolate_mode = ISOLATE_INACTIVE;
-	struct zone *zone = mz->zone;
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1522,31 +1503,29 @@ shrink_inactive_list(unsigned long nr_to
 
 	spin_lock_irq(&zone->lru_lock);
 
-	nr_taken = isolate_lru_pages(nr_to_scan, mz, &page_list, &nr_scanned,
-				     sc, isolate_mode, 0, file);
+	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
+				     &nr_scanned, sc, isolate_mode, 0, file);
 	if (global_reclaim(sc)) {
 		zone->pages_scanned += nr_scanned;
 		if (current_is_kswapd())
-			__count_zone_vm_events(PGSCAN_KSWAPD, zone,
-					       nr_scanned);
+			__count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
 		else
-			__count_zone_vm_events(PGSCAN_DIRECT, zone,
-					       nr_scanned);
+			__count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
 	}
 	spin_unlock_irq(&zone->lru_lock);
 
 	if (nr_taken == 0)
 		return 0;
 
-	update_isolated_counts(mz, &page_list, &nr_anon, &nr_file);
+	update_isolated_counts(zone, &page_list, &nr_anon, &nr_file);
 
-	nr_reclaimed = shrink_page_list(&page_list, mz, sc, priority,
+	nr_reclaimed = shrink_page_list(&page_list, lruvec, sc, priority,
 						&nr_dirty, &nr_writeback);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
 		set_reclaim_mode(priority, sc, true);
-		nr_reclaimed += shrink_page_list(&page_list, mz, sc,
+		nr_reclaimed += shrink_page_list(&page_list, lruvec, sc,
 					priority, &nr_dirty, &nr_writeback);
 	}
 
@@ -1559,7 +1538,7 @@ shrink_inactive_list(unsigned long nr_to
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-	putback_inactive_pages(mz, &page_list);
+	putback_inactive_pages(lruvec, &page_list);
 
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
 	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
@@ -1659,10 +1638,13 @@ static void move_active_pages_to_lru(str
 }
 
 static void shrink_active_list(unsigned long nr_to_scan,
-			       struct mem_cgroup_zone *mz,
+			       struct lruvec *lruvec,
 			       struct scan_control *sc,
 			       int priority, int file)
 {
+	struct mem_cgroup *memcg = mem_cgroup_from_lruvec(lruvec);
+	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	struct zone *zone = lruvec->zone;
 	unsigned long nr_taken;
 	unsigned long nr_scanned;
 	unsigned long vm_flags;
@@ -1670,10 +1652,8 @@ static void shrink_active_list(unsigned
 	LIST_HEAD(l_active);
 	LIST_HEAD(l_inactive);
 	struct page *page;
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
 	unsigned long nr_rotated = 0;
 	isolate_mode_t isolate_mode = ISOLATE_ACTIVE;
-	struct zone *zone = mz->zone;
 
 	lru_add_drain();
 
@@ -1684,8 +1664,8 @@ static void shrink_active_list(unsigned
 
 	spin_lock_irq(&zone->lru_lock);
 
-	nr_taken = isolate_lru_pages(nr_to_scan, mz, &l_hold, &nr_scanned, sc,
-				     isolate_mode, 1, file);
+	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
+				     &nr_scanned, sc, isolate_mode, 1, file);
 	if (global_reclaim(sc))
 		zone->pages_scanned += nr_scanned;
 
@@ -1717,7 +1697,7 @@ static void shrink_active_list(unsigned
 			}
 		}
 
-		if (page_referenced(page, 0, mz->mem_cgroup, &vm_flags)) {
+		if (page_referenced(page, 0, memcg, &vm_flags)) {
 			nr_rotated += hpage_nr_pages(page);
 			/*
 			 * Identify referenced, file-backed active pages and
@@ -1776,13 +1756,12 @@ static int inactive_anon_is_low_global(s
 
 /**
  * inactive_anon_is_low - check if anonymous pages need to be deactivated
- * @zone: zone to check
- * @sc:   scan control of this context
+ * @lruvec: The mem_cgroup/zone lruvec to check
  *
  * Returns true if the zone does not have enough inactive anon pages,
  * meaning some active anon pages need to be deactivated.
  */
-static int inactive_anon_is_low(struct mem_cgroup_zone *mz)
+static int inactive_anon_is_low(struct lruvec *lruvec)
 {
 	/*
 	 * If we don't have swap space, anonymous page deactivation
@@ -1792,13 +1771,12 @@ static int inactive_anon_is_low(struct m
 		return 0;
 
 	if (!mem_cgroup_disabled())
-		return mem_cgroup_inactive_anon_is_low(mz->mem_cgroup,
-						       mz->zone);
+		return mem_cgroup_inactive_anon_is_low(lruvec);
 
-	return inactive_anon_is_low_global(mz->zone);
+	return inactive_anon_is_low_global(lruvec->zone);
 }
 #else
-static inline int inactive_anon_is_low(struct mem_cgroup_zone *mz)
+static inline int inactive_anon_is_low(struct lruvec *lruvec)
 {
 	return 0;
 }
@@ -1816,7 +1794,7 @@ static int inactive_file_is_low_global(s
 
 /**
  * inactive_file_is_low - check if file pages need to be deactivated
- * @mz: memory cgroup and zone to check
+ * @lruvec: The mem_cgroup/zone lruvec to check
  *
  * When the system is doing streaming IO, memory pressure here
  * ensures that active file pages get deactivated, until more
@@ -1828,44 +1806,44 @@ static int inactive_file_is_low_global(s
  * This uses a different ratio than the anonymous pages, because
  * the page cache uses a use-once replacement algorithm.
  */
-static int inactive_file_is_low(struct mem_cgroup_zone *mz)
+static int inactive_file_is_low(struct lruvec *lruvec)
 {
 	if (!mem_cgroup_disabled())
-		return mem_cgroup_inactive_file_is_low(mz->mem_cgroup,
-						       mz->zone);
+		return mem_cgroup_inactive_file_is_low(lruvec);
 
-	return inactive_file_is_low_global(mz->zone);
+	return inactive_file_is_low_global(lruvec->zone);
 }
 
-static int inactive_list_is_low(struct mem_cgroup_zone *mz, int file)
+static int inactive_list_is_low(struct lruvec *lruvec, int file)
 {
 	if (file)
-		return inactive_file_is_low(mz);
+		return inactive_file_is_low(lruvec);
 	else
-		return inactive_anon_is_low(mz);
+		return inactive_anon_is_low(lruvec);
 }
 
 static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
-				 struct mem_cgroup_zone *mz,
+				 struct lruvec *lruvec,
 				 struct scan_control *sc, int priority)
 {
 	int file = is_file_lru(lru);
 
 	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(mz, file))
-			shrink_active_list(nr_to_scan, mz, sc, priority, file);
+		if (inactive_list_is_low(lruvec, file))
+			shrink_active_list(nr_to_scan, lruvec, sc, priority,
+									file);
 		return 0;
 	}
 
-	return shrink_inactive_list(nr_to_scan, mz, sc, priority, file);
+	return shrink_inactive_list(nr_to_scan, lruvec, sc, priority, file);
 }
 
-static int vmscan_swappiness(struct mem_cgroup_zone *mz,
+static int vmscan_swappiness(struct lruvec *lruvec,
 			     struct scan_control *sc)
 {
 	if (global_reclaim(sc))
 		return vm_swappiness;
-	return mem_cgroup_swappiness(mz->mem_cgroup);
+	return mem_cgroup_swappiness(mem_cgroup_from_lruvec(lruvec));
 }
 
 /*
@@ -1876,13 +1854,14 @@ static int vmscan_swappiness(struct mem_
  *
  * nr[0] = anon pages to scan; nr[1] = file pages to scan
  */
-static void get_scan_count(struct mem_cgroup_zone *mz, struct scan_control *sc,
+static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 			   unsigned long *nr, int priority)
 {
+	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	struct zone *zone = lruvec->zone;
 	unsigned long anon, file, free;
 	unsigned long anon_prio, file_prio;
 	unsigned long ap, fp;
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
 	u64 fraction[2], denominator;
 	enum lru_list lru;
 	int noswap = 0;
@@ -1898,7 +1877,7 @@ static void get_scan_count(struct mem_cg
 	 * latencies, so it's better to scan a minimum amount there as
 	 * well.
 	 */
-	if (current_is_kswapd() && mz->zone->all_unreclaimable)
+	if (current_is_kswapd() && zone->all_unreclaimable)
 		force_scan = true;
 	if (!global_reclaim(sc))
 		force_scan = true;
@@ -1912,16 +1891,16 @@ static void get_scan_count(struct mem_cg
 		goto out;
 	}
 
-	anon  = zone_nr_lru_pages(mz, LRU_ACTIVE_ANON) +
-		zone_nr_lru_pages(mz, LRU_INACTIVE_ANON);
-	file  = zone_nr_lru_pages(mz, LRU_ACTIVE_FILE) +
-		zone_nr_lru_pages(mz, LRU_INACTIVE_FILE);
+	anon  = zone_nr_lru_pages(lruvec, LRU_ACTIVE_ANON) +
+		zone_nr_lru_pages(lruvec, LRU_INACTIVE_ANON);
+	file  = zone_nr_lru_pages(lruvec, LRU_ACTIVE_FILE) +
+		zone_nr_lru_pages(lruvec, LRU_INACTIVE_FILE);
 
 	if (global_reclaim(sc)) {
-		free  = zone_page_state(mz->zone, NR_FREE_PAGES);
+		free  = zone_page_state(zone, NR_FREE_PAGES);
 		/* If we have very few page cache pages,
 		   force-scan anon pages. */
-		if (unlikely(file + free <= high_wmark_pages(mz->zone))) {
+		if (unlikely(file + free <= high_wmark_pages(zone))) {
 			fraction[0] = 1;
 			fraction[1] = 0;
 			denominator = 1;
@@ -1933,8 +1912,8 @@ static void get_scan_count(struct mem_cg
 	 * With swappiness at 100, anonymous and file have the same priority.
 	 * This scanning priority is essentially the inverse of IO cost.
 	 */
-	anon_prio = vmscan_swappiness(mz, sc);
-	file_prio = 200 - vmscan_swappiness(mz, sc);
+	anon_prio = vmscan_swappiness(lruvec, sc);
+	file_prio = 200 - anon_prio;
 
 	/*
 	 * OK, so we have swap space and a fair amount of page cache
@@ -1947,7 +1926,7 @@ static void get_scan_count(struct mem_cg
 	 *
 	 * anon in [0], file in [1]
 	 */
-	spin_lock_irq(&mz->zone->lru_lock);
+	spin_lock_irq(&zone->lru_lock);
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
 		reclaim_stat->recent_scanned[0] /= 2;
 		reclaim_stat->recent_rotated[0] /= 2;
@@ -1968,7 +1947,7 @@ static void get_scan_count(struct mem_cg
 
 	fp = (file_prio + 1) * (reclaim_stat->recent_scanned[1] + 1);
 	fp /= reclaim_stat->recent_rotated[1] + 1;
-	spin_unlock_irq(&mz->zone->lru_lock);
+	spin_unlock_irq(&zone->lru_lock);
 
 	fraction[0] = ap;
 	fraction[1] = fp;
@@ -1978,7 +1957,7 @@ out:
 		int file = is_file_lru(lru);
 		unsigned long scan;
 
-		scan = zone_nr_lru_pages(mz, lru);
+		scan = zone_nr_lru_pages(lruvec, lru);
 		if (priority || noswap) {
 			scan >>= priority;
 			if (!scan && force_scan)
@@ -1996,7 +1975,7 @@ out:
  * back to the allocator and call try_to_compact_zone(), we ensure that
  * there are enough free pages for it to be likely successful
  */
-static inline bool should_continue_reclaim(struct mem_cgroup_zone *mz,
+static inline bool should_continue_reclaim(struct lruvec *lruvec,
 					unsigned long nr_reclaimed,
 					unsigned long nr_scanned,
 					struct scan_control *sc)
@@ -2036,15 +2015,16 @@ static inline bool should_continue_recla
 	 * inactive lists are large enough, continue reclaiming
 	 */
 	pages_for_compaction = (2UL << sc->order);
-	inactive_lru_pages = zone_nr_lru_pages(mz, LRU_INACTIVE_FILE);
+	inactive_lru_pages = zone_nr_lru_pages(lruvec, LRU_INACTIVE_FILE);
 	if (nr_swap_pages > 0)
-		inactive_lru_pages += zone_nr_lru_pages(mz, LRU_INACTIVE_ANON);
+		inactive_lru_pages += zone_nr_lru_pages(lruvec,
+							LRU_INACTIVE_ANON);
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
 		return true;
 
 	/* If compaction would go ahead or the allocation would succeed, stop */
-	switch (compaction_suitable(mz->zone, sc->order)) {
+	switch (compaction_suitable(lruvec->zone, sc->order)) {
 	case COMPACT_PARTIAL:
 	case COMPACT_CONTINUE:
 		return false;
@@ -2056,7 +2036,7 @@ static inline bool should_continue_recla
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
-static void shrink_mem_cgroup_zone(int priority, struct mem_cgroup_zone *mz,
+static void shrink_mem_cgroup_zone(int priority, struct lruvec *lruvec,
 				   struct scan_control *sc)
 {
 	unsigned long nr[NR_LRU_LISTS];
@@ -2069,7 +2049,7 @@ static void shrink_mem_cgroup_zone(int p
 restart:
 	nr_reclaimed = 0;
 	nr_scanned = sc->nr_scanned;
-	get_scan_count(mz, sc, nr, priority);
+	get_scan_count(lruvec, sc, nr, priority);
 
 	blk_start_plug(&plug);
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
@@ -2081,7 +2061,7 @@ restart:
 				nr[lru] -= nr_to_scan;
 
 				nr_reclaimed += shrink_list(lru, nr_to_scan,
-							    mz, sc, priority);
+							lruvec, sc, priority);
 			}
 		}
 		/*
@@ -2107,11 +2087,11 @@ restart:
 	 * Even if we did not try to evict anon pages at all, we want to
 	 * rebalance the anon lru active/inactive ratio.
 	 */
-	if (inactive_anon_is_low(mz))
-		shrink_active_list(SWAP_CLUSTER_MAX, mz, sc, priority, 0);
+	if (inactive_anon_is_low(lruvec))
+		shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, priority, 0);
 
 	/* reclaim/compaction might need reclaim to continue */
-	if (should_continue_reclaim(mz, nr_reclaimed,
+	if (should_continue_reclaim(lruvec, nr_reclaimed,
 					sc->nr_scanned - nr_scanned, sc))
 		goto restart;
 
@@ -2130,12 +2110,9 @@ static void shrink_zone(int priority, st
 
 	memcg = mem_cgroup_iter(root, NULL, &reclaim);
 	do {
-		struct mem_cgroup_zone mz = {
-			.mem_cgroup = memcg,
-			.zone = zone,
-		};
+		struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
-		shrink_mem_cgroup_zone(priority, &mz, sc);
+		shrink_mem_cgroup_zone(priority, lruvec, sc);
 		/*
 		 * Limit reclaim has historically picked one memcg and
 		 * scanned it with decreasing priority levels until
@@ -2463,10 +2440,7 @@ unsigned long mem_cgroup_shrink_node_zon
 		.order = 0,
 		.target_mem_cgroup = memcg,
 	};
-	struct mem_cgroup_zone mz = {
-		.mem_cgroup = memcg,
-		.zone = zone,
-	};
+	struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
@@ -2482,7 +2456,7 @@ unsigned long mem_cgroup_shrink_node_zon
 	 * will pick up pages from other mem cgroup's as well. We hack
 	 * the priority and make it zero.
 	 */
-	shrink_mem_cgroup_zone(0, &mz, &sc);
+	shrink_mem_cgroup_zone(0, lruvec, &sc);
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
 
@@ -2543,13 +2517,10 @@ static void age_active_anon(struct zone
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
-		struct mem_cgroup_zone mz = {
-			.mem_cgroup = memcg,
-			.zone = zone,
-		};
+		struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
-		if (inactive_anon_is_low(&mz))
-			shrink_active_list(SWAP_CLUSTER_MAX, &mz,
+		if (inactive_anon_is_low(lruvec))
+			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 					   sc, priority, 0);
 
 		memcg = mem_cgroup_iter(NULL, memcg, NULL);

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 3/10] mm/memcg: add zone pointer into lruvec
@ 2012-02-20 23:30   ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

The lruvec is looking rather useful: if we just add a zone pointer
into the lruvec, then we can pass the lruvec pointer around and save
some superfluous arguments and recomputations in various places.

Just occasionally we do want mem_cgroup_from_lruvec() to get back from
lruvec to memcg; but then we can remove all uses of vmscan.c's private
mem_cgroup_zone *mz, passing the lruvec pointer instead.

And while we're there, get_scan_count() can call vmscan_swappiness()
once, instead of twice in a row.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/memcontrol.h |   23 ++-
 include/linux/mmzone.h     |    1 
 mm/memcontrol.c            |   47 ++++----
 mm/page_alloc.c            |    1 
 mm/vmscan.c                |  203 +++++++++++++++--------------------
 5 files changed, 128 insertions(+), 147 deletions(-)

--- mmotm.orig/include/linux/memcontrol.h	2012-02-18 11:57:20.391524062 -0800
+++ mmotm/include/linux/memcontrol.h	2012-02-18 11:57:28.371524252 -0800
@@ -63,6 +63,7 @@ extern int mem_cgroup_cache_charge(struc
 					gfp_t gfp_mask);
 
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
+extern struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec);
 struct lruvec *mem_cgroup_lru_add_list(struct zone *, struct page *,
 				       enum lru_list);
 void mem_cgroup_lru_del_list(struct page *, enum lru_list);
@@ -113,13 +114,11 @@ void mem_cgroup_iter_break(struct mem_cg
 /*
  * For memory reclaim.
  */
-int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg,
-				    struct zone *zone);
-int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg,
-				    struct zone *zone);
+int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec);
+int mem_cgroup_inactive_file_is_low(struct lruvec *lruvec);
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
-unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
-					int nid, int zid, unsigned int lrumask);
+unsigned long mem_cgroup_zone_nr_lru_pages(struct lruvec *lruvec,
+					   unsigned int lrumask);
 struct zone_reclaim_stat*
 mem_cgroup_get_reclaim_stat_from_page(struct page *page);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
@@ -249,6 +248,11 @@ static inline struct lruvec *mem_cgroup_
 	return &zone->lruvec;
 }
 
+static inline struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec)
+{
+	return NULL;
+}
+
 static inline struct lruvec *mem_cgroup_lru_add_list(struct zone *zone,
 						     struct page *page,
 						     enum lru_list lru)
@@ -331,20 +335,19 @@ static inline bool mem_cgroup_disabled(v
 }
 
 static inline int
-mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg, struct zone *zone)
+mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
 {
 	return 1;
 }
 
 static inline int
-mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg, struct zone *zone)
+mem_cgroup_inactive_file_is_low(struct lruvec *lruvec)
 {
 	return 1;
 }
 
 static inline unsigned long
-mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
-				unsigned int lru_mask)
+mem_cgroup_zone_nr_lru_pages(struct lruvec *lruvec, unsigned int lru_mask)
 {
 	return 0;
 }
--- mmotm.orig/include/linux/mmzone.h	2012-02-18 11:57:20.391524062 -0800
+++ mmotm/include/linux/mmzone.h	2012-02-18 11:57:28.371524252 -0800
@@ -173,6 +173,7 @@ struct zone_reclaim_stat {
 };
 
 struct lruvec {
+	struct zone *zone;
 	struct list_head lists[NR_LRU_LISTS];
 	struct zone_reclaim_stat reclaim_stat;
 };
--- mmotm.orig/mm/memcontrol.c	2012-02-18 11:57:20.391524062 -0800
+++ mmotm/mm/memcontrol.c	2012-02-18 11:57:28.371524252 -0800
@@ -703,14 +703,13 @@ static void mem_cgroup_charge_statistics
 }
 
 unsigned long
-mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
-			unsigned int lru_mask)
+mem_cgroup_zone_nr_lru_pages(struct lruvec *lruvec, unsigned int lru_mask)
 {
 	struct mem_cgroup_per_zone *mz;
 	enum lru_list lru;
 	unsigned long ret = 0;
 
-	mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
 
 	for_each_lru(lru) {
 		if (BIT(lru) & lru_mask)
@@ -723,12 +722,14 @@ static unsigned long
 mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
 			int nid, unsigned int lru_mask)
 {
+	struct mem_cgroup_per_zone *mz;
 	u64 total = 0;
 	int zid;
 
-	for (zid = 0; zid < MAX_NR_ZONES; zid++)
-		total += mem_cgroup_zone_nr_lru_pages(memcg,
-						nid, zid, lru_mask);
+	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+		mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+		total += mem_cgroup_zone_nr_lru_pages(&mz->lruvec, lru_mask);
+	}
 
 	return total;
 }
@@ -1003,13 +1004,24 @@ struct lruvec *mem_cgroup_zone_lruvec(st
 {
 	struct mem_cgroup_per_zone *mz;
 
-	if (mem_cgroup_disabled())
+	if (!memcg || mem_cgroup_disabled())
 		return &zone->lruvec;
 
 	mz = mem_cgroup_zoneinfo(memcg, zone_to_nid(zone), zone_idx(zone));
 	return &mz->lruvec;
 }
 
+struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec)
+{
+	struct mem_cgroup_per_zone *mz;
+
+	if (mem_cgroup_disabled())
+		return NULL;
+
+	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
+	return mz->memcg;
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -1161,19 +1173,15 @@ int task_in_mem_cgroup(struct task_struc
 	return ret;
 }
 
-int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg, struct zone *zone)
+int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
 {
 	unsigned long inactive_ratio;
-	int nid = zone_to_nid(zone);
-	int zid = zone_idx(zone);
 	unsigned long inactive;
 	unsigned long active;
 	unsigned long gb;
 
-	inactive = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid,
-						BIT(LRU_INACTIVE_ANON));
-	active = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid,
-					      BIT(LRU_ACTIVE_ANON));
+	inactive = mem_cgroup_zone_nr_lru_pages(lruvec, BIT(LRU_INACTIVE_ANON));
+	active = mem_cgroup_zone_nr_lru_pages(lruvec, BIT(LRU_ACTIVE_ANON));
 
 	gb = (inactive + active) >> (30 - PAGE_SHIFT);
 	if (gb)
@@ -1184,17 +1192,13 @@ int mem_cgroup_inactive_anon_is_low(stru
 	return inactive * inactive_ratio < active;
 }
 
-int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg, struct zone *zone)
+int mem_cgroup_inactive_file_is_low(struct lruvec *lruvec)
 {
 	unsigned long active;
 	unsigned long inactive;
-	int zid = zone_idx(zone);
-	int nid = zone_to_nid(zone);
 
-	inactive = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid,
-						BIT(LRU_INACTIVE_FILE));
-	active = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid,
-					      BIT(LRU_ACTIVE_FILE));
+	inactive = mem_cgroup_zone_nr_lru_pages(lruvec, BIT(LRU_INACTIVE_FILE));
+	active = mem_cgroup_zone_nr_lru_pages(lruvec, BIT(LRU_ACTIVE_FILE));
 
 	return (active > inactive);
 }
@@ -4755,6 +4759,7 @@ static int alloc_mem_cgroup_per_zone_inf
 
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		mz = &pn->zoneinfo[zone];
+		mz->lruvec.zone = &NODE_DATA(node)->node_zones[zone];
 		for_each_lru(lru)
 			INIT_LIST_HEAD(&mz->lruvec.lists[lru]);
 		mz->usage_in_excess = 0;
--- mmotm.orig/mm/page_alloc.c	2012-02-18 11:57:20.395524062 -0800
+++ mmotm/mm/page_alloc.c	2012-02-18 11:57:28.375524252 -0800
@@ -4365,6 +4365,7 @@ static void __paginginit free_area_init_
 		zone->zone_pgdat = pgdat;
 
 		zone_pcp_init(zone);
+		zone->lruvec.zone = zone;
 		for_each_lru(lru)
 			INIT_LIST_HEAD(&zone->lruvec.lists[lru]);
 		zone->lruvec.reclaim_stat.recent_rotated[0] = 0;
--- mmotm.orig/mm/vmscan.c	2012-02-18 11:57:20.395524062 -0800
+++ mmotm/mm/vmscan.c	2012-02-18 11:57:28.375524252 -0800
@@ -115,11 +115,6 @@ struct scan_control {
 	nodemask_t	*nodemask;
 };
 
-struct mem_cgroup_zone {
-	struct mem_cgroup *mem_cgroup;
-	struct zone *zone;
-};
-
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 
 #ifdef ARCH_HAS_PREFETCH
@@ -171,21 +166,12 @@ static bool global_reclaim(struct scan_c
 }
 #endif
 
-static struct zone_reclaim_stat *get_reclaim_stat(struct mem_cgroup_zone *mz)
-{
-	return &mem_cgroup_zone_lruvec(mz->zone, mz->mem_cgroup)->reclaim_stat;
-}
-
-static unsigned long zone_nr_lru_pages(struct mem_cgroup_zone *mz,
-				       enum lru_list lru)
+static unsigned long zone_nr_lru_pages(struct lruvec *lruvec, enum lru_list lru)
 {
 	if (!mem_cgroup_disabled())
-		return mem_cgroup_zone_nr_lru_pages(mz->mem_cgroup,
-						    zone_to_nid(mz->zone),
-						    zone_idx(mz->zone),
-						    BIT(lru));
+		return mem_cgroup_zone_nr_lru_pages(lruvec, BIT(lru));
 
-	return zone_page_state(mz->zone, NR_LRU_BASE + lru);
+	return zone_page_state(lruvec->zone, NR_LRU_BASE + lru);
 }
 
 
@@ -688,13 +674,13 @@ enum page_references {
 };
 
 static enum page_references page_check_references(struct page *page,
-						  struct mem_cgroup_zone *mz,
+						  struct mem_cgroup *memcg,
 						  struct scan_control *sc)
 {
 	int referenced_ptes, referenced_page;
 	unsigned long vm_flags;
 
-	referenced_ptes = page_referenced(page, 1, mz->mem_cgroup, &vm_flags);
+	referenced_ptes = page_referenced(page, 1, memcg, &vm_flags);
 	referenced_page = TestClearPageReferenced(page);
 
 	/* Lumpy reclaim - ignore references */
@@ -750,12 +736,13 @@ static enum page_references page_check_r
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
-				      struct mem_cgroup_zone *mz,
+				      struct lruvec *lruvec,
 				      struct scan_control *sc,
 				      int priority,
 				      unsigned long *ret_nr_dirty,
 				      unsigned long *ret_nr_writeback)
 {
+	struct mem_cgroup *memcg = mem_cgroup_from_lruvec(lruvec);
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
@@ -781,7 +768,7 @@ static unsigned long shrink_page_list(st
 			goto keep;
 
 		VM_BUG_ON(PageActive(page));
-		VM_BUG_ON(page_zone(page) != mz->zone);
+		VM_BUG_ON(page_zone(page) != lruvec->zone);
 
 		sc->nr_scanned++;
 
@@ -815,7 +802,7 @@ static unsigned long shrink_page_list(st
 			}
 		}
 
-		references = page_check_references(page, mz, sc);
+		references = page_check_references(page, memcg, sc);
 		switch (references) {
 		case PAGEREF_ACTIVATE:
 			goto activate_locked;
@@ -1007,7 +994,7 @@ keep_lumpy:
 	 * will encounter the same problem
 	 */
 	if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
-		zone_set_flag(mz->zone, ZONE_CONGESTED);
+		zone_set_flag(lruvec->zone, ZONE_CONGESTED);
 
 	free_hot_cold_page_list(&free_pages, 1);
 
@@ -1122,7 +1109,7 @@ int __isolate_lru_page(struct page *page
  * Appropriate locks must be held before calling this function.
  *
  * @nr_to_scan:	The number of pages to look through on the list.
- * @mz:		The mem_cgroup_zone to pull pages from.
+ * @lruvec:	The mem_cgroup/zone lruvec to pull pages from.
  * @dst:	The temp list to put pages on to.
  * @nr_scanned:	The number of pages that were scanned.
  * @sc:		The scan_control struct for this reclaim session
@@ -1133,11 +1120,10 @@ int __isolate_lru_page(struct page *page
  * returns how many pages were moved onto *@dst.
  */
 static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
-		struct mem_cgroup_zone *mz, struct list_head *dst,
+		struct lruvec *lruvec, struct list_head *dst,
 		unsigned long *nr_scanned, struct scan_control *sc,
 		isolate_mode_t mode, int active, int file)
 {
-	struct lruvec *lruvec;
 	struct list_head *src;
 	unsigned long nr_taken = 0;
 	unsigned long nr_lumpy_taken = 0;
@@ -1146,7 +1132,6 @@ static unsigned long isolate_lru_pages(u
 	unsigned long scan;
 	int lru = LRU_BASE;
 
-	lruvec = mem_cgroup_zone_lruvec(mz->zone, mz->mem_cgroup);
 	if (active)
 		lru += LRU_ACTIVE;
 	if (file)
@@ -1344,11 +1329,10 @@ static int too_many_isolated(struct zone
 }
 
 static noinline_for_stack void
-putback_inactive_pages(struct mem_cgroup_zone *mz,
-		       struct list_head *page_list)
+putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 {
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
-	struct zone *zone = mz->zone;
+	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	struct zone *zone = lruvec->zone;
 	LIST_HEAD(pages_to_free);
 
 	/*
@@ -1395,12 +1379,9 @@ putback_inactive_pages(struct mem_cgroup
 }
 
 static noinline_for_stack void
-update_isolated_counts(struct mem_cgroup_zone *mz,
-		       struct list_head *page_list,
-		       unsigned long *nr_anon,
-		       unsigned long *nr_file)
+update_isolated_counts(struct zone *zone, struct list_head *page_list,
+		       unsigned long *nr_anon, unsigned long *nr_file)
 {
-	struct zone *zone = mz->zone;
 	unsigned int count[NR_LRU_LISTS] = { 0, };
 	unsigned long nr_active = 0;
 	struct page *page;
@@ -1486,9 +1467,11 @@ static inline bool should_reclaim_stall(
  * of reclaimed pages
  */
 static noinline_for_stack unsigned long
-shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
+shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 		     struct scan_control *sc, int priority, int file)
 {
+	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	struct zone *zone = lruvec->zone;
 	LIST_HEAD(page_list);
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
@@ -1498,8 +1481,6 @@ shrink_inactive_list(unsigned long nr_to
 	unsigned long nr_dirty = 0;
 	unsigned long nr_writeback = 0;
 	isolate_mode_t isolate_mode = ISOLATE_INACTIVE;
-	struct zone *zone = mz->zone;
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1522,31 +1503,29 @@ shrink_inactive_list(unsigned long nr_to
 
 	spin_lock_irq(&zone->lru_lock);
 
-	nr_taken = isolate_lru_pages(nr_to_scan, mz, &page_list, &nr_scanned,
-				     sc, isolate_mode, 0, file);
+	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
+				     &nr_scanned, sc, isolate_mode, 0, file);
 	if (global_reclaim(sc)) {
 		zone->pages_scanned += nr_scanned;
 		if (current_is_kswapd())
-			__count_zone_vm_events(PGSCAN_KSWAPD, zone,
-					       nr_scanned);
+			__count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
 		else
-			__count_zone_vm_events(PGSCAN_DIRECT, zone,
-					       nr_scanned);
+			__count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
 	}
 	spin_unlock_irq(&zone->lru_lock);
 
 	if (nr_taken == 0)
 		return 0;
 
-	update_isolated_counts(mz, &page_list, &nr_anon, &nr_file);
+	update_isolated_counts(zone, &page_list, &nr_anon, &nr_file);
 
-	nr_reclaimed = shrink_page_list(&page_list, mz, sc, priority,
+	nr_reclaimed = shrink_page_list(&page_list, lruvec, sc, priority,
 						&nr_dirty, &nr_writeback);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
 		set_reclaim_mode(priority, sc, true);
-		nr_reclaimed += shrink_page_list(&page_list, mz, sc,
+		nr_reclaimed += shrink_page_list(&page_list, lruvec, sc,
 					priority, &nr_dirty, &nr_writeback);
 	}
 
@@ -1559,7 +1538,7 @@ shrink_inactive_list(unsigned long nr_to
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-	putback_inactive_pages(mz, &page_list);
+	putback_inactive_pages(lruvec, &page_list);
 
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
 	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
@@ -1659,10 +1638,13 @@ static void move_active_pages_to_lru(str
 }
 
 static void shrink_active_list(unsigned long nr_to_scan,
-			       struct mem_cgroup_zone *mz,
+			       struct lruvec *lruvec,
 			       struct scan_control *sc,
 			       int priority, int file)
 {
+	struct mem_cgroup *memcg = mem_cgroup_from_lruvec(lruvec);
+	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	struct zone *zone = lruvec->zone;
 	unsigned long nr_taken;
 	unsigned long nr_scanned;
 	unsigned long vm_flags;
@@ -1670,10 +1652,8 @@ static void shrink_active_list(unsigned
 	LIST_HEAD(l_active);
 	LIST_HEAD(l_inactive);
 	struct page *page;
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
 	unsigned long nr_rotated = 0;
 	isolate_mode_t isolate_mode = ISOLATE_ACTIVE;
-	struct zone *zone = mz->zone;
 
 	lru_add_drain();
 
@@ -1684,8 +1664,8 @@ static void shrink_active_list(unsigned
 
 	spin_lock_irq(&zone->lru_lock);
 
-	nr_taken = isolate_lru_pages(nr_to_scan, mz, &l_hold, &nr_scanned, sc,
-				     isolate_mode, 1, file);
+	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
+				     &nr_scanned, sc, isolate_mode, 1, file);
 	if (global_reclaim(sc))
 		zone->pages_scanned += nr_scanned;
 
@@ -1717,7 +1697,7 @@ static void shrink_active_list(unsigned
 			}
 		}
 
-		if (page_referenced(page, 0, mz->mem_cgroup, &vm_flags)) {
+		if (page_referenced(page, 0, memcg, &vm_flags)) {
 			nr_rotated += hpage_nr_pages(page);
 			/*
 			 * Identify referenced, file-backed active pages and
@@ -1776,13 +1756,12 @@ static int inactive_anon_is_low_global(s
 
 /**
  * inactive_anon_is_low - check if anonymous pages need to be deactivated
- * @zone: zone to check
- * @sc:   scan control of this context
+ * @lruvec: The mem_cgroup/zone lruvec to check
  *
  * Returns true if the zone does not have enough inactive anon pages,
  * meaning some active anon pages need to be deactivated.
  */
-static int inactive_anon_is_low(struct mem_cgroup_zone *mz)
+static int inactive_anon_is_low(struct lruvec *lruvec)
 {
 	/*
 	 * If we don't have swap space, anonymous page deactivation
@@ -1792,13 +1771,12 @@ static int inactive_anon_is_low(struct m
 		return 0;
 
 	if (!mem_cgroup_disabled())
-		return mem_cgroup_inactive_anon_is_low(mz->mem_cgroup,
-						       mz->zone);
+		return mem_cgroup_inactive_anon_is_low(lruvec);
 
-	return inactive_anon_is_low_global(mz->zone);
+	return inactive_anon_is_low_global(lruvec->zone);
 }
 #else
-static inline int inactive_anon_is_low(struct mem_cgroup_zone *mz)
+static inline int inactive_anon_is_low(struct lruvec *lruvec)
 {
 	return 0;
 }
@@ -1816,7 +1794,7 @@ static int inactive_file_is_low_global(s
 
 /**
  * inactive_file_is_low - check if file pages need to be deactivated
- * @mz: memory cgroup and zone to check
+ * @lruvec: The mem_cgroup/zone lruvec to check
  *
  * When the system is doing streaming IO, memory pressure here
  * ensures that active file pages get deactivated, until more
@@ -1828,44 +1806,44 @@ static int inactive_file_is_low_global(s
  * This uses a different ratio than the anonymous pages, because
  * the page cache uses a use-once replacement algorithm.
  */
-static int inactive_file_is_low(struct mem_cgroup_zone *mz)
+static int inactive_file_is_low(struct lruvec *lruvec)
 {
 	if (!mem_cgroup_disabled())
-		return mem_cgroup_inactive_file_is_low(mz->mem_cgroup,
-						       mz->zone);
+		return mem_cgroup_inactive_file_is_low(lruvec);
 
-	return inactive_file_is_low_global(mz->zone);
+	return inactive_file_is_low_global(lruvec->zone);
 }
 
-static int inactive_list_is_low(struct mem_cgroup_zone *mz, int file)
+static int inactive_list_is_low(struct lruvec *lruvec, int file)
 {
 	if (file)
-		return inactive_file_is_low(mz);
+		return inactive_file_is_low(lruvec);
 	else
-		return inactive_anon_is_low(mz);
+		return inactive_anon_is_low(lruvec);
 }
 
 static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
-				 struct mem_cgroup_zone *mz,
+				 struct lruvec *lruvec,
 				 struct scan_control *sc, int priority)
 {
 	int file = is_file_lru(lru);
 
 	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(mz, file))
-			shrink_active_list(nr_to_scan, mz, sc, priority, file);
+		if (inactive_list_is_low(lruvec, file))
+			shrink_active_list(nr_to_scan, lruvec, sc, priority,
+									file);
 		return 0;
 	}
 
-	return shrink_inactive_list(nr_to_scan, mz, sc, priority, file);
+	return shrink_inactive_list(nr_to_scan, lruvec, sc, priority, file);
 }
 
-static int vmscan_swappiness(struct mem_cgroup_zone *mz,
+static int vmscan_swappiness(struct lruvec *lruvec,
 			     struct scan_control *sc)
 {
 	if (global_reclaim(sc))
 		return vm_swappiness;
-	return mem_cgroup_swappiness(mz->mem_cgroup);
+	return mem_cgroup_swappiness(mem_cgroup_from_lruvec(lruvec));
 }
 
 /*
@@ -1876,13 +1854,14 @@ static int vmscan_swappiness(struct mem_
  *
  * nr[0] = anon pages to scan; nr[1] = file pages to scan
  */
-static void get_scan_count(struct mem_cgroup_zone *mz, struct scan_control *sc,
+static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 			   unsigned long *nr, int priority)
 {
+	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	struct zone *zone = lruvec->zone;
 	unsigned long anon, file, free;
 	unsigned long anon_prio, file_prio;
 	unsigned long ap, fp;
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
 	u64 fraction[2], denominator;
 	enum lru_list lru;
 	int noswap = 0;
@@ -1898,7 +1877,7 @@ static void get_scan_count(struct mem_cg
 	 * latencies, so it's better to scan a minimum amount there as
 	 * well.
 	 */
-	if (current_is_kswapd() && mz->zone->all_unreclaimable)
+	if (current_is_kswapd() && zone->all_unreclaimable)
 		force_scan = true;
 	if (!global_reclaim(sc))
 		force_scan = true;
@@ -1912,16 +1891,16 @@ static void get_scan_count(struct mem_cg
 		goto out;
 	}
 
-	anon  = zone_nr_lru_pages(mz, LRU_ACTIVE_ANON) +
-		zone_nr_lru_pages(mz, LRU_INACTIVE_ANON);
-	file  = zone_nr_lru_pages(mz, LRU_ACTIVE_FILE) +
-		zone_nr_lru_pages(mz, LRU_INACTIVE_FILE);
+	anon  = zone_nr_lru_pages(lruvec, LRU_ACTIVE_ANON) +
+		zone_nr_lru_pages(lruvec, LRU_INACTIVE_ANON);
+	file  = zone_nr_lru_pages(lruvec, LRU_ACTIVE_FILE) +
+		zone_nr_lru_pages(lruvec, LRU_INACTIVE_FILE);
 
 	if (global_reclaim(sc)) {
-		free  = zone_page_state(mz->zone, NR_FREE_PAGES);
+		free  = zone_page_state(zone, NR_FREE_PAGES);
 		/* If we have very few page cache pages,
 		   force-scan anon pages. */
-		if (unlikely(file + free <= high_wmark_pages(mz->zone))) {
+		if (unlikely(file + free <= high_wmark_pages(zone))) {
 			fraction[0] = 1;
 			fraction[1] = 0;
 			denominator = 1;
@@ -1933,8 +1912,8 @@ static void get_scan_count(struct mem_cg
 	 * With swappiness at 100, anonymous and file have the same priority.
 	 * This scanning priority is essentially the inverse of IO cost.
 	 */
-	anon_prio = vmscan_swappiness(mz, sc);
-	file_prio = 200 - vmscan_swappiness(mz, sc);
+	anon_prio = vmscan_swappiness(lruvec, sc);
+	file_prio = 200 - anon_prio;
 
 	/*
 	 * OK, so we have swap space and a fair amount of page cache
@@ -1947,7 +1926,7 @@ static void get_scan_count(struct mem_cg
 	 *
 	 * anon in [0], file in [1]
 	 */
-	spin_lock_irq(&mz->zone->lru_lock);
+	spin_lock_irq(&zone->lru_lock);
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
 		reclaim_stat->recent_scanned[0] /= 2;
 		reclaim_stat->recent_rotated[0] /= 2;
@@ -1968,7 +1947,7 @@ static void get_scan_count(struct mem_cg
 
 	fp = (file_prio + 1) * (reclaim_stat->recent_scanned[1] + 1);
 	fp /= reclaim_stat->recent_rotated[1] + 1;
-	spin_unlock_irq(&mz->zone->lru_lock);
+	spin_unlock_irq(&zone->lru_lock);
 
 	fraction[0] = ap;
 	fraction[1] = fp;
@@ -1978,7 +1957,7 @@ out:
 		int file = is_file_lru(lru);
 		unsigned long scan;
 
-		scan = zone_nr_lru_pages(mz, lru);
+		scan = zone_nr_lru_pages(lruvec, lru);
 		if (priority || noswap) {
 			scan >>= priority;
 			if (!scan && force_scan)
@@ -1996,7 +1975,7 @@ out:
  * back to the allocator and call try_to_compact_zone(), we ensure that
  * there are enough free pages for it to be likely successful
  */
-static inline bool should_continue_reclaim(struct mem_cgroup_zone *mz,
+static inline bool should_continue_reclaim(struct lruvec *lruvec,
 					unsigned long nr_reclaimed,
 					unsigned long nr_scanned,
 					struct scan_control *sc)
@@ -2036,15 +2015,16 @@ static inline bool should_continue_recla
 	 * inactive lists are large enough, continue reclaiming
 	 */
 	pages_for_compaction = (2UL << sc->order);
-	inactive_lru_pages = zone_nr_lru_pages(mz, LRU_INACTIVE_FILE);
+	inactive_lru_pages = zone_nr_lru_pages(lruvec, LRU_INACTIVE_FILE);
 	if (nr_swap_pages > 0)
-		inactive_lru_pages += zone_nr_lru_pages(mz, LRU_INACTIVE_ANON);
+		inactive_lru_pages += zone_nr_lru_pages(lruvec,
+							LRU_INACTIVE_ANON);
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
 		return true;
 
 	/* If compaction would go ahead or the allocation would succeed, stop */
-	switch (compaction_suitable(mz->zone, sc->order)) {
+	switch (compaction_suitable(lruvec->zone, sc->order)) {
 	case COMPACT_PARTIAL:
 	case COMPACT_CONTINUE:
 		return false;
@@ -2056,7 +2036,7 @@ static inline bool should_continue_recla
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
-static void shrink_mem_cgroup_zone(int priority, struct mem_cgroup_zone *mz,
+static void shrink_mem_cgroup_zone(int priority, struct lruvec *lruvec,
 				   struct scan_control *sc)
 {
 	unsigned long nr[NR_LRU_LISTS];
@@ -2069,7 +2049,7 @@ static void shrink_mem_cgroup_zone(int p
 restart:
 	nr_reclaimed = 0;
 	nr_scanned = sc->nr_scanned;
-	get_scan_count(mz, sc, nr, priority);
+	get_scan_count(lruvec, sc, nr, priority);
 
 	blk_start_plug(&plug);
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
@@ -2081,7 +2061,7 @@ restart:
 				nr[lru] -= nr_to_scan;
 
 				nr_reclaimed += shrink_list(lru, nr_to_scan,
-							    mz, sc, priority);
+							lruvec, sc, priority);
 			}
 		}
 		/*
@@ -2107,11 +2087,11 @@ restart:
 	 * Even if we did not try to evict anon pages at all, we want to
 	 * rebalance the anon lru active/inactive ratio.
 	 */
-	if (inactive_anon_is_low(mz))
-		shrink_active_list(SWAP_CLUSTER_MAX, mz, sc, priority, 0);
+	if (inactive_anon_is_low(lruvec))
+		shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, priority, 0);
 
 	/* reclaim/compaction might need reclaim to continue */
-	if (should_continue_reclaim(mz, nr_reclaimed,
+	if (should_continue_reclaim(lruvec, nr_reclaimed,
 					sc->nr_scanned - nr_scanned, sc))
 		goto restart;
 
@@ -2130,12 +2110,9 @@ static void shrink_zone(int priority, st
 
 	memcg = mem_cgroup_iter(root, NULL, &reclaim);
 	do {
-		struct mem_cgroup_zone mz = {
-			.mem_cgroup = memcg,
-			.zone = zone,
-		};
+		struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
-		shrink_mem_cgroup_zone(priority, &mz, sc);
+		shrink_mem_cgroup_zone(priority, lruvec, sc);
 		/*
 		 * Limit reclaim has historically picked one memcg and
 		 * scanned it with decreasing priority levels until
@@ -2463,10 +2440,7 @@ unsigned long mem_cgroup_shrink_node_zon
 		.order = 0,
 		.target_mem_cgroup = memcg,
 	};
-	struct mem_cgroup_zone mz = {
-		.mem_cgroup = memcg,
-		.zone = zone,
-	};
+	struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
@@ -2482,7 +2456,7 @@ unsigned long mem_cgroup_shrink_node_zon
 	 * will pick up pages from other mem cgroup's as well. We hack
 	 * the priority and make it zero.
 	 */
-	shrink_mem_cgroup_zone(0, &mz, &sc);
+	shrink_mem_cgroup_zone(0, lruvec, &sc);
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
 
@@ -2543,13 +2517,10 @@ static void age_active_anon(struct zone
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
-		struct mem_cgroup_zone mz = {
-			.mem_cgroup = memcg,
-			.zone = zone,
-		};
+		struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
-		if (inactive_anon_is_low(&mz))
-			shrink_active_list(SWAP_CLUSTER_MAX, &mz,
+		if (inactive_anon_is_low(lruvec))
+			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 					   sc, priority, 0);
 
 		memcg = mem_cgroup_iter(NULL, memcg, NULL);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 4/10] mm/memcg: apply add/del_page to lruvec
  2012-02-20 23:26 ` Hugh Dickins
@ 2012-02-20 23:32   ` Hugh Dickins
  -1 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

Go further: pass lruvec instead of zone to add_page_to_lru_list() and
del_page_from_lru_list(); and pagevec_lru_move_fn() pass lruvec down
to its target functions.

This cleanup eliminates a swathe of cruft in memcontrol.c,
including mem_cgroup_lru_add_list(), mem_cgroup_lru_del_list() and
mem_cgroup_lru_move_lists(), which never actually touched the lists.

In their place, mem_cgroup_page_lruvec() to decide the lruvec,
previously a side-effect of add, and mem_cgroup_update_lru_size()
to maintain the lru_size stats.

Whilst these are simplifications in their own right, the goal is to
bring the evaluation of lruvec next to the spin_locking of the lrus,
in preparation for the next patch.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/memcontrol.h |   40 ++-----------
 include/linux/mm_inline.h  |   20 +++---
 include/linux/swap.h       |    4 -
 mm/compaction.c            |    5 +
 mm/huge_memory.c           |    5 +
 mm/memcontrol.c            |  102 +++++++----------------------------
 mm/swap.c                  |   85 ++++++++++++++---------------
 mm/vmscan.c                |   60 ++++++++++++--------
 8 files changed, 128 insertions(+), 193 deletions(-)

--- mmotm.orig/include/linux/memcontrol.h	2012-02-18 11:57:28.371524252 -0800
+++ mmotm/include/linux/memcontrol.h	2012-02-18 11:57:35.583524425 -0800
@@ -63,13 +63,9 @@ extern int mem_cgroup_cache_charge(struc
 					gfp_t gfp_mask);
 
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
+extern struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 extern struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec);
-struct lruvec *mem_cgroup_lru_add_list(struct zone *, struct page *,
-				       enum lru_list);
-void mem_cgroup_lru_del_list(struct page *, enum lru_list);
-void mem_cgroup_lru_del(struct page *);
-struct lruvec *mem_cgroup_lru_move_lists(struct zone *, struct page *,
-					 enum lru_list, enum lru_list);
+extern void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
 
 /* For coalescing uncharge for reducing memcg' overhead*/
 extern void mem_cgroup_uncharge_start(void);
@@ -119,8 +115,6 @@ int mem_cgroup_inactive_file_is_low(stru
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_zone_nr_lru_pages(struct lruvec *lruvec,
 					   unsigned int lrumask);
-struct zone_reclaim_stat*
-mem_cgroup_get_reclaim_stat_from_page(struct page *page);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 					struct task_struct *p);
 extern void mem_cgroup_replace_page_cache(struct page *oldpage,
@@ -248,32 +242,20 @@ static inline struct lruvec *mem_cgroup_
 	return &zone->lruvec;
 }
 
-static inline struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec)
-{
-	return NULL;
-}
-
-static inline struct lruvec *mem_cgroup_lru_add_list(struct zone *zone,
-						     struct page *page,
-						     enum lru_list lru)
+static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
+						    struct zone *zone)
 {
 	return &zone->lruvec;
 }
 
-static inline void mem_cgroup_lru_del_list(struct page *page, enum lru_list lru)
-{
-}
-
-static inline void mem_cgroup_lru_del(struct page *page)
+static inline struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec)
 {
+	return NULL;
 }
 
-static inline struct lruvec *mem_cgroup_lru_move_lists(struct zone *zone,
-						       struct page *page,
-						       enum lru_list from,
-						       enum lru_list to)
+static inline void mem_cgroup_update_lru_size(struct lruvec *lruvec,
+					      enum lru_list lru, int increment)
 {
-	return &zone->lruvec;
 }
 
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
@@ -352,12 +334,6 @@ mem_cgroup_zone_nr_lru_pages(struct lruv
 	return 0;
 }
 
-static inline struct zone_reclaim_stat*
-mem_cgroup_get_reclaim_stat_from_page(struct page *page)
-{
-	return NULL;
-}
-
 static inline void
 mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
--- mmotm.orig/include/linux/mm_inline.h	2012-02-18 11:56:23.639522714 -0800
+++ mmotm/include/linux/mm_inline.h	2012-02-18 11:57:35.583524425 -0800
@@ -21,22 +21,22 @@ static inline int page_is_file_cache(str
 	return !PageSwapBacked(page);
 }
 
-static inline void
-add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list lru)
+static inline void add_page_to_lru_list(struct page *page,
+				struct lruvec *lruvec, enum lru_list lru)
 {
-	struct lruvec *lruvec;
-
-	lruvec = mem_cgroup_lru_add_list(zone, page, lru);
+	int nr_pages = hpage_nr_pages(page);
+	mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
 	list_add(&page->lru, &lruvec->lists[lru]);
-	__mod_zone_page_state(zone, NR_LRU_BASE + lru, hpage_nr_pages(page));
+	__mod_zone_page_state(lruvec->zone, NR_LRU_BASE + lru, nr_pages);
 }
 
-static inline void
-del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list lru)
+static inline void del_page_from_lru_list(struct page *page,
+				struct lruvec *lruvec, enum lru_list lru)
 {
-	mem_cgroup_lru_del_list(page, lru);
+	int nr_pages = hpage_nr_pages(page);
+	mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
 	list_del(&page->lru);
-	__mod_zone_page_state(zone, NR_LRU_BASE + lru, -hpage_nr_pages(page));
+	__mod_zone_page_state(lruvec->zone, NR_LRU_BASE + lru, -nr_pages);
 }
 
 /**
--- mmotm.orig/include/linux/swap.h	2012-02-18 11:56:23.639522714 -0800
+++ mmotm/include/linux/swap.h	2012-02-18 11:57:35.583524425 -0800
@@ -224,8 +224,8 @@ extern unsigned int nr_free_pagecache_pa
 /* linux/mm/swap.c */
 extern void __lru_cache_add(struct page *, enum lru_list lru);
 extern void lru_cache_add_lru(struct page *, enum lru_list lru);
-extern void lru_add_page_tail(struct zone* zone,
-			      struct page *page, struct page *page_tail);
+extern void lru_add_page_tail(struct page *page, struct page *page_tail,
+			      struct lruvec *lruvec);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
--- mmotm.orig/mm/compaction.c	2012-02-18 11:56:23.639522714 -0800
+++ mmotm/mm/compaction.c	2012-02-18 11:57:35.583524425 -0800
@@ -262,6 +262,7 @@ static isolate_migrate_t isolate_migrate
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct list_head *migratelist = &cc->migratepages;
 	isolate_mode_t mode = ISOLATE_ACTIVE|ISOLATE_INACTIVE;
+	struct lruvec *lruvec;
 
 	/* Do not scan outside zone boundaries */
 	low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
@@ -378,10 +379,12 @@ static isolate_migrate_t isolate_migrate
 		if (__isolate_lru_page(page, mode, 0) != 0)
 			continue;
 
+		lruvec = mem_cgroup_page_lruvec(page, zone);
+
 		VM_BUG_ON(PageTransCompound(page));
 
 		/* Successfully isolated */
-		del_page_from_lru_list(zone, page, page_lru(page));
+		del_page_from_lru_list(page, lruvec, page_lru(page));
 		list_add(&page->lru, migratelist);
 		cc->nr_migratepages++;
 		nr_isolated++;
--- mmotm.orig/mm/huge_memory.c	2012-02-18 11:56:23.639522714 -0800
+++ mmotm/mm/huge_memory.c	2012-02-18 11:57:35.583524425 -0800
@@ -1223,10 +1223,13 @@ static void __split_huge_page_refcount(s
 {
 	int i;
 	struct zone *zone = page_zone(page);
+	struct lruvec *lruvec;
 	int tail_count = 0;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
+	lruvec = mem_cgroup_page_lruvec(page, zone);
+
 	compound_lock(page);
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(page);
@@ -1302,7 +1305,7 @@ static void __split_huge_page_refcount(s
 		BUG_ON(!PageSwapBacked(page_tail));
 
 
-		lru_add_page_tail(zone, page, page_tail);
+		lru_add_page_tail(page, page_tail, lruvec);
 	}
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(atomic_read(&page->_count) <= 0);
--- mmotm.orig/mm/memcontrol.c	2012-02-18 11:57:28.371524252 -0800
+++ mmotm/mm/memcontrol.c	2012-02-18 11:57:35.587524424 -0800
@@ -993,7 +993,7 @@ EXPORT_SYMBOL(mem_cgroup_count_vm_event)
 /**
  * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
  * @zone: zone of the wanted lruvec
- * @mem: memcg of the wanted lruvec
+ * @memcg: memcg of the wanted lruvec
  *
  * Returns the lru list vector holding pages for the given @zone and
  * @mem.  This can be the global zone lruvec, if the memory controller
@@ -1037,19 +1037,11 @@ struct mem_cgroup *mem_cgroup_from_lruve
  */
 
 /**
- * mem_cgroup_lru_add_list - account for adding an lru page and return lruvec
- * @zone: zone of the page
+ * mem_cgroup_page_lruvec - return lruvec for adding an lru page
  * @page: the page
- * @lru: current lru
- *
- * This function accounts for @page being added to @lru, and returns
- * the lruvec for the given @zone and the memcg @page is charged to.
- *
- * The callsite is then responsible for physically linking the page to
- * the returned lruvec->lists[@lru].
+ * @zone: zone of the page
  */
-struct lruvec *mem_cgroup_lru_add_list(struct zone *zone, struct page *page,
-				       enum lru_list lru)
+struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
 {
 	struct mem_cgroup_per_zone *mz;
 	struct mem_cgroup *memcg;
@@ -1061,66 +1053,31 @@ struct lruvec *mem_cgroup_lru_add_list(s
 	pc = lookup_page_cgroup(page);
 	memcg = pc->mem_cgroup;
 	mz = page_cgroup_zoneinfo(memcg, page);
-	/* compound_order() is stabilized through lru_lock */
-	mz->lru_size[lru] += 1 << compound_order(page);
 	return &mz->lruvec;
 }
 
 /**
- * mem_cgroup_lru_del_list - account for removing an lru page
- * @page: the page
- * @lru: target lru
- *
- * This function accounts for @page being removed from @lru.
+ * mem_cgroup_update_lru_size - account for adding or removing an lru page
+ * @lruvec: mem_cgroup per zone lru vector
+ * @lru: index of lru list the page is sitting on
+ * @nr_pages: positive when adding or negative when removing
  *
- * The callsite is then responsible for physically unlinking
- * @page->lru.
+ * This function must be called when a page is added to or removed from an
+ * lru list.
  */
-void mem_cgroup_lru_del_list(struct page *page, enum lru_list lru)
+void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
+				int nr_pages)
 {
 	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup *memcg;
-	struct page_cgroup *pc;
+	unsigned long *lru_size;
 
 	if (mem_cgroup_disabled())
 		return;
 
-	pc = lookup_page_cgroup(page);
-	memcg = pc->mem_cgroup;
-	VM_BUG_ON(!memcg);
-	mz = page_cgroup_zoneinfo(memcg, page);
-	/* huge page split is done under lru_lock. so, we have no races. */
-	VM_BUG_ON(mz->lru_size[lru] < (1 << compound_order(page)));
-	mz->lru_size[lru] -= 1 << compound_order(page);
-}
-
-void mem_cgroup_lru_del(struct page *page)
-{
-	mem_cgroup_lru_del_list(page, page_lru(page));
-}
-
-/**
- * mem_cgroup_lru_move_lists - account for moving a page between lrus
- * @zone: zone of the page
- * @page: the page
- * @from: current lru
- * @to: target lru
- *
- * This function accounts for @page being moved between the lrus @from
- * and @to, and returns the lruvec for the given @zone and the memcg
- * @page is charged to.
- *
- * The callsite is then responsible for physically relinking
- * @page->lru to the returned lruvec->lists[@to].
- */
-struct lruvec *mem_cgroup_lru_move_lists(struct zone *zone,
-					 struct page *page,
-					 enum lru_list from,
-					 enum lru_list to)
-{
-	/* XXX: Optimize this, especially for @from == @to */
-	mem_cgroup_lru_del_list(page, from);
-	return mem_cgroup_lru_add_list(zone, page, to);
+	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
+	lru_size = mz->lru_size + lru;
+	*lru_size += nr_pages;
+	VM_BUG_ON((long)(*lru_size) < 0);
 }
 
 /*
@@ -1203,24 +1160,6 @@ int mem_cgroup_inactive_file_is_low(stru
 	return (active > inactive);
 }
 
-struct zone_reclaim_stat *
-mem_cgroup_get_reclaim_stat_from_page(struct page *page)
-{
-	struct page_cgroup *pc;
-	struct mem_cgroup_per_zone *mz;
-
-	if (mem_cgroup_disabled())
-		return NULL;
-
-	pc = lookup_page_cgroup(page);
-	if (!PageCgroupUsed(pc))
-		return NULL;
-	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
-	smp_rmb();
-	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	return &mz->lruvec.reclaim_stat;
-}
-
 #define mem_cgroup_from_res_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
@@ -2695,6 +2634,7 @@ __mem_cgroup_commit_charge_lrucare(struc
 	struct zone *zone = page_zone(page);
 	unsigned long flags;
 	bool removed = false;
+	struct lruvec *lruvec;
 
 	/*
 	 * In some case, SwapCache, FUSE(splice_buf->radixtree), the page
@@ -2703,13 +2643,15 @@ __mem_cgroup_commit_charge_lrucare(struc
 	 */
 	spin_lock_irqsave(&zone->lru_lock, flags);
 	if (PageLRU(page)) {
-		del_page_from_lru_list(zone, page, page_lru(page));
+		lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
+		del_page_from_lru_list(page, lruvec, page_lru(page));
 		ClearPageLRU(page);
 		removed = true;
 	}
 	__mem_cgroup_commit_charge(memcg, page, 1, pc, ctype);
 	if (removed) {
-		add_page_to_lru_list(zone, page, page_lru(page));
+		lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
+		add_page_to_lru_list(page, lruvec, page_lru(page));
 		SetPageLRU(page);
 	}
 	spin_unlock_irqrestore(&zone->lru_lock, flags);
--- mmotm.orig/mm/swap.c	2012-02-18 11:57:20.395524062 -0800
+++ mmotm/mm/swap.c	2012-02-18 11:57:35.587524424 -0800
@@ -47,13 +47,15 @@ static DEFINE_PER_CPU(struct pagevec, lr
 static void __page_cache_release(struct page *page)
 {
 	if (PageLRU(page)) {
-		unsigned long flags;
 		struct zone *zone = page_zone(page);
+		struct lruvec *lruvec;
+		unsigned long flags;
 
 		spin_lock_irqsave(&zone->lru_lock, flags);
+		lruvec = mem_cgroup_page_lruvec(page, zone);
 		VM_BUG_ON(!PageLRU(page));
 		__ClearPageLRU(page);
-		del_page_from_lru_list(zone, page, page_off_lru(page));
+		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
 }
@@ -202,11 +204,12 @@ void put_pages_list(struct list_head *pa
 EXPORT_SYMBOL(put_pages_list);
 
 static void pagevec_lru_move_fn(struct pagevec *pvec,
-				void (*move_fn)(struct page *page, void *arg),
-				void *arg)
+	void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg),
+	void *arg)
 {
 	int i;
 	struct zone *zone = NULL;
+	struct lruvec *lruvec;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
@@ -220,7 +223,8 @@ static void pagevec_lru_move_fn(struct p
 			spin_lock_irqsave(&zone->lru_lock, flags);
 		}
 
-		(*move_fn)(page, arg);
+		lruvec = mem_cgroup_page_lruvec(page, zone);
+		(*move_fn)(page, lruvec, arg);
 	}
 	if (zone)
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
@@ -228,16 +232,13 @@ static void pagevec_lru_move_fn(struct p
 	pagevec_reinit(pvec);
 }
 
-static void pagevec_move_tail_fn(struct page *page, void *arg)
+static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
+				 void *arg)
 {
 	int *pgmoved = arg;
 
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		enum lru_list lru = page_lru_base_type(page);
-		struct lruvec *lruvec;
-
-		lruvec = mem_cgroup_lru_move_lists(page_zone(page),
-						   page, lru, lru);
 		list_move_tail(&page->lru, &lruvec->lists[lru]);
 		(*pgmoved)++;
 	}
@@ -276,35 +277,30 @@ void rotate_reclaimable_page(struct page
 	}
 }
 
-static void update_page_reclaim_stat(struct zone *zone, struct page *page,
+static void update_page_reclaim_stat(struct lruvec *lruvec,
 				     int file, int rotated)
 {
-	struct zone_reclaim_stat *reclaim_stat;
-
-	reclaim_stat = mem_cgroup_get_reclaim_stat_from_page(page);
-	if (!reclaim_stat)
-		reclaim_stat = &zone->lruvec.reclaim_stat;
+	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
 	reclaim_stat->recent_scanned[file]++;
 	if (rotated)
 		reclaim_stat->recent_rotated[file]++;
 }
 
-static void __activate_page(struct page *page, void *arg)
+static void __activate_page(struct page *page, struct lruvec *lruvec,
+			    void *arg)
 {
-	struct zone *zone = page_zone(page);
-
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		int file = page_is_file_cache(page);
 		int lru = page_lru_base_type(page);
-		del_page_from_lru_list(zone, page, lru);
 
+		del_page_from_lru_list(page, lruvec, lru);
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
-		add_page_to_lru_list(zone, page, lru);
-		__count_vm_event(PGACTIVATE);
+		add_page_to_lru_list(page, lruvec, lru);
 
-		update_page_reclaim_stat(zone, page, file, 1);
+		__count_vm_event(PGACTIVATE);
+		update_page_reclaim_stat(lruvec, file, 1);
 	}
 }
 
@@ -341,7 +337,7 @@ void activate_page(struct page *page)
 	struct zone *zone = page_zone(page);
 
 	spin_lock_irq(&zone->lru_lock);
-	__activate_page(page, NULL);
+	__activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
 	spin_unlock_irq(&zone->lru_lock);
 }
 #endif
@@ -408,11 +404,13 @@ void lru_cache_add_lru(struct page *page
 void add_page_to_unevictable_list(struct page *page)
 {
 	struct zone *zone = page_zone(page);
+	struct lruvec *lruvec;
 
 	spin_lock_irq(&zone->lru_lock);
+	lruvec = mem_cgroup_page_lruvec(page, zone);
 	SetPageUnevictable(page);
 	SetPageLRU(page);
-	add_page_to_lru_list(zone, page, LRU_UNEVICTABLE);
+	add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
 	spin_unlock_irq(&zone->lru_lock);
 }
 
@@ -437,11 +435,11 @@ void add_page_to_unevictable_list(struct
  * be write it out by flusher threads as this is much more effective
  * than the single-page writeout from reclaim.
  */
-static void lru_deactivate_fn(struct page *page, void *arg)
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+			      void *arg)
 {
 	int lru, file;
 	bool active;
-	struct zone *zone = page_zone(page);
 
 	if (!PageLRU(page))
 		return;
@@ -454,13 +452,13 @@ static void lru_deactivate_fn(struct pag
 		return;
 
 	active = PageActive(page);
-
 	file = page_is_file_cache(page);
 	lru = page_lru_base_type(page);
-	del_page_from_lru_list(zone, page, lru + active);
+
+	del_page_from_lru_list(page, lruvec, lru + active);
 	ClearPageActive(page);
 	ClearPageReferenced(page);
-	add_page_to_lru_list(zone, page, lru);
+	add_page_to_lru_list(page, lruvec, lru);
 
 	if (PageWriteback(page) || PageDirty(page)) {
 		/*
@@ -470,19 +468,17 @@ static void lru_deactivate_fn(struct pag
 		 */
 		SetPageReclaim(page);
 	} else {
-		struct lruvec *lruvec;
 		/*
 		 * The page's writeback ends up during pagevec
 		 * We moves tha page into tail of inactive.
 		 */
-		lruvec = mem_cgroup_lru_move_lists(zone, page, lru, lru);
 		list_move_tail(&page->lru, &lruvec->lists[lru]);
 		__count_vm_event(PGROTATED);
 	}
 
 	if (active)
 		__count_vm_event(PGDEACTIVATE);
-	update_page_reclaim_stat(zone, page, file, 0);
+	update_page_reclaim_stat(lruvec, file, 0);
 }
 
 /*
@@ -582,6 +578,7 @@ void release_pages(struct page **pages,
 	int i;
 	LIST_HEAD(pages_to_free);
 	struct zone *zone = NULL;
+	struct lruvec *lruvec;
 	unsigned long uninitialized_var(flags);
 
 	for (i = 0; i < nr; i++) {
@@ -609,9 +606,11 @@ void release_pages(struct page **pages,
 				zone = pagezone;
 				spin_lock_irqsave(&zone->lru_lock, flags);
 			}
+
+			lruvec = mem_cgroup_page_lruvec(page, zone);
 			VM_BUG_ON(!PageLRU(page));
 			__ClearPageLRU(page);
-			del_page_from_lru_list(zone, page, page_off_lru(page));
+			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
 
 		list_add(&page->lru, &pages_to_free);
@@ -643,8 +642,8 @@ EXPORT_SYMBOL(__pagevec_release);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /* used by __split_huge_page_refcount() */
-void lru_add_page_tail(struct zone* zone,
-		       struct page *page, struct page *page_tail)
+void lru_add_page_tail(struct page *page, struct page *page_tail,
+		       struct lruvec *lruvec)
 {
 	int active;
 	enum lru_list lru;
@@ -653,7 +652,7 @@ void lru_add_page_tail(struct zone* zone
 	VM_BUG_ON(!PageHead(page));
 	VM_BUG_ON(PageCompound(page_tail));
 	VM_BUG_ON(PageLRU(page_tail));
-	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&zone->lru_lock));
+	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&lruvec->zone->lru_lock));
 
 	SetPageLRU(page_tail);
 
@@ -666,7 +665,7 @@ void lru_add_page_tail(struct zone* zone
 			active = 0;
 			lru = LRU_INACTIVE_ANON;
 		}
-		update_page_reclaim_stat(zone, page_tail, file, active);
+		update_page_reclaim_stat(lruvec, file, active);
 	} else {
 		SetPageUnevictable(page_tail);
 		lru = LRU_UNEVICTABLE;
@@ -683,17 +682,17 @@ void lru_add_page_tail(struct zone* zone
 		 * Use the standard add function to put page_tail on the list,
 		 * but then correct its position so they all end up in order.
 		 */
-		add_page_to_lru_list(zone, page_tail, lru);
+		add_page_to_lru_list(page_tail, lruvec, lru);
 		list_head = page_tail->lru.prev;
 		list_move_tail(&page_tail->lru, list_head);
 	}
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-static void __pagevec_lru_add_fn(struct page *page, void *arg)
+static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
+				 void *arg)
 {
 	enum lru_list lru = (enum lru_list)arg;
-	struct zone *zone = page_zone(page);
 	int file = is_file_lru(lru);
 	int active = is_active_lru(lru);
 
@@ -704,8 +703,8 @@ static void __pagevec_lru_add_fn(struct
 	SetPageLRU(page);
 	if (active)
 		SetPageActive(page);
-	update_page_reclaim_stat(zone, page, file, active);
-	add_page_to_lru_list(zone, page, lru);
+	update_page_reclaim_stat(lruvec, file, active);
+	add_page_to_lru_list(page, lruvec, lru);
 }
 
 /*
--- mmotm.orig/mm/vmscan.c	2012-02-18 11:57:28.375524252 -0800
+++ mmotm/mm/vmscan.c	2012-02-18 11:57:35.587524424 -0800
@@ -1124,6 +1124,7 @@ static unsigned long isolate_lru_pages(u
 		unsigned long *nr_scanned, struct scan_control *sc,
 		isolate_mode_t mode, int active, int file)
 {
+	struct lruvec *home_lruvec = lruvec;
 	struct list_head *src;
 	unsigned long nr_taken = 0;
 	unsigned long nr_lumpy_taken = 0;
@@ -1140,6 +1141,7 @@ static unsigned long isolate_lru_pages(u
 
 	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
 		struct page *page;
+		int isolated_pages;
 		unsigned long pfn;
 		unsigned long end_pfn;
 		unsigned long page_pfn;
@@ -1152,9 +1154,10 @@ static unsigned long isolate_lru_pages(u
 
 		switch (__isolate_lru_page(page, mode, file)) {
 		case 0:
-			mem_cgroup_lru_del(page);
+			isolated_pages = hpage_nr_pages(page);
+			mem_cgroup_update_lru_size(lruvec, lru, -isolated_pages);
 			list_move(&page->lru, dst);
-			nr_taken += hpage_nr_pages(page);
+			nr_taken += isolated_pages;
 			break;
 
 		case -EBUSY:
@@ -1209,11 +1212,13 @@ static unsigned long isolate_lru_pages(u
 				break;
 
 			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
-				unsigned int isolated_pages;
-
-				mem_cgroup_lru_del(cursor_page);
-				list_move(&cursor_page->lru, dst);
+				lruvec = mem_cgroup_page_lruvec(cursor_page,
+								lruvec->zone);
 				isolated_pages = hpage_nr_pages(cursor_page);
+				mem_cgroup_update_lru_size(lruvec,
+					page_lru(cursor_page), -isolated_pages);
+				list_move(&cursor_page->lru, dst);
+
 				nr_taken += isolated_pages;
 				nr_lumpy_taken += isolated_pages;
 				if (PageDirty(cursor_page))
@@ -1243,6 +1248,8 @@ static unsigned long isolate_lru_pages(u
 		/* If we break out of the loop above, lumpy reclaim failed */
 		if (pfn < end_pfn)
 			nr_lumpy_failed++;
+
+		lruvec = home_lruvec;
 	}
 
 	*nr_scanned = scan;
@@ -1288,15 +1295,16 @@ int isolate_lru_page(struct page *page)
 
 	if (PageLRU(page)) {
 		struct zone *zone = page_zone(page);
+		struct lruvec *lruvec;
 
 		spin_lock_irq(&zone->lru_lock);
+		lruvec = mem_cgroup_page_lruvec(page, zone);
 		if (PageLRU(page)) {
 			int lru = page_lru(page);
-			ret = 0;
 			get_page(page);
 			ClearPageLRU(page);
-
-			del_page_from_lru_list(zone, page, lru);
+			del_page_from_lru_list(page, lruvec, lru);
+			ret = 0;
 		}
 		spin_unlock_irq(&zone->lru_lock);
 	}
@@ -1350,9 +1358,13 @@ putback_inactive_pages(struct lruvec *lr
 			spin_lock_irq(&zone->lru_lock);
 			continue;
 		}
+
+		lruvec = mem_cgroup_page_lruvec(page, zone);
+
 		SetPageLRU(page);
 		lru = page_lru(page);
-		add_page_to_lru_list(zone, page, lru);
+		add_page_to_lru_list(page, lruvec, lru);
+
 		if (is_active_lru(lru)) {
 			int file = is_file_lru(lru);
 			int numpages = hpage_nr_pages(page);
@@ -1361,7 +1373,7 @@ putback_inactive_pages(struct lruvec *lr
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
-			del_page_from_lru_list(zone, page, lru);
+			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&zone->lru_lock);
@@ -1599,30 +1611,32 @@ shrink_inactive_list(unsigned long nr_to
  * But we had to alter page->flags anyway.
  */
 
-static void move_active_pages_to_lru(struct zone *zone,
+static void move_active_pages_to_lru(struct lruvec *lruvec,
 				     struct list_head *list,
 				     struct list_head *pages_to_free,
 				     enum lru_list lru)
 {
+	struct zone *zone = lruvec->zone;
 	unsigned long pgmoved = 0;
 	struct page *page;
+	int nr_pages;
 
 	while (!list_empty(list)) {
-		struct lruvec *lruvec;
-
 		page = lru_to_page(list);
+		lruvec = mem_cgroup_page_lruvec(page, zone);
 
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 
-		lruvec = mem_cgroup_lru_add_list(zone, page, lru);
+		nr_pages = hpage_nr_pages(page);
 		list_move(&page->lru, &lruvec->lists[lru]);
-		pgmoved += hpage_nr_pages(page);
+		mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
+		pgmoved += nr_pages;
 
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
-			del_page_from_lru_list(zone, page, lru);
+			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&zone->lru_lock);
@@ -1730,9 +1744,9 @@ static void shrink_active_list(unsigned
 	 */
 	reclaim_stat->recent_rotated[file] += nr_rotated;
 
-	move_active_pages_to_lru(zone, &l_active, &l_hold,
+	move_active_pages_to_lru(lruvec, &l_active, &l_hold,
 						LRU_ACTIVE + file * LRU_FILE);
-	move_active_pages_to_lru(zone, &l_inactive, &l_hold,
+	move_active_pages_to_lru(lruvec, &l_inactive, &l_hold,
 						LRU_BASE   + file * LRU_FILE);
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
 	spin_unlock_irq(&zone->lru_lock);
@@ -3529,6 +3543,7 @@ void check_move_unevictable_pages(struct
 			zone = pagezone;
 			spin_lock_irq(&zone->lru_lock);
 		}
+		lruvec = mem_cgroup_page_lruvec(page, zone);
 
 		if (!PageLRU(page) || !PageUnevictable(page))
 			continue;
@@ -3538,11 +3553,8 @@ void check_move_unevictable_pages(struct
 
 			VM_BUG_ON(PageActive(page));
 			ClearPageUnevictable(page);
-			__dec_zone_state(zone, NR_UNEVICTABLE);
-			lruvec = mem_cgroup_lru_move_lists(zone, page,
-						LRU_UNEVICTABLE, lru);
-			list_move(&page->lru, &lruvec->lists[lru]);
-			__inc_zone_state(zone, NR_INACTIVE_ANON + lru);
+			del_page_from_lru_list(page, lruvec, LRU_UNEVICTABLE);
+			add_page_to_lru_list(page, lruvec, lru);
 			pgrescued++;
 		}
 	}

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 4/10] mm/memcg: apply add/del_page to lruvec
@ 2012-02-20 23:32   ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

Go further: pass lruvec instead of zone to add_page_to_lru_list() and
del_page_from_lru_list(); and pagevec_lru_move_fn() pass lruvec down
to its target functions.

This cleanup eliminates a swathe of cruft in memcontrol.c,
including mem_cgroup_lru_add_list(), mem_cgroup_lru_del_list() and
mem_cgroup_lru_move_lists(), which never actually touched the lists.

In their place, mem_cgroup_page_lruvec() to decide the lruvec,
previously a side-effect of add, and mem_cgroup_update_lru_size()
to maintain the lru_size stats.

Whilst these are simplifications in their own right, the goal is to
bring the evaluation of lruvec next to the spin_locking of the lrus,
in preparation for the next patch.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/memcontrol.h |   40 ++-----------
 include/linux/mm_inline.h  |   20 +++---
 include/linux/swap.h       |    4 -
 mm/compaction.c            |    5 +
 mm/huge_memory.c           |    5 +
 mm/memcontrol.c            |  102 +++++++----------------------------
 mm/swap.c                  |   85 ++++++++++++++---------------
 mm/vmscan.c                |   60 ++++++++++++--------
 8 files changed, 128 insertions(+), 193 deletions(-)

--- mmotm.orig/include/linux/memcontrol.h	2012-02-18 11:57:28.371524252 -0800
+++ mmotm/include/linux/memcontrol.h	2012-02-18 11:57:35.583524425 -0800
@@ -63,13 +63,9 @@ extern int mem_cgroup_cache_charge(struc
 					gfp_t gfp_mask);
 
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
+extern struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 extern struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec);
-struct lruvec *mem_cgroup_lru_add_list(struct zone *, struct page *,
-				       enum lru_list);
-void mem_cgroup_lru_del_list(struct page *, enum lru_list);
-void mem_cgroup_lru_del(struct page *);
-struct lruvec *mem_cgroup_lru_move_lists(struct zone *, struct page *,
-					 enum lru_list, enum lru_list);
+extern void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
 
 /* For coalescing uncharge for reducing memcg' overhead*/
 extern void mem_cgroup_uncharge_start(void);
@@ -119,8 +115,6 @@ int mem_cgroup_inactive_file_is_low(stru
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_zone_nr_lru_pages(struct lruvec *lruvec,
 					   unsigned int lrumask);
-struct zone_reclaim_stat*
-mem_cgroup_get_reclaim_stat_from_page(struct page *page);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 					struct task_struct *p);
 extern void mem_cgroup_replace_page_cache(struct page *oldpage,
@@ -248,32 +242,20 @@ static inline struct lruvec *mem_cgroup_
 	return &zone->lruvec;
 }
 
-static inline struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec)
-{
-	return NULL;
-}
-
-static inline struct lruvec *mem_cgroup_lru_add_list(struct zone *zone,
-						     struct page *page,
-						     enum lru_list lru)
+static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
+						    struct zone *zone)
 {
 	return &zone->lruvec;
 }
 
-static inline void mem_cgroup_lru_del_list(struct page *page, enum lru_list lru)
-{
-}
-
-static inline void mem_cgroup_lru_del(struct page *page)
+static inline struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec)
 {
+	return NULL;
 }
 
-static inline struct lruvec *mem_cgroup_lru_move_lists(struct zone *zone,
-						       struct page *page,
-						       enum lru_list from,
-						       enum lru_list to)
+static inline void mem_cgroup_update_lru_size(struct lruvec *lruvec,
+					      enum lru_list lru, int increment)
 {
-	return &zone->lruvec;
 }
 
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
@@ -352,12 +334,6 @@ mem_cgroup_zone_nr_lru_pages(struct lruv
 	return 0;
 }
 
-static inline struct zone_reclaim_stat*
-mem_cgroup_get_reclaim_stat_from_page(struct page *page)
-{
-	return NULL;
-}
-
 static inline void
 mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
--- mmotm.orig/include/linux/mm_inline.h	2012-02-18 11:56:23.639522714 -0800
+++ mmotm/include/linux/mm_inline.h	2012-02-18 11:57:35.583524425 -0800
@@ -21,22 +21,22 @@ static inline int page_is_file_cache(str
 	return !PageSwapBacked(page);
 }
 
-static inline void
-add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list lru)
+static inline void add_page_to_lru_list(struct page *page,
+				struct lruvec *lruvec, enum lru_list lru)
 {
-	struct lruvec *lruvec;
-
-	lruvec = mem_cgroup_lru_add_list(zone, page, lru);
+	int nr_pages = hpage_nr_pages(page);
+	mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
 	list_add(&page->lru, &lruvec->lists[lru]);
-	__mod_zone_page_state(zone, NR_LRU_BASE + lru, hpage_nr_pages(page));
+	__mod_zone_page_state(lruvec->zone, NR_LRU_BASE + lru, nr_pages);
 }
 
-static inline void
-del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list lru)
+static inline void del_page_from_lru_list(struct page *page,
+				struct lruvec *lruvec, enum lru_list lru)
 {
-	mem_cgroup_lru_del_list(page, lru);
+	int nr_pages = hpage_nr_pages(page);
+	mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
 	list_del(&page->lru);
-	__mod_zone_page_state(zone, NR_LRU_BASE + lru, -hpage_nr_pages(page));
+	__mod_zone_page_state(lruvec->zone, NR_LRU_BASE + lru, -nr_pages);
 }
 
 /**
--- mmotm.orig/include/linux/swap.h	2012-02-18 11:56:23.639522714 -0800
+++ mmotm/include/linux/swap.h	2012-02-18 11:57:35.583524425 -0800
@@ -224,8 +224,8 @@ extern unsigned int nr_free_pagecache_pa
 /* linux/mm/swap.c */
 extern void __lru_cache_add(struct page *, enum lru_list lru);
 extern void lru_cache_add_lru(struct page *, enum lru_list lru);
-extern void lru_add_page_tail(struct zone* zone,
-			      struct page *page, struct page *page_tail);
+extern void lru_add_page_tail(struct page *page, struct page *page_tail,
+			      struct lruvec *lruvec);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
--- mmotm.orig/mm/compaction.c	2012-02-18 11:56:23.639522714 -0800
+++ mmotm/mm/compaction.c	2012-02-18 11:57:35.583524425 -0800
@@ -262,6 +262,7 @@ static isolate_migrate_t isolate_migrate
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct list_head *migratelist = &cc->migratepages;
 	isolate_mode_t mode = ISOLATE_ACTIVE|ISOLATE_INACTIVE;
+	struct lruvec *lruvec;
 
 	/* Do not scan outside zone boundaries */
 	low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
@@ -378,10 +379,12 @@ static isolate_migrate_t isolate_migrate
 		if (__isolate_lru_page(page, mode, 0) != 0)
 			continue;
 
+		lruvec = mem_cgroup_page_lruvec(page, zone);
+
 		VM_BUG_ON(PageTransCompound(page));
 
 		/* Successfully isolated */
-		del_page_from_lru_list(zone, page, page_lru(page));
+		del_page_from_lru_list(page, lruvec, page_lru(page));
 		list_add(&page->lru, migratelist);
 		cc->nr_migratepages++;
 		nr_isolated++;
--- mmotm.orig/mm/huge_memory.c	2012-02-18 11:56:23.639522714 -0800
+++ mmotm/mm/huge_memory.c	2012-02-18 11:57:35.583524425 -0800
@@ -1223,10 +1223,13 @@ static void __split_huge_page_refcount(s
 {
 	int i;
 	struct zone *zone = page_zone(page);
+	struct lruvec *lruvec;
 	int tail_count = 0;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
+	lruvec = mem_cgroup_page_lruvec(page, zone);
+
 	compound_lock(page);
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(page);
@@ -1302,7 +1305,7 @@ static void __split_huge_page_refcount(s
 		BUG_ON(!PageSwapBacked(page_tail));
 
 
-		lru_add_page_tail(zone, page, page_tail);
+		lru_add_page_tail(page, page_tail, lruvec);
 	}
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(atomic_read(&page->_count) <= 0);
--- mmotm.orig/mm/memcontrol.c	2012-02-18 11:57:28.371524252 -0800
+++ mmotm/mm/memcontrol.c	2012-02-18 11:57:35.587524424 -0800
@@ -993,7 +993,7 @@ EXPORT_SYMBOL(mem_cgroup_count_vm_event)
 /**
  * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
  * @zone: zone of the wanted lruvec
- * @mem: memcg of the wanted lruvec
+ * @memcg: memcg of the wanted lruvec
  *
  * Returns the lru list vector holding pages for the given @zone and
  * @mem.  This can be the global zone lruvec, if the memory controller
@@ -1037,19 +1037,11 @@ struct mem_cgroup *mem_cgroup_from_lruve
  */
 
 /**
- * mem_cgroup_lru_add_list - account for adding an lru page and return lruvec
- * @zone: zone of the page
+ * mem_cgroup_page_lruvec - return lruvec for adding an lru page
  * @page: the page
- * @lru: current lru
- *
- * This function accounts for @page being added to @lru, and returns
- * the lruvec for the given @zone and the memcg @page is charged to.
- *
- * The callsite is then responsible for physically linking the page to
- * the returned lruvec->lists[@lru].
+ * @zone: zone of the page
  */
-struct lruvec *mem_cgroup_lru_add_list(struct zone *zone, struct page *page,
-				       enum lru_list lru)
+struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
 {
 	struct mem_cgroup_per_zone *mz;
 	struct mem_cgroup *memcg;
@@ -1061,66 +1053,31 @@ struct lruvec *mem_cgroup_lru_add_list(s
 	pc = lookup_page_cgroup(page);
 	memcg = pc->mem_cgroup;
 	mz = page_cgroup_zoneinfo(memcg, page);
-	/* compound_order() is stabilized through lru_lock */
-	mz->lru_size[lru] += 1 << compound_order(page);
 	return &mz->lruvec;
 }
 
 /**
- * mem_cgroup_lru_del_list - account for removing an lru page
- * @page: the page
- * @lru: target lru
- *
- * This function accounts for @page being removed from @lru.
+ * mem_cgroup_update_lru_size - account for adding or removing an lru page
+ * @lruvec: mem_cgroup per zone lru vector
+ * @lru: index of lru list the page is sitting on
+ * @nr_pages: positive when adding or negative when removing
  *
- * The callsite is then responsible for physically unlinking
- * @page->lru.
+ * This function must be called when a page is added to or removed from an
+ * lru list.
  */
-void mem_cgroup_lru_del_list(struct page *page, enum lru_list lru)
+void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
+				int nr_pages)
 {
 	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup *memcg;
-	struct page_cgroup *pc;
+	unsigned long *lru_size;
 
 	if (mem_cgroup_disabled())
 		return;
 
-	pc = lookup_page_cgroup(page);
-	memcg = pc->mem_cgroup;
-	VM_BUG_ON(!memcg);
-	mz = page_cgroup_zoneinfo(memcg, page);
-	/* huge page split is done under lru_lock. so, we have no races. */
-	VM_BUG_ON(mz->lru_size[lru] < (1 << compound_order(page)));
-	mz->lru_size[lru] -= 1 << compound_order(page);
-}
-
-void mem_cgroup_lru_del(struct page *page)
-{
-	mem_cgroup_lru_del_list(page, page_lru(page));
-}
-
-/**
- * mem_cgroup_lru_move_lists - account for moving a page between lrus
- * @zone: zone of the page
- * @page: the page
- * @from: current lru
- * @to: target lru
- *
- * This function accounts for @page being moved between the lrus @from
- * and @to, and returns the lruvec for the given @zone and the memcg
- * @page is charged to.
- *
- * The callsite is then responsible for physically relinking
- * @page->lru to the returned lruvec->lists[@to].
- */
-struct lruvec *mem_cgroup_lru_move_lists(struct zone *zone,
-					 struct page *page,
-					 enum lru_list from,
-					 enum lru_list to)
-{
-	/* XXX: Optimize this, especially for @from == @to */
-	mem_cgroup_lru_del_list(page, from);
-	return mem_cgroup_lru_add_list(zone, page, to);
+	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
+	lru_size = mz->lru_size + lru;
+	*lru_size += nr_pages;
+	VM_BUG_ON((long)(*lru_size) < 0);
 }
 
 /*
@@ -1203,24 +1160,6 @@ int mem_cgroup_inactive_file_is_low(stru
 	return (active > inactive);
 }
 
-struct zone_reclaim_stat *
-mem_cgroup_get_reclaim_stat_from_page(struct page *page)
-{
-	struct page_cgroup *pc;
-	struct mem_cgroup_per_zone *mz;
-
-	if (mem_cgroup_disabled())
-		return NULL;
-
-	pc = lookup_page_cgroup(page);
-	if (!PageCgroupUsed(pc))
-		return NULL;
-	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
-	smp_rmb();
-	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	return &mz->lruvec.reclaim_stat;
-}
-
 #define mem_cgroup_from_res_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
@@ -2695,6 +2634,7 @@ __mem_cgroup_commit_charge_lrucare(struc
 	struct zone *zone = page_zone(page);
 	unsigned long flags;
 	bool removed = false;
+	struct lruvec *lruvec;
 
 	/*
 	 * In some case, SwapCache, FUSE(splice_buf->radixtree), the page
@@ -2703,13 +2643,15 @@ __mem_cgroup_commit_charge_lrucare(struc
 	 */
 	spin_lock_irqsave(&zone->lru_lock, flags);
 	if (PageLRU(page)) {
-		del_page_from_lru_list(zone, page, page_lru(page));
+		lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
+		del_page_from_lru_list(page, lruvec, page_lru(page));
 		ClearPageLRU(page);
 		removed = true;
 	}
 	__mem_cgroup_commit_charge(memcg, page, 1, pc, ctype);
 	if (removed) {
-		add_page_to_lru_list(zone, page, page_lru(page));
+		lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
+		add_page_to_lru_list(page, lruvec, page_lru(page));
 		SetPageLRU(page);
 	}
 	spin_unlock_irqrestore(&zone->lru_lock, flags);
--- mmotm.orig/mm/swap.c	2012-02-18 11:57:20.395524062 -0800
+++ mmotm/mm/swap.c	2012-02-18 11:57:35.587524424 -0800
@@ -47,13 +47,15 @@ static DEFINE_PER_CPU(struct pagevec, lr
 static void __page_cache_release(struct page *page)
 {
 	if (PageLRU(page)) {
-		unsigned long flags;
 		struct zone *zone = page_zone(page);
+		struct lruvec *lruvec;
+		unsigned long flags;
 
 		spin_lock_irqsave(&zone->lru_lock, flags);
+		lruvec = mem_cgroup_page_lruvec(page, zone);
 		VM_BUG_ON(!PageLRU(page));
 		__ClearPageLRU(page);
-		del_page_from_lru_list(zone, page, page_off_lru(page));
+		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
 }
@@ -202,11 +204,12 @@ void put_pages_list(struct list_head *pa
 EXPORT_SYMBOL(put_pages_list);
 
 static void pagevec_lru_move_fn(struct pagevec *pvec,
-				void (*move_fn)(struct page *page, void *arg),
-				void *arg)
+	void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg),
+	void *arg)
 {
 	int i;
 	struct zone *zone = NULL;
+	struct lruvec *lruvec;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
@@ -220,7 +223,8 @@ static void pagevec_lru_move_fn(struct p
 			spin_lock_irqsave(&zone->lru_lock, flags);
 		}
 
-		(*move_fn)(page, arg);
+		lruvec = mem_cgroup_page_lruvec(page, zone);
+		(*move_fn)(page, lruvec, arg);
 	}
 	if (zone)
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
@@ -228,16 +232,13 @@ static void pagevec_lru_move_fn(struct p
 	pagevec_reinit(pvec);
 }
 
-static void pagevec_move_tail_fn(struct page *page, void *arg)
+static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
+				 void *arg)
 {
 	int *pgmoved = arg;
 
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		enum lru_list lru = page_lru_base_type(page);
-		struct lruvec *lruvec;
-
-		lruvec = mem_cgroup_lru_move_lists(page_zone(page),
-						   page, lru, lru);
 		list_move_tail(&page->lru, &lruvec->lists[lru]);
 		(*pgmoved)++;
 	}
@@ -276,35 +277,30 @@ void rotate_reclaimable_page(struct page
 	}
 }
 
-static void update_page_reclaim_stat(struct zone *zone, struct page *page,
+static void update_page_reclaim_stat(struct lruvec *lruvec,
 				     int file, int rotated)
 {
-	struct zone_reclaim_stat *reclaim_stat;
-
-	reclaim_stat = mem_cgroup_get_reclaim_stat_from_page(page);
-	if (!reclaim_stat)
-		reclaim_stat = &zone->lruvec.reclaim_stat;
+	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
 	reclaim_stat->recent_scanned[file]++;
 	if (rotated)
 		reclaim_stat->recent_rotated[file]++;
 }
 
-static void __activate_page(struct page *page, void *arg)
+static void __activate_page(struct page *page, struct lruvec *lruvec,
+			    void *arg)
 {
-	struct zone *zone = page_zone(page);
-
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		int file = page_is_file_cache(page);
 		int lru = page_lru_base_type(page);
-		del_page_from_lru_list(zone, page, lru);
 
+		del_page_from_lru_list(page, lruvec, lru);
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
-		add_page_to_lru_list(zone, page, lru);
-		__count_vm_event(PGACTIVATE);
+		add_page_to_lru_list(page, lruvec, lru);
 
-		update_page_reclaim_stat(zone, page, file, 1);
+		__count_vm_event(PGACTIVATE);
+		update_page_reclaim_stat(lruvec, file, 1);
 	}
 }
 
@@ -341,7 +337,7 @@ void activate_page(struct page *page)
 	struct zone *zone = page_zone(page);
 
 	spin_lock_irq(&zone->lru_lock);
-	__activate_page(page, NULL);
+	__activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
 	spin_unlock_irq(&zone->lru_lock);
 }
 #endif
@@ -408,11 +404,13 @@ void lru_cache_add_lru(struct page *page
 void add_page_to_unevictable_list(struct page *page)
 {
 	struct zone *zone = page_zone(page);
+	struct lruvec *lruvec;
 
 	spin_lock_irq(&zone->lru_lock);
+	lruvec = mem_cgroup_page_lruvec(page, zone);
 	SetPageUnevictable(page);
 	SetPageLRU(page);
-	add_page_to_lru_list(zone, page, LRU_UNEVICTABLE);
+	add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
 	spin_unlock_irq(&zone->lru_lock);
 }
 
@@ -437,11 +435,11 @@ void add_page_to_unevictable_list(struct
  * be write it out by flusher threads as this is much more effective
  * than the single-page writeout from reclaim.
  */
-static void lru_deactivate_fn(struct page *page, void *arg)
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+			      void *arg)
 {
 	int lru, file;
 	bool active;
-	struct zone *zone = page_zone(page);
 
 	if (!PageLRU(page))
 		return;
@@ -454,13 +452,13 @@ static void lru_deactivate_fn(struct pag
 		return;
 
 	active = PageActive(page);
-
 	file = page_is_file_cache(page);
 	lru = page_lru_base_type(page);
-	del_page_from_lru_list(zone, page, lru + active);
+
+	del_page_from_lru_list(page, lruvec, lru + active);
 	ClearPageActive(page);
 	ClearPageReferenced(page);
-	add_page_to_lru_list(zone, page, lru);
+	add_page_to_lru_list(page, lruvec, lru);
 
 	if (PageWriteback(page) || PageDirty(page)) {
 		/*
@@ -470,19 +468,17 @@ static void lru_deactivate_fn(struct pag
 		 */
 		SetPageReclaim(page);
 	} else {
-		struct lruvec *lruvec;
 		/*
 		 * The page's writeback ends up during pagevec
 		 * We moves tha page into tail of inactive.
 		 */
-		lruvec = mem_cgroup_lru_move_lists(zone, page, lru, lru);
 		list_move_tail(&page->lru, &lruvec->lists[lru]);
 		__count_vm_event(PGROTATED);
 	}
 
 	if (active)
 		__count_vm_event(PGDEACTIVATE);
-	update_page_reclaim_stat(zone, page, file, 0);
+	update_page_reclaim_stat(lruvec, file, 0);
 }
 
 /*
@@ -582,6 +578,7 @@ void release_pages(struct page **pages,
 	int i;
 	LIST_HEAD(pages_to_free);
 	struct zone *zone = NULL;
+	struct lruvec *lruvec;
 	unsigned long uninitialized_var(flags);
 
 	for (i = 0; i < nr; i++) {
@@ -609,9 +606,11 @@ void release_pages(struct page **pages,
 				zone = pagezone;
 				spin_lock_irqsave(&zone->lru_lock, flags);
 			}
+
+			lruvec = mem_cgroup_page_lruvec(page, zone);
 			VM_BUG_ON(!PageLRU(page));
 			__ClearPageLRU(page);
-			del_page_from_lru_list(zone, page, page_off_lru(page));
+			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
 
 		list_add(&page->lru, &pages_to_free);
@@ -643,8 +642,8 @@ EXPORT_SYMBOL(__pagevec_release);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /* used by __split_huge_page_refcount() */
-void lru_add_page_tail(struct zone* zone,
-		       struct page *page, struct page *page_tail)
+void lru_add_page_tail(struct page *page, struct page *page_tail,
+		       struct lruvec *lruvec)
 {
 	int active;
 	enum lru_list lru;
@@ -653,7 +652,7 @@ void lru_add_page_tail(struct zone* zone
 	VM_BUG_ON(!PageHead(page));
 	VM_BUG_ON(PageCompound(page_tail));
 	VM_BUG_ON(PageLRU(page_tail));
-	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&zone->lru_lock));
+	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&lruvec->zone->lru_lock));
 
 	SetPageLRU(page_tail);
 
@@ -666,7 +665,7 @@ void lru_add_page_tail(struct zone* zone
 			active = 0;
 			lru = LRU_INACTIVE_ANON;
 		}
-		update_page_reclaim_stat(zone, page_tail, file, active);
+		update_page_reclaim_stat(lruvec, file, active);
 	} else {
 		SetPageUnevictable(page_tail);
 		lru = LRU_UNEVICTABLE;
@@ -683,17 +682,17 @@ void lru_add_page_tail(struct zone* zone
 		 * Use the standard add function to put page_tail on the list,
 		 * but then correct its position so they all end up in order.
 		 */
-		add_page_to_lru_list(zone, page_tail, lru);
+		add_page_to_lru_list(page_tail, lruvec, lru);
 		list_head = page_tail->lru.prev;
 		list_move_tail(&page_tail->lru, list_head);
 	}
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-static void __pagevec_lru_add_fn(struct page *page, void *arg)
+static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
+				 void *arg)
 {
 	enum lru_list lru = (enum lru_list)arg;
-	struct zone *zone = page_zone(page);
 	int file = is_file_lru(lru);
 	int active = is_active_lru(lru);
 
@@ -704,8 +703,8 @@ static void __pagevec_lru_add_fn(struct
 	SetPageLRU(page);
 	if (active)
 		SetPageActive(page);
-	update_page_reclaim_stat(zone, page, file, active);
-	add_page_to_lru_list(zone, page, lru);
+	update_page_reclaim_stat(lruvec, file, active);
+	add_page_to_lru_list(page, lruvec, lru);
 }
 
 /*
--- mmotm.orig/mm/vmscan.c	2012-02-18 11:57:28.375524252 -0800
+++ mmotm/mm/vmscan.c	2012-02-18 11:57:35.587524424 -0800
@@ -1124,6 +1124,7 @@ static unsigned long isolate_lru_pages(u
 		unsigned long *nr_scanned, struct scan_control *sc,
 		isolate_mode_t mode, int active, int file)
 {
+	struct lruvec *home_lruvec = lruvec;
 	struct list_head *src;
 	unsigned long nr_taken = 0;
 	unsigned long nr_lumpy_taken = 0;
@@ -1140,6 +1141,7 @@ static unsigned long isolate_lru_pages(u
 
 	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
 		struct page *page;
+		int isolated_pages;
 		unsigned long pfn;
 		unsigned long end_pfn;
 		unsigned long page_pfn;
@@ -1152,9 +1154,10 @@ static unsigned long isolate_lru_pages(u
 
 		switch (__isolate_lru_page(page, mode, file)) {
 		case 0:
-			mem_cgroup_lru_del(page);
+			isolated_pages = hpage_nr_pages(page);
+			mem_cgroup_update_lru_size(lruvec, lru, -isolated_pages);
 			list_move(&page->lru, dst);
-			nr_taken += hpage_nr_pages(page);
+			nr_taken += isolated_pages;
 			break;
 
 		case -EBUSY:
@@ -1209,11 +1212,13 @@ static unsigned long isolate_lru_pages(u
 				break;
 
 			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
-				unsigned int isolated_pages;
-
-				mem_cgroup_lru_del(cursor_page);
-				list_move(&cursor_page->lru, dst);
+				lruvec = mem_cgroup_page_lruvec(cursor_page,
+								lruvec->zone);
 				isolated_pages = hpage_nr_pages(cursor_page);
+				mem_cgroup_update_lru_size(lruvec,
+					page_lru(cursor_page), -isolated_pages);
+				list_move(&cursor_page->lru, dst);
+
 				nr_taken += isolated_pages;
 				nr_lumpy_taken += isolated_pages;
 				if (PageDirty(cursor_page))
@@ -1243,6 +1248,8 @@ static unsigned long isolate_lru_pages(u
 		/* If we break out of the loop above, lumpy reclaim failed */
 		if (pfn < end_pfn)
 			nr_lumpy_failed++;
+
+		lruvec = home_lruvec;
 	}
 
 	*nr_scanned = scan;
@@ -1288,15 +1295,16 @@ int isolate_lru_page(struct page *page)
 
 	if (PageLRU(page)) {
 		struct zone *zone = page_zone(page);
+		struct lruvec *lruvec;
 
 		spin_lock_irq(&zone->lru_lock);
+		lruvec = mem_cgroup_page_lruvec(page, zone);
 		if (PageLRU(page)) {
 			int lru = page_lru(page);
-			ret = 0;
 			get_page(page);
 			ClearPageLRU(page);
-
-			del_page_from_lru_list(zone, page, lru);
+			del_page_from_lru_list(page, lruvec, lru);
+			ret = 0;
 		}
 		spin_unlock_irq(&zone->lru_lock);
 	}
@@ -1350,9 +1358,13 @@ putback_inactive_pages(struct lruvec *lr
 			spin_lock_irq(&zone->lru_lock);
 			continue;
 		}
+
+		lruvec = mem_cgroup_page_lruvec(page, zone);
+
 		SetPageLRU(page);
 		lru = page_lru(page);
-		add_page_to_lru_list(zone, page, lru);
+		add_page_to_lru_list(page, lruvec, lru);
+
 		if (is_active_lru(lru)) {
 			int file = is_file_lru(lru);
 			int numpages = hpage_nr_pages(page);
@@ -1361,7 +1373,7 @@ putback_inactive_pages(struct lruvec *lr
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
-			del_page_from_lru_list(zone, page, lru);
+			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&zone->lru_lock);
@@ -1599,30 +1611,32 @@ shrink_inactive_list(unsigned long nr_to
  * But we had to alter page->flags anyway.
  */
 
-static void move_active_pages_to_lru(struct zone *zone,
+static void move_active_pages_to_lru(struct lruvec *lruvec,
 				     struct list_head *list,
 				     struct list_head *pages_to_free,
 				     enum lru_list lru)
 {
+	struct zone *zone = lruvec->zone;
 	unsigned long pgmoved = 0;
 	struct page *page;
+	int nr_pages;
 
 	while (!list_empty(list)) {
-		struct lruvec *lruvec;
-
 		page = lru_to_page(list);
+		lruvec = mem_cgroup_page_lruvec(page, zone);
 
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 
-		lruvec = mem_cgroup_lru_add_list(zone, page, lru);
+		nr_pages = hpage_nr_pages(page);
 		list_move(&page->lru, &lruvec->lists[lru]);
-		pgmoved += hpage_nr_pages(page);
+		mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
+		pgmoved += nr_pages;
 
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
-			del_page_from_lru_list(zone, page, lru);
+			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&zone->lru_lock);
@@ -1730,9 +1744,9 @@ static void shrink_active_list(unsigned
 	 */
 	reclaim_stat->recent_rotated[file] += nr_rotated;
 
-	move_active_pages_to_lru(zone, &l_active, &l_hold,
+	move_active_pages_to_lru(lruvec, &l_active, &l_hold,
 						LRU_ACTIVE + file * LRU_FILE);
-	move_active_pages_to_lru(zone, &l_inactive, &l_hold,
+	move_active_pages_to_lru(lruvec, &l_inactive, &l_hold,
 						LRU_BASE   + file * LRU_FILE);
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
 	spin_unlock_irq(&zone->lru_lock);
@@ -3529,6 +3543,7 @@ void check_move_unevictable_pages(struct
 			zone = pagezone;
 			spin_lock_irq(&zone->lru_lock);
 		}
+		lruvec = mem_cgroup_page_lruvec(page, zone);
 
 		if (!PageLRU(page) || !PageUnevictable(page))
 			continue;
@@ -3538,11 +3553,8 @@ void check_move_unevictable_pages(struct
 
 			VM_BUG_ON(PageActive(page));
 			ClearPageUnevictable(page);
-			__dec_zone_state(zone, NR_UNEVICTABLE);
-			lruvec = mem_cgroup_lru_move_lists(zone, page,
-						LRU_UNEVICTABLE, lru);
-			list_move(&page->lru, &lruvec->lists[lru]);
-			__inc_zone_state(zone, NR_INACTIVE_ANON + lru);
+			del_page_from_lru_list(page, lruvec, LRU_UNEVICTABLE);
+			add_page_to_lru_list(page, lruvec, lru);
 			pgrescued++;
 		}
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 5/10] mm/memcg: introduce page_relock_lruvec
  2012-02-20 23:26 ` Hugh Dickins
@ 2012-02-20 23:33   ` Hugh Dickins
  -1 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

Delete the mem_cgroup_page_lruvec() which we just added, replacing
it and nearby spin_lock_irq or spin_lock_irqsave of zone->lru_lock:
in most places by page_lock_lruvec() or page_relock_lruvec() (the
former being a simple case of the latter) or just by lock_lruvec().
unlock_lruvec() does the spin_unlock_irqrestore for them all.

page_relock_lruvec() is born from that "pagezone" pattern in swap.c
and vmscan.c, where we loop over an array of pages, switching lock
whenever the zone changes: bearing in mind that if we were to refine
that lock to per-memcg per-zone, then we would have to switch whenever
the memcg changes too.

page_relock_lruvec(page, &lruvec) locates the right lruvec for page,
unlocks the old lruvec if different (and not NULL), locks the new,
and updates lruvec on return: so that we shall have just one routine
to locate and lock the lruvec, whereas originally it got re-evaluated
at different stages.  But I don't yet know how to satisfy sparse(1).

There are some loops where we never change zone, and a non-memcg kernel
would not change memcg: use no-op mem_cgroup_page_relock_lruvec() there.

In compaction's isolate_migratepages(), although we do know the zone,
we don't know the lruvec in advance: allow for taking the lock later,
and reorganize its cond_resched() lock-dropping accordingly.

page_relock_lruvec() (and its wrappers) is actually an _irqsave operation:
there are a few cases in swap.c where it may be needed at interrupt time
(to free or to rotate a page on I/O completion).  Ideally(?) we would use
straightforward _irq disabling elsewhere, but the variants get confusing,
and page_relock_lruvec() will itself grow more complicated in subsequent
patches: so keep it simple for now with just the one irqsaver everywhere.

Passing an irqflags argument/pointer down several levels looks messy
too, and I'm reluctant to add any more to the page reclaim stack: so
save the irqflags alongside the lru_lock and restore them from there.

It's a little sad now to be including mm.h in swap.h to get page_zone();
but I think that swap.h (despite its name) is the right place for these
lru functions, and without those inlines the optimizer cannot do so
well in the !MEM_RES_CTLR case.

(Is this an appropriate place to confess? that even at the end of the
series, we're left with a small bug in putback_inactive_pages(), one
that I've not yet decided is worth fixing: reclaim_stat there is from
the lruvec on entry, but we might update stats after dropping its lock.
And do zone->pages_scanned and zone->all_unreclaimable need locking?
page_alloc.c thinks zone->lock, vmscan.c thought zone->lru_lock,
and that weakens if we now split lru_lock by memcg.)

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/memcontrol.h |    7 --
 include/linux/mmzone.h     |    1 
 include/linux/swap.h       |   65 +++++++++++++++++++++++
 mm/compaction.c            |   45 ++++++++++------
 mm/huge_memory.c           |   10 +--
 mm/memcontrol.c            |   56 ++++++++++++--------
 mm/swap.c                  |   67 +++++++-----------------
 mm/vmscan.c                |   95 ++++++++++++++++-------------------
 8 files changed, 194 insertions(+), 152 deletions(-)

--- mmotm.orig/include/linux/memcontrol.h	2012-02-18 11:57:35.583524425 -0800
+++ mmotm/include/linux/memcontrol.h	2012-02-18 11:57:42.675524592 -0800
@@ -63,7 +63,6 @@ extern int mem_cgroup_cache_charge(struc
 					gfp_t gfp_mask);
 
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
-extern struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 extern struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec);
 extern void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
 
@@ -241,12 +240,6 @@ static inline struct lruvec *mem_cgroup_
 {
 	return &zone->lruvec;
 }
-
-static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
-						    struct zone *zone)
-{
-	return &zone->lruvec;
-}
 
 static inline struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec)
 {
--- mmotm.orig/include/linux/mmzone.h	2012-02-18 11:57:28.371524252 -0800
+++ mmotm/include/linux/mmzone.h	2012-02-18 11:57:42.675524592 -0800
@@ -374,6 +374,7 @@ struct zone {
 
 	/* Fields commonly accessed by the page reclaim scanner */
 	spinlock_t		lru_lock;
+	unsigned long		irqflags;
 	struct lruvec		lruvec;
 
 	unsigned long		pages_scanned;	   /* since last reclaim */
--- mmotm.orig/include/linux/swap.h	2012-02-18 11:57:35.583524425 -0800
+++ mmotm/include/linux/swap.h	2012-02-18 11:57:42.675524592 -0800
@@ -8,7 +8,7 @@
 #include <linux/memcontrol.h>
 #include <linux/sched.h>
 #include <linux/node.h>
-
+#include <linux/mm.h>			/* for page_zone(page) */
 #include <linux/atomic.h>
 #include <asm/page.h>
 
@@ -250,6 +250,69 @@ static inline void lru_cache_add_file(st
 	__lru_cache_add(page, LRU_INACTIVE_FILE);
 }
 
+static inline spinlock_t *lru_lockptr(struct lruvec *lruvec)
+{
+	return &lruvec->zone->lru_lock;
+}
+
+static inline void lock_lruvec(struct lruvec *lruvec)
+{
+	struct zone *zone = lruvec->zone;
+	unsigned long irqflags;
+
+	spin_lock_irqsave(&zone->lru_lock, irqflags);
+	zone->irqflags = irqflags;
+}
+
+static inline void unlock_lruvec(struct lruvec *lruvec)
+{
+	struct zone *zone = lruvec->zone;
+	unsigned long irqflags;
+
+	irqflags = zone->irqflags;
+	spin_unlock_irqrestore(&zone->lru_lock, irqflags);
+}
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+/* linux/mm/memcontrol.c */
+extern void page_relock_lruvec(struct page *page, struct lruvec **lruvp);
+
+static inline void
+mem_cgroup_page_relock_lruvec(struct page *page, struct lruvec **lruvp)
+{
+	page_relock_lruvec(page, lruvp);
+}
+#else
+static inline void page_relock_lruvec(struct page *page, struct lruvec **lruvp)
+{
+	struct lruvec *lruvec;
+
+	lruvec = &page_zone(page)->lruvec;
+	if (*lruvp && *lruvp != lruvec) {
+		unlock_lruvec(*lruvp);
+		*lruvp = NULL;
+	}
+	if (!*lruvp) {
+		*lruvp = lruvec;
+		lock_lruvec(lruvec);
+	}
+}
+
+static inline void
+mem_cgroup_page_relock_lruvec(struct page *page, struct lruvec **lruvp)
+{
+	/* No-op used in a few places where zone is known not to change */
+}
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+
+static inline struct lruvec *page_lock_lruvec(struct page *page)
+{
+	struct lruvec *lruvec = NULL;
+
+	page_relock_lruvec(page, &lruvec);
+	return lruvec;
+}
+
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
--- mmotm.orig/mm/compaction.c	2012-02-18 11:57:35.583524425 -0800
+++ mmotm/mm/compaction.c	2012-02-18 11:57:42.675524592 -0800
@@ -262,7 +262,7 @@ static isolate_migrate_t isolate_migrate
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct list_head *migratelist = &cc->migratepages;
 	isolate_mode_t mode = ISOLATE_ACTIVE|ISOLATE_INACTIVE;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 
 	/* Do not scan outside zone boundaries */
 	low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
@@ -293,26 +293,23 @@ static isolate_migrate_t isolate_migrate
 	}
 
 	/* Time to isolate some pages for migration */
-	cond_resched();
-	spin_lock_irq(&zone->lru_lock);
 	for (; low_pfn < end_pfn; low_pfn++) {
 		struct page *page;
-		bool locked = true;
 
-		/* give a chance to irqs before checking need_resched() */
-		if (!((low_pfn+1) % SWAP_CLUSTER_MAX)) {
-			spin_unlock_irq(&zone->lru_lock);
-			locked = false;
-		}
-		if (need_resched() || spin_is_contended(&zone->lru_lock)) {
-			if (locked)
-				spin_unlock_irq(&zone->lru_lock);
+		/* give a chance to irqs before cond_resched() */
+		if (lruvec) {
+			if (!((low_pfn+1) % SWAP_CLUSTER_MAX) ||
+			    spin_is_contended(lru_lockptr(lruvec)) ||
+			    need_resched()) {
+				unlock_lruvec(lruvec);
+				lruvec = NULL;
+			}
+		}
+		if (!lruvec) {
 			cond_resched();
-			spin_lock_irq(&zone->lru_lock);
 			if (fatal_signal_pending(current))
 				break;
-		} else if (!locked)
-			spin_lock_irq(&zone->lru_lock);
+		}
 
 		/*
 		 * migrate_pfn does not necessarily start aligned to a
@@ -359,6 +356,15 @@ static isolate_migrate_t isolate_migrate
 			continue;
 		}
 
+		if (!lruvec) {
+			/*
+			 * We do need to take the lock before advancing to
+			 * check PageLRU etc., but there's no guarantee that
+			 * the page we're peeking at has a stable memcg here.
+			 */
+			lruvec = &zone->lruvec;
+			lock_lruvec(lruvec);
+		}
 		if (!PageLRU(page))
 			continue;
 
@@ -379,7 +385,7 @@ static isolate_migrate_t isolate_migrate
 		if (__isolate_lru_page(page, mode, 0) != 0)
 			continue;
 
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		page_relock_lruvec(page, &lruvec);
 
 		VM_BUG_ON(PageTransCompound(page));
 
@@ -396,9 +402,14 @@ static isolate_migrate_t isolate_migrate
 		}
 	}
 
+	if (!lruvec)
+		local_irq_disable();
 	acct_isolated(zone, cc);
+	if (lruvec)
+		unlock_lruvec(lruvec);
+	else
+		local_irq_enable();
 
-	spin_unlock_irq(&zone->lru_lock);
 	cc->migrate_pfn = low_pfn;
 
 	trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
--- mmotm.orig/mm/huge_memory.c	2012-02-18 11:57:35.583524425 -0800
+++ mmotm/mm/huge_memory.c	2012-02-18 11:57:42.679524592 -0800
@@ -1222,13 +1222,11 @@ static int __split_huge_page_splitting(s
 static void __split_huge_page_refcount(struct page *page)
 {
 	int i;
-	struct zone *zone = page_zone(page);
 	struct lruvec *lruvec;
 	int tail_count = 0;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irq(&zone->lru_lock);
-	lruvec = mem_cgroup_page_lruvec(page, zone);
+	lruvec = page_lock_lruvec(page);
 
 	compound_lock(page);
 	/* complete memcg works before add pages to LRU */
@@ -1310,12 +1308,12 @@ static void __split_huge_page_refcount(s
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(atomic_read(&page->_count) <= 0);
 
-	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
-	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
+	__mod_zone_page_state(lruvec->zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+	__mod_zone_page_state(lruvec->zone, NR_ANON_PAGES, HPAGE_PMD_NR);
 
 	ClearPageCompound(page);
 	compound_unlock(page);
-	spin_unlock_irq(&zone->lru_lock);
+	unlock_lruvec(lruvec);
 
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		struct page *page_tail = page + i;
--- mmotm.orig/mm/memcontrol.c	2012-02-18 11:57:35.587524424 -0800
+++ mmotm/mm/memcontrol.c	2012-02-18 11:57:42.679524592 -0800
@@ -1037,23 +1037,36 @@ struct mem_cgroup *mem_cgroup_from_lruve
  */
 
 /**
- * mem_cgroup_page_lruvec - return lruvec for adding an lru page
+ * page_relock_lruvec - lock and update lruvec for this page, unlocking previous
  * @page: the page
- * @zone: zone of the page
+ * @lruvp: pointer to where to output lruvec; unlock input lruvec if non-NULL
  */
-struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
+void page_relock_lruvec(struct page *page, struct lruvec **lruvp)
 {
 	struct mem_cgroup_per_zone *mz;
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc;
+	struct lruvec *lruvec;
 
 	if (mem_cgroup_disabled())
-		return &zone->lruvec;
+		lruvec = &page_zone(page)->lruvec;
+	else {
+		pc = lookup_page_cgroup(page);
+		memcg = pc->mem_cgroup;
+		mz = page_cgroup_zoneinfo(memcg, page);
+		lruvec = &mz->lruvec;
+	}
 
-	pc = lookup_page_cgroup(page);
-	memcg = pc->mem_cgroup;
-	mz = page_cgroup_zoneinfo(memcg, page);
-	return &mz->lruvec;
+	/*
+	 * For the moment, simply lock by zone just as before.
+	 */
+	if (*lruvp && (*lruvp)->zone != lruvec->zone) {
+		unlock_lruvec(*lruvp);
+		*lruvp = NULL;
+	}
+	if (!*lruvp)
+		lock_lruvec(lruvec);
+	*lruvp = lruvec;
 }
 
 /**
@@ -2631,30 +2644,27 @@ __mem_cgroup_commit_charge_lrucare(struc
 					enum charge_type ctype)
 {
 	struct page_cgroup *pc = lookup_page_cgroup(page);
-	struct zone *zone = page_zone(page);
-	unsigned long flags;
-	bool removed = false;
 	struct lruvec *lruvec;
+	bool removed = false;
 
 	/*
 	 * In some case, SwapCache, FUSE(splice_buf->radixtree), the page
 	 * is already on LRU. It means the page may on some other page_cgroup's
 	 * LRU. Take care of it.
 	 */
-	spin_lock_irqsave(&zone->lru_lock, flags);
+	lruvec = page_lock_lruvec(page);
 	if (PageLRU(page)) {
-		lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		ClearPageLRU(page);
 		removed = true;
 	}
 	__mem_cgroup_commit_charge(memcg, page, 1, pc, ctype);
 	if (removed) {
-		lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
+		page_relock_lruvec(page, &lruvec);
 		add_page_to_lru_list(page, lruvec, page_lru(page));
 		SetPageLRU(page);
 	}
-	spin_unlock_irqrestore(&zone->lru_lock, flags);
+	unlock_lruvec(lruvec);
 }
 
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
@@ -3572,15 +3582,15 @@ static int mem_cgroup_force_empty_list(s
 				int node, int zid, enum lru_list lru)
 {
 	struct mem_cgroup_per_zone *mz;
-	unsigned long flags, loop;
+	unsigned long loop;
 	struct list_head *list;
 	struct page *busy;
-	struct zone *zone;
+	struct lruvec *lruvec;
 	int ret = 0;
 
-	zone = &NODE_DATA(node)->node_zones[zid];
 	mz = mem_cgroup_zoneinfo(memcg, node, zid);
-	list = &mz->lruvec.lists[lru];
+	lruvec = &mz->lruvec;
+	list = &lruvec->lists[lru];
 
 	loop = mz->lru_size[lru];
 	/* give some margin against EBUSY etc...*/
@@ -3591,19 +3601,19 @@ static int mem_cgroup_force_empty_list(s
 		struct page *page;
 
 		ret = 0;
-		spin_lock_irqsave(&zone->lru_lock, flags);
+		lock_lruvec(lruvec);
 		if (list_empty(list)) {
-			spin_unlock_irqrestore(&zone->lru_lock, flags);
+			unlock_lruvec(lruvec);
 			break;
 		}
 		page = list_entry(list->prev, struct page, lru);
 		if (busy == page) {
 			list_move(&page->lru, list);
 			busy = NULL;
-			spin_unlock_irqrestore(&zone->lru_lock, flags);
+			unlock_lruvec(lruvec);
 			continue;
 		}
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+		unlock_lruvec(lruvec);
 
 		pc = lookup_page_cgroup(page);
 
--- mmotm.orig/mm/swap.c	2012-02-18 11:57:35.587524424 -0800
+++ mmotm/mm/swap.c	2012-02-18 11:57:42.679524592 -0800
@@ -47,16 +47,13 @@ static DEFINE_PER_CPU(struct pagevec, lr
 static void __page_cache_release(struct page *page)
 {
 	if (PageLRU(page)) {
-		struct zone *zone = page_zone(page);
 		struct lruvec *lruvec;
-		unsigned long flags;
 
-		spin_lock_irqsave(&zone->lru_lock, flags);
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = page_lock_lruvec(page);
 		VM_BUG_ON(!PageLRU(page));
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+		unlock_lruvec(lruvec);
 	}
 }
 
@@ -208,26 +205,16 @@ static void pagevec_lru_move_fn(struct p
 	void *arg)
 {
 	int i;
-	struct zone *zone = NULL;
-	struct lruvec *lruvec;
-	unsigned long flags = 0;
+	struct lruvec *lruvec = NULL;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct zone *pagezone = page_zone(page);
 
-		if (pagezone != zone) {
-			if (zone)
-				spin_unlock_irqrestore(&zone->lru_lock, flags);
-			zone = pagezone;
-			spin_lock_irqsave(&zone->lru_lock, flags);
-		}
-
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		page_relock_lruvec(page, &lruvec);
 		(*move_fn)(page, lruvec, arg);
 	}
-	if (zone)
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+	if (lruvec)
+		unlock_lruvec(lruvec);
 	release_pages(pvec->pages, pvec->nr, pvec->cold);
 	pagevec_reinit(pvec);
 }
@@ -334,11 +321,11 @@ static inline void activate_page_drain(i
 
 void activate_page(struct page *page)
 {
-	struct zone *zone = page_zone(page);
+	struct lruvec *lruvec;
 
-	spin_lock_irq(&zone->lru_lock);
-	__activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
-	spin_unlock_irq(&zone->lru_lock);
+	lruvec = page_lock_lruvec(page);
+	__activate_page(page, lruvec, NULL);
+	unlock_lruvec(lruvec);
 }
 #endif
 
@@ -403,15 +390,13 @@ void lru_cache_add_lru(struct page *page
  */
 void add_page_to_unevictable_list(struct page *page)
 {
-	struct zone *zone = page_zone(page);
 	struct lruvec *lruvec;
 
-	spin_lock_irq(&zone->lru_lock);
-	lruvec = mem_cgroup_page_lruvec(page, zone);
+	lruvec = page_lock_lruvec(page);
 	SetPageUnevictable(page);
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
-	spin_unlock_irq(&zone->lru_lock);
+	unlock_lruvec(lruvec);
 }
 
 /*
@@ -577,17 +562,15 @@ void release_pages(struct page **pages,
 {
 	int i;
 	LIST_HEAD(pages_to_free);
-	struct zone *zone = NULL;
-	struct lruvec *lruvec;
-	unsigned long uninitialized_var(flags);
+	struct lruvec *lruvec = NULL;
 
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 
 		if (unlikely(PageCompound(page))) {
-			if (zone) {
-				spin_unlock_irqrestore(&zone->lru_lock, flags);
-				zone = NULL;
+			if (lruvec) {
+				unlock_lruvec(lruvec);
+				lruvec = NULL;
 			}
 			put_compound_page(page);
 			continue;
@@ -597,17 +580,7 @@ void release_pages(struct page **pages,
 			continue;
 
 		if (PageLRU(page)) {
-			struct zone *pagezone = page_zone(page);
-
-			if (pagezone != zone) {
-				if (zone)
-					spin_unlock_irqrestore(&zone->lru_lock,
-									flags);
-				zone = pagezone;
-				spin_lock_irqsave(&zone->lru_lock, flags);
-			}
-
-			lruvec = mem_cgroup_page_lruvec(page, zone);
+			page_relock_lruvec(page, &lruvec);
 			VM_BUG_ON(!PageLRU(page));
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -615,8 +588,8 @@ void release_pages(struct page **pages,
 
 		list_add(&page->lru, &pages_to_free);
 	}
-	if (zone)
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+	if (lruvec)
+		unlock_lruvec(lruvec);
 
 	free_hot_cold_page_list(&pages_to_free, cold);
 }
@@ -652,7 +625,7 @@ void lru_add_page_tail(struct page *page
 	VM_BUG_ON(!PageHead(page));
 	VM_BUG_ON(PageCompound(page_tail));
 	VM_BUG_ON(PageLRU(page_tail));
-	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&lruvec->zone->lru_lock));
+	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(lru_lockptr(lruvec)));
 
 	SetPageLRU(page_tail);
 
--- mmotm.orig/mm/vmscan.c	2012-02-18 11:57:35.587524424 -0800
+++ mmotm/mm/vmscan.c	2012-02-18 11:57:42.679524592 -0800
@@ -1212,8 +1212,8 @@ static unsigned long isolate_lru_pages(u
 				break;
 
 			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
-				lruvec = mem_cgroup_page_lruvec(cursor_page,
-								lruvec->zone);
+				mem_cgroup_page_relock_lruvec(cursor_page,
+								&lruvec);
 				isolated_pages = hpage_nr_pages(cursor_page);
 				mem_cgroup_update_lru_size(lruvec,
 					page_lru(cursor_page), -isolated_pages);
@@ -1294,11 +1294,9 @@ int isolate_lru_page(struct page *page)
 	VM_BUG_ON(!page_count(page));
 
 	if (PageLRU(page)) {
-		struct zone *zone = page_zone(page);
 		struct lruvec *lruvec;
 
-		spin_lock_irq(&zone->lru_lock);
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = page_lock_lruvec(page);
 		if (PageLRU(page)) {
 			int lru = page_lru(page);
 			get_page(page);
@@ -1306,7 +1304,7 @@ int isolate_lru_page(struct page *page)
 			del_page_from_lru_list(page, lruvec, lru);
 			ret = 0;
 		}
-		spin_unlock_irq(&zone->lru_lock);
+		unlock_lruvec(lruvec);
 	}
 	return ret;
 }
@@ -1337,10 +1335,9 @@ static int too_many_isolated(struct zone
 }
 
 static noinline_for_stack void
-putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
+putback_inactive_pages(struct lruvec **lruvec, struct list_head *page_list)
 {
-	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
-	struct zone *zone = lruvec->zone;
+	struct zone_reclaim_stat *reclaim_stat = &(*lruvec)->reclaim_stat;
 	LIST_HEAD(pages_to_free);
 
 	/*
@@ -1353,17 +1350,18 @@ putback_inactive_pages(struct lruvec *lr
 		VM_BUG_ON(PageLRU(page));
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page, NULL))) {
-			spin_unlock_irq(&zone->lru_lock);
+			unlock_lruvec(*lruvec);
 			putback_lru_page(page);
-			spin_lock_irq(&zone->lru_lock);
+			lock_lruvec(*lruvec);
 			continue;
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		/* lock lru, occasionally changing lruvec */
+		mem_cgroup_page_relock_lruvec(page, lruvec);
 
 		SetPageLRU(page);
 		lru = page_lru(page);
-		add_page_to_lru_list(page, lruvec, lru);
+		add_page_to_lru_list(page, *lruvec, lru);
 
 		if (is_active_lru(lru)) {
 			int file = is_file_lru(lru);
@@ -1373,12 +1371,12 @@ putback_inactive_pages(struct lruvec *lr
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
-			del_page_from_lru_list(page, lruvec, lru);
+			del_page_from_lru_list(page, *lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&zone->lru_lock);
+				unlock_lruvec(*lruvec);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(&zone->lru_lock);
+				lock_lruvec(*lruvec);
 			} else
 				list_add(&page->lru, &pages_to_free);
 		}
@@ -1513,7 +1511,7 @@ shrink_inactive_list(unsigned long nr_to
 	if (!sc->may_writepage)
 		isolate_mode |= ISOLATE_CLEAN;
 
-	spin_lock_irq(&zone->lru_lock);
+	lock_lruvec(lruvec);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, isolate_mode, 0, file);
@@ -1524,7 +1522,7 @@ shrink_inactive_list(unsigned long nr_to
 		else
 			__count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
 	}
-	spin_unlock_irq(&zone->lru_lock);
+	unlock_lruvec(lruvec);
 
 	if (nr_taken == 0)
 		return 0;
@@ -1541,7 +1539,7 @@ shrink_inactive_list(unsigned long nr_to
 					priority, &nr_dirty, &nr_writeback);
 	}
 
-	spin_lock_irq(&zone->lru_lock);
+	lock_lruvec(lruvec);
 
 	reclaim_stat->recent_scanned[0] += nr_anon;
 	reclaim_stat->recent_scanned[1] += nr_file;
@@ -1550,12 +1548,12 @@ shrink_inactive_list(unsigned long nr_to
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-	putback_inactive_pages(lruvec, &page_list);
+	putback_inactive_pages(&lruvec, &page_list);
 
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
 	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
 
-	spin_unlock_irq(&zone->lru_lock);
+	unlock_lruvec(lruvec);
 
 	free_hot_cold_page_list(&page_list, 1);
 
@@ -1611,42 +1609,44 @@ shrink_inactive_list(unsigned long nr_to
  * But we had to alter page->flags anyway.
  */
 
-static void move_active_pages_to_lru(struct lruvec *lruvec,
+static void move_active_pages_to_lru(struct lruvec **lruvec,
 				     struct list_head *list,
 				     struct list_head *pages_to_free,
 				     enum lru_list lru)
 {
-	struct zone *zone = lruvec->zone;
 	unsigned long pgmoved = 0;
 	struct page *page;
 	int nr_pages;
 
 	while (!list_empty(list)) {
 		page = lru_to_page(list);
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+
+		/* lock lru, occasionally changing lruvec */
+		mem_cgroup_page_relock_lruvec(page, lruvec);
 
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 
 		nr_pages = hpage_nr_pages(page);
-		list_move(&page->lru, &lruvec->lists[lru]);
-		mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
+		list_move(&page->lru, &(*lruvec)->lists[lru]);
+		mem_cgroup_update_lru_size(*lruvec, lru, nr_pages);
 		pgmoved += nr_pages;
 
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
-			del_page_from_lru_list(page, lruvec, lru);
+			del_page_from_lru_list(page, *lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&zone->lru_lock);
+				unlock_lruvec(*lruvec);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(&zone->lru_lock);
+				lock_lruvec(*lruvec);
 			} else
 				list_add(&page->lru, pages_to_free);
 		}
 	}
-	__mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
+
+	__mod_zone_page_state((*lruvec)->zone, NR_LRU_BASE + lru, pgmoved);
 	if (!is_active_lru(lru))
 		__count_vm_events(PGDEACTIVATE, pgmoved);
 }
@@ -1676,7 +1676,7 @@ static void shrink_active_list(unsigned
 	if (!sc->may_writepage)
 		isolate_mode |= ISOLATE_CLEAN;
 
-	spin_lock_irq(&zone->lru_lock);
+	lock_lruvec(lruvec);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, isolate_mode, 1, file);
@@ -1691,7 +1691,8 @@ static void shrink_active_list(unsigned
 	else
 		__mod_zone_page_state(zone, NR_ACTIVE_ANON, -nr_taken);
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
-	spin_unlock_irq(&zone->lru_lock);
+
+	unlock_lruvec(lruvec);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -1735,7 +1736,7 @@ static void shrink_active_list(unsigned
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&zone->lru_lock);
+	lock_lruvec(lruvec);
 	/*
 	 * Count referenced pages from currently used mappings as rotated,
 	 * even though only some of them are actually re-activated.  This
@@ -1744,12 +1745,13 @@ static void shrink_active_list(unsigned
 	 */
 	reclaim_stat->recent_rotated[file] += nr_rotated;
 
-	move_active_pages_to_lru(lruvec, &l_active, &l_hold,
+	move_active_pages_to_lru(&lruvec, &l_active, &l_hold,
 						LRU_ACTIVE + file * LRU_FILE);
-	move_active_pages_to_lru(lruvec, &l_inactive, &l_hold,
+	move_active_pages_to_lru(&lruvec, &l_inactive, &l_hold,
 						LRU_BASE   + file * LRU_FILE);
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&zone->lru_lock);
+
+	unlock_lruvec(lruvec);
 
 	free_hot_cold_page_list(&l_hold, 1);
 }
@@ -1940,7 +1942,7 @@ static void get_scan_count(struct lruvec
 	 *
 	 * anon in [0], file in [1]
 	 */
-	spin_lock_irq(&zone->lru_lock);
+	lock_lruvec(lruvec);
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
 		reclaim_stat->recent_scanned[0] /= 2;
 		reclaim_stat->recent_rotated[0] /= 2;
@@ -1961,7 +1963,7 @@ static void get_scan_count(struct lruvec
 
 	fp = (file_prio + 1) * (reclaim_stat->recent_scanned[1] + 1);
 	fp /= reclaim_stat->recent_rotated[1] + 1;
-	spin_unlock_irq(&zone->lru_lock);
+	unlock_lruvec(lruvec);
 
 	fraction[0] = ap;
 	fraction[1] = fp;
@@ -3525,25 +3527,16 @@ int page_evictable(struct page *page, st
  */
 void check_move_unevictable_pages(struct page **pages, int nr_pages)
 {
-	struct lruvec *lruvec;
-	struct zone *zone = NULL;
+	struct lruvec *lruvec = NULL;
 	int pgscanned = 0;
 	int pgrescued = 0;
 	int i;
 
 	for (i = 0; i < nr_pages; i++) {
 		struct page *page = pages[i];
-		struct zone *pagezone;
 
 		pgscanned++;
-		pagezone = page_zone(page);
-		if (pagezone != zone) {
-			if (zone)
-				spin_unlock_irq(&zone->lru_lock);
-			zone = pagezone;
-			spin_lock_irq(&zone->lru_lock);
-		}
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		page_relock_lruvec(page, &lruvec);
 
 		if (!PageLRU(page) || !PageUnevictable(page))
 			continue;
@@ -3559,10 +3552,10 @@ void check_move_unevictable_pages(struct
 		}
 	}
 
-	if (zone) {
+	if (lruvec) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&zone->lru_lock);
+		unlock_lruvec(lruvec);
 	}
 }
 #endif /* CONFIG_SHMEM */

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 5/10] mm/memcg: introduce page_relock_lruvec
@ 2012-02-20 23:33   ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

Delete the mem_cgroup_page_lruvec() which we just added, replacing
it and nearby spin_lock_irq or spin_lock_irqsave of zone->lru_lock:
in most places by page_lock_lruvec() or page_relock_lruvec() (the
former being a simple case of the latter) or just by lock_lruvec().
unlock_lruvec() does the spin_unlock_irqrestore for them all.

page_relock_lruvec() is born from that "pagezone" pattern in swap.c
and vmscan.c, where we loop over an array of pages, switching lock
whenever the zone changes: bearing in mind that if we were to refine
that lock to per-memcg per-zone, then we would have to switch whenever
the memcg changes too.

page_relock_lruvec(page, &lruvec) locates the right lruvec for page,
unlocks the old lruvec if different (and not NULL), locks the new,
and updates lruvec on return: so that we shall have just one routine
to locate and lock the lruvec, whereas originally it got re-evaluated
at different stages.  But I don't yet know how to satisfy sparse(1).

There are some loops where we never change zone, and a non-memcg kernel
would not change memcg: use no-op mem_cgroup_page_relock_lruvec() there.

In compaction's isolate_migratepages(), although we do know the zone,
we don't know the lruvec in advance: allow for taking the lock later,
and reorganize its cond_resched() lock-dropping accordingly.

page_relock_lruvec() (and its wrappers) is actually an _irqsave operation:
there are a few cases in swap.c where it may be needed at interrupt time
(to free or to rotate a page on I/O completion).  Ideally(?) we would use
straightforward _irq disabling elsewhere, but the variants get confusing,
and page_relock_lruvec() will itself grow more complicated in subsequent
patches: so keep it simple for now with just the one irqsaver everywhere.

Passing an irqflags argument/pointer down several levels looks messy
too, and I'm reluctant to add any more to the page reclaim stack: so
save the irqflags alongside the lru_lock and restore them from there.

It's a little sad now to be including mm.h in swap.h to get page_zone();
but I think that swap.h (despite its name) is the right place for these
lru functions, and without those inlines the optimizer cannot do so
well in the !MEM_RES_CTLR case.

(Is this an appropriate place to confess? that even at the end of the
series, we're left with a small bug in putback_inactive_pages(), one
that I've not yet decided is worth fixing: reclaim_stat there is from
the lruvec on entry, but we might update stats after dropping its lock.
And do zone->pages_scanned and zone->all_unreclaimable need locking?
page_alloc.c thinks zone->lock, vmscan.c thought zone->lru_lock,
and that weakens if we now split lru_lock by memcg.)

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/memcontrol.h |    7 --
 include/linux/mmzone.h     |    1 
 include/linux/swap.h       |   65 +++++++++++++++++++++++
 mm/compaction.c            |   45 ++++++++++------
 mm/huge_memory.c           |   10 +--
 mm/memcontrol.c            |   56 ++++++++++++--------
 mm/swap.c                  |   67 +++++++-----------------
 mm/vmscan.c                |   95 ++++++++++++++++-------------------
 8 files changed, 194 insertions(+), 152 deletions(-)

--- mmotm.orig/include/linux/memcontrol.h	2012-02-18 11:57:35.583524425 -0800
+++ mmotm/include/linux/memcontrol.h	2012-02-18 11:57:42.675524592 -0800
@@ -63,7 +63,6 @@ extern int mem_cgroup_cache_charge(struc
 					gfp_t gfp_mask);
 
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
-extern struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 extern struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec);
 extern void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
 
@@ -241,12 +240,6 @@ static inline struct lruvec *mem_cgroup_
 {
 	return &zone->lruvec;
 }
-
-static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
-						    struct zone *zone)
-{
-	return &zone->lruvec;
-}
 
 static inline struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec)
 {
--- mmotm.orig/include/linux/mmzone.h	2012-02-18 11:57:28.371524252 -0800
+++ mmotm/include/linux/mmzone.h	2012-02-18 11:57:42.675524592 -0800
@@ -374,6 +374,7 @@ struct zone {
 
 	/* Fields commonly accessed by the page reclaim scanner */
 	spinlock_t		lru_lock;
+	unsigned long		irqflags;
 	struct lruvec		lruvec;
 
 	unsigned long		pages_scanned;	   /* since last reclaim */
--- mmotm.orig/include/linux/swap.h	2012-02-18 11:57:35.583524425 -0800
+++ mmotm/include/linux/swap.h	2012-02-18 11:57:42.675524592 -0800
@@ -8,7 +8,7 @@
 #include <linux/memcontrol.h>
 #include <linux/sched.h>
 #include <linux/node.h>
-
+#include <linux/mm.h>			/* for page_zone(page) */
 #include <linux/atomic.h>
 #include <asm/page.h>
 
@@ -250,6 +250,69 @@ static inline void lru_cache_add_file(st
 	__lru_cache_add(page, LRU_INACTIVE_FILE);
 }
 
+static inline spinlock_t *lru_lockptr(struct lruvec *lruvec)
+{
+	return &lruvec->zone->lru_lock;
+}
+
+static inline void lock_lruvec(struct lruvec *lruvec)
+{
+	struct zone *zone = lruvec->zone;
+	unsigned long irqflags;
+
+	spin_lock_irqsave(&zone->lru_lock, irqflags);
+	zone->irqflags = irqflags;
+}
+
+static inline void unlock_lruvec(struct lruvec *lruvec)
+{
+	struct zone *zone = lruvec->zone;
+	unsigned long irqflags;
+
+	irqflags = zone->irqflags;
+	spin_unlock_irqrestore(&zone->lru_lock, irqflags);
+}
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+/* linux/mm/memcontrol.c */
+extern void page_relock_lruvec(struct page *page, struct lruvec **lruvp);
+
+static inline void
+mem_cgroup_page_relock_lruvec(struct page *page, struct lruvec **lruvp)
+{
+	page_relock_lruvec(page, lruvp);
+}
+#else
+static inline void page_relock_lruvec(struct page *page, struct lruvec **lruvp)
+{
+	struct lruvec *lruvec;
+
+	lruvec = &page_zone(page)->lruvec;
+	if (*lruvp && *lruvp != lruvec) {
+		unlock_lruvec(*lruvp);
+		*lruvp = NULL;
+	}
+	if (!*lruvp) {
+		*lruvp = lruvec;
+		lock_lruvec(lruvec);
+	}
+}
+
+static inline void
+mem_cgroup_page_relock_lruvec(struct page *page, struct lruvec **lruvp)
+{
+	/* No-op used in a few places where zone is known not to change */
+}
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+
+static inline struct lruvec *page_lock_lruvec(struct page *page)
+{
+	struct lruvec *lruvec = NULL;
+
+	page_relock_lruvec(page, &lruvec);
+	return lruvec;
+}
+
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
--- mmotm.orig/mm/compaction.c	2012-02-18 11:57:35.583524425 -0800
+++ mmotm/mm/compaction.c	2012-02-18 11:57:42.675524592 -0800
@@ -262,7 +262,7 @@ static isolate_migrate_t isolate_migrate
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct list_head *migratelist = &cc->migratepages;
 	isolate_mode_t mode = ISOLATE_ACTIVE|ISOLATE_INACTIVE;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 
 	/* Do not scan outside zone boundaries */
 	low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
@@ -293,26 +293,23 @@ static isolate_migrate_t isolate_migrate
 	}
 
 	/* Time to isolate some pages for migration */
-	cond_resched();
-	spin_lock_irq(&zone->lru_lock);
 	for (; low_pfn < end_pfn; low_pfn++) {
 		struct page *page;
-		bool locked = true;
 
-		/* give a chance to irqs before checking need_resched() */
-		if (!((low_pfn+1) % SWAP_CLUSTER_MAX)) {
-			spin_unlock_irq(&zone->lru_lock);
-			locked = false;
-		}
-		if (need_resched() || spin_is_contended(&zone->lru_lock)) {
-			if (locked)
-				spin_unlock_irq(&zone->lru_lock);
+		/* give a chance to irqs before cond_resched() */
+		if (lruvec) {
+			if (!((low_pfn+1) % SWAP_CLUSTER_MAX) ||
+			    spin_is_contended(lru_lockptr(lruvec)) ||
+			    need_resched()) {
+				unlock_lruvec(lruvec);
+				lruvec = NULL;
+			}
+		}
+		if (!lruvec) {
 			cond_resched();
-			spin_lock_irq(&zone->lru_lock);
 			if (fatal_signal_pending(current))
 				break;
-		} else if (!locked)
-			spin_lock_irq(&zone->lru_lock);
+		}
 
 		/*
 		 * migrate_pfn does not necessarily start aligned to a
@@ -359,6 +356,15 @@ static isolate_migrate_t isolate_migrate
 			continue;
 		}
 
+		if (!lruvec) {
+			/*
+			 * We do need to take the lock before advancing to
+			 * check PageLRU etc., but there's no guarantee that
+			 * the page we're peeking at has a stable memcg here.
+			 */
+			lruvec = &zone->lruvec;
+			lock_lruvec(lruvec);
+		}
 		if (!PageLRU(page))
 			continue;
 
@@ -379,7 +385,7 @@ static isolate_migrate_t isolate_migrate
 		if (__isolate_lru_page(page, mode, 0) != 0)
 			continue;
 
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		page_relock_lruvec(page, &lruvec);
 
 		VM_BUG_ON(PageTransCompound(page));
 
@@ -396,9 +402,14 @@ static isolate_migrate_t isolate_migrate
 		}
 	}
 
+	if (!lruvec)
+		local_irq_disable();
 	acct_isolated(zone, cc);
+	if (lruvec)
+		unlock_lruvec(lruvec);
+	else
+		local_irq_enable();
 
-	spin_unlock_irq(&zone->lru_lock);
 	cc->migrate_pfn = low_pfn;
 
 	trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
--- mmotm.orig/mm/huge_memory.c	2012-02-18 11:57:35.583524425 -0800
+++ mmotm/mm/huge_memory.c	2012-02-18 11:57:42.679524592 -0800
@@ -1222,13 +1222,11 @@ static int __split_huge_page_splitting(s
 static void __split_huge_page_refcount(struct page *page)
 {
 	int i;
-	struct zone *zone = page_zone(page);
 	struct lruvec *lruvec;
 	int tail_count = 0;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irq(&zone->lru_lock);
-	lruvec = mem_cgroup_page_lruvec(page, zone);
+	lruvec = page_lock_lruvec(page);
 
 	compound_lock(page);
 	/* complete memcg works before add pages to LRU */
@@ -1310,12 +1308,12 @@ static void __split_huge_page_refcount(s
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(atomic_read(&page->_count) <= 0);
 
-	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
-	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
+	__mod_zone_page_state(lruvec->zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+	__mod_zone_page_state(lruvec->zone, NR_ANON_PAGES, HPAGE_PMD_NR);
 
 	ClearPageCompound(page);
 	compound_unlock(page);
-	spin_unlock_irq(&zone->lru_lock);
+	unlock_lruvec(lruvec);
 
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		struct page *page_tail = page + i;
--- mmotm.orig/mm/memcontrol.c	2012-02-18 11:57:35.587524424 -0800
+++ mmotm/mm/memcontrol.c	2012-02-18 11:57:42.679524592 -0800
@@ -1037,23 +1037,36 @@ struct mem_cgroup *mem_cgroup_from_lruve
  */
 
 /**
- * mem_cgroup_page_lruvec - return lruvec for adding an lru page
+ * page_relock_lruvec - lock and update lruvec for this page, unlocking previous
  * @page: the page
- * @zone: zone of the page
+ * @lruvp: pointer to where to output lruvec; unlock input lruvec if non-NULL
  */
-struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
+void page_relock_lruvec(struct page *page, struct lruvec **lruvp)
 {
 	struct mem_cgroup_per_zone *mz;
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc;
+	struct lruvec *lruvec;
 
 	if (mem_cgroup_disabled())
-		return &zone->lruvec;
+		lruvec = &page_zone(page)->lruvec;
+	else {
+		pc = lookup_page_cgroup(page);
+		memcg = pc->mem_cgroup;
+		mz = page_cgroup_zoneinfo(memcg, page);
+		lruvec = &mz->lruvec;
+	}
 
-	pc = lookup_page_cgroup(page);
-	memcg = pc->mem_cgroup;
-	mz = page_cgroup_zoneinfo(memcg, page);
-	return &mz->lruvec;
+	/*
+	 * For the moment, simply lock by zone just as before.
+	 */
+	if (*lruvp && (*lruvp)->zone != lruvec->zone) {
+		unlock_lruvec(*lruvp);
+		*lruvp = NULL;
+	}
+	if (!*lruvp)
+		lock_lruvec(lruvec);
+	*lruvp = lruvec;
 }
 
 /**
@@ -2631,30 +2644,27 @@ __mem_cgroup_commit_charge_lrucare(struc
 					enum charge_type ctype)
 {
 	struct page_cgroup *pc = lookup_page_cgroup(page);
-	struct zone *zone = page_zone(page);
-	unsigned long flags;
-	bool removed = false;
 	struct lruvec *lruvec;
+	bool removed = false;
 
 	/*
 	 * In some case, SwapCache, FUSE(splice_buf->radixtree), the page
 	 * is already on LRU. It means the page may on some other page_cgroup's
 	 * LRU. Take care of it.
 	 */
-	spin_lock_irqsave(&zone->lru_lock, flags);
+	lruvec = page_lock_lruvec(page);
 	if (PageLRU(page)) {
-		lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		ClearPageLRU(page);
 		removed = true;
 	}
 	__mem_cgroup_commit_charge(memcg, page, 1, pc, ctype);
 	if (removed) {
-		lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
+		page_relock_lruvec(page, &lruvec);
 		add_page_to_lru_list(page, lruvec, page_lru(page));
 		SetPageLRU(page);
 	}
-	spin_unlock_irqrestore(&zone->lru_lock, flags);
+	unlock_lruvec(lruvec);
 }
 
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
@@ -3572,15 +3582,15 @@ static int mem_cgroup_force_empty_list(s
 				int node, int zid, enum lru_list lru)
 {
 	struct mem_cgroup_per_zone *mz;
-	unsigned long flags, loop;
+	unsigned long loop;
 	struct list_head *list;
 	struct page *busy;
-	struct zone *zone;
+	struct lruvec *lruvec;
 	int ret = 0;
 
-	zone = &NODE_DATA(node)->node_zones[zid];
 	mz = mem_cgroup_zoneinfo(memcg, node, zid);
-	list = &mz->lruvec.lists[lru];
+	lruvec = &mz->lruvec;
+	list = &lruvec->lists[lru];
 
 	loop = mz->lru_size[lru];
 	/* give some margin against EBUSY etc...*/
@@ -3591,19 +3601,19 @@ static int mem_cgroup_force_empty_list(s
 		struct page *page;
 
 		ret = 0;
-		spin_lock_irqsave(&zone->lru_lock, flags);
+		lock_lruvec(lruvec);
 		if (list_empty(list)) {
-			spin_unlock_irqrestore(&zone->lru_lock, flags);
+			unlock_lruvec(lruvec);
 			break;
 		}
 		page = list_entry(list->prev, struct page, lru);
 		if (busy == page) {
 			list_move(&page->lru, list);
 			busy = NULL;
-			spin_unlock_irqrestore(&zone->lru_lock, flags);
+			unlock_lruvec(lruvec);
 			continue;
 		}
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+		unlock_lruvec(lruvec);
 
 		pc = lookup_page_cgroup(page);
 
--- mmotm.orig/mm/swap.c	2012-02-18 11:57:35.587524424 -0800
+++ mmotm/mm/swap.c	2012-02-18 11:57:42.679524592 -0800
@@ -47,16 +47,13 @@ static DEFINE_PER_CPU(struct pagevec, lr
 static void __page_cache_release(struct page *page)
 {
 	if (PageLRU(page)) {
-		struct zone *zone = page_zone(page);
 		struct lruvec *lruvec;
-		unsigned long flags;
 
-		spin_lock_irqsave(&zone->lru_lock, flags);
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = page_lock_lruvec(page);
 		VM_BUG_ON(!PageLRU(page));
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+		unlock_lruvec(lruvec);
 	}
 }
 
@@ -208,26 +205,16 @@ static void pagevec_lru_move_fn(struct p
 	void *arg)
 {
 	int i;
-	struct zone *zone = NULL;
-	struct lruvec *lruvec;
-	unsigned long flags = 0;
+	struct lruvec *lruvec = NULL;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct zone *pagezone = page_zone(page);
 
-		if (pagezone != zone) {
-			if (zone)
-				spin_unlock_irqrestore(&zone->lru_lock, flags);
-			zone = pagezone;
-			spin_lock_irqsave(&zone->lru_lock, flags);
-		}
-
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		page_relock_lruvec(page, &lruvec);
 		(*move_fn)(page, lruvec, arg);
 	}
-	if (zone)
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+	if (lruvec)
+		unlock_lruvec(lruvec);
 	release_pages(pvec->pages, pvec->nr, pvec->cold);
 	pagevec_reinit(pvec);
 }
@@ -334,11 +321,11 @@ static inline void activate_page_drain(i
 
 void activate_page(struct page *page)
 {
-	struct zone *zone = page_zone(page);
+	struct lruvec *lruvec;
 
-	spin_lock_irq(&zone->lru_lock);
-	__activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
-	spin_unlock_irq(&zone->lru_lock);
+	lruvec = page_lock_lruvec(page);
+	__activate_page(page, lruvec, NULL);
+	unlock_lruvec(lruvec);
 }
 #endif
 
@@ -403,15 +390,13 @@ void lru_cache_add_lru(struct page *page
  */
 void add_page_to_unevictable_list(struct page *page)
 {
-	struct zone *zone = page_zone(page);
 	struct lruvec *lruvec;
 
-	spin_lock_irq(&zone->lru_lock);
-	lruvec = mem_cgroup_page_lruvec(page, zone);
+	lruvec = page_lock_lruvec(page);
 	SetPageUnevictable(page);
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
-	spin_unlock_irq(&zone->lru_lock);
+	unlock_lruvec(lruvec);
 }
 
 /*
@@ -577,17 +562,15 @@ void release_pages(struct page **pages,
 {
 	int i;
 	LIST_HEAD(pages_to_free);
-	struct zone *zone = NULL;
-	struct lruvec *lruvec;
-	unsigned long uninitialized_var(flags);
+	struct lruvec *lruvec = NULL;
 
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 
 		if (unlikely(PageCompound(page))) {
-			if (zone) {
-				spin_unlock_irqrestore(&zone->lru_lock, flags);
-				zone = NULL;
+			if (lruvec) {
+				unlock_lruvec(lruvec);
+				lruvec = NULL;
 			}
 			put_compound_page(page);
 			continue;
@@ -597,17 +580,7 @@ void release_pages(struct page **pages,
 			continue;
 
 		if (PageLRU(page)) {
-			struct zone *pagezone = page_zone(page);
-
-			if (pagezone != zone) {
-				if (zone)
-					spin_unlock_irqrestore(&zone->lru_lock,
-									flags);
-				zone = pagezone;
-				spin_lock_irqsave(&zone->lru_lock, flags);
-			}
-
-			lruvec = mem_cgroup_page_lruvec(page, zone);
+			page_relock_lruvec(page, &lruvec);
 			VM_BUG_ON(!PageLRU(page));
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -615,8 +588,8 @@ void release_pages(struct page **pages,
 
 		list_add(&page->lru, &pages_to_free);
 	}
-	if (zone)
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+	if (lruvec)
+		unlock_lruvec(lruvec);
 
 	free_hot_cold_page_list(&pages_to_free, cold);
 }
@@ -652,7 +625,7 @@ void lru_add_page_tail(struct page *page
 	VM_BUG_ON(!PageHead(page));
 	VM_BUG_ON(PageCompound(page_tail));
 	VM_BUG_ON(PageLRU(page_tail));
-	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&lruvec->zone->lru_lock));
+	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(lru_lockptr(lruvec)));
 
 	SetPageLRU(page_tail);
 
--- mmotm.orig/mm/vmscan.c	2012-02-18 11:57:35.587524424 -0800
+++ mmotm/mm/vmscan.c	2012-02-18 11:57:42.679524592 -0800
@@ -1212,8 +1212,8 @@ static unsigned long isolate_lru_pages(u
 				break;
 
 			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
-				lruvec = mem_cgroup_page_lruvec(cursor_page,
-								lruvec->zone);
+				mem_cgroup_page_relock_lruvec(cursor_page,
+								&lruvec);
 				isolated_pages = hpage_nr_pages(cursor_page);
 				mem_cgroup_update_lru_size(lruvec,
 					page_lru(cursor_page), -isolated_pages);
@@ -1294,11 +1294,9 @@ int isolate_lru_page(struct page *page)
 	VM_BUG_ON(!page_count(page));
 
 	if (PageLRU(page)) {
-		struct zone *zone = page_zone(page);
 		struct lruvec *lruvec;
 
-		spin_lock_irq(&zone->lru_lock);
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = page_lock_lruvec(page);
 		if (PageLRU(page)) {
 			int lru = page_lru(page);
 			get_page(page);
@@ -1306,7 +1304,7 @@ int isolate_lru_page(struct page *page)
 			del_page_from_lru_list(page, lruvec, lru);
 			ret = 0;
 		}
-		spin_unlock_irq(&zone->lru_lock);
+		unlock_lruvec(lruvec);
 	}
 	return ret;
 }
@@ -1337,10 +1335,9 @@ static int too_many_isolated(struct zone
 }
 
 static noinline_for_stack void
-putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
+putback_inactive_pages(struct lruvec **lruvec, struct list_head *page_list)
 {
-	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
-	struct zone *zone = lruvec->zone;
+	struct zone_reclaim_stat *reclaim_stat = &(*lruvec)->reclaim_stat;
 	LIST_HEAD(pages_to_free);
 
 	/*
@@ -1353,17 +1350,18 @@ putback_inactive_pages(struct lruvec *lr
 		VM_BUG_ON(PageLRU(page));
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page, NULL))) {
-			spin_unlock_irq(&zone->lru_lock);
+			unlock_lruvec(*lruvec);
 			putback_lru_page(page);
-			spin_lock_irq(&zone->lru_lock);
+			lock_lruvec(*lruvec);
 			continue;
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		/* lock lru, occasionally changing lruvec */
+		mem_cgroup_page_relock_lruvec(page, lruvec);
 
 		SetPageLRU(page);
 		lru = page_lru(page);
-		add_page_to_lru_list(page, lruvec, lru);
+		add_page_to_lru_list(page, *lruvec, lru);
 
 		if (is_active_lru(lru)) {
 			int file = is_file_lru(lru);
@@ -1373,12 +1371,12 @@ putback_inactive_pages(struct lruvec *lr
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
-			del_page_from_lru_list(page, lruvec, lru);
+			del_page_from_lru_list(page, *lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&zone->lru_lock);
+				unlock_lruvec(*lruvec);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(&zone->lru_lock);
+				lock_lruvec(*lruvec);
 			} else
 				list_add(&page->lru, &pages_to_free);
 		}
@@ -1513,7 +1511,7 @@ shrink_inactive_list(unsigned long nr_to
 	if (!sc->may_writepage)
 		isolate_mode |= ISOLATE_CLEAN;
 
-	spin_lock_irq(&zone->lru_lock);
+	lock_lruvec(lruvec);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, isolate_mode, 0, file);
@@ -1524,7 +1522,7 @@ shrink_inactive_list(unsigned long nr_to
 		else
 			__count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
 	}
-	spin_unlock_irq(&zone->lru_lock);
+	unlock_lruvec(lruvec);
 
 	if (nr_taken == 0)
 		return 0;
@@ -1541,7 +1539,7 @@ shrink_inactive_list(unsigned long nr_to
 					priority, &nr_dirty, &nr_writeback);
 	}
 
-	spin_lock_irq(&zone->lru_lock);
+	lock_lruvec(lruvec);
 
 	reclaim_stat->recent_scanned[0] += nr_anon;
 	reclaim_stat->recent_scanned[1] += nr_file;
@@ -1550,12 +1548,12 @@ shrink_inactive_list(unsigned long nr_to
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-	putback_inactive_pages(lruvec, &page_list);
+	putback_inactive_pages(&lruvec, &page_list);
 
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
 	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
 
-	spin_unlock_irq(&zone->lru_lock);
+	unlock_lruvec(lruvec);
 
 	free_hot_cold_page_list(&page_list, 1);
 
@@ -1611,42 +1609,44 @@ shrink_inactive_list(unsigned long nr_to
  * But we had to alter page->flags anyway.
  */
 
-static void move_active_pages_to_lru(struct lruvec *lruvec,
+static void move_active_pages_to_lru(struct lruvec **lruvec,
 				     struct list_head *list,
 				     struct list_head *pages_to_free,
 				     enum lru_list lru)
 {
-	struct zone *zone = lruvec->zone;
 	unsigned long pgmoved = 0;
 	struct page *page;
 	int nr_pages;
 
 	while (!list_empty(list)) {
 		page = lru_to_page(list);
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+
+		/* lock lru, occasionally changing lruvec */
+		mem_cgroup_page_relock_lruvec(page, lruvec);
 
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 
 		nr_pages = hpage_nr_pages(page);
-		list_move(&page->lru, &lruvec->lists[lru]);
-		mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
+		list_move(&page->lru, &(*lruvec)->lists[lru]);
+		mem_cgroup_update_lru_size(*lruvec, lru, nr_pages);
 		pgmoved += nr_pages;
 
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
-			del_page_from_lru_list(page, lruvec, lru);
+			del_page_from_lru_list(page, *lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&zone->lru_lock);
+				unlock_lruvec(*lruvec);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(&zone->lru_lock);
+				lock_lruvec(*lruvec);
 			} else
 				list_add(&page->lru, pages_to_free);
 		}
 	}
-	__mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
+
+	__mod_zone_page_state((*lruvec)->zone, NR_LRU_BASE + lru, pgmoved);
 	if (!is_active_lru(lru))
 		__count_vm_events(PGDEACTIVATE, pgmoved);
 }
@@ -1676,7 +1676,7 @@ static void shrink_active_list(unsigned
 	if (!sc->may_writepage)
 		isolate_mode |= ISOLATE_CLEAN;
 
-	spin_lock_irq(&zone->lru_lock);
+	lock_lruvec(lruvec);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, isolate_mode, 1, file);
@@ -1691,7 +1691,8 @@ static void shrink_active_list(unsigned
 	else
 		__mod_zone_page_state(zone, NR_ACTIVE_ANON, -nr_taken);
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
-	spin_unlock_irq(&zone->lru_lock);
+
+	unlock_lruvec(lruvec);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -1735,7 +1736,7 @@ static void shrink_active_list(unsigned
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&zone->lru_lock);
+	lock_lruvec(lruvec);
 	/*
 	 * Count referenced pages from currently used mappings as rotated,
 	 * even though only some of them are actually re-activated.  This
@@ -1744,12 +1745,13 @@ static void shrink_active_list(unsigned
 	 */
 	reclaim_stat->recent_rotated[file] += nr_rotated;
 
-	move_active_pages_to_lru(lruvec, &l_active, &l_hold,
+	move_active_pages_to_lru(&lruvec, &l_active, &l_hold,
 						LRU_ACTIVE + file * LRU_FILE);
-	move_active_pages_to_lru(lruvec, &l_inactive, &l_hold,
+	move_active_pages_to_lru(&lruvec, &l_inactive, &l_hold,
 						LRU_BASE   + file * LRU_FILE);
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&zone->lru_lock);
+
+	unlock_lruvec(lruvec);
 
 	free_hot_cold_page_list(&l_hold, 1);
 }
@@ -1940,7 +1942,7 @@ static void get_scan_count(struct lruvec
 	 *
 	 * anon in [0], file in [1]
 	 */
-	spin_lock_irq(&zone->lru_lock);
+	lock_lruvec(lruvec);
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
 		reclaim_stat->recent_scanned[0] /= 2;
 		reclaim_stat->recent_rotated[0] /= 2;
@@ -1961,7 +1963,7 @@ static void get_scan_count(struct lruvec
 
 	fp = (file_prio + 1) * (reclaim_stat->recent_scanned[1] + 1);
 	fp /= reclaim_stat->recent_rotated[1] + 1;
-	spin_unlock_irq(&zone->lru_lock);
+	unlock_lruvec(lruvec);
 
 	fraction[0] = ap;
 	fraction[1] = fp;
@@ -3525,25 +3527,16 @@ int page_evictable(struct page *page, st
  */
 void check_move_unevictable_pages(struct page **pages, int nr_pages)
 {
-	struct lruvec *lruvec;
-	struct zone *zone = NULL;
+	struct lruvec *lruvec = NULL;
 	int pgscanned = 0;
 	int pgrescued = 0;
 	int i;
 
 	for (i = 0; i < nr_pages; i++) {
 		struct page *page = pages[i];
-		struct zone *pagezone;
 
 		pgscanned++;
-		pagezone = page_zone(page);
-		if (pagezone != zone) {
-			if (zone)
-				spin_unlock_irq(&zone->lru_lock);
-			zone = pagezone;
-			spin_lock_irq(&zone->lru_lock);
-		}
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		page_relock_lruvec(page, &lruvec);
 
 		if (!PageLRU(page) || !PageUnevictable(page))
 			continue;
@@ -3559,10 +3552,10 @@ void check_move_unevictable_pages(struct
 		}
 	}
 
-	if (zone) {
+	if (lruvec) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&zone->lru_lock);
+		unlock_lruvec(lruvec);
 	}
 }
 #endif /* CONFIG_SHMEM */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup
  2012-02-20 23:26 ` Hugh Dickins
@ 2012-02-20 23:34   ` Hugh Dickins
  -1 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

page_relock_lruvec() is using lookup_page_cgroup(page)->mem_cgroup
to find the memcg, and hence its per-zone lruvec for the page.  We
therefore need to be careful to see the right pc->mem_cgroup: where
is it updated?

In __mem_cgroup_commit_charge(), under lruvec lock whenever lru
care might be needed, lrucare holding the page off lru at that time.

In mem_cgroup_reset_owner(), not under lruvec lock, but before the
page can be visible to others - except compaction or lumpy reclaim,
which ignore the page because it is not yet PageLRU.

In mem_cgroup_split_huge_fixup(), always under lruvec lock.

In mem_cgroup_move_account(), which holds several locks, but an
lruvec lock not among them: yet it still appears to be safe, because
the page has been taken off its old lru and not yet put on the new.

Be particularly careful in compaction's isolate_migratepages() and
vmscan's lumpy handling in isolate_lru_pages(): those approach the
page by its physical location, and so can encounter pages which
would not be found by any logical lookup.  For those cases we have
to change __isolate_lru_page() slightly: it must leave ClearPageLRU
to the caller, because compaction and lumpy cannot safely interfere
with a page until they have first isolated it and then locked lruvec.

To the list above we have to add __mem_cgroup_uncharge_common(),
and new function mem_cgroup_reset_uncharged_to_root(): the first
resetting pc->mem_cgroup to root_mem_cgroup when a page off lru is
uncharged, and the second when an uncharged page is taken off lru
(which used to be achieved implicitly with the PageAcctLRU flag).

That's because there's a remote risk that compaction or lumpy reclaim
will spy a page while it has PageLRU set; then it's taken off LRU and
freed, its mem_cgroup torn down and freed, the page reallocated (so
get_page_unless_zero again succeeds); then compaction or lumpy reclaim
reach their page_relock_lruvec, using the stale mem_cgroup for locking.

So long as there's one charge on the mem_cgroup, or a page on one of
its lrus, mem_cgroup_force_empty() cannot succeed and the mem_cgroup
cannot be destroyed.  But when an uncharged page is taken off lru,
or a page off lru is uncharged, it no longer protects its old memcg,
and the one stable root_mem_cgroup must then be used for it.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/memcontrol.h |    5 ++
 mm/compaction.c            |   36 ++++++-----------
 mm/memcontrol.c            |   45 +++++++++++++++++++--
 mm/swap.c                  |    2 
 mm/vmscan.c                |   73 +++++++++++++++++++++++++----------
 5 files changed, 114 insertions(+), 47 deletions(-)

--- mmotm.orig/include/linux/memcontrol.h	2012-02-18 11:57:42.675524592 -0800
+++ mmotm/include/linux/memcontrol.h	2012-02-18 11:57:49.103524745 -0800
@@ -65,6 +65,7 @@ extern int mem_cgroup_cache_charge(struc
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 extern struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec);
 extern void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
+extern void mem_cgroup_reset_uncharged_to_root(struct page *);
 
 /* For coalescing uncharge for reducing memcg' overhead*/
 extern void mem_cgroup_uncharge_start(void);
@@ -251,6 +252,10 @@ static inline void mem_cgroup_update_lru
 {
 }
 
+static inline void mem_cgroup_reset_uncharged_to_root(struct page *page)
+{
+}
+
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	return NULL;
--- mmotm.orig/mm/compaction.c	2012-02-18 11:57:42.675524592 -0800
+++ mmotm/mm/compaction.c	2012-02-18 11:57:49.103524745 -0800
@@ -356,28 +356,6 @@ static isolate_migrate_t isolate_migrate
 			continue;
 		}
 
-		if (!lruvec) {
-			/*
-			 * We do need to take the lock before advancing to
-			 * check PageLRU etc., but there's no guarantee that
-			 * the page we're peeking at has a stable memcg here.
-			 */
-			lruvec = &zone->lruvec;
-			lock_lruvec(lruvec);
-		}
-		if (!PageLRU(page))
-			continue;
-
-		/*
-		 * PageLRU is set, and lru_lock excludes isolation,
-		 * splitting and collapsing (collapsing has already
-		 * happened if PageLRU is set).
-		 */
-		if (PageTransHuge(page)) {
-			low_pfn += (1 << compound_order(page)) - 1;
-			continue;
-		}
-
 		if (!cc->sync)
 			mode |= ISOLATE_ASYNC_MIGRATE;
 
@@ -386,10 +364,24 @@ static isolate_migrate_t isolate_migrate
 			continue;
 
 		page_relock_lruvec(page, &lruvec);
+		if (unlikely(!PageLRU(page) || PageUnevictable(page) ||
+						PageTransHuge(page))) {
+			/*
+			 * lru_lock excludes splitting a huge page,
+			 * but we cannot hold lru_lock while freeing page.
+			 */
+			low_pfn += (1 << compound_order(page)) - 1;
+			unlock_lruvec(lruvec);
+			lruvec = NULL;
+			put_page(page);
+			continue;
+		}
 
 		VM_BUG_ON(PageTransCompound(page));
 
 		/* Successfully isolated */
+		ClearPageLRU(page);
+		mem_cgroup_reset_uncharged_to_root(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		list_add(&page->lru, migratelist);
 		cc->nr_migratepages++;
--- mmotm.orig/mm/memcontrol.c	2012-02-18 11:57:42.679524592 -0800
+++ mmotm/mm/memcontrol.c	2012-02-18 11:57:49.107524745 -0800
@@ -1069,6 +1069,33 @@ void page_relock_lruvec(struct page *pag
 	*lruvp = lruvec;
 }
 
+void mem_cgroup_reset_uncharged_to_root(struct page *page)
+{
+	struct page_cgroup *pc;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	VM_BUG_ON(PageLRU(page));
+
+	/*
+	 * Once an uncharged page is isolated from the mem_cgroup's lru,
+	 * it no longer protects that mem_cgroup from rmdir: reset to root.
+	 *
+	 * __page_cache_release() and release_pages() may be called at
+	 * interrupt time: we cannot lock_page_cgroup() then (we might
+	 * have interrupted a section with page_cgroup already locked),
+	 * nor do we need to since the page is frozen and about to be freed.
+	 */
+	pc = lookup_page_cgroup(page);
+	if (page_count(page))
+		lock_page_cgroup(pc);
+	if (!PageCgroupUsed(pc) && pc->mem_cgroup != root_mem_cgroup)
+		pc->mem_cgroup = root_mem_cgroup;
+	if (page_count(page))
+		unlock_page_cgroup(pc);
+}
+
 /**
  * mem_cgroup_update_lru_size - account for adding or removing an lru page
  * @lruvec: mem_cgroup per zone lru vector
@@ -2865,6 +2892,7 @@ __mem_cgroup_uncharge_common(struct page
 	struct mem_cgroup *memcg = NULL;
 	unsigned int nr_pages = 1;
 	struct page_cgroup *pc;
+	struct lruvec *lruvec;
 	bool anon;
 
 	if (mem_cgroup_disabled())
@@ -2884,6 +2912,7 @@ __mem_cgroup_uncharge_common(struct page
 	if (unlikely(!PageCgroupUsed(pc)))
 		return NULL;
 
+	lruvec = page_lock_lruvec(page);
 	lock_page_cgroup(pc);
 
 	memcg = pc->mem_cgroup;
@@ -2915,14 +2944,17 @@ __mem_cgroup_uncharge_common(struct page
 	mem_cgroup_charge_statistics(memcg, anon, -nr_pages);
 
 	ClearPageCgroupUsed(pc);
+
 	/*
-	 * pc->mem_cgroup is not cleared here. It will be accessed when it's
-	 * freed from LRU. This is safe because uncharged page is expected not
-	 * to be reused (freed soon). Exception is SwapCache, it's handled by
-	 * special functions.
+	 * Once an uncharged page is isolated from the mem_cgroup's lru,
+	 * it no longer protects that mem_cgroup from rmdir: reset to root.
 	 */
+	if (!PageLRU(page) && pc->mem_cgroup != root_mem_cgroup)
+		pc->mem_cgroup = root_mem_cgroup;
 
 	unlock_page_cgroup(pc);
+	unlock_lruvec(lruvec);
+
 	/*
 	 * even after unlock, we have memcg->res.usage here and this memcg
 	 * will never be freed.
@@ -2939,6 +2971,7 @@ __mem_cgroup_uncharge_common(struct page
 
 unlock_out:
 	unlock_page_cgroup(pc);
+	unlock_lruvec(lruvec);
 	return NULL;
 }
 
@@ -3327,7 +3360,9 @@ static struct page_cgroup *lookup_page_c
 	 * the first time, i.e. during boot or memory hotplug;
 	 * or when mem_cgroup_disabled().
 	 */
-	if (likely(pc) && PageCgroupUsed(pc))
+	if (!pc || PageCgroupUsed(pc))
+		return pc;
+	if (pc->mem_cgroup && pc->mem_cgroup != root_mem_cgroup)
 		return pc;
 	return NULL;
 }
--- mmotm.orig/mm/swap.c	2012-02-18 11:57:42.679524592 -0800
+++ mmotm/mm/swap.c	2012-02-18 11:57:49.107524745 -0800
@@ -52,6 +52,7 @@ static void __page_cache_release(struct
 		lruvec = page_lock_lruvec(page);
 		VM_BUG_ON(!PageLRU(page));
 		__ClearPageLRU(page);
+		mem_cgroup_reset_uncharged_to_root(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		unlock_lruvec(lruvec);
 	}
@@ -583,6 +584,7 @@ void release_pages(struct page **pages,
 			page_relock_lruvec(page, &lruvec);
 			VM_BUG_ON(!PageLRU(page));
 			__ClearPageLRU(page);
+			mem_cgroup_reset_uncharged_to_root(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
 
--- mmotm.orig/mm/vmscan.c	2012-02-18 11:57:42.679524592 -0800
+++ mmotm/mm/vmscan.c	2012-02-18 11:57:49.107524745 -0800
@@ -1087,11 +1087,11 @@ int __isolate_lru_page(struct page *page
 
 	if (likely(get_page_unless_zero(page))) {
 		/*
-		 * Be careful not to clear PageLRU until after we're
-		 * sure the page is not being freed elsewhere -- the
-		 * page release code relies on it.
+		 * Beware of interface change: now leave ClearPageLRU(page)
+		 * to the caller, because memcg's lumpy and compaction
+		 * cases (approaching the page by its physical location)
+		 * may not have the right lru_lock yet.
 		 */
-		ClearPageLRU(page);
 		ret = 0;
 	}
 
@@ -1154,7 +1154,16 @@ static unsigned long isolate_lru_pages(u
 
 		switch (__isolate_lru_page(page, mode, file)) {
 		case 0:
+#ifdef CONFIG_DEBUG_VM
+			/* check lock on page is lock we already got */
+			page_relock_lruvec(page, &lruvec);
+			BUG_ON(lruvec != home_lruvec);
+			BUG_ON(page != lru_to_page(src));
+			BUG_ON(page_lru(page) != lru);
+#endif
+			ClearPageLRU(page);
 			isolated_pages = hpage_nr_pages(page);
+			mem_cgroup_reset_uncharged_to_root(page);
 			mem_cgroup_update_lru_size(lruvec, lru, -isolated_pages);
 			list_move(&page->lru, dst);
 			nr_taken += isolated_pages;
@@ -1211,21 +1220,7 @@ static unsigned long isolate_lru_pages(u
 			    !PageSwapCache(cursor_page))
 				break;
 
-			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
-				mem_cgroup_page_relock_lruvec(cursor_page,
-								&lruvec);
-				isolated_pages = hpage_nr_pages(cursor_page);
-				mem_cgroup_update_lru_size(lruvec,
-					page_lru(cursor_page), -isolated_pages);
-				list_move(&cursor_page->lru, dst);
-
-				nr_taken += isolated_pages;
-				nr_lumpy_taken += isolated_pages;
-				if (PageDirty(cursor_page))
-					nr_lumpy_dirty += isolated_pages;
-				scan++;
-				pfn += isolated_pages - 1;
-			} else {
+			if (__isolate_lru_page(cursor_page, mode, file) != 0) {
 				/*
 				 * Check if the page is freed already.
 				 *
@@ -1243,13 +1238,50 @@ static unsigned long isolate_lru_pages(u
 					continue;
 				break;
 			}
+
+			/*
+			 * This locking call is a no-op in the non-memcg
+			 * case, since we already hold the right lru_lock;
+			 * but it may change the lock in the memcg case.
+			 * It is then vital to recheck PageLRU (but not
+			 * necessary to recheck isolation mode).
+			 */
+			mem_cgroup_page_relock_lruvec(cursor_page, &lruvec);
+
+			if (PageLRU(cursor_page) &&
+			    !PageUnevictable(cursor_page)) {
+				ClearPageLRU(cursor_page);
+				isolated_pages = hpage_nr_pages(cursor_page);
+				mem_cgroup_reset_uncharged_to_root(cursor_page);
+				mem_cgroup_update_lru_size(lruvec,
+					page_lru(cursor_page), -isolated_pages);
+				list_move(&cursor_page->lru, dst);
+
+				nr_taken += isolated_pages;
+				nr_lumpy_taken += isolated_pages;
+				if (PageDirty(cursor_page))
+					nr_lumpy_dirty += isolated_pages;
+				scan++;
+				pfn += isolated_pages - 1;
+			} else {
+				/* Cannot hold lru_lock while freeing page */
+				unlock_lruvec(lruvec);
+				lruvec = NULL;
+				put_page(cursor_page);
+				break;
+			}
 		}
 
 		/* If we break out of the loop above, lumpy reclaim failed */
 		if (pfn < end_pfn)
 			nr_lumpy_failed++;
 
-		lruvec = home_lruvec;
+		if (lruvec != home_lruvec) {
+			if (lruvec)
+				unlock_lruvec(lruvec);
+			lruvec = home_lruvec;
+			lock_lruvec(lruvec);
+		}
 	}
 
 	*nr_scanned = scan;
@@ -1301,6 +1333,7 @@ int isolate_lru_page(struct page *page)
 			int lru = page_lru(page);
 			get_page(page);
 			ClearPageLRU(page);
+			mem_cgroup_reset_uncharged_to_root(page);
 			del_page_from_lru_list(page, lruvec, lru);
 			ret = 0;
 		}

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup
@ 2012-02-20 23:34   ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

page_relock_lruvec() is using lookup_page_cgroup(page)->mem_cgroup
to find the memcg, and hence its per-zone lruvec for the page.  We
therefore need to be careful to see the right pc->mem_cgroup: where
is it updated?

In __mem_cgroup_commit_charge(), under lruvec lock whenever lru
care might be needed, lrucare holding the page off lru at that time.

In mem_cgroup_reset_owner(), not under lruvec lock, but before the
page can be visible to others - except compaction or lumpy reclaim,
which ignore the page because it is not yet PageLRU.

In mem_cgroup_split_huge_fixup(), always under lruvec lock.

In mem_cgroup_move_account(), which holds several locks, but an
lruvec lock not among them: yet it still appears to be safe, because
the page has been taken off its old lru and not yet put on the new.

Be particularly careful in compaction's isolate_migratepages() and
vmscan's lumpy handling in isolate_lru_pages(): those approach the
page by its physical location, and so can encounter pages which
would not be found by any logical lookup.  For those cases we have
to change __isolate_lru_page() slightly: it must leave ClearPageLRU
to the caller, because compaction and lumpy cannot safely interfere
with a page until they have first isolated it and then locked lruvec.

To the list above we have to add __mem_cgroup_uncharge_common(),
and new function mem_cgroup_reset_uncharged_to_root(): the first
resetting pc->mem_cgroup to root_mem_cgroup when a page off lru is
uncharged, and the second when an uncharged page is taken off lru
(which used to be achieved implicitly with the PageAcctLRU flag).

That's because there's a remote risk that compaction or lumpy reclaim
will spy a page while it has PageLRU set; then it's taken off LRU and
freed, its mem_cgroup torn down and freed, the page reallocated (so
get_page_unless_zero again succeeds); then compaction or lumpy reclaim
reach their page_relock_lruvec, using the stale mem_cgroup for locking.

So long as there's one charge on the mem_cgroup, or a page on one of
its lrus, mem_cgroup_force_empty() cannot succeed and the mem_cgroup
cannot be destroyed.  But when an uncharged page is taken off lru,
or a page off lru is uncharged, it no longer protects its old memcg,
and the one stable root_mem_cgroup must then be used for it.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/memcontrol.h |    5 ++
 mm/compaction.c            |   36 ++++++-----------
 mm/memcontrol.c            |   45 +++++++++++++++++++--
 mm/swap.c                  |    2 
 mm/vmscan.c                |   73 +++++++++++++++++++++++++----------
 5 files changed, 114 insertions(+), 47 deletions(-)

--- mmotm.orig/include/linux/memcontrol.h	2012-02-18 11:57:42.675524592 -0800
+++ mmotm/include/linux/memcontrol.h	2012-02-18 11:57:49.103524745 -0800
@@ -65,6 +65,7 @@ extern int mem_cgroup_cache_charge(struc
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 extern struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec);
 extern void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
+extern void mem_cgroup_reset_uncharged_to_root(struct page *);
 
 /* For coalescing uncharge for reducing memcg' overhead*/
 extern void mem_cgroup_uncharge_start(void);
@@ -251,6 +252,10 @@ static inline void mem_cgroup_update_lru
 {
 }
 
+static inline void mem_cgroup_reset_uncharged_to_root(struct page *page)
+{
+}
+
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	return NULL;
--- mmotm.orig/mm/compaction.c	2012-02-18 11:57:42.675524592 -0800
+++ mmotm/mm/compaction.c	2012-02-18 11:57:49.103524745 -0800
@@ -356,28 +356,6 @@ static isolate_migrate_t isolate_migrate
 			continue;
 		}
 
-		if (!lruvec) {
-			/*
-			 * We do need to take the lock before advancing to
-			 * check PageLRU etc., but there's no guarantee that
-			 * the page we're peeking at has a stable memcg here.
-			 */
-			lruvec = &zone->lruvec;
-			lock_lruvec(lruvec);
-		}
-		if (!PageLRU(page))
-			continue;
-
-		/*
-		 * PageLRU is set, and lru_lock excludes isolation,
-		 * splitting and collapsing (collapsing has already
-		 * happened if PageLRU is set).
-		 */
-		if (PageTransHuge(page)) {
-			low_pfn += (1 << compound_order(page)) - 1;
-			continue;
-		}
-
 		if (!cc->sync)
 			mode |= ISOLATE_ASYNC_MIGRATE;
 
@@ -386,10 +364,24 @@ static isolate_migrate_t isolate_migrate
 			continue;
 
 		page_relock_lruvec(page, &lruvec);
+		if (unlikely(!PageLRU(page) || PageUnevictable(page) ||
+						PageTransHuge(page))) {
+			/*
+			 * lru_lock excludes splitting a huge page,
+			 * but we cannot hold lru_lock while freeing page.
+			 */
+			low_pfn += (1 << compound_order(page)) - 1;
+			unlock_lruvec(lruvec);
+			lruvec = NULL;
+			put_page(page);
+			continue;
+		}
 
 		VM_BUG_ON(PageTransCompound(page));
 
 		/* Successfully isolated */
+		ClearPageLRU(page);
+		mem_cgroup_reset_uncharged_to_root(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		list_add(&page->lru, migratelist);
 		cc->nr_migratepages++;
--- mmotm.orig/mm/memcontrol.c	2012-02-18 11:57:42.679524592 -0800
+++ mmotm/mm/memcontrol.c	2012-02-18 11:57:49.107524745 -0800
@@ -1069,6 +1069,33 @@ void page_relock_lruvec(struct page *pag
 	*lruvp = lruvec;
 }
 
+void mem_cgroup_reset_uncharged_to_root(struct page *page)
+{
+	struct page_cgroup *pc;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	VM_BUG_ON(PageLRU(page));
+
+	/*
+	 * Once an uncharged page is isolated from the mem_cgroup's lru,
+	 * it no longer protects that mem_cgroup from rmdir: reset to root.
+	 *
+	 * __page_cache_release() and release_pages() may be called at
+	 * interrupt time: we cannot lock_page_cgroup() then (we might
+	 * have interrupted a section with page_cgroup already locked),
+	 * nor do we need to since the page is frozen and about to be freed.
+	 */
+	pc = lookup_page_cgroup(page);
+	if (page_count(page))
+		lock_page_cgroup(pc);
+	if (!PageCgroupUsed(pc) && pc->mem_cgroup != root_mem_cgroup)
+		pc->mem_cgroup = root_mem_cgroup;
+	if (page_count(page))
+		unlock_page_cgroup(pc);
+}
+
 /**
  * mem_cgroup_update_lru_size - account for adding or removing an lru page
  * @lruvec: mem_cgroup per zone lru vector
@@ -2865,6 +2892,7 @@ __mem_cgroup_uncharge_common(struct page
 	struct mem_cgroup *memcg = NULL;
 	unsigned int nr_pages = 1;
 	struct page_cgroup *pc;
+	struct lruvec *lruvec;
 	bool anon;
 
 	if (mem_cgroup_disabled())
@@ -2884,6 +2912,7 @@ __mem_cgroup_uncharge_common(struct page
 	if (unlikely(!PageCgroupUsed(pc)))
 		return NULL;
 
+	lruvec = page_lock_lruvec(page);
 	lock_page_cgroup(pc);
 
 	memcg = pc->mem_cgroup;
@@ -2915,14 +2944,17 @@ __mem_cgroup_uncharge_common(struct page
 	mem_cgroup_charge_statistics(memcg, anon, -nr_pages);
 
 	ClearPageCgroupUsed(pc);
+
 	/*
-	 * pc->mem_cgroup is not cleared here. It will be accessed when it's
-	 * freed from LRU. This is safe because uncharged page is expected not
-	 * to be reused (freed soon). Exception is SwapCache, it's handled by
-	 * special functions.
+	 * Once an uncharged page is isolated from the mem_cgroup's lru,
+	 * it no longer protects that mem_cgroup from rmdir: reset to root.
 	 */
+	if (!PageLRU(page) && pc->mem_cgroup != root_mem_cgroup)
+		pc->mem_cgroup = root_mem_cgroup;
 
 	unlock_page_cgroup(pc);
+	unlock_lruvec(lruvec);
+
 	/*
 	 * even after unlock, we have memcg->res.usage here and this memcg
 	 * will never be freed.
@@ -2939,6 +2971,7 @@ __mem_cgroup_uncharge_common(struct page
 
 unlock_out:
 	unlock_page_cgroup(pc);
+	unlock_lruvec(lruvec);
 	return NULL;
 }
 
@@ -3327,7 +3360,9 @@ static struct page_cgroup *lookup_page_c
 	 * the first time, i.e. during boot or memory hotplug;
 	 * or when mem_cgroup_disabled().
 	 */
-	if (likely(pc) && PageCgroupUsed(pc))
+	if (!pc || PageCgroupUsed(pc))
+		return pc;
+	if (pc->mem_cgroup && pc->mem_cgroup != root_mem_cgroup)
 		return pc;
 	return NULL;
 }
--- mmotm.orig/mm/swap.c	2012-02-18 11:57:42.679524592 -0800
+++ mmotm/mm/swap.c	2012-02-18 11:57:49.107524745 -0800
@@ -52,6 +52,7 @@ static void __page_cache_release(struct
 		lruvec = page_lock_lruvec(page);
 		VM_BUG_ON(!PageLRU(page));
 		__ClearPageLRU(page);
+		mem_cgroup_reset_uncharged_to_root(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		unlock_lruvec(lruvec);
 	}
@@ -583,6 +584,7 @@ void release_pages(struct page **pages,
 			page_relock_lruvec(page, &lruvec);
 			VM_BUG_ON(!PageLRU(page));
 			__ClearPageLRU(page);
+			mem_cgroup_reset_uncharged_to_root(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
 
--- mmotm.orig/mm/vmscan.c	2012-02-18 11:57:42.679524592 -0800
+++ mmotm/mm/vmscan.c	2012-02-18 11:57:49.107524745 -0800
@@ -1087,11 +1087,11 @@ int __isolate_lru_page(struct page *page
 
 	if (likely(get_page_unless_zero(page))) {
 		/*
-		 * Be careful not to clear PageLRU until after we're
-		 * sure the page is not being freed elsewhere -- the
-		 * page release code relies on it.
+		 * Beware of interface change: now leave ClearPageLRU(page)
+		 * to the caller, because memcg's lumpy and compaction
+		 * cases (approaching the page by its physical location)
+		 * may not have the right lru_lock yet.
 		 */
-		ClearPageLRU(page);
 		ret = 0;
 	}
 
@@ -1154,7 +1154,16 @@ static unsigned long isolate_lru_pages(u
 
 		switch (__isolate_lru_page(page, mode, file)) {
 		case 0:
+#ifdef CONFIG_DEBUG_VM
+			/* check lock on page is lock we already got */
+			page_relock_lruvec(page, &lruvec);
+			BUG_ON(lruvec != home_lruvec);
+			BUG_ON(page != lru_to_page(src));
+			BUG_ON(page_lru(page) != lru);
+#endif
+			ClearPageLRU(page);
 			isolated_pages = hpage_nr_pages(page);
+			mem_cgroup_reset_uncharged_to_root(page);
 			mem_cgroup_update_lru_size(lruvec, lru, -isolated_pages);
 			list_move(&page->lru, dst);
 			nr_taken += isolated_pages;
@@ -1211,21 +1220,7 @@ static unsigned long isolate_lru_pages(u
 			    !PageSwapCache(cursor_page))
 				break;
 
-			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
-				mem_cgroup_page_relock_lruvec(cursor_page,
-								&lruvec);
-				isolated_pages = hpage_nr_pages(cursor_page);
-				mem_cgroup_update_lru_size(lruvec,
-					page_lru(cursor_page), -isolated_pages);
-				list_move(&cursor_page->lru, dst);
-
-				nr_taken += isolated_pages;
-				nr_lumpy_taken += isolated_pages;
-				if (PageDirty(cursor_page))
-					nr_lumpy_dirty += isolated_pages;
-				scan++;
-				pfn += isolated_pages - 1;
-			} else {
+			if (__isolate_lru_page(cursor_page, mode, file) != 0) {
 				/*
 				 * Check if the page is freed already.
 				 *
@@ -1243,13 +1238,50 @@ static unsigned long isolate_lru_pages(u
 					continue;
 				break;
 			}
+
+			/*
+			 * This locking call is a no-op in the non-memcg
+			 * case, since we already hold the right lru_lock;
+			 * but it may change the lock in the memcg case.
+			 * It is then vital to recheck PageLRU (but not
+			 * necessary to recheck isolation mode).
+			 */
+			mem_cgroup_page_relock_lruvec(cursor_page, &lruvec);
+
+			if (PageLRU(cursor_page) &&
+			    !PageUnevictable(cursor_page)) {
+				ClearPageLRU(cursor_page);
+				isolated_pages = hpage_nr_pages(cursor_page);
+				mem_cgroup_reset_uncharged_to_root(cursor_page);
+				mem_cgroup_update_lru_size(lruvec,
+					page_lru(cursor_page), -isolated_pages);
+				list_move(&cursor_page->lru, dst);
+
+				nr_taken += isolated_pages;
+				nr_lumpy_taken += isolated_pages;
+				if (PageDirty(cursor_page))
+					nr_lumpy_dirty += isolated_pages;
+				scan++;
+				pfn += isolated_pages - 1;
+			} else {
+				/* Cannot hold lru_lock while freeing page */
+				unlock_lruvec(lruvec);
+				lruvec = NULL;
+				put_page(cursor_page);
+				break;
+			}
 		}
 
 		/* If we break out of the loop above, lumpy reclaim failed */
 		if (pfn < end_pfn)
 			nr_lumpy_failed++;
 
-		lruvec = home_lruvec;
+		if (lruvec != home_lruvec) {
+			if (lruvec)
+				unlock_lruvec(lruvec);
+			lruvec = home_lruvec;
+			lock_lruvec(lruvec);
+		}
 	}
 
 	*nr_scanned = scan;
@@ -1301,6 +1333,7 @@ int isolate_lru_page(struct page *page)
 			int lru = page_lru(page);
 			get_page(page);
 			ClearPageLRU(page);
+			mem_cgroup_reset_uncharged_to_root(page);
 			del_page_from_lru_list(page, lruvec, lru);
 			ret = 0;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 7/10] mm/memcg: remove mem_cgroup_reset_owner
  2012-02-20 23:26 ` Hugh Dickins
@ 2012-02-20 23:35   ` Hugh Dickins
  -1 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

With mem_cgroup_reset_uncharged_to_root() now making sure that freed
pages point to root_mem_cgroup (instead of to a stale and perhaps
long-deleted memcg), we no longer need to initialize page memcg to
root in those odd places which put a page on lru before charging. 
Delete mem_cgroup_reset_owner().

But: we have no init_page_cgroup() nowadays (and even when we had,
it was called before root_mem_cgroup had been allocated); so until
a struct page has once entered the memcg lru cycle, its page_cgroup
->mem_cgroup will be NULL instead of root_mem_cgroup.

That could be fixed by reintroducing init_page_cgroup(), and ordering
properly: in future we shall probably want root_mem_cgroup in kernel
bss or data like swapper_space; but let's not get into that right now.

Instead allow for this in page_relock_lruvec(): treating NULL as
root_mem_cgroup, and correcting pc->mem_cgroup before going further.

What?  Before even taking the zone->lru_lock?  Is that safe?
Yes, because compaction and lumpy reclaim use __isolate_lru_page(),
which refuses unless it sees PageLRU - which may be cleared at any
instant, but we only need it to have been set once in the past for
pc->mem_cgroup to be initialized properly.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/memcontrol.h |    5 -----
 mm/ksm.c                   |   11 -----------
 mm/memcontrol.c            |   23 ++++++-----------------
 mm/migrate.c               |    2 --
 mm/swap_state.c            |   10 ----------
 5 files changed, 6 insertions(+), 45 deletions(-)

--- mmotm.orig/include/linux/memcontrol.h	2012-02-18 11:57:49.103524745 -0800
+++ mmotm/include/linux/memcontrol.h	2012-02-18 11:57:55.551524898 -0800
@@ -120,7 +120,6 @@ extern void mem_cgroup_print_oom_info(st
 extern void mem_cgroup_replace_page_cache(struct page *oldpage,
 					struct page *newpage);
 
-extern void mem_cgroup_reset_owner(struct page *page);
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 extern int do_swap_account;
 #endif
@@ -383,10 +382,6 @@ static inline void mem_cgroup_replace_pa
 				struct page *newpage)
 {
 }
-
-static inline void mem_cgroup_reset_owner(struct page *page)
-{
-}
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR */
 
 #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
--- mmotm.orig/mm/ksm.c	2012-02-18 11:56:23.435522709 -0800
+++ mmotm/mm/ksm.c	2012-02-18 11:57:55.551524898 -0800
@@ -28,7 +28,6 @@
 #include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/slab.h>
-#include <linux/memcontrol.h>
 #include <linux/rbtree.h>
 #include <linux/memory.h>
 #include <linux/mmu_notifier.h>
@@ -1572,16 +1571,6 @@ struct page *ksm_does_need_to_copy(struc
 
 	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
 	if (new_page) {
-		/*
-		 * The memcg-specific accounting when moving
-		 * pages around the LRU lists relies on the
-		 * page's owner (memcg) to be valid.  Usually,
-		 * pages are assigned to a new owner before
-		 * being put on the LRU list, but since this
-		 * is not the case here, the stale owner from
-		 * a previous allocation cycle must be reset.
-		 */
-		mem_cgroup_reset_owner(new_page);
 		copy_user_highpage(new_page, page, address, vma);
 
 		SetPageDirty(new_page);
--- mmotm.orig/mm/memcontrol.c	2012-02-18 11:57:49.107524745 -0800
+++ mmotm/mm/memcontrol.c	2012-02-18 11:57:55.551524898 -0800
@@ -1053,6 +1053,12 @@ void page_relock_lruvec(struct page *pag
 	else {
 		pc = lookup_page_cgroup(page);
 		memcg = pc->mem_cgroup;
+		/*
+		 * At present we start up with all page_cgroups initialized
+		 * to zero: correct that to root_mem_cgroup once we see it.
+		 */
+		if (unlikely(!memcg))
+			memcg = pc->mem_cgroup = root_mem_cgroup;
 		mz = page_cgroup_zoneinfo(memcg, page);
 		lruvec = &mz->lruvec;
 	}
@@ -3038,23 +3044,6 @@ void mem_cgroup_uncharge_end(void)
 	batch->memcg = NULL;
 }
 
-/*
- * A function for resetting pc->mem_cgroup for newly allocated pages.
- * This function should be called if the newpage will be added to LRU
- * before start accounting.
- */
-void mem_cgroup_reset_owner(struct page *newpage)
-{
-	struct page_cgroup *pc;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	pc = lookup_page_cgroup(newpage);
-	VM_BUG_ON(PageCgroupUsed(pc));
-	pc->mem_cgroup = root_mem_cgroup;
-}
-
 #ifdef CONFIG_SWAP
 /*
  * called after __delete_from_swap_cache() and drop "page" account.
--- mmotm.orig/mm/migrate.c	2012-02-18 11:56:23.435522709 -0800
+++ mmotm/mm/migrate.c	2012-02-18 11:57:55.551524898 -0800
@@ -839,8 +839,6 @@ static int unmap_and_move(new_page_t get
 	if (!newpage)
 		return -ENOMEM;
 
-	mem_cgroup_reset_owner(newpage);
-
 	if (page_count(page) == 1) {
 		/* page was freed from under us. So we are done. */
 		goto out;
--- mmotm.orig/mm/swap_state.c	2012-02-18 11:56:23.435522709 -0800
+++ mmotm/mm/swap_state.c	2012-02-18 11:57:55.551524898 -0800
@@ -300,16 +300,6 @@ struct page *read_swap_cache_async(swp_e
 			new_page = alloc_page_vma(gfp_mask, vma, addr);
 			if (!new_page)
 				break;		/* Out of memory */
-			/*
-			 * The memcg-specific accounting when moving
-			 * pages around the LRU lists relies on the
-			 * page's owner (memcg) to be valid.  Usually,
-			 * pages are assigned to a new owner before
-			 * being put on the LRU list, but since this
-			 * is not the case here, the stale owner from
-			 * a previous allocation cycle must be reset.
-			 */
-			mem_cgroup_reset_owner(new_page);
 		}
 
 		/*

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 7/10] mm/memcg: remove mem_cgroup_reset_owner
@ 2012-02-20 23:35   ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

With mem_cgroup_reset_uncharged_to_root() now making sure that freed
pages point to root_mem_cgroup (instead of to a stale and perhaps
long-deleted memcg), we no longer need to initialize page memcg to
root in those odd places which put a page on lru before charging. 
Delete mem_cgroup_reset_owner().

But: we have no init_page_cgroup() nowadays (and even when we had,
it was called before root_mem_cgroup had been allocated); so until
a struct page has once entered the memcg lru cycle, its page_cgroup
->mem_cgroup will be NULL instead of root_mem_cgroup.

That could be fixed by reintroducing init_page_cgroup(), and ordering
properly: in future we shall probably want root_mem_cgroup in kernel
bss or data like swapper_space; but let's not get into that right now.

Instead allow for this in page_relock_lruvec(): treating NULL as
root_mem_cgroup, and correcting pc->mem_cgroup before going further.

What?  Before even taking the zone->lru_lock?  Is that safe?
Yes, because compaction and lumpy reclaim use __isolate_lru_page(),
which refuses unless it sees PageLRU - which may be cleared at any
instant, but we only need it to have been set once in the past for
pc->mem_cgroup to be initialized properly.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/memcontrol.h |    5 -----
 mm/ksm.c                   |   11 -----------
 mm/memcontrol.c            |   23 ++++++-----------------
 mm/migrate.c               |    2 --
 mm/swap_state.c            |   10 ----------
 5 files changed, 6 insertions(+), 45 deletions(-)

--- mmotm.orig/include/linux/memcontrol.h	2012-02-18 11:57:49.103524745 -0800
+++ mmotm/include/linux/memcontrol.h	2012-02-18 11:57:55.551524898 -0800
@@ -120,7 +120,6 @@ extern void mem_cgroup_print_oom_info(st
 extern void mem_cgroup_replace_page_cache(struct page *oldpage,
 					struct page *newpage);
 
-extern void mem_cgroup_reset_owner(struct page *page);
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 extern int do_swap_account;
 #endif
@@ -383,10 +382,6 @@ static inline void mem_cgroup_replace_pa
 				struct page *newpage)
 {
 }
-
-static inline void mem_cgroup_reset_owner(struct page *page)
-{
-}
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR */
 
 #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
--- mmotm.orig/mm/ksm.c	2012-02-18 11:56:23.435522709 -0800
+++ mmotm/mm/ksm.c	2012-02-18 11:57:55.551524898 -0800
@@ -28,7 +28,6 @@
 #include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/slab.h>
-#include <linux/memcontrol.h>
 #include <linux/rbtree.h>
 #include <linux/memory.h>
 #include <linux/mmu_notifier.h>
@@ -1572,16 +1571,6 @@ struct page *ksm_does_need_to_copy(struc
 
 	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
 	if (new_page) {
-		/*
-		 * The memcg-specific accounting when moving
-		 * pages around the LRU lists relies on the
-		 * page's owner (memcg) to be valid.  Usually,
-		 * pages are assigned to a new owner before
-		 * being put on the LRU list, but since this
-		 * is not the case here, the stale owner from
-		 * a previous allocation cycle must be reset.
-		 */
-		mem_cgroup_reset_owner(new_page);
 		copy_user_highpage(new_page, page, address, vma);
 
 		SetPageDirty(new_page);
--- mmotm.orig/mm/memcontrol.c	2012-02-18 11:57:49.107524745 -0800
+++ mmotm/mm/memcontrol.c	2012-02-18 11:57:55.551524898 -0800
@@ -1053,6 +1053,12 @@ void page_relock_lruvec(struct page *pag
 	else {
 		pc = lookup_page_cgroup(page);
 		memcg = pc->mem_cgroup;
+		/*
+		 * At present we start up with all page_cgroups initialized
+		 * to zero: correct that to root_mem_cgroup once we see it.
+		 */
+		if (unlikely(!memcg))
+			memcg = pc->mem_cgroup = root_mem_cgroup;
 		mz = page_cgroup_zoneinfo(memcg, page);
 		lruvec = &mz->lruvec;
 	}
@@ -3038,23 +3044,6 @@ void mem_cgroup_uncharge_end(void)
 	batch->memcg = NULL;
 }
 
-/*
- * A function for resetting pc->mem_cgroup for newly allocated pages.
- * This function should be called if the newpage will be added to LRU
- * before start accounting.
- */
-void mem_cgroup_reset_owner(struct page *newpage)
-{
-	struct page_cgroup *pc;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	pc = lookup_page_cgroup(newpage);
-	VM_BUG_ON(PageCgroupUsed(pc));
-	pc->mem_cgroup = root_mem_cgroup;
-}
-
 #ifdef CONFIG_SWAP
 /*
  * called after __delete_from_swap_cache() and drop "page" account.
--- mmotm.orig/mm/migrate.c	2012-02-18 11:56:23.435522709 -0800
+++ mmotm/mm/migrate.c	2012-02-18 11:57:55.551524898 -0800
@@ -839,8 +839,6 @@ static int unmap_and_move(new_page_t get
 	if (!newpage)
 		return -ENOMEM;
 
-	mem_cgroup_reset_owner(newpage);
-
 	if (page_count(page) == 1) {
 		/* page was freed from under us. So we are done. */
 		goto out;
--- mmotm.orig/mm/swap_state.c	2012-02-18 11:56:23.435522709 -0800
+++ mmotm/mm/swap_state.c	2012-02-18 11:57:55.551524898 -0800
@@ -300,16 +300,6 @@ struct page *read_swap_cache_async(swp_e
 			new_page = alloc_page_vma(gfp_mask, vma, addr);
 			if (!new_page)
 				break;		/* Out of memory */
-			/*
-			 * The memcg-specific accounting when moving
-			 * pages around the LRU lists relies on the
-			 * page's owner (memcg) to be valid.  Usually,
-			 * pages are assigned to a new owner before
-			 * being put on the LRU list, but since this
-			 * is not the case here, the stale owner from
-			 * a previous allocation cycle must be reset.
-			 */
-			mem_cgroup_reset_owner(new_page);
 		}
 
 		/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 8/10] mm/memcg: nest lru_lock inside page_cgroup lock
  2012-02-20 23:26 ` Hugh Dickins
@ 2012-02-20 23:36   ` Hugh Dickins
  -1 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

Cut back on some of the overhead we've added, particularly the lruvec
locking added to every __mem_cgroup_uncharge_common(), and the page
cgroup locking in mem_cgroup_reset_uncharged_to_root().

Our hands were tied by the lock ordering (page cgroup inside lruvec)
defined by __mem_cgroup_commit_charge_lrucare().  There is no strong
reason for why that nesting needs to be one way or the other, and if
we invert it, then some optimizations become possible.

So delete __mem_cgroup_commit_charge_lrucare(), passing a bool lrucare
to __mem_cgroup_commit_charge() instead, using page_lock_lruvec() there
inside lock_page_cgroup() in the lrucare case.  (I'd prefer to work it
out internally, than rely upon an lrucare argument: but that is hard -
certainly PageLRU is not enough, racing with pages on pagevec about to
be flushed to lru.)  Use page_relock_lruvec() after setting mem_cgroup,
before adding to the appropriate new lruvec: so that (if lock depends
on memcg) old lock is held across change in ownership while off lru.

Delete the lruvec locking on entry to __mem_cgroup_uncharge_common();
but if the page being uncharged is not on lru, then we do need to
reset its ownership, and must dance very carefully with mem_cgroup_
reset_uncharged_to_root(), to make sure that when there's a race
between uncharging and removing from lru, one side or the other
will see it - smp_mb__after_clear_bit() at both ends.

Avoid overhead of calls to mem_cgroup_reset_uncharged_to_root() from
release_pages() and __page_cache_release(), by doing its work inside
page_relock_lruvec() when the page_count is 0 i.e. the page is frozen
from other references and about to be freed.  That was not possible
with the old lock ordering, since __mem_cgroup_uncharge_common()'s
lock then changed ownership too soon.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/memcontrol.c |  142 ++++++++++++++++++++++++----------------------
 mm/swap.c       |    2 
 2 files changed, 75 insertions(+), 69 deletions(-)

--- mmotm.orig/mm/memcontrol.c	2012-02-18 11:57:55.551524898 -0800
+++ mmotm/mm/memcontrol.c	2012-02-18 11:58:02.451525062 -0800
@@ -1059,6 +1059,14 @@ void page_relock_lruvec(struct page *pag
 		 */
 		if (unlikely(!memcg))
 			memcg = pc->mem_cgroup = root_mem_cgroup;
+		/*
+		 * We must reset pc->mem_cgroup back to root before freeing
+		 * a page: avoid additional callouts from hot paths by doing
+		 * it here when we see the page is frozen (can safely be done
+		 * before taking lru_lock because the page is frozen).
+		 */
+		if (memcg != root_mem_cgroup && !page_count(page))
+			pc->mem_cgroup = root_mem_cgroup;
 		mz = page_cgroup_zoneinfo(memcg, page);
 		lruvec = &mz->lruvec;
 	}
@@ -1083,23 +1091,20 @@ void mem_cgroup_reset_uncharged_to_root(
 		return;
 
 	VM_BUG_ON(PageLRU(page));
+	/*
+	 * Caller just did ClearPageLRU():
+	 * make sure that __mem_cgroup_uncharge_common()
+	 * can see that before we test PageCgroupUsed(pc).
+	 */
+	smp_mb__after_clear_bit();
 
 	/*
 	 * Once an uncharged page is isolated from the mem_cgroup's lru,
 	 * it no longer protects that mem_cgroup from rmdir: reset to root.
-	 *
-	 * __page_cache_release() and release_pages() may be called at
-	 * interrupt time: we cannot lock_page_cgroup() then (we might
-	 * have interrupted a section with page_cgroup already locked),
-	 * nor do we need to since the page is frozen and about to be freed.
 	 */
 	pc = lookup_page_cgroup(page);
-	if (page_count(page))
-		lock_page_cgroup(pc);
 	if (!PageCgroupUsed(pc) && pc->mem_cgroup != root_mem_cgroup)
 		pc->mem_cgroup = root_mem_cgroup;
-	if (page_count(page))
-		unlock_page_cgroup(pc);
 }
 
 /**
@@ -2422,9 +2427,11 @@ static void __mem_cgroup_commit_charge(s
 				       struct page *page,
 				       unsigned int nr_pages,
 				       struct page_cgroup *pc,
-				       enum charge_type ctype)
+				       enum charge_type ctype,
+				       bool lrucare)
 {
-	bool anon;
+	struct lruvec *lruvec;
+	bool was_on_lru = false;
 
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
@@ -2433,28 +2440,41 @@ static void __mem_cgroup_commit_charge(s
 		return;
 	}
 	/*
-	 * we don't need page_cgroup_lock about tail pages, becase they are not
-	 * accessed by any other context at this point.
+	 * We don't need lock_page_cgroup on tail pages, because they are not
+	 * accessible to any other context at this point.
 	 */
-	pc->mem_cgroup = memcg;
+
 	/*
-	 * We access a page_cgroup asynchronously without lock_page_cgroup().
-	 * Especially when a page_cgroup is taken from a page, pc->mem_cgroup
-	 * is accessed after testing USED bit. To make pc->mem_cgroup visible
-	 * before USED bit, we need memory barrier here.
-	 * See mem_cgroup_add_lru_list(), etc.
- 	 */
-	smp_wmb();
+	 * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page
+	 * may already be on some other page_cgroup's LRU.  Take care of it.
+	 */
+	if (lrucare) {
+		lruvec = page_lock_lruvec(page);
+		if (PageLRU(page)) {
+			ClearPageLRU(page);
+			del_page_from_lru_list(page, lruvec, page_lru(page));
+			was_on_lru = true;
+		}
+	}
 
+	pc->mem_cgroup = memcg;
 	SetPageCgroupUsed(pc);
-	if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
-		anon = true;
-	else
-		anon = false;
 
-	mem_cgroup_charge_statistics(memcg, anon, nr_pages);
+	if (lrucare) {
+		if (was_on_lru) {
+			page_relock_lruvec(page, &lruvec);
+			if (!PageLRU(page)) {
+				SetPageLRU(page);
+				add_page_to_lru_list(page, lruvec, page_lru(page));
+			}
+		}
+		unlock_lruvec(lruvec);
+	}
+
+	mem_cgroup_charge_statistics(memcg,
+			ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED, nr_pages);
 	unlock_page_cgroup(pc);
-	WARN_ON_ONCE(PageLRU(page));
+
 	/*
 	 * "charge_statistics" updated event counter. Then, check it.
 	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
@@ -2652,7 +2672,7 @@ static int mem_cgroup_charge_common(stru
 	ret = __mem_cgroup_try_charge(mm, gfp_mask, nr_pages, &memcg, oom);
 	if (ret == -ENOMEM)
 		return ret;
-	__mem_cgroup_commit_charge(memcg, page, nr_pages, pc, ctype);
+	__mem_cgroup_commit_charge(memcg, page, nr_pages, pc, ctype, false);
 	return 0;
 }
 
@@ -2672,34 +2692,6 @@ static void
 __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
 					enum charge_type ctype);
 
-static void
-__mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg,
-					enum charge_type ctype)
-{
-	struct page_cgroup *pc = lookup_page_cgroup(page);
-	struct lruvec *lruvec;
-	bool removed = false;
-
-	/*
-	 * In some case, SwapCache, FUSE(splice_buf->radixtree), the page
-	 * is already on LRU. It means the page may on some other page_cgroup's
-	 * LRU. Take care of it.
-	 */
-	lruvec = page_lock_lruvec(page);
-	if (PageLRU(page)) {
-		del_page_from_lru_list(page, lruvec, page_lru(page));
-		ClearPageLRU(page);
-		removed = true;
-	}
-	__mem_cgroup_commit_charge(memcg, page, 1, pc, ctype);
-	if (removed) {
-		page_relock_lruvec(page, &lruvec);
-		add_page_to_lru_list(page, lruvec, page_lru(page));
-		SetPageLRU(page);
-	}
-	unlock_lruvec(lruvec);
-}
-
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask)
 {
@@ -2777,13 +2769,16 @@ static void
 __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg,
 					enum charge_type ctype)
 {
+	struct page_cgroup *pc;
+
 	if (mem_cgroup_disabled())
 		return;
 	if (!memcg)
 		return;
 	cgroup_exclude_rmdir(&memcg->css);
 
-	__mem_cgroup_commit_charge_lrucare(page, memcg, ctype);
+	pc = lookup_page_cgroup(page);
+	__mem_cgroup_commit_charge(memcg, page, 1, pc, ctype, true);
 	/*
 	 * Now swap is on-memory. This means this page may be
 	 * counted both as mem and swap....double count.
@@ -2898,7 +2893,6 @@ __mem_cgroup_uncharge_common(struct page
 	struct mem_cgroup *memcg = NULL;
 	unsigned int nr_pages = 1;
 	struct page_cgroup *pc;
-	struct lruvec *lruvec;
 	bool anon;
 
 	if (mem_cgroup_disabled())
@@ -2918,7 +2912,6 @@ __mem_cgroup_uncharge_common(struct page
 	if (unlikely(!PageCgroupUsed(pc)))
 		return NULL;
 
-	lruvec = page_lock_lruvec(page);
 	lock_page_cgroup(pc);
 
 	memcg = pc->mem_cgroup;
@@ -2950,16 +2943,31 @@ __mem_cgroup_uncharge_common(struct page
 	mem_cgroup_charge_statistics(memcg, anon, -nr_pages);
 
 	ClearPageCgroupUsed(pc);
+	/*
+	 * Make sure that mem_cgroup_reset_uncharged_to_root()
+	 * can see that before we test PageLRU(page).
+	 */
+	smp_mb__after_clear_bit();
 
 	/*
 	 * Once an uncharged page is isolated from the mem_cgroup's lru,
 	 * it no longer protects that mem_cgroup from rmdir: reset to root.
-	 */
-	if (!PageLRU(page) && pc->mem_cgroup != root_mem_cgroup)
-		pc->mem_cgroup = root_mem_cgroup;
-
+	 *
+	 * The page_count() test avoids the lock in the common case when
+	 * shrink_page_list()'s __remove_mapping() has frozen references
+	 * to 0 and the page is on its way to freedom.
+	 */
+	if (!PageLRU(page) && pc->mem_cgroup != root_mem_cgroup) {
+		struct lruvec *lruvec = NULL;
+
+		if (page_count(page))
+			lruvec = page_lock_lruvec(page);
+		if (!PageLRU(page))
+			pc->mem_cgroup = root_mem_cgroup;
+		if (lruvec)
+			unlock_lruvec(lruvec);
+	}
 	unlock_page_cgroup(pc);
-	unlock_lruvec(lruvec);
 
 	/*
 	 * even after unlock, we have memcg->res.usage here and this memcg
@@ -2977,7 +2985,6 @@ __mem_cgroup_uncharge_common(struct page
 
 unlock_out:
 	unlock_page_cgroup(pc);
-	unlock_lruvec(lruvec);
 	return NULL;
 }
 
@@ -3248,7 +3255,7 @@ int mem_cgroup_prepare_migration(struct
 		ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
 	else
 		ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
-	__mem_cgroup_commit_charge(memcg, newpage, 1, pc, ctype);
+	__mem_cgroup_commit_charge(memcg, newpage, 1, pc, ctype, false);
 	return ret;
 }
 
@@ -3335,7 +3342,8 @@ void mem_cgroup_replace_page_cache(struc
 	 * the newpage may be on LRU(or pagevec for LRU) already. We lock
 	 * LRU while we overwrite pc->mem_cgroup.
 	 */
-	__mem_cgroup_commit_charge_lrucare(newpage, memcg, type);
+	pc = lookup_page_cgroup(newpage);
+	__mem_cgroup_commit_charge(memcg, newpage, 1, pc, type, true);
 }
 
 #ifdef CONFIG_DEBUG_VM
--- mmotm.orig/mm/swap.c	2012-02-18 11:57:49.107524745 -0800
+++ mmotm/mm/swap.c	2012-02-18 11:58:02.451525062 -0800
@@ -52,7 +52,6 @@ static void __page_cache_release(struct
 		lruvec = page_lock_lruvec(page);
 		VM_BUG_ON(!PageLRU(page));
 		__ClearPageLRU(page);
-		mem_cgroup_reset_uncharged_to_root(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		unlock_lruvec(lruvec);
 	}
@@ -584,7 +583,6 @@ void release_pages(struct page **pages,
 			page_relock_lruvec(page, &lruvec);
 			VM_BUG_ON(!PageLRU(page));
 			__ClearPageLRU(page);
-			mem_cgroup_reset_uncharged_to_root(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 8/10] mm/memcg: nest lru_lock inside page_cgroup lock
@ 2012-02-20 23:36   ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

Cut back on some of the overhead we've added, particularly the lruvec
locking added to every __mem_cgroup_uncharge_common(), and the page
cgroup locking in mem_cgroup_reset_uncharged_to_root().

Our hands were tied by the lock ordering (page cgroup inside lruvec)
defined by __mem_cgroup_commit_charge_lrucare().  There is no strong
reason for why that nesting needs to be one way or the other, and if
we invert it, then some optimizations become possible.

So delete __mem_cgroup_commit_charge_lrucare(), passing a bool lrucare
to __mem_cgroup_commit_charge() instead, using page_lock_lruvec() there
inside lock_page_cgroup() in the lrucare case.  (I'd prefer to work it
out internally, than rely upon an lrucare argument: but that is hard -
certainly PageLRU is not enough, racing with pages on pagevec about to
be flushed to lru.)  Use page_relock_lruvec() after setting mem_cgroup,
before adding to the appropriate new lruvec: so that (if lock depends
on memcg) old lock is held across change in ownership while off lru.

Delete the lruvec locking on entry to __mem_cgroup_uncharge_common();
but if the page being uncharged is not on lru, then we do need to
reset its ownership, and must dance very carefully with mem_cgroup_
reset_uncharged_to_root(), to make sure that when there's a race
between uncharging and removing from lru, one side or the other
will see it - smp_mb__after_clear_bit() at both ends.

Avoid overhead of calls to mem_cgroup_reset_uncharged_to_root() from
release_pages() and __page_cache_release(), by doing its work inside
page_relock_lruvec() when the page_count is 0 i.e. the page is frozen
from other references and about to be freed.  That was not possible
with the old lock ordering, since __mem_cgroup_uncharge_common()'s
lock then changed ownership too soon.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/memcontrol.c |  142 ++++++++++++++++++++++++----------------------
 mm/swap.c       |    2 
 2 files changed, 75 insertions(+), 69 deletions(-)

--- mmotm.orig/mm/memcontrol.c	2012-02-18 11:57:55.551524898 -0800
+++ mmotm/mm/memcontrol.c	2012-02-18 11:58:02.451525062 -0800
@@ -1059,6 +1059,14 @@ void page_relock_lruvec(struct page *pag
 		 */
 		if (unlikely(!memcg))
 			memcg = pc->mem_cgroup = root_mem_cgroup;
+		/*
+		 * We must reset pc->mem_cgroup back to root before freeing
+		 * a page: avoid additional callouts from hot paths by doing
+		 * it here when we see the page is frozen (can safely be done
+		 * before taking lru_lock because the page is frozen).
+		 */
+		if (memcg != root_mem_cgroup && !page_count(page))
+			pc->mem_cgroup = root_mem_cgroup;
 		mz = page_cgroup_zoneinfo(memcg, page);
 		lruvec = &mz->lruvec;
 	}
@@ -1083,23 +1091,20 @@ void mem_cgroup_reset_uncharged_to_root(
 		return;
 
 	VM_BUG_ON(PageLRU(page));
+	/*
+	 * Caller just did ClearPageLRU():
+	 * make sure that __mem_cgroup_uncharge_common()
+	 * can see that before we test PageCgroupUsed(pc).
+	 */
+	smp_mb__after_clear_bit();
 
 	/*
 	 * Once an uncharged page is isolated from the mem_cgroup's lru,
 	 * it no longer protects that mem_cgroup from rmdir: reset to root.
-	 *
-	 * __page_cache_release() and release_pages() may be called at
-	 * interrupt time: we cannot lock_page_cgroup() then (we might
-	 * have interrupted a section with page_cgroup already locked),
-	 * nor do we need to since the page is frozen and about to be freed.
 	 */
 	pc = lookup_page_cgroup(page);
-	if (page_count(page))
-		lock_page_cgroup(pc);
 	if (!PageCgroupUsed(pc) && pc->mem_cgroup != root_mem_cgroup)
 		pc->mem_cgroup = root_mem_cgroup;
-	if (page_count(page))
-		unlock_page_cgroup(pc);
 }
 
 /**
@@ -2422,9 +2427,11 @@ static void __mem_cgroup_commit_charge(s
 				       struct page *page,
 				       unsigned int nr_pages,
 				       struct page_cgroup *pc,
-				       enum charge_type ctype)
+				       enum charge_type ctype,
+				       bool lrucare)
 {
-	bool anon;
+	struct lruvec *lruvec;
+	bool was_on_lru = false;
 
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
@@ -2433,28 +2440,41 @@ static void __mem_cgroup_commit_charge(s
 		return;
 	}
 	/*
-	 * we don't need page_cgroup_lock about tail pages, becase they are not
-	 * accessed by any other context at this point.
+	 * We don't need lock_page_cgroup on tail pages, because they are not
+	 * accessible to any other context at this point.
 	 */
-	pc->mem_cgroup = memcg;
+
 	/*
-	 * We access a page_cgroup asynchronously without lock_page_cgroup().
-	 * Especially when a page_cgroup is taken from a page, pc->mem_cgroup
-	 * is accessed after testing USED bit. To make pc->mem_cgroup visible
-	 * before USED bit, we need memory barrier here.
-	 * See mem_cgroup_add_lru_list(), etc.
- 	 */
-	smp_wmb();
+	 * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page
+	 * may already be on some other page_cgroup's LRU.  Take care of it.
+	 */
+	if (lrucare) {
+		lruvec = page_lock_lruvec(page);
+		if (PageLRU(page)) {
+			ClearPageLRU(page);
+			del_page_from_lru_list(page, lruvec, page_lru(page));
+			was_on_lru = true;
+		}
+	}
 
+	pc->mem_cgroup = memcg;
 	SetPageCgroupUsed(pc);
-	if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
-		anon = true;
-	else
-		anon = false;
 
-	mem_cgroup_charge_statistics(memcg, anon, nr_pages);
+	if (lrucare) {
+		if (was_on_lru) {
+			page_relock_lruvec(page, &lruvec);
+			if (!PageLRU(page)) {
+				SetPageLRU(page);
+				add_page_to_lru_list(page, lruvec, page_lru(page));
+			}
+		}
+		unlock_lruvec(lruvec);
+	}
+
+	mem_cgroup_charge_statistics(memcg,
+			ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED, nr_pages);
 	unlock_page_cgroup(pc);
-	WARN_ON_ONCE(PageLRU(page));
+
 	/*
 	 * "charge_statistics" updated event counter. Then, check it.
 	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
@@ -2652,7 +2672,7 @@ static int mem_cgroup_charge_common(stru
 	ret = __mem_cgroup_try_charge(mm, gfp_mask, nr_pages, &memcg, oom);
 	if (ret == -ENOMEM)
 		return ret;
-	__mem_cgroup_commit_charge(memcg, page, nr_pages, pc, ctype);
+	__mem_cgroup_commit_charge(memcg, page, nr_pages, pc, ctype, false);
 	return 0;
 }
 
@@ -2672,34 +2692,6 @@ static void
 __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
 					enum charge_type ctype);
 
-static void
-__mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg,
-					enum charge_type ctype)
-{
-	struct page_cgroup *pc = lookup_page_cgroup(page);
-	struct lruvec *lruvec;
-	bool removed = false;
-
-	/*
-	 * In some case, SwapCache, FUSE(splice_buf->radixtree), the page
-	 * is already on LRU. It means the page may on some other page_cgroup's
-	 * LRU. Take care of it.
-	 */
-	lruvec = page_lock_lruvec(page);
-	if (PageLRU(page)) {
-		del_page_from_lru_list(page, lruvec, page_lru(page));
-		ClearPageLRU(page);
-		removed = true;
-	}
-	__mem_cgroup_commit_charge(memcg, page, 1, pc, ctype);
-	if (removed) {
-		page_relock_lruvec(page, &lruvec);
-		add_page_to_lru_list(page, lruvec, page_lru(page));
-		SetPageLRU(page);
-	}
-	unlock_lruvec(lruvec);
-}
-
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask)
 {
@@ -2777,13 +2769,16 @@ static void
 __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg,
 					enum charge_type ctype)
 {
+	struct page_cgroup *pc;
+
 	if (mem_cgroup_disabled())
 		return;
 	if (!memcg)
 		return;
 	cgroup_exclude_rmdir(&memcg->css);
 
-	__mem_cgroup_commit_charge_lrucare(page, memcg, ctype);
+	pc = lookup_page_cgroup(page);
+	__mem_cgroup_commit_charge(memcg, page, 1, pc, ctype, true);
 	/*
 	 * Now swap is on-memory. This means this page may be
 	 * counted both as mem and swap....double count.
@@ -2898,7 +2893,6 @@ __mem_cgroup_uncharge_common(struct page
 	struct mem_cgroup *memcg = NULL;
 	unsigned int nr_pages = 1;
 	struct page_cgroup *pc;
-	struct lruvec *lruvec;
 	bool anon;
 
 	if (mem_cgroup_disabled())
@@ -2918,7 +2912,6 @@ __mem_cgroup_uncharge_common(struct page
 	if (unlikely(!PageCgroupUsed(pc)))
 		return NULL;
 
-	lruvec = page_lock_lruvec(page);
 	lock_page_cgroup(pc);
 
 	memcg = pc->mem_cgroup;
@@ -2950,16 +2943,31 @@ __mem_cgroup_uncharge_common(struct page
 	mem_cgroup_charge_statistics(memcg, anon, -nr_pages);
 
 	ClearPageCgroupUsed(pc);
+	/*
+	 * Make sure that mem_cgroup_reset_uncharged_to_root()
+	 * can see that before we test PageLRU(page).
+	 */
+	smp_mb__after_clear_bit();
 
 	/*
 	 * Once an uncharged page is isolated from the mem_cgroup's lru,
 	 * it no longer protects that mem_cgroup from rmdir: reset to root.
-	 */
-	if (!PageLRU(page) && pc->mem_cgroup != root_mem_cgroup)
-		pc->mem_cgroup = root_mem_cgroup;
-
+	 *
+	 * The page_count() test avoids the lock in the common case when
+	 * shrink_page_list()'s __remove_mapping() has frozen references
+	 * to 0 and the page is on its way to freedom.
+	 */
+	if (!PageLRU(page) && pc->mem_cgroup != root_mem_cgroup) {
+		struct lruvec *lruvec = NULL;
+
+		if (page_count(page))
+			lruvec = page_lock_lruvec(page);
+		if (!PageLRU(page))
+			pc->mem_cgroup = root_mem_cgroup;
+		if (lruvec)
+			unlock_lruvec(lruvec);
+	}
 	unlock_page_cgroup(pc);
-	unlock_lruvec(lruvec);
 
 	/*
 	 * even after unlock, we have memcg->res.usage here and this memcg
@@ -2977,7 +2985,6 @@ __mem_cgroup_uncharge_common(struct page
 
 unlock_out:
 	unlock_page_cgroup(pc);
-	unlock_lruvec(lruvec);
 	return NULL;
 }
 
@@ -3248,7 +3255,7 @@ int mem_cgroup_prepare_migration(struct
 		ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
 	else
 		ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
-	__mem_cgroup_commit_charge(memcg, newpage, 1, pc, ctype);
+	__mem_cgroup_commit_charge(memcg, newpage, 1, pc, ctype, false);
 	return ret;
 }
 
@@ -3335,7 +3342,8 @@ void mem_cgroup_replace_page_cache(struc
 	 * the newpage may be on LRU(or pagevec for LRU) already. We lock
 	 * LRU while we overwrite pc->mem_cgroup.
 	 */
-	__mem_cgroup_commit_charge_lrucare(newpage, memcg, type);
+	pc = lookup_page_cgroup(newpage);
+	__mem_cgroup_commit_charge(memcg, newpage, 1, pc, type, true);
 }
 
 #ifdef CONFIG_DEBUG_VM
--- mmotm.orig/mm/swap.c	2012-02-18 11:57:49.107524745 -0800
+++ mmotm/mm/swap.c	2012-02-18 11:58:02.451525062 -0800
@@ -52,7 +52,6 @@ static void __page_cache_release(struct
 		lruvec = page_lock_lruvec(page);
 		VM_BUG_ON(!PageLRU(page));
 		__ClearPageLRU(page);
-		mem_cgroup_reset_uncharged_to_root(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		unlock_lruvec(lruvec);
 	}
@@ -584,7 +583,6 @@ void release_pages(struct page **pages,
 			page_relock_lruvec(page, &lruvec);
 			VM_BUG_ON(!PageLRU(page));
 			__ClearPageLRU(page);
-			mem_cgroup_reset_uncharged_to_root(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 9/10] mm/memcg: move lru_lock into lruvec
  2012-02-20 23:26 ` Hugh Dickins
@ 2012-02-20 23:38   ` Hugh Dickins
  -1 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

We're nearly there.  Now move lru_lock and irqflags into struct lruvec,
so they are in every zone (for !MEM_RES_CTLR and mem_cgroup_disabled()
cases) and in every memcg lruvec.

Extend the memcg version of page_relock_lruvec() to drop old and take
new lock whenever changing lruvec.  But the memcg will only be stable
once we already have the lock: so, having got it, check if it's still
the lock we want, and retry if not.  It's for this retry that we route
all page lruvec locking through page_relock_lruvec().

No need for lock_page_cgroup() in here (which would entail reinverting
the lock ordering, and _irq'ing all of its calls): the lrucare protocol
when charging (holding old lock while changing owner then acquiring new)
fits correctly with this retry protocol.  In some places we rely also on
page_count 0 preventing further references, in some places on !PageLRU
protecting a page from outside interference: mem_cgroup_move_account()

What if page_relock_lruvec() were preempted for a while, after reading
a valid mem_cgroup from page_cgroup, but before acquiring the lock?
In that case, a rmdir might free the mem_cgroup and its associated
zoneinfo, and we take a spin_lock in freed memory.  But rcu_read_lock()
before we read mem_cgroup keeps it safe: cgroup.c uses synchronize_rcu()
in between pre_destroy (force_empty) and destroy (freeing structures).
mem_cgroup_force_empty() cannot succeed while there's any charge, or any
page on any of its lrus - and checks list_empty() while holding the lock.

But although we are now fully prepared, in this patch keep on using
the zone->lru_lock for all of its memcgs: so that the cost or benefit
of split locking can be easily compared with the final patch (but
of course, some costs and benefits come earlier in the series).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/mmzone.h |    4 +-
 include/linux/swap.h   |   13 +++---
 mm/memcontrol.c        |   74 ++++++++++++++++++++++++++-------------
 mm/page_alloc.c        |    2 -
 4 files changed, 59 insertions(+), 34 deletions(-)

--- mmotm.orig/include/linux/mmzone.h	2012-02-18 11:57:42.675524592 -0800
+++ mmotm/include/linux/mmzone.h	2012-02-18 11:58:09.047525220 -0800
@@ -174,6 +174,8 @@ struct zone_reclaim_stat {
 
 struct lruvec {
 	struct zone *zone;
+	spinlock_t lru_lock;
+	unsigned long irqflags;
 	struct list_head lists[NR_LRU_LISTS];
 	struct zone_reclaim_stat reclaim_stat;
 };
@@ -373,8 +375,6 @@ struct zone {
 	ZONE_PADDING(_pad1_)
 
 	/* Fields commonly accessed by the page reclaim scanner */
-	spinlock_t		lru_lock;
-	unsigned long		irqflags;
 	struct lruvec		lruvec;
 
 	unsigned long		pages_scanned;	   /* since last reclaim */
--- mmotm.orig/include/linux/swap.h	2012-02-18 11:57:42.675524592 -0800
+++ mmotm/include/linux/swap.h	2012-02-18 11:58:09.047525220 -0800
@@ -252,25 +252,24 @@ static inline void lru_cache_add_file(st
 
 static inline spinlock_t *lru_lockptr(struct lruvec *lruvec)
 {
-	return &lruvec->zone->lru_lock;
+	/* Still use per-zone lru_lock */
+	return &lruvec->zone->lruvec.lru_lock;
 }
 
 static inline void lock_lruvec(struct lruvec *lruvec)
 {
-	struct zone *zone = lruvec->zone;
 	unsigned long irqflags;
 
-	spin_lock_irqsave(&zone->lru_lock, irqflags);
-	zone->irqflags = irqflags;
+	spin_lock_irqsave(lru_lockptr(lruvec), irqflags);
+	lruvec->irqflags = irqflags;
 }
 
 static inline void unlock_lruvec(struct lruvec *lruvec)
 {
-	struct zone *zone = lruvec->zone;
 	unsigned long irqflags;
 
-	irqflags = zone->irqflags;
-	spin_unlock_irqrestore(&zone->lru_lock, irqflags);
+	irqflags = lruvec->irqflags;
+	spin_unlock_irqrestore(lru_lockptr(lruvec), irqflags);
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
--- mmotm.orig/mm/memcontrol.c	2012-02-18 11:58:02.451525062 -0800
+++ mmotm/mm/memcontrol.c	2012-02-18 11:58:09.051525220 -0800
@@ -1048,39 +1048,64 @@ void page_relock_lruvec(struct page *pag
 	struct page_cgroup *pc;
 	struct lruvec *lruvec;
 
-	if (mem_cgroup_disabled())
+	if (unlikely(mem_cgroup_disabled())) {
 		lruvec = &page_zone(page)->lruvec;
-	else {
-		pc = lookup_page_cgroup(page);
-		memcg = pc->mem_cgroup;
-		/*
-		 * At present we start up with all page_cgroups initialized
-		 * to zero: correct that to root_mem_cgroup once we see it.
-		 */
-		if (unlikely(!memcg))
-			memcg = pc->mem_cgroup = root_mem_cgroup;
-		/*
-		 * We must reset pc->mem_cgroup back to root before freeing
-		 * a page: avoid additional callouts from hot paths by doing
-		 * it here when we see the page is frozen (can safely be done
-		 * before taking lru_lock because the page is frozen).
-		 */
-		if (memcg != root_mem_cgroup && !page_count(page))
-			pc->mem_cgroup = root_mem_cgroup;
-		mz = page_cgroup_zoneinfo(memcg, page);
-		lruvec = &mz->lruvec;
+		if (*lruvp && *lruvp != lruvec) {
+			unlock_lruvec(*lruvp);
+			*lruvp = NULL;
+		}
+		if (!*lruvp) {
+			*lruvp = lruvec;
+			lock_lruvec(lruvec);
+		}
+		return;
 	}
 
+	pc = lookup_page_cgroup(page);
+	/*
+	 * Imagine being preempted for a long time: we need to make sure that
+	 * the structure at pc->mem_cgroup, and structures it links to, cannot
+	 * be freed while we locate and acquire its zone lru_lock.  cgroup's
+	 * synchronize_rcu() between pre_destroy and destroy makes this safe.
+	 */
+	rcu_read_lock();
+again:
+	memcg = rcu_dereference(pc->mem_cgroup);
 	/*
-	 * For the moment, simply lock by zone just as before.
+	 * At present we start up with all page_cgroups initialized
+	 * to zero: here treat NULL as root_mem_cgroup, then correct
+	 * the page_cgroup below once we really have it locked.
 	 */
-	if (*lruvp && (*lruvp)->zone != lruvec->zone) {
+	mz = page_cgroup_zoneinfo(memcg ? : root_mem_cgroup, page);
+	lruvec = &mz->lruvec;
+
+	/*
+	 * Sometimes we are called with non-NULL *lruvp spinlock already held:
+	 * hold on if we want the same lock again, otherwise drop and acquire.
+	 */
+	if (*lruvp && *lruvp != lruvec) {
 		unlock_lruvec(*lruvp);
 		*lruvp = NULL;
 	}
-	if (!*lruvp)
+	if (!*lruvp) {
+		*lruvp = lruvec;
 		lock_lruvec(lruvec);
-	*lruvp = lruvec;
+		/*
+		 * But pc->mem_cgroup may have changed since we looked...
+		 */
+		if (unlikely(pc->mem_cgroup != memcg))
+			goto again;
+	}
+
+	/*
+	 * We must reset pc->mem_cgroup back to root before freeing a page:
+	 * avoid additional callouts from hot paths by doing it here when we
+	 * see the page is frozen.  Also initialize pc at first use of page.
+	 */
+	if (memcg != root_mem_cgroup && (!memcg || !page_count(page)))
+		pc->mem_cgroup = root_mem_cgroup;
+
+	rcu_read_unlock();
 }
 
 void mem_cgroup_reset_uncharged_to_root(struct page *page)
@@ -4744,6 +4769,7 @@ static int alloc_mem_cgroup_per_zone_inf
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		mz = &pn->zoneinfo[zone];
 		mz->lruvec.zone = &NODE_DATA(node)->node_zones[zone];
+		spin_lock_init(&mz->lruvec.lru_lock);
 		for_each_lru(lru)
 			INIT_LIST_HEAD(&mz->lruvec.lists[lru]);
 		mz->usage_in_excess = 0;
--- mmotm.orig/mm/page_alloc.c	2012-02-18 11:57:28.375524252 -0800
+++ mmotm/mm/page_alloc.c	2012-02-18 11:58:09.051525220 -0800
@@ -4360,12 +4360,12 @@ static void __paginginit free_area_init_
 #endif
 		zone->name = zone_names[j];
 		spin_lock_init(&zone->lock);
-		spin_lock_init(&zone->lru_lock);
 		zone_seqlock_init(zone);
 		zone->zone_pgdat = pgdat;
 
 		zone_pcp_init(zone);
 		zone->lruvec.zone = zone;
+		spin_lock_init(&zone->lruvec.lru_lock);
 		for_each_lru(lru)
 			INIT_LIST_HEAD(&zone->lruvec.lists[lru]);
 		zone->lruvec.reclaim_stat.recent_rotated[0] = 0;

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 9/10] mm/memcg: move lru_lock into lruvec
@ 2012-02-20 23:38   ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

We're nearly there.  Now move lru_lock and irqflags into struct lruvec,
so they are in every zone (for !MEM_RES_CTLR and mem_cgroup_disabled()
cases) and in every memcg lruvec.

Extend the memcg version of page_relock_lruvec() to drop old and take
new lock whenever changing lruvec.  But the memcg will only be stable
once we already have the lock: so, having got it, check if it's still
the lock we want, and retry if not.  It's for this retry that we route
all page lruvec locking through page_relock_lruvec().

No need for lock_page_cgroup() in here (which would entail reinverting
the lock ordering, and _irq'ing all of its calls): the lrucare protocol
when charging (holding old lock while changing owner then acquiring new)
fits correctly with this retry protocol.  In some places we rely also on
page_count 0 preventing further references, in some places on !PageLRU
protecting a page from outside interference: mem_cgroup_move_account()

What if page_relock_lruvec() were preempted for a while, after reading
a valid mem_cgroup from page_cgroup, but before acquiring the lock?
In that case, a rmdir might free the mem_cgroup and its associated
zoneinfo, and we take a spin_lock in freed memory.  But rcu_read_lock()
before we read mem_cgroup keeps it safe: cgroup.c uses synchronize_rcu()
in between pre_destroy (force_empty) and destroy (freeing structures).
mem_cgroup_force_empty() cannot succeed while there's any charge, or any
page on any of its lrus - and checks list_empty() while holding the lock.

But although we are now fully prepared, in this patch keep on using
the zone->lru_lock for all of its memcgs: so that the cost or benefit
of split locking can be easily compared with the final patch (but
of course, some costs and benefits come earlier in the series).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/mmzone.h |    4 +-
 include/linux/swap.h   |   13 +++---
 mm/memcontrol.c        |   74 ++++++++++++++++++++++++++-------------
 mm/page_alloc.c        |    2 -
 4 files changed, 59 insertions(+), 34 deletions(-)

--- mmotm.orig/include/linux/mmzone.h	2012-02-18 11:57:42.675524592 -0800
+++ mmotm/include/linux/mmzone.h	2012-02-18 11:58:09.047525220 -0800
@@ -174,6 +174,8 @@ struct zone_reclaim_stat {
 
 struct lruvec {
 	struct zone *zone;
+	spinlock_t lru_lock;
+	unsigned long irqflags;
 	struct list_head lists[NR_LRU_LISTS];
 	struct zone_reclaim_stat reclaim_stat;
 };
@@ -373,8 +375,6 @@ struct zone {
 	ZONE_PADDING(_pad1_)
 
 	/* Fields commonly accessed by the page reclaim scanner */
-	spinlock_t		lru_lock;
-	unsigned long		irqflags;
 	struct lruvec		lruvec;
 
 	unsigned long		pages_scanned;	   /* since last reclaim */
--- mmotm.orig/include/linux/swap.h	2012-02-18 11:57:42.675524592 -0800
+++ mmotm/include/linux/swap.h	2012-02-18 11:58:09.047525220 -0800
@@ -252,25 +252,24 @@ static inline void lru_cache_add_file(st
 
 static inline spinlock_t *lru_lockptr(struct lruvec *lruvec)
 {
-	return &lruvec->zone->lru_lock;
+	/* Still use per-zone lru_lock */
+	return &lruvec->zone->lruvec.lru_lock;
 }
 
 static inline void lock_lruvec(struct lruvec *lruvec)
 {
-	struct zone *zone = lruvec->zone;
 	unsigned long irqflags;
 
-	spin_lock_irqsave(&zone->lru_lock, irqflags);
-	zone->irqflags = irqflags;
+	spin_lock_irqsave(lru_lockptr(lruvec), irqflags);
+	lruvec->irqflags = irqflags;
 }
 
 static inline void unlock_lruvec(struct lruvec *lruvec)
 {
-	struct zone *zone = lruvec->zone;
 	unsigned long irqflags;
 
-	irqflags = zone->irqflags;
-	spin_unlock_irqrestore(&zone->lru_lock, irqflags);
+	irqflags = lruvec->irqflags;
+	spin_unlock_irqrestore(lru_lockptr(lruvec), irqflags);
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
--- mmotm.orig/mm/memcontrol.c	2012-02-18 11:58:02.451525062 -0800
+++ mmotm/mm/memcontrol.c	2012-02-18 11:58:09.051525220 -0800
@@ -1048,39 +1048,64 @@ void page_relock_lruvec(struct page *pag
 	struct page_cgroup *pc;
 	struct lruvec *lruvec;
 
-	if (mem_cgroup_disabled())
+	if (unlikely(mem_cgroup_disabled())) {
 		lruvec = &page_zone(page)->lruvec;
-	else {
-		pc = lookup_page_cgroup(page);
-		memcg = pc->mem_cgroup;
-		/*
-		 * At present we start up with all page_cgroups initialized
-		 * to zero: correct that to root_mem_cgroup once we see it.
-		 */
-		if (unlikely(!memcg))
-			memcg = pc->mem_cgroup = root_mem_cgroup;
-		/*
-		 * We must reset pc->mem_cgroup back to root before freeing
-		 * a page: avoid additional callouts from hot paths by doing
-		 * it here when we see the page is frozen (can safely be done
-		 * before taking lru_lock because the page is frozen).
-		 */
-		if (memcg != root_mem_cgroup && !page_count(page))
-			pc->mem_cgroup = root_mem_cgroup;
-		mz = page_cgroup_zoneinfo(memcg, page);
-		lruvec = &mz->lruvec;
+		if (*lruvp && *lruvp != lruvec) {
+			unlock_lruvec(*lruvp);
+			*lruvp = NULL;
+		}
+		if (!*lruvp) {
+			*lruvp = lruvec;
+			lock_lruvec(lruvec);
+		}
+		return;
 	}
 
+	pc = lookup_page_cgroup(page);
+	/*
+	 * Imagine being preempted for a long time: we need to make sure that
+	 * the structure at pc->mem_cgroup, and structures it links to, cannot
+	 * be freed while we locate and acquire its zone lru_lock.  cgroup's
+	 * synchronize_rcu() between pre_destroy and destroy makes this safe.
+	 */
+	rcu_read_lock();
+again:
+	memcg = rcu_dereference(pc->mem_cgroup);
 	/*
-	 * For the moment, simply lock by zone just as before.
+	 * At present we start up with all page_cgroups initialized
+	 * to zero: here treat NULL as root_mem_cgroup, then correct
+	 * the page_cgroup below once we really have it locked.
 	 */
-	if (*lruvp && (*lruvp)->zone != lruvec->zone) {
+	mz = page_cgroup_zoneinfo(memcg ? : root_mem_cgroup, page);
+	lruvec = &mz->lruvec;
+
+	/*
+	 * Sometimes we are called with non-NULL *lruvp spinlock already held:
+	 * hold on if we want the same lock again, otherwise drop and acquire.
+	 */
+	if (*lruvp && *lruvp != lruvec) {
 		unlock_lruvec(*lruvp);
 		*lruvp = NULL;
 	}
-	if (!*lruvp)
+	if (!*lruvp) {
+		*lruvp = lruvec;
 		lock_lruvec(lruvec);
-	*lruvp = lruvec;
+		/*
+		 * But pc->mem_cgroup may have changed since we looked...
+		 */
+		if (unlikely(pc->mem_cgroup != memcg))
+			goto again;
+	}
+
+	/*
+	 * We must reset pc->mem_cgroup back to root before freeing a page:
+	 * avoid additional callouts from hot paths by doing it here when we
+	 * see the page is frozen.  Also initialize pc at first use of page.
+	 */
+	if (memcg != root_mem_cgroup && (!memcg || !page_count(page)))
+		pc->mem_cgroup = root_mem_cgroup;
+
+	rcu_read_unlock();
 }
 
 void mem_cgroup_reset_uncharged_to_root(struct page *page)
@@ -4744,6 +4769,7 @@ static int alloc_mem_cgroup_per_zone_inf
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		mz = &pn->zoneinfo[zone];
 		mz->lruvec.zone = &NODE_DATA(node)->node_zones[zone];
+		spin_lock_init(&mz->lruvec.lru_lock);
 		for_each_lru(lru)
 			INIT_LIST_HEAD(&mz->lruvec.lists[lru]);
 		mz->usage_in_excess = 0;
--- mmotm.orig/mm/page_alloc.c	2012-02-18 11:57:28.375524252 -0800
+++ mmotm/mm/page_alloc.c	2012-02-18 11:58:09.051525220 -0800
@@ -4360,12 +4360,12 @@ static void __paginginit free_area_init_
 #endif
 		zone->name = zone_names[j];
 		spin_lock_init(&zone->lock);
-		spin_lock_init(&zone->lru_lock);
 		zone_seqlock_init(zone);
 		zone->zone_pgdat = pgdat;
 
 		zone_pcp_init(zone);
 		zone->lruvec.zone = zone;
+		spin_lock_init(&zone->lruvec.lru_lock);
 		for_each_lru(lru)
 			INIT_LIST_HEAD(&zone->lruvec.lists[lru]);
 		zone->lruvec.reclaim_stat.recent_rotated[0] = 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 10/10] mm/memcg: per-memcg per-zone lru locking
  2012-02-20 23:26 ` Hugh Dickins
@ 2012-02-20 23:39   ` Hugh Dickins
  -1 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

Flip the switch from per-zone lru locking to per-memcg per-zone lru locking.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/swap.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- mmotm.orig/include/linux/swap.h	2012-02-18 11:58:09.047525220 -0800
+++ mmotm/include/linux/swap.h	2012-02-18 11:58:15.659525376 -0800
@@ -252,8 +252,8 @@ static inline void lru_cache_add_file(st
 
 static inline spinlock_t *lru_lockptr(struct lruvec *lruvec)
 {
-	/* Still use per-zone lru_lock */
-	return &lruvec->zone->lruvec.lru_lock;
+	/* Now use per-memcg-per-zone lru_lock */
+	return &lruvec->lru_lock;
 }
 
 static inline void lock_lruvec(struct lruvec *lruvec)

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 10/10] mm/memcg: per-memcg per-zone lru locking
@ 2012-02-20 23:39   ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-20 23:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Johannes Weiner,
	Ying Han, linux-mm, linux-kernel

Flip the switch from per-zone lru locking to per-memcg per-zone lru locking.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/swap.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- mmotm.orig/include/linux/swap.h	2012-02-18 11:58:09.047525220 -0800
+++ mmotm/include/linux/swap.h	2012-02-18 11:58:15.659525376 -0800
@@ -252,8 +252,8 @@ static inline void lru_cache_add_file(st
 
 static inline spinlock_t *lru_lockptr(struct lruvec *lruvec)
 {
-	/* Still use per-zone lru_lock */
-	return &lruvec->zone->lruvec.lru_lock;
+	/* Now use per-memcg-per-zone lru_lock */
+	return &lruvec->lru_lock;
 }
 
 static inline void lock_lruvec(struct lruvec *lruvec)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup
  2012-02-20 23:34   ` Hugh Dickins
@ 2012-02-21  5:55     ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 72+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-21  5:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

Hugh Dickins wrote:
> page_relock_lruvec() is using lookup_page_cgroup(page)->mem_cgroup
> to find the memcg, and hence its per-zone lruvec for the page.  We
> therefore need to be careful to see the right pc->mem_cgroup: where
> is it updated?
>
> In __mem_cgroup_commit_charge(), under lruvec lock whenever lru
> care might be needed, lrucare holding the page off lru at that time.
>
> In mem_cgroup_reset_owner(), not under lruvec lock, but before the
> page can be visible to others - except compaction or lumpy reclaim,
> which ignore the page because it is not yet PageLRU.
>
> In mem_cgroup_split_huge_fixup(), always under lruvec lock.
>
> In mem_cgroup_move_account(), which holds several locks, but an
> lruvec lock not among them: yet it still appears to be safe, because
> the page has been taken off its old lru and not yet put on the new.
>
> Be particularly careful in compaction's isolate_migratepages() and
> vmscan's lumpy handling in isolate_lru_pages(): those approach the
> page by its physical location, and so can encounter pages which
> would not be found by any logical lookup.  For those cases we have
> to change __isolate_lru_page() slightly: it must leave ClearPageLRU
> to the caller, because compaction and lumpy cannot safely interfere
> with a page until they have first isolated it and then locked lruvec.
>
> To the list above we have to add __mem_cgroup_uncharge_common(),
> and new function mem_cgroup_reset_uncharged_to_root(): the first
> resetting pc->mem_cgroup to root_mem_cgroup when a page off lru is
> uncharged, and the second when an uncharged page is taken off lru
> (which used to be achieved implicitly with the PageAcctLRU flag).
>
> That's because there's a remote risk that compaction or lumpy reclaim
> will spy a page while it has PageLRU set; then it's taken off LRU and
> freed, its mem_cgroup torn down and freed, the page reallocated (so
> get_page_unless_zero again succeeds); then compaction or lumpy reclaim
> reach their page_relock_lruvec, using the stale mem_cgroup for locking.
>
> So long as there's one charge on the mem_cgroup, or a page on one of
> its lrus, mem_cgroup_force_empty() cannot succeed and the mem_cgroup
> cannot be destroyed.  But when an uncharged page is taken off lru,
> or a page off lru is uncharged, it no longer protects its old memcg,
> and the one stable root_mem_cgroup must then be used for it.

This is much better than my RCU-protected locking.
That will be great if it really race-less!
I think, I could steal this and polish a little. =)

But just one question: how appears uncharged pages in mem-cg lru lists?
Maybe we can forbid this case and uncharge these pages right in
__page_cache_release() and release_pages() at final removing from LRU.
This is how my old mem-controller works. There pages in lru are always charged.

>
> Signed-off-by: Hugh Dickins<hughd@google.com>
> ---
>   include/linux/memcontrol.h |    5 ++
>   mm/compaction.c            |   36 ++++++-----------
>   mm/memcontrol.c            |   45 +++++++++++++++++++--
>   mm/swap.c                  |    2
>   mm/vmscan.c                |   73 +++++++++++++++++++++++++----------
>   5 files changed, 114 insertions(+), 47 deletions(-)
>
> --- mmotm.orig/include/linux/memcontrol.h       2012-02-18 11:57:42.675524592 -0800
> +++ mmotm/include/linux/memcontrol.h    2012-02-18 11:57:49.103524745 -0800
> @@ -65,6 +65,7 @@ extern int mem_cgroup_cache_charge(struc
>   struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
>   extern struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec);
>   extern void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
> +extern void mem_cgroup_reset_uncharged_to_root(struct page *);
>
>   /* For coalescing uncharge for reducing memcg' overhead*/
>   extern void mem_cgroup_uncharge_start(void);
> @@ -251,6 +252,10 @@ static inline void mem_cgroup_update_lru
>   {
>   }
>
> +static inline void mem_cgroup_reset_uncharged_to_root(struct page *page)
> +{
> +}
> +
>   static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>   {
>          return NULL;
> --- mmotm.orig/mm/compaction.c  2012-02-18 11:57:42.675524592 -0800
> +++ mmotm/mm/compaction.c       2012-02-18 11:57:49.103524745 -0800
> @@ -356,28 +356,6 @@ static isolate_migrate_t isolate_migrate
>                          continue;
>                  }
>
> -               if (!lruvec) {
> -                       /*
> -                        * We do need to take the lock before advancing to
> -                        * check PageLRU etc., but there's no guarantee that
> -                        * the page we're peeking at has a stable memcg here.
> -                        */
> -                       lruvec =&zone->lruvec;
> -                       lock_lruvec(lruvec);
> -               }
> -               if (!PageLRU(page))
> -                       continue;
> -
> -               /*
> -                * PageLRU is set, and lru_lock excludes isolation,
> -                * splitting and collapsing (collapsing has already
> -                * happened if PageLRU is set).
> -                */
> -               if (PageTransHuge(page)) {
> -                       low_pfn += (1<<  compound_order(page)) - 1;
> -                       continue;
> -               }
> -
>                  if (!cc->sync)
>                          mode |= ISOLATE_ASYNC_MIGRATE;
>
> @@ -386,10 +364,24 @@ static isolate_migrate_t isolate_migrate
>                          continue;
>
>                  page_relock_lruvec(page,&lruvec);
> +               if (unlikely(!PageLRU(page) || PageUnevictable(page) ||
> +                                               PageTransHuge(page))) {
> +                       /*
> +                        * lru_lock excludes splitting a huge page,
> +                        * but we cannot hold lru_lock while freeing page.
> +                        */
> +                       low_pfn += (1<<  compound_order(page)) - 1;
> +                       unlock_lruvec(lruvec);
> +                       lruvec = NULL;
> +                       put_page(page);
> +                       continue;
> +               }
>
>                  VM_BUG_ON(PageTransCompound(page));
>
>                  /* Successfully isolated */
> +               ClearPageLRU(page);
> +               mem_cgroup_reset_uncharged_to_root(page);
>                  del_page_from_lru_list(page, lruvec, page_lru(page));
>                  list_add(&page->lru, migratelist);
>                  cc->nr_migratepages++;
> --- mmotm.orig/mm/memcontrol.c  2012-02-18 11:57:42.679524592 -0800
> +++ mmotm/mm/memcontrol.c       2012-02-18 11:57:49.107524745 -0800
> @@ -1069,6 +1069,33 @@ void page_relock_lruvec(struct page *pag
>          *lruvp = lruvec;
>   }
>
> +void mem_cgroup_reset_uncharged_to_root(struct page *page)
> +{
> +       struct page_cgroup *pc;
> +
> +       if (mem_cgroup_disabled())
> +               return;
> +
> +       VM_BUG_ON(PageLRU(page));
> +
> +       /*
> +        * Once an uncharged page is isolated from the mem_cgroup's lru,
> +        * it no longer protects that mem_cgroup from rmdir: reset to root.
> +        *
> +        * __page_cache_release() and release_pages() may be called at
> +        * interrupt time: we cannot lock_page_cgroup() then (we might
> +        * have interrupted a section with page_cgroup already locked),
> +        * nor do we need to since the page is frozen and about to be freed.
> +        */
> +       pc = lookup_page_cgroup(page);
> +       if (page_count(page))
> +               lock_page_cgroup(pc);
> +       if (!PageCgroupUsed(pc)&&  pc->mem_cgroup != root_mem_cgroup)
> +               pc->mem_cgroup = root_mem_cgroup;
> +       if (page_count(page))
> +               unlock_page_cgroup(pc);
> +}
> +
>   /**
>    * mem_cgroup_update_lru_size - account for adding or removing an lru page
>    * @lruvec: mem_cgroup per zone lru vector
> @@ -2865,6 +2892,7 @@ __mem_cgroup_uncharge_common(struct page
>          struct mem_cgroup *memcg = NULL;
>          unsigned int nr_pages = 1;
>          struct page_cgroup *pc;
> +       struct lruvec *lruvec;
>          bool anon;
>
>          if (mem_cgroup_disabled())
> @@ -2884,6 +2912,7 @@ __mem_cgroup_uncharge_common(struct page
>          if (unlikely(!PageCgroupUsed(pc)))
>                  return NULL;
>
> +       lruvec = page_lock_lruvec(page);
>          lock_page_cgroup(pc);
>
>          memcg = pc->mem_cgroup;
> @@ -2915,14 +2944,17 @@ __mem_cgroup_uncharge_common(struct page
>          mem_cgroup_charge_statistics(memcg, anon, -nr_pages);
>
>          ClearPageCgroupUsed(pc);
> +
>          /*
> -        * pc->mem_cgroup is not cleared here. It will be accessed when it's
> -        * freed from LRU. This is safe because uncharged page is expected not
> -        * to be reused (freed soon). Exception is SwapCache, it's handled by
> -        * special functions.
> +        * Once an uncharged page is isolated from the mem_cgroup's lru,
> +        * it no longer protects that mem_cgroup from rmdir: reset to root.
>           */
> +       if (!PageLRU(page)&&  pc->mem_cgroup != root_mem_cgroup)
> +               pc->mem_cgroup = root_mem_cgroup;
>
>          unlock_page_cgroup(pc);
> +       unlock_lruvec(lruvec);
> +
>          /*
>           * even after unlock, we have memcg->res.usage here and this memcg
>           * will never be freed.
> @@ -2939,6 +2971,7 @@ __mem_cgroup_uncharge_common(struct page
>
>   unlock_out:
>          unlock_page_cgroup(pc);
> +       unlock_lruvec(lruvec);
>          return NULL;
>   }
>
> @@ -3327,7 +3360,9 @@ static struct page_cgroup *lookup_page_c
>           * the first time, i.e. during boot or memory hotplug;
>           * or when mem_cgroup_disabled().
>           */
> -       if (likely(pc)&&  PageCgroupUsed(pc))
> +       if (!pc || PageCgroupUsed(pc))
> +               return pc;
> +       if (pc->mem_cgroup&&  pc->mem_cgroup != root_mem_cgroup)
>                  return pc;
>          return NULL;
>   }
> --- mmotm.orig/mm/swap.c        2012-02-18 11:57:42.679524592 -0800
> +++ mmotm/mm/swap.c     2012-02-18 11:57:49.107524745 -0800
> @@ -52,6 +52,7 @@ static void __page_cache_release(struct
>                  lruvec = page_lock_lruvec(page);
>                  VM_BUG_ON(!PageLRU(page));
>                  __ClearPageLRU(page);
> +               mem_cgroup_reset_uncharged_to_root(page);
>                  del_page_from_lru_list(page, lruvec, page_off_lru(page));
>                  unlock_lruvec(lruvec);
>          }
> @@ -583,6 +584,7 @@ void release_pages(struct page **pages,
>                          page_relock_lruvec(page,&lruvec);
>                          VM_BUG_ON(!PageLRU(page));
>                          __ClearPageLRU(page);
> +                       mem_cgroup_reset_uncharged_to_root(page);
>                          del_page_from_lru_list(page, lruvec, page_off_lru(page));
>                  }
>
> --- mmotm.orig/mm/vmscan.c      2012-02-18 11:57:42.679524592 -0800
> +++ mmotm/mm/vmscan.c   2012-02-18 11:57:49.107524745 -0800
> @@ -1087,11 +1087,11 @@ int __isolate_lru_page(struct page *page
>
>          if (likely(get_page_unless_zero(page))) {
>                  /*
> -                * Be careful not to clear PageLRU until after we're
> -                * sure the page is not being freed elsewhere -- the
> -                * page release code relies on it.
> +                * Beware of interface change: now leave ClearPageLRU(page)
> +                * to the caller, because memcg's lumpy and compaction
> +                * cases (approaching the page by its physical location)
> +                * may not have the right lru_lock yet.
>                   */
> -               ClearPageLRU(page);
>                  ret = 0;
>          }
>
> @@ -1154,7 +1154,16 @@ static unsigned long isolate_lru_pages(u
>
>                  switch (__isolate_lru_page(page, mode, file)) {
>                  case 0:
> +#ifdef CONFIG_DEBUG_VM
> +                       /* check lock on page is lock we already got */
> +                       page_relock_lruvec(page,&lruvec);
> +                       BUG_ON(lruvec != home_lruvec);
> +                       BUG_ON(page != lru_to_page(src));
> +                       BUG_ON(page_lru(page) != lru);
> +#endif
> +                       ClearPageLRU(page);
>                          isolated_pages = hpage_nr_pages(page);
> +                       mem_cgroup_reset_uncharged_to_root(page);
>                          mem_cgroup_update_lru_size(lruvec, lru, -isolated_pages);
>                          list_move(&page->lru, dst);
>                          nr_taken += isolated_pages;
> @@ -1211,21 +1220,7 @@ static unsigned long isolate_lru_pages(u
>                              !PageSwapCache(cursor_page))
>                                  break;
>
> -                       if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> -                               mem_cgroup_page_relock_lruvec(cursor_page,
> -&lruvec);
> -                               isolated_pages = hpage_nr_pages(cursor_page);
> -                               mem_cgroup_update_lru_size(lruvec,
> -                                       page_lru(cursor_page), -isolated_pages);
> -                               list_move(&cursor_page->lru, dst);
> -
> -                               nr_taken += isolated_pages;
> -                               nr_lumpy_taken += isolated_pages;
> -                               if (PageDirty(cursor_page))
> -                                       nr_lumpy_dirty += isolated_pages;
> -                               scan++;
> -                               pfn += isolated_pages - 1;
> -                       } else {
> +                       if (__isolate_lru_page(cursor_page, mode, file) != 0) {
>                                  /*
>                                   * Check if the page is freed already.
>                                   *
> @@ -1243,13 +1238,50 @@ static unsigned long isolate_lru_pages(u
>                                          continue;
>                                  break;
>                          }
> +
> +                       /*
> +                        * This locking call is a no-op in the non-memcg
> +                        * case, since we already hold the right lru_lock;
> +                        * but it may change the lock in the memcg case.
> +                        * It is then vital to recheck PageLRU (but not
> +                        * necessary to recheck isolation mode).
> +                        */
> +                       mem_cgroup_page_relock_lruvec(cursor_page,&lruvec);
> +
> +                       if (PageLRU(cursor_page)&&
> +                           !PageUnevictable(cursor_page)) {
> +                               ClearPageLRU(cursor_page);
> +                               isolated_pages = hpage_nr_pages(cursor_page);
> +                               mem_cgroup_reset_uncharged_to_root(cursor_page);
> +                               mem_cgroup_update_lru_size(lruvec,
> +                                       page_lru(cursor_page), -isolated_pages);
> +                               list_move(&cursor_page->lru, dst);
> +
> +                               nr_taken += isolated_pages;
> +                               nr_lumpy_taken += isolated_pages;
> +                               if (PageDirty(cursor_page))
> +                                       nr_lumpy_dirty += isolated_pages;
> +                               scan++;
> +                               pfn += isolated_pages - 1;
> +                       } else {
> +                               /* Cannot hold lru_lock while freeing page */
> +                               unlock_lruvec(lruvec);
> +                               lruvec = NULL;
> +                               put_page(cursor_page);
> +                               break;
> +                       }
>                  }
>
>                  /* If we break out of the loop above, lumpy reclaim failed */
>                  if (pfn<  end_pfn)
>                          nr_lumpy_failed++;
>
> -               lruvec = home_lruvec;
> +               if (lruvec != home_lruvec) {
> +                       if (lruvec)
> +                               unlock_lruvec(lruvec);
> +                       lruvec = home_lruvec;
> +                       lock_lruvec(lruvec);
> +               }
>          }
>
>          *nr_scanned = scan;
> @@ -1301,6 +1333,7 @@ int isolate_lru_page(struct page *page)
>                          int lru = page_lru(page);
>                          get_page(page);
>                          ClearPageLRU(page);
> +                       mem_cgroup_reset_uncharged_to_root(page);
>                          del_page_from_lru_list(page, lruvec, lru);
>                          ret = 0;
>                  }


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup
@ 2012-02-21  5:55     ` Konstantin Khlebnikov
  0 siblings, 0 replies; 72+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-21  5:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

Hugh Dickins wrote:
> page_relock_lruvec() is using lookup_page_cgroup(page)->mem_cgroup
> to find the memcg, and hence its per-zone lruvec for the page.  We
> therefore need to be careful to see the right pc->mem_cgroup: where
> is it updated?
>
> In __mem_cgroup_commit_charge(), under lruvec lock whenever lru
> care might be needed, lrucare holding the page off lru at that time.
>
> In mem_cgroup_reset_owner(), not under lruvec lock, but before the
> page can be visible to others - except compaction or lumpy reclaim,
> which ignore the page because it is not yet PageLRU.
>
> In mem_cgroup_split_huge_fixup(), always under lruvec lock.
>
> In mem_cgroup_move_account(), which holds several locks, but an
> lruvec lock not among them: yet it still appears to be safe, because
> the page has been taken off its old lru and not yet put on the new.
>
> Be particularly careful in compaction's isolate_migratepages() and
> vmscan's lumpy handling in isolate_lru_pages(): those approach the
> page by its physical location, and so can encounter pages which
> would not be found by any logical lookup.  For those cases we have
> to change __isolate_lru_page() slightly: it must leave ClearPageLRU
> to the caller, because compaction and lumpy cannot safely interfere
> with a page until they have first isolated it and then locked lruvec.
>
> To the list above we have to add __mem_cgroup_uncharge_common(),
> and new function mem_cgroup_reset_uncharged_to_root(): the first
> resetting pc->mem_cgroup to root_mem_cgroup when a page off lru is
> uncharged, and the second when an uncharged page is taken off lru
> (which used to be achieved implicitly with the PageAcctLRU flag).
>
> That's because there's a remote risk that compaction or lumpy reclaim
> will spy a page while it has PageLRU set; then it's taken off LRU and
> freed, its mem_cgroup torn down and freed, the page reallocated (so
> get_page_unless_zero again succeeds); then compaction or lumpy reclaim
> reach their page_relock_lruvec, using the stale mem_cgroup for locking.
>
> So long as there's one charge on the mem_cgroup, or a page on one of
> its lrus, mem_cgroup_force_empty() cannot succeed and the mem_cgroup
> cannot be destroyed.  But when an uncharged page is taken off lru,
> or a page off lru is uncharged, it no longer protects its old memcg,
> and the one stable root_mem_cgroup must then be used for it.

This is much better than my RCU-protected locking.
That will be great if it really race-less!
I think, I could steal this and polish a little. =)

But just one question: how appears uncharged pages in mem-cg lru lists?
Maybe we can forbid this case and uncharge these pages right in
__page_cache_release() and release_pages() at final removing from LRU.
This is how my old mem-controller works. There pages in lru are always charged.

>
> Signed-off-by: Hugh Dickins<hughd@google.com>
> ---
>   include/linux/memcontrol.h |    5 ++
>   mm/compaction.c            |   36 ++++++-----------
>   mm/memcontrol.c            |   45 +++++++++++++++++++--
>   mm/swap.c                  |    2
>   mm/vmscan.c                |   73 +++++++++++++++++++++++++----------
>   5 files changed, 114 insertions(+), 47 deletions(-)
>
> --- mmotm.orig/include/linux/memcontrol.h       2012-02-18 11:57:42.675524592 -0800
> +++ mmotm/include/linux/memcontrol.h    2012-02-18 11:57:49.103524745 -0800
> @@ -65,6 +65,7 @@ extern int mem_cgroup_cache_charge(struc
>   struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
>   extern struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec);
>   extern void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
> +extern void mem_cgroup_reset_uncharged_to_root(struct page *);
>
>   /* For coalescing uncharge for reducing memcg' overhead*/
>   extern void mem_cgroup_uncharge_start(void);
> @@ -251,6 +252,10 @@ static inline void mem_cgroup_update_lru
>   {
>   }
>
> +static inline void mem_cgroup_reset_uncharged_to_root(struct page *page)
> +{
> +}
> +
>   static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>   {
>          return NULL;
> --- mmotm.orig/mm/compaction.c  2012-02-18 11:57:42.675524592 -0800
> +++ mmotm/mm/compaction.c       2012-02-18 11:57:49.103524745 -0800
> @@ -356,28 +356,6 @@ static isolate_migrate_t isolate_migrate
>                          continue;
>                  }
>
> -               if (!lruvec) {
> -                       /*
> -                        * We do need to take the lock before advancing to
> -                        * check PageLRU etc., but there's no guarantee that
> -                        * the page we're peeking at has a stable memcg here.
> -                        */
> -                       lruvec =&zone->lruvec;
> -                       lock_lruvec(lruvec);
> -               }
> -               if (!PageLRU(page))
> -                       continue;
> -
> -               /*
> -                * PageLRU is set, and lru_lock excludes isolation,
> -                * splitting and collapsing (collapsing has already
> -                * happened if PageLRU is set).
> -                */
> -               if (PageTransHuge(page)) {
> -                       low_pfn += (1<<  compound_order(page)) - 1;
> -                       continue;
> -               }
> -
>                  if (!cc->sync)
>                          mode |= ISOLATE_ASYNC_MIGRATE;
>
> @@ -386,10 +364,24 @@ static isolate_migrate_t isolate_migrate
>                          continue;
>
>                  page_relock_lruvec(page,&lruvec);
> +               if (unlikely(!PageLRU(page) || PageUnevictable(page) ||
> +                                               PageTransHuge(page))) {
> +                       /*
> +                        * lru_lock excludes splitting a huge page,
> +                        * but we cannot hold lru_lock while freeing page.
> +                        */
> +                       low_pfn += (1<<  compound_order(page)) - 1;
> +                       unlock_lruvec(lruvec);
> +                       lruvec = NULL;
> +                       put_page(page);
> +                       continue;
> +               }
>
>                  VM_BUG_ON(PageTransCompound(page));
>
>                  /* Successfully isolated */
> +               ClearPageLRU(page);
> +               mem_cgroup_reset_uncharged_to_root(page);
>                  del_page_from_lru_list(page, lruvec, page_lru(page));
>                  list_add(&page->lru, migratelist);
>                  cc->nr_migratepages++;
> --- mmotm.orig/mm/memcontrol.c  2012-02-18 11:57:42.679524592 -0800
> +++ mmotm/mm/memcontrol.c       2012-02-18 11:57:49.107524745 -0800
> @@ -1069,6 +1069,33 @@ void page_relock_lruvec(struct page *pag
>          *lruvp = lruvec;
>   }
>
> +void mem_cgroup_reset_uncharged_to_root(struct page *page)
> +{
> +       struct page_cgroup *pc;
> +
> +       if (mem_cgroup_disabled())
> +               return;
> +
> +       VM_BUG_ON(PageLRU(page));
> +
> +       /*
> +        * Once an uncharged page is isolated from the mem_cgroup's lru,
> +        * it no longer protects that mem_cgroup from rmdir: reset to root.
> +        *
> +        * __page_cache_release() and release_pages() may be called at
> +        * interrupt time: we cannot lock_page_cgroup() then (we might
> +        * have interrupted a section with page_cgroup already locked),
> +        * nor do we need to since the page is frozen and about to be freed.
> +        */
> +       pc = lookup_page_cgroup(page);
> +       if (page_count(page))
> +               lock_page_cgroup(pc);
> +       if (!PageCgroupUsed(pc)&&  pc->mem_cgroup != root_mem_cgroup)
> +               pc->mem_cgroup = root_mem_cgroup;
> +       if (page_count(page))
> +               unlock_page_cgroup(pc);
> +}
> +
>   /**
>    * mem_cgroup_update_lru_size - account for adding or removing an lru page
>    * @lruvec: mem_cgroup per zone lru vector
> @@ -2865,6 +2892,7 @@ __mem_cgroup_uncharge_common(struct page
>          struct mem_cgroup *memcg = NULL;
>          unsigned int nr_pages = 1;
>          struct page_cgroup *pc;
> +       struct lruvec *lruvec;
>          bool anon;
>
>          if (mem_cgroup_disabled())
> @@ -2884,6 +2912,7 @@ __mem_cgroup_uncharge_common(struct page
>          if (unlikely(!PageCgroupUsed(pc)))
>                  return NULL;
>
> +       lruvec = page_lock_lruvec(page);
>          lock_page_cgroup(pc);
>
>          memcg = pc->mem_cgroup;
> @@ -2915,14 +2944,17 @@ __mem_cgroup_uncharge_common(struct page
>          mem_cgroup_charge_statistics(memcg, anon, -nr_pages);
>
>          ClearPageCgroupUsed(pc);
> +
>          /*
> -        * pc->mem_cgroup is not cleared here. It will be accessed when it's
> -        * freed from LRU. This is safe because uncharged page is expected not
> -        * to be reused (freed soon). Exception is SwapCache, it's handled by
> -        * special functions.
> +        * Once an uncharged page is isolated from the mem_cgroup's lru,
> +        * it no longer protects that mem_cgroup from rmdir: reset to root.
>           */
> +       if (!PageLRU(page)&&  pc->mem_cgroup != root_mem_cgroup)
> +               pc->mem_cgroup = root_mem_cgroup;
>
>          unlock_page_cgroup(pc);
> +       unlock_lruvec(lruvec);
> +
>          /*
>           * even after unlock, we have memcg->res.usage here and this memcg
>           * will never be freed.
> @@ -2939,6 +2971,7 @@ __mem_cgroup_uncharge_common(struct page
>
>   unlock_out:
>          unlock_page_cgroup(pc);
> +       unlock_lruvec(lruvec);
>          return NULL;
>   }
>
> @@ -3327,7 +3360,9 @@ static struct page_cgroup *lookup_page_c
>           * the first time, i.e. during boot or memory hotplug;
>           * or when mem_cgroup_disabled().
>           */
> -       if (likely(pc)&&  PageCgroupUsed(pc))
> +       if (!pc || PageCgroupUsed(pc))
> +               return pc;
> +       if (pc->mem_cgroup&&  pc->mem_cgroup != root_mem_cgroup)
>                  return pc;
>          return NULL;
>   }
> --- mmotm.orig/mm/swap.c        2012-02-18 11:57:42.679524592 -0800
> +++ mmotm/mm/swap.c     2012-02-18 11:57:49.107524745 -0800
> @@ -52,6 +52,7 @@ static void __page_cache_release(struct
>                  lruvec = page_lock_lruvec(page);
>                  VM_BUG_ON(!PageLRU(page));
>                  __ClearPageLRU(page);
> +               mem_cgroup_reset_uncharged_to_root(page);
>                  del_page_from_lru_list(page, lruvec, page_off_lru(page));
>                  unlock_lruvec(lruvec);
>          }
> @@ -583,6 +584,7 @@ void release_pages(struct page **pages,
>                          page_relock_lruvec(page,&lruvec);
>                          VM_BUG_ON(!PageLRU(page));
>                          __ClearPageLRU(page);
> +                       mem_cgroup_reset_uncharged_to_root(page);
>                          del_page_from_lru_list(page, lruvec, page_off_lru(page));
>                  }
>
> --- mmotm.orig/mm/vmscan.c      2012-02-18 11:57:42.679524592 -0800
> +++ mmotm/mm/vmscan.c   2012-02-18 11:57:49.107524745 -0800
> @@ -1087,11 +1087,11 @@ int __isolate_lru_page(struct page *page
>
>          if (likely(get_page_unless_zero(page))) {
>                  /*
> -                * Be careful not to clear PageLRU until after we're
> -                * sure the page is not being freed elsewhere -- the
> -                * page release code relies on it.
> +                * Beware of interface change: now leave ClearPageLRU(page)
> +                * to the caller, because memcg's lumpy and compaction
> +                * cases (approaching the page by its physical location)
> +                * may not have the right lru_lock yet.
>                   */
> -               ClearPageLRU(page);
>                  ret = 0;
>          }
>
> @@ -1154,7 +1154,16 @@ static unsigned long isolate_lru_pages(u
>
>                  switch (__isolate_lru_page(page, mode, file)) {
>                  case 0:
> +#ifdef CONFIG_DEBUG_VM
> +                       /* check lock on page is lock we already got */
> +                       page_relock_lruvec(page,&lruvec);
> +                       BUG_ON(lruvec != home_lruvec);
> +                       BUG_ON(page != lru_to_page(src));
> +                       BUG_ON(page_lru(page) != lru);
> +#endif
> +                       ClearPageLRU(page);
>                          isolated_pages = hpage_nr_pages(page);
> +                       mem_cgroup_reset_uncharged_to_root(page);
>                          mem_cgroup_update_lru_size(lruvec, lru, -isolated_pages);
>                          list_move(&page->lru, dst);
>                          nr_taken += isolated_pages;
> @@ -1211,21 +1220,7 @@ static unsigned long isolate_lru_pages(u
>                              !PageSwapCache(cursor_page))
>                                  break;
>
> -                       if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> -                               mem_cgroup_page_relock_lruvec(cursor_page,
> -&lruvec);
> -                               isolated_pages = hpage_nr_pages(cursor_page);
> -                               mem_cgroup_update_lru_size(lruvec,
> -                                       page_lru(cursor_page), -isolated_pages);
> -                               list_move(&cursor_page->lru, dst);
> -
> -                               nr_taken += isolated_pages;
> -                               nr_lumpy_taken += isolated_pages;
> -                               if (PageDirty(cursor_page))
> -                                       nr_lumpy_dirty += isolated_pages;
> -                               scan++;
> -                               pfn += isolated_pages - 1;
> -                       } else {
> +                       if (__isolate_lru_page(cursor_page, mode, file) != 0) {
>                                  /*
>                                   * Check if the page is freed already.
>                                   *
> @@ -1243,13 +1238,50 @@ static unsigned long isolate_lru_pages(u
>                                          continue;
>                                  break;
>                          }
> +
> +                       /*
> +                        * This locking call is a no-op in the non-memcg
> +                        * case, since we already hold the right lru_lock;
> +                        * but it may change the lock in the memcg case.
> +                        * It is then vital to recheck PageLRU (but not
> +                        * necessary to recheck isolation mode).
> +                        */
> +                       mem_cgroup_page_relock_lruvec(cursor_page,&lruvec);
> +
> +                       if (PageLRU(cursor_page)&&
> +                           !PageUnevictable(cursor_page)) {
> +                               ClearPageLRU(cursor_page);
> +                               isolated_pages = hpage_nr_pages(cursor_page);
> +                               mem_cgroup_reset_uncharged_to_root(cursor_page);
> +                               mem_cgroup_update_lru_size(lruvec,
> +                                       page_lru(cursor_page), -isolated_pages);
> +                               list_move(&cursor_page->lru, dst);
> +
> +                               nr_taken += isolated_pages;
> +                               nr_lumpy_taken += isolated_pages;
> +                               if (PageDirty(cursor_page))
> +                                       nr_lumpy_dirty += isolated_pages;
> +                               scan++;
> +                               pfn += isolated_pages - 1;
> +                       } else {
> +                               /* Cannot hold lru_lock while freeing page */
> +                               unlock_lruvec(lruvec);
> +                               lruvec = NULL;
> +                               put_page(cursor_page);
> +                               break;
> +                       }
>                  }
>
>                  /* If we break out of the loop above, lumpy reclaim failed */
>                  if (pfn<  end_pfn)
>                          nr_lumpy_failed++;
>
> -               lruvec = home_lruvec;
> +               if (lruvec != home_lruvec) {
> +                       if (lruvec)
> +                               unlock_lruvec(lruvec);
> +                       lruvec = home_lruvec;
> +                       lock_lruvec(lruvec);
> +               }
>          }
>
>          *nr_scanned = scan;
> @@ -1301,6 +1333,7 @@ int isolate_lru_page(struct page *page)
>                          int lru = page_lru(page);
>                          get_page(page);
>                          ClearPageLRU(page);
> +                       mem_cgroup_reset_uncharged_to_root(page);
>                          del_page_from_lru_list(page, lruvec, lru);
>                          ret = 0;
>                  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup
  2012-02-20 23:34   ` Hugh Dickins
@ 2012-02-21  6:05     ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 72+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-21  6:05 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

Hugh Dickins wrote:
> page_relock_lruvec() is using lookup_page_cgroup(page)->mem_cgroup
> to find the memcg, and hence its per-zone lruvec for the page.  We
> therefore need to be careful to see the right pc->mem_cgroup: where
> is it updated?
>
> In __mem_cgroup_commit_charge(), under lruvec lock whenever lru
> care might be needed, lrucare holding the page off lru at that time.
>
> In mem_cgroup_reset_owner(), not under lruvec lock, but before the
> page can be visible to others - except compaction or lumpy reclaim,
> which ignore the page because it is not yet PageLRU.
>
> In mem_cgroup_split_huge_fixup(), always under lruvec lock.
>
> In mem_cgroup_move_account(), which holds several locks, but an
> lruvec lock not among them: yet it still appears to be safe, because
> the page has been taken off its old lru and not yet put on the new.
>
> Be particularly careful in compaction's isolate_migratepages() and
> vmscan's lumpy handling in isolate_lru_pages(): those approach the
> page by its physical location, and so can encounter pages which
> would not be found by any logical lookup.  For those cases we have
> to change __isolate_lru_page() slightly: it must leave ClearPageLRU
> to the caller, because compaction and lumpy cannot safely interfere
> with a page until they have first isolated it and then locked lruvec.

Yeah, this is most complicated part. I found one race here, see below.

>
> To the list above we have to add __mem_cgroup_uncharge_common(),
> and new function mem_cgroup_reset_uncharged_to_root(): the first
> resetting pc->mem_cgroup to root_mem_cgroup when a page off lru is
> uncharged, and the second when an uncharged page is taken off lru
> (which used to be achieved implicitly with the PageAcctLRU flag).
>
> That's because there's a remote risk that compaction or lumpy reclaim
> will spy a page while it has PageLRU set; then it's taken off LRU and
> freed, its mem_cgroup torn down and freed, the page reallocated (so
> get_page_unless_zero again succeeds); then compaction or lumpy reclaim
> reach their page_relock_lruvec, using the stale mem_cgroup for locking.
>
> So long as there's one charge on the mem_cgroup, or a page on one of
> its lrus, mem_cgroup_force_empty() cannot succeed and the mem_cgroup
> cannot be destroyed.  But when an uncharged page is taken off lru,
> or a page off lru is uncharged, it no longer protects its old memcg,
> and the one stable root_mem_cgroup must then be used for it.
>
> Signed-off-by: Hugh Dickins<hughd@google.com>
> ---
>   include/linux/memcontrol.h |    5 ++
>   mm/compaction.c            |   36 ++++++-----------
>   mm/memcontrol.c            |   45 +++++++++++++++++++--
>   mm/swap.c                  |    2
>   mm/vmscan.c                |   73 +++++++++++++++++++++++++----------
>   5 files changed, 114 insertions(+), 47 deletions(-)
>
> --- mmotm.orig/include/linux/memcontrol.h       2012-02-18 11:57:42.675524592 -0800
> +++ mmotm/include/linux/memcontrol.h    2012-02-18 11:57:49.103524745 -0800
> @@ -65,6 +65,7 @@ extern int mem_cgroup_cache_charge(struc
>   struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
>   extern struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec);
>   extern void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
> +extern void mem_cgroup_reset_uncharged_to_root(struct page *);
>
>   /* For coalescing uncharge for reducing memcg' overhead*/
>   extern void mem_cgroup_uncharge_start(void);
> @@ -251,6 +252,10 @@ static inline void mem_cgroup_update_lru
>   {
>   }
>
> +static inline void mem_cgroup_reset_uncharged_to_root(struct page *page)
> +{
> +}
> +
>   static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>   {
>          return NULL;
> --- mmotm.orig/mm/compaction.c  2012-02-18 11:57:42.675524592 -0800
> +++ mmotm/mm/compaction.c       2012-02-18 11:57:49.103524745 -0800
> @@ -356,28 +356,6 @@ static isolate_migrate_t isolate_migrate
>                          continue;
>                  }
>
> -               if (!lruvec) {
> -                       /*
> -                        * We do need to take the lock before advancing to
> -                        * check PageLRU etc., but there's no guarantee that
> -                        * the page we're peeking at has a stable memcg here.
> -                        */
> -                       lruvec =&zone->lruvec;
> -                       lock_lruvec(lruvec);
> -               }
> -               if (!PageLRU(page))
> -                       continue;
> -
> -               /*
> -                * PageLRU is set, and lru_lock excludes isolation,
> -                * splitting and collapsing (collapsing has already
> -                * happened if PageLRU is set).
> -                */
> -               if (PageTransHuge(page)) {
> -                       low_pfn += (1<<  compound_order(page)) - 1;
> -                       continue;
> -               }
> -
>                  if (!cc->sync)
>                          mode |= ISOLATE_ASYNC_MIGRATE;
>
> @@ -386,10 +364,24 @@ static isolate_migrate_t isolate_migrate
>                          continue;
>
>                  page_relock_lruvec(page,&lruvec);

Here race with mem_cgroup_move_account() we hold lock for old lruvec,
while move_account() recharge page and put page back into other lruvec.
Thus we see PageLRU(), but below we isolate page from wrong lruvec.

In my patch-set this is fixed with __wait_lru_unlock() [ spin_unlock_wait() ]
in mem_cgroup_move_account()

> +               if (unlikely(!PageLRU(page) || PageUnevictable(page) ||
> +                                               PageTransHuge(page))) {
> +                       /*
> +                        * lru_lock excludes splitting a huge page,
> +                        * but we cannot hold lru_lock while freeing page.
> +                        */
> +                       low_pfn += (1<<  compound_order(page)) - 1;
> +                       unlock_lruvec(lruvec);
> +                       lruvec = NULL;
> +                       put_page(page);
> +                       continue;
> +               }
>
>                  VM_BUG_ON(PageTransCompound(page));
>
>                  /* Successfully isolated */
> +               ClearPageLRU(page);
> +               mem_cgroup_reset_uncharged_to_root(page);
>                  del_page_from_lru_list(page, lruvec, page_lru(page));
>                  list_add(&page->lru, migratelist);
>                  cc->nr_migratepages++;
> --- mmotm.orig/mm/memcontrol.c  2012-02-18 11:57:42.679524592 -0800
> +++ mmotm/mm/memcontrol.c       2012-02-18 11:57:49.107524745 -0800
> @@ -1069,6 +1069,33 @@ void page_relock_lruvec(struct page *pag
>          *lruvp = lruvec;
>   }
>
> +void mem_cgroup_reset_uncharged_to_root(struct page *page)
> +{
> +       struct page_cgroup *pc;
> +
> +       if (mem_cgroup_disabled())
> +               return;
> +
> +       VM_BUG_ON(PageLRU(page));
> +
> +       /*
> +        * Once an uncharged page is isolated from the mem_cgroup's lru,
> +        * it no longer protects that mem_cgroup from rmdir: reset to root.
> +        *
> +        * __page_cache_release() and release_pages() may be called at
> +        * interrupt time: we cannot lock_page_cgroup() then (we might
> +        * have interrupted a section with page_cgroup already locked),
> +        * nor do we need to since the page is frozen and about to be freed.
> +        */
> +       pc = lookup_page_cgroup(page);
> +       if (page_count(page))
> +               lock_page_cgroup(pc);
> +       if (!PageCgroupUsed(pc)&&  pc->mem_cgroup != root_mem_cgroup)
> +               pc->mem_cgroup = root_mem_cgroup;
> +       if (page_count(page))
> +               unlock_page_cgroup(pc);
> +}
> +
>   /**
>    * mem_cgroup_update_lru_size - account for adding or removing an lru page
>    * @lruvec: mem_cgroup per zone lru vector
> @@ -2865,6 +2892,7 @@ __mem_cgroup_uncharge_common(struct page
>          struct mem_cgroup *memcg = NULL;
>          unsigned int nr_pages = 1;
>          struct page_cgroup *pc;
> +       struct lruvec *lruvec;
>          bool anon;
>
>          if (mem_cgroup_disabled())
> @@ -2884,6 +2912,7 @@ __mem_cgroup_uncharge_common(struct page
>          if (unlikely(!PageCgroupUsed(pc)))
>                  return NULL;
>
> +       lruvec = page_lock_lruvec(page);
>          lock_page_cgroup(pc);
>
>          memcg = pc->mem_cgroup;
> @@ -2915,14 +2944,17 @@ __mem_cgroup_uncharge_common(struct page
>          mem_cgroup_charge_statistics(memcg, anon, -nr_pages);
>
>          ClearPageCgroupUsed(pc);
> +
>          /*
> -        * pc->mem_cgroup is not cleared here. It will be accessed when it's
> -        * freed from LRU. This is safe because uncharged page is expected not
> -        * to be reused (freed soon). Exception is SwapCache, it's handled by
> -        * special functions.
> +        * Once an uncharged page is isolated from the mem_cgroup's lru,
> +        * it no longer protects that mem_cgroup from rmdir: reset to root.
>           */
> +       if (!PageLRU(page)&&  pc->mem_cgroup != root_mem_cgroup)
> +               pc->mem_cgroup = root_mem_cgroup;
>
>          unlock_page_cgroup(pc);
> +       unlock_lruvec(lruvec);
> +
>          /*
>           * even after unlock, we have memcg->res.usage here and this memcg
>           * will never be freed.
> @@ -2939,6 +2971,7 @@ __mem_cgroup_uncharge_common(struct page
>
>   unlock_out:
>          unlock_page_cgroup(pc);
> +       unlock_lruvec(lruvec);
>          return NULL;
>   }
>
> @@ -3327,7 +3360,9 @@ static struct page_cgroup *lookup_page_c
>           * the first time, i.e. during boot or memory hotplug;
>           * or when mem_cgroup_disabled().
>           */
> -       if (likely(pc)&&  PageCgroupUsed(pc))
> +       if (!pc || PageCgroupUsed(pc))
> +               return pc;
> +       if (pc->mem_cgroup&&  pc->mem_cgroup != root_mem_cgroup)
>                  return pc;
>          return NULL;
>   }
> --- mmotm.orig/mm/swap.c        2012-02-18 11:57:42.679524592 -0800
> +++ mmotm/mm/swap.c     2012-02-18 11:57:49.107524745 -0800
> @@ -52,6 +52,7 @@ static void __page_cache_release(struct
>                  lruvec = page_lock_lruvec(page);
>                  VM_BUG_ON(!PageLRU(page));
>                  __ClearPageLRU(page);
> +               mem_cgroup_reset_uncharged_to_root(page);
>                  del_page_from_lru_list(page, lruvec, page_off_lru(page));
>                  unlock_lruvec(lruvec);
>          }
> @@ -583,6 +584,7 @@ void release_pages(struct page **pages,
>                          page_relock_lruvec(page,&lruvec);
>                          VM_BUG_ON(!PageLRU(page));
>                          __ClearPageLRU(page);
> +                       mem_cgroup_reset_uncharged_to_root(page);
>                          del_page_from_lru_list(page, lruvec, page_off_lru(page));
>                  }
>
> --- mmotm.orig/mm/vmscan.c      2012-02-18 11:57:42.679524592 -0800
> +++ mmotm/mm/vmscan.c   2012-02-18 11:57:49.107524745 -0800
> @@ -1087,11 +1087,11 @@ int __isolate_lru_page(struct page *page
>
>          if (likely(get_page_unless_zero(page))) {
>                  /*
> -                * Be careful not to clear PageLRU until after we're
> -                * sure the page is not being freed elsewhere -- the
> -                * page release code relies on it.
> +                * Beware of interface change: now leave ClearPageLRU(page)
> +                * to the caller, because memcg's lumpy and compaction
> +                * cases (approaching the page by its physical location)
> +                * may not have the right lru_lock yet.
>                   */
> -               ClearPageLRU(page);
>                  ret = 0;
>          }
>
> @@ -1154,7 +1154,16 @@ static unsigned long isolate_lru_pages(u
>
>                  switch (__isolate_lru_page(page, mode, file)) {
>                  case 0:
> +#ifdef CONFIG_DEBUG_VM
> +                       /* check lock on page is lock we already got */
> +                       page_relock_lruvec(page,&lruvec);
> +                       BUG_ON(lruvec != home_lruvec);
> +                       BUG_ON(page != lru_to_page(src));
> +                       BUG_ON(page_lru(page) != lru);
> +#endif
> +                       ClearPageLRU(page);
>                          isolated_pages = hpage_nr_pages(page);
> +                       mem_cgroup_reset_uncharged_to_root(page);
>                          mem_cgroup_update_lru_size(lruvec, lru, -isolated_pages);
>                          list_move(&page->lru, dst);
>                          nr_taken += isolated_pages;
> @@ -1211,21 +1220,7 @@ static unsigned long isolate_lru_pages(u
>                              !PageSwapCache(cursor_page))
>                                  break;
>
> -                       if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> -                               mem_cgroup_page_relock_lruvec(cursor_page,
> -&lruvec);
> -                               isolated_pages = hpage_nr_pages(cursor_page);
> -                               mem_cgroup_update_lru_size(lruvec,
> -                                       page_lru(cursor_page), -isolated_pages);
> -                               list_move(&cursor_page->lru, dst);
> -
> -                               nr_taken += isolated_pages;
> -                               nr_lumpy_taken += isolated_pages;
> -                               if (PageDirty(cursor_page))
> -                                       nr_lumpy_dirty += isolated_pages;
> -                               scan++;
> -                               pfn += isolated_pages - 1;
> -                       } else {
> +                       if (__isolate_lru_page(cursor_page, mode, file) != 0) {
>                                  /*
>                                   * Check if the page is freed already.
>                                   *
> @@ -1243,13 +1238,50 @@ static unsigned long isolate_lru_pages(u
>                                          continue;
>                                  break;
>                          }
> +
> +                       /*
> +                        * This locking call is a no-op in the non-memcg
> +                        * case, since we already hold the right lru_lock;
> +                        * but it may change the lock in the memcg case.
> +                        * It is then vital to recheck PageLRU (but not
> +                        * necessary to recheck isolation mode).
> +                        */
> +                       mem_cgroup_page_relock_lruvec(cursor_page,&lruvec);
> +
> +                       if (PageLRU(cursor_page)&&
> +                           !PageUnevictable(cursor_page)) {
> +                               ClearPageLRU(cursor_page);
> +                               isolated_pages = hpage_nr_pages(cursor_page);
> +                               mem_cgroup_reset_uncharged_to_root(cursor_page);
> +                               mem_cgroup_update_lru_size(lruvec,
> +                                       page_lru(cursor_page), -isolated_pages);
> +                               list_move(&cursor_page->lru, dst);
> +
> +                               nr_taken += isolated_pages;
> +                               nr_lumpy_taken += isolated_pages;
> +                               if (PageDirty(cursor_page))
> +                                       nr_lumpy_dirty += isolated_pages;
> +                               scan++;
> +                               pfn += isolated_pages - 1;
> +                       } else {
> +                               /* Cannot hold lru_lock while freeing page */
> +                               unlock_lruvec(lruvec);
> +                               lruvec = NULL;
> +                               put_page(cursor_page);
> +                               break;
> +                       }
>                  }
>
>                  /* If we break out of the loop above, lumpy reclaim failed */
>                  if (pfn<  end_pfn)
>                          nr_lumpy_failed++;
>
> -               lruvec = home_lruvec;
> +               if (lruvec != home_lruvec) {
> +                       if (lruvec)
> +                               unlock_lruvec(lruvec);
> +                       lruvec = home_lruvec;
> +                       lock_lruvec(lruvec);
> +               }
>          }
>
>          *nr_scanned = scan;
> @@ -1301,6 +1333,7 @@ int isolate_lru_page(struct page *page)
>                          int lru = page_lru(page);
>                          get_page(page);
>                          ClearPageLRU(page);
> +                       mem_cgroup_reset_uncharged_to_root(page);
>                          del_page_from_lru_list(page, lruvec, lru);
>                          ret = 0;
>                  }


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup
@ 2012-02-21  6:05     ` Konstantin Khlebnikov
  0 siblings, 0 replies; 72+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-21  6:05 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

Hugh Dickins wrote:
> page_relock_lruvec() is using lookup_page_cgroup(page)->mem_cgroup
> to find the memcg, and hence its per-zone lruvec for the page.  We
> therefore need to be careful to see the right pc->mem_cgroup: where
> is it updated?
>
> In __mem_cgroup_commit_charge(), under lruvec lock whenever lru
> care might be needed, lrucare holding the page off lru at that time.
>
> In mem_cgroup_reset_owner(), not under lruvec lock, but before the
> page can be visible to others - except compaction or lumpy reclaim,
> which ignore the page because it is not yet PageLRU.
>
> In mem_cgroup_split_huge_fixup(), always under lruvec lock.
>
> In mem_cgroup_move_account(), which holds several locks, but an
> lruvec lock not among them: yet it still appears to be safe, because
> the page has been taken off its old lru and not yet put on the new.
>
> Be particularly careful in compaction's isolate_migratepages() and
> vmscan's lumpy handling in isolate_lru_pages(): those approach the
> page by its physical location, and so can encounter pages which
> would not be found by any logical lookup.  For those cases we have
> to change __isolate_lru_page() slightly: it must leave ClearPageLRU
> to the caller, because compaction and lumpy cannot safely interfere
> with a page until they have first isolated it and then locked lruvec.

Yeah, this is most complicated part. I found one race here, see below.

>
> To the list above we have to add __mem_cgroup_uncharge_common(),
> and new function mem_cgroup_reset_uncharged_to_root(): the first
> resetting pc->mem_cgroup to root_mem_cgroup when a page off lru is
> uncharged, and the second when an uncharged page is taken off lru
> (which used to be achieved implicitly with the PageAcctLRU flag).
>
> That's because there's a remote risk that compaction or lumpy reclaim
> will spy a page while it has PageLRU set; then it's taken off LRU and
> freed, its mem_cgroup torn down and freed, the page reallocated (so
> get_page_unless_zero again succeeds); then compaction or lumpy reclaim
> reach their page_relock_lruvec, using the stale mem_cgroup for locking.
>
> So long as there's one charge on the mem_cgroup, or a page on one of
> its lrus, mem_cgroup_force_empty() cannot succeed and the mem_cgroup
> cannot be destroyed.  But when an uncharged page is taken off lru,
> or a page off lru is uncharged, it no longer protects its old memcg,
> and the one stable root_mem_cgroup must then be used for it.
>
> Signed-off-by: Hugh Dickins<hughd@google.com>
> ---
>   include/linux/memcontrol.h |    5 ++
>   mm/compaction.c            |   36 ++++++-----------
>   mm/memcontrol.c            |   45 +++++++++++++++++++--
>   mm/swap.c                  |    2
>   mm/vmscan.c                |   73 +++++++++++++++++++++++++----------
>   5 files changed, 114 insertions(+), 47 deletions(-)
>
> --- mmotm.orig/include/linux/memcontrol.h       2012-02-18 11:57:42.675524592 -0800
> +++ mmotm/include/linux/memcontrol.h    2012-02-18 11:57:49.103524745 -0800
> @@ -65,6 +65,7 @@ extern int mem_cgroup_cache_charge(struc
>   struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
>   extern struct mem_cgroup *mem_cgroup_from_lruvec(struct lruvec *lruvec);
>   extern void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
> +extern void mem_cgroup_reset_uncharged_to_root(struct page *);
>
>   /* For coalescing uncharge for reducing memcg' overhead*/
>   extern void mem_cgroup_uncharge_start(void);
> @@ -251,6 +252,10 @@ static inline void mem_cgroup_update_lru
>   {
>   }
>
> +static inline void mem_cgroup_reset_uncharged_to_root(struct page *page)
> +{
> +}
> +
>   static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>   {
>          return NULL;
> --- mmotm.orig/mm/compaction.c  2012-02-18 11:57:42.675524592 -0800
> +++ mmotm/mm/compaction.c       2012-02-18 11:57:49.103524745 -0800
> @@ -356,28 +356,6 @@ static isolate_migrate_t isolate_migrate
>                          continue;
>                  }
>
> -               if (!lruvec) {
> -                       /*
> -                        * We do need to take the lock before advancing to
> -                        * check PageLRU etc., but there's no guarantee that
> -                        * the page we're peeking at has a stable memcg here.
> -                        */
> -                       lruvec =&zone->lruvec;
> -                       lock_lruvec(lruvec);
> -               }
> -               if (!PageLRU(page))
> -                       continue;
> -
> -               /*
> -                * PageLRU is set, and lru_lock excludes isolation,
> -                * splitting and collapsing (collapsing has already
> -                * happened if PageLRU is set).
> -                */
> -               if (PageTransHuge(page)) {
> -                       low_pfn += (1<<  compound_order(page)) - 1;
> -                       continue;
> -               }
> -
>                  if (!cc->sync)
>                          mode |= ISOLATE_ASYNC_MIGRATE;
>
> @@ -386,10 +364,24 @@ static isolate_migrate_t isolate_migrate
>                          continue;
>
>                  page_relock_lruvec(page,&lruvec);

Here race with mem_cgroup_move_account() we hold lock for old lruvec,
while move_account() recharge page and put page back into other lruvec.
Thus we see PageLRU(), but below we isolate page from wrong lruvec.

In my patch-set this is fixed with __wait_lru_unlock() [ spin_unlock_wait() ]
in mem_cgroup_move_account()

> +               if (unlikely(!PageLRU(page) || PageUnevictable(page) ||
> +                                               PageTransHuge(page))) {
> +                       /*
> +                        * lru_lock excludes splitting a huge page,
> +                        * but we cannot hold lru_lock while freeing page.
> +                        */
> +                       low_pfn += (1<<  compound_order(page)) - 1;
> +                       unlock_lruvec(lruvec);
> +                       lruvec = NULL;
> +                       put_page(page);
> +                       continue;
> +               }
>
>                  VM_BUG_ON(PageTransCompound(page));
>
>                  /* Successfully isolated */
> +               ClearPageLRU(page);
> +               mem_cgroup_reset_uncharged_to_root(page);
>                  del_page_from_lru_list(page, lruvec, page_lru(page));
>                  list_add(&page->lru, migratelist);
>                  cc->nr_migratepages++;
> --- mmotm.orig/mm/memcontrol.c  2012-02-18 11:57:42.679524592 -0800
> +++ mmotm/mm/memcontrol.c       2012-02-18 11:57:49.107524745 -0800
> @@ -1069,6 +1069,33 @@ void page_relock_lruvec(struct page *pag
>          *lruvp = lruvec;
>   }
>
> +void mem_cgroup_reset_uncharged_to_root(struct page *page)
> +{
> +       struct page_cgroup *pc;
> +
> +       if (mem_cgroup_disabled())
> +               return;
> +
> +       VM_BUG_ON(PageLRU(page));
> +
> +       /*
> +        * Once an uncharged page is isolated from the mem_cgroup's lru,
> +        * it no longer protects that mem_cgroup from rmdir: reset to root.
> +        *
> +        * __page_cache_release() and release_pages() may be called at
> +        * interrupt time: we cannot lock_page_cgroup() then (we might
> +        * have interrupted a section with page_cgroup already locked),
> +        * nor do we need to since the page is frozen and about to be freed.
> +        */
> +       pc = lookup_page_cgroup(page);
> +       if (page_count(page))
> +               lock_page_cgroup(pc);
> +       if (!PageCgroupUsed(pc)&&  pc->mem_cgroup != root_mem_cgroup)
> +               pc->mem_cgroup = root_mem_cgroup;
> +       if (page_count(page))
> +               unlock_page_cgroup(pc);
> +}
> +
>   /**
>    * mem_cgroup_update_lru_size - account for adding or removing an lru page
>    * @lruvec: mem_cgroup per zone lru vector
> @@ -2865,6 +2892,7 @@ __mem_cgroup_uncharge_common(struct page
>          struct mem_cgroup *memcg = NULL;
>          unsigned int nr_pages = 1;
>          struct page_cgroup *pc;
> +       struct lruvec *lruvec;
>          bool anon;
>
>          if (mem_cgroup_disabled())
> @@ -2884,6 +2912,7 @@ __mem_cgroup_uncharge_common(struct page
>          if (unlikely(!PageCgroupUsed(pc)))
>                  return NULL;
>
> +       lruvec = page_lock_lruvec(page);
>          lock_page_cgroup(pc);
>
>          memcg = pc->mem_cgroup;
> @@ -2915,14 +2944,17 @@ __mem_cgroup_uncharge_common(struct page
>          mem_cgroup_charge_statistics(memcg, anon, -nr_pages);
>
>          ClearPageCgroupUsed(pc);
> +
>          /*
> -        * pc->mem_cgroup is not cleared here. It will be accessed when it's
> -        * freed from LRU. This is safe because uncharged page is expected not
> -        * to be reused (freed soon). Exception is SwapCache, it's handled by
> -        * special functions.
> +        * Once an uncharged page is isolated from the mem_cgroup's lru,
> +        * it no longer protects that mem_cgroup from rmdir: reset to root.
>           */
> +       if (!PageLRU(page)&&  pc->mem_cgroup != root_mem_cgroup)
> +               pc->mem_cgroup = root_mem_cgroup;
>
>          unlock_page_cgroup(pc);
> +       unlock_lruvec(lruvec);
> +
>          /*
>           * even after unlock, we have memcg->res.usage here and this memcg
>           * will never be freed.
> @@ -2939,6 +2971,7 @@ __mem_cgroup_uncharge_common(struct page
>
>   unlock_out:
>          unlock_page_cgroup(pc);
> +       unlock_lruvec(lruvec);
>          return NULL;
>   }
>
> @@ -3327,7 +3360,9 @@ static struct page_cgroup *lookup_page_c
>           * the first time, i.e. during boot or memory hotplug;
>           * or when mem_cgroup_disabled().
>           */
> -       if (likely(pc)&&  PageCgroupUsed(pc))
> +       if (!pc || PageCgroupUsed(pc))
> +               return pc;
> +       if (pc->mem_cgroup&&  pc->mem_cgroup != root_mem_cgroup)
>                  return pc;
>          return NULL;
>   }
> --- mmotm.orig/mm/swap.c        2012-02-18 11:57:42.679524592 -0800
> +++ mmotm/mm/swap.c     2012-02-18 11:57:49.107524745 -0800
> @@ -52,6 +52,7 @@ static void __page_cache_release(struct
>                  lruvec = page_lock_lruvec(page);
>                  VM_BUG_ON(!PageLRU(page));
>                  __ClearPageLRU(page);
> +               mem_cgroup_reset_uncharged_to_root(page);
>                  del_page_from_lru_list(page, lruvec, page_off_lru(page));
>                  unlock_lruvec(lruvec);
>          }
> @@ -583,6 +584,7 @@ void release_pages(struct page **pages,
>                          page_relock_lruvec(page,&lruvec);
>                          VM_BUG_ON(!PageLRU(page));
>                          __ClearPageLRU(page);
> +                       mem_cgroup_reset_uncharged_to_root(page);
>                          del_page_from_lru_list(page, lruvec, page_off_lru(page));
>                  }
>
> --- mmotm.orig/mm/vmscan.c      2012-02-18 11:57:42.679524592 -0800
> +++ mmotm/mm/vmscan.c   2012-02-18 11:57:49.107524745 -0800
> @@ -1087,11 +1087,11 @@ int __isolate_lru_page(struct page *page
>
>          if (likely(get_page_unless_zero(page))) {
>                  /*
> -                * Be careful not to clear PageLRU until after we're
> -                * sure the page is not being freed elsewhere -- the
> -                * page release code relies on it.
> +                * Beware of interface change: now leave ClearPageLRU(page)
> +                * to the caller, because memcg's lumpy and compaction
> +                * cases (approaching the page by its physical location)
> +                * may not have the right lru_lock yet.
>                   */
> -               ClearPageLRU(page);
>                  ret = 0;
>          }
>
> @@ -1154,7 +1154,16 @@ static unsigned long isolate_lru_pages(u
>
>                  switch (__isolate_lru_page(page, mode, file)) {
>                  case 0:
> +#ifdef CONFIG_DEBUG_VM
> +                       /* check lock on page is lock we already got */
> +                       page_relock_lruvec(page,&lruvec);
> +                       BUG_ON(lruvec != home_lruvec);
> +                       BUG_ON(page != lru_to_page(src));
> +                       BUG_ON(page_lru(page) != lru);
> +#endif
> +                       ClearPageLRU(page);
>                          isolated_pages = hpage_nr_pages(page);
> +                       mem_cgroup_reset_uncharged_to_root(page);
>                          mem_cgroup_update_lru_size(lruvec, lru, -isolated_pages);
>                          list_move(&page->lru, dst);
>                          nr_taken += isolated_pages;
> @@ -1211,21 +1220,7 @@ static unsigned long isolate_lru_pages(u
>                              !PageSwapCache(cursor_page))
>                                  break;
>
> -                       if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> -                               mem_cgroup_page_relock_lruvec(cursor_page,
> -&lruvec);
> -                               isolated_pages = hpage_nr_pages(cursor_page);
> -                               mem_cgroup_update_lru_size(lruvec,
> -                                       page_lru(cursor_page), -isolated_pages);
> -                               list_move(&cursor_page->lru, dst);
> -
> -                               nr_taken += isolated_pages;
> -                               nr_lumpy_taken += isolated_pages;
> -                               if (PageDirty(cursor_page))
> -                                       nr_lumpy_dirty += isolated_pages;
> -                               scan++;
> -                               pfn += isolated_pages - 1;
> -                       } else {
> +                       if (__isolate_lru_page(cursor_page, mode, file) != 0) {
>                                  /*
>                                   * Check if the page is freed already.
>                                   *
> @@ -1243,13 +1238,50 @@ static unsigned long isolate_lru_pages(u
>                                          continue;
>                                  break;
>                          }
> +
> +                       /*
> +                        * This locking call is a no-op in the non-memcg
> +                        * case, since we already hold the right lru_lock;
> +                        * but it may change the lock in the memcg case.
> +                        * It is then vital to recheck PageLRU (but not
> +                        * necessary to recheck isolation mode).
> +                        */
> +                       mem_cgroup_page_relock_lruvec(cursor_page,&lruvec);
> +
> +                       if (PageLRU(cursor_page)&&
> +                           !PageUnevictable(cursor_page)) {
> +                               ClearPageLRU(cursor_page);
> +                               isolated_pages = hpage_nr_pages(cursor_page);
> +                               mem_cgroup_reset_uncharged_to_root(cursor_page);
> +                               mem_cgroup_update_lru_size(lruvec,
> +                                       page_lru(cursor_page), -isolated_pages);
> +                               list_move(&cursor_page->lru, dst);
> +
> +                               nr_taken += isolated_pages;
> +                               nr_lumpy_taken += isolated_pages;
> +                               if (PageDirty(cursor_page))
> +                                       nr_lumpy_dirty += isolated_pages;
> +                               scan++;
> +                               pfn += isolated_pages - 1;
> +                       } else {
> +                               /* Cannot hold lru_lock while freeing page */
> +                               unlock_lruvec(lruvec);
> +                               lruvec = NULL;
> +                               put_page(cursor_page);
> +                               break;
> +                       }
>                  }
>
>                  /* If we break out of the loop above, lumpy reclaim failed */
>                  if (pfn<  end_pfn)
>                          nr_lumpy_failed++;
>
> -               lruvec = home_lruvec;
> +               if (lruvec != home_lruvec) {
> +                       if (lruvec)
> +                               unlock_lruvec(lruvec);
> +                       lruvec = home_lruvec;
> +                       lock_lruvec(lruvec);
> +               }
>          }
>
>          *nr_scanned = scan;
> @@ -1301,6 +1333,7 @@ int isolate_lru_page(struct page *page)
>                          int lru = page_lru(page);
>                          get_page(page);
>                          ClearPageLRU(page);
> +                       mem_cgroup_reset_uncharged_to_root(page);
>                          del_page_from_lru_list(page, lruvec, lru);
>                          ret = 0;
>                  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 9/10] mm/memcg: move lru_lock into lruvec
  2012-02-20 23:38   ` Hugh Dickins
@ 2012-02-21  7:08     ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 72+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-21  7:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

Hugh Dickins wrote:
> We're nearly there.  Now move lru_lock and irqflags into struct lruvec,
> so they are in every zone (for !MEM_RES_CTLR and mem_cgroup_disabled()
> cases) and in every memcg lruvec.
>
> Extend the memcg version of page_relock_lruvec() to drop old and take
> new lock whenever changing lruvec.  But the memcg will only be stable
> once we already have the lock: so, having got it, check if it's still
> the lock we want, and retry if not.  It's for this retry that we route
> all page lruvec locking through page_relock_lruvec().
>
> No need for lock_page_cgroup() in here (which would entail reinverting
> the lock ordering, and _irq'ing all of its calls): the lrucare protocol
> when charging (holding old lock while changing owner then acquiring new)
> fits correctly with this retry protocol.  In some places we rely also on
> page_count 0 preventing further references, in some places on !PageLRU
> protecting a page from outside interference: mem_cgroup_move_account()
>
> What if page_relock_lruvec() were preempted for a while, after reading
> a valid mem_cgroup from page_cgroup, but before acquiring the lock?
> In that case, a rmdir might free the mem_cgroup and its associated
> zoneinfo, and we take a spin_lock in freed memory.  But rcu_read_lock()
> before we read mem_cgroup keeps it safe: cgroup.c uses synchronize_rcu()
> in between pre_destroy (force_empty) and destroy (freeing structures).
> mem_cgroup_force_empty() cannot succeed while there's any charge, or any
> page on any of its lrus - and checks list_empty() while holding the lock.

Heh, your code is RCU-protected too. =)

On lumpy/compaction isolate you do:

if (!PageLRU(page))
	continue

__isolate_lru_page()

page_relock_rcu_vec()
	rcu_read_lock()
	rcu_dereference()...
	spin_lock()...
	rcu_read_unlock()

You protect page_relock_rcu_vec with switching pointers back to root.

I do:

catch_page_lru()
	rcu_read_lock()
	if (!PageLRU(page))
		return false
	rcu_dereference()...
	spin_lock()...
	rcu_read_unlock()
	if (PageLRU())
		return true
if true
	__isolate_lru_page()

I protect my catch_page_lruvec() with PageLRU() under single rcu-interval with locking.
Thus my code is better, because it not requires switching pointers back to root memcg.

Meanwhile after seeing your patches, I realized that this rcu-protection is
required only for lock-by-pfn in lumpy/compaction isolation.
Thus my locking should be simplified and optimized.

>
> But although we are now fully prepared, in this patch keep on using
> the zone->lru_lock for all of its memcgs: so that the cost or benefit
> of split locking can be easily compared with the final patch (but
> of course, some costs and benefits come earlier in the series).
>
> Signed-off-by: Hugh Dickins<hughd@google.com>
> ---
>   include/linux/mmzone.h |    4 +-
>   include/linux/swap.h   |   13 +++---
>   mm/memcontrol.c        |   74 ++++++++++++++++++++++++++-------------
>   mm/page_alloc.c        |    2 -
>   4 files changed, 59 insertions(+), 34 deletions(-)
>
> --- mmotm.orig/include/linux/mmzone.h	2012-02-18 11:57:42.675524592 -0800
> +++ mmotm/include/linux/mmzone.h	2012-02-18 11:58:09.047525220 -0800
> @@ -174,6 +174,8 @@ struct zone_reclaim_stat {
>
>   struct lruvec {
>   	struct zone *zone;
> +	spinlock_t lru_lock;
> +	unsigned long irqflags;
>   	struct list_head lists[NR_LRU_LISTS];
>   	struct zone_reclaim_stat reclaim_stat;
>   };
> @@ -373,8 +375,6 @@ struct zone {
>   	ZONE_PADDING(_pad1_)
>
>   	/* Fields commonly accessed by the page reclaim scanner */
> -	spinlock_t		lru_lock;
> -	unsigned long		irqflags;
>   	struct lruvec		lruvec;
>
>   	unsigned long		pages_scanned;	   /* since last reclaim */
> --- mmotm.orig/include/linux/swap.h	2012-02-18 11:57:42.675524592 -0800
> +++ mmotm/include/linux/swap.h	2012-02-18 11:58:09.047525220 -0800
> @@ -252,25 +252,24 @@ static inline void lru_cache_add_file(st
>
>   static inline spinlock_t *lru_lockptr(struct lruvec *lruvec)
>   {
> -	return&lruvec->zone->lru_lock;
> +	/* Still use per-zone lru_lock */
> +	return&lruvec->zone->lruvec.lru_lock;
>   }
>
>   static inline void lock_lruvec(struct lruvec *lruvec)
>   {
> -	struct zone *zone = lruvec->zone;
>   	unsigned long irqflags;
>
> -	spin_lock_irqsave(&zone->lru_lock, irqflags);
> -	zone->irqflags = irqflags;
> +	spin_lock_irqsave(lru_lockptr(lruvec), irqflags);
> +	lruvec->irqflags = irqflags;
>   }
>
>   static inline void unlock_lruvec(struct lruvec *lruvec)
>   {
> -	struct zone *zone = lruvec->zone;
>   	unsigned long irqflags;
>
> -	irqflags = zone->irqflags;
> -	spin_unlock_irqrestore(&zone->lru_lock, irqflags);
> +	irqflags = lruvec->irqflags;
> +	spin_unlock_irqrestore(lru_lockptr(lruvec), irqflags);
>   }
>
>   #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> --- mmotm.orig/mm/memcontrol.c	2012-02-18 11:58:02.451525062 -0800
> +++ mmotm/mm/memcontrol.c	2012-02-18 11:58:09.051525220 -0800
> @@ -1048,39 +1048,64 @@ void page_relock_lruvec(struct page *pag
>   	struct page_cgroup *pc;
>   	struct lruvec *lruvec;
>
> -	if (mem_cgroup_disabled())
> +	if (unlikely(mem_cgroup_disabled())) {
>   		lruvec =&page_zone(page)->lruvec;
> -	else {
> -		pc = lookup_page_cgroup(page);
> -		memcg = pc->mem_cgroup;
> -		/*
> -		 * At present we start up with all page_cgroups initialized
> -		 * to zero: correct that to root_mem_cgroup once we see it.
> -		 */
> -		if (unlikely(!memcg))
> -			memcg = pc->mem_cgroup = root_mem_cgroup;
> -		/*
> -		 * We must reset pc->mem_cgroup back to root before freeing
> -		 * a page: avoid additional callouts from hot paths by doing
> -		 * it here when we see the page is frozen (can safely be done
> -		 * before taking lru_lock because the page is frozen).
> -		 */
> -		if (memcg != root_mem_cgroup&&  !page_count(page))
> -			pc->mem_cgroup = root_mem_cgroup;
> -		mz = page_cgroup_zoneinfo(memcg, page);
> -		lruvec =&mz->lruvec;
> +		if (*lruvp&&  *lruvp != lruvec) {
> +			unlock_lruvec(*lruvp);
> +			*lruvp = NULL;
> +		}
> +		if (!*lruvp) {
> +			*lruvp = lruvec;
> +			lock_lruvec(lruvec);
> +		}
> +		return;
>   	}
>
> +	pc = lookup_page_cgroup(page);
> +	/*
> +	 * Imagine being preempted for a long time: we need to make sure that
> +	 * the structure at pc->mem_cgroup, and structures it links to, cannot
> +	 * be freed while we locate and acquire its zone lru_lock.  cgroup's
> +	 * synchronize_rcu() between pre_destroy and destroy makes this safe.
> +	 */
> +	rcu_read_lock();
> +again:
> +	memcg = rcu_dereference(pc->mem_cgroup);
>   	/*
> -	 * For the moment, simply lock by zone just as before.
> +	 * At present we start up with all page_cgroups initialized
> +	 * to zero: here treat NULL as root_mem_cgroup, then correct
> +	 * the page_cgroup below once we really have it locked.
>   	 */
> -	if (*lruvp&&  (*lruvp)->zone != lruvec->zone) {
> +	mz = page_cgroup_zoneinfo(memcg ? : root_mem_cgroup, page);
> +	lruvec =&mz->lruvec;
> +
> +	/*
> +	 * Sometimes we are called with non-NULL *lruvp spinlock already held:
> +	 * hold on if we want the same lock again, otherwise drop and acquire.
> +	 */
> +	if (*lruvp&&  *lruvp != lruvec) {
>   		unlock_lruvec(*lruvp);
>   		*lruvp = NULL;
>   	}
> -	if (!*lruvp)
> +	if (!*lruvp) {
> +		*lruvp = lruvec;
>   		lock_lruvec(lruvec);
> -	*lruvp = lruvec;
> +		/*
> +		 * But pc->mem_cgroup may have changed since we looked...
> +		 */
> +		if (unlikely(pc->mem_cgroup != memcg))
> +			goto again;
> +	}
> +
> +	/*
> +	 * We must reset pc->mem_cgroup back to root before freeing a page:
> +	 * avoid additional callouts from hot paths by doing it here when we
> +	 * see the page is frozen.  Also initialize pc at first use of page.
> +	 */
> +	if (memcg != root_mem_cgroup&&  (!memcg || !page_count(page)))
> +		pc->mem_cgroup = root_mem_cgroup;
> +
> +	rcu_read_unlock();
>   }
>
>   void mem_cgroup_reset_uncharged_to_root(struct page *page)
> @@ -4744,6 +4769,7 @@ static int alloc_mem_cgroup_per_zone_inf
>   	for (zone = 0; zone<  MAX_NR_ZONES; zone++) {
>   		mz =&pn->zoneinfo[zone];
>   		mz->lruvec.zone =&NODE_DATA(node)->node_zones[zone];
> +		spin_lock_init(&mz->lruvec.lru_lock);
>   		for_each_lru(lru)
>   			INIT_LIST_HEAD(&mz->lruvec.lists[lru]);
>   		mz->usage_in_excess = 0;
> --- mmotm.orig/mm/page_alloc.c	2012-02-18 11:57:28.375524252 -0800
> +++ mmotm/mm/page_alloc.c	2012-02-18 11:58:09.051525220 -0800
> @@ -4360,12 +4360,12 @@ static void __paginginit free_area_init_
>   #endif
>   		zone->name = zone_names[j];
>   		spin_lock_init(&zone->lock);
> -		spin_lock_init(&zone->lru_lock);
>   		zone_seqlock_init(zone);
>   		zone->zone_pgdat = pgdat;
>
>   		zone_pcp_init(zone);
>   		zone->lruvec.zone = zone;
> +		spin_lock_init(&zone->lruvec.lru_lock);
>   		for_each_lru(lru)
>   			INIT_LIST_HEAD(&zone->lruvec.lists[lru]);
>   		zone->lruvec.reclaim_stat.recent_rotated[0] = 0;


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 9/10] mm/memcg: move lru_lock into lruvec
@ 2012-02-21  7:08     ` Konstantin Khlebnikov
  0 siblings, 0 replies; 72+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-21  7:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

Hugh Dickins wrote:
> We're nearly there.  Now move lru_lock and irqflags into struct lruvec,
> so they are in every zone (for !MEM_RES_CTLR and mem_cgroup_disabled()
> cases) and in every memcg lruvec.
>
> Extend the memcg version of page_relock_lruvec() to drop old and take
> new lock whenever changing lruvec.  But the memcg will only be stable
> once we already have the lock: so, having got it, check if it's still
> the lock we want, and retry if not.  It's for this retry that we route
> all page lruvec locking through page_relock_lruvec().
>
> No need for lock_page_cgroup() in here (which would entail reinverting
> the lock ordering, and _irq'ing all of its calls): the lrucare protocol
> when charging (holding old lock while changing owner then acquiring new)
> fits correctly with this retry protocol.  In some places we rely also on
> page_count 0 preventing further references, in some places on !PageLRU
> protecting a page from outside interference: mem_cgroup_move_account()
>
> What if page_relock_lruvec() were preempted for a while, after reading
> a valid mem_cgroup from page_cgroup, but before acquiring the lock?
> In that case, a rmdir might free the mem_cgroup and its associated
> zoneinfo, and we take a spin_lock in freed memory.  But rcu_read_lock()
> before we read mem_cgroup keeps it safe: cgroup.c uses synchronize_rcu()
> in between pre_destroy (force_empty) and destroy (freeing structures).
> mem_cgroup_force_empty() cannot succeed while there's any charge, or any
> page on any of its lrus - and checks list_empty() while holding the lock.

Heh, your code is RCU-protected too. =)

On lumpy/compaction isolate you do:

if (!PageLRU(page))
	continue

__isolate_lru_page()

page_relock_rcu_vec()
	rcu_read_lock()
	rcu_dereference()...
	spin_lock()...
	rcu_read_unlock()

You protect page_relock_rcu_vec with switching pointers back to root.

I do:

catch_page_lru()
	rcu_read_lock()
	if (!PageLRU(page))
		return false
	rcu_dereference()...
	spin_lock()...
	rcu_read_unlock()
	if (PageLRU())
		return true
if true
	__isolate_lru_page()

I protect my catch_page_lruvec() with PageLRU() under single rcu-interval with locking.
Thus my code is better, because it not requires switching pointers back to root memcg.

Meanwhile after seeing your patches, I realized that this rcu-protection is
required only for lock-by-pfn in lumpy/compaction isolation.
Thus my locking should be simplified and optimized.

>
> But although we are now fully prepared, in this patch keep on using
> the zone->lru_lock for all of its memcgs: so that the cost or benefit
> of split locking can be easily compared with the final patch (but
> of course, some costs and benefits come earlier in the series).
>
> Signed-off-by: Hugh Dickins<hughd@google.com>
> ---
>   include/linux/mmzone.h |    4 +-
>   include/linux/swap.h   |   13 +++---
>   mm/memcontrol.c        |   74 ++++++++++++++++++++++++++-------------
>   mm/page_alloc.c        |    2 -
>   4 files changed, 59 insertions(+), 34 deletions(-)
>
> --- mmotm.orig/include/linux/mmzone.h	2012-02-18 11:57:42.675524592 -0800
> +++ mmotm/include/linux/mmzone.h	2012-02-18 11:58:09.047525220 -0800
> @@ -174,6 +174,8 @@ struct zone_reclaim_stat {
>
>   struct lruvec {
>   	struct zone *zone;
> +	spinlock_t lru_lock;
> +	unsigned long irqflags;
>   	struct list_head lists[NR_LRU_LISTS];
>   	struct zone_reclaim_stat reclaim_stat;
>   };
> @@ -373,8 +375,6 @@ struct zone {
>   	ZONE_PADDING(_pad1_)
>
>   	/* Fields commonly accessed by the page reclaim scanner */
> -	spinlock_t		lru_lock;
> -	unsigned long		irqflags;
>   	struct lruvec		lruvec;
>
>   	unsigned long		pages_scanned;	   /* since last reclaim */
> --- mmotm.orig/include/linux/swap.h	2012-02-18 11:57:42.675524592 -0800
> +++ mmotm/include/linux/swap.h	2012-02-18 11:58:09.047525220 -0800
> @@ -252,25 +252,24 @@ static inline void lru_cache_add_file(st
>
>   static inline spinlock_t *lru_lockptr(struct lruvec *lruvec)
>   {
> -	return&lruvec->zone->lru_lock;
> +	/* Still use per-zone lru_lock */
> +	return&lruvec->zone->lruvec.lru_lock;
>   }
>
>   static inline void lock_lruvec(struct lruvec *lruvec)
>   {
> -	struct zone *zone = lruvec->zone;
>   	unsigned long irqflags;
>
> -	spin_lock_irqsave(&zone->lru_lock, irqflags);
> -	zone->irqflags = irqflags;
> +	spin_lock_irqsave(lru_lockptr(lruvec), irqflags);
> +	lruvec->irqflags = irqflags;
>   }
>
>   static inline void unlock_lruvec(struct lruvec *lruvec)
>   {
> -	struct zone *zone = lruvec->zone;
>   	unsigned long irqflags;
>
> -	irqflags = zone->irqflags;
> -	spin_unlock_irqrestore(&zone->lru_lock, irqflags);
> +	irqflags = lruvec->irqflags;
> +	spin_unlock_irqrestore(lru_lockptr(lruvec), irqflags);
>   }
>
>   #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> --- mmotm.orig/mm/memcontrol.c	2012-02-18 11:58:02.451525062 -0800
> +++ mmotm/mm/memcontrol.c	2012-02-18 11:58:09.051525220 -0800
> @@ -1048,39 +1048,64 @@ void page_relock_lruvec(struct page *pag
>   	struct page_cgroup *pc;
>   	struct lruvec *lruvec;
>
> -	if (mem_cgroup_disabled())
> +	if (unlikely(mem_cgroup_disabled())) {
>   		lruvec =&page_zone(page)->lruvec;
> -	else {
> -		pc = lookup_page_cgroup(page);
> -		memcg = pc->mem_cgroup;
> -		/*
> -		 * At present we start up with all page_cgroups initialized
> -		 * to zero: correct that to root_mem_cgroup once we see it.
> -		 */
> -		if (unlikely(!memcg))
> -			memcg = pc->mem_cgroup = root_mem_cgroup;
> -		/*
> -		 * We must reset pc->mem_cgroup back to root before freeing
> -		 * a page: avoid additional callouts from hot paths by doing
> -		 * it here when we see the page is frozen (can safely be done
> -		 * before taking lru_lock because the page is frozen).
> -		 */
> -		if (memcg != root_mem_cgroup&&  !page_count(page))
> -			pc->mem_cgroup = root_mem_cgroup;
> -		mz = page_cgroup_zoneinfo(memcg, page);
> -		lruvec =&mz->lruvec;
> +		if (*lruvp&&  *lruvp != lruvec) {
> +			unlock_lruvec(*lruvp);
> +			*lruvp = NULL;
> +		}
> +		if (!*lruvp) {
> +			*lruvp = lruvec;
> +			lock_lruvec(lruvec);
> +		}
> +		return;
>   	}
>
> +	pc = lookup_page_cgroup(page);
> +	/*
> +	 * Imagine being preempted for a long time: we need to make sure that
> +	 * the structure at pc->mem_cgroup, and structures it links to, cannot
> +	 * be freed while we locate and acquire its zone lru_lock.  cgroup's
> +	 * synchronize_rcu() between pre_destroy and destroy makes this safe.
> +	 */
> +	rcu_read_lock();
> +again:
> +	memcg = rcu_dereference(pc->mem_cgroup);
>   	/*
> -	 * For the moment, simply lock by zone just as before.
> +	 * At present we start up with all page_cgroups initialized
> +	 * to zero: here treat NULL as root_mem_cgroup, then correct
> +	 * the page_cgroup below once we really have it locked.
>   	 */
> -	if (*lruvp&&  (*lruvp)->zone != lruvec->zone) {
> +	mz = page_cgroup_zoneinfo(memcg ? : root_mem_cgroup, page);
> +	lruvec =&mz->lruvec;
> +
> +	/*
> +	 * Sometimes we are called with non-NULL *lruvp spinlock already held:
> +	 * hold on if we want the same lock again, otherwise drop and acquire.
> +	 */
> +	if (*lruvp&&  *lruvp != lruvec) {
>   		unlock_lruvec(*lruvp);
>   		*lruvp = NULL;
>   	}
> -	if (!*lruvp)
> +	if (!*lruvp) {
> +		*lruvp = lruvec;
>   		lock_lruvec(lruvec);
> -	*lruvp = lruvec;
> +		/*
> +		 * But pc->mem_cgroup may have changed since we looked...
> +		 */
> +		if (unlikely(pc->mem_cgroup != memcg))
> +			goto again;
> +	}
> +
> +	/*
> +	 * We must reset pc->mem_cgroup back to root before freeing a page:
> +	 * avoid additional callouts from hot paths by doing it here when we
> +	 * see the page is frozen.  Also initialize pc at first use of page.
> +	 */
> +	if (memcg != root_mem_cgroup&&  (!memcg || !page_count(page)))
> +		pc->mem_cgroup = root_mem_cgroup;
> +
> +	rcu_read_unlock();
>   }
>
>   void mem_cgroup_reset_uncharged_to_root(struct page *page)
> @@ -4744,6 +4769,7 @@ static int alloc_mem_cgroup_per_zone_inf
>   	for (zone = 0; zone<  MAX_NR_ZONES; zone++) {
>   		mz =&pn->zoneinfo[zone];
>   		mz->lruvec.zone =&NODE_DATA(node)->node_zones[zone];
> +		spin_lock_init(&mz->lruvec.lru_lock);
>   		for_each_lru(lru)
>   			INIT_LIST_HEAD(&mz->lruvec.lists[lru]);
>   		mz->usage_in_excess = 0;
> --- mmotm.orig/mm/page_alloc.c	2012-02-18 11:57:28.375524252 -0800
> +++ mmotm/mm/page_alloc.c	2012-02-18 11:58:09.051525220 -0800
> @@ -4360,12 +4360,12 @@ static void __paginginit free_area_init_
>   #endif
>   		zone->name = zone_names[j];
>   		spin_lock_init(&zone->lock);
> -		spin_lock_init(&zone->lru_lock);
>   		zone_seqlock_init(zone);
>   		zone->zone_pgdat = pgdat;
>
>   		zone_pcp_init(zone);
>   		zone->lruvec.zone = zone;
> +		spin_lock_init(&zone->lruvec.lru_lock);
>   		for_each_lru(lru)
>   			INIT_LIST_HEAD(&zone->lruvec.lists[lru]);
>   		zone->lruvec.reclaim_stat.recent_rotated[0] = 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/10] mm/memcg: scanning_global_lru means mem_cgroup_disabled
  2012-02-20 23:28   ` Hugh Dickins
@ 2012-02-21  8:03     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-21  8:03 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Mon, 20 Feb 2012 15:28:21 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> Although one has to admire the skill with which it has been concealed,
> scanning_global_lru(mz) is actually just an interesting way to test
> mem_cgroup_disabled().  Too many developer hours have been wasted on
> confusing it with global_reclaim(): just use mem_cgroup_disabled().
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Ah, ok. Now we have global_reclaim() and scanning_global_lru() but
scanning_global_lru() is obsolete now.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/10] mm/memcg: scanning_global_lru means mem_cgroup_disabled
@ 2012-02-21  8:03     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-21  8:03 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Mon, 20 Feb 2012 15:28:21 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> Although one has to admire the skill with which it has been concealed,
> scanning_global_lru(mz) is actually just an interesting way to test
> mem_cgroup_disabled().  Too many developer hours have been wasted on
> confusing it with global_reclaim(): just use mem_cgroup_disabled().
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Ah, ok. Now we have global_reclaim() and scanning_global_lru() but
scanning_global_lru() is obsolete now.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/10] mm/memcg: move reclaim_stat into lruvec
  2012-02-20 23:29   ` Hugh Dickins
@ 2012-02-21  8:05     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-21  8:05 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Mon, 20 Feb 2012 15:29:37 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> With mem_cgroup_disabled() now explicit, it becomes clear that the
> zone_reclaim_stat structure actually belongs in lruvec, per-zone
> when memcg is disabled but per-memcg per-zone when it's enabled.
> 
> We can delete mem_cgroup_get_reclaim_stat(), and change
> update_page_reclaim_stat() to update just the one set of stats,
> the one which get_scan_count() will actually use.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Seems nice to me.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/10] mm/memcg: move reclaim_stat into lruvec
@ 2012-02-21  8:05     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-21  8:05 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Mon, 20 Feb 2012 15:29:37 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> With mem_cgroup_disabled() now explicit, it becomes clear that the
> zone_reclaim_stat structure actually belongs in lruvec, per-zone
> when memcg is disabled but per-memcg per-zone when it's enabled.
> 
> We can delete mem_cgroup_get_reclaim_stat(), and change
> update_page_reclaim_stat() to update just the one set of stats,
> the one which get_scan_count() will actually use.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Seems nice to me.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 3/10] mm/memcg: add zone pointer into lruvec
  2012-02-20 23:30   ` Hugh Dickins
@ 2012-02-21  8:08     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-21  8:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Mon, 20 Feb 2012 15:30:45 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> The lruvec is looking rather useful: if we just add a zone pointer
> into the lruvec, then we can pass the lruvec pointer around and save
> some superfluous arguments and recomputations in various places.
> 
> Just occasionally we do want mem_cgroup_from_lruvec() to get back from
> lruvec to memcg; but then we can remove all uses of vmscan.c's private
> mem_cgroup_zone *mz, passing the lruvec pointer instead.
> 
> And while we're there, get_scan_count() can call vmscan_swappiness()
> once, instead of twice in a row.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---

I like this cleanup

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 3/10] mm/memcg: add zone pointer into lruvec
@ 2012-02-21  8:08     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-21  8:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Mon, 20 Feb 2012 15:30:45 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> The lruvec is looking rather useful: if we just add a zone pointer
> into the lruvec, then we can pass the lruvec pointer around and save
> some superfluous arguments and recomputations in various places.
> 
> Just occasionally we do want mem_cgroup_from_lruvec() to get back from
> lruvec to memcg; but then we can remove all uses of vmscan.c's private
> mem_cgroup_zone *mz, passing the lruvec pointer instead.
> 
> And while we're there, get_scan_count() can call vmscan_swappiness()
> once, instead of twice in a row.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---

I like this cleanup

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 4/10] mm/memcg: apply add/del_page to lruvec
  2012-02-20 23:32   ` Hugh Dickins
@ 2012-02-21  8:20     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-21  8:20 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Mon, 20 Feb 2012 15:32:06 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> Go further: pass lruvec instead of zone to add_page_to_lru_list() and
> del_page_from_lru_list(); and pagevec_lru_move_fn() pass lruvec down
> to its target functions.
> 
> This cleanup eliminates a swathe of cruft in memcontrol.c,
> including mem_cgroup_lru_add_list(), mem_cgroup_lru_del_list() and
> mem_cgroup_lru_move_lists(), which never actually touched the lists.
> 
> In their place, mem_cgroup_page_lruvec() to decide the lruvec,
> previously a side-effect of add, and mem_cgroup_update_lru_size()
> to maintain the lru_size stats.
> 
> Whilst these are simplifications in their own right, the goal is to
> bring the evaluation of lruvec next to the spin_locking of the lrus,
> in preparation for the next patch.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>


Hmm.. a nitpick.

You do 
  lruvec = mem_cgroup_page_lruvec(page, zone);

What is the difference from

  lruvec = mem_cgroup_page_lruvec(page, page_zone(page)) 

?

If we have a function
  lruvec = mem_cgroup_page_lruvec(page)

Do we need 
  lruvec = mem_cgroup_page_lruvec_zone(page, zone) 

?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 4/10] mm/memcg: apply add/del_page to lruvec
@ 2012-02-21  8:20     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-21  8:20 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Mon, 20 Feb 2012 15:32:06 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> Go further: pass lruvec instead of zone to add_page_to_lru_list() and
> del_page_from_lru_list(); and pagevec_lru_move_fn() pass lruvec down
> to its target functions.
> 
> This cleanup eliminates a swathe of cruft in memcontrol.c,
> including mem_cgroup_lru_add_list(), mem_cgroup_lru_del_list() and
> mem_cgroup_lru_move_lists(), which never actually touched the lists.
> 
> In their place, mem_cgroup_page_lruvec() to decide the lruvec,
> previously a side-effect of add, and mem_cgroup_update_lru_size()
> to maintain the lru_size stats.
> 
> Whilst these are simplifications in their own right, the goal is to
> bring the evaluation of lruvec next to the spin_locking of the lrus,
> in preparation for the next patch.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>


Hmm.. a nitpick.

You do 
  lruvec = mem_cgroup_page_lruvec(page, zone);

What is the difference from

  lruvec = mem_cgroup_page_lruvec(page, page_zone(page)) 

?

If we have a function
  lruvec = mem_cgroup_page_lruvec(page)

Do we need 
  lruvec = mem_cgroup_page_lruvec_zone(page, zone) 

?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 5/10] mm/memcg: introduce page_relock_lruvec
  2012-02-20 23:33   ` Hugh Dickins
@ 2012-02-21  8:38     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-21  8:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Mon, 20 Feb 2012 15:33:20 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> Delete the mem_cgroup_page_lruvec() which we just added, replacing
> it and nearby spin_lock_irq or spin_lock_irqsave of zone->lru_lock:
> in most places by page_lock_lruvec() or page_relock_lruvec() (the
> former being a simple case of the latter) or just by lock_lruvec().
> unlock_lruvec() does the spin_unlock_irqrestore for them all.
> 

Wow..removed ;)

> page_relock_lruvec() is born from that "pagezone" pattern in swap.c
> and vmscan.c, where we loop over an array of pages, switching lock
> whenever the zone changes: bearing in mind that if we were to refine
> that lock to per-memcg per-zone, then we would have to switch whenever
> the memcg changes too.
> 
> page_relock_lruvec(page, &lruvec) locates the right lruvec for page,
> unlocks the old lruvec if different (and not NULL), locks the new,
> and updates lruvec on return: so that we shall have just one routine
> to locate and lock the lruvec, whereas originally it got re-evaluated
> at different stages.  But I don't yet know how to satisfy sparse(1).
> 

Ok, I like page_relock_lruvec().



> There are some loops where we never change zone, and a non-memcg kernel
> would not change memcg: use no-op mem_cgroup_page_relock_lruvec() there.
> 
> In compaction's isolate_migratepages(), although we do know the zone,
> we don't know the lruvec in advance: allow for taking the lock later,
> and reorganize its cond_resched() lock-dropping accordingly.
> 
> page_relock_lruvec() (and its wrappers) is actually an _irqsave operation:
> there are a few cases in swap.c where it may be needed at interrupt time
> (to free or to rotate a page on I/O completion).  Ideally(?) we would use
> straightforward _irq disabling elsewhere, but the variants get confusing,
> and page_relock_lruvec() will itself grow more complicated in subsequent
> patches: so keep it simple for now with just the one irqsaver everywhere.
> 
> Passing an irqflags argument/pointer down several levels looks messy
> too, and I'm reluctant to add any more to the page reclaim stack: so
> save the irqflags alongside the lru_lock and restore them from there.
> 
> It's a little sad now to be including mm.h in swap.h to get page_zone();
> but I think that swap.h (despite its name) is the right place for these
> lru functions, and without those inlines the optimizer cannot do so
> well in the !MEM_RES_CTLR case.
> 
> (Is this an appropriate place to confess? that even at the end of the
> series, we're left with a small bug in putback_inactive_pages(), one
> that I've not yet decided is worth fixing: reclaim_stat there is from
> the lruvec on entry, but we might update stats after dropping its lock.
> And do zone->pages_scanned and zone->all_unreclaimable need locking?
> page_alloc.c thinks zone->lock, vmscan.c thought zone->lru_lock,
> and that weakens if we now split lru_lock by memcg.)
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

No perforamce impact by replacing spin_lock_irq()/spin_unlock_irq() to
spin_lock_irqsave() and spin_unlock_irqrestore() ?

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 5/10] mm/memcg: introduce page_relock_lruvec
@ 2012-02-21  8:38     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-21  8:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Mon, 20 Feb 2012 15:33:20 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> Delete the mem_cgroup_page_lruvec() which we just added, replacing
> it and nearby spin_lock_irq or spin_lock_irqsave of zone->lru_lock:
> in most places by page_lock_lruvec() or page_relock_lruvec() (the
> former being a simple case of the latter) or just by lock_lruvec().
> unlock_lruvec() does the spin_unlock_irqrestore for them all.
> 

Wow..removed ;)

> page_relock_lruvec() is born from that "pagezone" pattern in swap.c
> and vmscan.c, where we loop over an array of pages, switching lock
> whenever the zone changes: bearing in mind that if we were to refine
> that lock to per-memcg per-zone, then we would have to switch whenever
> the memcg changes too.
> 
> page_relock_lruvec(page, &lruvec) locates the right lruvec for page,
> unlocks the old lruvec if different (and not NULL), locks the new,
> and updates lruvec on return: so that we shall have just one routine
> to locate and lock the lruvec, whereas originally it got re-evaluated
> at different stages.  But I don't yet know how to satisfy sparse(1).
> 

Ok, I like page_relock_lruvec().



> There are some loops where we never change zone, and a non-memcg kernel
> would not change memcg: use no-op mem_cgroup_page_relock_lruvec() there.
> 
> In compaction's isolate_migratepages(), although we do know the zone,
> we don't know the lruvec in advance: allow for taking the lock later,
> and reorganize its cond_resched() lock-dropping accordingly.
> 
> page_relock_lruvec() (and its wrappers) is actually an _irqsave operation:
> there are a few cases in swap.c where it may be needed at interrupt time
> (to free or to rotate a page on I/O completion).  Ideally(?) we would use
> straightforward _irq disabling elsewhere, but the variants get confusing,
> and page_relock_lruvec() will itself grow more complicated in subsequent
> patches: so keep it simple for now with just the one irqsaver everywhere.
> 
> Passing an irqflags argument/pointer down several levels looks messy
> too, and I'm reluctant to add any more to the page reclaim stack: so
> save the irqflags alongside the lru_lock and restore them from there.
> 
> It's a little sad now to be including mm.h in swap.h to get page_zone();
> but I think that swap.h (despite its name) is the right place for these
> lru functions, and without those inlines the optimizer cannot do so
> well in the !MEM_RES_CTLR case.
> 
> (Is this an appropriate place to confess? that even at the end of the
> series, we're left with a small bug in putback_inactive_pages(), one
> that I've not yet decided is worth fixing: reclaim_stat there is from
> the lruvec on entry, but we might update stats after dropping its lock.
> And do zone->pages_scanned and zone->all_unreclaimable need locking?
> page_alloc.c thinks zone->lock, vmscan.c thought zone->lru_lock,
> and that weakens if we now split lru_lock by memcg.)
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

No perforamce impact by replacing spin_lock_irq()/spin_unlock_irq() to
spin_lock_irqsave() and spin_unlock_irqrestore() ?

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup
  2012-02-20 23:34   ` Hugh Dickins
@ 2012-02-21  9:13     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-21  9:13 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Mon, 20 Feb 2012 15:34:28 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:
	return NULL;
>  
> +	lruvec = page_lock_lruvec(page);
>  	lock_page_cgroup(pc);
>  

Do we need to take lrulock+irq disable per page in this very very hot path ?

Hmm.... How about adding NR_ISOLATED counter into lruvec ?

Then, we can delay freeing lruvec until all conunters goes down to zero.
as...

	bool we_can_free_lruvec = true;

	lock_lruvec(lruvec->lock);
	for_each_lru_lruvec(lru)
		if (!list_empty(&lruvec->lru[lru]))
			we_can_free_lruvec = false;
	if (lruvec->nr_isolated)
		we_can_free_lruvec = false;
	unlock_lruvec(lruvec)
	if (we_can_free_lruvec)
		kfree(lruvec);

If compaction, lumpy reclaim free a page taken from LRU,
it knows what it does and can decrement lruvec->nr_isolated properly
(it seems zone's NR_ISOLATED is decremented at putback.)


Thanks,
-Kame

>  	memcg = pc->mem_cgroup;
> @@ -2915,14 +2944,17 @@ __mem_cgroup_uncharge_common(struct page
>  	mem_cgroup_charge_statistics(memcg, anon, -nr_pages);
>  
>  	ClearPageCgroupUsed(pc);
> +
>  	/*
> -	 * pc->mem_cgroup is not cleared here. It will be accessed when it's
> -	 * freed from LRU. This is safe because uncharged page is expected not
> -	 * to be reused (freed soon). Exception is SwapCache, it's handled by
> -	 * special functions.
> +	 * Once an uncharged page is isolated from the mem_cgroup's lru,
> +	 * it no longer protects that mem_cgroup from rmdir: reset to root.
>  	 */
> +	if (!PageLRU(page) && pc->mem_cgroup != root_mem_cgroup)
> +		pc->mem_cgroup = root_mem_cgroup;
>  
>  	unlock_page_cgroup(pc);
> +	unlock_lruvec(lruvec);
> +
>  	/*
>  	 * even after unlock, we have memcg->res.usage here and this memcg
>  	 * will never be freed.
> @@ -2939,6 +2971,7 @@ __mem_cgroup_uncharge_common(struct page
>  
>  unlock_out:
>  	unlock_page_cgroup(pc);
> +	unlock_lruvec(lruvec);
>  	return NULL;
>  }
>  
> @@ -3327,7 +3360,9 @@ static struct page_cgroup *lookup_page_c
>  	 * the first time, i.e. during boot or memory hotplug;
>  	 * or when mem_cgroup_disabled().
>  	 */
> -	if (likely(pc) && PageCgroupUsed(pc))
> +	if (!pc || PageCgroupUsed(pc))
> +		return pc;
> +	if (pc->mem_cgroup && pc->mem_cgroup != root_mem_cgroup)
>  		return pc;
>  	return NULL;
>  }
> --- mmotm.orig/mm/swap.c	2012-02-18 11:57:42.679524592 -0800
> +++ mmotm/mm/swap.c	2012-02-18 11:57:49.107524745 -0800
> @@ -52,6 +52,7 @@ static void __page_cache_release(struct
>  		lruvec = page_lock_lruvec(page);
>  		VM_BUG_ON(!PageLRU(page));
>  		__ClearPageLRU(page);
> +		mem_cgroup_reset_uncharged_to_root(page);
>  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
>  		unlock_lruvec(lruvec);
>  	}
> @@ -583,6 +584,7 @@ void release_pages(struct page **pages,
>  			page_relock_lruvec(page, &lruvec);
>  			VM_BUG_ON(!PageLRU(page));
>  			__ClearPageLRU(page);
> +			mem_cgroup_reset_uncharged_to_root(page);
>  			del_page_from_lru_list(page, lruvec, page_off_lru(page));
>  		}
>  
> --- mmotm.orig/mm/vmscan.c	2012-02-18 11:57:42.679524592 -0800
> +++ mmotm/mm/vmscan.c	2012-02-18 11:57:49.107524745 -0800
> @@ -1087,11 +1087,11 @@ int __isolate_lru_page(struct page *page
>  
>  	if (likely(get_page_unless_zero(page))) {
>  		/*
> -		 * Be careful not to clear PageLRU until after we're
> -		 * sure the page is not being freed elsewhere -- the
> -		 * page release code relies on it.
> +		 * Beware of interface change: now leave ClearPageLRU(page)
> +		 * to the caller, because memcg's lumpy and compaction
> +		 * cases (approaching the page by its physical location)
> +		 * may not have the right lru_lock yet.
>  		 */
> -		ClearPageLRU(page);
>  		ret = 0;
>  	}
>  
> @@ -1154,7 +1154,16 @@ static unsigned long isolate_lru_pages(u
>  
>  		switch (__isolate_lru_page(page, mode, file)) {
>  		case 0:
> +#ifdef CONFIG_DEBUG_VM
> +			/* check lock on page is lock we already got */
> +			page_relock_lruvec(page, &lruvec);
> +			BUG_ON(lruvec != home_lruvec);
> +			BUG_ON(page != lru_to_page(src));
> +			BUG_ON(page_lru(page) != lru);
> +#endif
> +			ClearPageLRU(page);
>  			isolated_pages = hpage_nr_pages(page);
> +			mem_cgroup_reset_uncharged_to_root(page);
>  			mem_cgroup_update_lru_size(lruvec, lru, -isolated_pages);
>  			list_move(&page->lru, dst);
>  			nr_taken += isolated_pages;
> @@ -1211,21 +1220,7 @@ static unsigned long isolate_lru_pages(u
>  			    !PageSwapCache(cursor_page))
>  				break;
>  
> -			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> -				mem_cgroup_page_relock_lruvec(cursor_page,
> -								&lruvec);
> -				isolated_pages = hpage_nr_pages(cursor_page);
> -				mem_cgroup_update_lru_size(lruvec,
> -					page_lru(cursor_page), -isolated_pages);
> -				list_move(&cursor_page->lru, dst);
> -
> -				nr_taken += isolated_pages;
> -				nr_lumpy_taken += isolated_pages;
> -				if (PageDirty(cursor_page))
> -					nr_lumpy_dirty += isolated_pages;
> -				scan++;
> -				pfn += isolated_pages - 1;
> -			} else {
> +			if (__isolate_lru_page(cursor_page, mode, file) != 0) {
>  				/*
>  				 * Check if the page is freed already.
>  				 *
> @@ -1243,13 +1238,50 @@ static unsigned long isolate_lru_pages(u
>  					continue;
>  				break;
>  			}
> +
> +			/*
> +			 * This locking call is a no-op in the non-memcg
> +			 * case, since we already hold the right lru_lock;
> +			 * but it may change the lock in the memcg case.
> +			 * It is then vital to recheck PageLRU (but not
> +			 * necessary to recheck isolation mode).
> +			 */
> +			mem_cgroup_page_relock_lruvec(cursor_page, &lruvec);
> +
> +			if (PageLRU(cursor_page) &&
> +			    !PageUnevictable(cursor_page)) {
> +				ClearPageLRU(cursor_page);
> +				isolated_pages = hpage_nr_pages(cursor_page);
> +				mem_cgroup_reset_uncharged_to_root(cursor_page);
> +				mem_cgroup_update_lru_size(lruvec,
> +					page_lru(cursor_page), -isolated_pages);
> +				list_move(&cursor_page->lru, dst);
> +
> +				nr_taken += isolated_pages;
> +				nr_lumpy_taken += isolated_pages;
> +				if (PageDirty(cursor_page))
> +					nr_lumpy_dirty += isolated_pages;
> +				scan++;
> +				pfn += isolated_pages - 1;
> +			} else {
> +				/* Cannot hold lru_lock while freeing page */
> +				unlock_lruvec(lruvec);
> +				lruvec = NULL;
> +				put_page(cursor_page);
> +				break;
> +			}
>  		}
>  
>  		/* If we break out of the loop above, lumpy reclaim failed */
>  		if (pfn < end_pfn)
>  			nr_lumpy_failed++;
>  
> -		lruvec = home_lruvec;
> +		if (lruvec != home_lruvec) {
> +			if (lruvec)
> +				unlock_lruvec(lruvec);
> +			lruvec = home_lruvec;
> +			lock_lruvec(lruvec);
> +		}
>  	}
>  
>  	*nr_scanned = scan;
> @@ -1301,6 +1333,7 @@ int isolate_lru_page(struct page *page)
>  			int lru = page_lru(page);
>  			get_page(page);
>  			ClearPageLRU(page);
> +			mem_cgroup_reset_uncharged_to_root(page);
>  			del_page_from_lru_list(page, lruvec, lru);
>  			ret = 0;
>  		}
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup
@ 2012-02-21  9:13     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-21  9:13 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Mon, 20 Feb 2012 15:34:28 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:
	return NULL;
>  
> +	lruvec = page_lock_lruvec(page);
>  	lock_page_cgroup(pc);
>  

Do we need to take lrulock+irq disable per page in this very very hot path ?

Hmm.... How about adding NR_ISOLATED counter into lruvec ?

Then, we can delay freeing lruvec until all conunters goes down to zero.
as...

	bool we_can_free_lruvec = true;

	lock_lruvec(lruvec->lock);
	for_each_lru_lruvec(lru)
		if (!list_empty(&lruvec->lru[lru]))
			we_can_free_lruvec = false;
	if (lruvec->nr_isolated)
		we_can_free_lruvec = false;
	unlock_lruvec(lruvec)
	if (we_can_free_lruvec)
		kfree(lruvec);

If compaction, lumpy reclaim free a page taken from LRU,
it knows what it does and can decrement lruvec->nr_isolated properly
(it seems zone's NR_ISOLATED is decremented at putback.)


Thanks,
-Kame

>  	memcg = pc->mem_cgroup;
> @@ -2915,14 +2944,17 @@ __mem_cgroup_uncharge_common(struct page
>  	mem_cgroup_charge_statistics(memcg, anon, -nr_pages);
>  
>  	ClearPageCgroupUsed(pc);
> +
>  	/*
> -	 * pc->mem_cgroup is not cleared here. It will be accessed when it's
> -	 * freed from LRU. This is safe because uncharged page is expected not
> -	 * to be reused (freed soon). Exception is SwapCache, it's handled by
> -	 * special functions.
> +	 * Once an uncharged page is isolated from the mem_cgroup's lru,
> +	 * it no longer protects that mem_cgroup from rmdir: reset to root.
>  	 */
> +	if (!PageLRU(page) && pc->mem_cgroup != root_mem_cgroup)
> +		pc->mem_cgroup = root_mem_cgroup;
>  
>  	unlock_page_cgroup(pc);
> +	unlock_lruvec(lruvec);
> +
>  	/*
>  	 * even after unlock, we have memcg->res.usage here and this memcg
>  	 * will never be freed.
> @@ -2939,6 +2971,7 @@ __mem_cgroup_uncharge_common(struct page
>  
>  unlock_out:
>  	unlock_page_cgroup(pc);
> +	unlock_lruvec(lruvec);
>  	return NULL;
>  }
>  
> @@ -3327,7 +3360,9 @@ static struct page_cgroup *lookup_page_c
>  	 * the first time, i.e. during boot or memory hotplug;
>  	 * or when mem_cgroup_disabled().
>  	 */
> -	if (likely(pc) && PageCgroupUsed(pc))
> +	if (!pc || PageCgroupUsed(pc))
> +		return pc;
> +	if (pc->mem_cgroup && pc->mem_cgroup != root_mem_cgroup)
>  		return pc;
>  	return NULL;
>  }
> --- mmotm.orig/mm/swap.c	2012-02-18 11:57:42.679524592 -0800
> +++ mmotm/mm/swap.c	2012-02-18 11:57:49.107524745 -0800
> @@ -52,6 +52,7 @@ static void __page_cache_release(struct
>  		lruvec = page_lock_lruvec(page);
>  		VM_BUG_ON(!PageLRU(page));
>  		__ClearPageLRU(page);
> +		mem_cgroup_reset_uncharged_to_root(page);
>  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
>  		unlock_lruvec(lruvec);
>  	}
> @@ -583,6 +584,7 @@ void release_pages(struct page **pages,
>  			page_relock_lruvec(page, &lruvec);
>  			VM_BUG_ON(!PageLRU(page));
>  			__ClearPageLRU(page);
> +			mem_cgroup_reset_uncharged_to_root(page);
>  			del_page_from_lru_list(page, lruvec, page_off_lru(page));
>  		}
>  
> --- mmotm.orig/mm/vmscan.c	2012-02-18 11:57:42.679524592 -0800
> +++ mmotm/mm/vmscan.c	2012-02-18 11:57:49.107524745 -0800
> @@ -1087,11 +1087,11 @@ int __isolate_lru_page(struct page *page
>  
>  	if (likely(get_page_unless_zero(page))) {
>  		/*
> -		 * Be careful not to clear PageLRU until after we're
> -		 * sure the page is not being freed elsewhere -- the
> -		 * page release code relies on it.
> +		 * Beware of interface change: now leave ClearPageLRU(page)
> +		 * to the caller, because memcg's lumpy and compaction
> +		 * cases (approaching the page by its physical location)
> +		 * may not have the right lru_lock yet.
>  		 */
> -		ClearPageLRU(page);
>  		ret = 0;
>  	}
>  
> @@ -1154,7 +1154,16 @@ static unsigned long isolate_lru_pages(u
>  
>  		switch (__isolate_lru_page(page, mode, file)) {
>  		case 0:
> +#ifdef CONFIG_DEBUG_VM
> +			/* check lock on page is lock we already got */
> +			page_relock_lruvec(page, &lruvec);
> +			BUG_ON(lruvec != home_lruvec);
> +			BUG_ON(page != lru_to_page(src));
> +			BUG_ON(page_lru(page) != lru);
> +#endif
> +			ClearPageLRU(page);
>  			isolated_pages = hpage_nr_pages(page);
> +			mem_cgroup_reset_uncharged_to_root(page);
>  			mem_cgroup_update_lru_size(lruvec, lru, -isolated_pages);
>  			list_move(&page->lru, dst);
>  			nr_taken += isolated_pages;
> @@ -1211,21 +1220,7 @@ static unsigned long isolate_lru_pages(u
>  			    !PageSwapCache(cursor_page))
>  				break;
>  
> -			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> -				mem_cgroup_page_relock_lruvec(cursor_page,
> -								&lruvec);
> -				isolated_pages = hpage_nr_pages(cursor_page);
> -				mem_cgroup_update_lru_size(lruvec,
> -					page_lru(cursor_page), -isolated_pages);
> -				list_move(&cursor_page->lru, dst);
> -
> -				nr_taken += isolated_pages;
> -				nr_lumpy_taken += isolated_pages;
> -				if (PageDirty(cursor_page))
> -					nr_lumpy_dirty += isolated_pages;
> -				scan++;
> -				pfn += isolated_pages - 1;
> -			} else {
> +			if (__isolate_lru_page(cursor_page, mode, file) != 0) {
>  				/*
>  				 * Check if the page is freed already.
>  				 *
> @@ -1243,13 +1238,50 @@ static unsigned long isolate_lru_pages(u
>  					continue;
>  				break;
>  			}
> +
> +			/*
> +			 * This locking call is a no-op in the non-memcg
> +			 * case, since we already hold the right lru_lock;
> +			 * but it may change the lock in the memcg case.
> +			 * It is then vital to recheck PageLRU (but not
> +			 * necessary to recheck isolation mode).
> +			 */
> +			mem_cgroup_page_relock_lruvec(cursor_page, &lruvec);
> +
> +			if (PageLRU(cursor_page) &&
> +			    !PageUnevictable(cursor_page)) {
> +				ClearPageLRU(cursor_page);
> +				isolated_pages = hpage_nr_pages(cursor_page);
> +				mem_cgroup_reset_uncharged_to_root(cursor_page);
> +				mem_cgroup_update_lru_size(lruvec,
> +					page_lru(cursor_page), -isolated_pages);
> +				list_move(&cursor_page->lru, dst);
> +
> +				nr_taken += isolated_pages;
> +				nr_lumpy_taken += isolated_pages;
> +				if (PageDirty(cursor_page))
> +					nr_lumpy_dirty += isolated_pages;
> +				scan++;
> +				pfn += isolated_pages - 1;
> +			} else {
> +				/* Cannot hold lru_lock while freeing page */
> +				unlock_lruvec(lruvec);
> +				lruvec = NULL;
> +				put_page(cursor_page);
> +				break;
> +			}
>  		}
>  
>  		/* If we break out of the loop above, lumpy reclaim failed */
>  		if (pfn < end_pfn)
>  			nr_lumpy_failed++;
>  
> -		lruvec = home_lruvec;
> +		if (lruvec != home_lruvec) {
> +			if (lruvec)
> +				unlock_lruvec(lruvec);
> +			lruvec = home_lruvec;
> +			lock_lruvec(lruvec);
> +		}
>  	}
>  
>  	*nr_scanned = scan;
> @@ -1301,6 +1333,7 @@ int isolate_lru_page(struct page *page)
>  			int lru = page_lru(page);
>  			get_page(page);
>  			ClearPageLRU(page);
> +			mem_cgroup_reset_uncharged_to_root(page);
>  			del_page_from_lru_list(page, lruvec, lru);
>  			ret = 0;
>  		}
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 7/10] mm/memcg: remove mem_cgroup_reset_owner
  2012-02-20 23:35   ` Hugh Dickins
@ 2012-02-21  9:17     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-21  9:17 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Mon, 20 Feb 2012 15:35:38 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> With mem_cgroup_reset_uncharged_to_root() now making sure that freed
> pages point to root_mem_cgroup (instead of to a stale and perhaps
> long-deleted memcg), we no longer need to initialize page memcg to
> root in those odd places which put a page on lru before charging. 
> Delete mem_cgroup_reset_owner().
> 
> But: we have no init_page_cgroup() nowadays (and even when we had,
> it was called before root_mem_cgroup had been allocated); so until
> a struct page has once entered the memcg lru cycle, its page_cgroup
> ->mem_cgroup will be NULL instead of root_mem_cgroup.
> 
> That could be fixed by reintroducing init_page_cgroup(), and ordering
> properly: in future we shall probably want root_mem_cgroup in kernel
> bss or data like swapper_space; but let's not get into that right now.
> 
> Instead allow for this in page_relock_lruvec(): treating NULL as
> root_mem_cgroup, and correcting pc->mem_cgroup before going further.
> 
> What?  Before even taking the zone->lru_lock?  Is that safe?
> Yes, because compaction and lumpy reclaim use __isolate_lru_page(),
> which refuses unless it sees PageLRU - which may be cleared at any
> instant, but we only need it to have been set once in the past for
> pc->mem_cgroup to be initialized properly.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Ok, this seems to be much better than current reset_owner().

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 7/10] mm/memcg: remove mem_cgroup_reset_owner
@ 2012-02-21  9:17     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-21  9:17 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Mon, 20 Feb 2012 15:35:38 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> With mem_cgroup_reset_uncharged_to_root() now making sure that freed
> pages point to root_mem_cgroup (instead of to a stale and perhaps
> long-deleted memcg), we no longer need to initialize page memcg to
> root in those odd places which put a page on lru before charging. 
> Delete mem_cgroup_reset_owner().
> 
> But: we have no init_page_cgroup() nowadays (and even when we had,
> it was called before root_mem_cgroup had been allocated); so until
> a struct page has once entered the memcg lru cycle, its page_cgroup
> ->mem_cgroup will be NULL instead of root_mem_cgroup.
> 
> That could be fixed by reintroducing init_page_cgroup(), and ordering
> properly: in future we shall probably want root_mem_cgroup in kernel
> bss or data like swapper_space; but let's not get into that right now.
> 
> Instead allow for this in page_relock_lruvec(): treating NULL as
> root_mem_cgroup, and correcting pc->mem_cgroup before going further.
> 
> What?  Before even taking the zone->lru_lock?  Is that safe?
> Yes, because compaction and lumpy reclaim use __isolate_lru_page(),
> which refuses unless it sees PageLRU - which may be cleared at any
> instant, but we only need it to have been set once in the past for
> pc->mem_cgroup to be initialized properly.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Ok, this seems to be much better than current reset_owner().

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 8/10] mm/memcg: nest lru_lock inside page_cgroup lock
  2012-02-20 23:36   ` Hugh Dickins
@ 2012-02-21  9:48     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-21  9:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Mon, 20 Feb 2012 15:36:55 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> Cut back on some of the overhead we've added, particularly the lruvec
> locking added to every __mem_cgroup_uncharge_common(), and the page
> cgroup locking in mem_cgroup_reset_uncharged_to_root().
> 
> Our hands were tied by the lock ordering (page cgroup inside lruvec)
> defined by __mem_cgroup_commit_charge_lrucare().  There is no strong
> reason for why that nesting needs to be one way or the other, and if
> we invert it, then some optimizations become possible.
> 
> So delete __mem_cgroup_commit_charge_lrucare(), passing a bool lrucare
> to __mem_cgroup_commit_charge() instead, using page_lock_lruvec() there
> inside lock_page_cgroup() in the lrucare case.  (I'd prefer to work it
> out internally, than rely upon an lrucare argument: but that is hard -
> certainly PageLRU is not enough, racing with pages on pagevec about to
> be flushed to lru.)  Use page_relock_lruvec() after setting mem_cgroup,
> before adding to the appropriate new lruvec: so that (if lock depends
> on memcg) old lock is held across change in ownership while off lru.
> 
> Delete the lruvec locking on entry to __mem_cgroup_uncharge_common();
> but if the page being uncharged is not on lru, then we do need to
> reset its ownership, and must dance very carefully with mem_cgroup_
> reset_uncharged_to_root(), to make sure that when there's a race
> between uncharging and removing from lru, one side or the other
> will see it - smp_mb__after_clear_bit() at both ends.
> 

> Avoid overhead of calls to mem_cgroup_reset_uncharged_to_root() from
> release_pages() and __page_cache_release(), by doing its work inside
> page_relock_lruvec() when the page_count is 0 i.e. the page is frozen
> from other references and about to be freed.  That was not possible
> with the old lock ordering, since __mem_cgroup_uncharge_common()'s
> lock then changed ownership too soon.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/memcontrol.c |  142 ++++++++++++++++++++++++----------------------
>  mm/swap.c       |    2 
>  2 files changed, 75 insertions(+), 69 deletions(-)
> 
> --- mmotm.orig/mm/memcontrol.c	2012-02-18 11:57:55.551524898 -0800
> +++ mmotm/mm/memcontrol.c	2012-02-18 11:58:02.451525062 -0800
> @@ -1059,6 +1059,14 @@ void page_relock_lruvec(struct page *pag
>  		 */
>  		if (unlikely(!memcg))
>  			memcg = pc->mem_cgroup = root_mem_cgroup;
> +		/*
> +		 * We must reset pc->mem_cgroup back to root before freeing
> +		 * a page: avoid additional callouts from hot paths by doing
> +		 * it here when we see the page is frozen (can safely be done
> +		 * before taking lru_lock because the page is frozen).
> +		 */
> +		if (memcg != root_mem_cgroup && !page_count(page))
> +			pc->mem_cgroup = root_mem_cgroup;
>  		mz = page_cgroup_zoneinfo(memcg, page);
>  		lruvec = &mz->lruvec;
>  	}
> @@ -1083,23 +1091,20 @@ void mem_cgroup_reset_uncharged_to_root(
>  		return;
>  
>  	VM_BUG_ON(PageLRU(page));
> +	/*
> +	 * Caller just did ClearPageLRU():
> +	 * make sure that __mem_cgroup_uncharge_common()
> +	 * can see that before we test PageCgroupUsed(pc).
> +	 */
> +	smp_mb__after_clear_bit();
>  
>  	/*
>  	 * Once an uncharged page is isolated from the mem_cgroup's lru,
>  	 * it no longer protects that mem_cgroup from rmdir: reset to root.
> -	 *
> -	 * __page_cache_release() and release_pages() may be called at
> -	 * interrupt time: we cannot lock_page_cgroup() then (we might
> -	 * have interrupted a section with page_cgroup already locked),
> -	 * nor do we need to since the page is frozen and about to be freed.
>  	 */
>  	pc = lookup_page_cgroup(page);
> -	if (page_count(page))
> -		lock_page_cgroup(pc);
>  	if (!PageCgroupUsed(pc) && pc->mem_cgroup != root_mem_cgroup)
>  		pc->mem_cgroup = root_mem_cgroup;
> -	if (page_count(page))
> -		unlock_page_cgroup(pc);
>  }
>  
>  /**
> @@ -2422,9 +2427,11 @@ static void __mem_cgroup_commit_charge(s
>  				       struct page *page,
>  				       unsigned int nr_pages,
>  				       struct page_cgroup *pc,
> -				       enum charge_type ctype)
> +				       enum charge_type ctype,
> +				       bool lrucare)
>  {
> -	bool anon;
> +	struct lruvec *lruvec;
> +	bool was_on_lru = false;
>  
>  	lock_page_cgroup(pc);
>  	if (unlikely(PageCgroupUsed(pc))) {
> @@ -2433,28 +2440,41 @@ static void __mem_cgroup_commit_charge(s
>  		return;
>  	}
>  	/*
> -	 * we don't need page_cgroup_lock about tail pages, becase they are not
> -	 * accessed by any other context at this point.
> +	 * We don't need lock_page_cgroup on tail pages, because they are not
> +	 * accessible to any other context at this point.
>  	 */
> -	pc->mem_cgroup = memcg;
> +
>  	/*
> -	 * We access a page_cgroup asynchronously without lock_page_cgroup().
> -	 * Especially when a page_cgroup is taken from a page, pc->mem_cgroup
> -	 * is accessed after testing USED bit. To make pc->mem_cgroup visible
> -	 * before USED bit, we need memory barrier here.
> -	 * See mem_cgroup_add_lru_list(), etc.
> - 	 */
> -	smp_wmb();
> +	 * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page
> +	 * may already be on some other page_cgroup's LRU.  Take care of it.
> +	 */
> +	if (lrucare) {
> +		lruvec = page_lock_lruvec(page);
> +		if (PageLRU(page)) {
> +			ClearPageLRU(page);
> +			del_page_from_lru_list(page, lruvec, page_lru(page));
> +			was_on_lru = true;
> +		}
> +	}
>  
> +	pc->mem_cgroup = memcg;
>  	SetPageCgroupUsed(pc);
> -	if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> -		anon = true;
> -	else
> -		anon = false;
>  
> -	mem_cgroup_charge_statistics(memcg, anon, nr_pages);
> +	if (lrucare) {
> +		if (was_on_lru) {
> +			page_relock_lruvec(page, &lruvec);
> +			if (!PageLRU(page)) {
> +				SetPageLRU(page);
> +				add_page_to_lru_list(page, lruvec, page_lru(page));
> +			}
> +		}
> +		unlock_lruvec(lruvec);
> +	}
> +
> +	mem_cgroup_charge_statistics(memcg,
> +			ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED, nr_pages);
>  	unlock_page_cgroup(pc);
> -	WARN_ON_ONCE(PageLRU(page));
> +
>  	/*
>  	 * "charge_statistics" updated event counter. Then, check it.
>  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> @@ -2652,7 +2672,7 @@ static int mem_cgroup_charge_common(stru
>  	ret = __mem_cgroup_try_charge(mm, gfp_mask, nr_pages, &memcg, oom);
>  	if (ret == -ENOMEM)
>  		return ret;
> -	__mem_cgroup_commit_charge(memcg, page, nr_pages, pc, ctype);
> +	__mem_cgroup_commit_charge(memcg, page, nr_pages, pc, ctype, false);
>  	return 0;
>  }
>  
> @@ -2672,34 +2692,6 @@ static void
>  __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
>  					enum charge_type ctype);
>  
> -static void
> -__mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg,
> -					enum charge_type ctype)
> -{
> -	struct page_cgroup *pc = lookup_page_cgroup(page);
> -	struct lruvec *lruvec;
> -	bool removed = false;
> -
> -	/*
> -	 * In some case, SwapCache, FUSE(splice_buf->radixtree), the page
> -	 * is already on LRU. It means the page may on some other page_cgroup's
> -	 * LRU. Take care of it.
> -	 */
> -	lruvec = page_lock_lruvec(page);
> -	if (PageLRU(page)) {
> -		del_page_from_lru_list(page, lruvec, page_lru(page));
> -		ClearPageLRU(page);
> -		removed = true;
> -	}
> -	__mem_cgroup_commit_charge(memcg, page, 1, pc, ctype);
> -	if (removed) {
> -		page_relock_lruvec(page, &lruvec);
> -		add_page_to_lru_list(page, lruvec, page_lru(page));
> -		SetPageLRU(page);
> -	}
> -	unlock_lruvec(lruvec);
> -}
> -
>  int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>  				gfp_t gfp_mask)
>  {
> @@ -2777,13 +2769,16 @@ static void
>  __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg,
>  					enum charge_type ctype)
>  {
> +	struct page_cgroup *pc;
> +
>  	if (mem_cgroup_disabled())
>  		return;
>  	if (!memcg)
>  		return;
>  	cgroup_exclude_rmdir(&memcg->css);
>  
> -	__mem_cgroup_commit_charge_lrucare(page, memcg, ctype);
> +	pc = lookup_page_cgroup(page);
> +	__mem_cgroup_commit_charge(memcg, page, 1, pc, ctype, true);
>  	/*
>  	 * Now swap is on-memory. This means this page may be
>  	 * counted both as mem and swap....double count.
> @@ -2898,7 +2893,6 @@ __mem_cgroup_uncharge_common(struct page
>  	struct mem_cgroup *memcg = NULL;
>  	unsigned int nr_pages = 1;
>  	struct page_cgroup *pc;
> -	struct lruvec *lruvec;
>  	bool anon;
>  
>  	if (mem_cgroup_disabled())
> @@ -2918,7 +2912,6 @@ __mem_cgroup_uncharge_common(struct page
>  	if (unlikely(!PageCgroupUsed(pc)))
>  		return NULL;
>  
> -	lruvec = page_lock_lruvec(page);
>  	lock_page_cgroup(pc);
>  
>  	memcg = pc->mem_cgroup;
> @@ -2950,16 +2943,31 @@ __mem_cgroup_uncharge_common(struct page
>  	mem_cgroup_charge_statistics(memcg, anon, -nr_pages);
>  
>  	ClearPageCgroupUsed(pc);
> +	/*
> +	 * Make sure that mem_cgroup_reset_uncharged_to_root()
> +	 * can see that before we test PageLRU(page).
> +	 */
> +	smp_mb__after_clear_bit();
>  
>  	/*
>  	 * Once an uncharged page is isolated from the mem_cgroup's lru,
>  	 * it no longer protects that mem_cgroup from rmdir: reset to root.
> -	 */
> -	if (!PageLRU(page) && pc->mem_cgroup != root_mem_cgroup)
> -		pc->mem_cgroup = root_mem_cgroup;
> -
> +	 *
> +	 * The page_count() test avoids the lock in the common case when
> +	 * shrink_page_list()'s __remove_mapping() has frozen references
> +	 * to 0 and the page is on its way to freedom.
> +	 */
> +	if (!PageLRU(page) && pc->mem_cgroup != root_mem_cgroup) {
> +		struct lruvec *lruvec = NULL;
> +
> +		if (page_count(page))
> +			lruvec = page_lock_lruvec(page);
> +		if (!PageLRU(page))
> +			pc->mem_cgroup = root_mem_cgroup;
> +		if (lruvec)
> +			unlock_lruvec(lruvec);
> +	}

Hmm. ok, isoalte_lru_page() at el take care of all problems if PageLRU()==true,
right ?

I wonder which is better to delay freeing lruvec or this locking scheme...

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 8/10] mm/memcg: nest lru_lock inside page_cgroup lock
@ 2012-02-21  9:48     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-21  9:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Mon, 20 Feb 2012 15:36:55 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> Cut back on some of the overhead we've added, particularly the lruvec
> locking added to every __mem_cgroup_uncharge_common(), and the page
> cgroup locking in mem_cgroup_reset_uncharged_to_root().
> 
> Our hands were tied by the lock ordering (page cgroup inside lruvec)
> defined by __mem_cgroup_commit_charge_lrucare().  There is no strong
> reason for why that nesting needs to be one way or the other, and if
> we invert it, then some optimizations become possible.
> 
> So delete __mem_cgroup_commit_charge_lrucare(), passing a bool lrucare
> to __mem_cgroup_commit_charge() instead, using page_lock_lruvec() there
> inside lock_page_cgroup() in the lrucare case.  (I'd prefer to work it
> out internally, than rely upon an lrucare argument: but that is hard -
> certainly PageLRU is not enough, racing with pages on pagevec about to
> be flushed to lru.)  Use page_relock_lruvec() after setting mem_cgroup,
> before adding to the appropriate new lruvec: so that (if lock depends
> on memcg) old lock is held across change in ownership while off lru.
> 
> Delete the lruvec locking on entry to __mem_cgroup_uncharge_common();
> but if the page being uncharged is not on lru, then we do need to
> reset its ownership, and must dance very carefully with mem_cgroup_
> reset_uncharged_to_root(), to make sure that when there's a race
> between uncharging and removing from lru, one side or the other
> will see it - smp_mb__after_clear_bit() at both ends.
> 

> Avoid overhead of calls to mem_cgroup_reset_uncharged_to_root() from
> release_pages() and __page_cache_release(), by doing its work inside
> page_relock_lruvec() when the page_count is 0 i.e. the page is frozen
> from other references and about to be freed.  That was not possible
> with the old lock ordering, since __mem_cgroup_uncharge_common()'s
> lock then changed ownership too soon.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/memcontrol.c |  142 ++++++++++++++++++++++++----------------------
>  mm/swap.c       |    2 
>  2 files changed, 75 insertions(+), 69 deletions(-)
> 
> --- mmotm.orig/mm/memcontrol.c	2012-02-18 11:57:55.551524898 -0800
> +++ mmotm/mm/memcontrol.c	2012-02-18 11:58:02.451525062 -0800
> @@ -1059,6 +1059,14 @@ void page_relock_lruvec(struct page *pag
>  		 */
>  		if (unlikely(!memcg))
>  			memcg = pc->mem_cgroup = root_mem_cgroup;
> +		/*
> +		 * We must reset pc->mem_cgroup back to root before freeing
> +		 * a page: avoid additional callouts from hot paths by doing
> +		 * it here when we see the page is frozen (can safely be done
> +		 * before taking lru_lock because the page is frozen).
> +		 */
> +		if (memcg != root_mem_cgroup && !page_count(page))
> +			pc->mem_cgroup = root_mem_cgroup;
>  		mz = page_cgroup_zoneinfo(memcg, page);
>  		lruvec = &mz->lruvec;
>  	}
> @@ -1083,23 +1091,20 @@ void mem_cgroup_reset_uncharged_to_root(
>  		return;
>  
>  	VM_BUG_ON(PageLRU(page));
> +	/*
> +	 * Caller just did ClearPageLRU():
> +	 * make sure that __mem_cgroup_uncharge_common()
> +	 * can see that before we test PageCgroupUsed(pc).
> +	 */
> +	smp_mb__after_clear_bit();
>  
>  	/*
>  	 * Once an uncharged page is isolated from the mem_cgroup's lru,
>  	 * it no longer protects that mem_cgroup from rmdir: reset to root.
> -	 *
> -	 * __page_cache_release() and release_pages() may be called at
> -	 * interrupt time: we cannot lock_page_cgroup() then (we might
> -	 * have interrupted a section with page_cgroup already locked),
> -	 * nor do we need to since the page is frozen and about to be freed.
>  	 */
>  	pc = lookup_page_cgroup(page);
> -	if (page_count(page))
> -		lock_page_cgroup(pc);
>  	if (!PageCgroupUsed(pc) && pc->mem_cgroup != root_mem_cgroup)
>  		pc->mem_cgroup = root_mem_cgroup;
> -	if (page_count(page))
> -		unlock_page_cgroup(pc);
>  }
>  
>  /**
> @@ -2422,9 +2427,11 @@ static void __mem_cgroup_commit_charge(s
>  				       struct page *page,
>  				       unsigned int nr_pages,
>  				       struct page_cgroup *pc,
> -				       enum charge_type ctype)
> +				       enum charge_type ctype,
> +				       bool lrucare)
>  {
> -	bool anon;
> +	struct lruvec *lruvec;
> +	bool was_on_lru = false;
>  
>  	lock_page_cgroup(pc);
>  	if (unlikely(PageCgroupUsed(pc))) {
> @@ -2433,28 +2440,41 @@ static void __mem_cgroup_commit_charge(s
>  		return;
>  	}
>  	/*
> -	 * we don't need page_cgroup_lock about tail pages, becase they are not
> -	 * accessed by any other context at this point.
> +	 * We don't need lock_page_cgroup on tail pages, because they are not
> +	 * accessible to any other context at this point.
>  	 */
> -	pc->mem_cgroup = memcg;
> +
>  	/*
> -	 * We access a page_cgroup asynchronously without lock_page_cgroup().
> -	 * Especially when a page_cgroup is taken from a page, pc->mem_cgroup
> -	 * is accessed after testing USED bit. To make pc->mem_cgroup visible
> -	 * before USED bit, we need memory barrier here.
> -	 * See mem_cgroup_add_lru_list(), etc.
> - 	 */
> -	smp_wmb();
> +	 * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page
> +	 * may already be on some other page_cgroup's LRU.  Take care of it.
> +	 */
> +	if (lrucare) {
> +		lruvec = page_lock_lruvec(page);
> +		if (PageLRU(page)) {
> +			ClearPageLRU(page);
> +			del_page_from_lru_list(page, lruvec, page_lru(page));
> +			was_on_lru = true;
> +		}
> +	}
>  
> +	pc->mem_cgroup = memcg;
>  	SetPageCgroupUsed(pc);
> -	if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> -		anon = true;
> -	else
> -		anon = false;
>  
> -	mem_cgroup_charge_statistics(memcg, anon, nr_pages);
> +	if (lrucare) {
> +		if (was_on_lru) {
> +			page_relock_lruvec(page, &lruvec);
> +			if (!PageLRU(page)) {
> +				SetPageLRU(page);
> +				add_page_to_lru_list(page, lruvec, page_lru(page));
> +			}
> +		}
> +		unlock_lruvec(lruvec);
> +	}
> +
> +	mem_cgroup_charge_statistics(memcg,
> +			ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED, nr_pages);
>  	unlock_page_cgroup(pc);
> -	WARN_ON_ONCE(PageLRU(page));
> +
>  	/*
>  	 * "charge_statistics" updated event counter. Then, check it.
>  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> @@ -2652,7 +2672,7 @@ static int mem_cgroup_charge_common(stru
>  	ret = __mem_cgroup_try_charge(mm, gfp_mask, nr_pages, &memcg, oom);
>  	if (ret == -ENOMEM)
>  		return ret;
> -	__mem_cgroup_commit_charge(memcg, page, nr_pages, pc, ctype);
> +	__mem_cgroup_commit_charge(memcg, page, nr_pages, pc, ctype, false);
>  	return 0;
>  }
>  
> @@ -2672,34 +2692,6 @@ static void
>  __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
>  					enum charge_type ctype);
>  
> -static void
> -__mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg,
> -					enum charge_type ctype)
> -{
> -	struct page_cgroup *pc = lookup_page_cgroup(page);
> -	struct lruvec *lruvec;
> -	bool removed = false;
> -
> -	/*
> -	 * In some case, SwapCache, FUSE(splice_buf->radixtree), the page
> -	 * is already on LRU. It means the page may on some other page_cgroup's
> -	 * LRU. Take care of it.
> -	 */
> -	lruvec = page_lock_lruvec(page);
> -	if (PageLRU(page)) {
> -		del_page_from_lru_list(page, lruvec, page_lru(page));
> -		ClearPageLRU(page);
> -		removed = true;
> -	}
> -	__mem_cgroup_commit_charge(memcg, page, 1, pc, ctype);
> -	if (removed) {
> -		page_relock_lruvec(page, &lruvec);
> -		add_page_to_lru_list(page, lruvec, page_lru(page));
> -		SetPageLRU(page);
> -	}
> -	unlock_lruvec(lruvec);
> -}
> -
>  int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>  				gfp_t gfp_mask)
>  {
> @@ -2777,13 +2769,16 @@ static void
>  __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg,
>  					enum charge_type ctype)
>  {
> +	struct page_cgroup *pc;
> +
>  	if (mem_cgroup_disabled())
>  		return;
>  	if (!memcg)
>  		return;
>  	cgroup_exclude_rmdir(&memcg->css);
>  
> -	__mem_cgroup_commit_charge_lrucare(page, memcg, ctype);
> +	pc = lookup_page_cgroup(page);
> +	__mem_cgroup_commit_charge(memcg, page, 1, pc, ctype, true);
>  	/*
>  	 * Now swap is on-memory. This means this page may be
>  	 * counted both as mem and swap....double count.
> @@ -2898,7 +2893,6 @@ __mem_cgroup_uncharge_common(struct page
>  	struct mem_cgroup *memcg = NULL;
>  	unsigned int nr_pages = 1;
>  	struct page_cgroup *pc;
> -	struct lruvec *lruvec;
>  	bool anon;
>  
>  	if (mem_cgroup_disabled())
> @@ -2918,7 +2912,6 @@ __mem_cgroup_uncharge_common(struct page
>  	if (unlikely(!PageCgroupUsed(pc)))
>  		return NULL;
>  
> -	lruvec = page_lock_lruvec(page);
>  	lock_page_cgroup(pc);
>  
>  	memcg = pc->mem_cgroup;
> @@ -2950,16 +2943,31 @@ __mem_cgroup_uncharge_common(struct page
>  	mem_cgroup_charge_statistics(memcg, anon, -nr_pages);
>  
>  	ClearPageCgroupUsed(pc);
> +	/*
> +	 * Make sure that mem_cgroup_reset_uncharged_to_root()
> +	 * can see that before we test PageLRU(page).
> +	 */
> +	smp_mb__after_clear_bit();
>  
>  	/*
>  	 * Once an uncharged page is isolated from the mem_cgroup's lru,
>  	 * it no longer protects that mem_cgroup from rmdir: reset to root.
> -	 */
> -	if (!PageLRU(page) && pc->mem_cgroup != root_mem_cgroup)
> -		pc->mem_cgroup = root_mem_cgroup;
> -
> +	 *
> +	 * The page_count() test avoids the lock in the common case when
> +	 * shrink_page_list()'s __remove_mapping() has frozen references
> +	 * to 0 and the page is on its way to freedom.
> +	 */
> +	if (!PageLRU(page) && pc->mem_cgroup != root_mem_cgroup) {
> +		struct lruvec *lruvec = NULL;
> +
> +		if (page_count(page))
> +			lruvec = page_lock_lruvec(page);
> +		if (!PageLRU(page))
> +			pc->mem_cgroup = root_mem_cgroup;
> +		if (lruvec)
> +			unlock_lruvec(lruvec);
> +	}

Hmm. ok, isoalte_lru_page() at el take care of all problems if PageLRU()==true,
right ?

I wonder which is better to delay freeing lruvec or this locking scheme...

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup
  2012-02-21  5:55     ` Konstantin Khlebnikov
@ 2012-02-21 19:37       ` Hugh Dickins
  -1 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-21 19:37 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Tue, 21 Feb 2012, Konstantin Khlebnikov wrote:
> 
> But just one question: how appears uncharged pages in mem-cg lru lists?

One way is swapin readahead pages, which cannot be charged to a memcg
until they're "claimed"; but we do need them visible on lru, otherwise
memory pressure couldn't reclaim them when necessary.

Another way is simply that uncharging has not historically removed the
page from lru list if it's on.  I usually assume that's an optimization:
why bother to get lru locks and take it off (and put it on the root lru?
if we don't, we're assuming it's will be freed very shortly - I'm not
sure that's always a good assumption), if freeing the page will usually
do that for us (without having to change lrus).

If I thought for longer, I might come up with other scenarios.

> Maybe we can forbid this case and uncharge these pages right in
> __page_cache_release() and release_pages() at final removing from LRU.
> This is how my old mem-controller works. There pages in lru are always
> charged.

As things stand, that would mean lock_page_cgroup() has to disable irqs
everywhere.  I'm not sure of the further ramifications of moving uncharge
to __page_cache_release() and release_pages().  I don't think a change
like that is out of the question, but it's certainly a bigger change
than I'd like to consider in this series.

Hugh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup
@ 2012-02-21 19:37       ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-21 19:37 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Tue, 21 Feb 2012, Konstantin Khlebnikov wrote:
> 
> But just one question: how appears uncharged pages in mem-cg lru lists?

One way is swapin readahead pages, which cannot be charged to a memcg
until they're "claimed"; but we do need them visible on lru, otherwise
memory pressure couldn't reclaim them when necessary.

Another way is simply that uncharging has not historically removed the
page from lru list if it's on.  I usually assume that's an optimization:
why bother to get lru locks and take it off (and put it on the root lru?
if we don't, we're assuming it's will be freed very shortly - I'm not
sure that's always a good assumption), if freeing the page will usually
do that for us (without having to change lrus).

If I thought for longer, I might come up with other scenarios.

> Maybe we can forbid this case and uncharge these pages right in
> __page_cache_release() and release_pages() at final removing from LRU.
> This is how my old mem-controller works. There pages in lru are always
> charged.

As things stand, that would mean lock_page_cgroup() has to disable irqs
everywhere.  I'm not sure of the further ramifications of moving uncharge
to __page_cache_release() and release_pages().  I don't think a change
like that is out of the question, but it's certainly a bigger change
than I'd like to consider in this series.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup
  2012-02-21  6:05     ` Konstantin Khlebnikov
@ 2012-02-21 20:00       ` Hugh Dickins
  -1 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-21 20:00 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Tue, 21 Feb 2012, Konstantin Khlebnikov wrote:
> > 
> > Be particularly careful in compaction's isolate_migratepages() and
> > vmscan's lumpy handling in isolate_lru_pages(): those approach the
> > page by its physical location, and so can encounter pages which
> > would not be found by any logical lookup.  For those cases we have
> > to change __isolate_lru_page() slightly: it must leave ClearPageLRU
> > to the caller, because compaction and lumpy cannot safely interfere
> > with a page until they have first isolated it and then locked lruvec.
> 
> Yeah, this is most complicated part.

Yes, I found myself leaving this patch until last when commenting.

And was not entirely convinced by what I then said of move_account().
Indeed, I wondered if it might be improved and simplified by taking
lruvec locks itself, in the manner that commit_charge lrucare ends
up doing.

> I found one race here, see below.

Thank you!

> > @@ -386,10 +364,24 @@ static isolate_migrate_t isolate_migrate
> >                          continue;
> > 
> >                  page_relock_lruvec(page,&lruvec);
> 
> Here race with mem_cgroup_move_account() we hold lock for old lruvec,
> while move_account() recharge page and put page back into other lruvec.
> Thus we see PageLRU(), but below we isolate page from wrong lruvec.

I expect you'll prove right on that, but I'm going to think about it
more later.

> 
> In my patch-set this is fixed with __wait_lru_unlock() [ spin_unlock_wait() ]
> in mem_cgroup_move_account()

Right now, after finishing mail, I want to concentrate on getting your
series working under my swapping load.

It's possible that I screwed up rediffing it to my slightly later
base (though the only parts that appeared to need fixing up were as
expected, near update_isolated_counts and move_active_pages_to_lru);
but if so I'd expect that to show up differently.

At present, although it runs fine with cgroup_disable=memory, with memcg
two machines very soon hit that BUG at include/linux/mm_inline.h:41!
when the lru_size or pages_count wraps around; on another it hit that
precisely when I stopped the test.

In all cases it's in release_pages from free_pages_and_swap_cache from
tlb_flush_mmu from tlb_finish_mmu from exit_mmap from mmput from exit_mm
from do_exit (but different processes exiting).

Hugh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup
@ 2012-02-21 20:00       ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-21 20:00 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Tue, 21 Feb 2012, Konstantin Khlebnikov wrote:
> > 
> > Be particularly careful in compaction's isolate_migratepages() and
> > vmscan's lumpy handling in isolate_lru_pages(): those approach the
> > page by its physical location, and so can encounter pages which
> > would not be found by any logical lookup.  For those cases we have
> > to change __isolate_lru_page() slightly: it must leave ClearPageLRU
> > to the caller, because compaction and lumpy cannot safely interfere
> > with a page until they have first isolated it and then locked lruvec.
> 
> Yeah, this is most complicated part.

Yes, I found myself leaving this patch until last when commenting.

And was not entirely convinced by what I then said of move_account().
Indeed, I wondered if it might be improved and simplified by taking
lruvec locks itself, in the manner that commit_charge lrucare ends
up doing.

> I found one race here, see below.

Thank you!

> > @@ -386,10 +364,24 @@ static isolate_migrate_t isolate_migrate
> >                          continue;
> > 
> >                  page_relock_lruvec(page,&lruvec);
> 
> Here race with mem_cgroup_move_account() we hold lock for old lruvec,
> while move_account() recharge page and put page back into other lruvec.
> Thus we see PageLRU(), but below we isolate page from wrong lruvec.

I expect you'll prove right on that, but I'm going to think about it
more later.

> 
> In my patch-set this is fixed with __wait_lru_unlock() [ spin_unlock_wait() ]
> in mem_cgroup_move_account()

Right now, after finishing mail, I want to concentrate on getting your
series working under my swapping load.

It's possible that I screwed up rediffing it to my slightly later
base (though the only parts that appeared to need fixing up were as
expected, near update_isolated_counts and move_active_pages_to_lru);
but if so I'd expect that to show up differently.

At present, although it runs fine with cgroup_disable=memory, with memcg
two machines very soon hit that BUG at include/linux/mm_inline.h:41!
when the lru_size or pages_count wraps around; on another it hit that
precisely when I stopped the test.

In all cases it's in release_pages from free_pages_and_swap_cache from
tlb_flush_mmu from tlb_finish_mmu from exit_mmap from mmput from exit_mm
from do_exit (but different processes exiting).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 9/10] mm/memcg: move lru_lock into lruvec
  2012-02-21  7:08     ` Konstantin Khlebnikov
@ 2012-02-21 20:12       ` Hugh Dickins
  -1 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-21 20:12 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Tue, 21 Feb 2012, Konstantin Khlebnikov wrote:
> 
> On lumpy/compaction isolate you do:
> 
> if (!PageLRU(page))
> 	continue
> 
> __isolate_lru_page()
> 
> page_relock_rcu_vec()
> 	rcu_read_lock()
> 	rcu_dereference()...
> 	spin_lock()...
> 	rcu_read_unlock()
> 
> You protect page_relock_rcu_vec with switching pointers back to root.
> 
> I do:
> 
> catch_page_lru()
> 	rcu_read_lock()
> 	if (!PageLRU(page))
> 		return false
> 	rcu_dereference()...
> 	spin_lock()...
> 	rcu_read_unlock()
> 	if (PageLRU())
> 		return true
> if true
> 	__isolate_lru_page()
> 
> I protect my catch_page_lruvec() with PageLRU() under single rcu-interval
> with locking.
> Thus my code is better, because it not requires switching pointers back to
> root memcg.

That sounds much better, yes - if it does work reliably.

I'll have to come back to think about your locking later too;
or maybe that's exactly where I need to look, when investigating
the mm_inline.h:41 BUG.

But at first sight, I have to say I'm very suspicious: I've never found
PageLRU a good enough test for whether we need such a lock, because of
races with those pages on percpu lruvec about to be put on the lru.

But maybe once I look closer, I'll find that's handled by your changes
away from lruvec; though I'd have thought the same issue exists,
independent of whether the pending pages are in vector or list.

Hugh

> 
> Meanwhile after seeing your patches, I realized that this rcu-protection is
> required only for lock-by-pfn in lumpy/compaction isolation.
> Thus my locking should be simplified and optimized.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 9/10] mm/memcg: move lru_lock into lruvec
@ 2012-02-21 20:12       ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-21 20:12 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Tue, 21 Feb 2012, Konstantin Khlebnikov wrote:
> 
> On lumpy/compaction isolate you do:
> 
> if (!PageLRU(page))
> 	continue
> 
> __isolate_lru_page()
> 
> page_relock_rcu_vec()
> 	rcu_read_lock()
> 	rcu_dereference()...
> 	spin_lock()...
> 	rcu_read_unlock()
> 
> You protect page_relock_rcu_vec with switching pointers back to root.
> 
> I do:
> 
> catch_page_lru()
> 	rcu_read_lock()
> 	if (!PageLRU(page))
> 		return false
> 	rcu_dereference()...
> 	spin_lock()...
> 	rcu_read_unlock()
> 	if (PageLRU())
> 		return true
> if true
> 	__isolate_lru_page()
> 
> I protect my catch_page_lruvec() with PageLRU() under single rcu-interval
> with locking.
> Thus my code is better, because it not requires switching pointers back to
> root memcg.

That sounds much better, yes - if it does work reliably.

I'll have to come back to think about your locking later too;
or maybe that's exactly where I need to look, when investigating
the mm_inline.h:41 BUG.

But at first sight, I have to say I'm very suspicious: I've never found
PageLRU a good enough test for whether we need such a lock, because of
races with those pages on percpu lruvec about to be put on the lru.

But maybe once I look closer, I'll find that's handled by your changes
away from lruvec; though I'd have thought the same issue exists,
independent of whether the pending pages are in vector or list.

Hugh

> 
> Meanwhile after seeing your patches, I realized that this rcu-protection is
> required only for lock-by-pfn in lumpy/compaction isolation.
> Thus my locking should be simplified and optimized.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup
  2012-02-21 19:37       ` Hugh Dickins
@ 2012-02-21 20:40         ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 72+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-21 20:40 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

Hugh Dickins wrote:
> On Tue, 21 Feb 2012, Konstantin Khlebnikov wrote:
>>
>> But just one question: how appears uncharged pages in mem-cg lru lists?
>
> One way is swapin readahead pages, which cannot be charged to a memcg
> until they're "claimed"; but we do need them visible on lru, otherwise
> memory pressure couldn't reclaim them when necessary.

Ok, this is really reasonable.

>
> Another way is simply that uncharging has not historically removed the
> page from lru list if it's on.  I usually assume that's an optimization:
> why bother to get lru locks and take it off (and put it on the root lru?
> if we don't, we're assuming it's will be freed very shortly - I'm not
> sure that's always a good assumption), if freeing the page will usually
> do that for us (without having to change lrus).
>
> If I thought for longer, I might come up with other scenarios.
>
>> Maybe we can forbid this case and uncharge these pages right in
>> __page_cache_release() and release_pages() at final removing from LRU.
>> This is how my old mem-controller works. There pages in lru are always
>> charged.
>
> As things stand, that would mean lock_page_cgroup() has to disable irqs
> everywhere.  I'm not sure of the further ramifications of moving uncharge
> to __page_cache_release() and release_pages().  I don't think a change
> like that is out of the question, but it's certainly a bigger change
> than I'd like to consider in this series.

Ok. I have another big question: Why we remove pages from lru at last put_page()?

Logically we can remove them in truncate_inode_pages_range() for file
and in free_pages_and_swap_cache() or something at last unmap for anon.
Pages are unreachable after that, they never become alive again.
Reclaimer also cannot reclaim them in this state, so there no reasons for keeping them in lru.
Into those two functions pages come in large batches, so we can remove them more effectively,
currently they are likely to be removed right in this place, just because release_pages() drops
last references, but we can do this lru remove unconditionally.
Plus it never happens in irq context, so lru_lock can be converted to irq-unsafe in some distant future.

>
> Hugh


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup
@ 2012-02-21 20:40         ` Konstantin Khlebnikov
  0 siblings, 0 replies; 72+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-21 20:40 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

Hugh Dickins wrote:
> On Tue, 21 Feb 2012, Konstantin Khlebnikov wrote:
>>
>> But just one question: how appears uncharged pages in mem-cg lru lists?
>
> One way is swapin readahead pages, which cannot be charged to a memcg
> until they're "claimed"; but we do need them visible on lru, otherwise
> memory pressure couldn't reclaim them when necessary.

Ok, this is really reasonable.

>
> Another way is simply that uncharging has not historically removed the
> page from lru list if it's on.  I usually assume that's an optimization:
> why bother to get lru locks and take it off (and put it on the root lru?
> if we don't, we're assuming it's will be freed very shortly - I'm not
> sure that's always a good assumption), if freeing the page will usually
> do that for us (without having to change lrus).
>
> If I thought for longer, I might come up with other scenarios.
>
>> Maybe we can forbid this case and uncharge these pages right in
>> __page_cache_release() and release_pages() at final removing from LRU.
>> This is how my old mem-controller works. There pages in lru are always
>> charged.
>
> As things stand, that would mean lock_page_cgroup() has to disable irqs
> everywhere.  I'm not sure of the further ramifications of moving uncharge
> to __page_cache_release() and release_pages().  I don't think a change
> like that is out of the question, but it's certainly a bigger change
> than I'd like to consider in this series.

Ok. I have another big question: Why we remove pages from lru at last put_page()?

Logically we can remove them in truncate_inode_pages_range() for file
and in free_pages_and_swap_cache() or something at last unmap for anon.
Pages are unreachable after that, they never become alive again.
Reclaimer also cannot reclaim them in this state, so there no reasons for keeping them in lru.
Into those two functions pages come in large batches, so we can remove them more effectively,
currently they are likely to be removed right in this place, just because release_pages() drops
last references, but we can do this lru remove unconditionally.
Plus it never happens in irq context, so lru_lock can be converted to irq-unsafe in some distant future.

>
> Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 9/10] mm/memcg: move lru_lock into lruvec
  2012-02-21 20:12       ` Hugh Dickins
@ 2012-02-21 21:35         ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 72+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-21 21:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

Hugh Dickins wrote:
> On Tue, 21 Feb 2012, Konstantin Khlebnikov wrote:
>>
>> On lumpy/compaction isolate you do:
>>
>> if (!PageLRU(page))
>> 	continue
>>
>> __isolate_lru_page()
>>
>> page_relock_rcu_vec()
>> 	rcu_read_lock()
>> 	rcu_dereference()...
>> 	spin_lock()...
>> 	rcu_read_unlock()
>>
>> You protect page_relock_rcu_vec with switching pointers back to root.
>>
>> I do:
>>
>> catch_page_lru()
>> 	rcu_read_lock()
>> 	if (!PageLRU(page))
>> 		return false
>> 	rcu_dereference()...
>> 	spin_lock()...
>> 	rcu_read_unlock()
>> 	if (PageLRU())
>> 		return true
>> if true
>> 	__isolate_lru_page()
>>
>> I protect my catch_page_lruvec() with PageLRU() under single rcu-interval
>> with locking.
>> Thus my code is better, because it not requires switching pointers back to
>> root memcg.
>
> That sounds much better, yes - if it does work reliably.
>
> I'll have to come back to think about your locking later too;
> or maybe that's exactly where I need to look, when investigating
> the mm_inline.h:41 BUG.

pages_count[] updates looks correct.
This really may be bug in locking, and this VM_BUG_ON catch it before list-debug.

>
> But at first sight, I have to say I'm very suspicious: I've never found
> PageLRU a good enough test for whether we need such a lock, because of
> races with those pages on percpu lruvec about to be put on the lru.
>
> But maybe once I look closer, I'll find that's handled by your changes
> away from lruvec; though I'd have thought the same issue exists,
> independent of whether the pending pages are in vector or list.

Are you talking about my per-cpu page-lists for lru-adding?
This is just an unnecessary patch, I don't know why I include it into v2 set.
It does not protect anything.

>
> Hugh
>
>>
>> Meanwhile after seeing your patches, I realized that this rcu-protection is
>> required only for lock-by-pfn in lumpy/compaction isolation.
>> Thus my locking should be simplified and optimized.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 9/10] mm/memcg: move lru_lock into lruvec
@ 2012-02-21 21:35         ` Konstantin Khlebnikov
  0 siblings, 0 replies; 72+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-21 21:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

Hugh Dickins wrote:
> On Tue, 21 Feb 2012, Konstantin Khlebnikov wrote:
>>
>> On lumpy/compaction isolate you do:
>>
>> if (!PageLRU(page))
>> 	continue
>>
>> __isolate_lru_page()
>>
>> page_relock_rcu_vec()
>> 	rcu_read_lock()
>> 	rcu_dereference()...
>> 	spin_lock()...
>> 	rcu_read_unlock()
>>
>> You protect page_relock_rcu_vec with switching pointers back to root.
>>
>> I do:
>>
>> catch_page_lru()
>> 	rcu_read_lock()
>> 	if (!PageLRU(page))
>> 		return false
>> 	rcu_dereference()...
>> 	spin_lock()...
>> 	rcu_read_unlock()
>> 	if (PageLRU())
>> 		return true
>> if true
>> 	__isolate_lru_page()
>>
>> I protect my catch_page_lruvec() with PageLRU() under single rcu-interval
>> with locking.
>> Thus my code is better, because it not requires switching pointers back to
>> root memcg.
>
> That sounds much better, yes - if it does work reliably.
>
> I'll have to come back to think about your locking later too;
> or maybe that's exactly where I need to look, when investigating
> the mm_inline.h:41 BUG.

pages_count[] updates looks correct.
This really may be bug in locking, and this VM_BUG_ON catch it before list-debug.

>
> But at first sight, I have to say I'm very suspicious: I've never found
> PageLRU a good enough test for whether we need such a lock, because of
> races with those pages on percpu lruvec about to be put on the lru.
>
> But maybe once I look closer, I'll find that's handled by your changes
> away from lruvec; though I'd have thought the same issue exists,
> independent of whether the pending pages are in vector or list.

Are you talking about my per-cpu page-lists for lru-adding?
This is just an unnecessary patch, I don't know why I include it into v2 set.
It does not protect anything.

>
> Hugh
>
>>
>> Meanwhile after seeing your patches, I realized that this rcu-protection is
>> required only for lock-by-pfn in lumpy/compaction isolation.
>> Thus my locking should be simplified and optimized.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup
  2012-02-21 20:40         ` Konstantin Khlebnikov
@ 2012-02-21 22:05           ` Hugh Dickins
  -1 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-21 22:05 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Wed, 22 Feb 2012, Konstantin Khlebnikov wrote:
> Hugh Dickins wrote:
> > 
> > As things stand, that would mean lock_page_cgroup() has to disable irqs
> > everywhere.  I'm not sure of the further ramifications of moving uncharge
> > to __page_cache_release() and release_pages().  I don't think a change
> > like that is out of the question, but it's certainly a bigger change
> > than I'd like to consider in this series.
> 
> Ok. I have another big question: Why we remove pages from lru at last
> put_page()?
> 
> Logically we can remove them in truncate_inode_pages_range() for file
> and in free_pages_and_swap_cache() or something at last unmap for anon.
> Pages are unreachable after that, they never become alive again.
> Reclaimer also cannot reclaim them in this state, so there no reasons for
> keeping them in lru.
> Into those two functions pages come in large batches, so we can remove them
> more effectively,
> currently they are likely to be removed right in this place, just because
> release_pages() drops
> last references, but we can do this lru remove unconditionally.

That may be a very good idea, but I'm not going to commit myself
in a hurry.

I think Kamezawa-san was involved, and has a much better grasp than
I have, of the choices of precisely when to charge and uncharge;
and why we would not have removed from lru at the point of uncharge.

There may have been lock ordering reasons, now gone away, why it could
not have been done.  Or they may just have been the overriding reason,
now going away, that memcg should not make any change to what already
was happening without memcg.

One difficulty that comes to mind, is that at the point of uncharge,
the page may be (temporarily) off lru already: what then?  We certainly
don't want the uncharge to wait until the page comes back on to lru.
But it should be possible to deal with, by just making everywhere that
puts a page back on lru check for the charge first.  Hmm, but then
what of non-memcg, where there is never any charge?  And what of
those swapin readahead pages?  I think what you were suggesting
is probably slightly different from what I went on to imagine.

Please keep this idea in mind: maybe Kamezawa will immediately point
out the fatal flaw in it, or maybe we should come back to it later -
I'm not getting deeper into it now.

> Plus it never happens in irq context, so lru_lock can be converted to
> irq-unsafe in some distant future.

I'd love that: no very strong reason, the irq-disabling just irritates
me.  But note that the irq-disabling was introduced by Andrew, not for
I/O completion reasons (those somehow followed later IIRC), but because
the lock was so contended that he didn't want the holders interrupted.
Though I've not seen such a justification used recently.

We'd also have to do something about the "rotation",
maybe Mel's separate list would help, maybe not.

Hugh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup
@ 2012-02-21 22:05           ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-21 22:05 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Wed, 22 Feb 2012, Konstantin Khlebnikov wrote:
> Hugh Dickins wrote:
> > 
> > As things stand, that would mean lock_page_cgroup() has to disable irqs
> > everywhere.  I'm not sure of the further ramifications of moving uncharge
> > to __page_cache_release() and release_pages().  I don't think a change
> > like that is out of the question, but it's certainly a bigger change
> > than I'd like to consider in this series.
> 
> Ok. I have another big question: Why we remove pages from lru at last
> put_page()?
> 
> Logically we can remove them in truncate_inode_pages_range() for file
> and in free_pages_and_swap_cache() or something at last unmap for anon.
> Pages are unreachable after that, they never become alive again.
> Reclaimer also cannot reclaim them in this state, so there no reasons for
> keeping them in lru.
> Into those two functions pages come in large batches, so we can remove them
> more effectively,
> currently they are likely to be removed right in this place, just because
> release_pages() drops
> last references, but we can do this lru remove unconditionally.

That may be a very good idea, but I'm not going to commit myself
in a hurry.

I think Kamezawa-san was involved, and has a much better grasp than
I have, of the choices of precisely when to charge and uncharge;
and why we would not have removed from lru at the point of uncharge.

There may have been lock ordering reasons, now gone away, why it could
not have been done.  Or they may just have been the overriding reason,
now going away, that memcg should not make any change to what already
was happening without memcg.

One difficulty that comes to mind, is that at the point of uncharge,
the page may be (temporarily) off lru already: what then?  We certainly
don't want the uncharge to wait until the page comes back on to lru.
But it should be possible to deal with, by just making everywhere that
puts a page back on lru check for the charge first.  Hmm, but then
what of non-memcg, where there is never any charge?  And what of
those swapin readahead pages?  I think what you were suggesting
is probably slightly different from what I went on to imagine.

Please keep this idea in mind: maybe Kamezawa will immediately point
out the fatal flaw in it, or maybe we should come back to it later -
I'm not getting deeper into it now.

> Plus it never happens in irq context, so lru_lock can be converted to
> irq-unsafe in some distant future.

I'd love that: no very strong reason, the irq-disabling just irritates
me.  But note that the irq-disabling was introduced by Andrew, not for
I/O completion reasons (those somehow followed later IIRC), but because
the lock was so contended that he didn't want the holders interrupted.
Though I've not seen such a justification used recently.

We'd also have to do something about the "rotation",
maybe Mel's separate list would help, maybe not.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 9/10] mm/memcg: move lru_lock into lruvec
  2012-02-21 21:35         ` Konstantin Khlebnikov
@ 2012-02-21 22:12           ` Hugh Dickins
  -1 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-21 22:12 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Wed, 22 Feb 2012, Konstantin Khlebnikov wrote:
> Hugh Dickins wrote:
> > 
> > I'll have to come back to think about your locking later too;
> > or maybe that's exactly where I need to look, when investigating
> > the mm_inline.h:41 BUG.
> 
> pages_count[] updates looks correct.
> This really may be bug in locking, and this VM_BUG_ON catch it before
> list-debug.

I've still not got into looking at it yet.

You're right to mention DEBUG_LIST: I have that on some of the machines,
and I would expect that to be the first to catch a mislocking issue.

In the past my problems with that BUG (well, the spur to introduce it)
came from hugepages.

> > 
> > But at first sight, I have to say I'm very suspicious: I've never found
> > PageLRU a good enough test for whether we need such a lock, because of
> > races with those pages on percpu lruvec about to be put on the lru.
> > 
> > But maybe once I look closer, I'll find that's handled by your changes
> > away from lruvec; though I'd have thought the same issue exists,
> > independent of whether the pending pages are in vector or list.
> 
> Are you talking about my per-cpu page-lists for lru-adding?

Yes.

> This is just an unnecessary patch, I don't know why I include it into v2 set.
> It does not protect anything.

Okay.

Hugh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 9/10] mm/memcg: move lru_lock into lruvec
@ 2012-02-21 22:12           ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-21 22:12 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Wed, 22 Feb 2012, Konstantin Khlebnikov wrote:
> Hugh Dickins wrote:
> > 
> > I'll have to come back to think about your locking later too;
> > or maybe that's exactly where I need to look, when investigating
> > the mm_inline.h:41 BUG.
> 
> pages_count[] updates looks correct.
> This really may be bug in locking, and this VM_BUG_ON catch it before
> list-debug.

I've still not got into looking at it yet.

You're right to mention DEBUG_LIST: I have that on some of the machines,
and I would expect that to be the first to catch a mislocking issue.

In the past my problems with that BUG (well, the spur to introduce it)
came from hugepages.

> > 
> > But at first sight, I have to say I'm very suspicious: I've never found
> > PageLRU a good enough test for whether we need such a lock, because of
> > races with those pages on percpu lruvec about to be put on the lru.
> > 
> > But maybe once I look closer, I'll find that's handled by your changes
> > away from lruvec; though I'd have thought the same issue exists,
> > independent of whether the pending pages are in vector or list.
> 
> Are you talking about my per-cpu page-lists for lru-adding?

Yes.

> This is just an unnecessary patch, I don't know why I include it into v2 set.
> It does not protect anything.

Okay.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 4/10] mm/memcg: apply add/del_page to lruvec
  2012-02-21  8:20     ` KAMEZAWA Hiroyuki
@ 2012-02-21 22:25       ` Hugh Dickins
  -1 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-21 22:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Hugh Dickins, Andrew Morton, Konstantin Khlebnikov,
	Johannes Weiner, Ying Han, linux-mm, linux-kernel

Many thanks for inspecting these, and so soon.

On Tue, 21 Feb 2012, KAMEZAWA Hiroyuki wrote:
> 
> Hmm.. a nitpick.
> 
> You do 
>   lruvec = mem_cgroup_page_lruvec(page, zone);
> 
> What is the difference from
> 
>   lruvec = mem_cgroup_page_lruvec(page, page_zone(page)) 
> 
> ?

I hope they were equivalent: I just did it that way because in all cases
the zone had already been computed, so that saved recomputing it - as I
understand it, in some layouts (such as mine) it's pretty cheap to work
out the page's zone, but in others an expense to be avoided.

But then you discovered that it soon got removed again anyway.

Hugh

> 
> If we have a function
>   lruvec = mem_cgroup_page_lruvec(page)
> 
> Do we need 
>   lruvec = mem_cgroup_page_lruvec_zone(page, zone) 
> 
> ?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 4/10] mm/memcg: apply add/del_page to lruvec
@ 2012-02-21 22:25       ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-21 22:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Hugh Dickins, Andrew Morton, Konstantin Khlebnikov,
	Johannes Weiner, Ying Han, linux-mm, linux-kernel

Many thanks for inspecting these, and so soon.

On Tue, 21 Feb 2012, KAMEZAWA Hiroyuki wrote:
> 
> Hmm.. a nitpick.
> 
> You do 
>   lruvec = mem_cgroup_page_lruvec(page, zone);
> 
> What is the difference from
> 
>   lruvec = mem_cgroup_page_lruvec(page, page_zone(page)) 
> 
> ?

I hope they were equivalent: I just did it that way because in all cases
the zone had already been computed, so that saved recomputing it - as I
understand it, in some layouts (such as mine) it's pretty cheap to work
out the page's zone, but in others an expense to be avoided.

But then you discovered that it soon got removed again anyway.

Hugh

> 
> If we have a function
>   lruvec = mem_cgroup_page_lruvec(page)
> 
> Do we need 
>   lruvec = mem_cgroup_page_lruvec_zone(page, zone) 
> 
> ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 5/10] mm/memcg: introduce page_relock_lruvec
  2012-02-21  8:38     ` KAMEZAWA Hiroyuki
@ 2012-02-21 22:36       ` Hugh Dickins
  -1 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-21 22:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Tue, 21 Feb 2012, KAMEZAWA Hiroyuki wrote:
> 
> No perforamce impact by replacing spin_lock_irq()/spin_unlock_irq() to
> spin_lock_irqsave() and spin_unlock_irqrestore() ?

None that I noticed - but that is not at all a reassuring answer!

It worries me a little.  I think it would make more or less difference
on different architectures, and I forget where x86 stands there - one
of the more or the less affected?  Worth branches down inside
page_relock_lruvec()?

It's also unfortunate to be "losing" the information of where _irq
is needed and where _irqsave (but not much gets lost with git).

It's something that can be fixed - and I think Konstantin's version
already keeps the variants: I just didn't want to get confused by them,
while focussing on the locking details.

Hugh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 5/10] mm/memcg: introduce page_relock_lruvec
@ 2012-02-21 22:36       ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-21 22:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Tue, 21 Feb 2012, KAMEZAWA Hiroyuki wrote:
> 
> No perforamce impact by replacing spin_lock_irq()/spin_unlock_irq() to
> spin_lock_irqsave() and spin_unlock_irqrestore() ?

None that I noticed - but that is not at all a reassuring answer!

It worries me a little.  I think it would make more or less difference
on different architectures, and I forget where x86 stands there - one
of the more or the less affected?  Worth branches down inside
page_relock_lruvec()?

It's also unfortunate to be "losing" the information of where _irq
is needed and where _irqsave (but not much gets lost with git).

It's something that can be fixed - and I think Konstantin's version
already keeps the variants: I just didn't want to get confused by them,
while focussing on the locking details.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup
  2012-02-21  9:13     ` KAMEZAWA Hiroyuki
@ 2012-02-21 23:03       ` Hugh Dickins
  -1 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-21 23:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Tue, 21 Feb 2012, KAMEZAWA Hiroyuki wrote:
> On Mon, 20 Feb 2012 15:34:28 -0800 (PST)
> Hugh Dickins <hughd@google.com> wrote:
> 	return NULL;
> >  
> > +	lruvec = page_lock_lruvec(page);
> >  	lock_page_cgroup(pc);
> >  
> 
> Do we need to take lrulock+irq disable per page in this very very hot path ?

I'm sure we don't want to: I hope you were pleased to find it goes away
(from most cases) a couple of patches later.

I had lruvec lock nested inside page_cgroup lock in the rollup I sent in
December, whereas you went for page_cgroup lock nested inside lruvec lock
in your lrucare patch.

I couldn't find an imperative reason why they should be one way round or
the other, so I tried hard to stick with your ordering, and it did work
(in this 6/10).  But then I couldn't work out how to get rid of the
overheads added in doing it this way round, so swapped them back.

> 
> Hmm.... How about adding NR_ISOLATED counter into lruvec ?
> 
> Then, we can delay freeing lruvec until all conunters goes down to zero.
> as...
> 
> 	bool we_can_free_lruvec = true;
> 
> 	lock_lruvec(lruvec->lock);
> 	for_each_lru_lruvec(lru)
> 		if (!list_empty(&lruvec->lru[lru]))
> 			we_can_free_lruvec = false;
> 	if (lruvec->nr_isolated)
> 		we_can_free_lruvec = false;
> 	unlock_lruvec(lruvec)
> 	if (we_can_free_lruvec)
> 		kfree(lruvec);
> 
> If compaction, lumpy reclaim free a page taken from LRU,
> it knows what it does and can decrement lruvec->nr_isolated properly
> (it seems zone's NR_ISOLATED is decremented at putback.)

At the moment I'm thinking that what we end up with by 9/10 is
better than adding such a refcount.  But I'm not entirely happy with
mem_cgroup_reset_uncharged_to_root (it adds a further page_cgroup
lookup just after I got rid of some others), and need yet to think
about the race which Konstantin posits, so all options remain open.

Hugh

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup
@ 2012-02-21 23:03       ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-21 23:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Tue, 21 Feb 2012, KAMEZAWA Hiroyuki wrote:
> On Mon, 20 Feb 2012 15:34:28 -0800 (PST)
> Hugh Dickins <hughd@google.com> wrote:
> 	return NULL;
> >  
> > +	lruvec = page_lock_lruvec(page);
> >  	lock_page_cgroup(pc);
> >  
> 
> Do we need to take lrulock+irq disable per page in this very very hot path ?

I'm sure we don't want to: I hope you were pleased to find it goes away
(from most cases) a couple of patches later.

I had lruvec lock nested inside page_cgroup lock in the rollup I sent in
December, whereas you went for page_cgroup lock nested inside lruvec lock
in your lrucare patch.

I couldn't find an imperative reason why they should be one way round or
the other, so I tried hard to stick with your ordering, and it did work
(in this 6/10).  But then I couldn't work out how to get rid of the
overheads added in doing it this way round, so swapped them back.

> 
> Hmm.... How about adding NR_ISOLATED counter into lruvec ?
> 
> Then, we can delay freeing lruvec until all conunters goes down to zero.
> as...
> 
> 	bool we_can_free_lruvec = true;
> 
> 	lock_lruvec(lruvec->lock);
> 	for_each_lru_lruvec(lru)
> 		if (!list_empty(&lruvec->lru[lru]))
> 			we_can_free_lruvec = false;
> 	if (lruvec->nr_isolated)
> 		we_can_free_lruvec = false;
> 	unlock_lruvec(lruvec)
> 	if (we_can_free_lruvec)
> 		kfree(lruvec);
> 
> If compaction, lumpy reclaim free a page taken from LRU,
> it knows what it does and can decrement lruvec->nr_isolated properly
> (it seems zone's NR_ISOLATED is decremented at putback.)

At the moment I'm thinking that what we end up with by 9/10 is
better than adding such a refcount.  But I'm not entirely happy with
mem_cgroup_reset_uncharged_to_root (it adds a further page_cgroup
lookup just after I got rid of some others), and need yet to think
about the race which Konstantin posits, so all options remain open.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 9/10] mm/memcg: move lru_lock into lruvec
  2012-02-21 22:12           ` Hugh Dickins
@ 2012-02-22  3:43             ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 72+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-22  3:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

Hugh Dickins wrote:
> On Wed, 22 Feb 2012, Konstantin Khlebnikov wrote:
>> Hugh Dickins wrote:
>>>
>>> I'll have to come back to think about your locking later too;
>>> or maybe that's exactly where I need to look, when investigating
>>> the mm_inline.h:41 BUG.
>>
>> pages_count[] updates looks correct.
>> This really may be bug in locking, and this VM_BUG_ON catch it before
>> list-debug.
>
> I've still not got into looking at it yet.
>
> You're right to mention DEBUG_LIST: I have that on some of the machines,
> and I would expect that to be the first to catch a mislocking issue.
>
> In the past my problems with that BUG (well, the spur to introduce it)
> came from hugepages.

My patchset hasn't your mem_cgroup_reset_uncharged_to_root protection,
or something to replace it. So, there exist race between cgroup remove and
isolated uncharged page put-back, but it shouldn't corrupt lru lists.
There something different.

>
>>>
>>> But at first sight, I have to say I'm very suspicious: I've never found
>>> PageLRU a good enough test for whether we need such a lock, because of
>>> races with those pages on percpu lruvec about to be put on the lru.
>>>
>>> But maybe once I look closer, I'll find that's handled by your changes
>>> away from lruvec; though I'd have thought the same issue exists,
>>> independent of whether the pending pages are in vector or list.
>>
>> Are you talking about my per-cpu page-lists for lru-adding?
>
> Yes.
>
>> This is just an unnecessary patch, I don't know why I include it into v2 set.
>> It does not protect anything.
>
> Okay.
>
> Hugh


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 9/10] mm/memcg: move lru_lock into lruvec
@ 2012-02-22  3:43             ` Konstantin Khlebnikov
  0 siblings, 0 replies; 72+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-22  3:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

Hugh Dickins wrote:
> On Wed, 22 Feb 2012, Konstantin Khlebnikov wrote:
>> Hugh Dickins wrote:
>>>
>>> I'll have to come back to think about your locking later too;
>>> or maybe that's exactly where I need to look, when investigating
>>> the mm_inline.h:41 BUG.
>>
>> pages_count[] updates looks correct.
>> This really may be bug in locking, and this VM_BUG_ON catch it before
>> list-debug.
>
> I've still not got into looking at it yet.
>
> You're right to mention DEBUG_LIST: I have that on some of the machines,
> and I would expect that to be the first to catch a mislocking issue.
>
> In the past my problems with that BUG (well, the spur to introduce it)
> came from hugepages.

My patchset hasn't your mem_cgroup_reset_uncharged_to_root protection,
or something to replace it. So, there exist race between cgroup remove and
isolated uncharged page put-back, but it shouldn't corrupt lru lists.
There something different.

>
>>>
>>> But at first sight, I have to say I'm very suspicious: I've never found
>>> PageLRU a good enough test for whether we need such a lock, because of
>>> races with those pages on percpu lruvec about to be put on the lru.
>>>
>>> But maybe once I look closer, I'll find that's handled by your changes
>>> away from lruvec; though I'd have thought the same issue exists,
>>> independent of whether the pending pages are in vector or list.
>>
>> Are you talking about my per-cpu page-lists for lru-adding?
>
> Yes.
>
>> This is just an unnecessary patch, I don't know why I include it into v2 set.
>> It does not protect anything.
>
> Okay.
>
> Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup
  2012-02-21 23:03       ` Hugh Dickins
@ 2012-02-22  4:05         ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 72+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-22  4:05 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

Hugh Dickins wrote:
> On Tue, 21 Feb 2012, KAMEZAWA Hiroyuki wrote:
>> On Mon, 20 Feb 2012 15:34:28 -0800 (PST)
>> Hugh Dickins<hughd@google.com>  wrote:
>> 	return NULL;
>>>
>>> +	lruvec = page_lock_lruvec(page);
>>>   	lock_page_cgroup(pc);
>>>
>>
>> Do we need to take lrulock+irq disable per page in this very very hot path ?
>
> I'm sure we don't want to: I hope you were pleased to find it goes away
> (from most cases) a couple of patches later.
>
> I had lruvec lock nested inside page_cgroup lock in the rollup I sent in
> December, whereas you went for page_cgroup lock nested inside lruvec lock
> in your lrucare patch.
>
> I couldn't find an imperative reason why they should be one way round or
> the other, so I tried hard to stick with your ordering, and it did work
> (in this 6/10).  But then I couldn't work out how to get rid of the
> overheads added in doing it this way round, so swapped them back.
>
>>
>> Hmm.... How about adding NR_ISOLATED counter into lruvec ?
>>
>> Then, we can delay freeing lruvec until all conunters goes down to zero.
>> as...
>>
>> 	bool we_can_free_lruvec = true;
>>
>> 	lock_lruvec(lruvec->lock);
>> 	for_each_lru_lruvec(lru)
>> 		if (!list_empty(&lruvec->lru[lru]))
>> 			we_can_free_lruvec = false;
>> 	if (lruvec->nr_isolated)
>> 		we_can_free_lruvec = false;
>> 	unlock_lruvec(lruvec)
>> 	if (we_can_free_lruvec)
>> 		kfree(lruvec);
>>
>> If compaction, lumpy reclaim free a page taken from LRU,
>> it knows what it does and can decrement lruvec->nr_isolated properly
>> (it seems zone's NR_ISOLATED is decremented at putback.)
>
> At the moment I'm thinking that what we end up with by 9/10 is
> better than adding such a refcount.  But I'm not entirely happy with
> mem_cgroup_reset_uncharged_to_root (it adds a further page_cgroup
> lookup just after I got rid of some others), and need yet to think
> about the race which Konstantin posits, so all options remain open.

This lruvec->nr_isolated seem reasonable, and its managegin not very costly.
In move_account() we anyway need to touch old_lruvec->lru_lock after recharge,
to stabilize PageLRU() before adding page to new_lruvec. (because that race)
In migration/compaction this handled automatically, because they always call putback_lru_page() at the end.
Main problem is shrink_page_list() for lumpy-reclaim, but seems like it never used if memory
compaction is enabled, so it can be slow and ineffective with tons of lru_list relocks.

>
> Hugh


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup
@ 2012-02-22  4:05         ` Konstantin Khlebnikov
  0 siblings, 0 replies; 72+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-22  4:05 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

Hugh Dickins wrote:
> On Tue, 21 Feb 2012, KAMEZAWA Hiroyuki wrote:
>> On Mon, 20 Feb 2012 15:34:28 -0800 (PST)
>> Hugh Dickins<hughd@google.com>  wrote:
>> 	return NULL;
>>>
>>> +	lruvec = page_lock_lruvec(page);
>>>   	lock_page_cgroup(pc);
>>>
>>
>> Do we need to take lrulock+irq disable per page in this very very hot path ?
>
> I'm sure we don't want to: I hope you were pleased to find it goes away
> (from most cases) a couple of patches later.
>
> I had lruvec lock nested inside page_cgroup lock in the rollup I sent in
> December, whereas you went for page_cgroup lock nested inside lruvec lock
> in your lrucare patch.
>
> I couldn't find an imperative reason why they should be one way round or
> the other, so I tried hard to stick with your ordering, and it did work
> (in this 6/10).  But then I couldn't work out how to get rid of the
> overheads added in doing it this way round, so swapped them back.
>
>>
>> Hmm.... How about adding NR_ISOLATED counter into lruvec ?
>>
>> Then, we can delay freeing lruvec until all conunters goes down to zero.
>> as...
>>
>> 	bool we_can_free_lruvec = true;
>>
>> 	lock_lruvec(lruvec->lock);
>> 	for_each_lru_lruvec(lru)
>> 		if (!list_empty(&lruvec->lru[lru]))
>> 			we_can_free_lruvec = false;
>> 	if (lruvec->nr_isolated)
>> 		we_can_free_lruvec = false;
>> 	unlock_lruvec(lruvec)
>> 	if (we_can_free_lruvec)
>> 		kfree(lruvec);
>>
>> If compaction, lumpy reclaim free a page taken from LRU,
>> it knows what it does and can decrement lruvec->nr_isolated properly
>> (it seems zone's NR_ISOLATED is decremented at putback.)
>
> At the moment I'm thinking that what we end up with by 9/10 is
> better than adding such a refcount.  But I'm not entirely happy with
> mem_cgroup_reset_uncharged_to_root (it adds a further page_cgroup
> lookup just after I got rid of some others), and need yet to think
> about the race which Konstantin posits, so all options remain open.

This lruvec->nr_isolated seem reasonable, and its managegin not very costly.
In move_account() we anyway need to touch old_lruvec->lru_lock after recharge,
to stabilize PageLRU() before adding page to new_lruvec. (because that race)
In migration/compaction this handled automatically, because they always call putback_lru_page() at the end.
Main problem is shrink_page_list() for lumpy-reclaim, but seems like it never used if memory
compaction is enabled, so it can be slow and ineffective with tons of lru_list relocks.

>
> Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 9/10] mm/memcg: move lru_lock into lruvec
  2012-02-22  3:43             ` Konstantin Khlebnikov
@ 2012-02-22  6:09               ` Hugh Dickins
  -1 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-22  6:09 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Wed, 22 Feb 2012, Konstantin Khlebnikov wrote:
> Hugh Dickins wrote:
> > On Wed, 22 Feb 2012, Konstantin Khlebnikov wrote:
> > > Hugh Dickins wrote:
> > > > 
> > > > I'll have to come back to think about your locking later too;
> > > > or maybe that's exactly where I need to look, when investigating
> > > > the mm_inline.h:41 BUG.
> > > 
> > > pages_count[] updates looks correct.
> > > This really may be bug in locking, and this VM_BUG_ON catch it before
> > > list-debug.
> > 
> > I've still not got into looking at it yet.
> > 
> > You're right to mention DEBUG_LIST: I have that on some of the machines,
> > and I would expect that to be the first to catch a mislocking issue.
> > 
> > In the past my problems with that BUG (well, the spur to introduce it)
> > came from hugepages.
> 
> My patchset hasn't your mem_cgroup_reset_uncharged_to_root protection,
> or something to replace it. So, there exist race between cgroup remove and
> isolated uncharged page put-back, but it shouldn't corrupt lru lists.
> There something different.

Yes, I'm not into removing cgroups yet.

I've got it: your "can differ only on lumpy reclaim" belief, first
commented in 17/22 but then assumed in 20/22, is wrong: those swapin
readahead pages, for example, may shift from root_mem_cgroup to another
mem_cgroup while the page is isolated by shrink_active or shrink_inactive.

Patch below against the top of my version of your tree: probably won't
quite apply to yours, since we used different bases here; but easy
enough to correct yours from it.

Bisection was misleading: it appeared to be much easier to reproduce
with 22/22 taken off, and led to 16/22, but that's because that one
introduced a similar bug, which actually got fixed in 22/22:

relock_page_lruvec() and relock_page_lruvec_irq() in 16/22 onwards
are wrong, in each case the if block needs an
	} else
		lruvec = page_lruvec(page);

You'll want to fix that in 16/22, but here's the patch for the end state:

Signed-off-by: Hugh Dickins <hughd@google.com>
but forget that, just quietly fold the fixes into yours!
---
 mm/vmscan.c |   20 ++++++--------------
 1 file changed, 6 insertions(+), 14 deletions(-)

--- 3033K2.orig/mm/vmscan.c	2012-02-21 00:02:13.000000000 -0800
+++ 3033K2/mm/vmscan.c	2012-02-21 21:23:25.768381375 -0800
@@ -1342,7 +1342,6 @@ static int too_many_isolated(struct zone
  */
 static noinline_for_stack struct lruvec *
 putback_inactive_pages(struct lruvec *lruvec,
-		       struct scan_control *sc,
 		       struct list_head *page_list)
 {
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
@@ -1364,11 +1363,8 @@ putback_inactive_pages(struct lruvec *lr
 			continue;
 		}
 
-		/* can differ only on lumpy reclaim */
-		if (sc->order) {
-			lruvec = __relock_page_lruvec(lruvec, page);
-			reclaim_stat = &lruvec->reclaim_stat;
-		}
+		lruvec = __relock_page_lruvec(lruvec, page);
+		reclaim_stat = &lruvec->reclaim_stat;
 
 		SetPageLRU(page);
 		lru = page_lru(page);
@@ -1566,7 +1562,7 @@ shrink_inactive_list(unsigned long nr_to
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-	lruvec = putback_inactive_pages(lruvec, sc, &page_list);
+	lruvec = putback_inactive_pages(lruvec, &page_list);
 
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
 	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
@@ -1631,7 +1627,6 @@ shrink_inactive_list(unsigned long nr_to
 
 static struct lruvec *
 move_active_pages_to_lru(struct lruvec *lruvec,
-			 struct scan_control *sc,
 			 struct list_head *list,
 			 struct list_head *pages_to_free,
 			 enum lru_list lru)
@@ -1643,10 +1638,7 @@ move_active_pages_to_lru(struct lruvec *
 		int numpages;
 
 		page = lru_to_page(list);
-
-		/* can differ only on lumpy reclaim */
-		if (sc->order)
-			lruvec = __relock_page_lruvec(lruvec, page);
+		lruvec = __relock_page_lruvec(lruvec, page);
 
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
@@ -1770,9 +1762,9 @@ static void shrink_active_list(unsigned
 	 */
 	reclaim_stat->recent_rotated[file] += nr_rotated;
 
-	lruvec = move_active_pages_to_lru(lruvec, sc, &l_active, &l_hold,
+	lruvec = move_active_pages_to_lru(lruvec, &l_active, &l_hold,
 						LRU_ACTIVE + file * LRU_FILE);
-	lruvec = move_active_pages_to_lru(lruvec, sc, &l_inactive, &l_hold,
+	lruvec = move_active_pages_to_lru(lruvec, &l_inactive, &l_hold,
 						LRU_BASE   + file * LRU_FILE);
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
 	unlock_lruvec_irq(lruvec);

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 9/10] mm/memcg: move lru_lock into lruvec
@ 2012-02-22  6:09               ` Hugh Dickins
  0 siblings, 0 replies; 72+ messages in thread
From: Hugh Dickins @ 2012-02-22  6:09 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

On Wed, 22 Feb 2012, Konstantin Khlebnikov wrote:
> Hugh Dickins wrote:
> > On Wed, 22 Feb 2012, Konstantin Khlebnikov wrote:
> > > Hugh Dickins wrote:
> > > > 
> > > > I'll have to come back to think about your locking later too;
> > > > or maybe that's exactly where I need to look, when investigating
> > > > the mm_inline.h:41 BUG.
> > > 
> > > pages_count[] updates looks correct.
> > > This really may be bug in locking, and this VM_BUG_ON catch it before
> > > list-debug.
> > 
> > I've still not got into looking at it yet.
> > 
> > You're right to mention DEBUG_LIST: I have that on some of the machines,
> > and I would expect that to be the first to catch a mislocking issue.
> > 
> > In the past my problems with that BUG (well, the spur to introduce it)
> > came from hugepages.
> 
> My patchset hasn't your mem_cgroup_reset_uncharged_to_root protection,
> or something to replace it. So, there exist race between cgroup remove and
> isolated uncharged page put-back, but it shouldn't corrupt lru lists.
> There something different.

Yes, I'm not into removing cgroups yet.

I've got it: your "can differ only on lumpy reclaim" belief, first
commented in 17/22 but then assumed in 20/22, is wrong: those swapin
readahead pages, for example, may shift from root_mem_cgroup to another
mem_cgroup while the page is isolated by shrink_active or shrink_inactive.

Patch below against the top of my version of your tree: probably won't
quite apply to yours, since we used different bases here; but easy
enough to correct yours from it.

Bisection was misleading: it appeared to be much easier to reproduce
with 22/22 taken off, and led to 16/22, but that's because that one
introduced a similar bug, which actually got fixed in 22/22:

relock_page_lruvec() and relock_page_lruvec_irq() in 16/22 onwards
are wrong, in each case the if block needs an
	} else
		lruvec = page_lruvec(page);

You'll want to fix that in 16/22, but here's the patch for the end state:

Signed-off-by: Hugh Dickins <hughd@google.com>
but forget that, just quietly fold the fixes into yours!
---
 mm/vmscan.c |   20 ++++++--------------
 1 file changed, 6 insertions(+), 14 deletions(-)

--- 3033K2.orig/mm/vmscan.c	2012-02-21 00:02:13.000000000 -0800
+++ 3033K2/mm/vmscan.c	2012-02-21 21:23:25.768381375 -0800
@@ -1342,7 +1342,6 @@ static int too_many_isolated(struct zone
  */
 static noinline_for_stack struct lruvec *
 putback_inactive_pages(struct lruvec *lruvec,
-		       struct scan_control *sc,
 		       struct list_head *page_list)
 {
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
@@ -1364,11 +1363,8 @@ putback_inactive_pages(struct lruvec *lr
 			continue;
 		}
 
-		/* can differ only on lumpy reclaim */
-		if (sc->order) {
-			lruvec = __relock_page_lruvec(lruvec, page);
-			reclaim_stat = &lruvec->reclaim_stat;
-		}
+		lruvec = __relock_page_lruvec(lruvec, page);
+		reclaim_stat = &lruvec->reclaim_stat;
 
 		SetPageLRU(page);
 		lru = page_lru(page);
@@ -1566,7 +1562,7 @@ shrink_inactive_list(unsigned long nr_to
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-	lruvec = putback_inactive_pages(lruvec, sc, &page_list);
+	lruvec = putback_inactive_pages(lruvec, &page_list);
 
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
 	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
@@ -1631,7 +1627,6 @@ shrink_inactive_list(unsigned long nr_to
 
 static struct lruvec *
 move_active_pages_to_lru(struct lruvec *lruvec,
-			 struct scan_control *sc,
 			 struct list_head *list,
 			 struct list_head *pages_to_free,
 			 enum lru_list lru)
@@ -1643,10 +1638,7 @@ move_active_pages_to_lru(struct lruvec *
 		int numpages;
 
 		page = lru_to_page(list);
-
-		/* can differ only on lumpy reclaim */
-		if (sc->order)
-			lruvec = __relock_page_lruvec(lruvec, page);
+		lruvec = __relock_page_lruvec(lruvec, page);
 
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
@@ -1770,9 +1762,9 @@ static void shrink_active_list(unsigned
 	 */
 	reclaim_stat->recent_rotated[file] += nr_rotated;
 
-	lruvec = move_active_pages_to_lru(lruvec, sc, &l_active, &l_hold,
+	lruvec = move_active_pages_to_lru(lruvec, &l_active, &l_hold,
 						LRU_ACTIVE + file * LRU_FILE);
-	lruvec = move_active_pages_to_lru(lruvec, sc, &l_inactive, &l_hold,
+	lruvec = move_active_pages_to_lru(lruvec, &l_inactive, &l_hold,
 						LRU_BASE   + file * LRU_FILE);
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
 	unlock_lruvec_irq(lruvec);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 9/10] mm/memcg: move lru_lock into lruvec
  2012-02-22  6:09               ` Hugh Dickins
@ 2012-02-23 14:21                 ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 72+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-23 14:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

Hugh Dickins wrote:
> On Wed, 22 Feb 2012, Konstantin Khlebnikov wrote:
>> Hugh Dickins wrote:
>>> On Wed, 22 Feb 2012, Konstantin Khlebnikov wrote:
>>>> Hugh Dickins wrote:
>>>>>
>>>>> I'll have to come back to think about your locking later too;
>>>>> or maybe that's exactly where I need to look, when investigating
>>>>> the mm_inline.h:41 BUG.
>>>>
>>>> pages_count[] updates looks correct.
>>>> This really may be bug in locking, and this VM_BUG_ON catch it before
>>>> list-debug.
>>>
>>> I've still not got into looking at it yet.
>>>
>>> You're right to mention DEBUG_LIST: I have that on some of the machines,
>>> and I would expect that to be the first to catch a mislocking issue.
>>>
>>> In the past my problems with that BUG (well, the spur to introduce it)
>>> came from hugepages.
>>
>> My patchset hasn't your mem_cgroup_reset_uncharged_to_root protection,
>> or something to replace it. So, there exist race between cgroup remove and
>> isolated uncharged page put-back, but it shouldn't corrupt lru lists.
>> There something different.
>
> Yes, I'm not into removing cgroups yet.

Ok, my v3 patchset can deal with cgroups removing. At least I believe. =)

I was implemented isolated pages counter.
Seems like overhead isn't fatal and can be reduced.
Plus these counters can be used not only as reference counters,
they provides useful statistics for reclaimer.

>
> I've got it: your "can differ only on lumpy reclaim" belief, first
> commented in 17/22 but then assumed in 20/22, is wrong: those swapin
> readahead pages, for example, may shift from root_mem_cgroup to another
> mem_cgroup while the page is isolated by shrink_active or shrink_inactive.

Ok, thanks.

>
> Patch below against the top of my version of your tree: probably won't
> quite apply to yours, since we used different bases here; but easy
> enough to correct yours from it.
>
> Bisection was misleading: it appeared to be much easier to reproduce
> with 22/22 taken off, and led to 16/22, but that's because that one
> introduced a similar bug, which actually got fixed in 22/22:
>
> relock_page_lruvec() and relock_page_lruvec_irq() in 16/22 onwards
> are wrong, in each case the if block needs an
> 	} else
> 		lruvec = page_lruvec(page);

Ok, fixed in v3

>
> You'll want to fix that in 16/22, but here's the patch for the end state:
>
> Signed-off-by: Hugh Dickins<hughd@google.com>
> but forget that, just quietly fold the fixes into yours!

This actually reverts my "mm: optimize putback for 0-order reclaim",
so I removed this wrong optimization in v3

> ---
>   mm/vmscan.c |   20 ++++++--------------
>   1 file changed, 6 insertions(+), 14 deletions(-)
>
> --- 3033K2.orig/mm/vmscan.c	2012-02-21 00:02:13.000000000 -0800
> +++ 3033K2/mm/vmscan.c	2012-02-21 21:23:25.768381375 -0800
> @@ -1342,7 +1342,6 @@ static int too_many_isolated(struct zone
>    */
>   static noinline_for_stack struct lruvec *
>   putback_inactive_pages(struct lruvec *lruvec,
> -		       struct scan_control *sc,
>   		       struct list_head *page_list)
>   {
>   	struct zone_reclaim_stat *reclaim_stat =&lruvec->reclaim_stat;
> @@ -1364,11 +1363,8 @@ putback_inactive_pages(struct lruvec *lr
>   			continue;
>   		}
>
> -		/* can differ only on lumpy reclaim */
> -		if (sc->order) {
> -			lruvec = __relock_page_lruvec(lruvec, page);
> -			reclaim_stat =&lruvec->reclaim_stat;
> -		}
> +		lruvec = __relock_page_lruvec(lruvec, page);
> +		reclaim_stat =&lruvec->reclaim_stat;
>
>   		SetPageLRU(page);
>   		lru = page_lru(page);
> @@ -1566,7 +1562,7 @@ shrink_inactive_list(unsigned long nr_to
>   		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
>   	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
>
> -	lruvec = putback_inactive_pages(lruvec, sc,&page_list);
> +	lruvec = putback_inactive_pages(lruvec,&page_list);
>
>   	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
>   	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
> @@ -1631,7 +1627,6 @@ shrink_inactive_list(unsigned long nr_to
>
>   static struct lruvec *
>   move_active_pages_to_lru(struct lruvec *lruvec,
> -			 struct scan_control *sc,
>   			 struct list_head *list,
>   			 struct list_head *pages_to_free,
>   			 enum lru_list lru)
> @@ -1643,10 +1638,7 @@ move_active_pages_to_lru(struct lruvec *
>   		int numpages;
>
>   		page = lru_to_page(list);
> -
> -		/* can differ only on lumpy reclaim */
> -		if (sc->order)
> -			lruvec = __relock_page_lruvec(lruvec, page);
> +		lruvec = __relock_page_lruvec(lruvec, page);
>
>   		VM_BUG_ON(PageLRU(page));
>   		SetPageLRU(page);
> @@ -1770,9 +1762,9 @@ static void shrink_active_list(unsigned
>   	 */
>   	reclaim_stat->recent_rotated[file] += nr_rotated;
>
> -	lruvec = move_active_pages_to_lru(lruvec, sc,&l_active,&l_hold,
> +	lruvec = move_active_pages_to_lru(lruvec,&l_active,&l_hold,
>   						LRU_ACTIVE + file * LRU_FILE);
> -	lruvec = move_active_pages_to_lru(lruvec, sc,&l_inactive,&l_hold,
> +	lruvec = move_active_pages_to_lru(lruvec,&l_inactive,&l_hold,
>   						LRU_BASE   + file * LRU_FILE);
>   	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
>   	unlock_lruvec_irq(lruvec);


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 9/10] mm/memcg: move lru_lock into lruvec
@ 2012-02-23 14:21                 ` Konstantin Khlebnikov
  0 siblings, 0 replies; 72+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-23 14:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Johannes Weiner, Ying Han,
	linux-mm, linux-kernel

Hugh Dickins wrote:
> On Wed, 22 Feb 2012, Konstantin Khlebnikov wrote:
>> Hugh Dickins wrote:
>>> On Wed, 22 Feb 2012, Konstantin Khlebnikov wrote:
>>>> Hugh Dickins wrote:
>>>>>
>>>>> I'll have to come back to think about your locking later too;
>>>>> or maybe that's exactly where I need to look, when investigating
>>>>> the mm_inline.h:41 BUG.
>>>>
>>>> pages_count[] updates looks correct.
>>>> This really may be bug in locking, and this VM_BUG_ON catch it before
>>>> list-debug.
>>>
>>> I've still not got into looking at it yet.
>>>
>>> You're right to mention DEBUG_LIST: I have that on some of the machines,
>>> and I would expect that to be the first to catch a mislocking issue.
>>>
>>> In the past my problems with that BUG (well, the spur to introduce it)
>>> came from hugepages.
>>
>> My patchset hasn't your mem_cgroup_reset_uncharged_to_root protection,
>> or something to replace it. So, there exist race between cgroup remove and
>> isolated uncharged page put-back, but it shouldn't corrupt lru lists.
>> There something different.
>
> Yes, I'm not into removing cgroups yet.

Ok, my v3 patchset can deal with cgroups removing. At least I believe. =)

I was implemented isolated pages counter.
Seems like overhead isn't fatal and can be reduced.
Plus these counters can be used not only as reference counters,
they provides useful statistics for reclaimer.

>
> I've got it: your "can differ only on lumpy reclaim" belief, first
> commented in 17/22 but then assumed in 20/22, is wrong: those swapin
> readahead pages, for example, may shift from root_mem_cgroup to another
> mem_cgroup while the page is isolated by shrink_active or shrink_inactive.

Ok, thanks.

>
> Patch below against the top of my version of your tree: probably won't
> quite apply to yours, since we used different bases here; but easy
> enough to correct yours from it.
>
> Bisection was misleading: it appeared to be much easier to reproduce
> with 22/22 taken off, and led to 16/22, but that's because that one
> introduced a similar bug, which actually got fixed in 22/22:
>
> relock_page_lruvec() and relock_page_lruvec_irq() in 16/22 onwards
> are wrong, in each case the if block needs an
> 	} else
> 		lruvec = page_lruvec(page);

Ok, fixed in v3

>
> You'll want to fix that in 16/22, but here's the patch for the end state:
>
> Signed-off-by: Hugh Dickins<hughd@google.com>
> but forget that, just quietly fold the fixes into yours!

This actually reverts my "mm: optimize putback for 0-order reclaim",
so I removed this wrong optimization in v3

> ---
>   mm/vmscan.c |   20 ++++++--------------
>   1 file changed, 6 insertions(+), 14 deletions(-)
>
> --- 3033K2.orig/mm/vmscan.c	2012-02-21 00:02:13.000000000 -0800
> +++ 3033K2/mm/vmscan.c	2012-02-21 21:23:25.768381375 -0800
> @@ -1342,7 +1342,6 @@ static int too_many_isolated(struct zone
>    */
>   static noinline_for_stack struct lruvec *
>   putback_inactive_pages(struct lruvec *lruvec,
> -		       struct scan_control *sc,
>   		       struct list_head *page_list)
>   {
>   	struct zone_reclaim_stat *reclaim_stat =&lruvec->reclaim_stat;
> @@ -1364,11 +1363,8 @@ putback_inactive_pages(struct lruvec *lr
>   			continue;
>   		}
>
> -		/* can differ only on lumpy reclaim */
> -		if (sc->order) {
> -			lruvec = __relock_page_lruvec(lruvec, page);
> -			reclaim_stat =&lruvec->reclaim_stat;
> -		}
> +		lruvec = __relock_page_lruvec(lruvec, page);
> +		reclaim_stat =&lruvec->reclaim_stat;
>
>   		SetPageLRU(page);
>   		lru = page_lru(page);
> @@ -1566,7 +1562,7 @@ shrink_inactive_list(unsigned long nr_to
>   		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
>   	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
>
> -	lruvec = putback_inactive_pages(lruvec, sc,&page_list);
> +	lruvec = putback_inactive_pages(lruvec,&page_list);
>
>   	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
>   	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
> @@ -1631,7 +1627,6 @@ shrink_inactive_list(unsigned long nr_to
>
>   static struct lruvec *
>   move_active_pages_to_lru(struct lruvec *lruvec,
> -			 struct scan_control *sc,
>   			 struct list_head *list,
>   			 struct list_head *pages_to_free,
>   			 enum lru_list lru)
> @@ -1643,10 +1638,7 @@ move_active_pages_to_lru(struct lruvec *
>   		int numpages;
>
>   		page = lru_to_page(list);
> -
> -		/* can differ only on lumpy reclaim */
> -		if (sc->order)
> -			lruvec = __relock_page_lruvec(lruvec, page);
> +		lruvec = __relock_page_lruvec(lruvec, page);
>
>   		VM_BUG_ON(PageLRU(page));
>   		SetPageLRU(page);
> @@ -1770,9 +1762,9 @@ static void shrink_active_list(unsigned
>   	 */
>   	reclaim_stat->recent_rotated[file] += nr_rotated;
>
> -	lruvec = move_active_pages_to_lru(lruvec, sc,&l_active,&l_hold,
> +	lruvec = move_active_pages_to_lru(lruvec,&l_active,&l_hold,
>   						LRU_ACTIVE + file * LRU_FILE);
> -	lruvec = move_active_pages_to_lru(lruvec, sc,&l_inactive,&l_hold,
> +	lruvec = move_active_pages_to_lru(lruvec,&l_inactive,&l_hold,
>   						LRU_BASE   + file * LRU_FILE);
>   	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
>   	unlock_lruvec_irq(lruvec);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2012-02-23 14:21 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-20 23:26 [PATCH 0/10] mm/memcg: per-memcg per-zone lru locking Hugh Dickins
2012-02-20 23:26 ` Hugh Dickins
2012-02-20 23:28 ` [PATCH 1/10] mm/memcg: scanning_global_lru means mem_cgroup_disabled Hugh Dickins
2012-02-20 23:28   ` Hugh Dickins
2012-02-21  8:03   ` KAMEZAWA Hiroyuki
2012-02-21  8:03     ` KAMEZAWA Hiroyuki
2012-02-20 23:29 ` [PATCH 2/10] mm/memcg: move reclaim_stat into lruvec Hugh Dickins
2012-02-20 23:29   ` Hugh Dickins
2012-02-21  8:05   ` KAMEZAWA Hiroyuki
2012-02-21  8:05     ` KAMEZAWA Hiroyuki
2012-02-20 23:30 ` [PATCH 3/10] mm/memcg: add zone pointer " Hugh Dickins
2012-02-20 23:30   ` Hugh Dickins
2012-02-21  8:08   ` KAMEZAWA Hiroyuki
2012-02-21  8:08     ` KAMEZAWA Hiroyuki
2012-02-20 23:32 ` [PATCH 4/10] mm/memcg: apply add/del_page to lruvec Hugh Dickins
2012-02-20 23:32   ` Hugh Dickins
2012-02-21  8:20   ` KAMEZAWA Hiroyuki
2012-02-21  8:20     ` KAMEZAWA Hiroyuki
2012-02-21 22:25     ` Hugh Dickins
2012-02-21 22:25       ` Hugh Dickins
2012-02-20 23:33 ` [PATCH 5/10] mm/memcg: introduce page_relock_lruvec Hugh Dickins
2012-02-20 23:33   ` Hugh Dickins
2012-02-21  8:38   ` KAMEZAWA Hiroyuki
2012-02-21  8:38     ` KAMEZAWA Hiroyuki
2012-02-21 22:36     ` Hugh Dickins
2012-02-21 22:36       ` Hugh Dickins
2012-02-20 23:34 ` [PATCH 6/10] mm/memcg: take care over pc->mem_cgroup Hugh Dickins
2012-02-20 23:34   ` Hugh Dickins
2012-02-21  5:55   ` Konstantin Khlebnikov
2012-02-21  5:55     ` Konstantin Khlebnikov
2012-02-21 19:37     ` Hugh Dickins
2012-02-21 19:37       ` Hugh Dickins
2012-02-21 20:40       ` Konstantin Khlebnikov
2012-02-21 20:40         ` Konstantin Khlebnikov
2012-02-21 22:05         ` Hugh Dickins
2012-02-21 22:05           ` Hugh Dickins
2012-02-21  6:05   ` Konstantin Khlebnikov
2012-02-21  6:05     ` Konstantin Khlebnikov
2012-02-21 20:00     ` Hugh Dickins
2012-02-21 20:00       ` Hugh Dickins
2012-02-21  9:13   ` KAMEZAWA Hiroyuki
2012-02-21  9:13     ` KAMEZAWA Hiroyuki
2012-02-21 23:03     ` Hugh Dickins
2012-02-21 23:03       ` Hugh Dickins
2012-02-22  4:05       ` Konstantin Khlebnikov
2012-02-22  4:05         ` Konstantin Khlebnikov
2012-02-20 23:35 ` [PATCH 7/10] mm/memcg: remove mem_cgroup_reset_owner Hugh Dickins
2012-02-20 23:35   ` Hugh Dickins
2012-02-21  9:17   ` KAMEZAWA Hiroyuki
2012-02-21  9:17     ` KAMEZAWA Hiroyuki
2012-02-20 23:36 ` [PATCH 8/10] mm/memcg: nest lru_lock inside page_cgroup lock Hugh Dickins
2012-02-20 23:36   ` Hugh Dickins
2012-02-21  9:48   ` KAMEZAWA Hiroyuki
2012-02-21  9:48     ` KAMEZAWA Hiroyuki
2012-02-20 23:38 ` [PATCH 9/10] mm/memcg: move lru_lock into lruvec Hugh Dickins
2012-02-20 23:38   ` Hugh Dickins
2012-02-21  7:08   ` Konstantin Khlebnikov
2012-02-21  7:08     ` Konstantin Khlebnikov
2012-02-21 20:12     ` Hugh Dickins
2012-02-21 20:12       ` Hugh Dickins
2012-02-21 21:35       ` Konstantin Khlebnikov
2012-02-21 21:35         ` Konstantin Khlebnikov
2012-02-21 22:12         ` Hugh Dickins
2012-02-21 22:12           ` Hugh Dickins
2012-02-22  3:43           ` Konstantin Khlebnikov
2012-02-22  3:43             ` Konstantin Khlebnikov
2012-02-22  6:09             ` Hugh Dickins
2012-02-22  6:09               ` Hugh Dickins
2012-02-23 14:21               ` Konstantin Khlebnikov
2012-02-23 14:21                 ` Konstantin Khlebnikov
2012-02-20 23:39 ` [PATCH 10/10] mm/memcg: per-memcg per-zone lru locking Hugh Dickins
2012-02-20 23:39   ` Hugh Dickins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.