linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting
@ 2012-02-15 22:57 Konstantin Khlebnikov
  2012-02-15 22:57 ` [PATCH RFC 01/15] mm: rename struct lruvec into struct book Konstantin Khlebnikov
                   ` (16 more replies)
  0 siblings, 17 replies; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-15 22:57 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

There should be no logic changes in this patchset, this is only tossing bits around.
[ This patchset is on top some memcg cleanup/rework patches,
  which I sent to linux-mm@ today/yesterday ]

Most of things in this patchset are self-descriptive, so here brief plan:

* Transmute struct lruvec into struct book. Like real book this struct will
  store set of pages for one zone. It will be working unit for reclaimer code.
[ If memcg is disabled in config there will only one book embedded into struct zone ]

* move page-lru counters to struct book
[ this adds extra overhead in add_page_to_lru_list()/del_page_from_lru_list() for
  non-memcg case, but I believe it will be invisible, only one non-atomic add/sub
  in the same cacheline with lru list ]

* unify inactive_list_is_low_global() and cleanup reclaimer code
* replace struct mem_cgroup_zone with single pointer to struct book
* optimize page to book translations, move it upper in the call stack,
  replace some struct zone arguments with struct book pointer.

page -> book dereference become main operation, page (even free) will always
points to one book in its zone. so page->flags bits may contains direct reference to book.
Maybe we can replace page_zone(page) with book_zone(page_book(page)), without mem cgroups
book -> zone dereference will be simple container_of().

Finally, there appears some new locking primitives for decorating lru_lock splitting logic.
Final patch actually splits zone->lru_lock into small per-book pieces.
All this code currently *completely untested*, but seems like it already can work.


After that, there two options how manage struct book on mem-cgroup create/destroy:
a) [ currently implemented ] allocate and release by rcu.
   Thus lock_page_book() will be protected with rcu_read_lock().
b) allocate and never release struct book, reuse them after rcu grace period.
   It allows to avoid some rcu_read_lock()/rcu_read_unlock() calls on hot paths.


Motivation:
I wrote the similar memory controller for our rhel6-based openvz/virtuozzo kernel,
including splitted lru-locks and some other [patented LOL] cool stuff.
[ common descrioption without techical details: http://wiki.openvz.org/VSwap ]
That kernel already in production and rather stable for a long time.

---

Konstantin Khlebnikov (15):
      mm: rename struct lruvec into struct book
      mm: memory bookkeeping core
      mm: add book->pages_count
      mm: unify inactive_list_is_low()
      mm: add book->reclaim_stat
      mm: kill struct mem_cgroup_zone
      mm: move page-to-book translation upper
      mm: introduce book locking primitives
      mm: handle book relocks on lumpy reclaim
      mm: handle book relocks in compaction
      mm: handle book relock in memory controller
      mm: optimize books in update_page_reclaim_stat()
      mm: optimize books in pagevec_lru_move_fn()
      mm: optimize putback for 0-order reclaim
      mm: split zone->lru_lock


 include/linux/memcontrol.h |   52 -------
 include/linux/mm_inline.h  |  222 ++++++++++++++++++++++++++++-
 include/linux/mmzone.h     |   26 ++-
 include/linux/swap.h       |    2 
 init/Kconfig               |    4 +
 mm/compaction.c            |   35 +++--
 mm/huge_memory.c           |   10 +
 mm/memcontrol.c            |  238 ++++++++++---------------------
 mm/page_alloc.c            |   20 ++-
 mm/swap.c                  |  128 ++++++-----------
 mm/vmscan.c                |  334 +++++++++++++++++++-------------------------
 11 files changed, 554 insertions(+), 517 deletions(-)

-- 
Signature

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH RFC 01/15] mm: rename struct lruvec into struct book
  2012-02-15 22:57 [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting Konstantin Khlebnikov
@ 2012-02-15 22:57 ` Konstantin Khlebnikov
  2012-02-15 22:57 ` [PATCH RFC 02/15] mm: memory bookkeeping core Konstantin Khlebnikov
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-15 22:57 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

This patch rename:
struct lruvec into struct book
lruvec.lists into book.pages_lru
mem_cgroup_zone_lruvec(zone, memcg) into mem_cgroup_zone_book(zone, memcg)

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
---
 include/linux/memcontrol.h |   18 +++++++++---------
 include/linux/mm_inline.h  |    6 +++---
 include/linux/mmzone.h     |   10 +++++-----
 mm/memcontrol.c            |   38 +++++++++++++++++++-------------------
 mm/page_alloc.c            |    2 +-
 mm/swap.c                  |   12 ++++++------
 mm/vmscan.c                |   18 +++++++++---------
 7 files changed, 52 insertions(+), 52 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c697eda..c97fff9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -62,12 +62,12 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
 
-struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
-struct lruvec *mem_cgroup_lru_add_list(struct zone *, struct page *,
+struct book *mem_cgroup_zone_book(struct zone *, struct mem_cgroup *);
+struct book *mem_cgroup_lru_add_list(struct zone *, struct page *,
 				       enum lru_list);
 void mem_cgroup_lru_del_list(struct page *, enum lru_list);
 void mem_cgroup_lru_del(struct page *);
-struct lruvec *mem_cgroup_lru_move_lists(struct zone *, struct page *,
+struct book *mem_cgroup_lru_move_lists(struct zone *, struct page *,
 					 enum lru_list, enum lru_list);
 
 /* For coalescing uncharge for reducing memcg' overhead*/
@@ -214,17 +214,17 @@ static inline void mem_cgroup_uncharge_cache_page(struct page *page)
 {
 }
 
-static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
+static inline struct book *mem_cgroup_zone_book(struct zone *zone,
 						    struct mem_cgroup *memcg)
 {
-	return &zone->lruvec;
+	return &zone->book;
 }
 
-static inline struct lruvec *mem_cgroup_lru_add_list(struct zone *zone,
+static inline struct book *mem_cgroup_lru_add_list(struct zone *zone,
 						     struct page *page,
 						     enum lru_list lru)
 {
-	return &zone->lruvec;
+	return &zone->book;
 }
 
 static inline void mem_cgroup_lru_del_list(struct page *page, enum lru_list lru)
@@ -235,12 +235,12 @@ static inline void mem_cgroup_lru_del(struct page *page)
 {
 }
 
-static inline struct lruvec *mem_cgroup_lru_move_lists(struct zone *zone,
+static inline struct book *mem_cgroup_lru_move_lists(struct zone *zone,
 						       struct page *page,
 						       enum lru_list from,
 						       enum lru_list to)
 {
-	return &zone->lruvec;
+	return &zone->book;
 }
 
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 227fd3e..e0b78ca 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -24,10 +24,10 @@ static inline int page_is_file_cache(struct page *page)
 static inline void
 add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list lru)
 {
-	struct lruvec *lruvec;
+	struct book *book;
 
-	lruvec = mem_cgroup_lru_add_list(zone, page, lru);
-	list_add(&page->lru, &lruvec->lists[lru]);
+	book = mem_cgroup_lru_add_list(zone, page, lru);
+	list_add(&page->lru, &book->pages_lru[lru]);
 	__mod_zone_page_state(zone, NR_LRU_BASE + lru, hpage_nr_pages(page));
 }
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index facbe02..0b6e5d4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -159,10 +159,6 @@ static inline int is_unevictable_lru(enum lru_list lru)
 	return (lru == LRU_UNEVICTABLE);
 }
 
-struct lruvec {
-	struct list_head lists[NR_LRU_LISTS];
-};
-
 /* Mask used at gathering information at once (see memcontrol.c) */
 #define LRU_ALL_FILE (BIT(LRU_INACTIVE_FILE) | BIT(LRU_ACTIVE_FILE))
 #define LRU_ALL_ANON (BIT(LRU_INACTIVE_ANON) | BIT(LRU_ACTIVE_ANON))
@@ -300,6 +296,10 @@ struct zone_reclaim_stat {
 	unsigned long		recent_scanned[2];
 };
 
+struct book {
+	struct list_head	pages_lru[NR_LRU_LISTS];
+};
+
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 
@@ -374,7 +374,7 @@ struct zone {
 
 	/* Fields commonly accessed by the page reclaim scanner */
 	spinlock_t		lru_lock;
-	struct lruvec		lruvec;
+	struct book		book;
 
 	struct zone_reclaim_stat reclaim_stat;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 343324a..578118b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -134,7 +134,7 @@ struct mem_cgroup_reclaim_iter {
  * per-zone information in memory controller.
  */
 struct mem_cgroup_per_zone {
-	struct lruvec		lruvec;
+	struct book		book;
 	unsigned long		count[NR_LRU_LISTS];
 
 	struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1];
@@ -990,24 +990,24 @@ out:
 EXPORT_SYMBOL(mem_cgroup_count_vm_event);
 
 /**
- * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
- * @zone: zone of the wanted lruvec
- * @mem: memcg of the wanted lruvec
+ * mem_cgroup_zone_book - get the lru list vector for a zone and memcg
+ * @zone: zone of the wanted book
+ * @mem: memcg of the wanted book
  *
  * Returns the lru list vector holding pages for the given @zone and
- * @mem.  This can be the global zone lruvec, if the memory controller
+ * @mem.  This can be the global zone book, if the memory controller
  * is disabled.
  */
-struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
+struct book *mem_cgroup_zone_book(struct zone *zone,
 				      struct mem_cgroup *memcg)
 {
 	struct mem_cgroup_per_zone *mz;
 
 	if (mem_cgroup_disabled())
-		return &zone->lruvec;
+		return &zone->book;
 
 	mz = mem_cgroup_zoneinfo(memcg, zone_to_nid(zone), zone_idx(zone));
-	return &mz->lruvec;
+	return &mz->book;
 }
 
 /*
@@ -1025,18 +1025,18 @@ struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
  */
 
 /**
- * mem_cgroup_lru_add_list - account for adding an lru page and return lruvec
+ * mem_cgroup_lru_add_list - account for adding an lru page and return book
  * @zone: zone of the page
  * @page: the page
  * @lru: current lru
  *
  * This function accounts for @page being added to @lru, and returns
- * the lruvec for the given @zone and the memcg @page is charged to.
+ * the book for the given @zone and the memcg @page is charged to.
  *
  * The callsite is then responsible for physically linking the page to
- * the returned lruvec->lists[@lru].
+ * the returned book->pages_lru[@lru].
  */
-struct lruvec *mem_cgroup_lru_add_list(struct zone *zone, struct page *page,
+struct book *mem_cgroup_lru_add_list(struct zone *zone, struct page *page,
 				       enum lru_list lru)
 {
 	struct mem_cgroup_per_zone *mz;
@@ -1044,14 +1044,14 @@ struct lruvec *mem_cgroup_lru_add_list(struct zone *zone, struct page *page,
 	struct page_cgroup *pc;
 
 	if (mem_cgroup_disabled())
-		return &zone->lruvec;
+		return &zone->book;
 
 	pc = lookup_page_cgroup(page);
 	memcg = pc->mem_cgroup;
 	mz = page_cgroup_zoneinfo(memcg, page);
 	/* compound_order() is stabilized through lru_lock */
 	MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
-	return &mz->lruvec;
+	return &mz->book;
 }
 
 /**
@@ -1095,13 +1095,13 @@ void mem_cgroup_lru_del(struct page *page)
  * @to: target lru
  *
  * This function accounts for @page being moved between the lrus @from
- * and @to, and returns the lruvec for the given @zone and the memcg
+ * and @to, and returns the book for the given @zone and the memcg
  * @page is charged to.
  *
  * The callsite is then responsible for physically relinking
- * @page->lru to the returned lruvec->lists[@to].
+ * @page->lru to the returned book->lists[@to].
  */
-struct lruvec *mem_cgroup_lru_move_lists(struct zone *zone,
+struct book *mem_cgroup_lru_move_lists(struct zone *zone,
 					 struct page *page,
 					 enum lru_list from,
 					 enum lru_list to)
@@ -3610,7 +3610,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
 
 	zone = &NODE_DATA(node)->node_zones[zid];
 	mz = mem_cgroup_zoneinfo(memcg, node, zid);
-	list = &mz->lruvec.lists[lru];
+	list = &mz->book.pages_lru[lru];
 
 	loop = MEM_CGROUP_ZSTAT(mz, lru);
 	/* give some margin against EBUSY etc...*/
@@ -4739,7 +4739,7 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		mz = &pn->zoneinfo[zone];
 		for_each_lru(l)
-			INIT_LIST_HEAD(&mz->lruvec.lists[l]);
+			INIT_LIST_HEAD(&mz->book.pages_lru[l]);
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->mem = memcg;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dd4ea43..08b4e4b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4311,7 +4311,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 		zone_pcp_init(zone);
 		for_each_lru(lru)
-			INIT_LIST_HEAD(&zone->lruvec.lists[lru]);
+			INIT_LIST_HEAD(&zone->book.pages_lru[lru]);
 		zone->reclaim_stat.recent_rotated[0] = 0;
 		zone->reclaim_stat.recent_rotated[1] = 0;
 		zone->reclaim_stat.recent_scanned[0] = 0;
diff --git a/mm/swap.c b/mm/swap.c
index fff1ff7..d7c4c8f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -234,11 +234,11 @@ static void pagevec_move_tail_fn(struct page *page, void *arg)
 
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		enum lru_list lru = page_lru_base_type(page);
-		struct lruvec *lruvec;
+		struct book *book;
 
-		lruvec = mem_cgroup_lru_move_lists(page_zone(page),
+		book = mem_cgroup_lru_move_lists(page_zone(page),
 						   page, lru, lru);
-		list_move_tail(&page->lru, &lruvec->lists[lru]);
+		list_move_tail(&page->lru, &book->pages_lru[lru]);
 		(*pgmoved)++;
 	}
 }
@@ -476,13 +476,13 @@ static void lru_deactivate_fn(struct page *page, void *arg)
 		 */
 		SetPageReclaim(page);
 	} else {
-		struct lruvec *lruvec;
+		struct book *book;
 		/*
 		 * The page's writeback ends up during pagevec
 		 * We moves tha page into tail of inactive.
 		 */
-		lruvec = mem_cgroup_lru_move_lists(zone, page, lru, lru);
-		list_move_tail(&page->lru, &lruvec->lists[lru]);
+		book = mem_cgroup_lru_move_lists(zone, page, lru, lru);
+		list_move_tail(&page->lru, &book->pages_lru[lru]);
 		__count_vm_event(PGROTATED);
 	}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 751fab3..fba9dfd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1158,7 +1158,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		unsigned long *nr_scanned, int order, isolate_mode_t mode,
 		int active, int file)
 {
-	struct lruvec *lruvec;
+	struct book *book;
 	struct list_head *src;
 	unsigned long nr_taken = 0;
 	unsigned long nr_lumpy_taken = 0;
@@ -1167,12 +1167,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	unsigned long scan;
 	int lru = LRU_BASE;
 
-	lruvec = mem_cgroup_zone_lruvec(mz->zone, mz->mem_cgroup);
+	book = mem_cgroup_zone_book(mz->zone, mz->mem_cgroup);
 	if (active)
 		lru += LRU_ACTIVE;
 	if (file)
 		lru += LRU_FILE;
-	src = &lruvec->lists[lru];
+	src = &book->pages_lru[lru];
 
 	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
 		struct page *page;
@@ -1669,15 +1669,15 @@ static void move_active_pages_to_lru(struct zone *zone,
 	}
 
 	while (!list_empty(list)) {
-		struct lruvec *lruvec;
+		struct book *book;
 
 		page = lru_to_page(list);
 
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 
-		lruvec = mem_cgroup_lru_add_list(zone, page, lru);
-		list_move(&page->lru, &lruvec->lists[lru]);
+		book = mem_cgroup_lru_add_list(zone, page, lru);
+		list_move(&page->lru, &book->pages_lru[lru]);
 		pgmoved += hpage_nr_pages(page);
 
 		if (put_page_testzero(page)) {
@@ -3514,7 +3514,7 @@ int page_evictable(struct page *page, struct vm_area_struct *vma)
  */
 void check_move_unevictable_pages(struct page **pages, int nr_pages)
 {
-	struct lruvec *lruvec;
+	struct book *book;
 	struct zone *zone = NULL;
 	int pgscanned = 0;
 	int pgrescued = 0;
@@ -3542,9 +3542,9 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
 			VM_BUG_ON(PageActive(page));
 			ClearPageUnevictable(page);
 			__dec_zone_state(zone, NR_UNEVICTABLE);
-			lruvec = mem_cgroup_lru_move_lists(zone, page,
+			book = mem_cgroup_lru_move_lists(zone, page,
 						LRU_UNEVICTABLE, lru);
-			list_move(&page->lru, &lruvec->lists[lru]);
+			list_move(&page->lru, &book->pages_lru[lru]);
 			__inc_zone_state(zone, NR_INACTIVE_ANON + lru);
 			pgrescued++;
 		}


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 02/15] mm: memory bookkeeping core
  2012-02-15 22:57 [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting Konstantin Khlebnikov
  2012-02-15 22:57 ` [PATCH RFC 01/15] mm: rename struct lruvec into struct book Konstantin Khlebnikov
@ 2012-02-15 22:57 ` Konstantin Khlebnikov
  2012-02-15 22:57 ` [PATCH RFC 03/15] mm: add book->pages_count Konstantin Khlebnikov
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-15 22:57 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

core blueprint

TODO:
* add direct link from page to book in page->flags instead of zoneid and nodeid
* atomic switching page book-id in page->flags bits
  [ something like atomic_long_xor(&page->flags, old_book_id ^ new_book_id) ]
* move freed pages into root book (zone->book)
* don't free struct book, reuse them after rcu grace period to keep rcu-iterating

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
---
 include/linux/mm_inline.h |   36 ++++++++++++++++++++++++++++++++++
 include/linux/mmzone.h    |    8 ++++++++
 init/Kconfig              |    4 ++++
 mm/memcontrol.c           |   48 ++++++++++++++++++++++++++++++++++++++++++++-
 mm/page_alloc.c           |    6 ++++++
 5 files changed, 101 insertions(+), 1 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index e0b78ca..6f42819 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -3,6 +3,42 @@
 
 #include <linux/huge_mm.h>
 
+#ifdef CONFIG_MEMORY_BOOKKEEPING
+
+extern struct book *page_book(struct page *book);
+
+static inline struct zone *book_zone(struct book *book)
+{
+	return book->zone;
+}
+
+static inline struct pglist_data *book_node(struct book *book)
+{
+	return book->node;
+}
+
+#else /* CONFIG_MEMORY_BOOKKEEPING */
+
+static inline struct book *page_book(struct page *page)
+{
+	return &page_zone(page)->book;
+}
+
+static inline struct zone *book_zone(struct book *book)
+{
+	return container_of(book, struct zone, book);
+}
+
+static inline struct pglist_data *book_node(struct book *book)
+{
+	return book_zone(book)->zone_pgdat;
+}
+
+#endif /* CONFIG_MEMORY_BOOKKEEPING */
+
+#define for_each_book(book, zone) \
+	list_for_each_entry_rcu(book, &zone->book_list, list)
+
 /**
  * page_is_file_cache - should the page be on a file LRU or anon LRU?
  * @page: the page to test
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0b6e5d4..e05b003 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -297,7 +297,12 @@ struct zone_reclaim_stat {
 };
 
 struct book {
+#ifdef CONFIG_MEMORY_BOOKKEEPING
+	struct pglist_data	*node;
+	struct zone		*zone;
+#endif
 	struct list_head	pages_lru[NR_LRU_LISTS];
+	struct list_head	list;	/* for zone->book_list */
 };
 
 struct zone {
@@ -442,6 +447,9 @@ struct zone {
 	unsigned long		spanned_pages;	/* total size, including holes */
 	unsigned long		present_pages;	/* amount of memory (excluding holes) */
 
+	/* RCU-protected list of all books in this zone */
+	struct list_head	book_list;
+
 	/*
 	 * rarely used fields:
 	 */
diff --git a/init/Kconfig b/init/Kconfig
index ab0d680..70c3cef 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -651,10 +651,14 @@ config RESOURCE_COUNTERS
 	  This option enables controller independent resource accounting
 	  infrastructure that works with cgroups.
 
+config MEMORY_BOOKKEEPING
+	bool
+
 config CGROUP_MEM_RES_CTLR
 	bool "Memory Resource Controller for Control Groups"
 	depends on RESOURCE_COUNTERS
 	select MM_OWNER
+	select MEMORY_BOOKKEEPING
 	help
 	  Provides a memory resource controller that manages both anonymous
 	  memory and page cache. (See Documentation/cgroups/memory.txt)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 578118b..06d946f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -152,6 +152,7 @@ struct mem_cgroup_per_zone {
 
 struct mem_cgroup_per_node {
 	struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
+	struct rcu_head		rcu_head;
 };
 
 struct mem_cgroup_lru_info {
@@ -990,6 +991,32 @@ out:
 EXPORT_SYMBOL(mem_cgroup_count_vm_event);
 
 /**
+ * page_book - get the book there this page is located
+ * @page: the struct page pointer with stable reference
+ *
+ * Returns pointer to struct book.
+ *
+ * FIXME optimized this translation, add direct link from page to book.
+ */
+struct book *page_book(struct page *page)
+{
+	struct mem_cgroup_per_zone *mz;
+	struct page_cgroup *pc;
+
+	if (mem_cgroup_disabled())
+		return &page_zone(page)->book;
+
+	pc = lookup_page_cgroup(page);
+	if (!PageCgroupUsed(pc))
+		return &page_zone(page)->book;
+	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
+	smp_rmb();
+	mz = mem_cgroup_zoneinfo(pc->mem_cgroup,
+			page_to_nid(page), page_zonenum(page));
+	return &mz->book;
+}
+
+/**
  * mem_cgroup_zone_book - get the lru list vector for a zone and memcg
  * @zone: zone of the wanted book
  * @mem: memcg of the wanted book
@@ -4740,6 +4767,11 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 		mz = &pn->zoneinfo[zone];
 		for_each_lru(l)
 			INIT_LIST_HEAD(&mz->book.pages_lru[l]);
+		mz->book.node = NODE_DATA(node);
+		mz->book.zone = &NODE_DATA(node)->node_zones[zone];
+		spin_lock(&mz->book.zone->lock);
+		list_add_tail_rcu(&mz->book.list, &mz->book.zone->book_list);
+		spin_unlock(&mz->book.zone->lock);
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->mem = memcg;
@@ -4750,7 +4782,21 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 
 static void free_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 {
-	kfree(memcg->info.nodeinfo[node]);
+	struct mem_cgroup_per_node *pn;
+	struct mem_cgroup_per_zone *mz;
+	int zone;
+
+	pn = memcg->info.nodeinfo[node];
+	if (!pn)
+		return;
+
+	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+		mz = mem_cgroup_zoneinfo(memcg, node, zone);
+		spin_lock(&mz->book.zone->lock);
+		list_del_rcu(&mz->book.list);
+		spin_unlock(&mz->book.zone->lock);
+	}
+	kfree_rcu(pn, rcu_head);
 }
 
 static struct mem_cgroup *mem_cgroup_alloc(void)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 08b4e4b..ead327b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4312,6 +4312,12 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		zone_pcp_init(zone);
 		for_each_lru(lru)
 			INIT_LIST_HEAD(&zone->book.pages_lru[lru]);
+#ifdef CONFIG_MEMORY_BOOKKEEPING
+		zone->book.node = pgdat;
+		zone->book.zone = zone;
+#endif
+		INIT_LIST_HEAD(&zone->book_list);
+		list_add(&zone->book.list, &zone->book_list);
 		zone->reclaim_stat.recent_rotated[0] = 0;
 		zone->reclaim_stat.recent_rotated[1] = 0;
 		zone->reclaim_stat.recent_scanned[0] = 0;


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 03/15] mm: add book->pages_count
  2012-02-15 22:57 [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting Konstantin Khlebnikov
  2012-02-15 22:57 ` [PATCH RFC 01/15] mm: rename struct lruvec into struct book Konstantin Khlebnikov
  2012-02-15 22:57 ` [PATCH RFC 02/15] mm: memory bookkeeping core Konstantin Khlebnikov
@ 2012-02-15 22:57 ` Konstantin Khlebnikov
  2012-02-15 22:57 ` [PATCH RFC 04/15] mm: unify inactive_list_is_low() Konstantin Khlebnikov
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-15 22:57 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Move lru pages counter from mem_cgroup_per_zone->count[] to book->pages_count[]

Account pages in all books, incuding root,
this isn't a huge overhead, but it greatly simplifies all code.

redundant page -> book translations will be optimized in further patches.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
---
 include/linux/memcontrol.h |   29 -------------
 include/linux/mm_inline.h  |   14 ++++--
 include/linux/mmzone.h     |    2 +
 mm/memcontrol.c            |   98 ++------------------------------------------
 mm/page_alloc.c            |    4 +-
 mm/swap.c                  |    7 +--
 mm/vmscan.c                |   21 +++++++--
 7 files changed, 36 insertions(+), 139 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c97fff9..4183753 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -63,12 +63,6 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
 
 struct book *mem_cgroup_zone_book(struct zone *, struct mem_cgroup *);
-struct book *mem_cgroup_lru_add_list(struct zone *, struct page *,
-				       enum lru_list);
-void mem_cgroup_lru_del_list(struct page *, enum lru_list);
-void mem_cgroup_lru_del(struct page *);
-struct book *mem_cgroup_lru_move_lists(struct zone *, struct page *,
-					 enum lru_list, enum lru_list);
 
 /* For coalescing uncharge for reducing memcg' overhead*/
 extern void mem_cgroup_uncharge_start(void);
@@ -220,29 +214,6 @@ static inline struct book *mem_cgroup_zone_book(struct zone *zone,
 	return &zone->book;
 }
 
-static inline struct book *mem_cgroup_lru_add_list(struct zone *zone,
-						     struct page *page,
-						     enum lru_list lru)
-{
-	return &zone->book;
-}
-
-static inline void mem_cgroup_lru_del_list(struct page *page, enum lru_list lru)
-{
-}
-
-static inline void mem_cgroup_lru_del(struct page *page)
-{
-}
-
-static inline struct book *mem_cgroup_lru_move_lists(struct zone *zone,
-						       struct page *page,
-						       enum lru_list from,
-						       enum lru_list to)
-{
-	return &zone->book;
-}
-
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	return NULL;
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 6f42819..9c484c0 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -60,19 +60,23 @@ static inline int page_is_file_cache(struct page *page)
 static inline void
 add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list lru)
 {
-	struct book *book;
+	struct book *book = page_book(page);
+	int numpages = hpage_nr_pages(page);
 
-	book = mem_cgroup_lru_add_list(zone, page, lru);
 	list_add(&page->lru, &book->pages_lru[lru]);
-	__mod_zone_page_state(zone, NR_LRU_BASE + lru, hpage_nr_pages(page));
+	book->pages_count[lru] += numpages;
+	__mod_zone_page_state(zone, NR_LRU_BASE + lru, numpages);
 }
 
 static inline void
 del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list lru)
 {
-	mem_cgroup_lru_del_list(page, lru);
+	struct book *book = page_book(page);
+	int numpages = hpage_nr_pages(page);
+
 	list_del(&page->lru);
-	__mod_zone_page_state(zone, NR_LRU_BASE + lru, -hpage_nr_pages(page));
+	book->pages_count[lru] -= numpages;
+	__mod_zone_page_state(zone, NR_LRU_BASE + lru, -numpages);
 }
 
 /**
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e05b003..ef4b984 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -302,6 +302,8 @@ struct book {
 	struct zone		*zone;
 #endif
 	struct list_head	pages_lru[NR_LRU_LISTS];
+	unsigned long		pages_count[NR_LRU_LISTS];
+
 	struct list_head	list;	/* for zone->book_list */
 };
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 06d946f..8e1765a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -135,7 +135,6 @@ struct mem_cgroup_reclaim_iter {
  */
 struct mem_cgroup_per_zone {
 	struct book		book;
-	unsigned long		count[NR_LRU_LISTS];
 
 	struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1];
 
@@ -147,8 +146,6 @@ struct mem_cgroup_per_zone {
 	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
 						/* use container_of	   */
 };
-/* Macro for accessing counter */
-#define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
 
 struct mem_cgroup_per_node {
 	struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
@@ -715,7 +712,7 @@ mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
 
 	for_each_lru(l) {
 		if (BIT(l) & lru_mask)
-			ret += MEM_CGROUP_ZSTAT(mz, l);
+			ret += mz->book.pages_count[l];
 	}
 	return ret;
 }
@@ -1051,93 +1048,6 @@ struct book *mem_cgroup_zone_book(struct zone *zone,
  * When moving account, the page is not on LRU. It's isolated.
  */
 
-/**
- * mem_cgroup_lru_add_list - account for adding an lru page and return book
- * @zone: zone of the page
- * @page: the page
- * @lru: current lru
- *
- * This function accounts for @page being added to @lru, and returns
- * the book for the given @zone and the memcg @page is charged to.
- *
- * The callsite is then responsible for physically linking the page to
- * the returned book->pages_lru[@lru].
- */
-struct book *mem_cgroup_lru_add_list(struct zone *zone, struct page *page,
-				       enum lru_list lru)
-{
-	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup *memcg;
-	struct page_cgroup *pc;
-
-	if (mem_cgroup_disabled())
-		return &zone->book;
-
-	pc = lookup_page_cgroup(page);
-	memcg = pc->mem_cgroup;
-	mz = page_cgroup_zoneinfo(memcg, page);
-	/* compound_order() is stabilized through lru_lock */
-	MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
-	return &mz->book;
-}
-
-/**
- * mem_cgroup_lru_del_list - account for removing an lru page
- * @page: the page
- * @lru: target lru
- *
- * This function accounts for @page being removed from @lru.
- *
- * The callsite is then responsible for physically unlinking
- * @page->lru.
- */
-void mem_cgroup_lru_del_list(struct page *page, enum lru_list lru)
-{
-	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup *memcg;
-	struct page_cgroup *pc;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	pc = lookup_page_cgroup(page);
-	memcg = pc->mem_cgroup;
-	VM_BUG_ON(!memcg);
-	mz = page_cgroup_zoneinfo(memcg, page);
-	/* huge page split is done under lru_lock. so, we have no races. */
-	VM_BUG_ON(MEM_CGROUP_ZSTAT(mz, lru) < (1 << compound_order(page)));
-	MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
-}
-
-void mem_cgroup_lru_del(struct page *page)
-{
-	mem_cgroup_lru_del_list(page, page_lru(page));
-}
-
-/**
- * mem_cgroup_lru_move_lists - account for moving a page between lrus
- * @zone: zone of the page
- * @page: the page
- * @from: current lru
- * @to: target lru
- *
- * This function accounts for @page being moved between the lrus @from
- * and @to, and returns the book for the given @zone and the memcg
- * @page is charged to.
- *
- * The callsite is then responsible for physically relinking
- * @page->lru to the returned book->lists[@to].
- */
-struct book *mem_cgroup_lru_move_lists(struct zone *zone,
-					 struct page *page,
-					 enum lru_list from,
-					 enum lru_list to)
-{
-	/* XXX: Optimize this, especially for @from == @to */
-	mem_cgroup_lru_del_list(page, from);
-	return mem_cgroup_lru_add_list(zone, page, to);
-}
-
 /*
  * Checks whether given mem is same or in the root_mem_cgroup's
  * hierarchy subtree
@@ -3639,7 +3549,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
 	mz = mem_cgroup_zoneinfo(memcg, node, zid);
 	list = &mz->book.pages_lru[lru];
 
-	loop = MEM_CGROUP_ZSTAT(mz, lru);
+	loop = mz->book.pages_count[lru];
 	/* give some margin against EBUSY etc...*/
 	loop += 256;
 	busy = NULL;
@@ -4765,8 +4675,10 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		mz = &pn->zoneinfo[zone];
-		for_each_lru(l)
+		for_each_lru(l) {
 			INIT_LIST_HEAD(&mz->book.pages_lru[l]);
+			mz->book.pages_count[l] = 0;
+		}
 		mz->book.node = NODE_DATA(node);
 		mz->book.zone = &NODE_DATA(node)->node_zones[zone];
 		spin_lock(&mz->book.zone->lock);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ead327b..c62a1d2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4310,8 +4310,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		zone->zone_pgdat = pgdat;
 
 		zone_pcp_init(zone);
-		for_each_lru(lru)
+		for_each_lru(lru) {
 			INIT_LIST_HEAD(&zone->book.pages_lru[lru]);
+			zone->book.pages_count[lru] = 0;
+		}
 #ifdef CONFIG_MEMORY_BOOKKEEPING
 		zone->book.node = pgdat;
 		zone->book.zone = zone;
diff --git a/mm/swap.c b/mm/swap.c
index d7c4c8f..ba29c3c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -234,10 +234,8 @@ static void pagevec_move_tail_fn(struct page *page, void *arg)
 
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		enum lru_list lru = page_lru_base_type(page);
-		struct book *book;
+		struct book *book = page_book(page);
 
-		book = mem_cgroup_lru_move_lists(page_zone(page),
-						   page, lru, lru);
 		list_move_tail(&page->lru, &book->pages_lru[lru]);
 		(*pgmoved)++;
 	}
@@ -476,12 +474,11 @@ static void lru_deactivate_fn(struct page *page, void *arg)
 		 */
 		SetPageReclaim(page);
 	} else {
-		struct book *book;
+		struct book *book = page_book(page);
 		/*
 		 * The page's writeback ends up during pagevec
 		 * We moves tha page into tail of inactive.
 		 */
-		book = mem_cgroup_lru_move_lists(zone, page, lru, lru);
 		list_move_tail(&page->lru, &book->pages_lru[lru]);
 		__count_vm_event(PGROTATED);
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fba9dfd..eddf617 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1188,7 +1188,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 
 		switch (__isolate_lru_page(page, mode, file)) {
 		case 0:
-			mem_cgroup_lru_del(page);
 			list_move(&page->lru, dst);
 			nr_taken += hpage_nr_pages(page);
 			break;
@@ -1246,10 +1245,14 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 
 			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
 				unsigned int isolated_pages;
+				struct book *cursor_book;
+				int cursor_lru = page_lru(cursor_page);
 
-				mem_cgroup_lru_del(cursor_page);
 				list_move(&cursor_page->lru, dst);
 				isolated_pages = hpage_nr_pages(cursor_page);
+				cursor_book = page_book(cursor_page);
+				cursor_book->pages_count[cursor_lru] -=
+								isolated_pages;
 				nr_taken += isolated_pages;
 				nr_lumpy_taken += isolated_pages;
 				if (PageDirty(cursor_page))
@@ -1281,6 +1284,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 			nr_lumpy_failed++;
 	}
 
+	book->pages_count[lru] -= nr_taken - nr_lumpy_taken;
+
 	*nr_scanned = scan;
 
 	trace_mm_vmscan_lru_isolate(order,
@@ -1670,15 +1675,18 @@ static void move_active_pages_to_lru(struct zone *zone,
 
 	while (!list_empty(list)) {
 		struct book *book;
+		int numpages;
 
 		page = lru_to_page(list);
 
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 
-		book = mem_cgroup_lru_add_list(zone, page, lru);
+		book = page_book(page);
 		list_move(&page->lru, &book->pages_lru[lru]);
-		pgmoved += hpage_nr_pages(page);
+		numpages = hpage_nr_pages(page);
+		book->pages_count[lru] += numpages;
+		pgmoved += numpages;
 
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
@@ -3542,8 +3550,9 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
 			VM_BUG_ON(PageActive(page));
 			ClearPageUnevictable(page);
 			__dec_zone_state(zone, NR_UNEVICTABLE);
-			book = mem_cgroup_lru_move_lists(zone, page,
-						LRU_UNEVICTABLE, lru);
+			book = page_book(page);
+			book->pages_count[LRU_UNEVICTABLE]--;
+			book->pages_count[lru]++;
 			list_move(&page->lru, &book->pages_lru[lru]);
 			__inc_zone_state(zone, NR_INACTIVE_ANON + lru);
 			pgrescued++;


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 04/15] mm: unify inactive_list_is_low()
  2012-02-15 22:57 [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting Konstantin Khlebnikov
                   ` (2 preceding siblings ...)
  2012-02-15 22:57 ` [PATCH RFC 03/15] mm: add book->pages_count Konstantin Khlebnikov
@ 2012-02-15 22:57 ` Konstantin Khlebnikov
  2012-02-15 22:57 ` [PATCH RFC 05/15] mm: add book->reclaim_stat Konstantin Khlebnikov
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-15 22:57 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Unify memcg and non-memcg logic, always use exact counters from struct book.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
---
 mm/vmscan.c |   30 ++++++++----------------------
 1 files changed, 8 insertions(+), 22 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eddf617..61ffc8a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1815,6 +1815,7 @@ static int inactive_anon_is_low(struct mem_cgroup_zone *mz,
 {
 	unsigned long active, inactive;
 	unsigned int ratio;
+	struct book *book;
 
 	/*
 	 * If we don't have swap space, anonymous page deactivation
@@ -1828,17 +1829,9 @@ static int inactive_anon_is_low(struct mem_cgroup_zone *mz,
 	else
 		ratio = mem_cgroup_inactive_ratio(sc->target_mem_cgroup);
 
-	if (scanning_global_lru(mz)) {
-		active = zone_page_state(mz->zone, NR_ACTIVE_ANON);
-		inactive = zone_page_state(mz->zone, NR_INACTIVE_ANON);
-	} else {
-		active = mem_cgroup_zone_nr_lru_pages(mz->mem_cgroup,
-				zone_to_nid(mz->zone), zone_idx(mz->zone),
-				BIT(LRU_ACTIVE_ANON));
-		inactive = mem_cgroup_zone_nr_lru_pages(mz->mem_cgroup,
-				zone_to_nid(mz->zone), zone_idx(mz->zone),
-				BIT(LRU_INACTIVE_ANON));
-	}
+	book = mem_cgroup_zone_book(mz->zone, mz->mem_cgroup);
+	active = book->pages_count[NR_ACTIVE_ANON];
+	inactive = book->pages_count[NR_INACTIVE_ANON];
 
 	return inactive * ratio < active;
 }
@@ -1867,18 +1860,11 @@ static inline int inactive_anon_is_low(struct mem_cgroup_zone *mz,
 static int inactive_file_is_low(struct mem_cgroup_zone *mz)
 {
 	unsigned long active, inactive;
+	struct book *book;
 
-	if (scanning_global_lru(mz)) {
-		active = zone_page_state(mz->zone, NR_ACTIVE_FILE);
-		inactive = zone_page_state(mz->zone, NR_INACTIVE_FILE);
-	} else {
-		active = mem_cgroup_zone_nr_lru_pages(mz->mem_cgroup,
-				zone_to_nid(mz->zone), zone_idx(mz->zone),
-				BIT(LRU_ACTIVE_FILE));
-		inactive = mem_cgroup_zone_nr_lru_pages(mz->mem_cgroup,
-				zone_to_nid(mz->zone), zone_idx(mz->zone),
-				BIT(LRU_INACTIVE_FILE));
-	}
+	book = mem_cgroup_zone_book(mz->zone, mz->mem_cgroup);
+	active = book->pages_count[NR_ACTIVE_FILE];
+	inactive = book->pages_count[NR_INACTIVE_FILE];
 
 	return inactive < active;
 }


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 05/15] mm: add book->reclaim_stat
  2012-02-15 22:57 [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting Konstantin Khlebnikov
                   ` (3 preceding siblings ...)
  2012-02-15 22:57 ` [PATCH RFC 04/15] mm: unify inactive_list_is_low() Konstantin Khlebnikov
@ 2012-02-15 22:57 ` Konstantin Khlebnikov
  2012-02-15 22:57 ` [PATCH RFC 06/15] mm: kill struct mem_cgroup_zone Konstantin Khlebnikov
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-15 22:57 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Merge memcg and non-memcg reclaim stat, thus we need to update only one.
Move zone->reclaimer_stat and mem_cgroup_per_zone->reclaimer_stat to struct book.

struct book will become operating unit for recalimer logic,
thus this is perfect place for these counters.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
---
 include/linux/memcontrol.h |   17 --------------
 include/linux/mmzone.h     |    4 ++-
 mm/memcontrol.c            |   52 +++++---------------------------------------
 mm/page_alloc.c            |    6 ++---
 mm/swap.c                  |   12 ++--------
 mm/vmscan.c                |    5 +++-
 6 files changed, 15 insertions(+), 81 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4183753..78d0fcd 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -111,10 +111,6 @@ unsigned int mem_cgroup_inactive_ratio(struct mem_cgroup *memcg);
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
 					int nid, int zid, unsigned int lrumask);
-struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
-						      struct zone *zone);
-struct zone_reclaim_stat*
-mem_cgroup_get_reclaim_stat_from_page(struct page *page);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 					struct task_struct *p);
 extern void mem_cgroup_replace_page_cache(struct page *oldpage,
@@ -284,19 +280,6 @@ mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
 	return 0;
 }
 
-
-static inline struct zone_reclaim_stat*
-mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg, struct zone *zone)
-{
-	return NULL;
-}
-
-static inline struct zone_reclaim_stat*
-mem_cgroup_get_reclaim_stat_from_page(struct page *page)
-{
-	return NULL;
-}
-
 static inline void
 mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ef4b984..5bcd5b1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -304,6 +304,8 @@ struct book {
 	struct list_head	pages_lru[NR_LRU_LISTS];
 	unsigned long		pages_count[NR_LRU_LISTS];
 
+	struct zone_reclaim_stat	reclaim_stat;
+
 	struct list_head	list;	/* for zone->book_list */
 };
 
@@ -383,8 +385,6 @@ struct zone {
 	spinlock_t		lru_lock;
 	struct book		book;
 
-	struct zone_reclaim_stat reclaim_stat;
-
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	unsigned long		flags;		   /* zone flags, see below */
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8e1765a..ff82d6e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -138,7 +138,6 @@ struct mem_cgroup_per_zone {
 
 	struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1];
 
-	struct zone_reclaim_stat reclaim_stat;
 	struct rb_node		tree_node;	/* RB tree node */
 	unsigned long long	usage_in_excess;/* Set to the value by which */
 						/* the soft limit is exceeded*/
@@ -448,15 +447,6 @@ struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg)
 	return &memcg->css;
 }
 
-static struct mem_cgroup_per_zone *
-page_cgroup_zoneinfo(struct mem_cgroup *memcg, struct page *page)
-{
-	int nid = page_to_nid(page);
-	int zid = page_zonenum(page);
-
-	return mem_cgroup_zoneinfo(memcg, nid, zid);
-}
-
 static struct mem_cgroup_tree_per_zone *
 soft_limit_tree_node_zone(int nid, int zid)
 {
@@ -1098,34 +1088,6 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *memcg)
 	return ret;
 }
 
-struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
-						      struct zone *zone)
-{
-	int nid = zone_to_nid(zone);
-	int zid = zone_idx(zone);
-	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
-
-	return &mz->reclaim_stat;
-}
-
-struct zone_reclaim_stat *
-mem_cgroup_get_reclaim_stat_from_page(struct page *page)
-{
-	struct page_cgroup *pc;
-	struct mem_cgroup_per_zone *mz;
-
-	if (mem_cgroup_disabled())
-		return NULL;
-
-	pc = lookup_page_cgroup(page);
-	if (!PageCgroupUsed(pc))
-		return NULL;
-	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
-	smp_rmb();
-	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	return &mz->reclaim_stat;
-}
-
 #define mem_cgroup_from_res_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
@@ -4098,21 +4060,19 @@ static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft,
 	{
 		int nid, zid;
 		struct mem_cgroup_per_zone *mz;
+		struct zone_reclaim_stat *rs;
 		unsigned long recent_rotated[2] = {0, 0};
 		unsigned long recent_scanned[2] = {0, 0};
 
 		for_each_online_node(nid)
 			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 				mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
+				rs = &mz->book.reclaim_stat;
 
-				recent_rotated[0] +=
-					mz->reclaim_stat.recent_rotated[0];
-				recent_rotated[1] +=
-					mz->reclaim_stat.recent_rotated[1];
-				recent_scanned[0] +=
-					mz->reclaim_stat.recent_scanned[0];
-				recent_scanned[1] +=
-					mz->reclaim_stat.recent_scanned[1];
+				recent_rotated[0] += rs->recent_rotated[0];
+				recent_rotated[1] += rs->recent_rotated[1];
+				recent_scanned[0] += rs->recent_scanned[0];
+				recent_scanned[1] += rs->recent_scanned[1];
 			}
 		cb->fill(cb, "recent_rotated_anon", recent_rotated[0]);
 		cb->fill(cb, "recent_rotated_file", recent_rotated[1]);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c62a1d2..a2df69e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4320,10 +4320,8 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 #endif
 		INIT_LIST_HEAD(&zone->book_list);
 		list_add(&zone->book.list, &zone->book_list);
-		zone->reclaim_stat.recent_rotated[0] = 0;
-		zone->reclaim_stat.recent_rotated[1] = 0;
-		zone->reclaim_stat.recent_scanned[0] = 0;
-		zone->reclaim_stat.recent_scanned[1] = 0;
+		memset(&zone->book.reclaim_stat, 0,
+				sizeof(struct zone_reclaim_stat));
 		zap_zone_vm_stats(zone);
 		zone->flags = 0;
 		if (!size)
diff --git a/mm/swap.c b/mm/swap.c
index ba29c3c..2268ee7 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -277,21 +277,13 @@ void rotate_reclaimable_page(struct page *page)
 static void update_page_reclaim_stat(struct zone *zone, struct page *page,
 				     int file, int rotated)
 {
-	struct zone_reclaim_stat *reclaim_stat = &zone->reclaim_stat;
-	struct zone_reclaim_stat *memcg_reclaim_stat;
+	struct zone_reclaim_stat *reclaim_stat;
 
-	memcg_reclaim_stat = mem_cgroup_get_reclaim_stat_from_page(page);
+	reclaim_stat = &page_book(page)->reclaim_stat;
 
 	reclaim_stat->recent_scanned[file]++;
 	if (rotated)
 		reclaim_stat->recent_rotated[file]++;
-
-	if (!memcg_reclaim_stat)
-		return;
-
-	memcg_reclaim_stat->recent_scanned[file]++;
-	if (rotated)
-		memcg_reclaim_stat->recent_rotated[file]++;
 }
 
 static void __activate_page(struct page *page, void *arg)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 61ffc8a..ba95e83 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -190,9 +190,10 @@ static bool scanning_global_lru(struct mem_cgroup_zone *mz)
 static struct zone_reclaim_stat *get_reclaim_stat(struct mem_cgroup_zone *mz)
 {
 	if (!scanning_global_lru(mz))
-		return mem_cgroup_get_reclaim_stat(mz->mem_cgroup, mz->zone);
+		return &mem_cgroup_zone_book(mz->zone,
+				mz->mem_cgroup)->reclaim_stat;
 
-	return &mz->zone->reclaim_stat;
+	return &mz->zone->book.reclaim_stat;
 }
 
 static unsigned long zone_nr_lru_pages(struct mem_cgroup_zone *mz,


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 06/15] mm: kill struct mem_cgroup_zone
  2012-02-15 22:57 [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting Konstantin Khlebnikov
                   ` (4 preceding siblings ...)
  2012-02-15 22:57 ` [PATCH RFC 05/15] mm: add book->reclaim_stat Konstantin Khlebnikov
@ 2012-02-15 22:57 ` Konstantin Khlebnikov
  2012-02-15 22:57 ` [PATCH RFC 07/15] mm: move page-to-book translation upper Konstantin Khlebnikov
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-15 22:57 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Now struct mem_cgroup_zone always points to one book, either root zone->book or
some from memcg. So this fancy pointer can be replaced with direct pointer to
struct book.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
---
 mm/vmscan.c |  181 ++++++++++++++++++++++-------------------------------------
 1 files changed, 66 insertions(+), 115 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ba95e83..c59d4f7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -121,11 +121,6 @@ struct scan_control {
 	nodemask_t	*nodemask;
 };
 
-struct mem_cgroup_zone {
-	struct mem_cgroup *mem_cgroup;
-	struct zone *zone;
-};
-
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 
 #ifdef ARCH_HAS_PREFETCH
@@ -170,45 +165,13 @@ static bool global_reclaim(struct scan_control *sc)
 {
 	return !sc->target_mem_cgroup;
 }
-
-static bool scanning_global_lru(struct mem_cgroup_zone *mz)
-{
-	return !mz->mem_cgroup;
-}
 #else
 static bool global_reclaim(struct scan_control *sc)
 {
 	return true;
 }
-
-static bool scanning_global_lru(struct mem_cgroup_zone *mz)
-{
-	return true;
-}
 #endif
 
-static struct zone_reclaim_stat *get_reclaim_stat(struct mem_cgroup_zone *mz)
-{
-	if (!scanning_global_lru(mz))
-		return &mem_cgroup_zone_book(mz->zone,
-				mz->mem_cgroup)->reclaim_stat;
-
-	return &mz->zone->book.reclaim_stat;
-}
-
-static unsigned long zone_nr_lru_pages(struct mem_cgroup_zone *mz,
-				       enum lru_list lru)
-{
-	if (!scanning_global_lru(mz))
-		return mem_cgroup_zone_nr_lru_pages(mz->mem_cgroup,
-						    zone_to_nid(mz->zone),
-						    zone_idx(mz->zone),
-						    BIT(lru));
-
-	return zone_page_state(mz->zone, NR_LRU_BASE + lru);
-}
-
-
 /*
  * Add a shrinker callback to be called from the vm
  */
@@ -772,7 +735,7 @@ static enum page_references page_check_references(struct page *page,
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
-				      struct mem_cgroup_zone *mz,
+				      struct book *book,
 				      struct scan_control *sc,
 				      int priority,
 				      unsigned long *ret_nr_dirty,
@@ -803,7 +766,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			goto keep;
 
 		VM_BUG_ON(PageActive(page));
-		VM_BUG_ON(page_zone(page) != mz->zone);
+		VM_BUG_ON(page_zone(page) != book_zone(book));
 
 		sc->nr_scanned++;
 
@@ -1029,7 +992,7 @@ keep_lumpy:
 	 * will encounter the same problem
 	 */
 	if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
-		zone_set_flag(mz->zone, ZONE_CONGESTED);
+		zone_set_flag(book_zone(book), ZONE_CONGESTED);
 
 	free_hot_cold_page_list(&free_pages, 1);
 
@@ -1144,7 +1107,7 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file)
  * Appropriate locks must be held before calling this function.
  *
  * @nr_to_scan:	The number of pages to look through on the list.
- * @mz:		The mem_cgroup_zone to pull pages from.
+ * @book	The struct book to pull pages from.
  * @dst:	The temp list to put pages on to.
  * @nr_scanned:	The number of pages that were scanned.
  * @order:	The caller's attempted allocation order
@@ -1155,11 +1118,10 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file)
  * returns how many pages were moved onto *@dst.
  */
 static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
-		struct mem_cgroup_zone *mz, struct list_head *dst,
+		struct book *book, struct list_head *dst,
 		unsigned long *nr_scanned, int order, isolate_mode_t mode,
 		int active, int file)
 {
-	struct book *book;
 	struct list_head *src;
 	unsigned long nr_taken = 0;
 	unsigned long nr_lumpy_taken = 0;
@@ -1168,7 +1130,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	unsigned long scan;
 	int lru = LRU_BASE;
 
-	book = mem_cgroup_zone_book(mz->zone, mz->mem_cgroup);
 	if (active)
 		lru += LRU_ACTIVE;
 	if (file)
@@ -1376,11 +1337,11 @@ static int too_many_isolated(struct zone *zone, int file,
 }
 
 static noinline_for_stack void
-putback_inactive_pages(struct mem_cgroup_zone *mz,
+putback_inactive_pages(struct book *book,
 		       struct list_head *page_list)
 {
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
-	struct zone *zone = mz->zone;
+	struct zone_reclaim_stat *reclaim_stat = &book->reclaim_stat;
+	struct zone *zone = book_zone(book);
 	LIST_HEAD(pages_to_free);
 
 	/*
@@ -1427,13 +1388,13 @@ putback_inactive_pages(struct mem_cgroup_zone *mz,
 }
 
 static noinline_for_stack void
-update_isolated_counts(struct mem_cgroup_zone *mz,
+update_isolated_counts(struct book *book,
 		       struct list_head *page_list,
 		       unsigned long *nr_anon,
 		       unsigned long *nr_file)
 {
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
-	struct zone *zone = mz->zone;
+	struct zone_reclaim_stat *reclaim_stat = &book->reclaim_stat;
+	struct zone *zone = book_zone(book);
 	unsigned int count[NR_LRU_LISTS] = { 0, };
 	unsigned long nr_active = 0;
 	struct page *page;
@@ -1517,7 +1478,7 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
  * of reclaimed pages
  */
 static noinline_for_stack unsigned long
-shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
+shrink_inactive_list(unsigned long nr_to_scan, struct book *book,
 		     struct scan_control *sc, int priority, int file)
 {
 	LIST_HEAD(page_list);
@@ -1529,7 +1490,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
 	unsigned long nr_dirty = 0;
 	unsigned long nr_writeback = 0;
 	isolate_mode_t reclaim_mode = ISOLATE_INACTIVE;
-	struct zone *zone = mz->zone;
+	struct zone *zone = book_zone(book);
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1552,7 +1513,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
 
 	spin_lock_irq(&zone->lru_lock);
 
-	nr_taken = isolate_lru_pages(nr_to_scan, mz, &page_list,
+	nr_taken = isolate_lru_pages(nr_to_scan, book, &page_list,
 				     &nr_scanned, sc->order,
 				     reclaim_mode, 0, file);
 	if (global_reclaim(sc)) {
@@ -1570,20 +1531,20 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
 		return 0;
 	}
 
-	update_isolated_counts(mz, &page_list, &nr_anon, &nr_file);
+	update_isolated_counts(book, &page_list, &nr_anon, &nr_file);
 
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
 	__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, mz, sc, priority,
+	nr_reclaimed = shrink_page_list(&page_list, book, sc, priority,
 						&nr_dirty, &nr_writeback);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
 		set_reclaim_mode(priority, sc, true);
-		nr_reclaimed += shrink_page_list(&page_list, mz, sc,
+		nr_reclaimed += shrink_page_list(&page_list, book, sc,
 					priority, &nr_dirty, &nr_writeback);
 	}
 
@@ -1593,7 +1554,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-	putback_inactive_pages(mz, &page_list);
+	putback_inactive_pages(book, &page_list);
 
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
 	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
@@ -1708,7 +1669,7 @@ static void move_active_pages_to_lru(struct zone *zone,
 }
 
 static void shrink_active_list(unsigned long nr_to_scan,
-			       struct mem_cgroup_zone *mz,
+			       struct book *book,
 			       struct scan_control *sc,
 			       int priority, int file)
 {
@@ -1719,10 +1680,10 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	LIST_HEAD(l_active);
 	LIST_HEAD(l_inactive);
 	struct page *page;
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
+	struct zone_reclaim_stat *reclaim_stat = &book->reclaim_stat;
 	unsigned long nr_rotated = 0;
 	isolate_mode_t reclaim_mode = ISOLATE_ACTIVE;
-	struct zone *zone = mz->zone;
+	struct zone *zone = book_zone(book);
 
 	lru_add_drain();
 
@@ -1733,7 +1694,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	spin_lock_irq(&zone->lru_lock);
 
-	nr_taken = isolate_lru_pages(nr_to_scan, mz, &l_hold,
+	nr_taken = isolate_lru_pages(nr_to_scan, book, &l_hold,
 				     &nr_scanned, sc->order,
 				     reclaim_mode, 1, file);
 	if (global_reclaim(sc))
@@ -1811,12 +1772,11 @@ static void shrink_active_list(unsigned long nr_to_scan,
  * Returns true if the zone does not have enough inactive anon pages,
  * meaning some active anon pages need to be deactivated.
  */
-static int inactive_anon_is_low(struct mem_cgroup_zone *mz,
+static int inactive_anon_is_low(struct book *book,
 				struct scan_control *sc)
 {
 	unsigned long active, inactive;
 	unsigned int ratio;
-	struct book *book;
 
 	/*
 	 * If we don't have swap space, anonymous page deactivation
@@ -1826,18 +1786,17 @@ static int inactive_anon_is_low(struct mem_cgroup_zone *mz,
 		return 0;
 
 	if (global_reclaim(sc))
-		ratio = mz->zone->inactive_ratio;
+		ratio = book_zone(book)->inactive_ratio;
 	else
 		ratio = mem_cgroup_inactive_ratio(sc->target_mem_cgroup);
 
-	book = mem_cgroup_zone_book(mz->zone, mz->mem_cgroup);
 	active = book->pages_count[NR_ACTIVE_ANON];
 	inactive = book->pages_count[NR_INACTIVE_ANON];
 
 	return inactive * ratio < active;
 }
 #else
-static inline int inactive_anon_is_low(struct mem_cgroup_zone *mz,
+static inline int inactive_anon_is_low(struct book *book,
 				       struct scan_control *sc)
 {
 	return 0;
@@ -1858,40 +1817,38 @@ static inline int inactive_anon_is_low(struct mem_cgroup_zone *mz,
  * This uses a different ratio than the anonymous pages, because
  * the page cache uses a use-once replacement algorithm.
  */
-static int inactive_file_is_low(struct mem_cgroup_zone *mz)
+static int inactive_file_is_low(struct book *book)
 {
 	unsigned long active, inactive;
-	struct book *book;
 
-	book = mem_cgroup_zone_book(mz->zone, mz->mem_cgroup);
 	active = book->pages_count[NR_ACTIVE_FILE];
 	inactive = book->pages_count[NR_INACTIVE_FILE];
 
 	return inactive < active;
 }
 
-static int inactive_list_is_low(struct mem_cgroup_zone *mz,
+static int inactive_list_is_low(struct book *book,
 				struct scan_control *sc, int file)
 {
 	if (file)
-		return inactive_file_is_low(mz);
+		return inactive_file_is_low(book);
 	else
-		return inactive_anon_is_low(mz, sc);
+		return inactive_anon_is_low(book, sc);
 }
 
 static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
-				 struct mem_cgroup_zone *mz,
+				 struct book *book,
 				 struct scan_control *sc, int priority)
 {
 	int file = is_file_lru(lru);
 
 	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(mz, sc, file))
-			shrink_active_list(nr_to_scan, mz, sc, priority, file);
+		if (inactive_list_is_low(book, sc, file))
+			shrink_active_list(nr_to_scan, book, sc, priority, file);
 		return 0;
 	}
 
-	return shrink_inactive_list(nr_to_scan, mz, sc, priority, file);
+	return shrink_inactive_list(nr_to_scan, book, sc, priority, file);
 }
 
 static int vmscan_swappiness(struct scan_control *sc)
@@ -1909,17 +1866,18 @@ static int vmscan_swappiness(struct scan_control *sc)
  *
  * nr[0] = anon pages to scan; nr[1] = file pages to scan
  */
-static void get_scan_count(struct mem_cgroup_zone *mz, struct scan_control *sc,
+static void get_scan_count(struct book *book, struct scan_control *sc,
 			   unsigned long *nr, int priority)
 {
 	unsigned long anon, file, free;
 	unsigned long anon_prio, file_prio;
 	unsigned long ap, fp;
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
+	struct zone_reclaim_stat *reclaim_stat = &book->reclaim_stat;
 	u64 fraction[2], denominator;
 	enum lru_list lru;
 	int noswap = 0;
 	bool force_scan = false;
+	struct zone *zone = book_zone(book);
 
 	/*
 	 * If the zone or memcg is small, nr[l] can be 0.  This
@@ -1931,7 +1889,7 @@ static void get_scan_count(struct mem_cgroup_zone *mz, struct scan_control *sc,
 	 * latencies, so it's better to scan a minimum amount there as
 	 * well.
 	 */
-	if (current_is_kswapd() && mz->zone->all_unreclaimable)
+	if (current_is_kswapd() && zone->all_unreclaimable)
 		force_scan = true;
 	if (!global_reclaim(sc))
 		force_scan = true;
@@ -1945,16 +1903,16 @@ static void get_scan_count(struct mem_cgroup_zone *mz, struct scan_control *sc,
 		goto out;
 	}
 
-	anon  = zone_nr_lru_pages(mz, LRU_ACTIVE_ANON) +
-		zone_nr_lru_pages(mz, LRU_INACTIVE_ANON);
-	file  = zone_nr_lru_pages(mz, LRU_ACTIVE_FILE) +
-		zone_nr_lru_pages(mz, LRU_INACTIVE_FILE);
+	anon  = book->pages_count[LRU_ACTIVE_ANON] +
+		book->pages_count[LRU_INACTIVE_ANON];
+	file  = book->pages_count[LRU_ACTIVE_FILE] +
+		book->pages_count[LRU_INACTIVE_FILE];
 
 	if (global_reclaim(sc)) {
-		free  = zone_page_state(mz->zone, NR_FREE_PAGES);
+		free  = zone_page_state(zone, NR_FREE_PAGES);
 		/* If we have very few page cache pages,
 		   force-scan anon pages. */
-		if (unlikely(file + free <= high_wmark_pages(mz->zone))) {
+		if (unlikely(file + free <= high_wmark_pages(zone))) {
 			fraction[0] = 1;
 			fraction[1] = 0;
 			denominator = 1;
@@ -1980,7 +1938,7 @@ static void get_scan_count(struct mem_cgroup_zone *mz, struct scan_control *sc,
 	 *
 	 * anon in [0], file in [1]
 	 */
-	spin_lock_irq(&mz->zone->lru_lock);
+	spin_lock_irq(&zone->lru_lock);
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
 		reclaim_stat->recent_scanned[0] /= 2;
 		reclaim_stat->recent_rotated[0] /= 2;
@@ -2001,7 +1959,7 @@ static void get_scan_count(struct mem_cgroup_zone *mz, struct scan_control *sc,
 
 	fp = (file_prio + 1) * (reclaim_stat->recent_scanned[1] + 1);
 	fp /= reclaim_stat->recent_rotated[1] + 1;
-	spin_unlock_irq(&mz->zone->lru_lock);
+	spin_unlock_irq(&zone->lru_lock);
 
 	fraction[0] = ap;
 	fraction[1] = fp;
@@ -2011,7 +1969,7 @@ out:
 		int file = is_file_lru(lru);
 		unsigned long scan;
 
-		scan = zone_nr_lru_pages(mz, lru);
+		scan = book->pages_count[lru];
 		if (priority || noswap) {
 			scan >>= priority;
 			if (!scan && force_scan)
@@ -2029,7 +1987,7 @@ out:
  * back to the allocator and call try_to_compact_zone(), we ensure that
  * there are enough free pages for it to be likely successful
  */
-static inline bool should_continue_reclaim(struct mem_cgroup_zone *mz,
+static inline bool should_continue_reclaim(struct book *book,
 					unsigned long nr_reclaimed,
 					unsigned long nr_scanned,
 					struct scan_control *sc)
@@ -2069,15 +2027,15 @@ static inline bool should_continue_reclaim(struct mem_cgroup_zone *mz,
 	 * inactive lists are large enough, continue reclaiming
 	 */
 	pages_for_compaction = (2UL << sc->order);
-	inactive_lru_pages = zone_nr_lru_pages(mz, LRU_INACTIVE_FILE);
+	inactive_lru_pages = book->pages_count[LRU_INACTIVE_FILE];
 	if (nr_swap_pages > 0)
-		inactive_lru_pages += zone_nr_lru_pages(mz, LRU_INACTIVE_ANON);
+		inactive_lru_pages += book->pages_count[LRU_INACTIVE_ANON];
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
 		return true;
 
 	/* If compaction would go ahead or the allocation would succeed, stop */
-	switch (compaction_suitable(mz->zone, sc->order)) {
+	switch (compaction_suitable(book_zone(book), sc->order)) {
 	case COMPACT_PARTIAL:
 	case COMPACT_CONTINUE:
 		return false;
@@ -2089,8 +2047,8 @@ static inline bool should_continue_reclaim(struct mem_cgroup_zone *mz,
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
-static void shrink_mem_cgroup_zone(int priority, struct mem_cgroup_zone *mz,
-				   struct scan_control *sc)
+static void shrink_book(int priority, struct book *book,
+			struct scan_control *sc)
 {
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
@@ -2102,7 +2060,7 @@ static void shrink_mem_cgroup_zone(int priority, struct mem_cgroup_zone *mz,
 restart:
 	nr_reclaimed = 0;
 	nr_scanned = sc->nr_scanned;
-	get_scan_count(mz, sc, nr, priority);
+	get_scan_count(book, sc, nr, priority);
 
 	blk_start_plug(&plug);
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
@@ -2114,7 +2072,7 @@ restart:
 				nr[lru] -= nr_to_scan;
 
 				nr_reclaimed += shrink_list(lru, nr_to_scan,
-							    mz, sc, priority);
+							    book, sc, priority);
 			}
 		}
 		/*
@@ -2135,11 +2093,11 @@ restart:
 	 * Even if we did not try to evict anon pages at all, we want to
 	 * rebalance the anon lru active/inactive ratio.
 	 */
-	if (inactive_anon_is_low(mz, sc))
-		shrink_active_list(SWAP_CLUSTER_MAX, mz, sc, priority, 0);
+	if (inactive_anon_is_low(book, sc))
+		shrink_active_list(SWAP_CLUSTER_MAX, book, sc, priority, 0);
 
 	/* reclaim/compaction might need reclaim to continue */
-	if (should_continue_reclaim(mz, nr_reclaimed,
+	if (should_continue_reclaim(book, nr_reclaimed,
 					sc->nr_scanned - nr_scanned, sc))
 		goto restart;
 
@@ -2155,18 +2113,17 @@ static void shrink_zone(int priority, struct zone *zone,
 		.priority = priority,
 	};
 	struct mem_cgroup *memcg;
+	struct book *book;
 
 	memcg = mem_cgroup_iter(root, NULL, &reclaim);
 	do {
-		struct mem_cgroup_zone mz = {
-			.mem_cgroup = memcg,
-			.zone = zone,
-		};
+		book = mem_cgroup_zone_book(zone, memcg);
 
 		if (!global_reclaim(sc))
 			sc->current_mem_cgroup = memcg;
 
-		shrink_mem_cgroup_zone(priority, &mz, sc);
+		shrink_book(priority, book, sc);
+
 		/*
 		 * Limit reclaim has historically picked one memcg and
 		 * scanned it with decreasing priority levels until
@@ -2483,10 +2440,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
 		.target_mem_cgroup = memcg,
 		.current_mem_cgroup = memcg,
 	};
-	struct mem_cgroup_zone mz = {
-		.mem_cgroup = memcg,
-		.zone = zone,
-	};
+	struct book *book = mem_cgroup_zone_book(zone, memcg);
 
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
@@ -2502,7 +2456,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
 	 * will pick up pages from other mem cgroup's as well. We hack
 	 * the priority and make it zero.
 	 */
-	shrink_mem_cgroup_zone(0, &mz, &sc);
+	shrink_book(0, book, &sc);
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
 
@@ -2563,13 +2517,10 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc,
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
-		struct mem_cgroup_zone mz = {
-			.mem_cgroup = memcg,
-			.zone = zone,
-		};
+		struct book *book = mem_cgroup_zone_book(zone, memcg);
 
-		if (inactive_anon_is_low(&mz, sc))
-			shrink_active_list(SWAP_CLUSTER_MAX, &mz,
+		if (inactive_anon_is_low(book, sc))
+			shrink_active_list(SWAP_CLUSTER_MAX, book,
 					   sc, priority, 0);
 
 		memcg = mem_cgroup_iter(NULL, memcg, NULL);


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 07/15] mm: move page-to-book translation upper
  2012-02-15 22:57 [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting Konstantin Khlebnikov
                   ` (5 preceding siblings ...)
  2012-02-15 22:57 ` [PATCH RFC 06/15] mm: kill struct mem_cgroup_zone Konstantin Khlebnikov
@ 2012-02-15 22:57 ` Konstantin Khlebnikov
  2012-02-15 22:57 ` [PATCH RFC 08/15] mm: introduce book locking primitives Konstantin Khlebnikov
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-15 22:57 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

move page_book() out of add_page_to_lru_list() and del_page_from_lru_list()
switch its first argument from zone to book.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
---
 include/linux/mm_inline.h |   10 ++++------
 mm/compaction.c           |    4 +++-
 mm/memcontrol.c           |    7 +++++--
 mm/swap.c                 |   30 ++++++++++++++++++++----------
 mm/vmscan.c               |   16 +++++++++++-----
 5 files changed, 43 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 9c484c0..286da9b 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -58,25 +58,23 @@ static inline int page_is_file_cache(struct page *page)
 }
 
 static inline void
-add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list lru)
+add_page_to_lru_list(struct book *book, struct page *page, enum lru_list lru)
 {
-	struct book *book = page_book(page);
 	int numpages = hpage_nr_pages(page);
 
 	list_add(&page->lru, &book->pages_lru[lru]);
 	book->pages_count[lru] += numpages;
-	__mod_zone_page_state(zone, NR_LRU_BASE + lru, numpages);
+	__mod_zone_page_state(book_zone(book), NR_LRU_BASE + lru, numpages);
 }
 
 static inline void
-del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list lru)
+del_page_from_lru_list(struct book *book, struct page *page, enum lru_list lru)
 {
-	struct book *book = page_book(page);
 	int numpages = hpage_nr_pages(page);
 
 	list_del(&page->lru);
 	book->pages_count[lru] -= numpages;
-	__mod_zone_page_state(zone, NR_LRU_BASE + lru, -numpages);
+	__mod_zone_page_state(book_zone(book), NR_LRU_BASE + lru, -numpages);
 }
 
 /**
diff --git a/mm/compaction.c b/mm/compaction.c
index 177f0d3..680a725 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -269,6 +269,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct list_head *migratelist = &cc->migratepages;
 	isolate_mode_t mode = ISOLATE_ACTIVE|ISOLATE_INACTIVE;
+	struct book *book;
 
 	/* Do not scan outside zone boundaries */
 	low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
@@ -388,7 +389,8 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 		VM_BUG_ON(PageTransCompound(page));
 
 		/* Successfully isolated */
-		del_page_from_lru_list(zone, page, page_lru(page));
+		book = page_book(page);
+		del_page_from_lru_list(book, page, page_lru(page));
 		list_add(&page->lru, migratelist);
 		cc->nr_migratepages++;
 		nr_isolated++;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ff82d6e..5136017 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2536,6 +2536,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg,
 {
 	struct page_cgroup *pc = lookup_page_cgroup(page);
 	struct zone *zone = page_zone(page);
+	struct book *book;
 	unsigned long flags;
 	bool removed = false;
 
@@ -2546,13 +2547,15 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg,
 	 */
 	spin_lock_irqsave(&zone->lru_lock, flags);
 	if (PageLRU(page)) {
-		del_page_from_lru_list(zone, page, page_lru(page));
+		book = page_book(page);
+		del_page_from_lru_list(book, page, page_lru(page));
 		ClearPageLRU(page);
 		removed = true;
 	}
 	__mem_cgroup_commit_charge(memcg, page, 1, pc, ctype);
 	if (removed) {
-		add_page_to_lru_list(zone, page, page_lru(page));
+		book = page_book(page);
+		add_page_to_lru_list(book, page, page_lru(page));
 		SetPageLRU(page);
 	}
 	spin_unlock_irqrestore(&zone->lru_lock, flags);
diff --git a/mm/swap.c b/mm/swap.c
index 2268ee7..4e00463 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -49,11 +49,13 @@ static void __page_cache_release(struct page *page)
 	if (PageLRU(page)) {
 		unsigned long flags;
 		struct zone *zone = page_zone(page);
+		struct book *book;
 
 		spin_lock_irqsave(&zone->lru_lock, flags);
 		VM_BUG_ON(!PageLRU(page));
 		__ClearPageLRU(page);
-		del_page_from_lru_list(zone, page, page_off_lru(page));
+		book = page_book(page);
+		del_page_from_lru_list(book, page, page_off_lru(page));
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
 }
@@ -293,11 +295,13 @@ static void __activate_page(struct page *page, void *arg)
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		int file = page_is_file_cache(page);
 		int lru = page_lru_base_type(page);
-		del_page_from_lru_list(zone, page, lru);
+		struct book *book = page_book(page);
+
+		del_page_from_lru_list(book, page, lru);
 
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
-		add_page_to_lru_list(zone, page, lru);
+		add_page_to_lru_list(book, page, lru);
 		__count_vm_event(PGACTIVATE);
 
 		update_page_reclaim_stat(zone, page, file, 1);
@@ -404,11 +408,13 @@ void lru_cache_add_lru(struct page *page, enum lru_list lru)
 void add_page_to_unevictable_list(struct page *page)
 {
 	struct zone *zone = page_zone(page);
+	struct book *book;
 
 	spin_lock_irq(&zone->lru_lock);
 	SetPageUnevictable(page);
 	SetPageLRU(page);
-	add_page_to_lru_list(zone, page, LRU_UNEVICTABLE);
+	book = page_book(page);
+	add_page_to_lru_list(book, page, LRU_UNEVICTABLE);
 	spin_unlock_irq(&zone->lru_lock);
 }
 
@@ -438,6 +444,7 @@ static void lru_deactivate_fn(struct page *page, void *arg)
 	int lru, file;
 	bool active;
 	struct zone *zone = page_zone(page);
+	struct book *book;
 
 	if (!PageLRU(page))
 		return;
@@ -453,10 +460,11 @@ static void lru_deactivate_fn(struct page *page, void *arg)
 
 	file = page_is_file_cache(page);
 	lru = page_lru_base_type(page);
-	del_page_from_lru_list(zone, page, lru + active);
+	book = page_book(page);
+	del_page_from_lru_list(book, page, lru + active);
 	ClearPageActive(page);
 	ClearPageReferenced(page);
-	add_page_to_lru_list(zone, page, lru);
+	add_page_to_lru_list(book, page, lru);
 
 	if (PageWriteback(page) || PageDirty(page)) {
 		/*
@@ -466,7 +474,6 @@ static void lru_deactivate_fn(struct page *page, void *arg)
 		 */
 		SetPageReclaim(page);
 	} else {
-		struct book *book = page_book(page);
 		/*
 		 * The page's writeback ends up during pagevec
 		 * We moves tha page into tail of inactive.
@@ -596,6 +603,7 @@ void release_pages(struct page **pages, int nr, int cold)
 
 		if (PageLRU(page)) {
 			struct zone *pagezone = page_zone(page);
+			struct book *book = page_book(page);
 
 			if (pagezone != zone) {
 				if (zone)
@@ -606,7 +614,7 @@ void release_pages(struct page **pages, int nr, int cold)
 			}
 			VM_BUG_ON(!PageLRU(page));
 			__ClearPageLRU(page);
-			del_page_from_lru_list(zone, page, page_off_lru(page));
+			del_page_from_lru_list(book, page, page_off_lru(page));
 		}
 
 		list_add(&page->lru, &pages_to_free);
@@ -644,6 +652,7 @@ void lru_add_page_tail(struct zone* zone,
 	int active;
 	enum lru_list lru;
 	const int file = 0;
+	struct book *book = page_book(page);
 
 	VM_BUG_ON(!PageHead(page));
 	VM_BUG_ON(PageCompound(page_tail));
@@ -678,7 +687,7 @@ void lru_add_page_tail(struct zone* zone,
 		 * Use the standard add function to put page_tail on the list,
 		 * but then correct its position so they all end up in order.
 		 */
-		add_page_to_lru_list(zone, page_tail, lru);
+		add_page_to_lru_list(book, page_tail, lru);
 		list_head = page_tail->lru.prev;
 		list_move_tail(&page_tail->lru, list_head);
 	}
@@ -689,6 +698,7 @@ static void __pagevec_lru_add_fn(struct page *page, void *arg)
 {
 	enum lru_list lru = (enum lru_list)arg;
 	struct zone *zone = page_zone(page);
+	struct book *book = page_book(page);
 	int file = is_file_lru(lru);
 	int active = is_active_lru(lru);
 
@@ -700,7 +710,7 @@ static void __pagevec_lru_add_fn(struct page *page, void *arg)
 	if (active)
 		SetPageActive(page);
 	update_page_reclaim_stat(zone, page, file, active);
-	add_page_to_lru_list(zone, page, lru);
+	add_page_to_lru_list(book, page, lru);
 }
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c59d4f7..0e0bbe2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1291,6 +1291,7 @@ int isolate_lru_page(struct page *page)
 
 	if (PageLRU(page)) {
 		struct zone *zone = page_zone(page);
+		struct book *book;
 
 		spin_lock_irq(&zone->lru_lock);
 		if (PageLRU(page)) {
@@ -1298,8 +1299,8 @@ int isolate_lru_page(struct page *page)
 			ret = 0;
 			get_page(page);
 			ClearPageLRU(page);
-
-			del_page_from_lru_list(zone, page, lru);
+			book = page_book(page);
+			del_page_from_lru_list(book, page, lru);
 		}
 		spin_unlock_irq(&zone->lru_lock);
 	}
@@ -1349,6 +1350,7 @@ putback_inactive_pages(struct book *book,
 	 */
 	while (!list_empty(page_list)) {
 		struct page *page = lru_to_page(page_list);
+		struct book *book;
 		int lru;
 
 		VM_BUG_ON(PageLRU(page));
@@ -1361,7 +1363,10 @@ putback_inactive_pages(struct book *book,
 		}
 		SetPageLRU(page);
 		lru = page_lru(page);
-		add_page_to_lru_list(zone, page, lru);
+
+		/* can differ only on lumpy reclaim */
+		book = page_book(page);
+		add_page_to_lru_list(book, page, lru);
 		if (is_active_lru(lru)) {
 			int file = is_file_lru(lru);
 			int numpages = hpage_nr_pages(page);
@@ -1370,7 +1375,7 @@ putback_inactive_pages(struct book *book,
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
-			del_page_from_lru_list(zone, page, lru);
+			del_page_from_lru_list(book, page, lru);
 
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&zone->lru_lock);
@@ -1644,6 +1649,7 @@ static void move_active_pages_to_lru(struct zone *zone,
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 
+		/* can differ only on lumpy reclaim */
 		book = page_book(page);
 		list_move(&page->lru, &book->pages_lru[lru]);
 		numpages = hpage_nr_pages(page);
@@ -1653,7 +1659,7 @@ static void move_active_pages_to_lru(struct zone *zone,
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
-			del_page_from_lru_list(zone, page, lru);
+			del_page_from_lru_list(book, page, lru);
 
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&zone->lru_lock);


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 08/15] mm: introduce book locking primitives
  2012-02-15 22:57 [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting Konstantin Khlebnikov
                   ` (6 preceding siblings ...)
  2012-02-15 22:57 ` [PATCH RFC 07/15] mm: move page-to-book translation upper Konstantin Khlebnikov
@ 2012-02-15 22:57 ` Konstantin Khlebnikov
  2012-02-15 22:57 ` [PATCH RFC 09/15] mm: handle book relocks on lumpy reclaim Konstantin Khlebnikov
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-15 22:57 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

This is initial preparation for lru_lock splitting.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
---
 include/linux/mm_inline.h |  135 +++++++++++++++++++++++++++++++++++++++++++++
 mm/huge_memory.c          |   10 ++-
 mm/memcontrol.c           |    8 +--
 mm/swap.c                 |   56 ++++++-------------
 mm/vmscan.c               |   81 +++++++++++----------------
 5 files changed, 196 insertions(+), 94 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 286da9b..9cb3a7e 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -36,6 +36,141 @@ static inline struct pglist_data *book_node(struct book *book)
 
 #endif /* CONFIG_MEMORY_BOOKKEEPING */
 
+static inline void lock_book(struct book *book, unsigned long *flags)
+{
+	spin_lock_irqsave(&book_zone(book)->lru_lock, *flags);
+}
+
+static inline void lock_book_irq(struct book *book)
+{
+	spin_lock_irq(&book_zone(book)->lru_lock);
+}
+
+static inline void unlock_book(struct book *book, unsigned long *flags)
+{
+	spin_unlock_irqrestore(&book_zone(book)->lru_lock, *flags);
+}
+
+static inline void unlock_book_irq(struct book *book)
+{
+	spin_unlock_irq(&book_zone(book)->lru_lock);
+}
+
+#ifdef CONFIG_MEMORY_BOOKKEEPING
+
+static inline struct book *lock_page_book(struct page *page,
+					  unsigned long *flags)
+{
+	struct zone *zone = page_zone(page);
+
+	spin_lock_irqsave(&zone->lru_lock, *flags);
+	return page_book(page);
+}
+
+static inline struct book *lock_page_book_irq(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+
+	spin_lock_irq(&zone->lru_lock);
+	return page_book(page);
+}
+
+static inline struct book *relock_page_book(struct book *locked_book,
+					    struct page *page,
+					    unsigned long *flags)
+{
+	struct zone *zone = page_zone(page);
+
+	if (!locked_book || zone != book_zone(locked_book)) {
+		if (locked_book)
+			unlock_book(locked_book, flags);
+		locked_book = lock_page_book(page, flags);
+	}
+
+	return locked_book;
+}
+
+static inline struct book *relock_page_book_irq(struct book *locked_book,
+						struct page *page)
+{
+	struct zone *zone = page_zone(page);
+
+	if (!locked_book || zone != book_zone(locked_book)) {
+		if (locked_book)
+			unlock_book_irq(locked_book);
+		locked_book = lock_page_book_irq(page);
+	}
+
+	return locked_book;
+}
+
+/*
+ * like relock_page_book_irq, but book always locked, interrupts disabled and
+ * page always in the same zone
+ */
+static inline struct book *__relock_page_book(struct book *locked_book,
+					      struct page *page)
+{
+	return page_book(page);
+}
+
+#else /* CONFIG_MEMORY_BOOKKEEPING */
+
+static inline struct book *lock_page_book(struct page *page,
+					  unsigned long *flags)
+{
+	struct book *book = page_book(page);
+
+	lock_book(book, flags);
+	return book;
+}
+
+static inline struct book *lock_page_book_irq(struct page *page)
+{
+	struct book *book = page_book(page);
+
+	lock_book_irq(book);
+	return book;
+}
+
+static inline struct book *relock_page_book(struct book *locked_book,
+					    struct page *page,
+					    unsigned long *flags)
+{
+	struct book *book = page_book(page);
+
+	if (unlikely(locked_book != book)) {
+		if (locked_book)
+			unlock_book(locked_book, flags);
+		lock_book(book, flags);
+		locked_book = book;
+	}
+	return locked_book;
+}
+
+static inline struct book *relock_page_book_irq(struct book *locked_book,
+						struct page *page)
+{
+	struct book *book = page_book(page);
+
+	if (unlikely(locked_book != book)) {
+		if (locked_book)
+			unlock_book_irq(locked_book);
+		lock_book_irq(book);
+		locked_book = book;
+	}
+	return locked_book;
+}
+
+static inline struct book *__relock_page_book(struct book *locked_book,
+					      struct page *page)
+{
+	/* one book per-zone, there nothing to do */
+	return locked_book;
+}
+
+#endif /* CONFIG_MEMORY_BOOKKEEPING */
+
 #define for_each_book(book, zone) \
 	list_for_each_entry_rcu(book, &zone->book_list, list)
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 91d3efb..8e7e289 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1228,11 +1228,11 @@ static int __split_huge_page_splitting(struct page *page,
 static void __split_huge_page_refcount(struct page *page)
 {
 	int i;
-	struct zone *zone = page_zone(page);
+	struct book *book;
 	int tail_count = 0;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irq(&zone->lru_lock);
+	book = lock_page_book_irq(page);
 	compound_lock(page);
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(page);
@@ -1308,17 +1308,17 @@ static void __split_huge_page_refcount(struct page *page)
 		BUG_ON(!PageSwapBacked(page_tail));
 
 
-		lru_add_page_tail(zone, page, page_tail);
+		lru_add_page_tail(book_zone(book), page, page_tail);
 	}
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(atomic_read(&page->_count) <= 0);
 
 	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
-	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
+	__mod_zone_page_state(book_zone(book), NR_ANON_PAGES, HPAGE_PMD_NR);
 
 	ClearPageCompound(page);
 	compound_unlock(page);
-	spin_unlock_irq(&zone->lru_lock);
+	unlock_book_irq(book);
 
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		struct page *page_tail = page + i;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5136017..84e04ae 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3523,19 +3523,19 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
 		struct page *page;
 
 		ret = 0;
-		spin_lock_irqsave(&zone->lru_lock, flags);
+		lock_book(&mz->book, &flags);
 		if (list_empty(list)) {
-			spin_unlock_irqrestore(&zone->lru_lock, flags);
+			unlock_book(&mz->book, &flags);
 			break;
 		}
 		page = list_entry(list->prev, struct page, lru);
 		if (busy == page) {
 			list_move(&page->lru, list);
 			busy = NULL;
-			spin_unlock_irqrestore(&zone->lru_lock, flags);
+			unlock_book(&mz->book, &flags);
 			continue;
 		}
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+		unlock_book(&mz->book, &flags);
 
 		pc = lookup_page_cgroup(page);
 
diff --git a/mm/swap.c b/mm/swap.c
index 4e00463..677b529 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -48,15 +48,13 @@ static void __page_cache_release(struct page *page)
 {
 	if (PageLRU(page)) {
 		unsigned long flags;
-		struct zone *zone = page_zone(page);
 		struct book *book;
 
-		spin_lock_irqsave(&zone->lru_lock, flags);
+		book = lock_page_book(page, &flags);
 		VM_BUG_ON(!PageLRU(page));
 		__ClearPageLRU(page);
-		book = page_book(page);
 		del_page_from_lru_list(book, page, page_off_lru(page));
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+		unlock_book(book, &flags);
 	}
 }
 
@@ -208,24 +206,17 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 				void *arg)
 {
 	int i;
-	struct zone *zone = NULL;
+	struct book *book = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct zone *pagezone = page_zone(page);
-
-		if (pagezone != zone) {
-			if (zone)
-				spin_unlock_irqrestore(&zone->lru_lock, flags);
-			zone = pagezone;
-			spin_lock_irqsave(&zone->lru_lock, flags);
-		}
 
+		book = relock_page_book(book, page, &flags);
 		(*move_fn)(page, arg);
 	}
-	if (zone)
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+	if (book)
+		unlock_book(book, &flags);
 	release_pages(pvec->pages, pvec->nr, pvec->cold);
 	pagevec_reinit(pvec);
 }
@@ -338,11 +329,11 @@ static inline void activate_page_drain(int cpu)
 
 void activate_page(struct page *page)
 {
-	struct zone *zone = page_zone(page);
+	struct book *book;
 
-	spin_lock_irq(&zone->lru_lock);
+	book = lock_page_book_irq(page);
 	__activate_page(page, NULL);
-	spin_unlock_irq(&zone->lru_lock);
+	unlock_book_irq(book);
 }
 #endif
 
@@ -407,15 +398,13 @@ void lru_cache_add_lru(struct page *page, enum lru_list lru)
  */
 void add_page_to_unevictable_list(struct page *page)
 {
-	struct zone *zone = page_zone(page);
 	struct book *book;
 
-	spin_lock_irq(&zone->lru_lock);
+	book = lock_page_book_irq(page);
 	SetPageUnevictable(page);
 	SetPageLRU(page);
-	book = page_book(page);
 	add_page_to_lru_list(book, page, LRU_UNEVICTABLE);
-	spin_unlock_irq(&zone->lru_lock);
+	unlock_book_irq(book);
 }
 
 /*
@@ -583,16 +572,16 @@ void release_pages(struct page **pages, int nr, int cold)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
-	struct zone *zone = NULL;
+	struct book *book = NULL;
 	unsigned long uninitialized_var(flags);
 
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 
 		if (unlikely(PageCompound(page))) {
-			if (zone) {
-				spin_unlock_irqrestore(&zone->lru_lock, flags);
-				zone = NULL;
+			if (book) {
+				unlock_book(book, &flags);
+				book = NULL;
 			}
 			put_compound_page(page);
 			continue;
@@ -602,16 +591,7 @@ void release_pages(struct page **pages, int nr, int cold)
 			continue;
 
 		if (PageLRU(page)) {
-			struct zone *pagezone = page_zone(page);
-			struct book *book = page_book(page);
-
-			if (pagezone != zone) {
-				if (zone)
-					spin_unlock_irqrestore(&zone->lru_lock,
-									flags);
-				zone = pagezone;
-				spin_lock_irqsave(&zone->lru_lock, flags);
-			}
+			book = relock_page_book(book, page, &flags);
 			VM_BUG_ON(!PageLRU(page));
 			__ClearPageLRU(page);
 			del_page_from_lru_list(book, page, page_off_lru(page));
@@ -619,8 +599,8 @@ void release_pages(struct page **pages, int nr, int cold)
 
 		list_add(&page->lru, &pages_to_free);
 	}
-	if (zone)
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+	if (book)
+		unlock_book(book, &flags);
 
 	free_hot_cold_page_list(&pages_to_free, cold);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0e0bbe2..0b973ff 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1290,19 +1290,17 @@ int isolate_lru_page(struct page *page)
 	VM_BUG_ON(!page_count(page));
 
 	if (PageLRU(page)) {
-		struct zone *zone = page_zone(page);
 		struct book *book;
 
-		spin_lock_irq(&zone->lru_lock);
+		book = lock_page_book_irq(page);
 		if (PageLRU(page)) {
 			int lru = page_lru(page);
 			ret = 0;
 			get_page(page);
 			ClearPageLRU(page);
-			book = page_book(page);
 			del_page_from_lru_list(book, page, lru);
 		}
-		spin_unlock_irq(&zone->lru_lock);
+		unlock_book_irq(book);
 	}
 	return ret;
 }
@@ -1342,7 +1340,6 @@ putback_inactive_pages(struct book *book,
 		       struct list_head *page_list)
 {
 	struct zone_reclaim_stat *reclaim_stat = &book->reclaim_stat;
-	struct zone *zone = book_zone(book);
 	LIST_HEAD(pages_to_free);
 
 	/*
@@ -1350,22 +1347,21 @@ putback_inactive_pages(struct book *book,
 	 */
 	while (!list_empty(page_list)) {
 		struct page *page = lru_to_page(page_list);
-		struct book *book;
 		int lru;
 
 		VM_BUG_ON(PageLRU(page));
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page, NULL))) {
-			spin_unlock_irq(&zone->lru_lock);
+			unlock_book_irq(book);
 			putback_lru_page(page);
-			spin_lock_irq(&zone->lru_lock);
+			lock_book_irq(book);
 			continue;
 		}
 		SetPageLRU(page);
 		lru = page_lru(page);
 
 		/* can differ only on lumpy reclaim */
-		book = page_book(page);
+		book = __relock_page_book(book, page);
 		add_page_to_lru_list(book, page, lru);
 		if (is_active_lru(lru)) {
 			int file = is_file_lru(lru);
@@ -1378,9 +1374,9 @@ putback_inactive_pages(struct book *book,
 			del_page_from_lru_list(book, page, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&zone->lru_lock);
+				unlock_book_irq(book);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(&zone->lru_lock);
+				lock_book_irq(book);
 			} else
 				list_add(&page->lru, &pages_to_free);
 		}
@@ -1516,7 +1512,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct book *book,
 	if (!sc->may_writepage)
 		reclaim_mode |= ISOLATE_CLEAN;
 
-	spin_lock_irq(&zone->lru_lock);
+	lock_book_irq(book);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, book, &page_list,
 				     &nr_scanned, sc->order,
@@ -1532,7 +1528,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct book *book,
 	}
 
 	if (nr_taken == 0) {
-		spin_unlock_irq(&zone->lru_lock);
+		unlock_book_irq(book);
 		return 0;
 	}
 
@@ -1541,7 +1537,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct book *book,
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
 	__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
 
-	spin_unlock_irq(&zone->lru_lock);
+	unlock_book_irq(book);
 
 	nr_reclaimed = shrink_page_list(&page_list, book, sc, priority,
 						&nr_dirty, &nr_writeback);
@@ -1553,7 +1549,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct book *book,
 					priority, &nr_dirty, &nr_writeback);
 	}
 
-	spin_lock_irq(&zone->lru_lock);
+	lock_book_irq(book);
 
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
@@ -1564,7 +1560,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct book *book,
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
 	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
 
-	spin_unlock_irq(&zone->lru_lock);
+	unlock_book_irq(book);
 
 	free_hot_cold_page_list(&page_list, 1);
 
@@ -1620,7 +1616,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct book *book,
  * But we had to alter page->flags anyway.
  */
 
-static void move_active_pages_to_lru(struct zone *zone,
+static void move_active_pages_to_lru(struct book *book,
 				     struct list_head *list,
 				     struct list_head *pages_to_free,
 				     enum lru_list lru)
@@ -1629,7 +1625,7 @@ static void move_active_pages_to_lru(struct zone *zone,
 	struct page *page;
 
 	if (buffer_heads_over_limit) {
-		spin_unlock_irq(&zone->lru_lock);
+		unlock_book_irq(book);
 		list_for_each_entry(page, list, lru) {
 			if (page_has_private(page) && trylock_page(page)) {
 				if (page_has_private(page))
@@ -1637,11 +1633,10 @@ static void move_active_pages_to_lru(struct zone *zone,
 				unlock_page(page);
 			}
 		}
-		spin_lock_irq(&zone->lru_lock);
+		lock_book_irq(book);
 	}
 
 	while (!list_empty(list)) {
-		struct book *book;
 		int numpages;
 
 		page = lru_to_page(list);
@@ -1650,7 +1645,7 @@ static void move_active_pages_to_lru(struct zone *zone,
 		SetPageLRU(page);
 
 		/* can differ only on lumpy reclaim */
-		book = page_book(page);
+		book = __relock_page_book(book, page);
 		list_move(&page->lru, &book->pages_lru[lru]);
 		numpages = hpage_nr_pages(page);
 		book->pages_count[lru] += numpages;
@@ -1662,14 +1657,14 @@ static void move_active_pages_to_lru(struct zone *zone,
 			del_page_from_lru_list(book, page, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&zone->lru_lock);
+				unlock_book_irq(book);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(&zone->lru_lock);
+				lock_book_irq(book);
 			} else
 				list_add(&page->lru, pages_to_free);
 		}
 	}
-	__mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
+	__mod_zone_page_state(book_zone(book), NR_LRU_BASE + lru, pgmoved);
 	if (!is_active_lru(lru))
 		__count_vm_events(PGDEACTIVATE, pgmoved);
 }
@@ -1698,7 +1693,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	if (!sc->may_writepage)
 		reclaim_mode |= ISOLATE_CLEAN;
 
-	spin_lock_irq(&zone->lru_lock);
+	lock_book_irq(book);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, book, &l_hold,
 				     &nr_scanned, sc->order,
@@ -1714,7 +1709,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	else
 		__mod_zone_page_state(zone, NR_ACTIVE_ANON, -nr_taken);
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
-	spin_unlock_irq(&zone->lru_lock);
+
+	unlock_book_irq(book);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -1750,7 +1746,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&zone->lru_lock);
+	lock_book_irq(book);
 	/*
 	 * Count referenced pages from currently used mappings as rotated,
 	 * even though only some of them are actually re-activated.  This
@@ -1759,12 +1755,12 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	 */
 	reclaim_stat->recent_rotated[file] += nr_rotated;
 
-	move_active_pages_to_lru(zone, &l_active, &l_hold,
+	move_active_pages_to_lru(book, &l_active, &l_hold,
 						LRU_ACTIVE + file * LRU_FILE);
-	move_active_pages_to_lru(zone, &l_inactive, &l_hold,
+	move_active_pages_to_lru(book, &l_inactive, &l_hold,
 						LRU_BASE   + file * LRU_FILE);
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&zone->lru_lock);
+	unlock_book_irq(book);
 
 	free_hot_cold_page_list(&l_hold, 1);
 }
@@ -1944,7 +1940,7 @@ static void get_scan_count(struct book *book, struct scan_control *sc,
 	 *
 	 * anon in [0], file in [1]
 	 */
-	spin_lock_irq(&zone->lru_lock);
+	lock_book_irq(book);
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
 		reclaim_stat->recent_scanned[0] /= 2;
 		reclaim_stat->recent_rotated[0] /= 2;
@@ -1965,7 +1961,7 @@ static void get_scan_count(struct book *book, struct scan_control *sc,
 
 	fp = (file_prio + 1) * (reclaim_stat->recent_scanned[1] + 1);
 	fp /= reclaim_stat->recent_rotated[1] + 1;
-	spin_unlock_irq(&zone->lru_lock);
+	unlock_book_irq(book);
 
 	fraction[0] = ap;
 	fraction[1] = fp;
@@ -3466,24 +3462,16 @@ int page_evictable(struct page *page, struct vm_area_struct *vma)
  */
 void check_move_unevictable_pages(struct page **pages, int nr_pages)
 {
-	struct book *book;
-	struct zone *zone = NULL;
+	struct book *book = NULL;
 	int pgscanned = 0;
 	int pgrescued = 0;
 	int i;
 
 	for (i = 0; i < nr_pages; i++) {
 		struct page *page = pages[i];
-		struct zone *pagezone;
 
 		pgscanned++;
-		pagezone = page_zone(page);
-		if (pagezone != zone) {
-			if (zone)
-				spin_unlock_irq(&zone->lru_lock);
-			zone = pagezone;
-			spin_lock_irq(&zone->lru_lock);
-		}
+		book = relock_page_book_irq(book, page);
 
 		if (!PageLRU(page) || !PageUnevictable(page))
 			continue;
@@ -3493,20 +3481,19 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
 
 			VM_BUG_ON(PageActive(page));
 			ClearPageUnevictable(page);
-			__dec_zone_state(zone, NR_UNEVICTABLE);
-			book = page_book(page);
+			__dec_zone_state(book_zone(book), NR_UNEVICTABLE);
 			book->pages_count[LRU_UNEVICTABLE]--;
 			book->pages_count[lru]++;
 			list_move(&page->lru, &book->pages_lru[lru]);
-			__inc_zone_state(zone, NR_INACTIVE_ANON + lru);
+			__inc_zone_state(book_zone(book), NR_INACTIVE_ANON + lru);
 			pgrescued++;
 		}
 	}
 
-	if (zone) {
+	if (book) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&zone->lru_lock);
+		unlock_book_irq(book);
 	}
 }
 #endif /* CONFIG_SHMEM */


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 09/15] mm: handle book relocks on lumpy reclaim
  2012-02-15 22:57 [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting Konstantin Khlebnikov
                   ` (7 preceding siblings ...)
  2012-02-15 22:57 ` [PATCH RFC 08/15] mm: introduce book locking primitives Konstantin Khlebnikov
@ 2012-02-15 22:57 ` Konstantin Khlebnikov
  2012-02-15 22:57 ` [PATCH RFC 10/15] mm: handle book relocks in compaction Konstantin Khlebnikov
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-15 22:57 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Prepare for lock splitting in move_active_pages_to_lru() and putback_inactive_pages()
on lumpy reclaim they can put pages into different books.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
---
 mm/vmscan.c |   26 ++++++++++++++++++--------
 1 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0b973ff..9a3fb72 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1335,7 +1335,10 @@ static int too_many_isolated(struct zone *zone, int file,
 	return isolated > inactive;
 }
 
-static noinline_for_stack void
+/*
+ * Returns currently locked book
+ */
+static noinline_for_stack struct book *
 putback_inactive_pages(struct book *book,
 		       struct list_head *page_list)
 {
@@ -1386,6 +1389,8 @@ putback_inactive_pages(struct book *book,
 	 * To save our caller's stack, now use input list for pages to free.
 	 */
 	list_splice(&pages_to_free, page_list);
+
+	return book;
 }
 
 static noinline_for_stack void
@@ -1555,7 +1560,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct book *book,
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-	putback_inactive_pages(book, &page_list);
+	book = putback_inactive_pages(book, &page_list);
 
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
 	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
@@ -1614,12 +1619,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct book *book,
  *
  * The downside is that we have to touch page->_count against each page.
  * But we had to alter page->flags anyway.
+ *
+ * Returns currently locked book
  */
 
-static void move_active_pages_to_lru(struct book *book,
-				     struct list_head *list,
-				     struct list_head *pages_to_free,
-				     enum lru_list lru)
+static struct book *
+move_active_pages_to_lru(struct book *book,
+			 struct list_head *list,
+			 struct list_head *pages_to_free,
+			 enum lru_list lru)
 {
 	unsigned long pgmoved = 0;
 	struct page *page;
@@ -1667,6 +1675,8 @@ static void move_active_pages_to_lru(struct book *book,
 	__mod_zone_page_state(book_zone(book), NR_LRU_BASE + lru, pgmoved);
 	if (!is_active_lru(lru))
 		__count_vm_events(PGDEACTIVATE, pgmoved);
+
+	return book;
 }
 
 static void shrink_active_list(unsigned long nr_to_scan,
@@ -1755,9 +1765,9 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	 */
 	reclaim_stat->recent_rotated[file] += nr_rotated;
 
-	move_active_pages_to_lru(book, &l_active, &l_hold,
+	book = move_active_pages_to_lru(book, &l_active, &l_hold,
 						LRU_ACTIVE + file * LRU_FILE);
-	move_active_pages_to_lru(book, &l_inactive, &l_hold,
+	book = move_active_pages_to_lru(book, &l_inactive, &l_hold,
 						LRU_BASE   + file * LRU_FILE);
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
 	unlock_book_irq(book);


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 10/15] mm: handle book relocks in compaction
  2012-02-15 22:57 [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting Konstantin Khlebnikov
                   ` (8 preceding siblings ...)
  2012-02-15 22:57 ` [PATCH RFC 09/15] mm: handle book relocks on lumpy reclaim Konstantin Khlebnikov
@ 2012-02-15 22:57 ` Konstantin Khlebnikov
  2012-02-15 22:57 ` [PATCH RFC 11/15] mm: handle book relock in memory controller Konstantin Khlebnikov
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-15 22:57 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Prepare for lru_lock splitting in memory compaction code.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
---
 mm/compaction.c |   32 ++++++++++++++++++++------------
 1 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 680a725..f521edf 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -269,7 +269,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct list_head *migratelist = &cc->migratepages;
 	isolate_mode_t mode = ISOLATE_ACTIVE|ISOLATE_INACTIVE;
-	struct book *book;
+	struct book *book = NULL;
 
 	/* Do not scan outside zone boundaries */
 	low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
@@ -301,25 +301,23 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 
 	/* Time to isolate some pages for migration */
 	cond_resched();
-	spin_lock_irq(&zone->lru_lock);
 	for (; low_pfn < end_pfn; low_pfn++) {
 		struct page *page;
-		bool locked = true;
 
 		/* give a chance to irqs before checking need_resched() */
 		if (!((low_pfn+1) % SWAP_CLUSTER_MAX)) {
-			spin_unlock_irq(&zone->lru_lock);
-			locked = false;
+			if (book)
+				unlock_book_irq(book);
+			book = NULL;
 		}
 		if (need_resched() || spin_is_contended(&zone->lru_lock)) {
-			if (locked)
-				spin_unlock_irq(&zone->lru_lock);
+			if (book)
+				unlock_book_irq(book);
+			book = NULL;
 			cond_resched();
-			spin_lock_irq(&zone->lru_lock);
 			if (fatal_signal_pending(current))
 				break;
-		} else if (!locked)
-			spin_lock_irq(&zone->lru_lock);
+		}
 
 		/*
 		 * migrate_pfn does not necessarily start aligned to a
@@ -345,6 +343,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 		 * as memory compaction should not move pages between nodes.
 		 */
 		page = pfn_to_page(low_pfn);
+
 		if (page_zone(page) != zone)
 			continue;
 
@@ -369,6 +368,14 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 		if (!PageLRU(page))
 			continue;
 
+		if (book)
+			book = __relock_page_book(book, page);
+		else
+			book = lock_page_book_irq(page);
+
+		if (!PageLRU(page))
+			continue;
+
 		/*
 		 * PageLRU is set, and lru_lock excludes isolation,
 		 * splitting and collapsing (collapsing has already
@@ -389,7 +396,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 		VM_BUG_ON(PageTransCompound(page));
 
 		/* Successfully isolated */
-		book = page_book(page);
 		del_page_from_lru_list(book, page, page_lru(page));
 		list_add(&page->lru, migratelist);
 		cc->nr_migratepages++;
@@ -402,9 +408,11 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 		}
 	}
 
+	if (book)
+		unlock_book_irq(book);
+
 	acct_isolated(zone, cc);
 
-	spin_unlock_irq(&zone->lru_lock);
 	cc->migrate_pfn = low_pfn;
 
 	trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 11/15] mm: handle book relock in memory controller
  2012-02-15 22:57 [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting Konstantin Khlebnikov
                   ` (9 preceding siblings ...)
  2012-02-15 22:57 ` [PATCH RFC 10/15] mm: handle book relocks in compaction Konstantin Khlebnikov
@ 2012-02-15 22:57 ` Konstantin Khlebnikov
  2012-02-15 22:57 ` [PATCH RFC 12/15] mm: optimize books in update_page_reclaim_stat() Konstantin Khlebnikov
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-15 22:57 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Carefully relock book lru lock at page memory cgroup change.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
---
 mm/memcontrol.c |    8 +++-----
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 84e04ae..90e21d2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2535,7 +2535,6 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg,
 					enum charge_type ctype)
 {
 	struct page_cgroup *pc = lookup_page_cgroup(page);
-	struct zone *zone = page_zone(page);
 	struct book *book;
 	unsigned long flags;
 	bool removed = false;
@@ -2545,20 +2544,19 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg,
 	 * is already on LRU. It means the page may on some other page_cgroup's
 	 * LRU. Take care of it.
 	 */
-	spin_lock_irqsave(&zone->lru_lock, flags);
+	book = lock_page_book(page, &flags);
 	if (PageLRU(page)) {
-		book = page_book(page);
 		del_page_from_lru_list(book, page, page_lru(page));
 		ClearPageLRU(page);
 		removed = true;
 	}
 	__mem_cgroup_commit_charge(memcg, page, 1, pc, ctype);
 	if (removed) {
-		book = page_book(page);
+		book = __relock_page_book(book, page);
 		add_page_to_lru_list(book, page, page_lru(page));
 		SetPageLRU(page);
 	}
-	spin_unlock_irqrestore(&zone->lru_lock, flags);
+	unlock_book(book, &flags);
 	return;
 }
 


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 12/15] mm: optimize books in update_page_reclaim_stat()
  2012-02-15 22:57 [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting Konstantin Khlebnikov
                   ` (10 preceding siblings ...)
  2012-02-15 22:57 ` [PATCH RFC 11/15] mm: handle book relock in memory controller Konstantin Khlebnikov
@ 2012-02-15 22:57 ` Konstantin Khlebnikov
  2012-02-15 22:57 ` [PATCH RFC 13/15] mm: optimize books in pagevec_lru_move_fn() Konstantin Khlebnikov
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-15 22:57 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Push book pointer into update_page_reclaim_stat()

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
---
 mm/swap.c |   18 ++++++------------
 1 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 677b529..e57c4c6 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -267,12 +267,10 @@ void rotate_reclaimable_page(struct page *page)
 	}
 }
 
-static void update_page_reclaim_stat(struct zone *zone, struct page *page,
+static void update_page_reclaim_stat(struct book *book, struct page *page,
 				     int file, int rotated)
 {
-	struct zone_reclaim_stat *reclaim_stat;
-
-	reclaim_stat = &page_book(page)->reclaim_stat;
+	struct zone_reclaim_stat *reclaim_stat = &book->reclaim_stat;
 
 	reclaim_stat->recent_scanned[file]++;
 	if (rotated)
@@ -281,8 +279,6 @@ static void update_page_reclaim_stat(struct zone *zone, struct page *page,
 
 static void __activate_page(struct page *page, void *arg)
 {
-	struct zone *zone = page_zone(page);
-
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		int file = page_is_file_cache(page);
 		int lru = page_lru_base_type(page);
@@ -295,7 +291,7 @@ static void __activate_page(struct page *page, void *arg)
 		add_page_to_lru_list(book, page, lru);
 		__count_vm_event(PGACTIVATE);
 
-		update_page_reclaim_stat(zone, page, file, 1);
+		update_page_reclaim_stat(book, page, file, 1);
 	}
 }
 
@@ -432,7 +428,6 @@ static void lru_deactivate_fn(struct page *page, void *arg)
 {
 	int lru, file;
 	bool active;
-	struct zone *zone = page_zone(page);
 	struct book *book;
 
 	if (!PageLRU(page))
@@ -473,7 +468,7 @@ static void lru_deactivate_fn(struct page *page, void *arg)
 
 	if (active)
 		__count_vm_event(PGDEACTIVATE);
-	update_page_reclaim_stat(zone, page, file, 0);
+	update_page_reclaim_stat(book, page, file, 0);
 }
 
 /*
@@ -650,7 +645,7 @@ void lru_add_page_tail(struct zone* zone,
 			active = 0;
 			lru = LRU_INACTIVE_ANON;
 		}
-		update_page_reclaim_stat(zone, page_tail, file, active);
+		update_page_reclaim_stat(book, page_tail, file, active);
 	} else {
 		SetPageUnevictable(page_tail);
 		lru = LRU_UNEVICTABLE;
@@ -677,7 +672,6 @@ void lru_add_page_tail(struct zone* zone,
 static void __pagevec_lru_add_fn(struct page *page, void *arg)
 {
 	enum lru_list lru = (enum lru_list)arg;
-	struct zone *zone = page_zone(page);
 	struct book *book = page_book(page);
 	int file = is_file_lru(lru);
 	int active = is_active_lru(lru);
@@ -689,7 +683,7 @@ static void __pagevec_lru_add_fn(struct page *page, void *arg)
 	SetPageLRU(page);
 	if (active)
 		SetPageActive(page);
-	update_page_reclaim_stat(zone, page, file, active);
+	update_page_reclaim_stat(book, page, file, active);
 	add_page_to_lru_list(book, page, lru);
 }
 


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 13/15] mm: optimize books in pagevec_lru_move_fn()
  2012-02-15 22:57 [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting Konstantin Khlebnikov
                   ` (11 preceding siblings ...)
  2012-02-15 22:57 ` [PATCH RFC 12/15] mm: optimize books in update_page_reclaim_stat() Konstantin Khlebnikov
@ 2012-02-15 22:57 ` Konstantin Khlebnikov
  2012-02-15 22:57 ` [PATCH RFC 14/15] mm: optimize putback for 0-order reclaim Konstantin Khlebnikov
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-15 22:57 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Push book pointer from pagevec_lru_move_fn() to iterator function.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
---
 include/linux/swap.h |    2 +-
 mm/huge_memory.c     |    2 +-
 mm/swap.c            |   25 +++++++++++--------------
 3 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 80cf6b8..7fa6f1d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -210,7 +210,7 @@ extern unsigned int nr_free_pagecache_pages(void);
 /* linux/mm/swap.c */
 extern void __lru_cache_add(struct page *, enum lru_list lru);
 extern void lru_cache_add_lru(struct page *, enum lru_list lru);
-extern void lru_add_page_tail(struct zone* zone,
+extern void lru_add_page_tail(struct book *book,
 			      struct page *page, struct page *page_tail);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8e7e289..0824655 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1308,7 +1308,7 @@ static void __split_huge_page_refcount(struct page *page)
 		BUG_ON(!PageSwapBacked(page_tail));
 
 
-		lru_add_page_tail(book_zone(book), page, page_tail);
+		lru_add_page_tail(book, page, page_tail);
 	}
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(atomic_read(&page->_count) <= 0);
diff --git a/mm/swap.c b/mm/swap.c
index e57c4c6..652e691 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -202,7 +202,8 @@ void put_pages_list(struct list_head *pages)
 EXPORT_SYMBOL(put_pages_list);
 
 static void pagevec_lru_move_fn(struct pagevec *pvec,
-				void (*move_fn)(struct page *page, void *arg),
+				void (*move_fn)(struct book *book,
+						struct page *page, void *arg),
 				void *arg)
 {
 	int i;
@@ -213,7 +214,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 		struct page *page = pvec->pages[i];
 
 		book = relock_page_book(book, page, &flags);
-		(*move_fn)(page, arg);
+		(*move_fn)(book, page, arg);
 	}
 	if (book)
 		unlock_book(book, &flags);
@@ -221,13 +222,13 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	pagevec_reinit(pvec);
 }
 
-static void pagevec_move_tail_fn(struct page *page, void *arg)
+static void pagevec_move_tail_fn(struct book *book,
+				 struct page *page, void *arg)
 {
 	int *pgmoved = arg;
 
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		enum lru_list lru = page_lru_base_type(page);
-		struct book *book = page_book(page);
 
 		list_move_tail(&page->lru, &book->pages_lru[lru]);
 		(*pgmoved)++;
@@ -277,12 +278,11 @@ static void update_page_reclaim_stat(struct book *book, struct page *page,
 		reclaim_stat->recent_rotated[file]++;
 }
 
-static void __activate_page(struct page *page, void *arg)
+static void __activate_page(struct book *book, struct page *page, void *arg)
 {
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		int file = page_is_file_cache(page);
 		int lru = page_lru_base_type(page);
-		struct book *book = page_book(page);
 
 		del_page_from_lru_list(book, page, lru);
 
@@ -424,11 +424,10 @@ void add_page_to_unevictable_list(struct page *page)
  * be write it out by flusher threads as this is much more effective
  * than the single-page writeout from reclaim.
  */
-static void lru_deactivate_fn(struct page *page, void *arg)
+static void lru_deactivate_fn(struct book *book, struct page *page, void *arg)
 {
 	int lru, file;
 	bool active;
-	struct book *book;
 
 	if (!PageLRU(page))
 		return;
@@ -444,7 +443,6 @@ static void lru_deactivate_fn(struct page *page, void *arg)
 
 	file = page_is_file_cache(page);
 	lru = page_lru_base_type(page);
-	book = page_book(page);
 	del_page_from_lru_list(book, page, lru + active);
 	ClearPageActive(page);
 	ClearPageReferenced(page);
@@ -621,18 +619,17 @@ EXPORT_SYMBOL(__pagevec_release);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /* used by __split_huge_page_refcount() */
-void lru_add_page_tail(struct zone* zone,
+void lru_add_page_tail(struct book *book,
 		       struct page *page, struct page *page_tail)
 {
 	int active;
 	enum lru_list lru;
 	const int file = 0;
-	struct book *book = page_book(page);
 
 	VM_BUG_ON(!PageHead(page));
 	VM_BUG_ON(PageCompound(page_tail));
 	VM_BUG_ON(PageLRU(page_tail));
-	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&zone->lru_lock));
+	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&book_zone(book)->lru_lock));
 
 	SetPageLRU(page_tail);
 
@@ -669,10 +666,10 @@ void lru_add_page_tail(struct zone* zone,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-static void __pagevec_lru_add_fn(struct page *page, void *arg)
+static void __pagevec_lru_add_fn(struct book *book,
+				 struct page *page, void *arg)
 {
 	enum lru_list lru = (enum lru_list)arg;
-	struct book *book = page_book(page);
 	int file = is_file_lru(lru);
 	int active = is_active_lru(lru);
 


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 14/15] mm: optimize putback for 0-order reclaim
  2012-02-15 22:57 [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting Konstantin Khlebnikov
                   ` (12 preceding siblings ...)
  2012-02-15 22:57 ` [PATCH RFC 13/15] mm: optimize books in pagevec_lru_move_fn() Konstantin Khlebnikov
@ 2012-02-15 22:57 ` Konstantin Khlebnikov
  2012-02-15 22:58 ` [PATCH RFC 15/15] mm: split zone->lru_lock Konstantin Khlebnikov
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-15 22:57 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

At 0-order reclaim all pages are taken from one book,
thus we don't need to recheck and relock page_book on putback.

Maybe it would be better to collect lumpy-isolated pages into
separate list and handle them independently.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
---
 mm/vmscan.c |   16 +++++++++++-----
 1 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9a3fb72..9fc814f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1340,6 +1340,7 @@ static int too_many_isolated(struct zone *zone, int file,
  */
 static noinline_for_stack struct book *
 putback_inactive_pages(struct book *book,
+		       struct scan_control *sc,
 		       struct list_head *page_list)
 {
 	struct zone_reclaim_stat *reclaim_stat = &book->reclaim_stat;
@@ -1364,7 +1365,9 @@ putback_inactive_pages(struct book *book,
 		lru = page_lru(page);
 
 		/* can differ only on lumpy reclaim */
-		book = __relock_page_book(book, page);
+		if (sc->order)
+			book = __relock_page_book(book, page);
+
 		add_page_to_lru_list(book, page, lru);
 		if (is_active_lru(lru)) {
 			int file = is_file_lru(lru);
@@ -1560,7 +1563,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct book *book,
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-	book = putback_inactive_pages(book, &page_list);
+	book = putback_inactive_pages(book, sc, &page_list);
 
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
 	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
@@ -1625,6 +1628,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct book *book,
 
 static struct book *
 move_active_pages_to_lru(struct book *book,
+			 struct scan_control *sc,
 			 struct list_head *list,
 			 struct list_head *pages_to_free,
 			 enum lru_list lru)
@@ -1653,7 +1657,9 @@ move_active_pages_to_lru(struct book *book,
 		SetPageLRU(page);
 
 		/* can differ only on lumpy reclaim */
-		book = __relock_page_book(book, page);
+		if (sc->order)
+			book = __relock_page_book(book, page);
+
 		list_move(&page->lru, &book->pages_lru[lru]);
 		numpages = hpage_nr_pages(page);
 		book->pages_count[lru] += numpages;
@@ -1765,9 +1771,9 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	 */
 	reclaim_stat->recent_rotated[file] += nr_rotated;
 
-	book = move_active_pages_to_lru(book, &l_active, &l_hold,
+	book = move_active_pages_to_lru(book, sc, &l_active, &l_hold,
 						LRU_ACTIVE + file * LRU_FILE);
-	book = move_active_pages_to_lru(book, &l_inactive, &l_hold,
+	book = move_active_pages_to_lru(book, sc, &l_inactive, &l_hold,
 						LRU_BASE   + file * LRU_FILE);
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
 	unlock_book_irq(book);


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH RFC 15/15] mm: split zone->lru_lock
  2012-02-15 22:57 [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting Konstantin Khlebnikov
                   ` (13 preceding siblings ...)
  2012-02-15 22:57 ` [PATCH RFC 14/15] mm: optimize putback for 0-order reclaim Konstantin Khlebnikov
@ 2012-02-15 22:58 ` Konstantin Khlebnikov
  2012-02-16  2:04 ` [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting KAMEZAWA Hiroyuki
  2012-02-16  2:37 ` Hugh Dickins
  16 siblings, 0 replies; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-15 22:58 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Looks like all ready for splitting zone->lru_lock into small per-book pieces.

Protect lock loop with rcu, memory controller alread release its mem_cgroup_per_node via rcu,
books embedded into zones never released.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
---
 include/linux/mm_inline.h |   67 ++++++++++++++++++++++++++++++++++-----------
 include/linux/mmzone.h    |    2 +
 mm/compaction.c           |    3 +-
 mm/memcontrol.c           |    1 +
 mm/page_alloc.c           |    2 +
 mm/swap.c                 |    2 +
 6 files changed, 56 insertions(+), 21 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 9cb3a7e..5d6df15 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -38,22 +38,22 @@ static inline struct pglist_data *book_node(struct book *book)
 
 static inline void lock_book(struct book *book, unsigned long *flags)
 {
-	spin_lock_irqsave(&book_zone(book)->lru_lock, *flags);
+	spin_lock_irqsave(&book->lru_lock, *flags);
 }
 
 static inline void lock_book_irq(struct book *book)
 {
-	spin_lock_irq(&book_zone(book)->lru_lock);
+	spin_lock_irq(&book->lru_lock);
 }
 
 static inline void unlock_book(struct book *book, unsigned long *flags)
 {
-	spin_unlock_irqrestore(&book_zone(book)->lru_lock, *flags);
+	spin_unlock_irqrestore(&book->lru_lock, *flags);
 }
 
 static inline void unlock_book_irq(struct book *book)
 {
-	spin_unlock_irq(&book_zone(book)->lru_lock);
+	spin_unlock_irq(&book->lru_lock);
 }
 
 #ifdef CONFIG_MEMORY_BOOKKEEPING
@@ -61,27 +61,47 @@ static inline void unlock_book_irq(struct book *book)
 static inline struct book *lock_page_book(struct page *page,
 					  unsigned long *flags)
 {
-	struct zone *zone = page_zone(page);
+	struct book *locked_book, *book;
 
-	spin_lock_irqsave(&zone->lru_lock, *flags);
-	return page_book(page);
+	rcu_read_lock();
+	local_irq_save(*flags);
+	book = page_book(page);
+	do {
+		spin_lock(&book->lru_lock);
+		locked_book = book;
+		book = page_book(page);
+		if (likely(book == locked_book)) {
+			rcu_read_unlock();
+			return book;
+		}
+		spin_unlock(&locked_book->lru_lock);
+	} while (1);
 }
 
 static inline struct book *lock_page_book_irq(struct page *page)
 {
-	struct zone *zone = page_zone(page);
+	struct book *locked_book, *book;
 
-	spin_lock_irq(&zone->lru_lock);
-	return page_book(page);
+	rcu_read_lock();
+	local_irq_disable();
+	book = page_book(page);
+	do {
+		spin_lock(&book->lru_lock);
+		locked_book = book;
+		book = page_book(page);
+		if (likely(book == locked_book)) {
+			rcu_read_unlock();
+			return book;
+		}
+		spin_unlock(&locked_book->lru_lock);
+	} while (1);
 }
 
 static inline struct book *relock_page_book(struct book *locked_book,
 					    struct page *page,
 					    unsigned long *flags)
 {
-	struct zone *zone = page_zone(page);
-
-	if (!locked_book || zone != book_zone(locked_book)) {
+	if (unlikely(locked_book != page_book(page))) {
 		if (locked_book)
 			unlock_book(locked_book, flags);
 		locked_book = lock_page_book(page, flags);
@@ -93,9 +113,7 @@ static inline struct book *relock_page_book(struct book *locked_book,
 static inline struct book *relock_page_book_irq(struct book *locked_book,
 						struct page *page)
 {
-	struct zone *zone = page_zone(page);
-
-	if (!locked_book || zone != book_zone(locked_book)) {
+	if (unlikely(locked_book != page_book(page))) {
 		if (locked_book)
 			unlock_book_irq(locked_book);
 		locked_book = lock_page_book_irq(page);
@@ -111,7 +129,22 @@ static inline struct book *relock_page_book_irq(struct book *locked_book,
 static inline struct book *__relock_page_book(struct book *locked_book,
 					      struct page *page)
 {
-	return page_book(page);
+	struct book *book;
+
+	if (likely(locked_book == page_book(page)))
+		return locked_book;
+
+	rcu_read_lock();
+	do {
+		book = page_book(page);
+		if (book == locked_book) {
+			rcu_read_unlock();
+			return book;
+		}
+		spin_unlock(&locked_book->lru_lock);
+		spin_lock(&book->lru_lock);
+		locked_book = book;
+	} while (1);
 }
 
 #else /* CONFIG_MEMORY_BOOKKEEPING */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5bcd5b1..629c6bd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -301,6 +301,7 @@ struct book {
 	struct pglist_data	*node;
 	struct zone		*zone;
 #endif
+	spinlock_t		lru_lock;
 	struct list_head	pages_lru[NR_LRU_LISTS];
 	unsigned long		pages_count[NR_LRU_LISTS];
 
@@ -382,7 +383,6 @@ struct zone {
 	ZONE_PADDING(_pad1_)
 
 	/* Fields commonly accessed by the page reclaim scanner */
-	spinlock_t		lru_lock;
 	struct book		book;
 
 	unsigned long		pages_scanned;	   /* since last reclaim */
diff --git a/mm/compaction.c b/mm/compaction.c
index f521edf..cb9266a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -310,7 +310,8 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 				unlock_book_irq(book);
 			book = NULL;
 		}
-		if (need_resched() || spin_is_contended(&zone->lru_lock)) {
+		if (need_resched() ||
+		    (book && spin_is_contended(&book->lru_lock))) {
 			if (book)
 				unlock_book_irq(book);
 			book = NULL;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 90e21d2..eabc2ef 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4640,6 +4640,7 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 			INIT_LIST_HEAD(&mz->book.pages_lru[l]);
 			mz->book.pages_count[l] = 0;
 		}
+		spin_lock_init(&mz->book.lru_lock);
 		mz->book.node = NODE_DATA(node);
 		mz->book.zone = &NODE_DATA(node)->node_zones[zone];
 		spin_lock(&mz->book.zone->lock);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a2df69e..144cef8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4305,7 +4305,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 #endif
 		zone->name = zone_names[j];
 		spin_lock_init(&zone->lock);
-		spin_lock_init(&zone->lru_lock);
+		spin_lock_init(&zone->book.lru_lock);
 		zone_seqlock_init(zone);
 		zone->zone_pgdat = pgdat;
 
diff --git a/mm/swap.c b/mm/swap.c
index 652e691..58ea4f3 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -629,7 +629,7 @@ void lru_add_page_tail(struct book *book,
 	VM_BUG_ON(!PageHead(page));
 	VM_BUG_ON(PageCompound(page_tail));
 	VM_BUG_ON(PageLRU(page_tail));
-	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&book_zone(book)->lru_lock));
+	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&book->lru_lock));
 
 	SetPageLRU(page_tail);
 


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting
  2012-02-15 22:57 [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting Konstantin Khlebnikov
                   ` (14 preceding siblings ...)
  2012-02-15 22:58 ` [PATCH RFC 15/15] mm: split zone->lru_lock Konstantin Khlebnikov
@ 2012-02-16  2:04 ` KAMEZAWA Hiroyuki
  2012-02-16  5:43   ` Konstantin Khlebnikov
  2012-02-16  2:37 ` Hugh Dickins
  16 siblings, 1 reply; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-16  2:04 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: linux-mm, linux-kernel, Hugh Dickins, hannes

On Thu, 16 Feb 2012 02:57:04 +0400
Konstantin Khlebnikov <khlebnikov@openvz.org> wrote:

> There should be no logic changes in this patchset, this is only tossing bits around.
> [ This patchset is on top some memcg cleanup/rework patches,
>   which I sent to linux-mm@ today/yesterday ]
> 
> Most of things in this patchset are self-descriptive, so here brief plan:
> 

AFAIK, Hugh Dickins said he has per-zone-per-lru-lock and is testing it.
So, please CC him and Johannes, at least.


> * Transmute struct lruvec into struct book. Like real book this struct will
>   store set of pages for one zone. It will be working unit for reclaimer code.
> [ If memcg is disabled in config there will only one book embedded into struct zone ]
> 

Why you need to add new structure rahter than enhancing lruvec ?
"book" means a binder of pages ?


> * move page-lru counters to struct book
> [ this adds extra overhead in add_page_to_lru_list()/del_page_from_lru_list() for
>   non-memcg case, but I believe it will be invisible, only one non-atomic add/sub
>   in the same cacheline with lru list ]
> 

This seems straightforward.

> * unify inactive_list_is_low_global() and cleanup reclaimer code
> * replace struct mem_cgroup_zone with single pointer to struct book

Hm, ok.

> * optimize page to book translations, move it upper in the call stack,
>   replace some struct zone arguments with struct book pointer.
> 

a page->book transrater from patch 2/15

+struct book *page_book(struct page *page)
+{
+	struct mem_cgroup_per_zone *mz;
+	struct page_cgroup *pc;
+
+	if (mem_cgroup_disabled())
+		return &page_zone(page)->book;
+
+	pc = lookup_page_cgroup(page);
+	if (!PageCgroupUsed(pc))
+		return &page_zone(page)->book;
+	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
+	smp_rmb();
+	mz = mem_cgroup_zoneinfo(pc->mem_cgroup,
+			page_to_nid(page), page_zonenum(page));
+	return &mz->book;
+}

What happens when pc->mem_cgroup is rewritten by move_account() ?
Where is the guard for lockless access of this ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting
  2012-02-15 22:57 [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting Konstantin Khlebnikov
                   ` (15 preceding siblings ...)
  2012-02-16  2:04 ` [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting KAMEZAWA Hiroyuki
@ 2012-02-16  2:37 ` Hugh Dickins
  2012-02-16  4:51   ` Konstantin Khlebnikov
  16 siblings, 1 reply; 31+ messages in thread
From: Hugh Dickins @ 2012-02-16  2:37 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: linux-mm, linux-kernel

On Thu, 16 Feb 2012, Konstantin Khlebnikov wrote:

> There should be no logic changes in this patchset, this is only tossing bits around.
> [ This patchset is on top some memcg cleanup/rework patches,
>   which I sent to linux-mm@ today/yesterday ]
> 
> Most of things in this patchset are self-descriptive, so here brief plan:
> 
> * Transmute struct lruvec into struct book. Like real book this struct will
>   store set of pages for one zone. It will be working unit for reclaimer code.
> [ If memcg is disabled in config there will only one book embedded into struct zone ]
> 
> * move page-lru counters to struct book
> [ this adds extra overhead in add_page_to_lru_list()/del_page_from_lru_list() for
>   non-memcg case, but I believe it will be invisible, only one non-atomic add/sub
>   in the same cacheline with lru list ]
> 
> * unify inactive_list_is_low_global() and cleanup reclaimer code
> * replace struct mem_cgroup_zone with single pointer to struct book
> * optimize page to book translations, move it upper in the call stack,
>   replace some struct zone arguments with struct book pointer.
> 
> page -> book dereference become main operation, page (even free) will always
> points to one book in its zone. so page->flags bits may contains direct reference to book.
> Maybe we can replace page_zone(page) with book_zone(page_book(page)), without mem cgroups
> book -> zone dereference will be simple container_of().
> 
> Finally, there appears some new locking primitives for decorating lru_lock splitting logic.
> Final patch actually splits zone->lru_lock into small per-book pieces.

Well done, it looks like you've beaten me by a few days: my per-memcg
per-zone locking patches are now split up and ready, except for the
commit comments and performance numbers to support them.

Or perhaps what we've been doing is orthogonal: I've not glanced beyond
your Subjects yet, but those do look very familiar from my own work -
though we're still using "lruvec"s rather than "book"s.

Anyway, I should be well-placed to review what you've done, and had
better switch away from my own patches to testing and reviewing yours
now, checking if we've caught anything that you're missing.  Or maybe
it'll be worth posting mine anyway, we'll see: I'll look to yours first.

> All this code currently *completely untested*, but seems like it already can work.

Oh, perhaps I'm ahead of you after all :)

> 
> After that, there two options how manage struct book on mem-cgroup create/destroy:
> a) [ currently implemented ] allocate and release by rcu.
>    Thus lock_page_book() will be protected with rcu_read_lock().
> b) allocate and never release struct book, reuse them after rcu grace period.
>    It allows to avoid some rcu_read_lock()/rcu_read_unlock() calls on hot paths.
> 
> 
> Motivation:
> I wrote the similar memory controller for our rhel6-based openvz/virtuozzo kernel,
> including splitted lru-locks and some other [patented LOL] cool stuff.
> [ common descrioption without techical details: http://wiki.openvz.org/VSwap ]
> That kernel already in production and rather stable for a long time.
> 
> ---
> 
> Konstantin Khlebnikov (15):
>       mm: rename struct lruvec into struct book
>       mm: memory bookkeeping core
>       mm: add book->pages_count
>       mm: unify inactive_list_is_low()
>       mm: add book->reclaim_stat
>       mm: kill struct mem_cgroup_zone
>       mm: move page-to-book translation upper
>       mm: introduce book locking primitives
>       mm: handle book relocks on lumpy reclaim
>       mm: handle book relocks in compaction
>       mm: handle book relock in memory controller
>       mm: optimize books in update_page_reclaim_stat()
>       mm: optimize books in pagevec_lru_move_fn()
>       mm: optimize putback for 0-order reclaim
>       mm: split zone->lru_lock
> 
> 
>  include/linux/memcontrol.h |   52 -------
>  include/linux/mm_inline.h  |  222 ++++++++++++++++++++++++++++-
>  include/linux/mmzone.h     |   26 ++-
>  include/linux/swap.h       |    2 
>  init/Kconfig               |    4 +
>  mm/compaction.c            |   35 +++--
>  mm/huge_memory.c           |   10 +
>  mm/memcontrol.c            |  238 ++++++++++---------------------
>  mm/page_alloc.c            |   20 ++-
>  mm/swap.c                  |  128 ++++++-----------
>  mm/vmscan.c                |  334 +++++++++++++++++++-------------------------
>  11 files changed, 554 insertions(+), 517 deletions(-)

That's a very familiar list of files to me!

Hugh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting
  2012-02-16  2:37 ` Hugh Dickins
@ 2012-02-16  4:51   ` Konstantin Khlebnikov
  2012-02-16 21:37     ` Hugh Dickins
  0 siblings, 1 reply; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-16  4:51 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-mm, linux-kernel

Hugh Dickins wrote:
> On Thu, 16 Feb 2012, Konstantin Khlebnikov wrote:
>
>> There should be no logic changes in this patchset, this is only tossing bits around.
>> [ This patchset is on top some memcg cleanup/rework patches,
>>    which I sent to linux-mm@ today/yesterday ]
>>
>> Most of things in this patchset are self-descriptive, so here brief plan:
>>
>> * Transmute struct lruvec into struct book. Like real book this struct will
>>    store set of pages for one zone. It will be working unit for reclaimer code.
>> [ If memcg is disabled in config there will only one book embedded into struct zone ]
>>
>> * move page-lru counters to struct book
>> [ this adds extra overhead in add_page_to_lru_list()/del_page_from_lru_list() for
>>    non-memcg case, but I believe it will be invisible, only one non-atomic add/sub
>>    in the same cacheline with lru list ]
>>
>> * unify inactive_list_is_low_global() and cleanup reclaimer code
>> * replace struct mem_cgroup_zone with single pointer to struct book
>> * optimize page to book translations, move it upper in the call stack,
>>    replace some struct zone arguments with struct book pointer.
>>
>> page ->  book dereference become main operation, page (even free) will always
>> points to one book in its zone. so page->flags bits may contains direct reference to book.
>> Maybe we can replace page_zone(page) with book_zone(page_book(page)), without mem cgroups
>> book ->  zone dereference will be simple container_of().
>>
>> Finally, there appears some new locking primitives for decorating lru_lock splitting logic.
>> Final patch actually splits zone->lru_lock into small per-book pieces.
>
> Well done, it looks like you've beaten me by a few days: my per-memcg
> per-zone locking patches are now split up and ready, except for the
> commit comments and performance numbers to support them.

Heh, nice. How do you organize link from page to "book"?
My patchset still uses page_cgroup, and I afraid something broken in locking around it.

>
> Or perhaps what we've been doing is orthogonal: I've not glanced beyond
> your Subjects yet, but those do look very familiar from my own work -
> though we're still using "lruvec"s rather than "book"s.

I think, good name is always good. "book" sounds much better than "mem_cgroup_zone".
Plus I want to split mm and mem-cg as much as possible, like this done in sched-grouping.
After my patchset there no links from "book" to "memcg", so now there can exist books of
different nature. It would be nice to use this memory management not only for cgroups.
For example, what do you think about per-struct-user memory controller,
"rt" prio memory group, which never pruned by "normal" prio memory group users.
or per-session-id memory auto-grouping =)

>
> Anyway, I should be well-placed to review what you've done, and had
> better switch away from my own patches to testing and reviewing yours
> now, checking if we've caught anything that you're missing.  Or maybe
> it'll be worth posting mine anyway, we'll see: I'll look to yours first.

Competition is good. It would be nice to see your patches too, if they are as ready as my.

I already found two bugs in my patchset:
in "mm: unify inactive_list_is_low()" NR_ACTIVE_* instead of LRU_ACTIVE_*
and in "mm: handle book relocks on lumpy reclaim" I forget locking in isolate_lru_pages()
[ but with CONFIG_COMPACTION=y this branch never used, as I guess ]

>
>> All this code currently *completely untested*, but seems like it already can work.
>
> Oh, perhaps I'm ahead of you after all :)

Not exactly, I already wrote the same code more than half-year ago.
So, it wasn't absolutely fair competition from the start. =)

>
>>
>> After that, there two options how manage struct book on mem-cgroup create/destroy:
>> a) [ currently implemented ] allocate and release by rcu.
>>     Thus lock_page_book() will be protected with rcu_read_lock().
>> b) allocate and never release struct book, reuse them after rcu grace period.
>>     It allows to avoid some rcu_read_lock()/rcu_read_unlock() calls on hot paths.
>>
>>
>> Motivation:
>> I wrote the similar memory controller for our rhel6-based openvz/virtuozzo kernel,
>> including splitted lru-locks and some other [patented LOL] cool stuff.
>> [ common descrioption without techical details: http://wiki.openvz.org/VSwap ]
>> That kernel already in production and rather stable for a long time.
>>
>> ---
>>
>> Konstantin Khlebnikov (15):
>>        mm: rename struct lruvec into struct book
>>        mm: memory bookkeeping core
>>        mm: add book->pages_count
>>        mm: unify inactive_list_is_low()
>>        mm: add book->reclaim_stat
>>        mm: kill struct mem_cgroup_zone
>>        mm: move page-to-book translation upper
>>        mm: introduce book locking primitives
>>        mm: handle book relocks on lumpy reclaim
>>        mm: handle book relocks in compaction
>>        mm: handle book relock in memory controller
>>        mm: optimize books in update_page_reclaim_stat()
>>        mm: optimize books in pagevec_lru_move_fn()
>>        mm: optimize putback for 0-order reclaim
>>        mm: split zone->lru_lock
>>
>>
>>   include/linux/memcontrol.h |   52 -------
>>   include/linux/mm_inline.h  |  222 ++++++++++++++++++++++++++++-
>>   include/linux/mmzone.h     |   26 ++-
>>   include/linux/swap.h       |    2
>>   init/Kconfig               |    4 +
>>   mm/compaction.c            |   35 +++--
>>   mm/huge_memory.c           |   10 +
>>   mm/memcontrol.c            |  238 ++++++++++---------------------
>>   mm/page_alloc.c            |   20 ++-
>>   mm/swap.c                  |  128 ++++++-----------
>>   mm/vmscan.c                |  334 +++++++++++++++++++-------------------------
>>   11 files changed, 554 insertions(+), 517 deletions(-)
>
> That's a very familiar list of files to me!
>
> Hugh


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting
  2012-02-16  2:04 ` [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting KAMEZAWA Hiroyuki
@ 2012-02-16  5:43   ` Konstantin Khlebnikov
  2012-02-16  8:24     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-16  5:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, Hugh Dickins, hannes

KAMEZAWA Hiroyuki wrote:
> On Thu, 16 Feb 2012 02:57:04 +0400
> Konstantin Khlebnikov<khlebnikov@openvz.org>  wrote:
>
>> There should be no logic changes in this patchset, this is only tossing bits around.
>> [ This patchset is on top some memcg cleanup/rework patches,
>>    which I sent to linux-mm@ today/yesterday ]
>>
>> Most of things in this patchset are self-descriptive, so here brief plan:
>>
>
> AFAIK, Hugh Dickins said he has per-zone-per-lru-lock and is testing it.
> So, please CC him and Johannes, at least.
>

Ok

>
>> * Transmute struct lruvec into struct book. Like real book this struct will
>>    store set of pages for one zone. It will be working unit for reclaimer code.
>> [ If memcg is disabled in config there will only one book embedded into struct zone ]
>>
>
> Why you need to add new structure rahter than enhancing lruvec ?
> "book" means a binder of pages ?
>

I responded to this in the reply to Hugh Dickins.

>
>> * move page-lru counters to struct book
>> [ this adds extra overhead in add_page_to_lru_list()/del_page_from_lru_list() for
>>    non-memcg case, but I believe it will be invisible, only one non-atomic add/sub
>>    in the same cacheline with lru list ]
>>
>
> This seems straightforward.
>
>> * unify inactive_list_is_low_global() and cleanup reclaimer code
>> * replace struct mem_cgroup_zone with single pointer to struct book
>
> Hm, ok.
>
>> * optimize page to book translations, move it upper in the call stack,
>>    replace some struct zone arguments with struct book pointer.
>>
>
> a page->book transrater from patch 2/15
>
> +struct book *page_book(struct page *page)
> +{
> +	struct mem_cgroup_per_zone *mz;
> +	struct page_cgroup *pc;
> +
> +	if (mem_cgroup_disabled())
> +		return&page_zone(page)->book;
> +
> +	pc = lookup_page_cgroup(page);
> +	if (!PageCgroupUsed(pc))
> +		return&page_zone(page)->book;
> +	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
> +	smp_rmb();
> +	mz = mem_cgroup_zoneinfo(pc->mem_cgroup,
> +			page_to_nid(page), page_zonenum(page));
> +	return&mz->book;
> +}
>
> What happens when pc->mem_cgroup is rewritten by move_account() ?
> Where is the guard for lockless access of this ?

Initially this suppose to be protected with lru_lock, in final patch they are protected with rcu.
After final patch all page_book() calls are collected in [__re]lock_page_book[_irq]() functions.
They pick some book reference, lock its lru and recheck page -> book reference in loop till success.

Currently I found there only one potential problem: free_mem_cgroup_per_zone_info() in "mm: memory bookkeeping core"
maybe should call spin_unlock_wait(&zone->lru_lock), because some guy can pick page_book(pfn_to_page(pfn))
and try to isolate this page. But I not sure, how this is possible. In final patch it is totally fixed with rcu.

>
> Thanks,
> -Kame
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting
  2012-02-16  5:43   ` Konstantin Khlebnikov
@ 2012-02-16  8:24     ` KAMEZAWA Hiroyuki
  2012-02-16 11:02       ` Konstantin Khlebnikov
  0 siblings, 1 reply; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-16  8:24 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: linux-mm, linux-kernel, Hugh Dickins, hannes

On Thu, 16 Feb 2012 09:43:52 +0400
Konstantin Khlebnikov <khlebnikov@openvz.org> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Thu, 16 Feb 2012 02:57:04 +0400
> > Konstantin Khlebnikov<khlebnikov@openvz.org>  wrote:

> >> * optimize page to book translations, move it upper in the call stack,
> >>    replace some struct zone arguments with struct book pointer.
> >>
> >
> > a page->book transrater from patch 2/15
> >
> > +struct book *page_book(struct page *page)
> > +{
> > +	struct mem_cgroup_per_zone *mz;
> > +	struct page_cgroup *pc;
> > +
> > +	if (mem_cgroup_disabled())
> > +		return&page_zone(page)->book;
> > +
> > +	pc = lookup_page_cgroup(page);
> > +	if (!PageCgroupUsed(pc))
> > +		return&page_zone(page)->book;
> > +	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
> > +	smp_rmb();
> > +	mz = mem_cgroup_zoneinfo(pc->mem_cgroup,
> > +			page_to_nid(page), page_zonenum(page));
> > +	return&mz->book;
> > +}
> >
> > What happens when pc->mem_cgroup is rewritten by move_account() ?
> > Where is the guard for lockless access of this ?
> 
> Initially this suppose to be protected with lru_lock, in final patch they are protected with rcu.

Hmm, VM_BUG_ON(!PageLRU(page)) ?

move_account() overwrites pc->mem_cgroup with isolating page from LRU.
but it doesn't take lru_lock.

BTW, what amount of perfomance benefit ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting
  2012-02-16  8:24     ` KAMEZAWA Hiroyuki
@ 2012-02-16 11:02       ` Konstantin Khlebnikov
  2012-02-16 15:54         ` Konstantin Khlebnikov
  2012-02-16 23:54         ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-16 11:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, Hugh Dickins, hannes

KAMEZAWA Hiroyuki wrote:
> On Thu, 16 Feb 2012 09:43:52 +0400
> Konstantin Khlebnikov<khlebnikov@openvz.org>  wrote:
>
>> KAMEZAWA Hiroyuki wrote:
>>> On Thu, 16 Feb 2012 02:57:04 +0400
>>> Konstantin Khlebnikov<khlebnikov@openvz.org>   wrote:
>
>>>> * optimize page to book translations, move it upper in the call stack,
>>>>     replace some struct zone arguments with struct book pointer.
>>>>
>>>
>>> a page->book transrater from patch 2/15
>>>
>>> +struct book *page_book(struct page *page)
>>> +{
>>> +	struct mem_cgroup_per_zone *mz;
>>> +	struct page_cgroup *pc;
>>> +
>>> +	if (mem_cgroup_disabled())
>>> +		return&page_zone(page)->book;
>>> +
>>> +	pc = lookup_page_cgroup(page);
>>> +	if (!PageCgroupUsed(pc))
>>> +		return&page_zone(page)->book;
>>> +	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
>>> +	smp_rmb();
>>> +	mz = mem_cgroup_zoneinfo(pc->mem_cgroup,
>>> +			page_to_nid(page), page_zonenum(page));
>>> +	return&mz->book;
>>> +}
>>>
>>> What happens when pc->mem_cgroup is rewritten by move_account() ?
>>> Where is the guard for lockless access of this ?
>>
>> Initially this suppose to be protected with lru_lock, in final patch they are protected with rcu.
>
> Hmm, VM_BUG_ON(!PageLRU(page)) ?

Where?

>
> move_account() overwrites pc->mem_cgroup with isolating page from LRU.
> but it doesn't take lru_lock.

There three kinds of lock_page_book() users:
1) caller want to catch page in LRU, it will lock either old or new book and
    recheck PageLRU() after locking, if page not it in LRU it don't touch anything.
    some of these functions has stable reference to page, some of them not.
  [ There actually exist small race, I knew about it, just forget to pick this chunk from old code. See below. ]
2) page is isolated by caller, it want to put it back. book link is stable. no problems.
3) page-release functions. page-counter is zero. no references -- no problems.

race for 1)

catcher					switcher

					# isolate
					old_book = lock_page_book(page)
					ClearPageLRU(page)
					unlock_book(old_book)				
					# charge
old_book = lock_page_book(page)		
					# switch
					page->book = new_book
					# putback
					lock_book(new_book)
					SetPageLRU(page)
					unlock_book(new_book)
if (PageLRU(page))
	oops, page actually in new_book
unlock_book(old_book)


I'll protect "switch" phase with old_book lru-lock:

lock_book(old_book)
page->book = new_book
unlock_book(old_book)

The other option is recheck in "catcher" page book after PageLRU()
maybe there exists some other variants.

> BTW, what amount of perfomance benefit ?

It depends, but usually lru_lock is very-very hot.
This lock splitting can be used without cgroups and containers,
now huge zones can be easily sliced into arbitrary pieces, for example one book per 256Mb.



According to my experience, one of complicated thing there is how to postpone "book" destroying
if some its pages are isolated. For example lumpy reclaim and memory compaction isolates pages
from several books. And they wants to put them back. Currently this can be broken, if someone removes
cgroup in wrong moment. There appears funny races with three players: catcher, switcher and destroyer.
This can be fixed with some extra reference-counting or some other sleepable synchronizing.
In my rhel6-based implementation I uses extra reference-counting, and it looks ugly. So I want to invent something better.
Other option is just never release books, reuse them after rcu grace period for rcu-list iterating.

>
> Thanks,
> -Kame
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting
  2012-02-16 11:02       ` Konstantin Khlebnikov
@ 2012-02-16 15:54         ` Konstantin Khlebnikov
  2012-02-16 23:54         ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-16 15:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, Hugh Dickins, hannes

Konstantin Khlebnikov wrote:
> KAMEZAWA Hiroyuki wrote:
>> On Thu, 16 Feb 2012 09:43:52 +0400
>> Konstantin Khlebnikov<khlebnikov@openvz.org>   wrote:
>>
>>> KAMEZAWA Hiroyuki wrote:
>>>> On Thu, 16 Feb 2012 02:57:04 +0400
>>>> Konstantin Khlebnikov<khlebnikov@openvz.org>    wrote:
>>
>>>>> * optimize page to book translations, move it upper in the call stack,
>>>>>      replace some struct zone arguments with struct book pointer.
>>>>>
>>>>
>>>> a page->book transrater from patch 2/15
>>>>
>>>> +struct book *page_book(struct page *page)
>>>> +{
>>>> +	struct mem_cgroup_per_zone *mz;
>>>> +	struct page_cgroup *pc;
>>>> +
>>>> +	if (mem_cgroup_disabled())
>>>> +		return&page_zone(page)->book;
>>>> +
>>>> +	pc = lookup_page_cgroup(page);
>>>> +	if (!PageCgroupUsed(pc))
>>>> +		return&page_zone(page)->book;
>>>> +	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
>>>> +	smp_rmb();
>>>> +	mz = mem_cgroup_zoneinfo(pc->mem_cgroup,
>>>> +			page_to_nid(page), page_zonenum(page));
>>>> +	return&mz->book;
>>>> +}
>>>>
>>>> What happens when pc->mem_cgroup is rewritten by move_account() ?
>>>> Where is the guard for lockless access of this ?
>>>
>>> Initially this suppose to be protected with lru_lock, in final patch they are protected with rcu.
>>
>> Hmm, VM_BUG_ON(!PageLRU(page)) ?
>
> Where?
>
>>
>> move_account() overwrites pc->mem_cgroup with isolating page from LRU.
>> but it doesn't take lru_lock.
>
> There three kinds of lock_page_book() users:
> 1) caller want to catch page in LRU, it will lock either old or new book and
>      recheck PageLRU() after locking, if page not it in LRU it don't touch anything.
>      some of these functions has stable reference to page, some of them not.
>    [ There actually exist small race, I knew about it, just forget to pick this chunk from old code. See below. ]
> 2) page is isolated by caller, it want to put it back. book link is stable. no problems.
> 3) page-release functions. page-counter is zero. no references -- no problems.
>
> race for 1)
>
> catcher					switcher
>
> 					# isolate
> 					old_book = lock_page_book(page)
> 					ClearPageLRU(page)
> 					unlock_book(old_book)				
> 					# charge
> old_book = lock_page_book(page)		
> 					# switch
> 					page->book = new_book
> 					# putback
> 					lock_book(new_book)
> 					SetPageLRU(page)
> 					unlock_book(new_book)
> if (PageLRU(page))
> 	oops, page actually in new_book
> unlock_book(old_book)
>
>
> I'll protect "switch" phase with old_book lru-lock:
>
> lock_book(old_book)
> page->book = new_book
> unlock_book(old_book)

I found better solution for switcher sequence:

#isolate
old_book = lock_page_book(page)
ClearPageLRU(page)
unlock_book(old_book)				

#charge

#switch
page->book = new_book
spin_unlock_wait(&old_book->lru_lock)

#putback
lock_book(new_book)
SetPageLRU(page)
unlock_book(new_book)

this spin_unlock_wait() effectively stabilize PageLRU() sign
for potential old_book lock holder.

>
> The other option is recheck in "catcher" page book after PageLRU()
> maybe there exists some other variants.
>
>> BTW, what amount of perfomance benefit ?
>
> It depends, but usually lru_lock is very-very hot.
> This lock splitting can be used without cgroups and containers,
> now huge zones can be easily sliced into arbitrary pieces, for example one book per 256Mb.
>
>
>
> According to my experience, one of complicated thing there is how to postpone "book" destroying
> if some its pages are isolated. For example lumpy reclaim and memory compaction isolates pages
> from several books. And they wants to put them back. Currently this can be broken, if someone removes
> cgroup in wrong moment. There appears funny races with three players: catcher, switcher and destroyer.
> This can be fixed with some extra reference-counting or some other sleepable synchronizing.
> In my rhel6-based implementation I uses extra reference-counting, and it looks ugly. So I want to invent something better.
> Other option is just never release books, reuse them after rcu grace period for rcu-list iterating.

Looks like it is not broken, charged page will keep memcg books alive.
To make it completely safe rcu-free callback must wait on spin_unlock_wait(book->lru_lock).

>
>>
>> Thanks,
>> -Kame
>>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting
  2012-02-16  4:51   ` Konstantin Khlebnikov
@ 2012-02-16 21:37     ` Hugh Dickins
  2012-02-17 19:56       ` Konstantin Khlebnikov
  2012-02-18  2:13       ` Hugh Dickins
  0 siblings, 2 replies; 31+ messages in thread
From: Hugh Dickins @ 2012-02-16 21:37 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: KAMEZAWA Hiroyuki, Ying Han, linux-mm, linux-kernel

On Thu, 16 Feb 2012, Konstantin Khlebnikov wrote:
> Hugh Dickins wrote:
> > On Thu, 16 Feb 2012, Konstantin Khlebnikov wrote:
> > > 
> > > Finally, there appears some new locking primitives for decorating
> > > lru_lock splitting logic.
> > > Final patch actually splits zone->lru_lock into small per-book pieces.
> > 
> > Well done, it looks like you've beaten me by a few days: my per-memcg
> > per-zone locking patches are now split up and ready, except for the
> > commit comments and performance numbers to support them.
> 
> Heh, nice. How do you organize link from page to "book"?
> My patchset still uses page_cgroup, and I afraid something broken in locking
> around it.

By page_cgroup, yes.  The locking is not straightforward; but was pretty
much impossible in the days of PageCgroupAcctLRU, we had to get rid of
that to move forward.

Given your encouragement below, I'm thinking to post my patchset once
I've added in the patch comments, in a couple of days.  I won't wait
to include any performance results, those will have to follow.

So I'd rather work on those comments to the relevant patches,
than try to go into it now without showing them.

> 
> > 
> > Or perhaps what we've been doing is orthogonal: I've not glanced beyond
> > your Subjects yet, but those do look very familiar from my own work -
> > though we're still using "lruvec"s rather than "book"s.
> 
> I think, good name is always good. "book" sounds much better than
> "mem_cgroup_zone".

"book" is fun, but I don't find it helpful: I prefer the current, more
boring "lruvec".  I guess I place a higher value on a book than as just
a bundle of pages, so hardly see the connection!

If we were designing a new user interface paradigm, then "book" might
be a great name for an object; but this is all just low-level mechanism.

I was a bit disappointed at how much of your patchset is merely renaming.
I think it would be better if you cut back on that, then if you're really
keen on the "book" naming, do that in a final "lruvec" to "book" patch
that people can take or leave as they please.

(I was impressed to see that lwn.net already has a "Book review"
headlining; but then it turned out not to be of your work ;)

And disappointed it wasn't built on linux-next mm, so I had to revert the
cleanups I've already got in there.  But you may well have been wise to
post against a more stable upstream tree: linux-next shifts from day to
day, and is not always good - I did a lot on rc2-next-20120210, that was
a good tree, but rc3-next-20120210 had a number of issues (especially on
PowerPC, even reverting Arjan's parallel cpu bringup).  I'll probably
try moving on to tonight's, but don't know what to expect of it.

> Plus I want to split mm and mem-cg as much as possible, like this done in
> sched-grouping.
> After my patchset there no links from "book" to "memcg", so now there can
> exist books of
> different nature. It would be nice to use this memory management not only for
> cgroups.
> For example, what do you think about per-struct-user memory controller,
> "rt" prio memory group, which never pruned by "normal" prio memory group
> users.
> or per-session-id memory auto-grouping =)

I'm not thinking of it at all - and probably not as interested in a forest
of cgroups as you guys are.  But I don't see "lruvec" as any more tied to
memcg than "book".

> 
> > 
> > Anyway, I should be well-placed to review what you've done, and had
> > better switch away from my own patches to testing and reviewing yours
> > now, checking if we've caught anything that you're missing.  Or maybe
> > it'll be worth posting mine anyway, we'll see: I'll look to yours first.
> 
> Competition is good. It would be nice to see your patches too, if they are as
> ready as my.

Thanks, I'll get on with them.

It is very very striking, the extent to which we have done the same:
with just args in a different order, or my page_relock_lruvec() versus
your relock_page_book().  As if I'd stolen your patches and then made
little mods to conceal the traces - I may need Ying to testify in a
court of law that I didn't, I gave her a copy of what I had on Monday!

(A lot of my cleanups, which converge with yours, were not in the rollup
I posted on December 5th: they came about as I tried to turn that into a
palatable series.)

If you move to lruvec, or I to book, then it will be easier for people
to compare the two.  I expect we'll find rather more difference once I
get deeper in (but will concentrate on mine for the moment): probably
when it comes to the locking, which was our whole motivation.

> 
> I already found two bugs in my patchset:
> in "mm: unify inactive_list_is_low()" NR_ACTIVE_* instead of LRU_ACTIVE_*
> and in "mm: handle book relocks on lumpy reclaim" I forget locking in
> isolate_lru_pages()
> [ but with CONFIG_COMPACTION=y this branch never used, as I guess ]

I included those LRU_ instead of NR_ changes in what I ran up last night;
but didn't attempt to fix up your isolate_lru_pages() locking, and indeed
very soon crashed there, after lots of list_del warnings.

Yours are not the only patches I was testing in that tree, I tried to
gather several other series which I should be reviewing if I ever have
time: Kamezawa-san's page cgroup diet 6, Xiao Guangrong's 4 prio_tree
cleanups, your 3 radix_tree changes, your 6 shmem changes, your 4 memcg
miscellaneous, and then your 15 books.

The tree before your final 15 did well under pressure, until I tried to
rmdir one of the cgroups afterwards: then it crashed nastily, I'll have
to bisect into that, probably either Kamezawa's or your memcg changes.

The tree with your final 15 booted up fine, but very soon crashed under
load: I don't think I'll spend more time on that for now, probably you
need to work on your locking while I work on my descriptions.

Hugh

> 
> > 
> > > All this code currently *completely untested*, but seems like it already
> > > can work.
> > 
> > Oh, perhaps I'm ahead of you after all :)
> 
> Not exactly, I already wrote the same code more than half-year ago.
> So, it wasn't absolutely fair competition from the start. =)
> 
> > 
> > > 
> > > After that, there two options how manage struct book on mem-cgroup
> > > create/destroy:
> > > a) [ currently implemented ] allocate and release by rcu.
> > >     Thus lock_page_book() will be protected with rcu_read_lock().
> > > b) allocate and never release struct book, reuse them after rcu grace
> > > period.
> > >     It allows to avoid some rcu_read_lock()/rcu_read_unlock() calls on
> > > hot paths.
> > > 
> > > 
> > > Motivation:
> > > I wrote the similar memory controller for our rhel6-based
> > > openvz/virtuozzo kernel,
> > > including splitted lru-locks and some other [patented LOL] cool stuff.
> > > [ common descrioption without techical details:
> > > http://wiki.openvz.org/VSwap ]
> > > That kernel already in production and rather stable for a long time.
> > > 
> > > ---
> > > 
> > > Konstantin Khlebnikov (15):
> > >        mm: rename struct lruvec into struct book
> > >        mm: memory bookkeeping core
> > >        mm: add book->pages_count
> > >        mm: unify inactive_list_is_low()
> > >        mm: add book->reclaim_stat
> > >        mm: kill struct mem_cgroup_zone
> > >        mm: move page-to-book translation upper
> > >        mm: introduce book locking primitives
> > >        mm: handle book relocks on lumpy reclaim
> > >        mm: handle book relocks in compaction
> > >        mm: handle book relock in memory controller
> > >        mm: optimize books in update_page_reclaim_stat()
> > >        mm: optimize books in pagevec_lru_move_fn()
> > >        mm: optimize putback for 0-order reclaim
> > >        mm: split zone->lru_lock
> > > 
> > > 
> > >   include/linux/memcontrol.h |   52 -------
> > >   include/linux/mm_inline.h  |  222 ++++++++++++++++++++++++++++-
> > >   include/linux/mmzone.h     |   26 ++-
> > >   include/linux/swap.h       |    2
> > >   init/Kconfig               |    4 +
> > >   mm/compaction.c            |   35 +++--
> > >   mm/huge_memory.c           |   10 +
> > >   mm/memcontrol.c            |  238 ++++++++++---------------------
> > >   mm/page_alloc.c            |   20 ++-
> > >   mm/swap.c                  |  128 ++++++-----------
> > >   mm/vmscan.c                |  334
> > > +++++++++++++++++++-------------------------
> > >   11 files changed, 554 insertions(+), 517 deletions(-)
> > 
> > That's a very familiar list of files to me!
> > 
> > Hugh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting
  2012-02-16 11:02       ` Konstantin Khlebnikov
  2012-02-16 15:54         ` Konstantin Khlebnikov
@ 2012-02-16 23:54         ` KAMEZAWA Hiroyuki
  2012-02-18  9:09           ` Konstantin Khlebnikov
  1 sibling, 1 reply; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-16 23:54 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: linux-mm, linux-kernel, Hugh Dickins, hannes

On Thu, 16 Feb 2012 15:02:27 +0400
Konstantin Khlebnikov <khlebnikov@openvz.org> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Thu, 16 Feb 2012 09:43:52 +0400
> > Konstantin Khlebnikov<khlebnikov@openvz.org>  wrote:
> >
> >> KAMEZAWA Hiroyuki wrote:
> >>> On Thu, 16 Feb 2012 02:57:04 +0400
> >>> Konstantin Khlebnikov<khlebnikov@openvz.org>   wrote:
> >
> >>>> * optimize page to book translations, move it upper in the call stack,
> >>>>     replace some struct zone arguments with struct book pointer.
> >>>>
> >>>
> >>> a page->book transrater from patch 2/15
> >>>
> >>> +struct book *page_book(struct page *page)
> >>> +{
> >>> +	struct mem_cgroup_per_zone *mz;
> >>> +	struct page_cgroup *pc;
> >>> +
> >>> +	if (mem_cgroup_disabled())
> >>> +		return&page_zone(page)->book;
> >>> +
> >>> +	pc = lookup_page_cgroup(page);
> >>> +	if (!PageCgroupUsed(pc))
> >>> +		return&page_zone(page)->book;
> >>> +	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
> >>> +	smp_rmb();
> >>> +	mz = mem_cgroup_zoneinfo(pc->mem_cgroup,
> >>> +			page_to_nid(page), page_zonenum(page));
> >>> +	return&mz->book;
> >>> +}
> >>>
> >>> What happens when pc->mem_cgroup is rewritten by move_account() ?
> >>> Where is the guard for lockless access of this ?
> >>
> >> Initially this suppose to be protected with lru_lock, in final patch they are protected with rcu.
> >
> > Hmm, VM_BUG_ON(!PageLRU(page)) ?
> 
> Where?
> 

You said this is guarded by lru_lock. So, page should be on LRU.



> >
> > move_account() overwrites pc->mem_cgroup with isolating page from LRU.
> > but it doesn't take lru_lock.
> 
> There three kinds of lock_page_book() users:
> 1) caller want to catch page in LRU, it will lock either old or new book and
>     recheck PageLRU() after locking, if page not it in LRU it don't touch anything.
>     some of these functions has stable reference to page, some of them not.
>   [ There actually exist small race, I knew about it, just forget to pick this chunk from old code. See below. ]
> 2) page is isolated by caller, it want to put it back. book link is stable. no problems.
> 3) page-release functions. page-counter is zero. no references -- no problems.
> 
> race for 1)
> 
> catcher					switcher
> 
> 					# isolate
> 					old_book = lock_page_book(page)
> 					ClearPageLRU(page)
> 					unlock_book(old_book)				
> 					# charge
> old_book = lock_page_book(page)		
> 					# switch
> 					page->book = new_book
> 					# putback
> 					lock_book(new_book)
> 					SetPageLRU(page)
> 					unlock_book(new_book)
> if (PageLRU(page))
> 	oops, page actually in new_book
> unlock_book(old_book)
> 
> 
> I'll protect "switch" phase with old_book lru-lock:
> 
In linex-next, pc->mem_cgroup is modified only when Page is on LRU.

When we need to touch "book", if !PageLRU() ?


> lock_book(old_book)
> page->book = new_book
> unlock_book(old_book)
> 
> The other option is recheck in "catcher" page book after PageLRU()
> maybe there exists some other variants.
> 
> > BTW, what amount of perfomance benefit ?
> 
> It depends, but usually lru_lock is very-very hot.
> This lock splitting can be used without cgroups and containers,
> now huge zones can be easily sliced into arbitrary pieces, for example one book per 256Mb.
> 
I personally think reducing lock by pagevec works enough well.
So, want to see perforamance on real machine with real apps.


> 
> According to my experience, one of complicated thing there is how to postpone "book" destroying
> if some its pages are isolated. For example lumpy reclaim and memory compaction isolates pages
> from several books. And they wants to put them back. Currently this can be broken, if someone removes
> cgroup in wrong moment. There appears funny races with three players: catcher, switcher and destroyer.

Thank you for pointing out. Hmm... it can happen ? Currently, at cgroup destroying,
force_empty() works 

  1. find a page from LRU
  2. remove it from LRU
  3. move it or reclaim it (you said "switcher")
  4. if res.usage != 0 goto 1.

I think "4" will finally keep cgroup from being destroyed.


> This can be fixed with some extra reference-counting or some other sleepable synchronizing.
> In my rhel6-based implementation I uses extra reference-counting, and it looks ugly. So I want to invent something better.
> Other option is just never release books, reuse them after rcu grace period for rcu-list iterating.
> 

Another reference counting is very very bad.



Thanks,
-Kame






^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting
  2012-02-16 21:37     ` Hugh Dickins
@ 2012-02-17 19:56       ` Konstantin Khlebnikov
  2012-02-18  2:13       ` Hugh Dickins
  1 sibling, 0 replies; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-17 19:56 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: KAMEZAWA Hiroyuki, Ying Han, linux-mm, linux-kernel

Hugh Dickins wrote:
> On Thu, 16 Feb 2012, Konstantin Khlebnikov wrote:
>> Hugh Dickins wrote:
>>> On Thu, 16 Feb 2012, Konstantin Khlebnikov wrote:
>>>>
>>>> Finally, there appears some new locking primitives for decorating
>>>> lru_lock splitting logic.
>>>> Final patch actually splits zone->lru_lock into small per-book pieces.
>>>
>>> Well done, it looks like you've beaten me by a few days: my per-memcg
>>> per-zone locking patches are now split up and ready, except for the
>>> commit comments and performance numbers to support them.
>>
>> Heh, nice. How do you organize link from page to "book"?
>> My patchset still uses page_cgroup, and I afraid something broken in locking
>> around it.
>
> By page_cgroup, yes.  The locking is not straightforward; but was pretty
> much impossible in the days of PageCgroupAcctLRU, we had to get rid of
> that to move forward.

Hmm, my old rhe6-based code has no extra locking except spited lru-lock
and optimized resource counter. Though it does not account mapped pages for cgroup,
but it is not so easy and seems completely useless in real life.

>
> Given your encouragement below, I'm thinking to post my patchset once
> I've added in the patch comments, in a couple of days.  I won't wait
> to include any performance results, those will have to follow.
>
> So I'd rather work on those comments to the relevant patches,
> than try to go into it now without showing them.
>
>>
>>>
>>> Or perhaps what we've been doing is orthogonal: I've not glanced beyond
>>> your Subjects yet, but those do look very familiar from my own work -
>>> though we're still using "lruvec"s rather than "book"s.
>>
>> I think, good name is always good. "book" sounds much better than
>> "mem_cgroup_zone".
>
> "book" is fun, but I don't find it helpful: I prefer the current, more
> boring "lruvec".  I guess I place a higher value on a book than as just
> a bundle of pages, so hardly see the connection!
>
> If we were designing a new user interface paradigm, then "book" might
> be a great name for an object; but this is all just low-level mechanism.

In my patchset book contains all required information for reclaimer.
So, It not very low-level object, it much more than lru-vector.
Thus, I want to call it with a short single-word name.
It should not has a second meaning in the kernel, maybe "pile" =)
if you want to reserve "book" for something more delicious.

>
> I was a bit disappointed at how much of your patchset is merely renaming.
> I think it would be better if you cut back on that, then if you're really
> keen on the "book" naming, do that in a final "lruvec" to "book" patch
> that people can take or leave as they please.

There only one renaming patch, and it not so big.
Plus my "book", "pages_lru" and "pages_count" totally unique,
there no other things in the kernel with these names.
This is very is handy for grep-ing.

>
> (I was impressed to see that lwn.net already has a "Book review"
> headlining; but then it turned out not to be of your work ;)
>
> And disappointed it wasn't built on linux-next mm, so I had to revert the
> cleanups I've already got in there.  But you may well have been wise to
> post against a more stable upstream tree: linux-next shifts from day to
> day, and is not always good - I did a lot on rc2-next-20120210, that was
> a good tree, but rc3-next-20120210 had a number of issues (especially on
> PowerPC, even reverting Arjan's parallel cpu bringup).  I'll probably
> try moving on to tonight's, but don't know what to expect of it.

I work on top of current Linus tree. I'll rebase it to linux-next.

>
>> Plus I want to split mm and mem-cg as much as possible, like this done in
>> sched-grouping.
>> After my patchset there no links from "book" to "memcg", so now there can
>> exist books of
>> different nature. It would be nice to use this memory management not only for
>> cgroups.
>> For example, what do you think about per-struct-user memory controller,
>> "rt" prio memory group, which never pruned by "normal" prio memory group
>> users.
>> or per-session-id memory auto-grouping =)
>
> I'm not thinking of it at all - and probably not as interested in a forest
> of cgroups as you guys are.  But I don't see "lruvec" as any more tied to
> memcg than "book".
>

Actually, I'm not very interested in a cgroups too, they are very limited and bloated.
Currently here required simple and effective basis for inventing more intelligent
memory management. It must be automatic, and work without hundreds tuning handles.

>>
>>>
>>> Anyway, I should be well-placed to review what you've done, and had
>>> better switch away from my own patches to testing and reviewing yours
>>> now, checking if we've caught anything that you're missing.  Or maybe
>>> it'll be worth posting mine anyway, we'll see: I'll look to yours first.
>>
>> Competition is good. It would be nice to see your patches too, if they are as
>> ready as my.
>
> Thanks, I'll get on with them.
>
> It is very very striking, the extent to which we have done the same:
> with just args in a different order, or my page_relock_lruvec() versus
> your relock_page_book().  As if I'd stolen your patches and then made
> little mods to conceal the traces - I may need Ying to testify in a
> court of law that I didn't, I gave her a copy of what I had on Monday!
>
> (A lot of my cleanups, which converge with yours, were not in the rollup
> I posted on December 5th: they came about as I tried to turn that into a
> palatable series.)
>
> If you move to lruvec, or I to book, then it will be easier for people
> to compare the two.  I expect we'll find rather more difference once I
> get deeper in (but will concentrate on mine for the moment): probably
> when it comes to the locking, which was our whole motivation.
>
>>
>> I already found two bugs in my patchset:
>> in "mm: unify inactive_list_is_low()" NR_ACTIVE_* instead of LRU_ACTIVE_*
>> and in "mm: handle book relocks on lumpy reclaim" I forget locking in
>> isolate_lru_pages()
>> [ but with CONFIG_COMPACTION=y this branch never used, as I guess ]
>
> I included those LRU_ instead of NR_ changes in what I ran up last night;
> but didn't attempt to fix up your isolate_lru_pages() locking, and indeed
> very soon crashed there, after lots of list_del warnings.
>
> Yours are not the only patches I was testing in that tree, I tried to
> gather several other series which I should be reviewing if I ever have
> time: Kamezawa-san's page cgroup diet 6, Xiao Guangrong's 4 prio_tree
> cleanups, your 3 radix_tree changes, your 6 shmem changes, your 4 memcg
> miscellaneous, and then your 15 books.
>
> The tree before your final 15 did well under pressure, until I tried to
> rmdir one of the cgroups afterwards: then it crashed nastily, I'll have
> to bisect into that, probably either Kamezawa's or your memcg changes.
>
> The tree with your final 15 booted up fine, but very soon crashed under
> load: I don't think I'll spend more time on that for now, probably you
> need to work on your locking while I work on my descriptions.

Yeah, I found many bugs. Uptodate version there: there https://github.com/koct9i/linux
I'll send v2 to review after testing.

I invented more clear locking. It does not interacts with memcg magic,
and does not adds overhead, except one spin_unlock_wait(old_book->lru_lock) in mem_cgroup_move_account().

>
> Hugh
>
>>
>>>
>>>> All this code currently *completely untested*, but seems like it already
>>>> can work.
>>>
>>> Oh, perhaps I'm ahead of you after all :)
>>
>> Not exactly, I already wrote the same code more than half-year ago.
>> So, it wasn't absolutely fair competition from the start. =)
>>
>>>
>>>>
>>>> After that, there two options how manage struct book on mem-cgroup
>>>> create/destroy:
>>>> a) [ currently implemented ] allocate and release by rcu.
>>>>      Thus lock_page_book() will be protected with rcu_read_lock().
>>>> b) allocate and never release struct book, reuse them after rcu grace
>>>> period.
>>>>      It allows to avoid some rcu_read_lock()/rcu_read_unlock() calls on
>>>> hot paths.
>>>>
>>>>
>>>> Motivation:
>>>> I wrote the similar memory controller for our rhel6-based
>>>> openvz/virtuozzo kernel,
>>>> including splitted lru-locks and some other [patented LOL] cool stuff.
>>>> [ common descrioption without techical details:
>>>> http://wiki.openvz.org/VSwap ]
>>>> That kernel already in production and rather stable for a long time.
>>>>
>>>> ---
>>>>
>>>> Konstantin Khlebnikov (15):
>>>>         mm: rename struct lruvec into struct book
>>>>         mm: memory bookkeeping core
>>>>         mm: add book->pages_count
>>>>         mm: unify inactive_list_is_low()
>>>>         mm: add book->reclaim_stat
>>>>         mm: kill struct mem_cgroup_zone
>>>>         mm: move page-to-book translation upper
>>>>         mm: introduce book locking primitives
>>>>         mm: handle book relocks on lumpy reclaim
>>>>         mm: handle book relocks in compaction
>>>>         mm: handle book relock in memory controller
>>>>         mm: optimize books in update_page_reclaim_stat()
>>>>         mm: optimize books in pagevec_lru_move_fn()
>>>>         mm: optimize putback for 0-order reclaim
>>>>         mm: split zone->lru_lock
>>>>
>>>>
>>>>    include/linux/memcontrol.h |   52 -------
>>>>    include/linux/mm_inline.h  |  222 ++++++++++++++++++++++++++++-
>>>>    include/linux/mmzone.h     |   26 ++-
>>>>    include/linux/swap.h       |    2
>>>>    init/Kconfig               |    4 +
>>>>    mm/compaction.c            |   35 +++--
>>>>    mm/huge_memory.c           |   10 +
>>>>    mm/memcontrol.c            |  238 ++++++++++---------------------
>>>>    mm/page_alloc.c            |   20 ++-
>>>>    mm/swap.c                  |  128 ++++++-----------
>>>>    mm/vmscan.c                |  334
>>>> +++++++++++++++++++-------------------------
>>>>    11 files changed, 554 insertions(+), 517 deletions(-)
>>>
>>> That's a very familiar list of files to me!
>>>
>>> Hugh


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting
  2012-02-16 21:37     ` Hugh Dickins
  2012-02-17 19:56       ` Konstantin Khlebnikov
@ 2012-02-18  2:13       ` Hugh Dickins
  2012-02-18  6:35         ` Konstantin Khlebnikov
  1 sibling, 1 reply; 31+ messages in thread
From: Hugh Dickins @ 2012-02-18  2:13 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Konstantin Khlebnikov, KAMEZAWA Hiroyuki, Ying Han, linux-mm,
	linux-kernel

On Thu, 16 Feb 2012, Hugh Dickins wrote:
> 
> Yours are not the only patches I was testing in that tree, I tried to
> gather several other series which I should be reviewing if I ever have
> time: Kamezawa-san's page cgroup diet 6, Xiao Guangrong's 4 prio_tree
> cleanups, your 3 radix_tree changes, your 6 shmem changes, your 4 memcg
> miscellaneous, and then your 15 books.
> 
> The tree before your final 15 did well under pressure, until I tried to
> rmdir one of the cgroups afterwards: then it crashed nastily, I'll have
> to bisect into that, probably either Kamezawa's or your memcg changes.

So far I haven't succeeded in reproducing that at all: it was real,
but obviously harder to get than I assumed - indeed, no good reason
to associate it with any of those patches, might even be in 3.3-rc.

It did involve a NULL pointer dereference in mem_cgroup_page_lruvec(),
somewhere below compact_zone() - but repercussions were causing the
stacktrace to scroll offscreen, so I didn't get good details.

Hugh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting
  2012-02-18  2:13       ` Hugh Dickins
@ 2012-02-18  6:35         ` Konstantin Khlebnikov
  2012-02-18  7:14           ` Hugh Dickins
  0 siblings, 1 reply; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-18  6:35 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: KAMEZAWA Hiroyuki, Ying Han, linux-mm, linux-kernel

Hugh Dickins wrote:
> On Thu, 16 Feb 2012, Hugh Dickins wrote:
>>
>> Yours are not the only patches I was testing in that tree, I tried to
>> gather several other series which I should be reviewing if I ever have
>> time: Kamezawa-san's page cgroup diet 6, Xiao Guangrong's 4 prio_tree
>> cleanups, your 3 radix_tree changes, your 6 shmem changes, your 4 memcg
>> miscellaneous, and then your 15 books.
>>
>> The tree before your final 15 did well under pressure, until I tried to
>> rmdir one of the cgroups afterwards: then it crashed nastily, I'll have
>> to bisect into that, probably either Kamezawa's or your memcg changes.
>
> So far I haven't succeeded in reproducing that at all: it was real,
> but obviously harder to get than I assumed - indeed, no good reason
> to associate it with any of those patches, might even be in 3.3-rc.
>
> It did involve a NULL pointer dereference in mem_cgroup_page_lruvec(),
> somewhere below compact_zone() - but repercussions were causing the
> stacktrace to scroll offscreen, so I didn't get good details.

There some stupid bugs in my v1 patchset, it shouldn't works at all.
I did not expect that someone will try to use it. I sent it just to discuss.

Most destructive bug is this PageCgroupUsed() below:

+struct book *page_book(struct page *page)
+{
+	struct mem_cgroup_per_zone *mz;
+	struct page_cgroup *pc;
+
+	if (mem_cgroup_disabled())
+		return &page_zone(page)->book;
+
+	pc = lookup_page_cgroup(page);
+	if (!PageCgroupUsed(pc))
+		return &page_zone(page)->book;
+	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
+	smp_rmb();
+	mz = mem_cgroup_zoneinfo(pc->mem_cgroup,
+			page_to_nid(page), page_zonenum(page));
+	return &mz->book;
+}

Thus after page uncharge I remove page from wrong book, under wrong lock =)

[ as I wrote, updated patchset there: https://github.com/koct9i/linux ]

>
> Hugh


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting
  2012-02-18  6:35         ` Konstantin Khlebnikov
@ 2012-02-18  7:14           ` Hugh Dickins
  2012-02-20  0:32             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 31+ messages in thread
From: Hugh Dickins @ 2012-02-18  7:14 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: KAMEZAWA Hiroyuki, Ying Han, linux-mm, linux-kernel

On Sat, 18 Feb 2012, Konstantin Khlebnikov wrote:
> Hugh Dickins wrote:
> > On Thu, 16 Feb 2012, Hugh Dickins wrote:
> > > 
> > > Yours are not the only patches I was testing in that tree, I tried to
> > > gather several other series which I should be reviewing if I ever have
> > > time: Kamezawa-san's page cgroup diet 6, Xiao Guangrong's 4 prio_tree
> > > cleanups, your 3 radix_tree changes, your 6 shmem changes, your 4 memcg
> > > miscellaneous, and then your 15 books.
> > > 
> > > The tree before your final 15 did well under pressure, until I tried to
> > > rmdir one of the cgroups afterwards: then it crashed nastily, I'll have
> > > to bisect into that, probably either Kamezawa's or your memcg changes.
> > 
> > So far I haven't succeeded in reproducing that at all: it was real,
> > but obviously harder to get than I assumed - indeed, no good reason
> > to associate it with any of those patches, might even be in 3.3-rc.
> > 
> > It did involve a NULL pointer dereference in mem_cgroup_page_lruvec(),
> > somewhere below compact_zone() - but repercussions were causing the
> > stacktrace to scroll offscreen, so I didn't get good details.
> 
> There some stupid bugs in my v1 patchset, it shouldn't works at all.
> I did not expect that someone will try to use it. I sent it just to discuss.

Yes, but as I said, that bug appeared before I put your patchset (the 15) on.

Hugh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting
  2012-02-16 23:54         ` KAMEZAWA Hiroyuki
@ 2012-02-18  9:09           ` Konstantin Khlebnikov
  0 siblings, 0 replies; 31+ messages in thread
From: Konstantin Khlebnikov @ 2012-02-18  9:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, Hugh Dickins, hannes

KAMEZAWA Hiroyuki wrote:
> On Thu, 16 Feb 2012 15:02:27 +0400
> Konstantin Khlebnikov<khlebnikov@openvz.org>  wrote:
>
>> KAMEZAWA Hiroyuki wrote:
>>> On Thu, 16 Feb 2012 09:43:52 +0400
>>> Konstantin Khlebnikov<khlebnikov@openvz.org>   wrote:
>>>
>>>> KAMEZAWA Hiroyuki wrote:
>>>>> On Thu, 16 Feb 2012 02:57:04 +0400
>>>>> Konstantin Khlebnikov<khlebnikov@openvz.org>    wrote:
>>>
>>>>>> * optimize page to book translations, move it upper in the call stack,
>>>>>>      replace some struct zone arguments with struct book pointer.
>>>>>>
>>>>>
>>>>> a page->book transrater from patch 2/15
>>>>>
>>>>> +struct book *page_book(struct page *page)
>>>>> +{
>>>>> +	struct mem_cgroup_per_zone *mz;
>>>>> +	struct page_cgroup *pc;
>>>>> +
>>>>> +	if (mem_cgroup_disabled())
>>>>> +		return&page_zone(page)->book;
>>>>> +
>>>>> +	pc = lookup_page_cgroup(page);
>>>>> +	if (!PageCgroupUsed(pc))
>>>>> +		return&page_zone(page)->book;
>>>>> +	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
>>>>> +	smp_rmb();
>>>>> +	mz = mem_cgroup_zoneinfo(pc->mem_cgroup,
>>>>> +			page_to_nid(page), page_zonenum(page));
>>>>> +	return&mz->book;
>>>>> +}
>>>>>
>>>>> What happens when pc->mem_cgroup is rewritten by move_account() ?
>>>>> Where is the guard for lockless access of this ?
>>>>
>>>> Initially this suppose to be protected with lru_lock, in final patch they are protected with rcu.
>>>
>>> Hmm, VM_BUG_ON(!PageLRU(page)) ?
>>
>> Where?
>>
>
> You said this is guarded by lru_lock. So, page should be on LRU.

Not exactly, in add_page_to_lru_list() it currenly not in lru =)

Plus we can race with lru-isolation, thus this BUG_ON invalid.
All callers must recheck PageLRU() after locking page->book->lru_lock.

My locking uses combination of PageLRU() RCU and small help from memcg code:

page -> book dereference protected with rcu, If we know that page has valid mem_cg pointer
we can pick it, lock, and recheck pointer in loop. If after that PageLRU is set it means we
successfully secured page in its LRU.

PageLRU() versus memcg change stability there guaranteed by spin_unlock_wait(old_book->lru_lock)
in mem_cgroup_move_account() between assigning new pointer and putting page back to new lru.
Otherwise some one can see PageLRU under old lru_lock, while page already putted into new lru under other lru_lock.

If we didn't know about page->book pointer validity (at isolation via pfn in compaction/lumpy reclaim)
we must check PageLRU() first, if it set page->book must be valid. And as usual, after lru locking
PageLRU must be rechecked. Whole operation is under one rcu-lock, it keep all valid pointers.

Plus memcg-destroy after rcu-greace period wait if lru_lock if locked before freeing this structure,
because all lock holder drops rcu-read-lock after securing lru-lock.

This scheme looks pretty simple and complete.

>
>
>
>>>
>>> move_account() overwrites pc->mem_cgroup with isolating page from LRU.
>>> but it doesn't take lru_lock.
>>
>> There three kinds of lock_page_book() users:
>> 1) caller want to catch page in LRU, it will lock either old or new book and
>>      recheck PageLRU() after locking, if page not it in LRU it don't touch anything.
>>      some of these functions has stable reference to page, some of them not.
>>    [ There actually exist small race, I knew about it, just forget to pick this chunk from old code. See below. ]
>> 2) page is isolated by caller, it want to put it back. book link is stable. no problems.
>> 3) page-release functions. page-counter is zero. no references -- no problems.
>>
>> race for 1)
>>
>> catcher					switcher
>>
>> 					# isolate
>> 					old_book = lock_page_book(page)
>> 					ClearPageLRU(page)
>> 					unlock_book(old_book)				
>> 					# charge
>> old_book = lock_page_book(page)		
>> 					# switch
>> 					page->book = new_book
>> 					# putback
>> 					lock_book(new_book)
>> 					SetPageLRU(page)
>> 					unlock_book(new_book)
>> if (PageLRU(page))
>> 	oops, page actually in new_book
>> unlock_book(old_book)
>>
>>
>> I'll protect "switch" phase with old_book lru-lock:
>>
> In linex-next, pc->mem_cgroup is modified only when Page is on LRU.

I hope, page is isolated before this modification? =)

>
> When we need to touch "book", if !PageLRU() ?

Never, but at isolation via-pfn we need check it carefully.

>
>
>> lock_book(old_book)
>> page->book = new_book
>> unlock_book(old_book)
>>
>> The other option is recheck in "catcher" page book after PageLRU()
>> maybe there exists some other variants.
>>
>>> BTW, what amount of perfomance benefit ?
>>
>> It depends, but usually lru_lock is very-very hot.
>> This lock splitting can be used without cgroups and containers,
>> now huge zones can be easily sliced into arbitrary pieces, for example one book per 256Mb.
>>
> I personally think reducing lock by pagevec works enough well.
> So, want to see perforamance on real machine with real apps.

I separate locking primitives into small set of static-inline functions,
thus we can switch locking scheme by config option. This works without extra overhead,
if there only one lock per zone some static-inline functions becomes lockless or empty.

There other problem with current per-cpu page vectors: after lru-lock splitting they will
drop and retake different locks more frequently.

Thus for optimal scalability these vectors should be splited too, or replaced with something.
For example, lru_add_pvecs can be replaced with per-book (lruvec) singly-linked list for pending
inserts with cmpxchg-based atomic splice-insert. In combination with pfn-based zone-book interleaving
(last patch at my github) it may be really faster than small percpu page vectors.
Adding page into this pending list actually does not require even elevating its refcount,
if it drop to zero and page still in pending list put_page can just commit this list into lru.

>
>
>>
>> According to my experience, one of complicated thing there is how to postpone "book" destroying
>> if some its pages are isolated. For example lumpy reclaim and memory compaction isolates pages
>> from several books. And they wants to put them back. Currently this can be broken, if someone removes
>> cgroup in wrong moment. There appears funny races with three players: catcher, switcher and destroyer.
>
> Thank you for pointing out. Hmm... it can happen ? Currently, at cgroup destroying,
> force_empty() works
>
>    1. find a page from LRU
>    2. remove it from LRU
>    3. move it or reclaim it (you said "switcher")
>    4. if res.usage != 0 goto 1.
>
> I think "4" will finally keep cgroup from being destroyed.

Ok, seems so.

>
>
>> This can be fixed with some extra reference-counting or some other sleepable synchronizing.
>> In my rhel6-based implementation I uses extra reference-counting, and it looks ugly. So I want to invent something better.
>> Other option is just never release books, reuse them after rcu grace period for rcu-list iterating.
>>
>
> Another reference counting is very very bad.

Ack)

>
>
>
> Thanks,
> -Kame
>
>
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting
  2012-02-18  7:14           ` Hugh Dickins
@ 2012-02-20  0:32             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 31+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-20  0:32 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Konstantin Khlebnikov, Ying Han, linux-mm, linux-kernel

On Fri, 17 Feb 2012 23:14:01 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> On Sat, 18 Feb 2012, Konstantin Khlebnikov wrote:
> > Hugh Dickins wrote:
> > > On Thu, 16 Feb 2012, Hugh Dickins wrote:
> > > > 
> > > > Yours are not the only patches I was testing in that tree, I tried to
> > > > gather several other series which I should be reviewing if I ever have
> > > > time: Kamezawa-san's page cgroup diet 6, Xiao Guangrong's 4 prio_tree
> > > > cleanups, your 3 radix_tree changes, your 6 shmem changes, your 4 memcg
> > > > miscellaneous, and then your 15 books.
> > > > 
> > > > The tree before your final 15 did well under pressure, until I tried to
> > > > rmdir one of the cgroups afterwards: then it crashed nastily, I'll have
> > > > to bisect into that, probably either Kamezawa's or your memcg changes.
> > > 
> > > So far I haven't succeeded in reproducing that at all: it was real,
> > > but obviously harder to get than I assumed - indeed, no good reason
> > > to associate it with any of those patches, might even be in 3.3-rc.
> > > 
> > > It did involve a NULL pointer dereference in mem_cgroup_page_lruvec(),
> > > somewhere below compact_zone() - but repercussions were causing the
> > > stacktrace to scroll offscreen, so I didn't get good details.
> > 
> > There some stupid bugs in my v1 patchset, it shouldn't works at all.
> > I did not expect that someone will try to use it. I sent it just to discuss.
> 
> Yes, but as I said, that bug appeared before I put your patchset (the 15) on.
> 

Hm, NULL pointer dereference in mem_cgroup_page_lruvec() via comaction tend to
mean pc->mem_cgroup was NULL... 

IIUC,

- compaction get pages from LRU list and isolate/migrate them. So, When pages
  on LRU were migrated by compact_zone(), pc->mem_cgroup never be NULL..
- All newly allocated pages for migration will be reset by mem_cgroup_reset_owner().

Hm, something unexpected happens..

Regards,
-Kame
 


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2012-02-20  0:33 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-15 22:57 [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 01/15] mm: rename struct lruvec into struct book Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 02/15] mm: memory bookkeeping core Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 03/15] mm: add book->pages_count Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 04/15] mm: unify inactive_list_is_low() Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 05/15] mm: add book->reclaim_stat Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 06/15] mm: kill struct mem_cgroup_zone Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 07/15] mm: move page-to-book translation upper Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 08/15] mm: introduce book locking primitives Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 09/15] mm: handle book relocks on lumpy reclaim Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 10/15] mm: handle book relocks in compaction Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 11/15] mm: handle book relock in memory controller Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 12/15] mm: optimize books in update_page_reclaim_stat() Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 13/15] mm: optimize books in pagevec_lru_move_fn() Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 14/15] mm: optimize putback for 0-order reclaim Konstantin Khlebnikov
2012-02-15 22:58 ` [PATCH RFC 15/15] mm: split zone->lru_lock Konstantin Khlebnikov
2012-02-16  2:04 ` [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting KAMEZAWA Hiroyuki
2012-02-16  5:43   ` Konstantin Khlebnikov
2012-02-16  8:24     ` KAMEZAWA Hiroyuki
2012-02-16 11:02       ` Konstantin Khlebnikov
2012-02-16 15:54         ` Konstantin Khlebnikov
2012-02-16 23:54         ` KAMEZAWA Hiroyuki
2012-02-18  9:09           ` Konstantin Khlebnikov
2012-02-16  2:37 ` Hugh Dickins
2012-02-16  4:51   ` Konstantin Khlebnikov
2012-02-16 21:37     ` Hugh Dickins
2012-02-17 19:56       ` Konstantin Khlebnikov
2012-02-18  2:13       ` Hugh Dickins
2012-02-18  6:35         ` Konstantin Khlebnikov
2012-02-18  7:14           ` Hugh Dickins
2012-02-20  0:32             ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).