All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] mm: workingset: per-cgroup thrash detection
@ 2016-01-26 21:00 ` Johannes Weiner
  0 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2016-01-26 21:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vladimir Davydov, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

Hi,

these patches make the workingset code cgroup-aware, so that page
reclaim works properly when using the cgroup memory controller. More
details in the 5/5 changelog.

This should have been part of the original thrash detection patches,
but those were already too complex. So here we go.

Thanks,
Johannes

 fs/buffer.c                |  14 ++--
 fs/xfs/xfs_aops.c          |   8 +--
 include/linux/memcontrol.h |  55 ++++++++++++++--
 include/linux/mmzone.h     |  11 ++--
 include/linux/swap.h       |   1 +
 mm/filemap.c               |  12 ++--
 mm/memcontrol.c            |  59 ++++-------------
 mm/page-writeback.c        |  28 ++++----
 mm/rmap.c                  |   8 +--
 mm/truncate.c              |   6 +-
 mm/vmscan.c                |  26 ++++----
 mm/workingset.c            | 151 ++++++++++++++++++++++++++++++++-----------
 12 files changed, 236 insertions(+), 143 deletions(-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 0/5] mm: workingset: per-cgroup thrash detection
@ 2016-01-26 21:00 ` Johannes Weiner
  0 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2016-01-26 21:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vladimir Davydov, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

Hi,

these patches make the workingset code cgroup-aware, so that page
reclaim works properly when using the cgroup memory controller. More
details in the 5/5 changelog.

This should have been part of the original thrash detection patches,
but those were already too complex. So here we go.

Thanks,
Johannes

 fs/buffer.c                |  14 ++--
 fs/xfs/xfs_aops.c          |   8 +--
 include/linux/memcontrol.h |  55 ++++++++++++++--
 include/linux/mmzone.h     |  11 ++--
 include/linux/swap.h       |   1 +
 mm/filemap.c               |  12 ++--
 mm/memcontrol.c            |  59 ++++-------------
 mm/page-writeback.c        |  28 ++++----
 mm/rmap.c                  |   8 +--
 mm/truncate.c              |   6 +-
 mm/vmscan.c                |  26 ++++----
 mm/workingset.c            | 151 ++++++++++++++++++++++++++++++++-----------
 12 files changed, 236 insertions(+), 143 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/5] mm: memcontrol: generalize locking for the page->mem_cgroup binding
  2016-01-26 21:00 ` Johannes Weiner
@ 2016-01-26 21:00   ` Johannes Weiner
  -1 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2016-01-26 21:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vladimir Davydov, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

So far the only sites that needed to exclude charge migration to
stabilize page->mem_cgroup have been per-cgroup page statistics, hence
the name mem_cgroup_begin_page_stat(). But per-cgroup thrash detection
will add another site that needs to ensure page->mem_cgroup lifetime.

Rename these locking functions to the more generic lock_page_memcg()
and unlock_page_memcg(). Since charge migration is a cgroup1 feature
only, we might be able to delete it at some point, and these now easy
to identify locking sites along with it.

Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/buffer.c                | 14 +++++++-------
 fs/xfs/xfs_aops.c          |  8 ++++----
 include/linux/memcontrol.h | 16 +++++++++++-----
 mm/filemap.c               | 12 ++++++------
 mm/memcontrol.c            | 34 ++++++++++++++--------------------
 mm/page-writeback.c        | 28 ++++++++++++++--------------
 mm/rmap.c                  |  8 ++++----
 mm/truncate.c              |  6 +++---
 mm/vmscan.c                |  8 ++++----
 9 files changed, 67 insertions(+), 67 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index e1632abb4ca9..dc991510bb06 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -621,7 +621,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  *
- * The caller must hold mem_cgroup_begin_page_stat() lock.
+ * The caller must hold lock_page_memcg().
  */
 static void __set_page_dirty(struct page *page, struct address_space *mapping,
 			     struct mem_cgroup *memcg, int warn)
@@ -683,17 +683,17 @@ int __set_page_dirty_buffers(struct page *page)
 		} while (bh != head);
 	}
 	/*
-	 * Use mem_group_begin_page_stat() to keep PageDirty synchronized with
-	 * per-memcg dirty page counters.
+	 * Lock out page->mem_cgroup migration to keep PageDirty
+	 * synchronized with per-memcg dirty page counters.
 	 */
-	memcg = mem_cgroup_begin_page_stat(page);
+	memcg = lock_page_memcg(page);
 	newly_dirty = !TestSetPageDirty(page);
 	spin_unlock(&mapping->private_lock);
 
 	if (newly_dirty)
 		__set_page_dirty(page, mapping, memcg, 1);
 
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 
 	if (newly_dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -1169,13 +1169,13 @@ void mark_buffer_dirty(struct buffer_head *bh)
 		struct address_space *mapping = NULL;
 		struct mem_cgroup *memcg;
 
-		memcg = mem_cgroup_begin_page_stat(page);
+		memcg = lock_page_memcg(page);
 		if (!TestSetPageDirty(page)) {
 			mapping = page_mapping(page);
 			if (mapping)
 				__set_page_dirty(page, mapping, memcg, 0);
 		}
-		mem_cgroup_end_page_stat(memcg);
+		unlock_page_memcg(memcg);
 		if (mapping)
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 	}
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 379c089fb051..28eb6f535ca9 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1974,10 +1974,10 @@ xfs_vm_set_page_dirty(
 		} while (bh != head);
 	}
 	/*
-	 * Use mem_group_begin_page_stat() to keep PageDirty synchronized with
-	 * per-memcg dirty page counters.
+	 * Lock out page->mem_cgroup migration to keep PageDirty
+	 * synchronized with per-memcg dirty page counters.
 	 */
-	memcg = mem_cgroup_begin_page_stat(page);
+	memcg = lock_page_memcg(page);
 	newly_dirty = !TestSetPageDirty(page);
 	spin_unlock(&mapping->private_lock);
 
@@ -1994,7 +1994,7 @@ xfs_vm_set_page_dirty(
 		}
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	}
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 	if (newly_dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 	return newly_dirty;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 792c8981e633..2685aaf91511 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -429,8 +429,8 @@ bool mem_cgroup_oom_synchronize(bool wait);
 extern int do_swap_account;
 #endif
 
-struct mem_cgroup *mem_cgroup_begin_page_stat(struct page *page);
-void mem_cgroup_end_page_stat(struct mem_cgroup *memcg);
+struct mem_cgroup *lock_page_memcg(struct page *page);
+void unlock_page_memcg(struct mem_cgroup *memcg);
 
 /**
  * mem_cgroup_update_page_stat - update page state statistics
@@ -438,7 +438,13 @@ void mem_cgroup_end_page_stat(struct mem_cgroup *memcg);
  * @idx: page state item to account
  * @val: number of pages (positive or negative)
  *
- * See mem_cgroup_begin_page_stat() for locking requirements.
+ * Callers must use lock_page_memcg() to prevent double accounting
+ * when the page is concurrently being moved to another memcg:
+ *
+ *   memcg = lock_page_memcg(page);
+ *   if (TestClearPageState(page))
+ *     mem_cgroup_update_page_stat(memcg, state, -1);
+ *   unlock_page_memcg(memcg);
  */
 static inline void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
 				 enum mem_cgroup_stat_index idx, int val)
@@ -613,12 +619,12 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
-static inline struct mem_cgroup *mem_cgroup_begin_page_stat(struct page *page)
+static inline struct mem_cgroup *lock_page_memcg(struct page *page)
 {
 	return NULL;
 }
 
-static inline void mem_cgroup_end_page_stat(struct mem_cgroup *memcg)
+static inline void unlock_page_memcg(struct mem_cgroup *memcg)
 {
 }
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 0720c9d19365..f812976350ca 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -101,7 +101,7 @@
  *    ->tree_lock		(page_remove_rmap->set_page_dirty)
  *    bdi.wb->list_lock		(page_remove_rmap->set_page_dirty)
  *    ->inode->i_lock		(page_remove_rmap->set_page_dirty)
- *    ->memcg->move_lock	(page_remove_rmap->mem_cgroup_begin_page_stat)
+ *    ->memcg->move_lock	(page_remove_rmap->lock_page_memcg)
  *    bdi.wb->list_lock		(zap_pte_range->set_page_dirty)
  *    ->inode->i_lock		(zap_pte_range->set_page_dirty)
  *    ->private_lock		(zap_pte_range->__set_page_dirty_buffers)
@@ -177,7 +177,7 @@ static void page_cache_tree_delete(struct address_space *mapping,
  * Delete a page from the page cache and free it. Caller has to make
  * sure the page is locked and that nobody else uses it - or that usage
  * is safe.  The caller must hold the mapping's tree_lock and
- * mem_cgroup_begin_page_stat().
+ * lock_page_memcg().
  */
 void __delete_from_page_cache(struct page *page, void *shadow,
 			      struct mem_cgroup *memcg)
@@ -240,11 +240,11 @@ void delete_from_page_cache(struct page *page)
 
 	freepage = mapping->a_ops->freepage;
 
-	memcg = mem_cgroup_begin_page_stat(page);
+	memcg = lock_page_memcg(page);
 	spin_lock_irqsave(&mapping->tree_lock, flags);
 	__delete_from_page_cache(page, NULL, memcg);
 	spin_unlock_irqrestore(&mapping->tree_lock, flags);
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 
 	if (freepage)
 		freepage(page);
@@ -542,7 +542,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		new->mapping = mapping;
 		new->index = offset;
 
-		memcg = mem_cgroup_begin_page_stat(old);
+		memcg = lock_page_memcg(old);
 		spin_lock_irqsave(&mapping->tree_lock, flags);
 		__delete_from_page_cache(old, NULL, memcg);
 		error = radix_tree_insert(&mapping->page_tree, offset, new);
@@ -557,7 +557,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		if (PageSwapBacked(new))
 			__inc_zone_page_state(new, NR_SHMEM);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
-		mem_cgroup_end_page_stat(memcg);
+		unlock_page_memcg(memcg);
 		mem_cgroup_replace_page(old, new);
 		radix_tree_preload_end();
 		if (freepage)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d06cae2de783..953f0f984392 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1709,19 +1709,13 @@ cleanup:
 }
 
 /**
- * mem_cgroup_begin_page_stat - begin a page state statistics transaction
- * @page: page that is going to change accounted state
- *
- * This function must mark the beginning of an accounted page state
- * change to prevent double accounting when the page is concurrently
- * being moved to another memcg:
+ * lock_page_memcg - lock a page->mem_cgroup binding
+ * @page: the page
  *
- *   memcg = mem_cgroup_begin_page_stat(page);
- *   if (TestClearPageState(page))
- *     mem_cgroup_update_page_stat(memcg, state, -1);
- *   mem_cgroup_end_page_stat(memcg);
+ * This function protects unlocked LRU pages from being moved to
+ * another cgroup and stabilizes their page->mem_cgroup binding.
  */
-struct mem_cgroup *mem_cgroup_begin_page_stat(struct page *page)
+struct mem_cgroup *lock_page_memcg(struct page *page)
 {
 	struct mem_cgroup *memcg;
 	unsigned long flags;
@@ -1759,20 +1753,20 @@ again:
 	/*
 	 * When charge migration first begins, we can have locked and
 	 * unlocked page stat updates happening concurrently.  Track
-	 * the task who has the lock for mem_cgroup_end_page_stat().
+	 * the task who has the lock for unlock_page_memcg().
 	 */
 	memcg->move_lock_task = current;
 	memcg->move_lock_flags = flags;
 
 	return memcg;
 }
-EXPORT_SYMBOL(mem_cgroup_begin_page_stat);
+EXPORT_SYMBOL(lock_page_memcg);
 
 /**
- * mem_cgroup_end_page_stat - finish a page state statistics transaction
- * @memcg: the memcg that was accounted against
+ * unlock_page_memcg - unlock a page->mem_cgroup binding
+ * @memcg: the memcg returned by lock_page_memcg()
  */
-void mem_cgroup_end_page_stat(struct mem_cgroup *memcg)
+void unlock_page_memcg(struct mem_cgroup *memcg)
 {
 	if (memcg && memcg->move_lock_task == current) {
 		unsigned long flags = memcg->move_lock_flags;
@@ -1785,7 +1779,7 @@ void mem_cgroup_end_page_stat(struct mem_cgroup *memcg)
 
 	rcu_read_unlock();
 }
-EXPORT_SYMBOL(mem_cgroup_end_page_stat);
+EXPORT_SYMBOL(unlock_page_memcg);
 
 /*
  * size of first charge trial. "32" comes from vmscan.c's magic value.
@@ -4923,9 +4917,9 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
 
 	lru_add_drain_all();
 	/*
-	 * Signal mem_cgroup_begin_page_stat() to take the memcg's
-	 * move_lock while we're moving its pages to another memcg.
-	 * Then wait for already started RCU-only updates to finish.
+	 * Signal lock_page_memcg() to take the memcg's move_lock
+	 * while we're moving its pages to another memcg. Then wait
+	 * for already started RCU-only updates to finish.
 	 */
 	atomic_inc(&mc.from->moving_account);
 	synchronize_rcu();
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d782cbab735a..2b5ea1271e32 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2410,7 +2410,7 @@ int __set_page_dirty_no_writeback(struct page *page)
 /*
  * Helper function for set_page_dirty family.
  *
- * Caller must hold mem_cgroup_begin_page_stat().
+ * Caller must hold lock_page_memcg().
  *
  * NOTE: This relies on being atomic wrt interrupts.
  */
@@ -2442,7 +2442,7 @@ EXPORT_SYMBOL(account_page_dirtied);
 /*
  * Helper function for deaccounting dirty page without writeback.
  *
- * Caller must hold mem_cgroup_begin_page_stat().
+ * Caller must hold lock_page_memcg().
  */
 void account_page_cleaned(struct page *page, struct address_space *mapping,
 			  struct mem_cgroup *memcg, struct bdi_writeback *wb)
@@ -2471,13 +2471,13 @@ int __set_page_dirty_nobuffers(struct page *page)
 {
 	struct mem_cgroup *memcg;
 
-	memcg = mem_cgroup_begin_page_stat(page);
+	memcg = lock_page_memcg(page);
 	if (!TestSetPageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
 		unsigned long flags;
 
 		if (!mapping) {
-			mem_cgroup_end_page_stat(memcg);
+			unlock_page_memcg(memcg);
 			return 1;
 		}
 
@@ -2488,7 +2488,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 		radix_tree_tag_set(&mapping->page_tree, page_index(page),
 				   PAGECACHE_TAG_DIRTY);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
-		mem_cgroup_end_page_stat(memcg);
+		unlock_page_memcg(memcg);
 
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
@@ -2496,7 +2496,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 		}
 		return 1;
 	}
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 	return 0;
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
@@ -2629,14 +2629,14 @@ void cancel_dirty_page(struct page *page)
 		struct mem_cgroup *memcg;
 		bool locked;
 
-		memcg = mem_cgroup_begin_page_stat(page);
+		memcg = lock_page_memcg(page);
 		wb = unlocked_inode_to_wb_begin(inode, &locked);
 
 		if (TestClearPageDirty(page))
 			account_page_cleaned(page, mapping, memcg, wb);
 
 		unlocked_inode_to_wb_end(inode, locked);
-		mem_cgroup_end_page_stat(memcg);
+		unlock_page_memcg(memcg);
 	} else {
 		ClearPageDirty(page);
 	}
@@ -2705,7 +2705,7 @@ int clear_page_dirty_for_io(struct page *page)
 		 * always locked coming in here, so we get the desired
 		 * exclusion.
 		 */
-		memcg = mem_cgroup_begin_page_stat(page);
+		memcg = lock_page_memcg(page);
 		wb = unlocked_inode_to_wb_begin(inode, &locked);
 		if (TestClearPageDirty(page)) {
 			mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
@@ -2714,7 +2714,7 @@ int clear_page_dirty_for_io(struct page *page)
 			ret = 1;
 		}
 		unlocked_inode_to_wb_end(inode, locked);
-		mem_cgroup_end_page_stat(memcg);
+		unlock_page_memcg(memcg);
 		return ret;
 	}
 	return TestClearPageDirty(page);
@@ -2727,7 +2727,7 @@ int test_clear_page_writeback(struct page *page)
 	struct mem_cgroup *memcg;
 	int ret;
 
-	memcg = mem_cgroup_begin_page_stat(page);
+	memcg = lock_page_memcg(page);
 	if (mapping) {
 		struct inode *inode = mapping->host;
 		struct backing_dev_info *bdi = inode_to_bdi(inode);
@@ -2755,7 +2755,7 @@ int test_clear_page_writeback(struct page *page)
 		dec_zone_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 	return ret;
 }
 
@@ -2765,7 +2765,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 	struct mem_cgroup *memcg;
 	int ret;
 
-	memcg = mem_cgroup_begin_page_stat(page);
+	memcg = lock_page_memcg(page);
 	if (mapping) {
 		struct inode *inode = mapping->host;
 		struct backing_dev_info *bdi = inode_to_bdi(inode);
@@ -2796,7 +2796,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 		mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITEBACK);
 	}
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 	return ret;
 
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 79f3bf047f38..2871e7d3cced 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1289,19 +1289,19 @@ void page_add_file_rmap(struct page *page)
 {
 	struct mem_cgroup *memcg;
 
-	memcg = mem_cgroup_begin_page_stat(page);
+	memcg = lock_page_memcg(page);
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_FILE_MAPPED);
 	}
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 }
 
 static void page_remove_file_rmap(struct page *page)
 {
 	struct mem_cgroup *memcg;
 
-	memcg = mem_cgroup_begin_page_stat(page);
+	memcg = lock_page_memcg(page);
 
 	/* Hugepages are not counted in NR_FILE_MAPPED for now. */
 	if (unlikely(PageHuge(page))) {
@@ -1325,7 +1325,7 @@ static void page_remove_file_rmap(struct page *page)
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
 out:
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 }
 
 static void page_remove_anon_compound_rmap(struct page *page)
diff --git a/mm/truncate.c b/mm/truncate.c
index e3ee0e27cd17..51a24f6a555d 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -528,7 +528,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL))
 		return 0;
 
-	memcg = mem_cgroup_begin_page_stat(page);
+	memcg = lock_page_memcg(page);
 	spin_lock_irqsave(&mapping->tree_lock, flags);
 	if (PageDirty(page))
 		goto failed;
@@ -536,7 +536,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	BUG_ON(page_has_private(page));
 	__delete_from_page_cache(page, NULL, memcg);
 	spin_unlock_irqrestore(&mapping->tree_lock, flags);
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 
 	if (mapping->a_ops->freepage)
 		mapping->a_ops->freepage(page);
@@ -545,7 +545,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	return 1;
 failed:
 	spin_unlock_irqrestore(&mapping->tree_lock, flags);
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 	return 0;
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 30e0cd7a0ceb..45771323e863 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -616,7 +616,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
 
-	memcg = mem_cgroup_begin_page_stat(page);
+	memcg = lock_page_memcg(page);
 	spin_lock_irqsave(&mapping->tree_lock, flags);
 	/*
 	 * The non racy check for a busy page.
@@ -656,7 +656,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 		mem_cgroup_swapout(page, swap);
 		__delete_from_swap_cache(page);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
-		mem_cgroup_end_page_stat(memcg);
+		unlock_page_memcg(memcg);
 		swapcache_free(swap);
 	} else {
 		void (*freepage)(struct page *);
@@ -684,7 +684,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 			shadow = workingset_eviction(mapping, page);
 		__delete_from_page_cache(page, shadow, memcg);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
-		mem_cgroup_end_page_stat(memcg);
+		unlock_page_memcg(memcg);
 
 		if (freepage != NULL)
 			freepage(page);
@@ -694,7 +694,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 
 cannot_free:
 	spin_unlock_irqrestore(&mapping->tree_lock, flags);
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 	return 0;
 }
 
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 1/5] mm: memcontrol: generalize locking for the page->mem_cgroup binding
@ 2016-01-26 21:00   ` Johannes Weiner
  0 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2016-01-26 21:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vladimir Davydov, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

So far the only sites that needed to exclude charge migration to
stabilize page->mem_cgroup have been per-cgroup page statistics, hence
the name mem_cgroup_begin_page_stat(). But per-cgroup thrash detection
will add another site that needs to ensure page->mem_cgroup lifetime.

Rename these locking functions to the more generic lock_page_memcg()
and unlock_page_memcg(). Since charge migration is a cgroup1 feature
only, we might be able to delete it at some point, and these now easy
to identify locking sites along with it.

Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/buffer.c                | 14 +++++++-------
 fs/xfs/xfs_aops.c          |  8 ++++----
 include/linux/memcontrol.h | 16 +++++++++++-----
 mm/filemap.c               | 12 ++++++------
 mm/memcontrol.c            | 34 ++++++++++++++--------------------
 mm/page-writeback.c        | 28 ++++++++++++++--------------
 mm/rmap.c                  |  8 ++++----
 mm/truncate.c              |  6 +++---
 mm/vmscan.c                |  8 ++++----
 9 files changed, 67 insertions(+), 67 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index e1632abb4ca9..dc991510bb06 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -621,7 +621,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  *
- * The caller must hold mem_cgroup_begin_page_stat() lock.
+ * The caller must hold lock_page_memcg().
  */
 static void __set_page_dirty(struct page *page, struct address_space *mapping,
 			     struct mem_cgroup *memcg, int warn)
@@ -683,17 +683,17 @@ int __set_page_dirty_buffers(struct page *page)
 		} while (bh != head);
 	}
 	/*
-	 * Use mem_group_begin_page_stat() to keep PageDirty synchronized with
-	 * per-memcg dirty page counters.
+	 * Lock out page->mem_cgroup migration to keep PageDirty
+	 * synchronized with per-memcg dirty page counters.
 	 */
-	memcg = mem_cgroup_begin_page_stat(page);
+	memcg = lock_page_memcg(page);
 	newly_dirty = !TestSetPageDirty(page);
 	spin_unlock(&mapping->private_lock);
 
 	if (newly_dirty)
 		__set_page_dirty(page, mapping, memcg, 1);
 
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 
 	if (newly_dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -1169,13 +1169,13 @@ void mark_buffer_dirty(struct buffer_head *bh)
 		struct address_space *mapping = NULL;
 		struct mem_cgroup *memcg;
 
-		memcg = mem_cgroup_begin_page_stat(page);
+		memcg = lock_page_memcg(page);
 		if (!TestSetPageDirty(page)) {
 			mapping = page_mapping(page);
 			if (mapping)
 				__set_page_dirty(page, mapping, memcg, 0);
 		}
-		mem_cgroup_end_page_stat(memcg);
+		unlock_page_memcg(memcg);
 		if (mapping)
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 	}
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 379c089fb051..28eb6f535ca9 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1974,10 +1974,10 @@ xfs_vm_set_page_dirty(
 		} while (bh != head);
 	}
 	/*
-	 * Use mem_group_begin_page_stat() to keep PageDirty synchronized with
-	 * per-memcg dirty page counters.
+	 * Lock out page->mem_cgroup migration to keep PageDirty
+	 * synchronized with per-memcg dirty page counters.
 	 */
-	memcg = mem_cgroup_begin_page_stat(page);
+	memcg = lock_page_memcg(page);
 	newly_dirty = !TestSetPageDirty(page);
 	spin_unlock(&mapping->private_lock);
 
@@ -1994,7 +1994,7 @@ xfs_vm_set_page_dirty(
 		}
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	}
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 	if (newly_dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 	return newly_dirty;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 792c8981e633..2685aaf91511 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -429,8 +429,8 @@ bool mem_cgroup_oom_synchronize(bool wait);
 extern int do_swap_account;
 #endif
 
-struct mem_cgroup *mem_cgroup_begin_page_stat(struct page *page);
-void mem_cgroup_end_page_stat(struct mem_cgroup *memcg);
+struct mem_cgroup *lock_page_memcg(struct page *page);
+void unlock_page_memcg(struct mem_cgroup *memcg);
 
 /**
  * mem_cgroup_update_page_stat - update page state statistics
@@ -438,7 +438,13 @@ void mem_cgroup_end_page_stat(struct mem_cgroup *memcg);
  * @idx: page state item to account
  * @val: number of pages (positive or negative)
  *
- * See mem_cgroup_begin_page_stat() for locking requirements.
+ * Callers must use lock_page_memcg() to prevent double accounting
+ * when the page is concurrently being moved to another memcg:
+ *
+ *   memcg = lock_page_memcg(page);
+ *   if (TestClearPageState(page))
+ *     mem_cgroup_update_page_stat(memcg, state, -1);
+ *   unlock_page_memcg(memcg);
  */
 static inline void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
 				 enum mem_cgroup_stat_index idx, int val)
@@ -613,12 +619,12 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
-static inline struct mem_cgroup *mem_cgroup_begin_page_stat(struct page *page)
+static inline struct mem_cgroup *lock_page_memcg(struct page *page)
 {
 	return NULL;
 }
 
-static inline void mem_cgroup_end_page_stat(struct mem_cgroup *memcg)
+static inline void unlock_page_memcg(struct mem_cgroup *memcg)
 {
 }
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 0720c9d19365..f812976350ca 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -101,7 +101,7 @@
  *    ->tree_lock		(page_remove_rmap->set_page_dirty)
  *    bdi.wb->list_lock		(page_remove_rmap->set_page_dirty)
  *    ->inode->i_lock		(page_remove_rmap->set_page_dirty)
- *    ->memcg->move_lock	(page_remove_rmap->mem_cgroup_begin_page_stat)
+ *    ->memcg->move_lock	(page_remove_rmap->lock_page_memcg)
  *    bdi.wb->list_lock		(zap_pte_range->set_page_dirty)
  *    ->inode->i_lock		(zap_pte_range->set_page_dirty)
  *    ->private_lock		(zap_pte_range->__set_page_dirty_buffers)
@@ -177,7 +177,7 @@ static void page_cache_tree_delete(struct address_space *mapping,
  * Delete a page from the page cache and free it. Caller has to make
  * sure the page is locked and that nobody else uses it - or that usage
  * is safe.  The caller must hold the mapping's tree_lock and
- * mem_cgroup_begin_page_stat().
+ * lock_page_memcg().
  */
 void __delete_from_page_cache(struct page *page, void *shadow,
 			      struct mem_cgroup *memcg)
@@ -240,11 +240,11 @@ void delete_from_page_cache(struct page *page)
 
 	freepage = mapping->a_ops->freepage;
 
-	memcg = mem_cgroup_begin_page_stat(page);
+	memcg = lock_page_memcg(page);
 	spin_lock_irqsave(&mapping->tree_lock, flags);
 	__delete_from_page_cache(page, NULL, memcg);
 	spin_unlock_irqrestore(&mapping->tree_lock, flags);
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 
 	if (freepage)
 		freepage(page);
@@ -542,7 +542,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		new->mapping = mapping;
 		new->index = offset;
 
-		memcg = mem_cgroup_begin_page_stat(old);
+		memcg = lock_page_memcg(old);
 		spin_lock_irqsave(&mapping->tree_lock, flags);
 		__delete_from_page_cache(old, NULL, memcg);
 		error = radix_tree_insert(&mapping->page_tree, offset, new);
@@ -557,7 +557,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		if (PageSwapBacked(new))
 			__inc_zone_page_state(new, NR_SHMEM);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
-		mem_cgroup_end_page_stat(memcg);
+		unlock_page_memcg(memcg);
 		mem_cgroup_replace_page(old, new);
 		radix_tree_preload_end();
 		if (freepage)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d06cae2de783..953f0f984392 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1709,19 +1709,13 @@ cleanup:
 }
 
 /**
- * mem_cgroup_begin_page_stat - begin a page state statistics transaction
- * @page: page that is going to change accounted state
- *
- * This function must mark the beginning of an accounted page state
- * change to prevent double accounting when the page is concurrently
- * being moved to another memcg:
+ * lock_page_memcg - lock a page->mem_cgroup binding
+ * @page: the page
  *
- *   memcg = mem_cgroup_begin_page_stat(page);
- *   if (TestClearPageState(page))
- *     mem_cgroup_update_page_stat(memcg, state, -1);
- *   mem_cgroup_end_page_stat(memcg);
+ * This function protects unlocked LRU pages from being moved to
+ * another cgroup and stabilizes their page->mem_cgroup binding.
  */
-struct mem_cgroup *mem_cgroup_begin_page_stat(struct page *page)
+struct mem_cgroup *lock_page_memcg(struct page *page)
 {
 	struct mem_cgroup *memcg;
 	unsigned long flags;
@@ -1759,20 +1753,20 @@ again:
 	/*
 	 * When charge migration first begins, we can have locked and
 	 * unlocked page stat updates happening concurrently.  Track
-	 * the task who has the lock for mem_cgroup_end_page_stat().
+	 * the task who has the lock for unlock_page_memcg().
 	 */
 	memcg->move_lock_task = current;
 	memcg->move_lock_flags = flags;
 
 	return memcg;
 }
-EXPORT_SYMBOL(mem_cgroup_begin_page_stat);
+EXPORT_SYMBOL(lock_page_memcg);
 
 /**
- * mem_cgroup_end_page_stat - finish a page state statistics transaction
- * @memcg: the memcg that was accounted against
+ * unlock_page_memcg - unlock a page->mem_cgroup binding
+ * @memcg: the memcg returned by lock_page_memcg()
  */
-void mem_cgroup_end_page_stat(struct mem_cgroup *memcg)
+void unlock_page_memcg(struct mem_cgroup *memcg)
 {
 	if (memcg && memcg->move_lock_task == current) {
 		unsigned long flags = memcg->move_lock_flags;
@@ -1785,7 +1779,7 @@ void mem_cgroup_end_page_stat(struct mem_cgroup *memcg)
 
 	rcu_read_unlock();
 }
-EXPORT_SYMBOL(mem_cgroup_end_page_stat);
+EXPORT_SYMBOL(unlock_page_memcg);
 
 /*
  * size of first charge trial. "32" comes from vmscan.c's magic value.
@@ -4923,9 +4917,9 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
 
 	lru_add_drain_all();
 	/*
-	 * Signal mem_cgroup_begin_page_stat() to take the memcg's
-	 * move_lock while we're moving its pages to another memcg.
-	 * Then wait for already started RCU-only updates to finish.
+	 * Signal lock_page_memcg() to take the memcg's move_lock
+	 * while we're moving its pages to another memcg. Then wait
+	 * for already started RCU-only updates to finish.
 	 */
 	atomic_inc(&mc.from->moving_account);
 	synchronize_rcu();
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d782cbab735a..2b5ea1271e32 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2410,7 +2410,7 @@ int __set_page_dirty_no_writeback(struct page *page)
 /*
  * Helper function for set_page_dirty family.
  *
- * Caller must hold mem_cgroup_begin_page_stat().
+ * Caller must hold lock_page_memcg().
  *
  * NOTE: This relies on being atomic wrt interrupts.
  */
@@ -2442,7 +2442,7 @@ EXPORT_SYMBOL(account_page_dirtied);
 /*
  * Helper function for deaccounting dirty page without writeback.
  *
- * Caller must hold mem_cgroup_begin_page_stat().
+ * Caller must hold lock_page_memcg().
  */
 void account_page_cleaned(struct page *page, struct address_space *mapping,
 			  struct mem_cgroup *memcg, struct bdi_writeback *wb)
@@ -2471,13 +2471,13 @@ int __set_page_dirty_nobuffers(struct page *page)
 {
 	struct mem_cgroup *memcg;
 
-	memcg = mem_cgroup_begin_page_stat(page);
+	memcg = lock_page_memcg(page);
 	if (!TestSetPageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
 		unsigned long flags;
 
 		if (!mapping) {
-			mem_cgroup_end_page_stat(memcg);
+			unlock_page_memcg(memcg);
 			return 1;
 		}
 
@@ -2488,7 +2488,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 		radix_tree_tag_set(&mapping->page_tree, page_index(page),
 				   PAGECACHE_TAG_DIRTY);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
-		mem_cgroup_end_page_stat(memcg);
+		unlock_page_memcg(memcg);
 
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
@@ -2496,7 +2496,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 		}
 		return 1;
 	}
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 	return 0;
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
@@ -2629,14 +2629,14 @@ void cancel_dirty_page(struct page *page)
 		struct mem_cgroup *memcg;
 		bool locked;
 
-		memcg = mem_cgroup_begin_page_stat(page);
+		memcg = lock_page_memcg(page);
 		wb = unlocked_inode_to_wb_begin(inode, &locked);
 
 		if (TestClearPageDirty(page))
 			account_page_cleaned(page, mapping, memcg, wb);
 
 		unlocked_inode_to_wb_end(inode, locked);
-		mem_cgroup_end_page_stat(memcg);
+		unlock_page_memcg(memcg);
 	} else {
 		ClearPageDirty(page);
 	}
@@ -2705,7 +2705,7 @@ int clear_page_dirty_for_io(struct page *page)
 		 * always locked coming in here, so we get the desired
 		 * exclusion.
 		 */
-		memcg = mem_cgroup_begin_page_stat(page);
+		memcg = lock_page_memcg(page);
 		wb = unlocked_inode_to_wb_begin(inode, &locked);
 		if (TestClearPageDirty(page)) {
 			mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
@@ -2714,7 +2714,7 @@ int clear_page_dirty_for_io(struct page *page)
 			ret = 1;
 		}
 		unlocked_inode_to_wb_end(inode, locked);
-		mem_cgroup_end_page_stat(memcg);
+		unlock_page_memcg(memcg);
 		return ret;
 	}
 	return TestClearPageDirty(page);
@@ -2727,7 +2727,7 @@ int test_clear_page_writeback(struct page *page)
 	struct mem_cgroup *memcg;
 	int ret;
 
-	memcg = mem_cgroup_begin_page_stat(page);
+	memcg = lock_page_memcg(page);
 	if (mapping) {
 		struct inode *inode = mapping->host;
 		struct backing_dev_info *bdi = inode_to_bdi(inode);
@@ -2755,7 +2755,7 @@ int test_clear_page_writeback(struct page *page)
 		dec_zone_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 	return ret;
 }
 
@@ -2765,7 +2765,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 	struct mem_cgroup *memcg;
 	int ret;
 
-	memcg = mem_cgroup_begin_page_stat(page);
+	memcg = lock_page_memcg(page);
 	if (mapping) {
 		struct inode *inode = mapping->host;
 		struct backing_dev_info *bdi = inode_to_bdi(inode);
@@ -2796,7 +2796,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 		mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITEBACK);
 	}
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 	return ret;
 
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 79f3bf047f38..2871e7d3cced 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1289,19 +1289,19 @@ void page_add_file_rmap(struct page *page)
 {
 	struct mem_cgroup *memcg;
 
-	memcg = mem_cgroup_begin_page_stat(page);
+	memcg = lock_page_memcg(page);
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_FILE_MAPPED);
 	}
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 }
 
 static void page_remove_file_rmap(struct page *page)
 {
 	struct mem_cgroup *memcg;
 
-	memcg = mem_cgroup_begin_page_stat(page);
+	memcg = lock_page_memcg(page);
 
 	/* Hugepages are not counted in NR_FILE_MAPPED for now. */
 	if (unlikely(PageHuge(page))) {
@@ -1325,7 +1325,7 @@ static void page_remove_file_rmap(struct page *page)
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
 out:
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 }
 
 static void page_remove_anon_compound_rmap(struct page *page)
diff --git a/mm/truncate.c b/mm/truncate.c
index e3ee0e27cd17..51a24f6a555d 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -528,7 +528,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL))
 		return 0;
 
-	memcg = mem_cgroup_begin_page_stat(page);
+	memcg = lock_page_memcg(page);
 	spin_lock_irqsave(&mapping->tree_lock, flags);
 	if (PageDirty(page))
 		goto failed;
@@ -536,7 +536,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	BUG_ON(page_has_private(page));
 	__delete_from_page_cache(page, NULL, memcg);
 	spin_unlock_irqrestore(&mapping->tree_lock, flags);
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 
 	if (mapping->a_ops->freepage)
 		mapping->a_ops->freepage(page);
@@ -545,7 +545,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	return 1;
 failed:
 	spin_unlock_irqrestore(&mapping->tree_lock, flags);
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 	return 0;
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 30e0cd7a0ceb..45771323e863 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -616,7 +616,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
 
-	memcg = mem_cgroup_begin_page_stat(page);
+	memcg = lock_page_memcg(page);
 	spin_lock_irqsave(&mapping->tree_lock, flags);
 	/*
 	 * The non racy check for a busy page.
@@ -656,7 +656,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 		mem_cgroup_swapout(page, swap);
 		__delete_from_swap_cache(page);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
-		mem_cgroup_end_page_stat(memcg);
+		unlock_page_memcg(memcg);
 		swapcache_free(swap);
 	} else {
 		void (*freepage)(struct page *);
@@ -684,7 +684,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 			shadow = workingset_eviction(mapping, page);
 		__delete_from_page_cache(page, shadow, memcg);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
-		mem_cgroup_end_page_stat(memcg);
+		unlock_page_memcg(memcg);
 
 		if (freepage != NULL)
 			freepage(page);
@@ -694,7 +694,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 
 cannot_free:
 	spin_unlock_irqrestore(&mapping->tree_lock, flags);
-	mem_cgroup_end_page_stat(memcg);
+	unlock_page_memcg(memcg);
 	return 0;
 }
 
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 2/5] mm: workingset: #define radix entry eviction mask
  2016-01-26 21:00 ` Johannes Weiner
@ 2016-01-26 21:00   ` Johannes Weiner
  -1 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2016-01-26 21:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vladimir Davydov, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

This is a compile-time constant, no need to calculate it on refault.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/workingset.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index 61ead9e5549d..3ef92f6e41fe 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -152,6 +152,10 @@
  * refault distance will immediately activate the refaulting page.
  */
 
+#define EVICTION_SHIFT	(RADIX_TREE_EXCEPTIONAL_ENTRY + \
+			 ZONES_SHIFT + NODES_SHIFT)
+#define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
+
 static void *pack_shadow(unsigned long eviction, struct zone *zone)
 {
 	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
@@ -168,7 +172,6 @@ static void unpack_shadow(void *shadow,
 	unsigned long entry = (unsigned long)shadow;
 	unsigned long eviction;
 	unsigned long refault;
-	unsigned long mask;
 	int zid, nid;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
@@ -181,8 +184,7 @@ static void unpack_shadow(void *shadow,
 	*zone = NODE_DATA(nid)->node_zones + zid;
 
 	refault = atomic_long_read(&(*zone)->inactive_age);
-	mask = ~0UL >> (NODES_SHIFT + ZONES_SHIFT +
-			RADIX_TREE_EXCEPTIONAL_SHIFT);
+
 	/*
 	 * The unsigned subtraction here gives an accurate distance
 	 * across inactive_age overflows in most cases.
@@ -199,7 +201,7 @@ static void unpack_shadow(void *shadow,
 	 * inappropriate activation leading to pressure on the active
 	 * list is not a problem.
 	 */
-	*distance = (refault - eviction) & mask;
+	*distance = (refault - eviction) & EVICTION_MASK;
 }
 
 /**
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 2/5] mm: workingset: #define radix entry eviction mask
@ 2016-01-26 21:00   ` Johannes Weiner
  0 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2016-01-26 21:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vladimir Davydov, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

This is a compile-time constant, no need to calculate it on refault.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/workingset.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index 61ead9e5549d..3ef92f6e41fe 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -152,6 +152,10 @@
  * refault distance will immediately activate the refaulting page.
  */
 
+#define EVICTION_SHIFT	(RADIX_TREE_EXCEPTIONAL_ENTRY + \
+			 ZONES_SHIFT + NODES_SHIFT)
+#define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
+
 static void *pack_shadow(unsigned long eviction, struct zone *zone)
 {
 	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
@@ -168,7 +172,6 @@ static void unpack_shadow(void *shadow,
 	unsigned long entry = (unsigned long)shadow;
 	unsigned long eviction;
 	unsigned long refault;
-	unsigned long mask;
 	int zid, nid;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
@@ -181,8 +184,7 @@ static void unpack_shadow(void *shadow,
 	*zone = NODE_DATA(nid)->node_zones + zid;
 
 	refault = atomic_long_read(&(*zone)->inactive_age);
-	mask = ~0UL >> (NODES_SHIFT + ZONES_SHIFT +
-			RADIX_TREE_EXCEPTIONAL_SHIFT);
+
 	/*
 	 * The unsigned subtraction here gives an accurate distance
 	 * across inactive_age overflows in most cases.
@@ -199,7 +201,7 @@ static void unpack_shadow(void *shadow,
 	 * inappropriate activation leading to pressure on the active
 	 * list is not a problem.
 	 */
-	*distance = (refault - eviction) & mask;
+	*distance = (refault - eviction) & EVICTION_MASK;
 }
 
 /**
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 3/5] mm: workingset: separate shadow unpacking and refault calculation
  2016-01-26 21:00 ` Johannes Weiner
@ 2016-01-26 21:00   ` Johannes Weiner
  -1 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2016-01-26 21:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vladimir Davydov, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

Per-cgroup thrash detection will need to derive a live memcg from the
eviction cookie, and doing that inside unpack_shadow() will get nasty
with the reference handling spread over two functions.

In preparation, make unpack_shadow() clearly about extracting static
data, and let workingset_refault() do all the higher-level handling.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/workingset.c | 56 ++++++++++++++++++++++++++++----------------------------
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index 3ef92f6e41fe..f874b2c663e3 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -165,13 +165,10 @@ static void *pack_shadow(unsigned long eviction, struct zone *zone)
 	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
-static void unpack_shadow(void *shadow,
-			  struct zone **zone,
-			  unsigned long *distance)
+static void unpack_shadow(void *shadow, struct zone **zonep,
+			  unsigned long *evictionp)
 {
 	unsigned long entry = (unsigned long)shadow;
-	unsigned long eviction;
-	unsigned long refault;
 	int zid, nid;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
@@ -179,29 +176,9 @@ static void unpack_shadow(void *shadow,
 	entry >>= ZONES_SHIFT;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
 	entry >>= NODES_SHIFT;
-	eviction = entry;
-
-	*zone = NODE_DATA(nid)->node_zones + zid;
 
-	refault = atomic_long_read(&(*zone)->inactive_age);
-
-	/*
-	 * The unsigned subtraction here gives an accurate distance
-	 * across inactive_age overflows in most cases.
-	 *
-	 * There is a special case: usually, shadow entries have a
-	 * short lifetime and are either refaulted or reclaimed along
-	 * with the inode before they get too old.  But it is not
-	 * impossible for the inactive_age to lap a shadow entry in
-	 * the field, which can then can result in a false small
-	 * refault distance, leading to a false activation should this
-	 * old entry actually refault again.  However, earlier kernels
-	 * used to deactivate unconditionally with *every* reclaim
-	 * invocation for the longest time, so the occasional
-	 * inappropriate activation leading to pressure on the active
-	 * list is not a problem.
-	 */
-	*distance = (refault - eviction) & EVICTION_MASK;
+	*zonep = NODE_DATA(nid)->node_zones + zid;
+	*evictionp = entry;
 }
 
 /**
@@ -233,9 +210,32 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 bool workingset_refault(void *shadow)
 {
 	unsigned long refault_distance;
+	unsigned long eviction;
+	unsigned long refault;
 	struct zone *zone;
 
-	unpack_shadow(shadow, &zone, &refault_distance);
+	unpack_shadow(shadow, &zone, &eviction);
+
+	refault = atomic_long_read(&zone->inactive_age);
+
+	/*
+	 * The unsigned subtraction here gives an accurate distance
+	 * across inactive_age overflows in most cases.
+	 *
+	 * There is a special case: usually, shadow entries have a
+	 * short lifetime and are either refaulted or reclaimed along
+	 * with the inode before they get too old.  But it is not
+	 * impossible for the inactive_age to lap a shadow entry in
+	 * the field, which can then can result in a false small
+	 * refault distance, leading to a false activation should this
+	 * old entry actually refault again.  However, earlier kernels
+	 * used to deactivate unconditionally with *every* reclaim
+	 * invocation for the longest time, so the occasional
+	 * inappropriate activation leading to pressure on the active
+	 * list is not a problem.
+	 */
+	refault_distance = (refault - eviction) & EVICTION_MASK;
+
 	inc_zone_state(zone, WORKINGSET_REFAULT);
 
 	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 3/5] mm: workingset: separate shadow unpacking and refault calculation
@ 2016-01-26 21:00   ` Johannes Weiner
  0 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2016-01-26 21:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vladimir Davydov, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

Per-cgroup thrash detection will need to derive a live memcg from the
eviction cookie, and doing that inside unpack_shadow() will get nasty
with the reference handling spread over two functions.

In preparation, make unpack_shadow() clearly about extracting static
data, and let workingset_refault() do all the higher-level handling.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/workingset.c | 56 ++++++++++++++++++++++++++++----------------------------
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index 3ef92f6e41fe..f874b2c663e3 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -165,13 +165,10 @@ static void *pack_shadow(unsigned long eviction, struct zone *zone)
 	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
-static void unpack_shadow(void *shadow,
-			  struct zone **zone,
-			  unsigned long *distance)
+static void unpack_shadow(void *shadow, struct zone **zonep,
+			  unsigned long *evictionp)
 {
 	unsigned long entry = (unsigned long)shadow;
-	unsigned long eviction;
-	unsigned long refault;
 	int zid, nid;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
@@ -179,29 +176,9 @@ static void unpack_shadow(void *shadow,
 	entry >>= ZONES_SHIFT;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
 	entry >>= NODES_SHIFT;
-	eviction = entry;
-
-	*zone = NODE_DATA(nid)->node_zones + zid;
 
-	refault = atomic_long_read(&(*zone)->inactive_age);
-
-	/*
-	 * The unsigned subtraction here gives an accurate distance
-	 * across inactive_age overflows in most cases.
-	 *
-	 * There is a special case: usually, shadow entries have a
-	 * short lifetime and are either refaulted or reclaimed along
-	 * with the inode before they get too old.  But it is not
-	 * impossible for the inactive_age to lap a shadow entry in
-	 * the field, which can then can result in a false small
-	 * refault distance, leading to a false activation should this
-	 * old entry actually refault again.  However, earlier kernels
-	 * used to deactivate unconditionally with *every* reclaim
-	 * invocation for the longest time, so the occasional
-	 * inappropriate activation leading to pressure on the active
-	 * list is not a problem.
-	 */
-	*distance = (refault - eviction) & EVICTION_MASK;
+	*zonep = NODE_DATA(nid)->node_zones + zid;
+	*evictionp = entry;
 }
 
 /**
@@ -233,9 +210,32 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 bool workingset_refault(void *shadow)
 {
 	unsigned long refault_distance;
+	unsigned long eviction;
+	unsigned long refault;
 	struct zone *zone;
 
-	unpack_shadow(shadow, &zone, &refault_distance);
+	unpack_shadow(shadow, &zone, &eviction);
+
+	refault = atomic_long_read(&zone->inactive_age);
+
+	/*
+	 * The unsigned subtraction here gives an accurate distance
+	 * across inactive_age overflows in most cases.
+	 *
+	 * There is a special case: usually, shadow entries have a
+	 * short lifetime and are either refaulted or reclaimed along
+	 * with the inode before they get too old.  But it is not
+	 * impossible for the inactive_age to lap a shadow entry in
+	 * the field, which can then can result in a false small
+	 * refault distance, leading to a false activation should this
+	 * old entry actually refault again.  However, earlier kernels
+	 * used to deactivate unconditionally with *every* reclaim
+	 * invocation for the longest time, so the occasional
+	 * inappropriate activation leading to pressure on the active
+	 * list is not a problem.
+	 */
+	refault_distance = (refault - eviction) & EVICTION_MASK;
+
 	inc_zone_state(zone, WORKINGSET_REFAULT);
 
 	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 4/5] mm: workingset: eviction buckets for bigmem/lowbit machines
  2016-01-26 21:00 ` Johannes Weiner
@ 2016-01-26 21:00   ` Johannes Weiner
  -1 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2016-01-26 21:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vladimir Davydov, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

For per-cgroup thrash detection, we need to store the memcg ID inside
the radix tree cookie as well. However, on 32 bit that doesn't leave
enough bits for the eviction timestamp to cover the necessary range of
recently evicted pages. The radix tree entry would look like this:

[ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ]

12 bits means 4096 pages, means 16M worth of recently evicted pages.
But refaults are actionable up to distances covering half of memory.
To not miss refaults, we have to stretch out the range at the cost of
how precisely we can tell when a page was evicted. This way we can
shave off lower bits from the eviction timestamp until the necessary
range is covered. E.g. grouping evictions into 1M buckets (256 pages)
will stretch the longest representable refault distance to 4G.

This patch implements eviction buckets that are automatically sized
according to the available bits and the necessary refault range, in
preparation for per-cgroup thrash detection.

The maximum actionable distance is currently half of memory, but to
support memory hotplug of up to 200% of boot-time memory, we size the
buckets to cover double the distance. Beyond that, thrashing won't be
detectable anymore.

During boot, the kernel will print out the exact parameters, like so:

[    0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6

In this example, there are 12 radix entry bits available for the
eviction timestamp, to cover a maximum distance of 2^18 pages (this is
a 1G machine). Consequently, evictions must be grouped into buckets of
2^6 pages, or 256K.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/workingset.c | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index f874b2c663e3..c5e77c3e3dca 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -156,8 +156,19 @@
 			 ZONES_SHIFT + NODES_SHIFT)
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
 
+/*
+ * Eviction timestamps need to be able to cover the full range of
+ * actionable refaults. However, bits are tight in the radix tree
+ * entry, and after storing the identifier for the lruvec there might
+ * not be enough left to represent every single actionable refault. In
+ * that case, we have to sacrifice granularity for distance, and group
+ * evictions into coarser buckets by shaving off lower timestamp bits.
+ */
+static unsigned int bucket_order;
+
 static void *pack_shadow(unsigned long eviction, struct zone *zone)
 {
+	eviction >>= bucket_order;
 	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
 	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
 	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
@@ -178,7 +189,7 @@ static void unpack_shadow(void *shadow, struct zone **zonep,
 	entry >>= NODES_SHIFT;
 
 	*zonep = NODE_DATA(nid)->node_zones + zid;
-	*evictionp = entry;
+	*evictionp = entry << bucket_order;
 }
 
 /**
@@ -400,8 +411,25 @@ static struct lock_class_key shadow_nodes_key;
 
 static int __init workingset_init(void)
 {
+	unsigned int timestamp_bits;
+	unsigned int max_order;
 	int ret;
 
+	BUILD_BUG_ON(BITS_PER_LONG < EVICTION_SHIFT);
+	/*
+	 * Calculate the eviction bucket size to cover the longest
+	 * actionable refault distance, which is currently half of
+	 * memory (totalram_pages/2). However, memory hotplug may add
+	 * some more pages at runtime, so keep working with up to
+	 * double the initial memory by using totalram_pages as-is.
+	 */
+	timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT;
+	max_order = fls_long(totalram_pages - 1);
+	if (max_order > timestamp_bits)
+		bucket_order = max_order - timestamp_bits;
+	printk("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n",
+	       timestamp_bits, max_order, bucket_order);
+
 	ret = list_lru_init_key(&workingset_shadow_nodes, &shadow_nodes_key);
 	if (ret)
 		goto err;
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 4/5] mm: workingset: eviction buckets for bigmem/lowbit machines
@ 2016-01-26 21:00   ` Johannes Weiner
  0 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2016-01-26 21:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vladimir Davydov, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

For per-cgroup thrash detection, we need to store the memcg ID inside
the radix tree cookie as well. However, on 32 bit that doesn't leave
enough bits for the eviction timestamp to cover the necessary range of
recently evicted pages. The radix tree entry would look like this:

[ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ]

12 bits means 4096 pages, means 16M worth of recently evicted pages.
But refaults are actionable up to distances covering half of memory.
To not miss refaults, we have to stretch out the range at the cost of
how precisely we can tell when a page was evicted. This way we can
shave off lower bits from the eviction timestamp until the necessary
range is covered. E.g. grouping evictions into 1M buckets (256 pages)
will stretch the longest representable refault distance to 4G.

This patch implements eviction buckets that are automatically sized
according to the available bits and the necessary refault range, in
preparation for per-cgroup thrash detection.

The maximum actionable distance is currently half of memory, but to
support memory hotplug of up to 200% of boot-time memory, we size the
buckets to cover double the distance. Beyond that, thrashing won't be
detectable anymore.

During boot, the kernel will print out the exact parameters, like so:

[    0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6

In this example, there are 12 radix entry bits available for the
eviction timestamp, to cover a maximum distance of 2^18 pages (this is
a 1G machine). Consequently, evictions must be grouped into buckets of
2^6 pages, or 256K.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/workingset.c | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index f874b2c663e3..c5e77c3e3dca 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -156,8 +156,19 @@
 			 ZONES_SHIFT + NODES_SHIFT)
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
 
+/*
+ * Eviction timestamps need to be able to cover the full range of
+ * actionable refaults. However, bits are tight in the radix tree
+ * entry, and after storing the identifier for the lruvec there might
+ * not be enough left to represent every single actionable refault. In
+ * that case, we have to sacrifice granularity for distance, and group
+ * evictions into coarser buckets by shaving off lower timestamp bits.
+ */
+static unsigned int bucket_order;
+
 static void *pack_shadow(unsigned long eviction, struct zone *zone)
 {
+	eviction >>= bucket_order;
 	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
 	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
 	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
@@ -178,7 +189,7 @@ static void unpack_shadow(void *shadow, struct zone **zonep,
 	entry >>= NODES_SHIFT;
 
 	*zonep = NODE_DATA(nid)->node_zones + zid;
-	*evictionp = entry;
+	*evictionp = entry << bucket_order;
 }
 
 /**
@@ -400,8 +411,25 @@ static struct lock_class_key shadow_nodes_key;
 
 static int __init workingset_init(void)
 {
+	unsigned int timestamp_bits;
+	unsigned int max_order;
 	int ret;
 
+	BUILD_BUG_ON(BITS_PER_LONG < EVICTION_SHIFT);
+	/*
+	 * Calculate the eviction bucket size to cover the longest
+	 * actionable refault distance, which is currently half of
+	 * memory (totalram_pages/2). However, memory hotplug may add
+	 * some more pages at runtime, so keep working with up to
+	 * double the initial memory by using totalram_pages as-is.
+	 */
+	timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT;
+	max_order = fls_long(totalram_pages - 1);
+	if (max_order > timestamp_bits)
+		bucket_order = max_order - timestamp_bits;
+	printk("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n",
+	       timestamp_bits, max_order, bucket_order);
+
 	ret = list_lru_init_key(&workingset_shadow_nodes, &shadow_nodes_key);
 	if (ret)
 		goto err;
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 5/5] mm: workingset: per-cgroup cache thrash detection
  2016-01-26 21:00 ` Johannes Weiner
  (?)
@ 2016-01-26 21:00   ` Johannes Weiner
  -1 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2016-01-26 21:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vladimir Davydov, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

Cache thrash detection (see a528910e12ec "mm: thrash detection-based
file cache sizing" for details) currently only works on the system
level, not inside cgroups. Worse, as the refaults are compared to the
global number of active cache, cgroups might wrongfully get all their
refaults activated when their pages are hotter than those of others.

Move the refault machinery from the zone to the lruvec, and then tag
eviction entries with the memcg ID. This makes the thrash detection
work correctly inside cgroups.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 39 ++++++++++++++++++++++++++
 include/linux/mmzone.h     | 11 ++++----
 include/linux/swap.h       |  1 +
 mm/memcontrol.c            | 25 -----------------
 mm/vmscan.c                | 18 ++++++------
 mm/workingset.c            | 69 +++++++++++++++++++++++++++++++++++++++-------
 6 files changed, 113 insertions(+), 50 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2685aaf91511..845b40905bf3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -89,6 +89,10 @@ enum mem_cgroup_events_target {
 };
 
 #ifdef CONFIG_MEMCG
+
+#define MEM_CGROUP_ID_SHIFT	16
+#define MEM_CGROUP_ID_MAX	USHRT_MAX
+
 struct mem_cgroup_stat_cpu {
 	long count[MEMCG_NR_STAT];
 	unsigned long events[MEMCG_NR_EVENTS];
@@ -312,6 +316,25 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
 				   struct mem_cgroup_reclaim_cookie *);
 void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
 
+static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
+{
+	return memcg->css.id;
+}
+
+/**
+ * mem_cgroup_from_id - look up a memcg from an id
+ * @id: the id to look up
+ *
+ * Caller must hold rcu_read_lock() and use css_tryget() as necessary.
+ */
+static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
+{
+	struct cgroup_subsys_state *css;
+
+	css = css_from_id(id, &memory_cgrp_subsys);
+	return mem_cgroup_from_css(css);
+}
+
 /**
  * parent_mem_cgroup - find the accounting parent of a memcg
  * @memcg: memcg whose parent to find
@@ -502,6 +525,10 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 #endif
 
 #else /* CONFIG_MEMCG */
+
+#define MEM_CGROUP_ID_SHIFT	0
+#define MEM_CGROUP_ID_MAX	0
+
 struct mem_cgroup;
 
 static inline void mem_cgroup_events(struct mem_cgroup *memcg,
@@ -586,6 +613,18 @@ static inline void mem_cgroup_iter_break(struct mem_cgroup *root,
 {
 }
 
+static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
+{
+	return 0;
+}
+
+static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
+{
+	WARN_ON_ONCE(id);
+	/* XXX: This should always return root_mem_cgroup */
+	return NULL;
+}
+
 static inline bool mem_cgroup_disabled(void)
 {
 	return true;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7b6c2cfee390..f58e61315982 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -209,10 +209,12 @@ struct zone_reclaim_stat {
 };
 
 struct lruvec {
-	struct list_head lists[NR_LRU_LISTS];
-	struct zone_reclaim_stat reclaim_stat;
+	struct list_head		lists[NR_LRU_LISTS];
+	struct zone_reclaim_stat	reclaim_stat;
+	/* Evictions & activations on the inactive file list */
+	atomic_long_t			inactive_age;
 #ifdef CONFIG_MEMCG
-	struct zone *zone;
+	struct zone			*zone;
 #endif
 };
 
@@ -487,9 +489,6 @@ struct zone {
 	spinlock_t		lru_lock;
 	struct lruvec		lruvec;
 
-	/* Evictions & activations on the inactive file list */
-	atomic_long_t		inactive_age;
-
 	/*
 	 * When free pages are below this point, additional steps are taken
 	 * when reading the number of free pages to avoid per-cpu counter
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b14a2bb33514..1cf3065c143b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 
 /* linux/mm/vmscan.c */
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
+extern unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 953f0f984392..864e237f32d9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -268,31 +268,6 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
 	return (memcg == root_mem_cgroup);
 }
 
-/*
- * We restrict the id in the range of [1, 65535], so it can fit into
- * an unsigned short.
- */
-#define MEM_CGROUP_ID_MAX	USHRT_MAX
-
-static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
-{
-	return memcg->css.id;
-}
-
-/*
- * A helper function to get mem_cgroup from ID. must be called under
- * rcu_read_lock().  The caller is responsible for calling
- * css_tryget_online() if the mem_cgroup is used for charging. (dropping
- * refcnt from swap can be called against removed memcg.)
- */
-static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
-{
-	struct cgroup_subsys_state *css;
-
-	css = css_from_id(id, &memory_cgrp_subsys);
-	return mem_cgroup_from_css(css);
-}
-
 #ifndef CONFIG_SLOB
 /*
  * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 45771323e863..9342a0fe8836 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -213,7 +213,7 @@ bool zone_reclaimable(struct zone *zone)
 		zone_reclaimable_pages(zone) * 6;
 }
 
-static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
+unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
 {
 	if (!mem_cgroup_disabled())
 		return mem_cgroup_get_lru_size(lruvec, lru);
@@ -1931,8 +1931,8 @@ static bool inactive_file_is_low(struct lruvec *lruvec)
 	unsigned long inactive;
 	unsigned long active;
 
-	inactive = get_lru_size(lruvec, LRU_INACTIVE_FILE);
-	active = get_lru_size(lruvec, LRU_ACTIVE_FILE);
+	inactive = lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
+	active = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
 
 	return active > inactive;
 }
@@ -2071,7 +2071,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * system is under heavy pressure.
 	 */
 	if (!inactive_file_is_low(lruvec) &&
-	    get_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
+	    lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
@@ -2097,10 +2097,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * anon in [0], file in [1]
 	 */
 
-	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
-		get_lru_size(lruvec, LRU_INACTIVE_ANON);
-	file  = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
-		get_lru_size(lruvec, LRU_INACTIVE_FILE);
+	anon  = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON) +
+		lruvec_lru_size(lruvec, LRU_INACTIVE_ANON);
+	file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE) +
+		lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
 
 	spin_lock_irq(&zone->lru_lock);
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
@@ -2138,7 +2138,7 @@ out:
 			unsigned long size;
 			unsigned long scan;
 
-			size = get_lru_size(lruvec, lru);
+			size = lruvec_lru_size(lruvec, lru);
 			scan = size >> sc->priority;
 
 			if (!scan && pass && force_scan)
diff --git a/mm/workingset.c b/mm/workingset.c
index c5e77c3e3dca..bdc5a0f62ad8 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -153,7 +153,8 @@
  */
 
 #define EVICTION_SHIFT	(RADIX_TREE_EXCEPTIONAL_ENTRY + \
-			 ZONES_SHIFT + NODES_SHIFT)
+			 ZONES_SHIFT + NODES_SHIFT +	\
+			 MEM_CGROUP_ID_SHIFT)
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
 
 /*
@@ -166,9 +167,10 @@
  */
 static unsigned int bucket_order;
 
-static void *pack_shadow(unsigned long eviction, struct zone *zone)
+static void *pack_shadow(int memcgid, struct zone *zone, unsigned long eviction)
 {
 	eviction >>= bucket_order;
+	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
 	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
 	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
@@ -176,18 +178,21 @@ static void *pack_shadow(unsigned long eviction, struct zone *zone)
 	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
-static void unpack_shadow(void *shadow, struct zone **zonep,
+static void unpack_shadow(void *shadow, int *memcgidp, struct zone **zonep,
 			  unsigned long *evictionp)
 {
 	unsigned long entry = (unsigned long)shadow;
-	int zid, nid;
+	int memcgid, nid, zid;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
 	zid = entry & ((1UL << ZONES_SHIFT) - 1);
 	entry >>= ZONES_SHIFT;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
 	entry >>= NODES_SHIFT;
+	memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
+	entry >>= MEM_CGROUP_ID_SHIFT;
 
+	*memcgidp = memcgid;
 	*zonep = NODE_DATA(nid)->node_zones + zid;
 	*evictionp = entry << bucket_order;
 }
@@ -202,11 +207,20 @@ static void unpack_shadow(void *shadow, struct zone **zonep,
  */
 void *workingset_eviction(struct address_space *mapping, struct page *page)
 {
+	struct mem_cgroup *memcg = page_memcg(page);
 	struct zone *zone = page_zone(page);
+	int memcgid = mem_cgroup_id(memcg);
 	unsigned long eviction;
+	struct lruvec *lruvec;
 
-	eviction = atomic_long_inc_return(&zone->inactive_age);
-	return pack_shadow(eviction, zone);
+	/* Page is fully exclusive and pins page->mem_cgroup */
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(page_count(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+	eviction = atomic_long_inc_return(&lruvec->inactive_age);
+	return pack_shadow(memcgid, zone, eviction);
 }
 
 /**
@@ -221,13 +235,42 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 bool workingset_refault(void *shadow)
 {
 	unsigned long refault_distance;
+	unsigned long active_file;
+	struct mem_cgroup *memcg;
 	unsigned long eviction;
+	struct lruvec *lruvec;
 	unsigned long refault;
 	struct zone *zone;
+	int memcgid;
 
-	unpack_shadow(shadow, &zone, &eviction);
+	unpack_shadow(shadow, &memcgid, &zone, &eviction);
 
-	refault = atomic_long_read(&zone->inactive_age);
+	rcu_read_lock();
+	/*
+	 * Look up the memcg associated with the stored ID. It might
+	 * have been deleted since the page's eviction.
+	 *
+	 * Note that in rare events the ID could have been recycled
+	 * for a new cgroup that refaults a shared page. This is
+	 * impossible to tell from the available data. However, this
+	 * should be a rare and limited disturbance, and activations
+	 * are always speculative anyway. Ultimately, it's the aging
+	 * algorithm's job to shake out the minimum access frequency
+	 * for the active cache.
+	 *
+	 * XXX: On !CONFIG_MEMCG, this will always return NULL; it
+	 * would be better if the root_mem_cgroup existed in all
+	 * configurations instead.
+	 */
+	memcg = mem_cgroup_from_id(memcgid);
+	if (!mem_cgroup_disabled() && !memcg) {
+		rcu_read_unlock();
+		return false;
+	}
+	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+	refault = atomic_long_read(&lruvec->inactive_age);
+	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
+	rcu_read_unlock();
 
 	/*
 	 * The unsigned subtraction here gives an accurate distance
@@ -249,7 +292,7 @@ bool workingset_refault(void *shadow)
 
 	inc_zone_state(zone, WORKINGSET_REFAULT);
 
-	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
+	if (refault_distance <= active_file) {
 		inc_zone_state(zone, WORKINGSET_ACTIVATE);
 		return true;
 	}
@@ -262,7 +305,13 @@ bool workingset_refault(void *shadow)
  */
 void workingset_activation(struct page *page)
 {
-	atomic_long_inc(&page_zone(page)->inactive_age);
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec;
+
+	memcg = lock_page_memcg(page);
+	lruvec = mem_cgroup_zone_lruvec(page_zone(page), memcg);
+	atomic_long_inc(&lruvec->inactive_age);
+	unlock_page_memcg(memcg);
 }
 
 /*
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 5/5] mm: workingset: per-cgroup cache thrash detection
@ 2016-01-26 21:00   ` Johannes Weiner
  0 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2016-01-26 21:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vladimir Davydov, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

Cache thrash detection (see a528910e12ec "mm: thrash detection-based
file cache sizing" for details) currently only works on the system
level, not inside cgroups. Worse, as the refaults are compared to the
global number of active cache, cgroups might wrongfully get all their
refaults activated when their pages are hotter than those of others.

Move the refault machinery from the zone to the lruvec, and then tag
eviction entries with the memcg ID. This makes the thrash detection
work correctly inside cgroups.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 39 ++++++++++++++++++++++++++
 include/linux/mmzone.h     | 11 ++++----
 include/linux/swap.h       |  1 +
 mm/memcontrol.c            | 25 -----------------
 mm/vmscan.c                | 18 ++++++------
 mm/workingset.c            | 69 +++++++++++++++++++++++++++++++++++++++-------
 6 files changed, 113 insertions(+), 50 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2685aaf91511..845b40905bf3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -89,6 +89,10 @@ enum mem_cgroup_events_target {
 };
 
 #ifdef CONFIG_MEMCG
+
+#define MEM_CGROUP_ID_SHIFT	16
+#define MEM_CGROUP_ID_MAX	USHRT_MAX
+
 struct mem_cgroup_stat_cpu {
 	long count[MEMCG_NR_STAT];
 	unsigned long events[MEMCG_NR_EVENTS];
@@ -312,6 +316,25 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
 				   struct mem_cgroup_reclaim_cookie *);
 void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
 
+static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
+{
+	return memcg->css.id;
+}
+
+/**
+ * mem_cgroup_from_id - look up a memcg from an id
+ * @id: the id to look up
+ *
+ * Caller must hold rcu_read_lock() and use css_tryget() as necessary.
+ */
+static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
+{
+	struct cgroup_subsys_state *css;
+
+	css = css_from_id(id, &memory_cgrp_subsys);
+	return mem_cgroup_from_css(css);
+}
+
 /**
  * parent_mem_cgroup - find the accounting parent of a memcg
  * @memcg: memcg whose parent to find
@@ -502,6 +525,10 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 #endif
 
 #else /* CONFIG_MEMCG */
+
+#define MEM_CGROUP_ID_SHIFT	0
+#define MEM_CGROUP_ID_MAX	0
+
 struct mem_cgroup;
 
 static inline void mem_cgroup_events(struct mem_cgroup *memcg,
@@ -586,6 +613,18 @@ static inline void mem_cgroup_iter_break(struct mem_cgroup *root,
 {
 }
 
+static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
+{
+	return 0;
+}
+
+static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
+{
+	WARN_ON_ONCE(id);
+	/* XXX: This should always return root_mem_cgroup */
+	return NULL;
+}
+
 static inline bool mem_cgroup_disabled(void)
 {
 	return true;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7b6c2cfee390..f58e61315982 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -209,10 +209,12 @@ struct zone_reclaim_stat {
 };
 
 struct lruvec {
-	struct list_head lists[NR_LRU_LISTS];
-	struct zone_reclaim_stat reclaim_stat;
+	struct list_head		lists[NR_LRU_LISTS];
+	struct zone_reclaim_stat	reclaim_stat;
+	/* Evictions & activations on the inactive file list */
+	atomic_long_t			inactive_age;
 #ifdef CONFIG_MEMCG
-	struct zone *zone;
+	struct zone			*zone;
 #endif
 };
 
@@ -487,9 +489,6 @@ struct zone {
 	spinlock_t		lru_lock;
 	struct lruvec		lruvec;
 
-	/* Evictions & activations on the inactive file list */
-	atomic_long_t		inactive_age;
-
 	/*
 	 * When free pages are below this point, additional steps are taken
 	 * when reading the number of free pages to avoid per-cpu counter
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b14a2bb33514..1cf3065c143b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 
 /* linux/mm/vmscan.c */
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
+extern unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 953f0f984392..864e237f32d9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -268,31 +268,6 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
 	return (memcg == root_mem_cgroup);
 }
 
-/*
- * We restrict the id in the range of [1, 65535], so it can fit into
- * an unsigned short.
- */
-#define MEM_CGROUP_ID_MAX	USHRT_MAX
-
-static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
-{
-	return memcg->css.id;
-}
-
-/*
- * A helper function to get mem_cgroup from ID. must be called under
- * rcu_read_lock().  The caller is responsible for calling
- * css_tryget_online() if the mem_cgroup is used for charging. (dropping
- * refcnt from swap can be called against removed memcg.)
- */
-static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
-{
-	struct cgroup_subsys_state *css;
-
-	css = css_from_id(id, &memory_cgrp_subsys);
-	return mem_cgroup_from_css(css);
-}
-
 #ifndef CONFIG_SLOB
 /*
  * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 45771323e863..9342a0fe8836 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -213,7 +213,7 @@ bool zone_reclaimable(struct zone *zone)
 		zone_reclaimable_pages(zone) * 6;
 }
 
-static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
+unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
 {
 	if (!mem_cgroup_disabled())
 		return mem_cgroup_get_lru_size(lruvec, lru);
@@ -1931,8 +1931,8 @@ static bool inactive_file_is_low(struct lruvec *lruvec)
 	unsigned long inactive;
 	unsigned long active;
 
-	inactive = get_lru_size(lruvec, LRU_INACTIVE_FILE);
-	active = get_lru_size(lruvec, LRU_ACTIVE_FILE);
+	inactive = lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
+	active = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
 
 	return active > inactive;
 }
@@ -2071,7 +2071,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * system is under heavy pressure.
 	 */
 	if (!inactive_file_is_low(lruvec) &&
-	    get_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
+	    lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
@@ -2097,10 +2097,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * anon in [0], file in [1]
 	 */
 
-	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
-		get_lru_size(lruvec, LRU_INACTIVE_ANON);
-	file  = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
-		get_lru_size(lruvec, LRU_INACTIVE_FILE);
+	anon  = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON) +
+		lruvec_lru_size(lruvec, LRU_INACTIVE_ANON);
+	file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE) +
+		lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
 
 	spin_lock_irq(&zone->lru_lock);
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
@@ -2138,7 +2138,7 @@ out:
 			unsigned long size;
 			unsigned long scan;
 
-			size = get_lru_size(lruvec, lru);
+			size = lruvec_lru_size(lruvec, lru);
 			scan = size >> sc->priority;
 
 			if (!scan && pass && force_scan)
diff --git a/mm/workingset.c b/mm/workingset.c
index c5e77c3e3dca..bdc5a0f62ad8 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -153,7 +153,8 @@
  */
 
 #define EVICTION_SHIFT	(RADIX_TREE_EXCEPTIONAL_ENTRY + \
-			 ZONES_SHIFT + NODES_SHIFT)
+			 ZONES_SHIFT + NODES_SHIFT +	\
+			 MEM_CGROUP_ID_SHIFT)
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
 
 /*
@@ -166,9 +167,10 @@
  */
 static unsigned int bucket_order;
 
-static void *pack_shadow(unsigned long eviction, struct zone *zone)
+static void *pack_shadow(int memcgid, struct zone *zone, unsigned long eviction)
 {
 	eviction >>= bucket_order;
+	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
 	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
 	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
@@ -176,18 +178,21 @@ static void *pack_shadow(unsigned long eviction, struct zone *zone)
 	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
-static void unpack_shadow(void *shadow, struct zone **zonep,
+static void unpack_shadow(void *shadow, int *memcgidp, struct zone **zonep,
 			  unsigned long *evictionp)
 {
 	unsigned long entry = (unsigned long)shadow;
-	int zid, nid;
+	int memcgid, nid, zid;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
 	zid = entry & ((1UL << ZONES_SHIFT) - 1);
 	entry >>= ZONES_SHIFT;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
 	entry >>= NODES_SHIFT;
+	memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
+	entry >>= MEM_CGROUP_ID_SHIFT;
 
+	*memcgidp = memcgid;
 	*zonep = NODE_DATA(nid)->node_zones + zid;
 	*evictionp = entry << bucket_order;
 }
@@ -202,11 +207,20 @@ static void unpack_shadow(void *shadow, struct zone **zonep,
  */
 void *workingset_eviction(struct address_space *mapping, struct page *page)
 {
+	struct mem_cgroup *memcg = page_memcg(page);
 	struct zone *zone = page_zone(page);
+	int memcgid = mem_cgroup_id(memcg);
 	unsigned long eviction;
+	struct lruvec *lruvec;
 
-	eviction = atomic_long_inc_return(&zone->inactive_age);
-	return pack_shadow(eviction, zone);
+	/* Page is fully exclusive and pins page->mem_cgroup */
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(page_count(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+	eviction = atomic_long_inc_return(&lruvec->inactive_age);
+	return pack_shadow(memcgid, zone, eviction);
 }
 
 /**
@@ -221,13 +235,42 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 bool workingset_refault(void *shadow)
 {
 	unsigned long refault_distance;
+	unsigned long active_file;
+	struct mem_cgroup *memcg;
 	unsigned long eviction;
+	struct lruvec *lruvec;
 	unsigned long refault;
 	struct zone *zone;
+	int memcgid;
 
-	unpack_shadow(shadow, &zone, &eviction);
+	unpack_shadow(shadow, &memcgid, &zone, &eviction);
 
-	refault = atomic_long_read(&zone->inactive_age);
+	rcu_read_lock();
+	/*
+	 * Look up the memcg associated with the stored ID. It might
+	 * have been deleted since the page's eviction.
+	 *
+	 * Note that in rare events the ID could have been recycled
+	 * for a new cgroup that refaults a shared page. This is
+	 * impossible to tell from the available data. However, this
+	 * should be a rare and limited disturbance, and activations
+	 * are always speculative anyway. Ultimately, it's the aging
+	 * algorithm's job to shake out the minimum access frequency
+	 * for the active cache.
+	 *
+	 * XXX: On !CONFIG_MEMCG, this will always return NULL; it
+	 * would be better if the root_mem_cgroup existed in all
+	 * configurations instead.
+	 */
+	memcg = mem_cgroup_from_id(memcgid);
+	if (!mem_cgroup_disabled() && !memcg) {
+		rcu_read_unlock();
+		return false;
+	}
+	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+	refault = atomic_long_read(&lruvec->inactive_age);
+	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
+	rcu_read_unlock();
 
 	/*
 	 * The unsigned subtraction here gives an accurate distance
@@ -249,7 +292,7 @@ bool workingset_refault(void *shadow)
 
 	inc_zone_state(zone, WORKINGSET_REFAULT);
 
-	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
+	if (refault_distance <= active_file) {
 		inc_zone_state(zone, WORKINGSET_ACTIVATE);
 		return true;
 	}
@@ -262,7 +305,13 @@ bool workingset_refault(void *shadow)
  */
 void workingset_activation(struct page *page)
 {
-	atomic_long_inc(&page_zone(page)->inactive_age);
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec;
+
+	memcg = lock_page_memcg(page);
+	lruvec = mem_cgroup_zone_lruvec(page_zone(page), memcg);
+	atomic_long_inc(&lruvec->inactive_age);
+	unlock_page_memcg(memcg);
 }
 
 /*
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 5/5] mm: workingset: per-cgroup cache thrash detection
@ 2016-01-26 21:00   ` Johannes Weiner
  0 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2016-01-26 21:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vladimir Davydov, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

Cache thrash detection (see a528910e12ec "mm: thrash detection-based
file cache sizing" for details) currently only works on the system
level, not inside cgroups. Worse, as the refaults are compared to the
global number of active cache, cgroups might wrongfully get all their
refaults activated when their pages are hotter than those of others.

Move the refault machinery from the zone to the lruvec, and then tag
eviction entries with the memcg ID. This makes the thrash detection
work correctly inside cgroups.

Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
 include/linux/memcontrol.h | 39 ++++++++++++++++++++++++++
 include/linux/mmzone.h     | 11 ++++----
 include/linux/swap.h       |  1 +
 mm/memcontrol.c            | 25 -----------------
 mm/vmscan.c                | 18 ++++++------
 mm/workingset.c            | 69 +++++++++++++++++++++++++++++++++++++++-------
 6 files changed, 113 insertions(+), 50 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2685aaf91511..845b40905bf3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -89,6 +89,10 @@ enum mem_cgroup_events_target {
 };
 
 #ifdef CONFIG_MEMCG
+
+#define MEM_CGROUP_ID_SHIFT	16
+#define MEM_CGROUP_ID_MAX	USHRT_MAX
+
 struct mem_cgroup_stat_cpu {
 	long count[MEMCG_NR_STAT];
 	unsigned long events[MEMCG_NR_EVENTS];
@@ -312,6 +316,25 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
 				   struct mem_cgroup_reclaim_cookie *);
 void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
 
+static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
+{
+	return memcg->css.id;
+}
+
+/**
+ * mem_cgroup_from_id - look up a memcg from an id
+ * @id: the id to look up
+ *
+ * Caller must hold rcu_read_lock() and use css_tryget() as necessary.
+ */
+static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
+{
+	struct cgroup_subsys_state *css;
+
+	css = css_from_id(id, &memory_cgrp_subsys);
+	return mem_cgroup_from_css(css);
+}
+
 /**
  * parent_mem_cgroup - find the accounting parent of a memcg
  * @memcg: memcg whose parent to find
@@ -502,6 +525,10 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 #endif
 
 #else /* CONFIG_MEMCG */
+
+#define MEM_CGROUP_ID_SHIFT	0
+#define MEM_CGROUP_ID_MAX	0
+
 struct mem_cgroup;
 
 static inline void mem_cgroup_events(struct mem_cgroup *memcg,
@@ -586,6 +613,18 @@ static inline void mem_cgroup_iter_break(struct mem_cgroup *root,
 {
 }
 
+static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
+{
+	return 0;
+}
+
+static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
+{
+	WARN_ON_ONCE(id);
+	/* XXX: This should always return root_mem_cgroup */
+	return NULL;
+}
+
 static inline bool mem_cgroup_disabled(void)
 {
 	return true;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7b6c2cfee390..f58e61315982 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -209,10 +209,12 @@ struct zone_reclaim_stat {
 };
 
 struct lruvec {
-	struct list_head lists[NR_LRU_LISTS];
-	struct zone_reclaim_stat reclaim_stat;
+	struct list_head		lists[NR_LRU_LISTS];
+	struct zone_reclaim_stat	reclaim_stat;
+	/* Evictions & activations on the inactive file list */
+	atomic_long_t			inactive_age;
 #ifdef CONFIG_MEMCG
-	struct zone *zone;
+	struct zone			*zone;
 #endif
 };
 
@@ -487,9 +489,6 @@ struct zone {
 	spinlock_t		lru_lock;
 	struct lruvec		lruvec;
 
-	/* Evictions & activations on the inactive file list */
-	atomic_long_t		inactive_age;
-
 	/*
 	 * When free pages are below this point, additional steps are taken
 	 * when reading the number of free pages to avoid per-cpu counter
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b14a2bb33514..1cf3065c143b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 
 /* linux/mm/vmscan.c */
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
+extern unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 953f0f984392..864e237f32d9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -268,31 +268,6 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
 	return (memcg == root_mem_cgroup);
 }
 
-/*
- * We restrict the id in the range of [1, 65535], so it can fit into
- * an unsigned short.
- */
-#define MEM_CGROUP_ID_MAX	USHRT_MAX
-
-static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
-{
-	return memcg->css.id;
-}
-
-/*
- * A helper function to get mem_cgroup from ID. must be called under
- * rcu_read_lock().  The caller is responsible for calling
- * css_tryget_online() if the mem_cgroup is used for charging. (dropping
- * refcnt from swap can be called against removed memcg.)
- */
-static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
-{
-	struct cgroup_subsys_state *css;
-
-	css = css_from_id(id, &memory_cgrp_subsys);
-	return mem_cgroup_from_css(css);
-}
-
 #ifndef CONFIG_SLOB
 /*
  * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 45771323e863..9342a0fe8836 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -213,7 +213,7 @@ bool zone_reclaimable(struct zone *zone)
 		zone_reclaimable_pages(zone) * 6;
 }
 
-static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
+unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
 {
 	if (!mem_cgroup_disabled())
 		return mem_cgroup_get_lru_size(lruvec, lru);
@@ -1931,8 +1931,8 @@ static bool inactive_file_is_low(struct lruvec *lruvec)
 	unsigned long inactive;
 	unsigned long active;
 
-	inactive = get_lru_size(lruvec, LRU_INACTIVE_FILE);
-	active = get_lru_size(lruvec, LRU_ACTIVE_FILE);
+	inactive = lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
+	active = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
 
 	return active > inactive;
 }
@@ -2071,7 +2071,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * system is under heavy pressure.
 	 */
 	if (!inactive_file_is_low(lruvec) &&
-	    get_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
+	    lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
@@ -2097,10 +2097,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * anon in [0], file in [1]
 	 */
 
-	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
-		get_lru_size(lruvec, LRU_INACTIVE_ANON);
-	file  = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
-		get_lru_size(lruvec, LRU_INACTIVE_FILE);
+	anon  = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON) +
+		lruvec_lru_size(lruvec, LRU_INACTIVE_ANON);
+	file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE) +
+		lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
 
 	spin_lock_irq(&zone->lru_lock);
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
@@ -2138,7 +2138,7 @@ out:
 			unsigned long size;
 			unsigned long scan;
 
-			size = get_lru_size(lruvec, lru);
+			size = lruvec_lru_size(lruvec, lru);
 			scan = size >> sc->priority;
 
 			if (!scan && pass && force_scan)
diff --git a/mm/workingset.c b/mm/workingset.c
index c5e77c3e3dca..bdc5a0f62ad8 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -153,7 +153,8 @@
  */
 
 #define EVICTION_SHIFT	(RADIX_TREE_EXCEPTIONAL_ENTRY + \
-			 ZONES_SHIFT + NODES_SHIFT)
+			 ZONES_SHIFT + NODES_SHIFT +	\
+			 MEM_CGROUP_ID_SHIFT)
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
 
 /*
@@ -166,9 +167,10 @@
  */
 static unsigned int bucket_order;
 
-static void *pack_shadow(unsigned long eviction, struct zone *zone)
+static void *pack_shadow(int memcgid, struct zone *zone, unsigned long eviction)
 {
 	eviction >>= bucket_order;
+	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
 	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
 	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
@@ -176,18 +178,21 @@ static void *pack_shadow(unsigned long eviction, struct zone *zone)
 	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
-static void unpack_shadow(void *shadow, struct zone **zonep,
+static void unpack_shadow(void *shadow, int *memcgidp, struct zone **zonep,
 			  unsigned long *evictionp)
 {
 	unsigned long entry = (unsigned long)shadow;
-	int zid, nid;
+	int memcgid, nid, zid;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
 	zid = entry & ((1UL << ZONES_SHIFT) - 1);
 	entry >>= ZONES_SHIFT;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
 	entry >>= NODES_SHIFT;
+	memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
+	entry >>= MEM_CGROUP_ID_SHIFT;
 
+	*memcgidp = memcgid;
 	*zonep = NODE_DATA(nid)->node_zones + zid;
 	*evictionp = entry << bucket_order;
 }
@@ -202,11 +207,20 @@ static void unpack_shadow(void *shadow, struct zone **zonep,
  */
 void *workingset_eviction(struct address_space *mapping, struct page *page)
 {
+	struct mem_cgroup *memcg = page_memcg(page);
 	struct zone *zone = page_zone(page);
+	int memcgid = mem_cgroup_id(memcg);
 	unsigned long eviction;
+	struct lruvec *lruvec;
 
-	eviction = atomic_long_inc_return(&zone->inactive_age);
-	return pack_shadow(eviction, zone);
+	/* Page is fully exclusive and pins page->mem_cgroup */
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(page_count(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+	eviction = atomic_long_inc_return(&lruvec->inactive_age);
+	return pack_shadow(memcgid, zone, eviction);
 }
 
 /**
@@ -221,13 +235,42 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 bool workingset_refault(void *shadow)
 {
 	unsigned long refault_distance;
+	unsigned long active_file;
+	struct mem_cgroup *memcg;
 	unsigned long eviction;
+	struct lruvec *lruvec;
 	unsigned long refault;
 	struct zone *zone;
+	int memcgid;
 
-	unpack_shadow(shadow, &zone, &eviction);
+	unpack_shadow(shadow, &memcgid, &zone, &eviction);
 
-	refault = atomic_long_read(&zone->inactive_age);
+	rcu_read_lock();
+	/*
+	 * Look up the memcg associated with the stored ID. It might
+	 * have been deleted since the page's eviction.
+	 *
+	 * Note that in rare events the ID could have been recycled
+	 * for a new cgroup that refaults a shared page. This is
+	 * impossible to tell from the available data. However, this
+	 * should be a rare and limited disturbance, and activations
+	 * are always speculative anyway. Ultimately, it's the aging
+	 * algorithm's job to shake out the minimum access frequency
+	 * for the active cache.
+	 *
+	 * XXX: On !CONFIG_MEMCG, this will always return NULL; it
+	 * would be better if the root_mem_cgroup existed in all
+	 * configurations instead.
+	 */
+	memcg = mem_cgroup_from_id(memcgid);
+	if (!mem_cgroup_disabled() && !memcg) {
+		rcu_read_unlock();
+		return false;
+	}
+	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+	refault = atomic_long_read(&lruvec->inactive_age);
+	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
+	rcu_read_unlock();
 
 	/*
 	 * The unsigned subtraction here gives an accurate distance
@@ -249,7 +292,7 @@ bool workingset_refault(void *shadow)
 
 	inc_zone_state(zone, WORKINGSET_REFAULT);
 
-	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
+	if (refault_distance <= active_file) {
 		inc_zone_state(zone, WORKINGSET_ACTIVATE);
 		return true;
 	}
@@ -262,7 +305,13 @@ bool workingset_refault(void *shadow)
  */
 void workingset_activation(struct page *page)
 {
-	atomic_long_inc(&page_zone(page)->inactive_age);
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec;
+
+	memcg = lock_page_memcg(page);
+	lruvec = mem_cgroup_zone_lruvec(page_zone(page), memcg);
+	atomic_long_inc(&lruvec->inactive_age);
+	unlock_page_memcg(memcg);
 }
 
 /*
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/5] mm: memcontrol: generalize locking for the page->mem_cgroup binding
  2016-01-26 21:00   ` Johannes Weiner
  (?)
@ 2016-01-27 14:30     ` Vladimir Davydov
  -1 siblings, 0 replies; 33+ messages in thread
From: Vladimir Davydov @ 2016-01-27 14:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

On Tue, Jan 26, 2016 at 04:00:02PM -0500, Johannes Weiner wrote:

> @@ -683,17 +683,17 @@ int __set_page_dirty_buffers(struct page *page)
>  		} while (bh != head);
>  	}
>  	/*
> -	 * Use mem_group_begin_page_stat() to keep PageDirty synchronized with
> -	 * per-memcg dirty page counters.
> +	 * Lock out page->mem_cgroup migration to keep PageDirty
> +	 * synchronized with per-memcg dirty page counters.
>  	 */
> -	memcg = mem_cgroup_begin_page_stat(page);
> +	memcg = lock_page_memcg(page);
>  	newly_dirty = !TestSetPageDirty(page);
>  	spin_unlock(&mapping->private_lock);
>  
>  	if (newly_dirty)
>  		__set_page_dirty(page, mapping, memcg, 1);

Do we really want to pass memcg to __set_page_dirty and then to
account_page_dirtied, increasing stack/regs usage even in case memory
cgroup is disabled? May be, it'd be better to make
mem_cgroup_update_page_stat take a page instead of a memcg?

Thanks,
Vladimir

>  
> -	mem_cgroup_end_page_stat(memcg);
> +	unlock_page_memcg(memcg);
>  
>  	if (newly_dirty)
>  		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/5] mm: memcontrol: generalize locking for the page->mem_cgroup binding
@ 2016-01-27 14:30     ` Vladimir Davydov
  0 siblings, 0 replies; 33+ messages in thread
From: Vladimir Davydov @ 2016-01-27 14:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

On Tue, Jan 26, 2016 at 04:00:02PM -0500, Johannes Weiner wrote:

> @@ -683,17 +683,17 @@ int __set_page_dirty_buffers(struct page *page)
>  		} while (bh != head);
>  	}
>  	/*
> -	 * Use mem_group_begin_page_stat() to keep PageDirty synchronized with
> -	 * per-memcg dirty page counters.
> +	 * Lock out page->mem_cgroup migration to keep PageDirty
> +	 * synchronized with per-memcg dirty page counters.
>  	 */
> -	memcg = mem_cgroup_begin_page_stat(page);
> +	memcg = lock_page_memcg(page);
>  	newly_dirty = !TestSetPageDirty(page);
>  	spin_unlock(&mapping->private_lock);
>  
>  	if (newly_dirty)
>  		__set_page_dirty(page, mapping, memcg, 1);

Do we really want to pass memcg to __set_page_dirty and then to
account_page_dirtied, increasing stack/regs usage even in case memory
cgroup is disabled? May be, it'd be better to make
mem_cgroup_update_page_stat take a page instead of a memcg?

Thanks,
Vladimir

>  
> -	mem_cgroup_end_page_stat(memcg);
> +	unlock_page_memcg(memcg);
>  
>  	if (newly_dirty)
>  		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/5] mm: memcontrol: generalize locking for the page->mem_cgroup binding
@ 2016-01-27 14:30     ` Vladimir Davydov
  0 siblings, 0 replies; 33+ messages in thread
From: Vladimir Davydov @ 2016-01-27 14:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

On Tue, Jan 26, 2016 at 04:00:02PM -0500, Johannes Weiner wrote:

> @@ -683,17 +683,17 @@ int __set_page_dirty_buffers(struct page *page)
>  		} while (bh != head);
>  	}
>  	/*
> -	 * Use mem_group_begin_page_stat() to keep PageDirty synchronized with
> -	 * per-memcg dirty page counters.
> +	 * Lock out page->mem_cgroup migration to keep PageDirty
> +	 * synchronized with per-memcg dirty page counters.
>  	 */
> -	memcg = mem_cgroup_begin_page_stat(page);
> +	memcg = lock_page_memcg(page);
>  	newly_dirty = !TestSetPageDirty(page);
>  	spin_unlock(&mapping->private_lock);
>  
>  	if (newly_dirty)
>  		__set_page_dirty(page, mapping, memcg, 1);

Do we really want to pass memcg to __set_page_dirty and then to
account_page_dirtied, increasing stack/regs usage even in case memory
cgroup is disabled? May be, it'd be better to make
mem_cgroup_update_page_stat take a page instead of a memcg?

Thanks,
Vladimir

>  
> -	mem_cgroup_end_page_stat(memcg);
> +	unlock_page_memcg(memcg);
>  
>  	if (newly_dirty)
>  		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/5] mm: workingset: #define radix entry eviction mask
  2016-01-26 21:00   ` Johannes Weiner
  (?)
@ 2016-01-27 14:32     ` Vladimir Davydov
  -1 siblings, 0 replies; 33+ messages in thread
From: Vladimir Davydov @ 2016-01-27 14:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

On Tue, Jan 26, 2016 at 04:00:03PM -0500, Johannes Weiner wrote:
> This is a compile-time constant, no need to calculate it on refault.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/5] mm: workingset: #define radix entry eviction mask
@ 2016-01-27 14:32     ` Vladimir Davydov
  0 siblings, 0 replies; 33+ messages in thread
From: Vladimir Davydov @ 2016-01-27 14:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

On Tue, Jan 26, 2016 at 04:00:03PM -0500, Johannes Weiner wrote:
> This is a compile-time constant, no need to calculate it on refault.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/5] mm: workingset: #define radix entry eviction mask
@ 2016-01-27 14:32     ` Vladimir Davydov
  0 siblings, 0 replies; 33+ messages in thread
From: Vladimir Davydov @ 2016-01-27 14:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

On Tue, Jan 26, 2016 at 04:00:03PM -0500, Johannes Weiner wrote:
> This is a compile-time constant, no need to calculate it on refault.
> 
> Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

Reviewed-by: Vladimir Davydov <vdavydov-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/5] mm: workingset: separate shadow unpacking and refault calculation
  2016-01-26 21:00   ` Johannes Weiner
  (?)
@ 2016-01-27 14:34     ` Vladimir Davydov
  -1 siblings, 0 replies; 33+ messages in thread
From: Vladimir Davydov @ 2016-01-27 14:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

On Tue, Jan 26, 2016 at 04:00:04PM -0500, Johannes Weiner wrote:
> Per-cgroup thrash detection will need to derive a live memcg from the
> eviction cookie, and doing that inside unpack_shadow() will get nasty
> with the reference handling spread over two functions.
> 
> In preparation, make unpack_shadow() clearly about extracting static
> data, and let workingset_refault() do all the higher-level handling.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/5] mm: workingset: separate shadow unpacking and refault calculation
@ 2016-01-27 14:34     ` Vladimir Davydov
  0 siblings, 0 replies; 33+ messages in thread
From: Vladimir Davydov @ 2016-01-27 14:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

On Tue, Jan 26, 2016 at 04:00:04PM -0500, Johannes Weiner wrote:
> Per-cgroup thrash detection will need to derive a live memcg from the
> eviction cookie, and doing that inside unpack_shadow() will get nasty
> with the reference handling spread over two functions.
> 
> In preparation, make unpack_shadow() clearly about extracting static
> data, and let workingset_refault() do all the higher-level handling.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/5] mm: workingset: separate shadow unpacking and refault calculation
@ 2016-01-27 14:34     ` Vladimir Davydov
  0 siblings, 0 replies; 33+ messages in thread
From: Vladimir Davydov @ 2016-01-27 14:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

On Tue, Jan 26, 2016 at 04:00:04PM -0500, Johannes Weiner wrote:
> Per-cgroup thrash detection will need to derive a live memcg from the
> eviction cookie, and doing that inside unpack_shadow() will get nasty
> with the reference handling spread over two functions.
> 
> In preparation, make unpack_shadow() clearly about extracting static
> data, and let workingset_refault() do all the higher-level handling.
> 
> Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

Reviewed-by: Vladimir Davydov <vdavydov-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/5] mm: workingset: eviction buckets for bigmem/lowbit machines
  2016-01-26 21:00   ` Johannes Weiner
  (?)
@ 2016-01-27 14:39     ` Vladimir Davydov
  -1 siblings, 0 replies; 33+ messages in thread
From: Vladimir Davydov @ 2016-01-27 14:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

On Tue, Jan 26, 2016 at 04:00:05PM -0500, Johannes Weiner wrote:
> For per-cgroup thrash detection, we need to store the memcg ID inside
> the radix tree cookie as well. However, on 32 bit that doesn't leave
> enough bits for the eviction timestamp to cover the necessary range of
> recently evicted pages. The radix tree entry would look like this:
> 
> [ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ]
> 
> 12 bits means 4096 pages, means 16M worth of recently evicted pages.
> But refaults are actionable up to distances covering half of memory.
> To not miss refaults, we have to stretch out the range at the cost of
> how precisely we can tell when a page was evicted. This way we can
> shave off lower bits from the eviction timestamp until the necessary
> range is covered. E.g. grouping evictions into 1M buckets (256 pages)
> will stretch the longest representable refault distance to 4G.
> 
> This patch implements eviction buckets that are automatically sized
> according to the available bits and the necessary refault range, in
> preparation for per-cgroup thrash detection.
> 
> The maximum actionable distance is currently half of memory, but to
> support memory hotplug of up to 200% of boot-time memory, we size the
> buckets to cover double the distance. Beyond that, thrashing won't be
> detectable anymore.
> 
> During boot, the kernel will print out the exact parameters, like so:
> 
> [    0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6
> 
> In this example, there are 12 radix entry bits available for the
> eviction timestamp, to cover a maximum distance of 2^18 pages (this is
> a 1G machine). Consequently, evictions must be grouped into buckets of
> 2^6 pages, or 256K.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

One nit below.

> +/*
> + * Eviction timestamps need to be able to cover the full range of
> + * actionable refaults. However, bits are tight in the radix tree
> + * entry, and after storing the identifier for the lruvec there might
> + * not be enough left to represent every single actionable refault. In
> + * that case, we have to sacrifice granularity for distance, and group
> + * evictions into coarser buckets by shaving off lower timestamp bits.
> + */
> +static unsigned int bucket_order;

__read_mostly?

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/5] mm: workingset: eviction buckets for bigmem/lowbit machines
@ 2016-01-27 14:39     ` Vladimir Davydov
  0 siblings, 0 replies; 33+ messages in thread
From: Vladimir Davydov @ 2016-01-27 14:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

On Tue, Jan 26, 2016 at 04:00:05PM -0500, Johannes Weiner wrote:
> For per-cgroup thrash detection, we need to store the memcg ID inside
> the radix tree cookie as well. However, on 32 bit that doesn't leave
> enough bits for the eviction timestamp to cover the necessary range of
> recently evicted pages. The radix tree entry would look like this:
> 
> [ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ]
> 
> 12 bits means 4096 pages, means 16M worth of recently evicted pages.
> But refaults are actionable up to distances covering half of memory.
> To not miss refaults, we have to stretch out the range at the cost of
> how precisely we can tell when a page was evicted. This way we can
> shave off lower bits from the eviction timestamp until the necessary
> range is covered. E.g. grouping evictions into 1M buckets (256 pages)
> will stretch the longest representable refault distance to 4G.
> 
> This patch implements eviction buckets that are automatically sized
> according to the available bits and the necessary refault range, in
> preparation for per-cgroup thrash detection.
> 
> The maximum actionable distance is currently half of memory, but to
> support memory hotplug of up to 200% of boot-time memory, we size the
> buckets to cover double the distance. Beyond that, thrashing won't be
> detectable anymore.
> 
> During boot, the kernel will print out the exact parameters, like so:
> 
> [    0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6
> 
> In this example, there are 12 radix entry bits available for the
> eviction timestamp, to cover a maximum distance of 2^18 pages (this is
> a 1G machine). Consequently, evictions must be grouped into buckets of
> 2^6 pages, or 256K.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

One nit below.

> +/*
> + * Eviction timestamps need to be able to cover the full range of
> + * actionable refaults. However, bits are tight in the radix tree
> + * entry, and after storing the identifier for the lruvec there might
> + * not be enough left to represent every single actionable refault. In
> + * that case, we have to sacrifice granularity for distance, and group
> + * evictions into coarser buckets by shaving off lower timestamp bits.
> + */
> +static unsigned int bucket_order;

__read_mostly?

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/5] mm: workingset: eviction buckets for bigmem/lowbit machines
@ 2016-01-27 14:39     ` Vladimir Davydov
  0 siblings, 0 replies; 33+ messages in thread
From: Vladimir Davydov @ 2016-01-27 14:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

On Tue, Jan 26, 2016 at 04:00:05PM -0500, Johannes Weiner wrote:
> For per-cgroup thrash detection, we need to store the memcg ID inside
> the radix tree cookie as well. However, on 32 bit that doesn't leave
> enough bits for the eviction timestamp to cover the necessary range of
> recently evicted pages. The radix tree entry would look like this:
> 
> [ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ]
> 
> 12 bits means 4096 pages, means 16M worth of recently evicted pages.
> But refaults are actionable up to distances covering half of memory.
> To not miss refaults, we have to stretch out the range at the cost of
> how precisely we can tell when a page was evicted. This way we can
> shave off lower bits from the eviction timestamp until the necessary
> range is covered. E.g. grouping evictions into 1M buckets (256 pages)
> will stretch the longest representable refault distance to 4G.
> 
> This patch implements eviction buckets that are automatically sized
> according to the available bits and the necessary refault range, in
> preparation for per-cgroup thrash detection.
> 
> The maximum actionable distance is currently half of memory, but to
> support memory hotplug of up to 200% of boot-time memory, we size the
> buckets to cover double the distance. Beyond that, thrashing won't be
> detectable anymore.
> 
> During boot, the kernel will print out the exact parameters, like so:
> 
> [    0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6
> 
> In this example, there are 12 radix entry bits available for the
> eviction timestamp, to cover a maximum distance of 2^18 pages (this is
> a 1G machine). Consequently, evictions must be grouped into buckets of
> 2^6 pages, or 256K.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

One nit below.

> +/*
> + * Eviction timestamps need to be able to cover the full range of
> + * actionable refaults. However, bits are tight in the radix tree
> + * entry, and after storing the identifier for the lruvec there might
> + * not be enough left to represent every single actionable refault. In
> + * that case, we have to sacrifice granularity for distance, and group
> + * evictions into coarser buckets by shaving off lower timestamp bits.
> + */
> +static unsigned int bucket_order;

__read_mostly?

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] mm: workingset: per-cgroup cache thrash detection
  2016-01-26 21:00   ` Johannes Weiner
  (?)
@ 2016-01-27 14:58     ` Vladimir Davydov
  -1 siblings, 0 replies; 33+ messages in thread
From: Vladimir Davydov @ 2016-01-27 14:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

On Tue, Jan 26, 2016 at 04:00:06PM -0500, Johannes Weiner wrote:
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index b14a2bb33514..1cf3065c143b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  
>  /* linux/mm/vmscan.c */
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
> +extern unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru);

Better declare it in mm/internal.h? And is there any point in renaming
it?

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 953f0f984392..864e237f32d9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -202,11 +207,20 @@ static void unpack_shadow(void *shadow, struct zone **zonep,
>   */
>  void *workingset_eviction(struct address_space *mapping, struct page *page)
>  {
> +	struct mem_cgroup *memcg = page_memcg(page);
>  	struct zone *zone = page_zone(page);
> +	int memcgid = mem_cgroup_id(memcg);

This will crash in case memcg is disabled via boot param.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] mm: workingset: per-cgroup cache thrash detection
@ 2016-01-27 14:58     ` Vladimir Davydov
  0 siblings, 0 replies; 33+ messages in thread
From: Vladimir Davydov @ 2016-01-27 14:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

On Tue, Jan 26, 2016 at 04:00:06PM -0500, Johannes Weiner wrote:
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index b14a2bb33514..1cf3065c143b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  
>  /* linux/mm/vmscan.c */
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
> +extern unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru);

Better declare it in mm/internal.h? And is there any point in renaming
it?

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 953f0f984392..864e237f32d9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -202,11 +207,20 @@ static void unpack_shadow(void *shadow, struct zone **zonep,
>   */
>  void *workingset_eviction(struct address_space *mapping, struct page *page)
>  {
> +	struct mem_cgroup *memcg = page_memcg(page);
>  	struct zone *zone = page_zone(page);
> +	int memcgid = mem_cgroup_id(memcg);

This will crash in case memcg is disabled via boot param.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] mm: workingset: per-cgroup cache thrash detection
@ 2016-01-27 14:58     ` Vladimir Davydov
  0 siblings, 0 replies; 33+ messages in thread
From: Vladimir Davydov @ 2016-01-27 14:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

On Tue, Jan 26, 2016 at 04:00:06PM -0500, Johannes Weiner wrote:
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index b14a2bb33514..1cf3065c143b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  
>  /* linux/mm/vmscan.c */
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
> +extern unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru);

Better declare it in mm/internal.h? And is there any point in renaming
it?

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 953f0f984392..864e237f32d9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -202,11 +207,20 @@ static void unpack_shadow(void *shadow, struct zone **zonep,
>   */
>  void *workingset_eviction(struct address_space *mapping, struct page *page)
>  {
> +	struct mem_cgroup *memcg = page_memcg(page);
>  	struct zone *zone = page_zone(page);
> +	int memcgid = mem_cgroup_id(memcg);

This will crash in case memcg is disabled via boot param.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/5] mm: memcontrol: generalize locking for the page->mem_cgroup binding
  2016-01-27 14:30     ` Vladimir Davydov
@ 2016-01-29 16:43       ` Johannes Weiner
  -1 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2016-01-29 16:43 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

On Wed, Jan 27, 2016 at 05:30:45PM +0300, Vladimir Davydov wrote:
> On Tue, Jan 26, 2016 at 04:00:02PM -0500, Johannes Weiner wrote:
> 
> > @@ -683,17 +683,17 @@ int __set_page_dirty_buffers(struct page *page)
> >  		} while (bh != head);
> >  	}
> >  	/*
> > -	 * Use mem_group_begin_page_stat() to keep PageDirty synchronized with
> > -	 * per-memcg dirty page counters.
> > +	 * Lock out page->mem_cgroup migration to keep PageDirty
> > +	 * synchronized with per-memcg dirty page counters.
> >  	 */
> > -	memcg = mem_cgroup_begin_page_stat(page);
> > +	memcg = lock_page_memcg(page);
> >  	newly_dirty = !TestSetPageDirty(page);
> >  	spin_unlock(&mapping->private_lock);
> >  
> >  	if (newly_dirty)
> >  		__set_page_dirty(page, mapping, memcg, 1);
> 
> Do we really want to pass memcg to __set_page_dirty and then to
> account_page_dirtied, increasing stack/regs usage even in case memory
> cgroup is disabled? May be, it'd be better to make
> mem_cgroup_update_page_stat take a page instead of a memcg?

I'll look into that. It will need changing migration to leave the
page->mem_cgroup binding of live pages alone, but that's something
worth doing anyway. It's beyond the scope of these patches, though.

Thanks

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/5] mm: memcontrol: generalize locking for the page->mem_cgroup binding
@ 2016-01-29 16:43       ` Johannes Weiner
  0 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2016-01-29 16:43 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

On Wed, Jan 27, 2016 at 05:30:45PM +0300, Vladimir Davydov wrote:
> On Tue, Jan 26, 2016 at 04:00:02PM -0500, Johannes Weiner wrote:
> 
> > @@ -683,17 +683,17 @@ int __set_page_dirty_buffers(struct page *page)
> >  		} while (bh != head);
> >  	}
> >  	/*
> > -	 * Use mem_group_begin_page_stat() to keep PageDirty synchronized with
> > -	 * per-memcg dirty page counters.
> > +	 * Lock out page->mem_cgroup migration to keep PageDirty
> > +	 * synchronized with per-memcg dirty page counters.
> >  	 */
> > -	memcg = mem_cgroup_begin_page_stat(page);
> > +	memcg = lock_page_memcg(page);
> >  	newly_dirty = !TestSetPageDirty(page);
> >  	spin_unlock(&mapping->private_lock);
> >  
> >  	if (newly_dirty)
> >  		__set_page_dirty(page, mapping, memcg, 1);
> 
> Do we really want to pass memcg to __set_page_dirty and then to
> account_page_dirtied, increasing stack/regs usage even in case memory
> cgroup is disabled? May be, it'd be better to make
> mem_cgroup_update_page_stat take a page instead of a memcg?

I'll look into that. It will need changing migration to leave the
page->mem_cgroup binding of live pages alone, but that's something
worth doing anyway. It's beyond the scope of these patches, though.

Thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] mm: workingset: per-cgroup cache thrash detection
  2016-01-27 14:58     ` Vladimir Davydov
  (?)
@ 2016-01-29 17:30       ` Johannes Weiner
  -1 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2016-01-29 17:30 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

On Wed, Jan 27, 2016 at 05:58:27PM +0300, Vladimir Davydov wrote:
> On Tue, Jan 26, 2016 at 04:00:06PM -0500, Johannes Weiner wrote:
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index b14a2bb33514..1cf3065c143b 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
> >  
> >  /* linux/mm/vmscan.c */
> >  extern unsigned long zone_reclaimable_pages(struct zone *zone);
> > +extern unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru);
> 
> Better declare it in mm/internal.h? And is there any point in renaming
> it?

mm/internal.h kinda sucks. However, it was only in swap.h because it
started as an inline function. I now moved it to mmzone.h where struct
lruvec is defined and the other lruvec functions are declared/defined.

As for the rename, get_lru_size() is not a great name for a wider
scope. So I followed the spirit of lruvec_init() and lruvec_zone().

> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 953f0f984392..864e237f32d9 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -202,11 +207,20 @@ static void unpack_shadow(void *shadow, struct zone **zonep,
> >   */
> >  void *workingset_eviction(struct address_space *mapping, struct page *page)
> >  {
> > +	struct mem_cgroup *memcg = page_memcg(page);
> >  	struct zone *zone = page_zone(page);
> > +	int memcgid = mem_cgroup_id(memcg);
> 
> This will crash in case memcg is disabled via boot param.

Thanks for catching that. I added a mem_cgroup_disabled() check to
mem_cgroup_id() since it's now a public interface.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] mm: workingset: per-cgroup cache thrash detection
@ 2016-01-29 17:30       ` Johannes Weiner
  0 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2016-01-29 17:30 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

On Wed, Jan 27, 2016 at 05:58:27PM +0300, Vladimir Davydov wrote:
> On Tue, Jan 26, 2016 at 04:00:06PM -0500, Johannes Weiner wrote:
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index b14a2bb33514..1cf3065c143b 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
> >  
> >  /* linux/mm/vmscan.c */
> >  extern unsigned long zone_reclaimable_pages(struct zone *zone);
> > +extern unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru);
> 
> Better declare it in mm/internal.h? And is there any point in renaming
> it?

mm/internal.h kinda sucks. However, it was only in swap.h because it
started as an inline function. I now moved it to mmzone.h where struct
lruvec is defined and the other lruvec functions are declared/defined.

As for the rename, get_lru_size() is not a great name for a wider
scope. So I followed the spirit of lruvec_init() and lruvec_zone().

> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 953f0f984392..864e237f32d9 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -202,11 +207,20 @@ static void unpack_shadow(void *shadow, struct zone **zonep,
> >   */
> >  void *workingset_eviction(struct address_space *mapping, struct page *page)
> >  {
> > +	struct mem_cgroup *memcg = page_memcg(page);
> >  	struct zone *zone = page_zone(page);
> > +	int memcgid = mem_cgroup_id(memcg);
> 
> This will crash in case memcg is disabled via boot param.

Thanks for catching that. I added a mem_cgroup_disabled() check to
mem_cgroup_id() since it's now a public interface.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] mm: workingset: per-cgroup cache thrash detection
@ 2016-01-29 17:30       ` Johannes Weiner
  0 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2016-01-29 17:30 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

On Wed, Jan 27, 2016 at 05:58:27PM +0300, Vladimir Davydov wrote:
> On Tue, Jan 26, 2016 at 04:00:06PM -0500, Johannes Weiner wrote:
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index b14a2bb33514..1cf3065c143b 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
> >  
> >  /* linux/mm/vmscan.c */
> >  extern unsigned long zone_reclaimable_pages(struct zone *zone);
> > +extern unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru);
> 
> Better declare it in mm/internal.h? And is there any point in renaming
> it?

mm/internal.h kinda sucks. However, it was only in swap.h because it
started as an inline function. I now moved it to mmzone.h where struct
lruvec is defined and the other lruvec functions are declared/defined.

As for the rename, get_lru_size() is not a great name for a wider
scope. So I followed the spirit of lruvec_init() and lruvec_zone().

> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 953f0f984392..864e237f32d9 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -202,11 +207,20 @@ static void unpack_shadow(void *shadow, struct zone **zonep,
> >   */
> >  void *workingset_eviction(struct address_space *mapping, struct page *page)
> >  {
> > +	struct mem_cgroup *memcg = page_memcg(page);
> >  	struct zone *zone = page_zone(page);
> > +	int memcgid = mem_cgroup_id(memcg);
> 
> This will crash in case memcg is disabled via boot param.

Thanks for catching that. I added a mem_cgroup_disabled() check to
mem_cgroup_id() since it's now a public interface.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2016-01-29 17:30 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-26 21:00 [PATCH 0/5] mm: workingset: per-cgroup thrash detection Johannes Weiner
2016-01-26 21:00 ` Johannes Weiner
2016-01-26 21:00 ` [PATCH 1/5] mm: memcontrol: generalize locking for the page->mem_cgroup binding Johannes Weiner
2016-01-26 21:00   ` Johannes Weiner
2016-01-27 14:30   ` Vladimir Davydov
2016-01-27 14:30     ` Vladimir Davydov
2016-01-27 14:30     ` Vladimir Davydov
2016-01-29 16:43     ` Johannes Weiner
2016-01-29 16:43       ` Johannes Weiner
2016-01-26 21:00 ` [PATCH 2/5] mm: workingset: #define radix entry eviction mask Johannes Weiner
2016-01-26 21:00   ` Johannes Weiner
2016-01-27 14:32   ` Vladimir Davydov
2016-01-27 14:32     ` Vladimir Davydov
2016-01-27 14:32     ` Vladimir Davydov
2016-01-26 21:00 ` [PATCH 3/5] mm: workingset: separate shadow unpacking and refault calculation Johannes Weiner
2016-01-26 21:00   ` Johannes Weiner
2016-01-27 14:34   ` Vladimir Davydov
2016-01-27 14:34     ` Vladimir Davydov
2016-01-27 14:34     ` Vladimir Davydov
2016-01-26 21:00 ` [PATCH 4/5] mm: workingset: eviction buckets for bigmem/lowbit machines Johannes Weiner
2016-01-26 21:00   ` Johannes Weiner
2016-01-27 14:39   ` Vladimir Davydov
2016-01-27 14:39     ` Vladimir Davydov
2016-01-27 14:39     ` Vladimir Davydov
2016-01-26 21:00 ` [PATCH 5/5] mm: workingset: per-cgroup cache thrash detection Johannes Weiner
2016-01-26 21:00   ` Johannes Weiner
2016-01-26 21:00   ` Johannes Weiner
2016-01-27 14:58   ` Vladimir Davydov
2016-01-27 14:58     ` Vladimir Davydov
2016-01-27 14:58     ` Vladimir Davydov
2016-01-29 17:30     ` Johannes Weiner
2016-01-29 17:30       ` Johannes Weiner
2016-01-29 17:30       ` Johannes Weiner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.