linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] mm: workingset: make workingset detection logic memcg aware
@ 2016-01-24 16:56 Vladimir Davydov
  2016-01-25 16:39 ` Johannes Weiner
  0 siblings, 1 reply; 6+ messages in thread
From: Vladimir Davydov @ 2016-01-24 16:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

Currently, inactive_age is maintained per zone, which results in
unexpected file page activations in case memory cgroups are used. For
example, if the total number of active pages is big, a memory cgroup
might get every refaulted file page activated even if refault distance
is much greater than the number of active file pages in the cgroup. This
patch fixes this issue by making inactive_age per lruvec.

The patch is pretty straightforward and self-explaining, but there are
two things that should be noted:

 - workingset_{eviction,activation} need to get lruvec given a page.
   On the default hierarchy one can safely access page->mem_cgroup
   provided the page is pinned, but on the legacy hierarchy a page can
   be migrated from one cgroup to another at any moment, so extra care
   must be taken to assure page->mem_cgroup will stay put.

   workingset_eviction is passed a locked page, so it is safe to use
   page->mem_cgroup in this function. workingset_activation is trickier:
   it is called from mark_page_accessed, where the page is not
   necessarily locked. To protect it against page->mem_cgroup change, we
   move it to __activate_page, which is called by mark_page_accessed
   once there's enough pages on percpu pagevec. This function is called
   with zone->lru_lock held, which rules out page charge migration.

 - To calculate refault distance correctly even in case a page is
   refaulted by a different cgroup, we need to store memcg id in shadow
   entry. There's no problem with it on 64-bit, but on 32-bit there's
   not much space left in radix tree slot after storing information
   about node, zone, and memory cgroup, so we can't just save eviction
   counter as is, because it would trim max refault distance making it
   unusable.

   To overcome this problem, we increase refault distance granularity,
   as proposed by Johannes Weiner. We disregard 10 least significant
   bits of eviction counter. This reduces refault distance accuracy to
   4MB, which is still fine. With the default NODE_SHIFT (3) this leaves
   us 9 bits for storing eviction counter, hence maximal refault
   distance will be 2GB, which should be enough for 32-bit systems.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
---
Changes from v1:
 - Handle refaults by different cgroups properly (Johannes).

v1: http://www.spinics.net/lists/linux-mm/msg92466.html

 include/linux/memcontrol.h |  44 ++++++++++++++++
 include/linux/mmzone.h     |   7 +--
 include/linux/swap.h       |   1 +
 mm/memcontrol.c            |  25 ---------
 mm/swap.c                  |   5 +-
 mm/truncate.c              |   1 +
 mm/workingset.c            | 125 ++++++++++++++++++++++++++++++++++++++++-----
 7 files changed, 166 insertions(+), 42 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9ae48d4aeb5e..fd67027bf2e7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -366,6 +366,42 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
 }
 
 /*
+ * We restrict the id in the range of [1, 65535], so it can fit into
+ * an unsigned short.
+ */
+#define MEM_CGROUP_ID_SHIFT	16
+#define MEM_CGROUP_ID_MAX	((1 << MEM_CGROUP_ID_SHIFT) - 1)
+
+static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
+{
+	return memcg->css.id;
+}
+
+/*
+ * A helper function to get mem_cgroup from ID. must be called under
+ * rcu_read_lock().  The caller is responsible for calling
+ * css_tryget_online() if the mem_cgroup is used for charging. (dropping
+ * refcnt from swap can be called against removed memcg.)
+ */
+static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
+{
+	struct cgroup_subsys_state *css;
+
+	css = css_from_id(id, &memory_cgrp_subsys);
+	return mem_cgroup_from_css(css);
+}
+
+static inline void mem_cgroup_get(struct mem_cgroup *memcg)
+{
+	css_get(&memcg->css);
+}
+
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+	css_put(&memcg->css);
+}
+
+/*
  * For memory reclaim.
  */
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
@@ -590,6 +626,14 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
 	return true;
 }
 
+static inline void mem_cgroup_get(struct mem_cgroup *memcg)
+{
+}
+
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline bool
 mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7b6c2cfee390..684368ccea50 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -211,6 +211,10 @@ struct zone_reclaim_stat {
 struct lruvec {
 	struct list_head lists[NR_LRU_LISTS];
 	struct zone_reclaim_stat reclaim_stat;
+
+	/* Evictions & activations on the inactive file list */
+	atomic_long_t		inactive_age;
+
 #ifdef CONFIG_MEMCG
 	struct zone *zone;
 #endif
@@ -487,9 +491,6 @@ struct zone {
 	spinlock_t		lru_lock;
 	struct lruvec		lruvec;
 
-	/* Evictions & activations on the inactive file list */
-	atomic_long_t		inactive_age;
-
 	/*
 	 * When free pages are below this point, additional steps are taken
 	 * when reading the number of free pages to avoid per-cpu counter
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b14a2bb33514..b3713332c754 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -252,6 +252,7 @@ struct swap_info_struct {
 void *workingset_eviction(struct address_space *mapping, struct page *page);
 bool workingset_refault(void *shadow);
 void workingset_activation(struct page *page);
+void workingset_release_shadow(void *shadow);
 extern struct list_lru workingset_shadow_nodes;
 
 static inline unsigned int workingset_node_pages(struct radix_tree_node *node)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d06cae2de783..4ea79f225fe8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -268,31 +268,6 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
 	return (memcg == root_mem_cgroup);
 }
 
-/*
- * We restrict the id in the range of [1, 65535], so it can fit into
- * an unsigned short.
- */
-#define MEM_CGROUP_ID_MAX	USHRT_MAX
-
-static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
-{
-	return memcg->css.id;
-}
-
-/*
- * A helper function to get mem_cgroup from ID. must be called under
- * rcu_read_lock().  The caller is responsible for calling
- * css_tryget_online() if the mem_cgroup is used for charging. (dropping
- * refcnt from swap can be called against removed memcg.)
- */
-static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
-{
-	struct cgroup_subsys_state *css;
-
-	css = css_from_id(id, &memory_cgrp_subsys);
-	return mem_cgroup_from_css(css);
-}
-
 #ifndef CONFIG_SLOB
 /*
  * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
diff --git a/mm/swap.c b/mm/swap.c
index 09fe5e97714a..4b5d7a1f9742 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -270,6 +270,9 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 
 		__count_vm_event(PGACTIVATE);
 		update_page_reclaim_stat(lruvec, file, 1);
+
+		if (file)
+			workingset_activation(page);
 	}
 }
 
@@ -375,8 +378,6 @@ void mark_page_accessed(struct page *page)
 		else
 			__lru_cache_activate_page(page);
 		ClearPageReferenced(page);
-		if (page_is_file_cache(page))
-			workingset_activation(page);
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
diff --git a/mm/truncate.c b/mm/truncate.c
index e3ee0e27cd17..a8bae846d399 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -52,6 +52,7 @@ static void clear_exceptional_entry(struct address_space *mapping,
 			goto unlock;
 		radix_tree_replace_slot(slot, NULL);
 		mapping->nrexceptional--;
+		workingset_release_shadow(entry);
 		if (!node)
 			goto unlock;
 		workingset_node_shadows_dec(node);
diff --git a/mm/workingset.c b/mm/workingset.c
index 61ead9e5549d..30298eaee397 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -142,7 +142,7 @@
  *		Implementation
  *
  * For each zone's file LRU lists, a counter for inactive evictions
- * and activations is maintained (zone->inactive_age).
+ * and activations is maintained (lruvec->inactive_age).
  *
  * On eviction, a snapshot of this counter (along with some bits to
  * identify the zone) is stored in the now empty page cache radix tree
@@ -152,8 +152,72 @@
  * refault distance will immediately activate the refaulting page.
  */
 
-static void *pack_shadow(unsigned long eviction, struct zone *zone)
+#ifdef CONFIG_MEMCG
+/*
+ * On 32-bit there is not much space left in radix tree slot after
+ * storing information about node, zone, and memory cgroup, so we
+ * disregard 10 least significant bits of eviction counter. This
+ * reduces refault distance accuracy to 4MB, which is still fine.
+ *
+ * With the default NODE_SHIFT (3) this leaves us 9 bits for storing
+ * eviction counter, hence maximal refault distance will be 2GB, which
+ * should be enough for 32-bit systems.
+ */
+#ifdef CONFIG_64BIT
+# define REFAULT_DISTANCE_GRANULARITY		0
+#else
+# define REFAULT_DISTANCE_GRANULARITY		10
+#endif
+
+static unsigned long pack_shadow_memcg(unsigned long eviction,
+				       struct mem_cgroup *memcg)
+{
+	if (mem_cgroup_disabled())
+		return eviction;
+
+	eviction >>= REFAULT_DISTANCE_GRANULARITY;
+	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | mem_cgroup_id(memcg);
+	return eviction;
+}
+
+static unsigned long unpack_shadow_memcg(unsigned long entry,
+					 unsigned long *mask,
+					 struct mem_cgroup **memcg)
+{
+	if (mem_cgroup_disabled()) {
+		*memcg = NULL;
+		return entry;
+	}
+
+	rcu_read_lock();
+	*memcg = mem_cgroup_from_id(entry & MEM_CGROUP_ID_MAX);
+	rcu_read_unlock();
+
+	entry >>= MEM_CGROUP_ID_SHIFT;
+	entry <<= REFAULT_DISTANCE_GRANULARITY;
+	*mask >>= MEM_CGROUP_ID_SHIFT - REFAULT_DISTANCE_GRANULARITY;
+	return entry;
+}
+#else /* !CONFIG_MEMCG */
+static unsigned long pack_shadow_memcg(unsigned long eviction,
+				       struct mem_cgroup *memcg)
+{
+	return eviction;
+}
+
+static unsigned long unpack_shadow_memcg(unsigned long entry,
+					 unsigned long *mask,
+					 struct mem_cgroup **memcg)
+{
+	*memcg = NULL;
+	return entry;
+}
+#endif /* CONFIG_MEMCG */
+
+static void *pack_shadow(unsigned long eviction, struct zone *zone,
+			 struct mem_cgroup *memcg)
 {
+	eviction = pack_shadow_memcg(eviction, memcg);
 	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
 	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
 	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
@@ -163,6 +227,7 @@ static void *pack_shadow(unsigned long eviction, struct zone *zone)
 
 static void unpack_shadow(void *shadow,
 			  struct zone **zone,
+			  struct mem_cgroup **memcg,
 			  unsigned long *distance)
 {
 	unsigned long entry = (unsigned long)shadow;
@@ -170,19 +235,23 @@ static void unpack_shadow(void *shadow,
 	unsigned long refault;
 	unsigned long mask;
 	int zid, nid;
+	struct lruvec *lruvec;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
 	zid = entry & ((1UL << ZONES_SHIFT) - 1);
 	entry >>= ZONES_SHIFT;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
 	entry >>= NODES_SHIFT;
-	eviction = entry;
 
-	*zone = NODE_DATA(nid)->node_zones + zid;
-
-	refault = atomic_long_read(&(*zone)->inactive_age);
 	mask = ~0UL >> (NODES_SHIFT + ZONES_SHIFT +
 			RADIX_TREE_EXCEPTIONAL_SHIFT);
+
+	eviction = unpack_shadow_memcg(entry, &mask, memcg);
+
+	*zone = NODE_DATA(nid)->node_zones + zid;
+	lruvec = mem_cgroup_zone_lruvec(*zone, *memcg);
+
+	refault = atomic_long_read(&lruvec->inactive_age);
 	/*
 	 * The unsigned subtraction here gives an accurate distance
 	 * across inactive_age overflows in most cases.
@@ -213,10 +282,16 @@ static void unpack_shadow(void *shadow,
 void *workingset_eviction(struct address_space *mapping, struct page *page)
 {
 	struct zone *zone = page_zone(page);
+	struct mem_cgroup *memcg = page_memcg(page);
+	struct lruvec *lruvec;
 	unsigned long eviction;
 
-	eviction = atomic_long_inc_return(&zone->inactive_age);
-	return pack_shadow(eviction, zone);
+	if (!mem_cgroup_disabled())
+		mem_cgroup_get(memcg);
+
+	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+	eviction = atomic_long_inc_return(&lruvec->inactive_age);
+	return pack_shadow(eviction, zone, memcg);
 }
 
 /**
@@ -230,13 +305,22 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
  */
 bool workingset_refault(void *shadow)
 {
-	unsigned long refault_distance;
+	unsigned long refault_distance, nr_active;
 	struct zone *zone;
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec;
 
-	unpack_shadow(shadow, &zone, &refault_distance);
+	unpack_shadow(shadow, &zone, &memcg, &refault_distance);
 	inc_zone_state(zone, WORKINGSET_REFAULT);
 
-	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
+	if (!mem_cgroup_disabled()) {
+		lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+		nr_active = mem_cgroup_get_lru_size(lruvec, LRU_ACTIVE_FILE);
+		mem_cgroup_put(memcg);
+	} else
+		nr_active = zone_page_state(zone, NR_ACTIVE_FILE);
+
+	if (refault_distance <= nr_active) {
 		inc_zone_state(zone, WORKINGSET_ACTIVATE);
 		return true;
 	}
@@ -249,7 +333,23 @@ bool workingset_refault(void *shadow)
  */
 void workingset_activation(struct page *page)
 {
-	atomic_long_inc(&page_zone(page)->inactive_age);
+	struct lruvec *lruvec;
+
+	lruvec = mem_cgroup_page_lruvec(page, page_zone(page));
+	atomic_long_inc(&lruvec->inactive_age);
+}
+
+void workingset_release_shadow(void *shadow)
+{
+	unsigned long refault_distance;
+	struct zone *zone;
+	struct mem_cgroup *memcg;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	unpack_shadow(shadow, &zone, &memcg, &refault_distance);
+	mem_cgroup_put(memcg);
 }
 
 /*
@@ -348,6 +448,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
 		if (node->slots[i]) {
 			BUG_ON(!radix_tree_exceptional_entry(node->slots[i]));
+			workingset_release_shadow(node->slots[i]);
 			node->slots[i] = NULL;
 			BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
 			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] mm: workingset: make workingset detection logic memcg aware
  2016-01-24 16:56 [PATCH v2] mm: workingset: make workingset detection logic memcg aware Vladimir Davydov
@ 2016-01-25 16:39 ` Johannes Weiner
  2016-01-25 16:41   ` [PATCH 1/3] mm: workingset: eviction buckets for bigmem/lowbit machines Johannes Weiner
                     ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Johannes Weiner @ 2016-01-25 16:39 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: Andrew Morton, Michal Hocko, linux-mm, linux-kernel

On Sun, Jan 24, 2016 at 07:56:16PM +0300, Vladimir Davydov wrote:
> Currently, inactive_age is maintained per zone, which results in
> unexpected file page activations in case memory cgroups are used. For
> example, if the total number of active pages is big, a memory cgroup
> might get every refaulted file page activated even if refault distance
> is much greater than the number of active file pages in the cgroup. This
> patch fixes this issue by making inactive_age per lruvec.

Argh!!

It's great that you're still interested in this and kept working on
it. I just regret that I worked on the same stuff the last couple days
without pinging you before picking it up. Oh well...

However, my patches are sufficiently different that I think it makes
sense to discuss them both and figure out the best end result.  I have
some comments below and will followup this email with my version.

> The patch is pretty straightforward and self-explaining, but there are
> two things that should be noted:
> 
>  - workingset_{eviction,activation} need to get lruvec given a page.
>    On the default hierarchy one can safely access page->mem_cgroup
>    provided the page is pinned, but on the legacy hierarchy a page can
>    be migrated from one cgroup to another at any moment, so extra care
>    must be taken to assure page->mem_cgroup will stay put.
> 
>    workingset_eviction is passed a locked page, so it is safe to use
>    page->mem_cgroup in this function. workingset_activation is trickier:
>    it is called from mark_page_accessed, where the page is not
>    necessarily locked. To protect it against page->mem_cgroup change, we
>    move it to __activate_page, which is called by mark_page_accessed
>    once there's enough pages on percpu pagevec. This function is called
>    with zone->lru_lock held, which rules out page charge migration.

When a page moves to another cgroup at the same time it's activated,
there really is no wrong lruvec to age. Both would be correct. The
locking guarantees a stable answer, but we don't need it here. It's
enough to take the rcu lock here to ensure page_memcg() isn't freed.

>  - To calculate refault distance correctly even in case a page is
>    refaulted by a different cgroup, we need to store memcg id in shadow
>    entry. There's no problem with it on 64-bit, but on 32-bit there's
>    not much space left in radix tree slot after storing information
>    about node, zone, and memory cgroup, so we can't just save eviction
>    counter as is, because it would trim max refault distance making it
>    unusable.
> 
>    To overcome this problem, we increase refault distance granularity,
>    as proposed by Johannes Weiner. We disregard 10 least significant
>    bits of eviction counter. This reduces refault distance accuracy to
>    4MB, which is still fine. With the default NODE_SHIFT (3) this leaves
>    us 9 bits for storing eviction counter, hence maximal refault
>    distance will be 2GB, which should be enough for 32-bit systems.

If we limit it to 2G it becomes a clear-cut correctness issue once you
have more memory. Instead, we should continue to stretch out the
distance with an ever-increasing bucket size. The more memory you
have, the less important the granularity becomes anyway. With 8G, an
8M granularity is still okay, and so forth. And once we get beyond a
certain point, and it creates problems for people, it should be fair
enough to recommend upgrading to 64 bit.

> @@ -152,8 +152,72 @@
>   * refault distance will immediately activate the refaulting page.
>   */
>  
> -static void *pack_shadow(unsigned long eviction, struct zone *zone)
> +#ifdef CONFIG_MEMCG
> +/*
> + * On 32-bit there is not much space left in radix tree slot after
> + * storing information about node, zone, and memory cgroup, so we
> + * disregard 10 least significant bits of eviction counter. This
> + * reduces refault distance accuracy to 4MB, which is still fine.
> + *
> + * With the default NODE_SHIFT (3) this leaves us 9 bits for storing
> + * eviction counter, hence maximal refault distance will be 2GB, which
> + * should be enough for 32-bit systems.
> + */
> +#ifdef CONFIG_64BIT
> +# define REFAULT_DISTANCE_GRANULARITY		0
> +#else
> +# define REFAULT_DISTANCE_GRANULARITY		10
> +#endif
> +
> +static unsigned long pack_shadow_memcg(unsigned long eviction,
> +				       struct mem_cgroup *memcg)
> +{
> +	if (mem_cgroup_disabled())
> +		return eviction;
> +
> +	eviction >>= REFAULT_DISTANCE_GRANULARITY;
> +	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | mem_cgroup_id(memcg);
> +	return eviction;
> +}
> +
> +static unsigned long unpack_shadow_memcg(unsigned long entry,
> +					 unsigned long *mask,
> +					 struct mem_cgroup **memcg)
> +{
> +	if (mem_cgroup_disabled()) {
> +		*memcg = NULL;
> +		return entry;
> +	}
> +
> +	rcu_read_lock();
> +	*memcg = mem_cgroup_from_id(entry & MEM_CGROUP_ID_MAX);
> +	rcu_read_unlock();
> +
> +	entry >>= MEM_CGROUP_ID_SHIFT;
> +	entry <<= REFAULT_DISTANCE_GRANULARITY;
> +	*mask >>= MEM_CGROUP_ID_SHIFT - REFAULT_DISTANCE_GRANULARITY;
> +	return entry;
> +}
> +#else /* !CONFIG_MEMCG */
> +static unsigned long pack_shadow_memcg(unsigned long eviction,
> +				       struct mem_cgroup *memcg)
> +{
> +	return eviction;
> +}
> +
> +static unsigned long unpack_shadow_memcg(unsigned long entry,
> +					 unsigned long *mask,
> +					 struct mem_cgroup **memcg)
> +{
> +	*memcg = NULL;
> +	return entry;
> +}
> +#endif /* CONFIG_MEMCG */
> +
> +static void *pack_shadow(unsigned long eviction, struct zone *zone,
> +			 struct mem_cgroup *memcg)
>  {
> +	eviction = pack_shadow_memcg(eviction, memcg);
>  	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
>  	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
>  	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);

For !CONFIG_MEMCG, we can define MEMCG_ID_SHIFT to 0 and pass in a
cssid of 0. That would save much of the special casing here.

> @@ -213,10 +282,16 @@ static void unpack_shadow(void *shadow,
>  void *workingset_eviction(struct address_space *mapping, struct page *page)
>  {
>  	struct zone *zone = page_zone(page);
> +	struct mem_cgroup *memcg = page_memcg(page);
> +	struct lruvec *lruvec;
>  	unsigned long eviction;
>  
> -	eviction = atomic_long_inc_return(&zone->inactive_age);
> -	return pack_shadow(eviction, zone);
> +	if (!mem_cgroup_disabled())
> +		mem_cgroup_get(memcg);
> +
> +	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> +	eviction = atomic_long_inc_return(&lruvec->inactive_age);
> +	return pack_shadow(eviction, zone, memcg);

I don't think we need to hold a reference to the memcg here, it should
be enough to verify whether the cssid is still existent upon refault.

What could theoretically happen then is that the original memcg gets
deleted, a new one will reuse the same id and then refault the same
pages. However, there are two things that should make this acceptable:
1. a couple pages don't matter in this case, and sharing data between
cgroups on a large scale already leads to weird accounting artifacts
in other places; this wouldn't be much different. 2. from a system
perspective, those pages were in fact recently evicted, so even if we
activate them by accident in the new cgroup, it wouldn't be entirely
unreasonable. The workload will shake out what the true active list
frequency is on its own.

So I think we can just do away with the reference counting for now,
and reconsider it should the false sharing in this case create new
problems that are worse than existing consequences of false sharing.

> @@ -230,13 +305,22 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
>   */
>  bool workingset_refault(void *shadow)
>  {
> -	unsigned long refault_distance;
> +	unsigned long refault_distance, nr_active;
>  	struct zone *zone;
> +	struct mem_cgroup *memcg;
> +	struct lruvec *lruvec;
>  
> -	unpack_shadow(shadow, &zone, &refault_distance);
> +	unpack_shadow(shadow, &zone, &memcg, &refault_distance);
>  	inc_zone_state(zone, WORKINGSET_REFAULT);
>  
> -	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
> +	if (!mem_cgroup_disabled()) {
> +		lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> +		nr_active = mem_cgroup_get_lru_size(lruvec, LRU_ACTIVE_FILE);
> +		mem_cgroup_put(memcg);
> +	} else
> +		nr_active = zone_page_state(zone, NR_ACTIVE_FILE);

This is basically get_lru_size(), so I reused that instead.

Thanks!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/3] mm: workingset: eviction buckets for bigmem/lowbit machines
  2016-01-25 16:39 ` Johannes Weiner
@ 2016-01-25 16:41   ` Johannes Weiner
  2016-01-25 16:41   ` [PATCH 2/3] mm: workingset: separate shadow unpacking and refault calculation Johannes Weiner
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Johannes Weiner @ 2016-01-25 16:41 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: Andrew Morton, Michal Hocko, linux-mm, linux-kernel

space will get tight once we need to identify the memcg. add this to
stretch out the necessary distance by sacrificing granularity.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/workingset.c | 40 +++++++++++++++++++++++++++++++++++-----
 1 file changed, 35 insertions(+), 5 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index 61ead9e5549d..6f3ba184ffb2 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -152,8 +152,23 @@
  * refault distance will immediately activate the refaulting page.
  */
 
+#define EVICTION_SHIFT (NODES_SHIFT + ZONES_SHIFT +	\
+			RADIX_TREE_EXCEPTIONAL_SHIFT)
+#define EVICTION_MASK (~0UL >> EVICTION_SHIFT)
+
+/*
+ * Eviction timestamps need to be able to cover the full range of
+ * actionable refaults. However, bits are tight in the radix tree
+ * entry, and after storing the identifier for the lruvec there might
+ * not be enough left to represent every single actionable refault. In
+ * that case, we have to sacrifice granularity for distance, and group
+ * evictions into coarser buckets by shaving off lower timestamp bits.
+ */
+static unsigned int bucket_order;
+
 static void *pack_shadow(unsigned long eviction, struct zone *zone)
 {
+	eviction >>= bucket_order;
 	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
 	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
 	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
@@ -168,7 +183,6 @@ static void unpack_shadow(void *shadow,
 	unsigned long entry = (unsigned long)shadow;
 	unsigned long eviction;
 	unsigned long refault;
-	unsigned long mask;
 	int zid, nid;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
@@ -176,13 +190,12 @@ static void unpack_shadow(void *shadow,
 	entry >>= ZONES_SHIFT;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
 	entry >>= NODES_SHIFT;
-	eviction = entry;
+	eviction = entry << bucket_order;
 
 	*zone = NODE_DATA(nid)->node_zones + zid;
 
 	refault = atomic_long_read(&(*zone)->inactive_age);
-	mask = ~0UL >> (NODES_SHIFT + ZONES_SHIFT +
-			RADIX_TREE_EXCEPTIONAL_SHIFT);
+
 	/*
 	 * The unsigned subtraction here gives an accurate distance
 	 * across inactive_age overflows in most cases.
@@ -199,7 +212,7 @@ static void unpack_shadow(void *shadow,
 	 * inappropriate activation leading to pressure on the active
 	 * list is not a problem.
 	 */
-	*distance = (refault - eviction) & mask;
+	*distance = (refault - eviction) & EVICTION_MASK;
 }
 
 /**
@@ -398,8 +411,25 @@ static struct lock_class_key shadow_nodes_key;
 
 static int __init workingset_init(void)
 {
+	unsigned int timestamp_bits;
+	unsigned int max_order;
 	int ret;
 
+	BUILD_BUG_ON(BITS_PER_LONG < EVICTION_SHIFT);
+	/*
+	 * Calculate the eviction bucket size to cover the longest
+	 * actionable refault distance, which is currently half of
+	 * memory (totalram_pages/2). However, memory hotplug may add
+	 * some more pages at runtime, so keep working with up to
+	 * double the initial memory by using totalram_pages as-is.
+	 */
+	timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT;
+	max_order = fls_long(totalram_pages - 1);
+	if (max_order > timestamp_bits)
+		bucket_order = max_order - timestamp_bits;
+	printk("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n",
+	       timestamp_bits, max_order, bucket_order);
+
 	ret = list_lru_init_key(&workingset_shadow_nodes, &shadow_nodes_key);
 	if (ret)
 		goto err;
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/3] mm: workingset: separate shadow unpacking and refault calculation
  2016-01-25 16:39 ` Johannes Weiner
  2016-01-25 16:41   ` [PATCH 1/3] mm: workingset: eviction buckets for bigmem/lowbit machines Johannes Weiner
@ 2016-01-25 16:41   ` Johannes Weiner
  2016-01-25 16:42   ` [PATCH 3/3] mm: workingset: cgroup-aware Johannes Weiner
  2016-01-26  8:27   ` [PATCH v2] mm: workingset: make workingset detection logic memcg aware Vladimir Davydov
  3 siblings, 0 replies; 6+ messages in thread
From: Johannes Weiner @ 2016-01-25 16:41 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: Andrew Morton, Michal Hocko, linux-mm, linux-kernel

used to be alright, but we're gonna add memcg and then the difference
between unpacking static data from the radix entry and dealing with
dynamic objects and doing calculations becomes more pronounced and
would make things awkward.

keep unpacking simple, move the higher-level stuff to _refault().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/workingset.c | 55 ++++++++++++++++++++++++++++---------------------------
 1 file changed, 28 insertions(+), 27 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index 6f3ba184ffb2..ac6eb7bc1faa 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -176,13 +176,10 @@ static void *pack_shadow(unsigned long eviction, struct zone *zone)
 	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
-static void unpack_shadow(void *shadow,
-			  struct zone **zone,
-			  unsigned long *distance)
+static void unpack_shadow(void *shadow, struct zone **zonep,
+			  unsigned long *evictionp)
 {
 	unsigned long entry = (unsigned long)shadow;
-	unsigned long eviction;
-	unsigned long refault;
 	int zid, nid;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
@@ -190,29 +187,10 @@ static void unpack_shadow(void *shadow,
 	entry >>= ZONES_SHIFT;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
 	entry >>= NODES_SHIFT;
-	eviction = entry << bucket_order;
-
-	*zone = NODE_DATA(nid)->node_zones + zid;
 
-	refault = atomic_long_read(&(*zone)->inactive_age);
+	*zonep = NODE_DATA(nid)->node_zones + zid;
+	*evictionp = entry << bucket_order;
 
-	/*
-	 * The unsigned subtraction here gives an accurate distance
-	 * across inactive_age overflows in most cases.
-	 *
-	 * There is a special case: usually, shadow entries have a
-	 * short lifetime and are either refaulted or reclaimed along
-	 * with the inode before they get too old.  But it is not
-	 * impossible for the inactive_age to lap a shadow entry in
-	 * the field, which can then can result in a false small
-	 * refault distance, leading to a false activation should this
-	 * old entry actually refault again.  However, earlier kernels
-	 * used to deactivate unconditionally with *every* reclaim
-	 * invocation for the longest time, so the occasional
-	 * inappropriate activation leading to pressure on the active
-	 * list is not a problem.
-	 */
-	*distance = (refault - eviction) & EVICTION_MASK;
 }
 
 /**
@@ -244,9 +222,32 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 bool workingset_refault(void *shadow)
 {
 	unsigned long refault_distance;
+	unsigned long eviction;
+	unsigned long refault;
 	struct zone *zone;
 
-	unpack_shadow(shadow, &zone, &refault_distance);
+	unpack_shadow(shadow, &zone, &eviction);
+
+	refault = atomic_long_read(&zone->inactive_age);
+
+	/*
+	 * The unsigned subtraction here gives an accurate distance
+	 * across inactive_age overflows in most cases.
+	 *
+	 * There is a special case: usually, shadow entries have a
+	 * short lifetime and are either refaulted or reclaimed along
+	 * with the inode before they get too old.  But it is not
+	 * impossible for the inactive_age to lap a shadow entry in
+	 * the field, which can then can result in a false small
+	 * refault distance, leading to a false activation should this
+	 * old entry actually refault again.  However, earlier kernels
+	 * used to deactivate unconditionally with *every* reclaim
+	 * invocation for the longest time, so the occasional
+	 * inappropriate activation leading to pressure on the active
+	 * list is not a problem.
+	 */
+	refault_distance = (refault - eviction) & EVICTION_MASK;
+
 	inc_zone_state(zone, WORKINGSET_REFAULT);
 
 	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/3] mm: workingset: cgroup-aware
  2016-01-25 16:39 ` Johannes Weiner
  2016-01-25 16:41   ` [PATCH 1/3] mm: workingset: eviction buckets for bigmem/lowbit machines Johannes Weiner
  2016-01-25 16:41   ` [PATCH 2/3] mm: workingset: separate shadow unpacking and refault calculation Johannes Weiner
@ 2016-01-25 16:42   ` Johannes Weiner
  2016-01-26  8:27   ` [PATCH v2] mm: workingset: make workingset detection logic memcg aware Vladimir Davydov
  3 siblings, 0 replies; 6+ messages in thread
From: Johannes Weiner @ 2016-01-25 16:42 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: Andrew Morton, Michal Hocko, linux-mm, linux-kernel

TODO:
- cgroup-aware shadow node reclaim

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 39 ++++++++++++++++++++++++++++++
 include/linux/mmzone.h     | 11 ++++-----
 include/linux/swap.h       |  1 +
 mm/memcontrol.c            | 25 --------------------
 mm/vmscan.c                | 18 +++++++-------
 mm/workingset.c            | 59 +++++++++++++++++++++++++++++++++++++---------
 6 files changed, 102 insertions(+), 51 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 792c8981e633..705aba54a50d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -89,6 +89,10 @@ enum mem_cgroup_events_target {
 };
 
 #ifdef CONFIG_MEMCG
+
+#define MEM_CGROUP_ID_SHIFT	16
+#define MEM_CGROUP_ID_MAX	USHRT_MAX
+
 struct mem_cgroup_stat_cpu {
 	long count[MEMCG_NR_STAT];
 	unsigned long events[MEMCG_NR_EVENTS];
@@ -312,6 +316,25 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
 				   struct mem_cgroup_reclaim_cookie *);
 void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
 
+static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
+{
+	return memcg->css.id;
+}
+
+/**
+ * mem_cgroup_from_id - look up a memcg from an id
+ * @id: the id to look up
+ *
+ * Caller must hold rcu_read_lock() and use css_tryget() as necessary.
+ */
+static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
+{
+	struct cgroup_subsys_state *css;
+
+	css = css_from_id(id, &memory_cgrp_subsys);
+	return mem_cgroup_from_css(css);
+}
+
 /**
  * parent_mem_cgroup - find the accounting parent of a memcg
  * @memcg: memcg whose parent to find
@@ -496,6 +519,10 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 #endif
 
 #else /* CONFIG_MEMCG */
+
+#define MEM_CGROUP_ID_SHIFT	0
+#define MEM_CGROUP_ID_MAX	0
+
 struct mem_cgroup;
 
 static inline void mem_cgroup_events(struct mem_cgroup *memcg,
@@ -580,6 +607,18 @@ static inline void mem_cgroup_iter_break(struct mem_cgroup *root,
 {
 }
 
+static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
+{
+	return 0;
+}
+
+static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
+{
+	WARN_ON_ONCE(id);
+	/* XXX: This should always return root_mem_cgroup */
+	return NULL;
+}
+
 static inline bool mem_cgroup_disabled(void)
 {
 	return true;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 33bb1b19273e..a7d8eeb6658a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -209,10 +209,12 @@ struct zone_reclaim_stat {
 };
 
 struct lruvec {
-	struct list_head lists[NR_LRU_LISTS];
-	struct zone_reclaim_stat reclaim_stat;
+	struct list_head		lists[NR_LRU_LISTS];
+	struct zone_reclaim_stat	reclaim_stat;
+	/* Evictions & activations on the inactive file list */
+	atomic_long_t			inactive_age;
 #ifdef CONFIG_MEMCG
-	struct zone *zone;
+	struct zone			*zone;
 #endif
 };
 
@@ -487,9 +489,6 @@ struct zone {
 	spinlock_t		lru_lock;
 	struct lruvec		lruvec;
 
-	/* Evictions & activations on the inactive file list */
-	atomic_long_t		inactive_age;
-
 	/*
 	 * When free pages are below this point, additional steps are taken
 	 * when reading the number of free pages to avoid per-cpu counter
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b14a2bb33514..1cf3065c143b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 
 /* linux/mm/vmscan.c */
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
+extern unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d06cae2de783..4ea79f225fe8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -268,31 +268,6 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
 	return (memcg == root_mem_cgroup);
 }
 
-/*
- * We restrict the id in the range of [1, 65535], so it can fit into
- * an unsigned short.
- */
-#define MEM_CGROUP_ID_MAX	USHRT_MAX
-
-static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
-{
-	return memcg->css.id;
-}
-
-/*
- * A helper function to get mem_cgroup from ID. must be called under
- * rcu_read_lock().  The caller is responsible for calling
- * css_tryget_online() if the mem_cgroup is used for charging. (dropping
- * refcnt from swap can be called against removed memcg.)
- */
-static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
-{
-	struct cgroup_subsys_state *css;
-
-	css = css_from_id(id, &memory_cgrp_subsys);
-	return mem_cgroup_from_css(css);
-}
-
 #ifndef CONFIG_SLOB
 /*
  * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 30e0cd7a0ceb..f4ac04c0d35a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -213,7 +213,7 @@ bool zone_reclaimable(struct zone *zone)
 		zone_reclaimable_pages(zone) * 6;
 }
 
-static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
+unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
 {
 	if (!mem_cgroup_disabled())
 		return mem_cgroup_get_lru_size(lruvec, lru);
@@ -1931,8 +1931,8 @@ static bool inactive_file_is_low(struct lruvec *lruvec)
 	unsigned long inactive;
 	unsigned long active;
 
-	inactive = get_lru_size(lruvec, LRU_INACTIVE_FILE);
-	active = get_lru_size(lruvec, LRU_ACTIVE_FILE);
+	inactive = lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
+	active = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
 
 	return active > inactive;
 }
@@ -2071,7 +2071,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * system is under heavy pressure.
 	 */
 	if (!inactive_file_is_low(lruvec) &&
-	    get_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
+	    lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
@@ -2097,10 +2097,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * anon in [0], file in [1]
 	 */
 
-	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
-		get_lru_size(lruvec, LRU_INACTIVE_ANON);
-	file  = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
-		get_lru_size(lruvec, LRU_INACTIVE_FILE);
+	anon  = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON) +
+		lruvec_lru_size(lruvec, LRU_INACTIVE_ANON);
+	file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE) +
+		lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
 
 	spin_lock_irq(&zone->lru_lock);
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
@@ -2138,7 +2138,7 @@ out:
 			unsigned long size;
 			unsigned long scan;
 
-			size = get_lru_size(lruvec, lru);
+			size = lruvec_lru_size(lruvec, lru);
 			scan = size >> sc->priority;
 
 			if (!scan && pass && force_scan)
diff --git a/mm/workingset.c b/mm/workingset.c
index ac6eb7bc1faa..fe69da29bbad 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -152,7 +152,8 @@
  * refault distance will immediately activate the refaulting page.
  */
 
-#define EVICTION_SHIFT (NODES_SHIFT + ZONES_SHIFT +	\
+#define EVICTION_SHIFT (MEM_CGROUP_ID_SHIFT +		\
+			NODES_SHIFT + ZONES_SHIFT +	\
 			RADIX_TREE_EXCEPTIONAL_SHIFT)
 #define EVICTION_MASK (~0UL >> EVICTION_SHIFT)
 
@@ -166,9 +167,10 @@
  */
 static unsigned int bucket_order;
 
-static void *pack_shadow(unsigned long eviction, struct zone *zone)
+static void *pack_shadow(int memcgid, struct zone *zone, unsigned long eviction)
 {
 	eviction >>= bucket_order;
+	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
 	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
 	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
@@ -176,21 +178,23 @@ static void *pack_shadow(unsigned long eviction, struct zone *zone)
 	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
-static void unpack_shadow(void *shadow, struct zone **zonep,
+static void unpack_shadow(void *shadow, int *memcgidp, struct zone **zonep,
 			  unsigned long *evictionp)
 {
 	unsigned long entry = (unsigned long)shadow;
-	int zid, nid;
+	int memcgid, nid, zid;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
 	zid = entry & ((1UL << ZONES_SHIFT) - 1);
 	entry >>= ZONES_SHIFT;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
 	entry >>= NODES_SHIFT;
+	memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
+	entry >>= MEM_CGROUP_ID_SHIFT;
 
+	*memcgidp = memcgid;
 	*zonep = NODE_DATA(nid)->node_zones + zid;
 	*evictionp = entry << bucket_order;
-
 }
 
 /**
@@ -203,11 +207,20 @@ static void unpack_shadow(void *shadow, struct zone **zonep,
  */
 void *workingset_eviction(struct address_space *mapping, struct page *page)
 {
+	struct mem_cgroup *memcg = page_memcg(page);
 	struct zone *zone = page_zone(page);
+	int memcgid = mem_cgroup_id(memcg);
 	unsigned long eviction;
+	struct lruvec *lruvec;
+
+	/* Page is fully exclusive and pins page->mem_cgroup */
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(page_count(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	eviction = atomic_long_inc_return(&zone->inactive_age);
-	return pack_shadow(eviction, zone);
+	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+	eviction = atomic_long_inc_return(&lruvec->inactive_age);
+	return pack_shadow(memcgid, zone, eviction);
 }
 
 /**
@@ -222,13 +235,32 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 bool workingset_refault(void *shadow)
 {
 	unsigned long refault_distance;
+	unsigned long active_file;
+	struct mem_cgroup *memcg;
 	unsigned long eviction;
+	struct lruvec *lruvec;
 	unsigned long refault;
 	struct zone *zone;
+	int memcgid;
 
-	unpack_shadow(shadow, &zone, &eviction);
+	unpack_shadow(shadow, &memcgid, &zone, &eviction);
 
-	refault = atomic_long_read(&zone->inactive_age);
+	rcu_read_lock();
+	memcg = mem_cgroup_from_id(memcgid);
+	/*
+	 * Don't count a refault if the remembered memcg has been
+	 * deleted since. XXX: On !CONFIG_MEMCG, this will always
+	 * return NULL; it would be better if the root_mem_cgroup
+	 * existed in all configurations instead.
+	 */
+	if (!mem_cgroup_disabled() && !memcg) {
+		rcu_read_unlock();
+		return false;
+	}
+	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+	refault = atomic_long_read(&lruvec->inactive_age);
+	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
+	rcu_read_unlock();
 
 	/*
 	 * The unsigned subtraction here gives an accurate distance
@@ -250,7 +282,7 @@ bool workingset_refault(void *shadow)
 
 	inc_zone_state(zone, WORKINGSET_REFAULT);
 
-	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
+	if (refault_distance <= active_file) {
 		inc_zone_state(zone, WORKINGSET_ACTIVATE);
 		return true;
 	}
@@ -263,7 +295,12 @@ bool workingset_refault(void *shadow)
  */
 void workingset_activation(struct page *page)
 {
-	atomic_long_inc(&page_zone(page)->inactive_age);
+	struct lruvec *lruvec;
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_zone_lruvec(page_zone(page), page_memcg(page));
+	atomic_long_inc(&lruvec->inactive_age);
+	rcu_read_unlock();
 }
 
 /*
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] mm: workingset: make workingset detection logic memcg aware
  2016-01-25 16:39 ` Johannes Weiner
                     ` (2 preceding siblings ...)
  2016-01-25 16:42   ` [PATCH 3/3] mm: workingset: cgroup-aware Johannes Weiner
@ 2016-01-26  8:27   ` Vladimir Davydov
  3 siblings, 0 replies; 6+ messages in thread
From: Vladimir Davydov @ 2016-01-26  8:27 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Michal Hocko, linux-mm, linux-kernel

On Mon, Jan 25, 2016 at 11:39:07AM -0500, Johannes Weiner wrote:
> On Sun, Jan 24, 2016 at 07:56:16PM +0300, Vladimir Davydov wrote:
> > Currently, inactive_age is maintained per zone, which results in
> > unexpected file page activations in case memory cgroups are used. For
> > example, if the total number of active pages is big, a memory cgroup
> > might get every refaulted file page activated even if refault distance
> > is much greater than the number of active file pages in the cgroup. This
> > patch fixes this issue by making inactive_age per lruvec.
> 
> Argh!!
> 
> It's great that you're still interested in this and kept working on
> it. I just regret that I worked on the same stuff the last couple days
> without pinging you before picking it up. Oh well...

:-)

> 
> However, my patches are sufficiently different that I think it makes
> sense to discuss them both and figure out the best end result.  I have
> some comments below and will followup this email with my version.
> 
> > The patch is pretty straightforward and self-explaining, but there are
> > two things that should be noted:
> > 
> >  - workingset_{eviction,activation} need to get lruvec given a page.
> >    On the default hierarchy one can safely access page->mem_cgroup
> >    provided the page is pinned, but on the legacy hierarchy a page can
> >    be migrated from one cgroup to another at any moment, so extra care
> >    must be taken to assure page->mem_cgroup will stay put.
> > 
> >    workingset_eviction is passed a locked page, so it is safe to use
> >    page->mem_cgroup in this function. workingset_activation is trickier:
> >    it is called from mark_page_accessed, where the page is not
> >    necessarily locked. To protect it against page->mem_cgroup change, we
> >    move it to __activate_page, which is called by mark_page_accessed
> >    once there's enough pages on percpu pagevec. This function is called
> >    with zone->lru_lock held, which rules out page charge migration.
> 
> When a page moves to another cgroup at the same time it's activated,
> there really is no wrong lruvec to age. Both would be correct. The
> locking guarantees a stable answer, but we don't need it here. It's
> enough to take the rcu lock here to ensure page_memcg() isn't freed.

Yeah, that's true, but I think that taking rcu lock when memcg is
disabled or even compiled out is ugly. May be, we should introduce
helpers lock/unlock_page_memcg(), which would be no-op on the unified
hierarchy while on the legacy hierarchy they would act just like
mem_cgroup_begin/end_page_stat, i.e. we could just rename the latter
two. This would also simplify dropping legacy code once the legacy
hierarchy has passed away.

> 
> >  - To calculate refault distance correctly even in case a page is
> >    refaulted by a different cgroup, we need to store memcg id in shadow
> >    entry. There's no problem with it on 64-bit, but on 32-bit there's
> >    not much space left in radix tree slot after storing information
> >    about node, zone, and memory cgroup, so we can't just save eviction
> >    counter as is, because it would trim max refault distance making it
> >    unusable.
> > 
> >    To overcome this problem, we increase refault distance granularity,
> >    as proposed by Johannes Weiner. We disregard 10 least significant
> >    bits of eviction counter. This reduces refault distance accuracy to
> >    4MB, which is still fine. With the default NODE_SHIFT (3) this leaves
> >    us 9 bits for storing eviction counter, hence maximal refault
> >    distance will be 2GB, which should be enough for 32-bit systems.
> 
> If we limit it to 2G it becomes a clear-cut correctness issue once you
> have more memory. Instead, we should continue to stretch out the
> distance with an ever-increasing bucket size. The more memory you
> have, the less important the granularity becomes anyway. With 8G, an
> 8M granularity is still okay, and so forth. And once we get beyond a
> certain point, and it creates problems for people, it should be fair
> enough to recommend upgrading to 64 bit.

That's a great idea. I'm all for it.

> 
> > @@ -152,8 +152,72 @@
> >   * refault distance will immediately activate the refaulting page.
> >   */
> >  
> > -static void *pack_shadow(unsigned long eviction, struct zone *zone)
> > +#ifdef CONFIG_MEMCG
> > +/*
> > + * On 32-bit there is not much space left in radix tree slot after
> > + * storing information about node, zone, and memory cgroup, so we
> > + * disregard 10 least significant bits of eviction counter. This
> > + * reduces refault distance accuracy to 4MB, which is still fine.
> > + *
> > + * With the default NODE_SHIFT (3) this leaves us 9 bits for storing
> > + * eviction counter, hence maximal refault distance will be 2GB, which
> > + * should be enough for 32-bit systems.
> > + */
> > +#ifdef CONFIG_64BIT
> > +# define REFAULT_DISTANCE_GRANULARITY		0
> > +#else
> > +# define REFAULT_DISTANCE_GRANULARITY		10
> > +#endif
> > +
> > +static unsigned long pack_shadow_memcg(unsigned long eviction,
> > +				       struct mem_cgroup *memcg)
> > +{
> > +	if (mem_cgroup_disabled())
> > +		return eviction;
> > +
> > +	eviction >>= REFAULT_DISTANCE_GRANULARITY;
> > +	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | mem_cgroup_id(memcg);
> > +	return eviction;
> > +}
> > +
> > +static unsigned long unpack_shadow_memcg(unsigned long entry,
> > +					 unsigned long *mask,
> > +					 struct mem_cgroup **memcg)
> > +{
> > +	if (mem_cgroup_disabled()) {
> > +		*memcg = NULL;
> > +		return entry;
> > +	}
> > +
> > +	rcu_read_lock();
> > +	*memcg = mem_cgroup_from_id(entry & MEM_CGROUP_ID_MAX);
> > +	rcu_read_unlock();
> > +
> > +	entry >>= MEM_CGROUP_ID_SHIFT;
> > +	entry <<= REFAULT_DISTANCE_GRANULARITY;
> > +	*mask >>= MEM_CGROUP_ID_SHIFT - REFAULT_DISTANCE_GRANULARITY;
> > +	return entry;
> > +}
> > +#else /* !CONFIG_MEMCG */
> > +static unsigned long pack_shadow_memcg(unsigned long eviction,
> > +				       struct mem_cgroup *memcg)
> > +{
> > +	return eviction;
> > +}
> > +
> > +static unsigned long unpack_shadow_memcg(unsigned long entry,
> > +					 unsigned long *mask,
> > +					 struct mem_cgroup **memcg)
> > +{
> > +	*memcg = NULL;
> > +	return entry;
> > +}
> > +#endif /* CONFIG_MEMCG */
> > +
> > +static void *pack_shadow(unsigned long eviction, struct zone *zone,
> > +			 struct mem_cgroup *memcg)
> >  {
> > +	eviction = pack_shadow_memcg(eviction, memcg);
> >  	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
> >  	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
> >  	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
> 
> For !CONFIG_MEMCG, we can define MEMCG_ID_SHIFT to 0 and pass in a
> cssid of 0. That would save much of the special casing here.

But what about mem_cgroup_disabled() case? Do we want to waste 16 bits
in this case?

> 
> > @@ -213,10 +282,16 @@ static void unpack_shadow(void *shadow,
> >  void *workingset_eviction(struct address_space *mapping, struct page *page)
> >  {
> >  	struct zone *zone = page_zone(page);
> > +	struct mem_cgroup *memcg = page_memcg(page);
> > +	struct lruvec *lruvec;
> >  	unsigned long eviction;
> >  
> > -	eviction = atomic_long_inc_return(&zone->inactive_age);
> > -	return pack_shadow(eviction, zone);
> > +	if (!mem_cgroup_disabled())
> > +		mem_cgroup_get(memcg);
> > +
> > +	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> > +	eviction = atomic_long_inc_return(&lruvec->inactive_age);
> > +	return pack_shadow(eviction, zone, memcg);
> 
> I don't think we need to hold a reference to the memcg here, it should
> be enough to verify whether the cssid is still existent upon refault.
> 
> What could theoretically happen then is that the original memcg gets
> deleted, a new one will reuse the same id and then refault the same
> pages. However, there are two things that should make this acceptable:
> 1. a couple pages don't matter in this case, and sharing data between
> cgroups on a large scale already leads to weird accounting artifacts
> in other places; this wouldn't be much different. 2. from a system
> perspective, those pages were in fact recently evicted, so even if we
> activate them by accident in the new cgroup, it wouldn't be entirely
> unreasonable. The workload will shake out what the true active list
> frequency is on its own.
> 
> So I think we can just do away with the reference counting for now,
> and reconsider it should the false sharing in this case create new
> problems that are worse than existing consequences of false sharing.

Memory cgroup creation/destruction are rare events, so I agree that we
can close our eyes on this and not clutter the code with
mem_cgroup_get/put.

> 
> > @@ -230,13 +305,22 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
> >   */
> >  bool workingset_refault(void *shadow)
> >  {
> > -	unsigned long refault_distance;
> > +	unsigned long refault_distance, nr_active;
> >  	struct zone *zone;
> > +	struct mem_cgroup *memcg;
> > +	struct lruvec *lruvec;
> >  
> > -	unpack_shadow(shadow, &zone, &refault_distance);
> > +	unpack_shadow(shadow, &zone, &memcg, &refault_distance);
> >  	inc_zone_state(zone, WORKINGSET_REFAULT);
> >  
> > -	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
> > +	if (!mem_cgroup_disabled()) {
> > +		lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> > +		nr_active = mem_cgroup_get_lru_size(lruvec, LRU_ACTIVE_FILE);
> > +		mem_cgroup_put(memcg);
> > +	} else
> > +		nr_active = zone_page_state(zone, NR_ACTIVE_FILE);
> 
> This is basically get_lru_size(), so I reused that instead.

Yeah. In the previous version I did the same.

>From quick glance, I think that in general, your patch set looks better.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-01-26  8:28 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-24 16:56 [PATCH v2] mm: workingset: make workingset detection logic memcg aware Vladimir Davydov
2016-01-25 16:39 ` Johannes Weiner
2016-01-25 16:41   ` [PATCH 1/3] mm: workingset: eviction buckets for bigmem/lowbit machines Johannes Weiner
2016-01-25 16:41   ` [PATCH 2/3] mm: workingset: separate shadow unpacking and refault calculation Johannes Weiner
2016-01-25 16:42   ` [PATCH 3/3] mm: workingset: cgroup-aware Johannes Weiner
2016-01-26  8:27   ` [PATCH v2] mm: workingset: make workingset detection logic memcg aware Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).