All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/2] Anonymous memory threshold notifications for memcg
@ 2014-09-11 15:41 ` Vladimir Davydov
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2014-09-11 15:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kamezawa Hiroyuki, Johannes Weiner, Michal Hocko, Greg Thelen,
	Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo,
	Andrew Morton, Pavel Emelianov, Konstantin Khorenko, linux-mm,
	cgroups

Hi,

This series introduces anonymous memory threshold notifications for
memory cgroups. For rationale please see the comment to patch 2.

I'm sending it to attract attention to the following thread:

http://www.spinics.net/lists/linux-mm/msg78180.html

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH RFC 0/2] Anonymous memory threshold notifications for memcg
@ 2014-09-11 15:41 ` Vladimir Davydov
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2014-09-11 15:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kamezawa Hiroyuki, Johannes Weiner, Michal Hocko, Greg Thelen,
	Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo,
	Andrew Morton, Pavel Emelianov, Konstantin Khorenko, linux-mm,
	cgroups

Hi,

This series introduces anonymous memory threshold notifications for
memory cgroups. For rationale please see the comment to patch 2.

I'm sending it to attract attention to the following thread:

http://www.spinics.net/lists/linux-mm/msg78180.html

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH RFC 0/2] Anonymous memory threshold notifications for memcg
@ 2014-09-11 15:41 ` Vladimir Davydov
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2014-09-11 15:41 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Kamezawa Hiroyuki, Johannes Weiner, Michal Hocko, Greg Thelen,
	Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo,
	Andrew Morton, Pavel Emelianov, Konstantin Khorenko,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA

Hi,

This series introduces anonymous memory threshold notifications for
memory cgroups. For rationale please see the comment to patch 2.

I'm sending it to attract attention to the following thread:

http://www.spinics.net/lists/linux-mm/msg78180.html

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH RFC 1/2] memcg: use percpu_counter for statistics
  2014-09-11 15:41 ` Vladimir Davydov
@ 2014-09-11 15:41   ` Vladimir Davydov
  -1 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2014-09-11 15:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kamezawa Hiroyuki, Johannes Weiner, Michal Hocko, Greg Thelen,
	Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo,
	Andrew Morton, Pavel Emelianov, Konstantin Khorenko, linux-mm,
	cgroups

In the next patch I need a quick way to get a value of
MEM_CGROUP_STAT_RSS. The current procedure (mem_cgroup_read_stat) is
slow (iterates over all cpus) and may sleep (uses get/put_online_cpus),
so it's a no-go.

This patch converts memory cgroup statistics to use percpu_counter so
that percpu_counter_read will do the trick.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/memcontrol.c |  217 ++++++++++++++++++-------------------------------------
 1 file changed, 69 insertions(+), 148 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 085dc6d2f876..7e8d65e0608a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -136,9 +136,7 @@ enum mem_cgroup_events_target {
 #define SOFTLIMIT_EVENTS_TARGET 1024
 #define NUMAINFO_EVENTS_TARGET	1024
 
-struct mem_cgroup_stat_cpu {
-	long count[MEM_CGROUP_STAT_NSTATS];
-	unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
+struct mem_cgroup_ratelimit_state {
 	unsigned long nr_page_events;
 	unsigned long targets[MEM_CGROUP_NTARGETS];
 };
@@ -341,16 +339,10 @@ struct mem_cgroup {
 	atomic_t	moving_account;
 	/* taken only while moving_account > 0 */
 	spinlock_t	move_lock;
-	/*
-	 * percpu counter.
-	 */
-	struct mem_cgroup_stat_cpu __percpu *stat;
-	/*
-	 * used when a cpu is offlined or other synchronizations
-	 * See mem_cgroup_read_stat().
-	 */
-	struct mem_cgroup_stat_cpu nocpu_base;
-	spinlock_t pcp_counter_lock;
+
+	struct percpu_counter stat[MEM_CGROUP_STAT_NSTATS];
+	struct percpu_counter events[MEM_CGROUP_EVENTS_NSTATS];
+	struct mem_cgroup_ratelimit_state __percpu *ratelimit;
 
 	atomic_t	dead_count;
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
@@ -849,59 +841,16 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
 	return mz;
 }
 
-/*
- * Implementation Note: reading percpu statistics for memcg.
- *
- * Both of vmstat[] and percpu_counter has threshold and do periodic
- * synchronization to implement "quick" read. There are trade-off between
- * reading cost and precision of value. Then, we may have a chance to implement
- * a periodic synchronizion of counter in memcg's counter.
- *
- * But this _read() function is used for user interface now. The user accounts
- * memory usage by memory cgroup and he _always_ requires exact value because
- * he accounts memory. Even if we provide quick-and-fuzzy read, we always
- * have to visit all online cpus and make sum. So, for now, unnecessary
- * synchronization is not implemented. (just implemented for cpu hotplug)
- *
- * If there are kernel internal actions which can make use of some not-exact
- * value, and reading all cpu value can be performance bottleneck in some
- * common workload, threashold and synchonization as vmstat[] should be
- * implemented.
- */
 static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
 				 enum mem_cgroup_stat_index idx)
 {
-	long val = 0;
-	int cpu;
-
-	get_online_cpus();
-	for_each_online_cpu(cpu)
-		val += per_cpu(memcg->stat->count[idx], cpu);
-#ifdef CONFIG_HOTPLUG_CPU
-	spin_lock(&memcg->pcp_counter_lock);
-	val += memcg->nocpu_base.count[idx];
-	spin_unlock(&memcg->pcp_counter_lock);
-#endif
-	put_online_cpus();
-	return val;
+	return percpu_counter_read(&memcg->stat[idx]);
 }
 
 static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
 					    enum mem_cgroup_events_index idx)
 {
-	unsigned long val = 0;
-	int cpu;
-
-	get_online_cpus();
-	for_each_online_cpu(cpu)
-		val += per_cpu(memcg->stat->events[idx], cpu);
-#ifdef CONFIG_HOTPLUG_CPU
-	spin_lock(&memcg->pcp_counter_lock);
-	val += memcg->nocpu_base.events[idx];
-	spin_unlock(&memcg->pcp_counter_lock);
-#endif
-	put_online_cpus();
-	return val;
+	return percpu_counter_read(&memcg->events[idx]);
 }
 
 static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
@@ -913,25 +862,21 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 	 * counted as CACHE even if it's on ANON LRU.
 	 */
 	if (PageAnon(page))
-		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS],
+		percpu_counter_add(&memcg->stat[MEM_CGROUP_STAT_RSS],
 				nr_pages);
 	else
-		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_CACHE],
+		percpu_counter_add(&memcg->stat[MEM_CGROUP_STAT_CACHE],
 				nr_pages);
 
 	if (PageTransHuge(page))
-		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
+		percpu_counter_add(&memcg->stat[MEM_CGROUP_STAT_RSS_HUGE],
 				nr_pages);
 
 	/* pagein of a big page is an event. So, ignore page size */
 	if (nr_pages > 0)
-		__this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGPGIN]);
-	else {
-		__this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGPGOUT]);
-		nr_pages = -nr_pages; /* for event */
-	}
-
-	__this_cpu_add(memcg->stat->nr_page_events, nr_pages);
+		percpu_counter_inc(&memcg->events[MEM_CGROUP_EVENTS_PGPGIN]);
+	else
+		percpu_counter_inc(&memcg->events[MEM_CGROUP_EVENTS_PGPGOUT]);
 }
 
 unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
@@ -981,8 +926,8 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 {
 	unsigned long val, next;
 
-	val = __this_cpu_read(memcg->stat->nr_page_events);
-	next = __this_cpu_read(memcg->stat->targets[target]);
+	val = __this_cpu_read(memcg->ratelimit->nr_page_events);
+	next = __this_cpu_read(memcg->ratelimit->targets[target]);
 	/* from time_after() in jiffies.h */
 	if ((long)next - (long)val < 0) {
 		switch (target) {
@@ -998,7 +943,7 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 		default:
 			break;
 		}
-		__this_cpu_write(memcg->stat->targets[target], next);
+		__this_cpu_write(memcg->ratelimit->targets[target], next);
 		return true;
 	}
 	return false;
@@ -1006,10 +951,15 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 
 /*
  * Check events in order.
- *
  */
-static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
+static void memcg_check_events(struct mem_cgroup *memcg, struct page *page,
+			       unsigned long nr_pages)
 {
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__this_cpu_add(memcg->ratelimit->nr_page_events, nr_pages);
+
 	/* threshold event is triggered in finer grain than soft limit */
 	if (unlikely(mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_THRESH))) {
@@ -1030,6 +980,7 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 			atomic_inc(&memcg->numainfo_events);
 #endif
 	}
+	local_irq_restore(flags);
 }
 
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
@@ -1294,10 +1245,10 @@ void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 
 	switch (idx) {
 	case PGFAULT:
-		this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGFAULT]);
+		percpu_counter_inc(&memcg->events[MEM_CGROUP_EVENTS_PGFAULT]);
 		break;
 	case PGMAJFAULT:
-		this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT]);
+		percpu_counter_inc(&memcg->events[MEM_CGROUP_EVENTS_PGMAJFAULT]);
 		break;
 	default:
 		BUG();
@@ -2306,7 +2257,7 @@ void mem_cgroup_update_page_stat(struct page *page,
 	if (unlikely(!memcg || !PageCgroupUsed(pc)))
 		return;
 
-	this_cpu_add(memcg->stat->count[idx], val);
+	percpu_counter_add(&memcg->stat[idx], val);
 }
 
 /*
@@ -2476,37 +2427,12 @@ static void drain_all_stock_sync(struct mem_cgroup *root_memcg)
 	mutex_unlock(&percpu_charge_mutex);
 }
 
-/*
- * This function drains percpu counter value from DEAD cpu and
- * move it to local cpu. Note that this function can be preempted.
- */
-static void mem_cgroup_drain_pcp_counter(struct mem_cgroup *memcg, int cpu)
-{
-	int i;
-
-	spin_lock(&memcg->pcp_counter_lock);
-	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
-		long x = per_cpu(memcg->stat->count[i], cpu);
-
-		per_cpu(memcg->stat->count[i], cpu) = 0;
-		memcg->nocpu_base.count[i] += x;
-	}
-	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) {
-		unsigned long x = per_cpu(memcg->stat->events[i], cpu);
-
-		per_cpu(memcg->stat->events[i], cpu) = 0;
-		memcg->nocpu_base.events[i] += x;
-	}
-	spin_unlock(&memcg->pcp_counter_lock);
-}
-
 static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 					unsigned long action,
 					void *hcpu)
 {
 	int cpu = (unsigned long)hcpu;
 	struct memcg_stock_pcp *stock;
-	struct mem_cgroup *iter;
 
 	if (action == CPU_ONLINE)
 		return NOTIFY_OK;
@@ -2514,9 +2440,6 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	if (action != CPU_DEAD && action != CPU_DEAD_FROZEN)
 		return NOTIFY_OK;
 
-	for_each_mem_cgroup(iter)
-		mem_cgroup_drain_pcp_counter(iter, cpu);
-
 	stock = &per_cpu(memcg_stock, cpu);
 	drain_stock(stock);
 	return NOTIFY_OK;
@@ -3419,8 +3342,8 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 		pc->mem_cgroup = memcg;
 		pc->flags = head_pc->flags;
 	}
-	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
-		       HPAGE_PMD_NR);
+	percpu_counter_sub(&memcg->stat[MEM_CGROUP_STAT_RSS_HUGE],
+			   HPAGE_PMD_NR);
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -3475,17 +3398,17 @@ static int mem_cgroup_move_account(struct page *page,
 	move_lock_mem_cgroup(from, &flags);
 
 	if (!PageAnon(page) && page_mapped(page)) {
-		__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
-			       nr_pages);
-		__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
-			       nr_pages);
+		percpu_counter_sub(&from->stat[MEM_CGROUP_STAT_FILE_MAPPED],
+				   nr_pages);
+		percpu_counter_add(&to->stat[MEM_CGROUP_STAT_FILE_MAPPED],
+				   nr_pages);
 	}
 
 	if (PageWriteback(page)) {
-		__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_WRITEBACK],
-			       nr_pages);
-		__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_WRITEBACK],
-			       nr_pages);
+		percpu_counter_sub(&from->stat[MEM_CGROUP_STAT_WRITEBACK],
+				   nr_pages);
+		percpu_counter_add(&to->stat[MEM_CGROUP_STAT_WRITEBACK],
+				   nr_pages);
 	}
 
 	/*
@@ -3499,12 +3422,10 @@ static int mem_cgroup_move_account(struct page *page,
 	move_unlock_mem_cgroup(from, &flags);
 	ret = 0;
 
-	local_irq_disable();
 	mem_cgroup_charge_statistics(to, page, nr_pages);
-	memcg_check_events(to, page);
+	memcg_check_events(to, page, nr_pages);
 	mem_cgroup_charge_statistics(from, page, -nr_pages);
-	memcg_check_events(from, page);
-	local_irq_enable();
+	memcg_check_events(from, page, nr_pages);
 out_unlock:
 	unlock_page(page);
 out:
@@ -3582,7 +3503,7 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
 					 bool charge)
 {
 	int val = (charge) ? 1 : -1;
-	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
+	percpu_counter_add(&memcg->stat[MEM_CGROUP_STAT_SWAP], val);
 }
 
 /**
@@ -5413,25 +5334,11 @@ static void free_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 
 static struct mem_cgroup *mem_cgroup_alloc(void)
 {
-	struct mem_cgroup *memcg;
 	size_t size;
 
 	size = sizeof(struct mem_cgroup);
 	size += nr_node_ids * sizeof(struct mem_cgroup_per_node *);
-
-	memcg = kzalloc(size, GFP_KERNEL);
-	if (!memcg)
-		return NULL;
-
-	memcg->stat = alloc_percpu(struct mem_cgroup_stat_cpu);
-	if (!memcg->stat)
-		goto out_free;
-	spin_lock_init(&memcg->pcp_counter_lock);
-	return memcg;
-
-out_free:
-	kfree(memcg);
-	return NULL;
+	return kzalloc(size, GFP_KERNEL);
 }
 
 /*
@@ -5448,13 +5355,20 @@ out_free:
 static void __mem_cgroup_free(struct mem_cgroup *memcg)
 {
 	int node;
+	int i;
 
 	mem_cgroup_remove_from_trees(memcg);
 
 	for_each_node(node)
 		free_mem_cgroup_per_zone_info(memcg, node);
 
-	free_percpu(memcg->stat);
+	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++)
+		percpu_counter_destroy(&memcg->stat[i]);
+
+	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
+		percpu_counter_destroy(&memcg->events[i]);
+
+	free_percpu(memcg->ratelimit);
 
 	/*
 	 * We need to make sure that (at least for now), the jump label
@@ -5511,11 +5425,24 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	struct mem_cgroup *memcg;
 	long error = -ENOMEM;
 	int node;
+	int i;
 
 	memcg = mem_cgroup_alloc();
 	if (!memcg)
 		return ERR_PTR(error);
 
+	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++)
+		if (percpu_counter_init(&memcg->stat[i], 0, GFP_KERNEL))
+			goto free_out;
+
+	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
+		if (percpu_counter_init(&memcg->events[i], 0, GFP_KERNEL))
+			goto free_out;
+
+	memcg->ratelimit = alloc_percpu(struct mem_cgroup_ratelimit_state);
+	if (!memcg->ratelimit)
+		goto free_out;
+
 	for_each_node(node)
 		if (alloc_mem_cgroup_per_zone_info(memcg, node))
 			goto free_out;
@@ -6507,10 +6434,8 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 	}
 
-	local_irq_disable();
 	mem_cgroup_charge_statistics(memcg, page, nr_pages);
-	memcg_check_events(memcg, page);
-	local_irq_enable();
+	memcg_check_events(memcg, page, nr_pages);
 
 	if (do_swap_account && PageSwapCache(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
@@ -6557,8 +6482,6 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
 			   unsigned long nr_anon, unsigned long nr_file,
 			   unsigned long nr_huge, struct page *dummy_page)
 {
-	unsigned long flags;
-
 	if (!mem_cgroup_is_root(memcg)) {
 		if (nr_mem)
 			res_counter_uncharge(&memcg->res,
@@ -6569,14 +6492,12 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
 		memcg_oom_recover(memcg);
 	}
 
-	local_irq_save(flags);
-	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS], nr_anon);
-	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_CACHE], nr_file);
-	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE], nr_huge);
-	__this_cpu_add(memcg->stat->events[MEM_CGROUP_EVENTS_PGPGOUT], pgpgout);
-	__this_cpu_add(memcg->stat->nr_page_events, nr_anon + nr_file);
-	memcg_check_events(memcg, dummy_page);
-	local_irq_restore(flags);
+	percpu_counter_sub(&memcg->stat[MEM_CGROUP_STAT_RSS], nr_anon);
+	percpu_counter_sub(&memcg->stat[MEM_CGROUP_STAT_CACHE], nr_file);
+	percpu_counter_sub(&memcg->stat[MEM_CGROUP_STAT_RSS_HUGE], nr_huge);
+	percpu_counter_add(&memcg->events[MEM_CGROUP_EVENTS_PGPGOUT], pgpgout);
+
+	memcg_check_events(memcg, dummy_page, nr_anon + nr_file);
 }
 
 static void uncharge_list(struct list_head *page_list)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC 1/2] memcg: use percpu_counter for statistics
@ 2014-09-11 15:41   ` Vladimir Davydov
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2014-09-11 15:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kamezawa Hiroyuki, Johannes Weiner, Michal Hocko, Greg Thelen,
	Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo,
	Andrew Morton, Pavel Emelianov, Konstantin Khorenko, linux-mm,
	cgroups

In the next patch I need a quick way to get a value of
MEM_CGROUP_STAT_RSS. The current procedure (mem_cgroup_read_stat) is
slow (iterates over all cpus) and may sleep (uses get/put_online_cpus),
so it's a no-go.

This patch converts memory cgroup statistics to use percpu_counter so
that percpu_counter_read will do the trick.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/memcontrol.c |  217 ++++++++++++++++++-------------------------------------
 1 file changed, 69 insertions(+), 148 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 085dc6d2f876..7e8d65e0608a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -136,9 +136,7 @@ enum mem_cgroup_events_target {
 #define SOFTLIMIT_EVENTS_TARGET 1024
 #define NUMAINFO_EVENTS_TARGET	1024
 
-struct mem_cgroup_stat_cpu {
-	long count[MEM_CGROUP_STAT_NSTATS];
-	unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
+struct mem_cgroup_ratelimit_state {
 	unsigned long nr_page_events;
 	unsigned long targets[MEM_CGROUP_NTARGETS];
 };
@@ -341,16 +339,10 @@ struct mem_cgroup {
 	atomic_t	moving_account;
 	/* taken only while moving_account > 0 */
 	spinlock_t	move_lock;
-	/*
-	 * percpu counter.
-	 */
-	struct mem_cgroup_stat_cpu __percpu *stat;
-	/*
-	 * used when a cpu is offlined or other synchronizations
-	 * See mem_cgroup_read_stat().
-	 */
-	struct mem_cgroup_stat_cpu nocpu_base;
-	spinlock_t pcp_counter_lock;
+
+	struct percpu_counter stat[MEM_CGROUP_STAT_NSTATS];
+	struct percpu_counter events[MEM_CGROUP_EVENTS_NSTATS];
+	struct mem_cgroup_ratelimit_state __percpu *ratelimit;
 
 	atomic_t	dead_count;
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
@@ -849,59 +841,16 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
 	return mz;
 }
 
-/*
- * Implementation Note: reading percpu statistics for memcg.
- *
- * Both of vmstat[] and percpu_counter has threshold and do periodic
- * synchronization to implement "quick" read. There are trade-off between
- * reading cost and precision of value. Then, we may have a chance to implement
- * a periodic synchronizion of counter in memcg's counter.
- *
- * But this _read() function is used for user interface now. The user accounts
- * memory usage by memory cgroup and he _always_ requires exact value because
- * he accounts memory. Even if we provide quick-and-fuzzy read, we always
- * have to visit all online cpus and make sum. So, for now, unnecessary
- * synchronization is not implemented. (just implemented for cpu hotplug)
- *
- * If there are kernel internal actions which can make use of some not-exact
- * value, and reading all cpu value can be performance bottleneck in some
- * common workload, threashold and synchonization as vmstat[] should be
- * implemented.
- */
 static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
 				 enum mem_cgroup_stat_index idx)
 {
-	long val = 0;
-	int cpu;
-
-	get_online_cpus();
-	for_each_online_cpu(cpu)
-		val += per_cpu(memcg->stat->count[idx], cpu);
-#ifdef CONFIG_HOTPLUG_CPU
-	spin_lock(&memcg->pcp_counter_lock);
-	val += memcg->nocpu_base.count[idx];
-	spin_unlock(&memcg->pcp_counter_lock);
-#endif
-	put_online_cpus();
-	return val;
+	return percpu_counter_read(&memcg->stat[idx]);
 }
 
 static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
 					    enum mem_cgroup_events_index idx)
 {
-	unsigned long val = 0;
-	int cpu;
-
-	get_online_cpus();
-	for_each_online_cpu(cpu)
-		val += per_cpu(memcg->stat->events[idx], cpu);
-#ifdef CONFIG_HOTPLUG_CPU
-	spin_lock(&memcg->pcp_counter_lock);
-	val += memcg->nocpu_base.events[idx];
-	spin_unlock(&memcg->pcp_counter_lock);
-#endif
-	put_online_cpus();
-	return val;
+	return percpu_counter_read(&memcg->events[idx]);
 }
 
 static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
@@ -913,25 +862,21 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 	 * counted as CACHE even if it's on ANON LRU.
 	 */
 	if (PageAnon(page))
-		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS],
+		percpu_counter_add(&memcg->stat[MEM_CGROUP_STAT_RSS],
 				nr_pages);
 	else
-		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_CACHE],
+		percpu_counter_add(&memcg->stat[MEM_CGROUP_STAT_CACHE],
 				nr_pages);
 
 	if (PageTransHuge(page))
-		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
+		percpu_counter_add(&memcg->stat[MEM_CGROUP_STAT_RSS_HUGE],
 				nr_pages);
 
 	/* pagein of a big page is an event. So, ignore page size */
 	if (nr_pages > 0)
-		__this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGPGIN]);
-	else {
-		__this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGPGOUT]);
-		nr_pages = -nr_pages; /* for event */
-	}
-
-	__this_cpu_add(memcg->stat->nr_page_events, nr_pages);
+		percpu_counter_inc(&memcg->events[MEM_CGROUP_EVENTS_PGPGIN]);
+	else
+		percpu_counter_inc(&memcg->events[MEM_CGROUP_EVENTS_PGPGOUT]);
 }
 
 unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
@@ -981,8 +926,8 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 {
 	unsigned long val, next;
 
-	val = __this_cpu_read(memcg->stat->nr_page_events);
-	next = __this_cpu_read(memcg->stat->targets[target]);
+	val = __this_cpu_read(memcg->ratelimit->nr_page_events);
+	next = __this_cpu_read(memcg->ratelimit->targets[target]);
 	/* from time_after() in jiffies.h */
 	if ((long)next - (long)val < 0) {
 		switch (target) {
@@ -998,7 +943,7 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 		default:
 			break;
 		}
-		__this_cpu_write(memcg->stat->targets[target], next);
+		__this_cpu_write(memcg->ratelimit->targets[target], next);
 		return true;
 	}
 	return false;
@@ -1006,10 +951,15 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 
 /*
  * Check events in order.
- *
  */
-static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
+static void memcg_check_events(struct mem_cgroup *memcg, struct page *page,
+			       unsigned long nr_pages)
 {
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__this_cpu_add(memcg->ratelimit->nr_page_events, nr_pages);
+
 	/* threshold event is triggered in finer grain than soft limit */
 	if (unlikely(mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_THRESH))) {
@@ -1030,6 +980,7 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 			atomic_inc(&memcg->numainfo_events);
 #endif
 	}
+	local_irq_restore(flags);
 }
 
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
@@ -1294,10 +1245,10 @@ void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 
 	switch (idx) {
 	case PGFAULT:
-		this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGFAULT]);
+		percpu_counter_inc(&memcg->events[MEM_CGROUP_EVENTS_PGFAULT]);
 		break;
 	case PGMAJFAULT:
-		this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT]);
+		percpu_counter_inc(&memcg->events[MEM_CGROUP_EVENTS_PGMAJFAULT]);
 		break;
 	default:
 		BUG();
@@ -2306,7 +2257,7 @@ void mem_cgroup_update_page_stat(struct page *page,
 	if (unlikely(!memcg || !PageCgroupUsed(pc)))
 		return;
 
-	this_cpu_add(memcg->stat->count[idx], val);
+	percpu_counter_add(&memcg->stat[idx], val);
 }
 
 /*
@@ -2476,37 +2427,12 @@ static void drain_all_stock_sync(struct mem_cgroup *root_memcg)
 	mutex_unlock(&percpu_charge_mutex);
 }
 
-/*
- * This function drains percpu counter value from DEAD cpu and
- * move it to local cpu. Note that this function can be preempted.
- */
-static void mem_cgroup_drain_pcp_counter(struct mem_cgroup *memcg, int cpu)
-{
-	int i;
-
-	spin_lock(&memcg->pcp_counter_lock);
-	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
-		long x = per_cpu(memcg->stat->count[i], cpu);
-
-		per_cpu(memcg->stat->count[i], cpu) = 0;
-		memcg->nocpu_base.count[i] += x;
-	}
-	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) {
-		unsigned long x = per_cpu(memcg->stat->events[i], cpu);
-
-		per_cpu(memcg->stat->events[i], cpu) = 0;
-		memcg->nocpu_base.events[i] += x;
-	}
-	spin_unlock(&memcg->pcp_counter_lock);
-}
-
 static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 					unsigned long action,
 					void *hcpu)
 {
 	int cpu = (unsigned long)hcpu;
 	struct memcg_stock_pcp *stock;
-	struct mem_cgroup *iter;
 
 	if (action == CPU_ONLINE)
 		return NOTIFY_OK;
@@ -2514,9 +2440,6 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	if (action != CPU_DEAD && action != CPU_DEAD_FROZEN)
 		return NOTIFY_OK;
 
-	for_each_mem_cgroup(iter)
-		mem_cgroup_drain_pcp_counter(iter, cpu);
-
 	stock = &per_cpu(memcg_stock, cpu);
 	drain_stock(stock);
 	return NOTIFY_OK;
@@ -3419,8 +3342,8 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 		pc->mem_cgroup = memcg;
 		pc->flags = head_pc->flags;
 	}
-	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
-		       HPAGE_PMD_NR);
+	percpu_counter_sub(&memcg->stat[MEM_CGROUP_STAT_RSS_HUGE],
+			   HPAGE_PMD_NR);
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -3475,17 +3398,17 @@ static int mem_cgroup_move_account(struct page *page,
 	move_lock_mem_cgroup(from, &flags);
 
 	if (!PageAnon(page) && page_mapped(page)) {
-		__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
-			       nr_pages);
-		__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
-			       nr_pages);
+		percpu_counter_sub(&from->stat[MEM_CGROUP_STAT_FILE_MAPPED],
+				   nr_pages);
+		percpu_counter_add(&to->stat[MEM_CGROUP_STAT_FILE_MAPPED],
+				   nr_pages);
 	}
 
 	if (PageWriteback(page)) {
-		__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_WRITEBACK],
-			       nr_pages);
-		__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_WRITEBACK],
-			       nr_pages);
+		percpu_counter_sub(&from->stat[MEM_CGROUP_STAT_WRITEBACK],
+				   nr_pages);
+		percpu_counter_add(&to->stat[MEM_CGROUP_STAT_WRITEBACK],
+				   nr_pages);
 	}
 
 	/*
@@ -3499,12 +3422,10 @@ static int mem_cgroup_move_account(struct page *page,
 	move_unlock_mem_cgroup(from, &flags);
 	ret = 0;
 
-	local_irq_disable();
 	mem_cgroup_charge_statistics(to, page, nr_pages);
-	memcg_check_events(to, page);
+	memcg_check_events(to, page, nr_pages);
 	mem_cgroup_charge_statistics(from, page, -nr_pages);
-	memcg_check_events(from, page);
-	local_irq_enable();
+	memcg_check_events(from, page, nr_pages);
 out_unlock:
 	unlock_page(page);
 out:
@@ -3582,7 +3503,7 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
 					 bool charge)
 {
 	int val = (charge) ? 1 : -1;
-	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
+	percpu_counter_add(&memcg->stat[MEM_CGROUP_STAT_SWAP], val);
 }
 
 /**
@@ -5413,25 +5334,11 @@ static void free_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 
 static struct mem_cgroup *mem_cgroup_alloc(void)
 {
-	struct mem_cgroup *memcg;
 	size_t size;
 
 	size = sizeof(struct mem_cgroup);
 	size += nr_node_ids * sizeof(struct mem_cgroup_per_node *);
-
-	memcg = kzalloc(size, GFP_KERNEL);
-	if (!memcg)
-		return NULL;
-
-	memcg->stat = alloc_percpu(struct mem_cgroup_stat_cpu);
-	if (!memcg->stat)
-		goto out_free;
-	spin_lock_init(&memcg->pcp_counter_lock);
-	return memcg;
-
-out_free:
-	kfree(memcg);
-	return NULL;
+	return kzalloc(size, GFP_KERNEL);
 }
 
 /*
@@ -5448,13 +5355,20 @@ out_free:
 static void __mem_cgroup_free(struct mem_cgroup *memcg)
 {
 	int node;
+	int i;
 
 	mem_cgroup_remove_from_trees(memcg);
 
 	for_each_node(node)
 		free_mem_cgroup_per_zone_info(memcg, node);
 
-	free_percpu(memcg->stat);
+	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++)
+		percpu_counter_destroy(&memcg->stat[i]);
+
+	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
+		percpu_counter_destroy(&memcg->events[i]);
+
+	free_percpu(memcg->ratelimit);
 
 	/*
 	 * We need to make sure that (at least for now), the jump label
@@ -5511,11 +5425,24 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	struct mem_cgroup *memcg;
 	long error = -ENOMEM;
 	int node;
+	int i;
 
 	memcg = mem_cgroup_alloc();
 	if (!memcg)
 		return ERR_PTR(error);
 
+	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++)
+		if (percpu_counter_init(&memcg->stat[i], 0, GFP_KERNEL))
+			goto free_out;
+
+	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
+		if (percpu_counter_init(&memcg->events[i], 0, GFP_KERNEL))
+			goto free_out;
+
+	memcg->ratelimit = alloc_percpu(struct mem_cgroup_ratelimit_state);
+	if (!memcg->ratelimit)
+		goto free_out;
+
 	for_each_node(node)
 		if (alloc_mem_cgroup_per_zone_info(memcg, node))
 			goto free_out;
@@ -6507,10 +6434,8 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 	}
 
-	local_irq_disable();
 	mem_cgroup_charge_statistics(memcg, page, nr_pages);
-	memcg_check_events(memcg, page);
-	local_irq_enable();
+	memcg_check_events(memcg, page, nr_pages);
 
 	if (do_swap_account && PageSwapCache(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
@@ -6557,8 +6482,6 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
 			   unsigned long nr_anon, unsigned long nr_file,
 			   unsigned long nr_huge, struct page *dummy_page)
 {
-	unsigned long flags;
-
 	if (!mem_cgroup_is_root(memcg)) {
 		if (nr_mem)
 			res_counter_uncharge(&memcg->res,
@@ -6569,14 +6492,12 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
 		memcg_oom_recover(memcg);
 	}
 
-	local_irq_save(flags);
-	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS], nr_anon);
-	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_CACHE], nr_file);
-	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE], nr_huge);
-	__this_cpu_add(memcg->stat->events[MEM_CGROUP_EVENTS_PGPGOUT], pgpgout);
-	__this_cpu_add(memcg->stat->nr_page_events, nr_anon + nr_file);
-	memcg_check_events(memcg, dummy_page);
-	local_irq_restore(flags);
+	percpu_counter_sub(&memcg->stat[MEM_CGROUP_STAT_RSS], nr_anon);
+	percpu_counter_sub(&memcg->stat[MEM_CGROUP_STAT_CACHE], nr_file);
+	percpu_counter_sub(&memcg->stat[MEM_CGROUP_STAT_RSS_HUGE], nr_huge);
+	percpu_counter_add(&memcg->events[MEM_CGROUP_EVENTS_PGPGOUT], pgpgout);
+
+	memcg_check_events(memcg, dummy_page, nr_anon + nr_file);
 }
 
 static void uncharge_list(struct list_head *page_list)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC 2/2] memcg: add threshold for anon rss
  2014-09-11 15:41 ` Vladimir Davydov
@ 2014-09-11 15:41   ` Vladimir Davydov
  -1 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2014-09-11 15:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kamezawa Hiroyuki, Johannes Weiner, Michal Hocko, Greg Thelen,
	Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo,
	Andrew Morton, Pavel Emelianov, Konstantin Khorenko, linux-mm,
	cgroups

Though hard memory limits suit perfectly for sand-boxing, they are not
that efficient when it comes to partitioning a server's resources among
multiple containers. The point is a container consuming a particular
amount of memory most of time may have infrequent spikes in the load.
Setting the hard limit to the maximal possible usage (spike) will lower
server utilization while setting it to the "normal" usage will result in
heavy lags during the spikes.

To handle such scenarios soft limits were introduced. The idea is to
allow a container to breach the limit freely when there's enough free
memory, but shrink it back to the limit aggressively on global memory
pressure. However, the concept of soft limits is intrinsically unsafe
by itself: if a container eats too much anonymous memory, it will be
very slow or even impossible (if there's no swap) to reclaim its
resources back to the limit. As a result the whole system will be
feeling bad until it finally realizes the culprit must die.

Currently we have no way to react to anonymous memory + swap usage
growth inside a container: the memsw counter accounts both anonymous
memory and file caches and swap, so we have neither a limit for
anon+swap nor a threshold notification. Actually, memsw is totally
useless if one wants to make full use of soft limits: it should be set
to a very large value or infinity then, otherwise it just makes no
sense.

That's one of the reasons why I think we should replace memsw with a
kind of anonsw so that it'd account only anon+swap. This way we'd still
be able to sand-box apps, but it'd also allow us to avoid nasty
surprises like the one I described above. For more arguments for and
against this idea, please see the following thread:

http://www.spinics.net/lists/linux-mm/msg78180.html

There's an alternative to this approach backed by Kamezawa. He thinks
that OOM on anon+swap limit hit is a no-go and proposes to use memory
thresholds for it. I still strongly disagree with the proposal, because
it's unsafe (what if the userspace handler won't react in time?).
Nevertheless, I implement his idea in this RFC. I hope this will fuel
the debate, because sadly enough nobody seems to care about this
problem.

So this patch adds the "memory.rss" file that shows the amount of
anonymous memory consumed by a cgroup and the event to handle threshold
notifications coming from it. The notification works exactly in the same
fashion as the existing memory/memsw usage notifications.

Please note this is improper implementation - we should rework
thresholds interface first.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/memcontrol.c |   61 +++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 53 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7e8d65e0608a..2cb4e498bc5f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -325,6 +325,9 @@ struct mem_cgroup {
 	/* thresholds for mem+swap usage. RCU-protected */
 	struct mem_cgroup_thresholds memsw_thresholds;
 
+	/* thresholds for anonymous memory usage. RCU-protected */
+	struct mem_cgroup_thresholds rss_thresholds;
+
 	/* For oom notifier event fd */
 	struct list_head oom_notify;
 
@@ -464,6 +467,7 @@ enum res_type {
 	_MEMSWAP,
 	_OOM_TYPE,
 	_KMEM,
+	_RSS,
 };
 
 #define MEMFILE_PRIVATE(x, val)	((x) << 16 | (val))
@@ -4076,6 +4080,10 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
 		if (name == RES_USAGE)
 			return mem_cgroup_usage(memcg, true);
 		return res_counter_read_u64(&memcg->memsw, name);
+	case _RSS:
+		BUG_ON(name != RES_USAGE);
+		return mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS)
+								<< PAGE_SHIFT;
 	case _KMEM:
 		return res_counter_read_u64(&memcg->kmem, name);
 		break;
@@ -4528,22 +4536,30 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
 	return 0;
 }
 
-static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
+static void __mem_cgroup_threshold(struct mem_cgroup *memcg, enum res_type type)
 {
 	struct mem_cgroup_threshold_ary *t;
 	u64 usage;
 	int i;
 
 	rcu_read_lock();
-	if (!swap)
+	if (type == _MEM)
 		t = rcu_dereference(memcg->thresholds.primary);
-	else
+	else if (type == _MEMSWAP)
 		t = rcu_dereference(memcg->memsw_thresholds.primary);
+	else if (type == _RSS)
+		t = rcu_dereference(memcg->rss_thresholds.primary);
+	else
+		BUG();
 
 	if (!t)
 		goto unlock;
 
-	usage = mem_cgroup_usage(memcg, swap);
+	if (type == _RSS)
+		usage = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS)
+								<< PAGE_SHIFT;
+	else
+		usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
 
 	/*
 	 * current_threshold points to threshold just below or equal to usage.
@@ -4582,9 +4598,10 @@ unlock:
 static void mem_cgroup_threshold(struct mem_cgroup *memcg)
 {
 	while (memcg) {
-		__mem_cgroup_threshold(memcg, false);
+		__mem_cgroup_threshold(memcg, _MEM);
 		if (do_swap_account)
-			__mem_cgroup_threshold(memcg, true);
+			__mem_cgroup_threshold(memcg, _MEMSWAP);
+		__mem_cgroup_threshold(memcg, _RSS);
 
 		memcg = parent_mem_cgroup(memcg);
 	}
@@ -4645,12 +4662,16 @@ static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
 	} else if (type == _MEMSWAP) {
 		thresholds = &memcg->memsw_thresholds;
 		usage = mem_cgroup_usage(memcg, true);
+	} else if (type == _RSS) {
+		thresholds = &memcg->rss_thresholds;
+		usage = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS)
+								<< PAGE_SHIFT;
 	} else
 		BUG();
 
 	/* Check if a threshold crossed before adding a new one */
 	if (thresholds->primary)
-		__mem_cgroup_threshold(memcg, type == _MEMSWAP);
+		__mem_cgroup_threshold(memcg, type);
 
 	size = thresholds->primary ? thresholds->primary->size + 1 : 1;
 
@@ -4718,6 +4739,12 @@ static int memsw_cgroup_usage_register_event(struct mem_cgroup *memcg,
 	return __mem_cgroup_usage_register_event(memcg, eventfd, args, _MEMSWAP);
 }
 
+static int mem_cgroup_rss_register_event(struct mem_cgroup *memcg,
+	struct eventfd_ctx *eventfd, const char *args)
+{
+	return __mem_cgroup_usage_register_event(memcg, eventfd, args, _RSS);
+}
+
 static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
 	struct eventfd_ctx *eventfd, enum res_type type)
 {
@@ -4734,6 +4761,10 @@ static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
 	} else if (type == _MEMSWAP) {
 		thresholds = &memcg->memsw_thresholds;
 		usage = mem_cgroup_usage(memcg, true);
+	} else if (type == _RSS) {
+		thresholds = &memcg->rss_thresholds;
+		usage = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS)
+								<< PAGE_SHIFT;
 	} else
 		BUG();
 
@@ -4741,7 +4772,7 @@ static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
 		goto unlock;
 
 	/* Check if a threshold crossed before removing */
-	__mem_cgroup_threshold(memcg, type == _MEMSWAP);
+	__mem_cgroup_threshold(memcg, type);
 
 	/* Calculate new number of threshold */
 	size = 0;
@@ -4808,6 +4839,12 @@ static void memsw_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
 	return __mem_cgroup_usage_unregister_event(memcg, eventfd, _MEMSWAP);
 }
 
+static void mem_cgroup_rss_unregister_event(struct mem_cgroup *memcg,
+	struct eventfd_ctx *eventfd)
+{
+	return __mem_cgroup_usage_unregister_event(memcg, eventfd, _RSS);
+}
+
 static int mem_cgroup_oom_register_event(struct mem_cgroup *memcg,
 	struct eventfd_ctx *eventfd, const char *args)
 {
@@ -5112,6 +5149,9 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 	} else if (!strcmp(name, "memory.memsw.usage_in_bytes")) {
 		event->register_event = memsw_cgroup_usage_register_event;
 		event->unregister_event = memsw_cgroup_usage_unregister_event;
+	} else if (!strcmp(name, "memory.rss")) {
+		event->register_event = mem_cgroup_rss_register_event;
+		event->unregister_event = mem_cgroup_rss_unregister_event;
 	} else {
 		ret = -EINVAL;
 		goto out_put_cfile;
@@ -5192,6 +5232,11 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_read_u64,
 	},
 	{
+		.name = "rss",
+		.private = MEMFILE_PRIVATE(_RSS, RES_USAGE),
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
 		.name = "stat",
 		.seq_show = memcg_stat_show,
 	},
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC 2/2] memcg: add threshold for anon rss
@ 2014-09-11 15:41   ` Vladimir Davydov
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2014-09-11 15:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kamezawa Hiroyuki, Johannes Weiner, Michal Hocko, Greg Thelen,
	Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo,
	Andrew Morton, Pavel Emelianov, Konstantin Khorenko, linux-mm,
	cgroups

Though hard memory limits suit perfectly for sand-boxing, they are not
that efficient when it comes to partitioning a server's resources among
multiple containers. The point is a container consuming a particular
amount of memory most of time may have infrequent spikes in the load.
Setting the hard limit to the maximal possible usage (spike) will lower
server utilization while setting it to the "normal" usage will result in
heavy lags during the spikes.

To handle such scenarios soft limits were introduced. The idea is to
allow a container to breach the limit freely when there's enough free
memory, but shrink it back to the limit aggressively on global memory
pressure. However, the concept of soft limits is intrinsically unsafe
by itself: if a container eats too much anonymous memory, it will be
very slow or even impossible (if there's no swap) to reclaim its
resources back to the limit. As a result the whole system will be
feeling bad until it finally realizes the culprit must die.

Currently we have no way to react to anonymous memory + swap usage
growth inside a container: the memsw counter accounts both anonymous
memory and file caches and swap, so we have neither a limit for
anon+swap nor a threshold notification. Actually, memsw is totally
useless if one wants to make full use of soft limits: it should be set
to a very large value or infinity then, otherwise it just makes no
sense.

That's one of the reasons why I think we should replace memsw with a
kind of anonsw so that it'd account only anon+swap. This way we'd still
be able to sand-box apps, but it'd also allow us to avoid nasty
surprises like the one I described above. For more arguments for and
against this idea, please see the following thread:

http://www.spinics.net/lists/linux-mm/msg78180.html

There's an alternative to this approach backed by Kamezawa. He thinks
that OOM on anon+swap limit hit is a no-go and proposes to use memory
thresholds for it. I still strongly disagree with the proposal, because
it's unsafe (what if the userspace handler won't react in time?).
Nevertheless, I implement his idea in this RFC. I hope this will fuel
the debate, because sadly enough nobody seems to care about this
problem.

So this patch adds the "memory.rss" file that shows the amount of
anonymous memory consumed by a cgroup and the event to handle threshold
notifications coming from it. The notification works exactly in the same
fashion as the existing memory/memsw usage notifications.

Please note this is improper implementation - we should rework
thresholds interface first.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/memcontrol.c |   61 +++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 53 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7e8d65e0608a..2cb4e498bc5f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -325,6 +325,9 @@ struct mem_cgroup {
 	/* thresholds for mem+swap usage. RCU-protected */
 	struct mem_cgroup_thresholds memsw_thresholds;
 
+	/* thresholds for anonymous memory usage. RCU-protected */
+	struct mem_cgroup_thresholds rss_thresholds;
+
 	/* For oom notifier event fd */
 	struct list_head oom_notify;
 
@@ -464,6 +467,7 @@ enum res_type {
 	_MEMSWAP,
 	_OOM_TYPE,
 	_KMEM,
+	_RSS,
 };
 
 #define MEMFILE_PRIVATE(x, val)	((x) << 16 | (val))
@@ -4076,6 +4080,10 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
 		if (name == RES_USAGE)
 			return mem_cgroup_usage(memcg, true);
 		return res_counter_read_u64(&memcg->memsw, name);
+	case _RSS:
+		BUG_ON(name != RES_USAGE);
+		return mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS)
+								<< PAGE_SHIFT;
 	case _KMEM:
 		return res_counter_read_u64(&memcg->kmem, name);
 		break;
@@ -4528,22 +4536,30 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
 	return 0;
 }
 
-static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
+static void __mem_cgroup_threshold(struct mem_cgroup *memcg, enum res_type type)
 {
 	struct mem_cgroup_threshold_ary *t;
 	u64 usage;
 	int i;
 
 	rcu_read_lock();
-	if (!swap)
+	if (type == _MEM)
 		t = rcu_dereference(memcg->thresholds.primary);
-	else
+	else if (type == _MEMSWAP)
 		t = rcu_dereference(memcg->memsw_thresholds.primary);
+	else if (type == _RSS)
+		t = rcu_dereference(memcg->rss_thresholds.primary);
+	else
+		BUG();
 
 	if (!t)
 		goto unlock;
 
-	usage = mem_cgroup_usage(memcg, swap);
+	if (type == _RSS)
+		usage = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS)
+								<< PAGE_SHIFT;
+	else
+		usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
 
 	/*
 	 * current_threshold points to threshold just below or equal to usage.
@@ -4582,9 +4598,10 @@ unlock:
 static void mem_cgroup_threshold(struct mem_cgroup *memcg)
 {
 	while (memcg) {
-		__mem_cgroup_threshold(memcg, false);
+		__mem_cgroup_threshold(memcg, _MEM);
 		if (do_swap_account)
-			__mem_cgroup_threshold(memcg, true);
+			__mem_cgroup_threshold(memcg, _MEMSWAP);
+		__mem_cgroup_threshold(memcg, _RSS);
 
 		memcg = parent_mem_cgroup(memcg);
 	}
@@ -4645,12 +4662,16 @@ static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
 	} else if (type == _MEMSWAP) {
 		thresholds = &memcg->memsw_thresholds;
 		usage = mem_cgroup_usage(memcg, true);
+	} else if (type == _RSS) {
+		thresholds = &memcg->rss_thresholds;
+		usage = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS)
+								<< PAGE_SHIFT;
 	} else
 		BUG();
 
 	/* Check if a threshold crossed before adding a new one */
 	if (thresholds->primary)
-		__mem_cgroup_threshold(memcg, type == _MEMSWAP);
+		__mem_cgroup_threshold(memcg, type);
 
 	size = thresholds->primary ? thresholds->primary->size + 1 : 1;
 
@@ -4718,6 +4739,12 @@ static int memsw_cgroup_usage_register_event(struct mem_cgroup *memcg,
 	return __mem_cgroup_usage_register_event(memcg, eventfd, args, _MEMSWAP);
 }
 
+static int mem_cgroup_rss_register_event(struct mem_cgroup *memcg,
+	struct eventfd_ctx *eventfd, const char *args)
+{
+	return __mem_cgroup_usage_register_event(memcg, eventfd, args, _RSS);
+}
+
 static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
 	struct eventfd_ctx *eventfd, enum res_type type)
 {
@@ -4734,6 +4761,10 @@ static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
 	} else if (type == _MEMSWAP) {
 		thresholds = &memcg->memsw_thresholds;
 		usage = mem_cgroup_usage(memcg, true);
+	} else if (type == _RSS) {
+		thresholds = &memcg->rss_thresholds;
+		usage = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS)
+								<< PAGE_SHIFT;
 	} else
 		BUG();
 
@@ -4741,7 +4772,7 @@ static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
 		goto unlock;
 
 	/* Check if a threshold crossed before removing */
-	__mem_cgroup_threshold(memcg, type == _MEMSWAP);
+	__mem_cgroup_threshold(memcg, type);
 
 	/* Calculate new number of threshold */
 	size = 0;
@@ -4808,6 +4839,12 @@ static void memsw_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
 	return __mem_cgroup_usage_unregister_event(memcg, eventfd, _MEMSWAP);
 }
 
+static void mem_cgroup_rss_unregister_event(struct mem_cgroup *memcg,
+	struct eventfd_ctx *eventfd)
+{
+	return __mem_cgroup_usage_unregister_event(memcg, eventfd, _RSS);
+}
+
 static int mem_cgroup_oom_register_event(struct mem_cgroup *memcg,
 	struct eventfd_ctx *eventfd, const char *args)
 {
@@ -5112,6 +5149,9 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 	} else if (!strcmp(name, "memory.memsw.usage_in_bytes")) {
 		event->register_event = memsw_cgroup_usage_register_event;
 		event->unregister_event = memsw_cgroup_usage_unregister_event;
+	} else if (!strcmp(name, "memory.rss")) {
+		event->register_event = mem_cgroup_rss_register_event;
+		event->unregister_event = mem_cgroup_rss_unregister_event;
 	} else {
 		ret = -EINVAL;
 		goto out_put_cfile;
@@ -5192,6 +5232,11 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_read_u64,
 	},
 	{
+		.name = "rss",
+		.private = MEMFILE_PRIVATE(_RSS, RES_USAGE),
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
 		.name = "stat",
 		.seq_show = memcg_stat_show,
 	},
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 2/2] memcg: add threshold for anon rss
@ 2014-09-11 17:20     ` Austin S Hemmelgarn
  0 siblings, 0 replies; 22+ messages in thread
From: Austin S Hemmelgarn @ 2014-09-11 17:20 UTC (permalink / raw)
  To: Vladimir Davydov, linux-kernel
  Cc: Kamezawa Hiroyuki, Johannes Weiner, Michal Hocko, Greg Thelen,
	Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo,
	Andrew Morton, Pavel Emelianov, Konstantin Khorenko, linux-mm,
	cgroups

[-- Attachment #1: Type: text/plain, Size: 4148 bytes --]

On 2014-09-11 11:41, Vladimir Davydov wrote:
> Though hard memory limits suit perfectly for sand-boxing, they are not
> that efficient when it comes to partitioning a server's resources among
> multiple containers. The point is a container consuming a particular
> amount of memory most of time may have infrequent spikes in the load.
> Setting the hard limit to the maximal possible usage (spike) will lower
> server utilization while setting it to the "normal" usage will result in
> heavy lags during the spikes.
> 
> To handle such scenarios soft limits were introduced. The idea is to
> allow a container to breach the limit freely when there's enough free
> memory, but shrink it back to the limit aggressively on global memory
> pressure. However, the concept of soft limits is intrinsically unsafe
> by itself: if a container eats too much anonymous memory, it will be
> very slow or even impossible (if there's no swap) to reclaim its
> resources back to the limit. As a result the whole system will be
> feeling bad until it finally realizes the culprit must die.
I have actually seen this happen on a number of occasions.  I use
cgroups to sandbox anything I run under wine (cause it's gotten so good
at mimicking windows that a number of windows viruses will run on it),
and have had issues with wine processes with memory leaks bringing the
system to it's knees on occasion.  There are a lot of other stupid
programs out there too, I've seen stuff that does it's own caching, but
doesn't free any of the cached items until it either gets a failed
malloc() or the system starts swapping it out.
> 
> Currently we have no way to react to anonymous memory + swap usage
> growth inside a container: the memsw counter accounts both anonymous
> memory and file caches and swap, so we have neither a limit for
> anon+swap nor a threshold notification. Actually, memsw is totally
> useless if one wants to make full use of soft limits: it should be set
> to a very large value or infinity then, otherwise it just makes no
> sense.
> 
> That's one of the reasons why I think we should replace memsw with a
> kind of anonsw so that it'd account only anon+swap. This way we'd still
> be able to sand-box apps, but it'd also allow us to avoid nasty
> surprises like the one I described above. For more arguments for and
> against this idea, please see the following thread:
> 
> http://www.spinics.net/lists/linux-mm/msg78180.html
> 
> There's an alternative to this approach backed by Kamezawa. He thinks
> that OOM on anon+swap limit hit is a no-go and proposes to use memory
> thresholds for it. I still strongly disagree with the proposal, because
> it's unsafe (what if the userspace handler won't react in time?).
> Nevertheless, I implement his idea in this RFC. I hope this will fuel
> the debate, because sadly enough nobody seems to care about this
> problem.

So, I've actually been following the discussion mentioned above rather
closely, I just haven't had the time to comment on it.
Personally, I think both ideas have merits, but would like to propose a
third solution.

I would propose that we keep memsw like it is right now (because being
able to limit the sum of anon+cache+swap is useful, especially if you
are using cgroups to do strict partitioning of a machine), but give it a
better name (vss maybe?), add a separate counter for anonymous memory
and swap, and then provide for each of them an option to control whether
the OOM killer is used when the limit is hit (possibly with the option
of a delay before running the OOM killer), and a separate option for
threshold notifications.  Users than would be able to choose whether
they want a particular container killed when it hits a particular limit,
and whether or not they want notifications when it gets within a certain
percentage of the limit, or potentially both.

We still need to have a way to hard limit sum of anon+cache+swap (and
ideally kmem once that is working correctly), because that useful for
systems that have to provide guaranteed minimum amounts of virtual
memory to containers.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 2/2] memcg: add threshold for anon rss
@ 2014-09-11 17:20     ` Austin S Hemmelgarn
  0 siblings, 0 replies; 22+ messages in thread
From: Austin S Hemmelgarn @ 2014-09-11 17:20 UTC (permalink / raw)
  To: Vladimir Davydov, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Kamezawa Hiroyuki, Johannes Weiner, Michal Hocko, Greg Thelen,
	Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo,
	Andrew Morton, Pavel Emelianov, Konstantin Khorenko,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 4148 bytes --]

On 2014-09-11 11:41, Vladimir Davydov wrote:
> Though hard memory limits suit perfectly for sand-boxing, they are not
> that efficient when it comes to partitioning a server's resources among
> multiple containers. The point is a container consuming a particular
> amount of memory most of time may have infrequent spikes in the load.
> Setting the hard limit to the maximal possible usage (spike) will lower
> server utilization while setting it to the "normal" usage will result in
> heavy lags during the spikes.
> 
> To handle such scenarios soft limits were introduced. The idea is to
> allow a container to breach the limit freely when there's enough free
> memory, but shrink it back to the limit aggressively on global memory
> pressure. However, the concept of soft limits is intrinsically unsafe
> by itself: if a container eats too much anonymous memory, it will be
> very slow or even impossible (if there's no swap) to reclaim its
> resources back to the limit. As a result the whole system will be
> feeling bad until it finally realizes the culprit must die.
I have actually seen this happen on a number of occasions.  I use
cgroups to sandbox anything I run under wine (cause it's gotten so good
at mimicking windows that a number of windows viruses will run on it),
and have had issues with wine processes with memory leaks bringing the
system to it's knees on occasion.  There are a lot of other stupid
programs out there too, I've seen stuff that does it's own caching, but
doesn't free any of the cached items until it either gets a failed
malloc() or the system starts swapping it out.
> 
> Currently we have no way to react to anonymous memory + swap usage
> growth inside a container: the memsw counter accounts both anonymous
> memory and file caches and swap, so we have neither a limit for
> anon+swap nor a threshold notification. Actually, memsw is totally
> useless if one wants to make full use of soft limits: it should be set
> to a very large value or infinity then, otherwise it just makes no
> sense.
> 
> That's one of the reasons why I think we should replace memsw with a
> kind of anonsw so that it'd account only anon+swap. This way we'd still
> be able to sand-box apps, but it'd also allow us to avoid nasty
> surprises like the one I described above. For more arguments for and
> against this idea, please see the following thread:
> 
> http://www.spinics.net/lists/linux-mm/msg78180.html
> 
> There's an alternative to this approach backed by Kamezawa. He thinks
> that OOM on anon+swap limit hit is a no-go and proposes to use memory
> thresholds for it. I still strongly disagree with the proposal, because
> it's unsafe (what if the userspace handler won't react in time?).
> Nevertheless, I implement his idea in this RFC. I hope this will fuel
> the debate, because sadly enough nobody seems to care about this
> problem.

So, I've actually been following the discussion mentioned above rather
closely, I just haven't had the time to comment on it.
Personally, I think both ideas have merits, but would like to propose a
third solution.

I would propose that we keep memsw like it is right now (because being
able to limit the sum of anon+cache+swap is useful, especially if you
are using cgroups to do strict partitioning of a machine), but give it a
better name (vss maybe?), add a separate counter for anonymous memory
and swap, and then provide for each of them an option to control whether
the OOM killer is used when the limit is hit (possibly with the option
of a delay before running the OOM killer), and a separate option for
threshold notifications.  Users than would be able to choose whether
they want a particular container killed when it hits a particular limit,
and whether or not they want notifications when it gets within a certain
percentage of the limit, or potentially both.

We still need to have a way to hard limit sum of anon+cache+swap (and
ideally kmem once that is working correctly), because that useful for
systems that have to provide guaranteed minimum amounts of virtual
memory to containers.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/2] memcg: use percpu_counter for statistics
  2014-09-11 15:41   ` Vladimir Davydov
@ 2014-09-12  1:10     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 22+ messages in thread
From: Kamezawa Hiroyuki @ 2014-09-12  1:10 UTC (permalink / raw)
  To: Vladimir Davydov, linux-kernel
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins,
	Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton,
	Pavel Emelianov, Konstantin Khorenko, linux-mm, cgroups

(2014/09/12 0:41), Vladimir Davydov wrote:
> In the next patch I need a quick way to get a value of
> MEM_CGROUP_STAT_RSS. The current procedure (mem_cgroup_read_stat) is
> slow (iterates over all cpus) and may sleep (uses get/put_online_cpus),
> so it's a no-go.
> 
> This patch converts memory cgroup statistics to use percpu_counter so
> that percpu_counter_read will do the trick.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>


I have no strong objections but you need performance comparison to go with this.

I thought percpu counter was messy to be used for "array".
I can't understand why you started from fixing future performance problem before
merging new feature.

Thanks,
-Kame


> ---
>   mm/memcontrol.c |  217 ++++++++++++++++++-------------------------------------
>   1 file changed, 69 insertions(+), 148 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 085dc6d2f876..7e8d65e0608a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -136,9 +136,7 @@ enum mem_cgroup_events_target {
>   #define SOFTLIMIT_EVENTS_TARGET 1024
>   #define NUMAINFO_EVENTS_TARGET	1024
>   
> -struct mem_cgroup_stat_cpu {
> -	long count[MEM_CGROUP_STAT_NSTATS];
> -	unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
> +struct mem_cgroup_ratelimit_state {
>   	unsigned long nr_page_events;
>   	unsigned long targets[MEM_CGROUP_NTARGETS];
>   };
> @@ -341,16 +339,10 @@ struct mem_cgroup {
>   	atomic_t	moving_account;
>   	/* taken only while moving_account > 0 */
>   	spinlock_t	move_lock;
> -	/*
> -	 * percpu counter.
> -	 */
> -	struct mem_cgroup_stat_cpu __percpu *stat;
> -	/*
> -	 * used when a cpu is offlined or other synchronizations
> -	 * See mem_cgroup_read_stat().
> -	 */
> -	struct mem_cgroup_stat_cpu nocpu_base;
> -	spinlock_t pcp_counter_lock;
> +
> +	struct percpu_counter stat[MEM_CGROUP_STAT_NSTATS];
> +	struct percpu_counter events[MEM_CGROUP_EVENTS_NSTATS];
> +	struct mem_cgroup_ratelimit_state __percpu *ratelimit;
>   
>   	atomic_t	dead_count;
>   #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
> @@ -849,59 +841,16 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
>   	return mz;
>   }
>   
> -/*
> - * Implementation Note: reading percpu statistics for memcg.
> - *
> - * Both of vmstat[] and percpu_counter has threshold and do periodic
> - * synchronization to implement "quick" read. There are trade-off between
> - * reading cost and precision of value. Then, we may have a chance to implement
> - * a periodic synchronizion of counter in memcg's counter.
> - *
> - * But this _read() function is used for user interface now. The user accounts
> - * memory usage by memory cgroup and he _always_ requires exact value because
> - * he accounts memory. Even if we provide quick-and-fuzzy read, we always
> - * have to visit all online cpus and make sum. So, for now, unnecessary
> - * synchronization is not implemented. (just implemented for cpu hotplug)
> - *
> - * If there are kernel internal actions which can make use of some not-exact
> - * value, and reading all cpu value can be performance bottleneck in some
> - * common workload, threashold and synchonization as vmstat[] should be
> - * implemented.
> - */
>   static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
>   				 enum mem_cgroup_stat_index idx)
>   {
> -	long val = 0;
> -	int cpu;
> -
> -	get_online_cpus();
> -	for_each_online_cpu(cpu)
> -		val += per_cpu(memcg->stat->count[idx], cpu);
> -#ifdef CONFIG_HOTPLUG_CPU
> -	spin_lock(&memcg->pcp_counter_lock);
> -	val += memcg->nocpu_base.count[idx];
> -	spin_unlock(&memcg->pcp_counter_lock);
> -#endif
> -	put_online_cpus();
> -	return val;
> +	return percpu_counter_read(&memcg->stat[idx]);
>   }
>   
>   static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
>   					    enum mem_cgroup_events_index idx)
>   {
> -	unsigned long val = 0;
> -	int cpu;
> -
> -	get_online_cpus();
> -	for_each_online_cpu(cpu)
> -		val += per_cpu(memcg->stat->events[idx], cpu);
> -#ifdef CONFIG_HOTPLUG_CPU
> -	spin_lock(&memcg->pcp_counter_lock);
> -	val += memcg->nocpu_base.events[idx];
> -	spin_unlock(&memcg->pcp_counter_lock);
> -#endif
> -	put_online_cpus();
> -	return val;
> +	return percpu_counter_read(&memcg->events[idx]);
>   }
>   
>   static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
> @@ -913,25 +862,21 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
>   	 * counted as CACHE even if it's on ANON LRU.
>   	 */
>   	if (PageAnon(page))
> -		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS],
> +		percpu_counter_add(&memcg->stat[MEM_CGROUP_STAT_RSS],
>   				nr_pages);
>   	else
> -		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_CACHE],
> +		percpu_counter_add(&memcg->stat[MEM_CGROUP_STAT_CACHE],
>   				nr_pages);
>   
>   	if (PageTransHuge(page))
> -		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
> +		percpu_counter_add(&memcg->stat[MEM_CGROUP_STAT_RSS_HUGE],
>   				nr_pages);
>   
>   	/* pagein of a big page is an event. So, ignore page size */
>   	if (nr_pages > 0)
> -		__this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGPGIN]);
> -	else {
> -		__this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGPGOUT]);
> -		nr_pages = -nr_pages; /* for event */
> -	}
> -
> -	__this_cpu_add(memcg->stat->nr_page_events, nr_pages);
> +		percpu_counter_inc(&memcg->events[MEM_CGROUP_EVENTS_PGPGIN]);
> +	else
> +		percpu_counter_inc(&memcg->events[MEM_CGROUP_EVENTS_PGPGOUT]);
>   }
>   
>   unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
> @@ -981,8 +926,8 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
>   {
>   	unsigned long val, next;
>   
> -	val = __this_cpu_read(memcg->stat->nr_page_events);
> -	next = __this_cpu_read(memcg->stat->targets[target]);
> +	val = __this_cpu_read(memcg->ratelimit->nr_page_events);
> +	next = __this_cpu_read(memcg->ratelimit->targets[target]);
>   	/* from time_after() in jiffies.h */
>   	if ((long)next - (long)val < 0) {
>   		switch (target) {
> @@ -998,7 +943,7 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
>   		default:
>   			break;
>   		}
> -		__this_cpu_write(memcg->stat->targets[target], next);
> +		__this_cpu_write(memcg->ratelimit->targets[target], next);
>   		return true;
>   	}
>   	return false;
> @@ -1006,10 +951,15 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
>   
>   /*
>    * Check events in order.
> - *
>    */
> -static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
> +static void memcg_check_events(struct mem_cgroup *memcg, struct page *page,
> +			       unsigned long nr_pages)
>   {
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	__this_cpu_add(memcg->ratelimit->nr_page_events, nr_pages);
> +
>   	/* threshold event is triggered in finer grain than soft limit */
>   	if (unlikely(mem_cgroup_event_ratelimit(memcg,
>   						MEM_CGROUP_TARGET_THRESH))) {
> @@ -1030,6 +980,7 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
>   			atomic_inc(&memcg->numainfo_events);
>   #endif
>   	}
> +	local_irq_restore(flags);
>   }
>   
>   struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
> @@ -1294,10 +1245,10 @@ void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
>   
>   	switch (idx) {
>   	case PGFAULT:
> -		this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGFAULT]);
> +		percpu_counter_inc(&memcg->events[MEM_CGROUP_EVENTS_PGFAULT]);
>   		break;
>   	case PGMAJFAULT:
> -		this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT]);
> +		percpu_counter_inc(&memcg->events[MEM_CGROUP_EVENTS_PGMAJFAULT]);
>   		break;
>   	default:
>   		BUG();
> @@ -2306,7 +2257,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>   	if (unlikely(!memcg || !PageCgroupUsed(pc)))
>   		return;
>   
> -	this_cpu_add(memcg->stat->count[idx], val);
> +	percpu_counter_add(&memcg->stat[idx], val);
>   }
>   
>   /*
> @@ -2476,37 +2427,12 @@ static void drain_all_stock_sync(struct mem_cgroup *root_memcg)
>   	mutex_unlock(&percpu_charge_mutex);
>   }
>   
> -/*
> - * This function drains percpu counter value from DEAD cpu and
> - * move it to local cpu. Note that this function can be preempted.
> - */
> -static void mem_cgroup_drain_pcp_counter(struct mem_cgroup *memcg, int cpu)
> -{
> -	int i;
> -
> -	spin_lock(&memcg->pcp_counter_lock);
> -	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
> -		long x = per_cpu(memcg->stat->count[i], cpu);
> -
> -		per_cpu(memcg->stat->count[i], cpu) = 0;
> -		memcg->nocpu_base.count[i] += x;
> -	}
> -	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) {
> -		unsigned long x = per_cpu(memcg->stat->events[i], cpu);
> -
> -		per_cpu(memcg->stat->events[i], cpu) = 0;
> -		memcg->nocpu_base.events[i] += x;
> -	}
> -	spin_unlock(&memcg->pcp_counter_lock);
> -}
> -
>   static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
>   					unsigned long action,
>   					void *hcpu)
>   {
>   	int cpu = (unsigned long)hcpu;
>   	struct memcg_stock_pcp *stock;
> -	struct mem_cgroup *iter;
>   
>   	if (action == CPU_ONLINE)
>   		return NOTIFY_OK;
> @@ -2514,9 +2440,6 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
>   	if (action != CPU_DEAD && action != CPU_DEAD_FROZEN)
>   		return NOTIFY_OK;
>   
> -	for_each_mem_cgroup(iter)
> -		mem_cgroup_drain_pcp_counter(iter, cpu);
> -
>   	stock = &per_cpu(memcg_stock, cpu);
>   	drain_stock(stock);
>   	return NOTIFY_OK;
> @@ -3419,8 +3342,8 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>   		pc->mem_cgroup = memcg;
>   		pc->flags = head_pc->flags;
>   	}
> -	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
> -		       HPAGE_PMD_NR);
> +	percpu_counter_sub(&memcg->stat[MEM_CGROUP_STAT_RSS_HUGE],
> +			   HPAGE_PMD_NR);
>   }
>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>   
> @@ -3475,17 +3398,17 @@ static int mem_cgroup_move_account(struct page *page,
>   	move_lock_mem_cgroup(from, &flags);
>   
>   	if (!PageAnon(page) && page_mapped(page)) {
> -		__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
> -			       nr_pages);
> -		__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
> -			       nr_pages);
> +		percpu_counter_sub(&from->stat[MEM_CGROUP_STAT_FILE_MAPPED],
> +				   nr_pages);
> +		percpu_counter_add(&to->stat[MEM_CGROUP_STAT_FILE_MAPPED],
> +				   nr_pages);
>   	}
>   
>   	if (PageWriteback(page)) {
> -		__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_WRITEBACK],
> -			       nr_pages);
> -		__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_WRITEBACK],
> -			       nr_pages);
> +		percpu_counter_sub(&from->stat[MEM_CGROUP_STAT_WRITEBACK],
> +				   nr_pages);
> +		percpu_counter_add(&to->stat[MEM_CGROUP_STAT_WRITEBACK],
> +				   nr_pages);
>   	}
>   
>   	/*
> @@ -3499,12 +3422,10 @@ static int mem_cgroup_move_account(struct page *page,
>   	move_unlock_mem_cgroup(from, &flags);
>   	ret = 0;
>   
> -	local_irq_disable();
>   	mem_cgroup_charge_statistics(to, page, nr_pages);
> -	memcg_check_events(to, page);
> +	memcg_check_events(to, page, nr_pages);
>   	mem_cgroup_charge_statistics(from, page, -nr_pages);
> -	memcg_check_events(from, page);
> -	local_irq_enable();
> +	memcg_check_events(from, page, nr_pages);
>   out_unlock:
>   	unlock_page(page);
>   out:
> @@ -3582,7 +3503,7 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
>   					 bool charge)
>   {
>   	int val = (charge) ? 1 : -1;
> -	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
> +	percpu_counter_add(&memcg->stat[MEM_CGROUP_STAT_SWAP], val);
>   }
>   
>   /**
> @@ -5413,25 +5334,11 @@ static void free_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
>   
>   static struct mem_cgroup *mem_cgroup_alloc(void)
>   {
> -	struct mem_cgroup *memcg;
>   	size_t size;
>   
>   	size = sizeof(struct mem_cgroup);
>   	size += nr_node_ids * sizeof(struct mem_cgroup_per_node *);
> -
> -	memcg = kzalloc(size, GFP_KERNEL);
> -	if (!memcg)
> -		return NULL;
> -
> -	memcg->stat = alloc_percpu(struct mem_cgroup_stat_cpu);
> -	if (!memcg->stat)
> -		goto out_free;
> -	spin_lock_init(&memcg->pcp_counter_lock);
> -	return memcg;
> -
> -out_free:
> -	kfree(memcg);
> -	return NULL;
> +	return kzalloc(size, GFP_KERNEL);
>   }
>   
>   /*
> @@ -5448,13 +5355,20 @@ out_free:
>   static void __mem_cgroup_free(struct mem_cgroup *memcg)
>   {
>   	int node;
> +	int i;
>   
>   	mem_cgroup_remove_from_trees(memcg);
>   
>   	for_each_node(node)
>   		free_mem_cgroup_per_zone_info(memcg, node);
>   
> -	free_percpu(memcg->stat);
> +	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++)
> +		percpu_counter_destroy(&memcg->stat[i]);
> +
> +	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
> +		percpu_counter_destroy(&memcg->events[i]);
> +
> +	free_percpu(memcg->ratelimit);
>   
>   	/*
>   	 * We need to make sure that (at least for now), the jump label
> @@ -5511,11 +5425,24 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>   	struct mem_cgroup *memcg;
>   	long error = -ENOMEM;
>   	int node;
> +	int i;
>   
>   	memcg = mem_cgroup_alloc();
>   	if (!memcg)
>   		return ERR_PTR(error);
>   
> +	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++)
> +		if (percpu_counter_init(&memcg->stat[i], 0, GFP_KERNEL))
> +			goto free_out;
> +
> +	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
> +		if (percpu_counter_init(&memcg->events[i], 0, GFP_KERNEL))
> +			goto free_out;
> +
> +	memcg->ratelimit = alloc_percpu(struct mem_cgroup_ratelimit_state);
> +	if (!memcg->ratelimit)
> +		goto free_out;
> +
>   	for_each_node(node)
>   		if (alloc_mem_cgroup_per_zone_info(memcg, node))
>   			goto free_out;
> @@ -6507,10 +6434,8 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
>   		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
>   	}
>   
> -	local_irq_disable();
>   	mem_cgroup_charge_statistics(memcg, page, nr_pages);
> -	memcg_check_events(memcg, page);
> -	local_irq_enable();
> +	memcg_check_events(memcg, page, nr_pages);
>   
>   	if (do_swap_account && PageSwapCache(page)) {
>   		swp_entry_t entry = { .val = page_private(page) };
> @@ -6557,8 +6482,6 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
>   			   unsigned long nr_anon, unsigned long nr_file,
>   			   unsigned long nr_huge, struct page *dummy_page)
>   {
> -	unsigned long flags;
> -
>   	if (!mem_cgroup_is_root(memcg)) {
>   		if (nr_mem)
>   			res_counter_uncharge(&memcg->res,
> @@ -6569,14 +6492,12 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
>   		memcg_oom_recover(memcg);
>   	}
>   
> -	local_irq_save(flags);
> -	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS], nr_anon);
> -	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_CACHE], nr_file);
> -	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE], nr_huge);
> -	__this_cpu_add(memcg->stat->events[MEM_CGROUP_EVENTS_PGPGOUT], pgpgout);
> -	__this_cpu_add(memcg->stat->nr_page_events, nr_anon + nr_file);
> -	memcg_check_events(memcg, dummy_page);
> -	local_irq_restore(flags);
> +	percpu_counter_sub(&memcg->stat[MEM_CGROUP_STAT_RSS], nr_anon);
> +	percpu_counter_sub(&memcg->stat[MEM_CGROUP_STAT_CACHE], nr_file);
> +	percpu_counter_sub(&memcg->stat[MEM_CGROUP_STAT_RSS_HUGE], nr_huge);
> +	percpu_counter_add(&memcg->events[MEM_CGROUP_EVENTS_PGPGOUT], pgpgout);
> +
> +	memcg_check_events(memcg, dummy_page, nr_anon + nr_file);
>   }
>   
>   static void uncharge_list(struct list_head *page_list)
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/2] memcg: use percpu_counter for statistics
@ 2014-09-12  1:10     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: Kamezawa Hiroyuki @ 2014-09-12  1:10 UTC (permalink / raw)
  To: Vladimir Davydov, linux-kernel
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins,
	Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton,
	Pavel Emelianov, Konstantin Khorenko, linux-mm, cgroups

(2014/09/12 0:41), Vladimir Davydov wrote:
> In the next patch I need a quick way to get a value of
> MEM_CGROUP_STAT_RSS. The current procedure (mem_cgroup_read_stat) is
> slow (iterates over all cpus) and may sleep (uses get/put_online_cpus),
> so it's a no-go.
> 
> This patch converts memory cgroup statistics to use percpu_counter so
> that percpu_counter_read will do the trick.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>


I have no strong objections but you need performance comparison to go with this.

I thought percpu counter was messy to be used for "array".
I can't understand why you started from fixing future performance problem before
merging new feature.

Thanks,
-Kame


> ---
>   mm/memcontrol.c |  217 ++++++++++++++++++-------------------------------------
>   1 file changed, 69 insertions(+), 148 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 085dc6d2f876..7e8d65e0608a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -136,9 +136,7 @@ enum mem_cgroup_events_target {
>   #define SOFTLIMIT_EVENTS_TARGET 1024
>   #define NUMAINFO_EVENTS_TARGET	1024
>   
> -struct mem_cgroup_stat_cpu {
> -	long count[MEM_CGROUP_STAT_NSTATS];
> -	unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
> +struct mem_cgroup_ratelimit_state {
>   	unsigned long nr_page_events;
>   	unsigned long targets[MEM_CGROUP_NTARGETS];
>   };
> @@ -341,16 +339,10 @@ struct mem_cgroup {
>   	atomic_t	moving_account;
>   	/* taken only while moving_account > 0 */
>   	spinlock_t	move_lock;
> -	/*
> -	 * percpu counter.
> -	 */
> -	struct mem_cgroup_stat_cpu __percpu *stat;
> -	/*
> -	 * used when a cpu is offlined or other synchronizations
> -	 * See mem_cgroup_read_stat().
> -	 */
> -	struct mem_cgroup_stat_cpu nocpu_base;
> -	spinlock_t pcp_counter_lock;
> +
> +	struct percpu_counter stat[MEM_CGROUP_STAT_NSTATS];
> +	struct percpu_counter events[MEM_CGROUP_EVENTS_NSTATS];
> +	struct mem_cgroup_ratelimit_state __percpu *ratelimit;
>   
>   	atomic_t	dead_count;
>   #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
> @@ -849,59 +841,16 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
>   	return mz;
>   }
>   
> -/*
> - * Implementation Note: reading percpu statistics for memcg.
> - *
> - * Both of vmstat[] and percpu_counter has threshold and do periodic
> - * synchronization to implement "quick" read. There are trade-off between
> - * reading cost and precision of value. Then, we may have a chance to implement
> - * a periodic synchronizion of counter in memcg's counter.
> - *
> - * But this _read() function is used for user interface now. The user accounts
> - * memory usage by memory cgroup and he _always_ requires exact value because
> - * he accounts memory. Even if we provide quick-and-fuzzy read, we always
> - * have to visit all online cpus and make sum. So, for now, unnecessary
> - * synchronization is not implemented. (just implemented for cpu hotplug)
> - *
> - * If there are kernel internal actions which can make use of some not-exact
> - * value, and reading all cpu value can be performance bottleneck in some
> - * common workload, threashold and synchonization as vmstat[] should be
> - * implemented.
> - */
>   static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
>   				 enum mem_cgroup_stat_index idx)
>   {
> -	long val = 0;
> -	int cpu;
> -
> -	get_online_cpus();
> -	for_each_online_cpu(cpu)
> -		val += per_cpu(memcg->stat->count[idx], cpu);
> -#ifdef CONFIG_HOTPLUG_CPU
> -	spin_lock(&memcg->pcp_counter_lock);
> -	val += memcg->nocpu_base.count[idx];
> -	spin_unlock(&memcg->pcp_counter_lock);
> -#endif
> -	put_online_cpus();
> -	return val;
> +	return percpu_counter_read(&memcg->stat[idx]);
>   }
>   
>   static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
>   					    enum mem_cgroup_events_index idx)
>   {
> -	unsigned long val = 0;
> -	int cpu;
> -
> -	get_online_cpus();
> -	for_each_online_cpu(cpu)
> -		val += per_cpu(memcg->stat->events[idx], cpu);
> -#ifdef CONFIG_HOTPLUG_CPU
> -	spin_lock(&memcg->pcp_counter_lock);
> -	val += memcg->nocpu_base.events[idx];
> -	spin_unlock(&memcg->pcp_counter_lock);
> -#endif
> -	put_online_cpus();
> -	return val;
> +	return percpu_counter_read(&memcg->events[idx]);
>   }
>   
>   static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
> @@ -913,25 +862,21 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
>   	 * counted as CACHE even if it's on ANON LRU.
>   	 */
>   	if (PageAnon(page))
> -		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS],
> +		percpu_counter_add(&memcg->stat[MEM_CGROUP_STAT_RSS],
>   				nr_pages);
>   	else
> -		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_CACHE],
> +		percpu_counter_add(&memcg->stat[MEM_CGROUP_STAT_CACHE],
>   				nr_pages);
>   
>   	if (PageTransHuge(page))
> -		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
> +		percpu_counter_add(&memcg->stat[MEM_CGROUP_STAT_RSS_HUGE],
>   				nr_pages);
>   
>   	/* pagein of a big page is an event. So, ignore page size */
>   	if (nr_pages > 0)
> -		__this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGPGIN]);
> -	else {
> -		__this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGPGOUT]);
> -		nr_pages = -nr_pages; /* for event */
> -	}
> -
> -	__this_cpu_add(memcg->stat->nr_page_events, nr_pages);
> +		percpu_counter_inc(&memcg->events[MEM_CGROUP_EVENTS_PGPGIN]);
> +	else
> +		percpu_counter_inc(&memcg->events[MEM_CGROUP_EVENTS_PGPGOUT]);
>   }
>   
>   unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
> @@ -981,8 +926,8 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
>   {
>   	unsigned long val, next;
>   
> -	val = __this_cpu_read(memcg->stat->nr_page_events);
> -	next = __this_cpu_read(memcg->stat->targets[target]);
> +	val = __this_cpu_read(memcg->ratelimit->nr_page_events);
> +	next = __this_cpu_read(memcg->ratelimit->targets[target]);
>   	/* from time_after() in jiffies.h */
>   	if ((long)next - (long)val < 0) {
>   		switch (target) {
> @@ -998,7 +943,7 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
>   		default:
>   			break;
>   		}
> -		__this_cpu_write(memcg->stat->targets[target], next);
> +		__this_cpu_write(memcg->ratelimit->targets[target], next);
>   		return true;
>   	}
>   	return false;
> @@ -1006,10 +951,15 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
>   
>   /*
>    * Check events in order.
> - *
>    */
> -static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
> +static void memcg_check_events(struct mem_cgroup *memcg, struct page *page,
> +			       unsigned long nr_pages)
>   {
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	__this_cpu_add(memcg->ratelimit->nr_page_events, nr_pages);
> +
>   	/* threshold event is triggered in finer grain than soft limit */
>   	if (unlikely(mem_cgroup_event_ratelimit(memcg,
>   						MEM_CGROUP_TARGET_THRESH))) {
> @@ -1030,6 +980,7 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
>   			atomic_inc(&memcg->numainfo_events);
>   #endif
>   	}
> +	local_irq_restore(flags);
>   }
>   
>   struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
> @@ -1294,10 +1245,10 @@ void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
>   
>   	switch (idx) {
>   	case PGFAULT:
> -		this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGFAULT]);
> +		percpu_counter_inc(&memcg->events[MEM_CGROUP_EVENTS_PGFAULT]);
>   		break;
>   	case PGMAJFAULT:
> -		this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT]);
> +		percpu_counter_inc(&memcg->events[MEM_CGROUP_EVENTS_PGMAJFAULT]);
>   		break;
>   	default:
>   		BUG();
> @@ -2306,7 +2257,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>   	if (unlikely(!memcg || !PageCgroupUsed(pc)))
>   		return;
>   
> -	this_cpu_add(memcg->stat->count[idx], val);
> +	percpu_counter_add(&memcg->stat[idx], val);
>   }
>   
>   /*
> @@ -2476,37 +2427,12 @@ static void drain_all_stock_sync(struct mem_cgroup *root_memcg)
>   	mutex_unlock(&percpu_charge_mutex);
>   }
>   
> -/*
> - * This function drains percpu counter value from DEAD cpu and
> - * move it to local cpu. Note that this function can be preempted.
> - */
> -static void mem_cgroup_drain_pcp_counter(struct mem_cgroup *memcg, int cpu)
> -{
> -	int i;
> -
> -	spin_lock(&memcg->pcp_counter_lock);
> -	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
> -		long x = per_cpu(memcg->stat->count[i], cpu);
> -
> -		per_cpu(memcg->stat->count[i], cpu) = 0;
> -		memcg->nocpu_base.count[i] += x;
> -	}
> -	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) {
> -		unsigned long x = per_cpu(memcg->stat->events[i], cpu);
> -
> -		per_cpu(memcg->stat->events[i], cpu) = 0;
> -		memcg->nocpu_base.events[i] += x;
> -	}
> -	spin_unlock(&memcg->pcp_counter_lock);
> -}
> -
>   static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
>   					unsigned long action,
>   					void *hcpu)
>   {
>   	int cpu = (unsigned long)hcpu;
>   	struct memcg_stock_pcp *stock;
> -	struct mem_cgroup *iter;
>   
>   	if (action == CPU_ONLINE)
>   		return NOTIFY_OK;
> @@ -2514,9 +2440,6 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
>   	if (action != CPU_DEAD && action != CPU_DEAD_FROZEN)
>   		return NOTIFY_OK;
>   
> -	for_each_mem_cgroup(iter)
> -		mem_cgroup_drain_pcp_counter(iter, cpu);
> -
>   	stock = &per_cpu(memcg_stock, cpu);
>   	drain_stock(stock);
>   	return NOTIFY_OK;
> @@ -3419,8 +3342,8 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>   		pc->mem_cgroup = memcg;
>   		pc->flags = head_pc->flags;
>   	}
> -	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
> -		       HPAGE_PMD_NR);
> +	percpu_counter_sub(&memcg->stat[MEM_CGROUP_STAT_RSS_HUGE],
> +			   HPAGE_PMD_NR);
>   }
>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>   
> @@ -3475,17 +3398,17 @@ static int mem_cgroup_move_account(struct page *page,
>   	move_lock_mem_cgroup(from, &flags);
>   
>   	if (!PageAnon(page) && page_mapped(page)) {
> -		__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
> -			       nr_pages);
> -		__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
> -			       nr_pages);
> +		percpu_counter_sub(&from->stat[MEM_CGROUP_STAT_FILE_MAPPED],
> +				   nr_pages);
> +		percpu_counter_add(&to->stat[MEM_CGROUP_STAT_FILE_MAPPED],
> +				   nr_pages);
>   	}
>   
>   	if (PageWriteback(page)) {
> -		__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_WRITEBACK],
> -			       nr_pages);
> -		__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_WRITEBACK],
> -			       nr_pages);
> +		percpu_counter_sub(&from->stat[MEM_CGROUP_STAT_WRITEBACK],
> +				   nr_pages);
> +		percpu_counter_add(&to->stat[MEM_CGROUP_STAT_WRITEBACK],
> +				   nr_pages);
>   	}
>   
>   	/*
> @@ -3499,12 +3422,10 @@ static int mem_cgroup_move_account(struct page *page,
>   	move_unlock_mem_cgroup(from, &flags);
>   	ret = 0;
>   
> -	local_irq_disable();
>   	mem_cgroup_charge_statistics(to, page, nr_pages);
> -	memcg_check_events(to, page);
> +	memcg_check_events(to, page, nr_pages);
>   	mem_cgroup_charge_statistics(from, page, -nr_pages);
> -	memcg_check_events(from, page);
> -	local_irq_enable();
> +	memcg_check_events(from, page, nr_pages);
>   out_unlock:
>   	unlock_page(page);
>   out:
> @@ -3582,7 +3503,7 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
>   					 bool charge)
>   {
>   	int val = (charge) ? 1 : -1;
> -	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
> +	percpu_counter_add(&memcg->stat[MEM_CGROUP_STAT_SWAP], val);
>   }
>   
>   /**
> @@ -5413,25 +5334,11 @@ static void free_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
>   
>   static struct mem_cgroup *mem_cgroup_alloc(void)
>   {
> -	struct mem_cgroup *memcg;
>   	size_t size;
>   
>   	size = sizeof(struct mem_cgroup);
>   	size += nr_node_ids * sizeof(struct mem_cgroup_per_node *);
> -
> -	memcg = kzalloc(size, GFP_KERNEL);
> -	if (!memcg)
> -		return NULL;
> -
> -	memcg->stat = alloc_percpu(struct mem_cgroup_stat_cpu);
> -	if (!memcg->stat)
> -		goto out_free;
> -	spin_lock_init(&memcg->pcp_counter_lock);
> -	return memcg;
> -
> -out_free:
> -	kfree(memcg);
> -	return NULL;
> +	return kzalloc(size, GFP_KERNEL);
>   }
>   
>   /*
> @@ -5448,13 +5355,20 @@ out_free:
>   static void __mem_cgroup_free(struct mem_cgroup *memcg)
>   {
>   	int node;
> +	int i;
>   
>   	mem_cgroup_remove_from_trees(memcg);
>   
>   	for_each_node(node)
>   		free_mem_cgroup_per_zone_info(memcg, node);
>   
> -	free_percpu(memcg->stat);
> +	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++)
> +		percpu_counter_destroy(&memcg->stat[i]);
> +
> +	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
> +		percpu_counter_destroy(&memcg->events[i]);
> +
> +	free_percpu(memcg->ratelimit);
>   
>   	/*
>   	 * We need to make sure that (at least for now), the jump label
> @@ -5511,11 +5425,24 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>   	struct mem_cgroup *memcg;
>   	long error = -ENOMEM;
>   	int node;
> +	int i;
>   
>   	memcg = mem_cgroup_alloc();
>   	if (!memcg)
>   		return ERR_PTR(error);
>   
> +	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++)
> +		if (percpu_counter_init(&memcg->stat[i], 0, GFP_KERNEL))
> +			goto free_out;
> +
> +	for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
> +		if (percpu_counter_init(&memcg->events[i], 0, GFP_KERNEL))
> +			goto free_out;
> +
> +	memcg->ratelimit = alloc_percpu(struct mem_cgroup_ratelimit_state);
> +	if (!memcg->ratelimit)
> +		goto free_out;
> +
>   	for_each_node(node)
>   		if (alloc_mem_cgroup_per_zone_info(memcg, node))
>   			goto free_out;
> @@ -6507,10 +6434,8 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
>   		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
>   	}
>   
> -	local_irq_disable();
>   	mem_cgroup_charge_statistics(memcg, page, nr_pages);
> -	memcg_check_events(memcg, page);
> -	local_irq_enable();
> +	memcg_check_events(memcg, page, nr_pages);
>   
>   	if (do_swap_account && PageSwapCache(page)) {
>   		swp_entry_t entry = { .val = page_private(page) };
> @@ -6557,8 +6482,6 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
>   			   unsigned long nr_anon, unsigned long nr_file,
>   			   unsigned long nr_huge, struct page *dummy_page)
>   {
> -	unsigned long flags;
> -
>   	if (!mem_cgroup_is_root(memcg)) {
>   		if (nr_mem)
>   			res_counter_uncharge(&memcg->res,
> @@ -6569,14 +6492,12 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
>   		memcg_oom_recover(memcg);
>   	}
>   
> -	local_irq_save(flags);
> -	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS], nr_anon);
> -	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_CACHE], nr_file);
> -	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE], nr_huge);
> -	__this_cpu_add(memcg->stat->events[MEM_CGROUP_EVENTS_PGPGOUT], pgpgout);
> -	__this_cpu_add(memcg->stat->nr_page_events, nr_anon + nr_file);
> -	memcg_check_events(memcg, dummy_page);
> -	local_irq_restore(flags);
> +	percpu_counter_sub(&memcg->stat[MEM_CGROUP_STAT_RSS], nr_anon);
> +	percpu_counter_sub(&memcg->stat[MEM_CGROUP_STAT_CACHE], nr_file);
> +	percpu_counter_sub(&memcg->stat[MEM_CGROUP_STAT_RSS_HUGE], nr_huge);
> +	percpu_counter_add(&memcg->events[MEM_CGROUP_EVENTS_PGPGOUT], pgpgout);
> +
> +	memcg_check_events(memcg, dummy_page, nr_anon + nr_file);
>   }
>   
>   static void uncharge_list(struct list_head *page_list)
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 2/2] memcg: add threshold for anon rss
  2014-09-11 15:41   ` Vladimir Davydov
@ 2014-09-12  1:23     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 22+ messages in thread
From: Kamezawa Hiroyuki @ 2014-09-12  1:23 UTC (permalink / raw)
  To: Vladimir Davydov, linux-kernel
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins,
	Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton,
	Pavel Emelianov, Konstantin Khorenko, linux-mm, cgroups

(2014/09/12 0:41), Vladimir Davydov wrote:
> Though hard memory limits suit perfectly for sand-boxing, they are not
> that efficient when it comes to partitioning a server's resources among
> multiple containers. The point is a container consuming a particular
> amount of memory most of time may have infrequent spikes in the load.
> Setting the hard limit to the maximal possible usage (spike) will lower
> server utilization while setting it to the "normal" usage will result in
> heavy lags during the spikes.
> 
> To handle such scenarios soft limits were introduced. The idea is to
> allow a container to breach the limit freely when there's enough free
> memory, but shrink it back to the limit aggressively on global memory
> pressure. However, the concept of soft limits is intrinsically unsafe
> by itself: if a container eats too much anonymous memory, it will be
> very slow or even impossible (if there's no swap) to reclaim its
> resources back to the limit. As a result the whole system will be
> feeling bad until it finally realizes the culprit must die.
> 
> Currently we have no way to react to anonymous memory + swap usage
> growth inside a container: the memsw counter accounts both anonymous
> memory and file caches and swap, so we have neither a limit for
> anon+swap nor a threshold notification. Actually, memsw is totally
> useless if one wants to make full use of soft limits: it should be set
> to a very large value or infinity then, otherwise it just makes no
> sense.
> 
> That's one of the reasons why I think we should replace memsw with a
> kind of anonsw so that it'd account only anon+swap. This way we'd still
> be able to sand-box apps, but it'd also allow us to avoid nasty
> surprises like the one I described above. For more arguments for and
> against this idea, please see the following thread:
> 
> http://www.spinics.net/lists/linux-mm/msg78180.html
> 
> There's an alternative to this approach backed by Kamezawa. He thinks
> that OOM on anon+swap limit hit is a no-go and proposes to use memory
> thresholds for it. I still strongly disagree with the proposal, because
> it's unsafe (what if the userspace handler won't react in time?).
> Nevertheless, I implement his idea in this RFC. I hope this will fuel
> the debate, because sadly enough nobody seems to care about this
> problem.
> 
> So this patch adds the "memory.rss" file that shows the amount of
> anonymous memory consumed by a cgroup and the event to handle threshold
> notifications coming from it. The notification works exactly in the same
> fashion as the existing memory/memsw usage notifications.
> 
>

So, now, you know you can handle "threshould".

If you want to implement "automatic-oom-killall-in-a-contanier-threshold-in-kernel",
I don't have any objections.

What you want is not limit, you want a trigger for killing process.
Threshold + Kill is enough, using res_counter for that is overspec.

You don't need res_counter and don't need to break other guy's use case.

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 2/2] memcg: add threshold for anon rss
@ 2014-09-12  1:23     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: Kamezawa Hiroyuki @ 2014-09-12  1:23 UTC (permalink / raw)
  To: Vladimir Davydov, linux-kernel
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins,
	Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton,
	Pavel Emelianov, Konstantin Khorenko, linux-mm, cgroups

(2014/09/12 0:41), Vladimir Davydov wrote:
> Though hard memory limits suit perfectly for sand-boxing, they are not
> that efficient when it comes to partitioning a server's resources among
> multiple containers. The point is a container consuming a particular
> amount of memory most of time may have infrequent spikes in the load.
> Setting the hard limit to the maximal possible usage (spike) will lower
> server utilization while setting it to the "normal" usage will result in
> heavy lags during the spikes.
> 
> To handle such scenarios soft limits were introduced. The idea is to
> allow a container to breach the limit freely when there's enough free
> memory, but shrink it back to the limit aggressively on global memory
> pressure. However, the concept of soft limits is intrinsically unsafe
> by itself: if a container eats too much anonymous memory, it will be
> very slow or even impossible (if there's no swap) to reclaim its
> resources back to the limit. As a result the whole system will be
> feeling bad until it finally realizes the culprit must die.
> 
> Currently we have no way to react to anonymous memory + swap usage
> growth inside a container: the memsw counter accounts both anonymous
> memory and file caches and swap, so we have neither a limit for
> anon+swap nor a threshold notification. Actually, memsw is totally
> useless if one wants to make full use of soft limits: it should be set
> to a very large value or infinity then, otherwise it just makes no
> sense.
> 
> That's one of the reasons why I think we should replace memsw with a
> kind of anonsw so that it'd account only anon+swap. This way we'd still
> be able to sand-box apps, but it'd also allow us to avoid nasty
> surprises like the one I described above. For more arguments for and
> against this idea, please see the following thread:
> 
> http://www.spinics.net/lists/linux-mm/msg78180.html
> 
> There's an alternative to this approach backed by Kamezawa. He thinks
> that OOM on anon+swap limit hit is a no-go and proposes to use memory
> thresholds for it. I still strongly disagree with the proposal, because
> it's unsafe (what if the userspace handler won't react in time?).
> Nevertheless, I implement his idea in this RFC. I hope this will fuel
> the debate, because sadly enough nobody seems to care about this
> problem.
> 
> So this patch adds the "memory.rss" file that shows the amount of
> anonymous memory consumed by a cgroup and the event to handle threshold
> notifications coming from it. The notification works exactly in the same
> fashion as the existing memory/memsw usage notifications.
> 
>

So, now, you know you can handle "threshould".

If you want to implement "automatic-oom-killall-in-a-contanier-threshold-in-kernel",
I don't have any objections.

What you want is not limit, you want a trigger for killing process.
Threshold + Kill is enough, using res_counter for that is overspec.

You don't need res_counter and don't need to break other guy's use case.

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/2] memcg: use percpu_counter for statistics
  2014-09-12  1:10     ` Kamezawa Hiroyuki
  (?)
@ 2014-09-12  7:41       ` Vladimir Davydov
  -1 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2014-09-12  7:41 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, Johannes Weiner, Michal Hocko, Greg Thelen,
	Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo,
	Andrew Morton, Pavel Emelianov, Konstantin Khorenko, linux-mm,
	cgroups

On Fri, Sep 12, 2014 at 10:10:52AM +0900, Kamezawa Hiroyuki wrote:
> (2014/09/12 0:41), Vladimir Davydov wrote:
> > In the next patch I need a quick way to get a value of
> > MEM_CGROUP_STAT_RSS. The current procedure (mem_cgroup_read_stat) is
> > slow (iterates over all cpus) and may sleep (uses get/put_online_cpus),
> > so it's a no-go.
> > 
> > This patch converts memory cgroup statistics to use percpu_counter so
> > that percpu_counter_read will do the trick.
> > 
> > Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> 
> 
> I have no strong objections but you need performance comparison to go with this.
> 
> I thought percpu counter was messy to be used for "array".
> I can't understand why you started from fixing future performance problem before
> merging new feature.

Because the present implementation of mem_cgroup_read_stat may sleep
(get/put_online_cpus) while I need to call it from atomic context in the
next patch.

I didn't do any performance comparisons, because it's just an RFC. It
exists only to attract attention to the problem. Using percpu counters
was the quickest way to implement a draft version, that's why I chose
them. It may have performance impact though, so it shouldn't be merged
w/o performance analysis.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/2] memcg: use percpu_counter for statistics
@ 2014-09-12  7:41       ` Vladimir Davydov
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2014-09-12  7:41 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, Johannes Weiner, Michal Hocko, Greg Thelen,
	Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo,
	Andrew Morton, Pavel Emelianov, Konstantin Khorenko, linux-mm,
	cgroups

On Fri, Sep 12, 2014 at 10:10:52AM +0900, Kamezawa Hiroyuki wrote:
> (2014/09/12 0:41), Vladimir Davydov wrote:
> > In the next patch I need a quick way to get a value of
> > MEM_CGROUP_STAT_RSS. The current procedure (mem_cgroup_read_stat) is
> > slow (iterates over all cpus) and may sleep (uses get/put_online_cpus),
> > so it's a no-go.
> > 
> > This patch converts memory cgroup statistics to use percpu_counter so
> > that percpu_counter_read will do the trick.
> > 
> > Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> 
> 
> I have no strong objections but you need performance comparison to go with this.
> 
> I thought percpu counter was messy to be used for "array".
> I can't understand why you started from fixing future performance problem before
> merging new feature.

Because the present implementation of mem_cgroup_read_stat may sleep
(get/put_online_cpus) while I need to call it from atomic context in the
next patch.

I didn't do any performance comparisons, because it's just an RFC. It
exists only to attract attention to the problem. Using percpu counters
was the quickest way to implement a draft version, that's why I chose
them. It may have performance impact though, so it shouldn't be merged
w/o performance analysis.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/2] memcg: use percpu_counter for statistics
@ 2014-09-12  7:41       ` Vladimir Davydov
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2014-09-12  7:41 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Johannes Weiner,
	Michal Hocko, Greg Thelen, Hugh Dickins, Motohiro Kosaki,
	Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov,
	Konstantin Khorenko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Sep 12, 2014 at 10:10:52AM +0900, Kamezawa Hiroyuki wrote:
> (2014/09/12 0:41), Vladimir Davydov wrote:
> > In the next patch I need a quick way to get a value of
> > MEM_CGROUP_STAT_RSS. The current procedure (mem_cgroup_read_stat) is
> > slow (iterates over all cpus) and may sleep (uses get/put_online_cpus),
> > so it's a no-go.
> > 
> > This patch converts memory cgroup statistics to use percpu_counter so
> > that percpu_counter_read will do the trick.
> > 
> > Signed-off-by: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> 
> 
> I have no strong objections but you need performance comparison to go with this.
> 
> I thought percpu counter was messy to be used for "array".
> I can't understand why you started from fixing future performance problem before
> merging new feature.

Because the present implementation of mem_cgroup_read_stat may sleep
(get/put_online_cpus) while I need to call it from atomic context in the
next patch.

I didn't do any performance comparisons, because it's just an RFC. It
exists only to attract attention to the problem. Using percpu counters
was the quickest way to implement a draft version, that's why I chose
them. It may have performance impact though, so it shouldn't be merged
w/o performance analysis.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 2/2] memcg: add threshold for anon rss
  2014-09-11 17:20     ` Austin S Hemmelgarn
  (?)
@ 2014-09-12  8:27       ` Vladimir Davydov
  -1 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2014-09-12  8:27 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: linux-kernel, Kamezawa Hiroyuki, Johannes Weiner, Michal Hocko,
	Greg Thelen, Hugh Dickins, Motohiro Kosaki, Glauber Costa,
	Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko,
	linux-mm, cgroups

Hi Austin,

On Thu, Sep 11, 2014 at 01:20:34PM -0400, Austin S Hemmelgarn wrote:
> On 2014-09-11 11:41, Vladimir Davydov wrote:
> > Though hard memory limits suit perfectly for sand-boxing, they are not
> > that efficient when it comes to partitioning a server's resources among
> > multiple containers. The point is a container consuming a particular
> > amount of memory most of time may have infrequent spikes in the load.
> > Setting the hard limit to the maximal possible usage (spike) will lower
> > server utilization while setting it to the "normal" usage will result in
> > heavy lags during the spikes.
> > 
> > To handle such scenarios soft limits were introduced. The idea is to
> > allow a container to breach the limit freely when there's enough free
> > memory, but shrink it back to the limit aggressively on global memory
> > pressure. However, the concept of soft limits is intrinsically unsafe
> > by itself: if a container eats too much anonymous memory, it will be
> > very slow or even impossible (if there's no swap) to reclaim its
> > resources back to the limit. As a result the whole system will be
> > feeling bad until it finally realizes the culprit must die.
> I have actually seen this happen on a number of occasions.  I use
> cgroups to sandbox anything I run under wine (cause it's gotten so good
> at mimicking windows that a number of windows viruses will run on it),
> and have had issues with wine processes with memory leaks bringing the
> system to it's knees on occasion.  There are a lot of other stupid
> programs out there too, I've seen stuff that does it's own caching, but
> doesn't free any of the cached items until it either gets a failed
> malloc() or the system starts swapping it out.

Good example. For desktop users, it can be solved by setting hard memsw
limit, but when there are hundreds of containers running on the same
server setting memsw limit for each container would be just inflexible.
There might be containers that would make use of extra file caches, and
hard limiting them would increase overall disk load. OTOH setting only
soft limit would be dangerous if there's e.g. a wine user in one of the
containers.

> > Currently we have no way to react to anonymous memory + swap usage
> > growth inside a container: the memsw counter accounts both anonymous
> > memory and file caches and swap, so we have neither a limit for
> > anon+swap nor a threshold notification. Actually, memsw is totally
> > useless if one wants to make full use of soft limits: it should be set
> > to a very large value or infinity then, otherwise it just makes no
> > sense.
> > 
> > That's one of the reasons why I think we should replace memsw with a
> > kind of anonsw so that it'd account only anon+swap. This way we'd still
> > be able to sand-box apps, but it'd also allow us to avoid nasty
> > surprises like the one I described above. For more arguments for and
> > against this idea, please see the following thread:
> > 
> > http://www.spinics.net/lists/linux-mm/msg78180.html
> > 
> > There's an alternative to this approach backed by Kamezawa. He thinks
> > that OOM on anon+swap limit hit is a no-go and proposes to use memory
> > thresholds for it. I still strongly disagree with the proposal, because
> > it's unsafe (what if the userspace handler won't react in time?).
> > Nevertheless, I implement his idea in this RFC. I hope this will fuel
> > the debate, because sadly enough nobody seems to care about this
> > problem.
> 
> So, I've actually been following the discussion mentioned above rather
> closely, I just haven't had the time to comment on it.
> Personally, I think both ideas have merits, but would like to propose a
> third solution.
> 
> I would propose that we keep memsw like it is right now (because being
> able to limit the sum of anon+cache+swap is useful, especially if you
> are using cgroups to do strict partitioning of a machine), but give it a
> better name (vss maybe?), add a separate counter for anonymous memory
> and swap, and then provide for each of them an option to control whether
> the OOM killer is used when the limit is hit (possibly with the option
> of a delay before running the OOM killer), and a separate option for
> threshold notifications.  Users than would be able to choose whether
> they want a particular container killed when it hits a particular limit,
> and whether or not they want notifications when it gets within a certain
> percentage of the limit, or potentially both.

The problem is adding yet another counter means extra overhead, more
code written, wider user interface. I don't think anybody would accept
that. There is even an opinion we don't need a separate kmem limit (i.e.
kmem should only be accounted in mem and memsw).

> We still need to have a way to hard limit sum of anon+cache+swap (and
> ideally kmem once that is working correctly), because that useful for
> systems that have to provide guaranteed minimum amounts of virtual
> memory to containers.

Do you have any use cases where anon+swap and anon+cache can't satisfy
the user request while anon+cache+swap and anon+cache can? I'd
appreciate if you could provide me with one, because currently I'm
pretty convinced that anon+swap and anon+cache would be sufficient for
both sand-boxing and loose partitioning, it'd be just a bit different to
configure. That's why for now I stand for substituting anon+cache+swap
with anon+swap.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 2/2] memcg: add threshold for anon rss
@ 2014-09-12  8:27       ` Vladimir Davydov
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2014-09-12  8:27 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: linux-kernel, Kamezawa Hiroyuki, Johannes Weiner, Michal Hocko,
	Greg Thelen, Hugh Dickins, Motohiro Kosaki, Glauber Costa,
	Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko,
	linux-mm, cgroups

Hi Austin,

On Thu, Sep 11, 2014 at 01:20:34PM -0400, Austin S Hemmelgarn wrote:
> On 2014-09-11 11:41, Vladimir Davydov wrote:
> > Though hard memory limits suit perfectly for sand-boxing, they are not
> > that efficient when it comes to partitioning a server's resources among
> > multiple containers. The point is a container consuming a particular
> > amount of memory most of time may have infrequent spikes in the load.
> > Setting the hard limit to the maximal possible usage (spike) will lower
> > server utilization while setting it to the "normal" usage will result in
> > heavy lags during the spikes.
> > 
> > To handle such scenarios soft limits were introduced. The idea is to
> > allow a container to breach the limit freely when there's enough free
> > memory, but shrink it back to the limit aggressively on global memory
> > pressure. However, the concept of soft limits is intrinsically unsafe
> > by itself: if a container eats too much anonymous memory, it will be
> > very slow or even impossible (if there's no swap) to reclaim its
> > resources back to the limit. As a result the whole system will be
> > feeling bad until it finally realizes the culprit must die.
> I have actually seen this happen on a number of occasions.  I use
> cgroups to sandbox anything I run under wine (cause it's gotten so good
> at mimicking windows that a number of windows viruses will run on it),
> and have had issues with wine processes with memory leaks bringing the
> system to it's knees on occasion.  There are a lot of other stupid
> programs out there too, I've seen stuff that does it's own caching, but
> doesn't free any of the cached items until it either gets a failed
> malloc() or the system starts swapping it out.

Good example. For desktop users, it can be solved by setting hard memsw
limit, but when there are hundreds of containers running on the same
server setting memsw limit for each container would be just inflexible.
There might be containers that would make use of extra file caches, and
hard limiting them would increase overall disk load. OTOH setting only
soft limit would be dangerous if there's e.g. a wine user in one of the
containers.

> > Currently we have no way to react to anonymous memory + swap usage
> > growth inside a container: the memsw counter accounts both anonymous
> > memory and file caches and swap, so we have neither a limit for
> > anon+swap nor a threshold notification. Actually, memsw is totally
> > useless if one wants to make full use of soft limits: it should be set
> > to a very large value or infinity then, otherwise it just makes no
> > sense.
> > 
> > That's one of the reasons why I think we should replace memsw with a
> > kind of anonsw so that it'd account only anon+swap. This way we'd still
> > be able to sand-box apps, but it'd also allow us to avoid nasty
> > surprises like the one I described above. For more arguments for and
> > against this idea, please see the following thread:
> > 
> > http://www.spinics.net/lists/linux-mm/msg78180.html
> > 
> > There's an alternative to this approach backed by Kamezawa. He thinks
> > that OOM on anon+swap limit hit is a no-go and proposes to use memory
> > thresholds for it. I still strongly disagree with the proposal, because
> > it's unsafe (what if the userspace handler won't react in time?).
> > Nevertheless, I implement his idea in this RFC. I hope this will fuel
> > the debate, because sadly enough nobody seems to care about this
> > problem.
> 
> So, I've actually been following the discussion mentioned above rather
> closely, I just haven't had the time to comment on it.
> Personally, I think both ideas have merits, but would like to propose a
> third solution.
> 
> I would propose that we keep memsw like it is right now (because being
> able to limit the sum of anon+cache+swap is useful, especially if you
> are using cgroups to do strict partitioning of a machine), but give it a
> better name (vss maybe?), add a separate counter for anonymous memory
> and swap, and then provide for each of them an option to control whether
> the OOM killer is used when the limit is hit (possibly with the option
> of a delay before running the OOM killer), and a separate option for
> threshold notifications.  Users than would be able to choose whether
> they want a particular container killed when it hits a particular limit,
> and whether or not they want notifications when it gets within a certain
> percentage of the limit, or potentially both.

The problem is adding yet another counter means extra overhead, more
code written, wider user interface. I don't think anybody would accept
that. There is even an opinion we don't need a separate kmem limit (i.e.
kmem should only be accounted in mem and memsw).

> We still need to have a way to hard limit sum of anon+cache+swap (and
> ideally kmem once that is working correctly), because that useful for
> systems that have to provide guaranteed minimum amounts of virtual
> memory to containers.

Do you have any use cases where anon+swap and anon+cache can't satisfy
the user request while anon+cache+swap and anon+cache can? I'd
appreciate if you could provide me with one, because currently I'm
pretty convinced that anon+swap and anon+cache would be sufficient for
both sand-boxing and loose partitioning, it'd be just a bit different to
configure. That's why for now I stand for substituting anon+cache+swap
with anon+swap.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 2/2] memcg: add threshold for anon rss
@ 2014-09-12  8:27       ` Vladimir Davydov
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2014-09-12  8:27 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Kamezawa Hiroyuki,
	Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins,
	Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton,
	Pavel Emelianov, Konstantin Khorenko,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA

Hi Austin,

On Thu, Sep 11, 2014 at 01:20:34PM -0400, Austin S Hemmelgarn wrote:
> On 2014-09-11 11:41, Vladimir Davydov wrote:
> > Though hard memory limits suit perfectly for sand-boxing, they are not
> > that efficient when it comes to partitioning a server's resources among
> > multiple containers. The point is a container consuming a particular
> > amount of memory most of time may have infrequent spikes in the load.
> > Setting the hard limit to the maximal possible usage (spike) will lower
> > server utilization while setting it to the "normal" usage will result in
> > heavy lags during the spikes.
> > 
> > To handle such scenarios soft limits were introduced. The idea is to
> > allow a container to breach the limit freely when there's enough free
> > memory, but shrink it back to the limit aggressively on global memory
> > pressure. However, the concept of soft limits is intrinsically unsafe
> > by itself: if a container eats too much anonymous memory, it will be
> > very slow or even impossible (if there's no swap) to reclaim its
> > resources back to the limit. As a result the whole system will be
> > feeling bad until it finally realizes the culprit must die.
> I have actually seen this happen on a number of occasions.  I use
> cgroups to sandbox anything I run under wine (cause it's gotten so good
> at mimicking windows that a number of windows viruses will run on it),
> and have had issues with wine processes with memory leaks bringing the
> system to it's knees on occasion.  There are a lot of other stupid
> programs out there too, I've seen stuff that does it's own caching, but
> doesn't free any of the cached items until it either gets a failed
> malloc() or the system starts swapping it out.

Good example. For desktop users, it can be solved by setting hard memsw
limit, but when there are hundreds of containers running on the same
server setting memsw limit for each container would be just inflexible.
There might be containers that would make use of extra file caches, and
hard limiting them would increase overall disk load. OTOH setting only
soft limit would be dangerous if there's e.g. a wine user in one of the
containers.

> > Currently we have no way to react to anonymous memory + swap usage
> > growth inside a container: the memsw counter accounts both anonymous
> > memory and file caches and swap, so we have neither a limit for
> > anon+swap nor a threshold notification. Actually, memsw is totally
> > useless if one wants to make full use of soft limits: it should be set
> > to a very large value or infinity then, otherwise it just makes no
> > sense.
> > 
> > That's one of the reasons why I think we should replace memsw with a
> > kind of anonsw so that it'd account only anon+swap. This way we'd still
> > be able to sand-box apps, but it'd also allow us to avoid nasty
> > surprises like the one I described above. For more arguments for and
> > against this idea, please see the following thread:
> > 
> > http://www.spinics.net/lists/linux-mm/msg78180.html
> > 
> > There's an alternative to this approach backed by Kamezawa. He thinks
> > that OOM on anon+swap limit hit is a no-go and proposes to use memory
> > thresholds for it. I still strongly disagree with the proposal, because
> > it's unsafe (what if the userspace handler won't react in time?).
> > Nevertheless, I implement his idea in this RFC. I hope this will fuel
> > the debate, because sadly enough nobody seems to care about this
> > problem.
> 
> So, I've actually been following the discussion mentioned above rather
> closely, I just haven't had the time to comment on it.
> Personally, I think both ideas have merits, but would like to propose a
> third solution.
> 
> I would propose that we keep memsw like it is right now (because being
> able to limit the sum of anon+cache+swap is useful, especially if you
> are using cgroups to do strict partitioning of a machine), but give it a
> better name (vss maybe?), add a separate counter for anonymous memory
> and swap, and then provide for each of them an option to control whether
> the OOM killer is used when the limit is hit (possibly with the option
> of a delay before running the OOM killer), and a separate option for
> threshold notifications.  Users than would be able to choose whether
> they want a particular container killed when it hits a particular limit,
> and whether or not they want notifications when it gets within a certain
> percentage of the limit, or potentially both.

The problem is adding yet another counter means extra overhead, more
code written, wider user interface. I don't think anybody would accept
that. There is even an opinion we don't need a separate kmem limit (i.e.
kmem should only be accounted in mem and memsw).

> We still need to have a way to hard limit sum of anon+cache+swap (and
> ideally kmem once that is working correctly), because that useful for
> systems that have to provide guaranteed minimum amounts of virtual
> memory to containers.

Do you have any use cases where anon+swap and anon+cache can't satisfy
the user request while anon+cache+swap and anon+cache can? I'd
appreciate if you could provide me with one, because currently I'm
pretty convinced that anon+swap and anon+cache would be sufficient for
both sand-boxing and loose partitioning, it'd be just a bit different to
configure. That's why for now I stand for substituting anon+cache+swap
with anon+swap.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 2/2] memcg: add threshold for anon rss
  2014-09-12  1:23     ` Kamezawa Hiroyuki
  (?)
@ 2014-09-12  9:02       ` Vladimir Davydov
  -1 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2014-09-12  9:02 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, Johannes Weiner, Michal Hocko, Greg Thelen,
	Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo,
	Andrew Morton, Pavel Emelianov, Konstantin Khorenko, linux-mm,
	cgroups

On Fri, Sep 12, 2014 at 10:23:08AM +0900, Kamezawa Hiroyuki wrote:
> (2014/09/12 0:41), Vladimir Davydov wrote:
> > Though hard memory limits suit perfectly for sand-boxing, they are not
> > that efficient when it comes to partitioning a server's resources among
> > multiple containers. The point is a container consuming a particular
> > amount of memory most of time may have infrequent spikes in the load.
> > Setting the hard limit to the maximal possible usage (spike) will lower
> > server utilization while setting it to the "normal" usage will result in
> > heavy lags during the spikes.
> > 
> > To handle such scenarios soft limits were introduced. The idea is to
> > allow a container to breach the limit freely when there's enough free
> > memory, but shrink it back to the limit aggressively on global memory
> > pressure. However, the concept of soft limits is intrinsically unsafe
> > by itself: if a container eats too much anonymous memory, it will be
> > very slow or even impossible (if there's no swap) to reclaim its
> > resources back to the limit. As a result the whole system will be
> > feeling bad until it finally realizes the culprit must die.
> > 
> > Currently we have no way to react to anonymous memory + swap usage
> > growth inside a container: the memsw counter accounts both anonymous
> > memory and file caches and swap, so we have neither a limit for
> > anon+swap nor a threshold notification. Actually, memsw is totally
> > useless if one wants to make full use of soft limits: it should be set
> > to a very large value or infinity then, otherwise it just makes no
> > sense.
> > 
> > That's one of the reasons why I think we should replace memsw with a
> > kind of anonsw so that it'd account only anon+swap. This way we'd still
> > be able to sand-box apps, but it'd also allow us to avoid nasty
> > surprises like the one I described above. For more arguments for and
> > against this idea, please see the following thread:
> > 
> > http://www.spinics.net/lists/linux-mm/msg78180.html
> > 
> > There's an alternative to this approach backed by Kamezawa. He thinks
> > that OOM on anon+swap limit hit is a no-go and proposes to use memory
> > thresholds for it. I still strongly disagree with the proposal, because
> > it's unsafe (what if the userspace handler won't react in time?).
> > Nevertheless, I implement his idea in this RFC. I hope this will fuel
> > the debate, because sadly enough nobody seems to care about this
> > problem.
> > 
> > So this patch adds the "memory.rss" file that shows the amount of
> > anonymous memory consumed by a cgroup and the event to handle threshold
> > notifications coming from it. The notification works exactly in the same
> > fashion as the existing memory/memsw usage notifications.
> > 
> >
> 
> So, now, you know you can handle "threshould".
> 
> If you want to implement "automatic-oom-killall-in-a-contanier-threshold-in-kernel",
> I don't have any objections.
> 
> What you want is not limit, you want a trigger for killing process.
> Threshold + Kill is enough, using res_counter for that is overspec.

I'm still unsure if it's always enough. Handing this job out to the
userspace may work in 90% percent of situations, but fail under some
circumstances (a bunch of containers go mad so that the userspace daemon
doesn't react in time). Can the admin take a risk like that?

> You don't need res_counter and don't need to break other guy's use case.

This is the time when we have a great chance to rework the user
interface. That's why I started this thread.

>From what I read from the comment to the memsw patch and slides,
anon+swap wasn't even considered as an alternative to anon+cache+swap.
The only question raised was "Why not a separate swap limit, why
mem+swap?". It was clearly answered "no need to recharge on swap
in/out", but anon+swap isn't a bit worse in this respect - caches can't
migrate from swap to mem anyway. I guess nobody considered the anon+swap
alternative, simply because there was no notion of soft limits at that
time, so mem+swap had no problems. But today the things have changed, so
let's face it now. Why not anon+swap?

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 2/2] memcg: add threshold for anon rss
@ 2014-09-12  9:02       ` Vladimir Davydov
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2014-09-12  9:02 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, Johannes Weiner, Michal Hocko, Greg Thelen,
	Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo,
	Andrew Morton, Pavel Emelianov, Konstantin Khorenko, linux-mm,
	cgroups

On Fri, Sep 12, 2014 at 10:23:08AM +0900, Kamezawa Hiroyuki wrote:
> (2014/09/12 0:41), Vladimir Davydov wrote:
> > Though hard memory limits suit perfectly for sand-boxing, they are not
> > that efficient when it comes to partitioning a server's resources among
> > multiple containers. The point is a container consuming a particular
> > amount of memory most of time may have infrequent spikes in the load.
> > Setting the hard limit to the maximal possible usage (spike) will lower
> > server utilization while setting it to the "normal" usage will result in
> > heavy lags during the spikes.
> > 
> > To handle such scenarios soft limits were introduced. The idea is to
> > allow a container to breach the limit freely when there's enough free
> > memory, but shrink it back to the limit aggressively on global memory
> > pressure. However, the concept of soft limits is intrinsically unsafe
> > by itself: if a container eats too much anonymous memory, it will be
> > very slow or even impossible (if there's no swap) to reclaim its
> > resources back to the limit. As a result the whole system will be
> > feeling bad until it finally realizes the culprit must die.
> > 
> > Currently we have no way to react to anonymous memory + swap usage
> > growth inside a container: the memsw counter accounts both anonymous
> > memory and file caches and swap, so we have neither a limit for
> > anon+swap nor a threshold notification. Actually, memsw is totally
> > useless if one wants to make full use of soft limits: it should be set
> > to a very large value or infinity then, otherwise it just makes no
> > sense.
> > 
> > That's one of the reasons why I think we should replace memsw with a
> > kind of anonsw so that it'd account only anon+swap. This way we'd still
> > be able to sand-box apps, but it'd also allow us to avoid nasty
> > surprises like the one I described above. For more arguments for and
> > against this idea, please see the following thread:
> > 
> > http://www.spinics.net/lists/linux-mm/msg78180.html
> > 
> > There's an alternative to this approach backed by Kamezawa. He thinks
> > that OOM on anon+swap limit hit is a no-go and proposes to use memory
> > thresholds for it. I still strongly disagree with the proposal, because
> > it's unsafe (what if the userspace handler won't react in time?).
> > Nevertheless, I implement his idea in this RFC. I hope this will fuel
> > the debate, because sadly enough nobody seems to care about this
> > problem.
> > 
> > So this patch adds the "memory.rss" file that shows the amount of
> > anonymous memory consumed by a cgroup and the event to handle threshold
> > notifications coming from it. The notification works exactly in the same
> > fashion as the existing memory/memsw usage notifications.
> > 
> >
> 
> So, now, you know you can handle "threshould".
> 
> If you want to implement "automatic-oom-killall-in-a-contanier-threshold-in-kernel",
> I don't have any objections.
> 
> What you want is not limit, you want a trigger for killing process.
> Threshold + Kill is enough, using res_counter for that is overspec.

I'm still unsure if it's always enough. Handing this job out to the
userspace may work in 90% percent of situations, but fail under some
circumstances (a bunch of containers go mad so that the userspace daemon
doesn't react in time). Can the admin take a risk like that?

> You don't need res_counter and don't need to break other guy's use case.

This is the time when we have a great chance to rework the user
interface. That's why I started this thread.

>From what I read from the comment to the memsw patch and slides,
anon+swap wasn't even considered as an alternative to anon+cache+swap.
The only question raised was "Why not a separate swap limit, why
mem+swap?". It was clearly answered "no need to recharge on swap
in/out", but anon+swap isn't a bit worse in this respect - caches can't
migrate from swap to mem anyway. I guess nobody considered the anon+swap
alternative, simply because there was no notion of soft limits at that
time, so mem+swap had no problems. But today the things have changed, so
let's face it now. Why not anon+swap?

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 2/2] memcg: add threshold for anon rss
@ 2014-09-12  9:02       ` Vladimir Davydov
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2014-09-12  9:02 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, Johannes Weiner, Michal Hocko, Greg Thelen,
	Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo,
	Andrew Morton, Pavel Emelianov, Konstantin Khorenko, linux-mm,
	cgroups

On Fri, Sep 12, 2014 at 10:23:08AM +0900, Kamezawa Hiroyuki wrote:
> (2014/09/12 0:41), Vladimir Davydov wrote:
> > Though hard memory limits suit perfectly for sand-boxing, they are not
> > that efficient when it comes to partitioning a server's resources among
> > multiple containers. The point is a container consuming a particular
> > amount of memory most of time may have infrequent spikes in the load.
> > Setting the hard limit to the maximal possible usage (spike) will lower
> > server utilization while setting it to the "normal" usage will result in
> > heavy lags during the spikes.
> > 
> > To handle such scenarios soft limits were introduced. The idea is to
> > allow a container to breach the limit freely when there's enough free
> > memory, but shrink it back to the limit aggressively on global memory
> > pressure. However, the concept of soft limits is intrinsically unsafe
> > by itself: if a container eats too much anonymous memory, it will be
> > very slow or even impossible (if there's no swap) to reclaim its
> > resources back to the limit. As a result the whole system will be
> > feeling bad until it finally realizes the culprit must die.
> > 
> > Currently we have no way to react to anonymous memory + swap usage
> > growth inside a container: the memsw counter accounts both anonymous
> > memory and file caches and swap, so we have neither a limit for
> > anon+swap nor a threshold notification. Actually, memsw is totally
> > useless if one wants to make full use of soft limits: it should be set
> > to a very large value or infinity then, otherwise it just makes no
> > sense.
> > 
> > That's one of the reasons why I think we should replace memsw with a
> > kind of anonsw so that it'd account only anon+swap. This way we'd still
> > be able to sand-box apps, but it'd also allow us to avoid nasty
> > surprises like the one I described above. For more arguments for and
> > against this idea, please see the following thread:
> > 
> > http://www.spinics.net/lists/linux-mm/msg78180.html
> > 
> > There's an alternative to this approach backed by Kamezawa. He thinks
> > that OOM on anon+swap limit hit is a no-go and proposes to use memory
> > thresholds for it. I still strongly disagree with the proposal, because
> > it's unsafe (what if the userspace handler won't react in time?).
> > Nevertheless, I implement his idea in this RFC. I hope this will fuel
> > the debate, because sadly enough nobody seems to care about this
> > problem.
> > 
> > So this patch adds the "memory.rss" file that shows the amount of
> > anonymous memory consumed by a cgroup and the event to handle threshold
> > notifications coming from it. The notification works exactly in the same
> > fashion as the existing memory/memsw usage notifications.
> > 
> >
> 
> So, now, you know you can handle "threshould".
> 
> If you want to implement "automatic-oom-killall-in-a-contanier-threshold-in-kernel",
> I don't have any objections.
> 
> What you want is not limit, you want a trigger for killing process.
> Threshold + Kill is enough, using res_counter for that is overspec.

I'm still unsure if it's always enough. Handing this job out to the
userspace may work in 90% percent of situations, but fail under some
circumstances (a bunch of containers go mad so that the userspace daemon
doesn't react in time). Can the admin take a risk like that?

> You don't need res_counter and don't need to break other guy's use case.

This is the time when we have a great chance to rework the user
interface. That's why I started this thread.

From what I read from the comment to the memsw patch and slides,
anon+swap wasn't even considered as an alternative to anon+cache+swap.
The only question raised was "Why not a separate swap limit, why
mem+swap?". It was clearly answered "no need to recharge on swap
in/out", but anon+swap isn't a bit worse in this respect - caches can't
migrate from swap to mem anyway. I guess nobody considered the anon+swap
alternative, simply because there was no notion of soft limits at that
time, so mem+swap had no problems. But today the things have changed, so
let's face it now. Why not anon+swap?

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2014-09-12  9:03 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-11 15:41 [PATCH RFC 0/2] Anonymous memory threshold notifications for memcg Vladimir Davydov
2014-09-11 15:41 ` Vladimir Davydov
2014-09-11 15:41 ` Vladimir Davydov
2014-09-11 15:41 ` [PATCH RFC 1/2] memcg: use percpu_counter for statistics Vladimir Davydov
2014-09-11 15:41   ` Vladimir Davydov
2014-09-12  1:10   ` Kamezawa Hiroyuki
2014-09-12  1:10     ` Kamezawa Hiroyuki
2014-09-12  7:41     ` Vladimir Davydov
2014-09-12  7:41       ` Vladimir Davydov
2014-09-12  7:41       ` Vladimir Davydov
2014-09-11 15:41 ` [PATCH RFC 2/2] memcg: add threshold for anon rss Vladimir Davydov
2014-09-11 15:41   ` Vladimir Davydov
2014-09-11 17:20   ` Austin S Hemmelgarn
2014-09-11 17:20     ` Austin S Hemmelgarn
2014-09-12  8:27     ` Vladimir Davydov
2014-09-12  8:27       ` Vladimir Davydov
2014-09-12  8:27       ` Vladimir Davydov
2014-09-12  1:23   ` Kamezawa Hiroyuki
2014-09-12  1:23     ` Kamezawa Hiroyuki
2014-09-12  9:02     ` Vladimir Davydov
2014-09-12  9:02       ` Vladimir Davydov
2014-09-12  9:02       ` Vladimir Davydov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.