[PATCH 00/14] mm: memcontrol: account socket memory in unified hierarchy

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/14] mm: memcontrol: account socket memory in unified hierarchy
@ 2015-11-12 23:41 Johannes Weiner
  2015-11-12 23:41 ` [PATCH 01/14] mm: memcontrol: export root_mem_cgroup Johannes Weiner
                   ` (13 more replies)
  0 siblings, 14 replies; 61+ messages in thread
From: Johannes Weiner @ 2015-11-12 23:41 UTC (permalink / raw)
  To: David Miller, Andrew Morton
  Cc: Vladimir Davydov, Tejun Heo, Michal Hocko, netdev, linux-mm,
	cgroups, linux-kernel, kernel-team

Hi,

this is version 3 of the patches to add socket memory accounting to
the unified hierarchy memory controller. Changes since v2 include:

- Fixed an underflow bug in the mem+swap counter that came through the
  design of the per-cpu charge cache. To fix that, the unused mem+swap
  counter is now fully patched out on unified hierarchy. Double whammy.

- Restored the counting jump label such that the networking callbacks
  get patched out again when the last memory-controlled cgroup goes
  away. The code was already there, so we might as well keep it.

- Broke down the massive tcp_memcontrol rewrite patch into smaller
  logical pieces to (hopefully) make it easier to review and verify.

---

Socket buffer memory can make up a significant share of a workload's
memory footprint that can be directly linked to userspace activity,
and so it needs to be part of the memory controller to provide proper
resource isolation/containment.

Historically, socket buffers were accounted in a separate counter,
without any pressure equalization between anonymous memory, page
cache, and the socket buffers. When the socket buffer pool was
exhausted, buffer allocations would fail hard and cause network
performance to tank, regardless of whether there was still memory
available to the group or not. Likewise, struggling anonymous or cache
workingsets could not dip into an idle socket memory pool. Because of
this, the feature was not usable for many real life applications.

To not repeat this mistake, the new memory controller will account all
types of memory pages it is tracking on behalf of a cgroup in a single
pool. Upon pressure, the VM reclaims and shrinks and puts pressure on
whatever memory consumer in that pool is within its reach.

For socket memory, pressure feedback is provided through vmpressure
events. When the VM has trouble freeing memory, the network code is
instructed to stop growing the cgroup's transmit windows.

This series begins with a rework of the existing tcp memory controller
that simplifies and cleans up the code while allowing us to have only
one set of networking hooks for both memory controller versions. The
original behavior of the existing tcp controller should be preserved.

It then adds socket accounting to the v2 memory controller, including
the use of the per-cpu charge cache and async memory.high enforcement
from socket memory charges.

Lastly, vmpressure is hooked up to the socket code so that it stops
growing transmit windows when the VM has trouble reclaiming memory.

 include/linux/memcontrol.h   |  71 ++++++----
 include/net/sock.h           | 149 ++------------------
 include/net/tcp.h            |   5 +-
 include/net/tcp_memcontrol.h |   1 -
 mm/backing-dev.c             |   2 +-
 mm/memcontrol.c              | 303 +++++++++++++++++++++++++++--------------
 mm/vmpressure.c              |  25 +++-
 mm/vmscan.c                  |  31 +++--
 net/core/sock.c              |  78 +++--------
 net/ipv4/tcp.c               |   3 +-
 net/ipv4/tcp_ipv4.c          |   9 +-
 net/ipv4/tcp_memcontrol.c    |  85 ++++--------
 net/ipv4/tcp_output.c        |   7 +-
 net/ipv6/tcp_ipv6.c          |   3 -
 14 files changed, 353 insertions(+), 419 deletions(-)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 01/14] mm: memcontrol: export root_mem_cgroup
  2015-11-12 23:41 [PATCH 00/14] mm: memcontrol: account socket memory in unified hierarchy Johannes Weiner
@ 2015-11-12 23:41 ` Johannes Weiner
  2015-11-13 15:59   ` David Miller
  2015-11-14 12:17   ` Vladimir Davydov
  2015-11-12 23:41 ` [PATCH 02/14] mm: vmscan: simplify memcg vs. global shrinker invocation Johannes Weiner
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 61+ messages in thread
From: Johannes Weiner @ 2015-11-12 23:41 UTC (permalink / raw)
  To: David Miller, Andrew Morton
  Cc: Vladimir Davydov, Tejun Heo, Michal Hocko, netdev, linux-mm,
	cgroups, linux-kernel, kernel-team

A later patch will need this symbol in files other than memcontrol.c,
so export it now and replace mem_cgroup_root_css at the same time.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/memcontrol.h | 3 ++-
 mm/backing-dev.c           | 2 +-
 mm/memcontrol.c            | 5 ++---
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ffc5460..9a7a24a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -275,7 +275,8 @@ struct mem_cgroup {
 	struct mem_cgroup_per_node *nodeinfo[0];
 	/* WARNING: nodeinfo must be the last member here */
 };
-extern struct cgroup_subsys_state *mem_cgroup_root_css;
+
+extern struct mem_cgroup *root_mem_cgroup;
 
 /**
  * mem_cgroup_events - count memory events against a cgroup
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 9160853..fdc6f4d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -707,7 +707,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 
 	ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL);
 	if (!ret) {
-		bdi->wb.memcg_css = mem_cgroup_root_css;
+		bdi->wb.memcg_css = &root_mem_cgroup->css;
 		bdi->wb.blkcg_css = blkcg_root_css;
 	}
 	return ret;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fcd7c4e..a5d2586 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -76,9 +76,9 @@
 struct cgroup_subsys memory_cgrp_subsys __read_mostly;
 EXPORT_SYMBOL(memory_cgrp_subsys);
 
+struct mem_cgroup *root_mem_cgroup __read_mostly;
+
 #define MEM_CGROUP_RECLAIM_RETRIES	5
-static struct mem_cgroup *root_mem_cgroup __read_mostly;
-struct cgroup_subsys_state *mem_cgroup_root_css __read_mostly;
 
 /* Whether the swap controller is active */
 #ifdef CONFIG_MEMCG_SWAP
@@ -4211,7 +4211,6 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	/* root ? */
 	if (parent_css == NULL) {
 		root_mem_cgroup = memcg;
-		mem_cgroup_root_css = &memcg->css;
 		page_counter_init(&memcg->memory, NULL);
 		memcg->high = PAGE_COUNTER_MAX;
 		memcg->soft_limit = PAGE_COUNTER_MAX;
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 02/14] mm: vmscan: simplify memcg vs. global shrinker invocation
  2015-11-12 23:41 [PATCH 00/14] mm: memcontrol: account socket memory in unified hierarchy Johannes Weiner
  2015-11-12 23:41 ` [PATCH 01/14] mm: memcontrol: export root_mem_cgroup Johannes Weiner
@ 2015-11-12 23:41 ` Johannes Weiner
  2015-11-13 15:59   ` David Miller
  2015-11-14 12:36   ` Vladimir Davydov
  2015-11-12 23:41 ` [PATCH 03/14] net: tcp_memcontrol: properly detect ancestor socket pressure Johannes Weiner
                   ` (11 subsequent siblings)
  13 siblings, 2 replies; 61+ messages in thread
From: Johannes Weiner @ 2015-11-12 23:41 UTC (permalink / raw)
  To: David Miller, Andrew Morton
  Cc: Vladimir Davydov, Tejun Heo, Michal Hocko, netdev, linux-mm,
	cgroups, linux-kernel, kernel-team

Letting shrink_slab() handle the root_mem_cgroup, and implicitely the
!CONFIG_MEMCG case, allows shrink_zone() to invoke the shrinkers
unconditionally from within the memcg iteration loop.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/memcontrol.h |  2 ++
 mm/vmscan.c                | 31 ++++++++++++++++---------------
 2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9a7a24a..251bb51 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -502,6 +502,8 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
+#define root_mem_cgroup NULL
+
 static inline void mem_cgroup_events(struct mem_cgroup *memcg,
 				     enum mem_cgroup_events_index idx,
 				     unsigned int nr)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a4507ec..e4f5b3c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -411,6 +411,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 	struct shrinker *shrinker;
 	unsigned long freed = 0;
 
+	/* Global shrinker mode */
+	if (memcg == root_mem_cgroup)
+		memcg = NULL;
+
 	if (memcg && !memcg_kmem_is_active(memcg))
 		return 0;
 
@@ -2410,11 +2414,22 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 			shrink_lruvec(lruvec, swappiness, sc, &lru_pages);
 			zone_lru_pages += lru_pages;
 
-			if (memcg && is_classzone)
+			/*
+			 * Shrink the slab caches in the same proportion that
+			 * the eligible LRU pages were scanned.
+			 */
+			if (is_classzone) {
 				shrink_slab(sc->gfp_mask, zone_to_nid(zone),
 					    memcg, sc->nr_scanned - scanned,
 					    lru_pages);
 
+				if (reclaim_state) {
+					sc->nr_reclaimed +=
+						reclaim_state->reclaimed_slab;
+					reclaim_state->reclaimed_slab = 0;
+				}
+			}
+
 			/*
 			 * Direct reclaim and kswapd have to scan all memory
 			 * cgroups to fulfill the overall scan target for the
@@ -2432,20 +2447,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 			}
 		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
 
-		/*
-		 * Shrink the slab caches in the same proportion that
-		 * the eligible LRU pages were scanned.
-		 */
-		if (global_reclaim(sc) && is_classzone)
-			shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL,
-				    sc->nr_scanned - nr_scanned,
-				    zone_lru_pages);
-
-		if (reclaim_state) {
-			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
-			reclaim_state->reclaimed_slab = 0;
-		}
-
 		vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
 			   sc->nr_scanned - nr_scanned,
 			   sc->nr_reclaimed - nr_reclaimed);
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 03/14] net: tcp_memcontrol: properly detect ancestor socket pressure
  2015-11-12 23:41 [PATCH 00/14] mm: memcontrol: account socket memory in unified hierarchy Johannes Weiner
  2015-11-12 23:41 ` [PATCH 01/14] mm: memcontrol: export root_mem_cgroup Johannes Weiner
  2015-11-12 23:41 ` [PATCH 02/14] mm: vmscan: simplify memcg vs. global shrinker invocation Johannes Weiner
@ 2015-11-12 23:41 ` Johannes Weiner
  2015-11-13 16:00   ` David Miller
  2015-11-14 12:45   ` Vladimir Davydov
  2015-11-12 23:41 ` [PATCH 04/14] net: tcp_memcontrol: remove bogus hierarchy pressure propagation Johannes Weiner
                   ` (10 subsequent siblings)
  13 siblings, 2 replies; 61+ messages in thread
From: Johannes Weiner @ 2015-11-12 23:41 UTC (permalink / raw)
  To: David Miller, Andrew Morton
  Cc: Vladimir Davydov, Tejun Heo, Michal Hocko, netdev, linux-mm,
	cgroups, linux-kernel, kernel-team

When charging socket memory, the code currently checks only the local
page counter for excess to determine whether the memcg is under socket
pressure. But even if the local counter is fine, one of the ancestors
could have breached its limit, which should also force this child to
enter socket pressure. This currently doesn't happen.

Fix this by using page_counter_try_charge() first. If that fails, it
means that either the local counter or one of the ancestors are in
excess of their limit, and the child should enter socket pressure.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/net/sock.h | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index bbf7c2c..c4b33c9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1190,11 +1190,13 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
 					      unsigned long amt,
 					      int *parent_status)
 {
-	page_counter_charge(&prot->memory_allocated, amt);
+	struct page_counter *counter;
+
+	if (page_counter_try_charge(&prot->memory_allocated, amt, &counter))
+		return;
 
-	if (page_counter_read(&prot->memory_allocated) >
-	    prot->memory_allocated.limit)
-		*parent_status = OVER_LIMIT;
+	page_counter_charge(&prot->memory_allocated, amt);
+	*parent_status = OVER_LIMIT;
 }
 
 static inline void memcg_memory_allocated_sub(struct cg_proto *prot,
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 04/14] net: tcp_memcontrol: remove bogus hierarchy pressure propagation
  2015-11-12 23:41 [PATCH 00/14] mm: memcontrol: account socket memory in unified hierarchy Johannes Weiner
                   ` (2 preceding siblings ...)
  2015-11-12 23:41 ` [PATCH 03/14] net: tcp_memcontrol: properly detect ancestor socket pressure Johannes Weiner
@ 2015-11-12 23:41 ` Johannes Weiner
  2015-11-13 16:00   ` David Miller
  2015-11-20  9:07   ` Vladimir Davydov
  2015-11-12 23:41 ` [PATCH 05/14] net: tcp_memcontrol: protect all tcp_memcontrol calls by jump-label Johannes Weiner
                   ` (9 subsequent siblings)
  13 siblings, 2 replies; 61+ messages in thread
From: Johannes Weiner @ 2015-11-12 23:41 UTC (permalink / raw)
  To: David Miller, Andrew Morton
  Cc: Vladimir Davydov, Tejun Heo, Michal Hocko, netdev, linux-mm,
	cgroups, linux-kernel, kernel-team

When a cgroup currently breaches its socket memory limit, it enters
memory pressure mode for itself and its *ancestors*. This throttles
transmission in unrelated sibling and cousin subtrees that have
nothing to do with the breached limit.

On the contrary, breaching a limit should make that group and its
*children* enter memory pressure mode. But this happens already,
albeit lazily: if an ancestor limit is breached, siblings will enter
memory pressure on their own once the next packet arrives for them.

So no additional hierarchy code is needed. Remove the bogus stuff.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/net/sock.h | 19 ++++---------------
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index c4b33c9..6fc9147 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1152,14 +1152,8 @@ static inline void sk_leave_memory_pressure(struct sock *sk)
 	if (*memory_pressure)
 		*memory_pressure = 0;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-		struct cg_proto *cg_proto = sk->sk_cgrp;
-		struct proto *prot = sk->sk_prot;
-
-		for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
-			cg_proto->memory_pressure = 0;
-	}
-
+	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
+		sk->sk_cgrp->memory_pressure = 0;
 }
 
 static inline void sk_enter_memory_pressure(struct sock *sk)
@@ -1167,13 +1161,8 @@ static inline void sk_enter_memory_pressure(struct sock *sk)
 	if (!sk->sk_prot->enter_memory_pressure)
 		return;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-		struct cg_proto *cg_proto = sk->sk_cgrp;
-		struct proto *prot = sk->sk_prot;
-
-		for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
-			cg_proto->memory_pressure = 1;
-	}
+	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
+		sk->sk_cgrp->memory_pressure = 1;
 
 	sk->sk_prot->enter_memory_pressure(sk);
 }
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 05/14] net: tcp_memcontrol: protect all tcp_memcontrol calls by jump-label
  2015-11-12 23:41 [PATCH 00/14] mm: memcontrol: account socket memory in unified hierarchy Johannes Weiner
                   ` (3 preceding siblings ...)
  2015-11-12 23:41 ` [PATCH 04/14] net: tcp_memcontrol: remove bogus hierarchy pressure propagation Johannes Weiner
@ 2015-11-12 23:41 ` Johannes Weiner
  2015-11-13 16:01   ` David Miller
  2015-11-14 16:33   ` Vladimir Davydov
  2015-11-12 23:41 ` [PATCH 06/14] net: tcp_memcontrol: remove dead per-memcg count of allocated sockets Johannes Weiner
                   ` (8 subsequent siblings)
  13 siblings, 2 replies; 61+ messages in thread
From: Johannes Weiner @ 2015-11-12 23:41 UTC (permalink / raw)
  To: David Miller, Andrew Morton
  Cc: Vladimir Davydov, Tejun Heo, Michal Hocko, netdev, linux-mm,
	cgroups, linux-kernel, kernel-team

Move the jump-label from sock_update_memcg() and sock_release_memcg()
to the callsite, and so eliminate those function calls when socket
accounting is not enabled.

This also eliminates the need for dummy functions because the calls
will be optimized away if the Kconfig options are not enabled.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |  9 +-------
 mm/memcontrol.c            | 56 +++++++++++++++++++++-------------------------
 net/core/sock.c            |  9 ++------
 net/ipv4/tcp.c             |  3 ++-
 net/ipv4/tcp_ipv4.c        |  4 +++-
 5 files changed, 33 insertions(+), 48 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 251bb51..e930803 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -709,17 +709,10 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
 
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
-struct sock;
 #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
+struct sock;
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
-#else
-static inline void sock_update_memcg(struct sock *sk)
-{
-}
-static inline void sock_release_memcg(struct sock *sk)
-{
-}
 #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_MEMCG_KMEM
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a5d2586..57f4539 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -293,46 +293,40 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
 
 void sock_update_memcg(struct sock *sk)
 {
-	if (mem_cgroup_sockets_enabled) {
-		struct mem_cgroup *memcg;
-		struct cg_proto *cg_proto;
+	struct mem_cgroup *memcg;
+	struct cg_proto *cg_proto;
 
-		BUG_ON(!sk->sk_prot->proto_cgroup);
+	BUG_ON(!sk->sk_prot->proto_cgroup);
 
-		/* Socket cloning can throw us here with sk_cgrp already
-		 * filled. It won't however, necessarily happen from
-		 * process context. So the test for root memcg given
-		 * the current task's memcg won't help us in this case.
-		 *
-		 * Respecting the original socket's memcg is a better
-		 * decision in this case.
-		 */
-		if (sk->sk_cgrp) {
-			BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg));
-			css_get(&sk->sk_cgrp->memcg->css);
-			return;
-		}
+	/* Socket cloning can throw us here with sk_cgrp already
+	 * filled. It won't however, necessarily happen from
+	 * process context. So the test for root memcg given
+	 * the current task's memcg won't help us in this case.
+	 *
+	 * Respecting the original socket's memcg is a better
+	 * decision in this case.
+	 */
+	if (sk->sk_cgrp) {
+		BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg));
+		css_get(&sk->sk_cgrp->memcg->css);
+		return;
+	}
 
-		rcu_read_lock();
-		memcg = mem_cgroup_from_task(current);
-		cg_proto = sk->sk_prot->proto_cgroup(memcg);
-		if (cg_proto && test_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags) &&
-		    css_tryget_online(&memcg->css)) {
-			sk->sk_cgrp = cg_proto;
-		}
-		rcu_read_unlock();
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	cg_proto = sk->sk_prot->proto_cgroup(memcg);
+	if (cg_proto && test_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags) &&
+	    css_tryget_online(&memcg->css)) {
+		sk->sk_cgrp = cg_proto;
 	}
+	rcu_read_unlock();
 }
 EXPORT_SYMBOL(sock_update_memcg);
 
 void sock_release_memcg(struct sock *sk)
 {
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-		struct mem_cgroup *memcg;
-		WARN_ON(!sk->sk_cgrp->memcg);
-		memcg = sk->sk_cgrp->memcg;
-		css_put(&sk->sk_cgrp->memcg->css);
-	}
+	WARN_ON(!sk->sk_cgrp->memcg);
+	css_put(&sk->sk_cgrp->memcg->css);
 }
 
 struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
diff --git a/net/core/sock.c b/net/core/sock.c
index 1e4dd54..04e54bc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1488,12 +1488,6 @@ void sk_free(struct sock *sk)
 }
 EXPORT_SYMBOL(sk_free);
 
-static void sk_update_clone(const struct sock *sk, struct sock *newsk)
-{
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		sock_update_memcg(newsk);
-}
-
 /**
  *	sk_clone_lock - clone a socket, and lock its clone
  *	@sk: the socket to clone
@@ -1589,7 +1583,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 		sk_set_socket(newsk, NULL);
 		newsk->sk_wq = NULL;
 
-		sk_update_clone(sk, newsk);
+		if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
+			sock_update_memcg(newsk);
 
 		if (newsk->sk_prot->sockets_allocated)
 			sk_sockets_allocated_inc(newsk);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0cfa7c0..1b324606 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -422,7 +422,8 @@ void tcp_init_sock(struct sock *sk)
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
 	local_bh_disable();
-	sock_update_memcg(sk);
+	if (mem_cgroup_sockets_enabled)
+		sock_update_memcg(sk);
 	sk_sockets_allocated_inc(sk);
 	local_bh_enable();
 }
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 950e28c..317a246 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1812,7 +1812,9 @@ void tcp_v4_destroy_sock(struct sock *sk)
 	tcp_saved_syn_free(tp);
 
 	sk_sockets_allocated_dec(sk);
-	sock_release_memcg(sk);
+
+	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
+		sock_release_memcg(sk);
 }
 EXPORT_SYMBOL(tcp_v4_destroy_sock);
 
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 06/14] net: tcp_memcontrol: remove dead per-memcg count of allocated sockets
  2015-11-12 23:41 [PATCH 00/14] mm: memcontrol: account socket memory in unified hierarchy Johannes Weiner
                   ` (4 preceding siblings ...)
  2015-11-12 23:41 ` [PATCH 05/14] net: tcp_memcontrol: protect all tcp_memcontrol calls by jump-label Johannes Weiner
@ 2015-11-12 23:41 ` Johannes Weiner
  2015-11-13 16:01   ` David Miller
  2015-11-20  9:48   ` Vladimir Davydov
  2015-11-12 23:41 ` [PATCH 07/14] net: tcp_memcontrol: simplify the per-memcg limit access Johannes Weiner
                   ` (7 subsequent siblings)
  13 siblings, 2 replies; 61+ messages in thread
From: Johannes Weiner @ 2015-11-12 23:41 UTC (permalink / raw)
  To: David Miller, Andrew Morton
  Cc: Vladimir Davydov, Tejun Heo, Michal Hocko, netdev, linux-mm,
	cgroups, linux-kernel, kernel-team

The number of allocated sockets is used for calculations in the soft
limit phase, where packets are accepted but the socket is under memory
pressure. Since there is no soft limit phase in tcp_memcontrol, and
memory pressure is only entered when packets are already dropped, this
is actually dead code. Remove it.

As this is the last user of parent_cg_proto(), remove that too.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |  1 -
 include/net/sock.h         | 39 +++------------------------------------
 net/ipv4/tcp_memcontrol.c  |  3 ---
 3 files changed, 3 insertions(+), 40 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e930803..185df8c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -97,7 +97,6 @@ enum cg_proto_flags {
 
 struct cg_proto {
 	struct page_counter	memory_allocated;	/* Current allocated memory. */
-	struct percpu_counter	sockets_allocated;	/* Current number of sockets. */
 	int			memory_pressure;
 	long			sysctl_mem[3];
 	unsigned long		flags;
diff --git a/include/net/sock.h b/include/net/sock.h
index 6fc9147..ed141b3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1095,19 +1095,9 @@ static inline void sk_refcnt_debug_release(const struct sock *sk)
 
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_NET)
 extern struct static_key memcg_socket_limit_enabled;
-static inline struct cg_proto *parent_cg_proto(struct proto *proto,
-					       struct cg_proto *cg_proto)
-{
-	return proto->proto_cgroup(parent_mem_cgroup(cg_proto->memcg));
-}
 #define mem_cgroup_sockets_enabled static_key_false(&memcg_socket_limit_enabled)
 #else
 #define mem_cgroup_sockets_enabled 0
-static inline struct cg_proto *parent_cg_proto(struct proto *proto,
-					       struct cg_proto *cg_proto)
-{
-	return NULL;
-}
 #endif
 
 static inline bool sk_stream_memory_free(const struct sock *sk)
@@ -1233,41 +1223,18 @@ sk_memory_allocated_sub(struct sock *sk, int amt)
 
 static inline void sk_sockets_allocated_dec(struct sock *sk)
 {
-	struct proto *prot = sk->sk_prot;
-
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-		struct cg_proto *cg_proto = sk->sk_cgrp;
-
-		for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
-			percpu_counter_dec(&cg_proto->sockets_allocated);
-	}
-
-	percpu_counter_dec(prot->sockets_allocated);
+	percpu_counter_dec(sk->sk_prot->sockets_allocated);
 }
 
 static inline void sk_sockets_allocated_inc(struct sock *sk)
 {
-	struct proto *prot = sk->sk_prot;
-
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-		struct cg_proto *cg_proto = sk->sk_cgrp;
-
-		for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
-			percpu_counter_inc(&cg_proto->sockets_allocated);
-	}
-
-	percpu_counter_inc(prot->sockets_allocated);
+	percpu_counter_inc(sk->sk_prot->sockets_allocated);
 }
 
 static inline int
 sk_sockets_allocated_read_positive(struct sock *sk)
 {
-	struct proto *prot = sk->sk_prot;
-
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		return percpu_counter_read_positive(&sk->sk_cgrp->sockets_allocated);
-
-	return percpu_counter_read_positive(prot->sockets_allocated);
+	return percpu_counter_read_positive(sk->sk_prot->sockets_allocated);
 }
 
 static inline int
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index 2379c1b..8965638 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -32,7 +32,6 @@ int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 		counter_parent = &parent_cg->memory_allocated;
 
 	page_counter_init(&cg_proto->memory_allocated, counter_parent);
-	percpu_counter_init(&cg_proto->sockets_allocated, 0, GFP_KERNEL);
 
 	return 0;
 }
@@ -46,8 +45,6 @@ void tcp_destroy_cgroup(struct mem_cgroup *memcg)
 	if (!cg_proto)
 		return;
 
-	percpu_counter_destroy(&cg_proto->sockets_allocated);
-
 	if (test_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags))
 		static_key_slow_dec(&memcg_socket_limit_enabled);
 
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 07/14] net: tcp_memcontrol: simplify the per-memcg limit access
  2015-11-12 23:41 [PATCH 00/14] mm: memcontrol: account socket memory in unified hierarchy Johannes Weiner
                   ` (5 preceding siblings ...)
  2015-11-12 23:41 ` [PATCH 06/14] net: tcp_memcontrol: remove dead per-memcg count of allocated sockets Johannes Weiner
@ 2015-11-12 23:41 ` Johannes Weiner
  2015-11-20  9:51   ` Vladimir Davydov
  2015-11-12 23:41 ` [PATCH 08/14] net: tcp_memcontrol: sanitize tcp memory accounting callbacks Johannes Weiner
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 61+ messages in thread
From: Johannes Weiner @ 2015-11-12 23:41 UTC (permalink / raw)
  To: David Miller, Andrew Morton
  Cc: Vladimir Davydov, Tejun Heo, Michal Hocko, netdev, linux-mm,
	cgroups, linux-kernel, kernel-team

tcp_memcontrol replicates the global sysctl_mem limit array per
cgroup, but it only ever sets these entries to the value of the
memory_allocated page_counter limit. Use the latter directly.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 1 -
 include/net/sock.h         | 8 +++++---
 net/ipv4/tcp_memcontrol.c  | 8 --------
 3 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 185df8c..96ca3d3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -98,7 +98,6 @@ enum cg_proto_flags {
 struct cg_proto {
 	struct page_counter	memory_allocated;	/* Current allocated memory. */
 	int			memory_pressure;
-	long			sysctl_mem[3];
 	unsigned long		flags;
 	/*
 	 * memcg field is used to find which memcg we belong directly
diff --git a/include/net/sock.h b/include/net/sock.h
index ed141b3..2eefc99 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1159,10 +1159,12 @@ static inline void sk_enter_memory_pressure(struct sock *sk)
 
 static inline long sk_prot_mem_limits(const struct sock *sk, int index)
 {
-	long *prot = sk->sk_prot->sysctl_mem;
+	long limit = sk->sk_prot->sysctl_mem[index];
+
 	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		prot = sk->sk_cgrp->sysctl_mem;
-	return prot[index];
+		limit = min_t(long, limit, sk->sk_cgrp->memory_allocated.limit);
+
+	return limit;
 }
 
 static inline void memcg_memory_allocated_add(struct cg_proto *prot,
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index 8965638..c383e68 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -21,9 +21,6 @@ int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 	if (!cg_proto)
 		return 0;
 
-	cg_proto->sysctl_mem[0] = sysctl_tcp_mem[0];
-	cg_proto->sysctl_mem[1] = sysctl_tcp_mem[1];
-	cg_proto->sysctl_mem[2] = sysctl_tcp_mem[2];
 	cg_proto->memory_pressure = 0;
 	cg_proto->memcg = memcg;
 
@@ -54,7 +51,6 @@ EXPORT_SYMBOL(tcp_destroy_cgroup);
 static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
 {
 	struct cg_proto *cg_proto;
-	int i;
 	int ret;
 
 	cg_proto = tcp_prot.proto_cgroup(memcg);
@@ -65,10 +61,6 @@ static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
 	if (ret)
 		return ret;
 
-	for (i = 0; i < 3; i++)
-		cg_proto->sysctl_mem[i] = min_t(long, nr_pages,
-						sysctl_tcp_mem[i]);
-
 	if (nr_pages == PAGE_COUNTER_MAX)
 		clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
 	else {
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 08/14] net: tcp_memcontrol: sanitize tcp memory accounting callbacks
  2015-11-12 23:41 [PATCH 00/14] mm: memcontrol: account socket memory in unified hierarchy Johannes Weiner
                   ` (6 preceding siblings ...)
  2015-11-12 23:41 ` [PATCH 07/14] net: tcp_memcontrol: simplify the per-memcg limit access Johannes Weiner
@ 2015-11-12 23:41 ` Johannes Weiner
  2015-11-13  4:53   ` Eric Dumazet
  2015-11-20 10:58   ` Vladimir Davydov
  2015-11-12 23:41 ` [PATCH 09/14] net: tcp_memcontrol: simplify linkage between socket and page counter Johannes Weiner
                   ` (5 subsequent siblings)
  13 siblings, 2 replies; 61+ messages in thread
From: Johannes Weiner @ 2015-11-12 23:41 UTC (permalink / raw)
  To: David Miller, Andrew Morton
  Cc: Vladimir Davydov, Tejun Heo, Michal Hocko, netdev, linux-mm,
	cgroups, linux-kernel, kernel-team

There won't be a tcp control soft limit, so integrating the memcg code
into the global skmem limiting scheme complicates things
unnecessarily. Replace this with simple and clear charge and uncharge
calls--hidden behind a jump label--to account skb memory.

Note that this is not purely aesthetic: as a result of shoehorning the
per-memcg code into the same memory accounting functions that handle
the global level, the old code would compare the per-memcg consumption
against the smaller of the per-memcg limit and the global limit. This
allowed the total consumption of multiple sockets to exceed the global
limit, as long as the individual sockets stayed within bounds. After
this change, the code will always compare the per-memcg consumption to
the per-memcg limit, and the global consumption to the global limit,
and thus close this loophole.

Without a soft limit, the per-memcg memory pressure state in sockets
is generally questionable. However, we did it until now, so we
continue to enter it when the hard limit is hit, and packets are
dropped, to let other sockets in the cgroup know that they shouldn't
grow their transmit windows, either. However, keep it simple in the
new callback model and leave memory pressure lazily when the next
packet is accepted (as opposed to doing it synchroneously when packets
are processed). When packets are dropped, network performance will
already be in the toilet, so that should be a reasonable trade-off.

As described above, consumption is now checked on the per-memcg level
and the global level separately. Likewise, memory pressure states are
maintained on both the per-memcg level and the global level, and a
socket is considered under pressure when either level asserts as much.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 12 ++++-----
 include/net/sock.h         | 63 ++++++----------------------------------------
 include/net/tcp.h          |  5 ++--
 mm/memcontrol.c            | 32 +++++++++++++++++++++++
 net/core/sock.c            | 26 +++++++++++--------
 net/ipv4/tcp_output.c      |  7 ++++--
 6 files changed, 69 insertions(+), 76 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 96ca3d3..906dfff 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -676,12 +676,6 @@ void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 }
 #endif /* CONFIG_MEMCG */
 
-enum {
-	UNDER_LIMIT,
-	SOFT_LIMIT,
-	OVER_LIMIT,
-};
-
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg);
@@ -711,6 +705,12 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
 struct sock;
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
+bool mem_cgroup_charge_skmem(struct cg_proto *proto, unsigned int nr_pages);
+void mem_cgroup_uncharge_skmem(struct cg_proto *proto, unsigned int nr_pages);
+static inline bool mem_cgroup_under_socket_pressure(struct cg_proto *proto)
+{
+	return proto->memory_pressure;
+}
 #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_MEMCG_KMEM
diff --git a/include/net/sock.h b/include/net/sock.h
index 2eefc99..8cc7613 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1126,8 +1126,8 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
 	if (!sk->sk_prot->memory_pressure)
 		return false;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		return !!sk->sk_cgrp->memory_pressure;
+	if (mem_cgroup_sockets_enabled && sk->sk_cgrp &&
+	    mem_cgroup_under_socket_pressure(sk->sk_cgrp))
 
 	return !!*sk->sk_prot->memory_pressure;
 }
@@ -1141,9 +1141,6 @@ static inline void sk_leave_memory_pressure(struct sock *sk)
 
 	if (*memory_pressure)
 		*memory_pressure = 0;
-
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		sk->sk_cgrp->memory_pressure = 0;
 }
 
 static inline void sk_enter_memory_pressure(struct sock *sk)
@@ -1151,76 +1148,30 @@ static inline void sk_enter_memory_pressure(struct sock *sk)
 	if (!sk->sk_prot->enter_memory_pressure)
 		return;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		sk->sk_cgrp->memory_pressure = 1;
-
 	sk->sk_prot->enter_memory_pressure(sk);
 }
 
 static inline long sk_prot_mem_limits(const struct sock *sk, int index)
 {
-	long limit = sk->sk_prot->sysctl_mem[index];
-
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		limit = min_t(long, limit, sk->sk_cgrp->memory_allocated.limit);
-
-	return limit;
-}
-
-static inline void memcg_memory_allocated_add(struct cg_proto *prot,
-					      unsigned long amt,
-					      int *parent_status)
-{
-	struct page_counter *counter;
-
-	if (page_counter_try_charge(&prot->memory_allocated, amt, &counter))
-		return;
-
-	page_counter_charge(&prot->memory_allocated, amt);
-	*parent_status = OVER_LIMIT;
-}
-
-static inline void memcg_memory_allocated_sub(struct cg_proto *prot,
-					      unsigned long amt)
-{
-	page_counter_uncharge(&prot->memory_allocated, amt);
+	return sk->sk_prot->sysctl_mem[index];
 }
 
 static inline long
 sk_memory_allocated(const struct sock *sk)
 {
-	struct proto *prot = sk->sk_prot;
-
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		return page_counter_read(&sk->sk_cgrp->memory_allocated);
-
-	return atomic_long_read(prot->memory_allocated);
+	return atomic_long_read(sk->sk_prot->memory_allocated);
 }
 
 static inline long
-sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status)
+sk_memory_allocated_add(struct sock *sk, int amt)
 {
-	struct proto *prot = sk->sk_prot;
-
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-		memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status);
-		/* update the root cgroup regardless */
-		atomic_long_add_return(amt, prot->memory_allocated);
-		return page_counter_read(&sk->sk_cgrp->memory_allocated);
-	}
-
-	return atomic_long_add_return(amt, prot->memory_allocated);
+	return atomic_long_add_return(amt, sk->sk_prot->memory_allocated);
 }
 
 static inline void
 sk_memory_allocated_sub(struct sock *sk, int amt)
 {
-	struct proto *prot = sk->sk_prot;
-
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		memcg_memory_allocated_sub(sk->sk_cgrp, amt);
-
-	atomic_long_sub(amt, prot->memory_allocated);
+	atomic_long_sub(amt, sk->sk_prot->memory_allocated);
 }
 
 static inline void sk_sockets_allocated_dec(struct sock *sk)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f80e74c..04517d6 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -292,8 +292,9 @@ extern int tcp_memory_pressure;
 /* optimized version of sk_under_memory_pressure() for TCP sockets */
 static inline bool tcp_under_memory_pressure(const struct sock *sk)
 {
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		return !!sk->sk_cgrp->memory_pressure;
+	if (mem_cgroup_sockets_enabled && sk->sk_cgrp &&
+	    mem_cgroup_under_socket_pressure(sk->sk_cgrp))
+		return true;
 
 	return tcp_memory_pressure;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 57f4539..3462a52 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -338,6 +338,38 @@ struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
 }
 EXPORT_SYMBOL(tcp_proto_cgroup);
 
+/**
+ * mem_cgroup_charge_skmem - charge socket memory
+ * @proto: proto to charge
+ * @nr_pages: number of pages to charge
+ *
+ * Charges @nr_pages to @proto. Returns %true if the charge fit within
+ * @proto's configured limit, %false if the charge had to be forced.
+ */
+bool mem_cgroup_charge_skmem(struct cg_proto *proto, unsigned int nr_pages)
+{
+	struct page_counter *counter;
+
+	if (page_counter_try_charge(&proto->memory_allocated,
+				    nr_pages, &counter)) {
+		proto->memory_pressure = 0;
+		return true;
+	}
+	page_counter_charge(&proto->memory_allocated, nr_pages);
+	proto->memory_pressure = 1;
+	return false;
+}
+
+/**
+ * mem_cgroup_uncharge_skmem - uncharge socket memory
+ * @proto - proto to uncharge
+ * @nr_pages - number of pages to uncharge
+ */
+void mem_cgroup_uncharge_skmem(struct cg_proto *proto, unsigned int nr_pages)
+{
+	page_counter_uncharge(&proto->memory_allocated, nr_pages);
+}
+
 #endif
 
 #ifdef CONFIG_MEMCG_KMEM
diff --git a/net/core/sock.c b/net/core/sock.c
index 04e54bc..5b1b96f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2066,27 +2066,27 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 	struct proto *prot = sk->sk_prot;
 	int amt = sk_mem_pages(size);
 	long allocated;
-	int parent_status = UNDER_LIMIT;
 
 	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
 
-	allocated = sk_memory_allocated_add(sk, amt, &parent_status);
+	allocated = sk_memory_allocated_add(sk, amt);
+
+	if (mem_cgroup_sockets_enabled && sk->sk_cgrp &&
+	    !mem_cgroup_charge_skmem(sk->sk_cgrp, amt))
+		goto suppress_allocation;
 
 	/* Under limit. */
-	if (parent_status == UNDER_LIMIT &&
-			allocated <= sk_prot_mem_limits(sk, 0)) {
+	if (allocated <= sk_prot_mem_limits(sk, 0)) {
 		sk_leave_memory_pressure(sk);
 		return 1;
 	}
 
-	/* Under pressure. (we or our parents) */
-	if ((parent_status > SOFT_LIMIT) ||
-			allocated > sk_prot_mem_limits(sk, 1))
+	/* Under pressure. */
+	if (allocated > sk_prot_mem_limits(sk, 1))
 		sk_enter_memory_pressure(sk);
 
-	/* Over hard limit (we or our parents) */
-	if ((parent_status == OVER_LIMIT) ||
-			(allocated > sk_prot_mem_limits(sk, 2)))
+	/* Over hard limit. */
+	if (allocated > sk_prot_mem_limits(sk, 2))
 		goto suppress_allocation;
 
 	/* guarantee minimum buffer size under pressure */
@@ -2135,6 +2135,9 @@ suppress_allocation:
 
 	sk_memory_allocated_sub(sk, amt);
 
+	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
+		mem_cgroup_uncharge_skmem(sk->sk_cgrp, amt);
+
 	return 0;
 }
 EXPORT_SYMBOL(__sk_mem_schedule);
@@ -2150,6 +2153,9 @@ void __sk_mem_reclaim(struct sock *sk, int amount)
 	sk_memory_allocated_sub(sk, amount);
 	sk->sk_forward_alloc -= amount << SK_MEM_QUANTUM_SHIFT;
 
+	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
+		mem_cgroup_uncharge_skmem(sk->sk_cgrp, amount);
+
 	if (sk_under_memory_pressure(sk) &&
 	    (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)))
 		sk_leave_memory_pressure(sk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index cb7ca56..7aa168a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2813,13 +2813,16 @@ begin_fwd:
  */
 void sk_forced_mem_schedule(struct sock *sk, int size)
 {
-	int amt, status;
+	int amt;
 
 	if (size <= sk->sk_forward_alloc)
 		return;
 	amt = sk_mem_pages(size);
 	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
-	sk_memory_allocated_add(sk, amt, &status);
+	sk_memory_allocated_add(sk, amt);
+
+	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
+		mem_cgroup_charge_skmem(sk->sk_cgrp, amt);
 }
 
 /* Send a FIN. The caller locks the socket for us.
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 09/14] net: tcp_memcontrol: simplify linkage between socket and page counter
  2015-11-12 23:41 [PATCH 00/14] mm: memcontrol: account socket memory in unified hierarchy Johannes Weiner
                   ` (7 preceding siblings ...)
  2015-11-12 23:41 ` [PATCH 08/14] net: tcp_memcontrol: sanitize tcp memory accounting callbacks Johannes Weiner
@ 2015-11-12 23:41 ` Johannes Weiner
  2015-11-20 12:42   ` Vladimir Davydov
  2015-11-12 23:41 ` [PATCH 10/14] mm: memcontrol: generalize the socket accounting jump label Johannes Weiner
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 61+ messages in thread
From: Johannes Weiner @ 2015-11-12 23:41 UTC (permalink / raw)
  To: David Miller, Andrew Morton
  Cc: Vladimir Davydov, Tejun Heo, Michal Hocko, netdev, linux-mm,
	cgroups, linux-kernel, kernel-team

There won't be any separate counters for socket memory consumed by
protocols other than TCP in the future. Remove the indirection and
link sockets directly to their owning memory cgroup.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h   | 18 +++---------
 include/net/sock.h           | 37 ++++-------------------
 include/net/tcp.h            |  4 +--
 include/net/tcp_memcontrol.h |  1 -
 mm/memcontrol.c              | 57 ++++++++++++++----------------------
 net/core/sock.c              | 52 +++++---------------------------
 net/ipv4/tcp_ipv4.c          |  7 +----
 net/ipv4/tcp_memcontrol.c    | 70 ++++++++++++++++++--------------------------
 net/ipv4/tcp_output.c        |  4 +--
 net/ipv6/tcp_ipv6.c          |  3 --
 10 files changed, 71 insertions(+), 182 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 906dfff..1c71f27 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -99,16 +99,6 @@ struct cg_proto {
 	struct page_counter	memory_allocated;	/* Current allocated memory. */
 	int			memory_pressure;
 	unsigned long		flags;
-	/*
-	 * memcg field is used to find which memcg we belong directly
-	 * Each memcg struct can hold more than one cg_proto, so container_of
-	 * won't really cut.
-	 *
-	 * The elegant solution would be having an inverse function to
-	 * proto_cgroup in struct proto, but that means polluting the structure
-	 * for everybody, instead of just for memcg users.
-	 */
-	struct mem_cgroup	*memcg;
 };
 
 #ifdef CONFIG_MEMCG
@@ -705,11 +695,11 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
 struct sock;
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
-bool mem_cgroup_charge_skmem(struct cg_proto *proto, unsigned int nr_pages);
-void mem_cgroup_uncharge_skmem(struct cg_proto *proto, unsigned int nr_pages);
-static inline bool mem_cgroup_under_socket_pressure(struct cg_proto *proto)
+bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
+void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
+static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 {
-	return proto->memory_pressure;
+	return memcg->tcp_mem.memory_pressure;
 }
 #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 8cc7613..b439dcc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -69,22 +69,6 @@
 #include <net/tcp_states.h>
 #include <linux/net_tstamp.h>
 
-struct cgroup;
-struct cgroup_subsys;
-#ifdef CONFIG_NET
-int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss);
-void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg);
-#else
-static inline
-int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
-{
-	return 0;
-}
-static inline
-void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg)
-{
-}
-#endif
 /*
  * This structure really needs to be cleaned up.
  * Most of it is for TCP, and not used by any of
@@ -310,7 +294,7 @@ struct cg_proto;
   *	@sk_security: used by security modules
   *	@sk_mark: generic packet mark
   *	@sk_classid: this socket's cgroup classid
-  *	@sk_cgrp: this socket's cgroup-specific proto data
+  *	@sk_memcg: this socket's memory cgroup association
   *	@sk_write_pending: a write to stream socket waits to start
   *	@sk_state_change: callback to indicate change in the state of the sock
   *	@sk_data_ready: callback to indicate there is data to be processed
@@ -447,7 +431,7 @@ struct sock {
 #ifdef CONFIG_CGROUP_NET_CLASSID
 	u32			sk_classid;
 #endif
-	struct cg_proto		*sk_cgrp;
+	struct mem_cgroup	*sk_memcg;
 	void			(*sk_state_change)(struct sock *sk);
 	void			(*sk_data_ready)(struct sock *sk);
 	void			(*sk_write_space)(struct sock *sk);
@@ -1051,18 +1035,6 @@ struct proto {
 #ifdef SOCK_REFCNT_DEBUG
 	atomic_t		socks;
 #endif
-#ifdef CONFIG_MEMCG_KMEM
-	/*
-	 * cgroup specific init/deinit functions. Called once for all
-	 * protocols that implement it, from cgroups populate function.
-	 * This function has to setup any files the protocol want to
-	 * appear in the kmem cgroup filesystem.
-	 */
-	int			(*init_cgroup)(struct mem_cgroup *memcg,
-					       struct cgroup_subsys *ss);
-	void			(*destroy_cgroup)(struct mem_cgroup *memcg);
-	struct cg_proto		*(*proto_cgroup)(struct mem_cgroup *memcg);
-#endif
 };
 
 int proto_register(struct proto *prot, int alloc_slab);
@@ -1126,8 +1098,9 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
 	if (!sk->sk_prot->memory_pressure)
 		return false;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp &&
-	    mem_cgroup_under_socket_pressure(sk->sk_cgrp))
+	if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
+	    mem_cgroup_under_socket_pressure(sk->sk_memcg))
+		return true;
 
 	return !!*sk->sk_prot->memory_pressure;
 }
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 04517d6..c008535 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -292,8 +292,8 @@ extern int tcp_memory_pressure;
 /* optimized version of sk_under_memory_pressure() for TCP sockets */
 static inline bool tcp_under_memory_pressure(const struct sock *sk)
 {
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp &&
-	    mem_cgroup_under_socket_pressure(sk->sk_cgrp))
+	if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
+	    mem_cgroup_under_socket_pressure(sk->sk_memcg))
 		return true;
 
 	return tcp_memory_pressure;
diff --git a/include/net/tcp_memcontrol.h b/include/net/tcp_memcontrol.h
index 05b94d9..3a17b16 100644
--- a/include/net/tcp_memcontrol.h
+++ b/include/net/tcp_memcontrol.h
@@ -1,7 +1,6 @@
 #ifndef _TCP_MEMCG_H
 #define _TCP_MEMCG_H
 
-struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg);
 int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss);
 void tcp_destroy_cgroup(struct mem_cgroup *memcg);
 #endif /* _TCP_MEMCG_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3462a52..89b1d9e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -294,9 +294,6 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
 void sock_update_memcg(struct sock *sk)
 {
 	struct mem_cgroup *memcg;
-	struct cg_proto *cg_proto;
-
-	BUG_ON(!sk->sk_prot->proto_cgroup);
 
 	/* Socket cloning can throw us here with sk_cgrp already
 	 * filled. It won't however, necessarily happen from
@@ -306,68 +303,58 @@ void sock_update_memcg(struct sock *sk)
 	 * Respecting the original socket's memcg is a better
 	 * decision in this case.
 	 */
-	if (sk->sk_cgrp) {
-		BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg));
-		css_get(&sk->sk_cgrp->memcg->css);
+	if (sk->sk_memcg) {
+		BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
+		css_get(&sk->sk_memcg->css);
 		return;
 	}
 
 	rcu_read_lock();
 	memcg = mem_cgroup_from_task(current);
-	cg_proto = sk->sk_prot->proto_cgroup(memcg);
-	if (cg_proto && test_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags) &&
-	    css_tryget_online(&memcg->css)) {
-		sk->sk_cgrp = cg_proto;
-	}
+	if (memcg != root_mem_cgroup &&
+	    test_bit(MEMCG_SOCK_ACTIVE, &memcg->tcp_mem.flags) &&
+	    css_tryget_online(&memcg->css))
+		sk->sk_memcg = memcg;
 	rcu_read_unlock();
 }
 EXPORT_SYMBOL(sock_update_memcg);
 
 void sock_release_memcg(struct sock *sk)
 {
-	WARN_ON(!sk->sk_cgrp->memcg);
-	css_put(&sk->sk_cgrp->memcg->css);
-}
-
-struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
-{
-	if (!memcg || mem_cgroup_is_root(memcg))
-		return NULL;
-
-	return &memcg->tcp_mem;
+	WARN_ON(!sk->sk_memcg);
+	css_put(&sk->sk_memcg->css);
 }
-EXPORT_SYMBOL(tcp_proto_cgroup);
 
 /**
  * mem_cgroup_charge_skmem - charge socket memory
- * @proto: proto to charge
+ * @memcg: memcg to charge
  * @nr_pages: number of pages to charge
  *
- * Charges @nr_pages to @proto. Returns %true if the charge fit within
- * @proto's configured limit, %false if the charge had to be forced.
+ * Charges @nr_pages to @memcg. Returns %true if the charge fit within
+ * @memcg's configured limit, %false if the charge had to be forced.
  */
-bool mem_cgroup_charge_skmem(struct cg_proto *proto, unsigned int nr_pages)
+bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
 	struct page_counter *counter;
 
-	if (page_counter_try_charge(&proto->memory_allocated,
+	if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
 				    nr_pages, &counter)) {
-		proto->memory_pressure = 0;
+		memcg->tcp_mem.memory_pressure = 0;
 		return true;
 	}
-	page_counter_charge(&proto->memory_allocated, nr_pages);
-	proto->memory_pressure = 1;
+	page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
+	memcg->tcp_mem.memory_pressure = 1;
 	return false;
 }
 
 /**
  * mem_cgroup_uncharge_skmem - uncharge socket memory
- * @proto - proto to uncharge
+ * @memcg - memcg to uncharge
  * @nr_pages - number of pages to uncharge
  */
-void mem_cgroup_uncharge_skmem(struct cg_proto *proto, unsigned int nr_pages)
+void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
-	page_counter_uncharge(&proto->memory_allocated, nr_pages);
+	page_counter_uncharge(&memcg->tcp_mem.memory_allocated, nr_pages);
 }
 
 #endif
@@ -3623,7 +3610,7 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 	if (ret)
 		return ret;
 
-	return mem_cgroup_sockets_init(memcg, ss);
+	return tcp_init_cgroup(memcg, ss);
 }
 
 static void memcg_deactivate_kmem(struct mem_cgroup *memcg)
@@ -3679,7 +3666,7 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 		static_key_slow_dec(&memcg_kmem_enabled_key);
 		WARN_ON(page_counter_read(&memcg->kmem));
 	}
-	mem_cgroup_sockets_destroy(memcg);
+	tcp_destroy_cgroup(memcg);
 }
 #else
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
diff --git a/net/core/sock.c b/net/core/sock.c
index 5b1b96f..6486b0d 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -194,44 +194,6 @@ bool sk_net_capable(const struct sock *sk, int cap)
 }
 EXPORT_SYMBOL(sk_net_capable);
 
-
-#ifdef CONFIG_MEMCG_KMEM
-int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
-{
-	struct proto *proto;
-	int ret = 0;
-
-	mutex_lock(&proto_list_mutex);
-	list_for_each_entry(proto, &proto_list, node) {
-		if (proto->init_cgroup) {
-			ret = proto->init_cgroup(memcg, ss);
-			if (ret)
-				goto out;
-		}
-	}
-
-	mutex_unlock(&proto_list_mutex);
-	return ret;
-out:
-	list_for_each_entry_continue_reverse(proto, &proto_list, node)
-		if (proto->destroy_cgroup)
-			proto->destroy_cgroup(memcg);
-	mutex_unlock(&proto_list_mutex);
-	return ret;
-}
-
-void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg)
-{
-	struct proto *proto;
-
-	mutex_lock(&proto_list_mutex);
-	list_for_each_entry_reverse(proto, &proto_list, node)
-		if (proto->destroy_cgroup)
-			proto->destroy_cgroup(memcg);
-	mutex_unlock(&proto_list_mutex);
-}
-#endif
-
 /*
  * Each address family might have different locking rules, so we have
  * one slock key per address family:
@@ -1583,7 +1545,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 		sk_set_socket(newsk, NULL);
 		newsk->sk_wq = NULL;
 
-		if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
+		if (mem_cgroup_sockets_enabled && sk->sk_memcg)
 			sock_update_memcg(newsk);
 
 		if (newsk->sk_prot->sockets_allocated)
@@ -2071,8 +2033,8 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 
 	allocated = sk_memory_allocated_add(sk, amt);
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp &&
-	    !mem_cgroup_charge_skmem(sk->sk_cgrp, amt))
+	if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
+	    !mem_cgroup_charge_skmem(sk->sk_memcg, amt))
 		goto suppress_allocation;
 
 	/* Under limit. */
@@ -2135,8 +2097,8 @@ suppress_allocation:
 
 	sk_memory_allocated_sub(sk, amt);
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		mem_cgroup_uncharge_skmem(sk->sk_cgrp, amt);
+	if (mem_cgroup_sockets_enabled && sk->sk_memcg)
+		mem_cgroup_uncharge_skmem(sk->sk_memcg, amt);
 
 	return 0;
 }
@@ -2153,8 +2115,8 @@ void __sk_mem_reclaim(struct sock *sk, int amount)
 	sk_memory_allocated_sub(sk, amount);
 	sk->sk_forward_alloc -= amount << SK_MEM_QUANTUM_SHIFT;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		mem_cgroup_uncharge_skmem(sk->sk_cgrp, amount);
+	if (mem_cgroup_sockets_enabled && sk->sk_memcg)
+		mem_cgroup_uncharge_skmem(sk->sk_memcg, amount);
 
 	if (sk_under_memory_pressure(sk) &&
 	    (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)))
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 317a246..045d3af 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1813,7 +1813,7 @@ void tcp_v4_destroy_sock(struct sock *sk)
 
 	sk_sockets_allocated_dec(sk);
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
+	if (mem_cgroup_sockets_enabled && sk->sk_memcg)
 		sock_release_memcg(sk);
 }
 EXPORT_SYMBOL(tcp_v4_destroy_sock);
@@ -2336,11 +2336,6 @@ struct proto tcp_prot = {
 	.compat_setsockopt	= compat_tcp_setsockopt,
 	.compat_getsockopt	= compat_tcp_getsockopt,
 #endif
-#ifdef CONFIG_MEMCG_KMEM
-	.init_cgroup		= tcp_init_cgroup,
-	.destroy_cgroup		= tcp_destroy_cgroup,
-	.proto_cgroup		= tcp_proto_cgroup,
-#endif
 };
 EXPORT_SYMBOL(tcp_prot);
 
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index c383e68..47addc3 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -8,61 +8,48 @@
 
 int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
+	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
+	struct page_counter *counter_parent = NULL;
 	/*
 	 * The root cgroup does not use page_counters, but rather,
 	 * rely on the data already collected by the network
 	 * subsystem
 	 */
-	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
-	struct page_counter *counter_parent = NULL;
-	struct cg_proto *cg_proto, *parent_cg;
-
-	cg_proto = tcp_prot.proto_cgroup(memcg);
-	if (!cg_proto)
+	if (memcg == root_mem_cgroup)
 		return 0;
 
-	cg_proto->memory_pressure = 0;
-	cg_proto->memcg = memcg;
+	memcg->tcp_mem.memory_pressure = 0;
 
-	parent_cg = tcp_prot.proto_cgroup(parent);
-	if (parent_cg)
-		counter_parent = &parent_cg->memory_allocated;
+	if (parent)
+		counter_parent = &parent->tcp_mem.memory_allocated;
 
-	page_counter_init(&cg_proto->memory_allocated, counter_parent);
+	page_counter_init(&memcg->tcp_mem.memory_allocated, counter_parent);
 
 	return 0;
 }
-EXPORT_SYMBOL(tcp_init_cgroup);
 
 void tcp_destroy_cgroup(struct mem_cgroup *memcg)
 {
-	struct cg_proto *cg_proto;
-
-	cg_proto = tcp_prot.proto_cgroup(memcg);
-	if (!cg_proto)
+	if (memcg == root_mem_cgroup)
 		return;
 
-	if (test_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags))
+	if (test_bit(MEMCG_SOCK_ACTIVATED, &memcg->tcp_mem.flags))
 		static_key_slow_dec(&memcg_socket_limit_enabled);
-
 }
-EXPORT_SYMBOL(tcp_destroy_cgroup);
 
 static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
 {
-	struct cg_proto *cg_proto;
 	int ret;
 
-	cg_proto = tcp_prot.proto_cgroup(memcg);
-	if (!cg_proto)
+	if (memcg == root_mem_cgroup)
 		return -EINVAL;
 
-	ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages);
+	ret = page_counter_limit(&memcg->tcp_mem.memory_allocated, nr_pages);
 	if (ret)
 		return ret;
 
 	if (nr_pages == PAGE_COUNTER_MAX)
-		clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
+		clear_bit(MEMCG_SOCK_ACTIVE, &memcg->tcp_mem.flags);
 	else {
 		/*
 		 * The active bit needs to be written after the static_key
@@ -84,9 +71,10 @@ static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
 		 * will do the update in the same memcg. Without that, we can't
 		 * properly shutdown the static key.
 		 */
-		if (!test_and_set_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags))
+		if (!test_and_set_bit(MEMCG_SOCK_ACTIVATED,
+				      &memcg->tcp_mem.flags))
 			static_key_slow_inc(&memcg_socket_limit_enabled);
-		set_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
+		set_bit(MEMCG_SOCK_ACTIVE, &memcg->tcp_mem.flags);
 	}
 
 	return 0;
@@ -130,32 +118,32 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
 static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-	struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg);
 	u64 val;
 
 	switch (cft->private) {
 	case RES_LIMIT:
-		if (!cg_proto)
-			return PAGE_COUNTER_MAX;
-		val = cg_proto->memory_allocated.limit;
+		if (memcg == root_mem_cgroup)
+			val = PAGE_COUNTER_MAX;
+		else
+			val = memcg->tcp_mem.memory_allocated.limit;
 		val *= PAGE_SIZE;
 		break;
 	case RES_USAGE:
-		if (!cg_proto)
+		if (memcg == root_mem_cgroup)
 			val = atomic_long_read(&tcp_memory_allocated);
 		else
-			val = page_counter_read(&cg_proto->memory_allocated);
+			val = page_counter_read(&memcg->tcp_mem.memory_allocated);
 		val *= PAGE_SIZE;
 		break;
 	case RES_FAILCNT:
-		if (!cg_proto)
+		if (memcg == root_mem_cgroup)
 			return 0;
-		val = cg_proto->memory_allocated.failcnt;
+		val = memcg->tcp_mem.memory_allocated.failcnt;
 		break;
 	case RES_MAX_USAGE:
-		if (!cg_proto)
+		if (memcg == root_mem_cgroup)
 			return 0;
-		val = cg_proto->memory_allocated.watermark;
+		val = memcg->tcp_mem.memory_allocated.watermark;
 		val *= PAGE_SIZE;
 		break;
 	default:
@@ -168,19 +156,17 @@ static ssize_t tcp_cgroup_reset(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off)
 {
 	struct mem_cgroup *memcg;
-	struct cg_proto *cg_proto;
 
 	memcg = mem_cgroup_from_css(of_css(of));
-	cg_proto = tcp_prot.proto_cgroup(memcg);
-	if (!cg_proto)
+	if (memcg == root_mem_cgroup)
 		return nbytes;
 
 	switch (of_cft(of)->private) {
 	case RES_MAX_USAGE:
-		page_counter_reset_watermark(&cg_proto->memory_allocated);
+		page_counter_reset_watermark(&memcg->tcp_mem.memory_allocated);
 		break;
 	case RES_FAILCNT:
-		cg_proto->memory_allocated.failcnt = 0;
+		memcg->tcp_mem.memory_allocated.failcnt = 0;
 		break;
 	}
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7aa168a..7b83a65 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2821,8 +2821,8 @@ void sk_forced_mem_schedule(struct sock *sk, int size)
 	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
 	sk_memory_allocated_add(sk, amt);
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		mem_cgroup_charge_skmem(sk->sk_cgrp, amt);
+	if (mem_cgroup_sockets_enabled && sk->sk_memcg)
+		mem_cgroup_charge_skmem(sk->sk_memcg, amt);
 }
 
 /* Send a FIN. The caller locks the socket for us.
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 5baa8e7..6340537 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1869,9 +1869,6 @@ struct proto tcpv6_prot = {
 	.compat_setsockopt	= compat_tcp_setsockopt,
 	.compat_getsockopt	= compat_tcp_getsockopt,
 #endif
-#ifdef CONFIG_MEMCG_KMEM
-	.proto_cgroup		= tcp_proto_cgroup,
-#endif
 	.clear_sk		= tcp_v6_clear_sk,
 };
 
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 10/14] mm: memcontrol: generalize the socket accounting jump label
  2015-11-12 23:41 [PATCH 00/14] mm: memcontrol: account socket memory in unified hierarchy Johannes Weiner
                   ` (8 preceding siblings ...)
  2015-11-12 23:41 ` [PATCH 09/14] net: tcp_memcontrol: simplify linkage between socket and page counter Johannes Weiner
@ 2015-11-12 23:41 ` Johannes Weiner
  2015-11-13 10:43   ` Michal Hocko
  2015-11-14 13:29   ` Vladimir Davydov
  2015-11-12 23:41 ` [PATCH 11/14] mm: memcontrol: do not account memory+swap on unified hierarchy Johannes Weiner
                   ` (3 subsequent siblings)
  13 siblings, 2 replies; 61+ messages in thread
From: Johannes Weiner @ 2015-11-12 23:41 UTC (permalink / raw)
  To: David Miller, Andrew Morton
  Cc: Vladimir Davydov, Tejun Heo, Michal Hocko, netdev, linux-mm,
	cgroups, linux-kernel, kernel-team

The unified hierarchy memory controller is going to use this jump
label as well to control the networking callbacks. Move it to the
memory controller code and give it a more generic name.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 4 ++++
 include/net/sock.h         | 7 -------
 mm/memcontrol.c            | 3 +++
 net/core/sock.c            | 5 -----
 net/ipv4/tcp_memcontrol.c  | 4 ++--
 5 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1c71f27..4cf5afa 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -693,6 +693,8 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
 
 #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
 struct sock;
+extern struct static_key memcg_sockets_enabled_key;
+#define mem_cgroup_sockets_enabled static_key_false(&memcg_sockets_enabled_key)
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
 bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
@@ -701,6 +703,8 @@ static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 {
 	return memcg->tcp_mem.memory_pressure;
 }
+#else
+#define mem_cgroup_sockets_enabled 0
 #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_MEMCG_KMEM
diff --git a/include/net/sock.h b/include/net/sock.h
index b439dcc..bf1b901 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1065,13 +1065,6 @@ static inline void sk_refcnt_debug_release(const struct sock *sk)
 #define sk_refcnt_debug_release(sk) do { } while (0)
 #endif /* SOCK_REFCNT_DEBUG */
 
-#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_NET)
-extern struct static_key memcg_socket_limit_enabled;
-#define mem_cgroup_sockets_enabled static_key_false(&memcg_socket_limit_enabled)
-#else
-#define mem_cgroup_sockets_enabled 0
-#endif
-
 static inline bool sk_stream_memory_free(const struct sock *sk)
 {
 	if (sk->sk_wmem_queued >= sk->sk_sndbuf)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 89b1d9e..658bef2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -291,6 +291,9 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
 /* Writing them here to avoid exposing memcg's inner layout */
 #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
 
+struct static_key memcg_sockets_enabled_key;
+EXPORT_SYMBOL(memcg_sockets_enabled_key);
+
 void sock_update_memcg(struct sock *sk)
 {
 	struct mem_cgroup *memcg;
diff --git a/net/core/sock.c b/net/core/sock.c
index 6486b0d..c5435b5 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -201,11 +201,6 @@ EXPORT_SYMBOL(sk_net_capable);
 static struct lock_class_key af_family_keys[AF_MAX];
 static struct lock_class_key af_family_slock_keys[AF_MAX];
 
-#if defined(CONFIG_MEMCG_KMEM)
-struct static_key memcg_socket_limit_enabled;
-EXPORT_SYMBOL(memcg_socket_limit_enabled);
-#endif
-
 /*
  * Make lock validator output more readable. (we pre-construct these
  * strings build-time, so that runtime initialization of socket
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index 47addc3..17df9dd 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -34,7 +34,7 @@ void tcp_destroy_cgroup(struct mem_cgroup *memcg)
 		return;
 
 	if (test_bit(MEMCG_SOCK_ACTIVATED, &memcg->tcp_mem.flags))
-		static_key_slow_dec(&memcg_socket_limit_enabled);
+		static_key_slow_dec(&memcg_sockets_enabled_key);
 }
 
 static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
@@ -73,7 +73,7 @@ static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
 		 */
 		if (!test_and_set_bit(MEMCG_SOCK_ACTIVATED,
 				      &memcg->tcp_mem.flags))
-			static_key_slow_inc(&memcg_socket_limit_enabled);
+			static_key_slow_inc(&memcg_sockets_enabled_key);
 		set_bit(MEMCG_SOCK_ACTIVE, &memcg->tcp_mem.flags);
 	}
 
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 11/14] mm: memcontrol: do not account memory+swap on unified hierarchy
  2015-11-12 23:41 [PATCH 00/14] mm: memcontrol: account socket memory in unified hierarchy Johannes Weiner
                   ` (9 preceding siblings ...)
  2015-11-12 23:41 ` [PATCH 10/14] mm: memcontrol: generalize the socket accounting jump label Johannes Weiner
@ 2015-11-12 23:41 ` Johannes Weiner
  2015-11-13 10:37   ` Michal Hocko
  2015-11-14 13:23   ` Vladimir Davydov
  2015-11-12 23:41 ` [PATCH 12/14] mm: memcontrol: move socket code for unified hierarchy accounting Johannes Weiner
                   ` (2 subsequent siblings)
  13 siblings, 2 replies; 61+ messages in thread
From: Johannes Weiner @ 2015-11-12 23:41 UTC (permalink / raw)
  To: David Miller, Andrew Morton
  Cc: Vladimir Davydov, Tejun Heo, Michal Hocko, netdev, linux-mm,
	cgroups, linux-kernel, kernel-team

The unified hierarchy memory controller doesn't expose the memory+swap
counter to userspace, but its accounting is hardcoded in all charge
paths right now, including the per-cpu charge cache ("the stock").

To avoid adding yet more pointless memory+swap accounting with the
socket memory support in unified hierarchy, disable the counter
altogether when in unified hierarchy mode.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 44 +++++++++++++++++++++++++-------------------
 1 file changed, 25 insertions(+), 19 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 658bef2..e7f1a79 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -87,6 +87,12 @@ int do_swap_account __read_mostly;
 #define do_swap_account		0
 #endif
 
+/* Whether legacy memory+swap accounting is active */
+static bool do_memsw_account(void)
+{
+	return !cgroup_subsys_on_dfl(memory_cgrp_subsys) && do_swap_account;
+}
+
 static const char * const mem_cgroup_stat_names[] = {
 	"cache",
 	"rss",
@@ -1177,7 +1183,7 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
 	if (count < limit)
 		margin = limit - count;
 
-	if (do_swap_account) {
+	if (do_memsw_account()) {
 		count = page_counter_read(&memcg->memsw);
 		limit = READ_ONCE(memcg->memsw.limit);
 		if (count <= limit)
@@ -1280,7 +1286,7 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 		pr_cont(":");
 
 		for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
-			if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
+			if (i == MEM_CGROUP_STAT_SWAP && !do_memsw_account())
 				continue;
 			pr_cont(" %s:%luKB", mem_cgroup_stat_names[i],
 				K(mem_cgroup_read_stat(iter, i)));
@@ -1903,7 +1909,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 
 	if (stock->nr_pages) {
 		page_counter_uncharge(&old->memory, stock->nr_pages);
-		if (do_swap_account)
+		if (do_memsw_account())
 			page_counter_uncharge(&old->memsw, stock->nr_pages);
 		css_put_many(&old->css, stock->nr_pages);
 		stock->nr_pages = 0;
@@ -2033,11 +2039,11 @@ retry:
 	if (consume_stock(memcg, nr_pages))
 		return 0;
 
-	if (!do_swap_account ||
+	if (!do_memsw_account() ||
 	    page_counter_try_charge(&memcg->memsw, batch, &counter)) {
 		if (page_counter_try_charge(&memcg->memory, batch, &counter))
 			goto done_restock;
-		if (do_swap_account)
+		if (do_memsw_account())
 			page_counter_uncharge(&memcg->memsw, batch);
 		mem_over_limit = mem_cgroup_from_counter(counter, memory);
 	} else {
@@ -2124,7 +2130,7 @@ force:
 	 * temporarily by force charging it.
 	 */
 	page_counter_charge(&memcg->memory, nr_pages);
-	if (do_swap_account)
+	if (do_memsw_account())
 		page_counter_charge(&memcg->memsw, nr_pages);
 	css_get_many(&memcg->css, nr_pages);
 
@@ -2161,7 +2167,7 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 		return;
 
 	page_counter_uncharge(&memcg->memory, nr_pages);
-	if (do_swap_account)
+	if (do_memsw_account())
 		page_counter_uncharge(&memcg->memsw, nr_pages);
 
 	css_put_many(&memcg->css, nr_pages);
@@ -2441,7 +2447,7 @@ void __memcg_kmem_uncharge(struct page *page, int order)
 
 	page_counter_uncharge(&memcg->kmem, nr_pages);
 	page_counter_uncharge(&memcg->memory, nr_pages);
-	if (do_swap_account)
+	if (do_memsw_account())
 		page_counter_uncharge(&memcg->memsw, nr_pages);
 
 	page->mem_cgroup = NULL;
@@ -3154,7 +3160,7 @@ static int memcg_stat_show(struct seq_file *m, void *v)
 	BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
 
 	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
-		if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
+		if (i == MEM_CGROUP_STAT_SWAP && !do_memsw_account())
 			continue;
 		seq_printf(m, "%s %lu\n", mem_cgroup_stat_names[i],
 			   mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
@@ -3176,14 +3182,14 @@ static int memcg_stat_show(struct seq_file *m, void *v)
 	}
 	seq_printf(m, "hierarchical_memory_limit %llu\n",
 		   (u64)memory * PAGE_SIZE);
-	if (do_swap_account)
+	if (do_memsw_account())
 		seq_printf(m, "hierarchical_memsw_limit %llu\n",
 			   (u64)memsw * PAGE_SIZE);
 
 	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
 		unsigned long long val = 0;
 
-		if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
+		if (i == MEM_CGROUP_STAT_SWAP && !do_memsw_account())
 			continue;
 		for_each_mem_cgroup_tree(mi, memcg)
 			val += mem_cgroup_read_stat(mi, i) * PAGE_SIZE;
@@ -3314,7 +3320,7 @@ static void mem_cgroup_threshold(struct mem_cgroup *memcg)
 {
 	while (memcg) {
 		__mem_cgroup_threshold(memcg, false);
-		if (do_swap_account)
+		if (do_memsw_account())
 			__mem_cgroup_threshold(memcg, true);
 
 		memcg = parent_mem_cgroup(memcg);
@@ -4460,7 +4466,7 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
 	 * we call find_get_page() with swapper_space directly.
 	 */
 	page = find_get_page(swap_address_space(ent), ent.val);
-	if (do_swap_account)
+	if (do_memsw_account())
 		entry->val = ent.val;
 
 	return page;
@@ -4495,7 +4501,7 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
 		page = find_get_entry(mapping, pgoff);
 		if (radix_tree_exceptional_entry(page)) {
 			swp_entry_t swp = radix_to_swp_entry(page);
-			if (do_swap_account)
+			if (do_memsw_account())
 				*entry = swp;
 			page = find_get_page(swap_address_space(swp), swp.val);
 		}
@@ -5270,7 +5276,7 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 		if (page->mem_cgroup)
 			goto out;
 
-		if (do_swap_account) {
+		if (do_memsw_account()) {
 			swp_entry_t ent = { .val = page_private(page), };
 			unsigned short id = lookup_swap_cgroup_id(ent);
 
@@ -5334,7 +5340,7 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 	memcg_check_events(memcg, page);
 	local_irq_enable();
 
-	if (do_swap_account && PageSwapCache(page)) {
+	if (do_memsw_account() && PageSwapCache(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
 		/*
 		 * The swap entry might not get freed for a long time,
@@ -5379,7 +5385,7 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
 
 	if (!mem_cgroup_is_root(memcg)) {
 		page_counter_uncharge(&memcg->memory, nr_pages);
-		if (do_swap_account)
+		if (do_memsw_account())
 			page_counter_uncharge(&memcg->memsw, nr_pages);
 		memcg_oom_recover(memcg);
 	}
@@ -5587,7 +5593,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 	VM_BUG_ON_PAGE(page_count(page), page);
 
-	if (!do_swap_account)
+	if (!do_memsw_account())
 		return;
 
 	memcg = page->mem_cgroup;
@@ -5627,7 +5633,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry)
 	struct mem_cgroup *memcg;
 	unsigned short id;
 
-	if (!do_swap_account)
+	if (!do_memsw_account())
 		return;
 
 	id = swap_cgroup_record(entry, 0);
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 12/14] mm: memcontrol: move socket code for unified hierarchy accounting
  2015-11-12 23:41 [PATCH 00/14] mm: memcontrol: account socket memory in unified hierarchy Johannes Weiner
                   ` (10 preceding siblings ...)
  2015-11-12 23:41 ` [PATCH 11/14] mm: memcontrol: do not account memory+swap on unified hierarchy Johannes Weiner
@ 2015-11-12 23:41 ` Johannes Weiner
  2015-11-20 12:44   ` Vladimir Davydov
  2015-11-12 23:41 ` [PATCH 13/14] mm: memcontrol: account socket memory in unified hierarchy memory controller Johannes Weiner
  2015-11-12 23:41 ` [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure Johannes Weiner
  13 siblings, 1 reply; 61+ messages in thread
From: Johannes Weiner @ 2015-11-12 23:41 UTC (permalink / raw)
  To: David Miller, Andrew Morton
  Cc: Vladimir Davydov, Tejun Heo, Michal Hocko, netdev, linux-mm,
	cgroups, linux-kernel, kernel-team

The unified hierarchy memory controller will account socket
memory. Move the infrastructure functions accordingly.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/memcontrol.c | 148 ++++++++++++++++++++++++++++----------------------------
 1 file changed, 74 insertions(+), 74 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e7f1a79..408fb04 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -294,80 +294,6 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
 	return mem_cgroup_from_css(css);
 }
 
-/* Writing them here to avoid exposing memcg's inner layout */
-#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
-
-struct static_key memcg_sockets_enabled_key;
-EXPORT_SYMBOL(memcg_sockets_enabled_key);
-
-void sock_update_memcg(struct sock *sk)
-{
-	struct mem_cgroup *memcg;
-
-	/* Socket cloning can throw us here with sk_cgrp already
-	 * filled. It won't however, necessarily happen from
-	 * process context. So the test for root memcg given
-	 * the current task's memcg won't help us in this case.
-	 *
-	 * Respecting the original socket's memcg is a better
-	 * decision in this case.
-	 */
-	if (sk->sk_memcg) {
-		BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
-		css_get(&sk->sk_memcg->css);
-		return;
-	}
-
-	rcu_read_lock();
-	memcg = mem_cgroup_from_task(current);
-	if (memcg != root_mem_cgroup &&
-	    test_bit(MEMCG_SOCK_ACTIVE, &memcg->tcp_mem.flags) &&
-	    css_tryget_online(&memcg->css))
-		sk->sk_memcg = memcg;
-	rcu_read_unlock();
-}
-EXPORT_SYMBOL(sock_update_memcg);
-
-void sock_release_memcg(struct sock *sk)
-{
-	WARN_ON(!sk->sk_memcg);
-	css_put(&sk->sk_memcg->css);
-}
-
-/**
- * mem_cgroup_charge_skmem - charge socket memory
- * @memcg: memcg to charge
- * @nr_pages: number of pages to charge
- *
- * Charges @nr_pages to @memcg. Returns %true if the charge fit within
- * @memcg's configured limit, %false if the charge had to be forced.
- */
-bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
-{
-	struct page_counter *counter;
-
-	if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
-				    nr_pages, &counter)) {
-		memcg->tcp_mem.memory_pressure = 0;
-		return true;
-	}
-	page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
-	memcg->tcp_mem.memory_pressure = 1;
-	return false;
-}
-
-/**
- * mem_cgroup_uncharge_skmem - uncharge socket memory
- * @memcg - memcg to uncharge
- * @nr_pages - number of pages to uncharge
- */
-void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
-{
-	page_counter_uncharge(&memcg->tcp_mem.memory_allocated, nr_pages);
-}
-
-#endif
-
 #ifdef CONFIG_MEMCG_KMEM
 /*
  * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
@@ -5538,6 +5464,80 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage)
 	commit_charge(newpage, memcg, true);
 }
 
+/* Writing them here to avoid exposing memcg's inner layout */
+#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
+
+struct static_key memcg_sockets_enabled_key;
+EXPORT_SYMBOL(memcg_sockets_enabled_key);
+
+void sock_update_memcg(struct sock *sk)
+{
+	struct mem_cgroup *memcg;
+
+	/* Socket cloning can throw us here with sk_cgrp already
+	 * filled. It won't however, necessarily happen from
+	 * process context. So the test for root memcg given
+	 * the current task's memcg won't help us in this case.
+	 *
+	 * Respecting the original socket's memcg is a better
+	 * decision in this case.
+	 */
+	if (sk->sk_memcg) {
+		BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
+		css_get(&sk->sk_memcg->css);
+		return;
+	}
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (memcg != root_mem_cgroup &&
+	    test_bit(MEMCG_SOCK_ACTIVE, &memcg->tcp_mem.flags) &&
+	    css_tryget_online(&memcg->css))
+		sk->sk_memcg = memcg;
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL(sock_update_memcg);
+
+void sock_release_memcg(struct sock *sk)
+{
+	WARN_ON(!sk->sk_memcg);
+	css_put(&sk->sk_memcg->css);
+}
+
+/**
+ * mem_cgroup_charge_skmem - charge socket memory
+ * @memcg: memcg to charge
+ * @nr_pages: number of pages to charge
+ *
+ * Charges @nr_pages to @memcg. Returns %true if the charge fit within
+ * @memcg's configured limit, %false if the charge had to be forced.
+ */
+bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+	struct page_counter *counter;
+
+	if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
+				    nr_pages, &counter)) {
+		memcg->tcp_mem.memory_pressure = 0;
+		return true;
+	}
+	page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
+	memcg->tcp_mem.memory_pressure = 1;
+	return false;
+}
+
+/**
+ * mem_cgroup_uncharge_skmem - uncharge socket memory
+ * @memcg - memcg to uncharge
+ * @nr_pages - number of pages to uncharge
+ */
+void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+	page_counter_uncharge(&memcg->tcp_mem.memory_allocated, nr_pages);
+}
+
+#endif
+
 /*
  * subsys_initcall() for memory controller.
  *
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 13/14] mm: memcontrol: account socket memory in unified hierarchy memory controller
  2015-11-12 23:41 [PATCH 00/14] mm: memcontrol: account socket memory in unified hierarchy Johannes Weiner
                   ` (11 preceding siblings ...)
  2015-11-12 23:41 ` [PATCH 12/14] mm: memcontrol: move socket code for unified hierarchy accounting Johannes Weiner
@ 2015-11-12 23:41 ` Johannes Weiner
  2015-11-16 15:59   ` Michal Hocko
  2015-11-20 13:10   ` Vladimir Davydov
  2015-11-12 23:41 ` [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure Johannes Weiner
  13 siblings, 2 replies; 61+ messages in thread
From: Johannes Weiner @ 2015-11-12 23:41 UTC (permalink / raw)
  To: David Miller, Andrew Morton
  Cc: Vladimir Davydov, Tejun Heo, Michal Hocko, netdev, linux-mm,
	cgroups, linux-kernel, kernel-team

Socket memory can be a significant share of overall memory consumed by
common workloads. In order to provide reasonable resource isolation in
the unified hierarchy, this type of memory needs to be included in the
tracking/accounting of a cgroup under active memory resource control.

Overhead is only incurred when a non-root control group is created AND
the memory controller is instructed to track and account the memory
footprint of that group. cgroup.memory=nosocket can be specified on
the boot commandline to override any runtime configuration and
forcibly exclude socket memory from active memory resource control.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |  12 ++++-
 mm/memcontrol.c            | 131 +++++++++++++++++++++++++++++++++++++--------
 2 files changed, 118 insertions(+), 25 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4cf5afa..809d6de 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -256,6 +256,10 @@ struct mem_cgroup {
 	struct wb_domain cgwb_domain;
 #endif
 
+#ifdef CONFIG_INET
+	struct work_struct	socket_work;
+#endif
+
 	/* List of events which userspace want to receive */
 	struct list_head event_list;
 	spinlock_t event_list_lock;
@@ -691,7 +695,7 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
 
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
-#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
+#ifdef CONFIG_INET
 struct sock;
 extern struct static_key memcg_sockets_enabled_key;
 #define mem_cgroup_sockets_enabled static_key_false(&memcg_sockets_enabled_key)
@@ -701,11 +705,15 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
 void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
 static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 {
+#ifdef CONFIG_MEMCG_KMEM
 	return memcg->tcp_mem.memory_pressure;
+#else
+	return false;
+#endif
 }
 #else
 #define mem_cgroup_sockets_enabled 0
-#endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
+#endif /* CONFIG_INET */
 
 #ifdef CONFIG_MEMCG_KMEM
 extern struct static_key memcg_kmem_enabled_key;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 408fb04..cad9525 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -80,6 +80,9 @@ struct mem_cgroup *root_mem_cgroup __read_mostly;
 
 #define MEM_CGROUP_RECLAIM_RETRIES	5
 
+/* Socket memory accounting disabled? */
+static bool cgroup_memory_nosocket;
+
 /* Whether the swap controller is active */
 #ifdef CONFIG_MEMCG_SWAP
 int do_swap_account __read_mostly;
@@ -1923,6 +1926,18 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
+static void reclaim_high(struct mem_cgroup *memcg,
+			 unsigned int nr_pages,
+			 gfp_t gfp_mask)
+{
+	do {
+		if (page_counter_read(&memcg->memory) <= memcg->high)
+			continue;
+		mem_cgroup_events(memcg, MEMCG_HIGH, 1);
+		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
+	} while ((memcg = parent_mem_cgroup(memcg)));
+}
+
 /*
  * Scheduled by try_charge() to be executed from the userland return path
  * and reclaims memory over the high limit.
@@ -1930,20 +1945,13 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 void mem_cgroup_handle_over_high(void)
 {
 	unsigned int nr_pages = current->memcg_nr_pages_over_high;
-	struct mem_cgroup *memcg, *pos;
+	struct mem_cgroup *memcg;
 
 	if (likely(!nr_pages))
 		return;
 
-	pos = memcg = get_mem_cgroup_from_mm(current->mm);
-
-	do {
-		if (page_counter_read(&pos->memory) <= pos->high)
-			continue;
-		mem_cgroup_events(pos, MEMCG_HIGH, 1);
-		try_to_free_mem_cgroup_pages(pos, nr_pages, GFP_KERNEL, true);
-	} while ((pos = parent_mem_cgroup(pos)));
-
+	memcg = get_mem_cgroup_from_mm(current->mm);
+	reclaim_high(memcg, nr_pages, GFP_KERNEL);
 	css_put(&memcg->css);
 	current->memcg_nr_pages_over_high = 0;
 }
@@ -4141,6 +4149,8 @@ struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 }
 EXPORT_SYMBOL(parent_mem_cgroup);
 
+static void socket_work_func(struct work_struct *work);
+
 static struct cgroup_subsys_state * __ref
 mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
@@ -4180,6 +4190,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 #ifdef CONFIG_CGROUP_WRITEBACK
 	INIT_LIST_HEAD(&memcg->cgwb_list);
 #endif
+#ifdef CONFIG_INET
+	INIT_WORK(&memcg->socket_work, socket_work_func);
+#endif
 	return &memcg->css;
 
 free_out:
@@ -4237,6 +4250,11 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	if (ret)
 		return ret;
 
+#ifdef CONFIG_INET
+	if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
+		static_key_slow_inc(&memcg_sockets_enabled_key);
+#endif
+
 	/*
 	 * Make sure the memcg is initialized: mem_cgroup_iter()
 	 * orders reading memcg->initialized against its callers
@@ -4276,6 +4294,11 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
 	memcg_destroy_kmem(memcg);
+#ifdef CONFIG_INET
+	if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
+		static_key_slow_dec(&memcg_sockets_enabled_key);
+	cancel_work_sync(&memcg->socket_work);
+#endif
 	__mem_cgroup_free(memcg);
 }
 
@@ -5464,8 +5487,7 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage)
 	commit_charge(newpage, memcg, true);
 }
 
-/* Writing them here to avoid exposing memcg's inner layout */
-#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
+#ifdef CONFIG_INET
 
 struct static_key memcg_sockets_enabled_key;
 EXPORT_SYMBOL(memcg_sockets_enabled_key);
@@ -5490,10 +5512,16 @@ void sock_update_memcg(struct sock *sk)
 
 	rcu_read_lock();
 	memcg = mem_cgroup_from_task(current);
-	if (memcg != root_mem_cgroup &&
-	    test_bit(MEMCG_SOCK_ACTIVE, &memcg->tcp_mem.flags) &&
-	    css_tryget_online(&memcg->css))
+	if (memcg == root_mem_cgroup)
+		goto out;
+#ifdef CONFIG_MEMCG_KMEM
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
+	    !test_bit(MEMCG_SOCK_ACTIVE, &memcg->tcp_mem.flags))
+		goto out;
+#endif
+	if (css_tryget_online(&memcg->css))
 		sk->sk_memcg = memcg;
+out:
 	rcu_read_unlock();
 }
 EXPORT_SYMBOL(sock_update_memcg);
@@ -5504,6 +5532,14 @@ void sock_release_memcg(struct sock *sk)
 	css_put(&sk->sk_memcg->css);
 }
 
+static void socket_work_func(struct work_struct *work)
+{
+	struct mem_cgroup *memcg;
+
+	memcg = container_of(work, struct mem_cgroup, socket_work);
+	reclaim_high(memcg, CHARGE_BATCH, GFP_KERNEL);
+}
+
 /**
  * mem_cgroup_charge_skmem - charge socket memory
  * @memcg: memcg to charge
@@ -5514,16 +5550,43 @@ void sock_release_memcg(struct sock *sk)
  */
 bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
+	unsigned int batch = max(CHARGE_BATCH, nr_pages);
 	struct page_counter *counter;
+	bool force = false;
 
-	if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
-				    nr_pages, &counter)) {
-		memcg->tcp_mem.memory_pressure = 0;
+#ifdef CONFIG_MEMCG_KMEM
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
+		if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
+					    nr_pages, &counter)) {
+			memcg->tcp_mem.memory_pressure = 0;
+			return true;
+		}
+		page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
+		memcg->tcp_mem.memory_pressure = 1;
+		return false;
+	}
+#endif
+	if (consume_stock(memcg, nr_pages))
 		return true;
+retry:
+	if (page_counter_try_charge(&memcg->memory, batch, &counter))
+		goto done;
+
+	if (batch > nr_pages) {
+		batch = nr_pages;
+		goto retry;
 	}
-	page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
-	memcg->tcp_mem.memory_pressure = 1;
-	return false;
+
+	page_counter_charge(&memcg->memory, batch);
+	force = true;
+done:
+	css_get_many(&memcg->css, batch);
+	if (batch > nr_pages)
+		refill_stock(memcg, batch - nr_pages);
+
+	schedule_work(&memcg->socket_work);
+
+	return !force;
 }
 
 /**
@@ -5533,10 +5596,32 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
  */
 void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
-	page_counter_uncharge(&memcg->tcp_mem.memory_allocated, nr_pages);
+#ifdef CONFIG_MEMCG_KMEM
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
+		page_counter_uncharge(&memcg->tcp_mem.memory_allocated,
+				      nr_pages);
+		return;
+	}
+#endif
+	page_counter_uncharge(&memcg->memory, nr_pages);
+	css_put_many(&memcg->css, nr_pages);
 }
 
-#endif
+#endif /* CONFIG_INET */
+
+static int __init cgroup_memory(char *s)
+{
+	char *token;
+
+	while ((token = strsep(&s, ",")) != NULL) {
+		if (!*token)
+			continue;
+		if (!strcmp(token, "nosocket"))
+			cgroup_memory_nosocket = true;
+	}
+	return 0;
+}
+__setup("cgroup.memory=", cgroup_memory);
 
 /*
  * subsys_initcall() for memory controller.
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure
  2015-11-12 23:41 [PATCH 00/14] mm: memcontrol: account socket memory in unified hierarchy Johannes Weiner
                   ` (12 preceding siblings ...)
  2015-11-12 23:41 ` [PATCH 13/14] mm: memcontrol: account socket memory in unified hierarchy memory controller Johannes Weiner
@ 2015-11-12 23:41 ` Johannes Weiner
  2015-11-15 13:54   ` Vladimir Davydov
  13 siblings, 1 reply; 61+ messages in thread
From: Johannes Weiner @ 2015-11-12 23:41 UTC (permalink / raw)
  To: David Miller, Andrew Morton
  Cc: Vladimir Davydov, Tejun Heo, Michal Hocko, netdev, linux-mm,
	cgroups, linux-kernel, kernel-team

Let the networking stack know when a memcg is under reclaim pressure
so that it can clamp its transmit windows accordingly.

Whenever the reclaim efficiency of a cgroup's LRU lists drops low
enough for a MEDIUM or HIGH vmpressure event to occur, assert a
pressure state in the socket and tcp memory code that tells it to curb
consumption growth from sockets associated with said control group.

vmpressure events are naturally edge triggered, so for hysteresis
assert socket pressure for a second to allow for subsequent vmpressure
events to occur before letting the socket code return to normal.

This will likely need finetuning for a wider variety of workloads, but
for now stick to the vmpressure presets and keep hysteresis simple.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 29 +++++++++++++++++++++++++----
 mm/memcontrol.c            | 15 +--------------
 mm/vmpressure.c            | 25 ++++++++++++++++++++-----
 3 files changed, 46 insertions(+), 23 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 809d6de..dba43cb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -258,6 +258,7 @@ struct mem_cgroup {
 
 #ifdef CONFIG_INET
 	struct work_struct	socket_work;
+	unsigned long		socket_pressure;
 #endif
 
 	/* List of events which userspace want to receive */
@@ -303,18 +304,34 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
-struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
 
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+#define mem_cgroup_from_counter(counter, member)	\
+	container_of(counter, struct mem_cgroup, member)
+
 struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
 				   struct mem_cgroup *,
 				   struct mem_cgroup_reclaim_cookie *);
 void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
 
+/**
+ * parent_mem_cgroup - find the accounting parent of a memcg
+ * @memcg: memcg whose parent to find
+ *
+ * Returns the parent memcg, or NULL if this is the root or the memory
+ * controller is in legacy no-hierarchy mode.
+ */
+static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
+{
+	if (!memcg->memory.parent)
+		return NULL;
+	return mem_cgroup_from_counter(memcg->memory.parent, memory);
+}
+
 static inline bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
 			      struct mem_cgroup *root)
 {
@@ -706,10 +723,14 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
 static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 {
 #ifdef CONFIG_MEMCG_KMEM
-	return memcg->tcp_mem.memory_pressure;
-#else
-	return false;
+	if (memcg->tcp_mem.memory_pressure)
+		return true;
 #endif
+	do {
+		if (time_before(jiffies, memcg->socket_pressure))
+			return true;
+	} while ((memcg = parent_mem_cgroup(memcg)));
+	return false;
 }
 #else
 #define mem_cgroup_sockets_enabled 0
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cad9525..4068662 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1091,9 +1091,6 @@ bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg)
 	return ret;
 }
 
-#define mem_cgroup_from_counter(counter, member)	\
-	container_of(counter, struct mem_cgroup, member)
-
 /**
  * mem_cgroup_margin - calculate chargeable space of a memory cgroup
  * @memcg: the memory cgroup
@@ -4138,17 +4135,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 	kfree(memcg);
 }
 
-/*
- * Returns the parent mem_cgroup in memcgroup hierarchy with hierarchy enabled.
- */
-struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
-{
-	if (!memcg->memory.parent)
-		return NULL;
-	return mem_cgroup_from_counter(memcg->memory.parent, memory);
-}
-EXPORT_SYMBOL(parent_mem_cgroup);
-
 static void socket_work_func(struct work_struct *work);
 
 static struct cgroup_subsys_state * __ref
@@ -4192,6 +4178,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 #endif
 #ifdef CONFIG_INET
 	INIT_WORK(&memcg->socket_work, socket_work_func);
+	memcg->socket_pressure = jiffies;
 #endif
 	return &memcg->css;
 
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 4c25e62..07e8440 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -137,14 +137,11 @@ struct vmpressure_event {
 };
 
 static bool vmpressure_event(struct vmpressure *vmpr,
-			     unsigned long scanned, unsigned long reclaimed)
+			     enum vmpressure_levels level)
 {
 	struct vmpressure_event *ev;
-	enum vmpressure_levels level;
 	bool signalled = false;
 
-	level = vmpressure_calc_level(scanned, reclaimed);
-
 	mutex_lock(&vmpr->events_lock);
 
 	list_for_each_entry(ev, &vmpr->events, node) {
@@ -162,6 +159,7 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 static void vmpressure_work_fn(struct work_struct *work)
 {
 	struct vmpressure *vmpr = work_to_vmpressure(work);
+	enum vmpressure_levels level;
 	unsigned long scanned;
 	unsigned long reclaimed;
 
@@ -185,8 +183,25 @@ static void vmpressure_work_fn(struct work_struct *work)
 	vmpr->reclaimed = 0;
 	spin_unlock(&vmpr->sr_lock);
 
+	level = vmpressure_calc_level(scanned, reclaimed);
+
+	if (level > VMPRESSURE_LOW) {
+		struct mem_cgroup *memcg;
+		/*
+		 * Let the socket buffer allocator know that we are
+		 * having trouble reclaiming LRU pages.
+		 *
+		 * For hysteresis, keep the pressure state asserted
+		 * for a second in which subsequent pressure events
+		 * can occur.
+		 */
+		memcg = container_of(vmpr, struct mem_cgroup, vmpressure);
+		if (memcg != root_mem_cgroup)
+			memcg->socket_pressure = jiffies + HZ;
+	}
+
 	do {
-		if (vmpressure_event(vmpr, scanned, reclaimed))
+		if (vmpressure_event(vmpr, level))
 			break;
 		/*
 		 * If not handled, propagate the event upward into the
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 08/14] net: tcp_memcontrol: sanitize tcp memory accounting callbacks
  2015-11-12 23:41 ` [PATCH 08/14] net: tcp_memcontrol: sanitize tcp memory accounting callbacks Johannes Weiner
@ 2015-11-13  4:53   ` Eric Dumazet
  2015-11-13  5:44     ` Johannes Weiner
  2015-11-20 10:58   ` Vladimir Davydov
  1 sibling, 1 reply; 61+ messages in thread
From: Eric Dumazet @ 2015-11-13  4:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Vladimir Davydov, Tejun Heo,
	Michal Hocko, netdev, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, 2015-11-12 at 18:41 -0500, Johannes Weiner wrote:


> @@ -711,6 +705,12 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
>  struct sock;
>  void sock_update_memcg(struct sock *sk);
>  void sock_release_memcg(struct sock *sk);
> +bool mem_cgroup_charge_skmem(struct cg_proto *proto, unsigned int nr_pages);
> +void mem_cgroup_uncharge_skmem(struct cg_proto *proto, unsigned int nr_pages);
> +static inline bool mem_cgroup_under_socket_pressure(struct cg_proto *proto)
> +{
> +	return proto->memory_pressure;
> +}
>  #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
>  
>  #ifdef CONFIG_MEMCG_KMEM
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 2eefc99..8cc7613 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1126,8 +1126,8 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
>  	if (!sk->sk_prot->memory_pressure)
>  		return false;
>  
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		return !!sk->sk_cgrp->memory_pressure;
> +	if (mem_cgroup_sockets_enabled && sk->sk_cgrp &&
> +	    mem_cgroup_under_socket_pressure(sk->sk_cgrp))
>  
>  	return !!*sk->sk_prot->memory_pressure;
>  }


This looks wrong ?

if (A && B && C)
    return !!*sk->sk_prot->memory_pressure;

<compiler should eventually barf, 
as this function should not return void>
}








^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 08/14] net: tcp_memcontrol: sanitize tcp memory accounting callbacks
  2015-11-13  4:53   ` Eric Dumazet
@ 2015-11-13  5:44     ` Johannes Weiner
  0 siblings, 0 replies; 61+ messages in thread
From: Johannes Weiner @ 2015-11-13  5:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Andrew Morton, Vladimir Davydov, Tejun Heo,
	Michal Hocko, netdev, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, Nov 12, 2015 at 08:53:38PM -0800, Eric Dumazet wrote:
> On Thu, 2015-11-12 at 18:41 -0500, Johannes Weiner wrote:
> > @@ -711,6 +705,12 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
> >  struct sock;
> >  void sock_update_memcg(struct sock *sk);
> >  void sock_release_memcg(struct sock *sk);
> > +bool mem_cgroup_charge_skmem(struct cg_proto *proto, unsigned int nr_pages);
> > +void mem_cgroup_uncharge_skmem(struct cg_proto *proto, unsigned int nr_pages);
> > +static inline bool mem_cgroup_under_socket_pressure(struct cg_proto *proto)
> > +{
> > +	return proto->memory_pressure;
> > +}
> >  #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
> >  
> >  #ifdef CONFIG_MEMCG_KMEM
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 2eefc99..8cc7613 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -1126,8 +1126,8 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
> >  	if (!sk->sk_prot->memory_pressure)
> >  		return false;
> >  
> > -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> > -		return !!sk->sk_cgrp->memory_pressure;
> > +	if (mem_cgroup_sockets_enabled && sk->sk_cgrp &&
> > +	    mem_cgroup_under_socket_pressure(sk->sk_cgrp))
> >  
> >  	return !!*sk->sk_prot->memory_pressure;
> >  }
> 
> 
> This looks wrong ?
> 
> if (A && B && C)
>     return !!*sk->sk_prot->memory_pressure;
> 
> <compiler should eventually barf, 
> as this function should not return void>

Yikes, you're right. This is missing a return true.

[ Just forced a complete rebuild and of course it warns at control
  reaching end of non-void function. ]

I'm stumped by how I could have missed it as I rebuild after every
commit with make -s, so a warning should stand out. And it should
definitely rebuild the callers frequently as most patches change
memcontrol.h. Probably a screwup in the final series polishing.
I'm going to go over this carefully one more time tomorrow.

Meanwhile, this is the missing piece and the updated patch.

Thanks Eric.

diff --git a/include/net/sock.h b/include/net/sock.h
index 8cc7613..f954e2a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1128,6 +1128,7 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
 
 	if (mem_cgroup_sockets_enabled && sk->sk_cgrp &&
 	    mem_cgroup_under_socket_pressure(sk->sk_cgrp))
+		return true;
 
 	return !!*sk->sk_prot->memory_pressure;
 }

---
>From 4a24ca67e5b0f651a68807ee99f714437ffd6109 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Tue, 10 Nov 2015 17:14:41 -0500
Subject: [PATCH v2] net: tcp_memcontrol: sanitize tcp memory accounting callbacks

There won't be a tcp control soft limit, so integrating the memcg code
into the global skmem limiting scheme complicates things
unnecessarily. Replace this with simple and clear charge and uncharge
calls--hidden behind a jump label--to account skb memory.

Note that this is not purely aesthetic: as a result of shoehorning the
per-memcg code into the same memory accounting functions that handle
the global level, the old code would compare the per-memcg consumption
against the smaller of the per-memcg limit and the global limit. This
allowed the total consumption of multiple sockets to exceed the global
limit, as long as the individual sockets stayed within bounds. After
this change, the code will always compare the per-memcg consumption to
the per-memcg limit, and the global consumption to the global limit,
and thus close this loophole.

Without a soft limit, the per-memcg memory pressure state in sockets
is generally questionable. However, we did it until now, so we
continue to enter it when the hard limit is hit, and packets are
dropped, to let other sockets in the cgroup know that they shouldn't
grow their transmit windows, either. However, keep it simple in the
new callback model and leave memory pressure lazily when the next
packet is accepted (as opposed to doing it synchroneously when packets
are processed). When packets are dropped, network performance will
already be in the toilet, so that should be a reasonable trade-off.

As described above, consumption is now checked on the per-memcg level
and the global level separately. Likewise, memory pressure states are
maintained on both the per-memcg level and the global level, and a
socket is considered under pressure when either level asserts as much.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 12 ++++-----
 include/net/sock.h         | 64 ++++++----------------------------------------
 include/net/tcp.h          |  5 ++--
 mm/memcontrol.c            | 32 +++++++++++++++++++++++
 net/core/sock.c            | 26 +++++++++++--------
 net/ipv4/tcp_output.c      |  7 +++--
 6 files changed, 70 insertions(+), 76 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 96ca3d3..906dfff 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -676,12 +676,6 @@ void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 }
 #endif /* CONFIG_MEMCG */
 
-enum {
-	UNDER_LIMIT,
-	SOFT_LIMIT,
-	OVER_LIMIT,
-};
-
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg);
@@ -711,6 +705,12 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
 struct sock;
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
+bool mem_cgroup_charge_skmem(struct cg_proto *proto, unsigned int nr_pages);
+void mem_cgroup_uncharge_skmem(struct cg_proto *proto, unsigned int nr_pages);
+static inline bool mem_cgroup_under_socket_pressure(struct cg_proto *proto)
+{
+	return proto->memory_pressure;
+}
 #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_MEMCG_KMEM
diff --git a/include/net/sock.h b/include/net/sock.h
index 2eefc99..f954e2a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1126,8 +1126,9 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
 	if (!sk->sk_prot->memory_pressure)
 		return false;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		return !!sk->sk_cgrp->memory_pressure;
+	if (mem_cgroup_sockets_enabled && sk->sk_cgrp &&
+	    mem_cgroup_under_socket_pressure(sk->sk_cgrp))
+		return true;
 
 	return !!*sk->sk_prot->memory_pressure;
 }
@@ -1141,9 +1142,6 @@ static inline void sk_leave_memory_pressure(struct sock *sk)
 
 	if (*memory_pressure)
 		*memory_pressure = 0;
-
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		sk->sk_cgrp->memory_pressure = 0;
 }
 
 static inline void sk_enter_memory_pressure(struct sock *sk)
@@ -1151,76 +1149,30 @@ static inline void sk_enter_memory_pressure(struct sock *sk)
 	if (!sk->sk_prot->enter_memory_pressure)
 		return;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		sk->sk_cgrp->memory_pressure = 1;
-
 	sk->sk_prot->enter_memory_pressure(sk);
 }
 
 static inline long sk_prot_mem_limits(const struct sock *sk, int index)
 {
-	long limit = sk->sk_prot->sysctl_mem[index];
-
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		limit = min_t(long, limit, sk->sk_cgrp->memory_allocated.limit);
-
-	return limit;
-}
-
-static inline void memcg_memory_allocated_add(struct cg_proto *prot,
-					      unsigned long amt,
-					      int *parent_status)
-{
-	struct page_counter *counter;
-
-	if (page_counter_try_charge(&prot->memory_allocated, amt, &counter))
-		return;
-
-	page_counter_charge(&prot->memory_allocated, amt);
-	*parent_status = OVER_LIMIT;
-}
-
-static inline void memcg_memory_allocated_sub(struct cg_proto *prot,
-					      unsigned long amt)
-{
-	page_counter_uncharge(&prot->memory_allocated, amt);
+	return sk->sk_prot->sysctl_mem[index];
 }
 
 static inline long
 sk_memory_allocated(const struct sock *sk)
 {
-	struct proto *prot = sk->sk_prot;
-
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		return page_counter_read(&sk->sk_cgrp->memory_allocated);
-
-	return atomic_long_read(prot->memory_allocated);
+	return atomic_long_read(sk->sk_prot->memory_allocated);
 }
 
 static inline long
-sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status)
+sk_memory_allocated_add(struct sock *sk, int amt)
 {
-	struct proto *prot = sk->sk_prot;
-
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-		memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status);
-		/* update the root cgroup regardless */
-		atomic_long_add_return(amt, prot->memory_allocated);
-		return page_counter_read(&sk->sk_cgrp->memory_allocated);
-	}
-
-	return atomic_long_add_return(amt, prot->memory_allocated);
+	return atomic_long_add_return(amt, sk->sk_prot->memory_allocated);
 }
 
 static inline void
 sk_memory_allocated_sub(struct sock *sk, int amt)
 {
-	struct proto *prot = sk->sk_prot;
-
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		memcg_memory_allocated_sub(sk->sk_cgrp, amt);
-
-	atomic_long_sub(amt, prot->memory_allocated);
+	atomic_long_sub(amt, sk->sk_prot->memory_allocated);
 }
 
 static inline void sk_sockets_allocated_dec(struct sock *sk)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f80e74c..04517d6 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -292,8 +292,9 @@ extern int tcp_memory_pressure;
 /* optimized version of sk_under_memory_pressure() for TCP sockets */
 static inline bool tcp_under_memory_pressure(const struct sock *sk)
 {
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		return !!sk->sk_cgrp->memory_pressure;
+	if (mem_cgroup_sockets_enabled && sk->sk_cgrp &&
+	    mem_cgroup_under_socket_pressure(sk->sk_cgrp))
+		return true;
 
 	return tcp_memory_pressure;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 57f4539..3462a52 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -338,6 +338,38 @@ struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
 }
 EXPORT_SYMBOL(tcp_proto_cgroup);
 
+/**
+ * mem_cgroup_charge_skmem - charge socket memory
+ * @proto: proto to charge
+ * @nr_pages: number of pages to charge
+ *
+ * Charges @nr_pages to @proto. Returns %true if the charge fit within
+ * @proto's configured limit, %false if the charge had to be forced.
+ */
+bool mem_cgroup_charge_skmem(struct cg_proto *proto, unsigned int nr_pages)
+{
+	struct page_counter *counter;
+
+	if (page_counter_try_charge(&proto->memory_allocated,
+				    nr_pages, &counter)) {
+		proto->memory_pressure = 0;
+		return true;
+	}
+	page_counter_charge(&proto->memory_allocated, nr_pages);
+	proto->memory_pressure = 1;
+	return false;
+}
+
+/**
+ * mem_cgroup_uncharge_skmem - uncharge socket memory
+ * @proto - proto to uncharge
+ * @nr_pages - number of pages to uncharge
+ */
+void mem_cgroup_uncharge_skmem(struct cg_proto *proto, unsigned int nr_pages)
+{
+	page_counter_uncharge(&proto->memory_allocated, nr_pages);
+}
+
 #endif
 
 #ifdef CONFIG_MEMCG_KMEM
diff --git a/net/core/sock.c b/net/core/sock.c
index 04e54bc..5b1b96f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2066,27 +2066,27 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 	struct proto *prot = sk->sk_prot;
 	int amt = sk_mem_pages(size);
 	long allocated;
-	int parent_status = UNDER_LIMIT;
 
 	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
 
-	allocated = sk_memory_allocated_add(sk, amt, &parent_status);
+	allocated = sk_memory_allocated_add(sk, amt);
+
+	if (mem_cgroup_sockets_enabled && sk->sk_cgrp &&
+	    !mem_cgroup_charge_skmem(sk->sk_cgrp, amt))
+		goto suppress_allocation;
 
 	/* Under limit. */
-	if (parent_status == UNDER_LIMIT &&
-			allocated <= sk_prot_mem_limits(sk, 0)) {
+	if (allocated <= sk_prot_mem_limits(sk, 0)) {
 		sk_leave_memory_pressure(sk);
 		return 1;
 	}
 
-	/* Under pressure. (we or our parents) */
-	if ((parent_status > SOFT_LIMIT) ||
-			allocated > sk_prot_mem_limits(sk, 1))
+	/* Under pressure. */
+	if (allocated > sk_prot_mem_limits(sk, 1))
 		sk_enter_memory_pressure(sk);
 
-	/* Over hard limit (we or our parents) */
-	if ((parent_status == OVER_LIMIT) ||
-			(allocated > sk_prot_mem_limits(sk, 2)))
+	/* Over hard limit. */
+	if (allocated > sk_prot_mem_limits(sk, 2))
 		goto suppress_allocation;
 
 	/* guarantee minimum buffer size under pressure */
@@ -2135,6 +2135,9 @@ suppress_allocation:
 
 	sk_memory_allocated_sub(sk, amt);
 
+	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
+		mem_cgroup_uncharge_skmem(sk->sk_cgrp, amt);
+
 	return 0;
 }
 EXPORT_SYMBOL(__sk_mem_schedule);
@@ -2150,6 +2153,9 @@ void __sk_mem_reclaim(struct sock *sk, int amount)
 	sk_memory_allocated_sub(sk, amount);
 	sk->sk_forward_alloc -= amount << SK_MEM_QUANTUM_SHIFT;
 
+	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
+		mem_cgroup_uncharge_skmem(sk->sk_cgrp, amount);
+
 	if (sk_under_memory_pressure(sk) &&
 	    (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)))
 		sk_leave_memory_pressure(sk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index cb7ca56..7aa168a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2813,13 +2813,16 @@ begin_fwd:
  */
 void sk_forced_mem_schedule(struct sock *sk, int size)
 {
-	int amt, status;
+	int amt;
 
 	if (size <= sk->sk_forward_alloc)
 		return;
 	amt = sk_mem_pages(size);
 	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
-	sk_memory_allocated_add(sk, amt, &status);
+	sk_memory_allocated_add(sk, amt);
+
+	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
+		mem_cgroup_charge_skmem(sk->sk_cgrp, amt);
 }
 
 /* Send a FIN. The caller locks the socket for us.
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 11/14] mm: memcontrol: do not account memory+swap on unified hierarchy
  2015-11-12 23:41 ` [PATCH 11/14] mm: memcontrol: do not account memory+swap on unified hierarchy Johannes Weiner
@ 2015-11-13 10:37   ` Michal Hocko
  2015-11-14 13:23   ` Vladimir Davydov
  1 sibling, 0 replies; 61+ messages in thread
From: Michal Hocko @ 2015-11-13 10:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Vladimir Davydov, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Thu 12-11-15 18:41:30, Johannes Weiner wrote:
> The unified hierarchy memory controller doesn't expose the memory+swap
> counter to userspace, but its accounting is hardcoded in all charge
> paths right now, including the per-cpu charge cache ("the stock").
> 
> To avoid adding yet more pointless memory+swap accounting with the
> socket memory support in unified hierarchy, disable the counter
> altogether when in unified hierarchy mode.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c | 44 +++++++++++++++++++++++++-------------------
>  1 file changed, 25 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 658bef2..e7f1a79 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -87,6 +87,12 @@ int do_swap_account __read_mostly;
>  #define do_swap_account		0
>  #endif
>  
> +/* Whether legacy memory+swap accounting is active */
> +static bool do_memsw_account(void)
> +{
> +	return !cgroup_subsys_on_dfl(memory_cgrp_subsys) && do_swap_account;
> +}
> +
>  static const char * const mem_cgroup_stat_names[] = {
>  	"cache",
>  	"rss",
> @@ -1177,7 +1183,7 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
>  	if (count < limit)
>  		margin = limit - count;
>  
> -	if (do_swap_account) {
> +	if (do_memsw_account()) {
>  		count = page_counter_read(&memcg->memsw);
>  		limit = READ_ONCE(memcg->memsw.limit);
>  		if (count <= limit)
> @@ -1280,7 +1286,7 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  		pr_cont(":");
>  
>  		for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
> -			if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
> +			if (i == MEM_CGROUP_STAT_SWAP && !do_memsw_account())
>  				continue;
>  			pr_cont(" %s:%luKB", mem_cgroup_stat_names[i],
>  				K(mem_cgroup_read_stat(iter, i)));
> @@ -1903,7 +1909,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
>  
>  	if (stock->nr_pages) {
>  		page_counter_uncharge(&old->memory, stock->nr_pages);
> -		if (do_swap_account)
> +		if (do_memsw_account())
>  			page_counter_uncharge(&old->memsw, stock->nr_pages);
>  		css_put_many(&old->css, stock->nr_pages);
>  		stock->nr_pages = 0;
> @@ -2033,11 +2039,11 @@ retry:
>  	if (consume_stock(memcg, nr_pages))
>  		return 0;
>  
> -	if (!do_swap_account ||
> +	if (!do_memsw_account() ||
>  	    page_counter_try_charge(&memcg->memsw, batch, &counter)) {
>  		if (page_counter_try_charge(&memcg->memory, batch, &counter))
>  			goto done_restock;
> -		if (do_swap_account)
> +		if (do_memsw_account())
>  			page_counter_uncharge(&memcg->memsw, batch);
>  		mem_over_limit = mem_cgroup_from_counter(counter, memory);
>  	} else {
> @@ -2124,7 +2130,7 @@ force:
>  	 * temporarily by force charging it.
>  	 */
>  	page_counter_charge(&memcg->memory, nr_pages);
> -	if (do_swap_account)
> +	if (do_memsw_account())
>  		page_counter_charge(&memcg->memsw, nr_pages);
>  	css_get_many(&memcg->css, nr_pages);
>  
> @@ -2161,7 +2167,7 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
>  		return;
>  
>  	page_counter_uncharge(&memcg->memory, nr_pages);
> -	if (do_swap_account)
> +	if (do_memsw_account())
>  		page_counter_uncharge(&memcg->memsw, nr_pages);
>  
>  	css_put_many(&memcg->css, nr_pages);
> @@ -2441,7 +2447,7 @@ void __memcg_kmem_uncharge(struct page *page, int order)
>  
>  	page_counter_uncharge(&memcg->kmem, nr_pages);
>  	page_counter_uncharge(&memcg->memory, nr_pages);
> -	if (do_swap_account)
> +	if (do_memsw_account())
>  		page_counter_uncharge(&memcg->memsw, nr_pages);
>  
>  	page->mem_cgroup = NULL;
> @@ -3154,7 +3160,7 @@ static int memcg_stat_show(struct seq_file *m, void *v)
>  	BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
>  
>  	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
> -		if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
> +		if (i == MEM_CGROUP_STAT_SWAP && !do_memsw_account())
>  			continue;
>  		seq_printf(m, "%s %lu\n", mem_cgroup_stat_names[i],
>  			   mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
> @@ -3176,14 +3182,14 @@ static int memcg_stat_show(struct seq_file *m, void *v)
>  	}
>  	seq_printf(m, "hierarchical_memory_limit %llu\n",
>  		   (u64)memory * PAGE_SIZE);
> -	if (do_swap_account)
> +	if (do_memsw_account())
>  		seq_printf(m, "hierarchical_memsw_limit %llu\n",
>  			   (u64)memsw * PAGE_SIZE);
>  
>  	for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
>  		unsigned long long val = 0;
>  
> -		if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
> +		if (i == MEM_CGROUP_STAT_SWAP && !do_memsw_account())
>  			continue;
>  		for_each_mem_cgroup_tree(mi, memcg)
>  			val += mem_cgroup_read_stat(mi, i) * PAGE_SIZE;
> @@ -3314,7 +3320,7 @@ static void mem_cgroup_threshold(struct mem_cgroup *memcg)
>  {
>  	while (memcg) {
>  		__mem_cgroup_threshold(memcg, false);
> -		if (do_swap_account)
> +		if (do_memsw_account())
>  			__mem_cgroup_threshold(memcg, true);
>  
>  		memcg = parent_mem_cgroup(memcg);
> @@ -4460,7 +4466,7 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
>  	 * we call find_get_page() with swapper_space directly.
>  	 */
>  	page = find_get_page(swap_address_space(ent), ent.val);
> -	if (do_swap_account)
> +	if (do_memsw_account())
>  		entry->val = ent.val;
>  
>  	return page;
> @@ -4495,7 +4501,7 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
>  		page = find_get_entry(mapping, pgoff);
>  		if (radix_tree_exceptional_entry(page)) {
>  			swp_entry_t swp = radix_to_swp_entry(page);
> -			if (do_swap_account)
> +			if (do_memsw_account())
>  				*entry = swp;
>  			page = find_get_page(swap_address_space(swp), swp.val);
>  		}
> @@ -5270,7 +5276,7 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
>  		if (page->mem_cgroup)
>  			goto out;
>  
> -		if (do_swap_account) {
> +		if (do_memsw_account()) {
>  			swp_entry_t ent = { .val = page_private(page), };
>  			unsigned short id = lookup_swap_cgroup_id(ent);
>  
> @@ -5334,7 +5340,7 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
>  	memcg_check_events(memcg, page);
>  	local_irq_enable();
>  
> -	if (do_swap_account && PageSwapCache(page)) {
> +	if (do_memsw_account() && PageSwapCache(page)) {
>  		swp_entry_t entry = { .val = page_private(page) };
>  		/*
>  		 * The swap entry might not get freed for a long time,
> @@ -5379,7 +5385,7 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
>  
>  	if (!mem_cgroup_is_root(memcg)) {
>  		page_counter_uncharge(&memcg->memory, nr_pages);
> -		if (do_swap_account)
> +		if (do_memsw_account())
>  			page_counter_uncharge(&memcg->memsw, nr_pages);
>  		memcg_oom_recover(memcg);
>  	}
> @@ -5587,7 +5593,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
>  	VM_BUG_ON_PAGE(PageLRU(page), page);
>  	VM_BUG_ON_PAGE(page_count(page), page);
>  
> -	if (!do_swap_account)
> +	if (!do_memsw_account())
>  		return;
>  
>  	memcg = page->mem_cgroup;
> @@ -5627,7 +5633,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry)
>  	struct mem_cgroup *memcg;
>  	unsigned short id;
>  
> -	if (!do_swap_account)
> +	if (!do_memsw_account())
>  		return;
>  
>  	id = swap_cgroup_record(entry, 0);
> -- 
> 2.6.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 10/14] mm: memcontrol: generalize the socket accounting jump label
  2015-11-12 23:41 ` [PATCH 10/14] mm: memcontrol: generalize the socket accounting jump label Johannes Weiner
@ 2015-11-13 10:43   ` Michal Hocko
  2015-11-14 13:29   ` Vladimir Davydov
  1 sibling, 0 replies; 61+ messages in thread
From: Michal Hocko @ 2015-11-13 10:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Vladimir Davydov, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Thu 12-11-15 18:41:29, Johannes Weiner wrote:
> The unified hierarchy memory controller is going to use this jump
> label as well to control the networking callbacks. Move it to the
> memory controller code and give it a more generic name.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Yes it makes more sense in memcg proper
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h | 4 ++++
>  include/net/sock.h         | 7 -------
>  mm/memcontrol.c            | 3 +++
>  net/core/sock.c            | 5 -----
>  net/ipv4/tcp_memcontrol.c  | 4 ++--
>  5 files changed, 9 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1c71f27..4cf5afa 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -693,6 +693,8 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
>  
>  #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
>  struct sock;
> +extern struct static_key memcg_sockets_enabled_key;
> +#define mem_cgroup_sockets_enabled static_key_false(&memcg_sockets_enabled_key)
>  void sock_update_memcg(struct sock *sk);
>  void sock_release_memcg(struct sock *sk);
>  bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
> @@ -701,6 +703,8 @@ static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
>  {
>  	return memcg->tcp_mem.memory_pressure;
>  }
> +#else
> +#define mem_cgroup_sockets_enabled 0
>  #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
>  
>  #ifdef CONFIG_MEMCG_KMEM
> diff --git a/include/net/sock.h b/include/net/sock.h
> index b439dcc..bf1b901 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1065,13 +1065,6 @@ static inline void sk_refcnt_debug_release(const struct sock *sk)
>  #define sk_refcnt_debug_release(sk) do { } while (0)
>  #endif /* SOCK_REFCNT_DEBUG */
>  
> -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_NET)
> -extern struct static_key memcg_socket_limit_enabled;
> -#define mem_cgroup_sockets_enabled static_key_false(&memcg_socket_limit_enabled)
> -#else
> -#define mem_cgroup_sockets_enabled 0
> -#endif
> -
>  static inline bool sk_stream_memory_free(const struct sock *sk)
>  {
>  	if (sk->sk_wmem_queued >= sk->sk_sndbuf)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 89b1d9e..658bef2 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -291,6 +291,9 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
>  /* Writing them here to avoid exposing memcg's inner layout */
>  #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
>  
> +struct static_key memcg_sockets_enabled_key;
> +EXPORT_SYMBOL(memcg_sockets_enabled_key);
> +
>  void sock_update_memcg(struct sock *sk)
>  {
>  	struct mem_cgroup *memcg;
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 6486b0d..c5435b5 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -201,11 +201,6 @@ EXPORT_SYMBOL(sk_net_capable);
>  static struct lock_class_key af_family_keys[AF_MAX];
>  static struct lock_class_key af_family_slock_keys[AF_MAX];
>  
> -#if defined(CONFIG_MEMCG_KMEM)
> -struct static_key memcg_socket_limit_enabled;
> -EXPORT_SYMBOL(memcg_socket_limit_enabled);
> -#endif
> -
>  /*
>   * Make lock validator output more readable. (we pre-construct these
>   * strings build-time, so that runtime initialization of socket
> diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
> index 47addc3..17df9dd 100644
> --- a/net/ipv4/tcp_memcontrol.c
> +++ b/net/ipv4/tcp_memcontrol.c
> @@ -34,7 +34,7 @@ void tcp_destroy_cgroup(struct mem_cgroup *memcg)
>  		return;
>  
>  	if (test_bit(MEMCG_SOCK_ACTIVATED, &memcg->tcp_mem.flags))
> -		static_key_slow_dec(&memcg_socket_limit_enabled);
> +		static_key_slow_dec(&memcg_sockets_enabled_key);
>  }
>  
>  static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
> @@ -73,7 +73,7 @@ static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
>  		 */
>  		if (!test_and_set_bit(MEMCG_SOCK_ACTIVATED,
>  				      &memcg->tcp_mem.flags))
> -			static_key_slow_inc(&memcg_socket_limit_enabled);
> +			static_key_slow_inc(&memcg_sockets_enabled_key);
>  		set_bit(MEMCG_SOCK_ACTIVE, &memcg->tcp_mem.flags);
>  	}
>  
> -- 
> 2.6.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 01/14] mm: memcontrol: export root_mem_cgroup
  2015-11-12 23:41 ` [PATCH 01/14] mm: memcontrol: export root_mem_cgroup Johannes Weiner
@ 2015-11-13 15:59   ` David Miller
  2015-11-14 12:17   ` Vladimir Davydov
  1 sibling, 0 replies; 61+ messages in thread
From: David Miller @ 2015-11-13 15:59 UTC (permalink / raw)
  To: hannes
  Cc: akpm, vdavydov, tj, mhocko, netdev, linux-mm, cgroups,
	linux-kernel, kernel-team

From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 12 Nov 2015 18:41:20 -0500

> A later patch will need this symbol in files other than memcontrol.c,
> so export it now and replace mem_cgroup_root_css at the same time.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Michal Hocko <mhocko@suse.com>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 02/14] mm: vmscan: simplify memcg vs. global shrinker invocation
  2015-11-12 23:41 ` [PATCH 02/14] mm: vmscan: simplify memcg vs. global shrinker invocation Johannes Weiner
@ 2015-11-13 15:59   ` David Miller
  2015-11-14 12:36   ` Vladimir Davydov
  1 sibling, 0 replies; 61+ messages in thread
From: David Miller @ 2015-11-13 15:59 UTC (permalink / raw)
  To: hannes
  Cc: akpm, vdavydov, tj, mhocko, netdev, linux-mm, cgroups,
	linux-kernel, kernel-team

From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 12 Nov 2015 18:41:21 -0500

> Letting shrink_slab() handle the root_mem_cgroup, and implicitely the
> !CONFIG_MEMCG case, allows shrink_zone() to invoke the shrinkers
> unconditionally from within the memcg iteration loop.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Michal Hocko <mhocko@suse.com>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 03/14] net: tcp_memcontrol: properly detect ancestor socket pressure
  2015-11-12 23:41 ` [PATCH 03/14] net: tcp_memcontrol: properly detect ancestor socket pressure Johannes Weiner
@ 2015-11-13 16:00   ` David Miller
  2015-11-14 12:45   ` Vladimir Davydov
  1 sibling, 0 replies; 61+ messages in thread
From: David Miller @ 2015-11-13 16:00 UTC (permalink / raw)
  To: hannes
  Cc: akpm, vdavydov, tj, mhocko, netdev, linux-mm, cgroups,
	linux-kernel, kernel-team

From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 12 Nov 2015 18:41:22 -0500

> When charging socket memory, the code currently checks only the local
> page counter for excess to determine whether the memcg is under socket
> pressure. But even if the local counter is fine, one of the ancestors
> could have breached its limit, which should also force this child to
> enter socket pressure. This currently doesn't happen.
> 
> Fix this by using page_counter_try_charge() first. If that fails, it
> means that either the local counter or one of the ancestors are in
> excess of their limit, and the child should enter socket pressure.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 04/14] net: tcp_memcontrol: remove bogus hierarchy pressure propagation
  2015-11-12 23:41 ` [PATCH 04/14] net: tcp_memcontrol: remove bogus hierarchy pressure propagation Johannes Weiner
@ 2015-11-13 16:00   ` David Miller
  2015-11-20  9:07   ` Vladimir Davydov
  1 sibling, 0 replies; 61+ messages in thread
From: David Miller @ 2015-11-13 16:00 UTC (permalink / raw)
  To: hannes
  Cc: akpm, vdavydov, tj, mhocko, netdev, linux-mm, cgroups,
	linux-kernel, kernel-team

From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 12 Nov 2015 18:41:23 -0500

> When a cgroup currently breaches its socket memory limit, it enters
> memory pressure mode for itself and its *ancestors*. This throttles
> transmission in unrelated sibling and cousin subtrees that have
> nothing to do with the breached limit.
> 
> On the contrary, breaching a limit should make that group and its
> *children* enter memory pressure mode. But this happens already,
> albeit lazily: if an ancestor limit is breached, siblings will enter
> memory pressure on their own once the next packet arrives for them.
> 
> So no additional hierarchy code is needed. Remove the bogus stuff.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/14] net: tcp_memcontrol: protect all tcp_memcontrol calls by jump-label
  2015-11-12 23:41 ` [PATCH 05/14] net: tcp_memcontrol: protect all tcp_memcontrol calls by jump-label Johannes Weiner
@ 2015-11-13 16:01   ` David Miller
  2015-11-14 16:33   ` Vladimir Davydov
  1 sibling, 0 replies; 61+ messages in thread
From: David Miller @ 2015-11-13 16:01 UTC (permalink / raw)
  To: hannes
  Cc: akpm, vdavydov, tj, mhocko, netdev, linux-mm, cgroups,
	linux-kernel, kernel-team

From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 12 Nov 2015 18:41:24 -0500

> Move the jump-label from sock_update_memcg() and sock_release_memcg()
> to the callsite, and so eliminate those function calls when socket
> accounting is not enabled.
> 
> This also eliminates the need for dummy functions because the calls
> will be optimized away if the Kconfig options are not enabled.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 06/14] net: tcp_memcontrol: remove dead per-memcg count of allocated sockets
  2015-11-12 23:41 ` [PATCH 06/14] net: tcp_memcontrol: remove dead per-memcg count of allocated sockets Johannes Weiner
@ 2015-11-13 16:01   ` David Miller
  2015-11-20  9:48   ` Vladimir Davydov
  1 sibling, 0 replies; 61+ messages in thread
From: David Miller @ 2015-11-13 16:01 UTC (permalink / raw)
  To: hannes
  Cc: akpm, vdavydov, tj, mhocko, netdev, linux-mm, cgroups,
	linux-kernel, kernel-team

From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 12 Nov 2015 18:41:25 -0500

> The number of allocated sockets is used for calculations in the soft
> limit phase, where packets are accepted but the socket is under memory
> pressure. Since there is no soft limit phase in tcp_memcontrol, and
> memory pressure is only entered when packets are already dropped, this
> is actually dead code. Remove it.
> 
> As this is the last user of parent_cg_proto(), remove that too.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 01/14] mm: memcontrol: export root_mem_cgroup
  2015-11-12 23:41 ` [PATCH 01/14] mm: memcontrol: export root_mem_cgroup Johannes Weiner
  2015-11-13 15:59   ` David Miller
@ 2015-11-14 12:17   ` Vladimir Davydov
  1 sibling, 0 replies; 61+ messages in thread
From: Vladimir Davydov @ 2015-11-14 12:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Thu, Nov 12, 2015 at 06:41:20PM -0500, Johannes Weiner wrote:
> A later patch will need this symbol in files other than memcontrol.c,
> so export it now and replace mem_cgroup_root_css at the same time.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 02/14] mm: vmscan: simplify memcg vs. global shrinker invocation
  2015-11-12 23:41 ` [PATCH 02/14] mm: vmscan: simplify memcg vs. global shrinker invocation Johannes Weiner
  2015-11-13 15:59   ` David Miller
@ 2015-11-14 12:36   ` Vladimir Davydov
  2015-11-14 15:06     ` Johannes Weiner
  1 sibling, 1 reply; 61+ messages in thread
From: Vladimir Davydov @ 2015-11-14 12:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Thu, Nov 12, 2015 at 06:41:21PM -0500, Johannes Weiner wrote:
...
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a4507ec..e4f5b3c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -411,6 +411,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  	struct shrinker *shrinker;
>  	unsigned long freed = 0;
>  
> +	/* Global shrinker mode */
> +	if (memcg == root_mem_cgroup)
> +		memcg = NULL;
> +
>  	if (memcg && !memcg_kmem_is_active(memcg))
>  		return 0;
>  
> @@ -2410,11 +2414,22 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  			shrink_lruvec(lruvec, swappiness, sc, &lru_pages);
>  			zone_lru_pages += lru_pages;
>  
> -			if (memcg && is_classzone)
> +			/*
> +			 * Shrink the slab caches in the same proportion that
> +			 * the eligible LRU pages were scanned.
> +			 */
> +			if (is_classzone) {
>  				shrink_slab(sc->gfp_mask, zone_to_nid(zone),
>  					    memcg, sc->nr_scanned - scanned,
>  					    lru_pages);
>  
> +				if (reclaim_state) {
> +					sc->nr_reclaimed +=
> +						reclaim_state->reclaimed_slab;
> +					reclaim_state->reclaimed_slab = 0;
> +				}
> +			}
> +
>  			/*
>  			 * Direct reclaim and kswapd have to scan all memory
>  			 * cgroups to fulfill the overall scan target for the
> @@ -2432,20 +2447,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  			}
>  		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
>  
> -		/*
> -		 * Shrink the slab caches in the same proportion that
> -		 * the eligible LRU pages were scanned.
> -		 */
> -		if (global_reclaim(sc) && is_classzone)
> -			shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL,
> -				    sc->nr_scanned - nr_scanned,
> -				    zone_lru_pages);
> -
> -		if (reclaim_state) {
> -			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> -			reclaim_state->reclaimed_slab = 0;
> -		}
> -

AFAICS this patch deadly breaks memcg-unaware shrinkers vs LRU balance:
currently we scan (*total* LRU scanned / *total* LRU pages) of all such
objects; with this patch we'd use the numbers from the root cgroup
instead. If most processes reside in memory cgroups, the root cgroup
will have only a few LRU pages and hence the pressure exerted upon such
objects will be unfairly severe.

Thanks,
Vladimir

>  		vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
>  			   sc->nr_scanned - nr_scanned,
>  			   sc->nr_reclaimed - nr_reclaimed);

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 03/14] net: tcp_memcontrol: properly detect ancestor socket pressure
  2015-11-12 23:41 ` [PATCH 03/14] net: tcp_memcontrol: properly detect ancestor socket pressure Johannes Weiner
  2015-11-13 16:00   ` David Miller
@ 2015-11-14 12:45   ` Vladimir Davydov
  2015-11-14 15:15     ` Johannes Weiner
  1 sibling, 1 reply; 61+ messages in thread
From: Vladimir Davydov @ 2015-11-14 12:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Thu, Nov 12, 2015 at 06:41:22PM -0500, Johannes Weiner wrote:
> When charging socket memory, the code currently checks only the local
> page counter for excess to determine whether the memcg is under socket
> pressure. But even if the local counter is fine, one of the ancestors
> could have breached its limit, which should also force this child to
> enter socket pressure. This currently doesn't happen.
> 
> Fix this by using page_counter_try_charge() first. If that fails, it
> means that either the local counter or one of the ancestors are in
> excess of their limit, and the child should enter socket pressure.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

For the record: it was broken by commit 3e32cb2e0a12 ("mm: memcontrol:
lockless page counters").

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 11/14] mm: memcontrol: do not account memory+swap on unified hierarchy
  2015-11-12 23:41 ` [PATCH 11/14] mm: memcontrol: do not account memory+swap on unified hierarchy Johannes Weiner
  2015-11-13 10:37   ` Michal Hocko
@ 2015-11-14 13:23   ` Vladimir Davydov
  1 sibling, 0 replies; 61+ messages in thread
From: Vladimir Davydov @ 2015-11-14 13:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Thu, Nov 12, 2015 at 06:41:30PM -0500, Johannes Weiner wrote:
> The unified hierarchy memory controller doesn't expose the memory+swap
> counter to userspace, but its accounting is hardcoded in all charge
> paths right now, including the per-cpu charge cache ("the stock").
> 
> To avoid adding yet more pointless memory+swap accounting with the
> socket memory support in unified hierarchy, disable the counter
> altogether when in unified hierarchy mode.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 10/14] mm: memcontrol: generalize the socket accounting jump label
  2015-11-12 23:41 ` [PATCH 10/14] mm: memcontrol: generalize the socket accounting jump label Johannes Weiner
  2015-11-13 10:43   ` Michal Hocko
@ 2015-11-14 13:29   ` Vladimir Davydov
  1 sibling, 0 replies; 61+ messages in thread
From: Vladimir Davydov @ 2015-11-14 13:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Thu, Nov 12, 2015 at 06:41:29PM -0500, Johannes Weiner wrote:
> The unified hierarchy memory controller is going to use this jump
> label as well to control the networking callbacks. Move it to the
> memory controller code and give it a more generic name.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 02/14] mm: vmscan: simplify memcg vs. global shrinker invocation
  2015-11-14 12:36   ` Vladimir Davydov
@ 2015-11-14 15:06     ` Johannes Weiner
  0 siblings, 0 replies; 61+ messages in thread
From: Johannes Weiner @ 2015-11-14 15:06 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Sat, Nov 14, 2015 at 03:36:50PM +0300, Vladimir Davydov wrote:
> On Thu, Nov 12, 2015 at 06:41:21PM -0500, Johannes Weiner wrote:
> > @@ -2432,20 +2447,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
> >  			}
> >  		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
> >  
> > -		/*
> > -		 * Shrink the slab caches in the same proportion that
> > -		 * the eligible LRU pages were scanned.
> > -		 */
> > -		if (global_reclaim(sc) && is_classzone)
> > -			shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL,
> > -				    sc->nr_scanned - nr_scanned,
> > -				    zone_lru_pages);
> > -
> > -		if (reclaim_state) {
> > -			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> > -			reclaim_state->reclaimed_slab = 0;
> > -		}
> > -
> 
> AFAICS this patch deadly breaks memcg-unaware shrinkers vs LRU balance:
> currently we scan (*total* LRU scanned / *total* LRU pages) of all such
> objects; with this patch we'd use the numbers from the root cgroup
> instead. If most processes reside in memory cgroups, the root cgroup
> will have only a few LRU pages and hence the pressure exerted upon such
> objects will be unfairly severe.

You're absolutely right, good catch.

Please disregard this patch. It's not necessary for this series after
v2, I just kept it because I thought it's a nice simplification that's
possible after making root_mem_cgroup public.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 03/14] net: tcp_memcontrol: properly detect ancestor socket pressure
  2015-11-14 12:45   ` Vladimir Davydov
@ 2015-11-14 15:15     ` Johannes Weiner
  0 siblings, 0 replies; 61+ messages in thread
From: Johannes Weiner @ 2015-11-14 15:15 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Sat, Nov 14, 2015 at 03:45:52PM +0300, Vladimir Davydov wrote:
> On Thu, Nov 12, 2015 at 06:41:22PM -0500, Johannes Weiner wrote:
> > When charging socket memory, the code currently checks only the local
> > page counter for excess to determine whether the memcg is under socket
> > pressure. But even if the local counter is fine, one of the ancestors
> > could have breached its limit, which should also force this child to
> > enter socket pressure. This currently doesn't happen.
> > 
> > Fix this by using page_counter_try_charge() first. If that fails, it
> > means that either the local counter or one of the ancestors are in
> > excess of their limit, and the child should enter socket pressure.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

Thanks Vladimir!

> For the record: it was broken by commit 3e32cb2e0a12 ("mm: memcontrol:
> lockless page counters").

I didn't realize that.

Fixes: 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/14] net: tcp_memcontrol: protect all tcp_memcontrol calls by jump-label
  2015-11-12 23:41 ` [PATCH 05/14] net: tcp_memcontrol: protect all tcp_memcontrol calls by jump-label Johannes Weiner
  2015-11-13 16:01   ` David Miller
@ 2015-11-14 16:33   ` Vladimir Davydov
  2015-11-16 17:52     ` Johannes Weiner
  1 sibling, 1 reply; 61+ messages in thread
From: Vladimir Davydov @ 2015-11-14 16:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Thu, Nov 12, 2015 at 06:41:24PM -0500, Johannes Weiner wrote:
> Move the jump-label from sock_update_memcg() and sock_release_memcg()
> to the callsite, and so eliminate those function calls when socket
> accounting is not enabled.

I don't believe this patch's necessary, because these functions aren't
hot paths. Neither do I think it makes the code look better. Anyway,
it's rather a matter of personal preference, and the patch looks correct
to me, so

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

> 
> This also eliminates the need for dummy functions because the calls
> will be optimized away if the Kconfig options are not enabled.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure
  2015-11-12 23:41 ` [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure Johannes Weiner
@ 2015-11-15 13:54   ` Vladimir Davydov
  2015-11-16 18:53     ` Johannes Weiner
  0 siblings, 1 reply; 61+ messages in thread
From: Vladimir Davydov @ 2015-11-15 13:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Thu, Nov 12, 2015 at 06:41:33PM -0500, Johannes Weiner wrote:
> Let the networking stack know when a memcg is under reclaim pressure
> so that it can clamp its transmit windows accordingly.
> 
> Whenever the reclaim efficiency of a cgroup's LRU lists drops low
> enough for a MEDIUM or HIGH vmpressure event to occur, assert a
> pressure state in the socket and tcp memory code that tells it to curb
> consumption growth from sockets associated with said control group.
> 
> vmpressure events are naturally edge triggered, so for hysteresis
> assert socket pressure for a second to allow for subsequent vmpressure
> events to occur before letting the socket code return to normal.

AFAICS, in contrast to v1, now you don't modify vmpressure behavior,
which means socket_pressure will only be set when cgroup hits its
high/hard limit. On tightly packed system, this is unlikely IMO -
cgroups will mostly experience pressure due to memory shortage at the
global level and/or their low limit configuration, in which case no
vmpressure events will be triggered and therefore tcp window won't be
clamped accordingly.

May be, we could use a per memcg slab shrinker to detect memory
pressure? This looks like abusing shrinkers API though.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 13/14] mm: memcontrol: account socket memory in unified hierarchy memory controller
  2015-11-12 23:41 ` [PATCH 13/14] mm: memcontrol: account socket memory in unified hierarchy memory controller Johannes Weiner
@ 2015-11-16 15:59   ` Michal Hocko
  2015-11-16 18:18     ` Johannes Weiner
  2015-11-20 13:10   ` Vladimir Davydov
  1 sibling, 1 reply; 61+ messages in thread
From: Michal Hocko @ 2015-11-16 15:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Vladimir Davydov, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Thu 12-11-15 18:41:32, Johannes Weiner wrote:
> Socket memory can be a significant share of overall memory consumed by
> common workloads. In order to provide reasonable resource isolation in
> the unified hierarchy, this type of memory needs to be included in the
> tracking/accounting of a cgroup under active memory resource control.
> 
> Overhead is only incurred when a non-root control group is created AND
> the memory controller is instructed to track and account the memory
> footprint of that group. cgroup.memory=nosocket can be specified on
> the boot commandline to override any runtime configuration and
> forcibly exclude socket memory from active memory resource control.

Do you have any numbers about the overhead?

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

With a way to disable this feature I am OK with it.
cgroup.memory=nosocket should be documented (at least in
Documentation/kernel-parameters.txt)

Other than that
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h |  12 ++++-
>  mm/memcontrol.c            | 131 +++++++++++++++++++++++++++++++++++++--------
>  2 files changed, 118 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4cf5afa..809d6de 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -256,6 +256,10 @@ struct mem_cgroup {
>  	struct wb_domain cgwb_domain;
>  #endif
>  
> +#ifdef CONFIG_INET
> +	struct work_struct	socket_work;
> +#endif
> +
>  	/* List of events which userspace want to receive */
>  	struct list_head event_list;
>  	spinlock_t event_list_lock;
> @@ -691,7 +695,7 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
>  
>  #endif	/* CONFIG_CGROUP_WRITEBACK */
>  
> -#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
> +#ifdef CONFIG_INET
>  struct sock;
>  extern struct static_key memcg_sockets_enabled_key;
>  #define mem_cgroup_sockets_enabled static_key_false(&memcg_sockets_enabled_key)
> @@ -701,11 +705,15 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
>  void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
>  static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
>  {
> +#ifdef CONFIG_MEMCG_KMEM
>  	return memcg->tcp_mem.memory_pressure;
> +#else
> +	return false;
> +#endif
>  }
>  #else
>  #define mem_cgroup_sockets_enabled 0
> -#endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
> +#endif /* CONFIG_INET */
>  
>  #ifdef CONFIG_MEMCG_KMEM
>  extern struct static_key memcg_kmem_enabled_key;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 408fb04..cad9525 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -80,6 +80,9 @@ struct mem_cgroup *root_mem_cgroup __read_mostly;
>  
>  #define MEM_CGROUP_RECLAIM_RETRIES	5
>  
> +/* Socket memory accounting disabled? */
> +static bool cgroup_memory_nosocket;
> +
>  /* Whether the swap controller is active */
>  #ifdef CONFIG_MEMCG_SWAP
>  int do_swap_account __read_mostly;
> @@ -1923,6 +1926,18 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
>  	return NOTIFY_OK;
>  }
>  
> +static void reclaim_high(struct mem_cgroup *memcg,
> +			 unsigned int nr_pages,
> +			 gfp_t gfp_mask)
> +{
> +	do {
> +		if (page_counter_read(&memcg->memory) <= memcg->high)
> +			continue;
> +		mem_cgroup_events(memcg, MEMCG_HIGH, 1);
> +		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
> +	} while ((memcg = parent_mem_cgroup(memcg)));
> +}
> +
>  /*
>   * Scheduled by try_charge() to be executed from the userland return path
>   * and reclaims memory over the high limit.
> @@ -1930,20 +1945,13 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
>  void mem_cgroup_handle_over_high(void)
>  {
>  	unsigned int nr_pages = current->memcg_nr_pages_over_high;
> -	struct mem_cgroup *memcg, *pos;
> +	struct mem_cgroup *memcg;
>  
>  	if (likely(!nr_pages))
>  		return;
>  
> -	pos = memcg = get_mem_cgroup_from_mm(current->mm);
> -
> -	do {
> -		if (page_counter_read(&pos->memory) <= pos->high)
> -			continue;
> -		mem_cgroup_events(pos, MEMCG_HIGH, 1);
> -		try_to_free_mem_cgroup_pages(pos, nr_pages, GFP_KERNEL, true);
> -	} while ((pos = parent_mem_cgroup(pos)));
> -
> +	memcg = get_mem_cgroup_from_mm(current->mm);
> +	reclaim_high(memcg, nr_pages, GFP_KERNEL);
>  	css_put(&memcg->css);
>  	current->memcg_nr_pages_over_high = 0;
>  }
> @@ -4141,6 +4149,8 @@ struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
>  }
>  EXPORT_SYMBOL(parent_mem_cgroup);
>  
> +static void socket_work_func(struct work_struct *work);
> +
>  static struct cgroup_subsys_state * __ref
>  mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  {
> @@ -4180,6 +4190,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  #ifdef CONFIG_CGROUP_WRITEBACK
>  	INIT_LIST_HEAD(&memcg->cgwb_list);
>  #endif
> +#ifdef CONFIG_INET
> +	INIT_WORK(&memcg->socket_work, socket_work_func);
> +#endif
>  	return &memcg->css;
>  
>  free_out:
> @@ -4237,6 +4250,11 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  	if (ret)
>  		return ret;
>  
> +#ifdef CONFIG_INET
> +	if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
> +		static_key_slow_inc(&memcg_sockets_enabled_key);
> +#endif
> +
>  	/*
>  	 * Make sure the memcg is initialized: mem_cgroup_iter()
>  	 * orders reading memcg->initialized against its callers
> @@ -4276,6 +4294,11 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
>  
>  	memcg_destroy_kmem(memcg);
> +#ifdef CONFIG_INET
> +	if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
> +		static_key_slow_dec(&memcg_sockets_enabled_key);
> +	cancel_work_sync(&memcg->socket_work);
> +#endif
>  	__mem_cgroup_free(memcg);
>  }
>  
> @@ -5464,8 +5487,7 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage)
>  	commit_charge(newpage, memcg, true);
>  }
>  
> -/* Writing them here to avoid exposing memcg's inner layout */
> -#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
> +#ifdef CONFIG_INET
>  
>  struct static_key memcg_sockets_enabled_key;
>  EXPORT_SYMBOL(memcg_sockets_enabled_key);
> @@ -5490,10 +5512,16 @@ void sock_update_memcg(struct sock *sk)
>  
>  	rcu_read_lock();
>  	memcg = mem_cgroup_from_task(current);
> -	if (memcg != root_mem_cgroup &&
> -	    test_bit(MEMCG_SOCK_ACTIVE, &memcg->tcp_mem.flags) &&
> -	    css_tryget_online(&memcg->css))
> +	if (memcg == root_mem_cgroup)
> +		goto out;
> +#ifdef CONFIG_MEMCG_KMEM
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
> +	    !test_bit(MEMCG_SOCK_ACTIVE, &memcg->tcp_mem.flags))
> +		goto out;
> +#endif
> +	if (css_tryget_online(&memcg->css))
>  		sk->sk_memcg = memcg;
> +out:
>  	rcu_read_unlock();
>  }
>  EXPORT_SYMBOL(sock_update_memcg);
> @@ -5504,6 +5532,14 @@ void sock_release_memcg(struct sock *sk)
>  	css_put(&sk->sk_memcg->css);
>  }
>  
> +static void socket_work_func(struct work_struct *work)
> +{
> +	struct mem_cgroup *memcg;
> +
> +	memcg = container_of(work, struct mem_cgroup, socket_work);
> +	reclaim_high(memcg, CHARGE_BATCH, GFP_KERNEL);
> +}
> +
>  /**
>   * mem_cgroup_charge_skmem - charge socket memory
>   * @memcg: memcg to charge
> @@ -5514,16 +5550,43 @@ void sock_release_memcg(struct sock *sk)
>   */
>  bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
> +	unsigned int batch = max(CHARGE_BATCH, nr_pages);
>  	struct page_counter *counter;
> +	bool force = false;
>  
> -	if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
> -				    nr_pages, &counter)) {
> -		memcg->tcp_mem.memory_pressure = 0;
> +#ifdef CONFIG_MEMCG_KMEM
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> +		if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
> +					    nr_pages, &counter)) {
> +			memcg->tcp_mem.memory_pressure = 0;
> +			return true;
> +		}
> +		page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
> +		memcg->tcp_mem.memory_pressure = 1;
> +		return false;
> +	}
> +#endif
> +	if (consume_stock(memcg, nr_pages))
>  		return true;
> +retry:
> +	if (page_counter_try_charge(&memcg->memory, batch, &counter))
> +		goto done;
> +
> +	if (batch > nr_pages) {
> +		batch = nr_pages;
> +		goto retry;
>  	}
> -	page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
> -	memcg->tcp_mem.memory_pressure = 1;
> -	return false;
> +
> +	page_counter_charge(&memcg->memory, batch);
> +	force = true;
> +done:
> +	css_get_many(&memcg->css, batch);
> +	if (batch > nr_pages)
> +		refill_stock(memcg, batch - nr_pages);
> +
> +	schedule_work(&memcg->socket_work);
> +
> +	return !force;
>  }
>  
>  /**
> @@ -5533,10 +5596,32 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
>   */
>  void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
> -	page_counter_uncharge(&memcg->tcp_mem.memory_allocated, nr_pages);
> +#ifdef CONFIG_MEMCG_KMEM
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> +		page_counter_uncharge(&memcg->tcp_mem.memory_allocated,
> +				      nr_pages);
> +		return;
> +	}
> +#endif
> +	page_counter_uncharge(&memcg->memory, nr_pages);
> +	css_put_many(&memcg->css, nr_pages);
>  }
>  
> -#endif
> +#endif /* CONFIG_INET */
> +
> +static int __init cgroup_memory(char *s)
> +{
> +	char *token;
> +
> +	while ((token = strsep(&s, ",")) != NULL) {
> +		if (!*token)
> +			continue;
> +		if (!strcmp(token, "nosocket"))
> +			cgroup_memory_nosocket = true;
> +	}
> +	return 0;
> +}
> +__setup("cgroup.memory=", cgroup_memory);
>  
>  /*
>   * subsys_initcall() for memory controller.
> -- 
> 2.6.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/14] net: tcp_memcontrol: protect all tcp_memcontrol calls by jump-label
  2015-11-14 16:33   ` Vladimir Davydov
@ 2015-11-16 17:52     ` Johannes Weiner
  0 siblings, 0 replies; 61+ messages in thread
From: Johannes Weiner @ 2015-11-16 17:52 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Sat, Nov 14, 2015 at 07:33:10PM +0300, Vladimir Davydov wrote:
> On Thu, Nov 12, 2015 at 06:41:24PM -0500, Johannes Weiner wrote:
> > Move the jump-label from sock_update_memcg() and sock_release_memcg()
> > to the callsite, and so eliminate those function calls when socket
> > accounting is not enabled.
> 
> I don't believe this patch's necessary, because these functions aren't
> hot paths. Neither do I think it makes the code look better. Anyway,
> it's rather a matter of personal preference, and the patch looks correct
> to me, so

Yeah, it's not a hotpath. What I like primarily about this patch I
guess is that makes it more consistent how memcg entry is gated. You
don't have to remember which functions have the checks in the caller
and which have it in the function themselves. And I really hate the

static inline void foo(void)
{
	if (foo_enabled())
		__foo()
}

in the headerfile pattern.

> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

Thanks, I appreciate you acking it despite your personal preference!

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 13/14] mm: memcontrol: account socket memory in unified hierarchy memory controller
  2015-11-16 15:59   ` Michal Hocko
@ 2015-11-16 18:18     ` Johannes Weiner
  2015-11-18 16:22       ` Michal Hocko
  0 siblings, 1 reply; 61+ messages in thread
From: Johannes Weiner @ 2015-11-16 18:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, Andrew Morton, Vladimir Davydov, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Mon, Nov 16, 2015 at 04:59:25PM +0100, Michal Hocko wrote:
> On Thu 12-11-15 18:41:32, Johannes Weiner wrote:
> > Socket memory can be a significant share of overall memory consumed by
> > common workloads. In order to provide reasonable resource isolation in
> > the unified hierarchy, this type of memory needs to be included in the
> > tracking/accounting of a cgroup under active memory resource control.
> > 
> > Overhead is only incurred when a non-root control group is created AND
> > the memory controller is instructed to track and account the memory
> > footprint of that group. cgroup.memory=nosocket can be specified on
> > the boot commandline to override any runtime configuration and
> > forcibly exclude socket memory from active memory resource control.
> 
> Do you have any numbers about the overhead?

Hm? Performance numbers make sense when you have a specific scenario
and a theory on how to optimize the implementation for it. What load
would you test and what would be the baseline to compare it to?

> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> With a way to disable this feature I am OK with it.
> cgroup.memory=nosocket should be documented (at least in
> Documentation/kernel-parameters.txt)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index f8aae63..d518340 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -599,6 +599,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			cut the overhead, others just disable the usage. So
 			only cgroup_disable=memory is actually worthy}
 
+	cgroup.memory=	[KNL] Pass options to the cgroup memory controller.
+			Format: <string>
+			nosocket -- Disable socket memory accounting.
+
 	checkreqprot	[SELINUX] Set initial checkreqprot flag value.
 			Format: { "0" | "1" }
 			See security/selinux/Kconfig help text.

> Other than that
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

---
>From abe0670cec1424a5b4f43dfabff8bb27a1007ced Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 14 Oct 2015 22:25:20 -0700
Subject: [PATCH] mm: memcontrol: account socket memory in unified hierarchy
 memory controller

Socket memory can be a significant share of overall memory consumed by
common workloads. In order to provide reasonable resource isolation in
the unified hierarchy, this type of memory needs to be included in the
tracking/accounting of a cgroup under active memory resource control.

Overhead is only incurred when a non-root control group is created AND
the memory controller is instructed to track and account the memory
footprint of that group. cgroup.memory=nosocket can be specified on
the boot commandline to override any runtime configuration and
forcibly exclude socket memory from active memory resource control.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 Documentation/kernel-parameters.txt |   4 ++
 include/linux/memcontrol.h          |  12 +++-
 mm/memcontrol.c                     | 131 +++++++++++++++++++++++++++++-------
 3 files changed, 122 insertions(+), 25 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index f8aae63..d518340 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -599,6 +599,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			cut the overhead, others just disable the usage. So
 			only cgroup_disable=memory is actually worthy}
 
+	cgroup.memory=	[KNL] Pass options to the cgroup memory controller.
+			Format: <string>
+			nosocket -- Disable socket memory accounting.
+
 	checkreqprot	[SELINUX] Set initial checkreqprot flag value.
 			Format: { "0" | "1" }
 			See security/selinux/Kconfig help text.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4cf5afa..809d6de 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -256,6 +256,10 @@ struct mem_cgroup {
 	struct wb_domain cgwb_domain;
 #endif
 
+#ifdef CONFIG_INET
+	struct work_struct	socket_work;
+#endif
+
 	/* List of events which userspace want to receive */
 	struct list_head event_list;
 	spinlock_t event_list_lock;
@@ -691,7 +695,7 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
 
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
-#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
+#ifdef CONFIG_INET
 struct sock;
 extern struct static_key memcg_sockets_enabled_key;
 #define mem_cgroup_sockets_enabled static_key_false(&memcg_sockets_enabled_key)
@@ -701,11 +705,15 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
 void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
 static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 {
+#ifdef CONFIG_MEMCG_KMEM
 	return memcg->tcp_mem.memory_pressure;
+#else
+	return false;
+#endif
 }
 #else
 #define mem_cgroup_sockets_enabled 0
-#endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
+#endif /* CONFIG_INET */
 
 #ifdef CONFIG_MEMCG_KMEM
 extern struct static_key memcg_kmem_enabled_key;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 408fb04..cad9525 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -80,6 +80,9 @@ struct mem_cgroup *root_mem_cgroup __read_mostly;
 
 #define MEM_CGROUP_RECLAIM_RETRIES	5
 
+/* Socket memory accounting disabled? */
+static bool cgroup_memory_nosocket;
+
 /* Whether the swap controller is active */
 #ifdef CONFIG_MEMCG_SWAP
 int do_swap_account __read_mostly;
@@ -1923,6 +1926,18 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
+static void reclaim_high(struct mem_cgroup *memcg,
+			 unsigned int nr_pages,
+			 gfp_t gfp_mask)
+{
+	do {
+		if (page_counter_read(&memcg->memory) <= memcg->high)
+			continue;
+		mem_cgroup_events(memcg, MEMCG_HIGH, 1);
+		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
+	} while ((memcg = parent_mem_cgroup(memcg)));
+}
+
 /*
  * Scheduled by try_charge() to be executed from the userland return path
  * and reclaims memory over the high limit.
@@ -1930,20 +1945,13 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 void mem_cgroup_handle_over_high(void)
 {
 	unsigned int nr_pages = current->memcg_nr_pages_over_high;
-	struct mem_cgroup *memcg, *pos;
+	struct mem_cgroup *memcg;
 
 	if (likely(!nr_pages))
 		return;
 
-	pos = memcg = get_mem_cgroup_from_mm(current->mm);
-
-	do {
-		if (page_counter_read(&pos->memory) <= pos->high)
-			continue;
-		mem_cgroup_events(pos, MEMCG_HIGH, 1);
-		try_to_free_mem_cgroup_pages(pos, nr_pages, GFP_KERNEL, true);
-	} while ((pos = parent_mem_cgroup(pos)));
-
+	memcg = get_mem_cgroup_from_mm(current->mm);
+	reclaim_high(memcg, nr_pages, GFP_KERNEL);
 	css_put(&memcg->css);
 	current->memcg_nr_pages_over_high = 0;
 }
@@ -4141,6 +4149,8 @@ struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 }
 EXPORT_SYMBOL(parent_mem_cgroup);
 
+static void socket_work_func(struct work_struct *work);
+
 static struct cgroup_subsys_state * __ref
 mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
@@ -4180,6 +4190,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 #ifdef CONFIG_CGROUP_WRITEBACK
 	INIT_LIST_HEAD(&memcg->cgwb_list);
 #endif
+#ifdef CONFIG_INET
+	INIT_WORK(&memcg->socket_work, socket_work_func);
+#endif
 	return &memcg->css;
 
 free_out:
@@ -4237,6 +4250,11 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	if (ret)
 		return ret;
 
+#ifdef CONFIG_INET
+	if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
+		static_key_slow_inc(&memcg_sockets_enabled_key);
+#endif
+
 	/*
 	 * Make sure the memcg is initialized: mem_cgroup_iter()
 	 * orders reading memcg->initialized against its callers
@@ -4276,6 +4294,11 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
 	memcg_destroy_kmem(memcg);
+#ifdef CONFIG_INET
+	if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
+		static_key_slow_dec(&memcg_sockets_enabled_key);
+	cancel_work_sync(&memcg->socket_work);
+#endif
 	__mem_cgroup_free(memcg);
 }
 
@@ -5464,8 +5487,7 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage)
 	commit_charge(newpage, memcg, true);
 }
 
-/* Writing them here to avoid exposing memcg's inner layout */
-#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
+#ifdef CONFIG_INET
 
 struct static_key memcg_sockets_enabled_key;
 EXPORT_SYMBOL(memcg_sockets_enabled_key);
@@ -5490,10 +5512,16 @@ void sock_update_memcg(struct sock *sk)
 
 	rcu_read_lock();
 	memcg = mem_cgroup_from_task(current);
-	if (memcg != root_mem_cgroup &&
-	    test_bit(MEMCG_SOCK_ACTIVE, &memcg->tcp_mem.flags) &&
-	    css_tryget_online(&memcg->css))
+	if (memcg == root_mem_cgroup)
+		goto out;
+#ifdef CONFIG_MEMCG_KMEM
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
+	    !test_bit(MEMCG_SOCK_ACTIVE, &memcg->tcp_mem.flags))
+		goto out;
+#endif
+	if (css_tryget_online(&memcg->css))
 		sk->sk_memcg = memcg;
+out:
 	rcu_read_unlock();
 }
 EXPORT_SYMBOL(sock_update_memcg);
@@ -5504,6 +5532,14 @@ void sock_release_memcg(struct sock *sk)
 	css_put(&sk->sk_memcg->css);
 }
 
+static void socket_work_func(struct work_struct *work)
+{
+	struct mem_cgroup *memcg;
+
+	memcg = container_of(work, struct mem_cgroup, socket_work);
+	reclaim_high(memcg, CHARGE_BATCH, GFP_KERNEL);
+}
+
 /**
  * mem_cgroup_charge_skmem - charge socket memory
  * @memcg: memcg to charge
@@ -5514,16 +5550,43 @@ void sock_release_memcg(struct sock *sk)
  */
 bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
+	unsigned int batch = max(CHARGE_BATCH, nr_pages);
 	struct page_counter *counter;
+	bool force = false;
 
-	if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
-				    nr_pages, &counter)) {
-		memcg->tcp_mem.memory_pressure = 0;
+#ifdef CONFIG_MEMCG_KMEM
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
+		if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
+					    nr_pages, &counter)) {
+			memcg->tcp_mem.memory_pressure = 0;
+			return true;
+		}
+		page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
+		memcg->tcp_mem.memory_pressure = 1;
+		return false;
+	}
+#endif
+	if (consume_stock(memcg, nr_pages))
 		return true;
+retry:
+	if (page_counter_try_charge(&memcg->memory, batch, &counter))
+		goto done;
+
+	if (batch > nr_pages) {
+		batch = nr_pages;
+		goto retry;
 	}
-	page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
-	memcg->tcp_mem.memory_pressure = 1;
-	return false;
+
+	page_counter_charge(&memcg->memory, batch);
+	force = true;
+done:
+	css_get_many(&memcg->css, batch);
+	if (batch > nr_pages)
+		refill_stock(memcg, batch - nr_pages);
+
+	schedule_work(&memcg->socket_work);
+
+	return !force;
 }
 
 /**
@@ -5533,10 +5596,32 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
  */
 void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
-	page_counter_uncharge(&memcg->tcp_mem.memory_allocated, nr_pages);
+#ifdef CONFIG_MEMCG_KMEM
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
+		page_counter_uncharge(&memcg->tcp_mem.memory_allocated,
+				      nr_pages);
+		return;
+	}
+#endif
+	page_counter_uncharge(&memcg->memory, nr_pages);
+	css_put_many(&memcg->css, nr_pages);
 }
 
-#endif
+#endif /* CONFIG_INET */
+
+static int __init cgroup_memory(char *s)
+{
+	char *token;
+
+	while ((token = strsep(&s, ",")) != NULL) {
+		if (!*token)
+			continue;
+		if (!strcmp(token, "nosocket"))
+			cgroup_memory_nosocket = true;
+	}
+	return 0;
+}
+__setup("cgroup.memory=", cgroup_memory);
 
 /*
  * subsys_initcall() for memory controller.
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure
  2015-11-15 13:54   ` Vladimir Davydov
@ 2015-11-16 18:53     ` Johannes Weiner
  2015-11-17 20:18       ` Vladimir Davydov
  0 siblings, 1 reply; 61+ messages in thread
From: Johannes Weiner @ 2015-11-16 18:53 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Sun, Nov 15, 2015 at 04:54:57PM +0300, Vladimir Davydov wrote:
> On Thu, Nov 12, 2015 at 06:41:33PM -0500, Johannes Weiner wrote:
> > Let the networking stack know when a memcg is under reclaim pressure
> > so that it can clamp its transmit windows accordingly.
> > 
> > Whenever the reclaim efficiency of a cgroup's LRU lists drops low
> > enough for a MEDIUM or HIGH vmpressure event to occur, assert a
> > pressure state in the socket and tcp memory code that tells it to curb
> > consumption growth from sockets associated with said control group.
> > 
> > vmpressure events are naturally edge triggered, so for hysteresis
> > assert socket pressure for a second to allow for subsequent vmpressure
> > events to occur before letting the socket code return to normal.
> 
> AFAICS, in contrast to v1, now you don't modify vmpressure behavior,
> which means socket_pressure will only be set when cgroup hits its
> high/hard limit. On tightly packed system, this is unlikely IMO -
> cgroups will mostly experience pressure due to memory shortage at the
> global level and/or their low limit configuration, in which case no
> vmpressure events will be triggered and therefore tcp window won't be
> clamped accordingly.

Yeah, this is an inherent problem in the vmpressure design and it
makes the feature significantly less useful than it could be IMO.

But you guys were wary about the patch that changed it, and this
series has kicked up enough dust already, so I backed it out.

But this will still be useful. Yes, it won't help in rebalancing an
regularly working system, which would be cool, but it'll still help
contain a worklad that is growing beyond expectations, which is the
scenario that kickstarted this work.

> May be, we could use a per memcg slab shrinker to detect memory
> pressure? This looks like abusing shrinkers API though.

Actually, I thought about doing this long-term.

Shrinkers are a nice way to export VM pressure to auxiliary allocators
and caches. But currently, the only metric we export is LRU scan rate,
whose application is limited to ageable caches: it doesn't make sense
to cause auxiliary workingsets to shrink when the VM is merely picking
up the drop-behind pages of a one-off page cache stream. I think it
would make sense for shrinkers to include reclaim efficiency so that
they can be used by caches that don't have 'accessed' bits and object
rotation, but are able to shrink based on the cost they're imposing.

But a change like this is beyond the scope of this series, IMO.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure
  2015-11-16 18:53     ` Johannes Weiner
@ 2015-11-17 20:18       ` Vladimir Davydov
  2015-11-17 22:22         ` Johannes Weiner
  0 siblings, 1 reply; 61+ messages in thread
From: Vladimir Davydov @ 2015-11-17 20:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Mon, Nov 16, 2015 at 01:53:16PM -0500, Johannes Weiner wrote:
> On Sun, Nov 15, 2015 at 04:54:57PM +0300, Vladimir Davydov wrote:
> > On Thu, Nov 12, 2015 at 06:41:33PM -0500, Johannes Weiner wrote:
> > > Let the networking stack know when a memcg is under reclaim pressure
> > > so that it can clamp its transmit windows accordingly.
> > > 
> > > Whenever the reclaim efficiency of a cgroup's LRU lists drops low
> > > enough for a MEDIUM or HIGH vmpressure event to occur, assert a
> > > pressure state in the socket and tcp memory code that tells it to curb
> > > consumption growth from sockets associated with said control group.
> > > 
> > > vmpressure events are naturally edge triggered, so for hysteresis
> > > assert socket pressure for a second to allow for subsequent vmpressure
> > > events to occur before letting the socket code return to normal.
> > 
> > AFAICS, in contrast to v1, now you don't modify vmpressure behavior,
> > which means socket_pressure will only be set when cgroup hits its
> > high/hard limit. On tightly packed system, this is unlikely IMO -
> > cgroups will mostly experience pressure due to memory shortage at the
> > global level and/or their low limit configuration, in which case no
> > vmpressure events will be triggered and therefore tcp window won't be
> > clamped accordingly.
> 
> Yeah, this is an inherent problem in the vmpressure design and it
> makes the feature significantly less useful than it could be IMO.

AFAIK vmpressure was designed to allow userspace to tune hard limits of
cgroups in accordance with their demands, in which case the way how
vmpressure notifications work makes sense.

> 
> But you guys were wary about the patch that changed it, and this

Changing vmpressure semantics as you proposed in v1 would result in
userspace getting notifications even if cgroup does not hit its limit.
May be it could be useful to someone (e.g. it could help tuning
memory.low), but I am pretty sure this would also result in breakages
for others.

> series has kicked up enough dust already, so I backed it out.
> 
> But this will still be useful. Yes, it won't help in rebalancing an
> regularly working system, which would be cool, but it'll still help
> contain a worklad that is growing beyond expectations, which is the
> scenario that kickstarted this work.

I haven't looked through all the previous patches in the series, but
AFAIU they should do the trick, no? Notifying sockets about vmpressure
is rather needed to protect a workload from itself. And with this patch
it will work this way, but only if sum limits < total ram, which is
rather rare in practice. On tightly packed systems it does nothing.

That said, I don't think we should commit this particular patch. Neither
do I think socket accounting should be enabled by default in the unified
hierarchy for now, since the implementation is still incomplete. IMHO.

Thanks,
Vladimir

> 
> > May be, we could use a per memcg slab shrinker to detect memory
> > pressure? This looks like abusing shrinkers API though.
> 
> Actually, I thought about doing this long-term.
> 
> Shrinkers are a nice way to export VM pressure to auxiliary allocators
> and caches. But currently, the only metric we export is LRU scan rate,
> whose application is limited to ageable caches: it doesn't make sense
> to cause auxiliary workingsets to shrink when the VM is merely picking
> up the drop-behind pages of a one-off page cache stream. I think it
> would make sense for shrinkers to include reclaim efficiency so that
> they can be used by caches that don't have 'accessed' bits and object
> rotation, but are able to shrink based on the cost they're imposing.
> 
> But a change like this is beyond the scope of this series, IMO.
> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure
  2015-11-17 20:18       ` Vladimir Davydov
@ 2015-11-17 22:22         ` Johannes Weiner
  2015-11-18 16:02           ` Vladimir Davydov
  0 siblings, 1 reply; 61+ messages in thread
From: Johannes Weiner @ 2015-11-17 22:22 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Tue, Nov 17, 2015 at 11:18:50PM +0300, Vladimir Davydov wrote:
> AFAIK vmpressure was designed to allow userspace to tune hard limits of
> cgroups in accordance with their demands, in which case the way how
> vmpressure notifications work makes sense.

You can still do that when the reporting happens on the reclaim level,
it's easy to figure out where the pressure comes from once a group is
struggling to reclaim its LRU pages.

Reporting on the pressure level does nothing but destroy valuable
information that would be useful in scenarios other than tuning a
hierarchical memory limit.

> > But you guys were wary about the patch that changed it, and this
> 
> Changing vmpressure semantics as you proposed in v1 would result in
> userspace getting notifications even if cgroup does not hit its limit.
> May be it could be useful to someone (e.g. it could help tuning
> memory.low), but I am pretty sure this would also result in breakages
> for others.

Maybe. I'll look into a two-layer vmpressure recording/reporting model
that would give us reclaim-level events internally while retaining
pressure-level events for the existing userspace interface.

> > series has kicked up enough dust already, so I backed it out.
> > 
> > But this will still be useful. Yes, it won't help in rebalancing an
> > regularly working system, which would be cool, but it'll still help
> > contain a worklad that is growing beyond expectations, which is the
> > scenario that kickstarted this work.
> 
> I haven't looked through all the previous patches in the series, but
> AFAIU they should do the trick, no? Notifying sockets about vmpressure
> is rather needed to protect a workload from itself.

No, the only critical thing is to protect the system from OOM
conditions caused by what should be containerized processes.

That's a correctness issue.

How much we mitigate the consequences inside the container when the
workload screws up is secondary. But even that is already much better
in this series compared to memcg v1, while leaving us with all the
freedom to continue improving this internal mitigation in the future.

> And with this patch it will work this way, but only if sum limits <
> total ram, which is rather rare in practice. On tightly packed
> systems it does nothing.

That's not true, it's still useful when things go south inside a
cgroup, even with overcommitted limits. See above.

We can optimize the continuous global pressure rebalancing later on;
whether that'll be based on a modified vmpressure implementation, or
adding reclaim efficiency to the shrinker API or whatever.

> That said, I don't think we should commit this particular patch. Neither
> do I think socket accounting should be enabled by default in the unified
> hierarchy for now, since the implementation is still incomplete. IMHO.

I don't see a technical basis for either of those suggestions.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure
  2015-11-17 22:22         ` Johannes Weiner
@ 2015-11-18 16:02           ` Vladimir Davydov
  2015-11-18 18:27             ` Johannes Weiner
  0 siblings, 1 reply; 61+ messages in thread
From: Vladimir Davydov @ 2015-11-18 16:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Tue, Nov 17, 2015 at 05:22:17PM -0500, Johannes Weiner wrote:
> On Tue, Nov 17, 2015 at 11:18:50PM +0300, Vladimir Davydov wrote:
> > AFAIK vmpressure was designed to allow userspace to tune hard limits of
> > cgroups in accordance with their demands, in which case the way how
> > vmpressure notifications work makes sense.
> 
> You can still do that when the reporting happens on the reclaim level,
> it's easy to figure out where the pressure comes from once a group is
> struggling to reclaim its LRU pages.

Right, one only needs to check usage-vs-limit in the cgroup that
received a notification and all its ascendants, but I doubt existing
applications do that.

> 
> Reporting on the pressure level does nothing but destroy valuable
> information that would be useful in scenarios other than tuning a
> hierarchical memory limit.

Agree.

> 
> > > But you guys were wary about the patch that changed it, and this
> > 
> > Changing vmpressure semantics as you proposed in v1 would result in
> > userspace getting notifications even if cgroup does not hit its limit.
> > May be it could be useful to someone (e.g. it could help tuning
> > memory.low), but I am pretty sure this would also result in breakages
> > for others.
> 
> Maybe. I'll look into a two-layer vmpressure recording/reporting model
> that would give us reclaim-level events internally while retaining
> pressure-level events for the existing userspace interface.

It would be great. I think vmpressure, as you propose to implement it,
could be useful even in the unified hierarchy for tuning memory.low and
memory.high.

> 
> > > series has kicked up enough dust already, so I backed it out.
> > > 
> > > But this will still be useful. Yes, it won't help in rebalancing an
> > > regularly working system, which would be cool, but it'll still help
> > > contain a worklad that is growing beyond expectations, which is the
> > > scenario that kickstarted this work.
> > 
> > I haven't looked through all the previous patches in the series, but
> > AFAIU they should do the trick, no? Notifying sockets about vmpressure
> > is rather needed to protect a workload from itself.
> 
> No, the only critical thing is to protect the system from OOM
> conditions caused by what should be containerized processes.
> 
> That's a correctness issue.
> 
> How much we mitigate the consequences inside the container when the
> workload screws up is secondary. But even that is already much better
> in this series compared to memcg v1, while leaving us with all the
> freedom to continue improving this internal mitigation in the future.
> 
> > And with this patch it will work this way, but only if sum limits <
> > total ram, which is rather rare in practice. On tightly packed
> > systems it does nothing.
> 
> That's not true, it's still useful when things go south inside a
> cgroup, even with overcommitted limits. See above.

I meant solely this patch here, not the rest of the patch set. In the
overcommitted case there is no difference if we have the last patch or
not AFAIU.

> 
> We can optimize the continuous global pressure rebalancing later on;
> whether that'll be based on a modified vmpressure implementation, or
> adding reclaim efficiency to the shrinker API or whatever.
> 
> > That said, I don't think we should commit this particular patch. Neither
> > do I think socket accounting should be enabled by default in the unified
> > hierarchy for now, since the implementation is still incomplete. IMHO.
> 
> I don't see a technical basis for either of those suggestions.
> 

IMHO users switching to the unified hierarchy don't expect that
something gets broken in the default setup unless it's a bug. They
expect API changes, new functionality appeared, some features dropped,
but not breakages.

With this patch set, one gets socket accounting enabled by default,
which would be OK if it always worked right, at least in theory. But it
does not if the node is overcommitted - one might get unexpected local
OOMs due to growing socket buffers, which have never been seen in the
legacy hierarchy.

You say that it will help coping with global OOM, which is true, but it
looks like trading an old problem for a new one, which is unaccepted in
this particular case IMHO, because the legacy hierarchy has been used
for years and people are likely to be used to old problems such as lack
of socket buffers accounting - they might already work around this
problem by tuning global tcp limits for instance. After switching to the
unified hierarchy they'll get a new problem in the default setup, which
is no good IMHO.

I'm not against enabling socket buffers accounting by default, but only
once it is expected to work in 99% cases, at least theoretically.

Why can't we apply all patches but the last one (they look OK at first
glance, but I need more time to review them carefully) and disable
socket accounting by default for now? Then you or someone else would
prepare a separate patch set introducing vmpressure propagation to
socket code, so that socket accounting could be enabled by default.

I don't insist. It's just my vision on how things should be done.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 13/14] mm: memcontrol: account socket memory in unified hierarchy memory controller
  2015-11-16 18:18     ` Johannes Weiner
@ 2015-11-18 16:22       ` Michal Hocko
  2015-11-18 21:48         ` Johannes Weiner
  0 siblings, 1 reply; 61+ messages in thread
From: Michal Hocko @ 2015-11-18 16:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Vladimir Davydov, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Mon 16-11-15 13:18:10, Johannes Weiner wrote:
> On Mon, Nov 16, 2015 at 04:59:25PM +0100, Michal Hocko wrote:
> > On Thu 12-11-15 18:41:32, Johannes Weiner wrote:
> > > Socket memory can be a significant share of overall memory consumed by
> > > common workloads. In order to provide reasonable resource isolation in
> > > the unified hierarchy, this type of memory needs to be included in the
> > > tracking/accounting of a cgroup under active memory resource control.
> > > 
> > > Overhead is only incurred when a non-root control group is created AND
> > > the memory controller is instructed to track and account the memory
> > > footprint of that group. cgroup.memory=nosocket can be specified on
> > > the boot commandline to override any runtime configuration and
> > > forcibly exclude socket memory from active memory resource control.
> > 
> > Do you have any numbers about the overhead?
> 
> Hm? Performance numbers make sense when you have a specific scenario
> and a theory on how to optimize the implementation for it.

The fact that there was a strong push to use static branches to put
the code out of line to reduce an overhead before the feature was
merged shows that people are sensitive to network performance and that
significant effort has been spent to eliminate it. My point was that you
are enabling the feature for all memcg users in unified hierarchy now
without having a performance impact overview which users can use
to judge whether to keep it enabled or disable before they start seeing
regressions or to make regression easier to track once it happens.

> What load would you test and what would be the baseline to compare it
> to?

It seems like netperf with a stream load running in a memcg with no
limits vs. in root memcg (and no other cgroups) should give at least a
hint about the runtime overhead, no?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure
  2015-11-18 16:02           ` Vladimir Davydov
@ 2015-11-18 18:27             ` Johannes Weiner
  0 siblings, 0 replies; 61+ messages in thread
From: Johannes Weiner @ 2015-11-18 18:27 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Wed, Nov 18, 2015 at 07:02:54PM +0300, Vladimir Davydov wrote:
> On Tue, Nov 17, 2015 at 05:22:17PM -0500, Johannes Weiner wrote:
> > On Tue, Nov 17, 2015 at 11:18:50PM +0300, Vladimir Davydov wrote:
> > > And with this patch it will work this way, but only if sum limits <
> > > total ram, which is rather rare in practice. On tightly packed
> > > systems it does nothing.
> > 
> > That's not true, it's still useful when things go south inside a
> > cgroup, even with overcommitted limits. See above.
> 
> I meant solely this patch here, not the rest of the patch set. In the
> overcommitted case there is no difference if we have the last patch or
> not AFAIU.

Even this patch, and even in the overcommitted case. When things go
bad inside a cgroup it can steal free memory (it's rare that machines
are at 100% utilization in practice) or memory from other groups until
it hits its own limit. I expect most users except, for some largescale
deployments, to frequently hit memory.high (or max) in practice.

Obviously the utopian case of full utilization will be even smoother
when we make vmpressure more finegrained, but why would that be an
argument *against* this patch here, which is useful everywhere else?

> Why can't we apply all patches but the last one (they look OK at first
> glance, but I need more time to review them carefully) and disable
> socket accounting by default for now? Then you or someone else would
> prepare a separate patch set introducing vmpressure propagation to
> socket code, so that socket accounting could be enabled by default.

This is not going to happen, and we discussed this several times
before. I really wish Michal and you would put more thought into
interface implications. It's trivial to fix up implementation if
actual shortcomings are observed, but it's nigh impossible to fix
interfaces and user-visible behavior once published. It requires
enormous undertakings such as unified hierarchy to rectify things.

Please take your time to review this series, no problem.

But I'm no longer reacting to suggestions to make interface tradeoffs
because new code is not proven to work 99% of the time. That's simply
ridiculous. Any problems will have to be fixed either way, and we're
giving users the cmdline options to work around them in the meantime.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 13/14] mm: memcontrol: account socket memory in unified hierarchy memory controller
  2015-11-18 16:22       ` Michal Hocko
@ 2015-11-18 21:48         ` Johannes Weiner
  2015-11-19 13:50           ` Michal Hocko
  0 siblings, 1 reply; 61+ messages in thread
From: Johannes Weiner @ 2015-11-18 21:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, Andrew Morton, Vladimir Davydov, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Wed, Nov 18, 2015 at 05:22:56PM +0100, Michal Hocko wrote:
> On Mon 16-11-15 13:18:10, Johannes Weiner wrote:
> > What load would you test and what would be the baseline to compare it
> > to?
> 
> It seems like netperf with a stream load running in a memcg with no
> limits vs. in root memcg (and no other cgroups) should give at least a
> hint about the runtime overhead, no?

Comparing root vs. dedicated group generally doesn't make sense since
you either need containment or you don't. It makes more sense to test
both times inside a memory-controlled cgroup, one with a regular boot,
one with cgroup.memory=nosocket.

So I ran perf record -g -a netperf -t TCP_STREAM multiple times inside
a memory-controlled cgroup, but mostly mem_cgroup_charge_skmem() does
not show up in the profile at all. Once it was there with 0.00%.

I ran another test that downloads the latest kernel image from
kernel.org at 13MB/s (on my i5 laptop) and it looks like this:

     0.02%     0.01%  irq/44-iwlwifi   [kernel.kallsyms]           [k] mem_cgroup_charge_skmem
             |
             ---mem_cgroup_charge_skmem
                __sk_mem_schedule
                tcp_try_rmem_schedule
                tcp_data_queue
                tcp_rcv_established
                tcp_v4_do_rcv
                tcp_v4_rcv
                ip_local_deliver
                ip_rcv
                __netif_receive_skb_core
                __netif_receive_skb
                netif_receive_skb_internal
                napi_gro_complete

The runs vary too much for this to be measurable in elapsed time.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 13/14] mm: memcontrol: account socket memory in unified hierarchy memory controller
  2015-11-18 21:48         ` Johannes Weiner
@ 2015-11-19 13:50           ` Michal Hocko
  2015-11-19 16:52             ` Johannes Weiner
  0 siblings, 1 reply; 61+ messages in thread
From: Michal Hocko @ 2015-11-19 13:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Vladimir Davydov, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Wed 18-11-15 16:48:22, Johannes Weiner wrote:
[...]
> So I ran perf record -g -a netperf -t TCP_STREAM multiple times inside
> a memory-controlled cgroup, but mostly mem_cgroup_charge_skmem() does
> not show up in the profile at all. Once it was there with 0.00%.

OK, this sounds very good! This means that most workloads which are not
focusing solely on the network traffic shouldn't even notice. I can
imagine that workloads with high throughput demands would notice but I
would also expect them to disable the feature.

Could you add this information to the changelog, please?

> I ran another test that downloads the latest kernel image from
> kernel.org at 13MB/s (on my i5 laptop) and it looks like this:
> 
>      0.02%     0.01%  irq/44-iwlwifi   [kernel.kallsyms]           [k] mem_cgroup_charge_skmem
>              |
>              ---mem_cgroup_charge_skmem
>                 __sk_mem_schedule
>                 tcp_try_rmem_schedule
>                 tcp_data_queue
>                 tcp_rcv_established
>                 tcp_v4_do_rcv
>                 tcp_v4_rcv
>                 ip_local_deliver
>                 ip_rcv
>                 __netif_receive_skb_core
>                 __netif_receive_skb
>                 netif_receive_skb_internal
>                 napi_gro_complete
> 
> The runs vary too much for this to be measurable in elapsed time.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 13/14] mm: memcontrol: account socket memory in unified hierarchy memory controller
  2015-11-19 13:50           ` Michal Hocko
@ 2015-11-19 16:52             ` Johannes Weiner
  0 siblings, 0 replies; 61+ messages in thread
From: Johannes Weiner @ 2015-11-19 16:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, Andrew Morton, Vladimir Davydov, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Thu, Nov 19, 2015 at 02:50:24PM +0100, Michal Hocko wrote:
> On Wed 18-11-15 16:48:22, Johannes Weiner wrote:
> [...]
> > So I ran perf record -g -a netperf -t TCP_STREAM multiple times inside
> > a memory-controlled cgroup, but mostly mem_cgroup_charge_skmem() does
> > not show up in the profile at all. Once it was there with 0.00%.
> 
> OK, this sounds very good! This means that most workloads which are not
> focusing solely on the network traffic shouldn't even notice. I can
> imagine that workloads with high throughput demands would notice but I
> would also expect them to disable the feature.

Even for high throughput, the cost of this is a function of number of
packets sent. E.g. the 13MB/s over wifi showed the socket charging at
0.02%. But I just did an http transfer over 1Gbit ethernet at around
110MB/s, ten times the bandwidth, and the charge function is at 0.00%.

> Could you add this information to the changelog, please?

Sure, but which information exactly?

If we had found a realistic networking workload that is expected to be
containerized and had shown that load to be negatively affected by the
charging calls, that would have been worth bringing up in conjunction
with the boot-time flag. But what do we have to say here? People care
about cost. It seems unnecessary to point out the absence of it.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 04/14] net: tcp_memcontrol: remove bogus hierarchy pressure propagation
  2015-11-12 23:41 ` [PATCH 04/14] net: tcp_memcontrol: remove bogus hierarchy pressure propagation Johannes Weiner
  2015-11-13 16:00   ` David Miller
@ 2015-11-20  9:07   ` Vladimir Davydov
  1 sibling, 0 replies; 61+ messages in thread
From: Vladimir Davydov @ 2015-11-20  9:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Thu, Nov 12, 2015 at 06:41:23PM -0500, Johannes Weiner wrote:
> When a cgroup currently breaches its socket memory limit, it enters
> memory pressure mode for itself and its *ancestors*. This throttles
> transmission in unrelated sibling and cousin subtrees that have
> nothing to do with the breached limit.
> 
> On the contrary, breaching a limit should make that group and its
> *children* enter memory pressure mode. But this happens already,
> albeit lazily: if an ancestor limit is breached, siblings will enter
> memory pressure on their own once the next packet arrives for them.

Hmm, we still call sk_prot->enter_memory_pressure, which might hurt a
workload in the root cgroup AFAICS. Strange. You fix it in patch 8
though.

> 
> So no additional hierarchy code is needed. Remove the bogus stuff.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 06/14] net: tcp_memcontrol: remove dead per-memcg count of allocated sockets
  2015-11-12 23:41 ` [PATCH 06/14] net: tcp_memcontrol: remove dead per-memcg count of allocated sockets Johannes Weiner
  2015-11-13 16:01   ` David Miller
@ 2015-11-20  9:48   ` Vladimir Davydov
  1 sibling, 0 replies; 61+ messages in thread
From: Vladimir Davydov @ 2015-11-20  9:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Thu, Nov 12, 2015 at 06:41:25PM -0500, Johannes Weiner wrote:
> The number of allocated sockets is used for calculations in the soft
> limit phase, where packets are accepted but the socket is under memory
> pressure. Since there is no soft limit phase in tcp_memcontrol, and
> memory pressure is only entered when packets are already dropped, this
> is actually dead code. Remove it.

Actually, we can get into the soft limit phase due to the global limit
(tcp_memory_pressure is set), but then using per-memcg sockets_allocated
counter is just wrong.

> 
> As this is the last user of parent_cg_proto(), remove that too.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/14] net: tcp_memcontrol: simplify the per-memcg limit access
  2015-11-12 23:41 ` [PATCH 07/14] net: tcp_memcontrol: simplify the per-memcg limit access Johannes Weiner
@ 2015-11-20  9:51   ` Vladimir Davydov
  0 siblings, 0 replies; 61+ messages in thread
From: Vladimir Davydov @ 2015-11-20  9:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Thu, Nov 12, 2015 at 06:41:26PM -0500, Johannes Weiner wrote:
> tcp_memcontrol replicates the global sysctl_mem limit array per
> cgroup, but it only ever sets these entries to the value of the
> memory_allocated page_counter limit. Use the latter directly.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 08/14] net: tcp_memcontrol: sanitize tcp memory accounting callbacks
  2015-11-12 23:41 ` [PATCH 08/14] net: tcp_memcontrol: sanitize tcp memory accounting callbacks Johannes Weiner
  2015-11-13  4:53   ` Eric Dumazet
@ 2015-11-20 10:58   ` Vladimir Davydov
  2015-11-20 18:42     ` Johannes Weiner
  1 sibling, 1 reply; 61+ messages in thread
From: Vladimir Davydov @ 2015-11-20 10:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Thu, Nov 12, 2015 at 06:41:27PM -0500, Johannes Weiner wrote:
> There won't be a tcp control soft limit, so integrating the memcg code
> into the global skmem limiting scheme complicates things
> unnecessarily. Replace this with simple and clear charge and uncharge
> calls--hidden behind a jump label--to account skb memory.
> 
> Note that this is not purely aesthetic: as a result of shoehorning the
> per-memcg code into the same memory accounting functions that handle
> the global level, the old code would compare the per-memcg consumption
> against the smaller of the per-memcg limit and the global limit. This
> allowed the total consumption of multiple sockets to exceed the global
> limit, as long as the individual sockets stayed within bounds. After
> this change, the code will always compare the per-memcg consumption to
> the per-memcg limit, and the global consumption to the global limit,
> and thus close this loophole.
> 
> Without a soft limit, the per-memcg memory pressure state in sockets
> is generally questionable. However, we did it until now, so we
> continue to enter it when the hard limit is hit, and packets are
> dropped, to let other sockets in the cgroup know that they shouldn't
> grow their transmit windows, either. However, keep it simple in the
> new callback model and leave memory pressure lazily when the next
> packet is accepted (as opposed to doing it synchroneously when packets
> are processed). When packets are dropped, network performance will
> already be in the toilet, so that should be a reasonable trade-off.
> 
> As described above, consumption is now checked on the per-memcg level
> and the global level separately. Likewise, memory pressure states are
> maintained on both the per-memcg level and the global level, and a
> socket is considered under pressure when either level asserts as much.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

It leaves the legacy functionality intact, while making the code look
much better.

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/14] net: tcp_memcontrol: simplify linkage between socket and page counter
  2015-11-12 23:41 ` [PATCH 09/14] net: tcp_memcontrol: simplify linkage between socket and page counter Johannes Weiner
@ 2015-11-20 12:42   ` Vladimir Davydov
  2015-11-20 18:56     ` Johannes Weiner
  0 siblings, 1 reply; 61+ messages in thread
From: Vladimir Davydov @ 2015-11-20 12:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Thu, Nov 12, 2015 at 06:41:28PM -0500, Johannes Weiner wrote:
> There won't be any separate counters for socket memory consumed by
> protocols other than TCP in the future. Remove the indirection and

I really want to believe you're right. And with vmpressure propagation
implemented properly you are likely to be right.

However, we might still want to account other socket protos to
memcg->memory in the unified hierarchy, e.g. UDP, or SCTP, or whatever
else. Adding new consumers should be trivial, but it will break the
legacy usecase, where only TCP sockets are supposed to be accounted.
What about adding a check to sock_update_memcg() so that it would enable
accounting only for TCP sockets in case legacy hierarchy is used?

For the same reason, I think we'd better rename memcg->tcp_mem to
something like memcg->sk_mem or we can even drop the cg_proto struct
altogether embedding its fields directly to mem_cgroup struct.

Also, I don't see any reason to have tcp_memcontrol.c file. It's tiny
and with this patch it does not depend on tcp code any more. Let's move
it to memcontrol.c?

Other than that this patch looks OK to me.

Thanks,
Vladimir

> link sockets directly to their owning memory cgroup.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 12/14] mm: memcontrol: move socket code for unified hierarchy accounting
  2015-11-12 23:41 ` [PATCH 12/14] mm: memcontrol: move socket code for unified hierarchy accounting Johannes Weiner
@ 2015-11-20 12:44   ` Vladimir Davydov
  0 siblings, 0 replies; 61+ messages in thread
From: Vladimir Davydov @ 2015-11-20 12:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Thu, Nov 12, 2015 at 06:41:31PM -0500, Johannes Weiner wrote:
> The unified hierarchy memory controller will account socket
> memory. Move the infrastructure functions accordingly.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 13/14] mm: memcontrol: account socket memory in unified hierarchy memory controller
  2015-11-12 23:41 ` [PATCH 13/14] mm: memcontrol: account socket memory in unified hierarchy memory controller Johannes Weiner
  2015-11-16 15:59   ` Michal Hocko
@ 2015-11-20 13:10   ` Vladimir Davydov
  2015-11-20 19:25     ` Johannes Weiner
  1 sibling, 1 reply; 61+ messages in thread
From: Vladimir Davydov @ 2015-11-20 13:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Thu, Nov 12, 2015 at 06:41:32PM -0500, Johannes Weiner wrote:
...
> @@ -5514,16 +5550,43 @@ void sock_release_memcg(struct sock *sk)
>   */
>  bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
> +	unsigned int batch = max(CHARGE_BATCH, nr_pages);
>  	struct page_counter *counter;
> +	bool force = false;
>  
> -	if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
> -				    nr_pages, &counter)) {
> -		memcg->tcp_mem.memory_pressure = 0;
> +#ifdef CONFIG_MEMCG_KMEM
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> +		if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
> +					    nr_pages, &counter)) {
> +			memcg->tcp_mem.memory_pressure = 0;
> +			return true;
> +		}
> +		page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
> +		memcg->tcp_mem.memory_pressure = 1;
> +		return false;
> +	}
> +#endif
> +	if (consume_stock(memcg, nr_pages))
>  		return true;
> +retry:
> +	if (page_counter_try_charge(&memcg->memory, batch, &counter))
> +		goto done;
> +
> +	if (batch > nr_pages) {
> +		batch = nr_pages;
> +		goto retry;
>  	}
> -	page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
> -	memcg->tcp_mem.memory_pressure = 1;
> -	return false;
> +
> +	page_counter_charge(&memcg->memory, batch);
> +	force = true;
> +done:

> +	css_get_many(&memcg->css, batch);

Is there any point to get css reference per each charged page? For kmem
it is absolutely necessary, because dangling slabs must block
destruction of memcg's kmem caches, which are destroyed on css_free. But
for sockets there's no such problem: memcg will be destroyed only after
all sockets are destroyed and therefore uncharged (since
sock_update_memcg pins css).

> +	if (batch > nr_pages)
> +		refill_stock(memcg, batch - nr_pages);
> +
> +	schedule_work(&memcg->socket_work);

I think it's suboptimal to schedule the work even if we are below the
high threshold.

BTW why do we need this work at all? Why is reclaim_high called from
task_work not enough?

Thanks,
Vladimir

> +
> +	return !force;
>  }
>  
>  /**

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 08/14] net: tcp_memcontrol: sanitize tcp memory accounting callbacks
  2015-11-20 10:58   ` Vladimir Davydov
@ 2015-11-20 18:42     ` Johannes Weiner
  0 siblings, 0 replies; 61+ messages in thread
From: Johannes Weiner @ 2015-11-20 18:42 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Fri, Nov 20, 2015 at 01:58:57PM +0300, Vladimir Davydov wrote:
> On Thu, Nov 12, 2015 at 06:41:27PM -0500, Johannes Weiner wrote:
> > There won't be a tcp control soft limit, so integrating the memcg code
> > into the global skmem limiting scheme complicates things
> > unnecessarily. Replace this with simple and clear charge and uncharge
> > calls--hidden behind a jump label--to account skb memory.
> > 
> > Note that this is not purely aesthetic: as a result of shoehorning the
> > per-memcg code into the same memory accounting functions that handle
> > the global level, the old code would compare the per-memcg consumption
> > against the smaller of the per-memcg limit and the global limit. This
> > allowed the total consumption of multiple sockets to exceed the global
> > limit, as long as the individual sockets stayed within bounds. After
> > this change, the code will always compare the per-memcg consumption to
> > the per-memcg limit, and the global consumption to the global limit,
> > and thus close this loophole.
> > 
> > Without a soft limit, the per-memcg memory pressure state in sockets
> > is generally questionable. However, we did it until now, so we
> > continue to enter it when the hard limit is hit, and packets are
> > dropped, to let other sockets in the cgroup know that they shouldn't
> > grow their transmit windows, either. However, keep it simple in the
> > new callback model and leave memory pressure lazily when the next
> > packet is accepted (as opposed to doing it synchroneously when packets
> > are processed). When packets are dropped, network performance will
> > already be in the toilet, so that should be a reasonable trade-off.
> > 
> > As described above, consumption is now checked on the per-memcg level
> > and the global level separately. Likewise, memory pressure states are
> > maintained on both the per-memcg level and the global level, and a
> > socket is considered under pressure when either level asserts as much.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> It leaves the legacy functionality intact, while making the code look
> much better.
> 
> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

Thank you very much!

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/14] net: tcp_memcontrol: simplify linkage between socket and page counter
  2015-11-20 12:42   ` Vladimir Davydov
@ 2015-11-20 18:56     ` Johannes Weiner
  2015-11-23  9:36       ` Vladimir Davydov
  0 siblings, 1 reply; 61+ messages in thread
From: Johannes Weiner @ 2015-11-20 18:56 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Fri, Nov 20, 2015 at 03:42:16PM +0300, Vladimir Davydov wrote:
> On Thu, Nov 12, 2015 at 06:41:28PM -0500, Johannes Weiner wrote:
> > There won't be any separate counters for socket memory consumed by
> > protocols other than TCP in the future. Remove the indirection and
> 
> I really want to believe you're right. And with vmpressure propagation
> implemented properly you are likely to be right.
> 
> However, we might still want to account other socket protos to
> memcg->memory in the unified hierarchy, e.g. UDP, or SCTP, or whatever
> else. Adding new consumers should be trivial, but it will break the
> legacy usecase, where only TCP sockets are supposed to be accounted.
> What about adding a check to sock_update_memcg() so that it would enable
> accounting only for TCP sockets in case legacy hierarchy is used?

Yup, I was thinking the same thing. But we can cross that bridge when
we come to it and are actually adding further packet types.

> For the same reason, I think we'd better rename memcg->tcp_mem to
> something like memcg->sk_mem or we can even drop the cg_proto struct
> altogether embedding its fields directly to mem_cgroup struct.
> 
> Also, I don't see any reason to have tcp_memcontrol.c file. It's tiny
> and with this patch it does not depend on tcp code any more. Let's move
> it to memcontrol.c?

I actually had all this at first, but then wondered if it makes more
sense to keep the legacy code in isolation. Don't you think it would
be easier to keep track of what's v1 and what's v2 if we keep the
legacy stuff physically separate as much as possible? In particular I
found that 'tcp_mem.' marker really useful while working on the code.

In the same vein, tcp_memcontrol.c doesn't really hurt anybody and I'd
expect it to remain mostly unopened and unchanged in the future. But
if we merge it into memcontrol.c, that code will likely be in the way
and we'd have to make it explicit somehow that this is not actually
part of the new memory controller anymore.

What do you think?

> Other than that this patch looks OK to me.

Thank you!

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 13/14] mm: memcontrol: account socket memory in unified hierarchy memory controller
  2015-11-20 13:10   ` Vladimir Davydov
@ 2015-11-20 19:25     ` Johannes Weiner
  2015-11-23 10:00       ` Vladimir Davydov
  0 siblings, 1 reply; 61+ messages in thread
From: Johannes Weiner @ 2015-11-20 19:25 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Fri, Nov 20, 2015 at 04:10:33PM +0300, Vladimir Davydov wrote:
> On Thu, Nov 12, 2015 at 06:41:32PM -0500, Johannes Weiner wrote:
> ...
> > @@ -5514,16 +5550,43 @@ void sock_release_memcg(struct sock *sk)
> >   */
> >  bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
> >  {
> > +	unsigned int batch = max(CHARGE_BATCH, nr_pages);
> >  	struct page_counter *counter;
> > +	bool force = false;
> >  
> > -	if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
> > -				    nr_pages, &counter)) {
> > -		memcg->tcp_mem.memory_pressure = 0;
> > +#ifdef CONFIG_MEMCG_KMEM
> > +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> > +		if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
> > +					    nr_pages, &counter)) {
> > +			memcg->tcp_mem.memory_pressure = 0;
> > +			return true;
> > +		}
> > +		page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
> > +		memcg->tcp_mem.memory_pressure = 1;
> > +		return false;
> > +	}
> > +#endif
> > +	if (consume_stock(memcg, nr_pages))
> >  		return true;
> > +retry:
> > +	if (page_counter_try_charge(&memcg->memory, batch, &counter))
> > +		goto done;
> > +
> > +	if (batch > nr_pages) {
> > +		batch = nr_pages;
> > +		goto retry;
> >  	}
> > -	page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
> > -	memcg->tcp_mem.memory_pressure = 1;
> > -	return false;
> > +
> > +	page_counter_charge(&memcg->memory, batch);
> > +	force = true;
> > +done:
> 
> > +	css_get_many(&memcg->css, batch);
> 
> Is there any point to get css reference per each charged page? For kmem
> it is absolutely necessary, because dangling slabs must block
> destruction of memcg's kmem caches, which are destroyed on css_free. But
> for sockets there's no such problem: memcg will be destroyed only after
> all sockets are destroyed and therefore uncharged (since
> sock_update_memcg pins css).

I'm afraid we have to when we want to share 'stock' with cache and
anon pages, which hold individual references. drain_stock() always
assumes one reference per cached page.

> > +	if (batch > nr_pages)
> > +		refill_stock(memcg, batch - nr_pages);
> > +
> > +	schedule_work(&memcg->socket_work);
> 
> I think it's suboptimal to schedule the work even if we are below the
> high threshold.

Hm, it seemed unnecessary to duplicate the hierarchy check since this
is in the batch-exhausted slowpath anyway.

> BTW why do we need this work at all? Why is reclaim_high called from
> task_work not enough?

The problem lies in the memcg association: the random task that gets
interrupted by an arriving packet might not be in the same memcg as
the one owning receiving socket. And multiple interrupts could happen
while we're in the kernel already charging pages. We'd basically have
to maintain a list of memcgs that need to run reclaim_high associated
with current.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/14] net: tcp_memcontrol: simplify linkage between socket and page counter
  2015-11-20 18:56     ` Johannes Weiner
@ 2015-11-23  9:36       ` Vladimir Davydov
  2015-11-23 18:20         ` Johannes Weiner
  0 siblings, 1 reply; 61+ messages in thread
From: Vladimir Davydov @ 2015-11-23  9:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Fri, Nov 20, 2015 at 01:56:48PM -0500, Johannes Weiner wrote:
> On Fri, Nov 20, 2015 at 03:42:16PM +0300, Vladimir Davydov wrote:
> > On Thu, Nov 12, 2015 at 06:41:28PM -0500, Johannes Weiner wrote:
> > > There won't be any separate counters for socket memory consumed by
> > > protocols other than TCP in the future. Remove the indirection and
> > 
> > I really want to believe you're right. And with vmpressure propagation
> > implemented properly you are likely to be right.
> > 
> > However, we might still want to account other socket protos to
> > memcg->memory in the unified hierarchy, e.g. UDP, or SCTP, or whatever
> > else. Adding new consumers should be trivial, but it will break the
> > legacy usecase, where only TCP sockets are supposed to be accounted.
> > What about adding a check to sock_update_memcg() so that it would enable
> > accounting only for TCP sockets in case legacy hierarchy is used?
> 
> Yup, I was thinking the same thing. But we can cross that bridge when
> we come to it and are actually adding further packet types.

Fair enough.

> 
> > For the same reason, I think we'd better rename memcg->tcp_mem to
> > something like memcg->sk_mem or we can even drop the cg_proto struct
> > altogether embedding its fields directly to mem_cgroup struct.
> > 
> > Also, I don't see any reason to have tcp_memcontrol.c file. It's tiny
> > and with this patch it does not depend on tcp code any more. Let's move
> > it to memcontrol.c?
> 
> I actually had all this at first, but then wondered if it makes more
> sense to keep the legacy code in isolation. Don't you think it would
> be easier to keep track of what's v1 and what's v2 if we keep the
> legacy stuff physically separate as much as possible? In particular I
> found that 'tcp_mem.' marker really useful while working on the code.
> 
> In the same vein, tcp_memcontrol.c doesn't really hurt anybody and I'd
> expect it to remain mostly unopened and unchanged in the future. But
> if we merge it into memcontrol.c, that code will likely be in the way
> and we'd have to make it explicit somehow that this is not actually
> part of the new memory controller anymore.
> 
> What do you think?

There isn't much code left in tcp_memcontrol.c, and not all of it is
legacy. We still want to call tcp_init_cgroup and tcp_destroy_cgroup
from memcontrol.c - in fact, it's the only call site, so I think we'd
better keep these functions there. Apart from init/destroy, there is
only stuff for handling legacy files, which is relatively small and
isolated. We can just put it along with memsw and kmem legacy files in
the end of memcontrol.c adding a comment that it's legacy. Personally,
I'd find the code easier to follow then, because currently the logic
behind the ACTIVE flag as well as memcg->tcp_mem init/use/destroy turns
out to be scattered between two files in different subsystems for no
apparent reason now, as it does not need tcp_prot any more. Besides,
this would allow us to accurately reuse the ACTIVE flag in init/destroy
for inc/dec static branch and probably in sock_update_memcg instead of
sprinkling cgroup_subsys_on_dfl all over the place, which would make the
code a bit cleaner IMO (in fact, that's why I proposed to drop ACTIVATED
bit and replace cg_proto->flags with ->active bool).

Regarding, tcp_mem marker, well, currently it's OK, because we don't
account anything but TCP sockets, but when it changes (and I'm pretty
sure it will), we'll have to rename it anyway. For now, I'm OK with
leaving it as is though.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 13/14] mm: memcontrol: account socket memory in unified hierarchy memory controller
  2015-11-20 19:25     ` Johannes Weiner
@ 2015-11-23 10:00       ` Vladimir Davydov
  2015-11-23 19:31         ` Johannes Weiner
  0 siblings, 1 reply; 61+ messages in thread
From: Vladimir Davydov @ 2015-11-23 10:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Fri, Nov 20, 2015 at 02:25:06PM -0500, Johannes Weiner wrote:
> On Fri, Nov 20, 2015 at 04:10:33PM +0300, Vladimir Davydov wrote:
> > On Thu, Nov 12, 2015 at 06:41:32PM -0500, Johannes Weiner wrote:
> > ...
> > > @@ -5514,16 +5550,43 @@ void sock_release_memcg(struct sock *sk)
> > >   */
> > >  bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
> > >  {
> > > +	unsigned int batch = max(CHARGE_BATCH, nr_pages);
> > >  	struct page_counter *counter;
> > > +	bool force = false;
> > >  
> > > -	if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
> > > -				    nr_pages, &counter)) {
> > > -		memcg->tcp_mem.memory_pressure = 0;
> > > +#ifdef CONFIG_MEMCG_KMEM
> > > +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> > > +		if (page_counter_try_charge(&memcg->tcp_mem.memory_allocated,
> > > +					    nr_pages, &counter)) {
> > > +			memcg->tcp_mem.memory_pressure = 0;
> > > +			return true;
> > > +		}
> > > +		page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
> > > +		memcg->tcp_mem.memory_pressure = 1;
> > > +		return false;
> > > +	}
> > > +#endif
> > > +	if (consume_stock(memcg, nr_pages))
> > >  		return true;
> > > +retry:
> > > +	if (page_counter_try_charge(&memcg->memory, batch, &counter))
> > > +		goto done;
> > > +
> > > +	if (batch > nr_pages) {
> > > +		batch = nr_pages;
> > > +		goto retry;
> > >  	}
> > > -	page_counter_charge(&memcg->tcp_mem.memory_allocated, nr_pages);
> > > -	memcg->tcp_mem.memory_pressure = 1;
> > > -	return false;
> > > +
> > > +	page_counter_charge(&memcg->memory, batch);
> > > +	force = true;
> > > +done:
> > 
> > > +	css_get_many(&memcg->css, batch);
> > 
> > Is there any point to get css reference per each charged page? For kmem
> > it is absolutely necessary, because dangling slabs must block
> > destruction of memcg's kmem caches, which are destroyed on css_free. But
> > for sockets there's no such problem: memcg will be destroyed only after
> > all sockets are destroyed and therefore uncharged (since
> > sock_update_memcg pins css).
> 
> I'm afraid we have to when we want to share 'stock' with cache and
> anon pages, which hold individual references. drain_stock() always
> assumes one reference per cached page.

Missed that, you're right.

> 
> > > +	if (batch > nr_pages)
> > > +		refill_stock(memcg, batch - nr_pages);
> > > +
> > > +	schedule_work(&memcg->socket_work);
> > 
> > I think it's suboptimal to schedule the work even if we are below the
> > high threshold.
> 
> Hm, it seemed unnecessary to duplicate the hierarchy check since this
> is in the batch-exhausted slowpath anyway.

Dunno, may be you're right.

I've another question regarding this socket_work: its reclaim target
always equals CHARGE_BATCH. Can't it result in a workload exceeding
memory.high in case there are a lot of allocations coming from different
cpus? In this case the work might not manage to complete before another
allocation happens. May be, we should accumulate the number of pages to
be reclaimed by the work, as we do in try_charge?

> 
> > BTW why do we need this work at all? Why is reclaim_high called from
> > task_work not enough?
> 
> The problem lies in the memcg association: the random task that gets
> interrupted by an arriving packet might not be in the same memcg as
> the one owning receiving socket. And multiple interrupts could happen
> while we're in the kernel already charging pages. We'd basically have
> to maintain a list of memcgs that need to run reclaim_high associated
> with current.
> 

Right, I think this is worth placing in a comment to memcg->socket_work.
I wonder if we could use it *instead* of task_work for handling every
allocation, not only socket-related. Would it make any sense? May be, it
could reduce the latency experienced by tasks in memory cgroups.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/14] net: tcp_memcontrol: simplify linkage between socket and page counter
  2015-11-23  9:36       ` Vladimir Davydov
@ 2015-11-23 18:20         ` Johannes Weiner
  2015-11-24 13:43           ` Vladimir Davydov
  0 siblings, 1 reply; 61+ messages in thread
From: Johannes Weiner @ 2015-11-23 18:20 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Mon, Nov 23, 2015 at 12:36:46PM +0300, Vladimir Davydov wrote:
> On Fri, Nov 20, 2015 at 01:56:48PM -0500, Johannes Weiner wrote:
> > I actually had all this at first, but then wondered if it makes more
> > sense to keep the legacy code in isolation. Don't you think it would
> > be easier to keep track of what's v1 and what's v2 if we keep the
> > legacy stuff physically separate as much as possible? In particular I
> > found that 'tcp_mem.' marker really useful while working on the code.
> > 
> > In the same vein, tcp_memcontrol.c doesn't really hurt anybody and I'd
> > expect it to remain mostly unopened and unchanged in the future. But
> > if we merge it into memcontrol.c, that code will likely be in the way
> > and we'd have to make it explicit somehow that this is not actually
> > part of the new memory controller anymore.
> > 
> > What do you think?
> 
> There isn't much code left in tcp_memcontrol.c, and not all of it is
> legacy. We still want to call tcp_init_cgroup and tcp_destroy_cgroup
> from memcontrol.c - in fact, it's the only call site, so I think we'd
> better keep these functions there. Apart from init/destroy, there is
> only stuff for handling legacy files, which is relatively small and
> isolated. We can just put it along with memsw and kmem legacy files in
> the end of memcontrol.c adding a comment that it's legacy. Personally,
> I'd find the code easier to follow then, because currently the logic
> behind the ACTIVE flag as well as memcg->tcp_mem init/use/destroy turns
> out to be scattered between two files in different subsystems for no
> apparent reason now, as it does not need tcp_prot any more. Besides,
> this would allow us to accurately reuse the ACTIVE flag in init/destroy
> for inc/dec static branch and probably in sock_update_memcg instead of
> sprinkling cgroup_subsys_on_dfl all over the place, which would make the
> code a bit cleaner IMO (in fact, that's why I proposed to drop ACTIVATED
> bit and replace cg_proto->flags with ->active bool).

As far as I can see, all of tcp_memcontrol.c is legacy, including the
init and destroy functions. We only call them to set up the legacy
tcp_mem state and do legacy jump-label maintenance. Delete it all and
the unified hierarchy controller would still work. So I don't really
see the benefits of consolidating it, and more risk of convoluting.

That being said, if you care strongly about it and see opportunities
to cut down code and make things more readable, please feel free to
turn the flags -> bool patch into a followup series and I'll be happy
to review it.

Thanks!

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 13/14] mm: memcontrol: account socket memory in unified hierarchy memory controller
  2015-11-23 10:00       ` Vladimir Davydov
@ 2015-11-23 19:31         ` Johannes Weiner
  0 siblings, 0 replies; 61+ messages in thread
From: Johannes Weiner @ 2015-11-23 19:31 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Mon, Nov 23, 2015 at 01:00:59PM +0300, Vladimir Davydov wrote:
> I've another question regarding this socket_work: its reclaim target
> always equals CHARGE_BATCH. Can't it result in a workload exceeding
> memory.high in case there are a lot of allocations coming from different
> cpus? In this case the work might not manage to complete before another
> allocation happens. May be, we should accumulate the number of pages to
> be reclaimed by the work, as we do in try_charge?

Actually, try_to_free_mem_cgroup_pages() rounds it up to 2MB anyway. I
would hate to add locking or more atomics to accumulate a reclaim goal
for the worker on spec, so let's wait to see if this is a real issue.

> > > BTW why do we need this work at all? Why is reclaim_high called from
> > > task_work not enough?
> > 
> > The problem lies in the memcg association: the random task that gets
> > interrupted by an arriving packet might not be in the same memcg as
> > the one owning receiving socket. And multiple interrupts could happen
> > while we're in the kernel already charging pages. We'd basically have
> > to maintain a list of memcgs that need to run reclaim_high associated
> > with current.
> > 
> 
> Right, I think this is worth placing in a comment to memcg->socket_work.

Okay, will do.

> I wonder if we could use it *instead* of task_work for handling every
> allocation, not only socket-related. Would it make any sense? May be, it
> could reduce the latency experienced by tasks in memory cgroups.

No, we *want* charging tasks to do reclaim work once memory.high is
breached, in order to match their speed to memory availability. That
needs to remain synchroneous.

What we could try is make memcg->socket_work purely about the receive
side when we're inside the softirq, and arm the per-task work when in
process context on the sending side. I'll look into that.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/14] net: tcp_memcontrol: simplify linkage between socket and page counter
  2015-11-23 18:20         ` Johannes Weiner
@ 2015-11-24 13:43           ` Vladimir Davydov
  0 siblings, 0 replies; 61+ messages in thread
From: Vladimir Davydov @ 2015-11-24 13:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, Andrew Morton, Tejun Heo, Michal Hocko, netdev,
	linux-mm, cgroups, linux-kernel, kernel-team

On Mon, Nov 23, 2015 at 01:20:37PM -0500, Johannes Weiner wrote:
> On Mon, Nov 23, 2015 at 12:36:46PM +0300, Vladimir Davydov wrote:
> > On Fri, Nov 20, 2015 at 01:56:48PM -0500, Johannes Weiner wrote:
> > > I actually had all this at first, but then wondered if it makes more
> > > sense to keep the legacy code in isolation. Don't you think it would
> > > be easier to keep track of what's v1 and what's v2 if we keep the
> > > legacy stuff physically separate as much as possible? In particular I
> > > found that 'tcp_mem.' marker really useful while working on the code.
> > > 
> > > In the same vein, tcp_memcontrol.c doesn't really hurt anybody and I'd
> > > expect it to remain mostly unopened and unchanged in the future. But
> > > if we merge it into memcontrol.c, that code will likely be in the way
> > > and we'd have to make it explicit somehow that this is not actually
> > > part of the new memory controller anymore.
> > > 
> > > What do you think?
> > 
> > There isn't much code left in tcp_memcontrol.c, and not all of it is
> > legacy. We still want to call tcp_init_cgroup and tcp_destroy_cgroup
> > from memcontrol.c - in fact, it's the only call site, so I think we'd
> > better keep these functions there. Apart from init/destroy, there is
> > only stuff for handling legacy files, which is relatively small and
> > isolated. We can just put it along with memsw and kmem legacy files in
> > the end of memcontrol.c adding a comment that it's legacy. Personally,
> > I'd find the code easier to follow then, because currently the logic
> > behind the ACTIVE flag as well as memcg->tcp_mem init/use/destroy turns
> > out to be scattered between two files in different subsystems for no
> > apparent reason now, as it does not need tcp_prot any more. Besides,
> > this would allow us to accurately reuse the ACTIVE flag in init/destroy
> > for inc/dec static branch and probably in sock_update_memcg instead of
> > sprinkling cgroup_subsys_on_dfl all over the place, which would make the
> > code a bit cleaner IMO (in fact, that's why I proposed to drop ACTIVATED
> > bit and replace cg_proto->flags with ->active bool).
> 
> As far as I can see, all of tcp_memcontrol.c is legacy, including the
> init and destroy functions. We only call them to set up the legacy
> tcp_mem state and do legacy jump-label maintenance. Delete it all and
> the unified hierarchy controller would still work. So I don't really
> see the benefits of consolidating it, and more risk of convoluting.
> 
> That being said, if you care strongly about it and see opportunities
> to cut down code and make things more readable, please feel free to
> turn the flags -> bool patch into a followup series and I'll be happy
> to review it.

OK, I'll look into that.

Regarding this patch, I don't have any questions left,

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2015-11-24 13:43 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-12 23:41 [PATCH 00/14] mm: memcontrol: account socket memory in unified hierarchy Johannes Weiner
2015-11-12 23:41 ` [PATCH 01/14] mm: memcontrol: export root_mem_cgroup Johannes Weiner
2015-11-13 15:59   ` David Miller
2015-11-14 12:17   ` Vladimir Davydov
2015-11-12 23:41 ` [PATCH 02/14] mm: vmscan: simplify memcg vs. global shrinker invocation Johannes Weiner
2015-11-13 15:59   ` David Miller
2015-11-14 12:36   ` Vladimir Davydov
2015-11-14 15:06     ` Johannes Weiner
2015-11-12 23:41 ` [PATCH 03/14] net: tcp_memcontrol: properly detect ancestor socket pressure Johannes Weiner
2015-11-13 16:00   ` David Miller
2015-11-14 12:45   ` Vladimir Davydov
2015-11-14 15:15     ` Johannes Weiner
2015-11-12 23:41 ` [PATCH 04/14] net: tcp_memcontrol: remove bogus hierarchy pressure propagation Johannes Weiner
2015-11-13 16:00   ` David Miller
2015-11-20  9:07   ` Vladimir Davydov
2015-11-12 23:41 ` [PATCH 05/14] net: tcp_memcontrol: protect all tcp_memcontrol calls by jump-label Johannes Weiner
2015-11-13 16:01   ` David Miller
2015-11-14 16:33   ` Vladimir Davydov
2015-11-16 17:52     ` Johannes Weiner
2015-11-12 23:41 ` [PATCH 06/14] net: tcp_memcontrol: remove dead per-memcg count of allocated sockets Johannes Weiner
2015-11-13 16:01   ` David Miller
2015-11-20  9:48   ` Vladimir Davydov
2015-11-12 23:41 ` [PATCH 07/14] net: tcp_memcontrol: simplify the per-memcg limit access Johannes Weiner
2015-11-20  9:51   ` Vladimir Davydov
2015-11-12 23:41 ` [PATCH 08/14] net: tcp_memcontrol: sanitize tcp memory accounting callbacks Johannes Weiner
2015-11-13  4:53   ` Eric Dumazet
2015-11-13  5:44     ` Johannes Weiner
2015-11-20 10:58   ` Vladimir Davydov
2015-11-20 18:42     ` Johannes Weiner
2015-11-12 23:41 ` [PATCH 09/14] net: tcp_memcontrol: simplify linkage between socket and page counter Johannes Weiner
2015-11-20 12:42   ` Vladimir Davydov
2015-11-20 18:56     ` Johannes Weiner
2015-11-23  9:36       ` Vladimir Davydov
2015-11-23 18:20         ` Johannes Weiner
2015-11-24 13:43           ` Vladimir Davydov
2015-11-12 23:41 ` [PATCH 10/14] mm: memcontrol: generalize the socket accounting jump label Johannes Weiner
2015-11-13 10:43   ` Michal Hocko
2015-11-14 13:29   ` Vladimir Davydov
2015-11-12 23:41 ` [PATCH 11/14] mm: memcontrol: do not account memory+swap on unified hierarchy Johannes Weiner
2015-11-13 10:37   ` Michal Hocko
2015-11-14 13:23   ` Vladimir Davydov
2015-11-12 23:41 ` [PATCH 12/14] mm: memcontrol: move socket code for unified hierarchy accounting Johannes Weiner
2015-11-20 12:44   ` Vladimir Davydov
2015-11-12 23:41 ` [PATCH 13/14] mm: memcontrol: account socket memory in unified hierarchy memory controller Johannes Weiner
2015-11-16 15:59   ` Michal Hocko
2015-11-16 18:18     ` Johannes Weiner
2015-11-18 16:22       ` Michal Hocko
2015-11-18 21:48         ` Johannes Weiner
2015-11-19 13:50           ` Michal Hocko
2015-11-19 16:52             ` Johannes Weiner
2015-11-20 13:10   ` Vladimir Davydov
2015-11-20 19:25     ` Johannes Weiner
2015-11-23 10:00       ` Vladimir Davydov
2015-11-23 19:31         ` Johannes Weiner
2015-11-12 23:41 ` [PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure Johannes Weiner
2015-11-15 13:54   ` Vladimir Davydov
2015-11-16 18:53     ` Johannes Weiner
2015-11-17 20:18       ` Vladimir Davydov
2015-11-17 22:22         ` Johannes Weiner
2015-11-18 16:02           ` Vladimir Davydov
2015-11-18 18:27             ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).