[PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
@ 2015-10-22  4:21 ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-22  4:21 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Tejun Heo, netdev, linux-mm,
	cgroups, linux-kernel

Hi,

this series adds socket buffer memory tracking and accounting to the
unified hierarchy memory cgroup controller.

[ Networking people, at this time please check the diffstat below to
  avoid going into convulsions. ]

Socket buffer memory can make up a significant share of a workload's
memory footprint, and so it needs to be accounted and tracked out of
the box, along with other types of memory that can be directly linked
to userspace activity, in order to provide useful resource isolation.

Historically, socket buffers were accounted in a separate counter,
without any pressure equalization between anonymous memory, page
cache, and the socket buffers. When the socket buffer pool was
exhausted, buffer allocations would fail hard and cause network
performance to tank, regardless of whether there was still memory
available to the group or not. Likewise, struggling anonymous or cache
workingsets could not dip into an idle socket memory pool. Because of
this, the feature was not usable for many real life applications.

To not repeat this mistake, the new memory controller will account all
types of memory pages it is tracking on behalf of a cgroup in a single
pool. And upon pressure, the VM reclaims and shrinks whatever memory
in that pool is within its reach.

These patches add accounting for memory consumed by sockets associated
with a cgroup to the existing pool of anonymous pages and page cache.

Patch #3 reworks the existing memcg socket infrastructure. It has many
provisions for future plans that won't materialize, and much of this
simply evaporates. The networking people should be happy about this.

Patch #5 adds accounting and tracking of socket memory to the unified
hierarchy memory controller, as described above. It uses the existing
per-cpu charge caches and triggers high limit reclaim asynchroneously.

Patch #8 uses the vmpressure extension to equalize pressure between
the pages tracked natively by the VM and socket buffer pages. As the
pool is shared, it makes sense that while natively tracked pages are
under duress the network transmit windows are also not increased.

As per above, this is an essential part of the new memory controller's
core functionality. With the unified hierarchy nearing release, please
consider this for 4.4.

 include/linux/memcontrol.h   |  90 +++++++++-------
 include/linux/page_counter.h |   6 +-
 include/net/sock.h           | 139 ++----------------------
 include/net/tcp.h            |   5 +-
 include/net/tcp_memcontrol.h |   7 --
 mm/backing-dev.c             |   2 +-
 mm/hugetlb_cgroup.c          |   3 +-
 mm/memcontrol.c              | 235 ++++++++++++++++++++++++++---------------
 mm/page_counter.c            |  14 +--
 mm/vmpressure.c              |  29 ++++-
 mm/vmscan.c                  |  41 +++----
 net/core/sock.c              |  78 ++++----------
 net/ipv4/sysctl_net_ipv4.c   |   1 -
 net/ipv4/tcp.c               |   3 +-
 net/ipv4/tcp_ipv4.c          |   9 +-
 net/ipv4/tcp_memcontrol.c    | 147 ++++----------------------
 net/ipv4/tcp_output.c        |   6 +-
 net/ipv6/tcp_ipv6.c          |   3 -
 18 files changed, 319 insertions(+), 499 deletions(-)

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
@ 2015-10-22  4:21 ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-22  4:21 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Tejun Heo, netdev, linux-mm,
	cgroups, linux-kernel

Hi,

this series adds socket buffer memory tracking and accounting to the
unified hierarchy memory cgroup controller.

[ Networking people, at this time please check the diffstat below to
  avoid going into convulsions. ]

Socket buffer memory can make up a significant share of a workload's
memory footprint, and so it needs to be accounted and tracked out of
the box, along with other types of memory that can be directly linked
to userspace activity, in order to provide useful resource isolation.

Historically, socket buffers were accounted in a separate counter,
without any pressure equalization between anonymous memory, page
cache, and the socket buffers. When the socket buffer pool was
exhausted, buffer allocations would fail hard and cause network
performance to tank, regardless of whether there was still memory
available to the group or not. Likewise, struggling anonymous or cache
workingsets could not dip into an idle socket memory pool. Because of
this, the feature was not usable for many real life applications.

To not repeat this mistake, the new memory controller will account all
types of memory pages it is tracking on behalf of a cgroup in a single
pool. And upon pressure, the VM reclaims and shrinks whatever memory
in that pool is within its reach.

These patches add accounting for memory consumed by sockets associated
with a cgroup to the existing pool of anonymous pages and page cache.

Patch #3 reworks the existing memcg socket infrastructure. It has many
provisions for future plans that won't materialize, and much of this
simply evaporates. The networking people should be happy about this.

Patch #5 adds accounting and tracking of socket memory to the unified
hierarchy memory controller, as described above. It uses the existing
per-cpu charge caches and triggers high limit reclaim asynchroneously.

Patch #8 uses the vmpressure extension to equalize pressure between
the pages tracked natively by the VM and socket buffer pages. As the
pool is shared, it makes sense that while natively tracked pages are
under duress the network transmit windows are also not increased.

As per above, this is an essential part of the new memory controller's
core functionality. With the unified hierarchy nearing release, please
consider this for 4.4.

 include/linux/memcontrol.h   |  90 +++++++++-------
 include/linux/page_counter.h |   6 +-
 include/net/sock.h           | 139 ++----------------------
 include/net/tcp.h            |   5 +-
 include/net/tcp_memcontrol.h |   7 --
 mm/backing-dev.c             |   2 +-
 mm/hugetlb_cgroup.c          |   3 +-
 mm/memcontrol.c              | 235 ++++++++++++++++++++++++++---------------
 mm/page_counter.c            |  14 +--
 mm/vmpressure.c              |  29 ++++-
 mm/vmscan.c                  |  41 +++----
 net/core/sock.c              |  78 ++++----------
 net/ipv4/sysctl_net_ipv4.c   |   1 -
 net/ipv4/tcp.c               |   3 +-
 net/ipv4/tcp_ipv4.c          |   9 +-
 net/ipv4/tcp_memcontrol.c    | 147 ++++----------------------
 net/ipv4/tcp_output.c        |   6 +-
 net/ipv6/tcp_ipv6.c          |   3 -
 18 files changed, 319 insertions(+), 499 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH 1/8] mm: page_counter: let page_counter_try_charge() return bool
  2015-10-22  4:21 ` Johannes Weiner
@ 2015-10-22  4:21   ` Johannes Weiner
  -1 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-22  4:21 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Tejun Heo, netdev, linux-mm,
	cgroups, linux-kernel

page_counter_try_charge() currently returns 0 on success and -ENOMEM
on failure, which is surprising behavior given the function name.

Make it follow the expected pattern of try_stuff() functions that
return a boolean true to indicate success, or false for failure.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/page_counter.h |  6 +++---
 mm/hugetlb_cgroup.c          |  3 ++-
 mm/memcontrol.c              | 11 +++++------
 mm/page_counter.c            | 14 +++++++-------
 4 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 17fa4f8..7e62920 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -36,9 +36,9 @@ static inline unsigned long page_counter_read(struct page_counter *counter)
 
 void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
-int page_counter_try_charge(struct page_counter *counter,
-			    unsigned long nr_pages,
-			    struct page_counter **fail);
+bool page_counter_try_charge(struct page_counter *counter,
+			     unsigned long nr_pages,
+			     struct page_counter **fail);
 void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
 int page_counter_limit(struct page_counter *counter, unsigned long limit);
 int page_counter_memparse(const char *buf, const char *max,
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 6a44263..d8fb10d 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -186,7 +186,8 @@ again:
 	}
 	rcu_read_unlock();
 
-	ret = page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter);
+	if (!page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter))
+		ret = -ENOMEM;
 	css_put(&h_cg->css);
 done:
 	*ptr = h_cg;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c71fe40..a8ccdbc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2018,8 +2018,8 @@ retry:
 		return 0;
 
 	if (!do_swap_account ||
-	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
-		if (!page_counter_try_charge(&memcg->memory, batch, &counter))
+	    page_counter_try_charge(&memcg->memsw, batch, &counter)) {
+		if (page_counter_try_charge(&memcg->memory, batch, &counter))
 			goto done_restock;
 		if (do_swap_account)
 			page_counter_uncharge(&memcg->memsw, batch);
@@ -2383,14 +2383,13 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
 {
 	unsigned int nr_pages = 1 << order;
 	struct page_counter *counter;
-	int ret = 0;
+	int ret;
 
 	if (!memcg_kmem_is_active(memcg))
 		return 0;
 
-	ret = page_counter_try_charge(&memcg->kmem, nr_pages, &counter);
-	if (ret)
-		return ret;
+	if (!page_counter_try_charge(&memcg->kmem, nr_pages, &counter))
+		return -ENOMEM;
 
 	ret = try_charge(memcg, gfp, nr_pages);
 	if (ret) {
diff --git a/mm/page_counter.c b/mm/page_counter.c
index 11b4bed..7c6a63d 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -56,12 +56,12 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
  * @nr_pages: number of pages to charge
  * @fail: points first counter to hit its limit, if any
  *
- * Returns 0 on success, or -ENOMEM and @fail if the counter or one of
- * its ancestors has hit its configured limit.
+ * Returns %true on success, or %false and @fail if the counter or one
+ * of its ancestors has hit its configured limit.
  */
-int page_counter_try_charge(struct page_counter *counter,
-			    unsigned long nr_pages,
-			    struct page_counter **fail)
+bool page_counter_try_charge(struct page_counter *counter,
+			     unsigned long nr_pages,
+			     struct page_counter **fail)
 {
 	struct page_counter *c;
 
@@ -99,13 +99,13 @@ int page_counter_try_charge(struct page_counter *counter,
 		if (new > c->watermark)
 			c->watermark = new;
 	}
-	return 0;
+	return true;
 
 failed:
 	for (c = counter; c != *fail; c = c->parent)
 		page_counter_cancel(c, nr_pages);
 
-	return -ENOMEM;
+	return false;
 }
 
 /**
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 1/8] mm: page_counter: let page_counter_try_charge() return bool
@ 2015-10-22  4:21   ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-22  4:21 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Tejun Heo, netdev, linux-mm,
	cgroups, linux-kernel

page_counter_try_charge() currently returns 0 on success and -ENOMEM
on failure, which is surprising behavior given the function name.

Make it follow the expected pattern of try_stuff() functions that
return a boolean true to indicate success, or false for failure.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/page_counter.h |  6 +++---
 mm/hugetlb_cgroup.c          |  3 ++-
 mm/memcontrol.c              | 11 +++++------
 mm/page_counter.c            | 14 +++++++-------
 4 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 17fa4f8..7e62920 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -36,9 +36,9 @@ static inline unsigned long page_counter_read(struct page_counter *counter)
 
 void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
-int page_counter_try_charge(struct page_counter *counter,
-			    unsigned long nr_pages,
-			    struct page_counter **fail);
+bool page_counter_try_charge(struct page_counter *counter,
+			     unsigned long nr_pages,
+			     struct page_counter **fail);
 void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
 int page_counter_limit(struct page_counter *counter, unsigned long limit);
 int page_counter_memparse(const char *buf, const char *max,
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 6a44263..d8fb10d 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -186,7 +186,8 @@ again:
 	}
 	rcu_read_unlock();
 
-	ret = page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter);
+	if (!page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter))
+		ret = -ENOMEM;
 	css_put(&h_cg->css);
 done:
 	*ptr = h_cg;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c71fe40..a8ccdbc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2018,8 +2018,8 @@ retry:
 		return 0;
 
 	if (!do_swap_account ||
-	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
-		if (!page_counter_try_charge(&memcg->memory, batch, &counter))
+	    page_counter_try_charge(&memcg->memsw, batch, &counter)) {
+		if (page_counter_try_charge(&memcg->memory, batch, &counter))
 			goto done_restock;
 		if (do_swap_account)
 			page_counter_uncharge(&memcg->memsw, batch);
@@ -2383,14 +2383,13 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
 {
 	unsigned int nr_pages = 1 << order;
 	struct page_counter *counter;
-	int ret = 0;
+	int ret;
 
 	if (!memcg_kmem_is_active(memcg))
 		return 0;
 
-	ret = page_counter_try_charge(&memcg->kmem, nr_pages, &counter);
-	if (ret)
-		return ret;
+	if (!page_counter_try_charge(&memcg->kmem, nr_pages, &counter))
+		return -ENOMEM;
 
 	ret = try_charge(memcg, gfp, nr_pages);
 	if (ret) {
diff --git a/mm/page_counter.c b/mm/page_counter.c
index 11b4bed..7c6a63d 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -56,12 +56,12 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
  * @nr_pages: number of pages to charge
  * @fail: points first counter to hit its limit, if any
  *
- * Returns 0 on success, or -ENOMEM and @fail if the counter or one of
- * its ancestors has hit its configured limit.
+ * Returns %true on success, or %false and @fail if the counter or one
+ * of its ancestors has hit its configured limit.
  */
-int page_counter_try_charge(struct page_counter *counter,
-			    unsigned long nr_pages,
-			    struct page_counter **fail)
+bool page_counter_try_charge(struct page_counter *counter,
+			     unsigned long nr_pages,
+			     struct page_counter **fail)
 {
 	struct page_counter *c;
 
@@ -99,13 +99,13 @@ int page_counter_try_charge(struct page_counter *counter,
 		if (new > c->watermark)
 			c->watermark = new;
 	}
-	return 0;
+	return true;
 
 failed:
 	for (c = counter; c != *fail; c = c->parent)
 		page_counter_cancel(c, nr_pages);
 
-	return -ENOMEM;
+	return false;
 }
 
 /**
-- 
2.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 2/8] mm: memcontrol: export root_mem_cgroup
  2015-10-22  4:21 ` Johannes Weiner
@ 2015-10-22  4:21   ` Johannes Weiner
  -1 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-22  4:21 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Tejun Heo, netdev, linux-mm,
	cgroups, linux-kernel

A later patch will need this symbol in files other than memcontrol.c,
so export it now and replace mem_cgroup_root_css at the same time.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 3 ++-
 mm/backing-dev.c           | 2 +-
 mm/memcontrol.c            | 5 ++---
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 805da1f..19ff87b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -275,7 +275,8 @@ struct mem_cgroup {
 	struct mem_cgroup_per_node *nodeinfo[0];
 	/* WARNING: nodeinfo must be the last member here */
 };
-extern struct cgroup_subsys_state *mem_cgroup_root_css;
+
+extern struct mem_cgroup *root_mem_cgroup;
 
 /**
  * mem_cgroup_events - count memory events against a cgroup
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 095b23b..73ab967 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -702,7 +702,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 
 	ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL);
 	if (!ret) {
-		bdi->wb.memcg_css = mem_cgroup_root_css;
+		bdi->wb.memcg_css = &root_mem_cgroup->css;
 		bdi->wb.blkcg_css = blkcg_root_css;
 	}
 	return ret;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a8ccdbc..e54f434 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -76,9 +76,9 @@
 struct cgroup_subsys memory_cgrp_subsys __read_mostly;
 EXPORT_SYMBOL(memory_cgrp_subsys);
 
+struct mem_cgroup *root_mem_cgroup __read_mostly;
+
 #define MEM_CGROUP_RECLAIM_RETRIES	5
-static struct mem_cgroup *root_mem_cgroup __read_mostly;
-struct cgroup_subsys_state *mem_cgroup_root_css __read_mostly;
 
 /* Whether the swap controller is active */
 #ifdef CONFIG_MEMCG_SWAP
@@ -4213,7 +4213,6 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	/* root ? */
 	if (parent_css == NULL) {
 		root_mem_cgroup = memcg;
-		mem_cgroup_root_css = &memcg->css;
 		page_counter_init(&memcg->memory, NULL);
 		memcg->high = PAGE_COUNTER_MAX;
 		memcg->soft_limit = PAGE_COUNTER_MAX;
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 2/8] mm: memcontrol: export root_mem_cgroup
@ 2015-10-22  4:21   ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-22  4:21 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Tejun Heo, netdev, linux-mm,
	cgroups, linux-kernel

A later patch will need this symbol in files other than memcontrol.c,
so export it now and replace mem_cgroup_root_css at the same time.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 3 ++-
 mm/backing-dev.c           | 2 +-
 mm/memcontrol.c            | 5 ++---
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 805da1f..19ff87b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -275,7 +275,8 @@ struct mem_cgroup {
 	struct mem_cgroup_per_node *nodeinfo[0];
 	/* WARNING: nodeinfo must be the last member here */
 };
-extern struct cgroup_subsys_state *mem_cgroup_root_css;
+
+extern struct mem_cgroup *root_mem_cgroup;
 
 /**
  * mem_cgroup_events - count memory events against a cgroup
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 095b23b..73ab967 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -702,7 +702,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 
 	ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL);
 	if (!ret) {
-		bdi->wb.memcg_css = mem_cgroup_root_css;
+		bdi->wb.memcg_css = &root_mem_cgroup->css;
 		bdi->wb.blkcg_css = blkcg_root_css;
 	}
 	return ret;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a8ccdbc..e54f434 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -76,9 +76,9 @@
 struct cgroup_subsys memory_cgrp_subsys __read_mostly;
 EXPORT_SYMBOL(memory_cgrp_subsys);
 
+struct mem_cgroup *root_mem_cgroup __read_mostly;
+
 #define MEM_CGROUP_RECLAIM_RETRIES	5
-static struct mem_cgroup *root_mem_cgroup __read_mostly;
-struct cgroup_subsys_state *mem_cgroup_root_css __read_mostly;
 
 /* Whether the swap controller is active */
 #ifdef CONFIG_MEMCG_SWAP
@@ -4213,7 +4213,6 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	/* root ? */
 	if (parent_css == NULL) {
 		root_mem_cgroup = memcg;
-		mem_cgroup_root_css = &memcg->css;
 		page_counter_init(&memcg->memory, NULL);
 		memcg->high = PAGE_COUNTER_MAX;
 		memcg->soft_limit = PAGE_COUNTER_MAX;
-- 
2.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting
  2015-10-22  4:21 ` Johannes Weiner
@ 2015-10-22  4:21   ` Johannes Weiner
  -1 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-22  4:21 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Tejun Heo, netdev, linux-mm,
	cgroups, linux-kernel

The tcp memory controller has extensive provisions for future memory
accounting interfaces that won't materialize after all. Cut the code
base down to what's actually used, now and in the likely future.

- There won't be any different protocol counters in the future, so a
  direct sock->sk_memcg linkage is enough. This eliminates a lot of
  callback maze and boilerplate code, and restores most of the socket
  allocation code to pre-tcp_memcontrol state.

- There won't be a tcp control soft limit, so integrating the memcg
  code into the global skmem limiting scheme complicates things
  unnecessarily. Replace all that with simple and clear charge and
  uncharge calls--hidden behind a jump label--to account skb memory.

- The previous jump label code was an elaborate state machine that
  tracked the number of cgroups with an active socket limit in order
  to enable the skmem tracking and accounting code only when actively
  necessary. But this is overengineered: it was meant to protect the
  people who never use this feature in the first place. Simply enable
  the branches once when the first limit is set until the next reboot.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h   |  64 ++++++++-----------
 include/net/sock.h           | 135 +++------------------------------------
 include/net/tcp.h            |   3 -
 include/net/tcp_memcontrol.h |   7 ---
 mm/memcontrol.c              | 101 +++++++++++++++--------------
 net/core/sock.c              |  78 ++++++-----------------
 net/ipv4/sysctl_net_ipv4.c   |   1 -
 net/ipv4/tcp.c               |   3 +-
 net/ipv4/tcp_ipv4.c          |   9 +--
 net/ipv4/tcp_memcontrol.c    | 147 +++++++------------------------------------
 net/ipv4/tcp_output.c        |   6 +-
 net/ipv6/tcp_ipv6.c          |   3 -
 12 files changed, 136 insertions(+), 421 deletions(-)
 delete mode 100644 include/net/tcp_memcontrol.h

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 19ff87b..5b72f83 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -85,34 +85,6 @@ enum mem_cgroup_events_target {
 	MEM_CGROUP_NTARGETS,
 };
 
-/*
- * Bits in struct cg_proto.flags
- */
-enum cg_proto_flags {
-	/* Currently active and new sockets should be assigned to cgroups */
-	MEMCG_SOCK_ACTIVE,
-	/* It was ever activated; we must disarm static keys on destruction */
-	MEMCG_SOCK_ACTIVATED,
-};
-
-struct cg_proto {
-	struct page_counter	memory_allocated;	/* Current allocated memory. */
-	struct percpu_counter	sockets_allocated;	/* Current number of sockets. */
-	int			memory_pressure;
-	long			sysctl_mem[3];
-	unsigned long		flags;
-	/*
-	 * memcg field is used to find which memcg we belong directly
-	 * Each memcg struct can hold more than one cg_proto, so container_of
-	 * won't really cut.
-	 *
-	 * The elegant solution would be having an inverse function to
-	 * proto_cgroup in struct proto, but that means polluting the structure
-	 * for everybody, instead of just for memcg users.
-	 */
-	struct mem_cgroup	*memcg;
-};
-
 #ifdef CONFIG_MEMCG
 struct mem_cgroup_stat_cpu {
 	long count[MEM_CGROUP_STAT_NSTATS];
@@ -185,8 +157,15 @@ struct mem_cgroup {
 
 	/* Accounted resources */
 	struct page_counter memory;
+
+	/*
+	 * Legacy non-resource counters. In unified hierarchy, all
+	 * memory is accounted and limited through memcg->memory.
+	 * Consumer breakdown happens in the statistics.
+	 */
 	struct page_counter memsw;
 	struct page_counter kmem;
+	struct page_counter skmem;
 
 	/* Normal memory consumption range */
 	unsigned long low;
@@ -246,9 +225,6 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat_cpu __percpu *stat;
 
-#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
-	struct cg_proto tcp_mem;
-#endif
 #if defined(CONFIG_MEMCG_KMEM)
         /* Index in the kmem_cache->memcg_params.memcg_caches array */
 	int kmemcg_id;
@@ -676,12 +652,6 @@ void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 }
 #endif /* CONFIG_MEMCG */
 
-enum {
-	UNDER_LIMIT,
-	SOFT_LIMIT,
-	OVER_LIMIT,
-};
-
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg);
@@ -707,15 +677,35 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
 
 struct sock;
 #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
+extern struct static_key_false mem_cgroup_sockets;
+static inline bool mem_cgroup_do_sockets(void)
+{
+	return static_branch_unlikely(&mem_cgroup_sockets);
+}
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
+bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
+void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
 #else
+static inline bool mem_cgroup_do_sockets(void)
+{
+	return false;
+}
 static inline void sock_update_memcg(struct sock *sk)
 {
 }
 static inline void sock_release_memcg(struct sock *sk)
 {
 }
+static inline bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg,
+					   unsigned int nr_pages)
+{
+	return true;
+}
+static inline void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg,
+					     unsigned int nr_pages)
+{
+}
 #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_MEMCG_KMEM
diff --git a/include/net/sock.h b/include/net/sock.h
index 59a7196..67795fc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -69,22 +69,6 @@
 #include <net/tcp_states.h>
 #include <linux/net_tstamp.h>
 
-struct cgroup;
-struct cgroup_subsys;
-#ifdef CONFIG_NET
-int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss);
-void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg);
-#else
-static inline
-int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
-{
-	return 0;
-}
-static inline
-void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg)
-{
-}
-#endif
 /*
  * This structure really needs to be cleaned up.
  * Most of it is for TCP, and not used by any of
@@ -243,7 +227,6 @@ struct sock_common {
 	/* public: */
 };
 
-struct cg_proto;
 /**
   *	struct sock - network layer representation of sockets
   *	@__sk_common: shared layout with inet_timewait_sock
@@ -310,7 +293,7 @@ struct cg_proto;
   *	@sk_security: used by security modules
   *	@sk_mark: generic packet mark
   *	@sk_classid: this socket's cgroup classid
-  *	@sk_cgrp: this socket's cgroup-specific proto data
+  *	@sk_memcg: this socket's memcg association
   *	@sk_write_pending: a write to stream socket waits to start
   *	@sk_state_change: callback to indicate change in the state of the sock
   *	@sk_data_ready: callback to indicate there is data to be processed
@@ -447,7 +430,7 @@ struct sock {
 #ifdef CONFIG_CGROUP_NET_CLASSID
 	u32			sk_classid;
 #endif
-	struct cg_proto		*sk_cgrp;
+	struct mem_cgroup	*sk_memcg;
 	void			(*sk_state_change)(struct sock *sk);
 	void			(*sk_data_ready)(struct sock *sk);
 	void			(*sk_write_space)(struct sock *sk);
@@ -1051,18 +1034,6 @@ struct proto {
 #ifdef SOCK_REFCNT_DEBUG
 	atomic_t		socks;
 #endif
-#ifdef CONFIG_MEMCG_KMEM
-	/*
-	 * cgroup specific init/deinit functions. Called once for all
-	 * protocols that implement it, from cgroups populate function.
-	 * This function has to setup any files the protocol want to
-	 * appear in the kmem cgroup filesystem.
-	 */
-	int			(*init_cgroup)(struct mem_cgroup *memcg,
-					       struct cgroup_subsys *ss);
-	void			(*destroy_cgroup)(struct mem_cgroup *memcg);
-	struct cg_proto		*(*proto_cgroup)(struct mem_cgroup *memcg);
-#endif
 };
 
 int proto_register(struct proto *prot, int alloc_slab);
@@ -1093,23 +1064,6 @@ static inline void sk_refcnt_debug_release(const struct sock *sk)
 #define sk_refcnt_debug_release(sk) do { } while (0)
 #endif /* SOCK_REFCNT_DEBUG */
 
-#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_NET)
-extern struct static_key memcg_socket_limit_enabled;
-static inline struct cg_proto *parent_cg_proto(struct proto *proto,
-					       struct cg_proto *cg_proto)
-{
-	return proto->proto_cgroup(parent_mem_cgroup(cg_proto->memcg));
-}
-#define mem_cgroup_sockets_enabled static_key_false(&memcg_socket_limit_enabled)
-#else
-#define mem_cgroup_sockets_enabled 0
-static inline struct cg_proto *parent_cg_proto(struct proto *proto,
-					       struct cg_proto *cg_proto)
-{
-	return NULL;
-}
-#endif
-
 static inline bool sk_stream_memory_free(const struct sock *sk)
 {
 	if (sk->sk_wmem_queued >= sk->sk_sndbuf)
@@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
 	if (!sk->sk_prot->memory_pressure)
 		return false;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		return !!sk->sk_cgrp->memory_pressure;
-
 	return !!*sk->sk_prot->memory_pressure;
 }
 
@@ -1146,61 +1097,19 @@ static inline void sk_leave_memory_pressure(struct sock *sk)
 {
 	int *memory_pressure = sk->sk_prot->memory_pressure;
 
-	if (!memory_pressure)
-		return;
-
-	if (*memory_pressure)
+	if (memory_pressure && *memory_pressure)
 		*memory_pressure = 0;
-
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-		struct cg_proto *cg_proto = sk->sk_cgrp;
-		struct proto *prot = sk->sk_prot;
-
-		for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
-			cg_proto->memory_pressure = 0;
-	}
-
 }
 
 static inline void sk_enter_memory_pressure(struct sock *sk)
 {
-	if (!sk->sk_prot->enter_memory_pressure)
-		return;
-
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-		struct cg_proto *cg_proto = sk->sk_cgrp;
-		struct proto *prot = sk->sk_prot;
-
-		for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
-			cg_proto->memory_pressure = 1;
-	}
-
-	sk->sk_prot->enter_memory_pressure(sk);
+	if (sk->sk_prot->enter_memory_pressure)
+		sk->sk_prot->enter_memory_pressure(sk);
 }
 
 static inline long sk_prot_mem_limits(const struct sock *sk, int index)
 {
-	long *prot = sk->sk_prot->sysctl_mem;
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		prot = sk->sk_cgrp->sysctl_mem;
-	return prot[index];
-}
-
-static inline void memcg_memory_allocated_add(struct cg_proto *prot,
-					      unsigned long amt,
-					      int *parent_status)
-{
-	page_counter_charge(&prot->memory_allocated, amt);
-
-	if (page_counter_read(&prot->memory_allocated) >
-	    prot->memory_allocated.limit)
-		*parent_status = OVER_LIMIT;
-}
-
-static inline void memcg_memory_allocated_sub(struct cg_proto *prot,
-					      unsigned long amt)
-{
-	page_counter_uncharge(&prot->memory_allocated, amt);
+	return sk->sk_prot->sysctl_mem[index];
 }
 
 static inline long
@@ -1208,24 +1117,14 @@ sk_memory_allocated(const struct sock *sk)
 {
 	struct proto *prot = sk->sk_prot;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		return page_counter_read(&sk->sk_cgrp->memory_allocated);
-
 	return atomic_long_read(prot->memory_allocated);
 }
 
 static inline long
-sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status)
+sk_memory_allocated_add(struct sock *sk, int amt)
 {
 	struct proto *prot = sk->sk_prot;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-		memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status);
-		/* update the root cgroup regardless */
-		atomic_long_add_return(amt, prot->memory_allocated);
-		return page_counter_read(&sk->sk_cgrp->memory_allocated);
-	}
-
 	return atomic_long_add_return(amt, prot->memory_allocated);
 }
 
@@ -1234,9 +1133,6 @@ sk_memory_allocated_sub(struct sock *sk, int amt)
 {
 	struct proto *prot = sk->sk_prot;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		memcg_memory_allocated_sub(sk->sk_cgrp, amt);
-
 	atomic_long_sub(amt, prot->memory_allocated);
 }
 
@@ -1244,13 +1140,6 @@ static inline void sk_sockets_allocated_dec(struct sock *sk)
 {
 	struct proto *prot = sk->sk_prot;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-		struct cg_proto *cg_proto = sk->sk_cgrp;
-
-		for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
-			percpu_counter_dec(&cg_proto->sockets_allocated);
-	}
-
 	percpu_counter_dec(prot->sockets_allocated);
 }
 
@@ -1258,13 +1147,6 @@ static inline void sk_sockets_allocated_inc(struct sock *sk)
 {
 	struct proto *prot = sk->sk_prot;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-		struct cg_proto *cg_proto = sk->sk_cgrp;
-
-		for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
-			percpu_counter_inc(&cg_proto->sockets_allocated);
-	}
-
 	percpu_counter_inc(prot->sockets_allocated);
 }
 
@@ -1273,9 +1155,6 @@ sk_sockets_allocated_read_positive(struct sock *sk)
 {
 	struct proto *prot = sk->sk_prot;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		return percpu_counter_read_positive(&sk->sk_cgrp->sockets_allocated);
-
 	return percpu_counter_read_positive(prot->sockets_allocated);
 }
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index eed94fc..77b6c7e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -291,9 +291,6 @@ extern int tcp_memory_pressure;
 /* optimized version of sk_under_memory_pressure() for TCP sockets */
 static inline bool tcp_under_memory_pressure(const struct sock *sk)
 {
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		return !!sk->sk_cgrp->memory_pressure;
-
 	return tcp_memory_pressure;
 }
 /*
diff --git a/include/net/tcp_memcontrol.h b/include/net/tcp_memcontrol.h
deleted file mode 100644
index 05b94d9..0000000
--- a/include/net/tcp_memcontrol.h
+++ /dev/null
@@ -1,7 +0,0 @@
-#ifndef _TCP_MEMCG_H
-#define _TCP_MEMCG_H
-
-struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg);
-int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss);
-void tcp_destroy_cgroup(struct mem_cgroup *memcg);
-#endif /* _TCP_MEMCG_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e54f434..c41e6d7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -66,7 +66,6 @@
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
-#include <net/tcp_memcontrol.h>
 #include "slab.h"
 
 #include <asm/uaccess.h>
@@ -291,58 +290,68 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
 /* Writing them here to avoid exposing memcg's inner layout */
 #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
 
+DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets);
+
 void sock_update_memcg(struct sock *sk)
 {
-	if (mem_cgroup_sockets_enabled) {
-		struct mem_cgroup *memcg;
-		struct cg_proto *cg_proto;
-
-		BUG_ON(!sk->sk_prot->proto_cgroup);
-
-		/* Socket cloning can throw us here with sk_cgrp already
-		 * filled. It won't however, necessarily happen from
-		 * process context. So the test for root memcg given
-		 * the current task's memcg won't help us in this case.
-		 *
-		 * Respecting the original socket's memcg is a better
-		 * decision in this case.
-		 */
-		if (sk->sk_cgrp) {
-			BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg));
-			css_get(&sk->sk_cgrp->memcg->css);
-			return;
-		}
-
-		rcu_read_lock();
-		memcg = mem_cgroup_from_task(current);
-		cg_proto = sk->sk_prot->proto_cgroup(memcg);
-		if (cg_proto && test_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags) &&
-		    css_tryget_online(&memcg->css)) {
-			sk->sk_cgrp = cg_proto;
-		}
-		rcu_read_unlock();
+	struct mem_cgroup *memcg;
+	/*
+	 * Socket cloning can throw us here with sk_cgrp already
+	 * filled. It won't however, necessarily happen from
+	 * process context. So the test for root memcg given
+	 * the current task's memcg won't help us in this case.
+	 *
+	 * Respecting the original socket's memcg is a better
+	 * decision in this case.
+	 */
+	if (sk->sk_memcg) {
+		BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
+		css_get(&sk->sk_memcg->css);
+		return;
 	}
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (css_tryget_online(&memcg->css))
+		sk->sk_memcg = memcg;
+	rcu_read_unlock();
 }
 EXPORT_SYMBOL(sock_update_memcg);
 
 void sock_release_memcg(struct sock *sk)
 {
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-		struct mem_cgroup *memcg;
-		WARN_ON(!sk->sk_cgrp->memcg);
-		memcg = sk->sk_cgrp->memcg;
-		css_put(&sk->sk_cgrp->memcg->css);
-	}
+	if (sk->sk_memcg)
+		css_put(&sk->sk_memcg->css);
 }
 
-struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
+/**
+ * mem_cgroup_charge_skmem - charge socket memory
+ * @memcg: memcg to charge
+ * @nr_pages: number of pages to charge
+ *
+ * Charges @nr_pages to @memcg. Returns %true if the charge fit within
+ * the memcg's configured limit, %false if the charge had to be forced.
+ */
+bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
-	if (!memcg || mem_cgroup_is_root(memcg))
-		return NULL;
+	struct page_counter *counter;
+
+	if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
+		return true;
 
-	return &memcg->tcp_mem;
+	page_counter_charge(&memcg->skmem, nr_pages);
+	return false;
+}
+
+/**
+ * mem_cgroup_uncharge_skmem - uncharge socket memory
+ * @memcg: memcg to uncharge
+ * @nr_pages: number of pages to uncharge
+ */
+void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+	page_counter_uncharge(&memcg->skmem, nr_pages);
 }
-EXPORT_SYMBOL(tcp_proto_cgroup);
 
 #endif
 
@@ -3592,13 +3601,7 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
 #ifdef CONFIG_MEMCG_KMEM
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
-	int ret;
-
-	ret = memcg_propagate_kmem(memcg);
-	if (ret)
-		return ret;
-
-	return mem_cgroup_sockets_init(memcg, ss);
+	return memcg_propagate_kmem(memcg);
 }
 
 static void memcg_deactivate_kmem(struct mem_cgroup *memcg)
@@ -3654,7 +3657,6 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 		static_key_slow_dec(&memcg_kmem_enabled_key);
 		WARN_ON(page_counter_read(&memcg->kmem));
 	}
-	mem_cgroup_sockets_destroy(memcg);
 }
 #else
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
@@ -4218,6 +4220,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 		memcg->soft_limit = PAGE_COUNTER_MAX;
 		page_counter_init(&memcg->memsw, NULL);
 		page_counter_init(&memcg->kmem, NULL);
+		page_counter_init(&memcg->skmem, NULL);
 	}
 
 	memcg->last_scanned_node = MAX_NUMNODES;
@@ -4266,6 +4269,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		memcg->soft_limit = PAGE_COUNTER_MAX;
 		page_counter_init(&memcg->memsw, &parent->memsw);
 		page_counter_init(&memcg->kmem, &parent->kmem);
+		page_counter_init(&memcg->skmem, &parent->skmem);
 
 		/*
 		 * No need to take a reference to the parent because cgroup
@@ -4277,6 +4281,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		memcg->soft_limit = PAGE_COUNTER_MAX;
 		page_counter_init(&memcg->memsw, NULL);
 		page_counter_init(&memcg->kmem, NULL);
+		page_counter_init(&memcg->skmem, NULL);
 		/*
 		 * Deeper hierachy with use_hierarchy == false doesn't make
 		 * much sense so let cgroup subsystem know about this
diff --git a/net/core/sock.c b/net/core/sock.c
index 0fafd27..0debff5 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -194,44 +194,6 @@ bool sk_net_capable(const struct sock *sk, int cap)
 }
 EXPORT_SYMBOL(sk_net_capable);
 
-
-#ifdef CONFIG_MEMCG_KMEM
-int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
-{
-	struct proto *proto;
-	int ret = 0;
-
-	mutex_lock(&proto_list_mutex);
-	list_for_each_entry(proto, &proto_list, node) {
-		if (proto->init_cgroup) {
-			ret = proto->init_cgroup(memcg, ss);
-			if (ret)
-				goto out;
-		}
-	}
-
-	mutex_unlock(&proto_list_mutex);
-	return ret;
-out:
-	list_for_each_entry_continue_reverse(proto, &proto_list, node)
-		if (proto->destroy_cgroup)
-			proto->destroy_cgroup(memcg);
-	mutex_unlock(&proto_list_mutex);
-	return ret;
-}
-
-void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg)
-{
-	struct proto *proto;
-
-	mutex_lock(&proto_list_mutex);
-	list_for_each_entry_reverse(proto, &proto_list, node)
-		if (proto->destroy_cgroup)
-			proto->destroy_cgroup(memcg);
-	mutex_unlock(&proto_list_mutex);
-}
-#endif
-
 /*
  * Each address family might have different locking rules, so we have
  * one slock key per address family:
@@ -239,11 +201,6 @@ void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg)
 static struct lock_class_key af_family_keys[AF_MAX];
 static struct lock_class_key af_family_slock_keys[AF_MAX];
 
-#if defined(CONFIG_MEMCG_KMEM)
-struct static_key memcg_socket_limit_enabled;
-EXPORT_SYMBOL(memcg_socket_limit_enabled);
-#endif
-
 /*
  * Make lock validator output more readable. (we pre-construct these
  * strings build-time, so that runtime initialization of socket
@@ -1476,12 +1433,6 @@ void sk_free(struct sock *sk)
 }
 EXPORT_SYMBOL(sk_free);
 
-static void sk_update_clone(const struct sock *sk, struct sock *newsk)
-{
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		sock_update_memcg(newsk);
-}
-
 /**
  *	sk_clone_lock - clone a socket, and lock its clone
  *	@sk: the socket to clone
@@ -1577,7 +1528,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 		sk_set_socket(newsk, NULL);
 		newsk->sk_wq = NULL;
 
-		sk_update_clone(sk, newsk);
+		if (mem_cgroup_do_sockets())
+			sock_update_memcg(newsk);
 
 		if (newsk->sk_prot->sockets_allocated)
 			sk_sockets_allocated_inc(newsk);
@@ -2036,27 +1988,27 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 	struct proto *prot = sk->sk_prot;
 	int amt = sk_mem_pages(size);
 	long allocated;
-	int parent_status = UNDER_LIMIT;
 
 	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
 
-	allocated = sk_memory_allocated_add(sk, amt, &parent_status);
+	allocated = sk_memory_allocated_add(sk, amt);
+
+	if (mem_cgroup_do_sockets() && sk->sk_memcg &&
+	    !mem_cgroup_charge_skmem(sk->sk_memcg, amt))
+		goto suppress_allocation;
 
 	/* Under limit. */
-	if (parent_status == UNDER_LIMIT &&
-			allocated <= sk_prot_mem_limits(sk, 0)) {
+	if (allocated <= sk_prot_mem_limits(sk, 0)) {
 		sk_leave_memory_pressure(sk);
 		return 1;
 	}
 
-	/* Under pressure. (we or our parents) */
-	if ((parent_status > SOFT_LIMIT) ||
-			allocated > sk_prot_mem_limits(sk, 1))
+	/* Under pressure. */
+	if (allocated > sk_prot_mem_limits(sk, 1))
 		sk_enter_memory_pressure(sk);
 
-	/* Over hard limit (we or our parents) */
-	if ((parent_status == OVER_LIMIT) ||
-			(allocated > sk_prot_mem_limits(sk, 2)))
+	/* Over hard limit. */
+	if (allocated > sk_prot_mem_limits(sk, 2))
 		goto suppress_allocation;
 
 	/* guarantee minimum buffer size under pressure */
@@ -2105,6 +2057,9 @@ suppress_allocation:
 
 	sk_memory_allocated_sub(sk, amt);
 
+	if (mem_cgroup_do_sockets() && sk->sk_memcg)
+		mem_cgroup_uncharge_skmem(sk->sk_memcg, amt);
+
 	return 0;
 }
 EXPORT_SYMBOL(__sk_mem_schedule);
@@ -2120,6 +2075,9 @@ void __sk_mem_reclaim(struct sock *sk, int amount)
 	sk_memory_allocated_sub(sk, amount);
 	sk->sk_forward_alloc -= amount << SK_MEM_QUANTUM_SHIFT;
 
+	if (mem_cgroup_do_sockets() && sk->sk_memcg)
+		mem_cgroup_uncharge_skmem(sk->sk_memcg, amount);
+
 	if (sk_under_memory_pressure(sk) &&
 	    (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)))
 		sk_leave_memory_pressure(sk);
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 894da3a..1f00819 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -24,7 +24,6 @@
 #include <net/cipso_ipv4.h>
 #include <net/inet_frag.h>
 #include <net/ping.h>
-#include <net/tcp_memcontrol.h>
 
 static int zero;
 static int one = 1;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ac1bdbb..ec931c0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -421,7 +421,8 @@ void tcp_init_sock(struct sock *sk)
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
 	local_bh_disable();
-	sock_update_memcg(sk);
+	if (mem_cgroup_do_sockets())
+		sock_update_memcg(sk);
 	sk_sockets_allocated_inc(sk);
 	local_bh_enable();
 }
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 30dd45c..bb5f4f2 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -73,7 +73,6 @@
 #include <net/timewait_sock.h>
 #include <net/xfrm.h>
 #include <net/secure_seq.h>
-#include <net/tcp_memcontrol.h>
 #include <net/busy_poll.h>
 
 #include <linux/inet.h>
@@ -1808,7 +1807,8 @@ void tcp_v4_destroy_sock(struct sock *sk)
 	tcp_saved_syn_free(tp);
 
 	sk_sockets_allocated_dec(sk);
-	sock_release_memcg(sk);
+	if (mem_cgroup_do_sockets())
+		sock_release_memcg(sk);
 }
 EXPORT_SYMBOL(tcp_v4_destroy_sock);
 
@@ -2330,11 +2330,6 @@ struct proto tcp_prot = {
 	.compat_setsockopt	= compat_tcp_setsockopt,
 	.compat_getsockopt	= compat_tcp_getsockopt,
 #endif
-#ifdef CONFIG_MEMCG_KMEM
-	.init_cgroup		= tcp_init_cgroup,
-	.destroy_cgroup		= tcp_destroy_cgroup,
-	.proto_cgroup		= tcp_proto_cgroup,
-#endif
 };
 EXPORT_SYMBOL(tcp_prot);
 
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index 2379c1b..09a37eb 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -1,107 +1,10 @@
-#include <net/tcp.h>
-#include <net/tcp_memcontrol.h>
-#include <net/sock.h>
-#include <net/ip.h>
-#include <linux/nsproxy.h>
+#include <linux/page_counter.h>
 #include <linux/memcontrol.h>
+#include <linux/cgroup.h>
 #include <linux/module.h>
-
-int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
-{
-	/*
-	 * The root cgroup does not use page_counters, but rather,
-	 * rely on the data already collected by the network
-	 * subsystem
-	 */
-	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
-	struct page_counter *counter_parent = NULL;
-	struct cg_proto *cg_proto, *parent_cg;
-
-	cg_proto = tcp_prot.proto_cgroup(memcg);
-	if (!cg_proto)
-		return 0;
-
-	cg_proto->sysctl_mem[0] = sysctl_tcp_mem[0];
-	cg_proto->sysctl_mem[1] = sysctl_tcp_mem[1];
-	cg_proto->sysctl_mem[2] = sysctl_tcp_mem[2];
-	cg_proto->memory_pressure = 0;
-	cg_proto->memcg = memcg;
-
-	parent_cg = tcp_prot.proto_cgroup(parent);
-	if (parent_cg)
-		counter_parent = &parent_cg->memory_allocated;
-
-	page_counter_init(&cg_proto->memory_allocated, counter_parent);
-	percpu_counter_init(&cg_proto->sockets_allocated, 0, GFP_KERNEL);
-
-	return 0;
-}
-EXPORT_SYMBOL(tcp_init_cgroup);
-
-void tcp_destroy_cgroup(struct mem_cgroup *memcg)
-{
-	struct cg_proto *cg_proto;
-
-	cg_proto = tcp_prot.proto_cgroup(memcg);
-	if (!cg_proto)
-		return;
-
-	percpu_counter_destroy(&cg_proto->sockets_allocated);
-
-	if (test_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags))
-		static_key_slow_dec(&memcg_socket_limit_enabled);
-
-}
-EXPORT_SYMBOL(tcp_destroy_cgroup);
-
-static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
-{
-	struct cg_proto *cg_proto;
-	int i;
-	int ret;
-
-	cg_proto = tcp_prot.proto_cgroup(memcg);
-	if (!cg_proto)
-		return -EINVAL;
-
-	ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages);
-	if (ret)
-		return ret;
-
-	for (i = 0; i < 3; i++)
-		cg_proto->sysctl_mem[i] = min_t(long, nr_pages,
-						sysctl_tcp_mem[i]);
-
-	if (nr_pages == PAGE_COUNTER_MAX)
-		clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
-	else {
-		/*
-		 * The active bit needs to be written after the static_key
-		 * update. This is what guarantees that the socket activation
-		 * function is the last one to run. See sock_update_memcg() for
-		 * details, and note that we don't mark any socket as belonging
-		 * to this memcg until that flag is up.
-		 *
-		 * We need to do this, because static_keys will span multiple
-		 * sites, but we can't control their order. If we mark a socket
-		 * as accounted, but the accounting functions are not patched in
-		 * yet, we'll lose accounting.
-		 *
-		 * We never race with the readers in sock_update_memcg(),
-		 * because when this value change, the code to process it is not
-		 * patched in yet.
-		 *
-		 * The activated bit is used to guarantee that no two writers
-		 * will do the update in the same memcg. Without that, we can't
-		 * properly shutdown the static key.
-		 */
-		if (!test_and_set_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags))
-			static_key_slow_inc(&memcg_socket_limit_enabled);
-		set_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
-	}
-
-	return 0;
-}
+#include <linux/kernfs.h>
+#include <linux/mutex.h>
+#include <net/tcp.h>
 
 enum {
 	RES_USAGE,
@@ -124,11 +27,17 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
 	switch (of_cft(of)->private) {
 	case RES_LIMIT:
 		/* see memcontrol.c */
+		if (memcg == root_mem_cgroup) {
+			ret = -EINVAL;
+			break;
+		}
 		ret = page_counter_memparse(buf, "-1", &nr_pages);
 		if (ret)
 			break;
 		mutex_lock(&tcp_limit_mutex);
-		ret = tcp_update_limit(memcg, nr_pages);
+		ret = page_counter_limit(&memcg->skmem, nr_pages);
+		if (!ret)
+			static_branch_enable(&mem_cgroup_sockets);
 		mutex_unlock(&tcp_limit_mutex);
 		break;
 	default:
@@ -141,32 +50,28 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
 static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-	struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg);
 	u64 val;
 
 	switch (cft->private) {
 	case RES_LIMIT:
-		if (!cg_proto)
-			return PAGE_COUNTER_MAX;
-		val = cg_proto->memory_allocated.limit;
+		val = memcg->skmem.limit;
 		val *= PAGE_SIZE;
 		break;
 	case RES_USAGE:
-		if (!cg_proto)
+		if (memcg == root_mem_cgroup)
 			val = atomic_long_read(&tcp_memory_allocated);
 		else
-			val = page_counter_read(&cg_proto->memory_allocated);
+			val = page_counter_read(&memcg->skmem);
 		val *= PAGE_SIZE;
 		break;
 	case RES_FAILCNT:
-		if (!cg_proto)
-			return 0;
-		val = cg_proto->memory_allocated.failcnt;
+		val = memcg->skmem.failcnt;
 		break;
 	case RES_MAX_USAGE:
-		if (!cg_proto)
-			return 0;
-		val = cg_proto->memory_allocated.watermark;
+		if (memcg == root_mem_cgroup)
+			val = 0;
+		else
+			val = memcg->skmem.watermark;
 		val *= PAGE_SIZE;
 		break;
 	default:
@@ -178,20 +83,14 @@ static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
 static ssize_t tcp_cgroup_reset(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off)
 {
-	struct mem_cgroup *memcg;
-	struct cg_proto *cg_proto;
-
-	memcg = mem_cgroup_from_css(of_css(of));
-	cg_proto = tcp_prot.proto_cgroup(memcg);
-	if (!cg_proto)
-		return nbytes;
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
 
 	switch (of_cft(of)->private) {
 	case RES_MAX_USAGE:
-		page_counter_reset_watermark(&cg_proto->memory_allocated);
+		page_counter_reset_watermark(&memcg->skmem);
 		break;
 	case RES_FAILCNT:
-		cg_proto->memory_allocated.failcnt = 0;
+		memcg->skmem.failcnt = 0;
 		break;
 	}
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 19adedb..b496fc9 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2819,13 +2819,15 @@ begin_fwd:
  */
 void sk_forced_mem_schedule(struct sock *sk, int size)
 {
-	int amt, status;
+	int amt;
 
 	if (size <= sk->sk_forward_alloc)
 		return;
 	amt = sk_mem_pages(size);
 	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
-	sk_memory_allocated_add(sk, amt, &status);
+	sk_memory_allocated_add(sk, amt);
+	if (mem_cgroup_do_sockets() && sk->sk_memcg)
+		mem_cgroup_charge_skmem(sk->sk_memcg, amt);
 }
 
 /* Send a FIN. The caller locks the socket for us.
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index f495d18..cf19e65 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1862,9 +1862,6 @@ struct proto tcpv6_prot = {
 	.compat_setsockopt	= compat_tcp_setsockopt,
 	.compat_getsockopt	= compat_tcp_getsockopt,
 #endif
-#ifdef CONFIG_MEMCG_KMEM
-	.proto_cgroup		= tcp_proto_cgroup,
-#endif
 	.clear_sk		= tcp_v6_clear_sk,
 };
 
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting
@ 2015-10-22  4:21   ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-22  4:21 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Tejun Heo, netdev, linux-mm,
	cgroups, linux-kernel

The tcp memory controller has extensive provisions for future memory
accounting interfaces that won't materialize after all. Cut the code
base down to what's actually used, now and in the likely future.

- There won't be any different protocol counters in the future, so a
  direct sock->sk_memcg linkage is enough. This eliminates a lot of
  callback maze and boilerplate code, and restores most of the socket
  allocation code to pre-tcp_memcontrol state.

- There won't be a tcp control soft limit, so integrating the memcg
  code into the global skmem limiting scheme complicates things
  unnecessarily. Replace all that with simple and clear charge and
  uncharge calls--hidden behind a jump label--to account skb memory.

- The previous jump label code was an elaborate state machine that
  tracked the number of cgroups with an active socket limit in order
  to enable the skmem tracking and accounting code only when actively
  necessary. But this is overengineered: it was meant to protect the
  people who never use this feature in the first place. Simply enable
  the branches once when the first limit is set until the next reboot.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h   |  64 ++++++++-----------
 include/net/sock.h           | 135 +++------------------------------------
 include/net/tcp.h            |   3 -
 include/net/tcp_memcontrol.h |   7 ---
 mm/memcontrol.c              | 101 +++++++++++++++--------------
 net/core/sock.c              |  78 ++++++-----------------
 net/ipv4/sysctl_net_ipv4.c   |   1 -
 net/ipv4/tcp.c               |   3 +-
 net/ipv4/tcp_ipv4.c          |   9 +--
 net/ipv4/tcp_memcontrol.c    | 147 +++++++------------------------------------
 net/ipv4/tcp_output.c        |   6 +-
 net/ipv6/tcp_ipv6.c          |   3 -
 12 files changed, 136 insertions(+), 421 deletions(-)
 delete mode 100644 include/net/tcp_memcontrol.h

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 19ff87b..5b72f83 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -85,34 +85,6 @@ enum mem_cgroup_events_target {
 	MEM_CGROUP_NTARGETS,
 };
 
-/*
- * Bits in struct cg_proto.flags
- */
-enum cg_proto_flags {
-	/* Currently active and new sockets should be assigned to cgroups */
-	MEMCG_SOCK_ACTIVE,
-	/* It was ever activated; we must disarm static keys on destruction */
-	MEMCG_SOCK_ACTIVATED,
-};
-
-struct cg_proto {
-	struct page_counter	memory_allocated;	/* Current allocated memory. */
-	struct percpu_counter	sockets_allocated;	/* Current number of sockets. */
-	int			memory_pressure;
-	long			sysctl_mem[3];
-	unsigned long		flags;
-	/*
-	 * memcg field is used to find which memcg we belong directly
-	 * Each memcg struct can hold more than one cg_proto, so container_of
-	 * won't really cut.
-	 *
-	 * The elegant solution would be having an inverse function to
-	 * proto_cgroup in struct proto, but that means polluting the structure
-	 * for everybody, instead of just for memcg users.
-	 */
-	struct mem_cgroup	*memcg;
-};
-
 #ifdef CONFIG_MEMCG
 struct mem_cgroup_stat_cpu {
 	long count[MEM_CGROUP_STAT_NSTATS];
@@ -185,8 +157,15 @@ struct mem_cgroup {
 
 	/* Accounted resources */
 	struct page_counter memory;
+
+	/*
+	 * Legacy non-resource counters. In unified hierarchy, all
+	 * memory is accounted and limited through memcg->memory.
+	 * Consumer breakdown happens in the statistics.
+	 */
 	struct page_counter memsw;
 	struct page_counter kmem;
+	struct page_counter skmem;
 
 	/* Normal memory consumption range */
 	unsigned long low;
@@ -246,9 +225,6 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat_cpu __percpu *stat;
 
-#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
-	struct cg_proto tcp_mem;
-#endif
 #if defined(CONFIG_MEMCG_KMEM)
         /* Index in the kmem_cache->memcg_params.memcg_caches array */
 	int kmemcg_id;
@@ -676,12 +652,6 @@ void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 }
 #endif /* CONFIG_MEMCG */
 
-enum {
-	UNDER_LIMIT,
-	SOFT_LIMIT,
-	OVER_LIMIT,
-};
-
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg);
@@ -707,15 +677,35 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
 
 struct sock;
 #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
+extern struct static_key_false mem_cgroup_sockets;
+static inline bool mem_cgroup_do_sockets(void)
+{
+	return static_branch_unlikely(&mem_cgroup_sockets);
+}
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
+bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
+void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
 #else
+static inline bool mem_cgroup_do_sockets(void)
+{
+	return false;
+}
 static inline void sock_update_memcg(struct sock *sk)
 {
 }
 static inline void sock_release_memcg(struct sock *sk)
 {
 }
+static inline bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg,
+					   unsigned int nr_pages)
+{
+	return true;
+}
+static inline void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg,
+					     unsigned int nr_pages)
+{
+}
 #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_MEMCG_KMEM
diff --git a/include/net/sock.h b/include/net/sock.h
index 59a7196..67795fc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -69,22 +69,6 @@
 #include <net/tcp_states.h>
 #include <linux/net_tstamp.h>
 
-struct cgroup;
-struct cgroup_subsys;
-#ifdef CONFIG_NET
-int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss);
-void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg);
-#else
-static inline
-int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
-{
-	return 0;
-}
-static inline
-void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg)
-{
-}
-#endif
 /*
  * This structure really needs to be cleaned up.
  * Most of it is for TCP, and not used by any of
@@ -243,7 +227,6 @@ struct sock_common {
 	/* public: */
 };
 
-struct cg_proto;
 /**
   *	struct sock - network layer representation of sockets
   *	@__sk_common: shared layout with inet_timewait_sock
@@ -310,7 +293,7 @@ struct cg_proto;
   *	@sk_security: used by security modules
   *	@sk_mark: generic packet mark
   *	@sk_classid: this socket's cgroup classid
-  *	@sk_cgrp: this socket's cgroup-specific proto data
+  *	@sk_memcg: this socket's memcg association
   *	@sk_write_pending: a write to stream socket waits to start
   *	@sk_state_change: callback to indicate change in the state of the sock
   *	@sk_data_ready: callback to indicate there is data to be processed
@@ -447,7 +430,7 @@ struct sock {
 #ifdef CONFIG_CGROUP_NET_CLASSID
 	u32			sk_classid;
 #endif
-	struct cg_proto		*sk_cgrp;
+	struct mem_cgroup	*sk_memcg;
 	void			(*sk_state_change)(struct sock *sk);
 	void			(*sk_data_ready)(struct sock *sk);
 	void			(*sk_write_space)(struct sock *sk);
@@ -1051,18 +1034,6 @@ struct proto {
 #ifdef SOCK_REFCNT_DEBUG
 	atomic_t		socks;
 #endif
-#ifdef CONFIG_MEMCG_KMEM
-	/*
-	 * cgroup specific init/deinit functions. Called once for all
-	 * protocols that implement it, from cgroups populate function.
-	 * This function has to setup any files the protocol want to
-	 * appear in the kmem cgroup filesystem.
-	 */
-	int			(*init_cgroup)(struct mem_cgroup *memcg,
-					       struct cgroup_subsys *ss);
-	void			(*destroy_cgroup)(struct mem_cgroup *memcg);
-	struct cg_proto		*(*proto_cgroup)(struct mem_cgroup *memcg);
-#endif
 };
 
 int proto_register(struct proto *prot, int alloc_slab);
@@ -1093,23 +1064,6 @@ static inline void sk_refcnt_debug_release(const struct sock *sk)
 #define sk_refcnt_debug_release(sk) do { } while (0)
 #endif /* SOCK_REFCNT_DEBUG */
 
-#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_NET)
-extern struct static_key memcg_socket_limit_enabled;
-static inline struct cg_proto *parent_cg_proto(struct proto *proto,
-					       struct cg_proto *cg_proto)
-{
-	return proto->proto_cgroup(parent_mem_cgroup(cg_proto->memcg));
-}
-#define mem_cgroup_sockets_enabled static_key_false(&memcg_socket_limit_enabled)
-#else
-#define mem_cgroup_sockets_enabled 0
-static inline struct cg_proto *parent_cg_proto(struct proto *proto,
-					       struct cg_proto *cg_proto)
-{
-	return NULL;
-}
-#endif
-
 static inline bool sk_stream_memory_free(const struct sock *sk)
 {
 	if (sk->sk_wmem_queued >= sk->sk_sndbuf)
@@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
 	if (!sk->sk_prot->memory_pressure)
 		return false;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		return !!sk->sk_cgrp->memory_pressure;
-
 	return !!*sk->sk_prot->memory_pressure;
 }
 
@@ -1146,61 +1097,19 @@ static inline void sk_leave_memory_pressure(struct sock *sk)
 {
 	int *memory_pressure = sk->sk_prot->memory_pressure;
 
-	if (!memory_pressure)
-		return;
-
-	if (*memory_pressure)
+	if (memory_pressure && *memory_pressure)
 		*memory_pressure = 0;
-
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-		struct cg_proto *cg_proto = sk->sk_cgrp;
-		struct proto *prot = sk->sk_prot;
-
-		for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
-			cg_proto->memory_pressure = 0;
-	}
-
 }
 
 static inline void sk_enter_memory_pressure(struct sock *sk)
 {
-	if (!sk->sk_prot->enter_memory_pressure)
-		return;
-
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-		struct cg_proto *cg_proto = sk->sk_cgrp;
-		struct proto *prot = sk->sk_prot;
-
-		for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
-			cg_proto->memory_pressure = 1;
-	}
-
-	sk->sk_prot->enter_memory_pressure(sk);
+	if (sk->sk_prot->enter_memory_pressure)
+		sk->sk_prot->enter_memory_pressure(sk);
 }
 
 static inline long sk_prot_mem_limits(const struct sock *sk, int index)
 {
-	long *prot = sk->sk_prot->sysctl_mem;
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		prot = sk->sk_cgrp->sysctl_mem;
-	return prot[index];
-}
-
-static inline void memcg_memory_allocated_add(struct cg_proto *prot,
-					      unsigned long amt,
-					      int *parent_status)
-{
-	page_counter_charge(&prot->memory_allocated, amt);
-
-	if (page_counter_read(&prot->memory_allocated) >
-	    prot->memory_allocated.limit)
-		*parent_status = OVER_LIMIT;
-}
-
-static inline void memcg_memory_allocated_sub(struct cg_proto *prot,
-					      unsigned long amt)
-{
-	page_counter_uncharge(&prot->memory_allocated, amt);
+	return sk->sk_prot->sysctl_mem[index];
 }
 
 static inline long
@@ -1208,24 +1117,14 @@ sk_memory_allocated(const struct sock *sk)
 {
 	struct proto *prot = sk->sk_prot;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		return page_counter_read(&sk->sk_cgrp->memory_allocated);
-
 	return atomic_long_read(prot->memory_allocated);
 }
 
 static inline long
-sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status)
+sk_memory_allocated_add(struct sock *sk, int amt)
 {
 	struct proto *prot = sk->sk_prot;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-		memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status);
-		/* update the root cgroup regardless */
-		atomic_long_add_return(amt, prot->memory_allocated);
-		return page_counter_read(&sk->sk_cgrp->memory_allocated);
-	}
-
 	return atomic_long_add_return(amt, prot->memory_allocated);
 }
 
@@ -1234,9 +1133,6 @@ sk_memory_allocated_sub(struct sock *sk, int amt)
 {
 	struct proto *prot = sk->sk_prot;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		memcg_memory_allocated_sub(sk->sk_cgrp, amt);
-
 	atomic_long_sub(amt, prot->memory_allocated);
 }
 
@@ -1244,13 +1140,6 @@ static inline void sk_sockets_allocated_dec(struct sock *sk)
 {
 	struct proto *prot = sk->sk_prot;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-		struct cg_proto *cg_proto = sk->sk_cgrp;
-
-		for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
-			percpu_counter_dec(&cg_proto->sockets_allocated);
-	}
-
 	percpu_counter_dec(prot->sockets_allocated);
 }
 
@@ -1258,13 +1147,6 @@ static inline void sk_sockets_allocated_inc(struct sock *sk)
 {
 	struct proto *prot = sk->sk_prot;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-		struct cg_proto *cg_proto = sk->sk_cgrp;
-
-		for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
-			percpu_counter_inc(&cg_proto->sockets_allocated);
-	}
-
 	percpu_counter_inc(prot->sockets_allocated);
 }
 
@@ -1273,9 +1155,6 @@ sk_sockets_allocated_read_positive(struct sock *sk)
 {
 	struct proto *prot = sk->sk_prot;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		return percpu_counter_read_positive(&sk->sk_cgrp->sockets_allocated);
-
 	return percpu_counter_read_positive(prot->sockets_allocated);
 }
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index eed94fc..77b6c7e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -291,9 +291,6 @@ extern int tcp_memory_pressure;
 /* optimized version of sk_under_memory_pressure() for TCP sockets */
 static inline bool tcp_under_memory_pressure(const struct sock *sk)
 {
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		return !!sk->sk_cgrp->memory_pressure;
-
 	return tcp_memory_pressure;
 }
 /*
diff --git a/include/net/tcp_memcontrol.h b/include/net/tcp_memcontrol.h
deleted file mode 100644
index 05b94d9..0000000
--- a/include/net/tcp_memcontrol.h
+++ /dev/null
@@ -1,7 +0,0 @@
-#ifndef _TCP_MEMCG_H
-#define _TCP_MEMCG_H
-
-struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg);
-int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss);
-void tcp_destroy_cgroup(struct mem_cgroup *memcg);
-#endif /* _TCP_MEMCG_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e54f434..c41e6d7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -66,7 +66,6 @@
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
-#include <net/tcp_memcontrol.h>
 #include "slab.h"
 
 #include <asm/uaccess.h>
@@ -291,58 +290,68 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
 /* Writing them here to avoid exposing memcg's inner layout */
 #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
 
+DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets);
+
 void sock_update_memcg(struct sock *sk)
 {
-	if (mem_cgroup_sockets_enabled) {
-		struct mem_cgroup *memcg;
-		struct cg_proto *cg_proto;
-
-		BUG_ON(!sk->sk_prot->proto_cgroup);
-
-		/* Socket cloning can throw us here with sk_cgrp already
-		 * filled. It won't however, necessarily happen from
-		 * process context. So the test for root memcg given
-		 * the current task's memcg won't help us in this case.
-		 *
-		 * Respecting the original socket's memcg is a better
-		 * decision in this case.
-		 */
-		if (sk->sk_cgrp) {
-			BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg));
-			css_get(&sk->sk_cgrp->memcg->css);
-			return;
-		}
-
-		rcu_read_lock();
-		memcg = mem_cgroup_from_task(current);
-		cg_proto = sk->sk_prot->proto_cgroup(memcg);
-		if (cg_proto && test_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags) &&
-		    css_tryget_online(&memcg->css)) {
-			sk->sk_cgrp = cg_proto;
-		}
-		rcu_read_unlock();
+	struct mem_cgroup *memcg;
+	/*
+	 * Socket cloning can throw us here with sk_cgrp already
+	 * filled. It won't however, necessarily happen from
+	 * process context. So the test for root memcg given
+	 * the current task's memcg won't help us in this case.
+	 *
+	 * Respecting the original socket's memcg is a better
+	 * decision in this case.
+	 */
+	if (sk->sk_memcg) {
+		BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
+		css_get(&sk->sk_memcg->css);
+		return;
 	}
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (css_tryget_online(&memcg->css))
+		sk->sk_memcg = memcg;
+	rcu_read_unlock();
 }
 EXPORT_SYMBOL(sock_update_memcg);
 
 void sock_release_memcg(struct sock *sk)
 {
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
-		struct mem_cgroup *memcg;
-		WARN_ON(!sk->sk_cgrp->memcg);
-		memcg = sk->sk_cgrp->memcg;
-		css_put(&sk->sk_cgrp->memcg->css);
-	}
+	if (sk->sk_memcg)
+		css_put(&sk->sk_memcg->css);
 }
 
-struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
+/**
+ * mem_cgroup_charge_skmem - charge socket memory
+ * @memcg: memcg to charge
+ * @nr_pages: number of pages to charge
+ *
+ * Charges @nr_pages to @memcg. Returns %true if the charge fit within
+ * the memcg's configured limit, %false if the charge had to be forced.
+ */
+bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
-	if (!memcg || mem_cgroup_is_root(memcg))
-		return NULL;
+	struct page_counter *counter;
+
+	if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
+		return true;
 
-	return &memcg->tcp_mem;
+	page_counter_charge(&memcg->skmem, nr_pages);
+	return false;
+}
+
+/**
+ * mem_cgroup_uncharge_skmem - uncharge socket memory
+ * @memcg: memcg to uncharge
+ * @nr_pages: number of pages to uncharge
+ */
+void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+	page_counter_uncharge(&memcg->skmem, nr_pages);
 }
-EXPORT_SYMBOL(tcp_proto_cgroup);
 
 #endif
 
@@ -3592,13 +3601,7 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
 #ifdef CONFIG_MEMCG_KMEM
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
-	int ret;
-
-	ret = memcg_propagate_kmem(memcg);
-	if (ret)
-		return ret;
-
-	return mem_cgroup_sockets_init(memcg, ss);
+	return memcg_propagate_kmem(memcg);
 }
 
 static void memcg_deactivate_kmem(struct mem_cgroup *memcg)
@@ -3654,7 +3657,6 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 		static_key_slow_dec(&memcg_kmem_enabled_key);
 		WARN_ON(page_counter_read(&memcg->kmem));
 	}
-	mem_cgroup_sockets_destroy(memcg);
 }
 #else
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
@@ -4218,6 +4220,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 		memcg->soft_limit = PAGE_COUNTER_MAX;
 		page_counter_init(&memcg->memsw, NULL);
 		page_counter_init(&memcg->kmem, NULL);
+		page_counter_init(&memcg->skmem, NULL);
 	}
 
 	memcg->last_scanned_node = MAX_NUMNODES;
@@ -4266,6 +4269,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		memcg->soft_limit = PAGE_COUNTER_MAX;
 		page_counter_init(&memcg->memsw, &parent->memsw);
 		page_counter_init(&memcg->kmem, &parent->kmem);
+		page_counter_init(&memcg->skmem, &parent->skmem);
 
 		/*
 		 * No need to take a reference to the parent because cgroup
@@ -4277,6 +4281,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		memcg->soft_limit = PAGE_COUNTER_MAX;
 		page_counter_init(&memcg->memsw, NULL);
 		page_counter_init(&memcg->kmem, NULL);
+		page_counter_init(&memcg->skmem, NULL);
 		/*
 		 * Deeper hierachy with use_hierarchy == false doesn't make
 		 * much sense so let cgroup subsystem know about this
diff --git a/net/core/sock.c b/net/core/sock.c
index 0fafd27..0debff5 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -194,44 +194,6 @@ bool sk_net_capable(const struct sock *sk, int cap)
 }
 EXPORT_SYMBOL(sk_net_capable);
 
-
-#ifdef CONFIG_MEMCG_KMEM
-int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
-{
-	struct proto *proto;
-	int ret = 0;
-
-	mutex_lock(&proto_list_mutex);
-	list_for_each_entry(proto, &proto_list, node) {
-		if (proto->init_cgroup) {
-			ret = proto->init_cgroup(memcg, ss);
-			if (ret)
-				goto out;
-		}
-	}
-
-	mutex_unlock(&proto_list_mutex);
-	return ret;
-out:
-	list_for_each_entry_continue_reverse(proto, &proto_list, node)
-		if (proto->destroy_cgroup)
-			proto->destroy_cgroup(memcg);
-	mutex_unlock(&proto_list_mutex);
-	return ret;
-}
-
-void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg)
-{
-	struct proto *proto;
-
-	mutex_lock(&proto_list_mutex);
-	list_for_each_entry_reverse(proto, &proto_list, node)
-		if (proto->destroy_cgroup)
-			proto->destroy_cgroup(memcg);
-	mutex_unlock(&proto_list_mutex);
-}
-#endif
-
 /*
  * Each address family might have different locking rules, so we have
  * one slock key per address family:
@@ -239,11 +201,6 @@ void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg)
 static struct lock_class_key af_family_keys[AF_MAX];
 static struct lock_class_key af_family_slock_keys[AF_MAX];
 
-#if defined(CONFIG_MEMCG_KMEM)
-struct static_key memcg_socket_limit_enabled;
-EXPORT_SYMBOL(memcg_socket_limit_enabled);
-#endif
-
 /*
  * Make lock validator output more readable. (we pre-construct these
  * strings build-time, so that runtime initialization of socket
@@ -1476,12 +1433,6 @@ void sk_free(struct sock *sk)
 }
 EXPORT_SYMBOL(sk_free);
 
-static void sk_update_clone(const struct sock *sk, struct sock *newsk)
-{
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-		sock_update_memcg(newsk);
-}
-
 /**
  *	sk_clone_lock - clone a socket, and lock its clone
  *	@sk: the socket to clone
@@ -1577,7 +1528,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 		sk_set_socket(newsk, NULL);
 		newsk->sk_wq = NULL;
 
-		sk_update_clone(sk, newsk);
+		if (mem_cgroup_do_sockets())
+			sock_update_memcg(newsk);
 
 		if (newsk->sk_prot->sockets_allocated)
 			sk_sockets_allocated_inc(newsk);
@@ -2036,27 +1988,27 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 	struct proto *prot = sk->sk_prot;
 	int amt = sk_mem_pages(size);
 	long allocated;
-	int parent_status = UNDER_LIMIT;
 
 	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
 
-	allocated = sk_memory_allocated_add(sk, amt, &parent_status);
+	allocated = sk_memory_allocated_add(sk, amt);
+
+	if (mem_cgroup_do_sockets() && sk->sk_memcg &&
+	    !mem_cgroup_charge_skmem(sk->sk_memcg, amt))
+		goto suppress_allocation;
 
 	/* Under limit. */
-	if (parent_status == UNDER_LIMIT &&
-			allocated <= sk_prot_mem_limits(sk, 0)) {
+	if (allocated <= sk_prot_mem_limits(sk, 0)) {
 		sk_leave_memory_pressure(sk);
 		return 1;
 	}
 
-	/* Under pressure. (we or our parents) */
-	if ((parent_status > SOFT_LIMIT) ||
-			allocated > sk_prot_mem_limits(sk, 1))
+	/* Under pressure. */
+	if (allocated > sk_prot_mem_limits(sk, 1))
 		sk_enter_memory_pressure(sk);
 
-	/* Over hard limit (we or our parents) */
-	if ((parent_status == OVER_LIMIT) ||
-			(allocated > sk_prot_mem_limits(sk, 2)))
+	/* Over hard limit. */
+	if (allocated > sk_prot_mem_limits(sk, 2))
 		goto suppress_allocation;
 
 	/* guarantee minimum buffer size under pressure */
@@ -2105,6 +2057,9 @@ suppress_allocation:
 
 	sk_memory_allocated_sub(sk, amt);
 
+	if (mem_cgroup_do_sockets() && sk->sk_memcg)
+		mem_cgroup_uncharge_skmem(sk->sk_memcg, amt);
+
 	return 0;
 }
 EXPORT_SYMBOL(__sk_mem_schedule);
@@ -2120,6 +2075,9 @@ void __sk_mem_reclaim(struct sock *sk, int amount)
 	sk_memory_allocated_sub(sk, amount);
 	sk->sk_forward_alloc -= amount << SK_MEM_QUANTUM_SHIFT;
 
+	if (mem_cgroup_do_sockets() && sk->sk_memcg)
+		mem_cgroup_uncharge_skmem(sk->sk_memcg, amount);
+
 	if (sk_under_memory_pressure(sk) &&
 	    (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)))
 		sk_leave_memory_pressure(sk);
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 894da3a..1f00819 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -24,7 +24,6 @@
 #include <net/cipso_ipv4.h>
 #include <net/inet_frag.h>
 #include <net/ping.h>
-#include <net/tcp_memcontrol.h>
 
 static int zero;
 static int one = 1;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ac1bdbb..ec931c0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -421,7 +421,8 @@ void tcp_init_sock(struct sock *sk)
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
 	local_bh_disable();
-	sock_update_memcg(sk);
+	if (mem_cgroup_do_sockets())
+		sock_update_memcg(sk);
 	sk_sockets_allocated_inc(sk);
 	local_bh_enable();
 }
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 30dd45c..bb5f4f2 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -73,7 +73,6 @@
 #include <net/timewait_sock.h>
 #include <net/xfrm.h>
 #include <net/secure_seq.h>
-#include <net/tcp_memcontrol.h>
 #include <net/busy_poll.h>
 
 #include <linux/inet.h>
@@ -1808,7 +1807,8 @@ void tcp_v4_destroy_sock(struct sock *sk)
 	tcp_saved_syn_free(tp);
 
 	sk_sockets_allocated_dec(sk);
-	sock_release_memcg(sk);
+	if (mem_cgroup_do_sockets())
+		sock_release_memcg(sk);
 }
 EXPORT_SYMBOL(tcp_v4_destroy_sock);
 
@@ -2330,11 +2330,6 @@ struct proto tcp_prot = {
 	.compat_setsockopt	= compat_tcp_setsockopt,
 	.compat_getsockopt	= compat_tcp_getsockopt,
 #endif
-#ifdef CONFIG_MEMCG_KMEM
-	.init_cgroup		= tcp_init_cgroup,
-	.destroy_cgroup		= tcp_destroy_cgroup,
-	.proto_cgroup		= tcp_proto_cgroup,
-#endif
 };
 EXPORT_SYMBOL(tcp_prot);
 
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index 2379c1b..09a37eb 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -1,107 +1,10 @@
-#include <net/tcp.h>
-#include <net/tcp_memcontrol.h>
-#include <net/sock.h>
-#include <net/ip.h>
-#include <linux/nsproxy.h>
+#include <linux/page_counter.h>
 #include <linux/memcontrol.h>
+#include <linux/cgroup.h>
 #include <linux/module.h>
-
-int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
-{
-	/*
-	 * The root cgroup does not use page_counters, but rather,
-	 * rely on the data already collected by the network
-	 * subsystem
-	 */
-	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
-	struct page_counter *counter_parent = NULL;
-	struct cg_proto *cg_proto, *parent_cg;
-
-	cg_proto = tcp_prot.proto_cgroup(memcg);
-	if (!cg_proto)
-		return 0;
-
-	cg_proto->sysctl_mem[0] = sysctl_tcp_mem[0];
-	cg_proto->sysctl_mem[1] = sysctl_tcp_mem[1];
-	cg_proto->sysctl_mem[2] = sysctl_tcp_mem[2];
-	cg_proto->memory_pressure = 0;
-	cg_proto->memcg = memcg;
-
-	parent_cg = tcp_prot.proto_cgroup(parent);
-	if (parent_cg)
-		counter_parent = &parent_cg->memory_allocated;
-
-	page_counter_init(&cg_proto->memory_allocated, counter_parent);
-	percpu_counter_init(&cg_proto->sockets_allocated, 0, GFP_KERNEL);
-
-	return 0;
-}
-EXPORT_SYMBOL(tcp_init_cgroup);
-
-void tcp_destroy_cgroup(struct mem_cgroup *memcg)
-{
-	struct cg_proto *cg_proto;
-
-	cg_proto = tcp_prot.proto_cgroup(memcg);
-	if (!cg_proto)
-		return;
-
-	percpu_counter_destroy(&cg_proto->sockets_allocated);
-
-	if (test_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags))
-		static_key_slow_dec(&memcg_socket_limit_enabled);
-
-}
-EXPORT_SYMBOL(tcp_destroy_cgroup);
-
-static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
-{
-	struct cg_proto *cg_proto;
-	int i;
-	int ret;
-
-	cg_proto = tcp_prot.proto_cgroup(memcg);
-	if (!cg_proto)
-		return -EINVAL;
-
-	ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages);
-	if (ret)
-		return ret;
-
-	for (i = 0; i < 3; i++)
-		cg_proto->sysctl_mem[i] = min_t(long, nr_pages,
-						sysctl_tcp_mem[i]);
-
-	if (nr_pages == PAGE_COUNTER_MAX)
-		clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
-	else {
-		/*
-		 * The active bit needs to be written after the static_key
-		 * update. This is what guarantees that the socket activation
-		 * function is the last one to run. See sock_update_memcg() for
-		 * details, and note that we don't mark any socket as belonging
-		 * to this memcg until that flag is up.
-		 *
-		 * We need to do this, because static_keys will span multiple
-		 * sites, but we can't control their order. If we mark a socket
-		 * as accounted, but the accounting functions are not patched in
-		 * yet, we'll lose accounting.
-		 *
-		 * We never race with the readers in sock_update_memcg(),
-		 * because when this value change, the code to process it is not
-		 * patched in yet.
-		 *
-		 * The activated bit is used to guarantee that no two writers
-		 * will do the update in the same memcg. Without that, we can't
-		 * properly shutdown the static key.
-		 */
-		if (!test_and_set_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags))
-			static_key_slow_inc(&memcg_socket_limit_enabled);
-		set_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
-	}
-
-	return 0;
-}
+#include <linux/kernfs.h>
+#include <linux/mutex.h>
+#include <net/tcp.h>
 
 enum {
 	RES_USAGE,
@@ -124,11 +27,17 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
 	switch (of_cft(of)->private) {
 	case RES_LIMIT:
 		/* see memcontrol.c */
+		if (memcg == root_mem_cgroup) {
+			ret = -EINVAL;
+			break;
+		}
 		ret = page_counter_memparse(buf, "-1", &nr_pages);
 		if (ret)
 			break;
 		mutex_lock(&tcp_limit_mutex);
-		ret = tcp_update_limit(memcg, nr_pages);
+		ret = page_counter_limit(&memcg->skmem, nr_pages);
+		if (!ret)
+			static_branch_enable(&mem_cgroup_sockets);
 		mutex_unlock(&tcp_limit_mutex);
 		break;
 	default:
@@ -141,32 +50,28 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
 static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-	struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg);
 	u64 val;
 
 	switch (cft->private) {
 	case RES_LIMIT:
-		if (!cg_proto)
-			return PAGE_COUNTER_MAX;
-		val = cg_proto->memory_allocated.limit;
+		val = memcg->skmem.limit;
 		val *= PAGE_SIZE;
 		break;
 	case RES_USAGE:
-		if (!cg_proto)
+		if (memcg == root_mem_cgroup)
 			val = atomic_long_read(&tcp_memory_allocated);
 		else
-			val = page_counter_read(&cg_proto->memory_allocated);
+			val = page_counter_read(&memcg->skmem);
 		val *= PAGE_SIZE;
 		break;
 	case RES_FAILCNT:
-		if (!cg_proto)
-			return 0;
-		val = cg_proto->memory_allocated.failcnt;
+		val = memcg->skmem.failcnt;
 		break;
 	case RES_MAX_USAGE:
-		if (!cg_proto)
-			return 0;
-		val = cg_proto->memory_allocated.watermark;
+		if (memcg == root_mem_cgroup)
+			val = 0;
+		else
+			val = memcg->skmem.watermark;
 		val *= PAGE_SIZE;
 		break;
 	default:
@@ -178,20 +83,14 @@ static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
 static ssize_t tcp_cgroup_reset(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off)
 {
-	struct mem_cgroup *memcg;
-	struct cg_proto *cg_proto;
-
-	memcg = mem_cgroup_from_css(of_css(of));
-	cg_proto = tcp_prot.proto_cgroup(memcg);
-	if (!cg_proto)
-		return nbytes;
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
 
 	switch (of_cft(of)->private) {
 	case RES_MAX_USAGE:
-		page_counter_reset_watermark(&cg_proto->memory_allocated);
+		page_counter_reset_watermark(&memcg->skmem);
 		break;
 	case RES_FAILCNT:
-		cg_proto->memory_allocated.failcnt = 0;
+		memcg->skmem.failcnt = 0;
 		break;
 	}
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 19adedb..b496fc9 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2819,13 +2819,15 @@ begin_fwd:
  */
 void sk_forced_mem_schedule(struct sock *sk, int size)
 {
-	int amt, status;
+	int amt;
 
 	if (size <= sk->sk_forward_alloc)
 		return;
 	amt = sk_mem_pages(size);
 	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
-	sk_memory_allocated_add(sk, amt, &status);
+	sk_memory_allocated_add(sk, amt);
+	if (mem_cgroup_do_sockets() && sk->sk_memcg)
+		mem_cgroup_charge_skmem(sk->sk_memcg, amt);
 }
 
 /* Send a FIN. The caller locks the socket for us.
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index f495d18..cf19e65 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1862,9 +1862,6 @@ struct proto tcpv6_prot = {
 	.compat_setsockopt	= compat_tcp_setsockopt,
 	.compat_getsockopt	= compat_tcp_getsockopt,
 #endif
-#ifdef CONFIG_MEMCG_KMEM
-	.proto_cgroup		= tcp_proto_cgroup,
-#endif
 	.clear_sk		= tcp_v6_clear_sk,
 };
 
-- 
2.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 4/8] mm: memcontrol: prepare for unified hierarchy socket accounting
  2015-10-22  4:21 ` Johannes Weiner
@ 2015-10-22  4:21   ` Johannes Weiner
  -1 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-22  4:21 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Tejun Heo, netdev, linux-mm,
	cgroups, linux-kernel

The unified hierarchy memory controller will account socket
memory. Move the infrastructure functions accordingly.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 136 ++++++++++++++++++++++++++++----------------------------
 1 file changed, 68 insertions(+), 68 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c41e6d7..3789050 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -287,74 +287,6 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
 	return mem_cgroup_from_css(css);
 }
 
-/* Writing them here to avoid exposing memcg's inner layout */
-#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
-
-DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets);
-
-void sock_update_memcg(struct sock *sk)
-{
-	struct mem_cgroup *memcg;
-	/*
-	 * Socket cloning can throw us here with sk_cgrp already
-	 * filled. It won't however, necessarily happen from
-	 * process context. So the test for root memcg given
-	 * the current task's memcg won't help us in this case.
-	 *
-	 * Respecting the original socket's memcg is a better
-	 * decision in this case.
-	 */
-	if (sk->sk_memcg) {
-		BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
-		css_get(&sk->sk_memcg->css);
-		return;
-	}
-
-	rcu_read_lock();
-	memcg = mem_cgroup_from_task(current);
-	if (css_tryget_online(&memcg->css))
-		sk->sk_memcg = memcg;
-	rcu_read_unlock();
-}
-EXPORT_SYMBOL(sock_update_memcg);
-
-void sock_release_memcg(struct sock *sk)
-{
-	if (sk->sk_memcg)
-		css_put(&sk->sk_memcg->css);
-}
-
-/**
- * mem_cgroup_charge_skmem - charge socket memory
- * @memcg: memcg to charge
- * @nr_pages: number of pages to charge
- *
- * Charges @nr_pages to @memcg. Returns %true if the charge fit within
- * the memcg's configured limit, %false if the charge had to be forced.
- */
-bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
-{
-	struct page_counter *counter;
-
-	if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
-		return true;
-
-	page_counter_charge(&memcg->skmem, nr_pages);
-	return false;
-}
-
-/**
- * mem_cgroup_uncharge_skmem - uncharge socket memory
- * @memcg: memcg to uncharge
- * @nr_pages: number of pages to uncharge
- */
-void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
-{
-	page_counter_uncharge(&memcg->skmem, nr_pages);
-}
-
-#endif
-
 #ifdef CONFIG_MEMCG_KMEM
 /*
  * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
@@ -5521,6 +5453,74 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage)
 	commit_charge(newpage, memcg, true);
 }
 
+/* Writing them here to avoid exposing memcg's inner layout */
+#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
+
+DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets);
+
+void sock_update_memcg(struct sock *sk)
+{
+	struct mem_cgroup *memcg;
+	/*
+	 * Socket cloning can throw us here with sk_cgrp already
+	 * filled. It won't however, necessarily happen from
+	 * process context. So the test for root memcg given
+	 * the current task's memcg won't help us in this case.
+	 *
+	 * Respecting the original socket's memcg is a better
+	 * decision in this case.
+	 */
+	if (sk->sk_memcg) {
+		BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
+		css_get(&sk->sk_memcg->css);
+		return;
+	}
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (css_tryget_online(&memcg->css))
+		sk->sk_memcg = memcg;
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL(sock_update_memcg);
+
+void sock_release_memcg(struct sock *sk)
+{
+	if (sk->sk_memcg)
+		css_put(&sk->sk_memcg->css);
+}
+
+/**
+ * mem_cgroup_charge_skmem - charge socket memory
+ * @memcg: memcg to charge
+ * @nr_pages: number of pages to charge
+ *
+ * Charges @nr_pages to @memcg. Returns %true if the charge fit within
+ * the memcg's configured limit, %false if the charge had to be forced.
+ */
+bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+	struct page_counter *counter;
+
+	if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
+		return true;
+
+	page_counter_charge(&memcg->skmem, nr_pages);
+	return false;
+}
+
+/**
+ * mem_cgroup_uncharge_skmem - uncharge socket memory
+ * @memcg: memcg to uncharge
+ * @nr_pages: number of pages to uncharge
+ */
+void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+	page_counter_uncharge(&memcg->skmem, nr_pages);
+}
+
+#endif
+
 /*
  * subsys_initcall() for memory controller.
  *
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 4/8] mm: memcontrol: prepare for unified hierarchy socket accounting
@ 2015-10-22  4:21   ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-22  4:21 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Tejun Heo, netdev, linux-mm,
	cgroups, linux-kernel

The unified hierarchy memory controller will account socket
memory. Move the infrastructure functions accordingly.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 136 ++++++++++++++++++++++++++++----------------------------
 1 file changed, 68 insertions(+), 68 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c41e6d7..3789050 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -287,74 +287,6 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
 	return mem_cgroup_from_css(css);
 }
 
-/* Writing them here to avoid exposing memcg's inner layout */
-#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
-
-DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets);
-
-void sock_update_memcg(struct sock *sk)
-{
-	struct mem_cgroup *memcg;
-	/*
-	 * Socket cloning can throw us here with sk_cgrp already
-	 * filled. It won't however, necessarily happen from
-	 * process context. So the test for root memcg given
-	 * the current task's memcg won't help us in this case.
-	 *
-	 * Respecting the original socket's memcg is a better
-	 * decision in this case.
-	 */
-	if (sk->sk_memcg) {
-		BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
-		css_get(&sk->sk_memcg->css);
-		return;
-	}
-
-	rcu_read_lock();
-	memcg = mem_cgroup_from_task(current);
-	if (css_tryget_online(&memcg->css))
-		sk->sk_memcg = memcg;
-	rcu_read_unlock();
-}
-EXPORT_SYMBOL(sock_update_memcg);
-
-void sock_release_memcg(struct sock *sk)
-{
-	if (sk->sk_memcg)
-		css_put(&sk->sk_memcg->css);
-}
-
-/**
- * mem_cgroup_charge_skmem - charge socket memory
- * @memcg: memcg to charge
- * @nr_pages: number of pages to charge
- *
- * Charges @nr_pages to @memcg. Returns %true if the charge fit within
- * the memcg's configured limit, %false if the charge had to be forced.
- */
-bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
-{
-	struct page_counter *counter;
-
-	if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
-		return true;
-
-	page_counter_charge(&memcg->skmem, nr_pages);
-	return false;
-}
-
-/**
- * mem_cgroup_uncharge_skmem - uncharge socket memory
- * @memcg: memcg to uncharge
- * @nr_pages: number of pages to uncharge
- */
-void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
-{
-	page_counter_uncharge(&memcg->skmem, nr_pages);
-}
-
-#endif
-
 #ifdef CONFIG_MEMCG_KMEM
 /*
  * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
@@ -5521,6 +5453,74 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage)
 	commit_charge(newpage, memcg, true);
 }
 
+/* Writing them here to avoid exposing memcg's inner layout */
+#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
+
+DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets);
+
+void sock_update_memcg(struct sock *sk)
+{
+	struct mem_cgroup *memcg;
+	/*
+	 * Socket cloning can throw us here with sk_cgrp already
+	 * filled. It won't however, necessarily happen from
+	 * process context. So the test for root memcg given
+	 * the current task's memcg won't help us in this case.
+	 *
+	 * Respecting the original socket's memcg is a better
+	 * decision in this case.
+	 */
+	if (sk->sk_memcg) {
+		BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
+		css_get(&sk->sk_memcg->css);
+		return;
+	}
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (css_tryget_online(&memcg->css))
+		sk->sk_memcg = memcg;
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL(sock_update_memcg);
+
+void sock_release_memcg(struct sock *sk)
+{
+	if (sk->sk_memcg)
+		css_put(&sk->sk_memcg->css);
+}
+
+/**
+ * mem_cgroup_charge_skmem - charge socket memory
+ * @memcg: memcg to charge
+ * @nr_pages: number of pages to charge
+ *
+ * Charges @nr_pages to @memcg. Returns %true if the charge fit within
+ * the memcg's configured limit, %false if the charge had to be forced.
+ */
+bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+	struct page_counter *counter;
+
+	if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
+		return true;
+
+	page_counter_charge(&memcg->skmem, nr_pages);
+	return false;
+}
+
+/**
+ * mem_cgroup_uncharge_skmem - uncharge socket memory
+ * @memcg: memcg to uncharge
+ * @nr_pages: number of pages to uncharge
+ */
+void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+	page_counter_uncharge(&memcg->skmem, nr_pages);
+}
+
+#endif
+
 /*
  * subsys_initcall() for memory controller.
  *
-- 
2.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
  2015-10-22  4:21 ` Johannes Weiner
@ 2015-10-22  4:21   ` Johannes Weiner
  -1 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-22  4:21 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Tejun Heo, netdev, linux-mm,
	cgroups, linux-kernel

Socket memory can be a significant share of overall memory consumed by
common workloads. In order to provide reasonable resource isolation
out-of-the-box in the unified hierarchy, this type of memory needs to
be accounted and tracked per default in the memory controller.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 16 ++++++--
 mm/memcontrol.c            | 95 ++++++++++++++++++++++++++++++++++++----------
 2 files changed, 87 insertions(+), 24 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5b72f83..6f1e0f8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -244,6 +244,10 @@ struct mem_cgroup {
 	struct wb_domain cgwb_domain;
 #endif
 
+#ifdef CONFIG_INET
+	struct work_struct socket_work;
+#endif
+
 	/* List of events which userspace want to receive */
 	struct list_head event_list;
 	spinlock_t event_list_lock;
@@ -676,11 +680,15 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 struct sock;
-#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
-extern struct static_key_false mem_cgroup_sockets;
+#ifdef CONFIG_INET
+extern struct static_key_true mem_cgroup_sockets;
 static inline bool mem_cgroup_do_sockets(void)
 {
-	return static_branch_unlikely(&mem_cgroup_sockets);
+	if (mem_cgroup_disabled())
+		return false;
+	if (!static_branch_likely(&mem_cgroup_sockets))
+		return false;
+	return true;
 }
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
@@ -706,7 +714,7 @@ static inline void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg,
 					     unsigned int nr_pages)
 {
 }
-#endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
+#endif /* CONFIG_INET */
 
 #ifdef CONFIG_MEMCG_KMEM
 extern struct static_key memcg_kmem_enabled_key;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3789050..cb1d6aa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1916,6 +1916,18 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
+static void reclaim_high(struct mem_cgroup *memcg,
+			 unsigned int nr_pages,
+			 gfp_t gfp_mask)
+{
+	do {
+		if (page_counter_read(&memcg->memory) <= memcg->high)
+			continue;
+		mem_cgroup_events(memcg, MEMCG_HIGH, 1);
+		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
+	} while ((memcg = parent_mem_cgroup(memcg)));
+}
+
 /*
  * Scheduled by try_charge() to be executed from the userland return path
  * and reclaims memory over the high limit.
@@ -1923,20 +1935,13 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 void mem_cgroup_handle_over_high(void)
 {
 	unsigned int nr_pages = current->memcg_nr_pages_over_high;
-	struct mem_cgroup *memcg, *pos;
+	struct mem_cgroup *memcg;
 
 	if (likely(!nr_pages))
 		return;
 
-	pos = memcg = get_mem_cgroup_from_mm(current->mm);
-
-	do {
-		if (page_counter_read(&pos->memory) <= pos->high)
-			continue;
-		mem_cgroup_events(pos, MEMCG_HIGH, 1);
-		try_to_free_mem_cgroup_pages(pos, nr_pages, GFP_KERNEL, true);
-	} while ((pos = parent_mem_cgroup(pos)));
-
+	memcg = get_mem_cgroup_from_mm(current->mm);
+	reclaim_high(memcg, nr_pages, GFP_KERNEL);
 	css_put(&memcg->css);
 	current->memcg_nr_pages_over_high = 0;
 }
@@ -4129,6 +4134,8 @@ struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 }
 EXPORT_SYMBOL(parent_mem_cgroup);
 
+static void socket_work_func(struct work_struct *work);
+
 static struct cgroup_subsys_state * __ref
 mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
@@ -4169,6 +4176,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 #ifdef CONFIG_CGROUP_WRITEBACK
 	INIT_LIST_HEAD(&memcg->cgwb_list);
 #endif
+#ifdef CONFIG_INET
+	INIT_WORK(&memcg->socket_work, socket_work_func);
+#endif
 	return &memcg->css;
 
 free_out:
@@ -4266,6 +4276,8 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
+	cancel_work_sync(&memcg->socket_work);
+
 	memcg_destroy_kmem(memcg);
 	__mem_cgroup_free(memcg);
 }
@@ -4948,10 +4960,15 @@ static void mem_cgroup_bind(struct cgroup_subsys_state *root_css)
 	 * guarantees that @root doesn't have any children, so turning it
 	 * on for the root memcg is enough.
 	 */
-	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
+	if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
 		root_mem_cgroup->use_hierarchy = true;
-	else
+#ifdef CONFIG_INET
+		/* unified hierarchy always counts skmem */
+		static_branch_enable(&mem_cgroup_sockets);
+#endif
+	} else {
 		root_mem_cgroup->use_hierarchy = false;
+	}
 }
 
 static u64 memory_current_read(struct cgroup_subsys_state *css,
@@ -5453,10 +5470,9 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage)
 	commit_charge(newpage, memcg, true);
 }
 
-/* Writing them here to avoid exposing memcg's inner layout */
-#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
+#ifdef CONFIG_INET
 
-DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets);
+DEFINE_STATIC_KEY_TRUE(mem_cgroup_sockets);
 
 void sock_update_memcg(struct sock *sk)
 {
@@ -5490,6 +5506,14 @@ void sock_release_memcg(struct sock *sk)
 		css_put(&sk->sk_memcg->css);
 }
 
+static void socket_work_func(struct work_struct *work)
+{
+	struct mem_cgroup *memcg;
+
+	memcg = container_of(work, struct mem_cgroup, socket_work);
+	reclaim_high(memcg, CHARGE_BATCH, GFP_KERNEL);
+}
+
 /**
  * mem_cgroup_charge_skmem - charge socket memory
  * @memcg: memcg to charge
@@ -5500,13 +5524,38 @@ void sock_release_memcg(struct sock *sk)
  */
 bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
+	unsigned int batch = max(CHARGE_BATCH, nr_pages);
 	struct page_counter *counter;
+	bool force = false;
 
-	if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
+		if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
+			return true;
+		page_counter_charge(&memcg->skmem, nr_pages);
+		return false;
+	}
+
+	if (consume_stock(memcg, nr_pages))
 		return true;
+retry:
+	if (page_counter_try_charge(&memcg->memory, batch, &counter))
+		goto done;
 
-	page_counter_charge(&memcg->skmem, nr_pages);
-	return false;
+	if (batch > nr_pages) {
+		batch = nr_pages;
+		goto retry;
+	}
+
+	force = true;
+	page_counter_charge(&memcg->memory, batch);
+done:
+	css_get_many(&memcg->css, batch);
+	if (batch > nr_pages)
+		refill_stock(memcg, batch - nr_pages);
+
+	schedule_work(&memcg->socket_work);
+
+	return !force;
 }
 
 /**
@@ -5516,10 +5565,16 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
  */
 void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
-	page_counter_uncharge(&memcg->skmem, nr_pages);
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
+		page_counter_uncharge(&memcg->skmem, nr_pages);
+		return;
+	}
+
+	page_counter_uncharge(&memcg->memory, nr_pages);
+	css_put_many(&memcg->css, nr_pages);
 }
 
-#endif
+#endif /* CONFIG_INET */
 
 /*
  * subsys_initcall() for memory controller.
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-22  4:21   ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-22  4:21 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Tejun Heo, netdev, linux-mm,
	cgroups, linux-kernel

Socket memory can be a significant share of overall memory consumed by
common workloads. In order to provide reasonable resource isolation
out-of-the-box in the unified hierarchy, this type of memory needs to
be accounted and tracked per default in the memory controller.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 16 ++++++--
 mm/memcontrol.c            | 95 ++++++++++++++++++++++++++++++++++++----------
 2 files changed, 87 insertions(+), 24 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5b72f83..6f1e0f8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -244,6 +244,10 @@ struct mem_cgroup {
 	struct wb_domain cgwb_domain;
 #endif
 
+#ifdef CONFIG_INET
+	struct work_struct socket_work;
+#endif
+
 	/* List of events which userspace want to receive */
 	struct list_head event_list;
 	spinlock_t event_list_lock;
@@ -676,11 +680,15 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 struct sock;
-#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
-extern struct static_key_false mem_cgroup_sockets;
+#ifdef CONFIG_INET
+extern struct static_key_true mem_cgroup_sockets;
 static inline bool mem_cgroup_do_sockets(void)
 {
-	return static_branch_unlikely(&mem_cgroup_sockets);
+	if (mem_cgroup_disabled())
+		return false;
+	if (!static_branch_likely(&mem_cgroup_sockets))
+		return false;
+	return true;
 }
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
@@ -706,7 +714,7 @@ static inline void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg,
 					     unsigned int nr_pages)
 {
 }
-#endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
+#endif /* CONFIG_INET */
 
 #ifdef CONFIG_MEMCG_KMEM
 extern struct static_key memcg_kmem_enabled_key;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3789050..cb1d6aa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1916,6 +1916,18 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
+static void reclaim_high(struct mem_cgroup *memcg,
+			 unsigned int nr_pages,
+			 gfp_t gfp_mask)
+{
+	do {
+		if (page_counter_read(&memcg->memory) <= memcg->high)
+			continue;
+		mem_cgroup_events(memcg, MEMCG_HIGH, 1);
+		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
+	} while ((memcg = parent_mem_cgroup(memcg)));
+}
+
 /*
  * Scheduled by try_charge() to be executed from the userland return path
  * and reclaims memory over the high limit.
@@ -1923,20 +1935,13 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 void mem_cgroup_handle_over_high(void)
 {
 	unsigned int nr_pages = current->memcg_nr_pages_over_high;
-	struct mem_cgroup *memcg, *pos;
+	struct mem_cgroup *memcg;
 
 	if (likely(!nr_pages))
 		return;
 
-	pos = memcg = get_mem_cgroup_from_mm(current->mm);
-
-	do {
-		if (page_counter_read(&pos->memory) <= pos->high)
-			continue;
-		mem_cgroup_events(pos, MEMCG_HIGH, 1);
-		try_to_free_mem_cgroup_pages(pos, nr_pages, GFP_KERNEL, true);
-	} while ((pos = parent_mem_cgroup(pos)));
-
+	memcg = get_mem_cgroup_from_mm(current->mm);
+	reclaim_high(memcg, nr_pages, GFP_KERNEL);
 	css_put(&memcg->css);
 	current->memcg_nr_pages_over_high = 0;
 }
@@ -4129,6 +4134,8 @@ struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 }
 EXPORT_SYMBOL(parent_mem_cgroup);
 
+static void socket_work_func(struct work_struct *work);
+
 static struct cgroup_subsys_state * __ref
 mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
@@ -4169,6 +4176,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 #ifdef CONFIG_CGROUP_WRITEBACK
 	INIT_LIST_HEAD(&memcg->cgwb_list);
 #endif
+#ifdef CONFIG_INET
+	INIT_WORK(&memcg->socket_work, socket_work_func);
+#endif
 	return &memcg->css;
 
 free_out:
@@ -4266,6 +4276,8 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
+	cancel_work_sync(&memcg->socket_work);
+
 	memcg_destroy_kmem(memcg);
 	__mem_cgroup_free(memcg);
 }
@@ -4948,10 +4960,15 @@ static void mem_cgroup_bind(struct cgroup_subsys_state *root_css)
 	 * guarantees that @root doesn't have any children, so turning it
 	 * on for the root memcg is enough.
 	 */
-	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
+	if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
 		root_mem_cgroup->use_hierarchy = true;
-	else
+#ifdef CONFIG_INET
+		/* unified hierarchy always counts skmem */
+		static_branch_enable(&mem_cgroup_sockets);
+#endif
+	} else {
 		root_mem_cgroup->use_hierarchy = false;
+	}
 }
 
 static u64 memory_current_read(struct cgroup_subsys_state *css,
@@ -5453,10 +5470,9 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage)
 	commit_charge(newpage, memcg, true);
 }
 
-/* Writing them here to avoid exposing memcg's inner layout */
-#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
+#ifdef CONFIG_INET
 
-DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets);
+DEFINE_STATIC_KEY_TRUE(mem_cgroup_sockets);
 
 void sock_update_memcg(struct sock *sk)
 {
@@ -5490,6 +5506,14 @@ void sock_release_memcg(struct sock *sk)
 		css_put(&sk->sk_memcg->css);
 }
 
+static void socket_work_func(struct work_struct *work)
+{
+	struct mem_cgroup *memcg;
+
+	memcg = container_of(work, struct mem_cgroup, socket_work);
+	reclaim_high(memcg, CHARGE_BATCH, GFP_KERNEL);
+}
+
 /**
  * mem_cgroup_charge_skmem - charge socket memory
  * @memcg: memcg to charge
@@ -5500,13 +5524,38 @@ void sock_release_memcg(struct sock *sk)
  */
 bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
+	unsigned int batch = max(CHARGE_BATCH, nr_pages);
 	struct page_counter *counter;
+	bool force = false;
 
-	if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
+		if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
+			return true;
+		page_counter_charge(&memcg->skmem, nr_pages);
+		return false;
+	}
+
+	if (consume_stock(memcg, nr_pages))
 		return true;
+retry:
+	if (page_counter_try_charge(&memcg->memory, batch, &counter))
+		goto done;
 
-	page_counter_charge(&memcg->skmem, nr_pages);
-	return false;
+	if (batch > nr_pages) {
+		batch = nr_pages;
+		goto retry;
+	}
+
+	force = true;
+	page_counter_charge(&memcg->memory, batch);
+done:
+	css_get_many(&memcg->css, batch);
+	if (batch > nr_pages)
+		refill_stock(memcg, batch - nr_pages);
+
+	schedule_work(&memcg->socket_work);
+
+	return !force;
 }
 
 /**
@@ -5516,10 +5565,16 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
  */
 void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
-	page_counter_uncharge(&memcg->skmem, nr_pages);
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
+		page_counter_uncharge(&memcg->skmem, nr_pages);
+		return;
+	}
+
+	page_counter_uncharge(&memcg->memory, nr_pages);
+	css_put_many(&memcg->css, nr_pages);
 }
 
-#endif
+#endif /* CONFIG_INET */
 
 /*
  * subsys_initcall() for memory controller.
-- 
2.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 6/8] mm: vmscan: simplify memcg vs. global shrinker invocation
  2015-10-22  4:21 ` Johannes Weiner
@ 2015-10-22  4:21   ` Johannes Weiner
  -1 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-22  4:21 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Tejun Heo, netdev, linux-mm,
	cgroups, linux-kernel

Letting shrink_slab() handle the root_mem_cgroup, and implicitely the
!CONFIG_MEMCG case, allows shrink_zone() to invoke the shrinkers
unconditionally from within the memcg iteration loop.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |  2 ++
 mm/vmscan.c                | 31 ++++++++++++++++---------------
 2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6f1e0f8..d66ae18 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -482,6 +482,8 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
+#define root_mem_cgroup NULL
+
 static inline void mem_cgroup_events(struct mem_cgroup *memcg,
 				     enum mem_cgroup_events_index idx,
 				     unsigned int nr)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9b52ecf..ecc2125 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -411,6 +411,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 	struct shrinker *shrinker;
 	unsigned long freed = 0;
 
+	/* Global shrinker mode */
+	if (memcg == root_mem_cgroup)
+		memcg = NULL;
+
 	if (memcg && !memcg_kmem_is_active(memcg))
 		return 0;
 
@@ -2417,11 +2421,22 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 			shrink_lruvec(lruvec, swappiness, sc, &lru_pages);
 			zone_lru_pages += lru_pages;
 
-			if (memcg && is_classzone)
+			/*
+			 * Shrink the slab caches in the same proportion that
+			 * the eligible LRU pages were scanned.
+			 */
+			if (is_classzone) {
 				shrink_slab(sc->gfp_mask, zone_to_nid(zone),
 					    memcg, sc->nr_scanned - scanned,
 					    lru_pages);
 
+				if (reclaim_state) {
+					sc->nr_reclaimed +=
+						reclaim_state->reclaimed_slab;
+					reclaim_state->reclaimed_slab = 0;
+				}
+			}
+
 			/*
 			 * Direct reclaim and kswapd have to scan all memory
 			 * cgroups to fulfill the overall scan target for the
@@ -2439,20 +2454,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 			}
 		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
 
-		/*
-		 * Shrink the slab caches in the same proportion that
-		 * the eligible LRU pages were scanned.
-		 */
-		if (global_reclaim(sc) && is_classzone)
-			shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL,
-				    sc->nr_scanned - nr_scanned,
-				    zone_lru_pages);
-
-		if (reclaim_state) {
-			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
-			reclaim_state->reclaimed_slab = 0;
-		}
-
 		vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
 			   sc->nr_scanned - nr_scanned,
 			   sc->nr_reclaimed - nr_reclaimed);
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 6/8] mm: vmscan: simplify memcg vs. global shrinker invocation
@ 2015-10-22  4:21   ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-22  4:21 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Tejun Heo, netdev, linux-mm,
	cgroups, linux-kernel

Letting shrink_slab() handle the root_mem_cgroup, and implicitely the
!CONFIG_MEMCG case, allows shrink_zone() to invoke the shrinkers
unconditionally from within the memcg iteration loop.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |  2 ++
 mm/vmscan.c                | 31 ++++++++++++++++---------------
 2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6f1e0f8..d66ae18 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -482,6 +482,8 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
+#define root_mem_cgroup NULL
+
 static inline void mem_cgroup_events(struct mem_cgroup *memcg,
 				     enum mem_cgroup_events_index idx,
 				     unsigned int nr)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9b52ecf..ecc2125 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -411,6 +411,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 	struct shrinker *shrinker;
 	unsigned long freed = 0;
 
+	/* Global shrinker mode */
+	if (memcg == root_mem_cgroup)
+		memcg = NULL;
+
 	if (memcg && !memcg_kmem_is_active(memcg))
 		return 0;
 
@@ -2417,11 +2421,22 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 			shrink_lruvec(lruvec, swappiness, sc, &lru_pages);
 			zone_lru_pages += lru_pages;
 
-			if (memcg && is_classzone)
+			/*
+			 * Shrink the slab caches in the same proportion that
+			 * the eligible LRU pages were scanned.
+			 */
+			if (is_classzone) {
 				shrink_slab(sc->gfp_mask, zone_to_nid(zone),
 					    memcg, sc->nr_scanned - scanned,
 					    lru_pages);
 
+				if (reclaim_state) {
+					sc->nr_reclaimed +=
+						reclaim_state->reclaimed_slab;
+					reclaim_state->reclaimed_slab = 0;
+				}
+			}
+
 			/*
 			 * Direct reclaim and kswapd have to scan all memory
 			 * cgroups to fulfill the overall scan target for the
@@ -2439,20 +2454,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 			}
 		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
 
-		/*
-		 * Shrink the slab caches in the same proportion that
-		 * the eligible LRU pages were scanned.
-		 */
-		if (global_reclaim(sc) && is_classzone)
-			shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL,
-				    sc->nr_scanned - nr_scanned,
-				    zone_lru_pages);
-
-		if (reclaim_state) {
-			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
-			reclaim_state->reclaimed_slab = 0;
-		}
-
 		vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
 			   sc->nr_scanned - nr_scanned,
 			   sc->nr_reclaimed - nr_reclaimed);
-- 
2.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity
  2015-10-22  4:21 ` Johannes Weiner
@ 2015-10-22  4:21   ` Johannes Weiner
  -1 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-22  4:21 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Tejun Heo, netdev, linux-mm,
	cgroups, linux-kernel

The vmpressure metric is based on reclaim efficiency, which in turn is
an attribute of the LRU. However, vmpressure events are currently
reported at the source of pressure rather than at the reclaim level.

Switch the reporting to the reclaim level to allow finer-grained
analysis of which memcg is having trouble reclaiming its pages.

As far as memory.pressure_level interface semantics go, events are
escalated up the hierarchy until a listener is found, so this won't
affect existing users that listen at higher levels.

This also prepares vmpressure for hooking it up to the networking
stack's memory pressure code.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ecc2125..50630c8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2404,6 +2404,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 		memcg = mem_cgroup_iter(root, NULL, &reclaim);
 		do {
 			unsigned long lru_pages;
+			unsigned long reclaimed;
 			unsigned long scanned;
 			struct lruvec *lruvec;
 			int swappiness;
@@ -2416,6 +2417,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 			swappiness = mem_cgroup_swappiness(memcg);
+			reclaimed = sc->nr_reclaimed;
 			scanned = sc->nr_scanned;
 
 			shrink_lruvec(lruvec, swappiness, sc, &lru_pages);
@@ -2437,6 +2439,10 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 				}
 			}
 
+			vmpressure(sc->gfp_mask, memcg,
+				   sc->nr_scanned - scanned,
+				   sc->nr_reclaimed - reclaimed);
+
 			/*
 			 * Direct reclaim and kswapd have to scan all memory
 			 * cgroups to fulfill the overall scan target for the
@@ -2454,10 +2460,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 			}
 		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
 
-		vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
-			   sc->nr_scanned - nr_scanned,
-			   sc->nr_reclaimed - nr_reclaimed);
-
 		if (sc->nr_reclaimed - nr_reclaimed)
 			reclaimable = true;
 
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity
@ 2015-10-22  4:21   ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-22  4:21 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Tejun Heo, netdev, linux-mm,
	cgroups, linux-kernel

The vmpressure metric is based on reclaim efficiency, which in turn is
an attribute of the LRU. However, vmpressure events are currently
reported at the source of pressure rather than at the reclaim level.

Switch the reporting to the reclaim level to allow finer-grained
analysis of which memcg is having trouble reclaiming its pages.

As far as memory.pressure_level interface semantics go, events are
escalated up the hierarchy until a listener is found, so this won't
affect existing users that listen at higher levels.

This also prepares vmpressure for hooking it up to the networking
stack's memory pressure code.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ecc2125..50630c8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2404,6 +2404,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 		memcg = mem_cgroup_iter(root, NULL, &reclaim);
 		do {
 			unsigned long lru_pages;
+			unsigned long reclaimed;
 			unsigned long scanned;
 			struct lruvec *lruvec;
 			int swappiness;
@@ -2416,6 +2417,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 			swappiness = mem_cgroup_swappiness(memcg);
+			reclaimed = sc->nr_reclaimed;
 			scanned = sc->nr_scanned;
 
 			shrink_lruvec(lruvec, swappiness, sc, &lru_pages);
@@ -2437,6 +2439,10 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 				}
 			}
 
+			vmpressure(sc->gfp_mask, memcg,
+				   sc->nr_scanned - scanned,
+				   sc->nr_reclaimed - reclaimed);
+
 			/*
 			 * Direct reclaim and kswapd have to scan all memory
 			 * cgroups to fulfill the overall scan target for the
@@ -2454,10 +2460,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 			}
 		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
 
-		vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
-			   sc->nr_scanned - nr_scanned,
-			   sc->nr_reclaimed - nr_reclaimed);
-
 		if (sc->nr_reclaimed - nr_reclaimed)
 			reclaimable = true;
 
-- 
2.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 8/8] mm: memcontrol: hook up vmpressure to socket pressure
  2015-10-22  4:21 ` Johannes Weiner
@ 2015-10-22  4:21   ` Johannes Weiner
  -1 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-22  4:21 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Tejun Heo, netdev, linux-mm,
	cgroups, linux-kernel

Let the networking stack know when a memcg is under reclaim pressure,
so it can shrink its transmit windows accordingly.

Whenever the reclaim efficiency of a memcg's LRU lists drops low
enough for a MEDIUM or HIGH vmpressure event to occur, assert a
pressure state in the socket and tcp memory code that tells it to
reduce memory usage in sockets associated with said memory cgroup.

vmpressure events are edge triggered, so for hysteresis assert socket
pressure for a second to allow for subsequent vmpressure events to
occur before letting the socket code return to normal.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |  9 +++++++++
 include/net/sock.h         |  4 ++++
 include/net/tcp.h          |  4 ++++
 mm/memcontrol.c            |  1 +
 mm/vmpressure.c            | 29 ++++++++++++++++++++++++-----
 5 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d66ae18..b9990f7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -246,6 +246,7 @@ struct mem_cgroup {
 
 #ifdef CONFIG_INET
 	struct work_struct socket_work;
+	unsigned long socket_pressure;
 #endif
 
 	/* List of events which userspace want to receive */
@@ -696,6 +697,10 @@ void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
 bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
 void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
+static inline bool mem_cgroup_socket_pressure(struct mem_cgroup *memcg)
+{
+	return time_before(jiffies, memcg->socket_pressure);
+}
 #else
 static inline bool mem_cgroup_do_sockets(void)
 {
@@ -716,6 +721,10 @@ static inline void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg,
 					     unsigned int nr_pages)
 {
 }
+static inline bool mem_cgroup_socket_pressure(struct mem_cgroup *memcg)
+{
+	return false;
+}
 #endif /* CONFIG_INET */
 
 #ifdef CONFIG_MEMCG_KMEM
diff --git a/include/net/sock.h b/include/net/sock.h
index 67795fc..22bfb9c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1087,6 +1087,10 @@ static inline bool sk_has_memory_pressure(const struct sock *sk)
 
 static inline bool sk_under_memory_pressure(const struct sock *sk)
 {
+	if (mem_cgroup_do_sockets() && sk->sk_memcg &&
+	    mem_cgroup_socket_pressure(sk->sk_memcg))
+		return true;
+
 	if (!sk->sk_prot->memory_pressure)
 		return false;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 77b6c7e..c7d342c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -291,6 +291,10 @@ extern int tcp_memory_pressure;
 /* optimized version of sk_under_memory_pressure() for TCP sockets */
 static inline bool tcp_under_memory_pressure(const struct sock *sk)
 {
+	if (mem_cgroup_do_sockets() && sk->sk_memcg &&
+	    mem_cgroup_socket_pressure(sk->sk_memcg))
+		return true;
+
 	return tcp_memory_pressure;
 }
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cb1d6aa..2e09def 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4178,6 +4178,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 #endif
 #ifdef CONFIG_INET
 	INIT_WORK(&memcg->socket_work, socket_work_func);
+	memcg->socket_pressure = jiffies;
 #endif
 	return &memcg->css;
 
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 4c25e62..f64c0e1 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -137,14 +137,11 @@ struct vmpressure_event {
 };
 
 static bool vmpressure_event(struct vmpressure *vmpr,
-			     unsigned long scanned, unsigned long reclaimed)
+			     enum vmpressure_levels level)
 {
 	struct vmpressure_event *ev;
-	enum vmpressure_levels level;
 	bool signalled = false;
 
-	level = vmpressure_calc_level(scanned, reclaimed);
-
 	mutex_lock(&vmpr->events_lock);
 
 	list_for_each_entry(ev, &vmpr->events, node) {
@@ -162,6 +159,7 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 static void vmpressure_work_fn(struct work_struct *work)
 {
 	struct vmpressure *vmpr = work_to_vmpressure(work);
+	enum vmpressure_levels level;
 	unsigned long scanned;
 	unsigned long reclaimed;
 
@@ -185,8 +183,29 @@ static void vmpressure_work_fn(struct work_struct *work)
 	vmpr->reclaimed = 0;
 	spin_unlock(&vmpr->sr_lock);
 
+	level = vmpressure_calc_level(scanned, reclaimed);
+
+	if (level > VMPRESSURE_LOW) {
+		struct mem_cgroup *memcg;
+		/*
+		 * Let the socket buffer allocator know that we are
+		 * having trouble reclaiming LRU pages.
+		 *
+		 * For hysteresis, keep the pressure state asserted
+		 * for a second in which subsequent pressure events
+		 * can occur.
+		 *
+		 * XXX: is vmpressure a global feature or part of
+		 * memcg? There shouldn't be anything memcg-specific
+		 * about exporting reclaim success ratios from the VM.
+		 */
+		memcg = container_of(vmpr, struct mem_cgroup, vmpressure);
+		if (memcg != root_mem_cgroup)
+			memcg->socket_pressure = jiffies + HZ;
+	}
+
 	do {
-		if (vmpressure_event(vmpr, scanned, reclaimed))
+		if (vmpressure_event(vmpr, level))
 			break;
 		/*
 		 * If not handled, propagate the event upward into the
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 8/8] mm: memcontrol: hook up vmpressure to socket pressure
@ 2015-10-22  4:21   ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-22  4:21 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Tejun Heo, netdev, linux-mm,
	cgroups, linux-kernel

Let the networking stack know when a memcg is under reclaim pressure,
so it can shrink its transmit windows accordingly.

Whenever the reclaim efficiency of a memcg's LRU lists drops low
enough for a MEDIUM or HIGH vmpressure event to occur, assert a
pressure state in the socket and tcp memory code that tells it to
reduce memory usage in sockets associated with said memory cgroup.

vmpressure events are edge triggered, so for hysteresis assert socket
pressure for a second to allow for subsequent vmpressure events to
occur before letting the socket code return to normal.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |  9 +++++++++
 include/net/sock.h         |  4 ++++
 include/net/tcp.h          |  4 ++++
 mm/memcontrol.c            |  1 +
 mm/vmpressure.c            | 29 ++++++++++++++++++++++++-----
 5 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d66ae18..b9990f7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -246,6 +246,7 @@ struct mem_cgroup {
 
 #ifdef CONFIG_INET
 	struct work_struct socket_work;
+	unsigned long socket_pressure;
 #endif
 
 	/* List of events which userspace want to receive */
@@ -696,6 +697,10 @@ void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
 bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
 void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
+static inline bool mem_cgroup_socket_pressure(struct mem_cgroup *memcg)
+{
+	return time_before(jiffies, memcg->socket_pressure);
+}
 #else
 static inline bool mem_cgroup_do_sockets(void)
 {
@@ -716,6 +721,10 @@ static inline void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg,
 					     unsigned int nr_pages)
 {
 }
+static inline bool mem_cgroup_socket_pressure(struct mem_cgroup *memcg)
+{
+	return false;
+}
 #endif /* CONFIG_INET */
 
 #ifdef CONFIG_MEMCG_KMEM
diff --git a/include/net/sock.h b/include/net/sock.h
index 67795fc..22bfb9c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1087,6 +1087,10 @@ static inline bool sk_has_memory_pressure(const struct sock *sk)
 
 static inline bool sk_under_memory_pressure(const struct sock *sk)
 {
+	if (mem_cgroup_do_sockets() && sk->sk_memcg &&
+	    mem_cgroup_socket_pressure(sk->sk_memcg))
+		return true;
+
 	if (!sk->sk_prot->memory_pressure)
 		return false;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 77b6c7e..c7d342c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -291,6 +291,10 @@ extern int tcp_memory_pressure;
 /* optimized version of sk_under_memory_pressure() for TCP sockets */
 static inline bool tcp_under_memory_pressure(const struct sock *sk)
 {
+	if (mem_cgroup_do_sockets() && sk->sk_memcg &&
+	    mem_cgroup_socket_pressure(sk->sk_memcg))
+		return true;
+
 	return tcp_memory_pressure;
 }
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cb1d6aa..2e09def 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4178,6 +4178,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 #endif
 #ifdef CONFIG_INET
 	INIT_WORK(&memcg->socket_work, socket_work_func);
+	memcg->socket_pressure = jiffies;
 #endif
 	return &memcg->css;
 
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 4c25e62..f64c0e1 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -137,14 +137,11 @@ struct vmpressure_event {
 };
 
 static bool vmpressure_event(struct vmpressure *vmpr,
-			     unsigned long scanned, unsigned long reclaimed)
+			     enum vmpressure_levels level)
 {
 	struct vmpressure_event *ev;
-	enum vmpressure_levels level;
 	bool signalled = false;
 
-	level = vmpressure_calc_level(scanned, reclaimed);
-
 	mutex_lock(&vmpr->events_lock);
 
 	list_for_each_entry(ev, &vmpr->events, node) {
@@ -162,6 +159,7 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 static void vmpressure_work_fn(struct work_struct *work)
 {
 	struct vmpressure *vmpr = work_to_vmpressure(work);
+	enum vmpressure_levels level;
 	unsigned long scanned;
 	unsigned long reclaimed;
 
@@ -185,8 +183,29 @@ static void vmpressure_work_fn(struct work_struct *work)
 	vmpr->reclaimed = 0;
 	spin_unlock(&vmpr->sr_lock);
 
+	level = vmpressure_calc_level(scanned, reclaimed);
+
+	if (level > VMPRESSURE_LOW) {
+		struct mem_cgroup *memcg;
+		/*
+		 * Let the socket buffer allocator know that we are
+		 * having trouble reclaiming LRU pages.
+		 *
+		 * For hysteresis, keep the pressure state asserted
+		 * for a second in which subsequent pressure events
+		 * can occur.
+		 *
+		 * XXX: is vmpressure a global feature or part of
+		 * memcg? There shouldn't be anything memcg-specific
+		 * about exporting reclaim success ratios from the VM.
+		 */
+		memcg = container_of(vmpr, struct mem_cgroup, vmpressure);
+		if (memcg != root_mem_cgroup)
+			memcg->socket_pressure = jiffies + HZ;
+	}
+
 	do {
-		if (vmpressure_event(vmpr, scanned, reclaimed))
+		if (vmpressure_event(vmpr, level))
 			break;
 		/*
 		 * If not handled, propagate the event upward into the
-- 
2.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
  2015-10-22  4:21 ` Johannes Weiner
  (?)
@ 2015-10-22 18:45   ` Vladimir Davydov
  -1 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-22 18:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

Hi Johannes,

On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote:
...
> Patch #5 adds accounting and tracking of socket memory to the unified
> hierarchy memory controller, as described above. It uses the existing
> per-cpu charge caches and triggers high limit reclaim asynchroneously.
> 
> Patch #8 uses the vmpressure extension to equalize pressure between
> the pages tracked natively by the VM and socket buffer pages. As the
> pool is shared, it makes sense that while natively tracked pages are
> under duress the network transmit windows are also not increased.

First of all, I've no experience in networking, so I'm likely to be
mistaken. Nevertheless I beg to disagree that this patch set is a step
in the right direction. Here goes why.

I admit that your idea to get rid of explicit tcp window control knobs
and size it dynamically basing on memory pressure instead does sound
tempting, but I don't think it'd always work. The problem is that in
contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only
stop growing them. Now suppose a system hasn't experienced memory
pressure for a while. If we don't have explicit tcp window limit, tcp
buffers on such a system might have eaten almost all available memory
(because of network load/problems). If a user workload that needs a
significant amount of memory is started suddenly then, the network code
will receive a notification and surely stop growing buffers, but all
those buffers accumulated won't disappear instantly. As a result, the
workload might be unable to find enough free memory and have no choice
but invoke OOM killer. This looks unexpected from the user POV.

That said, I think we do need per memcg tcp window control similar to
what we have system-wide. In other words, Glauber's work makes sense to
me. You might want to point me at my RFC patch where I proposed to
revert it (https://lkml.org/lkml/2014/9/12/401). Well, I've changed my
mind since then. Now I think I was mistaken, luckily I was stopped.
However, I may be mistaken again :-)

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
@ 2015-10-22 18:45   ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-22 18:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

Hi Johannes,

On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote:
...
> Patch #5 adds accounting and tracking of socket memory to the unified
> hierarchy memory controller, as described above. It uses the existing
> per-cpu charge caches and triggers high limit reclaim asynchroneously.
> 
> Patch #8 uses the vmpressure extension to equalize pressure between
> the pages tracked natively by the VM and socket buffer pages. As the
> pool is shared, it makes sense that while natively tracked pages are
> under duress the network transmit windows are also not increased.

First of all, I've no experience in networking, so I'm likely to be
mistaken. Nevertheless I beg to disagree that this patch set is a step
in the right direction. Here goes why.

I admit that your idea to get rid of explicit tcp window control knobs
and size it dynamically basing on memory pressure instead does sound
tempting, but I don't think it'd always work. The problem is that in
contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only
stop growing them. Now suppose a system hasn't experienced memory
pressure for a while. If we don't have explicit tcp window limit, tcp
buffers on such a system might have eaten almost all available memory
(because of network load/problems). If a user workload that needs a
significant amount of memory is started suddenly then, the network code
will receive a notification and surely stop growing buffers, but all
those buffers accumulated won't disappear instantly. As a result, the
workload might be unable to find enough free memory and have no choice
but invoke OOM killer. This looks unexpected from the user POV.

That said, I think we do need per memcg tcp window control similar to
what we have system-wide. In other words, Glauber's work makes sense to
me. You might want to point me at my RFC patch where I proposed to
revert it (https://lkml.org/lkml/2014/9/12/401). Well, I've changed my
mind since then. Now I think I was mistaken, luckily I was stopped.
However, I may be mistaken again :-)

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
@ 2015-10-22 18:45   ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-22 18:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

Hi Johannes,

On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote:
...
> Patch #5 adds accounting and tracking of socket memory to the unified
> hierarchy memory controller, as described above. It uses the existing
> per-cpu charge caches and triggers high limit reclaim asynchroneously.
> 
> Patch #8 uses the vmpressure extension to equalize pressure between
> the pages tracked natively by the VM and socket buffer pages. As the
> pool is shared, it makes sense that while natively tracked pages are
> under duress the network transmit windows are also not increased.

First of all, I've no experience in networking, so I'm likely to be
mistaken. Nevertheless I beg to disagree that this patch set is a step
in the right direction. Here goes why.

I admit that your idea to get rid of explicit tcp window control knobs
and size it dynamically basing on memory pressure instead does sound
tempting, but I don't think it'd always work. The problem is that in
contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only
stop growing them. Now suppose a system hasn't experienced memory
pressure for a while. If we don't have explicit tcp window limit, tcp
buffers on such a system might have eaten almost all available memory
(because of network load/problems). If a user workload that needs a
significant amount of memory is started suddenly then, the network code
will receive a notification and surely stop growing buffers, but all
those buffers accumulated won't disappear instantly. As a result, the
workload might be unable to find enough free memory and have no choice
but invoke OOM killer. This looks unexpected from the user POV.

That said, I think we do need per memcg tcp window control similar to
what we have system-wide. In other words, Glauber's work makes sense to
me. You might want to point me at my RFC patch where I proposed to
revert it (https://lkml.org/lkml/2014/9/12/401). Well, I've changed my
mind since then. Now I think I was mistaken, luckily I was stopped.
However, I may be mistaken again :-)

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting
  2015-10-22  4:21   ` Johannes Weiner
  (?)
@ 2015-10-22 18:46     ` Vladimir Davydov
  -1 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-22 18:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote:
> The tcp memory controller has extensive provisions for future memory
> accounting interfaces that won't materialize after all. Cut the code
> base down to what's actually used, now and in the likely future.
> 
> - There won't be any different protocol counters in the future, so a
>   direct sock->sk_memcg linkage is enough. This eliminates a lot of
>   callback maze and boilerplate code, and restores most of the socket
>   allocation code to pre-tcp_memcontrol state.
> 
> - There won't be a tcp control soft limit, so integrating the memcg

In fact, the code is ready for the "soft" limit (I mean min, pressure,
max tuple), it just lacks a knob.

>   code into the global skmem limiting scheme complicates things
>   unnecessarily. Replace all that with simple and clear charge and
>   uncharge calls--hidden behind a jump label--to account skb memory.
> 
> - The previous jump label code was an elaborate state machine that
>   tracked the number of cgroups with an active socket limit in order
>   to enable the skmem tracking and accounting code only when actively
>   necessary. But this is overengineered: it was meant to protect the
>   people who never use this feature in the first place. Simply enable
>   the branches once when the first limit is set until the next reboot.
> 
...
> @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
>  	if (!sk->sk_prot->memory_pressure)
>  		return false;
>  
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		return !!sk->sk_cgrp->memory_pressure;
> -

AFAIU, now we won't shrink the window on hitting the limit, i.e. this
patch subtly changes the behavior of the existing knobs, potentially
breaking them.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting
@ 2015-10-22 18:46     ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-22 18:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote:
> The tcp memory controller has extensive provisions for future memory
> accounting interfaces that won't materialize after all. Cut the code
> base down to what's actually used, now and in the likely future.
> 
> - There won't be any different protocol counters in the future, so a
>   direct sock->sk_memcg linkage is enough. This eliminates a lot of
>   callback maze and boilerplate code, and restores most of the socket
>   allocation code to pre-tcp_memcontrol state.
> 
> - There won't be a tcp control soft limit, so integrating the memcg

In fact, the code is ready for the "soft" limit (I mean min, pressure,
max tuple), it just lacks a knob.

>   code into the global skmem limiting scheme complicates things
>   unnecessarily. Replace all that with simple and clear charge and
>   uncharge calls--hidden behind a jump label--to account skb memory.
> 
> - The previous jump label code was an elaborate state machine that
>   tracked the number of cgroups with an active socket limit in order
>   to enable the skmem tracking and accounting code only when actively
>   necessary. But this is overengineered: it was meant to protect the
>   people who never use this feature in the first place. Simply enable
>   the branches once when the first limit is set until the next reboot.
> 
...
> @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
>  	if (!sk->sk_prot->memory_pressure)
>  		return false;
>  
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		return !!sk->sk_cgrp->memory_pressure;
> -

AFAIU, now we won't shrink the window on hitting the limit, i.e. this
patch subtly changes the behavior of the existing knobs, potentially
breaking them.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting
@ 2015-10-22 18:46     ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-22 18:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote:
> The tcp memory controller has extensive provisions for future memory
> accounting interfaces that won't materialize after all. Cut the code
> base down to what's actually used, now and in the likely future.
> 
> - There won't be any different protocol counters in the future, so a
>   direct sock->sk_memcg linkage is enough. This eliminates a lot of
>   callback maze and boilerplate code, and restores most of the socket
>   allocation code to pre-tcp_memcontrol state.
> 
> - There won't be a tcp control soft limit, so integrating the memcg

In fact, the code is ready for the "soft" limit (I mean min, pressure,
max tuple), it just lacks a knob.

>   code into the global skmem limiting scheme complicates things
>   unnecessarily. Replace all that with simple and clear charge and
>   uncharge calls--hidden behind a jump label--to account skb memory.
> 
> - The previous jump label code was an elaborate state machine that
>   tracked the number of cgroups with an active socket limit in order
>   to enable the skmem tracking and accounting code only when actively
>   necessary. But this is overengineered: it was meant to protect the
>   people who never use this feature in the first place. Simply enable
>   the branches once when the first limit is set until the next reboot.
> 
...
> @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
>  	if (!sk->sk_prot->memory_pressure)
>  		return false;
>  
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		return !!sk->sk_cgrp->memory_pressure;
> -

AFAIU, now we won't shrink the window on hitting the limit, i.e. this
patch subtly changes the behavior of the existing knobs, potentially
breaking them.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
  2015-10-22  4:21   ` Johannes Weiner
  (?)
@ 2015-10-22 18:47     ` Vladimir Davydov
  -1 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-22 18:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 22, 2015 at 12:21:33AM -0400, Johannes Weiner wrote:
...
> @@ -5500,13 +5524,38 @@ void sock_release_memcg(struct sock *sk)
>   */
>  bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
> +	unsigned int batch = max(CHARGE_BATCH, nr_pages);
>  	struct page_counter *counter;
> +	bool force = false;
>  
> -	if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> +		if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
> +			return true;
> +		page_counter_charge(&memcg->skmem, nr_pages);
> +		return false;
> +	}
> +
> +	if (consume_stock(memcg, nr_pages))
>  		return true;
> +retry:
> +	if (page_counter_try_charge(&memcg->memory, batch, &counter))
> +		goto done;

Currently, we use memcg->memory only for charging memory pages. Besides,
every page charged to this counter (including kmem) has ->mem_cgroup
field set appropriately. This looks consistent and nice. As an extra
benefit, we can track all pages charged to a memory cgroup via
/proc/kapgecgroup.

Now, you charge "window size" to it, which AFAIU isn't necessarily equal
to the amount of memory actually consumed by the cgroup for socket
buffers. I think this looks ugly and inconsistent with the existing
behavior. I agree that we need to charge socker buffers to ->memory, but
IMO we should do that per each skb page, using memcg_kmem_charge_kmem
somewhere in alloc_skb_with_frags invoking the reclaimer just as we do
for kmalloc, while tcp window size control should stay aside.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-22 18:47     ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-22 18:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 22, 2015 at 12:21:33AM -0400, Johannes Weiner wrote:
...
> @@ -5500,13 +5524,38 @@ void sock_release_memcg(struct sock *sk)
>   */
>  bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
> +	unsigned int batch = max(CHARGE_BATCH, nr_pages);
>  	struct page_counter *counter;
> +	bool force = false;
>  
> -	if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> +		if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
> +			return true;
> +		page_counter_charge(&memcg->skmem, nr_pages);
> +		return false;
> +	}
> +
> +	if (consume_stock(memcg, nr_pages))
>  		return true;
> +retry:
> +	if (page_counter_try_charge(&memcg->memory, batch, &counter))
> +		goto done;

Currently, we use memcg->memory only for charging memory pages. Besides,
every page charged to this counter (including kmem) has ->mem_cgroup
field set appropriately. This looks consistent and nice. As an extra
benefit, we can track all pages charged to a memory cgroup via
/proc/kapgecgroup.

Now, you charge "window size" to it, which AFAIU isn't necessarily equal
to the amount of memory actually consumed by the cgroup for socket
buffers. I think this looks ugly and inconsistent with the existing
behavior. I agree that we need to charge socker buffers to ->memory, but
IMO we should do that per each skb page, using memcg_kmem_charge_kmem
somewhere in alloc_skb_with_frags invoking the reclaimer just as we do
for kmalloc, while tcp window size control should stay aside.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-22 18:47     ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-22 18:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 22, 2015 at 12:21:33AM -0400, Johannes Weiner wrote:
...
> @@ -5500,13 +5524,38 @@ void sock_release_memcg(struct sock *sk)
>   */
>  bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
> +	unsigned int batch = max(CHARGE_BATCH, nr_pages);
>  	struct page_counter *counter;
> +	bool force = false;
>  
> -	if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> +		if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
> +			return true;
> +		page_counter_charge(&memcg->skmem, nr_pages);
> +		return false;
> +	}
> +
> +	if (consume_stock(memcg, nr_pages))
>  		return true;
> +retry:
> +	if (page_counter_try_charge(&memcg->memory, batch, &counter))
> +		goto done;

Currently, we use memcg->memory only for charging memory pages. Besides,
every page charged to this counter (including kmem) has ->mem_cgroup
field set appropriately. This looks consistent and nice. As an extra
benefit, we can track all pages charged to a memory cgroup via
/proc/kapgecgroup.

Now, you charge "window size" to it, which AFAIU isn't necessarily equal
to the amount of memory actually consumed by the cgroup for socket
buffers. I think this looks ugly and inconsistent with the existing
behavior. I agree that we need to charge socker buffers to ->memory, but
IMO we should do that per each skb page, using memcg_kmem_charge_kmem
somewhere in alloc_skb_with_frags invoking the reclaimer just as we do
for kmalloc, while tcp window size control should stay aside.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity
@ 2015-10-22 18:48     ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-22 18:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 22, 2015 at 12:21:35AM -0400, Johannes Weiner wrote:
...
> @@ -2437,6 +2439,10 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  				}
>  			}
>  
> +			vmpressure(sc->gfp_mask, memcg,
> +				   sc->nr_scanned - scanned,
> +				   sc->nr_reclaimed - reclaimed);
> +
>  			/*
>  			 * Direct reclaim and kswapd have to scan all memory
>  			 * cgroups to fulfill the overall scan target for the
> @@ -2454,10 +2460,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  			}
>  		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
>  
> -		vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
> -			   sc->nr_scanned - nr_scanned,
> -			   sc->nr_reclaimed - nr_reclaimed);
> -
>  		if (sc->nr_reclaimed - nr_reclaimed)
>  			reclaimable = true;
>  

I may be mistaken, but AFAIU this patch subtly changes the behavior of
vmpressure visible from the userspace: w/o this patch a userspace
process will only receive a notification for a memory cgroup only if
*this* memory cgroup calls reclaimer; with this patch userspace
notification will be issued even if reclaimer is invoked by any cgroup
up the hierarchy.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity
@ 2015-10-22 18:48     ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-22 18:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Oct 22, 2015 at 12:21:35AM -0400, Johannes Weiner wrote:
...
> @@ -2437,6 +2439,10 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  				}
>  			}
>  
> +			vmpressure(sc->gfp_mask, memcg,
> +				   sc->nr_scanned - scanned,
> +				   sc->nr_reclaimed - reclaimed);
> +
>  			/*
>  			 * Direct reclaim and kswapd have to scan all memory
>  			 * cgroups to fulfill the overall scan target for the
> @@ -2454,10 +2460,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  			}
>  		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
>  
> -		vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
> -			   sc->nr_scanned - nr_scanned,
> -			   sc->nr_reclaimed - nr_reclaimed);
> -
>  		if (sc->nr_reclaimed - nr_reclaimed)
>  			reclaimable = true;
>  

I may be mistaken, but AFAIU this patch subtly changes the behavior of
vmpressure visible from the userspace: w/o this patch a userspace
process will only receive a notification for a memory cgroup only if
*this* memory cgroup calls reclaimer; with this patch userspace
notification will be issued even if reclaimer is invoked by any cgroup
up the hierarchy.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity
@ 2015-10-22 18:48     ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-22 18:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 22, 2015 at 12:21:35AM -0400, Johannes Weiner wrote:
...
> @@ -2437,6 +2439,10 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  				}
>  			}
>  
> +			vmpressure(sc->gfp_mask, memcg,
> +				   sc->nr_scanned - scanned,
> +				   sc->nr_reclaimed - reclaimed);
> +
>  			/*
>  			 * Direct reclaim and kswapd have to scan all memory
>  			 * cgroups to fulfill the overall scan target for the
> @@ -2454,10 +2460,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  			}
>  		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
>  
> -		vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
> -			   sc->nr_scanned - nr_scanned,
> -			   sc->nr_reclaimed - nr_reclaimed);
> -
>  		if (sc->nr_reclaimed - nr_reclaimed)
>  			reclaimable = true;
>  

I may be mistaken, but AFAIU this patch subtly changes the behavior of
vmpressure visible from the userspace: w/o this patch a userspace
process will only receive a notification for a memory cgroup only if
*this* memory cgroup calls reclaimer; with this patch userspace
notification will be issued even if reclaimer is invoked by any cgroup
up the hierarchy.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity
@ 2015-10-22 18:48     ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-22 18:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Oct 22, 2015 at 12:21:35AM -0400, Johannes Weiner wrote:
...
> @@ -2437,6 +2439,10 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  				}
>  			}
>  
> +			vmpressure(sc->gfp_mask, memcg,
> +				   sc->nr_scanned - scanned,
> +				   sc->nr_reclaimed - reclaimed);
> +
>  			/*
>  			 * Direct reclaim and kswapd have to scan all memory
>  			 * cgroups to fulfill the overall scan target for the
> @@ -2454,10 +2460,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  			}
>  		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
>  
> -		vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
> -			   sc->nr_scanned - nr_scanned,
> -			   sc->nr_reclaimed - nr_reclaimed);
> -
>  		if (sc->nr_reclaimed - nr_reclaimed)
>  			reclaimable = true;
>  

I may be mistaken, but AFAIU this patch subtly changes the behavior of
vmpressure visible from the userspace: w/o this patch a userspace
process will only receive a notification for a memory cgroup only if
*this* memory cgroup calls reclaimer; with this patch userspace
notification will be issued even if reclaimer is invoked by any cgroup
up the hierarchy.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 8/8] mm: memcontrol: hook up vmpressure to socket pressure
  2015-10-22  4:21   ` Johannes Weiner
  (?)
@ 2015-10-22 18:57     ` Vladimir Davydov
  -1 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-22 18:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 22, 2015 at 12:21:36AM -0400, Johannes Weiner wrote:
...
> @@ -185,8 +183,29 @@ static void vmpressure_work_fn(struct work_struct *work)
>  	vmpr->reclaimed = 0;
>  	spin_unlock(&vmpr->sr_lock);
>  
> +	level = vmpressure_calc_level(scanned, reclaimed);
> +
> +	if (level > VMPRESSURE_LOW) {

So we start socket_pressure at MEDIUM. Why not at LOW or CRITICAL?

> +		struct mem_cgroup *memcg;
> +		/*
> +		 * Let the socket buffer allocator know that we are
> +		 * having trouble reclaiming LRU pages.
> +		 *
> +		 * For hysteresis, keep the pressure state asserted
> +		 * for a second in which subsequent pressure events
> +		 * can occur.
> +		 *
> +		 * XXX: is vmpressure a global feature or part of
> +		 * memcg? There shouldn't be anything memcg-specific
> +		 * about exporting reclaim success ratios from the VM.
> +		 */
> +		memcg = container_of(vmpr, struct mem_cgroup, vmpressure);
> +		if (memcg != root_mem_cgroup)
> +			memcg->socket_pressure = jiffies + HZ;

Why 1 second?

Thanks,
Vladimir

> +	}
> +
>  	do {
> -		if (vmpressure_event(vmpr, scanned, reclaimed))
> +		if (vmpressure_event(vmpr, level))
>  			break;
>  		/*
>  		 * If not handled, propagate the event upward into the

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 8/8] mm: memcontrol: hook up vmpressure to socket pressure
@ 2015-10-22 18:57     ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-22 18:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 22, 2015 at 12:21:36AM -0400, Johannes Weiner wrote:
...
> @@ -185,8 +183,29 @@ static void vmpressure_work_fn(struct work_struct *work)
>  	vmpr->reclaimed = 0;
>  	spin_unlock(&vmpr->sr_lock);
>  
> +	level = vmpressure_calc_level(scanned, reclaimed);
> +
> +	if (level > VMPRESSURE_LOW) {

So we start socket_pressure at MEDIUM. Why not at LOW or CRITICAL?

> +		struct mem_cgroup *memcg;
> +		/*
> +		 * Let the socket buffer allocator know that we are
> +		 * having trouble reclaiming LRU pages.
> +		 *
> +		 * For hysteresis, keep the pressure state asserted
> +		 * for a second in which subsequent pressure events
> +		 * can occur.
> +		 *
> +		 * XXX: is vmpressure a global feature or part of
> +		 * memcg? There shouldn't be anything memcg-specific
> +		 * about exporting reclaim success ratios from the VM.
> +		 */
> +		memcg = container_of(vmpr, struct mem_cgroup, vmpressure);
> +		if (memcg != root_mem_cgroup)
> +			memcg->socket_pressure = jiffies + HZ;

Why 1 second?

Thanks,
Vladimir

> +	}
> +
>  	do {
> -		if (vmpressure_event(vmpr, scanned, reclaimed))
> +		if (vmpressure_event(vmpr, level))
>  			break;
>  		/*
>  		 * If not handled, propagate the event upward into the

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 8/8] mm: memcontrol: hook up vmpressure to socket pressure
@ 2015-10-22 18:57     ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-22 18:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 22, 2015 at 12:21:36AM -0400, Johannes Weiner wrote:
...
> @@ -185,8 +183,29 @@ static void vmpressure_work_fn(struct work_struct *work)
>  	vmpr->reclaimed = 0;
>  	spin_unlock(&vmpr->sr_lock);
>  
> +	level = vmpressure_calc_level(scanned, reclaimed);
> +
> +	if (level > VMPRESSURE_LOW) {

So we start socket_pressure at MEDIUM. Why not at LOW or CRITICAL?

> +		struct mem_cgroup *memcg;
> +		/*
> +		 * Let the socket buffer allocator know that we are
> +		 * having trouble reclaiming LRU pages.
> +		 *
> +		 * For hysteresis, keep the pressure state asserted
> +		 * for a second in which subsequent pressure events
> +		 * can occur.
> +		 *
> +		 * XXX: is vmpressure a global feature or part of
> +		 * memcg? There shouldn't be anything memcg-specific
> +		 * about exporting reclaim success ratios from the VM.
> +		 */
> +		memcg = container_of(vmpr, struct mem_cgroup, vmpressure);
> +		if (memcg != root_mem_cgroup)
> +			memcg->socket_pressure = jiffies + HZ;

Why 1 second?

Thanks,
Vladimir

> +	}
> +
>  	do {
> -		if (vmpressure_event(vmpr, scanned, reclaimed))
> +		if (vmpressure_event(vmpr, level))
>  			break;
>  		/*
>  		 * If not handled, propagate the event upward into the

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting
  2015-10-22 18:46     ` Vladimir Davydov
@ 2015-10-22 19:09       ` Johannes Weiner
  -1 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-22 19:09 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 22, 2015 at 09:46:12PM +0300, Vladimir Davydov wrote:
> On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote:
> > The tcp memory controller has extensive provisions for future memory
> > accounting interfaces that won't materialize after all. Cut the code
> > base down to what's actually used, now and in the likely future.
> > 
> > - There won't be any different protocol counters in the future, so a
> >   direct sock->sk_memcg linkage is enough. This eliminates a lot of
> >   callback maze and boilerplate code, and restores most of the socket
> >   allocation code to pre-tcp_memcontrol state.
> > 
> > - There won't be a tcp control soft limit, so integrating the memcg
> 
> In fact, the code is ready for the "soft" limit (I mean min, pressure,
> max tuple), it just lacks a knob.

Yeah, but that's not going to materialize if the entire interface for
dedicated tcp throttling is considered obsolete.

> > @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
> >  	if (!sk->sk_prot->memory_pressure)
> >  		return false;
> >  
> > -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> > -		return !!sk->sk_cgrp->memory_pressure;
> > -
> 
> AFAIU, now we won't shrink the window on hitting the limit, i.e. this
> patch subtly changes the behavior of the existing knobs, potentially
> breaking them.

Hm, but there is no grace period in which something meaningful could
happen with the window shrinking, is there? Any buffer allocation is
still going to fail hard.

I don't see how this would change anything in practice.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting
@ 2015-10-22 19:09       ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-22 19:09 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 22, 2015 at 09:46:12PM +0300, Vladimir Davydov wrote:
> On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote:
> > The tcp memory controller has extensive provisions for future memory
> > accounting interfaces that won't materialize after all. Cut the code
> > base down to what's actually used, now and in the likely future.
> > 
> > - There won't be any different protocol counters in the future, so a
> >   direct sock->sk_memcg linkage is enough. This eliminates a lot of
> >   callback maze and boilerplate code, and restores most of the socket
> >   allocation code to pre-tcp_memcontrol state.
> > 
> > - There won't be a tcp control soft limit, so integrating the memcg
> 
> In fact, the code is ready for the "soft" limit (I mean min, pressure,
> max tuple), it just lacks a knob.

Yeah, but that's not going to materialize if the entire interface for
dedicated tcp throttling is considered obsolete.

> > @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
> >  	if (!sk->sk_prot->memory_pressure)
> >  		return false;
> >  
> > -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> > -		return !!sk->sk_cgrp->memory_pressure;
> > -
> 
> AFAIU, now we won't shrink the window on hitting the limit, i.e. this
> patch subtly changes the behavior of the existing knobs, potentially
> breaking them.

Hm, but there is no grace period in which something meaningful could
happen with the window shrinking, is there? Any buffer allocation is
still going to fail hard.

I don't see how this would change anything in practice.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 1/8] mm: page_counter: let page_counter_try_charge() return bool
  2015-10-22  4:21   ` Johannes Weiner
@ 2015-10-23 11:31     ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-23 11:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Vladimir Davydov, Tejun Heo,
	netdev, linux-mm, cgroups, linux-kernel

On Thu 22-10-15 00:21:29, Johannes Weiner wrote:
> page_counter_try_charge() currently returns 0 on success and -ENOMEM
> on failure, which is surprising behavior given the function name.
> 
> Make it follow the expected pattern of try_stuff() functions that
> return a boolean true to indicate success, or false for failure.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/page_counter.h |  6 +++---
>  mm/hugetlb_cgroup.c          |  3 ++-
>  mm/memcontrol.c              | 11 +++++------
>  mm/page_counter.c            | 14 +++++++-------
>  4 files changed, 17 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 17fa4f8..7e62920 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -36,9 +36,9 @@ static inline unsigned long page_counter_read(struct page_counter *counter)
>  
>  void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
>  void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
> -int page_counter_try_charge(struct page_counter *counter,
> -			    unsigned long nr_pages,
> -			    struct page_counter **fail);
> +bool page_counter_try_charge(struct page_counter *counter,
> +			     unsigned long nr_pages,
> +			     struct page_counter **fail);
>  void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
>  int page_counter_limit(struct page_counter *counter, unsigned long limit);
>  int page_counter_memparse(const char *buf, const char *max,
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index 6a44263..d8fb10d 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -186,7 +186,8 @@ again:
>  	}
>  	rcu_read_unlock();
>  
> -	ret = page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter);
> +	if (!page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter))
> +		ret = -ENOMEM;
>  	css_put(&h_cg->css);
>  done:
>  	*ptr = h_cg;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c71fe40..a8ccdbc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2018,8 +2018,8 @@ retry:
>  		return 0;
>  
>  	if (!do_swap_account ||
> -	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
> -		if (!page_counter_try_charge(&memcg->memory, batch, &counter))
> +	    page_counter_try_charge(&memcg->memsw, batch, &counter)) {
> +		if (page_counter_try_charge(&memcg->memory, batch, &counter))
>  			goto done_restock;
>  		if (do_swap_account)
>  			page_counter_uncharge(&memcg->memsw, batch);
> @@ -2383,14 +2383,13 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
>  {
>  	unsigned int nr_pages = 1 << order;
>  	struct page_counter *counter;
> -	int ret = 0;
> +	int ret;
>  
>  	if (!memcg_kmem_is_active(memcg))
>  		return 0;
>  
> -	ret = page_counter_try_charge(&memcg->kmem, nr_pages, &counter);
> -	if (ret)
> -		return ret;
> +	if (!page_counter_try_charge(&memcg->kmem, nr_pages, &counter))
> +		return -ENOMEM;
>  
>  	ret = try_charge(memcg, gfp, nr_pages);
>  	if (ret) {
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index 11b4bed..7c6a63d 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -56,12 +56,12 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
>   * @nr_pages: number of pages to charge
>   * @fail: points first counter to hit its limit, if any
>   *
> - * Returns 0 on success, or -ENOMEM and @fail if the counter or one of
> - * its ancestors has hit its configured limit.
> + * Returns %true on success, or %false and @fail if the counter or one
> + * of its ancestors has hit its configured limit.
>   */
> -int page_counter_try_charge(struct page_counter *counter,
> -			    unsigned long nr_pages,
> -			    struct page_counter **fail)
> +bool page_counter_try_charge(struct page_counter *counter,
> +			     unsigned long nr_pages,
> +			     struct page_counter **fail)
>  {
>  	struct page_counter *c;
>  
> @@ -99,13 +99,13 @@ int page_counter_try_charge(struct page_counter *counter,
>  		if (new > c->watermark)
>  			c->watermark = new;
>  	}
> -	return 0;
> +	return true;
>  
>  failed:
>  	for (c = counter; c != *fail; c = c->parent)
>  		page_counter_cancel(c, nr_pages);
>  
> -	return -ENOMEM;
> +	return false;
>  }
>  
>  /**
> -- 
> 2.6.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 1/8] mm: page_counter: let page_counter_try_charge() return bool
@ 2015-10-23 11:31     ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-23 11:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Vladimir Davydov, Tejun Heo,
	netdev, linux-mm, cgroups, linux-kernel

On Thu 22-10-15 00:21:29, Johannes Weiner wrote:
> page_counter_try_charge() currently returns 0 on success and -ENOMEM
> on failure, which is surprising behavior given the function name.
> 
> Make it follow the expected pattern of try_stuff() functions that
> return a boolean true to indicate success, or false for failure.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/page_counter.h |  6 +++---
>  mm/hugetlb_cgroup.c          |  3 ++-
>  mm/memcontrol.c              | 11 +++++------
>  mm/page_counter.c            | 14 +++++++-------
>  4 files changed, 17 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 17fa4f8..7e62920 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -36,9 +36,9 @@ static inline unsigned long page_counter_read(struct page_counter *counter)
>  
>  void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages);
>  void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
> -int page_counter_try_charge(struct page_counter *counter,
> -			    unsigned long nr_pages,
> -			    struct page_counter **fail);
> +bool page_counter_try_charge(struct page_counter *counter,
> +			     unsigned long nr_pages,
> +			     struct page_counter **fail);
>  void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
>  int page_counter_limit(struct page_counter *counter, unsigned long limit);
>  int page_counter_memparse(const char *buf, const char *max,
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index 6a44263..d8fb10d 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -186,7 +186,8 @@ again:
>  	}
>  	rcu_read_unlock();
>  
> -	ret = page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter);
> +	if (!page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter))
> +		ret = -ENOMEM;
>  	css_put(&h_cg->css);
>  done:
>  	*ptr = h_cg;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c71fe40..a8ccdbc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2018,8 +2018,8 @@ retry:
>  		return 0;
>  
>  	if (!do_swap_account ||
> -	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
> -		if (!page_counter_try_charge(&memcg->memory, batch, &counter))
> +	    page_counter_try_charge(&memcg->memsw, batch, &counter)) {
> +		if (page_counter_try_charge(&memcg->memory, batch, &counter))
>  			goto done_restock;
>  		if (do_swap_account)
>  			page_counter_uncharge(&memcg->memsw, batch);
> @@ -2383,14 +2383,13 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
>  {
>  	unsigned int nr_pages = 1 << order;
>  	struct page_counter *counter;
> -	int ret = 0;
> +	int ret;
>  
>  	if (!memcg_kmem_is_active(memcg))
>  		return 0;
>  
> -	ret = page_counter_try_charge(&memcg->kmem, nr_pages, &counter);
> -	if (ret)
> -		return ret;
> +	if (!page_counter_try_charge(&memcg->kmem, nr_pages, &counter))
> +		return -ENOMEM;
>  
>  	ret = try_charge(memcg, gfp, nr_pages);
>  	if (ret) {
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index 11b4bed..7c6a63d 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -56,12 +56,12 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
>   * @nr_pages: number of pages to charge
>   * @fail: points first counter to hit its limit, if any
>   *
> - * Returns 0 on success, or -ENOMEM and @fail if the counter or one of
> - * its ancestors has hit its configured limit.
> + * Returns %true on success, or %false and @fail if the counter or one
> + * of its ancestors has hit its configured limit.
>   */
> -int page_counter_try_charge(struct page_counter *counter,
> -			    unsigned long nr_pages,
> -			    struct page_counter **fail)
> +bool page_counter_try_charge(struct page_counter *counter,
> +			     unsigned long nr_pages,
> +			     struct page_counter **fail)
>  {
>  	struct page_counter *c;
>  
> @@ -99,13 +99,13 @@ int page_counter_try_charge(struct page_counter *counter,
>  		if (new > c->watermark)
>  			c->watermark = new;
>  	}
> -	return 0;
> +	return true;
>  
>  failed:
>  	for (c = counter; c != *fail; c = c->parent)
>  		page_counter_cancel(c, nr_pages);
>  
> -	return -ENOMEM;
> +	return false;
>  }
>  
>  /**
> -- 
> 2.6.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 2/8] mm: memcontrol: export root_mem_cgroup
  2015-10-22  4:21   ` Johannes Weiner
@ 2015-10-23 11:32     ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-23 11:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Vladimir Davydov, Tejun Heo,
	netdev, linux-mm, cgroups, linux-kernel

On Thu 22-10-15 00:21:30, Johannes Weiner wrote:
> A later patch will need this symbol in files other than memcontrol.c,
> so export it now and replace mem_cgroup_root_css at the same time.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h | 3 ++-
>  mm/backing-dev.c           | 2 +-
>  mm/memcontrol.c            | 5 ++---
>  3 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 805da1f..19ff87b 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -275,7 +275,8 @@ struct mem_cgroup {
>  	struct mem_cgroup_per_node *nodeinfo[0];
>  	/* WARNING: nodeinfo must be the last member here */
>  };
> -extern struct cgroup_subsys_state *mem_cgroup_root_css;
> +
> +extern struct mem_cgroup *root_mem_cgroup;
>  
>  /**
>   * mem_cgroup_events - count memory events against a cgroup
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 095b23b..73ab967 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -702,7 +702,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
>  
>  	ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL);
>  	if (!ret) {
> -		bdi->wb.memcg_css = mem_cgroup_root_css;
> +		bdi->wb.memcg_css = &root_mem_cgroup->css;
>  		bdi->wb.blkcg_css = blkcg_root_css;
>  	}
>  	return ret;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a8ccdbc..e54f434 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -76,9 +76,9 @@
>  struct cgroup_subsys memory_cgrp_subsys __read_mostly;
>  EXPORT_SYMBOL(memory_cgrp_subsys);
>  
> +struct mem_cgroup *root_mem_cgroup __read_mostly;
> +
>  #define MEM_CGROUP_RECLAIM_RETRIES	5
> -static struct mem_cgroup *root_mem_cgroup __read_mostly;
> -struct cgroup_subsys_state *mem_cgroup_root_css __read_mostly;
>  
>  /* Whether the swap controller is active */
>  #ifdef CONFIG_MEMCG_SWAP
> @@ -4213,7 +4213,6 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  	/* root ? */
>  	if (parent_css == NULL) {
>  		root_mem_cgroup = memcg;
> -		mem_cgroup_root_css = &memcg->css;
>  		page_counter_init(&memcg->memory, NULL);
>  		memcg->high = PAGE_COUNTER_MAX;
>  		memcg->soft_limit = PAGE_COUNTER_MAX;
> -- 
> 2.6.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 2/8] mm: memcontrol: export root_mem_cgroup
@ 2015-10-23 11:32     ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-23 11:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Vladimir Davydov, Tejun Heo,
	netdev, linux-mm, cgroups, linux-kernel

On Thu 22-10-15 00:21:30, Johannes Weiner wrote:
> A later patch will need this symbol in files other than memcontrol.c,
> so export it now and replace mem_cgroup_root_css at the same time.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h | 3 ++-
>  mm/backing-dev.c           | 2 +-
>  mm/memcontrol.c            | 5 ++---
>  3 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 805da1f..19ff87b 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -275,7 +275,8 @@ struct mem_cgroup {
>  	struct mem_cgroup_per_node *nodeinfo[0];
>  	/* WARNING: nodeinfo must be the last member here */
>  };
> -extern struct cgroup_subsys_state *mem_cgroup_root_css;
> +
> +extern struct mem_cgroup *root_mem_cgroup;
>  
>  /**
>   * mem_cgroup_events - count memory events against a cgroup
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 095b23b..73ab967 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -702,7 +702,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
>  
>  	ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL);
>  	if (!ret) {
> -		bdi->wb.memcg_css = mem_cgroup_root_css;
> +		bdi->wb.memcg_css = &root_mem_cgroup->css;
>  		bdi->wb.blkcg_css = blkcg_root_css;
>  	}
>  	return ret;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a8ccdbc..e54f434 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -76,9 +76,9 @@
>  struct cgroup_subsys memory_cgrp_subsys __read_mostly;
>  EXPORT_SYMBOL(memory_cgrp_subsys);
>  
> +struct mem_cgroup *root_mem_cgroup __read_mostly;
> +
>  #define MEM_CGROUP_RECLAIM_RETRIES	5
> -static struct mem_cgroup *root_mem_cgroup __read_mostly;
> -struct cgroup_subsys_state *mem_cgroup_root_css __read_mostly;
>  
>  /* Whether the swap controller is active */
>  #ifdef CONFIG_MEMCG_SWAP
> @@ -4213,7 +4213,6 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  	/* root ? */
>  	if (parent_css == NULL) {
>  		root_mem_cgroup = memcg;
> -		mem_cgroup_root_css = &memcg->css;
>  		page_counter_init(&memcg->memory, NULL);
>  		memcg->high = PAGE_COUNTER_MAX;
>  		memcg->soft_limit = PAGE_COUNTER_MAX;
> -- 
> 2.6.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting
  2015-10-22  4:21   ` Johannes Weiner
@ 2015-10-23 12:38     ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-23 12:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Vladimir Davydov, Tejun Heo,
	netdev, linux-mm, cgroups, linux-kernel

On Thu 22-10-15 00:21:31, Johannes Weiner wrote:
> The tcp memory controller has extensive provisions for future memory
> accounting interfaces that won't materialize after all. Cut the code
> base down to what's actually used, now and in the likely future.
> 
> - There won't be any different protocol counters in the future, so a
>   direct sock->sk_memcg linkage is enough. This eliminates a lot of
>   callback maze and boilerplate code, and restores most of the socket
>   allocation code to pre-tcp_memcontrol state.
> 
> - There won't be a tcp control soft limit, so integrating the memcg
>   code into the global skmem limiting scheme complicates things
>   unnecessarily. Replace all that with simple and clear charge and
>   uncharge calls--hidden behind a jump label--to account skb memory.
> 
> - The previous jump label code was an elaborate state machine that
>   tracked the number of cgroups with an active socket limit in order
>   to enable the skmem tracking and accounting code only when actively
>   necessary. But this is overengineered: it was meant to protect the
>   people who never use this feature in the first place. Simply enable
>   the branches once when the first limit is set until the next reboot.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

The changelog is certainly attractive. I have looked through the patch
but my knowledge of the networking subsystem and its memory management
is close to zero so I cannot really do a competent review.

Anyway I support any simplification of the tcp kmem accounting. If
networking people are OK with the changes, including reduction of the
functionality as described by Vladimir then no objections from me for
this to be merged.

Thanks!
> ---
>  include/linux/memcontrol.h   |  64 ++++++++-----------
>  include/net/sock.h           | 135 +++------------------------------------
>  include/net/tcp.h            |   3 -
>  include/net/tcp_memcontrol.h |   7 ---
>  mm/memcontrol.c              | 101 +++++++++++++++--------------
>  net/core/sock.c              |  78 ++++++-----------------
>  net/ipv4/sysctl_net_ipv4.c   |   1 -
>  net/ipv4/tcp.c               |   3 +-
>  net/ipv4/tcp_ipv4.c          |   9 +--
>  net/ipv4/tcp_memcontrol.c    | 147 +++++++------------------------------------
>  net/ipv4/tcp_output.c        |   6 +-
>  net/ipv6/tcp_ipv6.c          |   3 -
>  12 files changed, 136 insertions(+), 421 deletions(-)
>  delete mode 100644 include/net/tcp_memcontrol.h
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 19ff87b..5b72f83 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -85,34 +85,6 @@ enum mem_cgroup_events_target {
>  	MEM_CGROUP_NTARGETS,
>  };
>  
> -/*
> - * Bits in struct cg_proto.flags
> - */
> -enum cg_proto_flags {
> -	/* Currently active and new sockets should be assigned to cgroups */
> -	MEMCG_SOCK_ACTIVE,
> -	/* It was ever activated; we must disarm static keys on destruction */
> -	MEMCG_SOCK_ACTIVATED,
> -};
> -
> -struct cg_proto {
> -	struct page_counter	memory_allocated;	/* Current allocated memory. */
> -	struct percpu_counter	sockets_allocated;	/* Current number of sockets. */
> -	int			memory_pressure;
> -	long			sysctl_mem[3];
> -	unsigned long		flags;
> -	/*
> -	 * memcg field is used to find which memcg we belong directly
> -	 * Each memcg struct can hold more than one cg_proto, so container_of
> -	 * won't really cut.
> -	 *
> -	 * The elegant solution would be having an inverse function to
> -	 * proto_cgroup in struct proto, but that means polluting the structure
> -	 * for everybody, instead of just for memcg users.
> -	 */
> -	struct mem_cgroup	*memcg;
> -};
> -
>  #ifdef CONFIG_MEMCG
>  struct mem_cgroup_stat_cpu {
>  	long count[MEM_CGROUP_STAT_NSTATS];
> @@ -185,8 +157,15 @@ struct mem_cgroup {
>  
>  	/* Accounted resources */
>  	struct page_counter memory;
> +
> +	/*
> +	 * Legacy non-resource counters. In unified hierarchy, all
> +	 * memory is accounted and limited through memcg->memory.
> +	 * Consumer breakdown happens in the statistics.
> +	 */
>  	struct page_counter memsw;
>  	struct page_counter kmem;
> +	struct page_counter skmem;
>  
>  	/* Normal memory consumption range */
>  	unsigned long low;
> @@ -246,9 +225,6 @@ struct mem_cgroup {
>  	 */
>  	struct mem_cgroup_stat_cpu __percpu *stat;
>  
> -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
> -	struct cg_proto tcp_mem;
> -#endif
>  #if defined(CONFIG_MEMCG_KMEM)
>          /* Index in the kmem_cache->memcg_params.memcg_caches array */
>  	int kmemcg_id;
> @@ -676,12 +652,6 @@ void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
>  }
>  #endif /* CONFIG_MEMCG */
>  
> -enum {
> -	UNDER_LIMIT,
> -	SOFT_LIMIT,
> -	OVER_LIMIT,
> -};
> -
>  #ifdef CONFIG_CGROUP_WRITEBACK
>  
>  struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg);
> @@ -707,15 +677,35 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
>  
>  struct sock;
>  #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
> +extern struct static_key_false mem_cgroup_sockets;
> +static inline bool mem_cgroup_do_sockets(void)
> +{
> +	return static_branch_unlikely(&mem_cgroup_sockets);
> +}
>  void sock_update_memcg(struct sock *sk);
>  void sock_release_memcg(struct sock *sk);
> +bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
> +void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
>  #else
> +static inline bool mem_cgroup_do_sockets(void)
> +{
> +	return false;
> +}
>  static inline void sock_update_memcg(struct sock *sk)
>  {
>  }
>  static inline void sock_release_memcg(struct sock *sk)
>  {
>  }
> +static inline bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg,
> +					   unsigned int nr_pages)
> +{
> +	return true;
> +}
> +static inline void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg,
> +					     unsigned int nr_pages)
> +{
> +}
>  #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
>  
>  #ifdef CONFIG_MEMCG_KMEM
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 59a7196..67795fc 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -69,22 +69,6 @@
>  #include <net/tcp_states.h>
>  #include <linux/net_tstamp.h>
>  
> -struct cgroup;
> -struct cgroup_subsys;
> -#ifdef CONFIG_NET
> -int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss);
> -void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg);
> -#else
> -static inline
> -int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
> -{
> -	return 0;
> -}
> -static inline
> -void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg)
> -{
> -}
> -#endif
>  /*
>   * This structure really needs to be cleaned up.
>   * Most of it is for TCP, and not used by any of
> @@ -243,7 +227,6 @@ struct sock_common {
>  	/* public: */
>  };
>  
> -struct cg_proto;
>  /**
>    *	struct sock - network layer representation of sockets
>    *	@__sk_common: shared layout with inet_timewait_sock
> @@ -310,7 +293,7 @@ struct cg_proto;
>    *	@sk_security: used by security modules
>    *	@sk_mark: generic packet mark
>    *	@sk_classid: this socket's cgroup classid
> -  *	@sk_cgrp: this socket's cgroup-specific proto data
> +  *	@sk_memcg: this socket's memcg association
>    *	@sk_write_pending: a write to stream socket waits to start
>    *	@sk_state_change: callback to indicate change in the state of the sock
>    *	@sk_data_ready: callback to indicate there is data to be processed
> @@ -447,7 +430,7 @@ struct sock {
>  #ifdef CONFIG_CGROUP_NET_CLASSID
>  	u32			sk_classid;
>  #endif
> -	struct cg_proto		*sk_cgrp;
> +	struct mem_cgroup	*sk_memcg;
>  	void			(*sk_state_change)(struct sock *sk);
>  	void			(*sk_data_ready)(struct sock *sk);
>  	void			(*sk_write_space)(struct sock *sk);
> @@ -1051,18 +1034,6 @@ struct proto {
>  #ifdef SOCK_REFCNT_DEBUG
>  	atomic_t		socks;
>  #endif
> -#ifdef CONFIG_MEMCG_KMEM
> -	/*
> -	 * cgroup specific init/deinit functions. Called once for all
> -	 * protocols that implement it, from cgroups populate function.
> -	 * This function has to setup any files the protocol want to
> -	 * appear in the kmem cgroup filesystem.
> -	 */
> -	int			(*init_cgroup)(struct mem_cgroup *memcg,
> -					       struct cgroup_subsys *ss);
> -	void			(*destroy_cgroup)(struct mem_cgroup *memcg);
> -	struct cg_proto		*(*proto_cgroup)(struct mem_cgroup *memcg);
> -#endif
>  };
>  
>  int proto_register(struct proto *prot, int alloc_slab);
> @@ -1093,23 +1064,6 @@ static inline void sk_refcnt_debug_release(const struct sock *sk)
>  #define sk_refcnt_debug_release(sk) do { } while (0)
>  #endif /* SOCK_REFCNT_DEBUG */
>  
> -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_NET)
> -extern struct static_key memcg_socket_limit_enabled;
> -static inline struct cg_proto *parent_cg_proto(struct proto *proto,
> -					       struct cg_proto *cg_proto)
> -{
> -	return proto->proto_cgroup(parent_mem_cgroup(cg_proto->memcg));
> -}
> -#define mem_cgroup_sockets_enabled static_key_false(&memcg_socket_limit_enabled)
> -#else
> -#define mem_cgroup_sockets_enabled 0
> -static inline struct cg_proto *parent_cg_proto(struct proto *proto,
> -					       struct cg_proto *cg_proto)
> -{
> -	return NULL;
> -}
> -#endif
> -
>  static inline bool sk_stream_memory_free(const struct sock *sk)
>  {
>  	if (sk->sk_wmem_queued >= sk->sk_sndbuf)
> @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
>  	if (!sk->sk_prot->memory_pressure)
>  		return false;
>  
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		return !!sk->sk_cgrp->memory_pressure;
> -
>  	return !!*sk->sk_prot->memory_pressure;
>  }
>  
> @@ -1146,61 +1097,19 @@ static inline void sk_leave_memory_pressure(struct sock *sk)
>  {
>  	int *memory_pressure = sk->sk_prot->memory_pressure;
>  
> -	if (!memory_pressure)
> -		return;
> -
> -	if (*memory_pressure)
> +	if (memory_pressure && *memory_pressure)
>  		*memory_pressure = 0;
> -
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
> -		struct cg_proto *cg_proto = sk->sk_cgrp;
> -		struct proto *prot = sk->sk_prot;
> -
> -		for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
> -			cg_proto->memory_pressure = 0;
> -	}
> -
>  }
>  
>  static inline void sk_enter_memory_pressure(struct sock *sk)
>  {
> -	if (!sk->sk_prot->enter_memory_pressure)
> -		return;
> -
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
> -		struct cg_proto *cg_proto = sk->sk_cgrp;
> -		struct proto *prot = sk->sk_prot;
> -
> -		for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
> -			cg_proto->memory_pressure = 1;
> -	}
> -
> -	sk->sk_prot->enter_memory_pressure(sk);
> +	if (sk->sk_prot->enter_memory_pressure)
> +		sk->sk_prot->enter_memory_pressure(sk);
>  }
>  
>  static inline long sk_prot_mem_limits(const struct sock *sk, int index)
>  {
> -	long *prot = sk->sk_prot->sysctl_mem;
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		prot = sk->sk_cgrp->sysctl_mem;
> -	return prot[index];
> -}
> -
> -static inline void memcg_memory_allocated_add(struct cg_proto *prot,
> -					      unsigned long amt,
> -					      int *parent_status)
> -{
> -	page_counter_charge(&prot->memory_allocated, amt);
> -
> -	if (page_counter_read(&prot->memory_allocated) >
> -	    prot->memory_allocated.limit)
> -		*parent_status = OVER_LIMIT;
> -}
> -
> -static inline void memcg_memory_allocated_sub(struct cg_proto *prot,
> -					      unsigned long amt)
> -{
> -	page_counter_uncharge(&prot->memory_allocated, amt);
> +	return sk->sk_prot->sysctl_mem[index];
>  }
>  
>  static inline long
> @@ -1208,24 +1117,14 @@ sk_memory_allocated(const struct sock *sk)
>  {
>  	struct proto *prot = sk->sk_prot;
>  
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		return page_counter_read(&sk->sk_cgrp->memory_allocated);
> -
>  	return atomic_long_read(prot->memory_allocated);
>  }
>  
>  static inline long
> -sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status)
> +sk_memory_allocated_add(struct sock *sk, int amt)
>  {
>  	struct proto *prot = sk->sk_prot;
>  
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
> -		memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status);
> -		/* update the root cgroup regardless */
> -		atomic_long_add_return(amt, prot->memory_allocated);
> -		return page_counter_read(&sk->sk_cgrp->memory_allocated);
> -	}
> -
>  	return atomic_long_add_return(amt, prot->memory_allocated);
>  }
>  
> @@ -1234,9 +1133,6 @@ sk_memory_allocated_sub(struct sock *sk, int amt)
>  {
>  	struct proto *prot = sk->sk_prot;
>  
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		memcg_memory_allocated_sub(sk->sk_cgrp, amt);
> -
>  	atomic_long_sub(amt, prot->memory_allocated);
>  }
>  
> @@ -1244,13 +1140,6 @@ static inline void sk_sockets_allocated_dec(struct sock *sk)
>  {
>  	struct proto *prot = sk->sk_prot;
>  
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
> -		struct cg_proto *cg_proto = sk->sk_cgrp;
> -
> -		for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
> -			percpu_counter_dec(&cg_proto->sockets_allocated);
> -	}
> -
>  	percpu_counter_dec(prot->sockets_allocated);
>  }
>  
> @@ -1258,13 +1147,6 @@ static inline void sk_sockets_allocated_inc(struct sock *sk)
>  {
>  	struct proto *prot = sk->sk_prot;
>  
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
> -		struct cg_proto *cg_proto = sk->sk_cgrp;
> -
> -		for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
> -			percpu_counter_inc(&cg_proto->sockets_allocated);
> -	}
> -
>  	percpu_counter_inc(prot->sockets_allocated);
>  }
>  
> @@ -1273,9 +1155,6 @@ sk_sockets_allocated_read_positive(struct sock *sk)
>  {
>  	struct proto *prot = sk->sk_prot;
>  
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		return percpu_counter_read_positive(&sk->sk_cgrp->sockets_allocated);
> -
>  	return percpu_counter_read_positive(prot->sockets_allocated);
>  }
>  
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index eed94fc..77b6c7e 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -291,9 +291,6 @@ extern int tcp_memory_pressure;
>  /* optimized version of sk_under_memory_pressure() for TCP sockets */
>  static inline bool tcp_under_memory_pressure(const struct sock *sk)
>  {
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		return !!sk->sk_cgrp->memory_pressure;
> -
>  	return tcp_memory_pressure;
>  }
>  /*
> diff --git a/include/net/tcp_memcontrol.h b/include/net/tcp_memcontrol.h
> deleted file mode 100644
> index 05b94d9..0000000
> --- a/include/net/tcp_memcontrol.h
> +++ /dev/null
> @@ -1,7 +0,0 @@
> -#ifndef _TCP_MEMCG_H
> -#define _TCP_MEMCG_H
> -
> -struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg);
> -int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss);
> -void tcp_destroy_cgroup(struct mem_cgroup *memcg);
> -#endif /* _TCP_MEMCG_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e54f434..c41e6d7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -66,7 +66,6 @@
>  #include "internal.h"
>  #include <net/sock.h>
>  #include <net/ip.h>
> -#include <net/tcp_memcontrol.h>
>  #include "slab.h"
>  
>  #include <asm/uaccess.h>
> @@ -291,58 +290,68 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
>  /* Writing them here to avoid exposing memcg's inner layout */
>  #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
>  
> +DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets);
> +
>  void sock_update_memcg(struct sock *sk)
>  {
> -	if (mem_cgroup_sockets_enabled) {
> -		struct mem_cgroup *memcg;
> -		struct cg_proto *cg_proto;
> -
> -		BUG_ON(!sk->sk_prot->proto_cgroup);
> -
> -		/* Socket cloning can throw us here with sk_cgrp already
> -		 * filled. It won't however, necessarily happen from
> -		 * process context. So the test for root memcg given
> -		 * the current task's memcg won't help us in this case.
> -		 *
> -		 * Respecting the original socket's memcg is a better
> -		 * decision in this case.
> -		 */
> -		if (sk->sk_cgrp) {
> -			BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg));
> -			css_get(&sk->sk_cgrp->memcg->css);
> -			return;
> -		}
> -
> -		rcu_read_lock();
> -		memcg = mem_cgroup_from_task(current);
> -		cg_proto = sk->sk_prot->proto_cgroup(memcg);
> -		if (cg_proto && test_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags) &&
> -		    css_tryget_online(&memcg->css)) {
> -			sk->sk_cgrp = cg_proto;
> -		}
> -		rcu_read_unlock();
> +	struct mem_cgroup *memcg;
> +	/*
> +	 * Socket cloning can throw us here with sk_cgrp already
> +	 * filled. It won't however, necessarily happen from
> +	 * process context. So the test for root memcg given
> +	 * the current task's memcg won't help us in this case.
> +	 *
> +	 * Respecting the original socket's memcg is a better
> +	 * decision in this case.
> +	 */
> +	if (sk->sk_memcg) {
> +		BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
> +		css_get(&sk->sk_memcg->css);
> +		return;
>  	}
> +
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (css_tryget_online(&memcg->css))
> +		sk->sk_memcg = memcg;
> +	rcu_read_unlock();
>  }
>  EXPORT_SYMBOL(sock_update_memcg);
>  
>  void sock_release_memcg(struct sock *sk)
>  {
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
> -		struct mem_cgroup *memcg;
> -		WARN_ON(!sk->sk_cgrp->memcg);
> -		memcg = sk->sk_cgrp->memcg;
> -		css_put(&sk->sk_cgrp->memcg->css);
> -	}
> +	if (sk->sk_memcg)
> +		css_put(&sk->sk_memcg->css);
>  }
>  
> -struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
> +/**
> + * mem_cgroup_charge_skmem - charge socket memory
> + * @memcg: memcg to charge
> + * @nr_pages: number of pages to charge
> + *
> + * Charges @nr_pages to @memcg. Returns %true if the charge fit within
> + * the memcg's configured limit, %false if the charge had to be forced.
> + */
> +bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
> -	if (!memcg || mem_cgroup_is_root(memcg))
> -		return NULL;
> +	struct page_counter *counter;
> +
> +	if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
> +		return true;
>  
> -	return &memcg->tcp_mem;
> +	page_counter_charge(&memcg->skmem, nr_pages);
> +	return false;
> +}
> +
> +/**
> + * mem_cgroup_uncharge_skmem - uncharge socket memory
> + * @memcg: memcg to uncharge
> + * @nr_pages: number of pages to uncharge
> + */
> +void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
> +{
> +	page_counter_uncharge(&memcg->skmem, nr_pages);
>  }
> -EXPORT_SYMBOL(tcp_proto_cgroup);
>  
>  #endif
>  
> @@ -3592,13 +3601,7 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
>  #ifdef CONFIG_MEMCG_KMEM
>  static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
>  {
> -	int ret;
> -
> -	ret = memcg_propagate_kmem(memcg);
> -	if (ret)
> -		return ret;
> -
> -	return mem_cgroup_sockets_init(memcg, ss);
> +	return memcg_propagate_kmem(memcg);
>  }
>  
>  static void memcg_deactivate_kmem(struct mem_cgroup *memcg)
> @@ -3654,7 +3657,6 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
>  		static_key_slow_dec(&memcg_kmem_enabled_key);
>  		WARN_ON(page_counter_read(&memcg->kmem));
>  	}
> -	mem_cgroup_sockets_destroy(memcg);
>  }
>  #else
>  static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
> @@ -4218,6 +4220,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  		memcg->soft_limit = PAGE_COUNTER_MAX;
>  		page_counter_init(&memcg->memsw, NULL);
>  		page_counter_init(&memcg->kmem, NULL);
> +		page_counter_init(&memcg->skmem, NULL);
>  	}
>  
>  	memcg->last_scanned_node = MAX_NUMNODES;
> @@ -4266,6 +4269,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  		memcg->soft_limit = PAGE_COUNTER_MAX;
>  		page_counter_init(&memcg->memsw, &parent->memsw);
>  		page_counter_init(&memcg->kmem, &parent->kmem);
> +		page_counter_init(&memcg->skmem, &parent->skmem);
>  
>  		/*
>  		 * No need to take a reference to the parent because cgroup
> @@ -4277,6 +4281,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  		memcg->soft_limit = PAGE_COUNTER_MAX;
>  		page_counter_init(&memcg->memsw, NULL);
>  		page_counter_init(&memcg->kmem, NULL);
> +		page_counter_init(&memcg->skmem, NULL);
>  		/*
>  		 * Deeper hierachy with use_hierarchy == false doesn't make
>  		 * much sense so let cgroup subsystem know about this
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 0fafd27..0debff5 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -194,44 +194,6 @@ bool sk_net_capable(const struct sock *sk, int cap)
>  }
>  EXPORT_SYMBOL(sk_net_capable);
>  
> -
> -#ifdef CONFIG_MEMCG_KMEM
> -int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
> -{
> -	struct proto *proto;
> -	int ret = 0;
> -
> -	mutex_lock(&proto_list_mutex);
> -	list_for_each_entry(proto, &proto_list, node) {
> -		if (proto->init_cgroup) {
> -			ret = proto->init_cgroup(memcg, ss);
> -			if (ret)
> -				goto out;
> -		}
> -	}
> -
> -	mutex_unlock(&proto_list_mutex);
> -	return ret;
> -out:
> -	list_for_each_entry_continue_reverse(proto, &proto_list, node)
> -		if (proto->destroy_cgroup)
> -			proto->destroy_cgroup(memcg);
> -	mutex_unlock(&proto_list_mutex);
> -	return ret;
> -}
> -
> -void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg)
> -{
> -	struct proto *proto;
> -
> -	mutex_lock(&proto_list_mutex);
> -	list_for_each_entry_reverse(proto, &proto_list, node)
> -		if (proto->destroy_cgroup)
> -			proto->destroy_cgroup(memcg);
> -	mutex_unlock(&proto_list_mutex);
> -}
> -#endif
> -
>  /*
>   * Each address family might have different locking rules, so we have
>   * one slock key per address family:
> @@ -239,11 +201,6 @@ void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg)
>  static struct lock_class_key af_family_keys[AF_MAX];
>  static struct lock_class_key af_family_slock_keys[AF_MAX];
>  
> -#if defined(CONFIG_MEMCG_KMEM)
> -struct static_key memcg_socket_limit_enabled;
> -EXPORT_SYMBOL(memcg_socket_limit_enabled);
> -#endif
> -
>  /*
>   * Make lock validator output more readable. (we pre-construct these
>   * strings build-time, so that runtime initialization of socket
> @@ -1476,12 +1433,6 @@ void sk_free(struct sock *sk)
>  }
>  EXPORT_SYMBOL(sk_free);
>  
> -static void sk_update_clone(const struct sock *sk, struct sock *newsk)
> -{
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		sock_update_memcg(newsk);
> -}
> -
>  /**
>   *	sk_clone_lock - clone a socket, and lock its clone
>   *	@sk: the socket to clone
> @@ -1577,7 +1528,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
>  		sk_set_socket(newsk, NULL);
>  		newsk->sk_wq = NULL;
>  
> -		sk_update_clone(sk, newsk);
> +		if (mem_cgroup_do_sockets())
> +			sock_update_memcg(newsk);
>  
>  		if (newsk->sk_prot->sockets_allocated)
>  			sk_sockets_allocated_inc(newsk);
> @@ -2036,27 +1988,27 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
>  	struct proto *prot = sk->sk_prot;
>  	int amt = sk_mem_pages(size);
>  	long allocated;
> -	int parent_status = UNDER_LIMIT;
>  
>  	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
>  
> -	allocated = sk_memory_allocated_add(sk, amt, &parent_status);
> +	allocated = sk_memory_allocated_add(sk, amt);
> +
> +	if (mem_cgroup_do_sockets() && sk->sk_memcg &&
> +	    !mem_cgroup_charge_skmem(sk->sk_memcg, amt))
> +		goto suppress_allocation;
>  
>  	/* Under limit. */
> -	if (parent_status == UNDER_LIMIT &&
> -			allocated <= sk_prot_mem_limits(sk, 0)) {
> +	if (allocated <= sk_prot_mem_limits(sk, 0)) {
>  		sk_leave_memory_pressure(sk);
>  		return 1;
>  	}
>  
> -	/* Under pressure. (we or our parents) */
> -	if ((parent_status > SOFT_LIMIT) ||
> -			allocated > sk_prot_mem_limits(sk, 1))
> +	/* Under pressure. */
> +	if (allocated > sk_prot_mem_limits(sk, 1))
>  		sk_enter_memory_pressure(sk);
>  
> -	/* Over hard limit (we or our parents) */
> -	if ((parent_status == OVER_LIMIT) ||
> -			(allocated > sk_prot_mem_limits(sk, 2)))
> +	/* Over hard limit. */
> +	if (allocated > sk_prot_mem_limits(sk, 2))
>  		goto suppress_allocation;
>  
>  	/* guarantee minimum buffer size under pressure */
> @@ -2105,6 +2057,9 @@ suppress_allocation:
>  
>  	sk_memory_allocated_sub(sk, amt);
>  
> +	if (mem_cgroup_do_sockets() && sk->sk_memcg)
> +		mem_cgroup_uncharge_skmem(sk->sk_memcg, amt);
> +
>  	return 0;
>  }
>  EXPORT_SYMBOL(__sk_mem_schedule);
> @@ -2120,6 +2075,9 @@ void __sk_mem_reclaim(struct sock *sk, int amount)
>  	sk_memory_allocated_sub(sk, amount);
>  	sk->sk_forward_alloc -= amount << SK_MEM_QUANTUM_SHIFT;
>  
> +	if (mem_cgroup_do_sockets() && sk->sk_memcg)
> +		mem_cgroup_uncharge_skmem(sk->sk_memcg, amount);
> +
>  	if (sk_under_memory_pressure(sk) &&
>  	    (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)))
>  		sk_leave_memory_pressure(sk);
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 894da3a..1f00819 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -24,7 +24,6 @@
>  #include <net/cipso_ipv4.h>
>  #include <net/inet_frag.h>
>  #include <net/ping.h>
> -#include <net/tcp_memcontrol.h>
>  
>  static int zero;
>  static int one = 1;
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index ac1bdbb..ec931c0 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -421,7 +421,8 @@ void tcp_init_sock(struct sock *sk)
>  	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
>  
>  	local_bh_disable();
> -	sock_update_memcg(sk);
> +	if (mem_cgroup_do_sockets())
> +		sock_update_memcg(sk);
>  	sk_sockets_allocated_inc(sk);
>  	local_bh_enable();
>  }
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 30dd45c..bb5f4f2 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -73,7 +73,6 @@
>  #include <net/timewait_sock.h>
>  #include <net/xfrm.h>
>  #include <net/secure_seq.h>
> -#include <net/tcp_memcontrol.h>
>  #include <net/busy_poll.h>
>  
>  #include <linux/inet.h>
> @@ -1808,7 +1807,8 @@ void tcp_v4_destroy_sock(struct sock *sk)
>  	tcp_saved_syn_free(tp);
>  
>  	sk_sockets_allocated_dec(sk);
> -	sock_release_memcg(sk);
> +	if (mem_cgroup_do_sockets())
> +		sock_release_memcg(sk);
>  }
>  EXPORT_SYMBOL(tcp_v4_destroy_sock);
>  
> @@ -2330,11 +2330,6 @@ struct proto tcp_prot = {
>  	.compat_setsockopt	= compat_tcp_setsockopt,
>  	.compat_getsockopt	= compat_tcp_getsockopt,
>  #endif
> -#ifdef CONFIG_MEMCG_KMEM
> -	.init_cgroup		= tcp_init_cgroup,
> -	.destroy_cgroup		= tcp_destroy_cgroup,
> -	.proto_cgroup		= tcp_proto_cgroup,
> -#endif
>  };
>  EXPORT_SYMBOL(tcp_prot);
>  
> diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
> index 2379c1b..09a37eb 100644
> --- a/net/ipv4/tcp_memcontrol.c
> +++ b/net/ipv4/tcp_memcontrol.c
> @@ -1,107 +1,10 @@
> -#include <net/tcp.h>
> -#include <net/tcp_memcontrol.h>
> -#include <net/sock.h>
> -#include <net/ip.h>
> -#include <linux/nsproxy.h>
> +#include <linux/page_counter.h>
>  #include <linux/memcontrol.h>
> +#include <linux/cgroup.h>
>  #include <linux/module.h>
> -
> -int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
> -{
> -	/*
> -	 * The root cgroup does not use page_counters, but rather,
> -	 * rely on the data already collected by the network
> -	 * subsystem
> -	 */
> -	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
> -	struct page_counter *counter_parent = NULL;
> -	struct cg_proto *cg_proto, *parent_cg;
> -
> -	cg_proto = tcp_prot.proto_cgroup(memcg);
> -	if (!cg_proto)
> -		return 0;
> -
> -	cg_proto->sysctl_mem[0] = sysctl_tcp_mem[0];
> -	cg_proto->sysctl_mem[1] = sysctl_tcp_mem[1];
> -	cg_proto->sysctl_mem[2] = sysctl_tcp_mem[2];
> -	cg_proto->memory_pressure = 0;
> -	cg_proto->memcg = memcg;
> -
> -	parent_cg = tcp_prot.proto_cgroup(parent);
> -	if (parent_cg)
> -		counter_parent = &parent_cg->memory_allocated;
> -
> -	page_counter_init(&cg_proto->memory_allocated, counter_parent);
> -	percpu_counter_init(&cg_proto->sockets_allocated, 0, GFP_KERNEL);
> -
> -	return 0;
> -}
> -EXPORT_SYMBOL(tcp_init_cgroup);
> -
> -void tcp_destroy_cgroup(struct mem_cgroup *memcg)
> -{
> -	struct cg_proto *cg_proto;
> -
> -	cg_proto = tcp_prot.proto_cgroup(memcg);
> -	if (!cg_proto)
> -		return;
> -
> -	percpu_counter_destroy(&cg_proto->sockets_allocated);
> -
> -	if (test_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags))
> -		static_key_slow_dec(&memcg_socket_limit_enabled);
> -
> -}
> -EXPORT_SYMBOL(tcp_destroy_cgroup);
> -
> -static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
> -{
> -	struct cg_proto *cg_proto;
> -	int i;
> -	int ret;
> -
> -	cg_proto = tcp_prot.proto_cgroup(memcg);
> -	if (!cg_proto)
> -		return -EINVAL;
> -
> -	ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages);
> -	if (ret)
> -		return ret;
> -
> -	for (i = 0; i < 3; i++)
> -		cg_proto->sysctl_mem[i] = min_t(long, nr_pages,
> -						sysctl_tcp_mem[i]);
> -
> -	if (nr_pages == PAGE_COUNTER_MAX)
> -		clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
> -	else {
> -		/*
> -		 * The active bit needs to be written after the static_key
> -		 * update. This is what guarantees that the socket activation
> -		 * function is the last one to run. See sock_update_memcg() for
> -		 * details, and note that we don't mark any socket as belonging
> -		 * to this memcg until that flag is up.
> -		 *
> -		 * We need to do this, because static_keys will span multiple
> -		 * sites, but we can't control their order. If we mark a socket
> -		 * as accounted, but the accounting functions are not patched in
> -		 * yet, we'll lose accounting.
> -		 *
> -		 * We never race with the readers in sock_update_memcg(),
> -		 * because when this value change, the code to process it is not
> -		 * patched in yet.
> -		 *
> -		 * The activated bit is used to guarantee that no two writers
> -		 * will do the update in the same memcg. Without that, we can't
> -		 * properly shutdown the static key.
> -		 */
> -		if (!test_and_set_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags))
> -			static_key_slow_inc(&memcg_socket_limit_enabled);
> -		set_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
> -	}
> -
> -	return 0;
> -}
> +#include <linux/kernfs.h>
> +#include <linux/mutex.h>
> +#include <net/tcp.h>
>  
>  enum {
>  	RES_USAGE,
> @@ -124,11 +27,17 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
>  	switch (of_cft(of)->private) {
>  	case RES_LIMIT:
>  		/* see memcontrol.c */
> +		if (memcg == root_mem_cgroup) {
> +			ret = -EINVAL;
> +			break;
> +		}
>  		ret = page_counter_memparse(buf, "-1", &nr_pages);
>  		if (ret)
>  			break;
>  		mutex_lock(&tcp_limit_mutex);
> -		ret = tcp_update_limit(memcg, nr_pages);
> +		ret = page_counter_limit(&memcg->skmem, nr_pages);
> +		if (!ret)
> +			static_branch_enable(&mem_cgroup_sockets);
>  		mutex_unlock(&tcp_limit_mutex);
>  		break;
>  	default:
> @@ -141,32 +50,28 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
>  static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> -	struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg);
>  	u64 val;
>  
>  	switch (cft->private) {
>  	case RES_LIMIT:
> -		if (!cg_proto)
> -			return PAGE_COUNTER_MAX;
> -		val = cg_proto->memory_allocated.limit;
> +		val = memcg->skmem.limit;
>  		val *= PAGE_SIZE;
>  		break;
>  	case RES_USAGE:
> -		if (!cg_proto)
> +		if (memcg == root_mem_cgroup)
>  			val = atomic_long_read(&tcp_memory_allocated);
>  		else
> -			val = page_counter_read(&cg_proto->memory_allocated);
> +			val = page_counter_read(&memcg->skmem);
>  		val *= PAGE_SIZE;
>  		break;
>  	case RES_FAILCNT:
> -		if (!cg_proto)
> -			return 0;
> -		val = cg_proto->memory_allocated.failcnt;
> +		val = memcg->skmem.failcnt;
>  		break;
>  	case RES_MAX_USAGE:
> -		if (!cg_proto)
> -			return 0;
> -		val = cg_proto->memory_allocated.watermark;
> +		if (memcg == root_mem_cgroup)
> +			val = 0;
> +		else
> +			val = memcg->skmem.watermark;
>  		val *= PAGE_SIZE;
>  		break;
>  	default:
> @@ -178,20 +83,14 @@ static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
>  static ssize_t tcp_cgroup_reset(struct kernfs_open_file *of,
>  				char *buf, size_t nbytes, loff_t off)
>  {
> -	struct mem_cgroup *memcg;
> -	struct cg_proto *cg_proto;
> -
> -	memcg = mem_cgroup_from_css(of_css(of));
> -	cg_proto = tcp_prot.proto_cgroup(memcg);
> -	if (!cg_proto)
> -		return nbytes;
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
>  
>  	switch (of_cft(of)->private) {
>  	case RES_MAX_USAGE:
> -		page_counter_reset_watermark(&cg_proto->memory_allocated);
> +		page_counter_reset_watermark(&memcg->skmem);
>  		break;
>  	case RES_FAILCNT:
> -		cg_proto->memory_allocated.failcnt = 0;
> +		memcg->skmem.failcnt = 0;
>  		break;
>  	}
>  
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 19adedb..b496fc9 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2819,13 +2819,15 @@ begin_fwd:
>   */
>  void sk_forced_mem_schedule(struct sock *sk, int size)
>  {
> -	int amt, status;
> +	int amt;
>  
>  	if (size <= sk->sk_forward_alloc)
>  		return;
>  	amt = sk_mem_pages(size);
>  	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
> -	sk_memory_allocated_add(sk, amt, &status);
> +	sk_memory_allocated_add(sk, amt);
> +	if (mem_cgroup_do_sockets() && sk->sk_memcg)
> +		mem_cgroup_charge_skmem(sk->sk_memcg, amt);
>  }
>  
>  /* Send a FIN. The caller locks the socket for us.
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index f495d18..cf19e65 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -1862,9 +1862,6 @@ struct proto tcpv6_prot = {
>  	.compat_setsockopt	= compat_tcp_setsockopt,
>  	.compat_getsockopt	= compat_tcp_getsockopt,
>  #endif
> -#ifdef CONFIG_MEMCG_KMEM
> -	.proto_cgroup		= tcp_proto_cgroup,
> -#endif
>  	.clear_sk		= tcp_v6_clear_sk,
>  };
>  
> -- 
> 2.6.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting
@ 2015-10-23 12:38     ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-23 12:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Vladimir Davydov, Tejun Heo,
	netdev, linux-mm, cgroups, linux-kernel

On Thu 22-10-15 00:21:31, Johannes Weiner wrote:
> The tcp memory controller has extensive provisions for future memory
> accounting interfaces that won't materialize after all. Cut the code
> base down to what's actually used, now and in the likely future.
> 
> - There won't be any different protocol counters in the future, so a
>   direct sock->sk_memcg linkage is enough. This eliminates a lot of
>   callback maze and boilerplate code, and restores most of the socket
>   allocation code to pre-tcp_memcontrol state.
> 
> - There won't be a tcp control soft limit, so integrating the memcg
>   code into the global skmem limiting scheme complicates things
>   unnecessarily. Replace all that with simple and clear charge and
>   uncharge calls--hidden behind a jump label--to account skb memory.
> 
> - The previous jump label code was an elaborate state machine that
>   tracked the number of cgroups with an active socket limit in order
>   to enable the skmem tracking and accounting code only when actively
>   necessary. But this is overengineered: it was meant to protect the
>   people who never use this feature in the first place. Simply enable
>   the branches once when the first limit is set until the next reboot.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

The changelog is certainly attractive. I have looked through the patch
but my knowledge of the networking subsystem and its memory management
is close to zero so I cannot really do a competent review.

Anyway I support any simplification of the tcp kmem accounting. If
networking people are OK with the changes, including reduction of the
functionality as described by Vladimir then no objections from me for
this to be merged.

Thanks!
> ---
>  include/linux/memcontrol.h   |  64 ++++++++-----------
>  include/net/sock.h           | 135 +++------------------------------------
>  include/net/tcp.h            |   3 -
>  include/net/tcp_memcontrol.h |   7 ---
>  mm/memcontrol.c              | 101 +++++++++++++++--------------
>  net/core/sock.c              |  78 ++++++-----------------
>  net/ipv4/sysctl_net_ipv4.c   |   1 -
>  net/ipv4/tcp.c               |   3 +-
>  net/ipv4/tcp_ipv4.c          |   9 +--
>  net/ipv4/tcp_memcontrol.c    | 147 +++++++------------------------------------
>  net/ipv4/tcp_output.c        |   6 +-
>  net/ipv6/tcp_ipv6.c          |   3 -
>  12 files changed, 136 insertions(+), 421 deletions(-)
>  delete mode 100644 include/net/tcp_memcontrol.h
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 19ff87b..5b72f83 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -85,34 +85,6 @@ enum mem_cgroup_events_target {
>  	MEM_CGROUP_NTARGETS,
>  };
>  
> -/*
> - * Bits in struct cg_proto.flags
> - */
> -enum cg_proto_flags {
> -	/* Currently active and new sockets should be assigned to cgroups */
> -	MEMCG_SOCK_ACTIVE,
> -	/* It was ever activated; we must disarm static keys on destruction */
> -	MEMCG_SOCK_ACTIVATED,
> -};
> -
> -struct cg_proto {
> -	struct page_counter	memory_allocated;	/* Current allocated memory. */
> -	struct percpu_counter	sockets_allocated;	/* Current number of sockets. */
> -	int			memory_pressure;
> -	long			sysctl_mem[3];
> -	unsigned long		flags;
> -	/*
> -	 * memcg field is used to find which memcg we belong directly
> -	 * Each memcg struct can hold more than one cg_proto, so container_of
> -	 * won't really cut.
> -	 *
> -	 * The elegant solution would be having an inverse function to
> -	 * proto_cgroup in struct proto, but that means polluting the structure
> -	 * for everybody, instead of just for memcg users.
> -	 */
> -	struct mem_cgroup	*memcg;
> -};
> -
>  #ifdef CONFIG_MEMCG
>  struct mem_cgroup_stat_cpu {
>  	long count[MEM_CGROUP_STAT_NSTATS];
> @@ -185,8 +157,15 @@ struct mem_cgroup {
>  
>  	/* Accounted resources */
>  	struct page_counter memory;
> +
> +	/*
> +	 * Legacy non-resource counters. In unified hierarchy, all
> +	 * memory is accounted and limited through memcg->memory.
> +	 * Consumer breakdown happens in the statistics.
> +	 */
>  	struct page_counter memsw;
>  	struct page_counter kmem;
> +	struct page_counter skmem;
>  
>  	/* Normal memory consumption range */
>  	unsigned long low;
> @@ -246,9 +225,6 @@ struct mem_cgroup {
>  	 */
>  	struct mem_cgroup_stat_cpu __percpu *stat;
>  
> -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
> -	struct cg_proto tcp_mem;
> -#endif
>  #if defined(CONFIG_MEMCG_KMEM)
>          /* Index in the kmem_cache->memcg_params.memcg_caches array */
>  	int kmemcg_id;
> @@ -676,12 +652,6 @@ void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
>  }
>  #endif /* CONFIG_MEMCG */
>  
> -enum {
> -	UNDER_LIMIT,
> -	SOFT_LIMIT,
> -	OVER_LIMIT,
> -};
> -
>  #ifdef CONFIG_CGROUP_WRITEBACK
>  
>  struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg);
> @@ -707,15 +677,35 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
>  
>  struct sock;
>  #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
> +extern struct static_key_false mem_cgroup_sockets;
> +static inline bool mem_cgroup_do_sockets(void)
> +{
> +	return static_branch_unlikely(&mem_cgroup_sockets);
> +}
>  void sock_update_memcg(struct sock *sk);
>  void sock_release_memcg(struct sock *sk);
> +bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
> +void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
>  #else
> +static inline bool mem_cgroup_do_sockets(void)
> +{
> +	return false;
> +}
>  static inline void sock_update_memcg(struct sock *sk)
>  {
>  }
>  static inline void sock_release_memcg(struct sock *sk)
>  {
>  }
> +static inline bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg,
> +					   unsigned int nr_pages)
> +{
> +	return true;
> +}
> +static inline void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg,
> +					     unsigned int nr_pages)
> +{
> +}
>  #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
>  
>  #ifdef CONFIG_MEMCG_KMEM
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 59a7196..67795fc 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -69,22 +69,6 @@
>  #include <net/tcp_states.h>
>  #include <linux/net_tstamp.h>
>  
> -struct cgroup;
> -struct cgroup_subsys;
> -#ifdef CONFIG_NET
> -int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss);
> -void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg);
> -#else
> -static inline
> -int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
> -{
> -	return 0;
> -}
> -static inline
> -void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg)
> -{
> -}
> -#endif
>  /*
>   * This structure really needs to be cleaned up.
>   * Most of it is for TCP, and not used by any of
> @@ -243,7 +227,6 @@ struct sock_common {
>  	/* public: */
>  };
>  
> -struct cg_proto;
>  /**
>    *	struct sock - network layer representation of sockets
>    *	@__sk_common: shared layout with inet_timewait_sock
> @@ -310,7 +293,7 @@ struct cg_proto;
>    *	@sk_security: used by security modules
>    *	@sk_mark: generic packet mark
>    *	@sk_classid: this socket's cgroup classid
> -  *	@sk_cgrp: this socket's cgroup-specific proto data
> +  *	@sk_memcg: this socket's memcg association
>    *	@sk_write_pending: a write to stream socket waits to start
>    *	@sk_state_change: callback to indicate change in the state of the sock
>    *	@sk_data_ready: callback to indicate there is data to be processed
> @@ -447,7 +430,7 @@ struct sock {
>  #ifdef CONFIG_CGROUP_NET_CLASSID
>  	u32			sk_classid;
>  #endif
> -	struct cg_proto		*sk_cgrp;
> +	struct mem_cgroup	*sk_memcg;
>  	void			(*sk_state_change)(struct sock *sk);
>  	void			(*sk_data_ready)(struct sock *sk);
>  	void			(*sk_write_space)(struct sock *sk);
> @@ -1051,18 +1034,6 @@ struct proto {
>  #ifdef SOCK_REFCNT_DEBUG
>  	atomic_t		socks;
>  #endif
> -#ifdef CONFIG_MEMCG_KMEM
> -	/*
> -	 * cgroup specific init/deinit functions. Called once for all
> -	 * protocols that implement it, from cgroups populate function.
> -	 * This function has to setup any files the protocol want to
> -	 * appear in the kmem cgroup filesystem.
> -	 */
> -	int			(*init_cgroup)(struct mem_cgroup *memcg,
> -					       struct cgroup_subsys *ss);
> -	void			(*destroy_cgroup)(struct mem_cgroup *memcg);
> -	struct cg_proto		*(*proto_cgroup)(struct mem_cgroup *memcg);
> -#endif
>  };
>  
>  int proto_register(struct proto *prot, int alloc_slab);
> @@ -1093,23 +1064,6 @@ static inline void sk_refcnt_debug_release(const struct sock *sk)
>  #define sk_refcnt_debug_release(sk) do { } while (0)
>  #endif /* SOCK_REFCNT_DEBUG */
>  
> -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_NET)
> -extern struct static_key memcg_socket_limit_enabled;
> -static inline struct cg_proto *parent_cg_proto(struct proto *proto,
> -					       struct cg_proto *cg_proto)
> -{
> -	return proto->proto_cgroup(parent_mem_cgroup(cg_proto->memcg));
> -}
> -#define mem_cgroup_sockets_enabled static_key_false(&memcg_socket_limit_enabled)
> -#else
> -#define mem_cgroup_sockets_enabled 0
> -static inline struct cg_proto *parent_cg_proto(struct proto *proto,
> -					       struct cg_proto *cg_proto)
> -{
> -	return NULL;
> -}
> -#endif
> -
>  static inline bool sk_stream_memory_free(const struct sock *sk)
>  {
>  	if (sk->sk_wmem_queued >= sk->sk_sndbuf)
> @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
>  	if (!sk->sk_prot->memory_pressure)
>  		return false;
>  
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		return !!sk->sk_cgrp->memory_pressure;
> -
>  	return !!*sk->sk_prot->memory_pressure;
>  }
>  
> @@ -1146,61 +1097,19 @@ static inline void sk_leave_memory_pressure(struct sock *sk)
>  {
>  	int *memory_pressure = sk->sk_prot->memory_pressure;
>  
> -	if (!memory_pressure)
> -		return;
> -
> -	if (*memory_pressure)
> +	if (memory_pressure && *memory_pressure)
>  		*memory_pressure = 0;
> -
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
> -		struct cg_proto *cg_proto = sk->sk_cgrp;
> -		struct proto *prot = sk->sk_prot;
> -
> -		for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
> -			cg_proto->memory_pressure = 0;
> -	}
> -
>  }
>  
>  static inline void sk_enter_memory_pressure(struct sock *sk)
>  {
> -	if (!sk->sk_prot->enter_memory_pressure)
> -		return;
> -
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
> -		struct cg_proto *cg_proto = sk->sk_cgrp;
> -		struct proto *prot = sk->sk_prot;
> -
> -		for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
> -			cg_proto->memory_pressure = 1;
> -	}
> -
> -	sk->sk_prot->enter_memory_pressure(sk);
> +	if (sk->sk_prot->enter_memory_pressure)
> +		sk->sk_prot->enter_memory_pressure(sk);
>  }
>  
>  static inline long sk_prot_mem_limits(const struct sock *sk, int index)
>  {
> -	long *prot = sk->sk_prot->sysctl_mem;
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		prot = sk->sk_cgrp->sysctl_mem;
> -	return prot[index];
> -}
> -
> -static inline void memcg_memory_allocated_add(struct cg_proto *prot,
> -					      unsigned long amt,
> -					      int *parent_status)
> -{
> -	page_counter_charge(&prot->memory_allocated, amt);
> -
> -	if (page_counter_read(&prot->memory_allocated) >
> -	    prot->memory_allocated.limit)
> -		*parent_status = OVER_LIMIT;
> -}
> -
> -static inline void memcg_memory_allocated_sub(struct cg_proto *prot,
> -					      unsigned long amt)
> -{
> -	page_counter_uncharge(&prot->memory_allocated, amt);
> +	return sk->sk_prot->sysctl_mem[index];
>  }
>  
>  static inline long
> @@ -1208,24 +1117,14 @@ sk_memory_allocated(const struct sock *sk)
>  {
>  	struct proto *prot = sk->sk_prot;
>  
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		return page_counter_read(&sk->sk_cgrp->memory_allocated);
> -
>  	return atomic_long_read(prot->memory_allocated);
>  }
>  
>  static inline long
> -sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status)
> +sk_memory_allocated_add(struct sock *sk, int amt)
>  {
>  	struct proto *prot = sk->sk_prot;
>  
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
> -		memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status);
> -		/* update the root cgroup regardless */
> -		atomic_long_add_return(amt, prot->memory_allocated);
> -		return page_counter_read(&sk->sk_cgrp->memory_allocated);
> -	}
> -
>  	return atomic_long_add_return(amt, prot->memory_allocated);
>  }
>  
> @@ -1234,9 +1133,6 @@ sk_memory_allocated_sub(struct sock *sk, int amt)
>  {
>  	struct proto *prot = sk->sk_prot;
>  
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		memcg_memory_allocated_sub(sk->sk_cgrp, amt);
> -
>  	atomic_long_sub(amt, prot->memory_allocated);
>  }
>  
> @@ -1244,13 +1140,6 @@ static inline void sk_sockets_allocated_dec(struct sock *sk)
>  {
>  	struct proto *prot = sk->sk_prot;
>  
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
> -		struct cg_proto *cg_proto = sk->sk_cgrp;
> -
> -		for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
> -			percpu_counter_dec(&cg_proto->sockets_allocated);
> -	}
> -
>  	percpu_counter_dec(prot->sockets_allocated);
>  }
>  
> @@ -1258,13 +1147,6 @@ static inline void sk_sockets_allocated_inc(struct sock *sk)
>  {
>  	struct proto *prot = sk->sk_prot;
>  
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
> -		struct cg_proto *cg_proto = sk->sk_cgrp;
> -
> -		for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto))
> -			percpu_counter_inc(&cg_proto->sockets_allocated);
> -	}
> -
>  	percpu_counter_inc(prot->sockets_allocated);
>  }
>  
> @@ -1273,9 +1155,6 @@ sk_sockets_allocated_read_positive(struct sock *sk)
>  {
>  	struct proto *prot = sk->sk_prot;
>  
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		return percpu_counter_read_positive(&sk->sk_cgrp->sockets_allocated);
> -
>  	return percpu_counter_read_positive(prot->sockets_allocated);
>  }
>  
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index eed94fc..77b6c7e 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -291,9 +291,6 @@ extern int tcp_memory_pressure;
>  /* optimized version of sk_under_memory_pressure() for TCP sockets */
>  static inline bool tcp_under_memory_pressure(const struct sock *sk)
>  {
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		return !!sk->sk_cgrp->memory_pressure;
> -
>  	return tcp_memory_pressure;
>  }
>  /*
> diff --git a/include/net/tcp_memcontrol.h b/include/net/tcp_memcontrol.h
> deleted file mode 100644
> index 05b94d9..0000000
> --- a/include/net/tcp_memcontrol.h
> +++ /dev/null
> @@ -1,7 +0,0 @@
> -#ifndef _TCP_MEMCG_H
> -#define _TCP_MEMCG_H
> -
> -struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg);
> -int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss);
> -void tcp_destroy_cgroup(struct mem_cgroup *memcg);
> -#endif /* _TCP_MEMCG_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e54f434..c41e6d7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -66,7 +66,6 @@
>  #include "internal.h"
>  #include <net/sock.h>
>  #include <net/ip.h>
> -#include <net/tcp_memcontrol.h>
>  #include "slab.h"
>  
>  #include <asm/uaccess.h>
> @@ -291,58 +290,68 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
>  /* Writing them here to avoid exposing memcg's inner layout */
>  #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
>  
> +DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets);
> +
>  void sock_update_memcg(struct sock *sk)
>  {
> -	if (mem_cgroup_sockets_enabled) {
> -		struct mem_cgroup *memcg;
> -		struct cg_proto *cg_proto;
> -
> -		BUG_ON(!sk->sk_prot->proto_cgroup);
> -
> -		/* Socket cloning can throw us here with sk_cgrp already
> -		 * filled. It won't however, necessarily happen from
> -		 * process context. So the test for root memcg given
> -		 * the current task's memcg won't help us in this case.
> -		 *
> -		 * Respecting the original socket's memcg is a better
> -		 * decision in this case.
> -		 */
> -		if (sk->sk_cgrp) {
> -			BUG_ON(mem_cgroup_is_root(sk->sk_cgrp->memcg));
> -			css_get(&sk->sk_cgrp->memcg->css);
> -			return;
> -		}
> -
> -		rcu_read_lock();
> -		memcg = mem_cgroup_from_task(current);
> -		cg_proto = sk->sk_prot->proto_cgroup(memcg);
> -		if (cg_proto && test_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags) &&
> -		    css_tryget_online(&memcg->css)) {
> -			sk->sk_cgrp = cg_proto;
> -		}
> -		rcu_read_unlock();
> +	struct mem_cgroup *memcg;
> +	/*
> +	 * Socket cloning can throw us here with sk_cgrp already
> +	 * filled. It won't however, necessarily happen from
> +	 * process context. So the test for root memcg given
> +	 * the current task's memcg won't help us in this case.
> +	 *
> +	 * Respecting the original socket's memcg is a better
> +	 * decision in this case.
> +	 */
> +	if (sk->sk_memcg) {
> +		BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
> +		css_get(&sk->sk_memcg->css);
> +		return;
>  	}
> +
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (css_tryget_online(&memcg->css))
> +		sk->sk_memcg = memcg;
> +	rcu_read_unlock();
>  }
>  EXPORT_SYMBOL(sock_update_memcg);
>  
>  void sock_release_memcg(struct sock *sk)
>  {
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp) {
> -		struct mem_cgroup *memcg;
> -		WARN_ON(!sk->sk_cgrp->memcg);
> -		memcg = sk->sk_cgrp->memcg;
> -		css_put(&sk->sk_cgrp->memcg->css);
> -	}
> +	if (sk->sk_memcg)
> +		css_put(&sk->sk_memcg->css);
>  }
>  
> -struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg)
> +/**
> + * mem_cgroup_charge_skmem - charge socket memory
> + * @memcg: memcg to charge
> + * @nr_pages: number of pages to charge
> + *
> + * Charges @nr_pages to @memcg. Returns %true if the charge fit within
> + * the memcg's configured limit, %false if the charge had to be forced.
> + */
> +bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
> -	if (!memcg || mem_cgroup_is_root(memcg))
> -		return NULL;
> +	struct page_counter *counter;
> +
> +	if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
> +		return true;
>  
> -	return &memcg->tcp_mem;
> +	page_counter_charge(&memcg->skmem, nr_pages);
> +	return false;
> +}
> +
> +/**
> + * mem_cgroup_uncharge_skmem - uncharge socket memory
> + * @memcg: memcg to uncharge
> + * @nr_pages: number of pages to uncharge
> + */
> +void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
> +{
> +	page_counter_uncharge(&memcg->skmem, nr_pages);
>  }
> -EXPORT_SYMBOL(tcp_proto_cgroup);
>  
>  #endif
>  
> @@ -3592,13 +3601,7 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
>  #ifdef CONFIG_MEMCG_KMEM
>  static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
>  {
> -	int ret;
> -
> -	ret = memcg_propagate_kmem(memcg);
> -	if (ret)
> -		return ret;
> -
> -	return mem_cgroup_sockets_init(memcg, ss);
> +	return memcg_propagate_kmem(memcg);
>  }
>  
>  static void memcg_deactivate_kmem(struct mem_cgroup *memcg)
> @@ -3654,7 +3657,6 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
>  		static_key_slow_dec(&memcg_kmem_enabled_key);
>  		WARN_ON(page_counter_read(&memcg->kmem));
>  	}
> -	mem_cgroup_sockets_destroy(memcg);
>  }
>  #else
>  static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
> @@ -4218,6 +4220,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  		memcg->soft_limit = PAGE_COUNTER_MAX;
>  		page_counter_init(&memcg->memsw, NULL);
>  		page_counter_init(&memcg->kmem, NULL);
> +		page_counter_init(&memcg->skmem, NULL);
>  	}
>  
>  	memcg->last_scanned_node = MAX_NUMNODES;
> @@ -4266,6 +4269,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  		memcg->soft_limit = PAGE_COUNTER_MAX;
>  		page_counter_init(&memcg->memsw, &parent->memsw);
>  		page_counter_init(&memcg->kmem, &parent->kmem);
> +		page_counter_init(&memcg->skmem, &parent->skmem);
>  
>  		/*
>  		 * No need to take a reference to the parent because cgroup
> @@ -4277,6 +4281,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  		memcg->soft_limit = PAGE_COUNTER_MAX;
>  		page_counter_init(&memcg->memsw, NULL);
>  		page_counter_init(&memcg->kmem, NULL);
> +		page_counter_init(&memcg->skmem, NULL);
>  		/*
>  		 * Deeper hierachy with use_hierarchy == false doesn't make
>  		 * much sense so let cgroup subsystem know about this
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 0fafd27..0debff5 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -194,44 +194,6 @@ bool sk_net_capable(const struct sock *sk, int cap)
>  }
>  EXPORT_SYMBOL(sk_net_capable);
>  
> -
> -#ifdef CONFIG_MEMCG_KMEM
> -int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
> -{
> -	struct proto *proto;
> -	int ret = 0;
> -
> -	mutex_lock(&proto_list_mutex);
> -	list_for_each_entry(proto, &proto_list, node) {
> -		if (proto->init_cgroup) {
> -			ret = proto->init_cgroup(memcg, ss);
> -			if (ret)
> -				goto out;
> -		}
> -	}
> -
> -	mutex_unlock(&proto_list_mutex);
> -	return ret;
> -out:
> -	list_for_each_entry_continue_reverse(proto, &proto_list, node)
> -		if (proto->destroy_cgroup)
> -			proto->destroy_cgroup(memcg);
> -	mutex_unlock(&proto_list_mutex);
> -	return ret;
> -}
> -
> -void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg)
> -{
> -	struct proto *proto;
> -
> -	mutex_lock(&proto_list_mutex);
> -	list_for_each_entry_reverse(proto, &proto_list, node)
> -		if (proto->destroy_cgroup)
> -			proto->destroy_cgroup(memcg);
> -	mutex_unlock(&proto_list_mutex);
> -}
> -#endif
> -
>  /*
>   * Each address family might have different locking rules, so we have
>   * one slock key per address family:
> @@ -239,11 +201,6 @@ void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg)
>  static struct lock_class_key af_family_keys[AF_MAX];
>  static struct lock_class_key af_family_slock_keys[AF_MAX];
>  
> -#if defined(CONFIG_MEMCG_KMEM)
> -struct static_key memcg_socket_limit_enabled;
> -EXPORT_SYMBOL(memcg_socket_limit_enabled);
> -#endif
> -
>  /*
>   * Make lock validator output more readable. (we pre-construct these
>   * strings build-time, so that runtime initialization of socket
> @@ -1476,12 +1433,6 @@ void sk_free(struct sock *sk)
>  }
>  EXPORT_SYMBOL(sk_free);
>  
> -static void sk_update_clone(const struct sock *sk, struct sock *newsk)
> -{
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> -		sock_update_memcg(newsk);
> -}
> -
>  /**
>   *	sk_clone_lock - clone a socket, and lock its clone
>   *	@sk: the socket to clone
> @@ -1577,7 +1528,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
>  		sk_set_socket(newsk, NULL);
>  		newsk->sk_wq = NULL;
>  
> -		sk_update_clone(sk, newsk);
> +		if (mem_cgroup_do_sockets())
> +			sock_update_memcg(newsk);
>  
>  		if (newsk->sk_prot->sockets_allocated)
>  			sk_sockets_allocated_inc(newsk);
> @@ -2036,27 +1988,27 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
>  	struct proto *prot = sk->sk_prot;
>  	int amt = sk_mem_pages(size);
>  	long allocated;
> -	int parent_status = UNDER_LIMIT;
>  
>  	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
>  
> -	allocated = sk_memory_allocated_add(sk, amt, &parent_status);
> +	allocated = sk_memory_allocated_add(sk, amt);
> +
> +	if (mem_cgroup_do_sockets() && sk->sk_memcg &&
> +	    !mem_cgroup_charge_skmem(sk->sk_memcg, amt))
> +		goto suppress_allocation;
>  
>  	/* Under limit. */
> -	if (parent_status == UNDER_LIMIT &&
> -			allocated <= sk_prot_mem_limits(sk, 0)) {
> +	if (allocated <= sk_prot_mem_limits(sk, 0)) {
>  		sk_leave_memory_pressure(sk);
>  		return 1;
>  	}
>  
> -	/* Under pressure. (we or our parents) */
> -	if ((parent_status > SOFT_LIMIT) ||
> -			allocated > sk_prot_mem_limits(sk, 1))
> +	/* Under pressure. */
> +	if (allocated > sk_prot_mem_limits(sk, 1))
>  		sk_enter_memory_pressure(sk);
>  
> -	/* Over hard limit (we or our parents) */
> -	if ((parent_status == OVER_LIMIT) ||
> -			(allocated > sk_prot_mem_limits(sk, 2)))
> +	/* Over hard limit. */
> +	if (allocated > sk_prot_mem_limits(sk, 2))
>  		goto suppress_allocation;
>  
>  	/* guarantee minimum buffer size under pressure */
> @@ -2105,6 +2057,9 @@ suppress_allocation:
>  
>  	sk_memory_allocated_sub(sk, amt);
>  
> +	if (mem_cgroup_do_sockets() && sk->sk_memcg)
> +		mem_cgroup_uncharge_skmem(sk->sk_memcg, amt);
> +
>  	return 0;
>  }
>  EXPORT_SYMBOL(__sk_mem_schedule);
> @@ -2120,6 +2075,9 @@ void __sk_mem_reclaim(struct sock *sk, int amount)
>  	sk_memory_allocated_sub(sk, amount);
>  	sk->sk_forward_alloc -= amount << SK_MEM_QUANTUM_SHIFT;
>  
> +	if (mem_cgroup_do_sockets() && sk->sk_memcg)
> +		mem_cgroup_uncharge_skmem(sk->sk_memcg, amount);
> +
>  	if (sk_under_memory_pressure(sk) &&
>  	    (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)))
>  		sk_leave_memory_pressure(sk);
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 894da3a..1f00819 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -24,7 +24,6 @@
>  #include <net/cipso_ipv4.h>
>  #include <net/inet_frag.h>
>  #include <net/ping.h>
> -#include <net/tcp_memcontrol.h>
>  
>  static int zero;
>  static int one = 1;
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index ac1bdbb..ec931c0 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -421,7 +421,8 @@ void tcp_init_sock(struct sock *sk)
>  	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
>  
>  	local_bh_disable();
> -	sock_update_memcg(sk);
> +	if (mem_cgroup_do_sockets())
> +		sock_update_memcg(sk);
>  	sk_sockets_allocated_inc(sk);
>  	local_bh_enable();
>  }
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 30dd45c..bb5f4f2 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -73,7 +73,6 @@
>  #include <net/timewait_sock.h>
>  #include <net/xfrm.h>
>  #include <net/secure_seq.h>
> -#include <net/tcp_memcontrol.h>
>  #include <net/busy_poll.h>
>  
>  #include <linux/inet.h>
> @@ -1808,7 +1807,8 @@ void tcp_v4_destroy_sock(struct sock *sk)
>  	tcp_saved_syn_free(tp);
>  
>  	sk_sockets_allocated_dec(sk);
> -	sock_release_memcg(sk);
> +	if (mem_cgroup_do_sockets())
> +		sock_release_memcg(sk);
>  }
>  EXPORT_SYMBOL(tcp_v4_destroy_sock);
>  
> @@ -2330,11 +2330,6 @@ struct proto tcp_prot = {
>  	.compat_setsockopt	= compat_tcp_setsockopt,
>  	.compat_getsockopt	= compat_tcp_getsockopt,
>  #endif
> -#ifdef CONFIG_MEMCG_KMEM
> -	.init_cgroup		= tcp_init_cgroup,
> -	.destroy_cgroup		= tcp_destroy_cgroup,
> -	.proto_cgroup		= tcp_proto_cgroup,
> -#endif
>  };
>  EXPORT_SYMBOL(tcp_prot);
>  
> diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
> index 2379c1b..09a37eb 100644
> --- a/net/ipv4/tcp_memcontrol.c
> +++ b/net/ipv4/tcp_memcontrol.c
> @@ -1,107 +1,10 @@
> -#include <net/tcp.h>
> -#include <net/tcp_memcontrol.h>
> -#include <net/sock.h>
> -#include <net/ip.h>
> -#include <linux/nsproxy.h>
> +#include <linux/page_counter.h>
>  #include <linux/memcontrol.h>
> +#include <linux/cgroup.h>
>  #include <linux/module.h>
> -
> -int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
> -{
> -	/*
> -	 * The root cgroup does not use page_counters, but rather,
> -	 * rely on the data already collected by the network
> -	 * subsystem
> -	 */
> -	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
> -	struct page_counter *counter_parent = NULL;
> -	struct cg_proto *cg_proto, *parent_cg;
> -
> -	cg_proto = tcp_prot.proto_cgroup(memcg);
> -	if (!cg_proto)
> -		return 0;
> -
> -	cg_proto->sysctl_mem[0] = sysctl_tcp_mem[0];
> -	cg_proto->sysctl_mem[1] = sysctl_tcp_mem[1];
> -	cg_proto->sysctl_mem[2] = sysctl_tcp_mem[2];
> -	cg_proto->memory_pressure = 0;
> -	cg_proto->memcg = memcg;
> -
> -	parent_cg = tcp_prot.proto_cgroup(parent);
> -	if (parent_cg)
> -		counter_parent = &parent_cg->memory_allocated;
> -
> -	page_counter_init(&cg_proto->memory_allocated, counter_parent);
> -	percpu_counter_init(&cg_proto->sockets_allocated, 0, GFP_KERNEL);
> -
> -	return 0;
> -}
> -EXPORT_SYMBOL(tcp_init_cgroup);
> -
> -void tcp_destroy_cgroup(struct mem_cgroup *memcg)
> -{
> -	struct cg_proto *cg_proto;
> -
> -	cg_proto = tcp_prot.proto_cgroup(memcg);
> -	if (!cg_proto)
> -		return;
> -
> -	percpu_counter_destroy(&cg_proto->sockets_allocated);
> -
> -	if (test_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags))
> -		static_key_slow_dec(&memcg_socket_limit_enabled);
> -
> -}
> -EXPORT_SYMBOL(tcp_destroy_cgroup);
> -
> -static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
> -{
> -	struct cg_proto *cg_proto;
> -	int i;
> -	int ret;
> -
> -	cg_proto = tcp_prot.proto_cgroup(memcg);
> -	if (!cg_proto)
> -		return -EINVAL;
> -
> -	ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages);
> -	if (ret)
> -		return ret;
> -
> -	for (i = 0; i < 3; i++)
> -		cg_proto->sysctl_mem[i] = min_t(long, nr_pages,
> -						sysctl_tcp_mem[i]);
> -
> -	if (nr_pages == PAGE_COUNTER_MAX)
> -		clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
> -	else {
> -		/*
> -		 * The active bit needs to be written after the static_key
> -		 * update. This is what guarantees that the socket activation
> -		 * function is the last one to run. See sock_update_memcg() for
> -		 * details, and note that we don't mark any socket as belonging
> -		 * to this memcg until that flag is up.
> -		 *
> -		 * We need to do this, because static_keys will span multiple
> -		 * sites, but we can't control their order. If we mark a socket
> -		 * as accounted, but the accounting functions are not patched in
> -		 * yet, we'll lose accounting.
> -		 *
> -		 * We never race with the readers in sock_update_memcg(),
> -		 * because when this value change, the code to process it is not
> -		 * patched in yet.
> -		 *
> -		 * The activated bit is used to guarantee that no two writers
> -		 * will do the update in the same memcg. Without that, we can't
> -		 * properly shutdown the static key.
> -		 */
> -		if (!test_and_set_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags))
> -			static_key_slow_inc(&memcg_socket_limit_enabled);
> -		set_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
> -	}
> -
> -	return 0;
> -}
> +#include <linux/kernfs.h>
> +#include <linux/mutex.h>
> +#include <net/tcp.h>
>  
>  enum {
>  	RES_USAGE,
> @@ -124,11 +27,17 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
>  	switch (of_cft(of)->private) {
>  	case RES_LIMIT:
>  		/* see memcontrol.c */
> +		if (memcg == root_mem_cgroup) {
> +			ret = -EINVAL;
> +			break;
> +		}
>  		ret = page_counter_memparse(buf, "-1", &nr_pages);
>  		if (ret)
>  			break;
>  		mutex_lock(&tcp_limit_mutex);
> -		ret = tcp_update_limit(memcg, nr_pages);
> +		ret = page_counter_limit(&memcg->skmem, nr_pages);
> +		if (!ret)
> +			static_branch_enable(&mem_cgroup_sockets);
>  		mutex_unlock(&tcp_limit_mutex);
>  		break;
>  	default:
> @@ -141,32 +50,28 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
>  static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> -	struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg);
>  	u64 val;
>  
>  	switch (cft->private) {
>  	case RES_LIMIT:
> -		if (!cg_proto)
> -			return PAGE_COUNTER_MAX;
> -		val = cg_proto->memory_allocated.limit;
> +		val = memcg->skmem.limit;
>  		val *= PAGE_SIZE;
>  		break;
>  	case RES_USAGE:
> -		if (!cg_proto)
> +		if (memcg == root_mem_cgroup)
>  			val = atomic_long_read(&tcp_memory_allocated);
>  		else
> -			val = page_counter_read(&cg_proto->memory_allocated);
> +			val = page_counter_read(&memcg->skmem);
>  		val *= PAGE_SIZE;
>  		break;
>  	case RES_FAILCNT:
> -		if (!cg_proto)
> -			return 0;
> -		val = cg_proto->memory_allocated.failcnt;
> +		val = memcg->skmem.failcnt;
>  		break;
>  	case RES_MAX_USAGE:
> -		if (!cg_proto)
> -			return 0;
> -		val = cg_proto->memory_allocated.watermark;
> +		if (memcg == root_mem_cgroup)
> +			val = 0;
> +		else
> +			val = memcg->skmem.watermark;
>  		val *= PAGE_SIZE;
>  		break;
>  	default:
> @@ -178,20 +83,14 @@ static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft)
>  static ssize_t tcp_cgroup_reset(struct kernfs_open_file *of,
>  				char *buf, size_t nbytes, loff_t off)
>  {
> -	struct mem_cgroup *memcg;
> -	struct cg_proto *cg_proto;
> -
> -	memcg = mem_cgroup_from_css(of_css(of));
> -	cg_proto = tcp_prot.proto_cgroup(memcg);
> -	if (!cg_proto)
> -		return nbytes;
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
>  
>  	switch (of_cft(of)->private) {
>  	case RES_MAX_USAGE:
> -		page_counter_reset_watermark(&cg_proto->memory_allocated);
> +		page_counter_reset_watermark(&memcg->skmem);
>  		break;
>  	case RES_FAILCNT:
> -		cg_proto->memory_allocated.failcnt = 0;
> +		memcg->skmem.failcnt = 0;
>  		break;
>  	}
>  
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 19adedb..b496fc9 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2819,13 +2819,15 @@ begin_fwd:
>   */
>  void sk_forced_mem_schedule(struct sock *sk, int size)
>  {
> -	int amt, status;
> +	int amt;
>  
>  	if (size <= sk->sk_forward_alloc)
>  		return;
>  	amt = sk_mem_pages(size);
>  	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
> -	sk_memory_allocated_add(sk, amt, &status);
> +	sk_memory_allocated_add(sk, amt);
> +	if (mem_cgroup_do_sockets() && sk->sk_memcg)
> +		mem_cgroup_charge_skmem(sk->sk_memcg, amt);
>  }
>  
>  /* Send a FIN. The caller locks the socket for us.
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index f495d18..cf19e65 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -1862,9 +1862,6 @@ struct proto tcpv6_prot = {
>  	.compat_setsockopt	= compat_tcp_setsockopt,
>  	.compat_getsockopt	= compat_tcp_getsockopt,
>  #endif
> -#ifdef CONFIG_MEMCG_KMEM
> -	.proto_cgroup		= tcp_proto_cgroup,
> -#endif
>  	.clear_sk		= tcp_v6_clear_sk,
>  };
>  
> -- 
> 2.6.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 4/8] mm: memcontrol: prepare for unified hierarchy socket accounting
  2015-10-22  4:21   ` Johannes Weiner
@ 2015-10-23 12:39     ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-23 12:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Vladimir Davydov, Tejun Heo,
	netdev, linux-mm, cgroups, linux-kernel

On Thu 22-10-15 00:21:32, Johannes Weiner wrote:
> The unified hierarchy memory controller will account socket
> memory. Move the infrastructure functions accordingly.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c | 136 ++++++++++++++++++++++++++++----------------------------
>  1 file changed, 68 insertions(+), 68 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c41e6d7..3789050 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -287,74 +287,6 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
>  	return mem_cgroup_from_css(css);
>  }
>  
> -/* Writing them here to avoid exposing memcg's inner layout */
> -#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
> -
> -DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets);
> -
> -void sock_update_memcg(struct sock *sk)
> -{
> -	struct mem_cgroup *memcg;
> -	/*
> -	 * Socket cloning can throw us here with sk_cgrp already
> -	 * filled. It won't however, necessarily happen from
> -	 * process context. So the test for root memcg given
> -	 * the current task's memcg won't help us in this case.
> -	 *
> -	 * Respecting the original socket's memcg is a better
> -	 * decision in this case.
> -	 */
> -	if (sk->sk_memcg) {
> -		BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
> -		css_get(&sk->sk_memcg->css);
> -		return;
> -	}
> -
> -	rcu_read_lock();
> -	memcg = mem_cgroup_from_task(current);
> -	if (css_tryget_online(&memcg->css))
> -		sk->sk_memcg = memcg;
> -	rcu_read_unlock();
> -}
> -EXPORT_SYMBOL(sock_update_memcg);
> -
> -void sock_release_memcg(struct sock *sk)
> -{
> -	if (sk->sk_memcg)
> -		css_put(&sk->sk_memcg->css);
> -}
> -
> -/**
> - * mem_cgroup_charge_skmem - charge socket memory
> - * @memcg: memcg to charge
> - * @nr_pages: number of pages to charge
> - *
> - * Charges @nr_pages to @memcg. Returns %true if the charge fit within
> - * the memcg's configured limit, %false if the charge had to be forced.
> - */
> -bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
> -{
> -	struct page_counter *counter;
> -
> -	if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
> -		return true;
> -
> -	page_counter_charge(&memcg->skmem, nr_pages);
> -	return false;
> -}
> -
> -/**
> - * mem_cgroup_uncharge_skmem - uncharge socket memory
> - * @memcg: memcg to uncharge
> - * @nr_pages: number of pages to uncharge
> - */
> -void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
> -{
> -	page_counter_uncharge(&memcg->skmem, nr_pages);
> -}
> -
> -#endif
> -
>  #ifdef CONFIG_MEMCG_KMEM
>  /*
>   * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
> @@ -5521,6 +5453,74 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage)
>  	commit_charge(newpage, memcg, true);
>  }
>  
> +/* Writing them here to avoid exposing memcg's inner layout */
> +#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
> +
> +DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets);
> +
> +void sock_update_memcg(struct sock *sk)
> +{
> +	struct mem_cgroup *memcg;
> +	/*
> +	 * Socket cloning can throw us here with sk_cgrp already
> +	 * filled. It won't however, necessarily happen from
> +	 * process context. So the test for root memcg given
> +	 * the current task's memcg won't help us in this case.
> +	 *
> +	 * Respecting the original socket's memcg is a better
> +	 * decision in this case.
> +	 */
> +	if (sk->sk_memcg) {
> +		BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
> +		css_get(&sk->sk_memcg->css);
> +		return;
> +	}
> +
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (css_tryget_online(&memcg->css))
> +		sk->sk_memcg = memcg;
> +	rcu_read_unlock();
> +}
> +EXPORT_SYMBOL(sock_update_memcg);
> +
> +void sock_release_memcg(struct sock *sk)
> +{
> +	if (sk->sk_memcg)
> +		css_put(&sk->sk_memcg->css);
> +}
> +
> +/**
> + * mem_cgroup_charge_skmem - charge socket memory
> + * @memcg: memcg to charge
> + * @nr_pages: number of pages to charge
> + *
> + * Charges @nr_pages to @memcg. Returns %true if the charge fit within
> + * the memcg's configured limit, %false if the charge had to be forced.
> + */
> +bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
> +{
> +	struct page_counter *counter;
> +
> +	if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
> +		return true;
> +
> +	page_counter_charge(&memcg->skmem, nr_pages);
> +	return false;
> +}
> +
> +/**
> + * mem_cgroup_uncharge_skmem - uncharge socket memory
> + * @memcg: memcg to uncharge
> + * @nr_pages: number of pages to uncharge
> + */
> +void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
> +{
> +	page_counter_uncharge(&memcg->skmem, nr_pages);
> +}
> +
> +#endif
> +
>  /*
>   * subsys_initcall() for memory controller.
>   *
> -- 
> 2.6.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 4/8] mm: memcontrol: prepare for unified hierarchy socket accounting
@ 2015-10-23 12:39     ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-23 12:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Vladimir Davydov, Tejun Heo,
	netdev, linux-mm, cgroups, linux-kernel

On Thu 22-10-15 00:21:32, Johannes Weiner wrote:
> The unified hierarchy memory controller will account socket
> memory. Move the infrastructure functions accordingly.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c | 136 ++++++++++++++++++++++++++++----------------------------
>  1 file changed, 68 insertions(+), 68 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c41e6d7..3789050 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -287,74 +287,6 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
>  	return mem_cgroup_from_css(css);
>  }
>  
> -/* Writing them here to avoid exposing memcg's inner layout */
> -#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
> -
> -DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets);
> -
> -void sock_update_memcg(struct sock *sk)
> -{
> -	struct mem_cgroup *memcg;
> -	/*
> -	 * Socket cloning can throw us here with sk_cgrp already
> -	 * filled. It won't however, necessarily happen from
> -	 * process context. So the test for root memcg given
> -	 * the current task's memcg won't help us in this case.
> -	 *
> -	 * Respecting the original socket's memcg is a better
> -	 * decision in this case.
> -	 */
> -	if (sk->sk_memcg) {
> -		BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
> -		css_get(&sk->sk_memcg->css);
> -		return;
> -	}
> -
> -	rcu_read_lock();
> -	memcg = mem_cgroup_from_task(current);
> -	if (css_tryget_online(&memcg->css))
> -		sk->sk_memcg = memcg;
> -	rcu_read_unlock();
> -}
> -EXPORT_SYMBOL(sock_update_memcg);
> -
> -void sock_release_memcg(struct sock *sk)
> -{
> -	if (sk->sk_memcg)
> -		css_put(&sk->sk_memcg->css);
> -}
> -
> -/**
> - * mem_cgroup_charge_skmem - charge socket memory
> - * @memcg: memcg to charge
> - * @nr_pages: number of pages to charge
> - *
> - * Charges @nr_pages to @memcg. Returns %true if the charge fit within
> - * the memcg's configured limit, %false if the charge had to be forced.
> - */
> -bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
> -{
> -	struct page_counter *counter;
> -
> -	if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
> -		return true;
> -
> -	page_counter_charge(&memcg->skmem, nr_pages);
> -	return false;
> -}
> -
> -/**
> - * mem_cgroup_uncharge_skmem - uncharge socket memory
> - * @memcg: memcg to uncharge
> - * @nr_pages: number of pages to uncharge
> - */
> -void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
> -{
> -	page_counter_uncharge(&memcg->skmem, nr_pages);
> -}
> -
> -#endif
> -
>  #ifdef CONFIG_MEMCG_KMEM
>  /*
>   * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
> @@ -5521,6 +5453,74 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage)
>  	commit_charge(newpage, memcg, true);
>  }
>  
> +/* Writing them here to avoid exposing memcg's inner layout */
> +#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
> +
> +DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets);
> +
> +void sock_update_memcg(struct sock *sk)
> +{
> +	struct mem_cgroup *memcg;
> +	/*
> +	 * Socket cloning can throw us here with sk_cgrp already
> +	 * filled. It won't however, necessarily happen from
> +	 * process context. So the test for root memcg given
> +	 * the current task's memcg won't help us in this case.
> +	 *
> +	 * Respecting the original socket's memcg is a better
> +	 * decision in this case.
> +	 */
> +	if (sk->sk_memcg) {
> +		BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
> +		css_get(&sk->sk_memcg->css);
> +		return;
> +	}
> +
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (css_tryget_online(&memcg->css))
> +		sk->sk_memcg = memcg;
> +	rcu_read_unlock();
> +}
> +EXPORT_SYMBOL(sock_update_memcg);
> +
> +void sock_release_memcg(struct sock *sk)
> +{
> +	if (sk->sk_memcg)
> +		css_put(&sk->sk_memcg->css);
> +}
> +
> +/**
> + * mem_cgroup_charge_skmem - charge socket memory
> + * @memcg: memcg to charge
> + * @nr_pages: number of pages to charge
> + *
> + * Charges @nr_pages to @memcg. Returns %true if the charge fit within
> + * the memcg's configured limit, %false if the charge had to be forced.
> + */
> +bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
> +{
> +	struct page_counter *counter;
> +
> +	if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
> +		return true;
> +
> +	page_counter_charge(&memcg->skmem, nr_pages);
> +	return false;
> +}
> +
> +/**
> + * mem_cgroup_uncharge_skmem - uncharge socket memory
> + * @memcg: memcg to uncharge
> + * @nr_pages: number of pages to uncharge
> + */
> +void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
> +{
> +	page_counter_uncharge(&memcg->skmem, nr_pages);
> +}
> +
> +#endif
> +
>  /*
>   * subsys_initcall() for memory controller.
>   *
> -- 
> 2.6.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-23 13:19     ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-23 13:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Vladimir Davydov, Tejun Heo,
	netdev, linux-mm, cgroups, linux-kernel

On Thu 22-10-15 00:21:33, Johannes Weiner wrote:
> Socket memory can be a significant share of overall memory consumed by
> common workloads. In order to provide reasonable resource isolation
> out-of-the-box in the unified hierarchy, this type of memory needs to
> be accounted and tracked per default in the memory controller.

What about users who do not want to pay an additional overhead for the
accounting? How can they disable it?

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

[...]

> @@ -5453,10 +5470,9 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage)
>  	commit_charge(newpage, memcg, true);
>  }
>  
> -/* Writing them here to avoid exposing memcg's inner layout */
> -#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
> +#ifdef CONFIG_INET
>  
> -DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets);
> +DEFINE_STATIC_KEY_TRUE(mem_cgroup_sockets);

AFAIU this means that the jump label is enabled by default. Is this
intended when you enable it explicitly where needed?

>  
>  void sock_update_memcg(struct sock *sk)
>  {
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-23 13:19     ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-23 13:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Vladimir Davydov, Tejun Heo,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu 22-10-15 00:21:33, Johannes Weiner wrote:
> Socket memory can be a significant share of overall memory consumed by
> common workloads. In order to provide reasonable resource isolation
> out-of-the-box in the unified hierarchy, this type of memory needs to
> be accounted and tracked per default in the memory controller.

What about users who do not want to pay an additional overhead for the
accounting? How can they disable it?

> Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

[...]

> @@ -5453,10 +5470,9 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage)
>  	commit_charge(newpage, memcg, true);
>  }
>  
> -/* Writing them here to avoid exposing memcg's inner layout */
> -#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
> +#ifdef CONFIG_INET
>  
> -DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets);
> +DEFINE_STATIC_KEY_TRUE(mem_cgroup_sockets);

AFAIU this means that the jump label is enabled by default. Is this
intended when you enable it explicitly where needed?

>  
>  void sock_update_memcg(struct sock *sk)
>  {
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-23 13:19     ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-23 13:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Vladimir Davydov, Tejun Heo,
	netdev, linux-mm, cgroups, linux-kernel

On Thu 22-10-15 00:21:33, Johannes Weiner wrote:
> Socket memory can be a significant share of overall memory consumed by
> common workloads. In order to provide reasonable resource isolation
> out-of-the-box in the unified hierarchy, this type of memory needs to
> be accounted and tracked per default in the memory controller.

What about users who do not want to pay an additional overhead for the
accounting? How can they disable it?

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

[...]

> @@ -5453,10 +5470,9 @@ void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage)
>  	commit_charge(newpage, memcg, true);
>  }
>  
> -/* Writing them here to avoid exposing memcg's inner layout */
> -#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
> +#ifdef CONFIG_INET
>  
> -DEFINE_STATIC_KEY_FALSE(mem_cgroup_sockets);
> +DEFINE_STATIC_KEY_TRUE(mem_cgroup_sockets);

AFAIU this means that the jump label is enabled by default. Is this
intended when you enable it explicitly where needed?

>  
>  void sock_update_memcg(struct sock *sk)
>  {
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 6/8] mm: vmscan: simplify memcg vs. global shrinker invocation
  2015-10-22  4:21   ` Johannes Weiner
@ 2015-10-23 13:26     ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-23 13:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Vladimir Davydov, Tejun Heo,
	netdev, linux-mm, cgroups, linux-kernel

On Thu 22-10-15 00:21:34, Johannes Weiner wrote:
> Letting shrink_slab() handle the root_mem_cgroup, and implicitely the
> !CONFIG_MEMCG case, allows shrink_zone() to invoke the shrinkers
> unconditionally from within the memcg iteration loop.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h |  2 ++
>  mm/vmscan.c                | 31 ++++++++++++++++---------------
>  2 files changed, 18 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6f1e0f8..d66ae18 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -482,6 +482,8 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>  #else /* CONFIG_MEMCG */
>  struct mem_cgroup;
>  
> +#define root_mem_cgroup NULL
> +
>  static inline void mem_cgroup_events(struct mem_cgroup *memcg,
>  				     enum mem_cgroup_events_index idx,
>  				     unsigned int nr)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9b52ecf..ecc2125 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -411,6 +411,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  	struct shrinker *shrinker;
>  	unsigned long freed = 0;
>  
> +	/* Global shrinker mode */
> +	if (memcg == root_mem_cgroup)
> +		memcg = NULL;
> +
>  	if (memcg && !memcg_kmem_is_active(memcg))
>  		return 0;
>  
> @@ -2417,11 +2421,22 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  			shrink_lruvec(lruvec, swappiness, sc, &lru_pages);
>  			zone_lru_pages += lru_pages;
>  
> -			if (memcg && is_classzone)
> +			/*
> +			 * Shrink the slab caches in the same proportion that
> +			 * the eligible LRU pages were scanned.
> +			 */
> +			if (is_classzone) {
>  				shrink_slab(sc->gfp_mask, zone_to_nid(zone),
>  					    memcg, sc->nr_scanned - scanned,
>  					    lru_pages);
>  
> +				if (reclaim_state) {
> +					sc->nr_reclaimed +=
> +						reclaim_state->reclaimed_slab;
> +					reclaim_state->reclaimed_slab = 0;
> +				}
> +			}
> +
>  			/*
>  			 * Direct reclaim and kswapd have to scan all memory
>  			 * cgroups to fulfill the overall scan target for the
> @@ -2439,20 +2454,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  			}
>  		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
>  
> -		/*
> -		 * Shrink the slab caches in the same proportion that
> -		 * the eligible LRU pages were scanned.
> -		 */
> -		if (global_reclaim(sc) && is_classzone)
> -			shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL,
> -				    sc->nr_scanned - nr_scanned,
> -				    zone_lru_pages);
> -
> -		if (reclaim_state) {
> -			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> -			reclaim_state->reclaimed_slab = 0;
> -		}
> -
>  		vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
>  			   sc->nr_scanned - nr_scanned,
>  			   sc->nr_reclaimed - nr_reclaimed);
> -- 
> 2.6.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 6/8] mm: vmscan: simplify memcg vs. global shrinker invocation
@ 2015-10-23 13:26     ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-23 13:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Vladimir Davydov, Tejun Heo,
	netdev, linux-mm, cgroups, linux-kernel

On Thu 22-10-15 00:21:34, Johannes Weiner wrote:
> Letting shrink_slab() handle the root_mem_cgroup, and implicitely the
> !CONFIG_MEMCG case, allows shrink_zone() to invoke the shrinkers
> unconditionally from within the memcg iteration loop.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h |  2 ++
>  mm/vmscan.c                | 31 ++++++++++++++++---------------
>  2 files changed, 18 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6f1e0f8..d66ae18 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -482,6 +482,8 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>  #else /* CONFIG_MEMCG */
>  struct mem_cgroup;
>  
> +#define root_mem_cgroup NULL
> +
>  static inline void mem_cgroup_events(struct mem_cgroup *memcg,
>  				     enum mem_cgroup_events_index idx,
>  				     unsigned int nr)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9b52ecf..ecc2125 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -411,6 +411,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  	struct shrinker *shrinker;
>  	unsigned long freed = 0;
>  
> +	/* Global shrinker mode */
> +	if (memcg == root_mem_cgroup)
> +		memcg = NULL;
> +
>  	if (memcg && !memcg_kmem_is_active(memcg))
>  		return 0;
>  
> @@ -2417,11 +2421,22 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  			shrink_lruvec(lruvec, swappiness, sc, &lru_pages);
>  			zone_lru_pages += lru_pages;
>  
> -			if (memcg && is_classzone)
> +			/*
> +			 * Shrink the slab caches in the same proportion that
> +			 * the eligible LRU pages were scanned.
> +			 */
> +			if (is_classzone) {
>  				shrink_slab(sc->gfp_mask, zone_to_nid(zone),
>  					    memcg, sc->nr_scanned - scanned,
>  					    lru_pages);
>  
> +				if (reclaim_state) {
> +					sc->nr_reclaimed +=
> +						reclaim_state->reclaimed_slab;
> +					reclaim_state->reclaimed_slab = 0;
> +				}
> +			}
> +
>  			/*
>  			 * Direct reclaim and kswapd have to scan all memory
>  			 * cgroups to fulfill the overall scan target for the
> @@ -2439,20 +2454,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  			}
>  		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
>  
> -		/*
> -		 * Shrink the slab caches in the same proportion that
> -		 * the eligible LRU pages were scanned.
> -		 */
> -		if (global_reclaim(sc) && is_classzone)
> -			shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL,
> -				    sc->nr_scanned - nr_scanned,
> -				    zone_lru_pages);
> -
> -		if (reclaim_state) {
> -			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> -			reclaim_state->reclaimed_slab = 0;
> -		}
> -
>  		vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
>  			   sc->nr_scanned - nr_scanned,
>  			   sc->nr_reclaimed - nr_reclaimed);
> -- 
> 2.6.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting
  2015-10-22 19:09       ` Johannes Weiner
  (?)
@ 2015-10-23 13:42         ` Vladimir Davydov
  -1 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-23 13:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 22, 2015 at 03:09:43PM -0400, Johannes Weiner wrote:
> On Thu, Oct 22, 2015 at 09:46:12PM +0300, Vladimir Davydov wrote:
> > On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote:
> > > The tcp memory controller has extensive provisions for future memory
> > > accounting interfaces that won't materialize after all. Cut the code
> > > base down to what's actually used, now and in the likely future.
> > > 
> > > - There won't be any different protocol counters in the future, so a
> > >   direct sock->sk_memcg linkage is enough. This eliminates a lot of
> > >   callback maze and boilerplate code, and restores most of the socket
> > >   allocation code to pre-tcp_memcontrol state.
> > > 
> > > - There won't be a tcp control soft limit, so integrating the memcg
> > 
> > In fact, the code is ready for the "soft" limit (I mean min, pressure,
> > max tuple), it just lacks a knob.
> 
> Yeah, but that's not going to materialize if the entire interface for
> dedicated tcp throttling is considered obsolete.

May be, it shouldn't be. My current understanding is that per memcg tcp
window control is necessary, because:

 - We need to be able to protect a containerized workload from its
   growing network buffers. Using vmpressure notifications for that does
   not look reassuring to me.

 - We need a way to limit network buffers of a particular container,
   otherwise it can fill the system-wide window throttling other
   containers, which is unfair.

> 
> > > @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
> > >  	if (!sk->sk_prot->memory_pressure)
> > >  		return false;
> > >  
> > > -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> > > -		return !!sk->sk_cgrp->memory_pressure;
> > > -
> > 
> > AFAIU, now we won't shrink the window on hitting the limit, i.e. this
> > patch subtly changes the behavior of the existing knobs, potentially
> > breaking them.
> 
> Hm, but there is no grace period in which something meaningful could
> happen with the window shrinking, is there? Any buffer allocation is
> still going to fail hard.

AFAIU when we hit the limit, we not only throttle the socket which
allocates, but also try to release space reserved by other sockets.
After your patch we won't. This looks unfair to me.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting
@ 2015-10-23 13:42         ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-23 13:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 22, 2015 at 03:09:43PM -0400, Johannes Weiner wrote:
> On Thu, Oct 22, 2015 at 09:46:12PM +0300, Vladimir Davydov wrote:
> > On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote:
> > > The tcp memory controller has extensive provisions for future memory
> > > accounting interfaces that won't materialize after all. Cut the code
> > > base down to what's actually used, now and in the likely future.
> > > 
> > > - There won't be any different protocol counters in the future, so a
> > >   direct sock->sk_memcg linkage is enough. This eliminates a lot of
> > >   callback maze and boilerplate code, and restores most of the socket
> > >   allocation code to pre-tcp_memcontrol state.
> > > 
> > > - There won't be a tcp control soft limit, so integrating the memcg
> > 
> > In fact, the code is ready for the "soft" limit (I mean min, pressure,
> > max tuple), it just lacks a knob.
> 
> Yeah, but that's not going to materialize if the entire interface for
> dedicated tcp throttling is considered obsolete.

May be, it shouldn't be. My current understanding is that per memcg tcp
window control is necessary, because:

 - We need to be able to protect a containerized workload from its
   growing network buffers. Using vmpressure notifications for that does
   not look reassuring to me.

 - We need a way to limit network buffers of a particular container,
   otherwise it can fill the system-wide window throttling other
   containers, which is unfair.

> 
> > > @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
> > >  	if (!sk->sk_prot->memory_pressure)
> > >  		return false;
> > >  
> > > -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> > > -		return !!sk->sk_cgrp->memory_pressure;
> > > -
> > 
> > AFAIU, now we won't shrink the window on hitting the limit, i.e. this
> > patch subtly changes the behavior of the existing knobs, potentially
> > breaking them.
> 
> Hm, but there is no grace period in which something meaningful could
> happen with the window shrinking, is there? Any buffer allocation is
> still going to fail hard.

AFAIU when we hit the limit, we not only throttle the socket which
allocates, but also try to release space reserved by other sockets.
After your patch we won't. This looks unfair to me.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting
@ 2015-10-23 13:42         ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-23 13:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 22, 2015 at 03:09:43PM -0400, Johannes Weiner wrote:
> On Thu, Oct 22, 2015 at 09:46:12PM +0300, Vladimir Davydov wrote:
> > On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote:
> > > The tcp memory controller has extensive provisions for future memory
> > > accounting interfaces that won't materialize after all. Cut the code
> > > base down to what's actually used, now and in the likely future.
> > > 
> > > - There won't be any different protocol counters in the future, so a
> > >   direct sock->sk_memcg linkage is enough. This eliminates a lot of
> > >   callback maze and boilerplate code, and restores most of the socket
> > >   allocation code to pre-tcp_memcontrol state.
> > > 
> > > - There won't be a tcp control soft limit, so integrating the memcg
> > 
> > In fact, the code is ready for the "soft" limit (I mean min, pressure,
> > max tuple), it just lacks a knob.
> 
> Yeah, but that's not going to materialize if the entire interface for
> dedicated tcp throttling is considered obsolete.

May be, it shouldn't be. My current understanding is that per memcg tcp
window control is necessary, because:

 - We need to be able to protect a containerized workload from its
   growing network buffers. Using vmpressure notifications for that does
   not look reassuring to me.

 - We need a way to limit network buffers of a particular container,
   otherwise it can fill the system-wide window throttling other
   containers, which is unfair.

> 
> > > @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
> > >  	if (!sk->sk_prot->memory_pressure)
> > >  		return false;
> > >  
> > > -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> > > -		return !!sk->sk_cgrp->memory_pressure;
> > > -
> > 
> > AFAIU, now we won't shrink the window on hitting the limit, i.e. this
> > patch subtly changes the behavior of the existing knobs, potentially
> > breaking them.
> 
> Hm, but there is no grace period in which something meaningful could
> happen with the window shrinking, is there? Any buffer allocation is
> still going to fail hard.

AFAIU when we hit the limit, we not only throttle the socket which
allocates, but also try to release space reserved by other sockets.
After your patch we won't. This looks unfair to me.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity
@ 2015-10-23 13:49     ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-23 13:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Vladimir Davydov, Tejun Heo,
	netdev, linux-mm, cgroups, linux-kernel

On Thu 22-10-15 00:21:35, Johannes Weiner wrote:
> The vmpressure metric is based on reclaim efficiency, which in turn is
> an attribute of the LRU. However, vmpressure events are currently
> reported at the source of pressure rather than at the reclaim level.
> 
> Switch the reporting to the reclaim level to allow finer-grained
> analysis of which memcg is having trouble reclaiming its pages.

I can see how this can be useful.

> As far as memory.pressure_level interface semantics go, events are
> escalated up the hierarchy until a listener is found, so this won't
> affect existing users that listen at higher levels.

This is true but the parent will not see cumulative events anymore.
One memcg might be fighting and barely reclaim anything so it would
report high pressure while other would be doing just fine. The parent
will just see conflicting events in a short time period and cannot match
them the source memcg. This sounds really confusing. Even more confusing
than the current semantic which allows the same behavior under certain
configurations.

I dunno, have to think about it some more. Maybe we need to rethink the
way how the pressure is signaled. If we want the breakdown of the
particular memcgs then we should be able to identify them for this to be
useful.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity
@ 2015-10-23 13:49     ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-23 13:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Vladimir Davydov, Tejun Heo,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu 22-10-15 00:21:35, Johannes Weiner wrote:
> The vmpressure metric is based on reclaim efficiency, which in turn is
> an attribute of the LRU. However, vmpressure events are currently
> reported at the source of pressure rather than at the reclaim level.
> 
> Switch the reporting to the reclaim level to allow finer-grained
> analysis of which memcg is having trouble reclaiming its pages.

I can see how this can be useful.

> As far as memory.pressure_level interface semantics go, events are
> escalated up the hierarchy until a listener is found, so this won't
> affect existing users that listen at higher levels.

This is true but the parent will not see cumulative events anymore.
One memcg might be fighting and barely reclaim anything so it would
report high pressure while other would be doing just fine. The parent
will just see conflicting events in a short time period and cannot match
them the source memcg. This sounds really confusing. Even more confusing
than the current semantic which allows the same behavior under certain
configurations.

I dunno, have to think about it some more. Maybe we need to rethink the
way how the pressure is signaled. If we want the breakdown of the
particular memcgs then we should be able to identify them for this to be
useful.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity
@ 2015-10-23 13:49     ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-23 13:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Vladimir Davydov, Tejun Heo,
	netdev, linux-mm, cgroups, linux-kernel

On Thu 22-10-15 00:21:35, Johannes Weiner wrote:
> The vmpressure metric is based on reclaim efficiency, which in turn is
> an attribute of the LRU. However, vmpressure events are currently
> reported at the source of pressure rather than at the reclaim level.
> 
> Switch the reporting to the reclaim level to allow finer-grained
> analysis of which memcg is having trouble reclaiming its pages.

I can see how this can be useful.

> As far as memory.pressure_level interface semantics go, events are
> escalated up the hierarchy until a listener is found, so this won't
> affect existing users that listen at higher levels.

This is true but the parent will not see cumulative events anymore.
One memcg might be fighting and barely reclaim anything so it would
report high pressure while other would be doing just fine. The parent
will just see conflicting events in a short time period and cannot match
them the source memcg. This sounds really confusing. Even more confusing
than the current semantic which allows the same behavior under certain
configurations.

I dunno, have to think about it some more. Maybe we need to rethink the
way how the pressure is signaled. If we want the breakdown of the
particular memcgs then we should be able to identify them for this to be
useful.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-23 13:59       ` David Miller
  0 siblings, 0 replies; 156+ messages in thread
From: David Miller @ 2015-10-23 13:59 UTC (permalink / raw)
  To: mhocko
  Cc: hannes, akpm, vdavydov, tj, netdev, linux-mm, cgroups, linux-kernel

From: Michal Hocko <mhocko@kernel.org>
Date: Fri, 23 Oct 2015 15:19:56 +0200

> On Thu 22-10-15 00:21:33, Johannes Weiner wrote:
>> Socket memory can be a significant share of overall memory consumed by
>> common workloads. In order to provide reasonable resource isolation
>> out-of-the-box in the unified hierarchy, this type of memory needs to
>> be accounted and tracked per default in the memory controller.
> 
> What about users who do not want to pay an additional overhead for the
> accounting? How can they disable it?

Yeah, this really cannot pass.

This extra overhead will be seen by %99.9999 of users, since entities
(especially distributions) just flip on all of these config options by
default.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-23 13:59       ` David Miller
  0 siblings, 0 replies; 156+ messages in thread
From: David Miller @ 2015-10-23 13:59 UTC (permalink / raw)
  To: mhocko-DgEjT+Ai2ygdnm+yROfE0A
  Cc: hannes-druUgvl0LCNAfugRpC6u6w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	vdavydov-5HdwGun5lf+gSpxsJD1C4w, tj-DgEjT+Ai2ygdnm+yROfE0A,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

From: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Date: Fri, 23 Oct 2015 15:19:56 +0200

> On Thu 22-10-15 00:21:33, Johannes Weiner wrote:
>> Socket memory can be a significant share of overall memory consumed by
>> common workloads. In order to provide reasonable resource isolation
>> out-of-the-box in the unified hierarchy, this type of memory needs to
>> be accounted and tracked per default in the memory controller.
> 
> What about users who do not want to pay an additional overhead for the
> accounting? How can they disable it?

Yeah, this really cannot pass.

This extra overhead will be seen by %99.9999 of users, since entities
(especially distributions) just flip on all of these config options by
default.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-23 13:59       ` David Miller
  0 siblings, 0 replies; 156+ messages in thread
From: David Miller @ 2015-10-23 13:59 UTC (permalink / raw)
  To: mhocko
  Cc: hannes, akpm, vdavydov, tj, netdev, linux-mm, cgroups, linux-kernel

From: Michal Hocko <mhocko@kernel.org>
Date: Fri, 23 Oct 2015 15:19:56 +0200

> On Thu 22-10-15 00:21:33, Johannes Weiner wrote:
>> Socket memory can be a significant share of overall memory consumed by
>> common workloads. In order to provide reasonable resource isolation
>> out-of-the-box in the unified hierarchy, this type of memory needs to
>> be accounted and tracked per default in the memory controller.
> 
> What about users who do not want to pay an additional overhead for the
> accounting? How can they disable it?

Yeah, this really cannot pass.

This extra overhead will be seen by %99.9999 of users, since entities
(especially distributions) just flip on all of these config options by
default.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
  2015-10-23 13:59       ` David Miller
@ 2015-10-26 16:56         ` Johannes Weiner
  -1 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-26 16:56 UTC (permalink / raw)
  To: David Miller
  Cc: mhocko, akpm, vdavydov, tj, netdev, linux-mm, cgroups, linux-kernel

On Fri, Oct 23, 2015 at 06:59:57AM -0700, David Miller wrote:
> From: Michal Hocko <mhocko@kernel.org>
> Date: Fri, 23 Oct 2015 15:19:56 +0200
> 
> > On Thu 22-10-15 00:21:33, Johannes Weiner wrote:
> >> Socket memory can be a significant share of overall memory consumed by
> >> common workloads. In order to provide reasonable resource isolation
> >> out-of-the-box in the unified hierarchy, this type of memory needs to
> >> be accounted and tracked per default in the memory controller.
> > 
> > What about users who do not want to pay an additional overhead for the
> > accounting? How can they disable it?
> 
> Yeah, this really cannot pass.
> 
> This extra overhead will be seen by %99.9999 of users, since entities
> (especially distributions) just flip on all of these config options by
> default.

Okay, there are several layers to this issue.

If you boot a machine with a CONFIG_MEMCG distribution kernel and
don't create any cgroups, I agree there shouldn't be any overhead.

I already sent a patch to generally remove memory accounting on the
system or root level. I can easily update this patch here to not have
any socket buffer accounting overhead for systems that don't actively
use cgroups. Would you be okay with a branch on sk->sk_memcg in the
network accounting path? I'd leave that NULL on the system level then.

Then there is of course the case when you create cgroups for process
organization but don't care about memory accounting. Systemd comes to
mind. Or even if you create cgroups to track other resources like CPU
but don't care about memory. The unified hierarchy no longer enables
controllers on new cgroups per default, so unless you create a cgroup
and specifically tell it to account and track memory, you won't have
the socket memory accounting overhead, either.

Then there is the third case, where you create a control group to
specifically manage and limit the memory consumption of a workload. In
that scenario, a major memory consumer like socket buffers, which can
easily grow until OOM, should definitely be included in the tracking
in order to properly contain both untrusted (possibly malicious) and
trusted (possibly buggy) workloads. This is not a hole we can
reasonbly leave unpatched for general purpose resource management.

Now you could argue that there might exist specialized workloads that
need to account anonymous pages and page cache, but not socket memory
buffers. Or any other combination of pick-and-choose consumers. But
honestly, nowadays all our paths are lockless, and the counting is an
atomic-add-return with a per-cpu batch cache. I don't think there is a
compelling case for an elaborate interface to make individual memory
consumers configurable inside the memory controller.

So in summary, would you be okay with this patch if networking only
called into the memory controller when you explicitely create a cgroup
AND tell it to track the memory footprint of the workload in it?

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-26 16:56         ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-26 16:56 UTC (permalink / raw)
  To: David Miller
  Cc: mhocko, akpm, vdavydov, tj, netdev, linux-mm, cgroups, linux-kernel

On Fri, Oct 23, 2015 at 06:59:57AM -0700, David Miller wrote:
> From: Michal Hocko <mhocko@kernel.org>
> Date: Fri, 23 Oct 2015 15:19:56 +0200
> 
> > On Thu 22-10-15 00:21:33, Johannes Weiner wrote:
> >> Socket memory can be a significant share of overall memory consumed by
> >> common workloads. In order to provide reasonable resource isolation
> >> out-of-the-box in the unified hierarchy, this type of memory needs to
> >> be accounted and tracked per default in the memory controller.
> > 
> > What about users who do not want to pay an additional overhead for the
> > accounting? How can they disable it?
> 
> Yeah, this really cannot pass.
> 
> This extra overhead will be seen by %99.9999 of users, since entities
> (especially distributions) just flip on all of these config options by
> default.

Okay, there are several layers to this issue.

If you boot a machine with a CONFIG_MEMCG distribution kernel and
don't create any cgroups, I agree there shouldn't be any overhead.

I already sent a patch to generally remove memory accounting on the
system or root level. I can easily update this patch here to not have
any socket buffer accounting overhead for systems that don't actively
use cgroups. Would you be okay with a branch on sk->sk_memcg in the
network accounting path? I'd leave that NULL on the system level then.

Then there is of course the case when you create cgroups for process
organization but don't care about memory accounting. Systemd comes to
mind. Or even if you create cgroups to track other resources like CPU
but don't care about memory. The unified hierarchy no longer enables
controllers on new cgroups per default, so unless you create a cgroup
and specifically tell it to account and track memory, you won't have
the socket memory accounting overhead, either.

Then there is the third case, where you create a control group to
specifically manage and limit the memory consumption of a workload. In
that scenario, a major memory consumer like socket buffers, which can
easily grow until OOM, should definitely be included in the tracking
in order to properly contain both untrusted (possibly malicious) and
trusted (possibly buggy) workloads. This is not a hole we can
reasonbly leave unpatched for general purpose resource management.

Now you could argue that there might exist specialized workloads that
need to account anonymous pages and page cache, but not socket memory
buffers. Or any other combination of pick-and-choose consumers. But
honestly, nowadays all our paths are lockless, and the counting is an
atomic-add-return with a per-cpu batch cache. I don't think there is a
compelling case for an elaborate interface to make individual memory
consumers configurable inside the memory controller.

So in summary, would you be okay with this patch if networking only
called into the memory controller when you explicitely create a cgroup
AND tell it to track the memory footprint of the workload in it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
  2015-10-22 18:45   ` Vladimir Davydov
  (?)
  (?)
@ 2015-10-26 17:22     ` Johannes Weiner
  -1 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-26 17:22 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 22, 2015 at 09:45:10PM +0300, Vladimir Davydov wrote:
> Hi Johannes,
> 
> On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote:
> ...
> > Patch #5 adds accounting and tracking of socket memory to the unified
> > hierarchy memory controller, as described above. It uses the existing
> > per-cpu charge caches and triggers high limit reclaim asynchroneously.
> > 
> > Patch #8 uses the vmpressure extension to equalize pressure between
> > the pages tracked natively by the VM and socket buffer pages. As the
> > pool is shared, it makes sense that while natively tracked pages are
> > under duress the network transmit windows are also not increased.
> 
> First of all, I've no experience in networking, so I'm likely to be
> mistaken. Nevertheless I beg to disagree that this patch set is a step
> in the right direction. Here goes why.
> 
> I admit that your idea to get rid of explicit tcp window control knobs
> and size it dynamically basing on memory pressure instead does sound
> tempting, but I don't think it'd always work. The problem is that in
> contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only
> stop growing them. Now suppose a system hasn't experienced memory
> pressure for a while. If we don't have explicit tcp window limit, tcp
> buffers on such a system might have eaten almost all available memory
> (because of network load/problems). If a user workload that needs a
> significant amount of memory is started suddenly then, the network code
> will receive a notification and surely stop growing buffers, but all
> those buffers accumulated won't disappear instantly. As a result, the
> workload might be unable to find enough free memory and have no choice
> but invoke OOM killer. This looks unexpected from the user POV.

I'm not getting rid of those knobs, I'm just reusing the old socket
accounting infrastructure in an attempt to make the memory accounting
feature useful to more people in cgroups v2 (unified hierarchy).

We can always come back to think about per-cgroup tcp window limits in
the unified hierarchy, my patches don't get in the way of this. I'm
not removing the knobs in cgroups v1 and I'm not preventing them in v2.

But regardless of tcp window control, we need to account socket memory
in the main memory accounting pool where pressure is shared (to the
best of our abilities) between all accounted memory consumers.

>From an interface standpoint alone, I don't think it's reasonable to
ask users per default to limit different consumers on a case by case
basis. I certainly have no problem with finetuning for scenarios you
describe above, but with memory.current, memory.high, memory.max we
are providing a generic interface to account and contain memory
consumption of workloads. This has to include all major memory
consumers to make semantical sense.

But also, there are people right now for whom the socket buffers cause
system OOM, but the existing memcg's hard tcp window limitq that
exists absolutely wrecks network performance for them. It's not usable
the way it is. It'd be much better to have the socket buffers exert
pressure on the shared pool, and then propagate the overall pressure
back to individual consumers with reclaim, shrinkers, vmpressure etc.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
@ 2015-10-26 17:22     ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-26 17:22 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Oct 22, 2015 at 09:45:10PM +0300, Vladimir Davydov wrote:
> Hi Johannes,
> 
> On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote:
> ...
> > Patch #5 adds accounting and tracking of socket memory to the unified
> > hierarchy memory controller, as described above. It uses the existing
> > per-cpu charge caches and triggers high limit reclaim asynchroneously.
> > 
> > Patch #8 uses the vmpressure extension to equalize pressure between
> > the pages tracked natively by the VM and socket buffer pages. As the
> > pool is shared, it makes sense that while natively tracked pages are
> > under duress the network transmit windows are also not increased.
> 
> First of all, I've no experience in networking, so I'm likely to be
> mistaken. Nevertheless I beg to disagree that this patch set is a step
> in the right direction. Here goes why.
> 
> I admit that your idea to get rid of explicit tcp window control knobs
> and size it dynamically basing on memory pressure instead does sound
> tempting, but I don't think it'd always work. The problem is that in
> contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only
> stop growing them. Now suppose a system hasn't experienced memory
> pressure for a while. If we don't have explicit tcp window limit, tcp
> buffers on such a system might have eaten almost all available memory
> (because of network load/problems). If a user workload that needs a
> significant amount of memory is started suddenly then, the network code
> will receive a notification and surely stop growing buffers, but all
> those buffers accumulated won't disappear instantly. As a result, the
> workload might be unable to find enough free memory and have no choice
> but invoke OOM killer. This looks unexpected from the user POV.

I'm not getting rid of those knobs, I'm just reusing the old socket
accounting infrastructure in an attempt to make the memory accounting
feature useful to more people in cgroups v2 (unified hierarchy).

We can always come back to think about per-cgroup tcp window limits in
the unified hierarchy, my patches don't get in the way of this. I'm
not removing the knobs in cgroups v1 and I'm not preventing them in v2.

But regardless of tcp window control, we need to account socket memory
in the main memory accounting pool where pressure is shared (to the
best of our abilities) between all accounted memory consumers.

>From an interface standpoint alone, I don't think it's reasonable to
ask users per default to limit different consumers on a case by case
basis. I certainly have no problem with finetuning for scenarios you
describe above, but with memory.current, memory.high, memory.max we
are providing a generic interface to account and contain memory
consumption of workloads. This has to include all major memory
consumers to make semantical sense.

But also, there are people right now for whom the socket buffers cause
system OOM, but the existing memcg's hard tcp window limitq that
exists absolutely wrecks network performance for them. It's not usable
the way it is. It'd be much better to have the socket buffers exert
pressure on the shared pool, and then propagate the overall pressure
back to individual consumers with reclaim, shrinkers, vmpressure etc.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
@ 2015-10-26 17:22     ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-26 17:22 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 22, 2015 at 09:45:10PM +0300, Vladimir Davydov wrote:
> Hi Johannes,
> 
> On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote:
> ...
> > Patch #5 adds accounting and tracking of socket memory to the unified
> > hierarchy memory controller, as described above. It uses the existing
> > per-cpu charge caches and triggers high limit reclaim asynchroneously.
> > 
> > Patch #8 uses the vmpressure extension to equalize pressure between
> > the pages tracked natively by the VM and socket buffer pages. As the
> > pool is shared, it makes sense that while natively tracked pages are
> > under duress the network transmit windows are also not increased.
> 
> First of all, I've no experience in networking, so I'm likely to be
> mistaken. Nevertheless I beg to disagree that this patch set is a step
> in the right direction. Here goes why.
> 
> I admit that your idea to get rid of explicit tcp window control knobs
> and size it dynamically basing on memory pressure instead does sound
> tempting, but I don't think it'd always work. The problem is that in
> contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only
> stop growing them. Now suppose a system hasn't experienced memory
> pressure for a while. If we don't have explicit tcp window limit, tcp
> buffers on such a system might have eaten almost all available memory
> (because of network load/problems). If a user workload that needs a
> significant amount of memory is started suddenly then, the network code
> will receive a notification and surely stop growing buffers, but all
> those buffers accumulated won't disappear instantly. As a result, the
> workload might be unable to find enough free memory and have no choice
> but invoke OOM killer. This looks unexpected from the user POV.

I'm not getting rid of those knobs, I'm just reusing the old socket
accounting infrastructure in an attempt to make the memory accounting
feature useful to more people in cgroups v2 (unified hierarchy).

We can always come back to think about per-cgroup tcp window limits in
the unified hierarchy, my patches don't get in the way of this. I'm
not removing the knobs in cgroups v1 and I'm not preventing them in v2.

But regardless of tcp window control, we need to account socket memory
in the main memory accounting pool where pressure is shared (to the
best of our abilities) between all accounted memory consumers.

>From an interface standpoint alone, I don't think it's reasonable to
ask users per default to limit different consumers on a case by case
basis. I certainly have no problem with finetuning for scenarios you
describe above, but with memory.current, memory.high, memory.max we
are providing a generic interface to account and contain memory
consumption of workloads. This has to include all major memory
consumers to make semantical sense.

But also, there are people right now for whom the socket buffers cause
system OOM, but the existing memcg's hard tcp window limitq that
exists absolutely wrecks network performance for them. It's not usable
the way it is. It'd be much better to have the socket buffers exert
pressure on the shared pool, and then propagate the overall pressure
back to individual consumers with reclaim, shrinkers, vmpressure etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
@ 2015-10-26 17:22     ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-26 17:22 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Oct 22, 2015 at 09:45:10PM +0300, Vladimir Davydov wrote:
> Hi Johannes,
> 
> On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote:
> ...
> > Patch #5 adds accounting and tracking of socket memory to the unified
> > hierarchy memory controller, as described above. It uses the existing
> > per-cpu charge caches and triggers high limit reclaim asynchroneously.
> > 
> > Patch #8 uses the vmpressure extension to equalize pressure between
> > the pages tracked natively by the VM and socket buffer pages. As the
> > pool is shared, it makes sense that while natively tracked pages are
> > under duress the network transmit windows are also not increased.
> 
> First of all, I've no experience in networking, so I'm likely to be
> mistaken. Nevertheless I beg to disagree that this patch set is a step
> in the right direction. Here goes why.
> 
> I admit that your idea to get rid of explicit tcp window control knobs
> and size it dynamically basing on memory pressure instead does sound
> tempting, but I don't think it'd always work. The problem is that in
> contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only
> stop growing them. Now suppose a system hasn't experienced memory
> pressure for a while. If we don't have explicit tcp window limit, tcp
> buffers on such a system might have eaten almost all available memory
> (because of network load/problems). If a user workload that needs a
> significant amount of memory is started suddenly then, the network code
> will receive a notification and surely stop growing buffers, but all
> those buffers accumulated won't disappear instantly. As a result, the
> workload might be unable to find enough free memory and have no choice
> but invoke OOM killer. This looks unexpected from the user POV.

I'm not getting rid of those knobs, I'm just reusing the old socket
accounting infrastructure in an attempt to make the memory accounting
feature useful to more people in cgroups v2 (unified hierarchy).

We can always come back to think about per-cgroup tcp window limits in
the unified hierarchy, my patches don't get in the way of this. I'm
not removing the knobs in cgroups v1 and I'm not preventing them in v2.

But regardless of tcp window control, we need to account socket memory
in the main memory accounting pool where pressure is shared (to the
best of our abilities) between all accounted memory consumers.

From an interface standpoint alone, I don't think it's reasonable to
ask users per default to limit different consumers on a case by case
basis. I certainly have no problem with finetuning for scenarios you
describe above, but with memory.current, memory.high, memory.max we
are providing a generic interface to account and contain memory
consumption of workloads. This has to include all major memory
consumers to make semantical sense.

But also, there are people right now for whom the socket buffers cause
system OOM, but the existing memcg's hard tcp window limitq that
exists absolutely wrecks network performance for them. It's not usable
the way it is. It'd be much better to have the socket buffers exert
pressure on the shared pool, and then propagate the overall pressure
back to individual consumers with reclaim, shrinkers, vmpressure etc.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
  2015-10-26 17:22     ` Johannes Weiner
  (?)
@ 2015-10-27  8:43       ` Vladimir Davydov
  -1 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-27  8:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Mon, Oct 26, 2015 at 01:22:16PM -0400, Johannes Weiner wrote:
> On Thu, Oct 22, 2015 at 09:45:10PM +0300, Vladimir Davydov wrote:
> > Hi Johannes,
> > 
> > On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote:
> > ...
> > > Patch #5 adds accounting and tracking of socket memory to the unified
> > > hierarchy memory controller, as described above. It uses the existing
> > > per-cpu charge caches and triggers high limit reclaim asynchroneously.
> > > 
> > > Patch #8 uses the vmpressure extension to equalize pressure between
> > > the pages tracked natively by the VM and socket buffer pages. As the
> > > pool is shared, it makes sense that while natively tracked pages are
> > > under duress the network transmit windows are also not increased.
> > 
> > First of all, I've no experience in networking, so I'm likely to be
> > mistaken. Nevertheless I beg to disagree that this patch set is a step
> > in the right direction. Here goes why.
> > 
> > I admit that your idea to get rid of explicit tcp window control knobs
> > and size it dynamically basing on memory pressure instead does sound
> > tempting, but I don't think it'd always work. The problem is that in
> > contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only
> > stop growing them. Now suppose a system hasn't experienced memory
> > pressure for a while. If we don't have explicit tcp window limit, tcp
> > buffers on such a system might have eaten almost all available memory
> > (because of network load/problems). If a user workload that needs a
> > significant amount of memory is started suddenly then, the network code
> > will receive a notification and surely stop growing buffers, but all
> > those buffers accumulated won't disappear instantly. As a result, the
> > workload might be unable to find enough free memory and have no choice
> > but invoke OOM killer. This looks unexpected from the user POV.
> 
> I'm not getting rid of those knobs, I'm just reusing the old socket
> accounting infrastructure in an attempt to make the memory accounting
> feature useful to more people in cgroups v2 (unified hierarchy).
> 

My understanding is that in the meantime you effectively break the
existing per memcg tcp window control logic.

> We can always come back to think about per-cgroup tcp window limits in
> the unified hierarchy, my patches don't get in the way of this. I'm
> not removing the knobs in cgroups v1 and I'm not preventing them in v2.
> 
> But regardless of tcp window control, we need to account socket memory
> in the main memory accounting pool where pressure is shared (to the
> best of our abilities) between all accounted memory consumers.
> 

No objections to this point. However, I really don't like the idea to
charge tcp window size to memory.current instead of charging individual
pages consumed by the workload for storing socket buffers, because it is
inconsistent with what we have now. Can't we charge individual skb pages
as we do in case of other kmem allocations?

> From an interface standpoint alone, I don't think it's reasonable to
> ask users per default to limit different consumers on a case by case
> basis. I certainly have no problem with finetuning for scenarios you
> describe above, but with memory.current, memory.high, memory.max we
> are providing a generic interface to account and contain memory
> consumption of workloads. This has to include all major memory
> consumers to make semantical sense.

We can propose a reasonable default as we do in the global case.

> 
> But also, there are people right now for whom the socket buffers cause
> system OOM, but the existing memcg's hard tcp window limitq that
> exists absolutely wrecks network performance for them. It's not usable
> the way it is. It'd be much better to have the socket buffers exert
> pressure on the shared pool, and then propagate the overall pressure
> back to individual consumers with reclaim, shrinkers, vmpressure etc.
> 

This might or might not work. I'm not an expert to judge. But if you do
this only for memcg leaving the global case as it is, networking people
won't budge IMO. So could you please start such a major rework from the
global case? Could you please try to deprecate the tcp window limits not
only in the legacy memcg hierarchy, but also system-wide in order to
attract attention of networking experts?

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
@ 2015-10-27  8:43       ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-27  8:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Mon, Oct 26, 2015 at 01:22:16PM -0400, Johannes Weiner wrote:
> On Thu, Oct 22, 2015 at 09:45:10PM +0300, Vladimir Davydov wrote:
> > Hi Johannes,
> > 
> > On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote:
> > ...
> > > Patch #5 adds accounting and tracking of socket memory to the unified
> > > hierarchy memory controller, as described above. It uses the existing
> > > per-cpu charge caches and triggers high limit reclaim asynchroneously.
> > > 
> > > Patch #8 uses the vmpressure extension to equalize pressure between
> > > the pages tracked natively by the VM and socket buffer pages. As the
> > > pool is shared, it makes sense that while natively tracked pages are
> > > under duress the network transmit windows are also not increased.
> > 
> > First of all, I've no experience in networking, so I'm likely to be
> > mistaken. Nevertheless I beg to disagree that this patch set is a step
> > in the right direction. Here goes why.
> > 
> > I admit that your idea to get rid of explicit tcp window control knobs
> > and size it dynamically basing on memory pressure instead does sound
> > tempting, but I don't think it'd always work. The problem is that in
> > contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only
> > stop growing them. Now suppose a system hasn't experienced memory
> > pressure for a while. If we don't have explicit tcp window limit, tcp
> > buffers on such a system might have eaten almost all available memory
> > (because of network load/problems). If a user workload that needs a
> > significant amount of memory is started suddenly then, the network code
> > will receive a notification and surely stop growing buffers, but all
> > those buffers accumulated won't disappear instantly. As a result, the
> > workload might be unable to find enough free memory and have no choice
> > but invoke OOM killer. This looks unexpected from the user POV.
> 
> I'm not getting rid of those knobs, I'm just reusing the old socket
> accounting infrastructure in an attempt to make the memory accounting
> feature useful to more people in cgroups v2 (unified hierarchy).
> 

My understanding is that in the meantime you effectively break the
existing per memcg tcp window control logic.

> We can always come back to think about per-cgroup tcp window limits in
> the unified hierarchy, my patches don't get in the way of this. I'm
> not removing the knobs in cgroups v1 and I'm not preventing them in v2.
> 
> But regardless of tcp window control, we need to account socket memory
> in the main memory accounting pool where pressure is shared (to the
> best of our abilities) between all accounted memory consumers.
> 

No objections to this point. However, I really don't like the idea to
charge tcp window size to memory.current instead of charging individual
pages consumed by the workload for storing socket buffers, because it is
inconsistent with what we have now. Can't we charge individual skb pages
as we do in case of other kmem allocations?

> From an interface standpoint alone, I don't think it's reasonable to
> ask users per default to limit different consumers on a case by case
> basis. I certainly have no problem with finetuning for scenarios you
> describe above, but with memory.current, memory.high, memory.max we
> are providing a generic interface to account and contain memory
> consumption of workloads. This has to include all major memory
> consumers to make semantical sense.

We can propose a reasonable default as we do in the global case.

> 
> But also, there are people right now for whom the socket buffers cause
> system OOM, but the existing memcg's hard tcp window limitq that
> exists absolutely wrecks network performance for them. It's not usable
> the way it is. It'd be much better to have the socket buffers exert
> pressure on the shared pool, and then propagate the overall pressure
> back to individual consumers with reclaim, shrinkers, vmpressure etc.
> 

This might or might not work. I'm not an expert to judge. But if you do
this only for memcg leaving the global case as it is, networking people
won't budge IMO. So could you please start such a major rework from the
global case? Could you please try to deprecate the tcp window limits not
only in the legacy memcg hierarchy, but also system-wide in order to
attract attention of networking experts?

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
@ 2015-10-27  8:43       ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-27  8:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Mon, Oct 26, 2015 at 01:22:16PM -0400, Johannes Weiner wrote:
> On Thu, Oct 22, 2015 at 09:45:10PM +0300, Vladimir Davydov wrote:
> > Hi Johannes,
> > 
> > On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote:
> > ...
> > > Patch #5 adds accounting and tracking of socket memory to the unified
> > > hierarchy memory controller, as described above. It uses the existing
> > > per-cpu charge caches and triggers high limit reclaim asynchroneously.
> > > 
> > > Patch #8 uses the vmpressure extension to equalize pressure between
> > > the pages tracked natively by the VM and socket buffer pages. As the
> > > pool is shared, it makes sense that while natively tracked pages are
> > > under duress the network transmit windows are also not increased.
> > 
> > First of all, I've no experience in networking, so I'm likely to be
> > mistaken. Nevertheless I beg to disagree that this patch set is a step
> > in the right direction. Here goes why.
> > 
> > I admit that your idea to get rid of explicit tcp window control knobs
> > and size it dynamically basing on memory pressure instead does sound
> > tempting, but I don't think it'd always work. The problem is that in
> > contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only
> > stop growing them. Now suppose a system hasn't experienced memory
> > pressure for a while. If we don't have explicit tcp window limit, tcp
> > buffers on such a system might have eaten almost all available memory
> > (because of network load/problems). If a user workload that needs a
> > significant amount of memory is started suddenly then, the network code
> > will receive a notification and surely stop growing buffers, but all
> > those buffers accumulated won't disappear instantly. As a result, the
> > workload might be unable to find enough free memory and have no choice
> > but invoke OOM killer. This looks unexpected from the user POV.
> 
> I'm not getting rid of those knobs, I'm just reusing the old socket
> accounting infrastructure in an attempt to make the memory accounting
> feature useful to more people in cgroups v2 (unified hierarchy).
> 

My understanding is that in the meantime you effectively break the
existing per memcg tcp window control logic.

> We can always come back to think about per-cgroup tcp window limits in
> the unified hierarchy, my patches don't get in the way of this. I'm
> not removing the knobs in cgroups v1 and I'm not preventing them in v2.
> 
> But regardless of tcp window control, we need to account socket memory
> in the main memory accounting pool where pressure is shared (to the
> best of our abilities) between all accounted memory consumers.
> 

No objections to this point. However, I really don't like the idea to
charge tcp window size to memory.current instead of charging individual
pages consumed by the workload for storing socket buffers, because it is
inconsistent with what we have now. Can't we charge individual skb pages
as we do in case of other kmem allocations?

> From an interface standpoint alone, I don't think it's reasonable to
> ask users per default to limit different consumers on a case by case
> basis. I certainly have no problem with finetuning for scenarios you
> describe above, but with memory.current, memory.high, memory.max we
> are providing a generic interface to account and contain memory
> consumption of workloads. This has to include all major memory
> consumers to make semantical sense.

We can propose a reasonable default as we do in the global case.

> 
> But also, there are people right now for whom the socket buffers cause
> system OOM, but the existing memcg's hard tcp window limitq that
> exists absolutely wrecks network performance for them. It's not usable
> the way it is. It'd be much better to have the socket buffers exert
> pressure on the shared pool, and then propagate the overall pressure
> back to individual consumers with reclaim, shrinkers, vmpressure etc.
> 

This might or might not work. I'm not an expert to judge. But if you do
this only for memcg leaving the global case as it is, networking people
won't budge IMO. So could you please start such a major rework from the
global case? Could you please try to deprecate the tcp window limits not
only in the legacy memcg hierarchy, but also system-wide in order to
attract attention of networking experts?

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
  2015-10-26 16:56         ` Johannes Weiner
@ 2015-10-27 12:26           ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-27 12:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Mon 26-10-15 12:56:19, Johannes Weiner wrote:
[...]
> Now you could argue that there might exist specialized workloads that
> need to account anonymous pages and page cache, but not socket memory
> buffers.

Exactly, and there are loads doing this. Memcg groups are also created to
limit anon/page cache consumers to not affect the others running on
the system (basically in the root memcg context from memcg POV) which
don't care about tracking and they definitely do not want to pay for an
additional overhead. We should definitely be able to offer a global
disable knob for them. The same applies to kmem accounting in general.

I do understand with having the accounting enabled by default after we
are reasonably sure that both kmem/tcp are stable enough (which I am
not convinced about yet to be honest) but there will be always special
loads which simply do not care about kmem/tcp accounting and rather pay
a global balancing price (even OOM) rather than a permanent price. And
they should get a way to opt-out.

> Or any other combination of pick-and-choose consumers. But
> honestly, nowadays all our paths are lockless, and the counting is an
> atomic-add-return with a per-cpu batch cache.

You are still hooking into hot paths and there are users who want to
squeeze every single cycle from the HW.

> I don't think there is a compelling case for an elaborate interface
> to make individual memory consumers configurable inside the memory
> controller.

I do not think we need an elaborate interface. We just want to have
a global boot time knob to overwrite the default behavior. This is
few lines of code and it should give the sufficient flexibility.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-27 12:26           ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-27 12:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Mon 26-10-15 12:56:19, Johannes Weiner wrote:
[...]
> Now you could argue that there might exist specialized workloads that
> need to account anonymous pages and page cache, but not socket memory
> buffers.

Exactly, and there are loads doing this. Memcg groups are also created to
limit anon/page cache consumers to not affect the others running on
the system (basically in the root memcg context from memcg POV) which
don't care about tracking and they definitely do not want to pay for an
additional overhead. We should definitely be able to offer a global
disable knob for them. The same applies to kmem accounting in general.

I do understand with having the accounting enabled by default after we
are reasonably sure that both kmem/tcp are stable enough (which I am
not convinced about yet to be honest) but there will be always special
loads which simply do not care about kmem/tcp accounting and rather pay
a global balancing price (even OOM) rather than a permanent price. And
they should get a way to opt-out.

> Or any other combination of pick-and-choose consumers. But
> honestly, nowadays all our paths are lockless, and the counting is an
> atomic-add-return with a per-cpu batch cache.

You are still hooking into hot paths and there are users who want to
squeeze every single cycle from the HW.

> I don't think there is a compelling case for an elaborate interface
> to make individual memory consumers configurable inside the memory
> controller.

I do not think we need an elaborate interface. We just want to have
a global boot time knob to overwrite the default behavior. This is
few lines of code and it should give the sufficient flexibility.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-27 13:49             ` David Miller
  0 siblings, 0 replies; 156+ messages in thread
From: David Miller @ 2015-10-27 13:49 UTC (permalink / raw)
  To: mhocko
  Cc: hannes, akpm, vdavydov, tj, netdev, linux-mm, cgroups, linux-kernel

From: Michal Hocko <mhocko@kernel.org>
Date: Tue, 27 Oct 2015 13:26:47 +0100

> On Mon 26-10-15 12:56:19, Johannes Weiner wrote:
> [...]
>> Or any other combination of pick-and-choose consumers. But
>> honestly, nowadays all our paths are lockless, and the counting is an
>> atomic-add-return with a per-cpu batch cache.
> 
> You are still hooking into hot paths and there are users who want to
> squeeze every single cycle from the HW.

Yeah, you're basically probably undoing a half year of work by another
developer who was able to remove an atomic from these paths.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-27 13:49             ` David Miller
  0 siblings, 0 replies; 156+ messages in thread
From: David Miller @ 2015-10-27 13:49 UTC (permalink / raw)
  To: mhocko-DgEjT+Ai2ygdnm+yROfE0A
  Cc: hannes-druUgvl0LCNAfugRpC6u6w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	vdavydov-5HdwGun5lf+gSpxsJD1C4w, tj-DgEjT+Ai2ygdnm+yROfE0A,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

From: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Date: Tue, 27 Oct 2015 13:26:47 +0100

> On Mon 26-10-15 12:56:19, Johannes Weiner wrote:
> [...]
>> Or any other combination of pick-and-choose consumers. But
>> honestly, nowadays all our paths are lockless, and the counting is an
>> atomic-add-return with a per-cpu batch cache.
> 
> You are still hooking into hot paths and there are users who want to
> squeeze every single cycle from the HW.

Yeah, you're basically probably undoing a half year of work by another
developer who was able to remove an atomic from these paths.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-27 13:49             ` David Miller
  0 siblings, 0 replies; 156+ messages in thread
From: David Miller @ 2015-10-27 13:49 UTC (permalink / raw)
  To: mhocko
  Cc: hannes, akpm, vdavydov, tj, netdev, linux-mm, cgroups, linux-kernel

From: Michal Hocko <mhocko@kernel.org>
Date: Tue, 27 Oct 2015 13:26:47 +0100

> On Mon 26-10-15 12:56:19, Johannes Weiner wrote:
> [...]
>> Or any other combination of pick-and-choose consumers. But
>> honestly, nowadays all our paths are lockless, and the counting is an
>> atomic-add-return with a per-cpu batch cache.
> 
> You are still hooking into hot paths and there are users who want to
> squeeze every single cycle from the HW.

Yeah, you're basically probably undoing a half year of work by another
developer who was able to remove an atomic from these paths.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-27 15:41             ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-27 15:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Tue, Oct 27, 2015 at 01:26:47PM +0100, Michal Hocko wrote:
> On Mon 26-10-15 12:56:19, Johannes Weiner wrote:
> [...]
> > Now you could argue that there might exist specialized workloads that
> > need to account anonymous pages and page cache, but not socket memory
> > buffers.
> 
> Exactly, and there are loads doing this. Memcg groups are also created to
> limit anon/page cache consumers to not affect the others running on
> the system (basically in the root memcg context from memcg POV) which
> don't care about tracking and they definitely do not want to pay for an
> additional overhead. We should definitely be able to offer a global
> disable knob for them. The same applies to kmem accounting in general.

I don't see how you make such a clear distinction between, say, page
cache and the dentry cache, and call one user memory and the other
kernel memory. That just doesn't make sense to me. They're both kernel
memory allocated on behalf of the user, the only difference being that
one is tracked on the page level and the other on the slab level, and
we started accounting one before the other.

IMO that's an implementation detail and a historical artifact that
should not be exposed to the user. And that's the thing I hate about
the current opt-out knob.

> > I don't think there is a compelling case for an elaborate interface
> > to make individual memory consumers configurable inside the memory
> > controller.
> 
> I do not think we need an elaborate interface. We just want to have
> a global boot time knob to overwrite the default behavior. This is
> few lines of code and it should give the sufficient flexibility.

Okay, then let's add this for the socket memory to start with. I'll
have to think more about how to distinguish the slab-based consumers.
Or maybe you have an idea.

For now, something like this as a boot commandline?

cgroup.memory=nosocket

So again in summary, no default overhead until you create a cgroup to
specifically track and account memory. And then, when you know what
you are doing and have a specialized workload, you can disable socket
memory as a specific consumer to remove that particular overhead while
still being able to contain page cache, anon, kmem, whatever.

Does that sound like reasonable userinterfacing to everyone?

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-27 15:41             ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-27 15:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	vdavydov-5HdwGun5lf+gSpxsJD1C4w, tj-DgEjT+Ai2ygdnm+yROfE0A,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Oct 27, 2015 at 01:26:47PM +0100, Michal Hocko wrote:
> On Mon 26-10-15 12:56:19, Johannes Weiner wrote:
> [...]
> > Now you could argue that there might exist specialized workloads that
> > need to account anonymous pages and page cache, but not socket memory
> > buffers.
> 
> Exactly, and there are loads doing this. Memcg groups are also created to
> limit anon/page cache consumers to not affect the others running on
> the system (basically in the root memcg context from memcg POV) which
> don't care about tracking and they definitely do not want to pay for an
> additional overhead. We should definitely be able to offer a global
> disable knob for them. The same applies to kmem accounting in general.

I don't see how you make such a clear distinction between, say, page
cache and the dentry cache, and call one user memory and the other
kernel memory. That just doesn't make sense to me. They're both kernel
memory allocated on behalf of the user, the only difference being that
one is tracked on the page level and the other on the slab level, and
we started accounting one before the other.

IMO that's an implementation detail and a historical artifact that
should not be exposed to the user. And that's the thing I hate about
the current opt-out knob.

> > I don't think there is a compelling case for an elaborate interface
> > to make individual memory consumers configurable inside the memory
> > controller.
> 
> I do not think we need an elaborate interface. We just want to have
> a global boot time knob to overwrite the default behavior. This is
> few lines of code and it should give the sufficient flexibility.

Okay, then let's add this for the socket memory to start with. I'll
have to think more about how to distinguish the slab-based consumers.
Or maybe you have an idea.

For now, something like this as a boot commandline?

cgroup.memory=nosocket

So again in summary, no default overhead until you create a cgroup to
specifically track and account memory. And then, when you know what
you are doing and have a specialized workload, you can disable socket
memory as a specific consumer to remove that particular overhead while
still being able to contain page cache, anon, kmem, whatever.

Does that sound like reasonable userinterfacing to everyone?

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-27 15:41             ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-27 15:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Tue, Oct 27, 2015 at 01:26:47PM +0100, Michal Hocko wrote:
> On Mon 26-10-15 12:56:19, Johannes Weiner wrote:
> [...]
> > Now you could argue that there might exist specialized workloads that
> > need to account anonymous pages and page cache, but not socket memory
> > buffers.
> 
> Exactly, and there are loads doing this. Memcg groups are also created to
> limit anon/page cache consumers to not affect the others running on
> the system (basically in the root memcg context from memcg POV) which
> don't care about tracking and they definitely do not want to pay for an
> additional overhead. We should definitely be able to offer a global
> disable knob for them. The same applies to kmem accounting in general.

I don't see how you make such a clear distinction between, say, page
cache and the dentry cache, and call one user memory and the other
kernel memory. That just doesn't make sense to me. They're both kernel
memory allocated on behalf of the user, the only difference being that
one is tracked on the page level and the other on the slab level, and
we started accounting one before the other.

IMO that's an implementation detail and a historical artifact that
should not be exposed to the user. And that's the thing I hate about
the current opt-out knob.

> > I don't think there is a compelling case for an elaborate interface
> > to make individual memory consumers configurable inside the memory
> > controller.
> 
> I do not think we need an elaborate interface. We just want to have
> a global boot time knob to overwrite the default behavior. This is
> few lines of code and it should give the sufficient flexibility.

Okay, then let's add this for the socket memory to start with. I'll
have to think more about how to distinguish the slab-based consumers.
Or maybe you have an idea.

For now, something like this as a boot commandline?

cgroup.memory=nosocket

So again in summary, no default overhead until you create a cgroup to
specifically track and account memory. And then, when you know what
you are doing and have a specialized workload, you can disable socket
memory as a specific consumer to remove that particular overhead while
still being able to contain page cache, anon, kmem, whatever.

Does that sound like reasonable userinterfacing to everyone?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
  2015-10-27  8:43       ` Vladimir Davydov
@ 2015-10-27 16:01         ` Johannes Weiner
  -1 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-27 16:01 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Tue, Oct 27, 2015 at 11:43:21AM +0300, Vladimir Davydov wrote:
> On Mon, Oct 26, 2015 at 01:22:16PM -0400, Johannes Weiner wrote:
> > I'm not getting rid of those knobs, I'm just reusing the old socket
> > accounting infrastructure in an attempt to make the memory accounting
> > feature useful to more people in cgroups v2 (unified hierarchy).
> 
> My understanding is that in the meantime you effectively break the
> existing per memcg tcp window control logic.

That's not my intention, this stuff has to keep working. I'm assuming
you mean the changes to sk_enter_memory_pressure() when hitting the
charge limit; let me address this in the other subthread.

> > We can always come back to think about per-cgroup tcp window limits in
> > the unified hierarchy, my patches don't get in the way of this. I'm
> > not removing the knobs in cgroups v1 and I'm not preventing them in v2.
> > 
> > But regardless of tcp window control, we need to account socket memory
> > in the main memory accounting pool where pressure is shared (to the
> > best of our abilities) between all accounted memory consumers.
> > 
> 
> No objections to this point. However, I really don't like the idea to
> charge tcp window size to memory.current instead of charging individual
> pages consumed by the workload for storing socket buffers, because it is
> inconsistent with what we have now. Can't we charge individual skb pages
> as we do in case of other kmem allocations?

Absolutely, both work for me. I chose that route because it's where
the networking code already tracks and accounts memory consumed, so it
seemed like a better site to hook into.

But I understand your concerns. We want to track this stuff as close
to the memory allocators as possible.

> > But also, there are people right now for whom the socket buffers cause
> > system OOM, but the existing memcg's hard tcp window limitq that
> > exists absolutely wrecks network performance for them. It's not usable
> > the way it is. It'd be much better to have the socket buffers exert
> > pressure on the shared pool, and then propagate the overall pressure
> > back to individual consumers with reclaim, shrinkers, vmpressure etc.
> 
> This might or might not work. I'm not an expert to judge. But if you do
> this only for memcg leaving the global case as it is, networking people
> won't budge IMO. So could you please start such a major rework from the
> global case? Could you please try to deprecate the tcp window limits not
> only in the legacy memcg hierarchy, but also system-wide in order to
> attract attention of networking experts?

I'm definitely interested in addressing this globally as well.

The idea behind this was to use the memcg part as a testbed. cgroup2
is going to be new and people are prepared for hiccups when migrating
their applications to it; and they can roll back to cgroup1 and tcp
window limits at any time should they run into problems in production.

So this seemed like a good way to prove a new mechanism before rolling
it out to every single Linux setup, rather than switch everybody over
after the limited scope testing I can do as a developer on my own.

Keep in mind that my patches are not committing anything in terms of
interface, so we retain all the freedom to fix and tune the way this
is implemented, including the freedom to re-add tcp window limits in
case the pressure balancing is not a comprehensive solution.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
@ 2015-10-27 16:01         ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-27 16:01 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Tue, Oct 27, 2015 at 11:43:21AM +0300, Vladimir Davydov wrote:
> On Mon, Oct 26, 2015 at 01:22:16PM -0400, Johannes Weiner wrote:
> > I'm not getting rid of those knobs, I'm just reusing the old socket
> > accounting infrastructure in an attempt to make the memory accounting
> > feature useful to more people in cgroups v2 (unified hierarchy).
> 
> My understanding is that in the meantime you effectively break the
> existing per memcg tcp window control logic.

That's not my intention, this stuff has to keep working. I'm assuming
you mean the changes to sk_enter_memory_pressure() when hitting the
charge limit; let me address this in the other subthread.

> > We can always come back to think about per-cgroup tcp window limits in
> > the unified hierarchy, my patches don't get in the way of this. I'm
> > not removing the knobs in cgroups v1 and I'm not preventing them in v2.
> > 
> > But regardless of tcp window control, we need to account socket memory
> > in the main memory accounting pool where pressure is shared (to the
> > best of our abilities) between all accounted memory consumers.
> > 
> 
> No objections to this point. However, I really don't like the idea to
> charge tcp window size to memory.current instead of charging individual
> pages consumed by the workload for storing socket buffers, because it is
> inconsistent with what we have now. Can't we charge individual skb pages
> as we do in case of other kmem allocations?

Absolutely, both work for me. I chose that route because it's where
the networking code already tracks and accounts memory consumed, so it
seemed like a better site to hook into.

But I understand your concerns. We want to track this stuff as close
to the memory allocators as possible.

> > But also, there are people right now for whom the socket buffers cause
> > system OOM, but the existing memcg's hard tcp window limitq that
> > exists absolutely wrecks network performance for them. It's not usable
> > the way it is. It'd be much better to have the socket buffers exert
> > pressure on the shared pool, and then propagate the overall pressure
> > back to individual consumers with reclaim, shrinkers, vmpressure etc.
> 
> This might or might not work. I'm not an expert to judge. But if you do
> this only for memcg leaving the global case as it is, networking people
> won't budge IMO. So could you please start such a major rework from the
> global case? Could you please try to deprecate the tcp window limits not
> only in the legacy memcg hierarchy, but also system-wide in order to
> attract attention of networking experts?

I'm definitely interested in addressing this globally as well.

The idea behind this was to use the memcg part as a testbed. cgroup2
is going to be new and people are prepared for hiccups when migrating
their applications to it; and they can roll back to cgroup1 and tcp
window limits at any time should they run into problems in production.

So this seemed like a good way to prove a new mechanism before rolling
it out to every single Linux setup, rather than switch everybody over
after the limited scope testing I can do as a developer on my own.

Keep in mind that my patches are not committing anything in terms of
interface, so we retain all the freedom to fix and tune the way this
is implemented, including the freedom to re-add tcp window limits in
case the pressure balancing is not a comprehensive solution.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
  2015-10-27 15:41             ` Johannes Weiner
@ 2015-10-27 16:15               ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-27 16:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Tue 27-10-15 11:41:38, Johannes Weiner wrote:
> On Tue, Oct 27, 2015 at 01:26:47PM +0100, Michal Hocko wrote:
> > On Mon 26-10-15 12:56:19, Johannes Weiner wrote:
> > [...]
> > > Now you could argue that there might exist specialized workloads that
> > > need to account anonymous pages and page cache, but not socket memory
> > > buffers.
> > 
> > Exactly, and there are loads doing this. Memcg groups are also created to
> > limit anon/page cache consumers to not affect the others running on
> > the system (basically in the root memcg context from memcg POV) which
> > don't care about tracking and they definitely do not want to pay for an
> > additional overhead. We should definitely be able to offer a global
> > disable knob for them. The same applies to kmem accounting in general.
> 
> I don't see how you make such a clear distinction between, say, page
> cache and the dentry cache, and call one user memory and the other
> kernel memory.

Because the kernel memory footprint would be so small that it simply
doesn't change the picture at all. While the page cache or anonymous
memory consumption might be so large it might be disruptive.  I am
talking about loads where good enough is better than "perfect" and
ephemeral global memory pressure when kmem goes over expectations is
better than a permanent cpu overhead. Whatever we do it will always
be non-zero.

Also kmem accounting will make the load more non-deterministic because
many of the resources are shared between tasks in separate cgroups
unless they are explicitly configured. E.g. [id]cache will be shared
and first to touch gets charged so you would end up with more false
sharing.

Nevertheless, I do not want to shift the discussion from the topic. I
just think that one-fits-all simply won't work.

> That just doesn't make sense to me. They're both kernel
> memory allocated on behalf of the user, the only difference being that
> one is tracked on the page level and the other on the slab level, and
> we started accounting one before the other.
> 
> IMO that's an implementation detail and a historical artifact that
> should not be exposed to the user. And that's the thing I hate about
> the current opt-out knob.
> 
> > > I don't think there is a compelling case for an elaborate interface
> > > to make individual memory consumers configurable inside the memory
> > > controller.
> > 
> > I do not think we need an elaborate interface. We just want to have
> > a global boot time knob to overwrite the default behavior. This is
> > few lines of code and it should give the sufficient flexibility.
> 
> Okay, then let's add this for the socket memory to start with. I'll
> have to think more about how to distinguish the slab-based consumers.
> Or maybe you have an idea.

Isn't that as simple as enabling the jump label during the
initialization depending on the knob value? All the charging paths
should be disabled by default already.

> For now, something like this as a boot commandline?
> 
> cgroup.memory=nosocket

That would work for me. I would even see a place to have
CONFIG_MEMCG_TCP_KMEM_ENABLED config option for the default and
[no]socket as a kernel parameter to override the configuratioin default.
This would allow distributions to define their policy without enforcing
it hard and those who compile the kernel to define their own policy.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-27 16:15               ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-27 16:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Tue 27-10-15 11:41:38, Johannes Weiner wrote:
> On Tue, Oct 27, 2015 at 01:26:47PM +0100, Michal Hocko wrote:
> > On Mon 26-10-15 12:56:19, Johannes Weiner wrote:
> > [...]
> > > Now you could argue that there might exist specialized workloads that
> > > need to account anonymous pages and page cache, but not socket memory
> > > buffers.
> > 
> > Exactly, and there are loads doing this. Memcg groups are also created to
> > limit anon/page cache consumers to not affect the others running on
> > the system (basically in the root memcg context from memcg POV) which
> > don't care about tracking and they definitely do not want to pay for an
> > additional overhead. We should definitely be able to offer a global
> > disable knob for them. The same applies to kmem accounting in general.
> 
> I don't see how you make such a clear distinction between, say, page
> cache and the dentry cache, and call one user memory and the other
> kernel memory.

Because the kernel memory footprint would be so small that it simply
doesn't change the picture at all. While the page cache or anonymous
memory consumption might be so large it might be disruptive.  I am
talking about loads where good enough is better than "perfect" and
ephemeral global memory pressure when kmem goes over expectations is
better than a permanent cpu overhead. Whatever we do it will always
be non-zero.

Also kmem accounting will make the load more non-deterministic because
many of the resources are shared between tasks in separate cgroups
unless they are explicitly configured. E.g. [id]cache will be shared
and first to touch gets charged so you would end up with more false
sharing.

Nevertheless, I do not want to shift the discussion from the topic. I
just think that one-fits-all simply won't work.

> That just doesn't make sense to me. They're both kernel
> memory allocated on behalf of the user, the only difference being that
> one is tracked on the page level and the other on the slab level, and
> we started accounting one before the other.
> 
> IMO that's an implementation detail and a historical artifact that
> should not be exposed to the user. And that's the thing I hate about
> the current opt-out knob.
> 
> > > I don't think there is a compelling case for an elaborate interface
> > > to make individual memory consumers configurable inside the memory
> > > controller.
> > 
> > I do not think we need an elaborate interface. We just want to have
> > a global boot time knob to overwrite the default behavior. This is
> > few lines of code and it should give the sufficient flexibility.
> 
> Okay, then let's add this for the socket memory to start with. I'll
> have to think more about how to distinguish the slab-based consumers.
> Or maybe you have an idea.

Isn't that as simple as enabling the jump label during the
initialization depending on the knob value? All the charging paths
should be disabled by default already.

> For now, something like this as a boot commandline?
> 
> cgroup.memory=nosocket

That would work for me. I would even see a place to have
CONFIG_MEMCG_TCP_KMEM_ENABLED config option for the default and
[no]socket as a kernel parameter to override the configuratioin default.
This would allow distributions to define their policy without enforcing
it hard and those who compile the kernel to define their own policy.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
  2015-10-27 16:15               ` Michal Hocko
@ 2015-10-27 16:42                 ` Johannes Weiner
  -1 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-27 16:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote:
> On Tue 27-10-15 11:41:38, Johannes Weiner wrote:
> > On Tue, Oct 27, 2015 at 01:26:47PM +0100, Michal Hocko wrote:
> > > On Mon 26-10-15 12:56:19, Johannes Weiner wrote:
> > > [...]
> > > > Now you could argue that there might exist specialized workloads that
> > > > need to account anonymous pages and page cache, but not socket memory
> > > > buffers.
> > > 
> > > Exactly, and there are loads doing this. Memcg groups are also created to
> > > limit anon/page cache consumers to not affect the others running on
> > > the system (basically in the root memcg context from memcg POV) which
> > > don't care about tracking and they definitely do not want to pay for an
> > > additional overhead. We should definitely be able to offer a global
> > > disable knob for them. The same applies to kmem accounting in general.
> > 
> > I don't see how you make such a clear distinction between, say, page
> > cache and the dentry cache, and call one user memory and the other
> > kernel memory.
> 
> Because the kernel memory footprint would be so small that it simply
> doesn't change the picture at all. While the page cache or anonymous
> memory consumption might be so large it might be disruptive.

Or it could be exactly the other way around when you have a workload
that is heavy on filesystem metadata. I don't see why any scenario
would be more important than the other.

I'm not saying that distinguishing between consumers is wrong, just
that "user memory vs kernel memory" is a false classification. Why do
you call page cache user memory but dentry cache kernel memory? It
doesn't make any sense.

> Also kmem accounting will make the load more non-deterministic because
> many of the resources are shared between tasks in separate cgroups
> unless they are explicitly configured. E.g. [id]cache will be shared
> and first to touch gets charged so you would end up with more false
> sharing.

Exactly like page cache. This differentiation isn't based on reality.

> Nevertheless, I do not want to shift the discussion from the topic. I
> just think that one-fits-all simply won't work.

Okay, this is something we can converge on.

> > That just doesn't make sense to me. They're both kernel
> > memory allocated on behalf of the user, the only difference being that
> > one is tracked on the page level and the other on the slab level, and
> > we started accounting one before the other.
> > 
> > IMO that's an implementation detail and a historical artifact that
> > should not be exposed to the user. And that's the thing I hate about
> > the current opt-out knob.

You carefully skipped over this part. We can ignore it for socket
memory but it's something we need to figure out when it comes to slab
accounting and tracking.

> > > > I don't think there is a compelling case for an elaborate interface
> > > > to make individual memory consumers configurable inside the memory
> > > > controller.
> > > 
> > > I do not think we need an elaborate interface. We just want to have
> > > a global boot time knob to overwrite the default behavior. This is
> > > few lines of code and it should give the sufficient flexibility.
> > 
> > Okay, then let's add this for the socket memory to start with. I'll
> > have to think more about how to distinguish the slab-based consumers.
> > Or maybe you have an idea.
> 
> Isn't that as simple as enabling the jump label during the
> initialization depending on the knob value? All the charging paths
> should be disabled by default already.

You missed my point. It's not about the implementation, it's about how
we present these choices to the user.

Having page cache accounting built in while presenting dentry+inode
cache as a configurable extension is completely random and doesn't
make sense. They are both first class memory consumers. They're not
separate categories. One isn't more "core" than the other.

> > For now, something like this as a boot commandline?
> > 
> > cgroup.memory=nosocket
> 
> That would work for me.

Okay, then I'll go that route for the socket stuff.

Dave is that cool with you?

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-27 16:42                 ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-27 16:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote:
> On Tue 27-10-15 11:41:38, Johannes Weiner wrote:
> > On Tue, Oct 27, 2015 at 01:26:47PM +0100, Michal Hocko wrote:
> > > On Mon 26-10-15 12:56:19, Johannes Weiner wrote:
> > > [...]
> > > > Now you could argue that there might exist specialized workloads that
> > > > need to account anonymous pages and page cache, but not socket memory
> > > > buffers.
> > > 
> > > Exactly, and there are loads doing this. Memcg groups are also created to
> > > limit anon/page cache consumers to not affect the others running on
> > > the system (basically in the root memcg context from memcg POV) which
> > > don't care about tracking and they definitely do not want to pay for an
> > > additional overhead. We should definitely be able to offer a global
> > > disable knob for them. The same applies to kmem accounting in general.
> > 
> > I don't see how you make such a clear distinction between, say, page
> > cache and the dentry cache, and call one user memory and the other
> > kernel memory.
> 
> Because the kernel memory footprint would be so small that it simply
> doesn't change the picture at all. While the page cache or anonymous
> memory consumption might be so large it might be disruptive.

Or it could be exactly the other way around when you have a workload
that is heavy on filesystem metadata. I don't see why any scenario
would be more important than the other.

I'm not saying that distinguishing between consumers is wrong, just
that "user memory vs kernel memory" is a false classification. Why do
you call page cache user memory but dentry cache kernel memory? It
doesn't make any sense.

> Also kmem accounting will make the load more non-deterministic because
> many of the resources are shared between tasks in separate cgroups
> unless they are explicitly configured. E.g. [id]cache will be shared
> and first to touch gets charged so you would end up with more false
> sharing.

Exactly like page cache. This differentiation isn't based on reality.

> Nevertheless, I do not want to shift the discussion from the topic. I
> just think that one-fits-all simply won't work.

Okay, this is something we can converge on.

> > That just doesn't make sense to me. They're both kernel
> > memory allocated on behalf of the user, the only difference being that
> > one is tracked on the page level and the other on the slab level, and
> > we started accounting one before the other.
> > 
> > IMO that's an implementation detail and a historical artifact that
> > should not be exposed to the user. And that's the thing I hate about
> > the current opt-out knob.

You carefully skipped over this part. We can ignore it for socket
memory but it's something we need to figure out when it comes to slab
accounting and tracking.

> > > > I don't think there is a compelling case for an elaborate interface
> > > > to make individual memory consumers configurable inside the memory
> > > > controller.
> > > 
> > > I do not think we need an elaborate interface. We just want to have
> > > a global boot time knob to overwrite the default behavior. This is
> > > few lines of code and it should give the sufficient flexibility.
> > 
> > Okay, then let's add this for the socket memory to start with. I'll
> > have to think more about how to distinguish the slab-based consumers.
> > Or maybe you have an idea.
> 
> Isn't that as simple as enabling the jump label during the
> initialization depending on the knob value? All the charging paths
> should be disabled by default already.

You missed my point. It's not about the implementation, it's about how
we present these choices to the user.

Having page cache accounting built in while presenting dentry+inode
cache as a configurable extension is completely random and doesn't
make sense. They are both first class memory consumers. They're not
separate categories. One isn't more "core" than the other.

> > For now, something like this as a boot commandline?
> > 
> > cgroup.memory=nosocket
> 
> That would work for me.

Okay, then I'll go that route for the socket stuff.

Dave is that cool with you?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-28  0:45                   ` David Miller
  0 siblings, 0 replies; 156+ messages in thread
From: David Miller @ 2015-10-28  0:45 UTC (permalink / raw)
  To: hannes
  Cc: mhocko, akpm, vdavydov, tj, netdev, linux-mm, cgroups, linux-kernel

From: Johannes Weiner <hannes@cmpxchg.org>
Date: Tue, 27 Oct 2015 09:42:27 -0700

> On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote:
>> > For now, something like this as a boot commandline?
>> > 
>> > cgroup.memory=nosocket
>> 
>> That would work for me.
> 
> Okay, then I'll go that route for the socket stuff.
> 
> Dave is that cool with you?

Depends upon the default.

Until the user configures something explicitly into the memory
controller, the networking bits should all evaluate to nothing.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-28  0:45                   ` David Miller
  0 siblings, 0 replies; 156+ messages in thread
From: David Miller @ 2015-10-28  0:45 UTC (permalink / raw)
  To: hannes-druUgvl0LCNAfugRpC6u6w
  Cc: mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	vdavydov-5HdwGun5lf+gSpxsJD1C4w, tj-DgEjT+Ai2ygdnm+yROfE0A,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Date: Tue, 27 Oct 2015 09:42:27 -0700

> On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote:
>> > For now, something like this as a boot commandline?
>> > 
>> > cgroup.memory=nosocket
>> 
>> That would work for me.
> 
> Okay, then I'll go that route for the socket stuff.
> 
> Dave is that cool with you?

Depends upon the default.

Until the user configures something explicitly into the memory
controller, the networking bits should all evaluate to nothing.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-28  0:45                   ` David Miller
  0 siblings, 0 replies; 156+ messages in thread
From: David Miller @ 2015-10-28  0:45 UTC (permalink / raw)
  To: hannes
  Cc: mhocko, akpm, vdavydov, tj, netdev, linux-mm, cgroups, linux-kernel

From: Johannes Weiner <hannes@cmpxchg.org>
Date: Tue, 27 Oct 2015 09:42:27 -0700

> On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote:
>> > For now, something like this as a boot commandline?
>> > 
>> > cgroup.memory=nosocket
>> 
>> That would work for me.
> 
> Okay, then I'll go that route for the socket stuff.
> 
> Dave is that cool with you?

Depends upon the default.

Until the user configures something explicitly into the memory
controller, the networking bits should all evaluate to nothing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
  2015-10-28  0:45                   ` David Miller
@ 2015-10-28  3:05                     ` Johannes Weiner
  -1 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-28  3:05 UTC (permalink / raw)
  To: David Miller
  Cc: mhocko, akpm, vdavydov, tj, netdev, linux-mm, cgroups, linux-kernel

On Tue, Oct 27, 2015 at 05:45:32PM -0700, David Miller wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Tue, 27 Oct 2015 09:42:27 -0700
> 
> > On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote:
> >> > For now, something like this as a boot commandline?
> >> > 
> >> > cgroup.memory=nosocket
> >> 
> >> That would work for me.
> > 
> > Okay, then I'll go that route for the socket stuff.
> > 
> > Dave is that cool with you?
> 
> Depends upon the default.
> 
> Until the user configures something explicitly into the memory
> controller, the networking bits should all evaluate to nothing.

Yep, I'll stick them behind a default-off jump label again.

This bootflag is only to override an active memory controller
configuration and force-off that jump label permanently.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-28  3:05                     ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-28  3:05 UTC (permalink / raw)
  To: David Miller
  Cc: mhocko, akpm, vdavydov, tj, netdev, linux-mm, cgroups, linux-kernel

On Tue, Oct 27, 2015 at 05:45:32PM -0700, David Miller wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Tue, 27 Oct 2015 09:42:27 -0700
> 
> > On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote:
> >> > For now, something like this as a boot commandline?
> >> > 
> >> > cgroup.memory=nosocket
> >> 
> >> That would work for me.
> > 
> > Okay, then I'll go that route for the socket stuff.
> > 
> > Dave is that cool with you?
> 
> Depends upon the default.
> 
> Until the user configures something explicitly into the memory
> controller, the networking bits should all evaluate to nothing.

Yep, I'll stick them behind a default-off jump label again.

This bootflag is only to override an active memory controller
configuration and force-off that jump label permanently.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
  2015-10-27 16:01         ` Johannes Weiner
  (?)
@ 2015-10-28  8:20           ` Vladimir Davydov
  -1 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-28  8:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Tue, Oct 27, 2015 at 09:01:08AM -0700, Johannes Weiner wrote:
...
> > > But regardless of tcp window control, we need to account socket memory
> > > in the main memory accounting pool where pressure is shared (to the
> > > best of our abilities) between all accounted memory consumers.
> > > 
> > 
> > No objections to this point. However, I really don't like the idea to
> > charge tcp window size to memory.current instead of charging individual
> > pages consumed by the workload for storing socket buffers, because it is
> > inconsistent with what we have now. Can't we charge individual skb pages
> > as we do in case of other kmem allocations?
> 
> Absolutely, both work for me. I chose that route because it's where
> the networking code already tracks and accounts memory consumed, so it
> seemed like a better site to hook into.
> 
> But I understand your concerns. We want to track this stuff as close
> to the memory allocators as possible.

Exactly.

> 
> > > But also, there are people right now for whom the socket buffers cause
> > > system OOM, but the existing memcg's hard tcp window limitq that
> > > exists absolutely wrecks network performance for them. It's not usable
> > > the way it is. It'd be much better to have the socket buffers exert
> > > pressure on the shared pool, and then propagate the overall pressure
> > > back to individual consumers with reclaim, shrinkers, vmpressure etc.
> > 
> > This might or might not work. I'm not an expert to judge. But if you do
> > this only for memcg leaving the global case as it is, networking people
> > won't budge IMO. So could you please start such a major rework from the
> > global case? Could you please try to deprecate the tcp window limits not
> > only in the legacy memcg hierarchy, but also system-wide in order to
> > attract attention of networking experts?
> 
> I'm definitely interested in addressing this globally as well.
> 
> The idea behind this was to use the memcg part as a testbed. cgroup2
> is going to be new and people are prepared for hiccups when migrating
> their applications to it; and they can roll back to cgroup1 and tcp
> window limits at any time should they run into problems in production.

Then you'd better not touch existing tcp limits at all, because they
just work, and the logic behind them is very close to that of global tcp
limits. I don't think one can simplify it somehow. Moreover, frankly I
still have my reservations about this vmpressure propagation to skb
you're proposing. It might work, but I doubt it will allow us to throw
away explicit tcp limit, as I explained previously. So, even with your
approach I think we can still need per memcg tcp limit *unless* you get
rid of global tcp limit somehow.

> 
> So this seemed like a good way to prove a new mechanism before rolling
> it out to every single Linux setup, rather than switch everybody over
> after the limited scope testing I can do as a developer on my own.
> 
> Keep in mind that my patches are not committing anything in terms of
> interface, so we retain all the freedom to fix and tune the way this
> is implemented, including the freedom to re-add tcp window limits in
> case the pressure balancing is not a comprehensive solution.
> 

I really dislike this kind of proof. It looks like you're trying to push
something you think is right covertly, w/o having a proper discussion
with networking people and then say that it just works and hence should
be done globally, but what if it won't? Revert it? We already have a lot
of dubious stuff in memcg that should be reverted, so let's please try
to avoid this kind of mistakes in future. Note, I say "w/o having a
proper discussion with networking people", because I don't think they
will really care *unless* you change the global logic, simply because
most of them aren't very interested in memcg AFAICS.

That effectively means you loose a chance to listen to networking
experts, who could point you at design flaws and propose an improvement
right away. Let's please not miss such an opportunity. You said that
you'd seen this problem happen w/o cgroups, so you have a use case that
might need fixing at the global level. IMO it shouldn't be difficult to
prepare an RFC patch for the global case first and see what people think
about it.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
@ 2015-10-28  8:20           ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-28  8:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Tue, Oct 27, 2015 at 09:01:08AM -0700, Johannes Weiner wrote:
...
> > > But regardless of tcp window control, we need to account socket memory
> > > in the main memory accounting pool where pressure is shared (to the
> > > best of our abilities) between all accounted memory consumers.
> > > 
> > 
> > No objections to this point. However, I really don't like the idea to
> > charge tcp window size to memory.current instead of charging individual
> > pages consumed by the workload for storing socket buffers, because it is
> > inconsistent with what we have now. Can't we charge individual skb pages
> > as we do in case of other kmem allocations?
> 
> Absolutely, both work for me. I chose that route because it's where
> the networking code already tracks and accounts memory consumed, so it
> seemed like a better site to hook into.
> 
> But I understand your concerns. We want to track this stuff as close
> to the memory allocators as possible.

Exactly.

> 
> > > But also, there are people right now for whom the socket buffers cause
> > > system OOM, but the existing memcg's hard tcp window limitq that
> > > exists absolutely wrecks network performance for them. It's not usable
> > > the way it is. It'd be much better to have the socket buffers exert
> > > pressure on the shared pool, and then propagate the overall pressure
> > > back to individual consumers with reclaim, shrinkers, vmpressure etc.
> > 
> > This might or might not work. I'm not an expert to judge. But if you do
> > this only for memcg leaving the global case as it is, networking people
> > won't budge IMO. So could you please start such a major rework from the
> > global case? Could you please try to deprecate the tcp window limits not
> > only in the legacy memcg hierarchy, but also system-wide in order to
> > attract attention of networking experts?
> 
> I'm definitely interested in addressing this globally as well.
> 
> The idea behind this was to use the memcg part as a testbed. cgroup2
> is going to be new and people are prepared for hiccups when migrating
> their applications to it; and they can roll back to cgroup1 and tcp
> window limits at any time should they run into problems in production.

Then you'd better not touch existing tcp limits at all, because they
just work, and the logic behind them is very close to that of global tcp
limits. I don't think one can simplify it somehow. Moreover, frankly I
still have my reservations about this vmpressure propagation to skb
you're proposing. It might work, but I doubt it will allow us to throw
away explicit tcp limit, as I explained previously. So, even with your
approach I think we can still need per memcg tcp limit *unless* you get
rid of global tcp limit somehow.

> 
> So this seemed like a good way to prove a new mechanism before rolling
> it out to every single Linux setup, rather than switch everybody over
> after the limited scope testing I can do as a developer on my own.
> 
> Keep in mind that my patches are not committing anything in terms of
> interface, so we retain all the freedom to fix and tune the way this
> is implemented, including the freedom to re-add tcp window limits in
> case the pressure balancing is not a comprehensive solution.
> 

I really dislike this kind of proof. It looks like you're trying to push
something you think is right covertly, w/o having a proper discussion
with networking people and then say that it just works and hence should
be done globally, but what if it won't? Revert it? We already have a lot
of dubious stuff in memcg that should be reverted, so let's please try
to avoid this kind of mistakes in future. Note, I say "w/o having a
proper discussion with networking people", because I don't think they
will really care *unless* you change the global logic, simply because
most of them aren't very interested in memcg AFAICS.

That effectively means you loose a chance to listen to networking
experts, who could point you at design flaws and propose an improvement
right away. Let's please not miss such an opportunity. You said that
you'd seen this problem happen w/o cgroups, so you have a use case that
might need fixing at the global level. IMO it shouldn't be difficult to
prepare an RFC patch for the global case first and see what people think
about it.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
@ 2015-10-28  8:20           ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-28  8:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Tue, Oct 27, 2015 at 09:01:08AM -0700, Johannes Weiner wrote:
...
> > > But regardless of tcp window control, we need to account socket memory
> > > in the main memory accounting pool where pressure is shared (to the
> > > best of our abilities) between all accounted memory consumers.
> > > 
> > 
> > No objections to this point. However, I really don't like the idea to
> > charge tcp window size to memory.current instead of charging individual
> > pages consumed by the workload for storing socket buffers, because it is
> > inconsistent with what we have now. Can't we charge individual skb pages
> > as we do in case of other kmem allocations?
> 
> Absolutely, both work for me. I chose that route because it's where
> the networking code already tracks and accounts memory consumed, so it
> seemed like a better site to hook into.
> 
> But I understand your concerns. We want to track this stuff as close
> to the memory allocators as possible.

Exactly.

> 
> > > But also, there are people right now for whom the socket buffers cause
> > > system OOM, but the existing memcg's hard tcp window limitq that
> > > exists absolutely wrecks network performance for them. It's not usable
> > > the way it is. It'd be much better to have the socket buffers exert
> > > pressure on the shared pool, and then propagate the overall pressure
> > > back to individual consumers with reclaim, shrinkers, vmpressure etc.
> > 
> > This might or might not work. I'm not an expert to judge. But if you do
> > this only for memcg leaving the global case as it is, networking people
> > won't budge IMO. So could you please start such a major rework from the
> > global case? Could you please try to deprecate the tcp window limits not
> > only in the legacy memcg hierarchy, but also system-wide in order to
> > attract attention of networking experts?
> 
> I'm definitely interested in addressing this globally as well.
> 
> The idea behind this was to use the memcg part as a testbed. cgroup2
> is going to be new and people are prepared for hiccups when migrating
> their applications to it; and they can roll back to cgroup1 and tcp
> window limits at any time should they run into problems in production.

Then you'd better not touch existing tcp limits at all, because they
just work, and the logic behind them is very close to that of global tcp
limits. I don't think one can simplify it somehow. Moreover, frankly I
still have my reservations about this vmpressure propagation to skb
you're proposing. It might work, but I doubt it will allow us to throw
away explicit tcp limit, as I explained previously. So, even with your
approach I think we can still need per memcg tcp limit *unless* you get
rid of global tcp limit somehow.

> 
> So this seemed like a good way to prove a new mechanism before rolling
> it out to every single Linux setup, rather than switch everybody over
> after the limited scope testing I can do as a developer on my own.
> 
> Keep in mind that my patches are not committing anything in terms of
> interface, so we retain all the freedom to fix and tune the way this
> is implemented, including the freedom to re-add tcp window limits in
> case the pressure balancing is not a comprehensive solution.
> 

I really dislike this kind of proof. It looks like you're trying to push
something you think is right covertly, w/o having a proper discussion
with networking people and then say that it just works and hence should
be done globally, but what if it won't? Revert it? We already have a lot
of dubious stuff in memcg that should be reverted, so let's please try
to avoid this kind of mistakes in future. Note, I say "w/o having a
proper discussion with networking people", because I don't think they
will really care *unless* you change the global logic, simply because
most of them aren't very interested in memcg AFAICS.

That effectively means you loose a chance to listen to networking
experts, who could point you at design flaws and propose an improvement
right away. Let's please not miss such an opportunity. You said that
you'd seen this problem happen w/o cgroups, so you have a use case that
might need fixing at the global level. IMO it shouldn't be difficult to
prepare an RFC patch for the global case first and see what people think
about it.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
  2015-10-28  8:20           ` Vladimir Davydov
@ 2015-10-28 18:58             ` Johannes Weiner
  -1 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-28 18:58 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Wed, Oct 28, 2015 at 11:20:03AM +0300, Vladimir Davydov wrote:
> Then you'd better not touch existing tcp limits at all, because they
> just work, and the logic behind them is very close to that of global tcp
> limits. I don't think one can simplify it somehow.

Uhm, no, there is a crapload of boilerplate code and complication that
seems entirely unnecessary. The only thing missing from my patch seems
to be the part where it enters memory pressure state when the limit is
hit. I'm adding this for completeness, but I doubt it even matters.

> Moreover, frankly I still have my reservations about this vmpressure
> propagation to skb you're proposing. It might work, but I doubt it
> will allow us to throw away explicit tcp limit, as I explained
> previously. So, even with your approach I think we can still need
> per memcg tcp limit *unless* you get rid of global tcp limit
> somehow.

Having the hard limit as a failsafe (or a minimum for other consumers)
is one thing, and certainly something I'm open to for cgroupv2, should
we have problems with load startup up after a socket memory landgrab.

That being said, if the VM is struggling to reclaim pages, or is even
swapping, it makes perfect sense to let the socket memory scheduler
know it shouldn't continue to increase its footprint until the VM
recovers. Regardless of any hard limitations/minimum guarantees.

This is what my patch does and it seems pretty straight-forward to
me. I don't really understand why this is so controversial.

The *next* step would be to figure out whether we can actually
*reclaim* memory in the network subsystem--shrink windows and steal
buffers back--and that might even be an avenue to replace tcp window
limits. But it's not necessary for *this* patch series to be useful.

> > So this seemed like a good way to prove a new mechanism before rolling
> > it out to every single Linux setup, rather than switch everybody over
> > after the limited scope testing I can do as a developer on my own.
> > 
> > Keep in mind that my patches are not committing anything in terms of
> > interface, so we retain all the freedom to fix and tune the way this
> > is implemented, including the freedom to re-add tcp window limits in
> > case the pressure balancing is not a comprehensive solution.
> 
> I really dislike this kind of proof. It looks like you're trying to
> push something you think is right covertly, w/o having a proper
> discussion with networking people and then say that it just works
> and hence should be done globally, but what if it won't? Revert it?
> We already have a lot of dubious stuff in memcg that should be
> reverted, so let's please try to avoid this kind of mistakes in
> future. Note, I say "w/o having a proper discussion with networking
> people", because I don't think they will really care *unless* you
> change the global logic, simply because most of them aren't very
> interested in memcg AFAICS.

Come on, Dave is the first To and netdev is CC'd. They might not care
about memcg, but "pushing things covertly" is a bit of a stretch.

> That effectively means you loose a chance to listen to networking
> experts, who could point you at design flaws and propose an improvement
> right away. Let's please not miss such an opportunity. You said that
> you'd seen this problem happen w/o cgroups, so you have a use case that
> might need fixing at the global level. IMO it shouldn't be difficult to
> prepare an RFC patch for the global case first and see what people think
> about it.

No, the problem we are running into is when network memory is not
tracked per cgroup. The lack of containment means that the socket
memory consumption of individual cgroups can trigger system OOM.

We tried using the per-memcg tcp limits, and that prevents the OOMs
for sure, but it's horrendous for network performance. There is no
"stop growing" phase, it just keeps going full throttle until it hits
the wall hard.

Now, we could probably try to replicate the global knobs and add a
per-memcg soft limit. But you know better than anyone else how hard it
is to estimate the overall workingset size of a workload, and the
margins on containerized loads are razor-thin. Performance is much
more sensitive to input errors, and often times parameters must be
adjusted continuously during the runtime of a workload. It'd be
disasterous to rely on yet more static, error-prone user input here.

What all this means to me is that fixing it on the cgroup level has
higher priority. But it also means that once we figured it out under
such a high-pressure environment, it's much easier to apply to the
global case and potentially replace the soft limit there.

This seems like a better approach to me than starting globally, only
to realize that the solution is not workable for cgroups and we need
yet something else.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
@ 2015-10-28 18:58             ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-28 18:58 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Wed, Oct 28, 2015 at 11:20:03AM +0300, Vladimir Davydov wrote:
> Then you'd better not touch existing tcp limits at all, because they
> just work, and the logic behind them is very close to that of global tcp
> limits. I don't think one can simplify it somehow.

Uhm, no, there is a crapload of boilerplate code and complication that
seems entirely unnecessary. The only thing missing from my patch seems
to be the part where it enters memory pressure state when the limit is
hit. I'm adding this for completeness, but I doubt it even matters.

> Moreover, frankly I still have my reservations about this vmpressure
> propagation to skb you're proposing. It might work, but I doubt it
> will allow us to throw away explicit tcp limit, as I explained
> previously. So, even with your approach I think we can still need
> per memcg tcp limit *unless* you get rid of global tcp limit
> somehow.

Having the hard limit as a failsafe (or a minimum for other consumers)
is one thing, and certainly something I'm open to for cgroupv2, should
we have problems with load startup up after a socket memory landgrab.

That being said, if the VM is struggling to reclaim pages, or is even
swapping, it makes perfect sense to let the socket memory scheduler
know it shouldn't continue to increase its footprint until the VM
recovers. Regardless of any hard limitations/minimum guarantees.

This is what my patch does and it seems pretty straight-forward to
me. I don't really understand why this is so controversial.

The *next* step would be to figure out whether we can actually
*reclaim* memory in the network subsystem--shrink windows and steal
buffers back--and that might even be an avenue to replace tcp window
limits. But it's not necessary for *this* patch series to be useful.

> > So this seemed like a good way to prove a new mechanism before rolling
> > it out to every single Linux setup, rather than switch everybody over
> > after the limited scope testing I can do as a developer on my own.
> > 
> > Keep in mind that my patches are not committing anything in terms of
> > interface, so we retain all the freedom to fix and tune the way this
> > is implemented, including the freedom to re-add tcp window limits in
> > case the pressure balancing is not a comprehensive solution.
> 
> I really dislike this kind of proof. It looks like you're trying to
> push something you think is right covertly, w/o having a proper
> discussion with networking people and then say that it just works
> and hence should be done globally, but what if it won't? Revert it?
> We already have a lot of dubious stuff in memcg that should be
> reverted, so let's please try to avoid this kind of mistakes in
> future. Note, I say "w/o having a proper discussion with networking
> people", because I don't think they will really care *unless* you
> change the global logic, simply because most of them aren't very
> interested in memcg AFAICS.

Come on, Dave is the first To and netdev is CC'd. They might not care
about memcg, but "pushing things covertly" is a bit of a stretch.

> That effectively means you loose a chance to listen to networking
> experts, who could point you at design flaws and propose an improvement
> right away. Let's please not miss such an opportunity. You said that
> you'd seen this problem happen w/o cgroups, so you have a use case that
> might need fixing at the global level. IMO it shouldn't be difficult to
> prepare an RFC patch for the global case first and see what people think
> about it.

No, the problem we are running into is when network memory is not
tracked per cgroup. The lack of containment means that the socket
memory consumption of individual cgroups can trigger system OOM.

We tried using the per-memcg tcp limits, and that prevents the OOMs
for sure, but it's horrendous for network performance. There is no
"stop growing" phase, it just keeps going full throttle until it hits
the wall hard.

Now, we could probably try to replicate the global knobs and add a
per-memcg soft limit. But you know better than anyone else how hard it
is to estimate the overall workingset size of a workload, and the
margins on containerized loads are razor-thin. Performance is much
more sensitive to input errors, and often times parameters must be
adjusted continuously during the runtime of a workload. It'd be
disasterous to rely on yet more static, error-prone user input here.

What all this means to me is that fixing it on the cgroup level has
higher priority. But it also means that once we figured it out under
such a high-pressure environment, it's much easier to apply to the
global case and potentially replace the soft limit there.

This seems like a better approach to me than starting globally, only
to realize that the solution is not workable for cgroups and we need
yet something else.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
  2015-10-28 18:58             ` Johannes Weiner
  (?)
@ 2015-10-29  9:27               ` Vladimir Davydov
  -1 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-29  9:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Wed, Oct 28, 2015 at 11:58:10AM -0700, Johannes Weiner wrote:
> On Wed, Oct 28, 2015 at 11:20:03AM +0300, Vladimir Davydov wrote:
> > Then you'd better not touch existing tcp limits at all, because they
> > just work, and the logic behind them is very close to that of global tcp
> > limits. I don't think one can simplify it somehow.
> 
> Uhm, no, there is a crapload of boilerplate code and complication that
> seems entirely unnecessary. The only thing missing from my patch seems
> to be the part where it enters memory pressure state when the limit is
> hit. I'm adding this for completeness, but I doubt it even matters.
> 
> > Moreover, frankly I still have my reservations about this vmpressure
> > propagation to skb you're proposing. It might work, but I doubt it
> > will allow us to throw away explicit tcp limit, as I explained
> > previously. So, even with your approach I think we can still need
> > per memcg tcp limit *unless* you get rid of global tcp limit
> > somehow.
> 
> Having the hard limit as a failsafe (or a minimum for other consumers)
> is one thing, and certainly something I'm open to for cgroupv2, should
> we have problems with load startup up after a socket memory landgrab.
> 
> That being said, if the VM is struggling to reclaim pages, or is even
> swapping, it makes perfect sense to let the socket memory scheduler
> know it shouldn't continue to increase its footprint until the VM
> recovers. Regardless of any hard limitations/minimum guarantees.
> 
> This is what my patch does and it seems pretty straight-forward to
> me. I don't really understand why this is so controversial.

I'm not arguing that the idea behind this patch set is necessarily bad.
Quite the contrary, it does look interesting to me. I'm just saying that
IMO it can't replace hard/soft limits. It probably could if it was
possible to shrink buffers, but I don't think it's feasible, even
theoretically. That's why I propose not to change the behavior of the
existing per memcg tcp limit at all. And frankly I don't get why you are
so keen on simplifying it. You say it's a "crapload of boilerplate
code". Well, I don't see how it is - it just replicates global knobs and
I don't see how it could be done in a better way. The code is hidden
behind jump labels, so the overhead is zero if it isn't used. If you
really dislike this code, we can isolate it under a separate config
option. But all right, I don't rule out the possibility that the code
could be simplified. If you do that w/o breaking it, that'll be OK to
me, but I don't see why it should be related to this particular patch
set.

> 
> The *next* step would be to figure out whether we can actually
> *reclaim* memory in the network subsystem--shrink windows and steal
> buffers back--and that might even be an avenue to replace tcp window
> limits. But it's not necessary for *this* patch series to be useful.

Again, I don't think we can *reclaim* network memory, but you're right.

> 
> > > So this seemed like a good way to prove a new mechanism before rolling
> > > it out to every single Linux setup, rather than switch everybody over
> > > after the limited scope testing I can do as a developer on my own.
> > > 
> > > Keep in mind that my patches are not committing anything in terms of
> > > interface, so we retain all the freedom to fix and tune the way this
> > > is implemented, including the freedom to re-add tcp window limits in
> > > case the pressure balancing is not a comprehensive solution.
> > 
> > I really dislike this kind of proof. It looks like you're trying to
> > push something you think is right covertly, w/o having a proper
> > discussion with networking people and then say that it just works
> > and hence should be done globally, but what if it won't? Revert it?
> > We already have a lot of dubious stuff in memcg that should be
> > reverted, so let's please try to avoid this kind of mistakes in
> > future. Note, I say "w/o having a proper discussion with networking
> > people", because I don't think they will really care *unless* you
> > change the global logic, simply because most of them aren't very
> > interested in memcg AFAICS.
> 
> Come on, Dave is the first To and netdev is CC'd. They might not care
> about memcg, but "pushing things covertly" is a bit of a stretch.

Sorry if it sounded rude to you. I just look back at my experience
patching slab internals to make kmem accountable, and AFAICS Christoph
didn't really care about *what* I was doing, he only cared about the
global case - if there was no performance degradation when kmemcg was
disabled, he was usually fine with it, even if from the memcg pov it was
a crap.

Anyway, I can't force you to patch the global case first or
simultaneously with the memcg case, so let's just hope I'm a bit too
overcautious.

> 
> > That effectively means you loose a chance to listen to networking
> > experts, who could point you at design flaws and propose an improvement
> > right away. Let's please not miss such an opportunity. You said that
> > you'd seen this problem happen w/o cgroups, so you have a use case that
> > might need fixing at the global level. IMO it shouldn't be difficult to
> > prepare an RFC patch for the global case first and see what people think
> > about it.
> 
> No, the problem we are running into is when network memory is not
> tracked per cgroup. The lack of containment means that the socket
> memory consumption of individual cgroups can trigger system OOM.
> 
> We tried using the per-memcg tcp limits, and that prevents the OOMs
> for sure, but it's horrendous for network performance. There is no
> "stop growing" phase, it just keeps going full throttle until it hits
> the wall hard.
> 
> Now, we could probably try to replicate the global knobs and add a
> per-memcg soft limit. But you know better than anyone else how hard it
> is to estimate the overall workingset size of a workload, and the
> margins on containerized loads are razor-thin. Performance is much
> more sensitive to input errors, and often times parameters must be
> adjusted continuously during the runtime of a workload. It'd be
> disasterous to rely on yet more static, error-prone user input here.

Yeah, but the dynamic approach proposed in your patch set doesn't
guarantee we won't hit OOM in memcg due to overgrown buffers. It just
reduces this possibility. Of course, memcg OOM is far not as disastrous
as the global one, but still it usually means the workload breakage.

The static approach is error-prone for sure, but it has existed for
years and worked satisfactory AFAIK.

> 
> What all this means to me is that fixing it on the cgroup level has
> higher priority. But it also means that once we figured it out under
> such a high-pressure environment, it's much easier to apply to the
> global case and potentially replace the soft limit there.
> 
> This seems like a better approach to me than starting globally, only
> to realize that the solution is not workable for cgroups and we need
> yet something else.
> 

Are we in rush? I think if you try your approach at the global level and
fail, it's still good, because it will probably give us all a better
understanding of the problem. If you successfully fix the global case,
but then realize that it doesn't fit memcg, it's even better, because
you actually fixed a problem. If you patch both global and memcg cases,
it's perfect.

But of course, that's my understanding and I may be mistaken. Let's hope
you're right.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
@ 2015-10-29  9:27               ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-29  9:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Wed, Oct 28, 2015 at 11:58:10AM -0700, Johannes Weiner wrote:
> On Wed, Oct 28, 2015 at 11:20:03AM +0300, Vladimir Davydov wrote:
> > Then you'd better not touch existing tcp limits at all, because they
> > just work, and the logic behind them is very close to that of global tcp
> > limits. I don't think one can simplify it somehow.
> 
> Uhm, no, there is a crapload of boilerplate code and complication that
> seems entirely unnecessary. The only thing missing from my patch seems
> to be the part where it enters memory pressure state when the limit is
> hit. I'm adding this for completeness, but I doubt it even matters.
> 
> > Moreover, frankly I still have my reservations about this vmpressure
> > propagation to skb you're proposing. It might work, but I doubt it
> > will allow us to throw away explicit tcp limit, as I explained
> > previously. So, even with your approach I think we can still need
> > per memcg tcp limit *unless* you get rid of global tcp limit
> > somehow.
> 
> Having the hard limit as a failsafe (or a minimum for other consumers)
> is one thing, and certainly something I'm open to for cgroupv2, should
> we have problems with load startup up after a socket memory landgrab.
> 
> That being said, if the VM is struggling to reclaim pages, or is even
> swapping, it makes perfect sense to let the socket memory scheduler
> know it shouldn't continue to increase its footprint until the VM
> recovers. Regardless of any hard limitations/minimum guarantees.
> 
> This is what my patch does and it seems pretty straight-forward to
> me. I don't really understand why this is so controversial.

I'm not arguing that the idea behind this patch set is necessarily bad.
Quite the contrary, it does look interesting to me. I'm just saying that
IMO it can't replace hard/soft limits. It probably could if it was
possible to shrink buffers, but I don't think it's feasible, even
theoretically. That's why I propose not to change the behavior of the
existing per memcg tcp limit at all. And frankly I don't get why you are
so keen on simplifying it. You say it's a "crapload of boilerplate
code". Well, I don't see how it is - it just replicates global knobs and
I don't see how it could be done in a better way. The code is hidden
behind jump labels, so the overhead is zero if it isn't used. If you
really dislike this code, we can isolate it under a separate config
option. But all right, I don't rule out the possibility that the code
could be simplified. If you do that w/o breaking it, that'll be OK to
me, but I don't see why it should be related to this particular patch
set.

> 
> The *next* step would be to figure out whether we can actually
> *reclaim* memory in the network subsystem--shrink windows and steal
> buffers back--and that might even be an avenue to replace tcp window
> limits. But it's not necessary for *this* patch series to be useful.

Again, I don't think we can *reclaim* network memory, but you're right.

> 
> > > So this seemed like a good way to prove a new mechanism before rolling
> > > it out to every single Linux setup, rather than switch everybody over
> > > after the limited scope testing I can do as a developer on my own.
> > > 
> > > Keep in mind that my patches are not committing anything in terms of
> > > interface, so we retain all the freedom to fix and tune the way this
> > > is implemented, including the freedom to re-add tcp window limits in
> > > case the pressure balancing is not a comprehensive solution.
> > 
> > I really dislike this kind of proof. It looks like you're trying to
> > push something you think is right covertly, w/o having a proper
> > discussion with networking people and then say that it just works
> > and hence should be done globally, but what if it won't? Revert it?
> > We already have a lot of dubious stuff in memcg that should be
> > reverted, so let's please try to avoid this kind of mistakes in
> > future. Note, I say "w/o having a proper discussion with networking
> > people", because I don't think they will really care *unless* you
> > change the global logic, simply because most of them aren't very
> > interested in memcg AFAICS.
> 
> Come on, Dave is the first To and netdev is CC'd. They might not care
> about memcg, but "pushing things covertly" is a bit of a stretch.

Sorry if it sounded rude to you. I just look back at my experience
patching slab internals to make kmem accountable, and AFAICS Christoph
didn't really care about *what* I was doing, he only cared about the
global case - if there was no performance degradation when kmemcg was
disabled, he was usually fine with it, even if from the memcg pov it was
a crap.

Anyway, I can't force you to patch the global case first or
simultaneously with the memcg case, so let's just hope I'm a bit too
overcautious.

> 
> > That effectively means you loose a chance to listen to networking
> > experts, who could point you at design flaws and propose an improvement
> > right away. Let's please not miss such an opportunity. You said that
> > you'd seen this problem happen w/o cgroups, so you have a use case that
> > might need fixing at the global level. IMO it shouldn't be difficult to
> > prepare an RFC patch for the global case first and see what people think
> > about it.
> 
> No, the problem we are running into is when network memory is not
> tracked per cgroup. The lack of containment means that the socket
> memory consumption of individual cgroups can trigger system OOM.
> 
> We tried using the per-memcg tcp limits, and that prevents the OOMs
> for sure, but it's horrendous for network performance. There is no
> "stop growing" phase, it just keeps going full throttle until it hits
> the wall hard.
> 
> Now, we could probably try to replicate the global knobs and add a
> per-memcg soft limit. But you know better than anyone else how hard it
> is to estimate the overall workingset size of a workload, and the
> margins on containerized loads are razor-thin. Performance is much
> more sensitive to input errors, and often times parameters must be
> adjusted continuously during the runtime of a workload. It'd be
> disasterous to rely on yet more static, error-prone user input here.

Yeah, but the dynamic approach proposed in your patch set doesn't
guarantee we won't hit OOM in memcg due to overgrown buffers. It just
reduces this possibility. Of course, memcg OOM is far not as disastrous
as the global one, but still it usually means the workload breakage.

The static approach is error-prone for sure, but it has existed for
years and worked satisfactory AFAIK.

> 
> What all this means to me is that fixing it on the cgroup level has
> higher priority. But it also means that once we figured it out under
> such a high-pressure environment, it's much easier to apply to the
> global case and potentially replace the soft limit there.
> 
> This seems like a better approach to me than starting globally, only
> to realize that the solution is not workable for cgroups and we need
> yet something else.
> 

Are we in rush? I think if you try your approach at the global level and
fail, it's still good, because it will probably give us all a better
understanding of the problem. If you successfully fix the global case,
but then realize that it doesn't fit memcg, it's even better, because
you actually fixed a problem. If you patch both global and memcg cases,
it's perfect.

But of course, that's my understanding and I may be mistaken. Let's hope
you're right.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
@ 2015-10-29  9:27               ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-10-29  9:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Wed, Oct 28, 2015 at 11:58:10AM -0700, Johannes Weiner wrote:
> On Wed, Oct 28, 2015 at 11:20:03AM +0300, Vladimir Davydov wrote:
> > Then you'd better not touch existing tcp limits at all, because they
> > just work, and the logic behind them is very close to that of global tcp
> > limits. I don't think one can simplify it somehow.
> 
> Uhm, no, there is a crapload of boilerplate code and complication that
> seems entirely unnecessary. The only thing missing from my patch seems
> to be the part where it enters memory pressure state when the limit is
> hit. I'm adding this for completeness, but I doubt it even matters.
> 
> > Moreover, frankly I still have my reservations about this vmpressure
> > propagation to skb you're proposing. It might work, but I doubt it
> > will allow us to throw away explicit tcp limit, as I explained
> > previously. So, even with your approach I think we can still need
> > per memcg tcp limit *unless* you get rid of global tcp limit
> > somehow.
> 
> Having the hard limit as a failsafe (or a minimum for other consumers)
> is one thing, and certainly something I'm open to for cgroupv2, should
> we have problems with load startup up after a socket memory landgrab.
> 
> That being said, if the VM is struggling to reclaim pages, or is even
> swapping, it makes perfect sense to let the socket memory scheduler
> know it shouldn't continue to increase its footprint until the VM
> recovers. Regardless of any hard limitations/minimum guarantees.
> 
> This is what my patch does and it seems pretty straight-forward to
> me. I don't really understand why this is so controversial.

I'm not arguing that the idea behind this patch set is necessarily bad.
Quite the contrary, it does look interesting to me. I'm just saying that
IMO it can't replace hard/soft limits. It probably could if it was
possible to shrink buffers, but I don't think it's feasible, even
theoretically. That's why I propose not to change the behavior of the
existing per memcg tcp limit at all. And frankly I don't get why you are
so keen on simplifying it. You say it's a "crapload of boilerplate
code". Well, I don't see how it is - it just replicates global knobs and
I don't see how it could be done in a better way. The code is hidden
behind jump labels, so the overhead is zero if it isn't used. If you
really dislike this code, we can isolate it under a separate config
option. But all right, I don't rule out the possibility that the code
could be simplified. If you do that w/o breaking it, that'll be OK to
me, but I don't see why it should be related to this particular patch
set.

> 
> The *next* step would be to figure out whether we can actually
> *reclaim* memory in the network subsystem--shrink windows and steal
> buffers back--and that might even be an avenue to replace tcp window
> limits. But it's not necessary for *this* patch series to be useful.

Again, I don't think we can *reclaim* network memory, but you're right.

> 
> > > So this seemed like a good way to prove a new mechanism before rolling
> > > it out to every single Linux setup, rather than switch everybody over
> > > after the limited scope testing I can do as a developer on my own.
> > > 
> > > Keep in mind that my patches are not committing anything in terms of
> > > interface, so we retain all the freedom to fix and tune the way this
> > > is implemented, including the freedom to re-add tcp window limits in
> > > case the pressure balancing is not a comprehensive solution.
> > 
> > I really dislike this kind of proof. It looks like you're trying to
> > push something you think is right covertly, w/o having a proper
> > discussion with networking people and then say that it just works
> > and hence should be done globally, but what if it won't? Revert it?
> > We already have a lot of dubious stuff in memcg that should be
> > reverted, so let's please try to avoid this kind of mistakes in
> > future. Note, I say "w/o having a proper discussion with networking
> > people", because I don't think they will really care *unless* you
> > change the global logic, simply because most of them aren't very
> > interested in memcg AFAICS.
> 
> Come on, Dave is the first To and netdev is CC'd. They might not care
> about memcg, but "pushing things covertly" is a bit of a stretch.

Sorry if it sounded rude to you. I just look back at my experience
patching slab internals to make kmem accountable, and AFAICS Christoph
didn't really care about *what* I was doing, he only cared about the
global case - if there was no performance degradation when kmemcg was
disabled, he was usually fine with it, even if from the memcg pov it was
a crap.

Anyway, I can't force you to patch the global case first or
simultaneously with the memcg case, so let's just hope I'm a bit too
overcautious.

> 
> > That effectively means you loose a chance to listen to networking
> > experts, who could point you at design flaws and propose an improvement
> > right away. Let's please not miss such an opportunity. You said that
> > you'd seen this problem happen w/o cgroups, so you have a use case that
> > might need fixing at the global level. IMO it shouldn't be difficult to
> > prepare an RFC patch for the global case first and see what people think
> > about it.
> 
> No, the problem we are running into is when network memory is not
> tracked per cgroup. The lack of containment means that the socket
> memory consumption of individual cgroups can trigger system OOM.
> 
> We tried using the per-memcg tcp limits, and that prevents the OOMs
> for sure, but it's horrendous for network performance. There is no
> "stop growing" phase, it just keeps going full throttle until it hits
> the wall hard.
> 
> Now, we could probably try to replicate the global knobs and add a
> per-memcg soft limit. But you know better than anyone else how hard it
> is to estimate the overall workingset size of a workload, and the
> margins on containerized loads are razor-thin. Performance is much
> more sensitive to input errors, and often times parameters must be
> adjusted continuously during the runtime of a workload. It'd be
> disasterous to rely on yet more static, error-prone user input here.

Yeah, but the dynamic approach proposed in your patch set doesn't
guarantee we won't hit OOM in memcg due to overgrown buffers. It just
reduces this possibility. Of course, memcg OOM is far not as disastrous
as the global one, but still it usually means the workload breakage.

The static approach is error-prone for sure, but it has existed for
years and worked satisfactory AFAIK.

> 
> What all this means to me is that fixing it on the cgroup level has
> higher priority. But it also means that once we figured it out under
> such a high-pressure environment, it's much easier to apply to the
> global case and potentially replace the soft limit there.
> 
> This seems like a better approach to me than starting globally, only
> to realize that the solution is not workable for cgroups and we need
> yet something else.
> 

Are we in rush? I think if you try your approach at the global level and
fail, it's still good, because it will probably give us all a better
understanding of the problem. If you successfully fix the global case,
but then realize that it doesn't fit memcg, it's even better, because
you actually fixed a problem. If you patch both global and memcg cases,
it's perfect.

But of course, that's my understanding and I may be mistaken. Let's hope
you're right.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
  2015-10-27 16:42                 ` Johannes Weiner
@ 2015-10-29 15:25                   ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-29 15:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Tue 27-10-15 09:42:27, Johannes Weiner wrote:
> On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote:
> > On Tue 27-10-15 11:41:38, Johannes Weiner wrote:
[...]
> Or it could be exactly the other way around when you have a workload
> that is heavy on filesystem metadata. I don't see why any scenario
> would be more important than the other.

Yes I definitely agree. No scenario is more important. We can only
come up with a default that makes more sense for the majority and
allow the minority to override. That was what I wanted to say basically.

> I'm not saying that distinguishing between consumers is wrong, just
> that "user memory vs kernel memory" is a false classification. Why do
> you call page cache user memory but dentry cache kernel memory? It
> doesn't make any sense.

We are not talking about dcache vs. page cache alone here, though. We
are talking about _all_ slab allocations vs. only user accessed memory.
The slab consumption is directly under kernel control. A great pile of
this logic is completly hidden from userspace. While user can estimate
the user memory it is hard (if possible) to do that for the kernel
memory footprint - not even mentioning this is variable and dependent on
the particular kernel version.

> > Also kmem accounting will make the load more non-deterministic because
> > many of the resources are shared between tasks in separate cgroups
> > unless they are explicitly configured. E.g. [id]cache will be shared
> > and first to touch gets charged so you would end up with more false
> > sharing.
> 
> Exactly like page cache. This differentiation isn't based on reality.

Yes false sharing is an existing and long term problem already. I just
wanted to point out that the false sharing would be even a bigger
problem because some kernel tracked resources are shared more naturally
than file sharing.

> > > IMO that's an implementation detail and a historical artifact that
> > > should not be exposed to the user. And that's the thing I hate about
> > > the current opt-out knob.
> 
> You carefully skipped over this part. We can ignore it for socket
> memory but it's something we need to figure out when it comes to slab
> accounting and tracking.

I am sorry, I didn't mean to skip this part, I though it would be clear
from the previous text. I think kmem accounting falls into the same
category. Have a sane default and a global boottime knob to override it
for those that think differently - for whatever reason they might have.

[...]

> Having page cache accounting built in while presenting dentry+inode
> cache as a configurable extension is completely random and doesn't
> make sense. They are both first class memory consumers. They're not
> separate categories. One isn't more "core" than the other.

Again we are talking about all slab allocations not just the dcache. 

> > > For now, something like this as a boot commandline?
> > > 
> > > cgroup.memory=nosocket
> > 
> > That would work for me.
> 
> Okay, then I'll go that route for the socket stuff.

Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-29 15:25                   ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-10-29 15:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Tue 27-10-15 09:42:27, Johannes Weiner wrote:
> On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote:
> > On Tue 27-10-15 11:41:38, Johannes Weiner wrote:
[...]
> Or it could be exactly the other way around when you have a workload
> that is heavy on filesystem metadata. I don't see why any scenario
> would be more important than the other.

Yes I definitely agree. No scenario is more important. We can only
come up with a default that makes more sense for the majority and
allow the minority to override. That was what I wanted to say basically.

> I'm not saying that distinguishing between consumers is wrong, just
> that "user memory vs kernel memory" is a false classification. Why do
> you call page cache user memory but dentry cache kernel memory? It
> doesn't make any sense.

We are not talking about dcache vs. page cache alone here, though. We
are talking about _all_ slab allocations vs. only user accessed memory.
The slab consumption is directly under kernel control. A great pile of
this logic is completly hidden from userspace. While user can estimate
the user memory it is hard (if possible) to do that for the kernel
memory footprint - not even mentioning this is variable and dependent on
the particular kernel version.

> > Also kmem accounting will make the load more non-deterministic because
> > many of the resources are shared between tasks in separate cgroups
> > unless they are explicitly configured. E.g. [id]cache will be shared
> > and first to touch gets charged so you would end up with more false
> > sharing.
> 
> Exactly like page cache. This differentiation isn't based on reality.

Yes false sharing is an existing and long term problem already. I just
wanted to point out that the false sharing would be even a bigger
problem because some kernel tracked resources are shared more naturally
than file sharing.

> > > IMO that's an implementation detail and a historical artifact that
> > > should not be exposed to the user. And that's the thing I hate about
> > > the current opt-out knob.
> 
> You carefully skipped over this part. We can ignore it for socket
> memory but it's something we need to figure out when it comes to slab
> accounting and tracking.

I am sorry, I didn't mean to skip this part, I though it would be clear
from the previous text. I think kmem accounting falls into the same
category. Have a sane default and a global boottime knob to override it
for those that think differently - for whatever reason they might have.

[...]

> Having page cache accounting built in while presenting dentry+inode
> cache as a configurable extension is completely random and doesn't
> make sense. They are both first class memory consumers. They're not
> separate categories. One isn't more "core" than the other.

Again we are talking about all slab allocations not just the dcache. 

> > > For now, something like this as a boot commandline?
> > > 
> > > cgroup.memory=nosocket
> > 
> > That would work for me.
> 
> Okay, then I'll go that route for the socket stuff.

Thanks!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-29 16:10                     ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-29 16:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Thu, Oct 29, 2015 at 04:25:46PM +0100, Michal Hocko wrote:
> On Tue 27-10-15 09:42:27, Johannes Weiner wrote:
> > On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote:
> > > On Tue 27-10-15 11:41:38, Johannes Weiner wrote:
> > > > IMO that's an implementation detail and a historical artifact that
> > > > should not be exposed to the user. And that's the thing I hate about
> > > > the current opt-out knob.
> > 
> > You carefully skipped over this part. We can ignore it for socket
> > memory but it's something we need to figure out when it comes to slab
> > accounting and tracking.
> 
> I am sorry, I didn't mean to skip this part, I though it would be clear
> from the previous text. I think kmem accounting falls into the same
> category. Have a sane default and a global boottime knob to override it
> for those that think differently - for whatever reason they might have.

Yes, that makes sense to me.

Like cgroup.memory=nosocket, would you think it makes sense to include
slab in the default for functional/semantical completeness and provide
a cgroup.memory=noslab for powerusers?

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-29 16:10                     ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-29 16:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	vdavydov-5HdwGun5lf+gSpxsJD1C4w, tj-DgEjT+Ai2ygdnm+yROfE0A,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Oct 29, 2015 at 04:25:46PM +0100, Michal Hocko wrote:
> On Tue 27-10-15 09:42:27, Johannes Weiner wrote:
> > On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote:
> > > On Tue 27-10-15 11:41:38, Johannes Weiner wrote:
> > > > IMO that's an implementation detail and a historical artifact that
> > > > should not be exposed to the user. And that's the thing I hate about
> > > > the current opt-out knob.
> > 
> > You carefully skipped over this part. We can ignore it for socket
> > memory but it's something we need to figure out when it comes to slab
> > accounting and tracking.
> 
> I am sorry, I didn't mean to skip this part, I though it would be clear
> from the previous text. I think kmem accounting falls into the same
> category. Have a sane default and a global boottime knob to override it
> for those that think differently - for whatever reason they might have.

Yes, that makes sense to me.

Like cgroup.memory=nosocket, would you think it makes sense to include
slab in the default for functional/semantical completeness and provide
a cgroup.memory=noslab for powerusers?

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-10-29 16:10                     ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-29 16:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Thu, Oct 29, 2015 at 04:25:46PM +0100, Michal Hocko wrote:
> On Tue 27-10-15 09:42:27, Johannes Weiner wrote:
> > On Tue, Oct 27, 2015 at 05:15:54PM +0100, Michal Hocko wrote:
> > > On Tue 27-10-15 11:41:38, Johannes Weiner wrote:
> > > > IMO that's an implementation detail and a historical artifact that
> > > > should not be exposed to the user. And that's the thing I hate about
> > > > the current opt-out knob.
> > 
> > You carefully skipped over this part. We can ignore it for socket
> > memory but it's something we need to figure out when it comes to slab
> > accounting and tracking.
> 
> I am sorry, I didn't mean to skip this part, I though it would be clear
> from the previous text. I think kmem accounting falls into the same
> category. Have a sane default and a global boottime knob to override it
> for those that think differently - for whatever reason they might have.

Yes, that makes sense to me.

Like cgroup.memory=nosocket, would you think it makes sense to include
slab in the default for functional/semantical completeness and provide
a cgroup.memory=noslab for powerusers?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
  2015-10-29  9:27               ` Vladimir Davydov
  (?)
@ 2015-10-29 17:52                 ` Johannes Weiner
  -1 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-29 17:52 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 29, 2015 at 12:27:47PM +0300, Vladimir Davydov wrote:
> On Wed, Oct 28, 2015 at 11:58:10AM -0700, Johannes Weiner wrote:
> > Having the hard limit as a failsafe (or a minimum for other consumers)
> > is one thing, and certainly something I'm open to for cgroupv2, should
> > we have problems with load startup up after a socket memory landgrab.
> > 
> > That being said, if the VM is struggling to reclaim pages, or is even
> > swapping, it makes perfect sense to let the socket memory scheduler
> > know it shouldn't continue to increase its footprint until the VM
> > recovers. Regardless of any hard limitations/minimum guarantees.
> > 
> > This is what my patch does and it seems pretty straight-forward to
> > me. I don't really understand why this is so controversial.
> 
> I'm not arguing that the idea behind this patch set is necessarily bad.
> Quite the contrary, it does look interesting to me. I'm just saying that
> IMO it can't replace hard/soft limits. It probably could if it was
> possible to shrink buffers, but I don't think it's feasible, even
> theoretically. That's why I propose not to change the behavior of the
> existing per memcg tcp limit at all. And frankly I don't get why you are
> so keen on simplifying it. You say it's a "crapload of boilerplate
> code". Well, I don't see how it is - it just replicates global knobs and
> I don't see how it could be done in a better way. The code is hidden
> behind jump labels, so the overhead is zero if it isn't used. If you
> really dislike this code, we can isolate it under a separate config
> option. But all right, I don't rule out the possibility that the code
> could be simplified. If you do that w/o breaking it, that'll be OK to
> me, but I don't see why it should be related to this particular patch
> set.

Okay, I see your concern.

I'm not trying to change the behavior, just the implementation,
because it's too complex for the functionality it actually provides.

And the reason it's part of this patch set is because I'm using the
same code to hook into the memory accounting, so it makes sense to
refactor this stuff in the same go.

There is also a niceness factor of not adding more memcg callbacks to
the networking subsystem when there is an option to consolidate them.

Now, you mentioned that you'd rather see the socket buffers accounted
at the allocator level, but I looked at the different allocation paths
and network protocols and I'm not convinced that this makes sense. We
don't want to be in the hotpath of every single packet when a lot of
them are small, short-lived management blips that don't involve user
space to let the kernel dispose of them.

__sk_mem_schedule() on the other hand is already wired up to exactly
those consumers we are interested in for memory isolation: those with
bigger chunks of data attached to them and those that have exploding
receive queues when userspace fails to read(). UDP and TCP.

I mean, there is a reason why the global memory limits apply to only
those types of packets in the first place: everything else is noise.

I agree that it's appealing to account at the allocator level and set
page->mem_cgroup etc. but in this case we'd pay extra to capture a lot
of noise, and I don't want to pay that just for aesthetics. In this
case it's better to track ownership on the socket level and only count
packets that can accumulate a significant amount of memory consumed.

> > We tried using the per-memcg tcp limits, and that prevents the OOMs
> > for sure, but it's horrendous for network performance. There is no
> > "stop growing" phase, it just keeps going full throttle until it hits
> > the wall hard.
> > 
> > Now, we could probably try to replicate the global knobs and add a
> > per-memcg soft limit. But you know better than anyone else how hard it
> > is to estimate the overall workingset size of a workload, and the
> > margins on containerized loads are razor-thin. Performance is much
> > more sensitive to input errors, and often times parameters must be
> > adjusted continuously during the runtime of a workload. It'd be
> > disasterous to rely on yet more static, error-prone user input here.
> 
> Yeah, but the dynamic approach proposed in your patch set doesn't
> guarantee we won't hit OOM in memcg due to overgrown buffers. It just
> reduces this possibility. Of course, memcg OOM is far not as disastrous
> as the global one, but still it usually means the workload breakage.

Right now, the entire machine breaks. Confining it to a faulty memcg,
as well as reducing the likelihood of that OOM in many cases seems
like a good move in the right direction, no?

And how likely are memcg OOMs because of this anyway? There is of
course a scenario imaginable where the packets pile up, followed by
some *other* part of the workload, the one that doesn't read() and
process packets, trying to expand--which then doesn't work and goes
OOM. But that seems like a complete corner case. In the vast majority
of cases, the application will be in full operation and just fail to
read() fast enough--because the network bandwidth is enormous compared
to the container's size, or because it shares the CPU with thousands
of other workloads and there is scheduling latency.

This would be the perfect point to reign in the transmit window...

> The static approach is error-prone for sure, but it has existed for
> years and worked satisfactory AFAIK.

...but that point is not a fixed amount of memory consumed. It depends
on the workload and the random interactions it's having with thousands
of other containers on that same machine.

The point of containers is to maximize utilization of your hardware
and systematically eliminate slack in the system. But it's exactly
that slack on dedicated bare-metal machines that allowed us to take a
wild guess at the settings and then tune them based on observing a
handful of workloads. This approach is not going to work anymore when
we pack the machine to capacity and still expect every single
container out of thousands to perform well. We need that automation.

The static setting working okay on the global level is also why I'm
not interested in starting to experiment with it. There is no reason
to change it. It's much more likely that any attempt to change it will
be shot down, not because of the approach chosen, but because there is
no problem to solve there. I doubt we can get networking people to
care about containers by screwing with things that work for them ;-)

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
@ 2015-10-29 17:52                 ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-29 17:52 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Oct 29, 2015 at 12:27:47PM +0300, Vladimir Davydov wrote:
> On Wed, Oct 28, 2015 at 11:58:10AM -0700, Johannes Weiner wrote:
> > Having the hard limit as a failsafe (or a minimum for other consumers)
> > is one thing, and certainly something I'm open to for cgroupv2, should
> > we have problems with load startup up after a socket memory landgrab.
> > 
> > That being said, if the VM is struggling to reclaim pages, or is even
> > swapping, it makes perfect sense to let the socket memory scheduler
> > know it shouldn't continue to increase its footprint until the VM
> > recovers. Regardless of any hard limitations/minimum guarantees.
> > 
> > This is what my patch does and it seems pretty straight-forward to
> > me. I don't really understand why this is so controversial.
> 
> I'm not arguing that the idea behind this patch set is necessarily bad.
> Quite the contrary, it does look interesting to me. I'm just saying that
> IMO it can't replace hard/soft limits. It probably could if it was
> possible to shrink buffers, but I don't think it's feasible, even
> theoretically. That's why I propose not to change the behavior of the
> existing per memcg tcp limit at all. And frankly I don't get why you are
> so keen on simplifying it. You say it's a "crapload of boilerplate
> code". Well, I don't see how it is - it just replicates global knobs and
> I don't see how it could be done in a better way. The code is hidden
> behind jump labels, so the overhead is zero if it isn't used. If you
> really dislike this code, we can isolate it under a separate config
> option. But all right, I don't rule out the possibility that the code
> could be simplified. If you do that w/o breaking it, that'll be OK to
> me, but I don't see why it should be related to this particular patch
> set.

Okay, I see your concern.

I'm not trying to change the behavior, just the implementation,
because it's too complex for the functionality it actually provides.

And the reason it's part of this patch set is because I'm using the
same code to hook into the memory accounting, so it makes sense to
refactor this stuff in the same go.

There is also a niceness factor of not adding more memcg callbacks to
the networking subsystem when there is an option to consolidate them.

Now, you mentioned that you'd rather see the socket buffers accounted
at the allocator level, but I looked at the different allocation paths
and network protocols and I'm not convinced that this makes sense. We
don't want to be in the hotpath of every single packet when a lot of
them are small, short-lived management blips that don't involve user
space to let the kernel dispose of them.

__sk_mem_schedule() on the other hand is already wired up to exactly
those consumers we are interested in for memory isolation: those with
bigger chunks of data attached to them and those that have exploding
receive queues when userspace fails to read(). UDP and TCP.

I mean, there is a reason why the global memory limits apply to only
those types of packets in the first place: everything else is noise.

I agree that it's appealing to account at the allocator level and set
page->mem_cgroup etc. but in this case we'd pay extra to capture a lot
of noise, and I don't want to pay that just for aesthetics. In this
case it's better to track ownership on the socket level and only count
packets that can accumulate a significant amount of memory consumed.

> > We tried using the per-memcg tcp limits, and that prevents the OOMs
> > for sure, but it's horrendous for network performance. There is no
> > "stop growing" phase, it just keeps going full throttle until it hits
> > the wall hard.
> > 
> > Now, we could probably try to replicate the global knobs and add a
> > per-memcg soft limit. But you know better than anyone else how hard it
> > is to estimate the overall workingset size of a workload, and the
> > margins on containerized loads are razor-thin. Performance is much
> > more sensitive to input errors, and often times parameters must be
> > adjusted continuously during the runtime of a workload. It'd be
> > disasterous to rely on yet more static, error-prone user input here.
> 
> Yeah, but the dynamic approach proposed in your patch set doesn't
> guarantee we won't hit OOM in memcg due to overgrown buffers. It just
> reduces this possibility. Of course, memcg OOM is far not as disastrous
> as the global one, but still it usually means the workload breakage.

Right now, the entire machine breaks. Confining it to a faulty memcg,
as well as reducing the likelihood of that OOM in many cases seems
like a good move in the right direction, no?

And how likely are memcg OOMs because of this anyway? There is of
course a scenario imaginable where the packets pile up, followed by
some *other* part of the workload, the one that doesn't read() and
process packets, trying to expand--which then doesn't work and goes
OOM. But that seems like a complete corner case. In the vast majority
of cases, the application will be in full operation and just fail to
read() fast enough--because the network bandwidth is enormous compared
to the container's size, or because it shares the CPU with thousands
of other workloads and there is scheduling latency.

This would be the perfect point to reign in the transmit window...

> The static approach is error-prone for sure, but it has existed for
> years and worked satisfactory AFAIK.

...but that point is not a fixed amount of memory consumed. It depends
on the workload and the random interactions it's having with thousands
of other containers on that same machine.

The point of containers is to maximize utilization of your hardware
and systematically eliminate slack in the system. But it's exactly
that slack on dedicated bare-metal machines that allowed us to take a
wild guess at the settings and then tune them based on observing a
handful of workloads. This approach is not going to work anymore when
we pack the machine to capacity and still expect every single
container out of thousands to perform well. We need that automation.

The static setting working okay on the global level is also why I'm
not interested in starting to experiment with it. There is no reason
to change it. It's much more likely that any attempt to change it will
be shot down, not because of the approach chosen, but because there is
no problem to solve there. I doubt we can get networking people to
care about containers by screwing with things that work for them ;-)

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
@ 2015-10-29 17:52                 ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-10-29 17:52 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 29, 2015 at 12:27:47PM +0300, Vladimir Davydov wrote:
> On Wed, Oct 28, 2015 at 11:58:10AM -0700, Johannes Weiner wrote:
> > Having the hard limit as a failsafe (or a minimum for other consumers)
> > is one thing, and certainly something I'm open to for cgroupv2, should
> > we have problems with load startup up after a socket memory landgrab.
> > 
> > That being said, if the VM is struggling to reclaim pages, or is even
> > swapping, it makes perfect sense to let the socket memory scheduler
> > know it shouldn't continue to increase its footprint until the VM
> > recovers. Regardless of any hard limitations/minimum guarantees.
> > 
> > This is what my patch does and it seems pretty straight-forward to
> > me. I don't really understand why this is so controversial.
> 
> I'm not arguing that the idea behind this patch set is necessarily bad.
> Quite the contrary, it does look interesting to me. I'm just saying that
> IMO it can't replace hard/soft limits. It probably could if it was
> possible to shrink buffers, but I don't think it's feasible, even
> theoretically. That's why I propose not to change the behavior of the
> existing per memcg tcp limit at all. And frankly I don't get why you are
> so keen on simplifying it. You say it's a "crapload of boilerplate
> code". Well, I don't see how it is - it just replicates global knobs and
> I don't see how it could be done in a better way. The code is hidden
> behind jump labels, so the overhead is zero if it isn't used. If you
> really dislike this code, we can isolate it under a separate config
> option. But all right, I don't rule out the possibility that the code
> could be simplified. If you do that w/o breaking it, that'll be OK to
> me, but I don't see why it should be related to this particular patch
> set.

Okay, I see your concern.

I'm not trying to change the behavior, just the implementation,
because it's too complex for the functionality it actually provides.

And the reason it's part of this patch set is because I'm using the
same code to hook into the memory accounting, so it makes sense to
refactor this stuff in the same go.

There is also a niceness factor of not adding more memcg callbacks to
the networking subsystem when there is an option to consolidate them.

Now, you mentioned that you'd rather see the socket buffers accounted
at the allocator level, but I looked at the different allocation paths
and network protocols and I'm not convinced that this makes sense. We
don't want to be in the hotpath of every single packet when a lot of
them are small, short-lived management blips that don't involve user
space to let the kernel dispose of them.

__sk_mem_schedule() on the other hand is already wired up to exactly
those consumers we are interested in for memory isolation: those with
bigger chunks of data attached to them and those that have exploding
receive queues when userspace fails to read(). UDP and TCP.

I mean, there is a reason why the global memory limits apply to only
those types of packets in the first place: everything else is noise.

I agree that it's appealing to account at the allocator level and set
page->mem_cgroup etc. but in this case we'd pay extra to capture a lot
of noise, and I don't want to pay that just for aesthetics. In this
case it's better to track ownership on the socket level and only count
packets that can accumulate a significant amount of memory consumed.

> > We tried using the per-memcg tcp limits, and that prevents the OOMs
> > for sure, but it's horrendous for network performance. There is no
> > "stop growing" phase, it just keeps going full throttle until it hits
> > the wall hard.
> > 
> > Now, we could probably try to replicate the global knobs and add a
> > per-memcg soft limit. But you know better than anyone else how hard it
> > is to estimate the overall workingset size of a workload, and the
> > margins on containerized loads are razor-thin. Performance is much
> > more sensitive to input errors, and often times parameters must be
> > adjusted continuously during the runtime of a workload. It'd be
> > disasterous to rely on yet more static, error-prone user input here.
> 
> Yeah, but the dynamic approach proposed in your patch set doesn't
> guarantee we won't hit OOM in memcg due to overgrown buffers. It just
> reduces this possibility. Of course, memcg OOM is far not as disastrous
> as the global one, but still it usually means the workload breakage.

Right now, the entire machine breaks. Confining it to a faulty memcg,
as well as reducing the likelihood of that OOM in many cases seems
like a good move in the right direction, no?

And how likely are memcg OOMs because of this anyway? There is of
course a scenario imaginable where the packets pile up, followed by
some *other* part of the workload, the one that doesn't read() and
process packets, trying to expand--which then doesn't work and goes
OOM. But that seems like a complete corner case. In the vast majority
of cases, the application will be in full operation and just fail to
read() fast enough--because the network bandwidth is enormous compared
to the container's size, or because it shares the CPU with thousands
of other workloads and there is scheduling latency.

This would be the perfect point to reign in the transmit window...

> The static approach is error-prone for sure, but it has existed for
> years and worked satisfactory AFAIK.

...but that point is not a fixed amount of memory consumed. It depends
on the workload and the random interactions it's having with thousands
of other containers on that same machine.

The point of containers is to maximize utilization of your hardware
and systematically eliminate slack in the system. But it's exactly
that slack on dedicated bare-metal machines that allowed us to take a
wild guess at the settings and then tune them based on observing a
handful of workloads. This approach is not going to work anymore when
we pack the machine to capacity and still expect every single
container out of thousands to perform well. We need that automation.

The static setting working okay on the global level is also why I'm
not interested in starting to experiment with it. There is no reason
to change it. It's much more likely that any attempt to change it will
be shot down, not because of the approach chosen, but because there is
no problem to solve there. I doubt we can get networking people to
care about containers by screwing with things that work for them ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
  2015-10-29 17:52                 ` Johannes Weiner
  (?)
@ 2015-11-02 14:47                   ` Vladimir Davydov
  -1 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-11-02 14:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 29, 2015 at 10:52:28AM -0700, Johannes Weiner wrote:
...
> Now, you mentioned that you'd rather see the socket buffers accounted
> at the allocator level, but I looked at the different allocation paths
> and network protocols and I'm not convinced that this makes sense. We
> don't want to be in the hotpath of every single packet when a lot of
> them are small, short-lived management blips that don't involve user
> space to let the kernel dispose of them.
> 
> __sk_mem_schedule() on the other hand is already wired up to exactly
> those consumers we are interested in for memory isolation: those with
> bigger chunks of data attached to them and those that have exploding
> receive queues when userspace fails to read(). UDP and TCP.
> 
> I mean, there is a reason why the global memory limits apply to only
> those types of packets in the first place: everything else is noise.
> 
> I agree that it's appealing to account at the allocator level and set
> page->mem_cgroup etc. but in this case we'd pay extra to capture a lot
> of noise, and I don't want to pay that just for aesthetics. In this
> case it's better to track ownership on the socket level and only count
> packets that can accumulate a significant amount of memory consumed.

Sigh, you seem to be right. Moreover, I can't even think of a neat way
to account skb pages to memcg, because rcv skbs are generated in device
drivers, where we don't know which socket/memcg it will go to. We could
recharge individual pages when skb gets to the network or transport
layer, but it would result in unjustified overhead.

> 
> > > We tried using the per-memcg tcp limits, and that prevents the OOMs
> > > for sure, but it's horrendous for network performance. There is no
> > > "stop growing" phase, it just keeps going full throttle until it hits
> > > the wall hard.
> > > 
> > > Now, we could probably try to replicate the global knobs and add a
> > > per-memcg soft limit. But you know better than anyone else how hard it
> > > is to estimate the overall workingset size of a workload, and the
> > > margins on containerized loads are razor-thin. Performance is much
> > > more sensitive to input errors, and often times parameters must be
> > > adjusted continuously during the runtime of a workload. It'd be
> > > disasterous to rely on yet more static, error-prone user input here.
> > 
> > Yeah, but the dynamic approach proposed in your patch set doesn't
> > guarantee we won't hit OOM in memcg due to overgrown buffers. It just
> > reduces this possibility. Of course, memcg OOM is far not as disastrous
> > as the global one, but still it usually means the workload breakage.
> 
> Right now, the entire machine breaks. Confining it to a faulty memcg,
> as well as reducing the likelihood of that OOM in many cases seems
> like a good move in the right direction, no?

It seems. However, memcg OOM is also bad, we should strive to avoid it
if we can.

> 
> And how likely are memcg OOMs because of this anyway? There is of

Frankly, I've no idea. Your arguments below sound reassuring though.

> course a scenario imaginable where the packets pile up, followed by
> some *other* part of the workload, the one that doesn't read() and
> process packets, trying to expand--which then doesn't work and goes
> OOM. But that seems like a complete corner case. In the vast majority
> of cases, the application will be in full operation and just fail to
> read() fast enough--because the network bandwidth is enormous compared
> to the container's size, or because it shares the CPU with thousands
> of other workloads and there is scheduling latency.
> 
> This would be the perfect point to reign in the transmit window...
> 
> > The static approach is error-prone for sure, but it has existed for
> > years and worked satisfactory AFAIK.
> 
> ...but that point is not a fixed amount of memory consumed. It depends
> on the workload and the random interactions it's having with thousands
> of other containers on that same machine.
> 
> The point of containers is to maximize utilization of your hardware
> and systematically eliminate slack in the system. But it's exactly
> that slack on dedicated bare-metal machines that allowed us to take a
> wild guess at the settings and then tune them based on observing a
> handful of workloads. This approach is not going to work anymore when
> we pack the machine to capacity and still expect every single
> container out of thousands to perform well. We need that automation.

But we do use static approach when setting memory limits, no?
memory.{low,high,max} - they are all static.

I understand it's appealing to have just one knob - memory size - like
in case of virtual machines, but it doesn't seem to work with
containers. You added memory.low and memory.high knobs. VMs don't have
anything like that. How is one supposed to set them? Depends on the
workload, I guess. Also, there is the pids cgroup for limiting the
number of pids that can be used by a cgroup, because pid turns out to be
a resource in case of containers.  May be, tcp window should be
considered as a separate resource either, as it is now, and shouldn't go
to memcg? I'm just wondering...

> 
> The static setting working okay on the global level is also why I'm
> not interested in starting to experiment with it. There is no reason
> to change it. It's much more likely that any attempt to change it will
> be shot down, not because of the approach chosen, but because there is
> no problem to solve there. I doubt we can get networking people to
> care about containers by screwing with things that work for them ;-)

Fair enough.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
@ 2015-11-02 14:47                   ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-11-02 14:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 29, 2015 at 10:52:28AM -0700, Johannes Weiner wrote:
...
> Now, you mentioned that you'd rather see the socket buffers accounted
> at the allocator level, but I looked at the different allocation paths
> and network protocols and I'm not convinced that this makes sense. We
> don't want to be in the hotpath of every single packet when a lot of
> them are small, short-lived management blips that don't involve user
> space to let the kernel dispose of them.
> 
> __sk_mem_schedule() on the other hand is already wired up to exactly
> those consumers we are interested in for memory isolation: those with
> bigger chunks of data attached to them and those that have exploding
> receive queues when userspace fails to read(). UDP and TCP.
> 
> I mean, there is a reason why the global memory limits apply to only
> those types of packets in the first place: everything else is noise.
> 
> I agree that it's appealing to account at the allocator level and set
> page->mem_cgroup etc. but in this case we'd pay extra to capture a lot
> of noise, and I don't want to pay that just for aesthetics. In this
> case it's better to track ownership on the socket level and only count
> packets that can accumulate a significant amount of memory consumed.

Sigh, you seem to be right. Moreover, I can't even think of a neat way
to account skb pages to memcg, because rcv skbs are generated in device
drivers, where we don't know which socket/memcg it will go to. We could
recharge individual pages when skb gets to the network or transport
layer, but it would result in unjustified overhead.

> 
> > > We tried using the per-memcg tcp limits, and that prevents the OOMs
> > > for sure, but it's horrendous for network performance. There is no
> > > "stop growing" phase, it just keeps going full throttle until it hits
> > > the wall hard.
> > > 
> > > Now, we could probably try to replicate the global knobs and add a
> > > per-memcg soft limit. But you know better than anyone else how hard it
> > > is to estimate the overall workingset size of a workload, and the
> > > margins on containerized loads are razor-thin. Performance is much
> > > more sensitive to input errors, and often times parameters must be
> > > adjusted continuously during the runtime of a workload. It'd be
> > > disasterous to rely on yet more static, error-prone user input here.
> > 
> > Yeah, but the dynamic approach proposed in your patch set doesn't
> > guarantee we won't hit OOM in memcg due to overgrown buffers. It just
> > reduces this possibility. Of course, memcg OOM is far not as disastrous
> > as the global one, but still it usually means the workload breakage.
> 
> Right now, the entire machine breaks. Confining it to a faulty memcg,
> as well as reducing the likelihood of that OOM in many cases seems
> like a good move in the right direction, no?

It seems. However, memcg OOM is also bad, we should strive to avoid it
if we can.

> 
> And how likely are memcg OOMs because of this anyway? There is of

Frankly, I've no idea. Your arguments below sound reassuring though.

> course a scenario imaginable where the packets pile up, followed by
> some *other* part of the workload, the one that doesn't read() and
> process packets, trying to expand--which then doesn't work and goes
> OOM. But that seems like a complete corner case. In the vast majority
> of cases, the application will be in full operation and just fail to
> read() fast enough--because the network bandwidth is enormous compared
> to the container's size, or because it shares the CPU with thousands
> of other workloads and there is scheduling latency.
> 
> This would be the perfect point to reign in the transmit window...
> 
> > The static approach is error-prone for sure, but it has existed for
> > years and worked satisfactory AFAIK.
> 
> ...but that point is not a fixed amount of memory consumed. It depends
> on the workload and the random interactions it's having with thousands
> of other containers on that same machine.
> 
> The point of containers is to maximize utilization of your hardware
> and systematically eliminate slack in the system. But it's exactly
> that slack on dedicated bare-metal machines that allowed us to take a
> wild guess at the settings and then tune them based on observing a
> handful of workloads. This approach is not going to work anymore when
> we pack the machine to capacity and still expect every single
> container out of thousands to perform well. We need that automation.

But we do use static approach when setting memory limits, no?
memory.{low,high,max} - they are all static.

I understand it's appealing to have just one knob - memory size - like
in case of virtual machines, but it doesn't seem to work with
containers. You added memory.low and memory.high knobs. VMs don't have
anything like that. How is one supposed to set them? Depends on the
workload, I guess. Also, there is the pids cgroup for limiting the
number of pids that can be used by a cgroup, because pid turns out to be
a resource in case of containers.  May be, tcp window should be
considered as a separate resource either, as it is now, and shouldn't go
to memcg? I'm just wondering...

> 
> The static setting working okay on the global level is also why I'm
> not interested in starting to experiment with it. There is no reason
> to change it. It's much more likely that any attempt to change it will
> be shot down, not because of the approach chosen, but because there is
> no problem to solve there. I doubt we can get networking people to
> care about containers by screwing with things that work for them ;-)

Fair enough.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
@ 2015-11-02 14:47                   ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-11-02 14:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David S. Miller, Andrew Morton, Michal Hocko, Tejun Heo, netdev,
	linux-mm, cgroups, linux-kernel

On Thu, Oct 29, 2015 at 10:52:28AM -0700, Johannes Weiner wrote:
...
> Now, you mentioned that you'd rather see the socket buffers accounted
> at the allocator level, but I looked at the different allocation paths
> and network protocols and I'm not convinced that this makes sense. We
> don't want to be in the hotpath of every single packet when a lot of
> them are small, short-lived management blips that don't involve user
> space to let the kernel dispose of them.
> 
> __sk_mem_schedule() on the other hand is already wired up to exactly
> those consumers we are interested in for memory isolation: those with
> bigger chunks of data attached to them and those that have exploding
> receive queues when userspace fails to read(). UDP and TCP.
> 
> I mean, there is a reason why the global memory limits apply to only
> those types of packets in the first place: everything else is noise.
> 
> I agree that it's appealing to account at the allocator level and set
> page->mem_cgroup etc. but in this case we'd pay extra to capture a lot
> of noise, and I don't want to pay that just for aesthetics. In this
> case it's better to track ownership on the socket level and only count
> packets that can accumulate a significant amount of memory consumed.

Sigh, you seem to be right. Moreover, I can't even think of a neat way
to account skb pages to memcg, because rcv skbs are generated in device
drivers, where we don't know which socket/memcg it will go to. We could
recharge individual pages when skb gets to the network or transport
layer, but it would result in unjustified overhead.

> 
> > > We tried using the per-memcg tcp limits, and that prevents the OOMs
> > > for sure, but it's horrendous for network performance. There is no
> > > "stop growing" phase, it just keeps going full throttle until it hits
> > > the wall hard.
> > > 
> > > Now, we could probably try to replicate the global knobs and add a
> > > per-memcg soft limit. But you know better than anyone else how hard it
> > > is to estimate the overall workingset size of a workload, and the
> > > margins on containerized loads are razor-thin. Performance is much
> > > more sensitive to input errors, and often times parameters must be
> > > adjusted continuously during the runtime of a workload. It'd be
> > > disasterous to rely on yet more static, error-prone user input here.
> > 
> > Yeah, but the dynamic approach proposed in your patch set doesn't
> > guarantee we won't hit OOM in memcg due to overgrown buffers. It just
> > reduces this possibility. Of course, memcg OOM is far not as disastrous
> > as the global one, but still it usually means the workload breakage.
> 
> Right now, the entire machine breaks. Confining it to a faulty memcg,
> as well as reducing the likelihood of that OOM in many cases seems
> like a good move in the right direction, no?

It seems. However, memcg OOM is also bad, we should strive to avoid it
if we can.

> 
> And how likely are memcg OOMs because of this anyway? There is of

Frankly, I've no idea. Your arguments below sound reassuring though.

> course a scenario imaginable where the packets pile up, followed by
> some *other* part of the workload, the one that doesn't read() and
> process packets, trying to expand--which then doesn't work and goes
> OOM. But that seems like a complete corner case. In the vast majority
> of cases, the application will be in full operation and just fail to
> read() fast enough--because the network bandwidth is enormous compared
> to the container's size, or because it shares the CPU with thousands
> of other workloads and there is scheduling latency.
> 
> This would be the perfect point to reign in the transmit window...
> 
> > The static approach is error-prone for sure, but it has existed for
> > years and worked satisfactory AFAIK.
> 
> ...but that point is not a fixed amount of memory consumed. It depends
> on the workload and the random interactions it's having with thousands
> of other containers on that same machine.
> 
> The point of containers is to maximize utilization of your hardware
> and systematically eliminate slack in the system. But it's exactly
> that slack on dedicated bare-metal machines that allowed us to take a
> wild guess at the settings and then tune them based on observing a
> handful of workloads. This approach is not going to work anymore when
> we pack the machine to capacity and still expect every single
> container out of thousands to perform well. We need that automation.

But we do use static approach when setting memory limits, no?
memory.{low,high,max} - they are all static.

I understand it's appealing to have just one knob - memory size - like
in case of virtual machines, but it doesn't seem to work with
containers. You added memory.low and memory.high knobs. VMs don't have
anything like that. How is one supposed to set them? Depends on the
workload, I guess. Also, there is the pids cgroup for limiting the
number of pids that can be used by a cgroup, because pid turns out to be
a resource in case of containers.  May be, tcp window should be
considered as a separate resource either, as it is now, and shouldn't go
to memcg? I'm just wondering...

> 
> The static setting working okay on the global level is also why I'm
> not interested in starting to experiment with it. There is no reason
> to change it. It's much more likely that any attempt to change it will
> be shot down, not because of the approach chosen, but because there is
> no problem to solve there. I doubt we can get networking people to
> care about containers by screwing with things that work for them ;-)

Fair enough.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
  2015-10-29 16:10                     ` Johannes Weiner
@ 2015-11-04 10:42                       ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-11-04 10:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Thu 29-10-15 09:10:09, Johannes Weiner wrote:
> On Thu, Oct 29, 2015 at 04:25:46PM +0100, Michal Hocko wrote:
> > On Tue 27-10-15 09:42:27, Johannes Weiner wrote:
[...]
> > > You carefully skipped over this part. We can ignore it for socket
> > > memory but it's something we need to figure out when it comes to slab
> > > accounting and tracking.
> > 
> > I am sorry, I didn't mean to skip this part, I though it would be clear
> > from the previous text. I think kmem accounting falls into the same
> > category. Have a sane default and a global boottime knob to override it
> > for those that think differently - for whatever reason they might have.
> 
> Yes, that makes sense to me.
> 
> Like cgroup.memory=nosocket, would you think it makes sense to include
> slab in the default for functional/semantical completeness and provide
> a cgroup.memory=noslab for powerusers?

I am still not sure whether the kmem accounting is stable enough to be
enabled by default. If for nothing else the allocation failures, which
are not allowed for the global case and easily triggered by the hard
limit, might be a big problem. My last attempts to allow GFP_NOFS to
fail made me quite skeptical. I still believe this is something which
will be solved in the long term but the current state might be still too
fragile. So I would rather be conservative and have the kmem accounting
disabled by default with a config option and boot parameter to override.
If somebody is confident that the desired load is stable then the config
can be enabled easily.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-04 10:42                       ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-11-04 10:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Thu 29-10-15 09:10:09, Johannes Weiner wrote:
> On Thu, Oct 29, 2015 at 04:25:46PM +0100, Michal Hocko wrote:
> > On Tue 27-10-15 09:42:27, Johannes Weiner wrote:
[...]
> > > You carefully skipped over this part. We can ignore it for socket
> > > memory but it's something we need to figure out when it comes to slab
> > > accounting and tracking.
> > 
> > I am sorry, I didn't mean to skip this part, I though it would be clear
> > from the previous text. I think kmem accounting falls into the same
> > category. Have a sane default and a global boottime knob to override it
> > for those that think differently - for whatever reason they might have.
> 
> Yes, that makes sense to me.
> 
> Like cgroup.memory=nosocket, would you think it makes sense to include
> slab in the default for functional/semantical completeness and provide
> a cgroup.memory=noslab for powerusers?

I am still not sure whether the kmem accounting is stable enough to be
enabled by default. If for nothing else the allocation failures, which
are not allowed for the global case and easily triggered by the hard
limit, might be a big problem. My last attempts to allow GFP_NOFS to
fail made me quite skeptical. I still believe this is something which
will be solved in the long term but the current state might be still too
fragile. So I would rather be conservative and have the kmem accounting
disabled by default with a config option and boot parameter to override.
If somebody is confident that the desired load is stable then the config
can be enabled easily.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-04 19:50                         ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-11-04 19:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Wed, Nov 04, 2015 at 11:42:40AM +0100, Michal Hocko wrote:
> On Thu 29-10-15 09:10:09, Johannes Weiner wrote:
> > On Thu, Oct 29, 2015 at 04:25:46PM +0100, Michal Hocko wrote:
> > > On Tue 27-10-15 09:42:27, Johannes Weiner wrote:
> [...]
> > > > You carefully skipped over this part. We can ignore it for socket
> > > > memory but it's something we need to figure out when it comes to slab
> > > > accounting and tracking.
> > > 
> > > I am sorry, I didn't mean to skip this part, I though it would be clear
> > > from the previous text. I think kmem accounting falls into the same
> > > category. Have a sane default and a global boottime knob to override it
> > > for those that think differently - for whatever reason they might have.
> > 
> > Yes, that makes sense to me.
> > 
> > Like cgroup.memory=nosocket, would you think it makes sense to include
> > slab in the default for functional/semantical completeness and provide
> > a cgroup.memory=noslab for powerusers?
> 
> I am still not sure whether the kmem accounting is stable enough to be
> enabled by default. If for nothing else the allocation failures, which
> are not allowed for the global case and easily triggered by the hard
> limit, might be a big problem. My last attempts to allow GFP_NOFS to
> fail made me quite skeptical. I still believe this is something which
> will be solved in the long term but the current state might be still too
> fragile. So I would rather be conservative and have the kmem accounting
> disabled by default with a config option and boot parameter to override.
> If somebody is confident that the desired load is stable then the config
> can be enabled easily.

I agree with your assessment of the current kmem code state, but I
think your conclusion is completely backwards here.

The interface will be set in stone forever, whereas any stability
issues will be transient and will have to be addressed in a finite
amount of time anyway. It doesn't make sense to design an interface
based on temporary quality of implementation. Only one of those two
can ever be changed.

Because it goes without saying that once the cgroupv2 interface is
released, and people use it in production, there is no way we can then
*add* dentry cache, inode cache, and others to memory.current. That
would be an unacceptable change in interface behavior. On the other
hand, people will be prepared for hiccups in the early stages of
cgroupv2 release, and we're providing cgroup.memory=noslab to let them
workaround severe problems in production until we fix it without
forcing them to fully revert to cgroupv1.

So if we agree that there are no fundamental architectural concerns
with slab accounting, i.e. nothing that can't be addressed in the
implementation, we have to make the call now.

And I maintain that not accounting dentry cache and inode cache is a
gaping hole in memory isolation, so it should be included by default.
(The rest of the slabs is arguable, but IMO the risk of missing
something important is higher than the cost of including them.)

As far as your allocation failure concerns go, I think the kmem code
is currently not behaving as Glauber originally intended, which is to
force charge if reclaim and OOM killing weren't able to make enough
space. See this recently rewritten section of the kmem charge path:

-               /*
-                * try_charge() chose to bypass to root due to OOM kill or
-                * fatal signal.  Since our only options are to either fail
-                * the allocation or charge it to this cgroup, do it as a
-                * temporary condition. But we can't fail. From a kmem/slab
-                * perspective, the cache has already been selected, by
-                * mem_cgroup_kmem_get_cache(), so it is too late to change
-                * our minds.
-                *
-                * This condition will only trigger if the task entered
-                * memcg_charge_kmem in a sane state, but was OOM-killed
-                * during try_charge() above. Tasks that were already dying
-                * when the allocation triggers should have been already
-                * directed to the root cgroup in memcontrol.h
-                */
-               page_counter_charge(&memcg->memory, nr_pages);
-               if (do_swap_account)
-                       page_counter_charge(&memcg->memsw, nr_pages);

It could be that this never properly worked as it was tied to the
-EINTR bypass trick, but the idea was these charges never fail.

And this makes sense. If the allocator semantics are such that we
never fail these page allocations for slab, and the callsites rely on
that, surely we should not fail them in the memory controller, either.

And it makes a lot more sense to account them in excess of the limit
than pretend they don't exist. We might not be able to completely
fullfill the containment part of the memory controller (although these
slab charges will still create significant pressure before that), but
at least we don't fail the accounting part on top of it.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-04 19:50                         ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-11-04 19:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	vdavydov-5HdwGun5lf+gSpxsJD1C4w, tj-DgEjT+Ai2ygdnm+yROfE0A,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Nov 04, 2015 at 11:42:40AM +0100, Michal Hocko wrote:
> On Thu 29-10-15 09:10:09, Johannes Weiner wrote:
> > On Thu, Oct 29, 2015 at 04:25:46PM +0100, Michal Hocko wrote:
> > > On Tue 27-10-15 09:42:27, Johannes Weiner wrote:
> [...]
> > > > You carefully skipped over this part. We can ignore it for socket
> > > > memory but it's something we need to figure out when it comes to slab
> > > > accounting and tracking.
> > > 
> > > I am sorry, I didn't mean to skip this part, I though it would be clear
> > > from the previous text. I think kmem accounting falls into the same
> > > category. Have a sane default and a global boottime knob to override it
> > > for those that think differently - for whatever reason they might have.
> > 
> > Yes, that makes sense to me.
> > 
> > Like cgroup.memory=nosocket, would you think it makes sense to include
> > slab in the default for functional/semantical completeness and provide
> > a cgroup.memory=noslab for powerusers?
> 
> I am still not sure whether the kmem accounting is stable enough to be
> enabled by default. If for nothing else the allocation failures, which
> are not allowed for the global case and easily triggered by the hard
> limit, might be a big problem. My last attempts to allow GFP_NOFS to
> fail made me quite skeptical. I still believe this is something which
> will be solved in the long term but the current state might be still too
> fragile. So I would rather be conservative and have the kmem accounting
> disabled by default with a config option and boot parameter to override.
> If somebody is confident that the desired load is stable then the config
> can be enabled easily.

I agree with your assessment of the current kmem code state, but I
think your conclusion is completely backwards here.

The interface will be set in stone forever, whereas any stability
issues will be transient and will have to be addressed in a finite
amount of time anyway. It doesn't make sense to design an interface
based on temporary quality of implementation. Only one of those two
can ever be changed.

Because it goes without saying that once the cgroupv2 interface is
released, and people use it in production, there is no way we can then
*add* dentry cache, inode cache, and others to memory.current. That
would be an unacceptable change in interface behavior. On the other
hand, people will be prepared for hiccups in the early stages of
cgroupv2 release, and we're providing cgroup.memory=noslab to let them
workaround severe problems in production until we fix it without
forcing them to fully revert to cgroupv1.

So if we agree that there are no fundamental architectural concerns
with slab accounting, i.e. nothing that can't be addressed in the
implementation, we have to make the call now.

And I maintain that not accounting dentry cache and inode cache is a
gaping hole in memory isolation, so it should be included by default.
(The rest of the slabs is arguable, but IMO the risk of missing
something important is higher than the cost of including them.)

As far as your allocation failure concerns go, I think the kmem code
is currently not behaving as Glauber originally intended, which is to
force charge if reclaim and OOM killing weren't able to make enough
space. See this recently rewritten section of the kmem charge path:

-               /*
-                * try_charge() chose to bypass to root due to OOM kill or
-                * fatal signal.  Since our only options are to either fail
-                * the allocation or charge it to this cgroup, do it as a
-                * temporary condition. But we can't fail. From a kmem/slab
-                * perspective, the cache has already been selected, by
-                * mem_cgroup_kmem_get_cache(), so it is too late to change
-                * our minds.
-                *
-                * This condition will only trigger if the task entered
-                * memcg_charge_kmem in a sane state, but was OOM-killed
-                * during try_charge() above. Tasks that were already dying
-                * when the allocation triggers should have been already
-                * directed to the root cgroup in memcontrol.h
-                */
-               page_counter_charge(&memcg->memory, nr_pages);
-               if (do_swap_account)
-                       page_counter_charge(&memcg->memsw, nr_pages);

It could be that this never properly worked as it was tied to the
-EINTR bypass trick, but the idea was these charges never fail.

And this makes sense. If the allocator semantics are such that we
never fail these page allocations for slab, and the callsites rely on
that, surely we should not fail them in the memory controller, either.

And it makes a lot more sense to account them in excess of the limit
than pretend they don't exist. We might not be able to completely
fullfill the containment part of the memory controller (although these
slab charges will still create significant pressure before that), but
at least we don't fail the accounting part on top of it.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-04 19:50                         ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-11-04 19:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Wed, Nov 04, 2015 at 11:42:40AM +0100, Michal Hocko wrote:
> On Thu 29-10-15 09:10:09, Johannes Weiner wrote:
> > On Thu, Oct 29, 2015 at 04:25:46PM +0100, Michal Hocko wrote:
> > > On Tue 27-10-15 09:42:27, Johannes Weiner wrote:
> [...]
> > > > You carefully skipped over this part. We can ignore it for socket
> > > > memory but it's something we need to figure out when it comes to slab
> > > > accounting and tracking.
> > > 
> > > I am sorry, I didn't mean to skip this part, I though it would be clear
> > > from the previous text. I think kmem accounting falls into the same
> > > category. Have a sane default and a global boottime knob to override it
> > > for those that think differently - for whatever reason they might have.
> > 
> > Yes, that makes sense to me.
> > 
> > Like cgroup.memory=nosocket, would you think it makes sense to include
> > slab in the default for functional/semantical completeness and provide
> > a cgroup.memory=noslab for powerusers?
> 
> I am still not sure whether the kmem accounting is stable enough to be
> enabled by default. If for nothing else the allocation failures, which
> are not allowed for the global case and easily triggered by the hard
> limit, might be a big problem. My last attempts to allow GFP_NOFS to
> fail made me quite skeptical. I still believe this is something which
> will be solved in the long term but the current state might be still too
> fragile. So I would rather be conservative and have the kmem accounting
> disabled by default with a config option and boot parameter to override.
> If somebody is confident that the desired load is stable then the config
> can be enabled easily.

I agree with your assessment of the current kmem code state, but I
think your conclusion is completely backwards here.

The interface will be set in stone forever, whereas any stability
issues will be transient and will have to be addressed in a finite
amount of time anyway. It doesn't make sense to design an interface
based on temporary quality of implementation. Only one of those two
can ever be changed.

Because it goes without saying that once the cgroupv2 interface is
released, and people use it in production, there is no way we can then
*add* dentry cache, inode cache, and others to memory.current. That
would be an unacceptable change in interface behavior. On the other
hand, people will be prepared for hiccups in the early stages of
cgroupv2 release, and we're providing cgroup.memory=noslab to let them
workaround severe problems in production until we fix it without
forcing them to fully revert to cgroupv1.

So if we agree that there are no fundamental architectural concerns
with slab accounting, i.e. nothing that can't be addressed in the
implementation, we have to make the call now.

And I maintain that not accounting dentry cache and inode cache is a
gaping hole in memory isolation, so it should be included by default.
(The rest of the slabs is arguable, but IMO the risk of missing
something important is higher than the cost of including them.)

As far as your allocation failure concerns go, I think the kmem code
is currently not behaving as Glauber originally intended, which is to
force charge if reclaim and OOM killing weren't able to make enough
space. See this recently rewritten section of the kmem charge path:

-               /*
-                * try_charge() chose to bypass to root due to OOM kill or
-                * fatal signal.  Since our only options are to either fail
-                * the allocation or charge it to this cgroup, do it as a
-                * temporary condition. But we can't fail. From a kmem/slab
-                * perspective, the cache has already been selected, by
-                * mem_cgroup_kmem_get_cache(), so it is too late to change
-                * our minds.
-                *
-                * This condition will only trigger if the task entered
-                * memcg_charge_kmem in a sane state, but was OOM-killed
-                * during try_charge() above. Tasks that were already dying
-                * when the allocation triggers should have been already
-                * directed to the root cgroup in memcontrol.h
-                */
-               page_counter_charge(&memcg->memory, nr_pages);
-               if (do_swap_account)
-                       page_counter_charge(&memcg->memsw, nr_pages);

It could be that this never properly worked as it was tied to the
-EINTR bypass trick, but the idea was these charges never fail.

And this makes sense. If the allocator semantics are such that we
never fail these page allocations for slab, and the callsites rely on
that, surely we should not fail them in the memory controller, either.

And it makes a lot more sense to account them in excess of the limit
than pretend they don't exist. We might not be able to completely
fullfill the containment part of the memory controller (although these
slab charges will still create significant pressure before that), but
at least we don't fail the accounting part on top of it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
  2015-11-04 19:50                         ` Johannes Weiner
@ 2015-11-05 14:40                           ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-11-05 14:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
[...]
> Because it goes without saying that once the cgroupv2 interface is
> released, and people use it in production, there is no way we can then
> *add* dentry cache, inode cache, and others to memory.current. That
> would be an unacceptable change in interface behavior.

They would still have to _enable_ the config option _explicitly_. make
oldconfig wouldn't change it silently for them. I do not think
it is an unacceptable change of behavior if the config is changed
explicitly.

> On the other
> hand, people will be prepared for hiccups in the early stages of
> cgroupv2 release, and we're providing cgroup.memory=noslab to let them
> workaround severe problems in production until we fix it without
> forcing them to fully revert to cgroupv1.

This would be true if they moved on to the new cgroup API intentionally.
The reality is more complicated though. AFAIK sysmted is waiting for
cgroup2 already and privileged services enable all available resource
controllers by default as I've learned just recently. If we know that
the interface is not stable enough then we are basically forcing _most_
users to use the kernel boot parameter if we stay with the current
kmem semantic. More on that below.

> So if we agree that there are no fundamental architectural concerns
> with slab accounting, i.e. nothing that can't be addressed in the
> implementation, we have to make the call now.

We are on the same page here.

> And I maintain that not accounting dentry cache and inode cache is a
> gaping hole in memory isolation, so it should be included by default.
> (The rest of the slabs is arguable, but IMO the risk of missing
> something important is higher than the cost of including them.)

More on that below.

> As far as your allocation failure concerns go, I think the kmem code
> is currently not behaving as Glauber originally intended, which is to
> force charge if reclaim and OOM killing weren't able to make enough
> space. See this recently rewritten section of the kmem charge path:
> 
> -               /*
> -                * try_charge() chose to bypass to root due to OOM kill or
> -                * fatal signal.  Since our only options are to either fail
> -                * the allocation or charge it to this cgroup, do it as a
> -                * temporary condition. But we can't fail. From a kmem/slab
> -                * perspective, the cache has already been selected, by
> -                * mem_cgroup_kmem_get_cache(), so it is too late to change
> -                * our minds.
> -                *
> -                * This condition will only trigger if the task entered
> -                * memcg_charge_kmem in a sane state, but was OOM-killed
> -                * during try_charge() above. Tasks that were already dying
> -                * when the allocation triggers should have been already
> -                * directed to the root cgroup in memcontrol.h
> -                */
> -               page_counter_charge(&memcg->memory, nr_pages);
> -               if (do_swap_account)
> -                       page_counter_charge(&memcg->memsw, nr_pages);
> 
> It could be that this never properly worked as it was tied to the
> -EINTR bypass trick, but the idea was these charges never fail.

I have always understood this path as a corner case when the task is an
oom victim or exiting. So this would be only a temporal condition which
cannot cause a complete runaway.

> And this makes sense. If the allocator semantics are such that we
> never fail these page allocations for slab, and the callsites rely on
> that, surely we should not fail them in the memory controller, either.

Then we can only bypass them or loop inside the charge code for ever
like we do in the page allocator. The later one is really fragile and
it would be much more in the restricted environment as we have learned
with the memcg OOM killer in the past. 

> And it makes a lot more sense to account them in excess of the limit
> than pretend they don't exist. We might not be able to completely
> fullfill the containment part of the memory controller (although these
> slab charges will still create significant pressure before that), but
> at least we don't fail the accounting part on top of it.

Hmm, wouldn't that kill the whole purpose of the kmem accounting? Any
load could simply runaway via kernel allocations. What is even worse we
might even not trigger memcg OOM killer before we hit the global OOM. So
the whole containment goes straight to hell.

I can see four options here:
1) enable kmem by default with the current semantic which we know can
   BUG_ON (at least btrfs is known to hit this) or lead to other issues.
2) enable kmem by default and change the semantic for cgroup2 to allow
   runaway charges above the hard limit which would defeat the whole
   purpose of the containment for cgroup2. This can be a temporary
   workaround until we can afford kmem failures. This has a big risk
   that we will end up with this permanently because there is a strong
   pressure that GFP_KERNEL allocations should never fail. Yet this is
   the most common type of request. Or do we change the consistency with
   the global case at some point?
3) keep only some (safe) cache types enabled by default with the current
   failing semantic and require an explicit enabling for the complete
   kmem accounting. [di]cache code paths should be quite robust to
   handle allocation failures.
4) disable kmem by default and change the config default later to signal
   the accounting is safe as far as we are aware and let people enable
   the functionality on those basis. We would keep the current failing
   semantic.

To me 4) sounds like the safest option because it still keeps the
functionality available to those who can benefit from it in v1 already
while we are not exposing a potentially buggy behavior to the majority
(many of them even unintentionally).  Moreover we still allow to change
the default later on an explicit basis. 3) sounds like the second best
option but I am not really sure whether we can do that very easily
without bringing up a lot of unmaintainable mess. 2) sounds like the
third best approach but I am afraid it would render the basic use cases
unusable for a very long time and kill any interest in cgroup2 for even
longer (cargo cults are really hard to get rid of). 1) sounds like a
land mine approach which would force many/most users to simply keep
using the boot option and force us to re-evaluate the default hard way.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-05 14:40                           ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-11-05 14:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
[...]
> Because it goes without saying that once the cgroupv2 interface is
> released, and people use it in production, there is no way we can then
> *add* dentry cache, inode cache, and others to memory.current. That
> would be an unacceptable change in interface behavior.

They would still have to _enable_ the config option _explicitly_. make
oldconfig wouldn't change it silently for them. I do not think
it is an unacceptable change of behavior if the config is changed
explicitly.

> On the other
> hand, people will be prepared for hiccups in the early stages of
> cgroupv2 release, and we're providing cgroup.memory=noslab to let them
> workaround severe problems in production until we fix it without
> forcing them to fully revert to cgroupv1.

This would be true if they moved on to the new cgroup API intentionally.
The reality is more complicated though. AFAIK sysmted is waiting for
cgroup2 already and privileged services enable all available resource
controllers by default as I've learned just recently. If we know that
the interface is not stable enough then we are basically forcing _most_
users to use the kernel boot parameter if we stay with the current
kmem semantic. More on that below.

> So if we agree that there are no fundamental architectural concerns
> with slab accounting, i.e. nothing that can't be addressed in the
> implementation, we have to make the call now.

We are on the same page here.

> And I maintain that not accounting dentry cache and inode cache is a
> gaping hole in memory isolation, so it should be included by default.
> (The rest of the slabs is arguable, but IMO the risk of missing
> something important is higher than the cost of including them.)

More on that below.

> As far as your allocation failure concerns go, I think the kmem code
> is currently not behaving as Glauber originally intended, which is to
> force charge if reclaim and OOM killing weren't able to make enough
> space. See this recently rewritten section of the kmem charge path:
> 
> -               /*
> -                * try_charge() chose to bypass to root due to OOM kill or
> -                * fatal signal.  Since our only options are to either fail
> -                * the allocation or charge it to this cgroup, do it as a
> -                * temporary condition. But we can't fail. From a kmem/slab
> -                * perspective, the cache has already been selected, by
> -                * mem_cgroup_kmem_get_cache(), so it is too late to change
> -                * our minds.
> -                *
> -                * This condition will only trigger if the task entered
> -                * memcg_charge_kmem in a sane state, but was OOM-killed
> -                * during try_charge() above. Tasks that were already dying
> -                * when the allocation triggers should have been already
> -                * directed to the root cgroup in memcontrol.h
> -                */
> -               page_counter_charge(&memcg->memory, nr_pages);
> -               if (do_swap_account)
> -                       page_counter_charge(&memcg->memsw, nr_pages);
> 
> It could be that this never properly worked as it was tied to the
> -EINTR bypass trick, but the idea was these charges never fail.

I have always understood this path as a corner case when the task is an
oom victim or exiting. So this would be only a temporal condition which
cannot cause a complete runaway.

> And this makes sense. If the allocator semantics are such that we
> never fail these page allocations for slab, and the callsites rely on
> that, surely we should not fail them in the memory controller, either.

Then we can only bypass them or loop inside the charge code for ever
like we do in the page allocator. The later one is really fragile and
it would be much more in the restricted environment as we have learned
with the memcg OOM killer in the past. 

> And it makes a lot more sense to account them in excess of the limit
> than pretend they don't exist. We might not be able to completely
> fullfill the containment part of the memory controller (although these
> slab charges will still create significant pressure before that), but
> at least we don't fail the accounting part on top of it.

Hmm, wouldn't that kill the whole purpose of the kmem accounting? Any
load could simply runaway via kernel allocations. What is even worse we
might even not trigger memcg OOM killer before we hit the global OOM. So
the whole containment goes straight to hell.

I can see four options here:
1) enable kmem by default with the current semantic which we know can
   BUG_ON (at least btrfs is known to hit this) or lead to other issues.
2) enable kmem by default and change the semantic for cgroup2 to allow
   runaway charges above the hard limit which would defeat the whole
   purpose of the containment for cgroup2. This can be a temporary
   workaround until we can afford kmem failures. This has a big risk
   that we will end up with this permanently because there is a strong
   pressure that GFP_KERNEL allocations should never fail. Yet this is
   the most common type of request. Or do we change the consistency with
   the global case at some point?
3) keep only some (safe) cache types enabled by default with the current
   failing semantic and require an explicit enabling for the complete
   kmem accounting. [di]cache code paths should be quite robust to
   handle allocation failures.
4) disable kmem by default and change the config default later to signal
   the accounting is safe as far as we are aware and let people enable
   the functionality on those basis. We would keep the current failing
   semantic.

To me 4) sounds like the safest option because it still keeps the
functionality available to those who can benefit from it in v1 already
while we are not exposing a potentially buggy behavior to the majority
(many of them even unintentionally).  Moreover we still allow to change
the default later on an explicit basis. 3) sounds like the second best
option but I am not really sure whether we can do that very easily
without bringing up a lot of unmaintainable mess. 2) sounds like the
third best approach but I am afraid it would render the basic use cases
unusable for a very long time and kill any interest in cgroup2 for even
longer (cargo cults are really hard to get rid of). 1) sounds like a
land mine approach which would force many/most users to simply keep
using the boot option and force us to re-evaluate the default hard way.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
  2015-11-05 14:40                           ` Michal Hocko
@ 2015-11-05 16:16                             ` David Miller
  -1 siblings, 0 replies; 156+ messages in thread
From: David Miller @ 2015-11-05 16:16 UTC (permalink / raw)
  To: mhocko
  Cc: hannes, akpm, vdavydov, tj, netdev, linux-mm, cgroups, linux-kernel

From: Michal Hocko <mhocko@kernel.org>
Date: Thu, 5 Nov 2015 15:40:02 +0100

> On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
> [...]
>> Because it goes without saying that once the cgroupv2 interface is
>> released, and people use it in production, there is no way we can then
>> *add* dentry cache, inode cache, and others to memory.current. That
>> would be an unacceptable change in interface behavior.
> 
> They would still have to _enable_ the config option _explicitly_. make
> oldconfig wouldn't change it silently for them. I do not think
> it is an unacceptable change of behavior if the config is changed
> explicitly.

Every user is going to get this config option when they update their
distibution kernel or whatever.

Then they will all wonder why their networking performance went down.

This is why I do not want the networking accounting bits on by default
even if the kconfig option is enabled.  They must be off by default
and guarded by a static branch so the cost is exactly zero.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-05 16:16                             ` David Miller
  0 siblings, 0 replies; 156+ messages in thread
From: David Miller @ 2015-11-05 16:16 UTC (permalink / raw)
  To: mhocko
  Cc: hannes, akpm, vdavydov, tj, netdev, linux-mm, cgroups, linux-kernel

From: Michal Hocko <mhocko@kernel.org>
Date: Thu, 5 Nov 2015 15:40:02 +0100

> On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
> [...]
>> Because it goes without saying that once the cgroupv2 interface is
>> released, and people use it in production, there is no way we can then
>> *add* dentry cache, inode cache, and others to memory.current. That
>> would be an unacceptable change in interface behavior.
> 
> They would still have to _enable_ the config option _explicitly_. make
> oldconfig wouldn't change it silently for them. I do not think
> it is an unacceptable change of behavior if the config is changed
> explicitly.

Every user is going to get this config option when they update their
distibution kernel or whatever.

Then they will all wonder why their networking performance went down.

This is why I do not want the networking accounting bits on by default
even if the kconfig option is enabled.  They must be off by default
and guarded by a static branch so the cost is exactly zero.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-05 16:28                               ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-11-05 16:28 UTC (permalink / raw)
  To: David Miller
  Cc: hannes, akpm, vdavydov, tj, netdev, linux-mm, cgroups, linux-kernel

On Thu 05-11-15 11:16:09, David S. Miller wrote:
> From: Michal Hocko <mhocko@kernel.org>
> Date: Thu, 5 Nov 2015 15:40:02 +0100
> 
> > On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
> > [...]
> >> Because it goes without saying that once the cgroupv2 interface is
> >> released, and people use it in production, there is no way we can then
> >> *add* dentry cache, inode cache, and others to memory.current. That
> >> would be an unacceptable change in interface behavior.
> > 
> > They would still have to _enable_ the config option _explicitly_. make
> > oldconfig wouldn't change it silently for them. I do not think
> > it is an unacceptable change of behavior if the config is changed
> > explicitly.
> 
> Every user is going to get this config option when they update their
> distibution kernel or whatever.
> 
> Then they will all wonder why their networking performance went down.
> 
> This is why I do not want the networking accounting bits on by default
> even if the kconfig option is enabled.  They must be off by default
> and guarded by a static branch so the cost is exactly zero.

Yes, that part is clear and Johannes made it clear that the kmem tcp
part is disabled by default. Or are you considering also all the slab
usage by the networking code as well?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-05 16:28                               ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-11-05 16:28 UTC (permalink / raw)
  To: David Miller
  Cc: hannes-druUgvl0LCNAfugRpC6u6w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	vdavydov-5HdwGun5lf+gSpxsJD1C4w, tj-DgEjT+Ai2ygdnm+yROfE0A,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu 05-11-15 11:16:09, David S. Miller wrote:
> From: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Date: Thu, 5 Nov 2015 15:40:02 +0100
> 
> > On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
> > [...]
> >> Because it goes without saying that once the cgroupv2 interface is
> >> released, and people use it in production, there is no way we can then
> >> *add* dentry cache, inode cache, and others to memory.current. That
> >> would be an unacceptable change in interface behavior.
> > 
> > They would still have to _enable_ the config option _explicitly_. make
> > oldconfig wouldn't change it silently for them. I do not think
> > it is an unacceptable change of behavior if the config is changed
> > explicitly.
> 
> Every user is going to get this config option when they update their
> distibution kernel or whatever.
> 
> Then they will all wonder why their networking performance went down.
> 
> This is why I do not want the networking accounting bits on by default
> even if the kconfig option is enabled.  They must be off by default
> and guarded by a static branch so the cost is exactly zero.

Yes, that part is clear and Johannes made it clear that the kmem tcp
part is disabled by default. Or are you considering also all the slab
usage by the networking code as well?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-05 16:28                               ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-11-05 16:28 UTC (permalink / raw)
  To: David Miller
  Cc: hannes, akpm, vdavydov, tj, netdev, linux-mm, cgroups, linux-kernel

On Thu 05-11-15 11:16:09, David S. Miller wrote:
> From: Michal Hocko <mhocko@kernel.org>
> Date: Thu, 5 Nov 2015 15:40:02 +0100
> 
> > On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
> > [...]
> >> Because it goes without saying that once the cgroupv2 interface is
> >> released, and people use it in production, there is no way we can then
> >> *add* dentry cache, inode cache, and others to memory.current. That
> >> would be an unacceptable change in interface behavior.
> > 
> > They would still have to _enable_ the config option _explicitly_. make
> > oldconfig wouldn't change it silently for them. I do not think
> > it is an unacceptable change of behavior if the config is changed
> > explicitly.
> 
> Every user is going to get this config option when they update their
> distibution kernel or whatever.
> 
> Then they will all wonder why their networking performance went down.
> 
> This is why I do not want the networking accounting bits on by default
> even if the kconfig option is enabled.  They must be off by default
> and guarded by a static branch so the cost is exactly zero.

Yes, that part is clear and Johannes made it clear that the kmem tcp
part is disabled by default. Or are you considering also all the slab
usage by the networking code as well?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
  2015-11-05 16:28                               ` Michal Hocko
@ 2015-11-05 16:30                                 ` David Miller
  -1 siblings, 0 replies; 156+ messages in thread
From: David Miller @ 2015-11-05 16:30 UTC (permalink / raw)
  To: mhocko
  Cc: hannes, akpm, vdavydov, tj, netdev, linux-mm, cgroups, linux-kernel

From: Michal Hocko <mhocko@kernel.org>
Date: Thu, 5 Nov 2015 17:28:03 +0100

> Yes, that part is clear and Johannes made it clear that the kmem tcp
> part is disabled by default. Or are you considering also all the slab
> usage by the networking code as well?

I'm still thinking about the implications of that aspect, and will
comment when I have something coherent to say about it.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-05 16:30                                 ` David Miller
  0 siblings, 0 replies; 156+ messages in thread
From: David Miller @ 2015-11-05 16:30 UTC (permalink / raw)
  To: mhocko
  Cc: hannes, akpm, vdavydov, tj, netdev, linux-mm, cgroups, linux-kernel

From: Michal Hocko <mhocko@kernel.org>
Date: Thu, 5 Nov 2015 17:28:03 +0100

> Yes, that part is clear and Johannes made it clear that the kmem tcp
> part is disabled by default. Or are you considering also all the slab
> usage by the networking code as well?

I'm still thinking about the implications of that aspect, and will
comment when I have something coherent to say about it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
  2015-11-05 14:40                           ` Michal Hocko
@ 2015-11-05 20:55                             ` Johannes Weiner
  -1 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-11-05 20:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
> [...]
> > Because it goes without saying that once the cgroupv2 interface is
> > released, and people use it in production, there is no way we can then
> > *add* dentry cache, inode cache, and others to memory.current. That
> > would be an unacceptable change in interface behavior.
> 
> They would still have to _enable_ the config option _explicitly_. make
> oldconfig wouldn't change it silently for them. I do not think
> it is an unacceptable change of behavior if the config is changed
> explicitly.

Yeah, as Dave said these will all get turned on anyway, so there is no
point in fragmenting the Kconfig space in the first place.

> > On the other
> > hand, people will be prepared for hiccups in the early stages of
> > cgroupv2 release, and we're providing cgroup.memory=noslab to let them
> > workaround severe problems in production until we fix it without
> > forcing them to fully revert to cgroupv1.
> 
> This would be true if they moved on to the new cgroup API intentionally.
> The reality is more complicated though. AFAIK sysmted is waiting for
> cgroup2 already and privileged services enable all available resource
> controllers by default as I've learned just recently.

Have you filed a report with them? I don't think they should turn them
on unless users explicitely configure resource control for the unit.

But what I said still holds: critical production machines don't just
get rolling updates and "accidentally" switch to all this new
code. And those that do take the plunge have the cmdline options.

> > And it makes a lot more sense to account them in excess of the limit
> > than pretend they don't exist. We might not be able to completely
> > fullfill the containment part of the memory controller (although these
> > slab charges will still create significant pressure before that), but
> > at least we don't fail the accounting part on top of it.
> 
> Hmm, wouldn't that kill the whole purpose of the kmem accounting? Any
> load could simply runaway via kernel allocations. What is even worse we
> might even not trigger memcg OOM killer before we hit the global OOM. So
> the whole containment goes straight to hell.
>
> I can see four options here:
> 1) enable kmem by default with the current semantic which we know can
>    BUG_ON (at least btrfs is known to hit this) or lead to other issues.

Can you point me to that report?

That's not "semantics", that's a bug! Whether or not a feature is
enabled by default, it can not be allowed to crash the kernel.

Presenting this as a choice is a bit of a strawman argument.

> 2) enable kmem by default and change the semantic for cgroup2 to allow
>    runaway charges above the hard limit which would defeat the whole
>    purpose of the containment for cgroup2. This can be a temporary
>    workaround until we can afford kmem failures. This has a big risk
>    that we will end up with this permanently because there is a strong
>    pressure that GFP_KERNEL allocations should never fail. Yet this is
>    the most common type of request. Or do we change the consistency with
>    the global case at some point?

As per 1) we *have* to fail containment eventually if not doing so
means crashes and lockups. That's not a choice of semantics.

But that doesn't mean we have to give up *immediately* and allow
unrestrained "runaway charges"--again, more of a strawman than a
choice. We can still throttle the allocator and apply significant
pressure on the memory pool, culminating in OOM kills eventually.

Once we run out of available containment tools, however, we *have* to
follow the semantics of the page and slab allocator and succeed the
request. We can not just return -ENOMEM if that causes kernel bugs.

That's the only thing we can do right now.

In fact, it's likely going to be the best we will ever be able to do
when it comes to kernel memory accounting. Linus made it clear where
he stands on failing kernel allocations, so all we can do is continue
to improve our containment tools and then give up on containment when
they're exhausted and force the charge past the limit.

> 3) keep only some (safe) cache types enabled by default with the current
>    failing semantic and require an explicit enabling for the complete
>    kmem accounting. [di]cache code paths should be quite robust to
>    handle allocation failures.

Vladimir, what would be your opinion on this?

> 4) disable kmem by default and change the config default later to signal
>    the accounting is safe as far as we are aware and let people enable
>    the functionality on those basis. We would keep the current failing
>    semantic.
> 
> To me 4) sounds like the safest option because it still keeps the
> functionality available to those who can benefit from it in v1 already
> while we are not exposing a potentially buggy behavior to the majority
> (many of them even unintentionally). Moreover we still allow to change
> the default later on an explicit basis.

I'm not interested in fragmenting the interface forever out of caution
because there might be a bug in the implementation right now. As I
said we have to fix any instability in the features we provide whether
they are turned on by default or not. I don't see how this is relevant
to the interface discussion.

Also, there is no way we can later fundamentally change the semantics
of memory.current, so it would have to remain configurable forever,
forcing people forever to select multiple options in order to piece
together a single logical kernel feature.

This is really not an option, either.

If there are show-stopping bugs in the implementation, I'd rather hold
off the release of the unified hierarchy than commit to a half-assed
interface right out of the gate. The point of v2 is sane interfaces.

So let's please focus on fixing any problems that slab accounting may
have, rather than designing complex config options and transition
procedures whose sole purpose is to defer dealing with our issues.

Please?

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-05 20:55                             ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-11-05 20:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
> [...]
> > Because it goes without saying that once the cgroupv2 interface is
> > released, and people use it in production, there is no way we can then
> > *add* dentry cache, inode cache, and others to memory.current. That
> > would be an unacceptable change in interface behavior.
> 
> They would still have to _enable_ the config option _explicitly_. make
> oldconfig wouldn't change it silently for them. I do not think
> it is an unacceptable change of behavior if the config is changed
> explicitly.

Yeah, as Dave said these will all get turned on anyway, so there is no
point in fragmenting the Kconfig space in the first place.

> > On the other
> > hand, people will be prepared for hiccups in the early stages of
> > cgroupv2 release, and we're providing cgroup.memory=noslab to let them
> > workaround severe problems in production until we fix it without
> > forcing them to fully revert to cgroupv1.
> 
> This would be true if they moved on to the new cgroup API intentionally.
> The reality is more complicated though. AFAIK sysmted is waiting for
> cgroup2 already and privileged services enable all available resource
> controllers by default as I've learned just recently.

Have you filed a report with them? I don't think they should turn them
on unless users explicitely configure resource control for the unit.

But what I said still holds: critical production machines don't just
get rolling updates and "accidentally" switch to all this new
code. And those that do take the plunge have the cmdline options.

> > And it makes a lot more sense to account them in excess of the limit
> > than pretend they don't exist. We might not be able to completely
> > fullfill the containment part of the memory controller (although these
> > slab charges will still create significant pressure before that), but
> > at least we don't fail the accounting part on top of it.
> 
> Hmm, wouldn't that kill the whole purpose of the kmem accounting? Any
> load could simply runaway via kernel allocations. What is even worse we
> might even not trigger memcg OOM killer before we hit the global OOM. So
> the whole containment goes straight to hell.
>
> I can see four options here:
> 1) enable kmem by default with the current semantic which we know can
>    BUG_ON (at least btrfs is known to hit this) or lead to other issues.

Can you point me to that report?

That's not "semantics", that's a bug! Whether or not a feature is
enabled by default, it can not be allowed to crash the kernel.

Presenting this as a choice is a bit of a strawman argument.

> 2) enable kmem by default and change the semantic for cgroup2 to allow
>    runaway charges above the hard limit which would defeat the whole
>    purpose of the containment for cgroup2. This can be a temporary
>    workaround until we can afford kmem failures. This has a big risk
>    that we will end up with this permanently because there is a strong
>    pressure that GFP_KERNEL allocations should never fail. Yet this is
>    the most common type of request. Or do we change the consistency with
>    the global case at some point?

As per 1) we *have* to fail containment eventually if not doing so
means crashes and lockups. That's not a choice of semantics.

But that doesn't mean we have to give up *immediately* and allow
unrestrained "runaway charges"--again, more of a strawman than a
choice. We can still throttle the allocator and apply significant
pressure on the memory pool, culminating in OOM kills eventually.

Once we run out of available containment tools, however, we *have* to
follow the semantics of the page and slab allocator and succeed the
request. We can not just return -ENOMEM if that causes kernel bugs.

That's the only thing we can do right now.

In fact, it's likely going to be the best we will ever be able to do
when it comes to kernel memory accounting. Linus made it clear where
he stands on failing kernel allocations, so all we can do is continue
to improve our containment tools and then give up on containment when
they're exhausted and force the charge past the limit.

> 3) keep only some (safe) cache types enabled by default with the current
>    failing semantic and require an explicit enabling for the complete
>    kmem accounting. [di]cache code paths should be quite robust to
>    handle allocation failures.

Vladimir, what would be your opinion on this?

> 4) disable kmem by default and change the config default later to signal
>    the accounting is safe as far as we are aware and let people enable
>    the functionality on those basis. We would keep the current failing
>    semantic.
> 
> To me 4) sounds like the safest option because it still keeps the
> functionality available to those who can benefit from it in v1 already
> while we are not exposing a potentially buggy behavior to the majority
> (many of them even unintentionally). Moreover we still allow to change
> the default later on an explicit basis.

I'm not interested in fragmenting the interface forever out of caution
because there might be a bug in the implementation right now. As I
said we have to fix any instability in the features we provide whether
they are turned on by default or not. I don't see how this is relevant
to the interface discussion.

Also, there is no way we can later fundamentally change the semantics
of memory.current, so it would have to remain configurable forever,
forcing people forever to select multiple options in order to piece
together a single logical kernel feature.

This is really not an option, either.

If there are show-stopping bugs in the implementation, I'd rather hold
off the release of the unified hierarchy than commit to a half-assed
interface right out of the gate. The point of v2 is sane interfaces.

So let's please focus on fixing any problems that slab accounting may
have, rather than designing complex config options and transition
procedures whose sole purpose is to defer dealing with our issues.

Please?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-05 22:32                                 ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-11-05 22:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Thu, Nov 05, 2015 at 05:28:03PM +0100, Michal Hocko wrote:
> On Thu 05-11-15 11:16:09, David S. Miller wrote:
> > From: Michal Hocko <mhocko@kernel.org>
> > Date: Thu, 5 Nov 2015 15:40:02 +0100
> > 
> > > On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
> > > [...]
> > >> Because it goes without saying that once the cgroupv2 interface is
> > >> released, and people use it in production, there is no way we can then
> > >> *add* dentry cache, inode cache, and others to memory.current. That
> > >> would be an unacceptable change in interface behavior.
> > > 
> > > They would still have to _enable_ the config option _explicitly_. make
> > > oldconfig wouldn't change it silently for them. I do not think
> > > it is an unacceptable change of behavior if the config is changed
> > > explicitly.
> > 
> > Every user is going to get this config option when they update their
> > distibution kernel or whatever.
> > 
> > Then they will all wonder why their networking performance went down.
> > 
> > This is why I do not want the networking accounting bits on by default
> > even if the kconfig option is enabled.  They must be off by default
> > and guarded by a static branch so the cost is exactly zero.
> 
> Yes, that part is clear and Johannes made it clear that the kmem tcp
> part is disabled by default. Or are you considering also all the slab
> usage by the networking code as well?

Michal, there shouldn't be any tracking or accounting going on per
default when you boot into a fresh system.

I removed all accounting and statistics on the system level in
cgroupv2, so distribution kernels can compile-time enable a single,
feature-complete CONFIG_MEMCG that provides a full memory controller
while at the same time puts no overhead on users that don't benefit
from mem control at all and just want to use the machine bare-metal.

This is completely doable. My new series does it for skmem, but I also
want to retrofit the code to eliminate that current overhead for page
cache, anonymous memory, slab memory and so forth.

This is the only sane way to make the memory controller powerful and
generally useful without having to make unreasonable compromises with
memory consumers. We shouldn't even be *having* the discussion about
whether we should sacrifice the quality of our interface in order to
compromise with a class of users that doesn't care about any of this
in the first place.

So let's eliminate the cost for non-users, but make the memory
controller feature-complete and useful--with reasonable cost,
implementation, and interface--for our actual userbase.

Paying the necessary cost for a functionality you actually want is not
the problem. Paying for something that doesn't benefit you is.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-05 22:32                                 ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-11-05 22:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	vdavydov-5HdwGun5lf+gSpxsJD1C4w, tj-DgEjT+Ai2ygdnm+yROfE0A,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Nov 05, 2015 at 05:28:03PM +0100, Michal Hocko wrote:
> On Thu 05-11-15 11:16:09, David S. Miller wrote:
> > From: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > Date: Thu, 5 Nov 2015 15:40:02 +0100
> > 
> > > On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
> > > [...]
> > >> Because it goes without saying that once the cgroupv2 interface is
> > >> released, and people use it in production, there is no way we can then
> > >> *add* dentry cache, inode cache, and others to memory.current. That
> > >> would be an unacceptable change in interface behavior.
> > > 
> > > They would still have to _enable_ the config option _explicitly_. make
> > > oldconfig wouldn't change it silently for them. I do not think
> > > it is an unacceptable change of behavior if the config is changed
> > > explicitly.
> > 
> > Every user is going to get this config option when they update their
> > distibution kernel or whatever.
> > 
> > Then they will all wonder why their networking performance went down.
> > 
> > This is why I do not want the networking accounting bits on by default
> > even if the kconfig option is enabled.  They must be off by default
> > and guarded by a static branch so the cost is exactly zero.
> 
> Yes, that part is clear and Johannes made it clear that the kmem tcp
> part is disabled by default. Or are you considering also all the slab
> usage by the networking code as well?

Michal, there shouldn't be any tracking or accounting going on per
default when you boot into a fresh system.

I removed all accounting and statistics on the system level in
cgroupv2, so distribution kernels can compile-time enable a single,
feature-complete CONFIG_MEMCG that provides a full memory controller
while at the same time puts no overhead on users that don't benefit
from mem control at all and just want to use the machine bare-metal.

This is completely doable. My new series does it for skmem, but I also
want to retrofit the code to eliminate that current overhead for page
cache, anonymous memory, slab memory and so forth.

This is the only sane way to make the memory controller powerful and
generally useful without having to make unreasonable compromises with
memory consumers. We shouldn't even be *having* the discussion about
whether we should sacrifice the quality of our interface in order to
compromise with a class of users that doesn't care about any of this
in the first place.

So let's eliminate the cost for non-users, but make the memory
controller feature-complete and useful--with reasonable cost,
implementation, and interface--for our actual userbase.

Paying the necessary cost for a functionality you actually want is not
the problem. Paying for something that doesn't benefit you is.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-05 22:32                                 ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-11-05 22:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Thu, Nov 05, 2015 at 05:28:03PM +0100, Michal Hocko wrote:
> On Thu 05-11-15 11:16:09, David S. Miller wrote:
> > From: Michal Hocko <mhocko@kernel.org>
> > Date: Thu, 5 Nov 2015 15:40:02 +0100
> > 
> > > On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
> > > [...]
> > >> Because it goes without saying that once the cgroupv2 interface is
> > >> released, and people use it in production, there is no way we can then
> > >> *add* dentry cache, inode cache, and others to memory.current. That
> > >> would be an unacceptable change in interface behavior.
> > > 
> > > They would still have to _enable_ the config option _explicitly_. make
> > > oldconfig wouldn't change it silently for them. I do not think
> > > it is an unacceptable change of behavior if the config is changed
> > > explicitly.
> > 
> > Every user is going to get this config option when they update their
> > distibution kernel or whatever.
> > 
> > Then they will all wonder why their networking performance went down.
> > 
> > This is why I do not want the networking accounting bits on by default
> > even if the kconfig option is enabled.  They must be off by default
> > and guarded by a static branch so the cost is exactly zero.
> 
> Yes, that part is clear and Johannes made it clear that the kmem tcp
> part is disabled by default. Or are you considering also all the slab
> usage by the networking code as well?

Michal, there shouldn't be any tracking or accounting going on per
default when you boot into a fresh system.

I removed all accounting and statistics on the system level in
cgroupv2, so distribution kernels can compile-time enable a single,
feature-complete CONFIG_MEMCG that provides a full memory controller
while at the same time puts no overhead on users that don't benefit
from mem control at all and just want to use the machine bare-metal.

This is completely doable. My new series does it for skmem, but I also
want to retrofit the code to eliminate that current overhead for page
cache, anonymous memory, slab memory and so forth.

This is the only sane way to make the memory controller powerful and
generally useful without having to make unreasonable compromises with
memory consumers. We shouldn't even be *having* the discussion about
whether we should sacrifice the quality of our interface in order to
compromise with a class of users that doesn't care about any of this
in the first place.

So let's eliminate the cost for non-users, but make the memory
controller feature-complete and useful--with reasonable cost,
implementation, and interface--for our actual userbase.

Paying the necessary cost for a functionality you actually want is not
the problem. Paying for something that doesn't benefit you is.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-05 22:52                               ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-11-05 22:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > This would be true if they moved on to the new cgroup API intentionally.
> > The reality is more complicated though. AFAIK sysmted is waiting for
> > cgroup2 already and privileged services enable all available resource
> > controllers by default as I've learned just recently.
> 
> Have you filed a report with them? I don't think they should turn them
> on unless users explicitely configure resource control for the unit.

Okay, verified with systemd people that they're not planning on
enabling resource control per default.

Inflammatory half-truths, man. This is not constructive.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-05 22:52                               ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-11-05 22:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	vdavydov-5HdwGun5lf+gSpxsJD1C4w, tj-DgEjT+Ai2ygdnm+yROfE0A,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > This would be true if they moved on to the new cgroup API intentionally.
> > The reality is more complicated though. AFAIK sysmted is waiting for
> > cgroup2 already and privileged services enable all available resource
> > controllers by default as I've learned just recently.
> 
> Have you filed a report with them? I don't think they should turn them
> on unless users explicitely configure resource control for the unit.

Okay, verified with systemd people that they're not planning on
enabling resource control per default.

Inflammatory half-truths, man. This is not constructive.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-05 22:52                               ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-11-05 22:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > This would be true if they moved on to the new cgroup API intentionally.
> > The reality is more complicated though. AFAIK sysmted is waiting for
> > cgroup2 already and privileged services enable all available resource
> > controllers by default as I've learned just recently.
> 
> Have you filed a report with them? I don't think they should turn them
> on unless users explicitely configure resource control for the unit.

Okay, verified with systemd people that they're not planning on
enabling resource control per default.

Inflammatory half-truths, man. This is not constructive.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-06  9:05                               ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-11-06  9:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, David Miller, akpm, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
...
> > 3) keep only some (safe) cache types enabled by default with the current
> >    failing semantic and require an explicit enabling for the complete
> >    kmem accounting. [di]cache code paths should be quite robust to
> >    handle allocation failures.
> 
> Vladimir, what would be your opinion on this?

I'm all for this option. Actually, I've been thinking about this since I
introduced the __GFP_NOACCOUNT flag. Not because of the failing
semantics, since we can always let kmem allocations breach the limit.
This shouldn't be critical, because I don't think it's possible to issue
a series of kmem allocations w/o a single user page allocation, which
would reclaim/kill the excess.

The point is there are allocations that are shared system-wide and
therefore shouldn't go to any memcg. Most obvious examples are: mempool
users and radix_tree/idr preloads. Accounting them to memcg is likely to
result in noticeable memory overhead as memory cgroups are
created/destroyed, because they pin dead memory cgroups with all their
kmem caches, which aren't tiny.

Another funny example is objects destroyed lazily for performance
reasons, e.g. vmap_area. Such objects are usually very small, so
delaying destruction of a bunch of them will normally go unnoticed.
However, if kmemcg is used the effective memory consumption caused by
such objects can be multiplied by many times due to dangling kmem
caches.

We can, of course, mark all such allocations as __GFP_NOACCOUNT, but the
problem is they are tricky to identify, because they are scattered all
over the kernel source tree. E.g. Dave Chinner mentioned that XFS
internals do a lot of allocations that are shared among all XFS
filesystems and therefore should not be accounted (BTW that's why
list_lru's used by XFS are not marked as memcg-aware). There must be
more out there. Besides, kernel developers don't usually even know about
kmemcg (they just write the code for their subsys, so why should they?)
so they won't care thinking about using __GFP_NOACCOUNT, and hence new
falsely-accounted allocations are likely to appear.

That said, by switching from black-list (__GFP_NOACCOUNT) to white-list
(__GFP_ACCOUNT) kmem accounting policy we would make the system more
predictable and robust IMO. OTOH what would we lose? Security? Well,
containers aren't secure IMHO. In fact, I doubt they will ever be (as
secure as VMs). Anyway, if a runaway allocation is reported, it should
be trivial to fix by adding __GFP_ACCOUNT where appropriate.

If there are no objections, I'll prepare a patch switching to the
white-list approach. Let's start from obvious things like fs_struct,
mm_struct, task_struct, signal_struct, dentry, inode, which can be
easily allocated from user space. This should cover 90% of all
allocations that should be accounted AFAICS. The rest will be added
later if necessarily.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-06  9:05                               ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-11-06  9:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, David Miller,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, tj-DgEjT+Ai2ygdnm+yROfE0A,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
...
> > 3) keep only some (safe) cache types enabled by default with the current
> >    failing semantic and require an explicit enabling for the complete
> >    kmem accounting. [di]cache code paths should be quite robust to
> >    handle allocation failures.
> 
> Vladimir, what would be your opinion on this?

I'm all for this option. Actually, I've been thinking about this since I
introduced the __GFP_NOACCOUNT flag. Not because of the failing
semantics, since we can always let kmem allocations breach the limit.
This shouldn't be critical, because I don't think it's possible to issue
a series of kmem allocations w/o a single user page allocation, which
would reclaim/kill the excess.

The point is there are allocations that are shared system-wide and
therefore shouldn't go to any memcg. Most obvious examples are: mempool
users and radix_tree/idr preloads. Accounting them to memcg is likely to
result in noticeable memory overhead as memory cgroups are
created/destroyed, because they pin dead memory cgroups with all their
kmem caches, which aren't tiny.

Another funny example is objects destroyed lazily for performance
reasons, e.g. vmap_area. Such objects are usually very small, so
delaying destruction of a bunch of them will normally go unnoticed.
However, if kmemcg is used the effective memory consumption caused by
such objects can be multiplied by many times due to dangling kmem
caches.

We can, of course, mark all such allocations as __GFP_NOACCOUNT, but the
problem is they are tricky to identify, because they are scattered all
over the kernel source tree. E.g. Dave Chinner mentioned that XFS
internals do a lot of allocations that are shared among all XFS
filesystems and therefore should not be accounted (BTW that's why
list_lru's used by XFS are not marked as memcg-aware). There must be
more out there. Besides, kernel developers don't usually even know about
kmemcg (they just write the code for their subsys, so why should they?)
so they won't care thinking about using __GFP_NOACCOUNT, and hence new
falsely-accounted allocations are likely to appear.

That said, by switching from black-list (__GFP_NOACCOUNT) to white-list
(__GFP_ACCOUNT) kmem accounting policy we would make the system more
predictable and robust IMO. OTOH what would we lose? Security? Well,
containers aren't secure IMHO. In fact, I doubt they will ever be (as
secure as VMs). Anyway, if a runaway allocation is reported, it should
be trivial to fix by adding __GFP_ACCOUNT where appropriate.

If there are no objections, I'll prepare a patch switching to the
white-list approach. Let's start from obvious things like fs_struct,
mm_struct, task_struct, signal_struct, dentry, inode, which can be
easily allocated from user space. This should cover 90% of all
allocations that should be accounted AFAICS. The rest will be added
later if necessarily.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-06  9:05                               ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-11-06  9:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, David Miller, akpm, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
...
> > 3) keep only some (safe) cache types enabled by default with the current
> >    failing semantic and require an explicit enabling for the complete
> >    kmem accounting. [di]cache code paths should be quite robust to
> >    handle allocation failures.
> 
> Vladimir, what would be your opinion on this?

I'm all for this option. Actually, I've been thinking about this since I
introduced the __GFP_NOACCOUNT flag. Not because of the failing
semantics, since we can always let kmem allocations breach the limit.
This shouldn't be critical, because I don't think it's possible to issue
a series of kmem allocations w/o a single user page allocation, which
would reclaim/kill the excess.

The point is there are allocations that are shared system-wide and
therefore shouldn't go to any memcg. Most obvious examples are: mempool
users and radix_tree/idr preloads. Accounting them to memcg is likely to
result in noticeable memory overhead as memory cgroups are
created/destroyed, because they pin dead memory cgroups with all their
kmem caches, which aren't tiny.

Another funny example is objects destroyed lazily for performance
reasons, e.g. vmap_area. Such objects are usually very small, so
delaying destruction of a bunch of them will normally go unnoticed.
However, if kmemcg is used the effective memory consumption caused by
such objects can be multiplied by many times due to dangling kmem
caches.

We can, of course, mark all such allocations as __GFP_NOACCOUNT, but the
problem is they are tricky to identify, because they are scattered all
over the kernel source tree. E.g. Dave Chinner mentioned that XFS
internals do a lot of allocations that are shared among all XFS
filesystems and therefore should not be accounted (BTW that's why
list_lru's used by XFS are not marked as memcg-aware). There must be
more out there. Besides, kernel developers don't usually even know about
kmemcg (they just write the code for their subsys, so why should they?)
so they won't care thinking about using __GFP_NOACCOUNT, and hence new
falsely-accounted allocations are likely to appear.

That said, by switching from black-list (__GFP_NOACCOUNT) to white-list
(__GFP_ACCOUNT) kmem accounting policy we would make the system more
predictable and robust IMO. OTOH what would we lose? Security? Well,
containers aren't secure IMHO. In fact, I doubt they will ever be (as
secure as VMs). Anyway, if a runaway allocation is reported, it should
be trivial to fix by adding __GFP_ACCOUNT where appropriate.

If there are no objections, I'll prepare a patch switching to the
white-list approach. Let's start from obvious things like fs_struct,
mm_struct, task_struct, signal_struct, dentry, inode, which can be
easily allocated from user space. This should cover 90% of all
allocations that should be accounted AFAICS. The rest will be added
later if necessarily.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-06  9:05                               ` Vladimir Davydov
  0 siblings, 0 replies; 156+ messages in thread
From: Vladimir Davydov @ 2015-11-06  9:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, David Miller,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, tj-DgEjT+Ai2ygdnm+yROfE0A,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
...
> > 3) keep only some (safe) cache types enabled by default with the current
> >    failing semantic and require an explicit enabling for the complete
> >    kmem accounting. [di]cache code paths should be quite robust to
> >    handle allocation failures.
> 
> Vladimir, what would be your opinion on this?

I'm all for this option. Actually, I've been thinking about this since I
introduced the __GFP_NOACCOUNT flag. Not because of the failing
semantics, since we can always let kmem allocations breach the limit.
This shouldn't be critical, because I don't think it's possible to issue
a series of kmem allocations w/o a single user page allocation, which
would reclaim/kill the excess.

The point is there are allocations that are shared system-wide and
therefore shouldn't go to any memcg. Most obvious examples are: mempool
users and radix_tree/idr preloads. Accounting them to memcg is likely to
result in noticeable memory overhead as memory cgroups are
created/destroyed, because they pin dead memory cgroups with all their
kmem caches, which aren't tiny.

Another funny example is objects destroyed lazily for performance
reasons, e.g. vmap_area. Such objects are usually very small, so
delaying destruction of a bunch of them will normally go unnoticed.
However, if kmemcg is used the effective memory consumption caused by
such objects can be multiplied by many times due to dangling kmem
caches.

We can, of course, mark all such allocations as __GFP_NOACCOUNT, but the
problem is they are tricky to identify, because they are scattered all
over the kernel source tree. E.g. Dave Chinner mentioned that XFS
internals do a lot of allocations that are shared among all XFS
filesystems and therefore should not be accounted (BTW that's why
list_lru's used by XFS are not marked as memcg-aware). There must be
more out there. Besides, kernel developers don't usually even know about
kmemcg (they just write the code for their subsys, so why should they?)
so they won't care thinking about using __GFP_NOACCOUNT, and hence new
falsely-accounted allocations are likely to appear.

That said, by switching from black-list (__GFP_NOACCOUNT) to white-list
(__GFP_ACCOUNT) kmem accounting policy we would make the system more
predictable and robust IMO. OTOH what would we lose? Security? Well,
containers aren't secure IMHO. In fact, I doubt they will ever be (as
secure as VMs). Anyway, if a runaway allocation is reported, it should
be trivial to fix by adding __GFP_ACCOUNT where appropriate.

If there are no objections, I'll prepare a patch switching to the
white-list approach. Let's start from obvious things like fs_struct,
mm_struct, task_struct, signal_struct, dentry, inode, which can be
easily allocated from user space. This should cover 90% of all
allocations that should be accounted AFAICS. The rest will be added
later if necessarily.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
  2015-11-05 22:52                               ` Johannes Weiner
@ 2015-11-06 10:57                                 ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-11-06 10:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Thu 05-11-15 17:52:00, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > > This would be true if they moved on to the new cgroup API intentionally.
> > > The reality is more complicated though. AFAIK sysmted is waiting for
> > > cgroup2 already and privileged services enable all available resource
> > > controllers by default as I've learned just recently.
> > 
> > Have you filed a report with them? I don't think they should turn them
> > on unless users explicitely configure resource control for the unit.
> 
> Okay, verified with systemd people that they're not planning on
> enabling resource control per default.
> 
> Inflammatory half-truths, man. This is not constructive.

What about Delegate=yes feature then? We have just been burnt by this
quite heavily. AFAIU nspawn@.service and nspawn@.service have this
enabled by default
http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-06 10:57                                 ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-11-06 10:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Thu 05-11-15 17:52:00, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > > This would be true if they moved on to the new cgroup API intentionally.
> > > The reality is more complicated though. AFAIK sysmted is waiting for
> > > cgroup2 already and privileged services enable all available resource
> > > controllers by default as I've learned just recently.
> > 
> > Have you filed a report with them? I don't think they should turn them
> > on unless users explicitely configure resource control for the unit.
> 
> Okay, verified with systemd people that they're not planning on
> enabling resource control per default.
> 
> Inflammatory half-truths, man. This is not constructive.

What about Delegate=yes feature then? We have just been burnt by this
quite heavily. AFAIU nspawn@.service and nspawn@.service have this
enabled by default
http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
  2015-11-05 22:32                                 ` Johannes Weiner
@ 2015-11-06 12:51                                   ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-11-06 12:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Thu 05-11-15 17:32:51, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 05:28:03PM +0100, Michal Hocko wrote:
[...]
> > Yes, that part is clear and Johannes made it clear that the kmem tcp
> > part is disabled by default. Or are you considering also all the slab
> > usage by the networking code as well?
> 
> Michal, there shouldn't be any tracking or accounting going on per
> default when you boot into a fresh system.
> 
> I removed all accounting and statistics on the system level in
> cgroupv2, so distribution kernels can compile-time enable a single,
> feature-complete CONFIG_MEMCG that provides a full memory controller
> while at the same time puts no overhead on users that don't benefit
> from mem control at all and just want to use the machine bare-metal.

Yes that part is clear and I am not disputing it _at all_. It is just
that changes are high that memory controller _will_ be enabled in a
typical distribution systems. E.g. systemd _is_ enabling all resource
controllers by default for some services with Delegate=yes option.

> This is completely doable. My new series does it for skmem, but I also
> want to retrofit the code to eliminate that current overhead for page
> cache, anonymous memory, slab memory and so forth.
> 
> This is the only sane way to make the memory controller powerful and
> generally useful without having to make unreasonable compromises with
> memory consumers. We shouldn't even be *having* the discussion about
> whether we should sacrifice the quality of our interface in order to
> compromise with a class of users that doesn't care about any of this
> in the first place.
> 
> So let's eliminate the cost for non-users, but make the memory
> controller feature-complete and useful--with reasonable cost,
> implementation, and interface--for our actual userbase.
> 
> Paying the necessary cost for a functionality you actually want is not
> the problem. Paying for something that doesn't benefit you is.

I completely agree that a reasonable cost for those who _want_ the
functionality. It hasn't been shown that people actually lack kmem
accounting in the wild from the past in general. E.g. kmem controller
is even not enabled in opensuse nor SLES kernels and I do not remember
there was huge push to enable it.

I do understand that you want to have an out-of-the-box isolation
behavior which I agree is a nice-to-have feature. Especially with
a larger penetration of containerized workloads. But my point still
holds. This is not something everybody wants to have. So have a
configuration and a boot time option to override is the most reasonable
way to go. You can clearly see that this is already demand from tcp
kmem extension because they really _care_ about every single cpu cycle
even though some part of the userspace happens to have memcg enabled.

The question about the configuration default is a different question
and we can discuss that because this is not an easy one to decide right
now IMHO.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-06 12:51                                   ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-11-06 12:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Thu 05-11-15 17:32:51, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 05:28:03PM +0100, Michal Hocko wrote:
[...]
> > Yes, that part is clear and Johannes made it clear that the kmem tcp
> > part is disabled by default. Or are you considering also all the slab
> > usage by the networking code as well?
> 
> Michal, there shouldn't be any tracking or accounting going on per
> default when you boot into a fresh system.
> 
> I removed all accounting and statistics on the system level in
> cgroupv2, so distribution kernels can compile-time enable a single,
> feature-complete CONFIG_MEMCG that provides a full memory controller
> while at the same time puts no overhead on users that don't benefit
> from mem control at all and just want to use the machine bare-metal.

Yes that part is clear and I am not disputing it _at all_. It is just
that changes are high that memory controller _will_ be enabled in a
typical distribution systems. E.g. systemd _is_ enabling all resource
controllers by default for some services with Delegate=yes option.

> This is completely doable. My new series does it for skmem, but I also
> want to retrofit the code to eliminate that current overhead for page
> cache, anonymous memory, slab memory and so forth.
> 
> This is the only sane way to make the memory controller powerful and
> generally useful without having to make unreasonable compromises with
> memory consumers. We shouldn't even be *having* the discussion about
> whether we should sacrifice the quality of our interface in order to
> compromise with a class of users that doesn't care about any of this
> in the first place.
> 
> So let's eliminate the cost for non-users, but make the memory
> controller feature-complete and useful--with reasonable cost,
> implementation, and interface--for our actual userbase.
> 
> Paying the necessary cost for a functionality you actually want is not
> the problem. Paying for something that doesn't benefit you is.

I completely agree that a reasonable cost for those who _want_ the
functionality. It hasn't been shown that people actually lack kmem
accounting in the wild from the past in general. E.g. kmem controller
is even not enabled in opensuse nor SLES kernels and I do not remember
there was huge push to enable it.

I do understand that you want to have an out-of-the-box isolation
behavior which I agree is a nice-to-have feature. Especially with
a larger penetration of containerized workloads. But my point still
holds. This is not something everybody wants to have. So have a
configuration and a boot time option to override is the most reasonable
way to go. You can clearly see that this is already demand from tcp
kmem extension because they really _care_ about every single cpu cycle
even though some part of the userspace happens to have memcg enabled.

The question about the configuration default is a different question
and we can discuss that because this is not an easy one to decide right
now IMHO.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
  2015-11-05 20:55                             ` Johannes Weiner
@ 2015-11-06 13:21                               ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-11-06 13:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Thu 05-11-15 15:55:22, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
[...]
> > This would be true if they moved on to the new cgroup API intentionally.
> > The reality is more complicated though. AFAIK sysmted is waiting for
> > cgroup2 already and privileged services enable all available resource
> > controllers by default as I've learned just recently.
> 
> Have you filed a report with them? I don't think they should turn them
> on unless users explicitely configure resource control for the unit.

We have just been bitten by this (aka Delegate=yes for some basic
services) and our systemd people are supposed to bring this up upstream.
I've mentioned that in other email where you accuse me from spreading a
FUD.
 
> But what I said still holds: critical production machines don't just
> get rolling updates and "accidentally" switch to all this new
> code. And those that do take the plunge have the cmdline options.

That is exactly my point why I do not think re-evaluating the default
config option is a problem at all. The default wouldn't matter for
existing users. Those who care can have all the functionality they need
right away - be it kmem enabled or disabled.

> > > And it makes a lot more sense to account them in excess of the limit
> > > than pretend they don't exist. We might not be able to completely
> > > fullfill the containment part of the memory controller (although these
> > > slab charges will still create significant pressure before that), but
> > > at least we don't fail the accounting part on top of it.
> > 
> > Hmm, wouldn't that kill the whole purpose of the kmem accounting? Any
> > load could simply runaway via kernel allocations. What is even worse we
> > might even not trigger memcg OOM killer before we hit the global OOM. So
> > the whole containment goes straight to hell.
> >
> > I can see four options here:
> > 1) enable kmem by default with the current semantic which we know can
> >    BUG_ON (at least btrfs is known to hit this) or lead to other issues.
> 
> Can you point me to that report?

git grep "BUG_ON.*ENOMEM" -- fs/btrfs

just to give you a picture. Not all of them are kmalloc and others are
not annotated by ENOMEM comment. This came out as a result of my last
attempt to allow GFP_NOFS fail
(http://lkml.kernel.org/r/1438768284-30927-1-git-send-email-mhocko%40kernel.org)
 
> That's not "semantics", that's a bug! Whether or not a feature is
> enabled by default, it can not be allowed to crash the kernel.

Yes those are bugs and have to be fixed. Not an easy task but nothing
which couldn't be solved. It just takes some time. They are not very
likely right now because they are reduced to corner cases right now.
But they are more visible with the current kmem accounting semantic.
So either we change the semantic or wait until this gets fixed if
the accoutning should be on by default.

> Presenting this as a choice is a bit of a strawman argument.
> 
> > 2) enable kmem by default and change the semantic for cgroup2 to allow
> >    runaway charges above the hard limit which would defeat the whole
> >    purpose of the containment for cgroup2. This can be a temporary
> >    workaround until we can afford kmem failures. This has a big risk
> >    that we will end up with this permanently because there is a strong
> >    pressure that GFP_KERNEL allocations should never fail. Yet this is
> >    the most common type of request. Or do we change the consistency with
> >    the global case at some point?
> 
> As per 1) we *have* to fail containment eventually if not doing so
> means crashes and lockups. That's not a choice of semantics.
> 
> But that doesn't mean we have to give up *immediately* and allow
> unrestrained "runaway charges"--again, more of a strawman than a
> choice. We can still throttle the allocator and apply significant
> pressure on the memory pool, culminating in OOM kills eventually.
> 
> Once we run out of available containment tools, however, we *have* to
> follow the semantics of the page and slab allocator and succeed the
> request. We can not just return -ENOMEM if that causes kernel bugs.
> 
> That's the only thing we can do right now.
> 
> In fact, it's likely going to be the best we will ever be able to do
> when it comes to kernel memory accounting. Linus made it clear where
> he stands on failing kernel allocations, so all we can do is continue
> to improve our containment tools and then give up on containment when
> they're exhausted and force the charge past the limit.

OK, then we need all the additional measures to keep the hard limit
excess bound.

> > 3) keep only some (safe) cache types enabled by default with the current
> >    failing semantic and require an explicit enabling for the complete
> >    kmem accounting. [di]cache code paths should be quite robust to
> >    handle allocation failures.
> 
> Vladimir, what would be your opinion on this?
> 
> > 4) disable kmem by default and change the config default later to signal
> >    the accounting is safe as far as we are aware and let people enable
> >    the functionality on those basis. We would keep the current failing
> >    semantic.
> > 
> > To me 4) sounds like the safest option because it still keeps the
> > functionality available to those who can benefit from it in v1 already
> > while we are not exposing a potentially buggy behavior to the majority
> > (many of them even unintentionally). Moreover we still allow to change
> > the default later on an explicit basis.
> 
> I'm not interested in fragmenting the interface forever out of caution
> because there might be a bug in the implementation right now. As I
> said we have to fix any instability in the features we provide whether
> they are turned on by default or not. I don't see how this is relevant
> to the interface discussion.
> 
> Also, there is no way we can later fundamentally change the semantics
> of memory.current, so it would have to remain configurable forever,
> forcing people forever to select multiple options in order to piece
> together a single logical kernel feature.
> 
> This is really not an option, either.

Why not? I can clearly see people who would really want to have this
disabled and doing that by config is much more easier than providing
a command line parameter. A config option doesn't give any additional
maintenance burden than the boot time parameter.
 
> If there are show-stopping bugs in the implementation, I'd rather hold
> off the release of the unified hierarchy than commit to a half-assed
> interface right out of the gate.

If you are willing to postpone releasing cgroup2 until this gets
resolved - one way or another - then I have no objections. My impression
was that Tejun wanted to release it sooner rather than later. As this
mere discussion shows we are even not sure what should be the kmem
failure behavior.

> The point of v2 is sane interfaces.

And the sane interface to me is to use a single set of knobs regardless
of memory type. We are currently only discussing what should be
accounted by default. My understanding of what David said is that tcp
kmem should be enabled only when explicitly opted in. Did I get this
wrong?

> So let's please focus on fixing any problems that slab accounting may
> have, rather than designing complex config options and transition
> procedures whose sole purpose is to defer dealing with our issues.
> 
> Please?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-06 13:21                               ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-11-06 13:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Thu 05-11-15 15:55:22, Johannes Weiner wrote:
> On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > On Wed 04-11-15 14:50:37, Johannes Weiner wrote:
[...]
> > This would be true if they moved on to the new cgroup API intentionally.
> > The reality is more complicated though. AFAIK sysmted is waiting for
> > cgroup2 already and privileged services enable all available resource
> > controllers by default as I've learned just recently.
> 
> Have you filed a report with them? I don't think they should turn them
> on unless users explicitely configure resource control for the unit.

We have just been bitten by this (aka Delegate=yes for some basic
services) and our systemd people are supposed to bring this up upstream.
I've mentioned that in other email where you accuse me from spreading a
FUD.
 
> But what I said still holds: critical production machines don't just
> get rolling updates and "accidentally" switch to all this new
> code. And those that do take the plunge have the cmdline options.

That is exactly my point why I do not think re-evaluating the default
config option is a problem at all. The default wouldn't matter for
existing users. Those who care can have all the functionality they need
right away - be it kmem enabled or disabled.

> > > And it makes a lot more sense to account them in excess of the limit
> > > than pretend they don't exist. We might not be able to completely
> > > fullfill the containment part of the memory controller (although these
> > > slab charges will still create significant pressure before that), but
> > > at least we don't fail the accounting part on top of it.
> > 
> > Hmm, wouldn't that kill the whole purpose of the kmem accounting? Any
> > load could simply runaway via kernel allocations. What is even worse we
> > might even not trigger memcg OOM killer before we hit the global OOM. So
> > the whole containment goes straight to hell.
> >
> > I can see four options here:
> > 1) enable kmem by default with the current semantic which we know can
> >    BUG_ON (at least btrfs is known to hit this) or lead to other issues.
> 
> Can you point me to that report?

git grep "BUG_ON.*ENOMEM" -- fs/btrfs

just to give you a picture. Not all of them are kmalloc and others are
not annotated by ENOMEM comment. This came out as a result of my last
attempt to allow GFP_NOFS fail
(http://lkml.kernel.org/r/1438768284-30927-1-git-send-email-mhocko%40kernel.org)
 
> That's not "semantics", that's a bug! Whether or not a feature is
> enabled by default, it can not be allowed to crash the kernel.

Yes those are bugs and have to be fixed. Not an easy task but nothing
which couldn't be solved. It just takes some time. They are not very
likely right now because they are reduced to corner cases right now.
But they are more visible with the current kmem accounting semantic.
So either we change the semantic or wait until this gets fixed if
the accoutning should be on by default.

> Presenting this as a choice is a bit of a strawman argument.
> 
> > 2) enable kmem by default and change the semantic for cgroup2 to allow
> >    runaway charges above the hard limit which would defeat the whole
> >    purpose of the containment for cgroup2. This can be a temporary
> >    workaround until we can afford kmem failures. This has a big risk
> >    that we will end up with this permanently because there is a strong
> >    pressure that GFP_KERNEL allocations should never fail. Yet this is
> >    the most common type of request. Or do we change the consistency with
> >    the global case at some point?
> 
> As per 1) we *have* to fail containment eventually if not doing so
> means crashes and lockups. That's not a choice of semantics.
> 
> But that doesn't mean we have to give up *immediately* and allow
> unrestrained "runaway charges"--again, more of a strawman than a
> choice. We can still throttle the allocator and apply significant
> pressure on the memory pool, culminating in OOM kills eventually.
> 
> Once we run out of available containment tools, however, we *have* to
> follow the semantics of the page and slab allocator and succeed the
> request. We can not just return -ENOMEM if that causes kernel bugs.
> 
> That's the only thing we can do right now.
> 
> In fact, it's likely going to be the best we will ever be able to do
> when it comes to kernel memory accounting. Linus made it clear where
> he stands on failing kernel allocations, so all we can do is continue
> to improve our containment tools and then give up on containment when
> they're exhausted and force the charge past the limit.

OK, then we need all the additional measures to keep the hard limit
excess bound.

> > 3) keep only some (safe) cache types enabled by default with the current
> >    failing semantic and require an explicit enabling for the complete
> >    kmem accounting. [di]cache code paths should be quite robust to
> >    handle allocation failures.
> 
> Vladimir, what would be your opinion on this?
> 
> > 4) disable kmem by default and change the config default later to signal
> >    the accounting is safe as far as we are aware and let people enable
> >    the functionality on those basis. We would keep the current failing
> >    semantic.
> > 
> > To me 4) sounds like the safest option because it still keeps the
> > functionality available to those who can benefit from it in v1 already
> > while we are not exposing a potentially buggy behavior to the majority
> > (many of them even unintentionally). Moreover we still allow to change
> > the default later on an explicit basis.
> 
> I'm not interested in fragmenting the interface forever out of caution
> because there might be a bug in the implementation right now. As I
> said we have to fix any instability in the features we provide whether
> they are turned on by default or not. I don't see how this is relevant
> to the interface discussion.
> 
> Also, there is no way we can later fundamentally change the semantics
> of memory.current, so it would have to remain configurable forever,
> forcing people forever to select multiple options in order to piece
> together a single logical kernel feature.
> 
> This is really not an option, either.

Why not? I can clearly see people who would really want to have this
disabled and doing that by config is much more easier than providing
a command line parameter. A config option doesn't give any additional
maintenance burden than the boot time parameter.
 
> If there are show-stopping bugs in the implementation, I'd rather hold
> off the release of the unified hierarchy than commit to a half-assed
> interface right out of the gate.

If you are willing to postpone releasing cgroup2 until this gets
resolved - one way or another - then I have no objections. My impression
was that Tejun wanted to release it sooner rather than later. As this
mere discussion shows we are even not sure what should be the kmem
failure behavior.

> The point of v2 is sane interfaces.

And the sane interface to me is to use a single set of knobs regardless
of memory type. We are currently only discussing what should be
accounted by default. My understanding of what David said is that tcp
kmem should be enabled only when explicitly opted in. Did I get this
wrong?

> So let's please focus on fixing any problems that slab accounting may
> have, rather than designing complex config options and transition
> procedures whose sole purpose is to defer dealing with our issues.
> 
> Please?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
  2015-11-06  9:05                               ` Vladimir Davydov
@ 2015-11-06 13:29                                 ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-11-06 13:29 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, David Miller, akpm, tj, netdev, linux-mm,
	cgroups, linux-kernel

On Fri 06-11-15 12:05:55, Vladimir Davydov wrote:
[...]
> If there are no objections, I'll prepare a patch switching to the
> white-list approach. Let's start from obvious things like fs_struct,
> mm_struct, task_struct, signal_struct, dentry, inode, which can be
> easily allocated from user space.

pipe buffers, kernel stacks and who knows what more.

> This should cover 90% of all
> allocations that should be accounted AFAICS. The rest will be added
> later if necessarily.

The more I think about that the more I am convinced that is the only
sane way forward. The only concerns I would have is how do we deal with
the old interface in cgroup1? We do not want to break existing
deployments which might depend on the current behavior. I doubt they are
but...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-06 13:29                                 ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-11-06 13:29 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, David Miller, akpm, tj, netdev, linux-mm,
	cgroups, linux-kernel

On Fri 06-11-15 12:05:55, Vladimir Davydov wrote:
[...]
> If there are no objections, I'll prepare a patch switching to the
> white-list approach. Let's start from obvious things like fs_struct,
> mm_struct, task_struct, signal_struct, dentry, inode, which can be
> easily allocated from user space.

pipe buffers, kernel stacks and who knows what more.

> This should cover 90% of all
> allocations that should be accounted AFAICS. The rest will be added
> later if necessarily.

The more I think about that the more I am convinced that is the only
sane way forward. The only concerns I would have is how do we deal with
the old interface in cgroup1? We do not want to break existing
deployments which might depend on the current behavior. I doubt they are
but...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
  2015-11-06 10:57                                 ` Michal Hocko
@ 2015-11-06 16:19                                   ` Johannes Weiner
  -1 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-11-06 16:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote:
> On Thu 05-11-15 17:52:00, Johannes Weiner wrote:
> > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > > > This would be true if they moved on to the new cgroup API intentionally.
> > > > The reality is more complicated though. AFAIK sysmted is waiting for
> > > > cgroup2 already and privileged services enable all available resource
> > > > controllers by default as I've learned just recently.
> > > 
> > > Have you filed a report with them? I don't think they should turn them
> > > on unless users explicitely configure resource control for the unit.
> > 
> > Okay, verified with systemd people that they're not planning on
> > enabling resource control per default.
> > 
> > Inflammatory half-truths, man. This is not constructive.
> 
> What about Delegate=yes feature then? We have just been burnt by this
> quite heavily. AFAIU nspawn@.service and nspawn@.service have this
> enabled by default
> http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html

That's when you launch a *container* and want it to be able to use
nested resource control.

We're talking about actual container users here. It's not turning on
resource control for all "privileged services", which is what we were
worried about here. Can you at least admit that when you yourself link
to the refuting evidence?

And if you've been "burnt quite heavily" by this, where is your bug
report to stop other users from getting "burnt quite heavily" as well?

All I read here is vague inflammatory language to spread FUD.

You might think sending these emails is helpful, but it really
isn't. Not only is it not contributing code, insights, or solutions,
you're now actively sabotaging someone else's effort to build something.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-06 16:19                                   ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-11-06 16:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote:
> On Thu 05-11-15 17:52:00, Johannes Weiner wrote:
> > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > > > This would be true if they moved on to the new cgroup API intentionally.
> > > > The reality is more complicated though. AFAIK sysmted is waiting for
> > > > cgroup2 already and privileged services enable all available resource
> > > > controllers by default as I've learned just recently.
> > > 
> > > Have you filed a report with them? I don't think they should turn them
> > > on unless users explicitely configure resource control for the unit.
> > 
> > Okay, verified with systemd people that they're not planning on
> > enabling resource control per default.
> > 
> > Inflammatory half-truths, man. This is not constructive.
> 
> What about Delegate=yes feature then? We have just been burnt by this
> quite heavily. AFAIU nspawn@.service and nspawn@.service have this
> enabled by default
> http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html

That's when you launch a *container* and want it to be able to use
nested resource control.

We're talking about actual container users here. It's not turning on
resource control for all "privileged services", which is what we were
worried about here. Can you at least admit that when you yourself link
to the refuting evidence?

And if you've been "burnt quite heavily" by this, where is your bug
report to stop other users from getting "burnt quite heavily" as well?

All I read here is vague inflammatory language to spread FUD.

You might think sending these emails is helpful, but it really
isn't. Not only is it not contributing code, insights, or solutions,
you're now actively sabotaging someone else's effort to build something.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
  2015-11-06  9:05                               ` Vladimir Davydov
@ 2015-11-06 16:35                                 ` Johannes Weiner
  -1 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-11-06 16:35 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Michal Hocko, David Miller, akpm, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Fri, Nov 06, 2015 at 12:05:55PM +0300, Vladimir Davydov wrote:
> On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> ...
> > > 3) keep only some (safe) cache types enabled by default with the current
> > >    failing semantic and require an explicit enabling for the complete
> > >    kmem accounting. [di]cache code paths should be quite robust to
> > >    handle allocation failures.
> > 
> > Vladimir, what would be your opinion on this?
> 
> I'm all for this option. Actually, I've been thinking about this since I
> introduced the __GFP_NOACCOUNT flag. Not because of the failing
> semantics, since we can always let kmem allocations breach the limit.
> This shouldn't be critical, because I don't think it's possible to issue
> a series of kmem allocations w/o a single user page allocation, which
> would reclaim/kill the excess.
> 
> The point is there are allocations that are shared system-wide and
> therefore shouldn't go to any memcg. Most obvious examples are: mempool
> users and radix_tree/idr preloads. Accounting them to memcg is likely to
> result in noticeable memory overhead as memory cgroups are
> created/destroyed, because they pin dead memory cgroups with all their
> kmem caches, which aren't tiny.
> 
> Another funny example is objects destroyed lazily for performance
> reasons, e.g. vmap_area. Such objects are usually very small, so
> delaying destruction of a bunch of them will normally go unnoticed.
> However, if kmemcg is used the effective memory consumption caused by
> such objects can be multiplied by many times due to dangling kmem
> caches.
> 
> We can, of course, mark all such allocations as __GFP_NOACCOUNT, but the
> problem is they are tricky to identify, because they are scattered all
> over the kernel source tree. E.g. Dave Chinner mentioned that XFS
> internals do a lot of allocations that are shared among all XFS
> filesystems and therefore should not be accounted (BTW that's why
> list_lru's used by XFS are not marked as memcg-aware). There must be
> more out there. Besides, kernel developers don't usually even know about
> kmemcg (they just write the code for their subsys, so why should they?)
> so they won't care thinking about using __GFP_NOACCOUNT, and hence new
> falsely-accounted allocations are likely to appear.
> 
> That said, by switching from black-list (__GFP_NOACCOUNT) to white-list
> (__GFP_ACCOUNT) kmem accounting policy we would make the system more
> predictable and robust IMO. OTOH what would we lose? Security? Well,
> containers aren't secure IMHO. In fact, I doubt they will ever be (as
> secure as VMs). Anyway, if a runaway allocation is reported, it should
> be trivial to fix by adding __GFP_ACCOUNT where appropriate.

I wholeheartedly agree with all of this.

> If there are no objections, I'll prepare a patch switching to the
> white-list approach. Let's start from obvious things like fs_struct,
> mm_struct, task_struct, signal_struct, dentry, inode, which can be
> easily allocated from user space. This should cover 90% of all
> allocations that should be accounted AFAICS. The rest will be added
> later if necessarily.

Awesome, I'm looking forward to that patch!

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-06 16:35                                 ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-11-06 16:35 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Michal Hocko, David Miller, akpm, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Fri, Nov 06, 2015 at 12:05:55PM +0300, Vladimir Davydov wrote:
> On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> ...
> > > 3) keep only some (safe) cache types enabled by default with the current
> > >    failing semantic and require an explicit enabling for the complete
> > >    kmem accounting. [di]cache code paths should be quite robust to
> > >    handle allocation failures.
> > 
> > Vladimir, what would be your opinion on this?
> 
> I'm all for this option. Actually, I've been thinking about this since I
> introduced the __GFP_NOACCOUNT flag. Not because of the failing
> semantics, since we can always let kmem allocations breach the limit.
> This shouldn't be critical, because I don't think it's possible to issue
> a series of kmem allocations w/o a single user page allocation, which
> would reclaim/kill the excess.
> 
> The point is there are allocations that are shared system-wide and
> therefore shouldn't go to any memcg. Most obvious examples are: mempool
> users and radix_tree/idr preloads. Accounting them to memcg is likely to
> result in noticeable memory overhead as memory cgroups are
> created/destroyed, because they pin dead memory cgroups with all their
> kmem caches, which aren't tiny.
> 
> Another funny example is objects destroyed lazily for performance
> reasons, e.g. vmap_area. Such objects are usually very small, so
> delaying destruction of a bunch of them will normally go unnoticed.
> However, if kmemcg is used the effective memory consumption caused by
> such objects can be multiplied by many times due to dangling kmem
> caches.
> 
> We can, of course, mark all such allocations as __GFP_NOACCOUNT, but the
> problem is they are tricky to identify, because they are scattered all
> over the kernel source tree. E.g. Dave Chinner mentioned that XFS
> internals do a lot of allocations that are shared among all XFS
> filesystems and therefore should not be accounted (BTW that's why
> list_lru's used by XFS are not marked as memcg-aware). There must be
> more out there. Besides, kernel developers don't usually even know about
> kmemcg (they just write the code for their subsys, so why should they?)
> so they won't care thinking about using __GFP_NOACCOUNT, and hence new
> falsely-accounted allocations are likely to appear.
> 
> That said, by switching from black-list (__GFP_NOACCOUNT) to white-list
> (__GFP_ACCOUNT) kmem accounting policy we would make the system more
> predictable and robust IMO. OTOH what would we lose? Security? Well,
> containers aren't secure IMHO. In fact, I doubt they will ever be (as
> secure as VMs). Anyway, if a runaway allocation is reported, it should
> be trivial to fix by adding __GFP_ACCOUNT where appropriate.

I wholeheartedly agree with all of this.

> If there are no objections, I'll prepare a patch switching to the
> white-list approach. Let's start from obvious things like fs_struct,
> mm_struct, task_struct, signal_struct, dentry, inode, which can be
> easily allocated from user space. This should cover 90% of all
> allocations that should be accounted AFAICS. The rest will be added
> later if necessarily.

Awesome, I'm looking forward to that patch!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-06 16:46                                     ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-11-06 16:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Fri 06-11-15 11:19:53, Johannes Weiner wrote:
> On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote:
> > On Thu 05-11-15 17:52:00, Johannes Weiner wrote:
> > > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > > > > This would be true if they moved on to the new cgroup API intentionally.
> > > > > The reality is more complicated though. AFAIK sysmted is waiting for
> > > > > cgroup2 already and privileged services enable all available resource
> > > > > controllers by default as I've learned just recently.
> > > > 
> > > > Have you filed a report with them? I don't think they should turn them
> > > > on unless users explicitely configure resource control for the unit.
> > > 
> > > Okay, verified with systemd people that they're not planning on
> > > enabling resource control per default.
> > > 
> > > Inflammatory half-truths, man. This is not constructive.
> > 
> > What about Delegate=yes feature then? We have just been burnt by this
> > quite heavily. AFAIU nspawn@.service and nspawn@.service have this
> > enabled by default
> > http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html
> 
> That's when you launch a *container* and want it to be able to use
> nested resource control.

Ups. copy&paste error here. The second one was user@.service. So it is
not only about containers AFAIU but all user defined sessions.

> We're talking about actual container users here. It's not turning on
> resource control for all "privileged services", which is what we were
> worried about here. Can you at least admit that when you yourself link
> to the refuting evidence?

My bad, that was misundestanding of the changelog.

> And if you've been "burnt quite heavily" by this, where is your bug
> report to stop other users from getting "burnt quite heavily" as well?

The bug report is still internal because it is tracking an unrelased
product. We have ended up reverting Delegate feature. Our systemd
developers are supposed to bring this up with the upstream.

The basic problem was that the Delegate feature has been backported to
our systemd package without further consideration and that has
invalidated a lot of performance testing because some resource
controllers have measurable effects on those benchmarks.

> All I read here is vague inflammatory language to spread FUD.

I was merely pointing out that memory controller might be enabled without
_user_ actually even noticing because the controller wasn't enabled
explicitly. I haven't blamed anybody for that.

> You might think sending these emails is helpful, but it really
> isn't. Not only is it not contributing code, insights, or solutions,
> you're now actively sabotaging someone else's effort to build something.

Come on! Are you even serious?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-06 16:46                                     ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-11-06 16:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	vdavydov-5HdwGun5lf+gSpxsJD1C4w, tj-DgEjT+Ai2ygdnm+yROfE0A,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri 06-11-15 11:19:53, Johannes Weiner wrote:
> On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote:
> > On Thu 05-11-15 17:52:00, Johannes Weiner wrote:
> > > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > > > > This would be true if they moved on to the new cgroup API intentionally.
> > > > > The reality is more complicated though. AFAIK sysmted is waiting for
> > > > > cgroup2 already and privileged services enable all available resource
> > > > > controllers by default as I've learned just recently.
> > > > 
> > > > Have you filed a report with them? I don't think they should turn them
> > > > on unless users explicitely configure resource control for the unit.
> > > 
> > > Okay, verified with systemd people that they're not planning on
> > > enabling resource control per default.
> > > 
> > > Inflammatory half-truths, man. This is not constructive.
> > 
> > What about Delegate=yes feature then? We have just been burnt by this
> > quite heavily. AFAIU nspawn@.service and nspawn@.service have this
> > enabled by default
> > http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html
> 
> That's when you launch a *container* and want it to be able to use
> nested resource control.

Ups. copy&paste error here. The second one was user@.service. So it is
not only about containers AFAIU but all user defined sessions.

> We're talking about actual container users here. It's not turning on
> resource control for all "privileged services", which is what we were
> worried about here. Can you at least admit that when you yourself link
> to the refuting evidence?

My bad, that was misundestanding of the changelog.

> And if you've been "burnt quite heavily" by this, where is your bug
> report to stop other users from getting "burnt quite heavily" as well?

The bug report is still internal because it is tracking an unrelased
product. We have ended up reverting Delegate feature. Our systemd
developers are supposed to bring this up with the upstream.

The basic problem was that the Delegate feature has been backported to
our systemd package without further consideration and that has
invalidated a lot of performance testing because some resource
controllers have measurable effects on those benchmarks.

> All I read here is vague inflammatory language to spread FUD.

I was merely pointing out that memory controller might be enabled without
_user_ actually even noticing because the controller wasn't enabled
explicitly. I haven't blamed anybody for that.

> You might think sending these emails is helpful, but it really
> isn't. Not only is it not contributing code, insights, or solutions,
> you're now actively sabotaging someone else's effort to build something.

Come on! Are you even serious?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-06 16:46                                     ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2015-11-06 16:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Fri 06-11-15 11:19:53, Johannes Weiner wrote:
> On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote:
> > On Thu 05-11-15 17:52:00, Johannes Weiner wrote:
> > > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > > > > This would be true if they moved on to the new cgroup API intentionally.
> > > > > The reality is more complicated though. AFAIK sysmted is waiting for
> > > > > cgroup2 already and privileged services enable all available resource
> > > > > controllers by default as I've learned just recently.
> > > > 
> > > > Have you filed a report with them? I don't think they should turn them
> > > > on unless users explicitely configure resource control for the unit.
> > > 
> > > Okay, verified with systemd people that they're not planning on
> > > enabling resource control per default.
> > > 
> > > Inflammatory half-truths, man. This is not constructive.
> > 
> > What about Delegate=yes feature then? We have just been burnt by this
> > quite heavily. AFAIU nspawn@.service and nspawn@.service have this
> > enabled by default
> > http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html
> 
> That's when you launch a *container* and want it to be able to use
> nested resource control.

Ups. copy&paste error here. The second one was user@.service. So it is
not only about containers AFAIU but all user defined sessions.

> We're talking about actual container users here. It's not turning on
> resource control for all "privileged services", which is what we were
> worried about here. Can you at least admit that when you yourself link
> to the refuting evidence?

My bad, that was misundestanding of the changelog.

> And if you've been "burnt quite heavily" by this, where is your bug
> report to stop other users from getting "burnt quite heavily" as well?

The bug report is still internal because it is tracking an unrelased
product. We have ended up reverting Delegate feature. Our systemd
developers are supposed to bring this up with the upstream.

The basic problem was that the Delegate feature has been backported to
our systemd package without further consideration and that has
invalidated a lot of performance testing because some resource
controllers have measurable effects on those benchmarks.

> All I read here is vague inflammatory language to spread FUD.

I was merely pointing out that memory controller might be enabled without
_user_ actually even noticing because the controller wasn't enabled
explicitly. I haven't blamed anybody for that.

> You might think sending these emails is helpful, but it really
> isn't. Not only is it not contributing code, insights, or solutions,
> you're now actively sabotaging someone else's effort to build something.

Come on! Are you even serious?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-06 17:45                                       ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-11-06 17:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Fri, Nov 06, 2015 at 05:46:57PM +0100, Michal Hocko wrote:
> The basic problem was that the Delegate feature has been backported to
> our systemd package without further consideration and that has
> invalidated a lot of performance testing because some resource
> controllers have measurable effects on those benchmarks.

You're talking about a userspace bug. No amount of fragmenting and
layering and opt-in in the kernel's runtime configuration space is
going to help you if you screw up and enable it all by accident.

> > All I read here is vague inflammatory language to spread FUD.
> 
> I was merely pointing out that memory controller might be enabled without
> _user_ actually even noticing because the controller wasn't enabled
> explicitly. I haven't blamed anybody for that.

Why does that have anything to do with how we design our interface?

We can't do more than present a sane interface in good faith and lobby
userspace projects if we think they misuse it.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-06 17:45                                       ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-11-06 17:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	vdavydov-5HdwGun5lf+gSpxsJD1C4w, tj-DgEjT+Ai2ygdnm+yROfE0A,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, Nov 06, 2015 at 05:46:57PM +0100, Michal Hocko wrote:
> The basic problem was that the Delegate feature has been backported to
> our systemd package without further consideration and that has
> invalidated a lot of performance testing because some resource
> controllers have measurable effects on those benchmarks.

You're talking about a userspace bug. No amount of fragmenting and
layering and opt-in in the kernel's runtime configuration space is
going to help you if you screw up and enable it all by accident.

> > All I read here is vague inflammatory language to spread FUD.
> 
> I was merely pointing out that memory controller might be enabled without
> _user_ actually even noticing because the controller wasn't enabled
> explicitly. I haven't blamed anybody for that.

Why does that have anything to do with how we design our interface?

We can't do more than present a sane interface in good faith and lobby
userspace projects if we think they misuse it.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-06 17:45                                       ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-11-06 17:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Miller, akpm, vdavydov, tj, netdev, linux-mm, cgroups,
	linux-kernel

On Fri, Nov 06, 2015 at 05:46:57PM +0100, Michal Hocko wrote:
> The basic problem was that the Delegate feature has been backported to
> our systemd package without further consideration and that has
> invalidated a lot of performance testing because some resource
> controllers have measurable effects on those benchmarks.

You're talking about a userspace bug. No amount of fragmenting and
layering and opt-in in the kernel's runtime configuration space is
going to help you if you screw up and enable it all by accident.

> > All I read here is vague inflammatory language to spread FUD.
> 
> I was merely pointing out that memory controller might be enabled without
> _user_ actually even noticing because the controller wasn't enabled
> explicitly. I haven't blamed anybody for that.

Why does that have anything to do with how we design our interface?

We can't do more than present a sane interface in good faith and lobby
userspace projects if we think they misuse it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
  2015-11-06 16:46                                     ` Michal Hocko
@ 2015-11-07  3:45                                       ` David Miller
  -1 siblings, 0 replies; 156+ messages in thread
From: David Miller @ 2015-11-07  3:45 UTC (permalink / raw)
  To: mhocko
  Cc: hannes, akpm, vdavydov, tj, netdev, linux-mm, cgroups, linux-kernel

From: Michal Hocko <mhocko@kernel.org>
Date: Fri, 6 Nov 2015 17:46:57 +0100

> On Fri 06-11-15 11:19:53, Johannes Weiner wrote:
>> You might think sending these emails is helpful, but it really
>> isn't. Not only is it not contributing code, insights, or solutions,
>> you're now actively sabotaging someone else's effort to build something.
> 
> Come on! Are you even serious?

He is, and I agree %100 with him FWIW.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-07  3:45                                       ` David Miller
  0 siblings, 0 replies; 156+ messages in thread
From: David Miller @ 2015-11-07  3:45 UTC (permalink / raw)
  To: mhocko
  Cc: hannes, akpm, vdavydov, tj, netdev, linux-mm, cgroups, linux-kernel

From: Michal Hocko <mhocko@kernel.org>
Date: Fri, 6 Nov 2015 17:46:57 +0100

> On Fri 06-11-15 11:19:53, Johannes Weiner wrote:
>> You might think sending these emails is helpful, but it really
>> isn't. Not only is it not contributing code, insights, or solutions,
>> you're now actively sabotaging someone else's effort to build something.
> 
> Come on! Are you even serious?

He is, and I agree %100 with him FWIW.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-12 18:36                                     ` Mel Gorman
  0 siblings, 0 replies; 156+ messages in thread
From: Mel Gorman @ 2015-11-12 18:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, David Miller, akpm, vdavydov, tj, netdev, linux-mm,
	cgroups, linux-kernel

On Fri, Nov 06, 2015 at 11:19:53AM -0500, Johannes Weiner wrote:
> On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote:
> > On Thu 05-11-15 17:52:00, Johannes Weiner wrote:
> > > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > > > > This would be true if they moved on to the new cgroup API intentionally.
> > > > > The reality is more complicated though. AFAIK sysmted is waiting for
> > > > > cgroup2 already and privileged services enable all available resource
> > > > > controllers by default as I've learned just recently.
> > > > 
> > > > Have you filed a report with them? I don't think they should turn them
> > > > on unless users explicitely configure resource control for the unit.
> > > 
> > > Okay, verified with systemd people that they're not planning on
> > > enabling resource control per default.
> > > 
> > > Inflammatory half-truths, man. This is not constructive.
> > 
> > What about Delegate=yes feature then? We have just been burnt by this
> > quite heavily. AFAIU nspawn@.service and nspawn@.service have this
> > enabled by default
> > http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html
> 
> That's when you launch a *container* and want it to be able to use
> nested resource control.
> 
> We're talking about actual container users here. It's not turning on
> resource control for all "privileged services", which is what we were
> worried about here. Can you at least admit that when you yourself link
> to the refuting evidence?
> 
> And if you've been "burnt quite heavily" by this, where is your bug
> report to stop other users from getting "burnt quite heavily" as well?
> 

I didn't read this thread in detail but the lack of public information on
problems with cgroup controllers is partially my fault so I'd like to correct
that.

https://bugzilla.suse.com/show_bug.cgi?id=954765

This bug documents some of the impact that was incurred due to ssh
sessions being resource controlled by default. It talks primarily about
pipetest being impacted by cpu,cpuacct. It is also found in the recent
past that dbench4 was previously impacted because the blkio controller
was enabled. That bug is not public but basically dbench4 regressed 80% as
the journal thread was in a different cgroup than dbench4. dbench4 would
stall for 8ms in case more IO was issued before the journal thread could
issue any IO. The opensuse bug 954765 bug is not affected by blkio because
it's disabled by a distribution-specific patch. Mike Galbraith adds some
additional information on why activating the cpu controller can have an
impact on semantics even if the overhead was zero.

It may be the case that it's an oversight by the systemd developers and the
intent was only to affect containers. My experience was that everything
was affected. It also may be the case that this is an opensuse-specific
problem due to how the maintainers packaged systemd. I don't actually
know and hopefully the bug will be able to determine if upstream is really
affected or not.

There is also a link to this bug on the upstream project so there is some
chance they are aware https://github.com/systemd/systemd/issues/1715

Bottom line, there is legimate confusion over whether cgroup controllers
are going to be enabled by default or not in the future. If they are
enabled by default, there is a non-zero cost to that and a change in
semantics that people may or may not be surprised by.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-12 18:36                                     ` Mel Gorman
  0 siblings, 0 replies; 156+ messages in thread
From: Mel Gorman @ 2015-11-12 18:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, David Miller,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	vdavydov-5HdwGun5lf+gSpxsJD1C4w, tj-DgEjT+Ai2ygdnm+yROfE0A,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, Nov 06, 2015 at 11:19:53AM -0500, Johannes Weiner wrote:
> On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote:
> > On Thu 05-11-15 17:52:00, Johannes Weiner wrote:
> > > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > > > > This would be true if they moved on to the new cgroup API intentionally.
> > > > > The reality is more complicated though. AFAIK sysmted is waiting for
> > > > > cgroup2 already and privileged services enable all available resource
> > > > > controllers by default as I've learned just recently.
> > > > 
> > > > Have you filed a report with them? I don't think they should turn them
> > > > on unless users explicitely configure resource control for the unit.
> > > 
> > > Okay, verified with systemd people that they're not planning on
> > > enabling resource control per default.
> > > 
> > > Inflammatory half-truths, man. This is not constructive.
> > 
> > What about Delegate=yes feature then? We have just been burnt by this
> > quite heavily. AFAIU nspawn@.service and nspawn@.service have this
> > enabled by default
> > http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html
> 
> That's when you launch a *container* and want it to be able to use
> nested resource control.
> 
> We're talking about actual container users here. It's not turning on
> resource control for all "privileged services", which is what we were
> worried about here. Can you at least admit that when you yourself link
> to the refuting evidence?
> 
> And if you've been "burnt quite heavily" by this, where is your bug
> report to stop other users from getting "burnt quite heavily" as well?
> 

I didn't read this thread in detail but the lack of public information on
problems with cgroup controllers is partially my fault so I'd like to correct
that.

https://bugzilla.suse.com/show_bug.cgi?id=954765

This bug documents some of the impact that was incurred due to ssh
sessions being resource controlled by default. It talks primarily about
pipetest being impacted by cpu,cpuacct. It is also found in the recent
past that dbench4 was previously impacted because the blkio controller
was enabled. That bug is not public but basically dbench4 regressed 80% as
the journal thread was in a different cgroup than dbench4. dbench4 would
stall for 8ms in case more IO was issued before the journal thread could
issue any IO. The opensuse bug 954765 bug is not affected by blkio because
it's disabled by a distribution-specific patch. Mike Galbraith adds some
additional information on why activating the cpu controller can have an
impact on semantics even if the overhead was zero.

It may be the case that it's an oversight by the systemd developers and the
intent was only to affect containers. My experience was that everything
was affected. It also may be the case that this is an opensuse-specific
problem due to how the maintainers packaged systemd. I don't actually
know and hopefully the bug will be able to determine if upstream is really
affected or not.

There is also a link to this bug on the upstream project so there is some
chance they are aware https://github.com/systemd/systemd/issues/1715

Bottom line, there is legimate confusion over whether cgroup controllers
are going to be enabled by default or not in the future. If they are
enabled by default, there is a non-zero cost to that and a change in
semantics that people may or may not be surprised by.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-12 18:36                                     ` Mel Gorman
  0 siblings, 0 replies; 156+ messages in thread
From: Mel Gorman @ 2015-11-12 18:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, David Miller, akpm, vdavydov, tj, netdev, linux-mm,
	cgroups, linux-kernel

On Fri, Nov 06, 2015 at 11:19:53AM -0500, Johannes Weiner wrote:
> On Fri, Nov 06, 2015 at 11:57:24AM +0100, Michal Hocko wrote:
> > On Thu 05-11-15 17:52:00, Johannes Weiner wrote:
> > > On Thu, Nov 05, 2015 at 03:55:22PM -0500, Johannes Weiner wrote:
> > > > On Thu, Nov 05, 2015 at 03:40:02PM +0100, Michal Hocko wrote:
> > > > > This would be true if they moved on to the new cgroup API intentionally.
> > > > > The reality is more complicated though. AFAIK sysmted is waiting for
> > > > > cgroup2 already and privileged services enable all available resource
> > > > > controllers by default as I've learned just recently.
> > > > 
> > > > Have you filed a report with them? I don't think they should turn them
> > > > on unless users explicitely configure resource control for the unit.
> > > 
> > > Okay, verified with systemd people that they're not planning on
> > > enabling resource control per default.
> > > 
> > > Inflammatory half-truths, man. This is not constructive.
> > 
> > What about Delegate=yes feature then? We have just been burnt by this
> > quite heavily. AFAIU nspawn@.service and nspawn@.service have this
> > enabled by default
> > http://lists.freedesktop.org/archives/systemd-commits/2014-November/007400.html
> 
> That's when you launch a *container* and want it to be able to use
> nested resource control.
> 
> We're talking about actual container users here. It's not turning on
> resource control for all "privileged services", which is what we were
> worried about here. Can you at least admit that when you yourself link
> to the refuting evidence?
> 
> And if you've been "burnt quite heavily" by this, where is your bug
> report to stop other users from getting "burnt quite heavily" as well?
> 

I didn't read this thread in detail but the lack of public information on
problems with cgroup controllers is partially my fault so I'd like to correct
that.

https://bugzilla.suse.com/show_bug.cgi?id=954765

This bug documents some of the impact that was incurred due to ssh
sessions being resource controlled by default. It talks primarily about
pipetest being impacted by cpu,cpuacct. It is also found in the recent
past that dbench4 was previously impacted because the blkio controller
was enabled. That bug is not public but basically dbench4 regressed 80% as
the journal thread was in a different cgroup than dbench4. dbench4 would
stall for 8ms in case more IO was issued before the journal thread could
issue any IO. The opensuse bug 954765 bug is not affected by blkio because
it's disabled by a distribution-specific patch. Mike Galbraith adds some
additional information on why activating the cpu controller can have an
impact on semantics even if the overhead was zero.

It may be the case that it's an oversight by the systemd developers and the
intent was only to affect containers. My experience was that everything
was affected. It also may be the case that this is an opensuse-specific
problem due to how the maintainers packaged systemd. I don't actually
know and hopefully the bug will be able to determine if upstream is really
affected or not.

There is also a link to this bug on the upstream project so there is some
chance they are aware https://github.com/systemd/systemd/issues/1715

Bottom line, there is legimate confusion over whether cgroup controllers
are going to be enabled by default or not in the future. If they are
enabled by default, there is a non-zero cost to that and a change in
semantics that people may or may not be surprised by.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
  2015-11-12 18:36                                     ` Mel Gorman
@ 2015-11-12 19:12                                       ` Johannes Weiner
  -1 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-11-12 19:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, David Miller, akpm, vdavydov, tj, netdev, linux-mm,
	cgroups, linux-kernel

On Thu, Nov 12, 2015 at 06:36:20PM +0000, Mel Gorman wrote:
> Bottom line, there is legimate confusion over whether cgroup controllers
> are going to be enabled by default or not in the future. If they are
> enabled by default, there is a non-zero cost to that and a change in
> semantics that people may or may not be surprised by.

Thanks for elaborating, Mel.

My understanding is that this is a plain bug. I don't think anybody
wants to put costs without benefits on their users.

But I'll keep an eye on these reports, and I'll work with the systemd
people should issues with the kernel interface materialize that would
force them to enable resource control prematurely.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
@ 2015-11-12 19:12                                       ` Johannes Weiner
  0 siblings, 0 replies; 156+ messages in thread
From: Johannes Weiner @ 2015-11-12 19:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, David Miller, akpm, vdavydov, tj, netdev, linux-mm,
	cgroups, linux-kernel

On Thu, Nov 12, 2015 at 06:36:20PM +0000, Mel Gorman wrote:
> Bottom line, there is legimate confusion over whether cgroup controllers
> are going to be enabled by default or not in the future. If they are
> enabled by default, there is a non-zero cost to that and a change in
> semantics that people may or may not be surprised by.

Thanks for elaborating, Mel.

My understanding is that this is a plain bug. I don't think anybody
wants to put costs without benefits on their users.

But I'll keep an eye on these reports, and I'll work with the systemd
people should issues with the kernel interface materialize that would
force them to enable resource control prematurely.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

end of thread, other threads:[~2015-11-12 19:12 UTC | newest]

Thread overview: 156+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-22  4:21 [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Johannes Weiner
2015-10-22  4:21 ` Johannes Weiner
2015-10-22  4:21 ` [PATCH 1/8] mm: page_counter: let page_counter_try_charge() return bool Johannes Weiner
2015-10-22  4:21   ` Johannes Weiner
2015-10-23 11:31   ` Michal Hocko
2015-10-23 11:31     ` Michal Hocko
2015-10-22  4:21 ` [PATCH 2/8] mm: memcontrol: export root_mem_cgroup Johannes Weiner
2015-10-22  4:21   ` Johannes Weiner
2015-10-23 11:32   ` Michal Hocko
2015-10-23 11:32     ` Michal Hocko
2015-10-22  4:21 ` [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting Johannes Weiner
2015-10-22  4:21   ` Johannes Weiner
2015-10-22 18:46   ` Vladimir Davydov
2015-10-22 18:46     ` Vladimir Davydov
2015-10-22 18:46     ` Vladimir Davydov
2015-10-22 19:09     ` Johannes Weiner
2015-10-22 19:09       ` Johannes Weiner
2015-10-23 13:42       ` Vladimir Davydov
2015-10-23 13:42         ` Vladimir Davydov
2015-10-23 13:42         ` Vladimir Davydov
2015-10-23 12:38   ` Michal Hocko
2015-10-23 12:38     ` Michal Hocko
2015-10-22  4:21 ` [PATCH 4/8] mm: memcontrol: prepare for unified hierarchy socket accounting Johannes Weiner
2015-10-22  4:21   ` Johannes Weiner
2015-10-23 12:39   ` Michal Hocko
2015-10-23 12:39     ` Michal Hocko
2015-10-22  4:21 ` [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy Johannes Weiner
2015-10-22  4:21   ` Johannes Weiner
2015-10-22 18:47   ` Vladimir Davydov
2015-10-22 18:47     ` Vladimir Davydov
2015-10-22 18:47     ` Vladimir Davydov
2015-10-23 13:19   ` Michal Hocko
2015-10-23 13:19     ` Michal Hocko
2015-10-23 13:19     ` Michal Hocko
2015-10-23 13:59     ` David Miller
2015-10-23 13:59       ` David Miller
2015-10-23 13:59       ` David Miller
2015-10-26 16:56       ` Johannes Weiner
2015-10-26 16:56         ` Johannes Weiner
2015-10-27 12:26         ` Michal Hocko
2015-10-27 12:26           ` Michal Hocko
2015-10-27 13:49           ` David Miller
2015-10-27 13:49             ` David Miller
2015-10-27 13:49             ` David Miller
2015-10-27 15:41           ` Johannes Weiner
2015-10-27 15:41             ` Johannes Weiner
2015-10-27 15:41             ` Johannes Weiner
2015-10-27 16:15             ` Michal Hocko
2015-10-27 16:15               ` Michal Hocko
2015-10-27 16:42               ` Johannes Weiner
2015-10-27 16:42                 ` Johannes Weiner
2015-10-28  0:45                 ` David Miller
2015-10-28  0:45                   ` David Miller
2015-10-28  0:45                   ` David Miller
2015-10-28  3:05                   ` Johannes Weiner
2015-10-28  3:05                     ` Johannes Weiner
2015-10-29 15:25                 ` Michal Hocko
2015-10-29 15:25                   ` Michal Hocko
2015-10-29 16:10                   ` Johannes Weiner
2015-10-29 16:10                     ` Johannes Weiner
2015-10-29 16:10                     ` Johannes Weiner
2015-11-04 10:42                     ` Michal Hocko
2015-11-04 10:42                       ` Michal Hocko
2015-11-04 19:50                       ` Johannes Weiner
2015-11-04 19:50                         ` Johannes Weiner
2015-11-04 19:50                         ` Johannes Weiner
2015-11-05 14:40                         ` Michal Hocko
2015-11-05 14:40                           ` Michal Hocko
2015-11-05 16:16                           ` David Miller
2015-11-05 16:16                             ` David Miller
2015-11-05 16:28                             ` Michal Hocko
2015-11-05 16:28                               ` Michal Hocko
2015-11-05 16:28                               ` Michal Hocko
2015-11-05 16:30                               ` David Miller
2015-11-05 16:30                                 ` David Miller
2015-11-05 22:32                               ` Johannes Weiner
2015-11-05 22:32                                 ` Johannes Weiner
2015-11-05 22:32                                 ` Johannes Weiner
2015-11-06 12:51                                 ` Michal Hocko
2015-11-06 12:51                                   ` Michal Hocko
2015-11-05 20:55                           ` Johannes Weiner
2015-11-05 20:55                             ` Johannes Weiner
2015-11-05 22:52                             ` Johannes Weiner
2015-11-05 22:52                               ` Johannes Weiner
2015-11-05 22:52                               ` Johannes Weiner
2015-11-06 10:57                               ` Michal Hocko
2015-11-06 10:57                                 ` Michal Hocko
2015-11-06 16:19                                 ` Johannes Weiner
2015-11-06 16:19                                   ` Johannes Weiner
2015-11-06 16:46                                   ` Michal Hocko
2015-11-06 16:46                                     ` Michal Hocko
2015-11-06 16:46                                     ` Michal Hocko
2015-11-06 17:45                                     ` Johannes Weiner
2015-11-06 17:45                                       ` Johannes Weiner
2015-11-06 17:45                                       ` Johannes Weiner
2015-11-07  3:45                                     ` David Miller
2015-11-07  3:45                                       ` David Miller
2015-11-12 18:36                                   ` Mel Gorman
2015-11-12 18:36                                     ` Mel Gorman
2015-11-12 18:36                                     ` Mel Gorman
2015-11-12 19:12                                     ` Johannes Weiner
2015-11-12 19:12                                       ` Johannes Weiner
2015-11-06  9:05                             ` Vladimir Davydov
2015-11-06  9:05                               ` Vladimir Davydov
2015-11-06  9:05                               ` Vladimir Davydov
2015-11-06  9:05                               ` Vladimir Davydov
2015-11-06 13:29                               ` Michal Hocko
2015-11-06 13:29                                 ` Michal Hocko
2015-11-06 16:35                               ` Johannes Weiner
2015-11-06 16:35                                 ` Johannes Weiner
2015-11-06 13:21                             ` Michal Hocko
2015-11-06 13:21                               ` Michal Hocko
2015-10-22  4:21 ` [PATCH 6/8] mm: vmscan: simplify memcg vs. global shrinker invocation Johannes Weiner
2015-10-22  4:21   ` Johannes Weiner
2015-10-23 13:26   ` Michal Hocko
2015-10-23 13:26     ` Michal Hocko
2015-10-22  4:21 ` [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity Johannes Weiner
2015-10-22  4:21   ` Johannes Weiner
2015-10-22 18:48   ` Vladimir Davydov
2015-10-22 18:48     ` Vladimir Davydov
2015-10-22 18:48     ` Vladimir Davydov
2015-10-22 18:48     ` Vladimir Davydov
2015-10-23 13:49   ` Michal Hocko
2015-10-23 13:49     ` Michal Hocko
2015-10-23 13:49     ` Michal Hocko
2015-10-22  4:21 ` [PATCH 8/8] mm: memcontrol: hook up vmpressure to socket pressure Johannes Weiner
2015-10-22  4:21   ` Johannes Weiner
2015-10-22 18:57   ` Vladimir Davydov
2015-10-22 18:57     ` Vladimir Davydov
2015-10-22 18:57     ` Vladimir Davydov
2015-10-22 18:45 ` [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy Vladimir Davydov
2015-10-22 18:45   ` Vladimir Davydov
2015-10-22 18:45   ` Vladimir Davydov
2015-10-26 17:22   ` Johannes Weiner
2015-10-26 17:22     ` Johannes Weiner
2015-10-26 17:22     ` Johannes Weiner
2015-10-26 17:22     ` Johannes Weiner
2015-10-27  8:43     ` Vladimir Davydov
2015-10-27  8:43       ` Vladimir Davydov
2015-10-27  8:43       ` Vladimir Davydov
2015-10-27 16:01       ` Johannes Weiner
2015-10-27 16:01         ` Johannes Weiner
2015-10-28  8:20         ` Vladimir Davydov
2015-10-28  8:20           ` Vladimir Davydov
2015-10-28  8:20           ` Vladimir Davydov
2015-10-28 18:58           ` Johannes Weiner
2015-10-28 18:58             ` Johannes Weiner
2015-10-29  9:27             ` Vladimir Davydov
2015-10-29  9:27               ` Vladimir Davydov
2015-10-29  9:27               ` Vladimir Davydov
2015-10-29 17:52               ` Johannes Weiner
2015-10-29 17:52                 ` Johannes Weiner
2015-10-29 17:52                 ` Johannes Weiner
2015-11-02 14:47                 ` Vladimir Davydov
2015-11-02 14:47                   ` Vladimir Davydov
2015-11-02 14:47                   ` Vladimir Davydov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.