All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-09-13 18:59 ` Tejun Heo
  0 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2015-09-13 18:59 UTC (permalink / raw)
  To: akpm, hannes, mhocko; +Cc: cgroups, linux-mm, vdavydov, kernel-team

task_struct->memcg_oom is a sub-struct containing fields which are
used for async memcg oom handling.  Most task_struct fields aren't
packaged this way and it can lead to unnecessary alignment paddings.
This patch flattens it.

* task.memcg_oom.memcg          -> task.memcg_in_oom
* task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
* task.memcg_oom.order          -> task.memcg_oom_order
* task.memcg_oom.may_oom        -> task.memcg_may_oom

In addition, task.memcg_may_oom is relocated to where other bitfields
are which reduces the size of task_struct.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
---
Hello,

Andrew, these are the two patches which got acked from the following
thread.

 http://lkml.kernel.org/g/20150828220158.GD11089@htj.dyndns.org

Acks are added and the second patch's description is updated as
suggested by Michal and Vladimir.

Can you please put them in -mm?

Thanks!

 include/linux/memcontrol.h |   10 +++++-----
 include/linux/sched.h      |   13 ++++++-------
 mm/memcontrol.c            |   16 ++++++++--------
 3 files changed, 19 insertions(+), 20 deletions(-)

--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -407,19 +407,19 @@ void mem_cgroup_print_oom_info(struct me
 
 static inline void mem_cgroup_oom_enable(void)
 {
-	WARN_ON(current->memcg_oom.may_oom);
-	current->memcg_oom.may_oom = 1;
+	WARN_ON(current->memcg_may_oom);
+	current->memcg_may_oom = 1;
 }
 
 static inline void mem_cgroup_oom_disable(void)
 {
-	WARN_ON(!current->memcg_oom.may_oom);
-	current->memcg_oom.may_oom = 0;
+	WARN_ON(!current->memcg_may_oom);
+	current->memcg_may_oom = 0;
 }
 
 static inline bool task_in_memcg_oom(struct task_struct *p)
 {
-	return p->memcg_oom.memcg;
+	return p->memcg_in_oom;
 }
 
 bool mem_cgroup_oom_synchronize(bool wait);
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1451,7 +1451,9 @@ struct task_struct {
 	unsigned sched_reset_on_fork:1;
 	unsigned sched_contributes_to_load:1;
 	unsigned sched_migrated:1;
-
+#ifdef CONFIG_MEMCG
+	unsigned memcg_may_oom:1;
+#endif
 #ifdef CONFIG_MEMCG_KMEM
 	unsigned memcg_kmem_skip_account:1;
 #endif
@@ -1782,12 +1784,9 @@ struct task_struct {
 	unsigned long trace_recursion;
 #endif /* CONFIG_TRACING */
 #ifdef CONFIG_MEMCG
-	struct memcg_oom_info {
-		struct mem_cgroup *memcg;
-		gfp_t gfp_mask;
-		int order;
-		unsigned int may_oom:1;
-	} memcg_oom;
+	struct mem_cgroup *memcg_in_oom;
+	gfp_t memcg_oom_gfp_mask;
+	int memcg_oom_order;
 #endif
 #ifdef CONFIG_UPROBES
 	struct uprobe_task *utask;
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1652,7 +1652,7 @@ static void memcg_oom_recover(struct mem
 
 static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 {
-	if (!current->memcg_oom.may_oom)
+	if (!current->memcg_may_oom)
 		return;
 	/*
 	 * We are in the middle of the charge context here, so we
@@ -1669,9 +1669,9 @@ static void mem_cgroup_oom(struct mem_cg
 	 * and when we know whether the fault was overall successful.
 	 */
 	css_get(&memcg->css);
-	current->memcg_oom.memcg = memcg;
-	current->memcg_oom.gfp_mask = mask;
-	current->memcg_oom.order = order;
+	current->memcg_in_oom = memcg;
+	current->memcg_oom_gfp_mask = mask;
+	current->memcg_oom_order = order;
 }
 
 /**
@@ -1693,7 +1693,7 @@ static void mem_cgroup_oom(struct mem_cg
  */
 bool mem_cgroup_oom_synchronize(bool handle)
 {
-	struct mem_cgroup *memcg = current->memcg_oom.memcg;
+	struct mem_cgroup *memcg = current->memcg_in_oom;
 	struct oom_wait_info owait;
 	bool locked;
 
@@ -1721,8 +1721,8 @@ bool mem_cgroup_oom_synchronize(bool han
 	if (locked && !memcg->oom_kill_disable) {
 		mem_cgroup_unmark_under_oom(memcg);
 		finish_wait(&memcg_oom_waitq, &owait.wait);
-		mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask,
-					 current->memcg_oom.order);
+		mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
+					 current->memcg_oom_order);
 	} else {
 		schedule();
 		mem_cgroup_unmark_under_oom(memcg);
@@ -1739,7 +1739,7 @@ bool mem_cgroup_oom_synchronize(bool han
 		memcg_oom_recover(memcg);
 	}
 cleanup:
-	current->memcg_oom.memcg = NULL;
+	current->memcg_in_oom = NULL;
 	css_put(&memcg->css);
 	return true;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-09-13 18:59 ` Tejun Heo
  0 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2015-09-13 18:59 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg

task_struct->memcg_oom is a sub-struct containing fields which are
used for async memcg oom handling.  Most task_struct fields aren't
packaged this way and it can lead to unnecessary alignment paddings.
This patch flattens it.

* task.memcg_oom.memcg          -> task.memcg_in_oom
* task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
* task.memcg_oom.order          -> task.memcg_oom_order
* task.memcg_oom.may_oom        -> task.memcg_may_oom

In addition, task.memcg_may_oom is relocated to where other bitfields
are which reduces the size of task_struct.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
Reviewed-by: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
---
Hello,

Andrew, these are the two patches which got acked from the following
thread.

 http://lkml.kernel.org/g/20150828220158.GD11089-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org

Acks are added and the second patch's description is updated as
suggested by Michal and Vladimir.

Can you please put them in -mm?

Thanks!

 include/linux/memcontrol.h |   10 +++++-----
 include/linux/sched.h      |   13 ++++++-------
 mm/memcontrol.c            |   16 ++++++++--------
 3 files changed, 19 insertions(+), 20 deletions(-)

--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -407,19 +407,19 @@ void mem_cgroup_print_oom_info(struct me
 
 static inline void mem_cgroup_oom_enable(void)
 {
-	WARN_ON(current->memcg_oom.may_oom);
-	current->memcg_oom.may_oom = 1;
+	WARN_ON(current->memcg_may_oom);
+	current->memcg_may_oom = 1;
 }
 
 static inline void mem_cgroup_oom_disable(void)
 {
-	WARN_ON(!current->memcg_oom.may_oom);
-	current->memcg_oom.may_oom = 0;
+	WARN_ON(!current->memcg_may_oom);
+	current->memcg_may_oom = 0;
 }
 
 static inline bool task_in_memcg_oom(struct task_struct *p)
 {
-	return p->memcg_oom.memcg;
+	return p->memcg_in_oom;
 }
 
 bool mem_cgroup_oom_synchronize(bool wait);
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1451,7 +1451,9 @@ struct task_struct {
 	unsigned sched_reset_on_fork:1;
 	unsigned sched_contributes_to_load:1;
 	unsigned sched_migrated:1;
-
+#ifdef CONFIG_MEMCG
+	unsigned memcg_may_oom:1;
+#endif
 #ifdef CONFIG_MEMCG_KMEM
 	unsigned memcg_kmem_skip_account:1;
 #endif
@@ -1782,12 +1784,9 @@ struct task_struct {
 	unsigned long trace_recursion;
 #endif /* CONFIG_TRACING */
 #ifdef CONFIG_MEMCG
-	struct memcg_oom_info {
-		struct mem_cgroup *memcg;
-		gfp_t gfp_mask;
-		int order;
-		unsigned int may_oom:1;
-	} memcg_oom;
+	struct mem_cgroup *memcg_in_oom;
+	gfp_t memcg_oom_gfp_mask;
+	int memcg_oom_order;
 #endif
 #ifdef CONFIG_UPROBES
 	struct uprobe_task *utask;
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1652,7 +1652,7 @@ static void memcg_oom_recover(struct mem
 
 static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 {
-	if (!current->memcg_oom.may_oom)
+	if (!current->memcg_may_oom)
 		return;
 	/*
 	 * We are in the middle of the charge context here, so we
@@ -1669,9 +1669,9 @@ static void mem_cgroup_oom(struct mem_cg
 	 * and when we know whether the fault was overall successful.
 	 */
 	css_get(&memcg->css);
-	current->memcg_oom.memcg = memcg;
-	current->memcg_oom.gfp_mask = mask;
-	current->memcg_oom.order = order;
+	current->memcg_in_oom = memcg;
+	current->memcg_oom_gfp_mask = mask;
+	current->memcg_oom_order = order;
 }
 
 /**
@@ -1693,7 +1693,7 @@ static void mem_cgroup_oom(struct mem_cg
  */
 bool mem_cgroup_oom_synchronize(bool handle)
 {
-	struct mem_cgroup *memcg = current->memcg_oom.memcg;
+	struct mem_cgroup *memcg = current->memcg_in_oom;
 	struct oom_wait_info owait;
 	bool locked;
 
@@ -1721,8 +1721,8 @@ bool mem_cgroup_oom_synchronize(bool han
 	if (locked && !memcg->oom_kill_disable) {
 		mem_cgroup_unmark_under_oom(memcg);
 		finish_wait(&memcg_oom_waitq, &owait.wait);
-		mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask,
-					 current->memcg_oom.order);
+		mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
+					 current->memcg_oom_order);
 	} else {
 		schedule();
 		mem_cgroup_unmark_under_oom(memcg);
@@ -1739,7 +1739,7 @@ bool mem_cgroup_oom_synchronize(bool han
 		memcg_oom_recover(memcg);
 	}
 cleanup:
-	current->memcg_oom.memcg = NULL;
+	current->memcg_in_oom = NULL;
 	css_put(&memcg->css);
 	return true;
 }

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v3 2/2] memcg: punt high overage reclaim to return-to-userland path
  2015-09-13 18:59 ` Tejun Heo
  (?)
@ 2015-09-13 19:00 ` Tejun Heo
  2015-09-15  7:47     ` Johannes Weiner
  -1 siblings, 1 reply; 49+ messages in thread
From: Tejun Heo @ 2015-09-13 19:00 UTC (permalink / raw)
  To: akpm, hannes, mhocko; +Cc: cgroups, linux-mm, vdavydov, kernel-team

Currently, try_charge() tries to reclaim memory synchronously when the
high limit is breached; however, if the allocation doesn't have
__GFP_WAIT, synchronous reclaim is skipped.  If a process performs
only speculative allocations, it can blow way past the high limit.
This is actually easily reproducible by simply doing "find /".
slab/slub allocator tries speculative allocations first, so as long as
there's memory which can be consumed without blocking, it can keep
allocating memory regardless of the high limit.

This patch makes try_charge() always punt the over-high reclaim to the
return-to-userland path.  If try_charge() detects that high limit is
breached, it adds the overage to current->memcg_nr_pages_over_high and
schedules execution of mem_cgroup_handle_over_high() which performs
synchronous reclaim from the return-to-userland path.

As long as kernel doesn't have a run-away allocation spree, this
should provide enough protection while making kmemcg behave more
consistently.  It also has the following benefits.

- All over-high reclaims can use GFP_KERNEL regardless of the specific
  gfp mask in use, e.g. GFP_NOFS, when the limit was breached.

- It copes with prio inversion.  Previously, a low-prio task with
  small memory.high might perform over-high reclaim with a bunch of
  locks held.  If a higher prio task needed any of these locks, it
  would have to wait until the low prio task finished reclaim and
  released the locks.  By handing over-high reclaim to the task exit
  path this issue can be avoided.

v3: - Description updated.

v2: - Switched to reclaiming only the overage caused by current rather
      than the difference between usage and high as suggested by
      Michal.
    - Don't record the memcg which went over high limit.  This makes
      exit path handling unnecessary.  Dropped.
    - Drop mentions of avoiding high stack usage from description as
      suggested by Vladimir.  max limit still triggers direct reclaim.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@kernel.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/memcontrol.h |    6 +++++
 include/linux/sched.h      |    3 ++
 include/linux/tracehook.h  |    3 ++
 mm/memcontrol.c            |   47 +++++++++++++++++++++++++++++++++++++--------
 4 files changed, 51 insertions(+), 8 deletions(-)

--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -402,6 +402,8 @@ static inline int mem_cgroup_inactive_an
 	return inactive * inactive_ratio < active;
 }
 
+void mem_cgroup_handle_over_high(void);
+
 void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 				struct task_struct *p);
 
@@ -621,6 +623,10 @@ static inline void mem_cgroup_end_page_s
 {
 }
 
+static inline void mem_cgroup_handle_over_high(void)
+{
+}
+
 static inline void mem_cgroup_oom_enable(void)
 {
 }
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1787,6 +1787,9 @@ struct task_struct {
 	struct mem_cgroup *memcg_in_oom;
 	gfp_t memcg_oom_gfp_mask;
 	int memcg_oom_order;
+
+	/* number of pages to reclaim on returning to userland */
+	unsigned int memcg_nr_pages_over_high;
 #endif
 #ifdef CONFIG_UPROBES
 	struct uprobe_task *utask;
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -50,6 +50,7 @@
 #include <linux/ptrace.h>
 #include <linux/security.h>
 #include <linux/task_work.h>
+#include <linux/memcontrol.h>
 struct linux_binprm;
 
 /*
@@ -188,6 +189,8 @@ static inline void tracehook_notify_resu
 	smp_mb__after_atomic();
 	if (unlikely(current->task_works))
 		task_work_run();
+
+	mem_cgroup_handle_over_high();
 }
 
 #endif	/* <linux/tracehook.h> */
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -62,6 +62,7 @@
 #include <linux/oom.h>
 #include <linux/lockdep.h>
 #include <linux/file.h>
+#include <linux/tracehook.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -1963,6 +1964,31 @@ static int memcg_cpu_hotplug_callback(st
 	return NOTIFY_OK;
 }
 
+/*
+ * Scheduled by try_charge() to be executed from the userland return path
+ * and reclaims memory over the high limit.
+ */
+void mem_cgroup_handle_over_high(void)
+{
+	unsigned int nr_pages = current->memcg_nr_pages_over_high;
+	struct mem_cgroup *memcg, *pos;
+
+	if (likely(!nr_pages))
+		return;
+
+	pos = memcg = get_mem_cgroup_from_mm(current->mm);
+
+	do {
+		if (page_counter_read(&pos->memory) <= pos->high)
+			continue;
+		mem_cgroup_events(pos, MEMCG_HIGH, 1);
+		try_to_free_mem_cgroup_pages(pos, nr_pages, GFP_KERNEL, true);
+	} while ((pos = parent_mem_cgroup(pos)));
+
+	css_put(&memcg->css);
+	current->memcg_nr_pages_over_high = 0;
+}
+
 static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		      unsigned int nr_pages)
 {
@@ -2071,17 +2097,22 @@ done_restock:
 	css_get_many(&memcg->css, batch);
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
-	if (!(gfp_mask & __GFP_WAIT))
-		goto done;
+
 	/*
-	 * If the hierarchy is above the normal consumption range,
-	 * make the charging task trim their excess contribution.
+	 * If the hierarchy is above the normal consumption range, schedule
+	 * reclaim on returning to userland.  We can perform reclaim here
+	 * if __GFP_WAIT but let's always punt for simplicity and so that
+	 * GFP_KERNEL can consistently be used during reclaim.  @memcg is
+	 * not recorded as it most likely matches current's and won't
+	 * change in the meantime.  As high limit is checked again before
+	 * reclaim, the cost of mismatch is negligible.
 	 */
 	do {
-		if (page_counter_read(&memcg->memory) <= memcg->high)
-			continue;
-		mem_cgroup_events(memcg, MEMCG_HIGH, 1);
-		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
+		if (page_counter_read(&memcg->memory) > memcg->high) {
+			current->memcg_nr_pages_over_high += nr_pages;
+			set_notify_resume(current);
+			break;
+		}
 	} while ((memcg = parent_mem_cgroup(memcg)));
 done:
 	return ret;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-09-15  7:37   ` Johannes Weiner
  0 siblings, 0 replies; 49+ messages in thread
From: Johannes Weiner @ 2015-09-15  7:37 UTC (permalink / raw)
  To: Tejun Heo; +Cc: akpm, mhocko, cgroups, linux-mm, vdavydov, kernel-team

On Sun, Sep 13, 2015 at 02:59:40PM -0400, Tejun Heo wrote:
> task_struct->memcg_oom is a sub-struct containing fields which are
> used for async memcg oom handling.  Most task_struct fields aren't
> packaged this way and it can lead to unnecessary alignment paddings.
> This patch flattens it.
> 
> * task.memcg_oom.memcg          -> task.memcg_in_oom
> * task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
> * task.memcg_oom.order          -> task.memcg_oom_order
> * task.memcg_oom.may_oom        -> task.memcg_may_oom
> 
> In addition, task.memcg_may_oom is relocated to where other bitfields
> are which reduces the size of task_struct.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>

Looks good to me, thanks.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-09-15  7:37   ` Johannes Weiner
  0 siblings, 0 replies; 49+ messages in thread
From: Johannes Weiner @ 2015-09-15  7:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mhocko-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ,
	kernel-team-b10kYP2dOMg

On Sun, Sep 13, 2015 at 02:59:40PM -0400, Tejun Heo wrote:
> task_struct->memcg_oom is a sub-struct containing fields which are
> used for async memcg oom handling.  Most task_struct fields aren't
> packaged this way and it can lead to unnecessary alignment paddings.
> This patch flattens it.
> 
> * task.memcg_oom.memcg          -> task.memcg_in_oom
> * task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
> * task.memcg_oom.order          -> task.memcg_oom_order
> * task.memcg_oom.may_oom        -> task.memcg_may_oom
> 
> In addition, task.memcg_may_oom is relocated to where other bitfields
> are which reduces the size of task_struct.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> Reviewed-by: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

Looks good to me, thanks.

Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/2] memcg: punt high overage reclaim to return-to-userland path
@ 2015-09-15  7:47     ` Johannes Weiner
  0 siblings, 0 replies; 49+ messages in thread
From: Johannes Weiner @ 2015-09-15  7:47 UTC (permalink / raw)
  To: Tejun Heo; +Cc: akpm, mhocko, cgroups, linux-mm, vdavydov, kernel-team

On Sun, Sep 13, 2015 at 03:00:08PM -0400, Tejun Heo wrote:
> Currently, try_charge() tries to reclaim memory synchronously when the
> high limit is breached; however, if the allocation doesn't have
> __GFP_WAIT, synchronous reclaim is skipped.  If a process performs
> only speculative allocations, it can blow way past the high limit.
> This is actually easily reproducible by simply doing "find /".
> slab/slub allocator tries speculative allocations first, so as long as
> there's memory which can be consumed without blocking, it can keep
> allocating memory regardless of the high limit.
> 
> This patch makes try_charge() always punt the over-high reclaim to the
> return-to-userland path.  If try_charge() detects that high limit is
> breached, it adds the overage to current->memcg_nr_pages_over_high and
> schedules execution of mem_cgroup_handle_over_high() which performs
> synchronous reclaim from the return-to-userland path.

Why can't we simply fail NOWAIT allocations when the high limit is
breached? We do the same for the max limit.

As I see it, NOWAIT allocations are speculative attempts on available
memory. We should be able to just fail them and have somebody that is
allowed to reclaim try again, just like with the max limit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/2] memcg: punt high overage reclaim to return-to-userland path
@ 2015-09-15  7:47     ` Johannes Weiner
  0 siblings, 0 replies; 49+ messages in thread
From: Johannes Weiner @ 2015-09-15  7:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mhocko-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ,
	kernel-team-b10kYP2dOMg

On Sun, Sep 13, 2015 at 03:00:08PM -0400, Tejun Heo wrote:
> Currently, try_charge() tries to reclaim memory synchronously when the
> high limit is breached; however, if the allocation doesn't have
> __GFP_WAIT, synchronous reclaim is skipped.  If a process performs
> only speculative allocations, it can blow way past the high limit.
> This is actually easily reproducible by simply doing "find /".
> slab/slub allocator tries speculative allocations first, so as long as
> there's memory which can be consumed without blocking, it can keep
> allocating memory regardless of the high limit.
> 
> This patch makes try_charge() always punt the over-high reclaim to the
> return-to-userland path.  If try_charge() detects that high limit is
> breached, it adds the overage to current->memcg_nr_pages_over_high and
> schedules execution of mem_cgroup_handle_over_high() which performs
> synchronous reclaim from the return-to-userland path.

Why can't we simply fail NOWAIT allocations when the high limit is
breached? We do the same for the max limit.

As I see it, NOWAIT allocations are speculative attempts on available
memory. We should be able to just fail them and have somebody that is
allowed to reclaim try again, just like with the max limit.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/2] memcg: punt high overage reclaim to return-to-userland path
  2015-09-15  7:47     ` Johannes Weiner
  (?)
@ 2015-09-15 15:53     ` Tejun Heo
  2015-09-15 16:12       ` Johannes Weiner
  -1 siblings, 1 reply; 49+ messages in thread
From: Tejun Heo @ 2015-09-15 15:53 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: akpm, mhocko, cgroups, linux-mm, vdavydov, kernel-team

Hello, Johannes.

On Tue, Sep 15, 2015 at 09:47:24AM +0200, Johannes Weiner wrote:
> Why can't we simply fail NOWAIT allocations when the high limit is
> breached? We do the same for the max limit.

Because that can lead to continued systematic failures of NOWAIT
allocations.  For that to work, we'll have to add async reclaimaing.

> As I see it, NOWAIT allocations are speculative attempts on available
> memory. We should be able to just fail them and have somebody that is
> allowed to reclaim try again, just like with the max limit.

Yes, but the assumption is that even back-to-back NOWAIT allocations
won't continue to fail indefinitely.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/2] memcg: punt high overage reclaim to return-to-userland path
  2015-09-15 15:53     ` Tejun Heo
@ 2015-09-15 16:12       ` Johannes Weiner
  2015-09-15 16:22           ` Tejun Heo
  0 siblings, 1 reply; 49+ messages in thread
From: Johannes Weiner @ 2015-09-15 16:12 UTC (permalink / raw)
  To: Tejun Heo; +Cc: akpm, mhocko, cgroups, linux-mm, vdavydov, kernel-team

On Tue, Sep 15, 2015 at 11:53:55AM -0400, Tejun Heo wrote:
> Hello, Johannes.
> 
> On Tue, Sep 15, 2015 at 09:47:24AM +0200, Johannes Weiner wrote:
> > Why can't we simply fail NOWAIT allocations when the high limit is
> > breached? We do the same for the max limit.
> 
> Because that can lead to continued systematic failures of NOWAIT
> allocations.  For that to work, we'll have to add async reclaimaing.
> 
> > As I see it, NOWAIT allocations are speculative attempts on available
> > memory. We should be able to just fail them and have somebody that is
> > allowed to reclaim try again, just like with the max limit.
> 
> Yes, but the assumption is that even back-to-back NOWAIT allocations
> won't continue to fail indefinitely.

But they have been failing indefinitely forever once you hit the hard
limit in the past. There was never an async reclaim provision there.

I can definitely see that the unconstrained high limit breaching needs
to be fixed one way or another, I just don't quite understand why you
chose to go for new semantics. Is there a new or a specific usecase
you had in mind when you chose deferred reclaim over simply failing?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/2] memcg: punt high overage reclaim to return-to-userland path
@ 2015-09-15 16:22           ` Tejun Heo
  0 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2015-09-15 16:22 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: akpm, mhocko, cgroups, linux-mm, vdavydov, kernel-team

Hello,

On Tue, Sep 15, 2015 at 06:12:18PM +0200, Johannes Weiner wrote:
> But they have been failing indefinitely forever once you hit the hard
> limit in the past. There was never an async reclaim provision there.
>
> I can definitely see that the unconstrained high limit breaching needs
> to be fixed one way or another, I just don't quite understand why you
> chose to go for new semantics. Is there a new or a specific usecase
> you had in mind when you chose deferred reclaim over simply failing?

Hmmm... so if we just fail, it breaks the assumptions that slab/slub
is making and while they might not fail outright would behave in an
undesirable way.  It's just that we didn't notice that before with
limit_on_bytes and at least on the v2 interface the distinction
between high and max makes the problem easy to deal with from high
enforcement.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/2] memcg: punt high overage reclaim to return-to-userland path
@ 2015-09-15 16:22           ` Tejun Heo
  0 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2015-09-15 16:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mhocko-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ,
	kernel-team-b10kYP2dOMg

Hello,

On Tue, Sep 15, 2015 at 06:12:18PM +0200, Johannes Weiner wrote:
> But they have been failing indefinitely forever once you hit the hard
> limit in the past. There was never an async reclaim provision there.
>
> I can definitely see that the unconstrained high limit breaching needs
> to be fixed one way or another, I just don't quite understand why you
> chose to go for new semantics. Is there a new or a specific usecase
> you had in mind when you chose deferred reclaim over simply failing?

Hmmm... so if we just fail, it breaks the assumptions that slab/slub
is making and while they might not fail outright would behave in an
undesirable way.  It's just that we didn't notice that before with
limit_on_bytes and at least on the v2 interface the distinction
between high and max makes the problem easy to deal with from high
enforcement.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/2] memcg: punt high overage reclaim to return-to-userland path
@ 2015-09-15 16:33             ` Johannes Weiner
  0 siblings, 0 replies; 49+ messages in thread
From: Johannes Weiner @ 2015-09-15 16:33 UTC (permalink / raw)
  To: Tejun Heo; +Cc: akpm, mhocko, cgroups, linux-mm, vdavydov, kernel-team

On Tue, Sep 15, 2015 at 12:22:53PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Sep 15, 2015 at 06:12:18PM +0200, Johannes Weiner wrote:
> > But they have been failing indefinitely forever once you hit the hard
> > limit in the past. There was never an async reclaim provision there.
> >
> > I can definitely see that the unconstrained high limit breaching needs
> > to be fixed one way or another, I just don't quite understand why you
> > chose to go for new semantics. Is there a new or a specific usecase
> > you had in mind when you chose deferred reclaim over simply failing?
> 
> Hmmm... so if we just fail, it breaks the assumptions that slab/slub
> is making and while they might not fail outright would behave in an
> undesirable way.  It's just that we didn't notice that before with
> limit_on_bytes and at least on the v2 interface the distinction
> between high and max makes the problem easy to deal with from high
> enforcement.

Gotcha, it makes sense to address this then. Thanks for clarifying.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v3 2/2] memcg: punt high overage reclaim to return-to-userland path
@ 2015-09-15 16:33             ` Johannes Weiner
  0 siblings, 0 replies; 49+ messages in thread
From: Johannes Weiner @ 2015-09-15 16:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mhocko-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ,
	kernel-team-b10kYP2dOMg

On Tue, Sep 15, 2015 at 12:22:53PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Sep 15, 2015 at 06:12:18PM +0200, Johannes Weiner wrote:
> > But they have been failing indefinitely forever once you hit the hard
> > limit in the past. There was never an async reclaim provision there.
> >
> > I can definitely see that the unconstrained high limit breaching needs
> > to be fixed one way or another, I just don't quite understand why you
> > chose to go for new semantics. Is there a new or a specific usecase
> > you had in mind when you chose deferred reclaim over simply failing?
> 
> Hmmm... so if we just fail, it breaks the assumptions that slab/slub
> is making and while they might not fail outright would behave in an
> undesirable way.  It's just that we didn't notice that before with
> limit_on_bytes and at least on the v2 interface the distinction
> between high and max makes the problem easy to deal with from high
> enforcement.

Gotcha, it makes sense to address this then. Thanks for clarifying.

Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-09-20 14:45   ` Sasha Levin
  0 siblings, 0 replies; 49+ messages in thread
From: Sasha Levin @ 2015-09-20 14:45 UTC (permalink / raw)
  To: Tejun Heo, akpm, hannes, mhocko; +Cc: cgroups, linux-mm, vdavydov, kernel-team

On 09/13/2015 02:59 PM, Tejun Heo wrote:
> task_struct->memcg_oom is a sub-struct containing fields which are
> used for async memcg oom handling.  Most task_struct fields aren't
> packaged this way and it can lead to unnecessary alignment paddings.
> This patch flattens it.
> 
> * task.memcg_oom.memcg          -> task.memcg_in_oom
> * task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
> * task.memcg_oom.order          -> task.memcg_oom_order
> * task.memcg_oom.may_oom        -> task.memcg_may_oom
> 
> In addition, task.memcg_may_oom is relocated to where other bitfields
> are which reduces the size of task_struct.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
> ---
> Hello,
> 
> Andrew, these are the two patches which got acked from the following
> thread.
> 
>  http://lkml.kernel.org/g/20150828220158.GD11089@htj.dyndns.org
> 
> Acks are added and the second patch's description is updated as
> suggested by Michal and Vladimir.
> 
> Can you please put them in -mm?

Hi Tejun,

I've started seeing these warnings:

[1598889.250160] WARNING: CPU: 3 PID: 11648 at include/linux/memcontrol.h:414 handle_mm_fault+0x1020/0x3fa0()
[1598889.786891] Modules linked in:
[1598890.883223] Unable to find swap-space signature
[1598891.463736]
[1598892.236001] CPU: 3 PID: 11648 Comm: trinity-c10 Not tainted 4.3.0-rc1-next-20150918-sasha-00081-g4b7392a-dirty #2565
[1598892.239377]  ffffffffb4746d40 ffff8802edad7c70 ffffffffabfe97ba 0000000000000000
[1598892.241723]  ffff8802edad7cb0 ffffffffaa367466 ffffffffaa764140 ffff88037f00bec0
[1598892.244135]  ffff8802ecb94000 ffff88042420a000 0000000000b98fe8 ffff88042420a000
[1598892.246393] Call Trace:
[1598892.247256] dump_stack (lib/dump_stack.c:52)
[1598892.249105] warn_slowpath_common (kernel/panic.c:448)
[1598892.253202] warn_slowpath_null (kernel/panic.c:482)
[1598892.255148] handle_mm_fault (include/linux/memcontrol.h:414 mm/memory.c:3430)
[1598892.268151] __do_page_fault (arch/x86/mm/fault.c:1239)
[1598892.269022] trace_do_page_fault (arch/x86/mm/fault.c:1331 include/linux/jump_label.h:133 include/linux/context_tracking_state.h:30 include/linux/context_tracking.h:46 arch/x86/mm/fault.c:1332)
[1598892.269894] do_async_page_fault (arch/x86/kernel/kvm.c:280)
[1598892.270792] async_page_fault (arch/x86/entry/entry_64.S:989)

Not sure if it's because of this patch or not, but I haven't seen them before.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-09-20 14:45   ` Sasha Levin
  0 siblings, 0 replies; 49+ messages in thread
From: Sasha Levin @ 2015-09-20 14:45 UTC (permalink / raw)
  To: Tejun Heo, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg

On 09/13/2015 02:59 PM, Tejun Heo wrote:
> task_struct->memcg_oom is a sub-struct containing fields which are
> used for async memcg oom handling.  Most task_struct fields aren't
> packaged this way and it can lead to unnecessary alignment paddings.
> This patch flattens it.
> 
> * task.memcg_oom.memcg          -> task.memcg_in_oom
> * task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
> * task.memcg_oom.order          -> task.memcg_oom_order
> * task.memcg_oom.may_oom        -> task.memcg_may_oom
> 
> In addition, task.memcg_may_oom is relocated to where other bitfields
> are which reduces the size of task_struct.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> Reviewed-by: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> ---
> Hello,
> 
> Andrew, these are the two patches which got acked from the following
> thread.
> 
>  http://lkml.kernel.org/g/20150828220158.GD11089-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org
> 
> Acks are added and the second patch's description is updated as
> suggested by Michal and Vladimir.
> 
> Can you please put them in -mm?

Hi Tejun,

I've started seeing these warnings:

[1598889.250160] WARNING: CPU: 3 PID: 11648 at include/linux/memcontrol.h:414 handle_mm_fault+0x1020/0x3fa0()
[1598889.786891] Modules linked in:
[1598890.883223] Unable to find swap-space signature
[1598891.463736]
[1598892.236001] CPU: 3 PID: 11648 Comm: trinity-c10 Not tainted 4.3.0-rc1-next-20150918-sasha-00081-g4b7392a-dirty #2565
[1598892.239377]  ffffffffb4746d40 ffff8802edad7c70 ffffffffabfe97ba 0000000000000000
[1598892.241723]  ffff8802edad7cb0 ffffffffaa367466 ffffffffaa764140 ffff88037f00bec0
[1598892.244135]  ffff8802ecb94000 ffff88042420a000 0000000000b98fe8 ffff88042420a000
[1598892.246393] Call Trace:
[1598892.247256] dump_stack (lib/dump_stack.c:52)
[1598892.249105] warn_slowpath_common (kernel/panic.c:448)
[1598892.253202] warn_slowpath_null (kernel/panic.c:482)
[1598892.255148] handle_mm_fault (include/linux/memcontrol.h:414 mm/memory.c:3430)
[1598892.268151] __do_page_fault (arch/x86/mm/fault.c:1239)
[1598892.269022] trace_do_page_fault (arch/x86/mm/fault.c:1331 include/linux/jump_label.h:133 include/linux/context_tracking_state.h:30 include/linux/context_tracking.h:46 arch/x86/mm/fault.c:1332)
[1598892.269894] do_async_page_fault (arch/x86/kernel/kvm.c:280)
[1598892.270792] async_page_fault (arch/x86/entry/entry_64.S:989)

Not sure if it's because of this patch or not, but I haven't seen them before.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-09-21 20:01     ` Tejun Heo
  0 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2015-09-21 20:01 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Sasha Levin
  Cc: akpm, hannes, mhocko, cgroups, linux-mm, vdavydov, kernel-team

(cc'ing scheduler folks)

On Sun, Sep 20, 2015 at 10:45:25AM -0400, Sasha Levin wrote:
> On 09/13/2015 02:59 PM, Tejun Heo wrote:
> > task_struct->memcg_oom is a sub-struct containing fields which are
> > used for async memcg oom handling.  Most task_struct fields aren't
> > packaged this way and it can lead to unnecessary alignment paddings.
> > This patch flattens it.
> > 
> > * task.memcg_oom.memcg          -> task.memcg_in_oom
> > * task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
> > * task.memcg_oom.order          -> task.memcg_oom_order
> > * task.memcg_oom.may_oom        -> task.memcg_may_oom
...
> I've started seeing these warnings:
> 
> [1598889.250160] WARNING: CPU: 3 PID: 11648 at include/linux/memcontrol.h:414 handle_mm_fault+0x1020/0x3fa0()
...
> [1598892.247256] dump_stack (lib/dump_stack.c:52)
> [1598892.249105] warn_slowpath_common (kernel/panic.c:448)
> [1598892.253202] warn_slowpath_null (kernel/panic.c:482)
> [1598892.255148] handle_mm_fault (include/linux/memcontrol.h:414 mm/memory.c:3430)
> [1598892.268151] __do_page_fault (arch/x86/mm/fault.c:1239)
> [1598892.269022] trace_do_page_fault (arch/x86/mm/fault.c:1331 include/linux/jump_label.h:133 include/linux/context_tracking_state.h:30 include/linux/context_tracking.h:46 arch/x86/mm/fault.c:1332)
> [1598892.269894] do_async_page_fault (arch/x86/kernel/kvm.c:280)
> [1598892.270792] async_page_fault (arch/x86/entry/entry_64.S:989)
> 
> Not sure if it's because of this patch or not, but I haven't seen them before.

So, the only way the patch could have caused the above is if someone
who isn't the task itself is writing to the bitfields while the task
is running.  Looking through the fields, ->sched_reset_on_fork seems a
bit suspicious.  __sched_setscheduler() looks like it can modify the
bit while the target task is running.  Peter, am I misreading the
code?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-09-21 20:01     ` Tejun Heo
  0 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2015-09-21 20:01 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Sasha Levin
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg

(cc'ing scheduler folks)

On Sun, Sep 20, 2015 at 10:45:25AM -0400, Sasha Levin wrote:
> On 09/13/2015 02:59 PM, Tejun Heo wrote:
> > task_struct->memcg_oom is a sub-struct containing fields which are
> > used for async memcg oom handling.  Most task_struct fields aren't
> > packaged this way and it can lead to unnecessary alignment paddings.
> > This patch flattens it.
> > 
> > * task.memcg_oom.memcg          -> task.memcg_in_oom
> > * task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
> > * task.memcg_oom.order          -> task.memcg_oom_order
> > * task.memcg_oom.may_oom        -> task.memcg_may_oom
...
> I've started seeing these warnings:
> 
> [1598889.250160] WARNING: CPU: 3 PID: 11648 at include/linux/memcontrol.h:414 handle_mm_fault+0x1020/0x3fa0()
...
> [1598892.247256] dump_stack (lib/dump_stack.c:52)
> [1598892.249105] warn_slowpath_common (kernel/panic.c:448)
> [1598892.253202] warn_slowpath_null (kernel/panic.c:482)
> [1598892.255148] handle_mm_fault (include/linux/memcontrol.h:414 mm/memory.c:3430)
> [1598892.268151] __do_page_fault (arch/x86/mm/fault.c:1239)
> [1598892.269022] trace_do_page_fault (arch/x86/mm/fault.c:1331 include/linux/jump_label.h:133 include/linux/context_tracking_state.h:30 include/linux/context_tracking.h:46 arch/x86/mm/fault.c:1332)
> [1598892.269894] do_async_page_fault (arch/x86/kernel/kvm.c:280)
> [1598892.270792] async_page_fault (arch/x86/entry/entry_64.S:989)
> 
> Not sure if it's because of this patch or not, but I haven't seen them before.

So, the only way the patch could have caused the above is if someone
who isn't the task itself is writing to the bitfields while the task
is running.  Looking through the fields, ->sched_reset_on_fork seems a
bit suspicious.  __sched_setscheduler() looks like it can modify the
bit while the target task is running.  Peter, am I misreading the
code?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-09-30 18:54       ` Tejun Heo
  0 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2015-09-30 18:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Sasha Levin
  Cc: akpm, hannes, mhocko, cgroups, linux-mm, vdavydov, kernel-team

On Mon, Sep 21, 2015 at 04:01:41PM -0400, Tejun Heo wrote:
> (cc'ing scheduler folks)
> 
> On Sun, Sep 20, 2015 at 10:45:25AM -0400, Sasha Levin wrote:
> > On 09/13/2015 02:59 PM, Tejun Heo wrote:
> > > task_struct->memcg_oom is a sub-struct containing fields which are
> > > used for async memcg oom handling.  Most task_struct fields aren't
> > > packaged this way and it can lead to unnecessary alignment paddings.
> > > This patch flattens it.
> > > 
> > > * task.memcg_oom.memcg          -> task.memcg_in_oom
> > > * task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
> > > * task.memcg_oom.order          -> task.memcg_oom_order
> > > * task.memcg_oom.may_oom        -> task.memcg_may_oom
> ...
> > I've started seeing these warnings:
> > 
> > [1598889.250160] WARNING: CPU: 3 PID: 11648 at include/linux/memcontrol.h:414 handle_mm_fault+0x1020/0x3fa0()
> ...
> > [1598892.247256] dump_stack (lib/dump_stack.c:52)
> > [1598892.249105] warn_slowpath_common (kernel/panic.c:448)
> > [1598892.253202] warn_slowpath_null (kernel/panic.c:482)
> > [1598892.255148] handle_mm_fault (include/linux/memcontrol.h:414 mm/memory.c:3430)
> > [1598892.268151] __do_page_fault (arch/x86/mm/fault.c:1239)
> > [1598892.269022] trace_do_page_fault (arch/x86/mm/fault.c:1331 include/linux/jump_label.h:133 include/linux/context_tracking_state.h:30 include/linux/context_tracking.h:46 arch/x86/mm/fault.c:1332)
> > [1598892.269894] do_async_page_fault (arch/x86/kernel/kvm.c:280)
> > [1598892.270792] async_page_fault (arch/x86/entry/entry_64.S:989)
> > 
> > Not sure if it's because of this patch or not, but I haven't seen them before.
> 
> So, the only way the patch could have caused the above is if someone
> who isn't the task itself is writing to the bitfields while the task
> is running.  Looking through the fields, ->sched_reset_on_fork seems a
> bit suspicious.  __sched_setscheduler() looks like it can modify the
> bit while the target task is running.  Peter, am I misreading the
> code?

Ping?

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-09-30 18:54       ` Tejun Heo
  0 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2015-09-30 18:54 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Sasha Levin
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg

On Mon, Sep 21, 2015 at 04:01:41PM -0400, Tejun Heo wrote:
> (cc'ing scheduler folks)
> 
> On Sun, Sep 20, 2015 at 10:45:25AM -0400, Sasha Levin wrote:
> > On 09/13/2015 02:59 PM, Tejun Heo wrote:
> > > task_struct->memcg_oom is a sub-struct containing fields which are
> > > used for async memcg oom handling.  Most task_struct fields aren't
> > > packaged this way and it can lead to unnecessary alignment paddings.
> > > This patch flattens it.
> > > 
> > > * task.memcg_oom.memcg          -> task.memcg_in_oom
> > > * task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
> > > * task.memcg_oom.order          -> task.memcg_oom_order
> > > * task.memcg_oom.may_oom        -> task.memcg_may_oom
> ...
> > I've started seeing these warnings:
> > 
> > [1598889.250160] WARNING: CPU: 3 PID: 11648 at include/linux/memcontrol.h:414 handle_mm_fault+0x1020/0x3fa0()
> ...
> > [1598892.247256] dump_stack (lib/dump_stack.c:52)
> > [1598892.249105] warn_slowpath_common (kernel/panic.c:448)
> > [1598892.253202] warn_slowpath_null (kernel/panic.c:482)
> > [1598892.255148] handle_mm_fault (include/linux/memcontrol.h:414 mm/memory.c:3430)
> > [1598892.268151] __do_page_fault (arch/x86/mm/fault.c:1239)
> > [1598892.269022] trace_do_page_fault (arch/x86/mm/fault.c:1331 include/linux/jump_label.h:133 include/linux/context_tracking_state.h:30 include/linux/context_tracking.h:46 arch/x86/mm/fault.c:1332)
> > [1598892.269894] do_async_page_fault (arch/x86/kernel/kvm.c:280)
> > [1598892.270792] async_page_fault (arch/x86/entry/entry_64.S:989)
> > 
> > Not sure if it's because of this patch or not, but I haven't seen them before.
> 
> So, the only way the patch could have caused the above is if someone
> who isn't the task itself is writing to the bitfields while the task
> is running.  Looking through the fields, ->sched_reset_on_fork seems a
> bit suspicious.  __sched_setscheduler() looks like it can modify the
> bit while the target task is running.  Peter, am I misreading the
> code?

Ping?

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-11-25 14:43       ` Peter Zijlstra
  0 siblings, 0 replies; 49+ messages in thread
From: Peter Zijlstra @ 2015-11-25 14:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Sasha Levin, akpm, hannes, mhocko, cgroups,
	linux-mm, vdavydov, kernel-team

On Mon, Sep 21, 2015 at 04:01:41PM -0400, Tejun Heo wrote:
> (cc'ing scheduler folks)
> 
> On Sun, Sep 20, 2015 at 10:45:25AM -0400, Sasha Levin wrote:
> > On 09/13/2015 02:59 PM, Tejun Heo wrote:
> > > task_struct->memcg_oom is a sub-struct containing fields which are
> > > used for async memcg oom handling.  Most task_struct fields aren't
> > > packaged this way and it can lead to unnecessary alignment paddings.
> > > This patch flattens it.
> > > 
> > > * task.memcg_oom.memcg          -> task.memcg_in_oom
> > > * task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
> > > * task.memcg_oom.order          -> task.memcg_oom_order
> > > * task.memcg_oom.may_oom        -> task.memcg_may_oom
> ...
> > I've started seeing these warnings:
> > 
> > [1598889.250160] WARNING: CPU: 3 PID: 11648 at include/linux/memcontrol.h:414 handle_mm_fault+0x1020/0x3fa0()
> ...
> > [1598892.247256] dump_stack (lib/dump_stack.c:52)
> > [1598892.249105] warn_slowpath_common (kernel/panic.c:448)
> > [1598892.253202] warn_slowpath_null (kernel/panic.c:482)
> > [1598892.255148] handle_mm_fault (include/linux/memcontrol.h:414 mm/memory.c:3430)
> > [1598892.268151] __do_page_fault (arch/x86/mm/fault.c:1239)
> > [1598892.269022] trace_do_page_fault (arch/x86/mm/fault.c:1331 include/linux/jump_label.h:133 include/linux/context_tracking_state.h:30 include/linux/context_tracking.h:46 arch/x86/mm/fault.c:1332)
> > [1598892.269894] do_async_page_fault (arch/x86/kernel/kvm.c:280)
> > [1598892.270792] async_page_fault (arch/x86/entry/entry_64.S:989)
> > 
> > Not sure if it's because of this patch or not, but I haven't seen them before.
> 
> So, the only way the patch could have caused the above is if someone
> who isn't the task itself is writing to the bitfields while the task
> is running.  Looking through the fields, ->sched_reset_on_fork seems a
> bit suspicious.  __sched_setscheduler() looks like it can modify the
> bit while the target task is running.  Peter, am I misreading the
> code?

Nope, that's quite possible. Looks like we need to break up those
bitfields a bit. All the scheduler ones should be serialized by
scheduler locks, but the others are fair game.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-11-25 14:43       ` Peter Zijlstra
  0 siblings, 0 replies; 49+ messages in thread
From: Peter Zijlstra @ 2015-11-25 14:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Sasha Levin, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg

On Mon, Sep 21, 2015 at 04:01:41PM -0400, Tejun Heo wrote:
> (cc'ing scheduler folks)
> 
> On Sun, Sep 20, 2015 at 10:45:25AM -0400, Sasha Levin wrote:
> > On 09/13/2015 02:59 PM, Tejun Heo wrote:
> > > task_struct->memcg_oom is a sub-struct containing fields which are
> > > used for async memcg oom handling.  Most task_struct fields aren't
> > > packaged this way and it can lead to unnecessary alignment paddings.
> > > This patch flattens it.
> > > 
> > > * task.memcg_oom.memcg          -> task.memcg_in_oom
> > > * task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
> > > * task.memcg_oom.order          -> task.memcg_oom_order
> > > * task.memcg_oom.may_oom        -> task.memcg_may_oom
> ...
> > I've started seeing these warnings:
> > 
> > [1598889.250160] WARNING: CPU: 3 PID: 11648 at include/linux/memcontrol.h:414 handle_mm_fault+0x1020/0x3fa0()
> ...
> > [1598892.247256] dump_stack (lib/dump_stack.c:52)
> > [1598892.249105] warn_slowpath_common (kernel/panic.c:448)
> > [1598892.253202] warn_slowpath_null (kernel/panic.c:482)
> > [1598892.255148] handle_mm_fault (include/linux/memcontrol.h:414 mm/memory.c:3430)
> > [1598892.268151] __do_page_fault (arch/x86/mm/fault.c:1239)
> > [1598892.269022] trace_do_page_fault (arch/x86/mm/fault.c:1331 include/linux/jump_label.h:133 include/linux/context_tracking_state.h:30 include/linux/context_tracking.h:46 arch/x86/mm/fault.c:1332)
> > [1598892.269894] do_async_page_fault (arch/x86/kernel/kvm.c:280)
> > [1598892.270792] async_page_fault (arch/x86/entry/entry_64.S:989)
> > 
> > Not sure if it's because of this patch or not, but I haven't seen them before.
> 
> So, the only way the patch could have caused the above is if someone
> who isn't the task itself is writing to the bitfields while the task
> is running.  Looking through the fields, ->sched_reset_on_fork seems a
> bit suspicious.  __sched_setscheduler() looks like it can modify the
> bit while the target task is running.  Peter, am I misreading the
> code?

Nope, that's quite possible. Looks like we need to break up those
bitfields a bit. All the scheduler ones should be serialized by
scheduler locks, but the others are fair game.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-11-25 15:02         ` Peter Zijlstra
  0 siblings, 0 replies; 49+ messages in thread
From: Peter Zijlstra @ 2015-11-25 15:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Sasha Levin, akpm, hannes, mhocko, cgroups,
	linux-mm, vdavydov, kernel-team

On Wed, Nov 25, 2015 at 03:43:54PM +0100, Peter Zijlstra wrote:
> On Mon, Sep 21, 2015 at 04:01:41PM -0400, Tejun Heo wrote:
> > So, the only way the patch could have caused the above is if someone
> > who isn't the task itself is writing to the bitfields while the task
> > is running.  Looking through the fields, ->sched_reset_on_fork seems a
> > bit suspicious.  __sched_setscheduler() looks like it can modify the
> > bit while the target task is running.  Peter, am I misreading the
> > code?
> 
> Nope, that's quite possible. Looks like we need to break up those
> bitfields a bit. All the scheduler ones should be serialized by
> scheduler locks, but the others are fair game.

Maybe something like so; but my brain is a complete mess today.

---
 include/linux/sched.h | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f425aac63317..b474e0f05327 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1455,14 +1455,15 @@ struct task_struct {
 	/* Used for emulating ABI behavior of previous Linux versions */
 	unsigned int personality;
 
-	unsigned in_execve:1;	/* Tell the LSMs that the process is doing an
-				 * execve */
-	unsigned in_iowait:1;
-
-	/* Revert to default priority/policy when forking */
+	/* scheduler bits, serialized by scheduler locks */
 	unsigned sched_reset_on_fork:1;
 	unsigned sched_contributes_to_load:1;
 	unsigned sched_migrated:1;
+	unsigned __padding_sched:29;
+
+	/* unserialized, strictly 'current' */
+	unsigned in_execve:1; /* bit to tell LSMs we're in execve */
+	unsigned in_iowait:1;
 #ifdef CONFIG_MEMCG
 	unsigned memcg_may_oom:1;
 #endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-11-25 15:02         ` Peter Zijlstra
  0 siblings, 0 replies; 49+ messages in thread
From: Peter Zijlstra @ 2015-11-25 15:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Sasha Levin, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg

On Wed, Nov 25, 2015 at 03:43:54PM +0100, Peter Zijlstra wrote:
> On Mon, Sep 21, 2015 at 04:01:41PM -0400, Tejun Heo wrote:
> > So, the only way the patch could have caused the above is if someone
> > who isn't the task itself is writing to the bitfields while the task
> > is running.  Looking through the fields, ->sched_reset_on_fork seems a
> > bit suspicious.  __sched_setscheduler() looks like it can modify the
> > bit while the target task is running.  Peter, am I misreading the
> > code?
> 
> Nope, that's quite possible. Looks like we need to break up those
> bitfields a bit. All the scheduler ones should be serialized by
> scheduler locks, but the others are fair game.

Maybe something like so; but my brain is a complete mess today.

---
 include/linux/sched.h | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f425aac63317..b474e0f05327 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1455,14 +1455,15 @@ struct task_struct {
 	/* Used for emulating ABI behavior of previous Linux versions */
 	unsigned int personality;
 
-	unsigned in_execve:1;	/* Tell the LSMs that the process is doing an
-				 * execve */
-	unsigned in_iowait:1;
-
-	/* Revert to default priority/policy when forking */
+	/* scheduler bits, serialized by scheduler locks */
 	unsigned sched_reset_on_fork:1;
 	unsigned sched_contributes_to_load:1;
 	unsigned sched_migrated:1;
+	unsigned __padding_sched:29;
+
+	/* unserialized, strictly 'current' */
+	unsigned in_execve:1; /* bit to tell LSMs we're in execve */
+	unsigned in_iowait:1;
 #ifdef CONFIG_MEMCG
 	unsigned memcg_may_oom:1;
 #endif

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
  2015-11-25 15:02         ` Peter Zijlstra
  (?)
@ 2015-11-25 15:31         ` Andrey Ryabinin
  2015-11-25 17:34             ` Dmitry Vyukov
  2015-11-25 17:44           ` Peter Zijlstra
  -1 siblings, 2 replies; 49+ messages in thread
From: Andrey Ryabinin @ 2015-11-25 15:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Ingo Molnar, Sasha Levin, Andrew Morton,
	Johannes Weiner, mhocko, cgroups, linux-mm, vdavydov,
	kernel-team, Dmitry Vyukov

2015-11-25 18:02 GMT+03:00 Peter Zijlstra <peterz@infradead.org>:
> On Wed, Nov 25, 2015 at 03:43:54PM +0100, Peter Zijlstra wrote:
>> On Mon, Sep 21, 2015 at 04:01:41PM -0400, Tejun Heo wrote:
>> > So, the only way the patch could have caused the above is if someone
>> > who isn't the task itself is writing to the bitfields while the task
>> > is running.  Looking through the fields, ->sched_reset_on_fork seems a
>> > bit suspicious.  __sched_setscheduler() looks like it can modify the
>> > bit while the target task is running.  Peter, am I misreading the
>> > code?
>>
>> Nope, that's quite possible. Looks like we need to break up those
>> bitfields a bit. All the scheduler ones should be serialized by
>> scheduler locks, but the others are fair game.
>
> Maybe something like so; but my brain is a complete mess today.
>
> ---
>  include/linux/sched.h | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f425aac63317..b474e0f05327 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1455,14 +1455,15 @@ struct task_struct {
>         /* Used for emulating ABI behavior of previous Linux versions */
>         unsigned int personality;
>
> -       unsigned in_execve:1;   /* Tell the LSMs that the process is doing an
> -                                * execve */
> -       unsigned in_iowait:1;
> -
> -       /* Revert to default priority/policy when forking */
> +       /* scheduler bits, serialized by scheduler locks */
>         unsigned sched_reset_on_fork:1;
>         unsigned sched_contributes_to_load:1;
>         unsigned sched_migrated:1;
> +       unsigned __padding_sched:29;

AFAIK the order of bit fields is implementation defined, so GCC could
sort all these bits as it wants.
You could use unnamed zero-widht bit-field to force padding:

         unsigned :0; //force aligment to the next boundary.

> +
> +       /* unserialized, strictly 'current' */
> +       unsigned in_execve:1; /* bit to tell LSMs we're in execve */
> +       unsigned in_iowait:1;
>  #ifdef CONFIG_MEMCG
>         unsigned memcg_may_oom:1;
>  #endif
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-11-25 17:34             ` Dmitry Vyukov
  0 siblings, 0 replies; 49+ messages in thread
From: Dmitry Vyukov @ 2015-11-25 17:34 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Peter Zijlstra, Tejun Heo, Ingo Molnar, Sasha Levin,
	Andrew Morton, Johannes Weiner, Michal Hocko, cgroups, linux-mm,
	vdavydov, kernel-team

On Wed, Nov 25, 2015 at 4:31 PM, Andrey Ryabinin <ryabinin.a.a@gmail.com> wrote:
> 2015-11-25 18:02 GMT+03:00 Peter Zijlstra <peterz@infradead.org>:
>> On Wed, Nov 25, 2015 at 03:43:54PM +0100, Peter Zijlstra wrote:
>>> On Mon, Sep 21, 2015 at 04:01:41PM -0400, Tejun Heo wrote:
>>> > So, the only way the patch could have caused the above is if someone
>>> > who isn't the task itself is writing to the bitfields while the task
>>> > is running.  Looking through the fields, ->sched_reset_on_fork seems a
>>> > bit suspicious.  __sched_setscheduler() looks like it can modify the
>>> > bit while the target task is running.  Peter, am I misreading the
>>> > code?
>>>
>>> Nope, that's quite possible. Looks like we need to break up those
>>> bitfields a bit. All the scheduler ones should be serialized by
>>> scheduler locks, but the others are fair game.
>>
>> Maybe something like so; but my brain is a complete mess today.
>>
>> ---
>>  include/linux/sched.h | 11 ++++++-----
>>  1 file changed, 6 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index f425aac63317..b474e0f05327 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1455,14 +1455,15 @@ struct task_struct {
>>         /* Used for emulating ABI behavior of previous Linux versions */
>>         unsigned int personality;
>>
>> -       unsigned in_execve:1;   /* Tell the LSMs that the process is doing an
>> -                                * execve */
>> -       unsigned in_iowait:1;
>> -
>> -       /* Revert to default priority/policy when forking */
>> +       /* scheduler bits, serialized by scheduler locks */
>>         unsigned sched_reset_on_fork:1;
>>         unsigned sched_contributes_to_load:1;
>>         unsigned sched_migrated:1;
>> +       unsigned __padding_sched:29;
>
> AFAIK the order of bit fields is implementation defined, so GCC could
> sort all these bits as it wants.
> You could use unnamed zero-widht bit-field to force padding:
>
>          unsigned :0; //force aligment to the next boundary.
>
>> +
>> +       /* unserialized, strictly 'current' */
>> +       unsigned in_execve:1; /* bit to tell LSMs we're in execve */
>> +       unsigned in_iowait:1;
>>  #ifdef CONFIG_MEMCG
>>         unsigned memcg_may_oom:1;
>>  #endif
>>

I've gathered some evidence that in my case the guilty bit is
sched_reset_on_fork:
https://groups.google.com/d/msg/syzkaller/o8VqvYNEu_I/I0pXGx79DQAJ

This patch should help.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-11-25 17:34             ` Dmitry Vyukov
  0 siblings, 0 replies; 49+ messages in thread
From: Dmitry Vyukov @ 2015-11-25 17:34 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Peter Zijlstra, Tejun Heo, Ingo Molnar, Sasha Levin,
	Andrew Morton, Johannes Weiner, Michal Hocko,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg

On Wed, Nov 25, 2015 at 4:31 PM, Andrey Ryabinin <ryabinin.a.a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 2015-11-25 18:02 GMT+03:00 Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>:
>> On Wed, Nov 25, 2015 at 03:43:54PM +0100, Peter Zijlstra wrote:
>>> On Mon, Sep 21, 2015 at 04:01:41PM -0400, Tejun Heo wrote:
>>> > So, the only way the patch could have caused the above is if someone
>>> > who isn't the task itself is writing to the bitfields while the task
>>> > is running.  Looking through the fields, ->sched_reset_on_fork seems a
>>> > bit suspicious.  __sched_setscheduler() looks like it can modify the
>>> > bit while the target task is running.  Peter, am I misreading the
>>> > code?
>>>
>>> Nope, that's quite possible. Looks like we need to break up those
>>> bitfields a bit. All the scheduler ones should be serialized by
>>> scheduler locks, but the others are fair game.
>>
>> Maybe something like so; but my brain is a complete mess today.
>>
>> ---
>>  include/linux/sched.h | 11 ++++++-----
>>  1 file changed, 6 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index f425aac63317..b474e0f05327 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1455,14 +1455,15 @@ struct task_struct {
>>         /* Used for emulating ABI behavior of previous Linux versions */
>>         unsigned int personality;
>>
>> -       unsigned in_execve:1;   /* Tell the LSMs that the process is doing an
>> -                                * execve */
>> -       unsigned in_iowait:1;
>> -
>> -       /* Revert to default priority/policy when forking */
>> +       /* scheduler bits, serialized by scheduler locks */
>>         unsigned sched_reset_on_fork:1;
>>         unsigned sched_contributes_to_load:1;
>>         unsigned sched_migrated:1;
>> +       unsigned __padding_sched:29;
>
> AFAIK the order of bit fields is implementation defined, so GCC could
> sort all these bits as it wants.
> You could use unnamed zero-widht bit-field to force padding:
>
>          unsigned :0; //force aligment to the next boundary.
>
>> +
>> +       /* unserialized, strictly 'current' */
>> +       unsigned in_execve:1; /* bit to tell LSMs we're in execve */
>> +       unsigned in_iowait:1;
>>  #ifdef CONFIG_MEMCG
>>         unsigned memcg_may_oom:1;
>>  #endif
>>

I've gathered some evidence that in my case the guilty bit is
sched_reset_on_fork:
https://groups.google.com/d/msg/syzkaller/o8VqvYNEu_I/I0pXGx79DQAJ

This patch should help.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
  2015-11-25 15:31         ` Andrey Ryabinin
  2015-11-25 17:34             ` Dmitry Vyukov
@ 2015-11-25 17:44           ` Peter Zijlstra
  2015-12-11 16:25               ` Tejun Heo
  1 sibling, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2015-11-25 17:44 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Tejun Heo, Ingo Molnar, Sasha Levin, Andrew Morton,
	Johannes Weiner, mhocko, cgroups, linux-mm, vdavydov,
	kernel-team, Dmitry Vyukov

On Wed, Nov 25, 2015 at 06:31:41PM +0300, Andrey Ryabinin wrote:
> > +       /* scheduler bits, serialized by scheduler locks */
> >         unsigned sched_reset_on_fork:1;
> >         unsigned sched_contributes_to_load:1;
> >         unsigned sched_migrated:1;
> > +       unsigned __padding_sched:29;
> 
> AFAIK the order of bit fields is implementation defined, so GCC could
> sort all these bits as it wants.

We're relying on it doing DTRT in other places, so I'm fairly confident
this'll work, otoh

> You could use unnamed zero-widht bit-field to force padding:
> 
>          unsigned :0; //force aligment to the next boundary.

That's a nice trick I was not aware of, thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-12-11 16:25               ` Tejun Heo
  0 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2015-12-11 16:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrey Ryabinin, Ingo Molnar, Sasha Levin, Andrew Morton,
	Johannes Weiner, mhocko, cgroups, linux-mm, vdavydov,
	kernel-team, Dmitry Vyukov

Hello, Peter, Ingo.

On Wed, Nov 25, 2015 at 06:44:49PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 25, 2015 at 06:31:41PM +0300, Andrey Ryabinin wrote:
> > > +       /* scheduler bits, serialized by scheduler locks */
> > >         unsigned sched_reset_on_fork:1;
> > >         unsigned sched_contributes_to_load:1;
> > >         unsigned sched_migrated:1;
> > > +       unsigned __padding_sched:29;
> > 
> > AFAIK the order of bit fields is implementation defined, so GCC could
> > sort all these bits as it wants.
> 
> We're relying on it doing DTRT in other places, so I'm fairly confident
> this'll work, otoh
> 
> > You could use unnamed zero-widht bit-field to force padding:
> > 
> >          unsigned :0; //force aligment to the next boundary.
> 
> That's a nice trick I was not aware of, thanks!

Has this been fixed yet?  While I'm not completely sure and I don't
think there's a way to be certain after the fact, we have a single
report of a machine which is showing ~4G as loadavg and one plausible
explanation could be that one of the ->nr_uninterruptible counters
underflowed from sched_contributes_to_load getting corrupted, so it'd
be great to get this one fixed soon.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-12-11 16:25               ` Tejun Heo
  0 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2015-12-11 16:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrey Ryabinin, Ingo Molnar, Sasha Levin, Andrew Morton,
	Johannes Weiner, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg,
	Dmitry Vyukov

Hello, Peter, Ingo.

On Wed, Nov 25, 2015 at 06:44:49PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 25, 2015 at 06:31:41PM +0300, Andrey Ryabinin wrote:
> > > +       /* scheduler bits, serialized by scheduler locks */
> > >         unsigned sched_reset_on_fork:1;
> > >         unsigned sched_contributes_to_load:1;
> > >         unsigned sched_migrated:1;
> > > +       unsigned __padding_sched:29;
> > 
> > AFAIK the order of bit fields is implementation defined, so GCC could
> > sort all these bits as it wants.
> 
> We're relying on it doing DTRT in other places, so I'm fairly confident
> this'll work, otoh
> 
> > You could use unnamed zero-widht bit-field to force padding:
> > 
> >          unsigned :0; //force aligment to the next boundary.
> 
> That's a nice trick I was not aware of, thanks!

Has this been fixed yet?  While I'm not completely sure and I don't
think there's a way to be certain after the fact, we have a single
report of a machine which is showing ~4G as loadavg and one plausible
explanation could be that one of the ->nr_uninterruptible counters
underflowed from sched_contributes_to_load getting corrupted, so it'd
be great to get this one fixed soon.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
  2015-12-11 16:25               ` Tejun Heo
  (?)
@ 2015-12-15 19:22               ` Peter Zijlstra
  2015-12-30  9:23                 ` [PATCH v4.4-rc7] sched: isolate task_struct bitfields according to synchronization domains Tejun Heo
  -1 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2015-12-15 19:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrey Ryabinin, Ingo Molnar, Sasha Levin, Andrew Morton,
	Johannes Weiner, mhocko, cgroups, linux-mm, vdavydov,
	kernel-team, Dmitry Vyukov

On Fri, Dec 11, 2015 at 11:25:54AM -0500, Tejun Heo wrote:
> Hello, Peter, Ingo.
> 
> On Wed, Nov 25, 2015 at 06:44:49PM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 25, 2015 at 06:31:41PM +0300, Andrey Ryabinin wrote:
> > > > +       /* scheduler bits, serialized by scheduler locks */
> > > >         unsigned sched_reset_on_fork:1;
> > > >         unsigned sched_contributes_to_load:1;
> > > >         unsigned sched_migrated:1;
> > > > +       unsigned __padding_sched:29;
> > > 
> > > AFAIK the order of bit fields is implementation defined, so GCC could
> > > sort all these bits as it wants.
> > 
> > We're relying on it doing DTRT in other places, so I'm fairly confident
> > this'll work, otoh
> > 
> > > You could use unnamed zero-widht bit-field to force padding:
> > > 
> > >          unsigned :0; //force aligment to the next boundary.
> > 
> > That's a nice trick I was not aware of, thanks!
> 
> Has this been fixed yet?  While I'm not completely sure and I don't
> think there's a way to be certain after the fact, we have a single
> report of a machine which is showing ~4G as loadavg and one plausible
> explanation could be that one of the ->nr_uninterruptible counters
> underflowed from sched_contributes_to_load getting corrupted, so it'd
> be great to get this one fixed soon.

Nope, lemme write a Changelog and queue it. Thanks for the reminder.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v4.4-rc7] sched: isolate task_struct bitfields according to synchronization domains
  2015-12-15 19:22               ` Peter Zijlstra
@ 2015-12-30  9:23                 ` Tejun Heo
  2015-12-30 20:10                     ` Linus Torvalds
  0 siblings, 1 reply; 49+ messages in thread
From: Tejun Heo @ 2015-12-30  9:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrey Ryabinin, Ingo Molnar, Sasha Levin, Andrew Morton,
	Johannes Weiner, mhocko, cgroups, linux-mm, vdavydov,
	kernel-team, Dmitry Vyukov, Peter Zijlstra

task_struct has a cluster of unsigned bitfields.  Some are updated
under scheduler locks while others are updated only by the task
itself.  Currently, the two classes of bitfields aren't distinguished
and end up on the same word which can lead to clobbering when there
are simultaneous read-modify-write attempts.  While difficult to prove
definitely, it's likely that the resulting inconsistency led to low
frqeuency failures such as wrong memcg_may_oom state or loadavg
underflow due to clobbered sched_contributes_to_load.

Fix it by putting the two classes of the bitfields into separate
unsigned longs.

Original-patch-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/g/55FEC685.5010404@oracle.com
Cc: stable@vger.kernel.org
---
Hello,

Peter, I took the patch and changed the bitfields to ulong.

Thanks.

 include/linux/sched.h |   25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index edad7a4..e51464d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1455,22 +1455,25 @@ struct task_struct {
 	/* Used for emulating ABI behavior of previous Linux versions */
 	unsigned int personality;
 
-	unsigned in_execve:1;	/* Tell the LSMs that the process is doing an
-				 * execve */
-	unsigned in_iowait:1;
-
-	/* Revert to default priority/policy when forking */
-	unsigned sched_reset_on_fork:1;
-	unsigned sched_contributes_to_load:1;
-	unsigned sched_migrated:1;
+	/* scheduler bits, serialized by scheduler locks */
+	unsigned long sched_reset_on_fork:1;
+	unsigned long sched_contributes_to_load:1;
+	unsigned long sched_migrated:1;
+
+	/* force alignment to the next boundary */
+	unsigned long :0;
+
+	/* unserialized, strictly 'current' */
+	unsigned long in_execve:1; /* bit to tell LSMs we're in execve */
+	unsigned long in_iowait:1;
 #ifdef CONFIG_MEMCG
-	unsigned memcg_may_oom:1;
+	unsigned long memcg_may_oom:1;
 #endif
 #ifdef CONFIG_MEMCG_KMEM
-	unsigned memcg_kmem_skip_account:1;
+	unsigned long memcg_kmem_skip_account:1;
 #endif
 #ifdef CONFIG_COMPAT_BRK
-	unsigned brk_randomized:1;
+	unsigned long brk_randomized:1;
 #endif
 
 	unsigned long atomic_flags; /* Flags needing atomic access. */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v4.4-rc7] sched: isolate task_struct bitfields according to synchronization domains
@ 2015-12-30 20:10                     ` Linus Torvalds
  0 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2015-12-30 20:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrey Ryabinin, Ingo Molnar, Sasha Levin, Andrew Morton,
	Johannes Weiner, Michal Hocko, cgroups, linux-mm,
	Vladimir Davydov, kernel-team, Dmitry Vyukov, Peter Zijlstra

On Wed, Dec 30, 2015 at 1:23 AM, Tejun Heo <tj@kernel.org> wrote:
>
> Peter, I took the patch and changed the bitfields to ulong.

I wouldn't expect the unsigned long part to matter, except for the
forced split with

   unsigned long :0;

itself.

Also, quite frankly, since this is basically very close to other
fields that are *not* unsigned longs, I'd really prefer to not
unnecessarily use a 64-bit field for three bits each.

So why not just do it with plain unsigned "int", and then maybe just
intersperse them with the other int-sized fields in that neighborhood.

I'm also wondering if we shouldn't just put the scheduler bits in the
"atomic_flags" thing instead?

                     Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4.4-rc7] sched: isolate task_struct bitfields according to synchronization domains
@ 2015-12-30 20:10                     ` Linus Torvalds
  0 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2015-12-30 20:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrey Ryabinin, Ingo Molnar, Sasha Levin, Andrew Morton,
	Johannes Weiner, Michal Hocko, cgroups,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Vladimir Davydov, kernel-team,
	Dmitry Vyukov, Peter Zijlstra

On Wed, Dec 30, 2015 at 1:23 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> Peter, I took the patch and changed the bitfields to ulong.

I wouldn't expect the unsigned long part to matter, except for the
forced split with

   unsigned long :0;

itself.

Also, quite frankly, since this is basically very close to other
fields that are *not* unsigned longs, I'd really prefer to not
unnecessarily use a 64-bit field for three bits each.

So why not just do it with plain unsigned "int", and then maybe just
intersperse them with the other int-sized fields in that neighborhood.

I'm also wondering if we shouldn't just put the scheduler bits in the
"atomic_flags" thing instead?

                     Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4.4-rc7] sched: isolate task_struct bitfields according to synchronization domains
  2015-12-30 20:10                     ` Linus Torvalds
  (?)
@ 2015-12-30 20:17                     ` Linus Torvalds
  -1 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2015-12-30 20:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrey Ryabinin, Ingo Molnar, Sasha Levin, Andrew Morton,
	Johannes Weiner, Michal Hocko, cgroups, linux-mm,
	Vladimir Davydov, kernel-team, Dmitry Vyukov, Peter Zijlstra

On Wed, Dec 30, 2015 at 12:10 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Also, quite frankly, since this is basically very close to other
> fields that are *not* unsigned longs, I'd really prefer to not
> unnecessarily use a 64-bit field for three bits each.

Side note: I don't hate the patch. I think it's a good catch, and
would take it as-is. I just think it could be better.

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4.4-rc7] sched: isolate task_struct bitfields according to synchronization domains
@ 2015-12-30 20:41                       ` Tejun Heo
  0 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2015-12-30 20:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrey Ryabinin, Ingo Molnar, Sasha Levin, Andrew Morton,
	Johannes Weiner, Michal Hocko, cgroups, linux-mm,
	Vladimir Davydov, kernel-team, Dmitry Vyukov, Peter Zijlstra

Hello, Linus.

On Wed, Dec 30, 2015 at 12:10:12PM -0800, Linus Torvalds wrote:
> On Wed, Dec 30, 2015 at 1:23 AM, Tejun Heo <tj@kernel.org> wrote:
> >
> > Peter, I took the patch and changed the bitfields to ulong.
> 
> I wouldn't expect the unsigned long part to matter, except for the
> forced split with

Right, I was thinking alpha was doing rmw's for things smaller than
64bit.  That's 32bit, not 64.

>    unsigned long :0;
> 
> itself.
> 
> Also, quite frankly, since this is basically very close to other
> fields that are *not* unsigned longs, I'd really prefer to not
> unnecessarily use a 64-bit field for three bits each.
> 
> So why not just do it with plain unsigned "int", and then maybe just
> intersperse them with the other int-sized fields in that neighborhood.
>
> I'm also wondering if we shouldn't just put the scheduler bits in the
> "atomic_flags" thing instead?

Sure.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4.4-rc7] sched: isolate task_struct bitfields according to synchronization domains
@ 2015-12-30 20:41                       ` Tejun Heo
  0 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2015-12-30 20:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrey Ryabinin, Ingo Molnar, Sasha Levin, Andrew Morton,
	Johannes Weiner, Michal Hocko, cgroups,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Vladimir Davydov, kernel-team,
	Dmitry Vyukov, Peter Zijlstra

Hello, Linus.

On Wed, Dec 30, 2015 at 12:10:12PM -0800, Linus Torvalds wrote:
> On Wed, Dec 30, 2015 at 1:23 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> >
> > Peter, I took the patch and changed the bitfields to ulong.
> 
> I wouldn't expect the unsigned long part to matter, except for the
> forced split with

Right, I was thinking alpha was doing rmw's for things smaller than
64bit.  That's 32bit, not 64.

>    unsigned long :0;
> 
> itself.
> 
> Also, quite frankly, since this is basically very close to other
> fields that are *not* unsigned longs, I'd really prefer to not
> unnecessarily use a 64-bit field for three bits each.
> 
> So why not just do it with plain unsigned "int", and then maybe just
> intersperse them with the other int-sized fields in that neighborhood.
>
> I'm also wondering if we shouldn't just put the scheduler bits in the
> "atomic_flags" thing instead?

Sure.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4.4-rc7] sched: isolate task_struct bitfields according to synchronization domains
  2015-12-30 20:41                       ` Tejun Heo
  (?)
@ 2015-12-30 20:43                       ` Linus Torvalds
  -1 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2015-12-30 20:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrey Ryabinin, Ingo Molnar, Sasha Levin, Andrew Morton,
	Johannes Weiner, Michal Hocko, cgroups, linux-mm,
	Vladimir Davydov, kernel-team, Dmitry Vyukov, Peter Zijlstra

On Wed, Dec 30, 2015 at 12:41 PM, Tejun Heo <tj@kernel.org> wrote:
>
> Right, I was thinking alpha was doing rmw's for things smaller than
> 64bit.  That's 32bit, not 64.

Right. Alpha has trouble only with 8-bit and 16-bit fields. 32-bit
fields should be "atomic" on all architectures, modulo compiler bugs
(and we've had those).

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v4.4-rc7] sched: move sched lock synchronized bitfields in task_struct into ->atomic_flags
@ 2016-01-01  2:56                       ` Tejun Heo
  0 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2016-01-01  2:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrey Ryabinin, Ingo Molnar, Sasha Levin, Andrew Morton,
	Johannes Weiner, Michal Hocko, cgroups, linux-mm,
	Vladimir Davydov, kernel-team, Dmitry Vyukov, Peter Zijlstra

task_struct has a cluster of unsigned bitfields.  Some are updated
under scheduler locks while others are updated only by the task
itself.  Currently, the two classes of bitfields aren't distinguished
and end up on the same word which can lead to clobbering when there
are simultaneous read-modify-write attempts.  While difficult to prove
definitely, it's likely that the resulting inconsistency led to low
frqeuency failures such as wrong memcg_may_oom state or loadavg
underflow due to clobbered sched_contributes_to_load.

Fix it by moving sched lock synchronized bitfields into
->atomic_flags.

v2: Move flags into ->atomic_flags instead of segregating bitfields.

Original-patch-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/g/55FEC685.5010404@oracle.com
Cc: stable@vger.kernel.org
---
Hello,

task_struct is pretty well packed and I couldn't find a good hole to
fit a separate integer into.  atomic_flags is a bit cumbersome but it
looks like the better option.

Thanks.

 include/linux/perf_event.h |    6 +++---
 include/linux/sched.h      |   31 ++++++++++++++++++++++++-------
 kernel/sched/core.c        |   22 +++++++++++-----------
 3 files changed, 38 insertions(+), 21 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f9828a4..e5a80a4 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -921,7 +921,7 @@ perf_sw_migrate_enabled(void)
 static inline void perf_event_task_migrate(struct task_struct *task)
 {
 	if (perf_sw_migrate_enabled())
-		task->sched_migrated = 1;
+		task_set_sched_migrated(task);
 }
 
 static inline void perf_event_task_sched_in(struct task_struct *prev,
@@ -930,12 +930,12 @@ static inline void perf_event_task_sched_in(struct task_struct *prev,
 	if (static_key_false(&perf_sched_events.key))
 		__perf_event_task_sched_in(prev, task);
 
-	if (perf_sw_migrate_enabled() && task->sched_migrated) {
+	if (perf_sw_migrate_enabled() && task_sched_migrated(task)) {
 		struct pt_regs *regs = this_cpu_ptr(&__perf_regs[0]);
 
 		perf_fetch_caller_regs(regs);
 		___perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, regs, 0);
-		task->sched_migrated = 0;
+		task_clear_sched_migrated(task);
 	}
 }
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index edad7a4..b289f47 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1455,14 +1455,9 @@ struct task_struct {
 	/* Used for emulating ABI behavior of previous Linux versions */
 	unsigned int personality;
 
-	unsigned in_execve:1;	/* Tell the LSMs that the process is doing an
-				 * execve */
+	/* unserialized, strictly 'current' */
+	unsigned in_execve:1; /* bit to tell LSMs we're in execve */
 	unsigned in_iowait:1;
-
-	/* Revert to default priority/policy when forking */
-	unsigned sched_reset_on_fork:1;
-	unsigned sched_contributes_to_load:1;
-	unsigned sched_migrated:1;
 #ifdef CONFIG_MEMCG
 	unsigned memcg_may_oom:1;
 #endif
@@ -2144,6 +2139,10 @@ static inline void memalloc_noio_restore(unsigned int flags)
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
 #define PFA_SPREAD_SLAB  2      /* Spread some slab caches over cpuset */
 
+#define PFA_SCHED_RESET_ON_FORK		3 /* revert priority/policy on fork */
+#define PFA_SCHED_CONTRIBUTES_TO_LOAD	4
+#define PFA_SCHED_MIGRATED		5
+
 
 #define TASK_PFA_TEST(name, func)					\
 	static inline bool task_##func(struct task_struct *p)		\
@@ -2154,6 +2153,10 @@ static inline void memalloc_noio_restore(unsigned int flags)
 #define TASK_PFA_CLEAR(name, func)					\
 	static inline void task_clear_##func(struct task_struct *p)	\
 	{ clear_bit(PFA_##name, &p->atomic_flags); }
+#define TASK_PFA_UPDATE(name, func)					\
+	static inline void task_update_##func(struct task_struct *p, bool v) \
+	{ if (v) set_bit(PFA_##name, &p->atomic_flags);			\
+	  else clear_bit(PFA_##name, &p->atomic_flags); }
 
 TASK_PFA_TEST(NO_NEW_PRIVS, no_new_privs)
 TASK_PFA_SET(NO_NEW_PRIVS, no_new_privs)
@@ -2166,6 +2169,20 @@ TASK_PFA_TEST(SPREAD_SLAB, spread_slab)
 TASK_PFA_SET(SPREAD_SLAB, spread_slab)
 TASK_PFA_CLEAR(SPREAD_SLAB, spread_slab)
 
+TASK_PFA_TEST(SCHED_RESET_ON_FORK, sched_reset_on_fork);
+TASK_PFA_SET(SCHED_RESET_ON_FORK, sched_reset_on_fork);
+TASK_PFA_CLEAR(SCHED_RESET_ON_FORK, sched_reset_on_fork);
+TASK_PFA_UPDATE(SCHED_RESET_ON_FORK, sched_reset_on_fork);
+
+TASK_PFA_TEST(SCHED_CONTRIBUTES_TO_LOAD, sched_contributes_to_load);
+TASK_PFA_SET(SCHED_CONTRIBUTES_TO_LOAD, sched_contributes_to_load);
+TASK_PFA_CLEAR(SCHED_CONTRIBUTES_TO_LOAD, sched_contributes_to_load);
+TASK_PFA_UPDATE(SCHED_CONTRIBUTES_TO_LOAD, sched_contributes_to_load);
+
+TASK_PFA_TEST(SCHED_MIGRATED, sched_migrated);
+TASK_PFA_SET(SCHED_MIGRATED, sched_migrated);
+TASK_PFA_CLEAR(SCHED_MIGRATED, sched_migrated);
+
 /*
  * task->jobctl flags
  */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 732e993..c5a6a8c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1751,7 +1751,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)
 	lockdep_assert_held(&rq->lock);
 
 #ifdef CONFIG_SMP
-	if (p->sched_contributes_to_load)
+	if (task_sched_contributes_to_load(p))
 		rq->nr_uninterruptible--;
 #endif
 
@@ -1982,7 +1982,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 */
 	smp_rmb();
 
-	p->sched_contributes_to_load = !!task_contributes_to_load(p);
+	task_update_sched_contributes_to_load(p, task_contributes_to_load(p));
 	p->state = TASK_WAKING;
 
 	if (p->sched_class->task_waking)
@@ -2205,7 +2205,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	/*
 	 * Revert to default priority/policy on fork if requested.
 	 */
-	if (unlikely(p->sched_reset_on_fork)) {
+	if (unlikely(task_sched_reset_on_fork(p))) {
 		if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
 			p->policy = SCHED_NORMAL;
 			p->static_prio = NICE_TO_PRIO(0);
@@ -2220,7 +2220,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 		 * We don't need the reset flag anymore after the fork. It has
 		 * fulfilled its duty:
 		 */
-		p->sched_reset_on_fork = 0;
+		task_clear_sched_reset_on_fork(p);
 	}
 
 	if (dl_prio(p->prio)) {
@@ -3799,7 +3799,7 @@ static int __sched_setscheduler(struct task_struct *p,
 recheck:
 	/* double check policy once rq lock held */
 	if (policy < 0) {
-		reset_on_fork = p->sched_reset_on_fork;
+		reset_on_fork = task_sched_reset_on_fork(p);
 		policy = oldpolicy = p->policy;
 	} else {
 		reset_on_fork = !!(attr->sched_flags & SCHED_FLAG_RESET_ON_FORK);
@@ -3870,7 +3870,7 @@ static int __sched_setscheduler(struct task_struct *p,
 			return -EPERM;
 
 		/* Normal users shall not reset the sched_reset_on_fork flag */
-		if (p->sched_reset_on_fork && !reset_on_fork)
+		if (task_sched_reset_on_fork(p) && !reset_on_fork)
 			return -EPERM;
 	}
 
@@ -3909,7 +3909,7 @@ static int __sched_setscheduler(struct task_struct *p,
 		if (dl_policy(policy) && dl_param_changed(p, attr))
 			goto change;
 
-		p->sched_reset_on_fork = reset_on_fork;
+		task_update_sched_reset_on_fork(p, reset_on_fork);
 		task_rq_unlock(rq, p, &flags);
 		return 0;
 	}
@@ -3963,7 +3963,7 @@ static int __sched_setscheduler(struct task_struct *p,
 		return -EBUSY;
 	}
 
-	p->sched_reset_on_fork = reset_on_fork;
+	task_update_sched_reset_on_fork(p, reset_on_fork);
 	oldprio = p->prio;
 
 	if (pi) {
@@ -4260,8 +4260,8 @@ SYSCALL_DEFINE1(sched_getscheduler, pid_t, pid)
 	if (p) {
 		retval = security_task_getscheduler(p);
 		if (!retval)
-			retval = p->policy
-				| (p->sched_reset_on_fork ? SCHED_RESET_ON_FORK : 0);
+			retval = p->policy | (task_sched_reset_on_fork(p) ?
+					      SCHED_RESET_ON_FORK : 0);
 	}
 	rcu_read_unlock();
 	return retval;
@@ -4377,7 +4377,7 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 		goto out_unlock;
 
 	attr.sched_policy = p->policy;
-	if (p->sched_reset_on_fork)
+	if (task_sched_reset_on_fork(p))
 		attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
 	if (task_has_dl_policy(p))
 		__getparam_dl(p, &attr);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v4.4-rc7] sched: move sched lock synchronized bitfields in task_struct into ->atomic_flags
@ 2016-01-01  2:56                       ` Tejun Heo
  0 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2016-01-01  2:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrey Ryabinin, Ingo Molnar, Sasha Levin, Andrew Morton,
	Johannes Weiner, Michal Hocko, cgroups,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Vladimir Davydov, kernel-team,
	Dmitry Vyukov, Peter Zijlstra

task_struct has a cluster of unsigned bitfields.  Some are updated
under scheduler locks while others are updated only by the task
itself.  Currently, the two classes of bitfields aren't distinguished
and end up on the same word which can lead to clobbering when there
are simultaneous read-modify-write attempts.  While difficult to prove
definitely, it's likely that the resulting inconsistency led to low
frqeuency failures such as wrong memcg_may_oom state or loadavg
underflow due to clobbered sched_contributes_to_load.

Fix it by moving sched lock synchronized bitfields into
->atomic_flags.

v2: Move flags into ->atomic_flags instead of segregating bitfields.

Original-patch-by: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Link: http://lkml.kernel.org/g/55FEC685.5010404-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
Hello,

task_struct is pretty well packed and I couldn't find a good hole to
fit a separate integer into.  atomic_flags is a bit cumbersome but it
looks like the better option.

Thanks.

 include/linux/perf_event.h |    6 +++---
 include/linux/sched.h      |   31 ++++++++++++++++++++++++-------
 kernel/sched/core.c        |   22 +++++++++++-----------
 3 files changed, 38 insertions(+), 21 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f9828a4..e5a80a4 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -921,7 +921,7 @@ perf_sw_migrate_enabled(void)
 static inline void perf_event_task_migrate(struct task_struct *task)
 {
 	if (perf_sw_migrate_enabled())
-		task->sched_migrated = 1;
+		task_set_sched_migrated(task);
 }
 
 static inline void perf_event_task_sched_in(struct task_struct *prev,
@@ -930,12 +930,12 @@ static inline void perf_event_task_sched_in(struct task_struct *prev,
 	if (static_key_false(&perf_sched_events.key))
 		__perf_event_task_sched_in(prev, task);
 
-	if (perf_sw_migrate_enabled() && task->sched_migrated) {
+	if (perf_sw_migrate_enabled() && task_sched_migrated(task)) {
 		struct pt_regs *regs = this_cpu_ptr(&__perf_regs[0]);
 
 		perf_fetch_caller_regs(regs);
 		___perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, regs, 0);
-		task->sched_migrated = 0;
+		task_clear_sched_migrated(task);
 	}
 }
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index edad7a4..b289f47 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1455,14 +1455,9 @@ struct task_struct {
 	/* Used for emulating ABI behavior of previous Linux versions */
 	unsigned int personality;
 
-	unsigned in_execve:1;	/* Tell the LSMs that the process is doing an
-				 * execve */
+	/* unserialized, strictly 'current' */
+	unsigned in_execve:1; /* bit to tell LSMs we're in execve */
 	unsigned in_iowait:1;
-
-	/* Revert to default priority/policy when forking */
-	unsigned sched_reset_on_fork:1;
-	unsigned sched_contributes_to_load:1;
-	unsigned sched_migrated:1;
 #ifdef CONFIG_MEMCG
 	unsigned memcg_may_oom:1;
 #endif
@@ -2144,6 +2139,10 @@ static inline void memalloc_noio_restore(unsigned int flags)
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
 #define PFA_SPREAD_SLAB  2      /* Spread some slab caches over cpuset */
 
+#define PFA_SCHED_RESET_ON_FORK		3 /* revert priority/policy on fork */
+#define PFA_SCHED_CONTRIBUTES_TO_LOAD	4
+#define PFA_SCHED_MIGRATED		5
+
 
 #define TASK_PFA_TEST(name, func)					\
 	static inline bool task_##func(struct task_struct *p)		\
@@ -2154,6 +2153,10 @@ static inline void memalloc_noio_restore(unsigned int flags)
 #define TASK_PFA_CLEAR(name, func)					\
 	static inline void task_clear_##func(struct task_struct *p)	\
 	{ clear_bit(PFA_##name, &p->atomic_flags); }
+#define TASK_PFA_UPDATE(name, func)					\
+	static inline void task_update_##func(struct task_struct *p, bool v) \
+	{ if (v) set_bit(PFA_##name, &p->atomic_flags);			\
+	  else clear_bit(PFA_##name, &p->atomic_flags); }
 
 TASK_PFA_TEST(NO_NEW_PRIVS, no_new_privs)
 TASK_PFA_SET(NO_NEW_PRIVS, no_new_privs)
@@ -2166,6 +2169,20 @@ TASK_PFA_TEST(SPREAD_SLAB, spread_slab)
 TASK_PFA_SET(SPREAD_SLAB, spread_slab)
 TASK_PFA_CLEAR(SPREAD_SLAB, spread_slab)
 
+TASK_PFA_TEST(SCHED_RESET_ON_FORK, sched_reset_on_fork);
+TASK_PFA_SET(SCHED_RESET_ON_FORK, sched_reset_on_fork);
+TASK_PFA_CLEAR(SCHED_RESET_ON_FORK, sched_reset_on_fork);
+TASK_PFA_UPDATE(SCHED_RESET_ON_FORK, sched_reset_on_fork);
+
+TASK_PFA_TEST(SCHED_CONTRIBUTES_TO_LOAD, sched_contributes_to_load);
+TASK_PFA_SET(SCHED_CONTRIBUTES_TO_LOAD, sched_contributes_to_load);
+TASK_PFA_CLEAR(SCHED_CONTRIBUTES_TO_LOAD, sched_contributes_to_load);
+TASK_PFA_UPDATE(SCHED_CONTRIBUTES_TO_LOAD, sched_contributes_to_load);
+
+TASK_PFA_TEST(SCHED_MIGRATED, sched_migrated);
+TASK_PFA_SET(SCHED_MIGRATED, sched_migrated);
+TASK_PFA_CLEAR(SCHED_MIGRATED, sched_migrated);
+
 /*
  * task->jobctl flags
  */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 732e993..c5a6a8c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1751,7 +1751,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)
 	lockdep_assert_held(&rq->lock);
 
 #ifdef CONFIG_SMP
-	if (p->sched_contributes_to_load)
+	if (task_sched_contributes_to_load(p))
 		rq->nr_uninterruptible--;
 #endif
 
@@ -1982,7 +1982,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 */
 	smp_rmb();
 
-	p->sched_contributes_to_load = !!task_contributes_to_load(p);
+	task_update_sched_contributes_to_load(p, task_contributes_to_load(p));
 	p->state = TASK_WAKING;
 
 	if (p->sched_class->task_waking)
@@ -2205,7 +2205,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	/*
 	 * Revert to default priority/policy on fork if requested.
 	 */
-	if (unlikely(p->sched_reset_on_fork)) {
+	if (unlikely(task_sched_reset_on_fork(p))) {
 		if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
 			p->policy = SCHED_NORMAL;
 			p->static_prio = NICE_TO_PRIO(0);
@@ -2220,7 +2220,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 		 * We don't need the reset flag anymore after the fork. It has
 		 * fulfilled its duty:
 		 */
-		p->sched_reset_on_fork = 0;
+		task_clear_sched_reset_on_fork(p);
 	}
 
 	if (dl_prio(p->prio)) {
@@ -3799,7 +3799,7 @@ static int __sched_setscheduler(struct task_struct *p,
 recheck:
 	/* double check policy once rq lock held */
 	if (policy < 0) {
-		reset_on_fork = p->sched_reset_on_fork;
+		reset_on_fork = task_sched_reset_on_fork(p);
 		policy = oldpolicy = p->policy;
 	} else {
 		reset_on_fork = !!(attr->sched_flags & SCHED_FLAG_RESET_ON_FORK);
@@ -3870,7 +3870,7 @@ static int __sched_setscheduler(struct task_struct *p,
 			return -EPERM;
 
 		/* Normal users shall not reset the sched_reset_on_fork flag */
-		if (p->sched_reset_on_fork && !reset_on_fork)
+		if (task_sched_reset_on_fork(p) && !reset_on_fork)
 			return -EPERM;
 	}
 
@@ -3909,7 +3909,7 @@ static int __sched_setscheduler(struct task_struct *p,
 		if (dl_policy(policy) && dl_param_changed(p, attr))
 			goto change;
 
-		p->sched_reset_on_fork = reset_on_fork;
+		task_update_sched_reset_on_fork(p, reset_on_fork);
 		task_rq_unlock(rq, p, &flags);
 		return 0;
 	}
@@ -3963,7 +3963,7 @@ static int __sched_setscheduler(struct task_struct *p,
 		return -EBUSY;
 	}
 
-	p->sched_reset_on_fork = reset_on_fork;
+	task_update_sched_reset_on_fork(p, reset_on_fork);
 	oldprio = p->prio;
 
 	if (pi) {
@@ -4260,8 +4260,8 @@ SYSCALL_DEFINE1(sched_getscheduler, pid_t, pid)
 	if (p) {
 		retval = security_task_getscheduler(p);
 		if (!retval)
-			retval = p->policy
-				| (p->sched_reset_on_fork ? SCHED_RESET_ON_FORK : 0);
+			retval = p->policy | (task_sched_reset_on_fork(p) ?
+					      SCHED_RESET_ON_FORK : 0);
 	}
 	rcu_read_unlock();
 	return retval;
@@ -4377,7 +4377,7 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 		goto out_unlock;
 
 	attr.sched_policy = p->policy;
-	if (p->sched_reset_on_fork)
+	if (task_sched_reset_on_fork(p))
 		attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
 	if (task_has_dl_policy(p))
 		__getparam_dl(p, &attr);

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v4.4-rc7] sched: move sched lock synchronized bitfields in task_struct into ->atomic_flags
@ 2016-01-06 13:44                         ` Tejun Heo
  0 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2016-01-06 13:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrey Ryabinin, Ingo Molnar, Sasha Levin, Andrew Morton,
	Johannes Weiner, Michal Hocko, cgroups, linux-mm,
	Vladimir Davydov, kernel-team, Dmitry Vyukov, Peter Zijlstra

On Thu, Dec 31, 2015 at 09:56:28PM -0500, Tejun Heo wrote:
> task_struct has a cluster of unsigned bitfields.  Some are updated
> under scheduler locks while others are updated only by the task
> itself.  Currently, the two classes of bitfields aren't distinguished
> and end up on the same word which can lead to clobbering when there
> are simultaneous read-modify-write attempts.  While difficult to prove
> definitely, it's likely that the resulting inconsistency led to low
> frqeuency failures such as wrong memcg_may_oom state or loadavg
> underflow due to clobbered sched_contributes_to_load.

Ping.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v4.4-rc7] sched: move sched lock synchronized bitfields in task_struct into ->atomic_flags
@ 2016-01-06 13:44                         ` Tejun Heo
  0 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2016-01-06 13:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrey Ryabinin, Ingo Molnar, Sasha Levin, Andrew Morton,
	Johannes Weiner, Michal Hocko, cgroups,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Vladimir Davydov, kernel-team,
	Dmitry Vyukov, Peter Zijlstra

On Thu, Dec 31, 2015 at 09:56:28PM -0500, Tejun Heo wrote:
> task_struct has a cluster of unsigned bitfields.  Some are updated
> under scheduler locks while others are updated only by the task
> itself.  Currently, the two classes of bitfields aren't distinguished
> and end up on the same word which can lead to clobbering when there
> are simultaneous read-modify-write attempts.  While difficult to prove
> definitely, it's likely that the resulting inconsistency led to low
> frqeuency failures such as wrong memcg_may_oom state or loadavg
> underflow due to clobbered sched_contributes_to_load.

Ping.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [tip:sched/core] sched/core: Fix unserialized r-m-w scribbling stuff
  2015-11-25 15:02         ` Peter Zijlstra
  (?)
  (?)
@ 2016-01-06 18:48         ` tip-bot for Peter Zijlstra
  2016-01-06 20:17           ` Tejun Heo
  -1 siblings, 1 reply; 49+ messages in thread
From: tip-bot for Peter Zijlstra @ 2016-01-06 18:48 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, tj, hpa, sasha.levin, torvalds, mingo, peterz, dvyukov,
	linux-kernel

Commit-ID:  be958bdc96f18bc1356177bbb79d46ea0c037b96
Gitweb:     http://git.kernel.org/tip/be958bdc96f18bc1356177bbb79d46ea0c037b96
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 25 Nov 2015 16:02:07 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 6 Jan 2016 11:01:07 +0100

sched/core: Fix unserialized r-m-w scribbling stuff

Some of the sched bitfieds (notably sched_reset_on_fork) can be set
on other than current, this can cause the r-m-w to race with other
updates.

Since all the sched bits are serialized by scheduler locks, pull them
in a separate word.

Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: hannes@cmpxchg.org
Cc: mhocko@kernel.org
Cc: vdavydov@parallels.com
Link: http://lkml.kernel.org/r/20151125150207.GM11639@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9cf9dd1..fa39434 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1455,14 +1455,15 @@ struct task_struct {
 	/* Used for emulating ABI behavior of previous Linux versions */
 	unsigned int personality;
 
-	unsigned in_execve:1;	/* Tell the LSMs that the process is doing an
-				 * execve */
-	unsigned in_iowait:1;
-
-	/* Revert to default priority/policy when forking */
+	/* scheduler bits, serialized by scheduler locks */
 	unsigned sched_reset_on_fork:1;
 	unsigned sched_contributes_to_load:1;
 	unsigned sched_migrated:1;
+	unsigned :0; /* force alignment to the next boundary */
+
+	/* unserialized, strictly 'current' */
+	unsigned in_execve:1; /* bit to tell LSMs we're in execve */
+	unsigned in_iowait:1;
 #ifdef CONFIG_MEMCG
 	unsigned memcg_may_oom:1;
 #endif

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [tip:sched/core] sched/core: Fix unserialized r-m-w scribbling stuff
  2016-01-06 18:48         ` [tip:sched/core] sched/core: Fix unserialized r-m-w scribbling stuff tip-bot for Peter Zijlstra
@ 2016-01-06 20:17           ` Tejun Heo
  0 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2016-01-06 20:17 UTC (permalink / raw)
  To: linux-kernel, dvyukov, mingo, peterz, sasha.levin, torvalds, hpa, tglx
  Cc: linux-tip-commits

On Wed, Jan 06, 2016 at 10:48:15AM -0800, tip-bot for Peter Zijlstra wrote:
> Commit-ID:  be958bdc96f18bc1356177bbb79d46ea0c037b96
> Gitweb:     http://git.kernel.org/tip/be958bdc96f18bc1356177bbb79d46ea0c037b96
> Author:     Peter Zijlstra <peterz@infradead.org>
> AuthorDate: Wed, 25 Nov 2015 16:02:07 +0100
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Wed, 6 Jan 2016 11:01:07 +0100
> 
> sched/core: Fix unserialized r-m-w scribbling stuff
> 
> Some of the sched bitfieds (notably sched_reset_on_fork) can be set
> on other than current, this can cause the r-m-w to race with other
> updates.
> 
> Since all the sched bits are serialized by scheduler locks, pull them
> in a separate word.
> 
> Reported-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Dmitry Vyukov <dvyukov@google.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Sasha Levin <sasha.levin@oracle.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: akpm@linux-foundation.org
> Cc: hannes@cmpxchg.org
> Cc: mhocko@kernel.org
> Cc: vdavydov@parallels.com
> Link: http://lkml.kernel.org/r/20151125150207.GM11639@twins.programming.kicks-ass.net
> Signed-off-by: Ingo Molnar <mingo@kernel.org>

cc stable?

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-09-02 11:45   ` Vladimir Davydov
  0 siblings, 0 replies; 49+ messages in thread
From: Vladimir Davydov @ 2015-09-02 11:45 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, mhocko, akpm, cgroups, linux-mm, kernel-team

On Fri, Aug 28, 2015 at 06:01:58PM -0400, Tejun Heo wrote:
> task_struct->memcg_oom is a sub-struct containing fields which are
> used for async memcg oom handling.  Most task_struct fields aren't
> packaged this way and it can lead to unnecessary alignment paddings.
> This patch flattens it.
> 
> * task.memcg_oom.memcg          -> task.memcg_in_oom
> * task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
> * task.memcg_oom.order          -> task.memcg_oom_order
> * task.memcg_oom.may_oom        -> task.memcg_may_oom
> 
> In addition, task.memcg_may_oom is relocated to where other bitfields
> are which reduces the size of task_struct.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-09-02 11:45   ` Vladimir Davydov
  0 siblings, 0 replies; 49+ messages in thread
From: Vladimir Davydov @ 2015-09-02 11:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kernel-team-b10kYP2dOMg

On Fri, Aug 28, 2015 at 06:01:58PM -0400, Tejun Heo wrote:
> task_struct->memcg_oom is a sub-struct containing fields which are
> used for async memcg oom handling.  Most task_struct fields aren't
> packaged this way and it can lead to unnecessary alignment paddings.
> This patch flattens it.
> 
> * task.memcg_oom.memcg          -> task.memcg_in_oom
> * task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
> * task.memcg_oom.order          -> task.memcg_oom_order
> * task.memcg_oom.may_oom        -> task.memcg_may_oom
> 
> In addition, task.memcg_may_oom is relocated to where other bitfields
> are which reduces the size of task_struct.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Reviewed-by: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-09-01 15:25   ` Michal Hocko
  0 siblings, 0 replies; 49+ messages in thread
From: Michal Hocko @ 2015-09-01 15:25 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, akpm, cgroups, linux-mm, vdavydov, kernel-team

On Fri 28-08-15 18:01:58, Tejun Heo wrote:
> task_struct->memcg_oom is a sub-struct containing fields which are
> used for async memcg oom handling.  Most task_struct fields aren't
> packaged this way and it can lead to unnecessary alignment paddings.
> This patch flattens it.
> 
> * task.memcg_oom.memcg          -> task.memcg_in_oom
> * task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
> * task.memcg_oom.order          -> task.memcg_oom_order
> * task.memcg_oom.may_oom        -> task.memcg_may_oom
> 
> In addition, task.memcg_may_oom is relocated to where other bitfields
> are which reduces the size of task_struct.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
> Hello,
> 
> These two patches are what survived from the following patchset.
> 
>  http://lkml.kernel.org/g/1440775530-18630-1-git-send-email-tj@kernel.org
> 
> Thanks.
> 
>  include/linux/memcontrol.h |   10 +++++-----
>  include/linux/sched.h      |   13 ++++++-------
>  mm/memcontrol.c            |   16 ++++++++--------
>  3 files changed, 19 insertions(+), 20 deletions(-)
> 
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -407,19 +407,19 @@ void mem_cgroup_print_oom_info(struct me
>  
>  static inline void mem_cgroup_oom_enable(void)
>  {
> -	WARN_ON(current->memcg_oom.may_oom);
> -	current->memcg_oom.may_oom = 1;
> +	WARN_ON(current->memcg_may_oom);
> +	current->memcg_may_oom = 1;
>  }
>  
>  static inline void mem_cgroup_oom_disable(void)
>  {
> -	WARN_ON(!current->memcg_oom.may_oom);
> -	current->memcg_oom.may_oom = 0;
> +	WARN_ON(!current->memcg_may_oom);
> +	current->memcg_may_oom = 0;
>  }
>  
>  static inline bool task_in_memcg_oom(struct task_struct *p)
>  {
> -	return p->memcg_oom.memcg;
> +	return p->memcg_in_oom;
>  }
>  
>  bool mem_cgroup_oom_synchronize(bool wait);
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1451,7 +1451,9 @@ struct task_struct {
>  	unsigned sched_reset_on_fork:1;
>  	unsigned sched_contributes_to_load:1;
>  	unsigned sched_migrated:1;
> -
> +#ifdef CONFIG_MEMCG
> +	unsigned memcg_may_oom:1;
> +#endif
>  #ifdef CONFIG_MEMCG_KMEM
>  	unsigned memcg_kmem_skip_account:1;
>  #endif
> @@ -1782,12 +1784,9 @@ struct task_struct {
>  	unsigned long trace_recursion;
>  #endif /* CONFIG_TRACING */
>  #ifdef CONFIG_MEMCG
> -	struct memcg_oom_info {
> -		struct mem_cgroup *memcg;
> -		gfp_t gfp_mask;
> -		int order;
> -		unsigned int may_oom:1;
> -	} memcg_oom;
> +	struct mem_cgroup *memcg_in_oom;
> +	gfp_t memcg_oom_gfp_mask;
> +	int memcg_oom_order;
>  #endif
>  #ifdef CONFIG_UPROBES
>  	struct uprobe_task *utask;
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1652,7 +1652,7 @@ static void memcg_oom_recover(struct mem
>  
>  static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  {
> -	if (!current->memcg_oom.may_oom)
> +	if (!current->memcg_may_oom)
>  		return;
>  	/*
>  	 * We are in the middle of the charge context here, so we
> @@ -1669,9 +1669,9 @@ static void mem_cgroup_oom(struct mem_cg
>  	 * and when we know whether the fault was overall successful.
>  	 */
>  	css_get(&memcg->css);
> -	current->memcg_oom.memcg = memcg;
> -	current->memcg_oom.gfp_mask = mask;
> -	current->memcg_oom.order = order;
> +	current->memcg_in_oom = memcg;
> +	current->memcg_oom_gfp_mask = mask;
> +	current->memcg_oom_order = order;
>  }
>  
>  /**
> @@ -1693,7 +1693,7 @@ static void mem_cgroup_oom(struct mem_cg
>   */
>  bool mem_cgroup_oom_synchronize(bool handle)
>  {
> -	struct mem_cgroup *memcg = current->memcg_oom.memcg;
> +	struct mem_cgroup *memcg = current->memcg_in_oom;
>  	struct oom_wait_info owait;
>  	bool locked;
>  
> @@ -1721,8 +1721,8 @@ bool mem_cgroup_oom_synchronize(bool han
>  	if (locked && !memcg->oom_kill_disable) {
>  		mem_cgroup_unmark_under_oom(memcg);
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
> -		mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask,
> -					 current->memcg_oom.order);
> +		mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
> +					 current->memcg_oom_order);
>  	} else {
>  		schedule();
>  		mem_cgroup_unmark_under_oom(memcg);
> @@ -1739,7 +1739,7 @@ bool mem_cgroup_oom_synchronize(bool han
>  		memcg_oom_recover(memcg);
>  	}
>  cleanup:
> -	current->memcg_oom.memcg = NULL;
> +	current->memcg_in_oom = NULL;
>  	css_put(&memcg->css);
>  	return true;
>  }
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-09-01 15:25   ` Michal Hocko
  0 siblings, 0 replies; 49+ messages in thread
From: Michal Hocko @ 2015-09-01 15:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: hannes-druUgvl0LCNAfugRpC6u6w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg

On Fri 28-08-15 18:01:58, Tejun Heo wrote:
> task_struct->memcg_oom is a sub-struct containing fields which are
> used for async memcg oom handling.  Most task_struct fields aren't
> packaged this way and it can lead to unnecessary alignment paddings.
> This patch flattens it.
> 
> * task.memcg_oom.memcg          -> task.memcg_in_oom
> * task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
> * task.memcg_oom.order          -> task.memcg_oom_order
> * task.memcg_oom.may_oom        -> task.memcg_may_oom
> 
> In addition, task.memcg_may_oom is relocated to where other bitfields
> are which reduces the size of task_struct.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>

> ---
> Hello,
> 
> These two patches are what survived from the following patchset.
> 
>  http://lkml.kernel.org/g/1440775530-18630-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
> 
> Thanks.
> 
>  include/linux/memcontrol.h |   10 +++++-----
>  include/linux/sched.h      |   13 ++++++-------
>  mm/memcontrol.c            |   16 ++++++++--------
>  3 files changed, 19 insertions(+), 20 deletions(-)
> 
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -407,19 +407,19 @@ void mem_cgroup_print_oom_info(struct me
>  
>  static inline void mem_cgroup_oom_enable(void)
>  {
> -	WARN_ON(current->memcg_oom.may_oom);
> -	current->memcg_oom.may_oom = 1;
> +	WARN_ON(current->memcg_may_oom);
> +	current->memcg_may_oom = 1;
>  }
>  
>  static inline void mem_cgroup_oom_disable(void)
>  {
> -	WARN_ON(!current->memcg_oom.may_oom);
> -	current->memcg_oom.may_oom = 0;
> +	WARN_ON(!current->memcg_may_oom);
> +	current->memcg_may_oom = 0;
>  }
>  
>  static inline bool task_in_memcg_oom(struct task_struct *p)
>  {
> -	return p->memcg_oom.memcg;
> +	return p->memcg_in_oom;
>  }
>  
>  bool mem_cgroup_oom_synchronize(bool wait);
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1451,7 +1451,9 @@ struct task_struct {
>  	unsigned sched_reset_on_fork:1;
>  	unsigned sched_contributes_to_load:1;
>  	unsigned sched_migrated:1;
> -
> +#ifdef CONFIG_MEMCG
> +	unsigned memcg_may_oom:1;
> +#endif
>  #ifdef CONFIG_MEMCG_KMEM
>  	unsigned memcg_kmem_skip_account:1;
>  #endif
> @@ -1782,12 +1784,9 @@ struct task_struct {
>  	unsigned long trace_recursion;
>  #endif /* CONFIG_TRACING */
>  #ifdef CONFIG_MEMCG
> -	struct memcg_oom_info {
> -		struct mem_cgroup *memcg;
> -		gfp_t gfp_mask;
> -		int order;
> -		unsigned int may_oom:1;
> -	} memcg_oom;
> +	struct mem_cgroup *memcg_in_oom;
> +	gfp_t memcg_oom_gfp_mask;
> +	int memcg_oom_order;
>  #endif
>  #ifdef CONFIG_UPROBES
>  	struct uprobe_task *utask;
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1652,7 +1652,7 @@ static void memcg_oom_recover(struct mem
>  
>  static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  {
> -	if (!current->memcg_oom.may_oom)
> +	if (!current->memcg_may_oom)
>  		return;
>  	/*
>  	 * We are in the middle of the charge context here, so we
> @@ -1669,9 +1669,9 @@ static void mem_cgroup_oom(struct mem_cg
>  	 * and when we know whether the fault was overall successful.
>  	 */
>  	css_get(&memcg->css);
> -	current->memcg_oom.memcg = memcg;
> -	current->memcg_oom.gfp_mask = mask;
> -	current->memcg_oom.order = order;
> +	current->memcg_in_oom = memcg;
> +	current->memcg_oom_gfp_mask = mask;
> +	current->memcg_oom_order = order;
>  }
>  
>  /**
> @@ -1693,7 +1693,7 @@ static void mem_cgroup_oom(struct mem_cg
>   */
>  bool mem_cgroup_oom_synchronize(bool handle)
>  {
> -	struct mem_cgroup *memcg = current->memcg_oom.memcg;
> +	struct mem_cgroup *memcg = current->memcg_in_oom;
>  	struct oom_wait_info owait;
>  	bool locked;
>  
> @@ -1721,8 +1721,8 @@ bool mem_cgroup_oom_synchronize(bool han
>  	if (locked && !memcg->oom_kill_disable) {
>  		mem_cgroup_unmark_under_oom(memcg);
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
> -		mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask,
> -					 current->memcg_oom.order);
> +		mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
> +					 current->memcg_oom_order);
>  	} else {
>  		schedule();
>  		mem_cgroup_unmark_under_oom(memcg);
> @@ -1739,7 +1739,7 @@ bool mem_cgroup_oom_synchronize(bool han
>  		memcg_oom_recover(memcg);
>  	}
>  cleanup:
> -	current->memcg_oom.memcg = NULL;
> +	current->memcg_in_oom = NULL;
>  	css_put(&memcg->css);
>  	return true;
>  }
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-08-28 22:01 ` Tejun Heo
  0 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2015-08-28 22:01 UTC (permalink / raw)
  To: hannes, mhocko, akpm; +Cc: cgroups, linux-mm, vdavydov, kernel-team

task_struct->memcg_oom is a sub-struct containing fields which are
used for async memcg oom handling.  Most task_struct fields aren't
packaged this way and it can lead to unnecessary alignment paddings.
This patch flattens it.

* task.memcg_oom.memcg          -> task.memcg_in_oom
* task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
* task.memcg_oom.order          -> task.memcg_oom_order
* task.memcg_oom.may_oom        -> task.memcg_may_oom

In addition, task.memcg_may_oom is relocated to where other bitfields
are which reduces the size of task_struct.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
Hello,

These two patches are what survived from the following patchset.

 http://lkml.kernel.org/g/1440775530-18630-1-git-send-email-tj@kernel.org

Thanks.

 include/linux/memcontrol.h |   10 +++++-----
 include/linux/sched.h      |   13 ++++++-------
 mm/memcontrol.c            |   16 ++++++++--------
 3 files changed, 19 insertions(+), 20 deletions(-)

--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -407,19 +407,19 @@ void mem_cgroup_print_oom_info(struct me
 
 static inline void mem_cgroup_oom_enable(void)
 {
-	WARN_ON(current->memcg_oom.may_oom);
-	current->memcg_oom.may_oom = 1;
+	WARN_ON(current->memcg_may_oom);
+	current->memcg_may_oom = 1;
 }
 
 static inline void mem_cgroup_oom_disable(void)
 {
-	WARN_ON(!current->memcg_oom.may_oom);
-	current->memcg_oom.may_oom = 0;
+	WARN_ON(!current->memcg_may_oom);
+	current->memcg_may_oom = 0;
 }
 
 static inline bool task_in_memcg_oom(struct task_struct *p)
 {
-	return p->memcg_oom.memcg;
+	return p->memcg_in_oom;
 }
 
 bool mem_cgroup_oom_synchronize(bool wait);
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1451,7 +1451,9 @@ struct task_struct {
 	unsigned sched_reset_on_fork:1;
 	unsigned sched_contributes_to_load:1;
 	unsigned sched_migrated:1;
-
+#ifdef CONFIG_MEMCG
+	unsigned memcg_may_oom:1;
+#endif
 #ifdef CONFIG_MEMCG_KMEM
 	unsigned memcg_kmem_skip_account:1;
 #endif
@@ -1782,12 +1784,9 @@ struct task_struct {
 	unsigned long trace_recursion;
 #endif /* CONFIG_TRACING */
 #ifdef CONFIG_MEMCG
-	struct memcg_oom_info {
-		struct mem_cgroup *memcg;
-		gfp_t gfp_mask;
-		int order;
-		unsigned int may_oom:1;
-	} memcg_oom;
+	struct mem_cgroup *memcg_in_oom;
+	gfp_t memcg_oom_gfp_mask;
+	int memcg_oom_order;
 #endif
 #ifdef CONFIG_UPROBES
 	struct uprobe_task *utask;
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1652,7 +1652,7 @@ static void memcg_oom_recover(struct mem
 
 static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 {
-	if (!current->memcg_oom.may_oom)
+	if (!current->memcg_may_oom)
 		return;
 	/*
 	 * We are in the middle of the charge context here, so we
@@ -1669,9 +1669,9 @@ static void mem_cgroup_oom(struct mem_cg
 	 * and when we know whether the fault was overall successful.
 	 */
 	css_get(&memcg->css);
-	current->memcg_oom.memcg = memcg;
-	current->memcg_oom.gfp_mask = mask;
-	current->memcg_oom.order = order;
+	current->memcg_in_oom = memcg;
+	current->memcg_oom_gfp_mask = mask;
+	current->memcg_oom_order = order;
 }
 
 /**
@@ -1693,7 +1693,7 @@ static void mem_cgroup_oom(struct mem_cg
  */
 bool mem_cgroup_oom_synchronize(bool handle)
 {
-	struct mem_cgroup *memcg = current->memcg_oom.memcg;
+	struct mem_cgroup *memcg = current->memcg_in_oom;
 	struct oom_wait_info owait;
 	bool locked;
 
@@ -1721,8 +1721,8 @@ bool mem_cgroup_oom_synchronize(bool han
 	if (locked && !memcg->oom_kill_disable) {
 		mem_cgroup_unmark_under_oom(memcg);
 		finish_wait(&memcg_oom_waitq, &owait.wait);
-		mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask,
-					 current->memcg_oom.order);
+		mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
+					 current->memcg_oom_order);
 	} else {
 		schedule();
 		mem_cgroup_unmark_under_oom(memcg);
@@ -1739,7 +1739,7 @@ bool mem_cgroup_oom_synchronize(bool han
 		memcg_oom_recover(memcg);
 	}
 cleanup:
-	current->memcg_oom.memcg = NULL;
+	current->memcg_in_oom = NULL;
 	css_put(&memcg->css);
 	return true;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 1/2] memcg: flatten task_struct->memcg_oom
@ 2015-08-28 22:01 ` Tejun Heo
  0 siblings, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2015-08-28 22:01 UTC (permalink / raw)
  To: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg

task_struct->memcg_oom is a sub-struct containing fields which are
used for async memcg oom handling.  Most task_struct fields aren't
packaged this way and it can lead to unnecessary alignment paddings.
This patch flattens it.

* task.memcg_oom.memcg          -> task.memcg_in_oom
* task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
* task.memcg_oom.order          -> task.memcg_oom_order
* task.memcg_oom.may_oom        -> task.memcg_may_oom

In addition, task.memcg_may_oom is relocated to where other bitfields
are which reduces the size of task_struct.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
Hello,

These two patches are what survived from the following patchset.

 http://lkml.kernel.org/g/1440775530-18630-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org

Thanks.

 include/linux/memcontrol.h |   10 +++++-----
 include/linux/sched.h      |   13 ++++++-------
 mm/memcontrol.c            |   16 ++++++++--------
 3 files changed, 19 insertions(+), 20 deletions(-)

--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -407,19 +407,19 @@ void mem_cgroup_print_oom_info(struct me
 
 static inline void mem_cgroup_oom_enable(void)
 {
-	WARN_ON(current->memcg_oom.may_oom);
-	current->memcg_oom.may_oom = 1;
+	WARN_ON(current->memcg_may_oom);
+	current->memcg_may_oom = 1;
 }
 
 static inline void mem_cgroup_oom_disable(void)
 {
-	WARN_ON(!current->memcg_oom.may_oom);
-	current->memcg_oom.may_oom = 0;
+	WARN_ON(!current->memcg_may_oom);
+	current->memcg_may_oom = 0;
 }
 
 static inline bool task_in_memcg_oom(struct task_struct *p)
 {
-	return p->memcg_oom.memcg;
+	return p->memcg_in_oom;
 }
 
 bool mem_cgroup_oom_synchronize(bool wait);
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1451,7 +1451,9 @@ struct task_struct {
 	unsigned sched_reset_on_fork:1;
 	unsigned sched_contributes_to_load:1;
 	unsigned sched_migrated:1;
-
+#ifdef CONFIG_MEMCG
+	unsigned memcg_may_oom:1;
+#endif
 #ifdef CONFIG_MEMCG_KMEM
 	unsigned memcg_kmem_skip_account:1;
 #endif
@@ -1782,12 +1784,9 @@ struct task_struct {
 	unsigned long trace_recursion;
 #endif /* CONFIG_TRACING */
 #ifdef CONFIG_MEMCG
-	struct memcg_oom_info {
-		struct mem_cgroup *memcg;
-		gfp_t gfp_mask;
-		int order;
-		unsigned int may_oom:1;
-	} memcg_oom;
+	struct mem_cgroup *memcg_in_oom;
+	gfp_t memcg_oom_gfp_mask;
+	int memcg_oom_order;
 #endif
 #ifdef CONFIG_UPROBES
 	struct uprobe_task *utask;
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1652,7 +1652,7 @@ static void memcg_oom_recover(struct mem
 
 static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 {
-	if (!current->memcg_oom.may_oom)
+	if (!current->memcg_may_oom)
 		return;
 	/*
 	 * We are in the middle of the charge context here, so we
@@ -1669,9 +1669,9 @@ static void mem_cgroup_oom(struct mem_cg
 	 * and when we know whether the fault was overall successful.
 	 */
 	css_get(&memcg->css);
-	current->memcg_oom.memcg = memcg;
-	current->memcg_oom.gfp_mask = mask;
-	current->memcg_oom.order = order;
+	current->memcg_in_oom = memcg;
+	current->memcg_oom_gfp_mask = mask;
+	current->memcg_oom_order = order;
 }
 
 /**
@@ -1693,7 +1693,7 @@ static void mem_cgroup_oom(struct mem_cg
  */
 bool mem_cgroup_oom_synchronize(bool handle)
 {
-	struct mem_cgroup *memcg = current->memcg_oom.memcg;
+	struct mem_cgroup *memcg = current->memcg_in_oom;
 	struct oom_wait_info owait;
 	bool locked;
 
@@ -1721,8 +1721,8 @@ bool mem_cgroup_oom_synchronize(bool han
 	if (locked && !memcg->oom_kill_disable) {
 		mem_cgroup_unmark_under_oom(memcg);
 		finish_wait(&memcg_oom_waitq, &owait.wait);
-		mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask,
-					 current->memcg_oom.order);
+		mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
+					 current->memcg_oom_order);
 	} else {
 		schedule();
 		mem_cgroup_unmark_under_oom(memcg);
@@ -1739,7 +1739,7 @@ bool mem_cgroup_oom_synchronize(bool han
 		memcg_oom_recover(memcg);
 	}
 cleanup:
-	current->memcg_oom.memcg = NULL;
+	current->memcg_in_oom = NULL;
 	css_put(&memcg->css);
 	return true;
 }

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2016-01-06 20:17 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-13 18:59 [PATCH 1/2] memcg: flatten task_struct->memcg_oom Tejun Heo
2015-09-13 18:59 ` Tejun Heo
2015-09-13 19:00 ` [PATCH v3 2/2] memcg: punt high overage reclaim to return-to-userland path Tejun Heo
2015-09-15  7:47   ` Johannes Weiner
2015-09-15  7:47     ` Johannes Weiner
2015-09-15 15:53     ` Tejun Heo
2015-09-15 16:12       ` Johannes Weiner
2015-09-15 16:22         ` Tejun Heo
2015-09-15 16:22           ` Tejun Heo
2015-09-15 16:33           ` Johannes Weiner
2015-09-15 16:33             ` Johannes Weiner
2015-09-15  7:37 ` [PATCH 1/2] memcg: flatten task_struct->memcg_oom Johannes Weiner
2015-09-15  7:37   ` Johannes Weiner
2015-09-20 14:45 ` Sasha Levin
2015-09-20 14:45   ` Sasha Levin
2015-09-21 20:01   ` Tejun Heo
2015-09-21 20:01     ` Tejun Heo
2015-09-30 18:54     ` Tejun Heo
2015-09-30 18:54       ` Tejun Heo
2015-11-25 14:43     ` Peter Zijlstra
2015-11-25 14:43       ` Peter Zijlstra
2015-11-25 15:02       ` Peter Zijlstra
2015-11-25 15:02         ` Peter Zijlstra
2015-11-25 15:31         ` Andrey Ryabinin
2015-11-25 17:34           ` Dmitry Vyukov
2015-11-25 17:34             ` Dmitry Vyukov
2015-11-25 17:44           ` Peter Zijlstra
2015-12-11 16:25             ` Tejun Heo
2015-12-11 16:25               ` Tejun Heo
2015-12-15 19:22               ` Peter Zijlstra
2015-12-30  9:23                 ` [PATCH v4.4-rc7] sched: isolate task_struct bitfields according to synchronization domains Tejun Heo
2015-12-30 20:10                   ` Linus Torvalds
2015-12-30 20:10                     ` Linus Torvalds
2015-12-30 20:17                     ` Linus Torvalds
2015-12-30 20:41                     ` Tejun Heo
2015-12-30 20:41                       ` Tejun Heo
2015-12-30 20:43                       ` Linus Torvalds
2016-01-01  2:56                     ` [PATCH v4.4-rc7] sched: move sched lock synchronized bitfields in task_struct into ->atomic_flags Tejun Heo
2016-01-01  2:56                       ` Tejun Heo
2016-01-06 13:44                       ` Tejun Heo
2016-01-06 13:44                         ` Tejun Heo
2016-01-06 18:48         ` [tip:sched/core] sched/core: Fix unserialized r-m-w scribbling stuff tip-bot for Peter Zijlstra
2016-01-06 20:17           ` Tejun Heo
  -- strict thread matches above, loose matches on Subject: below --
2015-08-28 22:01 [PATCH 1/2] memcg: flatten task_struct->memcg_oom Tejun Heo
2015-08-28 22:01 ` Tejun Heo
2015-09-01 15:25 ` Michal Hocko
2015-09-01 15:25   ` Michal Hocko
2015-09-02 11:45 ` Vladimir Davydov
2015-09-02 11:45   ` Vladimir Davydov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.