All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched: Do not re-read h_load_next during hierarchical load calculation v2
@ 2019-03-19 12:36 Mel Gorman
  2019-03-19 15:37 ` Peter Zijlstra
  2019-04-03  8:37 ` [tip:sched/core] sched/fair: Do not re-read ->h_load_next during hierarchical load calculation tip-bot for Mel Gorman
  0 siblings, 2 replies; 3+ messages in thread
From: Mel Gorman @ 2019-03-19 12:36 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Valentin Schneider, linux-kernel

Changelog since v1
o Use WRITE_ONCE
o Add Fixes:
o Add reviewed-by for the READ_ONCE part as I considered it to still be
  ok even after the WRITE_ONCE

A NULL pointer dereference bug was reported on a distribution kernel but
the same issue should be present on mainline kernel. It occured on s390
but should not be arch-specific.  A partial oops looks like

[775277.408564] Unable to handle kernel pointer dereference in virtual kernel address space
...
[775277.408759] Call Trace:
[775277.408763] ([<0002c11c56899c61>] 0x2c11c56899c61)
[775277.408766]  [<0000000000177bb4>] try_to_wake_up+0xfc/0x450
[775277.408773]  [<000003ff81ede872>] vhost_poll_wakeup+0x3a/0x50 [vhost]
[775277.408777]  [<0000000000194ae4>] __wake_up_common+0xbc/0x178
[775277.408779]  [<0000000000194f86>] __wake_up_common_lock+0x9e/0x160
[775277.408780]  [<00000000001950de>] __wake_up_sync_key+0x4e/0x60
[775277.408785]  [<00000000005d911e>] sock_def_readable+0x5e/0x98

The bug hits any time between 1 hour to 3 days. The dereference occurs
in update_cfs_rq_h_load when accumulating h_load. The problem is that
cfq_rq->h_load_next is not protected by any locking and can be updated
by parallel calls to task_h_load. Depending on the compiler, code may be
generated that re-reads cfq_rq->h_load_next after the check for NULL and
then oops when reading se->avg.load_avg. The dissassembly showed that it
was possible to reread h_load_next after the check for NULL.

While this does not appear to be an issue for later compilers, it's still
an accident if the correct code is generated. Full locking in this path
would have high overhead so this patch uses READ_ONCE to read h_load_next
only once and check for NULL before dereferencing. It was confirmed that
there were no further oops after 10 days of testing.

As Peter pointed out, it is also necessary to use WRITE_ONCE to avoid any
potential problems with store tearing.

Fixes: 685207963be9 ("sched: Move h_load calculation to task_h_load()")
[peterz@infradead.org: Use WRITE_ONCE to protect against store tearing]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: stable@vger.kernel.org
---
 kernel/sched/fair.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 310d0637fe4b..5e61a1a99e38 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7713,10 +7713,10 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 	if (cfs_rq->last_h_load_update == now)
 		return;
 
-	cfs_rq->h_load_next = NULL;
+	WRITE_ONCE(cfs_rq->h_load_next, NULL);
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
-		cfs_rq->h_load_next = se;
+		WRITE_ONCE(cfs_rq->h_load_next, se);
 		if (cfs_rq->last_h_load_update == now)
 			break;
 	}
@@ -7726,7 +7726,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 		cfs_rq->last_h_load_update = now;
 	}
 
-	while ((se = cfs_rq->h_load_next) != NULL) {
+	while ((se = READ_ONCE(cfs_rq->h_load_next)) != NULL) {
 		load = cfs_rq->h_load;
 		load = div64_ul(load * se->avg.load_avg,
 			cfs_rq_load_avg(cfs_rq) + 1);

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] sched: Do not re-read h_load_next during hierarchical load calculation v2
  2019-03-19 12:36 [PATCH] sched: Do not re-read h_load_next during hierarchical load calculation v2 Mel Gorman
@ 2019-03-19 15:37 ` Peter Zijlstra
  2019-04-03  8:37 ` [tip:sched/core] sched/fair: Do not re-read ->h_load_next during hierarchical load calculation tip-bot for Mel Gorman
  1 sibling, 0 replies; 3+ messages in thread
From: Peter Zijlstra @ 2019-03-19 15:37 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Ingo Molnar, Valentin Schneider, linux-kernel

On Tue, Mar 19, 2019 at 12:36:10PM +0000, Mel Gorman wrote:
> Changelog since v1
> o Use WRITE_ONCE
> o Add Fixes:
> o Add reviewed-by for the READ_ONCE part as I considered it to still be
>   ok even after the WRITE_ONCE
> 
> A NULL pointer dereference bug was reported on a distribution kernel but
> the same issue should be present on mainline kernel. It occured on s390
> but should not be arch-specific.  A partial oops looks like
> 
> [775277.408564] Unable to handle kernel pointer dereference in virtual kernel address space
> ...
> [775277.408759] Call Trace:
> [775277.408763] ([<0002c11c56899c61>] 0x2c11c56899c61)
> [775277.408766]  [<0000000000177bb4>] try_to_wake_up+0xfc/0x450
> [775277.408773]  [<000003ff81ede872>] vhost_poll_wakeup+0x3a/0x50 [vhost]
> [775277.408777]  [<0000000000194ae4>] __wake_up_common+0xbc/0x178
> [775277.408779]  [<0000000000194f86>] __wake_up_common_lock+0x9e/0x160
> [775277.408780]  [<00000000001950de>] __wake_up_sync_key+0x4e/0x60
> [775277.408785]  [<00000000005d911e>] sock_def_readable+0x5e/0x98
> 
> The bug hits any time between 1 hour to 3 days. The dereference occurs
> in update_cfs_rq_h_load when accumulating h_load. The problem is that
> cfq_rq->h_load_next is not protected by any locking and can be updated
> by parallel calls to task_h_load. Depending on the compiler, code may be
> generated that re-reads cfq_rq->h_load_next after the check for NULL and
> then oops when reading se->avg.load_avg. The dissassembly showed that it
> was possible to reread h_load_next after the check for NULL.
> 
> While this does not appear to be an issue for later compilers, it's still
> an accident if the correct code is generated. Full locking in this path
> would have high overhead so this patch uses READ_ONCE to read h_load_next
> only once and check for NULL before dereferencing. It was confirmed that
> there were no further oops after 10 days of testing.
> 
> As Peter pointed out, it is also necessary to use WRITE_ONCE to avoid any
> potential problems with store tearing.
> 
> Fixes: 685207963be9 ("sched: Move h_load calculation to task_h_load()")
> [peterz@infradead.org: Use WRITE_ONCE to protect against store tearing]
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
> Cc: stable@vger.kernel.org

Thanks!

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [tip:sched/core] sched/fair: Do not re-read ->h_load_next during hierarchical load calculation
  2019-03-19 12:36 [PATCH] sched: Do not re-read h_load_next during hierarchical load calculation v2 Mel Gorman
  2019-03-19 15:37 ` Peter Zijlstra
@ 2019-04-03  8:37 ` tip-bot for Mel Gorman
  1 sibling, 0 replies; 3+ messages in thread
From: tip-bot for Mel Gorman @ 2019-04-03  8:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, mgorman, valentin.schneider, efault, mingo, peterz,
	linux-kernel, stable, torvalds, tglx

Commit-ID:  0e9f02450da07fc7b1346c8c32c771555173e397
Gitweb:     https://git.kernel.org/tip/0e9f02450da07fc7b1346c8c32c771555173e397
Author:     Mel Gorman <mgorman@techsingularity.net>
AuthorDate: Tue, 19 Mar 2019 12:36:10 +0000
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 3 Apr 2019 09:50:22 +0200

sched/fair: Do not re-read ->h_load_next during hierarchical load calculation

A NULL pointer dereference bug was reported on a distribution kernel but
the same issue should be present on mainline kernel. It occured on s390
but should not be arch-specific.  A partial oops looks like:

  Unable to handle kernel pointer dereference in virtual kernel address space
  ...
  Call Trace:
    ...
    try_to_wake_up+0xfc/0x450
    vhost_poll_wakeup+0x3a/0x50 [vhost]
    __wake_up_common+0xbc/0x178
    __wake_up_common_lock+0x9e/0x160
    __wake_up_sync_key+0x4e/0x60
    sock_def_readable+0x5e/0x98

The bug hits any time between 1 hour to 3 days. The dereference occurs
in update_cfs_rq_h_load when accumulating h_load. The problem is that
cfq_rq->h_load_next is not protected by any locking and can be updated
by parallel calls to task_h_load. Depending on the compiler, code may be
generated that re-reads cfq_rq->h_load_next after the check for NULL and
then oops when reading se->avg.load_avg. The dissassembly showed that it
was possible to reread h_load_next after the check for NULL.

While this does not appear to be an issue for later compilers, it's still
an accident if the correct code is generated. Full locking in this path
would have high overhead so this patch uses READ_ONCE to read h_load_next
only once and check for NULL before dereferencing. It was confirmed that
there were no further oops after 10 days of testing.

As Peter pointed out, it is also necessary to use WRITE_ONCE() to avoid any
potential problems with store tearing.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Fixes: 685207963be9 ("sched: Move h_load calculation to task_h_load()")
Link: https://lkml.kernel.org/r/20190319123610.nsivgf3mjbjjesxb@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdab7eb6f351..40bd1e27b1b7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7784,10 +7784,10 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 	if (cfs_rq->last_h_load_update == now)
 		return;
 
-	cfs_rq->h_load_next = NULL;
+	WRITE_ONCE(cfs_rq->h_load_next, NULL);
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
-		cfs_rq->h_load_next = se;
+		WRITE_ONCE(cfs_rq->h_load_next, se);
 		if (cfs_rq->last_h_load_update == now)
 			break;
 	}
@@ -7797,7 +7797,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 		cfs_rq->last_h_load_update = now;
 	}
 
-	while ((se = cfs_rq->h_load_next) != NULL) {
+	while ((se = READ_ONCE(cfs_rq->h_load_next)) != NULL) {
 		load = cfs_rq->h_load;
 		load = div64_ul(load * se->avg.load_avg,
 			cfs_rq_load_avg(cfs_rq) + 1);

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-04-03  8:38 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-19 12:36 [PATCH] sched: Do not re-read h_load_next during hierarchical load calculation v2 Mel Gorman
2019-03-19 15:37 ` Peter Zijlstra
2019-04-03  8:37 ` [tip:sched/core] sched/fair: Do not re-read ->h_load_next during hierarchical load calculation tip-bot for Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.