From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1424310AbcFIJar (ORCPT ); Thu, 9 Jun 2016 05:30:47 -0400 Received: from mga04.intel.com ([192.55.52.120]:19062 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1161346AbcFIJap (ORCPT ); Thu, 9 Jun 2016 05:30:45 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.26,444,1459839600"; d="scan'208";a="994175188" Date: Thu, 9 Jun 2016 09:33:24 +0800 From: Yuyang Du To: Chris Wilson Cc: Peter Zijlstra , Andrey Ryabinin , Linus Torvalds , Mike Galbraith , Thomas Gleixner , bsegall@google.com, morten.rasmussen@arm.com, pjt@google.com, steve.muckle@linaro.org, linux-kernel@vger.kernel.org Subject: Re: Divide-by-zero in post_init_entity_util_avg Message-ID: <20160609013324.GH8105@intel.com> References: <20160609090142.GS32344@nuc-i3427.alporthouse.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160609090142.GS32344@nuc-i3427.alporthouse.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jun 09, 2016 at 10:01:42AM +0100, Chris Wilson wrote: > I've presumed commit 2b8c41daba327 ("sched/fair: Initiate a new task's > util avg to a bounded value") to be at fault, hence the CCs. Though it > may just be a victim. > > gdb says 0x43/0x80 is > > 725 if (cfs_rq->avg.util_avg != 0) { > 726 sa->util_avg = cfs_rq->avg.util_avg * se->load.weight; > -> 727 sa->util_avg /= (cfs_rq->avg.load_avg + 1); > 728 > 729 if (sa->util_avg > cap) > 730 sa->util_avg = cap; > 731 } else { > > I've run the same fork-heavy workload that seemed to hit the initial > fault under kasan. kasan has not reported any errors, nor has the bug > reoccurred after a day (earlier I had a couple of panics within a few > hours). > > Is it possible for a race window where cfg_rq->avg.load_avg is indeed > -1? Any evidence of other memcorruption in the above? -1 should not be possible, sounds like a soft error. But, a race is anyway hazardous. Thanks a lot, Chris. -- Subject: [PATCH] sched/fair: Avoid hazardous reading cfs_rq->avg.load_avg without rq lock The commit 2b8c41daba327 ("sched/fair: Initiate a new task's util avg to a bounded value") references cfs_rq->avg.load_avg and then the value is used as a divisor (actually cfs_rq->avg.load_avg + 1). This race condition may cause a divide-by-zero exception. Fix it by moving it into rq locked section. Reported-by: Chris Wilson Signed-off-by: Yuyang Du --- kernel/sched/core.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 385c947..b9f44df 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2535,10 +2535,9 @@ void wake_up_new_task(struct task_struct *p) */ set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); #endif + rq = __task_rq_lock(p, &rf); /* Post initialize new task's util average when its cfs_rq is set */ post_init_entity_util_avg(&p->se); - - rq = __task_rq_lock(p, &rf); activate_task(rq, p, 0); p->on_rq = TASK_ON_RQ_QUEUED; trace_sched_wakeup_new(p); --