From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965105Ab3DIAbI (ORCPT ); Mon, 8 Apr 2013 20:31:08 -0400 Received: from hrndva-omtalb.mail.rr.com ([71.74.56.122]:29093 "EHLO hrndva-omtalb.mail.rr.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1763340Ab3DIAbF (ORCPT ); Mon, 8 Apr 2013 20:31:05 -0400 X-Authority-Analysis: v=2.0 cv=aOZyWMBm c=1 sm=0 a=rXTBtCOcEpjy1lPqhTCpEQ==:17 a=mNMOxpOpBa8A:10 a=aLYaJZaaLrUA:10 a=5SG0PmZfjMsA:10 a=IkcTkHD0fZMA:10 a=meVymXHHAAAA:8 a=Ntku_idAVdAA:10 a=PCm8-KaFzuoQCFcFmogA:9 a=QEXdDO2ut3YA:10 a=rXTBtCOcEpjy1lPqhTCpEQ==:117 X-Cloudmark-Score: 0 X-Authenticated-User: X-Originating-IP: 74.67.115.198 Message-ID: <1365467464.25498.36.camel@gandalf.local.home> Subject: Re: [PATCH] sched_clock: Prevent 64bit inatomicity on 32bit systems From: Steven Rostedt To: Thomas Gleixner Cc: LKML , Peter Zijlstra , Ingo Molnar , "H. Peter Anvin" , "Wulsch, Siegfried" Date: Mon, 08 Apr 2013 20:31:04 -0400 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.4.4-2 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, 2013-04-06 at 10:10 +0200, Thomas Gleixner wrote: > The sched_clock_remote() implementation has the following inatomicity > problem on 32bit systems when accessing the remote scd->clock, which > is a 64bit value. > > CPU0 CPU1 > > sched_clock_local() sched_clock_remote(CPU0) > ... > remote_clock = scd[CPU0]->clock > read_low32bit(scd[CPU0]->clock) > cmpxchg64(scd->clock,...) > read_high32bit(scd[CPU0]->clock) > > While the update of scd->clock is using an atomic64 mechanism, the > readout on the remote cpu is not, which can cause completely bogus > readouts. > > It is a quite rare problem, because it requires the update to hit the > narrow race window between the low/high readout and the update must go > across the 32bit boundary. > > The resulting misbehaviour is, that CPU1 will see the sched_clock on > CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value > to this bogus timestamp. This stays that way due to the clamping > implementation for about 4 seconds until the synchronization with > CLOCK_MONOTONIC undoes the problem. > > The issue is hard to observe, because it might only result in a less > accurate SCHED_OTHER timeslicing behaviour. To create observable > damage on realtime scheduling classes, it is necessary that the bogus > update of CPU1 sched_clock happens in the context of an realtime > thread, which then gets charged 4 seconds of RT runtime, which results > in the RT throttler mechanism to trigger and prevent scheduling of RT > tasks for a little less than 4 seconds. So this is quite unlikely as > well. > > The issue was quite hard to decode as the reproduction time is between > 2 days and 3 weeks and intrusive tracing makes it less likely, but the > following trace recorded with trace_clock=global, which uses > sched_clock_local(), gave the final hint: > > -0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80 > -0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ... > irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0 > -0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80 > > What happens is that CPU0 goes idle and invokes > sched_clock_idle_sleep_event() which invokes sched_clock_local() and > CPU1 runs a remote wakeup for CPU0 at the same time, which invokes > sched_remote_clock(). The time jump gets propagated to CPU0 via > sched_remote_clock() and stays stale on both cores for ~4 seconds. > > There are only two other possibilities, which could cause a stale > sched clock: > > 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic > wrong value. > > 2) sched_clock() which reads the TSC returns a sporadic wrong value. > > #1 can be excluded because sched_clock would continue to increase for > one jiffy and then go stale. > > #2 can be excluded because it would not make the clock jump > forward. It would just result in a stale sched_clock for one jiffy. > > After quite some brain twisting and finding the same pattern on other > traces, sched_clock_remote() remained the only place which could cause > such a problem and as explained above it's indeed racy on 32bit > systems. > > So while on 64bit systems the readout is atomic, we need to verify the > remote readout on 32bit machines. We need to protect the local->clock > readout in sched_clock_remote() on 32bit as well because an NMI could > hit between the low and the high readout, call sched_clock_local() and > modify local->clock. > > Thanks to Siegfried Wulsch for bearing with my debug requests and > going through the tedious tasks of running a bunch of reproducer > systems to generate the debug information which let me decode the > issue. Ug. That looks painful. Nice catch! -- Steve