LKML Archive on
 help / color / Atom feed
From: Steven Rostedt <>
To: Thomas Gleixner <>
Cc: LKML <>,
	Peter Zijlstra <>,
	Ingo Molnar <>, "H. Peter Anvin" <>,
	"Wulsch, Siegfried" <>
Subject: Re: [PATCH] sched_clock: Prevent 64bit inatomicity on 32bit systems
Date: Mon, 08 Apr 2013 20:31:04 -0400
Message-ID: <1365467464.25498.36.camel@gandalf.local.home> (raw)
In-Reply-To: <alpine.LFD.2.02.1304051544160.21884@ionos>

On Sat, 2013-04-06 at 10:10 +0200, Thomas Gleixner wrote:
> The sched_clock_remote() implementation has the following inatomicity
> problem on 32bit systems when accessing the remote scd->clock, which
> is a 64bit value.
> CPU0			CPU1
> sched_clock_local()	sched_clock_remote(CPU0)
> ...
> 			remote_clock = scd[CPU0]->clock
> 			    read_low32bit(scd[CPU0]->clock)
> cmpxchg64(scd->clock,...)
> 			    read_high32bit(scd[CPU0]->clock)
> While the update of scd->clock is using an atomic64 mechanism, the
> readout on the remote cpu is not, which can cause completely bogus
> readouts.
> It is a quite rare problem, because it requires the update to hit the
> narrow race window between the low/high readout and the update must go
> across the 32bit boundary.
> The resulting misbehaviour is, that CPU1 will see the sched_clock on
> CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value
> to this bogus timestamp. This stays that way due to the clamping
> implementation for about 4 seconds until the synchronization with
> CLOCK_MONOTONIC undoes the problem.
> The issue is hard to observe, because it might only result in a less
> accurate SCHED_OTHER timeslicing behaviour. To create observable
> damage on realtime scheduling classes, it is necessary that the bogus
> update of CPU1 sched_clock happens in the context of an realtime
> thread, which then gets charged 4 seconds of RT runtime, which results
> in the RT throttler mechanism to trigger and prevent scheduling of RT
> tasks for a little less than 4 seconds. So this is quite unlikely as
> well.
> The issue was quite hard to decode as the reproduction time is between
> 2 days and 3 weeks and intrusive tracing makes it less likely, but the
> following trace recorded with trace_clock=global, which uses
> sched_clock_local(), gave the final hint:
>   <idle>-0   0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80
>   <idle>-0   0d..30 400269.477151: hrtimer_start:  hrtimer=0xf7061e80 ...
> irq/20-S-587 1d..32 400273.772118: sched_wakeup:   comm= ... target_cpu=0
>   <idle>-0   0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80
> What happens is that CPU0 goes idle and invokes
> sched_clock_idle_sleep_event() which invokes sched_clock_local() and
> CPU1 runs a remote wakeup for CPU0 at the same time, which invokes
> sched_remote_clock(). The time jump gets propagated to CPU0 via
> sched_remote_clock() and stays stale on both cores for ~4 seconds.
> There are only two other possibilities, which could cause a stale
> sched clock:
> 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic
>    wrong value.
> 2) sched_clock() which reads the TSC returns a sporadic wrong value.
> #1 can be excluded because sched_clock would continue to increase for
>    one jiffy and then go stale.
> #2 can be excluded because it would not make the clock jump
>    forward. It would just result in a stale sched_clock for one jiffy.
> After quite some brain twisting and finding the same pattern on other
> traces, sched_clock_remote() remained the only place which could cause
> such a problem and as explained above it's indeed racy on 32bit
> systems.
> So while on 64bit systems the readout is atomic, we need to verify the
> remote readout on 32bit machines. We need to protect the local->clock
> readout in sched_clock_remote() on 32bit as well because an NMI could
> hit between the low and the high readout, call sched_clock_local() and
> modify local->clock.
> Thanks to Siegfried Wulsch for bearing with my debug requests and
> going through the tedious tasks of running a bunch of reproducer
> systems to generate the debug information which let me decode the
> issue.

Ug. That looks painful.

Nice catch!

-- Steve

      parent reply index

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-06  8:10 Thomas Gleixner
2013-04-06 16:28 ` Peter Zijlstra
2013-04-08 10:24 ` [tip:sched/urgent] " tip-bot for Thomas Gleixner
2013-04-09  0:31 ` Steven Rostedt [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1365467464.25498.36.camel@gandalf.local.home \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on

Archives are clonable:
	git clone --mirror lkml/git/0.git
	git clone --mirror lkml/git/1.git
	git clone --mirror lkml/git/2.git
	git clone --mirror lkml/git/3.git
	git clone --mirror lkml/git/4.git
	git clone --mirror lkml/git/5.git
	git clone --mirror lkml/git/6.git
	git clone --mirror lkml/git/7.git
	git clone --mirror lkml/git/8.git
	git clone --mirror lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ \
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone