From mboxrd@z Thu Jan  1 00:00:00 1970
Return-path: <tglx@linutronix.de>
Received: from p5492e61e.dip0.t-ipconnect.de ([84.146.230.30]
 helo=nanos.glx-home)	by Galois.linutronix.de with esmtpsa
 (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256)	(Exim 4.80)	(envelope-from
 <tglx@linutronix.de>)	id 1fEfcE-0001Gq-G6	for speck@linutronix.de; Fri, 04
 May 2018 20:39:34 +0200
Date: Fri, 4 May 2018 20:39:31 +0200 (CEST)
From: Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH 3/5] SSB extra 1
In-Reply-To: <30de6834-6580-4d88-f5f8-23d8fa8a4bad@linux.intel.com>
Message-ID: <alpine.DEB.2.21.1805042039240.1685@nanos.tec.linutronix.de>
References: <cover.1525383411.git.dave.hansen@intel.com> =?utf-8?q?=3Cd4ffdf?=
 =?utf-8?q?50f25bca207b3942fc4a390d2273487517=2E1525383411=2Egit=2Edave=2E?=
 =?utf-8?q?hansen=40intel=2Ecom=3E?=
 <alpine.DEB.2.21.1805041528400.1675@nanos.tec.linutronix.de>
 <1bf0c44d-c972-2c2e-5d90-4f51b8f2c4c9@linux.intel.com>
 <alpine.DEB.2.21.1805041624280.1675@nanos.tec.linutronix.de>
 <20180504160408.GG75137@tassilo.jf.intel.com>
 <alpine.DEB.2.21.1805041808210.1675@nanos.tec.linutronix.de>
 <20180504162813.GH75137@tassilo.jf.intel.com>
 <alpine.DEB.2.21.1805041831590.1675@nanos.tec.linutronix.de>
 <30de6834-6580-4d88-f5f8-23d8fa8a4bad@linux.intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
To: speck@linutronix.de
List-ID: <speck.linutronix.de>

On Fri, 4 May 2018, speck for Dave Hansen wrote:
> On 05/04/2018 09:32 AM, speck for Thomas Gleixner wrote:
> >> The flag doesn't know anything about the timer. You would
> >> need another flag that says "start a delay timer on the new CPU
> >> too".
> > Color me confused. I dont see a timer anywhere.
> 
> We were trying not to ping-pong the MSRs too much if we are going in/out
> of the BFP code frequently.
> 
> So, there's a schedule_delayed_work_on() in there that waits 10ms and
> turns the mitigations off, so we're only ping-ponging the MSRs every 10ms.
> 
> So, even in the case that we're generating BPF instructions for the
> mitigation enable/disable, we still might want some mechanism to make
> sure that we're not touching the MSRs *too* frequently if we're going
> in/out of BFP frequently.
> 
> That's a wee bit harder if we're tracking the mitigation on the task
> level than the CPU.  I don't think it's impossible, but it's certainly
> more code than is there at the moment.

Groan, you're again convoluting stuff in a completely nonsensical way.

Lets do a proper problem analysis first:

1) CPU resource speculation

   Can be controlled:

     A) Globaly (commandline on/off or not vulnerable system)

     B) Context dependent

     	Doing a time based prevention of toggling it too often is a
     	completely orthognal problem. Lets talk about that later.

2) Context

   The disable/enable is tied to execution contexts

     - task via prctl

     - task via seccomp

     - ebpf

   Now the fundamental context in Linux is a task. Soft and hard interupts
   are nesting into the task context, or you can consider them context
   stealing.

   That has a fundamental consequence:

     Nested contexts can always disable the speculation if requested, but
     they can only enable speculation when the context in which they nest
     has it enabled as well.

  So any nested context has to look at the previous level to see whether it
  can reenable. So you need storage for that.

  The regular task storage is TIF_RDS, the softirq storage can be per cpu
  and the hardirq context does not need storage assumed that NMI is not
  allowed to fiddle with that. But we can very simply use a per cpu
  refcount for soft and hard interrupt contexts. That refcount is
  incremented on disable and decremented on enable. prctl/seccomp has no
  influence at that point.  But if on enable the count goes to zero then it
  has to check task->TID_RDS to decide whether it can be reenabled or not.

  Now lets look at EBPF. EBPF is also a nesting context.

  So, if EBPF runs in preemptible task context, then it sets a flag in the
  task 'ebf_speculation_disabled' and sets TIF_RDS, which means that on
  migration the normal switch_to() logic will take care of it. Obviously we
  need a per task storage for the prctl selected state. I already did this
  for the force disable thing. If EBF reenables then it uses the per task
  prctl state.

  If EBPF runs in soft or hardirq context then it can uses the per cpu
  refcount. The above rules apply.

So now lets talk about the toggle timer thing.

  We have one central place where the MSR is written to and we already only
  write it when the control state changes between two contexts.

  So on every toggle, you increment a per cpu counter and you have a timer
  which polls that counter periodically and if the toggle count is over a
  threshold then it sets a per cpu flag which prevents MSR enable writes. A
  speculation disable write must always succeed, but that's at maximum one
  for the observation period. The toggle counter is still updated so the
  timer can check whether the wave of toggles has subsided or not. If yes,
  it lifts the MSR write restriction, if not it stays.

That just works and has the right separation levels and covers everything
from high speed context switches to high speed ebpf invocations in every
nested or non nested context. And it keeps everything which is in regular
task context tied to the task and therefore preemption, migration are
nothing special. No notifiers, no timer migration, no preempt disable
assumptions, nothing.

It's really that simple if you do a proper analysis before trying to solve
it by duct taping things together which are fundamentally separate.

Thanks,

	tglx