From mboxrd@z Thu Jan 1 00:00:00 1970 Return-path: Received: from p5492e61e.dip0.t-ipconnect.de ([84.146.230.30] helo=nanos.glx-home) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1fEfcE-0001Gq-G6 for speck@linutronix.de; Fri, 04 May 2018 20:39:34 +0200 Date: Fri, 4 May 2018 20:39:31 +0200 (CEST) From: Thomas Gleixner Subject: Re: [PATCH 3/5] SSB extra 1 In-Reply-To: <30de6834-6580-4d88-f5f8-23d8fa8a4bad@linux.intel.com> Message-ID: References: =?utf-8?q?=3Cd4ffdf?= =?utf-8?q?50f25bca207b3942fc4a390d2273487517=2E1525383411=2Egit=2Edave=2E?= =?utf-8?q?hansen=40intel=2Ecom=3E?= <1bf0c44d-c972-2c2e-5d90-4f51b8f2c4c9@linux.intel.com> <20180504160408.GG75137@tassilo.jf.intel.com> <20180504162813.GH75137@tassilo.jf.intel.com> <30de6834-6580-4d88-f5f8-23d8fa8a4bad@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit To: speck@linutronix.de List-ID: On Fri, 4 May 2018, speck for Dave Hansen wrote: > On 05/04/2018 09:32 AM, speck for Thomas Gleixner wrote: > >> The flag doesn't know anything about the timer. You would > >> need another flag that says "start a delay timer on the new CPU > >> too". > > Color me confused. I dont see a timer anywhere. > > We were trying not to ping-pong the MSRs too much if we are going in/out > of the BFP code frequently. > > So, there's a schedule_delayed_work_on() in there that waits 10ms and > turns the mitigations off, so we're only ping-ponging the MSRs every 10ms. > > So, even in the case that we're generating BPF instructions for the > mitigation enable/disable, we still might want some mechanism to make > sure that we're not touching the MSRs *too* frequently if we're going > in/out of BFP frequently. > > That's a wee bit harder if we're tracking the mitigation on the task > level than the CPU. I don't think it's impossible, but it's certainly > more code than is there at the moment. Groan, you're again convoluting stuff in a completely nonsensical way. Lets do a proper problem analysis first: 1) CPU resource speculation Can be controlled: A) Globaly (commandline on/off or not vulnerable system) B) Context dependent Doing a time based prevention of toggling it too often is a completely orthognal problem. Lets talk about that later. 2) Context The disable/enable is tied to execution contexts - task via prctl - task via seccomp - ebpf Now the fundamental context in Linux is a task. Soft and hard interupts are nesting into the task context, or you can consider them context stealing. That has a fundamental consequence: Nested contexts can always disable the speculation if requested, but they can only enable speculation when the context in which they nest has it enabled as well. So any nested context has to look at the previous level to see whether it can reenable. So you need storage for that. The regular task storage is TIF_RDS, the softirq storage can be per cpu and the hardirq context does not need storage assumed that NMI is not allowed to fiddle with that. But we can very simply use a per cpu refcount for soft and hard interrupt contexts. That refcount is incremented on disable and decremented on enable. prctl/seccomp has no influence at that point. But if on enable the count goes to zero then it has to check task->TID_RDS to decide whether it can be reenabled or not. Now lets look at EBPF. EBPF is also a nesting context. So, if EBPF runs in preemptible task context, then it sets a flag in the task 'ebf_speculation_disabled' and sets TIF_RDS, which means that on migration the normal switch_to() logic will take care of it. Obviously we need a per task storage for the prctl selected state. I already did this for the force disable thing. If EBF reenables then it uses the per task prctl state. If EBPF runs in soft or hardirq context then it can uses the per cpu refcount. The above rules apply. So now lets talk about the toggle timer thing. We have one central place where the MSR is written to and we already only write it when the control state changes between two contexts. So on every toggle, you increment a per cpu counter and you have a timer which polls that counter periodically and if the toggle count is over a threshold then it sets a per cpu flag which prevents MSR enable writes. A speculation disable write must always succeed, but that's at maximum one for the observation period. The toggle counter is still updated so the timer can check whether the wave of toggles has subsided or not. If yes, it lifts the MSR write restriction, if not it stays. That just works and has the right separation levels and covers everything from high speed context switches to high speed ebpf invocations in every nested or non nested context. And it keeps everything which is in regular task context tied to the task and therefore preemption, migration are nothing special. No notifiers, no timer migration, no preempt disable assumptions, nothing. It's really that simple if you do a proper analysis before trying to solve it by duct taping things together which are fundamentally separate. Thanks, tglx