All of lore.kernel.org
 help / color / mirror / Atom feed
* [MODERATED] L1D-Fault KVM mitigation
@ 2018-04-24  9:06 Joerg Roedel
  2018-04-24  9:35 ` [MODERATED] " Peter Zijlstra
  0 siblings, 1 reply; 91+ messages in thread
From: Joerg Roedel @ 2018-04-24  9:06 UTC (permalink / raw)
  To: speck

Hey,

I've been looking into the mitigation for the L1D fault issue in KVM,
and since the hardware seems to speculate with the GPA as an HPA, it
seems we have to disable SMT to be fully secure here because otherwise
two different guests running on HT siblings could spy on each other.

I'd like to discuss how we mitigate this, the big hammer would be not
initializing the HT siblings at boot on affected machines, but that is
probably a bit too eager as it also penalizes people not using KVM.

Another option is to just print a fat warning and/or refuse to load the
KVM modules on affected machines when HT is enabled.

So what are the opinions on how we should best mitigate this issue?


Regards,

	Joerg

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-04-24  9:06 [MODERATED] L1D-Fault KVM mitigation Joerg Roedel
@ 2018-04-24  9:35 ` Peter Zijlstra
  2018-04-24  9:48   ` David Woodhouse
                     ` (2 more replies)
  0 siblings, 3 replies; 91+ messages in thread
From: Peter Zijlstra @ 2018-04-24  9:35 UTC (permalink / raw)
  To: speck

On Tue, Apr 24, 2018 at 11:06:30AM +0200, speck for Joerg Roedel wrote:
> Hey,
> 
> I've been looking into the mitigation for the L1D fault issue in KVM,
> and since the hardware seems to speculate with the GPA as an HPA, it
> seems we have to disable SMT to be fully secure here because otherwise
> two different guests running on HT siblings could spy on each other.
> 
> I'd like to discuss how we mitigate this, the big hammer would be not
> initializing the HT siblings at boot on affected machines, but that is
> probably a bit too eager as it also penalizes people not using KVM.
> 
> Another option is to just print a fat warning and/or refuse to load the
> KVM modules on affected machines when HT is enabled.
> 
> So what are the opinions on how we should best mitigate this issue?

Another option, that is being explored, is to co-schedule siblings.
So ensure all siblings either run vcpus of the _same_ VM or idle.

Of course, this is all rather intrusive and ugly and brings with it
setup costs as well, because you'd have to sync up on VMENTER, VMEXIT
and interrupts (on the idle CPUs).

Another complication is that on overcommitted systems the regular load
balancer will happily migrate vcpu tasks around. So it is fairly tricky
to ensure runnable vcpu threads of the same VM are in fact around to be
ran on a core.

Not to mention that Linus has basically said: "No way, Jose".

I know that I worked a little with Tim on this, and I know Google did
their own thing (but have not seen patches from them -- is pjt on this
list?). I've also heard Amazon was also working on things (are they
here?). And I think RHT was also looking into something (mingo, bonzini
-- are you guys reading?)

In any case, if any of that is to go fly we need very solid numbers to
convince Linus to reconsider.

Another idea that I had was to only allow trusted guest kernels, as in
trusted computing, key verified images etc.. Of course, they too can be
compromised, but hopefully it avoids the most egregious hostile guest
scenarios.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-04-24  9:35 ` [MODERATED] " Peter Zijlstra
@ 2018-04-24  9:48   ` David Woodhouse
  2018-04-24 11:04     ` Peter Zijlstra
  2018-04-24 10:30   ` [MODERATED] Re: ***UNCHECKED*** " Joerg Roedel
  2018-04-24 12:53   ` Paolo Bonzini
  2 siblings, 1 reply; 91+ messages in thread
From: David Woodhouse @ 2018-04-24  9:48 UTC (permalink / raw)
  To: speck

On Tue, 2018-04-24 at 11:35 +0200, speck for Peter Zijlstra wrote:
> 
> Another option, that is being explored, is to co-schedule siblings.
> So ensure all siblings either run vcpus of the _same_ VM or idle.
> 
> Of course, this is all rather intrusive and ugly and brings with it
> setup costs as well, because you'd have to sync up on VMENTER, VMEXIT
> and interrupts (on the idle CPUs).

I hate to suggest more microcode hacks but... if there was an MSR bit
which, when set, would pause any HT sibling that was currently in VMX
non-root mode, then we could set that up to be automatically set on
vmexit and it would automatically pause the problematic siblings.
Meaning that co-ordinating vmexits with them might actually be
feasible?

The precise definition of 'pause' in the above could survive some
bikeshedding, but basically it shouldn't run any more guest
instructions, but it *should* be allowed to vmexit on interrupts, etc.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: ***UNCHECKED*** Re: L1D-Fault KVM mitigation
  2018-04-24  9:35 ` [MODERATED] " Peter Zijlstra
  2018-04-24  9:48   ` David Woodhouse
@ 2018-04-24 10:30   ` Joerg Roedel
  2018-04-24 11:09     ` Thomas Gleixner
  2018-04-24 12:53   ` Paolo Bonzini
  2 siblings, 1 reply; 91+ messages in thread
From: Joerg Roedel @ 2018-04-24 10:30 UTC (permalink / raw)
  To: speck

On Tue, Apr 24, 2018 at 11:35:37AM +0200, speck for Peter Zijlstra wrote:
> Another option, that is being explored, is to co-schedule siblings.
> So ensure all siblings either run vcpus of the _same_ VM or idle.
> 
> Of course, this is all rather intrusive and ugly and brings with it
> setup costs as well, because you'd have to sync up on VMENTER, VMEXIT
> and interrupts (on the idle CPUs).

Not to mention that it is going to be a maintenance nightmare for the
years to come.

And even if we end up with gang-scheduling in the end, which I don't see
coming yet, we need a simpler plan to have a mitigation when the embargo
lifts.


Regards,

	Joerg

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-04-24  9:48   ` David Woodhouse
@ 2018-04-24 11:04     ` Peter Zijlstra
  2018-04-24 11:16       ` David Woodhouse
  2018-05-23  9:45       ` David Woodhouse
  0 siblings, 2 replies; 91+ messages in thread
From: Peter Zijlstra @ 2018-04-24 11:04 UTC (permalink / raw)
  To: speck

On Tue, Apr 24, 2018 at 10:48:12AM +0100, speck for David Woodhouse wrote:
> On Tue, 2018-04-24 at 11:35 +0200, speck for Peter Zijlstra wrote:
> > 
> > Another option, that is being explored, is to co-schedule siblings.
> > So ensure all siblings either run vcpus of the _same_ VM or idle.
> > 
> > Of course, this is all rather intrusive and ugly and brings with it
> > setup costs as well, because you'd have to sync up on VMENTER, VMEXIT
> > and interrupts (on the idle CPUs).
> 
> I hate to suggest more microcode hacks but... if there was an MSR bit
> which, when set, would pause any HT sibling that was currently in VMX
> non-root mode, then we could set that up to be automatically set on
> vmexit and it would automatically pause the problematic siblings.
> Meaning that co-ordinating vmexits with them might actually be
> feasible?

Not sure I'm following. The above assumes a sibling is running a VCPU of
another VM, right? But it could equally well run any regular old task
(including idle).

So only pausing siblings in VMX mode wouldn't help anything. The !VMX
tasks could still be loading stuff into L1.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: L1D-Fault KVM mitigation
  2018-04-24 10:30   ` [MODERATED] Re: ***UNCHECKED*** " Joerg Roedel
@ 2018-04-24 11:09     ` Thomas Gleixner
  2018-04-24 16:06       ` [MODERATED] " Andi Kleen
  0 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2018-04-24 11:09 UTC (permalink / raw)
  To: speck

On Tue, 24 Apr 2018, speck for Joerg Roedel wrote:
> On Tue, Apr 24, 2018 at 11:35:37AM +0200, speck for Peter Zijlstra wrote:
> > Another option, that is being explored, is to co-schedule siblings.
> > So ensure all siblings either run vcpus of the _same_ VM or idle.
> > 
> > Of course, this is all rather intrusive and ugly and brings with it
> > setup costs as well, because you'd have to sync up on VMENTER, VMEXIT
> > and interrupts (on the idle CPUs).
> 
> Not to mention that it is going to be a maintenance nightmare for the
> years to come.
> 
> And even if we end up with gang-scheduling in the end, which I don't see
> coming yet, we need a simpler plan to have a mitigation when the embargo
> lifts.

There are other problems aside of those Peter mentioned.

There is only a particular class of workloads which can benefit of that:

 multi-vcpu guests where all VCPUs are doing CPU bound computations in
 parallel

Everything else, which is vmexit heavy due to I/O or whatever reasons is
going to suffer badly from the synchronization costs on VMENTER/EXIT.

Not to talk about the single VCPU guests which are pretty widespread in
hosting environments. There the crippled gang-scheduling of VPCU and idle
is not beneficial at all because it's basically the same as HT off just
with the ugly synchronization overhead plus increased power consumption.

Aside of that NMIs are an interesting problem as well. You certainly don't
want to do a synchronization dance from NMI, but ignoring NMI if you are
paranoid is not working either because it can make very interesting data
cache hot depending on the stuff which is monitored.

I've yet to see any numbers which show the effect of this gang-scheduling
mess on a broad range of workloads.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-04-24 11:04     ` Peter Zijlstra
@ 2018-04-24 11:16       ` David Woodhouse
  2018-04-24 15:10         ` Jon Masters
  2018-05-23  9:45       ` David Woodhouse
  1 sibling, 1 reply; 91+ messages in thread
From: David Woodhouse @ 2018-04-24 11:16 UTC (permalink / raw)
  To: speck



On Tue, 2018-04-24 at 13:04 +0200, speck for Peter Zijlstra wrote:
> Not sure I'm following. The above assumes a sibling is running a VCPU of
> another VM, right? But it could equally well run any regular old task
> (including idle).
> 
> So only pausing siblings in VMX mode wouldn't help anything. The !VMX
> tasks could still be loading stuff into L1.

Er, yeah... I may have briefly forgotten that some people sometimes run
actual userspace, not just VM guests.

It's ring 3 *and* VMX non-root which would need to be paused on HT
siblings. And it would need to be triggered on any transition back into
the kernel from userspace too, not just vmexit. Which makes it a little
bit harder.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-04-24  9:35 ` [MODERATED] " Peter Zijlstra
  2018-04-24  9:48   ` David Woodhouse
  2018-04-24 10:30   ` [MODERATED] Re: ***UNCHECKED*** " Joerg Roedel
@ 2018-04-24 12:53   ` Paolo Bonzini
  2018-05-03 16:20     ` Konrad Rzeszutek Wilk
  2 siblings, 1 reply; 91+ messages in thread
From: Paolo Bonzini @ 2018-04-24 12:53 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 3787 bytes --]

On 24/04/2018 11:35, speck for Peter Zijlstra wrote:
> I know that I worked a little with Tim on this, and I know Google did
> their own thing (but have not seen patches from them -- is pjt on this
> list?). I've also heard Amazon was also working on things (are they
> here?). And I think RHT was also looking into something (mingo, bonzini
> -- are you guys reading?)

Yes, I am.  First of all: the cost of doing an L1D flush on every
vmentry is absolutely horrible on KVM microbenchmarks, but seems a
little better (around 6% worst case) on syscall microbenchmarks.
"Message-passing" workloads with vCPUs repeatedly going to sleep are the
worst.

First of all, hyperthreading in general doesn't exactly shine when
running many small virtual machines since the VMs are unlikely to share
any code or data and you'll be able to use half the normal amount of L1
cache.  Perhaps KSM shares the guest kernels and recovers some of the
icache (assuming that there are kernel-heavy benchmarks that _also_
benefit from hyperthreading), but it's more likely to give worse than
improved performance.

Hyperthreading may provide slightly better jitter when you run two
different guests on the siblings.  But with gang scheduling you wouldn't
do that, so that's not an issue.  As a result, in the overcommitted case
the main issue is having to explain to the customers that disabling
hyperthreading is not that bad.

Even in the non-overcommitted case, there is a possibility that host
IRQs or NMIs happen, which as Thomas pointed out can also pollute the cache.

The only case where hyperthreading may be salvaged is the case where 1)
all guest CPUs are pinned to a single physical CPU and memory is also
reserved because you use 1GB hugetlbfs 2) host IRQs are either using
VT-d posted interrupts or are pinned away from those physical CPUs that
run guests, 3) you are using nohz_full and other similar fine-tuned
configuration to ensure that the guest CPUs run smoothly.  This includes
NFV usecases, but big databases like SAP are also run like this sometimes.

In this case, you'd need to add synchronization around vmexit.  However,
because these workloads _will_ actually do vmexits, sometimes a lot of
them (e.g. unless you use nohz_full in the guest as well, you'll have
vmexits to program the LAPIC timer).  Either all of them will have to
suffer from the synchronization cost, or you have to arbitrarily decide
that some vmexits are "confined" and unlikely to pollute the cache; in
that case you skip the synchronization and the L1D flush.  For example
you could say "anything that does not do get_user_pages is confined".

Because you've done this arbitrary choice, the synchronization is total
security theater unless you know what you're doing: no two guests on the
same core, no interrupt handlers that can run during a vmexit and
pollute the L1 cache (if that happens, the other sibling would be able
to read that data), etc..

BUT: 1) I'm not saying hyperthreading is valuable in those cases, only
that it can be salvaged; 2) if you're paranoid you're more likely to
disable HT anyway.  So while I do plan to test what happens when we do
synchronization, it's all but certain that we're going to ship it.  And
even that would only be if it is acceptable upstream---I'm not going to
make it a special Red Hat only patch.

Ingo suggested, for ease of testing and also for ease of deployment, a
knob to easily online/offline all siblings but the first on each core.
There's still the chance that some userspace daemon is started before
hyperthreading is software-disabled that way, and is confused by the
number of CPUs suddenly halving, so it would have to be both on the
kernel command line and in debugfs.

Paolo


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-04-24 11:16       ` David Woodhouse
@ 2018-04-24 15:10         ` Jon Masters
  0 siblings, 0 replies; 91+ messages in thread
From: Jon Masters @ 2018-04-24 15:10 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 2425 bytes --]

On 04/24/2018 07:16 AM, speck for David Woodhouse wrote:
> On Tue, 2018-04-24 at 13:04 +0200, speck for Peter Zijlstra wrote:
>> Not sure I'm following. The above assumes a sibling is running a
>> VCPU of another VM, right? But it could equally well run any regular
>> old task (including idle).

>> So only pausing siblings in VMX mode wouldn't help anything. The
>> !VMX tasks could still be loading stuff into L1.

> Er, yeah... I may have briefly forgotten that some people sometimes
> run actual userspace, not just VM guests.

:)

More than once over the past few months, I've pointed out that Annapurna
was the right way to go. Personally, I think the only possible way to
make any of this safe is by moving "all" userspace off host like AMZ.

> It's ring 3 *and* VMX non-root which would need to be paused on HT
> siblings. And it would need to be triggered on any transition back
> into the kernel from userspace too, not just vmexit. Which makes it a
> little bit harder.

The additional ucode hack (pausing the sibling thread) has been
discussed at some length already. It's a good idea, except for the host
side as you note, and there we have all manner of interrupts to handle.
In the case of a traditional OS on the host, that's a lot of code
happily pulling secrets into the L1 and lots of paths to track if we
wanted to do like one of the others (dunno if I can say who, but someone
is tracking all secrets loaded into the L1 and scrubbing them after, but
they also refactored enough other stuff to make that just about work).

On our end, we've discussed automatic handling at boot along the lines
that have been pondered this morning here. The problem is that you don't
want to penalize until people run KVM, and you don't want to them hot
unplug or do crazy things that might affect tunings (I already got shot
down, rightly, for suggesting that one). So, instead, it's going to have
to be messaging, and maybe tainting as insecure of some kind.

I've requested that RHEL's installer be modified to effectively add a
checkbox that defaults to enabled but very prominently offers to disable
HT if you're using virtualization. We'll then need some guidance
coordinated across the industry that for total security (even in the
face of some of the gang scheduling proposals) you need to !HT.

Jon.

-- 
Computer Architect | Sent from my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-04-24 11:09     ` Thomas Gleixner
@ 2018-04-24 16:06       ` Andi Kleen
  0 siblings, 0 replies; 91+ messages in thread
From: Andi Kleen @ 2018-04-24 16:06 UTC (permalink / raw)
  To: speck

> Aside of that NMIs are an interesting problem as well. You certainly don't
> want to do a synchronization dance from NMI, but ignoring NMI if you are
> paranoid is not working either because it can make very interesting data
> cache hot depending on the stuff which is monitored.

Like what interesting data? 

AFAIK NMIs do access only some non interesting kernel data structures, 
and potentially the current context if you're talking about perf's stack walking.
The current context would of course already be leakable without NMIs.

-Andi

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-04-24 12:53   ` Paolo Bonzini
@ 2018-05-03 16:20     ` Konrad Rzeszutek Wilk
  2018-05-07 17:11       ` Paolo Bonzini
  0 siblings, 1 reply; 91+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-05-03 16:20 UTC (permalink / raw)
  To: speck

On Tue, Apr 24, 2018 at 02:53:15PM +0200, speck for Paolo Bonzini wrote:
> On 24/04/2018 11:35, speck for Peter Zijlstra wrote:
> > I know that I worked a little with Tim on this, and I know Google did
> > their own thing (but have not seen patches from them -- is pjt on this
> > list?). I've also heard Amazon was also working on things (are they
> > here?). And I think RHT was also looking into something (mingo, bonzini
> > -- are you guys reading?)
> 
> Yes, I am.  First of all: the cost of doing an L1D flush on every

..snip..

> Ingo suggested, for ease of testing and also for ease of deployment, a
> knob to easily online/offline all siblings but the first on each core.
> There's still the chance that some userspace daemon is started before
> hyperthreading is software-disabled that way, and is confused by the
> number of CPUs suddenly halving, so it would have to be both on the
> kernel command line and in debugfs.

Are there any patches that you would be willing to share so folks
can review/test/etc? I was going to start doing this next but I suspect
you have already most of this ?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-03 16:20     ` Konrad Rzeszutek Wilk
@ 2018-05-07 17:11       ` Paolo Bonzini
  2018-05-16  8:51         ` Jiri Kosina
  0 siblings, 1 reply; 91+ messages in thread
From: Paolo Bonzini @ 2018-05-07 17:11 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1264 bytes --]

On 03/05/2018 18:20, speck for Konrad Rzeszutek Wilk wrote:
> On Tue, Apr 24, 2018 at 02:53:15PM +0200, speck for Paolo Bonzini wrote:
>> On 24/04/2018 11:35, speck for Peter Zijlstra wrote:
>>> I know that I worked a little with Tim on this, and I know Google did
>>> their own thing (but have not seen patches from them -- is pjt on this
>>> list?). I've also heard Amazon was also working on things (are they
>>> here?). And I think RHT was also looking into something (mingo, bonzini
>>> -- are you guys reading?)
>>
>> Yes, I am.  First of all: the cost of doing an L1D flush on every
> 
> ..snip..
> 
>> Ingo suggested, for ease of testing and also for ease of deployment, a
>> knob to easily online/offline all siblings but the first on each core.
>> There's still the chance that some userspace daemon is started before
>> hyperthreading is software-disabled that way, and is confused by the
>> number of CPUs suddenly halving, so it would have to be both on the
>> kernel command line and in debugfs.
> 
> Are there any patches that you would be willing to share so folks
> can review/test/etc? I was going to start doing this next but I suspect
> you have already most of this ?
> 

Yes, I'll send them out tomorrow.

Paolo


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-07 17:11       ` Paolo Bonzini
@ 2018-05-16  8:51         ` Jiri Kosina
  2018-05-16  8:53           ` Paolo Bonzini
  0 siblings, 1 reply; 91+ messages in thread
From: Jiri Kosina @ 2018-05-16  8:51 UTC (permalink / raw)
  To: speck

On Mon, 7 May 2018, speck for Paolo Bonzini wrote:

> > Are there any patches that you would be willing to share so folks can 
> > review/test/etc? I was going to start doing this next but I suspect
> > you have already most of this ?
> 
> Yes, I'll send them out tomorrow.

Hi Paolo,

have you sent those somewhere and I missed that, or is it still pending?

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-16  8:51         ` Jiri Kosina
@ 2018-05-16  8:53           ` Paolo Bonzini
  2018-05-21 10:06             ` David Woodhouse
  0 siblings, 1 reply; 91+ messages in thread
From: Paolo Bonzini @ 2018-05-16  8:53 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 512 bytes --]

On 16/05/2018 10:51, speck for Jiri Kosina wrote:
> On Mon, 7 May 2018, speck for Paolo Bonzini wrote:
> 
>>> Are there any patches that you would be willing to share so folks can 
>>> review/test/etc? I was going to start doing this next but I suspect
>>> you have already most of this ?
>>
>> Yes, I'll send them out tomorrow.
> 
> Hi Paolo,
> 
> have you sent those somewhere and I missed that, or is it still pending?

I'm figuring out right now the scripts to send patches here. :)

Paolo


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-16  8:53           ` Paolo Bonzini
@ 2018-05-21 10:06             ` David Woodhouse
  2018-05-21 13:40               ` Thomas Gleixner
  0 siblings, 1 reply; 91+ messages in thread
From: David Woodhouse @ 2018-05-21 10:06 UTC (permalink / raw)
  To: speck

On Wed, 2018-05-16 at 10:53 +0200, speck for Paolo Bonzini wrote:
> On 16/05/2018 10:51, speck for Jiri Kosina wrote:
> > 
> > On Mon, 7 May 2018, speck for Paolo Bonzini wrote:
> > 
> > > 
> > > > 
> > > > Are there any patches that you would be willing to share so
> > > > folks can 
> > > > review/test/etc? I was going to start doing this next but I
> > > > suspect
> > > > you have already most of this ?
> > > Yes, I'll send them out tomorrow.
> >
> > Hi Paolo,
> > 
> > have you sent those somewhere and I missed that, or is it still
> > pending?
>
> I'm figuring out right now the scripts to send patches here. :)

Did you work it out? Is this one also part of the release at 21:00 UTC
today, or do we have a bit longer?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: L1D-Fault KVM mitigation
  2018-05-21 10:06             ` David Woodhouse
@ 2018-05-21 13:40               ` Thomas Gleixner
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2018-05-21 13:40 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 893 bytes --]

On Mon, 21 May 2018, speck for David Woodhouse wrote:

> On Wed, 2018-05-16 at 10:53 +0200, speck for Paolo Bonzini wrote:
> > On 16/05/2018 10:51, speck for Jiri Kosina wrote:
> > > 
> > > On Mon, 7 May 2018, speck for Paolo Bonzini wrote:
> > > 
> > > > 
> > > > > 
> > > > > Are there any patches that you would be willing to share so
> > > > > folks can 
> > > > > review/test/etc? I was going to start doing this next but I
> > > > > suspect
> > > > > you have already most of this ?
> > > > Yes, I'll send them out tomorrow.
> > >
> > > Hi Paolo,
> > > 
> > > have you sent those somewhere and I missed that, or is it still
> > > pending?
> >
> > I'm figuring out right now the scripts to send patches here. :)
> 
> Did you work it out? Is this one also part of the release at 21:00 UTC
> today, or do we have a bit longer?

That's not part of today. Today is GPZ4 only.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-04-24 11:04     ` Peter Zijlstra
  2018-04-24 11:16       ` David Woodhouse
@ 2018-05-23  9:45       ` David Woodhouse
  2018-05-24  9:45         ` Peter Zijlstra
  1 sibling, 1 reply; 91+ messages in thread
From: David Woodhouse @ 2018-05-23  9:45 UTC (permalink / raw)
  To: speck



On Tue, 2018-04-24 at 13:04 +0200, speck for Peter Zijlstra wrote:
> On Tue, Apr 24, 2018 at 10:48:12AM +0100, speck for David Woodhouse wrote:
> > 
> > On Tue, 2018-04-24 at 11:35 +0200, speck for Peter Zijlstra wrote:
> > > 
> > > 
> > > Another option, that is being explored, is to co-schedule siblings.
> > > So ensure all siblings either run vcpus of the _same_ VM or idle.
> > > 
> > > Of course, this is all rather intrusive and ugly and brings with it
> > > setup costs as well, because you'd have to sync up on VMENTER, VMEXIT
> > > and interrupts (on the idle CPUs).
>
> > I hate to suggest more microcode hacks but... if there was an MSR bit
> > which, when set, would pause any HT sibling that was currently in VMX
> > non-root mode, then we could set that up to be automatically set on
> > vmexit and it would automatically pause the problematic siblings.
> > Meaning that co-ordinating vmexits with them might actually be
> > feasible?

> Not sure I'm following. The above assumes a sibling is running a VCPU of
> another VM, right? But it could equally well run any regular old task
> (including idle).
> 
> So only pausing siblings in VMX mode wouldn't help anything. The !VMX
> tasks could still be loading stuff into L1.

That's OK because it's only the VMX tasks which can abuse it, isn't it?

Let's assume we've fixed the problem for normal tasks, by flipping the
top bit in absent PTEs that actually contain swap pointers, etc.

The only thing we have left is VM guests. The microcode bit would say
that *if* a CPU thread is in non-root mode then *it* gets paused unless
its sibling is also in non-root mode for the same VMID.

So when both siblings are actually in the VM, they get to run. If one
sibling comes *out* of the VM to the host kernel or to run (host)
userspace, then the other one doesn't execute any guest instructions.
It can take exceptions which cause a vmexit though.

We'd also want a vCPU to be able to run if its sibling is actually in
the host but *idle* (and has flushed the L1. Perhaps we actually
automatically flush the L1 when resuming a sibling that got paused).

It does still depend on gang scheduling (or at least forced sibling
idle which is a subset of that), or a singleton vCPU might *never* get
run. But we were going to have to do something along those lines
anyway. The microcode trick just makes it a lot easier because we don't
have to *explicitly* pause the sibling vCPUs and manage their state on
every vmexit/entry. And avoids potential race conditions with managing
that in software.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-23  9:45       ` David Woodhouse
@ 2018-05-24  9:45         ` Peter Zijlstra
  2018-05-24 14:14           ` Jon Masters
                             ` (2 more replies)
  0 siblings, 3 replies; 91+ messages in thread
From: Peter Zijlstra @ 2018-05-24  9:45 UTC (permalink / raw)
  To: speck

On Wed, May 23, 2018 at 10:45:45AM +0100, speck for David Woodhouse wrote:

> That's OK because it's only the VMX tasks which can abuse it, isn't it?

If, like you outline below, this is an (optional) ucode assist to
co-scheduling matching VCPU threads, then yes.

> Let's assume we've fixed the problem for normal tasks, by flipping the
> top bit in absent PTEs that actually contain swap pointers, etc.
> 
> The only thing we have left is VM guests. The microcode bit would say
> that *if* a CPU thread is in non-root mode then *it* gets paused unless
> its sibling is also in non-root mode for the same VMID.
> 
> So when both siblings are actually in the VM, they get to run. If one
> sibling comes *out* of the VM to the host kernel or to run (host)
> userspace, then the other one doesn't execute any guest instructions.
> It can take exceptions which cause a vmexit though.

Would it make sense to time limit the being 'stuck', much like PLE ?

> We'd also want a vCPU to be able to run if its sibling is actually in
> the host but *idle* (and has flushed the L1. Perhaps we actually
> automatically flush the L1 when resuming a sibling that got paused).

Right, idle is a wildcard which matches with any VCPU. We don't care
about the cache state of the sibling though. L1 is shared and since
VMENTER must flush L1, that is sufficient.

> It does still depend on gang scheduling (or at least forced sibling
> idle which is a subset of that), or a singleton vCPU might *never* get
> run. But we were going to have to do something along those lines
> anyway.

Linus has opinions on that.. but yes, without that all that remains is
disabling HT afaict.

> The microcode trick just makes it a lot easier because we don't
> have to *explicitly* pause the sibling vCPUs and manage their state on
> every vmexit/entry. And avoids potential race conditions with managing
> that in software.

Yes, it would certainly help and avoid a fair bit of ugly. It would, for
instance, avoid having to modify irq_enter() / irq_exit(), which would
otherwise be required (and possibly leak all data touched up until that
point is reached).

But even with all that, adding L1-flush to every VMENTER will hurt lots.
Consider for example the PIO emulation used when booting a guest from a
disk image. That causes VMEXIT/VMENTER at stupendous rates.

Also, none of this readily addresses the problem of load-balancing
shredding the VCPU localities required for this.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-24  9:45         ` Peter Zijlstra
@ 2018-05-24 14:14           ` Jon Masters
  2018-05-24 15:04           ` Thomas Gleixner
  2018-05-24 15:38           ` Linus Torvalds
  2 siblings, 0 replies; 91+ messages in thread
From: Jon Masters @ 2018-05-24 14:14 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 721 bytes --]

On 05/24/2018 05:45 AM, speck for Peter Zijlstra wrote:
> On Wed, May 23, 2018 at 10:45:45AM +0100, speck for David Woodhouse wrote:

>> The microcode trick just makes it a lot easier because we don't
>> have to *explicitly* pause the sibling vCPUs and manage their state on
>> every vmexit/entry. And avoids potential race conditions with managing
>> that in software.
> 
> Yes, it would certainly help and avoid a fair bit of ugly.

We specifically requested this microcode change a few months ago and
were told that it isn't possible to implement across the fleet. If that
changes/changed, then great, wonderful, we're all for it.

Jon.

-- 
Computer Architect | Sent from my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: L1D-Fault KVM mitigation
  2018-05-24  9:45         ` Peter Zijlstra
  2018-05-24 14:14           ` Jon Masters
@ 2018-05-24 15:04           ` Thomas Gleixner
  2018-05-24 15:33             ` Thomas Gleixner
  2018-05-24 15:44             ` [MODERATED] Re: L1D-Fault KVM mitigation Andi Kleen
  2018-05-24 15:38           ` Linus Torvalds
  2 siblings, 2 replies; 91+ messages in thread
From: Thomas Gleixner @ 2018-05-24 15:04 UTC (permalink / raw)
  To: speck

On Thu, 24 May 2018, speck for Peter Zijlstra wrote:
> On Wed, May 23, 2018 at 10:45:45AM +0100, speck for David Woodhouse wrote:
> > The microcode trick just makes it a lot easier because we don't
> > have to *explicitly* pause the sibling vCPUs and manage their state on
> > every vmexit/entry. And avoids potential race conditions with managing
> > that in software.
> 
> Yes, it would certainly help and avoid a fair bit of ugly. It would, for
> instance, avoid having to modify irq_enter() / irq_exit(), which would
> otherwise be required (and possibly leak all data touched up until that
> point is reached).
> 
> But even with all that, adding L1-flush to every VMENTER will hurt lots.
> Consider for example the PIO emulation used when booting a guest from a
> disk image. That causes VMEXIT/VMENTER at stupendous rates.

Just did a test on SKL Client where I have ucode. It does not have HT so
its not suffering from any HT side effects when L1D is flushed.

Boot time from a disk image is ~1s measured from the first vcpu enter.

With L1D Flush on vmenter the boot time is about 5-10% slower. And that has
lots of PIO operations in the early boot.

For a kernel build the L1D Flush has an overhead of < 1%.

Netperf guest to host has a slight drop of the throughput in the 2%
range. Host to guest surprisingly goes up by ~3%. Fun stuff!

Now I isolated two host CPUs and pinned the two vCPUs on it to be able to
measure the overhead. Running cyclictest with a period of 25us in the guest
on a isolated guest CPU and monitoring the behaviour with perf on the host
for the corresponding host CPU gives

No Flush	      	       Flush

1.31 insn per cycle	       1.14 insn per cycle

2e6 L1-dcache-load-misses/sec  26e6 L1-dcache-load-misses/sec

In that simple test the L1D misses go up by a factor of 13.

Now with the whole gang scheduling the numbers I heard through the
grapevine are in the range of factor 130, i.e. 13k% for a simple boot from
disk image. 13 minutes instead of 6 seconds...

That's not surprising at all, though the magnitude is way higher than I
expected. I don't see a realistic chance for vmexit heavy workloads to work
with that synchronization thing at all, whether it's ucode assisted or not.
 
The only workload types which will ever benefit from that co-scheduling
stuff are CPU bound workloads which more or less never vmexit. But are
those workloads really workloads which benefit from HT? Compute workloads
tend to use floating point or vector instructions which are not really HT
friendly.

Can the virt folks who know what runs on their clowdy offerings please shed
some light on this? Has anyone made a proper analysis of clowd workloads
and their behaviour on HT and their vmexit rates?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: L1D-Fault KVM mitigation
  2018-05-24 15:04           ` Thomas Gleixner
@ 2018-05-24 15:33             ` Thomas Gleixner
  2018-05-24 15:38               ` [MODERATED] " Jiri Kosina
  2018-05-24 23:18               ` [MODERATED] Encrypted Message Tim Chen
  2018-05-24 15:44             ` [MODERATED] Re: L1D-Fault KVM mitigation Andi Kleen
  1 sibling, 2 replies; 91+ messages in thread
From: Thomas Gleixner @ 2018-05-24 15:33 UTC (permalink / raw)
  To: speck

On Thu, 24 May 2018, speck for Thomas Gleixner wrote:
> On Thu, 24 May 2018, speck for Peter Zijlstra wrote:
> > On Wed, May 23, 2018 at 10:45:45AM +0100, speck for David Woodhouse wrote:
> > > The microcode trick just makes it a lot easier because we don't
> > > have to *explicitly* pause the sibling vCPUs and manage their state on
> > > every vmexit/entry. And avoids potential race conditions with managing
> > > that in software.
> > 
> > Yes, it would certainly help and avoid a fair bit of ugly. It would, for
> > instance, avoid having to modify irq_enter() / irq_exit(), which would
> > otherwise be required (and possibly leak all data touched up until that
> > point is reached).
> > 
> > But even with all that, adding L1-flush to every VMENTER will hurt lots.
> > Consider for example the PIO emulation used when booting a guest from a
> > disk image. That causes VMEXIT/VMENTER at stupendous rates.
> 
> Just did a test on SKL Client where I have ucode. It does not have HT so
> its not suffering from any HT side effects when L1D is flushed.
> 
> Boot time from a disk image is ~1s measured from the first vcpu enter.
> 
> With L1D Flush on vmenter the boot time is about 5-10% slower. And that has
> lots of PIO operations in the early boot.
> 
> For a kernel build the L1D Flush has an overhead of < 1%.
> 
> Netperf guest to host has a slight drop of the throughput in the 2%
> range. Host to guest surprisingly goes up by ~3%. Fun stuff!
> 
> Now I isolated two host CPUs and pinned the two vCPUs on it to be able to
> measure the overhead. Running cyclictest with a period of 25us in the guest
> on a isolated guest CPU and monitoring the behaviour with perf on the host
> for the corresponding host CPU gives
> 
> No Flush	      	       Flush
> 
> 1.31 insn per cycle	       1.14 insn per cycle
> 
> 2e6 L1-dcache-load-misses/sec  26e6 L1-dcache-load-misses/sec
> 
> In that simple test the L1D misses go up by a factor of 13.
> 
> Now with the whole gang scheduling the numbers I heard through the
> grapevine are in the range of factor 130, i.e. 13k% for a simple boot from
> disk image. 13 minutes instead of 6 seconds...
> 
> That's not surprising at all, though the magnitude is way higher than I
> expected. I don't see a realistic chance for vmexit heavy workloads to work
> with that synchronization thing at all, whether it's ucode assisted or not.

That said, I think we should stage the host side mitigations plus the L1
flush on vmenter ASAP so we are not standing there with our pants down when
the cat comes out of the bag early. That means HT off, but it's still
better than having absolutely nothing.

The gang scheduling nonsense can be added on top when it should
surprisingly turn out to be usable at all.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-24  9:45         ` Peter Zijlstra
  2018-05-24 14:14           ` Jon Masters
  2018-05-24 15:04           ` Thomas Gleixner
@ 2018-05-24 15:38           ` Linus Torvalds
  2018-05-24 15:59             ` David Woodhouse
  2 siblings, 1 reply; 91+ messages in thread
From: Linus Torvalds @ 2018-05-24 15:38 UTC (permalink / raw)
  To: speck



On Thu, 24 May 2018, speck for Peter Zijlstra wrote:
> 
> > It does still depend on gang scheduling (or at least forced sibling
> > idle which is a subset of that), or a singleton vCPU might *never* get
> > run. But we were going to have to do something along those lines
> > anyway.
> 
> Linus has opinions on that.. 

Let's call them "beliefs".

I don't believe for a moment that anybody can come up with anything 
remotely reasonable for gang scheduling.

I'm willing to entertain the possibility that some really smart person can 
solve the problem cleanly and without messing up anything else.

I don't think it's remotely _likely_ to happen, but I'm willing to 
consider it within the realm of possibilities.

So right now, I consider the gang scheduling a pipe dream by people who 
underestimate how hard and ugly it would be, and often have political 
reasons why they are pushing the idea (ie they want to claim it's not a 
hardware deficiency, but just a small software problem).

And I do have opinions on those kinds of _people_.

           Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-24 15:33             ` Thomas Gleixner
@ 2018-05-24 15:38               ` Jiri Kosina
  2018-05-24 17:22                 ` Dave Hansen
  2018-05-24 23:18               ` [MODERATED] Encrypted Message Tim Chen
  1 sibling, 1 reply; 91+ messages in thread
From: Jiri Kosina @ 2018-05-24 15:38 UTC (permalink / raw)
  To: speck

On Thu, 24 May 2018, speck for Thomas Gleixner wrote:

> That said, I think we should stage the host side mitigations plus the L1
> flush on vmenter ASAP so we are not standing there with our pants down when
> the cat comes out of the bag early. 

Agreed.

> That means HT off, but it's still better than having absolutely nothing.

Will we actually be enforcing switching SMT off (before anything better 
exists) by either offlining all the siblings or forcing them to idle at 
the moment first virtual machine gets started, from the kernel directly?

This seems like this policy would better be enforced by userspace 
(libvirt?), but kernel should probably at least warn on affected CPUs if 
it detects this is being violated.

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-24 15:04           ` Thomas Gleixner
  2018-05-24 15:33             ` Thomas Gleixner
@ 2018-05-24 15:44             ` Andi Kleen
  1 sibling, 0 replies; 91+ messages in thread
From: Andi Kleen @ 2018-05-24 15:44 UTC (permalink / raw)
  To: speck

> Now with the whole gang scheduling the numbers I heard through the
> grapevine are in the range of factor 130, i.e. 13k% for a simple boot from
> disk image. 13 minutes instead of 6 seconds...

That's unoptimized, and also for the extreme PIO case (which does one
exit for every dword). With some optimizations we hope to do better.

Also the PIO case is really not that interesting, although it would
be nice to get rid of the slowdown so that users don't
have to fix their boot loaders.

But yes PIO will suffer a bit that's unavoidable. As long 
as the slow down is not too bad it should be acceptable. If it's 
a problem they can always use some other mechanism to load
the kernel.

> 
> That's not surprising at all, though the magnitude is way higher than I
> expected. I don't see a realistic chance for vmexit heavy workloads to work
> with that synchronization thing at all, whether it's ucode assisted or not.

Nothing should be anywhere near as VMEXIT intensive as PIO.

The worst realistic one is likely IO intensive with lots of 
small transactions. But even there you are nowhere near
"one exit for every 2 bytes in a long loop" like PIO.

>  
> The only workload types which will ever benefit from that co-scheduling
> stuff are CPU bound workloads which more or less never vmexit. But are
> those workloads really workloads which benefit from HT?

That's a very extreme conclusion which I don't think is backed at all
by your data.

> Compute workloads
> tend to use floating point or vector instructions which are not really HT
> friendly.

There are plenty of compute workloads that benefit from HT: usually
everything that is not completely memory bandwidth dominated per single
thread.

> 
> Can the virt folks who know what runs on their clowdy offerings please shed
> some light on this? Has anyone made a proper analysis of clowd workloads
> and their behaviour on HT and their vmexit rates?

My understanding is that most cloud providers only sell cores, so they
are already good for other guests.

Usually they use some other affinity mechanism to reach a similar
effect.

But we still have to fix the general case e.g. just for "someone
runs a KVM guest on a random system with HT on"

-Andi

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-24 15:38           ` Linus Torvalds
@ 2018-05-24 15:59             ` David Woodhouse
  2018-05-24 16:35               ` Linus Torvalds
  0 siblings, 1 reply; 91+ messages in thread
From: David Woodhouse @ 2018-05-24 15:59 UTC (permalink / raw)
  To: speck



On Thu, 2018-05-24 at 08:38 -0700, speck for Linus Torvalds wrote:
> 
> On Thu, 24 May 2018, speck for Peter Zijlstra wrote:
> > 
> > 
> > > 
> > > It does still depend on gang scheduling (or at least forced sibling
> > > idle which is a subset of that), or a singleton vCPU might *never* get
> > > run. But we were going to have to do something along those lines
> > > anyway.
> > 
> > Linus has opinions on that.. 
>
> Let's call them "beliefs".
> 
> I don't believe for a moment that anybody can come up with anything 
> remotely reasonable for gang scheduling.
> 
> I'm willing to entertain the possibility that some really smart person can 
> solve the problem cleanly and without messing up anything else.
> 
> I don't think it's remotely _likely_ to happen, but I'm willing to 
> consider it within the realm of possibilities.
> 
> So right now, I consider the gang scheduling a pipe dream by people who 
> underestimate how hard and ugly it would be, and often have political 
> reasons why they are pushing the idea (ie they want to claim it's not a 
> hardware deficiency, but just a small software problem).

Gang scheduling for the general case is probably a pipe dream.

But there are use cases where we were actually pinning sibling vCPUs to
the sibling pCPUs anyway because HT was *already* a nightmare for
information leakage paranoia.

In those cases, "pause the vCPU when the sibling pCPU vmexits to
anything other than idle" is the necessary and mostly sufficient
missing piece. Because we effectively already *have* gang scheduling to
the extent that it matters.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-24 15:59             ` David Woodhouse
@ 2018-05-24 16:35               ` Linus Torvalds
  2018-05-24 16:51                 ` David Woodhouse
  0 siblings, 1 reply; 91+ messages in thread
From: Linus Torvalds @ 2018-05-24 16:35 UTC (permalink / raw)
  To: speck



On Thu, 24 May 2018, speck for David Woodhouse wrote:
> 
> In those cases, "pause the vCPU when the sibling pCPU vmexits to
> anything other than idle" is the necessary and mostly sufficient
> missing piece. Because we effectively already *have* gang scheduling to
> the extent that it matters.

I'll believe it when I see the patch and actual numbers on real loads. 

As far as I can tell, if you want to avoid the data leak, you need to

 - on vmenter, the sibling *must* be in idle, or in "waiting to vmenter 
   the same VM". If the sibling is doing anything else, you don't know 
   what the f*ck it's doing, so you need to do an IPI or a sleep to 
   synchronize with it.

   And regardless, you're going to have to do (probably several) atomic 
   locked accesses just to make sure this is race-free. And you have to 
   clear your L1 caches before that "waiting to vmenter" _and_ before the 
   "going to idle" state, becasue otherwise there's a big race the other 
   way waiting to happen (which is trivial for the guest to hit, since the 
   guest trivially controls vmenter/vmexit by choice of instructions).

 - on exit from idle, you need to IPI the other CPU if it's in VM mode, 
   since you're now starting to touch L1 cache lines. I guess you can do a 
   synchronous wait, depending on why you're exiting idle, but probably 
   not without a big hit on latencies. You're basically primarily in an 
   interrupt handler, for chrissake!

 - on vmexit, you know that the sibling CPU can't have been doing anything 
   else (unless the exit was becuise the sibling IPI'd you to _get_ you 
   out of vmx). But if it was in vmx mode, since you're now going to 
   pollute the L1 caches, you need to either wait for it synchronously to 
   exit its VM, or you need to IPI it to force it to exit. Presumably, you 
   want to wait for it most of the time, but if you exited due to an 
   interrupt or something, that may not be an option, so maybe the IPI is 
   the common case, the same as exit from idle above?

Am I missing something fundamental? Because those three are just the basic 
requirements that have nothing to do with *scheduling*. They are 
fundamental to the whole "I must not touch any interesting data while 
another sibling is in VM mode".

Honestly, I don't care one whit what the actual VM entry/exit code does, 
since it doesn't hurt anything else, and the sane workaround remains to 
just disable HT. But I care deeply about the enter/exit from idle, 
especially if it now involves locked operations and L1 cache flushing, 
just in case.

But note how none of the above has anything to do with the *scheduler*. 
The gang scheduling is just to make the above fuck-up perform much better.

And the above all happens when we expect real things to be going on in the 
system, ie the host actually has timers and networking going on, and the 
actual _paying_ primary clients on something like Amazon cloud don't tend 
to be so much about CPU in the first place (those will be using the lowest 
cost bulk tiers), so they likely have a fair amount of real code going 
on too.

So absolutely none of this is *rare*. Interrupts, vmenter, and vmexit 
happen all the time, and whoever came up with the gang scheduling scheme 
was presumably looking a lot at the best-case numbers.

Now you want to add gang schedulign on *top* of the above disaster? 

Yeah, good luck. 

Maybe I've explained my skepticism sufficiently above. I don't have any 
scheduling frequency numbers and average times in VMX, and presumably you 
do, and maybe I'm being just unreasonably pessimistic. But it all really 
sounds like a complete pipe dream to me.  My guess is that the cost of all 
of the above is going to be *much* bigger than the advantage of HT.

Last I saw, HT on Intel processors was on the order of 30% win. Are you 
seriously claiming that the above costs are going to be so small that you 
make up for it? Because that's a hard limit. If it's more expensive than 
30%, it makes no sense to do. 

That's different from all the other mitigation we've done. It has to be 
cheap, and it has to be cheap for the *bad* case - the case where you 
actually have lots of VMX activity.

The only case I see making up for it is when vm entry/exit are so rare 
(because you don't actually do much virtualization at all) that you're 
willing to make them much much slower just to keep HT for the 95% of the 
time when you're not doing any virtualization at all.

But that case doesn't say "cloud provider" to me. It says "desktop machine 
that occasionally runs virtualbox". And that case doesn't care about the 
security issue in the first place, since it's not running other peoples 
untrusted shit to begin with.

So I really find it unlikely that anybody comes up with something 
worthwhile. 

              Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-24 16:35               ` Linus Torvalds
@ 2018-05-24 16:51                 ` David Woodhouse
  2018-05-24 16:57                   ` Linus Torvalds
  0 siblings, 1 reply; 91+ messages in thread
From: David Woodhouse @ 2018-05-24 16:51 UTC (permalink / raw)
  To: speck

On Thu, 2018-05-24 at 09:35 -0700, speck for Linus Torvalds wrote:
> 
> The only case I see making up for it is when vm entry/exit are so rare 
> (because you don't actually do much virtualization at all) that you're 
> willing to make them much much slower just to keep HT for the 95% of the 
> time when you're not doing any virtualization at all.

With decently designed hardware and SR-IOV passthrough for all I/O, and
posted interrupts... 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-24 16:51                 ` David Woodhouse
@ 2018-05-24 16:57                   ` Linus Torvalds
  2018-05-25 11:29                     ` David Woodhouse
  0 siblings, 1 reply; 91+ messages in thread
From: Linus Torvalds @ 2018-05-24 16:57 UTC (permalink / raw)
  To: speck



On Thu, 24 May 2018, speck for David Woodhouse wrote:
> 
> With decently designed hardware and SR-IOV passthrough for all I/O, and
> posted interrupts... 

If we had decently designed hardware, we wouldn't be having this 
discussion in the first place.

Anyway, I'll wait for the numbers (and the patch - even good numbers with 
a hacky patch doesn't necessarily mean "worth upstreamning").

                 Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-24 15:38               ` [MODERATED] " Jiri Kosina
@ 2018-05-24 17:22                 ` Dave Hansen
  2018-05-24 17:30                   ` Linus Torvalds
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Hansen @ 2018-05-24 17:22 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 847 bytes --]

On 05/24/2018 08:38 AM, speck for Jiri Kosina wrote:
>> That means HT off, but it's still better than having absolutely nothing.
> Will we actually be enforcing switching SMT off (before anything better 
> exists) by either offlining all the siblings or forcing them to idle at 
> the moment first virtual machine gets started, from the kernel directly?
> 
> This seems like this policy would better be enforced by userspace 
> (libvirt?), but kernel should probably at least warn on affected CPUs if 
> it detects this is being violated.

The most straightforward thing is to do trigger the same behavior as
"noht" as part of our arch/x86/kernel/cpu/bugs.c code whenever KVM is
compile-time enabled.

I think we have to do that by default, but allow folks to override it if
they want, like if they know KVM will never get used.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-24 17:22                 ` Dave Hansen
@ 2018-05-24 17:30                   ` Linus Torvalds
  0 siblings, 0 replies; 91+ messages in thread
From: Linus Torvalds @ 2018-05-24 17:30 UTC (permalink / raw)
  To: speck



On Thu, 24 May 2018, speck for Dave Hansen wrote:
> 
> The most straightforward thing is to do trigger the same behavior as
> "noht" as part of our arch/x86/kernel/cpu/bugs.c code whenever KVM is
> compile-time enabled.

Absolutely not.

The whole HT issue is only for people who run untrusted loads. 

Lots of people use kvm for testing, but also for just running qemu etc. 
They have absolutely zero reason to disable HT in general.

The only people who worry are Amazon etc.

We should document the hell out of it, and happily the tech press will be 
very happy to inform everybody about the issue. But most people won't care 
one whit, and will run only "good" loads (ie operating systems that have 
been fixed) in their vm's and not random garbage.

Don't try to make this HT workaround be anything but what it is: an issue 
for the Amazons and googles of the world.

               Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2018-05-24 15:33             ` Thomas Gleixner
  2018-05-24 15:38               ` [MODERATED] " Jiri Kosina
@ 2018-05-24 23:18               ` Tim Chen
  2018-05-24 23:28                 ` [MODERATED] Re: L1D-Fault KVM mitigation Linus Torvalds
                                   ` (2 more replies)
  1 sibling, 3 replies; 91+ messages in thread
From: Tim Chen @ 2018-05-24 23:18 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 134 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: L1D-Fault KVM mitigation

[-- Attachment #2: Type: text/plain, Size: 3619 bytes --]

On 05/24/2018 08:33 AM, speck for Thomas Gleixner wrote:
> On Thu, 24 May 2018, speck for Thomas Gleixner wrote:
>> On Thu, 24 May 2018, speck for Peter Zijlstra wrote:
>>> On Wed, May 23, 2018 at 10:45:45AM +0100, speck for David Woodhouse wrote:
>>>> The microcode trick just makes it a lot easier because we don't
>>>> have to *explicitly* pause the sibling vCPUs and manage their state on
>>>> every vmexit/entry. And avoids potential race conditions with managing
>>>> that in software.
>>>
>>> Yes, it would certainly help and avoid a fair bit of ugly. It would, for
>>> instance, avoid having to modify irq_enter() / irq_exit(), which would
>>> otherwise be required (and possibly leak all data touched up until that
>>> point is reached).
>>>
>>> But even with all that, adding L1-flush to every VMENTER will hurt lots.
>>> Consider for example the PIO emulation used when booting a guest from a
>>> disk image. That causes VMEXIT/VMENTER at stupendous rates.
>>
>> Just did a test on SKL Client where I have ucode. It does not have HT so
>> its not suffering from any HT side effects when L1D is flushed.
>>
>> Boot time from a disk image is ~1s measured from the first vcpu enter.
>>
>> With L1D Flush on vmenter the boot time is about 5-10% slower. And that has
>> lots of PIO operations in the early boot.
>>
>> For a kernel build the L1D Flush has an overhead of < 1%.
>>
>> Netperf guest to host has a slight drop of the throughput in the 2%
>> range. Host to guest surprisingly goes up by ~3%. Fun stuff!
>>
>> Now I isolated two host CPUs and pinned the two vCPUs on it to be able to
>> measure the overhead. Running cyclictest with a period of 25us in the guest
>> on a isolated guest CPU and monitoring the behaviour with perf on the host
>> for the corresponding host CPU gives
>>
>> No Flush	      	       Flush
>>
>> 1.31 insn per cycle	       1.14 insn per cycle
>>
>> 2e6 L1-dcache-load-misses/sec  26e6 L1-dcache-load-misses/sec
>>
>> In that simple test the L1D misses go up by a factor of 13.
>>
>> Now with the whole gang scheduling the numbers I heard through the
>> grapevine are in the range of factor 130, i.e. 13k% for a simple boot from
>> disk image. 13 minutes instead of 6 seconds...

The performance is highly dependent on how often we VM exit.
Working with Peter Z on his prototype, the performance ranges from
no regression for a network loop back, ~20% regression for kernel compile
to ~100% regression on File IO.  PIO brings out the worse aspect
of the synchronization overhead as we VM exit on every dword PIO read in, and the
kernel and initrd image was about 50 MB for the experiment, and led to
13 min of load time.

We may need to do the co-scheduling only when VM exit rate is low, and
turn off the SMT when VM exit rate becomes too high.

(Note: I haven't added in the L1 flush on VM entry for my experiment, that is on
the todo).

Tim


>>
>> That's not surprising at all, though the magnitude is way higher than I
>> expected. I don't see a realistic chance for vmexit heavy workloads to work
>> with that synchronization thing at all, whether it's ucode assisted or not.
> 
> That said, I think we should stage the host side mitigations plus the L1
> flush on vmenter ASAP so we are not standing there with our pants down when
> the cat comes out of the bag early. That means HT off, but it's still
> better than having absolutely nothing.
> 
> The gang scheduling nonsense can be added on top when it should
> surprisingly turn out to be usable at all.
> 
> Thanks,
> 
> 	tglx
> 



^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-24 23:18               ` [MODERATED] Encrypted Message Tim Chen
@ 2018-05-24 23:28                 ` Linus Torvalds
  2018-05-25  8:31                   ` Thomas Gleixner
  2018-05-25 18:22                 ` [MODERATED] Encrypted Message Tim Chen
  2018-05-26 19:14                 ` L1D-Fault KVM mitigation Thomas Gleixner
  2 siblings, 1 reply; 91+ messages in thread
From: Linus Torvalds @ 2018-05-24 23:28 UTC (permalink / raw)
  To: speck



On Thu, 24 May 2018, speck for Tim Chen wrote:
> 
> We may need to do the co-scheduling only when VM exit rate is low, and
> turn off the SMT when VM exit rate becomes too high.

I don't think there's any way to actually turn off HT dynamically, is 
there?

Yes, you can park the sibling in idle, but afaik, some core resources will 
still be statically partitioned, so it's different from the "turn off HT 
at boot" case.

But if there is actually a way to turn HT off dynamically, maybe we could 
make it be CPU hotplug. That would certainly be _so_ much better than the 
nasty "turn it on/off in BIOS" even for other uses.

               Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: L1D-Fault KVM mitigation
  2018-05-24 23:28                 ` [MODERATED] Re: L1D-Fault KVM mitigation Linus Torvalds
@ 2018-05-25  8:31                   ` Thomas Gleixner
  2018-05-28 14:43                     ` [MODERATED] " Paolo Bonzini
  0 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2018-05-25  8:31 UTC (permalink / raw)
  To: speck

On Thu, 24 May 2018, speck for Linus Torvalds wrote:
> On Thu, 24 May 2018, speck for Tim Chen wrote:
> > 
> > We may need to do the co-scheduling only when VM exit rate is low, and
> > turn off the SMT when VM exit rate becomes too high.
> 
> I don't think there's any way to actually turn off HT dynamically, is 
> there?
> 
> Yes, you can park the sibling in idle, but afaik, some core resources will 
> still be statically partitioned, so it's different from the "turn off HT 
> at boot" case.
> 
> But if there is actually a way to turn HT off dynamically, maybe we could 
> make it be CPU hotplug. That would certainly be _so_ much better than the 
> nasty "turn it on/off in BIOS" even for other uses.

CPU hotplug is pretty much the only solution for runtime HT disable. In the
current state it won't give the boot time allocated per cpu resources back
and nr_present_cpus() will still be the same as before shutting them down,
but everything else will be cleaned out. I need to check the scheduler
topology stuff, but that should see the change as well. We could optimize
that with the static key which is already available (sched_smt_present),
but I have to look which nastiness is involved when toggling that at
runtime.

So yes, it's different from the case where HT is disabled in the BIOS, but
I don't think it really matters much in practice.

I'll implement a knob in sysfs for that and see how that behaves.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-24 16:57                   ` Linus Torvalds
@ 2018-05-25 11:29                     ` David Woodhouse
  0 siblings, 0 replies; 91+ messages in thread
From: David Woodhouse @ 2018-05-25 11:29 UTC (permalink / raw)
  To: speck

On Thu, 2018-05-24 at 09:57 -0700, speck for Linus Torvalds wrote:
> 
> 
> On Thu, 24 May 2018, speck for David Woodhouse wrote:
> > 
> > With decently designed hardware and SR-IOV passthrough for all I/O, and
> > posted interrupts... 
> 
> If we had decently designed hardware, we wouldn't be having this 
> discussion in the first place.

Touché.

But with decently designed hardware *around* the CPUs in question…

…actually, who am I kidding? That's bullshit too. There are plenty of
skeletons in *that* cupboard.

There may be no such thing as "decently designed hardware" but we
*have* managed to eliminate vmexits to the point where they are no
longer the first and only thing we ever have to care about. Which
really does fit into the category you mentioned:

> The only case I see making up for it is when vm entry/exit are so rare 
> (because you don't actually do much virtualization at all) that you're 
> willing to make them much much slower just to keep HT for the 95% of the 
> time when you're not doing any virtualization at all.

My point was that you can do virtualisation without taking vmexits.
Much of the last decade or so of Intel CPU design and platform design
have been about achieving exactly that.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2018-05-24 23:18               ` [MODERATED] Encrypted Message Tim Chen
  2018-05-24 23:28                 ` [MODERATED] Re: L1D-Fault KVM mitigation Linus Torvalds
@ 2018-05-25 18:22                 ` Tim Chen
  2018-05-26 19:14                 ` L1D-Fault KVM mitigation Thomas Gleixner
  2 siblings, 0 replies; 91+ messages in thread
From: Tim Chen @ 2018-05-25 18:22 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 127 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Tim Chen <speck@linutronix.de>
Subject: Re: L1D-Fault KVM mitigation

[-- Attachment #2: Type: text/plain, Size: 3260 bytes --]

On 05/24/2018 04:18 PM, speck for Tim Chen wrote:
> On 05/24/2018 08:33 AM, speck for Thomas Gleixner wrote:
>> On Thu, 24 May 2018, speck for Thomas Gleixner wrote:
>>> On Thu, 24 May 2018, speck for Peter Zijlstra wrote:
>>>> On Wed, May 23, 2018 at 10:45:45AM +0100, speck for David Woodhouse wrote:
>>>>> The microcode trick just makes it a lot easier because we don't
>>>>> have to *explicitly* pause the sibling vCPUs and manage their state on
>>>>> every vmexit/entry. And avoids potential race conditions with managing
>>>>> that in software.
>>>>
>>>> Yes, it would certainly help and avoid a fair bit of ugly. It would, for
>>>> instance, avoid having to modify irq_enter() / irq_exit(), which would
>>>> otherwise be required (and possibly leak all data touched up until that
>>>> point is reached).
>>>>
>>>> But even with all that, adding L1-flush to every VMENTER will hurt lots.
>>>> Consider for example the PIO emulation used when booting a guest from a
>>>> disk image. That causes VMEXIT/VMENTER at stupendous rates.
>>>
>>> Just did a test on SKL Client where I have ucode. It does not have HT so
>>> its not suffering from any HT side effects when L1D is flushed.
>>>
>>> Boot time from a disk image is ~1s measured from the first vcpu enter.
>>>
>>> With L1D Flush on vmenter the boot time is about 5-10% slower. And that has
>>> lots of PIO operations in the early boot.
>>>
>>> For a kernel build the L1D Flush has an overhead of < 1%.
>>>
>>> Netperf guest to host has a slight drop of the throughput in the 2%
>>> range. Host to guest surprisingly goes up by ~3%. Fun stuff!
>>>
>>> Now I isolated two host CPUs and pinned the two vCPUs on it to be able to
>>> measure the overhead. Running cyclictest with a period of 25us in the guest
>>> on a isolated guest CPU and monitoring the behaviour with perf on the host
>>> for the corresponding host CPU gives
>>>
>>> No Flush	      	       Flush
>>>
>>> 1.31 insn per cycle	       1.14 insn per cycle
>>>
>>> 2e6 L1-dcache-load-misses/sec  26e6 L1-dcache-load-misses/sec
>>>
>>> In that simple test the L1D misses go up by a factor of 13.
>>>
>>> Now with the whole gang scheduling the numbers I heard through the
>>> grapevine are in the range of factor 130, i.e. 13k% for a simple boot from
>>> disk image. 13 minutes instead of 6 seconds...
> 
> The performance is highly dependent on how often we VM exit.
> Working with Peter Z on his prototype, the performance ranges from
> no regression for a network loop back, ~20% regression for kernel compile
> to ~100% regression on File IO.  PIO brings out the worse aspect
> of the synchronization overhead as we VM exit on every dword PIO read in, and the
> kernel and initrd image was about 50 MB for the experiment, and led to
> 13 min of load time.
> 
> We may need to do the co-scheduling only when VM exit rate is low, and
> turn off the SMT when VM exit rate becomes too high.
> 
> (Note: I haven't added in the L1 flush on VM entry for my experiment, that is on
> the todo).

As a post note, I added in the L1 flush and the performance numbers
pretty much stay the same.  So the synchronization overhead is
dominant and L1 flush overhead is secondary.

Tim



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: L1D-Fault KVM mitigation
  2018-05-24 23:18               ` [MODERATED] Encrypted Message Tim Chen
  2018-05-24 23:28                 ` [MODERATED] Re: L1D-Fault KVM mitigation Linus Torvalds
  2018-05-25 18:22                 ` [MODERATED] Encrypted Message Tim Chen
@ 2018-05-26 19:14                 ` Thomas Gleixner
  2018-05-26 20:43                   ` [MODERATED] " Andi Kleen
  2018-05-29 19:29                   ` [MODERATED] Encrypted Message Tim Chen
  2 siblings, 2 replies; 91+ messages in thread
From: Thomas Gleixner @ 2018-05-26 19:14 UTC (permalink / raw)
  To: speck

On Thu, 24 May 2018, speck for Tim Chen wrote:

>> Now with the whole gang scheduling the numbers I heard through the
>> grapevine are in the range of factor 130, i.e. 13k% for a simple boot from
>> disk image. 13 minutes instead of 6 seconds...

> The performance is highly dependent on how often we VM exit.

That's pretty obvious.

> Working with Peter Z on his prototype, the performance ranges from
> no regression for a network loop back, ~20% regression for kernel compile
> to ~100% regression on File IO.

These numbers are not that interesting when you do not provide comparisons
vs. single threaded. See below.

> PIO brings out the worse aspect of the synchronization overhead as we VM
> exit on every dword PIO read in, and the kernel and initrd image was
> about 50 MB for the experiment, and led to 13 min of load time.
>
> We may need to do the co-scheduling only when VM exit rate is low, and
> turn off the SMT when VM exit rate becomes too high.

You cannot do that during runtime. That will destroy placement schemes and
whatever. The SMT off decision needs to be done at a quiescent moment,
i.e. before starting VMs.

The PIO case _IS_ interesting because it highlights the problem with the
synchronization overhead. And it does not matter at all whether you VMEXIT
because of a PIO access or due to any other reason. So even if you optimize
it then you still have a gazillion of vm_exits on boot. The simple boot
tests I did have ~250k vm_exits in 5 seconds and only half of them are PIO.

Removing the PIO access makes the boot faster because you avoid 50% of the
vmexits, but the rest of the vmexits will still get a massive overhead,
unless you have a scenario where two vCPUs of a guest are runnable and
ready to enter at the same time and vmexit at the same time. Any other
scenario will lose due to the busy waiting synchronization overhead. Just
look at traces and do the math.

I did the following test:

 - Two CPUs (siblings) on the host (HSW-EX) fully isolated

 - One guest with two vCPUs affine to the isolated host CPUs. idle=poll on
   the guest command line to avoid the single vCPU case.

 - No L1 Flush

 - Running a kernel compile on the guest in the regular virtio disk backed
   filesystem. Modified the build skript to stop before the final linkage
   because that is single threaded.

Time:  88 seconds

vmexits:   vCPU0         86.218
           vCPU1         85.703
           total        171.921

That's about 2 vmexits per ms.

Running the same compile single threaded (offlining vCPU1 in the guest)
increases the time to 107 seconds.

    107 / 88  = 1.22

I.e. it's 20% slower than the one using two threads. That means that it is
the same slowdown as having two threads synchronized (your number).

So if I take the above example and assume that the overhead of
synchronization is ~20% then the average vmenter/vmexit time is close to
50us.

Next I did an experiment with synchronizing the vmenter/vmexit. It's
probably more stupid than what you have as the overhead I observe is way
higher, but then I don't know how and what you tested exactly, so it's hard
to compare.

Nevertheless it gave me very interesting insights via tracing the
synchronization mechanics. The interesting thing is that halfways
synchronous vmexits on both vCPUs are rather cheap. The slightly async ones
make the big difference and at some points in the trace the stuff starts to
ping pong in and out of guest mode without really making progress for a
while. So there is not only the overhead itself, it's timing dependend
overhead which can accumulate rather fast. And there is absolutely nothing
you can do about that.

So I can see the usefulness for scenarious which David Woodhouse described
where vCPU and host CPU have a fixed relationship and the guests exit once
in a while. But that should really be done with ucode assisantance which
avoids all the nasty synchronization hackery more or less completely.

But if anyone believes that the gang scheduling scheme with full software
synchronization can be applied to random usecases, then he's probably
working for the marketing department and authoring the L1 terminal fuckup
press release and whitepaper.

I'm surely open for a suprising clever trick which makes this all work, but
I certainly won't hold by breath.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-26 19:14                 ` L1D-Fault KVM mitigation Thomas Gleixner
@ 2018-05-26 20:43                   ` Andi Kleen
  2018-05-26 20:48                     ` Linus Torvalds
  2018-05-27 15:42                     ` Thomas Gleixner
  2018-05-29 19:29                   ` [MODERATED] Encrypted Message Tim Chen
  1 sibling, 2 replies; 91+ messages in thread
From: Andi Kleen @ 2018-05-26 20:43 UTC (permalink / raw)
  To: speck

> The PIO case _IS_ interesting because it highlights the problem with the
> synchronization overhead. And it does not matter at all whether you VMEXIT
> because of a PIO access or due to any other reason. So even if you optimize
> it then you still have a gazillion of vm_exits on boot. The simple boot
> tests I did have ~250k vm_exits in 5 seconds and only half of them are PIO.

Keep in mind that we don't need to synchronize when the other CPU is idle
in the guest, so it's only a problem when all the CPUs are busy.

That should be the common case for boot.

Right now something doesn't seem to be working right with this, so 
the PIO overhead is still high. 

> Nevertheless it gave me very interesting insights via tracing the
> synchronization mechanics. The interesting thing is that halfways
> synchronous vmexits on both vCPUs are rather cheap. The slightly async ones

What's an async vmexit? One that blocks?

I didn't think we had that many of those. What exactly you are seeing?

-Andi

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-26 20:43                   ` [MODERATED] " Andi Kleen
@ 2018-05-26 20:48                     ` Linus Torvalds
  2018-05-27 18:25                       ` Andi Kleen
  2018-05-27 15:42                     ` Thomas Gleixner
  1 sibling, 1 reply; 91+ messages in thread
From: Linus Torvalds @ 2018-05-26 20:48 UTC (permalink / raw)
  To: speck



On Sat, 26 May 2018, speck for Andi Kleen wrote:
> 
> Keep in mind that we don't need to synchronize when the other CPU is idle
> in the guest, so it's only a problem when all the CPUs are busy.

What? No.

Maybe you mean "idle in the _host_"?

Which is not necessarily at all the same thing. 

> That should be the common case for boot.

Again, no. The guest booting has absolutely nothing to do with the host 
being idle or not.

If the argument is that "most of the time you have idle host CPU 
resources", then the solution to that is simple: "turn off HT".

                    Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: L1D-Fault KVM mitigation
  2018-05-26 20:43                   ` [MODERATED] " Andi Kleen
  2018-05-26 20:48                     ` Linus Torvalds
@ 2018-05-27 15:42                     ` Thomas Gleixner
  2018-05-27 16:26                       ` [MODERATED] " Linus Torvalds
  2018-05-27 18:31                       ` Andi Kleen
  1 sibling, 2 replies; 91+ messages in thread
From: Thomas Gleixner @ 2018-05-27 15:42 UTC (permalink / raw)
  To: speck

On Sat, 26 May 2018, speck for Andi Kleen wrote:
> > The PIO case _IS_ interesting because it highlights the problem with the
> > synchronization overhead. And it does not matter at all whether you VMEXIT
> > because of a PIO access or due to any other reason. So even if you optimize
> > it then you still have a gazillion of vm_exits on boot. The simple boot
> > tests I did have ~250k vm_exits in 5 seconds and only half of them are PIO.
> 
> Keep in mind that we don't need to synchronize when the other CPU is idle
> in the guest, so it's only a problem when all the CPUs are busy.

No. It does not matter at all what a guest CPU does. The allowed states are:

	CPU0			CPU1

	In host			In host

	In guest		In host forced idle

	In host forced idle	In guest

	In guest       		In guest

Whatever the guest mode does is irrelevant.

> That should be the common case for boot.

> > Nevertheless it gave me very interesting insights via tracing the
> > synchronization mechanics. The interesting thing is that halfways
> > synchronous vmexits on both vCPUs are rather cheap. The slightly async ones
> 
> What's an async vmexit? One that blocks?

No, I'm talking about timing. Sorry I should have said simultaneous. Let me
rephrase.

If the vmexits of both guest CPUs happen almost at the same time,
i.e. simultaneous then the overhead is pretty small. That's the case for
the tick. But that's pretty much the only event which has that property.

All other vmexits I see are singular events of one guest. There you have a
choice of busy waiting for the other vCPU to vmexit as well or force it out
via an IPI. The method I use is IPI as busy waiting would be horribly slow
for obvious reasons.

Now there are situations which show the following behaviour:

CPU0	  vmexit
	  IPI CPU1
CPU0	  sync_exit
CPU1	  vmexit
CPU1	  sync_exit
CPU1	  sync_enter

CPU0	  do_stuff
CPU0	  sync_enter

CPU0	  vmenter
CPU1	  vmenter

CPU0	  vmexit	immediate after vmenter	
	  IPI CPU1
....

and this ping pong goes on 10 times in a row taking 2+ milliseconds and the
progress made is the guest is minimal. The reason for this are interrupts
targeted to one of the vCPUs or operations in one of the guest threads
which cause several exits in a row. And there is nothing you can do about
that. It's completely workload dependent.

So unless you have a fully controlled scenario where the guests almost
never exit, the whole synchronization stuff is doomed. But fully controlled
means a 1:1 relationship of physical and virtual CPUs like David mentioned.
Yes, that stuff can benefit, but then we rather want ucode assistance than
the whole wait/IPI dance in software.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-27 15:42                     ` Thomas Gleixner
@ 2018-05-27 16:26                       ` Linus Torvalds
  2018-05-27 18:31                       ` Andi Kleen
  1 sibling, 0 replies; 91+ messages in thread
From: Linus Torvalds @ 2018-05-27 16:26 UTC (permalink / raw)
  To: speck



On Sun, 27 May 2018, speck for Thomas Gleixner wrote:
> 
> No. It does not matter at all what a guest CPU does. The allowed states are:

It's even tighter than that.

> 	CPU0			CPU1
> 
> 	In host			In host
> 
> 	In guest		In host forced idle
> 
> 	In host forced idle	In guest
> 
> 	In guest       		In guest

That last case is "both in SAME guest".

You mustn't have two different VM's running on two sibling CPU's, or one 
might siphon data from the other.

                 Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-26 20:48                     ` Linus Torvalds
@ 2018-05-27 18:25                       ` Andi Kleen
  2018-05-27 18:49                         ` Linus Torvalds
  0 siblings, 1 reply; 91+ messages in thread
From: Andi Kleen @ 2018-05-27 18:25 UTC (permalink / raw)
  To: speck

On Sat, May 26, 2018 at 01:48:01PM -0700, speck for Linus Torvalds wrote:
> 
> 
> On Sat, 26 May 2018, speck for Andi Kleen wrote:
> > 
> > Keep in mind that we don't need to synchronize when the other CPU is idle
> > in the guest, so it's only a problem when all the CPUs are busy.
> 
> What? No.
> 
> Maybe you mean "idle in the _host_"?

It's both actually. Either idle in the host, or idle in the guest.

When it's idle in the host we don't need to synchronize, unless 
there is an interrupt (which does its own synchronization) because
the idle loop has nothing valuable to leak.

And if it's idle in the guest then the vcpu is blocked and
also obviously doesn't need to be synchronized.

Both cases can be reasonably common.

-Andi

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-27 15:42                     ` Thomas Gleixner
  2018-05-27 16:26                       ` [MODERATED] " Linus Torvalds
@ 2018-05-27 18:31                       ` Andi Kleen
  1 sibling, 0 replies; 91+ messages in thread
From: Andi Kleen @ 2018-05-27 18:31 UTC (permalink / raw)
  To: speck

On Sun, May 27, 2018 at 05:42:46PM +0200, speck for Thomas Gleixner wrote:
> Whatever the guest mode does is irrelevant.

Idle in the guest means already exited to the host because
HLT exit is always enabled.

> 
> and this ping pong goes on 10 times in a row taking 2+ milliseconds and the
> progress made is the guest is minimal. The reason for this are interrupts
> targeted to one of the vCPUs or operations in one of the guest threads
> which cause several exits in a row. And there is nothing you can do about
> that. It's completely workload dependent.

There's a threshold of the exit rate somewhere where HT off is probably 
better, but I don't think we really have characterized well so far where
exactly this threshold is.

My suspicion is that a lot of workloads will be below the threshold.
We'll see what works out.

> 
> So unless you have a fully controlled scenario where the guests almost
> never exit, the whole synchronization stuff is doomed. But fully controlled
> means a 1:1 relationship of physical and virtual CPUs like David mentioned.
> Yes, that stuff can benefit, but then we rather want ucode assistance than
> the whole wait/IPI dance in software.

Right that would be an obvious optimization. However in the traces
I looked so far the IPI was actually not the most expensive part.

I think it's because IPIs between siblings are not very expensive.

-Andi

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-27 18:25                       ` Andi Kleen
@ 2018-05-27 18:49                         ` Linus Torvalds
  2018-05-27 18:57                           ` Thomas Gleixner
                                             ` (2 more replies)
  0 siblings, 3 replies; 91+ messages in thread
From: Linus Torvalds @ 2018-05-27 18:49 UTC (permalink / raw)
  To: speck



On Sun, 27 May 2018, speck for Andi Kleen wrote:
> > 
> > Maybe you mean "idle in the _host_"?
> 
> It's both actually. Either idle in the host, or idle in the guest.

It really really isn't.

Andi, if one sibling does a vmexit, and the other sibling is still in vmx 
mode, we have to force-exit the other sibling. It doesn't matter one whit 
whether it's idle or not - because we won't know.

Or is there some magical sideband that I am not aware of?

> When it's idle in the host we don't need to synchronize, unless 
> there is an interrupt (which does its own synchronization) because
> the idle loop has nothing valuable to leak.

Right. The only case that doesn't need synchronization is "other sibling 
is idle in the _host_".

> And if it's idle in the guest then the vcpu is blocked and
> also obviously doesn't need to be synchronized.

If the other sibling is idle inside the guest, then we don't even know 
that.

Of course, if somebody is doing exit-on-{hlt/mwait/pause}, and we end up 
actually being in the host and _emulating_ the idle, then that might be 
acceptable. Except even then we'd be potentially leaking a *lot* of host 
caches. The vmx emulation code ends up having a not all that insignificant 
footprint in that (host percpu state, thread state etc).

But why would you emulate halt/mwait/pause anyway? That sounds insane to 
me. The reason you would want exit-on-halt is so that the host can do 
something else if a vcpu goes idle, not so that it can just stay in some 
emulated idle state.

If you want to go to low-power mode, you'd just let the halt/mwait happen 
inside the guest.

But to be honest, I haven't actually checked what kvm users (or xen, or 
whatever) really do. Am I missing something?

           Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: L1D-Fault KVM mitigation
  2018-05-27 18:49                         ` Linus Torvalds
@ 2018-05-27 18:57                           ` Thomas Gleixner
  2018-05-27 19:13                           ` [MODERATED] " Andrew Cooper
  2018-05-28 14:40                           ` Paolo Bonzini
  2 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2018-05-27 18:57 UTC (permalink / raw)
  To: speck

On Sun, 27 May 2018, speck for Linus Torvalds wrote:
> On Sun, 27 May 2018, speck for Andi Kleen wrote:
> > > 
> > > Maybe you mean "idle in the _host_"?
> > 
> > It's both actually. Either idle in the host, or idle in the guest.
> 
> It really really isn't.
> 
> Andi, if one sibling does a vmexit, and the other sibling is still in vmx 
> mode, we have to force-exit the other sibling. It doesn't matter one whit 
> whether it's idle or not - because we won't know.
> 
> Or is there some magical sideband that I am not aware of?
> 
> > When it's idle in the host we don't need to synchronize, unless 
> > there is an interrupt (which does its own synchronization) because
> > the idle loop has nothing valuable to leak.
> 
> Right. The only case that doesn't need synchronization is "other sibling 
> is idle in the _host_".

It still needs synchronization as you have to enforce that the other
sibling _IS_ in forced idle and not trying to do something else. Unless it
has reached forced idle state you can't vmenter.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-27 18:49                         ` Linus Torvalds
  2018-05-27 18:57                           ` Thomas Gleixner
@ 2018-05-27 19:13                           ` Andrew Cooper
  2018-05-27 19:26                             ` Linus Torvalds
  2018-05-28 14:40                           ` Paolo Bonzini
  2 siblings, 1 reply; 91+ messages in thread
From: Andrew Cooper @ 2018-05-27 19:13 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 2256 bytes --]

On 27/05/2018 19:49, speck for Linus Torvalds wrote:
>
> On Sun, 27 May 2018, speck for Andi Kleen wrote:
>>> Maybe you mean "idle in the _host_"?
>> It's both actually. Either idle in the host, or idle in the guest.
> It really really isn't.
>
> Andi, if one sibling does a vmexit, and the other sibling is still in vmx 
> mode, we have to force-exit the other sibling. It doesn't matter one whit 
> whether it's idle or not - because we won't know.
>
> Or is there some magical sideband that I am not aware of?
>
>> When it's idle in the host we don't need to synchronize, unless 
>> there is an interrupt (which does its own synchronization) because
>> the idle loop has nothing valuable to leak.
> Right. The only case that doesn't need synchronization is "other sibling 
> is idle in the _host_".
>
>> And if it's idle in the guest then the vcpu is blocked and
>> also obviously doesn't need to be synchronized.
> If the other sibling is idle inside the guest, then we don't even know 
> that.
>
> Of course, if somebody is doing exit-on-{hlt/mwait/pause}, and we end up 
> actually being in the host and _emulating_ the idle, then that might be 
> acceptable. Except even then we'd be potentially leaking a *lot* of host 
> caches. The vmx emulation code ends up having a not all that insignificant 
> footprint in that (host percpu state, thread state etc).
>
> But why would you emulate halt/mwait/pause anyway? That sounds insane to 
> me. The reason you would want exit-on-halt is so that the host can do 
> something else if a vcpu goes idle, not so that it can just stay in some 
> emulated idle state.
>
> If you want to go to low-power mode, you'd just let the halt/mwait happen 
> inside the guest.
>
> But to be honest, I haven't actually checked what kvm users (or xen, or 
> whatever) really do. Am I missing something?

Xen doesn't ever vcpus enter idle themselves.  We trap HLT/etc and will
either schedule another VM, or choose to idle the host if there really
is nothing else to do.

I've never come across a plausible usecase for letting non-root mode
idle the cores into a low power state.  They simply aren't in a position
to know whether other work needs doing or not.

~Andrew


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-27 19:13                           ` [MODERATED] " Andrew Cooper
@ 2018-05-27 19:26                             ` Linus Torvalds
  2018-05-27 19:41                               ` Thomas Gleixner
  0 siblings, 1 reply; 91+ messages in thread
From: Linus Torvalds @ 2018-05-27 19:26 UTC (permalink / raw)
  To: speck



On Sun, 27 May 2018, speck for Andrew Cooper wrote:
> 
> Xen doesn't ever vcpus enter idle themselves.  We trap HLT/etc and will
> either schedule another VM, or choose to idle the host if there really
> is nothing else to do.

Right. But then there is never any such thing as "guest idle". There is 
only "host idle" or "host is doing something else entirely".

You _could_ have a "idle polling" mode which is separate from the regular 
host idle loop, I guess.

> I've never come across a plausible usecase for letting non-root mode
> idle the cores into a low power state.  They simply aren't in a position
> to know whether other work needs doing or not.

Afaik, it mainly makes sense when there is no actual host OS at all, just 
the bare-metal hypervisor used for partitioning resources, not scheduling 
them.

                 Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: L1D-Fault KVM mitigation
  2018-05-27 19:26                             ` Linus Torvalds
@ 2018-05-27 19:41                               ` Thomas Gleixner
  2018-05-27 22:26                                 ` [MODERATED] " Andrew Cooper
  0 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2018-05-27 19:41 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1026 bytes --]

On Sun, 27 May 2018, speck for Linus Torvalds wrote:
> On Sun, 27 May 2018, speck for Andrew Cooper wrote:
> > 
> > Xen doesn't ever vcpus enter idle themselves.  We trap HLT/etc and will
> > either schedule another VM, or choose to idle the host if there really
> > is nothing else to do.
> 
> Right. But then there is never any such thing as "guest idle". There is 
> only "host idle" or "host is doing something else entirely".
> 
> You _could_ have a "idle polling" mode which is separate from the regular 
> host idle loop, I guess.
> 
> > I've never come across a plausible usecase for letting non-root mode
> > idle the cores into a low power state.  They simply aren't in a position
> > to know whether other work needs doing or not.
> 
> Afaik, it mainly makes sense when there is no actual host OS at all, just 
> the bare-metal hypervisor used for partitioning resources, not scheduling 
> them.

Right. That's what the Jailhouse hypervisor does. It's a zero vmexit setup.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-27 19:41                               ` Thomas Gleixner
@ 2018-05-27 22:26                                 ` Andrew Cooper
  2018-05-28  6:47                                   ` Thomas Gleixner
  0 siblings, 1 reply; 91+ messages in thread
From: Andrew Cooper @ 2018-05-27 22:26 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1886 bytes --]

On 27/05/2018 20:41, speck for Thomas Gleixner wrote:
> On Sun, 27 May 2018, speck for Linus Torvalds wrote:
>> On Sun, 27 May 2018, speck for Andrew Cooper wrote:
>>> Xen doesn't ever vcpus enter idle themselves.  We trap HLT/etc and will
>>> either schedule another VM, or choose to idle the host if there really
>>> is nothing else to do.
>> Right. But then there is never any such thing as "guest idle". There is 
>> only "host idle" or "host is doing something else entirely".
>>
>> You _could_ have a "idle polling" mode which is separate from the regular 
>> host idle loop, I guess.
>>
>>> I've never come across a plausible usecase for letting non-root mode
>>> idle the cores into a low power state.  They simply aren't in a position
>>> to know whether other work needs doing or not.
>> Afaik, it mainly makes sense when there is no actual host OS at all, just 
>> the bare-metal hypervisor used for partitioning resources, not scheduling 
>> them.
> Right. That's what the Jailhouse hypervisor does. It's a zero vmexit setup.

Jailhouse also has static assignment of resources, which means they can
arrange never to have two different VMs on the same sibling hyperthreads.

They still need to disable hyperthreads or find a working
synchronisation algorithm for entry/exit, but they don't have the added
gang scheduling problem of ensuring that two hyperthreads are always
occupied by vcpus of the same VM.

FWIW, my gut feeling at the moment is that the overhead of
synchronisation will outweigh disabling hyperthreading, but I'd like to
be proved wrong.  Others in the Xen community are looking to extend
shadow paging to be as performant as EPT is currently (because at that
point, the hypervisor control every PTE accessible to the pagewalk), and
again, I'd like to see this succeed, but my gut feeling is that it wont.

~Andrew


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: L1D-Fault KVM mitigation
  2018-05-27 22:26                                 ` [MODERATED] " Andrew Cooper
@ 2018-05-28  6:47                                   ` Thomas Gleixner
  2018-05-28 12:26                                     ` [MODERATED] " Andrew Cooper
  0 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2018-05-28  6:47 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1804 bytes --]

On Sun, 27 May 2018, speck for Andrew Cooper wrote:
> On 27/05/2018 20:41, speck for Thomas Gleixner wrote:
> > On Sun, 27 May 2018, speck for Linus Torvalds wrote:
> >> Afaik, it mainly makes sense when there is no actual host OS at all, just 
> >> the bare-metal hypervisor used for partitioning resources, not scheduling 
> >> them.
> > Right. That's what the Jailhouse hypervisor does. It's a zero vmexit setup.
> 
> Jailhouse also has static assignment of resources, which means they can
> arrange never to have two different VMs on the same sibling hyperthreads.

Right.

> They still need to disable hyperthreads or find a working
> synchronisation algorithm for entry/exit, but they don't have the added
> gang scheduling problem of ensuring that two hyperthreads are always
> occupied by vcpus of the same VM.

If Jailhouse exits, then there is something badly wrong and the guest will
usually be terminated. There are a few odd cases where an exit actually is
non fatal, but that should be fixable.

> FWIW, my gut feeling at the moment is that the overhead of
> synchronisation will outweigh disabling hyperthreading, but I'd like to
> be proved wrong.  Others in the Xen community are looking to extend
> shadow paging to be as performant as EPT is currently (because at that
> point, the hypervisor control every PTE accessible to the pagewalk), and
> again, I'd like to see this succeed, but my gut feeling is that it wont.

It might be a viable solution for some of the common scenarios like mass
hosting which tends to have a lot of single vcpu guests; there the overhead
of shadow page tables might be less than the overhead of forcing siblings
into idle and putting restrictions on load balancing etc. At least worth to
investigate.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-28  6:47                                   ` Thomas Gleixner
@ 2018-05-28 12:26                                     ` Andrew Cooper
  0 siblings, 0 replies; 91+ messages in thread
From: Andrew Cooper @ 2018-05-28 12:26 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1779 bytes --]

On 28/05/2018 07:47, speck for Thomas Gleixner wrote:
> On Sun, 27 May 2018, speck for Andrew Cooper wrote:
>> On 27/05/2018 20:41, speck for Thomas Gleixner wrote:
>>
>> FWIW, my gut feeling at the moment is that the overhead of
>> synchronisation will outweigh disabling hyperthreading, but I'd like to
>> be proved wrong.  Others in the Xen community are looking to extend
>> shadow paging to be as performant as EPT is currently (because at that
>> point, the hypervisor control every PTE accessible to the pagewalk), and
>> again, I'd like to see this succeed, but my gut feeling is that it wont.
> It might be a viable solution for some of the common scenarios like mass
> hosting which tends to have a lot of single vcpu guests; there the overhead
> of shadow page tables might be less than the overhead of forcing siblings
> into idle and putting restrictions on load balancing etc. At least worth to
> investigate.

Sadly, KPTI has taken what as a manageable performance difference
between shadow and EPT, and wrecked it.  A number of common tasks are
between 6 and 16 times slower than EPT, due to all the CR3 vmexits.

The CR3-target feature is attractive because it does let writes to CR3
happen, and can even filter on the NOFLUSH bit being set.  The problem
is the lack of a GPA => HPA translation, so a naive hypervisor which
tries to use this has its guests wandering off their shadows, and
everything explodes.

As of this morning, Juergen and I are experimenting with the hypervisor
providing the translation table to the guest, so it can write the
correctly-translated CR3 and avoid the vmexit.  I've also asked whether
this would be feasible to do in microcode, to avoid having to make any
guest modifications.

~Andrew


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-27 18:49                         ` Linus Torvalds
  2018-05-27 18:57                           ` Thomas Gleixner
  2018-05-27 19:13                           ` [MODERATED] " Andrew Cooper
@ 2018-05-28 14:40                           ` Paolo Bonzini
  2018-05-28 15:56                             ` Thomas Gleixner
  2 siblings, 1 reply; 91+ messages in thread
From: Paolo Bonzini @ 2018-05-28 14:40 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1379 bytes --]

On 27/05/2018 20:49, speck for Linus Torvalds wrote:
> But why would you emulate halt/mwait/pause anyway? That sounds insane to 
> me. The reason you would want exit-on-halt is so that the host can do 
> something else if a vcpu goes idle, not so that it can just stay in some 
> emulated idle state.

When hlt/mwait is emulated, the thread goes to sleep.  When pause is
emulated, the thread checks if there is another CPU to yield to, but
otherwise stays running.

KVM recently grew a new mode where hlt/mwait/pause is passed directly to
the guest.  It was (partly) contributed by Amazon because that's what
they're doing in their KVM-based cloud stuff - it does all I/O in custom
hardware so all interrupts will be VT-d posted interrupts and avoid the
overhead of the vCPU thread going to sleep and back running.

There is a problem though.  The only way to know if the guest is in
hlt/mwait/pause, is to cause a vmexit, e.g. with an IPI, and read the VM
control state (which you can only do from the CPU that was running it!).
 So even for the "idle in the guest" case you pretty much have to do
synchronization.

Paolo

> If you want to go to low-power mode, you'd just let the halt/mwait happen 
> inside the guest.
> 
> But to be honest, I haven't actually checked what kvm users (or xen, or 
> whatever) really do. Am I missing something?



^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-25  8:31                   ` Thomas Gleixner
@ 2018-05-28 14:43                     ` Paolo Bonzini
  0 siblings, 0 replies; 91+ messages in thread
From: Paolo Bonzini @ 2018-05-28 14:43 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1849 bytes --]

On 25/05/2018 10:31, speck for Thomas Gleixner wrote:
> On Thu, 24 May 2018, speck for Linus Torvalds wrote:
>> On Thu, 24 May 2018, speck for Tim Chen wrote:
>>>
>>> We may need to do the co-scheduling only when VM exit rate is low, and
>>> turn off the SMT when VM exit rate becomes too high.
>>
>> I don't think there's any way to actually turn off HT dynamically, is 
>> there?
>>
>> Yes, you can park the sibling in idle, but afaik, some core resources will 
>> still be statically partitioned, so it's different from the "turn off HT 
>> at boot" case.
>>
>> But if there is actually a way to turn HT off dynamically, maybe we could 
>> make it be CPU hotplug. That would certainly be _so_ much better than the 
>> nasty "turn it on/off in BIOS" even for other uses.
> 
> CPU hotplug is pretty much the only solution for runtime HT disable. In the
> current state it won't give the boot time allocated per cpu resources back
> and nr_present_cpus() will still be the same as before shutting them down,
> but everything else will be cleaned out. I need to check the scheduler
> topology stuff, but that should see the change as well. We could optimize
> that with the static key which is already available (sched_smt_present),
> but I have to look which nastiness is involved when toggling that at
> runtime.
> 
> So yes, it's different from the case where HT is disabled in the BIOS, but
> I don't think it really matters much in practice.
> 
> I'll implement a knob in sysfs for that and see how that behaves.

/me is still alive though slightly jet-lagged

That was also Ingo's suggestion when I talked to him (about a month back).

Also, PIO workloads are not necessarily _that_ bad because of course KVM
will do only one vmexit per "rep ins".  But some bootloaders are better
than others.

Paolo


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: L1D-Fault KVM mitigation
  2018-05-28 14:40                           ` Paolo Bonzini
@ 2018-05-28 15:56                             ` Thomas Gleixner
  2018-05-28 17:15                               ` [MODERATED] " Paolo Bonzini
  0 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2018-05-28 15:56 UTC (permalink / raw)
  To: speck

On Mon, 28 May 2018, speck for Paolo Bonzini wrote:
> On 27/05/2018 20:49, speck for Linus Torvalds wrote:
> > But why would you emulate halt/mwait/pause anyway? That sounds insane to 
> > me. The reason you would want exit-on-halt is so that the host can do 
> > something else if a vcpu goes idle, not so that it can just stay in some 
> > emulated idle state.
> 
> When hlt/mwait is emulated, the thread goes to sleep.  When pause is
> emulated, the thread checks if there is another CPU to yield to, but
> otherwise stays running.
> 
> KVM recently grew a new mode where hlt/mwait/pause is passed directly to
> the guest.  It was (partly) contributed by Amazon because that's what
> they're doing in their KVM-based cloud stuff - it does all I/O in custom
> hardware so all interrupts will be VT-d posted interrupts and avoid the
> overhead of the vCPU thread going to sleep and back running.
> 
> There is a problem though.  The only way to know if the guest is in
> hlt/mwait/pause, is to cause a vmexit, e.g. with an IPI, and read the VM
> control state (which you can only do from the CPU that was running it!).
>  So even for the "idle in the guest" case you pretty much have to do
> synchronization.

Right, but there ucode assist would be handy. If the sibling is in
hlt/mwait then the other could do a vmexit undisturbed and when the
mwait/hlt resumes then it either forces a vmexit if the other sibling is
still in the host or it lets it continue when the other sibling reentered
guest mode again.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Re: L1D-Fault KVM mitigation
  2018-05-28 15:56                             ` Thomas Gleixner
@ 2018-05-28 17:15                               ` Paolo Bonzini
  0 siblings, 0 replies; 91+ messages in thread
From: Paolo Bonzini @ 2018-05-28 17:15 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1063 bytes --]

On 28/05/2018 17:56, speck for Thomas Gleixner wrote:
>> There is a problem though.  The only way to know if the guest is in
>> hlt/mwait/pause, is to cause a vmexit, e.g. with an IPI, and read the VM
>> control state (which you can only do from the CPU that was running it!).
>>  So even for the "idle in the guest" case you pretty much have to do
>> synchronization.
> Right, but there ucode assist would be handy. If the sibling is in
> hlt/mwait then the other could do a vmexit undisturbed and when the
> mwait/hlt resumes then it either forces a vmexit if the other sibling is
> still in the host or it lets it continue when the other sibling reentered
> guest mode again.

Yes, but I am not sure how microcode can do it since it could
potentially block a sibling forever.

Perhaps some kind of "sibling not in VMX non-root mode" vmexit, to
automate the IPI part of the synchronization, would be useful.  When the
hypervisor receives it, it can decide whether to poll for completion of
the sibling's vmexit, or to go to sleep.

Paolo


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2018-05-26 19:14                 ` L1D-Fault KVM mitigation Thomas Gleixner
  2018-05-26 20:43                   ` [MODERATED] " Andi Kleen
@ 2018-05-29 19:29                   ` Tim Chen
  2018-05-29 21:14                     ` L1D-Fault KVM mitigation Thomas Gleixner
  1 sibling, 1 reply; 91+ messages in thread
From: Tim Chen @ 2018-05-29 19:29 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 134 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: L1D-Fault KVM mitigation

[-- Attachment #2: Type: text/plain, Size: 1799 bytes --]

On 05/26/2018 12:14 PM, speck for Thomas Gleixner wrote:
> On Thu, 24 May 2018, speck for Tim Chen wrote:
> 
 of load time.
>>
>> We may need to do the co-scheduling only when VM exit rate is low, and
>> turn off the SMT when VM exit rate becomes too high.
> 
> You cannot do that during runtime. That will destroy placement schemes and
> whatever. The SMT off decision needs to be done at a quiescent moment,
> i.e. before starting VMs.

Taking the SMT offline is a bit much and too big a hammer.  Andi and I thought about
having the scheduler forcing the other thread in idle instead for high
VM exit rate scenario. We don't have
to bother about doing sync with the other idle thread.

But we have issues about fairness, as we will be starving the
other run queue.

> 

> Running the same compile single threaded (offlining vCPU1 in the guest)
> increases the time to 107 seconds.
> 
>     107 / 88  = 1.22
> 
> I.e. it's 20% slower than the one using two threads. That means that it is
> the same slowdown as having two threads synchronized (your number).

yes, with compile workload, the HT speedup was mostly eaten up by
overhead.

> 
> So if I take the above example and assume that the overhead of
> synchronization is ~20% then the average vmenter/vmexit time is close to
> 50us.
> 

> 
> So I can see the usefulness for scenarious which David Woodhouse described
> where vCPU and host CPU have a fixed relationship and the guests exit once
> in a while. But that should really be done with ucode assisantance which
> avoids all the nasty synchronization hackery more or less completely.

The ucode guys are looking into such possibilities.  It is tough as they
have to work within the constraint of limited ucode headroom.

Thanks.

Tim


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: L1D-Fault KVM mitigation
  2018-05-29 19:29                   ` [MODERATED] Encrypted Message Tim Chen
@ 2018-05-29 21:14                     ` Thomas Gleixner
  2018-05-30 16:38                       ` [MODERATED] Encrypted Message Tim Chen
  0 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2018-05-29 21:14 UTC (permalink / raw)
  To: speck

On Tue, 29 May 2018, speck for Tim Chen wrote:
On 05/26/2018 12:14 PM, speck for Thomas Gleixner wrote:
> On Thu, 24 May 2018, speck for Tim Chen wrote:
> 
> > > We may need to do the co-scheduling only when VM exit rate is low, and
> > > turn off the SMT when VM exit rate becomes too high.
> > 
> > You cannot do that during runtime. That will destroy placement schemes and
> > whatever. The SMT off decision needs to be done at a quiescent moment,
> > i.e. before starting VMs.

> Taking the SMT offline is a bit much and too big a hammer.

Sorry, that's bullshit. It massively depends on the workload and the
scenario. I've explained it a gazillion times by now that there are enough
workloads which will massively lose with SMT on and the extra overhead. Its
trivial enough to figure that out without implementing all bells and
whistels.

> Andi and I thought about having the scheduler forcing the other thread in
> idle instead for high VM exit rate scenario. We don't have to bother
> about doing sync with the other idle thread.

You still have to make sure that the other idle thread _IS_ idle. It's not
the full synchronizaiton scheme, but it's extra work in a hotpath when the
guest is exit heavy. And you still have the problem of interrupt and
softirqs being served on the 'idle' sibling. It's not that simple.

> But we have issues about fairness, as we will be starving the
> other run queue.

That's more than obvious. And you will create even worse issues because
workloads which have a placement scheme, i.e. vCPU affinities will have no
chance to migrate to another CPU. Not to talk about wreckaging the load
balancer completely.

> > I.e. it's 20% slower than the one using two threads. That means that it is
> > the same slowdown as having two threads synchronized (your number).

> yes, with compile workload, the HT speedup was mostly eaten up by
> overhead.

So where is the point of the exercise?

You will not find a generic solution for this problem ever simply because
the workloads and guest scenarios are too different. There are clearly
scenarios which can benefit, but at the same time there are scenarios which
will be way worse off than with SMT disabled.

I completely understand that Intel wants to avoid the 'disable SMT'
solution by all means, but this cannot be done with something which is
obvioulsy creating more problems than it solves in the first place.

At some point reality has to kick in and you have to admit that there is no
generic solution and the only solution for a lot of use cases will be to
disable SMT. The solution for special workloads like the fully partitioned
ones David mentioned do not need the extra mess all over the place
especially not when there is ucode assist at least to the point which fits
into the patch space and some of it really should not take a huge amount of
effort, like the forced sibling vmexit to avoid the whole IPI machinery.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2018-05-29 21:14                     ` L1D-Fault KVM mitigation Thomas Gleixner
@ 2018-05-30 16:38                       ` Tim Chen
  0 siblings, 0 replies; 91+ messages in thread
From: Tim Chen @ 2018-05-30 16:38 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 134 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: L1D-Fault KVM mitigation

[-- Attachment #2: Type: text/plain, Size: 1626 bytes --]

On 05/29/2018 02:14 PM, speck for Thomas Gleixner wrote:
> 
>> yes, with compile workload, the HT speedup was mostly eaten up by
>> overhead.
> 
> So where is the point of the exercise?
> 
> You will not find a generic solution for this problem ever simply because
> the workloads and guest scenarios are too different. There are clearly
> scenarios which can benefit, but at the same time there are scenarios which
> will be way worse off than with SMT disabled.
> 
> I completely understand that Intel wants to avoid the 'disable SMT'
> solution by all means, but this cannot be done with something which is
> obvioulsy creating more problems than it solves in the first place.
> 
> At some point reality has to kick in and you have to admit that there is no
> generic solution and the only solution for a lot of use cases will be to
> disable SMT. The solution for special workloads like the fully partitioned
> ones David mentioned do not need the extra mess all over the place
> especially not when there is ucode assist at least to the point which fits
> into the patch space and some of it really should not take a huge amount of
> effort, like the forced sibling vmexit to avoid the whole IPI machinery.
> 

Having to sync on VM entry and on VM exit and on interrupt to idle sibling
sucks. Hopefully the ucode guys can come up with something
to provide an option that forces the sibling to vmexit on vmexit,
and on interrupt to idle sibling. This should cut the sync overhead in half.
Then only VM entry needs to be synced should we still want to
do co-scheduling.

Thanks.

Tim



^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-04 17:17     ` [MODERATED] " Josh Poimboeuf
@ 2019-03-06 16:22       ` Jon Masters
  0 siblings, 0 replies; 91+ messages in thread
From: Jon Masters @ 2019-03-06 16:22 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 117 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Josh Poimboeuf <speck@linutronix.de>
Subject: Re: Encrypted Message

[-- Attachment #2: Type: text/plain, Size: 778 bytes --]

On 3/4/19 12:17 PM, speck for Josh Poimboeuf wrote:
> On Sun, Mar 03, 2019 at 10:58:01PM -0500, speck for Jon Masters wrote:
> 
>> On 3/3/19 8:24 PM, speck for Josh Poimboeuf wrote:
>>
>>> +		if (sched_smt_active() && !boot_cpu_has(X86_BUG_MSBDS_ONLY))
>>> +			pr_warn_once(MDS_MSG_SMT);
>>
>> It's never fully safe to use SMT. I get that if we only had MSBDS then
>> it's unlikely we'll hit the e.g. power state change cases needed to
>> exploit it but I think it would be prudent to display something anyway?
> 
> My understanding is that the idle state changes are mitigated elsewhere
> in the MDS patches, so it should be safe in theory.

Looked at it again. Agree. Sorry about that.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-05 15:34     ` Thomas Gleixner
@ 2019-03-06 16:21       ` Jon Masters
  0 siblings, 0 replies; 91+ messages in thread
From: Jon Masters @ 2019-03-06 16:21 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 118 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: Encrypted Message

[-- Attachment #2: Type: text/plain, Size: 990 bytes --]

On 3/5/19 10:34 AM, speck for Thomas Gleixner wrote:
> On Mon, 4 Mar 2019, speck for Jon Masters wrote:
> 
>> On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:
>>>       if (static_branch_unlikely(&vmx_l1d_should_flush))
>>>               vmx_l1d_flush(vcpu);
>>> +     else if (static_branch_unlikely(&mds_user_clear))
>>> +             mds_clear_cpu_buffers();
>>
>> Does this cover the case where we have older ucode installed that does
>> L1D flush but NOT the MD_CLEAR? I'm about to go check to see if there's
>> logic handling this but wanted to call it out.
> 
> If no updated microcode is available then it's pretty irrelevant which code
> path you take. None of them will mitigate MDS.

You're right. My fear was we'd have some microcode that mitigated L1D
without implied MD clear but also did MDS. I was incorrect - all ucode
that will be publicly released will have both properties.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-05 22:31     ` Andrew Cooper
@ 2019-03-06 16:18       ` Jon Masters
  0 siblings, 0 replies; 91+ messages in thread
From: Jon Masters @ 2019-03-06 16:18 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 121 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Andrew Cooper <speck@linutronix.de>
Subject: Re: Starting to go public?

[-- Attachment #2: Type: text/plain, Size: 1380 bytes --]

On 3/5/19 5:31 PM, speck for Andrew Cooper wrote:
> On 05/03/2019 20:36, speck for Jiri Kosina wrote:
>> On Tue, 5 Mar 2019, speck for Andrew Cooper wrote:
>>
>>>> Looks like the papers are starting to leak:
>>>>
>>>>    https://arxiv.org/pdf/1903.00446.pdf
>>>>
>>>> yes, yes, a lot of the attack seems to be about rowhammer, but the
>>>> "spolier" part looks like MDS.
>>> So Intel was aware of that paper, but wasn't expecting it to go public
>>> today.
>>>
>>> =46rom their point of view, it is a traditional timing sidechannel on a
>>> piece of the pipeline (which happens to be component which exists for
>>> speculative memory disambiguation).
>>>
>>> There are no proposed changes to the MDS timeline at this point.
>> So this is not the paper that caused the panic fearing that PSF might leak 
>> earlier than the rest of the issues in mid-february (which few days later 
>> Intel claimed to have succesfully negotiated with the researches not to 
>> publish before the CRD)?
> 
> Correct.
> 
> The incident you are referring to is a researcher who definitely found
> PSF, contacted Intel and was initially displeased at the proposed embargo.

Indeed. There are at least three different teams with papers that read
on MDS, and all of them are holding to the embargo.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-05 16:43 [MODERATED] Starting to go public? Linus Torvalds
  2019-03-05 17:02 ` [MODERATED] " Andrew Cooper
@ 2019-03-05 17:10 ` Jon Masters
  1 sibling, 0 replies; 91+ messages in thread
From: Jon Masters @ 2019-03-05 17:10 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 135 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Linus Torvalds <speck@linutronix.de>
Subject: NOT PUBLIC - Re: Starting to go public?

[-- Attachment #2: Type: text/plain, Size: 796 bytes --]

On 3/5/19 11:43 AM, speck for Linus Torvalds wrote:
> Looks like the papers are starting to leak:
> 
>    https://arxiv.org/pdf/1903.00446.pdf
> 
> yes, yes, a lot of the attack seems to be about rowhammer, but the
> "spolier" part looks like MDS.

It's not but it is close to finding PSF behavior. The thing they found
is described separately in one of the original Intel store patent. So we
are at risk but should not panic.

I've spoken with several researchers sitting on MDS papers and confirmed
that they are NOT concerned at this stage. Of course everyone is
carefully watching and that's why we need to have contingency. People
will start looking in this area (I know of three teams doing so) now.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-04  7:06     ` Jon Masters
@ 2019-03-04  8:12       ` Jon Masters
  0 siblings, 0 replies; 91+ messages in thread
From: Jon Masters @ 2019-03-04  8:12 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 126 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Jon Masters <speck@linutronix.de>
Subject: Re: [patch V6 08/14] MDS basics 8

[-- Attachment #2: Type: text/plain, Size: 1075 bytes --]

On 3/4/19 2:06 AM, speck for Jon Masters wrote:
> On 3/4/19 1:57 AM, speck for Jon Masters wrote:
>> On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:
>>>  	if (static_branch_unlikely(&vmx_l1d_should_flush))
>>>  		vmx_l1d_flush(vcpu);
>>> +	else if (static_branch_unlikely(&mds_user_clear))
>>> +		mds_clear_cpu_buffers();
>>
>> Does this cover the case where we have older ucode installed that does
>> L1D flush but NOT the MD_CLEAR? I'm about to go check to see if there's
>> logic handling this but wanted to call it out.
> 
> Aside from the above question, I've reviewed all of the patches
> extensively at this point. Feel free to add a Reviewed-by or Tested-by
> according to your preference. I've a bunch of further tests running,
> including on AMD platforms just so to check nothing broke with those
> platforms that are not susceptible to MDS.

Running fine on AMD platform here and reports correctly:

$ cat /sys/devices/system/cpu/vulnerabilities/mds
Not affected

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-04  7:30   ` [MODERATED] Re: [PATCH RFC 1/4] 1 Greg KH
@ 2019-03-04  7:45     ` Jon Masters
  0 siblings, 0 replies; 91+ messages in thread
From: Jon Masters @ 2019-03-04  7:45 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 110 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Greg KH <speck@linutronix.de>
Subject: Re: [PATCH RFC 1/4] 1

[-- Attachment #2: Type: text/plain, Size: 1867 bytes --]

On 3/4/19 2:30 AM, speck for Greg KH wrote:
> On Sun, Mar 03, 2019 at 07:23:22PM -0600, speck for Josh Poimboeuf wrote:
>> From: Josh Poimboeuf <jpoimboe@redhat.com>
>> Subject: [PATCH RFC 1/4] x86/speculation/mds: Add mds=full,nosmt cmdline
>>  option
>>
>> Add the mds=full,nosmt cmdline option.  This is like mds=full, but with
>> SMT disabled if the CPU is vulnerable.
>>
>> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
>> ---
>>  Documentation/admin-guide/hw-vuln/mds.rst       |  3 +++
>>  Documentation/admin-guide/kernel-parameters.txt |  6 ++++--
>>  arch/x86/kernel/cpu/bugs.c                      | 10 ++++++++++
>>  3 files changed, 17 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/hw-vuln/mds.rst b/Documentation/admin-guide/hw-vuln/mds.rst
>> index 1de29d28903d..244ab47d1fb3 100644
>> --- a/Documentation/admin-guide/hw-vuln/mds.rst
>> +++ b/Documentation/admin-guide/hw-vuln/mds.rst
>> @@ -260,6 +260,9 @@ time with the option "mds=". The valid arguments for this option are:
>>  
>>  		It does not automatically disable SMT.
>>  
>> +  full,nosmt	The same as mds=full, with SMT disabled on vulnerable
>> +		CPUs.  This is the complete mitigation.
> 
> While I understand the intention, the number of different combinations
> we are "offering" to userspace here is huge, and everyone is going to be
> confused as to what to do.  If we really think/say that SMT is a major
> issue for this, why don't we just have "full" disable SMT?

Frankly, it ought to for safety (can't be made safe). The reason cited
for not doing so (Thomas and Linus can speak up on this part) was
upgrades vs new installs. The concern was not to break existing folks by
losing half their logical CPU count when upgrading a kernel.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-04  6:57   ` [MODERATED] Encrypted Message Jon Masters
@ 2019-03-04  7:06     ` Jon Masters
  2019-03-04  8:12       ` Jon Masters
  2019-03-05 15:34     ` Thomas Gleixner
  1 sibling, 1 reply; 91+ messages in thread
From: Jon Masters @ 2019-03-04  7:06 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 126 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Jon Masters <speck@linutronix.de>
Subject: Re: [patch V6 08/14] MDS basics 8

[-- Attachment #2: Type: text/plain, Size: 877 bytes --]

On 3/4/19 1:57 AM, speck for Jon Masters wrote:
> On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:
>>  	if (static_branch_unlikely(&vmx_l1d_should_flush))
>>  		vmx_l1d_flush(vcpu);
>> +	else if (static_branch_unlikely(&mds_user_clear))
>> +		mds_clear_cpu_buffers();
> 
> Does this cover the case where we have older ucode installed that does
> L1D flush but NOT the MD_CLEAR? I'm about to go check to see if there's
> logic handling this but wanted to call it out.

Aside from the above question, I've reviewed all of the patches
extensively at this point. Feel free to add a Reviewed-by or Tested-by
according to your preference. I've a bunch of further tests running,
including on AMD platforms just so to check nothing broke with those
platforms that are not susceptible to MDS.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-01 21:47 ` [patch V6 08/14] MDS basics 8 Thomas Gleixner
@ 2019-03-04  6:57   ` Jon Masters
  2019-03-04  7:06     ` Jon Masters
  2019-03-05 15:34     ` Thomas Gleixner
  0 siblings, 2 replies; 91+ messages in thread
From: Jon Masters @ 2019-03-04  6:57 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 130 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch V6 08/14] MDS basics 8

[-- Attachment #2: Type: text/plain, Size: 491 bytes --]

On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:
>  	if (static_branch_unlikely(&vmx_l1d_should_flush))
>  		vmx_l1d_flush(vcpu);
> +	else if (static_branch_unlikely(&mds_user_clear))
> +		mds_clear_cpu_buffers();

Does this cover the case where we have older ucode installed that does
L1D flush but NOT the MD_CLEAR? I'm about to go check to see if there's
logic handling this but wanted to call it out.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-01 21:47 ` [patch V6 10/14] MDS basics 10 Thomas Gleixner
@ 2019-03-04  6:45   ` Jon Masters
  0 siblings, 0 replies; 91+ messages in thread
From: Jon Masters @ 2019-03-04  6:45 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 131 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch V6 10/14] MDS basics 10

[-- Attachment #2: Type: text/plain, Size: 306 bytes --]

On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:

> +	/*
> +	 * Enable the idle clearing on CPUs which are affected only by
> +	 * MDBDS and not any other MDS variant. The other variants cannot
           ^^^^^
           MSBDS


-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-01 21:47 ` [patch V6 06/14] MDS basics 6 Thomas Gleixner
@ 2019-03-04  6:28   ` Jon Masters
  0 siblings, 0 replies; 91+ messages in thread
From: Jon Masters @ 2019-03-04  6:28 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 130 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch V6 06/14] MDS basics 6

[-- Attachment #2: Type: text/plain, Size: 1195 bytes --]

On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:
> Provide a inline function with the assembly magic. The argument of the VERW
> instruction must be a memory operand as documented:
> 
>   "MD_CLEAR enumerates that the memory-operand variant of VERW (for
>    example, VERW m16) has been extended to also overwrite buffers affected
>    by MDS. This buffer overwriting functionality is not guaranteed for the
>    register operand variant of VERW."
> 
> Documentation also recommends to use a writable data segment selector:
> 
>   "The buffer overwriting occurs regardless of the result of the VERW
>    permission check, as well as when the selector is null or causes a
>    descriptor load segment violation. However, for lowest latency we
>    recommend using a selector that indicates a valid writable data
>    segment."

Note that we raised this again with Intel last week amid Andrew's
results and they are going to get back to us if this guidance changes as
a result of further measurements on their end. It's a few cycles
difference in the Coffeelake case, but it could always be higher.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-01 21:47 ` [patch V6 12/14] MDS basics 12 Thomas Gleixner
@ 2019-03-04  5:47   ` Jon Masters
  0 siblings, 0 replies; 91+ messages in thread
From: Jon Masters @ 2019-03-04  5:47 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 131 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch V6 12/14] MDS basics 12

[-- Attachment #2: Type: text/plain, Size: 1553 bytes --]

On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:

> Subject: [patch V6 12/14] x86/speculation/mds: Add mitigation mode VMWERV
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> In virtualized environments it can happen that the host has the microcode
> update which utilizes the VERW instruction to clear CPU buffers, but the
> hypervisor is not yet updated to expose the X86_FEATURE_MD_CLEAR CPUID bit
> to guests.
> 
> Introduce an internal mitigation mode VWWERV which enables the invocation
> of the CPU buffer clearing even if X86_FEATURE_MD_CLEAR is not set. If the
> system has no updated microcode this results in a pointless execution of
> the VERW instruction wasting a few CPU cycles. If the microcode is updated,
> but not exposed to a guest then the CPU buffers will be cleared.
> 
> That said: Virtual Machines Will Eventually Receive Vaccine

The effect of this patch, currently, is that a (bare metal) machine
without updated ucode will print the following:

[    1.576602] MDS: Vulnerable: Clear CPU buffers attempted, no microcode

The intention of the patch is to say "hey, you might be on a VM, so
we'll try anyway in case we didn't get told you had MD_CLEAR". But the
effect on bare metal might be ambiguous. It's reasonable (for someone
else) to assume we might be using a software sequence to try flushing.

Perhaps the wording should convey something like:

"MDS: Vulnerable: Clear CPU buffers may not work, no microcode"

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-01 21:47 [patch V6 00/14] MDS basics 0 Thomas Gleixner
                   ` (3 preceding siblings ...)
  2019-03-01 21:47 ` [patch V6 12/14] MDS basics 12 Thomas Gleixner
@ 2019-03-04  5:30 ` Jon Masters
  4 siblings, 0 replies; 91+ messages in thread
From: Jon Masters @ 2019-03-04  5:30 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 130 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch V6 00/14] MDS basics 0

[-- Attachment #2: Type: text/plain, Size: 1408 bytes --]

On 3/1/19 4:47 PM, speck for Thomas Gleixner wrote:
> Changes vs. V5:
> 
>   - Fix tools/ build (Josh)
> 
>   - Dropped the AIRMONT_MID change as it needs confirmation from Intel
> 
>   - Made the consolidated whitelist more readable and correct
> 
>   - Added the MSBDS only quirk for XEON PHI, made the idle flush
>     depend on it and updated the sysfs output accordingly.
> 
>   - Fixed the protection matrix in the admin documentation and clarified
>     the SMT situation vs. MSBDS only.
> 
>   - Updated the KVM/VMX changelog.
> 
> Delta patch against V5 below.
> 
> Available from git:
> 
>    cvs.ou.linutronix.de:linux/speck/linux WIP.mds
> 
> The linux-4.20.y, linux-4.19.y and linux-4.14.y branches are updated as
> well and contain the untested backports of the pile for reference.
> 
> I'll send git bundles of the pile as well.

Tested on Coffeelake with updated ucode successfully:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 158
model name      : Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz
stepping        : 10
microcode       : 0xae

[jcm@stephen ~]$ dmesg|grep MDS
[    1.633165] MDS: Mitigation: Clear CPU buffers

[jcm@stephen ~]$ cat /sys/devices/system/cpu/vulnerabilities/mds
Mitigation: Clear CPU buffers; SMT vulnerable

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-04  1:25 ` [MODERATED] [PATCH RFC 4/4] 4 Josh Poimboeuf
@ 2019-03-04  4:07   ` Jon Masters
  0 siblings, 0 replies; 91+ messages in thread
From: Jon Masters @ 2019-03-04  4:07 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 117 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Josh Poimboeuf <speck@linutronix.de>
Subject: Re: [PATCH RFC 4/4] 4

[-- Attachment #2: Type: text/plain, Size: 1461 bytes --]

On 3/3/19 8:25 PM, speck for Josh Poimboeuf wrote:
> From: Josh Poimboeuf <jpoimboe@redhat.com>
> Subject: [PATCH RFC 4/4] x86/speculation: Add 'cpu_spec_mitigations=' cmdline
>  options
> 
> Keeping track of the number of mitigations for all the CPU speculation
> bugs has become overwhelming for many users.  It's getting more and more
> complicated to decide what mitigations are needed for a given
> architecture.
> 
> Most users fall into a few basic categories:
> 
> - want all mitigations off;
> 
> - want all reasonable mitigations on, with SMT enabled even if it's
>   vulnerable; or
> 
> - want all reasonable mitigations on, with SMT disabled if vulnerable.
> 
> Define a set of curated, arch-independent options, each of which is an
> aggregation of existing options:
> 
> - cpu_spec_mitigations=off: Disable all mitigations.
> 
> - cpu_spec_mitigations=auto: [default] Enable all the default mitigations,
>   but leave SMT enabled, even if it's vulnerable.
> 
> - cpu_spec_mitigations=auto,nosmt: Enable all the default mitigations,
>   disabling SMT if needed by a mitigation.
> 
> See the documentation for more details.

Looks good. There's an effort to upstream mitigation controls for the
arm64 but that's not in place yet. They'll want to wire that up later. I
actually had missed the s390x etokens work so that was fun to see here.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-04  1:24 ` [MODERATED] [PATCH RFC 3/4] 3 Josh Poimboeuf
@ 2019-03-04  3:58   ` Jon Masters
  2019-03-04 17:17     ` [MODERATED] " Josh Poimboeuf
  0 siblings, 1 reply; 91+ messages in thread
From: Jon Masters @ 2019-03-04  3:58 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 117 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Josh Poimboeuf <speck@linutronix.de>
Subject: Re: [PATCH RFC 3/4] 3

[-- Attachment #2: Type: text/plain, Size: 445 bytes --]

On 3/3/19 8:24 PM, speck for Josh Poimboeuf wrote:

> +		if (sched_smt_active() && !boot_cpu_has(X86_BUG_MSBDS_ONLY))
> +			pr_warn_once(MDS_MSG_SMT);

It's never fully safe to use SMT. I get that if we only had MSBDS then
it's unlikely we'll hit the e.g. power state change cases needed to
exploit it but I think it would be prudent to display something anyway?

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-04  1:23 ` [MODERATED] [PATCH RFC 1/4] 1 Josh Poimboeuf
@ 2019-03-04  3:55   ` Jon Masters
  2019-03-04  7:30   ` [MODERATED] Re: [PATCH RFC 1/4] 1 Greg KH
  1 sibling, 0 replies; 91+ messages in thread
From: Jon Masters @ 2019-03-04  3:55 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 117 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Josh Poimboeuf <speck@linutronix.de>
Subject: Re: [PATCH RFC 1/4] 1

[-- Attachment #2: Type: text/plain, Size: 1069 bytes --]

On 3/3/19 8:23 PM, speck for Josh Poimboeuf wrote:

> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index e11654f93e71..0c71ab0d57e3 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -221,6 +221,7 @@ static void x86_amd_ssb_disable(void)
>  
>  /* Default mitigation for L1TF-affected CPUs */
>  static enum mds_mitigations mds_mitigation __ro_after_init = MDS_MITIGATION_FULL;
> +static bool mds_nosmt __ro_after_init = false;
>  
>  static const char * const mds_strings[] = {
>  	[MDS_MITIGATION_OFF]	= "Vulnerable",
> @@ -238,8 +239,13 @@ static void mds_select_mitigation(void)
>  	if (mds_mitigation == MDS_MITIGATION_FULL) {
>  		if (!boot_cpu_has(X86_FEATURE_MD_CLEAR))
>  			mds_mitigation = MDS_MITIGATION_VMWERV;
> +
>  		static_branch_enable(&mds_user_clear);
> +
> +		if (mds_nosmt && !boot_cpu_has(X86_BUG_MSBDS_ONLY))
> +			cpu_smt_disable(false);

Is there some logic missing here to disable SMT?

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-03-01 20:58     ` [MODERATED] Encrypted Message Jon Masters
@ 2019-03-01 22:14       ` Jon Masters
  0 siblings, 0 replies; 91+ messages in thread
From: Jon Masters @ 2019-03-01 22:14 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 161 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Jon Masters <speck@linutronix.de>
Subject: Re: [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer()

[-- Attachment #2: Type: text/plain, Size: 3426 bytes --]

On 3/1/19 3:58 PM, speck for Jon Masters wrote:
> On 2/26/19 9:19 AM, speck for Josh Poimboeuf wrote:
> 
>> On Fri, Feb 22, 2019 at 11:24:22PM +0100, speck for Thomas Gleixner wrote:
>>> +MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
>>> +L1 miss situations and to hold data which is returned or sent in response
>>> +to a memory or I/O operation. Fill buffers can forward data to a load
>>> +operation and also write data to the cache. When the fill buffer is
>>> +deallocated it can retain the stale data of the preceding operations which
>>> +can then be forwarded to a faulting or assisting load operation, which can
>>> +be exploited under certain conditions. Fill buffers are shared between
>>> +Hyper-Threads so cross thread leakage is possible.
> 
> The fill buffers sit opposite the L1D$ and participate in coherency
> directly. They supply data directly to the load store units. Here's the
> internal summary I wrote (feel free to use any of it that is useful):
> 
> "Intel processors utilize fill buffers to perform loads of data when a
> miss occurs in the Level 1 data cache. The fill buffer allows the
> processor to implement a non-blocking cache, continuing with other
> operations while the necessary cache data “line” is loaded from a higher
> level cache or from memory. It also allows the result of the fill to be
> forwarded directly to the EU (Execution Unit) requiring the load,
> without waiting for it to be written into the L1 Data Cache.
> 
> A load operation is not decoupled in the same way that a store is, but
> it does involve an AGU (Address Generation Unit) operation. If the AGU
> generates a fault (#PF, etc.) or an assist (A/D bits) then the classical
> Intel design would block the load and later reissue it. In contemporary
> designs, it instead allows subsequent speculation operations to
> temporarily see a forwarded data value from the fill buffer slot prior
> to the load actually taking place. Thus it is possible to read data that
> was recently accessed by another thread, if the fill buffer entry is not
> reused.
> 
> It is this attack that allows cross-thread SMT leakage and breaks HT
> without recourse other than to disable it or to implement core
> scheduling in the Linux kernel.
> 
> Variants of this include loads that cross cache or page boundaries due
> to further optimizations in Intel’s implementation. For example, Intel
> incorporate logic to guess at address generation prior to determining
> whether it crosses such a boundary (covered in US5335333A) and will
> forward this to the TLB/load logic prior to resolving the full address.
> They will retry the load by re-issuing uops in the case of a cross
> cacheline/page boundary but in that case will leak state as well."

Btw, I've various reproducers here that I'm happy to share if useful
with the right folks. Thomas and Linus should already have my IFU one
for later testing of that, I've also e.g. an FBBF. Currently it just
spews whatever it sees from the other threads, but in the next few days
I'll have it cleaned up to send/receive specific messages - then can
just wrap it with a bow so it can print yes/no vulnerable.

Ping if you have a need for a repro (keybase/email) and I'll go through
our process for sharing as appropriate.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-02-26 14:19   ` [MODERATED] " Josh Poimboeuf
@ 2019-03-01 20:58     ` Jon Masters
  2019-03-01 22:14       ` Jon Masters
  0 siblings, 1 reply; 91+ messages in thread
From: Jon Masters @ 2019-03-01 20:58 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 164 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Josh Poimboeuf <speck@linutronix.de>
Subject: Re: [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer()

[-- Attachment #2: Type: text/plain, Size: 2764 bytes --]

On 2/26/19 9:19 AM, speck for Josh Poimboeuf wrote:

> On Fri, Feb 22, 2019 at 11:24:22PM +0100, speck for Thomas Gleixner wrote:
>> +MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
>> +L1 miss situations and to hold data which is returned or sent in response
>> +to a memory or I/O operation. Fill buffers can forward data to a load
>> +operation and also write data to the cache. When the fill buffer is
>> +deallocated it can retain the stale data of the preceding operations which
>> +can then be forwarded to a faulting or assisting load operation, which can
>> +be exploited under certain conditions. Fill buffers are shared between
>> +Hyper-Threads so cross thread leakage is possible.

The fill buffers sit opposite the L1D$ and participate in coherency
directly. They supply data directly to the load store units. Here's the
internal summary I wrote (feel free to use any of it that is useful):

"Intel processors utilize fill buffers to perform loads of data when a
miss occurs in the Level 1 data cache. The fill buffer allows the
processor to implement a non-blocking cache, continuing with other
operations while the necessary cache data “line” is loaded from a higher
level cache or from memory. It also allows the result of the fill to be
forwarded directly to the EU (Execution Unit) requiring the load,
without waiting for it to be written into the L1 Data Cache.

A load operation is not decoupled in the same way that a store is, but
it does involve an AGU (Address Generation Unit) operation. If the AGU
generates a fault (#PF, etc.) or an assist (A/D bits) then the classical
Intel design would block the load and later reissue it. In contemporary
designs, it instead allows subsequent speculation operations to
temporarily see a forwarded data value from the fill buffer slot prior
to the load actually taking place. Thus it is possible to read data that
was recently accessed by another thread, if the fill buffer entry is not
reused.

It is this attack that allows cross-thread SMT leakage and breaks HT
without recourse other than to disable it or to implement core
scheduling in the Linux kernel.

Variants of this include loads that cross cache or page boundaries due
to further optimizations in Intel’s implementation. For example, Intel
incorporate logic to guess at address generation prior to determining
whether it crosses such a boundary (covered in US5335333A) and will
forward this to the TLB/load logic prior to resolving the full address.
They will retry the load by re-issuing uops in the case of a cross
cacheline/page boundary but in that case will leak state as well."

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-02-25 16:30   ` [MODERATED] " Greg KH
@ 2019-02-25 16:41     ` Jon Masters
  0 siblings, 0 replies; 91+ messages in thread
From: Jon Masters @ 2019-02-25 16:41 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 115 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Greg KH <speck@linutronix.de>
Subject: Re: [PATCH v6 10/43] MDSv6

[-- Attachment #2: Type: text/plain, Size: 411 bytes --]

On 2/25/19 11:30 AM, speck for Greg KH wrote:

>> +BPF could attack the rest of the kernel if it can successfully
>> +measure side channel side effects.
> 
> Can it do such a measurement?

The researchers involved in MDS are actively working on an exploit using
BPF as well, so I expect we'll know soon. My assumption is "yes".

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-02-25 16:00           ` [MODERATED] " Greg KH
@ 2019-02-25 16:19             ` Jon Masters
  0 siblings, 0 replies; 91+ messages in thread
From: Jon Masters @ 2019-02-25 16:19 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 110 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Greg KH <speck@linutronix.de>
Subject: Re: Encrypted Message

[-- Attachment #2: Type: text/plain, Size: 1592 bytes --]

On 2/25/19 11:00 AM, speck for Greg KH wrote:
> On Mon, Feb 25, 2019 at 10:52:30AM -0500, speck for Jon Masters wrote:
>> From: Jon Masters <jcm@redhat.com>
>> To: speck for Greg KH <speck@linutronix.de>
>> Subject: Re: [PATCH v6 31/43] MDSv6
> 
>> On 2/25/19 10:49 AM, speck for Greg KH wrote:
>>> On Mon, Feb 25, 2019 at 07:34:11AM -0800, speck for Andi Kleen wrote:
>>
>>
>>>> However I will probably not be able to write a detailed
>>>> description for each of the interrupt handlers changed because
>>>> there are just too many.
>>>
>>> Then how do you expect each subsystem / driver author to know if this is
>>> an acceptable change or not?  How do you expect to educate driver
>>> authors to have them determine if they need to do this on their new
>>> drivers or not?  Are you going to hand-audit each new driver that gets
>>> added to the kernel for forever?
>>>
>>> Without this type of information, this seems like a futile exercise.
>>
>> Forgive me if I'm being too cautious here, but it seems to make most
>> sense to have the basic MDS infrastructure in place at unembargo. Unless
>> it's very clear how the auto stuff can be safe, and the audit
>> comprehensive, I wonder if that shouldn't just be done after.
> 
> I thought that was what Thomas's patchset provided and is what was
> alluded to in patch 00/43 of this series.

Indeed. I'm asking whether we're trying to figure out the "auto" stuff
as well before unembargo or is the other discussion just for planning?

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-02-25 15:49       ` Greg KH
@ 2019-02-25 15:52         ` Jon Masters
  2019-02-25 16:00           ` [MODERATED] " Greg KH
  0 siblings, 1 reply; 91+ messages in thread
From: Jon Masters @ 2019-02-25 15:52 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 115 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Greg KH <speck@linutronix.de>
Subject: Re: [PATCH v6 31/43] MDSv6

[-- Attachment #2: Type: text/plain, Size: 1032 bytes --]

On 2/25/19 10:49 AM, speck for Greg KH wrote:
> On Mon, Feb 25, 2019 at 07:34:11AM -0800, speck for Andi Kleen wrote:


>> However I will probably not be able to write a detailed
>> description for each of the interrupt handlers changed because
>> there are just too many.
> 
> Then how do you expect each subsystem / driver author to know if this is
> an acceptable change or not?  How do you expect to educate driver
> authors to have them determine if they need to do this on their new
> drivers or not?  Are you going to hand-audit each new driver that gets
> added to the kernel for forever?
> 
> Without this type of information, this seems like a futile exercise.

Forgive me if I'm being too cautious here, but it seems to make most
sense to have the basic MDS infrastructure in place at unembargo. Unless
it's very clear how the auto stuff can be safe, and the audit
comprehensive, I wonder if that shouldn't just be done after.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-02-21 23:44 ` [patch V3 4/9] MDS basics 4 Thomas Gleixner
@ 2019-02-22  7:45   ` Jon Masters
  0 siblings, 0 replies; 91+ messages in thread
From: Jon Masters @ 2019-02-22  7:45 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 128 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch V3 4/9] MDS basics 4

[-- Attachment #2: Type: text/plain, Size: 653 bytes --]

On 2/21/19 6:44 PM, speck for Thomas Gleixner wrote:
> +#include <asm/segment.h>
> +
> +/**
> + * mds_clear_cpu_buffers - Mitigation for MDS vulnerability
> + *
> + * This uses the otherwise unused and obsolete VERW instruction in
> + * combination with microcode which triggers a CPU buffer flush when the
> + * instruction is executed.
> + */
> +static inline void mds_clear_cpu_buffers(void)
> +{
> +	static const u16 ds = __KERNEL_DS;

Dunno if it's worth documenting that using a specifically valid segment
is faster than a zero selector according to Intel.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-02-20 17:10   ` [MODERATED] " mark gross
@ 2019-02-21 19:26     ` Tim Chen
  0 siblings, 0 replies; 91+ messages in thread
From: Tim Chen @ 2019-02-21 19:26 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 135 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for mark gross <speck@linutronix.de>
Subject: Re: [patch V2 04/10] MDS basics+ 4

[-- Attachment #2: Type: text/plain, Size: 829 bytes --]

On 2/20/19 9:10 AM, speck for mark gross wrote:

>> +
>> +      - KGBD

s/KGBD/KGDB

>> +
>> +        If the kernel debugger is accessible by an unpriviledged attacker,
>> +        then the NMI handler is the least of the problems.
>> +

...

> 
> However; if I'm being pedantic, the attacker not having controlability aspect
> of your argument can apply to most aspects of the MDS vulnerability.  I think
> that's why its name uses "data sampling".  Also, I need to ask the chip heads
> about if this list of NMI's is complete and can be expected to stay that way
> across processor and platfrom generations.
> 
> --mark
> 


I don't think any of the code paths listed touches any user data.  So even
if an attacker have some means to control NMI, he won't get any useful data.

Thanks.

Tim 



^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-02-19 12:44 [patch 0/8] MDS basics 0 Thomas Gleixner
@ 2019-02-21 16:14 ` Jon Masters
  0 siblings, 0 replies; 91+ messages in thread
From: Jon Masters @ 2019-02-21 16:14 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 125 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch 0/8] MDS basics 0

[-- Attachment #2: Type: text/plain, Size: 304 bytes --]

Hi Thomas,

Just a note on testing. I built a few Coffelake client systems for Red
Hat using the 8086K anniversary processor for which we have test ucode.
I will build and test these patches and ask the RH perf team to test.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-02-08 10:53         ` [MODERATED] [RFC][PATCH] performance walnuts Peter Zijlstra
@ 2019-02-15 23:45           ` Jon Masters
  0 siblings, 0 replies; 91+ messages in thread
From: Jon Masters @ 2019-02-15 23:45 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 132 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Peter Zijlstra <speck@linutronix.de>
Subject: Re: [RFC][PATCH] performance walnuts

[-- Attachment #2: Type: text/plain, Size: 944 bytes --]

On 2/8/19 5:53 AM, speck for Peter Zijlstra wrote:
> +static void intel_set_tfa(struct cpu_hw_events *cpuc, bool on)
> +{
> +	u64 val = MSR_TFA_RTM_FORCE_ABORT * on;
> +
> +	if (cpuc->tfa_shadow != val) {
> +		cpuc->tfa_shadow = val;
> +		wrmsrl(MSR_TSX_FORCE_ABORT, val);
> +	}
> +}

Ok let me ask a stupid question.

This MSR is exposed on a given core. What's the impact (if any) on
*other* cores that might be using TSX? For example, suppose I'm running
an application using RTM on one core while another application on
another core begins profiling. What impact does the impact of this MSR
write have on other cores? (Architecturally).

I'm assuming the implementation of HLE relies on whatever you're doing
fitting into the local core's cache and you just abort on any snoop,
etc. so it ought to be fairly self contained, but I want to know.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-01-14 19:20   ` [MODERATED] " Dave Hansen
@ 2019-01-18  7:33     ` Jon Masters
  0 siblings, 0 replies; 91+ messages in thread
From: Jon Masters @ 2019-01-18  7:33 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 122 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Dave Hansen <speck@linutronix.de>
Subject: Re: [PATCH v4 05/28] MDSv4 10

[-- Attachment #2: Type: text/plain, Size: 1328 bytes --]

On 1/14/19 2:20 PM, speck for Dave Hansen wrote:

> On 1/11/19 5:29 PM, speck for Andi Kleen wrote:
>> When entering idle the internal state of the current CPU might
>> become visible to the thread sibling because the CPU "frees" some
>> internal resources.
> 
> Is there some documentation somewhere about what "idle" means here?  It
> looks like MWAIT and HLT certainly count, but is there anything else?

We know power state transitions in addition can cause the peer to
dynamically sleep or wake up. MWAIT was the main example I got out of
Intel for how you'd explicitly cause a thread to be deallocated.

When Andi is talking about "frees" above he means (for example) the
dynamic allocation/deallocation of store buffer entries as threads come
and go - e.g. in Skylake there are 56 entries in a distributed store
buffer that splits into 2x28. I am not aware of fill buffer behavior
changing as threads come and go, and this isn't documented AFAICS.

I've been wondering whether we want a bit more detail in the docs. I
spent a /lot/ of time last week going through all of Intel's patents in
this area, which really help understand it. If folks feel we could do
with a bit more meaty summary, I can try to suggest something.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 10/28] MDSv4 24 Andi Kleen
@ 2019-01-15  1:05   ` Tim Chen
  0 siblings, 0 replies; 91+ messages in thread
From: Tim Chen @ 2019-01-15  1:05 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 130 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Andi Kleen <speck@linutronix.de>
Subject: Re: [PATCH v4 10/28] MDSv4 24

[-- Attachment #2: Type: text/plain, Size: 5059 bytes --]


On 1/11/19 5:29 PM, speck for Andi Kleen wrote:

> +Some CPUs can leave read or written data in internal buffers,
> +which then later might be sampled through side effects.
> +For more details see CVE-2018-12126 CVE-2018-12130 CVE-2018-12127
> +
> +This can be avoided by explicitely clearing the CPU state.

s/explicitely/explicitly

> +
> +We trying to avoid leaking data between different processes,

Suggest changing the above phrase to the below:

CPU state clearing prevents leaking data between different processes,

...

> +Basic requirements and assumptions
> +----------------------------------
> +
> +Kernel addresses and kernel temporary data are not sensitive.
> +
> +User data is sensitive, but only for other processes.
> +
> +Kernel data is sensitive when it is cryptographic keys.

s/when it is/when it involves/

> +
> +Guidance for driver/subsystem developers
> +----------------------------------------
> +
> +When you touch user supplied data of *other* processes in system call
> +context add lazy_clear_cpu().
> +
> +For the cases below we care only about data from other processes.
> +Touching non cryptographic data from the current process is always allowed.
> +
> +Touching only pointers to user data is always allowed.
> +
> +When your interrupt does not touch user data directly consider marking

Add a "," between "directly" and "consider"

> +it with IRQF_NO_USER.
> +
> +When your tasklet does not touch user data directly consider marking

Add a "," between "directly" and "consider"

> +it with TASKLET_NO_USER using tasklet_init_flags/or
> +DECLARE_TASKLET*_NOUSER.
> +
> +When your timer does not touch user data mark it with TIMER_NO_USER.

Add a "," between "data" and "mark"

> +If it is a hrtimer mark it with HRTIMER_MODE_NO_USER.

Add a "," between "hrtimer" and "mark"

> +
> +When your irq poll handler does not touch user data, mark it
> +with IRQ_POLL_F_NO_USER through irq_poll_init_flags.
> +
> +For networking code make sure to only touch user data through

Add a "," between "code" and "make"

> +skb_push/put/copy [add more], unless it is data from the current
> +process. If that is not ensured add lazy_clear_cpu or

Add a "," between "ensured" and "add"

> +lazy_clear_cpu_interrupt. When the non skb data access is only in a
> +hardware interrupt controlled by the driver, it can rely on not
> +setting IRQF_NO_USER for that interrupt.
> +
> +Any cryptographic code touching key data should use memzero_explicit
> +or kzfree.
> +
> +If your RCU callback touches user data add lazy_clear_cpu().
> +
> +These steps are currently only needed for code that runs on MDS affected
> +CPUs, which is currently only x86. But might be worth being prepared
> +if other architectures become affected too.
> +
> +Implementation details/assumptions
> +----------------------------------
> +
> +If a system call touches data it is for its own process, so does not

suggest rephrasing to 

If a system call touches data of its own process, cpu state does not

> +need to be cleared, because it has already access to it.
> +
> +When context switching we clear data, unless the context switch
> +is inside a process, or from/to idle. We also clear after any
> +context switches from kernel threads.
> +
> +Idle does not have sensitive data, except for in interrupts, which
> +are handled separately.
> +
> +Cryptographic keys inside the kernel should be protected.
> +We assume they use kzfree() or memzero_explicit() to clear
> +state, so these functions trigger a cpu clear.
> +
> +Hard interrupts, tasklets, timers which can run asynchronous are
> +assumed to touch random user data, unless they have been audited, and
> +marked with NO_USER flags.
> +
> +Most interrupt handlers for modern devices should not touch
> +user data because they rely on DMA and only manipulate
> +pointers. This needs auditing to confirm though.
> +
> +For softirqs we assume that if they touch user data they use

Add "," between "data" and "they"

...

> +Technically we would only need to do this if the BPF program
> +contains conditional branches and loads dominated by them, but
> +let's assume that near all do.
s/near/nealy/

> +
> +This could be further optimized by allowing callers that do
> +a lot of individual BPF runs and are sure they don't touch
> +other user's data inbetween to do the clear only once
> +at the beginning. 

Suggest breaking the above sentence.  It is quite difficult to read.

> We can add such optimizations later based on
> +profile data.
> +
> +Virtualization
> +--------------
> +
> +When entering a guest in KVM we clear to avoid any leakage to a guest.
... we clear CPU state to avoid ....

> +Normally this is done implicitely as part of the L1TF mitigation.

s/implicitely/implicitly/

> +It relies on this being enabled. It also uses the "fast exit"
> +optimization that only clears if an interrupt or context switch
> +happened.
> 



^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 05/28] MDSv4 10 Andi Kleen
  2019-01-14 19:20   ` [MODERATED] " Dave Hansen
@ 2019-01-14 23:39   ` Tim Chen
  1 sibling, 0 replies; 91+ messages in thread
From: Tim Chen @ 2019-01-14 23:39 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 130 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Andi Kleen <speck@linutronix.de>
Subject: Re: [PATCH v4 05/28] MDSv4 10

[-- Attachment #2: Type: text/plain, Size: 526 bytes --]


> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 50aa2aba69bd..b5a1bd4a1a46 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5980,6 +5980,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
>  
>  #ifdef CONFIG_SCHED_SMT
>  DEFINE_STATIC_KEY_FALSE(sched_smt_present);
> +EXPORT_SYMBOL(sched_smt_present);

This export is not needed since sched_smt_present is not used in the patch series.
Only sched_smt_active() is used.

Thanks.

Tim


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2018-06-12 17:29 [MODERATED] FYI - Reading uncached memory Jon Masters
@ 2018-06-14 16:59 ` Tim Chen
  0 siblings, 0 replies; 91+ messages in thread
From: Tim Chen @ 2018-06-14 16:59 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 135 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Jon Masters <speck@linutronix.de>
Subject: Re: FYI - Reading uncached memory

[-- Attachment #2: Type: text/plain, Size: 586 bytes --]

On 06/12/2018 10:29 AM, speck for Jon Masters wrote:
> FYI Graz have been able to prove the Intel processors will allow
> speculative reads of /explicitly/ UC memory (e.g. marked in MTRR). I
> believe they actually use the QPI SAD table to determine what memory is
> speculation safe and what memory has side effects (i.e. if it's HA'able
> memory then it's deemed ok to rampantly speculate from it).
> 
> Just in case anyone thought UC was safe against attacks.
> 
> Jon.
> 

Thanks for forwarding the info.  Yes, the internal Intel team
is aware of this issue.

Tim


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2018-06-05 23:37               ` Tim Chen
@ 2018-06-07 19:11                 ` Tim Chen
  0 siblings, 0 replies; 91+ messages in thread
From: Tim Chen @ 2018-06-07 19:11 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 165 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Konrad Rzeszutek Wilk <speck@linutronix.de>
Subject: Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1

[-- Attachment #2: Type: text/plain, Size: 2489 bytes --]

On 06/05/2018 04:37 PM, Tim Chen wrote:
> On 06/05/2018 04:34 PM, Tim Chen wrote:
>> On 06/04/2018 06:11 AM, speck for Konrad Rzeszutek Wilk wrote:
>>> On Mon, Jun 04, 2018 at 10:24:59AM +0200, speck for Martin Pohlack wrote:
>>>> [resending as new message as the replay seems to have been lost on at
>>>> least some mail paths]
>>>>
>>>> On 30.05.2018 11:01, speck for Paolo Bonzini wrote:
>>>>> On 30/05/2018 01:54, speck for Andrew Cooper wrote:
>>>>>> Other bits I don't understand are the 64k limit in the first place, why
>>>>>> it gets walked over in 4k strides to begin with (I'm not aware of any
>>>>>> prefetching which would benefit that...) and why a particularly
>>>>>> obfuscated piece of magic is used for the 64byte strides.
>>>>>
>>>>> That is the only part I understood, :) the 4k strides ensure that the
>>>>> source data is in the TLB.  Why that is needed is still a mystery though.
>>>>
>>>> I think the reasoning is that you first want to populate the TLB for the
>>>> whole flush array, then fence, to make sure TLB walks do not interfere
>>>> with the actual flushing later, either for performance reasons or for
>>>> preventing leakage of partial walk results.
>>>>
>>>> Not sure about the 64K, it likely is about the LRU implementation for L1
>>>> replacement not being perfect (but pseudo LRU), so you need to flush
>>>> more than the L1 size (32K) in software.  But I have also seen smaller
>>>> recommendations for that (52K).
>>>
>>
>> Had some discussions with other Intel folks.
>>
>> Our recommendation is not to use the software sequence for L1 clear but
>> use wrmsrl(MSR_IA32_FLUSH_L1D, MSR_IA32_FLUSH_L1D_VALUE).
>> We expect that all affected systems will be receiving a ucode update
>> to provide L1 clearing capability.
>>
>> Yes, the 4k stride is for getting TLB walks out of the way and
>> the 64kB replacement is to accommodate pseudo LRU.
> 
> I will try to see if I can get hold of the relevant documentation
> on pseudo LRU.
> 

The HW folks mentioned that if we have nothing from the flush buffer in
L1, then 32 KB would be sufficient (if we load miss for everything).

However, that's not the case. If some data from the flush buffer is
already in L1, it could protect an unrelated line that's considered
"near" by the LRU from getting flushed.  To make sure that does not
happen, we go through 64 KB of data to guarantee every line in L1 will
encounter a load miss and is flushed.

Tim


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2018-06-05 23:34             ` Tim Chen
@ 2018-06-05 23:37               ` Tim Chen
  2018-06-07 19:11                 ` Tim Chen
  0 siblings, 1 reply; 91+ messages in thread
From: Tim Chen @ 2018-06-05 23:37 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 165 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Konrad Rzeszutek Wilk <speck@linutronix.de>
Subject: Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1

[-- Attachment #2: Type: text/plain, Size: 1939 bytes --]

On 06/05/2018 04:34 PM, Tim Chen wrote:
> On 06/04/2018 06:11 AM, speck for Konrad Rzeszutek Wilk wrote:
>> On Mon, Jun 04, 2018 at 10:24:59AM +0200, speck for Martin Pohlack wrote:
>>> [resending as new message as the replay seems to have been lost on at
>>> least some mail paths]
>>>
>>> On 30.05.2018 11:01, speck for Paolo Bonzini wrote:
>>>> On 30/05/2018 01:54, speck for Andrew Cooper wrote:
>>>>> Other bits I don't understand are the 64k limit in the first place, why
>>>>> it gets walked over in 4k strides to begin with (I'm not aware of any
>>>>> prefetching which would benefit that...) and why a particularly
>>>>> obfuscated piece of magic is used for the 64byte strides.
>>>>
>>>> That is the only part I understood, :) the 4k strides ensure that the
>>>> source data is in the TLB.  Why that is needed is still a mystery though.
>>>
>>> I think the reasoning is that you first want to populate the TLB for the
>>> whole flush array, then fence, to make sure TLB walks do not interfere
>>> with the actual flushing later, either for performance reasons or for
>>> preventing leakage of partial walk results.
>>>
>>> Not sure about the 64K, it likely is about the LRU implementation for L1
>>> replacement not being perfect (but pseudo LRU), so you need to flush
>>> more than the L1 size (32K) in software.  But I have also seen smaller
>>> recommendations for that (52K).
>>
> 
> Had some discussions with other Intel folks.
> 
> Our recommendation is not to use the software sequence for L1 clear but
> use wrmsrl(MSR_IA32_FLUSH_L1D, MSR_IA32_FLUSH_L1D_VALUE).
> We expect that all affected systems will be receiving a ucode update
> to provide L1 clearing capability.
> 
> Yes, the 4k stride is for getting TLB walks out of the way and
> the 64kB replacement is to accommodate pseudo LRU.

I will try to see if I can get hold of the relevant documentation
on pseudo LRU.

Tim


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2018-06-04 13:11           ` [MODERATED] Is: Tim, Q to you. Was:Re: " Konrad Rzeszutek Wilk
  2018-06-04 17:59             ` [MODERATED] Encrypted Message Tim Chen
@ 2018-06-05 23:34             ` Tim Chen
  2018-06-05 23:37               ` Tim Chen
  1 sibling, 1 reply; 91+ messages in thread
From: Tim Chen @ 2018-06-05 23:34 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 165 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Konrad Rzeszutek Wilk <speck@linutronix.de>
Subject: Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1

[-- Attachment #2: Type: text/plain, Size: 1779 bytes --]

On 06/04/2018 06:11 AM, speck for Konrad Rzeszutek Wilk wrote:
> On Mon, Jun 04, 2018 at 10:24:59AM +0200, speck for Martin Pohlack wrote:
>> [resending as new message as the replay seems to have been lost on at
>> least some mail paths]
>>
>> On 30.05.2018 11:01, speck for Paolo Bonzini wrote:
>>> On 30/05/2018 01:54, speck for Andrew Cooper wrote:
>>>> Other bits I don't understand are the 64k limit in the first place, why
>>>> it gets walked over in 4k strides to begin with (I'm not aware of any
>>>> prefetching which would benefit that...) and why a particularly
>>>> obfuscated piece of magic is used for the 64byte strides.
>>>
>>> That is the only part I understood, :) the 4k strides ensure that the
>>> source data is in the TLB.  Why that is needed is still a mystery though.
>>
>> I think the reasoning is that you first want to populate the TLB for the
>> whole flush array, then fence, to make sure TLB walks do not interfere
>> with the actual flushing later, either for performance reasons or for
>> preventing leakage of partial walk results.
>>
>> Not sure about the 64K, it likely is about the LRU implementation for L1
>> replacement not being perfect (but pseudo LRU), so you need to flush
>> more than the L1 size (32K) in software.  But I have also seen smaller
>> recommendations for that (52K).
> 

Had some discussions with other Intel folks.

Our recommendation is not to use the software sequence for L1 clear but
use wrmsrl(MSR_IA32_FLUSH_L1D, MSR_IA32_FLUSH_L1D_VALUE).
We expect that all affected systems will be receiving a ucode update
to provide L1 clearing capability.

Yes, the 4k stride is for getting TLB walks out of the way and
the 64kB replacement is to accommodate pseudo LRU.

Thanks.

Tim


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2018-06-04 13:11           ` [MODERATED] Is: Tim, Q to you. Was:Re: " Konrad Rzeszutek Wilk
@ 2018-06-04 17:59             ` Tim Chen
  2018-06-05 23:34             ` Tim Chen
  1 sibling, 0 replies; 91+ messages in thread
From: Tim Chen @ 2018-06-04 17:59 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 165 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Konrad Rzeszutek Wilk <speck@linutronix.de>
Subject: Re: Is: Tim, Q to you. Was:Re: [PATCH 1/2] L1TF KVM 1

[-- Attachment #2: Type: text/plain, Size: 1464 bytes --]

On 06/04/2018 06:11 AM, speck for Konrad Rzeszutek Wilk wrote:
> On Mon, Jun 04, 2018 at 10:24:59AM +0200, speck for Martin Pohlack wrote:
>> [resending as new message as the replay seems to have been lost on at
>> least some mail paths]
>>
>> On 30.05.2018 11:01, speck for Paolo Bonzini wrote:
>>> On 30/05/2018 01:54, speck for Andrew Cooper wrote:
>>>> Other bits I don't understand are the 64k limit in the first place, why
>>>> it gets walked over in 4k strides to begin with (I'm not aware of any
>>>> prefetching which would benefit that...) and why a particularly
>>>> obfuscated piece of magic is used for the 64byte strides.
>>>
>>> That is the only part I understood, :) the 4k strides ensure that the
>>> source data is in the TLB.  Why that is needed is still a mystery though.
>>
>> I think the reasoning is that you first want to populate the TLB for the
>> whole flush array, then fence, to make sure TLB walks do not interfere
>> with the actual flushing later, either for performance reasons or for
>> preventing leakage of partial walk results.
>>
>> Not sure about the 64K, it likely is about the LRU implementation for L1
>> replacement not being perfect (but pseudo LRU), so you need to flush
>> more than the L1 size (32K) in software.  But I have also seen smaller
>> recommendations for that (52K).
> 
> Isn't Tim Chen from Intel on this mailing list? Tim, could you find out
> please?
> 

Will do.

Tim


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2018-05-18 14:29   ` Thomas Gleixner
@ 2018-05-18 19:50     ` Tim Chen
  0 siblings, 0 replies; 91+ messages in thread
From: Tim Chen @ 2018-05-18 19:50 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 163 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: Is: Sleep states ?Was:Re: SSB status - V18 pushed out

[-- Attachment #2: Type: text/plain, Size: 2667 bytes --]

On 05/18/2018 07:29 AM, speck for Thomas Gleixner wrote:
> On Fri, 18 May 2018, speck for Konrad Rzeszutek Wilk wrote:
>> On Thu, May 17, 2018 at 10:53:28PM +0200, speck for Thomas Gleixner wrote:
>>> Folks,
>>>
>>> we finally reached a stable state with the SSB patches. I've updated all 3
>>> branches master/linux-4.16.y/linux-4.14.y in the repo and attached the
>>> resulting git bundles. They merge cleanly on top of the current HEADs of
>>> the relevant trees.
>>>
>>> The lot survived light testing on my side and it would be great if everyone
>>> involved could expose it to their test scenarios.
>>>
>>> Thanks to everyone who participated in that effort (patches, review,
>>> testing ...)!
>>
>> Yeey! Thank you.
>>
>> I was reading the updated Intel doc today (instead of skim reading it) and it mentioned:
>>
>> "Intel recommends that the SSBD MSR bit be cleared when in a sleep state on such processors."
> 
> Well, the same recommendation was for IBRS and the reason is that with HT
> enabled the other hyperthread will not be able to go full speed because the
> sleeping one vanished with IBRS set. SSBD works the same way.
> 
> " SW should clear [SSBD] when enter sleep state, just as is suggested for
>   IBRS and STIBP on existing implementations"
> 
> and that document says:
> 
> "Enabling IBRS on one logical processor of a core with Intel
>  Hyper-Threading Technology may affect branch prediction on other logical
>  processors of the same core. For this reason, software should disable IBRS
>  (by clearing IA32_SPEC_CTRL.IBRS) prior to entering a sleep state (e.g.,
>  by executing HLT or MWAIT) and re-enable IBRS upon wakeup and prior to
>  executing any indirect branch."
> 
> So it's only a performance issue and not a fundamental problem to have it
> on when executing HLT/MWAIT
> 
> So we have two situations here:
> 
> 1) ssbd = on, i.e X86_FEATURE_SPEC_STORE_BYPASS_DISABLE
> 
>    There it is irrelevant because both threads have SSBD set permanentely,
>    so unsetting it on HLT/MWAIT is not going to lift the restriction for
>    the running sibling thread. And HLT/MWAIT is not going to be faster by
>    unsetting it and then setting it on wakeup again....
> 
> 2) SSBD via prctl/seccomp
> 
>    Nothing to do there, because idle task does not have TIF_SSBD set so it
>    never goes with SSBD set into HLT/MWAIT.
> 
> So I think we're good, but it would be nice if Intel folks would confirm
> that.

Yes, we have thought about turning off SSBD in the mwait path earlier. But
decided that it was unnecessary for the exact reasons Thomas mentioned.

Thanks.

Tim


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [MODERATED] Encrypted Message
  2018-05-02 21:51 [patch V11 00/16] SSB 0 Thomas Gleixner
@ 2018-05-03  4:27 ` Tim Chen
  0 siblings, 0 replies; 91+ messages in thread
From: Tim Chen @ 2018-05-03  4:27 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 133 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Thomas Gleixner <speck@linutronix.de>
Subject: Re: [patch V11 00/16] SSB 0

[-- Attachment #2: Type: text/plain, Size: 1580 bytes --]



On 05/02/2018 02:51 PM, speck for Thomas Gleixner wrote:
> Changes since V10:
> 
>   - Addressed Ingos review feedback
> 
>   - Picked up Reviewed-bys
> 
> Delta patch below. Bundle is coming in separate mail. Git repo branches are
> updated as well. The master branch contains also the fix for the lost IBRS
> issue Tim was seeing.
> 
> If there are no further issues and nitpicks, I'm going to make the
> changes immutable and changes need to go incremental on top.
> 
> Thanks,
> 
> 	tglx
> 
> 

I notice that this code ignores the current process's TIF_RDS setting
in the prctl case:

#define firmware_restrict_branch_speculation_end()                      \
do {                                                                    \
        u64 val = x86_get_default_spec_ctrl();                          \
                                                                        \
        alternative_msr_write(MSR_IA32_SPEC_CTRL, val,                  \
                              X86_FEATURE_USE_IBRS_FW);                 \
        preempt_enable();                                               \
} while (0)

x86_get_default_spec_ctrl will return x86_spec_ctrl_base, which
will result in x86_spec_ctrl_base written to the MSR
in the prctl case for Intel CPU.  That incorrectly ignores current
process's TIF_RDS setting and the RDS bit will not be set.

Instead, the following value should have been written to the MSR
for Intel CPU:
x86_spec_ctrl_base | rds_tif_to_spec_ctrl(current_thread_info()->flags)

Thanks.

Tim


^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2019-03-06 16:22 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-24  9:06 [MODERATED] L1D-Fault KVM mitigation Joerg Roedel
2018-04-24  9:35 ` [MODERATED] " Peter Zijlstra
2018-04-24  9:48   ` David Woodhouse
2018-04-24 11:04     ` Peter Zijlstra
2018-04-24 11:16       ` David Woodhouse
2018-04-24 15:10         ` Jon Masters
2018-05-23  9:45       ` David Woodhouse
2018-05-24  9:45         ` Peter Zijlstra
2018-05-24 14:14           ` Jon Masters
2018-05-24 15:04           ` Thomas Gleixner
2018-05-24 15:33             ` Thomas Gleixner
2018-05-24 15:38               ` [MODERATED] " Jiri Kosina
2018-05-24 17:22                 ` Dave Hansen
2018-05-24 17:30                   ` Linus Torvalds
2018-05-24 23:18               ` [MODERATED] Encrypted Message Tim Chen
2018-05-24 23:28                 ` [MODERATED] Re: L1D-Fault KVM mitigation Linus Torvalds
2018-05-25  8:31                   ` Thomas Gleixner
2018-05-28 14:43                     ` [MODERATED] " Paolo Bonzini
2018-05-25 18:22                 ` [MODERATED] Encrypted Message Tim Chen
2018-05-26 19:14                 ` L1D-Fault KVM mitigation Thomas Gleixner
2018-05-26 20:43                   ` [MODERATED] " Andi Kleen
2018-05-26 20:48                     ` Linus Torvalds
2018-05-27 18:25                       ` Andi Kleen
2018-05-27 18:49                         ` Linus Torvalds
2018-05-27 18:57                           ` Thomas Gleixner
2018-05-27 19:13                           ` [MODERATED] " Andrew Cooper
2018-05-27 19:26                             ` Linus Torvalds
2018-05-27 19:41                               ` Thomas Gleixner
2018-05-27 22:26                                 ` [MODERATED] " Andrew Cooper
2018-05-28  6:47                                   ` Thomas Gleixner
2018-05-28 12:26                                     ` [MODERATED] " Andrew Cooper
2018-05-28 14:40                           ` Paolo Bonzini
2018-05-28 15:56                             ` Thomas Gleixner
2018-05-28 17:15                               ` [MODERATED] " Paolo Bonzini
2018-05-27 15:42                     ` Thomas Gleixner
2018-05-27 16:26                       ` [MODERATED] " Linus Torvalds
2018-05-27 18:31                       ` Andi Kleen
2018-05-29 19:29                   ` [MODERATED] Encrypted Message Tim Chen
2018-05-29 21:14                     ` L1D-Fault KVM mitigation Thomas Gleixner
2018-05-30 16:38                       ` [MODERATED] Encrypted Message Tim Chen
2018-05-24 15:44             ` [MODERATED] Re: L1D-Fault KVM mitigation Andi Kleen
2018-05-24 15:38           ` Linus Torvalds
2018-05-24 15:59             ` David Woodhouse
2018-05-24 16:35               ` Linus Torvalds
2018-05-24 16:51                 ` David Woodhouse
2018-05-24 16:57                   ` Linus Torvalds
2018-05-25 11:29                     ` David Woodhouse
2018-04-24 10:30   ` [MODERATED] Re: ***UNCHECKED*** " Joerg Roedel
2018-04-24 11:09     ` Thomas Gleixner
2018-04-24 16:06       ` [MODERATED] " Andi Kleen
2018-04-24 12:53   ` Paolo Bonzini
2018-05-03 16:20     ` Konrad Rzeszutek Wilk
2018-05-07 17:11       ` Paolo Bonzini
2018-05-16  8:51         ` Jiri Kosina
2018-05-16  8:53           ` Paolo Bonzini
2018-05-21 10:06             ` David Woodhouse
2018-05-21 13:40               ` Thomas Gleixner
2018-05-02 21:51 [patch V11 00/16] SSB 0 Thomas Gleixner
2018-05-03  4:27 ` [MODERATED] Encrypted Message Tim Chen
2018-05-17 20:53 SSB status - V18 pushed out Thomas Gleixner
2018-05-18 13:54 ` [MODERATED] Is: Sleep states ?Was:Re: " Konrad Rzeszutek Wilk
2018-05-18 14:29   ` Thomas Gleixner
2018-05-18 19:50     ` [MODERATED] Encrypted Message Tim Chen
2018-05-29 19:42 [MODERATED] [PATCH 0/2] L1TF KVM 0 Paolo Bonzini
     [not found] ` <20180529194240.7F1336110A@crypto-ml.lab.linutronix.de>
2018-05-29 22:49   ` [PATCH 1/2] L1TF KVM 1 Thomas Gleixner
2018-05-29 23:54     ` [MODERATED] " Andrew Cooper
2018-05-30  9:01       ` Paolo Bonzini
2018-06-04  8:24         ` [MODERATED] " Martin Pohlack
2018-06-04 13:11           ` [MODERATED] Is: Tim, Q to you. Was:Re: " Konrad Rzeszutek Wilk
2018-06-04 17:59             ` [MODERATED] Encrypted Message Tim Chen
2018-06-05 23:34             ` Tim Chen
2018-06-05 23:37               ` Tim Chen
2018-06-07 19:11                 ` Tim Chen
2018-06-12 17:29 [MODERATED] FYI - Reading uncached memory Jon Masters
2018-06-14 16:59 ` [MODERATED] Encrypted Message Tim Chen
2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 05/28] MDSv4 10 Andi Kleen
2019-01-14 19:20   ` [MODERATED] " Dave Hansen
2019-01-18  7:33     ` [MODERATED] Encrypted Message Jon Masters
2019-01-14 23:39   ` Tim Chen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 10/28] MDSv4 24 Andi Kleen
2019-01-15  1:05   ` [MODERATED] Encrypted Message Tim Chen
2019-02-07 23:41 [MODERATED] [PATCH v3 0/6] PERFv3 Andi Kleen
2019-02-07 23:41 ` [MODERATED] [PATCH v3 2/6] PERFv3 Andi Kleen
2019-02-08  0:51   ` [MODERATED] Re: [SUSPECTED SPAM][PATCH " Andrew Cooper
2019-02-08  9:01     ` Peter Zijlstra
2019-02-08  9:39       ` Peter Zijlstra
2019-02-08 10:53         ` [MODERATED] [RFC][PATCH] performance walnuts Peter Zijlstra
2019-02-15 23:45           ` [MODERATED] Encrypted Message Jon Masters
2019-02-19 12:44 [patch 0/8] MDS basics 0 Thomas Gleixner
2019-02-21 16:14 ` [MODERATED] Encrypted Message Jon Masters
2019-02-20 15:07 [patch V2 00/10] MDS basics+ 0 Thomas Gleixner
2019-02-20 15:07 ` [patch V2 04/10] MDS basics+ 4 Thomas Gleixner
2019-02-20 17:10   ` [MODERATED] " mark gross
2019-02-21 19:26     ` [MODERATED] Encrypted Message Tim Chen
2019-02-21 23:44 [patch V3 0/9] MDS basics 0 Thomas Gleixner
2019-02-21 23:44 ` [patch V3 4/9] MDS basics 4 Thomas Gleixner
2019-02-22  7:45   ` [MODERATED] Encrypted Message Jon Masters
2019-02-22 22:24 [patch V4 00/11] MDS basics Thomas Gleixner
2019-02-22 22:24 ` [patch V4 04/11] x86/speculation/mds: Add mds_clear_cpu_buffer() Thomas Gleixner
2019-02-26 14:19   ` [MODERATED] " Josh Poimboeuf
2019-03-01 20:58     ` [MODERATED] Encrypted Message Jon Masters
2019-03-01 22:14       ` Jon Masters
2019-02-24 15:07 [MODERATED] [PATCH v6 00/43] MDSv6 Andi Kleen
2019-02-24 15:07 ` [MODERATED] [PATCH v6 10/43] MDSv6 Andi Kleen
2019-02-25 16:30   ` [MODERATED] " Greg KH
2019-02-25 16:41     ` [MODERATED] Encrypted Message Jon Masters
2019-02-24 15:07 ` [MODERATED] [PATCH v6 31/43] MDSv6 Andi Kleen
2019-02-25 15:19   ` [MODERATED] " Greg KH
2019-02-25 15:34     ` Andi Kleen
2019-02-25 15:49       ` Greg KH
2019-02-25 15:52         ` [MODERATED] Encrypted Message Jon Masters
2019-02-25 16:00           ` [MODERATED] " Greg KH
2019-02-25 16:19             ` [MODERATED] " Jon Masters
2019-03-01 21:47 [patch V6 00/14] MDS basics 0 Thomas Gleixner
2019-03-01 21:47 ` [patch V6 06/14] MDS basics 6 Thomas Gleixner
2019-03-04  6:28   ` [MODERATED] Encrypted Message Jon Masters
2019-03-01 21:47 ` [patch V6 08/14] MDS basics 8 Thomas Gleixner
2019-03-04  6:57   ` [MODERATED] Encrypted Message Jon Masters
2019-03-04  7:06     ` Jon Masters
2019-03-04  8:12       ` Jon Masters
2019-03-05 15:34     ` Thomas Gleixner
2019-03-06 16:21       ` [MODERATED] " Jon Masters
2019-03-01 21:47 ` [patch V6 10/14] MDS basics 10 Thomas Gleixner
2019-03-04  6:45   ` [MODERATED] Encrypted Message Jon Masters
2019-03-01 21:47 ` [patch V6 12/14] MDS basics 12 Thomas Gleixner
2019-03-04  5:47   ` [MODERATED] Encrypted Message Jon Masters
2019-03-04  5:30 ` Jon Masters
2019-03-04  1:21 [MODERATED] [PATCH RFC 0/4] Proposed cmdline improvements Josh Poimboeuf
2019-03-04  1:23 ` [MODERATED] [PATCH RFC 1/4] 1 Josh Poimboeuf
2019-03-04  3:55   ` [MODERATED] Encrypted Message Jon Masters
2019-03-04  7:30   ` [MODERATED] Re: [PATCH RFC 1/4] 1 Greg KH
2019-03-04  7:45     ` [MODERATED] Encrypted Message Jon Masters
2019-03-04  1:24 ` [MODERATED] [PATCH RFC 3/4] 3 Josh Poimboeuf
2019-03-04  3:58   ` [MODERATED] Encrypted Message Jon Masters
2019-03-04 17:17     ` [MODERATED] " Josh Poimboeuf
2019-03-06 16:22       ` [MODERATED] " Jon Masters
2019-03-04  1:25 ` [MODERATED] [PATCH RFC 4/4] 4 Josh Poimboeuf
2019-03-04  4:07   ` [MODERATED] Encrypted Message Jon Masters
2019-03-05 16:43 [MODERATED] Starting to go public? Linus Torvalds
2019-03-05 17:02 ` [MODERATED] " Andrew Cooper
2019-03-05 20:36   ` Jiri Kosina
2019-03-05 22:31     ` Andrew Cooper
2019-03-06 16:18       ` [MODERATED] Encrypted Message Jon Masters
2019-03-05 17:10 ` Jon Masters

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.