Re: [GIT PULL] x86/mm changes for v5.8

From: Ingo Molnar <mingo@kernel.org>
To: Balbir Singh <sblbir@amazon.com>
Cc: torvalds@linux-foundation.org, a.p.zijlstra@chello.nl,
	akpm@linux-foundation.org, bp@alien8.de,
	linux-kernel@vger.kernel.org, luto@kernel.org,
	tglx@linutronix.de, benh@kernel.crashing.org
Subject: Re: [GIT PULL] x86/mm changes for v5.8
Date: Tue, 2 Jun 2020 09:33:50 +0200	[thread overview]
Message-ID: <20200602073350.GA481221@gmail.com> (raw)
In-Reply-To: <CAHk-=wgXf_wQ9zrJKv2Hy4EpEbLuqty-Cjbs2u00gm7XcYHBfw@mail.gmail.com>

* Balbir Singh <sblbir@amazon.com> wrote:

> > At a _minimum_, SMT being enabled should disable this kind of crazy
> > pseudo-security entirely, since it is completely pointless in that
> > situation. Scheduling simply isn't a synchronization point with SMT
> > on, so saying "sure, I'll flush the L1 at context switch" is beyond
> > stupid.
> >
> > I do not want the kernel to do things that seem to be "beyond stupid".
> >
> > Because I really think this is just PR and pseudo-security, and I
> > think there's a real cost in making people think "oh, I'm so special
> > that I should enable this".
> >
> > I'm more than happy to be educated on why I'm wrong, but for now I'm
> > unpulling it for lack of data.
> >
> > Maybe it never happens on SMT because of all those subtle static
> > branch rules, but I'd really like to that to be explained.
> 
> The documentation calls out the SMT restrictions.

That's not what Linus suggested though, and you didn't answer his 
concerns AFAICS.

The documentation commit merely mentions that this feature is useless 
with SMT:

  0fcfdf55db9e: ("Documentation: Add L1D flushing Documentation")

+Limitations
+-----------
+
+The mechanism does not mitigate L1D data leaks between tasks belonging to
+different processes which are concurrently executing on sibling threads of
+a physical CPU core when SMT is enabled on the system.
+
+This can be addressed by controlled placement of processes on physical CPU
+cores or by disabling SMT. See the relevant chapter in the L1TF mitigation
+document: :ref:`Documentation/admin-guide/hw-vuln/l1tf.rst <smt_control>`.

Linus is right that the proper response is for this feature to do 
*nothing* if SMT is enabled on affected CPUs - but that's not 
implemented in the patches ...

Or rather, we should ask a higher level question as well, maybe we 
should not do this feature at all?

Typically cloud computing systems such as AWS will have SMT enabled, 
because cloud computing pricing is essentially per vCPU, and they want 
to sell the hyperthreads as vCPUs. So the safest solution, disabling 
SMT on affected systems, is not actually done, because it's an 
economic non-starter. (I'd like to note the security double standard 
there: the most secure option, to disable SMT, is not actually used ...)

BTW., I wonder how Amazon is solving the single-vCPU customer workload 
problem on AWS: if the vast majority of AWS computing capacity is 
running on a single vCPU, because it's the cheapest tier and because 
it's more than enough capacity to run a website. Even core-scheduling 
doesn't solve this fundamental SMT security problem: separate customer 
workloads *cannot* share the same core - but this means that the 
single-vCPU workloads will only be able to utilize 50% of all 
available vCPUs if they are properly isolated.

Or if the majority of AWS EC2 etc. customer systems are using 2,4 or 
more vCPUs, then both this feature and 'core-scheduling' is 
effectively pointless from a security POV, because the cloud computing 
systems are de-facto partitioned into cores already, with each core 
accounted as 2 vCPUs.

The hour-up-rounded way AWS (and many other cloud providers) account 
system runtime costs suggests that they are doing relatively static 
partitioning of customer workloads already, i.e. customer workloads 
are mapped to actual physical hardware in an exclusive fashion, with 
no overcommitting of physical resources and no sharing of cores 
between customers.

If I look at the pricing and capabilities table of AWS:

  https://aws.amazon.com/ec2/pricing/on-demand/

Only the 't2' and 't3' On-Demand instances have 'Variable' pricing, 
which is only 9% of the offered 228 configurations.

I.e. I strongly suspect that neither L1D flushing nor core-scheduling 
is actually required on affected vulnerable CPUs to keep customer 
workloads isolated from each other, on the majority of cloud computing 
systems, because they are already isolated via semi-static 
partitioning, using pricing that reflects static partitioning.

Thanks,

	Ingo