From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <mingo.kernel.org@gmail.com>
Received: from mail.linutronix.de (146.0.238.70:993) by
  crypto-ml.lab.linutronix.de with IMAP4-SSL for <speck@linutronix.de>; 09 Jul
  2018 11:04:41 -0000
Received: from mail-wr1-x42a.google.com ([2a00:1450:4864:20::42a])
	by Galois.linutronix.de with esmtps (TLS1.2:RSA_AES_128_CBC_SHA1:128)
	(Exim 4.80)
	(envelope-from <mingo.kernel.org@gmail.com>)
	id 1fcTyC-0005DH-Mw
	for speck@linutronix.de; Mon, 09 Jul 2018 13:04:40 +0200
Received: by mail-wr1-x42a.google.com with SMTP id h9-v6so10525037wro.3
        for <speck@linutronix.de>; Mon, 09 Jul 2018 04:04:40 -0700 (PDT)
Received: from gmail.com (2E8B0CD5.catv.pool.telekom.hu. [46.139.12.213])
        by smtp.gmail.com with ESMTPSA id
 i6-v6sm5117110wrr.10.2018.07.09.04.04.33        for <speck@linutronix.de>
        (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);
        Mon, 09 Jul 2018 04:04:34 -0700 (PDT)
Sender: Ingo Molnar <mingo.kernel.org@gmail.com>
Date: Mon, 9 Jul 2018 13:04:32 +0200
From: Ingo Molnar <mingo@kernel.org>
Subject: [MODERATED] Re: [patch 2/2] Command line and documentation 2
Message-ID: <20180709110432.GB26055@gmail.com>
References: <20180708125216.197406530@linutronix.de>
 <20180708125654.812951995@linutronix.de>
MIME-Version: 1.0
In-Reply-To: <20180708125654.812951995@linutronix.de>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
To: speck@linutronix.de
List-ID: <speck.linutronix.de>


* speck for Thomas Gleixner <speck@linutronix.de> wrote:

> From: Thomas Gleixner <tglx@linutronix.de>
> Subject: [patch 2/2] Documentation: Add section about CPU vulnerabilities
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  Documentation/admin-guide/index.rst |    9 
>  Documentation/admin-guide/l1tf.rst  |  356 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 365 insertions(+)
> 
> --- a/Documentation/admin-guide/index.rst
> +++ b/Documentation/admin-guide/index.rst
> @@ -17,6 +17,15 @@ etc.
>     kernel-parameters
>     devices
>  
> +This section describes CPU vulnerabilities and provides an overview over
> +the possible mitigations along with guidance for selecting mitigations if
> +they are configurable at compile, boot or run time.
> +
> +.. toctree::
> +   :maxdepth: 1
> +
> +   l1tf
> +
>  Here is a set of documents aimed at users who are trying to track down
>  problems and bugs in particular.
>  
> --- /dev/null
> +++ b/Documentation/admin-guide/l1tf.rst
> @@ -0,0 +1,356 @@
> +L1TF - L1 Terminal Fault
> +========================
> +
> +L1 Terminal Fault is a hardware vulnerability which allows unconstrained
> +speculative access to data which is available in the Level 1 Data Cache
> +when the page table entry controlling the virtual access, which is used for
> +the access, has the present bit cleared.

Would it be clearer to say "unprivileged" instead of "unconstrained"? 
"Unconstrained" could mean a number of other things. (At least to me.)

> +Affected CPUs
> +-------------
> +
> +This vulnerability affects a wide range of Intel processors. The
> +vulnerability is not present on:
> +
> +   - Older models, where the CPU family is < 6
> +
> +   - A range of ATOM processors (Cedarview, Cloverview, Lincroft, Penwell,
> +     Pineview, Slivermont, Airmont, Merrifield)
> +
> +   - The Core Duo Yonah variants (2006 - 2008)
> +
> +   - The XEON PHI family
> +
> +   - Processors which have the ARCH_CAP_RDCL_NO bit set in the
> +     IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is also not
> +     affected by the Meltdown vulnerabitly. These CPUs should become
> +     available end of 2018.

s/vulnerabitly
 /vulnerability

Also, maybe this is a bit clearer:

        If the bit is set then the CPU is not affected by the Meltdown 
        vulnerability either.

> +If an instruction accesses a virtual address for which the relevant page
> +table entry (PTE) has the present bit cleared, then the speculative
> +execution can load the data into the speculation flow when the data from
> +the physical address which is referenced in the PTE address bits is
> +available in the Level 1 Data Cache. This is a purely speculative
> +mechanism and the instruction will raise a page fault when it is retired.

s/then the speculative execution
 /then speculative execution

Also, maybe clarify it the following way:

> +If an instruction accesses a virtual address for which the relevant page
> +table entry (PTE) has the present bit cleared, then speculative
> +execution might ignore the cleared present bit and might load the
> +referenced data from the Level 1 Data Cache (if cached), as if the page was 
> +still present and accessible.
>
> +While this is a purely speculative mechanism and the instruction will raise a 
> +page fault when it is retired eventually - but the pure act of loading the
> +data and making it available to other speculative instructions opens up
> +the timing based side channel attacks to unprivileged malicious code, similar 
> +to the Meltdown attack.

?

> +This flaw is very similar to the Meltdown vulnerability, which speculates
> +on data which should be not accessible from user space because the
> +speculation ignores the permission bits. Contrary to Meltdown L1TF can not
> +be exploited without actually generating page faults. While Meltdown breaks
> +the user space to kernel space protection, L1TF has a broader scope. It
> +allows to attack any physical memory address in the system and the attack
> +works across all protection domains. It allows to attack SGX and also
> +works from inside virtual machines because the speculation bypasses the
> +extended page table (EPT) protection mechanism.

To the extent this paragraph survives into the final version:

 s/It allows to attack SGX
  /It allows an attack of SGX

?

> +
> +Attack scenarios
> +----------------
> +
> +1. Malicious user space:
> +
> +   Operating Systems store arbitrary information in the address bits of a
> +   PTE which is marked non present. This allows a malicious user space
> +   application to attack the physical memory to which these PTEs resolve.

Maybe also add this:

     "In some cases user-space can maliciously influence (i.e. set to a broad 
      range of arbitrary values) the information encoded in the address bits of 
      the PTE - making attacks more deterministic and more practical."

?

> +
> +   The Linux kernel contains a mitigation for this attack vector which is
> +   permanently enabled and has no performance impact. A system with an up
> +   to date kernel is not vulnerable as there is no way for a malicious
> +   application to control the content of PTEs which are not marked present.
> +
> +   |

It might make sense to mention the acronym of the mitigation here?

> +2. Malicious guest in a virtual machine
> +
> +   The fact that L1TF breaks all domain protections allows malicious guest
> +   OSes, which can control the PTEs directly, and malicious userspace,
> +   which runs on an unprotected guest kernel, to attack physical host
> +   memory.

s/malicious userspace
 /malicious guest user-space applications

?

Also, maybe mention here that 'unprotected guest kernel' here refers to the 
non-present PTE encoding protection technique, because it's a bit vague here I 
think.

> +   A special aspect of L1TF in the context of virtualization is symmetric
> +   multi threading (SMT). The Intel implementation of SMT is called
> +   HyperThreading. The fact that Hyperthreads on the affected processors
> +   share the L1 Data Cache (L1D) is important for this. As the flaw allows
> +   only to attack data which is present in L1D, a malicious guest running
> +   on one thread can attack the data which is brought into L1D by the
> +   context which runs on the sibling thread of the same physical core. This
> +   context can be host OS, host user space or a different guest.

For more clarify:

s/on one thread
  on one CPU thread

s/brought into L1D
 /brought into the L1D

s/sibling thread
 /sibling CPU thread

> +   While solutions exist to mitigate these attack vectors fully, these
> +   mitigations are not enabled by default in the Linux kernel because they
> +   can affect performance significantly. The kernel provides several
> +   mechanisms which can be utilized to address the problem depending on the
> +   deployment scenario.

Maybe provide a list of mitigations here, or a reference, to allow the reader to 
look further?


> +
> +L1TF system information
> +-----------------------
> +
> +The Linux kernel provides a sysfs interface to read out the information
> +about L1TF. The relevant sysfs file is:
> +
> +/sys/devices/system/cpu/vulnerabilities/l1tf
> +
> +It provides information whether the system is affected by L1TF or not and
> +in case it is affected it provides information about the active mitigation
> +mechanisms.

Maybe, for more clarity:

  The Linux kernel provides a sysfs interface to enumerate the current L1TF status 
  of the system: whether the system is vulnerable, and which mitigations are 
  available and active. The relevant sysfs file is:

   /sys/devices/system/cpu/vulnerabilities/l1tf

?

> +Guest mitigation mechanisms
> +---------------------------
> +
> +1. L1D flush on VMENTER
> +
> +   To make sure that a guest cannot attack data which is present in L1D the
> +   hypervisor flushes L1D before entering the guest.
>
> +   Flushing L1D evicts not only the data which should not be accessed by a
> +   potentially malicious guest, it also flushes the guest data. Flushing
> +   L1D has a performance impact as the processor has to bring the flushed
> +   guest data back into L1D. Depending on the frequency of VMEXIT/VMENTER
> +   and the type of computations in the guest performance degradation in the
> +   range of 1% to 50% has been observed. For scenarios where guest
> +   VMEXIT/VMENTER are rare the performance impact is minimal. Virtio and
> +   mechanisms like posted interrupts are designed to confine the VMEXITs to
> +   a bare minimum, but specific configurations and application scenarios
> +   might still suffer from a high VMEXIT rate.

A few articles are missing I think, here's the proposed fix:

  1. L1D flush on VMENTER

   To make sure that a guest cannot attack data which is present in the L1D the
   hypervisor flushes the L1D before entering the guest.

   Flushing the L1D evicts not only the data which should not be accessed by a
   potentially malicious guest, it also flushes the guest data. Flushing
   the L1D has a performance impact, as the processor has to bring the flushed
   guest data back into the L1D. Depending on the frequency of VMEXIT/VMENTER
   and the type of computations in the guest, performance degradation in the
   range of 1% to 50% has been observed. For scenarios where guest
   VMEXIT/VMENTER are rare the performance impact is minimal. Virtio and other
   mechanisms like posted interrupts are designed to confine the VMEXITs to
   a bare minimum, but specific configurations and application scenarios
   might still suffer from a high VMEXIT rate.


> +   The general recommendation is to enable L1D flush on VMENTER.
> +
> +   Note, that L1D flush does not prevent the SMT problem because the
> +   sibling thread will also bring back its data into L1D which makes it
> +   attackable again.

s/into L1D
 /into the L1D


> +2. Guest VCPU confinement to dedicated physical cores
> +
> +   To address the SMT problem, it is possible to make a guest or a group of
> +   guests affine to one or more physical cores. The proper mechanism for
> +   that is to utilize cpusets and to make sure that no other guest or host
> +   tasks can run on these cores.
> +
> +   If only a single guest or related guests run on sibling SMT threads on
> +   the same physical core then they can only attack their own memory and
> +   restricted parts of host memory.
> +
> +   Host memory is attackable when one of the sibling threads runs in
> +   host OS (hypervisor) context and the other in guest context. The amount
> +   of valuable information from the host OS context depends on the context
> +   which the host OS executes, i.e. interrupts, soft interrupts and kernel
> +   threads. The amount of valuable data from these contexts cannot be
> +   declared as non-interesting for an attacker without deep inspection of
> +   the code.
> +
> +   Note that assigning guests to a fixed set of physical cores affects the
> +   ability of the scheduler to perform load balancing and might have negative
> +   effects on CPU utilization depending on the hosting scenario. Disabling
> +   SMT might be a viable alternative for particular scenarios.
> +
> +   For further information about confining guests to a single or to a group
> +   of cores consult the cpusets documentation.

I've edited this one in place, just a few article problems.


> +  novirt,nowarn: Same as 'novirt', but hypervisors will not warn when
> +		 a VM is started in a potentially insecure configuration.
> +
> +The default is 'novirt'.

Isn't the default 'novirt,nowarn'?

> +  cond:		Flush L1D on VMENTER only when the code between VMEXIT and
> +		VMENTER can leak host memory which is considered
> +		interesting for an attacker. This still can leak host data
> +		which allows e.g. to determine the hosts address space layout.

s/hosts
 /host's

?

Other than these, LGTM:

Reviewed-by: Ingo Molnar <mingo@kernel.org>


Thanks,

	Ingo