[MODERATED] Re: [patch V10 10/10] Control knobs and Documentation 10

From: Borislav Petkov <bp@suse.de>
To: speck@linutronix.de
Subject: [MODERATED] Re: [patch V10 10/10] Control knobs and Documentation 10
Date: Sun, 15 Jul 2018 09:30:58 +0200	[thread overview]
Message-ID: <20180715073058.GA25608@nazgul.tnic> (raw)
In-Reply-To: <20180712142957.791282859@linutronix.de>

On Thu, Jul 12, 2018 at 04:19:12PM +0200, speck for Thomas Gleixner wrote:
>  Documentation/admin-guide/index.rst |    9 
>  Documentation/admin-guide/l1tf.rst  |  572 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 581 insertions(+)

Reads nicely, just a couple of minor things which sprang at me while
reading, below:

...

> +2. Malicious guest in a virtual machine
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +   The fact that L1TF breaks all domain protections allows malicious guest
> +   OSes, which can control the PTEs directly, and malicious guest user
> +   space applications, which run on an unprotected guest kernel lacking the
> +   PTE inversion mitigation for L1TF, to attack physical host memory.
> +
> +   A special aspect of L1TF in the context of virtualization is symmetric
> +   multi threading (SMT). The Intel implementation of SMT is called
> +   HyperThreading. The fact that Hyperthreads on the affected processors
> +   share the L1 Data Cache (L1D) is important for this. As the flaw allows
> +   only to attack data which is present in L1D, a malicious guest running
> +   on one Hyperthread can attack the data which is brought into the L1D by
> +   the context which runs on the sibling Hyperthread of the same physical
> +   core. This context can be host OS, host user space or a different guest.
> +
> +   If the processor does not support Extended Page Tables, the attack is
> +   only possible, when the hypervisor does not sanitize the content of the
> +   effective (shadow) page tables.
> +
> +   While solutions exist to mitigate these attack vectors fully, these
> +   mitigations are not enabled by default in the Linux kernel because they
> +   can affect performance significantly. The kernel provides several
> +   mechanisms which can be utilized to address the problem depending on the
> +   deployment scenario. The mitigations, their protection scope and impact
> +   are described in the next sections.
> +
> +   The default mitigations and the rationale for chosing them are explained

choosing

> +   at the end of this document. See :ref:`default_mitigations`.
> +
> +.. _l1tf_sys_info:
> +

...

> +1. L1D flush on VMENTER
> +^^^^^^^^^^^^^^^^^^^^^^^
> +
> +   To make sure that a guest cannot attack data which is present in the L1D
> +   the hypervisor flushes the L1D before entering the guest.
> +
> +   Flushing the L1D evicts not only the data which should not be accessed
> +   by a potentially malicious guest, it also flushes the guest
> +   data. Flushing the L1D has a performance impact as the processor has to

s/Flushing the L1D/Therefore it/

> +   bring the flushed guest data back into the L1D. Depending on the
> +   frequency of VMEXIT/VMENTER and the type of computations in the guest
> +   performance degradation in the range of 1% to 50% has been observed. For
> +   scenarios where guest VMEXIT/VMENTER are rare the performance impact is
> +   minimal. Virtio and mechanisms like posted interrupts are designed to
> +   confine the VMEXITs to a bare minimum, but specific configurations and
> +   application scenarios might still suffer from a high VMEXIT rate.
> +
> +   The general recommendation is to enable L1D flush on VMENTER.
> +
> +   Note, that L1D flush does not prevent the SMT problem because the

s/,//

> +   sibling thread will also bring back its data into the L1D which makes it
> +   attackable again.
> +
> +   L1D flush can be controlled by the administrator via the kernel command
> +   line and sysfs control files. See :ref:`mitigation_control_command_line`
> +   and :ref:`mitigation_control_kvm`.
> +
> +.. _guest_confinement:
> +
> +2. Guest VCPU confinement to dedicated physical cores
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +   To address the SMT problem, it is possible to make a guest or a group of
> +   guests affine to one or more physical cores. The proper mechanism for
> +   that is to utilize exclusive cpusets to ensure that no other guest or
> +   host tasks can run on these cores.
> +
> +   If only a single guest or related guests run on sibling SMT threads on
> +   the same physical core then they can only attack their own memory and
> +   restricted parts of the host memory.
> +
> +   Host memory is attackable, when one of the sibling SMT threads runs in
> +   host OS (hypervisor) context and the other in guest context. The amount
> +   of valuable information from the host OS context depends on the context
> +   which the host OS executes, i.e. interrupts, soft interrupts and kernel
> +   threads. The amount of valuable data from these contexts cannot be
> +   declared as non-interesting for an attacker without deep inspection of
> +   the code.
> +
> +   Note, that assigning guests to a fixed set of physical cores affects the

s/,//

> +   ability of the scheduler to do load balancing and might have negative
> +   effects on CPU utilization depending on the hosting scenario. Disabling
> +   SMT might be a viable alternative for particular scenarios.
> +
> +   For further information about confining guests to a single or to a group
> +   of cores consult the cpusets documentation:
> +
> +   https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt

Should this reference be relative to our Documentation/ tree instead, i.e.,

       ../cgroup-v1/cpusets.txt

and the doc system will resolve it to the respective absolute URL where
it is displayed?

> +
> +.. _interrupt_isolation:
> +
> +3. Interrupt affinity
> +^^^^^^^^^^^^^^^^^^^^^
> +
> +   Interrupts can be made affine to logical CPUs. This is not universally
> +   true because there are types of interrupts which are truly per CPU
> +   interrupts, e.g. the local timer interrupt. Aside of that multi queue
							       ^
							       ,

> +   devices affine their interrupts to single CPUs or groups of CPUs per

s/affine/assign/

> +   queue without allowing the administrator to control the affinities.
> +
> +   Moving the interrupts, which can be affinity controlled, away from CPUs
> +   which run untrusted guests, reduces the attack vector space.
> +
> +   Whether the interrupts with are affine to CPUs, which run untrusted

s/with //

> +   guests, provide interesting data for an attacker depends on the system

	      provides

> +   configuration and the scenarios which run on the system. While for some
> +   of the interrupts it can be assumed that they wont expose interesting
> +   information beyond exposing hints about the host OS memory layout, there

s/exposing //

> +   is no way to make general assumptions.
> +
> +   Interrupt affinity can be controlled by the administrator via the
> +   /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is
> +   available at:
> +
> +   https://www.kernel.org/doc/Documentation/IRQ-affinity.txt

Same comment as above.

> +
> +.. _smt_control:
> +
> +4. SMT control
> +^^^^^^^^^^^^^^
> +
> +   To prevent the SMT issues of L1TF it might be necessary to disable SMT
> +   completely. Disabling SMT can have a significant performance impact, but
> +   the impact depends on the hosting scenario and the type of workloads.
> +   The impact of disabling SMT needs also to be weighted against the impact

s/weighted/weighed/

> +   of other mitigation solutions like confining guests to dedicated cores.
> +
> +   The kernel provides a sysfs interface to retrieve the status of SMT and
> +   to control it. It also provides a kernel command line interface to
> +   control SMT.
> +
> +   The kernel command line interface consists of the following options:
> +

...

> +5. Disabling EPT
> +^^^^^^^^^^^^^^^^
> +
> +  Disabling EPT for virtual machines provides full mitigation for L1TF even
> +  with SMT enabled, because the effective page tables for guests are
> +  managed and sanitized by the hypervisor. Though disabling EPT has a

s/Though/However/

> +  significant performance impact especially when the Meltdown mitigation
> +  KPTI is enabled.
> +
> +  EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.
> +
> +There is ongoing research and development for new mitigation mechanisms to
> +address the performance impact of disabling SMT or EPT.
> +
> +.. _mitigation_control_command_line:
> +
> +Mitigation control on the kernel command line
> +---------------------------------------------
> +
> +The kernel command line allows to control the L1TF mitigations at boot

Passive:

"L1TF mitigations are controlled on the kernel command line with the
option ..."

> +time with the option "l1tf=". The valid arguments for this option are:
> +
> +  ============  ===================================================
> +  full		Provides all available mitigations for the L1TF
> +		vulnerability. Disables SMT and enables all mitigations in
> +		the hypervisors.
> +
> +		SMT control and L1D flush control via the sysfs interface
> +		is still possible after boot.  Hypervisors will issue a
> +		warning when the first VM is started in a potentially
> +		insecure configuration, i.e. SMT enabled or L1D flush
> +		disabled.
> +

...

> +
> +  - Interrupt isolation:
> +
> +    Isolating the guest CPUs from interrupts can reduce the attack surface
> +    further, but still allows a malicious guest to explore a limited amount
> +    of host physical memory. This can at least be used to gain knowledge
> +    about the host address space layout. The interrupts which have a fixed
> +    affinity to the CPUs which run the untrusted guests can depending on
> +    the scenario still trigger soft interrupts and schedule kernel threads

"... which run the untrusted guests can - depending on the scenario - still
trigger... "

> +    which might expose valuable information. See
> +    :ref:`interrupt_isolation`.
> +

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--