[MODERATED] Re: [PATCH v6 10/43] MDSv6

From: Dave Hansen <dave.hansen@intel.com>
To: speck@linutronix.de
Subject: [MODERATED] Re: [PATCH v6 10/43] MDSv6
Date: Mon, 25 Feb 2019 09:18:40 -0800	[thread overview]
Message-ID: <757c2247-fbe3-acda-710c-7120eb6a08f6@intel.com> (raw)
In-Reply-To: <fd985a6564dd500cf316665c5de823cb13843a1d.1551019522.git.ak@linux.intel.com>

[-- Attachment #1: Type: text/plain, Size: 7783 bytes --]

I haven't been paying super-close attention to all this MDS stuff, but I
have written a bunch of docs on all these things, so here are some
suggestions if you find them valuable.  Take 'em or leave 'em. :)

> +Some CPUs can leave read or written data in internal buffers,
> +which then later might be sampled through side effects.
> +For more details see CVE-2018-12126 CVE-2018-12130 CVE-2018-12127

Maybe:

Some CPUs can leave data in internal buffers which might later be
leaked.  For more details see CVE-2018-12126 CVE-2018-12130 CVE-2018-12127

> +This can be avoided by explicitly clearing the CPU state.

If you say "buffers" in the previous paragraph, it's probably good to be
consistent in the next paragraph.

> +We attempt to avoid leaking data between different processes,
> +and also some sensitive data, like cryptographic data, to
> +user space.

The kernel attempts to avoid leaking sensitive data from inside the
kernel to applications and also between applications.

> +We support three modes:
> +
> +(1) mitigation off (mds=off)
> +(2) clear only when needed (default)
> +(3) clear on every kernel exit, or guest entry (mds=full)
> +
> +(1) and (3) are trivial, the rest of the document discusses (2)
> +
> +In general option (3) is the most conservative choice. It does
> +not make ST assumptions about leaking data.

ST?

Do we need to say anything about why you might make one choice or another?

> +Basic requirements and assumptions
> +----------------------------------
> +
> +Kernel addresses and kernel temporary data are not sensitive.

Should we say something more about the implications?  Basically, picking
mode-1 or mode-2 weakens KASLR.

> +User data is sensitive, but only for other processes.

I thought we went to some trouble with Spectre-v2 to at least have some
minimal checks for app-to-app trust.  This makes it sound like we
consider all apps untrusted whether they are from the same user or not.

> +User data is anything in the user address space, or data buffers
> +directly copied from/to the user (e.g. read/write). It does not
> +include metadata, or flag settings. For example packet headers
> +or file names are not sensitive in this model.
> +
> +Block IO data (but not meta data) is sensitive.

Aren't filenames part of metadata?  We go to the trouble of encrypting
them for ecryptfs.  I thought that's because they can be sensitive.

> +Most data structures in the kernel are not sensitive.
> +
> +Kernel data is sensitive when it involves cryptographic keys.
> +
> +We consider data from input devices (such as key presses)
> +sensitive. We also consider sound data or terminal
> +data sensitive.

One problem I have with the tone of this documentation is that it seems
to state fact after fact with none of the backing logic.  For instance,
I think you're saying that the terminals are sensitive because folks
type passwords into them.  But, it would be wonderful to actually say that.

That way, folks can actually check the logic of the author or expand on it.

> +We assume that only data actually accessed by the kernel by explicit
> +instructions can be leaked.

I'm not a fan of saying "explicit instructions".  Explicit to me means
an instruction in the code base.  I think you mean something closer to
"instructions with architectural effects" or "retired instructions" or
something.

I'd also think that a gadget formed out of the middle of an instruction
that comes from the codebase wouldn't be "explicit", but this model
doesn't seem to comprehend those.

> Note that this may not be always
> +true, in theory prefetching or speculation may touch more. The assumption
> +is that if any such happens it will be very low bandwidth and hard
> +to control due to the existing Spectre and other mitigations,
> +such as memory randomization.  If a user is concerned about this they
> +need to use mds=full.
What is "memory randomization"?  Do you mean user and kernel ASLR?

Should this maybe say something like:

	This approach does not mitigate leaks that originate from non-
	architectural data access such as speculation or prefetching.

I also have a hard time restating the "low-bandwidth" bits, mostly
because I don't know enough about it to say if such a channel would be
high or low-bandwidh.  I *hope* it would be low-bandwidth, and that's
obviously required by the model here, but I just don't know what we're
supposed to take away from this.

Why even bother to mention "Spectre and other mitigations"?

> +Guidance for driver/subsystem developers
> +----------------------------------------
> +
> +[These generally need to be enforced in code review for new code now]
> +
> +When you touch user supplied data of *other* processes in system call

Nit: needs to be "user-supplied"

> +context add lazy_clear_cpu().

An example here would be nice.  I can't think of one of these of the top
of my head other than ptrace, and *that* is actually a counter-example.

> +For the cases below we care only about data from other processes.
> +Touching non cryptographic data from the current process is always allowed.

Nit: needs to be "non-cryptographic"

> +Touching only pointers to user data is always allowed.

Is this referring to the pointer or the data pointed to?  But, I usually
think of "touching a pointer" as dereferencing it.

> +When your interrupt does touch user data directly mark it with IRQF_USER_DATA.
> +
> +When your tasklet does touch user data directly, mark it TASKLET_USER_DATA
> +using tasklet_init_flags/or DECLARE_TASKLET_USERDATA*.
> +
> +When your timer does touch user data mark it with TIMER_USER_DATA
> +If it is a hrtimer and touches user data, mark it with HRTIMER_MODE_USER_DATA.

This all looks good and straightforward.

> +When your irq poll handler does touch user data, mark it lazy_clear_cpu().

This is different than the others.  This is a *call* you have to make,
and it has to be done in a specific order versus touching the data.  If
you clear before touching the data, it's no good.  This is backwards
from some of the other mitigations, so probably really good to mention.

> +For networking code, make sure to only touch user data through
> +skb_push/put/copy [add more], unless it is data from the current
> +process. If that is not ensured add lazy_clear_cpu or
> +lazy_clear_cpu_interrupt.

Nit: lazy_clear_cpu has parenthesis in the previous paragraph but not
here.  Consistency would be nice.

> +Any cryptographic code touching key data should use memzero_explicit
> +or kzfree to free the data.

Is it just key data?  I would think the plaintext is just as sensitive
as the key data.

> +If your RCU callback touches user data add lazy_clear_cpu().
> +
> +These steps are currently only needed for code that runs on MDS affected

Nit: MDS-affected

> +CPUs, which is currently only x86. But might be worth being prepared
> +if other architectures become affected too.

I think you're trying to say that even code that can't or doesn't run on
x86 can add these mitigations.  They might want to do this in the case
that similar vulnerabilities appear in the future.

> +Implementation details/assumptions
> +----------------------------------
> +
> +Any buffer clearing is done lazily on next kernel exit. lazy_clear*
> +is only a few fast instructions with no cache misses setting
> +a flag and can be used frequently even in fast paths.

This is a bit overzealous. :)

It's not like we're using some magic instructions that never miss the
caches.  Maybe:

	lazy_clear* is designed to be low-cost and can be used
	frequently even in performance-sensitive paths.