Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver

From: Alexander Graf <graf@amazon.de>
To: Jann Horn <jannh@google.com>,
	"Catangiu, Adrian Costin" <acatan@amazon.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>,
	"Jason A. Donenfeld" <Jason@zx2c4.com>, Willy Tarreau <w@1wt.eu>,
	"MacCarthaigh, Colm" <colmmacc@amazon.com>,
	Andy Lutomirski <luto@kernel.org>,
	"Theodore Y. Ts'o" <tytso@mit.edu>,
	Eric Biggers <ebiggers@kernel.org>,
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,
	kernel list <linux-kernel@vger.kernel.org>,
	"Woodhouse, David" <dwmw@amazon.co.uk>,
	"bonzini@gnu.org" <bonzini@gnu.org>,
	"Singh, Balbir" <sblbir@amazon.com>,
	"Weiss, Radu" <raduweis@amazon.com>,
	"oridgar@gmail.com" <oridgar@gmail.com>,
	"ghammer@redhat.com" <ghammer@redhat.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Qemu Developers <qemu-devel@nongnu.org>,
	KVM list <kvm@vger.kernel.org>, Michal Hocko <mhocko@kernel.org>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Pavel Machek <pavel@ucw.cz>,
	Linux API <linux-api@vger.kernel.org>,
	"mpe@ellerman.id.au" <mpe@ellerman.id.au>,
	linux-s390 <linux-s390@vger.kernel.org>,
	"areber@redhat.com" <areber@redhat.com>,
	Pavel Emelyanov <ovzxemul@gmail.com>,
	Andrey Vagin <avagin@gmail.com>, Mike Rapoport <rppt@kernel.org>,
	Dmitry Safonov <0x7f454c46@gmail.com>,
	Pavel Tikhomirov <ptikhomirov@virtuozzo.com>,
	"gil@azul.com" <gil@azul.com>,
	"asmehra@redhat.com" <asmehra@redhat.com>,
	"dgunigun@redhat.com" <dgunigun@redhat.com>,
	"vijaysun@ca.ibm.com" <vijaysun@ca.ibm.com>
Subject: Re: [PATCH v2] drivers/virt: vmgenid: add vm generation id driver
Date: Mon, 7 Dec 2020 15:22:06 +0100	[thread overview]
Message-ID: <113122dd-1600-4948-1faa-72ddf46c0439@amazon.de> (raw)
In-Reply-To: <CAG48ez2akv0pGSt084sNHtESbjJNXpx=Ko86JEsyZM24+5zLqw@mail.gmail.com>

On 27.11.20 21:20, Jann Horn wrote:
> 
> On Fri, Nov 27, 2020 at 8:04 PM Catangiu, Adrian Costin
> <acatan@amazon.com> wrote:
>> On 27/11/2020 20:22, Jann Horn wrote:
>>> On Fri, Nov 20, 2020 at 11:29 PM Jann Horn <jannh@google.com> wrote:
>>>> On Mon, Nov 16, 2020 at 4:35 PM Catangiu, Adrian Costin
>>>> <acatan@amazon.com> wrote:
>>>>> This patch is a driver that exposes a monotonic incremental Virtual
>>>>> Machine Generation u32 counter via a char-dev FS interface that
>>>>> provides sync and async VmGen counter updates notifications. It also
>>>>> provides VmGen counter retrieval and confirmation mechanisms.
>>>>>
>>>>> The hw provided UUID is not exposed to userspace, it is internally
>>>>> used by the driver to keep accounting for the exposed VmGen counter.
>>>>> The counter starts from zero when the driver is initialized and
>>>>> monotonically increments every time the hw UUID changes (the VM
>>>>> generation changes).
>>>>>
>>>>> On each hw UUID change, the new hypervisor-provided UUID is also fed
>>>>> to the kernel RNG.
>>>> As for v1:
>>>>
>>>> Is there a reasonable usecase for the "confirmation" mechanism? It
>>>> doesn't seem very useful to me.
>>
>> I think it adds value in complex scenarios with multiple users of the
>> mechanism, potentially at varying layers of the stack, different
>> processes and/or runtime libraries.
>>
>> The driver offers a natural place to handle minimal orchestration
>> support and offer visibility in system-wide status.
>>
>> A high-level service that trusts all system components to properly use
>> the confirmation mechanism can actually block and wait patiently for the
>> system to adjust to the new world. Even if it doesn't trust all
>> components it can still do a best-effort, timeout block.
> 
> What concrete action would that high-level service be able to take
> after waiting for such an event?

For us, it would only allow incoming requests to the target container 
after the container has successfully adjusted.

You can think of other models too. Your container orchestration engine 
could prevent network traffic to reach the container applications until 
the full container is adjusted for example.

> My model of the vmgenid mechanism is that RNGs and cryptographic
> libraries that use it need to be fundamentally written such that it is
> guaranteed that a VM fork can not cause the same random number /
> counter / ... to be reused for two different cryptographic operations
> in a way visible to an attacker. This means that e.g. TLS libraries
> need to, between accepting unencrypted input and sending out encrypted
> data, check whether the vmgenid changed since the connection was set
> up, and if a vmgenid change occurred, kill the connection.
> 
> Can you give a concrete example of a usecase where the vmgenid
> mechanism is used securely and the confirmation mechanism is necessary
> as part of that?

The main crux here is that we have 2 fundamental approaches:

1) Transactional

For an RNG, the natural place to adjust yourself to a resumed snapshot 
is in the random number retrieval. You just check if your generation is 
still identical when you fetch the next random number.

Ideally, you also do the same for anything consuming such a random 
number. So random number retrieval would no longer just return ( number 
), but instead ( number, generation ). That way you could check at every 
consumer side of the random number that it's actually still random. That 
would need to cascade down.

So every key you derive from a random number, every uuid you generate, 
they all would need to store the generation as well and compare if the 
current generation is still the same as when they were generated. That 
means you need to convert every data access method to a function call 
that checks if the value is still consumable and if not, able to 
regenerate it. The same applies for global values, such as a system 
global UUID that is shared among multiple processes.

If you slowly move away from super integrated environments like a TLS 
library and start thinking of samba system UUIDs or SSH host keys, 
you'll quickly see how that approach reaches its limits.

2) Event based

Let's take a look at the complicated things to implement with the 
transactional approach: samba system UUIDs, SSH host keys, global 
variables in a random Java application that get initialized on 
application start.

All of these are very easy to resolve through an event based mechanism. 
Based on the "new generation" event, you can just generate a new UUID. 
Or a new host key. All you would need to know for this to be non-racy is 
that before you actually use the target services, you know they are 
self-adjusted. In most container workloads, that can be achieved by not 
letting network traffic go in before the event is fully processed.

What this patch set does is provide both: We allow the transactional 
approach through mmap() of a shared page to be implemented for stacks 
where that's easiest. You can use that when your logic is realistically 
convertable to transactional. We also allow for an asynchronous event, 
which can be used in environments where the transactional approach is 
hard because of design constraints (language, API, system, etc.).

Combining the two, you get the best of both worlds IMHO.

> 
>>>> How do you envision integrating this with libraries that have to work
>>>> in restrictive seccomp sandboxes? If this was in the vDSO, that would
>>>> be much easier.
>>
>> Since this mechanism targets all of userspace stack, the usecase greatly
>> vary. I doubt we can have a single silver bullet interface.
>>
>> For example, the mmap interface targets user space RNGs, where as fast
>> and as race free as possible is key. But there also higher level
>> applications that don't manage their own memory or don't have access to
>> low-level primitives so they can't use the mmap or even vDSO interfaces.
>> That's what the rest of the logic is there for, the read+poll interface
>> and all of the orchestration logic.
> 
> Are you saying that, because people might not want to write proper
> bindings for this interface while also being unwilling to take the
> performance hit of calling read() in every place where they would have
> to do so to be fully correct, you want to build a "best-effort"
> mechanism that is deliberately designed to allow some cryptographic
> state reuse in a limited time window?

I seriously hope that for crypto, we will always(?) be able to use the 
transactional approach. And there we don't even have to resort to read() 
- you can just mmap() the generation ID.

What the event based mechanism is there for are the other cases that are 
not easily converted to such an approach. As library owner, you always 
have the choice.

That said, I don't think the "best-effort" mechanism is as bad as you 
describe it above. If you're thinking of a normal VM image, imagine 
systemd would implement vmgenid support. It could install a quick BPF 
program that just blocks all network traffic altogether from the VM 
until its genid is fully synchronized across all processes. Ideally, it 
would even be able to kill uncooperative processes eventually, so that 
your resumed VM is reachable after a timeout.

>> Like you correctly point out, there are also scenarios like tight
>> seccomp jails where even the FS interfaces is inaccessible. For cases
>> like this and others, I believe we will have to work incrementally to
>> build up the interface diversity to cater to all the user scenarios
>> diversity.
> 
> It would be much nicer if we could have one simple interface that lets
> everyone correctly do what they need to, though...

I want a pony too :). We need to do what's best for our users here. I am 
not convinced only offering a transaction based interface is going to 
find the adoption we're hoping for. That means, we'll end up less secure 
than we want to, not more.

Alex

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879