Re: [PATCH v7 1/2] drivers/misc: sysgenid: add system generation id driver

From: Alexander Graf <graf@amazon.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Adrian Catangiu <acatan@amazon.com>, <linux-doc@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <qemu-devel@nongnu.org>,
	<kvm@vger.kernel.org>, <linux-s390@vger.kernel.org>,
	<gregkh@linuxfoundation.org>, <rdunlap@infradead.org>,
	<arnd@arndb.de>, <ebiederm@xmission.com>, <rppt@kernel.org>,
	<0x7f454c46@gmail.com>, <borntraeger@de.ibm.com>,
	<Jason@zx2c4.com>, <jannh@google.com>, <w@1wt.eu>,
	<colmmacc@amazon.com>, <luto@kernel.org>, <tytso@mit.edu>,
	<ebiggers@kernel.org>, <dwmw@amazon.co.uk>, <bonzini@gnu.org>,
	<sblbir@amazon.com>, <raduweis@amazon.com>, <corbet@lwn.net>,
	<mhocko@kernel.org>, <rafael@kernel.org>, <pavel@ucw.cz>,
	<mpe@ellerman.id.au>, <areber@redhat.com>, <ovzxemul@gmail.com>,
	<avagin@gmail.com>, <ptikhomirov@virtuozzo.com>, <gil@azul.com>,
	<asmehra@redhat.com>, <dgunigun@redhat.com>,
	<vijaysun@ca.ibm.com>, <oridgar@gmail.com>, <ghammer@redhat.com>
Subject: Re: [PATCH v7 1/2] drivers/misc: sysgenid: add system generation id driver
Date: Thu, 25 Feb 2021 00:22:41 +0100	[thread overview]
Message-ID: <e7768780-ce08-9998-8200-d3c33d34fade@amazon.com> (raw)
In-Reply-To: <20210224173205-mutt-send-email-mst@kernel.org>

On 24.02.21 23:41, Michael S. Tsirkin wrote:
> 
> On Wed, Feb 24, 2021 at 02:45:03PM +0100, Alexander Graf wrote:
>>> Above should try harder to explan what are the things that need to be
>>> scrubbed and why. For example, I personally don't really know what is
>>> the OpenSSL session token example and what makes it vulnerable. I guess
>>> snapshots can attack each other?
>>>
>>>
>>>
>>>
>>> Here's a simple example of a workflow that submits transactions
>>> to a database and wants to avoid duplicate transactions.
>>> This does not require overseer magic. It does however require
>>> a correct genid from hypervisor, so no mmap tricks work.
>>>
>>>
>>>
>>>           int genid, oldgenid;
>>>           read(&genid);
>>> start:
>>>           oldgenid = genid;
>>>           transid = submit transaction
>>>           read(&genid);
>>>           if (genid != oldgenid) {
>>>                           revert transaction (transid);
>>>                           goto start:
>>>           }
>>
>> I'm not sure I fully follow. For starters, if this is a VM local database, I
>> don't think you'd care about the genid. If it's a remote database, your
>> connection would get dropped already at the point when you clone/resume,
>> because TCP and your connection state machine will get really confused when
>> you suddenly have a different IP address or two consumers of the same stream
>> :).
>>
>> But for the sake of the argument, let's assume you can have a connectionless
>> database connection that maintains its own connection uniqueness logic.
> 
> Right. E.g. not uncommon with REST APIs. They survive disconnect easily
> and use cookies or such.
> 
>> That
>> database connector would need to understand how to abort the connection (and
>> thus the transaction!) when the generation changes.
> 
> the point is that instead of all that you discover transaction as
> a duplicate and revert it.
> 
> 
>> And that's logic you
>> would do with the read/write/notify mechanism. So your main loop would check
>> for reads on the genid fd and after sending a connection termination, notify
>> the overlord that it's safe to use the VM now.
>>
>> The OpenSSL case (with mmap) is for libraries that are stateless and can not
>> guarantee that they receive a genid notification event timely.
>>
>> Since you asked, this is mainly important for the PRNG. Imagine an https
>> server. You create a snapshot. You resume from that snapshot. OpenSSL is
>> fully initialized with a user space PRNG randomness pool that it considers
>> safe to consume. However, that means your first connection after resume will
>> be 100% predictable randomness wise.
> 
> I wonder whether something similar is possible here. I.e. use the secret
> to encrypt stuff but check the gen ID before actually sending data.
> If it changed re-encrypt. Hmm?

I don't see why you would though. Once you control the application 
level, just use the event based API. That's the much easier to use one. 
The mmap one is really just there to cover cases where you don't own the 
main event loop, but can't spend the syscall overhead on every 
invocation to check if the genid changed.

> 
>>
>> The mmap mechanism allows the PRNG to reseed after a genid change. Because
>> we don't have an event mechanism for this code path, that can happen minutes
>> after the resume. But that's ok, we "just" have to ensure that nobody is
>> consuming secret data at the point of the snapshot.
> 
> 
> Something I am still not clear on is whether it's really important to
> skip the system call here. If not I think it's prudent to just stick
> to read for now, I think there's a slightly lower chance that
> it will get misused. mmap which gives you a laggy gen id value
> really seems like it would be hard to use correctly.

The read is not any less racy than the mmap. The real "safety" of the 
read interface comes from the acknowledge path. And that path requires 
you to be part of the event loop.

> 
> 
>>>
>>>
>>>
>>>
>>>
>>>
>>>> +Simplifyng assumption - safety prerequisite
>>>> +-------------------------------------------
>>>> +
>>>> +**Control the snapshot flow**, disallow snapshots coming at arbitrary
>>>> +moments in the workload lifetime.
>>>> +
>>>> +Use a system-level overseer entity that quiesces the system before
>>>> +snapshot, and post-snapshot-resume oversees that software components
>>>> +have readjusted to new environment, to the new generation. Only after,
>>>> +will the overseer un-quiesce the system and allow active workloads.
>>>> +
>>>> +Software components can choose whether they want to be tracked and
>>>> +waited on by the overseer by using the ``SYSGENID_SET_WATCHER_TRACKING``
>>>> +IOCTL.
>>>> +
>>>> +The sysgenid framework standardizes the API for system software to
>>>> +find out about needing to readjust and at the same time provides a
>>>> +mechanism for the overseer entity to wait for everyone to be done, the
>>>> +system to have readjusted, so it can un-quiesce.
>>>> +
>>>> +Example snapshot-safe workflow
>>>> +------------------------------
>>>> +
>>>> +1) Before taking a snapshot, quiesce the VM/container/system. Exactly
>>>> +   how this is achieved is very workload-specific, but the general
>>>> +   description is to get all software to an expected state where their
>>>> +   event loops dry up and they are effectively quiesced.
>>>
>>> If you have ability to do this by communicating with
>>> all processes e.g. through a unix domain socket,
>>> why do you need the rest of the stuff in the kernel?
>>> Quescing is a harder problem than waking up.
>>
>> That depends. Think of a typical VM workload. Let's take the web server
>> example again. You can preboot the full VM and snapshot it as is. As long as
>> you don't allow any incoming connections, you can guarantee that the system
>> is "quiesced" well enough for the snapshot.
> 
> Well you can use a firewall or such to block incoming packets,
> but I am not at all sure that means e.g. all socket buffers
> are empty.

If it's a fresh VM that only started the web server and did nothing 
else, there shouldn't be anything in its socket buffers :).

I agree that it won't allow us to cover 100% of all cases automatically 
and seamlessly. I can't think of any solution that does - if you can 
think of something I'm all ears. But this API at least gives us a path 
to slowly move the ecosystem to a point where applications and libraries 
can enable themselves to become vm/container clone aware. Today we don't 
even give them the opportunity to self adjust.

Alex

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879