From: "Catangiu, Adrian Costin" <email@example.com> To: "Jason A. Donenfeld" <Jason@zx2c4.com>, Jann Horn <firstname.lastname@example.org> Cc: Willy Tarreau <email@example.com>, "MacCarthaigh, Colm" <firstname.lastname@example.org>, "Andy Lutomirski" <email@example.com>, "Theodore Y. Ts'o" <firstname.lastname@example.org>, "Eric Biggers" <email@example.com>, "open list:DOCUMENTATION" <firstname.lastname@example.org>, kernel list <email@example.com>, "open list:VIRTIO GPU DRIVER" <firstname.lastname@example.org>, "Graf (AWS), Alexander" <email@example.com>, "Woodhouse, David" <firstname.lastname@example.org>, "email@example.com" <firstname.lastname@example.org>, "Singh, Balbir" <email@example.com>, "Weiss, Radu" <firstname.lastname@example.org>, "email@example.com" <firstname.lastname@example.org>, "email@example.com" <firstname.lastname@example.org>, "Jonathan Corbet" <email@example.com>, Greg Kroah-Hartman <firstname.lastname@example.org>, "Michael S. Tsirkin" <email@example.com>, Qemu Developers <firstname.lastname@example.org>, KVM list <email@example.com>, Michal Hocko <firstname.lastname@example.org>, "Rafael J. Wysocki" <email@example.com>, Pavel Machek <firstname.lastname@example.org>, Linux API <email@example.com> Subject: Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver Date: Sat, 17 Oct 2020 18:06:25 +0000 Message-ID: <76FF304F-C828-46CB-B82B-0948D4E927F5@amazon.com> (raw) In-Reply-To: <CAHmME9qHGSF8w3DoyCP+ud_N0MAJ5_8zsUWx=rxQB1mFnGcu9w@mail.gmail.com> After discussing this offline with Jann a bit, I have a few general comments on the design of this. First, the UUID communicated by the hypervisor should be consumed by the kernel -- added as another input to the rng -- and then userspace should be notified that it should reseed any userspace RNGs that it may have, without actually communicating that UUID to userspace. IOW, I agree with Jann there. Then, it's the functioning of this notification mechanism to userspace that is interesting to me. Agreed! The UUID/vmgenid is the glue to the hypervisor to be able to find out about VM snapshots/forks. The really interesting (and important) topic here is finding the right notification mechanism to userspace. ...In retrospect, I should have posted this as RFC instead of PATCH. So, anyway, here are a few options with some pros and cons for the kernel notifying userspace that its RNG should reseed. 1. SIGRND - a new signal. Lol. 2. Userspace opens a file descriptor that it can epoll on. Pros are that many notification mechanisms already use this. Cons is that this requires syscall and might be more racy than we want. Another con is that this a new thing for userspace programs to do. 3. We stick an atomic counter in the vDSO, Jann's suggestion. Pros are that this is extremely fast, and also simple to use and implement. There are enough sequence points in typical crypto programs that checking to see whether this counter has changed before doing whatever operation seems easy enough. Cons are that typically we've been conservative about adding things to the vDSO, and this is also a new thing for userspace programs to do. For each 1, 2, and 3 options, userspace programs _have to do smth new_ anyway, so I wouldn't weigh that as a con. An atomic counter in the vDSO looks like the most bang for the buck to me. I'm really curious to hear more opinions on why we shouldn't do it. 4. We already have a mechanism for this kind of thing, because the same issue comes up when fork()ing. The solution was MADV_WIPEONFORK, where userspace marks a page to be zeroed when forking, for the purposes of the RNG being notified when its world gets split in two. This is basically the same thing as we're discussing here with guest snapshots, except it's on the system level rather than the process level, and a system has many processes. But the problem space is still almost the same, and we could simply reuse that same mechanism. There are a few implementation strategies for that: I don't think we can piggy back on MADV_WIPEONFORK. That madvise flag has a clear contract of only wiping _on fork_. Overloading it with wiping on VM-fork - while process doesn't fork - might break some of its users. 4a. We mess with the PTEs of all processes' pages that are MADV_WIPEONFORK, like fork does now, when the hypervisor notifies us to do so. Then we wind up reusing the already existing logic for userspace RNGs. Cons might be that this usually requires semaphores, and we're in irq context, so we'd have to hoist to a workqueue, which means either more wake up latency, or a larger race window. 4b. We just memzero all processes' pages that are MADV_WIPEONFORK, when the hypervisor notifies us to do so. Then we wind up reusing the already existing logic for userspace RNGs. 4c. The guest kernel maintains an array of physical addresses that are MADV_WIPEONFORK. The hypervisor knows about this array and its location through whatever protocol, and before resuming a moved/snapshotted/duplicated VM, it takes the responsibility for memzeroing this memory. The huge pro here would be that this eliminates all races, and reduces complexity quite a bit, because the hypervisor can perfectly synchronize its bringup (and SMP bringup) with this, and it can even optimize things like on-disk memory snapshots to simply not write out those pages to disk. I've previously proposed a path similar (in concept at least) with a combination of 4 a,b and c - https://lwn.net/ml/linux-mm/B7793B7A-3660-4769-9B9A-FFCF250728BB@amazon.com/ without reusing MADV_WIPEONFORK, but by adding a dedicated MADV_WIPEONSUSPEND. That proposal was clunky however with many people raising concerns around how the interface is too subtle and hard to work with. A vmgenid driver offering a clean FS interface seemed cleaner, although, like some of you observed, it still allows a window of time between actual VM fork and userspace handling of the event. One other direction that I would like to explore and I feel it’s similar to your 4c proposal is to do smth like: "mm: extend memfd with ability to create 'secret' memory" https://patchwork.kernel.org/project/linux-mm/patch/20200130162340.GA14232@rapoport-lnx/ Maybe we can combine ideas from the two patches in smth like: instead of libs using anon mem with MADV_WIPEONSUSPEND, they can use a secret_mem_fd with a (secretmem specific) WIPEONSUSPEND flag; then instead of crawling PTEs in the core PM code looking for MADV_WIPEONSUSPEND mappings, we can register this secretmem device to wipe its own secrets during a pm callback. From a crypto safety pov, the ideal solution would be one where wiping happens before or during VM forks, not after. Having said that, I think a vDSO (updated by the guest kernel _after_ VM fork) still closes that gap enough to be practically safe. Thanks, Adrian. Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
next prev parent reply index Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top [not found] <AQHWo8lIfZnFKGe8nkGmhTCXwq5R3w==> 2020-10-16 14:33 ` Catangiu, Adrian Costin 2020-10-16 15:00 ` Catangiu, Adrian Costin 2020-10-16 15:14 ` gregkh 2020-10-17 1:40 ` Jann Horn 2020-10-17 3:36 ` Willy Tarreau 2020-10-17 4:02 ` Jann Horn 2020-10-17 4:34 ` Colm MacCarthaigh 2020-10-17 5:01 ` Jann Horn 2020-10-17 5:29 ` Colm MacCarthaigh 2020-10-17 5:37 ` Willy Tarreau 2020-10-17 5:52 ` Jann Horn 2020-10-17 6:44 ` Willy Tarreau 2020-10-17 6:55 ` Jann Horn 2020-10-17 7:17 ` Willy Tarreau 2020-10-17 13:24 ` Jason A. Donenfeld 2020-10-17 18:06 ` Catangiu, Adrian Costin [this message] 2020-10-17 18:09 ` Alexander Graf 2020-10-18 2:08 ` Jann Horn 2020-10-20 9:35 ` Christian Borntraeger 2020-10-20 9:54 ` Alexander Graf 2020-10-20 16:54 ` Catangiu, Adrian Costin 2020-10-18 3:14 ` Colm MacCarthaigh 2020-10-18 15:52 ` Michael S. Tsirkin 2020-10-18 15:54 ` Andy Lutomirski 2020-10-18 15:59 ` Michael S. Tsirkin 2020-10-18 16:14 ` Andy Lutomirski 2020-10-19 15:00 ` Michael S. Tsirkin 2020-10-17 18:10 ` Andy Lutomirski 2020-10-19 17:15 ` Mathieu Desnoyers 2020-10-20 10:00 ` Alexander Graf
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=76FF304F-C828-46CB-B82B-0948D4E927F5@amazon.com \ --firstname.lastname@example.org \ --cc=Jason@zx2c4.com \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
QEMU-Devel Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/qemu-devel/0 qemu-devel/git/0.git git clone --mirror https://lore.kernel.org/qemu-devel/1 qemu-devel/git/1.git git clone --mirror https://lore.kernel.org/qemu-devel/2 qemu-devel/git/2.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 qemu-devel qemu-devel/ https://lore.kernel.org/qemu-devel \ firstname.lastname@example.org public-inbox-index qemu-devel Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.nongnu.qemu-devel AGPL code for this site: git clone https://public-inbox.org/public-inbox.git