From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 43B02C4320A for ; Wed, 11 Aug 2021 20:14:06 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B7B826108C for ; Wed, 11 Aug 2021 20:14:05 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org B7B826108C Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 04D576B006C; Wed, 11 Aug 2021 16:14:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F402B6B0071; Wed, 11 Aug 2021 16:14:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E06C68D0001; Wed, 11 Aug 2021 16:14:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0044.hostedemail.com [216.40.44.44]) by kanga.kvack.org (Postfix) with ESMTP id C3C9D6B006C for ; Wed, 11 Aug 2021 16:14:04 -0400 (EDT) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 5C60A18025BE3 for ; Wed, 11 Aug 2021 20:14:04 +0000 (UTC) X-FDA: 78463901208.26.9139B4A Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf03.hostedemail.com (Postfix) with ESMTP id 0D71330097C4 for ; Wed, 11 Aug 2021 20:14:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1628712843; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=gz4tyz8ZCc1bDCBSeK8pMb5/FKvZCosp7jg60St3cH4=; b=GTD256IHryDs/g0+NSlqZIDqzIQ7CEgnBr4U3oL/VbwHd/oT7rXmsnpc76F+LJsmrXIwAO D+mZ4SzNNqBlKK/Bq1egkifeL3oVeD6oJq7QDOUMqCcnNR6gfNCGac1ezI3H9bdd48L/zV dVV6eR8QVxobOjip4ujlc1YlUn3kYyM= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-392-n2_ejG4hOta_K5cWzS9LLg-1; Wed, 11 Aug 2021 16:14:02 -0400 X-MC-Unique: n2_ejG4hOta_K5cWzS9LLg-1 Received: by mail-wm1-f70.google.com with SMTP id l38-20020a05600c1d26b0290259bef426efso785104wms.8 for ; Wed, 11 Aug 2021 13:14:02 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=gz4tyz8ZCc1bDCBSeK8pMb5/FKvZCosp7jg60St3cH4=; b=ByQdVLT92p19K51YuhhDsJh+A0z1qqH/yQ2HB2ZN/UuQ6KJUKYJpBA/JQ4rI1oouSj HlIZ4r6FU9O+rDM1Ch+fzHbJp/tYfl+lXfXksgTej2lewdod8Z5Io+p4RUTelQMplmCv 4Ok9GF2fi1cirMKdCaooeJeO/s/nVoA42pGzjpjbcvs7m8qcuS1lhlgqxXX15AqQnD6G qQU4eM3bdV9f/3dtPObBcRDoetNFF83U6ENyGT+1CULUK4NDlxth2UR2o6hC/vQ948vx Af63aYRhDflFH9tuniTCDbSdBQ49jpunGwDymfDYj3W6/BEFGSXwdUp4WgwbGdW9fm9b AHXA== X-Gm-Message-State: AOAM531pM2vrqBA8LKg7DhQLqNCT+drCqGZPj4Oi8n2u1rLJ8zAYAn09 DQLgpyxl5uPojlLdxm4WOycFwNX9HmrFaNohHJfUipOrqWp5XLjX6Wyv6e9F/UfxdeS46CXbXNr X8gUKD/V+o4I= X-Received: by 2002:adf:f08b:: with SMTP id n11mr211122wro.270.1628712841015; Wed, 11 Aug 2021 13:14:01 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxiy6DNhuXIQ8jelBcikSubWXLGiIioDLtl8+n4g6o4yzcrLAkdqP0pCWQkcLUN17rtO59wBg== X-Received: by 2002:adf:f08b:: with SMTP id n11mr211077wro.270.1628712840679; Wed, 11 Aug 2021 13:14:00 -0700 (PDT) Received: from [192.168.3.132] (p5b0c64a0.dip0.t-ipconnect.de. [91.12.100.160]) by smtp.gmail.com with ESMTPSA id y11sm452620wru.0.2021.08.11.13.13.59 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 11 Aug 2021 13:14:00 -0700 (PDT) To: Peter Xu Cc: Tiberiu A Georgescu , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, christian.brauner@ubuntu.com, ebiederm@xmission.com, adobriyan@gmail.com, songmuchun@bytedance.com, axboe@kernel.dk, vincenzo.frascino@arm.com, catalin.marinas@arm.com, peterz@infradead.org, chinwen.chang@mediatek.com, linmiaohe@huawei.com, jannh@google.com, apopple@nvidia.com, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, ivan.teterevkov@nutanix.com, florian.schmidt@nutanix.com, carl.waldspurger@nutanix.com, jonathan.davies@nutanix.com, Andrea Arcangeli References: <20210730160826.63785-1-tiberiu.georgescu@nutanix.com> <839e82f7-2c54-d1ef-8371-0a332a4cb447@redhat.com> <0beb1386-d670-aab1-6291-5c3cb0d661e0@redhat.com> <253e7067-1c62-19bd-d395-d5c0495610d7@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH 0/1] pagemap: swap location for shared pages Message-ID: <154c2804-9861-ab91-4bfe-5354683fdfd3@redhat.com> Date: Wed, 11 Aug 2021 22:13:59 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 0D71330097C4 X-Stat-Signature: k5qfezgn4hjzsdr56woof1otmicab3tn Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=GTD256IH; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf03.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com X-HE-Tag: 1628712843-84410 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: >> >> Good question, I'd imagine e.g., file sealing could forbid uffd (or ho= wever >> it is called) registration on a file, and there would have to be a way= to >> reject files that have uffd registered. But it's certainly a valid con= cern - >> and it raises the question to *what* we actually want to apply such a >> concept. Random files? random memfd? most probably not. Special memfds >> created with an ALLOW_UFFD flag? sounds like a good idea. >=20 > Note that when daemons open files, they may not be aware of what's unde= rneath > but read that file directly. The attacker could still create the file = with > uffd-wp enabled with any flag we introduce. Right, but we could, for example, use a prctrl to make a process to opt=20 in to opening possibly-uffd-wp-protected files at all. I guess securing=20 that aspect shouldn't be a hard nut to crack. At least with my thinking. >=20 >> >>> >>> I also don't know the initial concept when uffd is designed and why i= t's >>> designed at pte level. Avoid vma manipulation should be a major fact= or, but I >>> can't say I understand all of them. Not sure whether Andrea has any = input here. >> >> AFAIU originally a) avoid signal handler madness and b) avoid VMA >> modifications and c) avoid taking the mmap lock in write (well, that d= idn't >> work out completely for uffd-wp for now IIRC). >=20 > Nadav fixed that; it's with read lock now just like when it's introduce= d. > Please see mwriteprotect_range() and commit 6ce64428d62026a10c. Oh, rings a bell, thanks! >> >>> >>> That's why I think current uffd can still make sense with per-process= concepts >>> and keep it that way. When register uffd-wp yes we need to do that f= or >>> multiple processes, but it also means each process is fully aware tha= t this is >>> happening so it's kind of verified that this is wanted behavior for t= hat >>> process. It'll happen with less "surprises", and smells safer. >>> >>> I don't think that will not work out. It may require all the process= to >>> support uffd-wp apis and cooperate, but that's so far how it should w= ork for me >>> in a safe and self-contained way. Say, every process should be aware= of what's >>> going to happen on blocked page faults. >> >> That's a valid concern, although I wonder if it can just be handled vi= a >> specially marked memfds ("this memfd might get a uffd handler register= ed >> later"). >=20 > Yes, please see my above concern. So I think we at least reached conce= nsus on: > (1) that idea is already not userfaultfd but something else; what's tha= t is > still to be defined. And, (2) that definitely needs further thoughts a= nd > context to support its validity and safety. Now uffd got people worrie= d about > safety already, that's why all the uffd selinux and privileged_userfaul= tfd > sysctl comes to mainline; we'd wish good luck with the new concept! Sure, whenever you introduce random ever-lasting delays, we have to be=20 very careful what we support. And if means not supporting some ioctls=20 for such a special memfd (hello secretmem :)). >=20 > OTOH, uffd whole idea is already in mainline, it has limitations on req= uiring > to rework all processes to support uffd-wp, but actually the same to MI= SSING > messages has already happened and our QE is testing those: that's what = we do > with e.g. postcopy-migrating vhost-user enabled OVS-DPDK - we pass over= uffd > registered with missing mode and let QEMU handle the page fault. So it= 's a bit > complicated but it should work. And I hope you can also agree we don't= need to > block uffd before that idea settles. Let's phrase it that way: instead of extending something that just=20 doesn't fit cleanly and feels kind of hackish (see my approach to=20 teaching QEMU background snapshots above), I'd much rather see something=20 clean and actually performant for the use cases I am aware of. That doesn't mean that your current uffd-wp approach on shmem is all bad=20 IMHO (well, I make no decisions either way :) ), I'd just like us to=20 look into finding eventually an approach to handle this cleanly instead=20 of trying to solve problems we might not have to solve after all (pte=20 markers) when things are done differently. >=20 > The pte markers idea need comment; that's about implementation, and it'= ll be > great to have comments there or even NACK (better with a better suggest= ion, > though :). But the original idea of uffd that is pte-based has never c= hanged. Right, I hope some other people can comment. If we want to go down that=20 path for uffd-wp, pte makers make sense. I'm not convinced we want them=20 to handle swapped shared pages, but that discussion is better off in=20 your posting. >> >>>> >>>> Again, I am not sure if uffd-wp or softdirty make too much sense in = general >>>> when applied to shmem. But I'm happy to learn more. >>> >>> Me too, I'm more than glad to know whether the page cache idea could = be >>> welcomed or am I just wrong about it. Before I understand more thing= s around >>> this, so far I still think the per-process based and fd-based solutio= n of uffd >>> still makes sense. >> >> I'd be curious about applications where the per-process approach would >> actually solve something a per-fd approach couldn't solve. Maybe there= are >> some that I just can't envision. >=20 > Right, that's a good point. >=20 > Actually it could be when like virtio-mem that some process shouldn't h= ave > write privilege, but we still allow some other process writting to the = shmem. > Something like that. With virtio-mem, you most probably wouldn't want anybody writing to it,=20 at least from what I can tell. But I understand the rough idea -- just=20 that you cannot enforce something on another process that doesn't play=20 along (at least with the current uffd-wp approach! you could with an=20 fd-based approach). >=20 >> >> (using shmem for a single process only isn't a use case I consider imp= ortant >> :) ) >=20 > If you still remember the discussion about "having qemu start to use me= mfd and > shmem as default"? :) Oh yes :) >=20 > shmem is hard but it's indeed useful in many cases, even if single thre= aded. > For example, shmem-based VMs can do local binary update without migrati= ng guest > RAMs (because memory is shared between old/new binaries!). To me it's = always a > valid request to enable both shmem and write protect. Right, but it would also just work with an fd-based approach. (well,=20 unless we're dealing with shared anonymous RAM, but that is just some=20 weird stuff for really exotic use cases) --=20 Thanks, David / dhildenb