From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 719C7C433F5 for ; Thu, 2 Sep 2021 18:47:16 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0EA136113A for ; Thu, 2 Sep 2021 18:47:15 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 0EA136113A Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=shutemov.name Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 7CA468D0001; Thu, 2 Sep 2021 14:47:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 779B26B0072; Thu, 2 Sep 2021 14:47:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6199B8D0001; Thu, 2 Sep 2021 14:47:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0246.hostedemail.com [216.40.44.246]) by kanga.kvack.org (Postfix) with ESMTP id 4F37C6B0071 for ; Thu, 2 Sep 2021 14:47:15 -0400 (EDT) Received: from smtpin39.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 08E102C5A4 for ; Thu, 2 Sep 2021 18:47:15 +0000 (UTC) X-FDA: 78543516030.39.03142F1 Received: from mail-lj1-f171.google.com (mail-lj1-f171.google.com [209.85.208.171]) by imf24.hostedemail.com (Postfix) with ESMTP id BE93AB0000A2 for ; Thu, 2 Sep 2021 18:47:13 +0000 (UTC) Received: by mail-lj1-f171.google.com with SMTP id j12so5365564ljg.10 for ; Thu, 02 Sep 2021 11:47:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shutemov-name.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=wzO6lGJGjHrf6YXLUu+Gtn4cGYG6w0mu1s+nUPgqR5I=; b=l4l5kxrvVgdfFTZsPcPQbbvRON66shVccX3faBj7qxoYuG72tr7RGAyPpBToisL5DW IfwffJluMlCD44kAfRGKj0Wy/DfQhuqdEKOpaygQVRjfOLtsfVhlmLEaYmzWZpLw+/s9 19sMx9GezN0uaxJwmMEgTaU4zh0hGLI9z2N8Ssxptn6qfaCTXeQOEBZApi4PVtuuy4ub YTIeCMWubNHM2sRGLBl0rYvxHgTQyJgnY4WGva5xegj+JfAkPNY7OwtSupLytpQWe4o2 zoWMHF02BFhTYeDYNJ4HNiiWYzuoPHy06AlmedEy5+6d9J22t9TjVKExAWMqUa0mBuTg 6Yjg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=wzO6lGJGjHrf6YXLUu+Gtn4cGYG6w0mu1s+nUPgqR5I=; b=jEGmQ0Cm7cj8TZU0Vb2I514jBH46XyqDSRYsmzlIg4AKTVugc6yTC+L78/wX9AL9Jz vBdmKNY63WnI+clB7DuvMxvSalP1nuffPIl3aWfnTu4XRffd3f7v9l4f1mU3K3KAR+XA jLDuZshh78QtVT0QAp+ykJVg0hS5ew59AyvjLsEZR0Asc+vdRMjbk34fMw5I/ApoGIbZ YDlR/IV+D69TcFTD7pl6KjWwRIRh/SI43PDrEaX1Y1Nqom+DPBYwvlC8d49fTAURnfyl wDSGHB3k9YMPtPJDxlcnfIyVUW0XssoWSuHoZG+o/pV44T/V/kBxCLtzar1CNB0MOBgE R27A== X-Gm-Message-State: AOAM531eadTV6JdycnGjERpFd972YpjO6Xd7gy+pkod0uMWSt94gbEXL jctRnXR1xwatoiC9VlFeRjFjnw== X-Google-Smtp-Source: ABdhPJyG/b3okzmXZ9mB5SnG83gp5BkkogDqf6wXBq+2miVaUHumYCZIm+43gDmCOlx8maZ2WhXPOA== X-Received: by 2002:a2e:86d1:: with SMTP id n17mr3714548ljj.237.1630608432210; Thu, 02 Sep 2021 11:47:12 -0700 (PDT) Received: from box.localdomain ([86.57.175.117]) by smtp.gmail.com with ESMTPSA id w18sm295786lfa.50.2021.09.02.11.47.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 02 Sep 2021 11:47:11 -0700 (PDT) Received: by box.localdomain (Postfix, from userid 1000) id 0142A102E9A; Thu, 2 Sep 2021 21:47:11 +0300 (+03) Date: Thu, 2 Sep 2021 21:47:11 +0300 From: "Kirill A. Shutemov" To: Sean Christopherson Cc: Paolo Bonzini , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Borislav Petkov , Andy Lutomirski , Andrew Morton , Joerg Roedel , Andi Kleen , David Rientjes , Vlastimil Babka , Tom Lendacky , Thomas Gleixner , Peter Zijlstra , Ingo Molnar , Varad Gautam , Dario Faggioli , x86@kernel.org, linux-mm@kvack.org, linux-coco@lists.linux.dev, "Kirill A . Shutemov" , Kuppuswamy Sathyanarayanan , David Hildenbrand , Dave Hansen , Yu Zhang Subject: Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory Message-ID: <20210902184711.7v65p5lwhpr2pvk7@box.shutemov.name> References: <20210824005248.200037-1-seanjc@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210824005248.200037-1-seanjc@google.com> Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=shutemov-name.20150623.gappssmtp.com header.s=20150623 header.b=l4l5kxrv; spf=none (imf24.hostedemail.com: domain of kirill@shutemov.name has no SPF policy when checking 209.85.208.171) smtp.mailfrom=kirill@shutemov.name; dmarc=none X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: BE93AB0000A2 X-Stat-Signature: d4zkgyeysq7eqi86skj7uxmbmzrd5iy3 X-HE-Tag: 1630608433-531700 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi folks, I try to sketch how the memfd changes would look like. I've added F_SEAL_GUEST. The new seal is only allowed if there's no pre-existing pages in the fd (i_mapping->nrpages check) and there's no existing mapping of the file (RB_EMPTY_ROOT(&i_mapping->i_mmap.rb_root check). After the seal is set, no read/write/mmap from userspace is allowed. Although it's not clear how to serialize read check vs. seal setup: seal is protected with inode_lock() which we don't hold in read path because it is expensive. I don't know yet how to get it right. For TDX, it's okay to allow read as it cannot trigger #MCE. Maybe we can allow it? Truncate and punch hole are tricky. We want to allow it to save memory if substantial range is converted to shared. Partial truncate and punch hole effectively writes zeros to partially truncated page and may lead to #MCE. We can reject any partial truncate/punch requests, but it doesn't help the situation with THPs. If we truncate to the middle of THP page, we try to split it into small pages and proceed as usual for small pages. But split is allowed to fail. If it happens we zero part of THP. I guess we may reject truncate if split fails. It should work fine if we only use it for saving memory. We need to modify truncation/punch path to notify kvm that pages are about to be freed. I think we will register callback in the memfd on adding the fd to KVM memslot that going to be called for the notification. That means 1:1 between memfd and memslot. I guess it's okay. Migration going to always fail on F_SEAL_GUEST for now. Can be modified to use a callback in the future. Swapout will also always fail on F_SEAL_GUEST. It seems trivial. Again, it can be a callback in the future. For GPA->PFN translation KVM could use vm_ops->fault(). Semantically it is a good fit, but we don't have any VMAs around and ->mmap is forbidden for F_SEAL_GUEST. Other option is call shmem_getpage() directly, but it looks like a layering violation to me. And it's not available to modules :/ Any comments? -- Kirill A. Shutemov