From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77ABDC43334 for ; Thu, 9 Jun 2022 20:29:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E4C776B00EA; Thu, 9 Jun 2022 16:29:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DFC0C6B00EB; Thu, 9 Jun 2022 16:29:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C75846B00EC; Thu, 9 Jun 2022 16:29:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id B5DE56B00EA for ; Thu, 9 Jun 2022 16:29:13 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id 8660B120F4F for ; Thu, 9 Jun 2022 20:29:13 +0000 (UTC) X-FDA: 79559836986.23.1A55E34 Received: from mail-pj1-f44.google.com (mail-pj1-f44.google.com [209.85.216.44]) by imf26.hostedemail.com (Postfix) with ESMTP id B674E14006E for ; Thu, 9 Jun 2022 20:29:12 +0000 (UTC) Received: by mail-pj1-f44.google.com with SMTP id u12-20020a17090a1d4c00b001df78c7c209so379449pju.1 for ; Thu, 09 Jun 2022 13:29:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=C4/yJYyLya7rE6Jieokr56CujDpfPfuLcGNiEECS+QU=; b=CjppovKB2YLi+/umkUgs7TGbUGFHqjbhz+/vuQpURnYcTn7Y/XOYbzw0Ykj8khDf1D lU6118PzbLcDnYEohwwuQpWRqLnFFor88HMkhxJnTWzVFPYmlDCWI4t25w0FcOY9LsJ3 Xx+fRvY97K5Q4b9fyl2wIJsNDu8BFxuzsJC2MBkEM8rOua9Xl2avo9S7S1w8ygcb6WQp l07wjz7O+ipoSddpByuvOEx8LLG0at/WFa44rE+IaodvVFqdep6gbL1AbyDvvOmSNBeS GcNHdTH1imkTwWGYIM5B+u73KBCppBXzzkigAH/N4DrtNRxqLpENGyWxu6onk+j5isHM 5lXw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=C4/yJYyLya7rE6Jieokr56CujDpfPfuLcGNiEECS+QU=; b=WCckVfPti/OqQmPZ8cAuaGMJ/MFRVbn/u3gYMdz5HLLSgqlLOzpFkS057aTPIOOZss sNUns9ZwldmcjNwE+VFHrCG7yIN7hpe+yWBBfqFi8DbwG5zo97x+944jnWCzaBB8YhCH 1APmAfMa88MVY1+GshjZu3gXRwurJdQRmJ5HzlLy/gh+4EDQEDeKElb1obE3NeLq77aC vw4bYyJznBYwZItfuoRyTAqyPISj+BIUFhSDv3JVvr0URCCpCM4NtLSYl4NC5gCjPIjl 3v0M5XPPYE5VRKmak2HD4SmSoMABSeUVWrZBlRlClOsMRorJcMAClGv3funt/JIAexuw 21Cg== X-Gm-Message-State: AOAM5323c8PgkGplvjnWXd7Ayo6WW0nxrlrLOsbTmJCfkYUomcUOoJeu 59IT8g9gMueBMgp0NMCWAydrzg== X-Google-Smtp-Source: ABdhPJx0nBU6Y+TUh04OO6/IROXMWJ7hUc76Ud455mDwfSNlb8nAgIObkwH4NgXB0NSx9cPem36m4g== X-Received: by 2002:a17:90a:b284:b0:1e3:826b:d11d with SMTP id c4-20020a17090ab28400b001e3826bd11dmr5147277pjr.79.1654806551448; Thu, 09 Jun 2022 13:29:11 -0700 (PDT) Received: from google.com (157.214.185.35.bc.googleusercontent.com. [35.185.214.157]) by smtp.gmail.com with ESMTPSA id e3-20020a17090301c300b0016511314b94sm17748369plh.159.2022.06.09.13.29.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 09 Jun 2022 13:29:10 -0700 (PDT) Date: Thu, 9 Jun 2022 20:29:06 +0000 From: Sean Christopherson To: Vishal Annapurve Cc: Chao Peng , Marc Orr , kvm list , LKML , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86 , "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Yu Zhang , "Kirill A . Shutemov" , Andy Lutomirski , Jun Nakajima , Dave Hansen , Andi Kleen , David Hildenbrand , aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com Subject: Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory Message-ID: References: <20220519153713.819591-1-chao.p.peng@linux.intel.com> <20220607065749.GA1513445@chaop.bj.intel.com> <20220608021820.GA1548172@chaop.bj.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1654806553; a=rsa-sha256; cv=none; b=16Idf8xa3E/eot+U2f5E8yXnWlZlnqbDSTLcBvavAm0Q29w7Ke2tI4wADYUdZHDObcpxVA 5V0CE41v7EUejMmLQmUW5jJHlPkhDO9gtW+m6tnXNWm3e7cdeHH/d/TWGodOevIrxUZrlR GqSc99yNuH56bTERRjKHF+C8u8K/q3I= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=CjppovKB; spf=pass (imf26.hostedemail.com: domain of seanjc@google.com designates 209.85.216.44 as permitted sender) smtp.mailfrom=seanjc@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1654806553; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=C4/yJYyLya7rE6Jieokr56CujDpfPfuLcGNiEECS+QU=; b=58XlrECminV6efYZjbJHdVxGVJgKm6JLkQRMzGfkuJBenbqnzGXQH6k/Nn+wyRU7TthjTH NfNAwgdzDdBBZJUFhpEJJg/30xsN4pMoBzW4D438ZnftJh0Ifr2RGyMxWnpqHmOG3qzcR6 URhGNDN7C7wP6Kz4SJGfBW+ecBTP6HQ= X-Rspamd-Queue-Id: B674E14006E Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=CjppovKB; spf=pass (imf26.hostedemail.com: domain of seanjc@google.com designates 209.85.216.44 as permitted sender) smtp.mailfrom=seanjc@google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspam-User: X-Rspamd-Server: rspam06 X-Stat-Signature: cbhiocohfzefxhi66k8jwkpfb6hxmgpo X-HE-Tag: 1654806552-614466 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jun 08, 2022, Vishal Annapurve wrote: > ... > > With this patch series, it's actually even not possible for userspace VMM > > to allocate private page by a direct write, it's basically unmapped from > > there. If it really wants to, it should so something special, by intention, > > that's basically the conversion, which we should allow. > > > > A VM can pass GPA backed by private pages to userspace VMM and when > Userspace VMM accesses the backing hva there will be pages allocated > to back the shared fd causing 2 sets of pages backing the same guest > memory range. > > > Thanks for bringing this up. But in my mind I still think userspace VMM > > can do and it's its responsibility to guarantee that, if that is hard > > required. That was my initial reaction too, but there are unfortunate side effects to punting this to userspace. > By design, userspace VMM is the decision-maker for page > > conversion and has all the necessary information to know which page is > > shared/private. It also has the necessary knobs to allocate/free the > > physical pages for guest memory. Definitely, we should make userspace > > VMM more robust. > > Making Userspace VMM more robust to avoid double allocation can get > complex, it will have to keep track of all in-use (by Userspace VMM) > shared fd memory to disallow conversion from shared to private and > will have to ensure that all guest supplied addresses belong to shared > GPA ranges. IMO, the complexity argument isn't sufficient justfication for introducing new kernel functionality. If multiple processes are accessing guest memory then there already needs to be some amount of coordination, i.e. it can't be _that_ complex. My concern with forcing userspace to fully handle unmapping shared memory is that it may lead to additional performance overhead and/or noisy neighbor issues, even if all guests are well-behaved. Unnmapping arbitrary ranges will fragment the virtual address space and consume more memory for all the result VMAs. The extra memory consumption isn't that big of a deal, and it will be self-healing to some extent as VMAs will get merged when the holes are filled back in (if the guest converts back to shared), but it's still less than desirable. More concerning is having to take mmap_lock for write for every conversion, which is very problematic for configurations where a single userspace process maps memory belong to multiple VMs. Unmapping and remapping on every conversion will create a bottleneck, especially if a VM has sub-optimal behavior and is converting pages at a high rate. One argument is that userspace can simply rely on cgroups to detect misbehaving guests, but (a) those types of OOMs will be a nightmare to debug and (b) an OOM kill from the host is typically considered a _host_ issue and will be treated as a missed SLO. An idea for handling this in the kernel without too much complexity would be to add F_SEAL_FAULT_ALLOCATIONS (terrible name) that would prevent page faults from allocating pages, i.e. holes can only be filled by an explicit fallocate(). Minor faults, e.g. due to NUMA balancing stupidity, and major faults due to swap would still work, but writes to previously unreserved/unallocated memory would get a SIGSEGV on something it has mapped. That would allow the userspace VMM to prevent unintentional allocations without having to coordinate unmapping/remapping across multiple processes.