From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 20C17C77B7C for ; Tue, 25 Apr 2023 23:01:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 601C66B0071; Tue, 25 Apr 2023 19:01:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 589CC6B0072; Tue, 25 Apr 2023 19:01:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 403286B0074; Tue, 25 Apr 2023 19:01:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 26F4B6B0071 for ; Tue, 25 Apr 2023 19:01:09 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id CE5D7A0412 for ; Tue, 25 Apr 2023 23:01:08 +0000 (UTC) X-FDA: 80721435816.16.8B9A9AE Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) by imf02.hostedemail.com (Postfix) with ESMTP id 022FD80017 for ; Tue, 25 Apr 2023 23:01:06 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=C4AljZcb; spf=pass (imf02.hostedemail.com: domain of 3sVtIZAYKCLIkWSfbUYggYdW.Ugedafmp-eecnSUc.gjY@flex--seanjc.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3sVtIZAYKCLIkWSfbUYggYdW.Ugedafmp-eecnSUc.gjY@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1682463667; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=FCfgX459C+HvVh2FD7e7JZg40F1mOAMJVuhQWLHla/0=; b=7GKjozUBJIMgMe8TE+bHD3RWMuL0uIGl0eeNllCYcGMSFfrBip4rhKHgfIxlD/DC+PsWbL xogyQmL1UNaEwtl8H8YSRdhwgITLUBPkT+4jEM6wpE21E84FGguA+MTZHAKw8H66fctCLf ii/uJT5D+DZvQ/ubZUMm4nhGcNxdNEE= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=C4AljZcb; spf=pass (imf02.hostedemail.com: domain of 3sVtIZAYKCLIkWSfbUYggYdW.Ugedafmp-eecnSUc.gjY@flex--seanjc.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3sVtIZAYKCLIkWSfbUYggYdW.Ugedafmp-eecnSUc.gjY@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1682463667; a=rsa-sha256; cv=none; b=0T29+3C/keGGMdjUSEJLMcYYvKtQ7EZ0Rbp1SipIVFonzIMHNFdRCUCEjtqKl0Qbg+7kwf XyQxIfgH5FVtTxjWMfl/XDqTDDBh6r9ytwSfj+WwFpyBQZtIDX3DLgVbqKPLpXh3mBZ7bl E+XRBxYypycqmO+d5x6+G7fTagDcufY= Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-b92309d84c1so31328333276.1 for ; Tue, 25 Apr 2023 16:01:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1682463666; x=1685055666; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=FCfgX459C+HvVh2FD7e7JZg40F1mOAMJVuhQWLHla/0=; b=C4AljZcb5zSEki65YN1EEhXpKoY2+BQjkZw0lA37fbXw5J94rINcczlcHQKioosXoz QYJO7B3PQHsCBBmLZHmB2bdJerRvszbD8Z0QkB4l66SvwAArYTtc7JH15w/nYT/SE5u5 vwm/eMEJk1TY+eEPX82hBLbrKJ9kTJikXcAr57e2kixe1dolatDoyNrc44iwLTBRVolO bfADSu5kj9Pbljxty8CvVBHkrXGW31eBuRDxicojfWBD4NefEdirxSFfQsz+fSqmBEth Qzgwwtz8KWzAJnphFlQcihCJZ0/hpd4K50BVsMO7BpnBpVR4VQKe84Z8OdOHhFvQF7fh tpww== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682463666; x=1685055666; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=FCfgX459C+HvVh2FD7e7JZg40F1mOAMJVuhQWLHla/0=; b=f4ZHz29VCvIOpeofx+ozYErETA6hZ5F0USjmjWE81pbx//1myRO3t68sY2p9tuqPki pI12SQe01AxHb6FOlJ2dC6r5Vtoxq3xCK9+1qBpRxmAYNS5nuCzs2BhRFa7V816E0x0Y uWDXCpSFpB/7T/3I3fGwz9Tu5awb0NH0U3qUcfnEYCPUuwhdL6x1bOFPaHxR+E1FM4TI xnVdXdJBHX2zyrB6RWfZoHZ05yyOSBZvc4lTDs3OLh+AVHbYHstCH5XePuJUGTp2v4LS BXBg8yLcXwRvnjWappGPoNbhcOwBeKOlW4Wtyu1f7gry6J2U7W3b3C8a3D5gcEANH4fG llfQ== X-Gm-Message-State: AC+VfDwO95vo+ic7P5px9nBhZrJzWa3Lq9AqeeiESNvketPmrHgF5z2n KobbfXh5lIsz9Sl7grB8YGJ9PTDy844= X-Google-Smtp-Source: ACHHUZ7zn8PbJQ/UuE69mi5lwaIaGdUdtx9hP2cvpdZFHM9mCc7INsMunl92AGHkpHAOjQl1AWoM2QO/O/0= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:690c:986:b0:54c:15ad:11e4 with SMTP id ce6-20020a05690c098600b0054c15ad11e4mr342131ywb.0.1682463665939; Tue, 25 Apr 2023 16:01:05 -0700 (PDT) Date: Tue, 25 Apr 2023 16:01:04 -0700 In-Reply-To: Mime-Version: 1.0 References: Message-ID: Subject: Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE From: Sean Christopherson To: Ackerley Tng Cc: chao.p.peng@linux.intel.com, xiaoyao.li@intel.com, isaku.yamahata@gmail.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, pbonzini@redhat.com, corbet@lwn.net, vkuznets@redhat.com, wanpengli@tencent.com, jmattson@google.com, joro@8bytes.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, arnd@arndb.de, naoya.horiguchi@nec.com, linmiaohe@huawei.com, x86@kernel.org, hpa@zytor.com, hughd@google.com, jlayton@kernel.org, bfields@fieldses.org, akpm@linux-foundation.org, shuah@kernel.org, rppt@kernel.org, steven.price@arm.com, mail@maciej.szmigiero.name, vbabka@suse.cz, vannapurve@google.com, yu.c.zhang@linux.intel.com, kirill.shutemov@linux.intel.com, luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, qperret@google.com, tabba@google.com, michael.roth@amd.com, mhocko@suse.com, wei.w.wang@intel.com Content-Type: text/plain; charset="us-ascii" X-Rspamd-Queue-Id: 022FD80017 X-Stat-Signature: xpcecc3c7qgcfu74o35omamo95w4sjbz X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1682463666-54613 X-HE-Meta: U2FsdGVkX19R5e2wPXcjUW+lOPilayz86ZMPxzDm7WzCXFszJrhTB54fjWvfLEPITdCF4JlTK62eLPaWPt7FDwA6dTB/f2utkmzKYEX4dtuGTfdzWfTeiKYUR1jWTpQmLrN8E13uqks+ngqBaNG7l5v/imLi1LSNrsXSjgiRmSpF8cKk5SjEgz/fol1M4PUu9mDNMPSIXPL5ByD9G5tRH2Jwrf8uUUFsbYeFE/y8gAB7O6Yp/CRcArBPM/bA50a9p954s6ZY6n9gFMCHymyso4aqa08sxD6AdErDH2BYpFWhh+jHfJWqTXuGP1avDtpuNZNsSdmQBhYO7DvGfY/BiMZ4cyXINAD//JSvDb7z5j0fKXUizJwAgJDaNl/kihdJ/Wd/ErfrzOPsJLvJ7gs+u9sjqlDMZlpKjP0oIBpeY5ukyy0h3T0gK4ZMdonBi1dZkR2YkiEju06ZohOyMK2m+DLt5ExxTBUAUptzR+8i9v21i9n2KoOXbKtHDepGMTOj5S0G6lxO5GzD7aO9kAeuUqnjJzRCYyK99rj8c43f7VBd8whK8nQ0Q8pLxWIVzO0vCfDvxz/6w2QyFmqQwdreaItfPm/SMzS83C/5KI6FpW5P47lxh0wUTPr8TBO1jzAbA9k4FAluch2+oV56FP9KM+3FBns8yqVEoIM5dSVYWXPpMd5FbhkBKJ5gqwkOVnFskBRlUMj76PxYTUEsagZL5i8GoDZ5qCW3Vgu8t3JVSFtbizia+rRvZQCWIVNKEXZ5BqKvH+TjaljaytsD+04nDJgKRlqRRM5JCnjrdhMhpRdPqu/AXVL4l752musKPS9xjR4HU9PLVXlO3Nspk5FNb7uvFn6z1BGpzpxnNzX3Jxm7k9gT1Auqw5kCOzZov47nh75au4PZw60iXDdUx8VLHRKB7khzl+Zl77jB7CAf9Ft8cfNmGsiA0lFe0AdFAhEN3kwIBzh9z+w8PbD0JZR zNPGOSyx 9kCOEl05hv8OIuE5SaKSpMZI1uLvmp8GRQ6h1F4xE2Ry7KNXanqbhOWCyjfQlA2RZWdTYLZpeRiF/6mtjuuo7RCapjDKkvb65k5jP7as5Egx3ldb8vrm9wNPnmxnIxCTMCDVuJE6nw3/cX0DvPqFKvwVeKCWgf+NnLjfgs48JA+SSvp6kPzyhBAnmnx64aPOpkyTlMW6qKaLclwvvOnLfJ5A3FSdYxwEYigZvZvJUo4t21fnO5FcEN2o7C+zmPmo2DxWHCJXH0m+xRiAQ3abHBJpY92z/F8bfU2reQzUGpSF6tbsUZq5EihxHWfWWmn45KHVqqmddbZnstfmI5HDkUD8LufhmNudvJ0JEQ6LrOpnpjJV4tX3Q+Ico1obYoeWAYXTlK7jH+ymjevsTSWHNqgUUCrPSIFEXx5uDG7DNBEekSyu54Yzjc+WJZsUyo09N+ixFbyJcnoY/GvE6Xz4SakYtfovtBjFycHmfdPlTIDvT9LGkip1S/IfeeyllPF7d+4RX2BA3suaubQG6AZ+lxK1Zh8kbPP9MYU6Ns3HgK0tm/pZLlfSx6Fy8Bq0Boi5EjUdytOwBn+9/eBj+umvI+CuuIP5ppsD+/1roQx2tJAcgewLT/aYqZQKf8K2g9fA/b2fClQHvEukOU8MiJkQfLiAvbfn6on/U0tJXIqtAyDA9789Dg2eq5HBEobgwgIWvDYwE7fVGFDY0MyY906ypb27t8tguUyVtq7K3 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Apr 18, 2023, Ackerley Tng wrote: > Sean Christopherson writes: > > I agree, a pure alignment check is too restrictive, and not really what I > > intended despite past me literally saying that's what I wanted :-) I think > > I may have also inverted the "less alignment" statement, but luckily I > > believe that ends up being a moot point. > > > The goal is to avoid having to juggle scenarios where KVM wants to create a > > hugepage, but restrictedmem can't provide one because of a misaligned file > > offset. I think the rule we want is that the offset must be aligned to the > > largest page size allowed by the memslot _size_. E.g. on x86, if the > > memslot size is >=1GiB then the offset must be 1GiB or beter, ditto for > > >=2MiB and >=4KiB (ignoring that 4KiB is already a requirement). > > > We could loosen that to say the largest size allowed by the memslot, but I > > don't think that's worth the effort unless it's trivially easy to implement > > in code, e.g. KVM could technically allow a 4KiB aligned offset if the > > memslot is 2MiB sized but only 4KiB aligned on the GPA. I doubt there's a > > real use case for such a memslot, so I want to disallow that unless it's > > super easy to implement. > > Checking my understanding here about why we need this alignment check: > > When KVM requests a page from restrictedmem, KVM will provide an offset > into the file in terms of 4K pages. > > When shmem is configured to use hugepages, shmem_get_folio() will round > the requested offset down to the nearest hugepage-aligned boundary in > shmem_alloc_hugefolio(). > > Example of problematic configuration provided to > KVM_SET_USER_MEMORY_REGION2: > > + shmem configured to use 1GB pages > + restrictedmem_offset provided to KVM_SET_USER_MEMORY_REGION2: 0x4000 > + memory_size provided in KVM_SET_USER_MEMORY_REGION2: 1GB > + KVM requests offset (pgoff_t) 0x8, which translates to offset 0x8000 > > restrictedmem_get_page() and shmem_get_folio() returns the page for > offset 0x0 in the file, since rounding down 0x8000 to the nearest 1GB is > 0x0. This is allocating outside the range that KVM is supposed to use, > since the parameters provided in KVM_SET_USER_MEMORY_REGION2 is only > supposed to be offset 0x4000 to (0x4000 + 1GB = 0x40004000) in the file. > > IIUC shmem will actually just round down (0x4000 rounded down to nearest > 1GB will be 0x0) and allocate without checking bounds, so if offset 0x0 > to 0x4000 in the file were supposed to be used by something else, there > might be issues. > > Hence, this alignment check ensures that rounding down of any offsets > provided by KVM (based on page size configured in the backing file > provided) to restrictedmem_get_page() must not go below the offset > provided to KVM_SET_USER_MEMORY_REGION2. > > Enforcing alignment of restrictedmem_offset based on the currently-set > page size in the backing file (i.e. shmem) may not be effective, since > the size of the pages in the backing file can be adjusted to a larger > size after KVM_SET_USER_MEMORY_REGION2 succeeds. With that, we may still > end up allocating outside the range that KVM was provided with. > > Hence, to be safe, we should check alignment to the max page size across > all backing filesystems, so the constraint is > > rounding down restrictedmem_offset to > min(max page size across all backing filesystems, > max page size that fits in memory_size) == restrictedmem_offset > > which is the same check as > > restrictedmem_offset must be aligned to min(max page size across all > backing filesystems, max page size that fits in memory_size) > > which can safely reduce to > > restrictedmem_offset must be aligned to max page size that fits in > memory_size > > since "max page size that fits in memory_size" is probably <= to "max > page size across all backing filesystems", and if it's larger, it'll > just be a tighter constraint. Yes? The alignment check isn't strictly required, KVM _could_ deal with the above scenario, it's just a lot simpler and safer for KVM if the file offset needs to be sanely aligned. > If the above understanding is correct: > > + We must enforce this in the KVM_SET_USER_MEMORY_REGION2 handler, since > IIUC shmem will just round down and allocate without checking bounds. > > + I think this is okay because holes in the restrictedmem file (in > terms of offset) made to accommodate this constraint don't cost us > anything anyway(?) Are they just arbitrary offsets in a file? In > our case, this file is usually a new and empty file. > > + In the case of migration of a restrictedmem file between two KVM > VMs, this constraint would cause a problem is if the largest > possible page size on the destination machine is larger than that > of the source machine. In that case, we might have to move the > data in the file to a different offset (a separate problem). Hmm, I was thinking this would be a non-issue because the check would be tied to the max page _possible_ page size irrespective of hardware support, but that would be problematic if KVM ever supports 512GiB pages. I'm not sure that speculatively requiring super huge memslots to be 512GiB aligned is sensible. Aha! If we go with a KVM ioctl(), a clean way around this is tie the alignment requirement to the memfd flags, e.g. if userspace requests the memfd to be backed by PMD hugepages, then the memslot offset needs to be 2MiB aligned on x86. That will continue to work if (big if) KVM supports 512GiB pages because the "legacy" memfd would still be capped at 2MiB pages. Architectures that support variable hugepage sizes might need to do something else, but I don't think that possibility affects what x86 can/can't do. > + On this note, it seems like there is no check for when the range is > smaller than the allocated page? Like if the range provided is 4KB in > size, but shmem is then configured to use a 1GB page, will we end up > allocating past the end of the range? No, KVM already gracefully handles situations like this. Well, x86 does, I assume other architectures do too :-) As above, the intent of the extra restriction is so that KVM doen't need even more weird code (read: math) to gracefully handle the new edge cases that would come with fd-only memslots.