From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8B860C77B7C for ; Tue, 18 Apr 2023 23:40:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231748AbjDRXko (ORCPT ); Tue, 18 Apr 2023 19:40:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52486 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231646AbjDRXk2 (ORCPT ); Tue, 18 Apr 2023 19:40:28 -0400 Received: from mail-pl1-x64a.google.com (mail-pl1-x64a.google.com [IPv6:2607:f8b0:4864:20::64a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 695C014449 for ; Tue, 18 Apr 2023 16:38:34 -0700 (PDT) Received: by mail-pl1-x64a.google.com with SMTP id d9443c01a7336-1a52649285bso18696395ad.0 for ; Tue, 18 Apr 2023 16:38:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1681861114; x=1684453114; h=cc:to:from:subject:message-id:mime-version:in-reply-to:date:from:to :cc:subject:date:message-id:reply-to; bh=KZZcbhEIXep3CsDuGt1XX4NDaxpUgfaEfLEdZFPghSQ=; b=hm9bDIHJQIeAjh+0Wi4XwWxk5FEWf9SU2iifaVaFm2SHEQi+MfvRomA6hNFQPypblO ajskHfJ21yXbKIgYGAkYYUqP0Ygv5283TmzvvXkvGPLvNGcY8IspXO87zRQLxoMgm7+A 44LeeK44/TjVVWr4ejNfVV1NjMwkMAjVBZwnQccmKfvNgc/rCXeLfUDkrNJrLDo+aM/k 5pFpnDDsiw5EE9Dzm//Lga2qY9q3z+kEsegm3Xv/EsUZ8FyYGeSCyJOZ49dURNwlnL2q Y5lq2y1JVNHacWD2DjdNMHYSACvPNhgvLrByCLzQkYOtXAP0CIMW0LotNLXFCHIJamQv WJuQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681861114; x=1684453114; h=cc:to:from:subject:message-id:mime-version:in-reply-to:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=KZZcbhEIXep3CsDuGt1XX4NDaxpUgfaEfLEdZFPghSQ=; b=GriTjDowPqCwDut6Vj+4Ww5haJin2tGF0Y/pR42ezMWpbDPj4tMp4/26Ch7nqN5b2o oBDOu34rBkYbzrerZujOsXZRyR2ACCHKk153td5PQEzMaH/tn/nHZStoeWpPEj5SHmnM RSEV4MfXW7h+sOTN4BFy+OnX/ATxL9btowsAg3PBTEOIUVae2FBrFq4td89JPiOlmH3N C8KdQMSjzbJFn8mRKnFOId48DNL6KiPjIt9zTPALBCFFJ4/hKhQj22JRi8yTEAzRTVAa Q8fcFss9MepFI2N6e+QNUl0RVWR/BHul9vnQrN25HbgCPRe1HUwNP0ttmjxsbgNH5I4Q HDSQ== X-Gm-Message-State: AAQBX9chhQfM0BSiqEHmcGsD6mWTs5EPhVH4r0ioyNfltxOumwpaPZAd kRtvw3Z3/iBNPFSNs8WHSWcMfwxbTBDadq0ALw== X-Google-Smtp-Source: AKy350Y4yrZy3jVvN3wQp7+RZBqQNJ9wdfC4ADrLjc4RYHgu0a5KoOvNA6U3D5YLtIMfO6ES8yW87Yjz2s2FcMbRug== X-Received: from ackerleytng-cloudtop.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:1f5f]) (user=ackerleytng job=sendgmr) by 2002:a17:903:41c5:b0:19a:7bd4:5b0d with SMTP id u5-20020a17090341c500b0019a7bd45b0dmr1394106ple.8.1681861113900; Tue, 18 Apr 2023 16:38:33 -0700 (PDT) Date: Tue, 18 Apr 2023 23:38:32 +0000 In-Reply-To: (message from Sean Christopherson on Fri, 14 Apr 2023 14:08:08 -0700) Mime-Version: 1.0 Message-ID: Subject: Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE From: Ackerley Tng To: Sean Christopherson Cc: chao.p.peng@linux.intel.com, xiaoyao.li@intel.com, isaku.yamahata@gmail.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, pbonzini@redhat.com, corbet@lwn.net, vkuznets@redhat.com, wanpengli@tencent.com, jmattson@google.com, joro@8bytes.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, arnd@arndb.de, naoya.horiguchi@nec.com, linmiaohe@huawei.com, x86@kernel.org, hpa@zytor.com, hughd@google.com, jlayton@kernel.org, bfields@fieldses.org, akpm@linux-foundation.org, shuah@kernel.org, rppt@kernel.org, steven.price@arm.com, mail@maciej.szmigiero.name, vbabka@suse.cz, vannapurve@google.com, yu.c.zhang@linux.intel.com, kirill.shutemov@linux.intel.com, luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, qperret@google.com, tabba@google.com, michael.roth@amd.com, mhocko@suse.com, wei.w.wang@intel.com Content-Type: text/plain; charset="UTF-8"; format=flowed; delsp=yes Precedence: bulk List-ID: X-Mailing-List: linux-api@vger.kernel.org Sean Christopherson writes: > On Tue, Mar 28, 2023, Chao Peng wrote: >> On Fri, Mar 24, 2023 at 10:29:25AM +0800, Xiaoyao Li wrote: >> > On 3/24/2023 10:10 AM, Chao Peng wrote: >> > > On Wed, Mar 22, 2023 at 05:41:31PM -0700, Isaku Yamahata wrote: >> > > > On Wed, Mar 08, 2023 at 03:40:26PM +0800, >> > > > Chao Peng wrote: >> > > > >> > > > > On Wed, Mar 08, 2023 at 12:13:24AM +0000, Ackerley Tng wrote: >> > > > > > Chao Peng writes: >> > > > > > >> > > > > > > On Sat, Jan 14, 2023 at 12:01:01AM +0000, Sean >> Christopherson wrote: >> > > > > > > > On Fri, Dec 02, 2022, Chao Peng wrote: >> > > > > > > +static bool kvm_check_rmem_offset_alignment(u64 offset, u64 >> gpa) >> > > > > > > +{ >> > > > > > > + if (!offset) >> > > > > > > + return true; >> > > > > > > + if (!gpa) >> > > > > > > + return false; >> > > > > > > + >> > > > > > > + return !!(count_trailing_zeros(offset) >= >> count_trailing_zeros(gpa)); >> > > > >> > > > This check doesn't work expected. For example, offset = 2GB, >> gpa=4GB >> > > > this check fails. >> > > >> > > This case is expected to fail as Sean initially suggested[*]: >> > > I would rather reject memslot if the gfn has lesser alignment than >> > > the offset. I'm totally ok with this approach _if_ there's a use >> case. >> > > Until such a use case presents itself, I would rather be >> conservative >> > > from a uAPI perspective. >> > > >> > > I understand that we put tighter restriction on this but if you see >> such >> > > restriction is really a big issue for real usage, instead of a >> > > theoretical problem, then we can loosen the check here. But at that >> time >> > > below code is kind of x86 specific and may need improve. >> > > >> > > BTW, in latest code, I replaced count_trailing_zeros() with fls64(): >> > > return !!(fls64(offset) >= fls64(gpa)); >> > >> > wouldn't it be !!(ffs64(offset) <= ffs64(gpa)) ? >> As the function document explains, here we want to return true when >> ALIGNMENT(offset) >= ALIGNMENT(gpa), so '>=' is what we need. >> It's worthy clarifying that in Sean's original suggestion he actually >> mentioned the opposite. He said 'reject memslot if the gfn has lesser >> alignment than the offset', but I wonder this is his purpose, since >> if ALIGNMENT(offset) < ALIGNMENT(gpa), we wouldn't be possible to map >> the page as largepage. Consider we have below config: >> gpa=2M, offset=1M >> In this case KVM tries to map gpa at 2M as 2M hugepage but the physical >> page at the offset(1M) in private_fd cannot provide the 2M page due to >> misalignment. >> But as we discussed in the off-list thread, here we do find a real use >> case indicating this check is too strict. i.e. QEMU immediately fails >> when launch a guest > 2G memory. For this case QEMU splits guest memory >> space into two slots: >> Slot#1(ram_below_4G): gpa=0x0, offset=0x0, size=2G >> Slot#2(ram_above_4G): gpa=4G, offset=2G, size=totalsize-2G >> This strict alignment check fails for slot#2 because offset(2G) has less >> alignment than gpa(4G). To allow this, one solution can revert to my >> previous change in kvm_alloc_memslot_metadata() to disallow hugepage >> only when the offset/gpa are not aligned to related page size. >> Sean, How do you think? > I agree, a pure alignment check is too restrictive, and not really what I > intended > despite past me literally saying that's what I wanted :-) I think I may > have also > inverted the "less alignment" statement, but luckily I believe that ends > up being > a moot point. > The goal is to avoid having to juggle scenarios where KVM wants to create > a hugepage, > but restrictedmem can't provide one because of a misaligned file offset. > I think > the rule we want is that the offset must be aligned to the largest page > size allowed > by the memslot _size_. E.g. on x86, if the memslot size is >=1GiB then > the offset > must be 1GiB or beter, ditto for >=2MiB and >=4KiB (ignoring that 4KiB is > already a > requirement). > We could loosen that to say the largest size allowed by the memslot, but > I don't > think that's worth the effort unless it's trivially easy to implement in > code, > e.g. KVM could technically allow a 4KiB aligned offset if the memslot is > 2MiB > sized but only 4KiB aligned on the GPA. I doubt there's a real use case > for such > a memslot, so I want to disallow that unless it's super easy to implement. Checking my understanding here about why we need this alignment check: When KVM requests a page from restrictedmem, KVM will provide an offset into the file in terms of 4K pages. When shmem is configured to use hugepages, shmem_get_folio() will round the requested offset down to the nearest hugepage-aligned boundary in shmem_alloc_hugefolio(). Example of problematic configuration provided to KVM_SET_USER_MEMORY_REGION2: + shmem configured to use 1GB pages + restrictedmem_offset provided to KVM_SET_USER_MEMORY_REGION2: 0x4000 + memory_size provided in KVM_SET_USER_MEMORY_REGION2: 1GB + KVM requests offset (pgoff_t) 0x8, which translates to offset 0x8000 restrictedmem_get_page() and shmem_get_folio() returns the page for offset 0x0 in the file, since rounding down 0x8000 to the nearest 1GB is 0x0. This is allocating outside the range that KVM is supposed to use, since the parameters provided in KVM_SET_USER_MEMORY_REGION2 is only supposed to be offset 0x4000 to (0x4000 + 1GB = 0x40004000) in the file. IIUC shmem will actually just round down (0x4000 rounded down to nearest 1GB will be 0x0) and allocate without checking bounds, so if offset 0x0 to 0x4000 in the file were supposed to be used by something else, there might be issues. Hence, this alignment check ensures that rounding down of any offsets provided by KVM (based on page size configured in the backing file provided) to restrictedmem_get_page() must not go below the offset provided to KVM_SET_USER_MEMORY_REGION2. Enforcing alignment of restrictedmem_offset based on the currently-set page size in the backing file (i.e. shmem) may not be effective, since the size of the pages in the backing file can be adjusted to a larger size after KVM_SET_USER_MEMORY_REGION2 succeeds. With that, we may still end up allocating outside the range that KVM was provided with. Hence, to be safe, we should check alignment to the max page size across all backing filesystems, so the constraint is rounding down restrictedmem_offset to min(max page size across all backing filesystems, max page size that fits in memory_size) == restrictedmem_offset which is the same check as restrictedmem_offset must be aligned to min(max page size across all backing filesystems, max page size that fits in memory_size) which can safely reduce to restrictedmem_offset must be aligned to max page size that fits in memory_size since "max page size that fits in memory_size" is probably <= to "max page size across all backing filesystems", and if it's larger, it'll just be a tighter constraint. If the above understanding is correct: + We must enforce this in the KVM_SET_USER_MEMORY_REGION2 handler, since IIUC shmem will just round down and allocate without checking bounds. + I think this is okay because holes in the restrictedmem file (in terms of offset) made to accommodate this constraint don't cost us anything anyway(?) Are they just arbitrary offsets in a file? In our case, this file is usually a new and empty file. + In the case of migration of a restrictedmem file between two KVM VMs, this constraint would cause a problem is if the largest possible page size on the destination machine is larger than that of the source machine. In that case, we might have to move the data in the file to a different offset (a separate problem). + On this note, it seems like there is no check for when the range is smaller than the allocated page? Like if the range provided is 4KB in size, but shmem is then configured to use a 1GB page, will we end up allocating past the end of the range?