From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BBDAEC77B75 for ; Tue, 18 Apr 2023 23:38:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 43DF1900002; Tue, 18 Apr 2023 19:38:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3ED458E0001; Tue, 18 Apr 2023 19:38:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2B5A1900002; Tue, 18 Apr 2023 19:38:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 1D1EE8E0001 for ; Tue, 18 Apr 2023 19:38:37 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id E41711A036E for ; Tue, 18 Apr 2023 23:38:36 +0000 (UTC) X-FDA: 80696128632.18.A2A0BE5 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) by imf27.hostedemail.com (Postfix) with ESMTP id 2ADC74000B for ; Tue, 18 Apr 2023 23:38:34 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=hm9bDIHJ; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf27.hostedemail.com: domain of 3-Sk_ZAsKCHIQSaUhbUojdWWeeWbU.SecbYdkn-ccalQSa.ehW@flex--ackerleytng.bounces.google.com designates 209.85.214.202 as permitted sender) smtp.mailfrom=3-Sk_ZAsKCHIQSaUhbUojdWWeeWbU.SecbYdkn-ccalQSa.ehW@flex--ackerleytng.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1681861115; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:dkim-signature; bh=KZZcbhEIXep3CsDuGt1XX4NDaxpUgfaEfLEdZFPghSQ=; b=oLkLQ/dwmGfly3rM8pfcKJgqJDg37S9zMiphj/1ZUpbhq8s8qX5Ec5cYaJLQ0C85igSdt6 +kPyy1IgUYzOaIFNKqy4EEdgRaLKcx4zIo5jjr8dSq4jCk5AKLy7yWK6CqRQooJcicA9bk G8fkaOtXdocXcE0w4yKS1OI3bAE1dWE= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=hm9bDIHJ; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf27.hostedemail.com: domain of 3-Sk_ZAsKCHIQSaUhbUojdWWeeWbU.SecbYdkn-ccalQSa.ehW@flex--ackerleytng.bounces.google.com designates 209.85.214.202 as permitted sender) smtp.mailfrom=3-Sk_ZAsKCHIQSaUhbUojdWWeeWbU.SecbYdkn-ccalQSa.ehW@flex--ackerleytng.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1681861115; a=rsa-sha256; cv=none; b=eVdp85WUKokZuklmhCSZjqUFphQ2+Bh0RtmfBNBJ6MX2nYYYljhzDPUwuwex0iQRq+fSB9 /GZw/51y7d9YoFy7UYxezT9DJqxHpk8mrLBPFAfjLEOSxWvUKsRZjJzrzqPRGlKTTcJFqI ndalpqDfpcblc5to059fYFVRIgyEvKA= Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-1a52649285bso18696425ad.0 for ; Tue, 18 Apr 2023 16:38:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1681861114; x=1684453114; h=cc:to:from:subject:message-id:mime-version:in-reply-to:date:from:to :cc:subject:date:message-id:reply-to; bh=KZZcbhEIXep3CsDuGt1XX4NDaxpUgfaEfLEdZFPghSQ=; b=hm9bDIHJQIeAjh+0Wi4XwWxk5FEWf9SU2iifaVaFm2SHEQi+MfvRomA6hNFQPypblO ajskHfJ21yXbKIgYGAkYYUqP0Ygv5283TmzvvXkvGPLvNGcY8IspXO87zRQLxoMgm7+A 44LeeK44/TjVVWr4ejNfVV1NjMwkMAjVBZwnQccmKfvNgc/rCXeLfUDkrNJrLDo+aM/k 5pFpnDDsiw5EE9Dzm//Lga2qY9q3z+kEsegm3Xv/EsUZ8FyYGeSCyJOZ49dURNwlnL2q Y5lq2y1JVNHacWD2DjdNMHYSACvPNhgvLrByCLzQkYOtXAP0CIMW0LotNLXFCHIJamQv WJuQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681861114; x=1684453114; h=cc:to:from:subject:message-id:mime-version:in-reply-to:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=KZZcbhEIXep3CsDuGt1XX4NDaxpUgfaEfLEdZFPghSQ=; b=VAyX98YLFRYAnoTA2xDSnfASiZ6EKZAY/UT4rKsDKPbUBr9/BsX+q7DPchcDUsKz2E S+GAZa6R+LDVh/7uFQVZFRQaLHk7dXFz4LlgBewOErduKkaOAGKFZAYoQ5f3s5kzBIPZ KxrNazfhUXttqfRfSKGq6fX6pplS5/m0ZL61F4eIzUxtXpgl/Gb8ce0YSQprV16AsfUN 5S2Q2zFUz9s0DaKszHLiNW7KfIDvSY2Oq1AmbMORfOqPvcIX3XDPHcFo9GgZxz4nPuUZ mkA2La84PwcxYkVp3Th392KtOFrDvt56Phd81BtXD0L4XuCqGXkBvYH8+ulj7odJSXrW G69w== X-Gm-Message-State: AAQBX9c4iYGBEQxO+p0fCArQVdQyl3yCZT2i+vJlnfqdlJmC5Jw6T5jX Pr//0YK6taHiwGcraFrQorPHq5olMA601LRpYw== X-Google-Smtp-Source: AKy350Y4yrZy3jVvN3wQp7+RZBqQNJ9wdfC4ADrLjc4RYHgu0a5KoOvNA6U3D5YLtIMfO6ES8yW87Yjz2s2FcMbRug== X-Received: from ackerleytng-cloudtop.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:1f5f]) (user=ackerleytng job=sendgmr) by 2002:a17:903:41c5:b0:19a:7bd4:5b0d with SMTP id u5-20020a17090341c500b0019a7bd45b0dmr1394106ple.8.1681861113900; Tue, 18 Apr 2023 16:38:33 -0700 (PDT) Date: Tue, 18 Apr 2023 23:38:32 +0000 In-Reply-To: (message from Sean Christopherson on Fri, 14 Apr 2023 14:08:08 -0700) Mime-Version: 1.0 Message-ID: Subject: Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE From: Ackerley Tng To: Sean Christopherson Cc: chao.p.peng@linux.intel.com, xiaoyao.li@intel.com, isaku.yamahata@gmail.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, pbonzini@redhat.com, corbet@lwn.net, vkuznets@redhat.com, wanpengli@tencent.com, jmattson@google.com, joro@8bytes.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, arnd@arndb.de, naoya.horiguchi@nec.com, linmiaohe@huawei.com, x86@kernel.org, hpa@zytor.com, hughd@google.com, jlayton@kernel.org, bfields@fieldses.org, akpm@linux-foundation.org, shuah@kernel.org, rppt@kernel.org, steven.price@arm.com, mail@maciej.szmigiero.name, vbabka@suse.cz, vannapurve@google.com, yu.c.zhang@linux.intel.com, kirill.shutemov@linux.intel.com, luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, qperret@google.com, tabba@google.com, michael.roth@amd.com, mhocko@suse.com, wei.w.wang@intel.com Content-Type: text/plain; charset="UTF-8"; format=flowed; delsp=yes X-Rspamd-Queue-Id: 2ADC74000B X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: k5e7doqpahu5uipsnecd9ygybqg3ea1w X-HE-Tag: 1681861114-653862 X-HE-Meta: U2FsdGVkX18Amr5e72YXZzpIzNpkE6NN073UFVu45tksgwzBjmSF+KbHlkzctqdx4KAND4CaRydIQ+p3pz1fzpmPHuE/nraNmp2idaKtouOcEYWR5ts3cnRAtlVBxqBl4ov3b0CAwa8AIpna/Gc6ZC0/ICvS9G01cBAY2WSjmqlb4mc8dSSG/rNE13BUAv9ajyCebmuKHtJljngkLGIuZ8gQwkM2wnmsXltWxQ1FEkgqN0o/2j4mGCE4EPEVhjpzO9yVlzVVFIe8AXP3G8438ebVbpYhY4NG5uiZgrBQpi3zMN3Gaz4frsOUjyCtc031wI6pgvLon6GMHucdPat/Bg73Ja0KWHAElAhFevsf3K/60swhinaEb474MDW+8Vqq9gw39ObkXefdgxY941/yU5+fFWug5eyUGugjIDzx96YDVWchUDX1+Ms2Hvihk/MkqDTPHHavUn5zBsmPmT9aacMn7GcbnwSry/4Rv3zQIHPlmLkVx+Yex1Y7VlndvNqXp2rjFV7TyUR1LTJKfDEwdP5Z/PhrPl8UhNgf0519ssps+AjP3e7K8imF5vn0H8Ne2cmdt8QeEHhf9/0JkM2jFOXpzQ2oZlqQ7rbnxz25ojrx4HgpvhV6LrFVwf7DgRk1/F3eif1NlrliYN+u+BRVRij6/9sAHhrgrmK1tEUoev/1g8gCiF/FSPoHUoPOttbKmJhFAZKtwTRVOtKrlNZRsqjNz0XNEonNkokbScb576ibLUMGwE4Uc2kl1lMQV8/EQhAjNyNkoRx2fb92eAohCLbLnOOd8m7XWhpMdUP3hYPywBhfF9uBZuBTbdHzQd6dFYB11umerEEmtcwlQ2C9TA+pQjmK3qtdpM8+k8cPSYWIG9U/hZ3OqbxeuZYXomM+wVfJ+Yq1hPF1pDdGvSuC07uwxAwsT9ByqQwVrvy28BNBmGv4Dpv1tlZLzaQQGHOLM2Ri9pb5u6csZox3wu3 +CTH3U6v TKWq2/lcXZY6v7OuMeiBsLm4EGQykgVfyUYgY72cREpaLheGi7bc7CIj4OvkQ8/ei8x6KhRI7rG4FtjspqPrY6z9VlwCRZg1CFtF4e+DJGKdRldLrSbci3G/yMQmVZMaHSjAfesczPkaRON4dLvpu7z0EAGDtc3Wmruw7CifQEmrNfZ0etNNJDQe82sZcNhXnESIS9lUH6R4qm/2vSjHPUJlKoIvXmumHiUrYyNNwqp86gNNbM5BR2n1jzp5NBeVvAv50W4YHjoUWiydwo22T3+7KwecC41ASWtYLnlKEaZ7QQG/doXYCyp7icmoCcA71PkARG1lvTw/fx1Mjo1VgXw6LRSuxbfD8KMVplAz+GcDg4oMhgzVFavLND5JElGYWpBvh3sDDZhAx0k6AW4DENri6TIBABY1UYbelVG2n2+b1h9RyqYPXyl3UtMNsfgzWbkvXyPIRkU4lSegdOW+7+OCZreipXqK9Btj8L9qSIznJQEfQkAchFr3eJTcUxyS4MdkfhjH1xO0XMTuYuIr7N9XmFiw65ioMXEBGkEy9q6yx2S3SIiO3UTaTlwacg4ItN1d69jQ6Q+OS6AcWaqmknCP1TdTGeO+KrENKlJRpYlsobYo/cy2HRjKHG5INvW6EwKKAinwZt3XGMbhHGkMCvOJAPaiFmRskX43R6u1fx/4oHEl8xeDjMAL31XW8WcD4lW6EQGRXZowQXBX86Sljx6XGP5q1BVQZPVAJxt2MnHX/1/lmrpkkctM2SSylONI7t9KH X-Bogosity: Ham, tests=bogofilter, spamicity=0.000350, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Sean Christopherson writes: > On Tue, Mar 28, 2023, Chao Peng wrote: >> On Fri, Mar 24, 2023 at 10:29:25AM +0800, Xiaoyao Li wrote: >> > On 3/24/2023 10:10 AM, Chao Peng wrote: >> > > On Wed, Mar 22, 2023 at 05:41:31PM -0700, Isaku Yamahata wrote: >> > > > On Wed, Mar 08, 2023 at 03:40:26PM +0800, >> > > > Chao Peng wrote: >> > > > >> > > > > On Wed, Mar 08, 2023 at 12:13:24AM +0000, Ackerley Tng wrote: >> > > > > > Chao Peng writes: >> > > > > > >> > > > > > > On Sat, Jan 14, 2023 at 12:01:01AM +0000, Sean >> Christopherson wrote: >> > > > > > > > On Fri, Dec 02, 2022, Chao Peng wrote: >> > > > > > > +static bool kvm_check_rmem_offset_alignment(u64 offset, u64 >> gpa) >> > > > > > > +{ >> > > > > > > + if (!offset) >> > > > > > > + return true; >> > > > > > > + if (!gpa) >> > > > > > > + return false; >> > > > > > > + >> > > > > > > + return !!(count_trailing_zeros(offset) >= >> count_trailing_zeros(gpa)); >> > > > >> > > > This check doesn't work expected. For example, offset = 2GB, >> gpa=4GB >> > > > this check fails. >> > > >> > > This case is expected to fail as Sean initially suggested[*]: >> > > I would rather reject memslot if the gfn has lesser alignment than >> > > the offset. I'm totally ok with this approach _if_ there's a use >> case. >> > > Until such a use case presents itself, I would rather be >> conservative >> > > from a uAPI perspective. >> > > >> > > I understand that we put tighter restriction on this but if you see >> such >> > > restriction is really a big issue for real usage, instead of a >> > > theoretical problem, then we can loosen the check here. But at that >> time >> > > below code is kind of x86 specific and may need improve. >> > > >> > > BTW, in latest code, I replaced count_trailing_zeros() with fls64(): >> > > return !!(fls64(offset) >= fls64(gpa)); >> > >> > wouldn't it be !!(ffs64(offset) <= ffs64(gpa)) ? >> As the function document explains, here we want to return true when >> ALIGNMENT(offset) >= ALIGNMENT(gpa), so '>=' is what we need. >> It's worthy clarifying that in Sean's original suggestion he actually >> mentioned the opposite. He said 'reject memslot if the gfn has lesser >> alignment than the offset', but I wonder this is his purpose, since >> if ALIGNMENT(offset) < ALIGNMENT(gpa), we wouldn't be possible to map >> the page as largepage. Consider we have below config: >> gpa=2M, offset=1M >> In this case KVM tries to map gpa at 2M as 2M hugepage but the physical >> page at the offset(1M) in private_fd cannot provide the 2M page due to >> misalignment. >> But as we discussed in the off-list thread, here we do find a real use >> case indicating this check is too strict. i.e. QEMU immediately fails >> when launch a guest > 2G memory. For this case QEMU splits guest memory >> space into two slots: >> Slot#1(ram_below_4G): gpa=0x0, offset=0x0, size=2G >> Slot#2(ram_above_4G): gpa=4G, offset=2G, size=totalsize-2G >> This strict alignment check fails for slot#2 because offset(2G) has less >> alignment than gpa(4G). To allow this, one solution can revert to my >> previous change in kvm_alloc_memslot_metadata() to disallow hugepage >> only when the offset/gpa are not aligned to related page size. >> Sean, How do you think? > I agree, a pure alignment check is too restrictive, and not really what I > intended > despite past me literally saying that's what I wanted :-) I think I may > have also > inverted the "less alignment" statement, but luckily I believe that ends > up being > a moot point. > The goal is to avoid having to juggle scenarios where KVM wants to create > a hugepage, > but restrictedmem can't provide one because of a misaligned file offset. > I think > the rule we want is that the offset must be aligned to the largest page > size allowed > by the memslot _size_. E.g. on x86, if the memslot size is >=1GiB then > the offset > must be 1GiB or beter, ditto for >=2MiB and >=4KiB (ignoring that 4KiB is > already a > requirement). > We could loosen that to say the largest size allowed by the memslot, but > I don't > think that's worth the effort unless it's trivially easy to implement in > code, > e.g. KVM could technically allow a 4KiB aligned offset if the memslot is > 2MiB > sized but only 4KiB aligned on the GPA. I doubt there's a real use case > for such > a memslot, so I want to disallow that unless it's super easy to implement. Checking my understanding here about why we need this alignment check: When KVM requests a page from restrictedmem, KVM will provide an offset into the file in terms of 4K pages. When shmem is configured to use hugepages, shmem_get_folio() will round the requested offset down to the nearest hugepage-aligned boundary in shmem_alloc_hugefolio(). Example of problematic configuration provided to KVM_SET_USER_MEMORY_REGION2: + shmem configured to use 1GB pages + restrictedmem_offset provided to KVM_SET_USER_MEMORY_REGION2: 0x4000 + memory_size provided in KVM_SET_USER_MEMORY_REGION2: 1GB + KVM requests offset (pgoff_t) 0x8, which translates to offset 0x8000 restrictedmem_get_page() and shmem_get_folio() returns the page for offset 0x0 in the file, since rounding down 0x8000 to the nearest 1GB is 0x0. This is allocating outside the range that KVM is supposed to use, since the parameters provided in KVM_SET_USER_MEMORY_REGION2 is only supposed to be offset 0x4000 to (0x4000 + 1GB = 0x40004000) in the file. IIUC shmem will actually just round down (0x4000 rounded down to nearest 1GB will be 0x0) and allocate without checking bounds, so if offset 0x0 to 0x4000 in the file were supposed to be used by something else, there might be issues. Hence, this alignment check ensures that rounding down of any offsets provided by KVM (based on page size configured in the backing file provided) to restrictedmem_get_page() must not go below the offset provided to KVM_SET_USER_MEMORY_REGION2. Enforcing alignment of restrictedmem_offset based on the currently-set page size in the backing file (i.e. shmem) may not be effective, since the size of the pages in the backing file can be adjusted to a larger size after KVM_SET_USER_MEMORY_REGION2 succeeds. With that, we may still end up allocating outside the range that KVM was provided with. Hence, to be safe, we should check alignment to the max page size across all backing filesystems, so the constraint is rounding down restrictedmem_offset to min(max page size across all backing filesystems, max page size that fits in memory_size) == restrictedmem_offset which is the same check as restrictedmem_offset must be aligned to min(max page size across all backing filesystems, max page size that fits in memory_size) which can safely reduce to restrictedmem_offset must be aligned to max page size that fits in memory_size since "max page size that fits in memory_size" is probably <= to "max page size across all backing filesystems", and if it's larger, it'll just be a tighter constraint. If the above understanding is correct: + We must enforce this in the KVM_SET_USER_MEMORY_REGION2 handler, since IIUC shmem will just round down and allocate without checking bounds. + I think this is okay because holes in the restrictedmem file (in terms of offset) made to accommodate this constraint don't cost us anything anyway(?) Are they just arbitrary offsets in a file? In our case, this file is usually a new and empty file. + In the case of migration of a restrictedmem file between two KVM VMs, this constraint would cause a problem is if the largest possible page size on the destination machine is larger than that of the source machine. In that case, we might have to move the data in the file to a different offset (a separate problem). + On this note, it seems like there is no check for when the range is smaller than the allocated page? Like if the range provided is 4KB in size, but shmem is then configured to use a 1GB page, will we end up allocating past the end of the range?