From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 40662C04A95 for ; Tue, 25 Oct 2022 13:45:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232338AbiJYNpR (ORCPT ); Tue, 25 Oct 2022 09:45:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47394 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230193AbiJYNpN (ORCPT ); Tue, 25 Oct 2022 09:45:13 -0400 Received: from mail-io1-xd2a.google.com (mail-io1-xd2a.google.com [IPv6:2607:f8b0:4864:20::d2a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3F3E6147056 for ; Tue, 25 Oct 2022 06:45:12 -0700 (PDT) Received: by mail-io1-xd2a.google.com with SMTP id y80so10324051iof.3 for ; Tue, 25 Oct 2022 06:45:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=KgMDPhJWvWPx+5LJ/Qv7vJHmHiGIhKGLj7RjUESz09M=; b=WumVmYJxhniCTmZMefNCEm3UcPRsmZeEHJM8hqvVntpzvgyZGF/jvSU6opR0WzrC9W SMegd9e/PZJXVf5poNQdoYE2OTiOVdP8bdUnPOBLSMDR5AhNVe8FeON9n+FAgyMkQV+3 IIm1RDfijAM52g9K3lWfNrRiYQbJZHEv6hqL9ct7vkYEUrDKjPt8CZm/l979ZaAbtZgZ UZvD0cDeYGgcnMddbdLSSuSh+gQe/L8xr+z/7bXLOO7k/DzPLGODC4V2f8paSK39lAKd aTKwBy6JpTxFmrEYDWJM1HXyrmhIX5iH6iGt5EuLBCYBQLEYS6nJrXjTRlEaVEgDDitM SgaA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=KgMDPhJWvWPx+5LJ/Qv7vJHmHiGIhKGLj7RjUESz09M=; b=akm1dkHMfMvEGma98KxPzf4nDGr+Xtz4RODB9i/wW0un3FKVMFWuaDFCSj4FC2xrB/ BHXo+kZFSGpqSnbvEg4EtWYCFPSqMVh5y3FB7KQXICwlF1jr1wMKsJ0dYBba8x2c+SPe svtaK8I56dx/iCjb3/nd+cFQXdmYobLwAjXfnsRYD19XMfvEWM0nCr4qQ9w87NbSOma0 71hV8fN9aRhpd21D6C/m3t54o4k1BGYv0YcgPq6oPCwgtIJWw+8KLDXHmZtlfXMwbudH LXpkzUonONFeNCyoY35neQy2ZSla6EL4sa+W9svTOV/vcwIVipRxwQlt3zTUrUblNHIO fkgA== X-Gm-Message-State: ACrzQf3gsVUR27+320dmM0G3GqnOPuBh4rMa+WRmL2P7G8QZcQ7WUPHA RjXqO0a0sD+OtfiE1QxcPoMapTYvRbxHKcSoEmv5qw== X-Google-Smtp-Source: AMsMyM7ZrmQax2xcIXLFfXG6apPiVAukou4+KudnV0Fh3BwBnAb15t7Vd9OEZdU+B06IcvnW44Plgur2XuRcFXBjdAc= X-Received: by 2002:a05:6638:38a5:b0:363:f688:94dc with SMTP id b37-20020a05663838a500b00363f68894dcmr25269301jav.133.1666705511462; Tue, 25 Oct 2022 06:45:11 -0700 (PDT) MIME-Version: 1.0 References: <20221022111403.531902164@infradead.org> <20221022114424.515572025@infradead.org> <2c800ed1-d17a-def4-39e1-09281ee78d05@nvidia.com> <87fsfcuxu6.fsf@nvidia.com> In-Reply-To: <87fsfcuxu6.fsf@nvidia.com> From: Jann Horn Date: Tue, 25 Oct 2022 15:44:34 +0200 Message-ID: Subject: Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment To: Alistair Popple Cc: Matthew Wilcox , Linus Torvalds , Peter Zijlstra , John Hubbard , x86@kernel.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, aarcange@redhat.com, kirill.shutemov@linux.intel.com, jroedel@suse.de, ubizjak@gmail.com Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 25, 2022 at 10:11 AM Alistair Popple wrote: > > > Matthew Wilcox writes: > > > On Mon, Oct 24, 2022 at 10:23:51PM +0200, Jann Horn wrote: > >> """ > >> This guarantees that the page tables that are being walked > >> aren't freed concurrently, but at the end of the walk, we > >> have to grab a stable reference to the referenced page. > >> For this we use the grab-reference-and-revalidate trick > >> from above again: > >> First we (locklessly) load the page > >> table entry, then we grab a reference to the page that it > >> points to (which can fail if the refcount is zero, in that > >> case we bail), then we recheck that the page table entry > >> is still the same, and if it changed in between, we drop the > >> page reference and bail. > >> This can, again, grab a reference to a page after it has > >> already been freed and reallocated. The reason why this is > >> fine is that the metadata structure that holds this refcount, > >> `struct folio` (or `struct page`, depending on which kernel > >> version you're looking at; in current kernels it's `folio` > >> but `struct page` and `struct folio` are actually aliases for > >> the same memory, basically, though that is supposed to maybe > >> change at some point) is never freed; even when a page is > >> freed and reallocated, the corresponding `struct folio` > >> stays. This does have the fun consequence that whenever a > >> page/folio has a non-zero refcount, the refcount can > >> spuriously go up and then back down for a little bit. > >> (Also it's technically not as simple as I just described it, > >> because the `struct page` that the PTE points to might be > >> a "tail page" of a `struct folio`. > >> So actually we first read the PTE, the PTE gives us the > >> `page*`, then from that we go to the `folio*`, then we > >> try to grab a reference to the `folio`, then if that worked > >> we check that the `page` still points to the same `folio`, > >> and then we recheck that the PTE is still the same.) > >> """ > > > > Nngh. In trying to make this description fit all kernels (with > > both pages and folios), you've complicated it maximally. Let's > > try a more simple explanation: > > > > First we (locklessly) load the page table entry, then we grab a > > reference to the folio that contains it (which can fail if the > > refcount is zero, in that case we bail), then we recheck that the > > page table entry is still the same, and if it changed in between, > > we drop the folio reference and bail. > > This can, again, grab a reference to a folio after it has > > already been freed and reallocated. The reason why this is > > fine is that the metadata structure that holds this refcount, > > `struct folio` is never freed; even when a folio is > > freed and reallocated, the corresponding `struct folio` > > stays. Oh, thanks. You're right, trying to talk about kernels with folios made it unnecessarily complicated... > I'm probably missing something obvious but how is that synchronised > against memory hotplug? AFAICT if it isn't couldn't the pages be freed > and memory removed? In that case the above would no longer hold because > (I think) the metadata structure could have been freed. Hm... that's this codepath? arch_remove_memory -> __remove_pages -> __remove_section -> sparse_remove_section -> section_deactivate -> depopulate_section_memmap -> vmemmap_free -> remove_pagetable which then walks down the page tables and ends up freeing individual pages in remove_pte_table() using the confusingly-named free_pagetable()? I'm not sure what the synchronization against hotplug is - GUP-fast is running with IRQs disabled, but other codepaths might not, like get_ksm_page()? I don't know if that's holding something else for protection...