From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B2057ECAAD5 for ; Mon, 5 Sep 2022 11:11:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 502088D0063; Mon, 5 Sep 2022 07:11:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4B0CD8D0050; Mon, 5 Sep 2022 07:11:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 32F3D8D0063; Mon, 5 Sep 2022 07:11:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 1E4418D0050 for ; Mon, 5 Sep 2022 07:11:47 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id E7A4D1A0693 for ; Mon, 5 Sep 2022 11:11:46 +0000 (UTC) X-FDA: 79877766612.18.8366ABE Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf21.hostedemail.com (Postfix) with ESMTP id 995F21C0071 for ; Mon, 5 Sep 2022 11:11:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1662376306; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=/9Cvtrw/U3C3PcZffSYCQWLqBd8d9T5JEME0S1hKNLk=; b=Q8jtcJUFRJvHt+1r7/Y6z6g9Ap9g18JxYJA8ZUM6jXSmWU3a2Z0Zuievc+S4YCeMdqKpf1 DOfyjQTOs/y0JTXJmgXQWKvPtmqBtEw1wmcubnVcwTditDN+M4ki6mRS82e/ncwKczklux BHg32zWabdf+DJbxS3NPJoBMLGZLxdE= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-214-IAaTMFxzMUuXx9EJje2EjQ-1; Mon, 05 Sep 2022 07:11:43 -0400 X-MC-Unique: IAaTMFxzMUuXx9EJje2EjQ-1 Received: by mail-wr1-f69.google.com with SMTP id u27-20020adfa19b000000b0022863c08ac4so592456wru.11 for ; Mon, 05 Sep 2022 04:11:42 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:organization:references:cc:to :from:content-language:subject:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date; bh=/9Cvtrw/U3C3PcZffSYCQWLqBd8d9T5JEME0S1hKNLk=; b=sRQPcfnmadZNSKnGhPicqtxonH9yf2r3eSJC4AZZTaZfPQLGTF/9H2vAZLEHlh7bPW a0AUt3FGCKQiLPO6zlvFqdo4gpqyYSgUULobhy7V2j+JVY3t2SmqQC/pBPqQgTl4X/rH q64E/pEOxKV0mH6RGbXvf1KRfVrlyS5gofWf2XwNIbNhMhaFlOEcchEsoF752F+AThw2 SD87X1tMzsazZn0JTIVsigCU9qZP+NBqJxjLP4EQpBroJ080XJUDR8qaK88OMdRCvVtt k2bVHvxxcIXtXA09dXAUGZ1/hzyyTWZ+QclX3l0mslWzEl92Dt4VjXrhiOpqRY0l4I0G sA8Q== X-Gm-Message-State: ACgBeo3cl9komaA5VkrY6Y7JCXI3d3kr16cGLwKHvE4vR0COHJbBr2ld 7jWS/9Yx5Z5OBmPckgwETegJVUUTeb/SCEDxlbUVtHe2hBIRL1swiXbJYEfYl9gDctD5jONc2kh SVOXPNZPBxdE= X-Received: by 2002:a5d:5407:0:b0:228:a79b:4432 with SMTP id g7-20020a5d5407000000b00228a79b4432mr1536893wrv.96.1662376301999; Mon, 05 Sep 2022 04:11:41 -0700 (PDT) X-Google-Smtp-Source: AA6agR6F7nD49xRoK7hZXeL9vTPJlS8kXvvv7EvTNB1WIlh8ZYBXAJKS/gT0AlixZTaK7FaHas63tQ== X-Received: by 2002:a5d:5407:0:b0:228:a79b:4432 with SMTP id g7-20020a5d5407000000b00228a79b4432mr1536869wrv.96.1662376301662; Mon, 05 Sep 2022 04:11:41 -0700 (PDT) Received: from ?IPV6:2003:d8:2f0d:ba00:c951:31d7:b2b0:8ba0? (p200300d82f0dba00c95131d7b2b08ba0.dip0.t-ipconnect.de. [2003:d8:2f0d:ba00:c951:31d7:b2b0:8ba0]) by smtp.gmail.com with ESMTPSA id a6-20020a056000050600b0021e519eba9bsm8611739wrf.42.2022.09.05.04.11.40 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 05 Sep 2022 04:11:41 -0700 (PDT) Message-ID: Date: Mon, 5 Sep 2022 13:11:40 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.2.0 Subject: Re: [PATCH] mm: gup: fix the fast GUP race against THP collapse From: David Hildenbrand To: Baolin Wang , John Hubbard , Yang Shi , peterx@redhat.com, kirill.shutemov@linux.intel.com, jgg@nvidia.com, hughd@google.com, akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20220901222707.477402-1-shy828301@gmail.com> <0c9d9774-77dd-fd93-b5b6-fc63f3d01b7f@linux.alibaba.com> <383fec21-9801-9b60-7570-856da2133ea9@redhat.com> Organization: Red Hat In-Reply-To: <383fec21-9801-9b60-7570-856da2133ea9@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1662376306; a=rsa-sha256; cv=none; b=HBd3C46GjIMwOGtNgpwjNikLBXh90E9/y9WHMsQRP2kI6qlzSxqNkS4BMFfOKgmRfsfsMp 2B+WccP0yocUwtkheFFsq7o6vJqp1un1ft/jCWS8d5XyGE2tAu0+paFmSdx0z5qERqBLoK cE/MQ3HxFIijLhpDEhek9Yjr4nO1GE0= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Q8jtcJUF; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf21.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1662376306; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/9Cvtrw/U3C3PcZffSYCQWLqBd8d9T5JEME0S1hKNLk=; b=536g+p4tabgpeFF+/8ZJ+qud4MBwMvsKtugDHNO2YflWIXVzCpugXBvXEEv0PSVN7sOhaV ZyOugGABCGrUUZGBjbWpU5TGMX4wgUICSD/K0GnvT36hUn4Sri7I56ljKOLNzT7FCVGw7t Ib9zykrB2NWdrZVDC1Xatm8n6i6Kc5o= X-Rspam-User: Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Q8jtcJUF; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf21.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 995F21C0071 X-Stat-Signature: 56n85yp5tyb4ih9r6hycyntgq6rcjqaq X-HE-Tag: 1662376306-599381 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 05.09.22 12:24, David Hildenbrand wrote: > On 05.09.22 12:16, Baolin Wang wrote: >> >> >> On 9/5/2022 3:59 PM, David Hildenbrand wrote: >>> On 05.09.22 00:29, John Hubbard wrote: >>>> On 9/1/22 15:27, Yang Shi wrote: >>>>> Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: >>>>> introduce a general RCU get_user_pages_fast()"), a TLB flush is no >>>>> longer >>>>> sufficient to handle concurrent GUP-fast in all cases, it only handles >>>>> traditional IPI-based GUP-fast correctly.  On architectures that send >>>>> an IPI broadcast on TLB flush, it works as expected.  But on the >>>>> architectures that do not use IPI to broadcast TLB flush, it may have >>>>> the below race: >>>>> >>>>>     CPU A                                          CPU B >>>>> THP collapse                                     fast GUP >>>>>                                                gup_pmd_range() <-- >>>>> see valid pmd >>>>>                                                    gup_pte_range() >>>>> <-- work on pte >>>>> pmdp_collapse_flush() <-- clear pmd and flush >>>>> __collapse_huge_page_isolate() >>>>>      check page pinned <-- before GUP bump refcount >>>>>                                                        pin the page >>>>>                                                        check PTE <-- >>>>> no change >>>>> __collapse_huge_page_copy() >>>>>      copy data to huge page >>>>>      ptep_clear() >>>>> install huge pmd for the huge page >>>>>                                                        return the >>>>> stale page >>>>> discard the stale page >>>> >>>> Hi Yang, >>>> >>>> Thanks for taking the trouble to write down these notes. I always >>>> forget which race we are dealing with, and this is a great help. :) >>>> >>>> More... >>>> >>>>> >>>>> The race could be fixed by checking whether PMD is changed or not after >>>>> taking the page pin in fast GUP, just like what it does for PTE.  If the >>>>> PMD is changed it means there may be parallel THP collapse, so GUP >>>>> should back off. >>>>> >>>>> Also update the stale comment about serializing against fast GUP in >>>>> khugepaged. >>>>> >>>>> Fixes: 2667f50e8b81 ("mm: introduce a general RCU >>>>> get_user_pages_fast()") >>>>> Signed-off-by: Yang Shi >>>>> --- >>>>>   mm/gup.c        | 30 ++++++++++++++++++++++++------ >>>>>   mm/khugepaged.c | 10 ++++++---- >>>>>   2 files changed, 30 insertions(+), 10 deletions(-) >>>>> >>>>> diff --git a/mm/gup.c b/mm/gup.c >>>>> index f3fc1f08d90c..4365b2811269 100644 >>>>> --- a/mm/gup.c >>>>> +++ b/mm/gup.c >>>>> @@ -2380,8 +2380,9 @@ static void __maybe_unused undo_dev_pagemap(int >>>>> *nr, int nr_start, >>>>>   } >>>>>   #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL >>>>> -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned >>>>> long end, >>>>> -             unsigned int flags, struct page **pages, int *nr) >>>>> +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, >>>>> +             unsigned long end, unsigned int flags, >>>>> +             struct page **pages, int *nr) >>>>>   { >>>>>       struct dev_pagemap *pgmap = NULL; >>>>>       int nr_start = *nr, ret = 0; >>>>> @@ -2423,7 +2424,23 @@ static int gup_pte_range(pmd_t pmd, unsigned >>>>> long addr, unsigned long end, >>>>>               goto pte_unmap; >>>>>           } >>>>> -        if (unlikely(pte_val(pte) != pte_val(*ptep))) { >>>>> +        /* >>>>> +         * THP collapse conceptually does: >>>>> +         *   1. Clear and flush PMD >>>>> +         *   2. Check the base page refcount >>>>> +         *   3. Copy data to huge page >>>>> +         *   4. Clear PTE >>>>> +         *   5. Discard the base page >>>>> +         * >>>>> +         * So fast GUP may race with THP collapse then pin and >>>>> +         * return an old page since TLB flush is no longer sufficient >>>>> +         * to serialize against fast GUP. >>>>> +         * >>>>> +         * Check PMD, if it is changed just back off since it >>>>> +         * means there may be parallel THP collapse. >>>>> +         */ >>>> >>>> As I mentioned in the other thread, it would be a nice touch to move >>>> such discussion into the comment header. >>>> >>>>> +        if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || >>>>> +            unlikely(pte_val(pte) != pte_val(*ptep))) { >>>> >>>> >>>> That should be READ_ONCE() for the *pmdp and *ptep reads. Because this >>>> whole lockless house of cards may fall apart if we try reading the >>>> page table values without READ_ONCE(). >>> >>> I came to the conclusion that the implicit memory barrier when grabbing >>> a reference on the page is sufficient such that we don't need READ_ONCE >>> here. >> >> IMHO the compiler may optimize the code 'pte_val(*ptep)' to be always >> get from a register, then we can get an old value if other thread did >> set_pte(). I am not sure how the implicit memory barrier can pervent the >> compiler optimization? Please correct me if I missed something. > > IIUC, an memory barrier always implies a compiler barrier. > To clarify what I mean, Documentation/atomic_t.txt documents NOTE: when the atomic RmW ops are fully ordered, they should also imply a compiler barrier. -- Thanks, David / dhildenb