From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4AFB7C04AA5 for ; Thu, 25 Aug 2022 10:54:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AC3EC940008; Thu, 25 Aug 2022 06:54:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A7415940007; Thu, 25 Aug 2022 06:54:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 93B6F940008; Thu, 25 Aug 2022 06:54:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 83E76940007 for ; Thu, 25 Aug 2022 06:54:22 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 5638EAC7DE for ; Thu, 25 Aug 2022 10:54:22 +0000 (UTC) X-FDA: 79837805964.29.E4DC10D Received: from out199-15.us.a.mail.aliyun.com (out199-15.us.a.mail.aliyun.com [47.90.199.15]) by imf26.hostedemail.com (Postfix) with ESMTP id 0837214000B for ; Thu, 25 Aug 2022 10:54:17 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R641e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046059;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=6;SR=0;TI=SMTPD_---0VNDJSvi_1661424837; Received: from 30.97.48.44(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0VNDJSvi_1661424837) by smtp.aliyun-inc.com; Thu, 25 Aug 2022 18:53:58 +0800 Message-ID: <9b5b9b7a-3cba-37a1-411a-5031a67208fe@linux.alibaba.com> Date: Thu, 25 Aug 2022 18:54:07 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.12.0 Subject: Re: [PATCH v2 1/5] mm/hugetlb: fix races when looking up a CONT-PTE size hugetlb page To: David Hildenbrand , Mike Kravetz Cc: akpm@linux-foundation.org, songmuchun@bytedance.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <0e5d92da043d147a867f634b17acbcc97a7f0e64.1661240170.git.baolin.wang@linux.alibaba.com> <4c24b891-04ce-2608-79d2-a75dc236533f@redhat.com> <376d2e0a-d28a-984b-903c-1f6451b04a15@linux.alibaba.com> <7d4e7f47-30a5-3cc6-dc9f-aa89120847d8@redhat.com> <64669c0a-4a6e-f034-a15b-c4a8deea9e5d@linux.alibaba.com> <7ee73879-e402-9175-eae8-41471d80d59e@redhat.com> <887ca2e2-a7c5-93a7-46cb-185daccd4444@redhat.com> From: Baolin Wang In-Reply-To: <887ca2e2-a7c5-93a7-46cb-185daccd4444@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1661424859; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=339Lw4ow91sayiI+8+3OAgDud7SboC2THqcYCFmimug=; b=MtT1IGYzvhf9xAAC1Gf8bzqZByDXq1wSInyrCk/AZMs4y4eVY7wyOWc+0FE4P00pdjDH1e BFPw0vw/dq++uIXtPvakPt+6Dk7Azrwu9kqeGOkMcWdHCwn79mhTi14hCfvLpGEhfOw31p 47s0Ykn+NytuJA3oUV3urL7iZGkfWQc= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=none; spf=pass (imf26.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 47.90.199.15 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1661424859; a=rsa-sha256; cv=none; b=m1NVZuxIENV8w3lHgN3U3/Jd4s/URRblf2FHUz23hpx6XFlI+2DukfzxJhUK5Th+6jlt3m 06h54034KEDuAZ4Fq2Q37clF95kBKgUROIuoh7WipBajtK4rDxSMqgV0ke/MwYvH+fgIuR W7ZXPqSUeuLenLlP+TETmwGNJCK87yw= Authentication-Results: imf26.hostedemail.com; dkim=none; spf=pass (imf26.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 47.90.199.15 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com X-Rspam-User: X-Stat-Signature: 3rhhnrpccc5y8kjiux7w7joo3mytmzfp X-Rspamd-Queue-Id: 0837214000B X-Rspamd-Server: rspam03 X-HE-Tag: 1661424857-350222 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 8/25/2022 3:25 PM, David Hildenbrand wrote: >> Is the primary concern the locking? If so, I am not sure we have an issue. >> As mentioned in your commit message, current code will use >> pte_offset_map_lock(). pte_offset_map_lock uses pte_lockptr, and pte_lockptr >> will either be the mm wide lock or pmd_page lock. To me, it seems that >> either would provide correct synchronization for CONT-PTE entries. Am I >> missing something or misreading the code? >> >> I started looking at code cleanup suggested by David. Here is a quick >> patch (not tested and likely containing errors) to see if this is a step >> in the right direction. >> >> I like it because we get rid of/combine all those follow_huge_p*d >> routines. >> > > Yes, see comments below. > >> From 35d117a707c1567ddf350554298697d40eace0d7 Mon Sep 17 00:00:00 2001 >> From: Mike Kravetz >> Date: Wed, 24 Aug 2022 15:59:15 -0700 >> Subject: [PATCH] hugetlb: call hugetlb_follow_page_mask for hugetlb pages in >> follow_page_mask >> >> At the beginning of follow_page_mask, there currently is a call to >> follow_huge_addr which 'may' handle hugetlb pages. ia64 is the only >> architecture which (incorrectly) provides a follow_huge_addr routine >> that does not return error. Instead, at each level of the page table a >> check is made for a hugetlb entry. If a hugetlb entry is found, a call >> to a routine associated with that page table level such as >> follow_huge_pmd is made. >> >> All the follow_huge_p*d routines are basically the same. In addition >> huge page size can be derived from the vma, so we know where in the page >> table a huge page would reside. So, replace follow_huge_addr with a >> new architecture independent routine which will provide the same >> functionality as the follow_huge_p*d routines. We can then eliminate >> the p*d_huge checks in follow_page_mask page table walking as well as >> the follow_huge_p*d routines themselves.> >> follow_page_mask still has is_hugepd hugetlb checks during page table >> walking. This is due to these checks and follow_huge_pd being >> architecture specific. These can be eliminated if >> hugetlb_follow_page_mask can be overwritten by architectures (powerpc) >> that need to do follow_huge_pd processing. > > But won't the > >> + /* hugetlb is special */ >> + if (is_vm_hugetlb_page(vma)) >> + return hugetlb_follow_page_mask(vma, address, flags); > > code route everything via hugetlb_follow_page_mask() and all these > (beloved) hugepd checks would essentially be unreachable? > > At least my understanding is that hugepd only applies to hugetlb. > > Can't we move the hugepd handling code into hugetlb_follow_page_mask() > as well? > > I mean, doesn't follow_hugetlb_page() also have to handle that hugepd > stuff already ... ? Yes, I also think about this, and I did a simple patch (without testing) based on Mike's patch to make it more clean. diff --git a/mm/gup.c b/mm/gup.c index d3239ea63159..1003c03dcf78 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -626,14 +626,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, pmdval = READ_ONCE(*pmd); if (pmd_none(pmdval)) return no_page_table(vma, flags); - if (is_hugepd(__hugepd(pmd_val(pmdval)))) { - page = follow_huge_pd(vma, address, - __hugepd(pmd_val(pmdval)), flags, - PMD_SHIFT); - if (page) - return page; - return no_page_table(vma, flags); - } + retry: if (!pmd_present(pmdval)) { /* @@ -723,14 +716,6 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma, pud = pud_offset(p4dp, address); if (pud_none(*pud)) return no_page_table(vma, flags); - if (is_hugepd(__hugepd(pud_val(*pud)))) { - page = follow_huge_pd(vma, address, - __hugepd(pud_val(*pud)), flags, - PUD_SHIFT); - if (page) - return page; - return no_page_table(vma, flags); - } if (pud_devmap(*pud)) { ptl = pud_lock(mm, pud); page = follow_devmap_pud(vma, address, pud, flags, &ctx->pgmap); @@ -759,14 +744,6 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma, if (unlikely(p4d_bad(*p4d))) return no_page_table(vma, flags); - if (is_hugepd(__hugepd(p4d_val(*p4d)))) { - page = follow_huge_pd(vma, address, - __hugepd(p4d_val(*p4d)), flags, - P4D_SHIFT); - if (page) - return page; - return no_page_table(vma, flags); - } return follow_pud_mask(vma, address, p4d, flags, ctx); } @@ -813,15 +790,6 @@ static struct page *follow_page_mask(struct vm_area_struct *vma, if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd))) return no_page_table(vma, flags); - if (is_hugepd(__hugepd(pgd_val(*pgd)))) { - page = follow_huge_pd(vma, address, - __hugepd(pgd_val(*pgd)), flags, - PGDIR_SHIFT); - if (page) - return page; - return no_page_table(vma, flags); - } - return follow_p4d_mask(vma, address, pgd, flags, ctx); } diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 2c107e7ebd66..848b4fb7a05d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6118,6 +6118,81 @@ static inline bool __follow_hugetlb_must_fault(unsigned int flags, pte_t *pte, return false; } +static struct page *hugetlb_follow_hugepd(struct vm_area_struct *vma, + unsigned long address, + unsigned int flags) +{ + struct mm_struct *mm = vma->vm_mm; + struct page *page; + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd; + + pgd = pgd_offset(mm, address); + if (pgd_none(*pgd) || pgd_bad(*pgd)) + return ERR_PTR(-EFAULT); + + if (pgd_huge(*pgd)) + return NULL; + + if (is_hugepd(__hugepd(pgd_val(*pgd)))) { + page = follow_huge_pd(vma, address, + __hugepd(pgd_val(*pgd)), flags, + PGDIR_SHIFT); + if (page) + return page; + return ERR_PTR(-EFAULT); + } + + p4d = p4d_offset(pgd, address); + if (p4d_none(*p4d) || p4d_bad(*p4d)) + return ERR_PTR(-EFAULT); + + if (is_hugepd(__hugepd(p4d_val(*p4d)))) { + page = follow_huge_pd(vma, address, + __hugepd(p4d_val(*p4d)), flags, + P4D_SHIFT); + if (page) + return page; + return ERR_PTR(-EFAULT); + } + + pud = pud_offset(p4d, address); + if (pud_none(*pud) || pud_bad(*pud)) + return ERR_PTR(-EFAULT); + + if (pud_huge(*pud)) + return NULL; + + if (is_hugepd(__hugepd(pud_val(*pud)))) { + page = follow_huge_pd(vma, address, + __hugepd(pud_val(*pud)), flags, + PUD_SHIFT); + if (page) + return page; + return ERR_PTR(-EFAULT); + } + + pmd = pmd_offset(pud, address);+ if (pmd_none(*pmd) || pmd_bad(*pmd)) + return ERR_PTR(-EFAULT); + + if (pmd_huge(*pmd)) + return NULL; + + if (is_hugepd(__hugepd(pmd_val(*pmd)))) { + page = follow_huge_pd(vma, address, + __hugepd(pmd_val(*pmd)), flags, + PMD_SHIFT); + if (page) + return page; + return ERR_PTR(-EFAULT); + } + + return NULL; +} + struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma, unsigned long address, unsigned int flags) { @@ -6135,6 +6210,10 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma, if (WARN_ON_ONCE(flags & FOLL_PIN)) return NULL; + page = hugetlb_follow_hugepd(vma, address, flags); + if (page) + return page; + pte = huge_pte_offset(mm, haddr, huge_page_size(h)); if (!pte) return NULL; >> >> +struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma, >> + unsigned long address, unsigned int flags) >> +{ >> + struct hstate *h = hstate_vma(vma); >> + struct mm_struct *mm = vma->vm_mm; >> + unsigned long haddr = address & huge_page_mask(h); >> + struct page *page = NULL; >> + spinlock_t *ptl; >> + pte_t *pte, entry; >> + >> + /* >> + * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via >> + * follow_hugetlb_page(). >> + */ >> + if (WARN_ON_ONCE(flags & FOLL_PIN)) >> + return NULL; >> + >> + pte = huge_pte_offset(mm, haddr, huge_page_size(h)); >> + if (!pte) >> + return NULL; >> + >> +retry: >> + ptl = huge_pte_lock(h, mm, pte); >> + entry = huge_ptep_get(pte); >> + if (pte_present(entry)) { >> + page = pte_page(entry); >> + /* >> + * try_grab_page() should always succeed here, because we hold >> + * the ptl lock and have verified pte_present(). >> + */ >> + if (WARN_ON_ONCE(!try_grab_page(page, flags))) { >> + page = NULL; >> + goto out; >> + } >> + } else { >> + if (is_hugetlb_entry_migration(entry)) { >> + spin_unlock(ptl); >> + __migration_entry_wait_huge(pte, ptl); >> + goto retry; >> + } >> + /* >> + * hwpoisoned entry is treated as no_page_table in >> + * follow_page_mask(). >> + */ >> + } >> +out: >> + spin_unlock(ptl); >> + return page; > > > This is neat and clean enough to not reuse follow_hugetlb_page(). I > wonder if we want to add some comment to the function how this differs > to follow_hugetlb_page(). > > ... or do we maybe want to rename follow_hugetlb_page() to someting like > __hugetlb_get_user_pages() to make it clearer in which context it will > get called? Sounds reasonable to me. > I guess it might be feasible in the future to eliminate > follow_hugetlb_page() and centralizing the faulting code. For now, this > certainly improves the situation. >