From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.5 required=3.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED,DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6BD26C433DB for ; Sun, 31 Jan 2021 00:16:37 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E69C364E0F for ; Sun, 31 Jan 2021 00:16:36 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E69C364E0F Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 7D8266B0081; Sat, 30 Jan 2021 19:16:27 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 73CF26B0082; Sat, 30 Jan 2021 19:16:27 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4CD4E6B0083; Sat, 30 Jan 2021 19:16:27 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0040.hostedemail.com [216.40.44.40]) by kanga.kvack.org (Postfix) with ESMTP id 230C56B0081 for ; Sat, 30 Jan 2021 19:16:27 -0500 (EST) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id DF22C180AD830 for ; Sun, 31 Jan 2021 00:16:26 +0000 (UTC) X-FDA: 77764153572.13.shake05_1f172fa275b5 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin13.hostedemail.com (Postfix) with ESMTP id C1EDB18140B60 for ; Sun, 31 Jan 2021 00:16:26 +0000 (UTC) X-HE-Tag: shake05_1f172fa275b5 X-Filterd-Recvd-Size: 17414 Received: from mail-pf1-f172.google.com (mail-pf1-f172.google.com [209.85.210.172]) by imf27.hostedemail.com (Postfix) with ESMTP for ; Sun, 31 Jan 2021 00:16:26 +0000 (UTC) Received: by mail-pf1-f172.google.com with SMTP id b145so2281756pfb.4 for ; Sat, 30 Jan 2021 16:16:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=J0sNt0ybEAJJITGt3SZGOCmrQucNsG5Zn6uxnJSWEHU=; b=mGqXv/v3S2PT1kY3WmXQuXGhWNvnf3TFomZSCuxt1lRwxzy077QI4egWiVOxjrT4Ps WdBCHWZ9bM30L+PnMBFQYR5QoPUCONXwjHTb2CDztTXX0+Y/RsYWFOCAgJny0OVKA0dX urxWPZ18rfSnLOWT7SNjyGpNrbjABnWpvnaCkip4YW9JEZq9xoazTN5jOWU/yhyXXh0O ttiWKtPXS0Xf7UzsCDAJtZ7STa/yKkDu5EslnM+ftx0oR1V/jKH82cGUXaqyfGQmYWDD Z0N6Tkq2NOap6GymQUn4b7abeYc4N8WuMremtfN9pDd4PvmWnva8N+AbeWAlPjpEj52G M+0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=J0sNt0ybEAJJITGt3SZGOCmrQucNsG5Zn6uxnJSWEHU=; b=uDWesGompKCkYAw2W2U+slV4T8Tyh33XfZRdzHHYpO+C/dzWrk+4tfUquBcHjjDxlF WQVjTkiL7odWCFwoIZSTDhsQDISEonx83WtE1P7tvFM1zhHxp8CnqyZjw8Bx1oltaChS VNjAd7bpaBa1Z6ZF2FV8H1JUMzNE4BQH2jfkLYZTooS9Ke7gHrlFrI0y9kdfRXuFnJrl nQ2ZPUzR47Y5RKEgB9gJo9KFqnoRtNxvOG+x43zS29jjhKUz1mynkt4FOD9wPc/Mqq5X NWsZy3+eaB4L7/+ksCSA9TupaK3xUAXgAa0G10PTWjhzxv0EiQC6zppmtbF9jMxgjucp 052g== X-Gm-Message-State: AOAM533fSHCzqdAsh974rqxwG/Fy2TiSln+8ANZ8eCxP05Dn+EpQH55Y Dg0mHu8TcIY/k1xQiZk+VUk0XzC9kWE= X-Google-Smtp-Source: ABdhPJwNe3mKTlXo7k7U2SgPj7FsSjQAw80hIuT+kdV7YK4IsAvj/+Q83sl9GE2LG7Qvy2myneRi7g== X-Received: by 2002:a62:1690:0:b029:1c6:fdac:3438 with SMTP id 138-20020a6216900000b02901c6fdac3438mr10186735pfw.43.1612052184960; Sat, 30 Jan 2021 16:16:24 -0800 (PST) Received: from sc2-haas01-esx0118.eng.vmware.com ([66.170.99.1]) by smtp.gmail.com with ESMTPSA id e12sm13127365pga.13.2021.01.30.16.16.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 30 Jan 2021 16:16:24 -0800 (PST) From: Nadav Amit X-Google-Original-From: Nadav Amit To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Nadav Amit , Andrea Arcangeli , Andrew Morton , Andy Lutomirski , Dave Hansen , Peter Zijlstra , Thomas Gleixner , Will Deacon , Yu Zhao , x86@kernel.org Subject: [RFC 15/20] mm: detect deferred TLB flushes in vma granularity Date: Sat, 30 Jan 2021 16:11:27 -0800 Message-Id: <20210131001132.3368247-16-namit@vmware.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20210131001132.3368247-1-namit@vmware.com> References: <20210131001132.3368247-1-namit@vmware.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Nadav Amit Currently, deferred TLB flushes are detected in the mm granularity: if there is any deferred TLB flush in the entire address space due to NUMA migration, pte_accessible() in x86 would return true, and ptep_clear_flush() would require a TLB flush. This would happen even if the PTE resides in a completely different vma. Recent code changes and possible future enhancements might require to detect TLB flushes in finer granularity. Detection in finer granularity can also enable more aggressive TLB deferring in the future. Record for each vma the last mm's TLB generation after the last deferred PTE/PMD change while the page-table lock is still held. Increase the mm generation before recording to indicate that a pending TLB flush is pending. Record in the mmu_gather struct the mm's TLB generation at the time in which the last TLB flushing was deferred. Once the TLB flushing of deferred request takes place, use the deferred TLB generation that is recorded in mmu_gather. Detection of deferred TLB flushes is performed by checking whether the mm's completed TLB generation is the lower/equal than the mm's TLB generation. Architectures that use the TLB generation logic are required to perform a full TLB flush if they detect that a new TLB flush request "skips" a generation (as already done by x86 code). To indicate that a deferred TLB flush takes place, increase the mm's TLB generation after updating the PTEs. However, try to avoid increasing the mm's generation after subsequent PTE updates, as increasing it again would lead to a full TLB flush once the deferred TLB flushes are performed (due to the "skipped" TLB generation). Therefore, if the mm generation did not change after subsequent PTE update, use the previous generation. As multiple updates of the vma generation can be performed concurrently, use atomic operations to ensure that the TLB generation as recorded in the vma is the last (most recent) one. Once a deferred TLB flush is eventually performed it might be redundant, if due to another TLB flush the deferred flush was performed (by doing a full TLB flush once detecting the "skipped" generation). This case can be detected if the deferred TLB generation, as recorded in mmu_gather was already completed. However, we do not record deferred PUD/P4D flushes, and freeing tables also requires a flush on cores in lazy TLB mode. In such cases a TLB flush is needed even if the mm's completed TLB generation indicates the flush was already "performed". Signed-off-by: Nadav Amit Cc: Andrea Arcangeli Cc: Andrew Morton Cc: Andy Lutomirski Cc: Dave Hansen Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Cc: Yu Zhao Cc: x86@kernel.org --- arch/x86/include/asm/tlb.h | 18 ++++-- arch/x86/include/asm/tlbflush.h | 5 ++ arch/x86/mm/tlb.c | 14 ++++- include/asm-generic/tlb.h | 104 ++++++++++++++++++++++++++++++-- include/linux/mm_types.h | 19 ++++++ mm/mmap.c | 1 + mm/mmu_gather.c | 3 + 7 files changed, 150 insertions(+), 14 deletions(-) diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h index 580636cdc257..ecf538e6c6d5 100644 --- a/arch/x86/include/asm/tlb.h +++ b/arch/x86/include/asm/tlb.h @@ -9,15 +9,23 @@ static inline void tlb_flush(struct mmu_gather *tlb); =20 static inline void tlb_flush(struct mmu_gather *tlb) { - unsigned long start =3D 0UL, end =3D TLB_FLUSH_ALL; unsigned int stride_shift =3D tlb_get_unmap_shift(tlb); =20 - if (!tlb->fullmm && !tlb->need_flush_all) { - start =3D tlb->start; - end =3D tlb->end; + /* Perform full flush when needed */ + if (tlb->fullmm || tlb->need_flush_all) { + flush_tlb_mm_range(tlb->mm, 0, TLB_FLUSH_ALL, stride_shift, + tlb->freed_tables); + return; } =20 - flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables= ); + /* Check if flush was already performed */ + if (!tlb->freed_tables && !tlb->cleared_puds && + !tlb->cleared_p4ds && + atomic64_read(&tlb->mm->tlb_gen_completed) > tlb->defer_gen) + return; + + flush_tlb_mm_range_gen(tlb->mm, tlb->start, tlb->end, stride_shift, + tlb->freed_tables, tlb->defer_gen); } =20 /* diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbfl= ush.h index 2110b98026a7..296a00545056 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -225,6 +225,11 @@ void flush_tlb_others(const struct cpumask *cpumask, : PAGE_SHIFT, false) =20 extern void flush_tlb_all(void); + +extern void flush_tlb_mm_range_gen(struct mm_struct *mm, unsigned long s= tart, + unsigned long end, unsigned int stride_shift, + bool freed_tables, u64 gen); + extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start= , unsigned long end, unsigned int stride_shift, bool freed_tables); diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index d17b5575531e..48f4b56fc4a7 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -883,12 +883,11 @@ static inline void put_flush_tlb_info(void) #endif } =20 -void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, +void flush_tlb_mm_range_gen(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned int stride_shift, - bool freed_tables) + bool freed_tables, u64 new_tlb_gen) { struct flush_tlb_info *info; - u64 new_tlb_gen; int cpu; =20 cpu =3D get_cpu(); @@ -923,6 +922,15 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsign= ed long start, put_cpu(); } =20 +void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, + unsigned long end, unsigned int stride_shift, + bool freed_tables) +{ + u64 new_tlb_gen =3D inc_mm_tlb_gen(mm); + + flush_tlb_mm_range_gen(mm, start, end, stride_shift, freed_tables, + new_tlb_gen); +} =20 static void do_flush_tlb_all(void *info) { diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 10690763090a..f25d2d955076 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -295,6 +295,11 @@ struct mmu_gather { unsigned int cleared_puds : 1; unsigned int cleared_p4ds : 1; =20 + /* + * Whether a TLB flush was needed for PTEs in the current table + */ + unsigned int cleared_ptes_in_table : 1; + unsigned int batch_count; =20 #ifndef CONFIG_MMU_GATHER_NO_GATHER @@ -305,6 +310,10 @@ struct mmu_gather { #ifdef CONFIG_MMU_GATHER_PAGE_SIZE unsigned int page_size; #endif + +#ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS + u64 defer_gen; +#endif #endif }; =20 @@ -381,7 +390,8 @@ static inline void tlb_flush(struct mmu_gather *tlb) #endif =20 #if __is_defined(tlb_flush) || \ - IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING) + IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING) || \ + IS_ENABLED(CONFIG_ARCH_HAS_TLB_GENERATIONS) =20 static inline void tlb_update_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) @@ -472,7 +482,8 @@ static inline unsigned long tlb_get_unmap_size(struct= mmu_gather *tlb) */ static inline void tlb_start_vma(struct mmu_gather *tlb, struct vm_area_= struct *vma) { - if (tlb->fullmm) + if (IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING) && + tlb->fullmm) return; =20 tlb_update_vma(tlb, vma); @@ -530,16 +541,87 @@ static inline void mark_mm_tlb_gen_done(struct mm_s= truct *mm, u64 gen) tlb_update_generation(&mm->tlb_gen_completed, gen); } =20 -#endif /* CONFIG_ARCH_HAS_TLB_GENERATIONS */ +static inline void read_defer_tlb_flush_gen(struct mmu_gather *tlb) +{ + struct mm_struct *mm =3D tlb->mm; + u64 mm_gen; + + /* + * Any change of PTE before calling __track_deferred_tlb_flush() must b= e + * performed using RMW atomic operation that provides a memory barriers= , + * such as ptep_modify_prot_start(). The barrier ensure the PTEs are + * written before the current generation is read, synchronizing + * (implicitly) with flush_tlb_mm_range(). + */ + smp_mb__after_atomic(); + + mm_gen =3D atomic64_read(&mm->tlb_gen); + + /* + * This condition checks for both first deferred TLB flush and for othe= r + * TLB pending or executed TLB flushes after the last table that we + * updated. In the latter case, we are going to skip a generation, whic= h + * would lead to a full TLB flush. This should therefore not cause + * correctness issues, and should not induce overheads, since anyhow in + * TLB storms it is better to perform full TLB flush. + */ + if (mm_gen !=3D tlb->defer_gen) { + VM_BUG_ON(mm_gen < tlb->defer_gen); + + tlb->defer_gen =3D inc_mm_tlb_gen(mm); + } +} + +/* + * Store the deferred TLB generation in the VMA + */ +static inline void store_deferred_tlb_gen(struct mmu_gather *tlb) +{ + tlb_update_generation(&tlb->vma->defer_tlb_gen, tlb->defer_gen); +} + +/* + * Track deferred TLB flushes for PTEs and PMDs to allow fine granularit= y checks + * whether a PTE is accessible. The TLB generation after the PTE is flus= hed is + * saved in the mmu_gather struct. Once a flush is performed, the genear= tion is + * advanced. + */ +static inline void track_defer_tlb_flush(struct mmu_gather *tlb) +{ + if (tlb->fullmm) + return; + + BUG_ON(!tlb->vma); + + read_defer_tlb_flush_gen(tlb); + store_deferred_tlb_gen(tlb); +} + +#define init_vma_tlb_generation(vma) \ + atomic64_set(&(vma)->defer_tlb_gen, 0) +#else +static inline void init_vma_tlb_generation(struct vm_area_struct *vma) {= } +#endif =20 #define tlb_start_ptes(tlb) \ do { \ struct mmu_gather *_tlb =3D (tlb); \ \ flush_tlb_batched_pending(_tlb->mm); \ + if (IS_ENABLED(CONFIG_ARCH_HAS_TLB_GENERATIONS)) \ + _tlb->cleared_ptes_in_table =3D 0; \ } while (0) =20 -static inline void tlb_end_ptes(struct mmu_gather *tlb) { } +static inline void tlb_end_ptes(struct mmu_gather *tlb) +{ + if (!IS_ENABLED(CONFIG_ARCH_HAS_TLB_GENERATIONS)) + return; + + if (tlb->cleared_ptes_in_table) + track_defer_tlb_flush(tlb); + + tlb->cleared_ptes_in_table =3D 0; +} =20 /* * tlb_flush_{pte|pmd|pud|p4d}_range() adjust the tlb->start and tlb->en= d, @@ -550,15 +632,25 @@ static inline void tlb_flush_pte_range(struct mmu_g= ather *tlb, { __tlb_adjust_range(tlb, address, size); tlb->cleared_ptes =3D 1; + + if (IS_ENABLED(CONFIG_ARCH_HAS_TLB_GENERATIONS)) + tlb->cleared_ptes_in_table =3D 1; } =20 -static inline void tlb_flush_pmd_range(struct mmu_gather *tlb, +static inline void __tlb_flush_pmd_range(struct mmu_gather *tlb, unsigned long address, unsigned long size) { __tlb_adjust_range(tlb, address, size); tlb->cleared_pmds =3D 1; } =20 +static inline void tlb_flush_pmd_range(struct mmu_gather *tlb, + unsigned long address, unsigned long size) +{ + __tlb_flush_pmd_range(tlb, address, size); + track_defer_tlb_flush(tlb); +} + static inline void tlb_flush_pud_range(struct mmu_gather *tlb, unsigned long address, unsigned long size) { @@ -649,7 +741,7 @@ static inline void tlb_flush_p4d_range(struct mmu_gat= her *tlb, #ifndef pte_free_tlb #define pte_free_tlb(tlb, ptep, address) \ do { \ - tlb_flush_pmd_range(tlb, address, PAGE_SIZE); \ + __tlb_flush_pmd_range(tlb, address, PAGE_SIZE); \ tlb->freed_tables =3D 1; \ __pte_free_tlb(tlb, ptep, address); \ } while (0) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 676795dfd5d4..bbe5d4a422f7 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -367,6 +367,9 @@ struct vm_area_struct { #endif #ifdef CONFIG_NUMA struct mempolicy *vm_policy; /* NUMA policy for the VMA */ +#endif +#ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS + atomic64_t defer_tlb_gen; /* Deferred TLB flushes generation */ #endif struct vm_userfaultfd_ctx vm_userfaultfd_ctx; } __randomize_layout; @@ -628,6 +631,21 @@ static inline bool mm_tlb_flush_pending(struct mm_st= ruct *mm) return atomic_read(&mm->tlb_flush_pending); } =20 +#ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS +static inline bool pte_tlb_flush_pending(struct vm_area_struct *vma, pte= _t *pte) +{ + struct mm_struct *mm =3D vma->vm_mm; + + return atomic64_read(&vma->defer_tlb_gen) < atomic64_read(&mm->tlb_gen_= completed); +} + +static inline bool pmd_tlb_flush_pending(struct vm_area_struct *vma, pmd= _t *pmd) +{ + struct mm_struct *mm =3D vma->vm_mm; + + return atomic64_read(&vma->defer_tlb_gen) < atomic64_read(&mm->tlb_gen_= completed); +} +#else /* CONFIG_ARCH_HAS_TLB_GENERATIONS */ static inline bool pte_tlb_flush_pending(struct vm_area_struct *vma, pte= _t *pte) { return mm_tlb_flush_pending(vma->vm_mm); @@ -637,6 +655,7 @@ static inline bool pmd_tlb_flush_pending(struct vm_ar= ea_struct *vma, pmd_t *pmd) { return mm_tlb_flush_pending(vma->vm_mm); } +#endif /* CONFIG_ARCH_HAS_TLB_GENERATIONS */ =20 static inline bool mm_tlb_flush_nested(struct mm_struct *mm) { diff --git a/mm/mmap.c b/mm/mmap.c index 90673febce6a..a81ef902e296 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3337,6 +3337,7 @@ struct vm_area_struct *copy_vma(struct vm_area_stru= ct **vmap, get_file(new_vma->vm_file); if (new_vma->vm_ops && new_vma->vm_ops->open) new_vma->vm_ops->open(new_vma); + init_vma_tlb_generation(new_vma); vma_link(mm, new_vma, prev, rb_link, rb_parent); *need_rmap_locks =3D false; } diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 13338c096cc6..0d554f2f92ac 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -329,6 +329,9 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, = struct mm_struct *mm, #endif =20 tlb_table_init(tlb); +#ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS + tlb->defer_gen =3D 0; +#endif #ifdef CONFIG_MMU_GATHER_PAGE_SIZE tlb->page_size =3D 0; #endif --=20 2.25.1