From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 98645C433DB for ; Tue, 2 Feb 2021 00:15:04 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2FA0E64DDE for ; Tue, 2 Feb 2021 00:15:04 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2FA0E64DDE Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=amacapital.net Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 85F986B0005; Mon, 1 Feb 2021 19:15:03 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7E86E6B0006; Mon, 1 Feb 2021 19:15:03 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6B03D6B006E; Mon, 1 Feb 2021 19:15:03 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0192.hostedemail.com [216.40.44.192]) by kanga.kvack.org (Postfix) with ESMTP id 4EB946B0005 for ; Mon, 1 Feb 2021 19:15:03 -0500 (EST) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 19E721EE6 for ; Tue, 2 Feb 2021 00:15:03 +0000 (UTC) X-FDA: 77771407686.10.side95_43173bb275c6 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin10.hostedemail.com (Postfix) with ESMTP id EB64816A4A4 for ; Tue, 2 Feb 2021 00:15:02 +0000 (UTC) X-HE-Tag: side95_43173bb275c6 X-Filterd-Recvd-Size: 6484 Received: from mail-pg1-f182.google.com (mail-pg1-f182.google.com [209.85.215.182]) by imf16.hostedemail.com (Postfix) with ESMTP for ; Tue, 2 Feb 2021 00:15:01 +0000 (UTC) Received: by mail-pg1-f182.google.com with SMTP id c132so13599753pga.3 for ; Mon, 01 Feb 2021 16:15:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amacapital-net.20150623.gappssmtp.com; s=20150623; h=content-transfer-encoding:from:mime-version:subject:date:message-id :references:cc:in-reply-to:to; bh=c8//MWYPxllDiS7aiVWvuvZ5u+o3Llfwk59pxO392CA=; b=wcbV8xCkVuGdJEFtGv7Xqp2eeynTTNbHizH/im84FIh/EIpCJzQ7zYjYIa9eBay/zD H9zwuRAS2FRyJ6a0ktEVZLEO/NRXvbxRjIglMZ9hjl1AVzIECg93uRXgNd/ifpB0XLvn VdhgySsn7/9NEzA0jo65SYPTJKi3NBJDVwa9QJ8O0YW06GTjWqNdakvOFFgLPoTkeZt0 03RRj/7aGLwp+7jl0sH3xlmBHcvGPWMEHYpzkbvxzGfgxR+vSNMQg1AYZFE0xKT9A62m L+dUQA/X1ahqSIQ3jvfp8T9WxpUbD/xOTEd6EcetqxseBPMNIY2Toos6p+tbh8bd/c08 YZYw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:content-transfer-encoding:from:mime-version :subject:date:message-id:references:cc:in-reply-to:to; bh=c8//MWYPxllDiS7aiVWvuvZ5u+o3Llfwk59pxO392CA=; b=fvYWz7805PNLzpsgRusAZGu6YO8EcqPiv+71kTBMLqMSxyn5otg+8QnrEGaBaLSCH/ 08Mdqtv1fbbNnqPmfLrtbSAc9yVOICK1xQsRUr8AHahTXOZd95FyU+y3kislF5fcM5ZW MPG0HzOcrZ2+A3cx2+a2lMKI8ZNNKxrV9ikmtsGqvb09lNKrR96lF0lWCNTt8yNDEubZ kYIsuf+ZbE70dgmHJT9sViHVNdmVacHhheukhJ2hqxvYtgZSO6/xOhgrzFhPE0wN2Ps0 my75q80rPdoOoStRaSIVD/y9HVxO76qLpnjv2Ni4jCF+5BNUHrYrCv8Co2w3OmIQhlx9 c67g== X-Gm-Message-State: AOAM532np3MRLp8BUQbFfGkDhREJ7ivEqKvlhPmHOMMj4laN7i+tB04v L1olwectrw6qAIiEkGCbfYm8dQ== X-Google-Smtp-Source: ABdhPJw53jcn5hSuIG8ub4KmD6MDqHvXydl0r2qjhnl1FjYpKiDzNi4jrAIH68xn7NOoahHTgkQgsg== X-Received: by 2002:a62:7acf:0:b029:1bd:e13a:fdbc with SMTP id v198-20020a627acf0000b02901bde13afdbcmr19006548pfc.52.1612224900461; Mon, 01 Feb 2021 16:15:00 -0800 (PST) Received: from ?IPv6:2601:646:c200:1ef2:117a:7890:cdb4:ef1c? ([2601:646:c200:1ef2:117a:7890:cdb4:ef1c]) by smtp.gmail.com with ESMTPSA id x8sm564282pjf.55.2021.02.01.16.14.59 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 01 Feb 2021 16:14:59 -0800 (PST) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: Andy Lutomirski Mime-Version: 1.0 (1.0) Subject: Re: [RFC 15/20] mm: detect deferred TLB flushes in vma granularity Date: Mon, 1 Feb 2021 16:14:58 -0800 Message-Id: <8F37526F-8189-483A-A16E-E0EB8662AD98@amacapital.net> References: Cc: Linux-MM , LKML , Andy Lutomirski , Andrea Arcangeli , Andrew Morton , Dave Hansen , Peter Zijlstra , Thomas Gleixner , Will Deacon , Yu Zhao , X86 ML In-Reply-To: To: Nadav Amit X-Mailer: iPhone Mail (18C66) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Feb 1, 2021, at 2:04 PM, Nadav Amit wrote: >=20 > =EF=BB=BF >>=20 >> On Jan 30, 2021, at 4:11 PM, Nadav Amit wrote: >>=20 >> From: Nadav Amit >>=20 >> Currently, deferred TLB flushes are detected in the mm granularity: if >> there is any deferred TLB flush in the entire address space due to NUMA >> migration, pte_accessible() in x86 would return true, and >> ptep_clear_flush() would require a TLB flush. This would happen even if >> the PTE resides in a completely different vma. >=20 > [ snip ] >=20 >> +static inline void read_defer_tlb_flush_gen(struct mmu_gather *tlb) >> +{ >> + struct mm_struct *mm =3D tlb->mm; >> + u64 mm_gen; >> + >> + /* >> + * Any change of PTE before calling __track_deferred_tlb_flush() mus= t be >> + * performed using RMW atomic operation that provides a memory barri= ers, >> + * such as ptep_modify_prot_start(). The barrier ensure the PTEs ar= e >> + * written before the current generation is read, synchronizing >> + * (implicitly) with flush_tlb_mm_range(). >> + */ >> + smp_mb__after_atomic(); >> + >> + mm_gen =3D atomic64_read(&mm->tlb_gen); >> + >> + /* >> + * This condition checks for both first deferred TLB flush and for o= ther >> + * TLB pending or executed TLB flushes after the last table that we >> + * updated. In the latter case, we are going to skip a generation, w= hich >> + * would lead to a full TLB flush. This should therefore not cause >> + * correctness issues, and should not induce overheads, since anyhow= in >> + * TLB storms it is better to perform full TLB flush. >> + */ >> + if (mm_gen !=3D tlb->defer_gen) { >> + VM_BUG_ON(mm_gen < tlb->defer_gen); >> + >> + tlb->defer_gen =3D inc_mm_tlb_gen(mm); >> + } >> +} >=20 > Andy=E2=80=99s comments managed to make me realize this code is wrong. We m= ust > call inc_mm_tlb_gen(mm) every time. >=20 > Otherwise, a CPU that saw the old tlb_gen and updated it in its local > cpu_tlbstate on a context-switch. If the process was not running when the > TLB flush was issued, no IPI will be sent to the CPU. Therefore, later > switch_mm_irqs_off() back to the process will not flush the local TLB. >=20 > I need to think if there is a better solution. Multiple calls to > inc_mm_tlb_gen() during deferred flushes would trigger a full TLB flush > instead of one that is specific to the ranges, once the flush actually tak= es > place. On x86 it=E2=80=99s practically a non-issue, since anyhow any updat= e of more > than 33-entries or so would cause a full TLB flush, but this is still ugly= . >=20 What if we had a per-mm ring buffer of flushes? When starting a flush, we w= ould stick the range in the ring buffer and, when flushing, we would read th= e ring buffer to catch up. This would mostly replace the flush_tlb_info str= uct, and it would let us process multiple partial flushes together.=