From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A20AFC04AB5 for ; Mon, 3 Jun 2019 17:57:29 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 826EE246CA for ; Mon, 3 Jun 2019 17:57:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726341AbfFCR53 (ORCPT ); Mon, 3 Jun 2019 13:57:29 -0400 Received: from foss.arm.com ([217.140.101.70]:57226 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726270AbfFCR52 (ORCPT ); Mon, 3 Jun 2019 13:57:28 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 22C6B80D; Mon, 3 Jun 2019 10:57:28 -0700 (PDT) Received: from fuggles.cambridge.arm.com (usa-sjc-imap-foss1.foss.arm.com [10.72.51.249]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 2FFC33F5AF; Mon, 3 Jun 2019 10:57:26 -0700 (PDT) Date: Mon, 3 Jun 2019 18:57:19 +0100 From: Will Deacon To: Nicholas Piggin Cc: akpm@linux-foundation.org, "Aneesh Kumar K.V" , jstancek@redhat.com, mgorman@suse.de, minchan@kernel.org, mm-commits@vger.kernel.org, namit@vmware.com, Peter Zijlstra , stable@vger.kernel.org, yang.shi@linux.alibaba.com Subject: Re: + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree Message-ID: <20190603175719.GA13018@fuggles.cambridge.arm.com> References: <20190521231833.P5ThR%akpm@linux-foundation.org> <20190527110158.GB2623@hirez.programming.kicks-ass.net> <335de44e-02f5-ce92-c026-e8ac4a34a766@linux.ibm.com> <20190527142552.GD2623@hirez.programming.kicks-ass.net> <1559270298.wiy8c3d4zs.astroid@bobo.none> <20190531094931.GM2623@hirez.programming.kicks-ass.net> <1559527383.76rykleqz1.astroid@bobo.none> <20190603103009.GB27507@fuggles.cambridge.arm.com> <1559569861.n3f6bbdn43.astroid@bobo.none> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1559569861.n3f6bbdn43.astroid@bobo.none> User-Agent: Mutt/1.11.1+86 (6f28e57d73f2) () Sender: stable-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org Hi Nick, On Tue, Jun 04, 2019 at 12:10:37AM +1000, Nicholas Piggin wrote: > Will Deacon's on June 3, 2019 8:30 pm: > > On Mon, Jun 03, 2019 at 12:11:38PM +1000, Nicholas Piggin wrote: > >> Peter Zijlstra's on May 31, 2019 7:49 pm: > >> > On Fri, May 31, 2019 at 12:46:56PM +1000, Nicholas Piggin wrote: > >> >> I don't think it's very nice to set fullmm and freed_tables for this > >> >> case though. Is this concurrent zapping an important fast path? It > >> >> must have been, in order to justify all this complexity to the mm, so > >> >> we don't want to tie this boat anchor to it AFAIKS? > >> > > >> > I'm not convinced its an important fast path, afaict it is an > >> > unfortunate correctness issue caused by allowing concurrenct frees. > >> > >> I mean -- concurrent freeing was an important fastpath, right? > >> And concurrent freeing means that you hit this case. So this > >> case itself should be important too. > > > > I honestly don't think we (Peter and I) know. Our first involvement with > > this was because TLBs were found to contain stale user entries: > > > > https://lore.kernel.org/linux-arm-kernel/1817839533.20996552.1557065445233.JavaMail.zimbra@redhat.com/ > > > > so the initial work to support the concurrent freeing was done separately > > and, I assume, motivated by some real workloads. I would also very much > > like to know more about that, since nothing remotely realistic has surfaced > > in this discussion, other than some historical glibc thing which has long > > since been fixed. > > Well, it seems like it is important. While the complexity is carried > in the mm, we should not skimp on this last small piece. As I say, I really don't know. But yes, if we can do something better we should. > >> >> Is the problem just that the freed page tables flags get cleared by > >> >> __tlb_reset_range()? Why not just remove that then, so the bits are > >> >> set properly for the munmap? > >> > > >> > That's insufficient; as argued in my initial suggestion: > >> > > >> > https://lkml.kernel.org/r/20190509103813.GP2589@hirez.programming.kicks-ass.net > >> > > >> > Since we don't know what was flushed by the concorrent flushes, we must > >> > flush all state (page sizes, tables etc..). > >> > >> Page tables should not be concurrently freed I think. Just don't clear > >> those page table free flags and it should be okay. Page sizes yes, > >> but we accommodated for that in the arch code. I could see reason to > >> add a flag to the gather struct like "concurrent_free" and set that > >> from the generic code, which the arch has to take care of. > > > > I think you're correct that two CPUs cannot free the page tables > > concurrently (I misunderstood this initially), although I also think > > there may be some subtle issues if tlb->freed_tables is not set, > > depending on the architecture. Roughly speaking, if one CPU is clearing > > a PMD as part of munmap() and another CPU in madvise() does only last-level > > TLB invalidation, then I think there's the potential for the invalidation > > to be ineffective if observing a cleared PMD doesn't imply that the last > > level has been unmapped from the perspective of the page-table walker. > > That should not be the case because the last level table should have > had all entries cleared before the pointer to it has been cleared. The devil is in the detail here, and I think specifically it depends what you mean by "before". Does that mean memory barrier, or special page-table walker barrier, or TLB invalidation or ...? > So the page table walker could begin from the now-freed page table, > but it would never instantiate a valid TLB entry from there. So a > TLB invalidation would behave properly despite not flushing page > tables. > > Powerpc at least would want to avoid over flushing here, AFAIKS. For arm64 it really depends how often this hits. Simply not setting tlb->freed_tables would also break things for us, because we have an optimisation where we elide invalidation in the fullmm && !freed_tables case, since this is indicative of the mm going away and so we simply avoid reallocating its ASID. > >> > But it looks like benchmarks (for the one test-case we have) seem to > >> > favour flushing the world over flushing a smaller range. > >> > >> Testing on 16MB unmap is possibly not a good benchmark, I didn't run > >> it exactly but it looks likely to go beyond the range flush threshold > >> and flush the entire PID anyway. > > > > If we can get a better idea of what a "good benchmark" might look like (i.e. > > something that is representative of the cases in which real workloads are > > likely to trigger this path) then we can definitely try to optimise around > > that. > > Hard to say unfortunately. A smaller unmap range to start with, but > even then when you have a TLB over-flushing case, then an unmap micro > benchmark is not a great test because you'd like to see more impact of > other useful entries being flushed (e.g., you need an actual working > set). Right, sounds like somebody needs to do some better analysis than what's been done so far. > > In the meantime, I would really like to see this patch land in mainline > > since it fixes a regression. > > Sorry I didn't provide input earlier. I would like to improve the fix or > at least make an option for archs to provide an optimised way to flush > this case, so it would be nice not to fix archs this way and then have > to change the fix significantly right away. Please send patches ;) > But the bug does need to be fixed of course, if there needs to be more > thought about it maybe it's best to take this fix for next release. Agreed. Will