From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.1 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4A7E0C10F14 for ; Thu, 3 Oct 2019 07:31:03 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id DB21720815 for ; Thu, 3 Oct 2019 07:31:02 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="B4wbOH9N" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DB21720815 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 467518E0001; Thu, 3 Oct 2019 03:31:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3F3666B0008; Thu, 3 Oct 2019 03:31:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 290E88E0001; Thu, 3 Oct 2019 03:31:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0032.hostedemail.com [216.40.44.32]) by kanga.kvack.org (Postfix) with ESMTP id 02CB26B0007 for ; Thu, 3 Oct 2019 03:31:01 -0400 (EDT) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id 90BF5180AD804 for ; Thu, 3 Oct 2019 07:31:01 +0000 (UTC) X-FDA: 76001651922.28.toy67_7d4976aa73c40 X-HE-Tag: toy67_7d4976aa73c40 X-Filterd-Recvd-Size: 6391 Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) by imf26.hostedemail.com (Postfix) with ESMTP for ; Thu, 3 Oct 2019 07:31:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=merlin.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=QsQ0pKc29g6AJUvIf8oghBRQ3LKFY4Dh/HxyizW2xtQ=; b=B4wbOH9NB4/QQQ0XP2JE6ZJpc qBm4NYJ2bhIkYOjtu7lY6JQ3ei2MG79dTQ2TN7IEetm8Mj3gD3Ee0ulG2Y6i0k9yinZRnJFAozQMO AGDj86EdrEp4RrwAPMPswc3+7gPnJJf/0HnuzS7Dp97ZUqU2VsZYULFj3wKz9czO9jBR389ob2QDP DK201mUK82fZqlSIItfgJQGVDTtvENKjL71ga8eA8p4/CxTEy/XCDEvcqrlfxCNutmDWq9JUxfsRL 39o62euiLHa5KcWthoaLFVxoJTkgkw7dglH6kiAv+qOw8jV6S7KZ6l9A7Px/OqO6Bab0ME4q67/wg HsmVpl0pg==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by merlin.infradead.org with esmtpsa (Exim 4.92.2 #3 (Red Hat Linux)) id 1iFvYj-0003dq-Rg; Thu, 03 Oct 2019 07:29:58 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id B9AE4301A79; Thu, 3 Oct 2019 09:29:03 +0200 (CEST) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id CC90020FF8D3E; Thu, 3 Oct 2019 09:29:52 +0200 (CEST) Date: Thu, 3 Oct 2019 09:29:52 +0200 From: Peter Zijlstra To: Leonardo Bras Cc: linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org, kvm-ppc@vger.kernel.org, linux-arch@vger.kernel.org, linux-mm@kvack.org, Benjamin Herrenschmidt , Paul Mackerras , Michael Ellerman , Arnd Bergmann , "Aneesh Kumar K.V" , Christophe Leroy , Nicholas Piggin , Andrew Morton , Mahesh Salgaonkar , Reza Arbab , Santosh Sivaraj , Balbir Singh , Thomas Gleixner , Greg Kroah-Hartman , Mike Rapoport , Allison Randal , Jason Gunthorpe , Dan Williams , Vlastimil Babka , Christoph Lameter , Logan Gunthorpe , Andrey Ryabinin , Alexey Dobriyan , Souptick Joarder , Mathieu Desnoyers , Ralph Campbell , Jesper Dangaard Brouer , Jann Horn , Davidlohr Bueso , Ingo Molnar , Christian Brauner , Michal Hocko , Elena Reshetova , Roman Gushchin , Andrea Arcangeli , Al Viro , "Dmitry V. Levin" , =?iso-8859-1?B?Suly9G1l?= Glisse , Song Liu , Bartlomiej Zolnierkiewicz , Ira Weiny , "Kirill A. Shutemov" , John Hubbard , Keith Busch Subject: Re: [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks Message-ID: <20191003072952.GN4536@hirez.programming.kicks-ass.net> References: <20191003013325.2614-1-leonardo@linux.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20191003013325.2614-1-leonardo@linux.ibm.com> User-Agent: Mutt/1.10.1 (2018-07-13) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Oct 02, 2019 at 10:33:14PM -0300, Leonardo Bras wrote: > If a process (qemu) with a lot of CPUs (128) try to munmap() a large > chunk of memory (496GB) mapped with THP, it takes an average of 275 > seconds, which can cause a lot of problems to the load (in qemu case, > the guest will lock for this time). > > Trying to find the source of this bug, I found out most of this time is > spent on serialize_against_pte_lookup(). This function will take a lot > of time in smp_call_function_many() if there is more than a couple CPUs > running the user process. Since it has to happen to all THP mapped, it > will take a very long time for large amounts of memory. > > By the docs, serialize_against_pte_lookup() is needed in order to avoid > pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless > pagetable walk, to happen concurrently with THP splitting/collapsing. > > It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[], > after interrupts are re-enabled. > Since, interrupts are (usually) disabled during lockless pagetable > walk, and serialize_against_pte_lookup will only return after > interrupts are enabled, it is protected. This is something entirely specific to Power, you shouldn't be touching generic code at all. Also, I'm not sure I understand things properly. So serialize_against_pte_lookup() wants to wait for all currently out-standing __find_linux_pte() instances (which are very similar to gup_fast). It seems to want to do this before flushing the THP TLB for some reason; why? Should not THP observe the normal page table freeing rules which includes a RCU-like grace period like this already. Why is THP special here? This doesn't seem adequately explained. Also, specifically to munmap(), this seems entirely superfluous, munmap() uses the normal page-table freeing code and should be entirely fine without additional waiting. Furthermore, Power never accurately tracks mm_cpumask(), so using that makes the whole thing more expensive than it needs to be. Also, I suppose that is buggered vs file backed THP.