From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.0 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BDCFEC43603 for ; Mon, 9 Dec 2019 03:33:50 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4C89C2071A for ; Mon, 9 Dec 2019 03:33:50 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="ciwZH9p8" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4C89C2071A Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A38446B2490; Sun, 8 Dec 2019 22:33:49 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9E9066B2491; Sun, 8 Dec 2019 22:33:49 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8FE906B2492; Sun, 8 Dec 2019 22:33:49 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0033.hostedemail.com [216.40.44.33]) by kanga.kvack.org (Postfix) with ESMTP id 7A4126B2490 for ; Sun, 8 Dec 2019 22:33:49 -0500 (EST) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id 2E71C4DD8 for ; Mon, 9 Dec 2019 03:33:49 +0000 (UTC) X-FDA: 76244183778.25.play21_3e46fcf6a6119 X-HE-Tag: play21_3e46fcf6a6119 X-Filterd-Recvd-Size: 8053 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) by imf37.hostedemail.com (Postfix) with ESMTP for ; Mon, 9 Dec 2019 03:33:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=OMmNcuxcTvI53nCaksFfkDUyNagVZ9s+UbqTNPzFPzs=; b=ciwZH9p8ymyGpRoJTuwlNI5Kw MNNjcMBGdvdfcttB7jCLbTCJreY1X5u/Wy6MH5OgmyU9d/ZVO1Kvv95MunF2lxMmSHth9G0SahoZ0 f9S1FXrZtOAOzn01j16sN6YoTB4KYUpbJAC/pN13/EyKIGY3voHLzpY9jlM/6+MkBkQKH9is6zbZV HBo/b+fo7Dl9+co4ki2j/mj6/tSS+0aKt6daeuapduq2y3u4iOJOmE86aBRTlSN1KyzM5epGASGQR D4QKCFADygrZRynUxkbQofj4Z/5QTxNZouJ8N4dHFKf6N9dtGuQjHEmTwgLEoXI1tRBeQnAGOhIKy zuOH92iLA==; Received: from willy by bombadil.infradead.org with local (Exim 4.92.3 #3 (Red Hat Linux)) id 1ie9nj-0008MJ-Oi; Mon, 09 Dec 2019 03:33:35 +0000 Date: Sun, 8 Dec 2019 19:33:35 -0800 From: Matthew Wilcox To: Jerome Glisse Cc: linux-mm@kvack.org, Laurent Dufour , David Rientjes , Vlastimil Babka , Hugh Dickins , Michel Lespinasse , Davidlohr Bueso Subject: Re: Splitting the mmap_sem Message-ID: <20191209033335.GC32169@bombadil.infradead.org> References: <20191203222147.GV20752@bombadil.infradead.org> <20191205172150.GD5819@redhat.com> <20191206051322.GA21007@bombadil.infradead.org> <20191206173030.GA3648@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20191206173030.GA3648@redhat.com> User-Agent: Mutt/1.12.1 (2019-06-15) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Dec 06, 2019 at 12:30:30PM -0500, Jerome Glisse wrote: > On Thu, Dec 05, 2019 at 09:13:22PM -0800, Matthew Wilcox wrote: > > On Thu, Dec 05, 2019 at 12:21:50PM -0500, Jerome Glisse wrote: > > > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote: > > > > While one thread is calling mmap(MAP_FIXED), two other threads which are > > > > accessing the same address may see different data from each other and > > > > have different page translations in their respective CPU caches until > > > > the thread calling mmap() returns. I believe this is OK, but would > > > > greatly appreciate hearing from people who know better. > > > > > > I do not believe this is OK, i believe this is wrong (not even considering > > > possible hardware issues that can arise from such aliasing). > > > > Well, OK, but why do you believe it is wrong? If thread A is executing > > a load instruction at the same time that thread B is calling mmap(), > > it really is indeterminate what value A loads. It might be from before > > the call to mmap() and it might be from after. And if thread C is also > > executing a load instruction at the same time, then it might already get > > a different result from thread A. And can threads A and C really tell > > which of them executed the load instruction 'first'? I think this is > > all so indeterminate already that the (lack of) guarantees I outlined > > above are acceptable. > > > > But we should all agree on this, so _please_ continue to argue your case > > for why you believe it to be wrong. > > I agree that such application might looks like it is doing something that > is undeterminate but their might be application that catch SEGFAULT and use > it as synchronization. I did something similar for reverse engineering a > long time ago with a library call libsegfault ... > > In any case, i agree that an application that is not catching SEGFAULT, and > which is doing the above (access patterns) is doing something undeterminate. > > Nonetheless i believe it is important that at any point in time for all the > threads in a given process, on all the CPUs, a given virtual address should > always point to the same physical memory (or to nothing) ie we should never > have one CPU that sees a different physical memory from another CPU for the > same virtual address. > > Well i feel like you are also not discussing about the munmap() the above > seemed to be about MAP_FIXED (replacing an existing mapping). For munmap > too i believe we should agree on what should be the expected behavior and > from my POV again we should not allow new mapping to appear until a "running" > munmap is not fully done (ie all CPUs cache and TLB flushed). For the same > reason as above ie all CPUs always see same physical memory (or nothing) for > a given virtual address. I see MAP_FIXED as being the harder case, but sure, let's talk about munmap! I agree that a munmap() + mmap() call should not permit thread B to see the old value after thread A has seen the new value. But, as long as no new mmap can occupy that range, then it's OK if thread A takes a segfault while thread B can still load the old value. At least for a short window. We can replicate that behaviour by ensuring that new lookups see a NULL entry, but new attempts to allocate will not reuse that range until the munmap has finished and all TLB entries are flushed. The maple tree actually supports a "ZERO" entry (just like the XArray does) which has this behaviour -- lookups see NULL, but attempts to allocate do not see it as free. We already use that property to prevent allocating above the end of the process address space. > This is what we have today with the big rwsem and i think we need to keep > that behavior even with concurency. I do not believe this will impact the > performance and it is easy enough to solve so i feel safer doing so given > it does not cost anything. > > So i would rather argue on why we should change the current behavior if we > can fix the concurrency without changing it (hence why discussing solution > might also be relevant here). It seems like you want to force a thread which sees an ongoing munmap to spin or sleep until the munmap is done, rather than immediately take a segfault, and I don't know that's a useful behaviour. > > > Just to make sure i understand, you propose that ->map_pages() becomes > > > a new ->fault() handler that get calls before ->fault() without refcount > > > so that we can update fs/drivers slowly to perform better in the new scheme > > > (ie avoid the overead of refcounting if possible at all) ? > > > > > > The ->fault() callback would then be the "slow" path which will require > > > a refcount on the vma (taken by core mm code before dropping rcu lock). > > > > I would actually propose never updating most drivers. There's just no > > need for them to handle such an unstable and tricky situation as this. > > Let's not make driver writers lives harder. > > > > For the ones which need this kind of scalability (and let's be clear, they > > would already have *better* scalability than today due to the rwsem being > > split into a per-VMA refcount), then yes, implementing ->map_pages would > > be the way to go. Indeed, they would probably benefit from implementing > > it today, since it will reduce the number of page faults. > > Yes they will get better scalability but i see some of those drivers might > want the extra few mini-percent :) In any case, i believe that it might be > better to give a new name ie fix current map_pages() user and rename that > callback to something more explicit (atomic_map_pages() or something similar > i am not good at naming). But otherwise this looks like a good plan to avoid > excessive refcount overhead. OK, great. I don't think the current name is bad, but if someone comes up with a better one, I don't have a problem with renaming it.