From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=wijy=Z4=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.4 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 75100C43603
	for <linux-mm@archiver.kernel.org>; Fri,  6 Dec 2019 17:30:46 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 2E51E206DF
	for <linux-mm@archiver.kernel.org>; Fri,  6 Dec 2019 17:30:46 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="T0Q/VFJ1"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2E51E206DF
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id BE1426B16F6; Fri,  6 Dec 2019 12:30:45 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id B93C26B16F7; Fri,  6 Dec 2019 12:30:45 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A80706B16F8; Fri,  6 Dec 2019 12:30:45 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0209.hostedemail.com [216.40.44.209])
	by kanga.kvack.org (Postfix) with ESMTP id 8F3AF6B16F6
	for <linux-mm@kvack.org>; Fri,  6 Dec 2019 12:30:45 -0500 (EST)
Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with SMTP id 4142E6D8F
	for <linux-mm@kvack.org>; Fri,  6 Dec 2019 17:30:45 +0000 (UTC)
X-FDA: 76235406450.22.class27_4090040483134
X-HE-Tag: class27_4090040483134
X-Filterd-Recvd-Size: 8122
Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [207.211.31.120])
	by imf06.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Fri,  6 Dec 2019 17:30:43 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1575653442;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=9DZxy3E4/qfEQ7vpwKUDVTvJ2EFgasTEIzq8NxvMdkQ=;
	b=T0Q/VFJ1rNiAWe0K8r97FT+X8lpEtJRElSDRVQ1gC8w5uMyifro7z/sG5Ic4OqGPH9m2Et
	8izf5fn5oE5Z9nksxqXtpNMpQ9vBF4QAOsxw2ahYGm2gnKNn/FOh97IiNja8AaAJ58w2Dv
	HM+3YcEl9ZFT9LPllH+gXs+THTJO730=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-305-mdUGRTAaNleLMuRaeVqrJA-1; Fri, 06 Dec 2019 12:30:39 -0500
Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 8C00B18AAFA3;
	Fri,  6 Dec 2019 17:30:33 +0000 (UTC)
Received: from redhat.com (ovpn-125-174.rdu2.redhat.com [10.10.125.174])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 374EE60159;
	Fri,  6 Dec 2019 17:30:32 +0000 (UTC)
Date: Fri, 6 Dec 2019 12:30:30 -0500
From: Jerome Glisse <jglisse@redhat.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: linux-mm@kvack.org, Laurent Dufour <ldufour@linux.ibm.com>,
	David Rientjes <rientjes@google.com>,
	Vlastimil Babka <vbabka@suse.cz>, Hugh Dickins <hughd@google.com>,
	Michel Lespinasse <walken@google.com>,
	Davidlohr Bueso <dbueso@suse.de>
Subject: Re: Splitting the mmap_sem
Message-ID: <20191206173030.GA3648@redhat.com>
References: <20191203222147.GV20752@bombadil.infradead.org>
 <20191205172150.GD5819@redhat.com>
 <20191206051322.GA21007@bombadil.infradead.org>
MIME-Version: 1.0
In-Reply-To: <20191206051322.GA21007@bombadil.infradead.org>
User-Agent: Mutt/1.12.1 (2019-06-15)
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11
X-MC-Unique: mdUGRTAaNleLMuRaeVqrJA-1
X-Mimecast-Spam-Score: 0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, Dec 05, 2019 at 09:13:22PM -0800, Matthew Wilcox wrote:
> On Thu, Dec 05, 2019 at 12:21:50PM -0500, Jerome Glisse wrote:
> > Adding few interested people in cc
>=20
> I figured they all read linux-mm already ;-)
>=20
> > On Tue, Dec 03, 2019 at 02:21:47PM -0800, Matthew Wilcox wrote:
> > > While one thread is calling mmap(MAP_FIXED), two other threads which =
are
> > > accessing the same address may see different data from each other and
> > > have different page translations in their respective CPU caches until
> > > the thread calling mmap() returns.  I believe this is OK, but would
> > > greatly appreciate hearing from people who know better.
> >=20
> > I do not believe this is OK, i believe this is wrong (not even consider=
ing
> > possible hardware issues that can arise from such aliasing).
>=20
> Well, OK, but why do you believe it is wrong?  If thread A is executing
> a load instruction at the same time that thread B is calling mmap(),
> it really is indeterminate what value A loads.  It might be from before
> the call to mmap() and it might be from after.  And if thread C is also
> executing a load instruction at the same time, then it might already get
> a different result from thread A.  And can threads A and C really tell
> which of them executed the load instruction 'first'?  I think this is
> all so indeterminate already that the (lack of) guarantees I outlined
> above are acceptable.
>=20
> But we should all agree on this, so _please_ continue to argue your case
> for why you believe it to be wrong.
>=20

I agree that such application might looks like it is doing something that
is undeterminate but their might be application that catch SEGFAULT and use
it as synchronization. I did something similar for reverse engineering a
long time ago with a library call libsegfault ...

In any case, i agree that an application that is not catching SEGFAULT, and
which is doing the above (access patterns) is doing something undeterminate=
.

Nonetheless i believe it is important that at any point in time for all the
threads in a given process, on all the CPUs, a given virtual address should
always point to the same physical memory (or to nothing) ie we should never
have one CPU that sees a different physical memory from another CPU for the
same virtual address.


> [snip proposed solution -- if the problem needs solving, we can argue
> about how to solve it later]

Well i feel like you are also not discussing about the munmap() the above
seemed to be about MAP_FIXED (replacing an existing mapping). For munmap
too i believe we should agree on what should be the expected behavior and
from my POV again we should not allow new mapping to appear until a "runnin=
g"
munmap is not fully done (ie all CPUs cache and TLB flushed). For the same
reason as above ie all CPUs always see same physical memory (or nothing) fo=
r
a given virtual address.

This is what we have today with the big rwsem and i think we need to keep
that behavior even with concurency. I do not believe this will impact the
performance and it is easy enough to solve so i feel safer doing so given
it does not cost anything.

So i would rather argue on why we should change the current behavior if we
can fix the concurrency without changing it (hence why discussing solution
might also be relevant here).


> > > Some people are concerned that a reference count on the VMA will lead=
 to
> > > contention moving from the mmap_sem to the refcount on a very large V=
MA
> > > for workloads which have one giant VMA covering the entire working se=
t.
> > > For those workloads, I propose we use the existing ->map_pages() call=
back
> > > (changed to return a vm_fault_t from the current void).
> > >=20
> > > It will be called with the RCU lock held and no reference count on
> > > the vma.  If it needs to sleep, it should bump the refcount, drop the
> > > RCU lock, prepare enough so that the next call will not need to sleep=
,
> > > then drop the refcount and return VM_FAULT_RETRY so the VM knows the
> > > VMA is no longer good, and it needs to walk the VMA tree from the sta=
rt.
> >=20
> > Just to make sure i understand, you propose that ->map_pages() becomes
> > a new ->fault() handler that get calls before ->fault() without refcoun=
t
> > so that we can update fs/drivers slowly to perform better in the new sc=
heme
> > (ie avoid the overead of refcounting if possible at all) ?
> >=20
> > The ->fault() callback would then be the "slow" path which will require
> > a refcount on the vma (taken by core mm code before dropping rcu lock).
>=20
> I would actually propose never updating most drivers.  There's just no
> need for them to handle such an unstable and tricky situation as this.
> Let's not make driver writers lives harder.
>=20
> For the ones which need this kind of scalability (and let's be clear, the=
y
> would already have *better* scalability than today due to the rwsem being
> split into a per-VMA refcount), then yes, implementing ->map_pages would
> be the way to go.  Indeed, they would probably benefit from implementing
> it today, since it will reduce the number of page faults.

Yes they will get better scalability but i see some of those drivers might
want the extra few mini-percent :) In any case, i believe that it might be
better to give a new name ie fix current map_pages() user and rename that
callback to something more explicit (atomic_map_pages() or something simila=
r
i am not good at naming). But otherwise this looks like a good plan to avoi=
d
excessive refcount overhead.

Cheers,
J=E9r=F4me