From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=RPMH=DV=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.9 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 55D4BC433DF
	for <linux-mm@archiver.kernel.org>; Wed, 14 Oct 2020 16:53:58 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 8F7F8221EB
	for <linux-mm@archiver.kernel.org>; Wed, 14 Oct 2020 16:53:57 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=linux-foundation.org header.i=@linux-foundation.org header.b="Dt066gP1"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8F7F8221EB
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id E9D6294000B; Wed, 14 Oct 2020 12:53:56 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E4BF3900002; Wed, 14 Oct 2020 12:53:56 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D150994000B; Wed, 14 Oct 2020 12:53:56 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0244.hostedemail.com [216.40.44.244])
	by kanga.kvack.org (Postfix) with ESMTP id 9F4DD900002
	for <linux-mm@kvack.org>; Wed, 14 Oct 2020 12:53:56 -0400 (EDT)
Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 36750181AC9CC
	for <linux-mm@kvack.org>; Wed, 14 Oct 2020 16:53:56 +0000 (UTC)
X-FDA: 77371128072.10.sail75_06006b12720d
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin10.hostedemail.com (Postfix) with ESMTP id 13A6C169F9A
	for <linux-mm@kvack.org>; Wed, 14 Oct 2020 16:53:56 +0000 (UTC)
X-HE-Tag: sail75_06006b12720d
X-Filterd-Recvd-Size: 7782
Received: from mail-lj1-f195.google.com (mail-lj1-f195.google.com [209.85.208.195])
	by imf34.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed, 14 Oct 2020 16:53:55 +0000 (UTC)
Received: by mail-lj1-f195.google.com with SMTP id c21so215569ljj.0
        for <linux-mm@kvack.org>; Wed, 14 Oct 2020 09:53:55 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linux-foundation.org; s=google;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=jUWxdkdjg3lFrQZhFI9PuRj+O488sFbEr8pq3LquhNM=;
        b=Dt066gP1+r7WsWUip/hPQkGd3om2xvR/D+iICX0sXkTFFYEBD2RqEDZQy8tSaGaMUR
         H5wB5kzJTkCT8oQU72tNDP92wSRVr2NCIZjn7r0Dv76pfk4Ja635M0iatP/0mGcrd5ff
         ONoPiJsCy5e36akQzfPFFU9VTMraqg5N1opLw=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=jUWxdkdjg3lFrQZhFI9PuRj+O488sFbEr8pq3LquhNM=;
        b=J6dYyoe3BEZ+w1At2uQ3aCx+9MYME7tyuhd4Ro30TGkmDtCb0hOa1g4W6wTy4O7OW3
         ZCt7rolOnjST25gaJpz8adFxKw3chN7+FJnNxa4y8OEBHNz5vAH7S4Hd8ipb373XuA+h
         SDR4oJY01KgZtC0WVCQdCupD9/4Z/z1RMB5L5lBbegdzi6Pv6WQmZPBMODU2lwR5Yv6F
         VverZtz4M1IWgdxyRJcw1DG7hZYL4XG/3OJd8/boTU/0kG3b/NUW30PDv9wPbONpBcAe
         va2etVJ5h+pmuoJ4rLHPS0lDnzWDq3jt83SAiC9/KtXyjOAI9u72Dj+x4VnoLNpx/pvR
         lWVQ==
X-Gm-Message-State: AOAM530bHJ/I8huUgYn2+wkTv3fb44GFxRoWFn3o8uJINuxdj7CfOv3Q
	bQVpJIlo4jWWuPa7S9ea40xDa/Zda/rYng==
X-Google-Smtp-Source: ABdhPJyv6Rn8qfcxfQUcPDsbdQPdtrDuHq/9K4T3cVGqizg++llHP7Fm+oqT911s7wof37rRi+XPIw==
X-Received: by 2002:a05:651c:1031:: with SMTP id w17mr38564ljm.227.1602694433216;
        Wed, 14 Oct 2020 09:53:53 -0700 (PDT)
Received: from mail-lj1-f170.google.com (mail-lj1-f170.google.com. [209.85.208.170])
        by smtp.gmail.com with ESMTPSA id l16sm1333171lfg.155.2020.10.14.09.53.51
        for <linux-mm@kvack.org>
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Wed, 14 Oct 2020 09:53:51 -0700 (PDT)
Received: by mail-lj1-f170.google.com with SMTP id f29so194250ljo.3
        for <linux-mm@kvack.org>; Wed, 14 Oct 2020 09:53:51 -0700 (PDT)
X-Received: by 2002:a2e:868b:: with SMTP id l11mr55751lji.102.1602694431261;
 Wed, 14 Oct 2020 09:53:51 -0700 (PDT)
MIME-Version: 1.0
References: <CAHk-=wgkD+sVx3cHAAzhVO5orgksY=7i8q6mbzwBjN0+4XTAUw@mail.gmail.com>
 <CAHk-=wicH=FaLOeum9_f7Vyyz9Fe4MWmELT7WKR_UbfY37yX-Q@mail.gmail.com> <20201014130555.kdbxyavqoyfnpos3@box>
In-Reply-To: <20201014130555.kdbxyavqoyfnpos3@box>
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 14 Oct 2020 09:53:35 -0700
X-Gmail-Original-Message-ID: <CAHk-=wjXBv0ZKqH4muuo2j4bH2km=7wedrEeQJxY6g2JcdOZSQ@mail.gmail.com>
Message-ID: <CAHk-=wjXBv0ZKqH4muuo2j4bH2km=7wedrEeQJxY6g2JcdOZSQ@mail.gmail.com>
Subject: Re: [PATCH 0/4] Some more lock_page work..
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>, Matthew Wilcox <willy@infradead.org>, Linux-MM <linux-mm@kvack.org>, 
	Andrew Morton <akpm@linux-foundation.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, 
	Amir Goldstein <amir73il@gmail.com>
Content-Type: text/plain; charset="UTF-8"
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Oct 14, 2020 at 6:05 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> So we remove strict serialization against truncation in
> filemap_map_pages() if the mapping is private.

Well, that particular patch does.

But as mentioned, that's the wrong way of doing it. It's just that the
right way - which keeps the serialization - requires some changes to
how we actually install the pte.

Right now that goes through alloc_set_pte(), and that one is basically
written with the page lock in mind. So the current logic is

 - lock page

 - check page->mapping etc that is now stable

 - lock page tables if required (alloc_set_pte -> pte_alloc_one_map ->
pte_offset_map_lock)

 - insert pte

but that whole alloc_set_pte() thing is actually much too generic for
what "map_pages()" really wants, and I think we could re-organize it.

In particular, what I _think_ we could do is:

 - lock the page tables

 - check that the page isn't locked

 - increment the page mapcount (atomic and ordered)

 - check that the page still isn't locked

 - insert pte

without taking the page lock. And the reason that's safe is that *if*
we're racing with something that is about the remove the page mapping,
then *that* code will

 (a) hold the page lock

 (b) before it removes the mapping, it has to remove it from any
existing mappings, which involves checking the mapcount and going off
and getting the page table locks to remove any ptes.

but my patch did *not* do that, because you have to re-organize things a bit.

Note that the above still keeps the "optimistic" model - if the page
is locked, we'll skip that pte, and we'll end up doing the heavy thing
(which will then take the pte lock if required). Which is why we
basically get away with that "only test page lock, don't take it"
approach with some ordering.

But my theory is that the actual truncate case is *so* rare that we
basically don't need to even worry about it, and that once the
(common) page fault case doesn't need the page lock, then that whole
"fall back to slow case" simply doesn't happen in practice.

> IIUC, we can end up with a page with ->mapping == NULL set up in a PTE for
> such mappings. The "page->mapping != mapping" check makes the race window
> smaller, but doesn't remove it.

Yes, that patch I posted as an RFC does exactly that. But I think we
can keep the serialization if we just make slightly more involved
changes.

And they aren't necessarily a _lot_ more involved. In fact, I think we
may already hold the page table lock due to doing that
"pte_alloc_one_map()" thing over all of filemap_map_pages(). So I
think the only _real_ problem is that I think we increment the
page_mapcount() too late in alloc_set_pte().

And yeah, I realize this is a bit handwavy. But that's why I sent
these things out as an RFC. To see if people can shoot holes in this,
and maybe do that proper patch instead of my "let's get discussion
going" one.

Also, I really do think that our current "private mappings are
coherent with fiel changes" is a big QoI issue: it's the right thing
to do. So I wouldn't actually really want to take advantage of
"private mappings don't really _have_ to be coherent according to
POSIX". That's a lazy cop-out.

So I dislike my patch, and don't think it's really acceptable, but as
a "let's try this" I think it's a good first step.

> I'm not sure all codepaths are fine with this. For instance, looks like
> migration will back off such pages: __unmap_and_move() doesn't know how to
> deal with mapped pages with ->mapping == NULL.

So as outlined above, I think we can avoid the ->mapping == NULL case,
and we can remove all the races.

But it would still do new things, like incremnt the page mapcount
without holding the page lock. I think that's ok - we already
_decrement_ it without holding the lock, so it's not that the count is
stable without locking - but at a minimum we'd have that "need to make
sure the memory ordering between testing the page lock initially,
incrementing the count, and re-testing is ok". And maybe some code
that expects things to be stable.

So I'm not at all claiming that things are obviously fine. I'm happy
to see that Hugh started some practical stress-tests on that RFC
patch, and I think we can improve on the situation.

               Linus