From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.9 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 55D4BC433DF for ; Wed, 14 Oct 2020 16:53:58 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 8F7F8221EB for ; Wed, 14 Oct 2020 16:53:57 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=linux-foundation.org header.i=@linux-foundation.org header.b="Dt066gP1" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8F7F8221EB Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id E9D6294000B; Wed, 14 Oct 2020 12:53:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E4BF3900002; Wed, 14 Oct 2020 12:53:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D150994000B; Wed, 14 Oct 2020 12:53:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0244.hostedemail.com [216.40.44.244]) by kanga.kvack.org (Postfix) with ESMTP id 9F4DD900002 for ; Wed, 14 Oct 2020 12:53:56 -0400 (EDT) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 36750181AC9CC for ; Wed, 14 Oct 2020 16:53:56 +0000 (UTC) X-FDA: 77371128072.10.sail75_06006b12720d Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin10.hostedemail.com (Postfix) with ESMTP id 13A6C169F9A for ; Wed, 14 Oct 2020 16:53:56 +0000 (UTC) X-HE-Tag: sail75_06006b12720d X-Filterd-Recvd-Size: 7782 Received: from mail-lj1-f195.google.com (mail-lj1-f195.google.com [209.85.208.195]) by imf34.hostedemail.com (Postfix) with ESMTP for ; Wed, 14 Oct 2020 16:53:55 +0000 (UTC) Received: by mail-lj1-f195.google.com with SMTP id c21so215569ljj.0 for ; Wed, 14 Oct 2020 09:53:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=jUWxdkdjg3lFrQZhFI9PuRj+O488sFbEr8pq3LquhNM=; b=Dt066gP1+r7WsWUip/hPQkGd3om2xvR/D+iICX0sXkTFFYEBD2RqEDZQy8tSaGaMUR H5wB5kzJTkCT8oQU72tNDP92wSRVr2NCIZjn7r0Dv76pfk4Ja635M0iatP/0mGcrd5ff ONoPiJsCy5e36akQzfPFFU9VTMraqg5N1opLw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=jUWxdkdjg3lFrQZhFI9PuRj+O488sFbEr8pq3LquhNM=; b=J6dYyoe3BEZ+w1At2uQ3aCx+9MYME7tyuhd4Ro30TGkmDtCb0hOa1g4W6wTy4O7OW3 ZCt7rolOnjST25gaJpz8adFxKw3chN7+FJnNxa4y8OEBHNz5vAH7S4Hd8ipb373XuA+h SDR4oJY01KgZtC0WVCQdCupD9/4Z/z1RMB5L5lBbegdzi6Pv6WQmZPBMODU2lwR5Yv6F VverZtz4M1IWgdxyRJcw1DG7hZYL4XG/3OJd8/boTU/0kG3b/NUW30PDv9wPbONpBcAe va2etVJ5h+pmuoJ4rLHPS0lDnzWDq3jt83SAiC9/KtXyjOAI9u72Dj+x4VnoLNpx/pvR lWVQ== X-Gm-Message-State: AOAM530bHJ/I8huUgYn2+wkTv3fb44GFxRoWFn3o8uJINuxdj7CfOv3Q bQVpJIlo4jWWuPa7S9ea40xDa/Zda/rYng== X-Google-Smtp-Source: ABdhPJyv6Rn8qfcxfQUcPDsbdQPdtrDuHq/9K4T3cVGqizg++llHP7Fm+oqT911s7wof37rRi+XPIw== X-Received: by 2002:a05:651c:1031:: with SMTP id w17mr38564ljm.227.1602694433216; Wed, 14 Oct 2020 09:53:53 -0700 (PDT) Received: from mail-lj1-f170.google.com (mail-lj1-f170.google.com. [209.85.208.170]) by smtp.gmail.com with ESMTPSA id l16sm1333171lfg.155.2020.10.14.09.53.51 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 14 Oct 2020 09:53:51 -0700 (PDT) Received: by mail-lj1-f170.google.com with SMTP id f29so194250ljo.3 for ; Wed, 14 Oct 2020 09:53:51 -0700 (PDT) X-Received: by 2002:a2e:868b:: with SMTP id l11mr55751lji.102.1602694431261; Wed, 14 Oct 2020 09:53:51 -0700 (PDT) MIME-Version: 1.0 References: <20201014130555.kdbxyavqoyfnpos3@box> In-Reply-To: <20201014130555.kdbxyavqoyfnpos3@box> From: Linus Torvalds Date: Wed, 14 Oct 2020 09:53:35 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 0/4] Some more lock_page work.. To: "Kirill A. Shutemov" Cc: Hugh Dickins , Matthew Wilcox , Linux-MM , Andrew Morton , linux-fsdevel , Amir Goldstein Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Oct 14, 2020 at 6:05 AM Kirill A. Shutemov wrote: > > So we remove strict serialization against truncation in > filemap_map_pages() if the mapping is private. Well, that particular patch does. But as mentioned, that's the wrong way of doing it. It's just that the right way - which keeps the serialization - requires some changes to how we actually install the pte. Right now that goes through alloc_set_pte(), and that one is basically written with the page lock in mind. So the current logic is - lock page - check page->mapping etc that is now stable - lock page tables if required (alloc_set_pte -> pte_alloc_one_map -> pte_offset_map_lock) - insert pte but that whole alloc_set_pte() thing is actually much too generic for what "map_pages()" really wants, and I think we could re-organize it. In particular, what I _think_ we could do is: - lock the page tables - check that the page isn't locked - increment the page mapcount (atomic and ordered) - check that the page still isn't locked - insert pte without taking the page lock. And the reason that's safe is that *if* we're racing with something that is about the remove the page mapping, then *that* code will (a) hold the page lock (b) before it removes the mapping, it has to remove it from any existing mappings, which involves checking the mapcount and going off and getting the page table locks to remove any ptes. but my patch did *not* do that, because you have to re-organize things a bit. Note that the above still keeps the "optimistic" model - if the page is locked, we'll skip that pte, and we'll end up doing the heavy thing (which will then take the pte lock if required). Which is why we basically get away with that "only test page lock, don't take it" approach with some ordering. But my theory is that the actual truncate case is *so* rare that we basically don't need to even worry about it, and that once the (common) page fault case doesn't need the page lock, then that whole "fall back to slow case" simply doesn't happen in practice. > IIUC, we can end up with a page with ->mapping == NULL set up in a PTE for > such mappings. The "page->mapping != mapping" check makes the race window > smaller, but doesn't remove it. Yes, that patch I posted as an RFC does exactly that. But I think we can keep the serialization if we just make slightly more involved changes. And they aren't necessarily a _lot_ more involved. In fact, I think we may already hold the page table lock due to doing that "pte_alloc_one_map()" thing over all of filemap_map_pages(). So I think the only _real_ problem is that I think we increment the page_mapcount() too late in alloc_set_pte(). And yeah, I realize this is a bit handwavy. But that's why I sent these things out as an RFC. To see if people can shoot holes in this, and maybe do that proper patch instead of my "let's get discussion going" one. Also, I really do think that our current "private mappings are coherent with fiel changes" is a big QoI issue: it's the right thing to do. So I wouldn't actually really want to take advantage of "private mappings don't really _have_ to be coherent according to POSIX". That's a lazy cop-out. So I dislike my patch, and don't think it's really acceptable, but as a "let's try this" I think it's a good first step. > I'm not sure all codepaths are fine with this. For instance, looks like > migration will back off such pages: __unmap_and_move() doesn't know how to > deal with mapped pages with ->mapping == NULL. So as outlined above, I think we can avoid the ->mapping == NULL case, and we can remove all the races. But it would still do new things, like incremnt the page mapcount without holding the page lock. I think that's ok - we already _decrement_ it without holding the lock, so it's not that the count is stable without locking - but at a minimum we'd have that "need to make sure the memory ordering between testing the page lock initially, incrementing the count, and re-testing is ok". And maybe some code that expects things to be stable. So I'm not at all claiming that things are obviously fine. I'm happy to see that Hugh started some practical stress-tests on that RFC patch, and I think we can improve on the situation. Linus