From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.9 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 48D06C433E7 for ; Wed, 14 Oct 2020 16:53:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E035A221EB for ; Wed, 14 Oct 2020 16:53:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1602694437; bh=BBYAXSAn4mBquWe8qwqDlzIGr+XrpOnPBH+5TXDXxDM=; h=References:In-Reply-To:From:Date:Subject:To:Cc:List-ID:From; b=Bl8FRpSFHCZ2mNhEKFH1Anh73Zsyd73Y7GdTGTMJylvLhMftsAAPslf6mrDwFSzf3 ZN2T1KLqObwhneM0WFIzHqbB+T1JbJkV6tJUauSRs7RJv8Zdl5FTIB0kL8SLoHkr9f zzJyrxLK6hb45Z8qhGBgJly1rTkh/OpkSUSPhpa8= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729482AbgJNQx4 (ORCPT ); Wed, 14 Oct 2020 12:53:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53878 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726152AbgJNQx4 (ORCPT ); Wed, 14 Oct 2020 12:53:56 -0400 Received: from mail-lj1-x242.google.com (mail-lj1-x242.google.com [IPv6:2a00:1450:4864:20::242]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0E563C061755 for ; Wed, 14 Oct 2020 09:53:55 -0700 (PDT) Received: by mail-lj1-x242.google.com with SMTP id p15so158598ljj.8 for ; Wed, 14 Oct 2020 09:53:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=jUWxdkdjg3lFrQZhFI9PuRj+O488sFbEr8pq3LquhNM=; b=Dt066gP1+r7WsWUip/hPQkGd3om2xvR/D+iICX0sXkTFFYEBD2RqEDZQy8tSaGaMUR H5wB5kzJTkCT8oQU72tNDP92wSRVr2NCIZjn7r0Dv76pfk4Ja635M0iatP/0mGcrd5ff ONoPiJsCy5e36akQzfPFFU9VTMraqg5N1opLw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=jUWxdkdjg3lFrQZhFI9PuRj+O488sFbEr8pq3LquhNM=; b=bSp3A3w/E12h5oTTvdQFpjdii1/ptqwAX/r4SuDlN2BcmvzYxwd9eksy080jGdKuM8 GvjYh59NMMEPveCqqE7mcLss07KwA9ZefXuYyowkSm3Z2pHDE+kz45xWJbUlIj1HVzlm iPjqN9S2XPgNiwPbJAEvVW7cxLcGBY33UD6kdXmL6REIOCOaGmMr0CvKhRSzORVTvvVm 8Aw/4YXaAdEQx65o3ByiHujLZb1uu2ixbFUtZRo+LxpW3aUJWF26SYw3/0AYiCYnwo10 3YkT6KZVlke/XTsXHoT6Qc4gcdxuZOC98KFbeNqhxqwWDE0Z/6zuptcq3hr08vO0o6yX to2g== X-Gm-Message-State: AOAM533YKzNBxsStoUdWEcQu0Pbvc2sKDv7ADecjhCxj9n+ME9NlWTPT Ys5s2QHicLi4tDlyyWxfYe7mYBJ6cIYuVQ== X-Google-Smtp-Source: ABdhPJzTa0u/FFjWcHykLgWj5Dd91qKAy4Yf6gWjs3+PraocwCY+gA9Zvp8YWz59aWEmAN/N8Ld/cQ== X-Received: by 2002:a2e:9bcf:: with SMTP id w15mr57781ljj.14.1602694433138; Wed, 14 Oct 2020 09:53:53 -0700 (PDT) Received: from mail-lj1-f172.google.com (mail-lj1-f172.google.com. [209.85.208.172]) by smtp.gmail.com with ESMTPSA id u15sm1341794lfl.140.2020.10.14.09.53.51 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 14 Oct 2020 09:53:51 -0700 (PDT) Received: by mail-lj1-f172.google.com with SMTP id c21so122674ljn.13 for ; Wed, 14 Oct 2020 09:53:51 -0700 (PDT) X-Received: by 2002:a2e:868b:: with SMTP id l11mr55751lji.102.1602694431261; Wed, 14 Oct 2020 09:53:51 -0700 (PDT) MIME-Version: 1.0 References: <20201014130555.kdbxyavqoyfnpos3@box> In-Reply-To: <20201014130555.kdbxyavqoyfnpos3@box> From: Linus Torvalds Date: Wed, 14 Oct 2020 09:53:35 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 0/4] Some more lock_page work.. To: "Kirill A. Shutemov" Cc: Hugh Dickins , Matthew Wilcox , Linux-MM , Andrew Morton , linux-fsdevel , Amir Goldstein Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Wed, Oct 14, 2020 at 6:05 AM Kirill A. Shutemov wrote: > > So we remove strict serialization against truncation in > filemap_map_pages() if the mapping is private. Well, that particular patch does. But as mentioned, that's the wrong way of doing it. It's just that the right way - which keeps the serialization - requires some changes to how we actually install the pte. Right now that goes through alloc_set_pte(), and that one is basically written with the page lock in mind. So the current logic is - lock page - check page->mapping etc that is now stable - lock page tables if required (alloc_set_pte -> pte_alloc_one_map -> pte_offset_map_lock) - insert pte but that whole alloc_set_pte() thing is actually much too generic for what "map_pages()" really wants, and I think we could re-organize it. In particular, what I _think_ we could do is: - lock the page tables - check that the page isn't locked - increment the page mapcount (atomic and ordered) - check that the page still isn't locked - insert pte without taking the page lock. And the reason that's safe is that *if* we're racing with something that is about the remove the page mapping, then *that* code will (a) hold the page lock (b) before it removes the mapping, it has to remove it from any existing mappings, which involves checking the mapcount and going off and getting the page table locks to remove any ptes. but my patch did *not* do that, because you have to re-organize things a bit. Note that the above still keeps the "optimistic" model - if the page is locked, we'll skip that pte, and we'll end up doing the heavy thing (which will then take the pte lock if required). Which is why we basically get away with that "only test page lock, don't take it" approach with some ordering. But my theory is that the actual truncate case is *so* rare that we basically don't need to even worry about it, and that once the (common) page fault case doesn't need the page lock, then that whole "fall back to slow case" simply doesn't happen in practice. > IIUC, we can end up with a page with ->mapping == NULL set up in a PTE for > such mappings. The "page->mapping != mapping" check makes the race window > smaller, but doesn't remove it. Yes, that patch I posted as an RFC does exactly that. But I think we can keep the serialization if we just make slightly more involved changes. And they aren't necessarily a _lot_ more involved. In fact, I think we may already hold the page table lock due to doing that "pte_alloc_one_map()" thing over all of filemap_map_pages(). So I think the only _real_ problem is that I think we increment the page_mapcount() too late in alloc_set_pte(). And yeah, I realize this is a bit handwavy. But that's why I sent these things out as an RFC. To see if people can shoot holes in this, and maybe do that proper patch instead of my "let's get discussion going" one. Also, I really do think that our current "private mappings are coherent with fiel changes" is a big QoI issue: it's the right thing to do. So I wouldn't actually really want to take advantage of "private mappings don't really _have_ to be coherent according to POSIX". That's a lazy cop-out. So I dislike my patch, and don't think it's really acceptable, but as a "let's try this" I think it's a good first step. > I'm not sure all codepaths are fine with this. For instance, looks like > migration will back off such pages: __unmap_and_move() doesn't know how to > deal with mapped pages with ->mapping == NULL. So as outlined above, I think we can avoid the ->mapping == NULL case, and we can remove all the races. But it would still do new things, like incremnt the page mapcount without holding the page lock. I think that's ok - we already _decrement_ it without holding the lock, so it's not that the count is stable without locking - but at a minimum we'd have that "need to make sure the memory ordering between testing the page lock initially, incrementing the count, and re-testing is ok". And maybe some code that expects things to be stable. So I'm not at all claiming that things are obviously fine. I'm happy to see that Hugh started some practical stress-tests on that RFC patch, and I think we can improve on the situation. Linus