From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.9 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7E864C2D0A8 for ; Mon, 28 Sep 2020 23:57:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2B7052083B for ; Mon, 28 Sep 2020 23:57:44 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=ziepe.ca header.i=@ziepe.ca header.b="YWHwPdmV" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727000AbgI1X5n (ORCPT ); Mon, 28 Sep 2020 19:57:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43372 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726379AbgI1X5m (ORCPT ); Mon, 28 Sep 2020 19:57:42 -0400 Received: from mail-qt1-x844.google.com (mail-qt1-x844.google.com [IPv6:2607:f8b0:4864:20::844]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 90C8DC0613D3 for ; Mon, 28 Sep 2020 16:57:42 -0700 (PDT) Received: by mail-qt1-x844.google.com with SMTP id n10so2256440qtv.3 for ; Mon, 28 Sep 2020 16:57:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=Jr289Sb1LowRnJCbwbwIDCYxSQ64x8Enfs9iqwh+h9U=; b=YWHwPdmVv40s6TS/bg6Gd0qcM1kga54SKTXTa0doxtZ8oSlAWBebc6gcZ4f2FgsC+r WfJOwambFCHZut+6E+BwrQdk3wO9Jg7F8u9bOBa00F+/b4pGN62J1FFPDd0V+BcM0BjJ z0jCoD9tSujWyOgFJHnH/2o5fP3E7ItHzYPwO8DM1PNj+E73bX3U+h/L4OV2PzeYRdhO v4ZJzoUdh3r97iBBqWovHiu9SpJQFZMDZ8zUycLugaZOB01JSiw2WU8PAWSqSfIz31Ip 8jmYkELq2Hh3rP9rBeg7esrTmOuzyJAn9levkoWMGcF73V3lPmC511TPOCr/YJWdjOPz /UWQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=Jr289Sb1LowRnJCbwbwIDCYxSQ64x8Enfs9iqwh+h9U=; b=knsEk6jluxIKCE0rAIkwUSs58UUOOnr0uXt+QVD3Zzw6aVSbaqFDusX5mflN2OINO4 uGjuwDIJcp9Wza6EmGuojiOyzz9Otxjuny13iSSmHuZilfa/S3i857xlcZEhJho14Rrj l1xKP21aWWZEuhCyFkrNY5mBC4RhPb3b7EnsQMA2BdzRwt+RW6LjedYi52XaBM3ImHA1 8DJJjasU47dzgylxRQFBTXPGoOTq1MHwr1Y842uBUHT95eyDVOgMFS2fIr0MolctXHJh Pb9M4UGEUZQhvQK0zgCKHNHB6ZirREUbseWCdKQoaGKG9W1rkryqT/FQf5k8c39mW+6N cWpw== X-Gm-Message-State: AOAM531xtz4d2Y6fznVZgMhlc6q91XLU6/LxjmQReeeEg9drnFZS1Jhw WAXXv3xiDBqUykPtjmPXFMUdHA== X-Google-Smtp-Source: ABdhPJwKkSTWsdg8TEwhFijWvMsfDsLOlaz4M0kLaunzGu267STOwjT0NPIjZ0foGkitWmFygEvzMw== X-Received: by 2002:ac8:192b:: with SMTP id t40mr678472qtj.60.1601337461606; Mon, 28 Sep 2020 16:57:41 -0700 (PDT) Received: from ziepe.ca (hlfxns017vw-156-34-48-30.dhcp-dynamic.fibreop.ns.bellaliant.net. [156.34.48.30]) by smtp.gmail.com with ESMTPSA id r21sm3163199qtj.80.2020.09.28.16.57.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 28 Sep 2020 16:57:40 -0700 (PDT) Received: from jgg by mlx with local (Exim 4.94) (envelope-from ) id 1kN31Y-002bVC-0Q; Mon, 28 Sep 2020 20:57:40 -0300 Date: Mon, 28 Sep 2020 20:57:39 -0300 From: Jason Gunthorpe To: Linus Torvalds Cc: Peter Xu , Leon Romanovsky , John Hubbard , Linux-MM , Linux Kernel Mailing List , Andrew Morton , Jan Kara , Michal Hocko , Kirill Tkhai , Kirill Shutemov , Hugh Dickins , Christoph Hellwig , Andrea Arcangeli , Oleg Nesterov , Jann Horn Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned Message-ID: <20200928235739.GU9916@ziepe.ca> References: <20200927062337.GE2280698@unreal> <20200928124937.GN9916@ziepe.ca> <20200928172256.GB59869@xz-x1> <20200928183928.GR9916@ziepe.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Sep 28, 2020 at 12:29:55PM -0700, Linus Torvalds wrote: > So a read pin action would basically never work for the fast-path for > a few cases, notably a shared read-only mapping - because we could > never mark it in the page tables as "fast pin accessible" Agree, I was assuming we'd loose more of the fast path to create this thing. It would only still be fast if the pages are already writable. I strongly suspect the case of DMA'ing actual read-only data is the minority here, the usual case is probably filling a writable buffer with something interesting and then triggering the DMA. The DMA just happens to be read from the driver view so the driver doesn't set FOLL_WRITE. Looking at the FOLL_LONGTERM users, which should be the banner usecase for this, there are very few that do a read pin and use fast. > And it would basically have no advantages over a writable FOLL_PIN. It > would break the association with any backing store for private pages, > because otherwise it can't follow future writes. Yes, I wasn't clear enough, I'm looking at this from a driver API perspective. We have this API pin_user_pages(FOLL_LONGTERM | FOLL_WRITE) Which now has no decoherence issues with the MM. If the driver naturally wants to do read-only access it might be tempted to do: pin_user_pages(FOLL_LONGTERM) Which is now NOT the same thing and brings all these really surprising mm coherence issues back. The driver author might discover this in testing, then be tempted to hardwire 'FOLL_LONGTERM | FOLL_WRITE'. Now their uAPI is broken for things that are actually read-only like .rodata. If they discover this then they add a FOLL_FORCE to the mix. When someone comes along to read this later it is a big leap to see pin_user_pages(FOLL_LONGTERM | FOLL_FORCE | FOLL_WRITE) and realize this is code for "read only mapping". At least it took me a while to decipher it the first time I saw it. I think this is really hard to use and ugly. My thinking has been to just stick: if (flags & FOLL_LONGTERM) flags |= FOLL_FORCE | FOLL_WRITE In pin_user_pages(). It would make the driver API cleaner. If we can do a bit better somehow by not COW'ing for certain VMA's as you explained then all the better, but not my primary goal.. Basically, I think if a driver is using FOLL_LONGTERM | FOLL_PIN we should guarentee that driver a consistent MM and take the gup_fast performance hit to do it. AFAICT the giant wack of other cases not using FOLL_LONGTERM really shouldn't care about read-decoherence. For those cases the user should really not be racing write's with data under read-only pin, and the new COW logic looks like it solves the other issues with this. I know Jann/John have been careful to not have special behaviors for the DMA case, but I think it makes sense here. It is actually different. Jason