From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1D071C4361B for ; Wed, 9 Dec 2020 19:05:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C978623A33 for ; Wed, 9 Dec 2020 19:05:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387502AbgLITFP (ORCPT ); Wed, 9 Dec 2020 14:05:15 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50836 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733155AbgLITFO (ORCPT ); Wed, 9 Dec 2020 14:05:14 -0500 Received: from mail-lf1-x144.google.com (mail-lf1-x144.google.com [IPv6:2a00:1450:4864:20::144]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E974DC0613CF for ; Wed, 9 Dec 2020 11:04:33 -0800 (PST) Received: by mail-lf1-x144.google.com with SMTP id u18so4580854lfd.9 for ; Wed, 09 Dec 2020 11:04:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=O3/iD6BpU9NwnI1wlGLlC1tle+7JV89Hx5+q3b5LN54=; b=EYN3ZZKwWVNMwBm8x9JOGa9EihlumcROW9WrixT7kWiz0koOnKoAiHMcys1QHob10q F9FijxEJ+1etvVO7Jx91vMKUQLhPYYQGGJRibfG8iRO3mhY7Dg4YruAM9ThMqlJDWydA bPjD/apL42XuTa7gN8I86JwOtsZ4oyipVPTYw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=O3/iD6BpU9NwnI1wlGLlC1tle+7JV89Hx5+q3b5LN54=; b=XB7nme4ad4/FVEdEZqg/kxUyIrGCIk97GFTTCf2tbUK/HznauwPizU5pgrfApbmohh kHhm6TMI0phVZ913GA8nB5KTn/ydFFRHpwBOzrMGt2KE9FFv9Tz7giprpdYzYt8yZZn+ VVWbwWGZoDEdio/5hCw2IOAaXt95JKH0fJRdMOSKu8czTWJJboMW/jq3ROVEOwqvK7YT C+8q4psq3cU29Tfy4aJyTI3eJKku4UvO2/0c8Atd2klRpTMRWOAWCQP2Xtv1iuKeHarv MJn8XYg1iEGygFkuDMlMndh6bPSHHDpUS1dhYqTqh0ybSq5eOUrTZrpKHxyQA05ywlXB x3Cw== X-Gm-Message-State: AOAM530qpp7FGnr/LLF5Z3B0geh3rrkN3tVeJr3///RKp80jf1ruviVw ixB3ryVvOO/2vWBPl/5ZGyspbFUiOiQ5/Q== X-Google-Smtp-Source: ABdhPJz+dwRz4oHOZhIabMZC5d/1QXC0TPsJmM0B2b8O6ZlgsCyIMxWCsvNdaVcLCATyxbtSwxIVzQ== X-Received: by 2002:a05:6512:3586:: with SMTP id m6mr808431lfr.318.1607540671596; Wed, 09 Dec 2020 11:04:31 -0800 (PST) Received: from mail-lf1-f41.google.com (mail-lf1-f41.google.com. [209.85.167.41]) by smtp.gmail.com with ESMTPSA id a15sm260333lfg.27.2020.12.09.11.04.30 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 09 Dec 2020 11:04:30 -0800 (PST) Received: by mail-lf1-f41.google.com with SMTP id a9so4620210lfh.2 for ; Wed, 09 Dec 2020 11:04:30 -0800 (PST) X-Received: by 2002:ac2:41d9:: with SMTP id d25mr1395245lfi.377.1607540669881; Wed, 09 Dec 2020 11:04:29 -0800 (PST) MIME-Version: 1.0 References: <20201209163950.8494-1-will@kernel.org> <20201209163950.8494-2-will@kernel.org> <20201209184049.GA8778@willie-the-truck> In-Reply-To: <20201209184049.GA8778@willie-the-truck> From: Linus Torvalds Date: Wed, 9 Dec 2020 11:04:13 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting To: Will Deacon Cc: Linux Kernel Mailing List , Linux-MM , Linux ARM , Catalin Marinas , Jan Kara , Minchan Kim , Andrew Morton , "Kirill A . Shutemov" , Vinayak Menon , Android Kernel Team Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 9, 2020 at 10:40 AM Will Deacon wrote: > > > And yes, that probably means that you need to change "alloc_set_pte()" > > to actually pass in the real address, and leave "vmf->address" alone - > > so that it can know which ones are prefaulted and which one is real, > > but that sounds like a good idea anyway. > > Right, I deliberately avoided that based on the feedback from Jan on an > older version [1], but I can certainly look at it again. Side note: I absolutely detest alloc_set_pte() in general, and it would be very good to try to just change the rules of that function entirely. The alloc_set_pte() function is written to be some generic "iterate over ptes setting them'" function, but it has exactly two users, and none of those users really want the semantics it gives them, I feel. And the semantics that alloc_set_pte() does have actually *hurts* them. In particular, it made it a nightmare to read what do_fault_around() does: it does that odd if (pmd_none(*vmf->pmd)) { vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm); and then it calls ->map_pages() (which is always filemap_map_pages(), except for xfs, where it is also always filemap_map_pages but it takes a lock first). And it did that prealloc_pte, because that's what alloc_set_pte() needs to be atomic - and it needs to be atomic because it's all done under the RCU lock. So essentially, the _major_ user of alloc_set_pte() requires atomicity - but that's not at all obvious when you actually read alloc_set_pte() itself, and in fact when you read it, you see that if (!vmf->pte) { ret = pte_alloc_one_map(vmf); which can block. Except it won't block if we have vmf->prealloc_pte, because then it will just take that instead. In other words, ALL THAT IS COMPLETE GARBAGE. And it's written to be as obtuse as humanly possible, and there is no way in hell anybody understands it by just looking at the code - you have to follow all these odd quirks back to see what's going on. So it would be much better to get rid of alloc_set_pte() entirely, and move the allocation and the pte locking into the caller, and just clean it up (because it turns out the callers have their own models for allocation anyway). And then each of the callers would be a lot more understandable, instead of having this insanely subtle "I need to be atomic, so I will pre-allocate in one place, and then take the RCU lock in another place, in order to call a _third_ place that is only atomic because of that first step". The other user of alloc_set_pte() is "finish_fault()", and honestly, that's what alloc_set_pte() was written for, and why alloc_set_pte() does all those things that filemap_map_pages() doesn't even want. But while that other user does want what alloc_set_pte() does, there's no downside to just moving those things into the caller. It would once again just clarify things to make the allocation and the setting be separate operations - even in that place where it doesn't have the same kind of very subtle behavior with pre-allocation and atomicity. I think the problem with alloc_set_pte() is hat it has had many different people working on it over the years, and Kirill massaged that old use to fit the new pre-alloc use. Before the pre-faulting, I think both cases were much closer to each other. So I'm not really blaming anybody here, but the ugly and very hard-to-follow rules seem to come from historical behavior that was massaged over time to "just work" rather than have that function be made more logical. Kirill - I think you know this code best. Would you be willing to look at splitting out that "allocate and map" from alloc_set_pte() and into the callers? At that point, I think the current very special and odd do_fault_around() pre-allocation could be made into just a _regular_ "allocate the pmd if it doesn't exist". And then the pte locking could be moved into filemap_map_pages(), and suddenly the semantics and rules around all that would be a whole lot more obvious. Pretty please? Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 63AC3C433FE for ; Wed, 9 Dec 2020 19:04:36 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D2F9523A33 for ; Wed, 9 Dec 2020 19:04:35 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D2F9523A33 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 323348D004A; Wed, 9 Dec 2020 14:04:35 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2D22E8D0031; Wed, 9 Dec 2020 14:04:35 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1C1278D004A; Wed, 9 Dec 2020 14:04:35 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 035598D0031 for ; Wed, 9 Dec 2020 14:04:34 -0500 (EST) Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id C1E2A1EE6 for ; Wed, 9 Dec 2020 19:04:34 +0000 (UTC) X-FDA: 77574670068.17.ant94_5f17a45273f2 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin17.hostedemail.com (Postfix) with ESMTP id A5B3A180D0180 for ; Wed, 9 Dec 2020 19:04:34 +0000 (UTC) X-HE-Tag: ant94_5f17a45273f2 X-Filterd-Recvd-Size: 7806 Received: from mail-lj1-f194.google.com (mail-lj1-f194.google.com [209.85.208.194]) by imf29.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Dec 2020 19:04:34 +0000 (UTC) Received: by mail-lj1-f194.google.com with SMTP id f11so3695270ljm.8 for ; Wed, 09 Dec 2020 11:04:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=O3/iD6BpU9NwnI1wlGLlC1tle+7JV89Hx5+q3b5LN54=; b=EYN3ZZKwWVNMwBm8x9JOGa9EihlumcROW9WrixT7kWiz0koOnKoAiHMcys1QHob10q F9FijxEJ+1etvVO7Jx91vMKUQLhPYYQGGJRibfG8iRO3mhY7Dg4YruAM9ThMqlJDWydA bPjD/apL42XuTa7gN8I86JwOtsZ4oyipVPTYw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=O3/iD6BpU9NwnI1wlGLlC1tle+7JV89Hx5+q3b5LN54=; b=ZjdwyS7Kba/SDFYp2oUjnG93/gaM8DfjfUx+2+dBLXjBy5F4ILNYIZhyXTdeEYYEyb 8IKO8Fmdkmd6tTTWuP1N3oI1TkN7u3tWueg3AVx/B031LK4P9GhEH5raYA/blWgPgNsI aZN67NxWoQ2IHPCNWpf3q0gEMMMKJyf2zNqDkpEQkh1Veg9eo+cOoShv5M+zD11mm0q8 PBPPkm3LOa0O3QXbFaBnHXBWMywi3932sdMlTAo0rpba2zGy+8zp2OirMJbBRUGlNT6R IuiG8qUZRIFKQckdwDY0Rna2AaBaylUXUA1wCFsvJKDeLsCtgDSB7VQdnEG1XfwU8Afh +ssw== X-Gm-Message-State: AOAM532ezEo6d7bcQSlObeSsJ9v87Qk0AD/tbip1hEJ4Ma9Se/se+bzO jhMe6//sXkTwZhQFdvTuK56LGoDmSGe6Kw== X-Google-Smtp-Source: ABdhPJz70tboHbgXuyrR3hynflkrAJ0Cvq9zxhx85S1ZemRsAPg4fLcAfzr/Aa6RSzQVf/ZrJ1D42g== X-Received: by 2002:a2e:89d7:: with SMTP id c23mr1552763ljk.397.1607540671637; Wed, 09 Dec 2020 11:04:31 -0800 (PST) Received: from mail-lf1-f50.google.com (mail-lf1-f50.google.com. [209.85.167.50]) by smtp.gmail.com with ESMTPSA id z17sm257097lfg.275.2020.12.09.11.04.30 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 09 Dec 2020 11:04:30 -0800 (PST) Received: by mail-lf1-f50.google.com with SMTP id u18so4580656lfd.9 for ; Wed, 09 Dec 2020 11:04:30 -0800 (PST) X-Received: by 2002:ac2:41d9:: with SMTP id d25mr1395245lfi.377.1607540669881; Wed, 09 Dec 2020 11:04:29 -0800 (PST) MIME-Version: 1.0 References: <20201209163950.8494-1-will@kernel.org> <20201209163950.8494-2-will@kernel.org> <20201209184049.GA8778@willie-the-truck> In-Reply-To: <20201209184049.GA8778@willie-the-truck> From: Linus Torvalds Date: Wed, 9 Dec 2020 11:04:13 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting To: Will Deacon Cc: Linux Kernel Mailing List , Linux-MM , Linux ARM , Catalin Marinas , Jan Kara , Minchan Kim , Andrew Morton , "Kirill A . Shutemov" , Vinayak Menon , Android Kernel Team Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Dec 9, 2020 at 10:40 AM Will Deacon wrote: > > > And yes, that probably means that you need to change "alloc_set_pte()" > > to actually pass in the real address, and leave "vmf->address" alone - > > so that it can know which ones are prefaulted and which one is real, > > but that sounds like a good idea anyway. > > Right, I deliberately avoided that based on the feedback from Jan on an > older version [1], but I can certainly look at it again. Side note: I absolutely detest alloc_set_pte() in general, and it would be very good to try to just change the rules of that function entirely. The alloc_set_pte() function is written to be some generic "iterate over ptes setting them'" function, but it has exactly two users, and none of those users really want the semantics it gives them, I feel. And the semantics that alloc_set_pte() does have actually *hurts* them. In particular, it made it a nightmare to read what do_fault_around() does: it does that odd if (pmd_none(*vmf->pmd)) { vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm); and then it calls ->map_pages() (which is always filemap_map_pages(), except for xfs, where it is also always filemap_map_pages but it takes a lock first). And it did that prealloc_pte, because that's what alloc_set_pte() needs to be atomic - and it needs to be atomic because it's all done under the RCU lock. So essentially, the _major_ user of alloc_set_pte() requires atomicity - but that's not at all obvious when you actually read alloc_set_pte() itself, and in fact when you read it, you see that if (!vmf->pte) { ret = pte_alloc_one_map(vmf); which can block. Except it won't block if we have vmf->prealloc_pte, because then it will just take that instead. In other words, ALL THAT IS COMPLETE GARBAGE. And it's written to be as obtuse as humanly possible, and there is no way in hell anybody understands it by just looking at the code - you have to follow all these odd quirks back to see what's going on. So it would be much better to get rid of alloc_set_pte() entirely, and move the allocation and the pte locking into the caller, and just clean it up (because it turns out the callers have their own models for allocation anyway). And then each of the callers would be a lot more understandable, instead of having this insanely subtle "I need to be atomic, so I will pre-allocate in one place, and then take the RCU lock in another place, in order to call a _third_ place that is only atomic because of that first step". The other user of alloc_set_pte() is "finish_fault()", and honestly, that's what alloc_set_pte() was written for, and why alloc_set_pte() does all those things that filemap_map_pages() doesn't even want. But while that other user does want what alloc_set_pte() does, there's no downside to just moving those things into the caller. It would once again just clarify things to make the allocation and the setting be separate operations - even in that place where it doesn't have the same kind of very subtle behavior with pre-allocation and atomicity. I think the problem with alloc_set_pte() is hat it has had many different people working on it over the years, and Kirill massaged that old use to fit the new pre-alloc use. Before the pre-faulting, I think both cases were much closer to each other. So I'm not really blaming anybody here, but the ugly and very hard-to-follow rules seem to come from historical behavior that was massaged over time to "just work" rather than have that function be made more logical. Kirill - I think you know this code best. Would you be willing to look at splitting out that "allocate and map" from alloc_set_pte() and into the callers? At that point, I think the current very special and odd do_fault_around() pre-allocation could be made into just a _regular_ "allocate the pmd if it doesn't exist". And then the pte locking could be moved into filemap_map_pages(), and suddenly the semantics and rules around all that would be a whole lot more obvious. Pretty please? Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DFC0AC433FE for ; Wed, 9 Dec 2020 19:05:55 +0000 (UTC) Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id A60FF233F8 for ; Wed, 9 Dec 2020 19:05:55 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A60FF233F8 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Transfer-Encoding: Content-Type:Cc:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:To:Subject:Message-ID:Date:From:In-Reply-To: References:MIME-Version:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=OUJTWMOvJEjQPbFl3btvooCDjqysrvmupFVZsxN569w=; b=H6LH3N4RqBRqZ90ouocC2xx0B Dt6/QL1XWv+VREhblIGT9Dw+hOv0iNOM1BPWsGMFHfbEFvFJSmv4yDG9pB8k4dINjzTvg6ReUOlIO c/ezG640baE/cQN8dFF85UO1+tM5cg+MuFcRIaQ1wVO2CLHD2xEjPhzr9clVQx2pLI54t+xrC1sB/ pI+1wUIfMPfQdycBTJXwt/zAp+JvtoAw8OAA7UY3vvagt1xwPiwuMC7K7Z3+zTSHHgHBLsdgmyISr WIPtehc2AGdJlfe8xnT283C8bZ2yYWZNEU/txIJhei/rUKIEFCubYWdLsRp1UY16UjVECqW79PuYI z+EI5wv/g==; Received: from localhost ([::1] helo=merlin.infradead.org) by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1kn4lR-0002ur-53; Wed, 09 Dec 2020 19:04:37 +0000 Received: from mail-lj1-x243.google.com ([2a00:1450:4864:20::243]) by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux)) id 1kn4lO-0002tr-BN for linux-arm-kernel@lists.infradead.org; Wed, 09 Dec 2020 19:04:35 +0000 Received: by mail-lj1-x243.google.com with SMTP id s11so3700790ljp.4 for ; Wed, 09 Dec 2020 11:04:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=O3/iD6BpU9NwnI1wlGLlC1tle+7JV89Hx5+q3b5LN54=; b=EYN3ZZKwWVNMwBm8x9JOGa9EihlumcROW9WrixT7kWiz0koOnKoAiHMcys1QHob10q F9FijxEJ+1etvVO7Jx91vMKUQLhPYYQGGJRibfG8iRO3mhY7Dg4YruAM9ThMqlJDWydA bPjD/apL42XuTa7gN8I86JwOtsZ4oyipVPTYw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=O3/iD6BpU9NwnI1wlGLlC1tle+7JV89Hx5+q3b5LN54=; b=UcwVyQCbgveZ1i52MnhjPGLrzLsO+yNDNbrYNgW4E8dNzJQuoTAw4xYWVLLA/mA5QU T//jbTOQEyqR/zFL1kFi+EjFAFk193l8CJdD5f3SCFd12/lowbcWVtks6bPMxX7YJxI4 w/ajIc+acH6/VhCm33YQAk/hO9nF6qlLIBZZI2qHExIiRO8zvfmr1t1v9OeSpqlX7EVN k0kGyyvswGFU6VinD64oZleN1LqG72XL0HNAmVILe+YuP6qL0jDUg98iRCfdi/fXpNfd s6TqCcbY3ldQjj6RzXFE994wVBeyeXPM94j0BXLWf5nTso61QbwjW6MWfVjbZ6A+XgHP mwMg== X-Gm-Message-State: AOAM530/sIuOWs6rapq0KAiVo8WQxH7wK7OVNxfmj4TmzA+iRlsPjCUh ZAfNse4RQXuNc2TH8LNtCv/6Wg907jpAJA== X-Google-Smtp-Source: ABdhPJwVJvm3oQ/9AuSmofjOsOgqx0sc4T2NbnsZCUl/MVbaQPmpK+5wPc7lcPvTTQEjc/J7OP/oZA== X-Received: by 2002:a2e:5741:: with SMTP id r1mr1461271ljd.15.1607540671729; Wed, 09 Dec 2020 11:04:31 -0800 (PST) Received: from mail-lf1-f52.google.com (mail-lf1-f52.google.com. [209.85.167.52]) by smtp.gmail.com with ESMTPSA id a11sm331705ljj.4.2020.12.09.11.04.30 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 09 Dec 2020 11:04:30 -0800 (PST) Received: by mail-lf1-f52.google.com with SMTP id w13so4597613lfd.5 for ; Wed, 09 Dec 2020 11:04:30 -0800 (PST) X-Received: by 2002:ac2:41d9:: with SMTP id d25mr1395245lfi.377.1607540669881; Wed, 09 Dec 2020 11:04:29 -0800 (PST) MIME-Version: 1.0 References: <20201209163950.8494-1-will@kernel.org> <20201209163950.8494-2-will@kernel.org> <20201209184049.GA8778@willie-the-truck> In-Reply-To: <20201209184049.GA8778@willie-the-truck> From: Linus Torvalds Date: Wed, 9 Dec 2020 11:04:13 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting To: Will Deacon X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20201209_140434_545312_F5904493 X-CRM114-Status: GOOD ( 25.77 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Jan Kara , Minchan Kim , Catalin Marinas , Linux Kernel Mailing List , Linux-MM , Vinayak Menon , "Kirill A . Shutemov" , Andrew Morton , Android Kernel Team , Linux ARM Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Wed, Dec 9, 2020 at 10:40 AM Will Deacon wrote: > > > And yes, that probably means that you need to change "alloc_set_pte()" > > to actually pass in the real address, and leave "vmf->address" alone - > > so that it can know which ones are prefaulted and which one is real, > > but that sounds like a good idea anyway. > > Right, I deliberately avoided that based on the feedback from Jan on an > older version [1], but I can certainly look at it again. Side note: I absolutely detest alloc_set_pte() in general, and it would be very good to try to just change the rules of that function entirely. The alloc_set_pte() function is written to be some generic "iterate over ptes setting them'" function, but it has exactly two users, and none of those users really want the semantics it gives them, I feel. And the semantics that alloc_set_pte() does have actually *hurts* them. In particular, it made it a nightmare to read what do_fault_around() does: it does that odd if (pmd_none(*vmf->pmd)) { vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm); and then it calls ->map_pages() (which is always filemap_map_pages(), except for xfs, where it is also always filemap_map_pages but it takes a lock first). And it did that prealloc_pte, because that's what alloc_set_pte() needs to be atomic - and it needs to be atomic because it's all done under the RCU lock. So essentially, the _major_ user of alloc_set_pte() requires atomicity - but that's not at all obvious when you actually read alloc_set_pte() itself, and in fact when you read it, you see that if (!vmf->pte) { ret = pte_alloc_one_map(vmf); which can block. Except it won't block if we have vmf->prealloc_pte, because then it will just take that instead. In other words, ALL THAT IS COMPLETE GARBAGE. And it's written to be as obtuse as humanly possible, and there is no way in hell anybody understands it by just looking at the code - you have to follow all these odd quirks back to see what's going on. So it would be much better to get rid of alloc_set_pte() entirely, and move the allocation and the pte locking into the caller, and just clean it up (because it turns out the callers have their own models for allocation anyway). And then each of the callers would be a lot more understandable, instead of having this insanely subtle "I need to be atomic, so I will pre-allocate in one place, and then take the RCU lock in another place, in order to call a _third_ place that is only atomic because of that first step". The other user of alloc_set_pte() is "finish_fault()", and honestly, that's what alloc_set_pte() was written for, and why alloc_set_pte() does all those things that filemap_map_pages() doesn't even want. But while that other user does want what alloc_set_pte() does, there's no downside to just moving those things into the caller. It would once again just clarify things to make the allocation and the setting be separate operations - even in that place where it doesn't have the same kind of very subtle behavior with pre-allocation and atomicity. I think the problem with alloc_set_pte() is hat it has had many different people working on it over the years, and Kirill massaged that old use to fit the new pre-alloc use. Before the pre-faulting, I think both cases were much closer to each other. So I'm not really blaming anybody here, but the ugly and very hard-to-follow rules seem to come from historical behavior that was massaged over time to "just work" rather than have that function be made more logical. Kirill - I think you know this code best. Would you be willing to look at splitting out that "allocate and map" from alloc_set_pte() and into the callers? At that point, I think the current very special and odd do_fault_around() pre-allocation could be made into just a _regular_ "allocate the pmd if it doesn't exist". And then the pte locking could be moved into filemap_map_pages(), and suddenly the semantics and rules around all that would be a whole lot more obvious. Pretty please? Linus _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel