From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.6 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 97102C282CC for ; Tue, 5 Feb 2019 20:17:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 5ACD0217F9 for ; Tue, 5 Feb 2019 20:17:57 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="KAzi4Km5" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729446AbfBEURz (ORCPT ); Tue, 5 Feb 2019 15:17:55 -0500 Received: from mail-pg1-f194.google.com ([209.85.215.194]:38357 "EHLO mail-pg1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726547AbfBEURy (ORCPT ); Tue, 5 Feb 2019 15:17:54 -0500 Received: by mail-pg1-f194.google.com with SMTP id g189so1876441pgc.5 for ; Tue, 05 Feb 2019 12:17:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=/asjwI2h3R4dB1KhUJqTPMjf/nV/o0Zi8H3tJeSh7VI=; b=KAzi4Km5jRdAkGHd8bF9cYNW+h3If+L5zZL9wXe8rlZwk2X9UWvGIWw9I/vNP7Ded0 eHc2JjtT4+muoLZ5IWpv/CpBbNDvyMWZpA7We2nd7HHvbNEPgHz9I6qtuvF309UvIm4Z TPtLp5ZlmhpbJ/PENU681DVQGwzC3ZYFm5X3rVrgrW7HtPWGWohKXY9CSa/H9u22shzU 9N1iGG7Lt78dbVsBTqG/gzQ7k9avihm5xuQiy/ENFH92r9xyLYqcgFiIeAJocprDnPV2 Oa1Ed2PGwzMHZrUAsO+/SbO8o6jzdX58TDEPUTdTWcTxXfFczNH2ydRuPtZVheCX6vrL bIiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=/asjwI2h3R4dB1KhUJqTPMjf/nV/o0Zi8H3tJeSh7VI=; b=uWX2SdvDVOFw4Loup7HWLUieU6qV5UKFek8rChIz0VmbHWyzYG50LpHcPSbt5wq858 HYtJGMDjMmaQhf1WOjG1XxFIDKUYUgMe/7fyewgu10wUtL4vyp2bOkCjyrrwR5SKPiaR Orr0VF5P0Wi1rpnQJ5J615D+t7zZHPoDqDHfFXeM3vYwDpm+gTHvVbFspKwmGBpHxIH6 BScCiuPeunf0bDNPs7aTPWN5JrFYncOLheg5NBFYZqJMRdjvS9cqvW9S6cxGxjCXSsHt xsbdg5E5OK0HfluS4xFZWoLmxfkfmazDa7p0qDRfbr94CCukfB23jjbjYgSVIAwsD9t5 hqbA== X-Gm-Message-State: AHQUAuaN+S0U7j4ns2A+GIDuBzfnErkXeIR5/ilhl8m0LBU25qz5YxXb ZIFhOt5Bu7/HyMZDg37D26k++g== X-Google-Smtp-Source: AHgI3IaxNujIIMGLSxIBhRN36McCLaK/t8qyuhWae1Pe1AbArdzO/lDa9w0pCkuUIZ50wuLA5kt+Hw== X-Received: by 2002:a62:ae04:: with SMTP id q4mr6864882pff.126.1549397872974; Tue, 05 Feb 2019 12:17:52 -0800 (PST) Received: from [100.112.89.103] ([104.133.8.103]) by smtp.gmail.com with ESMTPSA id 24sm6722800pfr.75.2019.02.05.12.17.51 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Tue, 05 Feb 2019 12:17:51 -0800 (PST) Date: Tue, 5 Feb 2019 12:17:50 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Linus Torvalds cc: Hugh Dickins , Artem Savkov , Baoquan He , Qian Cai , Andrea Arcangeli , Michal Hocko , Vlastimil Babka , Andrew Morton , Linux List Kernel Mailing , Linux-MM Subject: Re: mm: race in put_and_wait_on_page_locked() In-Reply-To: Message-ID: References: <20190204091300.GB13536@shodan.usersys.redhat.com> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 5 Feb 2019, Linus Torvalds wrote: > On Mon, Feb 4, 2019 at 8:43 PM Hugh Dickins wrote: > > > > Something I shall not be doing, is verifying the correctness of the > > low-level get_page_unless_zero() versus page_ref_freeze() protocol > > on arm64 and power - nobody has reported on x86, and I do wonder if > > there's a barrier missing somewhere, that could manifest in this way - > > but I'm unlikely to be the one to find that (and also think that any > > weakness there should have shown up long before now). > > Remind me what the page_ref_freeze() rules even _are_? > > It's a very special thing, setting the page count down to zero if it > matches the "expected" count. > > Now, if another CPU does a put_page() at that point, that certainly > will hit the "oops, we dropped the ref to something that was zero". > > So the "expected" count had better be only references we have and own > 100%, but some of those references aren't really necessarily private > to our thread. > > For example, what happens if > > (a) one CPU is doing migration_entry_wait() (counting expected page > refs etc, before doing page_ref_freeze) s/migration_entry_wait/migrate_page_move_mapping/ > > (b) another CPU is dirtying a page that was in the swap cache and > takes a reference to it, but drops it from the swap cache This is reuse_swap_page() called from do_wp_page(), I presume. > > Note how (b) does not change the refcount on the page at all, because > it just moves the ref-count from "swap cache entry" to "I own the page > in my page tables". Which means that when (a) does the "count expected > count, and match it", it happily matches, and the page_ref_freeze() > succeeds and makes the page count be zero. > > But now (b) has a private reference to that page, and can drop it, so > the "freeze" isn't a freeze at all. > > Ok, so clearly the above cannot happen, and there's something I'm > missing with the freezing. I think we hold the page lock while this is > going on, which means those two things cannot happen at the same time. > But maybe there is something else that does the above kind of "move > page ref from one owner to another"? You're right that the page lock prevents even getting there (and is essential whenever mucking around with PageSwapCache), but more to the point is that the expected_count passed to page_ref_freeze() does not include any user mapping references (mapcount). All user mappings (known of at that instant) have been unmapped before migrate_page_move_mapping() is called, and if any got added since (difficult without page lock, but I wouldn't assume impossible), their associated page references are sure to make the page_ref_freeze() fail (so long as the page refcounting has not been broken). reuse_swap_page() is called while holding the page table lock: so although do_wp_page() cannot quite be said to own the page, it is making sure that it cannot be racily unmapped at that point. So until the pte_unmap_unlock() (by which time it has done its own get_page()) it can treat the page reference associated with the user mapping as safe, as if it were its own. And no racing page_ref_freeze() could succeed while it's there, page lock or not. Page lock is more important at the "outer" level of the page migration protocol: holding together the "atomic" switch from old to new page with the copying of data and flags from old to new. And more important with anon (swapless) pages, for which there's no shared visible cache, so migrate_page_move_mapping() does not even bother with a page_ref_freeze() (though sometimes I want it to). > > The page_ref_freeze() rules don't seem to be documented anywhere. I would not enjoy documenting what has to be done at what stage in the page migration sequence: it has evolved, it is subtle, and we're grateful just to have working code. At the inner level (where I worried we might have some barrier problem), the relation between page_ref_freeze() and get_page_unless_zero(): the best documentation I can think of on page_ref_freeze() indeed does not mention it as such at all (I think it has gone through renamings): the big comment Nick wrote above page_cache_get_speculative() in include/linux/pagemap.h. And there's a helpful little comment in include/linux/mm.h get_page() too. (Ah, those innocent days, when the word "speculative" filled us with delight and hope, instead of horror and dread.) By the way, I won't put this bug out of my mind, until I've done an audit of PagePrivate: I've a feeling that fs/iomap.c is not the first place to forget that PagePrivate must be associated with a page reference (which is not a special demand of page migration: page reclaim won't work without it either); and put_and_wait_blah may now expose such incorrectness as crashes. Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.6 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1FA23C282CB for ; Tue, 5 Feb 2019 20:17:57 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 99DCA217D6 for ; Tue, 5 Feb 2019 20:17:56 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="KAzi4Km5" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 99DCA217D6 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0C1B38E009D; Tue, 5 Feb 2019 15:17:56 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 072038E009C; Tue, 5 Feb 2019 15:17:56 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EC9648E009D; Tue, 5 Feb 2019 15:17:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from mail-pf1-f197.google.com (mail-pf1-f197.google.com [209.85.210.197]) by kanga.kvack.org (Postfix) with ESMTP id A9A1F8E009C for ; Tue, 5 Feb 2019 15:17:55 -0500 (EST) Received: by mail-pf1-f197.google.com with SMTP id e89so3392955pfb.17 for ; Tue, 05 Feb 2019 12:17:55 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:date:from:to:cc:subject :in-reply-to:message-id:references:user-agent:mime-version; bh=/asjwI2h3R4dB1KhUJqTPMjf/nV/o0Zi8H3tJeSh7VI=; b=dLqLa/006PzWlAZNxlm0npGomTzh4AKUfptokeVv+cJouiJdSai1o5EMZkzP4yoCvz HpLVhc5ctAqprn3XKh0w3rV8B2MZ3qjWfZaxm/lAPUswoCjKrzMPMjH3qp4h2YikxOCf A1DV/lKCJ3uHXe3BPK69d0gWZm88C1fOvIGqoGpMMnyWN3TN/EZmb3O51pC4r1BeOwcZ dMg1EvzK6u9hcwnAmHRnWoGm7R+ySotOxsCOskaSfh50G9J6UD+r6akidPFUqBWetTWH taouKZk6t5xgnuJGMLOMl9gJiaQH1Ar36gPyYRehFcTTXH5E9n5HSEqpsVqLBBnquffQ C8JA== X-Gm-Message-State: AHQUAubtANbaK056LR7mLfi9JuXJUM+Gj1pa9Z2btoEXjDJ03vXHqTnK 0JKS7gF4I3HTvkqNWkOOhvXXZ600VgUcrfm4rZaqdsXYxwxlljCVZe8ecR3ngWdm6KOZEFetZ8M 6T8vnX+qvlP1mndbWSLM67ggApxdyYLsbWidDiTSVUXkwiYdmp/qplrthrkgdgkbfJVsNH6Mdgf LFT3nM/ujDTLijExvK55AEEp9ldLE/AY3YIcE+3cD6jywt4GYjR8sMIwvXo1/JllOM44YY+vQmf IKxJZOwluu6JiSn4Ox7Q4urcv+ovXRpzYyKaBYL5+G1IFT3sZGu8lhrvSOkRxoczUfkGxzn11T0 4Dfh+PZiWP71lkqBT0wpM0PV4FSDYXQVGGjtectY03DfbSQqoma6oW8pje8VNmaIZlU50jYQKeT C X-Received: by 2002:a62:5287:: with SMTP id g129mr608517pfb.22.1549397875205; Tue, 05 Feb 2019 12:17:55 -0800 (PST) X-Received: by 2002:a62:5287:: with SMTP id g129mr608442pfb.22.1549397874148; Tue, 05 Feb 2019 12:17:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1549397874; cv=none; d=google.com; s=arc-20160816; b=log4XtbQ7EaeJpjw5u9zVIoDiqJRT/CyE02piGPkeJtWgMktit5cfueyqzGT+QZdG9 lqq6YZBi+r3i3UoH/Sde+lqMdWVwJQ2jWAbZVu7Powg0JMvH4BUsmVXrKGDlWveKsjZv SY3FIrpawyAHjbe5YDHjnn35sA5YN7LSUIDra04AjiZ5VQ6kU4syKcUJX5enj2cPz7qh ne9OzwbQveML3CtXEfWg5qjoPNh2SwB5U0oDDN2NkkjZU83tiXZjP4qws33giQkuGOhc n7SGpRXIJBSQFCdu4+2ndWCHZkI1K91IFZ+W9/9jLJpUZmQYHyYBWFBvuTgxclO9aeRn +pvw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:user-agent:references:message-id:in-reply-to:subject :cc:to:from:date:dkim-signature; bh=/asjwI2h3R4dB1KhUJqTPMjf/nV/o0Zi8H3tJeSh7VI=; b=HALONQ+9gEFatI7uikjXsDQvZVJLZedTY0SEEQhrAmbm8RGZcs0GQ1k7CH0UoSoUEs ltZ8LkqRwFbr7YdfeUyqXqmpRReGNwekh8BdSChMmXFZMFECXL5/FbVDgKFeBEmR+63p fOntq10BCWUOWgW2q5rNF5ZT6lGLICf+IONawp6ctbhjl1dSt2AXr3o95ONCrxMizskz zJGzrj5E+dxjlGfmY/n2RQkuL76ukdpGAhfENNAT9s0LYenuzlWkRF+BeNu3CdiCsv1J AnE1Ksvp8YSIyl5z0dkOKxHpTQMOiu3Ln5xQRbr2Br5mGHJu2kXB7XhVWVZ+JeR9SDTb zFjQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=KAzi4Km5; spf=pass (google.com: domain of hughd@google.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=hughd@google.com; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id q77sor6922909pfi.33.2019.02.05.12.17.53 for (Google Transport Security); Tue, 05 Feb 2019 12:17:54 -0800 (PST) Received-SPF: pass (google.com: domain of hughd@google.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=KAzi4Km5; spf=pass (google.com: domain of hughd@google.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=hughd@google.com; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=/asjwI2h3R4dB1KhUJqTPMjf/nV/o0Zi8H3tJeSh7VI=; b=KAzi4Km5jRdAkGHd8bF9cYNW+h3If+L5zZL9wXe8rlZwk2X9UWvGIWw9I/vNP7Ded0 eHc2JjtT4+muoLZ5IWpv/CpBbNDvyMWZpA7We2nd7HHvbNEPgHz9I6qtuvF309UvIm4Z TPtLp5ZlmhpbJ/PENU681DVQGwzC3ZYFm5X3rVrgrW7HtPWGWohKXY9CSa/H9u22shzU 9N1iGG7Lt78dbVsBTqG/gzQ7k9avihm5xuQiy/ENFH92r9xyLYqcgFiIeAJocprDnPV2 Oa1Ed2PGwzMHZrUAsO+/SbO8o6jzdX58TDEPUTdTWcTxXfFczNH2ydRuPtZVheCX6vrL bIiQ== X-Google-Smtp-Source: AHgI3IaxNujIIMGLSxIBhRN36McCLaK/t8qyuhWae1Pe1AbArdzO/lDa9w0pCkuUIZ50wuLA5kt+Hw== X-Received: by 2002:a62:ae04:: with SMTP id q4mr6864882pff.126.1549397872974; Tue, 05 Feb 2019 12:17:52 -0800 (PST) Received: from [100.112.89.103] ([104.133.8.103]) by smtp.gmail.com with ESMTPSA id 24sm6722800pfr.75.2019.02.05.12.17.51 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Tue, 05 Feb 2019 12:17:51 -0800 (PST) Date: Tue, 5 Feb 2019 12:17:50 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Linus Torvalds cc: Hugh Dickins , Artem Savkov , Baoquan He , Qian Cai , Andrea Arcangeli , Michal Hocko , Vlastimil Babka , Andrew Morton , Linux List Kernel Mailing , Linux-MM Subject: Re: mm: race in put_and_wait_on_page_locked() In-Reply-To: Message-ID: References: <20190204091300.GB13536@shodan.usersys.redhat.com> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, 5 Feb 2019, Linus Torvalds wrote: > On Mon, Feb 4, 2019 at 8:43 PM Hugh Dickins wrote: > > > > Something I shall not be doing, is verifying the correctness of the > > low-level get_page_unless_zero() versus page_ref_freeze() protocol > > on arm64 and power - nobody has reported on x86, and I do wonder if > > there's a barrier missing somewhere, that could manifest in this way - > > but I'm unlikely to be the one to find that (and also think that any > > weakness there should have shown up long before now). > > Remind me what the page_ref_freeze() rules even _are_? > > It's a very special thing, setting the page count down to zero if it > matches the "expected" count. > > Now, if another CPU does a put_page() at that point, that certainly > will hit the "oops, we dropped the ref to something that was zero". > > So the "expected" count had better be only references we have and own > 100%, but some of those references aren't really necessarily private > to our thread. > > For example, what happens if > > (a) one CPU is doing migration_entry_wait() (counting expected page > refs etc, before doing page_ref_freeze) s/migration_entry_wait/migrate_page_move_mapping/ > > (b) another CPU is dirtying a page that was in the swap cache and > takes a reference to it, but drops it from the swap cache This is reuse_swap_page() called from do_wp_page(), I presume. > > Note how (b) does not change the refcount on the page at all, because > it just moves the ref-count from "swap cache entry" to "I own the page > in my page tables". Which means that when (a) does the "count expected > count, and match it", it happily matches, and the page_ref_freeze() > succeeds and makes the page count be zero. > > But now (b) has a private reference to that page, and can drop it, so > the "freeze" isn't a freeze at all. > > Ok, so clearly the above cannot happen, and there's something I'm > missing with the freezing. I think we hold the page lock while this is > going on, which means those two things cannot happen at the same time. > But maybe there is something else that does the above kind of "move > page ref from one owner to another"? You're right that the page lock prevents even getting there (and is essential whenever mucking around with PageSwapCache), but more to the point is that the expected_count passed to page_ref_freeze() does not include any user mapping references (mapcount). All user mappings (known of at that instant) have been unmapped before migrate_page_move_mapping() is called, and if any got added since (difficult without page lock, but I wouldn't assume impossible), their associated page references are sure to make the page_ref_freeze() fail (so long as the page refcounting has not been broken). reuse_swap_page() is called while holding the page table lock: so although do_wp_page() cannot quite be said to own the page, it is making sure that it cannot be racily unmapped at that point. So until the pte_unmap_unlock() (by which time it has done its own get_page()) it can treat the page reference associated with the user mapping as safe, as if it were its own. And no racing page_ref_freeze() could succeed while it's there, page lock or not. Page lock is more important at the "outer" level of the page migration protocol: holding together the "atomic" switch from old to new page with the copying of data and flags from old to new. And more important with anon (swapless) pages, for which there's no shared visible cache, so migrate_page_move_mapping() does not even bother with a page_ref_freeze() (though sometimes I want it to). > > The page_ref_freeze() rules don't seem to be documented anywhere. I would not enjoy documenting what has to be done at what stage in the page migration sequence: it has evolved, it is subtle, and we're grateful just to have working code. At the inner level (where I worried we might have some barrier problem), the relation between page_ref_freeze() and get_page_unless_zero(): the best documentation I can think of on page_ref_freeze() indeed does not mention it as such at all (I think it has gone through renamings): the big comment Nick wrote above page_cache_get_speculative() in include/linux/pagemap.h. And there's a helpful little comment in include/linux/mm.h get_page() too. (Ah, those innocent days, when the word "speculative" filled us with delight and hope, instead of horror and dread.) By the way, I won't put this bug out of my mind, until I've done an audit of PagePrivate: I've a feeling that fs/iomap.c is not the first place to forget that PagePrivate must be associated with a page reference (which is not a special demand of page migration: page reclaim won't work without it either); and put_and_wait_blah may now expose such incorrectness as crashes. Hugh