From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-14.8 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4E878C2D0E4 for ; Tue, 24 Nov 2020 06:34:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id DDA78206CA for ; Tue, 24 Nov 2020 06:34:49 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="qhe1mkmW" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729808AbgKXGe3 (ORCPT ); Tue, 24 Nov 2020 01:34:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54006 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726757AbgKXGe2 (ORCPT ); Tue, 24 Nov 2020 01:34:28 -0500 Received: from mail-oi1-x235.google.com (mail-oi1-x235.google.com [IPv6:2607:f8b0:4864:20::235]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 16238C0613CF for ; Mon, 23 Nov 2020 22:34:28 -0800 (PST) Received: by mail-oi1-x235.google.com with SMTP id v202so19486547oia.9 for ; Mon, 23 Nov 2020 22:34:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=NRH6VONBYNIxV3zO4nNRr9V11VXjtGQdzP+hrNGAm3A=; b=qhe1mkmWMwhGuf8fqGpCpj+e8tRFPMtqWbixqm50z4cHkRXVBUsZ7CU1/j/3vZBSEx i02OxFKXk2LaAEF2fZz5tPHf9/q0ss9DbDxNVwhC7LKVfn+oQXptj0CFmw2v7eEfmSFJ M86CkeLLA/t8Wn0Nn9rrOWw8p4ZUls/7JJw8vEpy+N6au2YzdMd/T4PwcEHuveafA3sq zxAujieBdrx0MmJ2WiNkfYcYGM5h/HaN0/81OBBEE+PaVIft5zJvVBg0mxu7bj5mV/zF 0CCxa96dPF8hFwabQ+wMVg+QK5D/GRETSm2SsNTVmgVJ6yDhF5p3QGC7wnIvZr2Oad3A aq6g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=NRH6VONBYNIxV3zO4nNRr9V11VXjtGQdzP+hrNGAm3A=; b=WbIp1HuMZuL1h0/MdyYDj3O6X0cHr0ZYutcJbhymD10ufn0O2eY1kbXzr9N5oiPGPf TCBYIpZ0giP6nDarxJwbihtJo4PQTY9Tw5qdGWA8oGd84N2LsXGIoTvT/vQGJoK52KEJ /nxWkFqQMUvo2LLE7GQQPMsXOQ+MPVjz2qvcO/SexQlZgPZFpfyRoSt8KRAfrx+CFBEs H0zQCruUi47yJouAP4McapAlY2BADadDevxb0fp6LNZP2uXnmiWcNjfSlTZd7Lpo+dew zpQ89QwpY/QEuXereJvxZBRb8HL2ZfyMI8nJjadklH8TDN89/GLICWQXi0F1iqOjrK8E d7cg== X-Gm-Message-State: AOAM5315EvZy8wbinTq4nRMhxaKBe3utYOMRz2P6o2+UITyqudTJPnsw KpTEFFiHf4KTLASQOkn6qDmlBw== X-Google-Smtp-Source: ABdhPJyQ+wMmSncaYJbGrsIp+MuAFAXtXoBkYvhGm11e160LsNcAa3QgtmSHaVmvUvzrXb/dS3G0lQ== X-Received: by 2002:aca:3087:: with SMTP id w129mr1737657oiw.78.1606199667073; Mon, 23 Nov 2020 22:34:27 -0800 (PST) Received: from eggly.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id m65sm8116677otm.40.2020.11.23.22.34.24 (version=TLS1 cipher=ECDHE-ECDSA-AES128-SHA bits=128/128); Mon, 23 Nov 2020 22:34:25 -0800 (PST) Date: Mon, 23 Nov 2020 22:34:12 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Linus Torvalds cc: Hugh Dickins , Jan Kara , syzbot , Andreas Dilger , Ext4 Developers List , Linux Kernel Mailing List , syzkaller-bugs , Theodore Ts'o , Linux-MM , Oleg Nesterov , Andrew Morton , "Kirill A. Shutemov" , Nicholas Piggin , Alex Shi , Qian Cai , Christoph Hellwig , "Darrick J. Wong" , Matthew Wilcox , William Kucharski , Jens Axboe , linux-fsdevel , linux-xfs Subject: Re: kernel BUG at fs/ext4/inode.c:LINE! In-Reply-To: Message-ID: References: <000000000000d3a33205add2f7b2@google.com> <20200828100755.GG7072@quack2.suse.cz> <20200831100340.GA26519@quack2.suse.cz> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 23 Nov 2020, Linus Torvalds wrote: > On Mon, Nov 23, 2020 at 8:07 PM Hugh Dickins wrote: > > > > The problem is that PageWriteback is not accompanied by a page reference > > (as the NOTE at the end of test_clear_page_writeback() acknowledges): as > > soon as TestClearPageWriteback has been done, that page could be removed > > from page cache, freed, and reused for something else by the time that > > wake_up_page() is reached. > > Ugh. > > Would it be possible to instead just make PageWriteback take the ref? > > I don't hate your patch per se, but looking at that long explanation, > and looking at the gyrations end_page_writeback() does, I go "why > don't we do that?" > > IOW, why couldn't we just make the __test_set_page_writeback() > increment the page count if the writeback flag wasn't already set, and > then make the end_page_writeback() do a put_page() after it all? Right, that should be a lot simpler, and will not require any of the cleanup (much as I liked that). If you're reasonably confident that adding the extra get_page+put_page to every writeback (instead of just to the waited case, which I presume significantly less common) will get lost in the noise - I was not confident of that, nor confident of devising realistic tests to decide it. What I did look into before sending, was whether in the filesystems there was a pattern of doing a put_page() after *set_page_writeback(), when it would just be a matter of deleting that put_page() and doing it instead at the end of end_page_writeback(). But no: there were a few cases like that, but in general no such pattern. Though, what I think I'll try is not quite what you suggest there, but instead do both get_page() and put_page() in end_page_writeback(). The reason being, there are a number of places (in mm at least) where we judge what to do by the expected refcount: places that know to add 1 on when PagePrivate is set (for buffers), but do not expect to add 1 on when PageWriteback is set. Now, all of those places probably have to have their own wait_on_page_writeback() too, but I'd rather narrow the window when the refcount is raised, than work through what if any change would be needed in those places. > > > > Then on crashing a second time, realized there's a stronger reason against > > that approach. If my testing just occasionally crashes on that check, > > when the page is reused for part of a compound page, wouldn't it be much > > more common for the page to get reused as an order-0 page before reaching > > wake_up_page()? And on rare occasions, might that reused page already be > > marked PageWriteback by its new user, and already be waited upon? What > > would that look like? > > > > It would look like BUG_ON(PageWriteback) after wait_on_page_writeback() > > in write_cache_pages() (though I have never seen that crash myself). > > So looking more at the patch, I started looking at this part: > > > + writeback = TestClearPageWriteback(page); > > + /* No need for smp_mb__after_atomic() after TestClear */ > > + waiters = PageWaiters(page); > > + if (waiters) { > > + /* > > + * Writeback doesn't hold a page reference on its own, relying > > + * on truncation to wait for the clearing of PG_writeback. > > + * We could safely wake_up_page_bit(page, PG_writeback) here, > > + * while holding i_pages lock: but that would be a poor choice > > + * if the page is on a long hash chain; so instead choose to > > + * get_page+put_page - though atomics will add some overhead. > > + */ > > + get_page(page); > > + } > > and thinking more about this, my first reaction was "but that has the > same race, just a smaller window". > > And then reading the comment more, I realize you relied on the i_pages > lock, and that this odd ordering was to avoid the possible latency. Yes. I decided to send the get_page+put_page variant, rather than the wake_up_page_bit while holding i_pages variant (also tested), in part because it's easier to edit the get_page+put_page one to the other. > > But what about the non-mapping case? I'm not sure how that happens, > but this does seem very fragile. I don't see how the non-mapping case would ever occur: I think it probably comes from a general pattern of caution about NULL mapping when akpm (I think) originally wrote these functions. > > I'm wondering why you didn't want to just do the get_page() > unconditionally and early. Is avoiding the refcount really such a big > optimization? I don't know: I trust your judgement more than mine. Hugh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-14.8 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0C98BC63798 for ; Tue, 24 Nov 2020 06:34:31 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 440A1206FA for ; Tue, 24 Nov 2020 06:34:29 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="qhe1mkmW" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 440A1206FA Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 4FF816B005D; Tue, 24 Nov 2020 01:34:29 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4AF976B006E; Tue, 24 Nov 2020 01:34:29 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 377A96B0070; Tue, 24 Nov 2020 01:34:29 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0130.hostedemail.com [216.40.44.130]) by kanga.kvack.org (Postfix) with ESMTP id 0A2936B005D for ; Tue, 24 Nov 2020 01:34:28 -0500 (EST) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id A5FF3181AC9C6 for ; Tue, 24 Nov 2020 06:34:28 +0000 (UTC) X-FDA: 77518347816.28.star22_630665f2736b Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin28.hostedemail.com (Postfix) with ESMTP id 8C8CF6C26 for ; Tue, 24 Nov 2020 06:34:28 +0000 (UTC) X-HE-Tag: star22_630665f2736b X-Filterd-Recvd-Size: 8967 Received: from mail-oi1-f177.google.com (mail-oi1-f177.google.com [209.85.167.177]) by imf31.hostedemail.com (Postfix) with ESMTP for ; Tue, 24 Nov 2020 06:34:28 +0000 (UTC) Received: by mail-oi1-f177.google.com with SMTP id j15so17316969oih.4 for ; Mon, 23 Nov 2020 22:34:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=NRH6VONBYNIxV3zO4nNRr9V11VXjtGQdzP+hrNGAm3A=; b=qhe1mkmWMwhGuf8fqGpCpj+e8tRFPMtqWbixqm50z4cHkRXVBUsZ7CU1/j/3vZBSEx i02OxFKXk2LaAEF2fZz5tPHf9/q0ss9DbDxNVwhC7LKVfn+oQXptj0CFmw2v7eEfmSFJ M86CkeLLA/t8Wn0Nn9rrOWw8p4ZUls/7JJw8vEpy+N6au2YzdMd/T4PwcEHuveafA3sq zxAujieBdrx0MmJ2WiNkfYcYGM5h/HaN0/81OBBEE+PaVIft5zJvVBg0mxu7bj5mV/zF 0CCxa96dPF8hFwabQ+wMVg+QK5D/GRETSm2SsNTVmgVJ6yDhF5p3QGC7wnIvZr2Oad3A aq6g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=NRH6VONBYNIxV3zO4nNRr9V11VXjtGQdzP+hrNGAm3A=; b=Xlvlau+uNbJK7/4WmFVevCF5zIZEFTaNr5Zpza8u7WMRDG6vLwXq8OgECIHTdOYlyl GgGwR8nfmN7L0gBs7Y3ObEEzIcbblq+/ENQfayVgXiJgENwi+A9NsrhJA5340N2OvFsW Aat1z9FFhQQuh6D+Hfb3BJSzVgmSzKcBgVQ762vn7EbT8uAKjYC92jGwN+DhlCLJMwK4 8Isge/xP5e5bOyfytWFwLevGTzF0noRYRGTTVM1qDUJnji4VCi1z6aRIwVLNpd+c+ZxL 49nld0BIBVwgrsxLT08Zn+FBpH/LG23q/sXwMT+IC5g2ULMPzz+4GNz+FiOL637bGxQm VNAA== X-Gm-Message-State: AOAM532ugJxN7Me/U0/bmzIgaCg/S7v1KxLnnokDFG1rmRz+f5q91sPW Oc+HKqAW86KVdydJFMhBzJzyEQ== X-Google-Smtp-Source: ABdhPJyQ+wMmSncaYJbGrsIp+MuAFAXtXoBkYvhGm11e160LsNcAa3QgtmSHaVmvUvzrXb/dS3G0lQ== X-Received: by 2002:aca:3087:: with SMTP id w129mr1737657oiw.78.1606199667073; Mon, 23 Nov 2020 22:34:27 -0800 (PST) Received: from eggly.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id m65sm8116677otm.40.2020.11.23.22.34.24 (version=TLS1 cipher=ECDHE-ECDSA-AES128-SHA bits=128/128); Mon, 23 Nov 2020 22:34:25 -0800 (PST) Date: Mon, 23 Nov 2020 22:34:12 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Linus Torvalds cc: Hugh Dickins , Jan Kara , syzbot , Andreas Dilger , Ext4 Developers List , Linux Kernel Mailing List , syzkaller-bugs , Theodore Ts'o , Linux-MM , Oleg Nesterov , Andrew Morton , "Kirill A. Shutemov" , Nicholas Piggin , Alex Shi , Qian Cai , Christoph Hellwig , "Darrick J. Wong" , Matthew Wilcox , William Kucharski , Jens Axboe , linux-fsdevel , linux-xfs Subject: Re: kernel BUG at fs/ext4/inode.c:LINE! In-Reply-To: Message-ID: References: <000000000000d3a33205add2f7b2@google.com> <20200828100755.GG7072@quack2.suse.cz> <20200831100340.GA26519@quack2.suse.cz> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, 23 Nov 2020, Linus Torvalds wrote: > On Mon, Nov 23, 2020 at 8:07 PM Hugh Dickins wrote: > > > > The problem is that PageWriteback is not accompanied by a page reference > > (as the NOTE at the end of test_clear_page_writeback() acknowledges): as > > soon as TestClearPageWriteback has been done, that page could be removed > > from page cache, freed, and reused for something else by the time that > > wake_up_page() is reached. > > Ugh. > > Would it be possible to instead just make PageWriteback take the ref? > > I don't hate your patch per se, but looking at that long explanation, > and looking at the gyrations end_page_writeback() does, I go "why > don't we do that?" > > IOW, why couldn't we just make the __test_set_page_writeback() > increment the page count if the writeback flag wasn't already set, and > then make the end_page_writeback() do a put_page() after it all? Right, that should be a lot simpler, and will not require any of the cleanup (much as I liked that). If you're reasonably confident that adding the extra get_page+put_page to every writeback (instead of just to the waited case, which I presume significantly less common) will get lost in the noise - I was not confident of that, nor confident of devising realistic tests to decide it. What I did look into before sending, was whether in the filesystems there was a pattern of doing a put_page() after *set_page_writeback(), when it would just be a matter of deleting that put_page() and doing it instead at the end of end_page_writeback(). But no: there were a few cases like that, but in general no such pattern. Though, what I think I'll try is not quite what you suggest there, but instead do both get_page() and put_page() in end_page_writeback(). The reason being, there are a number of places (in mm at least) where we judge what to do by the expected refcount: places that know to add 1 on when PagePrivate is set (for buffers), but do not expect to add 1 on when PageWriteback is set. Now, all of those places probably have to have their own wait_on_page_writeback() too, but I'd rather narrow the window when the refcount is raised, than work through what if any change would be needed in those places. > > > > Then on crashing a second time, realized there's a stronger reason against > > that approach. If my testing just occasionally crashes on that check, > > when the page is reused for part of a compound page, wouldn't it be much > > more common for the page to get reused as an order-0 page before reaching > > wake_up_page()? And on rare occasions, might that reused page already be > > marked PageWriteback by its new user, and already be waited upon? What > > would that look like? > > > > It would look like BUG_ON(PageWriteback) after wait_on_page_writeback() > > in write_cache_pages() (though I have never seen that crash myself). > > So looking more at the patch, I started looking at this part: > > > + writeback = TestClearPageWriteback(page); > > + /* No need for smp_mb__after_atomic() after TestClear */ > > + waiters = PageWaiters(page); > > + if (waiters) { > > + /* > > + * Writeback doesn't hold a page reference on its own, relying > > + * on truncation to wait for the clearing of PG_writeback. > > + * We could safely wake_up_page_bit(page, PG_writeback) here, > > + * while holding i_pages lock: but that would be a poor choice > > + * if the page is on a long hash chain; so instead choose to > > + * get_page+put_page - though atomics will add some overhead. > > + */ > > + get_page(page); > > + } > > and thinking more about this, my first reaction was "but that has the > same race, just a smaller window". > > And then reading the comment more, I realize you relied on the i_pages > lock, and that this odd ordering was to avoid the possible latency. Yes. I decided to send the get_page+put_page variant, rather than the wake_up_page_bit while holding i_pages variant (also tested), in part because it's easier to edit the get_page+put_page one to the other. > > But what about the non-mapping case? I'm not sure how that happens, > but this does seem very fragile. I don't see how the non-mapping case would ever occur: I think it probably comes from a general pattern of caution about NULL mapping when akpm (I think) originally wrote these functions. > > I'm wondering why you didn't want to just do the get_page() > unconditionally and early. Is avoiding the refcount really such a big > optimization? I don't know: I trust your judgement more than mine. Hugh