linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: kernel BUG at fs/ext4/inode.c:LINE!
       [not found]     ` <CAHk-=wivRS_1uy326sLqKuwerbL0APyKYKwa+vWVGsQg8sxhLw@mail.gmail.com>
@ 2020-11-24  4:07       ` Hugh Dickins
  2020-11-24  4:26         ` Linus Torvalds
                           ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Hugh Dickins @ 2020-11-24  4:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, syzbot, Andreas Dilger, Ext4 Developers List,
	Linux Kernel Mailing List, syzkaller-bugs, Theodore Ts'o,
	Linux-MM, Oleg Nesterov, Andrew Morton, Kirill A. Shutemov,
	Nicholas Piggin, Alex Shi, Qian Cai, Christoph Hellwig,
	Darrick J. Wong, Matthew Wilcox, William Kucharski, Jens Axboe,
	linux-fsdevel, linux-xfs, Hugh Dickins

On Mon, 30 Aug 2020, Linus Torvalds wrote:
> On Mon, Aug 31, 2020 at 3:03 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Fri 28-08-20 12:07:55, Jan Kara wrote:
> > >
> > > Doh, so this is:
> > >
> > >                         wait_on_page_writeback(page);
> > > >>>                     BUG_ON(PageWriteback(page));
> > >
> > > in mpage_prepare_extent_to_map(). So we have PageWriteback() page after we
> > > have called wait_on_page_writeback() on a locked page. Not sure how this
> > > could ever happen even less how ext4 could cause this...
> >
> > I was poking a bit into this and there were actually recent changes into
> > page bit waiting logic by Linus. Linus, any idea?
> 
> So the main change is that now if somebody does a wake_up_page(), the
> page waiter will be released - even if somebody else then set the bit
> again (or possible if the waker never cleared it!).
> 
> It used to be that the waiter went back to sleep.
> 
> Which really shouldn't matter, but if we had any code that did something like
> 
>         end_page_writeback();
>         .. something does set_page_writeback() on the page again ..
> 
> then the old BUG_ON() would likely never have triggered (because the
> waiter would have seen the writeback bit being set again and gone back
> to sleep), but now it will.
> 
> So I would suspect a pre-existing issue that was just hidden by the
> old behavior and was basically impossible to trigger unless you hit
> *just* the right timing.
> 
> And now it's easy to trigger, because the first time somebody clears
> PG_writeback, the wait_on_page_writeback() will just return *without*
> re-testing and *without* going back to sleep.
> 
> Could there be somebody who does set_page_writeback() without holding
> the page lock?
> 
> Maybe adding a
> 
>         WARN_ON_ONCE(!PageLocked(page));
> 
> at the top of __test_set_page_writeback() might find something?
> 
> Note that it looks like this problem has been reported on Android
> before according to that syzbot thing. Ie, this thing:
> 
>     https://groups.google.com/g/syzkaller-android-bugs/c/2CfEdQd4EE0/m/xk_GRJEHBQAJ
> 
> looks very similar, and predates the wake_up_page() changes.
> 
> So it was probably just much _harder_ to hit before, and got easier to hit.
> 
> Hmm. In fact, googling for
> 
>         mpage_prepare_extent_to_map "kernel BUG"
> 
> seems to find stuff going back years. Here's a patchwork discussion
> where you had a debug patch to try to figure it out back in 2016:
> 
>     https://patchwork.ozlabs.org/project/linux-ext4/patch/20161122133452.GF3973@quack2.suse.cz/
> 
> although that one seems to be a different BUG_ON() in the same area.
> 
> Maybe entirely unrelated, but the fact that this function shows up a
> fair amount is perhaps a sign of some long-running issue..

No recent updates here, nor in
https://lore.kernel.org/linux-mm/37abe67a5a7d83b361932464b4af499fdeaf5ef7.camel@redhat.com/
but I believe I've found the answer (or an answer) to this issue.

You may not care for this patch, but I haven't thought of a better,
so let me explain in its commit message.  And its "Fixes:" tag is unfair
to your patch, sorry: I agree the issue has probably lurked there longer.

[PATCH] mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)

Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
no longer an ext4 page at all.

The problem is that PageWriteback is not accompanied by a page reference
(as the NOTE at the end of test_clear_page_writeback() acknowledges): as
soon as TestClearPageWriteback has been done, that page could be removed
from page cache, freed, and reused for something else by the time that
wake_up_page() is reached.

https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
check; but I'm paranoid about even looking at an unreferenced struct page,
lest its memory might itself have already been reused or hotremoved (and
wake_up_page_bit() may modify that memory with its ClearPageWaiters()).

Then on crashing a second time, realized there's a stronger reason against
that approach.  If my testing just occasionally crashes on that check,
when the page is reused for part of a compound page, wouldn't it be much
more common for the page to get reused as an order-0 page before reaching
wake_up_page()?  And on rare occasions, might that reused page already be
marked PageWriteback by its new user, and already be waited upon?  What
would that look like?

It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
in write_cache_pages() (though I have never seen that crash myself).

And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
this would have been much less likely: before that, wake_page_function()'s
non-exclusive case would stop walking and not wake if it found Writeback
already set again; whereas now the non-exclusive case proceeds to wake.

I have not thought of a fix that does not add a little overhead.
It would be safe to wake_up_page() after TestClearPageWriteback()
while still holding i_pages lock (since that lock is needed to remove
the page from cache), but the history of long page lock hash chains
cautions against; so the patch below does get_page() when PageWaiters
there, and put_page() after wake_up_page_bit() at the end.  And in
any case, we do need the i_pages lock, even though it was skipped
before when !mapping_use_writeback_tags() i.e. swap.  Can mapping be
NULL? I don't see how, but allow for that with a WARN_ON_ONCE(): this
patch is no worse than before, but does not fix the issue if !mapping.

The bulk of the patch below is cleanup: it was not helpful to separate
test_clear_page_writeback() from end_page_writeback(), especially with
the latter declaring BUG() on a condition which the former was working
around: combine them into end_page_writeback() in mm/page-writeback.c.

Was there a chance of missed wakeups before, since a page freed before
reaching wake_up_page() would have PageWaiters cleared?  I think not,
because each waiter does hold a reference on the page: this bug comes
not from real waiters, but from when PageWaiters is a false positive.

Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com
Reported-by: Qian Cai <cai@lca.pw>
Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # v5.8+
---

 include/linux/page-flags.h |    1 
 include/linux/pagemap.h    |    1 
 kernel/sched/wait_bit.c    |    5 -
 mm/filemap.c               |   35 ------------
 mm/page-writeback.c        |   96 ++++++++++++++++++++++-------------
 5 files changed, 67 insertions(+), 71 deletions(-)

--- 5.10-rc5/include/linux/page-flags.h	2020-10-25 16:45:47.061817039 -0700
+++ linux/include/linux/page-flags.h	2020-11-22 18:31:21.303046924 -0800
@@ -550,7 +550,6 @@ static __always_inline void SetPageUptod
 
 CLEARPAGEFLAG(Uptodate, uptodate, PF_NO_TAIL)
 
-int test_clear_page_writeback(struct page *page);
 int __test_set_page_writeback(struct page *page, bool keep_write);
 
 #define test_set_page_writeback(page)			\
--- 5.10-rc5/include/linux/pagemap.h	2020-11-22 17:43:01.585279333 -0800
+++ linux/include/linux/pagemap.h	2020-11-22 18:31:21.303046924 -0800
@@ -660,6 +660,7 @@ static inline int lock_page_or_retry(str
  */
 extern void wait_on_page_bit(struct page *page, int bit_nr);
 extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
+extern void wake_up_page_bit(struct page *page, int bit_nr);
 
 /* 
  * Wait for a page to be unlocked.
--- 5.10-rc5/kernel/sched/wait_bit.c	2020-03-29 15:25:41.000000000 -0700
+++ linux/kernel/sched/wait_bit.c	2020-11-22 18:31:21.303046924 -0800
@@ -90,9 +90,8 @@ __wait_on_bit_lock(struct wait_queue_hea
 			ret = action(&wbq_entry->key, mode);
 			/*
 			 * See the comment in prepare_to_wait_event().
-			 * finish_wait() does not necessarily takes wwq_head->lock,
-			 * but test_and_set_bit() implies mb() which pairs with
-			 * smp_mb__after_atomic() before wake_up_page().
+			 * finish_wait() does not necessarily take
+			 * wwq_head->lock, but test_and_set_bit() implies mb().
 			 */
 			if (ret)
 				finish_wait(wq_head, &wbq_entry->wq_entry);
--- 5.10-rc5/mm/filemap.c	2020-11-22 17:43:01.637279974 -0800
+++ linux/mm/filemap.c	2020-11-22 18:31:21.303046924 -0800
@@ -1093,7 +1093,7 @@ static int wake_page_function(wait_queue
 	return (flags & WQ_FLAG_EXCLUSIVE) != 0;
 }
 
-static void wake_up_page_bit(struct page *page, int bit_nr)
+void wake_up_page_bit(struct page *page, int bit_nr)
 {
 	wait_queue_head_t *q = page_waitqueue(page);
 	struct wait_page_key key;
@@ -1147,13 +1147,6 @@ static void wake_up_page_bit(struct page
 	spin_unlock_irqrestore(&q->lock, flags);
 }
 
-static void wake_up_page(struct page *page, int bit)
-{
-	if (!PageWaiters(page))
-		return;
-	wake_up_page_bit(page, bit);
-}
-
 /*
  * A choice of three behaviors for wait_on_page_bit_common():
  */
@@ -1466,32 +1459,6 @@ void unlock_page(struct page *page)
 }
 EXPORT_SYMBOL(unlock_page);
 
-/**
- * end_page_writeback - end writeback against a page
- * @page: the page
- */
-void end_page_writeback(struct page *page)
-{
-	/*
-	 * TestClearPageReclaim could be used here but it is an atomic
-	 * operation and overkill in this particular case. Failing to
-	 * shuffle a page marked for immediate reclaim is too mild to
-	 * justify taking an atomic operation penalty at the end of
-	 * ever page writeback.
-	 */
-	if (PageReclaim(page)) {
-		ClearPageReclaim(page);
-		rotate_reclaimable_page(page);
-	}
-
-	if (!test_clear_page_writeback(page))
-		BUG();
-
-	smp_mb__after_atomic();
-	wake_up_page(page, PG_writeback);
-}
-EXPORT_SYMBOL(end_page_writeback);
-
 /*
  * After completing I/O on a page, call this routine to update the page
  * flags appropriately
--- 5.10-rc5/mm/page-writeback.c	2020-10-25 16:45:47.977843485 -0700
+++ linux/mm/page-writeback.c	2020-11-22 18:31:21.303046924 -0800
@@ -589,7 +589,7 @@ static void wb_domain_writeout_inc(struc
 
 /*
  * Increment @wb's writeout completion count and the global writeout
- * completion count. Called from test_clear_page_writeback().
+ * completion count. Called from end_page_writeback().
  */
 static inline void __wb_writeout_inc(struct bdi_writeback *wb)
 {
@@ -2719,55 +2719,85 @@ int clear_page_dirty_for_io(struct page
 }
 EXPORT_SYMBOL(clear_page_dirty_for_io);
 
-int test_clear_page_writeback(struct page *page)
+/**
+ * end_page_writeback - end writeback against a page
+ * @page: the page
+ */
+void end_page_writeback(struct page *page)
 {
-	struct address_space *mapping = page_mapping(page);
+	struct address_space *mapping;
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
-	int ret;
+	unsigned long flags;
+	int writeback;
+	int waiters;
+
+	/*
+	 * TestClearPageReclaim could be used here but it is an atomic
+	 * operation and overkill in this particular case. Failing to
+	 * shuffle a page marked for immediate reclaim is too mild to
+	 * justify taking an atomic operation penalty at the end of
+	 * every page writeback.
+	 */
+	if (PageReclaim(page)) {
+		ClearPageReclaim(page);
+		rotate_reclaimable_page(page);
+	}
 
+	mapping = page_mapping(page);
+	WARN_ON_ONCE(!mapping);
 	memcg = lock_page_memcg(page);
 	lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+
+	dec_lruvec_state(lruvec, NR_WRITEBACK);
+	dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
+	inc_node_page_state(page, NR_WRITTEN);
+
+	if (mapping)
+		xa_lock_irqsave(&mapping->i_pages, flags);
+
+	writeback = TestClearPageWriteback(page);
+	/* No need for smp_mb__after_atomic() after TestClear */
+	waiters = PageWaiters(page);
+	if (waiters) {
+		/*
+		 * Writeback doesn't hold a page reference on its own, relying
+		 * on truncation to wait for the clearing of PG_writeback.
+		 * We could safely wake_up_page_bit(page, PG_writeback) here,
+		 * while holding i_pages lock: but that would be a poor choice
+		 * if the page is on a long hash chain; so instead choose to
+		 * get_page+put_page - though atomics will add some overhead.
+		 */
+		get_page(page);
+	}
+
 	if (mapping && mapping_use_writeback_tags(mapping)) {
 		struct inode *inode = mapping->host;
 		struct backing_dev_info *bdi = inode_to_bdi(inode);
-		unsigned long flags;
 
-		xa_lock_irqsave(&mapping->i_pages, flags);
-		ret = TestClearPageWriteback(page);
-		if (ret) {
-			__xa_clear_mark(&mapping->i_pages, page_index(page),
+		__xa_clear_mark(&mapping->i_pages, page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-			if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT) {
-				struct bdi_writeback *wb = inode_to_wb(inode);
+		if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT) {
+			struct bdi_writeback *wb = inode_to_wb(inode);
 
-				dec_wb_stat(wb, WB_WRITEBACK);
-				__wb_writeout_inc(wb);
-			}
+			dec_wb_stat(wb, WB_WRITEBACK);
+			__wb_writeout_inc(wb);
 		}
+		if (inode && !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
+			sb_clear_inode_writeback(inode);
+	}
 
-		if (mapping->host && !mapping_tagged(mapping,
-						     PAGECACHE_TAG_WRITEBACK))
-			sb_clear_inode_writeback(mapping->host);
-
+	if (mapping)
 		xa_unlock_irqrestore(&mapping->i_pages, flags);
-	} else {
-		ret = TestClearPageWriteback(page);
-	}
-	/*
-	 * NOTE: Page might be free now! Writeback doesn't hold a page
-	 * reference on its own, it relies on truncation to wait for
-	 * the clearing of PG_writeback. The below can only access
-	 * page state that is static across allocation cycles.
-	 */
-	if (ret) {
-		dec_lruvec_state(lruvec, NR_WRITEBACK);
-		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
-		inc_node_page_state(page, NR_WRITTEN);
-	}
 	__unlock_page_memcg(memcg);
-	return ret;
+
+	if (waiters) {
+		wake_up_page_bit(page, PG_writeback);
+		put_page(page);
+	}
+	BUG_ON(!writeback);
 }
+EXPORT_SYMBOL(end_page_writeback);
 
 int __test_set_page_writeback(struct page *page, bool keep_write)
 {

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG at fs/ext4/inode.c:LINE!
  2020-11-24  4:07       ` kernel BUG at fs/ext4/inode.c:LINE! Hugh Dickins
@ 2020-11-24  4:26         ` Linus Torvalds
  2020-11-24  4:53         ` Linus Torvalds
  2020-11-24 12:19         ` Matthew Wilcox
  2 siblings, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2020-11-24  4:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jan Kara, syzbot, Andreas Dilger, Ext4 Developers List,
	Linux Kernel Mailing List, syzkaller-bugs, Theodore Ts'o,
	Linux-MM, Oleg Nesterov, Andrew Morton, Kirill A. Shutemov,
	Nicholas Piggin, Alex Shi, Qian Cai, Christoph Hellwig,
	Darrick J. Wong, Matthew Wilcox, William Kucharski, Jens Axboe,
	linux-fsdevel, linux-xfs

On Mon, Nov 23, 2020 at 8:07 PM Hugh Dickins <hughd@google.com> wrote:
>
> The problem is that PageWriteback is not accompanied by a page reference
> (as the NOTE at the end of test_clear_page_writeback() acknowledges): as
> soon as TestClearPageWriteback has been done, that page could be removed
> from page cache, freed, and reused for something else by the time that
> wake_up_page() is reached.

Ugh.

Would it be possible to instead just make PageWriteback take the ref?

I don't hate your patch per se, but looking at that long explanation,
and looking at the gyrations end_page_writeback() does, I go "why
don't we do that?"

IOW, why couldn't we just make the __test_set_page_writeback()
increment the page count if the writeback flag wasn't already set, and
then make the end_page_writeback() do a put_page() after it all?

            Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG at fs/ext4/inode.c:LINE!
  2020-11-24  4:07       ` kernel BUG at fs/ext4/inode.c:LINE! Hugh Dickins
  2020-11-24  4:26         ` Linus Torvalds
@ 2020-11-24  4:53         ` Linus Torvalds
  2020-11-24  6:34           ` Hugh Dickins
  2020-11-24 12:19         ` Matthew Wilcox
  2 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2020-11-24  4:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jan Kara, syzbot, Andreas Dilger, Ext4 Developers List,
	Linux Kernel Mailing List, syzkaller-bugs, Theodore Ts'o,
	Linux-MM, Oleg Nesterov, Andrew Morton, Kirill A. Shutemov,
	Nicholas Piggin, Alex Shi, Qian Cai, Christoph Hellwig,
	Darrick J. Wong, Matthew Wilcox, William Kucharski, Jens Axboe,
	linux-fsdevel, linux-xfs

On Mon, Nov 23, 2020 at 8:07 PM Hugh Dickins <hughd@google.com> wrote:
>
> Then on crashing a second time, realized there's a stronger reason against
> that approach.  If my testing just occasionally crashes on that check,
> when the page is reused for part of a compound page, wouldn't it be much
> more common for the page to get reused as an order-0 page before reaching
> wake_up_page()?  And on rare occasions, might that reused page already be
> marked PageWriteback by its new user, and already be waited upon?  What
> would that look like?
>
> It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
> in write_cache_pages() (though I have never seen that crash myself).

So looking more at the patch, I started looking at this part:

> +       writeback = TestClearPageWriteback(page);
> +       /* No need for smp_mb__after_atomic() after TestClear */
> +       waiters = PageWaiters(page);
> +       if (waiters) {
> +               /*
> +                * Writeback doesn't hold a page reference on its own, relying
> +                * on truncation to wait for the clearing of PG_writeback.
> +                * We could safely wake_up_page_bit(page, PG_writeback) here,
> +                * while holding i_pages lock: but that would be a poor choice
> +                * if the page is on a long hash chain; so instead choose to
> +                * get_page+put_page - though atomics will add some overhead.
> +                */
> +               get_page(page);
> +       }

and thinking more about this, my first reaction was "but that has the
same race, just a smaller window".

And then reading the comment more, I realize you relied on the i_pages
lock, and that this odd ordering was to avoid the possible latency.

But what about the non-mapping case? I'm not sure how that happens,
but this does seem very fragile.

I'm wondering why you didn't want to just do the get_page()
unconditionally and early. Is avoiding the refcount really such a big
optimization?

            Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG at fs/ext4/inode.c:LINE!
  2020-11-24  4:53         ` Linus Torvalds
@ 2020-11-24  6:34           ` Hugh Dickins
  2020-11-24 16:46             ` Hugh Dickins
  0 siblings, 1 reply; 16+ messages in thread
From: Hugh Dickins @ 2020-11-24  6:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Jan Kara, syzbot, Andreas Dilger,
	Ext4 Developers List, Linux Kernel Mailing List, syzkaller-bugs,
	Theodore Ts'o, Linux-MM, Oleg Nesterov, Andrew Morton,
	Kirill A. Shutemov, Nicholas Piggin, Alex Shi, Qian Cai,
	Christoph Hellwig, Darrick J. Wong, Matthew Wilcox,
	William Kucharski, Jens Axboe, linux-fsdevel, linux-xfs

On Mon, 23 Nov 2020, Linus Torvalds wrote:
> On Mon, Nov 23, 2020 at 8:07 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > The problem is that PageWriteback is not accompanied by a page reference
> > (as the NOTE at the end of test_clear_page_writeback() acknowledges): as
> > soon as TestClearPageWriteback has been done, that page could be removed
> > from page cache, freed, and reused for something else by the time that
> > wake_up_page() is reached.
> 
> Ugh.
> 
> Would it be possible to instead just make PageWriteback take the ref?
> 
> I don't hate your patch per se, but looking at that long explanation,
> and looking at the gyrations end_page_writeback() does, I go "why
> don't we do that?"
> 
> IOW, why couldn't we just make the __test_set_page_writeback()
> increment the page count if the writeback flag wasn't already set, and
> then make the end_page_writeback() do a put_page() after it all?

Right, that should be a lot simpler, and will not require any of the
cleanup (much as I liked that).  If you're reasonably confident that
adding the extra get_page+put_page to every writeback (instead of
just to the waited case, which I presume significantly less common)
will get lost in the noise - I was not confident of that, nor
confident of devising realistic tests to decide it.

What I did look into before sending, was whether in the filesystems
there was a pattern of doing a put_page() after *set_page_writeback(),
when it would just be a matter of deleting that put_page() and doing
it instead at the end of end_page_writeback().  But no: there were a
few cases like that, but in general no such pattern.

Though, what I think I'll try is not quite what you suggest there,
but instead do both get_page() and put_page() in end_page_writeback().
The reason being, there are a number of places (in mm at least) where
we judge what to do by the expected refcount: places that know to add
1 on when PagePrivate is set (for buffers), but do not expect to add
1 on when PageWriteback is set.  Now, all of those places probably
have to have their own wait_on_page_writeback() too, but I'd rather
narrow the window when the refcount is raised, than work through
what if any change would be needed in those places.

> >
> > Then on crashing a second time, realized there's a stronger reason against
> > that approach.  If my testing just occasionally crashes on that check,
> > when the page is reused for part of a compound page, wouldn't it be much
> > more common for the page to get reused as an order-0 page before reaching
> > wake_up_page()?  And on rare occasions, might that reused page already be
> > marked PageWriteback by its new user, and already be waited upon?  What
> > would that look like?
> >
> > It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
> > in write_cache_pages() (though I have never seen that crash myself).
> 
> So looking more at the patch, I started looking at this part:
> 
> > +       writeback = TestClearPageWriteback(page);
> > +       /* No need for smp_mb__after_atomic() after TestClear */
> > +       waiters = PageWaiters(page);
> > +       if (waiters) {
> > +               /*
> > +                * Writeback doesn't hold a page reference on its own, relying
> > +                * on truncation to wait for the clearing of PG_writeback.
> > +                * We could safely wake_up_page_bit(page, PG_writeback) here,
> > +                * while holding i_pages lock: but that would be a poor choice
> > +                * if the page is on a long hash chain; so instead choose to
> > +                * get_page+put_page - though atomics will add some overhead.
> > +                */
> > +               get_page(page);
> > +       }
> 
> and thinking more about this, my first reaction was "but that has the
> same race, just a smaller window".
> 
> And then reading the comment more, I realize you relied on the i_pages
> lock, and that this odd ordering was to avoid the possible latency.

Yes.  I decided to send the get_page+put_page variant, rather than the
wake_up_page_bit while holding i_pages variant (also tested), in part
because it's easier to edit the get_page+put_page one to the other.

> 
> But what about the non-mapping case? I'm not sure how that happens,
> but this does seem very fragile.

I don't see how the non-mapping case would ever occur: I think it
probably comes from a general pattern of caution about NULL mapping
when akpm (I think) originally wrote these functions.

> 
> I'm wondering why you didn't want to just do the get_page()
> unconditionally and early. Is avoiding the refcount really such a big
> optimization?

I don't know: I trust your judgement more than mine.

Hugh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG at fs/ext4/inode.c:LINE!
  2020-11-24  4:07       ` kernel BUG at fs/ext4/inode.c:LINE! Hugh Dickins
  2020-11-24  4:26         ` Linus Torvalds
  2020-11-24  4:53         ` Linus Torvalds
@ 2020-11-24 12:19         ` Matthew Wilcox
  2020-11-24 16:28           ` Hugh Dickins
  2020-11-25  9:20           ` Jan Kara
  2 siblings, 2 replies; 16+ messages in thread
From: Matthew Wilcox @ 2020-11-24 12:19 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Jan Kara, syzbot, Andreas Dilger,
	Ext4 Developers List, Linux Kernel Mailing List, syzkaller-bugs,
	Theodore Ts'o, Linux-MM, Oleg Nesterov, Andrew Morton,
	Kirill A. Shutemov, Nicholas Piggin, Alex Shi, Qian Cai,
	Christoph Hellwig, Darrick J. Wong, William Kucharski,
	Jens Axboe, linux-fsdevel, linux-xfs

On Mon, Nov 23, 2020 at 08:07:24PM -0800, Hugh Dickins wrote:
> Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
> on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
> end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
> no longer an ext4 page at all.
> 
> The problem is that PageWriteback is not accompanied by a page reference
> (as the NOTE at the end of test_clear_page_writeback() acknowledges): as
> soon as TestClearPageWriteback has been done, that page could be removed
> from page cache, freed, and reused for something else by the time that
> wake_up_page() is reached.
> 
> https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
> Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
> check; but I'm paranoid about even looking at an unreferenced struct page,
> lest its memory might itself have already been reused or hotremoved (and
> wake_up_page_bit() may modify that memory with its ClearPageWaiters()).
> 
> Then on crashing a second time, realized there's a stronger reason against
> that approach.  If my testing just occasionally crashes on that check,
> when the page is reused for part of a compound page, wouldn't it be much
> more common for the page to get reused as an order-0 page before reaching
> wake_up_page()?  And on rare occasions, might that reused page already be
> marked PageWriteback by its new user, and already be waited upon?  What
> would that look like?
> 
> It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
> in write_cache_pages() (though I have never seen that crash myself).

I don't think this is it.  write_cache_pages() holds a reference to the
page -- indeed, it holds the page lock!  So this particular race cannot
cause the page to get recycled.  I still have no good ideas what this
is :-(

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG at fs/ext4/inode.c:LINE!
  2020-11-24 12:19         ` Matthew Wilcox
@ 2020-11-24 16:28           ` Hugh Dickins
  2020-11-24 18:33             ` Matthew Wilcox
  2020-11-25  9:20           ` Jan Kara
  1 sibling, 1 reply; 16+ messages in thread
From: Hugh Dickins @ 2020-11-24 16:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Linus Torvalds, Jan Kara, syzbot, Andreas Dilger,
	Ext4 Developers List, Linux Kernel Mailing List, syzkaller-bugs,
	Theodore Ts'o, Linux-MM, Oleg Nesterov, Andrew Morton,
	Kirill A. Shutemov, Nicholas Piggin, Alex Shi, Qian Cai,
	Christoph Hellwig, Darrick J. Wong, William Kucharski,
	Jens Axboe, linux-fsdevel, linux-xfs

On Tue, 24 Nov 2020, Matthew Wilcox wrote:
> On Mon, Nov 23, 2020 at 08:07:24PM -0800, Hugh Dickins wrote:
> > 
> > Then on crashing a second time, realized there's a stronger reason against
> > that approach.  If my testing just occasionally crashes on that check,
> > when the page is reused for part of a compound page, wouldn't it be much
> > more common for the page to get reused as an order-0 page before reaching
> > wake_up_page()?  And on rare occasions, might that reused page already be
> > marked PageWriteback by its new user, and already be waited upon?  What
> > would that look like?
> > 
> > It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
> > in write_cache_pages() (though I have never seen that crash myself).
> 
> I don't think this is it.  write_cache_pages() holds a reference to the
> page -- indeed, it holds the page lock!  So this particular race cannot
> cause the page to get recycled.  I still have no good ideas what this
> is :-(

It is confusing. I tried to explain that in the final paragraph:

> > Was there a chance of missed wakeups before, since a page freed before
> > reaching wake_up_page() would have PageWaiters cleared?  I think not,
> > because each waiter does hold a reference on the page: this bug comes
> > not from real waiters, but from when PageWaiters is a false positive.

but got lost in between the original end_page_writeback() and the patched
version when writing that last part - false positive PageWaiters are not
relevant.  I'll try rewording that in the simpler version, following.

The BUG_ON(PageWriteback) would occur when the old use of the page, the
one we do TestClearPageWriteback on, had *no* waiters, so no additional
page reference beyond the page cache (and whoever racily frees it). The
reuse of the page definitely has a waiter holding a reference, as you
point out, and PageWriteback still set; but our belated wake_up_page()
has woken it to hit the BUG_ON.

Hugh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG at fs/ext4/inode.c:LINE!
  2020-11-24  6:34           ` Hugh Dickins
@ 2020-11-24 16:46             ` Hugh Dickins
  0 siblings, 0 replies; 16+ messages in thread
From: Hugh Dickins @ 2020-11-24 16:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jan Kara, syzbot, Andreas Dilger, Ext4 Developers List,
	Linux Kernel Mailing List, syzkaller-bugs, Theodore Ts'o,
	Linux-MM, Oleg Nesterov, Andrew Morton, Kirill A. Shutemov,
	Nicholas Piggin, Alex Shi, Qian Cai, Christoph Hellwig,
	Darrick J. Wong, Matthew Wilcox, William Kucharski, Jens Axboe,
	linux-fsdevel, linux-xfs, Hugh Dickins

On Mon, 23 Nov 2020, Hugh Dickins wrote:
> On Mon, 23 Nov 2020, Linus Torvalds wrote:
> > 
> > IOW, why couldn't we just make the __test_set_page_writeback()
> > increment the page count if the writeback flag wasn't already set, and
> > then make the end_page_writeback() do a put_page() after it all?
> 
> Right, that should be a lot simpler, and will not require any of the
> cleanup (much as I liked that).  If you're reasonably confident that
> adding the extra get_page+put_page to every writeback (instead of
> just to the waited case, which I presume significantly less common)
> will get lost in the noise - I was not confident of that, nor
> confident of devising realistic tests to decide it.
> 
> What I did look into before sending, was whether in the filesystems
> there was a pattern of doing a put_page() after *set_page_writeback(),
> when it would just be a matter of deleting that put_page() and doing
> it instead at the end of end_page_writeback().  But no: there were a
> few cases like that, but in general no such pattern.
> 
> Though, what I think I'll try is not quite what you suggest there,
> but instead do both get_page() and put_page() in end_page_writeback().
> The reason being, there are a number of places (in mm at least) where
> we judge what to do by the expected refcount: places that know to add
> 1 on when PagePrivate is set (for buffers), but do not expect to add
> 1 on when PageWriteback is set.  Now, all of those places probably
> have to have their own wait_on_page_writeback() too, but I'd rather
> narrow the window when the refcount is raised, than work through
> what if any change would be needed in those places.

This ran fine overnight on several machines - just to check I hadn't
screwed it up.  Vanishingly unlikely to have hit either condition,
nor would I have noticed any difference in performance.

[PATCH] mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)

Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
no longer an ext4 page at all.

The problem is that PageWriteback is not accompanied by a page reference
(as the NOTE at the end of test_clear_page_writeback() acknowledges): as
soon as TestClearPageWriteback has been done, that page could be removed
from page cache, freed, and reused for something else by the time that
wake_up_page() is reached.

https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
check; but I'm paranoid about even looking at an unreferenced struct page,
lest its memory might itself have already been reused or hotremoved (and
wake_up_page_bit() may modify that memory with its ClearPageWaiters()).

Then on crashing a second time, realized there's a stronger reason against
that approach.  If my testing just occasionally crashes on that check,
when the page is reused for part of a compound page, wouldn't it be much
more common for the page to get reused as an order-0 page before reaching
wake_up_page()?  And on rare occasions, might that reused page already be
marked PageWriteback by its new user, and already be waited upon?  What
would that look like?

It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
in write_cache_pages() (though I have never seen that crash myself).

And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
this would have been much less likely: before that, wake_page_function()'s
non-exclusive case would stop walking and not wake if it found Writeback
already set again; whereas now the non-exclusive case proceeds to wake.

I have not thought of a fix that does not add a little overhead: the
simplest fix is for end_page_writeback() to get_page() before calling
test_clear_page_writeback(), then put_page() after wake_up_page().

Was there a chance of missed wakeups before, since a page freed before
reaching wake_up_page() would have PageWaiters cleared?  I think not,
because each waiter does hold a reference on the page.  This bug comes
when the old use of the page, the one we do TestClearPageWriteback on,
had *no* waiters, so no additional page reference beyond the page cache
(and whoever racily freed it).  The reuse of the page has a waiter
holding a reference, and its own PageWriteback set; but the belated
wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback).

Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com
Reported-by: Qian Cai <cai@lca.pw>
Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # v5.8+
---

 mm/filemap.c        |    8 ++++++++
 mm/page-writeback.c |    6 ------
 2 files changed, 8 insertions(+), 6 deletions(-)

--- 5.10-rc5/mm/filemap.c	2020-11-22 17:43:01.637279974 -0800
+++ linux/mm/filemap.c	2020-11-23 23:08:20.141851113 -0800
@@ -1484,11 +1484,19 @@ void end_page_writeback(struct page *pag
 		rotate_reclaimable_page(page);
 	}
 
+	/*
+	 * Writeback does not hold a page reference of its own, relying
+	 * on truncation to wait for the clearing of PG_writeback.
+	 * But here we must make sure that the page is not freed and
+	 * reused before the wake_up_page().
+	 */
+	get_page(page);
 	if (!test_clear_page_writeback(page))
 		BUG();
 
 	smp_mb__after_atomic();
 	wake_up_page(page, PG_writeback);
+	put_page(page);
 }
 EXPORT_SYMBOL(end_page_writeback);
 
--- 5.10-rc5/mm/page-writeback.c	2020-10-25 16:45:47.977843485 -0700
+++ linux/mm/page-writeback.c	2020-11-23 23:08:20.141851113 -0800
@@ -2754,12 +2754,6 @@ int test_clear_page_writeback(struct pag
 	} else {
 		ret = TestClearPageWriteback(page);
 	}
-	/*
-	 * NOTE: Page might be free now! Writeback doesn't hold a page
-	 * reference on its own, it relies on truncation to wait for
-	 * the clearing of PG_writeback. The below can only access
-	 * page state that is static across allocation cycles.
-	 */
 	if (ret) {
 		dec_lruvec_state(lruvec, NR_WRITEBACK);
 		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG at fs/ext4/inode.c:LINE!
  2020-11-24 16:28           ` Hugh Dickins
@ 2020-11-24 18:33             ` Matthew Wilcox
  2020-11-24 19:00               ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2020-11-24 18:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Jan Kara, syzbot, Andreas Dilger,
	Ext4 Developers List, Linux Kernel Mailing List, syzkaller-bugs,
	Theodore Ts'o, Linux-MM, Oleg Nesterov, Andrew Morton,
	Kirill A. Shutemov, Nicholas Piggin, Alex Shi, Qian Cai,
	Christoph Hellwig, Darrick J. Wong, William Kucharski,
	Jens Axboe, linux-fsdevel, linux-xfs

On Tue, Nov 24, 2020 at 08:28:16AM -0800, Hugh Dickins wrote:
> On Tue, 24 Nov 2020, Matthew Wilcox wrote:
> > On Mon, Nov 23, 2020 at 08:07:24PM -0800, Hugh Dickins wrote:
> > > 
> > > Then on crashing a second time, realized there's a stronger reason against
> > > that approach.  If my testing just occasionally crashes on that check,
> > > when the page is reused for part of a compound page, wouldn't it be much
> > > more common for the page to get reused as an order-0 page before reaching
> > > wake_up_page()?  And on rare occasions, might that reused page already be
> > > marked PageWriteback by its new user, and already be waited upon?  What
> > > would that look like?
> > > 
> > > It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
> > > in write_cache_pages() (though I have never seen that crash myself).
> > 
> > I don't think this is it.  write_cache_pages() holds a reference to the
> > page -- indeed, it holds the page lock!  So this particular race cannot
> > cause the page to get recycled.  I still have no good ideas what this
> > is :-(
> 
> It is confusing. I tried to explain that in the final paragraph:
> 
> > > Was there a chance of missed wakeups before, since a page freed before
> > > reaching wake_up_page() would have PageWaiters cleared?  I think not,
> > > because each waiter does hold a reference on the page: this bug comes
> > > not from real waiters, but from when PageWaiters is a false positive.
> 
> but got lost in between the original end_page_writeback() and the patched
> version when writing that last part - false positive PageWaiters are not
> relevant.  I'll try rewording that in the simpler version, following.
> 
> The BUG_ON(PageWriteback) would occur when the old use of the page, the
> one we do TestClearPageWriteback on, had *no* waiters, so no additional
> page reference beyond the page cache (and whoever racily frees it). The
> reuse of the page definitely has a waiter holding a reference, as you
> point out, and PageWriteback still set; but our belated wake_up_page()
> has woken it to hit the BUG_ON.

I ... think I see.  Let me try to write it out:

page is allocated, added to page cache, dirtied, writeback starts,

--- thread A ---
filesystem calls end_page_writeback()
	test_clear_page_writeback()
--- context switch to thread B ---
truncate_inode_pages_range() finds the page, it doesn't have writeback set,
we delete it from the page cache.  Page gets reallocated, dirtied, writeback
starts again.  Then we call write_cache_pages(), see
PageWriteback() set, call wait_on_page_writeback()
--- context switch back to thread A ---
wake_up_page(page, PG_writeback);
... thread B is woken, but because the wakeup was for the old use of
the page, PageWriteback is still set.

Devious.

We could fix this by turning that 'if' into a 'while' in
write_cache_pages().  Just accept that spurious wakeups can happen
and they're harmless.  We do need to remove that check of PageWaiters
in wake_up_page() -- as you say, we shouldn't be checking that after
dropping the reference.  I had patches to do that ..

https://lore.kernel.org/linux-mm/20200416220130.13343-1-willy@infradead.org/
specifically:
https://lore.kernel.org/linux-mm/20200416220130.13343-11-willy@infradead.org/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG at fs/ext4/inode.c:LINE!
  2020-11-24 18:33             ` Matthew Wilcox
@ 2020-11-24 19:00               ` Linus Torvalds
  2020-11-24 20:15                 ` Matthew Wilcox
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2020-11-24 19:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Jan Kara, syzbot, Andreas Dilger,
	Ext4 Developers List, Linux Kernel Mailing List, syzkaller-bugs,
	Theodore Ts'o, Linux-MM, Oleg Nesterov, Andrew Morton,
	Kirill A. Shutemov, Nicholas Piggin, Alex Shi, Qian Cai,
	Christoph Hellwig, Darrick J. Wong, William Kucharski,
	Jens Axboe, linux-fsdevel, linux-xfs

On Tue, Nov 24, 2020 at 10:33 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> We could fix this by turning that 'if' into a 'while' in
> write_cache_pages().

That might be the simplest patch indeed.

At the same time, I do worry about other cases like this: while
spurious wakeup events are normal and happen in other places, this is
a bit different.

This is literally a wakeup that leaks from a previous use of a page,
and makes us think that something could have happened to the new use.

The unlock_page() case presumably never hits that, because even if we
have some unlock without a page ref (which I don't think can happen,
but whatever..), the exclusive nature of "lock_page()" means that no
locker can care - once you get the lock, you own the page./

The writeback code is special in that the writeback bit isn't some
kind of exclusive bit, but this code kind of expected it to be that.

So I'd _like_ to have something like

        WARN_ON_ONCE(!page_count(page));

in the wake_up_page_bit() function, to catch things that wake up a
page that has already been released and might be reused..

And that would require the "get_page()" to be done when we set the
writeback bit and queue the page up for IO (so that then
end_page_writeback() would clear the bit, do the wakeup, and then drop
the ref).

Hugh's second patch isn't pretty - I think the "get_page()" is
conceptually in the wrong place - but it "works" in that it keeps that
"implicit page reference" being kept by the PG_writeback bit, and then
it takes an explicit page reference before it clears the bit.

So while I don't love the whole "PG_writeback is an implicit reference
to the page" model, Hugh's patch at least makes that model much more
straightforward: we really either have that PG_writeback, _or_ we have
a real ref to the page, and we never have that odd "we could actually
lose the page" situation.

So I think I prefer Hugh's two-liner over your one-liner suggestion.

But your one-liner is technically not just smaller, it obviously also
avoids the whole mucking with the atomic page ref.

I don't _think_ that the extra get/put overhead could possibly really
matter: doing the writeback is going to be a lot more expensive
anyway. And an atomic access to a 'struct page' sounds expensive, but
that cacheline is already likely dirty in the L1 cache because we've
touch page->flags and done other things to it).

So I'd personally be inclined to go with Hugh's patch. Comments?

                 Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG at fs/ext4/inode.c:LINE!
  2020-11-24 19:00               ` Linus Torvalds
@ 2020-11-24 20:15                 ` Matthew Wilcox
  2020-11-24 20:34                   ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2020-11-24 20:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Jan Kara, syzbot, Andreas Dilger,
	Ext4 Developers List, Linux Kernel Mailing List, syzkaller-bugs,
	Theodore Ts'o, Linux-MM, Oleg Nesterov, Andrew Morton,
	Kirill A. Shutemov, Nicholas Piggin, Alex Shi, Qian Cai,
	Christoph Hellwig, Darrick J. Wong, William Kucharski,
	Jens Axboe, linux-fsdevel, linux-xfs

On Tue, Nov 24, 2020 at 11:00:42AM -0800, Linus Torvalds wrote:
> On Tue, Nov 24, 2020 at 10:33 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > We could fix this by turning that 'if' into a 'while' in
> > write_cache_pages().
> 
> That might be the simplest patch indeed.
> 
> At the same time, I do worry about other cases like this: while
> spurious wakeup events are normal and happen in other places, this is
> a bit different.
> 
> This is literally a wakeup that leaks from a previous use of a page,
> and makes us think that something could have happened to the new use.
> 
> The unlock_page() case presumably never hits that, because even if we
> have some unlock without a page ref (which I don't think can happen,
> but whatever..), the exclusive nature of "lock_page()" means that no
> locker can care - once you get the lock, you own the page./
> 
> The writeback code is special in that the writeback bit isn't some
> kind of exclusive bit, but this code kind of expected it to be that.
> 
> So I'd _like_ to have something like
> 
>         WARN_ON_ONCE(!page_count(page));
> 
> in the wake_up_page_bit() function, to catch things that wake up a
> page that has already been released and might be reused..
> 
> And that would require the "get_page()" to be done when we set the
> writeback bit and queue the page up for IO (so that then
> end_page_writeback() would clear the bit, do the wakeup, and then drop
> the ref).
> 
> Hugh's second patch isn't pretty - I think the "get_page()" is
> conceptually in the wrong place - but it "works" in that it keeps that
> "implicit page reference" being kept by the PG_writeback bit, and then
> it takes an explicit page reference before it clears the bit.
> 
> So while I don't love the whole "PG_writeback is an implicit reference
> to the page" model, Hugh's patch at least makes that model much more
> straightforward: we really either have that PG_writeback, _or_ we have
> a real ref to the page, and we never have that odd "we could actually
> lose the page" situation.
> 
> So I think I prefer Hugh's two-liner over your one-liner suggestion.
> 
> But your one-liner is technically not just smaller, it obviously also
> avoids the whole mucking with the atomic page ref.
> 
> I don't _think_ that the extra get/put overhead could possibly really
> matter: doing the writeback is going to be a lot more expensive
> anyway. And an atomic access to a 'struct page' sounds expensive, but
> that cacheline is already likely dirty in the L1 cache because we've
> touch page->flags and done other things to it).
> 
> So I'd personally be inclined to go with Hugh's patch. Comments?

My only objection to Hugh's patch is that it may cause us to fail
to split pages when we can currently split them.  That is, we do:

	wait_on_page_writeback()
	if (page_has_private(page))
		do_invalidatepage(page, offset, length);
	split_huge_page()

(at least we do in my THP patchset; not sure if there's any of that
in the kernel today), and the extra reference held for a few nanoseconds
after calling wake_up_page() will cause us to fail to split the page.
It probably doesn't matter; there has to be a fallback path anyway.

Now I'm looking at that codepath, and the race that Hugh uncovered now
looks like a real bug.  Consider this sequence:

page allocated, added to page cache, dirtied, writeback started

--- thread A ---
end_page_writeback()
	test_clear_page_writeback
--- ctx switch to thread B ---
alloc page, add to page cache, dirty page, start page writeback,
truncate_inode_pages_range()
	wait_on_page_writeback()
--- ctx switch to thread A ---
	wake_up_page()
--- ctx switch to thread B ---
free page
alloc page
write new data to page

... now the DMA actually starts to do page writeback, and it's writing
the new data.

So my s/if/while/ suggestion is wrong and we need to do something to
prevent spurious wakeups.  Unless we bury the spurious wakeup logic
inside wait_on_page_writeback() ...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG at fs/ext4/inode.c:LINE!
  2020-11-24 20:15                 ` Matthew Wilcox
@ 2020-11-24 20:34                   ` Linus Torvalds
  2020-11-24 21:46                     ` Hugh Dickins
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2020-11-24 20:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Jan Kara, syzbot, Andreas Dilger,
	Ext4 Developers List, Linux Kernel Mailing List, syzkaller-bugs,
	Theodore Ts'o, Linux-MM, Oleg Nesterov, Andrew Morton,
	Kirill A. Shutemov, Nicholas Piggin, Alex Shi, Qian Cai,
	Christoph Hellwig, Darrick J. Wong, William Kucharski,
	Jens Axboe, linux-fsdevel, linux-xfs

On Tue, Nov 24, 2020 at 12:16 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> So my s/if/while/ suggestion is wrong and we need to do something to
> prevent spurious wakeups.  Unless we bury the spurious wakeup logic
> inside wait_on_page_writeback() ...

We can certainly make the "if()" in that loop be a "while()'.

That's basically what the old code did - simply by virtue of the
wakeup not happening if the writeback bit was set in
wake_page_function():

        if (test_bit(key->bit_nr, &key->page->flags))
                return -1;

of course, the race was still there - because the writeback bit might
be clear at that point, but another CPU would reallocate and dirty it,
and then autoremove_wake_function() would happen anyway.

But back in the bad old days, the wait_on_page_bit_common() code would
then double-check in a loop, so it would catch that case, re-insert
itself on the wait queue, and try again. Except for the DROP case,
which isn't used by writeback.

Anyway, making that "if()" be a "while()" in wait_on_page_writeback()
would basically re-introduce that old behavior. I don't really care,
because it was the lock bit that really mattered, the writeback bit is
not really all that interesting (except from a "let's fix this bug"
angle)

I'm not 100% sure I like the fragility of this writeback thing.

Anyway, I'm certainly happy with either model, whether it be an added
while() in wait_on_page_writeback(), or it be the page reference count
in end_page_writeback().

Strong opinions?

            Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG at fs/ext4/inode.c:LINE!
  2020-11-24 20:34                   ` Linus Torvalds
@ 2020-11-24 21:46                     ` Hugh Dickins
  2020-11-24 23:24                       ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Hugh Dickins @ 2020-11-24 21:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Hugh Dickins, Jan Kara, syzbot, Andreas Dilger,
	Ext4 Developers List, Linux Kernel Mailing List, syzkaller-bugs,
	Theodore Ts'o, Linux-MM, Oleg Nesterov, Andrew Morton,
	Kirill A. Shutemov, Nicholas Piggin, Alex Shi, Qian Cai,
	Christoph Hellwig, Darrick J. Wong, William Kucharski,
	Jens Axboe, linux-fsdevel, linux-xfs

On Tue, 24 Nov 2020, Linus Torvalds wrote:
> On Tue, Nov 24, 2020 at 12:16 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > So my s/if/while/ suggestion is wrong and we need to do something to
> > prevent spurious wakeups.  Unless we bury the spurious wakeup logic
> > inside wait_on_page_writeback() ...
> 
> We can certainly make the "if()" in that loop be a "while()'.
> 
> That's basically what the old code did - simply by virtue of the
> wakeup not happening if the writeback bit was set in
> wake_page_function():
> 
>         if (test_bit(key->bit_nr, &key->page->flags))
>                 return -1;
> 
> of course, the race was still there - because the writeback bit might
> be clear at that point, but another CPU would reallocate and dirty it,
> and then autoremove_wake_function() would happen anyway.
> 
> But back in the bad old days, the wait_on_page_bit_common() code would
> then double-check in a loop, so it would catch that case, re-insert
> itself on the wait queue, and try again. Except for the DROP case,
> which isn't used by writeback.
> 
> Anyway, making that "if()" be a "while()" in wait_on_page_writeback()
> would basically re-introduce that old behavior. I don't really care,
> because it was the lock bit that really mattered, the writeback bit is
> not really all that interesting (except from a "let's fix this bug"
> angle)
> 
> I'm not 100% sure I like the fragility of this writeback thing.
> 
> Anyway, I'm certainly happy with either model, whether it be an added
> while() in wait_on_page_writeback(), or it be the page reference count
> in end_page_writeback().
> 
> Strong opinions?

Responding to "Strong opinions?" before having digested Matthew's
DMA sequence (no, not his DNA sequence).

I think it comes down to whether my paranoia (about accessing an
unreferenced struct page) is realistic or not: since I do hold
that paranoia, I do prefer (whatever variant of) my patch.

I'm not a memory hotremove guy. I did search mm/memory_hotplug.c
for references to rcu or stop_machine(), but found none.  I can
imagine that the memory containing the struct pages would be
located elsewhere than the memory itself, with some strong
barrier in between removals; but think there were patches posted
just a few days ago, with intent to allocate struct pages from
the same memory block.  It would be easy to forget this writeback
issue when hotremove advances, if we don't fix it properly now.

Another problem with the s/if/while/ solution: I think Matthew
pointed to another patch needed, to prevent wake_up_page_bit()
from doing an inappropriate ClearPageWaiters (I've not studied
that patch); and would also need a further patch to deal with
my PF_ONLY_HEAD VM_BUG_ON(PageTail).  More?

I think the unreferenced struct page asks for trouble.

Hugh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG at fs/ext4/inode.c:LINE!
  2020-11-24 21:46                     ` Hugh Dickins
@ 2020-11-24 23:24                       ` Linus Torvalds
  2020-11-25 21:30                         ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2020-11-24 23:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Matthew Wilcox, Jan Kara, syzbot, Andreas Dilger,
	Ext4 Developers List, Linux Kernel Mailing List, syzkaller-bugs,
	Theodore Ts'o, Linux-MM, Oleg Nesterov, Andrew Morton,
	Kirill A. Shutemov, Nicholas Piggin, Alex Shi, Qian Cai,
	Christoph Hellwig, Darrick J. Wong, William Kucharski,
	Jens Axboe, linux-fsdevel, linux-xfs

On Tue, Nov 24, 2020 at 1:47 PM Hugh Dickins <hughd@google.com> wrote:
>
> I think the unreferenced struct page asks for trouble.

I do agree.

I've applied your second patch (the smaller one that just takes a ref
around the critical section). If somebody comes up with some great
alternative, we can always revisit this.

            Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG at fs/ext4/inode.c:LINE!
  2020-11-24 12:19         ` Matthew Wilcox
  2020-11-24 16:28           ` Hugh Dickins
@ 2020-11-25  9:20           ` Jan Kara
  1 sibling, 0 replies; 16+ messages in thread
From: Jan Kara @ 2020-11-25  9:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Linus Torvalds, Jan Kara, syzbot, Andreas Dilger,
	Ext4 Developers List, Linux Kernel Mailing List, syzkaller-bugs,
	Theodore Ts'o, Linux-MM, Oleg Nesterov, Andrew Morton,
	Kirill A. Shutemov, Nicholas Piggin, Alex Shi, Qian Cai,
	Christoph Hellwig, Darrick J. Wong, William Kucharski,
	Jens Axboe, linux-fsdevel, linux-xfs

On Tue 24-11-20 12:19:12, Matthew Wilcox wrote:
> On Mon, Nov 23, 2020 at 08:07:24PM -0800, Hugh Dickins wrote:
> > Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
> > on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
> > end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
> > no longer an ext4 page at all.
> > 
> > The problem is that PageWriteback is not accompanied by a page reference
> > (as the NOTE at the end of test_clear_page_writeback() acknowledges): as
> > soon as TestClearPageWriteback has been done, that page could be removed
> > from page cache, freed, and reused for something else by the time that
> > wake_up_page() is reached.
> > 
> > https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
> > Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
> > check; but I'm paranoid about even looking at an unreferenced struct page,
> > lest its memory might itself have already been reused or hotremoved (and
> > wake_up_page_bit() may modify that memory with its ClearPageWaiters()).
> > 
> > Then on crashing a second time, realized there's a stronger reason against
> > that approach.  If my testing just occasionally crashes on that check,
> > when the page is reused for part of a compound page, wouldn't it be much
> > more common for the page to get reused as an order-0 page before reaching
> > wake_up_page()?  And on rare occasions, might that reused page already be
> > marked PageWriteback by its new user, and already be waited upon?  What
> > would that look like?
> > 
> > It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
> > in write_cache_pages() (though I have never seen that crash myself).
> 
> I don't think this is it.  write_cache_pages() holds a reference to the
> page -- indeed, it holds the page lock!  So this particular race cannot
> cause the page to get recycled.  I still have no good ideas what this
> is :-(

But does it really matter what write_cache_pages() does? I mean we start
page writeback. I mean struct bio holds no reference to the page it writes.
The only thing that prevents the page from being freed under bio's hands is
PageWriteback bit. So when the bio is completing we do (e.g. in
ext4_end_bio()), we usually walk all pages in a bio
bio_for_each_segment_all() and for each page call end_page_writeback(), now
once end_page_writeback() calls test_clear_page_writeback() which clears
PageWriteback(), the page can get freed. And that can happen before the
wake_up_page() call in end_page_writeback(). So a race will be like:

CPU1					CPU2
ext4_end_bio()
  ...
  end_page_writeback(page)
    test_clear_page_writeback(page)
					free page
					reallocate page for something else
					we can even dirty & start to
					  writeback 'page'
    wake_up_page(page)

and we have a "spurious" wake up on 'page'.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG at fs/ext4/inode.c:LINE!
  2020-11-24 23:24                       ` Linus Torvalds
@ 2020-11-25 21:30                         ` Linus Torvalds
  2020-11-25 22:01                           ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2020-11-25 21:30 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Matthew Wilcox, Jan Kara, syzbot, Andreas Dilger,
	Ext4 Developers List, Linux Kernel Mailing List, syzkaller-bugs,
	Theodore Ts'o, Linux-MM, Oleg Nesterov, Andrew Morton,
	Kirill A. Shutemov, Nicholas Piggin, Alex Shi, Qian Cai,
	Christoph Hellwig, Darrick J. Wong, William Kucharski,
	Jens Axboe, linux-fsdevel, linux-xfs

On Tue, Nov 24, 2020 at 3:24 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I've applied your second patch (the smaller one that just takes a ref
> around the critical section). If somebody comes up with some great
> alternative, we can always revisit this.

Hmm.

I'm not sure about "great alternative", but it strikes me that we
*could* move the clearing of the PG_writeback bit _into_
wake_up_page_bit(), under the page waitqueue lock.

IOW, we could make the rule be that the bit isn't actually cleared
before calling wake_up_page() at all, and we'd clear it with something
like

    unsigned long flags = READ_ONCE(page->flags);

    // We can clear PG_writeback directly if PG_waiters isn't set
    while (!(flags & (1ul << PG_waiters))) {
        unsigned long new = flags & ~(1ul << PG_writeback);
        // PG_writeback was already clear??!!?
        if (WARN_ON_ONCE(new == flags))
            return;
        new = cmpxchg(&page->flags, flags, new);
        if (likely(flags == new))
            return;
        flags = new;
    }

    // Otherwise, clear the bit at the end - but under the
    // page waitqueue lock - inside wake_up_page_bit()
    return wake_up_page_bit(..);

instead.

That would basically make the bit clearing atomic wrt the PG_waiters
flags - either using that atomic cmpxchg, or by doing it under the
page queue lock so that it's atomic wrt any new waiters.

This seems conceptually like the right thing to do - and if would also
make the (fair) exclusive lock hand-off case atomic too, because the
bit we're waking up on would never be cleared if it gets handed off
directly.

The above is entirely untested crap written in my MUA, and obviously
requires that all callers of wake_up_page() be moved to that new world
order, but I think we only have two cases: unlock_page() and
end_page_writeback().

And unlock_page() already has that
"clear_bit_unlock_is_negative_byte()" special case that is an ugly
special case of PG_waiters atomicity. So we'd get rid of that, because
the cmpxchg loop would be the better model.

I'm not sure I'm willing to write and test the real patch, but it
doesn't look _too_ nasty from just looking at the code. The bookmark
thing makes it important to only actually clear the bit at the end (as
does the handoff case anyway), but the way wake_up_page_bit() is
written, that's actually very straightforward - just after the
while-loop. That's when we've woken up everybody.

So I'm sending this idea out to see if somebody can shoot it down, or
even wants to possibly even try to do it..

                Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: kernel BUG at fs/ext4/inode.c:LINE!
  2020-11-25 21:30                         ` Linus Torvalds
@ 2020-11-25 22:01                           ` Linus Torvalds
  0 siblings, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2020-11-25 22:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Matthew Wilcox, Jan Kara, syzbot, Andreas Dilger,
	Ext4 Developers List, Linux Kernel Mailing List, syzkaller-bugs,
	Theodore Ts'o, Linux-MM, Oleg Nesterov, Andrew Morton,
	Kirill A. Shutemov, Nicholas Piggin, Alex Shi, Qian Cai,
	Christoph Hellwig, Darrick J. Wong, William Kucharski,
	Jens Axboe, linux-fsdevel, linux-xfs

On Wed, Nov 25, 2020 at 1:30 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'm not sure I'm willing to write and test the real patch, but it
> doesn't look _too_ nasty from just looking at the code. The bookmark
> thing makes it important to only actually clear the bit at the end (as
> does the handoff case anyway), but the way wake_up_page_bit() is
> written, that's actually very straightforward - just after the
> while-loop. That's when we've woken up everybody.

Actually, there's a problem. We don't know if we've done the hand-off
or not, so we don't know if we should clear the bit after waking
everybody up or not.

We set that WQ_FLAG_DONE bit for the hand-0off case, but only the
woken party sees that - the waker itself doesn't know about it (and we
have no good way to return it in that call chain: wake_up_page_bit ->
__wake_up_locked_key_bookmark -> __wake_up_common ->
wake_page_function().

We could easily hide the flag in the "bookmark" wait queue entry, but
that smells a bit hacky to me.

So I don't think it's worth it, unless somebody really wants to give it a try.

But if it turns out that the page ref change from Hugh causes some
unexpected problem, we do have this model as a backup.

            Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-11-25 22:09 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <000000000000d3a33205add2f7b2@google.com>
     [not found] ` <20200828100755.GG7072@quack2.suse.cz>
     [not found]   ` <20200831100340.GA26519@quack2.suse.cz>
     [not found]     ` <CAHk-=wivRS_1uy326sLqKuwerbL0APyKYKwa+vWVGsQg8sxhLw@mail.gmail.com>
2020-11-24  4:07       ` kernel BUG at fs/ext4/inode.c:LINE! Hugh Dickins
2020-11-24  4:26         ` Linus Torvalds
2020-11-24  4:53         ` Linus Torvalds
2020-11-24  6:34           ` Hugh Dickins
2020-11-24 16:46             ` Hugh Dickins
2020-11-24 12:19         ` Matthew Wilcox
2020-11-24 16:28           ` Hugh Dickins
2020-11-24 18:33             ` Matthew Wilcox
2020-11-24 19:00               ` Linus Torvalds
2020-11-24 20:15                 ` Matthew Wilcox
2020-11-24 20:34                   ` Linus Torvalds
2020-11-24 21:46                     ` Hugh Dickins
2020-11-24 23:24                       ` Linus Torvalds
2020-11-25 21:30                         ` Linus Torvalds
2020-11-25 22:01                           ` Linus Torvalds
2020-11-25  9:20           ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).