Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] btrfs: save i_size in compress_file_range
@ 2019-10-11 13:03 Josef Bacik
  2019-10-11 15:28 ` Filipe Manana
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Josef Bacik @ 2019-10-11 13:03 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: stable

We hit a regression while rolling out 5.2 internally where we were
hitting the following panic

kernel BUG at mm/page-writeback.c:2659!
RIP: 0010:clear_page_dirty_for_io+0xe6/0x1f0
Call Trace:
 __process_pages_contig+0x25a/0x350
 ? extent_clear_unlock_delalloc+0x43/0x70
 submit_compressed_extents+0x359/0x4d0
 normal_work_helper+0x15a/0x330
 process_one_work+0x1f5/0x3f0
 worker_thread+0x2d/0x3d0
 ? rescuer_thread+0x340/0x340
 kthread+0x111/0x130
 ? kthread_create_on_node+0x60/0x60
 ret_from_fork+0x1f/0x30

this is happening because the page is not locked when doing
clear_page_dirty_for_io.  Looking at the core dump it was because our
async_extent had a ram_size of 24576 but our async_chunk range only
spanned 20480, so we had a whole extra page in our ram_size for our
async_extent.

This happened because we try not to compress pages outside of our
i_size, however a cleanup patch changed us to do

actual_end = min_t(u64, i_size_read(inode), end + 1);

which is problematic because i_size_read() can evaluate to different
values in between checking and assigning.  So either a expanding
truncate or a fallocate could increase our i_size while we're doing
writeout and actual_end would end up being past the range we have
locked.

I confirmed this was what was happening by installing a debug kernel
that had

actual_end = min_t(u64, i_size_read(inode), end + 1);
if (actual_end > end + 1) {
	printk(KERN_ERR "WE GOT FUCKED\n");
	actual_end = end + 1;
}

and installing it onto 500 boxes of the tier that had been seeing the
problem regularly.  Last night I got my debug message and no panic,
confirming what I expected.

Fixes: 62b37622718c ("btrfs: Remove isize local variable in compress_file_range")
cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2eb1d7249f83..9a483d1f61f8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -474,6 +474,7 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
 	u64 start = async_chunk->start;
 	u64 end = async_chunk->end;
 	u64 actual_end;
+	loff_t i_size = i_size_read(inode);
 	int ret = 0;
 	struct page **pages = NULL;
 	unsigned long nr_pages;
@@ -488,7 +489,13 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
 	inode_should_defrag(BTRFS_I(inode), start, end, end - start + 1,
 			SZ_16K);
 
-	actual_end = min_t(u64, i_size_read(inode), end + 1);
+	/*
+	 * We need to save i_size before now because it could change in between
+	 * us evaluating the size and assigning it.  This is because we lock and
+	 * unlock the page in truncate and fallocate, and then modify the i_size
+	 * later on.
+	 */
+	actual_end = min_t(u64, i_size, end + 1);
 again:
 	will_compress = 0;
 	nr_pages = (end >> PAGE_SHIFT) - (start >> PAGE_SHIFT) + 1;
-- 
2.21.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] btrfs: save i_size in compress_file_range
  2019-10-11 13:03 [PATCH] btrfs: save i_size in compress_file_range Josef Bacik
@ 2019-10-11 15:28 ` Filipe Manana
  2019-10-11 17:15   ` Josef Bacik
  2019-10-14 11:52 ` David Sterba
  2019-10-23 16:51 ` David Sterba
  2 siblings, 1 reply; 6+ messages in thread
From: Filipe Manana @ 2019-10-11 15:28 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team, stable

On Fri, Oct 11, 2019 at 2:05 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> We hit a regression while rolling out 5.2 internally where we were
> hitting the following panic
>
> kernel BUG at mm/page-writeback.c:2659!
> RIP: 0010:clear_page_dirty_for_io+0xe6/0x1f0
> Call Trace:
>  __process_pages_contig+0x25a/0x350
>  ? extent_clear_unlock_delalloc+0x43/0x70
>  submit_compressed_extents+0x359/0x4d0
>  normal_work_helper+0x15a/0x330
>  process_one_work+0x1f5/0x3f0
>  worker_thread+0x2d/0x3d0
>  ? rescuer_thread+0x340/0x340
>  kthread+0x111/0x130
>  ? kthread_create_on_node+0x60/0x60
>  ret_from_fork+0x1f/0x30
>
> this is happening because the page is not locked when doing
> clear_page_dirty_for_io.  Looking at the core dump it was because our
> async_extent had a ram_size of 24576 but our async_chunk range only
> spanned 20480, so we had a whole extra page in our ram_size for our
> async_extent.
>
> This happened because we try not to compress pages outside of our
> i_size, however a cleanup patch changed us to do
>
> actual_end = min_t(u64, i_size_read(inode), end + 1);
>
> which is problematic because i_size_read() can evaluate to different
> values in between checking and assigning.  So either a expanding
> truncate or a fallocate could increase our i_size while we're doing

Well, not just those operations but a write starting at or beyond eof
could also do it,
since at writeback time we don't hold the inode's lock and the
buffered write path doesn't lock the range if it starts at or past
current i_size.

> writeout and actual_end would end up being past the range we have
> locked.
>
> I confirmed this was what was happening by installing a debug kernel
> that had
>
> actual_end = min_t(u64, i_size_read(inode), end + 1);
> if (actual_end > end + 1) {
>         printk(KERN_ERR "WE GOT FUCKED\n");

I think we coud be more polite on changelogs and mailing lists, or we
could add a tag in the subjects warning about improper content like
video games and music records :)

>         actual_end = end + 1;
> }
>
> and installing it onto 500 boxes of the tier that had been seeing the
> problem regularly.  Last night I got my debug message and no panic,
> confirming what I expected.
>
> Fixes: 62b37622718c ("btrfs: Remove isize local variable in compress_file_range")
> cc: stable@vger.kernel.org
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Anyway, looks good to me.

Reviewed-by: Filipe Manana <fdmanana@suse.com>

> ---
>  fs/btrfs/inode.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 2eb1d7249f83..9a483d1f61f8 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -474,6 +474,7 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
>         u64 start = async_chunk->start;
>         u64 end = async_chunk->end;
>         u64 actual_end;
> +       loff_t i_size = i_size_read(inode);
>         int ret = 0;
>         struct page **pages = NULL;
>         unsigned long nr_pages;
> @@ -488,7 +489,13 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
>         inode_should_defrag(BTRFS_I(inode), start, end, end - start + 1,
>                         SZ_16K);
>
> -       actual_end = min_t(u64, i_size_read(inode), end + 1);
> +       /*
> +        * We need to save i_size before now because it could change in between
> +        * us evaluating the size and assigning it.  This is because we lock and
> +        * unlock the page in truncate and fallocate, and then modify the i_size
> +        * later on.
> +        */
> +       actual_end = min_t(u64, i_size, end + 1);
>  again:
>         will_compress = 0;
>         nr_pages = (end >> PAGE_SHIFT) - (start >> PAGE_SHIFT) + 1;
> --
> 2.21.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] btrfs: save i_size in compress_file_range
  2019-10-11 15:28 ` Filipe Manana
@ 2019-10-11 17:15   ` Josef Bacik
  0 siblings, 0 replies; 6+ messages in thread
From: Josef Bacik @ 2019-10-11 17:15 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Josef Bacik, linux-btrfs, kernel-team, stable

On Fri, Oct 11, 2019 at 04:28:28PM +0100, Filipe Manana wrote:
> On Fri, Oct 11, 2019 at 2:05 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > We hit a regression while rolling out 5.2 internally where we were
> > hitting the following panic
> >
> > kernel BUG at mm/page-writeback.c:2659!
> > RIP: 0010:clear_page_dirty_for_io+0xe6/0x1f0
> > Call Trace:
> >  __process_pages_contig+0x25a/0x350
> >  ? extent_clear_unlock_delalloc+0x43/0x70
> >  submit_compressed_extents+0x359/0x4d0
> >  normal_work_helper+0x15a/0x330
> >  process_one_work+0x1f5/0x3f0
> >  worker_thread+0x2d/0x3d0
> >  ? rescuer_thread+0x340/0x340
> >  kthread+0x111/0x130
> >  ? kthread_create_on_node+0x60/0x60
> >  ret_from_fork+0x1f/0x30
> >
> > this is happening because the page is not locked when doing
> > clear_page_dirty_for_io.  Looking at the core dump it was because our
> > async_extent had a ram_size of 24576 but our async_chunk range only
> > spanned 20480, so we had a whole extra page in our ram_size for our
> > async_extent.
> >
> > This happened because we try not to compress pages outside of our
> > i_size, however a cleanup patch changed us to do
> >
> > actual_end = min_t(u64, i_size_read(inode), end + 1);
> >
> > which is problematic because i_size_read() can evaluate to different
> > values in between checking and assigning.  So either a expanding
> > truncate or a fallocate could increase our i_size while we're doing
> 
> Well, not just those operations but a write starting at or beyond eof
> could also do it,
> since at writeback time we don't hold the inode's lock and the
> buffered write path doesn't lock the range if it starts at or past
> current i_size.
>

Yeah it was just for example.

> > writeout and actual_end would end up being past the range we have
> > locked.
> >
> > I confirmed this was what was happening by installing a debug kernel
> > that had
> >
> > actual_end = min_t(u64, i_size_read(inode), end + 1);
> > if (actual_end > end + 1) {
> >         printk(KERN_ERR "WE GOT FUCKED\n");
> 
> I think we coud be more polite on changelogs and mailing lists, or we
> could add a tag in the subjects warning about improper content like
> video games and music records :)
> 

Hah, well that's what I put, but we can change it for the changelog if it's a
problem.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] btrfs: save i_size in compress_file_range
  2019-10-11 13:03 [PATCH] btrfs: save i_size in compress_file_range Josef Bacik
  2019-10-11 15:28 ` Filipe Manana
@ 2019-10-14 11:52 ` David Sterba
  2019-10-23 16:51 ` David Sterba
  2 siblings, 0 replies; 6+ messages in thread
From: David Sterba @ 2019-10-14 11:52 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team, stable

On Fri, Oct 11, 2019 at 09:03:54AM -0400, Josef Bacik wrote:
> this is happening because the page is not locked when doing
> clear_page_dirty_for_io.  Looking at the core dump it was because our
> async_extent had a ram_size of 24576 but our async_chunk range only
> spanned 20480, so we had a whole extra page in our ram_size for our
> async_extent.
> 
> This happened because we try not to compress pages outside of our
> i_size, however a cleanup patch changed us to do
> 
> actual_end = min_t(u64, i_size_read(inode), end + 1);
> 
> which is problematic because i_size_read() can evaluate to different
> values in between checking and assigning.

The other patch adding READ_ONCE to i_size_read actually has the
interesting bits why this happens, so this changelog should have them
too.

> So either a expanding
> truncate or a fallocate could increase our i_size while we're doing
> writeout and actual_end would end up being past the range we have
> locked.

Yeah, it's there, the problematic part is the double load of inode size

  mov    0x20(%rsp),%rax
  cmp    %rax,0x48(%r15)
  movl   $0x0,0x18(%rsp)
  mov    %rax,%r12
  mov    %r14,%rax
  cmovbe 0x48(%r15),%r12

Where r15 is inode and 0x48 is offset of i_size.

> I confirmed this was what was happening by installing a debug kernel
> that had
> 
> actual_end = min_t(u64, i_size_read(inode), end + 1);
> if (actual_end > end + 1) {
> 	printk(KERN_ERR "WE GOT FUCKED\n");
> 	actual_end = end + 1;
> }
> 
> and installing it onto 500 boxes of the tier that had been seeing the
> problem regularly.  Last night I got my debug message and no panic,
> confirming what I expected.
> 
> Fixes: 62b37622718c ("btrfs: Remove isize local variable in compress_file_range")
> cc: stable@vger.kernel.org
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/inode.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 2eb1d7249f83..9a483d1f61f8 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -474,6 +474,7 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
>  	u64 start = async_chunk->start;
>  	u64 end = async_chunk->end;
>  	u64 actual_end;
> +	loff_t i_size = i_size_read(inode);
>  	int ret = 0;
>  	struct page **pages = NULL;
>  	unsigned long nr_pages;
> @@ -488,7 +489,13 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
>  	inode_should_defrag(BTRFS_I(inode), start, end, end - start + 1,
>  			SZ_16K);
>  
> -	actual_end = min_t(u64, i_size_read(inode), end + 1);
> +	/*
> +	 * We need to save i_size before now because it could change in between
> +	 * us evaluating the size and assigning it.  This is because we lock and
> +	 * unlock the page in truncate and fallocate, and then modify the i_size
> +	 * later on.
> +	 */
> +	actual_end = min_t(u64, i_size, end + 1);

Unfortunatelly this is not sufficient and does not prevent the compiler
to do the same optimization, though current compilers seem to do just
one load it's not prevented in principle.  The only cure is READ_ONCE,
either globally or locally done in compress_file_range.

The other patch addding READ_ONCE to i_size_read would be IMHO ideal.

I've experimented with various ways to do the 'once' semantics, the
following worked:

* force READ_ONCE on the isize varaible, as READ_ONCE requires lvalue,
  we can't do READ_ONCE(i_size_read())

  u64 isize;
  ...
  isize = i_read_size(inode);
  isize = READ_ONCE(isize);

* insert barrier

  u64 isize;
  ...
  isize = i_size_read(inode);
  barrier();

* i_size_read returns READ_ONCE(inode->i_size)

  same as current code, ie.
  actual_end = min_t(..., i_size_read(inode), ...)

All verified on the assembly level that this does not do the double
load.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] btrfs: save i_size in compress_file_range
  2019-10-11 13:03 [PATCH] btrfs: save i_size in compress_file_range Josef Bacik
  2019-10-11 15:28 ` Filipe Manana
  2019-10-14 11:52 ` David Sterba
@ 2019-10-23 16:51 ` David Sterba
  2019-11-04 19:59   ` David Sterba
  2 siblings, 1 reply; 6+ messages in thread
From: David Sterba @ 2019-10-23 16:51 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team, stable

On Fri, Oct 11, 2019 at 09:03:54AM -0400, Josef Bacik wrote:
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -474,6 +474,7 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
>  	u64 start = async_chunk->start;
>  	u64 end = async_chunk->end;
>  	u64 actual_end;
> +	loff_t i_size = i_size_read(inode);
>  	int ret = 0;
>  	struct page **pages = NULL;
>  	unsigned long nr_pages;
> @@ -488,7 +489,13 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
>  	inode_should_defrag(BTRFS_I(inode), start, end, end - start + 1,
>  			SZ_16K);
>  
> -	actual_end = min_t(u64, i_size_read(inode), end + 1);
> +	/*
> +	 * We need to save i_size before now because it could change in between
> +	 * us evaluating the size and assigning it.  This is because we lock and
> +	 * unlock the page in truncate and fallocate, and then modify the i_size
> +	 * later on.
> +	 */
> +	actual_end = min_t(u64, i_size, end + 1);

Ping. This is not a future proof fix, please update the changelog and
code according to the reply I sent.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] btrfs: save i_size in compress_file_range
  2019-10-23 16:51 ` David Sterba
@ 2019-11-04 19:59   ` David Sterba
  0 siblings, 0 replies; 6+ messages in thread
From: David Sterba @ 2019-11-04 19:59 UTC (permalink / raw)
  To: David Sterba; +Cc: Josef Bacik, linux-btrfs, kernel-team, stable

On Wed, Oct 23, 2019 at 06:51:02PM +0200, David Sterba wrote:
> On Fri, Oct 11, 2019 at 09:03:54AM -0400, Josef Bacik wrote:
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -474,6 +474,7 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
> >  	u64 start = async_chunk->start;
> >  	u64 end = async_chunk->end;
> >  	u64 actual_end;
> > +	loff_t i_size = i_size_read(inode);
> >  	int ret = 0;
> >  	struct page **pages = NULL;
> >  	unsigned long nr_pages;
> > @@ -488,7 +489,13 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
> >  	inode_should_defrag(BTRFS_I(inode), start, end, end - start + 1,
> >  			SZ_16K);
> >  
> > -	actual_end = min_t(u64, i_size_read(inode), end + 1);
> > +	/*
> > +	 * We need to save i_size before now because it could change in between
> > +	 * us evaluating the size and assigning it.  This is because we lock and
> > +	 * unlock the page in truncate and fallocate, and then modify the i_size
> > +	 * later on.
> > +	 */
> > +	actual_end = min_t(u64, i_size, end + 1);
> 
> Ping. This is not a future proof fix, please update the changelog and
> code according to the reply I sent.

The vfs i_size_read patch is being ignored so I guess we're on our own.
I have postponed last weeks pull request so I'll add this patch on top
and send in a day or two. The READ_ONCE will be simulated by barrier()s,
I've verified the assembly and actually that's alwo what read once does
among other things.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, back to index

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-11 13:03 [PATCH] btrfs: save i_size in compress_file_range Josef Bacik
2019-10-11 15:28 ` Filipe Manana
2019-10-11 17:15   ` Josef Bacik
2019-10-14 11:52 ` David Sterba
2019-10-23 16:51 ` David Sterba
2019-11-04 19:59   ` David Sterba

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git