linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] bdev: Do not return EBUSY if bdev discard races with write
@ 2021-01-07 15:40 Jan Kara
  2021-01-07 15:48 ` Maxim Levitsky
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Jan Kara @ 2021-01-07 15:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Maxim Levitsky, linux-fsdevel, Jan Kara, Darrick J. Wong

blkdev_fallocate() tries to detect whether a discard raced with an
overlapping write by calling invalidate_inode_pages2_range(). However
this check can give both false negatives (when writing using direct IO
or when writeback already writes out the written pagecache range) and
false positives (when write is not actually overlapping but ends in the
same page when blocksize < pagesize). This actually causes issues for
qemu which is getting confused by EBUSY errors.

Fix the problem by removing this conflicting write detection since it is
inherently racy and thus of little use anyway.

Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
CC: "Darrick J. Wong" <darrick.wong@oracle.com>
Link: https://lore.kernel.org/qemu-devel/20201111153913.41840-1-mlevitsk@redhat.com
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 3e5b02f6606c..a97f43b49839 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1797,13 +1797,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 		return error;
 
 	/*
-	 * Invalidate again; if someone wandered in and dirtied a page,
-	 * the caller will be given -EBUSY.  The third argument is
-	 * inclusive, so the rounding here is safe.
+	 * Invalidate the page cache again; if someone wandered in and dirtied
+	 * a page, we just discard it - userspace has no way of knowing whether
+	 * the write happened before or after discard completing...
 	 */
-	return invalidate_inode_pages2_range(bdev->bd_inode->i_mapping,
-					     start >> PAGE_SHIFT,
-					     end >> PAGE_SHIFT);
+	return truncate_bdev_range(bdev, file->f_mode, start, end);
 }
 
 const struct file_operations def_blk_fops = {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] bdev: Do not return EBUSY if bdev discard races with write
  2021-01-07 15:40 [PATCH] bdev: Do not return EBUSY if bdev discard races with write Jan Kara
@ 2021-01-07 15:48 ` Maxim Levitsky
  2021-01-07 15:52   ` Maxim Levitsky
  2021-01-07 19:40 ` Darrick J. Wong
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Maxim Levitsky @ 2021-01-07 15:48 UTC (permalink / raw)
  To: Jan Kara, Jens Axboe; +Cc: linux-fsdevel, Darrick J. Wong

On Thu, 2021-01-07 at 16:40 +0100, Jan Kara wrote:
> blkdev_fallocate() tries to detect whether a discard raced with an
> overlapping write by calling invalidate_inode_pages2_range(). However
> this check can give both false negatives (when writing using direct IO
> or when writeback already writes out the written pagecache range) and
> false positives (when write is not actually overlapping but ends in the
> same page when blocksize < pagesize). This actually causes issues for
> qemu which is getting confused by EBUSY errors.
> 
> Fix the problem by removing this conflicting write detection since it is
> inherently racy and thus of little use anyway.
> 
> Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
> CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> Link: https://lore.kernel.org/qemu-devel/20201111153913.41840-1-mlevitsk@redhat.com
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/block_dev.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 3e5b02f6606c..a97f43b49839 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1797,13 +1797,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
>  		return error;
>  
>  	/*
> -	 * Invalidate again; if someone wandered in and dirtied a page,
> -	 * the caller will be given -EBUSY.  The third argument is
> -	 * inclusive, so the rounding here is safe.
> +	 * Invalidate the page cache again; if someone wandered in and dirtied
> +	 * a page, we just discard it - userspace has no way of knowing whether
> +	 * the write happened before or after discard completing...
>  	 */
> -	return invalidate_inode_pages2_range(bdev->bd_inode->i_mapping,
> -					     start >> PAGE_SHIFT,
> -					     end >> PAGE_SHIFT);
> +	return truncate_bdev_range(bdev, file->f_mode, start, end);
>  }


But what happens if write and discard don't overlap? Won't we
discard the written data in this case?


Best regards,
	Maxim Levitsky


>  
>  const struct file_operations def_blk_fops = {



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] bdev: Do not return EBUSY if bdev discard races with write
  2021-01-07 15:48 ` Maxim Levitsky
@ 2021-01-07 15:52   ` Maxim Levitsky
  0 siblings, 0 replies; 8+ messages in thread
From: Maxim Levitsky @ 2021-01-07 15:52 UTC (permalink / raw)
  To: Jan Kara, Jens Axboe; +Cc: linux-fsdevel, Darrick J. Wong

On Thu, 2021-01-07 at 17:48 +0200, Maxim Levitsky wrote:
> On Thu, 2021-01-07 at 16:40 +0100, Jan Kara wrote:
> > blkdev_fallocate() tries to detect whether a discard raced with an
> > overlapping write by calling invalidate_inode_pages2_range(). However
> > this check can give both false negatives (when writing using direct IO
> > or when writeback already writes out the written pagecache range) and
> > false positives (when write is not actually overlapping but ends in the
> > same page when blocksize < pagesize). This actually causes issues for
> > qemu which is getting confused by EBUSY errors.
> > 
> > Fix the problem by removing this conflicting write detection since it is
> > inherently racy and thus of little use anyway.
> > 
> > Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
> > CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> > Link: https://lore.kernel.org/qemu-devel/20201111153913.41840-1-mlevitsk@redhat.com
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/block_dev.c | 10 ++++------
> >  1 file changed, 4 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 3e5b02f6606c..a97f43b49839 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -1797,13 +1797,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> >  		return error;
> >  
> >  	/*
> > -	 * Invalidate again; if someone wandered in and dirtied a page,
> > -	 * the caller will be given -EBUSY.  The third argument is
> > -	 * inclusive, so the rounding here is safe.
> > +	 * Invalidate the page cache again; if someone wandered in and dirtied
> > +	 * a page, we just discard it - userspace has no way of knowing whether
> > +	 * the write happened before or after discard completing...
> >  	 */
> > -	return invalidate_inode_pages2_range(bdev->bd_inode->i_mapping,
> > -					     start >> PAGE_SHIFT,
> > -					     end >> PAGE_SHIFT);
> > +	return truncate_bdev_range(bdev, file->f_mode, start, end);
> >  }
> 
> But what happens if write and discard don't overlap? Won't we
> discard the written data in this case?

Ah, I see, the truncate_bdev_range preserves the partial
areas that are not included in the range.

In this case this indeed looks right.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


> 
> 
> Best regards,
> 	Maxim Levitsky
> 
> 
> >  
> >  const struct file_operations def_blk_fops = {



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] bdev: Do not return EBUSY if bdev discard races with write
  2021-01-07 15:40 [PATCH] bdev: Do not return EBUSY if bdev discard races with write Jan Kara
  2021-01-07 15:48 ` Maxim Levitsky
@ 2021-01-07 19:40 ` Darrick J. Wong
  2021-01-09 10:42 ` Christoph Hellwig
  2021-01-26 10:02 ` Jan Kara
  3 siblings, 0 replies; 8+ messages in thread
From: Darrick J. Wong @ 2021-01-07 19:40 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, Maxim Levitsky, linux-fsdevel

On Thu, Jan 07, 2021 at 04:40:34PM +0100, Jan Kara wrote:
> blkdev_fallocate() tries to detect whether a discard raced with an
> overlapping write by calling invalidate_inode_pages2_range(). However
> this check can give both false negatives (when writing using direct IO
> or when writeback already writes out the written pagecache range) and
> false positives (when write is not actually overlapping but ends in the
> same page when blocksize < pagesize). This actually causes issues for
> qemu which is getting confused by EBUSY errors.
> 
> Fix the problem by removing this conflicting write detection since it is
> inherently racy and thus of little use anyway.
> 
> Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
> CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> Link: https://lore.kernel.org/qemu-devel/20201111153913.41840-1-mlevitsk@redhat.com
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks good to me,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/block_dev.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 3e5b02f6606c..a97f43b49839 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1797,13 +1797,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
>  		return error;
>  
>  	/*
> -	 * Invalidate again; if someone wandered in and dirtied a page,
> -	 * the caller will be given -EBUSY.  The third argument is
> -	 * inclusive, so the rounding here is safe.
> +	 * Invalidate the page cache again; if someone wandered in and dirtied
> +	 * a page, we just discard it - userspace has no way of knowing whether
> +	 * the write happened before or after discard completing...
>  	 */
> -	return invalidate_inode_pages2_range(bdev->bd_inode->i_mapping,
> -					     start >> PAGE_SHIFT,
> -					     end >> PAGE_SHIFT);
> +	return truncate_bdev_range(bdev, file->f_mode, start, end);
>  }
>  
>  const struct file_operations def_blk_fops = {
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] bdev: Do not return EBUSY if bdev discard races with write
  2021-01-07 15:40 [PATCH] bdev: Do not return EBUSY if bdev discard races with write Jan Kara
  2021-01-07 15:48 ` Maxim Levitsky
  2021-01-07 19:40 ` Darrick J. Wong
@ 2021-01-09 10:42 ` Christoph Hellwig
  2021-01-26 10:02 ` Jan Kara
  3 siblings, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2021-01-09 10:42 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, Maxim Levitsky, linux-fsdevel, Darrick J. Wong

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] bdev: Do not return EBUSY if bdev discard races with write
  2021-01-07 15:40 [PATCH] bdev: Do not return EBUSY if bdev discard races with write Jan Kara
                   ` (2 preceding siblings ...)
  2021-01-09 10:42 ` Christoph Hellwig
@ 2021-01-26 10:02 ` Jan Kara
  2021-01-26 17:22   ` Jens Axboe
  3 siblings, 1 reply; 8+ messages in thread
From: Jan Kara @ 2021-01-26 10:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Maxim Levitsky, linux-fsdevel, Jan Kara, Darrick J. Wong

On Thu 07-01-21 16:40:34, Jan Kara wrote:
> blkdev_fallocate() tries to detect whether a discard raced with an
> overlapping write by calling invalidate_inode_pages2_range(). However
> this check can give both false negatives (when writing using direct IO
> or when writeback already writes out the written pagecache range) and
> false positives (when write is not actually overlapping but ends in the
> same page when blocksize < pagesize). This actually causes issues for
> qemu which is getting confused by EBUSY errors.
> 
> Fix the problem by removing this conflicting write detection since it is
> inherently racy and thus of little use anyway.
> 
> Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
> CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> Link: https://lore.kernel.org/qemu-devel/20201111153913.41840-1-mlevitsk@redhat.com
> Signed-off-by: Jan Kara <jack@suse.cz>

Jens, can you please pick up this patch? Thanks!

									Honza

> ---
>  fs/block_dev.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 3e5b02f6606c..a97f43b49839 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1797,13 +1797,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
>  		return error;
>  
>  	/*
> -	 * Invalidate again; if someone wandered in and dirtied a page,
> -	 * the caller will be given -EBUSY.  The third argument is
> -	 * inclusive, so the rounding here is safe.
> +	 * Invalidate the page cache again; if someone wandered in and dirtied
> +	 * a page, we just discard it - userspace has no way of knowing whether
> +	 * the write happened before or after discard completing...
>  	 */
> -	return invalidate_inode_pages2_range(bdev->bd_inode->i_mapping,
> -					     start >> PAGE_SHIFT,
> -					     end >> PAGE_SHIFT);
> +	return truncate_bdev_range(bdev, file->f_mode, start, end);
>  }
>  
>  const struct file_operations def_blk_fops = {
> -- 
> 2.26.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] bdev: Do not return EBUSY if bdev discard races with write
  2021-01-26 10:02 ` Jan Kara
@ 2021-01-26 17:22   ` Jens Axboe
  2021-01-27  9:12     ` Jan Kara
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2021-01-26 17:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: Maxim Levitsky, linux-fsdevel, Darrick J. Wong

On 1/26/21 3:02 AM, Jan Kara wrote:
> On Thu 07-01-21 16:40:34, Jan Kara wrote:
>> blkdev_fallocate() tries to detect whether a discard raced with an
>> overlapping write by calling invalidate_inode_pages2_range(). However
>> this check can give both false negatives (when writing using direct IO
>> or when writeback already writes out the written pagecache range) and
>> false positives (when write is not actually overlapping but ends in the
>> same page when blocksize < pagesize). This actually causes issues for
>> qemu which is getting confused by EBUSY errors.
>>
>> Fix the problem by removing this conflicting write detection since it is
>> inherently racy and thus of little use anyway.
>>
>> Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
>> CC: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Link: https://lore.kernel.org/qemu-devel/20201111153913.41840-1-mlevitsk@redhat.com
>> Signed-off-by: Jan Kara <jack@suse.cz>
> 
> Jens, can you please pick up this patch? Thanks!

Picked it up for 5.12, hope that works. It looks simple enough but not
really meeting criteria for 5.11 at this point.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] bdev: Do not return EBUSY if bdev discard races with write
  2021-01-26 17:22   ` Jens Axboe
@ 2021-01-27  9:12     ` Jan Kara
  0 siblings, 0 replies; 8+ messages in thread
From: Jan Kara @ 2021-01-27  9:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jan Kara, Maxim Levitsky, linux-fsdevel, Darrick J. Wong

On Tue 26-01-21 10:22:56, Jens Axboe wrote:
> On 1/26/21 3:02 AM, Jan Kara wrote:
> > On Thu 07-01-21 16:40:34, Jan Kara wrote:
> >> blkdev_fallocate() tries to detect whether a discard raced with an
> >> overlapping write by calling invalidate_inode_pages2_range(). However
> >> this check can give both false negatives (when writing using direct IO
> >> or when writeback already writes out the written pagecache range) and
> >> false positives (when write is not actually overlapping but ends in the
> >> same page when blocksize < pagesize). This actually causes issues for
> >> qemu which is getting confused by EBUSY errors.
> >>
> >> Fix the problem by removing this conflicting write detection since it is
> >> inherently racy and thus of little use anyway.
> >>
> >> Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
> >> CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> >> Link: https://lore.kernel.org/qemu-devel/20201111153913.41840-1-mlevitsk@redhat.com
> >> Signed-off-by: Jan Kara <jack@suse.cz>
> > 
> > Jens, can you please pick up this patch? Thanks!
> 
> Picked it up for 5.12, hope that works. It looks simple enough but not
> really meeting criteria for 5.11 at this point.

Sure, 5.12 is fine. We've been living with the current behavior for quite
some time and not many people complained...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-01-27 13:32 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-07 15:40 [PATCH] bdev: Do not return EBUSY if bdev discard races with write Jan Kara
2021-01-07 15:48 ` Maxim Levitsky
2021-01-07 15:52   ` Maxim Levitsky
2021-01-07 19:40 ` Darrick J. Wong
2021-01-09 10:42 ` Christoph Hellwig
2021-01-26 10:02 ` Jan Kara
2021-01-26 17:22   ` Jens Axboe
2021-01-27  9:12     ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).