* Re: [PATCH] bdev: Do not return EBUSY if bdev discard races with write
2021-01-07 15:40 [PATCH] bdev: Do not return EBUSY if bdev discard races with write Jan Kara
@ 2021-01-07 15:48 ` Maxim Levitsky
2021-01-07 15:52 ` Maxim Levitsky
2021-01-07 19:40 ` Darrick J. Wong
` (2 subsequent siblings)
3 siblings, 1 reply; 8+ messages in thread
From: Maxim Levitsky @ 2021-01-07 15:48 UTC (permalink / raw)
To: Jan Kara, Jens Axboe; +Cc: linux-fsdevel, Darrick J. Wong
On Thu, 2021-01-07 at 16:40 +0100, Jan Kara wrote:
> blkdev_fallocate() tries to detect whether a discard raced with an
> overlapping write by calling invalidate_inode_pages2_range(). However
> this check can give both false negatives (when writing using direct IO
> or when writeback already writes out the written pagecache range) and
> false positives (when write is not actually overlapping but ends in the
> same page when blocksize < pagesize). This actually causes issues for
> qemu which is getting confused by EBUSY errors.
>
> Fix the problem by removing this conflicting write detection since it is
> inherently racy and thus of little use anyway.
>
> Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
> CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> Link: https://lore.kernel.org/qemu-devel/20201111153913.41840-1-mlevitsk@redhat.com
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
> fs/block_dev.c | 10 ++++------
> 1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 3e5b02f6606c..a97f43b49839 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1797,13 +1797,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> return error;
>
> /*
> - * Invalidate again; if someone wandered in and dirtied a page,
> - * the caller will be given -EBUSY. The third argument is
> - * inclusive, so the rounding here is safe.
> + * Invalidate the page cache again; if someone wandered in and dirtied
> + * a page, we just discard it - userspace has no way of knowing whether
> + * the write happened before or after discard completing...
> */
> - return invalidate_inode_pages2_range(bdev->bd_inode->i_mapping,
> - start >> PAGE_SHIFT,
> - end >> PAGE_SHIFT);
> + return truncate_bdev_range(bdev, file->f_mode, start, end);
> }
But what happens if write and discard don't overlap? Won't we
discard the written data in this case?
Best regards,
Maxim Levitsky
>
> const struct file_operations def_blk_fops = {
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] bdev: Do not return EBUSY if bdev discard races with write
2021-01-07 15:48 ` Maxim Levitsky
@ 2021-01-07 15:52 ` Maxim Levitsky
0 siblings, 0 replies; 8+ messages in thread
From: Maxim Levitsky @ 2021-01-07 15:52 UTC (permalink / raw)
To: Jan Kara, Jens Axboe; +Cc: linux-fsdevel, Darrick J. Wong
On Thu, 2021-01-07 at 17:48 +0200, Maxim Levitsky wrote:
> On Thu, 2021-01-07 at 16:40 +0100, Jan Kara wrote:
> > blkdev_fallocate() tries to detect whether a discard raced with an
> > overlapping write by calling invalidate_inode_pages2_range(). However
> > this check can give both false negatives (when writing using direct IO
> > or when writeback already writes out the written pagecache range) and
> > false positives (when write is not actually overlapping but ends in the
> > same page when blocksize < pagesize). This actually causes issues for
> > qemu which is getting confused by EBUSY errors.
> >
> > Fix the problem by removing this conflicting write detection since it is
> > inherently racy and thus of little use anyway.
> >
> > Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
> > CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> > Link: https://lore.kernel.org/qemu-devel/20201111153913.41840-1-mlevitsk@redhat.com
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> > fs/block_dev.c | 10 ++++------
> > 1 file changed, 4 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index 3e5b02f6606c..a97f43b49839 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -1797,13 +1797,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> > return error;
> >
> > /*
> > - * Invalidate again; if someone wandered in and dirtied a page,
> > - * the caller will be given -EBUSY. The third argument is
> > - * inclusive, so the rounding here is safe.
> > + * Invalidate the page cache again; if someone wandered in and dirtied
> > + * a page, we just discard it - userspace has no way of knowing whether
> > + * the write happened before or after discard completing...
> > */
> > - return invalidate_inode_pages2_range(bdev->bd_inode->i_mapping,
> > - start >> PAGE_SHIFT,
> > - end >> PAGE_SHIFT);
> > + return truncate_bdev_range(bdev, file->f_mode, start, end);
> > }
>
> But what happens if write and discard don't overlap? Won't we
> discard the written data in this case?
Ah, I see, the truncate_bdev_range preserves the partial
areas that are not included in the range.
In this case this indeed looks right.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Best regards,
Maxim Levitsky
>
>
> Best regards,
> Maxim Levitsky
>
>
> >
> > const struct file_operations def_blk_fops = {
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] bdev: Do not return EBUSY if bdev discard races with write
2021-01-07 15:40 [PATCH] bdev: Do not return EBUSY if bdev discard races with write Jan Kara
2021-01-07 15:48 ` Maxim Levitsky
@ 2021-01-07 19:40 ` Darrick J. Wong
2021-01-09 10:42 ` Christoph Hellwig
2021-01-26 10:02 ` Jan Kara
3 siblings, 0 replies; 8+ messages in thread
From: Darrick J. Wong @ 2021-01-07 19:40 UTC (permalink / raw)
To: Jan Kara; +Cc: Jens Axboe, Maxim Levitsky, linux-fsdevel
On Thu, Jan 07, 2021 at 04:40:34PM +0100, Jan Kara wrote:
> blkdev_fallocate() tries to detect whether a discard raced with an
> overlapping write by calling invalidate_inode_pages2_range(). However
> this check can give both false negatives (when writing using direct IO
> or when writeback already writes out the written pagecache range) and
> false positives (when write is not actually overlapping but ends in the
> same page when blocksize < pagesize). This actually causes issues for
> qemu which is getting confused by EBUSY errors.
>
> Fix the problem by removing this conflicting write detection since it is
> inherently racy and thus of little use anyway.
>
> Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
> CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> Link: https://lore.kernel.org/qemu-devel/20201111153913.41840-1-mlevitsk@redhat.com
> Signed-off-by: Jan Kara <jack@suse.cz>
Looks good to me,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
--D
> ---
> fs/block_dev.c | 10 ++++------
> 1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 3e5b02f6606c..a97f43b49839 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1797,13 +1797,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> return error;
>
> /*
> - * Invalidate again; if someone wandered in and dirtied a page,
> - * the caller will be given -EBUSY. The third argument is
> - * inclusive, so the rounding here is safe.
> + * Invalidate the page cache again; if someone wandered in and dirtied
> + * a page, we just discard it - userspace has no way of knowing whether
> + * the write happened before or after discard completing...
> */
> - return invalidate_inode_pages2_range(bdev->bd_inode->i_mapping,
> - start >> PAGE_SHIFT,
> - end >> PAGE_SHIFT);
> + return truncate_bdev_range(bdev, file->f_mode, start, end);
> }
>
> const struct file_operations def_blk_fops = {
> --
> 2.26.2
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] bdev: Do not return EBUSY if bdev discard races with write
2021-01-07 15:40 [PATCH] bdev: Do not return EBUSY if bdev discard races with write Jan Kara
2021-01-07 15:48 ` Maxim Levitsky
2021-01-07 19:40 ` Darrick J. Wong
@ 2021-01-09 10:42 ` Christoph Hellwig
2021-01-26 10:02 ` Jan Kara
3 siblings, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2021-01-09 10:42 UTC (permalink / raw)
To: Jan Kara; +Cc: Jens Axboe, Maxim Levitsky, linux-fsdevel, Darrick J. Wong
Looks good,
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] bdev: Do not return EBUSY if bdev discard races with write
2021-01-07 15:40 [PATCH] bdev: Do not return EBUSY if bdev discard races with write Jan Kara
` (2 preceding siblings ...)
2021-01-09 10:42 ` Christoph Hellwig
@ 2021-01-26 10:02 ` Jan Kara
2021-01-26 17:22 ` Jens Axboe
3 siblings, 1 reply; 8+ messages in thread
From: Jan Kara @ 2021-01-26 10:02 UTC (permalink / raw)
To: Jens Axboe; +Cc: Maxim Levitsky, linux-fsdevel, Jan Kara, Darrick J. Wong
On Thu 07-01-21 16:40:34, Jan Kara wrote:
> blkdev_fallocate() tries to detect whether a discard raced with an
> overlapping write by calling invalidate_inode_pages2_range(). However
> this check can give both false negatives (when writing using direct IO
> or when writeback already writes out the written pagecache range) and
> false positives (when write is not actually overlapping but ends in the
> same page when blocksize < pagesize). This actually causes issues for
> qemu which is getting confused by EBUSY errors.
>
> Fix the problem by removing this conflicting write detection since it is
> inherently racy and thus of little use anyway.
>
> Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
> CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> Link: https://lore.kernel.org/qemu-devel/20201111153913.41840-1-mlevitsk@redhat.com
> Signed-off-by: Jan Kara <jack@suse.cz>
Jens, can you please pick up this patch? Thanks!
Honza
> ---
> fs/block_dev.c | 10 ++++------
> 1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 3e5b02f6606c..a97f43b49839 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1797,13 +1797,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
> return error;
>
> /*
> - * Invalidate again; if someone wandered in and dirtied a page,
> - * the caller will be given -EBUSY. The third argument is
> - * inclusive, so the rounding here is safe.
> + * Invalidate the page cache again; if someone wandered in and dirtied
> + * a page, we just discard it - userspace has no way of knowing whether
> + * the write happened before or after discard completing...
> */
> - return invalidate_inode_pages2_range(bdev->bd_inode->i_mapping,
> - start >> PAGE_SHIFT,
> - end >> PAGE_SHIFT);
> + return truncate_bdev_range(bdev, file->f_mode, start, end);
> }
>
> const struct file_operations def_blk_fops = {
> --
> 2.26.2
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] bdev: Do not return EBUSY if bdev discard races with write
2021-01-26 10:02 ` Jan Kara
@ 2021-01-26 17:22 ` Jens Axboe
2021-01-27 9:12 ` Jan Kara
0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2021-01-26 17:22 UTC (permalink / raw)
To: Jan Kara; +Cc: Maxim Levitsky, linux-fsdevel, Darrick J. Wong
On 1/26/21 3:02 AM, Jan Kara wrote:
> On Thu 07-01-21 16:40:34, Jan Kara wrote:
>> blkdev_fallocate() tries to detect whether a discard raced with an
>> overlapping write by calling invalidate_inode_pages2_range(). However
>> this check can give both false negatives (when writing using direct IO
>> or when writeback already writes out the written pagecache range) and
>> false positives (when write is not actually overlapping but ends in the
>> same page when blocksize < pagesize). This actually causes issues for
>> qemu which is getting confused by EBUSY errors.
>>
>> Fix the problem by removing this conflicting write detection since it is
>> inherently racy and thus of little use anyway.
>>
>> Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
>> CC: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Link: https://lore.kernel.org/qemu-devel/20201111153913.41840-1-mlevitsk@redhat.com
>> Signed-off-by: Jan Kara <jack@suse.cz>
>
> Jens, can you please pick up this patch? Thanks!
Picked it up for 5.12, hope that works. It looks simple enough but not
really meeting criteria for 5.11 at this point.
--
Jens Axboe
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] bdev: Do not return EBUSY if bdev discard races with write
2021-01-26 17:22 ` Jens Axboe
@ 2021-01-27 9:12 ` Jan Kara
0 siblings, 0 replies; 8+ messages in thread
From: Jan Kara @ 2021-01-27 9:12 UTC (permalink / raw)
To: Jens Axboe; +Cc: Jan Kara, Maxim Levitsky, linux-fsdevel, Darrick J. Wong
On Tue 26-01-21 10:22:56, Jens Axboe wrote:
> On 1/26/21 3:02 AM, Jan Kara wrote:
> > On Thu 07-01-21 16:40:34, Jan Kara wrote:
> >> blkdev_fallocate() tries to detect whether a discard raced with an
> >> overlapping write by calling invalidate_inode_pages2_range(). However
> >> this check can give both false negatives (when writing using direct IO
> >> or when writeback already writes out the written pagecache range) and
> >> false positives (when write is not actually overlapping but ends in the
> >> same page when blocksize < pagesize). This actually causes issues for
> >> qemu which is getting confused by EBUSY errors.
> >>
> >> Fix the problem by removing this conflicting write detection since it is
> >> inherently racy and thus of little use anyway.
> >>
> >> Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
> >> CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> >> Link: https://lore.kernel.org/qemu-devel/20201111153913.41840-1-mlevitsk@redhat.com
> >> Signed-off-by: Jan Kara <jack@suse.cz>
> >
> > Jens, can you please pick up this patch? Thanks!
>
> Picked it up for 5.12, hope that works. It looks simple enough but not
> really meeting criteria for 5.11 at this point.
Sure, 5.12 is fine. We've been living with the current behavior for quite
some time and not many people complained...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 8+ messages in thread