[PATCH 0/2] RFC: Issue with discards on raw block device without O

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
@ 2020-11-11 15:39 Maxim Levitsky
  2020-11-11 15:39 ` [PATCH 1/2] file-posix: allow -EBUSY errors during write zeros on raw block devices Maxim Levitsky
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Maxim Levitsky @ 2020-11-11 15:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Jan Kara, qemu-block, Darrick J . Wong, Peter Lieven,
	Max Reitz, Paolo Bonzini, Maxim Levitsky

clone of "starship_production"

Maxim Levitsky (2):
  file-posix: allow -EBUSY errors during write zeros on raw block
    devices
  qemu-img: align next status sector on destination alignment.

 block/file-posix.c |  1 +
 qemu-img.c         | 13 ++++++++-----
 2 files changed, 9 insertions(+), 5 deletions(-)

-- 
2.26.2




^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/2] file-posix: allow -EBUSY errors during write zeros on raw block devices
  2020-11-11 15:39 [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky
@ 2020-11-11 15:39 ` Maxim Levitsky
  2020-11-16 14:48   ` Kevin Wolf
  2020-11-11 15:39 ` [PATCH 2/2] qemu-img: align next status sector on destination alignment Maxim Levitsky
  2020-11-11 15:44 ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky
  2 siblings, 1 reply; 25+ messages in thread
From: Maxim Levitsky @ 2020-11-11 15:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Jan Kara, qemu-block, Darrick J . Wong, Peter Lieven,
	Max Reitz, Paolo Bonzini, Maxim Levitsky

On Linux, fallocate(fd, FALLOC_FL_PUNCH_HOLE) when it is used on a block device,
without O_DIRECT can return -EBUSY if it races with another write to the same page.

Since this is rare and discard is not a critical operation, ignore this error

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 block/file-posix.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index c63926d592..d5fd1dbcd2 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1698,6 +1698,7 @@ static int handle_aiocb_write_zeroes_unmap(void *opaque)
     switch (ret) {
     case -ENOTSUP:
     case -EINVAL:
+    case -EBUSY:
         break;
     default:
         return ret;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] file-posix: allow -EBUSY errors during write zeros on raw block devices
  2020-11-11 15:39 ` [PATCH 1/2] file-posix: allow -EBUSY errors during write zeros on raw block devices Maxim Levitsky
@ 2020-11-16 14:48   ` Kevin Wolf
  2021-01-07 12:44     ` Maxim Levitsky
  0 siblings, 1 reply; 25+ messages in thread
From: Kevin Wolf @ 2020-11-16 14:48 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Jan Kara, qemu-block, Darrick J . Wong, Peter Lieven, qemu-devel,
	Max Reitz, Paolo Bonzini

Am 11.11.2020 um 16:39 hat Maxim Levitsky geschrieben:
> On Linux, fallocate(fd, FALLOC_FL_PUNCH_HOLE) when it is used on a block device,
> without O_DIRECT can return -EBUSY if it races with another write to the same page.
> 
> Since this is rare and discard is not a critical operation, ignore this error
> 
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>

I'm applying this one for 5.2, it certainly shouldn't hurt and makes
things work at least, even if possibly not in the optimal way.

Patch 2 seems to be a bit less obvious and discussion is ongoing, so
that's probably more 6.0 material.

Kevin



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] file-posix: allow -EBUSY errors during write zeros on raw block devices
  2020-11-16 14:48   ` Kevin Wolf
@ 2021-01-07 12:44     ` Maxim Levitsky
  0 siblings, 0 replies; 25+ messages in thread
From: Maxim Levitsky @ 2021-01-07 12:44 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Jan Kara, qemu-block, Darrick J . Wong, Peter Lieven, qemu-devel,
	Max Reitz, Paolo Bonzini

On Mon, 2020-11-16 at 15:48 +0100, Kevin Wolf wrote:
> Am 11.11.2020 um 16:39 hat Maxim Levitsky geschrieben:
> > On Linux, fallocate(fd, FALLOC_FL_PUNCH_HOLE) when it is used on a block device,
> > without O_DIRECT can return -EBUSY if it races with another write to the same page.
> > 
> > Since this is rare and discard is not a critical operation, ignore this error
> > 
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> 
> I'm applying this one for 5.2, it certainly shouldn't hurt and makes
> things work at least, even if possibly not in the optimal way.
> 
> Patch 2 seems to be a bit less obvious and discussion is ongoing, so
> that's probably more 6.0 material.
> 
> Kevin

Any feedback on patch 2?

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/2] qemu-img: align next status sector on destination alignment.
  2020-11-11 15:39 [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky
  2020-11-11 15:39 ` [PATCH 1/2] file-posix: allow -EBUSY errors during write zeros on raw block devices Maxim Levitsky
@ 2020-11-11 15:39 ` Maxim Levitsky
  2020-11-12 12:40   ` Peter Lieven
  2020-11-11 15:44 ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky
  2 siblings, 1 reply; 25+ messages in thread
From: Maxim Levitsky @ 2020-11-11 15:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Jan Kara, qemu-block, Darrick J . Wong, Peter Lieven,
	Max Reitz, Paolo Bonzini, Maxim Levitsky

This helps avoid unneeded writes and discards.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 qemu-img.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index c2c56fc797..7e9b0f659f 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1722,7 +1722,7 @@ static void convert_select_part(ImgConvertState *s, int64_t sector_num,
 static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
 {
     int64_t src_cur_offset;
-    int ret, n, src_cur;
+    int ret, n, src_cur, alignment;
     bool post_backing_zero = false;
 
     convert_select_part(s, sector_num, &src_cur, &src_cur_offset);
@@ -1785,11 +1785,14 @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
         n = DIV_ROUND_UP(count, BDRV_SECTOR_SIZE);
 
         /*
-         * Avoid that s->sector_next_status becomes unaligned to the source
-         * request alignment and/or cluster size to avoid unnecessary read
-         * cycles.
+         * Avoid that s->sector_next_status becomes unaligned to the
+         * source/destination request alignment and/or cluster size to avoid
+         * unnecessary read/write cycles.
          */
-        tail = (sector_num - src_cur_offset + n) % s->src_alignment[src_cur];
+        alignment = MAX(s->src_alignment[src_cur], s->alignment);
+        assert(is_power_of_2(alignment));
+
+        tail = (sector_num - src_cur_offset + n) % alignment;
         if (n > tail) {
             n -= tail;
         }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] qemu-img: align next status sector on destination alignment.
  2020-11-11 15:39 ` [PATCH 2/2] qemu-img: align next status sector on destination alignment Maxim Levitsky
@ 2020-11-12 12:40   ` Peter Lieven
  2020-11-12 13:45     ` Eric Blake
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Lieven @ 2020-11-12 12:40 UTC (permalink / raw)
  To: Maxim Levitsky, qemu-devel
  Cc: Kevin Wolf, Jan Kara, qemu-block, Darrick J . Wong, Max Reitz,
	Paolo Bonzini

Am 11.11.20 um 16:39 schrieb Maxim Levitsky:
> This helps avoid unneeded writes and discards.
>
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  qemu-img.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/qemu-img.c b/qemu-img.c
> index c2c56fc797..7e9b0f659f 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -1722,7 +1722,7 @@ static void convert_select_part(ImgConvertState *s, int64_t sector_num,
>  static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
>  {
>      int64_t src_cur_offset;
> -    int ret, n, src_cur;
> +    int ret, n, src_cur, alignment;
>      bool post_backing_zero = false;
>  
>      convert_select_part(s, sector_num, &src_cur, &src_cur_offset);
> @@ -1785,11 +1785,14 @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
>          n = DIV_ROUND_UP(count, BDRV_SECTOR_SIZE);
>  
>          /*
> -         * Avoid that s->sector_next_status becomes unaligned to the source
> -         * request alignment and/or cluster size to avoid unnecessary read
> -         * cycles.
> +         * Avoid that s->sector_next_status becomes unaligned to the
> +         * source/destination request alignment and/or cluster size to avoid
> +         * unnecessary read/write cycles.
>           */
> -        tail = (sector_num - src_cur_offset + n) % s->src_alignment[src_cur];
> +        alignment = MAX(s->src_alignment[src_cur], s->alignment);
> +        assert(is_power_of_2(alignment));
> +
> +        tail = (sector_num - src_cur_offset + n) % alignment;
>          if (n > tail) {
>              n -= tail;
>          }


I was also considering including the s->alignment when adding this chance. However, you need the least common multiple of both alignments, not the maximum, otherwise

you might get misaligned to either source or destination.


Why exactly do you need the power of two requirement?


Peter




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] qemu-img: align next status sector on destination alignment.
  2020-11-12 12:40   ` Peter Lieven
@ 2020-11-12 13:45     ` Eric Blake
  2020-11-12 15:04       ` Maxim Levitsky
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Blake @ 2020-11-12 13:45 UTC (permalink / raw)
  To: Peter Lieven, Maxim Levitsky, qemu-devel
  Cc: Kevin Wolf, Jan Kara, qemu-block, Darrick J . Wong, Max Reitz,
	Paolo Bonzini

On 11/12/20 6:40 AM, Peter Lieven wrote:

>>          /*
>> -         * Avoid that s->sector_next_status becomes unaligned to the source
>> -         * request alignment and/or cluster size to avoid unnecessary read
>> -         * cycles.
>> +         * Avoid that s->sector_next_status becomes unaligned to the
>> +         * source/destination request alignment and/or cluster size to avoid
>> +         * unnecessary read/write cycles.
>>           */
>> -        tail = (sector_num - src_cur_offset + n) % s->src_alignment[src_cur];
>> +        alignment = MAX(s->src_alignment[src_cur], s->alignment);
>> +        assert(is_power_of_2(alignment));
>> +
>> +        tail = (sector_num - src_cur_offset + n) % alignment;
>>          if (n > tail) {
>>              n -= tail;
>>          }
> 
> 
> I was also considering including the s->alignment when adding this chance. However, you need the least common multiple of both alignments, not the maximum, otherwise
> 
> you might get misaligned to either source or destination.
> 
> 
> Why exactly do you need the power of two requirement?

The power of two requirement ensures that you h ave the least common
multiple of both alignments ;)

However, there ARE devices in practice that have advertised a
non-power-of-2 discard granularity, such as 15MiB (see commits 430b26a8,
63188c24).  Which means you probably don't want to assert power-of-2,
and in turn need to worry about least common multiple.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] qemu-img: align next status sector on destination alignment.
  2020-11-12 13:45     ` Eric Blake
@ 2020-11-12 15:04       ` Maxim Levitsky
  2021-03-18  9:57         ` Maxim Levitsky
  0 siblings, 1 reply; 25+ messages in thread
From: Maxim Levitsky @ 2020-11-12 15:04 UTC (permalink / raw)
  To: Eric Blake, Peter Lieven, qemu-devel
  Cc: Kevin Wolf, Jan Kara, qemu-block, Darrick J . Wong, Max Reitz,
	Paolo Bonzini

On Thu, 2020-11-12 at 07:45 -0600, Eric Blake wrote:
> On 11/12/20 6:40 AM, Peter Lieven wrote:
> 
> > >          /*
> > > -         * Avoid that s->sector_next_status becomes unaligned to the source
> > > -         * request alignment and/or cluster size to avoid unnecessary read
> > > -         * cycles.
> > > +         * Avoid that s->sector_next_status becomes unaligned to the
> > > +         * source/destination request alignment and/or cluster size to avoid
> > > +         * unnecessary read/write cycles.
> > >           */
> > > -        tail = (sector_num - src_cur_offset + n) % s->src_alignment[src_cur];
> > > +        alignment = MAX(s->src_alignment[src_cur], s->alignment);
> > > +        assert(is_power_of_2(alignment));
> > > +
> > > +        tail = (sector_num - src_cur_offset + n) % alignment;
> > >          if (n > tail) {
> > >              n -= tail;
> > >          }
> > 
> > I was also considering including the s->alignment when adding this chance. However, you need the least common multiple of both alignments, not the maximum, otherwise
> > 
> > you might get misaligned to either source or destination.
> > 
> > 
> > Why exactly do you need the power of two requirement?
> 
> The power of two requirement ensures that you h ave the least common
> multiple of both alignments ;)
-
Yes, and in addition to that both alignments are already power of two because we align them
to it. The assert I added is just in case.

This is how we calculate the destination alignment:
 
s.alignment = MAX(pow2floor(s.min_sparse),
                      DIV_ROUND_UP(out_bs->bl.request_alignment,
                                   BDRV_SECTOR_SIZE));
 
 
This is how we calculate the source alignments (it is possible to have several source files)
 
s.src_alignment[bs_i] = DIV_ROUND_UP(src_bs->bl.request_alignment,
                                             BDRV_SECTOR_SIZE);
if (!bdrv_get_info(src_bs, &bdi)) {
    s.src_alignment[bs_i] = MAX(s.src_alignment[bs_i], bdi.cluster_size / BDRV_SECTOR_SIZE);
}


The bl.request_alignment comment mentions that it must be power of 2,
and cluster sizes are also very likely to be power of 2 in all drivers
we support. An assert for s.src_alignment won't hurt though.

 
Note though that we don't really read the discard alignment.
We have max_pdiscard, and pdiscard_alignment, but max_pdiscard
is only used to split 'too large' discard requests, and pdiscard_alignment
is advisory and used only in a couple of places.
Neither are set by file-posix.
 
We do have request_alignment, which file-posix tries to align to the logical block
size which is still often 512 for backward compatibility on many devices (even nvme)
 
Now both source and destination alignments in qemu-img convert are based on request_aligment
and on min_sparse (which is by default 4K, controlled by qemu-img -S parameter
(which otherwise controls the minimum number of blocks to be zero, to consider
discarding them in convert)
 

This means that there is no guarantee to have 4K alignment, and besides,
with sufficiently fragmented source file, the bdrv_block_status_above
can return arbitrary short and unaligned extents which can't be aligned, 
thus as I said this patch alone can't guarantee that we won't have 
write and discard sharing the same page.
 
Another thing that can be discussed is why is_allocated_sectors was patched
to convert short discards to writes. Probably because underlying hardware
ignores them or something? In theory we should detect this and fail
back to regular zeroing in this case.
Again though, while in case of converting an empty file, this is the only
source of writes, and eliminating it, also 'fixes' this case, with sufficiently
fragmented source file we can even without this code get a write and discard
sharing a page.


Best regards,
	Maxim Levitsky

> 
> However, there ARE devices in practice that have advertised a
> non-power-of-2 discard granularity, such as 15MiB (see commits 430b26a8,
> 63188c24).  Which means you probably don't want to assert power-of-2,
> and in turn need to worry about least common multiple.
> 




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] qemu-img: align next status sector on destination alignment.
  2020-11-12 15:04       ` Maxim Levitsky
@ 2021-03-18  9:57         ` Maxim Levitsky
  0 siblings, 0 replies; 25+ messages in thread
From: Maxim Levitsky @ 2021-03-18  9:57 UTC (permalink / raw)
  To: Eric Blake, Peter Lieven, qemu-devel
  Cc: Kevin Wolf, Jan Kara, qemu-block, Darrick J . Wong, Max Reitz,
	Paolo Bonzini

On Thu, 2020-11-12 at 17:04 +0200, Maxim Levitsky wrote:
> On Thu, 2020-11-12 at 07:45 -0600, Eric Blake wrote:
> > On 11/12/20 6:40 AM, Peter Lieven wrote:
> > 
> > > >          /*
> > > > -         * Avoid that s->sector_next_status becomes unaligned to the source
> > > > -         * request alignment and/or cluster size to avoid unnecessary read
> > > > -         * cycles.
> > > > +         * Avoid that s->sector_next_status becomes unaligned to the
> > > > +         * source/destination request alignment and/or cluster size to avoid
> > > > +         * unnecessary read/write cycles.
> > > >           */
> > > > -        tail = (sector_num - src_cur_offset + n) % s->src_alignment[src_cur];
> > > > +        alignment = MAX(s->src_alignment[src_cur], s->alignment);
> > > > +        assert(is_power_of_2(alignment));
> > > > +
> > > > +        tail = (sector_num - src_cur_offset + n) % alignment;
> > > >          if (n > tail) {
> > > >              n -= tail;
> > > >          }
> > > 
> > > I was also considering including the s->alignment when adding this chance. However, you need the least common multiple of both alignments, not the maximum, otherwise
> > > 
> > > you might get misaligned to either source or destination.
> > > 
> > > 
> > > Why exactly do you need the power of two requirement?
> > 
> > The power of two requirement ensures that you h ave the least common
> > multiple of both alignments ;)
> -
> Yes, and in addition to that both alignments are already power of two because we align them
> to it. The assert I added is just in case.
> 
> This is how we calculate the destination alignment:
>  
> s.alignment = MAX(pow2floor(s.min_sparse),
>                       DIV_ROUND_UP(out_bs->bl.request_alignment,
>                                    BDRV_SECTOR_SIZE));
>  
>  
> This is how we calculate the source alignments (it is possible to have several source files)
>  
> s.src_alignment[bs_i] = DIV_ROUND_UP(src_bs->bl.request_alignment,
>                                              BDRV_SECTOR_SIZE);
> if (!bdrv_get_info(src_bs, &bdi)) {
>     s.src_alignment[bs_i] = MAX(s.src_alignment[bs_i], bdi.cluster_size / BDRV_SECTOR_SIZE);
> }
> 
> 
> The bl.request_alignment comment mentions that it must be power of 2,
> and cluster sizes are also very likely to be power of 2 in all drivers
> we support. An assert for s.src_alignment won't hurt though.
> 
>  
> Note though that we don't really read the discard alignment.
> We have max_pdiscard, and pdiscard_alignment, but max_pdiscard
> is only used to split 'too large' discard requests, and pdiscard_alignment
> is advisory and used only in a couple of places.
> Neither are set by file-posix.
>  
> We do have request_alignment, which file-posix tries to align to the logical block
> size which is still often 512 for backward compatibility on many devices (even nvme)
>  
> Now both source and destination alignments in qemu-img convert are based on request_aligment
> and on min_sparse (which is by default 4K, controlled by qemu-img -S parameter
> (which otherwise controls the minimum number of blocks to be zero, to consider
> discarding them in convert)
>  
> 
> This means that there is no guarantee to have 4K alignment, and besides,
> with sufficiently fragmented source file, the bdrv_block_status_above
> can return arbitrary short and unaligned extents which can't be aligned, 
> thus as I said this patch alone can't guarantee that we won't have 
> write and discard sharing the same page.
>  
> Another thing that can be discussed is why is_allocated_sectors was patched
> to convert short discards to writes. Probably because underlying hardware
> ignores them or something? In theory we should detect this and fail
> back to regular zeroing in this case.
> Again though, while in case of converting an empty file, this is the only
> source of writes, and eliminating it, also 'fixes' this case, with sufficiently
> fragmented source file we can even without this code get a write and discard
> sharing a page.
> 
> 
> Best regards,
> 	Maxim Levitsky
> 
> > However, there ARE devices in practice that have advertised a
> > non-power-of-2 discard granularity, such as 15MiB (see commits 430b26a8,
> > 63188c24).  Which means you probably don't want to assert power-of-2,
> > and in turn need to worry about least common multiple.
> > 

Any update on this patch?

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
  2020-11-11 15:39 [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky
  2020-11-11 15:39 ` [PATCH 1/2] file-posix: allow -EBUSY errors during write zeros on raw block devices Maxim Levitsky
  2020-11-11 15:39 ` [PATCH 2/2] qemu-img: align next status sector on destination alignment Maxim Levitsky
@ 2020-11-11 15:44 ` Maxim Levitsky
  2020-11-12 11:19     ` Jan Kara
  2 siblings, 1 reply; 25+ messages in thread
From: Maxim Levitsky @ 2020-11-11 15:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Jan Kara, qemu-block, Darrick J . Wong, Peter Lieven,
	Max Reitz, Paolo Bonzini

On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> clone of "starship_production"

The git-publish destroyed the cover letter:

For the reference this is for bz #1872633

The issue is that current kernel code that implements 'fallocate'
on kernel block devices roughly works like that:

1. Flush the page cache on the range that is about to be discarded.
2. Issue the discard and wait for it to finish.
   (as far as I can see the discard doesn't go through the
   page cache).

3. Check if the page cache is dirty for this range,
   if it is dirty (meaning that someone wrote to it meanwhile)
   return -EBUSY.

This means that if qemu (or qemu-img) issues a write, and then
discard to the area that shares a page, -EBUSY can be returned by
the kernel.

On the other hand, for example, the ext4 implementation of discard
doesn't seem to be affected. It does take a lock on the inode to avoid
concurrent IO and flushes O_DIRECT writers prior to doing discard thought.

Doing fsync and retrying is seems to resolve this issue, but it might be a too big hammer.
Just retrying doesn't work, indicating that maybe the code that flushes the page
cache in (1) doesn't do this correctly ?

It also can be racy unless special means are done to block IO from happening
from qemu during this fsync.

This patch series contains two patches:

First patch just lets the file-posix ignore the -EBUSY errors, which is
technically enough to fail back to plain write in this case, but seems wrong.

And the second patch adds an optimization to qemu-img to avoid such a
fragmented write/discard in the first place.

Both patches make the reproducer work for this particular bugzilla,
but I don't think they are enough.

What do you think?

Best regards,
	Maxim Levitsky

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
  2020-11-11 15:44 ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky
@ 2020-11-12 11:19     ` Jan Kara
  0 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2020-11-12 11:19 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: qemu-devel, Kevin Wolf, Peter Lieven, Jan Kara, Paolo Bonzini,
	Max Reitz, Darrick J . Wong, qemu-block, linux-block,
	Christoph Hellwig, Jens Axboe

[added some relevant people and lists to CC]

On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > clone of "starship_production"
> 
> The git-publish destroyed the cover letter:
> 
> For the reference this is for bz #1872633
> 
> The issue is that current kernel code that implements 'fallocate'
> on kernel block devices roughly works like that:
> 
> 1. Flush the page cache on the range that is about to be discarded.
> 2. Issue the discard and wait for it to finish.
>    (as far as I can see the discard doesn't go through the
>    page cache).
> 
> 3. Check if the page cache is dirty for this range,
>    if it is dirty (meaning that someone wrote to it meanwhile)
>    return -EBUSY.
> 
> This means that if qemu (or qemu-img) issues a write, and then
> discard to the area that shares a page, -EBUSY can be returned by
> the kernel.

Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
EBUSY which seems wrong to me. IMO we should handle this gracefully in the
kernel so we need to fix this.

> On the other hand, for example, the ext4 implementation of discard
> doesn't seem to be affected. It does take a lock on the inode to avoid
> concurrent IO and flushes O_DIRECT writers prior to doing discard thought.

Well, filesystem hole punching is somewhat different beast than block device
discard (at least implementation wise).

> Doing fsync and retrying is seems to resolve this issue, but it might be
> a too big hammer.  Just retrying doesn't work, indicating that maybe the
> code that flushes the page cache in (1) doesn't do this correctly ?
> 
> It also can be racy unless special means are done to block IO from happening
> from qemu during this fsync.
> 
> This patch series contains two patches:
> 
> First patch just lets the file-posix ignore the -EBUSY errors, which is
> technically enough to fail back to plain write in this case, but seems wrong.
> 
> And the second patch adds an optimization to qemu-img to avoid such a
> fragmented write/discard in the first place.
> 
> Both patches make the reproducer work for this particular bugzilla,
> but I don't think they are enough.
> 
> What do you think?

So if the EBUSY error happens because something happened to the page cache
outside of discarded range (like you describe above), that is a kernel bug
than needs to get fixed. EBUSY should really mean - someone wrote to the
discarded range while discard was running and userspace app has to deal
with that depending on what it aims to do...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
@ 2020-11-12 11:19     ` Jan Kara
  0 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2020-11-12 11:19 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Kevin Wolf, Christoph Hellwig, Jan Kara, qemu-block,
	Darrick J . Wong, Peter Lieven, qemu-devel, Max Reitz,
	linux-block, Jens Axboe, Paolo Bonzini

[added some relevant people and lists to CC]

On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > clone of "starship_production"
> 
> The git-publish destroyed the cover letter:
> 
> For the reference this is for bz #1872633
> 
> The issue is that current kernel code that implements 'fallocate'
> on kernel block devices roughly works like that:
> 
> 1. Flush the page cache on the range that is about to be discarded.
> 2. Issue the discard and wait for it to finish.
>    (as far as I can see the discard doesn't go through the
>    page cache).
> 
> 3. Check if the page cache is dirty for this range,
>    if it is dirty (meaning that someone wrote to it meanwhile)
>    return -EBUSY.
> 
> This means that if qemu (or qemu-img) issues a write, and then
> discard to the area that shares a page, -EBUSY can be returned by
> the kernel.

Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
EBUSY which seems wrong to me. IMO we should handle this gracefully in the
kernel so we need to fix this.

> On the other hand, for example, the ext4 implementation of discard
> doesn't seem to be affected. It does take a lock on the inode to avoid
> concurrent IO and flushes O_DIRECT writers prior to doing discard thought.

Well, filesystem hole punching is somewhat different beast than block device
discard (at least implementation wise).

> Doing fsync and retrying is seems to resolve this issue, but it might be
> a too big hammer.  Just retrying doesn't work, indicating that maybe the
> code that flushes the page cache in (1) doesn't do this correctly ?
> 
> It also can be racy unless special means are done to block IO from happening
> from qemu during this fsync.
> 
> This patch series contains two patches:
> 
> First patch just lets the file-posix ignore the -EBUSY errors, which is
> technically enough to fail back to plain write in this case, but seems wrong.
> 
> And the second patch adds an optimization to qemu-img to avoid such a
> fragmented write/discard in the first place.
> 
> Both patches make the reproducer work for this particular bugzilla,
> but I don't think they are enough.
> 
> What do you think?

So if the EBUSY error happens because something happened to the page cache
outside of discarded range (like you describe above), that is a kernel bug
than needs to get fixed. EBUSY should really mean - someone wrote to the
discarded range while discard was running and userspace app has to deal
with that depending on what it aims to do...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
  2020-11-12 11:19     ` Jan Kara
@ 2020-11-12 12:00       ` Jan Kara
  -1 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2020-11-12 12:00 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: qemu-devel, Kevin Wolf, Peter Lieven, Jan Kara, Paolo Bonzini,
	Max Reitz, Darrick J . Wong, qemu-block, linux-block,
	Christoph Hellwig, Jens Axboe

On Thu 12-11-20 12:19:51, Jan Kara wrote:
> [added some relevant people and lists to CC]
> 
> On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > clone of "starship_production"
> > 
> > The git-publish destroyed the cover letter:
> > 
> > For the reference this is for bz #1872633
> > 
> > The issue is that current kernel code that implements 'fallocate'
> > on kernel block devices roughly works like that:
> > 
> > 1. Flush the page cache on the range that is about to be discarded.
> > 2. Issue the discard and wait for it to finish.
> >    (as far as I can see the discard doesn't go through the
> >    page cache).
> > 
> > 3. Check if the page cache is dirty for this range,
> >    if it is dirty (meaning that someone wrote to it meanwhile)
> >    return -EBUSY.
> > 
> > This means that if qemu (or qemu-img) issues a write, and then
> > discard to the area that shares a page, -EBUSY can be returned by
> > the kernel.
> 
> Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> kernel so we need to fix this.
> 
> > On the other hand, for example, the ext4 implementation of discard
> > doesn't seem to be affected. It does take a lock on the inode to avoid
> > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> 
> Well, filesystem hole punching is somewhat different beast than block device
> discard (at least implementation wise).
> 
> > Doing fsync and retrying is seems to resolve this issue, but it might be
> > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > code that flushes the page cache in (1) doesn't do this correctly ?
> > 
> > It also can be racy unless special means are done to block IO from happening
> > from qemu during this fsync.
> > 
> > This patch series contains two patches:
> > 
> > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > technically enough to fail back to plain write in this case, but seems wrong.
> > 
> > And the second patch adds an optimization to qemu-img to avoid such a
> > fragmented write/discard in the first place.
> > 
> > Both patches make the reproducer work for this particular bugzilla,
> > but I don't think they are enough.
> > 
> > What do you think?
> 
> So if the EBUSY error happens because something happened to the page cache
> outside of discarded range (like you describe above), that is a kernel bug
> than needs to get fixed. EBUSY should really mean - someone wrote to the
> discarded range while discard was running and userspace app has to deal
> with that depending on what it aims to do...

So I was looking what it would take to fix this inside the kernel. The
problem is that invalidate_inode_pages2_range() is working on page
granularity and it is non-trivial to extend it to work on byte granularity
since we don't support something like "try to reclaim part of a page". But
I'm also somewhat wondering why we use invalidate_inode_pages2_range() here
instead of truncate_inode_pages_range() again? I mean the EBUSY detection
cannot be reliable anyway and userspace has no way of knowing whether a
write happened before discard or after it so just discarding data is fine
from this point of view. Darrick?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
@ 2020-11-12 12:00       ` Jan Kara
  0 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2020-11-12 12:00 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Kevin Wolf, Christoph Hellwig, Jan Kara, qemu-block,
	Darrick J . Wong, Peter Lieven, qemu-devel, Max Reitz,
	linux-block, Jens Axboe, Paolo Bonzini

On Thu 12-11-20 12:19:51, Jan Kara wrote:
> [added some relevant people and lists to CC]
> 
> On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > clone of "starship_production"
> > 
> > The git-publish destroyed the cover letter:
> > 
> > For the reference this is for bz #1872633
> > 
> > The issue is that current kernel code that implements 'fallocate'
> > on kernel block devices roughly works like that:
> > 
> > 1. Flush the page cache on the range that is about to be discarded.
> > 2. Issue the discard and wait for it to finish.
> >    (as far as I can see the discard doesn't go through the
> >    page cache).
> > 
> > 3. Check if the page cache is dirty for this range,
> >    if it is dirty (meaning that someone wrote to it meanwhile)
> >    return -EBUSY.
> > 
> > This means that if qemu (or qemu-img) issues a write, and then
> > discard to the area that shares a page, -EBUSY can be returned by
> > the kernel.
> 
> Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> kernel so we need to fix this.
> 
> > On the other hand, for example, the ext4 implementation of discard
> > doesn't seem to be affected. It does take a lock on the inode to avoid
> > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> 
> Well, filesystem hole punching is somewhat different beast than block device
> discard (at least implementation wise).
> 
> > Doing fsync and retrying is seems to resolve this issue, but it might be
> > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > code that flushes the page cache in (1) doesn't do this correctly ?
> > 
> > It also can be racy unless special means are done to block IO from happening
> > from qemu during this fsync.
> > 
> > This patch series contains two patches:
> > 
> > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > technically enough to fail back to plain write in this case, but seems wrong.
> > 
> > And the second patch adds an optimization to qemu-img to avoid such a
> > fragmented write/discard in the first place.
> > 
> > Both patches make the reproducer work for this particular bugzilla,
> > but I don't think they are enough.
> > 
> > What do you think?
> 
> So if the EBUSY error happens because something happened to the page cache
> outside of discarded range (like you describe above), that is a kernel bug
> than needs to get fixed. EBUSY should really mean - someone wrote to the
> discarded range while discard was running and userspace app has to deal
> with that depending on what it aims to do...

So I was looking what it would take to fix this inside the kernel. The
problem is that invalidate_inode_pages2_range() is working on page
granularity and it is non-trivial to extend it to work on byte granularity
since we don't support something like "try to reclaim part of a page". But
I'm also somewhat wondering why we use invalidate_inode_pages2_range() here
instead of truncate_inode_pages_range() again? I mean the EBUSY detection
cannot be reliable anyway and userspace has no way of knowing whether a
write happened before discard or after it so just discarding data is fine
from this point of view. Darrick?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
  2020-11-12 12:00       ` Jan Kara
@ 2020-11-12 22:08         ` Darrick J. Wong
  -1 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2020-11-12 22:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Maxim Levitsky, qemu-devel, Kevin Wolf, Peter Lieven,
	Paolo Bonzini, Max Reitz, qemu-block, linux-block,
	Christoph Hellwig, Jens Axboe

On Thu, Nov 12, 2020 at 01:00:56PM +0100, Jan Kara wrote:
> On Thu 12-11-20 12:19:51, Jan Kara wrote:
> > [added some relevant people and lists to CC]
> > 
> > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > > clone of "starship_production"
> > > 
> > > The git-publish destroyed the cover letter:
> > > 
> > > For the reference this is for bz #1872633
> > > 
> > > The issue is that current kernel code that implements 'fallocate'
> > > on kernel block devices roughly works like that:
> > > 
> > > 1. Flush the page cache on the range that is about to be discarded.
> > > 2. Issue the discard and wait for it to finish.
> > >    (as far as I can see the discard doesn't go through the
> > >    page cache).
> > > 
> > > 3. Check if the page cache is dirty for this range,
> > >    if it is dirty (meaning that someone wrote to it meanwhile)
> > >    return -EBUSY.
> > > 
> > > This means that if qemu (or qemu-img) issues a write, and then
> > > discard to the area that shares a page, -EBUSY can be returned by
> > > the kernel.
> > 
> > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> > EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> > kernel so we need to fix this.
> > 
> > > On the other hand, for example, the ext4 implementation of discard
> > > doesn't seem to be affected. It does take a lock on the inode to avoid
> > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> > 
> > Well, filesystem hole punching is somewhat different beast than block device
> > discard (at least implementation wise).
> > 
> > > Doing fsync and retrying is seems to resolve this issue, but it might be
> > > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > > code that flushes the page cache in (1) doesn't do this correctly ?
> > > 
> > > It also can be racy unless special means are done to block IO from happening
> > > from qemu during this fsync.
> > > 
> > > This patch series contains two patches:
> > > 
> > > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > > technically enough to fail back to plain write in this case, but seems wrong.
> > > 
> > > And the second patch adds an optimization to qemu-img to avoid such a
> > > fragmented write/discard in the first place.
> > > 
> > > Both patches make the reproducer work for this particular bugzilla,
> > > but I don't think they are enough.
> > > 
> > > What do you think?
> > 
> > So if the EBUSY error happens because something happened to the page cache
> > outside of discarded range (like you describe above), that is a kernel bug
> > than needs to get fixed. EBUSY should really mean - someone wrote to the
> > discarded range while discard was running and userspace app has to deal
> > with that depending on what it aims to do...
> 
> So I was looking what it would take to fix this inside the kernel. The
> problem is that invalidate_inode_pages2_range() is working on page
> granularity and it is non-trivial to extend it to work on byte granularity
> since we don't support something like "try to reclaim part of a page". But
> I'm also somewhat wondering why we use invalidate_inode_pages2_range() here
> instead of truncate_inode_pages_range() again? I mean the EBUSY detection
> cannot be reliable anyway and userspace has no way of knowing whether a
> write happened before discard or after it so just discarding data is fine
> from this point of view. Darrick?

Hmmm, I think I overlooked the fact that we can do buffered writes into
a block device's pagecache without taking any of the usual locks that
have to be held for filesystem files.  This is essentially a race
between a not-page-aligned fallocate and a buffered write to a different
sector that is mapped by a page that caches part of the fallocate range.

So yes, Jan is right that we need to use truncate_bdev_range instead of
invalidate_inode_pages2_range because the former will zero the sub-page
ranges on either end of the fallocate request instead of returning
-EBUSY because someone dirtied a part of a page that wasn't involved in
the fallocate operation.

I /probably/ just copy-pasta'd that invalidation call from directio
without thinking hard enough about it, sorry about that. :(

--D

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
@ 2020-11-12 22:08         ` Darrick J. Wong
  0 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2020-11-12 22:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kevin Wolf, Christoph Hellwig, qemu-block, Peter Lieven,
	qemu-devel, Max Reitz, linux-block, Jens Axboe, Paolo Bonzini,
	Maxim Levitsky

On Thu, Nov 12, 2020 at 01:00:56PM +0100, Jan Kara wrote:
> On Thu 12-11-20 12:19:51, Jan Kara wrote:
> > [added some relevant people and lists to CC]
> > 
> > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > > clone of "starship_production"
> > > 
> > > The git-publish destroyed the cover letter:
> > > 
> > > For the reference this is for bz #1872633
> > > 
> > > The issue is that current kernel code that implements 'fallocate'
> > > on kernel block devices roughly works like that:
> > > 
> > > 1. Flush the page cache on the range that is about to be discarded.
> > > 2. Issue the discard and wait for it to finish.
> > >    (as far as I can see the discard doesn't go through the
> > >    page cache).
> > > 
> > > 3. Check if the page cache is dirty for this range,
> > >    if it is dirty (meaning that someone wrote to it meanwhile)
> > >    return -EBUSY.
> > > 
> > > This means that if qemu (or qemu-img) issues a write, and then
> > > discard to the area that shares a page, -EBUSY can be returned by
> > > the kernel.
> > 
> > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> > EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> > kernel so we need to fix this.
> > 
> > > On the other hand, for example, the ext4 implementation of discard
> > > doesn't seem to be affected. It does take a lock on the inode to avoid
> > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> > 
> > Well, filesystem hole punching is somewhat different beast than block device
> > discard (at least implementation wise).
> > 
> > > Doing fsync and retrying is seems to resolve this issue, but it might be
> > > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > > code that flushes the page cache in (1) doesn't do this correctly ?
> > > 
> > > It also can be racy unless special means are done to block IO from happening
> > > from qemu during this fsync.
> > > 
> > > This patch series contains two patches:
> > > 
> > > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > > technically enough to fail back to plain write in this case, but seems wrong.
> > > 
> > > And the second patch adds an optimization to qemu-img to avoid such a
> > > fragmented write/discard in the first place.
> > > 
> > > Both patches make the reproducer work for this particular bugzilla,
> > > but I don't think they are enough.
> > > 
> > > What do you think?
> > 
> > So if the EBUSY error happens because something happened to the page cache
> > outside of discarded range (like you describe above), that is a kernel bug
> > than needs to get fixed. EBUSY should really mean - someone wrote to the
> > discarded range while discard was running and userspace app has to deal
> > with that depending on what it aims to do...
> 
> So I was looking what it would take to fix this inside the kernel. The
> problem is that invalidate_inode_pages2_range() is working on page
> granularity and it is non-trivial to extend it to work on byte granularity
> since we don't support something like "try to reclaim part of a page". But
> I'm also somewhat wondering why we use invalidate_inode_pages2_range() here
> instead of truncate_inode_pages_range() again? I mean the EBUSY detection
> cannot be reliable anyway and userspace has no way of knowing whether a
> write happened before discard or after it so just discarding data is fine
> from this point of view. Darrick?

Hmmm, I think I overlooked the fact that we can do buffered writes into
a block device's pagecache without taking any of the usual locks that
have to be held for filesystem files.  This is essentially a race
between a not-page-aligned fallocate and a buffered write to a different
sector that is mapped by a page that caches part of the fallocate range.

So yes, Jan is right that we need to use truncate_bdev_range instead of
invalidate_inode_pages2_range because the former will zero the sub-page
ranges on either end of the fallocate request instead of returning
-EBUSY because someone dirtied a part of a page that wasn't involved in
the fallocate operation.

I /probably/ just copy-pasta'd that invalidation call from directio
without thinking hard enough about it, sorry about that. :(

--D

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
  2020-11-12 22:08         ` Darrick J. Wong
@ 2020-12-07 17:23           ` Maxim Levitsky
  -1 siblings, 0 replies; 25+ messages in thread
From: Maxim Levitsky @ 2020-12-07 17:23 UTC (permalink / raw)
  To: Darrick J. Wong, Jan Kara
  Cc: qemu-devel, Kevin Wolf, Peter Lieven, Paolo Bonzini, Max Reitz,
	qemu-block, linux-block, Christoph Hellwig, Jens Axboe

On Thu, 2020-11-12 at 14:08 -0800, Darrick J. Wong wrote:
> On Thu, Nov 12, 2020 at 01:00:56PM +0100, Jan Kara wrote:
> > On Thu 12-11-20 12:19:51, Jan Kara wrote:
> > > [added some relevant people and lists to CC]
> > > 
> > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > > > clone of "starship_production"
> > > > 
> > > > The git-publish destroyed the cover letter:
> > > > 
> > > > For the reference this is for bz #1872633
> > > > 
> > > > The issue is that current kernel code that implements 'fallocate'
> > > > on kernel block devices roughly works like that:
> > > > 
> > > > 1. Flush the page cache on the range that is about to be discarded.
> > > > 2. Issue the discard and wait for it to finish.
> > > >    (as far as I can see the discard doesn't go through the
> > > >    page cache).
> > > > 
> > > > 3. Check if the page cache is dirty for this range,
> > > >    if it is dirty (meaning that someone wrote to it meanwhile)
> > > >    return -EBUSY.
> > > > 
> > > > This means that if qemu (or qemu-img) issues a write, and then
> > > > discard to the area that shares a page, -EBUSY can be returned by
> > > > the kernel.
> > > 
> > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> > > EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> > > kernel so we need to fix this.
> > > 
> > > > On the other hand, for example, the ext4 implementation of discard
> > > > doesn't seem to be affected. It does take a lock on the inode to avoid
> > > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> > > 
> > > Well, filesystem hole punching is somewhat different beast than block device
> > > discard (at least implementation wise).
> > > 
> > > > Doing fsync and retrying is seems to resolve this issue, but it might be
> > > > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > > > code that flushes the page cache in (1) doesn't do this correctly ?
> > > > 
> > > > It also can be racy unless special means are done to block IO from happening
> > > > from qemu during this fsync.
> > > > 
> > > > This patch series contains two patches:
> > > > 
> > > > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > > > technically enough to fail back to plain write in this case, but seems wrong.
> > > > 
> > > > And the second patch adds an optimization to qemu-img to avoid such a
> > > > fragmented write/discard in the first place.
> > > > 
> > > > Both patches make the reproducer work for this particular bugzilla,
> > > > but I don't think they are enough.
> > > > 
> > > > What do you think?
> > > 
> > > So if the EBUSY error happens because something happened to the page cache
> > > outside of discarded range (like you describe above), that is a kernel bug
> > > than needs to get fixed. EBUSY should really mean - someone wrote to the
> > > discarded range while discard was running and userspace app has to deal
> > > with that depending on what it aims to do...
> > 
> > So I was looking what it would take to fix this inside the kernel. The
> > problem is that invalidate_inode_pages2_range() is working on page
> > granularity and it is non-trivial to extend it to work on byte granularity
> > since we don't support something like "try to reclaim part of a page". But
> > I'm also somewhat wondering why we use invalidate_inode_pages2_range() here
> > instead of truncate_inode_pages_range() again? I mean the EBUSY detection
> > cannot be reliable anyway and userspace has no way of knowing whether a
> > write happened before discard or after it so just discarding data is fine
> > from this point of view. Darrick?
> 
> Hmmm, I think I overlooked the fact that we can do buffered writes into
> a block device's pagecache without taking any of the usual locks that
> have to be held for filesystem files.  This is essentially a race
> between a not-page-aligned fallocate and a buffered write to a different
> sector that is mapped by a page that caches part of the fallocate range.
> 
> So yes, Jan is right that we need to use truncate_bdev_range instead of
> invalidate_inode_pages2_range because the former will zero the sub-page
> ranges on either end of the fallocate request instead of returning
> -EBUSY because someone dirtied a part of a page that wasn't involved in
> the fallocate operation.
> 
> I /probably/ just copy-pasta'd that invalidation call from directio
> without thinking hard enough about it, sorry about that. :(

Any update on this?
 
Today I took a look at both truncate_bdev_range,
and at invalidate_inode_pages2_range.
 
 
I see that the truncate_bdev_range can't really fail other 
than check for a mounted
filesystem.
 
It calls truncate_inode_pages_range which writes back all
dirty pages and waits for writeback to finish.
So it won't detect dirty pages as invalidate_inode_pages2_range does
 
 
Also AFAIK the
kernel page cache indeed 
only tracks the dirtiness of a whole page (e.g a 512 byte write, 
will cause the whole page to be written back, unless the 
write was done using O_DIRECT)
 
So if a page, part of
which is discarded, becomes dirty we can't really
know if someone wrote to the discarded region, or only near it.
 
I vote to keep the call to invalidate_inode_pages2_range but
exclude the non page
aligned areas from it.
 
This way we will still do a best effort detection of this case,
while not causing any false positives.
 
If you agree, I'll send a patch for this.


Best regards,
	Maxim Levitsky

> 
> --D
> 
> > 								Honza
> > -- 
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
@ 2020-12-07 17:23           ` Maxim Levitsky
  0 siblings, 0 replies; 25+ messages in thread
From: Maxim Levitsky @ 2020-12-07 17:23 UTC (permalink / raw)
  To: Darrick J. Wong, Jan Kara
  Cc: Kevin Wolf, Christoph Hellwig, qemu-block, Peter Lieven,
	qemu-devel, Max Reitz, linux-block, Jens Axboe, Paolo Bonzini

On Thu, 2020-11-12 at 14:08 -0800, Darrick J. Wong wrote:
> On Thu, Nov 12, 2020 at 01:00:56PM +0100, Jan Kara wrote:
> > On Thu 12-11-20 12:19:51, Jan Kara wrote:
> > > [added some relevant people and lists to CC]
> > > 
> > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > > > clone of "starship_production"
> > > > 
> > > > The git-publish destroyed the cover letter:
> > > > 
> > > > For the reference this is for bz #1872633
> > > > 
> > > > The issue is that current kernel code that implements 'fallocate'
> > > > on kernel block devices roughly works like that:
> > > > 
> > > > 1. Flush the page cache on the range that is about to be discarded.
> > > > 2. Issue the discard and wait for it to finish.
> > > >    (as far as I can see the discard doesn't go through the
> > > >    page cache).
> > > > 
> > > > 3. Check if the page cache is dirty for this range,
> > > >    if it is dirty (meaning that someone wrote to it meanwhile)
> > > >    return -EBUSY.
> > > > 
> > > > This means that if qemu (or qemu-img) issues a write, and then
> > > > discard to the area that shares a page, -EBUSY can be returned by
> > > > the kernel.
> > > 
> > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> > > EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> > > kernel so we need to fix this.
> > > 
> > > > On the other hand, for example, the ext4 implementation of discard
> > > > doesn't seem to be affected. It does take a lock on the inode to avoid
> > > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> > > 
> > > Well, filesystem hole punching is somewhat different beast than block device
> > > discard (at least implementation wise).
> > > 
> > > > Doing fsync and retrying is seems to resolve this issue, but it might be
> > > > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > > > code that flushes the page cache in (1) doesn't do this correctly ?
> > > > 
> > > > It also can be racy unless special means are done to block IO from happening
> > > > from qemu during this fsync.
> > > > 
> > > > This patch series contains two patches:
> > > > 
> > > > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > > > technically enough to fail back to plain write in this case, but seems wrong.
> > > > 
> > > > And the second patch adds an optimization to qemu-img to avoid such a
> > > > fragmented write/discard in the first place.
> > > > 
> > > > Both patches make the reproducer work for this particular bugzilla,
> > > > but I don't think they are enough.
> > > > 
> > > > What do you think?
> > > 
> > > So if the EBUSY error happens because something happened to the page cache
> > > outside of discarded range (like you describe above), that is a kernel bug
> > > than needs to get fixed. EBUSY should really mean - someone wrote to the
> > > discarded range while discard was running and userspace app has to deal
> > > with that depending on what it aims to do...
> > 
> > So I was looking what it would take to fix this inside the kernel. The
> > problem is that invalidate_inode_pages2_range() is working on page
> > granularity and it is non-trivial to extend it to work on byte granularity
> > since we don't support something like "try to reclaim part of a page". But
> > I'm also somewhat wondering why we use invalidate_inode_pages2_range() here
> > instead of truncate_inode_pages_range() again? I mean the EBUSY detection
> > cannot be reliable anyway and userspace has no way of knowing whether a
> > write happened before discard or after it so just discarding data is fine
> > from this point of view. Darrick?
> 
> Hmmm, I think I overlooked the fact that we can do buffered writes into
> a block device's pagecache without taking any of the usual locks that
> have to be held for filesystem files.  This is essentially a race
> between a not-page-aligned fallocate and a buffered write to a different
> sector that is mapped by a page that caches part of the fallocate range.
> 
> So yes, Jan is right that we need to use truncate_bdev_range instead of
> invalidate_inode_pages2_range because the former will zero the sub-page
> ranges on either end of the fallocate request instead of returning
> -EBUSY because someone dirtied a part of a page that wasn't involved in
> the fallocate operation.
> 
> I /probably/ just copy-pasta'd that invalidation call from directio
> without thinking hard enough about it, sorry about that. :(

Any update on this?
 
Today I took a look at both truncate_bdev_range,
and at invalidate_inode_pages2_range.
 
 
I see that the truncate_bdev_range can't really fail other 
than check for a mounted
filesystem.
 
It calls truncate_inode_pages_range which writes back all
dirty pages and waits for writeback to finish.
So it won't detect dirty pages as invalidate_inode_pages2_range does
 
 
Also AFAIK the
kernel page cache indeed 
only tracks the dirtiness of a whole page (e.g a 512 byte write, 
will cause the whole page to be written back, unless the 
write was done using O_DIRECT)
 
So if a page, part of
which is discarded, becomes dirty we can't really
know if someone wrote to the discarded region, or only near it.
 
I vote to keep the call to invalidate_inode_pages2_range but
exclude the non page
aligned areas from it.
 
This way we will still do a best effort detection of this case,
while not causing any false positives.
 
If you agree, I'll send a patch for this.


Best regards,
	Maxim Levitsky

> 
> --D
> 
> > 								Honza
> > -- 
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR




^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] block: fallocate: avoid false positive on collision detection
  2020-12-07 17:23           ` Maxim Levitsky
  (?)
@ 2021-01-07 12:40           ` Maxim Levitsky
  2021-01-07 12:42             ` Maxim Levitsky
  2021-01-07 15:37             ` Jan Kara
  -1 siblings, 2 replies; 25+ messages in thread
From: Maxim Levitsky @ 2021-01-07 12:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Darrick J . Wong, Alexander Viro,
	open list:FILESYSTEMS (VFS and infrastructure),
	Jan Kara, Maxim Levitsky

Align start and end on page boundaries before calling
invalidate_inode_pages2_range.

This might allow us to miss a collision if the write and the discard were done
to the same page and do overlap but it is still better than returning -EBUSY
if those writes didn't overlap.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 fs/block_dev.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 9e84b1928b94..97f0d16661b5 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1970,6 +1970,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 	loff_t end = start + len - 1;
 	loff_t isize;
 	int error;
+	pgoff_t invalidate_first_page, invalidate_last_page;
 
 	/* Fail if we don't recognize the flags. */
 	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
@@ -2020,12 +2021,23 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 
 	/*
 	 * Invalidate again; if someone wandered in and dirtied a page,
-	 * the caller will be given -EBUSY.  The third argument is
-	 * inclusive, so the rounding here is safe.
+	 * the caller will be given -EBUSY.
+	 *
+	 * If the start/end of the range is not page aligned, exclude the
+	 * non aligned regions to avoid false positives.
 	 */
+	invalidate_first_page = DIV_ROUND_UP(start, PAGE_SIZE);
+	invalidate_last_page = end >> PAGE_SHIFT;
+
+	if ((end + 1) & PAGE_MASK)
+		invalidate_last_page--;
+
+	if (invalidate_last_page < invalidate_first_page)
+		return 0;
+
 	return invalidate_inode_pages2_range(bdev->bd_inode->i_mapping,
-					     start >> PAGE_SHIFT,
-					     end >> PAGE_SHIFT);
+					     invalidate_first_page,
+					     invalidate_last_page);
 }
 
 const struct file_operations def_blk_fops = {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] block: fallocate: avoid false positive on collision detection
  2021-01-07 12:40           ` [PATCH] block: fallocate: avoid false positive on collision detection Maxim Levitsky
@ 2021-01-07 12:42             ` Maxim Levitsky
  2021-01-07 15:37             ` Jan Kara
  1 sibling, 0 replies; 25+ messages in thread
From: Maxim Levitsky @ 2021-01-07 12:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Darrick J . Wong, Alexander Viro,
	open list:FILESYSTEMS (VFS and infrastructure),
	Jan Kara

On Thu, 2021-01-07 at 14:40 +0200, Maxim Levitsky wrote:
> Align start and end on page boundaries before calling
> invalidate_inode_pages2_range.
> 
> This might allow us to miss a collision if the write and the discard were done
> to the same page and do overlap but it is still better than returning -EBUSY
> if those writes didn't overlap.
> 
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>

This is my attempt to fix this issue. I am not 100% sure
that this is the right solution though.

Any feedback is welcome!

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] block: fallocate: avoid false positive on collision detection
  2021-01-07 12:40           ` [PATCH] block: fallocate: avoid false positive on collision detection Maxim Levitsky
  2021-01-07 12:42             ` Maxim Levitsky
@ 2021-01-07 15:37             ` Jan Kara
  1 sibling, 0 replies; 25+ messages in thread
From: Jan Kara @ 2021-01-07 15:37 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: linux-kernel, Darrick J . Wong, Alexander Viro,
	open list:FILESYSTEMS (VFS and infrastructure),
	Jan Kara

[-- Attachment #1: Type: text/plain, Size: 2249 bytes --]

On Thu 07-01-21 14:40:22, Maxim Levitsky wrote:
> Align start and end on page boundaries before calling
> invalidate_inode_pages2_range.
> 
> This might allow us to miss a collision if the write and the discard were done
> to the same page and do overlap but it is still better than returning -EBUSY
> if those writes didn't overlap.
> 
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>

Thanks for getting back to this and I'm sorry I didn't get to this earlier
myself! I actually think the fix should be different as we discussed with
Darrick. Attached patch should fix the issue for you (I'll also post it
formally for inclusion).

								Honza

> ---
>  fs/block_dev.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 9e84b1928b94..97f0d16661b5 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1970,6 +1970,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
>  	loff_t end = start + len - 1;
>  	loff_t isize;
>  	int error;
> +	pgoff_t invalidate_first_page, invalidate_last_page;
>  
>  	/* Fail if we don't recognize the flags. */
>  	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
> @@ -2020,12 +2021,23 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
>  
>  	/*
>  	 * Invalidate again; if someone wandered in and dirtied a page,
> -	 * the caller will be given -EBUSY.  The third argument is
> -	 * inclusive, so the rounding here is safe.
> +	 * the caller will be given -EBUSY.
> +	 *
> +	 * If the start/end of the range is not page aligned, exclude the
> +	 * non aligned regions to avoid false positives.
>  	 */
> +	invalidate_first_page = DIV_ROUND_UP(start, PAGE_SIZE);
> +	invalidate_last_page = end >> PAGE_SHIFT;
> +
> +	if ((end + 1) & PAGE_MASK)
> +		invalidate_last_page--;
> +
> +	if (invalidate_last_page < invalidate_first_page)
> +		return 0;
> +
>  	return invalidate_inode_pages2_range(bdev->bd_inode->i_mapping,
> -					     start >> PAGE_SHIFT,
> -					     end >> PAGE_SHIFT);
> +					     invalidate_first_page,
> +					     invalidate_last_page);
>  }
>  
>  const struct file_operations def_blk_fops = {
> -- 
> 2.26.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-bdev-Do-not-return-EBUSY-if-bdev-discard-races-with-.patch --]
[-- Type: text/x-patch, Size: 1918 bytes --]

From 36f751ac88420a6bda8a3c161986455629dc80d4 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 7 Jan 2021 16:26:52 +0100
Subject: [PATCH] bdev: Do not return EBUSY if bdev discard races with write

blkdev_fallocate() tries to detect whether a discard raced with an
overlapping write by calling invalidate_inode_pages2_range(). However
this check can give both false negatives (when writing using direct IO
or when writeback already writes out the written pagecache range) and
false positives (when write is not actually overlapping but ends in the
same page when blocksize < pagesize). This actually causes issues for
qemu which is getting confused by EBUSY errors.

Fix the problem by removing this conflicting write detection since it is
inherently racy and thus of little use anyway.

Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
CC: "Darrick J. Wong" <darrick.wong@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 3e5b02f6606c..a97f43b49839 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1797,13 +1797,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 		return error;
 
 	/*
-	 * Invalidate again; if someone wandered in and dirtied a page,
-	 * the caller will be given -EBUSY.  The third argument is
-	 * inclusive, so the rounding here is safe.
+	 * Invalidate the page cache again; if someone wandered in and dirtied
+	 * a page, we just discard it - userspace has no way of knowing whether
+	 * the write happened before or after discard completing...
 	 */
-	return invalidate_inode_pages2_range(bdev->bd_inode->i_mapping,
-					     start >> PAGE_SHIFT,
-					     end >> PAGE_SHIFT);
+	return truncate_bdev_range(bdev, file->f_mode, start, end);
 }
 
 const struct file_operations def_blk_fops = {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
  2020-11-12 11:19     ` Jan Kara
@ 2020-11-12 15:38       ` Maxim Levitsky
  -1 siblings, 0 replies; 25+ messages in thread
From: Maxim Levitsky @ 2020-11-12 15:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: qemu-devel, Kevin Wolf, Peter Lieven, Paolo Bonzini, Max Reitz,
	Darrick J . Wong, qemu-block, linux-block, Christoph Hellwig,
	Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 3345 bytes --]

On Thu, 2020-11-12 at 12:19 +0100, Jan Kara wrote:
> [added some relevant people and lists to CC]
> 
> On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > clone of "starship_production"
> > 
> > The git-publish destroyed the cover letter:
> > 
> > For the reference this is for bz #1872633
> > 
> > The issue is that current kernel code that implements 'fallocate'
> > on kernel block devices roughly works like that:
> > 
> > 1. Flush the page cache on the range that is about to be discarded.
> > 2. Issue the discard and wait for it to finish.
> >    (as far as I can see the discard doesn't go through the
> >    page cache).
> > 
> > 3. Check if the page cache is dirty for this range,
> >    if it is dirty (meaning that someone wrote to it meanwhile)
> >    return -EBUSY.
> > 
> > This means that if qemu (or qemu-img) issues a write, and then
> > discard to the area that shares a page, -EBUSY can be returned by
> > the kernel.
> 
> Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> kernel so we need to fix this.
> 
> > On the other hand, for example, the ext4 implementation of discard
> > doesn't seem to be affected. It does take a lock on the inode to avoid
> > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> 
> Well, filesystem hole punching is somewhat different beast than block device
> discard (at least implementation wise).
> 
> > Doing fsync and retrying is seems to resolve this issue, but it might be
> > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > code that flushes the page cache in (1) doesn't do this correctly ?
> > 
> > It also can be racy unless special means are done to block IO from happening
> > from qemu during this fsync.
> > 
> > This patch series contains two patches:
> > 
> > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > technically enough to fail back to plain write in this case, but seems wrong.
> > 
> > And the second patch adds an optimization to qemu-img to avoid such a
> > fragmented write/discard in the first place.
> > 
> > Both patches make the reproducer work for this particular bugzilla,
> > but I don't think they are enough.
> > 
> > What do you think?
> 
> So if the EBUSY error happens because something happened to the page cache
> outside of discarded range (like you describe above), that is a kernel bug
> than needs to get fixed. EBUSY should really mean - someone wrote to the
> discarded range while discard was running and userspace app has to deal
> with that depending on what it aims to do...
I double checked this, those are the writes/discards according to my debug
prints (I print start and then start+len-1 for each request)
I have attached the patch for this for reference.

ZERO: 0x00007fe00000 00007fffefff (len:0x1ff000)
       fallocate 00007fe00000 00007fffefff
WRITE: 0x00007ffff000 00007ffffdff (len:0xe00)
       write 00007ffff000 00007ffffdff
ZERO: 0x00007ffffe00
0000801fefff (len:0x1ff200)
       fallocate 00007ffffe00 0000801fefff
FALLOCATE failed with error 16
qemu-img: error while writing at byte 2147483136: Device or resource busy


Best regards,
     Maxim Levitsky

> 
> 								Honza


[-- Attachment #2: hacks.diff --]
[-- Type: text/x-patch, Size: 3079 bytes --]

commit ce897250babe3527f451c1c54c86b62659a2c29e
Author: Maxim Levitsky <mlevitsk@redhat.com>
Date:   Thu Oct 15 17:02:58 2020 +0300

    hacks

diff --git a/block/file-posix.c b/block/file-posix.c
index c63926d592..91ca690505 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1440,6 +1440,12 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData *aiocb, char *buf)
 
     while (offset < aiocb->aio_nbytes) {
         if (aiocb->aio_type & QEMU_AIO_WRITE) {
+
+            long unsigned int size = aiocb->aio_nbytes - offset;
+            long unsigned int off = aiocb->aio_offset + offset;
+
+            printf("       write %012lx %012lx\n", off, off+size-1);
+
             len = pwrite(aiocb->aio_fildes,
                          (const char *)buf + offset,
                          aiocb->aio_nbytes - offset,
@@ -1581,10 +1587,15 @@ static int translate_err(int err)
 static int do_fallocate(int fd, int mode, off_t offset, off_t len)
 {
     do {
+
+        printf("       fallocate %012lx %012lx\n", offset, offset+len-1);
+
         if (fallocate(fd, mode, offset, len) == 0) {
             return 0;
         }
     } while (errno == EINTR);
+
+    printf("FALLOCATE failed with error %d\n", errno);
     return translate_err(-errno);
 }
 #endif
diff --git a/qemu-img.c b/qemu-img.c
index c2c56fc797..64d3b84728 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1803,6 +1803,7 @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
         }
 
         s->sector_next_status = sector_num + n;
+        printf("NEXT_STATUS: %lx\n", s->sector_next_status << 9);
     }
 
     n = MIN(n, s->sector_next_status - sector_num);
@@ -1878,6 +1879,10 @@ static int coroutine_fn convert_co_read(ImgConvertState *s, int64_t sector_num,
     return 0;
 }
 
+static void test_print(const char* op, unsigned long start, unsigned long len)
+{
+    printf("%s: 0x%012lx %012lx (len:0x%lx)\n", op, start, start + len-1, len);
+}
 
 static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num,
                                          int nb_sectors, uint8_t *buf,
@@ -1911,6 +1916,8 @@ static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num,
                 (s->compressed &&
                  !buffer_is_zero(buf, n * BDRV_SECTOR_SIZE)))
             {
+                test_print("WRITE", sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS);
+
                 ret = blk_co_pwrite(s->target, sector_num << BDRV_SECTOR_BITS,
                                     n << BDRV_SECTOR_BITS, buf, flags);
                 if (ret < 0) {
@@ -1925,6 +1932,8 @@ static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num,
                 assert(!s->target_has_backing);
                 break;
             }
+            test_print("ZERO", sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS);
+
             ret = blk_co_pwrite_zeroes(s->target,
                                        sector_num << BDRV_SECTOR_BITS,
                                        n << BDRV_SECTOR_BITS,

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
@ 2020-11-12 15:38       ` Maxim Levitsky
  0 siblings, 0 replies; 25+ messages in thread
From: Maxim Levitsky @ 2020-11-12 15:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kevin Wolf, Christoph Hellwig, qemu-block, Darrick J . Wong,
	Peter Lieven, qemu-devel, Max Reitz, linux-block, Jens Axboe,
	Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 3345 bytes --]

On Thu, 2020-11-12 at 12:19 +0100, Jan Kara wrote:
> [added some relevant people and lists to CC]
> 
> On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > clone of "starship_production"
> > 
> > The git-publish destroyed the cover letter:
> > 
> > For the reference this is for bz #1872633
> > 
> > The issue is that current kernel code that implements 'fallocate'
> > on kernel block devices roughly works like that:
> > 
> > 1. Flush the page cache on the range that is about to be discarded.
> > 2. Issue the discard and wait for it to finish.
> >    (as far as I can see the discard doesn't go through the
> >    page cache).
> > 
> > 3. Check if the page cache is dirty for this range,
> >    if it is dirty (meaning that someone wrote to it meanwhile)
> >    return -EBUSY.
> > 
> > This means that if qemu (or qemu-img) issues a write, and then
> > discard to the area that shares a page, -EBUSY can be returned by
> > the kernel.
> 
> Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> kernel so we need to fix this.
> 
> > On the other hand, for example, the ext4 implementation of discard
> > doesn't seem to be affected. It does take a lock on the inode to avoid
> > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> 
> Well, filesystem hole punching is somewhat different beast than block device
> discard (at least implementation wise).
> 
> > Doing fsync and retrying is seems to resolve this issue, but it might be
> > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > code that flushes the page cache in (1) doesn't do this correctly ?
> > 
> > It also can be racy unless special means are done to block IO from happening
> > from qemu during this fsync.
> > 
> > This patch series contains two patches:
> > 
> > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > technically enough to fail back to plain write in this case, but seems wrong.
> > 
> > And the second patch adds an optimization to qemu-img to avoid such a
> > fragmented write/discard in the first place.
> > 
> > Both patches make the reproducer work for this particular bugzilla,
> > but I don't think they are enough.
> > 
> > What do you think?
> 
> So if the EBUSY error happens because something happened to the page cache
> outside of discarded range (like you describe above), that is a kernel bug
> than needs to get fixed. EBUSY should really mean - someone wrote to the
> discarded range while discard was running and userspace app has to deal
> with that depending on what it aims to do...
I double checked this, those are the writes/discards according to my debug
prints (I print start and then start+len-1 for each request)
I have attached the patch for this for reference.

ZERO: 0x00007fe00000 00007fffefff (len:0x1ff000)
       fallocate 00007fe00000 00007fffefff
WRITE: 0x00007ffff000 00007ffffdff (len:0xe00)
       write 00007ffff000 00007ffffdff
ZERO: 0x00007ffffe00
0000801fefff (len:0x1ff200)
       fallocate 00007ffffe00 0000801fefff
FALLOCATE failed with error 16
qemu-img: error while writing at byte 2147483136: Device or resource busy


Best regards,
     Maxim Levitsky

> 
> 								Honza


[-- Attachment #2: hacks.diff --]
[-- Type: text/x-patch, Size: 3079 bytes --]

commit ce897250babe3527f451c1c54c86b62659a2c29e
Author: Maxim Levitsky <mlevitsk@redhat.com>
Date:   Thu Oct 15 17:02:58 2020 +0300

    hacks

diff --git a/block/file-posix.c b/block/file-posix.c
index c63926d592..91ca690505 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1440,6 +1440,12 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData *aiocb, char *buf)
 
     while (offset < aiocb->aio_nbytes) {
         if (aiocb->aio_type & QEMU_AIO_WRITE) {
+
+            long unsigned int size = aiocb->aio_nbytes - offset;
+            long unsigned int off = aiocb->aio_offset + offset;
+
+            printf("       write %012lx %012lx\n", off, off+size-1);
+
             len = pwrite(aiocb->aio_fildes,
                          (const char *)buf + offset,
                          aiocb->aio_nbytes - offset,
@@ -1581,10 +1587,15 @@ static int translate_err(int err)
 static int do_fallocate(int fd, int mode, off_t offset, off_t len)
 {
     do {
+
+        printf("       fallocate %012lx %012lx\n", offset, offset+len-1);
+
         if (fallocate(fd, mode, offset, len) == 0) {
             return 0;
         }
     } while (errno == EINTR);
+
+    printf("FALLOCATE failed with error %d\n", errno);
     return translate_err(-errno);
 }
 #endif
diff --git a/qemu-img.c b/qemu-img.c
index c2c56fc797..64d3b84728 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1803,6 +1803,7 @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
         }
 
         s->sector_next_status = sector_num + n;
+        printf("NEXT_STATUS: %lx\n", s->sector_next_status << 9);
     }
 
     n = MIN(n, s->sector_next_status - sector_num);
@@ -1878,6 +1879,10 @@ static int coroutine_fn convert_co_read(ImgConvertState *s, int64_t sector_num,
     return 0;
 }
 
+static void test_print(const char* op, unsigned long start, unsigned long len)
+{
+    printf("%s: 0x%012lx %012lx (len:0x%lx)\n", op, start, start + len-1, len);
+}
 
 static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num,
                                          int nb_sectors, uint8_t *buf,
@@ -1911,6 +1916,8 @@ static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num,
                 (s->compressed &&
                  !buffer_is_zero(buf, n * BDRV_SECTOR_SIZE)))
             {
+                test_print("WRITE", sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS);
+
                 ret = blk_co_pwrite(s->target, sector_num << BDRV_SECTOR_BITS,
                                     n << BDRV_SECTOR_BITS, buf, flags);
                 if (ret < 0) {
@@ -1925,6 +1932,8 @@ static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num,
                 assert(!s->target_has_backing);
                 break;
             }
+            test_print("ZERO", sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS);
+
             ret = blk_co_pwrite_zeroes(s->target,
                                        sector_num << BDRV_SECTOR_BITS,
                                        n << BDRV_SECTOR_BITS,

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
  2020-11-12 15:38       ` Maxim Levitsky
@ 2020-11-13 10:07         ` Jan Kara
  -1 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2020-11-13 10:07 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Jan Kara, qemu-devel, Kevin Wolf, Peter Lieven, Paolo Bonzini,
	Max Reitz, Darrick J . Wong, qemu-block, linux-block,
	Christoph Hellwig, Jens Axboe

On Thu 12-11-20 17:38:36, Maxim Levitsky wrote:
> On Thu, 2020-11-12 at 12:19 +0100, Jan Kara wrote:
> > [added some relevant people and lists to CC]
> > 
> > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > > clone of "starship_production"
> > > 
> > > The git-publish destroyed the cover letter:
> > > 
> > > For the reference this is for bz #1872633
> > > 
> > > The issue is that current kernel code that implements 'fallocate'
> > > on kernel block devices roughly works like that:
> > > 
> > > 1. Flush the page cache on the range that is about to be discarded.
> > > 2. Issue the discard and wait for it to finish.
> > >    (as far as I can see the discard doesn't go through the
> > >    page cache).
> > > 
> > > 3. Check if the page cache is dirty for this range,
> > >    if it is dirty (meaning that someone wrote to it meanwhile)
> > >    return -EBUSY.
> > > 
> > > This means that if qemu (or qemu-img) issues a write, and then
> > > discard to the area that shares a page, -EBUSY can be returned by
> > > the kernel.
> > 
> > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> > EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> > kernel so we need to fix this.
> > 
> > > On the other hand, for example, the ext4 implementation of discard
> > > doesn't seem to be affected. It does take a lock on the inode to avoid
> > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> > 
> > Well, filesystem hole punching is somewhat different beast than block device
> > discard (at least implementation wise).
> > 
> > > Doing fsync and retrying is seems to resolve this issue, but it might be
> > > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > > code that flushes the page cache in (1) doesn't do this correctly ?
> > > 
> > > It also can be racy unless special means are done to block IO from happening
> > > from qemu during this fsync.
> > > 
> > > This patch series contains two patches:
> > > 
> > > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > > technically enough to fail back to plain write in this case, but seems wrong.
> > > 
> > > And the second patch adds an optimization to qemu-img to avoid such a
> > > fragmented write/discard in the first place.
> > > 
> > > Both patches make the reproducer work for this particular bugzilla,
> > > but I don't think they are enough.
> > > 
> > > What do you think?
> > 
> > So if the EBUSY error happens because something happened to the page cache
> > outside of discarded range (like you describe above), that is a kernel bug
> > than needs to get fixed. EBUSY should really mean - someone wrote to the
> > discarded range while discard was running and userspace app has to deal
> > with that depending on what it aims to do...
> I double checked this, those are the writes/discards according to my debug
> prints (I print start and then start+len-1 for each request)
> I have attached the patch for this for reference.
> 
> ZERO: 0x00007fe00000 00007fffefff (len:0x1ff000)
>        fallocate 00007fe00000 00007fffefff

Yeah, the end at 7ffff000 is indeed not 4k aligned...

> WRITE: 0x00007ffff000 00007ffffdff (len:0xe00)
>        write 00007ffff000 00007ffffdff

.. and this write is following discarded area in the same page
(7ffff000..7ffffdff).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
@ 2020-11-13 10:07         ` Jan Kara
  0 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2020-11-13 10:07 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Kevin Wolf, Christoph Hellwig, Jan Kara, qemu-block,
	Darrick J . Wong, Peter Lieven, qemu-devel, Max Reitz,
	linux-block, Jens Axboe, Paolo Bonzini

On Thu 12-11-20 17:38:36, Maxim Levitsky wrote:
> On Thu, 2020-11-12 at 12:19 +0100, Jan Kara wrote:
> > [added some relevant people and lists to CC]
> > 
> > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > > clone of "starship_production"
> > > 
> > > The git-publish destroyed the cover letter:
> > > 
> > > For the reference this is for bz #1872633
> > > 
> > > The issue is that current kernel code that implements 'fallocate'
> > > on kernel block devices roughly works like that:
> > > 
> > > 1. Flush the page cache on the range that is about to be discarded.
> > > 2. Issue the discard and wait for it to finish.
> > >    (as far as I can see the discard doesn't go through the
> > >    page cache).
> > > 
> > > 3. Check if the page cache is dirty for this range,
> > >    if it is dirty (meaning that someone wrote to it meanwhile)
> > >    return -EBUSY.
> > > 
> > > This means that if qemu (or qemu-img) issues a write, and then
> > > discard to the area that shares a page, -EBUSY can be returned by
> > > the kernel.
> > 
> > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> > EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> > kernel so we need to fix this.
> > 
> > > On the other hand, for example, the ext4 implementation of discard
> > > doesn't seem to be affected. It does take a lock on the inode to avoid
> > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> > 
> > Well, filesystem hole punching is somewhat different beast than block device
> > discard (at least implementation wise).
> > 
> > > Doing fsync and retrying is seems to resolve this issue, but it might be
> > > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > > code that flushes the page cache in (1) doesn't do this correctly ?
> > > 
> > > It also can be racy unless special means are done to block IO from happening
> > > from qemu during this fsync.
> > > 
> > > This patch series contains two patches:
> > > 
> > > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > > technically enough to fail back to plain write in this case, but seems wrong.
> > > 
> > > And the second patch adds an optimization to qemu-img to avoid such a
> > > fragmented write/discard in the first place.
> > > 
> > > Both patches make the reproducer work for this particular bugzilla,
> > > but I don't think they are enough.
> > > 
> > > What do you think?
> > 
> > So if the EBUSY error happens because something happened to the page cache
> > outside of discarded range (like you describe above), that is a kernel bug
> > than needs to get fixed. EBUSY should really mean - someone wrote to the
> > discarded range while discard was running and userspace app has to deal
> > with that depending on what it aims to do...
> I double checked this, those are the writes/discards according to my debug
> prints (I print start and then start+len-1 for each request)
> I have attached the patch for this for reference.
> 
> ZERO: 0x00007fe00000 00007fffefff (len:0x1ff000)
>        fallocate 00007fe00000 00007fffefff

Yeah, the end at 7ffff000 is indeed not 4k aligned...

> WRITE: 0x00007ffff000 00007ffffdff (len:0xe00)
>        write 00007ffff000 00007ffffdff

.. and this write is following discarded area in the same page
(7ffff000..7ffffdff).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2021-03-18  9:58 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-11 15:39 [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky
2020-11-11 15:39 ` [PATCH 1/2] file-posix: allow -EBUSY errors during write zeros on raw block devices Maxim Levitsky
2020-11-16 14:48   ` Kevin Wolf
2021-01-07 12:44     ` Maxim Levitsky
2020-11-11 15:39 ` [PATCH 2/2] qemu-img: align next status sector on destination alignment Maxim Levitsky
2020-11-12 12:40   ` Peter Lieven
2020-11-12 13:45     ` Eric Blake
2020-11-12 15:04       ` Maxim Levitsky
2021-03-18  9:57         ` Maxim Levitsky
2020-11-11 15:44 ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky
2020-11-12 11:19   ` Jan Kara
2020-11-12 11:19     ` Jan Kara
2020-11-12 12:00     ` Jan Kara
2020-11-12 12:00       ` Jan Kara
2020-11-12 22:08       ` Darrick J. Wong
2020-11-12 22:08         ` Darrick J. Wong
2020-12-07 17:23         ` Maxim Levitsky
2020-12-07 17:23           ` Maxim Levitsky
2021-01-07 12:40           ` [PATCH] block: fallocate: avoid false positive on collision detection Maxim Levitsky
2021-01-07 12:42             ` Maxim Levitsky
2021-01-07 15:37             ` Jan Kara
2020-11-12 15:38     ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky
2020-11-12 15:38       ` Maxim Levitsky
2020-11-13 10:07       ` Jan Kara
2020-11-13 10:07         ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.