* [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT @ 2020-11-11 15:39 Maxim Levitsky 2020-11-11 15:39 ` [PATCH 1/2] file-posix: allow -EBUSY errors during write zeros on raw block devices Maxim Levitsky ` (2 more replies) 0 siblings, 3 replies; 25+ messages in thread From: Maxim Levitsky @ 2020-11-11 15:39 UTC (permalink / raw) To: qemu-devel Cc: Kevin Wolf, Jan Kara, qemu-block, Darrick J . Wong, Peter Lieven, Max Reitz, Paolo Bonzini, Maxim Levitsky clone of "starship_production" Maxim Levitsky (2): file-posix: allow -EBUSY errors during write zeros on raw block devices qemu-img: align next status sector on destination alignment. block/file-posix.c | 1 + qemu-img.c | 13 ++++++++----- 2 files changed, 9 insertions(+), 5 deletions(-) -- 2.26.2 ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/2] file-posix: allow -EBUSY errors during write zeros on raw block devices 2020-11-11 15:39 [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky @ 2020-11-11 15:39 ` Maxim Levitsky 2020-11-16 14:48 ` Kevin Wolf 2020-11-11 15:39 ` [PATCH 2/2] qemu-img: align next status sector on destination alignment Maxim Levitsky 2020-11-11 15:44 ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky 2 siblings, 1 reply; 25+ messages in thread From: Maxim Levitsky @ 2020-11-11 15:39 UTC (permalink / raw) To: qemu-devel Cc: Kevin Wolf, Jan Kara, qemu-block, Darrick J . Wong, Peter Lieven, Max Reitz, Paolo Bonzini, Maxim Levitsky On Linux, fallocate(fd, FALLOC_FL_PUNCH_HOLE) when it is used on a block device, without O_DIRECT can return -EBUSY if it races with another write to the same page. Since this is rare and discard is not a critical operation, ignore this error Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> --- block/file-posix.c | 1 + 1 file changed, 1 insertion(+) diff --git a/block/file-posix.c b/block/file-posix.c index c63926d592..d5fd1dbcd2 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -1698,6 +1698,7 @@ static int handle_aiocb_write_zeroes_unmap(void *opaque) switch (ret) { case -ENOTSUP: case -EINVAL: + case -EBUSY: break; default: return ret; -- 2.26.2 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] file-posix: allow -EBUSY errors during write zeros on raw block devices 2020-11-11 15:39 ` [PATCH 1/2] file-posix: allow -EBUSY errors during write zeros on raw block devices Maxim Levitsky @ 2020-11-16 14:48 ` Kevin Wolf 2021-01-07 12:44 ` Maxim Levitsky 0 siblings, 1 reply; 25+ messages in thread From: Kevin Wolf @ 2020-11-16 14:48 UTC (permalink / raw) To: Maxim Levitsky Cc: Jan Kara, qemu-block, Darrick J . Wong, Peter Lieven, qemu-devel, Max Reitz, Paolo Bonzini Am 11.11.2020 um 16:39 hat Maxim Levitsky geschrieben: > On Linux, fallocate(fd, FALLOC_FL_PUNCH_HOLE) when it is used on a block device, > without O_DIRECT can return -EBUSY if it races with another write to the same page. > > Since this is rare and discard is not a critical operation, ignore this error > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> I'm applying this one for 5.2, it certainly shouldn't hurt and makes things work at least, even if possibly not in the optimal way. Patch 2 seems to be a bit less obvious and discussion is ongoing, so that's probably more 6.0 material. Kevin ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/2] file-posix: allow -EBUSY errors during write zeros on raw block devices 2020-11-16 14:48 ` Kevin Wolf @ 2021-01-07 12:44 ` Maxim Levitsky 0 siblings, 0 replies; 25+ messages in thread From: Maxim Levitsky @ 2021-01-07 12:44 UTC (permalink / raw) To: Kevin Wolf Cc: Jan Kara, qemu-block, Darrick J . Wong, Peter Lieven, qemu-devel, Max Reitz, Paolo Bonzini On Mon, 2020-11-16 at 15:48 +0100, Kevin Wolf wrote: > Am 11.11.2020 um 16:39 hat Maxim Levitsky geschrieben: > > On Linux, fallocate(fd, FALLOC_FL_PUNCH_HOLE) when it is used on a block device, > > without O_DIRECT can return -EBUSY if it races with another write to the same page. > > > > Since this is rare and discard is not a critical operation, ignore this error > > > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> > > I'm applying this one for 5.2, it certainly shouldn't hurt and makes > things work at least, even if possibly not in the optimal way. > > Patch 2 seems to be a bit less obvious and discussion is ongoing, so > that's probably more 6.0 material. > > Kevin Any feedback on patch 2? Best regards, Maxim Levitsky ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 2/2] qemu-img: align next status sector on destination alignment. 2020-11-11 15:39 [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky 2020-11-11 15:39 ` [PATCH 1/2] file-posix: allow -EBUSY errors during write zeros on raw block devices Maxim Levitsky @ 2020-11-11 15:39 ` Maxim Levitsky 2020-11-12 12:40 ` Peter Lieven 2020-11-11 15:44 ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky 2 siblings, 1 reply; 25+ messages in thread From: Maxim Levitsky @ 2020-11-11 15:39 UTC (permalink / raw) To: qemu-devel Cc: Kevin Wolf, Jan Kara, qemu-block, Darrick J . Wong, Peter Lieven, Max Reitz, Paolo Bonzini, Maxim Levitsky This helps avoid unneeded writes and discards. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> --- qemu-img.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/qemu-img.c b/qemu-img.c index c2c56fc797..7e9b0f659f 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -1722,7 +1722,7 @@ static void convert_select_part(ImgConvertState *s, int64_t sector_num, static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num) { int64_t src_cur_offset; - int ret, n, src_cur; + int ret, n, src_cur, alignment; bool post_backing_zero = false; convert_select_part(s, sector_num, &src_cur, &src_cur_offset); @@ -1785,11 +1785,14 @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num) n = DIV_ROUND_UP(count, BDRV_SECTOR_SIZE); /* - * Avoid that s->sector_next_status becomes unaligned to the source - * request alignment and/or cluster size to avoid unnecessary read - * cycles. + * Avoid that s->sector_next_status becomes unaligned to the + * source/destination request alignment and/or cluster size to avoid + * unnecessary read/write cycles. */ - tail = (sector_num - src_cur_offset + n) % s->src_alignment[src_cur]; + alignment = MAX(s->src_alignment[src_cur], s->alignment); + assert(is_power_of_2(alignment)); + + tail = (sector_num - src_cur_offset + n) % alignment; if (n > tail) { n -= tail; } -- 2.26.2 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] qemu-img: align next status sector on destination alignment. 2020-11-11 15:39 ` [PATCH 2/2] qemu-img: align next status sector on destination alignment Maxim Levitsky @ 2020-11-12 12:40 ` Peter Lieven 2020-11-12 13:45 ` Eric Blake 0 siblings, 1 reply; 25+ messages in thread From: Peter Lieven @ 2020-11-12 12:40 UTC (permalink / raw) To: Maxim Levitsky, qemu-devel Cc: Kevin Wolf, Jan Kara, qemu-block, Darrick J . Wong, Max Reitz, Paolo Bonzini Am 11.11.20 um 16:39 schrieb Maxim Levitsky: > This helps avoid unneeded writes and discards. > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> > --- > qemu-img.c | 13 ++++++++----- > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff --git a/qemu-img.c b/qemu-img.c > index c2c56fc797..7e9b0f659f 100644 > --- a/qemu-img.c > +++ b/qemu-img.c > @@ -1722,7 +1722,7 @@ static void convert_select_part(ImgConvertState *s, int64_t sector_num, > static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num) > { > int64_t src_cur_offset; > - int ret, n, src_cur; > + int ret, n, src_cur, alignment; > bool post_backing_zero = false; > > convert_select_part(s, sector_num, &src_cur, &src_cur_offset); > @@ -1785,11 +1785,14 @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num) > n = DIV_ROUND_UP(count, BDRV_SECTOR_SIZE); > > /* > - * Avoid that s->sector_next_status becomes unaligned to the source > - * request alignment and/or cluster size to avoid unnecessary read > - * cycles. > + * Avoid that s->sector_next_status becomes unaligned to the > + * source/destination request alignment and/or cluster size to avoid > + * unnecessary read/write cycles. > */ > - tail = (sector_num - src_cur_offset + n) % s->src_alignment[src_cur]; > + alignment = MAX(s->src_alignment[src_cur], s->alignment); > + assert(is_power_of_2(alignment)); > + > + tail = (sector_num - src_cur_offset + n) % alignment; > if (n > tail) { > n -= tail; > } I was also considering including the s->alignment when adding this chance. However, you need the least common multiple of both alignments, not the maximum, otherwise you might get misaligned to either source or destination. Why exactly do you need the power of two requirement? Peter ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] qemu-img: align next status sector on destination alignment. 2020-11-12 12:40 ` Peter Lieven @ 2020-11-12 13:45 ` Eric Blake 2020-11-12 15:04 ` Maxim Levitsky 0 siblings, 1 reply; 25+ messages in thread From: Eric Blake @ 2020-11-12 13:45 UTC (permalink / raw) To: Peter Lieven, Maxim Levitsky, qemu-devel Cc: Kevin Wolf, Jan Kara, qemu-block, Darrick J . Wong, Max Reitz, Paolo Bonzini On 11/12/20 6:40 AM, Peter Lieven wrote: >> /* >> - * Avoid that s->sector_next_status becomes unaligned to the source >> - * request alignment and/or cluster size to avoid unnecessary read >> - * cycles. >> + * Avoid that s->sector_next_status becomes unaligned to the >> + * source/destination request alignment and/or cluster size to avoid >> + * unnecessary read/write cycles. >> */ >> - tail = (sector_num - src_cur_offset + n) % s->src_alignment[src_cur]; >> + alignment = MAX(s->src_alignment[src_cur], s->alignment); >> + assert(is_power_of_2(alignment)); >> + >> + tail = (sector_num - src_cur_offset + n) % alignment; >> if (n > tail) { >> n -= tail; >> } > > > I was also considering including the s->alignment when adding this chance. However, you need the least common multiple of both alignments, not the maximum, otherwise > > you might get misaligned to either source or destination. > > > Why exactly do you need the power of two requirement? The power of two requirement ensures that you h ave the least common multiple of both alignments ;) However, there ARE devices in practice that have advertised a non-power-of-2 discard granularity, such as 15MiB (see commits 430b26a8, 63188c24). Which means you probably don't want to assert power-of-2, and in turn need to worry about least common multiple. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] qemu-img: align next status sector on destination alignment. 2020-11-12 13:45 ` Eric Blake @ 2020-11-12 15:04 ` Maxim Levitsky 2021-03-18 9:57 ` Maxim Levitsky 0 siblings, 1 reply; 25+ messages in thread From: Maxim Levitsky @ 2020-11-12 15:04 UTC (permalink / raw) To: Eric Blake, Peter Lieven, qemu-devel Cc: Kevin Wolf, Jan Kara, qemu-block, Darrick J . Wong, Max Reitz, Paolo Bonzini On Thu, 2020-11-12 at 07:45 -0600, Eric Blake wrote: > On 11/12/20 6:40 AM, Peter Lieven wrote: > > > > /* > > > - * Avoid that s->sector_next_status becomes unaligned to the source > > > - * request alignment and/or cluster size to avoid unnecessary read > > > - * cycles. > > > + * Avoid that s->sector_next_status becomes unaligned to the > > > + * source/destination request alignment and/or cluster size to avoid > > > + * unnecessary read/write cycles. > > > */ > > > - tail = (sector_num - src_cur_offset + n) % s->src_alignment[src_cur]; > > > + alignment = MAX(s->src_alignment[src_cur], s->alignment); > > > + assert(is_power_of_2(alignment)); > > > + > > > + tail = (sector_num - src_cur_offset + n) % alignment; > > > if (n > tail) { > > > n -= tail; > > > } > > > > I was also considering including the s->alignment when adding this chance. However, you need the least common multiple of both alignments, not the maximum, otherwise > > > > you might get misaligned to either source or destination. > > > > > > Why exactly do you need the power of two requirement? > > The power of two requirement ensures that you h ave the least common > multiple of both alignments ;) - Yes, and in addition to that both alignments are already power of two because we align them to it. The assert I added is just in case. This is how we calculate the destination alignment: s.alignment = MAX(pow2floor(s.min_sparse), DIV_ROUND_UP(out_bs->bl.request_alignment, BDRV_SECTOR_SIZE)); This is how we calculate the source alignments (it is possible to have several source files) s.src_alignment[bs_i] = DIV_ROUND_UP(src_bs->bl.request_alignment, BDRV_SECTOR_SIZE); if (!bdrv_get_info(src_bs, &bdi)) { s.src_alignment[bs_i] = MAX(s.src_alignment[bs_i], bdi.cluster_size / BDRV_SECTOR_SIZE); } The bl.request_alignment comment mentions that it must be power of 2, and cluster sizes are also very likely to be power of 2 in all drivers we support. An assert for s.src_alignment won't hurt though. Note though that we don't really read the discard alignment. We have max_pdiscard, and pdiscard_alignment, but max_pdiscard is only used to split 'too large' discard requests, and pdiscard_alignment is advisory and used only in a couple of places. Neither are set by file-posix. We do have request_alignment, which file-posix tries to align to the logical block size which is still often 512 for backward compatibility on many devices (even nvme) Now both source and destination alignments in qemu-img convert are based on request_aligment and on min_sparse (which is by default 4K, controlled by qemu-img -S parameter (which otherwise controls the minimum number of blocks to be zero, to consider discarding them in convert) This means that there is no guarantee to have 4K alignment, and besides, with sufficiently fragmented source file, the bdrv_block_status_above can return arbitrary short and unaligned extents which can't be aligned, thus as I said this patch alone can't guarantee that we won't have write and discard sharing the same page. Another thing that can be discussed is why is_allocated_sectors was patched to convert short discards to writes. Probably because underlying hardware ignores them or something? In theory we should detect this and fail back to regular zeroing in this case. Again though, while in case of converting an empty file, this is the only source of writes, and eliminating it, also 'fixes' this case, with sufficiently fragmented source file we can even without this code get a write and discard sharing a page. Best regards, Maxim Levitsky > > However, there ARE devices in practice that have advertised a > non-power-of-2 discard granularity, such as 15MiB (see commits 430b26a8, > 63188c24). Which means you probably don't want to assert power-of-2, > and in turn need to worry about least common multiple. > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/2] qemu-img: align next status sector on destination alignment. 2020-11-12 15:04 ` Maxim Levitsky @ 2021-03-18 9:57 ` Maxim Levitsky 0 siblings, 0 replies; 25+ messages in thread From: Maxim Levitsky @ 2021-03-18 9:57 UTC (permalink / raw) To: Eric Blake, Peter Lieven, qemu-devel Cc: Kevin Wolf, Jan Kara, qemu-block, Darrick J . Wong, Max Reitz, Paolo Bonzini On Thu, 2020-11-12 at 17:04 +0200, Maxim Levitsky wrote: > On Thu, 2020-11-12 at 07:45 -0600, Eric Blake wrote: > > On 11/12/20 6:40 AM, Peter Lieven wrote: > > > > > > /* > > > > - * Avoid that s->sector_next_status becomes unaligned to the source > > > > - * request alignment and/or cluster size to avoid unnecessary read > > > > - * cycles. > > > > + * Avoid that s->sector_next_status becomes unaligned to the > > > > + * source/destination request alignment and/or cluster size to avoid > > > > + * unnecessary read/write cycles. > > > > */ > > > > - tail = (sector_num - src_cur_offset + n) % s->src_alignment[src_cur]; > > > > + alignment = MAX(s->src_alignment[src_cur], s->alignment); > > > > + assert(is_power_of_2(alignment)); > > > > + > > > > + tail = (sector_num - src_cur_offset + n) % alignment; > > > > if (n > tail) { > > > > n -= tail; > > > > } > > > > > > I was also considering including the s->alignment when adding this chance. However, you need the least common multiple of both alignments, not the maximum, otherwise > > > > > > you might get misaligned to either source or destination. > > > > > > > > > Why exactly do you need the power of two requirement? > > > > The power of two requirement ensures that you h ave the least common > > multiple of both alignments ;) > - > Yes, and in addition to that both alignments are already power of two because we align them > to it. The assert I added is just in case. > > This is how we calculate the destination alignment: > > s.alignment = MAX(pow2floor(s.min_sparse), > DIV_ROUND_UP(out_bs->bl.request_alignment, > BDRV_SECTOR_SIZE)); > > > This is how we calculate the source alignments (it is possible to have several source files) > > s.src_alignment[bs_i] = DIV_ROUND_UP(src_bs->bl.request_alignment, > BDRV_SECTOR_SIZE); > if (!bdrv_get_info(src_bs, &bdi)) { > s.src_alignment[bs_i] = MAX(s.src_alignment[bs_i], bdi.cluster_size / BDRV_SECTOR_SIZE); > } > > > The bl.request_alignment comment mentions that it must be power of 2, > and cluster sizes are also very likely to be power of 2 in all drivers > we support. An assert for s.src_alignment won't hurt though. > > > Note though that we don't really read the discard alignment. > We have max_pdiscard, and pdiscard_alignment, but max_pdiscard > is only used to split 'too large' discard requests, and pdiscard_alignment > is advisory and used only in a couple of places. > Neither are set by file-posix. > > We do have request_alignment, which file-posix tries to align to the logical block > size which is still often 512 for backward compatibility on many devices (even nvme) > > Now both source and destination alignments in qemu-img convert are based on request_aligment > and on min_sparse (which is by default 4K, controlled by qemu-img -S parameter > (which otherwise controls the minimum number of blocks to be zero, to consider > discarding them in convert) > > > This means that there is no guarantee to have 4K alignment, and besides, > with sufficiently fragmented source file, the bdrv_block_status_above > can return arbitrary short and unaligned extents which can't be aligned, > thus as I said this patch alone can't guarantee that we won't have > write and discard sharing the same page. > > Another thing that can be discussed is why is_allocated_sectors was patched > to convert short discards to writes. Probably because underlying hardware > ignores them or something? In theory we should detect this and fail > back to regular zeroing in this case. > Again though, while in case of converting an empty file, this is the only > source of writes, and eliminating it, also 'fixes' this case, with sufficiently > fragmented source file we can even without this code get a write and discard > sharing a page. > > > Best regards, > Maxim Levitsky > > > However, there ARE devices in practice that have advertised a > > non-power-of-2 discard granularity, such as 15MiB (see commits 430b26a8, > > 63188c24). Which means you probably don't want to assert power-of-2, > > and in turn need to worry about least common multiple. > > Any update on this patch? Best regards, Maxim Levitsky ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT 2020-11-11 15:39 [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky 2020-11-11 15:39 ` [PATCH 1/2] file-posix: allow -EBUSY errors during write zeros on raw block devices Maxim Levitsky 2020-11-11 15:39 ` [PATCH 2/2] qemu-img: align next status sector on destination alignment Maxim Levitsky @ 2020-11-11 15:44 ` Maxim Levitsky 2020-11-12 11:19 ` Jan Kara 2 siblings, 1 reply; 25+ messages in thread From: Maxim Levitsky @ 2020-11-11 15:44 UTC (permalink / raw) To: qemu-devel Cc: Kevin Wolf, Jan Kara, qemu-block, Darrick J . Wong, Peter Lieven, Max Reitz, Paolo Bonzini On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > clone of "starship_production" The git-publish destroyed the cover letter: For the reference this is for bz #1872633 The issue is that current kernel code that implements 'fallocate' on kernel block devices roughly works like that: 1. Flush the page cache on the range that is about to be discarded. 2. Issue the discard and wait for it to finish. (as far as I can see the discard doesn't go through the page cache). 3. Check if the page cache is dirty for this range, if it is dirty (meaning that someone wrote to it meanwhile) return -EBUSY. This means that if qemu (or qemu-img) issues a write, and then discard to the area that shares a page, -EBUSY can be returned by the kernel. On the other hand, for example, the ext4 implementation of discard doesn't seem to be affected. It does take a lock on the inode to avoid concurrent IO and flushes O_DIRECT writers prior to doing discard thought. Doing fsync and retrying is seems to resolve this issue, but it might be a too big hammer. Just retrying doesn't work, indicating that maybe the code that flushes the page cache in (1) doesn't do this correctly ? It also can be racy unless special means are done to block IO from happening from qemu during this fsync. This patch series contains two patches: First patch just lets the file-posix ignore the -EBUSY errors, which is technically enough to fail back to plain write in this case, but seems wrong. And the second patch adds an optimization to qemu-img to avoid such a fragmented write/discard in the first place. Both patches make the reproducer work for this particular bugzilla, but I don't think they are enough. What do you think? Best regards, Maxim Levitsky ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT 2020-11-11 15:44 ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky @ 2020-11-12 11:19 ` Jan Kara 0 siblings, 0 replies; 25+ messages in thread From: Jan Kara @ 2020-11-12 11:19 UTC (permalink / raw) To: Maxim Levitsky Cc: qemu-devel, Kevin Wolf, Peter Lieven, Jan Kara, Paolo Bonzini, Max Reitz, Darrick J . Wong, qemu-block, linux-block, Christoph Hellwig, Jens Axboe [added some relevant people and lists to CC] On Wed 11-11-20 17:44:05, Maxim Levitsky wrote: > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > > clone of "starship_production" > > The git-publish destroyed the cover letter: > > For the reference this is for bz #1872633 > > The issue is that current kernel code that implements 'fallocate' > on kernel block devices roughly works like that: > > 1. Flush the page cache on the range that is about to be discarded. > 2. Issue the discard and wait for it to finish. > (as far as I can see the discard doesn't go through the > page cache). > > 3. Check if the page cache is dirty for this range, > if it is dirty (meaning that someone wrote to it meanwhile) > return -EBUSY. > > This means that if qemu (or qemu-img) issues a write, and then > discard to the area that shares a page, -EBUSY can be returned by > the kernel. Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back EBUSY which seems wrong to me. IMO we should handle this gracefully in the kernel so we need to fix this. > On the other hand, for example, the ext4 implementation of discard > doesn't seem to be affected. It does take a lock on the inode to avoid > concurrent IO and flushes O_DIRECT writers prior to doing discard thought. Well, filesystem hole punching is somewhat different beast than block device discard (at least implementation wise). > Doing fsync and retrying is seems to resolve this issue, but it might be > a too big hammer. Just retrying doesn't work, indicating that maybe the > code that flushes the page cache in (1) doesn't do this correctly ? > > It also can be racy unless special means are done to block IO from happening > from qemu during this fsync. > > This patch series contains two patches: > > First patch just lets the file-posix ignore the -EBUSY errors, which is > technically enough to fail back to plain write in this case, but seems wrong. > > And the second patch adds an optimization to qemu-img to avoid such a > fragmented write/discard in the first place. > > Both patches make the reproducer work for this particular bugzilla, > but I don't think they are enough. > > What do you think? So if the EBUSY error happens because something happened to the page cache outside of discarded range (like you describe above), that is a kernel bug than needs to get fixed. EBUSY should really mean - someone wrote to the discarded range while discard was running and userspace app has to deal with that depending on what it aims to do... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT @ 2020-11-12 11:19 ` Jan Kara 0 siblings, 0 replies; 25+ messages in thread From: Jan Kara @ 2020-11-12 11:19 UTC (permalink / raw) To: Maxim Levitsky Cc: Kevin Wolf, Christoph Hellwig, Jan Kara, qemu-block, Darrick J . Wong, Peter Lieven, qemu-devel, Max Reitz, linux-block, Jens Axboe, Paolo Bonzini [added some relevant people and lists to CC] On Wed 11-11-20 17:44:05, Maxim Levitsky wrote: > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > > clone of "starship_production" > > The git-publish destroyed the cover letter: > > For the reference this is for bz #1872633 > > The issue is that current kernel code that implements 'fallocate' > on kernel block devices roughly works like that: > > 1. Flush the page cache on the range that is about to be discarded. > 2. Issue the discard and wait for it to finish. > (as far as I can see the discard doesn't go through the > page cache). > > 3. Check if the page cache is dirty for this range, > if it is dirty (meaning that someone wrote to it meanwhile) > return -EBUSY. > > This means that if qemu (or qemu-img) issues a write, and then > discard to the area that shares a page, -EBUSY can be returned by > the kernel. Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back EBUSY which seems wrong to me. IMO we should handle this gracefully in the kernel so we need to fix this. > On the other hand, for example, the ext4 implementation of discard > doesn't seem to be affected. It does take a lock on the inode to avoid > concurrent IO and flushes O_DIRECT writers prior to doing discard thought. Well, filesystem hole punching is somewhat different beast than block device discard (at least implementation wise). > Doing fsync and retrying is seems to resolve this issue, but it might be > a too big hammer. Just retrying doesn't work, indicating that maybe the > code that flushes the page cache in (1) doesn't do this correctly ? > > It also can be racy unless special means are done to block IO from happening > from qemu during this fsync. > > This patch series contains two patches: > > First patch just lets the file-posix ignore the -EBUSY errors, which is > technically enough to fail back to plain write in this case, but seems wrong. > > And the second patch adds an optimization to qemu-img to avoid such a > fragmented write/discard in the first place. > > Both patches make the reproducer work for this particular bugzilla, > but I don't think they are enough. > > What do you think? So if the EBUSY error happens because something happened to the page cache outside of discarded range (like you describe above), that is a kernel bug than needs to get fixed. EBUSY should really mean - someone wrote to the discarded range while discard was running and userspace app has to deal with that depending on what it aims to do... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT 2020-11-12 11:19 ` Jan Kara @ 2020-11-12 12:00 ` Jan Kara -1 siblings, 0 replies; 25+ messages in thread From: Jan Kara @ 2020-11-12 12:00 UTC (permalink / raw) To: Maxim Levitsky Cc: qemu-devel, Kevin Wolf, Peter Lieven, Jan Kara, Paolo Bonzini, Max Reitz, Darrick J . Wong, qemu-block, linux-block, Christoph Hellwig, Jens Axboe On Thu 12-11-20 12:19:51, Jan Kara wrote: > [added some relevant people and lists to CC] > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote: > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > > > clone of "starship_production" > > > > The git-publish destroyed the cover letter: > > > > For the reference this is for bz #1872633 > > > > The issue is that current kernel code that implements 'fallocate' > > on kernel block devices roughly works like that: > > > > 1. Flush the page cache on the range that is about to be discarded. > > 2. Issue the discard and wait for it to finish. > > (as far as I can see the discard doesn't go through the > > page cache). > > > > 3. Check if the page cache is dirty for this range, > > if it is dirty (meaning that someone wrote to it meanwhile) > > return -EBUSY. > > > > This means that if qemu (or qemu-img) issues a write, and then > > discard to the area that shares a page, -EBUSY can be returned by > > the kernel. > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back > EBUSY which seems wrong to me. IMO we should handle this gracefully in the > kernel so we need to fix this. > > > On the other hand, for example, the ext4 implementation of discard > > doesn't seem to be affected. It does take a lock on the inode to avoid > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought. > > Well, filesystem hole punching is somewhat different beast than block device > discard (at least implementation wise). > > > Doing fsync and retrying is seems to resolve this issue, but it might be > > a too big hammer. Just retrying doesn't work, indicating that maybe the > > code that flushes the page cache in (1) doesn't do this correctly ? > > > > It also can be racy unless special means are done to block IO from happening > > from qemu during this fsync. > > > > This patch series contains two patches: > > > > First patch just lets the file-posix ignore the -EBUSY errors, which is > > technically enough to fail back to plain write in this case, but seems wrong. > > > > And the second patch adds an optimization to qemu-img to avoid such a > > fragmented write/discard in the first place. > > > > Both patches make the reproducer work for this particular bugzilla, > > but I don't think they are enough. > > > > What do you think? > > So if the EBUSY error happens because something happened to the page cache > outside of discarded range (like you describe above), that is a kernel bug > than needs to get fixed. EBUSY should really mean - someone wrote to the > discarded range while discard was running and userspace app has to deal > with that depending on what it aims to do... So I was looking what it would take to fix this inside the kernel. The problem is that invalidate_inode_pages2_range() is working on page granularity and it is non-trivial to extend it to work on byte granularity since we don't support something like "try to reclaim part of a page". But I'm also somewhat wondering why we use invalidate_inode_pages2_range() here instead of truncate_inode_pages_range() again? I mean the EBUSY detection cannot be reliable anyway and userspace has no way of knowing whether a write happened before discard or after it so just discarding data is fine from this point of view. Darrick? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT @ 2020-11-12 12:00 ` Jan Kara 0 siblings, 0 replies; 25+ messages in thread From: Jan Kara @ 2020-11-12 12:00 UTC (permalink / raw) To: Maxim Levitsky Cc: Kevin Wolf, Christoph Hellwig, Jan Kara, qemu-block, Darrick J . Wong, Peter Lieven, qemu-devel, Max Reitz, linux-block, Jens Axboe, Paolo Bonzini On Thu 12-11-20 12:19:51, Jan Kara wrote: > [added some relevant people and lists to CC] > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote: > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > > > clone of "starship_production" > > > > The git-publish destroyed the cover letter: > > > > For the reference this is for bz #1872633 > > > > The issue is that current kernel code that implements 'fallocate' > > on kernel block devices roughly works like that: > > > > 1. Flush the page cache on the range that is about to be discarded. > > 2. Issue the discard and wait for it to finish. > > (as far as I can see the discard doesn't go through the > > page cache). > > > > 3. Check if the page cache is dirty for this range, > > if it is dirty (meaning that someone wrote to it meanwhile) > > return -EBUSY. > > > > This means that if qemu (or qemu-img) issues a write, and then > > discard to the area that shares a page, -EBUSY can be returned by > > the kernel. > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back > EBUSY which seems wrong to me. IMO we should handle this gracefully in the > kernel so we need to fix this. > > > On the other hand, for example, the ext4 implementation of discard > > doesn't seem to be affected. It does take a lock on the inode to avoid > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought. > > Well, filesystem hole punching is somewhat different beast than block device > discard (at least implementation wise). > > > Doing fsync and retrying is seems to resolve this issue, but it might be > > a too big hammer. Just retrying doesn't work, indicating that maybe the > > code that flushes the page cache in (1) doesn't do this correctly ? > > > > It also can be racy unless special means are done to block IO from happening > > from qemu during this fsync. > > > > This patch series contains two patches: > > > > First patch just lets the file-posix ignore the -EBUSY errors, which is > > technically enough to fail back to plain write in this case, but seems wrong. > > > > And the second patch adds an optimization to qemu-img to avoid such a > > fragmented write/discard in the first place. > > > > Both patches make the reproducer work for this particular bugzilla, > > but I don't think they are enough. > > > > What do you think? > > So if the EBUSY error happens because something happened to the page cache > outside of discarded range (like you describe above), that is a kernel bug > than needs to get fixed. EBUSY should really mean - someone wrote to the > discarded range while discard was running and userspace app has to deal > with that depending on what it aims to do... So I was looking what it would take to fix this inside the kernel. The problem is that invalidate_inode_pages2_range() is working on page granularity and it is non-trivial to extend it to work on byte granularity since we don't support something like "try to reclaim part of a page". But I'm also somewhat wondering why we use invalidate_inode_pages2_range() here instead of truncate_inode_pages_range() again? I mean the EBUSY detection cannot be reliable anyway and userspace has no way of knowing whether a write happened before discard or after it so just discarding data is fine from this point of view. Darrick? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT 2020-11-12 12:00 ` Jan Kara @ 2020-11-12 22:08 ` Darrick J. Wong -1 siblings, 0 replies; 25+ messages in thread From: Darrick J. Wong @ 2020-11-12 22:08 UTC (permalink / raw) To: Jan Kara Cc: Maxim Levitsky, qemu-devel, Kevin Wolf, Peter Lieven, Paolo Bonzini, Max Reitz, qemu-block, linux-block, Christoph Hellwig, Jens Axboe On Thu, Nov 12, 2020 at 01:00:56PM +0100, Jan Kara wrote: > On Thu 12-11-20 12:19:51, Jan Kara wrote: > > [added some relevant people and lists to CC] > > > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote: > > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > > > > clone of "starship_production" > > > > > > The git-publish destroyed the cover letter: > > > > > > For the reference this is for bz #1872633 > > > > > > The issue is that current kernel code that implements 'fallocate' > > > on kernel block devices roughly works like that: > > > > > > 1. Flush the page cache on the range that is about to be discarded. > > > 2. Issue the discard and wait for it to finish. > > > (as far as I can see the discard doesn't go through the > > > page cache). > > > > > > 3. Check if the page cache is dirty for this range, > > > if it is dirty (meaning that someone wrote to it meanwhile) > > > return -EBUSY. > > > > > > This means that if qemu (or qemu-img) issues a write, and then > > > discard to the area that shares a page, -EBUSY can be returned by > > > the kernel. > > > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back > > EBUSY which seems wrong to me. IMO we should handle this gracefully in the > > kernel so we need to fix this. > > > > > On the other hand, for example, the ext4 implementation of discard > > > doesn't seem to be affected. It does take a lock on the inode to avoid > > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought. > > > > Well, filesystem hole punching is somewhat different beast than block device > > discard (at least implementation wise). > > > > > Doing fsync and retrying is seems to resolve this issue, but it might be > > > a too big hammer. Just retrying doesn't work, indicating that maybe the > > > code that flushes the page cache in (1) doesn't do this correctly ? > > > > > > It also can be racy unless special means are done to block IO from happening > > > from qemu during this fsync. > > > > > > This patch series contains two patches: > > > > > > First patch just lets the file-posix ignore the -EBUSY errors, which is > > > technically enough to fail back to plain write in this case, but seems wrong. > > > > > > And the second patch adds an optimization to qemu-img to avoid such a > > > fragmented write/discard in the first place. > > > > > > Both patches make the reproducer work for this particular bugzilla, > > > but I don't think they are enough. > > > > > > What do you think? > > > > So if the EBUSY error happens because something happened to the page cache > > outside of discarded range (like you describe above), that is a kernel bug > > than needs to get fixed. EBUSY should really mean - someone wrote to the > > discarded range while discard was running and userspace app has to deal > > with that depending on what it aims to do... > > So I was looking what it would take to fix this inside the kernel. The > problem is that invalidate_inode_pages2_range() is working on page > granularity and it is non-trivial to extend it to work on byte granularity > since we don't support something like "try to reclaim part of a page". But > I'm also somewhat wondering why we use invalidate_inode_pages2_range() here > instead of truncate_inode_pages_range() again? I mean the EBUSY detection > cannot be reliable anyway and userspace has no way of knowing whether a > write happened before discard or after it so just discarding data is fine > from this point of view. Darrick? Hmmm, I think I overlooked the fact that we can do buffered writes into a block device's pagecache without taking any of the usual locks that have to be held for filesystem files. This is essentially a race between a not-page-aligned fallocate and a buffered write to a different sector that is mapped by a page that caches part of the fallocate range. So yes, Jan is right that we need to use truncate_bdev_range instead of invalidate_inode_pages2_range because the former will zero the sub-page ranges on either end of the fallocate request instead of returning -EBUSY because someone dirtied a part of a page that wasn't involved in the fallocate operation. I /probably/ just copy-pasta'd that invalidation call from directio without thinking hard enough about it, sorry about that. :( --D > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT @ 2020-11-12 22:08 ` Darrick J. Wong 0 siblings, 0 replies; 25+ messages in thread From: Darrick J. Wong @ 2020-11-12 22:08 UTC (permalink / raw) To: Jan Kara Cc: Kevin Wolf, Christoph Hellwig, qemu-block, Peter Lieven, qemu-devel, Max Reitz, linux-block, Jens Axboe, Paolo Bonzini, Maxim Levitsky On Thu, Nov 12, 2020 at 01:00:56PM +0100, Jan Kara wrote: > On Thu 12-11-20 12:19:51, Jan Kara wrote: > > [added some relevant people and lists to CC] > > > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote: > > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > > > > clone of "starship_production" > > > > > > The git-publish destroyed the cover letter: > > > > > > For the reference this is for bz #1872633 > > > > > > The issue is that current kernel code that implements 'fallocate' > > > on kernel block devices roughly works like that: > > > > > > 1. Flush the page cache on the range that is about to be discarded. > > > 2. Issue the discard and wait for it to finish. > > > (as far as I can see the discard doesn't go through the > > > page cache). > > > > > > 3. Check if the page cache is dirty for this range, > > > if it is dirty (meaning that someone wrote to it meanwhile) > > > return -EBUSY. > > > > > > This means that if qemu (or qemu-img) issues a write, and then > > > discard to the area that shares a page, -EBUSY can be returned by > > > the kernel. > > > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back > > EBUSY which seems wrong to me. IMO we should handle this gracefully in the > > kernel so we need to fix this. > > > > > On the other hand, for example, the ext4 implementation of discard > > > doesn't seem to be affected. It does take a lock on the inode to avoid > > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought. > > > > Well, filesystem hole punching is somewhat different beast than block device > > discard (at least implementation wise). > > > > > Doing fsync and retrying is seems to resolve this issue, but it might be > > > a too big hammer. Just retrying doesn't work, indicating that maybe the > > > code that flushes the page cache in (1) doesn't do this correctly ? > > > > > > It also can be racy unless special means are done to block IO from happening > > > from qemu during this fsync. > > > > > > This patch series contains two patches: > > > > > > First patch just lets the file-posix ignore the -EBUSY errors, which is > > > technically enough to fail back to plain write in this case, but seems wrong. > > > > > > And the second patch adds an optimization to qemu-img to avoid such a > > > fragmented write/discard in the first place. > > > > > > Both patches make the reproducer work for this particular bugzilla, > > > but I don't think they are enough. > > > > > > What do you think? > > > > So if the EBUSY error happens because something happened to the page cache > > outside of discarded range (like you describe above), that is a kernel bug > > than needs to get fixed. EBUSY should really mean - someone wrote to the > > discarded range while discard was running and userspace app has to deal > > with that depending on what it aims to do... > > So I was looking what it would take to fix this inside the kernel. The > problem is that invalidate_inode_pages2_range() is working on page > granularity and it is non-trivial to extend it to work on byte granularity > since we don't support something like "try to reclaim part of a page". But > I'm also somewhat wondering why we use invalidate_inode_pages2_range() here > instead of truncate_inode_pages_range() again? I mean the EBUSY detection > cannot be reliable anyway and userspace has no way of knowing whether a > write happened before discard or after it so just discarding data is fine > from this point of view. Darrick? Hmmm, I think I overlooked the fact that we can do buffered writes into a block device's pagecache without taking any of the usual locks that have to be held for filesystem files. This is essentially a race between a not-page-aligned fallocate and a buffered write to a different sector that is mapped by a page that caches part of the fallocate range. So yes, Jan is right that we need to use truncate_bdev_range instead of invalidate_inode_pages2_range because the former will zero the sub-page ranges on either end of the fallocate request instead of returning -EBUSY because someone dirtied a part of a page that wasn't involved in the fallocate operation. I /probably/ just copy-pasta'd that invalidation call from directio without thinking hard enough about it, sorry about that. :( --D > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT 2020-11-12 22:08 ` Darrick J. Wong @ 2020-12-07 17:23 ` Maxim Levitsky -1 siblings, 0 replies; 25+ messages in thread From: Maxim Levitsky @ 2020-12-07 17:23 UTC (permalink / raw) To: Darrick J. Wong, Jan Kara Cc: qemu-devel, Kevin Wolf, Peter Lieven, Paolo Bonzini, Max Reitz, qemu-block, linux-block, Christoph Hellwig, Jens Axboe On Thu, 2020-11-12 at 14:08 -0800, Darrick J. Wong wrote: > On Thu, Nov 12, 2020 at 01:00:56PM +0100, Jan Kara wrote: > > On Thu 12-11-20 12:19:51, Jan Kara wrote: > > > [added some relevant people and lists to CC] > > > > > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote: > > > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > > > > > clone of "starship_production" > > > > > > > > The git-publish destroyed the cover letter: > > > > > > > > For the reference this is for bz #1872633 > > > > > > > > The issue is that current kernel code that implements 'fallocate' > > > > on kernel block devices roughly works like that: > > > > > > > > 1. Flush the page cache on the range that is about to be discarded. > > > > 2. Issue the discard and wait for it to finish. > > > > (as far as I can see the discard doesn't go through the > > > > page cache). > > > > > > > > 3. Check if the page cache is dirty for this range, > > > > if it is dirty (meaning that someone wrote to it meanwhile) > > > > return -EBUSY. > > > > > > > > This means that if qemu (or qemu-img) issues a write, and then > > > > discard to the area that shares a page, -EBUSY can be returned by > > > > the kernel. > > > > > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back > > > EBUSY which seems wrong to me. IMO we should handle this gracefully in the > > > kernel so we need to fix this. > > > > > > > On the other hand, for example, the ext4 implementation of discard > > > > doesn't seem to be affected. It does take a lock on the inode to avoid > > > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought. > > > > > > Well, filesystem hole punching is somewhat different beast than block device > > > discard (at least implementation wise). > > > > > > > Doing fsync and retrying is seems to resolve this issue, but it might be > > > > a too big hammer. Just retrying doesn't work, indicating that maybe the > > > > code that flushes the page cache in (1) doesn't do this correctly ? > > > > > > > > It also can be racy unless special means are done to block IO from happening > > > > from qemu during this fsync. > > > > > > > > This patch series contains two patches: > > > > > > > > First patch just lets the file-posix ignore the -EBUSY errors, which is > > > > technically enough to fail back to plain write in this case, but seems wrong. > > > > > > > > And the second patch adds an optimization to qemu-img to avoid such a > > > > fragmented write/discard in the first place. > > > > > > > > Both patches make the reproducer work for this particular bugzilla, > > > > but I don't think they are enough. > > > > > > > > What do you think? > > > > > > So if the EBUSY error happens because something happened to the page cache > > > outside of discarded range (like you describe above), that is a kernel bug > > > than needs to get fixed. EBUSY should really mean - someone wrote to the > > > discarded range while discard was running and userspace app has to deal > > > with that depending on what it aims to do... > > > > So I was looking what it would take to fix this inside the kernel. The > > problem is that invalidate_inode_pages2_range() is working on page > > granularity and it is non-trivial to extend it to work on byte granularity > > since we don't support something like "try to reclaim part of a page". But > > I'm also somewhat wondering why we use invalidate_inode_pages2_range() here > > instead of truncate_inode_pages_range() again? I mean the EBUSY detection > > cannot be reliable anyway and userspace has no way of knowing whether a > > write happened before discard or after it so just discarding data is fine > > from this point of view. Darrick? > > Hmmm, I think I overlooked the fact that we can do buffered writes into > a block device's pagecache without taking any of the usual locks that > have to be held for filesystem files. This is essentially a race > between a not-page-aligned fallocate and a buffered write to a different > sector that is mapped by a page that caches part of the fallocate range. > > So yes, Jan is right that we need to use truncate_bdev_range instead of > invalidate_inode_pages2_range because the former will zero the sub-page > ranges on either end of the fallocate request instead of returning > -EBUSY because someone dirtied a part of a page that wasn't involved in > the fallocate operation. > > I /probably/ just copy-pasta'd that invalidation call from directio > without thinking hard enough about it, sorry about that. :( Any update on this? Today I took a look at both truncate_bdev_range, and at invalidate_inode_pages2_range. I see that the truncate_bdev_range can't really fail other than check for a mounted filesystem. It calls truncate_inode_pages_range which writes back all dirty pages and waits for writeback to finish. So it won't detect dirty pages as invalidate_inode_pages2_range does Also AFAIK the kernel page cache indeed only tracks the dirtiness of a whole page (e.g a 512 byte write, will cause the whole page to be written back, unless the write was done using O_DIRECT) So if a page, part of which is discarded, becomes dirty we can't really know if someone wrote to the discarded region, or only near it. I vote to keep the call to invalidate_inode_pages2_range but exclude the non page aligned areas from it. This way we will still do a best effort detection of this case, while not causing any false positives. If you agree, I'll send a patch for this. Best regards, Maxim Levitsky > > --D > > > Honza > > -- > > Jan Kara <jack@suse.com> > > SUSE Labs, CR ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT @ 2020-12-07 17:23 ` Maxim Levitsky 0 siblings, 0 replies; 25+ messages in thread From: Maxim Levitsky @ 2020-12-07 17:23 UTC (permalink / raw) To: Darrick J. Wong, Jan Kara Cc: Kevin Wolf, Christoph Hellwig, qemu-block, Peter Lieven, qemu-devel, Max Reitz, linux-block, Jens Axboe, Paolo Bonzini On Thu, 2020-11-12 at 14:08 -0800, Darrick J. Wong wrote: > On Thu, Nov 12, 2020 at 01:00:56PM +0100, Jan Kara wrote: > > On Thu 12-11-20 12:19:51, Jan Kara wrote: > > > [added some relevant people and lists to CC] > > > > > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote: > > > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > > > > > clone of "starship_production" > > > > > > > > The git-publish destroyed the cover letter: > > > > > > > > For the reference this is for bz #1872633 > > > > > > > > The issue is that current kernel code that implements 'fallocate' > > > > on kernel block devices roughly works like that: > > > > > > > > 1. Flush the page cache on the range that is about to be discarded. > > > > 2. Issue the discard and wait for it to finish. > > > > (as far as I can see the discard doesn't go through the > > > > page cache). > > > > > > > > 3. Check if the page cache is dirty for this range, > > > > if it is dirty (meaning that someone wrote to it meanwhile) > > > > return -EBUSY. > > > > > > > > This means that if qemu (or qemu-img) issues a write, and then > > > > discard to the area that shares a page, -EBUSY can be returned by > > > > the kernel. > > > > > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back > > > EBUSY which seems wrong to me. IMO we should handle this gracefully in the > > > kernel so we need to fix this. > > > > > > > On the other hand, for example, the ext4 implementation of discard > > > > doesn't seem to be affected. It does take a lock on the inode to avoid > > > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought. > > > > > > Well, filesystem hole punching is somewhat different beast than block device > > > discard (at least implementation wise). > > > > > > > Doing fsync and retrying is seems to resolve this issue, but it might be > > > > a too big hammer. Just retrying doesn't work, indicating that maybe the > > > > code that flushes the page cache in (1) doesn't do this correctly ? > > > > > > > > It also can be racy unless special means are done to block IO from happening > > > > from qemu during this fsync. > > > > > > > > This patch series contains two patches: > > > > > > > > First patch just lets the file-posix ignore the -EBUSY errors, which is > > > > technically enough to fail back to plain write in this case, but seems wrong. > > > > > > > > And the second patch adds an optimization to qemu-img to avoid such a > > > > fragmented write/discard in the first place. > > > > > > > > Both patches make the reproducer work for this particular bugzilla, > > > > but I don't think they are enough. > > > > > > > > What do you think? > > > > > > So if the EBUSY error happens because something happened to the page cache > > > outside of discarded range (like you describe above), that is a kernel bug > > > than needs to get fixed. EBUSY should really mean - someone wrote to the > > > discarded range while discard was running and userspace app has to deal > > > with that depending on what it aims to do... > > > > So I was looking what it would take to fix this inside the kernel. The > > problem is that invalidate_inode_pages2_range() is working on page > > granularity and it is non-trivial to extend it to work on byte granularity > > since we don't support something like "try to reclaim part of a page". But > > I'm also somewhat wondering why we use invalidate_inode_pages2_range() here > > instead of truncate_inode_pages_range() again? I mean the EBUSY detection > > cannot be reliable anyway and userspace has no way of knowing whether a > > write happened before discard or after it so just discarding data is fine > > from this point of view. Darrick? > > Hmmm, I think I overlooked the fact that we can do buffered writes into > a block device's pagecache without taking any of the usual locks that > have to be held for filesystem files. This is essentially a race > between a not-page-aligned fallocate and a buffered write to a different > sector that is mapped by a page that caches part of the fallocate range. > > So yes, Jan is right that we need to use truncate_bdev_range instead of > invalidate_inode_pages2_range because the former will zero the sub-page > ranges on either end of the fallocate request instead of returning > -EBUSY because someone dirtied a part of a page that wasn't involved in > the fallocate operation. > > I /probably/ just copy-pasta'd that invalidation call from directio > without thinking hard enough about it, sorry about that. :( Any update on this? Today I took a look at both truncate_bdev_range, and at invalidate_inode_pages2_range. I see that the truncate_bdev_range can't really fail other than check for a mounted filesystem. It calls truncate_inode_pages_range which writes back all dirty pages and waits for writeback to finish. So it won't detect dirty pages as invalidate_inode_pages2_range does Also AFAIK the kernel page cache indeed only tracks the dirtiness of a whole page (e.g a 512 byte write, will cause the whole page to be written back, unless the write was done using O_DIRECT) So if a page, part of which is discarded, becomes dirty we can't really know if someone wrote to the discarded region, or only near it. I vote to keep the call to invalidate_inode_pages2_range but exclude the non page aligned areas from it. This way we will still do a best effort detection of this case, while not causing any false positives. If you agree, I'll send a patch for this. Best regards, Maxim Levitsky > > --D > > > Honza > > -- > > Jan Kara <jack@suse.com> > > SUSE Labs, CR ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH] block: fallocate: avoid false positive on collision detection 2020-12-07 17:23 ` Maxim Levitsky (?) @ 2021-01-07 12:40 ` Maxim Levitsky 2021-01-07 12:42 ` Maxim Levitsky 2021-01-07 15:37 ` Jan Kara -1 siblings, 2 replies; 25+ messages in thread From: Maxim Levitsky @ 2021-01-07 12:40 UTC (permalink / raw) To: linux-kernel Cc: Darrick J . Wong, Alexander Viro, open list:FILESYSTEMS (VFS and infrastructure), Jan Kara, Maxim Levitsky Align start and end on page boundaries before calling invalidate_inode_pages2_range. This might allow us to miss a collision if the write and the discard were done to the same page and do overlap but it is still better than returning -EBUSY if those writes didn't overlap. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> --- fs/block_dev.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 9e84b1928b94..97f0d16661b5 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1970,6 +1970,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start, loff_t end = start + len - 1; loff_t isize; int error; + pgoff_t invalidate_first_page, invalidate_last_page; /* Fail if we don't recognize the flags. */ if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED) @@ -2020,12 +2021,23 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start, /* * Invalidate again; if someone wandered in and dirtied a page, - * the caller will be given -EBUSY. The third argument is - * inclusive, so the rounding here is safe. + * the caller will be given -EBUSY. + * + * If the start/end of the range is not page aligned, exclude the + * non aligned regions to avoid false positives. */ + invalidate_first_page = DIV_ROUND_UP(start, PAGE_SIZE); + invalidate_last_page = end >> PAGE_SHIFT; + + if ((end + 1) & PAGE_MASK) + invalidate_last_page--; + + if (invalidate_last_page < invalidate_first_page) + return 0; + return invalidate_inode_pages2_range(bdev->bd_inode->i_mapping, - start >> PAGE_SHIFT, - end >> PAGE_SHIFT); + invalidate_first_page, + invalidate_last_page); } const struct file_operations def_blk_fops = { -- 2.26.2 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH] block: fallocate: avoid false positive on collision detection 2021-01-07 12:40 ` [PATCH] block: fallocate: avoid false positive on collision detection Maxim Levitsky @ 2021-01-07 12:42 ` Maxim Levitsky 2021-01-07 15:37 ` Jan Kara 1 sibling, 0 replies; 25+ messages in thread From: Maxim Levitsky @ 2021-01-07 12:42 UTC (permalink / raw) To: linux-kernel Cc: Darrick J . Wong, Alexander Viro, open list:FILESYSTEMS (VFS and infrastructure), Jan Kara On Thu, 2021-01-07 at 14:40 +0200, Maxim Levitsky wrote: > Align start and end on page boundaries before calling > invalidate_inode_pages2_range. > > This might allow us to miss a collision if the write and the discard were done > to the same page and do overlap but it is still better than returning -EBUSY > if those writes didn't overlap. > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> This is my attempt to fix this issue. I am not 100% sure that this is the right solution though. Any feedback is welcome! Best regards, Maxim Levitsky ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] block: fallocate: avoid false positive on collision detection 2021-01-07 12:40 ` [PATCH] block: fallocate: avoid false positive on collision detection Maxim Levitsky 2021-01-07 12:42 ` Maxim Levitsky @ 2021-01-07 15:37 ` Jan Kara 1 sibling, 0 replies; 25+ messages in thread From: Jan Kara @ 2021-01-07 15:37 UTC (permalink / raw) To: Maxim Levitsky Cc: linux-kernel, Darrick J . Wong, Alexander Viro, open list:FILESYSTEMS (VFS and infrastructure), Jan Kara [-- Attachment #1: Type: text/plain, Size: 2249 bytes --] On Thu 07-01-21 14:40:22, Maxim Levitsky wrote: > Align start and end on page boundaries before calling > invalidate_inode_pages2_range. > > This might allow us to miss a collision if the write and the discard were done > to the same page and do overlap but it is still better than returning -EBUSY > if those writes didn't overlap. > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Thanks for getting back to this and I'm sorry I didn't get to this earlier myself! I actually think the fix should be different as we discussed with Darrick. Attached patch should fix the issue for you (I'll also post it formally for inclusion). Honza > --- > fs/block_dev.c | 20 ++++++++++++++++---- > 1 file changed, 16 insertions(+), 4 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 9e84b1928b94..97f0d16661b5 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -1970,6 +1970,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start, > loff_t end = start + len - 1; > loff_t isize; > int error; > + pgoff_t invalidate_first_page, invalidate_last_page; > > /* Fail if we don't recognize the flags. */ > if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED) > @@ -2020,12 +2021,23 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start, > > /* > * Invalidate again; if someone wandered in and dirtied a page, > - * the caller will be given -EBUSY. The third argument is > - * inclusive, so the rounding here is safe. > + * the caller will be given -EBUSY. > + * > + * If the start/end of the range is not page aligned, exclude the > + * non aligned regions to avoid false positives. > */ > + invalidate_first_page = DIV_ROUND_UP(start, PAGE_SIZE); > + invalidate_last_page = end >> PAGE_SHIFT; > + > + if ((end + 1) & PAGE_MASK) > + invalidate_last_page--; > + > + if (invalidate_last_page < invalidate_first_page) > + return 0; > + > return invalidate_inode_pages2_range(bdev->bd_inode->i_mapping, > - start >> PAGE_SHIFT, > - end >> PAGE_SHIFT); > + invalidate_first_page, > + invalidate_last_page); > } > > const struct file_operations def_blk_fops = { > -- > 2.26.2 > -- Jan Kara <jack@suse.com> SUSE Labs, CR [-- Attachment #2: 0001-bdev-Do-not-return-EBUSY-if-bdev-discard-races-with-.patch --] [-- Type: text/x-patch, Size: 1918 bytes --] From 36f751ac88420a6bda8a3c161986455629dc80d4 Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.cz> Date: Thu, 7 Jan 2021 16:26:52 +0100 Subject: [PATCH] bdev: Do not return EBUSY if bdev discard races with write blkdev_fallocate() tries to detect whether a discard raced with an overlapping write by calling invalidate_inode_pages2_range(). However this check can give both false negatives (when writing using direct IO or when writeback already writes out the written pagecache range) and false positives (when write is not actually overlapping but ends in the same page when blocksize < pagesize). This actually causes issues for qemu which is getting confused by EBUSY errors. Fix the problem by removing this conflicting write detection since it is inherently racy and thus of little use anyway. Reported-by: Maxim Levitsky <mlevitsk@redhat.com> CC: "Darrick J. Wong" <darrick.wong@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz> --- fs/block_dev.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 3e5b02f6606c..a97f43b49839 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1797,13 +1797,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start, return error; /* - * Invalidate again; if someone wandered in and dirtied a page, - * the caller will be given -EBUSY. The third argument is - * inclusive, so the rounding here is safe. + * Invalidate the page cache again; if someone wandered in and dirtied + * a page, we just discard it - userspace has no way of knowing whether + * the write happened before or after discard completing... */ - return invalidate_inode_pages2_range(bdev->bd_inode->i_mapping, - start >> PAGE_SHIFT, - end >> PAGE_SHIFT); + return truncate_bdev_range(bdev, file->f_mode, start, end); } const struct file_operations def_blk_fops = { -- 2.26.2 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT 2020-11-12 11:19 ` Jan Kara @ 2020-11-12 15:38 ` Maxim Levitsky -1 siblings, 0 replies; 25+ messages in thread From: Maxim Levitsky @ 2020-11-12 15:38 UTC (permalink / raw) To: Jan Kara Cc: qemu-devel, Kevin Wolf, Peter Lieven, Paolo Bonzini, Max Reitz, Darrick J . Wong, qemu-block, linux-block, Christoph Hellwig, Jens Axboe [-- Attachment #1: Type: text/plain, Size: 3345 bytes --] On Thu, 2020-11-12 at 12:19 +0100, Jan Kara wrote: > [added some relevant people and lists to CC] > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote: > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > > > clone of "starship_production" > > > > The git-publish destroyed the cover letter: > > > > For the reference this is for bz #1872633 > > > > The issue is that current kernel code that implements 'fallocate' > > on kernel block devices roughly works like that: > > > > 1. Flush the page cache on the range that is about to be discarded. > > 2. Issue the discard and wait for it to finish. > > (as far as I can see the discard doesn't go through the > > page cache). > > > > 3. Check if the page cache is dirty for this range, > > if it is dirty (meaning that someone wrote to it meanwhile) > > return -EBUSY. > > > > This means that if qemu (or qemu-img) issues a write, and then > > discard to the area that shares a page, -EBUSY can be returned by > > the kernel. > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back > EBUSY which seems wrong to me. IMO we should handle this gracefully in the > kernel so we need to fix this. > > > On the other hand, for example, the ext4 implementation of discard > > doesn't seem to be affected. It does take a lock on the inode to avoid > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought. > > Well, filesystem hole punching is somewhat different beast than block device > discard (at least implementation wise). > > > Doing fsync and retrying is seems to resolve this issue, but it might be > > a too big hammer. Just retrying doesn't work, indicating that maybe the > > code that flushes the page cache in (1) doesn't do this correctly ? > > > > It also can be racy unless special means are done to block IO from happening > > from qemu during this fsync. > > > > This patch series contains two patches: > > > > First patch just lets the file-posix ignore the -EBUSY errors, which is > > technically enough to fail back to plain write in this case, but seems wrong. > > > > And the second patch adds an optimization to qemu-img to avoid such a > > fragmented write/discard in the first place. > > > > Both patches make the reproducer work for this particular bugzilla, > > but I don't think they are enough. > > > > What do you think? > > So if the EBUSY error happens because something happened to the page cache > outside of discarded range (like you describe above), that is a kernel bug > than needs to get fixed. EBUSY should really mean - someone wrote to the > discarded range while discard was running and userspace app has to deal > with that depending on what it aims to do... I double checked this, those are the writes/discards according to my debug prints (I print start and then start+len-1 for each request) I have attached the patch for this for reference. ZERO: 0x00007fe00000 00007fffefff (len:0x1ff000) fallocate 00007fe00000 00007fffefff WRITE: 0x00007ffff000 00007ffffdff (len:0xe00) write 00007ffff000 00007ffffdff ZERO: 0x00007ffffe00 0000801fefff (len:0x1ff200) fallocate 00007ffffe00 0000801fefff FALLOCATE failed with error 16 qemu-img: error while writing at byte 2147483136: Device or resource busy Best regards, Maxim Levitsky > > Honza [-- Attachment #2: hacks.diff --] [-- Type: text/x-patch, Size: 3079 bytes --] commit ce897250babe3527f451c1c54c86b62659a2c29e Author: Maxim Levitsky <mlevitsk@redhat.com> Date: Thu Oct 15 17:02:58 2020 +0300 hacks diff --git a/block/file-posix.c b/block/file-posix.c index c63926d592..91ca690505 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -1440,6 +1440,12 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData *aiocb, char *buf) while (offset < aiocb->aio_nbytes) { if (aiocb->aio_type & QEMU_AIO_WRITE) { + + long unsigned int size = aiocb->aio_nbytes - offset; + long unsigned int off = aiocb->aio_offset + offset; + + printf(" write %012lx %012lx\n", off, off+size-1); + len = pwrite(aiocb->aio_fildes, (const char *)buf + offset, aiocb->aio_nbytes - offset, @@ -1581,10 +1587,15 @@ static int translate_err(int err) static int do_fallocate(int fd, int mode, off_t offset, off_t len) { do { + + printf(" fallocate %012lx %012lx\n", offset, offset+len-1); + if (fallocate(fd, mode, offset, len) == 0) { return 0; } } while (errno == EINTR); + + printf("FALLOCATE failed with error %d\n", errno); return translate_err(-errno); } #endif diff --git a/qemu-img.c b/qemu-img.c index c2c56fc797..64d3b84728 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -1803,6 +1803,7 @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num) } s->sector_next_status = sector_num + n; + printf("NEXT_STATUS: %lx\n", s->sector_next_status << 9); } n = MIN(n, s->sector_next_status - sector_num); @@ -1878,6 +1879,10 @@ static int coroutine_fn convert_co_read(ImgConvertState *s, int64_t sector_num, return 0; } +static void test_print(const char* op, unsigned long start, unsigned long len) +{ + printf("%s: 0x%012lx %012lx (len:0x%lx)\n", op, start, start + len-1, len); +} static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num, int nb_sectors, uint8_t *buf, @@ -1911,6 +1916,8 @@ static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num, (s->compressed && !buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))) { + test_print("WRITE", sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS); + ret = blk_co_pwrite(s->target, sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS, buf, flags); if (ret < 0) { @@ -1925,6 +1932,8 @@ static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num, assert(!s->target_has_backing); break; } + test_print("ZERO", sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS); + ret = blk_co_pwrite_zeroes(s->target, sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS, ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT @ 2020-11-12 15:38 ` Maxim Levitsky 0 siblings, 0 replies; 25+ messages in thread From: Maxim Levitsky @ 2020-11-12 15:38 UTC (permalink / raw) To: Jan Kara Cc: Kevin Wolf, Christoph Hellwig, qemu-block, Darrick J . Wong, Peter Lieven, qemu-devel, Max Reitz, linux-block, Jens Axboe, Paolo Bonzini [-- Attachment #1: Type: text/plain, Size: 3345 bytes --] On Thu, 2020-11-12 at 12:19 +0100, Jan Kara wrote: > [added some relevant people and lists to CC] > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote: > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > > > clone of "starship_production" > > > > The git-publish destroyed the cover letter: > > > > For the reference this is for bz #1872633 > > > > The issue is that current kernel code that implements 'fallocate' > > on kernel block devices roughly works like that: > > > > 1. Flush the page cache on the range that is about to be discarded. > > 2. Issue the discard and wait for it to finish. > > (as far as I can see the discard doesn't go through the > > page cache). > > > > 3. Check if the page cache is dirty for this range, > > if it is dirty (meaning that someone wrote to it meanwhile) > > return -EBUSY. > > > > This means that if qemu (or qemu-img) issues a write, and then > > discard to the area that shares a page, -EBUSY can be returned by > > the kernel. > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back > EBUSY which seems wrong to me. IMO we should handle this gracefully in the > kernel so we need to fix this. > > > On the other hand, for example, the ext4 implementation of discard > > doesn't seem to be affected. It does take a lock on the inode to avoid > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought. > > Well, filesystem hole punching is somewhat different beast than block device > discard (at least implementation wise). > > > Doing fsync and retrying is seems to resolve this issue, but it might be > > a too big hammer. Just retrying doesn't work, indicating that maybe the > > code that flushes the page cache in (1) doesn't do this correctly ? > > > > It also can be racy unless special means are done to block IO from happening > > from qemu during this fsync. > > > > This patch series contains two patches: > > > > First patch just lets the file-posix ignore the -EBUSY errors, which is > > technically enough to fail back to plain write in this case, but seems wrong. > > > > And the second patch adds an optimization to qemu-img to avoid such a > > fragmented write/discard in the first place. > > > > Both patches make the reproducer work for this particular bugzilla, > > but I don't think they are enough. > > > > What do you think? > > So if the EBUSY error happens because something happened to the page cache > outside of discarded range (like you describe above), that is a kernel bug > than needs to get fixed. EBUSY should really mean - someone wrote to the > discarded range while discard was running and userspace app has to deal > with that depending on what it aims to do... I double checked this, those are the writes/discards according to my debug prints (I print start and then start+len-1 for each request) I have attached the patch for this for reference. ZERO: 0x00007fe00000 00007fffefff (len:0x1ff000) fallocate 00007fe00000 00007fffefff WRITE: 0x00007ffff000 00007ffffdff (len:0xe00) write 00007ffff000 00007ffffdff ZERO: 0x00007ffffe00 0000801fefff (len:0x1ff200) fallocate 00007ffffe00 0000801fefff FALLOCATE failed with error 16 qemu-img: error while writing at byte 2147483136: Device or resource busy Best regards, Maxim Levitsky > > Honza [-- Attachment #2: hacks.diff --] [-- Type: text/x-patch, Size: 3079 bytes --] commit ce897250babe3527f451c1c54c86b62659a2c29e Author: Maxim Levitsky <mlevitsk@redhat.com> Date: Thu Oct 15 17:02:58 2020 +0300 hacks diff --git a/block/file-posix.c b/block/file-posix.c index c63926d592..91ca690505 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -1440,6 +1440,12 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData *aiocb, char *buf) while (offset < aiocb->aio_nbytes) { if (aiocb->aio_type & QEMU_AIO_WRITE) { + + long unsigned int size = aiocb->aio_nbytes - offset; + long unsigned int off = aiocb->aio_offset + offset; + + printf(" write %012lx %012lx\n", off, off+size-1); + len = pwrite(aiocb->aio_fildes, (const char *)buf + offset, aiocb->aio_nbytes - offset, @@ -1581,10 +1587,15 @@ static int translate_err(int err) static int do_fallocate(int fd, int mode, off_t offset, off_t len) { do { + + printf(" fallocate %012lx %012lx\n", offset, offset+len-1); + if (fallocate(fd, mode, offset, len) == 0) { return 0; } } while (errno == EINTR); + + printf("FALLOCATE failed with error %d\n", errno); return translate_err(-errno); } #endif diff --git a/qemu-img.c b/qemu-img.c index c2c56fc797..64d3b84728 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -1803,6 +1803,7 @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num) } s->sector_next_status = sector_num + n; + printf("NEXT_STATUS: %lx\n", s->sector_next_status << 9); } n = MIN(n, s->sector_next_status - sector_num); @@ -1878,6 +1879,10 @@ static int coroutine_fn convert_co_read(ImgConvertState *s, int64_t sector_num, return 0; } +static void test_print(const char* op, unsigned long start, unsigned long len) +{ + printf("%s: 0x%012lx %012lx (len:0x%lx)\n", op, start, start + len-1, len); +} static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num, int nb_sectors, uint8_t *buf, @@ -1911,6 +1916,8 @@ static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num, (s->compressed && !buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))) { + test_print("WRITE", sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS); + ret = blk_co_pwrite(s->target, sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS, buf, flags); if (ret < 0) { @@ -1925,6 +1932,8 @@ static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num, assert(!s->target_has_backing); break; } + test_print("ZERO", sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS); + ret = blk_co_pwrite_zeroes(s->target, sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS, ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT 2020-11-12 15:38 ` Maxim Levitsky @ 2020-11-13 10:07 ` Jan Kara -1 siblings, 0 replies; 25+ messages in thread From: Jan Kara @ 2020-11-13 10:07 UTC (permalink / raw) To: Maxim Levitsky Cc: Jan Kara, qemu-devel, Kevin Wolf, Peter Lieven, Paolo Bonzini, Max Reitz, Darrick J . Wong, qemu-block, linux-block, Christoph Hellwig, Jens Axboe On Thu 12-11-20 17:38:36, Maxim Levitsky wrote: > On Thu, 2020-11-12 at 12:19 +0100, Jan Kara wrote: > > [added some relevant people and lists to CC] > > > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote: > > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > > > > clone of "starship_production" > > > > > > The git-publish destroyed the cover letter: > > > > > > For the reference this is for bz #1872633 > > > > > > The issue is that current kernel code that implements 'fallocate' > > > on kernel block devices roughly works like that: > > > > > > 1. Flush the page cache on the range that is about to be discarded. > > > 2. Issue the discard and wait for it to finish. > > > (as far as I can see the discard doesn't go through the > > > page cache). > > > > > > 3. Check if the page cache is dirty for this range, > > > if it is dirty (meaning that someone wrote to it meanwhile) > > > return -EBUSY. > > > > > > This means that if qemu (or qemu-img) issues a write, and then > > > discard to the area that shares a page, -EBUSY can be returned by > > > the kernel. > > > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back > > EBUSY which seems wrong to me. IMO we should handle this gracefully in the > > kernel so we need to fix this. > > > > > On the other hand, for example, the ext4 implementation of discard > > > doesn't seem to be affected. It does take a lock on the inode to avoid > > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought. > > > > Well, filesystem hole punching is somewhat different beast than block device > > discard (at least implementation wise). > > > > > Doing fsync and retrying is seems to resolve this issue, but it might be > > > a too big hammer. Just retrying doesn't work, indicating that maybe the > > > code that flushes the page cache in (1) doesn't do this correctly ? > > > > > > It also can be racy unless special means are done to block IO from happening > > > from qemu during this fsync. > > > > > > This patch series contains two patches: > > > > > > First patch just lets the file-posix ignore the -EBUSY errors, which is > > > technically enough to fail back to plain write in this case, but seems wrong. > > > > > > And the second patch adds an optimization to qemu-img to avoid such a > > > fragmented write/discard in the first place. > > > > > > Both patches make the reproducer work for this particular bugzilla, > > > but I don't think they are enough. > > > > > > What do you think? > > > > So if the EBUSY error happens because something happened to the page cache > > outside of discarded range (like you describe above), that is a kernel bug > > than needs to get fixed. EBUSY should really mean - someone wrote to the > > discarded range while discard was running and userspace app has to deal > > with that depending on what it aims to do... > I double checked this, those are the writes/discards according to my debug > prints (I print start and then start+len-1 for each request) > I have attached the patch for this for reference. > > ZERO: 0x00007fe00000 00007fffefff (len:0x1ff000) > fallocate 00007fe00000 00007fffefff Yeah, the end at 7ffff000 is indeed not 4k aligned... > WRITE: 0x00007ffff000 00007ffffdff (len:0xe00) > write 00007ffff000 00007ffffdff .. and this write is following discarded area in the same page (7ffff000..7ffffdff). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT @ 2020-11-13 10:07 ` Jan Kara 0 siblings, 0 replies; 25+ messages in thread From: Jan Kara @ 2020-11-13 10:07 UTC (permalink / raw) To: Maxim Levitsky Cc: Kevin Wolf, Christoph Hellwig, Jan Kara, qemu-block, Darrick J . Wong, Peter Lieven, qemu-devel, Max Reitz, linux-block, Jens Axboe, Paolo Bonzini On Thu 12-11-20 17:38:36, Maxim Levitsky wrote: > On Thu, 2020-11-12 at 12:19 +0100, Jan Kara wrote: > > [added some relevant people and lists to CC] > > > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote: > > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > > > > clone of "starship_production" > > > > > > The git-publish destroyed the cover letter: > > > > > > For the reference this is for bz #1872633 > > > > > > The issue is that current kernel code that implements 'fallocate' > > > on kernel block devices roughly works like that: > > > > > > 1. Flush the page cache on the range that is about to be discarded. > > > 2. Issue the discard and wait for it to finish. > > > (as far as I can see the discard doesn't go through the > > > page cache). > > > > > > 3. Check if the page cache is dirty for this range, > > > if it is dirty (meaning that someone wrote to it meanwhile) > > > return -EBUSY. > > > > > > This means that if qemu (or qemu-img) issues a write, and then > > > discard to the area that shares a page, -EBUSY can be returned by > > > the kernel. > > > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back > > EBUSY which seems wrong to me. IMO we should handle this gracefully in the > > kernel so we need to fix this. > > > > > On the other hand, for example, the ext4 implementation of discard > > > doesn't seem to be affected. It does take a lock on the inode to avoid > > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought. > > > > Well, filesystem hole punching is somewhat different beast than block device > > discard (at least implementation wise). > > > > > Doing fsync and retrying is seems to resolve this issue, but it might be > > > a too big hammer. Just retrying doesn't work, indicating that maybe the > > > code that flushes the page cache in (1) doesn't do this correctly ? > > > > > > It also can be racy unless special means are done to block IO from happening > > > from qemu during this fsync. > > > > > > This patch series contains two patches: > > > > > > First patch just lets the file-posix ignore the -EBUSY errors, which is > > > technically enough to fail back to plain write in this case, but seems wrong. > > > > > > And the second patch adds an optimization to qemu-img to avoid such a > > > fragmented write/discard in the first place. > > > > > > Both patches make the reproducer work for this particular bugzilla, > > > but I don't think they are enough. > > > > > > What do you think? > > > > So if the EBUSY error happens because something happened to the page cache > > outside of discarded range (like you describe above), that is a kernel bug > > than needs to get fixed. EBUSY should really mean - someone wrote to the > > discarded range while discard was running and userspace app has to deal > > with that depending on what it aims to do... > I double checked this, those are the writes/discards according to my debug > prints (I print start and then start+len-1 for each request) > I have attached the patch for this for reference. > > ZERO: 0x00007fe00000 00007fffefff (len:0x1ff000) > fallocate 00007fe00000 00007fffefff Yeah, the end at 7ffff000 is indeed not 4k aligned... > WRITE: 0x00007ffff000 00007ffffdff (len:0xe00) > write 00007ffff000 00007ffffdff .. and this write is following discarded area in the same page (7ffff000..7ffffdff). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2021-03-18 9:58 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-11-11 15:39 [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky 2020-11-11 15:39 ` [PATCH 1/2] file-posix: allow -EBUSY errors during write zeros on raw block devices Maxim Levitsky 2020-11-16 14:48 ` Kevin Wolf 2021-01-07 12:44 ` Maxim Levitsky 2020-11-11 15:39 ` [PATCH 2/2] qemu-img: align next status sector on destination alignment Maxim Levitsky 2020-11-12 12:40 ` Peter Lieven 2020-11-12 13:45 ` Eric Blake 2020-11-12 15:04 ` Maxim Levitsky 2021-03-18 9:57 ` Maxim Levitsky 2020-11-11 15:44 ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky 2020-11-12 11:19 ` Jan Kara 2020-11-12 11:19 ` Jan Kara 2020-11-12 12:00 ` Jan Kara 2020-11-12 12:00 ` Jan Kara 2020-11-12 22:08 ` Darrick J. Wong 2020-11-12 22:08 ` Darrick J. Wong 2020-12-07 17:23 ` Maxim Levitsky 2020-12-07 17:23 ` Maxim Levitsky 2021-01-07 12:40 ` [PATCH] block: fallocate: avoid false positive on collision detection Maxim Levitsky 2021-01-07 12:42 ` Maxim Levitsky 2021-01-07 15:37 ` Jan Kara 2020-11-12 15:38 ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky 2020-11-12 15:38 ` Maxim Levitsky 2020-11-13 10:07 ` Jan Kara 2020-11-13 10:07 ` Jan Kara
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.