From: Maxim Levitsky <mlevitsk@redhat.com> To: Jan Kara <jack@suse.cz> Cc: qemu-devel@nongnu.org, Kevin Wolf <kwolf@redhat.com>, Peter Lieven <pl@kamp.de>, Paolo Bonzini <pbonzini@redhat.com>, Max Reitz <mreitz@redhat.com>, "Darrick J . Wong" <darrick.wong@oracle.com>, qemu-block@nongnu.org, linux-block@vger.kernel.org, Christoph Hellwig <hch@infradead.org>, Jens Axboe <axboe@kernel.dk> Subject: Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Date: Thu, 12 Nov 2020 17:38:36 +0200 [thread overview] Message-ID: <fbe9f98d6fa9ecd5f53fd284216c740d2d4a723a.camel@redhat.com> (raw) In-Reply-To: <20201112111951.GB27697@quack2.suse.cz> [-- Attachment #1: Type: text/plain, Size: 3345 bytes --] On Thu, 2020-11-12 at 12:19 +0100, Jan Kara wrote: > [added some relevant people and lists to CC] > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote: > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > > > clone of "starship_production" > > > > The git-publish destroyed the cover letter: > > > > For the reference this is for bz #1872633 > > > > The issue is that current kernel code that implements 'fallocate' > > on kernel block devices roughly works like that: > > > > 1. Flush the page cache on the range that is about to be discarded. > > 2. Issue the discard and wait for it to finish. > > (as far as I can see the discard doesn't go through the > > page cache). > > > > 3. Check if the page cache is dirty for this range, > > if it is dirty (meaning that someone wrote to it meanwhile) > > return -EBUSY. > > > > This means that if qemu (or qemu-img) issues a write, and then > > discard to the area that shares a page, -EBUSY can be returned by > > the kernel. > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back > EBUSY which seems wrong to me. IMO we should handle this gracefully in the > kernel so we need to fix this. > > > On the other hand, for example, the ext4 implementation of discard > > doesn't seem to be affected. It does take a lock on the inode to avoid > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought. > > Well, filesystem hole punching is somewhat different beast than block device > discard (at least implementation wise). > > > Doing fsync and retrying is seems to resolve this issue, but it might be > > a too big hammer. Just retrying doesn't work, indicating that maybe the > > code that flushes the page cache in (1) doesn't do this correctly ? > > > > It also can be racy unless special means are done to block IO from happening > > from qemu during this fsync. > > > > This patch series contains two patches: > > > > First patch just lets the file-posix ignore the -EBUSY errors, which is > > technically enough to fail back to plain write in this case, but seems wrong. > > > > And the second patch adds an optimization to qemu-img to avoid such a > > fragmented write/discard in the first place. > > > > Both patches make the reproducer work for this particular bugzilla, > > but I don't think they are enough. > > > > What do you think? > > So if the EBUSY error happens because something happened to the page cache > outside of discarded range (like you describe above), that is a kernel bug > than needs to get fixed. EBUSY should really mean - someone wrote to the > discarded range while discard was running and userspace app has to deal > with that depending on what it aims to do... I double checked this, those are the writes/discards according to my debug prints (I print start and then start+len-1 for each request) I have attached the patch for this for reference. ZERO: 0x00007fe00000 00007fffefff (len:0x1ff000) fallocate 00007fe00000 00007fffefff WRITE: 0x00007ffff000 00007ffffdff (len:0xe00) write 00007ffff000 00007ffffdff ZERO: 0x00007ffffe00 0000801fefff (len:0x1ff200) fallocate 00007ffffe00 0000801fefff FALLOCATE failed with error 16 qemu-img: error while writing at byte 2147483136: Device or resource busy Best regards, Maxim Levitsky > > Honza [-- Attachment #2: hacks.diff --] [-- Type: text/x-patch, Size: 3079 bytes --] commit ce897250babe3527f451c1c54c86b62659a2c29e Author: Maxim Levitsky <mlevitsk@redhat.com> Date: Thu Oct 15 17:02:58 2020 +0300 hacks diff --git a/block/file-posix.c b/block/file-posix.c index c63926d592..91ca690505 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -1440,6 +1440,12 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData *aiocb, char *buf) while (offset < aiocb->aio_nbytes) { if (aiocb->aio_type & QEMU_AIO_WRITE) { + + long unsigned int size = aiocb->aio_nbytes - offset; + long unsigned int off = aiocb->aio_offset + offset; + + printf(" write %012lx %012lx\n", off, off+size-1); + len = pwrite(aiocb->aio_fildes, (const char *)buf + offset, aiocb->aio_nbytes - offset, @@ -1581,10 +1587,15 @@ static int translate_err(int err) static int do_fallocate(int fd, int mode, off_t offset, off_t len) { do { + + printf(" fallocate %012lx %012lx\n", offset, offset+len-1); + if (fallocate(fd, mode, offset, len) == 0) { return 0; } } while (errno == EINTR); + + printf("FALLOCATE failed with error %d\n", errno); return translate_err(-errno); } #endif diff --git a/qemu-img.c b/qemu-img.c index c2c56fc797..64d3b84728 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -1803,6 +1803,7 @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num) } s->sector_next_status = sector_num + n; + printf("NEXT_STATUS: %lx\n", s->sector_next_status << 9); } n = MIN(n, s->sector_next_status - sector_num); @@ -1878,6 +1879,10 @@ static int coroutine_fn convert_co_read(ImgConvertState *s, int64_t sector_num, return 0; } +static void test_print(const char* op, unsigned long start, unsigned long len) +{ + printf("%s: 0x%012lx %012lx (len:0x%lx)\n", op, start, start + len-1, len); +} static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num, int nb_sectors, uint8_t *buf, @@ -1911,6 +1916,8 @@ static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num, (s->compressed && !buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))) { + test_print("WRITE", sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS); + ret = blk_co_pwrite(s->target, sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS, buf, flags); if (ret < 0) { @@ -1925,6 +1932,8 @@ static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num, assert(!s->target_has_backing); break; } + test_print("ZERO", sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS); + ret = blk_co_pwrite_zeroes(s->target, sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS,
WARNING: multiple messages have this Message-ID (diff)
From: Maxim Levitsky <mlevitsk@redhat.com> To: Jan Kara <jack@suse.cz> Cc: Kevin Wolf <kwolf@redhat.com>, Christoph Hellwig <hch@infradead.org>, qemu-block@nongnu.org, "Darrick J . Wong" <darrick.wong@oracle.com>, Peter Lieven <pl@kamp.de>, qemu-devel@nongnu.org, Max Reitz <mreitz@redhat.com>, linux-block@vger.kernel.org, Jens Axboe <axboe@kernel.dk>, Paolo Bonzini <pbonzini@redhat.com> Subject: Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Date: Thu, 12 Nov 2020 17:38:36 +0200 [thread overview] Message-ID: <fbe9f98d6fa9ecd5f53fd284216c740d2d4a723a.camel@redhat.com> (raw) In-Reply-To: <20201112111951.GB27697@quack2.suse.cz> [-- Attachment #1: Type: text/plain, Size: 3345 bytes --] On Thu, 2020-11-12 at 12:19 +0100, Jan Kara wrote: > [added some relevant people and lists to CC] > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote: > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > > > clone of "starship_production" > > > > The git-publish destroyed the cover letter: > > > > For the reference this is for bz #1872633 > > > > The issue is that current kernel code that implements 'fallocate' > > on kernel block devices roughly works like that: > > > > 1. Flush the page cache on the range that is about to be discarded. > > 2. Issue the discard and wait for it to finish. > > (as far as I can see the discard doesn't go through the > > page cache). > > > > 3. Check if the page cache is dirty for this range, > > if it is dirty (meaning that someone wrote to it meanwhile) > > return -EBUSY. > > > > This means that if qemu (or qemu-img) issues a write, and then > > discard to the area that shares a page, -EBUSY can be returned by > > the kernel. > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back > EBUSY which seems wrong to me. IMO we should handle this gracefully in the > kernel so we need to fix this. > > > On the other hand, for example, the ext4 implementation of discard > > doesn't seem to be affected. It does take a lock on the inode to avoid > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought. > > Well, filesystem hole punching is somewhat different beast than block device > discard (at least implementation wise). > > > Doing fsync and retrying is seems to resolve this issue, but it might be > > a too big hammer. Just retrying doesn't work, indicating that maybe the > > code that flushes the page cache in (1) doesn't do this correctly ? > > > > It also can be racy unless special means are done to block IO from happening > > from qemu during this fsync. > > > > This patch series contains two patches: > > > > First patch just lets the file-posix ignore the -EBUSY errors, which is > > technically enough to fail back to plain write in this case, but seems wrong. > > > > And the second patch adds an optimization to qemu-img to avoid such a > > fragmented write/discard in the first place. > > > > Both patches make the reproducer work for this particular bugzilla, > > but I don't think they are enough. > > > > What do you think? > > So if the EBUSY error happens because something happened to the page cache > outside of discarded range (like you describe above), that is a kernel bug > than needs to get fixed. EBUSY should really mean - someone wrote to the > discarded range while discard was running and userspace app has to deal > with that depending on what it aims to do... I double checked this, those are the writes/discards according to my debug prints (I print start and then start+len-1 for each request) I have attached the patch for this for reference. ZERO: 0x00007fe00000 00007fffefff (len:0x1ff000) fallocate 00007fe00000 00007fffefff WRITE: 0x00007ffff000 00007ffffdff (len:0xe00) write 00007ffff000 00007ffffdff ZERO: 0x00007ffffe00 0000801fefff (len:0x1ff200) fallocate 00007ffffe00 0000801fefff FALLOCATE failed with error 16 qemu-img: error while writing at byte 2147483136: Device or resource busy Best regards, Maxim Levitsky > > Honza [-- Attachment #2: hacks.diff --] [-- Type: text/x-patch, Size: 3079 bytes --] commit ce897250babe3527f451c1c54c86b62659a2c29e Author: Maxim Levitsky <mlevitsk@redhat.com> Date: Thu Oct 15 17:02:58 2020 +0300 hacks diff --git a/block/file-posix.c b/block/file-posix.c index c63926d592..91ca690505 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -1440,6 +1440,12 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData *aiocb, char *buf) while (offset < aiocb->aio_nbytes) { if (aiocb->aio_type & QEMU_AIO_WRITE) { + + long unsigned int size = aiocb->aio_nbytes - offset; + long unsigned int off = aiocb->aio_offset + offset; + + printf(" write %012lx %012lx\n", off, off+size-1); + len = pwrite(aiocb->aio_fildes, (const char *)buf + offset, aiocb->aio_nbytes - offset, @@ -1581,10 +1587,15 @@ static int translate_err(int err) static int do_fallocate(int fd, int mode, off_t offset, off_t len) { do { + + printf(" fallocate %012lx %012lx\n", offset, offset+len-1); + if (fallocate(fd, mode, offset, len) == 0) { return 0; } } while (errno == EINTR); + + printf("FALLOCATE failed with error %d\n", errno); return translate_err(-errno); } #endif diff --git a/qemu-img.c b/qemu-img.c index c2c56fc797..64d3b84728 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -1803,6 +1803,7 @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num) } s->sector_next_status = sector_num + n; + printf("NEXT_STATUS: %lx\n", s->sector_next_status << 9); } n = MIN(n, s->sector_next_status - sector_num); @@ -1878,6 +1879,10 @@ static int coroutine_fn convert_co_read(ImgConvertState *s, int64_t sector_num, return 0; } +static void test_print(const char* op, unsigned long start, unsigned long len) +{ + printf("%s: 0x%012lx %012lx (len:0x%lx)\n", op, start, start + len-1, len); +} static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num, int nb_sectors, uint8_t *buf, @@ -1911,6 +1916,8 @@ static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num, (s->compressed && !buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))) { + test_print("WRITE", sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS); + ret = blk_co_pwrite(s->target, sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS, buf, flags); if (ret < 0) { @@ -1925,6 +1932,8 @@ static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num, assert(!s->target_has_backing); break; } + test_print("ZERO", sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS); + ret = blk_co_pwrite_zeroes(s->target, sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS,
next prev parent reply other threads:[~2020-11-12 15:38 UTC|newest] Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-11-11 15:39 [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky 2020-11-11 15:39 ` [PATCH 1/2] file-posix: allow -EBUSY errors during write zeros on raw block devices Maxim Levitsky 2020-11-16 14:48 ` Kevin Wolf 2021-01-07 12:44 ` Maxim Levitsky 2020-11-11 15:39 ` [PATCH 2/2] qemu-img: align next status sector on destination alignment Maxim Levitsky 2020-11-12 12:40 ` Peter Lieven 2020-11-12 13:45 ` Eric Blake 2020-11-12 15:04 ` Maxim Levitsky 2021-03-18 9:57 ` Maxim Levitsky 2020-11-11 15:44 ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky 2020-11-12 11:19 ` Jan Kara 2020-11-12 11:19 ` Jan Kara 2020-11-12 12:00 ` Jan Kara 2020-11-12 12:00 ` Jan Kara 2020-11-12 22:08 ` Darrick J. Wong 2020-11-12 22:08 ` Darrick J. Wong 2020-12-07 17:23 ` Maxim Levitsky 2020-12-07 17:23 ` Maxim Levitsky 2021-01-07 12:40 ` [PATCH] block: fallocate: avoid false positive on collision detection Maxim Levitsky 2021-01-07 12:42 ` Maxim Levitsky 2021-01-07 15:37 ` Jan Kara 2020-11-12 15:38 ` Maxim Levitsky [this message] 2020-11-12 15:38 ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky 2020-11-13 10:07 ` Jan Kara 2020-11-13 10:07 ` Jan Kara
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=fbe9f98d6fa9ecd5f53fd284216c740d2d4a723a.camel@redhat.com \ --to=mlevitsk@redhat.com \ --cc=axboe@kernel.dk \ --cc=darrick.wong@oracle.com \ --cc=hch@infradead.org \ --cc=jack@suse.cz \ --cc=kwolf@redhat.com \ --cc=linux-block@vger.kernel.org \ --cc=mreitz@redhat.com \ --cc=pbonzini@redhat.com \ --cc=pl@kamp.de \ --cc=qemu-block@nongnu.org \ --cc=qemu-devel@nongnu.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.