Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT

From: Maxim Levitsky <mlevitsk@redhat.com>
To: Jan Kara <jack@suse.cz>
Cc: qemu-devel@nongnu.org, Kevin Wolf <kwolf@redhat.com>,
	Peter Lieven <pl@kamp.de>, Paolo Bonzini <pbonzini@redhat.com>,
	Max Reitz <mreitz@redhat.com>,
	"Darrick J . Wong" <darrick.wong@oracle.com>,
	qemu-block@nongnu.org, linux-block@vger.kernel.org,
	Christoph Hellwig <hch@infradead.org>,
	Jens Axboe <axboe@kernel.dk>
Subject: Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
Date: Thu, 12 Nov 2020 17:38:36 +0200	[thread overview]
Message-ID: <fbe9f98d6fa9ecd5f53fd284216c740d2d4a723a.camel@redhat.com> (raw)
In-Reply-To: <20201112111951.GB27697@quack2.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 3345 bytes --]

On Thu, 2020-11-12 at 12:19 +0100, Jan Kara wrote:
> [added some relevant people and lists to CC]
> 
> On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > clone of "starship_production"
> > 
> > The git-publish destroyed the cover letter:
> > 
> > For the reference this is for bz #1872633
> > 
> > The issue is that current kernel code that implements 'fallocate'
> > on kernel block devices roughly works like that:
> > 
> > 1. Flush the page cache on the range that is about to be discarded.
> > 2. Issue the discard and wait for it to finish.
> >    (as far as I can see the discard doesn't go through the
> >    page cache).
> > 
> > 3. Check if the page cache is dirty for this range,
> >    if it is dirty (meaning that someone wrote to it meanwhile)
> >    return -EBUSY.
> > 
> > This means that if qemu (or qemu-img) issues a write, and then
> > discard to the area that shares a page, -EBUSY can be returned by
> > the kernel.
> 
> Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> kernel so we need to fix this.
> 
> > On the other hand, for example, the ext4 implementation of discard
> > doesn't seem to be affected. It does take a lock on the inode to avoid
> > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> 
> Well, filesystem hole punching is somewhat different beast than block device
> discard (at least implementation wise).
> 
> > Doing fsync and retrying is seems to resolve this issue, but it might be
> > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > code that flushes the page cache in (1) doesn't do this correctly ?
> > 
> > It also can be racy unless special means are done to block IO from happening
> > from qemu during this fsync.
> > 
> > This patch series contains two patches:
> > 
> > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > technically enough to fail back to plain write in this case, but seems wrong.
> > 
> > And the second patch adds an optimization to qemu-img to avoid such a
> > fragmented write/discard in the first place.
> > 
> > Both patches make the reproducer work for this particular bugzilla,
> > but I don't think they are enough.
> > 
> > What do you think?
> 
> So if the EBUSY error happens because something happened to the page cache
> outside of discarded range (like you describe above), that is a kernel bug
> than needs to get fixed. EBUSY should really mean - someone wrote to the
> discarded range while discard was running and userspace app has to deal
> with that depending on what it aims to do...
I double checked this, those are the writes/discards according to my debug
prints (I print start and then start+len-1 for each request)
I have attached the patch for this for reference.

ZERO: 0x00007fe00000 00007fffefff (len:0x1ff000)
       fallocate 00007fe00000 00007fffefff
WRITE: 0x00007ffff000 00007ffffdff (len:0xe00)
       write 00007ffff000 00007ffffdff
ZERO: 0x00007ffffe00
0000801fefff (len:0x1ff200)
       fallocate 00007ffffe00 0000801fefff
FALLOCATE failed with error 16
qemu-img: error while writing at byte 2147483136: Device or resource busy


Best regards,
     Maxim Levitsky

> 
> 								Honza


[-- Attachment #2: hacks.diff --]
[-- Type: text/x-patch, Size: 3079 bytes --]

commit ce897250babe3527f451c1c54c86b62659a2c29e
Author: Maxim Levitsky <mlevitsk@redhat.com>
Date:   Thu Oct 15 17:02:58 2020 +0300

    hacks

diff --git a/block/file-posix.c b/block/file-posix.c
index c63926d592..91ca690505 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1440,6 +1440,12 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData *aiocb, char *buf)
 
     while (offset < aiocb->aio_nbytes) {
         if (aiocb->aio_type & QEMU_AIO_WRITE) {
+
+            long unsigned int size = aiocb->aio_nbytes - offset;
+            long unsigned int off = aiocb->aio_offset + offset;
+
+            printf("       write %012lx %012lx\n", off, off+size-1);
+
             len = pwrite(aiocb->aio_fildes,
                          (const char *)buf + offset,
                          aiocb->aio_nbytes - offset,
@@ -1581,10 +1587,15 @@ static int translate_err(int err)
 static int do_fallocate(int fd, int mode, off_t offset, off_t len)
 {
     do {
+
+        printf("       fallocate %012lx %012lx\n", offset, offset+len-1);
+
         if (fallocate(fd, mode, offset, len) == 0) {
             return 0;
         }
     } while (errno == EINTR);
+
+    printf("FALLOCATE failed with error %d\n", errno);
     return translate_err(-errno);
 }
 #endif
diff --git a/qemu-img.c b/qemu-img.c
index c2c56fc797..64d3b84728 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1803,6 +1803,7 @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
         }
 
         s->sector_next_status = sector_num + n;
+        printf("NEXT_STATUS: %lx\n", s->sector_next_status << 9);
     }
 
     n = MIN(n, s->sector_next_status - sector_num);
@@ -1878,6 +1879,10 @@ static int coroutine_fn convert_co_read(ImgConvertState *s, int64_t sector_num,
     return 0;
 }
 
+static void test_print(const char* op, unsigned long start, unsigned long len)
+{
+    printf("%s: 0x%012lx %012lx (len:0x%lx)\n", op, start, start + len-1, len);
+}
 
 static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num,
                                          int nb_sectors, uint8_t *buf,
@@ -1911,6 +1916,8 @@ static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num,
                 (s->compressed &&
                  !buffer_is_zero(buf, n * BDRV_SECTOR_SIZE)))
             {
+                test_print("WRITE", sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS);
+
                 ret = blk_co_pwrite(s->target, sector_num << BDRV_SECTOR_BITS,
                                     n << BDRV_SECTOR_BITS, buf, flags);
                 if (ret < 0) {
@@ -1925,6 +1932,8 @@ static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num,
                 assert(!s->target_has_backing);
                 break;
             }
+            test_print("ZERO", sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS);
+
             ret = blk_co_pwrite_zeroes(s->target,
                                        sector_num << BDRV_SECTOR_BITS,
                                        n << BDRV_SECTOR_BITS,