[Qemu-devel] [PATCH v2 0/7] eliminate data write in bdrv_write

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v2 0/7] eliminate data write in bdrv_write_zeroes on Linux
@ 2014-12-26 12:35 Denis V. Lunev
  2014-12-26 12:35 ` [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs Denis V. Lunev
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Denis V. Lunev @ 2014-12-26 12:35 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, qemu-devel, Stefan Hajnoczi

These patches eliminate data writes completely on Linux if fallocate
FALLOC_FL_ZERO_RANGE or FALLOC_FL_PUNCH_HOLE are  supported on
underlying filesystem.

This should seriously increase performance in some special cases.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs
  2014-12-26 12:35 [Qemu-devel] [PATCH v2 0/7] eliminate data write in bdrv_write_zeroes on Linux Denis V. Lunev
@ 2014-12-26 12:35 ` Denis V. Lunev
  2014-12-26 13:13   ` Peter Lieven
  2014-12-26 12:35 ` [Qemu-devel] [PATCH 2/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes Denis V. Lunev
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Denis V. Lunev @ 2014-12-26 12:35 UTC (permalink / raw)
  Cc: Kevin Wolf, Peter Lieven, qemu-devel, Stefan Hajnoczi, Denis V. Lunev

The check for maximum length was added by
  commit c31cb70728d2c0c8900b35a66784baa446fd5147
  Author: Peter Lieven <pl@kamp.de>
  Date:   Thu Oct 24 12:06:58 2013 +0200
    block: honour BlockLimits in bdrv_co_do_write_zeroes

but actually if driver provides .bdrv_co_write_zeroes callback, there is
no need to limit the size to 32 MB. Callback should provide effective
implementation which normally should not write any zeroes in comparable
amount.

Correct check should be as follows to honour BlockLimits for slow writes:
- if bs->bl.max_write_zeroes is specified, it should be applied to all
  paths (fast and slow)
- MAX_WRITE_ZEROES_DEFAULT should be applied for slow path only

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Eric Blake <eblake@redhat.com>
CC: Peter Lieven <pl@kamp.de>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
---
 block.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/block.c b/block.c
index 4165d42..0ec8b15 100644
--- a/block.c
+++ b/block.c
@@ -3182,8 +3182,12 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
     struct iovec iov = {0};
     int ret = 0;
 
-    int max_write_zeroes = bs->bl.max_write_zeroes ?
-                           bs->bl.max_write_zeroes : MAX_WRITE_ZEROES_DEFAULT;
+    int max_fast_write_size = nb_sectors;
+    int max_slow_write_size = MAX_WRITE_ZEROES_DEFAULT;
+    if (bs->bl.max_write_zeroes != 0) {
+        max_fast_write_size = bs->bl.max_write_zeroes;
+        max_slow_write_size = bs->bl.max_write_zeroes;
+    }
 
     while (nb_sectors > 0 && !ret) {
         int num = nb_sectors;
@@ -3205,9 +3209,9 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
             }
         }
 
-        /* limit request size */
-        if (num > max_write_zeroes) {
-            num = max_write_zeroes;
+        /* limit request size for fast path */
+        if (num > max_fast_write_size) {
+            num = max_fast_write_size;
         }
 
         ret = -ENOTSUP;
@@ -3217,6 +3221,11 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
         }
 
         if (ret == -ENOTSUP) {
+            /* limit request size further for slow path */
+            if (num > max_slow_write_size) {
+                num = max_slow_write_size;
+            }
+
             /* Fall back to bounce buffer if write zeroes is unsupported */
             iov.iov_len = num * BDRV_SECTOR_SIZE;
             if (iov.iov_base == NULL) {
@@ -3234,7 +3243,7 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
             /* Keep bounce buffer around if it is big enough for all
              * all future requests.
              */
-            if (num < max_write_zeroes) {
+            if (num < max_slow_write_size) {
                 qemu_vfree(iov.iov_base);
                 iov.iov_base = NULL;
             }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH 2/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes
  2014-12-26 12:35 [Qemu-devel] [PATCH v2 0/7] eliminate data write in bdrv_write_zeroes on Linux Denis V. Lunev
  2014-12-26 12:35 ` [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs Denis V. Lunev
@ 2014-12-26 12:35 ` Denis V. Lunev
  2014-12-26 12:35 ` [Qemu-devel] [PATCH 3/7] block/raw-posix: create do_fallocate helper Denis V. Lunev
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Denis V. Lunev @ 2014-12-26 12:35 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, qemu-devel, Stefan Hajnoczi

This efficiently writes zeroes on Linux if the kernel is capable enough.
FALLOC_FL_ZERO_RANGE correctly handles all cases, including and not
including file expansion.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/raw-posix.c | 13 ++++++++++++-
 configure         | 19 +++++++++++++++++++
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index e51293a..66ebaab 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -60,7 +60,7 @@
 #define FS_NOCOW_FL                     0x00800000 /* Do not cow file */
 #endif
 #endif
-#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
 #include <linux/falloc.h>
 #endif
 #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
@@ -919,6 +919,17 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
             return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
         }
 #endif
+
+#ifdef CONFIG_FALLOCATE_ZERO_RANGE
+        do {
+            if (fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
+                          aiocb->aio_offset, aiocb->aio_nbytes) == 0) {
+                return 0;
+            }
+        } while (errno == EINTR);
+
+        ret = -errno;
+#endif
     }
 
     if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
diff --git a/configure b/configure
index cae588c..dfcf7b3 100755
--- a/configure
+++ b/configure
@@ -3309,6 +3309,22 @@ if compile_prog "" "" ; then
   fallocate_punch_hole=yes
 fi
 
+# check that fallocate supports range zeroing inside the file
+fallocate_zero_range=no
+cat > $TMPC << EOF
+#include <fcntl.h>
+#include <linux/falloc.h>
+
+int main(void)
+{
+    fallocate(0, FALLOC_FL_ZERO_RANGE, 0, 0);
+    return 0;
+}
+EOF
+if compile_prog "" "" ; then
+  fallocate_zero_range=yes
+fi
+
 # check for posix_fallocate
 posix_fallocate=no
 cat > $TMPC << EOF
@@ -4538,6 +4554,9 @@ fi
 if test "$fallocate_punch_hole" = "yes" ; then
   echo "CONFIG_FALLOCATE_PUNCH_HOLE=y" >> $config_host_mak
 fi
+if test "$fallocate_zero_range" = "yes" ; then
+  echo "CONFIG_FALLOCATE_ZERO_RANGE=y" >> $config_host_mak
+fi
 if test "$posix_fallocate" = "yes" ; then
   echo "CONFIG_POSIX_FALLOCATE=y" >> $config_host_mak
 fi
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH 3/7] block/raw-posix: create do_fallocate helper
  2014-12-26 12:35 [Qemu-devel] [PATCH v2 0/7] eliminate data write in bdrv_write_zeroes on Linux Denis V. Lunev
  2014-12-26 12:35 ` [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs Denis V. Lunev
  2014-12-26 12:35 ` [Qemu-devel] [PATCH 2/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes Denis V. Lunev
@ 2014-12-26 12:35 ` Denis V. Lunev
  2014-12-26 12:35 ` [Qemu-devel] [PATCH 4/7] block/raw-posix: create translate_err helper to merge errno values Denis V. Lunev
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Denis V. Lunev @ 2014-12-26 12:35 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, qemu-devel, Stefan Hajnoczi

The pattern
    do {
        if (fallocate(s->fd, mode, offset, len) == 0) {
            return 0;
        }
    } while (errno == EINTR);
is used twice at the moment and I am going to add more usages. Move it
to the helper function.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/raw-posix.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 66ebaab..a7c8816 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -893,6 +893,19 @@ static int xfs_discard(BDRVRawState *s, int64_t offset, uint64_t bytes)
 }
 #endif
 
+
+#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
+static int do_fallocate(int fd, int mode, off_t offset, off_t len)
+{
+    do {
+        if (fallocate(fd, mode, offset, len) == 0) {
+            return 0;
+        }
+    } while (errno == EINTR);
+    return -errno;
+}
+#endif
+
 static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
 {
     int ret = -EOPNOTSUPP;
@@ -921,14 +934,8 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
 #endif
 
 #ifdef CONFIG_FALLOCATE_ZERO_RANGE
-        do {
-            if (fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
-                          aiocb->aio_offset, aiocb->aio_nbytes) == 0) {
-                return 0;
-            }
-        } while (errno == EINTR);
-
-        ret = -errno;
+        ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
+                           aiocb->aio_offset, aiocb->aio_nbytes);
 #endif
     }
 
@@ -968,14 +975,8 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
 #endif
 
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
-        do {
-            if (fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
-                          aiocb->aio_offset, aiocb->aio_nbytes) == 0) {
-                return 0;
-            }
-        } while (errno == EINTR);
-
-        ret = -errno;
+        ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+                           aiocb->aio_offset, aiocb->aio_nbytes);
 #endif
     }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH 4/7] block/raw-posix: create translate_err helper to merge errno values
  2014-12-26 12:35 [Qemu-devel] [PATCH v2 0/7] eliminate data write in bdrv_write_zeroes on Linux Denis V. Lunev
                   ` (2 preceding siblings ...)
  2014-12-26 12:35 ` [Qemu-devel] [PATCH 3/7] block/raw-posix: create do_fallocate helper Denis V. Lunev
@ 2014-12-26 12:35 ` Denis V. Lunev
  2014-12-26 12:35 ` [Qemu-devel] [PATCH 5/7] block/raw-posix: refactor handle_aiocb_write_zeroes a bit Denis V. Lunev
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Denis V. Lunev @ 2014-12-26 12:35 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, qemu-devel, Stefan Hajnoczi

actually the code
    if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
        ret == -ENOTTY) {
        ret = -ENOTSUP;
    }
is present twice and will be added a couple more times. Create helper
for this. Place it into do_fallocate() for further convinience.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/raw-posix.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index a7c8816..25a6947 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -894,6 +894,15 @@ static int xfs_discard(BDRVRawState *s, int64_t offset, uint64_t bytes)
 #endif
 
 
+static int translate_err(int err)
+{
+    if (err == -ENODEV || err == -ENOSYS || err == -EOPNOTSUPP ||
+        err == -ENOTTY) {
+        err = -ENOTSUP;
+    }
+    return err;
+}
+
 #if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
 static int do_fallocate(int fd, int mode, off_t offset, off_t len)
 {
@@ -902,7 +911,7 @@ static int do_fallocate(int fd, int mode, off_t offset, off_t len)
             return 0;
         }
     } while (errno == EINTR);
-    return -errno;
+    return translate_err(-errno);
 }
 #endif
 
@@ -939,10 +948,9 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
 #endif
     }
 
-    if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
-        ret == -ENOTTY) {
+    ret = translate_err(ret);
+    if (ret == -ENOTSUP) {
         s->has_write_zeroes = false;
-        ret = -ENOTSUP;
     }
     return ret;
 }
@@ -980,10 +988,9 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
 #endif
     }
 
-    if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
-        ret == -ENOTTY) {
+    ret = translate_err(ret);
+    if (ret == -ENOTSUP) {
         s->has_discard = false;
-        ret = -ENOTSUP;
     }
     return ret;
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH 5/7] block/raw-posix: refactor handle_aiocb_write_zeroes a bit
  2014-12-26 12:35 [Qemu-devel] [PATCH v2 0/7] eliminate data write in bdrv_write_zeroes on Linux Denis V. Lunev
                   ` (3 preceding siblings ...)
  2014-12-26 12:35 ` [Qemu-devel] [PATCH 4/7] block/raw-posix: create translate_err helper to merge errno values Denis V. Lunev
@ 2014-12-26 12:35 ` Denis V. Lunev
  2014-12-26 12:35 ` [Qemu-devel] [PATCH 6/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev
  2014-12-26 12:35 ` [Qemu-devel] [PATCH 7/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes Denis V. Lunev
  6 siblings, 0 replies; 17+ messages in thread
From: Denis V. Lunev @ 2014-12-26 12:35 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, qemu-devel, Stefan Hajnoczi

move code dealing with a block device to a separate function. This will
allow to implement additional processing for an ordinary files.

Pls note, that xfs_code has been moved before checking for
s->has_write_zeroes as xfs_write_zeroes does not touch this flag inside.
This makes code a bit more consistent.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/raw-posix.c | 60 +++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 38 insertions(+), 22 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 25a6947..7866d31 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -915,46 +915,62 @@ static int do_fallocate(int fd, int mode, off_t offset, off_t len)
 }
 #endif
 
-static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
+static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
 {
     int ret = -EOPNOTSUPP;
     BDRVRawState *s = aiocb->bs->opaque;
 
-    if (s->has_write_zeroes == 0) {
+    if (!s->has_write_zeroes) {
         return -ENOTSUP;
     }
 
-    if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
 #ifdef BLKZEROOUT
-        do {
-            uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
-            if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
-                return 0;
-            }
-        } while (errno == EINTR);
-
-        ret = -errno;
-#endif
-    } else {
-#ifdef CONFIG_XFS
-        if (s->is_xfs) {
-            return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
+    do {
+        uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
+        if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
+            return 0;
         }
-#endif
+    } while (errno == EINTR);
 
-#ifdef CONFIG_FALLOCATE_ZERO_RANGE
-        ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
-                           aiocb->aio_offset, aiocb->aio_nbytes);
+    ret = translate_err(-errno);
 #endif
-    }
 
-    ret = translate_err(ret);
     if (ret == -ENOTSUP) {
         s->has_write_zeroes = false;
     }
     return ret;
 }
 
+static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
+{
+    BDRVRawState *s;
+
+    if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
+        return handle_aiocb_write_zeroes_block(aiocb);
+    }
+
+#ifdef CONFIG_XFS
+    if (s->is_xfs) {
+        return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
+    }
+#endif
+
+    s = aiocb->bs->opaque;
+
+#ifdef CONFIG_FALLOCATE_ZERO_RANGE
+    if (s->has_write_zeroes) {
+        int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
+                               aiocb->aio_offset, aiocb->aio_nbytes);
+        if (ret == 0 && ret != -ENOTSUP) {
+            return ret;
+        }
+    }
+#endif
+
+    s->has_write_zeroes = false;
+    return -ENOTSUP;
+}
+
 static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
 {
     int ret = -EOPNOTSUPP;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH 6/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes
  2014-12-26 12:35 [Qemu-devel] [PATCH v2 0/7] eliminate data write in bdrv_write_zeroes on Linux Denis V. Lunev
                   ` (4 preceding siblings ...)
  2014-12-26 12:35 ` [Qemu-devel] [PATCH 5/7] block/raw-posix: refactor handle_aiocb_write_zeroes a bit Denis V. Lunev
@ 2014-12-26 12:35 ` Denis V. Lunev
  2014-12-26 12:35 ` [Qemu-devel] [PATCH 7/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes Denis V. Lunev
  6 siblings, 0 replies; 17+ messages in thread
From: Denis V. Lunev @ 2014-12-26 12:35 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, qemu-devel, Stefan Hajnoczi

This sequence works efficiently if FALLOC_FL_ZERO_RANGE is not supported.

Simple fallocate(0) will extend file with zeroes when appropriate in the
middle of the file if there is a hole there and at the end of the file.
Unfortunately fallocate(0) does not drop the content of the file if
there is a data on this offset. Therefore to make the situation consistent
we should drop the data beforehand. This is done using FALLOC_FL_PUNCH_HOLE

This should increase the performance a bit for not-so-modern kernels or for
filesystems which do not support FALLOC_FL_ZERO_RANGE.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/raw-posix.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 7866d31..96a8678 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -968,6 +968,23 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
 #endif
 
     s->has_write_zeroes = false;
+
+#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+    if (s->has_discard) {
+        int ret;
+        ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+                           aiocb->aio_offset, aiocb->aio_nbytes);
+        if (ret < 0) {
+            if (ret == -ENOTSUP) {
+                s->has_discard = false;
+            }
+            return ret;
+        }
+        return do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
+    }
+#endif
+
+    s->has_discard = false;
     return -ENOTSUP;
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH 7/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes
  2014-12-26 12:35 [Qemu-devel] [PATCH v2 0/7] eliminate data write in bdrv_write_zeroes on Linux Denis V. Lunev
                   ` (5 preceding siblings ...)
  2014-12-26 12:35 ` [Qemu-devel] [PATCH 6/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev
@ 2014-12-26 12:35 ` Denis V. Lunev
  6 siblings, 0 replies; 17+ messages in thread
From: Denis V. Lunev @ 2014-12-26 12:35 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, qemu-devel, Stefan Hajnoczi

There is a possibility that we are extending our image and thus writing
zeroes beyond end of the file. In this case we do not need to care
about the hole to make sure that there is no data in the file under
this offset (pre-condition to fallocate(0) to work). We could simply call
fallocate(0).

This improves the performance of writing zeroes even on really old
platforms which do not have even FALLOC_FL_PUNCH_HOLE.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/raw-posix.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 96a8678..334f818 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -60,7 +60,7 @@
 #define FS_NOCOW_FL                     0x00800000 /* Do not cow file */
 #endif
 #endif
-#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
+#ifdef CONFIG_FALLOCATE
 #include <linux/falloc.h>
 #endif
 #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
@@ -903,7 +903,7 @@ static int translate_err(int err)
     return err;
 }
 
-#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
+#ifdef CONFIG_FALLOCATE
 static int do_fallocate(int fd, int mode, off_t offset, off_t len)
 {
     do {
@@ -957,6 +957,10 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
 
     s = aiocb->bs->opaque;
 
+    if (aiocb->aio_offset >= aiocb->bs->total_sectors << BDRV_SECTOR_BITS) {
+        return do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
+    }
+
 #ifdef CONFIG_FALLOCATE_ZERO_RANGE
     if (s->has_write_zeroes) {
         int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs
  2014-12-26 12:35 ` [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs Denis V. Lunev
@ 2014-12-26 13:13   ` Peter Lieven
  2014-12-26 13:32     ` Denis V. Lunev
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Lieven @ 2014-12-26 13:13 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

Am 26.12.2014 um 13:35 schrieb Denis V. Lunev:
> The check for maximum length was added by
>   commit c31cb70728d2c0c8900b35a66784baa446fd5147
>   Author: Peter Lieven <pl@kamp.de>
>   Date:   Thu Oct 24 12:06:58 2013 +0200
>     block: honour BlockLimits in bdrv_co_do_write_zeroes
>
> but actually if driver provides .bdrv_co_write_zeroes callback, there is
> no need to limit the size to 32 MB. Callback should provide effective
> implementation which normally should not write any zeroes in comparable
> amount.

NACK.

First there is no guarantee that bdrv_co_do_write_zeroes is a fast operation.
This heaviliy depends on several circumstances that the block layer is not aware of.
If a specific protocol knows it is very fast in writing zeroes under any circumstance
it should provide INT_MAX in bs->bl.max_write_zeroes. It is then still allowed to
return -ENOTSUP if the request size or alignment doesn't fit.

There are known backends e.g. anything that deals with SCSI that have a known
limitation of the maximum number of zeroes they can write fast in a single request.
This number MUST NOT be exceeded. The below patch would break all those backends.

What issue are you trying to fix with this patch? Maybe there is a better way to fix
it at another point in the code.

Peter

>
> Correct check should be as follows to honour BlockLimits for slow writes:
> - if bs->bl.max_write_zeroes is specified, it should be applied to all
>   paths (fast and slow)
> - MAX_WRITE_ZEROES_DEFAULT should be applied for slow path only
>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Eric Blake <eblake@redhat.com>
> CC: Peter Lieven <pl@kamp.de>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  block.c | 21 +++++++++++++++------
>  1 file changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/block.c b/block.c
> index 4165d42..0ec8b15 100644
> --- a/block.c
> +++ b/block.c
> @@ -3182,8 +3182,12 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
>      struct iovec iov = {0};
>      int ret = 0;
>  
> -    int max_write_zeroes = bs->bl.max_write_zeroes ?
> -                           bs->bl.max_write_zeroes : MAX_WRITE_ZEROES_DEFAULT;
> +    int max_fast_write_size = nb_sectors;
> +    int max_slow_write_size = MAX_WRITE_ZEROES_DEFAULT;
> +    if (bs->bl.max_write_zeroes != 0) {
> +        max_fast_write_size = bs->bl.max_write_zeroes;
> +        max_slow_write_size = bs->bl.max_write_zeroes;
> +    }
>  
>      while (nb_sectors > 0 && !ret) {
>          int num = nb_sectors;
> @@ -3205,9 +3209,9 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
>              }
>          }
>  
> -        /* limit request size */
> -        if (num > max_write_zeroes) {
> -            num = max_write_zeroes;
> +        /* limit request size for fast path */
> +        if (num > max_fast_write_size) {
> +            num = max_fast_write_size;
>          }
>  
>          ret = -ENOTSUP;
> @@ -3217,6 +3221,11 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
>          }
>  
>          if (ret == -ENOTSUP) {
> +            /* limit request size further for slow path */
> +            if (num > max_slow_write_size) {
> +                num = max_slow_write_size;
> +            }
> +
>              /* Fall back to bounce buffer if write zeroes is unsupported */
>              iov.iov_len = num * BDRV_SECTOR_SIZE;
>              if (iov.iov_base == NULL) {
> @@ -3234,7 +3243,7 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
>              /* Keep bounce buffer around if it is big enough for all
>               * all future requests.
>               */
> -            if (num < max_write_zeroes) {
> +            if (num < max_slow_write_size) {
>                  qemu_vfree(iov.iov_base);
>                  iov.iov_base = NULL;
>              }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs
  2014-12-26 13:13   ` Peter Lieven
@ 2014-12-26 13:32     ` Denis V. Lunev
  2014-12-26 19:15       ` Denis V. Lunev
  0 siblings, 1 reply; 17+ messages in thread
From: Denis V. Lunev @ 2014-12-26 13:32 UTC (permalink / raw)
  To: Peter Lieven, Denis V. Lunev; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On 26/12/14 16:13, Peter Lieven wrote:
> Am 26.12.2014 um 13:35 schrieb Denis V. Lunev:
>> The check for maximum length was added by
>>    commit c31cb70728d2c0c8900b35a66784baa446fd5147
>>    Author: Peter Lieven <pl@kamp.de>
>>    Date:   Thu Oct 24 12:06:58 2013 +0200
>>      block: honour BlockLimits in bdrv_co_do_write_zeroes
>>
>> but actually if driver provides .bdrv_co_write_zeroes callback, there is
>> no need to limit the size to 32 MB. Callback should provide effective
>> implementation which normally should not write any zeroes in comparable
>> amount.
>
> NACK.
>
> First there is no guarantee that bdrv_co_do_write_zeroes is a fast operation.
> This heaviliy depends on several circumstances that the block layer is not aware of.
> If a specific protocol knows it is very fast in writing zeroes under any circumstance
> it should provide INT_MAX in bs->bl.max_write_zeroes. It is then still allowed to
> return -ENOTSUP if the request size or alignment doesn't fit.

the idea is that (from my point of view) if .bdrv_co_do_write_zeroes is
specified, the cost is almost the same for any amount of zeroes
written. This is true for fallocate from my point of view. The amount
of actually written data will be in several orders less than specified
except slow path, which honors 32 MB limit.

If the operation is complex in realization, then it will be rate-limited
below, in actual implementation.

> There are known backends e.g. anything that deals with SCSI that have a known
> limitation of the maximum number of zeroes they can write fast in a single request.
> This number MUST NOT be exceeded. The below patch would break all those backends.

could you pls point me this backends. Actually, from my point of
view, they should properly setup max_write_zeroes for themselves.
This is done at least in block/iscsi.c and it would be consistent
way of doing so.

>
> What issue are you trying to fix with this patch? Maybe there is a better way to fix
> it at another point in the code.
>

I am trying to minimize amount of metadata updates for a file.
This provides some speedup even on ext4 and this will provide
even more speedup with a distributed filesystem like CEPH
where size updates of the files and block allocation are
costly.

Regards,
	Den

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs
  2014-12-26 13:32     ` Denis V. Lunev
@ 2014-12-26 19:15       ` Denis V. Lunev
  2014-12-27 14:52         ` Peter Lieven
  0 siblings, 1 reply; 17+ messages in thread
From: Denis V. Lunev @ 2014-12-26 19:15 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On 26/12/14 16:32, Denis V. Lunev wrote:
> On 26/12/14 16:13, Peter Lieven wrote:
>> Am 26.12.2014 um 13:35 schrieb Denis V. Lunev:
>>> The check for maximum length was added by
>>>    commit c31cb70728d2c0c8900b35a66784baa446fd5147
>>>    Author: Peter Lieven <pl@kamp.de>
>>>    Date:   Thu Oct 24 12:06:58 2013 +0200
>>>      block: honour BlockLimits in bdrv_co_do_write_zeroes
>>>
>>> but actually if driver provides .bdrv_co_write_zeroes callback, 
>>> there is
>>> no need to limit the size to 32 MB. Callback should provide effective
>>> implementation which normally should not write any zeroes in comparable
>>> amount.
>>
>> NACK.
>>
>> First there is no guarantee that bdrv_co_do_write_zeroes is a fast 
>> operation.
>> This heaviliy depends on several circumstances that the block layer 
>> is not aware of.
>> If a specific protocol knows it is very fast in writing zeroes under 
>> any circumstance
>> it should provide INT_MAX in bs->bl.max_write_zeroes. It is then 
>> still allowed to
>> return -ENOTSUP if the request size or alignment doesn't fit.
>
> the idea is that (from my point of view) if .bdrv_co_do_write_zeroes is
> specified, the cost is almost the same for any amount of zeroes
> written. This is true for fallocate from my point of view. The amount
> of actually written data will be in several orders less than specified
> except slow path, which honors 32 MB limit.
>
> If the operation is complex in realization, then it will be rate-limited
> below, in actual implementation.
>
>> There are known backends e.g. anything that deals with SCSI that have 
>> a known
>> limitation of the maximum number of zeroes they can write fast in a 
>> single request.
>> This number MUST NOT be exceeded. The below patch would break all 
>> those backends.
>
> could you pls point me this backends. Actually, from my point of
> view, they should properly setup max_write_zeroes for themselves.
> This is done at least in block/iscsi.c and it would be consistent
> way of doing so.
>
>>
>> What issue are you trying to fix with this patch? Maybe there is a 
>> better way to fix
>> it at another point in the code.
>>
>
> I am trying to minimize amount of metadata updates for a file.
> This provides some speedup even on ext4 and this will provide
> even more speedup with a distributed filesystem like CEPH
> where size updates of the files and block allocation are
> costly.
>
> Regards,
>     Den
First of all, the patch is really wrong :) It was written using
wrong assumptions.

OK. I have spent some time reading your original patchset and
and did not found any useful justification for default limit
for both discard and write zero.

Yes, there are drivers which requires block level to call
.bdrv_co_do_write_zeroes with alignment and with upper limit.
But in this case driver setups max_write_zeroes. All buggy
drivers should do that not to affect not buggy ones from
my opinion.

This is the only purpose of the original patches for limits.
I have wrongly interpret BlockLimits as something connected
with time of the operation. Sorry for that.

Therefore there is no good reason for limiting the amount of
data sent to drv->bdrv_co_writev with any data size. The only
thing is that it would be good not to allocate too many memory
at once. We could do something like

base = qemu_try_blockalign(bs, MIN(2048, num) * BDRV_SECTOR_SIZE);
added = 0;
for (added = 0; added < num; add += MIN(2048, num)) {
     qemu_iovec_add(qiov, base, MIN(2048, num));
}

to avoid really big allocations here even if .max_write_zeroes is
very high. Do you think that this might be useful?

As for .bdrv_co_do_write_zeroes itself, can we still drop
default 32 Mb limit? If there are some buggy drivers, they
should have .max_write_zeroes specified.

The same applies to .max_discard

Regards,
     Den

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs
  2014-12-26 19:15       ` Denis V. Lunev
@ 2014-12-27 14:52         ` Peter Lieven
  2014-12-27 17:42           ` Denis V. Lunev
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Lieven @ 2014-12-27 14:52 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Kevin Wolf, Paolo Bonzini, qemu-devel, Stefan Hajnoczi

Am 26.12.2014 um 20:15 schrieb Denis V. Lunev:
> On 26/12/14 16:32, Denis V. Lunev wrote:
>> On 26/12/14 16:13, Peter Lieven wrote:
>>> Am 26.12.2014 um 13:35 schrieb Denis V. Lunev:
>>>> The check for maximum length was added by
>>>>    commit c31cb70728d2c0c8900b35a66784baa446fd5147
>>>>    Author: Peter Lieven <pl@kamp.de>
>>>>    Date:   Thu Oct 24 12:06:58 2013 +0200
>>>>      block: honour BlockLimits in bdrv_co_do_write_zeroes
>>>>
>>>> but actually if driver provides .bdrv_co_write_zeroes callback, there is
>>>> no need to limit the size to 32 MB. Callback should provide effective
>>>> implementation which normally should not write any zeroes in comparable
>>>> amount.
>>>
>>> NACK.
>>>
>>> First there is no guarantee that bdrv_co_do_write_zeroes is a fast operation.
>>> This heaviliy depends on several circumstances that the block layer is not aware of.
>>> If a specific protocol knows it is very fast in writing zeroes under any circumstance
>>> it should provide INT_MAX in bs->bl.max_write_zeroes. It is then still allowed to
>>> return -ENOTSUP if the request size or alignment doesn't fit.
>>
>> the idea is that (from my point of view) if .bdrv_co_do_write_zeroes is
>> specified, the cost is almost the same for any amount of zeroes
>> written. This is true for fallocate from my point of view. The amount
>> of actually written data will be in several orders less than specified
>> except slow path, which honors 32 MB limit.
>>
>> If the operation is complex in realization, then it will be rate-limited
>> below, in actual implementation.
>>
>>> There are known backends e.g. anything that deals with SCSI that have a known
>>> limitation of the maximum number of zeroes they can write fast in a single request.
>>> This number MUST NOT be exceeded. The below patch would break all those backends.
>>
>> could you pls point me this backends. Actually, from my point of
>> view, they should properly setup max_write_zeroes for themselves.
>> This is done at least in block/iscsi.c and it would be consistent
>> way of doing so.
>>
>>>
>>> What issue are you trying to fix with this patch? Maybe there is a better way to fix
>>> it at another point in the code.
>>>
>>
>> I am trying to minimize amount of metadata updates for a file.
>> This provides some speedup even on ext4 and this will provide
>> even more speedup with a distributed filesystem like CEPH
>> where size updates of the files and block allocation are
>> costly.
>>
>> Regards,
>>     Den
> First of all, the patch is really wrong :) It was written using
> wrong assumptions.
>
> OK. I have spent some time reading your original patchset and
> and did not found any useful justification for default limit
> for both discard and write zero.

32768 is the largest power of two fitting into a uint16.
And uint16 is quite common for nb_sectors in backends.

>
> Yes, there are drivers which requires block level to call
> .bdrv_co_do_write_zeroes with alignment and with upper limit.
> But in this case driver setups max_write_zeroes. All buggy
> drivers should do that not to affect not buggy ones from
> my opinion.
>
> This is the only purpose of the original patches for limits.
> I have wrongly interpret BlockLimits as something connected
> with time of the operation. Sorry for that.
>
> Therefore there is no good reason for limiting the amount of
> data sent to drv->bdrv_co_writev with any data size. The only
> thing is that it would be good not to allocate too many memory
> at once. We could do something like
>
> base = qemu_try_blockalign(bs, MIN(2048, num) * BDRV_SECTOR_SIZE);
> added = 0;
> for (added = 0; added < num; add += MIN(2048, num)) {
>     qemu_iovec_add(qiov, base, MIN(2048, num));
> }
>
> to avoid really big allocations here even if .max_write_zeroes is
> very high. Do you think that this might be useful?
>
> As for .bdrv_co_do_write_zeroes itself, can we still drop
> default 32 Mb limit? If there are some buggy drivers, they
> should have .max_write_zeroes specified.
>
> The same applies to .max_discard

Its always risky to change default behaviour. In the original discussion we
agreed that there should be a limit for each request. I think the 2^15 was
Paolos suggestion.

You where talking of metadata updates for a file. So the operation that is too slow
for you is bdrv_write_zeroes inside a container file? What is the underlaying filesystem?
What is the exact operation that you try to optimize?

I am wondering because as far as I can see write zeroes is only supported for
XFS and block devices which support BLKZEROOUT. The latter only works for
cache=none. So its not that easy to end up in an optimized (fast) path anyway.

Peter

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs
  2014-12-27 14:52         ` Peter Lieven
@ 2014-12-27 17:42           ` Denis V. Lunev
  2014-12-27 20:01             ` Peter Lieven
  0 siblings, 1 reply; 17+ messages in thread
From: Denis V. Lunev @ 2014-12-27 17:42 UTC (permalink / raw)
  To: Peter Lieven, Denis V. Lunev
  Cc: Kevin Wolf, Paolo Bonzini, qemu-devel, Stefan Hajnoczi

On 27/12/14 17:52, Peter Lieven wrote:
> Am 26.12.2014 um 20:15 schrieb Denis V. Lunev:
>> On 26/12/14 16:32, Denis V. Lunev wrote:
>>> On 26/12/14 16:13, Peter Lieven wrote:
>>>> Am 26.12.2014 um 13:35 schrieb Denis V. Lunev:
>>>>> The check for maximum length was added by
>>>>>     commit c31cb70728d2c0c8900b35a66784baa446fd5147
>>>>>     Author: Peter Lieven <pl@kamp.de>
>>>>>     Date:   Thu Oct 24 12:06:58 2013 +0200
>>>>>       block: honour BlockLimits in bdrv_co_do_write_zeroes
>>>>>
>>>>> but actually if driver provides .bdrv_co_write_zeroes callback, there is
>>>>> no need to limit the size to 32 MB. Callback should provide effective
>>>>> implementation which normally should not write any zeroes in comparable
>>>>> amount.
>>>>
>>>> NACK.
>>>>
>>>> First there is no guarantee that bdrv_co_do_write_zeroes is a fast operation.
>>>> This heaviliy depends on several circumstances that the block layer is not aware of.
>>>> If a specific protocol knows it is very fast in writing zeroes under any circumstance
>>>> it should provide INT_MAX in bs->bl.max_write_zeroes. It is then still allowed to
>>>> return -ENOTSUP if the request size or alignment doesn't fit.
>>>
>>> the idea is that (from my point of view) if .bdrv_co_do_write_zeroes is
>>> specified, the cost is almost the same for any amount of zeroes
>>> written. This is true for fallocate from my point of view. The amount
>>> of actually written data will be in several orders less than specified
>>> except slow path, which honors 32 MB limit.
>>>
>>> If the operation is complex in realization, then it will be rate-limited
>>> below, in actual implementation.
>>>
>>>> There are known backends e.g. anything that deals with SCSI that have a known
>>>> limitation of the maximum number of zeroes they can write fast in a single request.
>>>> This number MUST NOT be exceeded. The below patch would break all those backends.
>>>
>>> could you pls point me this backends. Actually, from my point of
>>> view, they should properly setup max_write_zeroes for themselves.
>>> This is done at least in block/iscsi.c and it would be consistent
>>> way of doing so.
>>>
>>>>
>>>> What issue are you trying to fix with this patch? Maybe there is a better way to fix
>>>> it at another point in the code.
>>>>
>>>
>>> I am trying to minimize amount of metadata updates for a file.
>>> This provides some speedup even on ext4 and this will provide
>>> even more speedup with a distributed filesystem like CEPH
>>> where size updates of the files and block allocation are
>>> costly.
>>>
>>> Regards,
>>>      Den
>> First of all, the patch is really wrong :) It was written using
>> wrong assumptions.
>>
>> OK. I have spent some time reading your original patchset and
>> and did not found any useful justification for default limit
>> for both discard and write zero.
>
> 32768 is the largest power of two fitting into a uint16.
> And uint16 is quite common for nb_sectors in backends.
>

ok. This could be reasonable.


>>
>> Yes, there are drivers which requires block level to call
>> .bdrv_co_do_write_zeroes with alignment and with upper limit.
>> But in this case driver setups max_write_zeroes. All buggy
>> drivers should do that not to affect not buggy ones from
>> my opinion.
>>
>> This is the only purpose of the original patches for limits.
>> I have wrongly interpret BlockLimits as something connected
>> with time of the operation. Sorry for that.
>>
>> Therefore there is no good reason for limiting the amount of
>> data sent to drv->bdrv_co_writev with any data size. The only
>> thing is that it would be good not to allocate too many memory
>> at once. We could do something like
>>
>> base = qemu_try_blockalign(bs, MIN(2048, num) * BDRV_SECTOR_SIZE);
>> added = 0;
>> for (added = 0; added < num; add += MIN(2048, num)) {
>>      qemu_iovec_add(qiov, base, MIN(2048, num));
>> }
>>
>> to avoid really big allocations here even if .max_write_zeroes is
>> very high. Do you think that this might be useful?
>>
>> As for .bdrv_co_do_write_zeroes itself, can we still drop
>> default 32 Mb limit? If there are some buggy drivers, they
>> should have .max_write_zeroes specified.
>>
>> The same applies to .max_discard
>
> Its always risky to change default behaviour. In the original discussion we
> agreed that there should be a limit for each request. I think the 2^15 was
> Paolos suggestion.
>
> You where talking of metadata updates for a file. So the operation that is too slow
> for you is bdrv_write_zeroes inside a container file? What is the underlaying filesystem?
> What is the exact operation that you try to optimize?
>
> I am wondering because as far as I can see write zeroes is only supported for
> XFS and block devices which support BLKZEROOUT. The latter only works for
> cache=none. So its not that easy to end up in an optimized (fast) path anyway.
>
> Peter
>
>
you have missed 6 patches below ;) f.e. patch 2/7

OK. I'll redo changes and fix on raw-posix level.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs
  2014-12-27 17:42           ` Denis V. Lunev
@ 2014-12-27 20:01             ` Peter Lieven
  2014-12-27 20:14               ` Denis V. Lunev
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Lieven @ 2014-12-27 20:01 UTC (permalink / raw)
  To: Denis V. Lunev, Denis V. Lunev
  Cc: Kevin Wolf, Paolo Bonzini, qemu-devel, Stefan Hajnoczi

Am 27.12.2014 um 18:42 schrieb Denis V. Lunev:
> On 27/12/14 17:52, Peter Lieven wrote:
>> Am 26.12.2014 um 20:15 schrieb Denis V. Lunev:
>>> On 26/12/14 16:32, Denis V. Lunev wrote:
>>>> On 26/12/14 16:13, Peter Lieven wrote:
>>>>> Am 26.12.2014 um 13:35 schrieb Denis V. Lunev:
>>>>>> The check for maximum length was added by
>>>>>>     commit c31cb70728d2c0c8900b35a66784baa446fd5147
>>>>>>     Author: Peter Lieven <pl@kamp.de>
>>>>>>     Date:   Thu Oct 24 12:06:58 2013 +0200
>>>>>>       block: honour BlockLimits in bdrv_co_do_write_zeroes
>>>>>>
>>>>>> but actually if driver provides .bdrv_co_write_zeroes callback, there is
>>>>>> no need to limit the size to 32 MB. Callback should provide effective
>>>>>> implementation which normally should not write any zeroes in comparable
>>>>>> amount.
>>>>>
>>>>> NACK.
>>>>>
>>>>> First there is no guarantee that bdrv_co_do_write_zeroes is a fast operation.
>>>>> This heaviliy depends on several circumstances that the block layer is not aware of.
>>>>> If a specific protocol knows it is very fast in writing zeroes under any circumstance
>>>>> it should provide INT_MAX in bs->bl.max_write_zeroes. It is then still allowed to
>>>>> return -ENOTSUP if the request size or alignment doesn't fit.
>>>>
>>>> the idea is that (from my point of view) if .bdrv_co_do_write_zeroes is
>>>> specified, the cost is almost the same for any amount of zeroes
>>>> written. This is true for fallocate from my point of view. The amount
>>>> of actually written data will be in several orders less than specified
>>>> except slow path, which honors 32 MB limit.
>>>>
>>>> If the operation is complex in realization, then it will be rate-limited
>>>> below, in actual implementation.
>>>>
>>>>> There are known backends e.g. anything that deals with SCSI that have a known
>>>>> limitation of the maximum number of zeroes they can write fast in a single request.
>>>>> This number MUST NOT be exceeded. The below patch would break all those backends.
>>>>
>>>> could you pls point me this backends. Actually, from my point of
>>>> view, they should properly setup max_write_zeroes for themselves.
>>>> This is done at least in block/iscsi.c and it would be consistent
>>>> way of doing so.
>>>>
>>>>>
>>>>> What issue are you trying to fix with this patch? Maybe there is a better way to fix
>>>>> it at another point in the code.
>>>>>
>>>>
>>>> I am trying to minimize amount of metadata updates for a file.
>>>> This provides some speedup even on ext4 and this will provide
>>>> even more speedup with a distributed filesystem like CEPH
>>>> where size updates of the files and block allocation are
>>>> costly.
>>>>
>>>> Regards,
>>>>      Den
>>> First of all, the patch is really wrong :) It was written using
>>> wrong assumptions.
>>>
>>> OK. I have spent some time reading your original patchset and
>>> and did not found any useful justification for default limit
>>> for both discard and write zero.
>>
>> 32768 is the largest power of two fitting into a uint16.
>> And uint16 is quite common for nb_sectors in backends.
>>
>
> ok. This could be reasonable.
>
>
>>>
>>> Yes, there are drivers which requires block level to call
>>> .bdrv_co_do_write_zeroes with alignment and with upper limit.
>>> But in this case driver setups max_write_zeroes. All buggy
>>> drivers should do that not to affect not buggy ones from
>>> my opinion.
>>>
>>> This is the only purpose of the original patches for limits.
>>> I have wrongly interpret BlockLimits as something connected
>>> with time of the operation. Sorry for that.
>>>
>>> Therefore there is no good reason for limiting the amount of
>>> data sent to drv->bdrv_co_writev with any data size. The only
>>> thing is that it would be good not to allocate too many memory
>>> at once. We could do something like
>>>
>>> base = qemu_try_blockalign(bs, MIN(2048, num) * BDRV_SECTOR_SIZE);
>>> added = 0;
>>> for (added = 0; added < num; add += MIN(2048, num)) {
>>>      qemu_iovec_add(qiov, base, MIN(2048, num));
>>> }
>>>
>>> to avoid really big allocations here even if .max_write_zeroes is
>>> very high. Do you think that this might be useful?
>>>
>>> As for .bdrv_co_do_write_zeroes itself, can we still drop
>>> default 32 Mb limit? If there are some buggy drivers, they
>>> should have .max_write_zeroes specified.
>>>
>>> The same applies to .max_discard
>>
>> Its always risky to change default behaviour. In the original discussion we
>> agreed that there should be a limit for each request. I think the 2^15 was
>> Paolos suggestion.
>>
>> You where talking of metadata updates for a file. So the operation that is too slow
>> for you is bdrv_write_zeroes inside a container file? What is the underlaying filesystem?
>> What is the exact operation that you try to optimize?
>>
>> I am wondering because as far as I can see write zeroes is only supported for
>> XFS and block devices which support BLKZEROOUT. The latter only works for
>> cache=none. So its not that easy to end up in an optimized (fast) path anyway.
>>
>> Peter
>>
>>
> you have missed 6 patches below ;) f.e. patch 2/7
>
> OK. I'll redo changes and fix on raw-posix level.

I was not in CC on the series. Can you please include me in CC for all patches when you respin.

Thanks,
Peter

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs
  2014-12-27 20:01             ` Peter Lieven
@ 2014-12-27 20:14               ` Denis V. Lunev
  0 siblings, 0 replies; 17+ messages in thread
From: Denis V. Lunev @ 2014-12-27 20:14 UTC (permalink / raw)
  To: Peter Lieven, Denis V. Lunev
  Cc: Kevin Wolf, Paolo Bonzini, qemu-devel, Stefan Hajnoczi

On 27/12/14 23:01, Peter Lieven wrote:
> Am 27.12.2014 um 18:42 schrieb Denis V. Lunev:
>> On 27/12/14 17:52, Peter Lieven wrote:
>>> Am 26.12.2014 um 20:15 schrieb Denis V. Lunev:
>>>> On 26/12/14 16:32, Denis V. Lunev wrote:
>>>>> On 26/12/14 16:13, Peter Lieven wrote:
>>>>>> Am 26.12.2014 um 13:35 schrieb Denis V. Lunev:
>>>>>>> The check for maximum length was added by
>>>>>>>      commit c31cb70728d2c0c8900b35a66784baa446fd5147
>>>>>>>      Author: Peter Lieven <pl@kamp.de>
>>>>>>>      Date:   Thu Oct 24 12:06:58 2013 +0200
>>>>>>>        block: honour BlockLimits in bdrv_co_do_write_zeroes
>>>>>>>
>>>>>>> but actually if driver provides .bdrv_co_write_zeroes callback, there is
>>>>>>> no need to limit the size to 32 MB. Callback should provide effective
>>>>>>> implementation which normally should not write any zeroes in comparable
>>>>>>> amount.
>>>>>>
>>>>>> NACK.
>>>>>>
>>>>>> First there is no guarantee that bdrv_co_do_write_zeroes is a fast operation.
>>>>>> This heaviliy depends on several circumstances that the block layer is not aware of.
>>>>>> If a specific protocol knows it is very fast in writing zeroes under any circumstance
>>>>>> it should provide INT_MAX in bs->bl.max_write_zeroes. It is then still allowed to
>>>>>> return -ENOTSUP if the request size or alignment doesn't fit.
>>>>>
>>>>> the idea is that (from my point of view) if .bdrv_co_do_write_zeroes is
>>>>> specified, the cost is almost the same for any amount of zeroes
>>>>> written. This is true for fallocate from my point of view. The amount
>>>>> of actually written data will be in several orders less than specified
>>>>> except slow path, which honors 32 MB limit.
>>>>>
>>>>> If the operation is complex in realization, then it will be rate-limited
>>>>> below, in actual implementation.
>>>>>
>>>>>> There are known backends e.g. anything that deals with SCSI that have a known
>>>>>> limitation of the maximum number of zeroes they can write fast in a single request.
>>>>>> This number MUST NOT be exceeded. The below patch would break all those backends.
>>>>>
>>>>> could you pls point me this backends. Actually, from my point of
>>>>> view, they should properly setup max_write_zeroes for themselves.
>>>>> This is done at least in block/iscsi.c and it would be consistent
>>>>> way of doing so.
>>>>>
>>>>>>
>>>>>> What issue are you trying to fix with this patch? Maybe there is a better way to fix
>>>>>> it at another point in the code.
>>>>>>
>>>>>
>>>>> I am trying to minimize amount of metadata updates for a file.
>>>>> This provides some speedup even on ext4 and this will provide
>>>>> even more speedup with a distributed filesystem like CEPH
>>>>> where size updates of the files and block allocation are
>>>>> costly.
>>>>>
>>>>> Regards,
>>>>>       Den
>>>> First of all, the patch is really wrong :) It was written using
>>>> wrong assumptions.
>>>>
>>>> OK. I have spent some time reading your original patchset and
>>>> and did not found any useful justification for default limit
>>>> for both discard and write zero.
>>>
>>> 32768 is the largest power of two fitting into a uint16.
>>> And uint16 is quite common for nb_sectors in backends.
>>>
>>
>> ok. This could be reasonable.
>>
>>
>>>>
>>>> Yes, there are drivers which requires block level to call
>>>> .bdrv_co_do_write_zeroes with alignment and with upper limit.
>>>> But in this case driver setups max_write_zeroes. All buggy
>>>> drivers should do that not to affect not buggy ones from
>>>> my opinion.
>>>>
>>>> This is the only purpose of the original patches for limits.
>>>> I have wrongly interpret BlockLimits as something connected
>>>> with time of the operation. Sorry for that.
>>>>
>>>> Therefore there is no good reason for limiting the amount of
>>>> data sent to drv->bdrv_co_writev with any data size. The only
>>>> thing is that it would be good not to allocate too many memory
>>>> at once. We could do something like
>>>>
>>>> base = qemu_try_blockalign(bs, MIN(2048, num) * BDRV_SECTOR_SIZE);
>>>> added = 0;
>>>> for (added = 0; added < num; add += MIN(2048, num)) {
>>>>       qemu_iovec_add(qiov, base, MIN(2048, num));
>>>> }
>>>>
>>>> to avoid really big allocations here even if .max_write_zeroes is
>>>> very high. Do you think that this might be useful?
>>>>
>>>> As for .bdrv_co_do_write_zeroes itself, can we still drop
>>>> default 32 Mb limit? If there are some buggy drivers, they
>>>> should have .max_write_zeroes specified.
>>>>
>>>> The same applies to .max_discard
>>>
>>> Its always risky to change default behaviour. In the original discussion we
>>> agreed that there should be a limit for each request. I think the 2^15 was
>>> Paolos suggestion.
>>>
>>> You where talking of metadata updates for a file. So the operation that is too slow
>>> for you is bdrv_write_zeroes inside a container file? What is the underlaying filesystem?
>>> What is the exact operation that you try to optimize?
>>>
>>> I am wondering because as far as I can see write zeroes is only supported for
>>> XFS and block devices which support BLKZEROOUT. The latter only works for
>>> cache=none. So its not that easy to end up in an optimized (fast) path anyway.
>>>
>>> Peter
>>>
>>>
>> you have missed 6 patches below ;) f.e. patch 2/7
>>
>> OK. I'll redo changes and fix on raw-posix level.
>
> I was not in CC on the series. Can you please include me in CC for all patches when you respin.
>

no prob :)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH 6/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes
  2015-01-30  8:42 ` [Qemu-devel] [PATCH 6/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev
@ 2015-01-30 15:02   ` Max Reitz
  0 siblings, 0 replies; 17+ messages in thread
From: Max Reitz @ 2015-01-30 15:02 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Stefan Hajnoczi

On 2015-01-30 at 03:42, Denis V. Lunev wrote:
> This sequence works efficiently if FALLOC_FL_ZERO_RANGE is not supported.
> Unfortunately, FALLOC_FL_ZERO_RANGE is supported on really modern systems
> and only for a couple of filesystems. FALLOC_FL_PUNCH_HOLE is much more
> mature.
>
> The sequence of 2 operations FALLOC_FL_PUNCH_HOLE and 0 is necessary due
> to the following reasons:
> - FALLOC_FL_PUNCH_HOLE creates a hole in the file, the file becomes
>    sparse. In order to retain original functionality we must allocate
>    disk space afterwards. This is done using fallocate(0) call
> - fallocate(0) without preceeding FALLOC_FL_PUNCH_HOLE will do nothing
>    if called above already allocated areas of the file, i.e. the content
>    will not be zeroed
>
> This should increase the performance a bit for not-so-modern kernels.
>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Peter Lieven <pl@kamp.de>
> CC: Fam Zheng <famz@redhat.com>
> ---
>   block/raw-posix.c | 19 +++++++++++++++++++
>   1 file changed, 19 insertions(+)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH 6/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes
  2015-01-30  8:42 [Qemu-devel] [PATCH v6 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
@ 2015-01-30  8:42 ` Denis V. Lunev
  2015-01-30 15:02   ` Max Reitz
  0 siblings, 1 reply; 17+ messages in thread
From: Denis V. Lunev @ 2015-01-30  8:42 UTC (permalink / raw)
  Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Max Reitz,
	Stefan Hajnoczi, Denis V. Lunev

This sequence works efficiently if FALLOC_FL_ZERO_RANGE is not supported.
Unfortunately, FALLOC_FL_ZERO_RANGE is supported on really modern systems
and only for a couple of filesystems. FALLOC_FL_PUNCH_HOLE is much more
mature.

The sequence of 2 operations FALLOC_FL_PUNCH_HOLE and 0 is necessary due
to the following reasons:
- FALLOC_FL_PUNCH_HOLE creates a hole in the file, the file becomes
  sparse. In order to retain original functionality we must allocate
  disk space afterwards. This is done using fallocate(0) call
- fallocate(0) without preceeding FALLOC_FL_PUNCH_HOLE will do nothing
  if called above already allocated areas of the file, i.e. the content
  will not be zeroed

This should increase the performance a bit for not-so-modern kernels.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Max Reitz <mreitz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
CC: Fam Zheng <famz@redhat.com>
---
 block/raw-posix.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 1c88ad8..7b42f37 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -967,6 +967,25 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
     }
 #endif
 
+#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+    if (s->has_discard && s->has_fallocate) {
+        int ret = do_fallocate(s->fd,
+                               FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+                               aiocb->aio_offset, aiocb->aio_nbytes);
+        if (ret == 0) {
+            ret = do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
+            if (ret == 0 || ret != -ENOTSUP) {
+                return ret;
+            }
+            s->has_fallocate = false;
+        } else if (ret != -ENOTSUP) {
+            return ret;
+        } else {
+            s->has_discard = false;
+        }
+    }
+#endif
+
 #ifdef CONFIG_FALLOCATE
     if (s->has_fallocate && aiocb->aio_offset >= bdrv_getlength(aiocb->bs)) {
         int ret = do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-01-30 15:02 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-26 12:35 [Qemu-devel] [PATCH v2 0/7] eliminate data write in bdrv_write_zeroes on Linux Denis V. Lunev
2014-12-26 12:35 ` [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs Denis V. Lunev
2014-12-26 13:13   ` Peter Lieven
2014-12-26 13:32     ` Denis V. Lunev
2014-12-26 19:15       ` Denis V. Lunev
2014-12-27 14:52         ` Peter Lieven
2014-12-27 17:42           ` Denis V. Lunev
2014-12-27 20:01             ` Peter Lieven
2014-12-27 20:14               ` Denis V. Lunev
2014-12-26 12:35 ` [Qemu-devel] [PATCH 2/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes Denis V. Lunev
2014-12-26 12:35 ` [Qemu-devel] [PATCH 3/7] block/raw-posix: create do_fallocate helper Denis V. Lunev
2014-12-26 12:35 ` [Qemu-devel] [PATCH 4/7] block/raw-posix: create translate_err helper to merge errno values Denis V. Lunev
2014-12-26 12:35 ` [Qemu-devel] [PATCH 5/7] block/raw-posix: refactor handle_aiocb_write_zeroes a bit Denis V. Lunev
2014-12-26 12:35 ` [Qemu-devel] [PATCH 6/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev
2014-12-26 12:35 ` [Qemu-devel] [PATCH 7/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes Denis V. Lunev
2015-01-30  8:42 [Qemu-devel] [PATCH v6 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
2015-01-30  8:42 ` [Qemu-devel] [PATCH 6/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev
2015-01-30 15:02   ` Max Reitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.