* [Qemu-devel] [PATCH v4 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c @ 2015-01-27 13:51 Denis V. Lunev 2015-01-27 13:51 ` [Qemu-devel] [PATCH 1/7] block/raw-posix: create translate_err helper to merge errno values Denis V. Lunev ` (6 more replies) 0 siblings, 7 replies; 32+ messages in thread From: Denis V. Lunev @ 2015-01-27 13:51 UTC (permalink / raw) Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Stefan Hajnoczi, Denis V. Lunev These patches eliminate data writes completely on Linux if fallocate FALLOC_FL_ZERO_RANGE or FALLOC_FL_PUNCH_HOLE are supported on underlying filesystem. I have performed several tests with non-aligned fallocate calls and in all cases (with non-aligned fallocates) Linux performs fine, i.e. areas are zeroed correctly. Checks were made on Linux 3.16.0-28-generic #38-Ubuntu SMP This should seriously increase performance of bdrv_write_zeroes Changes from v3: - dropped original patch 1, equivalent stuff was merged already - reordered patches as suggested by Fam - fixes spelling errors as suggested by Fam - fixed not initialized value in handle_aiocb_write_zeroes - fixed wrong error processing from do_fallocate(FALLOC_FL_ZERO_RANGE) Changes from v2: - added Peter Lieven to CC - added CONFIG_FALLOCATE check to call do_fallocate in patch 7 - dropped patch 1 as NACK-ed - added processing of very large data areas in bdrv_co_write_zeroes (new patch 1) - set bl.max_write_zeroes to INT_MAX in raw-posix.c for regular files (new patch 8) Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Kevin Wolf <kwolf@redhat.com> CC: Stefan Hajnoczi <stefanha@redhat.com> CC: Peter Lieven <pl@kamp.de> CC: Fam Zheng <famz@redhat.com> ^ permalink raw reply [flat|nested] 32+ messages in thread
* [Qemu-devel] [PATCH 1/7] block/raw-posix: create translate_err helper to merge errno values 2015-01-27 13:51 [Qemu-devel] [PATCH v4 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev @ 2015-01-27 13:51 ` Denis V. Lunev 2015-01-27 16:50 ` Max Reitz 2015-01-27 13:51 ` [Qemu-devel] [PATCH 2/7] block/raw-posix: create do_fallocate helper Denis V. Lunev ` (5 subsequent siblings) 6 siblings, 1 reply; 32+ messages in thread From: Denis V. Lunev @ 2015-01-27 13:51 UTC (permalink / raw) Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Stefan Hajnoczi, Denis V. Lunev actually the code if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP || ret == -ENOTTY) { ret = -ENOTSUP; } is present twice and will be added a couple more times. Create helper for this. Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Kevin Wolf <kwolf@redhat.com> CC: Stefan Hajnoczi <stefanha@redhat.com> CC: Peter Lieven <pl@kamp.de> CC: Fam Zheng <famz@redhat.com> --- block/raw-posix.c | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index e51293a..24300d0 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -893,6 +893,15 @@ static int xfs_discard(BDRVRawState *s, int64_t offset, uint64_t bytes) } #endif +static int translate_err(int err) +{ + if (err == -ENODEV || err == -ENOSYS || err == -EOPNOTSUPP || + err == -ENOTTY) { + err = -ENOTSUP; + } + return err; +} + static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) { int ret = -EOPNOTSUPP; @@ -921,10 +930,9 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) #endif } - if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP || - ret == -ENOTTY) { + ret = translate_err(ret); + if (ret == -ENOTSUP) { s->has_write_zeroes = false; - ret = -ENOTSUP; } return ret; } @@ -968,10 +976,9 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb) #endif } - if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP || - ret == -ENOTTY) { + ret = translate_err(ret); + if (ret == -ENOTSUP) { s->has_discard = false; - ret = -ENOTSUP; } return ret; } -- 1.9.1 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 1/7] block/raw-posix: create translate_err helper to merge errno values 2015-01-27 13:51 ` [Qemu-devel] [PATCH 1/7] block/raw-posix: create translate_err helper to merge errno values Denis V. Lunev @ 2015-01-27 16:50 ` Max Reitz 0 siblings, 0 replies; 32+ messages in thread From: Max Reitz @ 2015-01-27 16:50 UTC (permalink / raw) To: Denis V. Lunev Cc: Kevin Wolf, Peter Lieven, Fam Zheng, qemu-devel, Stefan Hajnoczi On 2015-01-27 at 08:51, Denis V. Lunev wrote: > actually the code > if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP || > ret == -ENOTTY) { > ret = -ENOTSUP; > } > is present twice and will be added a couple more times. Create helper > for this. > > Signed-off-by: Denis V. Lunev <den@openvz.org> > CC: Kevin Wolf <kwolf@redhat.com> > CC: Stefan Hajnoczi <stefanha@redhat.com> > CC: Peter Lieven <pl@kamp.de> > CC: Fam Zheng <famz@redhat.com> > --- > block/raw-posix.c | 19 +++++++++++++------ > 1 file changed, 13 insertions(+), 6 deletions(-) Reviewed-by: Max Reitz <mreitz@redhat.com> ^ permalink raw reply [flat|nested] 32+ messages in thread
* [Qemu-devel] [PATCH 2/7] block/raw-posix: create do_fallocate helper 2015-01-27 13:51 [Qemu-devel] [PATCH v4 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev 2015-01-27 13:51 ` [Qemu-devel] [PATCH 1/7] block/raw-posix: create translate_err helper to merge errno values Denis V. Lunev @ 2015-01-27 13:51 ` Denis V. Lunev 2015-01-27 16:57 ` Max Reitz 2015-01-27 13:51 ` [Qemu-devel] [PATCH 3/7] block/raw-posix: refactor handle_aiocb_write_zeroes a bit Denis V. Lunev ` (4 subsequent siblings) 6 siblings, 1 reply; 32+ messages in thread From: Denis V. Lunev @ 2015-01-27 13:51 UTC (permalink / raw) Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Stefan Hajnoczi, Denis V. Lunev The pattern do { if (fallocate(s->fd, mode, offset, len) == 0) { return 0; } } while (errno == EINTR); ret = translate_err(-errno); will be commonly useful in next patches. Create helper for it. Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Kevin Wolf <kwolf@redhat.com> CC: Stefan Hajnoczi <stefanha@redhat.com> CC: Peter Lieven <pl@kamp.de> CC: Fam Zheng <famz@redhat.com> --- block/raw-posix.c | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index 24300d0..2aa268a 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -902,6 +902,18 @@ static int translate_err(int err) return err; } +#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) +static int do_fallocate(int fd, int mode, off_t offset, off_t len) +{ + do { + if (fallocate(fd, mode, offset, len) == 0) { + return 0; + } + } while (errno == EINTR); + return translate_err(-errno); +} +#endif + static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) { int ret = -EOPNOTSUPP; @@ -965,14 +977,8 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb) #endif #ifdef CONFIG_FALLOCATE_PUNCH_HOLE - do { - if (fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, - aiocb->aio_offset, aiocb->aio_nbytes) == 0) { - return 0; - } - } while (errno == EINTR); - - ret = -errno; + ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + aiocb->aio_offset, aiocb->aio_nbytes); #endif } -- 1.9.1 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 2/7] block/raw-posix: create do_fallocate helper 2015-01-27 13:51 ` [Qemu-devel] [PATCH 2/7] block/raw-posix: create do_fallocate helper Denis V. Lunev @ 2015-01-27 16:57 ` Max Reitz 0 siblings, 0 replies; 32+ messages in thread From: Max Reitz @ 2015-01-27 16:57 UTC (permalink / raw) To: Denis V. Lunev Cc: Kevin Wolf, Peter Lieven, Fam Zheng, qemu-devel, Stefan Hajnoczi On 2015-01-27 at 08:51, Denis V. Lunev wrote: > The pattern > do { > if (fallocate(s->fd, mode, offset, len) == 0) { > return 0; > } > } while (errno == EINTR); > ret = translate_err(-errno); > will be commonly useful in next patches. Create helper for it. > > Signed-off-by: Denis V. Lunev <den@openvz.org> > CC: Kevin Wolf <kwolf@redhat.com> > CC: Stefan Hajnoczi <stefanha@redhat.com> > CC: Peter Lieven <pl@kamp.de> > CC: Fam Zheng <famz@redhat.com> > --- > block/raw-posix.c | 22 ++++++++++++++-------- > 1 file changed, 14 insertions(+), 8 deletions(-) > > diff --git a/block/raw-posix.c b/block/raw-posix.c > index 24300d0..2aa268a 100644 > --- a/block/raw-posix.c > +++ b/block/raw-posix.c > @@ -902,6 +902,18 @@ static int translate_err(int err) > return err; > } > > +#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) > +static int do_fallocate(int fd, int mode, off_t offset, off_t len) > +{ > + do { > + if (fallocate(fd, mode, offset, len) == 0) { > + return 0; > + } > + } while (errno == EINTR); > + return translate_err(-errno); The translate_err() seems superfluous so far, but won't hurt and I guess the redundancy will disappear in a later patch of this series. Reviewed-by: Max Reitz <mreitz@redhat.com> ^ permalink raw reply [flat|nested] 32+ messages in thread
* [Qemu-devel] [PATCH 3/7] block/raw-posix: refactor handle_aiocb_write_zeroes a bit 2015-01-27 13:51 [Qemu-devel] [PATCH v4 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev 2015-01-27 13:51 ` [Qemu-devel] [PATCH 1/7] block/raw-posix: create translate_err helper to merge errno values Denis V. Lunev 2015-01-27 13:51 ` [Qemu-devel] [PATCH 2/7] block/raw-posix: create do_fallocate helper Denis V. Lunev @ 2015-01-27 13:51 ` Denis V. Lunev 2015-01-27 17:13 ` Max Reitz 2015-01-27 13:51 ` [Qemu-devel] [PATCH 4/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes Denis V. Lunev ` (3 subsequent siblings) 6 siblings, 1 reply; 32+ messages in thread From: Denis V. Lunev @ 2015-01-27 13:51 UTC (permalink / raw) Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Stefan Hajnoczi, Denis V. Lunev move code dealing with a block device to a separate function. This will allow to implement additional processing for ordinary files. Pls note, that xfs_code has been moved before checking for s->has_write_zeroes as xfs_write_zeroes does not touch this flag inside. This makes code a bit more consistent. Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Kevin Wolf <kwolf@redhat.com> CC: Stefan Hajnoczi <stefanha@redhat.com> CC: Peter Lieven <pl@kamp.de> CC: Fam Zheng <famz@redhat.com> --- block/raw-posix.c | 48 +++++++++++++++++++++++++++++------------------- 1 file changed, 29 insertions(+), 19 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index 2aa268a..24e1fab 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -914,41 +914,51 @@ static int do_fallocate(int fd, int mode, off_t offset, off_t len) } #endif -static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) +static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb) { - int ret = -EOPNOTSUPP; + int ret = -ENOTSUP; BDRVRawState *s = aiocb->bs->opaque; - if (s->has_write_zeroes == 0) { + if (!s->has_write_zeroes) { return -ENOTSUP; } - if (aiocb->aio_type & QEMU_AIO_BLKDEV) { #ifdef BLKZEROOUT - do { - uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes }; - if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) { - return 0; - } - } while (errno == EINTR); - - ret = -errno; -#endif - } else { -#ifdef CONFIG_XFS - if (s->is_xfs) { - return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes); + do { + uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes }; + if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) { + return 0; } + } while (errno == EINTR); + + ret = translate_err(-errno); #endif - } - ret = translate_err(ret); if (ret == -ENOTSUP) { s->has_write_zeroes = false; } return ret; } +static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) +{ + int ret = -ENOTSUP; + BDRVRawState *s = aiocb->bs->opaque; + + if (aiocb->aio_type & QEMU_AIO_BLKDEV) { + return handle_aiocb_write_zeroes_block(aiocb); + } + +#ifdef CONFIG_XFS + if (s->is_xfs) { + return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes); + } +#endif + + s->has_write_zeroes = false; + return ret; +} + static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb) { int ret = -EOPNOTSUPP; -- 1.9.1 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 3/7] block/raw-posix: refactor handle_aiocb_write_zeroes a bit 2015-01-27 13:51 ` [Qemu-devel] [PATCH 3/7] block/raw-posix: refactor handle_aiocb_write_zeroes a bit Denis V. Lunev @ 2015-01-27 17:13 ` Max Reitz 0 siblings, 0 replies; 32+ messages in thread From: Max Reitz @ 2015-01-27 17:13 UTC (permalink / raw) To: Denis V. Lunev Cc: Kevin Wolf, Peter Lieven, Fam Zheng, qemu-devel, Stefan Hajnoczi On 2015-01-27 at 08:51, Denis V. Lunev wrote: > move code dealing with a block device to a separate function. This will > allow to implement additional processing for ordinary files. > > Pls note, that xfs_code has been moved before checking for Please use "Please". :-) > s->has_write_zeroes as xfs_write_zeroes does not touch this flag inside. > This makes code a bit more consistent. > > Signed-off-by: Denis V. Lunev <den@openvz.org> > CC: Kevin Wolf <kwolf@redhat.com> > CC: Stefan Hajnoczi <stefanha@redhat.com> > CC: Peter Lieven <pl@kamp.de> > CC: Fam Zheng <famz@redhat.com> > --- > block/raw-posix.c | 48 +++++++++++++++++++++++++++++------------------- > 1 file changed, 29 insertions(+), 19 deletions(-) > > diff --git a/block/raw-posix.c b/block/raw-posix.c > index 2aa268a..24e1fab 100644 > --- a/block/raw-posix.c > +++ b/block/raw-posix.c > @@ -914,41 +914,51 @@ static int do_fallocate(int fd, int mode, off_t offset, off_t len) > } > #endif > > -static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) > +static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb) > { > - int ret = -EOPNOTSUPP; > + int ret = -ENOTSUP; > BDRVRawState *s = aiocb->bs->opaque; > > - if (s->has_write_zeroes == 0) { > + if (!s->has_write_zeroes) { > return -ENOTSUP; > } > > - if (aiocb->aio_type & QEMU_AIO_BLKDEV) { > #ifdef BLKZEROOUT > - do { > - uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes }; > - if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) { > - return 0; > - } > - } while (errno == EINTR); > - > - ret = -errno; > -#endif > - } else { > -#ifdef CONFIG_XFS > - if (s->is_xfs) { > - return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes); > + do { > + uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes }; > + if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) { > + return 0; > } > + } while (errno == EINTR); > + > + ret = translate_err(-errno); > #endif > - } > > - ret = translate_err(ret); > if (ret == -ENOTSUP) { > s->has_write_zeroes = false; > } > return ret; > } > > +static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) > +{ > + int ret = -ENOTSUP; > + BDRVRawState *s = aiocb->bs->opaque; > + > + if (aiocb->aio_type & QEMU_AIO_BLKDEV) { > + return handle_aiocb_write_zeroes_block(aiocb); > + } > + > +#ifdef CONFIG_XFS > + if (s->is_xfs) { > + return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes); > + } > +#endif > + > + s->has_write_zeroes = false; > + return ret; > +} > + It'll probably look nicer if you remove the "ret" variable from this function completely and just "return -ENOTSUP" at the end. With s/Pls/Please/ in the commit message and with or without "ret" removed from this function: Reviewed-by: Max Reitz <mreitz@redhat.com> ^ permalink raw reply [flat|nested] 32+ messages in thread
* [Qemu-devel] [PATCH 4/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes 2015-01-27 13:51 [Qemu-devel] [PATCH v4 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev ` (2 preceding siblings ...) 2015-01-27 13:51 ` [Qemu-devel] [PATCH 3/7] block/raw-posix: refactor handle_aiocb_write_zeroes a bit Denis V. Lunev @ 2015-01-27 13:51 ` Denis V. Lunev 2015-01-27 17:30 ` Max Reitz 2015-01-27 13:51 ` [Qemu-devel] [PATCH 5/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev ` (2 subsequent siblings) 6 siblings, 1 reply; 32+ messages in thread From: Denis V. Lunev @ 2015-01-27 13:51 UTC (permalink / raw) Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Stefan Hajnoczi, Denis V. Lunev This efficiently writes zeroes on Linux if the kernel is capable enough. FALLOC_FL_ZERO_RANGE correctly handles all cases, including and not including file expansion. Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Kevin Wolf <kwolf@redhat.com> CC: Stefan Hajnoczi <stefanha@redhat.com> CC: Peter Lieven <pl@kamp.de> CC: Fam Zheng <famz@redhat.com> --- block/raw-posix.c | 16 ++++++++++++++-- configure | 19 +++++++++++++++++++ 2 files changed, 33 insertions(+), 2 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index 24e1fab..3c35b2f 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -60,7 +60,7 @@ #define FS_NOCOW_FL 0x00800000 /* Do not cow file */ #endif #endif -#ifdef CONFIG_FALLOCATE_PUNCH_HOLE +#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) #include <linux/falloc.h> #endif #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__) @@ -902,7 +902,7 @@ static int translate_err(int err) return err; } -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) +#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) static int do_fallocate(int fd, int mode, off_t offset, off_t len) { do { @@ -955,6 +955,18 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) } #endif +#ifdef CONFIG_FALLOCATE_ZERO_RANGE + if (s->has_write_zeroes) { + ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE, + aiocb->aio_offset, aiocb->aio_nbytes); + if (ret == 0 || ret != -ENOTSUP) { + return ret; + } + s->has_write_zeroes = false; + return ret; + } +#endif + s->has_write_zeroes = false; return ret; } diff --git a/configure b/configure index f185dd0..e00e03a 100755 --- a/configure +++ b/configure @@ -3335,6 +3335,22 @@ if compile_prog "" "" ; then fallocate_punch_hole=yes fi +# check that fallocate supports range zeroing inside the file +fallocate_zero_range=no +cat > $TMPC << EOF +#include <fcntl.h> +#include <linux/falloc.h> + +int main(void) +{ + fallocate(0, FALLOC_FL_ZERO_RANGE, 0, 0); + return 0; +} +EOF +if compile_prog "" "" ; then + fallocate_zero_range=yes +fi + # check for posix_fallocate posix_fallocate=no cat > $TMPC << EOF @@ -4567,6 +4583,9 @@ fi if test "$fallocate_punch_hole" = "yes" ; then echo "CONFIG_FALLOCATE_PUNCH_HOLE=y" >> $config_host_mak fi +if test "$fallocate_zero_range" = "yes" ; then + echo "CONFIG_FALLOCATE_ZERO_RANGE=y" >> $config_host_mak +fi if test "$posix_fallocate" = "yes" ; then echo "CONFIG_POSIX_FALLOCATE=y" >> $config_host_mak fi -- 1.9.1 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 4/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes 2015-01-27 13:51 ` [Qemu-devel] [PATCH 4/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes Denis V. Lunev @ 2015-01-27 17:30 ` Max Reitz 0 siblings, 0 replies; 32+ messages in thread From: Max Reitz @ 2015-01-27 17:30 UTC (permalink / raw) To: Denis V. Lunev Cc: Kevin Wolf, Peter Lieven, Fam Zheng, qemu-devel, Stefan Hajnoczi On 2015-01-27 at 08:51, Denis V. Lunev wrote: > This efficiently writes zeroes on Linux if the kernel is capable enough. > FALLOC_FL_ZERO_RANGE correctly handles all cases, including and not > including file expansion. > > Signed-off-by: Denis V. Lunev <den@openvz.org> > CC: Kevin Wolf <kwolf@redhat.com> > CC: Stefan Hajnoczi <stefanha@redhat.com> > CC: Peter Lieven <pl@kamp.de> > CC: Fam Zheng <famz@redhat.com> > --- > block/raw-posix.c | 16 ++++++++++++++-- > configure | 19 +++++++++++++++++++ > 2 files changed, 33 insertions(+), 2 deletions(-) Okay, now the "ret" in handle_aiocb_write_zeroes() is necessary, so please disregard my statement about removing it in patch 3. > diff --git a/block/raw-posix.c b/block/raw-posix.c > index 24e1fab..3c35b2f 100644 > --- a/block/raw-posix.c > +++ b/block/raw-posix.c > @@ -60,7 +60,7 @@ > #define FS_NOCOW_FL 0x00800000 /* Do not cow file */ > #endif > #endif > -#ifdef CONFIG_FALLOCATE_PUNCH_HOLE > +#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) > #include <linux/falloc.h> > #endif > #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__) > @@ -902,7 +902,7 @@ static int translate_err(int err) > return err; > } > > -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) > +#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) > static int do_fallocate(int fd, int mode, off_t offset, off_t len) > { > do { > @@ -955,6 +955,18 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) > } > #endif > > +#ifdef CONFIG_FALLOCATE_ZERO_RANGE > + if (s->has_write_zeroes) { > + ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE, > + aiocb->aio_offset, aiocb->aio_nbytes); > + if (ret == 0 || ret != -ENOTSUP) { > + return ret; > + } > + s->has_write_zeroes = false; > + return ret; > + } First, you probably want to simply fall through here; right now, you are immediately failing with -ENOTSUP on the first call, but falling through on the second call. After this patch, it doesn't make a difference, but after the next one, it might. Second, while using s->has_write_zeroes here seems correct to me, I personally don't like sharing it with handle_aiocb_write_zeroes_block(); and if you do introduce a new flag like "has_zero_range", please don't make it a bit field (I will give you an R-b regardless of whether you make it a bit field or not, I just won't like it). Feel free to keep has_write_zeroes, though, while it doesn't look good to me it certainly is correct from a technical perspective. Max ^ permalink raw reply [flat|nested] 32+ messages in thread
* [Qemu-devel] [PATCH 5/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes 2015-01-27 13:51 [Qemu-devel] [PATCH v4 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev ` (3 preceding siblings ...) 2015-01-27 13:51 ` [Qemu-devel] [PATCH 4/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes Denis V. Lunev @ 2015-01-27 13:51 ` Denis V. Lunev 2015-01-27 17:48 ` Max Reitz 2015-01-27 13:51 ` [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes Denis V. Lunev 2015-01-27 13:51 ` [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files Denis V. Lunev 6 siblings, 1 reply; 32+ messages in thread From: Denis V. Lunev @ 2015-01-27 13:51 UTC (permalink / raw) Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Stefan Hajnoczi, Denis V. Lunev This sequence works efficiently if FALLOC_FL_ZERO_RANGE is not supported. Unfortunately, FALLOC_FL_ZERO_RANGE is supported on really modern systems and only for a couple of filesystems. FALLOC_FL_PUNCH_HOLE is much more mature. The sequence of 2 operations FALLOC_FL_PUNCH_HOLE and 0 is necessary due to the following reasons: - FALLOC_FL_PUNCH_HOLE creates a hole in the file, the file becomes sparse. In order to retain original functionality we must allocate disk space afterwards. This is done using fallocate(0) call - fallocate(0) without preceeding FALLOC_FL_PUNCH_HOLE will do nothing if called above already allocated areas of the file, i.e. the content will not be zeroed This should increase the performance a bit for not-so-modern kernels. Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Kevin Wolf <kwolf@redhat.com> CC: Stefan Hajnoczi <stefanha@redhat.com> CC: Peter Lieven <pl@kamp.de> CC: Fam Zheng <famz@redhat.com> --- block/raw-posix.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/block/raw-posix.c b/block/raw-posix.c index 3c35b2f..c039bef 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -967,6 +967,20 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) } #endif +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE + if (s->has_discard) { + ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + aiocb->aio_offset, aiocb->aio_nbytes); + if (ret < 0) { + if (ret == -ENOTSUP) { + s->has_discard = false; + } + return ret; + } + return do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes); + } +#endif + s->has_write_zeroes = false; return ret; } -- 1.9.1 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 5/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes 2015-01-27 13:51 ` [Qemu-devel] [PATCH 5/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev @ 2015-01-27 17:48 ` Max Reitz 0 siblings, 0 replies; 32+ messages in thread From: Max Reitz @ 2015-01-27 17:48 UTC (permalink / raw) To: Denis V. Lunev Cc: Kevin Wolf, Peter Lieven, Fam Zheng, qemu-devel, Stefan Hajnoczi On 2015-01-27 at 08:51, Denis V. Lunev wrote: > This sequence works efficiently if FALLOC_FL_ZERO_RANGE is not supported. > Unfortunately, FALLOC_FL_ZERO_RANGE is supported on really modern systems > and only for a couple of filesystems. FALLOC_FL_PUNCH_HOLE is much more > mature. > > The sequence of 2 operations FALLOC_FL_PUNCH_HOLE and 0 is necessary due > to the following reasons: > - FALLOC_FL_PUNCH_HOLE creates a hole in the file, the file becomes > sparse. In order to retain original functionality we must allocate > disk space afterwards. This is done using fallocate(0) call > - fallocate(0) without preceeding FALLOC_FL_PUNCH_HOLE will do nothing > if called above already allocated areas of the file, i.e. the content > will not be zeroed > > This should increase the performance a bit for not-so-modern kernels. > > Signed-off-by: Denis V. Lunev <den@openvz.org> > CC: Kevin Wolf <kwolf@redhat.com> > CC: Stefan Hajnoczi <stefanha@redhat.com> > CC: Peter Lieven <pl@kamp.de> > CC: Fam Zheng <famz@redhat.com> > --- > block/raw-posix.c | 14 ++++++++++++++ > 1 file changed, 14 insertions(+) > > diff --git a/block/raw-posix.c b/block/raw-posix.c > index 3c35b2f..c039bef 100644 > --- a/block/raw-posix.c > +++ b/block/raw-posix.c > @@ -967,6 +967,20 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) > } > #endif > > +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE > + if (s->has_discard) { > + ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > + aiocb->aio_offset, aiocb->aio_nbytes); > + if (ret < 0) { > + if (ret == -ENOTSUP) { > + s->has_discard = false; > + } > + return ret; > + } > + return do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes); > + } > +#endif > + Sharing "has_discard" with handle_aiocb_discard() looks fine to me, because it's used for the the same do_fallocate() call there. Once again, you should not abort if the first do_fallocate() returns ENOTSUP, because this is inconsistent with the behavior on the second call to handle_aiocb_write_zeroes() (where it falls through due to has_discard being false). Once again, this doesn't make a difference now, but very well might after the next patch. And finally, do we need another has_foo for the fallocate(0) call? (like just "has_fallocate") Max ^ permalink raw reply [flat|nested] 32+ messages in thread
* [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes 2015-01-27 13:51 [Qemu-devel] [PATCH v4 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev ` (4 preceding siblings ...) 2015-01-27 13:51 ` [Qemu-devel] [PATCH 5/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev @ 2015-01-27 13:51 ` Denis V. Lunev 2015-01-27 17:57 ` Max Reitz 2015-01-27 13:51 ` [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files Denis V. Lunev 6 siblings, 1 reply; 32+ messages in thread From: Denis V. Lunev @ 2015-01-27 13:51 UTC (permalink / raw) Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Stefan Hajnoczi, Denis V. Lunev There is a possibility that we are extending our image and thus writing zeroes beyond the end of the file. In this case we do not need to care about the hole to make sure that there is no data in the file under this offset (pre-condition to fallocate(0) to work). We could simply call fallocate(0). This improves the performance of writing zeroes even on really old platforms which do not have even FALLOC_FL_PUNCH_HOLE. Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Kevin Wolf <kwolf@redhat.com> CC: Stefan Hajnoczi <stefanha@redhat.com> CC: Peter Lieven <pl@kamp.de> CC: Fam Zheng <famz@redhat.com> --- block/raw-posix.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index c039bef..fa05239 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -60,7 +60,7 @@ #define FS_NOCOW_FL 0x00800000 /* Do not cow file */ #endif #endif -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) +#ifdef CONFIG_FALLOCATE #include <linux/falloc.h> #endif #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__) @@ -902,7 +902,7 @@ static int translate_err(int err) return err; } -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) +#ifdef CONFIG_FALLOCATE static int do_fallocate(int fd, int mode, off_t offset, off_t len) { do { @@ -981,6 +981,12 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) } #endif +#ifdef CONFIG_FALLOCATE + if (aiocb->aio_offset >= aiocb->bs->total_sectors << BDRV_SECTOR_BITS) { + return do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes); + } +#endif + s->has_write_zeroes = false; return ret; } -- 1.9.1 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes 2015-01-27 13:51 ` [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes Denis V. Lunev @ 2015-01-27 17:57 ` Max Reitz 2015-01-27 18:19 ` Denis V. Lunev 0 siblings, 1 reply; 32+ messages in thread From: Max Reitz @ 2015-01-27 17:57 UTC (permalink / raw) To: Denis V. Lunev Cc: Kevin Wolf, Peter Lieven, Fam Zheng, qemu-devel, Stefan Hajnoczi On 2015-01-27 at 08:51, Denis V. Lunev wrote: > There is a possibility that we are extending our image and thus writing > zeroes beyond the end of the file. In this case we do not need to care > about the hole to make sure that there is no data in the file under > this offset (pre-condition to fallocate(0) to work). We could simply call > fallocate(0). > > This improves the performance of writing zeroes even on really old > platforms which do not have even FALLOC_FL_PUNCH_HOLE. > > Signed-off-by: Denis V. Lunev <den@openvz.org> > CC: Kevin Wolf <kwolf@redhat.com> > CC: Stefan Hajnoczi <stefanha@redhat.com> > CC: Peter Lieven <pl@kamp.de> > CC: Fam Zheng <famz@redhat.com> > --- > block/raw-posix.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/block/raw-posix.c b/block/raw-posix.c > index c039bef..fa05239 100644 > --- a/block/raw-posix.c > +++ b/block/raw-posix.c > @@ -60,7 +60,7 @@ > #define FS_NOCOW_FL 0x00800000 /* Do not cow file */ > #endif > #endif > -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) > +#ifdef CONFIG_FALLOCATE This change doesn't seem right; CONFIG_FALLOCATE is set if posix_fallocate() is available, not for the Linux-specific fallocate() from linux/falloc.h. > #include <linux/falloc.h> > #endif > #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__) > @@ -902,7 +902,7 @@ static int translate_err(int err) > return err; > } > > -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE) > +#ifdef CONFIG_FALLOCATE Same here. > static int do_fallocate(int fd, int mode, off_t offset, off_t len) > { > do { > @@ -981,6 +981,12 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) > } > #endif > > +#ifdef CONFIG_FALLOCATE > + if (aiocb->aio_offset >= aiocb->bs->total_sectors << BDRV_SECTOR_BITS) { > + return do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes); > + } > +#endif > + This seems fine though, but as I've asked in patch 5: Do we want to have a "has_fallocate"? Other than that, this is the first usage of bs->total_sectors in this file; raw_co_get_block_status() does a similar check, but it uses bdrv_getlength() instead. If bs->total_sectors is correct, bdrv_getlength() will actually do nothing but return bs->total_sectors * BDRV_SECTOR_SIZE; it will only do more (that is, update bs->total_sectors) if it is not correct to use bs->total_sectors (and I feel like it may not be correct because BlockDriver.has_variable_length is true). Max > s->has_write_zeroes = false; > return ret; > } ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes 2015-01-27 17:57 ` Max Reitz @ 2015-01-27 18:19 ` Denis V. Lunev 2015-01-27 18:24 ` Max Reitz 0 siblings, 1 reply; 32+ messages in thread From: Denis V. Lunev @ 2015-01-27 18:19 UTC (permalink / raw) To: Max Reitz Cc: Kevin Wolf, Peter Lieven, Fam Zheng, qemu-devel, Stefan Hajnoczi On 27/01/15 20:57, Max Reitz wrote: > On 2015-01-27 at 08:51, Denis V. Lunev wrote: >> There is a possibility that we are extending our image and thus writing >> zeroes beyond the end of the file. In this case we do not need to care >> about the hole to make sure that there is no data in the file under >> this offset (pre-condition to fallocate(0) to work). We could simply >> call >> fallocate(0). >> >> This improves the performance of writing zeroes even on really old >> platforms which do not have even FALLOC_FL_PUNCH_HOLE. >> >> Signed-off-by: Denis V. Lunev <den@openvz.org> >> CC: Kevin Wolf <kwolf@redhat.com> >> CC: Stefan Hajnoczi <stefanha@redhat.com> >> CC: Peter Lieven <pl@kamp.de> >> CC: Fam Zheng <famz@redhat.com> >> --- >> block/raw-posix.c | 10 ++++++++-- >> 1 file changed, 8 insertions(+), 2 deletions(-) >> >> diff --git a/block/raw-posix.c b/block/raw-posix.c >> index c039bef..fa05239 100644 >> --- a/block/raw-posix.c >> +++ b/block/raw-posix.c >> @@ -60,7 +60,7 @@ >> #define FS_NOCOW_FL 0x00800000 /* Do not cow >> file */ >> #endif >> #endif >> -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || >> defined(CONFIG_FALLOCATE_ZERO_RANGE) >> +#ifdef CONFIG_FALLOCATE > > This change doesn't seem right; CONFIG_FALLOCATE is set if > posix_fallocate() is available, not for the Linux-specific fallocate() > from linux/falloc.h. > here is a check for fallocate and posix_fallocate in configure script # check for fallocate fallocate=no cat > $TMPC << EOF #include <fcntl.h> int main(void) { fallocate(0, 0, 0, 0); return 0; } EOF if compile_prog "" "" ; then fallocate=yes fi ... # check for posix_fallocate posix_fallocate=no cat > $TMPC << EOF #include <fcntl.h> int main(void) { posix_fallocate(0, 0, 0); return 0; } EOF if compile_prog "" "" ; then posix_fallocate=yes fi ... if test "$fallocate" = "yes" ; then echo "CONFIG_FALLOCATE=y" >> $config_host_mak fi ... if test "$posix_fallocate" = "yes" ; then echo "CONFIG_POSIX_FALLOCATE=y" >> $config_host_mak fi Thus my check looks correct to me. >> #include <linux/falloc.h> >> #endif >> #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__) >> @@ -902,7 +902,7 @@ static int translate_err(int err) >> return err; >> } >> -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || >> defined(CONFIG_FALLOCATE_ZERO_RANGE) >> +#ifdef CONFIG_FALLOCATE > > Same here. > >> static int do_fallocate(int fd, int mode, off_t offset, off_t len) >> { >> do { >> @@ -981,6 +981,12 @@ static ssize_t >> handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) >> } >> #endif >> +#ifdef CONFIG_FALLOCATE >> + if (aiocb->aio_offset >= aiocb->bs->total_sectors << >> BDRV_SECTOR_BITS) { >> + return do_fallocate(s->fd, 0, aiocb->aio_offset, >> aiocb->aio_nbytes); >> + } >> +#endif >> + > > This seems fine though, but as I've asked in patch 5: Do we want to > have a "has_fallocate"? > > Other than that, this is the first usage of bs->total_sectors in this > file; raw_co_get_block_status() does a similar check, but it uses > bdrv_getlength() instead. If bs->total_sectors is correct, > bdrv_getlength() will actually do nothing but return bs->total_sectors > * BDRV_SECTOR_SIZE; it will only do more (that is, update > bs->total_sectors) if it is not correct to use bs->total_sectors (and > I feel like it may not be correct because > BlockDriver.has_variable_length is true). > > Max > ok, will do ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes 2015-01-27 18:19 ` Denis V. Lunev @ 2015-01-27 18:24 ` Max Reitz 2015-01-27 18:33 ` Denis V. Lunev 0 siblings, 1 reply; 32+ messages in thread From: Max Reitz @ 2015-01-27 18:24 UTC (permalink / raw) To: Denis V. Lunev Cc: Kevin Wolf, Peter Lieven, Fam Zheng, qemu-devel, Stefan Hajnoczi On 2015-01-27 at 13:19, Denis V. Lunev wrote: > On 27/01/15 20:57, Max Reitz wrote: >> On 2015-01-27 at 08:51, Denis V. Lunev wrote: >>> There is a possibility that we are extending our image and thus writing >>> zeroes beyond the end of the file. In this case we do not need to care >>> about the hole to make sure that there is no data in the file under >>> this offset (pre-condition to fallocate(0) to work). We could simply >>> call >>> fallocate(0). >>> >>> This improves the performance of writing zeroes even on really old >>> platforms which do not have even FALLOC_FL_PUNCH_HOLE. >>> >>> Signed-off-by: Denis V. Lunev <den@openvz.org> >>> CC: Kevin Wolf <kwolf@redhat.com> >>> CC: Stefan Hajnoczi <stefanha@redhat.com> >>> CC: Peter Lieven <pl@kamp.de> >>> CC: Fam Zheng <famz@redhat.com> >>> --- >>> block/raw-posix.c | 10 ++++++++-- >>> 1 file changed, 8 insertions(+), 2 deletions(-) >>> >>> diff --git a/block/raw-posix.c b/block/raw-posix.c >>> index c039bef..fa05239 100644 >>> --- a/block/raw-posix.c >>> +++ b/block/raw-posix.c >>> @@ -60,7 +60,7 @@ >>> #define FS_NOCOW_FL 0x00800000 /* Do not cow >>> file */ >>> #endif >>> #endif >>> -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || >>> defined(CONFIG_FALLOCATE_ZERO_RANGE) >>> +#ifdef CONFIG_FALLOCATE >> >> This change doesn't seem right; CONFIG_FALLOCATE is set if >> posix_fallocate() is available, not for the Linux-specific >> fallocate() from linux/falloc.h. >> > > here is a check for fallocate and posix_fallocate in configure script > > # check for fallocate > fallocate=no > cat > $TMPC << EOF > #include <fcntl.h> > > int main(void) > { > fallocate(0, 0, 0, 0); > return 0; > } > EOF > if compile_prog "" "" ; then > fallocate=yes > fi > ... > # check for posix_fallocate > posix_fallocate=no > cat > $TMPC << EOF > #include <fcntl.h> > > int main(void) > { > posix_fallocate(0, 0, 0); > return 0; > } > EOF > if compile_prog "" "" ; then > posix_fallocate=yes > fi > ... > if test "$fallocate" = "yes" ; then > echo "CONFIG_FALLOCATE=y" >> $config_host_mak > fi > ... > if test "$posix_fallocate" = "yes" ; then > echo "CONFIG_POSIX_FALLOCATE=y" >> $config_host_mak > fi > > Thus my check looks correct to me. Oh, sorry, I somehow mixed those checks. You're right. Very well then; maybe you want to mention this change in the commit message, though? Max > >>> #include <linux/falloc.h> >>> #endif >>> #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__) >>> @@ -902,7 +902,7 @@ static int translate_err(int err) >>> return err; >>> } >>> -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || >>> defined(CONFIG_FALLOCATE_ZERO_RANGE) >>> +#ifdef CONFIG_FALLOCATE >> >> Same here. >> >>> static int do_fallocate(int fd, int mode, off_t offset, off_t len) >>> { >>> do { >>> @@ -981,6 +981,12 @@ static ssize_t >>> handle_aiocb_write_zeroes(RawPosixAIOData *aiocb) >>> } >>> #endif >>> +#ifdef CONFIG_FALLOCATE >>> + if (aiocb->aio_offset >= aiocb->bs->total_sectors << >>> BDRV_SECTOR_BITS) { >>> + return do_fallocate(s->fd, 0, aiocb->aio_offset, >>> aiocb->aio_nbytes); >>> + } >>> +#endif >>> + >> >> This seems fine though, but as I've asked in patch 5: Do we want to >> have a "has_fallocate"? >> >> Other than that, this is the first usage of bs->total_sectors in this >> file; raw_co_get_block_status() does a similar check, but it uses >> bdrv_getlength() instead. If bs->total_sectors is correct, >> bdrv_getlength() will actually do nothing but return >> bs->total_sectors * BDRV_SECTOR_SIZE; it will only do more (that is, >> update bs->total_sectors) if it is not correct to use >> bs->total_sectors (and I feel like it may not be correct because >> BlockDriver.has_variable_length is true). >> >> Max >> > ok, will do ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes 2015-01-27 18:24 ` Max Reitz @ 2015-01-27 18:33 ` Denis V. Lunev 0 siblings, 0 replies; 32+ messages in thread From: Denis V. Lunev @ 2015-01-27 18:33 UTC (permalink / raw) To: Max Reitz Cc: Kevin Wolf, Peter Lieven, Fam Zheng, qemu-devel, Stefan Hajnoczi On 27/01/15 21:24, Max Reitz wrote: > On 2015-01-27 at 13:19, Denis V. Lunev wrote: >> On 27/01/15 20:57, Max Reitz wrote: >>> On 2015-01-27 at 08:51, Denis V. Lunev wrote: >>>> There is a possibility that we are extending our image and thus >>>> writing >>>> zeroes beyond the end of the file. In this case we do not need to care >>>> about the hole to make sure that there is no data in the file under >>>> this offset (pre-condition to fallocate(0) to work). We could >>>> simply call >>>> fallocate(0). >>>> >>>> This improves the performance of writing zeroes even on really old >>>> platforms which do not have even FALLOC_FL_PUNCH_HOLE. >>>> >>>> Signed-off-by: Denis V. Lunev <den@openvz.org> >>>> CC: Kevin Wolf <kwolf@redhat.com> >>>> CC: Stefan Hajnoczi <stefanha@redhat.com> >>>> CC: Peter Lieven <pl@kamp.de> >>>> CC: Fam Zheng <famz@redhat.com> >>>> --- >>>> block/raw-posix.c | 10 ++++++++-- >>>> 1 file changed, 8 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/block/raw-posix.c b/block/raw-posix.c >>>> index c039bef..fa05239 100644 >>>> --- a/block/raw-posix.c >>>> +++ b/block/raw-posix.c >>>> @@ -60,7 +60,7 @@ >>>> #define FS_NOCOW_FL 0x00800000 /* Do not cow >>>> file */ >>>> #endif >>>> #endif >>>> -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || >>>> defined(CONFIG_FALLOCATE_ZERO_RANGE) >>>> +#ifdef CONFIG_FALLOCATE >>> >>> This change doesn't seem right; CONFIG_FALLOCATE is set if >>> posix_fallocate() is available, not for the Linux-specific >>> fallocate() from linux/falloc.h. >>> >> >> here is a check for fallocate and posix_fallocate in configure script >> >> # check for fallocate >> fallocate=no >> cat > $TMPC << EOF >> #include <fcntl.h> >> >> int main(void) >> { >> fallocate(0, 0, 0, 0); >> return 0; >> } >> EOF >> if compile_prog "" "" ; then >> fallocate=yes >> fi >> ... >> # check for posix_fallocate >> posix_fallocate=no >> cat > $TMPC << EOF >> #include <fcntl.h> >> >> int main(void) >> { >> posix_fallocate(0, 0, 0); >> return 0; >> } >> EOF >> if compile_prog "" "" ; then >> posix_fallocate=yes >> fi >> ... >> if test "$fallocate" = "yes" ; then >> echo "CONFIG_FALLOCATE=y" >> $config_host_mak >> fi >> ... >> if test "$posix_fallocate" = "yes" ; then >> echo "CONFIG_POSIX_FALLOCATE=y" >> $config_host_mak >> fi >> >> Thus my check looks correct to me. > > Oh, sorry, I somehow mixed those checks. You're right. > > Very well then; maybe you want to mention this change in the commit > message, though? > > Max > no prob ^ permalink raw reply [flat|nested] 32+ messages in thread
* [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files 2015-01-27 13:51 [Qemu-devel] [PATCH v4 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev ` (5 preceding siblings ...) 2015-01-27 13:51 ` [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes Denis V. Lunev @ 2015-01-27 13:51 ` Denis V. Lunev 2015-01-27 18:05 ` Max Reitz 6 siblings, 1 reply; 32+ messages in thread From: Denis V. Lunev @ 2015-01-27 13:51 UTC (permalink / raw) Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Stefan Hajnoczi, Denis V. Lunev fallocate() works fine and could handle properly with arbitrary size requests. There is no sense to reduce the amount of space to fallocate. The bigger is the size, the better is the performance as the amount of journal updates is reduced. Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Kevin Wolf <kwolf@redhat.com> CC: Stefan Hajnoczi <stefanha@redhat.com> CC: Peter Lieven <pl@kamp.de> CC: Fam Zheng <famz@redhat.com> --- block/raw-posix.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/block/raw-posix.c b/block/raw-posix.c index fa05239..e0b35c9 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -292,6 +292,20 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) } } +static void raw_probe_max_write_zeroes(BlockDriverState *bs) +{ + BDRVRawState *s = bs->opaque; + struct stat st; + + if (fstat(s->fd, &st) < 0) { + return; /* no problem, keep default value */ + } + if (!S_ISREG(st.st_mode) || !s->discard_zeroes) { + return; + } + bs->bl.max_write_zeroes = INT_MAX; +} + static void raw_parse_flags(int bdrv_flags, int *open_flags) { assert(open_flags != NULL); @@ -598,6 +612,7 @@ static int raw_reopen_prepare(BDRVReopenState *state, /* Fail already reopen_prepare() if we can't get a working O_DIRECT * alignment with the new fd. */ if (raw_s->fd != -1) { + raw_probe_max_write_zeroes(state->bs); raw_probe_alignment(state->bs, raw_s->fd, &local_err); if (local_err) { qemu_close(raw_s->fd); @@ -651,6 +666,8 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) raw_probe_alignment(bs, s->fd, errp); bs->bl.opt_mem_alignment = s->buf_align; + + raw_probe_max_write_zeroes(bs); } static ssize_t handle_aiocb_ioctl(RawPosixAIOData *aiocb) -- 1.9.1 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files 2015-01-27 13:51 ` [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files Denis V. Lunev @ 2015-01-27 18:05 ` Max Reitz 2015-01-27 18:11 ` Denis V. Lunev 2015-01-28 6:39 ` Denis V. Lunev 0 siblings, 2 replies; 32+ messages in thread From: Max Reitz @ 2015-01-27 18:05 UTC (permalink / raw) To: Denis V. Lunev Cc: Kevin Wolf, Peter Lieven, Fam Zheng, qemu-devel, Stefan Hajnoczi On 2015-01-27 at 08:51, Denis V. Lunev wrote: > fallocate() works fine and could handle properly with arbitrary size > requests. Maybe "could properly handle arbitrary size requests" (or "...arbitrarily sized requests")? > There is no sense to reduce the amount of space to fallocate. > The bigger is the size, the better is the performance as the amount of > journal updates is reduced. True for fallocate(). But is it true for xfs_write_zeroes(), too? I guess so, but I don't know. If it does, the patch looks good to me. Max > Signed-off-by: Denis V. Lunev <den@openvz.org> > CC: Kevin Wolf <kwolf@redhat.com> > CC: Stefan Hajnoczi <stefanha@redhat.com> > CC: Peter Lieven <pl@kamp.de> > CC: Fam Zheng <famz@redhat.com> > --- > block/raw-posix.c | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files 2015-01-27 18:05 ` Max Reitz @ 2015-01-27 18:11 ` Denis V. Lunev 2015-01-28 6:39 ` Denis V. Lunev 1 sibling, 0 replies; 32+ messages in thread From: Denis V. Lunev @ 2015-01-27 18:11 UTC (permalink / raw) To: Max Reitz Cc: Kevin Wolf, Peter Lieven, Fam Zheng, qemu-devel, Stefan Hajnoczi On 27/01/15 21:05, Max Reitz wrote: > On 2015-01-27 at 08:51, Denis V. Lunev wrote: >> fallocate() works fine and could handle properly with arbitrary size >> requests. > > Maybe "could properly handle arbitrary size requests" (or > "...arbitrarily sized requests")? > >> There is no sense to reduce the amount of space to fallocate. >> The bigger is the size, the better is the performance as the amount of >> journal updates is reduced. > > True for fallocate(). But is it true for xfs_write_zeroes(), too? I > guess so, but I don't know. > > If it does, the patch looks good to me. > > Max > >> Signed-off-by: Denis V. Lunev <den@openvz.org> >> CC: Kevin Wolf <kwolf@redhat.com> >> CC: Stefan Hajnoczi <stefanha@redhat.com> >> CC: Peter Lieven <pl@kamp.de> >> CC: Fam Zheng <famz@redhat.com> >> --- >> block/raw-posix.c | 17 +++++++++++++++++ >> 1 file changed, 17 insertions(+) thank you very much for a review. I will proceed with these findings, they look quite reasonable. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files 2015-01-27 18:05 ` Max Reitz 2015-01-27 18:11 ` Denis V. Lunev @ 2015-01-28 6:39 ` Denis V. Lunev 1 sibling, 0 replies; 32+ messages in thread From: Denis V. Lunev @ 2015-01-28 6:39 UTC (permalink / raw) To: Max Reitz Cc: Kevin Wolf, Peter Lieven, Fam Zheng, qemu-devel, Stefan Hajnoczi On 27/01/15 21:05, Max Reitz wrote: > On 2015-01-27 at 08:51, Denis V. Lunev wrote: >> fallocate() works fine and could handle properly with arbitrary size >> requests. > > Maybe "could properly handle arbitrary size requests" (or > "...arbitrarily sized requests")? > >> There is no sense to reduce the amount of space to fallocate. >> The bigger is the size, the better is the performance as the amount of >> journal updates is reduced. > > True for fallocate(). But is it true for xfs_write_zeroes(), too? I > guess so, but I don't know. > > If it does, the patch looks good to me. > > Max > checked. xfs_ioc_space (ioctl handler) calls exactly the same xfs_zero_file_space as performed by xfs_file_fallocate on fallocate path. I will reflect this in the description Regards, Den ^ permalink raw reply [flat|nested] 32+ messages in thread
* [Qemu-devel] [PATCH v5 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c @ 2015-01-28 18:38 Denis V. Lunev 2015-01-28 18:38 ` [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files Denis V. Lunev 0 siblings, 1 reply; 32+ messages in thread From: Denis V. Lunev @ 2015-01-28 18:38 UTC (permalink / raw) Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Max Reitz, Stefan Hajnoczi, Denis V. Lunev These patches eliminate data writes completely on Linux if fallocate FALLOC_FL_ZERO_RANGE or FALLOC_FL_PUNCH_HOLE are supported on underlying filesystem. I have performed several tests with non-aligned fallocate calls and in all cases (with non-aligned fallocates) Linux performs fine, i.e. areas are zeroed correctly. Checks were made on Linux 3.16.0-28-generic #38-Ubuntu SMP This should seriously increase performance of bdrv_write_zeroes Changes from v4: - comments to patches are improved by Max Reitz suggestions - replaced 'return ret' with 'return -ENOTSUP' handle_aiocb_write_zeroes The idea is that we can reach that point only if original ret was equal to -ENOTSUP - added has_fallocate flag to indicate that fallocate is working for given BDS - checked file length with bdrv_getlength Changes from v3: - dropped original patch 1, equivalent stuff was merged already - reordered patches as suggested by Fam - fixes spelling errors as suggested by Fam - fixed not initialized value in handle_aiocb_write_zeroes - fixed wrong error processing from do_fallocate(FALLOC_FL_ZERO_RANGE) Changes from v2: - added Peter Lieven to CC - added CONFIG_FALLOCATE check to call do_fallocate in patch 7 - dropped patch 1 as NACK-ed - added processing of very large data areas in bdrv_co_write_zeroes (new patch 1) - set bl.max_write_zeroes to INT_MAX in raw-posix.c for regular files (new patch 8) Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Max Reitz <mreitz@redhat.com> CC: Kevin Wolf <kwolf@redhat.com> CC: Stefan Hajnoczi <stefanha@redhat.com> CC: Peter Lieven <pl@kamp.de> CC: Fam Zheng <famz@redhat.com> ^ permalink raw reply [flat|nested] 32+ messages in thread
* [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files 2015-01-28 18:38 [Qemu-devel] [PATCH v5 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev @ 2015-01-28 18:38 ` Denis V. Lunev 2015-01-29 22:51 ` Max Reitz 0 siblings, 1 reply; 32+ messages in thread From: Denis V. Lunev @ 2015-01-28 18:38 UTC (permalink / raw) Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Max Reitz, Stefan Hajnoczi, Denis V. Lunev fallocate() works fine and could handle properly with arbitrary size requests. There is no sense to reduce the amount of space to fallocate. The bigger is the size, the better is the performance as the amount of journal updates is reduced. The patch changes behavior for both generic filesystem and XFS codepaths, which are different in handle_aiocb_write_zeroes. The implementation of falocate and xfsctl(XFS_IOC_ZERO_RANGE) for XFS are exactly the same thus the change is fine for both ways. Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Max Reitz <mreitz@redhat.com> CC: Kevin Wolf <kwolf@redhat.com> CC: Stefan Hajnoczi <stefanha@redhat.com> CC: Peter Lieven <pl@kamp.de> CC: Fam Zheng <famz@redhat.com> --- block/raw-posix.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/block/raw-posix.c b/block/raw-posix.c index 3db911a..ec38fee 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -293,6 +293,20 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) } } +static void raw_probe_max_write_zeroes(BlockDriverState *bs) +{ + BDRVRawState *s = bs->opaque; + struct stat st; + + if (fstat(s->fd, &st) < 0) { + return; /* no problem, keep default value */ + } + if (!S_ISREG(st.st_mode) || !s->discard_zeroes) { + return; + } + bs->bl.max_write_zeroes = INT_MAX; +} + static void raw_parse_flags(int bdrv_flags, int *open_flags) { assert(open_flags != NULL); @@ -600,6 +614,7 @@ static int raw_reopen_prepare(BDRVReopenState *state, /* Fail already reopen_prepare() if we can't get a working O_DIRECT * alignment with the new fd. */ if (raw_s->fd != -1) { + raw_probe_max_write_zeroes(state->bs); raw_probe_alignment(state->bs, raw_s->fd, &local_err); if (local_err) { qemu_close(raw_s->fd); @@ -653,6 +668,8 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) raw_probe_alignment(bs, s->fd, errp); bs->bl.opt_mem_alignment = s->buf_align; + + raw_probe_max_write_zeroes(bs); } static ssize_t handle_aiocb_ioctl(RawPosixAIOData *aiocb) -- 1.9.1 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files 2015-01-28 18:38 ` [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files Denis V. Lunev @ 2015-01-29 22:51 ` Max Reitz 0 siblings, 0 replies; 32+ messages in thread From: Max Reitz @ 2015-01-29 22:51 UTC (permalink / raw) To: Denis V. Lunev Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Stefan Hajnoczi On 2015-01-28 at 13:38, Denis V. Lunev wrote: > fallocate() works fine and could handle properly with arbitrary size > requests. There is no sense to reduce the amount of space to fallocate. > The bigger is the size, the better is the performance as the amount of > journal updates is reduced. > > The patch changes behavior for both generic filesystem and XFS codepaths, > which are different in handle_aiocb_write_zeroes. The implementation > of falocate and xfsctl(XFS_IOC_ZERO_RANGE) for XFS are exactly the same *fallocate With that fixed: Reviewed-by: Max Reitz <mreitz@redhat.com> > thus the change is fine for both ways. > > Signed-off-by: Denis V. Lunev <den@openvz.org> > CC: Max Reitz <mreitz@redhat.com> > CC: Kevin Wolf <kwolf@redhat.com> > CC: Stefan Hajnoczi <stefanha@redhat.com> > CC: Peter Lieven <pl@kamp.de> > CC: Fam Zheng <famz@redhat.com> > --- > block/raw-posix.c | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) > > diff --git a/block/raw-posix.c b/block/raw-posix.c > index 3db911a..ec38fee 100644 > --- a/block/raw-posix.c > +++ b/block/raw-posix.c > @@ -293,6 +293,20 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) > } > } > > +static void raw_probe_max_write_zeroes(BlockDriverState *bs) > +{ > + BDRVRawState *s = bs->opaque; > + struct stat st; > + > + if (fstat(s->fd, &st) < 0) { > + return; /* no problem, keep default value */ > + } > + if (!S_ISREG(st.st_mode) || !s->discard_zeroes) { > + return; > + } > + bs->bl.max_write_zeroes = INT_MAX; > +} > + > static void raw_parse_flags(int bdrv_flags, int *open_flags) > { > assert(open_flags != NULL); > @@ -600,6 +614,7 @@ static int raw_reopen_prepare(BDRVReopenState *state, > /* Fail already reopen_prepare() if we can't get a working O_DIRECT > * alignment with the new fd. */ > if (raw_s->fd != -1) { > + raw_probe_max_write_zeroes(state->bs); > raw_probe_alignment(state->bs, raw_s->fd, &local_err); > if (local_err) { > qemu_close(raw_s->fd); > @@ -653,6 +668,8 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) > > raw_probe_alignment(bs, s->fd, errp); > bs->bl.opt_mem_alignment = s->buf_align; > + > + raw_probe_max_write_zeroes(bs); > } > > static ssize_t handle_aiocb_ioctl(RawPosixAIOData *aiocb) ^ permalink raw reply [flat|nested] 32+ messages in thread
* [Qemu-devel] [PATCH v6 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c @ 2015-01-30 8:42 Denis V. Lunev 2015-01-30 8:42 ` [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files Denis V. Lunev 0 siblings, 1 reply; 32+ messages in thread From: Denis V. Lunev @ 2015-01-30 8:42 UTC (permalink / raw) Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Max Reitz, Stefan Hajnoczi, Denis V. Lunev I have performed several tests with non-aligned fallocate calls and in all cases (with non-aligned fallocates) Linux performs fine, i.e. areas are zeroed correctly. Checks were made on Linux 3.16.0-28-generic #38-Ubuntu SMP This should seriously increase performance of bdrv_write_zeroes Changes from v5: - changed order between patches 5/6 - check has_fallocate in FALLOC_FL_PUNCH_HOLE branch of the code - minor comment tweaks as pointed out by Max Reitz Changes from v4: - comments to patches are improved by Max Reitz suggestions - replaced 'return ret' with 'return -ENOTSUP' handle_aiocb_write_zeroes The idea is that we can reach that point only if original ret was equal to -ENOTSUP - added has_fallocate flag to indicate that fallocate is working for given BDS - checked file length with bdrv_getlength Changes from v3: - dropped original patch 1, equivalent stuff was merged already - reordered patches as suggested by Fam - fixes spelling errors as suggested by Fam - fixed not initialized value in handle_aiocb_write_zeroes - fixed wrong error processing from do_fallocate(FALLOC_FL_ZERO_RANGE) Changes from v2: - added Peter Lieven to CC - added CONFIG_FALLOCATE check to call do_fallocate in patch 7 - dropped patch 1 as NACK-ed - added processing of very large data areas in bdrv_co_write_zeroes (new patch 1) - set bl.max_write_zeroes to INT_MAX in raw-posix.c for regular files (new patch 8) Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Max Reitz <mreitz@redhat.com> CC: Kevin Wolf <kwolf@redhat.com> CC: Stefan Hajnoczi <stefanha@redhat.com> CC: Peter Lieven <pl@kamp.de> CC: Fam Zheng <famz@redhat.com> ^ permalink raw reply [flat|nested] 32+ messages in thread
* [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files 2015-01-30 8:42 [Qemu-devel] [PATCH v6 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev @ 2015-01-30 8:42 ` Denis V. Lunev 2015-02-02 13:23 ` Kevin Wolf 0 siblings, 1 reply; 32+ messages in thread From: Denis V. Lunev @ 2015-01-30 8:42 UTC (permalink / raw) Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Stefan Hajnoczi, Denis V. Lunev fallocate() works fine and could handle properly with arbitrary size requests. There is no sense to reduce the amount of space to fallocate. The bigger is the size, the better is the performance as the amount of journal updates is reduced. The patch changes behavior for both generic filesystem and XFS codepaths, which are different in handle_aiocb_write_zeroes. The implementation of fallocate and xfsctl(XFS_IOC_ZERO_RANGE) for XFS are exactly the same thus the change is fine for both ways. Signed-off-by: Denis V. Lunev <den@openvz.org> Reviewed-by: Max Reitz <mreitz@redhat.com> CC: Kevin Wolf <kwolf@redhat.com> CC: Stefan Hajnoczi <stefanha@redhat.com> CC: Peter Lieven <pl@kamp.de> CC: Fam Zheng <famz@redhat.com> --- block/raw-posix.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/block/raw-posix.c b/block/raw-posix.c index 7b42f37..933c778 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -293,6 +293,20 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) } } +static void raw_probe_max_write_zeroes(BlockDriverState *bs) +{ + BDRVRawState *s = bs->opaque; + struct stat st; + + if (fstat(s->fd, &st) < 0) { + return; /* no problem, keep default value */ + } + if (!S_ISREG(st.st_mode) || !s->discard_zeroes) { + return; + } + bs->bl.max_write_zeroes = INT_MAX; +} + static void raw_parse_flags(int bdrv_flags, int *open_flags) { assert(open_flags != NULL); @@ -600,6 +614,7 @@ static int raw_reopen_prepare(BDRVReopenState *state, /* Fail already reopen_prepare() if we can't get a working O_DIRECT * alignment with the new fd. */ if (raw_s->fd != -1) { + raw_probe_max_write_zeroes(state->bs); raw_probe_alignment(state->bs, raw_s->fd, &local_err); if (local_err) { qemu_close(raw_s->fd); @@ -653,6 +668,8 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) raw_probe_alignment(bs, s->fd, errp); bs->bl.opt_mem_alignment = s->buf_align; + + raw_probe_max_write_zeroes(bs); } static ssize_t handle_aiocb_ioctl(RawPosixAIOData *aiocb) -- 1.9.1 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files 2015-01-30 8:42 ` [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files Denis V. Lunev @ 2015-02-02 13:23 ` Kevin Wolf 2015-02-02 13:55 ` Peter Lieven 0 siblings, 1 reply; 32+ messages in thread From: Kevin Wolf @ 2015-02-02 13:23 UTC (permalink / raw) To: Denis V. Lunev; +Cc: Fam Zheng, Peter Lieven, qemu-devel, Stefan Hajnoczi Am 30.01.2015 um 09:42 hat Denis V. Lunev geschrieben: > fallocate() works fine and could handle properly with arbitrary size > requests. There is no sense to reduce the amount of space to fallocate. > The bigger is the size, the better is the performance as the amount of > journal updates is reduced. > > The patch changes behavior for both generic filesystem and XFS codepaths, > which are different in handle_aiocb_write_zeroes. The implementation > of fallocate and xfsctl(XFS_IOC_ZERO_RANGE) for XFS are exactly the same > thus the change is fine for both ways. > > Signed-off-by: Denis V. Lunev <den@openvz.org> > Reviewed-by: Max Reitz <mreitz@redhat.com> > CC: Kevin Wolf <kwolf@redhat.com> > CC: Stefan Hajnoczi <stefanha@redhat.com> > CC: Peter Lieven <pl@kamp.de> > CC: Fam Zheng <famz@redhat.com> > --- > block/raw-posix.c | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) > > diff --git a/block/raw-posix.c b/block/raw-posix.c > index 7b42f37..933c778 100644 > --- a/block/raw-posix.c > +++ b/block/raw-posix.c > @@ -293,6 +293,20 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) > } > } > > +static void raw_probe_max_write_zeroes(BlockDriverState *bs) > +{ > + BDRVRawState *s = bs->opaque; > + struct stat st; > + > + if (fstat(s->fd, &st) < 0) { > + return; /* no problem, keep default value */ > + } > + if (!S_ISREG(st.st_mode) || !s->discard_zeroes) { > + return; > + } > + bs->bl.max_write_zeroes = INT_MAX; > +} Peter, do you remember why INT_MAX isn't actually the default? I think the most reasonable behaviour would be that a limitation is only used if a block driver requests it, and otherwise unlimited is assumed. We can take this patch to raw-posix, it is certainly not wrong. But any format driver or filter will still, in most cases needlessly, apply MAX_WRITE_ZEROES_DEFAULT, i.e. a 16 MB maximum, so I think we should consider making a change to the default. Kevin ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files 2015-02-02 13:23 ` Kevin Wolf @ 2015-02-02 13:55 ` Peter Lieven 2015-02-02 14:04 ` Kevin Wolf 0 siblings, 1 reply; 32+ messages in thread From: Peter Lieven @ 2015-02-02 13:55 UTC (permalink / raw) To: Kevin Wolf, Denis V. Lunev; +Cc: Fam Zheng, qemu-devel, Stefan Hajnoczi Am 02.02.2015 um 14:23 schrieb Kevin Wolf: > Am 30.01.2015 um 09:42 hat Denis V. Lunev geschrieben: >> fallocate() works fine and could handle properly with arbitrary size >> requests. There is no sense to reduce the amount of space to fallocate. >> The bigger is the size, the better is the performance as the amount of >> journal updates is reduced. >> >> The patch changes behavior for both generic filesystem and XFS codepaths, >> which are different in handle_aiocb_write_zeroes. The implementation >> of fallocate and xfsctl(XFS_IOC_ZERO_RANGE) for XFS are exactly the same >> thus the change is fine for both ways. >> >> Signed-off-by: Denis V. Lunev <den@openvz.org> >> Reviewed-by: Max Reitz <mreitz@redhat.com> >> CC: Kevin Wolf <kwolf@redhat.com> >> CC: Stefan Hajnoczi <stefanha@redhat.com> >> CC: Peter Lieven <pl@kamp.de> >> CC: Fam Zheng <famz@redhat.com> >> --- >> block/raw-posix.c | 17 +++++++++++++++++ >> 1 file changed, 17 insertions(+) >> >> diff --git a/block/raw-posix.c b/block/raw-posix.c >> index 7b42f37..933c778 100644 >> --- a/block/raw-posix.c >> +++ b/block/raw-posix.c >> @@ -293,6 +293,20 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) >> } >> } >> >> +static void raw_probe_max_write_zeroes(BlockDriverState *bs) >> +{ >> + BDRVRawState *s = bs->opaque; >> + struct stat st; >> + >> + if (fstat(s->fd, &st) < 0) { >> + return; /* no problem, keep default value */ >> + } >> + if (!S_ISREG(st.st_mode) || !s->discard_zeroes) { >> + return; >> + } >> + bs->bl.max_write_zeroes = INT_MAX; >> +} > Peter, do you remember why INT_MAX isn't actually the default? I think > the most reasonable behaviour would be that a limitation is only used if > a block driver requests it, and otherwise unlimited is assumed. The default (0) actually means unlimited or undefined. We introduced that limit of 16MB in bdrv_co_write_zeroes to create only reasonable sized requests because there is no guarantee that write zeroes is a fast operation. We should set INT_MAX only if we know that write zeroes of an arbitrary size is always fast. Peter ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files 2015-02-02 13:55 ` Peter Lieven @ 2015-02-02 14:04 ` Kevin Wolf 2015-02-02 14:12 ` Peter Lieven 0 siblings, 1 reply; 32+ messages in thread From: Kevin Wolf @ 2015-02-02 14:04 UTC (permalink / raw) To: Peter Lieven; +Cc: Denis V. Lunev, Fam Zheng, qemu-devel, Stefan Hajnoczi Am 02.02.2015 um 14:55 hat Peter Lieven geschrieben: > Am 02.02.2015 um 14:23 schrieb Kevin Wolf: > >Am 30.01.2015 um 09:42 hat Denis V. Lunev geschrieben: > >>fallocate() works fine and could handle properly with arbitrary size > >>requests. There is no sense to reduce the amount of space to fallocate. > >>The bigger is the size, the better is the performance as the amount of > >>journal updates is reduced. > >> > >>The patch changes behavior for both generic filesystem and XFS codepaths, > >>which are different in handle_aiocb_write_zeroes. The implementation > >>of fallocate and xfsctl(XFS_IOC_ZERO_RANGE) for XFS are exactly the same > >>thus the change is fine for both ways. > >> > >>Signed-off-by: Denis V. Lunev <den@openvz.org> > >>Reviewed-by: Max Reitz <mreitz@redhat.com> > >>CC: Kevin Wolf <kwolf@redhat.com> > >>CC: Stefan Hajnoczi <stefanha@redhat.com> > >>CC: Peter Lieven <pl@kamp.de> > >>CC: Fam Zheng <famz@redhat.com> > >>--- > >> block/raw-posix.c | 17 +++++++++++++++++ > >> 1 file changed, 17 insertions(+) > >> > >>diff --git a/block/raw-posix.c b/block/raw-posix.c > >>index 7b42f37..933c778 100644 > >>--- a/block/raw-posix.c > >>+++ b/block/raw-posix.c > >>@@ -293,6 +293,20 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) > >> } > >> } > >>+static void raw_probe_max_write_zeroes(BlockDriverState *bs) > >>+{ > >>+ BDRVRawState *s = bs->opaque; > >>+ struct stat st; > >>+ > >>+ if (fstat(s->fd, &st) < 0) { > >>+ return; /* no problem, keep default value */ > >>+ } > >>+ if (!S_ISREG(st.st_mode) || !s->discard_zeroes) { > >>+ return; > >>+ } > >>+ bs->bl.max_write_zeroes = INT_MAX; > >>+} > >Peter, do you remember why INT_MAX isn't actually the default? I think > >the most reasonable behaviour would be that a limitation is only used if > >a block driver requests it, and otherwise unlimited is assumed. > > The default (0) actually means unlimited or undefined. We introduced > that limit of 16MB in bdrv_co_write_zeroes to create only reasonable > sized requests because there is no guarantee that write zeroes is a > fast operation. We should set INT_MAX only if we know that write > zeroes of an arbitrary size is always fast. Well, splitting it up doesn't make it any faster. I think we can assume that drv->bdrv_co_write_zeroes() wants to know the full request size unless the driver has explicitly set bs->bl.max_write_zeroes. Only if we go on emulating the operation with a zero-filled buffer, I understand that we might need to split it up so that our bounce buffer doesn't become huge. Kevin ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files 2015-02-02 14:04 ` Kevin Wolf @ 2015-02-02 14:12 ` Peter Lieven 2015-02-02 14:16 ` Kevin Wolf 0 siblings, 1 reply; 32+ messages in thread From: Peter Lieven @ 2015-02-02 14:12 UTC (permalink / raw) To: Kevin Wolf; +Cc: Denis V. Lunev, Fam Zheng, qemu-devel, Stefan Hajnoczi Am 02.02.2015 um 15:04 schrieb Kevin Wolf: > Am 02.02.2015 um 14:55 hat Peter Lieven geschrieben: >> Am 02.02.2015 um 14:23 schrieb Kevin Wolf: >>> Am 30.01.2015 um 09:42 hat Denis V. Lunev geschrieben: >>>> fallocate() works fine and could handle properly with arbitrary size >>>> requests. There is no sense to reduce the amount of space to fallocate. >>>> The bigger is the size, the better is the performance as the amount of >>>> journal updates is reduced. >>>> >>>> The patch changes behavior for both generic filesystem and XFS codepaths, >>>> which are different in handle_aiocb_write_zeroes. The implementation >>>> of fallocate and xfsctl(XFS_IOC_ZERO_RANGE) for XFS are exactly the same >>>> thus the change is fine for both ways. >>>> >>>> Signed-off-by: Denis V. Lunev <den@openvz.org> >>>> Reviewed-by: Max Reitz <mreitz@redhat.com> >>>> CC: Kevin Wolf <kwolf@redhat.com> >>>> CC: Stefan Hajnoczi <stefanha@redhat.com> >>>> CC: Peter Lieven <pl@kamp.de> >>>> CC: Fam Zheng <famz@redhat.com> >>>> --- >>>> block/raw-posix.c | 17 +++++++++++++++++ >>>> 1 file changed, 17 insertions(+) >>>> >>>> diff --git a/block/raw-posix.c b/block/raw-posix.c >>>> index 7b42f37..933c778 100644 >>>> --- a/block/raw-posix.c >>>> +++ b/block/raw-posix.c >>>> @@ -293,6 +293,20 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) >>>> } >>>> } >>>> +static void raw_probe_max_write_zeroes(BlockDriverState *bs) >>>> +{ >>>> + BDRVRawState *s = bs->opaque; >>>> + struct stat st; >>>> + >>>> + if (fstat(s->fd, &st) < 0) { >>>> + return; /* no problem, keep default value */ >>>> + } >>>> + if (!S_ISREG(st.st_mode) || !s->discard_zeroes) { >>>> + return; >>>> + } >>>> + bs->bl.max_write_zeroes = INT_MAX; >>>> +} >>> Peter, do you remember why INT_MAX isn't actually the default? I think >>> the most reasonable behaviour would be that a limitation is only used if >>> a block driver requests it, and otherwise unlimited is assumed. >> The default (0) actually means unlimited or undefined. We introduced >> that limit of 16MB in bdrv_co_write_zeroes to create only reasonable >> sized requests because there is no guarantee that write zeroes is a >> fast operation. We should set INT_MAX only if we know that write >> zeroes of an arbitrary size is always fast. > Well, splitting it up doesn't make it any faster. I think we can assume > that drv->bdrv_co_write_zeroes() wants to know the full request size > unless the driver has explicitly set bs->bl.max_write_zeroes. You mean sth like this: diff --git a/block.c b/block.c index 61412e9..8272ef9 100644 --- a/block.c +++ b/block.c @@ -3192,10 +3192,7 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs, BDRV_REQ_COPY_ON_READ); } -/* if no limit is specified in the BlockLimits use a default - * of 32768 512-byte sectors (16 MiB) per request. - */ -#define MAX_WRITE_ZEROES_DEFAULT 32768 +#define MAX_WRITE_ZEROES_BOUNCE_BUFFER 32768 static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, int64_t sector_num, int nb_sectors, BdrvRequestFlags flags) @@ -3206,7 +3203,7 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, int ret = 0; int max_write_zeroes = bs->bl.max_write_zeroes ? - bs->bl.max_write_zeroes : MAX_WRITE_ZEROES_DEFAULT; + bs->bl.max_write_zeroes : INT_MAX; while (nb_sectors > 0 && !ret) { int num = nb_sectors; @@ -3242,7 +3239,7 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, if (ret == -ENOTSUP) { /* Fall back to bounce buffer if write zeroes is unsupported */ int max_xfer_len = MIN_NON_ZERO(bs->bl.max_transfer_length, - MAX_WRITE_ZEROES_DEFAULT); + MAX_WRITE_ZEROES_BOUNCE_BUFFER); num = MIN(num, max_xfer_len); iov.iov_len = num * BDRV_SECTOR_SIZE; if (iov.iov_base == NULL) { @@ -5099,11 +5096,6 @@ static void coroutine_fn bdrv_discard_co_entry(void *opaque) rwco->ret = bdrv_co_discard(rwco->bs, rwco->sector_num, rwco->nb_sectors); } -/* if no limit is specified in the BlockLimits use a default - * of 32768 512-byte sectors (16 MiB) per request. - */ -#define MAX_DISCARD_DEFAULT 32768 - int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num, int nb_sectors) { @@ -5128,7 +5120,7 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num, return 0; } - max_discard = bs->bl.max_discard ? bs->bl.max_discard : MAX_DISCARD_DEFAULT; + max_discard = bs->bl.max_discard ? bs->bl.max_discard : INT_MAX; while (nb_sectors > 0) { int ret; int num = nb_sectors; Peter ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files 2015-02-02 14:12 ` Peter Lieven @ 2015-02-02 14:16 ` Kevin Wolf 2015-02-02 14:20 ` Peter Lieven 0 siblings, 1 reply; 32+ messages in thread From: Kevin Wolf @ 2015-02-02 14:16 UTC (permalink / raw) To: Peter Lieven; +Cc: Denis V. Lunev, Fam Zheng, qemu-devel, Stefan Hajnoczi Am 02.02.2015 um 15:12 hat Peter Lieven geschrieben: > Am 02.02.2015 um 15:04 schrieb Kevin Wolf: > >Am 02.02.2015 um 14:55 hat Peter Lieven geschrieben: > >>Am 02.02.2015 um 14:23 schrieb Kevin Wolf: > >>>Am 30.01.2015 um 09:42 hat Denis V. Lunev geschrieben: > >>>>fallocate() works fine and could handle properly with arbitrary size > >>>>requests. There is no sense to reduce the amount of space to fallocate. > >>>>The bigger is the size, the better is the performance as the amount of > >>>>journal updates is reduced. > >>>> > >>>>The patch changes behavior for both generic filesystem and XFS codepaths, > >>>>which are different in handle_aiocb_write_zeroes. The implementation > >>>>of fallocate and xfsctl(XFS_IOC_ZERO_RANGE) for XFS are exactly the same > >>>>thus the change is fine for both ways. > >>>> > >>>>Signed-off-by: Denis V. Lunev <den@openvz.org> > >>>>Reviewed-by: Max Reitz <mreitz@redhat.com> > >>>>CC: Kevin Wolf <kwolf@redhat.com> > >>>>CC: Stefan Hajnoczi <stefanha@redhat.com> > >>>>CC: Peter Lieven <pl@kamp.de> > >>>>CC: Fam Zheng <famz@redhat.com> > >>>>--- > >>>> block/raw-posix.c | 17 +++++++++++++++++ > >>>> 1 file changed, 17 insertions(+) > >>>> > >>>>diff --git a/block/raw-posix.c b/block/raw-posix.c > >>>>index 7b42f37..933c778 100644 > >>>>--- a/block/raw-posix.c > >>>>+++ b/block/raw-posix.c > >>>>@@ -293,6 +293,20 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) > >>>> } > >>>> } > >>>>+static void raw_probe_max_write_zeroes(BlockDriverState *bs) > >>>>+{ > >>>>+ BDRVRawState *s = bs->opaque; > >>>>+ struct stat st; > >>>>+ > >>>>+ if (fstat(s->fd, &st) < 0) { > >>>>+ return; /* no problem, keep default value */ > >>>>+ } > >>>>+ if (!S_ISREG(st.st_mode) || !s->discard_zeroes) { > >>>>+ return; > >>>>+ } > >>>>+ bs->bl.max_write_zeroes = INT_MAX; > >>>>+} > >>>Peter, do you remember why INT_MAX isn't actually the default? I think > >>>the most reasonable behaviour would be that a limitation is only used if > >>>a block driver requests it, and otherwise unlimited is assumed. > >>The default (0) actually means unlimited or undefined. We introduced > >>that limit of 16MB in bdrv_co_write_zeroes to create only reasonable > >>sized requests because there is no guarantee that write zeroes is a > >>fast operation. We should set INT_MAX only if we know that write > >>zeroes of an arbitrary size is always fast. > >Well, splitting it up doesn't make it any faster. I think we can assume > >that drv->bdrv_co_write_zeroes() wants to know the full request size > >unless the driver has explicitly set bs->bl.max_write_zeroes. > > You mean sth like this: Yes, I think that's what I meant. Kevin > diff --git a/block.c b/block.c > index 61412e9..8272ef9 100644 > --- a/block.c > +++ b/block.c > @@ -3192,10 +3192,7 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs, > BDRV_REQ_COPY_ON_READ); > } > > -/* if no limit is specified in the BlockLimits use a default > - * of 32768 512-byte sectors (16 MiB) per request. > - */ > -#define MAX_WRITE_ZEROES_DEFAULT 32768 > +#define MAX_WRITE_ZEROES_BOUNCE_BUFFER 32768 > > static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, > int64_t sector_num, int nb_sectors, BdrvRequestFlags flags) > @@ -3206,7 +3203,7 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, > int ret = 0; > > int max_write_zeroes = bs->bl.max_write_zeroes ? > - bs->bl.max_write_zeroes : MAX_WRITE_ZEROES_DEFAULT; > + bs->bl.max_write_zeroes : INT_MAX; > > while (nb_sectors > 0 && !ret) { > int num = nb_sectors; > @@ -3242,7 +3239,7 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, > if (ret == -ENOTSUP) { > /* Fall back to bounce buffer if write zeroes is unsupported */ > int max_xfer_len = MIN_NON_ZERO(bs->bl.max_transfer_length, > - MAX_WRITE_ZEROES_DEFAULT); > + MAX_WRITE_ZEROES_BOUNCE_BUFFER); > num = MIN(num, max_xfer_len); > iov.iov_len = num * BDRV_SECTOR_SIZE; > if (iov.iov_base == NULL) { > @@ -5099,11 +5096,6 @@ static void coroutine_fn bdrv_discard_co_entry(void *opaque) > rwco->ret = bdrv_co_discard(rwco->bs, rwco->sector_num, rwco->nb_sectors); > } > > -/* if no limit is specified in the BlockLimits use a default > - * of 32768 512-byte sectors (16 MiB) per request. > - */ > -#define MAX_DISCARD_DEFAULT 32768 > - > int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num, > int nb_sectors) > { > @@ -5128,7 +5120,7 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num, > return 0; > } > > - max_discard = bs->bl.max_discard ? bs->bl.max_discard : MAX_DISCARD_DEFAULT; > + max_discard = bs->bl.max_discard ? bs->bl.max_discard : INT_MAX; > while (nb_sectors > 0) { > int ret; > int num = nb_sectors; > > > > Peter ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files 2015-02-02 14:16 ` Kevin Wolf @ 2015-02-02 14:20 ` Peter Lieven 2015-02-02 14:38 ` Denis V. Lunev 2015-02-02 14:49 ` Kevin Wolf 0 siblings, 2 replies; 32+ messages in thread From: Peter Lieven @ 2015-02-02 14:20 UTC (permalink / raw) To: Kevin Wolf; +Cc: Denis V. Lunev, Fam Zheng, qemu-devel, Stefan Hajnoczi Am 02.02.2015 um 15:16 schrieb Kevin Wolf: > Am 02.02.2015 um 15:12 hat Peter Lieven geschrieben: >> Am 02.02.2015 um 15:04 schrieb Kevin Wolf: >>> Am 02.02.2015 um 14:55 hat Peter Lieven geschrieben: >>>> Am 02.02.2015 um 14:23 schrieb Kevin Wolf: >>>>> Am 30.01.2015 um 09:42 hat Denis V. Lunev geschrieben: >>>>>> fallocate() works fine and could handle properly with arbitrary size >>>>>> requests. There is no sense to reduce the amount of space to fallocate. >>>>>> The bigger is the size, the better is the performance as the amount of >>>>>> journal updates is reduced. >>>>>> >>>>>> The patch changes behavior for both generic filesystem and XFS codepaths, >>>>>> which are different in handle_aiocb_write_zeroes. The implementation >>>>>> of fallocate and xfsctl(XFS_IOC_ZERO_RANGE) for XFS are exactly the same >>>>>> thus the change is fine for both ways. >>>>>> >>>>>> Signed-off-by: Denis V. Lunev <den@openvz.org> >>>>>> Reviewed-by: Max Reitz <mreitz@redhat.com> >>>>>> CC: Kevin Wolf <kwolf@redhat.com> >>>>>> CC: Stefan Hajnoczi <stefanha@redhat.com> >>>>>> CC: Peter Lieven <pl@kamp.de> >>>>>> CC: Fam Zheng <famz@redhat.com> >>>>>> --- >>>>>> block/raw-posix.c | 17 +++++++++++++++++ >>>>>> 1 file changed, 17 insertions(+) >>>>>> >>>>>> diff --git a/block/raw-posix.c b/block/raw-posix.c >>>>>> index 7b42f37..933c778 100644 >>>>>> --- a/block/raw-posix.c >>>>>> +++ b/block/raw-posix.c >>>>>> @@ -293,6 +293,20 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) >>>>>> } >>>>>> } >>>>>> +static void raw_probe_max_write_zeroes(BlockDriverState *bs) >>>>>> +{ >>>>>> + BDRVRawState *s = bs->opaque; >>>>>> + struct stat st; >>>>>> + >>>>>> + if (fstat(s->fd, &st) < 0) { >>>>>> + return; /* no problem, keep default value */ >>>>>> + } >>>>>> + if (!S_ISREG(st.st_mode) || !s->discard_zeroes) { >>>>>> + return; >>>>>> + } >>>>>> + bs->bl.max_write_zeroes = INT_MAX; >>>>>> +} >>>>> Peter, do you remember why INT_MAX isn't actually the default? I think >>>>> the most reasonable behaviour would be that a limitation is only used if >>>>> a block driver requests it, and otherwise unlimited is assumed. >>>> The default (0) actually means unlimited or undefined. We introduced >>>> that limit of 16MB in bdrv_co_write_zeroes to create only reasonable >>>> sized requests because there is no guarantee that write zeroes is a >>>> fast operation. We should set INT_MAX only if we know that write >>>> zeroes of an arbitrary size is always fast. >>> Well, splitting it up doesn't make it any faster. I think we can assume >>> that drv->bdrv_co_write_zeroes() wants to know the full request size >>> unless the driver has explicitly set bs->bl.max_write_zeroes. >> You mean sth like this: > Yes, I think that's what I meant. I can't find the original discussion why we added this limit. It was actually the default before we introduced BlockLimits. And, it was also the default in the unsupported path of write zeroes which created big memory allocations. This might be the reason why we introduced a limit. Peter > > Kevin > >> diff --git a/block.c b/block.c >> index 61412e9..8272ef9 100644 >> --- a/block.c >> +++ b/block.c >> @@ -3192,10 +3192,7 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs, >> BDRV_REQ_COPY_ON_READ); >> } >> >> -/* if no limit is specified in the BlockLimits use a default >> - * of 32768 512-byte sectors (16 MiB) per request. >> - */ >> -#define MAX_WRITE_ZEROES_DEFAULT 32768 >> +#define MAX_WRITE_ZEROES_BOUNCE_BUFFER 32768 >> >> static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, >> int64_t sector_num, int nb_sectors, BdrvRequestFlags flags) >> @@ -3206,7 +3203,7 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, >> int ret = 0; >> >> int max_write_zeroes = bs->bl.max_write_zeroes ? >> - bs->bl.max_write_zeroes : MAX_WRITE_ZEROES_DEFAULT; >> + bs->bl.max_write_zeroes : INT_MAX; >> >> while (nb_sectors > 0 && !ret) { >> int num = nb_sectors; >> @@ -3242,7 +3239,7 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, >> if (ret == -ENOTSUP) { >> /* Fall back to bounce buffer if write zeroes is unsupported */ >> int max_xfer_len = MIN_NON_ZERO(bs->bl.max_transfer_length, >> - MAX_WRITE_ZEROES_DEFAULT); >> + MAX_WRITE_ZEROES_BOUNCE_BUFFER); >> num = MIN(num, max_xfer_len); >> iov.iov_len = num * BDRV_SECTOR_SIZE; >> if (iov.iov_base == NULL) { >> @@ -5099,11 +5096,6 @@ static void coroutine_fn bdrv_discard_co_entry(void *opaque) >> rwco->ret = bdrv_co_discard(rwco->bs, rwco->sector_num, rwco->nb_sectors); >> } >> >> -/* if no limit is specified in the BlockLimits use a default >> - * of 32768 512-byte sectors (16 MiB) per request. >> - */ >> -#define MAX_DISCARD_DEFAULT 32768 >> - >> int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num, >> int nb_sectors) >> { >> @@ -5128,7 +5120,7 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num, >> return 0; >> } >> >> - max_discard = bs->bl.max_discard ? bs->bl.max_discard : MAX_DISCARD_DEFAULT; >> + max_discard = bs->bl.max_discard ? bs->bl.max_discard : INT_MAX; >> while (nb_sectors > 0) { >> int ret; >> int num = nb_sectors; >> >> >> >> Peter ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files 2015-02-02 14:20 ` Peter Lieven @ 2015-02-02 14:38 ` Denis V. Lunev 2015-02-02 14:49 ` Kevin Wolf 1 sibling, 0 replies; 32+ messages in thread From: Denis V. Lunev @ 2015-02-02 14:38 UTC (permalink / raw) To: Peter Lieven, Kevin Wolf; +Cc: Fam Zheng, qemu-devel, Stefan Hajnoczi On 02/02/15 17:20, Peter Lieven wrote: > Am 02.02.2015 um 15:16 schrieb Kevin Wolf: >> Am 02.02.2015 um 15:12 hat Peter Lieven geschrieben: >>> Am 02.02.2015 um 15:04 schrieb Kevin Wolf: >>>> Am 02.02.2015 um 14:55 hat Peter Lieven geschrieben: >>>>> Am 02.02.2015 um 14:23 schrieb Kevin Wolf: >>>>>> Am 30.01.2015 um 09:42 hat Denis V. Lunev geschrieben: >>>>>>> fallocate() works fine and could handle properly with arbitrary >>>>>>> size >>>>>>> requests. There is no sense to reduce the amount of space to >>>>>>> fallocate. >>>>>>> The bigger is the size, the better is the performance as the >>>>>>> amount of >>>>>>> journal updates is reduced. >>>>>>> >>>>>>> The patch changes behavior for both generic filesystem and XFS >>>>>>> codepaths, >>>>>>> which are different in handle_aiocb_write_zeroes. The >>>>>>> implementation >>>>>>> of fallocate and xfsctl(XFS_IOC_ZERO_RANGE) for XFS are exactly >>>>>>> the same >>>>>>> thus the change is fine for both ways. >>>>>>> >>>>>>> Signed-off-by: Denis V. Lunev <den@openvz.org> >>>>>>> Reviewed-by: Max Reitz <mreitz@redhat.com> >>>>>>> CC: Kevin Wolf <kwolf@redhat.com> >>>>>>> CC: Stefan Hajnoczi <stefanha@redhat.com> >>>>>>> CC: Peter Lieven <pl@kamp.de> >>>>>>> CC: Fam Zheng <famz@redhat.com> >>>>>>> --- >>>>>>> block/raw-posix.c | 17 +++++++++++++++++ >>>>>>> 1 file changed, 17 insertions(+) >>>>>>> >>>>>>> diff --git a/block/raw-posix.c b/block/raw-posix.c >>>>>>> index 7b42f37..933c778 100644 >>>>>>> --- a/block/raw-posix.c >>>>>>> +++ b/block/raw-posix.c >>>>>>> @@ -293,6 +293,20 @@ static void >>>>>>> raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) >>>>>>> } >>>>>>> } >>>>>>> +static void raw_probe_max_write_zeroes(BlockDriverState *bs) >>>>>>> +{ >>>>>>> + BDRVRawState *s = bs->opaque; >>>>>>> + struct stat st; >>>>>>> + >>>>>>> + if (fstat(s->fd, &st) < 0) { >>>>>>> + return; /* no problem, keep default value */ >>>>>>> + } >>>>>>> + if (!S_ISREG(st.st_mode) || !s->discard_zeroes) { >>>>>>> + return; >>>>>>> + } >>>>>>> + bs->bl.max_write_zeroes = INT_MAX; >>>>>>> +} >>>>>> Peter, do you remember why INT_MAX isn't actually the default? I >>>>>> think >>>>>> the most reasonable behaviour would be that a limitation is only >>>>>> used if >>>>>> a block driver requests it, and otherwise unlimited is assumed. >>>>> The default (0) actually means unlimited or undefined. We introduced >>>>> that limit of 16MB in bdrv_co_write_zeroes to create only reasonable >>>>> sized requests because there is no guarantee that write zeroes is a >>>>> fast operation. We should set INT_MAX only if we know that write >>>>> zeroes of an arbitrary size is always fast. >>>> Well, splitting it up doesn't make it any faster. I think we can >>>> assume >>>> that drv->bdrv_co_write_zeroes() wants to know the full request size >>>> unless the driver has explicitly set bs->bl.max_write_zeroes. >>> You mean sth like this: >> Yes, I think that's what I meant. > > I can't find the original discussion why we added this limit. It was > actually the default > before we introduced BlockLimits. And, it was also the default in the > unsupported path > of write zeroes which created big memory allocations. This might be > the reason why > we introduced a limit. > > Peter > my $0.02 here is that even if the patch below adds regression (though I can not imagine how at the moment after some checking), we should fix bogus driver. Personally I do not like such unnatural limitations. Den >> >> Kevin >> >>> diff --git a/block.c b/block.c >>> index 61412e9..8272ef9 100644 >>> --- a/block.c >>> +++ b/block.c >>> @@ -3192,10 +3192,7 @@ int coroutine_fn >>> bdrv_co_copy_on_readv(BlockDriverState *bs, >>> BDRV_REQ_COPY_ON_READ); >>> } >>> >>> -/* if no limit is specified in the BlockLimits use a default >>> - * of 32768 512-byte sectors (16 MiB) per request. >>> - */ >>> -#define MAX_WRITE_ZEROES_DEFAULT 32768 >>> +#define MAX_WRITE_ZEROES_BOUNCE_BUFFER 32768 >>> >>> static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, >>> int64_t sector_num, int nb_sectors, BdrvRequestFlags flags) >>> @@ -3206,7 +3203,7 @@ static int coroutine_fn >>> bdrv_co_do_write_zeroes(BlockDriverState *bs, >>> int ret = 0; >>> >>> int max_write_zeroes = bs->bl.max_write_zeroes ? >>> - bs->bl.max_write_zeroes : >>> MAX_WRITE_ZEROES_DEFAULT; >>> + bs->bl.max_write_zeroes : INT_MAX; >>> >>> while (nb_sectors > 0 && !ret) { >>> int num = nb_sectors; >>> @@ -3242,7 +3239,7 @@ static int coroutine_fn >>> bdrv_co_do_write_zeroes(BlockDriverState *bs, >>> if (ret == -ENOTSUP) { >>> /* Fall back to bounce buffer if write zeroes is >>> unsupported */ >>> int max_xfer_len = >>> MIN_NON_ZERO(bs->bl.max_transfer_length, >>> - MAX_WRITE_ZEROES_DEFAULT); >>> + MAX_WRITE_ZEROES_BOUNCE_BUFFER); >>> num = MIN(num, max_xfer_len); >>> iov.iov_len = num * BDRV_SECTOR_SIZE; >>> if (iov.iov_base == NULL) { >>> @@ -5099,11 +5096,6 @@ static void coroutine_fn >>> bdrv_discard_co_entry(void *opaque) >>> rwco->ret = bdrv_co_discard(rwco->bs, rwco->sector_num, >>> rwco->nb_sectors); >>> } >>> >>> -/* if no limit is specified in the BlockLimits use a default >>> - * of 32768 512-byte sectors (16 MiB) per request. >>> - */ >>> -#define MAX_DISCARD_DEFAULT 32768 >>> - >>> int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t >>> sector_num, >>> int nb_sectors) >>> { >>> @@ -5128,7 +5120,7 @@ int coroutine_fn >>> bdrv_co_discard(BlockDriverState *bs, int64_t sector_num, >>> return 0; >>> } >>> >>> - max_discard = bs->bl.max_discard ? bs->bl.max_discard : >>> MAX_DISCARD_DEFAULT; >>> + max_discard = bs->bl.max_discard ? bs->bl.max_discard : INT_MAX; >>> while (nb_sectors > 0) { >>> int ret; >>> int num = nb_sectors; >>> >>> >>> >>> Peter > > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files 2015-02-02 14:20 ` Peter Lieven 2015-02-02 14:38 ` Denis V. Lunev @ 2015-02-02 14:49 ` Kevin Wolf 2015-02-02 15:30 ` Denis V. Lunev 1 sibling, 1 reply; 32+ messages in thread From: Kevin Wolf @ 2015-02-02 14:49 UTC (permalink / raw) To: Peter Lieven; +Cc: Denis V. Lunev, Fam Zheng, qemu-devel, Stefan Hajnoczi Am 02.02.2015 um 15:20 hat Peter Lieven geschrieben: > Am 02.02.2015 um 15:16 schrieb Kevin Wolf: > >Am 02.02.2015 um 15:12 hat Peter Lieven geschrieben: > >>Am 02.02.2015 um 15:04 schrieb Kevin Wolf: > >>>Am 02.02.2015 um 14:55 hat Peter Lieven geschrieben: > >>>>Am 02.02.2015 um 14:23 schrieb Kevin Wolf: > >>>>>Am 30.01.2015 um 09:42 hat Denis V. Lunev geschrieben: > >>>>>>fallocate() works fine and could handle properly with arbitrary size > >>>>>>requests. There is no sense to reduce the amount of space to fallocate. > >>>>>>The bigger is the size, the better is the performance as the amount of > >>>>>>journal updates is reduced. > >>>>>> > >>>>>>The patch changes behavior for both generic filesystem and XFS codepaths, > >>>>>>which are different in handle_aiocb_write_zeroes. The implementation > >>>>>>of fallocate and xfsctl(XFS_IOC_ZERO_RANGE) for XFS are exactly the same > >>>>>>thus the change is fine for both ways. > >>>>>> > >>>>>>Signed-off-by: Denis V. Lunev <den@openvz.org> > >>>>>>Reviewed-by: Max Reitz <mreitz@redhat.com> > >>>>>>CC: Kevin Wolf <kwolf@redhat.com> > >>>>>>CC: Stefan Hajnoczi <stefanha@redhat.com> > >>>>>>CC: Peter Lieven <pl@kamp.de> > >>>>>>CC: Fam Zheng <famz@redhat.com> > >>>>>>--- > >>>>>> block/raw-posix.c | 17 +++++++++++++++++ > >>>>>> 1 file changed, 17 insertions(+) > >>>>>> > >>>>>>diff --git a/block/raw-posix.c b/block/raw-posix.c > >>>>>>index 7b42f37..933c778 100644 > >>>>>>--- a/block/raw-posix.c > >>>>>>+++ b/block/raw-posix.c > >>>>>>@@ -293,6 +293,20 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) > >>>>>> } > >>>>>> } > >>>>>>+static void raw_probe_max_write_zeroes(BlockDriverState *bs) > >>>>>>+{ > >>>>>>+ BDRVRawState *s = bs->opaque; > >>>>>>+ struct stat st; > >>>>>>+ > >>>>>>+ if (fstat(s->fd, &st) < 0) { > >>>>>>+ return; /* no problem, keep default value */ > >>>>>>+ } > >>>>>>+ if (!S_ISREG(st.st_mode) || !s->discard_zeroes) { > >>>>>>+ return; > >>>>>>+ } > >>>>>>+ bs->bl.max_write_zeroes = INT_MAX; > >>>>>>+} > >>>>>Peter, do you remember why INT_MAX isn't actually the default? I think > >>>>>the most reasonable behaviour would be that a limitation is only used if > >>>>>a block driver requests it, and otherwise unlimited is assumed. > >>>>The default (0) actually means unlimited or undefined. We introduced > >>>>that limit of 16MB in bdrv_co_write_zeroes to create only reasonable > >>>>sized requests because there is no guarantee that write zeroes is a > >>>>fast operation. We should set INT_MAX only if we know that write > >>>>zeroes of an arbitrary size is always fast. > >>>Well, splitting it up doesn't make it any faster. I think we can assume > >>>that drv->bdrv_co_write_zeroes() wants to know the full request size > >>>unless the driver has explicitly set bs->bl.max_write_zeroes. > >>You mean sth like this: > >Yes, I think that's what I meant. > > I can't find the original discussion why we added this limit. It was actually the default > before we introduced BlockLimits. And, it was also the default in the unsupported path > of write zeroes which created big memory allocations. This might be the reason why > we introduced a limit. Commit c31cb707 added the limit to bdrv_co_do_write_zeroes(). Before, we used a bounce buffer of unbounded size. Anyway, it seems that none of us can think of a reason not to apply the patch to block.c. Let's just do it, and if it does break something, we'll figure it out. Can you send it as a proper patch? Denis, if we apply that patch, would you be okay with dropping 7/7 from this series, or would still something be missing? Kevin > >>diff --git a/block.c b/block.c > >>index 61412e9..8272ef9 100644 > >>--- a/block.c > >>+++ b/block.c > >>@@ -3192,10 +3192,7 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs, > >> BDRV_REQ_COPY_ON_READ); > >> } > >> > >>-/* if no limit is specified in the BlockLimits use a default > >>- * of 32768 512-byte sectors (16 MiB) per request. > >>- */ > >>-#define MAX_WRITE_ZEROES_DEFAULT 32768 > >>+#define MAX_WRITE_ZEROES_BOUNCE_BUFFER 32768 > >> > >> static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, > >> int64_t sector_num, int nb_sectors, BdrvRequestFlags flags) > >>@@ -3206,7 +3203,7 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, > >> int ret = 0; > >> > >> int max_write_zeroes = bs->bl.max_write_zeroes ? > >>- bs->bl.max_write_zeroes : MAX_WRITE_ZEROES_DEFAULT; > >>+ bs->bl.max_write_zeroes : INT_MAX; > >> > >> while (nb_sectors > 0 && !ret) { > >> int num = nb_sectors; > >>@@ -3242,7 +3239,7 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, > >> if (ret == -ENOTSUP) { > >> /* Fall back to bounce buffer if write zeroes is unsupported */ > >> int max_xfer_len = MIN_NON_ZERO(bs->bl.max_transfer_length, > >>- MAX_WRITE_ZEROES_DEFAULT); > >>+ MAX_WRITE_ZEROES_BOUNCE_BUFFER); > >> num = MIN(num, max_xfer_len); > >> iov.iov_len = num * BDRV_SECTOR_SIZE; > >> if (iov.iov_base == NULL) { > >>@@ -5099,11 +5096,6 @@ static void coroutine_fn bdrv_discard_co_entry(void *opaque) > >> rwco->ret = bdrv_co_discard(rwco->bs, rwco->sector_num, rwco->nb_sectors); > >> } > >> > >>-/* if no limit is specified in the BlockLimits use a default > >>- * of 32768 512-byte sectors (16 MiB) per request. > >>- */ > >>-#define MAX_DISCARD_DEFAULT 32768 > >>- > >> int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num, > >> int nb_sectors) > >> { > >>@@ -5128,7 +5120,7 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num, > >> return 0; > >> } > >> > >>- max_discard = bs->bl.max_discard ? bs->bl.max_discard : MAX_DISCARD_DEFAULT; > >>+ max_discard = bs->bl.max_discard ? bs->bl.max_discard : INT_MAX; > >> while (nb_sectors > 0) { > >> int ret; > >> int num = nb_sectors; > >> > >> > >> > >>Peter > > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files 2015-02-02 14:49 ` Kevin Wolf @ 2015-02-02 15:30 ` Denis V. Lunev 0 siblings, 0 replies; 32+ messages in thread From: Denis V. Lunev @ 2015-02-02 15:30 UTC (permalink / raw) To: Kevin Wolf, Peter Lieven; +Cc: Fam Zheng, qemu-devel, Stefan Hajnoczi On 02/02/15 17:49, Kevin Wolf wrote: > Am 02.02.2015 um 15:20 hat Peter Lieven geschrieben: >> Am 02.02.2015 um 15:16 schrieb Kevin Wolf: >>> Am 02.02.2015 um 15:12 hat Peter Lieven geschrieben: >>>> Am 02.02.2015 um 15:04 schrieb Kevin Wolf: >>>>> Am 02.02.2015 um 14:55 hat Peter Lieven geschrieben: >>>>>> Am 02.02.2015 um 14:23 schrieb Kevin Wolf: >>>>>>> Am 30.01.2015 um 09:42 hat Denis V. Lunev geschrieben: >>>>>>>> fallocate() works fine and could handle properly with arbitrary size >>>>>>>> requests. There is no sense to reduce the amount of space to fallocate. >>>>>>>> The bigger is the size, the better is the performance as the amount of >>>>>>>> journal updates is reduced. >>>>>>>> >>>>>>>> The patch changes behavior for both generic filesystem and XFS codepaths, >>>>>>>> which are different in handle_aiocb_write_zeroes. The implementation >>>>>>>> of fallocate and xfsctl(XFS_IOC_ZERO_RANGE) for XFS are exactly the same >>>>>>>> thus the change is fine for both ways. >>>>>>>> >>>>>>>> Signed-off-by: Denis V. Lunev <den@openvz.org> >>>>>>>> Reviewed-by: Max Reitz <mreitz@redhat.com> >>>>>>>> CC: Kevin Wolf <kwolf@redhat.com> >>>>>>>> CC: Stefan Hajnoczi <stefanha@redhat.com> >>>>>>>> CC: Peter Lieven <pl@kamp.de> >>>>>>>> CC: Fam Zheng <famz@redhat.com> >>>>>>>> --- >>>>>>>> block/raw-posix.c | 17 +++++++++++++++++ >>>>>>>> 1 file changed, 17 insertions(+) >>>>>>>> >>>>>>>> diff --git a/block/raw-posix.c b/block/raw-posix.c >>>>>>>> index 7b42f37..933c778 100644 >>>>>>>> --- a/block/raw-posix.c >>>>>>>> +++ b/block/raw-posix.c >>>>>>>> @@ -293,6 +293,20 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) >>>>>>>> } >>>>>>>> } >>>>>>>> +static void raw_probe_max_write_zeroes(BlockDriverState *bs) >>>>>>>> +{ >>>>>>>> + BDRVRawState *s = bs->opaque; >>>>>>>> + struct stat st; >>>>>>>> + >>>>>>>> + if (fstat(s->fd, &st) < 0) { >>>>>>>> + return; /* no problem, keep default value */ >>>>>>>> + } >>>>>>>> + if (!S_ISREG(st.st_mode) || !s->discard_zeroes) { >>>>>>>> + return; >>>>>>>> + } >>>>>>>> + bs->bl.max_write_zeroes = INT_MAX; >>>>>>>> +} >>>>>>> Peter, do you remember why INT_MAX isn't actually the default? I think >>>>>>> the most reasonable behaviour would be that a limitation is only used if >>>>>>> a block driver requests it, and otherwise unlimited is assumed. >>>>>> The default (0) actually means unlimited or undefined. We introduced >>>>>> that limit of 16MB in bdrv_co_write_zeroes to create only reasonable >>>>>> sized requests because there is no guarantee that write zeroes is a >>>>>> fast operation. We should set INT_MAX only if we know that write >>>>>> zeroes of an arbitrary size is always fast. >>>>> Well, splitting it up doesn't make it any faster. I think we can assume >>>>> that drv->bdrv_co_write_zeroes() wants to know the full request size >>>>> unless the driver has explicitly set bs->bl.max_write_zeroes. >>>> You mean sth like this: >>> Yes, I think that's what I meant. >> I can't find the original discussion why we added this limit. It was actually the default >> before we introduced BlockLimits. And, it was also the default in the unsupported path >> of write zeroes which created big memory allocations. This might be the reason why >> we introduced a limit. > Commit c31cb707 added the limit to bdrv_co_do_write_zeroes(). Before, we > used a bounce buffer of unbounded size. > > Anyway, it seems that none of us can think of a reason not to apply the > patch to block.c. Let's just do it, and if it does break something, > we'll figure it out. Can you send it as a proper patch? > > Denis, if we apply that patch, would you be okay with dropping 7/7 from > this series, or would still something be missing? > > Kevin Sure. This will be even better. Something similar was implemented in v1/v2 of the patchset. Regards, Den ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2015-02-02 15:31 UTC | newest] Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-01-27 13:51 [Qemu-devel] [PATCH v4 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev 2015-01-27 13:51 ` [Qemu-devel] [PATCH 1/7] block/raw-posix: create translate_err helper to merge errno values Denis V. Lunev 2015-01-27 16:50 ` Max Reitz 2015-01-27 13:51 ` [Qemu-devel] [PATCH 2/7] block/raw-posix: create do_fallocate helper Denis V. Lunev 2015-01-27 16:57 ` Max Reitz 2015-01-27 13:51 ` [Qemu-devel] [PATCH 3/7] block/raw-posix: refactor handle_aiocb_write_zeroes a bit Denis V. Lunev 2015-01-27 17:13 ` Max Reitz 2015-01-27 13:51 ` [Qemu-devel] [PATCH 4/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes Denis V. Lunev 2015-01-27 17:30 ` Max Reitz 2015-01-27 13:51 ` [Qemu-devel] [PATCH 5/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev 2015-01-27 17:48 ` Max Reitz 2015-01-27 13:51 ` [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes Denis V. Lunev 2015-01-27 17:57 ` Max Reitz 2015-01-27 18:19 ` Denis V. Lunev 2015-01-27 18:24 ` Max Reitz 2015-01-27 18:33 ` Denis V. Lunev 2015-01-27 13:51 ` [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files Denis V. Lunev 2015-01-27 18:05 ` Max Reitz 2015-01-27 18:11 ` Denis V. Lunev 2015-01-28 6:39 ` Denis V. Lunev 2015-01-28 18:38 [Qemu-devel] [PATCH v5 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev 2015-01-28 18:38 ` [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files Denis V. Lunev 2015-01-29 22:51 ` Max Reitz 2015-01-30 8:42 [Qemu-devel] [PATCH v6 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev 2015-01-30 8:42 ` [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files Denis V. Lunev 2015-02-02 13:23 ` Kevin Wolf 2015-02-02 13:55 ` Peter Lieven 2015-02-02 14:04 ` Kevin Wolf 2015-02-02 14:12 ` Peter Lieven 2015-02-02 14:16 ` Kevin Wolf 2015-02-02 14:20 ` Peter Lieven 2015-02-02 14:38 ` Denis V. Lunev 2015-02-02 14:49 ` Kevin Wolf 2015-02-02 15:30 ` Denis V. Lunev
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.