From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:57332)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1WHsW2-0000Xr-UX
	for qemu-devel@nongnu.org; Mon, 24 Feb 2014 05:12:08 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1WHsVy-00060h-2e
	for qemu-devel@nongnu.org; Mon, 24 Feb 2014 05:12:02 -0500
Received: from mx1.redhat.com ([209.132.183.28]:57019)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1WHsVx-00060Y-PR
	for qemu-devel@nongnu.org; Mon, 24 Feb 2014 05:11:57 -0500
Date: Mon, 24 Feb 2014 11:11:52 +0100
From: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20140224101152.GE3775@dhcp-200-207.str.redhat.com>
References: <1393074022-32388-1-git-send-email-pl@kamp.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1393074022-32388-1-git-send-email-pl@kamp.de>
Subject: Re: [Qemu-devel] [RFC PATCH] block: optimize zero writes with
	bdrv_write_zeroes
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Peter Lieven <pl@kamp.de>
Cc: pbonzini@redhat.com, qemu-devel@nongnu.org, stefanha@redhat.com, mreitz@redhat.com

Am 22.02.2014 um 14:00 hat Peter Lieven geschrieben:
> this patch tries to optimize zero write requests
> by automatically using bdrv_write_zeroes if it is
> supported by the format.
> 
> i know that there is a lot of potential for discussion, but i would
> like to know what the others think.
> 
> this should significantly speed up file system initialization and
> should speed zero write test used to test backend storage performance.
> 
> the difference can simply be tested by e.g.
> 
> dd if=/dev/zero of=/dev/vdX bs=1M
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>

As you probably have expected, there's no way I can let the patch in in
this form. The least you need to introduce is a boolean option to enable
or disable the zero check. (The default would probably be disabled, but
we can discuss this.)

>  block.c               |    8 ++++++++
>  include/qemu-common.h |    1 +
>  util/iov.c            |   20 ++++++++++++++++++++
>  3 files changed, 29 insertions(+)
> 
> diff --git a/block.c b/block.c
> index 6f4baca..505888e 100644
> --- a/block.c
> +++ b/block.c
> @@ -3145,6 +3145,14 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>  
>      ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
>  
> +    if (!ret && !(flags & BDRV_REQ_ZERO_WRITE) &&
> +        drv->bdrv_co_write_zeroes && qemu_iovec_is_zero(qiov)) {
> +        flags |= BDRV_REQ_ZERO_WRITE;
> +        /* if the device was not opened with discard=on the below flag
> +         * is immediately cleared again in bdrv_co_do_write_zeroes */
> +        flags |= BDRV_REQ_MAY_UNMAP;

I'm not sure about this one. I think it is reasonable to expect that
after an explicit write of a buffer filled with zeros the block is
allocated.

In a simple qcow2-on-file case, we basically have three options for
handling all-zero writes:

- Allocate the cluster on a qcow2 and file level and write literal zeros
  to it. No metadata updates involved in the next write to the cluster.

- Set the qcow2 zero flag, but leave the allocation in place. The next
  write in theory just needs to remove the zero flag (I think in
  practice we're doing an unnecessary COW) from the L2 table and that's
  it.

- Set the qcow2 zero flag and unmap the cluster on both the qcow2 and
  the filesystem layer. The next write causes new allocations in both
  layers, which means multiple metadata updates and possibly added
  fragmentation. The upside is that we use less disk space if there is
  no next write to this cluster.

I think it's pretty clear that the right behaviour depends on your use
case and we can't find an one-size-fits-all solution.

> +    }
> +
>      if (ret < 0) {
>          /* Do nothing, write notifier decided to fail this request */
>      } else if (flags & BDRV_REQ_ZERO_WRITE) {

Kevin