qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Ilya Dryomov <idryomov@gmail.com>
To: Peter Lieven <pl@kamp.de>
Cc: "Kevin Wolf" <kwolf@redhat.com>,
	"Daniel P. Berrangé" <berrange@redhat.com>,
	qemu-block@nongnu.org, ct@flyingcircus.io, qemu-devel@nongnu.org,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	mreitz@redhat.com, "Jason Dillaman" <dillaman@redhat.com>
Subject: Re: [PATCH V4 5/6] block/rbd: add write zeroes support
Date: Fri, 2 Jul 2021 14:24:27 +0200	[thread overview]
Message-ID: <CAOi1vP8pkgyquGggTMLKN3RirmFxQMxSe2PVa_JjJKMQddt-wA@mail.gmail.com> (raw)
In-Reply-To: <20210702090935.15300-6-pl@kamp.de>

On Fri, Jul 2, 2021 at 11:09 AM Peter Lieven <pl@kamp.de> wrote:
>
> this patch wittingly sets BDRV_REQ_NO_FALLBACK and silently ignores BDRV_REQ_MAY_UNMAP
> for older librbd versions.
>
> The rationale for this is as following (citing Ilya Dryomov current RBD maintainer):
> ---8<---
> a) remove the BDRV_REQ_MAY_UNMAP check in qemu_rbd_co_pwrite_zeroes()
>    and as a consequence always unmap if librbd is too old
>
>    It's not clear what qemu's expectation is but in general Write
>    Zeroes is allowed to unmap.  The only guarantee is that subsequent
>    reads return zeroes, everything else is a hint.  This is how it is
>    specified in the kernel and in the NVMe spec.
>
>    In particular, block/nvme.c implements it as follows:
>
>    if (flags & BDRV_REQ_MAY_UNMAP) {
>        cdw12 |= (1 << 25);
>    }
>
>    This sets the Deallocate bit.  But if it's not set, the device may
>    still deallocate:
>
>    """
>    If the Deallocate bit (CDW12.DEAC) is set to '1' in a Write Zeroes
>    command, and the namespace supports clearing all bytes to 0h in the
>    values read (e.g., bits 2:0 in the DLFEAT field are set to 001b)
>    from a deallocated logical block and its metadata (excluding
>    protection information), then for each specified logical block, the
>    controller:
>    - should deallocate that logical block;
>
>    ...
>
>    If the Deallocate bit is cleared to '0' in a Write Zeroes command,
>    and the namespace supports clearing all bytes to 0h in the values
>    read (e.g., bits 2:0 in the DLFEAT field are set to 001b) from
>    a deallocated logical block and its metadata (excluding protection
>    information), then, for each specified logical block, the
>    controller:
>    - may deallocate that logical block;
>    """
>
>    https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-2021.06.02-Ratified-1.pdf
>
> b) set BDRV_REQ_NO_FALLBACK in supported_zero_flags
>
>    Again, it's not clear what qemu expects here, but without it we end
>    up in a ridiculous situation where specifying the "don't allow slow
>    fallback" switch immediately fails all efficient zeroing requests on
>    a device where Write Zeroes is always efficient:
>
>    $ qemu-io -c 'help write' | grep -- '-[zun]'
>     -n, -- with -z, don't allow slow fallback
>     -u, -- with -z, allow unmapping
>     -z, -- write zeroes using blk_co_pwrite_zeroes
>
>    $ qemu-io -f rbd -c 'write -z -u -n 0 1M' rbd:foo/bar
>    write failed: Operation not supported
> --->8---
>
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  block/rbd.c | 32 +++++++++++++++++++++++++++++++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index be0471944a..149317d33c 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -63,7 +63,8 @@ typedef enum {
>      RBD_AIO_READ,
>      RBD_AIO_WRITE,
>      RBD_AIO_DISCARD,
> -    RBD_AIO_FLUSH
> +    RBD_AIO_FLUSH,
> +    RBD_AIO_WRITE_ZEROES
>  } RBDAIOCmd;
>
>  typedef struct BDRVRBDState {
> @@ -705,6 +706,10 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
>          }
>      }
>
> +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
> +    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK;
> +#endif
> +
>      /* When extending regular files, we get zeros from the OS */
>      bs->supported_truncate_flags = BDRV_REQ_ZERO_WRITE;
>
> @@ -827,6 +832,18 @@ static int coroutine_fn qemu_rbd_start_co(BlockDriverState *bs,
>      case RBD_AIO_FLUSH:
>          r = rbd_aio_flush(s->image, c);
>          break;
> +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
> +    case RBD_AIO_WRITE_ZEROES: {
> +        int zero_flags = 0;
> +#ifdef RBD_WRITE_ZEROES_FLAG_THICK_PROVISION
> +        if (!(flags & BDRV_REQ_MAY_UNMAP)) {
> +            zero_flags = RBD_WRITE_ZEROES_FLAG_THICK_PROVISION;
> +        }
> +#endif
> +        r = rbd_aio_write_zeroes(s->image, offset, bytes, c, zero_flags, 0);
> +        break;
> +    }
> +#endif
>      default:
>          r = -EINVAL;
>      }
> @@ -897,6 +914,16 @@ static int coroutine_fn qemu_rbd_co_pdiscard(BlockDriverState *bs,
>      return qemu_rbd_start_co(bs, offset, count, NULL, 0, RBD_AIO_DISCARD);
>  }
>
> +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
> +static int
> +coroutine_fn qemu_rbd_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
> +                                      int count, BdrvRequestFlags flags)
> +{
> +    return qemu_rbd_start_co(bs, offset, count, NULL, flags,
> +                             RBD_AIO_WRITE_ZEROES);
> +}
> +#endif
> +
>  static int qemu_rbd_getinfo(BlockDriverState *bs, BlockDriverInfo *bdi)
>  {
>      BDRVRBDState *s = bs->opaque;
> @@ -1120,6 +1147,9 @@ static BlockDriver bdrv_rbd = {
>      .bdrv_co_pwritev        = qemu_rbd_co_pwritev,
>      .bdrv_co_flush_to_disk  = qemu_rbd_co_flush,
>      .bdrv_co_pdiscard       = qemu_rbd_co_pdiscard,
> +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
> +    .bdrv_co_pwrite_zeroes  = qemu_rbd_co_pwrite_zeroes,
> +#endif
>
>      .bdrv_snapshot_create   = qemu_rbd_snap_create,
>      .bdrv_snapshot_delete   = qemu_rbd_snap_remove,
> --
> 2.17.1
>
>

Reviewed-by: Ilya Dryomov <idryomov@gmail.com>

Thanks,

                Ilya


  reply	other threads:[~2021-07-02 12:27 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-02  9:09 [PATCH V4 0/6] block/rbd: migrate to coroutines and add write zeroes support Peter Lieven
2021-07-02  9:09 ` [PATCH V4 1/6] block/rbd: bump librbd requirement to luminous release Peter Lieven
2021-07-02 10:43   ` Ilya Dryomov
2021-07-02  9:09 ` [PATCH V4 2/6] block/rbd: store object_size in BDRVRBDState Peter Lieven
2021-07-02  9:09 ` [PATCH V4 3/6] block/rbd: update s->image_size in qemu_rbd_getlength Peter Lieven
2021-07-02 10:45   ` Ilya Dryomov
2021-07-02  9:09 ` [PATCH V4 4/6] block/rbd: migrate from aio to coroutines Peter Lieven
2021-07-02 10:57   ` Ilya Dryomov
2021-07-02  9:09 ` [PATCH V4 5/6] block/rbd: add write zeroes support Peter Lieven
2021-07-02 12:24   ` Ilya Dryomov [this message]
2021-07-02  9:09 ` [PATCH V4 6/6] block/rbd: drop qemu_rbd_refresh_limits Peter Lieven
2021-07-02 12:46 ` [PATCH V4 0/6] block/rbd: migrate to coroutines and add write zeroes support Ilya Dryomov
2021-07-02 13:15   ` Peter Lieven

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAOi1vP8pkgyquGggTMLKN3RirmFxQMxSe2PVa_JjJKMQddt-wA@mail.gmail.com \
    --to=idryomov@gmail.com \
    --cc=berrange@redhat.com \
    --cc=ct@flyingcircus.io \
    --cc=dillaman@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=pl@kamp.de \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).