All of lore.kernel.org
 help / color / mirror / Atom feed
From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
To: Eric Blake <eblake@redhat.com>
Cc: nbd-general@lists.sourceforge.net, Kevin Wolf <kwolf@redhat.com>,
	qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>, Wouter Verhelst <w@uter.be>,
	"Denis V. Lunev" <den@openvz.org>
Subject: Re: [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
Date: Thu, 24 Mar 2016 15:30:06 +0300	[thread overview]
Message-ID: <20160324123006.GF24831@phobos.sw.ru> (raw)
In-Reply-To: <56F2C3D4.6080204@redhat.com>

On Wed, Mar 23, 2016 at 10:27:00AM -0600, Eric Blake wrote:
> On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> > From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > 
> > With the availability of sparse storage formats, it is often needed to
> > query status of a particular LBA range and read only those blocks of
> > data that are actually present on the block device.
> 
> The acronym LBA is not used elsewhere in the NBD spec; should we spell
> it out at least once?
> 
> > 
> > To provide such information, the patch adds GET_LBA_STATUS extension
> > with one new NBD_CMD_GET_LBA_STATUS command.
> > 
> > There exists a concept of data dirtiness, which is required during, for
> > example, incremental block device backup. To express this concept via
> > NBD protocol, this patch also adds additional mode of operation to
> > NBD_CMD_GET_LBA_STATUS command.
> > 
> > Since NBD protocol has no notion of block size, and to mimic SCSI "GET
> > LBA STATUS" command more closely, it has been chosen to return a list of
> > extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a
> > bitmap.
> > 
> > Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > CC: Wouter Verhelst <w@uter.be>
> > CC: Paolo Bonzini <pbonzini@redhat.com>
> > CC: Kevin Wolf <kwolf@redhat.com>
> > CC: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  doc/proto.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 82 insertions(+)
> > 
> 
> >  
> > +### `GET_LBA_STATUS` extension
> > +
> > +With the availability of sparse storage formats, it is often needed to query
> > +status of a particular LBA range and read only those blocks of data that are
> > +actually present on the block device.
> > +
> > +Some storage formats and operations over such formats express a concept of
> > +data dirtiness. Whether the operation is block device mirroring,
> > +incremental block device backup or any other operation with a concept of
> > +data dirtiness, they all share a need to provide a list of LBA ranges
> > +that this particular operation treats as dirty.
> > +
> > +To provide such class of information, `GET_LBA_STATUS` extension adds new
> > +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with
> > +their respective states.
> > +
> > +* `NBD_CMD_GET_LBA_STATUS` (7)
> > +
> > +    An LBA range status query request. Length and offset define the range
> > +    of interest. The server MUST reply with a reply header, followed
> > +    immediately by the following data:
> > +
> > +      - 32 bits, length of parameter data that follow (unsigned)
> 
> Is this length the number of descriptors, or the number of bytes
> occupied by those descriptors?  It looks like bytes (that is, with the
> current layout, this field should be a multiple of 14 unless an error is
> returned and the data is bogus).
> 
> I guess 32 bits is sufficient: transmission commands are limited to
> 32-bit length, and we are unlikely to have more than one extent per 512
> bytes of length, so even if we have a header for every single sector
> (worst-case for alternating clean/dirty sectors), as long as the
> smallest granularity of an extent is larger than the extent field, the
> 'length of parameter data' in bytes is still sufficient.
> 
> > +      - zero or more LBA status descriptors, each having the following
> 
> zero or more? [1]
> 
> > +        structure:
> > +
> > +        * 64 bits, offset (unsigned)
> > +        * 32 bits, length (unsigned)
> > +        * 16 bits, status (unsigned)
> 
> An array of these status descriptors is packed on the wire, while the
> typical C layout of an array of these structures will have padding to
> reach a nice 8-byte alignment.  Should 'status' be a 32-bit field, so
> that clients and servers do not have to pack/unpack between 14 bytes on
> the wire and 16 bytes in efficient array handling, at the expense of
> more network traffic?
> 
> Conversely, it would be possible to send less data over the wire, as
> long as we require that all LBA status descriptors cover consecutive
> offsets.  That is, having the server reply with offsets is pointless,
> since they can be reconstructed on the client by starting with the
> offset in the client's request, then adding length from each status
> field.  Is less network traffic desirable?

I think it's better to explicitly send the start of each LBA extent.
So I'll go with changing 'status' field to 32 bits to avoid
packing/unpacking issues.

> 
> > +
> > +    unless an error condition has occurred.
> > +
> > +    If an error occurs, the server SHOULD set the appropriate error code
> > +    in the error field. The server MUST then either close the
> > +    connection, or send *length of parameter data* bytes of data
> > +    (which MAY be invalid).
> > +
> > +    The type of information required by the client is passed to server in the
> > +    command flags field. If the server does not implement requested type or
> > +    have no means to express it, it MUST NOT return an error, but instead MUST
> > +    return a single LBA status descriptor with *offset* and *length* equal to
> > +    the *offset* and *length* from request, and *status* set to `0`.
> 
> [1] So in what situations will we ever return an array of zero status
> fields? On an error?  Should we make it clear that the server MUST NOT
> return 0 status fields except on an error?
> 
> Do we want to require that the server MUST reply with enough extents to
> sum up to the length of the client's request, or are we permitting a
> short reply?

While the "GET LBA STATUS" command in SCSI allows partial reply, I
believe it'd better to keep things simple and require that the server
must either return a list of extents that covers the whole requested
range, or an error.

> 
> > +
> > +    The following request types are currently defined for the command:
> > +
> > +    1. Block provisioning state
> > +
> > +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> > +    field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return
> 
> Here, you spell it '0x0'; in the previous patch, you spelled the command
> flag as 'bit 1' - does that mean that Block provisioning state is the
> default when no command flags are sent?  What if we later add other
> flags but still want block provisioning mode?  Wouldn't it be better to
> state that this mode is entered when the NBD_FLAG_GET_DIRTY flag is
> clear, without regards to the state of the other flags, than allowing
> this mode only when all 16 flags are zero?
> 
> For example, should we allow a flag that states that the client is
> interested only in allocated/unallocated, and that the server may
> coalesce NBD_STATE_ZEROED extents as if they were NBD_STATE_ALLOCATED
> for fewer extents reported and thus potentially less network traffic?

Actually, for this command I treat 'command flags' field not as a set of
flags, but rather as a plain number representing required mode of
operation. Probably, not a good idea as it doesn't match the rest of the
commands.

I went this way because I didn't want to allow clients to request
several modes simultaneously (e.g. provisioning + dirtiness) in the same
request. This makes server side implementation harder.

I think I'll just switch to bits to match the rest of the commands and
will add a note, that server should return EINVAL in case several modes
are requested simultaneously.

> 
> > +    the provisioning state of the device. The following provisionnig states
> 
> s/provisionnig/provisioning/
> 
> > +    are defined for the command:
> > +
> > +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
> > +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
> > +        and contains zeroes;
> > +      - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the
> > +        block device. A client MUST NOT make any assumptions about the
> > +        contents of the extent.
> 
> Can NBD_STATE_ALLOCATED and NBD_STATE_DEALLOCATED both be set at the
> same time, or is that an error on the server?  What do we know about an
> extent that has neither NBD_STATE_ALLOCATED nor NBD_STATE_DEALLOCATED set?
> 
> /me re-reads
> 
> Oh, you are using this as the _entire_ 16-bit status value, rather than
> as bits 0, 1, and 2 as flags.

Yes

> 
> But I think you have two binary flags (four possible states) here: it is
> quite conceivable to have a server on top of a SCSI device, where we
> know that the extent is unallocated in SCSI, but where the server will
> guarantee that it reads as all zeroes (possibly because the server
> bypasses SCSI on the NBD read commands when it knows SCSI is
> unallocated).  That is, if we set this up as two flags:
> 
> 0x1 - allocated
> 0x2 - reads as 0
> 
> then we can express four states:
> 
> 0x0 - LBA extent not present, client MUST NOT make assumptions about
> contents, and reads should not be attempted
> 0x1 - LBA extent allocated, reads will succeed but no guarantee on contents
> 0x2 - LBA extent not present, but client can treat the extent as zeroes
> and reads will succeed
> 0x3 - LBA extent present, client can treat the extent as zeroes and
> reads will succeed

I'm not sure that clients need this level of details. From client's POV
0x2 and 0x3 are the same.

> 
> Actually, we should probably pick the bit values such that 0x0 means
> allocated and readable, as the most common state, since we also require
> that the command returns a single extent over the entire length with
> status 0 if the server can't provide any further details.
> 
> I'm not familiar enough with the SCSI "GET LBA STATUS" command to know
> if your command sanely matches to that one.

Extents returned by "GET LBA STATUS" in SCSI can have three possible
state: MAPPED/ANCHORED/DEALLOCATED. These are not bits and cannot be
combined together.

> 
> > +
> > +    2. Block dirtiness state
> > +
> > +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> > +    field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return
> 
> This overlaps with the bit value for NBD_FLAG_SEND_FUA in the previous
> patch.  Is that okay?  Or should we use a different bit value, on the
> grounds that some future extension may want to use both flags
> orthogonally within the same (possibly new) command?  Again, consistency
> in the spelling ('bit 1' in the previous patch, '0x1' here), would be nice.

I couldn't think of any use case. But, just in case, I'll change the
value of the flags so they don't overlap.

-- 
Pavel

> 
> > +    the dirtiness status of the device. The following dirtiness states
> > +    are defined for the command:
> > +
> > +      - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
> > +      - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.
> 
> Again, it looks like you are using these as two entire 16-bit status
> values, rather than as two separate bits (1<<0 and 1<<1).  Another way
> of expressing it is that a single boolean flag is defined, if clear, the
> extent is dirty, if set, the extent is clean.
> 
> > +
> > +    Generic NBD client implementation without knowledge of a particular NBD
> > +    server operation MUST NOT make any assumption on the meaning of the
> > +    NBD_STATE_DIRTY or NBD_STATE_CLEAN states.
> > +
> > +The server SHOULD return `EINVAL` if it receives a `GET_LBA_STATUS` request
> > +including one or more sectors beyond the size of the device.
> 
> As mentioned in the previous mail, should we also recommend an EINVAL if
> NBD_CMD_GET_LBA_STATUS was not negotiated in the export options but the
> client sends the command anyways; and/or a requirement that the client
> must not issue the command in that case?
> 
> > +
> >  ## About this file
> >  
> >  This file tries to document the NBD protocol as it is currently
> > 
> 
> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 

  reply	other threads:[~2016-03-24 12:30 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-23 14:16 [Qemu-devel] [PATCH 0/2] NBD protocol extensions: WRITE_ZEROES and GET_LBA_STATUS Denis V. Lunev
2016-03-23 14:16 ` [Qemu-devel] [PATCH 1/2] NBD proto: add WRITE_ZEROES extension Denis V. Lunev
2016-03-23 15:14   ` Eric Blake
2016-03-23 17:40     ` [Qemu-devel] [Nbd] " Wouter Verhelst
2016-03-24  7:16     ` [Qemu-devel] " Pavel Borzenkov
2016-03-24  7:36       ` [Qemu-devel] [Nbd] " Wouter Verhelst
2016-03-23 17:21   ` Wouter Verhelst
2016-03-24  7:57     ` Pavel Borzenkov
2016-03-24  8:26       ` Wouter Verhelst
2016-03-24 11:35         ` Pavel Borzenkov
2016-03-24 11:37         ` Paolo Bonzini
2016-03-24 12:31           ` Wouter Verhelst
2016-03-24 14:53         ` Eric Blake
2016-03-23 14:16 ` [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension Denis V. Lunev
2016-03-23 16:27   ` Eric Blake
2016-03-24 12:30     ` Pavel Borzenkov [this message]
2016-03-24 15:04       ` Eric Blake
2016-03-24 16:36         ` Pavel Borzenkov
2016-03-23 17:58   ` [Qemu-devel] [Nbd] " Wouter Verhelst
2016-03-23 18:14     ` Kevin Wolf
2016-03-24  8:25       ` Pavel Borzenkov
2016-03-24  8:41         ` Wouter Verhelst
2016-03-24 11:36           ` Pavel Borzenkov
2016-03-24 12:32             ` Wouter Verhelst
2016-03-24  8:43     ` Pavel Borzenkov
2016-03-24  9:33       ` Wouter Verhelst
2016-03-24 10:32         ` Alex Bligh
2016-03-24 11:58           ` Paolo Bonzini
2016-03-24 12:17             ` Alex Bligh
2016-03-24 12:32               ` Paolo Bonzini
2016-03-24 13:31                 ` Alex Bligh
2016-03-24 13:32                   ` Paolo Bonzini
2016-03-24 11:55     ` Paolo Bonzini
2016-03-24 12:43       ` Wouter Verhelst
2016-03-24 15:25       ` Eric Blake
2016-03-24 15:33         ` Paolo Bonzini
2016-03-24 15:53           ` Wouter Verhelst
2016-03-24 16:04             ` Eric Blake
2016-03-24 16:07               ` Kevin Wolf
2016-03-24 16:47                 ` Wouter Verhelst
2016-03-29  9:38                   ` Kevin Wolf
2016-03-29  9:53                     ` Wouter Verhelst
2016-03-29 10:25                     ` Paolo Bonzini
2016-03-24 22:08   ` [Qemu-devel] " Eric Blake
2016-03-25  8:49     ` [Qemu-devel] [Nbd] " Wouter Verhelst
2016-03-25  9:01       ` Alex Bligh
2016-03-28 15:58       ` Eric Blake
2016-04-04 10:32         ` Markus Pargmann
2016-04-04 10:18       ` Markus Pargmann
2016-04-04 16:54         ` Eric Blake
2016-04-04 22:17         ` Wouter Verhelst
2016-04-04 16:40   ` [Qemu-devel] " Eric Blake
2016-04-04 20:16   ` Denis V. Lunev
2016-04-04 20:36     ` [Qemu-devel] [Nbd] " Eric Blake

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160324123006.GF24831@phobos.sw.ru \
    --to=pborzenkov@virtuozzo.com \
    --cc=den@openvz.org \
    --cc=eblake@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=nbd-general@lists.sourceforge.net \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    --cc=w@uter.be \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.