All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wouter Verhelst <w@uter.be>
To: Eric Blake <eblake@redhat.com>
Cc: "nbd-general@lists.sourceforge.net"
	<nbd-general@lists.sourceforge.net>,
	Kevin Wolf <kwolf@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	Pavel Borzenkov <pborzenkov@virtuozzo.com>,
	"Stefan stefanha@redhat. com" <stefanha@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"Denis V. Lunev" <den@openvz.org>
Subject: Re: [Qemu-devel] [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension
Date: Tue, 5 Apr 2016 23:44:25 +0200	[thread overview]
Message-ID: <20160405214425.GF13913@grep.be> (raw)
In-Reply-To: <5702F992.1060909@redhat.com>

On Mon, Apr 04, 2016 at 05:32:34PM -0600, Eric Blake wrote:
> On 04/04/2016 05:08 PM, Wouter Verhelst wrote:
> > On Mon, Apr 04, 2016 at 10:54:02PM +0300, Denis V. Lunev wrote:
> >> saying about dirtiness, we would soon come to the fact, that
> >> we can have several dirtiness states regarding different
> >> lines of incremental backups. This complexity is hidden
> >> inside QEMU and it would be very difficult to publish and
> >> reuse it.
> > 
> > How about this then.
> > 
> > A reply to GET_BLOCK_STATUS containing chunks of this:
> > 
> > 32-bit length
> > 32-bit "snapshot status"
> > if bit 0 in the latter field is set, that means the block is allocated
> >   on the original device
> > if bit 1 is set, that means the block is allocated on the first-level
> >   snapshot
> > if bit 2 is set, that means the block is allocated on the second-level
> >   snapshot
> 
> The idea of allocation is orthogonal from the idea of reads as zeroes.
> That is, a client may usefully guarantee that something reads as zeroes,
> whether or not it is allocated (but knowing whether it is a hole or
> allocated will determine whether future writes to that area will cause
> file system fragmentation or be at risk of ENOSPC on thin-provisioning).
>  If we want to expose the notion of depth (and I'm not sure about that
> yet), we may want to reserve bit 0 for 'reads as zero' and bits 1-30 as
> 'allocated at depth "bit-1"' (and bit 31 as 'allocated at depth 30 or
> greater).
> 
> I don't know if the idea of depth of allocation is useful enough to
> expose in this manner; qemu could certainly advertise depth if the
> protocol calls it out, but I'm still not sure whether knowing depth
> helps any algorithms.
> 
> > 
> > If all flags are cleared, that means the block is not allocated (i.e.,
> > is a hole) and MUST read as zeroes.
> 
> That's too strong.  NBD_CMD_TRIM says that we can create holes whose
> data does not necessarily read as zeroes (and SCSI definitely has
> semantics like this - not all devices guarantee zero reads when you
> UNMAP; and WRITE_SAME has an UNMAP flag to control whether you are okay
> with the faster unmapping operation at the expense of bad reads, or
> slower explicit writes).  Hence my complaint that we have to treat
> 'reads as zero' as an orthogonal bit to 'allocated at depth X'.

Right.

It would be possible to instead mark bit 0 as NBD_ALLOC_LEVEL_DATA (i.e.,
"not just zeroes" -- since I believe a flag value of "one" to indicate
"zeroes" is bound to cause confusion), bit 1 as NBD_ALLOC_LEVEL_CURRENT
(i.e., "allocated in the current snapshot level"), and bits 2 through 31
as NBD_ALLOC_LEVEL_2 through NBD_ALLOC_LEVEL_31, with
implementation-defined meanings, as above.

> > If a flag is set at a particular level X, that means the device is dirty
> > at the Xth-level snapshot.
> > 
> > If at least one flag is set for a region, that means the data may read
> > as "not zero".
> > 
> > The protocol does not define what it means to have multiple levels of
> > snapshots, other than:
> > 
> > - Any write command (WRITE or WRITE_ZEROES) MUST NOT clear or set the
> >   Xth level flag if the Yth level flag is not also cleared at the same
> >   time, for any Y > X
> > - Any write (as above) MAY set or clear multiple levels of flags at the
> >   same time, as long as the above holds

Having thought about this on my way to work, I'm now no longer convinced
this is a good idea. I could imagine several scenarios where violating
these constraints might be useful, and none where not violating them
would buy us anything.

Instead, I suggest the following semantics:

- NBD_CMD_WRITE to a region which has NBD_ALLOC_LEVEL_DATA cleared MUST
  set that bit for that region
- NBD_CMD_WRITE_ZEROES or NBD_CMD_TRIM to a region which has
  NBD_ALLOC_LEVEL_DATA set MAY clear that bit for that region
- NBD_CMD_TRIM to a region which has one of the NBD_ALLOC_LEVEL_X flags
  set MAY just clear those flags (and no other). In that case, the
  contents of that region (when read) SHOULD be assumed to have changed,
  but is not defined, and is unlikely to read as zeroes if other
  NBD_ALLOC_LEVEL_* flags are still set.
- NBD_CMD_WRITE to a region which has NBD_ALLOC_LEVEL_CURRENT set SHOULD
  NOT fail with ENOSPC, although it MAY still fail with other errors.
- NBD_CMD_LEVEL_READ from a region which has NBD_ALLOC_LEVEL_DATA clared
  MUST read as zeroes

This may all be a bit overengineered, but it has the benefit of
being fairly generic. It also means that if the flags field is
all-zeroes, the region is "uninteresting".

Thoughts?

[...]

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

  parent reply	other threads:[~2016-04-05 21:44 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-04 16:39 [Qemu-devel] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension Eric Blake
2016-04-04 18:06 ` [Qemu-devel] [Nbd] " Alex Bligh
2016-04-04 19:34   ` Eric Blake
2016-04-04 19:54     ` Denis V. Lunev
2016-04-04 20:03       ` Alex Bligh
2016-04-04 20:08         ` Denis V. Lunev
2016-04-04 20:34           ` Eric Blake
2016-04-04 21:06             ` Denis V. Lunev
2016-04-04 21:12             ` Alex Bligh
2016-04-05 14:15         ` Paolo Bonzini
2016-04-05 15:01           ` Alex Bligh
2016-04-05 15:23             ` Paolo Bonzini
2016-04-05 15:27               ` Alex Bligh
2016-04-05 15:31                 ` Paolo Bonzini
2016-04-04 23:08       ` Wouter Verhelst
2016-04-04 23:32         ` Eric Blake
2016-04-05  7:16           ` Wouter Verhelst
2016-04-05 21:44           ` Wouter Verhelst [this message]
2016-04-05  7:13         ` Alex Bligh
2016-04-04 19:58     ` Alex Bligh
2016-04-04 20:04       ` Denis V. Lunev
2016-04-04 20:08         ` Alex Bligh
2016-04-04 20:13           ` Denis V. Lunev
2016-04-04 20:15             ` Alex Bligh
2016-04-04 20:27               ` Denis V. Lunev
2016-04-04 20:45                 ` Eric Blake
2016-04-04 21:04                   ` Denis V. Lunev
2016-04-04 21:12                     ` Alex Bligh
2016-04-04 21:17                     ` Eric Blake
2016-04-04 21:27                       ` Denis V. Lunev
2016-04-04 20:26           ` Eric Blake
2016-04-04 21:07             ` Alex Bligh
2016-04-04 21:25               ` Eric Blake
2016-04-04 22:06                 ` Alex Bligh
2016-04-04 20:22       ` Eric Blake
2016-04-05 13:38     ` Paolo Bonzini
2016-04-04 22:40 ` Wouter Verhelst
2016-04-04 23:03   ` Eric Blake
2016-04-05 13:41     ` Paolo Bonzini
2016-04-06  5:57     ` Denis V. Lunev
2016-04-06 14:08       ` Eric Blake
2016-04-05  4:05 ` [Qemu-devel] " Kevin Wolf
2016-04-05 13:43   ` Paolo Bonzini
2016-04-07 10:38     ` Vladimir Sementsov-Ogievskiy
2016-04-07 16:10       ` Eric Blake
2016-04-07 16:21         ` [Qemu-devel] [Nbd] " Alex Bligh
2016-04-08 11:35         ` [Qemu-devel] " Kevin Wolf
2016-04-09  9:08         ` [Qemu-devel] [Nbd] " Wouter Verhelst
2016-04-13 12:38         ` [Qemu-devel] " Pavel Borzenkov
2016-04-13 14:40           ` Eric Blake
2016-04-07 15:35     ` Pavel Borzenkov
2016-04-07 15:43       ` Paolo Bonzini
2016-04-05  8:51 ` Stefan Hajnoczi
2016-04-05  9:24 ` [Qemu-devel] [Nbd] " Markus Pargmann
2016-04-05 13:50   ` Paolo Bonzini
2016-04-11  5:58     ` Markus Pargmann
2016-04-05 14:14   ` Eric Blake
2016-04-05 20:50     ` Wouter Verhelst
2016-04-11  6:07       ` Markus Pargmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160405214425.GF13913@grep.be \
    --to=w@uter.be \
    --cc=den@openvz.org \
    --cc=eblake@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=nbd-general@lists.sourceforge.net \
    --cc=pbonzini@redhat.com \
    --cc=pborzenkov@virtuozzo.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.