Re: [PATCH] nbd/server: Advertise MULTI_CONN for shared writable exports

From: Eric Blake <eblake@redhat.com>
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Cc: kwolf@redhat.com, qemu-block@nongnu.org,
	Markus Armbruster <armbru@redhat.com>,
	qemu-devel@nongnu.org, rjones@redhat.com, nsoffer@redhat.com,
	Hanna Reitz <hreitz@redhat.com>
Subject: Re: [PATCH] nbd/server: Advertise MULTI_CONN for shared writable exports
Date: Fri, 27 Aug 2021 13:45:03 -0500	[thread overview]
Message-ID: <20210827184503.m3lbpz56qs6mpjla@redhat.com> (raw)
In-Reply-To: <81fc3d16-b357-5a8c-45f2-682ddf253590@virtuozzo.com>

On Fri, Aug 27, 2021 at 07:58:10PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 27.08.2021 18:09, Eric Blake wrote:
> > According to the NBD spec, a server advertising
> > NBD_FLAG_CAN_MULTI_CONN promises that multiple client connections will
> > not see any cache inconsistencies: when properly separated by a single
> > flush, actions performed by one client will be visible to another
> > client, regardless of which client did the flush.  We satisfy these
> > conditions in qemu because our block layer serializes any overlapping
> > operations (see bdrv_find_conflicting_request and friends)
> 
> Not any. We serialize only write operations not aligned to request_alignment of bs (see bdrv_make_request_serialising() call in bdrv_co_pwritev_part). So, actually most of overlapping operations remain overlapping. And that's correct: it's not a Qemu work to resolve overlapping requests. We resolve them only when we are responsible for appearing of intersection: when we align requests up.

I welcome improvements on the wording.  Maybe what I should be
emphasizing is that even when there are overlapping requests, qemu
itself is multiplexing all of those requests through a single
interface into the backend, without any caching on qemu's part, and
relying on the consistency of the flush operation into that backend.

From a parallelism perspective, in file-posix.c, we don't distiguish
between two pwrite() syscalls made (potentially out-of-order) by a
single BDS client in two coroutines, from two pwrite() syscalls made
by two separate BDS clients.  Either way, those two syscalls may both
be asynchronous, but both go through a single interface into the
kernel's view of the underlying filesystem or block device.  And we
implement flush via fdatasync(), which the kernel already has some
pretty strong guarantees on cross-thread consistency.

But I am less certain of whether we are guaranteed cross-consistency
like that for all protocol drivers.  Is there any block driver (most
likely a networked one) where we have situations such that even though
we are using the same API for all asynchronous access within the qemu
coroutines, under the hood those APIs can end up diverging on their
destinations such as due to network round-robin effects, and result in
us seeing cache-inconsistent views?  That is, can we ever encounter
this:

-> read()
  -> kicks off networked storage call that resolves to host X
    -> host X caches the read
  <- reply
-> write()
  -> kicks off networked storage call that resolves to host Y
    -> host Y updates the file system
  <- reply
-> flush()
  -> kicks off networked storage call that resolves to host Y
    -> host Y starts flushing, but replies early
  <- reply
-> read()
  -> kicks off networked storage call that resolves to host X
    -> host X does not see effects of Y's flush yet, returns stale data

If we can encounter that, then in those situations we must not
advertise MULTI_CONN.  But I'm confident that file-posix.c does not
have that problem, and even if another driver did have that problem
(where our single API access can result in cache-inconsistent views
over the protocol, rather than flush really being effective to all
further API access to that driver), you'd think we'd be aware of it.
However, if we DO know of a place where that is the case, then now is
the time to design our QAPI control over whether to advertise NBD's
MULTI_CONN bit based on whether the block layer can warn us about a
particular block layer NOT being safe.

But unless we come up with such a scenario, maybe all I need here is
better wording to put in the commit message to state why we think we
ARE safe in advertising MULTI_CONN.  Remember, the NBD flag only has
an impact in relation to how strong flush calls are (it is NOT
required that overlapping write requests have any particular behavior
- that's always been up to the client to be careful with that, and
qemu need not go out of its way to prevent client stupidity with
overlapping writes), but rather that actions with a reply completed
prior to FLUSH are then visible to actions started after the reply to
FLUSH.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org