Re: [Qemu-devel] [PATCH 13/14] raw-posix: Implement image locking

From: Fam Zheng <famz@redhat.com>
To: Eric Blake <eblake@redhat.com>
Cc: qemu-devel@nongnu.org, Kevin Wolf <kwolf@redhat.com>,
	rjones@redhat.com, qemu-block@nongnu.org,
	Max Reitz <mreitz@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 13/14] raw-posix: Implement image locking
Date: Tue, 1 Nov 2016 10:06:46 +0800	[thread overview]
Message-ID: <20161101020646.GA19068@lemon> (raw)
In-Reply-To: <43438fac-7c1c-443f-28fc-f49fd33fdfe0@redhat.com>

On Mon, 10/31 17:01, Eric Blake wrote:
> On 10/31/2016 10:38 AM, Fam Zheng wrote:
> > This implements open flag sensible image locking for local file
> > and host device protocol.
> > 
> > virtlockd in libvirt locks the first byte, so we start looking at the
> > file bytes from 1.
> 
> What happens if we try to use a raw file with less than 3 bytes? There's
> not much to be locked in that case (especially if we round down to
> sector sizes - the file is effectively empty) - but it's probably a
> corner case you have to be aware of.

Yes.  Not sure if it is complete (and always true) but patch 14 covers a 0 byte
test case and it seems the lock just works the same a a large file.

So I look at the kernel implementation of fcntl locks, to see if it limits lock
range to file size, but I don't see any.

I also manually tested with "touch /var/tmp/zerofile && qemu-io
/var/tmp/zerofile" and "lslocks", it indeed report the locked bytes tough the
file is 0 byte.

As another example, /dev/null is also lockable by this series, that's why I have
to add a patch to change it to null-co://.

So, I think where a zero byte file cannot be locked, if any, is a corner case.

> 
> > 
> > Quoting what was proposed by Kevin Wolf <kwolf@redhat.com>, there are
> > four locking modes by combining two bits (BDRV_O_RDWR and
> > BDRV_O_SHARE_RW), and implemented by taking two locks:
> > 
> > Lock bytes:
> > 
> > * byte 1: I can't allow other processes to write to the image
> > * byte 2: I am writing to the image
> > 
> > Lock modes:
> > 
> > * shared writer (BDRV_O_RDWR | BDRV_O_SHARE_RW): Take shared lock on
> >   byte 2. Test whether byte 1 is locked using an exclusive lock, and
> >   fail if so.
> > 
> > * exclusive writer (BDRV_O_RDWR only): Take shared lock on byte 2. Test
> >   whether byte 1 is locked using an exclusive lock, and fail if so. Then
> >   take shared lock on byte 1. I suppose this is racy, but we can
> >   probably tolerate that.
> > 
> > * reader that can tolerate writers (BDRV_O_SHARE_RW only): Don't do anything
> > 
> > * reader that can't tolerate writers (neither bit is set): Take shared
> >   lock on byte 1. Test whether byte 2 is locked, and fail if so.
> > 
> > The complication is in the transactional reopen.  To make the reopen
> > logic managable, and allow better reuse, the code is internally
> 
> s/managable/manageable/
> 
> > organized with a table from old mode to the new one.
> > 
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > ---
> >  block/raw-posix.c | 710 ++++++++++++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 660 insertions(+), 50 deletions(-)
> > 
> 
> > +typedef enum {
> > +    /* Read only and accept other writers. */
> > +    RAW_L_READ_SHARE_RW,
> > +    /* Read only and try to forbid other writers. */
> > +    RAW_L_READ,
> > +    /* Read write and accept other writers. */
> > +    RAW_L_WRITE_SHARE_RW,
> > +    /* Read write and try to forbit other writers. */
> 
> s/forbit/forbid/
> 
> 
> >  
> > +static int raw_lock_fd(int fd, BDRVRawLockMode mode, Error **errp)
> > +{
> > +    int ret;
> > +    assert(fd >= 0);
> > +    /* Locking byte 1 avoids interfereing with virtlockd. */
> 
> s/interfereing/interfering/
> 
> 
> > +/**
> > + * Transactionally moving between possible locking states is tricky and must be
> > + * done carefully. That is mostly because downgrading an exclusive lock to
> > + * shared or unlocked is not guaranteed to be revertable. As a result, in such
> 
> s/revertable/revertible/
> 
> > + * cases we have to defer the downgraing to "commit", given that no revert will
> 
> s/downgraing/downgrading/
> 
> > + * happen after that point, and that downgrading a lock should never fail.
> > + *
> > + * On the other hand, upgrading a lock (e.g. from unlocked or shared to
> > + * exclusive lock) must happen in "prepare" because it may fail.
> > + *
> > + * Manage the operation matrix with this state transition table to make
> > + * fulfulling above conditions easier.
> 
> s/fulfulling/fulfilling/

Thanks, will fix these typos and misspells.

> 
> 
> > @@ -560,61 +1177,24 @@ static int raw_reopen_prepare(BDRVReopenState *state,
> >  
> >      raw_parse_flags(state->flags, &rs->open_flags);
> >  
> > -    rs->fd = -1;
> > -
> > -    int fcntl_flags = O_APPEND | O_NONBLOCK;
> > -#ifdef O_NOATIME
> > -    fcntl_flags |= O_NOATIME;
> > -#endif
> > -
> > -#ifdef O_ASYNC
> > -    /* Not all operating systems have O_ASYNC, and those that don't
> > -     * will not let us track the state into rs->open_flags (typically
> > -     * you achieve the same effect with an ioctl, for example I_SETSIG
> > -     * on Solaris). But we do not use O_ASYNC, so that's fine.
> > -     */
> > -    assert((s->open_flags & O_ASYNC) == 0);
> > -#endif
> 
> It looks like you are doing some code motion (refactoring into a helper
> function) mixed in with everything else; it might be worth splitting
> that into a separate commit for ease of review.

Yes, good idea.

Fam