On 10/31/2016 10:38 AM, Fam Zheng wrote:
> This implements open flag sensible image locking for local file
> and host device protocol.
> 
> virtlockd in libvirt locks the first byte, so we start looking at the
> file bytes from 1.

What happens if we try to use a raw file with less than 3 bytes? There's
not much to be locked in that case (especially if we round down to
sector sizes - the file is effectively empty) - but it's probably a
corner case you have to be aware of.

> 
> Quoting what was proposed by Kevin Wolf <kwolf@redhat.com>, there are
> four locking modes by combining two bits (BDRV_O_RDWR and
> BDRV_O_SHARE_RW), and implemented by taking two locks:
> 
> Lock bytes:
> 
> * byte 1: I can't allow other processes to write to the image
> * byte 2: I am writing to the image
> 
> Lock modes:
> 
> * shared writer (BDRV_O_RDWR | BDRV_O_SHARE_RW): Take shared lock on
>   byte 2. Test whether byte 1 is locked using an exclusive lock, and
>   fail if so.
> 
> * exclusive writer (BDRV_O_RDWR only): Take shared lock on byte 2. Test
>   whether byte 1 is locked using an exclusive lock, and fail if so. Then
>   take shared lock on byte 1. I suppose this is racy, but we can
>   probably tolerate that.
> 
> * reader that can tolerate writers (BDRV_O_SHARE_RW only): Don't do anything
> 
> * reader that can't tolerate writers (neither bit is set): Take shared
>   lock on byte 1. Test whether byte 2 is locked, and fail if so.
> 
> The complication is in the transactional reopen.  To make the reopen
> logic managable, and allow better reuse, the code is internally

s/managable/manageable/

> organized with a table from old mode to the new one.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  block/raw-posix.c | 710 ++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 660 insertions(+), 50 deletions(-)
> 

> +typedef enum {
> +    /* Read only and accept other writers. */
> +    RAW_L_READ_SHARE_RW,
> +    /* Read only and try to forbid other writers. */
> +    RAW_L_READ,
> +    /* Read write and accept other writers. */
> +    RAW_L_WRITE_SHARE_RW,
> +    /* Read write and try to forbit other writers. */

s/forbit/forbid/


>  
> +static int raw_lock_fd(int fd, BDRVRawLockMode mode, Error **errp)
> +{
> +    int ret;
> +    assert(fd >= 0);
> +    /* Locking byte 1 avoids interfereing with virtlockd. */

s/interfereing/interfering/


> +/**
> + * Transactionally moving between possible locking states is tricky and must be
> + * done carefully. That is mostly because downgrading an exclusive lock to
> + * shared or unlocked is not guaranteed to be revertable. As a result, in such

s/revertable/revertible/

> + * cases we have to defer the downgraing to "commit", given that no revert will

s/downgraing/downgrading/

> + * happen after that point, and that downgrading a lock should never fail.
> + *
> + * On the other hand, upgrading a lock (e.g. from unlocked or shared to
> + * exclusive lock) must happen in "prepare" because it may fail.
> + *
> + * Manage the operation matrix with this state transition table to make
> + * fulfulling above conditions easier.

s/fulfulling/fulfilling/


> @@ -560,61 +1177,24 @@ static int raw_reopen_prepare(BDRVReopenState *state,
>  
>      raw_parse_flags(state->flags, &rs->open_flags);
>  
> -    rs->fd = -1;
> -
> -    int fcntl_flags = O_APPEND | O_NONBLOCK;
> -#ifdef O_NOATIME
> -    fcntl_flags |= O_NOATIME;
> -#endif
> -
> -#ifdef O_ASYNC
> -    /* Not all operating systems have O_ASYNC, and those that don't
> -     * will not let us track the state into rs->open_flags (typically
> -     * you achieve the same effect with an ioctl, for example I_SETSIG
> -     * on Solaris). But we do not use O_ASYNC, so that's fine.
> -     */
> -    assert((s->open_flags & O_ASYNC) == 0);
> -#endif

It looks like you are doing some code motion (refactoring into a helper
function) mixed in with everything else; it might be worth splitting
that into a separate commit for ease of review.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org