On 31.10.2016 16:38, Fam Zheng wrote:
> This implements open flag sensible image locking for local file
> and host device protocol.
> 
> virtlockd in libvirt locks the first byte, so we start looking at the
> file bytes from 1.
> 
> Quoting what was proposed by Kevin Wolf <kwolf@redhat.com>, there are
> four locking modes by combining two bits (BDRV_O_RDWR and
> BDRV_O_SHARE_RW), and implemented by taking two locks:
> 
> Lock bytes:
> 
> * byte 1: I can't allow other processes to write to the image
> * byte 2: I am writing to the image
> 
> Lock modes:
> 
> * shared writer (BDRV_O_RDWR | BDRV_O_SHARE_RW): Take shared lock on
>   byte 2. Test whether byte 1 is locked using an exclusive lock, and
>   fail if so.

Testing whether something is locked would be easier by using F_OFD_GETLK
instead of actually creating an exclusive lock and then releasing it.
The reason why I proposed exclusive locks for this in the first place
was because I had the order reversed (first take the exclusive lock for
testing, then take the permanent shared lock, then release the exclusive
lock).

(And so you don't overlook it: You should be using F_OFD_GETLK, see [2])

> * exclusive writer (BDRV_O_RDWR only): Take shared lock on byte 2. Test
>   whether byte 1 is locked using an exclusive lock, and fail if so. Then
>   take shared lock on byte 1. I suppose this is racy, but we can
>   probably tolerate that.

As I said in my response to Kevin, we can make this non-racy in two ways:

(1) Just take an exclusive lock on both bytes.
(2) First start as an exclusive reader. Then upgrade the lock to that of
an exclusive writer by taking an exclusive lock on byte 2 and then
downgrading it to a shared lock.

The latter can be optimized by taking exclusive locks on both bytes and
then downgrading both to shared locks. But at that point you could just
as well do (1).

> * reader that can tolerate writers (BDRV_O_SHARE_RW only): Don't do anything
> 
> * reader that can't tolerate writers (neither bit is set): Take shared
>   lock on byte 1. Test whether byte 2 is locked, and fail if so.
> 
> The complication is in the transactional reopen.  To make the reopen
> logic managable, and allow better reuse, the code is internally
> organized with a table from old mode to the new one.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  block/raw-posix.c | 710 ++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 660 insertions(+), 50 deletions(-)
> 
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 7c62fc3..07ab117 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -131,8 +131,44 @@ do { \
>  
>  #define MAX_BLOCKSIZE	4096
>  
> +/* Posix file locking bytes. Libvirt takes byte 0, so start from byte 1. */
> +#define RAW_LOCK_BYTE_MIN 1
> +#define RAW_LOCK_BYTE_NO_OTHER_WRITER 1
> +#define RAW_LOCK_BYTE_WRITE     2
> +#define RAW_LOCK_BYTE_MAX 2
> +
> +/*
> + ** shared writer: Take shared lock on byte 2. Test whether byte 1 is
> + *  locked using an exclusive lock, and fail if so.
> + *
> + ** exclusive writer: Take shared lock on byte 2. Test whether byte 1 is
> + *  locked using an exclusive lock, and fail if so. Then take shared lock
> + *  on byte 1. I suppose this is racy, but we can probably tolerate that.
> + *
> + ** reader that can tolerate writers: Don't do anything
> + *
> + ** reader that can't tolerate writers: Take shared lock on byte 1. Test
> + *  whether byte 2 is locked, and fail if so.
> + */

I guess it would make sense to bring this comment into the same order as
the names are given in BDRVRawLockMode.

> +
> +typedef enum {
> +    /* Read only and accept other writers. */
> +    RAW_L_READ_SHARE_RW,
> +    /* Read only and try to forbid other writers. */
> +    RAW_L_READ,
> +    /* Read write and accept other writers. */
> +    RAW_L_WRITE_SHARE_RW,
> +    /* Read write and try to forbit other writers. */
> +    RAW_L_WRITE,
> +} BDRVRawLockMode;
> +
>  typedef struct BDRVRawState {
>      int fd;
> +    /* A dup of @fd to make manipulating lock easier, especially during reopen,
> +     * where this will accept BDRVRawReopenState.lock_fd. */
> +    int lock_fd;
> +    bool disable_lock;
> +    bool lock_on_invalidate;
>      int type;
>      int open_flags;
>      size_t buf_align;
> @@ -146,10 +182,13 @@ typedef struct BDRVRawState {
>      bool use_linux_aio:1;
>      bool has_fallocate;
>      bool needs_alignment;
> +    BDRVRawLockMode cur_lock_mode;
>  } BDRVRawState;
>  
>  typedef struct BDRVRawReopenState {
>      int fd;
> +    /* A dup of @fd used for acquiring lock. */
> +    int lock_fd;
>      int open_flags;
>  } BDRVRawReopenState;
>  
> @@ -368,6 +407,77 @@ static void raw_parse_flags(int bdrv_flags, int *open_flags)
>      }
>  }
>  
> +static int raw_lock_fd(int fd, BDRVRawLockMode mode, Error **errp)
> +{
> +    int ret;
> +    assert(fd >= 0);
> +    /* Locking byte 1 avoids interfereing with virtlockd. */

This comment should be superseded by the one above the definition of
RAW_LOCK_BYTE_*.

> +    switch (mode) {
> +    case RAW_L_READ_SHARE_RW:
> +        ret = qemu_unlock_fd(fd, RAW_LOCK_BYTE_MIN,
> +                             RAW_LOCK_BYTE_MAX - RAW_LOCK_BYTE_MIN + 1);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to unlock fd");

Why not use -ret instead of errno?

> +            goto fail;
> +        }
> +        break;
> +    case RAW_L_READ:
> +        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock share byte");
> +            goto fail;
> +        }
> +        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_WRITE, 1, true);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock write byte exclusively");

I think we should drop errno in favor of a more useful hint like "(maybe
already opened R/W by a different process?)", but that seems a bit long
to include...

> +            goto fail;
> +        }
> +        qemu_unlock_fd(fd, RAW_LOCK_BYTE_WRITE, 1);
> +        break;
> +    case RAW_L_WRITE_SHARE_RW:
> +        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_WRITE, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock write byte");
> +            goto fail;
> +        }
> +        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, true);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock share byte exclusively");
> +            goto fail;
> +        }
> +        ret = qemu_unlock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to unlock share byte");
> +            goto fail;
> +        }
> +        break;
> +    case RAW_L_WRITE:
> +        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_WRITE, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock write byte");
> +            goto fail;
> +        }
> +        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, true);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock share byte exclusively");
> +            goto fail;
> +        }
> +        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to downgrade share byte");
> +            goto fail;
> +        }
> +        break;
> +    default:
> +        abort();
> +    }
> +    return 0;
> +fail:
> +    qemu_unlock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
> +    qemu_unlock_fd(fd, RAW_LOCK_BYTE_WRITE, 1);
> +    return -errno;

qemu_unlock_fd() may overwrite errno. Not sure if that is intended.
Also, just returning ret would be fine.

> +}
> +
>  static void raw_parse_filename(const char *filename, QDict *options,
>                                 Error **errp)
>  {
> @@ -393,10 +503,115 @@ static QemuOptsList raw_runtime_opts = {
>              .type = QEMU_OPT_STRING,
>              .help = "host AIO implementation (threads, native)",
>          },
> +        {
> +            .name = "disable-lock",
> +            .type = QEMU_OPT_BOOL,
> +            .help = "don't lock the file",
> +        },
>          { /* end of list */ }
>      },
>  };
>  
> +static BDRVRawLockMode raw_get_lock_mode(int flags)
> +{
> +    switch (flags & (BDRV_O_RDWR | BDRV_O_SHARE_RW)) {
> +    case BDRV_O_RDWR:
> +        return RAW_L_WRITE;
> +    case BDRV_O_RDWR | BDRV_O_SHARE_RW:
> +        return RAW_L_WRITE_SHARE_RW;
> +    case BDRV_O_SHARE_RW:
> +        return RAW_L_READ_SHARE_RW;
> +    case 0:
> +        return RAW_L_READ;
> +    default:
> +        abort();
> +    }
> +}
> +
> +static int raw_dup_flags(int fd, int old_flags, int new_flags,
> +                         const char *filename, Error **errp)
> +{
> +    int ret = -1;
> +    int fcntl_flags = O_APPEND | O_NONBLOCK;
> +#ifdef O_NOATIME
> +    fcntl_flags |= O_NOATIME;
> +#endif
> +
> +#ifdef O_ASYNC
> +    /* Not all operating systems have O_ASYNC, and those that don't
> +     * will not let us track the state into rs->open_flags (typically
> +     * you achieve the same effect with an ioctl, for example I_SETSIG
> +     * on Solaris). But we do not use O_ASYNC, so that's fine.
> +     */
> +    assert((old_flags & O_ASYNC) == 0);
> +#endif
> +
> +    if ((new_flags & ~fcntl_flags) == (old_flags & ~fcntl_flags)) {
> +        /* dup the original fd */
> +        ret = qemu_dup(fd);
> +        if (ret >= 0) {
> +            if (fcntl_setfl(ret, new_flags)) {
> +                int new_fd = ret;
> +                ret = -errno;
> +                qemu_close(new_fd);
> +            }
> +        }
> +    }
> +
> +    /* If we cannot use fcntl, or fcntl failed, fall back to qemu_open() */
> +    if (ret < 0) {
> +        const char *normalized_filename = filename;
> +        ret = raw_normalize_devicepath(&normalized_filename);
> +        if (ret < 0) {
> +            error_setg_errno(errp, -ret, "Could not normalize device path");
> +        } else {
> +            assert(!(new_flags & O_CREAT));
> +            ret = qemu_open(normalized_filename, new_flags);
> +            if (ret == -1) {
> +                error_setg_errno(errp, errno, "Could not open file with new flags");
> +                ret = -errno;

I personally prefer it the other way round (ret = -errno;
error_setg_errno(errp, -ret, ...);) but error_setg_errno() now saves
errno, it's not wrong this way.

> +            }
> +        }
> +    }
> +    return ret;
> +}
> +
> +static int raw_lock_image(BlockDriverState *bs, int bdrv_flags, Error **errp)
> +{
> +    int ret;
> +    BDRVRawState *s = bs->opaque;
> +    BDRVRawLockMode lock_mode;
> +
> +    if (bdrv_flags & BDRV_O_INACTIVE) {
> +        s->disable_lock = true;
> +        s->lock_on_invalidate = true;
> +    }

Ah, I see, that's interesting (regarding my comments on when and how an
inactive BDS should be shared).

Anyway, using s->disable_lock both for the user-supplied option and for
internal stuff seems wrong. It's not so bad here, but it seems wrong in
raw_invalidate_cache().

> +    if (!s->disable_lock) {
> +        lock_mode = raw_get_lock_mode(bdrv_flags);
> +        if (!(bdrv_flags & BDRV_O_RDWR) && access(bs->filename, W_OK) != 0) {
> +            s->disable_lock = true;
> +        }

What is this code for (the above three lines)? Why do you disable the
lock if the image cannot be written to and it's opened read-only?

I suppose it's because of [1], but: What if it's just this process that
cannot write to it but others (e.g. invoked by root) can? Then this is
not a no-op.

> +    }
> +    if (!s->disable_lock && lock_mode != RAW_L_READ_SHARE_RW) {
> +        int lock_flags = s->open_flags;
> +        if (!(bdrv_flags & BDRV_O_SHARE_RW)) {
> +            lock_flags |= O_RDWR;
> +        }

This will result in the next function call trying to open an image R/W,
even if the original image was opened in read-only mode. That doesn't
seem quite right to me.

[1] So I guess above you are setting disable_lock to true so that
raw_dup_flags() will at least succeed.

In any case, it seems wrong to me to open the image R/W if the user has
requested read-only, even if it succeeds and even if we are not actually
writing to it.

[2] I guess this is required because otherwise we can't place exclusive
locks? Would using F_OFD_GETLK instead of trying to place an exclusive
lock fix this issue? (A quick test on my part says: Yes, it would;
F_OFD_GETLK ignores permissions and just seems to care about whether
there already is a lock.)

> +        ret = raw_dup_flags(s->fd, s->open_flags, lock_flags, bs->filename,

I'd propose using bs->exact_filename instead of bs->filename, because
the former is guaranteed not to be a json: filename.

While for raw-posix you are guaranteed to never to have a json: filename
in bs->filename, it's still a bit more clear.

> +                                   errp);

The indentation is off.

> +        if (ret < 0) {
> +            return ret;
> +        }
> +        s->lock_fd = ret;
> +        ret = raw_lock_fd(s->lock_fd, lock_mode, errp);
> +        if (ret) {
> +            return ret;
> +        }
> +        s->cur_lock_mode = lock_mode;
> +    }
> +    return 0;
> +}
> +
>  static int raw_open_common(BlockDriverState *bs, QDict *options,
>                             int bdrv_flags, int open_flags, Error **errp)
>  {
> @@ -440,6 +655,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>      raw_parse_flags(bdrv_flags, &s->open_flags);
>  
>      s->fd = -1;
> +    s->lock_fd = -1;
>      fd = qemu_open(filename, s->open_flags, 0644);
>      if (fd < 0) {
>          ret = -errno;
> @@ -451,6 +667,15 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>      }
>      s->fd = fd;
>  
> +    s->disable_lock = qemu_opt_get_bool(opts, "disable-lock", false);
> +
> +    if (!s->disable_lock) {
> +        ret = raw_lock_image(bs, bdrv_flags, errp);
> +        if (ret) {
> +            goto fail;
> +        }
> +    }
> +
>  #ifdef CONFIG_LINUX_AIO
>       /* Currently Linux does AIO only for files opened with O_DIRECT */
>      if (s->use_linux_aio && !(s->open_flags & O_DIRECT)) {

Pausing review here because it's already 4 am...

(I'll try to look at it later today, after I've had some sleep.)

> @@ -538,6 +763,398 @@ static int raw_open(BlockDriverState *bs, QDict *options, int flags,

[...]

> @@ -1832,6 +2420,27 @@ static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)

And continuing here.

>      return 0;
>  }
>  
> +static int raw_inactivate(BlockDriverState *bs)
> +{
> +    BDRVRawState *s = bs->opaque;
> +    int r = 0;
> +
> +    if (s->cur_lock_mode != RAW_L_READ_SHARE_RW) {
> +        r = raw_lock_fd(s->lock_fd, RAW_L_READ_SHARE_RW, NULL);
> +    }
> +    return r;
> +}
> +
> +static void raw_invalidate_cache(BlockDriverState *bs, Error **errp)
> +{
> +    BDRVRawState *s = bs->opaque;
> +
> +    if (s->lock_on_invalidate) {
> +        s->disable_lock = false;

As I hinted at above, this looks wrong. If the user has specified that
this image should be unlocked, it should be unlocked.

Max

> +        raw_lock_image(bs, bdrv_get_flags(bs), errp);
> +    }
> +}
> +
>  static QemuOptsList raw_create_opts = {
>      .name = "raw-create-opts",
>      .head = QTAILQ_HEAD_INITIALIZER(raw_create_opts.head),
> @@ -1885,7 +2494,8 @@ BlockDriver bdrv_file = {
>      .bdrv_get_info = raw_get_info,
>      .bdrv_get_allocated_file_size
>                          = raw_get_allocated_file_size,
> -
> +    .bdrv_inactivate = raw_inactivate,
> +    .bdrv_invalidate_cache = raw_invalidate_cache,
>      .create_opts = &raw_create_opts,
>  };
>  
>