All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] Safely reopening image files by stashing fds
@ 2011-08-05  8:40 Stefan Hajnoczi
  2011-08-05  9:04 ` Paolo Bonzini
                   ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Stefan Hajnoczi @ 2011-08-05  8:40 UTC (permalink / raw)
  To: Supriya Kannery; +Cc: Kevin Wolf, Anthony Liguori, qemu-devel

We've discussed safe methods for reopening image files (e.g. useful for
changing the hostcache parameter).  The problem is that closing the file first
and then opening it again exposes us to the error case where the open fails.
At that point we cannot get to the file anymore and our options are to
terminate QEMU, pause the VM, or offline the block device.

This window of vulnerability can be eliminated by keeping the file descriptor
around and falling back to it should the open fail.

The challenge for the file descriptor approach is that image formats, like
VMDK, can span multiple files.  Therefore the solution is not as simple as
stashing a single file descriptor and reopening from it.

Here is the outline for an fd stashing mechanism that can handle reopening
multi-file images and could also solve the file descriptor passing problem for
libvirt:

1. Extract monitor getfd/closefd functionality

The monitor already supports fd stashing with getfd/closefd commands.  But the
fd stash code is part of Monitor and we need to extract it into its own object.

/* A stashed file descriptor */
typedef FDEntry {
	const char *name;
	int fd;
	QLIST_ENTRY(FDEntry) next;
} FDEntry;

/* A container for stashing file descriptors */
typedef struct FDStash {
	QLIST_HEAD(, FDEntry) fds;
} FDStash;

void fdstash_init(FDStash *stash);

/**
 * Clear stashed file descriptors and close them
 */
void fdstash_cleanup(FDStash *stash);

/**
 * Stash a file descriptor and give up ownership
 *
 * If a file descriptor is already present with the same name the old fd is
 * closed and replaced by the new one.
 */
void fdstash_give(FDStash *stash, const char *name, int fd);

/**
 * Find and take ownership of a stashed file descriptor
 *
 * Return the file descriptor or -ENOENT if not found.
 */
int fdstash_take(FDStash *stash, const char *name);

The monitor is refactored to use this code instead of open coding fd stashing.

2. Introduce a function to extract open file descriptors from an block device

Add a new .bdrv_extract_fds(BlockDriverState *bs, FDStash *stash) interface,
which defaults to calling bdrv_extract_fds(bs->file, stash).

VMDK and protocols can implement this function to support extracting open fds
from a block device.  Note that they need to dup(2) fds before giving them to
the fdstash, otherwise the fd will be closed when the block device is
closed/deleted.

3. Rework bdrv_open() to take a FDStash

Check the FDStash before opening an image file on the host file system.  This
makes it possible to open an image file and use existing stashed fds.

4. Implement bdrv_reopen()

First call bdrv_extract_fds() to stash the file descriptors, then close the
block device.  Try opening the new image but if that fails, reopen using the
stashed file descriptors.

Thoughts?

Stefan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-05  8:40 [Qemu-devel] Safely reopening image files by stashing fds Stefan Hajnoczi
@ 2011-08-05  9:04 ` Paolo Bonzini
  2011-08-05  9:27   ` Stefan Hajnoczi
  2011-08-05  9:07 ` Kevin Wolf
  2011-08-05 20:16 ` Blue Swirl
  2 siblings, 1 reply; 39+ messages in thread
From: Paolo Bonzini @ 2011-08-05  9:04 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Supriya Kannery, Anthony Liguori, qemu-devel

On 08/05/2011 10:40 AM, Stefan Hajnoczi wrote:
> 4. Implement bdrv_reopen()
>
> First call bdrv_extract_fds() to stash the file descriptors, then close the
> block device.  Try opening the new image but if that fails, reopen using the
> stashed file descriptors.

Why not do the latter unconditionally?

Paolo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-05  8:40 [Qemu-devel] Safely reopening image files by stashing fds Stefan Hajnoczi
  2011-08-05  9:04 ` Paolo Bonzini
@ 2011-08-05  9:07 ` Kevin Wolf
  2011-08-05  9:29   ` Stefan Hajnoczi
  2011-08-05 20:16 ` Blue Swirl
  2 siblings, 1 reply; 39+ messages in thread
From: Kevin Wolf @ 2011-08-05  9:07 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel

Am 05.08.2011 10:40, schrieb Stefan Hajnoczi:
> We've discussed safe methods for reopening image files (e.g. useful for
> changing the hostcache parameter).  The problem is that closing the file first
> and then opening it again exposes us to the error case where the open fails.
> At that point we cannot get to the file anymore and our options are to
> terminate QEMU, pause the VM, or offline the block device.
> 
> This window of vulnerability can be eliminated by keeping the file descriptor
> around and falling back to it should the open fail.
> 
> The challenge for the file descriptor approach is that image formats, like
> VMDK, can span multiple files.  Therefore the solution is not as simple as
> stashing a single file descriptor and reopening from it.

So far I agree. The rest I believe is wrong because you can't assume
that every backend uses file descriptors. The qemu block layer is based
on BlockDriverStates, not fds. They are a concept that should be hidden
in raw-posix.

I think something like this could do:

struct BDRVReopenState {
    BlockDriverState *bs;
    /* can be extended by block drivers */
};

.bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int
flags);
.bdrv_reopen_commit(BDRVReopenState *reopen_state);
.bdrv_reopen_abort(BDRVReopenState *reopen_state);

raw-posix would store the old file descriptor in its reopen_state. On
commit, it closes the old descriptors, on abort it reverts to the old
one and closes the newly opened one.

Makes things a bit more complicated than the simple bdrv_reopen I had in
mind before, but it allows VMDK to get an all-or-nothing semantics.

Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-05  9:04 ` Paolo Bonzini
@ 2011-08-05  9:27   ` Stefan Hajnoczi
  2011-08-05  9:55     ` Paolo Bonzini
                       ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Stefan Hajnoczi @ 2011-08-05  9:27 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, Supriya Kannery, Anthony Liguori, qemu-devel

On Fri, Aug 5, 2011 at 10:04 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 08/05/2011 10:40 AM, Stefan Hajnoczi wrote:
>>
>> 4. Implement bdrv_reopen()
>>
>> First call bdrv_extract_fds() to stash the file descriptors, then close
>> the
>> block device.  Try opening the new image but if that fails, reopen using
>> the
>> stashed file descriptors.
>
> Why not do the latter unconditionally?

Because you cannot change O_DIRECT on an open fd :(.  This is why
we're going through this pain.

The only method I've found that works is to open("/proc/self/fd/X",
new_flags) but that's non-portable.

Stefan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-05  9:07 ` Kevin Wolf
@ 2011-08-05  9:29   ` Stefan Hajnoczi
  2011-08-05  9:48     ` Kevin Wolf
  0 siblings, 1 reply; 39+ messages in thread
From: Stefan Hajnoczi @ 2011-08-05  9:29 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel

On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi:
>> We've discussed safe methods for reopening image files (e.g. useful for
>> changing the hostcache parameter).  The problem is that closing the file first
>> and then opening it again exposes us to the error case where the open fails.
>> At that point we cannot get to the file anymore and our options are to
>> terminate QEMU, pause the VM, or offline the block device.
>>
>> This window of vulnerability can be eliminated by keeping the file descriptor
>> around and falling back to it should the open fail.
>>
>> The challenge for the file descriptor approach is that image formats, like
>> VMDK, can span multiple files.  Therefore the solution is not as simple as
>> stashing a single file descriptor and reopening from it.
>
> So far I agree. The rest I believe is wrong because you can't assume
> that every backend uses file descriptors. The qemu block layer is based
> on BlockDriverStates, not fds. They are a concept that should be hidden
> in raw-posix.
>
> I think something like this could do:
>
> struct BDRVReopenState {
>    BlockDriverState *bs;
>    /* can be extended by block drivers */
> };
>
> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int
> flags);
> .bdrv_reopen_commit(BDRVReopenState *reopen_state);
> .bdrv_reopen_abort(BDRVReopenState *reopen_state);
>
> raw-posix would store the old file descriptor in its reopen_state. On
> commit, it closes the old descriptors, on abort it reverts to the old
> one and closes the newly opened one.
>
> Makes things a bit more complicated than the simple bdrv_reopen I had in
> mind before, but it allows VMDK to get an all-or-nothing semantics.

Can you show how bdrv_reopen() would use these new interfaces?  I'm
not 100% clear on the idea.

Stefan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-05  9:29   ` Stefan Hajnoczi
@ 2011-08-05  9:48     ` Kevin Wolf
  2011-08-08 14:49       ` Stefan Hajnoczi
  0 siblings, 1 reply; 39+ messages in thread
From: Kevin Wolf @ 2011-08-05  9:48 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel

Am 05.08.2011 11:29, schrieb Stefan Hajnoczi:
> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi:
>>> We've discussed safe methods for reopening image files (e.g. useful for
>>> changing the hostcache parameter).  The problem is that closing the file first
>>> and then opening it again exposes us to the error case where the open fails.
>>> At that point we cannot get to the file anymore and our options are to
>>> terminate QEMU, pause the VM, or offline the block device.
>>>
>>> This window of vulnerability can be eliminated by keeping the file descriptor
>>> around and falling back to it should the open fail.
>>>
>>> The challenge for the file descriptor approach is that image formats, like
>>> VMDK, can span multiple files.  Therefore the solution is not as simple as
>>> stashing a single file descriptor and reopening from it.
>>
>> So far I agree. The rest I believe is wrong because you can't assume
>> that every backend uses file descriptors. The qemu block layer is based
>> on BlockDriverStates, not fds. They are a concept that should be hidden
>> in raw-posix.
>>
>> I think something like this could do:
>>
>> struct BDRVReopenState {
>>    BlockDriverState *bs;
>>    /* can be extended by block drivers */
>> };
>>
>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int
>> flags);
>> .bdrv_reopen_commit(BDRVReopenState *reopen_state);
>> .bdrv_reopen_abort(BDRVReopenState *reopen_state);
>>
>> raw-posix would store the old file descriptor in its reopen_state. On
>> commit, it closes the old descriptors, on abort it reverts to the old
>> one and closes the newly opened one.
>>
>> Makes things a bit more complicated than the simple bdrv_reopen I had in
>> mind before, but it allows VMDK to get an all-or-nothing semantics.
> 
> Can you show how bdrv_reopen() would use these new interfaces?  I'm
> not 100% clear on the idea.

Well, you wouldn't only call bdrv_reopen, but also either
bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper
function that does both, but that's syntactic sugar).

For example we would have:

int vmdk_reopen()
{
    *((VMDKReopenState**) rs) = malloc();

    foreach (extent in s->extents) {
        ret = bdrv_reopen(extent->file, &extent->reopen_state)
        if (ret < 0)
            goto fail;
    }
    return 0;

fail:
    foreach (extent in rs->already_reopened) {
        bdrv_reopen_abort(extent->reopen_state);
    }
    return ret;
}

void vmdk_reopen_commit()
{
    foreach (extent in s->extents) {
        bdrv_reopen_commit(extent->reopen_state);
    }
    free(rs);
}

void vmdk_reopen_abort()
{
    foreach (extent in s->extents) {
        bdrv_reopen_abort(extent->reopen_state);
    }
    free(rs);
}

The top-level caller, which isn't a block driver, but just wants to have
an image reopened, will do something like this (as I said, this should
probably be a wrapper function in block.c):

BDRVReopenState *rs;
ret = bdrv_reopen(bs, &rs);
if (ret < 0) {
    goto fail;
}
bdrv_reopen_commit(rs);

Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-05  9:27   ` Stefan Hajnoczi
@ 2011-08-05  9:55     ` Paolo Bonzini
  2011-08-05 13:03       ` Stefan Hajnoczi
  2011-08-05 13:12     ` Daniel P. Berrange
  2011-08-05 14:27     ` Christoph Hellwig
  2 siblings, 1 reply; 39+ messages in thread
From: Paolo Bonzini @ 2011-08-05  9:55 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Supriya Kannery, Anthony Liguori, qemu-devel

On 08/05/2011 11:27 AM, Stefan Hajnoczi wrote:
>>> First call bdrv_extract_fds() to stash the file descriptors, then close
>>> the block device.  Try opening the new image but if that fails, reopen using
>>> the stashed file descriptors.
>>
>> Why not do the latter unconditionally?
>
> Because you cannot change O_DIRECT on an open fd:(.  This is why
> we're going through this pain.
>
> The only method I've found that works is to open("/proc/self/fd/X",
> new_flags) but that's non-portable.

Maybe I'm missing something obvious, but so is O_DIRECT, no? :)

So for Linux you can dup the stashed file descriptors using 
/proc/self/fd and change flags directly, and for other OSes you can dup 
them using dup2 and change flags with F_SETFL.  In any case, reopening 
can always be done using the stashed descriptors (or BlockDriverStates).

Paolo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-05  9:55     ` Paolo Bonzini
@ 2011-08-05 13:03       ` Stefan Hajnoczi
  0 siblings, 0 replies; 39+ messages in thread
From: Stefan Hajnoczi @ 2011-08-05 13:03 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, Supriya Kannery, Anthony Liguori, qemu-devel

On Fri, Aug 5, 2011 at 10:55 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 08/05/2011 11:27 AM, Stefan Hajnoczi wrote:
>>>>
>>>> First call bdrv_extract_fds() to stash the file descriptors, then close
>>>> the block device.  Try opening the new image but if that fails, reopen
>>>> using
>>>> the stashed file descriptors.
>>>
>>> Why not do the latter unconditionally?
>>
>> Because you cannot change O_DIRECT on an open fd:(.  This is why
>> we're going through this pain.
>>
>> The only method I've found that works is to open("/proc/self/fd/X",
>> new_flags) but that's non-portable.
>
> Maybe I'm missing something obvious, but so is O_DIRECT, no? :)

http://www.freebsd.org/cgi/man.cgi?query=open&apropos=0&sektion=0&manpath=FreeBSD+8.1-RELEASE&format=html

> So for Linux you can dup the stashed file descriptors using /proc/self/fd
> and change flags directly, and for other OSes you can dup them using dup2
> and change flags with F_SETFL.  In any case, reopening can always be done
> using the stashed descriptors (or BlockDriverStates).

F_SETFL doesn't let you change arbitrary flags.

Stefan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-05  9:27   ` Stefan Hajnoczi
  2011-08-05  9:55     ` Paolo Bonzini
@ 2011-08-05 13:12     ` Daniel P. Berrange
  2011-08-05 14:28       ` Christoph Hellwig
  2011-08-05 14:27     ` Christoph Hellwig
  2 siblings, 1 reply; 39+ messages in thread
From: Daniel P. Berrange @ 2011-08-05 13:12 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Paolo Bonzini, Anthony Liguori, qemu-devel, Supriya Kannery

On Fri, Aug 05, 2011 at 10:27:00AM +0100, Stefan Hajnoczi wrote:
> On Fri, Aug 5, 2011 at 10:04 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > On 08/05/2011 10:40 AM, Stefan Hajnoczi wrote:
> >>
> >> 4. Implement bdrv_reopen()
> >>
> >> First call bdrv_extract_fds() to stash the file descriptors, then close
> >> the
> >> block device.  Try opening the new image but if that fails, reopen using
> >> the
> >> stashed file descriptors.
> >
> > Why not do the latter unconditionally?
> 
> Because you cannot change O_DIRECT on an open fd :(.  This is why
> we're going through this pain.

Hmm, I remember hearing that before, but looking at the current fcntl()
manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps this
is a newish feature, but it'd be nicer to use it if possible ?

  [man 2 fcntl]
       F_SETFL (long)
              Set  the  file status flags to the value specified by arg.  File
              access mode (O_RDONLY, O_WRONLY, O_RDWR) and file creation flags
              (i.e.,  O_CREAT,  O_EXCL, O_NOCTTY, O_TRUNC) in arg are ignored.
              On Linux this command can only  change  the  O_APPEND,  O_ASYNC,
              O_DIRECT, O_NOATIME, and O_NONBLOCK flags.
  [/man]

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-05  9:27   ` Stefan Hajnoczi
  2011-08-05  9:55     ` Paolo Bonzini
  2011-08-05 13:12     ` Daniel P. Berrange
@ 2011-08-05 14:27     ` Christoph Hellwig
  2 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2011-08-05 14:27 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Paolo Bonzini, Anthony Liguori, qemu-devel, Supriya Kannery

On Fri, Aug 05, 2011 at 10:27:00AM +0100, Stefan Hajnoczi wrote:
> > Why not do the latter unconditionally?
> 
> Because you cannot change O_DIRECT on an open fd :(.  This is why
> we're going through this pain.

You can.  What you can't right now is O_SYNC, but we're going to change
that.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-05 13:12     ` Daniel P. Berrange
@ 2011-08-05 14:28       ` Christoph Hellwig
  2011-08-05 15:24         ` Stefan Hajnoczi
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2011-08-05 14:28 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Kevin Wolf, Supriya Kannery, Anthony Liguori, Stefan Hajnoczi,
	qemu-devel, Paolo Bonzini

On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote:
> > Because you cannot change O_DIRECT on an open fd :(.  This is why
> > we're going through this pain.
> 
> Hmm, I remember hearing that before, but looking at the current fcntl()
> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps this
> is a newish feature, but it'd be nicer to use it if possible ?

It's been there since day 1 of O_DIRECT support.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-05 14:28       ` Christoph Hellwig
@ 2011-08-05 15:24         ` Stefan Hajnoczi
  2011-08-05 15:43           ` Kevin Wolf
  0 siblings, 1 reply; 39+ messages in thread
From: Stefan Hajnoczi @ 2011-08-05 15:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kevin Wolf, Supriya Kannery, Anthony Liguori, qemu-devel, Paolo Bonzini

On Fri, Aug 5, 2011 at 3:28 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote:
>> > Because you cannot change O_DIRECT on an open fd :(.  This is why
>> > we're going through this pain.
>>
>> Hmm, I remember hearing that before, but looking at the current fcntl()
>> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps this
>> is a newish feature, but it'd be nicer to use it if possible ?
>
> It's been there since day 1 of O_DIRECT support.

Sorry, my bad.  So for Linux we could just use fcntl for
block_set_hostcache and not bother with reopening.  However, we will
need to reopen should we wish to support changing O_DSYNC.

Stefan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-05 15:24         ` Stefan Hajnoczi
@ 2011-08-05 15:43           ` Kevin Wolf
  2011-08-05 15:49             ` Anthony Liguori
  0 siblings, 1 reply; 39+ messages in thread
From: Kevin Wolf @ 2011-08-05 15:43 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Supriya Kannery, Anthony Liguori, qemu-devel, Paolo Bonzini,
	Christoph Hellwig

Am 05.08.2011 17:24, schrieb Stefan Hajnoczi:
> On Fri, Aug 5, 2011 at 3:28 PM, Christoph Hellwig <hch@lst.de> wrote:
>> On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote:
>>>> Because you cannot change O_DIRECT on an open fd :(.  This is why
>>>> we're going through this pain.
>>>
>>> Hmm, I remember hearing that before, but looking at the current fcntl()
>>> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps this
>>> is a newish feature, but it'd be nicer to use it if possible ?
>>
>> It's been there since day 1 of O_DIRECT support.
> 
> Sorry, my bad.  So for Linux we could just use fcntl for
> block_set_hostcache and not bother with reopening.  However, we will
> need to reopen should we wish to support changing O_DSYNC.

We do wish to support that.

Anthony thinks that allowing the guest to toggle WCE is a prerequisite
for making cache=writeback the default. And this is something that I
definitely want to do for 1.0.

Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-05 15:43           ` Kevin Wolf
@ 2011-08-05 15:49             ` Anthony Liguori
  2011-08-08  7:02               ` Supriya Kannery
  0 siblings, 1 reply; 39+ messages in thread
From: Anthony Liguori @ 2011-08-05 15:49 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Stefan Hajnoczi, Paolo Bonzini, Christoph Hellwig, qemu-devel,
	Supriya Kannery

On 08/05/2011 10:43 AM, Kevin Wolf wrote:
> Am 05.08.2011 17:24, schrieb Stefan Hajnoczi:
>> On Fri, Aug 5, 2011 at 3:28 PM, Christoph Hellwig<hch@lst.de>  wrote:
>>> On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote:
>>>>> Because you cannot change O_DIRECT on an open fd :(.  This is why
>>>>> we're going through this pain.
>>>>
>>>> Hmm, I remember hearing that before, but looking at the current fcntl()
>>>> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps this
>>>> is a newish feature, but it'd be nicer to use it if possible ?
>>>
>>> It's been there since day 1 of O_DIRECT support.
>>
>> Sorry, my bad.  So for Linux we could just use fcntl for
>> block_set_hostcache and not bother with reopening.  However, we will
>> need to reopen should we wish to support changing O_DSYNC.
>
> We do wish to support that.
>
> Anthony thinks that allowing the guest to toggle WCE is a prerequisite
> for making cache=writeback the default. And this is something that I
> definitely want to do for 1.0.

Indeed.

Regards,

Anthony Liguori

> Kevin
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-05  8:40 [Qemu-devel] Safely reopening image files by stashing fds Stefan Hajnoczi
  2011-08-05  9:04 ` Paolo Bonzini
  2011-08-05  9:07 ` Kevin Wolf
@ 2011-08-05 20:16 ` Blue Swirl
  2 siblings, 0 replies; 39+ messages in thread
From: Blue Swirl @ 2011-08-05 20:16 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Supriya Kannery, Anthony Liguori, qemu-devel

On Fri, Aug 5, 2011 at 8:40 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> We've discussed safe methods for reopening image files (e.g. useful for
> changing the hostcache parameter).  The problem is that closing the file first
> and then opening it again exposes us to the error case where the open fails.
> At that point we cannot get to the file anymore and our options are to
> terminate QEMU, pause the VM, or offline the block device.
>
> This window of vulnerability can be eliminated by keeping the file descriptor
> around and falling back to it should the open fail.
>
> The challenge for the file descriptor approach is that image formats, like
> VMDK, can span multiple files.  Therefore the solution is not as simple as
> stashing a single file descriptor and reopening from it.
>
> Here is the outline for an fd stashing mechanism that can handle reopening
> multi-file images and could also solve the file descriptor passing problem for
> libvirt:
>
> 1. Extract monitor getfd/closefd functionality
>
> The monitor already supports fd stashing with getfd/closefd commands.  But the
> fd stash code is part of Monitor and we need to extract it into its own object.
>
> /* A stashed file descriptor */
> typedef FDEntry {
>        const char *name;
>        int fd;
>        QLIST_ENTRY(FDEntry) next;
> } FDEntry;
>
> /* A container for stashing file descriptors */
> typedef struct FDStash {
>        QLIST_HEAD(, FDEntry) fds;
> } FDStash;
>
> void fdstash_init(FDStash *stash);
>
> /**
>  * Clear stashed file descriptors and close them
>  */
> void fdstash_cleanup(FDStash *stash);
>
> /**
>  * Stash a file descriptor and give up ownership
>  *
>  * If a file descriptor is already present with the same name the old fd is
>  * closed and replaced by the new one.
>  */
> void fdstash_give(FDStash *stash, const char *name, int fd);
>
> /**
>  * Find and take ownership of a stashed file descriptor
>  *
>  * Return the file descriptor or -ENOENT if not found.
>  */
> int fdstash_take(FDStash *stash, const char *name);
>
> The monitor is refactored to use this code instead of open coding fd stashing.
>
> 2. Introduce a function to extract open file descriptors from an block device
>
> Add a new .bdrv_extract_fds(BlockDriverState *bs, FDStash *stash) interface,
> which defaults to calling bdrv_extract_fds(bs->file, stash).
>
> VMDK and protocols can implement this function to support extracting open fds
> from a block device.  Note that they need to dup(2) fds before giving them to
> the fdstash, otherwise the fd will be closed when the block device is
> closed/deleted.
>
> 3. Rework bdrv_open() to take a FDStash
>
> Check the FDStash before opening an image file on the host file system.  This
> makes it possible to open an image file and use existing stashed fds.
>
> 4. Implement bdrv_reopen()
>
> First call bdrv_extract_fds() to stash the file descriptors, then close the
> block device.  Try opening the new image but if that fails, reopen using the
> stashed file descriptors.
>
> Thoughts?

I was previously thinking that the fd store should be tightly coupled
with block layer, but maybe it would be better to make this totally
generic like Avi(?) proposed, then all files (logs etc) could benefit
from this functionality.

I'd then handle fd stashing in qemu_open() etc. Stashing would not
need major changes anywhere, except reopen should be handled with
qemu_reopen() instead of qemu_close() + qemu_open() sequence. From the
other side (monitor), the interface could be like you present in 1.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-05 15:49             ` Anthony Liguori
@ 2011-08-08  7:02               ` Supriya Kannery
  2011-08-08  8:12                 ` Kevin Wolf
  0 siblings, 1 reply; 39+ messages in thread
From: Supriya Kannery @ 2011-08-08  7:02 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, Christoph Hellwig,
	Paolo Bonzini

On 08/05/2011 09:19 PM, Anthony Liguori wrote:
> On 08/05/2011 10:43 AM, Kevin Wolf wrote:
>> Am 05.08.2011 17:24, schrieb Stefan Hajnoczi:
>>> On Fri, Aug 5, 2011 at 3:28 PM, Christoph Hellwig<hch@lst.de> wrote:
>>>> On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote:
>>>>>> Because you cannot change O_DIRECT on an open fd :(. This is why
>>>>>> we're going through this pain.
>>>>>
>>>>> Hmm, I remember hearing that before, but looking at the current
>>>>> fcntl()
>>>>> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps
>>>>> this
>>>>> is a newish feature, but it'd be nicer to use it if possible ?
>>>>
>>>> It's been there since day 1 of O_DIRECT support.
>>>
>>> Sorry, my bad. So for Linux we could just use fcntl for
>>> block_set_hostcache and not bother with reopening. However, we will
>>> need to reopen should we wish to support changing O_DSYNC.
>>
>> We do wish to support that.
>>
>> Anthony thinks that allowing the guest to toggle WCE is a prerequisite
>> for making cache=writeback the default. And this is something that I
>> definitely want to do for 1.0.
>
> Indeed.
>

We discussed the following so far...
1. How to safely reopen image files
2. Dynamic hostcache change
3. Support for dynamic change of O_DSYNC

Since 2 is independent of 1, shall I go ahead implementing
hostcache change using fcntl.

Implementation for safely reopening image files using "BDRVReopenState"
can be done separately as a pre-requisite before implementing 3

Thanks, Supriya

> Regards,
>
> Anthony Liguori
>
>> Kevin
>>
>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-08  7:02               ` Supriya Kannery
@ 2011-08-08  8:12                 ` Kevin Wolf
  2011-08-09  9:22                   ` supriya kannery
  0 siblings, 1 reply; 39+ messages in thread
From: Kevin Wolf @ 2011-08-08  8:12 UTC (permalink / raw)
  To: supriyak; +Cc: qemu-devel, Stefan Hajnoczi, Christoph Hellwig, Paolo Bonzini

Am 08.08.2011 09:02, schrieb Supriya Kannery:
> On 08/05/2011 09:19 PM, Anthony Liguori wrote:
>> On 08/05/2011 10:43 AM, Kevin Wolf wrote:
>>> Am 05.08.2011 17:24, schrieb Stefan Hajnoczi:
>>>> On Fri, Aug 5, 2011 at 3:28 PM, Christoph Hellwig<hch@lst.de> wrote:
>>>>> On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote:
>>>>>>> Because you cannot change O_DIRECT on an open fd :(. This is why
>>>>>>> we're going through this pain.
>>>>>>
>>>>>> Hmm, I remember hearing that before, but looking at the current
>>>>>> fcntl()
>>>>>> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps
>>>>>> this
>>>>>> is a newish feature, but it'd be nicer to use it if possible ?
>>>>>
>>>>> It's been there since day 1 of O_DIRECT support.
>>>>
>>>> Sorry, my bad. So for Linux we could just use fcntl for
>>>> block_set_hostcache and not bother with reopening. However, we will
>>>> need to reopen should we wish to support changing O_DSYNC.
>>>
>>> We do wish to support that.
>>>
>>> Anthony thinks that allowing the guest to toggle WCE is a prerequisite
>>> for making cache=writeback the default. And this is something that I
>>> definitely want to do for 1.0.
>>
>> Indeed.
>>
> 
> We discussed the following so far...
> 1. How to safely reopen image files
> 2. Dynamic hostcache change
> 3. Support for dynamic change of O_DSYNC
> 
> Since 2 is independent of 1, shall I go ahead implementing
> hostcache change using fcntl.
> 
> Implementation for safely reopening image files using "BDRVReopenState"
> can be done separately as a pre-requisite before implementing 3

Doing it separately means that we would introduce yet another callback
that is used just to change O_DIRECT. In the end we want it to use
bdrv_reopen(), too, so I'm not sure if there is a need for a temporary
solution.

Actually, once we know what we really want (I haven't seen many comments
on the BDRVReopenState suggestion yet), it should be pretty easy to
implement.

Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-05  9:48     ` Kevin Wolf
@ 2011-08-08 14:49       ` Stefan Hajnoczi
  2011-08-08 15:16         ` Kevin Wolf
  0 siblings, 1 reply; 39+ messages in thread
From: Stefan Hajnoczi @ 2011-08-08 14:49 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel

On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi:
>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi:
>>>> We've discussed safe methods for reopening image files (e.g. useful for
>>>> changing the hostcache parameter).  The problem is that closing the file first
>>>> and then opening it again exposes us to the error case where the open fails.
>>>> At that point we cannot get to the file anymore and our options are to
>>>> terminate QEMU, pause the VM, or offline the block device.
>>>>
>>>> This window of vulnerability can be eliminated by keeping the file descriptor
>>>> around and falling back to it should the open fail.
>>>>
>>>> The challenge for the file descriptor approach is that image formats, like
>>>> VMDK, can span multiple files.  Therefore the solution is not as simple as
>>>> stashing a single file descriptor and reopening from it.
>>>
>>> So far I agree. The rest I believe is wrong because you can't assume
>>> that every backend uses file descriptors. The qemu block layer is based
>>> on BlockDriverStates, not fds. They are a concept that should be hidden
>>> in raw-posix.
>>>
>>> I think something like this could do:
>>>
>>> struct BDRVReopenState {
>>>    BlockDriverState *bs;
>>>    /* can be extended by block drivers */
>>> };
>>>
>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int
>>> flags);
>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state);
>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state);
>>>
>>> raw-posix would store the old file descriptor in its reopen_state. On
>>> commit, it closes the old descriptors, on abort it reverts to the old
>>> one and closes the newly opened one.
>>>
>>> Makes things a bit more complicated than the simple bdrv_reopen I had in
>>> mind before, but it allows VMDK to get an all-or-nothing semantics.
>>
>> Can you show how bdrv_reopen() would use these new interfaces?  I'm
>> not 100% clear on the idea.
>
> Well, you wouldn't only call bdrv_reopen, but also either
> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper
> function that does both, but that's syntactic sugar).
>
> For example we would have:
>
> int vmdk_reopen()

.bdrv_reopen() is a confusing name for this operation because it does
not reopen anything.  bdrv_prepare_reopen() might be clearer.

> {
>    *((VMDKReopenState**) rs) = malloc();
>
>    foreach (extent in s->extents) {
>        ret = bdrv_reopen(extent->file, &extent->reopen_state)
>        if (ret < 0)
>            goto fail;
>    }
>    return 0;
>
> fail:
>    foreach (extent in rs->already_reopened) {
>        bdrv_reopen_abort(extent->reopen_state);
>    }
>    return ret;
> }

> void vmdk_reopen_commit()
> {
>    foreach (extent in s->extents) {
>        bdrv_reopen_commit(extent->reopen_state);
>    }
>    free(rs);
> }
>
> void vmdk_reopen_abort()
> {
>    foreach (extent in s->extents) {
>        bdrv_reopen_abort(extent->reopen_state);
>    }
>    free(rs);
> }

Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs,
&rs)?  There is more state than just the file descriptors and I'm not
sure that that gets preserved unless we add code to stash away stuff.
I'm basically hoping this interface does not require touching every
BlockDriver.

Stefan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-08 14:49       ` Stefan Hajnoczi
@ 2011-08-08 15:16         ` Kevin Wolf
  2011-08-09 10:25           ` Stefan Hajnoczi
  0 siblings, 1 reply; 39+ messages in thread
From: Kevin Wolf @ 2011-08-08 15:16 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel

Am 08.08.2011 16:49, schrieb Stefan Hajnoczi:
> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi:
>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi:
>>>>> We've discussed safe methods for reopening image files (e.g. useful for
>>>>> changing the hostcache parameter).  The problem is that closing the file first
>>>>> and then opening it again exposes us to the error case where the open fails.
>>>>> At that point we cannot get to the file anymore and our options are to
>>>>> terminate QEMU, pause the VM, or offline the block device.
>>>>>
>>>>> This window of vulnerability can be eliminated by keeping the file descriptor
>>>>> around and falling back to it should the open fail.
>>>>>
>>>>> The challenge for the file descriptor approach is that image formats, like
>>>>> VMDK, can span multiple files.  Therefore the solution is not as simple as
>>>>> stashing a single file descriptor and reopening from it.
>>>>
>>>> So far I agree. The rest I believe is wrong because you can't assume
>>>> that every backend uses file descriptors. The qemu block layer is based
>>>> on BlockDriverStates, not fds. They are a concept that should be hidden
>>>> in raw-posix.
>>>>
>>>> I think something like this could do:
>>>>
>>>> struct BDRVReopenState {
>>>>    BlockDriverState *bs;
>>>>    /* can be extended by block drivers */
>>>> };
>>>>
>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int
>>>> flags);
>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state);
>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state);
>>>>
>>>> raw-posix would store the old file descriptor in its reopen_state. On
>>>> commit, it closes the old descriptors, on abort it reverts to the old
>>>> one and closes the newly opened one.
>>>>
>>>> Makes things a bit more complicated than the simple bdrv_reopen I had in
>>>> mind before, but it allows VMDK to get an all-or-nothing semantics.
>>>
>>> Can you show how bdrv_reopen() would use these new interfaces?  I'm
>>> not 100% clear on the idea.
>>
>> Well, you wouldn't only call bdrv_reopen, but also either
>> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper
>> function that does both, but that's syntactic sugar).
>>
>> For example we would have:
>>
>> int vmdk_reopen()
> 
> .bdrv_reopen() is a confusing name for this operation because it does
> not reopen anything.  bdrv_prepare_reopen() might be clearer.

Makes sense.

> 
>> {
>>    *((VMDKReopenState**) rs) = malloc();
>>
>>    foreach (extent in s->extents) {
>>        ret = bdrv_reopen(extent->file, &extent->reopen_state)
>>        if (ret < 0)
>>            goto fail;
>>    }
>>    return 0;
>>
>> fail:
>>    foreach (extent in rs->already_reopened) {
>>        bdrv_reopen_abort(extent->reopen_state);
>>    }
>>    return ret;
>> }
> 
>> void vmdk_reopen_commit()
>> {
>>    foreach (extent in s->extents) {
>>        bdrv_reopen_commit(extent->reopen_state);
>>    }
>>    free(rs);
>> }
>>
>> void vmdk_reopen_abort()
>> {
>>    foreach (extent in s->extents) {
>>        bdrv_reopen_abort(extent->reopen_state);
>>    }
>>    free(rs);
>> }
> 
> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs,
> &rs)? 

No. Closing the old backend would be part of bdrv_reopen_commit.

Do you have a use case where it would be helpful if the caller invoked
bdrv_close?

> There is more state than just the file descriptors and I'm not
> sure that that gets preserved unless we add code to stash away stuff.
> I'm basically hoping this interface does not require touching every
> BlockDriver.

If we only want to change flags like O_DIRECT or O_SYNC, I think format
drivers (except VMDK) can use a standard implementation that just
reopens bs->file.

If we wanted bdrv_reopen to ensure that all caches are dropped etc. then
I think we need a specific implementation in all drivers unless
bdrv->bdrv_open/bdrv_close is good enough to emulate it.

Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-08  8:12                 ` Kevin Wolf
@ 2011-08-09  9:22                   ` supriya kannery
  2011-08-09  9:51                     ` Kevin Wolf
  2011-10-10 18:28                     ` [Qemu-devel] " Kevin Wolf
  0 siblings, 2 replies; 39+ messages in thread
From: supriya kannery @ 2011-08-09  9:22 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: supriyak, Stefan Hajnoczi, qemu-devel, Paolo Bonzini, Christoph Hellwig

Kevin Wolf wrote:
> Am 08.08.2011 09:02, schrieb Supriya Kannery:
>   
>> On 08/05/2011 09:19 PM, Anthony Liguori wrote:
>>     
>>> On 08/05/2011 10:43 AM, Kevin Wolf wrote:
>>>       
>>>> Am 05.08.2011 17:24, schrieb Stefan Hajnoczi:
>>>>         
>>>>> On Fri, Aug 5, 2011 at 3:28 PM, Christoph Hellwig<hch@lst.de> wrote:
>>>>>           
>>>>>> On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote:
>>>>>>             
>>>>>>>> Because you cannot change O_DIRECT on an open fd :(. This is why
>>>>>>>> we're going through this pain.
>>>>>>>>                 
>>>>>>> Hmm, I remember hearing that before, but looking at the current
>>>>>>> fcntl()
>>>>>>> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps
>>>>>>> this
>>>>>>> is a newish feature, but it'd be nicer to use it if possible ?
>>>>>>>               
>>>>>> It's been there since day 1 of O_DIRECT support.
>>>>>>             
>>>>> Sorry, my bad. So for Linux we could just use fcntl for
>>>>> block_set_hostcache and not bother with reopening. However, we will
>>>>> need to reopen should we wish to support changing O_DSYNC.
>>>>>           
>>>> We do wish to support that.
>>>>
>>>> Anthony thinks that allowing the guest to toggle WCE is a prerequisite
>>>> for making cache=writeback the default. And this is something that I
>>>> definitely want to do for 1.0.
>>>>         
>>> Indeed.
>>>
>>>       
>> We discussed the following so far...
>> 1. How to safely reopen image files
>> 2. Dynamic hostcache change
>> 3. Support for dynamic change of O_DSYNC
>>
>> Since 2 is independent of 1, shall I go ahead implementing
>> hostcache change using fcntl.
>>
>> Implementation for safely reopening image files using "BDRVReopenState"
>> can be done separately as a pre-requisite before implementing 3
>>     
>
> Doing it separately means that we would introduce yet another callback
> that is used just to change O_DIRECT. In the end we want it to use
> bdrv_reopen(), too, so I'm not sure if there is a need for a temporary
> solution.
>
>   
Could you please explain "In the end we want it to use bdrv_reopen" at 
bit more.
When fcntl() can change O_DIRECT on  open fd , is there a  need to 
"re-open"
the image file?

Considering the current way of having separate high level commands for
changing block parameters (block_set_hostcache, and may be block_set_flush
in furture), these dynamic requests will be sequential. So wouldn't it 
be better to
avoid re-opening of image if possible for individual flag change request 
that comes in?
> Actually, once we know what we really want (I haven't seen many comments
> on the BDRVReopenState suggestion yet), it should be pretty easy to
> implement.
>
> Kevin
>   
Will work on to get an RFC patch with this proposed BDRVReopenState to 
get more
inputs.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-09  9:51                     ` Kevin Wolf
@ 2011-08-09  9:32                       ` supriya kannery
  2011-08-16 19:18                         ` [Qemu-devel] [RFC] " Supriya Kannery
  2011-08-16 19:18                         ` Supriya Kannery
  0 siblings, 2 replies; 39+ messages in thread
From: supriya kannery @ 2011-08-09  9:32 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: supriyak, Stefan Hajnoczi, qemu-devel, Paolo Bonzini, Christoph Hellwig

Kevin Wolf wrote:
> Am 09.08.2011 11:22, schrieb supriya kannery:
>   
>> Kevin Wolf wrote:
>>     
>>> Am 08.08.2011 09:02, schrieb Supriya Kannery:
>>>   
>>>       
>>>> On 08/05/2011 09:19 PM, Anthony Liguori wrote:
>>>>     
>>>>         
>>>>> On 08/05/2011 10:43 AM, Kevin Wolf wrote:
>>>>>       
>>>>>           
>>>>>> Am 05.08.2011 17:24, schrieb Stefan Hajnoczi:
>>>>>>         
>>>>>>             
>>>>>>> On Fri, Aug 5, 2011 at 3:28 PM, Christoph Hellwig<hch@lst.de> wrote:
>>>>>>>           
>>>>>>>               
>>>>>>>> On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote:
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>>> Because you cannot change O_DIRECT on an open fd :(. This is why
>>>>>>>>>> we're going through this pain.
>>>>>>>>>>                 
>>>>>>>>>>                     
>>>>>>>>> Hmm, I remember hearing that before, but looking at the current
>>>>>>>>> fcntl()
>>>>>>>>> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps
>>>>>>>>> this
>>>>>>>>> is a newish feature, but it'd be nicer to use it if possible ?
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> It's been there since day 1 of O_DIRECT support.
>>>>>>>>             
>>>>>>>>                 
>>>>>>> Sorry, my bad. So for Linux we could just use fcntl for
>>>>>>> block_set_hostcache and not bother with reopening. However, we will
>>>>>>> need to reopen should we wish to support changing O_DSYNC.
>>>>>>>           
>>>>>>>               
>>>>>> We do wish to support that.
>>>>>>
>>>>>> Anthony thinks that allowing the guest to toggle WCE is a prerequisite
>>>>>> for making cache=writeback the default. And this is something that I
>>>>>> definitely want to do for 1.0.
>>>>>>         
>>>>>>             
>>>>> Indeed.
>>>>>
>>>>>       
>>>>>           
>>>> We discussed the following so far...
>>>> 1. How to safely reopen image files
>>>> 2. Dynamic hostcache change
>>>> 3. Support for dynamic change of O_DSYNC
>>>>
>>>> Since 2 is independent of 1, shall I go ahead implementing
>>>> hostcache change using fcntl.
>>>>
>>>> Implementation for safely reopening image files using "BDRVReopenState"
>>>> can be done separately as a pre-requisite before implementing 3
>>>>     
>>>>         
>>> Doing it separately means that we would introduce yet another callback
>>> that is used just to change O_DIRECT. In the end we want it to use
>>> bdrv_reopen(), too, so I'm not sure if there is a need for a temporary
>>> solution.
>>>
>>>   
>>>       
>> Could you please explain "In the end we want it to use bdrv_reopen" at 
>> bit more.
>> When fcntl() can change O_DIRECT on  open fd , is there a  need to 
>> "re-open"
>> the image file?
>>     
>
> What I meant is that in the end, with a generic bdrv_reopen(), we can
> have raw-posix only call dup() and fcntl() instead of doing a
> close()/open() sequence if it can satisfy the new flags this way. But
> this would be an implementation detail and not be visible in the interface.
>
> Kevin
>   

ok
- thanks, Supriya

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-09  9:22                   ` supriya kannery
@ 2011-08-09  9:51                     ` Kevin Wolf
  2011-08-09  9:32                       ` supriya kannery
  2011-10-10 18:28                     ` [Qemu-devel] " Kevin Wolf
  1 sibling, 1 reply; 39+ messages in thread
From: Kevin Wolf @ 2011-08-09  9:51 UTC (permalink / raw)
  To: supriya kannery
  Cc: supriyak, Stefan Hajnoczi, qemu-devel, Paolo Bonzini, Christoph Hellwig

Am 09.08.2011 11:22, schrieb supriya kannery:
> Kevin Wolf wrote:
>> Am 08.08.2011 09:02, schrieb Supriya Kannery:
>>   
>>> On 08/05/2011 09:19 PM, Anthony Liguori wrote:
>>>     
>>>> On 08/05/2011 10:43 AM, Kevin Wolf wrote:
>>>>       
>>>>> Am 05.08.2011 17:24, schrieb Stefan Hajnoczi:
>>>>>         
>>>>>> On Fri, Aug 5, 2011 at 3:28 PM, Christoph Hellwig<hch@lst.de> wrote:
>>>>>>           
>>>>>>> On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote:
>>>>>>>             
>>>>>>>>> Because you cannot change O_DIRECT on an open fd :(. This is why
>>>>>>>>> we're going through this pain.
>>>>>>>>>                 
>>>>>>>> Hmm, I remember hearing that before, but looking at the current
>>>>>>>> fcntl()
>>>>>>>> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps
>>>>>>>> this
>>>>>>>> is a newish feature, but it'd be nicer to use it if possible ?
>>>>>>>>               
>>>>>>> It's been there since day 1 of O_DIRECT support.
>>>>>>>             
>>>>>> Sorry, my bad. So for Linux we could just use fcntl for
>>>>>> block_set_hostcache and not bother with reopening. However, we will
>>>>>> need to reopen should we wish to support changing O_DSYNC.
>>>>>>           
>>>>> We do wish to support that.
>>>>>
>>>>> Anthony thinks that allowing the guest to toggle WCE is a prerequisite
>>>>> for making cache=writeback the default. And this is something that I
>>>>> definitely want to do for 1.0.
>>>>>         
>>>> Indeed.
>>>>
>>>>       
>>> We discussed the following so far...
>>> 1. How to safely reopen image files
>>> 2. Dynamic hostcache change
>>> 3. Support for dynamic change of O_DSYNC
>>>
>>> Since 2 is independent of 1, shall I go ahead implementing
>>> hostcache change using fcntl.
>>>
>>> Implementation for safely reopening image files using "BDRVReopenState"
>>> can be done separately as a pre-requisite before implementing 3
>>>     
>>
>> Doing it separately means that we would introduce yet another callback
>> that is used just to change O_DIRECT. In the end we want it to use
>> bdrv_reopen(), too, so I'm not sure if there is a need for a temporary
>> solution.
>>
>>   
> Could you please explain "In the end we want it to use bdrv_reopen" at 
> bit more.
> When fcntl() can change O_DIRECT on  open fd , is there a  need to 
> "re-open"
> the image file?

What I meant is that in the end, with a generic bdrv_reopen(), we can
have raw-posix only call dup() and fcntl() instead of doing a
close()/open() sequence if it can satisfy the new flags this way. But
this would be an implementation detail and not be visible in the interface.

Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-08 15:16         ` Kevin Wolf
@ 2011-08-09 10:25           ` Stefan Hajnoczi
  2011-08-09 10:35             ` Kevin Wolf
  0 siblings, 1 reply; 39+ messages in thread
From: Stefan Hajnoczi @ 2011-08-09 10:25 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel

On Mon, Aug 8, 2011 at 4:16 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 08.08.2011 16:49, schrieb Stefan Hajnoczi:
>> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi:
>>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi:
>>>>>> We've discussed safe methods for reopening image files (e.g. useful for
>>>>>> changing the hostcache parameter).  The problem is that closing the file first
>>>>>> and then opening it again exposes us to the error case where the open fails.
>>>>>> At that point we cannot get to the file anymore and our options are to
>>>>>> terminate QEMU, pause the VM, or offline the block device.
>>>>>>
>>>>>> This window of vulnerability can be eliminated by keeping the file descriptor
>>>>>> around and falling back to it should the open fail.
>>>>>>
>>>>>> The challenge for the file descriptor approach is that image formats, like
>>>>>> VMDK, can span multiple files.  Therefore the solution is not as simple as
>>>>>> stashing a single file descriptor and reopening from it.
>>>>>
>>>>> So far I agree. The rest I believe is wrong because you can't assume
>>>>> that every backend uses file descriptors. The qemu block layer is based
>>>>> on BlockDriverStates, not fds. They are a concept that should be hidden
>>>>> in raw-posix.
>>>>>
>>>>> I think something like this could do:
>>>>>
>>>>> struct BDRVReopenState {
>>>>>    BlockDriverState *bs;
>>>>>    /* can be extended by block drivers */
>>>>> };
>>>>>
>>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int
>>>>> flags);
>>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state);
>>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state);
>>>>>
>>>>> raw-posix would store the old file descriptor in its reopen_state. On
>>>>> commit, it closes the old descriptors, on abort it reverts to the old
>>>>> one and closes the newly opened one.
>>>>>
>>>>> Makes things a bit more complicated than the simple bdrv_reopen I had in
>>>>> mind before, but it allows VMDK to get an all-or-nothing semantics.
>>>>
>>>> Can you show how bdrv_reopen() would use these new interfaces?  I'm
>>>> not 100% clear on the idea.
>>>
>>> Well, you wouldn't only call bdrv_reopen, but also either
>>> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper
>>> function that does both, but that's syntactic sugar).
>>>
>>> For example we would have:
>>>
>>> int vmdk_reopen()
>>
>> .bdrv_reopen() is a confusing name for this operation because it does
>> not reopen anything.  bdrv_prepare_reopen() might be clearer.
>
> Makes sense.
>
>>
>>> {
>>>    *((VMDKReopenState**) rs) = malloc();
>>>
>>>    foreach (extent in s->extents) {
>>>        ret = bdrv_reopen(extent->file, &extent->reopen_state)
>>>        if (ret < 0)
>>>            goto fail;
>>>    }
>>>    return 0;
>>>
>>> fail:
>>>    foreach (extent in rs->already_reopened) {
>>>        bdrv_reopen_abort(extent->reopen_state);
>>>    }
>>>    return ret;
>>> }
>>
>>> void vmdk_reopen_commit()
>>> {
>>>    foreach (extent in s->extents) {
>>>        bdrv_reopen_commit(extent->reopen_state);
>>>    }
>>>    free(rs);
>>> }
>>>
>>> void vmdk_reopen_abort()
>>> {
>>>    foreach (extent in s->extents) {
>>>        bdrv_reopen_abort(extent->reopen_state);
>>>    }
>>>    free(rs);
>>> }
>>
>> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs,
>> &rs)?
>
> No. Closing the old backend would be part of bdrv_reopen_commit.
>
> Do you have a use case where it would be helpful if the caller invoked
> bdrv_close?

When the caller does bdrv_close() two BlockDriverStates are never open
for the same image file.  I thought this was a property we wanted.

Also, in the block_set_hostcache case we need to reopen without
switching to a new BlockDriverState instance.  That means the reopen
needs to be in-place with respect to the BlockDriverState *bs pointer.
 We cannot create a new instance.

Stefan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-09 10:25           ` Stefan Hajnoczi
@ 2011-08-09 10:35             ` Kevin Wolf
  2011-08-09 10:50               ` Stefan Hajnoczi
  0 siblings, 1 reply; 39+ messages in thread
From: Kevin Wolf @ 2011-08-09 10:35 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel

Am 09.08.2011 12:25, schrieb Stefan Hajnoczi:
> On Mon, Aug 8, 2011 at 4:16 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 08.08.2011 16:49, schrieb Stefan Hajnoczi:
>>> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi:
>>>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi:
>>>>>>> We've discussed safe methods for reopening image files (e.g. useful for
>>>>>>> changing the hostcache parameter).  The problem is that closing the file first
>>>>>>> and then opening it again exposes us to the error case where the open fails.
>>>>>>> At that point we cannot get to the file anymore and our options are to
>>>>>>> terminate QEMU, pause the VM, or offline the block device.
>>>>>>>
>>>>>>> This window of vulnerability can be eliminated by keeping the file descriptor
>>>>>>> around and falling back to it should the open fail.
>>>>>>>
>>>>>>> The challenge for the file descriptor approach is that image formats, like
>>>>>>> VMDK, can span multiple files.  Therefore the solution is not as simple as
>>>>>>> stashing a single file descriptor and reopening from it.
>>>>>>
>>>>>> So far I agree. The rest I believe is wrong because you can't assume
>>>>>> that every backend uses file descriptors. The qemu block layer is based
>>>>>> on BlockDriverStates, not fds. They are a concept that should be hidden
>>>>>> in raw-posix.
>>>>>>
>>>>>> I think something like this could do:
>>>>>>
>>>>>> struct BDRVReopenState {
>>>>>>    BlockDriverState *bs;
>>>>>>    /* can be extended by block drivers */
>>>>>> };
>>>>>>
>>>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int
>>>>>> flags);
>>>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state);
>>>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state);
>>>>>>
>>>>>> raw-posix would store the old file descriptor in its reopen_state. On
>>>>>> commit, it closes the old descriptors, on abort it reverts to the old
>>>>>> one and closes the newly opened one.
>>>>>>
>>>>>> Makes things a bit more complicated than the simple bdrv_reopen I had in
>>>>>> mind before, but it allows VMDK to get an all-or-nothing semantics.
>>>>>
>>>>> Can you show how bdrv_reopen() would use these new interfaces?  I'm
>>>>> not 100% clear on the idea.
>>>>
>>>> Well, you wouldn't only call bdrv_reopen, but also either
>>>> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper
>>>> function that does both, but that's syntactic sugar).
>>>>
>>>> For example we would have:
>>>>
>>>> int vmdk_reopen()
>>>
>>> .bdrv_reopen() is a confusing name for this operation because it does
>>> not reopen anything.  bdrv_prepare_reopen() might be clearer.
>>
>> Makes sense.
>>
>>>
>>>> {
>>>>    *((VMDKReopenState**) rs) = malloc();
>>>>
>>>>    foreach (extent in s->extents) {
>>>>        ret = bdrv_reopen(extent->file, &extent->reopen_state)
>>>>        if (ret < 0)
>>>>            goto fail;
>>>>    }
>>>>    return 0;
>>>>
>>>> fail:
>>>>    foreach (extent in rs->already_reopened) {
>>>>        bdrv_reopen_abort(extent->reopen_state);
>>>>    }
>>>>    return ret;
>>>> }
>>>
>>>> void vmdk_reopen_commit()
>>>> {
>>>>    foreach (extent in s->extents) {
>>>>        bdrv_reopen_commit(extent->reopen_state);
>>>>    }
>>>>    free(rs);
>>>> }
>>>>
>>>> void vmdk_reopen_abort()
>>>> {
>>>>    foreach (extent in s->extents) {
>>>>        bdrv_reopen_abort(extent->reopen_state);
>>>>    }
>>>>    free(rs);
>>>> }
>>>
>>> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs,
>>> &rs)?
>>
>> No. Closing the old backend would be part of bdrv_reopen_commit.
>>
>> Do you have a use case where it would be helpful if the caller invoked
>> bdrv_close?
> 
> When the caller does bdrv_close() two BlockDriverStates are never open
> for the same image file.  I thought this was a property we wanted.
> 
> Also, in the block_set_hostcache case we need to reopen without
> switching to a new BlockDriverState instance.  That means the reopen
> needs to be in-place with respect to the BlockDriverState *bs pointer.
>  We cannot create a new instance.

Yes, but where do you even get the second BlockDriverState from?

My prototype only returns an int, not a new BlockDriverState. Until
bdrv_reopen_commit() it would refer to the old file descriptors etc. and
after bdrv_reopen_commit() the very same BlockDriverState would refer to
the new ones.

Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-09 10:35             ` Kevin Wolf
@ 2011-08-09 10:50               ` Stefan Hajnoczi
  2011-08-09 10:56                 ` Stefan Hajnoczi
  0 siblings, 1 reply; 39+ messages in thread
From: Stefan Hajnoczi @ 2011-08-09 10:50 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel

On Tue, Aug 9, 2011 at 11:35 AM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 09.08.2011 12:25, schrieb Stefan Hajnoczi:
>> On Mon, Aug 8, 2011 at 4:16 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>> Am 08.08.2011 16:49, schrieb Stefan Hajnoczi:
>>>> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi:
>>>>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi:
>>>>>>>> We've discussed safe methods for reopening image files (e.g. useful for
>>>>>>>> changing the hostcache parameter).  The problem is that closing the file first
>>>>>>>> and then opening it again exposes us to the error case where the open fails.
>>>>>>>> At that point we cannot get to the file anymore and our options are to
>>>>>>>> terminate QEMU, pause the VM, or offline the block device.
>>>>>>>>
>>>>>>>> This window of vulnerability can be eliminated by keeping the file descriptor
>>>>>>>> around and falling back to it should the open fail.
>>>>>>>>
>>>>>>>> The challenge for the file descriptor approach is that image formats, like
>>>>>>>> VMDK, can span multiple files.  Therefore the solution is not as simple as
>>>>>>>> stashing a single file descriptor and reopening from it.
>>>>>>>
>>>>>>> So far I agree. The rest I believe is wrong because you can't assume
>>>>>>> that every backend uses file descriptors. The qemu block layer is based
>>>>>>> on BlockDriverStates, not fds. They are a concept that should be hidden
>>>>>>> in raw-posix.
>>>>>>>
>>>>>>> I think something like this could do:
>>>>>>>
>>>>>>> struct BDRVReopenState {
>>>>>>>    BlockDriverState *bs;
>>>>>>>    /* can be extended by block drivers */
>>>>>>> };
>>>>>>>
>>>>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int
>>>>>>> flags);
>>>>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state);
>>>>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state);
>>>>>>>
>>>>>>> raw-posix would store the old file descriptor in its reopen_state. On
>>>>>>> commit, it closes the old descriptors, on abort it reverts to the old
>>>>>>> one and closes the newly opened one.
>>>>>>>
>>>>>>> Makes things a bit more complicated than the simple bdrv_reopen I had in
>>>>>>> mind before, but it allows VMDK to get an all-or-nothing semantics.
>>>>>>
>>>>>> Can you show how bdrv_reopen() would use these new interfaces?  I'm
>>>>>> not 100% clear on the idea.
>>>>>
>>>>> Well, you wouldn't only call bdrv_reopen, but also either
>>>>> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper
>>>>> function that does both, but that's syntactic sugar).
>>>>>
>>>>> For example we would have:
>>>>>
>>>>> int vmdk_reopen()
>>>>
>>>> .bdrv_reopen() is a confusing name for this operation because it does
>>>> not reopen anything.  bdrv_prepare_reopen() might be clearer.
>>>
>>> Makes sense.
>>>
>>>>
>>>>> {
>>>>>    *((VMDKReopenState**) rs) = malloc();
>>>>>
>>>>>    foreach (extent in s->extents) {
>>>>>        ret = bdrv_reopen(extent->file, &extent->reopen_state)
>>>>>        if (ret < 0)
>>>>>            goto fail;
>>>>>    }
>>>>>    return 0;
>>>>>
>>>>> fail:
>>>>>    foreach (extent in rs->already_reopened) {
>>>>>        bdrv_reopen_abort(extent->reopen_state);
>>>>>    }
>>>>>    return ret;
>>>>> }
>>>>
>>>>> void vmdk_reopen_commit()
>>>>> {
>>>>>    foreach (extent in s->extents) {
>>>>>        bdrv_reopen_commit(extent->reopen_state);
>>>>>    }
>>>>>    free(rs);
>>>>> }
>>>>>
>>>>> void vmdk_reopen_abort()
>>>>> {
>>>>>    foreach (extent in s->extents) {
>>>>>        bdrv_reopen_abort(extent->reopen_state);
>>>>>    }
>>>>>    free(rs);
>>>>> }
>>>>
>>>> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs,
>>>> &rs)?
>>>
>>> No. Closing the old backend would be part of bdrv_reopen_commit.
>>>
>>> Do you have a use case where it would be helpful if the caller invoked
>>> bdrv_close?
>>
>> When the caller does bdrv_close() two BlockDriverStates are never open
>> for the same image file.  I thought this was a property we wanted.
>>
>> Also, in the block_set_hostcache case we need to reopen without
>> switching to a new BlockDriverState instance.  That means the reopen
>> needs to be in-place with respect to the BlockDriverState *bs pointer.
>>  We cannot create a new instance.
>
> Yes, but where do you even get the second BlockDriverState from?
>
> My prototype only returns an int, not a new BlockDriverState. Until
> bdrv_reopen_commit() it would refer to the old file descriptors etc. and
> after bdrv_reopen_commit() the very same BlockDriverState would refer to
> the new ones.

It seems I don't understand the API.  I thought it was:

do_block_set_hostcache()
{
    bdrv_prepare_reopen(bs, &rs);
    ...open new file and check everything is okay...
    if (ret == 0) {
        bdrv_reopen_commit(rs);
    } else {
        bdrv_reopen_abort(rs);
    }
    return ret;
}

If the caller isn't opening the new file then what's the point of
giving the caller control over prepare, commit, and abort?

Stefan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-09 10:50               ` Stefan Hajnoczi
@ 2011-08-09 10:56                 ` Stefan Hajnoczi
  2011-08-09 11:39                   ` Kevin Wolf
  0 siblings, 1 reply; 39+ messages in thread
From: Stefan Hajnoczi @ 2011-08-09 10:56 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel

On Tue, Aug 9, 2011 at 11:50 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Tue, Aug 9, 2011 at 11:35 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 09.08.2011 12:25, schrieb Stefan Hajnoczi:
>>> On Mon, Aug 8, 2011 at 4:16 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>> Am 08.08.2011 16:49, schrieb Stefan Hajnoczi:
>>>>> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi:
>>>>>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi:
>>>>>>>>> We've discussed safe methods for reopening image files (e.g. useful for
>>>>>>>>> changing the hostcache parameter).  The problem is that closing the file first
>>>>>>>>> and then opening it again exposes us to the error case where the open fails.
>>>>>>>>> At that point we cannot get to the file anymore and our options are to
>>>>>>>>> terminate QEMU, pause the VM, or offline the block device.
>>>>>>>>>
>>>>>>>>> This window of vulnerability can be eliminated by keeping the file descriptor
>>>>>>>>> around and falling back to it should the open fail.
>>>>>>>>>
>>>>>>>>> The challenge for the file descriptor approach is that image formats, like
>>>>>>>>> VMDK, can span multiple files.  Therefore the solution is not as simple as
>>>>>>>>> stashing a single file descriptor and reopening from it.
>>>>>>>>
>>>>>>>> So far I agree. The rest I believe is wrong because you can't assume
>>>>>>>> that every backend uses file descriptors. The qemu block layer is based
>>>>>>>> on BlockDriverStates, not fds. They are a concept that should be hidden
>>>>>>>> in raw-posix.
>>>>>>>>
>>>>>>>> I think something like this could do:
>>>>>>>>
>>>>>>>> struct BDRVReopenState {
>>>>>>>>    BlockDriverState *bs;
>>>>>>>>    /* can be extended by block drivers */
>>>>>>>> };
>>>>>>>>
>>>>>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int
>>>>>>>> flags);
>>>>>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state);
>>>>>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state);
>>>>>>>>
>>>>>>>> raw-posix would store the old file descriptor in its reopen_state. On
>>>>>>>> commit, it closes the old descriptors, on abort it reverts to the old
>>>>>>>> one and closes the newly opened one.
>>>>>>>>
>>>>>>>> Makes things a bit more complicated than the simple bdrv_reopen I had in
>>>>>>>> mind before, but it allows VMDK to get an all-or-nothing semantics.
>>>>>>>
>>>>>>> Can you show how bdrv_reopen() would use these new interfaces?  I'm
>>>>>>> not 100% clear on the idea.
>>>>>>
>>>>>> Well, you wouldn't only call bdrv_reopen, but also either
>>>>>> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper
>>>>>> function that does both, but that's syntactic sugar).
>>>>>>
>>>>>> For example we would have:
>>>>>>
>>>>>> int vmdk_reopen()
>>>>>
>>>>> .bdrv_reopen() is a confusing name for this operation because it does
>>>>> not reopen anything.  bdrv_prepare_reopen() might be clearer.
>>>>
>>>> Makes sense.
>>>>
>>>>>
>>>>>> {
>>>>>>    *((VMDKReopenState**) rs) = malloc();
>>>>>>
>>>>>>    foreach (extent in s->extents) {
>>>>>>        ret = bdrv_reopen(extent->file, &extent->reopen_state)
>>>>>>        if (ret < 0)
>>>>>>            goto fail;
>>>>>>    }
>>>>>>    return 0;
>>>>>>
>>>>>> fail:
>>>>>>    foreach (extent in rs->already_reopened) {
>>>>>>        bdrv_reopen_abort(extent->reopen_state);
>>>>>>    }
>>>>>>    return ret;
>>>>>> }
>>>>>
>>>>>> void vmdk_reopen_commit()
>>>>>> {
>>>>>>    foreach (extent in s->extents) {
>>>>>>        bdrv_reopen_commit(extent->reopen_state);
>>>>>>    }
>>>>>>    free(rs);
>>>>>> }
>>>>>>
>>>>>> void vmdk_reopen_abort()
>>>>>> {
>>>>>>    foreach (extent in s->extents) {
>>>>>>        bdrv_reopen_abort(extent->reopen_state);
>>>>>>    }
>>>>>>    free(rs);
>>>>>> }
>>>>>
>>>>> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs,
>>>>> &rs)?
>>>>
>>>> No. Closing the old backend would be part of bdrv_reopen_commit.
>>>>
>>>> Do you have a use case where it would be helpful if the caller invoked
>>>> bdrv_close?
>>>
>>> When the caller does bdrv_close() two BlockDriverStates are never open
>>> for the same image file.  I thought this was a property we wanted.
>>>
>>> Also, in the block_set_hostcache case we need to reopen without
>>> switching to a new BlockDriverState instance.  That means the reopen
>>> needs to be in-place with respect to the BlockDriverState *bs pointer.
>>>  We cannot create a new instance.
>>
>> Yes, but where do you even get the second BlockDriverState from?
>>
>> My prototype only returns an int, not a new BlockDriverState. Until
>> bdrv_reopen_commit() it would refer to the old file descriptors etc. and
>> after bdrv_reopen_commit() the very same BlockDriverState would refer to
>> the new ones.
>
> It seems I don't understand the API.  I thought it was:
>
> do_block_set_hostcache()
> {
>    bdrv_prepare_reopen(bs, &rs);
>    ...open new file and check everything is okay...
>    if (ret == 0) {
>        bdrv_reopen_commit(rs);
>    } else {
>        bdrv_reopen_abort(rs);
>    }
>    return ret;
> }
>
> If the caller isn't opening the new file then what's the point of
> giving the caller control over prepare, commit, and abort?

After sending the last email I realized what I was missing:

You need the prepare, commit, and abort API in order to handle
multi-file block drivers like VMDK.

Stefan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-09 10:56                 ` Stefan Hajnoczi
@ 2011-08-09 11:39                   ` Kevin Wolf
  2011-08-09 12:00                     ` Stefan Hajnoczi
  0 siblings, 1 reply; 39+ messages in thread
From: Kevin Wolf @ 2011-08-09 11:39 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel

Am 09.08.2011 12:56, schrieb Stefan Hajnoczi:
> On Tue, Aug 9, 2011 at 11:50 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> On Tue, Aug 9, 2011 at 11:35 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>> Am 09.08.2011 12:25, schrieb Stefan Hajnoczi:
>>>> On Mon, Aug 8, 2011 at 4:16 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>> Am 08.08.2011 16:49, schrieb Stefan Hajnoczi:
>>>>>> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi:
>>>>>>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi:
>>>>>>>>>> We've discussed safe methods for reopening image files (e.g. useful for
>>>>>>>>>> changing the hostcache parameter).  The problem is that closing the file first
>>>>>>>>>> and then opening it again exposes us to the error case where the open fails.
>>>>>>>>>> At that point we cannot get to the file anymore and our options are to
>>>>>>>>>> terminate QEMU, pause the VM, or offline the block device.
>>>>>>>>>>
>>>>>>>>>> This window of vulnerability can be eliminated by keeping the file descriptor
>>>>>>>>>> around and falling back to it should the open fail.
>>>>>>>>>>
>>>>>>>>>> The challenge for the file descriptor approach is that image formats, like
>>>>>>>>>> VMDK, can span multiple files.  Therefore the solution is not as simple as
>>>>>>>>>> stashing a single file descriptor and reopening from it.
>>>>>>>>>
>>>>>>>>> So far I agree. The rest I believe is wrong because you can't assume
>>>>>>>>> that every backend uses file descriptors. The qemu block layer is based
>>>>>>>>> on BlockDriverStates, not fds. They are a concept that should be hidden
>>>>>>>>> in raw-posix.
>>>>>>>>>
>>>>>>>>> I think something like this could do:
>>>>>>>>>
>>>>>>>>> struct BDRVReopenState {
>>>>>>>>>    BlockDriverState *bs;
>>>>>>>>>    /* can be extended by block drivers */
>>>>>>>>> };
>>>>>>>>>
>>>>>>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int
>>>>>>>>> flags);
>>>>>>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state);
>>>>>>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state);
>>>>>>>>>
>>>>>>>>> raw-posix would store the old file descriptor in its reopen_state. On
>>>>>>>>> commit, it closes the old descriptors, on abort it reverts to the old
>>>>>>>>> one and closes the newly opened one.
>>>>>>>>>
>>>>>>>>> Makes things a bit more complicated than the simple bdrv_reopen I had in
>>>>>>>>> mind before, but it allows VMDK to get an all-or-nothing semantics.
>>>>>>>>
>>>>>>>> Can you show how bdrv_reopen() would use these new interfaces?  I'm
>>>>>>>> not 100% clear on the idea.
>>>>>>>
>>>>>>> Well, you wouldn't only call bdrv_reopen, but also either
>>>>>>> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper
>>>>>>> function that does both, but that's syntactic sugar).
>>>>>>>
>>>>>>> For example we would have:
>>>>>>>
>>>>>>> int vmdk_reopen()
>>>>>>
>>>>>> .bdrv_reopen() is a confusing name for this operation because it does
>>>>>> not reopen anything.  bdrv_prepare_reopen() might be clearer.
>>>>>
>>>>> Makes sense.
>>>>>
>>>>>>
>>>>>>> {
>>>>>>>    *((VMDKReopenState**) rs) = malloc();
>>>>>>>
>>>>>>>    foreach (extent in s->extents) {
>>>>>>>        ret = bdrv_reopen(extent->file, &extent->reopen_state)
>>>>>>>        if (ret < 0)
>>>>>>>            goto fail;
>>>>>>>    }
>>>>>>>    return 0;
>>>>>>>
>>>>>>> fail:
>>>>>>>    foreach (extent in rs->already_reopened) {
>>>>>>>        bdrv_reopen_abort(extent->reopen_state);
>>>>>>>    }
>>>>>>>    return ret;
>>>>>>> }
>>>>>>
>>>>>>> void vmdk_reopen_commit()
>>>>>>> {
>>>>>>>    foreach (extent in s->extents) {
>>>>>>>        bdrv_reopen_commit(extent->reopen_state);
>>>>>>>    }
>>>>>>>    free(rs);
>>>>>>> }
>>>>>>>
>>>>>>> void vmdk_reopen_abort()
>>>>>>> {
>>>>>>>    foreach (extent in s->extents) {
>>>>>>>        bdrv_reopen_abort(extent->reopen_state);
>>>>>>>    }
>>>>>>>    free(rs);
>>>>>>> }
>>>>>>
>>>>>> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs,
>>>>>> &rs)?
>>>>>
>>>>> No. Closing the old backend would be part of bdrv_reopen_commit.
>>>>>
>>>>> Do you have a use case where it would be helpful if the caller invoked
>>>>> bdrv_close?
>>>>
>>>> When the caller does bdrv_close() two BlockDriverStates are never open
>>>> for the same image file.  I thought this was a property we wanted.
>>>>
>>>> Also, in the block_set_hostcache case we need to reopen without
>>>> switching to a new BlockDriverState instance.  That means the reopen
>>>> needs to be in-place with respect to the BlockDriverState *bs pointer.
>>>>  We cannot create a new instance.
>>>
>>> Yes, but where do you even get the second BlockDriverState from?
>>>
>>> My prototype only returns an int, not a new BlockDriverState. Until
>>> bdrv_reopen_commit() it would refer to the old file descriptors etc. and
>>> after bdrv_reopen_commit() the very same BlockDriverState would refer to
>>> the new ones.
>>
>> It seems I don't understand the API.  I thought it was:
>>
>> do_block_set_hostcache()
>> {
>>    bdrv_prepare_reopen(bs, &rs);
>>    ...open new file and check everything is okay...
>>    if (ret == 0) {
>>        bdrv_reopen_commit(rs);
>>    } else {
>>        bdrv_reopen_abort(rs);
>>    }
>>    return ret;
>> }
>>
>> If the caller isn't opening the new file then what's the point of
>> giving the caller control over prepare, commit, and abort?
> 
> After sending the last email I realized what I was missing:
> 
> You need the prepare, commit, and abort API in order to handle
> multi-file block drivers like VMDK.

Yes, this is whole point of separating commit out. Does the proposal
make sense to you now?

Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-09 11:39                   ` Kevin Wolf
@ 2011-08-09 12:00                     ` Stefan Hajnoczi
  2011-08-09 12:24                       ` Kevin Wolf
  0 siblings, 1 reply; 39+ messages in thread
From: Stefan Hajnoczi @ 2011-08-09 12:00 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel

On Tue, Aug 09, 2011 at 01:39:13PM +0200, Kevin Wolf wrote:
> Am 09.08.2011 12:56, schrieb Stefan Hajnoczi:
> > On Tue, Aug 9, 2011 at 11:50 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >> On Tue, Aug 9, 2011 at 11:35 AM, Kevin Wolf <kwolf@redhat.com> wrote:
> >>> Am 09.08.2011 12:25, schrieb Stefan Hajnoczi:
> >>>> On Mon, Aug 8, 2011 at 4:16 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> >>>>> Am 08.08.2011 16:49, schrieb Stefan Hajnoczi:
> >>>>>> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote:
> >>>>>>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi:
> >>>>>>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote:
> >>>>>>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi:
> >>>>>>>>>> We've discussed safe methods for reopening image files (e.g. useful for
> >>>>>>>>>> changing the hostcache parameter).  The problem is that closing the file first
> >>>>>>>>>> and then opening it again exposes us to the error case where the open fails.
> >>>>>>>>>> At that point we cannot get to the file anymore and our options are to
> >>>>>>>>>> terminate QEMU, pause the VM, or offline the block device.
> >>>>>>>>>>
> >>>>>>>>>> This window of vulnerability can be eliminated by keeping the file descriptor
> >>>>>>>>>> around and falling back to it should the open fail.
> >>>>>>>>>>
> >>>>>>>>>> The challenge for the file descriptor approach is that image formats, like
> >>>>>>>>>> VMDK, can span multiple files.  Therefore the solution is not as simple as
> >>>>>>>>>> stashing a single file descriptor and reopening from it.
> >>>>>>>>>
> >>>>>>>>> So far I agree. The rest I believe is wrong because you can't assume
> >>>>>>>>> that every backend uses file descriptors. The qemu block layer is based
> >>>>>>>>> on BlockDriverStates, not fds. They are a concept that should be hidden
> >>>>>>>>> in raw-posix.
> >>>>>>>>>
> >>>>>>>>> I think something like this could do:
> >>>>>>>>>
> >>>>>>>>> struct BDRVReopenState {
> >>>>>>>>>    BlockDriverState *bs;
> >>>>>>>>>    /* can be extended by block drivers */
> >>>>>>>>> };
> >>>>>>>>>
> >>>>>>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int
> >>>>>>>>> flags);
> >>>>>>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state);
> >>>>>>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state);
> >>>>>>>>>
> >>>>>>>>> raw-posix would store the old file descriptor in its reopen_state. On
> >>>>>>>>> commit, it closes the old descriptors, on abort it reverts to the old
> >>>>>>>>> one and closes the newly opened one.
> >>>>>>>>>
> >>>>>>>>> Makes things a bit more complicated than the simple bdrv_reopen I had in
> >>>>>>>>> mind before, but it allows VMDK to get an all-or-nothing semantics.
> >>>>>>>>
> >>>>>>>> Can you show how bdrv_reopen() would use these new interfaces?  I'm
> >>>>>>>> not 100% clear on the idea.
> >>>>>>>
> >>>>>>> Well, you wouldn't only call bdrv_reopen, but also either
> >>>>>>> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper
> >>>>>>> function that does both, but that's syntactic sugar).
> >>>>>>>
> >>>>>>> For example we would have:
> >>>>>>>
> >>>>>>> int vmdk_reopen()
> >>>>>>
> >>>>>> .bdrv_reopen() is a confusing name for this operation because it does
> >>>>>> not reopen anything.  bdrv_prepare_reopen() might be clearer.
> >>>>>
> >>>>> Makes sense.
> >>>>>
> >>>>>>
> >>>>>>> {
> >>>>>>>    *((VMDKReopenState**) rs) = malloc();
> >>>>>>>
> >>>>>>>    foreach (extent in s->extents) {
> >>>>>>>        ret = bdrv_reopen(extent->file, &extent->reopen_state)
> >>>>>>>        if (ret < 0)
> >>>>>>>            goto fail;
> >>>>>>>    }
> >>>>>>>    return 0;
> >>>>>>>
> >>>>>>> fail:
> >>>>>>>    foreach (extent in rs->already_reopened) {
> >>>>>>>        bdrv_reopen_abort(extent->reopen_state);
> >>>>>>>    }
> >>>>>>>    return ret;
> >>>>>>> }
> >>>>>>
> >>>>>>> void vmdk_reopen_commit()
> >>>>>>> {
> >>>>>>>    foreach (extent in s->extents) {
> >>>>>>>        bdrv_reopen_commit(extent->reopen_state);
> >>>>>>>    }
> >>>>>>>    free(rs);
> >>>>>>> }
> >>>>>>>
> >>>>>>> void vmdk_reopen_abort()
> >>>>>>> {
> >>>>>>>    foreach (extent in s->extents) {
> >>>>>>>        bdrv_reopen_abort(extent->reopen_state);
> >>>>>>>    }
> >>>>>>>    free(rs);
> >>>>>>> }
> >>>>>>
> >>>>>> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs,
> >>>>>> &rs)?
> >>>>>
> >>>>> No. Closing the old backend would be part of bdrv_reopen_commit.
> >>>>>
> >>>>> Do you have a use case where it would be helpful if the caller invoked
> >>>>> bdrv_close?
> >>>>
> >>>> When the caller does bdrv_close() two BlockDriverStates are never open
> >>>> for the same image file.  I thought this was a property we wanted.
> >>>>
> >>>> Also, in the block_set_hostcache case we need to reopen without
> >>>> switching to a new BlockDriverState instance.  That means the reopen
> >>>> needs to be in-place with respect to the BlockDriverState *bs pointer.
> >>>>  We cannot create a new instance.
> >>>
> >>> Yes, but where do you even get the second BlockDriverState from?
> >>>
> >>> My prototype only returns an int, not a new BlockDriverState. Until
> >>> bdrv_reopen_commit() it would refer to the old file descriptors etc. and
> >>> after bdrv_reopen_commit() the very same BlockDriverState would refer to
> >>> the new ones.
> >>
> >> It seems I don't understand the API.  I thought it was:
> >>
> >> do_block_set_hostcache()
> >> {
> >>    bdrv_prepare_reopen(bs, &rs);
> >>    ...open new file and check everything is okay...
> >>    if (ret == 0) {
> >>        bdrv_reopen_commit(rs);
> >>    } else {
> >>        bdrv_reopen_abort(rs);
> >>    }
> >>    return ret;
> >> }
> >>
> >> If the caller isn't opening the new file then what's the point of
> >> giving the caller control over prepare, commit, and abort?
> > 
> > After sending the last email I realized what I was missing:
> > 
> > You need the prepare, commit, and abort API in order to handle
> > multi-file block drivers like VMDK.
> 
> Yes, this is whole point of separating commit out. Does the proposal
> make sense to you now?

It depends on the details.  Adding more functions that every BlockDriver
must implement is bad, so it's important that we only drop this
functionality into raw-posix.c, vmdk.c, and block.c as appropriate.

I liked the idea of doing a generic FDStash type that the monitor and
bdrv_reopen() can use.  Blue's idea to hook at the qemu_open() level
takes that further.

But if we can do prepare, commit, and abort in a relatively simple way
then I'm for it.

Stefan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-09 12:00                     ` Stefan Hajnoczi
@ 2011-08-09 12:24                       ` Kevin Wolf
  2011-08-09 19:39                         ` Blue Swirl
  0 siblings, 1 reply; 39+ messages in thread
From: Kevin Wolf @ 2011-08-09 12:24 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Supriya Kannery, Anthony Liguori, qemu-devel

Am 09.08.2011 14:00, schrieb Stefan Hajnoczi:
> On Tue, Aug 09, 2011 at 01:39:13PM +0200, Kevin Wolf wrote:
>> Am 09.08.2011 12:56, schrieb Stefan Hajnoczi:
>>> On Tue, Aug 9, 2011 at 11:50 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>> On Tue, Aug 9, 2011 at 11:35 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>> Am 09.08.2011 12:25, schrieb Stefan Hajnoczi:
>>>>>> On Mon, Aug 8, 2011 at 4:16 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>>> Am 08.08.2011 16:49, schrieb Stefan Hajnoczi:
>>>>>>>> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>>>>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi:
>>>>>>>>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>>>>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi:
>>>>>>>>>>>> We've discussed safe methods for reopening image files (e.g. useful for
>>>>>>>>>>>> changing the hostcache parameter).  The problem is that closing the file first
>>>>>>>>>>>> and then opening it again exposes us to the error case where the open fails.
>>>>>>>>>>>> At that point we cannot get to the file anymore and our options are to
>>>>>>>>>>>> terminate QEMU, pause the VM, or offline the block device.
>>>>>>>>>>>>
>>>>>>>>>>>> This window of vulnerability can be eliminated by keeping the file descriptor
>>>>>>>>>>>> around and falling back to it should the open fail.
>>>>>>>>>>>>
>>>>>>>>>>>> The challenge for the file descriptor approach is that image formats, like
>>>>>>>>>>>> VMDK, can span multiple files.  Therefore the solution is not as simple as
>>>>>>>>>>>> stashing a single file descriptor and reopening from it.
>>>>>>>>>>>
>>>>>>>>>>> So far I agree. The rest I believe is wrong because you can't assume
>>>>>>>>>>> that every backend uses file descriptors. The qemu block layer is based
>>>>>>>>>>> on BlockDriverStates, not fds. They are a concept that should be hidden
>>>>>>>>>>> in raw-posix.
>>>>>>>>>>>
>>>>>>>>>>> I think something like this could do:
>>>>>>>>>>>
>>>>>>>>>>> struct BDRVReopenState {
>>>>>>>>>>>    BlockDriverState *bs;
>>>>>>>>>>>    /* can be extended by block drivers */
>>>>>>>>>>> };
>>>>>>>>>>>
>>>>>>>>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int
>>>>>>>>>>> flags);
>>>>>>>>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state);
>>>>>>>>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state);
>>>>>>>>>>>
>>>>>>>>>>> raw-posix would store the old file descriptor in its reopen_state. On
>>>>>>>>>>> commit, it closes the old descriptors, on abort it reverts to the old
>>>>>>>>>>> one and closes the newly opened one.
>>>>>>>>>>>
>>>>>>>>>>> Makes things a bit more complicated than the simple bdrv_reopen I had in
>>>>>>>>>>> mind before, but it allows VMDK to get an all-or-nothing semantics.
>>>>>>>>>>
>>>>>>>>>> Can you show how bdrv_reopen() would use these new interfaces?  I'm
>>>>>>>>>> not 100% clear on the idea.
>>>>>>>>>
>>>>>>>>> Well, you wouldn't only call bdrv_reopen, but also either
>>>>>>>>> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper
>>>>>>>>> function that does both, but that's syntactic sugar).
>>>>>>>>>
>>>>>>>>> For example we would have:
>>>>>>>>>
>>>>>>>>> int vmdk_reopen()
>>>>>>>>
>>>>>>>> .bdrv_reopen() is a confusing name for this operation because it does
>>>>>>>> not reopen anything.  bdrv_prepare_reopen() might be clearer.
>>>>>>>
>>>>>>> Makes sense.
>>>>>>>
>>>>>>>>
>>>>>>>>> {
>>>>>>>>>    *((VMDKReopenState**) rs) = malloc();
>>>>>>>>>
>>>>>>>>>    foreach (extent in s->extents) {
>>>>>>>>>        ret = bdrv_reopen(extent->file, &extent->reopen_state)
>>>>>>>>>        if (ret < 0)
>>>>>>>>>            goto fail;
>>>>>>>>>    }
>>>>>>>>>    return 0;
>>>>>>>>>
>>>>>>>>> fail:
>>>>>>>>>    foreach (extent in rs->already_reopened) {
>>>>>>>>>        bdrv_reopen_abort(extent->reopen_state);
>>>>>>>>>    }
>>>>>>>>>    return ret;
>>>>>>>>> }
>>>>>>>>
>>>>>>>>> void vmdk_reopen_commit()
>>>>>>>>> {
>>>>>>>>>    foreach (extent in s->extents) {
>>>>>>>>>        bdrv_reopen_commit(extent->reopen_state);
>>>>>>>>>    }
>>>>>>>>>    free(rs);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> void vmdk_reopen_abort()
>>>>>>>>> {
>>>>>>>>>    foreach (extent in s->extents) {
>>>>>>>>>        bdrv_reopen_abort(extent->reopen_state);
>>>>>>>>>    }
>>>>>>>>>    free(rs);
>>>>>>>>> }
>>>>>>>>
>>>>>>>> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs,
>>>>>>>> &rs)?
>>>>>>>
>>>>>>> No. Closing the old backend would be part of bdrv_reopen_commit.
>>>>>>>
>>>>>>> Do you have a use case where it would be helpful if the caller invoked
>>>>>>> bdrv_close?
>>>>>>
>>>>>> When the caller does bdrv_close() two BlockDriverStates are never open
>>>>>> for the same image file.  I thought this was a property we wanted.
>>>>>>
>>>>>> Also, in the block_set_hostcache case we need to reopen without
>>>>>> switching to a new BlockDriverState instance.  That means the reopen
>>>>>> needs to be in-place with respect to the BlockDriverState *bs pointer.
>>>>>>  We cannot create a new instance.
>>>>>
>>>>> Yes, but where do you even get the second BlockDriverState from?
>>>>>
>>>>> My prototype only returns an int, not a new BlockDriverState. Until
>>>>> bdrv_reopen_commit() it would refer to the old file descriptors etc. and
>>>>> after bdrv_reopen_commit() the very same BlockDriverState would refer to
>>>>> the new ones.
>>>>
>>>> It seems I don't understand the API.  I thought it was:
>>>>
>>>> do_block_set_hostcache()
>>>> {
>>>>    bdrv_prepare_reopen(bs, &rs);
>>>>    ...open new file and check everything is okay...
>>>>    if (ret == 0) {
>>>>        bdrv_reopen_commit(rs);
>>>>    } else {
>>>>        bdrv_reopen_abort(rs);
>>>>    }
>>>>    return ret;
>>>> }
>>>>
>>>> If the caller isn't opening the new file then what's the point of
>>>> giving the caller control over prepare, commit, and abort?
>>>
>>> After sending the last email I realized what I was missing:
>>>
>>> You need the prepare, commit, and abort API in order to handle
>>> multi-file block drivers like VMDK.
>>
>> Yes, this is whole point of separating commit out. Does the proposal
>> make sense to you now?
> 
> It depends on the details.  Adding more functions that every BlockDriver
> must implement is bad, so it's important that we only drop this
> functionality into raw-posix.c, vmdk.c, and block.c as appropriate.

Yes, I agree.

> I liked the idea of doing a generic FDStash type that the monitor and
> bdrv_reopen() can use.  Blue's idea to hook at the qemu_open() level
> takes that further.

Well, having something that works for the raw-posix, the monitor and
maybe some more things is nice. Having something that works for
raw-posix, Sheepdog and rbd is an absolute requirement and I can't see
how FDStash solves that. Even raw-win32 doesn't have an int fd, but a
HANDLE hfile for its backend.

So my main concern is that file descriptors are a concept not as generic
as it needs to be to suit all block drivers.

> But if we can do prepare, commit, and abort in a relatively simple way
> then I'm for it.

Great, then let's do that.

Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-09 12:24                       ` Kevin Wolf
@ 2011-08-09 19:39                         ` Blue Swirl
  2011-08-10  7:58                           ` Kevin Wolf
  0 siblings, 1 reply; 39+ messages in thread
From: Blue Swirl @ 2011-08-09 19:39 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, Anthony Liguori, qemu-devel, Supriya Kannery

On Tue, Aug 9, 2011 at 12:24 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 09.08.2011 14:00, schrieb Stefan Hajnoczi:
>> On Tue, Aug 09, 2011 at 01:39:13PM +0200, Kevin Wolf wrote:
>>> Am 09.08.2011 12:56, schrieb Stefan Hajnoczi:
>>>> On Tue, Aug 9, 2011 at 11:50 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>> On Tue, Aug 9, 2011 at 11:35 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>> Am 09.08.2011 12:25, schrieb Stefan Hajnoczi:
>>>>>>> On Mon, Aug 8, 2011 at 4:16 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>>>> Am 08.08.2011 16:49, schrieb Stefan Hajnoczi:
>>>>>>>>> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>>>>>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi:
>>>>>>>>>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>>>>>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi:
>>>>>>>>>>>>> We've discussed safe methods for reopening image files (e.g. useful for
>>>>>>>>>>>>> changing the hostcache parameter).  The problem is that closing the file first
>>>>>>>>>>>>> and then opening it again exposes us to the error case where the open fails.
>>>>>>>>>>>>> At that point we cannot get to the file anymore and our options are to
>>>>>>>>>>>>> terminate QEMU, pause the VM, or offline the block device.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This window of vulnerability can be eliminated by keeping the file descriptor
>>>>>>>>>>>>> around and falling back to it should the open fail.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The challenge for the file descriptor approach is that image formats, like
>>>>>>>>>>>>> VMDK, can span multiple files.  Therefore the solution is not as simple as
>>>>>>>>>>>>> stashing a single file descriptor and reopening from it.
>>>>>>>>>>>>
>>>>>>>>>>>> So far I agree. The rest I believe is wrong because you can't assume
>>>>>>>>>>>> that every backend uses file descriptors. The qemu block layer is based
>>>>>>>>>>>> on BlockDriverStates, not fds. They are a concept that should be hidden
>>>>>>>>>>>> in raw-posix.
>>>>>>>>>>>>
>>>>>>>>>>>> I think something like this could do:
>>>>>>>>>>>>
>>>>>>>>>>>> struct BDRVReopenState {
>>>>>>>>>>>>    BlockDriverState *bs;
>>>>>>>>>>>>    /* can be extended by block drivers */
>>>>>>>>>>>> };
>>>>>>>>>>>>
>>>>>>>>>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int
>>>>>>>>>>>> flags);
>>>>>>>>>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state);
>>>>>>>>>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state);
>>>>>>>>>>>>
>>>>>>>>>>>> raw-posix would store the old file descriptor in its reopen_state. On
>>>>>>>>>>>> commit, it closes the old descriptors, on abort it reverts to the old
>>>>>>>>>>>> one and closes the newly opened one.
>>>>>>>>>>>>
>>>>>>>>>>>> Makes things a bit more complicated than the simple bdrv_reopen I had in
>>>>>>>>>>>> mind before, but it allows VMDK to get an all-or-nothing semantics.
>>>>>>>>>>>
>>>>>>>>>>> Can you show how bdrv_reopen() would use these new interfaces?  I'm
>>>>>>>>>>> not 100% clear on the idea.
>>>>>>>>>>
>>>>>>>>>> Well, you wouldn't only call bdrv_reopen, but also either
>>>>>>>>>> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper
>>>>>>>>>> function that does both, but that's syntactic sugar).
>>>>>>>>>>
>>>>>>>>>> For example we would have:
>>>>>>>>>>
>>>>>>>>>> int vmdk_reopen()
>>>>>>>>>
>>>>>>>>> .bdrv_reopen() is a confusing name for this operation because it does
>>>>>>>>> not reopen anything.  bdrv_prepare_reopen() might be clearer.
>>>>>>>>
>>>>>>>> Makes sense.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> {
>>>>>>>>>>    *((VMDKReopenState**) rs) = malloc();
>>>>>>>>>>
>>>>>>>>>>    foreach (extent in s->extents) {
>>>>>>>>>>        ret = bdrv_reopen(extent->file, &extent->reopen_state)
>>>>>>>>>>        if (ret < 0)
>>>>>>>>>>            goto fail;
>>>>>>>>>>    }
>>>>>>>>>>    return 0;
>>>>>>>>>>
>>>>>>>>>> fail:
>>>>>>>>>>    foreach (extent in rs->already_reopened) {
>>>>>>>>>>        bdrv_reopen_abort(extent->reopen_state);
>>>>>>>>>>    }
>>>>>>>>>>    return ret;
>>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>> void vmdk_reopen_commit()
>>>>>>>>>> {
>>>>>>>>>>    foreach (extent in s->extents) {
>>>>>>>>>>        bdrv_reopen_commit(extent->reopen_state);
>>>>>>>>>>    }
>>>>>>>>>>    free(rs);
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> void vmdk_reopen_abort()
>>>>>>>>>> {
>>>>>>>>>>    foreach (extent in s->extents) {
>>>>>>>>>>        bdrv_reopen_abort(extent->reopen_state);
>>>>>>>>>>    }
>>>>>>>>>>    free(rs);
>>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs,
>>>>>>>>> &rs)?
>>>>>>>>
>>>>>>>> No. Closing the old backend would be part of bdrv_reopen_commit.
>>>>>>>>
>>>>>>>> Do you have a use case where it would be helpful if the caller invoked
>>>>>>>> bdrv_close?
>>>>>>>
>>>>>>> When the caller does bdrv_close() two BlockDriverStates are never open
>>>>>>> for the same image file.  I thought this was a property we wanted.
>>>>>>>
>>>>>>> Also, in the block_set_hostcache case we need to reopen without
>>>>>>> switching to a new BlockDriverState instance.  That means the reopen
>>>>>>> needs to be in-place with respect to the BlockDriverState *bs pointer.
>>>>>>>  We cannot create a new instance.
>>>>>>
>>>>>> Yes, but where do you even get the second BlockDriverState from?
>>>>>>
>>>>>> My prototype only returns an int, not a new BlockDriverState. Until
>>>>>> bdrv_reopen_commit() it would refer to the old file descriptors etc. and
>>>>>> after bdrv_reopen_commit() the very same BlockDriverState would refer to
>>>>>> the new ones.
>>>>>
>>>>> It seems I don't understand the API.  I thought it was:
>>>>>
>>>>> do_block_set_hostcache()
>>>>> {
>>>>>    bdrv_prepare_reopen(bs, &rs);
>>>>>    ...open new file and check everything is okay...
>>>>>    if (ret == 0) {
>>>>>        bdrv_reopen_commit(rs);
>>>>>    } else {
>>>>>        bdrv_reopen_abort(rs);
>>>>>    }
>>>>>    return ret;
>>>>> }
>>>>>
>>>>> If the caller isn't opening the new file then what's the point of
>>>>> giving the caller control over prepare, commit, and abort?
>>>>
>>>> After sending the last email I realized what I was missing:
>>>>
>>>> You need the prepare, commit, and abort API in order to handle
>>>> multi-file block drivers like VMDK.
>>>
>>> Yes, this is whole point of separating commit out. Does the proposal
>>> make sense to you now?
>>
>> It depends on the details.  Adding more functions that every BlockDriver
>> must implement is bad, so it's important that we only drop this
>> functionality into raw-posix.c, vmdk.c, and block.c as appropriate.
>
> Yes, I agree.
>
>> I liked the idea of doing a generic FDStash type that the monitor and
>> bdrv_reopen() can use.  Blue's idea to hook at the qemu_open() level
>> takes that further.
>
> Well, having something that works for the raw-posix, the monitor and
> maybe some more things is nice. Having something that works for
> raw-posix, Sheepdog and rbd is an absolute requirement and I can't see
> how FDStash solves that.

For Sheepdog also network access functions would need to be hooked.
RBD seems to use librados functions for low level I/O, so that needs
some RBD specific wrappers.

> Even raw-win32 doesn't have an int fd, but a
> HANDLE hfile for its backend.

Just replace "int fd" with "FDStash fd" everywhere?

> So my main concern is that file descriptors are a concept not as generic
> as it needs to be to suit all block drivers.
>
>> But if we can do prepare, commit, and abort in a relatively simple way
>> then I'm for it.
>
> Great, then let's do that.
>
> Kevin
>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-09 19:39                         ` Blue Swirl
@ 2011-08-10  7:58                           ` Kevin Wolf
  2011-08-10 17:20                             ` Blue Swirl
  0 siblings, 1 reply; 39+ messages in thread
From: Kevin Wolf @ 2011-08-10  7:58 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Stefan Hajnoczi, Anthony Liguori, qemu-devel, Supriya Kannery

Am 09.08.2011 21:39, schrieb Blue Swirl:
> On Tue, Aug 9, 2011 at 12:24 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 09.08.2011 14:00, schrieb Stefan Hajnoczi:
>>> On Tue, Aug 09, 2011 at 01:39:13PM +0200, Kevin Wolf wrote:
>>>> Am 09.08.2011 12:56, schrieb Stefan Hajnoczi:
>>>>> On Tue, Aug 9, 2011 at 11:50 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>>> On Tue, Aug 9, 2011 at 11:35 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>>> Am 09.08.2011 12:25, schrieb Stefan Hajnoczi:
>>>>>>>> On Mon, Aug 8, 2011 at 4:16 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>>>>> Am 08.08.2011 16:49, schrieb Stefan Hajnoczi:
>>>>>>>>>> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>>>>>>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi:
>>>>>>>>>>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>>>>>>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi:
>>>>>>>>>>>>>> We've discussed safe methods for reopening image files (e.g. useful for
>>>>>>>>>>>>>> changing the hostcache parameter).  The problem is that closing the file first
>>>>>>>>>>>>>> and then opening it again exposes us to the error case where the open fails.
>>>>>>>>>>>>>> At that point we cannot get to the file anymore and our options are to
>>>>>>>>>>>>>> terminate QEMU, pause the VM, or offline the block device.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This window of vulnerability can be eliminated by keeping the file descriptor
>>>>>>>>>>>>>> around and falling back to it should the open fail.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The challenge for the file descriptor approach is that image formats, like
>>>>>>>>>>>>>> VMDK, can span multiple files.  Therefore the solution is not as simple as
>>>>>>>>>>>>>> stashing a single file descriptor and reopening from it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So far I agree. The rest I believe is wrong because you can't assume
>>>>>>>>>>>>> that every backend uses file descriptors. The qemu block layer is based
>>>>>>>>>>>>> on BlockDriverStates, not fds. They are a concept that should be hidden
>>>>>>>>>>>>> in raw-posix.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think something like this could do:
>>>>>>>>>>>>>
>>>>>>>>>>>>> struct BDRVReopenState {
>>>>>>>>>>>>>    BlockDriverState *bs;
>>>>>>>>>>>>>    /* can be extended by block drivers */
>>>>>>>>>>>>> };
>>>>>>>>>>>>>
>>>>>>>>>>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int
>>>>>>>>>>>>> flags);
>>>>>>>>>>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state);
>>>>>>>>>>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state);
>>>>>>>>>>>>>
>>>>>>>>>>>>> raw-posix would store the old file descriptor in its reopen_state. On
>>>>>>>>>>>>> commit, it closes the old descriptors, on abort it reverts to the old
>>>>>>>>>>>>> one and closes the newly opened one.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Makes things a bit more complicated than the simple bdrv_reopen I had in
>>>>>>>>>>>>> mind before, but it allows VMDK to get an all-or-nothing semantics.
>>>>>>>>>>>>
>>>>>>>>>>>> Can you show how bdrv_reopen() would use these new interfaces?  I'm
>>>>>>>>>>>> not 100% clear on the idea.
>>>>>>>>>>>
>>>>>>>>>>> Well, you wouldn't only call bdrv_reopen, but also either
>>>>>>>>>>> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper
>>>>>>>>>>> function that does both, but that's syntactic sugar).
>>>>>>>>>>>
>>>>>>>>>>> For example we would have:
>>>>>>>>>>>
>>>>>>>>>>> int vmdk_reopen()
>>>>>>>>>>
>>>>>>>>>> .bdrv_reopen() is a confusing name for this operation because it does
>>>>>>>>>> not reopen anything.  bdrv_prepare_reopen() might be clearer.
>>>>>>>>>
>>>>>>>>> Makes sense.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> {
>>>>>>>>>>>    *((VMDKReopenState**) rs) = malloc();
>>>>>>>>>>>
>>>>>>>>>>>    foreach (extent in s->extents) {
>>>>>>>>>>>        ret = bdrv_reopen(extent->file, &extent->reopen_state)
>>>>>>>>>>>        if (ret < 0)
>>>>>>>>>>>            goto fail;
>>>>>>>>>>>    }
>>>>>>>>>>>    return 0;
>>>>>>>>>>>
>>>>>>>>>>> fail:
>>>>>>>>>>>    foreach (extent in rs->already_reopened) {
>>>>>>>>>>>        bdrv_reopen_abort(extent->reopen_state);
>>>>>>>>>>>    }
>>>>>>>>>>>    return ret;
>>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>> void vmdk_reopen_commit()
>>>>>>>>>>> {
>>>>>>>>>>>    foreach (extent in s->extents) {
>>>>>>>>>>>        bdrv_reopen_commit(extent->reopen_state);
>>>>>>>>>>>    }
>>>>>>>>>>>    free(rs);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> void vmdk_reopen_abort()
>>>>>>>>>>> {
>>>>>>>>>>>    foreach (extent in s->extents) {
>>>>>>>>>>>        bdrv_reopen_abort(extent->reopen_state);
>>>>>>>>>>>    }
>>>>>>>>>>>    free(rs);
>>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs,
>>>>>>>>>> &rs)?
>>>>>>>>>
>>>>>>>>> No. Closing the old backend would be part of bdrv_reopen_commit.
>>>>>>>>>
>>>>>>>>> Do you have a use case where it would be helpful if the caller invoked
>>>>>>>>> bdrv_close?
>>>>>>>>
>>>>>>>> When the caller does bdrv_close() two BlockDriverStates are never open
>>>>>>>> for the same image file.  I thought this was a property we wanted.
>>>>>>>>
>>>>>>>> Also, in the block_set_hostcache case we need to reopen without
>>>>>>>> switching to a new BlockDriverState instance.  That means the reopen
>>>>>>>> needs to be in-place with respect to the BlockDriverState *bs pointer.
>>>>>>>>  We cannot create a new instance.
>>>>>>>
>>>>>>> Yes, but where do you even get the second BlockDriverState from?
>>>>>>>
>>>>>>> My prototype only returns an int, not a new BlockDriverState. Until
>>>>>>> bdrv_reopen_commit() it would refer to the old file descriptors etc. and
>>>>>>> after bdrv_reopen_commit() the very same BlockDriverState would refer to
>>>>>>> the new ones.
>>>>>>
>>>>>> It seems I don't understand the API.  I thought it was:
>>>>>>
>>>>>> do_block_set_hostcache()
>>>>>> {
>>>>>>    bdrv_prepare_reopen(bs, &rs);
>>>>>>    ...open new file and check everything is okay...
>>>>>>    if (ret == 0) {
>>>>>>        bdrv_reopen_commit(rs);
>>>>>>    } else {
>>>>>>        bdrv_reopen_abort(rs);
>>>>>>    }
>>>>>>    return ret;
>>>>>> }
>>>>>>
>>>>>> If the caller isn't opening the new file then what's the point of
>>>>>> giving the caller control over prepare, commit, and abort?
>>>>>
>>>>> After sending the last email I realized what I was missing:
>>>>>
>>>>> You need the prepare, commit, and abort API in order to handle
>>>>> multi-file block drivers like VMDK.
>>>>
>>>> Yes, this is whole point of separating commit out. Does the proposal
>>>> make sense to you now?
>>>
>>> It depends on the details.  Adding more functions that every BlockDriver
>>> must implement is bad, so it's important that we only drop this
>>> functionality into raw-posix.c, vmdk.c, and block.c as appropriate.
>>
>> Yes, I agree.
>>
>>> I liked the idea of doing a generic FDStash type that the monitor and
>>> bdrv_reopen() can use.  Blue's idea to hook at the qemu_open() level
>>> takes that further.
>>
>> Well, having something that works for the raw-posix, the monitor and
>> maybe some more things is nice. Having something that works for
>> raw-posix, Sheepdog and rbd is an absolute requirement and I can't see
>> how FDStash solves that.
> 
> For Sheepdog also network access functions would need to be hooked.
> RBD seems to use librados functions for low level I/O, so that needs
> some RBD specific wrappers.
> 
>> Even raw-win32 doesn't have an int fd, but a
>> HANDLE hfile for its backend.
> 
> Just replace "int fd" with "FDStash fd" everywhere?

And how is FDStash defined? The current proposed is this one:

/* A stashed file descriptor */
typedef FDEntry {
	const char *name;
	int fd;
	QLIST_ENTRY(FDEntry) next;
} FDEntry;

/* A container for stashing file descriptors */
typedef struct FDStash {
	QLIST_HEAD(, FDEntry) fds;
} FDStash;

So there we have the int fd again. If you want something generic, you
would have to start letting block drivers extend FDEntry with their own
fields (and how would the first bdrv_open work then?) and basically
arrive at something like a BDRVReopenState.

Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-10  7:58                           ` Kevin Wolf
@ 2011-08-10 17:20                             ` Blue Swirl
  2011-08-11  7:37                               ` Kevin Wolf
  0 siblings, 1 reply; 39+ messages in thread
From: Blue Swirl @ 2011-08-10 17:20 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, Anthony Liguori, qemu-devel, Supriya Kannery

On Wed, Aug 10, 2011 at 7:58 AM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 09.08.2011 21:39, schrieb Blue Swirl:
>> On Tue, Aug 9, 2011 at 12:24 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>> Am 09.08.2011 14:00, schrieb Stefan Hajnoczi:
>>>> On Tue, Aug 09, 2011 at 01:39:13PM +0200, Kevin Wolf wrote:
>>>>> Am 09.08.2011 12:56, schrieb Stefan Hajnoczi:
>>>>>> On Tue, Aug 9, 2011 at 11:50 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>>>> On Tue, Aug 9, 2011 at 11:35 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>>>> Am 09.08.2011 12:25, schrieb Stefan Hajnoczi:
>>>>>>>>> On Mon, Aug 8, 2011 at 4:16 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>>>>>> Am 08.08.2011 16:49, schrieb Stefan Hajnoczi:
>>>>>>>>>>> On Fri, Aug 5, 2011 at 10:48 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>>>>>>>> Am 05.08.2011 11:29, schrieb Stefan Hajnoczi:
>>>>>>>>>>>>> On Fri, Aug 5, 2011 at 10:07 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>>>>>>>>>>> Am 05.08.2011 10:40, schrieb Stefan Hajnoczi:
>>>>>>>>>>>>>>> We've discussed safe methods for reopening image files (e.g. useful for
>>>>>>>>>>>>>>> changing the hostcache parameter).  The problem is that closing the file first
>>>>>>>>>>>>>>> and then opening it again exposes us to the error case where the open fails.
>>>>>>>>>>>>>>> At that point we cannot get to the file anymore and our options are to
>>>>>>>>>>>>>>> terminate QEMU, pause the VM, or offline the block device.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This window of vulnerability can be eliminated by keeping the file descriptor
>>>>>>>>>>>>>>> around and falling back to it should the open fail.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The challenge for the file descriptor approach is that image formats, like
>>>>>>>>>>>>>>> VMDK, can span multiple files.  Therefore the solution is not as simple as
>>>>>>>>>>>>>>> stashing a single file descriptor and reopening from it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So far I agree. The rest I believe is wrong because you can't assume
>>>>>>>>>>>>>> that every backend uses file descriptors. The qemu block layer is based
>>>>>>>>>>>>>> on BlockDriverStates, not fds. They are a concept that should be hidden
>>>>>>>>>>>>>> in raw-posix.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think something like this could do:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> struct BDRVReopenState {
>>>>>>>>>>>>>>    BlockDriverState *bs;
>>>>>>>>>>>>>>    /* can be extended by block drivers */
>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> .bdrv_reopen(BlockDriverState *bs, BDRVReopenState **reopen_state, int
>>>>>>>>>>>>>> flags);
>>>>>>>>>>>>>> .bdrv_reopen_commit(BDRVReopenState *reopen_state);
>>>>>>>>>>>>>> .bdrv_reopen_abort(BDRVReopenState *reopen_state);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> raw-posix would store the old file descriptor in its reopen_state. On
>>>>>>>>>>>>>> commit, it closes the old descriptors, on abort it reverts to the old
>>>>>>>>>>>>>> one and closes the newly opened one.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Makes things a bit more complicated than the simple bdrv_reopen I had in
>>>>>>>>>>>>>> mind before, but it allows VMDK to get an all-or-nothing semantics.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can you show how bdrv_reopen() would use these new interfaces?  I'm
>>>>>>>>>>>>> not 100% clear on the idea.
>>>>>>>>>>>>
>>>>>>>>>>>> Well, you wouldn't only call bdrv_reopen, but also either
>>>>>>>>>>>> bdrv_reopen_commit/abort (for the top-level caller we can have a wrapper
>>>>>>>>>>>> function that does both, but that's syntactic sugar).
>>>>>>>>>>>>
>>>>>>>>>>>> For example we would have:
>>>>>>>>>>>>
>>>>>>>>>>>> int vmdk_reopen()
>>>>>>>>>>>
>>>>>>>>>>> .bdrv_reopen() is a confusing name for this operation because it does
>>>>>>>>>>> not reopen anything.  bdrv_prepare_reopen() might be clearer.
>>>>>>>>>>
>>>>>>>>>> Makes sense.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> {
>>>>>>>>>>>>    *((VMDKReopenState**) rs) = malloc();
>>>>>>>>>>>>
>>>>>>>>>>>>    foreach (extent in s->extents) {
>>>>>>>>>>>>        ret = bdrv_reopen(extent->file, &extent->reopen_state)
>>>>>>>>>>>>        if (ret < 0)
>>>>>>>>>>>>            goto fail;
>>>>>>>>>>>>    }
>>>>>>>>>>>>    return 0;
>>>>>>>>>>>>
>>>>>>>>>>>> fail:
>>>>>>>>>>>>    foreach (extent in rs->already_reopened) {
>>>>>>>>>>>>        bdrv_reopen_abort(extent->reopen_state);
>>>>>>>>>>>>    }
>>>>>>>>>>>>    return ret;
>>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>>> void vmdk_reopen_commit()
>>>>>>>>>>>> {
>>>>>>>>>>>>    foreach (extent in s->extents) {
>>>>>>>>>>>>        bdrv_reopen_commit(extent->reopen_state);
>>>>>>>>>>>>    }
>>>>>>>>>>>>    free(rs);
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> void vmdk_reopen_abort()
>>>>>>>>>>>> {
>>>>>>>>>>>>    foreach (extent in s->extents) {
>>>>>>>>>>>>        bdrv_reopen_abort(extent->reopen_state);
>>>>>>>>>>>>    }
>>>>>>>>>>>>    free(rs);
>>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> Does the caller invoke bdrv_close(bs) after bdrv_prepare_reopen(bs,
>>>>>>>>>>> &rs)?
>>>>>>>>>>
>>>>>>>>>> No. Closing the old backend would be part of bdrv_reopen_commit.
>>>>>>>>>>
>>>>>>>>>> Do you have a use case where it would be helpful if the caller invoked
>>>>>>>>>> bdrv_close?
>>>>>>>>>
>>>>>>>>> When the caller does bdrv_close() two BlockDriverStates are never open
>>>>>>>>> for the same image file.  I thought this was a property we wanted.
>>>>>>>>>
>>>>>>>>> Also, in the block_set_hostcache case we need to reopen without
>>>>>>>>> switching to a new BlockDriverState instance.  That means the reopen
>>>>>>>>> needs to be in-place with respect to the BlockDriverState *bs pointer.
>>>>>>>>>  We cannot create a new instance.
>>>>>>>>
>>>>>>>> Yes, but where do you even get the second BlockDriverState from?
>>>>>>>>
>>>>>>>> My prototype only returns an int, not a new BlockDriverState. Until
>>>>>>>> bdrv_reopen_commit() it would refer to the old file descriptors etc. and
>>>>>>>> after bdrv_reopen_commit() the very same BlockDriverState would refer to
>>>>>>>> the new ones.
>>>>>>>
>>>>>>> It seems I don't understand the API.  I thought it was:
>>>>>>>
>>>>>>> do_block_set_hostcache()
>>>>>>> {
>>>>>>>    bdrv_prepare_reopen(bs, &rs);
>>>>>>>    ...open new file and check everything is okay...
>>>>>>>    if (ret == 0) {
>>>>>>>        bdrv_reopen_commit(rs);
>>>>>>>    } else {
>>>>>>>        bdrv_reopen_abort(rs);
>>>>>>>    }
>>>>>>>    return ret;
>>>>>>> }
>>>>>>>
>>>>>>> If the caller isn't opening the new file then what's the point of
>>>>>>> giving the caller control over prepare, commit, and abort?
>>>>>>
>>>>>> After sending the last email I realized what I was missing:
>>>>>>
>>>>>> You need the prepare, commit, and abort API in order to handle
>>>>>> multi-file block drivers like VMDK.
>>>>>
>>>>> Yes, this is whole point of separating commit out. Does the proposal
>>>>> make sense to you now?
>>>>
>>>> It depends on the details.  Adding more functions that every BlockDriver
>>>> must implement is bad, so it's important that we only drop this
>>>> functionality into raw-posix.c, vmdk.c, and block.c as appropriate.
>>>
>>> Yes, I agree.
>>>
>>>> I liked the idea of doing a generic FDStash type that the monitor and
>>>> bdrv_reopen() can use.  Blue's idea to hook at the qemu_open() level
>>>> takes that further.
>>>
>>> Well, having something that works for the raw-posix, the monitor and
>>> maybe some more things is nice. Having something that works for
>>> raw-posix, Sheepdog and rbd is an absolute requirement and I can't see
>>> how FDStash solves that.
>>
>> For Sheepdog also network access functions would need to be hooked.
>> RBD seems to use librados functions for low level I/O, so that needs
>> some RBD specific wrappers.
>>
>>> Even raw-win32 doesn't have an int fd, but a
>>> HANDLE hfile for its backend.
>>
>> Just replace "int fd" with "FDStash fd" everywhere?
>
> And how is FDStash defined? The current proposed is this one:
>
> /* A stashed file descriptor */
> typedef FDEntry {
>        const char *name;
>        int fd;
>        QLIST_ENTRY(FDEntry) next;
> } FDEntry;
>
> /* A container for stashing file descriptors */
> typedef struct FDStash {
>        QLIST_HEAD(, FDEntry) fds;
> } FDStash;
>
> So there we have the int fd again. If you want something generic, you

What's the problem:
typedef struct FDEntry {
       const char *name;
#ifdef _WIN32
       HANDLE fd;
#else
       int fd;
#endif
       QLIST_ENTRY(FDEntry) next;
} FDEntry;

Renaming 'fd' to something that does not immediately look like 'int'
may be useful.

> would have to start letting block drivers extend FDEntry with their own
> fields (and how would the first bdrv_open work then?) and basically
> arrive at something like a BDRVReopenState.

I'd not handle FDEntries at block level at all (or only at raw level),
then there would not be first mover problems.

Extending the uses of FDEntries to for example network stuff
(Sheepdog) could need some kind of unified API for both file and net
operations which is not what we have now (bdrv_read/write etc. vs.
direct recv/send). Then this new API would use FDEntries instead of
fds, sockets or HANDLEs. The 'name' field could be either network
address or file path. Though maybe we are only interested in
open/connect part so unification would not be needed for other
operations.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-10 17:20                             ` Blue Swirl
@ 2011-08-11  7:37                               ` Kevin Wolf
  2011-08-11 16:21                                 ` Blue Swirl
  0 siblings, 1 reply; 39+ messages in thread
From: Kevin Wolf @ 2011-08-11  7:37 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Stefan Hajnoczi, Anthony Liguori, qemu-devel, Supriya Kannery

Am 10.08.2011 19:20, schrieb Blue Swirl:
> On Wed, Aug 10, 2011 at 7:58 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 09.08.2011 21:39, schrieb Blue Swirl:
>>> On Tue, Aug 9, 2011 at 12:24 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>> Am 09.08.2011 14:00, schrieb Stefan Hajnoczi:
>>>>> I liked the idea of doing a generic FDStash type that the monitor and
>>>>> bdrv_reopen() can use.  Blue's idea to hook at the qemu_open() level
>>>>> takes that further.
>>>>
>>>> Well, having something that works for the raw-posix, the monitor and
>>>> maybe some more things is nice. Having something that works for
>>>> raw-posix, Sheepdog and rbd is an absolute requirement and I can't see
>>>> how FDStash solves that.
>>>
>>> For Sheepdog also network access functions would need to be hooked.
>>> RBD seems to use librados functions for low level I/O, so that needs
>>> some RBD specific wrappers.
>>>
>>>> Even raw-win32 doesn't have an int fd, but a
>>>> HANDLE hfile for its backend.
>>>
>>> Just replace "int fd" with "FDStash fd" everywhere?
>>
>> And how is FDStash defined? The current proposed is this one:
>>
>> /* A stashed file descriptor */
>> typedef FDEntry {
>>        const char *name;
>>        int fd;
>>        QLIST_ENTRY(FDEntry) next;
>> } FDEntry;
>>
>> /* A container for stashing file descriptors */
>> typedef struct FDStash {
>>        QLIST_HEAD(, FDEntry) fds;
>> } FDStash;
>>
>> So there we have the int fd again. If you want something generic, you
> 
> What's the problem:
> typedef struct FDEntry {
>        const char *name;
> #ifdef _WIN32
>        HANDLE fd;
> #else
>        int fd;
> #endif
>        QLIST_ENTRY(FDEntry) next;
> } FDEntry;
> 
> Renaming 'fd' to something that does not immediately look like 'int'
> may be useful.

This isn't any more generic, it merely covers two special cases instead
of only one.

You have used the fact that raw-posix and raw-win32 are never compiled
in at the same time, so that you can use an #ifdef to conditionally pull
out parts of their internal data structures into a generic struct (which
in itself is a layering violation)

It doesn't help with the other backend drivers like curl, nbd, rbd,
sheepdog and last but not least our special friend vvfat.

>> would have to start letting block drivers extend FDEntry with their own
>> fields (and how would the first bdrv_open work then?) and basically
>> arrive at something like a BDRVReopenState.
> 
> I'd not handle FDEntries at block level at all (or only at raw level),
> then there would not be first mover problems.

Well, but how does it solve the bdrv_reopen problem if it's not visible
on the block level?

> Extending the uses of FDEntries to for example network stuff
> (Sheepdog) could need some kind of unified API for both file and net
> operations which is not what we have now (bdrv_read/write etc. vs.
> direct recv/send). Then this new API would use FDEntries instead of
> fds, sockets or HANDLEs. The 'name' field could be either network
> address or file path. Though maybe we are only interested in
> open/connect part so unification would not be needed for other
> operations.

Now you have handled three special cases, but still haven't got a
generic solution.

But considering that you don't want to touch the block layer API and
therefore I don't even have an idea of what you would use FDEntries
for... Let's go back one step: What problem are you trying to solve, and
what does the API look like that you're thinking of? It doesn't seem to
be the same as Stefan suggested.

Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-11  7:37                               ` Kevin Wolf
@ 2011-08-11 16:21                                 ` Blue Swirl
  0 siblings, 0 replies; 39+ messages in thread
From: Blue Swirl @ 2011-08-11 16:21 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, Anthony Liguori, qemu-devel, Supriya Kannery

On Thu, Aug 11, 2011 at 7:37 AM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 10.08.2011 19:20, schrieb Blue Swirl:
>> On Wed, Aug 10, 2011 at 7:58 AM, Kevin Wolf <kwolf@redhat.com> wrote:
>>> Am 09.08.2011 21:39, schrieb Blue Swirl:
>>>> On Tue, Aug 9, 2011 at 12:24 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>>> Am 09.08.2011 14:00, schrieb Stefan Hajnoczi:
>>>>>> I liked the idea of doing a generic FDStash type that the monitor and
>>>>>> bdrv_reopen() can use.  Blue's idea to hook at the qemu_open() level
>>>>>> takes that further.
>>>>>
>>>>> Well, having something that works for the raw-posix, the monitor and
>>>>> maybe some more things is nice. Having something that works for
>>>>> raw-posix, Sheepdog and rbd is an absolute requirement and I can't see
>>>>> how FDStash solves that.
>>>>
>>>> For Sheepdog also network access functions would need to be hooked.
>>>> RBD seems to use librados functions for low level I/O, so that needs
>>>> some RBD specific wrappers.
>>>>
>>>>> Even raw-win32 doesn't have an int fd, but a
>>>>> HANDLE hfile for its backend.
>>>>
>>>> Just replace "int fd" with "FDStash fd" everywhere?
>>>
>>> And how is FDStash defined? The current proposed is this one:
>>>
>>> /* A stashed file descriptor */
>>> typedef FDEntry {
>>>        const char *name;
>>>        int fd;
>>>        QLIST_ENTRY(FDEntry) next;
>>> } FDEntry;
>>>
>>> /* A container for stashing file descriptors */
>>> typedef struct FDStash {
>>>        QLIST_HEAD(, FDEntry) fds;
>>> } FDStash;
>>>
>>> So there we have the int fd again. If you want something generic, you
>>
>> What's the problem:
>> typedef struct FDEntry {
>>        const char *name;
>> #ifdef _WIN32
>>        HANDLE fd;
>> #else
>>        int fd;
>> #endif
>>        QLIST_ENTRY(FDEntry) next;
>> } FDEntry;
>>
>> Renaming 'fd' to something that does not immediately look like 'int'
>> may be useful.
>
> This isn't any more generic, it merely covers two special cases instead
> of only one.
>
> You have used the fact that raw-posix and raw-win32 are never compiled
> in at the same time, so that you can use an #ifdef to conditionally pull
> out parts of their internal data structures into a generic struct (which
> in itself is a layering violation)
>
> It doesn't help with the other backend drivers like curl, nbd, rbd,
> sheepdog and last but not least our special friend vvfat.

Probably vvfat would not need any FDEntry. Curl has the same problem
as RBD that it's not possible to inject a file descriptor.

>>> would have to start letting block drivers extend FDEntry with their own
>>> fields (and how would the first bdrv_open work then?) and basically
>>> arrive at something like a BDRVReopenState.
>>
>> I'd not handle FDEntries at block level at all (or only at raw level),
>> then there would not be first mover problems.
>
> Well, but how does it solve the bdrv_reopen problem if it's not visible
> on the block level?
>
>> Extending the uses of FDEntries to for example network stuff
>> (Sheepdog) could need some kind of unified API for both file and net
>> operations which is not what we have now (bdrv_read/write etc. vs.
>> direct recv/send). Then this new API would use FDEntries instead of
>> fds, sockets or HANDLEs. The 'name' field could be either network
>> address or file path. Though maybe we are only interested in
>> open/connect part so unification would not be needed for other
>> operations.
>
> Now you have handled three special cases, but still haven't got a
> generic solution.
>
> But considering that you don't want to touch the block layer API and
> therefore I don't even have an idea of what you would use FDEntries
> for... Let's go back one step: What problem are you trying to solve, and
> what does the API look like that you're thinking of? It doesn't seem to
> be the same as Stefan suggested.

Excellent point. I think I was thinking of solving SELinux/NFS and
privilege separation problems with this one. From this one step back
perspective those seem to be slightly different issues. For this
problem, BDRVReopenState could be OK. FDStash may be useful otherwise.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [RFC] Safely reopening image files by stashing fds
  2011-08-09  9:32                       ` supriya kannery
@ 2011-08-16 19:18                         ` Supriya Kannery
  2011-08-16 19:18                         ` Supriya Kannery
  1 sibling, 0 replies; 39+ messages in thread
From: Supriya Kannery @ 2011-08-16 19:18 UTC (permalink / raw)
  To: supriya kannery
  Cc: Kevin Wolf, Stefan Hajnoczi, Christoph Hellwig, qemu-devel,
	Paolo Bonzini

On 08/09/2011 03:02 PM, supriya kannery wrote:
 > Kevin Wolf wrote:
 >> Am 09.08.2011 11:22, schrieb supriya kannery:
 >>> Kevin Wolf wrote:
 >>
 >> What I meant is that in the end, with a generic bdrv_reopen(), we can
 >> have raw-posix only call dup() and fcntl() instead of doing a
 >> close()/open() sequence if it can satisfy the new flags this way. But
 >> this would be an implementation detail and not be visible in the
 >> interface.
 >>
 >> Kevin
 >
 > ok
 > - thanks, Supriya
 >

Though I started the RFC patch with defining BDRVReopenState, ended up
in enhancing struct BlockDriver with .bdrv_reopen. bdrv_reopen 
mplementation specific to respective driver is assigned to this
function pointer.

Please find the implementation of O_DIRECT flag change, done in
raw-posix.c. Similar implementation can be done for vmdk (with
bdrv_reopen_commit and bdrv_reopen_abort as internal functions in
vmdk.c for opening split files), sheepdog, nbd etc..

Request for comments on this approach.

Note: Following patch is for demonstrating the approach using
raw-posix  as sample driver. This patch is not complete.

---
  block.c           |   33 +++++++++++++++++++++++----------
  block/raw-posix.c |   34 ++++++++++++++++++++++++++++++++++
  block_int.h       |    1 +
  3 files changed, 58 insertions(+), 10 deletions(-)

Index: qemu/block/raw-posix.c
===================================================================
--- qemu.orig/block/raw-posix.c
+++ qemu/block/raw-posix.c
@@ -276,6 +276,39 @@ static int raw_open(BlockDriverState *bs
      return raw_open_common(bs, filename, flags, 0);
  }

+static int raw_reopen(BlockDriverState *bs, int flags)
+{
+    BDRVRawState *s = bs->opaque;
+    int new_fd;
+    int ret = 0;
+
+    /* Handle request for toggling O_DIRECT */
+    if ((bs->open_flags | BDRV_O_NOCACHE) ^
+        (flags | BDRV_O_NOCACHE))
+    {
+        if ((new_fd = dup(s->fd)) <=  0) {
+            return -1;
+        }
+
+        if ((ret = fcntl_setfl(new_fd, flags)) < 0) {
+            close(new_fd);
+            return ret;
+        }
+
+        /* Set new flags, so replace old fd with new one */
+        close(s->fd);
+        s->fd = new_fd;
+        s->open_flags = flags;
+        bs->open_flags = flags;
+    }
+
+    /*
+     * TBD: handle O_DSYNC and other flags for which image
+     * file has to be reopened
+     */
+    return 0;
+}
+
  /* XXX: use host sector size if necessary with:
  #ifdef DIOCGSECTORSIZE
          {
@@ -886,6 +919,7 @@ static BlockDriver bdrv_file = {
      .instance_size = sizeof(BDRVRawState),
      .bdrv_probe = NULL, /* no probe for protocols */
      .bdrv_file_open = raw_open,
+    .bdrv_reopen = raw_reopen,
      .bdrv_read = raw_read,
      .bdrv_write = raw_write,
      .bdrv_close = raw_close,
Index: qemu/block.c
===================================================================
--- qemu.orig/block.c
+++ qemu/block.c
@@ -687,20 +687,33 @@ int bdrv_reopen(BlockDriverState *bs, in
          qerror_report(QERR_DATA_SYNC_FAILED, bs->device_name);
          return ret;
      }
-    open_flags = bs->open_flags;
-    bdrv_close(bs);

-    ret = bdrv_open(bs, bs->filename, bdrv_flags, drv);
-    if (ret < 0) {
-        /* Reopen failed. Try to open with original flags */
-        qerror_report(QERR_REOPEN_FILE_FAILED, bs->filename);
-        ret = bdrv_open(bs, bs->filename, open_flags, drv);
+    /* Use driver specific reopen() if available */
+    if (drv->bdrv_reopen) {
+        if ((ret = drv->bdrv_reopen(bs, bdrv_flags)) < 0) {
+            qerror_report(QERR_REOPEN_FILE_FAILED, bs->filename);
+            return ret;
+        }
+    } else {
+        /* Use reopen procedure common for drivers */
+        open_flags = bs->open_flags;
+        bdrv_close(bs);
+
+        ret = bdrv_open(bs, bs->filename, bdrv_flags, drv);
          if (ret < 0) {
-            /* Reopen failed with orig and modified flags */
-            abort();
+            /*
+             * Reopen failed. Try to open with original flags
+            */
+            qerror_report(QERR_REOPEN_FILE_FAILED, bs->filename);
+            ret = bdrv_open(bs, bs->filename, open_flags, drv);
+            if (ret < 0) {
+                /*
+                 * Reopen with orig and modified flags failed
+                 */
+                abort();
+            }
          }
      }
-
      return ret;
  }

Index: qemu/block_int.h
===================================================================
--- qemu.orig/block_int.h
+++ qemu/block_int.h
@@ -54,6 +54,7 @@ struct BlockDriver {
      int (*bdrv_probe)(const uint8_t *buf, int buf_size, const char 
*filename);
      int (*bdrv_probe_device)(const char *filename);
      int (*bdrv_open)(BlockDriverState *bs, int flags);
+    int (*bdrv_reopen)(BlockDriverState *bs, int flags);
      int (*bdrv_file_open)(BlockDriverState *bs, const char *filename, 
int flags);
      int (*bdrv_read)(BlockDriverState *bs, int64_t sector_num,
                       uint8_t *buf, int nb_sectors);

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [RFC] Safely reopening image files by stashing fds
  2011-08-09  9:32                       ` supriya kannery
  2011-08-16 19:18                         ` [Qemu-devel] [RFC] " Supriya Kannery
@ 2011-08-16 19:18                         ` Supriya Kannery
  2011-08-17 14:35                           ` Kevin Wolf
  1 sibling, 1 reply; 39+ messages in thread
From: Supriya Kannery @ 2011-08-16 19:18 UTC (permalink / raw)
  To: supriya kannery
  Cc: Kevin Wolf, Stefan Hajnoczi, Christoph Hellwig, qemu-devel,
	Paolo Bonzini

On 08/09/2011 03:02 PM, supriya kannery wrote:
 > Kevin Wolf wrote:
 >> Am 09.08.2011 11:22, schrieb supriya kannery:
 >>> Kevin Wolf wrote:
 >>
 >> What I meant is that in the end, with a generic bdrv_reopen(), we can
 >> have raw-posix only call dup() and fcntl() instead of doing a
 >> close()/open() sequence if it can satisfy the new flags this way. But
 >> this would be an implementation detail and not be visible in the
 >> interface.
 >>
 >> Kevin
 >
 > ok
 > - thanks, Supriya
 >

Though I started the RFC patch with defining BDRVReopenState, ended up
in enhancing struct BlockDriver with .bdrv_reopen. bdrv_reopen 
mplementation specific to respective driver is assigned to this
function pointer.

Please find the implementation of O_DIRECT flag change, done in
raw-posix.c. Similar implementation can be done for vmdk (with
bdrv_reopen_commit and bdrv_reopen_abort as internal functions in
vmdk.c for opening split files), sheepdog, nbd etc..

Request for comments on this approach.

Note: Following patch is for demonstrating the approach using
raw-posix  as sample driver. This patch is not complete.

- thanks, Supriya

---
  block.c           |   33 +++++++++++++++++++++++----------
  block/raw-posix.c |   34 ++++++++++++++++++++++++++++++++++
  block_int.h       |    1 +
  3 files changed, 58 insertions(+), 10 deletions(-)

Index: qemu/block/raw-posix.c
===================================================================
--- qemu.orig/block/raw-posix.c
+++ qemu/block/raw-posix.c
@@ -276,6 +276,39 @@ static int raw_open(BlockDriverState *bs
      return raw_open_common(bs, filename, flags, 0);
  }

+static int raw_reopen(BlockDriverState *bs, int flags)
+{
+    BDRVRawState *s = bs->opaque;
+    int new_fd;
+    int ret = 0;
+
+    /* Handle request for toggling O_DIRECT */
+    if ((bs->open_flags | BDRV_O_NOCACHE) ^
+        (flags | BDRV_O_NOCACHE))
+    {
+        if ((new_fd = dup(s->fd)) <=  0) {
+            return -1;
+        }
+
+        if ((ret = fcntl_setfl(new_fd, flags)) < 0) {
+            close(new_fd);
+            return ret;
+        }
+
+        /* Set new flags, so replace old fd with new one */
+        close(s->fd);
+        s->fd = new_fd;
+        s->open_flags = flags;
+        bs->open_flags = flags;
+    }
+
+    /*
+     * TBD: handle O_DSYNC and other flags for which image
+     * file has to be reopened
+     */
+    return 0;
+}
+
  /* XXX: use host sector size if necessary with:
  #ifdef DIOCGSECTORSIZE
          {
@@ -886,6 +919,7 @@ static BlockDriver bdrv_file = {
      .instance_size = sizeof(BDRVRawState),
      .bdrv_probe = NULL, /* no probe for protocols */
      .bdrv_file_open = raw_open,
+    .bdrv_reopen = raw_reopen,
      .bdrv_read = raw_read,
      .bdrv_write = raw_write,
      .bdrv_close = raw_close,
Index: qemu/block.c
===================================================================
--- qemu.orig/block.c
+++ qemu/block.c
@@ -687,20 +687,33 @@ int bdrv_reopen(BlockDriverState *bs, in
          qerror_report(QERR_DATA_SYNC_FAILED, bs->device_name);
          return ret;
      }
-    open_flags = bs->open_flags;
-    bdrv_close(bs);

-    ret = bdrv_open(bs, bs->filename, bdrv_flags, drv);
-    if (ret < 0) {
-        /* Reopen failed. Try to open with original flags */
-        qerror_report(QERR_REOPEN_FILE_FAILED, bs->filename);
-        ret = bdrv_open(bs, bs->filename, open_flags, drv);
+    /* Use driver specific reopen() if available */
+    if (drv->bdrv_reopen) {
+        if ((ret = drv->bdrv_reopen(bs, bdrv_flags)) < 0) {
+            qerror_report(QERR_REOPEN_FILE_FAILED, bs->filename);
+            return ret;
+        }
+    } else {
+        /* Use reopen procedure common for drivers */
+        open_flags = bs->open_flags;
+        bdrv_close(bs);
+
+        ret = bdrv_open(bs, bs->filename, bdrv_flags, drv);
          if (ret < 0) {
-            /* Reopen failed with orig and modified flags */
-            abort();
+            /*
+             * Reopen failed. Try to open with original flags
+            */
+            qerror_report(QERR_REOPEN_FILE_FAILED, bs->filename);
+            ret = bdrv_open(bs, bs->filename, open_flags, drv);
+            if (ret < 0) {
+                /*
+                 * Reopen with orig and modified flags failed
+                 */
+                abort();
+            }
          }
      }
-
      return ret;
  }

Index: qemu/block_int.h
===================================================================
--- qemu.orig/block_int.h
+++ qemu/block_int.h
@@ -54,6 +54,7 @@ struct BlockDriver {
      int (*bdrv_probe)(const uint8_t *buf, int buf_size, const char 
*filename);
      int (*bdrv_probe_device)(const char *filename);
      int (*bdrv_open)(BlockDriverState *bs, int flags);
+    int (*bdrv_reopen)(BlockDriverState *bs, int flags);
      int (*bdrv_file_open)(BlockDriverState *bs, const char *filename, 
int flags);
      int (*bdrv_read)(BlockDriverState *bs, int64_t sector_num,
                       uint8_t *buf, int nb_sectors);

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] [RFC] Safely reopening image files by stashing fds
  2011-08-16 19:18                         ` Supriya Kannery
@ 2011-08-17 14:35                           ` Kevin Wolf
  0 siblings, 0 replies; 39+ messages in thread
From: Kevin Wolf @ 2011-08-17 14:35 UTC (permalink / raw)
  To: supriyak
  Cc: Stefan Hajnoczi, Paolo Bonzini, Christoph Hellwig, qemu-devel,
	supriya kannery

Am 16.08.2011 21:18, schrieb Supriya Kannery:
> On 08/09/2011 03:02 PM, supriya kannery wrote:
>  > Kevin Wolf wrote:
>  >> Am 09.08.2011 11:22, schrieb supriya kannery:
>  >>> Kevin Wolf wrote:
>  >>
>  >> What I meant is that in the end, with a generic bdrv_reopen(), we can
>  >> have raw-posix only call dup() and fcntl() instead of doing a
>  >> close()/open() sequence if it can satisfy the new flags this way. But
>  >> this would be an implementation detail and not be visible in the
>  >> interface.
>  >>
>  >> Kevin
>  >
>  > ok
>  > - thanks, Supriya
>  >
> 
> Though I started the RFC patch with defining BDRVReopenState, ended up
> in enhancing struct BlockDriver with .bdrv_reopen. bdrv_reopen 
> mplementation specific to respective driver is assigned to this
> function pointer.
> 
> Please find the implementation of O_DIRECT flag change, done in
> raw-posix.c. Similar implementation can be done for vmdk (with
> bdrv_reopen_commit and bdrv_reopen_abort as internal functions in
> vmdk.c for opening split files), sheepdog, nbd etc..

I haven't looked at the code yet, but this can't work. I have thought of
BDRVReopenState for a reason. If you have multiple files like VMDK can
have, then you want an all-or-nothing semantics of reopen.

With only .bdrv_reopen, I can't see a way to guarantee this. If
reopening the first file succeeds and the second one goes wrong, you
have the new flags on the first file, but the old flags on the second
one. You would have to re-reopen the first file with the old flags, but
then this is not guaranteed to succeed.

Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-08-09  9:22                   ` supriya kannery
  2011-08-09  9:51                     ` Kevin Wolf
@ 2011-10-10 18:28                     ` Kevin Wolf
  2011-10-11  5:21                       ` Supriya Kannery
  1 sibling, 1 reply; 39+ messages in thread
From: Kevin Wolf @ 2011-10-10 18:28 UTC (permalink / raw)
  To: supriya kannery
  Cc: supriyak, Paolo Bonzini, Christoph Hellwig, qemu-devel, Stefan Hajnoczi

Am 09.08.2011 11:22, schrieb supriya kannery:
> Kevin Wolf wrote:
>> Am 08.08.2011 09:02, schrieb Supriya Kannery:
>>   
>>> On 08/05/2011 09:19 PM, Anthony Liguori wrote:
>>>     
>>>> On 08/05/2011 10:43 AM, Kevin Wolf wrote:
>>>>       
>>>>> Am 05.08.2011 17:24, schrieb Stefan Hajnoczi:
>>>>>         
>>>>>> On Fri, Aug 5, 2011 at 3:28 PM, Christoph Hellwig<hch@lst.de> wrote:
>>>>>>           
>>>>>>> On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote:
>>>>>>>             
>>>>>>>>> Because you cannot change O_DIRECT on an open fd :(. This is why
>>>>>>>>> we're going through this pain.
>>>>>>>>>                 
>>>>>>>> Hmm, I remember hearing that before, but looking at the current
>>>>>>>> fcntl()
>>>>>>>> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps
>>>>>>>> this
>>>>>>>> is a newish feature, but it'd be nicer to use it if possible ?
>>>>>>>>               
>>>>>>> It's been there since day 1 of O_DIRECT support.
>>>>>>>             
>>>>>> Sorry, my bad. So for Linux we could just use fcntl for
>>>>>> block_set_hostcache and not bother with reopening. However, we will
>>>>>> need to reopen should we wish to support changing O_DSYNC.
>>>>>>           
>>>>> We do wish to support that.
>>>>>
>>>>> Anthony thinks that allowing the guest to toggle WCE is a prerequisite
>>>>> for making cache=writeback the default. And this is something that I
>>>>> definitely want to do for 1.0.
>>>>>         
>>>> Indeed.
>>>>
>>>>       
>>> We discussed the following so far...
>>> 1. How to safely reopen image files
>>> 2. Dynamic hostcache change
>>> 3. Support for dynamic change of O_DSYNC
>>>
>>> Since 2 is independent of 1, shall I go ahead implementing
>>> hostcache change using fcntl.
>>>
>>> Implementation for safely reopening image files using "BDRVReopenState"
>>> can be done separately as a pre-requisite before implementing 3
>>>     
>>
>> Doing it separately means that we would introduce yet another callback
>> that is used just to change O_DIRECT. In the end we want it to use
>> bdrv_reopen(), too, so I'm not sure if there is a need for a temporary
>> solution.
>>
>>   
> Could you please explain "In the end we want it to use bdrv_reopen" at 
> bit more.
> When fcntl() can change O_DIRECT on  open fd , is there a  need to 
> "re-open"
> the image file?
> 
> Considering the current way of having separate high level commands for
> changing block parameters (block_set_hostcache, and may be block_set_flush
> in furture), these dynamic requests will be sequential. So wouldn't it 
> be better to
> avoid re-opening of image if possible for individual flag change request 
> that comes in?
>> Actually, once we know what we really want (I haven't seen many comments
>> on the BDRVReopenState suggestion yet), it should be pretty easy to
>> implement.
>>
>> Kevin
>>   
> Will work on to get an RFC patch with this proposed BDRVReopenState to 
> get more
> inputs.

Are you still going to prepare an RFC patch implementing
bdrv_reopen_prepare/commit/abort using a BDRVReopenState? Or has it even
been posted and I just missed it?

Kevin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Qemu-devel] Safely reopening image files by stashing fds
  2011-10-10 18:28                     ` [Qemu-devel] " Kevin Wolf
@ 2011-10-11  5:21                       ` Supriya Kannery
  0 siblings, 0 replies; 39+ messages in thread
From: Supriya Kannery @ 2011-10-11  5:21 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Stefan Hajnoczi, Paolo Bonzini, Christoph Hellwig, qemu-devel,
	supriya kannery

On 10/10/2011 11:58 PM, Kevin Wolf wrote:
> Am 09.08.2011 11:22, schrieb supriya kannery:
>> Kevin Wolf wrote:
>>> Am 08.08.2011 09:02, schrieb Supriya Kannery:
>>>
>>>> On 08/05/2011 09:19 PM, Anthony Liguori wrote:
>>>>
>>>>> On 08/05/2011 10:43 AM, Kevin Wolf wrote:
>>>>>
>>>>>> Am 05.08.2011 17:24, schrieb Stefan Hajnoczi:
>>>>>>
>>>>>>> On Fri, Aug 5, 2011 at 3:28 PM, Christoph Hellwig<hch@lst.de>  wrote:
>>>>>>>
>>>>>>>> On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote:
>>>>>>>>
>>>>>>>>>> Because you cannot change O_DIRECT on an open fd :(. This is why
>>>>>>>>>> we're going through this pain.
>>>>>>>>>>
>>>>>>>>> Hmm, I remember hearing that before, but looking at the current
>>>>>>>>> fcntl()
>>>>>>>>> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps
>>>>>>>>> this
>>>>>>>>> is a newish feature, but it'd be nicer to use it if possible ?
>>>>>>>>>
>>>>>>>> It's been there since day 1 of O_DIRECT support.
>>>>>>>>
>>>>>>> Sorry, my bad. So for Linux we could just use fcntl for
>>>>>>> block_set_hostcache and not bother with reopening. However, we will
>>>>>>> need to reopen should we wish to support changing O_DSYNC.
>>>>>>>
>>>>>> We do wish to support that.
>>>>>>
>>>>>> Anthony thinks that allowing the guest to toggle WCE is a prerequisite
>>>>>> for making cache=writeback the default. And this is something that I
>>>>>> definitely want to do for 1.0.
>>>>>>
>>>>> Indeed.
>>>>>
>>>>>
>>>> We discussed the following so far...
>>>> 1. How to safely reopen image files
>>>> 2. Dynamic hostcache change
>>>> 3. Support for dynamic change of O_DSYNC
>>>>
>>>> Since 2 is independent of 1, shall I go ahead implementing
>>>> hostcache change using fcntl.
>>>>
>>>> Implementation for safely reopening image files using "BDRVReopenState"
>>>> can be done separately as a pre-requisite before implementing 3
>>>>
>>>
>>> Doing it separately means that we would introduce yet another callback
>>> that is used just to change O_DIRECT. In the end we want it to use
>>> bdrv_reopen(), too, so I'm not sure if there is a need for a temporary
>>> solution.
>>>
>>>
>> Could you please explain "In the end we want it to use bdrv_reopen" at
>> bit more.
>> When fcntl() can change O_DIRECT on  open fd , is there a  need to
>> "re-open"
>> the image file?
>>
>> Considering the current way of having separate high level commands for
>> changing block parameters (block_set_hostcache, and may be block_set_flush
>> in furture), these dynamic requests will be sequential. So wouldn't it
>> be better to
>> avoid re-opening of image if possible for individual flag change request
>> that comes in?
>>> Actually, once we know what we really want (I haven't seen many comments
>>> on the BDRVReopenState suggestion yet), it should be pretty easy to
>>> implement.
>>>
>>> Kevin
>>>
>> Will work on to get an RFC patch with this proposed BDRVReopenState to
>> get more
>> inputs.
>
> Are you still going to prepare an RFC patch implementing
> bdrv_reopen_prepare/commit/abort using a BDRVReopenState? Or has it even
> been posted and I just missed it?
>
> Kevin

Pls review the patchset posted at
http://lists.gnu.org/archive/html/qemu-devel/2011-10/msg01014.html

Working on further to similarly use BDRVReopenState for raw-win32 images.

thanks, Supriya

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2011-10-11  5:22 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-05  8:40 [Qemu-devel] Safely reopening image files by stashing fds Stefan Hajnoczi
2011-08-05  9:04 ` Paolo Bonzini
2011-08-05  9:27   ` Stefan Hajnoczi
2011-08-05  9:55     ` Paolo Bonzini
2011-08-05 13:03       ` Stefan Hajnoczi
2011-08-05 13:12     ` Daniel P. Berrange
2011-08-05 14:28       ` Christoph Hellwig
2011-08-05 15:24         ` Stefan Hajnoczi
2011-08-05 15:43           ` Kevin Wolf
2011-08-05 15:49             ` Anthony Liguori
2011-08-08  7:02               ` Supriya Kannery
2011-08-08  8:12                 ` Kevin Wolf
2011-08-09  9:22                   ` supriya kannery
2011-08-09  9:51                     ` Kevin Wolf
2011-08-09  9:32                       ` supriya kannery
2011-08-16 19:18                         ` [Qemu-devel] [RFC] " Supriya Kannery
2011-08-16 19:18                         ` Supriya Kannery
2011-08-17 14:35                           ` Kevin Wolf
2011-10-10 18:28                     ` [Qemu-devel] " Kevin Wolf
2011-10-11  5:21                       ` Supriya Kannery
2011-08-05 14:27     ` Christoph Hellwig
2011-08-05  9:07 ` Kevin Wolf
2011-08-05  9:29   ` Stefan Hajnoczi
2011-08-05  9:48     ` Kevin Wolf
2011-08-08 14:49       ` Stefan Hajnoczi
2011-08-08 15:16         ` Kevin Wolf
2011-08-09 10:25           ` Stefan Hajnoczi
2011-08-09 10:35             ` Kevin Wolf
2011-08-09 10:50               ` Stefan Hajnoczi
2011-08-09 10:56                 ` Stefan Hajnoczi
2011-08-09 11:39                   ` Kevin Wolf
2011-08-09 12:00                     ` Stefan Hajnoczi
2011-08-09 12:24                       ` Kevin Wolf
2011-08-09 19:39                         ` Blue Swirl
2011-08-10  7:58                           ` Kevin Wolf
2011-08-10 17:20                             ` Blue Swirl
2011-08-11  7:37                               ` Kevin Wolf
2011-08-11 16:21                                 ` Blue Swirl
2011-08-05 20:16 ` Blue Swirl

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.